CN110800024B

CN110800024B - Method and electronic device for estimating current posture of hand

Info

Publication number: CN110800024B
Application number: CN201880036103.XA
Authority: CN
Inventors: 乔纳森·詹姆斯·泰勒; 弗拉迪米尔·坦科维奇; 唐丹航; 塞姆·克斯金; 阿达尔什·普拉卡什·穆尔蒂·寇德莱; 菲利普·L·戴维森; 沙赫拉姆·伊扎迪; 戴维·金
Original assignee: Google LLC
Current assignee: Google LLC
Priority date: 2018-05-31
Filing date: 2018-07-27
Publication date: 2021-08-10
Anticipated expiration: 2038-07-27
Also published as: CN110800024A; CN113762068A

Abstract

The electronic device (100) estimates a pose of the hand (120) by volumetrically deforming the signed distance field (330) using the skinned tetrahedral mesh (410) to locate a local minimum of the energy function (810), wherein the local minimum corresponds to the pose of the hand. The electronic device recognizes a gesture of the hand by fitting an implicit surface model (305) of the hand to pixels (320) of a depth image (115) corresponding to the hand. The electronic device warps the space from the base pose to the deformed pose using a skinned tetrahedral mesh to define an explicitly expressed signed distance field from which the hand tracking module derives candidate poses of the hand. The electronic device then minimizes an energy function based on the distance of each respective pixel to identify a candidate gesture that is closest to the gesture of the hand.

Description

Method and electronic device for estimating current posture of hand

Background

Hand tracking allows for the use of explicitly expressed gestures as input mechanisms for virtual reality and augmented reality systems, thereby supporting a more immersive user experience. Generative hand tracking systems capture images and depth data of a user's hand and fit generative models to the captured images or depth data. To fit the model to the captured data, the hand tracking system defines and optimizes an energy function to find a minimum value corresponding to the correct hand pose. However, conventional hand tracking systems often have accuracy and latency issues, which can lead to an unsatisfactory user experience.

Drawings

The present disclosure may be better understood, and its numerous features and advantages made apparent to those skilled in the art by referencing the accompanying drawings. The use of the same reference symbols in different drawings indicates similar or identical items.

Fig. 1 is a schematic diagram illustrating a hand tracking system that estimates a current pose of a hand based on a depth image in accordance with at least one embodiment of the present disclosure.

Fig. 2 is a schematic diagram illustrating a hand tracking module of the hand tracking system of fig. 1 configured to estimate a current pose of a hand based on a depth image in accordance with at least one embodiment of the present disclosure.

FIG. 3 is a schematic diagram illustrating interpolating a pre-computed grid of signed distances to generate a smoothed signed distance field for estimating distance from a point to a model according to at least one embodiment of the present disclosure.

Fig. 4 is a schematic diagram illustrating a basic pose of a skinned tetrahedral volume mesh in accordance with at least one embodiment of the present disclosure.

Fig. 5 is a schematic diagram illustrating a deformation posture of a tetrahedral volume mesh in accordance with at least one embodiment of the present disclosure.

Fig. 6 is a schematic diagram illustrating a two-dimensional cross-section of an end of a finger contained in a basic pose within a triangular mesh in accordance with at least one embodiment of the present disclosure.

Fig. 7 is a schematic diagram illustrating a two-dimensional cross-section of an end of a finger contained within a deformed triangular mesh in a query pose, according to at least one embodiment of the present disclosure.

Fig. 8 is a schematic diagram of an energy function of a distance between each point of a three-dimensional (3D) point cloud of depth images and a candidate pose, according to at least one embodiment of the present disclosure.

Fig. 9 is a flowchart illustrating a method of estimating a current pose of a hand based on a captured depth image in accordance with at least one embodiment of the present disclosure.

Fig. 10 is a flow diagram illustrating a method of minimizing an energy function by initialization using a pose from a previous frame and one or more poses derived from a coarse overall predicted pose, in accordance with at least one embodiment of the present disclosure.

Fig. 11 is a flow diagram illustrating a method of predicting a rough overall predicted pose of a hand in accordance with at least one embodiment of the present disclosure.

Detailed Description

The following description is intended to convey a thorough understanding of the present disclosure by providing numerous specific embodiments and details, including estimating hand pose by volumetrically deforming a signed distance field based on a skinned tetrahedral mesh. It is to be understood, however, that the disclosure is not limited to these specific embodiments and details, which are exemplary only, and that the scope of the disclosure is accordingly intended to be limited only by the appended claims and equivalents thereof. It should also be appreciated that in light of known systems and methods, those skilled in the art will be able to utilize the present disclosure for its intended purposes and benefits in any number of alternative embodiments, depending upon specific design and other needs.

Fig. 1-11 illustrate techniques for estimating pose of at least one hand by deforming a signed distance field over a volume using a skinned tetrahedral mesh to locate a local minimum of an energy function, wherein the local minimum corresponds to the hand pose. The hand tracking module receives a depth image of the hand from the depth camera and identifies a gesture of the hand by fitting an implicit surface model of the hand defined as zero crossings of an explicitly expressed signed distance function to pixels of the depth image corresponding to the hand. The hand tracking module fits the model to the pixels by first warping the pixels volumetrically into the base pose and then interpolating a 3D grid of pre-computed signed distance values to estimate the distance to the implicit surface model. Volume warping is performed using a skinned tetrahedral mesh. The hand tracking module warps the space from the base pose to the deformed pose using a skinned tetrahedral mesh to define an explicitly expressed signed distance field from which the hand tracking module derives candidate poses for the hand. However, by warping pixels from the warped pose to the base pose, explicit generation of a well-defined signed distance function is avoided, wherein the distance to the surface can be estimated by interpolating a 3D grid of pre-computed signed distance values. The hand tracking module then minimizes an energy function based on the distance of each corresponding pixel to identify a candidate gesture that is the closest gesture of the hand.

In some embodiments, the hand tracking module uses the pose from the previous frame (i.e., the depth image immediately preceding the current depth image) to initialize candidate poses. Manual tracking systems utilize depth cameras with very high frame rates to minimize the difference between the true pose of the previous frame and the true pose of the current frame. In some embodiments, the hand tracking module also initializes candidate gestures by predicting a gesture. To predict the pose, the hand tracking module segments pixels of the depth image based on a probability of each pixel representing the left hand, the right hand, or the background. The hand tracking module generates a three-dimensional (3D) point cloud for at least one of the left and right hands based on the corresponding pixels and predicts an overall orientation of the hand based on a comparison of the 3D point cloud to a plurality of known gestures to generate a predicted current gesture.

Fig. 1 illustrates a hand tracking system 100 in accordance with at least one embodiment of the present disclosure, the hand tracking system 100 configured to support hand tracking functionality of AR/VR applications using depth sensor data. The hand tracking system 100 can include a user portable mobile device such as a tablet computer, a cellular telephone with computing capabilities (e.g., "smart phone"), a Head Mounted Display (HMD), a notebook computer, a Personal Digital Assistant (PDA), a gaming system remote control, a television remote control, a camera accessory with or without a screen, and so forth. In other embodiments, hand tracking system 100 can include another type of mobile device, such as an automobile, a robot, a remote drone, or other onboard device, and the like. For ease of illustration, the hand tracking system 100 is generally described herein in the exemplary context of a mobile device, such as a tablet computer or smartphone; however, hand tracking system 100 is not limited to these exemplary embodiments. According to at least one embodiment of the present disclosure, hand tracking system 100 includes a hand tracking module 110, the hand tracking module 110 estimating a current pose 140 of hand 120 based on a depth image 115 captured by depth camera 105. In this example, the hand 120 is the right hand making a tap gesture with the thumb and index finger extended and the remaining fingers bent down to the palm.

In one embodiment, the depth camera 105 uses a modulated light projector (not shown) to project a modulated light pattern into the local environment and uses one or more imaging sensors 106 to capture reflected light of the modulated light pattern reflected off objects in the local environment 112. These modulated light patterns can be spatially modulated light patterns as well as temporally modulated light patterns. The captured reflection of the modulated light pattern is referred to herein as a "depth image" 115. In some embodiments, the depth camera 105 calculates the depth of the object, i.e., the distance of the object from the depth camera 105, based on an analysis of the depth image 115.

The hand tracking module 110 receives the depth image 115 from the depth camera 105 and identifies a pose of the hand 120 by fitting a hand model to pixels of the depth image 115 corresponding to the hand 120. In some embodiments, the model is parameterized by 28 values (e.g., four joints for each of the five fingers, two degrees of freedom at the wrist and six degrees of freedom for the overall orientation). In some embodiments, hand tracking module 110 parameterizes the overall rotation of the model using quaternions such that pose vector θ is 29 dimensions. The hand tracking module 110 segments and backprojects a set of 3D data points corresponding to the hand 120 from the depth image 115. The hand tracking module 110 then models the parameterized implicit surface

(zero crossings formulated as well-defined signed distance functions) to the set of 3D data points

The hand tracking module 110 minimizes the distance from each 3D data point to the surface by minimizing the energy

Wherein E is_data(theta) is the energy of the attitude theta, D (x)_nθ) is from each 3D data point x_nDistance to the nearest point y of the surface model in pose θ, N is the number of 3D data points in the set.

To facilitate greater accuracy and efficiency in minimizing energy, the hand tracking module 110 defines the distance D (x, θ) to the hidden surface of the hand model in a relatively easy and fast to calculate manner. The hand tracking module 110 constructs a tetrahedral mesh (not shown) and skins the vertices to a skeleton (not shown). By defining x in the tetrahedrons of the mesh in relation to its barycentric coordinate, the hand tracking module 110 defines a function that warps the space from the basic pose to the deformed pose, as described in more detail below. Based on the deformation gesture, the hand tracking module 110 defines an explicitly expressed signed distance field. One point in the space of the current pose can be warped back to the basic pose, where the distance to the surface can be effectively estimated by interpolating a pre-computed 3D grid of signed distances. The hand tracking module 110 uses this as part of its processing to quickly estimate the current pose 140 of the hand 120.

In some embodiments, hand tracking module 110 uses current pose estimate 140 to update graphical data 135 on display 130. In some embodiments, the display 130 is a physical surface, such as a tablet, a mobile phone, a smart device, a display monitor, one or more arrays of display monitors, a laptop, a sign, etc., or a projection onto a physical surface. In some embodiments, display 130 is planar. In some embodiments, display 130 is curved. In some embodiments, display 130 is a virtual surface, such as a three-dimensional projection or holographic projection of an object in space including virtual reality and augmented reality. In some embodiments where display 130 is a virtual surface, the virtual surface is displayed within the user's HMD. The location of the virtual surface may be relative to a stationary object (such as a wall or furniture) within the user's local environment 112.

Fig. 2 is a schematic diagram illustrating a hand tracking module 110 of the hand tracking system 100 of fig. 1 in accordance with at least one embodiment of the present disclosure. The hand tracking module 110 includes a memory 205, a pixel segmenter 210, a re-initializer 215, an interpolator 220, and a volume deformer 225. Each of these modules represents hardware, software, or a combination thereof configured to perform the operations described herein. Hand tracking module 110 is configured to receive depth image 115 from a depth camera (not shown) and generate current pose estimate 140 based on depth image 115.

The memory 205 is a memory device that is generally configured to store data, and thus may be a Random Access Memory (RAM) memory module, a non-volatile memory device (e.g., flash memory), or the like. Memory 205 may form part of the hierarchical memory hierarchy of hand tracking system 100 and may include other memory modules, such as additional caches not shown in FIG. 1. The memory 205 is configured to receive and store the depth image 115 from a depth camera (not shown).

Pixel segmentor 210 is configured to segment pixels of depth image 115 into pixels corresponding to the left hand, right hand, and background. In some embodiments, pixel segmentor 210 assigns each pixel of depth image 115 to be p-th left hand^leftThe right hand p^rightAnd background p^bg∈[0,1]Corresponding probabilities to generate a probability map P. In some embodiments, pixel divider 210 will have a high value η^high∈[0,1]Is thresholded, convolves the output with a large bandwidth gaussian filter, and then finds the location of the maximum (which hand segmenter 210 assigns as the hand location). Hand splitter 210 will then have a smaller value η^lowIs thresholded and is related to the radius

To segment the hand pixels.

In some embodiments, the pixel segmenter 210 also trains a Random Decision Forest (RDF) classifier to produce P. An RDF classifier (not shown) employs depth and translation invariant features that threshold the depth difference of two pixels at a depth normalized offset around the center pixel. For each pixel p at coordinate (u, v) on the depth image I, each split node in the tree evaluates the following function:

where Γ is I (u, v), Δ ui and Δ vi are two offsets, and τ is the threshold of the split node. In some embodiments, to enhance the pool of features used to rotate invariant subtasks, such as a single outstretched hand, pixel segmenter 210 introduces a new family of rotation invariant features that thresholds the average depth of two concentric rings:

where R (u, v, R, I) is the sum of K depth pixels found on a ring of depth scaling radii R around the center pixel. In some embodiments, pixel segmentor 210 approximates the ring with a fixed number of points k:

in some embodiments, the pixel segmenter 210 additionally defines a unary version of the feature as follows:

in training, the pixel segmenter 210 samples from a pool of binary and unary rotation-related and invariant features based on the learned previous pose. In some embodiments, for each feature considered, pixel segmenter 210 uniformly samples multiple τ values from a fixed range and selects the value that maximizes the information gain. The pixel segmentor 210 outputs a segmented depth image R per hand.

In some embodiments, the pixel segmenter 210 uses a Convolutional Neural Network (CNN) or a Random Decision Forest (RDF) or both to generate a probability map that encodes each pixel, the probability of a pixel belonging to the left hand, right hand, and background, respectively. To detect the right hand, the pixel segmenter 210 maps the probability map P^rightAre temporarily set to zero, which are lower than the high value eta^high∈[0,1]. The pixel segmentor 210 convolves the output with a large bandwidth gaussian filter and then uses the location of the maximum. The pixel segmentor 210 then determines the probability by reducing the probability to be less than η^low∈[0,η^high]Or the 3D location is not contained within a radius detected around the hand

Is set to zero, from the original scoreCutting P^rightRemoving outliers. The pixel segmenter 210 thus ensures that pixels far from the most prominent hand (e.g., pixels on the hands of other people in the background) do not contaminate the segmentation while allowing machine learning methods to discard nearby pixels (e.g., pixels of the user's chest) that are identified as not belonging to the hand. Hand segmenter 210 uses depth camera 105 parameters to cast pixels that pass the test back into 3D space to form a point cloud

Thereby defining energy

The re-initializer 215 receives the segmented depth image R from the pixel segmentor 210. When the hand tracking module 110 loses track of the hand 120 of FIG. 1, the re-initializer 215 resets the hand tracking module 110 by generating a coarse overall predicted pose. In some embodiments, the hand tracking module 110 uses the rough overall predicted pose as a candidate pose for the hand. In some embodiments, the reinitializer 215 estimates a six degree of freedom (6DOF) hand pose using RDF by positioning three joints on a palm of the hand, which is assumed to be planar. The three joints are the wrist joint q_wMetacarpophalangeal (MCP) joint q_iBase and little finger MCP q_pThe base of (2). The re-initializer 215 locates three joints by evaluating each pixel p in R to produce a single vote for each joint's three-dimensional (3D) offset from p. The tree of RDFs is trained with regression targets to minimize vote differences in the leaves. Each pixel votes for all joints, which are summed separately to form a vote distribution for each joint. The re-initializer 215 selects the pattern of distribution as the final estimate of the three joints. In some embodiments, re-initializer 215 operates by setting the overall translation to q_wAnd the three joints are converted to the reinitialization posture by finding the orientation of the three-dimensional triangle defined by the three joints to derive the overall orientation. Reinitializer 215 then randomly samples a set from the previous gestureFinger gestures to generate a coarse overall predicted gesture.

Interpolator 220 is in the base attitude θ₀Precomputing a 3D grid of signed distance values and using cubic interpolation as an arbitrary point

Defining a signed distance to a surface

Cubic interpolation can access smooth first and second derivatives with respect to x. Thus, the signed distance field smoothly captures the details of the model using cubic interpolation.

The volume deformer 225 defines the signed distance field into an arbitrary pose θ using a linear skinned tetrahedral mesh as a volume warp of the signed distance field for the interpolator 220. The volume deformer 225 can effectively warp points in the current pose back into the base pose instead of explicitly generating a distorted signed distance function, and thus the distance to the implicit surface and its derivatives can be quickly estimated by the interpolator. The volume deformer 225 defines deformations of vertices of the tetrahedral mesh via a linear hybrid skin.

Strictly speaking, a tetrahedral mesh actually defines the twist y ═ W (x, θ) from the basic posture to the deformed posture. The function is largely reversible, such that the set of points of deformation in the base pose to the point in the current pose is typically 1, unless deformation results in tetrahedral self-intersection. In the latter case, the ambiguity is resolved by simply picking up points in the base pose that have smaller absolute distances than the implicit surface defined by the interpolator 220. Thus, this defines a function W that warps the space from the deformed pose to the basic pose^-1(x, θ). Therefore, for any pose θ, the distance to the surface D (x, θ) is defined as

This can be easily evaluated without explicitly generating a dense signed distance field in the deformation pose. Thus, tetrahedral meshes will have details of the signed distance field changedChange to a different pose. Tetrahedral mesh warping introduces artifacts only at the connection points, which can be solved by compressing the tetrahedral mesh only at the connection points.

The hand tracking module 110 combines the pre-computed signed distance field from the interpolator 220

And volume deformation W (x, theta) synthesis from skinned volumetric tetrahedral mesh to define a well-defined signed distance field

Which yields an estimated distance to the surface at point x in the estimated pose. The hand tracking module 110 uses the explicitly expressed signed distance field D (x, θ) to define an energy function

Although other terms encoding prior knowledge can be incorporated.

In some embodiments, hand tracking module 110 first uses the pose θ output from the system in the previous frame_prevTo initialize candidate pose theta. In some embodiments, the hand tracking module 110 predicts the pose θ by using the rough population generated by the re-initializer 215_predTo initialize other candidate gestures theta. In some embodiments, the depth camera (not shown) employs a high frame rate such that the pose θ in the previous frame_prevThe difference from the true pose in the current frame is minimal. By minimizing the energy function, the hand tracking module 110 generates a current pose estimate 140.

In some embodiments, hand tracking module 110 jointly optimizes pose Θ ═ θ^left,θ^rightAnd a set of right-hand assignments

To track the two hands of the hand-held device,

implicitly defines a set of left-hand assignments

Hand tracking module 110 then formulates the total energy to be optimized as

Wherein the content of the first and second substances,

and

is a penalty output from the segmented forest for assigning data points n to right-hand and left-hand gestures, respectively. To optimize this function, the hand tracking module 110 performs an alternation between Θ and γ, updates Θ with a Levenberg update, and updates γ by taking into account whether assigning data points to the left hand or the right hand would reduce the energy.

FIG. 3 illustrates interpolating pixels 320 of a depth image based on a pre-computed distance function to generate a smoothed Signed Distance Field (SDF)330 for estimating the distance in the base pose θ, according to at least one embodiment of the disclosure₀From pixel 320 to model 305, 325. Interpolator 220 of FIG. 2 pre-computes the value of θ at the base pose₀There is a dense grid 310 of symbol distances 315. Interpolator 220 then uses cubic interpolation to define any point in the neutral or base pose

Signed distance to surface function 325

Pre-computing and interpolating the grid of signed distances 315 reduces the computational burden of evaluating distances D (x, θ) and smoothly captures the high frequency details of model 305.

Fig. 4 illustrates a basic pose 400 of a tetrahedral volume mesh 410 of the volume deformer 225 of fig. 2, wherein vertices are skinned to the dense SDF330 of fig. 3, in accordance with at least one embodiment of the present disclosure. The skinned tetrahedral mesh 410 transforms the details of the dense SDF330 to different poses. The skinned tetrahedral mesh 410 introduces artifacts only at the connection points. In some embodiments, the skinned tetrahedral mesh 410 is compressed, for example, at the connection points 415, 420, 425, while the dense SDF330 represents the geometry of the pose in other regions. In some embodiments, a volume deformer (not shown) applies any mesh skinning technique to deform the individual SDFs 330. Thus, the deformation function and the detail representation are decoupled, allowing the coarse tetrahedral mesh to be used to deliver the detailed static geometry represented by the SDF 330. This may also present the possibility of modifying the static geometry in the SDF330 online without having to modify the deformation function.

Fig. 5 illustrates a deformation pose 500 of the tetrahedral volume mesh 410 of fig. 4 in accordance with at least one embodiment of the present disclosure. The volume deformer 225 of fig. 2 warps the point x to W (x, θ) using the tetrahedral volume mesh 410. Thus, the volume deformer 225 uses the tetrahedral volume mesh 410 to provide a function y ═ W (x, θ) that warps the space from the base pose to the deformed pose. This function is largely reversible, so it is also possible to define a function x ═ W-1(y, θ) that distorts the space from the deformed posture to the basic posture. This allows the hand tracking module 110 to avoid explicit warping in new poses and to generate a signed distance function densely, which would be very expensive to execute continuously while finding the correct pose. Instead, the hand tracking module 110 can estimate the distance D (x, θ) from the point x to the implicit surface in any pose θ by warping x back into the base pose, where the distance to the surface can be quickly estimated by interpolating a pre-computed 3D grid of signed distance values. Furthermore, because warped and signed distance fields are distinguishable almost anywhere, the hand tracking module 110 can also quickly query the derivatives to enable a quick local search for an energy function defined in terms of distance to the surface.

Fig. 6 illustrates a two-dimensional (2D) cross-section of an end of a finger 605 in a basic gesture contained within a triangular mesh 610 in accordance with at least one embodiment of the present disclosure. The tetrahedral volume mesh 410 of fig. 4 and 5 is depicted as a 2D equivalent triangular mesh 610 for ease of reference. Triangular mesh 610 includes

triangles

614, 616, 618, 620, 622, 624, 626, and 628.

Fig. 7 illustrates a 2D cross-section of the tip of the finger 605 of fig. 6 contained in a query pose θ within a warped triangular mesh 710, in accordance with at least one embodiment of the present disclosure. Triangular meshes in 2D are similar to tetrahedral meshes in 3D and are therefore used to more simply illustrate the technique. The tetrahedral mesh (shown as triangular mesh 710) includes tetrahedrons (shown as

triangles

714, 716, 718, 720, 722, 724, 726, and 728) that correspond to tetrahedrons (or triangles) 614, 616, 618, 620, 622, 624, 626, and 628, respectively, of fig. 6. When mesh 710 is deformed, each tetrahedron (or triangle) 714, 716, 718, 720, 722, 724, 726, and 728 defines an affine transformation between the base pose of fig. 6 and the query pose θ. This defines the volume twist W (x, θ) from the base pose to the query pose. Using an inverse affine transform of each tetrahedron (or triangle), one can try to define the dewarping W^-1(x, θ). Using this approach, the volume deformer 225 of fig. 2 implicitly defines the signed distance field D (x, θ), as further described herein. For a query point x (e.g., point 730) that falls within warped mesh 710, a tetrahedron (or triangle) τ containing that point can send the query point to B using its inverse affine transform^τ(x, θ) where the distance to an implicitly coded surface can be queried as

For a point y that falls outside the deformed mesh 710 (e.g., point 732), the volume deformer 225 first measures the distance to the closest point contained in the tetrahedral mesh. Then, for this distance, the volume deformer 225 adds a distance obtained by evaluating the distance of the closest point to the surface using the aforementioned technique.

In more detail, for any point x, the volume deformer 225 uses the closest point of approach

Where τ is the tetrahedron (or triangle) containing the closest point of approach, and

(or

) Is a matrix in which the positions of the four vertices of the tetrahedron τ (or the three vertices of the triangle τ) in the pose θ are stored in their columns, and

(or

) Is the barycentric coordinate of the closest point in tetrahedron (or triangle) τ in pose θ. In some embodiments, the volume deformer 225 twists the closest point of approach back to the base pose, i.e., the

To query its distance to the implicitly coded surface. When query point x is located in a tetrahedral mesh, q^τ(x, θ) x, and when x is outside the tetrahedral mesh (e.g., point 732), the volume deformer occupies q^τAn additional distance between (x, θ) and x. In some cases, the deformation of the tetrahedral mesh causes the query point x to fall into multiple overlapping tetrahedra, resulting in a volume distortion that is strictly irreversible. Thus, the volume deformer 225 addresses this problem by defining a set of tetrahedrons (or triangles) containing x as follows.

The volume deformer 225 then selects the tetrahedron (or triangle) τ that will be used to warp the point back into the base pose^*(x, theta), i.e.

The first case selects a tetrahedron (or triangle) that contains the smallest absolute distance to warp a point back to the surface in the base pose. The second case selects the tetrahedron (or triangle) in the current pose that is closest to the point. The volume deformer 225 then defines a well-defined signed distance function to the surface as

Wherein the first term measures the distance to the closest point in the selected tetrahedron (or triangle) and the second term warps said closest point back to the basic pose to evaluate the signed distance and thus its distance to the surface.

Thus, with τ^*(x, θ) jumps from one tetrahedron (or triangle) to another tetrahedron (or triangle), and the volume deformer 225 divides the space into a discrete set of cells. When x falls within at least one tetrahedron (or triangle), volume deformer 225 maps the space in the current pose back to the base pose using an affine transformation defined by the selected tetrahedron (or triangle) for SDF evaluation. When x falls outside tetrahedral mesh 510 (or triangular mesh 710), volume deformer 225 selects the closest tetrahedron (triangle) and similarly warps the closest points on the boundaries of the closest tetrahedron into the base pose using affine transformation for SDF evaluation. The volume deformer 225 adds the distance from x to the closest point on the tetrahedral boundary to this value to compensate for query points outside the tetrahedral mesh. In some embodiments, the volume deformer 225 adds more tetrahedra (or triangles) to smooth the protrusion around the joint.

Fig. 8 is a schematic diagram of an energy function 810 of a distance between each point of a three-dimensional (3D) point cloud based on the depth image 115 of fig. 1 and a candidate pose based on a well-defined signed distance function, according to at least one embodiment of the present disclosure. The hand tracking module 110 of fig. 1 and 2 generates an energy function 810 to evaluate the degree of interpretation of the points of the 3D point cloud by the candidate hand pose θ. The hand tracking module 110 defines the energy function as

The well-defined signed distance field allows fast querying of the distances and derivatives of D (x, θ). As a result, both the value and the falling direction of the above energy function can be quickly queried, so that a quick local search can be performed from the initialization gesture.

In some embodiments, the hand tracking module 110 performs a local search to minimize energy by candidate poses constrained by poses from previous frames 820 of the depth camera 105 of fig. 1. In some embodiments, the depth camera 105 is a high frame rate depth camera such that the pose in the previous frame 825 is most likely to be close to the true pose in the current frame due to the short time interval between frames. Quickly minimizing the aforementioned energy function facilitates processing depth frames at high frame rates. In some embodiments, the hand tracking module 110 also initializes candidate gestures by reinitializing the coarse overall predicted gesture 830 generated by the generator 215. By initializing candidate poses from one or both of the pose of the previous frame and the coarse ensemble predicted pose 830, the hand tracking module 110 avoids local minima of the energy function 810.

Fig. 9 is a flow diagram illustrating a method 900 of estimating a current pose of a hand based on a captured depth image in accordance with at least one embodiment of the present disclosure. At step 902, the depth camera 105 of FIG. 1 captures a depth image 115 of the hand 120. At step 904, the interpolator 220 of the hand tracking module 110 defines the dense signed distance field 330 based on the depth image 115. At step 906, the volume deformer 225 defines the dense signed distance field 330 over the volume based on the tetrahedral mesh 510. At step 908, the volume deformer 225 defines an explicitly expressed signed distance function based on the volume deformation of the dense signed distance field 330. At step 910, the hand tracking module 110 minimizes the energy function 810 to estimate the current pose 140 by utilizing a deformer and interpolator that allow extremely fast querying of the distance to the implicit surface and the corresponding derivative in any pose.

Fig. 10 is a flow diagram illustrating a method 1000 of minimizing an energy function 810 of candidate poses initialized by a pose in a previous frame 825 and a coarse overall predicted pose 830, in accordance with at least one embodiment of the present disclosure. At step 1002, the hand tracking module 110 sets the gesture from the previous frame 825 as a first initialization of candidate gestures. At step 1004, the hand tracking module 110 sets the coarse overall predicted gesture 830 as a second initialization of candidate gestures. At step 1006, the hand tracking module 110 utilizes an explicitly formulated signed distance function to provide a fast local search from each initialization. At step 1008, hand tracking module 110 estimates current pose 140 as a candidate pose using minimum energy function 810.

Fig. 11 is a flow diagram illustrating a method 1100 of generating a rough overall predicted gesture 830 of a hand 120 in accordance with at least one embodiment of the present disclosure. At step 1102, memory 205 receives depth image 115. At step 1104, pixel segmenter 210 segments pixels of depth image 115 into pixels corresponding to the left hand, right hand, and background. At step 1106, each segmented pixel is voted to a location on the palm of hand 120 to generate a point cloud. At step 1108, the re-initializer 215 finds the center of each point cloud to generate a rough overall predicted pose 830 for the hand 120.

In some embodiments, certain aspects of the techniques described above may be implemented by one or more processors of a processing system executing software. The software includes one or more sets of executable instructions stored or otherwise tangibly embodied on a non-transitory computer-readable storage medium. The software can include instructions and certain data that, when executed by one or more processors, manipulate the one or more processors to perform one or more aspects of the techniques described above. The non-volatile computer-readable storage medium can include, for example, magnetic or optical disk storage, solid state storage such as flash memory, cache, Random Access Memory (RAM) or other non-volatile storage, and so forth. Executable instructions stored on a non-transitory computer-readable storage medium may be in source code, assembly language code, object code, or other instruction format that is interpreted or otherwise executable by one or more processors.

Computer-readable storage media can include any storage media or combination of storage media that can be accessed by a computer system during use to provide instructions and/or data to the computer system. Such storage media can include, but is not limited to, optical media (e.g., Compact Disc (CD), Digital Versatile Disc (DVD), blu-ray disc), magnetic media (e.g., floppy disk, tape, or hard drive), volatile memory (e.g., Random Access Memory (RAM) or cache), non-volatile memory (e.g., Read Only Memory (ROM) or flash memory), or micro-electro-mechanical system (MEMS) -based storage media. The computer-readable storage medium can be embedded in a computing system (e.g., system RAM or ROM), fixedly attached to a computing system (e.g., a hard disk drive), removably attached to a computing system (e.g., an optical disk or Universal Serial Bus (USB) based flash memory, or coupled to a computer system via a wired or wireless network (e.g., Network Accessible Storage (NAS)).

Note that not all of the activities or elements described above in the general description are required, that a portion of a particular activity or device may not be required, and that one or more other activities or included elements may be performed in addition to those described above. Still further, the order in which activities are listed is not necessarily the order in which the activities are performed. Also, the concepts have been described with reference to specific embodiments. However, one of ordinary skill in the art appreciates that various modifications and changes can be made without departing from the scope of the present disclosure as set forth in the claims below. Accordingly, the specification and figures are to be regarded in an illustrative rather than a restrictive sense, and all such modifications are intended to be included within the scope of the present disclosure.

Benefits, other advantages, and solutions to problems have been described above with regard to specific embodiments. The benefits, advantages, solutions to problems, and any feature or features that may cause any benefit, advantage, or solution to occur or become more pronounced, however, are not to be construed as critical, required, or essential features of any or all the claims. Moreover, the particular embodiments disclosed above are illustrative only, as the disclosed subject matter may be modified and practiced in different but equivalent manners apparent to those skilled in the art having the benefit of the teachings herein. No limitations are intended to the details of construction or design herein shown, other than as described in the claims below. It is therefore evident that the particular embodiments disclosed above may be altered or modified and all such variations are considered within the scope of the disclosed subject matter. Accordingly, the protection sought herein is as set forth in the claims below.

Claims

1. A method for estimating a current pose of a hand, comprising:

capturing a depth image of at least one hand of a user at a depth camera, the depth image comprising a plurality of pixels; and

identifying a current pose of the at least one hand by fitting an implicit surface model of the hand to a subset of the plurality of pixels, the fitting comprising:

interpolating a dense grid of pre-computed signed distances to define a first signed distance function;

deforming the signed distance function over a volume based on a skinned tetrahedral mesh associated with the candidate pose to define an explicitly expressed signed distance field;

querying the explicitly-expressed signed distance field by warping points back to a basic pose in which the dense grid of pre-computed signed distances can be interpolated; and

estimating a current pose of the hand based on the explicitly expressed signed distance field.

2. The method of claim 1, wherein the subset of pixels is identified by assigning each pixel of the depth image as a probability corresponding to a right hand, a left hand, or a background to generate a probability map.

3. The method of claim 1, further comprising: the candidate gesture is initialized by a first gesture and a second gesture.

4. The method of claim 3, wherein the first pose is based on a pose from a previous frame.

5. The method of claim 4, wherein the second pose is based on a coarse overall predicted pose.

6. The method of claim 5, wherein the coarse overall predicted pose is generated based on:

generating a three-dimensional (3D) point cloud of the hand based on the subset of the plurality of pixels; and

predicting an overall orientation of the hand based on a comparison of the 3D point cloud to a plurality of known poses.

7. The method of claim 6, wherein generating the 3D point cloud comprises: voting for a location on the palm of the hand by each pixel of the subset of the plurality of pixels.

8. A method for estimating a current pose of a hand, comprising:

capturing a plurality of consecutive frames of depth images of the hand at a depth camera, each depth image comprising a plurality of pixels;

generating a three-dimensional (3D) point cloud based on a subset of the plurality of pixels;

minimizing an energy function based on a distance between each point of the 3D point cloud and an implicitly defined surface of the hand in a candidate pose, wherein the candidate pose is generated based on:

fitting a parameterized implicit surface model of a hand to a set of points of the 3D point cloud to pre-compute a signed distance function by minimizing a distance from each point in the set of points to the surface;

defining an explicitly expressed signed distance function as warping over a volume using a skinned tetrahedral mesh; and

evaluating distances and derivatives of points to the implicitly defined surface of the hand by warping points back to a base pose and interpolating a grid of pre-computed signed distance values; and

calculating a direction of descent of the energy function that measures a fit of the candidate pose based on the estimated distance and derivative of the points; and

estimating a current pose of the hand based on the candidate poses that produce the minimized energy function.

9. The method of claim 8, wherein the candidate gesture is initialized by a first initialization and a second initialization.

10. The method of claim 9, wherein the first initialization is based on a pose of the hand estimated for a frame immediately preceding a current frame.

11. The method of claim 10, wherein the second initialization is based on a coarse overall predicted pose.

12. The method of claim 11, wherein the coarse overall predicted pose is generated based on a prediction of an overall orientation of the hand based on a comparison of the 3D point cloud to a plurality of known poses.

13. The method of claim 8, wherein generating the 3D point cloud comprises: voting for a location on the palm of the hand by each pixel in the subset of the plurality of pixels.

14. An electronic device, comprising:

a user-facing depth camera to capture a plurality of consecutive frames of depth images of at least one hand of a user, each depth image comprising a plurality of pixels; and

a processor configured to:

identifying a current pose of the at least one hand by fitting an implicitly defined surface model of the hand in candidate poses to a subset of the plurality of pixels, the fitting comprising:

interpolating a dense 3D grid of pre-computed signed distance values to define a first signed distance function;

defining an explicitly expressed signed distance function as warping over a volume using a skinned tetrahedral mesh;

evaluating the distance and derivative of a point to the implicitly defined surface of the hand by warping the point back to a base pose and interpolating a grid of pre-computed signed distance values;

calculating a direction of descent of an energy function measuring a fit of the candidate pose based on the distance and derivative of the points, enabling a local search to be performed; and

estimating the current pose based on the explicitly expressed signed distance function.

15. The electronic device of claim 14, wherein the processor is further configured to: identifying a subset of the pixels by encoding, for each pixel of the depth image, a probability that the pixel belongs to one of a right hand, a left hand, or a background to generate a probability map.

16. The electronic device of claim 14, wherein the processor is further configured to: the candidate gesture is initialized by a first gesture and a second gesture.

17. The electronic device of claim 16, wherein the first pose is based on a pose of a frame immediately preceding a current frame.

18. The electronic device of claim 17, wherein the second pose is based on a coarse overall predicted pose.

19. The electronic device of claim 18, wherein the processor is further configured to:

predicting an overall orientation of the hand based on a comparison of the 3D point cloud to a plurality of known poses to generate the coarse overall predicted pose.

20. The electronic device of claim 19, wherein the processor is further configured to: generating the 3D point cloud by voting for a location on a palm of the hand by each pixel of the subset of the plurality of pixels.