EP3631759A1 - Handverfolgung auf basis von gelenkigem distanzfeld - Google Patents

Handverfolgung auf basis von gelenkigem distanzfeld

Info

Publication number
EP3631759A1
EP3631759A1 EP18755602.2A EP18755602A EP3631759A1 EP 3631759 A1 EP3631759 A1 EP 3631759A1 EP 18755602 A EP18755602 A EP 18755602A EP 3631759 A1 EP3631759 A1 EP 3631759A1
Authority
EP
European Patent Office
Prior art keywords
pose
hand
pixels
signed distance
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP18755602.2A
Other languages
English (en)
French (fr)
Inventor
Jonathan James Taylor
Vladimir Tankovich
Danhang Tang
Cem Keskin
Adarsh Prakash Murthy Kowdle
Philip L. Davidson
Shahram Izadi
David Kim
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Google LLC
Original Assignee
Google LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Google LLC filed Critical Google LLC
Publication of EP3631759A1 publication Critical patent/EP3631759A1/de
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • G06T7/75Determining position or orientation of objects or cameras using feature-based methods involving models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/149Segmentation; Edge detection involving deformable models, e.g. active contour models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/162Segmentation; Edge detection involving graph-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/194Segmentation; Edge detection involving foreground-background segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • G06T7/251Analysis of motion using feature-based methods, e.g. the tracking of corners or segments involving models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • G06T7/74Determining position or orientation of objects or cameras using feature-based methods involving reference images or patches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/107Static hand or arm
    • G06V40/11Hand-related biometrics; Hand pose recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/107Static hand or arm
    • G06V40/113Recognition of static hand signs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/017Gesture based interaction, e.g. based on a set of recognized hand gestures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10028Range image; Depth image; 3D point clouds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20076Probabilistic image processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person

Definitions

  • Hand tracking allows articulated hand gestures to be used as an input mechanism for virtual reality and augmented reality systems, thereby supporting a more immersive user experience.
  • a generative hand tracking system captures images and depth data of the user's hand and fits a generative model to the captured image or depth data. To fit the model to the captured data, the hand tracking system defines and optimizes an energy function to find a minimum that corresponds to the correct hand pose.
  • conventional hand tracking systems typically have accuracy and latency issues that can result in an unsatisfying user experience.
  • FIG. 1 is a diagram illustrating a hand tracking system estimating a current pose of a hand based on a depth image in accordance with at least one embodiment of the present disclosure.
  • FIG. 2 is a diagram illustrating a hand tracking module of the hand tracking system of FIG. 1 configured to estimate a current pose of a hand based on a depth image in accordance with at least one embodiment of the present disclosure.
  • FIG. 3 is a diagram illustrating interpolation of a grid of precomputed signed distances to generate a smooth signed distance field for estimating a distance from a point to a model in accordance with at least one em bodiment of the present disclosure.
  • FIG. 4 is a diagram illustrating a base pose of a skinned tetrahedral volumetric mesh in accordance with at least one em bodiment of the present disclosure.
  • FIG. 5 is a diagram illustrating a deformed pose of the tetrahedral volumetric mesh in accordance with at least one em bodiment of the present disclosure.
  • FIG. 6 is a diagram illustrating a two-dimensional cross-section of the end of a finger in a base pose contained inside a triangular mesh in accordance with at least one embodiment of the present disclosure.
  • FIG. 7 is a diagram illustrating a two-dimensional cross-section of the end of a finger in a query pose contained inside a deformed triangular mesh in accordance with at least one em bodiment of the present disclosure.
  • FIG. 8 is a diagram of an energy function based on a distance between each point of a three-dimensional (3D) point cloud based on a depth image and a candidate pose in accordance with at least one em bodiment of the present disclosure.
  • FIG. 9 is a flow diagram illustrating a method of estimating a current pose of a hand based on a captured depth image in accordance with at least one embodiment of the present disclosure.
  • FIG. 10 is a flow diagram i llustrating a method of m inim izing an energy function by initial izing using the pose from the previous frame and one or more poses derived from a coarse global predicted pose in accordance with at least one embodiment of the present disclosure.
  • FIG. 1 1 is a flow diagram i llustrating a method of predicting a coarse global predicted pose of a hand in accordance with at least one embodiment of the present disclosure.
  • FIGs. 1 -1 1 illustrate techniques for estimating a pose of at least one hand by volumetrically deforming a signed distance field using a skinned tetrahedral mesh to locate a local minimum of an energy function, wherein the local minimum corresponds to the hand pose.
  • a hand tracking module receives depth images of a hand from a depth camera and identifies a pose of the hand by fitting an implicit surface model of a hand, defined as the zero crossings of an articulated signed distance function, to the pixels of a depth image that correspond to the hand. The hand tracking module fits the model to the pixels by first
  • the volumetric warp is performed using a skinned tetrahedral mesh.
  • the hand tracking module uses the skinned tetrahedral mesh to warp space from a base pose to a deformed pose to define an articulated signed distance field from which the hand tracking module derives candidate poses of the hand.
  • Explicitly generating the articulated signed distance function is, however, avoided, by instead warping the pixels from the deformed pose to the base pose where the distance to the surface can be estimated by interpolating the precomputed 3D grid of signed distance values.
  • the hand tracking module then minimizes the energy function based on the distance of each corresponding pixel as to identify the candidate pose that most closely approximates the pose of the hand.
  • the hand tracking module initializes the candidate poses using the pose from the previous frame, that is, the depth image immediately preceding the current depth image.
  • the hand tracking system leverages a depth camera with an extremely high frame rate to minimize the difference between the true pose from the previous frame and the true pose in the current frame.
  • the hand tracking module further initializes the candidate poses by a predicted pose. To predict a pose, the hand tracking module segments the pixels of the depth images based on a probability for each pixel representing a left hand, a right hand, or a background.
  • the hand tracking module generates a three-dimensional (3D) point cloud of at least one of the left hand and the right hand based on the corresponding pixels and predicts a global orientation of the hand based a comparison of the 3D point cloud to a plurality of known poses to generate the predicted current pose.
  • 3D three-dimensional
  • FIG.1 illustrates a hand tracking system 100 configured to support hand tracking functionality for AR/VR applications, using depth sensor data in accordance with at least one embodiment of the present disclosure.
  • the hand tracking system 100 can include a user-portable mobile device, such as a tablet computer, computing-enabled cellular phone (e.g., a "smartphone"), a head-mounted display (HMD), a notebook computer, a personal digital assistant (PDA), a gaming system remote, a television remote, camera attachments with or without a screen, and the like.
  • the hand tracking system 100 can include another type of mobile device, such as an automobile, robot, remote- controlled drone or other airborne device, and the like.
  • the hand tracking system 100 is generally described herein in the example context of a mobile device, such as a tablet computer or a smartphone; however, the hand tracking system 100 is not limited to these example implementations.
  • the hand tracking system 100 includes a hand tracking module 1 10 estimating a current pose 140 of a hand 120 based on a depth image 1 15 captured by a depth camera 105 in accordance with at least one embodiment of the present disclosure.
  • the hand 120 is a right hand making a pointing gesture, with the thumb and index finger extended and the remaining fingers curled down to the palm.
  • the depth camera 105 uses a modulated light projector (not shown) to project modulated light patterns into the local environment, and uses one or more imaging sensors 106 to capture reflections of the modulated light patterns as they reflect back from objects in the local environment 1 12.
  • modulated light patterns can be either spatially-modulated light patterns or temporally-modulated light patterns.
  • the captured reflections of the modulated light patterns are referred to herein as "depth images" 1 15.
  • the depth camera 105 calculates the depths of the objects, that is, the distances of the objects from the depth camera 105, based on the analysis of the depth images 1 15.
  • the hand tracking module 1 10 receives a depth image 1 15 from the depth camera 105 and identifies a pose of the hand 120 by fitting a hand model to the pixels of the depth image 1 15 that correspond to the hand 120.
  • the model is parameterized by 28 values (e.g. , four joint articulations of each of the five fingers, two degrees of freedom at the wrist, and six degrees of freedom for global orientation).
  • the hand tracking module 1 10 parameterizes the global rotation of the model using a quaternion so that the pose vector ⁇ is 29-dimensional.
  • the hand tracking module 1 10 segments out of and back projects from the depth image 1 15 a set of 3D data points corresponding to the hand 120.
  • the hand tracking module 1 10 then fits a parameterized implicit surface model 5( ⁇ ) _ ⁇ E 3 , formulated as the zero crossings of an articulated signed distance function, to the set of 3D data points e
  • the hand tracking module 1 10 defines the distance D(x, ⁇ ) to an implicit surface of the hand model in a way that is relatively easy and fast to compute.
  • the hand tracking module 1 10 builds a tetrahedral mesh (not shown) and skins the vertices to a skeleton (not shown).
  • the hand tracking module 1 10 defines a function that warps the space from a base pose to a deformed pose, as is described in more detail below. Based on the deformed pose, the hand tracking module 1 10 defines an articulated signed distance field.
  • a point in the space of the current pose can be warped back to the base pose where the distance to the surface can be estimated efficiently by interpolating a precomputed 3D grid of signed distances.
  • the hand tracking module 1 10 leverages this as part of its process to rapidly estimate a current pose 140 of the hand 120.
  • the hand tracking module 1 10 uses the current pose estimate 140 to update graphical data 135 on a display 130. In some embodiments, the hand tracking module 1 10 uses the current pose estimate 140 to update graphical data 135 on a display 130. In some embodiments, the hand tracking module 1 10 uses the current pose estimate 140 to update graphical data 135 on a display 130. In some
  • the display 130 is a physical surface, such as a tablet, mobile phone, smart device, display monitor, array(s) of display monitors, laptop, signage and the like or a projection onto a physical surface.
  • a physical surface such as a tablet, mobile phone, smart device, display monitor, array(s) of display monitors, laptop, signage and the like or a projection onto a physical surface.
  • the display 130 is planar. In some embodiments, the display 130 is curved. In some embodiments, the display 130 is a virtual surface, such as a three-dimensional or holographic projection of objects in space including virtual reality and augmented reality. In some embodiments in which the display 130 is a virtual surface, the virtual surface is displayed within an HMD of a user. The location of the virtual surface may be relative to stationary objects (such as walls or furniture) within the local environment 1 12 of the user.
  • FIG. 2 is a diagram illustrating the hand tracking module 1 10 of the hand tracking system 100 of FIG. 1 in accordance with at least one embodiment of the present disclosure.
  • the hand tracking module 1 10 includes a memory 205, a pixel segmenter 210, a reinitializer 215, an interpolator 220, and a volumetric deformer 225. Each of these modules represents hardware, software, or a combination thereof, configured to execute the operations as described herein.
  • the hand tracking module 1 10 is configured to receive a depth image 1 15 from the depth camera (not shown) and to generate a current pose estimate 140 based on the depth image 1 15.
  • the memory 205 is a memory device generally configured to store data, and therefore may be a random access memory (RAM) memory module, non-volatile memory device (e.g., flash memory), and the like.
  • RAM random access memory
  • non-volatile memory device e.g., flash memory
  • the memory 205 may form part of a memory hierarchy of the hand tracking system 100 and may include other memory modules, such as additional caches not illustrated at FIG. 1 .
  • the memory 205 is configured to receive and store the depth image 1 15 from the depth camera (not shown).
  • the pixel segmenter 210 is a module configured to segment the pixels of the depth image 1 15 into pixels corresponding to a left hand, a right hand, and a background. In some embodiments, the pixel segmenter 210 assigns a probability for each pixel of the depth image 1 15 as corresponding to a left hand p left , a right hand p right , and a background pft 9 ⁇ [0,1] to produce a probability map P. In some embodiments, the pixel segmenter 210 thresholds P with a high value ⁇ ⁇ 9 ⁇ ⁇ [0,1], convolves the output with a large bandwidth Gaussian filter, and then finds the location of the maximum value, which the hand segmenter 210 assigns as a hand position.
  • the hand segmenter 210 then thresholds P with a smaller value ⁇ ⁇ 0 ⁇ and intersects P with a sphere of radius ⁇ ⁇ 1 ⁇ ⁇ R to segment the hand pixels.
  • the pixel segmenter 210 also trains a Randomized Decision Forest (RDF) classifier to produce P.
  • the RDF classifier (not shown) employs depth and translation invariant features which threshold the depth difference of two pixels at depth-normalized offsets around the central pixel. For each pixel p at coordinate (u, v), on a depth image /, each split node in the tree evaluates the function: where ⁇ is l(u, v), Aui and Av, are the two offsets and ⁇ is the threshold for that split node.
  • the pixel segmenter 210 introduces a new rotationally invariant family of features, which threshold the average depth of two co-centric rings:
  • R(u, v,r,l) is the sum over K depth pixels found on a ring of depth-scaled radius r around the central pixel.
  • the pixel segmenter 210 approximates the ring with a fixed number of points k:
  • the pixel segmenter 210 additionally defines a unary version of this feature as follows: ( ⁇ £ ⁇ _ ⁇ > ⁇ (5)
  • the pixel segmenter 210 samples from a pool of binary and unary rotationally dependent and invariant features based on a learned prior pose. In some embodiments, for each considered feature, the pixel segmenter 210 uniformly samples multiple ⁇ values from a fixed range and selects the value that maximizes the information gain. The pixel segmenter 210 outputs a segmented depth image R per hand.
  • the pixel segmenter 210 uses a convolutional neural network (CNN) or a randomized decision forest (RDF) or both to produce a probability map that encodes for each pixel, the probability of the pixel belonging to the left hand, the right hand, and the background, respectively.
  • CNN convolutional neural network
  • RDF randomized decision forest
  • the pixel segmenter 210 temporarily sets all values of the probability map p ri9ht to zero that are below a high value ⁇ ⁇ ⁇ [0,1]-
  • the pixel segmenter 210 convolves the output with a large bandwidth Gaussian filter, and then uses the location of the maximum value.
  • the pixel segmenter 210 then removes outliers from the original segmentation p ri a ht by setting to zero the value of any pixels whose probability is less than ⁇ ⁇ 0 ⁇ ⁇ [0, ⁇ ⁇ 3 ⁇ ] or whose 3D location is not contained in a sphere of radius r sphere ⁇ R around the hand detection.
  • the pixel segmenter 210 thus ensures that pixels far from the most prominent hand (e.g., pixels on other people's hands in the background) do not contaminate the segmentation while allowing the machine learning method to discard nearby pixels that are recognized as not belonging to the hand (e.g., pixels on the user's chest).
  • the hand segmenter 210 back projects the pixels that pass the test into 3D space using the depth camera 105 parameters to form a point cloud
  • the reinitializer 215 receives the segmented depth image R from the pixel segmenter 210.
  • the reinitializer 215 resets the hand tracking module 1 10 by generating a coarse global predicted pose when the hand tracking module 1 10 loses track of the hand 120 of FIG. 1 .
  • the hand tracking module 1 10 uses coarse global predicted pose as a candidate pose of the hand.
  • the reinitializer 215 uses an RDF to estimate the six degrees of freedom (6DOF) hand pose by locating three joints on the palm, which is assumed to be planar. The three joints are the wrist joint q w , the base of the metacarpophalangeal (MCP) joint q and the base of the pinky MCP q p .
  • 6DOF six degrees of freedom
  • the reinitiahzer 215 locates the three joints by evaluating each pixel p in R to produce a single vote for the three-dimensional (3D) offset of each joint relative to p.
  • the trees of the RDF are trained with a regression objective to minimize the vote variance in the leaves.
  • the reinitiahzer 215 selects the modes of the distributions as final estimates for the three joints.
  • the reinitiahzer 215 converts the three joints into a reinitialization pose by setting the global translation to q w and deriving the global orientation by finding the orientation of the three-dimensional triangle defined by the three joints.
  • the reinitiahzer 215 then samples a set of finger poses randomly from the prior pose to generate the coarse global predicted pose.
  • Tricubic interpolation gives access to smooth first and second order derivatives with respect to x.
  • the signed distance field smoothly captures details of the model using tricubic interpolation.
  • the volumetric deformer 225 uses a linear skinned tetrahedral mesh to define a signed distance field into an arbitrary pose ⁇ as a volumetric warp of the signed distance field of the interpolator 220. Instead of explicitly generating the deformed signed distance function, the volumetric deformer 225 can efficiently warp a point in the current pose back into the base pose so the distance to the implicit surface, and its derivatives, can be rapidly estimated by the interpolator.
  • the volumetric deformer 225 defines the deformation of the vertices of the tetrahedral mesh via linear blend skinning.
  • the function is largely invertible, such that the set of points in the base pose that deform to a point in the current pose is typically 1 , unless the deformation causes tetrahedra to self-intersect.
  • the ambiguity is resolved by simply picking the point in the base pose with a smaller absolute distance to the implicit surface as defined by the interpolator 220. This thus defines a function V _1 (x, ⁇ ) that warps the space from the deformed pose to the base pose.
  • D (x, ⁇ ) £>(I/ _1 (x, 0))
  • the tetrahedral mesh warp introduces artifacts only at articulation points, which can be addressed by densifying the tetrahedral mesh only at the articulation points.
  • the hand tracking module 1 10 initializes the candidate poses ⁇ first using the pose ⁇ ⁇ output from the system in the previous frame. In some embodiments, the hand tracking module 1 10 initializes further candidate poses ⁇ by using a coarse global predicted pose e pre d generated by the
  • the depth camera (not shown) employs a high frame rate, such that the difference between the pose ⁇ ⁇ in the previous frame and the true pose in the current frame is minimized.
  • the hand tracking module 1 10 generates a current pose estimate 140.
  • E ⁇ Q E ( ⁇ ' « '; ⁇ ( ⁇ )) + E Y) + + (1 - ⁇ 7») ⁇
  • y" ght and y ⁇ eft are penalties output from the segmentation forest for assigning data point n to the right and the left hand pose, respectively.
  • the hand tracking module 1 10 performs alternation between ⁇ and Y, updating ⁇ with Levenberg updates and updating Y by discretely considering whether assigning the data point to the left or right hand will lower the energy.
  • FIG. 3 illustrates interpolation of a pixel 320 of a depth image based on a precomputed distance function to generate a smooth signed distance field (SDF) 330 for estimating a distance 325 from the pixel 320 to a model 305 in a base pose 0o in accordance with at least one embodiment of the present disclosure.
  • the interpolator 220 of FIG. 2 precomputes a dense grid 310 of signed distances 315 in the base pose ⁇ .
  • FIG. 4 illustrates a base pose 400 of a tetrahedral volumetric mesh 410 of the volumetric deformer 225 of FIG. 2 with vertices skinned to the dense SDF 330 of FIG. 3 in accordance with at least one embodiment of the present disclosure.
  • the skinned tetrahedral mesh 410 transforms the detail of the dense SDF 330 into different poses.
  • the skinned tetrahedral mesh 410 introduces artifacts only at articulation points.
  • the skinned tetrahedral mesh 410 is densified at the articulation points, e.g. , 415, 420, 425, while the dense SDF 330 represents the geometry of the pose in other areas.
  • the volumetric deformer (not shown) applies arbitrary mesh skinning techniques to deform a single SDF 330.
  • FIG. 5 illustrates a deformed pose 500 of the tetrahedral volumetric mesh 410 of FIG. 4 in accordance with at least one embodiment of the present disclosure.
  • the volumetric deformer 225 of FIG. 2 uses the tetrahedral volumetric mesh 410 to warp a point x to W(x, ⁇ ).
  • the hand tracking module 1 10 can estimate the distance D(x, ⁇ ) of a point x to the implicit surface in any pose ⁇ by instead warping x back into the base pose where the distance to the surface can be rapidly evaluated by interpolating a precomputed 3D grid of signed distance values. Further, as the warp and the signed distance field are differentiable almost everywhere, the hand tracking module 1 10 can also rapidly query derivatives to enable rapid local search of energy functions defined in terms of distances to the surface. FIG.
  • FIG. 6 illustrates a two-dimensional (2D) cross-section of the end of a finger 605 in a base pose contained inside a triangular mesh 610 in accordance with at least one embodiment of the present disclosure.
  • the tetrahedral volumetric mesh 410 of FIGS. 4 and 5 is depicted as a 2D equivalent triangular mesh 610 for ease of reference.
  • the triangular mesh 610 includes triangles 614, 616, 618, 620, 622, 624, 626, and 628.
  • FIG. 7 illustrates a 2D cross-section of the end of the finger 605 of FIG. 6 in a query pose ⁇ contained inside a deformed triangular mesh 710 in accordance with at least one embodiment of the present disclosure.
  • a triangular mesh in 2D is the analogue of a tetrahedral mesh in 3D and is thus used to more simply illustrate the technique.
  • the tetrahedral mesh (illustrated as triangular mesh 710) includes tetrahedra (illustrated as triangles 714, 716, 718, 720, 722, 724, 726, and 728), which correspond to tetrahedra (or triangles) 614, 616, 618, 620, 622, 624, 626, and 628, respectively, of FIG. 6.
  • each tetrahedra (or triangle) 714, 716, 718, 720, 722, 724, 726, and 728 defines an affine transform between the base pose of FIG. 6 and the query pose ⁇ . This defines a volumetric warp W(x, ⁇ ) from the base pose to the query pose.
  • the volumetric deformer 225 of FIG. 2 implicitly defines a signed distance field D(x, ⁇ ) as described further herein.
  • a query point x e.g. , point 730
  • a tetrahedra (or triangle) ⁇ that contains the point can use its inverse affine transform sends the query point to ⁇ ⁇ ( ⁇ , ⁇ ) where the distance to the implicitly encoded surface can be queried as 0(5 T (x, 0)).
  • a point y e.g.
  • the volumetric deformer 225 first measures the distance to the closest point contained in the tetrahedral mesh. To this distance, the volumetric deformer 225 then adds the distance obtained by evaluating the distance of this closest point to the surface using the
  • the tetrahedron (or triangle) containing the closest point
  • ⁇ ⁇ ( ⁇ ) ⁇ E 3x4 or E 2x3
  • ⁇ ⁇ ( ⁇ , ⁇ ) ⁇ E 4 or ⁇ ⁇ ( ⁇ , ⁇ )
  • ⁇ ⁇ ( ⁇ , ⁇ ) ⁇ ⁇ ( ⁇ 0 ) ⁇ ⁇ ( ⁇ , ⁇ ) to query its distance to the implicitly encoded surface.
  • q T (x, ⁇ ) x
  • x lies outside the tetrahedral mesh (e.g. , point 732)
  • the volumetric deformer accounts for the additional distance between q T (x, ⁇ ) and x.
  • the deformation of the tetrahedral mesh causes the query point x to fall in multiple overlapping tetrahedra, causing the volumetric warp to not be strictly invertible.
  • the volumetric deformer 225 therefore resolves this issue by defining the set of tetrahedra (or triangles) that contain x as
  • the volumetric deformer 225 then chooses the tetrahedron (or triangle) ⁇ * ( ⁇ , ⁇ ) that will be used to warp the point back into the base pose as
  • the first case selects the containing tetrahedron (or triangle) which warps the point back of minimum absolute distance to the surface in the base pose.
  • the second case selects the tetrahedron (or triangle) that the point is closest to in the current pose.
  • the volumetric deformer 225 divides the space into a discrete set of cells as ⁇ * ( ⁇ , ⁇ ) jumps from one tetrahedron (or triangle) to another.
  • the volumetric deformer 225 uses an affine transform defined by the selected tetrahedron (or triangle) to map the space in the current pose back into the base pose for SDF evaluation.
  • the volumetric deformer 225 selects the closest tetrahedron (triangle) and similarly uses the affine transform to warp the closest point on the closest tetrahedron's boundary into the base pose for SDF evaluation.
  • the volumetric deformer 225 adds to this value the distance from x to the closest point on the tetrahedron boundary to compensate for the query point being outside the tetrahedral mesh.
  • the volumetric deformer 225 adds more tetrahedra (or triangles) to smooth out bumps around joints.
  • FIG. 8 is a diagram of an energy function 810 of a distance between each point of a three-dimensional (3D) point cloud based on the depth image 1 15 of FIG. 1 and a candidate pose based on the articulated signed distance function in accordance with at least one embodiment of the present disclosure.
  • the hand tracking module 1 10 of FIGS. 1 and 2 generates the energy function 810 to evaluate how well the points of the 3D point cloud are explained by the candidate hand pose ⁇ .
  • the hand tracking module 1 10 defines the energy function as
  • the articulated signed distance field defined allows D(x, ⁇ ) to be rapidly queried for distances and derivatives.
  • the energy function above can be rapidly queried for both its value and descent directions so that rapid local search can be performed from initialization poses.
  • the hand tracking module 1 10 performs a local search to minimize the energy by bounding the candidate pose by the pose from the previous frame 820 of the depth camera 105 of FIG. 1 .
  • the depth camera 105 is a high frame rate depth camera, such that the pose in the previous frame 825 is extremely likely to be close to the true pose in the current frame due to the short time interval between the frames. Rapidly minimizing the aforementioned energy function facilitates processing of depth frames at a high frame rate.
  • the hand tracking module 1 10 further initializes the candidate pose by the coarse global predicted pose 830 generated by the reinitializer 215. By initializing the candidate pose by one or both of the pose of the previous frame and the coarse global predicted pose 830, the hand tracking module 1 10 avoids local minima of the energy function 810.
  • FIG. 9 is a flow diagram illustrating a method 900 of estimating a current pose of a hand based on a captured depth image in accordance with at least one embodiment of the present disclosure.
  • the depth camera 105 of FIG. 1 captures a depth image 1 15 of the hand 120.
  • the depth camera 105 of FIG. 1 captures a depth image 1 15 of the hand 120.
  • the depth camera 105 of FIG. 1 captures a depth image 1 15 of the hand 120.
  • the depth camera 105 of FIG. 1 captures a depth image 1 15 of the hand 120.
  • the interpolator 220 of the hand tracking module 1 10 defines a dense signed distance field 330 based on the depth image 1 15.
  • the volumetric deformer 225 volumetrically defines the dense signed distance field 330 based on the tetrahedral mesh 510.
  • the volumetric deformer 225 defines the articulated signed distance function based on the volumetric deformation of the dense signed distance field 330.
  • the hand tracking module 1 10 minimizes the energy function 810 to estimate the current pose 140 by exploiting the deformer and interpolator that allows extremely rapid querying of distances to the implicit surface, and corresponding derivatives, in arbitrary poses.
  • FIG. 10 is a flow diagram illustrating a method 1000 of minimizing the energy function 810 for a candidate pose that is initialized by the pose in the previous frame 825 and a coarse global predicted pose 830 in accordance with at least one embodiment of the present disclosure.
  • the hand tracking module 1 10 sets the pose from the previous frame 825 as a first initialization of the candidate pose.
  • the hand tracking module 1 10 sets the coarse global predicted pose 830 as a second initialization of the candidate pose.
  • the hand tracking module 1 10 leverages an articulated signed distance function to provide rapid local search from each initialization.
  • the hand tracking module 1 10 estimates the current pose 140 as the candidate pose with the minimum energy function 810.
  • FIG. 1 1 is a flow diagram illustrating a method 1 100 of generating a coarse global predicted pose 830 of a hand 120 in accordance with at least one embodiment of the present disclosure.
  • the memory 205 receives a depth image 1 15.
  • the pixel segmenter 210 segments the pixels of the depth image 1 15 into pixels corresponding to the left hand, the right hand, and the background.
  • the reinitializer 215 finds the center of each point cloud to generate the coarse global predicted pose 830 of the hand 120.
  • certain aspects of the techniques described above may implemented by one or more processors of a processing system executing software.
  • the software comprises one or more sets of executable instructions stored or otherwise tangibly embodied on a non-transitory computer readable storage medium.
  • the software can include the instructions and certain data that, when executed by the one or more processors, manipulate the one or more processors to perform one or more aspects of the techniques described above.
  • the non-transitory computer readable storage medium can include, for example, a magnetic or optical disk storage device, solid state storage devices such as Flash memory, a cache, random access memory (RAM) or other non-volatile memory device or devices, and the like.
  • the executable instructions stored on the non-transitory computer readable storage medium may be in source code, assembly language code, object code, or other instruction format that is interpreted or otherwise executable by one or more processors.
  • a computer readable storage medium may include any storage medium, or combination of storage media, accessible by a computer system during use to provide instructions and/or data to the computer system.
  • Such storage media can include, but is not limited to, optical media (e.g., compact disc (CD), digital versatile disc (DVD), Blu-Ray disc), magnetic media (e.g., floppy disc , magnetic tape, or magnetic hard drive), volatile memory (e.g., random access memory (RAM) or cache), non-volatile memory (e.g., read-only memory (ROM) or Flash memory), or microelectromechanical systems (MEMS)-based storage media.
  • optical media e.g., compact disc (CD), digital versatile disc (DVD), Blu-Ray disc
  • magnetic media e.g., floppy disc , magnetic tape, or magnetic hard drive
  • volatile memory e.g., random access memory (RAM) or cache
  • non-volatile memory e.g., read-only memory (ROM) or Flash memory
  • MEMS microelectromechanical systems
  • the computer readable storage medium may be embedded in the computing system (e.g., system RAM or ROM), fixedly attached to the computing system (e.g., a magnetic hard drive), removably attached to the computing system (e.g., an optical disc or Universal Serial Bus (USB)-based Flash memory), or coupled to the computer system via a wired or wireless network (e.g., network accessible storage (NAS)).
  • NAS network accessible storage

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)
EP18755602.2A 2017-05-31 2018-07-27 Handverfolgung auf basis von gelenkigem distanzfeld Withdrawn EP3631759A1 (de)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US201762513199P 2017-05-31 2017-05-31
US15/994,563 US10614591B2 (en) 2017-05-31 2018-05-31 Hand tracking based on articulated distance field
PCT/US2018/044045 WO2018223155A1 (en) 2017-05-31 2018-07-27 Hand tracking based on articulated distance field

Publications (1)

Publication Number Publication Date
EP3631759A1 true EP3631759A1 (de) 2020-04-08

Family

ID=63209668

Family Applications (1)

Application Number Title Priority Date Filing Date
EP18755602.2A Withdrawn EP3631759A1 (de) 2017-05-31 2018-07-27 Handverfolgung auf basis von gelenkigem distanzfeld

Country Status (3)

Country Link
US (2) US10614591B2 (de)
EP (1) EP3631759A1 (de)
WO (1) WO2018223155A1 (de)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10416755B1 (en) 2018-06-01 2019-09-17 Finch Technologies Ltd. Motion predictions of overlapping kinematic chains of a skeleton model used to control a computer system
US11474593B2 (en) 2018-05-07 2022-10-18 Finch Technologies Ltd. Tracking user movements to control a skeleton model in a computer system
US11009941B2 (en) 2018-07-25 2021-05-18 Finch Technologies Ltd. Calibration of measurement units in alignment with a skeleton model to control a computer system
CN109145803B (zh) * 2018-08-14 2022-07-22 京东方科技集团股份有限公司 手势识别方法及装置、电子设备、计算机可读存储介质
CN110221690B (zh) * 2019-05-13 2022-01-04 Oppo广东移动通信有限公司 基于ar场景的手势交互方法及装置、存储介质、通信终端
CN111433783B (zh) * 2019-07-04 2023-06-06 深圳市瑞立视多媒体科技有限公司 手部模型生成方法、装置、终端设备及手部动作捕捉方法
CN110503646B (zh) * 2019-08-29 2022-03-25 联想(北京)有限公司 一种图像处理方法及装置
US10976863B1 (en) * 2019-09-19 2021-04-13 Finch Technologies Ltd. Calibration of inertial measurement units in alignment with a skeleton model to control a computer system based on determination of orientation of an inertial measurement unit from an image of a portion of a user
US11175729B2 (en) 2019-09-19 2021-11-16 Finch Technologies Ltd. Orientation determination based on both images and inertial measurement units
US11182909B2 (en) * 2019-12-10 2021-11-23 Google Llc Scalable real-time hand tracking
EP4172938A4 (de) * 2020-06-26 2024-04-03 INTEL Corporation Vorrichtung und verfahren zur dreidimensionalen posenschätzung
CN116189308B (zh) * 2023-03-09 2023-08-01 杰能科世智能安全科技(杭州)有限公司 一种无人机飞手检测方法、系统及存储介质

Family Cites Families (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5454043A (en) * 1993-07-30 1995-09-26 Mitsubishi Electric Research Laboratories, Inc. Dynamic and static hand gesture recognition through low-level image analysis
WO2010099034A1 (en) * 2009-02-25 2010-09-02 Honda Motor Co., Ltd. Capturing and recognizing hand postures using inner distance shape contexts
US20110289455A1 (en) 2010-05-18 2011-11-24 Microsoft Corporation Gestures And Gesture Recognition For Manipulating A User-Interface
US9020187B2 (en) 2011-05-27 2015-04-28 Qualcomm Incorporated Planar mapping and tracking for mobile devices
US9344707B2 (en) * 2011-06-29 2016-05-17 Microsoft Technology Licensing, Llc Probabilistic and constraint based articulated model fitting
US9002099B2 (en) * 2011-09-11 2015-04-07 Apple Inc. Learning-based estimation of hand and finger pose
US9734393B2 (en) * 2012-03-20 2017-08-15 Facebook, Inc. Gesture-based control system
EP3107070B1 (de) * 2014-02-14 2020-04-29 Sony Interactive Entertainment Inc. Informationsverarbeitungsvorrichtung und informationsverarbeitungsverfahren
KR101687017B1 (ko) * 2014-06-25 2016-12-16 한국과학기술원 머리 착용형 컬러 깊이 카메라를 활용한 손 위치 추정 장치 및 방법, 이를 이용한 맨 손 상호작용 시스템
US20160086349A1 (en) * 2014-09-23 2016-03-24 Microsoft Corporation Tracking hand pose using forearm-hand model
US9747717B2 (en) * 2015-05-13 2017-08-29 Intel Corporation Iterative closest point technique based on a solution of inverse kinematics problem
KR102317247B1 (ko) * 2015-06-15 2021-10-26 한국전자통신연구원 영상정보를 이용한 증강현실 기반 손 인터랙션 장치 및 방법
US10318008B2 (en) * 2015-12-15 2019-06-11 Purdue Research Foundation Method and system for hand pose detection
CN105654492B (zh) 2015-12-30 2018-09-07 哈尔滨工业大学 基于消费级摄像头的鲁棒实时三维重建方法
CN107992858A (zh) 2017-12-25 2018-05-04 深圳市唯特视科技有限公司 一种基于单一rgb帧的实时三维手势估计方法

Also Published As

Publication number Publication date
US20180350105A1 (en) 2018-12-06
US11030773B2 (en) 2021-06-08
US10614591B2 (en) 2020-04-07
WO2018223155A1 (en) 2018-12-06
US20200193638A1 (en) 2020-06-18

Similar Documents

Publication Publication Date Title
US11030773B2 (en) Hand tracking based on articulated distance field
US20230108253A1 (en) Hand skeleton learning, lifting, and denoising from 2d images
US10854012B1 (en) Concealing loss of distributed simultaneous localization and mapping (SLAM) data in edge cloud architectures
Taylor et al. Articulated distance fields for ultra-fast tracking of hands interacting
US10394318B2 (en) Scene analysis for improved eye tracking
EP2880633B1 (de) Animation von objekten unter verwendung des menschlichen körpers
CN107111746B (zh) 根据原始飞行时间图像的模型拟合
US8994652B2 (en) Model-based multi-hypothesis target tracker
EP3475875A1 (de) Erkennung von objekten in videodaten
US11244506B2 (en) Tracking rigged polygon-mesh models of articulated objects
US11240525B2 (en) Systems and methods for video encoding acceleration in virtual, augmented, and mixed reality (xR) applications
US20200380719A1 (en) Resolving incorrect distributed simultaneous localization and mapping (slam) data in edge cloud architectures
EP3593323B1 (de) Schnelle gesichtsverfolgung mit hoher wiedergabetreue
EP3639193A1 (de) Menschliches feedback bei der 3d-modellanpassung
US10656722B2 (en) Sensor system for collecting gestural data in two-dimensional animation
Figueroa et al. A combined approach toward consistent reconstructions of indoor spaces based on 6D RGB-D odometry and KinectFusion
CN110800024B (zh) 用于估计手的当前姿势的方法和电子装置
CN116686006A (zh) 基于可变形模型的三维扫描配准
EP3480789B1 (de) Dynamisch anmutiger abbau von erweiterten realitätseffekten
Ravikumar Lightweight markerless monocular face capture with 3d spatial priors
Zanuttigh et al. Human Pose Estimation and Tracking

Legal Events

Date Code Title Description
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: UNKNOWN

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE

PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE

17P Request for examination filed

Effective date: 20190520

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

AX Request for extension of the european patent

Extension state: BA ME

DAV Request for validation of the european patent (deleted)
DAX Request for extension of the european patent (deleted)
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: EXAMINATION IS IN PROGRESS

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: EXAMINATION IS IN PROGRESS

17Q First examination report despatched

Effective date: 20210111

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION HAS BEEN WITHDRAWN

18W Application withdrawn

Effective date: 20210514

P01 Opt-out of the competence of the unified patent court (upc) registered

Effective date: 20230525