CN118071923A

CN118071923A - Method and apparatus for generating three-dimensional gestures of a hand

Info

Publication number: CN118071923A
Application number: CN202410169501.6A
Authority: CN
Inventors: 奥努尔·居莱尔于兹
Original assignee: Google LLC
Current assignee: Google LLC
Priority date: 2017-12-13
Filing date: 2018-10-15
Publication date: 2024-05-24
Also published as: US20230108253A1; EP3724810A1; KR20200011425A; CN111492367A; US20190180473A1; KR102329781B1; WO2019118058A1; US11544871B2; CN111492367B

Abstract

Methods and apparatus for generating a three-dimensional pose of a hand are disclosed. The processor identifies keypoints [115, 120, 125, 130] on the hand in a two-dimensional image [100] captured by a camera [215 ]. The location of the keypoints is used to access a look-up table (LUT) 230 to determine the three-dimensional pose of the hand, the look-up table 230 representing the potential pose of the hand as a function of the location of the keypoints. In some embodiments, the key points include the location of the tips of the fingers and thumb, the joints connecting the finger and the phalanges of the thumb, the knuckles representing the points of attachment of the fingers and thumb to the palm of the hand, and the wrist location indicating the points of attachment of the hand to the forearm. Some embodiments of the LUT represent the 2D coordinates of the finger and thumb in the corresponding finger gesture plane [405 ].

Description

Method and apparatus for generating three-dimensional gestures of a hand

Description of the division

The application belongs to a divisional application of Chinese application patent application 201880033400.9 with the application date of 2018, 10, 15.

Technical Field

The present disclosure relates to gesture learning, lifting, and denoising from 2D images.

Background

The position of a human body part, particularly a human hand, in three-dimensional (3D) space is a useful driver for many applications. Virtual Reality (VR), augmented Reality (AR), or Mixed Reality (MR) applications use representations of a user's hands to facilitate interactions with virtual objects, select items from virtual memory, place objects in a user's virtual hand, provide a user interface by drawing a menu on one hand and selecting elements of the menu with the other hand, and so forth. Gesture interactions add an additional way to interact with automated Home assistants such as Google Home or Nest. The safety or monitoring system uses a 3D representation of a person's hand or other body part to detect and signal an abnormal situation. In general, a 3D representation of the position of a human hand or other body part provides other ways of interaction or detection that replace or supplement existing ways such as voice communication, touch screens, keyboards, and computer mice. Computer systems do not always implement 3D imaging devices. For example, devices such as smartphones, tablet computers, and Head Mounted Devices (HMDs) typically implement lightweight imaging devices such as two-dimensional (2D) cameras.

Drawings

The present disclosure may be better understood, and its numerous features and advantages made apparent to those skilled in the art by referencing the accompanying drawings. The use of the same reference symbols in different drawings indicates similar or identical items.

Fig. 1 is a two-dimensional (2D) image of a hand according to some embodiments.

Fig. 2 is a block diagram of a processing system configured to acquire a 2D image of a hand and generate a 3D pose of the hand based on the 2D image, in accordance with some embodiments.

FIG. 3 illustrates a palm triangle and a thumb triangle representing a portion of a skeletal model of a hand, in accordance with some embodiments.

Fig. 4 illustrates a finger gesture in a corresponding finger gesture plane, according to some embodiments.

FIG. 5 illustrates a skeletal model of a finger in a finger gesture plane, according to some embodiments.

Fig. 6 is a representation of a look-up table (LUT) for looking up the 2D coordinates of a finger in a finger pose plane based on the relative positions of the tip of the finger and the palm knuckles, according to some embodiments.

Fig. 7 illustrates 2D coordinates of a finger with the relative positions of the finger tip and palm knuckle represented by the circle shown in fig. 6, in accordance with some embodiments.

Fig. 8 is a flow chart of a method of configuring a LUT that maps the relative positions of the tip of a finger and the palm knuckle to the 2D coordinates of the finger, according to some embodiments.

Fig. 9 is a flow chart of a method of lifting a 3D pose of a hand from a 2D image of the hand, according to some embodiments.

Fig. 10 is an illustration of iterative denoising of 3D keypoints listed from a 2D image of a hand, according to some embodiments.

Fig. 11 is a flow chart of a method of denoising keypoints extracted from a 2D image of a hand, according to some embodiments.

Detailed Description

The 3D representation of the hand (referred to herein as a skeletal model) is generated in real-time from the 2D image of the hand by identifying keypoints on the hand in the 2D image and using the positions of the keypoints to access a look-up table that represents the potential 3D pose of the hand as a function of the positions of the keypoints to determine the 3D pose and the position of the hand. In some embodiments, the key points include: the position of the tip of the finger and thumb, the joints connecting the finger and thumb phalanges, the palmar knuckle representing the point of attachment of each finger and thumb to the palm, and the wrist position indicating the point of attachment of the hand to the user's forearm. The look-up table includes a finger gesture look-up table representing the 2D coordinates of each finger (or thumb) in a corresponding finger gesture plane as a function of the position of the tip of the finger (or thumb) relative to the corresponding palm knuckle of the finger or thumb. The phalange length of the finger and thumb is determined from a set of training images of the hand. The finger pose look-up table is generated based on the length of the joint connecting the phalanges and anatomical constraints on their range of motion. The palms are represented as palmar triangles and thumbtriangles, which are defined by corresponding sets of vertices. The palm triangle has an apex at the wrist location opposite the triangle side of the palm knuckle including the finger. The thumb triangle has an apex at the wrist location, an apex at the palmar knuckle of the thumb, and an apex at the palmar knuckle of the index finger. Parameters defining the palm triangle and thumb triangle are also determined from the set of training images.

In operation, a skeletal model of the hand is determined from the 2D image of the hand using 2D coordinates determined from the finger pose look-up table and the directions of the palm triangle and thumb triangle. The fingers and thumb are anatomically constrained to move in corresponding pose planes having fixed directions relative to the palm triangle and thumb triangle, respectively. For example, the index finger in the 2D image is constrained to lie in a finger pose plane that connects the palm-finger joints of the index finger with the tips of the index finger. The 2D coordinates of the finger in its finger pose plane are determined by accessing a corresponding finger pose look-up table using the relative positions of the fingertip and palm finger joints. Then, the 3D pose of the finger is determined by rotating the 2D coordinates based on the direction of the palm triangle. The 3D pose of the thumb is determined by rotating the 2D pose of the thumb (determined by a finger pose look-up table) according to the direction of the thumb triangle. In some embodiments, as described above, the noise values of the keypoints in the 2D image are used to determine the 3D pose of the hand by determining a 3D skeletal model of the hand from the 2D image based on the initial estimate of the keypoints. The 3D position of the keypoint indicated by the skeletal model is modified based on a projection of the 3D position of the keypoint into the image plane along a line connecting the original 2D keypoint to the vanishing point associated with the 2D image. The modified 3D locations of the keypoints are then used to modify the skeletal model, as described above. The process is iterated to achieve convergence.

Some embodiments of the technology disclosed herein have been validated on different data sets and achieve greater than 80% of correctly identified keypoints (up to 98% in some cases) when the results are not aligned with the underlying real data (ground truth data) prior to comparison. Aligning the results with the underlying real data prior to comparison increases the percentage of correct keypoints.

Fig. 1 is a two-dimensional (2D) image 100 of a hand 105 according to some embodiments. The hand 105 is represented by a skeletal model 110, which skeletal model 110 models the fingers, thumbs, and palms of the hand 105 as a collection of interconnected keypoints. In the illustrated embodiment, the key points include the tips 115 of the fingers and thumbs (represented by only one reference numeral for clarity), the joints 120 (represented by only one reference numeral for clarity) connecting the phalanges 125 of the fingers and thumbs (represented by only one reference numeral for clarity), the palmar knuckles 130 (represented by only one reference numeral for clarity) representing the attachment points of each finger and thumb to the palm, and the wrist locations 135 indicating the attachment points of the hand to the user's forearms.

Fig. 2 is a block diagram of a processing system 200 configured to acquire a 2D image of a hand 205 and generate a 3D pose of the hand based on the 2D image, in accordance with some embodiments. Generating a 3D pose of the hand 205 from the 2D image is referred to as "lifting" the 3D pose of the hand 205 from the 2D image. In the illustrated embodiment, the 3D pose of the hand 205 is represented by a skeletal model 210, such as skeletal model 110 shown in fig. 1. For clarity, the following discussion uses hand 205 as an example of a body part. Some embodiments of the techniques discussed herein are equally applicable to 3D gestures to lift other body parts from corresponding 2D images. For example, the processing system 200 can lift a 3D pose of a foot, arm, leg, head, other body part, or a combination thereof from a 2D image of a corresponding body part.

The processing system 200 includes an image acquisition device such as a camera 215. Examples of image acquisition devices for implementing camera 215 include: a Red Green Blue (RGB) camera such as a camera implemented in a mobile phone or tablet computer to execute a virtual reality or augmented reality application; an RGB camera for depth estimation using one or more depth sensors; in some embodiments, the camera 215 is a lightweight RGB camera that is implemented in a small form factor and consumes a small amount of power.

The camera 215 acquires a 2D image of the hand 205 and stores information representing the 2D image in the memory 220. The processor 225 is capable of accessing information representing the 2D image from the memory 220 and performing operations including learning, lifting, and denoising the 2D image. The learning phase includes generating one or more look-up tables (LUTs) 230 using training images of the hand 205. For example, the length of the phalanges of the finger and thumb is determined from the set of training images of the hand 205. LUT 230 is generated based on the length of the joint connecting the phalanges and anatomical constraints on the range of motion, and then stored in memory 220. Parameters such as vertices defining palm triangles and thumb triangles are also determined from the set of training images and stored in memory 220.

During the lift-off phase, the processor 225 generates the skeletal model 210 in real-time from the 2D image of the hand by identifying keypoints on the hand 205 in the 2D image. The processor uses the location of the keypoints to determine the 3D pose and location of the hand 205 to access the 2D coordinates of the finger and thumb from LUTs 230, which store the 2D coordinates of each finger and thumb as a function of the relative positions of the finger tip and palm joints. Processor 225 determines the 3D pose of the finger by rotating the 2D coordinates based on the direction of the palm triangle. The processor 225 determines a 3D pose of the thumb based on the 2D pose (determined by the finger pose look-up table) of the thumb rotated by the direction of the thumb triangle.

Some embodiments of the processor 225 are configured to perform denoising of noise values of keypoints extracted from 2D images of the hand 205. The denoising stage is an iterative process. Initially, processor 225 determines a 3D pose of the hand from the 2D image by determining a 3D skeletal model of the hand from the 2D image based on an initial estimate of noise keypoints. The processor 225 then modifies the 3D location of the keypoint indicated by the skeletal model based on the projection of the 3D location of the keypoint into the image plane along the line connecting the original noise keypoint to the vanishing point associated with the 2D image. The vanishing point is determined based on parameters characterizing the camera 215. Processor 225 updates the values of the noise keypoints based on the modified 3D locations of the keypoints indicated by the skeletal model, and then iterates the process until the noise keypoints meet the corresponding convergence criteria.

Fig. 3 illustrates a palm triangle 300 and a thumb triangle 305 representing a portion of a skeletal model of a hand, in accordance with some embodiments. Palm triangle 300 and thumb triangle 305 represent portions of skeletal model 110 shown in fig. 1 and skeletal model 210 shown in fig. 2.

Palm triangle 300 is defined by the apex at wrist location 310 and the palm knuckles 311, 312, 313, 314 of the hand (collectively referred to herein as "palm knuckles 311-314"). The plane comprising palm triangle 300 is defined by unit vectors 315, 316, unit vectors 315, 316 being represented by parameter u _I、u_L, respectively. The distance 320 from the wrist position 310 to the palmar knuckle 311 of the index finger is represented by parameter I and the distance 325 from the wrist position 310 to the palmar knuckle 314 of the little finger is represented by parameter L. Thus, the position of the palm-finger joint 311 relative to the wrist position 310 is given by a vector having a direction u _I and a magnitude I. The position of the palmar knuckle 314 relative to the wrist position 310 is given by a vector having a direction u _L and a magnitude L. The position of the palmar knuckle 312 of the middle finger is defined as:

λ_mIu_I+(1-λ_m)Lu_L

Where lambda _m is a parameter related to the middle finger. The position of the palm finger joint 313 of the ring finger is defined as:

(1-λ_r)Iu_I+λ_rLu_L

Where lambda _r is a parameter associated with the ring finger. While the hand is held in the set of training poses, the 2D image of the hand is used to learn the values of the parameters defining palm triangle 300.

The thumb triangle 305 is defined by the apex at the wrist location 310, the palmar knuckle 311 of the index finger, and the palmar knuckle 330 of the thumb. The plane comprising the thumb triangle 305 is defined by unit vectors 315, 335, the unit vectors 315, 335 being represented by the parameter u _I、u_T, respectively. As discussed herein, the distance 320 from the wrist location 310 to the palmar knuckle 311 of the index finger is represented by parameter I. The distance 340 from the wrist position 310 to the palmar knuckle 330 of the thumb is represented by parameter T. Thus, the position of the palm finger joint 330 relative to the wrist position 310 is given by a vector having a direction u _T and a magnitude T. Thumb triangle 305 differs from palm triangle 300 in that thumb triangle 305 is compressible and may have a zero area. While the hand remains in the set of training poses, the 2D image of the hand is used to learn the values of the parameters defining the thumb triangle 305.

Fig. 4 illustrates a finger gesture 400 in a corresponding finger gesture plane 405, according to some embodiments. The finger gesture plane 405 is anatomically constrained to maintain a substantially fixed orientation relative to the plane 410. Thus, the movement of the finger is constrained to lie substantially within the corresponding finger gesture plane 405. The finger gesture plane 405 shown in fig. 4 represents some embodiments of a movement plane of an index finger, middle finger, ring finger, little finger, or thumb. If finger gesture plane 405 represents a plane of motion of an index finger, middle finger, ring finger, or little finger, plane 410 is a plane of a palm triangle that includes a hand such as palm triangle 300 shown in fig. 3. If the finger gesture plane 405 represents a plane of movement of a thumb, the plane 410 is a plane that includes a thumb triangle of a hand such as the thumb triangle 305 shown in FIG. 3.

The fingers of the hand are represented by a skeletal model 415 of the fingers. Skeletal model 415 is characterized by the position of fingertip 420 relative to the palm knuckle 425 of the finger. As discussed below, the relative position of the fingertip 420 with respect to the palm knuckle 425 determines 2D coordinates defining the position of the skeletal model 415 of the finger in the finger pose plane 405.

The direction of the plane 410 is determined by the vector 430, the vector 430 being defined as the vector perpendicular to the plane 410. The direction defined by vector 430 is determined by comparing the size of the palm triangle (or thumb triangle) in the 2D image of the hand to the size of a training representation of the palm triangle (or thumb triangle) such as the size discussed above with reference to fig. 3. The direction of the finger gesture plane 405 is determined by vector 435, vector 435 being defined as the vector perpendicular to vector 430 and in the finger gesture plane 405. The 3D pose of the finger in the 2D image is generated by rotating the skeletal model 415 of the finger based on the directions determined by the vectors 430, 435.

Fig. 5 illustrates a skeletal model 500 of a finger in a finger gesture plane, in accordance with some embodiments. Skeletal model 500 represents some embodiments of skeletal model 415 shown in fig. 4. The skeletal model 500 also represents portions of some embodiments of the skeletal model 110 shown in FIG. 1 and the skeletal face 210 shown in FIG. 2. The skeletal model 500 includes a palm knuckle 501, a first joint knuckle 502, a second joint knuckle 503, and a fingertip 504. Skeletal model 500 is characterized by a length of phalanges 510 between palm knuckle 501 and first joint knuckle, a length of phalanges 515 between first joint knuckle 502 and second joint knuckle 503, and a link of phalanges 520 between second joint knuckle 503 and fingertip 504.

Values of the lengths of the phalanges 510, 515, 520 are learned from a set of training images of the hand held in the set of training poses. In some embodiments, the values of the lengths of the phalanges 510, 515, 520 are learned by extracting key points corresponding to the palm knuckles, joint knuckles, and fingertips from a set of training images. Outlier filtering keypoints are performed using techniques such as median or median absolute deviation to find and reject the outlier keypoints. The values of the lengths are then fitted to the displacements of the keypoints in the set of training images using techniques that include quadratic programming.

The position of the fingertip 504 relative to the palm knuckle 501 is determined by the set of angles at the palm knuckle 501, the first joint knuckle 502 and the second joint knuckle 503. The first angle 525 represents the angle between the phalanges 510 and the plane of the palm-like triangle (or thumb triangle), as indicated by the dashed line 530. The second angle 535 represents the angle between the phalanges 510 and 515. The third angle 540 represents the angle between the phalanges 515 and the phalanges 515. The range of angles 520, 525, 535, 540 is anatomically constrained to a limited set of values that are substantially the same with minor variations in different hands. For example, the angles 525, 535, 540 are constrained to range between 0 ° and 90 °.

Fig. 6 is a representation of a LUT 600 for looking up the 2D coordinates of a finger in a finger pose plane based on the relative positions of the palm knuckle and the tip of the finger, according to some embodiments. The vertical axis of LUT 600 represents the displacement of the tip of a finger relative to the metacarpophalangeal in the vertical direction. The horizontal axis of the LUT represents the displacement of the tip of the finger relative to the palm knuckle in the horizontal direction. The closed curve 605 represents the outer boundary of the tip of the finger relative to the possible locations of the palm knuckles. Thus, the closed curve 605 is determined based on the length of the phalanges of the finger and anatomical constraints on the relative angle between the phalanges due to the limitations of the range of motion of the corresponding joint. The position within the closed curve 605 represents the possible relative positions of the tip of the finger and the palm knuckle.

The LUT 600 for a particular hand is determined using a set of training images of the hand in a set of predetermined poses. To account for the different lengths of phalanges in different hands, the set of training images is defined to include locations near the boundary defined by the closed curve 605. Most of the positions within the closed curve 605 uniquely determine the 2D coordinates of the finger. However, some embodiments within the closed curve 605 include a small set of degradation cases that map a single point within the closed curve 605 to more than one set of 2D coordinates of the finger. Other information such as previous position of the finger, depth information, and shading or illumination information may be used to break down the degradation between different sets of 2D coordinates.

In some embodiments, the information in LUT 600 is used to determine when two or more dissimilar poses result in a set of 2D coordinates of the projection of the same or similar finger, e.g., one or more keypoints derived from LUT 600 for one 3D pose are the same as or similar to one or more keypoints derived from LUT 600 for another 3D pose. Signals may then be generated to identify distinct poses having the same or similar projected 2D coordinates. LUT 600 may also be used to convert 2D tags to 3D gestures of, for example, a hand without collecting new data. In some embodiments, confidence scores are derived for distinct poses that may be generated from the same or similar projected sets of 2D coordinates. For example, the distance from the current pose to the furthest pose with the same or similar 2D coordinates is used to generate a confidence score, such as a high confidence score if the distance is zero (or less than a threshold distance) and a low confidence score if the distance is greater than the threshold distance. In some embodiments, disambiguation of disparate gestures is performed based on confidence scores that generate keypoints or 2D coordinates of the disparate gestures. For example, in some cases, the image of a human is used to check or confirm that the 3D lifting of the 3D pose from the 2D tag is correct. The image can also be used to generate more accurate data by choosing among different possible solutions.

Fig. 7 illustrates 2D coordinates of a finger with relative positions of the finger tip and palm knuckles represented by circles 1,2, 3, 4, and 5 in fig. 6, according to some embodiments. As shown in skeletal model 705, circle 1 indicates the relative position between the tip of a finger and the palm joint corresponding to the extended finger. As shown in skeletal model 710, circle 2 indicates the relative position between the tip and the palm joint corresponding to the tip of the finger bent 90 ° relative to the second joint of the finger. As shown in the skeletal model 715, circle 3 indicates the relative position between the tip and the palm knuckle corresponding to the tip bent under the horizontally extending phalanges connecting the palm knuckle and the first knuckle. As shown in the skeletal model 720, circle 4 indicates the relative position between the tip and the palm knuckle corresponding to the tip curled around the vertically extending phalanges connecting the palm knuckle and the first joint. As shown in skeletal model 725, circle 5 indicates the relative position between the tip and the palm knuckle corresponding to the vertically downward extending finger.

Fig. 8 is a flow chart of a method 800 of configuring a LUT that maps the relative positions of the tip of a finger and the palm knuckle to the 2D coordinates of the finger, according to some embodiments. Method 800 is used to train some embodiments of LUT 230 shown in fig. 2 and LUT 600 shown in fig. 6. Thus, the method 800 is performed by some embodiments of the processor 225 shown in FIG. 2.

At block 805, a 2D image of a hand located in a training set of gestures is captured. For example, a 2D image may be captured by the camera 215 shown in fig. 2. The 2D image is stored in a memory such as the memory 220 shown in fig. 2.

At block 810, the processor identifies keypoints in the 2D image of the hand. As discussed herein, the key points include the location of the tips of the fingers and thumbs, the joints connecting the finger and the phalanges of the thumbs, the palm-finger joints representing the point of attachment of each finger and thumb to the palm, and the wrist location indicating the point of attachment of the hand to the user's forearm. Techniques for identifying keypoints in 2D images are known in the art and are not further discussed herein for clarity.

At block 815, the processor determines the length of the phalanges in the fingers and thumb of the hand based on the keypoints, for example using the quadratic programming discussed herein.

At block 820, the processor configures the LUT based on the length of the phalanges and other anatomical constraints on the relative positions of the fingertip and palm knuckles. The processor stores the LUT in a memory such as memory 220 shown in fig. 2.

Fig. 9 is a flow diagram of a method 900 of lifting a 3D pose of a hand from a 2D image of the hand, according to some embodiments. Method 900 is implemented in some embodiments of processing system 200 shown in fig. 2. In the illustrated embodiment, a LUT mapping the relative positions of the tips of the fingers and thumb to the corresponding palm knuckles has been generated for the hand, for example, in accordance with some embodiments of the method 800 illustrated in fig. 8. Thus, other parameters representing a skeletal model of a hand have also been determined, such as the length of the phalanges, parameters defining a palm triangle for the hand, and parameters defining a thumb triangle for the hand.

At block 905, the processor identifies keypoints in the 2D image of the hand. The processor then estimates the hand's transformation in 3D space based on the keypoints. Some embodiments of the processor estimate the transformation by comparing the parameters defining the skeletal model of the hand with the relative values of the corresponding parameters in the 2D image. For example, the processor may compare the lengths of the phalanges of the finger and thumb in the skeletal model with the lengths of the corresponding phalanges in the 2D image to account for perspective projection and back-projection of the hand of the 2D image.

At block 915, the processor learns the orientation of the palm triangle and the thumb triangle. Some embodiments of the processor learn the orientation of the palm triangle and thumb triangle by comparing parameters defining the palm and thumb triangle with portions of the 2D image. The directions of the palm triangle and the thumb triangle are characterized by corresponding vectors defined as lying in directions perpendicular to the planes of the palm triangle and the thumb triangle.

At block 920, the processor learns the direction of the finger gesture plane of the finger and thumb. The direction of the finger gesture plane is characterized by a corresponding vector that is perpendicular to the vector defining the corresponding palm triangle or thumb triangle and lies in the corresponding finger gesture plane.

At block 925, the processor determines the 2D finger coordinates of the finger and thumb based on the LUT and the relative positions of the tip of the finger and the corresponding palm-finger joint.

At block 930, the processor generates a 3D skeletal model representing a 3D pose of the hand. To generate the 3D skeletal model, the processor rotates the 2D coordinates of the finger and thumb based on the directions of the palm triangle and thumb triangle, respectively. The 3D skeleton model is determined by combining palm triangles, directions of palm triangles, thumb triangles, directions of thumb triangles, and rotated 2D finger coordinates of fingers and thumb.

Fig. 10 is a diagram 1000 of iterative denoising of 3D keypoints listed from 2D images of a hand, according to some embodiments. The iterative process depicted in diagram 1000 is implemented in some embodiments of processing system 200 shown in fig. 2. Diagram 1000 shows an image plane 1005 of a camera, such as camera 215 shown in fig. 2. An image captured by the camera is projected onto an image plane 1005. The characteristics of the camera also determine vanishing points 1010, which are abstract points on the image plane 1005 where the 2D projection of parallel lines in 3D space would converge.

Initially, key points 1015 are extracted from the 2D image. In the illustrated embodiment, the 2D image is a noisy image and the initial estimate of the keypoint 1015 is not necessarily in the correct position in the image of the hand. The 3D skeletal model of the hand is lifted from the 2D image based on the noise keypoints 1015 and other potential noise keypoints (not shown in fig. 10) extracted from the 2D image. For example, according to some embodiments of the method 800 shown in fig. 8 and the method 900 shown in fig. 9, the 3D skeletal model of the hand is lifted. The 3D skeletal model of the hand is used to determine 3D keypoints 1020 corresponding to the same positions in the hand as keypoints 1015.

The 3D keypoints 1020, which are referred to herein as skeleton-compliant keypoints, do not necessarily coincide with perspective projections of the initial keypoints 1015, because the skeleton-compliant keypoints 1020 do not necessarily lie on a line 1025 between the initial keypoints 1015 and the vanishing points 1010. Thus, by projecting skeleton-compliant keypoints 1020 onto line 1025, modified 3-D keypoints 1030 can be determined. The process is then iterated by setting the value of the initial keypoint 1015 equal to the modified 3D keypoint 1030, referred to herein as a camera compatible keypoint, to update the value of the initial keypoint 1015. The process is iterated until convergence criteria for the keypoints (and any other noise keypoints in the 2D image) are met.

Fig. 11 is a flow chart of a method 1100 of denoising keypoints extracted from a 2D image of a hand, according to some embodiments. Method 1100 is performed in some embodiments of processing system 200 shown in fig. 2.

At block 1105, the processor generates a 3D skeletal model of the hand based on noise keypoints extracted from the 2D image. In some embodiments, a 3D skeletal model is generated according to the embodiment of method 800 shown in fig. 8 and method 900 shown in fig. 9.

At block 1110, the processor identifies a first set of 3D keypoints that are consistent with a 3D skeletal model of a hand. For example, the first set of 3D keypoints represents keypoints corresponding to tips of fingers and thumbs, joints of fingers and thumbs, palm-finger joints of fingers and thumbs, and wrist positions defined by a 3D skeletal model of a hand. In some embodiments, the first set of 3D keypoints includes skeletal-compliant keypoints 1020 shown in fig. 10.

At block 1115, the processor identifies a second 3D keypoint based on the first 3D keypoint and the vanishing point associated with the image. As discussed herein, the vanishing point is determined based on characteristics of a camera that acquired the 2D image. In some embodiments, the second set of 3D keypoints includes camera conforming keypoints 1030 shown in fig. 10.

At block 1120, the processor modifies noise keypoints extracted from the 2D image based on the second set of 3D keypoints. For example, the value of the noise keypoint is updated to be equal to the corresponding value in the second set of 3D keypoints.

At decision block 1125, the processor determines that the value of the noise keypoint has converged, e.g., based on a convergence criterion for the noise keypoint. If not, the method 1100 proceeds to block 1105 and generates an updated 3D skeletal model based on the modified values of the noise keypoints. If the processor determines that the values have converged, the method proceeds to termination block 1130 and ends.

In some embodiments, certain aspects of the techniques described above are implemented by one or more processors of a processing system executing software. The software includes a set of one or more executable instructions stored or otherwise tangibly embodied on a non-transitory computer-readable storage medium. The software may include instructions and certain data that, when executed by one or more processors, operate the one or more processors to perform one or more aspects of the techniques described above. The non-volatile computer-readable storage medium may include, for example, a magnetic or optical disk storage device, a solid state storage device such as flash memory, cache memory, random Access Memory (RAM) or other non-volatile storage device or devices, and the like. Executable instructions stored on a non-transitory computer-readable storage medium may use source code, assembly language code, object code, or other instruction formats that are interpreted or otherwise executable by one or more processors.

A computer-readable storage medium may include any storage medium or combination of storage media that are accessible by a computer system during use to provide instructions and/or data to the computer system. Such storage media may include, but is not limited to, optical media (e.g., compact Disc (CD), digital Versatile Disc (DVD), blu-ray disc), magnetic media (e.g., floppy disk, magnetic tape, or magnetic hard drive), volatile memory (e.g., random Access Memory (RAM) or cache), non-volatile memory (e.g., read Only Memory (ROM) or flash memory), or microelectromechanical system (MEMS) based storage media. The computer-readable storage medium may be embedded in a computing system (e.g., system RAM or ROM), fixedly attached to the computing system (e.g., a magnetic hard drive), removably attached to the computing system (e.g., an optical disk or Universal Serial Bus (USB) -based flash memory), or coupled to the computer system via a wired or wireless network (e.g., network-accessible storage (NAS)).

Note that not all of the activities or elements described above are required in a general description, that a portion of a particular activity or device may not be required, and that one or more other activities or elements may be performed in addition to those described. Still further, the order in which the activities are listed is not necessarily the order in which the activities are performed. Moreover, concepts have been described with reference to specific embodiments. However, one of ordinary skill in the art appreciates that various modifications and changes can be made without departing from the scope of the present disclosure as set forth in the claims below. The specification and figures are, accordingly, to be regarded in an illustrative rather than a restrictive sense, and all such modifications are intended to be included within the scope of present disclosure.

Benefits, other advantages, and solutions to problems have been described above with regard to specific embodiments. The benefits, other advantages, and solutions to problems, however, should not be construed as critical, required, or essential features of any or all of the claims that may cause any of the benefits, other advantages, and solutions to occur or become more pronounced. Furthermore, the particular embodiments disclosed above are illustrative only, as the disclosed subject matter may be modified and practiced in different but equivalent manners apparent to those skilled in the art having the benefit of the teachings herein. No limitations are intended to the details of construction or design herein shown, other than as described in the claims below. It is therefore evident that the particular embodiments disclosed above may be altered or modified and all such variations are considered within the scope of the disclosed subject matter. Accordingly, the protection sought herein is as set forth in the claims below.

Claims

1. A method, comprising:

identifying, at a processor, keypoints on the hand in a two-dimensional 2D image captured by a camera;

Identifying, at the processor, one or more hand feature triangles based on the identified keypoints, wherein a first hand feature triangle of the one or more hand feature triangles comprises a thumb triangle having vertices corresponding to a first set of identified keypoints including a wrist position, a palm-finger joint position of the thumb of the hand, and a palm-finger joint position of the index finger; and

A three-dimensional 3D pose of the hand is determined at the processor based on the directions of the one or more hand feature triangles.

2. The method of claim 1, wherein a second hand feature triangle of the one or more hand feature triangles comprises a palm triangle having features corresponding to the second set of identified keypoints, wherein the palm triangle has a vertex at the wrist position and a side opposite the vertex at the wrist position, the side comprising positions associated with two or more palm knuckles of a finger of the hand.

3. The method of claim 2, wherein determining, at the processor, the 3D pose of the hand based on the direction of the one or more hand feature triangles comprises generating a 3D skeletal model representing the 3D pose of the hand by combining the direction of the thumb triangle and the direction of the palm triangle.

4. The method of claim 2, wherein determining, at the processor, the 3D pose of the hand includes identifying 2D coordinates associated with a finger and the thumb of the hand.

5. The method of claim 4, wherein the identified 2D coordinates associated with the finger and thumb of the hand are determined by accessing a corresponding finger or thumb look-up table LUT.

6. The method of claim 5, wherein determining, at the processor, the 3D pose of the hand comprises determining the 3D pose of the finger and thumb by rotating the identified 2D coordinates based on the direction of the palm triangle and the direction of the thumb triangle, respectively.

7. A method, comprising:

at the processor, identifying keypoints on the hand in a two-dimensional 2D image captured by the camera;

At the processor, identifying one or more hand feature triangles based on the identified keypoints, wherein a first hand feature triangle of the one or more hand feature triangles comprises a palm triangle having features corresponding to the first set of identified keypoints, wherein the palm triangle has a vertex at the wrist position and a side opposite the vertex at the wrist position, the side comprising positions associated with two or more palm knuckles of a finger of the hand; and

8. The method of claim 7, wherein a second hand feature triangle of the one or more hand feature triangles comprises a thumb triangle having vertices corresponding to a second set of identified keypoints including a wrist position, a palm-finger joint position of a thumb of the hand, and a palm-finger joint position of an index finger.

9. The method of claim 8, wherein determining, at the processor, the 3D pose of the hand based on the direction of the one or more hand feature triangles comprises generating a 3D skeletal model representing the 3D pose of the hand by combining the direction of the thumb triangle and the direction of the palm triangle.

10. The method of claim 8, wherein determining, at the processor, the 3D pose of the hand comprises identifying 2D coordinates associated with a finger and the thumb of the hand.

11. The method of claim 10, wherein the identified 2D coordinates associated with the finger and thumb of the hand are determined by accessing a corresponding finger or thumb look-up table LUT.

12. The method of claim 11, wherein determining, at the processor, the 3D pose of the hand comprises determining the 3D pose of the finger and thumb by rotating the identified 2D coordinates based on the direction of the palm triangle and the direction of the thumb triangle, respectively.

13. An apparatus comprising a processor configured to:

identifying keypoints on the hand in a two-dimensional 2D image captured by the camera;

Identifying one or more hand feature triangles based on the identified keypoints, wherein a first hand feature triangle of the one or more hand feature triangles comprises:

A thumb triangle having vertices corresponding to a first set of identified keypoints including a wrist position, a palm-to-finger joint position of a thumb and a palm-to-finger joint position of an index finger of the hand; or alternatively

A palm triangle having features corresponding to a second set of identified keypoints, wherein the palm triangle has a vertex at the wrist position and a side opposite the vertex at the wrist position, the side comprising positions associated with two or more palm knuckles of a finger of the hand; and

14. The apparatus of claim 13, wherein the one or more hand feature triangles comprise two hand feature triangles comprising the thumb triangle and the palm triangle.

15. The device of claim 14, wherein the thumb triangle and the palm triangle share a common side and two vertices corresponding to the wrist position and the position of the palm-finger joint of the index finger.

16. The apparatus of claim 14, wherein determining the 3D pose of the hand based on the direction of the one or more hand feature triangles comprises generating a 3D skeletal model representing the 3D pose of the hand by combining the direction of the thumb triangle and the direction of the palm triangle.

17. The apparatus of claim 14, wherein determining the 3D pose of the hand comprises identifying 2D coordinates associated with a finger and the thumb of the hand.

18. The apparatus of claim 17, wherein the identified 2D coordinates associated with the finger and thumb of the hand are determined by accessing a corresponding finger or thumb look-up table LUT.

19. The apparatus of claim 18, wherein determining the 3D pose of the hand comprises determining the 3D pose of the finger and thumb by rotating the identified 2D coordinates based on the direction of the palm triangle and the direction of the thumb triangle, respectively.

20. The apparatus of claim 19, wherein determining the 3D pose of the hand comprises generating a 3D skeletal model representing the 3D pose of the hand by combining the palm triangle, a direction of the thumb triangle, and a direction of the thumb triangle.