WO2022226432A1

WO2022226432A1 - Hand gesture detection methods and systems with hand prediction

Info

Publication number: WO2022226432A1
Application number: PCT/US2022/030356
Authority: WO
Inventors: Yang Zhou
Original assignee: Innopeak Technology, Inc.
Priority date: 2022-05-20
Filing date: 2022-05-20
Publication date: 2022-10-27

Abstract

The present invention is directed to extended reality systems and methods. In an exemplary embodiment, two-dimensional hand images are captured. Two-dimensional key points are identified using the two-dimensional hand images. The two-dimensional key points are mapped to three-dimensional key points. Hand prediction is performed using the threedimensional key points. There are other embodiments as well.

Description

HAND GESTURE DETECTION METHODS AND SYSTEMS WITH HAND

PREDICTION

BACKGROUND OF THE INVENTION [0001] The present invention is directed to extended reality systems and methods.

[0002] Over the last decade, extended reality (XR) devices — including both augmented reality (AR) devices and virtual reality (VR) devices — have become increasingly popular. Important design considerations and challenges for XR devices include performance, cost, and power consumption. Due to various limitations, existing XR devices have been inadequate for reasons further explained below.

[0003] It is desired to have new and improved XR systems and methods thereof.

BRIEF SUMMARY OF THE INVENTION [0004] The present invention is directed to extended reality systems and methods. In an exemplary embodiment, two-dimensional hand images are captured. Two-dimensional key points are identified using the two-dimensional hand images. The two-dimensional key points are mapped to three-dimensional key points. Hand prediction is performed using the three-dimensional key points. There are other embodiments as well.

[0005] A system of one or more computers can be configured to perform particular operations or actions by virtue of having software, firmware, hardware, or a combination of them installed on the system that in operation causes or cause the system to perform the actions. One or more computer programs can be configured to perform particular operations or actions by virtue of including instructions that, when executed by data processing apparatus, cause the apparatus to perform the actions. One general aspect includes a method for hand prediction, which includes capturing a plurality of images containing at least a first hand, the plurality of images being in a two-dimensional (2D) space, the plurality of images including a current image and a previous image. The method also includes identifying a plurality of previous 2D key points using the previous image. The method also includes identifying a plurality of current 2D key points using the current image. The method also includes mapping the plurality of previous 2D key points to a plurality of previous three- dimensional (3D) key points. The method also includes mapping the plurality of current 2D key points to a plurality of current 3D key points. The method also includes generating a plurality of 3D predicted key points in a 3D space using the plurality of previous 3d key points and the plurality of current 3D key points. The method also includes mapping the plurality of 3D predicted key points to a plurality of predicted 2D key points. The method also includes identifying false hand detection using the plurality of predicted 2D key points. Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods.

[0006] Implementations may include one or more of the following features. The method may include projecting the plurality of previous 2D key points to the 3D space. The plurality of images further contains at least a second hand, the plurality of 3d predicted key points being associated with the first hand and the second hand. The method may include defining a bound box enclosing the first hand in the current image and tracking the bound box using the plurality of predicted 2D key points. The bound box is defined with a left top corner location and a bottom right comer location, the bound box including at least a ten percent margin area surrounding the first hand. The plurality of 3D predicted key points are assigned confidence values, the method may include detecting a non-hand object using the confidence values. The method may include tracking the first hand using the plurality of predicted 2D key points. The method may include initiating a hand tracking process upon detecting the first hand. The method may include calculating coordinate changes between the plurality of previous 3D key points and the plurality of current 3D key points, each of the 3D key points may include three coordinates. The method may include a plurality of 3D vectors using the plurality of previous 3D key points and the plurality of current 3D key points. Implementations of the described techniques may include hardware, a method or process, or computer software on a computer-accessible medium.

[0007] One general aspect includes an extended reality apparatus that includes a housing having a front side and a rear side. The apparatus also includes a first camera configured on the front side, the first camera being configured to capture a plurality of two-dimensional (2D) images at a predefined frame rate, the plurality of 2D images including a current image and a previous image. The apparatus also includes a display configured on the rear side of the housing. The apparatus also includes a memory coupled to the first camera and being configured to store the plurality of 2D images. The apparatus also includes a processor coupled to the memory. The apparatus also includes where the processor is configured to: identify a plurality of 2D key points associated with a hand using at least the current image and the previous image, mapping the plurality of 2D key points to plurality of three- dimensional (3D) key points, providing a hand prediction using at least the plurality of 3D key points. Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods.

[0008] Implementations may include one or more of the following features. The processor may include a neural processing unit configured to detect the hand using a first image captured by the first camera. The apparatus may include a second camera, the first camera being positioned on a left side of the housing, the second camera being positioned on a right side of the housing. The processor is further configured to track the hand. The processor is further configured to: generate a plurality of predicted 3d key points, map the plurality of predicted 3d key points to a plurality of predicted 2d key points. Implementations of the described techniques may include hardware, a method or process, or computer software on a computer-accessible medium.

[0009] One general aspect includes a method for hand tracking, which includes capturing a first image. The method also includes detecting at least a first hand in the first image. The method also includes capturing a plurality of images containing at least the first hand, the plurality of images being in a two-dimensional (2D) space, the plurality of images including a current image and a previous image. The method also includes identifying a plurality of 2D key points using the plurality of images. The method also includes mapping the plurality of 2D key points to a plurality of three-dimensional (3D) key points. The method also includes generating a plurality of 3d predicted key points in a 3d space using the plurality of 3d key points. The method also includes mapping the plurality of 3d predicted key points to a plurality of predicted 2d key points. The method also includes tracking the first hand using the plurality of 3d predicted key points. Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods. [0010] Implementations may include one or more of the following features. The method may include calculating confidence values for the plurality of 3d predicted key points and identifying false hand detection using at least the confidence values. The method may include identifying a change between the first image and a second image. The method may include performing a deep learning process using the first image for hand detection. Implementations of the described techniques may include hardware, a method or process, or computer software on a computer-accessible medium.

[0011] It is to be appreciated that embodiments of the present invention provide many advantages over conventional techniques. Among other things, hand shape prediction techniques allow for more accurate and efficient hand tracking and bound box. Additionally, h hand shape prediction techniques according to embodiments of the present invention can be performed in conjunction with hand gesture identification techniques. [0012] Embodiments of the present invention can be implemented in conjunction with existing systems and processes. For example, hand shape calibration techniques according to the present invention can be used in a wide variety of XR systems, including XR devices that are equipped with ranging components. Additionally, various techniques according to the present invention can be adopted into existing XR systems via software or firmware update. There are other benefits as well.

[0013] The present invention achieves these benefits and others in the context of known technology. However, a further understanding of the nature and advantages of the present invention may be realized by reference to the latter portions of the specification and attached drawings.

BRIEF DESCRIPTION OF THE DRAWINGS [0014] Figure 1A is a simplified diagram illustrating extended reality (XR) apparatus 115n according to embodiments of the present invention.

[0015] Figure IB is a simplified block diagram illustrating components of extended reality apparatus 115n according to embodiments of the present invention.

[0016] Figure 2 is a simplified diagram illustrating fields of view of cameras on extended reality apparatus 210 according to embodiments of the present invention.

[0017] Figure 3A is a simplified diagram illustrating key points defined on a right hand according to embodiments of the present invention.

[0018] Figure 3B is a simplified diagram illustrating an exemplary hand gesture according to embodiments of the present invention.

[0019] Figure 4 is a simplified block diagram illustrating functional blocks of extended reality apparatus according to embodiments of the present invention.

[0020] Figure 5 is a simplified block diagram illustrating function modules in a hand gesture detection algorithm according to embodiments of the present invention.

[0021] Figure 6 is a simplified flow diagram illustrating a process for method for hand prediction according to embodiments of the present invention.

[0022] Figure 7 is a simplified diagram illustrating hand prediction using 3D hand key points according to embodiments of the present invention.

[0023] Figure 8 is a simplified diagram illustrating an exemplary state machine for hand tracking according to embodiments of the present invention. DETAILED DESCRIPTION OF THE INVENTION [0024] The present invention is directed to extended reality systems and methods. In an exemplary embodiment, two-dimensional hand images are captured. Two-dimensional key points are identified using the two-dimensional hand images. The two-dimensional key points are mapped to three-dimensional key points. Hand prediction is performed using the three-dimensional key points. There are other embodiments as well.

[0025] With the advent of virtual reality and augmented reality applications, gesture-based control schemes are becoming more and more popular. In recent years, commercial depth camera-based 3D hand tracking on AR Glass has been popular, with direct 3D measurements to hands. Conventional research typically focuses on RGB camera-based hand tracking algorithms, and there has been limited research work on a practical hand tracking system as compared to an algorithm.

[0026] Among other features, the ability to accurately and efficiently reconstruct the motion of the human hand from images promises exciting new applications in immersive virtual and augmented realities, robotic control, and sign language recognition. There has been great progress in recent years, especially with the arrival of deep learning technology. However, it remains a challenging task due to unconstrained global and local pose variations, frequent occlusion, local self-similarity, and a high degree of articulation. In various hand detection processes according to the present invention, the output includes a bound box and confidence values. In hand tracking processes hand detection could be unstable, thereby lowering the overall performance and accuracy. For example, existing hand detection methods typically use convolutional network based method, which uses a small model to explain various hand data in the world, and the end result is often unsatisfactory. It is to be appreciated that hand prediction method is used — in addition to or instead of hand detection — to reduce the problems of missing bound boxes and false positive bound boxes. For example, the problem of “missing bound box” refers to missing the detection of a true hand image, and the problem of “false positive bound box” refers to incorrectly detecting non-hand objects as hands. In the hand tracking process, the threshold of restrictions for defining the bond box can lead to different results. Too many restrictions lead to a missing bound box, while too few restrictions lead to a false bond box. To properly select the threshold, a machine learning algorithm may be configured into the system. After applying the present invention, the missing bound boxes and the false-positive bound boxes in the results of real-time 3D hand tracking can be reduced to nearly zero.

[0027] According to various embodiments, hand prediction mechanisms involve using the 3D hand key points of the previous two frames (e.g., images captured by a camera(s) at the previous two timestamps), and outputting 2D bound boxes enclosing the hand and the corresponding confidence value. For example, the hand prediction results are used as inputs in bound box tracking. In certain implementations, a state machine is used for bound box tracking and determines if the hand prediction should be used or not.

[0028] The following description is presented to enable one of ordinary skill in the art to make and use the invention and to incorporate it in the context of particular applications. Various modifications, as well as a variety of uses in different applications will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to a wide range of embodiments. Thus, the present invention is not intended to be limited to the embodiments presented, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

[0029] In the following detailed description, numerous specific details are set forth in order to provide a more thorough understanding of the present invention. However, it will be apparent to one skilled in the art that the present invention may be practiced without necessarily being limited to these specific details. In other instances, well-known structures and devices are shown in block diagram form, rather than in detail, in order to avoid obscuring the present invention.

[0030] The reader’ s attention is directed to all papers and documents which are filed concurrently with this specification and which are open to public inspection with this specification, and the contents of all such papers and documents are incorporated herein by reference. All the features disclosed in this specification, (including any accompanying claims, abstract, and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise. Thus, unless expressly stated otherwise, each feature disclosed is one example only of a generic series of equivalent or similar features.

[0031] Furthermore, any element in a claim that does not explicitly state “means for” performing a specified function, or “step for” performing a specific function, is not to be interpreted as a “means” or “step” clause as specified in 35 U.S.C. Section 112, Paragraph 6. In particular, the use of “step of’ or “act of’ in the Claims herein is not intended to invoke the provisions of 35 U.S.C. 112, Paragraph 6.

[0032] Please note, if used, the labels left, right, front, back, top, bottom, forward, reverse, clockwise and counter-clockwise have been used for convenience purposes only and are not intended to imply any particular fixed direction. Instead, they are used to reflect relative locations and/or directions between various portions of an object. [0033] Figure 1A is a simplified diagram (top view) illustrating extended reality apparatus 115n according to embodiments of the present invention. This diagram is merely an example, which should not unduly limit the scope of the claims. One of ordinary skill in the art would recognize many variations, alternatives, and modifications. It is to be understood that the term “extended reality” (ER) is broadly defined, which includes virtual reality (VR), augmented reality (AR), and/or other similar technologies. For example, ER apparatus 115 as shown can be configured as VR, AR, or others. Depending on the specific implementation, ER apparatus 115 may include small housing for AR applications or relatively larger housing for VR applications. Cameras 180 A and 180B are configured on the front side of apparatus 115. For example, cameras 180A and 180B are respectively mounted on the left and right sides of the ER apparatus 115. In various applications, additional cameras may be configured below cameras 180A and 180B to provide an additional field of view and range estimation accuracy. For example, cameras 180A and 180B both include ultrawide angle or fisheye lenses that offer large fields of view, and they share a common field of view. Due to the placement of the two cameras, the parallax — which is a known factor — of the two cameras can be used to estimate subject distance. Display 185 is configured on the backside of ER apparatus 115. For example, display 185 may be a semitransparent display that overlays information on an optical lens in AR applications. In VR implementations, display 185 may include a non-transparent display. [0034] Figure IB is a simplified block diagram illustrating components of extended reality apparatus 115 according to embodiments of the present invention. This diagram is merely an example, which should not unduly limit the scope of the claims. One of ordinary skill in the art would recognize many variations, alternatives, and modifications. In some embodiments, an XR headset (e.g., AR headset 115n as shown, or the like) might include, without limitation, at least one of processor 150, data store 155, speaker(s) or earpiece(s) 160, eye-tracking sensor(s) 165, light source(s) 170, audio sensor(s) or microphone(s) 175, front or front-facing cameras 180, display 185, and/or communication interfacel90, and/or the like.

[0035] In some instances, the processor 150 might communicatively be coupled (e.g., via a bus, via wired connectors, or via electrical pathways (e.g., traces and/or pads, etc.) of printed circuit boards ("PCBs") or integrated circuits ("ICs"), and/or the like) to each of one or more of the data store 155, the speaker(s) or earpiece(s) 160, the eye tracking sensor(s) 165, the light source(s) 170, the audio sensor(s) or microphone(s) 175, the front camera(s) 180, display 185, and/or the communication interface 190, and/or the like. In various embodiments, data store 155 may include dynamic random-access memory (DRAM) and/or non-volatile memory. For example, images captured by cameras 180 may be temporarily stored at the DRAM for processing, and executable instructions (e.g., hand shape calibration and hand gesture identification algorithms) may be stored at the non-volatile memory. In various embodiments, data store 155 may be implemented as a part of the processor 150 in a system-on-chip (SoC) arrangement.

[0036] The eye tracking sensor(s) 165 - which might include, without limitation, at least one of one or more cameras, one or more motion sensors, or one or more tracking sensors, and/or the like - track where the user's eyes are looking, which in conjunction with computation processing by the processor 150 to compare with images or videos taken in front of the ER apparatus 115. The audio sensor(s) 175 might include, but is not limited to, microphones, sound sensors, noise sensors, and/or the like, and might be used to receive or capture voice signals, sound signals, and/or noise signals, or the like.

[0037] The front cameras 180 include their respective lenses and sensors used to capture images or video of an area in front of the ER apparatus 115. For example, front cameras 180 include cameras 180A and 180B as shown in Figure IB, and they are configured respectively on the left and right sides of the housing. In various implementations, the sensors of the front cameras may be low-resolution monochrome sensors, which are not only energy efficient (without color filter and color processing thereof), but also relatively inexpensive, both in terms of device size and cost.

[0038] Figure 2 is a simplified diagram illustrating fields of view of cameras on extended reality apparatus 210 according to embodiments of the present invention. This diagram is merely an example, which should not unduly limit the scope of the claims. One of ordinary skill in the art would recognize many variations, alternatives, and modifications. Left camera 180A is mounted on the left side of the ER apparatus housing 210, and right camera 180B is mounted on the right side of the ER apparatus housing 210. Each of the cameras has an ultrawide angle or fisheye lens that is capable of capturing a wide field of view. For example, camera 180A has a field of view on the left with an angle of 0L, and camera 180B has a field of view on the right with an angle of OR. Hands or other objects can be detected by either camera.

[0039] Hand detection is a prerequisite for hand gesture identification. An XR device detects a hand when at least one of its cameras captures a hand image. For example, hand 221 is detectable only through images captured camera by 180A, and hand 223 is detectable only by images captured by camera 180B. When hand 222 is positioned at region 220, it is within the common field of view of both cameras 180A and 180B, and images from either camera can be used for hand detection, and depth calculation and other computations may also be performed. In use, the hand might move in and out of the FOVs of both cameras. Hand detection, hand tracking, and hand prediction processes may be performed. For example, as hand 221 moves to the left and out of the FOV of camera 180A, a hand prediction mechanism according to an embodiment of the prevention invention may reduce “false positive” hand identifications. Additionally, as a hand is being tracked, a bound box (i.e., area within an image captured by camera 180A or camera 180B) enclosing the hand is updated using — among other techniques — hand prediction techniques.

[0040] Now referring back to Figure IB. The processor 150, in various embodiments, is configured to perform hand detection and hand prediction processes. In various embodiments, processor 150 includes central processing unit (CPU), graphic processing unit (GPU), and neural processing unit (NPU). For example, hand detection processes may be performed by NPU, and hand prediction may be performed by CPU and/or NPU.

[0041] In AR applications, the field of view of each front camera 180 overlaps with a field of view of the eye of the user 120, the captured images or video. The display screen(s) and/or projector(s) 185 may be used to display or project the generated image overlays (and/or to display a composite image or video that combines the generated image overlays superimposed over images or video of the actual area). The communication interface 190 provides wired or wireless communication with other devices and/or networks. For example, communication interface 190 may be connected to a computer for tether operations, where the computer provides the processing power needed for graphic-intensive applications.

[0042] Figure 3A is a simplified diagram illustrating key points defined on a right hand according to embodiments of the present invention. This diagram is merely an example, which should not unduly limit the scope of the claims. One of ordinary skill in the art would recognize many variations, alternatives, and modifications. As an example, key points 0-19 are assigned to different regions of a user’s right hand. Based on the locations of these key points, hand gestures may be determined. For example, by identifying the relative positions of these key points from 0 to 19, different hand gestures can be determined. In various embodiments, the relative positions (e.g., as measured in pixel distances) of the key points are calibrated during the initial hand shape calibration process, which allows for more accuracy during hand gesture identification processes.

[0043] Figure 3B is a simplified diagram illustrating an exemplary hand gesture according to embodiments of the present invention. This diagram is merely an example, which should not unduly limit the scope of the claims. One of ordinary skill in the art would recognize many variations, alternatives, and modifications. As an example, hand images captured by the device cameras as two-dimensional (2D) images may be mapped into a 3D space for processing. In various embodiments, depth information and calibration parameters (e.g., hand shape) may be used to map 2D images into the 3D space. Since hands travel in 3D space, hand prediction processes are performed using 3D vectors that are based on the 3D coordinates. As shown in Figure 3B, for the hand gesture on the left, 21 (i.e., 0-20) key points are obtained. As an example, a left-hand gesture is translated into 3D key points, through which the extended reality device may identify the gesture as an “OK” sign. Additional processes may be performed as well.

[0044] Figure 4 is a simplified block diagram illustrating functional blocks of extended reality apparatus according to embodiments of the present invention. This diagram is merely an example, which should not unduly limit the scope of the claims. One of ordinary skill in the art would recognize many variations, alternatives, and modifications. The system pipeline of an extended reality device 400 in Figure 4 may include the functional components, which may correspond to various parts of device 115 in Figure IB, as shown. On the front-end, the sensors — such as the right fisheye camera 401, left fisheye camera 402, and inertia measuring unit (IMU) 403 — capture images and other information and send the captured data to sensor processor 411 (e.g., a lightweight embedding CPU, such as processor 150 in Figure IB). The sensor processor 411 performs various simple image processing (e.g., denoising, exposure control, and others), and then pack the processed data to an XR server 421. For example, XR server 421 is implemented to function as a data consumer and to deliver the data to various algorithms, such as 3D hand tracking 431, 6DoF 441, and others, i.e., 451. The position of a 3D hand tracking algorithm 431 is configured after XR server 421 as shown, and it is followed by APP module 432. In various embodiments, 3D hand tracking algorithm 431 utilizes hand detection and hand prediction techniques.

[0045] The unity APP 432 receives the hand tracking results for different purposes, such as gaming, manipulation of virtual objects, and others. Additional functions such as object compositor 433, system render 434, asynchronous time warp (ATW) 435, and display 436 may be configured as shown. Depending on the implementation, there may be other functional blocks as well.

[0046] Figure 5 is a simplified block diagram illustrating function modules in a hand gesture detection and prediction algorithms according to embodiments of the present invention. This diagram is merely an example, which should not unduly limit the scope of the claims. One of ordinary skill in the art would recognize many variations, alternatives, and modifications. According to various embodiments, a hand tracking system 500, which may be implemented with device 150 shown in Figure IB, uses dual hand tracking processes for left (1) and right (r) hand. For example, system 500 provides real-time (i.e., 30 frames per second) hand tracking on edge device, and it operates as a 3D hand tracking system. Stereo fisheye cameras are used to obtain left and right images with known parallax calibration. The system includes various sets of algorithms that include hand acquisition 501, hand detection 502, hand prediction 503r and 5031, bound box tracking 504r and 5041, 2D hand key point detection 505r (506r) and 5051 (5061), 3D hand key point detection 507r and 5071, hand gesture recognition 508r and 5081, and hand shape calibration 570. In various embodiments, hand prediction algorithm is a part of bound box tracking process, [0047] Figure 6 is a simplified flow diagram illustrating a process for a method for hand prediction according to embodiments of the present invention. This diagram is merely an example, which should not unduly limit the scope of the claims. One of ordinary skill in the art would recognize many variations, alternatives, and modifications. For example, one or more steps may be added, removed, repeated, modified, replaced, overlapped, and/or rearranged, and should not limit the scope of the claims.

[0048] At step 610, two-dimensional (2D) hand images are captured, and the captured images include at least a previous image and a current image. According to various embodiments, images are captured by 2D cameras, and distance information may or may not be available. For example, the terms “previous image” and “current image” refer to two images that are captured at consecutive time intervals, where the current image is currently being processed, and the previous image is the most recent. For example, captured images are stored in a memory in chronological order, and thus can be easily retrieved for processing.

[0049] At step 620, the previous and current 2D key points are respectively identified by using the previous and current images. In various embodiments, the 2D key points are first used in a hand detection process, and hand prediction processes are only performed after one or more hands are detected. Depending on the implementation, various types of image recognition algorithms may be used. For example, a machine learning algorithm may be employed for the image identifying process. A bond box enclosing the first hand in the current image is defined with a left top corner location and bottom right comer location. According to various embodiments, the bond box includes at least a ten percent margin area surrounding the first hand.

[0050] At step 630, the previous and current 2D key points are mapped to previous and current 3D key points respectively. For example, the hand tracking process is initiated upon detecting a hand, and since the hand moves in 3D space — while only the 2D images of the hand are captured — hand tracking is performed in 3D space. In various embodiments, 2D key points are projected into 3D space using information such as hand shape, hand distance, and hand size. For example, the use of 3D key points in hand prediction is illustrated in Figure 7.

[0051] Figure 7 is a simplified diagram illustrating hand prediction using 3D hand key points according to embodiments of the present invention. This diagram is merely an example, which should not unduly limit the scope of the claims. One of ordinary skill in the art would recognize many variations, alternatives, and modifications. The 3D hand points as shown, from both the previous image P_t-i and current image P_t, are obtained by converting 2D hand key points from images captured by cameras. As shown in Figure 7, the previous 3D hand key points are in frame P_t-i, and the current 3D key points are in frame P_t. Both the previous 3D hand key points and current 3D key points are generated from key 2D key points at step 630.

[0052] At step 640, step 630 generated previous 3D key points and current 3D key points are used to generate a set of predicted 3D key points, which corresponds to a predicted hand location. In various embodiments, vectors between two corresponding previous and current 3D hand key points are calculated and used to generate the predicated 3D key points. For example, each of these vectors includes change in coordinate values of x, y, and z axis. In a specific embodiment, where the direction of hand movement is assumed to be substantially linear and the speed of hand movement is assumed to be substantially constant, the predicted 3D key points can be easily extrapolated (e.g., applying the same differences of each key point coordinate). For example, with a previous 3D key point being (1, 1, 1) and a current 3D key point being (4, 5, 6), the predicted 3D key point would be (7, 9, 11). It is to be appreciated that linear extrapolation is a relatively simple calculation and can be performed by various types of processors. In various embodiments, other types of extrapolation mechanisms may be used, and more than two sets of 3D key points may be used for prediction. The calculation for 3D key points prediction can be performed in real-time and satisfy a predetermined performance requirement (e.g., 30 frames per second or faster). In various embodiments, confidence values are calculated using convolutional neural network, which may be performed by one or more NPUs.

[0053] As an example, 3D key points prediction is illustrated in Figure 7. As shown, the inputs of the prediction process are 3D hand key points of two timestamps, which include previous frame (t-1) and current frame (t). For example, image P_t-i is the image captured by the camera at previous timestamp t-1, and image P_t is the image captured by camara at the current timestamp t. The extrapolation is used to predict the hand 3D key points on next time stamp t+1 (e.g., P_t+i = 2*P_t - P_{t i}). For example, the 3D hand key points are in the format of 21 key points in 3D rectilinear space (i.e., x, y, z), wherein each point (x, y, z) is an individual 3D position of the key point. As explained above, when the extrapolation formula P_t+i = 2*Pn - P_t is used, it is assumed that the hand is moving at constant velocity. More complicated formulae can be used to apply acceleration and directional changes to the prediction.

[0054] In addition to generating the predicted 3D key points, confidence values may be calculated for each of the predicted key points. For example, each predicted 3D key point is assigned with a confidence value between 0 and 1. The assigned confidence values can be used in various ways. For example, a predicted key point with a low confidence value may be discarded. Depending on the implementation, confidence values for the predicted 3D key points can be calculated in many ways. The total confidence value for 21 key points is between 0 and 21, and the predicted frame may be discarded if the total confidence value is below a predetermined threshold value.

[0055] Now referring back to Figure 6. At step 650, the predicted 3D key points are mapped to predicted 2D key points. Additionally, confidence values are assigned to the predicted 3D key points. It is to be understood that depending on the application and use,

3D or 2D key points may be used. For example, for hand gesture identification, 3D key points are used (e.g., see Figure 5, blocks 507 and 508). Hand prediction can be used for bound box tracking (e.g., see Figure 5, blocks 503 and 504), and for this application, 2D key points are more useful. For example, the predicted 3D key points are converted to 2D key points that can be used in bound box tracking. For example, Figure 7 illustrates that the predicted 3D key points for frame P_t+i are mapped to 2D key points. In various embodiments, the predicted 3D key points are projected to 2D space. As shown in Figure 7, the predicted image P_t+i (e.g., 21 key points in (x, y, z) space) is projected to 2D hand key points (21*(u, v)) as output.

[0056] At step 660, the predicted 2D key points are used to track the bound box. In various embodiments, the total confidence value of the predicted 2D or 3D key points may be used to identify false hand detection. For example, a low confidence value (e.g., below 11 out of 21) could indicate that the predicted 3D key points (and the corresponding 2D key points) are likely to be incorrect — possibly resulting from false hand detection — and they should not be used for applications such as bound box tracking and hand gesture detection. [0057] The predicted 2D key points may be used to facilitate bound box tracking in various ways. As explained above, hand detection could be unreliable for various reasons, and the predicted key points — with their confidence values — can be used to identify “false positive” hand detection. In various implementations, hand prediction mechanisms according to the embodiments of the present invention can be both accurate and efficient. For example, the hand prediction methods performed in 3D space that is used — in addition to or instead of hand detection — can reduce the problems of missing bound box and false positive bound box. In various embodiments, the threshold of restrictions (e.g., using confidence values) for defining the bond box can lead to different results and may be calibrated depending on the use case (e.g., dark vs. bright environment). The threshold of incorrect hand detection may use a machine learning algorithm. It is to be appreciated that the hand prediction process when used in conjunction with bound box can greatly improve performance; the missing bound boxes and the false-positive bound boxes in the results of real-time 3D hand tracking can be reduced to nearly zero.

[0058] In various embodiments, the predicted 2D key points are used to define and update bound box size and location. For example, bound box is delineated around the predicted 2D key points with a predetermined margin (e.g., 10 to 20% around the outermost key points). The predicated bound box may change size and shape as well (e.g., when a fist changes to a palm, or vice versa).

[0059] As an example, steps illustrated in Figure 6 may be performed by the XR device 115 illustrated in Figure IB. Camera module 180 may be used to capture images, as described in step 610. Hand detection and hand prediction processes may be performed by the processor 150. In a specific implementation, an exemplary hand prediction process is illustrated in Figure 8. Figure 8 is a simplified diagram illustrating an exemplary state machine for hand tracking according to embodiments of the present invention. One of ordinary skill in the art would recognize many variations, alternatives, and modifications. For example, one or more steps illustrated in Figure 8 may be added, removed, repeated, modified, overlapped, rearranged, and replaced, which should not limit the scope of the claims.

[0060] At block 810, the hand tracking is in a “Dead” state, wherein processes such as hand detection and hand tracking are not performed. For example, an XR device may be at the 810 state when it is idle (e.g., images stay the same, no movement or other types of change). The XR device would stay in this “Dead” state until it is activated (e.g., movement or change of images detected in the left or right image).

[0061] At block 820, the XR device is initiated and ready to perform various tasks such as hand detection and hand tracking. As a part of the initialization process at block 820, cameras are active and capturing images, which are stored for processing. [0062] At block 830, hand tracking is performed, which includes hand detection (block 831) and hand prediction (block 832). Hand detection 831 may be repeated until a hand is detected in one of the left and right images. As explained above, hand detection process 881 may incorrectly identify a hand. Hand prediction process 832 is performed once a hand is detected, using at least a previous frame and a current frame. As a part of the hand tracking process, hand prediction 832 may be repeated until the hand is lost or not within a bound box, within which the hand prediction process can be performed. For example, if the hand that is being tracked is no longer in the bound box, hand detection 831 may be performed to define a new bound box; hand detection 831 may also determine that the hand is no longer present, and proceed to block 840. In this invention, the hand detections of beginning two timestamps are mainly used for hand tracking, then the hand predictions are heavily used. [0063] At block 840, the XR device is in the “Lost” state, where the hand is no longer detected. For example, the hand prediction process 832 may identify a “false positive” hand detection and determines that no hand is present. In the lost state, various XR components and processes may still be active to detect hand movements, which may move back to block 830 to perform hand detection if a movement is detected in the images (e.g., a difference between two consecutive time stamps). For example, block 840 runs a loop (as shown) for a predetermined time before moving to the “Dead” state in block 810. In certain embodiments, blocks 810 and 840 may be implemented (or programmed) as the same state. [0064] As an example, pseudo code for a hand prediction process mechanism according to the present invitation is provided below: bool handPredictionValid() {

// Continus detection score over 0.97 for several frames, the hand is stably tracked, we use hand prediction, in place for hand detection warmup = detecton.conf > 0.97 ? warmup + 1 : 0; return warmup >= 3;

}

// When hand prediction will be used, inside a thread while loop, right hand is symmetric to left hand void leftHandPoseO { while (!algStop) { if (lefthand->handPredictionValid()) { prediction.3D = getHandPredictioin3D(3DKp[t], (3DKp[t-l]) prediction.bbox = getLeftHandPredictioinBBox(prediction.3D) lefthand.bbox = prediction.bbox } else { lefthand.bbox = detection.bbox

}

// Apply 2D hand CNN leftCamera2DKp, rightCamera2DKp = run2DCnn(leftCameraImage, rightCameralmage)

// Apply 3D hand CNN, get the 3DKp on time frame t 3DKp(t+l) = run3DCnn(leftCameraImage, rightCameralmage, leftCamera2DKp, rightCamera2DKp)

}

// How the prediction is computed void getHandPredictioin3D() {

3DKp(t+l) = 2 * 3DKp(t) - 3DKp(t-l)

}

// The projection and bbox margin void getLeftHandPredictioinBBox() {

2DKp(t+l) = camera.project(3DKp(t+l)) xmin = 2DKp(t+l).x.min ymin = 2DKp(t+l).x.min xmax = 2DKp(t+l).y.max ymax = 2DKp(t+l).y.max wm = (xmax - xmin) * 0.2 // width margin hm = (ymax - ymin) * 0.2 // height margin xmin = xmin - wm ymin = ymin - hm xmax = xmax + wm ymax = ymax + hm

} [0065] As an example, state machine 800 may be stored as instructions executed by a processor, which may include different computational cores (e.g., NPU and GPU). For example, hand detection process 831 and hand prediction process 832 may be performed by an NPU.

[0066] Now referring back to Figure 5. System 500 enables a set of outputs including 3D hand key points in blocks 507r and 5071. For example, hand key points are illustrated in Figures 3A and 3B. It is to be noted that while the captured images are two-dimensional, hand gesture detection is performed using 3D hand key points. For example, results and/or intermediate calculations obtained in blocks 5031 and 503r may be used in hand gesture identification processes. For example, 2D to 3D mapping may be performed between blocks 5051 and 5071 or obtained from blocks 5031 and 503r. Calibration parameters may be used in the mapping process.

[0067] System 500, as shown, includes five components: main thread 501, hand detection thread 502, right hand thread 502r, left hand thread 5021, and hand shape calibration thread 570. These components interact with one another.

[0068] As an example, main thread 501 is used for copying the images captured by right fisheye camera 501r and left fisheye camera 5011 to the local memory of the system.

[0069] The hand detection thread 502 waits for the right fisheye image and left fisheye image. Once the images have been received, the hand detection thread 502 may use a hand detection convolutional network on the right fisheye image and left fisheye image. For example, hand detection thread 502 outputs a confidence value and bounding box for the right hand and left hand.

[0070] The right hand thread 502r and left hand thread 5021 may be implemented symmetrically, and they respectively receive thread inputs from the right fisheye image and the left fisheye image. They also rely on their respective bound box tracking (i.e., blocks 504r and 5041). For example, confidence values and bound box tracking may be used to generate 3D hand key points that allows for the identification of hand gesture types.

[0071] The hand bound box threads 504r and 5041 provide tracking, and their inputs the bound box sizes (and shapes), confidence values, and bound box prediction values from hand prediction blocks 503r and 5031. The hand bound box threads 504r and 5041 output, among other things, hand status (e.g., does it exist or not) and bound box data.

[0072] As shown in Figure 5, if a hand exists (as determined in block 504r and 5041), the 2D hand key point detection (e.g., blocks 505r and/or 5051) crops the hand out using the bound box from hand bound box tracking on the captured images. For example, the captured images are cropped out are resized to a predetermined size (e.g., 96 pixels by 96 pixels, which is a small size that is optimized for efficient processing). The 2D hand key point detection (e.g., blocks 505r and 5051) uses a 2D key point detection convolutional network on the resized image, and outputs the 2D hand key points. As described above, 2D key points, if exist, are mapped to 3D key points for hand gesture detection. [0073] While the above is a full description of the specific embodiments, various modifications, alternative constructions and equivalents may be used. Therefore, the above description and illustrations should not be taken as limiting the scope of the present invention which is defined by the appended claims.

Claims

WHAT IS CLAIMED IS:

1. A method for hand prediction, the method comprising: capturing a plurality of images containing at least a first hand, the plurality of images being in a two-dimensional (2D) space, the plurality of images including a current image and a previous image; identifying a plurality of previous 2D key points using the previous image; identifying a plurality of current 2D key points using the current image; mapping the plurality of previous 2D key points to a plurality of previous three- dimensional (3D) key points; mapping the plurality of current 2D key points to a plurality of current 3D key points; generating a plurality of 3D predicted key points in a 3D space using the plurality of previous 3D key points and the plurality of current 3D key points; mapping the plurality of 3D predicted key points to a plurality of predicted 2D key points; and deciding a potential false hand detection using the plurality of predicted 2D key points.

2. The method of claim 1 further comprising projecting the plurality of previous 2D key points to the 3D space.

3. The method of claim 2 wherein the plurality of images further contains at least a second hand, the plurality of 3D predicted key points being associated with the first hand and the second hand.

4. The method of claim 1 further comprising: defining a bound box enclosing the first hand in the current image; tracking the bound box using the plurality of predicted 2D key points.

5. The method of claim 4 wherein the bound box is defined with a left top comer location and a bottom right comer location, the bound box including at least a ten percent margin area surrounding the first hand.

6. The method of claim 1 wherein the plurality of 3D predicted key points are assigned confidence values, the method further comprising detecting a non-hand object using the confidence values.

7. The method of claim 1 further comprising tracking the first hand using the plurality of predicted 2D key points.

8. The method of claim 1 further comprising initiating a hand tracking process upon detecting the first hand.

9. The method of claim 1 further comprising calculating coordinate changes between the plurality of previous 3D key points and the plurality of current 3D key points, each of the 3D key points comprises three coordinates.

10. The method of claim 1 further comprising a plurality of 3D vectors using the plurality of previous 3D key points and the plurality of current 3D key points.

11. An extended reality apparatus comprising: a housing having a front side and a rear side; a first camera configured on the front side, the first camera being configured to capture a plurality of two-dimensional (2D) images at a predefined frame rate, the plurality of 2D images including a current image and a previous image; a display configured on the rear side of the housing; a memory coupled to first camera and being configured to store the plurality of 2D images; and a processor coupled to the memory; wherein the processor is configured to: identify a plurality of 2D key points associated with a hand using at least the current image and the previous image; map the plurality of 2D key points to plurality of three-dimensional (3D) key points; and provide a hand prediction using at least the plurality of 3D key points.

12. The apparatus of claim 11 wherein the processor comprises a neural processing unit configured to detect the hand using a first image captured by the first camera.

13. The apparatus of claim 11 further comprising a second camera, the first camera being positioned on a left side of the housing, the second camera being positioned on a right side of the housing.

14. The apparatus of claim 11 wherein the processor is further configured to track the hand.

15. The apparatus of claim 11 wherein the processor is further configured to: generate a plurality of predicted 3D key points; map the plurality of predicted 3D key points to a plurality of predicted 2D key points.

16. A method for hand tracking, the method comprising: capturing a first image; detecting at least a first hand in the first image; capturing a plurality of images containing at least the first hand, the plurality of images being in a two-dimensional (2D) space, the plurality of images including a current image and a previous image; identifying a plurality of 2D key points using the plurality of images; mapping the plurality of 2D key points to a plurality of three-dimensional (3D) key points; generating a plurality of 3D predicted key points in a 3D space using the plurality of 3D key points; mapping the plurality of 3D predicted key points to a plurality of predicted 2D key points; and tracking the first hand using the plurality of 3D predicted key points.

17. The method of claim 16 further comprising: calculating confidence values for the plurality of 3D predicted key points; identifying false hand detection using at least the confidence values.

18. The method of claim 16 further comprising identifying a change between the first image and a second image.

19. The method of claim 16 further comprising performing a deep learning process using the first image for hand detection.