WO2022256189A1

WO2022256189A1 - Hand gesture detection methods and systems with optimized hand detection

Info

Publication number: WO2022256189A1
Application number: PCT/US2022/030353
Authority: WO
Inventors: Yang Zhou
Original assignee: Innopeak Technology, Inc.
Priority date: 2022-05-20
Filing date: 2022-05-20
Publication date: 2022-12-08

Abstract

The present invention is directed to extended reality systems and methods. In an exemplary embodiment, hand images that are captured by at least two cameras are used in a hand detection process, which is performed on the images that are alternatively seletected from the images captured by the at least two cameras. Hand tracking is performed once a hand is detected. There are other embodiments as well.

Description

HAND GESTURE DETECTION METHODS AND SYSTEMS WITH OPTIMIZED HAND DETECTION BACKGROUND OF THE INVENTION [0001] The present invention is directed to extended reality systems and methods. [0002] Over the last decade, extended reality (XR) devices—including both augmented reality (AR) devices and virtual reality (VR) devices—have become increasingly popular. Important design considerations and challenges for XR devices include performance, cost, and power consumption. Due to various limitations, existing XR devices have been inadequate for reasons further explained below. [0003] It is desired to have new and improved XR systems and methods thereof. BRIEF SUMMARY OF THE INVENTION [0004] The present invention is directed to extended reality systems and methods. In an exemplary embodiment, hand images that are captured by at least two cameras are used in a hand detection process, which is performed on the images that are alternatively selected from the images captured by the at least two cameras. Hand tracking is performed once a hand is detected. There are other embodiments as well. [0005] A system of one or more computers can be configured to perform particular operations or actions by virtue of having software, firmware, hardware, or a combination of them installed on the system that in operation causes or cause the system to perform the actions. One or more computer programs can be configured to perform particular operations or actions by virtue of including instructions that, when executed by data processing apparatus, cause the apparatus to perform the actions. One general aspect includes a method for hand shape detection. The method also includes capturing a first plurality of images by a left camera. The method also includes capturing a second plurality of images by a right camera. The method also includes storing the first plurality of images and the second plurality of images at a memory. The method also includes selecting alternatively between the first plurality of images and the second plurality of images, the first plurality of images including a first image associated with a first timestamp, the second plurality images including a second image associated with a second timestamp, the first timestamp and the second timestamp being consecutive timestamps. The method also includes performing a first hand detection process using the first image. The method also includes performing a second hand detection process using the second image based on an absence of a hand detected in the first hand detection process. The method also includes performing a hand prediction process if the hand is detected in the first hand detection process. Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods. [0006] Implementations may include one or more of the following features. The hand prediction process uses the first image and the second image. The method further includes generating a frame index, the frame index being updated based at least on timestamps. The first hand detection process is performed by a neural processing unit. The method may include initiating a hand tracking process using the first image and the second image if the hand is detected. The method may include performing a third hand detection process if the hand is determined to be lost. The method may include performing a third hand detection process using a third image and a fourth hand detection process using a fourth image, the first plurality of images including the third image, the second plurality of images including the fourth image. The method may include tracking the hand in a bound box if a hand is detected. The first detection process and the second hand detection are performed within a predetermined time interval associated frame rate of at least twenty frames per second. Implementations of the described techniques may include hardware, a method or process, or computer software on a computer-accessible medium. [0007] One general aspect includes an extended reality apparatus that includes a housing, a housing may include a front side and a rear side. The apparatus also includes a left camera may include a first sensor and a first lens, the first lens being characterized by a first field of view, the left camera being configured on a left region of the front side, the left camera being configured to capture a first plurality of images, the first plurality of images including a first image having a first timestamp. The apparatus also includes a right camera may include a second sensor and a second lens, the second lens being characterized by a second field of view, the right camera being configured on a right region of the front side, the second field of view sharing a common field of view with the first field of view, the right camera being configured to capture a second plurality of images, the second plurality of images including a second image having a second timestamp, a difference between the first timestamp and the second timestamp being within a predetermined interval. The apparatus also includes a display configured on the rear side of the housing. The apparatus also includes a memory coupled to the first sensor and the second sensor and being configured to store the first plurality of images and the second plurality of images. The apparatus also includes a first processor coupled to the memory. The apparatus also includes a second processor coupled to the first processor and the memory, the second processor including a neural processing unit. The apparatus also includes where: the first processor is configured to select alternatively between the first plurality of images and the second plurality of images, the second processor is configured to perform a first hand detection process using the first image, the second processor is further configured to perform a second hand detection using the second image in response of an absence of a hand determined in the first hand detection process, and the second processor is further configured to perform a hand prediction process if the hand is detected in the first hand detection process. Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods. [0008] Implementations may include one or more of the following features. The apparatus where the second processor is configured to perform hand detection processes at a rate of at least 20 frames per second. The second processor is further configured to detect an absence of the hand from the first plurality of images or the second plurality of images. The first lens may include a fisheye lens. The first sensor may include a monochrome sensor having less than one million pixels. The left camera is characterized by a capture rate of at least 15 frames per second. The predetermined interval is less than 30 ms. Implementations of the described techniques may include hardware, a method or process, or computer software on a computer- accessible medium. [0009] One general aspect includes a method for hand shape detection. The method includes capturing a first plurality of images by a left camera. The method also includes capturing a second plurality of images by a right camera. The method also includes storing the first plurality of images and the second plurality of images at a memory. The method also includes selecting alternatively between the first plurality of images and the second plurality of images. The method also includes performing a first hand detection process using a first selected image. The method also includes performing a second hand detection process using the second selected image based on detecting an absence of a hand using the first hand detection process. The method also includes tracking the hand in a bound box if the hand is detected in the first hand detection process or the second hand detection process. Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods. [0010] Implementations may include one or more of the following features. The method may include performing a hand prediction process. The first pluralty of images and the second plurality of images are selected using a frame index number. The method may include switching from a hand detection process to a hand prediction process if the hand is not detected in a third selected image. Implementations of the described techniques may include hardware, a method or process, or computer software on a computer-accessible medium. [0011] It is to be appreciated that embodiments of the present invention provide many advantages over conventional techniques. Among other things, hand detection techniques are both efficient in both power and computations. Additionally, hand shape detection techniques according to embodiments of the present invention can be performed at high frame rates and satisfy various performance requirements of XR devices. [0012] Embodiments of the present invention can be implemented in conjunction with existing systems and processes. For example, hand shape detection techniques according to the present invention can be used in a wide variety of XR systems. Additionally, various techniques according to the present invention can be adopted into existing XR systems via software or firmware update. There are other benefits as well. [0013] The present invention achieves these benefits and others in the context of known technology. However, a further understanding of the nature and advantages of the present invention may be realized by reference to the latter portions of the specification and attached drawings. BRIEF DESCRIPTION OF THE DRAWINGS [0014] Figure 1A is a simplified diagram illustrating extended reality (XR) apparatus 115n according to embodiments of the present invention. [0015] Figure 1B is a simplified block diagram illustrating components of extended reality apparatus 115n according to embodiments of the present invention. [0016] Figure 2 is a simplified diagram illustrating fields of view of cameras on extended reality apparatus 210 according to embodiments of the present invention. [0017] Figure 3 is a simplified diagram illustrating key points defined on a right hand according to embodiments of the present invention. [0018] Figure 4 is a simplified block diagram illustrating functional blocks of extended reality apparatus according to embodiments of the present invention. [0019] Figure 5 is a simplified block diagram illustrating function modules in a hand gesture detection algorithm according to embodiments of the present invention. [0020] Figure 6 is a simplified flow diagram illustrating a process for a method for hand detection according to embodiments of the present invention. [0021] Figure 7 is a simplified flow diagram illustrating a process of a method for hand tracking according to embodiments of the present invention. DETAILED DESCRIPTION OF THE INVENTION [0022] The present invention is directed to extended reality systems and methods. In an exemplary embodiment, hand images that are captured by at least two cameras are used in a hand detection process, which is performed on the images that are alternatively selected from the images captured by the at least two cameras. Hand tracking is performed once a hand is detected. There are other embodiments as well. [0023] With the advent of virtual reality and augmented reality applications, gesture-based control schemes are becoming more and more popular. The ability to reconstruct the motion of the human hand accurately and efficiently from images promises exciting new applications in immersive virtual and augmented realities, robotic control, and sign language recognition. There has been great progress in recent years, especially with the arrival of deep learning technology. However, it remains a challenging task due to unconstrained global and local pose variations, frequent occlusion, local self-similarity, and a high degree of articulation. In recent years, commercial depth camera-based 3D hand tracking on AR Glass has been popular, with direct 3D measurements to hands. Conventional research typically focuses on RGB camera- based hand tracking algorithms, and there has been limited research work on a practical hand tracking system as compared to an algorithm. [0024] Embodiments of the present invention provide a complete hand tracking system for AR-Glass. The complete system enables features that include real-time tracking, dual-hand tracking, shape calibration, and simultaneous hand gesture recognition. In various embodiments, stereo fisheye cameras with low-power sensors are used with fisheye lenses to capture a large field of view (FoV). It is to be appreciated that embodiments of the present invention use the stereo fisheye (or ultrawide angle) cameras instead of the ToF modules for distance determination. Implemented with stereo cameras, embodiments of the present invention provide a complete system-wide solution, which may involve features such as real- time on edge devices (e.g., mobile phones, embedding devices), to enable dual-hand tracking and hand gesture recognition simultaneously, and to adjust the hand scale to match true hand. Hand detection processes according to embodiments of the invention take advantage of the combined wide FoV of two (or more) cameras, and they are performed alternatively using images from the left and right cameras. For example, left and right images are selected in a round-robin fashion for hand detection to reduce computational complexity—compared to hand detection processes using both left and right images—by half. In a specific implementation, hand detection processes operate at a speed of 18 ms per timestamp, which satisfies a desired frame rate of 30 frames per second by a large margin. It is to be appreciated the hand detection mechanisms of the present invention can be implemented with other ER configurations, such as a four-camera setup. [0025] The following description is presented to enable one of ordinary skill in the art to make and use the invention and to incorporate it in the context of particular applications. Various modifications, as well as a variety of uses in different applications will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to a wide range of embodiments. Thus, the present invention is not intended to be limited to the embodiments presented, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein. [0026] In the following detailed description, numerous specific details are set forth in order to provide a more thorough understanding of the present invention. However, it will be apparent to one skilled in the art that the present invention may be practiced without necessarily being limited to these specific details. In other instances, well-known structures and devices are shown in block diagram form, rather than in detail, in order to avoid obscuring the present invention. [0027] The reader’s attention is directed to all papers and documents which are filed concurrently with this specification and which are open to public inspection with this specification, and the contents of all such papers and documents are incorporated herein by reference. All the features disclosed in this specification, (including any accompanying claims, abstract, and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise. Thus, unless expressly stated otherwise, each feature disclosed is one example only of a generic series of equivalent or similar features. [0028] Furthermore, any element in a claim that does not explicitly state “means for” performing a specified function, or “step for” performing a specific function, is not to be interpreted as a “means” or “step” clause as specified in 35 U.S.C. Section 112, Paragraph 6. In particular, the use of “step of” or “act of” in the Claims herein is not intended to invoke the provisions of 35 U.S.C.112, Paragraph 6. [0029] Please note, if used, the labels left, right, front, back, top, bottom, forward, reverse, clockwise and counter-clockwise have been used for convenience purposes only and are not intended to imply any particular fixed direction. Instead, they are used to reflect relative locations and/or directions between various portions of an object. [0030] Figure 1A is a simplified diagram (top view) illustrating extended reality apparatus 115n according to embodiments of the present invention. This diagram is merely an example, which should not unduly limit the scope of the claims. One of ordinary skill in the art would recognize many variations, alternatives, and modifications. It is to be understood that the term “extended reality” (ER) is broadly defined, which includes virtual reality (VR), augmented reality (AR), and/or other similar technologies. For example, ER apparatus 115 as shown can be configured as VR, AR, or others. Depending on the specific implementation, ER apparatus 115 may include small housing for AR applications or relatively larger housing for VR applications. Cameras 180A and 180B are configured on the front side of apparatus 115. For example, cameras 180A and 180B are respectively mounted on the left and right sides of the ER apparatus 115. In various applications, additional cameras may be configured below cameras 180A and 180B to provide an additional field of view and range estimation accuracy. For example, cameras 180A and 180B both include ultrawide angle or fisheye lenses that offer large fields of view, and they share a common field of view. Due to the placement of the two cameras, the parallax—which is a known factor—of the two cameras can be used to estimate subject distance. The combined field of view of two wide angles allows for hand detection as one or more hands enter. Display 185 is configured on the backside of ER apparatus 115. For example, display 185 may be a semitransparent display that overlays information on an optical lens in AR applications. In VR implementations, display 185 may include a non-transparent display. [0031] Figure 1B is a simplified block diagram illustrating components of extended reality apparatus 115 according to embodiments of the present invention. This diagram is merely an example, which should not unduly limit the scope of the claims. One of ordinary skill in the art would recognize many variations, alternatives, and modifications. In some embodiments, an XR headset (e.g., AR headset 115n as shown, or the like) might include, without limitation, at least one of processor 150, data store 155, speaker(s) or earpiece(s) 160, eye-tracking sensor(s) 165, light source(s) 170, audio sensor(s) or microphone(s) 175, front or front-facing cameras 180, display 185, and/or communication interface190, and/or the like. [0032] In some instances, the processor 150 might communicatively be coupled (e.g., via a bus, via wired connectors, or via electrical pathways (e.g., traces and/or pads, etc.) of printed circuit boards ("PCBs") or integrated circuits ("ICs"), and/or the like) to each of one or more of the data store 155, the speaker(s) or earpiece(s) 160, the eye tracking sensor(s) 165, the light source(s) 170, the audio sensor(s) or microphone(s) 175, the front camera(s) 180, display 185, and/or the communication interface 190, and/or the like. In various embodiments, data store 155 may include dynamic random-access memory (DRAM) and/or non-volatile memory. For example, images captured by cameras 180 may be temporarily stored at the DRAM for processing, and executable instructions (e.g., hand shape calibration and hand gesture identification algorithms) may be stored at the non-volatile memory. In various embodiments, data store 155 may be implemented as a part of the processor 150 in a system-on-chip (SoC) arrangement. [0033] In various embodiments, processor 150 includes different types of processing units, such as central processing unit (CPU) 151 and neural processing unit (NPU) 152. Processor 150 may additionally include a graphic processing unit. Different types of processing units are optimized for different types of computations. For example, CPU 151 handles various types of system functions, such as managing cameras 180 and moving captured images to data store 155. NPU 152 is optimized for convolutional neural network and predictive models. In certain embodiments, NPU 152 is specifically to perform ER-related calculations, such as hand tracking, gesture identification, image recognition, and/or others. Optimized for neural processing, NPU 152 may consume a relatively large amount of power in operation, and for certain applications (such as hand detection) it may be unable to operate at real-time speed (e.g., 30 frames per second). To utilize NPU 152 efficiently and to reduce overall power consumption, embodiments of the present invention employ a round-robin scheme (as described in further details below), through which a hand detection process alternates between images captured by the left and right cameras—as opposed to process two contemporaneous images. For example, efficient allocation of NPU for hand detections allows for, among other things, real-time performance of at 30 frames per second. [0034] The eye tracking sensor(s) 165 – which might include, without limitation, at least one of one or more cameras, one or more motion sensors, or one or more tracking sensors, and/or the like – track where the user's eyes are looking, which in conjunction with computation processing by the processor 150, the computing system 105a or 105b, and/or an AI system to compare with images or videos taken in front of the ER apparatus 115. The audio sensor(s) 175 might include, but is not limited to, microphones, sound sensors, noise sensors, and/or the like, and might be used to receive or capture voice signals, sound signals, and/or noise signals, or the like. [0035] The front cameras 180 include their respective lenses and sensors used to capture images or video of an area in front of the ER apparatus 115. For example, front cameras 180 include cameras 180A and 180B as shown in Figure 1B, and they are configured respectively on the left and right sides of the housing. In various implementations, the sensors of the front cameras may be low-resolution monochrome sensors, which are not only energy efficient (without color filter and color processing thereof), but also relatively inexpensive, both in terms of device size and cost. [0036] Figure 2 is a simplified diagram illustrating fields of view of cameras on extended reality apparatus 210 according to embodiments of the present invention. This diagram is merely an example, which should not unduly limit the scope of the claims. One of ordinary skill in the art would recognize many variations, alternatives, and modifications. Left camera 180A is mounted on the left side of the ER apparatus housing 210, and right camera 180B is mounted on the right side of the ER apparatus housing 210. Each of the cameras has an ultrawide angle or fisheye lens that is capable of capturing a wide field of view. For example, camera 180A has a field of view on the left with an angle of ^L, and camera 180B has a field of view on the right with an angle of ^R. Hands or other objects can be detected by either camera. For example, hand 221 is within the field of the view of camera 180A, and hand 223 is within the field of view of camera 180B. In various embodiments, when a hand is only within the FOV of a single camera, the amount of processing is limited, and calculations such as distance estimation and 3D mapping may not be available. Region 220 is within the FOV of both cameras, and additional processing can be performed for objects that are positioned within region 220. For example, region 220 may be referred to as a “common FOV”, which is defined by both FOV and distance; that is, within the predetermined distance (e.g., arm length of a user) and FOV, additional calculations and processing can be performed. Among other things, because hand 222 within the FOVs of both cameras 180A and 180B, the differences between images captured by the two cameras can be used to approximate the distance between hand 222 and housing 210. For example, when hand 222 is positioned in region 220 as shown, processes such as hand shape calibration, three-dimensional mapping, and others can be performed. In certain embodiments, a shape calibration process—as described in further detail below—is implemented as a part of the initial calibration process, and a user is prompted to position her hand to region 220 for shape calibration, and the calibration parameters generating during this process are later used for other calculations and processes. [0037] Now referring back to Figure 1B. The processor 150, in various embodiments, is configured to calculate hand shape calibration parameters based on a left image captured by the left camera (e.g., camera 180A) and a right image captured by the right camera (e.g., camera 180B). The left image includes a hand (e.g., hand 222) positioned in the first position of the left image and within the common field of view (e.g., region 220). The right image includes the hand (e.g., hand 222) positioned in a second position of the right image and within the common field of view. During hand detection process, images captured by cameras 180A and 180B are processed alternatively; that is, only one of two images that were captured at the same time (e.g., sharing the same timestamp or within a predetermined time interval) is used in hand detection process. For example, detection of either hand 221 or hand 223 would cause the ER device to transition to a hand tracking and/or gesture identification mode. In various embodiments, detection of hand 222 in the common field of view 220 using one of the images would trigger using both left and right images captured at the same time for hand shape calibration, hand gesture identification, and/or other processes. For example, cameras 180A and 180B are configured to capture images at a predetermined frame rate (e.g., 24 frames per second or 30 frames per second). To perform hand detection using both left and right images, an NPU might need to process images at double time (e.g., 48 frames per second or 60 frames per second respectively). By using a round-robin scheme in hand detection processes, the NPU workload is reduced to the frame rate of a single camera capture rate (e.g., 24 frames per second or 30 frames per second). [0038] In AR applications, the field of view of each of the front cameras 180 overlaps with a field of view of an eye of the user 120, the captured images or video. The display screen(s) and/or projector(s) 185 may be used to display or project the generated image overlays (and/or to display a composite image or video that combines the generated image overlays superimposed over images or video of the actual area). The communication interface 190 provides wired or wireless communication with other devices and/or networks. For example, communication interface 190 may be connected to a computer for tether operations, where the computer provides the processing power needed for graphic-intensive applications. [0039] Figure 3 is a simplified diagram illustrating key points defined on a right hand according to embodiments of the present invention. This diagram is merely an example, which should not unduly limit the scope of the claims. One of ordinary skill in the art would recognize many variations, alternatives, and modifications. As an example, key points 0-19 are assigned to different regions of a user’s right hand. Based on the locations of these key points, hand gestures may be determined. For example, by identifying the relative positions of these key points from 0 to 19, different hand gestures can be determined. For example, key points are used (e.g., in a CNN by an NPU) in hand detection, tracking, and gesture identification processes. [0040] Figure 4 is a simplified block diagram illustrating functional blocks of extended reality apparatus according to embodiments of the present invention. This diagram is merely an example, which should not unduly limit the scope of the claims. One of ordinary skill in the art would recognize many variations, alternatives, and modifications. The system pipeline of an extended reality device 400 in Figure 4 may include the functional components, which may correspond to various parts of device 115 in Figure 1B, as shown. On the front-end, the sensors—such as the right fisheye camera 401, left fisheye camera 402, and inertia measuring unit (IMU) 403—capture images and other information and send the captured data to sensor processor 411 (e.g., a lightweight embedding CPU, such as processor 150 in Figure 1B). The sensor processor 411 performs various simple image processing (e.g., denoising, exposure control, and others), and then pack the processed data to an XR server 421. In various embodiments, sensor processor 411 is implemented using CPU or GPU, which is generally not as computationally expensive as NPUs. For example, XR server 421 is implemented to function as a data consumer and to deliver the data to various algorithms 451, such as 3D hand tracking 431, 6DoF 441, and others. In an embodiment, algorithm 451 includes hand detection algorithms, which are performed by one or more NPUs. The position of a 3D hand tracking algorithm 431 is configured after XR server 421 as shown, and it is followed by APP module 432. For example, unity APP 432 receives the hand tracking results for different purposes, such as gaming, manipulation of virtual objects, and others. Additional functions such as system render 434, asynchronous time warp (ATW) 435, and display 436 may be configured as shown. Depending on the implementation, there may be other functional blocks as well. [0041] Figure 5 is a simplified block diagram illustrating function modules in a hand gesture detection algorithm according to embodiments of the present invention. This diagram is merely an example, which should not unduly limit the scope of the claims. One of ordinary skill in the art would recognize many variations, alternatives, and modifications. According to various embodiments, a hand tracking system 500, which may be implemented with device 150 shown in Figure 1B, uses dual hand tracking processes for left (l) and right (r) hand. For example, system 500 provides real-time (i.e., 30 frames per second) hand tracking on edge device, and it operates as a 3D hand tracking system. Stereo fisheye cameras are used to obtained left and right images with known parallax calibration. The system includes various sets of algorithms that include hand acquisition 501, hand detection 502, hand prediction 503r and 503l, bound box tracking 504r and 504l, 2D hand key point detection 505r and 505l, 3D hand key point detection 507r and 507l, hand gesture recognition 508r and 508l, and hand shape calibration 570. [0042] System 500 enables a set of outputs including 3D hand key points in blocks 507r 507l. For example, hand key points are illustrated in Figure 3. It is to be noted that while the captured images are two-dimensional, hand gesture detection is performed using 3D hand key points. For example, 2D to 3D mapping is performed between block 505l and 507l, and calibration parameters may be used in the mapping process. [0043] System 500, as shown, includes five components: main thread 501, hand detection thread 502, right hand thread 502r, left hand thread 502l, and hand shape calibration thread 570. These components interact with one another. [0044] As an example, main thread 501 is used for copying the images captured by the right fisheye camera 501r and left fisheye camera 501l to the local memory of the system. [0045] The hand detection thread 502 waits for the right fisheye image and left fisheye image. Once the images have been received, the hand detection thread 502 may use a hand detection convolutional network on the right fisheye image and left fisheye image. For example, hand detection thread 502 outputs a confidence value and bounding box for the right hand and left hand. In various embodiment, images captured by the left and right cameras are selected alternatively for the hand detection process. [0046] Figure 6 is a simplified flow diagram illustrating a process for a method for hand detection according to embodiments of the present invention. This diagram is merely an example, which should not unduly limit the scope of the claims. One of ordinary skill in the art would recognize many variations, alternatives, and modifications. For example, one or more steps illustrated in Figure 6 may be added, removed, repeated, modified, overlapped, rearranged, and replaced, which should not limit the scope of the claims. [0047] At step 601, hand images are respectively captured by the left and right cameras. For example, the hand may be positioned inside the image captured by the left camera or the right camera. However, it is to be noted that if the hand is positioned within the common view, processes such as hand tracking and/or hand shape calibration may be performed once the hand is detected. For example, when hand 222 (see Figure 2) is positioned in the common FOV 220, thereby allowing both cameras 180A and 180B to record their respective images, with hand 222 at slightly different positions of the left and right frames, and the difference in positions allows for distance calculation or approximation. According to various embodiments, the hand detection process is repeated (e.g., steps 601 to 605) until the hand is detected. [0048] At step 602, left and right images are stored in the memory. For example, a buffer memory may be used to store the left and right images. In various embodiment, the captured images are stored temporarily for the purpose of processing, and may first be stored in volatile memory and later transferred to non-volatile memory. It is to be noted, the left and right images are stored with their respective metadata, which includes timestamp information that allows for selective hand detection processes (e.g., only one of the left and right images within a predetermined time interval is used for hand detection process). [0049] At step 603, select alternatively between the first plurality of images and the second plurality of images, the first plurality of images including a first image associated with a first timestamp, the second plurality images including a second image associated with a second timestamp, the first timestamp and the second timestamp being consecutive timestamps. Depending on the implementation, alternative selection between the left and right images can be performed in many ways. In a specific embodiment, images are assigned frame numbers for alternative selection, where a frame index variable and its remainder are used to select between left and right images. For example, the variable FrameIndex is used to record the current frame number for the image selection process, where the variable FrameIndex is linked to the image timestamp. For example, within a given time interval, left and right images may share the same timestamp (or both timestamps are within the given time interval), the right image is selected when FrameIndex %2 = 0, and the left image is selected when FrameIndex %2 = 1. For the next time interval (or the next timestamp), FrameIndex is increased by 1 (i.e., FrameIndex=FrameIndex+1), thereby switching from even to odd or from odd to even, and the alternative image (compared to the previous time interval) is thus selected using the FrameIndex remainder calculation. It is to be appreciated that alternative selection between left and right images can be implemented in many ways according to embodiments of the present invention. [0050] A step 604, hand detection is performed on the selected image. For example, depending on the FrameIndex value, which updates with timestamps, left or right images may be selected as explained. And only the selected image within a time interval (or per the same timestamp) is processed for hand detection. By performing hand detection using a single image within a time interval, the computation cost is greatly reduced (compared to processes involving both left and right images within the same time interval). Additionally, given that hand detection often involves complex computation, an XR device may not be able to process both left and right images within a time interval (e.g., 33 ms for a frame rate of 30 frames per second), performing hand detection using a single image per time interval can help maintain a desired frame rate (e.g., 30 frames per second). [0051] At step 605, whether the hand is detected is determined. In various embodiments, the bound box used in hand detection is generated by locating the hand image and cropping the captured images. The first hand detection process is performed by a neural processing unit. If the hand, as determined in step 605, is detected, step 606 and subsequent hand tracking steps are performed. On the other hand, if the hand is not detected, it goes back to step 601 for additional hand image capture and processing. It is to be noted that there might variations as to which steps are performed. For example, images may be captured and stored in batch, and the hand detection process may be repeated at step 602. Additionally, left and right images may be queued using the frame indices describe above for processing (e.g., providing the function of alternative selection at step 603), and the hand detection process 604 may be performed in multiple iterations before steps 601, 602, or 603 are repeated. [0052] The computation and energy saving with hand detection processes may no longer be needed in various situations. As explained below, hand tracking, hand gesture identification, and other processes may be performed once a hand is detected. In such situation, the variable FrameIndex is not updated, and both left and right images are used, and a bound box around the hand location may be used for hand tracking. [0053] At step 606, the hand tracking is performed. A hand tracking process is initiated using the first image and the second image if the hand is detected. The method further comprises tracking the hand in a bound box. For example, various techniques and processes may be performed, such as hand prediction, shape calibration, 2D-3D key point conversion, hand gesture identification, and others (e.g., as shown in Figure 5). It is to be noted that once an XR device is in the hand tracking mode, the presence of the hand is assumed, and therefore the hand detection process is not performed until the hand is no longer captured by either the left or the right camera, and thus is “lost.” [0054] At step 607, whether the hand tracking is lost is determined. If hand tracking, as determined in step 607, is not lost, go back step 606 and keep the hand tracking. On the other hand, if the hand tracking is lost, it goes back to step 601 for additional hand image capture and processing. It is to be understood that the hand may be lost in various ways, and may require different actions by the XR device. For example, the lost state may be associated with a hand temporarily exiting the combined FoV of two cameras; in certain situations, the XR device may enter into an inactive state (e.g., too dark to capture images that are usable for hand detection/tracking). [0055] Figure 7 is a simplified flow diagram illustrating a process for method for hand tracking according to embodiments of the present invention. One of ordinary skill in the art would recognize many variations, alternatives, and modifications. For example, one or more steps illustrated in Figure 7 may be added, removed, repeated, modified, overlapped, rearranged, and replaced, which should not limit the scope of the claims. [0056] At block 710, the hand tracking is in a “Dead” state, wherein processes such as hand detection and hand tracking are not performed. For example, an XR device may be in the 710 state when it is idle (e.g., images stay the same, no movement or other types of change). The XR device would stay in this “Dead” state until it is activated (e.g., movement or change of images detected in the left or right image). [0057] At block 720, the XR device is initiated and ready to perform various tasks such as hand detection and hand tracking. As a part of the initialization process at block 720, left and right cameras are active and capturing images, which are stored for processing. In various implementations, variables such as FrameIndex are initialized to facilitate image selection for processing. Other processes may be performed as well. [0058] At block 730, hand tracking is performed, which includes hand detection (block 731) and hand prediction (block 732). Hand detection 731 is performed using an image selected between left and right images per a time interval (or timestamp). Hand detection 731 may be repeated until a hand is detected in one of the left and right images. Hand prediction 732 is performed once a hand is detected. As a part of the hand tracking process, hand prediction 732 may be repeated until the hand is lost or not within a bound box, within which hand prediction process can be performed. For example, if the hand that is being tracked is no longer in the bound box, hand detection 731 may be performed to define a new bound box; hand detection 731 may also determine that the hand is no longer present, and proceed to block 740. [0059] At block 740, the XR device is in the “Lost” state, where the hand is no longer detected. In the lost state, various XR components and processes may still be active to detect hand movements, which may move back to block 730 to perform hand detection if a movement is detected in the images (e.g., a difference between two consecutive frames). For example, block 740 runs a loop (as shown) for a predetermined time before moving the “Dead” state in block 710. In certain embodiments, blocks 710 and 740 may be implemented (or programmed) as the same state. [0060] As an example, pseudo-code for a hand detection mechanism, which may be implemented with Figures 6 and 7, according to the present invention is provided below: init trackingFrameIdx = 0 // when algorithm starts // Inside detection trhead function: func applyHandDet() { while (!algStop) { left_image, right_image = get_image_from_camera() // When to apply hand detection if (!algStop && ((!lefthand->handPredictionValid()) || (!righthand- >handPredictionValid()))) { // Detection network runDetNet(left_image, right_image) } } } // Inside detection func runDetNet(left_image, right_image) { trackingFrameIdx = updateTrackingFrameIdx() // Round-robin selection viewID = trackingFrameSelection(trackingFrameIdx) // Only run detection cnn on only one image, on each time point T if (viewID == left) { lefthand.exist, lefthand.bbox = runDetCnn(left_image) } else if (viewID == right) { righthand.exist, righthand.bbox = runDetCnn(right_image) } } // Inside updateTrackingFrameIdx func updateTrackingFrameIdx() { if(!lefthand->handPredictionValid() && lefthand->isNewBorn()) // isNewBorn: the first frame detection output hand exists return if(!righthand->handPredictionValid() && righthand->isNewBorn()) return trackingFrameIdx = (trackingFrameIdx % INT_MAX) + 1 } // Inside trackingFrameSelection func trackingFrameSelection(trackingFrameIdx) { return trackingFrameIdx % 2; } [0061] Now referring back to Figure 5. The right hand thread 502r and left hand thread 502l may be implemented symmetrically, and they respectively receive thread inputs from the right fisheye image and the left fisheye image. They also rely on their respective bound box tracking (i.e., blocks 504r and 504l). For example, confidence values and bound box tracking may be used to generate 3D hand key points that allow for the identification of hand gesture types. [0062] The hand bound box threads 504r and 504l provide tracking, and their inputs the bound box sizes confidence values, and bound box prediction values from hand prediction blocks 503r and 503l. The hand bound box threads 504r and 504l output, among other things, hand status (e.g., does it exist or not) and bound box data. [0063] As shown in Figure 5, if hand exists (as determined in block 504r and 504l), the 2D hand key point detection (e.g., blocks 505r and/or 505l) crops the hand out using the bound box from hand bound box tracking on the captured images. For example, the captured images are cropped out are resized to a predetermined size (e.g., 96 pixels by 96 pixels, which is a small size that is optimized for efficient processing). The 2D hand key point detection (e.g., blocks 505r and 505l) uses a 2D key point detection convolutional network on the resized image, and outputs the 2D hand key points. As described above, 2D key points, if exist, are mapped to 3D key points for hand gesture detection. [0064] While the above is a full description of the specific embodiments, various modifications, alternative constructions and equivalents may be used. Therefore, the above description and illustrations should not be taken as limiting the scope of the present invention which is defined by the appended claims.

Claims

WHAT IS CLAIMED IS: 1. A method for hand shape detection, the method comprising: capturing a first plurality of images by a left camera; capturing a second plurality of images by a right camera; storing the first plurality of images and the second plurality of images at a memory; selecting alternatively between the first plurality of images and the second plurality of images, the first plurality of images including a first image associated with a first timestamp, the second plurality images including a second image associated with a second timestamp, the first timestamp and the second timestamp being consecutive timestamps; performing a first hand detection process using the first image; determining an absence of a hand in the first hand detection process; performing a second hand detection process using the second image; and performing a hand prediction process in response to a detected hand in the second hand detection process.

2. The method of claim 1 wherein the hand prediction process uses the first image and the second image.

3. The method of claim 1 further comprising generating a frame index, the frame index being updated based at least on timestamps.

4. The method of claim 1 wherein the first hand detection process is performed by a neural processing unit.

5. The method of claim 4 further comprising initiating a hand tracking process using the first image and the second image if the hand is detected.

6. The method of claim 1 further comprising performing a third hand detection process if the hand is determined to be lost.

7. The method of claim 1 further comprising performing a third hand detection process using a third image and a fourth hand detection process using a fourth image, the first plurality of images including the third image, the second plurality of images including the fourth image.

8. The method of claim 1 further comprising tracking the hand in a bound box if a hand is detected.

9. The method of claim 1 wherein the first detection process and the second hand detection are performed within a predetermined time interval associated frame rate of at least twenty frames per second.

10. An extended reality apparatus comprising: a housing, a housing comprising a front side and a rear side; a left camera comprising a first sensor and a first lens, the first lens being characterized by a first field of view, the left camera being configured on a left region of the front side, the left camera being configured to capture a first plurality of images, the first plurality of images including a first image having a first timestamp; a right camera comprising a second sensor and a second lens, the second lens being characterized by a second field of view, the right camera being configured on a right region of the front side, the second field of view sharing a common field of view with the first field of view, the right camera being configured to capture a second plurality of images, the second plurality of images including a second image having a second timestamp, a difference between the first timestamp and the second timestamp being within a predetermined interval; a display configured on the rear side of the housing; a memory coupled to the first sensor and the second sensor and being configured to store the first plurality of images and the second plurality of images; and a first processor coupled to the memory; and a second processor coupled to the first processor and the memory, the second processor including a neural processing unit; wherein: the first processor is configured to select alternatively between the first plurality of images and the second plurality of images; the second processor is configured to perform a first hand detection process using the first image; the second processor is further configured to perform a second hand detection using the second image based on an absence of a hand determined in the first hand detection process; and the second processor is further configured to perform a hand prediction process if the hand is detected in the first hand detection process.

11. The apparatus of claim 10 wherein the second processor is configured to perform hand detection processes at a rate of at least 20 frames per second.

12. The apparatus of claim 10 wherein the second processor is further configured to detect an absence of the hand from the first plurality of images or the second plurality of images.

13. The apparatus of claim 10 wherein the first lens comprises a fisheye lens.

14. The apparatus of claim 10 wherein the first sensor comprises a monochrome sensor having less than one million pixels.

15. The apparatus of claim 10 wherein the left camera is characterized by a capture rate of at least 15 frames per second.

16. The apparatus of claim 10 wherein the predetermined interval is less then 30 ms.

17. A method for hand shape detection, the method comprising: capturing a first plurality of images by a left camera; capturing a second plurality of images by a right camera; storing the first plurality of images and the second plurality of images at a memory; selecting alternatively between the first plurality of images and the second plurality of images; performing a first hand detection process using a first selected image; performing a second hand detection process using the second selected image based on an absence of a hand detected in the first hand detection process; and tracking the hand in a bound box if the hand is detected in the first hand detection process or the second hand detection process.

18. The method of claim 17 further comprising performing a hand prediction process.

19. The method of claim 17 wherein the first pluralty of images and the second plurality of images are selected using a frame index number.

20. The method of claim 17 further comprising switching from a hand detection process to a hand prediction process if the hand is not detected in a third selected image.