WO2024049463A1 - Clavier virtuel - Google Patents

Clavier virtuel Download PDF

Info

Publication number
WO2024049463A1
WO2024049463A1 PCT/US2022/075622 US2022075622W WO2024049463A1 WO 2024049463 A1 WO2024049463 A1 WO 2024049463A1 US 2022075622 W US2022075622 W US 2022075622W WO 2024049463 A1 WO2024049463 A1 WO 2024049463A1
Authority
WO
WIPO (PCT)
Prior art keywords
user
egocentric
virtual keyboard
hand
view
Prior art date
Application number
PCT/US2022/075622
Other languages
English (en)
Inventor
Dongeek Shin
Sean Kyungmok Bae
Shumin Zhai
Original Assignee
Google Llc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Google Llc filed Critical Google Llc
Priority to PCT/US2022/075622 priority Critical patent/WO2024049463A1/fr
Publication of WO2024049463A1 publication Critical patent/WO2024049463A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F1/00Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
    • G06F1/16Constructional details or arrangements
    • G06F1/1613Constructional details or arrangements for portable computers
    • G06F1/163Wearable computers, e.g. on a belt
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/017Gesture based interaction, e.g. based on a set of recognized hand gestures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/03Arrangements for converting the position or the displacement of a member into a coded form
    • G06F3/0304Detection arrangements using opto-electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • G06F3/167Audio in a user interface, e.g. using voice commands for navigating, audio feedback

Definitions

  • the present disclosure relates to augmented reality more specifically, to a virtual keyboard for input to an augmented reality device.
  • An augmented reality (AR) device such as AR glasses, may obtain inputs from a user in a variety of ways to interact with the AR device and/or the information presented by the AR device. New and improved inputs for humans to interact with these AR devices may be desirable.
  • AR augmented reality
  • a system for augmented reality interaction can include AR glasses and a supplemental device, such as a smart phone, that is communicatively coupled to the AR glasses.
  • the AR glasses may be configured to determine an egocentric point-of-view of a user.
  • Image data e.g., images
  • image data of a virtual keyboard in the egocentric point-of-view of the user may be generated by a processor of the AR glasses and combined with (e.g., added to) the image data of at least one hand to form egocentric images.
  • These egocentric images can include a hand (or hands) of the user interacting with the virtual keyboard.
  • the egocentric images may be processed and analyzed to detect finger positions and movements as the user types on the keyboard. Based on these positions and movements, keystrokes may be detected an identified.
  • Supplemental information about the user and the environment may be captured by the supplemental device to aid in this detection and identification.
  • images from the smart phone can include the hands of the user interacting with the virtual keyboard from an alternate point of view.
  • the alternate point of view images may be analyzed to detect figure positions and movements from a different perspective so that keystrokes can be better detected. For example, figure tip position/movement obscured in an egocentric image may be available in the images from the supplemental device.
  • the techniques described herein relate to a method including: determining, using a position sensor of an augmented reality device, an egocentric point-of-view of a user of the augmented reality device; generating, using a processor of the augmented reality device, image data of a virtual keyboard in the egocentric point-of-view of the user; capturing, using a camera of the augmented reality device, image data of a hand of the user in the egocentric point-of-view of the user; combining the image data of the virtual keyboard and the image data of the hand of the user to generate egocentric images of the hand of the user and the virtual keyboard; determining egocentric information as positions and movements of the hand of the user relative to the virtual keyboard based on the egocentric images; determining a supplemental information corresponding to the hand of the user and the virtual keyboard using at least one supplemental device; combining the egocentric information and the supplemental information to obtain combined information; and identifying keystrokes on the virtual keyboard based on the combined information.
  • the techniques described herein relate to an augmented reality (AR) device, including: a position sensor configured to determine an egocentric point-of-view of a user of the AR device; a heads-up display configured to display image data of a virtual keyboard generated in the egocentric point-of-view of the user; a camera configured to capture image data of a hand of a user in the egocentric point-of-view of the user; and a processor configured to: combine the image data of the virtual keyboard and the image data of the hand of the user to generate an egocentric information including the hand of the user and the virtual keyboard; receive supplemental information from at least one supplemental device communicatively coupled to the AR device, the supplemental information corresponding to the hand of the user and the virtual keyboard; combine the egocentric information and the supplemental information to obtain combined information; and identify keystrokes on the virtual keyboard based on the combined information.
  • AR augmented reality
  • the techniques described herein relate to an augmented reality (AR) system, including: AR glasses configured to: determine an egocentric point-of-view of a user; generate image data of a virtual keyboard in the egocentric point-of-view of the user; capture image data of a hand of a user in the egocentric point-of-view of the user; and combine the image data of the virtual keyboard and the image data of the hand of the user to generate egocentric images including the hand of the user and the virtual keyboard; and at least one supplemental device communicatively coupled to the AR glasses, the at least one supplemental device configured to: capture supplemental information corresponding to the hand of the user and the virtual keyboard; and transmit the supplemental information to the AR glasses, wherein the AR glasses are further configured to: combine egocentric information determined based on the egocentric images and the supplemental information to obtain combined information; and identify keystrokes on the virtual keyboard based on the combined information.
  • AR augmented reality
  • FIG. l is a perspective view of AR glasses according to a possible implementation of the present disclosure.
  • FIG. 2 illustrates an egocentric view of a virtual keyboard through AR glasses according to a possible implementation of the present disclosure.
  • FIG. 3 A illustrates an egocentric view through AR glasses for a first head position according to a possible implementation of the present disclosure.
  • FIG. 3B illustrates an egocentric view through AR glasses for a second head position according to a possible implementation of the present disclosure.
  • FIG. 4 illustrates an egocentric view through AR glasses of a user interacting with a virtual keyboard according to a possible implementation of the present disclosure.
  • FIG. 5 illustrates a portion of a keystroke identification process according to a possible implementation of the present disclosure.
  • FIG. 6 illustrates a computing environment for an AR device according to a possible implementation of the present disclosure.
  • FIG. 7 illustrates a system for keystroke identification of a virtual keyboard according to a possible implementation of the present disclosure.
  • FIG. 8 illustrates a multiview arrangement according to a possible implementation of the present disclosure.
  • FIG. 9 illustrates a spectrogram of audio for keystroke detection according to a possible implementation of the present disclosure.
  • FIG. 10 illustrates a spectrogram of sensed vibrations for keystroke detection according to a possible implementation of the present disclosure.
  • FIG. 11 is a flowchart of a method for identifying keystrokes from a virtual keyboard according to an implementation of the present disclosure.
  • an optimal user input i.e., input
  • a simple gesture e.g., tapping
  • one criterion for the utility of an input may be its versatility. For example, an input, which can convey a wide range of intents in a wide range of applications, may be a more versatile input than a tap gesture customized for a particular application.
  • Another criterion for the utility of an input may be its ease of use (i.e., usability).
  • an input which is known (and therefore intuitive) to a user may be easier to use than a set of gestures, which require training.
  • Another criterion for the utility of an input may be its accuracy. For example, inputs that do not accurately convey the intent of a user may easily become frustrating.
  • Another criterion for the utility of an input may be its efficiency (i.e., speed). For example, users have come to expect no noticeable lag between an input and the device’s response.
  • the present disclosure describes a virtual keyboard as an input for an AR/VR device that is versatile, usable, accurate, and efficient.
  • a user’s experience with the accuracy, layout, and feedback (i.e., tactile, auditory) of physical keyboards can set expectations high for a virtual keyboard in an AR/VR environment.
  • a virtual keyboard may be expected to provide fast key entry (e.g., >15 words per minute (wpm)) and for multi-word interactions (e.g., messaging application, email response, password entry, document editing, etc.) either by (i) typing and/or (ii) performing gestures connecting keys. Certain features may be helpful for a virtual keyboard in an AR/VR environment to meet these expectations of fast and convenient text interaction.
  • a touch interaction can include a user tapping a surface (e.g., desktop, palm, etc.) to type. This tapping can help a user know when a key has been pressed (i.e., tactile feedback). Additionally, the tapping can generate a sound corresponding to the typing (i.e., audio feedback).
  • the present disclosure describes a virtual keyboard for an AR/VR environment that is rendered on a physical surface to provide the touch interaction.
  • Another feature of a virtual keyboard, which may be desirable for fast and convenient text interactions is the support for two-handed typing. Two-handed typing may be familiar to a user and therefore can increase typing speed. Further, it may be desirable for a virtual keyboard to match a physical keyboard size and layout to enhance the familiarity for a user.
  • the present disclosure describes a virtual keyboard for an AR/VR environment that supports two-handed typing.
  • a virtual keyboard which may be desirable for fast and convenient text interactions, is that it is accurate. Users may be highly sensitive to any diminished accuracy below 100%, which can be provided by a physical keyboard.
  • keystroke detection errors can arise. Keystroke detection errors may include missed keystrokes (e.g., nothing typed when J pressed), added keystrokes (e.g., H typed when nothing pressed), or wrong keystrokes (e.g., H typed when J pressed).
  • the present disclosure describes a virtual keyboard for an AR/VR environment that reduces keystroke detection errors (i.e., increases accuracy) through the use of additional devices to help detect keystrokes.
  • specialty equipment refers to equipment used primarily (e.g., exclusively) for the typing input.
  • specialty equipment can refer to devices (e.g., sensors, markers, tags, etc.) that are worn on the wrists/forearms (e.g., bracelets, bands), hands (e.g., gloves), or fingers (e.g., rings) for the purpose of detecting keystroke gestures.
  • specialty equipment can refer to a projector camera device used in place of a physical keyboard.
  • Specialty equipment does not refer to computing devices available to a user in a computing environment. For example, AR glasses, smart phones, smart watches, etc.
  • the present disclosure describes a virtual keyboard for an AR/VR environment, which does not require equipment specialized for keystroke detection but rather can use computing devices already used to a user.
  • a virtual keyboard input that can convey the intents of a user to the AR/VR device is described herein.
  • the virtual keyboard solves, at least, the technical problem of inaccurate keystroke identification by using multiple devices to detect and identify keystrokes.
  • the disclosed approach has the technical effect of providing a virtual keyboard that provides the desirable features listed above.
  • the disclosed virtual keyboard can be utilized in an AR and/or VR environment created by a head worn device, such as AR glasses, VR headset, or XR goggles. In what follows, a possible implementation using AR glasses will be discussed in detail.
  • FIG. l is a perspective view of AR glasses according to a possible implementation of the present disclosure.
  • the AR glasses 100 are configured to be worn on a head and face of a user.
  • the AR glasses 100 include a right earpiece 101 and a left earpiece 102 that are supported by the ears of a user.
  • the AR glasses further include a bridge portion 103 that is supported by the nose of the user so that a left lens 104 and a right lens 105 can be positioned in front a left eye of the user and a right eye of the user respectively.
  • the portions of the AR glasses can be collectively referred to as the frame of the AR glasses.
  • the frame of the AR glasses can contain electronics to enable function.
  • the frame may include a battery, a processor, a memory (e.g., non-transitory computer readable medium), electronics to support sensors (e.g., cameras, depth sensors, etc.), at least one position sensor (e.g., an inertial measurement unit) and interface devices (e.g., speakers, display, network adapter, etc.).
  • the AR glasses may display and sense an environment relative to a coordinate system 130.
  • the coordinate system 130 can be aligned with the head of a user wearing the AR glasses.
  • the eyes of the user may be along a line in a horizontal (e.g., x-direction) direction of the coordinate system 130.
  • the AR glasses 100 can further include a heads-up display (i.e., HUD) configured to display visual information at a lens (or lenses) of the AR glasses.
  • the heads-up display may present AR data (e.g., images, graphics, text, icons, etc.) on a portion 115 of a lens (or lenses) of the AR glasses so that a user may view the AR data as the user looks through a lens of the AR glasses. In this way, the AR data can overlap with the user’s view of the environment.
  • the portion 115 can correspond to (i.e., substantially match) area(s) of the right lens 105 and/or left lens 104.
  • the AR glasses 100 can include a position sensor that is configured that is configured to determine a head-pose of a user. Based on this head-pose, and possibly other sensor data, an egocentric point-of-view of the user (i.e., egocentric view of the user) may be determined.
  • the AR glasses 100 can include a camera 110 that is directed to a camera field-of- view that overlaps with the natural field-of-view of the eyes of the user when the glasses are worn.
  • the camera 110 can provide images of a view aligned with a point-of-view (POV) of a user (i.e., an egocentric view of the user).
  • POV point-of-view
  • the AR glasses 100 can further include a depth sensor 111 (e.g., LIDAR, structured light, time-of-flight, depth camera) that is directed to a depthsensor field-of-view that overlaps with the natural field-of-view of the eyes of a user when the glasses are worn.
  • a depth sensor 111 e.g., LIDAR, structured light, time-of-flight, depth camera
  • Data from the depth sensor 111 and/or the camera 110 can be used to measure depths in a field-of-view (i.e., region of interest) of the user (i.e., wearer).
  • the camera field-of-view and the depth-sensor field-of- view may be calibrated so that depths (i.e., ranges) of objects in images from the camera 110 can be determined in depth images, where pixel values correspond with depths measured at positions corresponding to the pixel positions.
  • the AR glasses 100 can further include an eye-tracking sensor.
  • the eye tracking sensor can include a right-eye camera and/or a left-eye camera 121.
  • a left-eye camera 121 can be located in a portion of the frame so that a left FOV 123 of the left-eye camera 121 includes the left eye of the user when the AR glasses are worn.
  • the AR glasses 100 can further include one or more microphones.
  • the one or more microphones can be spaced apart on the frames of the AR glasses.
  • the AR glasses can include a first microphone 131 and a second microphone 132.
  • the microphones may be configured to operate together as a microphone array.
  • the microphone array can be configured to apply sound localization to determine directions of the sounds relative to the AR glasses.
  • the AR glasses may further include a left speaker 141 and a right speaker 142 configured to transmit audio to the user. Additionally, or alternatively, transmitting audio to a user may include transmitting the audio over a wireless communication link 145 to a listening device (e.g., hearing aid, earbud, etc.). For example, the AR glasses may transmit audio to a left wireless earbud 146 and to a right earbud 147.
  • FIG. 2 illustrates an egocentric view of a virtual keyboard through AR glasses according to a possible implementation of the present disclosure.
  • the AR glasses 100 can be configured to display, on a heads-up display in a lens of the AR glasses, virtual objects so that as the user views an environment through the AR glasses 100 both real objects and virtual objects can be seen by the user.
  • the egocentric view of the user wearing AR glasses includes real objects and virtual objects.
  • a user wearing the AR glasses 100 can view a virtual object corresponding to a keyboard (i.e., a virtual keyboard 200) along with the hands 210 of the user.
  • the camera 110 i.e., egocentric camera
  • the AR glasses may be configured to capture this egocentric view including the hands 210 and a region of the real world for displaying the virtual keyboard 200.
  • the AR glasses may be configured to create an egocentric image of the hand of the user and the virtual keyboard using this egocentric view.
  • a processor of the AR glasses may be configured to combine this egocentric view captured by the camera with the virtual keyboard 200 in an egocentric image that includes the hand (or hands) of the user interacting with the virtual keyboard, such as shown in FIG. 2.
  • the processor of the AR glasses may be configured to create egocentric images of a hand of a user and a virtual keyboard using images captured with a camera directed at an egocentric point-of-view of the user.
  • the AR glasses 100 may be configured to capture a video stream of egocentric views (i.e., egocentric images).
  • the egocentric images of the video stream may be analyzed (e.g., using machine vision algorithms) to determine interactions (e.g., keystrokes, key gestures) between the hands 210 and the virtual keyboard 200. Accordingly, the virtual keyboard may be said to be vision-based because of the egocentric views used for determining virtual keyboard interactions.
  • FIG. 3A a first egocentric view 301 of an environment as seen through AR glasses from a first perspective of a user (i.e., first pose 310 of the user).
  • FIG. 3B illustrates a second egocentric view 302 of the environment as seen through AR glasses from a second perspective of the user (i.e., second pose 320 of the user.
  • the perspective i.e., viewpoint
  • the perspective can be determined from a position (i.e., x, y, z) and/or orientation (i.e., yaw, pitch, roll) of the user’s head.
  • the combination of position and orientation can be referred to as the pose of the user’s head.
  • For the first perspective i.e., FIG.
  • the user’s head is in a first pose 310 and for the second perspective (i.e., FIG. 3B), the user’s head is in a second pose 320.
  • the first egocentric view includes a desk 350 and a mobile phone 330 on a surface defined by the desk (i.e., desktop).
  • the desk 350 and the mobile phone 330 are real objects in the environment.
  • the first egocentric view 301 further includes a virtual keyboard 340.
  • the virtual keyboard 340 is word-locked in the AR environment so that it appears on the desktop at a first location within an area of the desk 350 in the same way that the mobile phone 330 appears on the desktop at a second location within the area of the desk 350.
  • the second egocentric view includes the desk 350, the mobile phone 330, and the virtual keyboard.
  • the virtual keyboard 340 is world-locked in the AR environment so that its relative position relative to the boundaries of the second egocentric view 302 is shifted according to the second pose 320, but the relative positions of the mobile phone 330 and the virtual keyboard 340 within the second egocentric view 302 are not changed.
  • the AR glasses may compute a global coordinate system for real and virtual objects. As the pose of a user changes, then the differences between the user and the global coordinate system may be used to update the rendered position of virtual objects.
  • World locking may be useful for assigning each key of the virtual keyboard 340 an area at a particular location of the desktop (or more general; in the global coordinate system).
  • the world locking analysis may further include transforming a distorted layout of the keyboard captured from an egocentric view into a standard layout as if captured from directly above the keyboard (i.e., bird’s eye view). This transformation may standardize a layout of the virtual keyboard used for keystroke identification. This standard layout may not significantly change as the user changes perspectives.
  • FIG. 4 illustrates an egocentric view through AR glasses of a user interacting with a virtual keyboard according to a possible implementation of the present disclosure.
  • an egocentric view of a user 410 wearing AR glasses 400 is shown.
  • the egocentric view 401 includes the left hand 440 and the right hand 430 of the user typing on a virtual keyboard 445.
  • the hands of the user and the virtual keyboard are located at a surface 450 (e.g., table, desk, counter, etc.).
  • a user can tap a location of the desktop assigned to the key in order to virtually press the key.
  • a caption 460 of the typing can also be displayed to a user to show which keys have been pressed.
  • the AR glasses may analyze an egocentric image to compute skeletal topologies for each hand, namely a left skeletal topology 436 and a right skeletal topology 435.
  • Each skeletal topology includes nodes (dots) corresponding with key points on the hands and edges connecting corresponding nodes.
  • an egocentric image of a hand may be analyzed to recognize key points on the hand and then place corresponding nodes at the image locations of key points.
  • the placement of nodes may be estimated. For example, when a portion of the hand in an egocentric image is obscured, nodes may be placed approximately according to rules based on the anatomy of a typical hand and/or the movement of the hand.
  • a keystroke may be identified when a portion of a skeletal topology, such as a node (e.g., fingertip node), intersects a key area (i.e., key) of the virtual keyboard.
  • a node e.g., fingertip node
  • key a key area of the virtual keyboard
  • the caption 460 indicates that the user 410 is typing the word “HELLO” on the virtual keyboard 445.
  • the letter “E” is obscured by the left hand 440 and the letters “L” and “O” are obscured by the right hand 430.
  • estimating node positions and node motions relative to these keys based strictly on the egocentric view 401 may be problematic.
  • FIG. 5 illustrates a portion of a keystroke identification process according to a possible implementation of the present disclosure.
  • a raw layout 510 i.e., distorted layout
  • a standard layout 520 i.e., normalized layout, undistorted layout
  • Each key in the standard layout may then be assigned a key point (i.e., key location).
  • a skeletal topology which includes nodes corresponding to fingertips (i.e., fingertip nodes), may be computed for a hand of the user (and transformed to the standard layout).
  • a keystroke may be triggered when a fingertip node 530 and a key location 540 are at a distance at, or within, a threshold distance.
  • a problem with this approach is that many spurious keystrokes can be generated as the finger moves past keys. Further, it may not always be possible to determine this distance accurately. For example, sometimes fingertip nodes of the user and/or key locations of the virtual keyboard are obscured (e.g., by the hands of the user).
  • the disclosed virtual keyboard addresses this technical problem by obtaining additional information from supplemental devices (i.e., supplemental sources), which may be common to a commuting environment for the AR glasses.
  • FIG. 6 illustrates a computing environment for an AR device (shown as AR glasses 600) according to a possible implementation of the present disclosure.
  • the AR glasses 600 can include a wireless interface (i.e., wireless module) that can be configured to communicate wirelessly with other devices (i.e., supplemental devices).
  • the wireless communication may occur over a wireless communication channel 601.
  • the wireless communication channel may use a variety of wireless protocols, including (but not limited to) WiFi, Bluetooth, ultra- wideband (UWB), mobile technology (4G, 5G).
  • the supplemental devices in the computing environment may include (but are not limited to) a smart watch 610, a mobile phone 620, a laptop computer 630, a tablet 640, a smart home device 660 (e.g., smart thermostat, home hub, etc.).
  • the AR glasses 600 may be in communication with a network 650 (e.g., the cloud). Communication with the network 650 may expand the possible supplemental devices from which the AR glasses 600 can receive information (e.g., images, video, sound).
  • the AR glasses 600 may be configured to communicate via the network 650 to receive images from a camera 651 (e.g., CCTV camera) and sounds from a microphone 652 corresponding to an environment of the user.
  • a camera 651 e.g., CCTV camera
  • the camera 651 and microphone 652 may be part of a home monitoring system (e.g., smart hub, smart thermostat, etc.).
  • the supplemental devices may collect supplemental information about the user and/or the virtual keyboard in the environment. Supplemental information, collected by supplemental devices, while the user types on the virtual keyboard can be transmitted to the AR glasses. This supplemental information can complement and/or enhance the egocentric information in keystroke identification.
  • the AR glasses may be configured to combine (i.e., fuse) the egocentric information and the supplemental information to obtain (i.e., generate, calculate) combined information and use the combined information to identify keystrokes. Keystroke identification based on combined information may be more accurate than keystroke identification without only egocentric information. [0047] FIG.
  • the AR system 700 includes a first source of audio-visual information implemented as an AR device 701 (e.g., AR glasses).
  • the AR device 701 is configured to capture egocentric information 711 (e.g., egocentric images) of a hand (or hands) of a user and the virtual keyboard from an egocentric point-of-view.
  • the AR system 700 further includes a second source of audio-visual information implemented as at least one supplemental device 702.
  • the at least one supplemental device 702 is configured to capture supplemental information 712 corresponding to the hand (or hands) of the user and the virtual keyboard.
  • the at least one supplemental device 702 may be commodity devices used with the AR glasses including (but not limited to) mobile phone (i.e., smartphone), wearables (e.g., smart watch), and home devices (e.g., smart hub).
  • the supplemental information 712 can include audio, visual, and or tactile information about the user and/or the virtual keyboard.
  • the egocentric information 711 may be input to an egocentric hand-tracking algorithm 721 (e.g., egocentric neural network), which may determine positions and movements of a finger (or fingers) of the hand (or hands) of the user. Determining the positions and movements may include generating a skeletal topology of the user’s hand (or hands). Determining the positions and movements may further include transforming the skeletal topology to a standard layout, such as described previously for the virtual keyboard (FIG. 5). The determined positions and movements may be relative to a global coordinate system shared by the AR device 701 (i.e., egocentric device) and the at least one supplemental device 702.
  • the egocentric hand-tracking algorithm 721 may include a media-pipe single shot detector (SSD) and hand key point reconstruction model to digitize hand/finger positions from the egocentric camera of AR glasses.
  • SSD media-pipe single shot detector
  • the supplemental information 712 may be input to a complementary tracking algorithm 722, such as a complementary view hand-tracking algorithm (e.g., neural network).
  • the complementary tracking algorithm 722 may include software instructions recalled from a memory and used to configure a processor to determine position information and/or temporal information that can disambiguate figure movements/positions unavailable (e.g., obscured) in the egocentric information 711.
  • the complementary tracking algorithm 722 may determine positions and movements of a finger (or fingers) of a hand (or hands) of the user from images captured by a camera for an alternate point-of-view that is different from the egocentric point-of- view.
  • This determination may further include transforming the skeletal topology to a standard layout 520, such as described previously for the virtual keyboard (FIG. 5).
  • the egocentric hand-tracking algorithm 721 may include a media-pipe single shot detector (SSD) and a hand key-point reconstruction model to digitize hand/finger positions from the egocentric camera of AR glasses.
  • SSD media-pipe single shot detector
  • the complementary tracking algorithm 722 may further include (or alternatively include) capturing audio information, such as the sounds of fingers tapping, to determine times at which keystrokes occur.
  • the complementary tracking algorithm 722 may further include (or alternatively include) capturing motion information, such as vibrations created by fingers tapping, to determine times at which keystrokes occur.
  • the complementary tracking algorithm 722 may accommodate any number of supplemental devices and may add and remove supplementary devices as needed. When no supplemental devices are present, the AR system 700 may use the egocentric hand-tracking algorithm 721 alone.
  • the AR system 700 further includes a fusion block 730 configured to evaluate the positions and movements of a finger (or fingers) determined by the egocentric hand-tracking algorithm 721 with any available complementary audiovisual information (i.e., supplemental information) determined by the complementary tracking algorithm 722 in order to determine letters corresponding to keystrokes.
  • the fusion block 730 may include a neural network, heuristics, and/or sensor fusion algorithms to perform the analysis of the multiple sources of information.
  • One possible implementation of the AR system 700 includes AR glasses and a mobile phone in a multiview arrangement.
  • FIG 8 illustrates a multiview arrangement according to a possible implementation of the present disclosure.
  • a user 801 is interacting with a virtual keyboard 805 (and a virtual display) using AR glasses 800.
  • the computing environment of the AR glasses 800 includes a mobile phone 830.
  • a sensor (or sensors) on the mobile phone 830 may be configured to provide additional (i.e., supplemental) sensing modalities for use with keystroke identification.
  • the AR glasses 800 include an egocentric camera configured to capture images of the hands of the user 801 interacting with the virtual keyboard in an egocentric field-of-view 810.
  • the mobile phone 830 includes a supplemental camera configured to capture supplemental images of the hands of the user interacting with the virtual keyboard in a supplemental field-of-view 820.
  • the egocentric field-of-view 810 and the supplemental field-of-view 820 can be different so that the hand (e.g., finger) positions and/or hand (e.g., finger) movements that are obscured in the egocentric field-of-view may be visible in the supplemental field-of-view image (and vice versa).
  • the mobile phone 830 may include a microphone to detect audio from the typing (or gesturing) of the user 801.
  • the mobile phone may include an inertial measurement unit (IMU) that can detect vibrations of the surface (i.e., desktop), on which the virtual keyboard is rendered, and on which the mobile phone rests.
  • IMU inertial measurement unit
  • the audio and the sensed vibrations may be rendered as spectrograms for keystroke detection.
  • FIG. 9 illustrates a spectrogram of audio for keystroke detection according to a possible implementation of the present disclosure.
  • Audio as the user interacts with the virtual keyboard may include sounds of the user typing. For example, a user may create a sound when the surface on which the virtual keyboard is rendered is tapped. Detecting times at which these tapping sounds occur may help to determine a keystroke even when the fingers of the user are obscured in an image. Thus, combining (i.e., fusing) the detected keystroke times with the egocentric information may improve an accuracy of the keystroke identification.
  • Detecting keystroke times from an audio signal may include creating a spectrogram of the captured audio.
  • the spectrogram is an image 900 that illustrates amplitudes as grayscale or color values in the image 900. As shown, lighter pixels in the image 900 are relatively higher amplitudes and darker pixels are relatively lower amplitudes.
  • the image 900 includes a frequency axis 901 (i.e., vertical axis) and a time axis 902 (i.e., horizontal axis).
  • the dimension of the image 900 along the time axis 902 represents a window 910 of time.
  • the spectrogram may include a series (i.e., stream) of images as the window is shifted in time.
  • the image 900 may be processed to detect sounds corresponding to keystrokes. For example, thresholds and/or other comparisons may be used to recognize peaks corresponding to a keystroke.
  • the image 900 of the spectrogram shows five peaks 920 detected as keystrokes.
  • the keystroke times may be used to identify keystrokes by isolating times in other information. For example, egocentric information may be analyzed at the isolated times to identify keystrokes.
  • FIG. 10 illustrates a spectrogram of sensed vibrations for keystroke detection according to a possible implementation of the present disclosure.
  • the user interacting with the virtual keyboard may generate vibrations while typing (or gesturing). For example, a user may create a vibration when the surface on which the virtual keyboard is rendered is tapped.
  • Detecting times at which these sensed vibrations occur may help to determine a keystroke even when the fingers of the user are obscured in an image.
  • combining (i.e., fusing) the detected keystroke times with the egocentric information may improve an accuracy of the keystroke identification.
  • the spectrogram shown in FIG. 10 is a PCA component of a 100 Hz 3-axis smartphone accelerometer captured while the user was tapping a surface near the smartphone. The five taps are visually clear with a distinctive spectral signature per tap.
  • spectral signal has a two-band pattern. The spectral signature may depend on the surface material and the specific structure-borne sound propagation physics.
  • Detecting keystroke times from a vibration signal may include creating a spectrogram of the sensed vibrations.
  • the spectrogram is an image 1000 that illustrates amplitudes as grayscale or color values in the image 1000. As shown, lighter pixels in the image 1000 are relatively higher amplitudes and darker pixels are relatively lower amplitudes.
  • the image 1000 includes a frequency axis 1001 (i.e., vertical axis) and a time axis 1002 (i.e., horizontal axis).
  • the dimension of the image 1000 along the time axis 1002 represents a window of time.
  • the spectrogram may include a series (i.e., stream) of images as the window is shifted in time.
  • the image 1000 may be processed to detect vibrations corresponding to keystrokes. For example, thresholds and/or other comparisons may be used to recognize a tap vibration corresponding to a keystroke.
  • the image 1000 of the spectrogram shows five detections. Each detection may correspond to keystroke times. The keystroke times may be used to identify keystrokes by isolating times in other information. For example, egocentric information may be analyzed at the isolated times to identify keystrokes.
  • FIG. 11 is a flowchart of a method for identifying keystrokes from a virtual keyboard according to an implementation of the present disclosure.
  • the method 1100 includes creating 1110 egocentric images of a hand (or hands) of a user and a virtual keyboard using images captured with a camera directed at an egocentric point-of-view of the user (i.e., egocentric camera).
  • the egocentric images may be in a raw layout that corresponds to the perspective of the user.
  • the method may include transforming the virtual keyboard to a standard layout (i.e., bird’s eye view).
  • the method 1100 further includes determining 1120 egocentric information based on the egocentric images.
  • the egocentric information may include positions of the left hand, right hand, or both hands.
  • the egocentric information may include positions of one or more fingers on either (or both) hands.
  • the egocentric information may include motion (or motions) of the hand (or hands) or a finger (or fingers).
  • the motion may be a velocity or an acceleration of a fingertip.
  • Determining the position and/or motion of the hand (or hands) and finger (or fingers) may include generating 1130 a skeletal topology corresponding to the hand/fingers.
  • the skeletal topology may be transformed to a standard layout.
  • the skeletal topology may include nodes and edges. The nodes may be located relative to a global coordinate system that is shared between devices.
  • the method 1100 may include deciding 1140 if one or more supplemental devices 1170 are available. If no supplemental devices are available, the method 1100 includes identifying 1190 keystrokes based on egocentric information.
  • identifying keystrokes includes identifying a keystroke when a distance between a node (e.g., fingertip node) of the skeletal topology and a key point (e.g., center of a key) on the virtual keyboard (e.g., in a global coordinate system) satisfy a criterion (e.g., are within a threshold distance).
  • the method 1100 includes capturing and/or receiving 1160 supplemental information from the supplemental device (or supplemental devices).
  • the supplemental information can include audio, visual, or tactile information corresponding to the hand (or hands) of the user and/or the virtual keyboard.
  • the method further includes combining 1150 the egocentric information and the supplemental information to form (i.e., generate) combined information.
  • the method may then include identifying 1180 keystrokes based on the combined information.
  • the identifying may include localizing tap events in time and then searching for finger movements and/or finger locations at the time of tap events to determine which key was pressed.
  • the method described above may be implemented as a computer program product tangibly embodied on a non-transitory computer-readable medium and comprising instructions that, when executed, are configured to cause at least one processor to perform one or more operations of the method.
  • the computer program product also known as computer program, module, program, software, software applications or code
  • machine-readable medium refers to any computer program product, apparatus and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal.
  • machine-readable signal refers to any signal used to provide machine instructions and/or data to a programmable processor.
  • Ranges may be expressed herein as from “about” one particular value, and/or to “about” another particular value. When such a range is expressed, an aspect includes from the one particular value and/or to the other particular value. Similarly, when values are expressed as approximations, by use of the antecedent "about,” it will be understood that the particular value forms another aspect. It will be further understood that the endpoints of each of the ranges are significant both in relation to the other endpoint, and independently of the other endpoint.
  • a singular form may, unless definitely indicating a particular case in terms of the context, include a plural form.
  • Spatially relative terms e.g., over, above, upper, under, beneath, below, lower, and so forth
  • the relative terms above and below can, respectively, include vertically above and vertically below.
  • the term adjacent can include laterally adjacent to or horizontally adjacent to.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

L'interaction avec un dispositif de réalité augmentée (AR) peut être améliorée à l'aide d'un clavier virtuel pour prendre en charge des tâches complexes, telles que l'envoi de courriers électroniques, l'édition de documents, et analogues. Un tel clavier virtuel peut être verrouillé dans l'environnement AR sur une surface telle qu'un bureau ou une table de cuisine. La précision d'identification de frappe pour le clavier virtuel peut cependant en pâtir lorsque l'unique source d'informations pour la détection de frappe consiste en des images égocentriques capturées par le dispositif AR. L'invention concerne des procédés et des systèmes pour améliorer la précision de détection et d'identification de frappe de touche par l'utilisation de multiples sources d'informations disponibles dans un environnement AR.
PCT/US2022/075622 2022-08-30 2022-08-30 Clavier virtuel WO2024049463A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/US2022/075622 WO2024049463A1 (fr) 2022-08-30 2022-08-30 Clavier virtuel

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2022/075622 WO2024049463A1 (fr) 2022-08-30 2022-08-30 Clavier virtuel

Publications (1)

Publication Number Publication Date
WO2024049463A1 true WO2024049463A1 (fr) 2024-03-07

Family

ID=83457553

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2022/075622 WO2024049463A1 (fr) 2022-08-30 2022-08-30 Clavier virtuel

Country Status (1)

Country Link
WO (1) WO2024049463A1 (fr)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100214267A1 (en) * 2006-06-15 2010-08-26 Nokia Corporation Mobile device with virtual keypad
US20180350150A1 (en) * 2017-05-19 2018-12-06 Magic Leap, Inc. Keyboards for virtual, augmented, and mixed reality display systems
US20190265781A1 (en) * 2018-02-28 2019-08-29 Logitech Europe S.A. Precision tracking of user interaction with a virtual input device
US20210124417A1 (en) * 2019-10-23 2021-04-29 Interlake Research, Llc Wrist worn computing device control systems and methods

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100214267A1 (en) * 2006-06-15 2010-08-26 Nokia Corporation Mobile device with virtual keypad
US20180350150A1 (en) * 2017-05-19 2018-12-06 Magic Leap, Inc. Keyboards for virtual, augmented, and mixed reality display systems
US20190265781A1 (en) * 2018-02-28 2019-08-29 Logitech Europe S.A. Precision tracking of user interaction with a virtual input device
US20210124417A1 (en) * 2019-10-23 2021-04-29 Interlake Research, Llc Wrist worn computing device control systems and methods

Similar Documents

Publication Publication Date Title
US11995774B2 (en) Augmented reality experiences using speech and text captions
US11670267B2 (en) Computer vision and mapping for audio applications
US11699271B2 (en) Beacons for localization and content delivery to wearable devices
EP2946266B1 (fr) Procédé et dispositif corporel pour fournir une interface d'entrée virtuelle
CN110488974B (zh) 用于提供虚拟输入界面的方法和可穿戴装置
US10269222B2 (en) System with wearable device and haptic output device
US20140198130A1 (en) Augmented reality user interface with haptic feedback
US11869156B2 (en) Augmented reality eyewear with speech bubbles and translation
US11360550B2 (en) IMU for touch detection
KR20170133754A (ko) 동작 인식 기반의 스마트 글래스 장치
US20210405363A1 (en) Augmented reality experiences using social distancing
KR20150110285A (ko) 웨어러블 디바이스에서 가상의 입력 인터페이스를 제공하는 방법 및 이를 위한 웨어러블 디바이스
CN103914128A (zh) 头戴式电子设备和输入方法
WO2020260085A1 (fr) Procédé, programme informatique et dispositif monté sur la tête pour déclencher une action, procédé et programme informatique pour un dispositif informatique et dispositif informatique
KR20140069660A (ko) 이미지 중첩 기반의 사용자 인터페이스 장치 및 방법
US20230367118A1 (en) Augmented reality gaming using virtual eyewear beams
CN103713387A (zh) 电子设备和采集方法
WO2024049463A1 (fr) Clavier virtuel
US11797100B1 (en) Systems and methods for classifying touch events based on relative orientation
US11733789B1 (en) Selectively activating a handheld device to control a user interface displayed by a wearable device
US12013985B1 (en) Single-handed gestures for reviewing virtual content
KR20170093057A (ko) 미디어 중심의 웨어러블 전자 기기를 위한 손 제스쳐 명령의 처리 방법 및 장치

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22777887

Country of ref document: EP

Kind code of ref document: A1