WO2023069591A1 - Object-based dual cursor input and guiding system - Google Patents

Object-based dual cursor input and guiding system Download PDF

Info

Publication number
WO2023069591A1
WO2023069591A1 PCT/US2022/047236 US2022047236W WO2023069591A1 WO 2023069591 A1 WO2023069591 A1 WO 2023069591A1 US 2022047236 W US2022047236 W US 2022047236W WO 2023069591 A1 WO2023069591 A1 WO 2023069591A1
Authority
WO
WIPO (PCT)
Prior art keywords
indicator
target
scene
radius
user
Prior art date
Application number
PCT/US2022/047236
Other languages
French (fr)
Inventor
Qi XIONG
Yi Xu
Original Assignee
Innopeak Technology, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Innopeak Technology, Inc. filed Critical Innopeak Technology, Inc.
Publication of WO2023069591A1 publication Critical patent/WO2023069591A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/017Gesture based interaction, e.g. based on a set of recognized hand gestures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/03Arrangements for converting the position or the displacement of a member into a coded form
    • G06F3/0304Detection arrangements using opto-electronic means

Definitions

  • This application relates generally to user interface technology including, but not limited to, methods, systems, and non-transitory computer-readable media for detecting user gestures and visualizing user actions in an extended reality environment.
  • Gesture control is an important component of user interface on modem day electronic devices.
  • touch gestures are used for invoking a graphic, icon, or pointer to point, select, or trigger user interface elements on two- dimensional displays (e.g., display monitors, computer screens).
  • Common touch gestures include tap, double tap, swipe, pinch, zoom, rotate, etc.
  • Each touch gesture is typically associated with a certain user interface function.
  • touchless air gestures are used to implement certain user interface functions for electronic devices having no touch screens, e.g., head mounted display including virtual reality headsets, augmented reality glasses, mixed reality headsets and the like.
  • These devices having no touch screens can include frontfacing cameras or miniature radars to track human hands in real time.
  • some head mounted displays have implemented hand tracking functions to complete user interaction including selecting, clicking, and typing on a virtual keyboard.
  • Air gestures can also be used on the devices with touch screens when a user’s hands are not available to touch the screen (e.g., while preparing a meal, the user can use air gestures to scroll down a recipe so that the user does not need to touch the device screen with wet hands). It would be beneficial to have an efficient mechanism to visualize user gestures and enable user interactions in the context of extended reality (e.g., virtual, augmented, and mixed realities).
  • extended reality e.g., virtual, augmented, and mixed realities
  • Various embodiments of this application are directed to methods, systems, devices, non-transitory computer-readable media for enabling user interaction that assists a user’s action on target objects, provides input in a user interface, and controls objects in an extended reality application.
  • An input mechanism e.g., hands, laser pointer, sensor glove
  • an electronic device e.g., a head-mounted display (HMD)
  • the extended reality application displays a graphical user interface (GUI) on which the input mechanism and virtual objects are visualized to guide the input mechanism to approach a target.
  • GUI graphical user interface
  • the electronic device executes the extended reality application to display a user interface including two cursor-like representations of the input mechanism and a virtual object targeted by the input mechanism.
  • a dual-cursor representation includes a primary cursor (i.e., a source indicator) and an assistive cursor (i.e., a target indicator).
  • the assistive cursor shows a target position that is associated with the virtual object and configured to receive a user interaction.
  • the primary and assistive cursors have different states indicating a distance between the input mechanism and the target object.
  • the input mechanism on which the extended reality application is relied on reaches a final position (which overlaps or is distinct from the target position), the primary and assistive cursors reach a final state (e.g., entirely overlap with each other).
  • Each indicator optionally has a 2D shape or 3D volume.
  • a method is implemented by an electronic device for enabling user interaction in extended reality.
  • the method includes executing an extended reality application including a graphical user interface (GUI), detecting and tracking a user gesture in a scene, and determining a target position of the user gesture in the scene.
  • the method further includes displaying a target indicator and a source indicator on the GUI.
  • the target indicator is rendered at the target position of the user gesture on the GUI.
  • the source indicator has a relative distance from the target indicator on the GUI, and the relative distance corresponds to (e.g., is proportional to, is linearly correlated with) a distance of a current position and an end position of the user gesture in the scene.
  • the end position and target position of the user gesture are optionally identical to or distinct from each other and that the target position of the user gesture is based on a user’s intention determined from tracked positions of the user gesture.
  • the user gesture is made by a user’s hand, another other body part, or an object held by the user, and each of the hand, other body part, and object is used as an interaction mechanism.
  • the method further includes comparing the relative distance between the target indicator and source indicator and a distance threshold. Displaying of the target indicator and source indicator is initiated in accordance with a determination that the relative distance is less than the distance threshold.
  • the target indicator includes a contour enclosing an area on the GUI. The area corresponds to a region of the scene in which the target position of the user gesture is located with a threshold probability or above. The area has a center corresponding to a location having a greatest probability for the target position than remaining locations in the scene.
  • the method further includes in accordance with a determination that the target indicator and the source indicator substantially overlap on the GUI, identifying a user action on an object associated with the target position in the scene.
  • the method further includes displaying a three-dimensional (3D) virtual object at a target portion of the GUI corresponding to the target position of the scene and displaying the source indicator and the target indicator conformally on the 3D virtual object.
  • some implementations include an electronic device that includes one or more processors and memory having instructions stored thereon, which when executed by the one or more processors cause the processors to perform any of the above methods.
  • some implementations include a non-transitory computer-readable medium, having instructions stored thereon, which when executed by one or more processors cause the processors to perform any of the above methods.
  • Figure 1 is an example data processing environment having one or more servers communicatively coupled to one or more client devices, in accordance with some embodiments.
  • FIG. 2 is a block diagram illustrating an electronic device configured to process content data (e.g., image data), in accordance with some embodiments.
  • Figures 3A-3C are three example situations of displaying user hand gestures on a graphical user interface (GUI) in an extended reality application, in accordance with some embodiments.
  • GUI graphical user interface
  • Figures 4A and 4B are two example user interfaces displaying a target indicator and a source indicator in a scene of an extended reality application, in accordance with some embodiments.
  • Figures 5A and 5B are two example scenes of an extended reality application in each of which a user gesture selects an object, in accordance with some embodiments.
  • Figure 6 is a flow diagram of an example method for enabling user interaction in extended reality, in accordance with some embodiments.
  • Extended reality includes augmented reality (AR) in which virtual objects are overlaid on a view of a real physical world, virtual reality (VR) that includes only virtual content, and mixed reality (MR) that combines both AR and VR and in which a user is allowed to interact with real -world and virtual objects.
  • AR augmented reality
  • VR virtual reality
  • MR mixed reality
  • AR is an interactive experience of a real-world environment where the objects that reside in the real world are enhanced by computer-generated perceptual information, e.g., across multiple sensory modalities including visual, auditory, haptic, somatosensory, and olfactory.
  • the electronic device renders visual content corresponding to a scene on a graphical user interface (GUI) of the extended reality application, and detects and tracks a user gesture in the scene.
  • GUI graphical user interface
  • the electronic device visualizes the user gesture by displaying a target indicator and a source indicator jointly on the GUI with the visual content.
  • FIG. 1 is an example data processing environment 100 having one or more servers 102 communicatively coupled to one or more client devices 104, in accordance with some embodiments.
  • the one or more client devices 104 may be, for example, laptop computers 104A, tablet computers 104B, mobile phones 104C, or intelligent, multi -sensing, network-connected home devices (e.g., a surveillance camera 104E, a smart television device, a drone).
  • the one or more client devices 104 include a headmounted display (HMD) 104D configured to render extended reality content.
  • HMD headmounted display
  • Each client device 104 can collect data or user inputs, executes user applications, and present outputs on its user interface.
  • the collected data or user inputs can be processed locally at the client device 104 and/or remotely by the server(s) 102.
  • the one or more servers 102 provides system data (e.g., boot files, operating system images, and user applications) to the client devices 104, and in some embodiments, processes the data and user inputs received from the client device(s) 104 when the user applications are executed on the client devices 104.
  • the data processing environment 100 further includes a storage 106 for storing data related to the servers 102, client devices 104, and applications executed on the client devices 104.
  • storage 106 may store video content (including visual and audio content), static visual content, and/or inertial sensor data.
  • the one or more servers 102 can enable real-time data communication with the client devices 104 that are remote from each other or from the one or more servers 102. Further, in some embodiments, the one or more servers 102 can implement data processing tasks that cannot be or are preferably not completed locally by the client devices 104.
  • the client devices 104 include a game console (e.g., formed by the HMD 104D) that executes an interactive online gaming application.
  • the game console receives a user instruction and sends it to a game server 102 with user data.
  • the game server 102 generates a stream of video data based on the user instruction and user data and providing the stream of video data for display on the game console and other client devices that are engaged in the same game session with the game console.
  • the client devices 104 include a networked surveillance camera 104E and a mobile phone 104C.
  • the networked surveillance camera 104E collects video data and streams the video data to a surveillance camera server 102 in real time. While the video data is optionally pre-processed on the surveillance camera 104E, the surveillance camera server 102 processes the video data to identify motion or audio events in the video data and share information of these events with the mobile phone 104C, thereby allowing a user of the mobile phone 104C to monitor the events occurring near the networked surveillance camera 104E in the real time and remotely.
  • the one or more servers 102, one or more client devices 104, and storage 106 are communicatively coupled to each other via one or more communication networks 108, which are the medium used to provide communications links between these devices and computers connected together within the data processing environment 100.
  • the one or more communication networks 108 may include connections, such as wire, wireless communication links, or fiber optic cables. Examples of the one or more communication networks 108 include local area networks (LAN), wide area networks (WAN) such as the Internet, or a combination thereof.
  • the one or more communication networks 108 are, optionally, implemented using any known network protocol, including various wired or wireless protocols, such as Ethernet, Universal Serial Bus (USB), FIREWIRE, Long Term Evolution (LTE), Global System for Mobile Communications (GSM), Enhanced Data GSM Environment (EDGE), code division multiple access (CDMA), time division multiple access (TDMA), Bluetooth, Wi-Fi, voice over Internet Protocol (VoIP), Wi-MAX, or any other suitable communication protocol.
  • a connection to the one or more communication networks 108 may be established either directly (e.g., using 3G/4G connectivity to a wireless carrier), or through a network interface 110 (e.g., a router, switch, gateway, hub, or an intelligent, dedicated whole-home control node), or through any combination thereof.
  • a network interface 110 e.g., a router, switch, gateway, hub, or an intelligent, dedicated whole-home control node
  • the one or more communication networks 108 can represent the Internet of a worldwide collection of networks and gateways that use the Transmission Control Protocol/Internet Protocol (TCP/IP) suite of protocols to communicate with one another.
  • TCP/IP Transmission Control Protocol/Internet Protocol
  • At the heart of the Internet is a backbone of high-speed data communication lines between major nodes or host computers, consisting of thousands of commercial, governmental, educational and other electronic systems that route data and messages.
  • the HMD 104D include one or more cameras (e.g., a visible light camera, a depth camera), a microphone, a speaker, one or more inertial sensors (e.g., gyroscope, accelerometer), and a display.
  • the camera(s) and microphone are configured to capture video and audio data from a scene of the HMD 104D, while the one or more inertial sensors are configured to capture inertial sensor data.
  • the camera captures gestures of a user wearing the HMD 104D.
  • the microphone records ambient sound, including user’s voice commands.
  • both video or static visual data captured by the visible light camera and the inertial sensor data measured by the one or more inertial sensors are applied to determine and predict device poses (i.e., device positions and orientations).
  • the video, static image, audio, or inertial sensor data captured by the HMD 104D are processed by the HMD 104D, server(s) 102, or both to recognize the device poses.
  • both depth data e.g., depth map and confidence map
  • the depth and inertial sensor data captured by the HMD 104D are processed by the HMD 104D, server(s) 102, or both to recognize the device poses.
  • the device poses are used to control the HMD 104D itself or interact with an application (e.g., a gaming application) executed by the HMD 104D.
  • the display of the HMD 104D displays a user interface.
  • the recognized or predicted device poses are used to render virtual objects with high fidelity, and the user gestures captured by the camera are used to interact with visual content on the user interface.
  • SLAM techniques are applied in the data processing environment 100 to process video data, static image data, or depth data captured by the HMD 104D with inertial sensor data. Device poses are recognized and predicted, and a scene in which the HMD 104D is located is mapped and updated.
  • the SLAM techniques are optionally implemented by HMD 104D independently or by both of the server 102 and HMD 104D jointly.
  • FIG 2 is a block diagram illustrating an electronic system 200 configured to process content data (e.g., image data), in accordance with some embodiments.
  • the electronic system 200 includes a server 102, a client device 104 (e.g., HMD 104D in Figure 1), a storage 106, or a combination thereof.
  • the electronic system 200 typically, includes one or more processing units (CPUs) 202, one or more network interfaces 204, memory 206, and one or more communication buses 208 for interconnecting these components (sometimes called a chipset).
  • the electronic system 200 includes one or more input devices 210 that facilitate user input, such as a keyboard, a mouse, a voice-command input unit or microphone, a touch screen display, a touch-sensitive input pad, a gesture capturing camera, or other input buttons or controls.
  • the client device 104 of the electronic system 200 uses a microphone for voice recognition or a camera 260 for gesture recognition to supplement or replace the keyboard.
  • the client device 104 includes one or more optical cameras 260 (e.g., an RGB camera), scanners, or photo sensor units for capturing images, for example, of graphic serial codes printed on the electronic devices.
  • the electronic system 200 also includes one or more output devices 212 that enable presentation of user interfaces and display content, including one or more speakers and/or one or more visual displays.
  • the client device 104 includes a location detection device, such as a GPS (global positioning system) or other geo-location receiver, for determining the location of the client device 104.
  • GPS global positioning system
  • the client device 104 includes an inertial measurement unit (IMU) 280 integrating sensor data captured by multi-axes inertial sensors to provide estimation of a location and an orientation of the client device 104 in space.
  • IMU inertial measurement unit
  • the one or more inertial sensors of the IMU 280 include, but are not limited to, a gyroscope, an accelerometer, a magnetometer, and an inclinometer.
  • Memory 206 includes high-speed random access memory, such as DRAM, SRAM, DDR RAM, or other random access solid state memory devices; and, optionally, includes non-volatile memory, such as one or more magnetic disk storage devices, one or more optical disk storage devices, one or more flash memory devices, or one or more other non-volatile solid state storage devices. Memory 206, optionally, includes one or more storage devices remotely located from one or more processing units 202. Memory 206, or alternatively the non-volatile memory within memory 206, includes a non-transitory computer readable storage medium. In some embodiments, memory 206, or the non- transitory computer readable storage medium of memory 206, stores the following programs, modules, and data structures, or a subset or superset thereof:
  • Operating system 214 including procedures for handling various basic system services and for performing hardware dependent tasks
  • Network communication module 216 for connecting each server 102 or client device 104 to other devices (e.g., server 102, client device 104, or storage 106) via one or more network interfaces 204 (wired or wireless) and one or more communication networks 108, such as the Internet, other wide area networks, local area networks, metropolitan area networks, and so on;
  • User interface module 218 for enabling presentation of information (e.g., a graphical user interface for application(s) 224, widgets, websites and web pages thereof, and/or games, audio and/or video content, text, etc.) at each client device 104 via one or more output devices 212 (e.g., displays, speakers, etc.); • Input processing module 220 for detecting one or more user inputs or interactions from one of the one or more input devices 210 and interpreting the detected input or interaction;
  • information e.g., a graphical user interface for application(s) 224, widgets, websites and web pages thereof, and/or games, audio and/or video content, text, etc.
  • output devices 212 e.g., displays, speakers, etc.
  • Input processing module 220 for detecting one or more user inputs or interactions from one of the one or more input devices 210 and interpreting the detected input or interaction;
  • Web browser module 222 for navigating, requesting (e.g., via HTTP), and displaying websites and web pages thereof, including a web interface for logging into a user account associated with a client device 104 or another electronic device, controlling the client or electronic device if associated with the user account, and editing and reviewing settings and data that are associated with the user account;
  • One or more user applications 224 for execution by the electronic system 200 e.g., games, social network applications, smart home applications, and/or other web or non-web based applications for controlling another electronic device and reviewing data captured by such devices
  • the user application(s) 224 includes an extended reality application configured to present multi-media content associated with extended reality
  • Model training module 226 for receiving training data and establishing a data processing model for processing content data (e.g., video, image, audio, or textual data) to be collected or obtained by a client device 104;
  • content data e.g., video, image, audio, or textual data
  • Data processing module 228 for processing content data using data processing models 250, thereby identifying information contained in the content data, matching the content data with other data, categorizing the content data, or synthesizing related content data, where in some embodiments, the data processing module 228 is associated with one of the user applications 224 to process the content data in response to a user instruction received from the user application 224;
  • Pose determination and prediction module 230 for determining and predicting a pose of the client device 104 (e.g., HMD 104D), where in some embodiments, the pose is determined and predicted jointly by the pose determination and prediction module 230 and data processing module 228, and the module 230 further includes an SLAM module 232 for mapping a scene where a client device 104 is located and identifying a pose of the client device 104 within the scene using image and IMU sensor data;
  • Pose-based rendering module 238 for rendering virtual objects on top of a field of view of the camera 260 of the client device 104 or creating mixed, virtual, or augmented reality content using images captured by the camera 260, where the virtual objects are rendered and the mixed, virtual, or augmented reality content are created from a perspective of the camera 260 (i.e., from a point of view of the camera 260) based on a camera pose of the camera 260; and
  • the one or more databases 240 are stored in one of the server 102, client device 104, and storage 106 of the electronic system 200 .
  • the one or more databases 240 are distributed in more than one of the server 102, client device 104, and storage 106 of the electronic system 200 .
  • more than one copy of the above data is stored at distinct devices, e.g., two copies of the data processing models 250 are stored at the server 102 and storage 106, respectively.
  • Each of the above identified elements may be stored in one or more of the previously mentioned memory devices, and corresponds to a set of instructions for performing a function described above.
  • the above identified modules or programs i.e., sets of instructions
  • memory 206 optionally, stores a subset of the modules and data structures identified above.
  • memory 206 optionally, stores additional modules and data structures not described above.
  • FIGS 3A-3C are three example situations 300, 330, and 360 of displaying user hand gestures on a graphical user interface (GUI) 310 in an extended reality application 225, in accordance with some embodiments.
  • the extended reality application 225 is executed by an electronic device (e.g., HMD 104D) and displays a graphical user interface (GUI) 310 for rendering visual content corresponding to a scene and enabling user interaction.
  • Extended reality includes AR, VR, and MR.
  • AR displays a real physical world in the scene, and virtual objects are overlaid on a view of the scene.
  • VR displays a virtual scene that includes only virtual content (e.g., virtual objects).
  • AR is an interactive experience of a real -world environment where the objects that reside in the real world are enhanced by computer-generated perceptual information, e.g., across multiple sensory modalities including visual, auditory, haptic, somatosensory, and olfactory.
  • the extended reality application 225 creates an immersive user experience for the user via the GUI 310.
  • the electronic device While executing the extended reality application 225, the electronic device detects and tracks a user gesture in the scene.
  • the user gesture is made by a user’s hand, another other body part, or an object held by the user, and each of the hand, other body part, and object is used as an interaction mechanism.
  • the extended reality application 225 functions jointly with the input processing module 220 and user interface module 218 in Figure 2 to detect and track the user gesture.
  • the user gesture includes a hand gesture. The hand gesture is performed by a user hand itself or by a hand holding an object, e.g., a wand, a pencil, a laser pointer, and a hand wearing a sensor glove.
  • the extended reality application 225 is further associated with a data processing model 250 that is applied to process images of hand gestures based on deep learning neural networks and identify information of the hand gesture.
  • the information of the hand gesture includes at least a current position 302 A and a target position 302T of the user gesture in the scene.
  • a user hand moves from the current position 302 A towards an end position 302B, and stops at the end position 302B.
  • the target position 302T is extended and extrapolated from a moving path created from the current position 302 A.
  • the target position 302T overlaps with the end position 302B.
  • the target position 302T is distinct from the end position 302B.
  • a target indicator 304B is rendered at the target position 302T of the user gesture on the GUI, and a source indicator 304 A corresponds to the current position 302 A.
  • the source indicator 304A and target indicator 304B are displayed, e.g., jointly with the visual content on the GUI 310.
  • the source indicator 304A has a relative distance from the target indicator 304B on the GUI 310. The relative distance corresponds to (e.g., is proportional to, is linearly correlated with) a distance of the current position 302A and the end position 302B of the user gesture in the scene.
  • the relative distance on the GUI 310 is a projection of the distance of the current position 302A and the end position 302B of the hand gesture in the scene on a display that enables the GUI 310.
  • the relative distance decreases, and the target indicator 304B rendered at the target position 302T is optionally fixed or dynamically adjusted.
  • the distance between the current and end positions 302A and 302B aligned along an eye sight is rendered into the relative distance of the source and target indicators 304A and 304B directly facing the user, allowing the user gesture that moves away from or towards a user to be visualized in a 2D plane or a 3D space in a more user-friendly manner.
  • the target indicator 304B includes a contour enclosing an area on the GUI 310.
  • the area corresponds to a region of the scene in which the target position 302T of the user gesture is located with a threshold probability or above.
  • the area has a center corresponding to a location having a greatest probability for the target position 302T than remaining locations in the scene.
  • the source indicator 304 A and target indicator 304B appear on the GUI 310 when a gesture visualization criterion is satisfied.
  • the gesture visualization criterion includes a distance threshold.
  • the electronic device compares the relative distance between the target indicator 304B and the source indicator 304 A with the distance threshold. Displaying of the target indicator 304B and source indicator 304A is initiated in accordance with a determination that the relative distance is less than the distance threshold.
  • the target position 302T is not stable.
  • the gesture visualization criterion requires that a variation of the target position 302T be stabilized for a predetermined duration of time (e.g., within 100ms). Displaying of the target indicator 304B and source indicator 304A is initiated in accordance with a determination that the target position 302T is stabilized for the predetermined duration of time.
  • the target indicator 304B includes a first circle having a first radius Ri
  • the source indicator 304 A includes a second circle having a second radius R2.
  • the second circle of the source indicator 304A is not concentric with the first circle of the target indicator 304B.
  • Each of the first and second circles includes one of a solid circle or an open circle.
  • the relative distance of the source indicator 304A from the target indicator 304B is measured from a first center of the first circle to a second center of the second circle.
  • the target indicator 304B includes a first circle having a first radius Ri
  • the source indicator 304 A includes a second circle that has a second radius R2.
  • the second circle of the source indicator 304 A is concentric with the first circle of the target indicator 304B.
  • the GUI 310 displays the second circle, while displaying the first circle based on the target position 302T.
  • the second radius R2 is greater than the first radius Ri.
  • the relative distance of the source indicator 304A from the target indicator 304B is optionally equal to the second radius R2 of the second circle of the source indicator 304 A or a difference between the second and first radii, i.e., R2-R1.
  • the GUI 310 displays the second circle that has a third radius R3, while displaying the first circle based on the target position 302T.
  • the relative distance of the source indicator 304A from the target indicator 304B is optionally equal to the third radius R3 or a difference between the first and third radii, i.e., Rs-Ri.
  • the third radius R3 is less than the second radius R2, indicating that the hand gesture is moving closer to the target position 302T in the scene. Additionally, in some embodiments, the electronic device determines an average hand speed based on a difference of the second and third radii (i.e., R2-R3) and a duration of time between the first and second instants of time.
  • the current position 302 A of the hand gesture stops at the end position 302B and does not reach, or change with respect to, the target position 302T.
  • the current position 302A overlaps the end position 302B and is aligned with the target position 302T.
  • the GUI 310 displays a single indicator 304 indicating that the source indicator 304 A and the target indicator 304B overlap with each other.
  • the current position 302A of the hand gesture reaches the target position 302T.
  • the GUI 310 displays a single indicator 304 indicating that the source indicator 304 A and the target indicator 304B overlap with each other.
  • the electronic device identifies a user action on an object associated with the target position 302T in the scene.
  • a virtual object is located at the target position 302T in the scene, and the user action is associated with the virtual object.
  • a real physical object is located at the target position 302T in the scene, and the user action is associated with the physical object.
  • an actionable affordance item is located at the target position 302T in the scene, and the user action initiates an executable program associated with the actionable affordance item.
  • the object associated with the target position 302T is displayed on the GUI 310 concurrently with the target indicator 304B.
  • the target indicator 304B is overlaid at least partially on the object.
  • the user gesture selects the object when the source and target indicators 304 overlap.
  • the user applied a subsequent hand gesture to define a subsequent action (e.g., deletion, copy, display of supplemental information) on the selected object.
  • a subsequent action e.g., deletion, copy, display of supplemental information
  • a menu is overlaid on top of the object, allowing the user to select one of a lists of commands to modify the object.
  • the user gesture adjusts (e.g., zooms in or out, moves) display of the GUI 310 with respect to the object located at the target position 302T.
  • each of the target indicator 304B and source indicator 304A includes a portion of a respective 3D sphere that is displayed conformally with a real or virtual object in a scene.
  • the target indicator 304T includes a portion of a first 3D sphere having a first radius Ri
  • the source indicator 302A includes a portion of a second 3D sphere that has a second radius R2 and is concentric with the first 3D sphere.
  • the portion of the second 3D sphere is displayed, while the portion of the first 3D sphere is displayed based on the target position.
  • the second radius R2 is greater than the first radius Ri.
  • the relative distance of the source indicator 304A from the target indicator 304T is equal to the second radius R2.
  • the relative distance may be equal to a difference between the second radius R2 and the first radius Ri.
  • a portion of a third 3D sphere that has a third radius R3 is displayed, while the portion of the first 3D sphere is displayed based on the target position.
  • the relative distance of the source indicator 304A from the target indicator 304T is equal to the third radius R3, and the third radius R3 is less than the second radius R2.
  • the relative distance may be equal to a difference between the third radius R3 and the first radius Ri.
  • the electronic device determines an average hand speed based on a difference of the second and third radii R2 and R3 and a duration of time between the first and second instants of time.
  • each of the first, second, and third 3D spheres includes an imaginary sphere and is not entirely rendered for display. Rather, each imaginary sphere intersects the real or virtual object with a respective portion (e.g., an intersection circle) of the imaginary sphere, and the respective intersection portion of the imaginary sphere (e.g., the intersection circle of each 3D sphere) is displayed on the real or virtual object.
  • a respective portion e.g., an intersection circle
  • Figures 4A and 4B are two example user interfaces 400 displaying a target indicator 304B and a source indicator 304A in a scene of an extended reality application 225, in accordance with some embodiments.
  • the source indicator 304 A and target indicator 304B are projected onto virtual objects 402 for three-dimensional (3D) display effects.
  • a hand gesture 410 approaches a surface of a virtual object 402 at a target position 302T
  • a source projection 404 A of the source indicator 304 A and a target projection 404B of the target indicator 304B get closer to each other in Figure 4 A.
  • the source indicator 304A and the target indicator 304B overlap with each other in Figure 4B.
  • the end position 302B and target position 302T of the user gesture optionally overlap with each other or differ from each other.
  • the source indicator 304A and target indicator 304B are displayed with the virtual objects 402, when and after the hand gesture 410 approaches the end position 302B. More specifically, in some embodiments, the electronic device displays a three-dimensional (3D) virtual object 402 at a target portion of the GUI corresponding to the target position 302T of the scene.
  • the source indicator 304A and the target indicator 304B are displayed conformally on the 3D virtual object 402.
  • each of the target indicator 304B and source indicator 304A includes a respective 2D circle that is displayed as a respective projection 404A or 404B conformally with the virtual object 402.
  • each of the target indicator 304B and source indicator 304 A includes a portion of a respective 3D sphere 404A or 404B that is displayed conformally with the virtual object 402.
  • each of the target indicator 304B and source indicator 304 A has a 2D shape (e.g., a square) distinct from the 2D circle or a 3D volume (e.g., a cube) distinct from the 3D sphere.
  • each of the first, second, and third 3D spheres includes an imaginary sphere and is not entirely rendered for display. Rather, each imaginary sphere intersects the real or virtual object with a respective portion (e.g., an intersection circle) of the imaginary sphere, and the respective intersection portion of the imaginary sphere (e.g., the intersection circle of each 3D sphere) is displayed on the real or virtual object.
  • Figures 5A and 5B are two example scenes 500 and 550 of an extended reality application 225 in each of which a user gesture selects an object, in accordance with some embodiments.
  • the user gesture is made by a user’s hand, another body part, or an object held by the user, and each of the hand, other body part, and object is used as an interaction mechanism.
  • the extended reality application 225 is executed by an electronic device (e.g., an HMD 104D) and displays a GUI 310.
  • the electronic device tracks locations 502 of the user gesture in a scene during a duration of time (e.g., between a first instant ti and a second instant ti), and projects a path 504 of the user gesture in the scene based on the tracked locations 502 of the user gesture.
  • locations of a user shoulder or head are tracked to determine the target position 302T.
  • the electronic device determines that the path 504 intersects a user actionable affordance item 508, real object, or a virtual object in the scene.
  • the target position 302T is identified based on a location of the user actionable affordance item 508, real object, or virtual object in the scene.
  • a current position 302A of a hand gesture is tracked along locations on a path 502A and reaches an end position 302B.
  • the path 502A is approximated with a path 504 A.
  • the path 504 A extends to intersect with a tree 506, and the tree 506 is selected to be associated with the target position 302T.
  • the tree 506 is optionally a real tree captured by a camera of the electronic device and displayed in the scene in real time.
  • the tree 506 is optionally a real tree seen by the user via optical-see through display.
  • the tree 506 is optionally a virtual tree rendered in the scene by the electronic device.
  • the target indicator 304B is overlaid at least partially on the tree 506.
  • a user fully completes the hand gesture at a final instant C, and the hand gesture reach the tree 506, i.e., the end position 302B overlaps with the target position 302T. Conversely, in some embodiments, a user fully completes the hand gesture at a final instant C, and the hand gesture does not reach the tree 506 while an extended line of the path 504A reaches the tree 506, i.e., the end position 302B is separate from the target position 302T. At the final instant C, the source indicator 304 A and target indicator 304B overlap with each other on the GUI 310.
  • the target position 302T is not stable.
  • the gesture visualization criterion requires that a variation of the target position 302T be stabilized for a predetermined duration of time (e.g., within 100ms) before the source indicator 304A or target indicator 304B are displayed on the GUI 310.
  • a current position 302A of a hand gesture is tracked along locations on a path 502B and reaches an end position 302B’.
  • the path 502A is approximated with a path 504B.
  • the path 504B extends to intersect with a user actionable affordance item 508 before the path 504C reaches the tree 506.
  • the user actionable affordance item 508 is selected to be associated with the target position 302T’.
  • the target indicator 304B is overlaid at least partially on the user actionable affordance item 508.
  • the user actionable affordance item 508 in response to a hand gesture reaching the end position 302B, the user actionable affordance item 508 is selected to initiate an executable program associated with the user actionable affordance item 508.
  • a current position 302A of a hand gesture is tracked along locations on a path 502C and reaches an end position 302B”.
  • the path 502C is approximated with a path 504C.
  • the path 504C extends to intersect with a dog 510 sitting in front of the tree 506, before the path 504C reaches the tree 506.
  • the dog 510 is selected to be associated with the target position 302T”.
  • the target indicator 304B is overlaid at least partially on the dog 510.
  • the dog 510 is optionally a real dog appearing in a physical world or a virtual dog rendered in an MR, AR or VR scene.
  • the electronic device includes an HMD 104D, and user interfaces are normally constructed in a 3D space.
  • the user interfaces reflect a distance and a spatial relation between a current position 302 A of a hand gesture and an end position 302B associated with a target position 302T of an object targeted by the hand gesture.
  • a dualindicator system includes a source indicator (also called primary cursor) 304A showing the current position for user interaction on a virtual object and a target indicator (also called assistive cursor) 304B.
  • the distinguishing states of the two indicators indicate the distance between an input agent (e.g., a hand gesture) and the end position 302B or targeted object.
  • the input agent in an extended reality system reaches the end position 302B touches a surface for interaction, the corresponding source and target indicators 304 A and 304B reach a final state (e.g., in which the indicators 304 A and 304B entirely overlap with each other).
  • Figure 6 is a flow diagram of an example method 600 for enabling user interaction in extended reality, in accordance with some embodiments.
  • the method 600 is described as being implemented by an electronic device (e.g., an HMD 104D).
  • Method 600 is, optionally, governed by instructions that are stored in a non-transitory computer readable storage medium and that are executed by one or more processors of the computer system.
  • Each of the operations shown in Figure 6 may correspond to instructions stored in a computer memory or non-transitory computer readable storage medium (e.g., memory 206 in Figure 2).
  • the computer readable storage medium may include a magnetic or optical disk storage device, solid state storage devices such as Flash memory, or other nonvolatile memory device or devices.
  • the instructions stored on the computer readable storage medium may include one or more of: source code, assembly language code, object code, or other instruction format that is interpreted by one or more processors.
  • the electronic device executes (602) an extended reality application 225 including a graphical user interface (GUI).
  • GUI graphical user interface
  • the electronic device detects and tracks (606) a user gesture in a scene.
  • the scene is a real scene associated with a physical world, and the visual content is rendered to enable augmented reality (AR) or mixed reality (MR) based on the scene.
  • AR augmented reality
  • MR mixed reality
  • the scene is a virtual scene associated with a virtual world, and the visual content is rendered to enable virtual reality (VR) or MR based on the scene.
  • the user gesture is optionally a hand gesture or a gesture enabled by an input agent, such as a magic wand, a pencil, a laser pointer, or a marker.
  • the electronic device determines (608) a target position 302T of the user gesture in the scene, e.g., in accordance with tracking of the user gesture, and displays (610) a target indicator 304B and a source indicator 304 A on the GUI.
  • the target indicator 304B indicates (612) the target position of the user gesture in the scene.
  • the target position reflects the user’s intention determined from the tracked user gesture.
  • the source indicator 304A corresponds (614) to a current position of the user gesture and has a relative distance from the target indicator 304B on the GUI, and the relative distance corresponds to (e.g., is proportional to, is linearly correlated with) a distance between the current position 302A and an end position 302B of the hand gesture in the scene.
  • the target indicator 304B is optionally fixed or dynamically adjusted.
  • the relative distance on the GUI is a proportional to or linearly correlated with the distance of the current position 302A and end position 302B in the scene.
  • the distance between the current and end positions 302A and 302B aligned along an eye sight is rendered into the relative distance of the source and target indicators 304 A and 304B directly facing the user, allowing the user gesture that moves away from or towards a user to be visualized in a 2D plane or a 3D space in a more user-friendly manner.
  • the electronic device compares (616) the relative distance between the target indicator 304B and source indicator 304A with a distance threshold. Displaying of the target indicator 304B and source indicator 304A is initiated in accordance with a determination that the relative distance is less than the distance threshold.
  • the target indicator 304B includes (618) a contour enclosing an area on the GUI. The area corresponds to a region of the scene in which the target position 302T of the user gesture is located with a threshold probability or above. The area has a center corresponding to a location having a greatest probability for the target position 302T than remaining locations in the scene.
  • the electronic device obtains an input of the target position 302T of the user gesture provided by an input mechanism, thereby determining the target position 302T.
  • the input mechanism includes a laser pointer or a sensor glove.
  • joint locations of a user hand can be tracked by an algorithm or by sensors of the sensor glove.
  • the electronic device identifies (620) a user action on an object associated with the target position 302T in the scene.
  • each of the target and source indicators corresponds to a respective location, and locations of the target and source indicators are identical.
  • each of the target and source indicators corresponds to a respective area, and a difference of areas of the target and source indicators are within 5% of the area of the target or source indicator 304 A.
  • the object includes an affordance item, and the user action is configured to activate the affordance item.
  • the object associated with the target position 302T is one of a user actionable affordance item or a virtual object.
  • the electronic device displays (622) the object associated with the target position 302T on the GUI concurrently with the target indicator 304B, and the target indicator 304B is overlaid at least partially on the object.
  • the electronic device displays a three-dimensional (3D) virtual object at a target portion of the GUI corresponding to the target position 302T of the scene, and displays the source indicator 304A and the target indicator 304B conformally on the 3D virtual object.
  • the electronic device tracks locations of the user gesture in the scene during a duration of time, and projects a path of the user gesture in the scene based on the tracked locations of the user gesture. For example, the path of the user gesture follows a moving direction of a finger, body part, or object held by a user. In some situations, the tracked locations of the user gesture are determined based on locations of user shoulders or heads. The electronic device determines that the path intersects a user actionable affordance item, a real object, or a virtual object in the scene and identifies the target position 302T based on a location of the user actionable affordance item, real object, or virtual object in the scene.
  • the target indicator 304B includes a first circle having a first radius Ri.
  • the source indicator 304 A includes a second circle having a second radius R2 and not concentric with the first circle.
  • the relative distance of the source indicator 304A from the target indicator 304B is measured from a first center of the first circle to a second center of the second circle. The relative distance decreases as the user hand moves towards the target location, and increases as the user hand moves away from the target location.
  • the target indicator 304B includes a first circle having a first radius Ri
  • the source indicator 304 A includes a second circle that has a second radius R2 and is concentric with the first circle.
  • the electronic device displays the second circle, while displaying the first circle based on the target position 302T.
  • the second radius R2 is greater than the first radius Ri.
  • the relative distance of the source indicator 304A from the target indicator 304B is equal to the second radius R2.
  • the relative distance may be equal to a difference between the second radius R2 and the first radius Ri.
  • the electronic device displays the second circle that has a third radius R3, while displaying the first circle based on the target position 302T.
  • the relative distance of the source indicator 304A from the target indicator 304B is equal to the third radius R3, and the third radius R3 is less than the second radius R2.
  • the relative distance may be equal to a difference between the third radius R3 and the first radius Ri.
  • the electronic device determines an average hand speed based on a difference of the second and third radii R2 and Rs and a duration of time between the first and second instants of time.
  • the second circle optionally includes a center and an edge.
  • the target indicator includes a portion of a first 3D sphere having a first radius
  • the source indicator includes a portion of a second 3D sphere that has a second radius and is concentric with the first 3D sphere.
  • the portion of the second 3D sphere is displayed, while the portion of the first 3D sphere is displayed based on the target position.
  • the second radius is greater than the first radius.
  • the relative distance of the source indicator from the target indicator is equal to the second radius.
  • the relative distance may be equal to a difference between the second radius R2 and the first radius Ri.
  • a portion of a third 3D sphere that has a third radius is displayed, while the portion of the first 3D sphere is displayed based on the target position.
  • the relative distance of the source indicator from the target indicator is equal to the third radius, and the third radius is less than the second radius.
  • the relative distance may be equal to a difference between the third radius Rs and the first radius Ri.
  • the electronic device determines an average hand speed based on a difference of the second and third radii R2 and Rs and a duration of time between the first and second instants of time. It is noted that the portion of each of the first, second, and third 3D spheres is displayed conformally with the virtual object 402.
  • each of the first, second, and third 3D spheres includes an imaginary sphere and is not entirely rendered for display. Rather, each imaginary sphere intersects the real or virtual object with a respective portion (e.g., an intersection circle) of the imaginary sphere, and the respective intersection portion of the imaginary sphere (e.g., the intersection circle of each 3D sphere) is displayed on the real or virtual object.
  • the extended reality application 225 includes an augmented reality application, and the target and source indicators 304 A and 304B are displayed on the scene that is directly seen through a lens or captured in visual content of a stream of video.
  • the electronic device renders (604) visual content corresponding to the scene with the target indicator 304B and source indicator 304 A on the GUI.
  • the extended reality application 225 includes a virtual reality application, and the visual content includes a stream of virtual content.
  • the scene corresponds to a virtual world that is captured by the stream of virtual content.
  • the electronic device renders (604) visual content corresponding to the scene with the target indicator 304B and source indicator 304 A on the GUI.
  • the extended reality application 225 includes a mixed reality application, and the visual content includes a stream of mixed real and virtual content.
  • the scene corresponds to a mixed environment that is captured by the stream of mixed real and virtual content.
  • the term “if’ is, optionally, construed to mean “when” or “upon” or “in response to determining” or “in response to detecting” or “in accordance with a determination that,” depending on the context.
  • the phrase “if it is determined” or “if [a stated condition or event] is detected” is, optionally, construed to mean “upon determining” or “in response to determining” or “upon detecting [the stated condition or event]” or “in response to detecting [the stated condition or event]” or “in accordance with a determination that [a stated condition or event] is detected,” depending on the context.
  • stages that are not order dependent may be reordered and other stages may be combined or broken out. While some reordering or other groupings are specifically mentioned, others will be obvious to those of ordinary skill in the art, so the ordering and groupings presented herein are not an exhaustive list of alternatives. Moreover, it should be recognized that the stages can be implemented in hardware, firmware, software or any combination thereof.

Abstract

This application is directed to user interface of extended reality. An electronic device executes an extended reality application including a graphical user interface. A user gesture is detected and tracked in a scene, and a target position of the user gesture is determined in the scene. The electronic device displays a target indicator and a source indicator on the GUI. The target indicator is rendered at the target position of the user gesture on the GUI, and the source indicator corresponds to a current position of the hand gesture and has a relative distance from the target indicator on the GUI. The relative distance corresponds to a distance between the current position and an end position of the user gesture in the scene.

Description

Object-based Dual Cursor Input and Guiding System
RELATED APPLICATION
[0001] This application claims priority to U.S. Provisional Patent Application No. 63/270,479, titled “Object-based Dual Cursor Input & Guiding System,” filed on October 21, 2021, which is hereby incorporated by reference in its entirety.
TECHNICAL FIELD
[0002] This application relates generally to user interface technology including, but not limited to, methods, systems, and non-transitory computer-readable media for detecting user gestures and visualizing user actions in an extended reality environment.
BACKGROUND
[0003] Gesture control is an important component of user interface on modem day electronic devices. For devices with touch screens, touch gestures are used for invoking a graphic, icon, or pointer to point, select, or trigger user interface elements on two- dimensional displays (e.g., display monitors, computer screens). Common touch gestures include tap, double tap, swipe, pinch, zoom, rotate, etc. Each touch gesture is typically associated with a certain user interface function. In contrast, touchless air gestures are used to implement certain user interface functions for electronic devices having no touch screens, e.g., head mounted display including virtual reality headsets, augmented reality glasses, mixed reality headsets and the like. These devices having no touch screens can include frontfacing cameras or miniature radars to track human hands in real time. For example, some head mounted displays have implemented hand tracking functions to complete user interaction including selecting, clicking, and typing on a virtual keyboard. Air gestures can also be used on the devices with touch screens when a user’s hands are not available to touch the screen (e.g., while preparing a meal, the user can use air gestures to scroll down a recipe so that the user does not need to touch the device screen with wet hands). It would be beneficial to have an efficient mechanism to visualize user gestures and enable user interactions in the context of extended reality (e.g., virtual, augmented, and mixed realities).
SUMMARY
[0004] Various embodiments of this application are directed to methods, systems, devices, non-transitory computer-readable media for enabling user interaction that assists a user’s action on target objects, provides input in a user interface, and controls objects in an extended reality application. An input mechanism (e.g., hands, laser pointer, sensor glove) is applied with an electronic device (e.g., a head-mounted display (HMD)) that executes the extended reality application. The extended reality application displays a graphical user interface (GUI) on which the input mechanism and virtual objects are visualized to guide the input mechanism to approach a target. Specifically, the electronic device executes the extended reality application to display a user interface including two cursor-like representations of the input mechanism and a virtual object targeted by the input mechanism. A dual-cursor representation includes a primary cursor (i.e., a source indicator) and an assistive cursor (i.e., a target indicator). The assistive cursor shows a target position that is associated with the virtual object and configured to receive a user interaction. The primary and assistive cursors have different states indicating a distance between the input mechanism and the target object. When the input mechanism on which the extended reality application is relied on reaches a final position (which overlaps or is distinct from the target position), the primary and assistive cursors reach a final state (e.g., entirely overlap with each other). Each indicator optionally has a 2D shape or 3D volume. By these means, a user’s postural and positional perception of the input mechanism is improved.
[0005] In one aspect, a method is implemented by an electronic device for enabling user interaction in extended reality. The method includes executing an extended reality application including a graphical user interface (GUI), detecting and tracking a user gesture in a scene, and determining a target position of the user gesture in the scene. The method further includes displaying a target indicator and a source indicator on the GUI. The target indicator is rendered at the target position of the user gesture on the GUI. The source indicator has a relative distance from the target indicator on the GUI, and the relative distance corresponds to (e.g., is proportional to, is linearly correlated with) a distance of a current position and an end position of the user gesture in the scene. It is noted that the end position and target position of the user gesture are optionally identical to or distinct from each other and that the target position of the user gesture is based on a user’s intention determined from tracked positions of the user gesture. In some embodiments, the user gesture is made by a user’s hand, another other body part, or an object held by the user, and each of the hand, other body part, and object is used as an interaction mechanism.
[0006] In some embodiments, the method further includes comparing the relative distance between the target indicator and source indicator and a distance threshold. Displaying of the target indicator and source indicator is initiated in accordance with a determination that the relative distance is less than the distance threshold. In some embodiments, the target indicator includes a contour enclosing an area on the GUI. The area corresponds to a region of the scene in which the target position of the user gesture is located with a threshold probability or above. The area has a center corresponding to a location having a greatest probability for the target position than remaining locations in the scene. In some embodiments, the method further includes in accordance with a determination that the target indicator and the source indicator substantially overlap on the GUI, identifying a user action on an object associated with the target position in the scene. In some embodiments, the method further includes displaying a three-dimensional (3D) virtual object at a target portion of the GUI corresponding to the target position of the scene and displaying the source indicator and the target indicator conformally on the 3D virtual object.
[0007] In another aspect, some implementations include an electronic device that includes one or more processors and memory having instructions stored thereon, which when executed by the one or more processors cause the processors to perform any of the above methods.
[0008] In yet another aspect, some implementations include a non-transitory computer-readable medium, having instructions stored thereon, which when executed by one or more processors cause the processors to perform any of the above methods.
[0009] These illustrative embodiments and implementations are mentioned not to limit or define the disclosure, but to provide examples to aid understanding thereof.
Additional embodiments are discussed in the Detailed Description, and further description is provided there.
BRIEF DESCRIPTION OF THE DRAWINGS
[0010] For a better understanding of the various described implementations, reference should be made to the Detailed Description below, in conjunction with the following drawings in which like reference numerals refer to corresponding parts throughout the figures.
[0011] Figure 1 is an example data processing environment having one or more servers communicatively coupled to one or more client devices, in accordance with some embodiments.
[0012] Figure 2 is a block diagram illustrating an electronic device configured to process content data (e.g., image data), in accordance with some embodiments. [0013] Figures 3A-3C are three example situations of displaying user hand gestures on a graphical user interface (GUI) in an extended reality application, in accordance with some embodiments.
[0014] Figures 4A and 4B are two example user interfaces displaying a target indicator and a source indicator in a scene of an extended reality application, in accordance with some embodiments.
[0015] Figures 5A and 5B are two example scenes of an extended reality application in each of which a user gesture selects an object, in accordance with some embodiments. [0016] Figure 6 is a flow diagram of an example method for enabling user interaction in extended reality, in accordance with some embodiments.
[0017] Like reference numerals refer to corresponding parts throughout the several views of the drawings.
DETAILED DESCRIPTION
[0018] Reference will now be made in detail to specific embodiments, examples of which are illustrated in the accompanying drawings. In the following detailed description, numerous non-limiting specific details are set forth in order to assist in understanding the subject matter presented herein. But it will be apparent to one of ordinary skill in the art that various alternatives may be used without departing from the scope of claims and the subject matter may be practiced without these specific details. For example, it will be apparent to one of ordinary skill in the art that the subject matter presented herein can be implemented on many types of electronic devices with digital video capabilities.
[0019] Various embodiments of this application are directed to visualizing user gestures and enabling user interaction that assists a user’s action on target objects, provides input in a user interface, and controls the target objects in an extended reality application. The extended reality application is executed by an electronic device (e.g., a head-mounted display that is configured to display extended reality content). Extended reality includes augmented reality (AR) in which virtual objects are overlaid on a view of a real physical world, virtual reality (VR) that includes only virtual content, and mixed reality (MR) that combines both AR and VR and in which a user is allowed to interact with real -world and virtual objects. More specifically, AR is an interactive experience of a real-world environment where the objects that reside in the real world are enhanced by computer-generated perceptual information, e.g., across multiple sensory modalities including visual, auditory, haptic, somatosensory, and olfactory. In various embodiments of this application, the electronic device renders visual content corresponding to a scene on a graphical user interface (GUI) of the extended reality application, and detects and tracks a user gesture in the scene. The electronic device visualizes the user gesture by displaying a target indicator and a source indicator jointly on the GUI with the visual content. By these means, the user gesture is recognized and visualized to a user of the electronic device, allowing the user to interact with the extended reality application smoothly and enjoy seamless immersion experience in the extended reality.
[0020] Figure 1 is an example data processing environment 100 having one or more servers 102 communicatively coupled to one or more client devices 104, in accordance with some embodiments. The one or more client devices 104 may be, for example, laptop computers 104A, tablet computers 104B, mobile phones 104C, or intelligent, multi -sensing, network-connected home devices (e.g., a surveillance camera 104E, a smart television device, a drone). In some implementations, the one or more client devices 104 include a headmounted display (HMD) 104D configured to render extended reality content. Each client device 104 can collect data or user inputs, executes user applications, and present outputs on its user interface. The collected data or user inputs can be processed locally at the client device 104 and/or remotely by the server(s) 102. The one or more servers 102 provides system data (e.g., boot files, operating system images, and user applications) to the client devices 104, and in some embodiments, processes the data and user inputs received from the client device(s) 104 when the user applications are executed on the client devices 104. In some embodiments, the data processing environment 100 further includes a storage 106 for storing data related to the servers 102, client devices 104, and applications executed on the client devices 104. For example, storage 106 may store video content (including visual and audio content), static visual content, and/or inertial sensor data.
[0021] The one or more servers 102 can enable real-time data communication with the client devices 104 that are remote from each other or from the one or more servers 102. Further, in some embodiments, the one or more servers 102 can implement data processing tasks that cannot be or are preferably not completed locally by the client devices 104. For example, the client devices 104 include a game console (e.g., formed by the HMD 104D) that executes an interactive online gaming application. The game console receives a user instruction and sends it to a game server 102 with user data. The game server 102 generates a stream of video data based on the user instruction and user data and providing the stream of video data for display on the game console and other client devices that are engaged in the same game session with the game console. In another example, the client devices 104 include a networked surveillance camera 104E and a mobile phone 104C. The networked surveillance camera 104E collects video data and streams the video data to a surveillance camera server 102 in real time. While the video data is optionally pre-processed on the surveillance camera 104E, the surveillance camera server 102 processes the video data to identify motion or audio events in the video data and share information of these events with the mobile phone 104C, thereby allowing a user of the mobile phone 104C to monitor the events occurring near the networked surveillance camera 104E in the real time and remotely. [0022] The one or more servers 102, one or more client devices 104, and storage 106 are communicatively coupled to each other via one or more communication networks 108, which are the medium used to provide communications links between these devices and computers connected together within the data processing environment 100. The one or more communication networks 108 may include connections, such as wire, wireless communication links, or fiber optic cables. Examples of the one or more communication networks 108 include local area networks (LAN), wide area networks (WAN) such as the Internet, or a combination thereof. The one or more communication networks 108 are, optionally, implemented using any known network protocol, including various wired or wireless protocols, such as Ethernet, Universal Serial Bus (USB), FIREWIRE, Long Term Evolution (LTE), Global System for Mobile Communications (GSM), Enhanced Data GSM Environment (EDGE), code division multiple access (CDMA), time division multiple access (TDMA), Bluetooth, Wi-Fi, voice over Internet Protocol (VoIP), Wi-MAX, or any other suitable communication protocol. A connection to the one or more communication networks 108 may be established either directly (e.g., using 3G/4G connectivity to a wireless carrier), or through a network interface 110 (e.g., a router, switch, gateway, hub, or an intelligent, dedicated whole-home control node), or through any combination thereof. As such, the one or more communication networks 108 can represent the Internet of a worldwide collection of networks and gateways that use the Transmission Control Protocol/Internet Protocol (TCP/IP) suite of protocols to communicate with one another. At the heart of the Internet is a backbone of high-speed data communication lines between major nodes or host computers, consisting of thousands of commercial, governmental, educational and other electronic systems that route data and messages.
[0023] The HMD 104D include one or more cameras (e.g., a visible light camera, a depth camera), a microphone, a speaker, one or more inertial sensors (e.g., gyroscope, accelerometer), and a display. The camera(s) and microphone are configured to capture video and audio data from a scene of the HMD 104D, while the one or more inertial sensors are configured to capture inertial sensor data. In some situations, the camera captures gestures of a user wearing the HMD 104D. In some situations, the microphone records ambient sound, including user’s voice commands. In some situations, both video or static visual data captured by the visible light camera and the inertial sensor data measured by the one or more inertial sensors are applied to determine and predict device poses (i.e., device positions and orientations). The video, static image, audio, or inertial sensor data captured by the HMD 104D are processed by the HMD 104D, server(s) 102, or both to recognize the device poses. Alternatively, in some embodiments, both depth data (e.g., depth map and confidence map) captured by the depth camera and the inertial sensor data measured by the one or more inertial sensors are applied to determine and predict device poses. The depth and inertial sensor data captured by the HMD 104D are processed by the HMD 104D, server(s) 102, or both to recognize the device poses. The device poses are used to control the HMD 104D itself or interact with an application (e.g., a gaming application) executed by the HMD 104D. In some embodiments, the display of the HMD 104D displays a user interface. The recognized or predicted device poses are used to render virtual objects with high fidelity, and the user gestures captured by the camera are used to interact with visual content on the user interface. [0024] In some embodiments, SLAM techniques are applied in the data processing environment 100 to process video data, static image data, or depth data captured by the HMD 104D with inertial sensor data. Device poses are recognized and predicted, and a scene in which the HMD 104D is located is mapped and updated. The SLAM techniques are optionally implemented by HMD 104D independently or by both of the server 102 and HMD 104D jointly.
[0025] Figure 2 is a block diagram illustrating an electronic system 200 configured to process content data (e.g., image data), in accordance with some embodiments. The electronic system 200 includes a server 102, a client device 104 (e.g., HMD 104D in Figure 1), a storage 106, or a combination thereof. The electronic system 200, typically, includes one or more processing units (CPUs) 202, one or more network interfaces 204, memory 206, and one or more communication buses 208 for interconnecting these components (sometimes called a chipset). The electronic system 200 includes one or more input devices 210 that facilitate user input, such as a keyboard, a mouse, a voice-command input unit or microphone, a touch screen display, a touch-sensitive input pad, a gesture capturing camera, or other input buttons or controls. Furthermore, in some embodiments, the client device 104 of the electronic system 200 uses a microphone for voice recognition or a camera 260 for gesture recognition to supplement or replace the keyboard. In some embodiments, the client device 104 includes one or more optical cameras 260 (e.g., an RGB camera), scanners, or photo sensor units for capturing images, for example, of graphic serial codes printed on the electronic devices. The electronic system 200 also includes one or more output devices 212 that enable presentation of user interfaces and display content, including one or more speakers and/or one or more visual displays. Optionally, the client device 104 includes a location detection device, such as a GPS (global positioning system) or other geo-location receiver, for determining the location of the client device 104.
[0026] Optionally, the client device 104 includes an inertial measurement unit (IMU) 280 integrating sensor data captured by multi-axes inertial sensors to provide estimation of a location and an orientation of the client device 104 in space. Examples of the one or more inertial sensors of the IMU 280 include, but are not limited to, a gyroscope, an accelerometer, a magnetometer, and an inclinometer.
[0027] Memory 206 includes high-speed random access memory, such as DRAM, SRAM, DDR RAM, or other random access solid state memory devices; and, optionally, includes non-volatile memory, such as one or more magnetic disk storage devices, one or more optical disk storage devices, one or more flash memory devices, or one or more other non-volatile solid state storage devices. Memory 206, optionally, includes one or more storage devices remotely located from one or more processing units 202. Memory 206, or alternatively the non-volatile memory within memory 206, includes a non-transitory computer readable storage medium. In some embodiments, memory 206, or the non- transitory computer readable storage medium of memory 206, stores the following programs, modules, and data structures, or a subset or superset thereof:
• Operating system 214 including procedures for handling various basic system services and for performing hardware dependent tasks;
• Network communication module 216 for connecting each server 102 or client device 104 to other devices (e.g., server 102, client device 104, or storage 106) via one or more network interfaces 204 (wired or wireless) and one or more communication networks 108, such as the Internet, other wide area networks, local area networks, metropolitan area networks, and so on;
• User interface module 218 for enabling presentation of information (e.g., a graphical user interface for application(s) 224, widgets, websites and web pages thereof, and/or games, audio and/or video content, text, etc.) at each client device 104 via one or more output devices 212 (e.g., displays, speakers, etc.); • Input processing module 220 for detecting one or more user inputs or interactions from one of the one or more input devices 210 and interpreting the detected input or interaction;
• Web browser module 222 for navigating, requesting (e.g., via HTTP), and displaying websites and web pages thereof, including a web interface for logging into a user account associated with a client device 104 or another electronic device, controlling the client or electronic device if associated with the user account, and editing and reviewing settings and data that are associated with the user account;
• One or more user applications 224 for execution by the electronic system 200 (e.g., games, social network applications, smart home applications, and/or other web or non-web based applications for controlling another electronic device and reviewing data captured by such devices), where the user application(s) 224 includes an extended reality application configured to present multi-media content associated with extended reality;
• Model training module 226 for receiving training data and establishing a data processing model for processing content data (e.g., video, image, audio, or textual data) to be collected or obtained by a client device 104;
• Data processing module 228 for processing content data using data processing models 250, thereby identifying information contained in the content data, matching the content data with other data, categorizing the content data, or synthesizing related content data, where in some embodiments, the data processing module 228 is associated with one of the user applications 224 to process the content data in response to a user instruction received from the user application 224;
• Pose determination and prediction module 230 for determining and predicting a pose of the client device 104 (e.g., HMD 104D), where in some embodiments, the pose is determined and predicted jointly by the pose determination and prediction module 230 and data processing module 228, and the module 230 further includes an SLAM module 232 for mapping a scene where a client device 104 is located and identifying a pose of the client device 104 within the scene using image and IMU sensor data;
• Pose-based rendering module 238 for rendering virtual objects on top of a field of view of the camera 260 of the client device 104 or creating mixed, virtual, or augmented reality content using images captured by the camera 260, where the virtual objects are rendered and the mixed, virtual, or augmented reality content are created from a perspective of the camera 260 (i.e., from a point of view of the camera 260) based on a camera pose of the camera 260; and
• One or more databases 240 for storing at least data including one or more of: o Device settings 242 including common device settings (e.g., service tier, device model, storage capacity, processing capabilities, communication capabilities, etc.) of the one or more servers 102 or client devices 104; o User account information 244 for the one or more user applications 224, e.g., user names, security questions, account history data, user preferences, and predefined account settings; o Network parameters 246 for the one or more communication networks 108, e.g., IP address, subnet mask, default gateway, DNS server and host name; o Training data 248 for training one or more data processing models 250; o Data processing model(s) 250 for processing content data (e.g., video, image, audio, or textual data) using deep learning techniques; o Pose data database 252 for storing pose data of the camera 260; and o Content data and results 254 that are obtained by and outputted to the client device 104 of the electronic system 200 , respectively, where the content data is processed by the data processing models 250 locally at the client device 104 or remotely at the server 102 to provide the associated results to be presented on client device 104, and include the candidate images.
[0028] Optionally, the one or more databases 240 are stored in one of the server 102, client device 104, and storage 106 of the electronic system 200 . Optionally, the one or more databases 240 are distributed in more than one of the server 102, client device 104, and storage 106 of the electronic system 200 . In some embodiments, more than one copy of the above data is stored at distinct devices, e.g., two copies of the data processing models 250 are stored at the server 102 and storage 106, respectively.
[0029] Each of the above identified elements may be stored in one or more of the previously mentioned memory devices, and corresponds to a set of instructions for performing a function described above. The above identified modules or programs (i.e., sets of instructions) need not be implemented as separate software programs, procedures, modules or data structures, and thus various subsets of these modules may be combined or otherwise re-arranged in various embodiments. In some embodiments, memory 206, optionally, stores a subset of the modules and data structures identified above. Furthermore, memory 206, optionally, stores additional modules and data structures not described above. [0030] Figures 3A-3C are three example situations 300, 330, and 360 of displaying user hand gestures on a graphical user interface (GUI) 310 in an extended reality application 225, in accordance with some embodiments. The extended reality application 225 is executed by an electronic device (e.g., HMD 104D) and displays a graphical user interface (GUI) 310 for rendering visual content corresponding to a scene and enabling user interaction. Extended reality includes AR, VR, and MR. AR displays a real physical world in the scene, and virtual objects are overlaid on a view of the scene. VR displays a virtual scene that includes only virtual content (e.g., virtual objects). MR combines AR and VR content in the same scene, and a user is allowed to interact with real -world and virtual objects. Specifically, AR is an interactive experience of a real -world environment where the objects that reside in the real world are enhanced by computer-generated perceptual information, e.g., across multiple sensory modalities including visual, auditory, haptic, somatosensory, and olfactory. As such, the extended reality application 225 creates an immersive user experience for the user via the GUI 310.
[0031] While executing the extended reality application 225, the electronic device detects and tracks a user gesture in the scene. In some embodiments, the user gesture is made by a user’s hand, another other body part, or an object held by the user, and each of the hand, other body part, and object is used as an interaction mechanism. Specifically, the extended reality application 225 functions jointly with the input processing module 220 and user interface module 218 in Figure 2 to detect and track the user gesture. In some embodiments, the user gesture includes a hand gesture. The hand gesture is performed by a user hand itself or by a hand holding an object, e.g., a wand, a pencil, a laser pointer, and a hand wearing a sensor glove. In some situations, the extended reality application 225 is further associated with a data processing model 250 that is applied to process images of hand gestures based on deep learning neural networks and identify information of the hand gesture. The information of the hand gesture includes at least a current position 302 A and a target position 302T of the user gesture in the scene. A user hand moves from the current position 302 A towards an end position 302B, and stops at the end position 302B. The target position 302T is extended and extrapolated from a moving path created from the current position 302 A. In some situations, the target position 302T overlaps with the end position 302B. In some situations, the target position 302T is distinct from the end position 302B.
[0032] A target indicator 304B is rendered at the target position 302T of the user gesture on the GUI, and a source indicator 304 A corresponds to the current position 302 A. The source indicator 304A and target indicator 304B are displayed, e.g., jointly with the visual content on the GUI 310. The source indicator 304A has a relative distance from the target indicator 304B on the GUI 310. The relative distance corresponds to (e.g., is proportional to, is linearly correlated with) a distance of the current position 302A and the end position 302B of the user gesture in the scene. Specifically, in some embodiments, the relative distance on the GUI 310 is a projection of the distance of the current position 302A and the end position 302B of the hand gesture in the scene on a display that enables the GUI 310. As the user gesture moves from the current position 302A towards the end position 302B, the relative distance decreases, and the target indicator 304B rendered at the target position 302T is optionally fixed or dynamically adjusted. By these means, the distance between the current and end positions 302A and 302B aligned along an eye sight is rendered into the relative distance of the source and target indicators 304A and 304B directly facing the user, allowing the user gesture that moves away from or towards a user to be visualized in a 2D plane or a 3D space in a more user-friendly manner.
[0033] In some embodiments, the target indicator 304B includes a contour enclosing an area on the GUI 310. The area corresponds to a region of the scene in which the target position 302T of the user gesture is located with a threshold probability or above. The area has a center corresponding to a location having a greatest probability for the target position 302T than remaining locations in the scene.
[0034] In some embodiments, the source indicator 304 A and target indicator 304B appear on the GUI 310 when a gesture visualization criterion is satisfied. For example, the gesture visualization criterion includes a distance threshold. The electronic device compares the relative distance between the target indicator 304B and the source indicator 304 A with the distance threshold. Displaying of the target indicator 304B and source indicator 304A is initiated in accordance with a determination that the relative distance is less than the distance threshold. In some embodiments, the target position 302T is not stable. The gesture visualization criterion requires that a variation of the target position 302T be stabilized for a predetermined duration of time (e.g., within 100ms). Displaying of the target indicator 304B and source indicator 304A is initiated in accordance with a determination that the target position 302T is stabilized for the predetermined duration of time.
[0035] Referring to Figure 3A, in some embodiments, the target indicator 304B includes a first circle having a first radius Ri, and the source indicator 304 A includes a second circle having a second radius R2. The second circle of the source indicator 304A is not concentric with the first circle of the target indicator 304B. Each of the first and second circles includes one of a solid circle or an open circle. The relative distance of the source indicator 304A from the target indicator 304B is measured from a first center of the first circle to a second center of the second circle.
[0036] Referring to Figure 3B, in some embodiments, the target indicator 304B includes a first circle having a first radius Ri, and the source indicator 304 A includes a second circle that has a second radius R2. The second circle of the source indicator 304 A is concentric with the first circle of the target indicator 304B. At a first instant of time, the GUI 310 displays the second circle, while displaying the first circle based on the target position 302T. The second radius R2 is greater than the first radius Ri. At the first instant of time, the relative distance of the source indicator 304A from the target indicator 304B is optionally equal to the second radius R2 of the second circle of the source indicator 304 A or a difference between the second and first radii, i.e., R2-R1. Further, in some embodiments, at a second instant of time that follows the first instant of time, the GUI 310 displays the second circle that has a third radius R3, while displaying the first circle based on the target position 302T. At the second instant of time, the relative distance of the source indicator 304A from the target indicator 304B is optionally equal to the third radius R3 or a difference between the first and third radii, i.e., Rs-Ri. The third radius R3 is less than the second radius R2, indicating that the hand gesture is moving closer to the target position 302T in the scene. Additionally, in some embodiments, the electronic device determines an average hand speed based on a difference of the second and third radii (i.e., R2-R3) and a duration of time between the first and second instants of time.
[0037] In some embodiments not shown, the current position 302 A of the hand gesture stops at the end position 302B and does not reach, or change with respect to, the target position 302T. The current position 302A overlaps the end position 302B and is aligned with the target position 302T. The GUI 310 displays a single indicator 304 indicating that the source indicator 304 A and the target indicator 304B overlap with each other. Conversely, referring to Figure 3C, in some embodiments, the current position 302A of the hand gesture reaches the target position 302T. The GUI 310 displays a single indicator 304 indicating that the source indicator 304 A and the target indicator 304B overlap with each other. In some embodiments, in accordance with a determination that the target indicator 304B and the source indicator 304A substantially overlap with each other on the GUI 310, the electronic device identifies a user action on an object associated with the target position 302T in the scene. Further, in some embodiments, a virtual object is located at the target position 302T in the scene, and the user action is associated with the virtual object. Alternatively, in some embodiments, a real physical object is located at the target position 302T in the scene, and the user action is associated with the physical object. Alternatively, in some embodiments, an actionable affordance item is located at the target position 302T in the scene, and the user action initiates an executable program associated with the actionable affordance item. In some situations, the object associated with the target position 302T is displayed on the GUI 310 concurrently with the target indicator 304B. The target indicator 304B is overlaid at least partially on the object.
[0038] In some embodiments, the user gesture selects the object when the source and target indicators 304 overlap. The user applied a subsequent hand gesture to define a subsequent action (e.g., deletion, copy, display of supplemental information) on the selected object. In an example, a menu is overlaid on top of the object, allowing the user to select one of a lists of commands to modify the object. In some embodiments, when the source and target indicators 304 overlap, the user gesture adjusts (e.g., zooms in or out, moves) display of the GUI 310 with respect to the object located at the target position 302T.
[0039] Referring to Figures 3A-3C, each of the target indicator 304B and source indicator 304A includes a portion of a respective 3D sphere that is displayed conformally with a real or virtual object in a scene. In some embodiments (Figure 3B), the target indicator 304T includes a portion of a first 3D sphere having a first radius Ri, and the source indicator 302A includes a portion of a second 3D sphere that has a second radius R2 and is concentric with the first 3D sphere. At a first instant of time, the portion of the second 3D sphere is displayed, while the portion of the first 3D sphere is displayed based on the target position. The second radius R2 is greater than the first radius Ri. At the first instant of time, the relative distance of the source indicator 304A from the target indicator 304T is equal to the second radius R2. Alternatively, the relative distance may be equal to a difference between the second radius R2 and the first radius Ri.
[0040] Further, in some embodiments, at a second instant of time that follows the first instant of time, a portion of a third 3D sphere that has a third radius R3 is displayed, while the portion of the first 3D sphere is displayed based on the target position. At the second instant of time, the relative distance of the source indicator 304A from the target indicator 304T is equal to the third radius R3, and the third radius R3 is less than the second radius R2. Alternatively, the relative distance may be equal to a difference between the third radius R3 and the first radius Ri. Additionally, in some embodiments, the electronic device determines an average hand speed based on a difference of the second and third radii R2 and R3 and a duration of time between the first and second instants of time. [0041] It is noted that each of the first, second, and third 3D spheres includes an imaginary sphere and is not entirely rendered for display. Rather, each imaginary sphere intersects the real or virtual object with a respective portion (e.g., an intersection circle) of the imaginary sphere, and the respective intersection portion of the imaginary sphere (e.g., the intersection circle of each 3D sphere) is displayed on the real or virtual object.
[0042] Figures 4A and 4B are two example user interfaces 400 displaying a target indicator 304B and a source indicator 304A in a scene of an extended reality application 225, in accordance with some embodiments. The source indicator 304 A and target indicator 304B are projected onto virtual objects 402 for three-dimensional (3D) display effects. When a hand gesture 410 approaches a surface of a virtual object 402 at a target position 302T, a source projection 404 A of the source indicator 304 A and a target projection 404B of the target indicator 304B get closer to each other in Figure 4 A. When the hand gesture 410 reaches an end position 302B, the source indicator 304A and the target indicator 304B overlap with each other in Figure 4B. The end position 302B and target position 302T of the user gesture optionally overlap with each other or differ from each other. The source indicator 304A and target indicator 304B are displayed with the virtual objects 402, when and after the hand gesture 410 approaches the end position 302B. More specifically, in some embodiments, the electronic device displays a three-dimensional (3D) virtual object 402 at a target portion of the GUI corresponding to the target position 302T of the scene. The source indicator 304A and the target indicator 304B are displayed conformally on the 3D virtual object 402.
[0043] In some embodiments, each of the target indicator 304B and source indicator 304A includes a respective 2D circle that is displayed as a respective projection 404A or 404B conformally with the virtual object 402. Alternatively, referring to Figures 4A and 4B, each of the target indicator 304B and source indicator 304 A includes a portion of a respective 3D sphere 404A or 404B that is displayed conformally with the virtual object 402. It is noted that in some embodiments, each of the target indicator 304B and source indicator 304 A has a 2D shape (e.g., a square) distinct from the 2D circle or a 3D volume (e.g., a cube) distinct from the 3D sphere. In some embodiments, each of the first, second, and third 3D spheres includes an imaginary sphere and is not entirely rendered for display. Rather, each imaginary sphere intersects the real or virtual object with a respective portion (e.g., an intersection circle) of the imaginary sphere, and the respective intersection portion of the imaginary sphere (e.g., the intersection circle of each 3D sphere) is displayed on the real or virtual object. [0044] Figures 5A and 5B are two example scenes 500 and 550 of an extended reality application 225 in each of which a user gesture selects an object, in accordance with some embodiments. In some embodiments, the user gesture is made by a user’s hand, another body part, or an object held by the user, and each of the hand, other body part, and object is used as an interaction mechanism. The extended reality application 225 is executed by an electronic device (e.g., an HMD 104D) and displays a GUI 310. In some embodiments, the electronic device tracks locations 502 of the user gesture in a scene during a duration of time (e.g., between a first instant ti and a second instant ti), and projects a path 504 of the user gesture in the scene based on the tracked locations 502 of the user gesture. In some embodiments, locations of a user shoulder or head are tracked to determine the target position 302T. The electronic device determines that the path 504 intersects a user actionable affordance item 508, real object, or a virtual object in the scene. The target position 302T is identified based on a location of the user actionable affordance item 508, real object, or virtual object in the scene.
[0045] Referring to Figure 5A, a current position 302A of a hand gesture is tracked along locations on a path 502A and reaches an end position 302B. The path 502A is approximated with a path 504 A. The path 504 A extends to intersect with a tree 506, and the tree 506 is selected to be associated with the target position 302T. The tree 506 is optionally a real tree captured by a camera of the electronic device and displayed in the scene in real time. The tree 506 is optionally a real tree seen by the user via optical-see through display. The tree 506 is optionally a virtual tree rendered in the scene by the electronic device. The target indicator 304B is overlaid at least partially on the tree 506. In some embodiments, a user fully completes the hand gesture at a final instant C, and the hand gesture reach the tree 506, i.e., the end position 302B overlaps with the target position 302T. Conversely, in some embodiments, a user fully completes the hand gesture at a final instant C, and the hand gesture does not reach the tree 506 while an extended line of the path 504A reaches the tree 506, i.e., the end position 302B is separate from the target position 302T. At the final instant C, the source indicator 304 A and target indicator 304B overlap with each other on the GUI 310. Additionally, in some situations, between an initial period of time (e.g.,
Figure imgf000017_0001
the path 504A is dynamically adjusted in real time based on the path 502A of the hand gesture. At this initial period of time (e.g., ti-t2), the target position 302T is not stable. The gesture visualization criterion requires that a variation of the target position 302T be stabilized for a predetermined duration of time (e.g., within 100ms) before the source indicator 304A or target indicator 304B are displayed on the GUI 310. [0046] Referring to Figure 5B, a current position 302A of a hand gesture is tracked along locations on a path 502B and reaches an end position 302B’. The path 502A is approximated with a path 504B. The path 504B extends to intersect with a user actionable affordance item 508 before the path 504C reaches the tree 506. The user actionable affordance item 508 is selected to be associated with the target position 302T’. The target indicator 304B is overlaid at least partially on the user actionable affordance item 508. In some embodiments, in response to a hand gesture reaching the end position 302B, the user actionable affordance item 508 is selected to initiate an executable program associated with the user actionable affordance item 508.
[0047] Alternatively, a current position 302A of a hand gesture is tracked along locations on a path 502C and reaches an end position 302B”. The path 502C is approximated with a path 504C. The path 504C extends to intersect with a dog 510 sitting in front of the tree 506, before the path 504C reaches the tree 506. The dog 510 is selected to be associated with the target position 302T”. The target indicator 304B is overlaid at least partially on the dog 510. The dog 510 is optionally a real dog appearing in a physical world or a virtual dog rendered in an MR, AR or VR scene.
[0048] In some embodiments, the electronic device includes an HMD 104D, and user interfaces are normally constructed in a 3D space. The user interfaces reflect a distance and a spatial relation between a current position 302 A of a hand gesture and an end position 302B associated with a target position 302T of an object targeted by the hand gesture. A dualindicator system includes a source indicator (also called primary cursor) 304A showing the current position for user interaction on a virtual object and a target indicator (also called assistive cursor) 304B. The distinguishing states of the two indicators indicate the distance between an input agent (e.g., a hand gesture) and the end position 302B or targeted object. When the input agent in an extended reality system reaches the end position 302B touches a surface for interaction, the corresponding source and target indicators 304 A and 304B reach a final state (e.g., in which the indicators 304 A and 304B entirely overlap with each other).
[0049] Regular display of hand of other virtual representations of input agents occupies a large portion of a display area. Simply removing the rendered image may affect the user’s spatial perception of the input agent in a virtual space. The dual indicator system reflects the relative position of the input agent against virtual objects without rendering a representation of the input agents. Moreover, compared to traditional mouse-cursor interaction, the system works with multiple objects simultaneously, providing a more solid perception of the spatial position of the input agent. The form of the indicators 304 A and 304B in the previous example is circled. However, a similar result can be achieved by changing the form of both indicators 304A and 304B. The shape of cursors is not necessarily the same as long as they reach a recognizable final state (e.g., become fully overlapped, parallel, concentric, highlighted, glowing) at the end of the approaching process.
[0050] Figure 6 is a flow diagram of an example method 600 for enabling user interaction in extended reality, in accordance with some embodiments. For convenience, the method 600 is described as being implemented by an electronic device (e.g., an HMD 104D). Method 600 is, optionally, governed by instructions that are stored in a non-transitory computer readable storage medium and that are executed by one or more processors of the computer system. Each of the operations shown in Figure 6 may correspond to instructions stored in a computer memory or non-transitory computer readable storage medium (e.g., memory 206 in Figure 2). The computer readable storage medium may include a magnetic or optical disk storage device, solid state storage devices such as Flash memory, or other nonvolatile memory device or devices. The instructions stored on the computer readable storage medium may include one or more of: source code, assembly language code, object code, or other instruction format that is interpreted by one or more processors. Some operations in method 600 may be combined and/or the order of some operations may be changed.
[0051] The electronic device executes (602) an extended reality application 225 including a graphical user interface (GUI). The electronic device detects and tracks (606) a user gesture in a scene. In some embodiments, the scene is a real scene associated with a physical world, and the visual content is rendered to enable augmented reality (AR) or mixed reality (MR) based on the scene. Alternatively, in some embodiments, the scene is a virtual scene associated with a virtual world, and the visual content is rendered to enable virtual reality (VR) or MR based on the scene. The user gesture is optionally a hand gesture or a gesture enabled by an input agent, such as a magic wand, a pencil, a laser pointer, or a marker. The electronic device determines (608) a target position 302T of the user gesture in the scene, e.g., in accordance with tracking of the user gesture, and displays (610) a target indicator 304B and a source indicator 304 A on the GUI. The target indicator 304B indicates (612) the target position of the user gesture in the scene. The target position reflects the user’s intention determined from the tracked user gesture. The source indicator 304A corresponds (614) to a current position of the user gesture and has a relative distance from the target indicator 304B on the GUI, and the relative distance corresponds to (e.g., is proportional to, is linearly correlated with) a distance between the current position 302A and an end position 302B of the hand gesture in the scene. The target indicator 304B is optionally fixed or dynamically adjusted. For example, the relative distance on the GUI is a proportional to or linearly correlated with the distance of the current position 302A and end position 302B in the scene. As a result, the distance between the current and end positions 302A and 302B aligned along an eye sight is rendered into the relative distance of the source and target indicators 304 A and 304B directly facing the user, allowing the user gesture that moves away from or towards a user to be visualized in a 2D plane or a 3D space in a more user-friendly manner.
[0052] In some embodiments, the electronic device compares (616) the relative distance between the target indicator 304B and source indicator 304A with a distance threshold. Displaying of the target indicator 304B and source indicator 304A is initiated in accordance with a determination that the relative distance is less than the distance threshold. [0053] In some embodiments, the target indicator 304B includes (618) a contour enclosing an area on the GUI. The area corresponds to a region of the scene in which the target position 302T of the user gesture is located with a threshold probability or above. The area has a center corresponding to a location having a greatest probability for the target position 302T than remaining locations in the scene.
[0054] In some embodiments, the electronic device obtains an input of the target position 302T of the user gesture provided by an input mechanism, thereby determining the target position 302T. Optionally, the input mechanism includes a laser pointer or a sensor glove. Specifically, joint locations of a user hand can be tracked by an algorithm or by sensors of the sensor glove.
[0055] In some embodiments, in accordance with a determination that the target indicator 304B and the source indicator 304A substantially overlaps on the GUI, the electronic device identifies (620) a user action on an object associated with the target position 302T in the scene. Optionally, each of the target and source indicators corresponds to a respective location, and locations of the target and source indicators are identical. Optionally, each of the target and source indicators corresponds to a respective area, and a difference of areas of the target and source indicators are within 5% of the area of the target or source indicator 304 A. In some embodiments, the object includes an affordance item, and the user action is configured to activate the affordance item. Further, in some embodiments, the object associated with the target position 302T is one of a user actionable affordance item or a virtual object. The electronic device displays (622) the object associated with the target position 302T on the GUI concurrently with the target indicator 304B, and the target indicator 304B is overlaid at least partially on the object. [0056] In some embodiments, the electronic device displays a three-dimensional (3D) virtual object at a target portion of the GUI corresponding to the target position 302T of the scene, and displays the source indicator 304A and the target indicator 304B conformally on the 3D virtual object.
[0057] In some embodiments, the electronic device tracks locations of the user gesture in the scene during a duration of time, and projects a path of the user gesture in the scene based on the tracked locations of the user gesture. For example, the path of the user gesture follows a moving direction of a finger, body part, or object held by a user. In some situations, the tracked locations of the user gesture are determined based on locations of user shoulders or heads. The electronic device determines that the path intersects a user actionable affordance item, a real object, or a virtual object in the scene and identifies the target position 302T based on a location of the user actionable affordance item, real object, or virtual object in the scene.
[0058] Referring to Figure 3A, in some embodiments, the target indicator 304B includes a first circle having a first radius Ri. The source indicator 304 A includes a second circle having a second radius R2 and not concentric with the first circle. The relative distance of the source indicator 304A from the target indicator 304B is measured from a first center of the first circle to a second center of the second circle. The relative distance decreases as the user hand moves towards the target location, and increases as the user hand moves away from the target location.
[0059] In some embodiments, the target indicator 304B includes a first circle having a first radius Ri, and the source indicator 304 A includes a second circle that has a second radius R2 and is concentric with the first circle. At a first instant of time, the electronic device displays the second circle, while displaying the first circle based on the target position 302T. The second radius R2 is greater than the first radius Ri. At the first instant of time, the relative distance of the source indicator 304A from the target indicator 304B is equal to the second radius R2. Alternatively, the relative distance may be equal to a difference between the second radius R2 and the first radius Ri. Further, in some embodiments, at a second instant of time that follows the first instant of time, the electronic device displays the second circle that has a third radius R3, while displaying the first circle based on the target position 302T. At the second instant of time, the relative distance of the source indicator 304A from the target indicator 304B is equal to the third radius R3, and the third radius R3 is less than the second radius R2. Alternatively, the relative distance may be equal to a difference between the third radius R3 and the first radius Ri. Additionally, in some embodiments, the electronic device determines an average hand speed based on a difference of the second and third radii R2 and Rs and a duration of time between the first and second instants of time. The second circle optionally includes a center and an edge. The edge encloses an area where the current position 302A of the user hand is located with a threshold probability or above, and the center has a greater probability for the current position 302 A than remaining locations on the GUI. [0060] In some embodiments, the target indicator includes a portion of a first 3D sphere having a first radius, and the source indicator includes a portion of a second 3D sphere that has a second radius and is concentric with the first 3D sphere. At a first instant of time, the portion of the second 3D sphere is displayed, while the portion of the first 3D sphere is displayed based on the target position. The second radius is greater than the first radius. At the first instant of time, the relative distance of the source indicator from the target indicator is equal to the second radius. Alternatively, the relative distance may be equal to a difference between the second radius R2 and the first radius Ri. Further, in some embodiments, at a second instant of time that follows the first instant of time, a portion of a third 3D sphere that has a third radius is displayed, while the portion of the first 3D sphere is displayed based on the target position. At the second instant of time, the relative distance of the source indicator from the target indicator is equal to the third radius, and the third radius is less than the second radius. Alternatively, the relative distance may be equal to a difference between the third radius Rs and the first radius Ri. Additionally, in some embodiments, the electronic device determines an average hand speed based on a difference of the second and third radii R2 and Rs and a duration of time between the first and second instants of time. It is noted that the portion of each of the first, second, and third 3D spheres is displayed conformally with the virtual object 402. In some embodiments, each of the first, second, and third 3D spheres includes an imaginary sphere and is not entirely rendered for display. Rather, each imaginary sphere intersects the real or virtual object with a respective portion (e.g., an intersection circle) of the imaginary sphere, and the respective intersection portion of the imaginary sphere (e.g., the intersection circle of each 3D sphere) is displayed on the real or virtual object.
[0061] In some embodiments, the extended reality application 225 includes an augmented reality application, and the target and source indicators 304 A and 304B are displayed on the scene that is directly seen through a lens or captured in visual content of a stream of video.
[0062] In some embodiments, the electronic device renders (604) visual content corresponding to the scene with the target indicator 304B and source indicator 304 A on the GUI. The extended reality application 225 includes a virtual reality application, and the visual content includes a stream of virtual content. The scene corresponds to a virtual world that is captured by the stream of virtual content.
[0063] In some embodiments, the electronic device renders (604) visual content corresponding to the scene with the target indicator 304B and source indicator 304 A on the GUI. The extended reality application 225 includes a mixed reality application, and the visual content includes a stream of mixed real and virtual content. The scene corresponds to a mixed environment that is captured by the stream of mixed real and virtual content.
[0064] It should be understood that the particular order in which the operations in Figure 6 have been described are merely exemplary and are not intended to indicate that the described order is the only order in which the operations could be performed. One of ordinary skill in the art would recognize various ways to enable user interaction in extended reality. Additionally, it should be noted that details of other processes described above with respect to Figures 3 A-5B are also applicable in an analogous manner to method 600 described above with respect to Figure 6. For brevity, these details are not repeated here.
[0065] The terminology used in the description of the various described implementations herein is for the purpose of describing particular implementations only and is not intended to be limiting. As used in the description of the various described implementations and the appended claims, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will also be understood that the term “and/or” as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. It will be further understood that the terms “includes,” “including,” “comprises,” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. Additionally, it will be understood that, although the terms “first,” “second,” etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another.
[0066] As used herein, the term “if’ is, optionally, construed to mean “when” or “upon” or “in response to determining” or “in response to detecting” or “in accordance with a determination that,” depending on the context. Similarly, the phrase “if it is determined” or “if [a stated condition or event] is detected” is, optionally, construed to mean “upon determining” or “in response to determining” or “upon detecting [the stated condition or event]” or “in response to detecting [the stated condition or event]” or “in accordance with a determination that [a stated condition or event] is detected,” depending on the context.
[0067] The foregoing description, for purpose of explanation, has been described with reference to specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the claims to the precise forms disclosed. Many modifications and variations are possible in view of the above teachings. The embodiments were chosen and described in order to best explain principles of operation and practical applications, to thereby enable others skilled in the art.
[0068] Although various drawings illustrate a number of logical stages in a particular order, stages that are not order dependent may be reordered and other stages may be combined or broken out. While some reordering or other groupings are specifically mentioned, others will be obvious to those of ordinary skill in the art, so the ordering and groupings presented herein are not an exhaustive list of alternatives. Moreover, it should be recognized that the stages can be implemented in hardware, firmware, software or any combination thereof.

Claims

What is claimed is:
1. A method implemented by an electronic device for enabling user interaction in extended reality, comprising: executing an extended reality application including a graphical user interface (GUI); detecting and tracking a user gesture in a scene; determining a target position of the user gesture in the scene; and displaying a target indicator and a source indicator on the GUI, the target indicator rendered at the target position of the user gesture on the GUI, the source indicator corresponding to a current position of the hand gesture and having a relative distance from the target indicator on the GUI, wherein the relative distance corresponds to a distance between the current position and an end position of the user gesture in the scene.
2. The method of claim 1, further comprising: comparing the relative distance between the target indicator and the source indicator with a distance threshold; wherein displaying of the target indicator and source indicator is initiated in accordance with a determination that the relative distance is less than the distance threshold.
3. The method of claim 1 or 2, wherein the target indicator includes a contour enclosing an area on the GUI, the area corresponding to a region of the scene in which the target position of the user gesture is located with a threshold probability or above, the area having a center corresponding to a location having a greatest probability for the target position than remaining locations in the scene.
4. The method of any of the preceding claims, further comprising: obtaining an input of the target position of the user gesture provided by an input mechanism, thereby determining the target position.
5. The method of any of the preceding claims, further comprising: in accordance with a determination that the target indicator and the source indicator substantially overlaps on the GUI, identifying a user action on an object associated with the target position in the scene.
6. The method of claim 5, wherein the object associated with the target position is one of a user actionable affordance item or a virtual object, further comprising:
24 displaying the object associated with the target position on the GUI concurrently with the target indicator, wherein the target indicator is overlaid at least partially on the object.
7. The method of any of claims 1-4, further comprising: displaying a three-dimensional (3D) virtual object at a target portion of the GUI corresponding to the target position of the scene; and displaying the source indicator and the target indicator conformally on the 3D virtual object.
8. The method of any of claims 1-4, the method further comprising: tracking locations of the user gesture in the scene during a duration of time; projecting a path of the user gesture in the scene based on the tracked locations of the user gesture; determining that the path intersects a user actionable affordance item, a real object, or a virtual object in the scene; and identifying the target position based on a location of the user actionable affordance item, real object, or virtual object in the scene.
9. The method of any of claims 1-8, wherein: the target indicator includes a first circle having a first radius; the source indicator includes a second circle having a second radius and not concentric with the first circle; and the relative distance of the source indicator from the target indicator is measured from a first center of the first circle to a second center of the second circle.
10. The method of any of claims 1-8, wherein the target indicator includes a first circle having a first radius, and the source indicator includes a second circle that has a second radius and is concentric with the first circle, displaying the target indicator and the source indicator on the GUI_further comprising: at a first instant of time, displaying the second circle, while displaying the first circle based on the target position, the second radius greater than the first radius, wherein at the first instant of time, the relative distance of the source indicator from the target indicator is equal to the second radius.
11. The method of claim 10, displaying the target indicator and the source indicator on the GUI_further comprising: at a second instant of time that follows the first instant of time, displaying the second circle that has a third radius, while displaying the first circle based on the target position, wherein at the second instant of time, the relative distance of the source indicator from the target indicator is equal to the third radius, the third radius less than the second radius.
12. The method of any of claims 1-8, wherein the target indicator includes a portion of a first 3D sphere having a first radius, and the source indicator includes a portion of a second 3D sphere that has a second radius and is concentric with the first 3D sphere, displaying the target indicator and the source indicator on the GUI_further comprising: at a first instant of time, displaying the portion of the second 3D sphere, while displaying the portion of the first 3D sphere based on the target position, the second radius greater than the first radius, wherein at the first instant of time, the relative distance of the source indicator from the target indicator is equal to the second radius.
13. The method of claim 12, displaying the target indicator and the source indicator on the GUI_further comprising: at a second instant of time that follows the first instant of time, displaying a portion of a third 3D sphere that has a third radius, while displaying the portion of the first 3D sphere based on the target position, wherein at the second instant of time, the relative distance of the source indicator from the target indicator is equal to the third radius, the third radius less than the second radius.
14. The method of any of claims 1-13, wherein the extended reality application includes an augmented reality application, and the target and source indicators are displayed on the scene that is directly seen through a lens or captured in visual content of a stream of video, the scene corresponding to a physical world.
15. The method of any of claims 1-13, further comprising rendering visual content corresponding to the scene with the target indicator and source indicator on the GUI, wherein the extended reality application includes a virtual reality application, and wherein the visual content includes a stream of virtual content, and the scene corresponds to a virtual world that is captured by the stream of virtual content.
16. The method of any of claims 1-13, further comprising rendering visual content corresponding to the scene with the target indicator and source indicator on the GUI, wherein the extended reality application includes a mixed reality application, and wherein the visual content includes a stream of mixed real and virtual content, and the scene corresponds to a mixed environment that is captured by the stream of mixed real and virtual content.
17. An electronic device, comprising: one or more processors; and memory having instructions stored thereon, which when executed by the one or more processors cause the processors to perform a method of any of claims 1-16.
18. A non-transitory computer-readable medium, having instructions stored thereon, which when executed by one or more processors cause the processors to perform a method of any of claims 1-16.
27
PCT/US2022/047236 2021-10-21 2022-10-20 Object-based dual cursor input and guiding system WO2023069591A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202163270479P 2021-10-21 2021-10-21
US63/270,479 2021-10-21

Publications (1)

Publication Number Publication Date
WO2023069591A1 true WO2023069591A1 (en) 2023-04-27

Family

ID=86059678

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2022/047236 WO2023069591A1 (en) 2021-10-21 2022-10-20 Object-based dual cursor input and guiding system

Country Status (1)

Country Link
WO (1) WO2023069591A1 (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180300940A1 (en) * 2017-04-17 2018-10-18 Intel Corporation Augmented reality and virtual reality feedback enhancement system, apparatus and method

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180300940A1 (en) * 2017-04-17 2018-10-18 Intel Corporation Augmented reality and virtual reality feedback enhancement system, apparatus and method

Similar Documents

Publication Publication Date Title
US11714592B2 (en) Gaze-based user interactions
US10606609B2 (en) Context-based discovery of applications
US20190172266A1 (en) Rendering virtual objects in 3d environments
JP6013583B2 (en) Method for emphasizing effective interface elements
JP7092028B2 (en) Information processing equipment, information processing methods, and programs
JP6072237B2 (en) Fingertip location for gesture input
JP5807686B2 (en) Image processing apparatus, image processing method, and program
CN109997098B (en) Apparatus, associated method and associated computer-readable medium
JP7059934B2 (en) Information processing equipment, information processing methods, and programs
CN112783700A (en) Computer readable medium for network-based remote assistance system
US11367416B1 (en) Presenting computer-generated content associated with reading content based on user interactions
TW202324041A (en) User interactions with remote devices
CN108369451B (en) Information processing apparatus, information processing method, and computer-readable storage medium
US20230400956A1 (en) Displaying Representations of Environments
JPWO2015198729A1 (en) Display control apparatus, display control method, and program
WO2023069591A1 (en) Object-based dual cursor input and guiding system
KR102156175B1 (en) Interfacing device of providing user interface expoliting multi-modality and mehod thereof
US20230065077A1 (en) Displaying a Rendered Volumetric Representation According to Different Display Modes
WO2023086102A1 (en) Data visualization in extended reality
WO2023219612A1 (en) Adaptive resizing of manipulatable and readable objects
US20230095282A1 (en) Method And Device For Faciliating Interactions With A Peripheral Device
US20230162450A1 (en) Connecting Spatially Distinct Settings
CN112578983A (en) Finger-oriented touch detection

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22884460

Country of ref document: EP

Kind code of ref document: A1