CN117193538A

CN117193538A - Interface for rendering avatars in a three-dimensional environment

Info

Publication number: CN117193538A
Application number: CN202311319264.9A
Authority: CN
Inventors: R·T·G·伯顿; G·I·布彻; K·E·S·鲍尔莱因; S·O·勒梅; J·瑞克瓦德; W·A·索伦蒂诺三世; G·耶基斯; D·D·达尔根
Original assignee: Apple Inc
Current assignee: Apple Inc
Priority date: 2021-02-16
Filing date: 2022-02-15
Publication date: 2023-12-08
Also published as: CN116868152A

Abstract

The present disclosure relates to interfaces for rendering avatars in three-dimensional environments. In some embodiments, the computer system displays a user interface for registering one or more features of a user of the computer system. In some embodiments, the computer system displays visual effects associated with the avatar in an XR environment. In some embodiments, the computer system displays objects having different visual characteristics in an XR environment. In some embodiments, the computer system switches between different presentation modes associated with users represented in the XR environment. In some embodiments, the computer system displays the avatar in an XR environment.

Description

Interface for rendering avatars in a three-dimensional environment

The present application is a divisional application of the application patent application with application number 202280015249.2 and application date 2022, 2, 15, entitled "interface for rendering avatar in three-dimensional environment".

Cross Reference to Related Applications

The present application claims priority from U.S. provisional application No. 63/149,989, entitled "INTERFACES FOR PRESENTING AVATARS IN THREE-DIMENSIONAL ENVIRONMENTS", filed on month 2, 16 of 2021, and U.S. application No. 17/667,350, entitled "INTERFACES FOR PRESENTING AVATARS IN THREE-DIMENSIONAL ENVIRONMENTS", filed on month 2, 8 of 2022, each of which is incorporated herein by reference in its entirety.

Technical Field

The present disclosure relates generally to computer systems in communication with a display generation component and optionally one or more input devices providing a computer-generated experience, including, but not limited to, electronic devices providing virtual reality and mixed reality experiences via a display.

Background

In recent years, the development of computer systems for augmented reality has increased significantly. An example augmented reality environment includes at least some virtual elements that replace or augment the physical world. Input devices (such as cameras, controllers, joysticks, touch-sensitive surfaces, and touch screen displays) for computer systems and other electronic computing devices are used to interact with the virtual/augmented reality environment. Exemplary virtual elements include virtual objects such as digital images, video, text, icons, and control elements (such as buttons and other graphics).

Disclosure of Invention

Some methods and interfaces for interacting with environments (e.g., applications, augmented reality environments, mixed reality environments, and virtual reality environments) that include at least some virtual elements are cumbersome, inefficient, and limited. For example, providing a system for insufficient feedback of actions associated with virtual objects, a system that requires a series of inputs to achieve desired results in an augmented reality environment, and a system in which virtual objects are complex, cumbersome, and error-prone to manipulate can create a significant cognitive burden on the user and detract from the experience of the virtual/augmented reality environment. In addition, these methods take longer than necessary, wasting energy from the computer system. This latter consideration is particularly important in battery-powered devices.

Accordingly, there is a need for a computer system with improved methods and interfaces to provide a user with a computer-generated experience, thereby making user interactions with the computer system more efficient and intuitive for the user. Such methods and interfaces optionally complement or replace conventional methods for providing an augmented reality experience to a user. Such methods and interfaces reduce the number, extent, and/or nature of inputs from a user by helping the user understand the association between the inputs provided and the response of the device to those inputs, thereby forming a more efficient human-machine interface.

The disclosed system reduces or eliminates the above-described drawbacks and other problems associated with user interfaces of computer systems for communicating with display generating components and optionally one or more input devices. In some embodiments, the computer system is a desktop computer with an associated display. In some embodiments, the computer system is a portable device (e.g., a notebook, tablet, or handheld device). In some embodiments, the computer system is a personal electronic device (e.g., a wearable electronic device such as a watch or a head-mounted device). In some embodiments, the computer system has a touch pad. In some embodiments, the computer system has one or more cameras. In some implementations, the computer system has a touch-sensitive display (also referred to as a "touch screen" or "touch screen display"). In some embodiments, the computer system has one or more eye tracking components. In some embodiments, the computer system has one or more hand tracking components. In some embodiments, the computer system has, in addition to the display generating component, one or more output devices including one or more haptic output generators and one or more audio output devices. In some embodiments, a computer system has a Graphical User Interface (GUI), one or more processors, memory and one or more modules, a program or set of instructions stored in the memory for performing a plurality of functions. In some embodiments, the user interacts with the GUI through contact and gestures of a stylus and/or finger on the touch-sensitive surface, movements of the user's eyes and hands in space relative to the GUI (and/or computer system) or the user's body (as captured by cameras and other motion sensors), and/or voice input (as captured by one or more audio input devices). In some embodiments, the functions performed by the interactions optionally include image editing, drawing, presentation, word processing, spreadsheet making, game playing, phone calls, video conferencing, email sending and receiving, instant messaging, test support, digital photography, digital video recording, web browsing, digital music playing, notes taking, and/or digital video playing. Executable instructions for performing these functions are optionally included in a transitory and/or non-transitory computer readable storage medium or other computer program product configured for execution by one or more processors.

There is a need for an electronic device with improved methods and interfaces to interact with a three-dimensional environment. Such methods and interfaces may supplement or replace conventional methods for interacting with a three-dimensional environment. Such methods and interfaces reduce the amount, degree, and/or nature of input from a user and result in a more efficient human-machine interface. For battery-powered computing devices, such methods and interfaces conserve power and increase the time interval between battery charges.

It is noted that the various embodiments described above may be combined with any of the other embodiments described herein. The features and advantages described in this specification are not all-inclusive and, in particular, many additional features and advantages will be apparent to one of ordinary skill in the art in view of the drawings, specification, and claims. Furthermore, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter.

Drawings

For a better understanding of the various described embodiments, reference should be made to the following detailed description taken in conjunction with the following drawings, in which like reference numerals designate corresponding parts throughout the several views.

FIG. 1 is a block diagram illustrating an operating environment of a computer system for providing an augmented reality (XR) experience, according to some embodiments.

FIG. 2 is a block diagram illustrating a controller of a computer system configured to manage and coordinate a user's XR experience, according to some embodiments.

FIG. 3 is a block diagram illustrating a display generation component of a computer system configured to provide a visual component of an XR experience to a user, according to some embodiments.

FIG. 4 is a block diagram illustrating a hand tracking unit of a computer system configured to capture gesture inputs of a user, according to some embodiments.

Fig. 5 is a block diagram illustrating an eye tracking unit of a computer system configured to capture gaze input of a user, according to some embodiments.

Fig. 6 is a flow diagram illustrating a flash-assisted gaze tracking pipeline in accordance with some embodiments.

Fig. 7A-7H illustrate a user interface for registering one or more features of a user of a computer system, according to some embodiments.

FIG. 8 is a flowchart illustrating an exemplary method for registering one or more features of a user of a computer system, according to some embodiments.

Fig. 9A-9F illustrate various visual effects associated with an avatar presented in an XR environment, according to some embodiments.

FIG. 10 is a flowchart illustrating an exemplary method for displaying visual indicators on a hand of an avatar in an XR environment, according to some embodiments.

Fig. 11 is a flowchart illustrating an exemplary method for displaying objects having different visual characteristics in an XR environment, according to some embodiments.

Fig. 12A-12E illustrate various presentation modes associated with users represented in an XR environment, according to some embodiments.

Fig. 13A and 13B are flowcharts illustrating exemplary methods for switching between different presentation modes associated with users represented in an XR environment, according to some embodiments.

FIG. 14 is a flowchart of an exemplary method for displaying an avatar in an XR environment, according to some embodiments.

Detailed Description

According to some embodiments, the present disclosure relates to a user interface for providing an augmented reality (XR) experience to a user.

The systems, methods, and GUIs described herein improve user interface interactions with virtual/augmented reality environments in a variety of ways.

In some embodiments, the computer system switches between different presentation modes associated with users represented in the XR environment. The computer system communicates with the display generating component and an external computer system associated with the first user. The computer system displays, via the display generation component, a communication user interface comprising a representation of a first user of the external computer system in a first presentation mode, wherein the communication user interface displays the representation of the first user in an augmented reality environment; the representation of the first user has a shape that visually reacts to a change in movement of the first portion of the first user detected by the external computer system when in the first presentation mode. When the representation of the first user is displayed in the first presentation mode, the computer system receives first data from the external computer system, the first data indicating movement of the first portion of the first user; responsive to receiving the first data, a shape of the representation of the first user is modified based on the movement of the first portion of the first user. After modifying the shape of the representation of the first user, the computer system receives second data indicating that the representation of the first user is to be displayed in a second presentation mode different from the first presentation mode. In response to receiving the second data, the computer system displays, via the display generating component, a representation of the first user in a second presentation mode, wherein the representation of the first user has a shape that is visually non-responsive to a change in movement of the first portion of the first user detected by the external computer system when in the second presentation mode. When the representation of the first user is displayed in the second presentation mode, the computer system receives third data indicating that the first user is moving from a first location in the physical environment to a second location in the physical environment that is different from the first location in the physical environment; and in response to receiving the third data, displaying a representation of the first user moving from a first location in the augmented reality environment to a second location in the augmented reality environment different from the first location in the augmented reality environment.

In some embodiments, the computer system displays the avatar in an XR environment. The computer system communicates with the display generating component and an external computer system associated with the first user. In response to receiving a request to display a representation of a first user in an augmented reality environment, in accordance with a determination that a set of eyeglass display criteria is satisfied, the computer system displays the representation of the first user in the augmented reality environment via a display generating component; and displaying, via the display generation component, a representation of glasses positioned on the representation of the first user in the augmented reality environment. In accordance with a determination that the set of eyeglass display criteria is not met, the computer system displays, via the display generation component, the representation of the first user in the augmented reality environment without displaying, in the augmented reality environment, the representation of the eyeglasses positioned on the representation of the first user.

In some embodiments, the computer system displays a user interface for registering one or more features of a user of the computer system. The computer system is in communication with the display generation component and the one or more cameras. During a registration process including capturing facial data of a user via the one or more cameras, the computer system displays a registration interface for registering one or more features of the user via a display generation component, comprising: outputting a first cue to locate a first set of one or more of the facial features of the user in a first predefined set of one or more facial expressions; and outputting a second cue to locate a second one or more of the facial features of the user in a second predefined set of one or more facial expressions different from the first predefined set of one or more facial expressions.

In some embodiments, the computer system displays visual effects associated with the avatar in an XR environment. A computer system in communication with the display generating component and the one or more sensors. The computer system displays, via a display generation component, a user characteristic indicator interface comprising: a set of one or more visual indicators corresponding to a detected position of a set of one or more features of a user's hand in a physical environment, wherein the set of one or more visual indicators is displayed in an augmented reality environment and has a first display position corresponding to a first detected position of the set of one or more features of the user's hand in the physical environment. The computer system detects, via one or more sensors, movement of at least one feature of the user's hand in the set of one or more features of the user's hand. In response to detecting movement of at least one feature of the user's hand in the set of one or more features of the user's hand, the computer system updates the display of the user feature indicator interface, comprising: in accordance with determining that the set of one or more features of the user's hand move to a second detected position in the physical environment, displaying, via the display generating component, the set of one or more visual indicators in the augmented reality environment having a second display position, the second display position corresponding to the second detected position of the set of one or more features of the user's hand in the physical environment; and in accordance with a determination that the set of one or more features of the user's hand moves to a third detected position in the physical environment that is different from the second detected position, displaying, via the display generating component, the set of one or more visual indicators in the augmented reality environment having a third display position that corresponds to the third detected position of the set of one or more features of the user's hand in the physical environment, wherein the third display position in the augmented reality environment is different from the second display position in the augmented reality environment.

In some embodiments, the computer system displays objects having different visual characteristics in an XR environment. The computer system communicates with the display generating component and an external computer system associated with the first user. The computer system displays, via the display generation component, a representation of the first user in the augmented reality environment, wherein the representation of the first user is displayed in the augmented reality environment having a first pose and a shape based on a shape of at least a portion of the first user, wherein the shape of the representation of the first user is visualized with a first set of visual characteristics. The computer system receives first data comprising data indicative of a change in posture of a first user; in response to receiving the first data, updating an appearance of the representation of the first user in the augmented reality environment, comprising: in accordance with a determination that the first data includes an indication that a first portion of the first user is contacting the object, displaying in the augmented reality environment: a representation of the first user having a second gesture based on the gesture change of the first user, wherein a shape of the representation of the first user is visualized with a first set of visual characteristics; and a representation of the object having a shape based on the shape of at least a portion of the object, wherein the shape of the representation of the object is visualized with a second set of visual characteristics different from the first set of visual characteristics.

Fig. 1-6 provide a description of an exemplary computer system for providing an XR experience to a user. Fig. 7A-7H illustrate a user interface for registering one or more features of a user of a computer system, according to some embodiments. FIG. 8 is a flowchart illustrating an exemplary method for registering one or more features of a user of a computer system, in accordance with various embodiments. Fig. 7A to 7H are diagrams for illustrating the process in fig. 8. Fig. 9A-9F illustrate various visual effects associated with an avatar in an XR environment, according to some embodiments. FIG. 10 is a flowchart illustrating an exemplary method for displaying visual indicators on a hand of an avatar in an XR environment, according to some embodiments. Fig. 11 is a flowchart illustrating an exemplary method for displaying objects having different visual characteristics in an XR environment, according to some embodiments. Fig. 9A to 9F are for illustrating the process in fig. 10 and 11. Fig. 12A-12E illustrate various presentation modes associated with users represented in an XR environment, according to some embodiments. Fig. 13A and 13B are flowcharts illustrating exemplary methods for switching between different presentation modes associated with users represented in an XR environment, according to some embodiments. FIG. 14 is a flowchart of an exemplary method for displaying an avatar in an XR environment, according to some embodiments. Fig. 12A to 12E are diagrams for illustrating the processes in fig. 13A, 13B, and 14.

The processes described below enhance operability of a device and make user-device interfaces more efficient (e.g., by helping a user provide appropriate input and reducing user error in operating/interacting with the device) through various techniques, including by providing improved visual feedback to the user, reducing the number of inputs required to perform an operation, providing additional control options without cluttering the user interface with additional display controls, performing an operation when a set of conditions has been met without further user input, improving privacy and/or security, and/or additional techniques. These techniques also reduce power usage and extend battery life of the device by enabling a user to use the device faster and more efficiently.

Furthermore, in a method described herein in which one or more steps are dependent on one or more conditions having been met, it should be understood that the method may be repeated in multiple iterations such that during the iteration, all conditions that determine steps in the method have been met in different iterations of the method. For example, if a method requires performing a first step (if a condition is met) and performing a second step (if a condition is not met), one of ordinary skill will know that the stated steps are repeated until both the condition and the condition are not met (not sequentially). Thus, a method described as having one or more steps depending on one or more conditions having been met may be rewritten as a method that repeats until each of the conditions described in the method have been met. However, this does not require the system or computer-readable medium to claim that the system or computer-readable medium contains instructions for performing the contingent operation based on the satisfaction of the corresponding condition or conditions, and thus is able to determine whether the contingent situation has been met without explicitly repeating the steps of the method until all conditions to decide on steps in the method have been met. It will also be appreciated by those of ordinary skill in the art that, similar to a method with optional steps, a system or computer readable storage medium may repeat the steps of the method as many times as necessary to ensure that all optional steps have been performed.

In some embodiments, as shown in fig. 1, an XR experience is provided to a user via an operating environment 100 comprising a computer system 101. The computer system 101 includes a controller 110 (e.g., a processor or remote server of a portable electronic device), a display generation component 120 (e.g., a Head Mounted Device (HMD), a display, a projector, a touch screen, etc.), one or more input devices 125 (e.g., an eye tracking device 130, a hand tracking device 140, other input devices 150), one or more output devices 155 (e.g., a speaker 160, a haptic output generator 170, and other output devices 180), one or more sensors 190 (e.g., an image sensor, a light sensor, a depth sensor, a haptic sensor, an orientation sensor, a proximity sensor, a temperature sensor, a position sensor, a motion sensor, a speed sensor, etc.), and optionally one or more peripheral devices 195 (e.g., a household appliance, a wearable device, etc.). In some implementations, one or more of the input device 125, the output device 155, the sensor 190, and the peripheral device 195 are integrated with the display generating component 120 (e.g., in a head-mounted device or a handheld device).

In describing an XR experience, various terms are used to refer differently to several related but different environments that a user may sense and/or interact with (e.g., interact with inputs detected by computer system 101 that generated the XR experience, such inputs causing the computer system that generated the XR experience to generate audio, visual, and/or tactile feedback corresponding to various inputs provided to computer system 101). The following are a subset of these terms:

physical environment: a physical environment refers to a physical world in which people can sense and/or interact without the assistance of an electronic system. Physical environments such as physical parks include physical objects such as physical trees, physical buildings, and physical people. People can directly sense and/or interact with a physical environment, such as by visual, tactile, auditory, gustatory, and olfactory.

And (3) augmented reality: conversely, an augmented reality (XR) environment refers to a fully or partially simulated environment in which people perceive and/or interact via an electronic system. In XR, a subset of the physical movements of the person, or a representation thereof, is tracked, and in response, one or more characteristics of one or more virtual objects simulated in the XR environment are adjusted in a manner consistent with at least one physical law. For example, an XR system may detect a person's head rotation and, in response, adjust the graphical content and sound field presented to the person in a manner similar to the manner in which such views and sounds change in a physical environment. In some cases (e.g., for reachability reasons), the adjustment of the characteristics of the virtual object in the XR environment may be made in response to a representation of the physical motion (e.g., a voice command). A person may utilize any of his sensations to sense and/or interact with XR objects, including vision, hearing, touch, taste, and smell. For example, a person may sense and/or interact with audio objects that create a 3D or spatial audio environment that provides perception of a point audio source in 3D space. As another example, an audio object may enable audio transparency that selectively introduces environmental sounds from a physical environment with or without computer generated audio. In some XR environments, a person may sense and/or interact with only audio objects.

Examples of XRs include virtual reality and mixed reality.

Virtual reality: a Virtual Reality (VR) environment refers to a simulated environment designed to be based entirely on computer-generated sensory input for one or more senses. The VR environment includes a plurality of virtual objects that a person can sense and/or interact with. For example, computer-generated images of trees, buildings, and avatars representing people are examples of virtual objects. A person may sense and/or interact with virtual objects in the VR environment through a simulation of the presence of the person within the computer-generated environment and/or through a simulation of a subset of the physical movements of the person within the computer-generated environment.

Mixed reality: in contrast to VR environments designed to be based entirely on computer-generated sensory input, a Mixed Reality (MR) environment refers to a simulated environment designed to introduce sensory input from a physical environment or a representation thereof in addition to including computer-generated sensory input (e.g., virtual objects). On a virtual continuum, a mixed reality environment is any condition between, but not including, a full physical environment as one end and a virtual reality environment as the other end. In some MR environments, the computer-generated sensory input may be responsive to changes in sensory input from the physical environment. In addition, some electronic systems for rendering MR environments may track the position and/or orientation relative to the physical environment to enable virtual objects to interact with real objects (i.e., physical objects or representations thereof from the physical environment). For example, the system may cause the motion such that the virtual tree appears to be stationary relative to the physical ground.

Examples of mixed reality include augmented reality and augmented virtualization.

Augmented reality: an Augmented Reality (AR) environment refers to a simulated environment in which one or more virtual objects are superimposed over a physical environment or a representation of a physical environment. For example, an electronic system for presenting an AR environment may have a transparent or translucent display through which a person may directly view the physical environment. The system may be configured to present the virtual object on a transparent or semi-transparent display such that a person perceives the virtual object superimposed over the physical environment with the system. Alternatively, the system may have an opaque display and one or more imaging sensors that capture images or videos of the physical environment, which are representations of the physical environment. The system combines the image or video with the virtual object and presents the composition on an opaque display. A person utilizes the system to indirectly view the physical environment via an image or video of the physical environment and perceive a virtual object superimposed over the physical environment. As used herein, video of a physical environment displayed on an opaque display is referred to as "pass-through video," meaning that the system captures images of the physical environment using one or more image sensors and uses those images when rendering an AR environment on the opaque display. Further alternatively, the system may have a projection system that projects the virtual object into the physical environment, for example as a hologram or on a physical surface, such that a person perceives the virtual object superimposed on top of the physical environment with the system. An augmented reality environment also refers to a simulated environment in which a representation of a physical environment is transformed by computer-generated sensory information. For example, in providing a passthrough video, the system may transform one or more sensor images to apply a selected viewing angle (e.g., a viewpoint) that is different from the viewing angle captured by the imaging sensor. As another example, the representation of the physical environment may be transformed by graphically modifying (e.g., magnifying) portions thereof such that the modified portions may be representative but not real versions of the original captured image. For another example, the representation of the physical environment may be transformed by graphically eliminating or blurring portions thereof.

Enhanced virtualization: enhanced virtual (AV) environment refers to a simulated environment in which a virtual environment or computer-generated environment incorporates one or more sensory inputs from a physical environment. The sensory input may be a representation of one or more characteristics of the physical environment. For example, an AV park may have virtual trees and virtual buildings, but the face of a person is realistically reproduced from an image taken of a physical person. As another example, the virtual object may take the shape or color of a physical object imaged by one or more imaging sensors. For another example, the virtual object may employ shadows that conform to the positioning of the sun in the physical environment.

Viewpoint-locked virtual object: when the computer system displays the virtual object at the same location and/or position in the user's viewpoint, the virtual object is viewpoint-locked even if the user's viewpoint is offset (e.g., changed). In embodiments in which the computer system is a head-mounted device, the user's point of view is locked to the forward direction of the user's head (e.g., when the user looks directly in front, the user's point of view is at least a portion of the user's field of view); thus, the user's point of view remains fixed without moving the user's head, even when the user's gaze is offset. In embodiments in which the computer system has a display generating component (e.g., a display screen) that is repositionable relative to the user's head, the user's point of view is an augmented reality view presented to the user on the display generating component of the computer system. For example, a viewpoint-locked virtual object displayed in the upper left corner of the user's viewpoint continues to be displayed in the upper left corner of the user's viewpoint when the user's viewpoint is in a first orientation (e.g., the user's head faces north), even when the user's viewpoint changes to a second orientation (e.g., the user's head faces west). In other words, the position and/or orientation of the virtual object in which the viewpoint lock is displayed in the viewpoint of the user is independent of the position and/or orientation of the user in the physical environment. In embodiments in which the computer system is a head-mounted device, the user's point of view is locked to the orientation of the user's head, such that the virtual object is also referred to as a "head-locked virtual object.

Environment-locked visual object: when a computer system displays a virtual object at a location and/or position in a user's point of view, the virtual object is environment-locked (alternatively, "world-locked"), the location and/or position being based on (e.g., selected and/or anchored to) a location and/or object in a three-dimensional environment (e.g., a physical environment or virtual environment) with reference to the location and/or object. As the user's point of view moves, the position and/or object in the environment relative to the user's point of view changes, which results in the environment-locked virtual object being displayed at a different position and/or location in the user's point of view. For example, an environmentally locked virtual object that locks onto a tree immediately in front of the user is displayed at the center of the user's viewpoint. When the user's viewpoint is shifted to the right (e.g., the user's head is turned to the right) such that the tree is now to the left of center in the user's viewpoint (e.g., the tree positioning in the user's viewpoint is shifted), the environmentally locked virtual object that is locked onto the tree is displayed to the left of center in the user's viewpoint. In other words, the position and/or orientation at which the environment-locked virtual object is displayed in the user's viewpoint depends on the position and/or orientation of the object and/or the position at which the virtual object is locked in the environment. In some embodiments, the computer system uses a stationary frame of reference (e.g., a coordinate system anchored to a fixed location and/or object in the physical environment) in order to determine the location of the virtual object that displays the environmental lock in the viewpoint of the user. The environment-locked virtual object may be locked to a stationary portion of the environment (e.g., a floor, wall, table, or other stationary object), or may be locked to a movable portion of the environment (e.g., a representation of a vehicle, animal, person, or even a portion of a user's body such as a user's hand, wrist, arm, or foot that moves independent of the user's point of view) such that the virtual object moves as the point of view or the portion of the environment moves to maintain a fixed relationship between the virtual object and the portion of the environment.

In some implementations, the environmentally or view-locked virtual object exhibits an inert follow-up behavior that reduces or delays movement of the environmentally or view-locked virtual object relative to movement of a reference point that the virtual object follows. In some embodiments, the computer system intentionally delays movement of the virtual object when detecting movement of a reference point (e.g., a portion of the environment, a viewpoint, or a point fixed relative to the viewpoint, such as a point between 5cm and 300cm from the viewpoint) that the virtual object is following while exhibiting inert follow-up behavior. For example, when a reference point (e.g., the portion or viewpoint of the environment) moves at a first speed, the virtual object is moved by the device to remain locked to the reference point, but moves at a second speed that is slower than the first speed (e.g., until the reference point stops moving or slows down, at which point the virtual object begins to catch up with the reference point). In some embodiments, when the virtual object exhibits inert follow-up behavior, the device ignores small movements of the reference point (e.g., ignores movements of the reference point below a threshold amount of movement, such as movements of 0 to 5 degrees or movements of 0 to 50 cm). For example, when a reference point (e.g., the portion or point of view of the environment to which the virtual object is locked) moves a first amount, the distance between the reference point and the virtual object increases (e.g., because the virtual object is being displayed so as to maintain a fixed or substantially fixed position relative to the portion of the environment or point of view other than the reference point to which the virtual object is locked), and when the reference point (e.g., the portion or point of view of the environment to which the virtual object is locked) moves a second amount that is greater than the first amount, the distance between the reference point and the virtual object initially increases (e.g., because the virtual object is being displayed so as to maintain a fixed or substantially fixed position relative to the portion of the environment other than the point of view or point to which the virtual object is locked), and then decreases as the amount of movement of the reference point increases above a threshold (e.g., an "inertia following" threshold) because the virtual object is moved by the computer system to maintain a fixed or substantially fixed position relative to the reference point. In some embodiments, maintaining a substantially fixed position of the virtual object relative to the reference point includes the virtual object being displayed within a threshold distance (e.g., 1cm, 2cm, 3cm, 5cm, 15cm, 20cm, 50 cm) of the reference point in one or more dimensions (e.g., up/down, left/right, and/or forward/backward relative to the position of the reference point).

Hardware: there are many different types of electronic systems that enable a person to sense and/or interact with various XR environments. Examples include head-mounted systems, projection-based systems, head-up displays (HUDs), vehicle windshields integrated with display capabilities, windows integrated with display capabilities, displays formed as lenses designed for placement on a human eye (e.g., similar to contact lenses), headphones/earphones, speaker arrays, input systems (e.g., wearable or handheld controllers with or without haptic feedback), smart phones, tablet computers, and desktop/laptop computers. The head-mounted system may have one or more speakers and an integrated opaque display. Alternatively, the head-mounted system may be configured to accept an external opaque display (e.g., a smart phone). The head-mounted system may incorporate one or more imaging sensors for capturing images or video of the physical environment, and/or one or more microphones for capturing audio of the physical environment. The head-mounted system may have a transparent or translucent display instead of an opaque display. The transparent or translucent display may have a medium through which light representing an image is directed to the eyes of a person. The display may utilize digital light projection, OLED, LED, uLED, liquid crystal on silicon, laser scanning light sources, or any combination of these techniques. The medium may be an optical waveguide, a holographic medium, an optical combiner, an optical reflector, or any combination thereof. In one embodiment, the transparent or translucent display may be configured to selectively become opaque. Projection-based systems may employ retinal projection techniques that project a graphical image onto a person's retina. The projection system may also be configured to project the virtual object into the physical environment, for example as a hologram or on a physical surface.

In some embodiments, the controller 110 is configured to manage and coordinate the XR experience of the user. In some embodiments, controller 110 includes suitable combinations of software, firmware, and/or hardware. The controller 110 is described in more detail below with reference to fig. 2. In some implementations, the controller 110 is a computing device that is in a local or remote location relative to the scene 105 (e.g., physical environment). For example, the controller 110 is a local server located within the scene 105. As another example, the controller 110 is a remote server (e.g., cloud server, central server, etc.) located outside of the scene 105. In some implementations, the controller 110 is communicatively coupled with the display generation component 120 (e.g., HMD, display, projector, touch-screen, etc.) via one or more wired or wireless communication channels 144 (e.g., bluetooth, IEEE 802.11x, IEEE 802.16x, IEEE 802.3x, etc.). In another example, the controller 110 is included within a housing (e.g., a physical enclosure) of the display generation component 120 (e.g., an HMD or portable electronic device including a display and one or more processors, etc.), one or more of the input devices 125, one or more of the output devices 155, one or more of the sensors 190, and/or one or more of the peripheral devices 195, or shares the same physical housing or support structure with one or more of the above.

In some embodiments, display generation component 120 is configured to provide an XR experience (e.g., at least a visual component of the XR experience) to a user. In some embodiments, display generation component 120 includes suitable combinations of software, firmware, and/or hardware. The display generating section 120 is described in more detail below with respect to fig. 3. In some embodiments, the functionality of the controller 110 is provided by and/or combined with the display generating component 120.

According to some embodiments, display generation component 120 provides an XR experience to a user when the user is virtually and/or physically present within scene 105.

In some embodiments, the display generating component is worn on a portion of the user's body (e.g., on his/her head, on his/her hand, etc.). As such, display generation component 120 includes one or more XR displays provided for displaying XR content. For example, in various embodiments, the display generation component 120 encloses a field of view of a user. In some embodiments, display generation component 120 is a handheld device (such as a smart phone or tablet computer) configured to present XR content, and the user holds the device with a display facing the user's field of view and a camera facing scene 105. In some embodiments, the handheld device is optionally placed within a housing that is worn on the head of the user. In some embodiments, the handheld device is optionally placed on a support (e.g., tripod) in front of the user. In some embodiments, display generation component 120 is an XR room, housing, or room configured to present XR content, wherein the user does not wear or hold display generation component 120. Many of the user interfaces described with reference to one type of hardware for displaying XR content (e.g., a handheld device or a device on a tripod) may be implemented on another type of hardware for displaying XR content (e.g., an HMD or other wearable computing device). For example, a user interface showing interactions with XR content triggered based on interactions occurring in a space in front of a handheld device or a tripod-mounted device may similarly be implemented with an HMD, where the interactions occur in the space in front of the HMD and responses to the XR content are displayed via the HMD. Similarly, a user interface showing interaction with XR content triggered based on movement of a handheld device or tripod-mounted device relative to a physical environment (e.g., a scene 105 or a portion of a user's body (e.g., a user's eye, head, or hand)) may similarly be implemented with an HMD, where the movement is caused by movement of the HMD relative to the physical environment (e.g., the scene 105 or a portion of the user's body (e.g., a user's eye, head, or hand)).

While relevant features of the operating environment 100 are shown in fig. 1, those of ordinary skill in the art will recognize from this disclosure that various other features are not shown for the sake of brevity and so as not to obscure more relevant aspects of the exemplary embodiments disclosed herein.

Fig. 2 is a block diagram of an example of a controller 110 according to some embodiments. While certain specific features are shown, those of ordinary skill in the art will appreciate from the disclosure that various other features are not shown for the sake of brevity and so as not to obscure more pertinent aspects of the embodiments disclosed herein. To this end, as a non-limiting example, in some embodiments, the controller 110 includes one or more processing units 202 (e.g., microprocessors, application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs), graphics Processing Units (GPUs), central Processing Units (CPUs), processing cores, etc.), one or more input/output (I/O) devices 206, one or more communication interfaces 208 (e.g., universal Serial Bus (USB), IEEE 802.3x, IEEE 802.11x, IEEE 802.16x, global system for mobile communications (GSM), code Division Multiple Access (CDMA), time Division Multiple Access (TDMA), global Positioning System (GPS), infrared (IR), bluetooth, ZIGBEE, and/or similar types of interfaces), one or more programming (e.g., I/O) interfaces 210, memory 220, and one or more communication buses 204 for interconnecting these components and various other components.

In some embodiments, one or more of the communication buses 204 include circuitry that interconnects and controls communications between system components. In some implementations, the one or more I/O devices 206 include at least one of a keyboard, a mouse, a touch pad, a joystick, one or more microphones, one or more speakers, one or more image sensors, one or more displays, and the like.

Memory 220 includes high-speed random access memory such as Dynamic Random Access Memory (DRAM), static Random Access Memory (SRAM), double data rate random access memory (DDR RAM), or other random access solid state memory devices. In some embodiments, memory 220 includes non-volatile memory, such as one or more magnetic disk storage devices, optical disk storage devices, flash memory devices, or other non-volatile solid state storage devices. Memory 220 optionally includes one or more storage devices located remotely from the one or more processing units 202. Memory 220 includes a non-transitory computer-readable storage medium. In some embodiments, memory 220 or a non-transitory computer readable storage medium of memory 220 stores the following programs, modules, and data structures, or a subset thereof, including optional operating system 230 and XR experience module 240.

Operating system 230 includes instructions for handling various basic system services and for performing hardware-related tasks. In some embodiments, XR experience module 240 is configured to manage and coordinate single or multiple XR experiences of one or more users (e.g., single XR experiences of one or more users, or multiple XR experiences of a respective group of one or more users). To this end, in various embodiments, the XR experience module 240 includes a data acquisition unit 241, a tracking unit 242, a coordination unit 246, and a data transmission unit 248.

In some embodiments, the data acquisition unit 241 is configured to acquire data (e.g., presentation data, interaction data, sensor data, location data, etc.) from at least the display generation component 120 of fig. 1, and optionally from one or more of the input device 125, the output device 155, the sensor 190, and/or the peripheral device 195. To this end, in various embodiments, the data acquisition unit 241 includes instructions and/or logic for instructions as well as heuristics and metadata for heuristics.

In some embodiments, tracking unit 242 is configured to map scene 105 and track at least the location/position of display generation component 120 relative to scene 105 of fig. 1, and optionally relative to one or more of tracking input device 125, output device 155, sensor 190, and/or peripheral device 195. To this end, in various embodiments, the tracking unit 242 includes instructions and/or logic for instructions as well as heuristics and metadata for heuristics. In some embodiments, tracking unit 242 includes a hand tracking unit 244 and/or an eye tracking unit 243. In some embodiments, the hand tracking unit 244 is configured to track the location/position of one or more portions of the user's hand, and/or the motion of one or more portions of the user's hand relative to the scene 105 of fig. 1, relative to the display generating component 120, and/or relative to a coordinate system defined relative to the user's hand. The hand tracking unit 244 is described in more detail below with respect to fig. 4. In some embodiments, the eye tracking unit 243 is configured to track the positioning or movement of the user gaze (or more generally, the user's eyes, face, or head) relative to the scene 105 (e.g., relative to the physical environment and/or relative to the user (e.g., the user's hand)) or relative to XR content displayed via the display generating component 120. The eye tracking unit 243 is described in more detail below with respect to fig. 5.

In some embodiments, coordination unit 246 is configured to manage and coordinate XR experiences presented to a user by display generation component 120, and optionally by one or more of output device 155 and/or peripheral device 195. For this purpose, in various embodiments, coordination unit 246 includes instructions and/or logic for instructions as well as heuristics and metadata for heuristics.

In some embodiments, the data transmission unit 248 is configured to transmit data (e.g., presentation data, location data, etc.) to at least the display generation component 120, and optionally to one or more of the input device 125, the output device 155, the sensor 190, and/or the peripheral device 195. For this purpose, in various embodiments, the data transmission unit 248 includes instructions and/or logic for instructions as well as heuristics and metadata for heuristics.

While the data acquisition unit 241, tracking unit 242 (e.g., including eye tracking unit 243 and hand tracking unit 244), coordination unit 246, and data transmission unit 248 are shown as residing on a single device (e.g., controller 110), it should be understood that in other embodiments, any combination of the data acquisition unit 241, tracking unit 242 (e.g., including eye tracking unit 243 and hand tracking unit 244), coordination unit 246, and data transmission unit 248 may reside in a single computing device.

Furthermore, FIG. 2 is a functional description of various features that may be present in a particular implementation, as opposed to a schematic of the embodiments described herein. As will be appreciated by one of ordinary skill in the art, the individually displayed items may be combined and some items may be separated. For example, some of the functional blocks shown separately in fig. 2 may be implemented in a single block, and the various functions of a single functional block may be implemented by one or more functional blocks in various embodiments. The actual number of modules and the division of particular functions, and how features are allocated among them, will vary depending upon the particular implementation, and in some embodiments, depend in part on the particular combination of hardware, software, and/or firmware selected for a particular implementation.

Fig. 3 is a block diagram of an example of display generation component 120 according to some embodiments. While certain specific features are shown, those of ordinary skill in the art will appreciate from the disclosure that various other features are not shown for the sake of brevity and so as not to obscure more pertinent aspects of the embodiments disclosed herein. For this purpose, as a non-limiting example, in some embodiments, the display generation component 120 (e.g., HMD) includes one or more processing units 302 (e.g., microprocessors, ASIC, FPGA, GPU, CPU, processing cores, etc.), one or more input/output (I/O) devices and sensors 306, one or more communication interfaces 308 (e.g., USB, FIREWIRE, THUNDERBOLT, IEEE 802.3x, IEEE 802.11x, IEEE 802.16x, GSM, CDMA, TDMA, GPS, IR, BLUETOOTH, ZIGBEE, and/or similar types of interfaces), one or more programming (e.g., I/O) interfaces 310, one or more XR displays 312, one or more optional internally and/or externally facing image sensors 314, memory 320, and one or more communication buses 304 for interconnecting these components and various other components.

In some embodiments, one or more communication buses 304 include circuitry for interconnecting and controlling communications between various system components. In some embodiments, the one or more I/O devices and sensors 306 include an Inertial Measurement Unit (IMU), an accelerometer, a gyroscope, a thermometer, one or more physiological sensors (e.g., blood pressure monitor, heart rate monitor, blood oxygen sensor, blood glucose sensor, etc.), one or more microphones, one or more speakers, a haptic engine, and/or one or more depth sensors (e.g., structured light, time of flight, etc.), and/or the like.

In some embodiments, one or more XR displays 312 are configured to provide an XR experience to a user. In some embodiments, one or more XR displays 312 correspond to holographic, digital Light Processing (DLP), liquid Crystal Displays (LCD), liquid crystal on silicon (LCoS), organic light emitting field effect transistors (OLET), organic Light Emitting Diodes (OLED), surface conduction electron emitting displays (SED), field Emission Displays (FED), quantum dot light emitting diodes (QD-LED), microelectromechanical systems (MEMS), and/or similar display types. In some embodiments, one or more XR displays 312 correspond to diffractive, reflective, polarizing, holographic, etc. waveguide displays. For example, the display generation component 120 (e.g., HMD) includes a single XR display. In another example, display generation component 120 includes an XR display for each eye of the user. In some embodiments, one or more XR displays 312 are capable of presenting MR and VR content. In some implementations, one or more XR displays 312 can present MR or VR content.

In some embodiments, the one or more image sensors 314 are configured to acquire image data corresponding to at least a portion of the user's face including the user's eyes (and may be referred to as an eye tracking camera). In some embodiments, the one or more image sensors 314 are configured to acquire image data corresponding to at least a portion of the user's hand and optionally the user's arm (and may be referred to as a hand tracking camera). In some implementations, the one or more image sensors 314 are configured to face forward in order to acquire image data corresponding to a scene that a user would see in the absence of the display generating component 120 (e.g., HMD) (and may be referred to as a scene camera). The one or more optional image sensors 314 may include one or more RGB cameras (e.g., with Complementary Metal Oxide Semiconductor (CMOS) image sensors or Charge Coupled Device (CCD) image sensors), one or more Infrared (IR) cameras, and/or one or more event-based cameras, etc.

Memory 320 includes high-speed random access memory such as DRAM, SRAM, DDR RAM or other random access solid state memory devices. In some embodiments, memory 320 includes non-volatile memory, such as one or more magnetic disk storage devices, optical disk storage devices, flash memory devices, or other non-volatile solid state storage devices. Memory 320 optionally includes one or more storage devices located remotely from the one or more processing units 302. Memory 320 includes a non-transitory computer-readable storage medium. In some embodiments, memory 320 or a non-transitory computer readable storage medium of memory 320 stores the following programs, modules, and data structures, or a subset thereof, including optional operating system 330 and XR presentation module 340.

Operating system 330 includes processes for handling various basic system services and for performing hardware-related tasks. In some embodiments, XR presentation module 340 is configured to present XR content to a user via one or more XR displays 312. For this purpose, in various embodiments, the XR presentation module 340 includes a data acquisition unit 342, an XR presentation unit 344, an XR map generation unit 346, and a data transmission unit 348.

In some embodiments, the data acquisition unit 342 is configured to at least acquire data (e.g., presentation data, interaction data, sensor data, location data, etc.) from the controller 110 of fig. 1. For this purpose, in various embodiments, the data acquisition unit 342 includes instructions and/or logic for instructions and heuristics and metadata for heuristics.

In some embodiments, XR presentation unit 344 is configured to present XR content via one or more XR displays 312. For this purpose, in various embodiments, XR presentation unit 344 includes instructions and/or logic for instructions and heuristics and metadata for heuristics.

In some embodiments, XR map generation unit 346 is configured to generate an XR map based on the media content data (e.g., a 3D map of a mixed reality scene or a physical environment in which computer-generated objects may be placed to generate an augmented reality map). For this purpose, in various embodiments, XR map generation unit 346 includes instructions and/or logic for the instructions as well as heuristics and metadata for the heuristics.

In some embodiments, the data transmission unit 348 is configured to transmit data (e.g., presentation data, location data, etc.) to at least the controller 110, and optionally one or more of the input device 125, the output device 155, the sensor 190, and/or the peripheral device 195. For this purpose, in various embodiments, the data transmission unit 348 includes instructions and/or logic for instructions and heuristics and metadata for heuristics.

Although the data acquisition unit 342, the XR presentation unit 344, the XR map generation unit 346, and the data transmission unit 348 are shown as residing on a single device (e.g., the display generation component 120 of fig. 1), it should be understood that in other embodiments, any combination of the data acquisition unit 342, the XR presentation unit 344, the XR map generation unit 346, and the data transmission unit 348 may be located in separate computing devices.

Furthermore, fig. 3 is used more as a functional description of various features that may be present in a particular embodiment, as opposed to a schematic of the embodiments described herein. As will be appreciated by one of ordinary skill in the art, the individually displayed items may be combined and some items may be separated. For example, some of the functional blocks shown separately in fig. 3 may be implemented in a single block, and the various functions of a single functional block may be implemented by one or more functional blocks in various embodiments. The actual number of modules and the division of particular functions, and how features are allocated among them, will vary depending upon the particular implementation, and in some embodiments, depend in part on the particular combination of hardware, software, and/or firmware selected for a particular implementation.

Fig. 4 is a schematic illustration of an exemplary embodiment of a hand tracking device 140. In some embodiments, the hand tracking device 140 (fig. 1) is controlled by the hand tracking unit 244 (fig. 2) to track the position/location of one or more portions of the user's hand, and/or the movement of one or more portions of the user's hand relative to the scene 105 of fig. 1 (e.g., relative to a portion of the physical environment surrounding the user, relative to the display generating component 120, or relative to a portion of the user (e.g., the user's face, eyes, or head), and/or relative to a coordinate system defined relative to the user's hand). In some implementations, the hand tracking device 140 is part of the display generation component 120 (e.g., embedded in or attached to a head-mounted device). In some embodiments, the hand tracking device 140 is separate from the display generation component 120 (e.g., in a separate housing or attached to a separate physical support structure).

In some implementations, the hand tracking device 140 includes an image sensor 404 (e.g., one or more IR cameras, 3D cameras, depth cameras, and/or color cameras, etc.) that captures three-dimensional scene information including at least a human user's hand 406. The image sensor 404 captures the hand image with sufficient resolution to enable the fingers and their respective locations to be distinguished. The image sensor 404 typically captures images of other parts of the user's body, and possibly also all parts of the body, and may have a zoom capability or a dedicated sensor with increased magnification to capture images of the hand with a desired resolution. In some implementations, the image sensor 404 also captures 2D color video images of the hand 406 and other elements of the scene. In some implementations, the image sensor 404 is used in conjunction with other image sensors to capture the physical environment of the scene 105, or as an image sensor that captures the physical environment of the scene 105. In some embodiments, the image sensor 404, or a portion thereof, is positioned relative to the user or the user's environment in a manner that uses the field of view of the image sensor to define an interaction space in which hand movements captured by the image sensor are considered input to the controller 110.

In some embodiments, the image sensor 404 outputs a sequence of frames containing 3D mapping data (and, in addition, possible color image data) to the controller 110, which extracts high-level information from the mapping data. This high-level information is typically provided via an Application Program Interface (API) to an application program running on the controller, which drives the display generating component 120 accordingly. For example, a user may interact with software running on the controller 110 by moving his hand 406 and changing his hand pose.

In some implementations, the image sensor 404 projects a speckle pattern onto a scene that includes the hand 406 and captures an image of the projected pattern. In some implementations, the controller 110 calculates 3D coordinates of points in the scene (including points on the surface of the user's hand) by triangulation based on lateral offsets of the blobs in the pattern. This approach is advantageous because it does not require the user to hold or wear any kind of beacon, sensor or other marker. The method gives the depth coordinates of points in the scene relative to a predetermined reference plane at a specific distance from the image sensor 404. In this disclosure, it is assumed that the image sensor 404 defines an orthogonal set of x-axis, y-axis, z-axis such that the depth coordinates of points in the scene correspond to the z-component measured by the image sensor. Alternatively, the image sensor 404 (e.g., a hand tracking device) may use other 3D mapping methods, such as stereoscopic imaging or time-of-flight measurements, based on single or multiple cameras or other types of sensors.

In some implementations, the hand tracking device 140 captures and processes a time series containing a depth map of the user's hand as the user moves his hand (e.g., the entire hand or one or more fingers). Software running on the image sensor 404 and/or a processor in the controller 110 processes the 3D mapping data to extract image block descriptors of the hand in these depth maps. The software may match these descriptors with image block descriptors stored in database 408 based on previous learning processes in order to estimate the pose of the hand in each frame. The pose typically includes the 3D position of the user's hand joints and finger tips.

The software may also analyze the trajectory of the hand and/or finger over multiple frames in the sequence to identify gestures. The pose estimation functions described herein may alternate with motion tracking functions such that image block-based pose estimation is performed only once every two (or more) frames while tracking changes used to find poses that occur on the remaining frames. Pose, motion, and gesture information are provided to an application running on the controller 110 via the APIs described above. The program may move and modify images presented on the display generation component 120, for example, in response to pose and/or gesture information, or perform other functions.

In some implementations, the gesture includes an air gesture. An air gesture is a motion of a portion of a user's body (e.g., a head, one or more arms, one or more hands, one or more fingers, and/or one or more legs) through the air that is detected without the user touching an input element (or being independent of an input element that is part of a device) that is part of a device (e.g., computer system 101, one or more input devices 125, and/or hand tracking device 140) (including a motion of the user's body relative to an absolute reference (e.g., angle of the user's arm relative to the ground or distance of the user's hand relative to the ground), movement relative to another portion of the user's body (e.g., movement of the user's hand relative to the user's shoulder, movement of one hand of the user relative to the other hand of the user, and/or movement of the user's finger relative to the other finger or portion of the hand of the user), and/or absolute movement of a portion of the user's body (e.g., a flick gesture comprising a predetermined amount and/or speed of movement of the hand in a predetermined gesture, or a shake gesture comprising a predetermined speed or amount of rotation of a portion of the user's body)).

In some embodiments, according to some embodiments, the input gestures used in the various examples and embodiments described herein include air gestures performed by movement of a user's finger relative to other fingers or portions of the user's hand for interacting with an XR environment (e.g., a virtual or mixed reality environment). In some embodiments, the air gesture is a gesture that is detected without the user touching an input element that is part of the device (or independent of an input element that is part of the device) and based on a detected movement of a portion of the user's body through the air, including a movement of the user's body relative to an absolute reference (e.g., an angle of the user's arm relative to the ground or a distance of the user's hand relative to the ground), a movement relative to another portion of the user's body (e.g., a movement of the user's hand relative to the user's shoulder, a movement of the user's hand relative to the other hand of the user, and/or a movement of the user's finger relative to the other finger or part of the hand of the user), and/or an absolute movement of a portion of the user's body (e.g., a flick gesture that includes a predetermined amount and/or speed of movement of the hand in a predetermined gesture that includes a predetermined gesture of the hand, or a shake gesture that includes a predetermined speed or amount of rotation of a portion of the user's body).

In some embodiments where the input gesture is an air gesture (e.g., in the absence of physical contact with the input device, the input device provides information to the computer system as to which user interface element is the target of the user input, such as contact with a user interface element displayed on a touch screen, or contact with a mouse or touchpad to move a cursor to the user interface element), the gesture takes into account the user's attention (e.g., gaze) to determine the target of the user input (e.g., for direct input, as described below). Thus, in embodiments involving air gestures, for example, an input gesture in combination (e.g., simultaneously) with movement of a user's finger and/or hand detects an attention (e.g., gaze) toward a user interface element to perform pinch and/or tap inputs, as described below.

In some implementations, an input gesture directed to a user interface object is performed with direct or indirect reference to the user interface object. For example, user input is performed directly on a user interface object according to performing input with a user's hand at a location corresponding to the location of the user interface object in a three-dimensional environment (e.g., as determined based on the user's current viewpoint). In some implementations, upon detecting a user's attention (e.g., gaze) to a user interface object, an input gesture is performed indirectly on the user interface object in accordance with a position of a user's hand not being at the position corresponding to the position of the user interface object in the three-dimensional environment while the user is performing the input gesture. For example, for a direct input gesture, the user can direct the user's input to the user interface object by initiating the gesture at or near a location corresponding to the display location of the user interface object (e.g., within 0.5cm, 1cm, 5cm, or within a distance between 0 and 5cm measured from the outer edge of the option or the center portion of the option). For indirect input gestures, a user can direct the user's input to a user interface object by focusing on the user interface object (e.g., by looking at the user interface object), and while focusing on an option, the user initiates an input gesture (e.g., at any location detectable by the computer system) (e.g., at a location that does not correspond to the display location of the user interface object).

In some embodiments, according to some embodiments, the input gestures (e.g., air gestures) used in the various examples and embodiments described herein include pinch inputs and tap inputs for interacting with a virtual or mixed reality environment. For example, pinch and tap inputs described below are performed as air gestures.

In some implementations, the pinch input is part of an air gesture that includes one or more of: pinch gestures, long pinch gestures, pinch and drag gestures, or double pinch gestures. For example, pinch gestures as air gestures include movements of two or more fingers of a hand to contact each other, i.e., optionally, immediately followed by interruption of contact with each other (e.g., within 0 to 1 second). A long pinch gesture, which is an air gesture, includes movement of two or more fingers of a hand into contact with each other for at least a threshold amount of time (e.g., at least 1 second) before a break in contact with each other is detected. For example, a long pinch gesture includes a user holding a pinch gesture (e.g., where two or more fingers make contact), and the long pinch gesture continues until a break in contact between the two or more fingers is detected. In some implementations, the double pinch gesture as an air gesture includes two (e.g., or more) pinch inputs (e.g., performed by the same hand) that are detected in succession with each other immediately (e.g., within a predefined period of time). For example, the user performs a first pinch input (e.g., a pinch input or a long pinch input), releases the first pinch input (e.g., breaks contact between two or more fingers), and performs a second pinch input within a predefined period of time (e.g., within 1 second or within 2 seconds) after releasing the first pinch input.

In some implementations, the pinch-and-drag gesture as an air gesture includes a pinch gesture (e.g., a pinch gesture or a long pinch gesture) that is performed in conjunction with (e.g., follows) a drag input that changes a position of a user's hand from a first position (e.g., a start position of the drag) to a second position (e.g., an end position of the drag). In some implementations, the user holds the pinch gesture while the drag input is performed, and releases the pinch gesture (e.g., opens their two or more fingers) to end the drag gesture (e.g., at the second location). In some implementations, pinch input and drag input are performed by the same hand (e.g., a user pinch two or more fingers to contact each other and move the same hand to a second position in the air with a drag gesture). In some embodiments, the input gesture as an over-the-air gesture includes an input (e.g., pinch and/or tap input) performed using two hands of the user, e.g., the input gesture includes two (e.g., or more) inputs performed in conjunction with each other (e.g., simultaneously or within a predefined time period).

In some implementations, the tap input (e.g., pointing to the user interface element) performed as an air gesture includes movement of a user's finger toward the user interface element, movement of a user's hand toward the user interface element (optionally, the user's finger extends toward the user interface element), downward movement of the user's finger (e.g., mimicking a mouse click motion or a tap on a touch screen), or other predefined movement of the user's hand. In some embodiments, a flick input performed as an air gesture is detected based on a movement characteristic of a finger or hand performing a flick gesture movement of the finger or hand away from a user's point of view and/or toward an object that is a target of the flick input, followed by an end of the movement. In some embodiments, the end of movement is detected based on a change in movement characteristics of the finger or hand performing the flick gesture (e.g., the end of movement away from the user's point of view and/or toward an object that is the target of the flick input, the reversal of the direction of movement of the finger or hand, and/or the reversal of the acceleration direction of movement of the finger or hand).

In some embodiments, the determination that the user's attention is directed to a portion of the three-dimensional environment is based on detection of gaze directed to that portion (optionally, without other conditions). In some embodiments, the portion of the three-dimensional environment to which the user's attention is directed is determined based on detecting a gaze directed to the portion of the three-dimensional environment with one or more additional conditions, such as requiring the gaze to be directed to the portion of the three-dimensional environment for at least a threshold duration (e.g., dwell duration) and/or requiring the gaze to be directed to the portion of the three-dimensional environment when the point of view of the user is within a distance threshold from the portion of the three-dimensional environment, such that the device determines the portion of the three-dimensional environment to which the user's attention is directed, wherein if one of the additional conditions is not met, the device determines that the attention is not directed to the portion of the three-dimensional environment to which the gaze is directed (e.g., until the one or more additional conditions are met).

In some embodiments, detection of the ready state configuration of the user or a portion of the user is detected by the computer system. Detection of a ready state configuration of a hand is used by a computer system as an indication that a user may be ready to interact with the computer system using one or more air gesture inputs (e.g., pinch, tap, pinch and drag, double pinch, long pinch, or other air gestures described herein) performed by the hand. For example, the ready state of the hand is determined based on whether the hand has a predetermined hand shape (e.g., a pre-pinch shape in which the thumb and one or more fingers extend and are spaced apart in preparation for making a pinch or grasp gesture, or a pre-flick in which the one or more fingers extend and the palm faces away from the user), based on whether the hand is in a predetermined position relative to the user's point of view (e.g., below the user's head and above the user's waist and extending at least 15cm, 20cm, 25cm, 30cm, or 50cm from the body), and/or based on whether the hand has moved in a particular manner (e.g., toward an area above the user's waist and in front of the user's head or away from the user's body or legs). In some implementations, the ready state is used to determine whether an interactive element of the user interface is responsive to an attention (e.g., gaze) input.

In some embodiments, the software may be downloaded to the controller 110 in electronic form, over a network, for example, or may alternatively be provided on tangible non-transitory media, such as optical, magnetic, or electronic memory media. In some embodiments, database 408 is also stored in a memory associated with controller 110. Alternatively or in addition, some or all of the described functions of the computer may be implemented in dedicated hardware, such as a custom or semi-custom integrated circuit or a programmable Digital Signal Processor (DSP). Although the controller 110 is shown in fig. 4, for example, as a separate unit from the image sensor 404, some or all of the processing functions of the controller may be performed by a suitable microprocessor and software or by dedicated circuitry within the housing of the image sensor 404 (e.g., a hand tracking device) or other devices associated with the image sensor 404. In some embodiments, at least some of these processing functions may be performed by a suitable processor integrated with display generation component 120 (e.g., in a television receiver, handheld device, or head mounted device) or with any other suitable computerized device (such as a game console or media player). The sensing functionality of the image sensor 404 may likewise be integrated into a computer or other computerized device to be controlled by the sensor output.

Fig. 4 also includes a schematic diagram of a depth map 410 captured by the image sensor 404, according to some embodiments. As described above, the depth map comprises a matrix of pixels having corresponding depth values. Pixels 412 corresponding to the hand 406 have been segmented from the background and wrist in the map. The brightness of each pixel within the depth map 410 is inversely proportional to its depth value (i.e., the measured z-distance from the image sensor 404), where the gray shade becomes darker with increasing depth. The controller 110 processes these depth values to identify and segment components of the image (i.e., a set of adjacent pixels) that have human hand features. These features may include, for example, overall size, shape, and frame-to-frame motion from a sequence of depth maps.

Fig. 4 also schematically illustrates the hand bones 414 that the controller 110 eventually extracts from the depth map 410 of the hand 406, according to some embodiments. In fig. 4, the hand skeleton 414 is superimposed over the hand background 416 that has been segmented from the original depth map. In some embodiments, key feature points of the hand and optionally on the wrist or arm connected to the hand (e.g., points corresponding to knuckles, finger tips, palm centers, ends of the hand connected to the wrist, etc.) are identified and located on the hand bones 414. In some embodiments, the controller 110 uses the positions and movements of these key feature points on the plurality of image frames to determine a gesture performed by the hand or a current state of the hand according to some embodiments.

Fig. 5 illustrates an exemplary embodiment of the eye tracking device 130 (fig. 1). In some embodiments, eye tracking device 130 is controlled by eye tracking unit 243 (fig. 2) to track the positioning and movement of the user gaze relative to scene 105 or relative to XR content displayed via display generation component 120. In some embodiments, the eye tracking device 130 is integrated with the display generation component 120. For example, in some embodiments, when display generating component 120 is a head-mounted device (such as a headset, helmet, goggles, or glasses) or a handheld device placed in a wearable frame, the head-mounted device includes both components that generate XR content for viewing by a user and components for tracking the user's gaze with respect to the XR content. In some embodiments, the eye tracking device 130 is separate from the display generation component 120. For example, when the display generating component is a handheld device or an XR chamber, the eye tracking device 130 is optionally a device separate from the handheld device or XR chamber. In some embodiments, the eye tracking device 130 is a head mounted device or a portion of a head mounted device. In some embodiments, the head-mounted eye tracking device 130 is optionally used in combination with a display generating component that is also head-mounted or a display generating component that is not head-mounted. In some embodiments, the eye tracking device 130 is not a head mounted device and is optionally used in conjunction with a head mounted display generating component. In some embodiments, the eye tracking device 130 is not a head mounted device and optionally is part of a non-head mounted display generating component.

In some embodiments, the display generation component 120 uses a display mechanism (e.g., a left near-eye display panel and a right near-eye display panel) to display frames including left and right images in front of the user's eyes, thereby providing a 3D virtual view to the user. For example, the head mounted display generating component may include left and right optical lenses (referred to herein as eye lenses) located between the display and the user's eyes. In some embodiments, the display generation component may include or be coupled to one or more external cameras that capture video of the user's environment for display. In some embodiments, the head mounted display generating component may have a transparent or translucent display and the virtual object is displayed on the transparent or translucent display through which the user may directly view the physical environment. In some embodiments, the display generation component projects the virtual object into the physical environment. The virtual object may be projected, for example, on a physical surface or as a hologram, such that an individual uses the system to observe the virtual object superimposed over the physical environment. In this case, separate display panels and image frames for the left and right eyes may not be required.

As shown in fig. 5, in some embodiments, the eye tracking device 130 (e.g., a gaze tracking device) includes at least one eye tracking camera (e.g., an Infrared (IR) or Near Infrared (NIR) camera) and an illumination source (e.g., an IR or NIR light source, such as an array or ring of LEDs) that emits light (e.g., IR or NIR light) toward the user's eye. The eye-tracking camera may be directed toward the user's eye to receive IR or NIR light reflected directly from the eye by the light source, or alternatively may be directed toward "hot" mirrors located between the user's eye and the display panel that reflect IR or NIR light from the eye to the eye-tracking camera while allowing visible light to pass through. The eye tracking device 130 optionally captures images of the user's eyes (e.g., as a video stream captured at 60-120 frames per second (fps)), analyzes the images to generate gaze tracking information, and communicates the gaze tracking information to the controller 110. In some embodiments, both eyes of the user are tracked separately by the respective eye tracking camera and illumination source. In some embodiments, only one eye of the user is tracked by the respective eye tracking camera and illumination source.

In some embodiments, the eye tracking device 130 is calibrated using a device-specific calibration process to determine parameters of the eye tracking device for the particular operating environment 100, such as 3D geometry and parameters of LEDs, cameras, hot mirrors (if present), eye lenses, and display screens. The device-specific calibration procedure may be performed at the factory or another facility prior to delivering the AR/VR equipment to the end user. The device-specific calibration process may be an automatic calibration process or a manual calibration process. According to some embodiments, the user-specific calibration process may include an estimation of eye parameters of a specific user, such as pupil position, foveal position, optical axis, visual axis, eye distance, etc. According to some embodiments, once the device-specific parameters and the user-specific parameters are determined for the eye-tracking device 130, the images captured by the eye-tracking camera may be processed using a flash-assist method to determine the current visual axis and gaze point of the user relative to the display.

As shown in fig. 5, the eye tracking device 130 (e.g., 130A or 130B) includes an eye lens 520 and a gaze tracking system including at least one eye tracking camera 540 (e.g., an Infrared (IR) or Near Infrared (NIR) camera) positioned on a side of the user's face on which eye tracking is performed, and an illumination source 530 (e.g., an IR or NIR light source such as an array or ring of NIR Light Emitting Diodes (LEDs)) that emits light (e.g., IR or NIR light) toward the user's eyes 592. The eye-tracking camera 540 may be directed toward a mirror 550 (which reflects IR or NIR light from the eye 592 while allowing visible light to pass) located between the user's eye 592 and the display 510 (e.g., left or right display panel of a head-mounted display, or display of a handheld device, projector, etc.) (e.g., as shown in the top portion of fig. 5), or alternatively may be directed toward the user's eye 592 to receive reflected IR or NIR light from the eye 592 (e.g., as shown in the bottom portion of fig. 5).

In some implementations, the controller 110 renders AR or VR frames 562 (e.g., left and right frames for left and right display panels) and provides the frames 562 to the display 510. The controller 110 uses the gaze tracking input 542 from the eye tracking camera 540 for various purposes, such as for processing the frames 562 for display. The controller 110 optionally estimates the gaze point of the user on the display 510 based on gaze tracking input 542 acquired from the eye tracking camera 540 using a flash assist method or other suitable method. The gaze point estimated from the gaze tracking input 542 is optionally used to determine the direction in which the user is currently looking.

Several possible use cases of the current gaze direction of the user are described below and are not intended to be limiting. As an exemplary use case, the controller 110 may render virtual content differently based on the determined direction of the user's gaze. For example, the controller 110 may generate virtual content in a foveal region determined according to a current gaze direction of the user at a higher resolution than in a peripheral region. As another example, the controller may position or move virtual content in the view based at least in part on the user's current gaze direction. As another example, the controller may display particular virtual content in the view based at least in part on the user's current gaze direction. As another exemplary use case in an AR application, the controller 110 may direct an external camera used to capture the physical environment of the XR experience to focus in the determined direction. The autofocus mechanism of the external camera may then focus on an object or surface in the environment that the user is currently looking at on display 510. As another example use case, the eye lens 520 may be a focusable lens, and the controller uses the gaze tracking information to adjust the focus of the eye lens 520 such that the virtual object the user is currently looking at has the appropriate vergence to match the convergence of the user's eyes 592. The controller 110 may utilize the gaze tracking information to direct the eye lens 520 to adjust the focus such that the approaching object the user is looking at appears at the correct distance.

In some embodiments, the eye tracking device is part of a head mounted device that includes a display (e.g., display 510), two eye lenses (e.g., eye lens 520), an eye tracking camera (e.g., eye tracking camera 540), and a light source (e.g., light source 530 (e.g., IR or NIR LED)) mounted in a wearable housing. The light source emits light (e.g., IR or NIR light) toward the user's eye 592. In some embodiments, the light sources may be arranged in a ring or circle around each of the lenses, as shown in fig. 5. In some embodiments, for example, eight light sources 530 (e.g., LEDs) are arranged around each lens 520. However, more or fewer light sources 530 may be used, and other arrangements and locations of light sources 530 may be used.

In some implementations, the display 510 emits light in the visible range and does not emit light in the IR or NIR range, and thus does not introduce noise in the gaze tracking system. Note that the position and angle of the eye tracking camera 540 is given by way of example and is not intended to be limiting. In some implementations, a single eye tracking camera 540 is located on each side of the user's face. In some implementations, two or more NIR cameras 540 may be used on each side of the user's face. In some implementations, a camera 540 with a wider field of view (FOV) and a camera 540 with a narrower FOV may be used on each side of the user's face. In some implementations, a camera 540 operating at one wavelength (e.g., 850 nm) and a camera 540 operating at a different wavelength (e.g., 940 nm) may be used on each side of the user's face.

The embodiment of the gaze tracking system as shown in fig. 5 may be used, for example, in computer-generated reality, virtual reality, and/or mixed reality applications to provide a user with a computer-generated reality, virtual reality, augmented reality, and/or augmented virtual experience.

Fig. 6 illustrates a flash-assisted gaze tracking pipeline in accordance with some embodiments. In some embodiments, the gaze tracking pipeline is implemented by a glint-assisted gaze tracking system (e.g., an eye tracking device 130 as shown in fig. 1 and 5). The flash-assisted gaze tracking system may maintain a tracking state. Initially, the tracking state is off or "no". When in the tracking state, the glint-assisted gaze tracking system uses previous information from a previous frame when analyzing the current frame to track pupil contours and glints in the current frame. When not in the tracking state, the glint-assisted gaze tracking system attempts to detect pupils and glints in the current frame and, if successful, initializes the tracking state to "yes" and continues with the next frame in the tracking state.

As shown in fig. 6, the gaze tracking camera may capture left and right images of the left and right eyes of the user. The captured image is then input to the gaze tracking pipeline for processing beginning at 610. As indicated by the arrow returning to element 600, the gaze tracking system may continue to capture images of the user's eyes, for example, at a rate of 60 to 120 frames per second. In some embodiments, each set of captured images may be input to a pipeline for processing. However, in some embodiments or under some conditions, not all captured frames are pipelined.

At 610, for the currently captured image, if the tracking state is yes, the method proceeds to element 640. At 610, if the tracking state is no, the image is analyzed to detect a user's pupil and glints in the image, as indicated at 620. At 630, if the pupil and glints are successfully detected, the method proceeds to element 640. Otherwise, the method returns to element 610 to process the next image of the user's eye.

At 640, if proceeding from element 610, the current frame is analyzed to track pupils and glints based in part on previous information from the previous frame. At 640, if proceeding from element 630, a tracking state is initialized based on the pupil and flash detected in the current frame. The results of the processing at element 640 are checked to verify that the results of the tracking or detection may be trusted. For example, the results may be checked to determine if the pupil and a sufficient number of flashes for performing gaze estimation are successfully tracked or detected in the current frame. If the result is unlikely to be authentic at 650, then the tracking state is set to no at element 660 and the method returns to element 610 to process the next image of the user's eye. At 650, if the result is trusted, the method proceeds to element 670. At 670, the tracking state is set to YES (if not already YES), and pupil and glint information is passed to element 680 to estimate the gaze point of the user.

Fig. 6 is intended to serve as one example of an eye tracking technique that may be used in a particular implementation. As will be appreciated by one of ordinary skill in the art, other eye tracking techniques, currently existing or developed in the future, may be used in place of or in combination with the glint-assisted eye tracking techniques described herein in computer system 101 for providing an XR experience to a user, according to various embodiments.

In this disclosure, various input methods are described with respect to interactions with a computer system. When one input device or input method is used to provide an example and another input device or input method is used to provide another example, it should be understood that each example may be compatible with and optionally utilize the input device or input method described with respect to the other example. Similarly, various output methods are described with respect to interactions with a computer system. When one output device or output method is used to provide an example and another output device or output method is used to provide another example, it should be understood that each example may be compatible with and optionally utilize the output device or output method described with respect to the other example. Similarly, the various methods are described with respect to interactions with a virtual environment or mixed reality environment through a computer system. When examples are provided using interactions with a virtual environment, and another example is provided using a mixed reality environment, it should be understood that each example may be compatible with and optionally utilize the methods described with respect to the other example. Thus, the present disclosure discloses embodiments that are combinations of features of multiple examples, without the need to list all features of the embodiments in detail in the description of each example embodiment.

User interface and associated process

Attention is now directed to embodiments of a user interface ("UI") and associated processes that may be implemented on a computer system, such as a portable multifunction device or a head-mounted device, in communication with a display generating component and (optionally) one or more sensors (e.g., cameras).

The present disclosure relates to an exemplary process for representing a user in an XR environment. Fig. 7A-7H and 8 depict examples in which a user is registered for representation in an XR environment. Fig. 9A-9F, 10, and 11 depict examples in which various visual effects associated with an avatar are presented in an XR environment. Fig. 12A-12E, 13A-13B, and 14 depict examples of various presentation modes associated with users represented in an XR environment. As described above, the processes disclosed herein are implemented using a computer system (e.g., computer system 101 in fig. 1).

Fig. 7A to 7H depict a registration process for registering features of the user 700. The enrollment process involves capturing data representing various aspects of the user 700, such as physical features (e.g., facial features), facial expressions, feature movements, skin colors, clothing, and glasses, or other data that may be used to design and/or manipulate a displayed representation of the user 700 in an XR environment. In some embodiments, user 700 may be represented in an XR environment as, for example, an avatar or audio representation, as discussed in more detail below with respect to fig. 9A-9F and 12A-12E.

Fig. 7A depicts a user 700 holding an electronic device 701 that includes a display 702 and a camera 703. The user 700 is looking at the device 701 and wearing glasses 707 wearing an orange shirt 709. Electronic device 701 is a computer system (e.g., computer system 101 in FIG. 1).

In fig. 7A, device 701 is displaying a check-in interface 704 for guiding user 700 through a check-in process. The enrollment interface 704 includes a camera view 705 that shows a representation of the image and/or depth data captured from the camera 703. In the embodiment shown in fig. 7A, camera view 705 includes a representation 700a of user 700 (including a representation 709a of shirt 709 and a representation 707A of eyewear 707 worn by user 700). Registration interface 704 also includes various prompts that instruct user 700 to complete portions of the registration process, as discussed in more detail below.

In the embodiment depicted in fig. 7A, the registration interface 704 includes a prompt 706 that instructs the user 700 to hold their head stationary and to move the device 701 so as to scan the user's face, and in some embodiments other portions of the user's body, such as the user's head. The device 701 performs a scan by collecting image data and/or depth data representing the face/head of the user. In some embodiments, this collected data is referred to herein as facial data. In addition, because device 701 detects that user 700 is wearing glasses, prompt 706 also instructs user 700 to remove glasses 707 in order to collect facial data that more accurately represents the contours of the user's face and head. In some embodiments, the prompt displayed on registration interface 704 may include additional instructions. For example, if a user's long hair covers a portion of their head or face, the prompt may include instructions to pull their hair back to expose a hidden portion of the head/face (e.g., the ear).

Fig. 7B depicts the user 700 taking off the glasses 707 and the mobile device 701 scanning their face as indicated by the prompt 706. In some embodiments, the device 701 instructs the user to keep their head stationary in order to reduce movement of any potential glare on the user's face, which movement may affect the facial data collected from the scan. The registration interface 704 also includes a progress indicator 708 that is updated to show the progress of the scan as the device 701 collects facial data representing the user's face and/or head.

Fig. 7C depicts an alternative embodiment of the face/head scan in fig. 7A and 7B. In the embodiment shown in fig. 7C, registration interface 704 includes a prompt 710 that instructs user 700 to move their head in a circle to complete the face/head scan. In this embodiment, the user 700 holds the device 701 in front of them while moving their head so that different parts of the head are visible to the camera 703, which captures facial data from the user's face/head as it moves around a circle.

Fig. 7D-7G depict portions of a registration process in which a user is prompted to perform various facial expressions while the device 701 captures (e.g., via the camera 703) facial data of the user 700. The device 701 prompts the user 700 to make different facial expressions in order to capture facial data representing movements and poses of the user's facial features of each of these facial expressions. This face data may be used (in conjunction with face data obtained from a face/head scan in some embodiments), for example, to inform creation and operation of an avatar for representing the user in an XR environment. The cues depicted in fig. 7D-7G represent an exemplary embodiment of the registration process. As such, the enrollment process may include a greater number of cues, use different cues, or use different combinations of cues in order to obtain sufficient facial data for enrolling the physical features of user 700.

In fig. 7D, device 701 displays a check-in interface 704 with a camera preview 712 (similar to camera preview 705) and a prompt 714 indicating that user 700 smiles. After the prompt 714 is displayed, the device 701 evaluates the collected facial data (e.g., via the camera 703) and determines whether the facial data indicates that the facial expression made by the user matches the prompt displayed in the registration interface 704. After the device 701 determines that the user 700 is making the requested facial expression (smile), the device 701 stops the display of the prompt 714 and confirms that the user has made the requested facial expression, for example, by displaying a confirmation indication 716, as shown in fig. 7E.

In fig. 7F, the device 701 displays a check-in interface 704 with a camera preview 712 and a prompt 718 prompting the user 700 to speak "o". After displaying the cues 718, the device 701 evaluates the collected facial data (e.g., via the camera 703) and determines whether the facial data indicates that the facial expression made by the user matches the cues displayed in the registration interface 704. After the device 701 determines that the user 700 is making the requested facial expression (say "o"), the device 701 stops the display of the prompt 718 and confirms that the user has made the requested facial expression, for example, by displaying a confirmation indication 719, as shown in fig. 7G.

In some embodiments, such as those depicted in fig. 7D and 7E, the prompt displayed by the device 701 in the registration interface 704 is an instruction for the user to make a particular facial expression (such as a smile). In some embodiments, such as those depicted in fig. 7F and 7G, these cues are instructions for the user to speak a particular phrase or word, such as speaking an "o. The embodiments depicted in fig. 7D-7G are examples of specific cues that the device 701 may use to register user features and are not intended to be limiting. For example, the cues may include instructions to make different facial expressions (such as frowning, oblique eyes, and/or surprise expressions). Similarly, the prompts may include instructions to speak other phrases and/or words. When a user's facial features pose and/or move while making a requested facial expression or speaking a requested word or phrase, device 701 captures the movement/pose of the facial features and detects additional facial features exposed by the movement/pose of the facial features and uses the captured facial data to register the user's features, such as the user's face, mouth, tongue, lips, nose, etc., so that those features can be properly represented in an XR environment. For example, by indicating that the user smiles or speaks "o," the device 701 may determine the appearance of the user's teeth, the movement of the user's lips, whether the user has a dimple, and other information useful for modeling and/or controlling the movement of an avatar that accurately reflects the user's physical features in an XR environment.

As shown in fig. 7H, after device 701 captures facial data from a user's facial expression, device 701 displays a registration interface 704 with a prompt 720 instructing user 700 to select various appearance options and then scan other physical features (e.g., hands) of user 700 using a separate device (e.g., a headset). The different appearance options shown in the check-in interface 704 include a height option 722, a presentation option 724, and a glasses option 726. Each of these appearance options is discussed in more detail below.

Height option 722 is adjustable to indicate the height of user 700. In some embodiments, height option 722 is omitted and the height of user 700 is determined based on data collected from other sources, such as headphones, sensors, wearable devices, or other components that are capable of accessing the height of the user.

Representation options 724 include an audio option 724a and an avatar option 724b. These presentation options are selectable to determine a presentation mode for presenting user 700 in an XR environment. When audio option 724a is selected, user 700 is represented by an audio representation in an XR environment. When avatar option 724b is selected (as depicted in FIG. 7H), user 700 is represented by an avatar in an XR environment. Different presentation options are discussed in more detail below with respect to fig. 9A-9F and fig. 12A-12E.

The eyeglass options 726 include a no-eyeglass option 726a, a rectangular frame option 726b, a translucent frame option 726c, and a headset option 726d. Glasses option 726 is used to customize the appearance of an avatar representing user 700 in an XR environment. For example, the avatar is depicted in an XR environment as having glasses corresponding to the selected glasses option. If the eyeglass option 726a is not selected, the avatar is depicted as having no eyeglasses. Similarly, if headphone option 726d is selected, the avatar is depicted as having a headphone device (e.g., HMD). In some embodiments, the glasses option 726 is displayed or left to be selected only when the avatar option 724b is selected. The glasses option 726 may be selected manually by the user 700 or automatically by the device 701. For example, if device 701 determines that user 700 is not wearing glasses at any time during the enrollment process, device 701 automatically does not select glasses option 726a. Similarly, if device 701 determines that user 700 is wearing glasses at any time during the enrollment process, device 701 automatically selects a glasses option (or creates a glasses option) that optionally matches the glasses detected on the user during the enrollment process. In the embodiment depicted in fig. 7H, the device 701 detects the user glasses 707, and thus the selected rectangular box option 726b, which is similar in style to the glasses 707.

When the desired appearance option is selected, the user 700 may select the continue affordance 728 to begin registering other portions of their body with the separate device. For example, a user may wear a headphone device (e.g., an HMD such as discussed above with respect to display generation component 120) and use the headphone (in particular, one or more cameras integrated with the headphone), then collect image and/or depth data of other physical features of user 700 (e.g., the user's hand, foot, torso, arm, shoulder, etc.). As another example, the user may use another device (such as the electronic device 901 shown in fig. 9A) to collect image and/or depth data of other physical features of the user 700. In some embodiments, a separate device (e.g., headset or device 901) may be used to display additional cues. For example, a prompt may be displayed on a display component of the headset device that instructs the user to bend their finger while the camera of the headset device captures data of the user's hand and finger. Similar to the facial data collected via device 701, the data collected from the separate devices is used to register features of user 700 that may be used to create, model, and/or control various features used to represent an avatar of user 700 in an XR environment.

In the embodiment shown in fig. 7A-7H, the device 701 is a smart phone. However, in some embodiments, the registration process may be performed using other devices or components for interacting with the user and/or the XR environment (such as computer system 101 in fig. 1 or device 901 in fig. 9A). Such a device may be used in place of, or in addition to, device 701.

Additional description regarding fig. 7A-7H is provided below with respect to method 800 described with respect to fig. 8.

FIG. 8 is a flowchart of an exemplary method 800 for registering one or more features of a user of a computer system, according to some embodiments. The method 800 occurs at a computer system (e.g., 101; 701) (e.g., a smart phone, a tablet, a head-mounted display generating component) in communication with a display generating component (e.g., 702) (e.g., a visual output device, a 3D display, a display having at least a transparent or translucent portion on which an image may be projected (e.g., a see-through display), a projector, a heads-up display, a display controller) and one or more cameras (e.g., 703) (e.g., an infrared camera; a depth camera; a visible light camera).

During a registration process that includes capturing facial data (e.g., image data, sensor data, and/or depth data) of one or more features of a user's face, such as size, shape, position, pose, color, depth, or other characteristics of the user's face (e.g., 700) via the one or more cameras (e.g., 703), a computer system (e.g., 701) displays (802) one or more features (e.g., biometric features; face; head and/or facial features such as hair, eyes, nose, ears, mouth, eyebrows, facial hair, skin, etc.; features such as hair color, hair texture, hairstyle, eye color, skin tone, etc.; a registration interface (e.g., 704) such as a hat, glasses, shirt, etc.) of the user via a display generating component (e.g., 702).

As part of displaying a registration interface (e.g., 704) for registering one or more features of a user (e.g., 700), a computer system (e.g., 701) outputs (804) a first prompt (e.g., 706;710;714; 718) (e.g., visual, audible, and/or tactile prompts) (e.g., prompts the user to make a particular facial expression (e.g., smile, oblique eye, surprise expression, etc.) and/or speak a particular phrase or word) that locates a first set of one or more of the facial features of the user in a first predefined set of one or more facial expressions.

As part of displaying the registration interface (e.g., 704) for registering one or more features of the user (e.g., 700), the computer system (e.g., 701) outputs (806) a second prompt (e.g., 706;710;714; 718) to locate a second set of one or more facial features of the user's facial features (which, in some embodiments, include one or more facial features from the first set) in a second predefined set of one or more facial expressions different from the first predefined set of one or more facial expressions (e.g., prompt the user to make a different particular facial expression and/or speak a different particular phrase or word). Outputting a first cue to locate a first one or more of the user's facial features in a first predefined set of one or more facial expressions and outputting a second cue to locate a second one or more of the user's facial features in a second predefined set of one or more facial expressions different from the first predefined set of one or more facial expressions, which improves the speed and accuracy of the registration process by providing feedback to the user of the computer system indicating a particular set of instructions for moving the user's face to obtain facial data for registering the one or more features of the user. Providing improved feedback enhances the operability of the computer system, increases the speed and accuracy of the enrollment process, and makes the user-system interface more efficient (e.g., by helping the user provide proper input and reducing user errors in operating/interacting with the computer system), which in turn reduces power usage and extends battery life of the computer system by enabling the user to use the system more quickly and efficiently.

In some embodiments, the first predefined set of one or more facial expressions is a particular facial expression (e.g., smile) and the second predefined set of one or more facial expressions is a particular phrase or word (e.g., "o") and vice versa.

In some implementations, the computer system (e.g., 701) outputs a first prompt (e.g., 714) in accordance with a determination that the first set of enrollment criteria is not met (e.g., the first set of facial data has not been captured (e.g., has not been captured within a predetermined period of time)). In some implementations, the computer system outputs a second prompt (e.g., 718) in accordance with a determination that the first set of enrollment criteria is met and the second set of enrollment criteria is not met (e.g., the first set of facial data has been captured and the second set of facial data has not been captured (e.g., has not been captured within a predetermined period of time)). Outputting the first prompt in accordance with a determination that the first set of registration criteria is not met and outputting the second prompt in accordance with a determination that the first set of registration criteria is met and the second set of registration criteria is not met provides feedback to a user of the computer system indicating whether the user has met criteria for moving the user's face in order to obtain facial data for registering one or more features of the user. Providing improved feedback enhances the operability of the computer system and makes the user-system interface more efficient (e.g., by helping the user provide proper input and reducing user error in operating/interacting with the computer system), which in turn reduces power usage and extends battery life of the computer system by enabling the user to use the system more quickly and efficiently.

In some implementations, after outputting the first prompt (e.g., 714), the computer system (e.g., 701) captures a first set of facial data of the user (e.g., 700) via the one or more cameras (e.g., capturing facial data of the user while the user is making the first facial expression (e.g., locating the first set of one or more facial features in the first predefined set of one or more facial expressions)). In some implementations, after outputting the second prompt (e.g., 718), the computer system captures a second set of facial data of the user via the one or more cameras (e.g., captures facial data of the user while the user is making the second facial expression (e.g., locates the second set of one or more facial features in the second predefined set of one or more facial expressions)).

In some embodiments, after capturing a first set of facial data of a user (e.g., 700) via one or more cameras (e.g., 703) (in some embodiments, and in accordance with a determination that the first set of facial data of the user meets a first set of expression criteria (e.g., data is identified as corresponding to a first type of facial expression (e.g., smile; expression corresponding to a first prompt)), the computer system (e.g., 701) ceases display of the first prompt (e.g., no longer displaying prompt 714, as depicted in fig. 7E). In some embodiments, after capturing a second set of facial data of the user via one or more cameras (e.g., in response to) and in accordance with a determination that the second set of facial data of the user meets a second set of expression criteria (e.g., data is identified as corresponding to a second type of facial expression (e.g., a frowning; expression corresponding to a second prompt)), the computer system (e.g., 701) ceases display of the second prompt (e.g., no longer displaying prompt 718, as depicted in fig. 7 e.g., the first prompt is stopped) and after the first set of facial data is captured via one or more cameras has been captured, the first set of facial data is registered for the first set of facial data is instructed to be captured for the user to move, instructions in the first hint and instructions in the second hint). Providing improved feedback enhances the operability of the computer system and makes the user-system interface more efficient (e.g., by helping the user provide proper input and reducing user error in operating/interacting with the computer system), which in turn reduces power usage and extends battery life of the computer system by enabling the user to use the system more quickly and efficiently.

In some embodiments, the first predefined set of one or more facial expressions is selected from the group consisting of smiling, frowning, oblique eyes, and surprise expressions (e.g., mouth and eyes are enlarged, frowning) (e.g., as shown by cue 714 in fig. 7D).

In some implementations, the second prompt (e.g., 718) includes prompting the user (e.g., 700) to speak a set of one or more words (e.g., a word or phrase (e.g., "say 'o'" as indicated by prompt 718 in fig. 7F)). In some embodiments, the user is instructed to speak a particular word or phrase such that the user's face will achieve a particular facial expression when speaking, and the computer system (e.g., 701) captures facial data of the user while the user is speaking.

In some implementations, as part of a registration interface (e.g., 704) that displays one or more features for registering a user (e.g., 700), a computer system (e.g., 701) outputs a third prompt (e.g., 706; 710) (e.g., visual, audible, and/or tactile prompt) to change the position of the user's head (e.g., a prompt to move the user's head such that a different portion of the head is within the field of view of one or more cameras (e.g., 703)). Outputting a third prompt to change the position of the user's head provides feedback to the user of the computer system indicating a particular set of instructions for moving the user's head to obtain facial data for registering one or more features of the user. Providing improved feedback enhances the operability of the computer system and makes the user-system interface more efficient (e.g., by helping the user provide proper input and reducing user error in operating/interacting with the computer system), which in turn reduces power usage and extends battery life of the computer system by enabling the user to use the system more quickly and efficiently.

In some implementations, the computer system (e.g., 701) outputs a third cue (e.g., 706; 710) before at least one of the first cue (e.g., 714) or the second cue (e.g., 718) (e.g., outputs cues for different facial expressions after cues of the head of the mobile user).

In some implementations, as part of displaying the registration interface (e.g., 704) for registering one or more features of the user (e.g., 700), the computer system (e.g., 701) outputs a fourth prompt (e.g., 706) (e.g., a visual, audible, and/or tactile prompt) that changes the position of the one or more cameras (e.g., 703) relative to the user's head while keeping the user's head stationary (e.g., a prompt to move the one or more cameras around the user's head without moving the user's head). Outputting a fourth prompt to change the position of the one or more cameras relative to the head of the user while keeping the head of the user stationary provides feedback to the user of the computer system indicating a particular set of instructions for moving the one or more cameras relative to the head of the user to reduce the effects of glare while obtaining facial data for registering one or more features of the user. Providing improved feedback enhances the operability of the computer system and makes the user-system interface more efficient (e.g., by helping the user provide proper input and reducing user error in operating/interacting with the computer system), which in turn reduces power usage and extends battery life of the computer system by enabling the user to use the system more quickly and efficiently. In some embodiments, moving the user's head causes the glare to move on the head while the user's head moves. Variations in glare location may cause problems with facial data capture. Thus, to avoid these problems, the computer system (e.g., 701) prompts the user (e.g., 700) to move the camera (e.g., 703; 701) without moving the user's head, allowing the camera to capture facial data of the user's head from different angles without changing the position of any glare relative to the user's head.

In some implementations, as part of a registration interface (e.g., 704) that displays one or more features for registering the user (e.g., 700), the computer system (e.g., 701) outputs a fifth prompt (e.g., 722) (e.g., visual, audible, and/or tactile prompt) indicating the height of the user. Outputting a fifth prompt indicating the height of the user provides feedback to the user of the computer system indicating a particular set of instructions for providing data registering one or more characteristics of the user. Providing improved feedback enhances the operability of the computer system and makes the user-system interface more efficient (e.g., by helping the user provide proper input and reducing user error in operating/interacting with the computer system), which in turn reduces power usage and extends battery life of the computer system by enabling the user to use the system more quickly and efficiently. In some embodiments, in conjunction with outputting the fifth prompt, the computer system displays one or more user interface objects (e.g., text input fields; virtual keyboard or keypad; slider bar) for inputting the height.

In some embodiments, as part of a registration interface (e.g., 704) displaying one or more features for registering a user (e.g., 700), a computer system (e.g., 701) outputs a sixth cue (e.g., 706) (e.g., visual, auditory, and/or tactile cues) of time to pick a set of eyeglasses (e.g., 707) (e.g., eyeglasses, frame eyeglasses, framed corrective lenses, framed decorative lenses, framed protective lenses) from a user's face for at least a portion of the registration process. Outputting a sixth prompt to take a set of glasses off the user's face for at least a portion of the enrollment process provides feedback to the user of the computer system indicating a particular set of instructions for eliminating the effect of wearing the glasses while obtaining facial data for enrolling one or more features of the user. Providing improved feedback enhances the operability of the computer system and makes the user-system interface more efficient (e.g., by helping the user provide proper input and reducing user error in operating/interacting with the computer system), which in turn reduces power usage and extends battery life of the computer system by enabling the user to use the system more quickly and efficiently. In some embodiments, a sixth prompt is output in accordance with a determination that the user is currently wearing the set of eyeglasses (e.g., based on data captured by the one or more cameras).

In some implementations, an avatar is generated (e.g., at a computer system (e.g., 701; at another computer system (e.g., 901 and/or 901a discussed below)) using at least a portion of the facial data captured during the enrollment process. In some embodiments, an avatar (e.g., 919 and/or 1220 discussed below) is displayed using an external computer system (e.g., 901;901 a) that is different from the computer system (e.g., a computer system that is different from the computer system used to perform the registration process) (e.g., a headset device used to interact in an augmented reality, virtual reality, and/or augmented reality environment). In some implementations, the registration process is performed using a first device (e.g., 701) (e.g., a smart phone) and an avatar generated from the registration process is displayed (e.g., in an augmented reality environment) using a different device (e.g., 901 a) (e.g., a headphone device). In some embodiments, the use of different devices for the enrollment process allows the computer system to transfer specific enrollment tasks to devices equipped to facilitate those tasks more conveniently.

In some embodiments, as part of displaying the registration interface (e.g., 704) for registering one or more features of the user (e.g., 700), the computer system (e.g., 701) outputs a seventh prompt (e.g., 720) (e.g., visual, auditory, and/or tactile prompt) (e.g., a prompt to register a non-facial feature such as an ear, arm, hand, upper body, etc.) capturing a pose of the user's non-facial feature. Outputting a seventh prompt to capture a pose of the user's non-facial features provides feedback to the user of the computer system indicating a particular set of instructions for registering one or more of the user's non-facial features. Providing improved feedback enhances the operability of the computer system and makes the user-system interface more efficient (e.g., by helping the user provide proper input and reducing user error in operating/interacting with the computer system), which in turn reduces power usage and extends battery life of the computer system by enabling the user to use the system more quickly and efficiently.

In some embodiments, if the user's hair covers their ear, a prompt (e.g., 706;710;714;718; 720) instructs the user to pull their hair back to expose the ear so that the ear can be scanned (e.g., capturing data representing the size, shape, position, pose, color, depth, or other characteristics of the ear). In some implementations, the prompt (e.g., 720) instructs the user to wear a device (e.g., a headset) to capture a pose of the non-facial feature. For example, a user (e.g., 700) is instructed to wear headphones to scan their hand. In some implementations, the user is prompted to move the non-facial features during enrollment. For example, the user is instructed to bend their finger while scanning their hand.

It is noted that the details of the process described above with reference to method 800 (e.g., fig. 8) also apply in a similar manner to methods 1000, 1100, 1300, and 1400 described below. For example, methods 1000, 1100, 1300, and/or 1400 optionally include one or more of the features of the various methods described above with reference to method 800. For the sake of brevity, these details are not repeated hereinafter.

Fig. 9A-9F, 10, and 11 depict examples in which various visual effects associated with an avatar are presented in an XR environment.

Fig. 9A depicts a user 700 holding an electronic device 901, which is a computer system for viewing an XR environment (e.g., computer system 101 of fig. 1). The device 901 includes a camera 904 (e.g., a rear camera), a display 902 as shown in fig. 9B, and a camera 903 (e.g., a front camera). In some implementations, camera 904 is used to capture image and/or depth data of a physical environment for rendering an XR environment using display 902. For example, in fig. 9A, user 700 positions hand 700-1 within the field of view of camera 904 for interaction with an XR environment. In some embodiments, device 901 is a tablet computer. However, device 901 may be one or more alternative electronic devices capable of viewing an XR environment, such as a smart phone or headset device.

FIG. 9B shows device 901 and device 901a displaying an interface depicting an XR environment. Device 901a is similar to device 901 and includes similar features to device 901, including a display 902a, a camera 903a, and in some embodiments a camera similar to camera 904 positioned on an opposite side of device 901a. Device 901a is used by a second user (e.g., a user represented by avatar 922 on device 901 and rendering 918-1 on device 901 a) to view an XR environment. In some embodiments, the user 700 and the second user are in the same physical environment (e.g., the same room). In some embodiments, the user 700 and the second user are in different physical environments (e.g., different rooms or geographic locations).

Device 901 displays via display 902 an XR interface 906, which is an interface for viewing an XR session of XR environment 905. XR interface 906 includes a rendering of XR environment 905 using image and/or depth data captured via camera 904 (e.g., camera 904 is currently selected to capture image/depth data for rendering the XR environment). XR interface 906 optionally includes control options 907 and camera preview 908. Control option 907 can be selected to perform various operations such as muting audio (e.g., audio at device 901), flipping camera views (e.g., switching from a view including data captured from camera 904 to a view including data captured from camera 903), and terminating an XR session. The camera preview 908 provides a rendering of data captured within the field of view of the camera that is not currently selected to capture data for rendering the XR environment. For example, in FIG. 9B, camera preview 908 provides a rendering 908-1 of user 700 captured via camera 903.

Device 901 displays an XR environment 905 with representations of physical objects physically present in the physical environment of user 700 and within the field of view of camera 904. The representation of the physical object includes a bottle 910, a table 912, and a user's hand 914 (with fingers 914-1 through 914-5) (the hand 700-1 of the user 700 is located in front of the camera 904, as shown in fig. 9A). In the embodiment depicted in fig. 9B, the representation of the physical object is displayed as a pass-through video of the physical environment. For example, in some embodiments, hand 914 is a passthrough video feed source for hand 700-1. In some embodiments, device 901 includes a transparent display component and the physical object is visible through the transparent display component due to its transparent nature. In some implementations, when the device 901 operates in a full virtual mode (e.g., VR mode), the device 901 renders the physical object as a virtual object. For example, in such an embodiment, hand 914 is a virtual representation of user hand 700-1. The position, posture, movement, or other aspect of the hand 914 (and/or fingers 914-1 through 914-5) is determined based on the corresponding position, posture, movement, or other aspect of the user's physical hand 700-1. However, for simplicity, reference is sometimes made to the hand 914 (and/or fingers 914-1 through 700-5) when describing corresponding positions, gestures, movements, or other aspects of the user's physical hand 914-1 and/or the user's physical fingers.

The device 901 also displays an XR environment 905 having virtual objects rendered by the device 901 in the XR environment. The virtual objects include a highlighting 920 and an avatar 922. Avatar 922 is a representation (e.g., a virtual representation) of the second user in XR environment 905. In some embodiments, avatar 922 is rendered at device 901 based on data received at and/or obtained by device 901 and/or device 901 a. Highlighting 920 is a visual hand effect (e.g., a visual indicator) displayed around the perimeter of a portion of user's hand 914 that is within the field of view of camera 904 and rendered on display 902. The displayed visual hand effects (such as highlighting 920 and other effects discussed below) indicate that the device 901 recognizes the user's hand 914 as a hand. This provides feedback to the user 700 that the pointing device 901 is recognizing the hand 700-1 and thus responding to hand movements. Various attributes of highlighting 920 are described below. However, it should be understood that these attributes apply in a similar manner to other visual hand effects described herein unless otherwise indicated.

As user hand 700-1 moves within the field of view of camera 904, device 901 displays highlighting 920 as hand 914 moves. In some implementations, the amount of highlighting 920 displayed varies based on the amount of hand 914 visible on the display 902. For example, as more of the user's hand 700-1 moves into the field of view of camera 904, a greater number of hands 914 are displayed on display 902, and as they move farther on the screen, a greater number of highlights 920 are displayed around the perimeter of hands 914. Similarly, as user hand 700-1 moves out of the field of view of camera 904, the amount of hand 914 displayed on display 902 decreases and fewer highlights 920 are displayed around its perimeter as hand 914 moves out of the screen. In some embodiments, when a user manipulates the pose of their hand 700-1 (e.g., making a fist, making a grasp gesture, crossing their finger, etc.), the highlighting 920 accommodates the change in the circumference of the hand 914 as the hand pose is manipulated. In some embodiments, device 901 displays other visual hand effects in addition to or instead of highlighting 920. These other visual hand effects are discussed in more detail below, including with reference to fig. 9C-9F, 10, and 11.

Device 901a shows an XR interface 916 similar to XR interface 906. The XR interface 916 includes control options 917 (similar to control options 907) and a camera preview 918 (similar to camera preview 908), which provides a rendering 918-1 of the second user captured via camera 903 a. The XR interface 916 depicts an XR environment 915, which is an XR environment rendered on display 902a and displayed to a second user during an XR session.

As depicted in FIG. 9B, device 901a displays XR environment 915 with avatar 919. Avatar 919 is a representation of user 700 in XR environment 915. In the embodiments depicted herein, avatar 919 is a virtual avatar having virtual features such as virtual shirt 919-1 and virtual hand 919-2. In the embodiment depicted in fig. 9B, XR environment 915 does not include a representation of physical objects depicted on device 901 (e.g., the second user is in a different physical environment than user 700). 9B-9F, device 901B displays XR environment 915 with avatar 919 having avatar hands 919-2, which is a virtual representation of user hand 700-1 (e.g., similar to hand 914), but does not display visual hand effects (e.g., highlighting 920) on avatar hands 919-2.

In some implementations, one or more attributes of the appearance of avatar 919 are determined based on profile settings, appearance settings, registration data, and/or data obtained at device 901 (e.g., data collected from one or more cameras/sensors of device 901 indicating the position, posture, appearance, etc. of user 700 (or a portion thereof (e.g., 700-1)). In some embodiments, data collected from device 901 is transmitted to device 901a and used to determine various attributes of the appearance of avatar 919 or other aspects of XR environment 915. In some embodiments, one or more attributes of the appearance of avatar 919 are determined based on data collected from device 901 a. For example, if user 700 and the second user are in the same room, device 901a may determine a pose of avatar 919 based on a pose of user 700 within a field of view of a camera of device 901a, as discussed in more detail below.

In some embodiments, as discussed above with respect to FIGS. 7A-7H, portions of avatar 919 may be derived from a registration of user 700. For example, in FIG. 9B, avatar 919 is depicted as wearing shirt 919-1, which represents the same orange-yellow shirt 709 that user 700 is wearing during the enrollment process, even though user 700 is currently wearing a different shirt, as shown in FIG. 9A and preview 908-1. In some embodiments, avatar 919 has an appearance determined based on various appearance settings selected by user 700. For example, avatar 919 is not depicted as wearing glasses because user 700 does not select glasses option 726a discussed above with respect to FIG. 7H.

In some embodiments, the appearance of avatar 919 is determined based on data collected in real-time using, for example, device 901. For example, avatar 919 is depicted with its left hand raised to model the pose of user hand 914 based on the position of user hand 700-1 detected using camera 904 of device 901. In some embodiments, avatar 919 may have an appearance (e.g., pose) determined based on data collected from other sources, such as camera 903 of device 901. For example, when user 700 opens their mouth, camera 903 detects the open mouth. This data is transferred to device 901a, which then displays avatar 919 with the mouth open in a similar manner. As another example, device 901 may determine from data collected via camera 903 that user 700 is wearing glasses, and in response, update avatar appearance settings to select a glasses appearance option (e.g., option 726 b) for avatar 919. The update to the appearance settings is then detected by device 901a, which then updates the display of avatar 919 to include the selected glasses.

Fig. 9C depicts an embodiment similar to that in fig. 9B, except that user 700 has moved his hand 700-1 toward the bottle (as depicted by the position of hand 914 relative to bottle 910), and the visual hand effect is now depicted as a highlighted point indicator 930 located on the fingertips of fingers 914-1 through 914-5. In response to the detected movement of user hand 700-1, device 901 updates the display of XR interface 906 to depict hand 914 extending toward bottle 910, and device 901a updates the display of XR interface 916 to depict avatar 919 moving avatar hand 919-2 in a similar manner.

Similar to highlighting 920, the highlighted dot indicator 930 is a visual hand effect that changes based on movement of the user's hand 700-1 (specifically, movement of a finger). The device 901 displays a highlighted point indicator 930 at the tips of the fingers 914-1 through 914-5 of the hand 914. As the fingertip moves, the point indicator 930 moves accordingly.

In some implementations, the device 901 modifies the displayed visual hand effect in response to detecting a particular gesture performed by the user's hand 700-1. For example, in fig. 9D, user 700 performs a pinch gesture with hand 700-1, as depicted by hand 914. In response to detecting the pinch gesture, device 901 increases the display size and brightness of the highlighted point indicators 930-1 and 930-2. The modification to the visual hand effect provides feedback to the user 700 indicating that the gesture was recognized by the device 901. In some implementations, the device 901 responds to the gesture by performing one or more operations associated with the gesture (rather than modifying the point indicators 930-1 and 930-2).

As depicted in FIG. 9D, device 901a modifies the display of XR interface 916 to depict avatar 919 performing a pinch gesture, but not displaying a visual hand effect.

In fig. 9E, device 901 detects that user 700 is holding a bottle and, in response, displays a hand 914 holding bottle 910. In the embodiment depicted in fig. 9E and 9F, the visual hand effect is now displayed as particles 940 appearing on the fingers of the hand 914. As shown in fig. 9F, as the user's hand moves, the particles drag with the moving finger. In some embodiments, particles 940 have an animated appearance, moving or shifting around the respective fingers of hand 914.

As discussed above, data collected from device 901 and/or device 901a may be used to determine a gesture of user 700. Similarly, such data may be used to determine that user 700 is holding a physical object, namely bottle 910. In response to determining that the user is holding the physical object, device 901a updates the display of XR interface 916 to include rendered bottle 945 in hand 919-2 of avatar 919. Rendered bottle 945 is a representation of a physical bottle held by user 700 that does not have the same appearance as bottle 910. For example, rendered bottle 945 is shown having a different shape than bottle 910. Additionally, a rendered bottle 945 is shown in fig. 9E, with altered visual characteristics (e.g., represented by hatching 947) distinguishing it from avatar 919.

In some implementations, the visual characteristics include one or more of blur amount, opacity, color, visual smoothness, attenuation, particle density, resolution, or other visual parameters. By comparing one or more of the visual characteristics of rendered bottle 945 to the visual characteristics of avatar 919 (e.g., avatar hands 919-2), the changed visual characteristics distinguish the appearance of rendered bottle 945 from the appearance of avatar 919. For example, rendered bottle 945 may be displayed with greater (or lesser) blur than avatar 919. As another example, rendered bottle 945 may be displayed with a low amount of particle density such that rendered bottle 945 appears to be a loose collection of particles with more and/or larger gaps between the particles forming the bottle when compared to avatar 919 presented with densely packed particles with fewer and/or smaller gaps. As another example, rendered bottle 945 may be displayed with less visual smoothing than avatar 919. As another example, rendered bottle 945 may be displayed with a more pixelated appearance than avatar 919. It should be appreciated that the foregoing examples of changing visual characteristics may be switched with respect to rendered bottles and avatars. For example, instead of displaying rendered bottle 945 in greater pixelation, rendered bottle 945 may be displayed with a less pixelated appearance than avatar 919.

As user 700 moves the physical bottle, devices 901 and 901a modify their respective XR interfaces based on the detected movement. For example, when user 700 tilts the vial in fig. 9F, device 901 detects movement within the field of view of camera 904 and, in response, updates XR interface 906 to display that hand 914 is tilting vial 910 accordingly. As the hand 914 moves, the device 901 also displays particles 940 that move with the fingers of the hand 914 and have a tailing effect indicated by particles 940-1.

Device 901a modifies the display of avatar 919 and rendered bottle 945 (e.g., the position of rendered bottle 945) based on the detected movements of user hand 700-1 and the bottle. In some implementations, the device 901a displays a rendered bottle 945 having the appearance (e.g., shape and altered visual characteristics) depicted in fig. 9E when moved.

In some embodiments, device 901a displays a rendered bottle having an appearance that is generated based on a library of image data (e.g., images, video, etc.) or other data that is not received from device 901 and that is usable to generate a rendering of a physical object being held by user 700. In the embodiment depicted in fig. 9F, device 901a replaces rendered bottle 945 with rendered bottle 948. Rendered bottle 948 has a different shape than bottle 910 (and rendered bottle 945) because rendered bottle 948 is presented based on a library of image data rather than data captured for a physical bottle (e.g., image data). In some embodiments, the rendered bottle 948 has a realistic appearance. In some embodiments, the rendered bottle 948 has altered visual characteristics. In some embodiments, rendered bottle 948 has the same visual characteristics as avatar 919.

In some implementations, the device 901 selectively displays visual hand effects (e.g., highlighting 920, point indicators 930, particles 940) based on the position, pose, or shape of the user's hand 700-1. For example, in some embodiments, device 901 does not display visual hand effects unless user hand 700-1 is within a predefined area of the field of view of a camera (e.g., camera 904), or if the hands are otherwise determined to be relevant (e.g., the user is looking at their hands). In some embodiments, the device 901 does not display visual hand effects based on the currently enabled visual hand effects and the pose of the user's hand 700-1. For example, if the user's hand is fist-shaped, the user's fingertip is not displayed, and thus the device 901 does not display the point indicator 930 on the fingertip of the hand. In some embodiments, device 901 does not display any visual hand effect when hand 700-1 has a particular pose (e.g., fist) or is otherwise determined to be irrelevant to a particular scene.

In some embodiments, the device (e.g., device 901) is a headphone device and/or the camera (e.g., camera 904) has a position offset (e.g., vertically) from the display (e.g., display 902) and the visual hand effect is displayed with a predicted line of sight such that the visual hand effect is aligned with the line of sight of the user to appear to be positioned on the user's hand when the visual hand effect is viewed on the display.

Additional description with respect to fig. 9A-9F see methods 1000 and 1100 described below with respect to fig. 10 and 11.

FIG. 10 is a flowchart of an exemplary method 1000 for displaying visual indicators on a hand of an avatar in an XR environment, according to some embodiments. The method occurs at a computer system (e.g., 101; 901) (e.g., smart phone, tablet, head mounted display generating component) in communication with a display generating component (e.g., 902) (e.g., visual output device, 3D display, display having at least a transparent or translucent portion on which an image may be projected (e.g., see-through display), projector, heads-up display, display controller) and one or more sensors (e.g., 903; 904) (e.g., infrared camera; depth camera, visible light camera).

The computer system (e.g., 901) displays (1002) a user characteristic indicator interface (e.g., 906) via a display generation component (e.g., 902). The user characteristic indicator interface includes (1004) a set of one or more visual indicators (920; 930; 940) (e.g., virtual objects and/or visual effects) corresponding to a detected position (e.g., the position is detected via the one or more sensors) of a hand (e.g., 700-1; 914) (e.g., a user of a computer system) of a user (e.g., a hand of a computer system) in a physical environment (e.g., 914-1;914-2;914-3;914-4; 914-5) (e.g., a hand; a portion of a hand; one or more fingers; one or more portions (e.g., fingertips, knuckles) of one or more fingers). The set of one or more visual indicators is displayed in the augmented reality environment (e.g., 905) and has a first display position corresponding to (e.g., co-located; based on; overlapping) a first detected position of the set of one or more features of the user's hand (e.g., 700-1) in the physical environment (e.g., 920 displayed around the hand 914 in FIG. 9B; 930 displayed on the fingers 914-1 through 914-5 in FIG. 9C; 940 displayed on the finger 914-5 in FIG. 9E). In some embodiments, the set of one or more visual indicators is displayed in the interface so as to overlap (e.g., overlay) the first detected position from the perspective of the user so as to appear to the user to be positioned on at least one feature of the set of one or more features of the user's hand. In some embodiments, the computer system displays a visual indicator in a line of sight (e.g., predicted/estimated line of sight) of the user between the user and the user's hand (or portion thereof) so as to appear to the user to be positioned on the user's hand (or portion thereof) in an augmented reality environment. In some embodiments, the one or more sensors are used to detect the position of the user's hand, and the user's perspective is a view angle that is different from the perspective of the one or more sensors (e.g., cameras) capturing the position of the user's hand in the physical environment.

The computer system (e.g., 901) detects (1006), via one or more sensors (e.g., 904), movement (e.g., change in position, change in posture, gesture, etc.) of at least one feature (e.g., 914-1;914-2;914-3;914-4; 914-5) of the user's hand (e.g., 700-1; 914) of the set of one or more features of the user's hand. In some implementations, the computer system receives data (e.g., depth data, image data, sensor data (e.g., image data from a camera)) indicative of a change in position (e.g., physical position, orientation, gesture, movement, etc.) of at least a portion of a user's hand in a physical environment.

In response to detecting movement of at least one feature (e.g., 914-1;914-2;914-3;914-4; 914-5) of the user's hand (e.g., 700-1; 914) of the set of one or more features, the computer system (e.g., 901) updates (1008) the display of the user feature indicator interface (e.g., 906).

As part of updating the display of the user characteristic indicator interface (e.g., 906), and in accordance with determining that the set of one or more characteristics (e.g., 914-1;914-2;914-3;914-4; 914-5) of the hand (e.g., 914) of the user (e.g., 700) is moved (in some embodiments, from a first detected position) to a second detected position in the physical environment (e.g., fig. 9D) where the user hand is detected to move from the first position in the physical environment to the second position in the physical environment), the computer system (e.g., 901) displays (1010), via the display generating component (e.g., 902), the set of one or more visual indicators (e.g., 920;930; 940) having a second display position corresponding to the second detected position of the set of one or more characteristics of the user hand in the physical environment in the augmented reality environment. In some embodiments, displaying the set of one or more visual indicators includes displaying one or more of the visual indicators that are moved so as to appear to the user to move in unison with one or more features of the user's hand. In some embodiments, the set of one or more visual indicators is displayed as positioned in the interface so as to overlap (e.g., overlay) the second detected position from the perspective of the user so as to appear to the user to be positioned on at least one feature of the set of one or more features of the user's hand.

As part of updating the display of the user characteristic indicator interface (e.g., 906), and in accordance with determining one or more characteristics (e.g., 914-1;914-2;914-3;914-4; 914-5) of the user's hand (e.g., 700) to move (in some embodiments, from a first detected position) to a third detected position in the physical environment that is different from the second detected position (e.g., fig. 9E), the computer system (e.g., 901) displays (1012) the set of one or more visual indicators (e.g., 920;930; 940) in the augmented reality environment via the display generating component (e.g., 902) that has a third display position that corresponds to a third detected position in the physical environment of the set of one or more characteristics of the user's hand, wherein the third display position in the augmented reality environment is different from the second display position in the augmented reality environment. Displaying the set of one or more visual indicators in the augmented reality environment having a second display position or a third display position corresponding to a second detected position or a third detected position of the set of one or more features of the user's hand in the physical environment provides feedback to the user of the computer system indicating the detected position of the set of one or more features of the user's hand and increases the accuracy of the visual indicators displayed in the augmented reality environment by taking into account movement of the set of one or more features of the user's hand in the physical environment. Providing improved feedback enhances the operability of the computer system and makes the user-system interface more efficient (e.g., by helping the user provide proper input and reducing user error in operating/interacting with the computer system), which in turn reduces power usage and extends battery life of the computer system by enabling the user to use the system more quickly and efficiently.

In some embodiments, as part of detecting movement of at least one feature (e.g., 914-1;914-2;914-3;914-4; 914-5) of the user's hand (e.g., 914) in the set of one or more features of the user's hand (e.g., 700), the computer system (e.g., 901) detects (e.g., via the one or more sensors (e.g., 904)) an amplitude and/or direction of movement of the at least one feature of the user's hand in the set of one or more features of the user's hand. In some embodiments, displaying the set of one or more visual indicators (e.g., 920;930; 940) having the second display position in the augmented reality environment includes displaying the set of one or more visual indicators moving from the first display position to the second display position, wherein movement from the first display position to the second display position is based on (e.g., characteristics of movement (e.g., speed, magnitude, direction) are based on) the detected magnitude and/or direction of movement of at least one feature of the user's hand in the set of one or more features of the user's hand. In some implementations, displaying the set of one or more visual indicators having the third display position in the augmented reality environment includes displaying the set of one or more visual indicators moving from the first display position to the third display position, wherein movement from the first display position to the third display position is based on (e.g., characteristics of movement (e.g., speed, magnitude, direction) based on) the detected magnitude and/or direction of movement of at least one feature of the user's hand in the set of one or more features of the user's hand. Displaying movement of the set of one or more visual indicators moving from the first display position to the second display position or the third display position in the augmented reality environment based on the detected magnitude and/or direction of movement of at least one feature of the set of one or more features of the user's hand provides feedback to the user of the computer system indicating the detected position of the set of one or more features of the user's hand and increases the accuracy of the displayed visual indicators by taking into account the magnitude and/or direction of movement of the at least one feature of the user's hand. Providing improved feedback enhances the operability of the computer system and makes the user-system interface more efficient (e.g., by helping the user provide proper input and reducing user error in operating/interacting with the computer system), which in turn reduces power usage and extends battery life of the computer system by enabling the user to use the system more quickly and efficiently.

In some embodiments, one or more of these visual indicators (e.g., 920;930; 940) are displayed as moving so as to appear to the user to move in unison with one or more corresponding features (e.g., 914-1;914-2;914-3;914-4; 914-5) of the user's hand (e.g., 914).

In some embodiments, the display generation component includes a transparent display component (e.g., a see-through display on which content is displayed (e.g., projected) and through which a physical environment is visible due to the transparent nature of the display), and the set of one or more visual indicators (e.g., 920;930; 940) are displayed on the transparent display component at predicted (e.g., estimated; by a computer system (e.g., 901)) locations along the eyes of a user (e.g., 700) and the set of one or more features (e.g., 914-1;914-2;914-3;914-4; 914-5) of the hand (e.g., predicted/estimated lines of sight) so that the visual indicators are projected onto the transparent display to make the hands of the user visible through the display and the visual indicators are projected onto the hands of the user due to their positioning. Displaying the set of one or more visual indicators on the transparent display component at a location on the transparent display component that is predicted to be along a line of sight between the eyes of the user and the detected location of the set of one or more features of the hand, by providing feedback to the user of the computer system indicating the detected location of the set of one or more features of the user's hand, and increasing the accuracy of the displayed visual indicators by taking into account the visual offset from the user's line of sight and the perspective of the sensor detecting the location of the set of one or more features of the user's hand. Providing improved feedback enhances the operability of the computer system and makes the user-system interface more efficient (e.g., by helping the user provide proper input and reducing user error in operating/interacting with the computer system), which in turn reduces power usage and extends battery life of the computer system by enabling the user to use the system more quickly and efficiently.

In some embodiments, the one or more sensors (e.g., 904) are used to detect the position of the user's hand (e.g., 914), and the view angle of the user (e.g., 700) is a different view angle than the view angle of the one or more sensors (e.g., cameras) capturing the position of the user's hand in the physical environment.

In some embodiments, displaying the set of one or more visual indicators (e.g., 920;930; 940) includes displaying a virtual highlighting effect (e.g., 920) in the augmented reality environment (e.g., 905) at a location corresponding to (e.g., at or near) a peripheral region (e.g., see FIG. 9B) of the set of one or more features of the user's (e.g., 700) hand (e.g., 914) (e.g., the visual indicators are displayed as highlighting around at least a portion of the user's hand). Displaying a virtual highlighting effect at a location corresponding to a peripheral region of the set of one or more features of the user's hand provides feedback to the user of the computer system indicating a detected location of at least a portion of the set of one or more features of the user's hand. Providing improved feedback enhances the operability of the computer system and makes the user-system interface more efficient (e.g., by helping the user provide proper input and reducing user error in operating/interacting with the computer system), which in turn reduces power usage and extends battery life of the computer system by enabling the user to use the system more quickly and efficiently.

In some embodiments, as part of displaying the set of one or more visual indicators (e.g., 920;930; 940) having the second display position in the augmented reality environment (e.g., 905), the computer system (e.g., 901) displays the set of one or more visual indicators (e.g., 940) moving from the first display position to the second display position (e.g., FIG. 9F). In some embodiments, as part of displaying the set of one or more visual indicators (e.g., 920;930; 940) having the second display position in the augmented reality environment (e.g., 905), the computer system displays a second set of one or more visual indicators (e.g., 940-1) (e.g., particles; particle effects; residual traces of indicators left after the set of one or more visual indicators move), the second set of one or more visual indicators following (e.g., dragging; moving along the same path behind the set of one or more visual indicators) the set of one or more visual indicators as the set of one or more visual indicators moves from the first display position to the second display position.

In some embodiments, as part of displaying the set of one or more visual indicators (e.g., 920;930; 940) having a third display position in the augmented reality environment (e.g., 905), the computer system (e.g., 901) displays the set of one or more visual indicators (e.g., 940) moving from the first display position to the third display position. In some embodiments, as part of displaying the set of one or more visual indicators having a third display position in the XR environment, the computer system displays a third set of one or more visual indicators (e.g., 940-1) (e.g., particles; particle effects; residual traces of indicators left after the set of one or more visual indicators move), the third set of one or more visual indicators following (e.g., dragging; moving along the same path behind the set of one or more visual indicators) the set of one or more visual indicators as the set of one or more visual indicators moves from the first display position to the third display position (e.g., FIG. 9F). The second or third set of one or more visual indicators following the set of one or more visual indicators is displayed as the set of one or more visual indicators moves from the first display position to the second or third display position, which provides feedback to a user of the computer system indicating the detected position and movement of the user's finger. Providing improved feedback enhances the operability of the computer system and makes the user-system interface more efficient (e.g., by helping the user provide proper input and reducing user error in operating/interacting with the computer system), which in turn reduces power usage and extends battery life of the computer system by enabling the user to use the system more quickly and efficiently.

In some embodiments, the at least one feature of the hand (e.g., 914) of the user (e.g., 914) is the tip of a finger (e.g., 700-1;914-2;914-3;914-4; 914-5) of the user's hand. In some embodiments, displaying the set of one or more visual indicators (e.g., 920;930; 940) includes displaying a highlighting effect (e.g., 930) (e.g., a highlighted dot or bulb) in the augmented reality environment (e.g., 905) at a location corresponding to (e.g., at or near) a finger tip of the user's hand (e.g., the visual indicators are displayed as highlighted dots or bulbs located at the user's fingertips) (e.g., a plurality of fingers are detected, and the visual indicators are displayed as highlighted dots or bulbs located at each of the detected fingertips). Displaying the highlighting effect at a location corresponding to a finger tip of a user's hand provides feedback to a user of the computer system indicating a detected location of the user's fingertip. Providing improved feedback enhances the operability of the computer system and makes the user-system interface more efficient (e.g., by helping the user provide proper input and reducing user error in operating/interacting with the computer system), which in turn reduces power usage and extends battery life of the computer system by enabling the user to use the system more quickly and efficiently.

In some embodiments, a computer system (e.g., 901) is ready to accept input based on the position and/or movement of a user's (e.g., 700) hand (e.g., 914) (e.g., the user's hand is in a position and/or orientation that is available to provide input to an electronic device) to display a user characteristic indicator interface (e.g., 906) (e.g., the set of one or more characteristics of the user's hand meets a set of indicator display criteria (e.g., the computer system only displays visual indicators when the user is active with respect to their hand (e.g., when the user is looking at their hand, the user's hand is at a predefined pose, the user's hand is at a predefined area of the one or more sensors and/or displays)). In accordance with a determination that the user's hand is ready to accept input based on the user's hand's position and/or movement, the one or more visual indicators are displayed by eliminating the need to perform calculations that track the user's hand and display the visual indicators unless the device is ready to accept input based on the user's hand's position and/or movement, thereby conserving computational resources. Reducing the computational effort enhances the operability of the computer system and makes the user-system interface more efficient (e.g., by helping the user provide proper input and reducing user errors in operating/interacting with the computer system), which in turn reduces power usage and extends battery life of the computer system by enabling the user to use the system more quickly and efficiently.

In some embodiments, in accordance with a determination that the user's (e.g., 700) hand (e.g., 914) is inactive (e.g., a determination that the user's hand does not meet movement criteria (e.g., a sufficient degree of movement) and/or a determination (e.g., prediction) that the user's gaze is not currently directed to the user's hand and/or the user's hand is not currently within the predicted user's field of view), the computer system (e.g., 901) ceases to display the visual indicator (e.g., 920;930; 940), or in some embodiments, ceases to display the user characteristic indicator interface (e.g., 906).

In some embodiments, the device (e.g., computer system; 901) is ready to accept input based on the position and/or movement of the user's (e.g., 700) hand (e.g., 914) when it is determined that the user is looking at the hand (e.g., the computer system determines and/or predicts that the user's gaze is directed to a determined location of the user's hand).

In some embodiments, the device (e.g., computer system; 901) is ready to accept input based on the position and/or movement of the user's (e.g., 700) hand (e.g., 914) when it is determined that the hand has at least one of a set of one or more predefined gestures (e.g., computer system determines that the user's hand has a predefined gesture (e.g., hand open, finger pointing, etc.).

In some implementations, the feature indicator interface (e.g., 906) is displayed in accordance with a determination that a first set of display criteria is met (e.g., the first set of display criteria is met when the mixed reality display mode is enabled). In some implementations, in accordance with a determination that the second set of display criteria is met (e.g., the second set of display criteria is met when the virtual reality display mode is enabled), the computer system (e.g., 901) displays a virtual representation of the user's hand in the virtual reality environment (e.g., fully virtually displays the user's hand in the VR environment) via the display generating component (e.g., 902). Displaying a virtual representation of a user's hand in a virtual reality environment provides feedback to a user of a computer system regarding the detected position of the user's hand in the virtual environment. Providing improved feedback enhances the operability of the computer system and makes the user-system interface more efficient (e.g., by helping the user provide proper input and reducing user error in operating/interacting with the computer system), which in turn reduces power usage and extends battery life of the computer system by enabling the user to use the system more quickly and efficiently.

In some embodiments, when the virtual environment is displayed, the display generating component (e.g., 902) is opaque and does not transmit light or video from the physical environment in at least a portion of the display generating component that is displaying the virtual representation of the hand. In some embodiments, the computer system (e.g., 901) foregoes displaying the set of one or more visual indicators (e.g., 920;930; 940) when the second set of display criteria is met. In some embodiments, the computer system continues to display the set of one or more visual indicators with the virtual representation of the user's hand when the second set of display criteria is met.

In some implementations, a computer system (e.g., 901) communicates with an external computer system (e.g., 901 a) (e.g., an external computer system associated with a first user (e.g., being operated by the first user (e.g., a user who is conducting a communication session (e.g., augmented reality, virtual reality, and/or video conference) with a user of the computer system)). When the computer system (e.g., 901) displays a user characteristic indicator interface (e.g., 906) comprising the set of one or more visual indicators (e.g., 920;930; 940) via the display generation component (e.g., 902), the external computer system (e.g., 901 a) displays a virtual representation of the user's hand (e.g., 919-2) in an augmented reality environment (e.g., 915) (in some embodiments, the set of one or more visual indicators and/or image data (e.g., camera image data) of the hand is not displayed). Displaying a user characteristic indicator interface comprising the set of one or more visual indicators while the external computer system displays a virtual representation of the user's hand in an augmented reality environment provides feedback to the user of the external computer system indicating where the user's hand is and how they are moving, while also providing visual feedback to the user of the computer system as to the position and movement of their hand. Providing improved feedback enhances the operability of the computer system and makes the user-system interface more efficient (e.g., by helping the user provide proper input and reducing user error in operating/interacting with the computer system), which in turn reduces power usage and extends battery life of the computer system by enabling the user to use the system more quickly and efficiently.

In some embodiments, a user (e.g., 700) of a computer system (e.g., 901) views an augmented reality environment (e.g., 905) with a visual indicator (e.g., 920;930; 940) positioned on a passthrough view of the user's hand (e.g., 914) (e.g., due to the transparent nature of the display; due to video passthrough of the user's hand), and other users viewing the augmented reality environment (e.g., 915) view a virtual representation of the user's hand (e.g., 919-2) (e.g., with or without a visual indicator, and do not display image data of the physical hand).

In some implementations, the computer system (e.g., 901) detects at least one gesture (e.g., fig. 9D) of a set of predefined gestures (e.g., a pointing gesture, a snapshot gesture, a pinch gesture, a grab gesture, a predefined movement of a user's hand and/or finger) via the one or more sensors (e.g., 904). In response to detecting the at least one gesture, the computer system modifies an appearance (e.g., increases brightness, changes a shape of the visual indicator, displays additional visual indicators and/or portions thereof, removes a display portion of the visual indicator) of the set of one or more visual indicators (e.g., 930-1; 930-2). Modifying the appearance of the set of one or more visual indicators in response to detecting the at least one gesture provides feedback to a user of the computer system indicating whether the gesture was recognized by the computer system. Providing improved feedback enhances the operability of the computer system and makes the user-system interface more efficient (e.g., by helping the user provide proper input and reducing user error in operating/interacting with the computer system), which in turn reduces power usage and extends battery life of the computer system by enabling the user to use the system more quickly and efficiently.

In some embodiments, the modified appearance of the set of one or more visual indicators (e.g., 930-1; 930-2) is temporary. For example, a temporary increase in brightness of a visual indicator that indicates that a gesture was recognized by a computer system (e.g., 901). In some implementations, the computer system continues to move the visual indicator based on movement of the user's hand (e.g., 914) and also modifies the appearance of the visual indicator when one of the gestures is recognized.

In some embodiments, as part of displaying the set of one or more visual indicators (e.g., 920;930; 940), the computer system (e.g., 901) displays (e.g., an opaque display; a non-transparent display; a display on which a video feed source of a user's hand is displayed and through which the physical environment is not visible due to the opaque nature of the display) via a display generating component (e.g., 902), the visual indicators on the video feed source (e.g., a passthrough video feed source) of the set of one or more features of the user's hand (e.g., 914) positioned in the physical environment. The set of one or more visual indicators displayed on the video feed source of the set of one or more features positioned on the user's hand in the physical environment provide feedback to the user of the computer system indicating the detected position of the set of one or more features of the user's hand relative to the video feed source of the set of one or more features of the hand. Providing improved feedback enhances the operability of the computer system and makes the user-system interface more efficient (e.g., by helping the user provide proper input and reducing user error in operating/interacting with the computer system), which in turn reduces power usage and extends battery life of the computer system by enabling the user to use the system more quickly and efficiently.

It is noted that the details of the process described above with reference to method 1000 (e.g., fig. 10) also apply in a similar manner to methods 800, 1100, 1300, and 1400 described herein. For example, methods 800, 1100, 1300, and/or 1400 optionally include one or more of the features of the various methods described above with reference to method 1000. For the sake of brevity, these details are not repeated hereinafter.

Fig. 11 is a flowchart illustrating an exemplary method 1100 for displaying objects having different visual characteristics in an XR environment, according to some embodiments. The method occurs at a computer system (e.g., 101;901 a) (e.g., a smartphone, a tablet, a head-mounted display generating component) in communication with a display generating component (e.g., 902 a) (e.g., a visual output device, a 3D display, a display having at least a transparent or translucent portion on which an image may be projected (e.g., a see-through display), a projector, a heads-up display, a display controller) and an external computer system (e.g., 901) associated with a first user (e.g., 700) (e.g., being operated by the first user (e.g., a user who is conducting a communication session (e.g., an augmented reality and/or video conference) with a user of the computer system).

The computer system (e.g., 901 a) displays (1102) a representation (e.g., 919) (e.g., avatar; a virtual avatar (e.g., avatar is a virtual representation of at least a portion of the first user) of the first user (e.g., 700) (e.g., a user in a physical environment) in an augmented reality environment (e.g., 915) via a display generation component (e.g., 902 a), in some embodiments, the avatar is displayed in the augmented reality environment in place of the first user. The representation of the first user is displayed in an augmented reality environment (e.g., 915) with a first pose (e.g., physical position, orientation, gesture, etc.) and a shape based on the shape of at least a portion of the first user (e.g., avatar 919 has avatar hands 919-2 (e.g., with the same pose as the user's hands) based on the shape of the user's hands (e.g., 914)). The shape of the representation of the first user is visualized (e.g., visually represented) with a first set of visual characteristics (e.g., a rendered set of one or more visual parameters of the avatar; blur amount, opacity, color, visual smoothing, attenuation/density, resolution, etc.).

In some implementations, the representation of the first user (e.g., 919) is displayed as having a mode (e.g., virtual presence mode) in which the first user (e.g., 700) is represented in the augmented reality environment (e.g., 915) by a rendering (e.g., avatar) having personified features (e.g., head, arm, leg, hand, etc.) or as an animated character (e.g., human; cartoon character; anthropomorphic construct of a non-human character, such as a dog, robot, etc.). In some embodiments, the representation of the first user is displayed as having the same gesture as the first user. In some implementations, the representation of the first user is displayed as having a portion (e.g., 919-2) that has the same pose as a corresponding portion (e.g., 914) of the first user. In some implementations, the representation of the first user is an avatar (e.g., a virtual avatar) that changes pose in response to a detected change in pose of at least a portion of the first user in the physical environment. For example, the avatar is displayed in the augmented reality environment (e.g., 915) as an animated character simulating the detected movement of the first user in the physical environment.

The computer system (e.g., 901 a) receives (1104) first data (e.g., depth data, image data, sensor data (e.g., image data from a camera)) including data indicative of a change in a gesture (e.g., physical position, orientation, gesture, movement, etc.) of a first user (e.g., 700) (e.g., a change in a gesture of the first user in a physical environment). In some implementations, the data includes sensor data (e.g., image data from a camera (e.g., 904; 903), movement data from an accelerometer, location data from a GPS sensor, data from a proximity sensor, data from a wearable device (e.g., a watch; a headset device)). In some embodiments, the sensor may be connected to or integrated with a computer system (e.g., 901a; 901). In some embodiments, the sensor may be an external sensor (e.g., a sensor of a different computer system (e.g., another user's electronic device)).

In response to receiving the first data, the computer system (e.g., 901 a) updates (1106) an appearance of the representation (e.g., 919) of the first user in the augmented reality environment (e.g., 915) (e.g., based on at least a portion of the first data). Updating the appearance of the representation of the first user in the augmented reality environment includes displaying (1108) the items referenced in steps 1110 and 1112 of the method 1100 in the augmented reality environment in accordance with a determination that the first data includes an indication (e.g., data indicating a change in posture) that the first portion of the first user (e.g., 914) (e.g., a physical hand of the first user) is contacting (e.g., touching, holding, grasping, manipulating, interacting, etc.) the object (e.g., 910) (e.g., a physical object in the physical environment) (in some embodiments, the first portion of the first user was not previously determined to be contacting an object in the physical environment prior to receiving the first data).

At 1110, the computer system (e.g., 901 a) displays a representation (e.g., 919) of the first user having a second pose (e.g., the pose in fig. 9E) based on the pose (e.g., at least one of the magnitude or direction) change (e.g., fig. 9E) of the first user (e.g., the avatar's pose is updated by the magnitude and/or direction corresponding to the magnitude and/or direction of the first user's pose change). The shape of the representation (e.g., 919) of the first user is visualized with a first set of visual characteristics (e.g., as depicted in fig. 9E).

At 1112, the computer system (e.g., 901 a) displays a representation (e.g., 945; 948) of the object having a shape (e.g., three-dimensional shape) based on the shape of at least a portion of the object (e.g., 910) (e.g., the representation of the object has a shape similar to the shape of the physical object or a portion thereof). The shape of the representation of the object is visualized with a second set of visual characteristics (e.g., 947) that is different from the first set of visual characteristics. Displaying a representation of a first user having a second pose that varies based on the pose of the first user and that is visualized with a first set of visual characteristics, and displaying a representation of an object having a shape that is based on the shape of at least a portion of the object and that is visualized with a second set of visual characteristics that is different from the first set of visual characteristics, this providing feedback to a user of the computer system that the first user is contacting the object in the physical environment and that the object is separate from the first user. Providing improved feedback enhances the operability of the computer system and makes the user-system interface more efficient (e.g., by helping the user provide proper input and reducing user error in operating/interacting with the computer system), which in turn reduces power usage and extends battery life of the computer system by enabling the user to use the system more quickly and efficiently.

In some implementations, the physical object (e.g., 910) has a relative position with respect to a first portion of a first user (e.g., 914) in the physical environment, and the representation of the object (e.g., 945) is displayed in the augmented reality environment (e.g., 915) with the same relative position with respect to the representation of the first portion of the first user (e.g., 919-2) in the augmented reality environment.

In some embodiments, as part of updating the appearance of the representation (e.g., 919) of the first user in the augmented reality environment (e.g., 915) and in accordance with an indication that the first data does not include the first portion (e.g., 914) of the first user (e.g., 700) being in contact with the object (e.g., 910) (e.g., the first user does not contact the object) (e.g., the user 700 does not hold the bottle in fig. 9D), the computer system (e.g., 901 a) displays the representation of the first user (e.g., 919) having the second pose (e.g., fig. 9D) in the augmented reality environment (e.g., 915) based on the pose change of the first user (e.g., 700). The shape of the representation (e.g., 919) of the first user is visualized with a first set of visual properties. The computer system also foregoes displaying a representation (e.g., 945) of the object having a shape based on the shape of at least a portion of the object and visualized with a second set of visual characteristics different from the first set of visual characteristics in the augmented reality environment (e.g., if the user does not contact the object, then the representation of the object is not displayed in the augmented reality environment (e.g., see fig. 9D)). Displaying a representation of the first user having a second gesture based on the gesture change of the first user, and discarding displaying a representation of the object having a shape based on at least a portion of the shape of the object and visualized with a second set of visual characteristics different from the first set of visual characteristics provides feedback to the user of the computer system that the first user is not touching the object in the physical environment, and reduces computational effort by eliminating computations for rendering the representation of the object in the augmented reality environment. Providing improved feedback and reduced computational effort enhances the operability of the computer system and makes the user-system interface more efficient (e.g., by helping a user provide appropriate input and reducing user errors in operating, interacting with, and interacting with the computer system), which in turn reduces power usage and extends battery life of the system by enabling the user to use the computer system more quickly and efficiently.

In some implementations, the first set of visual characteristics includes a first amount of blur (e.g., a first degree of blur or sharpness) of the shape of the representation (e.g., 919) of the first user, and the second set of visual characteristics (e.g., 947) includes a second amount of blur of the shape of the representation (e.g., 945) of the object that is different from (e.g., greater than; less than) the first amount of blur (e.g., the shape of the representation of the object is displayed with greater blur (less sharpness) or less blur (greater sharpness) than the shape of the representation of the first user. Displaying the shape of the representation of the first user visualized in a different amount of blur than the shape of the representation of the object provides feedback to the user of the computer system that the first user is contacting the object in the physical environment and that the object is separate (e.g., distinct) from the first user. Providing improved feedback enhances the operability of the computer system and makes the user-system interface more efficient (e.g., by helping the user provide proper input and reducing user error in operating/interacting with the computer system), which in turn reduces power usage and extends battery life of the computer system by enabling the user to use the system more quickly and efficiently.

In some embodiments, the first set of visual characteristics includes a first density of particles (e.g., an amount and/or size of spaces between particles) that make up a shape of the representation of the first user (e.g., 919), and the second set of visual characteristics (e.g., 947) includes a second density of particles that make up a shape of the representation of the object (e.g., 945) that is different from (e.g., greater than; less than) the first density (e.g., the shape of the representation of the object is displayed with a greater particle density (e.g., less and/or less gaps between particles) or a lesser particle density (e.g., greater and/or more gaps between particles) than the shape of the representation of the object). Displaying the shape of the representation of the first user visualized at a different particle density than the shape of the representation of the object provides feedback to the user of the computer system that the first user is contacting the object in the physical environment and that the object is separate (e.g., distinct) from the first user. Providing improved feedback enhances the operability of the computer system and makes the user-system interface more efficient (e.g., by helping the user provide proper input and reducing user error in operating/interacting with the computer system), which in turn reduces power usage and extends battery life of the computer system by enabling the user to use the system more quickly and efficiently.

In some implementations, the first set of visual characteristics includes a first visual smoothness amount (e.g., image smoothing) of a shape of the representation (e.g., 919) of the first user, and the second set of visual characteristics (e.g., 947) includes a second visual smoothness amount of a shape of the representation (e.g., 945) of the object that is different (e.g., greater than; less than) the first visual smoothness amount (e.g., the shape of the representation of the first user is displayed with greater visual smoothness (e.g., image smoothing) or less visual smoothness than the shape of the representation of the object). Displaying the shape of the representation of the first user visualized in a different visual smoothness amount than the shape of the representation of the object provides feedback to the user of the computer system that the first user is contacting the object in the physical environment and that the object is separate (e.g., distinct) from the first user. Providing improved feedback enhances the operability of the computer system and makes the user-system interface more efficient (e.g., by helping the user provide proper input and reducing user error in operating/interacting with the computer system), which in turn reduces power usage and extends battery life of the computer system by enabling the user to use the system more quickly and efficiently.

In some embodiments, the first set of visual characteristics includes a first amount of pixelation (e.g., resolution; size of particles comprising the shape of the representation of the first user) of the shape of the first user, and the second set of visual characteristics (e.g., 947) includes a second amount of pixelation of the shape of the representation of the object (e.g., 945) that is different from (e.g., greater than; less than) the first amount of pixelation (e.g., the shape of the representation of the first user is displayed in less pixelation (e.g., higher resolution) or greater pixelation (e.g., lower resolution) than the shape of the representation of the object). Displaying the shape of the representation of the first user visualized in a pixelated amount different from the shape of the representation of the object provides feedback to the user of the computer system that the first user is contacting the object in the physical environment and that the object is separate (e.g., distinct) from the first user. Providing improved feedback enhances the operability of the computer system and makes the user-system interface more efficient (e.g., by helping the user provide proper input and reducing user error in operating/interacting with the computer system), which in turn reduces power usage and extends battery life of the computer system by enabling the user to use the system more quickly and efficiently.

In some implementations, the representation of the object (e.g., 948) is based at least in part on data (e.g., image data; model data) from a library of objects (e.g., libraries accessible by a computer system and/or an external computer system). In some embodiments, a computer system (e.g., 901 a) and/or an external computer system (e.g., 901) determines the identity of an object (e.g., 910) and determines matching data from the object library based on the identity of the object. Displaying representations of objects based on data from an object library provides a user of a computer system with feedback to identify objects from the object library and display the objects using data from the object library, which reduces computational effort by eliminating computations for rendering representations of objects based on other data (e.g., data detected by the computer system in real-time). Providing improved feedback and reduced computational effort enhances the operability of the computer system and makes the user-system interface more efficient (e.g., by helping a user provide appropriate input and reducing user errors in operating, interacting with, and interacting with the computer system), which in turn reduces power usage and extends battery life of the system by enabling the user to use the computer system more quickly and efficiently.

In some implementations, the representation of the first user (e.g., 919-1) is based at least in part on data (e.g., image data) from a registration process (e.g., such as the registration process discussed with respect to fig. 7A-7H) of the first user (e.g., 700). In some implementations, the computer system (e.g., 901 a) displays a representation (e.g., 919) of the first user having an appearance based on image data from the registration process, but not image data from another source (e.g., an image library). Displaying a representation of the first user based on data from the first user's enrollment process enhances the user system experience by providing a more realistic look to the first user, enhances the operability of the computer system and makes the user-system interface more efficient (e.g., by helping the user provide proper input and reducing user errors in operating/interacting with the computer system), which in turn reduces power usage and extends battery life of the computer system by enabling the user to use the system more quickly and efficiently.

In some embodiments, when the computer system (e.g., 901 a) is displaying the representation of the first user (e.g., 919) and the representation of the object (e.g., 945) via the display generation component (e.g., 902 a), the external computer system (e.g., 901) foregoes displaying the representation of the first user (e.g., 919) and the representation of the object (e.g., 945) (e.g., the first user sees a perspective view of their hands and physical objects, but not the representation of the first user's hands and the representation of the object).

In some implementations, when displaying a representation (e.g., 945) of an object having a first position (e.g., in FIG. 9E) and a representation (e.g., 919; 919-2) of a first user having a second gesture (e.g., in FIG. 9E), the computer system (e.g., 901 a) receives second data including data indicating movement of at least a first portion of the first user (e.g., 914). In response to receiving the second data, the computer system updates a display of a representation (e.g., 945; 948) of the object and a representation (e.g., 919; 919-2) of the first user in the augmented reality environment (e.g., 915). As part of updating the representation of the object and the display of the representation of the first user, the computer system displays the representation of the first user (e.g., 919) having a third gesture based on movement of at least the first portion of the user (e.g., gestures in fig. 9F) (e.g., the representation of the first user moves based on movement of the first user's hand) (e.g., the shape of the representation of the first user is visualized with a first set of visual characteristics). The computer system also displays a representation (e.g., 945; 948) of the object having the second position (e.g., the representation of the object moves with the first user's hand) based on movement of at least the first portion of the user (e.g., the position in FIG. 9F) (e.g., the shape of the object is visualized with the second set of visual characteristics). Displaying a representation of the first user having the third gesture based on movement of at least the first portion of the user and displaying a representation of the object having the second position provides feedback to the user of the computer system that the first user continues to contact the object and has moved the object to a different position. Providing improved feedback enhances the operability of the computer system and makes the user-system interface more efficient (e.g., by helping the user provide proper input and reducing user error in operating/interacting with the computer system), which in turn reduces power usage and extends battery life of the computer system by enabling the user to use the system more quickly and efficiently.

It is noted that the details of the process described above with reference to method 1100 (e.g., fig. 11) also apply in a similar manner to methods 800, 1000, 1300, and 1400 described herein. For example, methods 800, 1000, 1300, and/or 1400 optionally include one or more of the features of the various methods described above with reference to method 1100. For the sake of brevity, these details are not repeated hereinafter.

Fig. 12A-12E, 13A-13B, and 14 depict examples of various presentation modes associated with users represented in an XR environment.

Fig. 12A depicts a physical environment 1200 that includes a user 700 standing in front of a device 901 (at least partially within the field of view of a camera 904), with a head 700-3 facing forward, with a hand 700-2 raised, while participating in an XR session with a second user in a manner similar to that discussed above with respect to fig. 9A-9F. Device 901 displays an XR interface 1206 similar to XR interface 906 via display 902. XR interface 1206 includes an XR environment 1205 (similar to XR environment 905) and a control option 1207 (similar to control option 907). As shown in FIG. 12A, XR environment 1205 currently includes an avatar 1222 (similar to avatar 922) that represents the presence of the second user in the XR environment.

Fig. 12A also depicts device 901a displaying an XR interface 1216 similar to XR interface 916 via display 902A. XR interface 1216 includes a preview 1218 with a rendering 1218-1 of a second user located within the field of view of camera 903 a. The XR interface 1216 also shows an XR environment 1215 (similar to XR environment 915) and control options 1217. In the embodiment depicted in FIG. 12A, XR environment 1215 currently includes a representation of user 700 in the form of avatar 1220 (similar to avatar 919). Devices 901 and 901a display XR interfaces 1206 and 1216, respectively, in a manner similar to that described above with respect to fig. 9A-9F. For brevity, these details are not repeated below.

In the embodiment depicted in FIG. 12A, avatar 1220 includes portions 1220-1, 1220-2, 1220-3, and 1220-4 that are displayed as a virtual representation of user 700. Portion 1220-1 forms the left forearm and hand of the avatar and has an appearance (e.g., position, pose, orientation, color, movement, etc.) determined based on various aspects (e.g., position, pose, orientation, color, shape, etc.) of the user's left forearm and hand 700-1, e.g., detected by camera 904 of device 901. Similarly, portion 1220-2 forms the right forearm and hand of the avatar and has an appearance determined based on various aspects of the user's right forearm and hand 700-2. Portion 1220-3 forms the head and shoulder regions of the avatar and has an appearance determined based on various aspects of the user's head 700-3 and shoulders. Portion 1220-4 forms the remainder of avatar 1220 and has a visual appearance that is different from the visual appearance of portions 1220-1 through 1220-3. For example, as shown in FIG. 12A, portions 1220-4 have an appearance formed by elements 1225 having various colors and optional shapes (optionally different shapes; optionally overlapping or non-overlapping), while portions 1220-1 through 1220-3 have an appearance that visually represents (e.g., resembles a shape having one or more character features) a corresponding portion of user 700. For example, portion 1220-2 has the same shape and pose as the corresponding portion of user 700 (e.g., user's right forearm and hand 700-2). In some embodiments, portions 1220-4 (or sub-portions thereof) have an amorphous shape formed by element 1225. In some embodiments, portion 1220-4 has the shape of one or more character features, such as a torso, elbows, legs, etc. In some embodiments, element 1225 (or a subset thereof) creates a visual effect (e.g., a defocusing effect) shaped to form one or more character features. In some embodiments, the color of element 1225 in portion 1220-4 corresponds to the color of clothing worn by user 700 in physical environment 1200. In some embodiments, the color of element 1225 in portion 1220-4 is automatically selected by device 901 and/or device 901 a. For example, in some embodiments, the color of element 1225 in portion 1220-4 is selected to match the color of a garment (e.g., shirt 709) worn by user 700 during the registration process described above with respect to fig. 7A-7H. In some embodiments, the color of element 1225 in portion 1220-4 is selected to have a warm palette, while the color of device 901 or other aspects of device 901a (e.g., the representation of a system element) is selected to have a cold palette. In some embodiments, portion 1220-4 is not displayed. In some embodiments, only portions of portion 1220-4 are displayed, such as a subset of elements 1225 immediately adjacent to portions 1220-1, 1220-2, and/or 1220-3. In some embodiments, portion 1220-4 represents the following portions of avatar 1220: for that portion, the appearance (e.g., gesture) of the corresponding portion of user 700 is unknown, undetected, or insufficient data (or less than a threshold amount of data) to determine the appearance.

As indicated above, device 901a displays a representation of user 700 (e.g., avatar 1220) in XR environment 1215 based at least in part on various appearance settings that indicate aspects of the appearance of the representation of user 700. For reference, these appearance settings are depicted in an appearance settings interface 1204 that is shown as being displayed by device 701 (e.g., using display 702). Appearance settings interface 1204 includes various appearance settings similar to those depicted in fig. 7H for controlling the appearance of the representation of user 700 in an XR environment. For example, appearance settings interface 1204 includes representation options 1224 (similar to representation options 724) and eyewear options 1226 (similar to eyewear options 726). As shown in fig. 12A, avatar option 1224b and glasses-free option 1226a are selected. Thus, the representation of user 700 has the appearance of an avatar that does not include glasses, as shown by avatar 1220 displayed on device 901 a.

In fig. 12B, user 700 has rotated their head 700-3 and lowered their right arm, placed hand 700-2 on their side, and selected rectangular box option 1226B in appearance settings interface 1204. Thus, device 901a updates XR interface 1216 based on the gesture change of user 700 and the updated appearance settings to display avatar 1220 with an updated appearance. In particular, device 901a displays avatar 1220 with portion 1220-2 lowered, as depicted in FIG. 12B, and portion 1220-3 updated to show the head-turning sides of the avatar and glasses 1221 displayed on the face of the avatar.

In some implementations, portions of the avatar 1220 change shape based on changes in the pose of the user 700. For example, the portion 1220-2 shown in FIG. 12B is displayed with the hand relaxed, changing the displayed shape (e.g., contour; geometry; outline) of the portion 1220-2 as compared to the shape of the portion 1220-2 when the finger is deployed in the raised position, as shown in FIG. 12A. In some embodiments, as user 700 moves, portions of the user move in or out of the field of view of camera 904, such that different portions of the user are detected (e.g., by device 901), and avatar 1220 is updated accordingly. For example, in FIG. 12B, a greater number of portions of the user's right forearm are visible to camera 904, and thus the shape of portion 1220-2 is changed because a greater number of portions of the avatar's right forearm are represented in portion 1220-2 (thus, fewer portions 1220-4 are displayed because some of the elements 1225 previously represented in portion 1220-4 are no longer displayed and the corresponding portion of avatar 1220 is now included in the forearm in portion 1220-2).

In some embodiments, the eyeglasses 1221 include a frame portion, but do not include arms or temple pieces, as shown in fig. 12B. In some embodiments, the glasses 1221 have an appearance corresponding to the selected glasses option. For example, in fig. 12B, the glasses 1221 are rectangular frames having the same appearance as the glasses depicted in the rectangular frame option 1226B. In some embodiments, the glasses 1221 have a default appearance that is not based on the appearance of the user glasses 707. In some embodiments, the eyeglasses 1221 have an appearance that corresponds to eyeglasses 707 detected on the user 700. In some embodiments, the eyewear option is automatically selected (e.g., by device 701, device 901, and/or device 901 a). For example, the device 901 detects glasses 707 on the face of the user, and in response, changes the appearance settings to select the rectangular frame option 1226b. In some embodiments, rectangular frame option 1226b is selected because it most accurately depicts the appearance of glasses 707 on user 700. In some embodiments, rectangular frame option 1226b is manually selected by user 700. In some implementations, in response to detecting the glasses 707 on the user's face during at least a portion of the enrollment process, the display of the avatar glasses (e.g., 1221) is automatically enabled (and one of the glasses options is selected) (e.g., by device 701).

In fig. 12C, the user 700 remains stationary while speaking and selects the semi-transparent frame option 1226C in the appearance settings interface 1204. Thus, device 901a updates XR interface 1216 based on the updated appearance settings to display avatar 1220 with an updated appearance. Specifically, the device 901a displays the avatar 1220 with glasses 1221 having the appearance of being updated to a translucent frame, as displayed on the avatar's face in fig. 12C. Because user 700 is not moving, device 901a does not change the shape of portions 1220-1 through 1220-4. However, the user 700 is speaking, so the device 901a displays that the avatar's mouth is moving without changing the shape of portions of the avatar 1220. In addition, device 901a does not change the appearance of any portion of avatar 1220 (including portion 1220-4 and element 1225) in response to audio detected (e.g., by device 901 and/or device 901 a) from speech of user 700.

In fig. 12D, audio option 1224a is selected and device 901a updates XR interface 1216 to display that the representation of user 700 is transitioning from an avatar representation to an audio representation. In fig. 12D, the transition is depicted as an animation, where portions 1220-1 through 1220-3 are replaced or overlaid by element 1225, and when the user is in audio representation mode, element 1225 begins to move around to change the shape of the representation of user 700 to a two-dimensional or three-dimensional shape (e.g., cube, sphere, or spheroid) representing user 700 in XR environment 1215. During the transition, elements 1225 move together while some elements begin to overlap and others disappear as the represented shape transitions to the cube shape depicted in fig. 12E.

While user 700 is participating in an XR session in an audio presentation mode, audio from user 700 is transferred to the devices of other users participating in the XR session (e.g., device 901a of the second user), and the representation of user 700 is displayed as an audio representation that does not change shape in response to movement of user 700. For example, as the user 700 moves (e.g., walks, lifts the hand 700-2 and/or rotates the head 700-3), the audio representation remains the same geometry. In some implementations, device 901a displays an audio representation of movement around XR environment 1215 based on movement of user 700 in physical environment 1200. For example, as user 700 walks around physical environment 1200, device 901a optionally similarly displays an audio representation (e.g., audio representation 1230-1) of the movement (e.g., changing position) in XR environment 1215. Various examples of audio representations of user 700 are depicted in fig. 12E, each associated with a different set of conditions detected in the physical environment. Each example of these audio representations is shown as a cube. However, it should be understood that the audio representation may have different forms, such as spheres, spheroids, amorphous three-dimensional shapes, and the like.

In the implementations described herein, various features of the audio representation of user 700 in XR environment 1215 are described with reference to a particular audio representation, such as audio representation 1230-1. However, it should be understood that reference to a particular audio representation is not intended to limit the features described to that particular audio representation. Thus, the various features described with respect to a particular audio representation may be similarly applied to other audio representations described herein (e.g., audio representations 1230-2 through 1230-4). For brevity, these details are not repeated herein.

In some embodiments, the audio representation 1230-1 is formed from a collection of particles 1235 having different sizes and colors. In some embodiments, particles 1235 are similar to elements 1225. In some embodiments, the color of the particles 1235 corresponds to the color of clothing worn by the user 700 and/or the skin tone of the user 700 in the physical environment 1200. In some embodiments, the color of particles 1235 is automatically selected by device 901 and/or device 901 a. For example, in some embodiments, the color is selected to match the color of a garment (e.g., shirt 709) worn by user 700 during the registration process described above with respect to fig. 7A-7H. In some embodiments, the color of particles 1235 is selected to have a warm palette, while the color of device 901 or other aspects of device 901a (e.g., the representation of a system element), such as a virtual assistant, is selected to have a cool palette. In some embodiments, the particles 1235 can have different forms, such as rectangular, square, circular, spherical, and the like.

In some embodiments, the particles 1235 move along the surface of the audio representation 1230-1, thereby changing size and optionally changing shape. For example, in some embodiments, the particles 1235 change position and size as part of the progressive animation of the audio representation 1230-1. In this way, the audio representation 1230-1 changes appearance over time regardless of whether the user 700 is speaking. In some embodiments, the audio representations 1230-1, 1230-2, 1230-3, and 1230-4 represent different appearances of a single audio representation as depicted at different points in time, and the particles 1235 forming the audio representation have different positions, sizes, and colors, thereby showing the appearance of the audio representation over time due to animation.

As described above, the audio representations depicted in fig. 12E correspond to different appearances of the audio representations based on the determined locations and/or behaviors of the user 700 in the physical environment 1200 at different times. For example, when user 700 is at location 1200-1, device 901a displays an XR interface 1216 in which audio representation 1230-1 represents user 700 in XR environment 1215, facing away from the camera (e.g., camera 904) of device 901, as shown in FIG. 12E. Similarly, when user 700 is at location 1200-2, device 901a displays audio representation 1230-2. When user 700 is at location 1200-3, device 901a displays audio representation 1230-3. When user 700 is at location 1200-4, device 901a displays audio representation 1230-4. In some implementations, different locations (e.g., 1200-1 to 1200-4) in the physical environment 1200 correspond to different depths from the camera of the device 901. For example, position 1200-2 represents a greater distance from the camera than position 1200-4, and is therefore depicted in fig. 12E as having a smaller size. In some embodiments, positions 1200-1, 1200-3, and 1200-4 all have similar distances from the camera.

The audio representation 1230-1 includes an element 1232, which is a two-dimensional (or substantially two-dimensional) feature that associates the audio representation with the user 700. For example, in fig. 12E, element 1232 is a letter combination that includes the initials of user 700. In some embodiments, the element 1232 may include, instead of or in addition to the initials, the user's first and/or last name or other identifying information, such as the user's telephone number, email address, user name, and the like.

In some embodiments, multiple users may participate in an XR session, and for each user participating in the XR session, audio representation 1230-1 appears to face the respective user, regardless of whether user 700 is actually facing the respective user in a physical environment or an XR environment. For example, in FIG. 12E, although user 700 is facing away from the second user, device 901a displays an audio representation 1230-1 (including element 1232) that is facing the second user in XR environment 1215, giving the second user the appearance that user 700 is facing the second user, thereby interacting with and/or communicating with them in XR environment 1215 via audio representation 1230-1. In some embodiments, additional users, such as third (or fourth, fifth, sixth, etc.) users are also participating in an XR session with user 700 and the second user. For each of these additional users, the audio representation of user 700 has the same appearance as audio representation 1230-1, such that the audio representation (including element 1232) appears to face that particular user. In some embodiments, audio representation 1230-1 appears to face the corresponding user even when the user moves (changes orientation) around the XR environment.

In some embodiments, device 901a displays audio representation 1230-1 at a location in XR environment 1215 that corresponds to the location of user's head 700-3 in physical environment 1200 (e.g., 1200-1) and/or the location at which the avatar's head would be displayed if the user were represented by avatar 1220 in XR environment 1215. By displaying the audio representation 1230-1 at the position of the user and/or avatar's head, the audio representation 1230-1 remains aligned with the line of sight of the user 700 such that the second user appears (from the perspective of the user 700) to remain in eye contact when the second user is looking at the audio representation 1230-1. In some implementations, device 901a displays audio representation 1230-1 at a location in XR environment 1215 that corresponds to a perceived or determined spatial location of an audio source in XR environment 1215 (e.g., corresponding to audio from user 700).

In some implementations, various attributes of the element 1232 are used to indicate information about the location and/or position of the user 700 within the physical environment 1200 or XR environment 1205. For example, in some implementations, the size of element 1232 is used to convey the distance of user 700 from the camera of device 901. For example, when the user 700 is at the location 1200-2, the device 901a displays an audio representation 1230-2 having the same size as the audio representation 1230-1, but wherein the element 1232 has a smaller size (when compared to the size of the element 1232 in the audio representation 1230-1) to convey a greater distance of the user 700 from the camera. Thus, as user 700 walks away from the camera from location 1200-1 to location 1200-2, device 901a displays that the audio representation moves from the location of audio representation 1230-1 to the location of audio representation 1230-2, wherein element 1232 shrinks in size as user 700 moves away from the camera. Conversely, a larger size of the element 1232 in the audio representation 1230-1 indicates that the user 700 is closer to the camera when he is at the position 1200-1. In some implementations, the device 901a modifies the displayed size of the entire audio representation (including the element 1232) to indicate a change in distance of the user 700 from the camera.

In some implementations, the device 901a modifies the audio representation in response to detecting audio from the user 700 (when the user 700 is speaking). In some embodiments, the modification includes a change in the size, brightness, or other visual characteristic of the displayed audio representation. For example, in FIG. 12E, audio representation 1230-3 represents a larger size audio representation that is temporarily displayed in response to detecting that user 700 is speaking at location 1200-3. In some implementations, the audio representation expands and contracts in synchronization with detected changes in audio characteristics of the user 700 speaking (e.g., changes in tone, pitch, volume, etc.). In some implementations, the display 901a changes other visual characteristics of the audio representation 1230-3 in response to the audio, such as pulsing the brightness of the displayed audio representation. In some embodiments, in response to detecting that the user 700 speaks, the device 901a modifies the visual characteristics of the audio representation 1230-3 but does not modify the visual characteristics of the avatar 1220, as discussed above.

In some embodiments, when the audio of user 700 (e.g., as detected at device 901) is muted, device 901a modifies the appearance of the audio representation. For example, in FIG. 12E, device 901a displays an audio representation 1230-4 with mute icon 1240 to indicate that the audio of user 700 was muted while user 700 was at location 1200-4.

In some embodiments, when an avatar representation (e.g., avatar 1220) is not available, device 901a displays a representation of user 700 with an audio representation (e.g., audio representation 1230-1). In some implementations, the audio representation is not available if the conditions are insufficient to present the avatar in the XR environment (e.g., poor lighting in the environment 1200 and/or the environment of the second user) or if there is insufficient data to render the avatar representation. In some embodiments, if the user 700 does not perform the registration operations described above with respect to fig. 7A-7H, there is insufficient data to render the avatar representation.

Additional description with respect to fig. 12A-12E see methods 1300 and 1400 described below with respect to fig. 13A-13B and 14.

Fig. 13A-13B are flowcharts of an exemplary method 1300 for switching between different presentation modes associated with users represented in an XR environment, according to some embodiments. The method 1300 occurs at a computer system (e.g., 101;901 a) (e.g., a smartphone, a tablet, a head-mounted display generating component) in communication with a display generating component (e.g., 902 a) (e.g., a visual output device, a 3D display, a display having at least a transparent or translucent portion on which an image may be projected (e.g., a see-through display), a projector, a heads-up display, a display controller) and an external computer system (e.g., 901) associated with a first user (e.g., 700) (e.g., being operated by the first user (e.g., a user who is conducting a communication session (e.g., an augmented reality and/or video conference) with a user of the computer system) (e.g., a second user)).

The computer system (e.g., 901 a) displays (1302), via the display generation component (e.g., 902 a), a communication user interface (e.g., 1216) including a representation (e.g., 1220) (e.g., an animated representation, an avatar representation, a virtual representation of a virtual avatar (e.g., a virtual representation of at least a portion of the first user) in a first rendering mode (e.g., indicated by 1224 b), wherein the first user is represented in an augmented reality environment by a rendering having human or personified features (e.g., head, arm, leg, hand, etc.) or as an animated character (e.g., human; cartoon character; personified formations of non-human characters such as dogs, robots, etc.), in some embodiments, the virtual avatar is displayed in the augmented reality environment in place of the first user. In some embodiments, the representation of the first user is displayed as having the same gesture as the first user. In some embodiments, the representation of the first user is displayed as a portion (e.g., 1220-1;1220-2; 1220-3) having the same gesture as the corresponding portion (e.g., 700-1;700-2; 700-3) of the first user.

The communication user interface (e.g., 1216) displays (1304) a representation of the first user (e.g., 1220) in an augmented reality environment (e.g., 1215) (e.g., the computer system (e.g., 901 a) displays the communication user interface with the representation of the first user in the augmented reality environment). While in the first presentation mode (e.g., 1224 b), a representation of the first user (e.g., 1220) is displayed (1306) (e.g., by the computer system (e.g., via the display generation component (e.g., 902 a)) as having a movement change of a first portion (e.g., 700-1;700-2; 700-3) (e.g., a portion (e.g., palm, finger, etc.) of the first user (e.g., 700) detected by the external computer system (e.g., 901) (e.g., as responsive to movement change) in response to movement of the user's hand detected in the physical environment and/or the augmented reality environment).

When a computer system (e.g., 901 a) displays (1308) a representation of a first user (e.g., 1220) in a first presentation mode (e.g., 1224 b), the computer system receives (1310) first data (e.g., depth data, image data, sensor data (e.g., image data from a camera)) from an external computer system (e.g., 901; 904) indicating movement of a first portion (e.g., 700-1;700-2; 700-3) of the first user (e.g., 700). In some implementations, the first data includes sensor data (e.g., image data from a camera (e.g., 904), movement data from an accelerometer, location data from a GPS sensor, data from a proximity sensor, data from a wearable device (e.g., watch, headset)).

When the computer system (e.g., 901 a) displays (1308) a representation of a first user (e.g., 1220) in a first presentation mode (e.g., 1224B), and in response to receiving the first data, the computer system modifies (1312) a shape of the representation of the first user (e.g., displays a greater or lesser amount of representations (e.g., avatars) of the first user based on movement (e.g., magnitude and/or direction of movement) of a first portion (e.g., 700-1;700-2; 700-3) of the first user (e.g., 700-1; changing a shape of a portion (e.g., 1220-2; 1220-3) of the representation of the user; changing a geometry of a portion of the representation of the user; changing a contour and/or shape of an appearance of the representation of the user) (e.g., see FIG. 12B).

After modifying the shape of the representation of the first user (e.g., 1220), the computer system (e.g., 901 a) receives (1314) second data (e.g., from an external computer system (e.g., 901); via input at the computer system (e.g., 701)), the second data indicating that the representation of the first user is to be displayed in a second presentation mode (e.g., indicated by 1224 a) (e.g., an audio presentation mode; wherein the first user is represented in an augmented reality environment by a rendering (e.g., 1230-1;1230-2;1230-3; 1230-4) (e.g., icon, letter combination) that does not have anthropomorphic features and/or is an inanimate object, the second presentation mode being different from the first presentation mode. In some implementations, the computer system receives (e.g., from an external computer system) an indication that the first user has transitioned its representation from being in the first presentation mode to being in the second presentation mode.

In response to receiving the second data, the computer system (e.g., 901 a) displays (1316) a representation (e.g., 1230-1;1230-2;1230-3; 1230-4) of the first user in a second presentation mode via a display generation component (e.g., 902 a), wherein the representation of the first user has a shape (e.g., appearance; geometry (e.g., disk or sphere; cube; rectangular prism)) that is visually unreactive to movement changes of the first portion (e.g., 700-1;700-2; 700-3) of the first user (e.g., 700) detected by the external computer system (e.g., 901) when in the second presentation mode (e.g., the representation of the user does not visually react in response to movement of the user's hand detected in the physical environment and/or the augmented reality environment when in the second presentation mode).

When the computer system (e.g., 901 a) displays (1318) a representation (e.g., 1230-1;1230-2;1230-3; 1230-4) of the first user in the second presentation mode, the computer system receives (1320) third data (e.g., from an external computer system (e.g., 901), from a sensor that detects movement or positioning, via input at the computer system) indicating movement of the first user (e.g., 700) from a first location (e.g., 1200-1;1200-2;1200-3; 1200-4) in a physical environment (e.g., in the physical environment of the first user) to a second location (e.g., 1200-1;1200-2;1200-3; 1200-4) in the physical environment that is different from the first location in the physical environment. In some implementations, the third data includes sensor data (e.g., image data from a camera, movement data from an accelerometer, position data from a GPS sensor, data from a proximity sensor, data from a wearable device (e.g., a watch, a headset device). In some embodiments, the sensor may be connected to or integrated with a computer system. In some embodiments, the sensor may be an external sensor (e.g., a sensor of a different computer system (e.g., an external computer system)).

When the computer system (e.g., 901 a) displays (1318) a representation of a first user (e.g., 1230-1;1230-2;1230-3; 1230-4) in a second presentation mode, and in response to receiving the third data, the computer system displays (1322) the representation of the first user moving from a first location in the augmented reality environment (e.g., 1215) (e.g., the location 1230-1 in FIG. 12E) to a second location in the augmented reality environment (e.g., the location 1230-2 in FIG. 12E) that is different from the first location in the augmented reality environment. Displaying a representation of the first user moving from a first location in the augmented reality environment to a second location in the augmented reality environment in response to receiving the third data, by providing feedback to a user of the computer system that the first user is moving about its physical location and that the movement about the physical location corresponds to the movement of the representation of the first user in the augmented reality environment. Providing improved feedback enhances the operability of the computer system and makes the user-system interface more efficient (e.g., by helping the user provide proper input and reducing user error in operating/interacting with the computer system), which in turn reduces power usage and extends battery life of the computer system by enabling the user to use the system more quickly and efficiently.

In some implementations, a first location (e.g., the location 1230-1 in fig. 12E) in the augmented reality environment (e.g., 1215) represents a first location (e.g., 1200-1) of the first user (e.g., 700) in the physical environment (e.g., 1200) of the first user, and a second location (e.g., the location 1230-2 in fig. 12E) in the augmented reality environment represents a second location (e.g., the location 1200-2) of the first user (e.g., the representation of the first user moves around the augmented reality environment to represent physical movement of the first user around the physical environment of the first user).

In some implementations, when in the second presentation mode (e.g., 1224 a), the representation of the first user (e.g., 1230-2; 1232) (e.g., a portion of the representation of the first user (e.g., 1232)) is displayed to change size to indicate a relative position of the representation of the first user with respect to a user of the computer system (e.g., 901 a) (e.g., the second user) as the representation of the first user moves toward or away from the user of the computer system. For example, as the representation of the first user moves away from the user of the computer system, the representation of the first user is displayed as a shrink in size. Conversely, as the representation of the first user moves closer to the user of the computer system, the representation of the first user is displayed as increasing in size.

In some implementations, the first portion (e.g., 700-1; 700-2) of the first user (e.g., 700) includes at least a portion of the first user's hand (e.g., is the user's hand; is detected and/or recognized by at least the external computer system (e.g., 901) as at least a portion of the user's hand).

In some implementations, in response to receiving the second data, the computer system (e.g., 901 a) displays an animation (e.g., sequential graphics transition) of the representation of the first user (e.g., 700) transitioning from the first presentation mode (e.g., 1224 b) to the second presentation mode (e.g., 1224 a) via the display generation component (e.g., 902 a) (see, e.g., fig. 12D). In some embodiments, the transition is depicted as an animation in which particles (e.g., 1225) (e.g., from a defocusing effect) forming a representation (e.g., 1230) of the first user in the first presentation mode move together to form a representation (e.g., 1220-1) of the first user in the second presentation mode.

In some implementations, the representation (e.g., 1230-1) of the first user in the second presentation mode includes a set of one or more colors (e.g., on the particle 1235) selected (e.g., automatically; without user input; by the computer system) based on one or more colors (e.g., a set of colors determined based on the appearance of the user) associated with the first user (e.g., 700). In some embodiments, the one or more colors associated with the first user include a color of clothing (e.g., 709) worn by the first user in the physical environment (e.g., 1200), a color of clothing worn by the first user during a registration process (e.g., a registration process as discussed with respect to fig. 7A-7H), a color of clothing worn by the first user that is representative of an XR environment, a color of skin tone of the first user, and/or a color of skin tone of the representation of the first user. In some embodiments, data representing the color of the representation of the first user in the second presentation mode is provided to the computer system (e.g., 901 a) by the external computer system (e.g., 901; 701). In some embodiments, a second user, different from the first user, is represented in a second presentation mode using a color associated with the second user that is different from the color associated with the first user.

In some implementations, the representation (e.g., 1230-1) of the first user in the second presentation mode (e.g., 1224 a) includes a set of one or more colors (e.g., on the particle 1235) selected (e.g., automatically; without user input; by the computer system) from a predetermined set of palettes (e.g., a predetermined set of colors that are not determined based on the appearance of the user). Displaying a representation of a first user having a set of one or more colors selected from a predetermined set of palettes in a second presentation mode reduces computational resources consumed by the computer system by reducing the amount of user input required to display the representation of the first user in the second presentation mode, eliminating the need for color sampling of the user, and/or by eliminating problems that may occur when colors associated with the user are undesirable (e.g., black and/or white may obscure the appearance of the representation of the first user in the second presentation mode) or undetected (e.g., during registration). Reducing the computational effort enhances the operability of the computer system and makes the user-system interface more efficient (e.g., by helping the user provide proper input and reducing user errors in operating/interacting with the computer system), which in turn reduces power usage and extends battery life of the computer system by enabling the user to use the system more quickly and efficiently.

In some implementations, the representation (e.g., 1230-1) of the first user in the second presentation mode (e.g., 1224 a) includes selecting (e.g., automatically; without user input; by the computer system) a set of one or more colors (e.g., on the particles 1235) from a set of warm color palettes (e.g., a set of warm hues (e.g., colors based on orange, red, and/or yellow; colors having a lower color temperature relative to average)). Displaying the representation of the first user with a set of one or more colors selected from the set of warm-tone color plates in the second presentation mode provides feedback to the user of the computer system that the representation of the first user represents a person, even when the representation of the first user is not in the personified configuration. Providing improved feedback enhances the operability of the computer system and makes the user-system interface more efficient (e.g., by helping the user provide proper input and reducing user error in operating/interacting with the computer system), which in turn reduces power usage and extends battery life of the computer system by enabling the user to use the system more quickly and efficiently.

In some implementations, a computer system (e.g., 1200) displays a representation of a system element (e.g., a virtual assistant; a visual representation of something other than a first user) via a display generation component (e.g., 1202), where the representation of the system element includes a set of one or more colors selected (e.g., automatically; without user input; by the computer system) from a set of cold color palettes (e.g., a set of cold color hues (e.g., colors based on blue, green, and/or purple; colors having a higher color temperature relative to an average)). Displaying a representation of a system element having a set of one or more colors selected from a set of cold color palettes provides feedback to a user of the computer system that the representation of the system element represents something other than a person (e.g., does not represent another user in an augmented reality environment). Providing improved feedback enhances the operability of the computer system and makes the user-system interface more efficient (e.g., by helping the user provide proper input and reducing user error in operating/interacting with the computer system), which in turn reduces power usage and extends battery life of the computer system by enabling the user to use the system more quickly and efficiently.

In some implementations, when a representation of a first user (e.g., 1230-1) is displayed in a second presentation mode (e.g., 1224 a), a computer system (e.g., 901 a) receives data representing audio (e.g., speech) received from the first user (e.g., 700; 901). In response to receiving information representative of audio received from the first user, the computer system modifies an appearance of the representation of the first user (e.g., size, color, shape, brightness, and/or pulsation pattern of the representation of the first user in the second presentation mode and/or particles forming the representation of the first user in the second presentation mode) in response to a detected change in one or more characteristics of the audio received from the first user (e.g., audio characteristics such as tone, volume, pitch, etc.) (e.g., the representation of the first user in the second presentation mode changes appearance by changing size, color, shape, brightness, and/or pulsation as the first user's speech changes when the first user speaks). Modifying the appearance of the representation of the first user in the second presentation mode in response to a detected change in one or more characteristics of the audio received from the first user provides feedback to the user of the computer system that the first user is speaking, even when the representation of the first user is not in a personified configuration. Providing improved feedback enhances the operability of the computer system and makes the user-system interface more efficient (e.g., by helping the user provide proper input and reducing user error in operating/interacting with the computer system), which in turn reduces power usage and extends battery life of the computer system by enabling the user to use the system more quickly and efficiently.

In some implementations, when a representation of a first user (e.g., 1220) is displayed in a first presentation mode (e.g., 1224 b), a computer system (e.g., 901 a) receives data representing audio (e.g., speech) received from the first user (e.g., 700; 901). Responsive to receiving information representative of audio received from the first user, the computer system discards modifying an appearance of the representation of the first user in the first presentation mode (e.g., color, shape, brightness, and/or pulsing pattern of particles forming the representation of the first user in the first presentation mode) responsive to a detected change in one or more characteristics of the audio received from the first user (e.g., audio characteristics such as tone, volume, pitch, etc.) (e.g., particles forming the representation of the first user in the first presentation mode do not change in appearance with a change in speech of the first user when the first user speaks). In some implementations, when the first user speaks, the representation of the first user in the first presentation mode optionally changes appearance by moving a mouth feature or performing another action mimicking the movement of speaking (e.g., see fig. 12C), but the particles (e.g., 1225) forming the representation of the first user do not otherwise change appearance. Conversely, when the representation of the first user is in the second presentation mode (1230-3), the particles (e.g., 1235) forming the representation of the first user change appearance by, for example, changing color, brightness, and/or pulsing behavior.

In some embodiments, the representation (e.g., 1230-1;1230-2;1230-3; 1230-4) of the first user in the second rendering mode (e.g., 1224 a) changes at least a portion of its appearance (e.g., particles (e.g., 1235) forming the representation of the first user in the second rendering mode move in a predetermined pattern) independent of audio output by the first user (e.g., 700) (e.g., output by an external computer system (e.g., 901; provided to the computer system (e.g., 901 a)). In some embodiments, as part of displaying the representation of the first user in the second presentation mode, includes: after a predetermined period of time in which the first user's predetermined audio level is not detected (e.g., no audio data of the first user is received; the first user is not speaking), the computer system (e.g., 901 a) modifies the appearance of the representation of the first user in the second presentation mode in a predetermined manner (e.g., the appearance of the representation of the first user in the second mode gradually changes over time (e.g., particles forming the representation of the first user in the second presentation mode move in the predetermined mode) regardless of whether the first user is speaking).

In some implementations, the representation of the first user (e.g., 1230-1) in the second presentation mode (e.g., 1224 a) includes elements (e.g., 1232) (e.g., letter combinations; initials of the first user) having a two-dimensional or substantially two-dimensional appearance (e.g., appearance that does not convey depth, a flat appearance; appearance that is not modeled as having depth in an augmented reality environment). In some embodiments, the representation of the first user in the second presentation mode (e.g., 1230-1) has a three-dimensional appearance (e.g., spherical shape, curved lens shape, rectangular prism shape, cube shape, etc.), and the elements displayed on the representation of the first user in the second presentation mode have a two-dimensional appearance or substantially two-dimensional appearance (e.g., two-dimensional text having a thickness or visual effect that may give some degree of three-dimensional appearance).

In some embodiments, an external computer system (e.g., 901) communicates with a second external computer system associated with a second user (e.g., a third user). In some embodiments, as part of displaying the representation of the first user (e.g., 1230-1) in the second presentation mode (e.g., 1224 a), the computer system (e.g., 901 a) displays an element (e.g., 1232) (e.g., the first user's initials) having the first position in the augmented reality environment (e.g., 1215) via the display generating component (e.g., 902 a), the element facing the user of the computer system in the augmented reality environment (e.g., oriented toward the user's point of view). In some embodiments, the second external computer system displays a representation of the first user in a second presentation mode (e.g., similar to 1230-1), including displaying an element (e.g., similar to 1232) having a second location (e.g., different from the first location) in the augmented reality environment that faces the second user in the augmented reality environment (e.g., displays a representation of the first user in the second presentation mode such that the element appears to the user of the computer system to face the user of the computer system in the augmented reality environment and appears to the second user to face the second user in the augmented reality environment). In some embodiments, the element is displayed differently for each user viewing the augmented reality environment and receiving a transmission of the representation of the first user in the second presentation mode such that the element appears to each user to face the user. In some embodiments, the display generation component displays that the element changes position to face an active user in the augmented reality environment. For example, when a user in an augmented reality environment begins speaking, the element moves (e.g., rotates) to face the user who is speaking.

In some implementations, when a representation (e.g., 1230-1) of a first user having a first display size is displayed in a second presentation mode (e.g., 1224 a), the computer system (e.g., 901 a) receives fourth data (in some implementations, third data) (e.g., data indicating movement of the first user in the physical environment of the first user (e.g., from location 1200-1 to location 1200-2) from an external computer system (e.g., 901). In response to receiving the fourth data, the computer system (e.g., 901 a) displays an element (e.g., 1232) that changes (e.g., grows or shrinks) from a second display size (e.g., the size of the element 1232 depicted in 1230-1) to a third display size (e.g., the size of the element 1232 depicted in 1230-2) that is different from the second display size (e.g., the size of the representation of the first user in the second presentation mode remains constant while the size of the element changes (e.g., based on the movement of the first user in the physical environment)). When the representation of the first user in the second presentation mode is displayed as having the first size, the element of the display changing from the second display size to a third display size different from the second display size provides feedback to the user of the computer system that the first user is moving towards or away from the user of the computer system. Providing improved feedback enhances the operability of the computer system and makes the user-system interface more efficient (e.g., by helping the user provide proper input and reducing user error in operating/interacting with the computer system), which in turn reduces power usage and extends battery life of the computer system by enabling the user to use the system more quickly and efficiently.

In some implementations, the size of the representation (e.g., 1230-1) of the first user in the second presentation mode also changes size. For example, the size of the representation of the first user in the second presentation mode may become larger or smaller to indicate the relative distance of the first user (e.g., 700) from the user of the computer system (e.g., 901 a) in the augmented reality environment (e.g., 1215).

In some implementations, the representation of the first user (e.g., 1230-4) in the second presentation mode (e.g., 1224 a) includes a visual indication (e.g., 1240) (e.g., a glyph) of a mute state of the first user (e.g., 700) (e.g., a state of whether audio detectable by the first external computer system (e.g., 901) is being output by (or provided to) the computer system (e.g., 901 a)). The visual indication of the mute state of the first user is displayed providing feedback to the user of the computer system indicating whether the audio of the first user is muted. Providing improved feedback enhances the operability of the computer system and makes the user-system interface more efficient (e.g., by helping the user provide proper input and reducing user error in operating/interacting with the computer system), which in turn reduces power usage and extends battery life of the computer system by enabling the user to use the system more quickly and efficiently.

In some implementations, the representation (e.g., 1230-1) of the first user in the second presentation mode (e.g., 1224 a) includes a visual indication (e.g., 1232) of the identity (e.g., name or initials; text indication) of the first user (e.g., 700). The visual indication displaying the identity of the first user provides feedback to the user of the computer system identifying the first user when the first user is not identifiable in the augmented reality environment. Providing improved feedback enhances the operability of the computer system and makes the user-system interface more efficient (e.g., by helping the user provide proper input and reducing user error in operating/interacting with the computer system), which in turn reduces power usage and extends battery life of the computer system by enabling the user to use the system more quickly and efficiently.

In some implementations, the representation (e.g., 1220) of the first user in the first presentation mode (e.g., 1224 b) includes an avatar having avatar head features (e.g., a portion of portion 1220-3). In some implementations, displaying the representation (e.g., 1230-1) of the first user in the second rendering mode (e.g., 1224 a) includes stopping the display of the avatar (e.g., 1220) and displaying the representation (e.g., 1230-1) of the first user in the second rendering mode at a first position that overlaps a second position previously occupied by the avatar's head feature (e.g., displaying the representation of the first user in the second rendering mode at or near the position where the avatar's head is positioned when the first user transitions from the first rendering mode to the second rendering mode). Displaying the representation of the first user in the second mode at a first position that overlaps a second position previously occupied by the avatar head feature provides feedback to the user of the computer system of the position of the first user's face and aligns the representation of the first user with a focal plane of the user of the computer system such that, from the perspective of the first user, the user of the computer system appears to be in eye contact with the representation of the first user in the augmented reality environment. Providing improved feedback enhances the operability of the computer system and makes the user-system interface more efficient (e.g., by helping the user provide proper input and reducing user error in operating/interacting with the computer system), which in turn reduces power usage and extends battery life of the computer system by enabling the user to use the system more quickly and efficiently. In some embodiments, the position of the avatar's head is determined based on perceived or determined spatial locations of audio sources in an augmented reality environment.

It is noted that the details of the process described above with reference to method 1300 (e.g., fig. 13A-13B) also apply in a similar manner to methods 800, 1000, 1100, and 1400 described herein. For example, methods 800, 1000, 1100, and/or 1400 optionally include one or more of the features of the various methods described above with reference to method 1300. For the sake of brevity, these details are not repeated hereinafter.

FIG. 14 is a flowchart of an exemplary method 1400 for displaying an avatar in an XR environment, according to some embodiments. The method 1400 occurs at a computer system (e.g., 101;901 a) (e.g., a smartphone, a tablet, a head-mounted display generating component) in communication with a display generating component (e.g., 902 a) (e.g., a visual output device, a 3D display, a display having at least a transparent or translucent portion on which an image may be projected (e.g., a see-through display), a projector, a heads-up display, a display controller) and an external computer system (e.g., 901) associated with a first user (e.g., 700) (e.g., being operated by the first user (e.g., a user who is conducting a communication session (e.g., an augmented reality and/or video conference) with a user of the computer system).

At method 1400, in response to receiving (1402) a request to display a representation (e.g., 1220) (e.g., an avatar; a virtual avatar (e.g., the avatar is a virtual representation of at least a portion of the first user) of a first user (e.g., 700) (e.g., a user of an external computer system) in an augmented reality environment (e.g., 1215), in place of the first user, in some embodiments, the computer system (e.g., 901 a) performs the following.

In accordance with a determination (1404) that a set of glasses (e.g., glasses; frame glasses; framed corrective lenses; framed decorative lenses; framed protective lenses) are met display criteria (e.g., user settings (e.g., 1226b;1226c;1226 d) enabled to display the glasses; glasses (e.g., 707) are detected on the first user (e.g., 700) during a registration process (e.g., discussed with respect to FIGS. 7A-7H), display of the glasses is manually enabled by the first user; display of the glasses is automatically enabled by a computer system or another computer system (e.g., 701;901 a), the first user is known to wear the glasses), the computer system (e.g., 901 a) displays (1406) a representation (e.g., 1220) of the first user in an extended reality environment (e.g., 1215) via a display generating component (e.g., displays an avatar in the extended reality environment). In some embodiments, the representation of the first user is displayed as having a mode (e.g., 1224 b) (e.g., virtual presence mode) in which the first user is represented in an augmented reality environment by a rendering (e.g., avatar) having human or anthropomorphic features (e.g., head, arm, leg, hand, etc.), or as an animated character (e.g., human; cartoon character; anthropomorphic construct of a non-human character such as a dog, robot, etc.). In some implementations, the representation of the first user (e.g., 1220) is displayed with the same gesture as the first user (e.g., 700). In some implementations, the representation of the first user is displayed as having a portion (e.g., 1220-2) that has the same pose as the corresponding portion (e.g., 700-2) of the first user. In some implementations, the representation of the first user is an avatar (e.g., a virtual avatar) that changes pose in response to a detected change in pose of at least a portion of the first user in the physical environment. For example, an avatar is displayed in an augmented reality environment as an animated character simulating the detected movement of the first user in the physical environment.

In accordance with a determination that the set of eyewear display criteria is met, the computer system (e.g., 901 a) displays (1408), via the display generation component (e.g., 902 a), a representation of the eyewear (e.g., 1221) (e.g., avatar eyewear) positioned on the representation (e.g., 1220) of the first user in an augmented reality environment (e.g., displaying an avatar in an augmented reality environment, wearing the eyewear in front of the avatar's eyes (e.g., instead of displaying an avatar with a headset device over its eyes)).

In accordance with a determination (1410) that the set of glasses display criteria is not met (e.g., option 1226a is selected in fig. 12A), the computer system (e.g., 901 a) displays (1412) a representation of the first user (e.g., 1220) in an augmented reality environment (e.g., 1215) via a display generating component (e.g., 902A) without displaying a representation of glasses positioned on the representation of the first user in the augmented reality environment (e.g., see fig. 12A) (e.g., forgoing displaying a representation of glasses positioned on the representation of the first user in the augmented reality environment (e.g., displaying the same avatar in the augmented reality environment but not wearing glasses in front of their eyes)). In an augmented reality environment, representations of glasses positioned on a representation of a first user are selectively displayed based on whether the set of glasses display criteria is met, which provides feedback to a user of a computer system regarding an appearance of the first user (such as whether the first user wears glasses), and improves human-machine interaction by providing a more realistic appearance of the representation of the first user. Providing improved feedback enhances the operability of the computer system and makes the user-system interface more efficient (e.g., by helping the user provide proper input and reducing user error in operating/interacting with the computer system), which in turn reduces power usage and extends battery life of the computer system by enabling the user to use the system more quickly and efficiently.

In some embodiments, the set of eyewear display criteria includes criteria that are met when user settings (e.g., 1226b;1226c;1226 d) (e.g., selectable options (e.g., toggle switches) in a user settings interface (e.g., 1204; 704)) are enabled (e.g., enabled by a first user (e.g., 700)) for displaying a representation of eyewear (e.g., 1221).

In some embodiments, the set of eyewear display criteria includes criteria that are met when a set of eyewear (e.g., 707) (e.g., a set of eyewear being worn by a user) is detected (e.g., automatically; by an external computer system (e.g., 901; 701)) during a registration process (e.g., an external computer system detects that a first user is wearing or holding a set of eyewear during a registration process (e.g., the registration process discussed with respect to FIGS. 7A-7H)).

In some embodiments, as part of displaying a representation of glasses (e.g., 1221) positioned on a representation (e.g., 1220) of a first user in an augmented reality environment (e.g., 1215), a computer system (e.g., 901 a) performs the following. In accordance with a determination that the first user (e.g., 700) has selected an option (e.g., 1226B) for a first appearance (e.g., a first appearance of a representation of glasses (e.g., glasses 1221 in fig. 12B) (e.g., a previous manual selection of the first user) (e.g., the first user currently selects/enables the first appearance option), the computer system displays a representation of glasses (e.g., 1221) having the first appearance (e.g., as depicted in fig. 12B). In accordance with a determination that the first user has selected an option (e.g., 1226C) for a second appearance (e.g., a second appearance of a representation of glasses that is different from the first appearance (e.g., glasses 1221 in fig. 12C) (e.g., the first user currently selects/enables the second appearance option), the computer system displays a representation of glasses (e.g., 1221) having the second appearance (e.g., as depicted in fig. 12C). Displaying a representation of glasses having a first appearance or a second appearance according to which option the first user has selected provides feedback to a user of the computer system regarding the appearance of the first user, such as the appearance of glasses worn by the first user, and improves human-machine interaction by providing a more realistic appearance of the representation of the first user. Providing improved feedback enhances the operability of the computer system and makes the user-system interface more efficient (e.g., by helping the user provide proper input and reducing user error in operating/interacting with the computer system), which in turn reduces power usage and extends battery life of the computer system by enabling the user to use the system more quickly and efficiently.

In some implementations, the first appearance is based on the appearance of a display generating component (e.g., a headset component; e.g., an augmented reality headset) of the computer system (e.g., option 1226d depicted in fig. 7H and 12A). In some embodiments, the representation of the glasses (e.g., 1221) has the appearance of a headphone device.

In some embodiments, as part of displaying a representation of glasses (e.g., 1221) positioned on a representation (e.g., 1220) of a first user in an augmented reality environment (e.g., 1215), a computer system (e.g., 901 a) performs the following. In accordance with a determination that a third appearance criterion is met (e.g., data indicative of having detected a third appearance of the representation of the glasses (e.g., input data; data from a camera (e.g., similar to 904) or sensor of the computer system; data from an external computer system (e.g., 901; 701) (e.g., automatically; detected by the computer system and/or the external computer system (e.g., during a registration process)), the computer system (e.g., 901 a) displays a representation of the glasses (e.g., 1221) having a third appearance selected based on the glasses (e.g., 707) detected on the face of the user (e.g., 700) (e.g., prior to placement of the augmented reality headset, such as during a registration process for using the augmented reality headset). In some embodiments, the third appearance is an eyeglass appearance that is automatically detected by a computer system (e.g., 701) (e.g., an external computer system), for example, during a registration process. For example, during enrollment, the computer system detects that the user is wearing glasses with a thick frame, and thus automatically selects an appearance for the glasses representation that is similar to the appearance of the detected glasses (e.g., with a thick frame). Displaying a representation of glasses having a third appearance in accordance with a determination that the third appearance criterion is met provides feedback to a user of the computer system regarding an appearance of the first user, such as an appearance of glasses worn by the first user, and improves human-machine interaction by providing a more realistic appearance of the representation of the first user. Providing improved feedback enhances the operability of the computer system and makes the user-system interface more efficient (e.g., by helping the user provide proper input and reducing user error in operating/interacting with the computer system), which in turn reduces power usage and extends battery life of the computer system by enabling the user to use the system more quickly and efficiently.

In some implementations, a first user (e.g., 700) is associated with a set of eyeglasses (e.g., 707) having a first set of appearance characteristics (e.g., style, size, color, shape, hue). In some embodiments, the first user is associated with the set of eyeglasses when the set of eyeglasses is detected and/or selected during the registration process of the first user. In some embodiments, the representation of the glasses (e.g., 1221) has a second set of appearance characteristics different from the first set of appearance characteristics by omitting one or more visual details of the set of glasses (e.g., the representation of the glasses is an abstract representation of the set of glasses associated with the first user).

In some implementations, the representation of the glasses (e.g., 1221) has a translucent appearance (e.g., as depicted in fig. 12C) (e.g., such that the appearance of the representation of the user (e.g., 1220), the appearance of the one or more representations of the virtual objects, and/or the appearance of the one or more physical objects are visible through the representation of the glasses, wherein the appearance includes one or more of a shape, color, number, or size of the objects.

In some embodiments, the representation of the glasses (e.g., 1221) positioned on the representation of the first user (e.g., 1220) in the augmented reality environment (e.g., 1215) includes a representation of one or more edge portions of the glasses (e.g., as depicted in fig. 12B and 12C) (e.g., lens frames, optionally with or without lenses) and does not include a representation of a temple portion of the glasses (e.g., one or more arms) of the glasses (e.g., the displayed representation of the glasses does not include arms or temples of the glasses). Displaying the representation of the glasses without displaying the representation of the temple portions of the glasses reduces computing resources consumed by the computer system by eliminating the need to consider the positioning and display of the temple portions of the representation of the glasses. Reducing the computational effort enhances the operability of the computer system and makes the user-system interface more efficient (e.g., by helping the user provide proper input and reducing user errors in operating/interacting with the computer system), which in turn reduces power usage and extends battery life of the computer system by enabling the user to use the system more quickly and efficiently.

It is noted that the details of the process described above with reference to method 1400 (e.g., fig. 14) also apply in a similar manner to methods 800, 1000, 1100, and 1300 described above. For example, methods 800, 1000, 1100, and/or 1300 optionally include one or more features of the various methods described above with reference to method 1400.

In some embodiments, aspects and/or operations of methods 800, 1000, 1100, 1300, and 1400 may be interchanged, substituted, and/or added between the methods. For the sake of brevity, these details are not repeated here.

The foregoing description, for purposes of explanation, has been described with reference to specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the invention to the precise forms disclosed. Many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and its practical application to thereby enable others skilled in the art to best utilize the invention and various described embodiments with various modifications as are suited to the particular use contemplated.

As described above, one aspect of the present technology is to collect and use data from various sources to improve the XR experience of the user. The present disclosure contemplates that in some examples, such collected data may include personal information data that uniquely identifies or may be used to contact or locate a particular person. Such personal information data may include demographic data, location-based data, telephone numbers, email addresses, tweet IDs, home addresses, data or records related to the user's health or fitness level (e.g., vital sign measurements, medication information, exercise information), date of birth, or any other identifying or personal information.

The present disclosure recognizes that the use of such personal information data in the present technology may be used to benefit users. For example, personal information data may be used to improve the XR experience of the user. In addition, the present disclosure contemplates other uses for personal information data that are beneficial to the user. For example, health and fitness data may be used to provide insight into the overall health of a user, or may be used as positive feedback to individuals using technology to pursue health goals.

The present disclosure contemplates that entities responsible for collecting, analyzing, disclosing, transmitting, storing, or otherwise using such personal information data will adhere to established privacy policies and/or privacy practices. In particular, such entities should exercise and adhere to privacy policies and practices that are recognized as meeting or exceeding industry or government requirements for maintaining the privacy and security of personal information data. Such policies should be readily accessible to the user and should be updated as the collection and/or use of the data changes. Personal information from users should be collected for legal and reasonable use by entities and not shared or sold outside of these legal uses. In addition, such collection/sharing should be performed after informed consent is received from the user. In addition, such entities should consider taking any necessary steps to defend and secure access to such personal information data and to ensure that others who have access to personal information data adhere to their privacy policies and procedures. In addition, such entities may subject themselves to third party evaluations to prove compliance with widely accepted privacy policies and practices. In addition, policies and practices should be adjusted to collect and/or access specific types of personal information data and to suit applicable laws and standards including specific considerations of jurisdiction. For example, in the united states, the collection or acquisition of certain health data may be governed by federal and/or state law, such as the health insurance flow and liability act (HIPAA); while health data in other countries may be subject to other regulations and policies and should be processed accordingly. Thus, different privacy practices should be maintained for different personal data types in each country.

In spite of the foregoing, the present disclosure also contemplates embodiments in which a user selectively prevents use or access to personal information data. That is, the present disclosure contemplates that hardware elements and/or software elements may be provided to prevent or block access to such personal information data. For example, with respect to an XR experience, the present technology may be configured to allow a user to choose to "opt-in" or "opt-out" to participate in the collection of personal information data during or at any time after registration with a service. In another example, the user may choose not to provide data for the customized service. In yet another example, the user may choose to limit the length of time that data is maintained or to prohibit development of the customized service altogether. In addition to providing the "opt-in" and "opt-out" options, the present disclosure also contemplates providing notifications related to accessing or using personal information. For example, the user may be notified that his personal information data will be accessed when the application is downloaded, and then be reminded again just before the personal information data is accessed by the application.

Further, it is an object of the present disclosure that personal information data should be managed and processed to minimize the risk of inadvertent or unauthorized access or use. Once the data is no longer needed, risk can be minimized by limiting the data collection and deleting the data. In addition, and when applicable, included in certain health-related applications, the data de-identification may be used to protect the privacy of the user. De-identification may be facilitated by removing specific identifiers (e.g., date of birth, etc.), controlling the amount or specificity of stored data (e.g., collecting location data at a city level instead of at an address level), controlling how data is stored (e.g., aggregating data among users), and/or other methods, as appropriate.

Thus, while the present disclosure broadly covers the use of personal information data to implement one or more of the various disclosed embodiments, the present disclosure also contemplates that the various embodiments may be implemented without accessing such personal information data. That is, various embodiments of the present technology do not fail to function properly due to the lack of all or a portion of such personal information data. For example, an XR experience may be generated by inferring preferences based on non-personal information data or absolute minimum metrics of personal information, such as content requested by a device associated with the user, other non-personal information available to the service, or publicly available information.

Claims

1. A method, the method comprising:

at a computer system in communication with a display generation component and one or more cameras:

during a registration process including capturing facial data of a user via the one or more cameras, displaying a registration interface for registering one or more features of the user via the display generating component, comprising:

outputting a first cue to locate a first set of one or more of the facial features of the user in a first predefined set of one or more facial expressions, wherein the first predefined set of one or more facial expressions includes a first particular facial expression; and

A second cue is output to locate a second set of one or more of the facial features of the user in a second predefined set of one or more facial expressions different from the first predefined set of one or more facial expressions, wherein the second predefined set of one or more facial expressions includes a second particular facial expression different from the first particular facial expression.

2. The method according to claim 1, wherein:

the first prompt is output in accordance with a determination that a first set of registration criteria is not met, and the second prompt is output in accordance with a determination that the first set of registration criteria is met and a second set of registration criteria is not met.

3. The method according to claim 1 or 2, comprising:

after outputting the first prompt, capturing a first set of facial data of the user via the one or more cameras; and

after outputting the second prompt, a second set of facial data of the user is captured via the one or more cameras.

4. A method according to claim 3, further comprising:

after capturing the first set of facial data of the user via the one or more cameras, stopping display of the first prompt; and

After capturing the second set of facial data of the user via the one or more cameras, display of the second prompt is stopped.

5. The method of claim 1 or 2, wherein the first predefined set of one or more facial expressions is selected from the group consisting of smiling, frowning, oblique eyes, and surprised expressions.

6. The method of claim 1 or 2, wherein the second prompt includes prompting the user to speak a set of one or more words.

7. The method of claim 1 or 2, wherein displaying the enrollment interface for enrolling one or more features of the user further comprises:

outputting a third prompt to change the position of the user's head.

8. The method of claim 7, wherein the third hint is output before at least one of the first hint or the second hint.

9. The method of claim 1 or 2, wherein displaying the enrollment interface for enrolling one or more features of the user further comprises:

a fourth cue is output to change the position of the one or more cameras relative to the user's head while holding the user's head stationary.

10. The method of claim 1 or 2, wherein displaying the enrollment interface for enrolling one or more features of the user further comprises:

a fifth prompt is output to indicate the height of the user.

11. The method of claim 1 or 2, wherein displaying the enrollment interface for enrolling one or more features of the user further comprises:

for at least a portion of the enrollment process, a sixth prompt is output to remove a set of eyeglasses from the user's face.

12. The method according to claim 1 or 2, wherein:

generating an avatar using at least a portion of the facial data captured during the enrollment process, and

the avatar is displayed using an external computer system different from the computer system.

13. The method of claim 1 or 2, wherein displaying the enrollment interface for enrolling one or more features of the user further comprises:

a seventh prompt is output to capture a pose of the non-facial feature of the user.

14. A computer-readable storage medium storing one or more programs configured to be executed by one or more processors of a computer system in communication with a display generation component and one or more cameras, the one or more programs comprising instructions for:

15. The computer-readable storage medium of claim 14, wherein:

16. The computer-readable storage medium of claim 14 or 15, the one or more programs further comprising instructions for:

17. The computer-readable storage medium of claim 16, the one or more programs further comprising instructions for:

18. The computer-readable storage medium of claim 14 or 15, wherein the first predefined set of one or more facial expressions is selected from the group consisting of smiling, frowning, oblique eye, and surprise expression.

19. The computer-readable storage medium of claim 14 or 15, wherein the second prompt includes prompting the user to speak a set of one or more words.

20. The computer-readable storage medium of claim 14 or 15, wherein displaying the enrollment interface for enrolling one or more features of the user further comprises:

outputting a third prompt to change the position of the user's head.

21. The computer-readable storage medium of claim 20, wherein the third hint is output before at least one of the first hint or the second hint.

22. The computer-readable storage medium of claim 14 or 15, wherein displaying the enrollment interface for enrolling one or more features of the user further comprises:

23. The computer-readable storage medium of claim 14 or 15, wherein displaying the enrollment interface for enrolling one or more features of the user further comprises:

a fifth prompt is output to indicate the height of the user.

24. The computer-readable storage medium of claim 14 or 15, wherein displaying the enrollment interface for enrolling one or more features of the user further comprises:

25. The computer-readable storage medium of claim 14 or 15, wherein:

26. The computer-readable storage medium of claim 14 or 15, wherein displaying the enrollment interface for enrolling one or more features of the user further comprises:

27. A computer system, wherein the computer system is configured to communicate with a display generation component and one or more cameras, the computer system comprising:

one or more processors; and

a memory storing one or more programs configured to be executed by the one or more processors, the one or more programs comprising instructions for:

28. The computer system of claim 27, wherein:

29. The computer system of claim 27 or 28, the one or more programs further comprising instructions for:

30. The computer system of claim 29, the one or more programs further comprising instructions for:

31. The computer system of claim 27 or 28, wherein the first predefined set of one or more facial expressions is selected from the group consisting of smiling, frowning, oblique eye, and surprised expression.

32. The computer system of claim 27 or 28, wherein the second prompt includes prompting the user to speak a set of one or more words.

33. The computer system of claim 27 or 28, wherein displaying the enrollment interface for enrolling one or more features of the user further comprises:

outputting a third prompt to change the position of the user's head.

34. The computer system of claim 33, wherein the third hint is output before at least one of the first hint or the second hint.

35. The computer system of claim 27 or 28, wherein displaying the enrollment interface for enrolling one or more features of the user further comprises:

36. The computer system of claim 27 or 28, wherein displaying the enrollment interface for enrolling one or more features of the user further comprises:

a fifth prompt is output to indicate the height of the user.

37. The computer system of claim 27 or 28, wherein displaying the enrollment interface for enrolling one or more features of the user further comprises:

38. The computer system of claim 27 or 28, wherein:

39. The computer system of claim 27 or 28, wherein displaying the enrollment interface for enrolling one or more features of the user further comprises:

40. A computer system configured to communicate with a display generation component and one or more cameras, comprising:

apparatus for performing the method of any one of claims 1 to 13.

41. A computer program product comprising one or more programs configured to be executed by one or more processors of a computer system in communication with a display generating component and one or more cameras, the one or more programs comprising instructions for performing the method of any of claims 1-13.