WO2025254920A1 - Pose-based facial expressions - Google Patents
Pose-based facial expressionsInfo
- Publication number
- WO2025254920A1 WO2025254920A1 PCT/US2025/031396 US2025031396W WO2025254920A1 WO 2025254920 A1 WO2025254920 A1 WO 2025254920A1 US 2025031396 W US2025031396 W US 2025031396W WO 2025254920 A1 WO2025254920 A1 WO 2025254920A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- user
- data
- model
- expression
- body pose
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
- G06V40/23—Recognition of whole body movements, e.g. for sport training
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/10—Image acquisition
- G06V10/12—Details of acquisition arrangements; Constructional details thereof
- G06V10/14—Optical characteristics of the device performing the acquisition or on the illumination arrangements
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/174—Facial expression recognition
- G06V40/176—Dynamic expression
Definitions
- the present disclosure generally relates to artificial intelligence (Al) applications, and more particularly to pose-based facial expressions.
- Facial expressions are a form of nonverbal communication that involves one or more motions or positions of the muscles beneath the skin of the face. These movements are believed to convey the emotional state of an individual to observers. Human faces areakily capable of a vast range of expressions, such as showing fear to send signals of alarm, interest to draw others toward an opportunity, or fondness and kindness to increase closeness.
- Al has revolutionized the field of body movement tracking, opening new possibilities in various sectors such as fitness, healthcare, gaming, and animation.
- Al- powered motion-capture and body-tracking technologies have made it possible to generate three-dimensional (3D) animations from video in seconds.
- 3D three-dimensional
- Al-powered body scanning technologies are being used to track and analyze users’ exercise routines. These systems can provide real-time feedback on the user’s form and technique, helping to prevent injuries and improve workout efficiency. Also, Al-powered body tracking allows for more realistic and dynamic character movements in the field of animation and gaming. Moreover, Al- powered body posture detection and motion tracking are also being used in healthcare for enhanced exercise experiences.
- a method for inferring avatar facial expressions from captured user body pose data comprising: accessing a first set of data comprising facial expressions and a second set of data comprising body poses, wherein each body pose in the second set of data is mapped to at least one facial expression in the first set of data; training, based on the mappings between the first set of data and the second set of data, an artificialintelligence (Al) model to infer facial expressions when the Al model receives at least one or more body poses; receiving one or more body pose indications; applying the Al model to the one or more body pose indications and receiving, from the Al model based on the training, an inference of a facial expression; and causing an avatar to affect an expression based on the facial expression inferred by the Al model.
- Al artificialintelligence
- the second set of data is based on images or video clips of body poses and each mapping for a body pose, corresponding to an image or video clip, is based on facial expressions determined at the time the image or video clip was captured.
- the second set of data further comprises, associated with one or more of the body pose, biometric data including a heart rate or a blood pressure; wherein the training of the Al model is further based on the association between the biometric data and one or more body poses mapped to facial expressions; wherein the received one or more body pose indications are associated biometric data; and wherein the applying the Al model to the one or more body pose indications further includes applying the Al model to the biometric data associated with the received one or more body pose indications to infer the facial expression received from the Al model.
- the second set of data further comprises, associated with one or more of the body pose, voice data; wherein the training of the Al model is further based on the association between the voice data and one or more body poses mapped to facial expressions; wherein the received one or more body pose indications are associated a voice recording; and wherein the applying the Al model to the one or more body pose indications further includes applying the Al model to the data based on the voice recording associated with the received one or more body pose indications to infer the facial expression received from the Al model.
- the method further comprises: determining that a user, on which the one or more body pose indications are based, is engaged in a competition; wherein the expression affected by the avatar is further based on the determining that the user is engaged in the competition.
- the method further comprises: determining an expression of one or more users in a vicinity of a user, on which the one or more body pose indications are based; wherein the expression affected by the avatar is further based on the determined expression of the one or more users in the vicinity of the user.
- the determining the expression of the one or more users in a vicinity of the user is in response to: determining that the one or more users has a specified type of relationship, in a social graphs, to the user; or determining that there is a record of one or more historical interactions between the one or more users and the user.
- the method further comprises: identifying above a threshold level of activity of a user, on which the one or more body pose indications are based; and in response to the identifying above the threshold level of activity, further causing the avatar to affect an increased activity expression comprising one or more of: flaring nostrils, an accelerated rate of chest and/or neck breathing animation, an altered skin tone, or any combination thereof.
- the method further comprises computing a period of time based on one or more of: an age of the user, a weight of the user, a determined intensity of the activity of the user, or any combination thereof; and identifying an end of the activity of the user; wherein the increased activity expression is maintained for the computed period of time after end of the activity of the user.
- the identifying the level of activity of the user is based on calculated body motion velocities and/or motion vectors for the user.
- the one or more body pose indications are based on images from a virtual camera that uses an Al engine to determine the user’s body positioning; and wherein the method further comprises adjusting parameters of the virtual camera causing the virtual camera to frame the user’s activities for improved pose capture.
- a computer-readable storage medium storing instructions, for inferring avatar facial expressions from captured user body pose data, the instructions, when executed by a computing system, cause the computing system to: access a first set of data comprising facial expressions and a second set of data comprising body poses, wherein each body pose in the second set of data is mapped to at least one facial expression in the first set of data; train, based on the mappings between the first set of data and the second set of data, an artificial-intelligence (Al) model to infer facial expressions when the Al model receives at least one or more body poses; receive one or more body pose indications; apply the Al model to the one or more body pose indications and receive, from the Al model based on the training, an inference of a facial expression; and cause an avatar to affect an expression based on the facial expression inferred by the Al model.
- Al artificial-intelligence
- the second set of data is based on images or video clips of body poses and each mapping for a body pose, corresponding to an image or video clip, is based on facial expressions determined at the time the image or video clip was captured.
- the second set of data further comprises, associated with one or more of the body pose, biometric data including a heart rate or a blood pressure; wherein the training of the Al model is further based on the association between the biometric data and one or more body poses mapped to facial expressions; wherein the received one or more body pose indications are associated biometric data; and wherein the applying the Al model to the one or more body pose indications further includes applying the Al model to the biometric data associated with the received one or more body pose indications to infer the facial expression received from the Al model.
- the instructions when executed, further cause the computing system to: determine that a user, on which the one or more body pose indications are based, is engaged in a competition; wherein the expression affected by the avatar is further based on the determining that the user is engaged in the competition.
- the instructions when executed, further cause the computing system to: determine an expression of one or more users in a vicinity of a user, on which the one or more body pose indications are based; wherein the expression affected by the avatar is further based on the determined expression of the one or more users in the vicinity of the user; and wherein the determining the expression of the one or more users in a vicinity of the user is in response to: determining that the one or more users has a specified type of relationship, in a social graphs, to the user; or determining that there is a record of one or more historical interactions between the one or more users and the user.
- a computing system for inferring avatar facial expressions from captured user body pose data
- the computing system comprising: one or more processors; and one or more memories storing instructions that, when executed by the one or more processors, cause the computing system to: access a first set of data comprising facial expressions and a second set of data comprising body poses, wherein each body pose in the second set of data is mapped to at least one facial expression in the first set of data; train, based on the mappings between the first set of data and the second set of data, an artificial-intelligence (Al) model to infer facial expressions when the Al model receives at least one or more body poses; receive one or more body pose indications; apply the Al model to the one or more body pose indications and receive, from the Al model based on the training, an inference of a facial expression; and cause an avatar to affect an expression based on the facial expression inferred by the Al model.
- Al artificial-intelligence
- the instructions when executed further cause the computing system to: identify above a threshold level of activity of a user, on which the one or more body pose indications are based; and in response to the identifying above the threshold level of activity, further cause the avatar to affect an increased activity expression comprising one or more of: flaring nostrils, an accelerated rate of chest and/or neck breathing animation, an altered skin tone, or any combination thereof.
- the instructions when executed further cause the computing system to: compute a period of time based on one or more of: an age of the user, a weight of the user, a determined intensity of the activity of the user, or any combination thereof; and identify an end of the activity of the user; wherein the increased activity expression is maintained for the computed period of time after end of the activity of the user.
- the one or more body pose indications are based on images from a virtual camera that uses an Al engine to determine the user’s body positioning; and wherein the instructions, when executed further cause the computing system to adjust parameters of the virtual camera causing the virtual camera to frame the user’s activities for improved pose capture.
- FIG. 1 is a high-level block diagram illustrating a network architecture within which some aspects of the subject technology are implemented.
- FIG. 2 is a block diagram illustrating details of a system including a client device and a server, as discussed herein.
- FIG. 3 is a block diagram illustrating examples of application modules used in the client device of FIG. 2, according to some embodiments.
- FIG. 4 is a screen shot illustrating an example of a facial expression inferred from a form of a hand-in the-air body gesture, according to some embodiments.
- FIG. 5 is a screen shot illustrating an example of a facial expression inferred from a form of a stop body gesture, according to some embodiments.
- FIG. 6 is a screen shot illustrating an example of a facial expression inferred from a form of a peace sign body gesture, according to some embodiments.
- FIG. 7 is a screen shot illustrating an example of a facial expression inferred from a form of a punching body gesture, according to some embodiments.
- FIG. 8 is a flow diagram illustrating an example of a method of inferring facial expression from body gestures, according to some embodiments.
- FIG. 9 is a flow diagram illustrating an example of a method of inferring facial expression from body poses, according to some embodiments.
- FIG. 10 is a block diagram illustrating an overview of devices on which some implementations can operate.
- FIG. 11 is a block diagram illustrating an overview of an environment in which some implementations can operate.
- FIG. 11 is a block diagram illustrating an overview of an environment in which some implementations can operate.
- not all of the depicted components in each figure may be required, and one or more implementations may include additional components not shown in a figure. Variations in the arrangement and type of the components may be made without departing from the scope of the subject disclosure. Additional components, different components, or fewer components may be utilized within the scope of the subject disclosure.
- a device of the subject technology includes an extra-reality (XR) headset comprising a processor configured to execute machine-learning (ML) instructions, memory configured to store a first set of data and a communications module configured to access a cloud storage including a second set of data.
- the ML instructions are configured to train an Al model to infer facial expressions based on at least one of the first set of data or the second set of data.
- an apparatus comprises an XR headset including a processor configured to execute ML instructions, memory configured to store a first set of data and a communications module configured to access a cloud storage including a second set of data. At least one of the first set of data or the second set of data includes a plurality of facial expressions.
- the ML instructions are configured to train an Al model to infer at least one body pose based on at least one of the first set of data or the second set of data.
- a method of the subject technology includes executing, by a processor, ML instructions, retrieving a first set of data from memory, and obtaining, by a communication module, from a cloud storage a second set of data. At least one of the first set of data or the second set of data includes a plurality of facial expressions and body poses.
- the ML instructions are configured to train an Al model to infer at least one body pose based on at least one of the first set of data or the second set of data.
- the facial expression and/or appearance can be driven in a fitness activity while the user is working out or is engaged in a sport such as running, jumping, punching or any other activity that involves high velocity motions.
- the measured user’s biometric data including a heart rate or a blood pressure may be used as an indication of working out and cause the avatar to breathe heavily, for example, expressed by nostril flaring or chest and/or neck being animated.
- the indication of working out can be expressed by changing of the color of the skin of the avatar, for example, by turning the color to red to signal getting hot.
- the facial expression can be used to drive plausible body poses by using face tracking.
- the body poses can change based on the facial expression.
- a body movement indicating an activity can be driven by sensing turning the color of skin of the avatar to red, flaring of the nostrils or movement of the chest or the neck of the avatar.
- the generation of the body motions can be valuable when only the face of the user is tracked, for example, by a mobile camera, but the body of the user is not in the field of view of the camera. This may happen when the user is an avatar in the horizon with only phone access.
- Embodiments of the disclosed technology may include or be implemented in conjunction with an extra reality system.
- Artificial reality or extra reality (XR) is a form of reality that has been adjusted in some manner before presentation to a user, which may include, e.g., virtual reality (VR), augmented reality (AR), mixed reality (MR), hybrid reality, or some combination and/or derivatives thereof.
- Extra reality content may include completely generated content or generated content combined with captured content (e.g., real-world photographs).
- the extra reality content may include video, audio, haptic feedback, or some combination thereof, any of which may be presented in a single channel or in multiple channels (such as stereo video that produces a three-dimensional effect to the viewer).
- extra reality may be associated with applications, products, accessories, services, or some combination thereof, that are, e.g., used to create content in an extra reality and/or used in (e.g., perform activities in) an extra reality.
- the extra reality system that provides the extra reality content may be implemented on various platforms, including a head-mounted display (HMD) connected to a host computer system, a standalone HMD, a mobile device or computing system, a "cave" environment or other projection system, or any other hardware platform capable of providing extra reality content to one or more viewers.
- HMD head-mounted display
- Virtual reality refers to an immersive experience where a user's visual input is controlled by a computing system.
- Augmented reality refers to systems where a user views images of the real world after they have passed through a computing system.
- a tablet with a camera on the back can capture images of the real world and then display the images on the screen on the opposite side of the tablet from the camera. The tablet can process and adjust or “augment” the images as they pass through the system, such as by adding virtual objects.
- Mated reality or “MR” refers to systems where light entering a user's eye is partially generated by a computing system and partially composes light reflected off objects in the real world.
- a MR headset could be shaped as a pair of glasses with a pass-through display, which allows light from the real world to pass through a waveguide that simultaneously emits light from a projector in the MR headset, allowing the MR headset to present virtual objects intermixed with the real objects the user can see.
- "Artificial reality,” “extra reality,” or “XR,” as used herein, refers to any of VR, AR, MR, or any combination or hybrid thereof.
- FIG. 1 is a high-level block diagram illustrating a network architecture 100 within which some aspects of the subject technology are implemented.
- the network architecture 100 may include servers 130 and a database 152, communicatively coupled with multiple client devices 110 via a network 150.
- Client devices 110 may include, but are not limited to, laptop computers, desktop computers, and the like, and/or mobile devices such as smart phones, palm devices, video players, headsets (e.g., extra-reality (XR) headsets), tablet devices, and the like.
- XR extra-reality
- the network 150 may include, for example, any one or more of a local area network (LAN), a wide area network (WAN), the Internet, and the like. Further, the network 150 may include, but is not limited to, any one or more of the following network topologies, including a bus network, a star network, a ring network, a mesh network, a star-bus network, tree or hierarchical network, and the like.
- LAN local area network
- WAN wide area network
- the Internet and the like.
- the network 150 may include, but is not limited to, any one or more of the following network topologies, including a bus network, a star network, a ring network, a mesh network, a star-bus network, tree or hierarchical network, and the like.
- FIG. 2 is a block diagram illustrating details of a system 200 including a client device and a server, as discussed herein.
- the system 200 includes at least one client device 110, at least one server 130 of the network architecture 100, a database 252 and the network 150.
- the client device 110 and the server 130 are communicatively coupled over network 150 via respective communications modules 218-1 and 218-2 (hereinafter, collectively referred to as “communications modules 218”).
- Communications modules 218 are configured to interface with network 150 to send and receive information, such as requests, uploads, messages, and commands to other devices on the network 150.
- Communications modules 218 can be, for example, modems or Ethernet cards, and may include radio hardware and software for wireless communications (e.g., via electromagnetic radiation, such as radiofrequency (RF), nearfield communications (NFC), Wi-Fi, and Bluetooth radio technology).
- RF radiofrequency
- NFC nearfield communications
- Wi-Fi Wi-Fi
- the client device 110 may be coupled with an input device 214 and with an output device 216.
- a user may interact with the client device 110 via the input device 214 and the output device 216.
- Input device 214 may include a mouse, a keyboard, a pointer, a touchscreen, a microphone, a joystick, a virtual joystick, a touchscreen display that a user may use to interact with client device 110, or the like.
- the input device 214 may include cameras, microphones, and sensors, such as touch sensors, acoustic sensors, inertial motion units and other sensors configured to provide input data to an XR system.
- Output device 216 may be a screen display, a touchscreen, a speaker, and the like.
- the client device 110 may also include a camera 210 (e.g., a smart camera), a processor 212-1 , memory 220-1 and the communications module 218- 1 .
- the camera 210 is in communication with the processor 212-1 and the memory 220-1.
- the processor 212-1 is configured to execute instructions stored in a memory 220-1 , and to cause the client device 110 to perform at least some operations in methods consistent with the present disclosure.
- the memory 220-1 may further include application 222, configured to run in the client device 110 and couple with input device 214, output device 216 and the camera 210.
- the application 222 may be downloaded by the user from the server 130, and/or may be hosted by the server 130.
- the application 222 includes specific instructions which, when executed by processor 212-1 , cause operations to be performed according to methods described herein.
- the application 222 runs on an operating system (OS) installed in client device 110.
- application 222 may run within a web browser.
- the processor 212-1 is configured to control a graphical user interface (GUI) for the user of one of the client devices 110 accessing the server 130.
- GUI graphical user interface
- the camera 210 is a virtual camera using an Al engine that can understand the user’s body positioning and intent, which is different from existing smart cameras that simply keep the user in frame.
- the camera 210 can adjust the camera parameters based on the user’s actions, providing the best framing for the user’s activities.
- the camera 210 can work with highly realistic avatars, which could represent the user or a celebrity in a virtual environment by mimicking the appearance and behavior of real humans as closely as possible.
- the camera 210 can work with stylized avatars, which can represent the user based on artistic or cartoon-like representations.
- the camera 210 leverages body tracking to understand the user’s actions and adjust the camera 210 accordingly. This provides a new degree of freedom and control for the user, allowing for a more immersive and interactive experience.
- the camera 210 is Al based and can be trained to understand the way to frame a user’s avatar, for example, in a video communication application such as Messenger, WhatsApp, Instagram, and the like.
- the camera 210 can leverage body tracking, action recognition, and/or scene understanding to adjust the virtual camera features (e.g., position, rotation, focal length, aperture) for framing the user’s avatar according to the context of the video call.
- the camera 210 can determine the right camera position for different scenarios such as when the user is whiteboarding versus writing at a desk (overhead camera) or exercising. Each of these scenarios would require a different setup that could be inferred if the Al engine of the camera 210 can understand the context.
- the database 252 may store data and files associated with the server 130 from the application 222.
- the client device 110 collects data, including but not limited to video and images, for upload to server 130 using the application 222, to store in the database 252.
- the server 130 includes a memory 220-2, a processor 212-2, an application program interface (API) layer 215 and communications module 218-2.
- processors 212-1 and 212-2, and memories 220-1 and 220-2, will be collectively referred to, respectively, as “processors 212” and “memories 220.”
- the processors 212 are configured to execute instructions stored in memories 220.
- memory 220-2 includes an applications engine 232.
- the applications engine 232 may be configured to perform operations and methods according to aspects of embodiments.
- the applications engine 232 may share or provide features and resources with the client device, including multiple tools associated with data, image, video collection, capture, or applications that use data, images, or video retrieved with the application engine 232 (e.g., the application 222).
- the user may access the applications engine 232 through the application 222, installed in a memory 220-1 of client device 110. Accordingly, the application 222 may be installed by server 130 and perform scripts and other routines provided by server 130 through any one of multiple tools. Execution of the application 222 may be controlled by processor 212-1 .
- FIG. 3 is a block diagram illustrating examples of application 222 used by the client device of FIG. 2, according to some embodiments.
- the application 222 includes several application modules including, but not limited to, a video chat module 310, a messaging module 320 and an Al module 340.
- the video chat module 310 is responsible for operations of video chat applications such as Facebook Messenger, Zoom Meeting, Facetime, Skype, and the like and can control speakers, microphones, video recorders, audio recorders and similar devices.
- the messaging module 320 is responsible for operations of messaging applications such as WhatsApp, Facebook Messenger, Signal, Telegram and the like and can control devices such as cameras and microphones and similar devices.
- the Al module 340 may include a number of Al models.
- Al models apply different algorithms to relevant data inputs to achieve the tasks, or an output for which the model has been programmed for.
- An Al model can be defined by its ability to autonomously make decisions or predictions, rather than simulate human intelligence. Different types of Al models are better suited for specific tasks, or domains, for which their particular decision-making logic is most useful or relevant. Complex systems often employ multiple models simultaneously, using ensemble learning techniques like bagging, boosting or stacking.
- Al models can automate decision-making, but only models capable of machine learning (ML) are able to autonomously optimize their performance over time. While all ML models are Al, not all Al involves ML. The most elementary Al models are a series of if-then-else statements, with rules programmed explicitly by a data scientist. Machine learning models use statistical Al rather than symbolic Al. Whereas rule-based Al models must be explicitly programmed, ML models are trained by applying their mathematical frameworks to a sample dataset whose data points serve as the basis for the model’s future real-world predictions.
- the subject technology can use a system consisting of one or more ML models trained over time using a large database (e.g., database 252 of FIG. 2).
- the system can be trained to learn what the face looked like when the body engaged in certain activity.
- the system can use action recognition to understand the action that the user is doing and then drive the face to imitate or infer what the user’s expression would be during these activities.
- the system can be multimodal, using both body movements and the tonality of the user’s voice to drive facial expressions.
- the system when the user is engaged in a sports activity, the system can adapt to the genre of the sport activity, changing expressions based on the activity, such as boxing.
- the system could also consider hand interactions and scene understanding to infer facial expressions to be driven.
- the output of the system is the inference of a facial expression, which could potentially be modified in post-processing steps.
- the system can return to a neutral, idle state after an intense activity, but it could also infer that the user just burned a significant number of calories and might be breathing hard or flushed.
- the system can maintain the inferred facial expression for a certain period of time after an intense activity, based on factors such as the age and weight of the user and the intensity of the workout.
- the body poses may be used to drive the facial expression, either wholesale or as an overlay.
- the system can calculate body motion velocities and understand motion vectors, to infer the strain that can be displayed on the face (e.g., squat, jump, jab or cross, kick, leap).
- the system can combine body gesture with audio expression to derive a new facial expression.
- the expressions that are additive and can maintain lip sync quality may be authored and saved by the Al module.
- the system can consider social factors, e.g., in conjunction with a social graph. For example, if a user is competing with others, they might try to suppress their expressions. The system may use the user’s social graph to attenuate the intensity of the expression. The system could also consider the expressions of other people around the person. For example, if a friend’s avatar is super happy, the user may want to support them and be happy as well. This is referred to as body mimicry. In some implementations, the system can go beyond audio-driven lip sync. For example, the system may use audio to drive facial expressions and body gestures. In some implementations, given environment awareness, the scene understanding can be used as an input for a most plausible expression. In some implementations, people or social graphs (e.g., users’ relationship to other avatars) can be used to infer expression according to relationships and historical interaction.
- social factors e.g., in conjunction with a social graph. For example, if a user is competing with others, they might try to suppress their expressions
- FIG. 4 is a screen shot 400 illustrating an example of a facial expression inferred from a form of a hand-in-the-air body gesture, according to some embodiments.
- FIG. 4 shows several example hand-in-the-air body gestures that are self-explanatory.
- the Al module 340 of FIG. 3 can be trained with these body gestures and similar ones to infer a facial expression that is indicative of, for example, an elated, formed, delighted or excited expression.
- FIG. 5 is a screen shot 500 illustrating an example of a facial expression inferred from a form of a stop body gesture, according to some embodiments.
- stop body gestures are shown in FIG. 5. These body gestures are just examples and are self-explanatory.
- the Al module 340 of FIG. 3 can be trained with these body gestures and similar ones to infer a facial expression that is indicative of, for example, a concerned, anxious, upset, or nervous expression.
- FIG. 6 is a screen shot 600 illustrating an example of a facial expression inferred from a form of a peace-sign body gesture, according to some embodiments.
- FIG. 6 depicts multiple examples of peace-sign body gestures that are self-explanatory.
- the Al module 340 of FIG. 3 can be trained with these body gestures and similar ones to infer a facial expression that is indicative of, for example, a happy, friendly or agreeable expression.
- FIG. 7 is a screen shot 700 illustrating an example of a facial expression inferred from a form of a punching body gesture, according to some embodiments.
- punching body gestures are shown in FIG. 5, which are just example body gestures and are self-explanatory.
- the Al module 340 of FIG. 3 can be trained with these body gestures and similar ones to infer a facial expression that is indicative of, for example, anger, rage or aggression expression.
- FIG. 8 is a flow diagram illustrating an example of a method 800 for inferring facial expression from body gestures, according to some embodiments.
- the method 800 includes executing, by a processor (e.g., 212-1 of FIG.
- ML instructions (810), retrieving a first set of data from memory (e.g., 220-1 of FIG. 2) (820), and obtaining, by a communication module (e.g., 218-1 of FIG. 2), from a cloud storage a second set of data (830). At least one of the first set of data or the second set of data includes a plurality of facial expressions and body poses.
- the ML instructions are configured to train an Al model (e.g., from 340 of FIG. 3) to infer at least one body pose based on at least one of the first set of data or the second set of data.
- FIG. 9 is a flow diagram illustrating an example of a method 900 for inferring avatar facial expressions from captured user body pose data.
- process 900 can access a first set of data comprising facial expressions and a second set of data comprising body poses. Each body pose in the second set of data can be mapped to at least one facial expression in the first set of data.
- the second set of data can be based on images or video clips of body poses and each mapping for a body pose, corresponding to an image or video clip, can be based on facial expressions determined at the time the image or video clip was captured.
- the one or more body pose indications can be based on images from a virtual camera that uses an Al engine to determine the user’s body positioning and process 900 can include, adjusting parameters of the virtual camera causing the virtual camera to frame the user’s activities for improved pose capture.
- the second set of data can further include, associated with one or more of the body pose, biometric data including a heart rate or a blood pressure. In some cases, in addition to the body pose data, the second set of data can further include, associated with one or more of the body pose, voice data.
- process 900 can train, based on the mappings between the first set of data and the second set of data, an artificial-intelligence (Al) model to infer facial expressions when the Al model receives at least one or more body poses.
- Al artificial-intelligence
- the training of the Al model can further be based on associations between biometric data, from the second set, and one or more body poses mapped to facial expressions.
- the training of the Al model can further be based on association between voice data, from the second set, and one or more body poses mapped to facial expressions;
- process 900 can receive one or more body pose indications.
- the received one or more body pose indications are associated biometric data and/or a voice recording.
- process 900 can apply the Al model to the one or more body pose indications and can receive, from the Al model based on the training, an inference of a facial expression.
- applying the Al model to the one or more body pose indications further includes applying the Al model to biometric data associated with the received one or more body pose indications to infer the facial expression received from the Al model.
- applying the Al model to the one or more body pose indications further includes applying the Al model to data based on a voice recording associated with the received one or more body pose indications to infer the facial expression received from the Al model.
- process 900 can cause an avatar to affect an expression based on the facial expression inferred by the Al model. For example, process 900 can cause the avatar to smile, frown, raise its eyebrows, blink, perform motions corresponding to speaking certain phonemes, etc.
- process 900 can determine an expression of one or more users in a vicinity of a user, on which the one or more body pose indications are based, where the expression affected by the avatar is further based on the determined expression of the one or more users in the vicinity of the user.
- the determining the expression of the one or more users in a vicinity of the user is in response to determining that the one or more users has a specified type of relationship, in a social graphs, to the user or determining that there is a record of one or more historical interactions between the one or more users and the user.
- process 900 can determine that a user, on which the one or more body pose indications are based, is engaged in a competition, where the expression affected by the avatar is further based on the determining that the user is engaged in the competition.
- process 900 can identify above a threshold level of activity of a user, on which the one or more body pose indications are based and, in response to identifying above the threshold level of activity, can further cause the avatar to affect an increased activity expression.
- the increased activity expression can be one or more of: flaring nostrils, an accelerated rate of chest and/or neck breathing animation, or an altered skin tone.
- process 900 can compute a period of time based on one or more of: an age of the user, a weight of the user, a determined intensity of the activity of the user, or any combination thereof and can identify an end of the activity of the user, where process 900 can cause the avatar to maintain the increased activity expression for the computed period of time after end of the activity of the user.
- identifying the level of activity of the user is based on calculated body motion velocities and/or motion vectors for the user.
- An aspect of the subject technology is directed to a device including an XR headset comprising a processor configured to execute ML instructions, memory configured to store a first set of data and a communications module configured to access a cloud storage including a second set of data.
- the ML instructions are configured to train an Al model to infer facial expressions based on at least one of the first set of data or the second set of data.
- the first set of data and the second set of data comprise images or video clips of body poses.
- the body poses are provided by Al- powered body scanning.
- the body poses comprise body motions in at least one of a social activity or a physical activity including a sports activity or a fitness activity.
- the body poses are indicative of emotional states in one of a plurality of contexts.
- the first set of data or the second set of data further comprise audio including environment sounds, music or voice.
- the first set of data or the second set of data further comprise a measured user’s biometric data including a heart rate or a blood pressure used to indicate an intensity of a physical activity.
- the facial expressions include elated, formed, delighted or excited expressions inferred from a hand-in-the-air body gesture.
- the facial expressions include concerned, anxious, upset, or nervous expressions inferred from a form of a stop body gesture.
- the facial expressions include happy, friendly or agreeable expressions inferred from a form of a peace-sign body gesture.
- the facial expressions include anger, rage or aggression expressions inferred from a form of a punching body gesture.
- an apparatus comprising an XR headset including a processor configured to execute ML instructions, memory configured to store a first set of data and a communications module configured to access a cloud storage including a second set of data. At least one of the first set of data or the second set of data includes a plurality of facial expressions.
- the ML instructions are configured to train an Al model to infer at least one body pose based on at least one of the first set of data or the second set of data.
- the plurality of facial expressions comprises elated, formed, delighted, excited, happy, friendly, agreeable, concerned, anxious, upset, nervous, anger, rage, aggression expressions, nostril flaring, chest and neck being animated or changing of a skin color.
- the at least one body pose comprises one or more of a hand-in-the-air body gesture, a stop body gesture, a peace-sign body gesture and a punching body gesture.
- the at least one body pose is indicative of an emotional state in one of a plurality of contexts, wherein the at least one body pose comprises body motions in at least one of a social activity or a physical activity including a sports activity or a fitness activity.
- the first set of data or the second set of data further comprise a measured user’s biometric data including a heart rate or a blood pressure used to indicate an intensity of a physical activity.
- the first set of data or the second set of data further comprise audio including environment sounds, music or voice.
- Yet another aspect of the subject technology is directed to a method including executing, by a processor, ML instructions, retrieving a first set of data from memory, and obtaining, by a communication module, from a cloud storage a second set of data. At least one of the first set of data or the second set of data includes a plurality of facial expressions and body poses.
- the ML instructions are configured to train an Al model to infer at least one body pose based on at least one of the first set of data or the second set of data.
- the ML instructions are configured to train an Al model to infer at least one facial expression based on at least one of the first set of data or the second set of data.
- the first set of data or the second set of data further comprise a measured user’s biometric data including a heart rate or a blood pressure used to indicate an intensity of a physical activity, and audio including environment sounds, music or voice.
- FIG. 10 is a block diagram illustrating an overview of devices on which some implementations of the disclosed technology can operate.
- the devices can comprise hardware components of a device 1000 that infers avatar facial expressions from captured user body pose data.
- Device 1000 can include one or more input devices 1020 that provide input to the Processor(s) 1010 (e.g. CPll(s), GPU(s), HPU(s), etc.), notifying it of actions.
- the actions can be mediated by a hardware controller that interprets the signals received from the input device and communicates the information to the processors 1010 using a communication protocol.
- Input devices 1020 include, for example, a mouse, a keyboard, a touchscreen, an infrared sensor, a touchpad, a wearable input device, a camera- or image-based input device, a microphone, or other user input devices.
- Processors 1010 can be a single processing unit or multiple processing units in a device or distributed across multiple devices. Processors 1010 can be coupled to other hardware devices, for example, with the use of a bus, such as a PCI bus or SCSI bus.
- the processors 1010 can communicate with a hardware controller for devices, such as for a display 1030.
- Display 1030 can be used to display text and graphics. In some implementations, display 1030 provides graphical and textual visual feedback to a user.
- display 1030 includes the input device as part of the display, such as when the input device is a touchscreen or is equipped with an eye direction monitoring system. In some implementations, the display is separate from the input device.
- Examples of display devices are: an LCD display screen, an LED display screen, a projected, holographic, or augmented reality display (such as a heads-up display device or a head-mounted device), and so on.
- Other I/O devices 1040 can also be coupled to the processor, such as a network card, video card, audio card, USB, firewire or other external device, camera, printer, speakers, CD- ROM drive, DVD drive, disk drive, or Blu-Ray device.
- the device 1000 also includes a communication device capable of communicating wirelessly or wire-based with a network node.
- the communication device can communicate with another device or a server through a network using, for example, TCP/IP protocols.
- Device 1000 can utilize the communication device to distribute operations across multiple network devices.
- the processors 1010 can have access to a memory 1050 in a device or distributed across multiple devices.
- a memory includes one or more of various hardware devices for volatile and non-volatile storage, and can include both readonly and writable memory.
- a memory can comprise random access memory (RAM), various caches, CPU registers, read-only memory (ROM), and writable non-volatile memory, such as flash memory, hard drives, floppy disks, CDs, DVDs, magnetic storage devices, tape drives, and so forth.
- RAM random access memory
- ROM read-only memory
- writable non-volatile memory such as flash memory, hard drives, floppy disks, CDs, DVDs, magnetic storage devices, tape drives, and so forth.
- a memory is not a propagating signal divorced from underlying hardware; a memory is thus non- transitory.
- Memory 1050 can include program memory 160 that stores programs and software, such as an operating system 1062, pose-based facial expression system 1064, and other application programs 1066.
- Memory 1050 can also include data memory 1070 that can include application data, configuration data, settings, user options or preferences, etc., which can be provided to the program memory 1060 or any element of the device 1000.
- Some implementations can be operational with numerous other computing system environments or configurations.
- Examples of computing systems, environments, and/or configurations that may be suitable for use with the technology include, but are not limited to, personal computers, server computers, handheld or laptop devices, cellular telephones, wearable electronics, gaming consoles, tablet devices, multiprocessor systems, microprocessor-based systems, set-top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, or the like.
- FIG 11 is a block diagram illustrating an overview of an environment 1100 in which some implementations of the disclosed technology can operate.
- Environment 1100 can include one or more client computing devices 1105A-D, examples of which can include device 1000.
- Client computing devices 1105 can operate in a networked environment using logical connections through network 1130 to one or more remote computers, such as a server computing device.
- server 1110 can be an edge server which receives client requests and coordinates fulfillment of those requests through other servers, such as servers 1120A-C.
- Server computing devices 1110 and 1120 can comprise computing systems, such as device 1000. Though each server computing device 1110 and 1120 is displayed logically as a single server, server computing devices can each be a distributed computing environment encompassing multiple computing devices located at the same or at geographically disparate physical locations. In some implementations, each server 1120 corresponds to a group of servers.
- Client computing devices 1105 and server computing devices 1110 and 1120 can each act as a server or client to other server/client devices.
- Server 1110 can connect to a database 1115.
- Servers 1120A-C can each connect to a corresponding database 1125A-C.
- each server 1120 can correspond to a group of servers, and each of these servers can share a database or can have their own database.
- Databases 1115 and 1125 can warehouse (e.g. store) information. Though databases 1 115 and 1125 are displayed logically as single units, databases 1115 and 1125 can each be a distributed computing environment encompassing multiple computing devices, can be located within their corresponding server, or can be located at the same or at geographically disparate physical locations.
- Network 1130 can be a local area network (LAN) or a wide area network (WAN), but can also be other wired or wireless networks.
- Network 1130 may be the Internet or some other public or private network.
- Client computing devices 1105 can be connected to network 1130 through a network interface, such as by wired or wireless communication. While the connections between server 1110 and servers 1120 are shown as separate connections, these connections can be any kind of local, wide area, wired, or wireless network, including network 1130 or a separate public or private network.
- servers 1110 and 1120 can be used as part of a social network.
- the social network can maintain a social graph and perform various actions based on the social graph.
- a social graph can include a set of nodes (representing social networking system objects, also known as social objects) interconnected by edges (representing interactions, activity, or relatedness).
- a social networking system object can be a social networking system user, nonperson entity, content item, group, social networking system page, location, application, subject, concept representation or other social networking system object, e.g., a movie, a band, a book, etc.
- Content items can be any digital data such as text, images, audio, video, links, webpages, minutia (e.g.
- content items can be social network items or parts of social network items, such as posts, likes, mentions, news items, events, shares, comments, messages, other notifications, etc.
- Subjects and concepts, in the context of a social graph, comprise nodes that represent any person, place, thing, or idea.
- a social networking system can enable a user to enter and display information related to the user's interests, age I date of birth, location (e.g. longitude/latitude, country, region, city, etc.), education information, life stage, relationship status, name, a model of devices typically used, languages identified as ones the user is facile with, occupation, contact information, or other demographic or biographical information in the user's profile. Any such information can be represented, in various implementations, by a node or edge between nodes in the social graph.
- a social networking system can enable a user to upload or create pictures, videos, documents, songs, or other content items, and can enable a user to create and schedule events. Content items can be represented, in various implementations, by a node or edge between nodes in the social graph.
- a social networking system can enable a user to perform uploads or create content items, interact with content items or other users, express an interest or opinion, or perform other actions.
- a social networking system can provide various means to interact with non-user objects within the social networking system. Actions can be represented, in various implementations, by a node or edge between nodes in the social graph. For example, a user can form or join groups, or become a fan of a page or entity within the social networking system.
- a user can create, download, view, upload, link to, tag, edit, or play a social networking system object.
- a user can interact with social networking system objects outside of the context of the social networking system. For example, an article on a news web site might have a “like” button that users can click.
- the interaction between the user and the object can be represented by an edge in the social graph connecting the node of the user to the node of the object.
- a user can use location detection functionality (such as a GPS receiver on a mobile device) to “check in” to a particular location, and an edge can connect the user's node with the location's node in the social graph.
- a social networking system can provide a variety of communication channels to users.
- a social networking system can enable a user to email, instant message, or text/SMS message, one or more other users. It can enable a user to post a message to the user's wall or profile or another user's wall or profile. It can enable a user to post a message to a group or a fan page. It can enable a user to comment on an image, wall post or other content item created or uploaded by the user or another user. And it can allow users to interact (e.g., via their personalized avatar) with objects or other avatars in an artificial reality environment, etc.
- a user can post a status message to the user's profile indicating a current event, state of mind, thought, feeling, activity, or any other present-time relevant communication.
- a social networking system can enable users to communicate both within, and external to, the social networking system. For example, a first user can send a second user a message within the social networking system, an email through the social networking system, an email external to but originating from the social networking system, an instant message within the social networking system, an instant message external to but originating from the social networking system, provide voice or video messaging between users, or provide an artificial reality environment were users can communicate and interact via avatars or other digital representations of themselves.
- a first user can comment on the profile page of a second user, or can comment on objects associated with a second user, e.g., content items uploaded by the second user.
- Social networking systems enable users to associate themselves and establish connections with other users of the social networking system.
- two users e.g., social graph nodes
- friends or, “connections”
- connections within the context of the social networking system. For example, a friend request from a “John Doe” to a “Jane Smith,” which is accepted by “Jane Smith,” is a social connection.
- the social connection can be an edge in the social graph.
- Being friends or being within a threshold number of friend edges on the social graph can allow users access to more information about each other than would otherwise be available to unconnected users. For example, being friends can allow a user to view another user's profile, to see another user's friends, or to view pictures of another user. Likewise, becoming friends within a social networking system can allow a user greater access to communicate with another user, e.g., by email (internal and external to the social networking system), instant message, text message, phone, or any other communicative interface. Being friends can allow a user access to view, comment on, download, endorse or otherwise interact with another user's uploaded content items. Establishing connections, accessing user information, communicating, and interacting within the context of the social networking system can be represented by an edge between the nodes representing two social networking system users.
- users with common characteristics can be considered connected (such as a soft or implicit connection) for the purposes of determining social context for use in determining the topic of communications.
- users who belong to a common network are considered connected.
- users who attend a common school, work for a common company, or belong to a common social networking system group can be considered connected.
- users with common biographical characteristics are considered connected. For example, the geographic region users were born in or live in, the age of users, the gender of users and the relationship status of users can be used to determine whether users are connected.
- users with common interests are considered connected.
- users' movie preferences, music preferences, political views, religious views, or any other interest can be used to determine whether users are connected.
- users who have taken a common action within the social networking system are considered connected.
- users who endorse or recommend a common object, who comment on a common content item, or who RSVP to a common event can be considered connected.
- a social networking system can utilize a social graph to determine users who are connected with or are similar to a particular user in order to determine or evaluate the social context between the users.
- the social networking system can utilize such social context and common attributes to facilitate content distribution systems and content caching systems to predictably select content items for caching in cache appliances associated with specific social network accounts.
- the word “exemplary” is used herein to mean “serving as an example, instance, or illustration.” Any embodiment described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other embodiments. Phrases such as an aspect, the aspect, another aspect, some aspects, one or more aspects, an implementation, the implementation, another implementation, some implementations, one or more implementations, an embodiment, the embodiment, another embodiment, some embodiments, one or more embodiments, a configuration, the configuration, another configuration, some configurations, one or more configurations, the subject technology, the disclosure, the present disclosure, other variations thereof and alike are for convenience and do not imply that a disclosure relating to such phrase(s) is essential to the subject technology or that such disclosure applies to all configurations of the subject technology.
- a disclosure relating to such phrase(s) may apply to all configurations, or one or more configurations.
- a disclosure relating to such phrase(s) may provide one or more examples.
- a phrase such as an aspect or some aspects may refer to one or more aspects and vice versa, and this applies similarly to other foregoing phrases.
- a reference to an element in the singular is not intended to mean “one and only one” unless specifically stated, but rather “one or more.”
- Pronouns in the masculine include the feminine and neuter gender (e.g., her and its) and vice versa.
- the term “some” refers to one or more.
- Underlined and/or italicized headings and subheadings are used for convenience only, do not limit the subject technology, and are not referred to in connection with the interpretation of the description of the subject technology. Relational terms such as first and second and the like may be used to distinguish one entity or action from another without necessarily requiring or implying any actual such relationship or order between such entities or actions.
- aspects of the subject matter described in this disclosure can be implemented to realize one or more of the following potential advantages.
- the described techniques may be implemented to support a range of benefits and significant advantages of the disclosed eye tracking system. It should be noted that the subject technology enables fabrication of a depth-sensing apparatus that is a fully solid-state device with small size, low power, and low cost.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Theoretical Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Oral & Maxillofacial Surgery (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Psychiatry (AREA)
- Social Psychology (AREA)
- Processing Or Creating Images (AREA)
Abstract
A device of the subject technology comprises a extra-reality (XR) headset including a processor configured to execute machine-learning (ML) instructions, memory configured to store a first set of data and a communications module configured to access a cloud storage including a second set of data. The ML instructions are configured to train an artificial-intelligence (AI) model to infer facial expressions based on at least one of the first set of data or the second set of data.
Description
POSE-BASED FACIAL EXPRESSIONS
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] This application claims benefit of and priority to US Provisional Patent Application No. 63/656,199 filed on 05 June 2024 and US Non-Provisional Patent Application No. 19/201 ,253 filed on 7 May 2025.
TECHNICAL FIELD
[0002] The present disclosure generally relates to artificial intelligence (Al) applications, and more particularly to pose-based facial expressions.
BACKGROUND
[0003] Facial expressions are a form of nonverbal communication that involves one or more motions or positions of the muscles beneath the skin of the face. These movements are believed to convey the emotional state of an individual to observers. Human faces are exquisitely capable of a vast range of expressions, such as showing fear to send signals of alarm, interest to draw others toward an opportunity, or fondness and kindness to increase closeness.
[0004] Al has revolutionized the field of body movement tracking, opening new possibilities in various sectors such as fitness, healthcare, gaming, and animation. Al- powered motion-capture and body-tracking technologies have made it possible to generate three-dimensional (3D) animations from video in seconds. These systems use Al to analyze and interpret physical movements and postures, providing valuable data regarding a user’s physical condition and progress. They are accessible and easy to use, requiring only a standard webcam or smartphone camera.
[0005] For example, in the fitness industry, Al-powered body scanning technologies are being used to track and analyze users’ exercise routines. These systems can provide real-time feedback on the user’s form and technique, helping to prevent injuries and improve workout efficiency. Also, Al-powered body tracking allows for more realistic and dynamic character movements in the field of animation and gaming. Moreover, Al- powered body posture detection and motion tracking are also being used in healthcare for enhanced exercise experiences.
SUMMARY
[0006] According to an aspect of the present invention, there is provided a method for inferring avatar facial expressions from captured user body pose data, the method comprising: accessing a first set of data comprising facial expressions and a second set of data comprising body poses, wherein each body pose in the second set of data is mapped to at least one facial expression in the first set of data; training, based on the mappings between the first set of data and the second set of data, an artificialintelligence (Al) model to infer facial expressions when the Al model receives at least
one or more body poses; receiving one or more body pose indications; applying the Al model to the one or more body pose indications and receiving, from the Al model based on the training, an inference of a facial expression; and causing an avatar to affect an expression based on the facial expression inferred by the Al model.
[0007] Optionally, the second set of data is based on images or video clips of body poses and each mapping for a body pose, corresponding to an image or video clip, is based on facial expressions determined at the time the image or video clip was captured.
[0008] Optionally, the second set of data further comprises, associated with one or more of the body pose, biometric data including a heart rate or a blood pressure; wherein the training of the Al model is further based on the association between the biometric data and one or more body poses mapped to facial expressions; wherein the received one or more body pose indications are associated biometric data; and wherein the applying the Al model to the one or more body pose indications further includes applying the Al model to the biometric data associated with the received one or more body pose indications to infer the facial expression received from the Al model.
[0009] Optionally, the second set of data further comprises, associated with one or more of the body pose, voice data; wherein the training of the Al model is further based on the association between the voice data and one or more body poses mapped to facial expressions; wherein the received one or more body pose indications are associated a voice recording; and wherein the applying the Al model to the one or more body pose indications further includes applying the Al model to the data based on the voice recording associated with the received one or more body pose indications to infer the facial expression received from the Al model.
[0010] Optionally, the method further comprises: determining that a user, on which the one or more body pose indications are based, is engaged in a competition; wherein the expression affected by the avatar is further based on the determining that the user is engaged in the competition.
[0011] Optionally, the method further comprises: determining an expression of one or more users in a vicinity of a user, on which the one or more body pose indications are based; wherein the expression affected by the avatar is further based on the determined expression of the one or more users in the vicinity of the user.
[0012] Optionally, the determining the expression of the one or more users in a vicinity of the user is in response to: determining that the one or more users has a specified type of relationship, in a social graphs, to the user; or determining that there is a record of one or more historical interactions between the one or more users and the user.
[0013] Optionally, the method further comprises: identifying above a threshold level of activity of a user, on which the one or more body pose indications are based; and in response to the identifying above the threshold level of activity, further causing the avatar to affect an increased activity expression comprising one or more of: flaring nostrils, an accelerated rate of chest and/or neck breathing animation, an altered skin tone, or any combination thereof.
[0014] Optionally, the method further comprises computing a period of time based on one or more of: an age of the user, a weight of the user, a determined intensity of the activity of the user, or any combination thereof; and identifying an end of the activity of the user; wherein the increased activity expression is maintained for the computed period of time after end of the activity of the user.
[0015] Optionally, the identifying the level of activity of the user is based on calculated body motion velocities and/or motion vectors for the user.
[0016] Optionally, the one or more body pose indications are based on images from a virtual camera that uses an Al engine to determine the user’s body positioning; and wherein the method further comprises adjusting parameters of the virtual camera causing the virtual camera to frame the user’s activities for improved pose capture. [0017] According to a further aspect of the present invention, there is provided a computer-readable storage medium storing instructions, for inferring avatar facial expressions from captured user body pose data, the instructions, when executed by a computing system, cause the computing system to: access a first set of data comprising facial expressions and a second set of data comprising body poses, wherein each body pose in the second set of data is mapped to at least one facial expression in the first set of data; train, based on the mappings between the first set of data and the second set of data, an artificial-intelligence (Al) model to infer facial expressions when the Al model receives at least one or more body poses; receive one or more body pose indications; apply the Al model to the one or more body pose indications and receive, from the Al model based on the training, an inference of a facial expression; and cause an avatar to affect an expression based on the facial expression inferred by the Al model.
[0018] Optionally, the second set of data is based on images or video clips of body poses and each mapping for a body pose, corresponding to an image or video clip, is based on facial expressions determined at the time the image or video clip was captured.
[0019] Optionally, the second set of data further comprises, associated with one or more of the body pose, biometric data including a heart rate or a blood pressure; wherein the training of the Al model is further based on the association between the biometric data and one or more body poses mapped to facial expressions; wherein the
received one or more body pose indications are associated biometric data; and wherein the applying the Al model to the one or more body pose indications further includes applying the Al model to the biometric data associated with the received one or more body pose indications to infer the facial expression received from the Al model.
[0020] Optionally, the instructions, when executed, further cause the computing system to: determine that a user, on which the one or more body pose indications are based, is engaged in a competition; wherein the expression affected by the avatar is further based on the determining that the user is engaged in the competition.
[0021] Optionally, the instructions, when executed, further cause the computing system to: determine an expression of one or more users in a vicinity of a user, on which the one or more body pose indications are based; wherein the expression affected by the avatar is further based on the determined expression of the one or more users in the vicinity of the user; and wherein the determining the expression of the one or more users in a vicinity of the user is in response to: determining that the one or more users has a specified type of relationship, in a social graphs, to the user; or determining that there is a record of one or more historical interactions between the one or more users and the user.
[0022] According to a further aspect of the present invention, there is provided a computing system for inferring avatar facial expressions from captured user body pose data, the computing system comprising: one or more processors; and one or more memories storing instructions that, when executed by the one or more processors, cause the computing system to: access a first set of data comprising facial expressions and a second set of data comprising body poses, wherein each body pose in the second set of data is mapped to at least one facial expression in the first set of data; train, based on the mappings between the first set of data and the second set of data, an artificial-intelligence (Al) model to infer facial expressions when the Al model receives at least one or more body poses; receive one or more body pose indications; apply the Al model to the one or more body pose indications and receive, from the Al model based on the training, an inference of a facial expression; and cause an avatar to affect an expression based on the facial expression inferred by the Al model.
[0023] Optionally, the instructions, when executed further cause the computing system to: identify above a threshold level of activity of a user, on which the one or more body pose indications are based; and in response to the identifying above the threshold level of activity, further cause the avatar to affect an increased activity expression comprising one or more of: flaring nostrils, an accelerated rate of chest and/or neck breathing animation, an altered skin tone, or any combination thereof. [0024] Optionally, the instructions, when executed further cause the computing
system to: compute a period of time based on one or more of: an age of the user, a weight of the user, a determined intensity of the activity of the user, or any combination thereof; and identify an end of the activity of the user; wherein the increased activity expression is maintained for the computed period of time after end of the activity of the user.
[0025] Optionally, the one or more body pose indications are based on images from a virtual camera that uses an Al engine to determine the user’s body positioning; and wherein the instructions, when executed further cause the computing system to adjust parameters of the virtual camera causing the virtual camera to frame the user’s activities for improved pose capture.
BRIEF DESCRIPTION OF THE DRAWINGS
[0026] The accompanying drawings, which are included to provide further understanding and are incorporated in and constitute a part of this specification, illustrate disclosed embodiments and together with the description serve to explain the principles of the disclosed embodiments.
[0027] FIG. 1 is a high-level block diagram illustrating a network architecture within which some aspects of the subject technology are implemented.
[0028] FIG. 2 is a block diagram illustrating details of a system including a client device and a server, as discussed herein.
[0029] FIG. 3 is a block diagram illustrating examples of application modules used in the client device of FIG. 2, according to some embodiments.
[0030] FIG. 4 is a screen shot illustrating an example of a facial expression inferred from a form of a hand-in the-air body gesture, according to some embodiments.
[0031] FIG. 5 is a screen shot illustrating an example of a facial expression inferred from a form of a stop body gesture, according to some embodiments.
[0032] FIG. 6 is a screen shot illustrating an example of a facial expression inferred from a form of a peace sign body gesture, according to some embodiments.
[0033] FIG. 7 is a screen shot illustrating an example of a facial expression inferred from a form of a punching body gesture, according to some embodiments.
[0034] FIG. 8 is a flow diagram illustrating an example of a method of inferring facial expression from body gestures, according to some embodiments.
[0035] FIG. 9 is a flow diagram illustrating an example of a method of inferring facial expression from body poses, according to some embodiments.
[0036] FIG. 10 is a block diagram illustrating an overview of devices on which some implementations can operate.
[0037] FIG. 11 is a block diagram illustrating an overview of an environment in which some implementations can operate.
[0038] In one or more implementations, not all of the depicted components in each figure may be required, and one or more implementations may include additional components not shown in a figure. Variations in the arrangement and type of the components may be made without departing from the scope of the subject disclosure. Additional components, different components, or fewer components may be utilized within the scope of the subject disclosure.
DETAILED DESCRIPTION
[0039] According to some embodiments, a device of the subject technology includes an extra-reality (XR) headset comprising a processor configured to execute machine-learning (ML) instructions, memory configured to store a first set of data and a communications module configured to access a cloud storage including a second set of data. The ML instructions are configured to train an Al model to infer facial expressions based on at least one of the first set of data or the second set of data.
[0040] According to some embodiments, an apparatus comprises an XR headset including a processor configured to execute ML instructions, memory configured to store a first set of data and a communications module configured to access a cloud storage including a second set of data. At least one of the first set of data or the second set of data includes a plurality of facial expressions. The ML instructions are configured to train an Al model to infer at least one body pose based on at least one of the first set of data or the second set of data.
[0041] According to some embodiments, a method of the subject technology includes executing, by a processor, ML instructions, retrieving a first set of data from memory, and obtaining, by a communication module, from a cloud storage a second set of data. At least one of the first set of data or the second set of data includes a plurality of facial expressions and body poses. The ML instructions are configured to train an Al model to infer at least one body pose based on at least one of the first set of data or the second set of data.
[0042] In the following detailed description, numerous specific details are set forth to provide a full understanding of the present disclosure. It will be apparent, however, to one ordinarily skilled in the art, that the embodiments of the present disclosure may be practiced without some of these specific details. In other instances, well-known structures and techniques have not been shown in detail so as not to obscure the disclosure.
[0043] In some aspects, the subject technology is directed to pose-based facial expressions. The disclosed technique provides capabilities for facial expression, for example, by inferring facial expression from body gestures using Al resources. The disclosed solution drives facial expression based on body tracking motions. In some aspects, the subject technology ties the facial expression to a number of features such as body pose, body motion, social context, application context. In some implementations, the above-mentioned features can be combined with audio and video tracking to better infer the facial expression.
[0044] In some aspects, the facial expression and/or appearance can be driven in a fitness activity while the user is working out or is engaged in a sport such as running, jumping, punching or any other activity that involves high velocity motions. In some aspects, the measured user’s biometric data including a heart rate or a blood pressure may be used as an indication of working out and cause the avatar to breathe heavily, for example, expressed by nostril flaring or chest and/or neck being animated. In some aspects, the indication of working out can be expressed by changing of the color of the skin of the avatar, for example, by turning the color to red to signal getting hot.
[0045] In some aspects, the facial expression can be used to drive plausible body poses by using face tracking. In this case, the body poses can change based on the facial expression. For example, a body movement indicating an activity can be driven by sensing turning the color of skin of the avatar to red, flaring of the nostrils or movement of the chest or the neck of the avatar. The generation of the body motions can be valuable when only the face of the user is tracked, for example, by a mobile camera, but the body of the user is not in the field of view of the camera. This may happen when the user is an avatar in the horizon with only phone access.
[0046] Embodiments of the disclosed technology may include or be implemented in conjunction with an extra reality system. Artificial reality or extra reality (XR) is a form of reality that has been adjusted in some manner before presentation to a user, which may include, e.g., virtual reality (VR), augmented reality (AR), mixed reality (MR), hybrid reality, or some combination and/or derivatives thereof. Extra reality content may include completely generated content or generated content combined with captured content (e.g., real-world photographs). The extra reality content may include video, audio, haptic feedback,
or some combination thereof, any of which may be presented in a single channel or in multiple channels (such as stereo video that produces a three-dimensional effect to the viewer). Additionally, in some embodiments, extra reality may be associated with applications, products, accessories, services, or some combination thereof, that are, e.g., used to create content in an extra reality and/or used in (e.g., perform activities in) an extra reality. The extra reality system that provides the extra reality content may be implemented on various platforms, including a head-mounted display (HMD) connected to a host computer system, a standalone HMD, a mobile device or computing system, a "cave" environment or other projection system, or any other hardware platform capable of providing extra reality content to one or more viewers.
[0047] "Virtual reality" or "VR," as used herein, refers to an immersive experience where a user's visual input is controlled by a computing system. "Augmented reality" or "AR" refers to systems where a user views images of the real world after they have passed through a computing system. For example, a tablet with a camera on the back can capture images of the real world and then display the images on the screen on the opposite side of the tablet from the camera. The tablet can process and adjust or "augment" the images as they pass through the system, such as by adding virtual objects. "Mixed reality" or "MR" refers to systems where light entering a user's eye is partially generated by a computing system and partially composes light reflected off objects in the real world. For example, a MR headset could be shaped as a pair of glasses with a pass-through display, which allows light from the real world to pass through a waveguide that simultaneously emits light from a projector in the MR headset, allowing the MR headset to present virtual objects intermixed with the real objects the user can see. "Artificial reality," "extra reality," or "XR," as used herein, refers to any of VR, AR, MR, or any combination or hybrid thereof.
[0048] Examples of additional descriptions of XR technology which may be used with the disclosed technology are provided in U.S. Patent Application No. 18/488,482, titled, “Voice-enabled Virtual Object Disambiguation and Controls in Artificial Reality,” which is herein incorporated by reference. Any patents, patent applications, and other references noted above are incorporated herein by reference. Aspects can be modified, if necessary, to employ the systems, functions, and concepts of the various references described above to provide yet
further implementations. If statements or subject matter in a document incorporated by reference conflicts with statements or subject matter of this application, then this application shall control.
[0049] Turning now to the figures, FIG. 1 is a high-level block diagram illustrating a network architecture 100 within which some aspects of the subject technology are implemented. The network architecture 100 may include servers 130 and a database 152, communicatively coupled with multiple client devices 110 via a network 150. Client devices 110 may include, but are not limited to, laptop computers, desktop computers, and the like, and/or mobile devices such as smart phones, palm devices, video players, headsets (e.g., extra-reality (XR) headsets), tablet devices, and the like.
[0050] The network 150 may include, for example, any one or more of a local area network (LAN), a wide area network (WAN), the Internet, and the like. Further, the network 150 may include, but is not limited to, any one or more of the following network topologies, including a bus network, a star network, a ring network, a mesh network, a star-bus network, tree or hierarchical network, and the like.
[0051] FIG. 2 is a block diagram illustrating details of a system 200 including a client device and a server, as discussed herein. The system 200 includes at least one client device 110, at least one server 130 of the network architecture 100, a database 252 and the network 150. The client device 110 and the server 130 are communicatively coupled over network 150 via respective communications modules 218-1 and 218-2 (hereinafter, collectively referred to as “communications modules 218”). Communications modules 218 are configured to interface with network 150 to send and receive information, such as requests, uploads, messages, and commands to other devices on the network 150. Communications modules 218 can be, for example, modems or Ethernet cards, and may include radio hardware and software for wireless communications (e.g., via electromagnetic radiation, such as radiofrequency (RF), nearfield communications (NFC), Wi-Fi, and Bluetooth radio technology).
[0052] The client device 110 may be coupled with an input device 214 and with an output device 216. A user may interact with the client device 110 via the input device 214 and the output device 216. Input device 214 may include a mouse, a keyboard, a pointer, a touchscreen, a microphone, a joystick, a virtual joystick, a
touchscreen display that a user may use to interact with client device 110, or the like. In some embodiments, the input device 214 may include cameras, microphones, and sensors, such as touch sensors, acoustic sensors, inertial motion units and other sensors configured to provide input data to an XR system. Output device 216 may be a screen display, a touchscreen, a speaker, and the like.
[0053] The client device 110 may also include a camera 210 (e.g., a smart camera), a processor 212-1 , memory 220-1 and the communications module 218- 1 . The camera 210 is in communication with the processor 212-1 and the memory 220-1. The processor 212-1 is configured to execute instructions stored in a memory 220-1 , and to cause the client device 110 to perform at least some operations in methods consistent with the present disclosure. The memory 220-1 may further include application 222, configured to run in the client device 110 and couple with input device 214, output device 216 and the camera 210. The application 222 may be downloaded by the user from the server 130, and/or may be hosted by the server 130. The application 222 includes specific instructions which, when executed by processor 212-1 , cause operations to be performed according to methods described herein. In some embodiments, the application 222 runs on an operating system (OS) installed in client device 110. In some embodiments, application 222 may run within a web browser. In some embodiments, the processor 212-1 is configured to control a graphical user interface (GUI) for the user of one of the client devices 110 accessing the server 130.
[0054] In some embodiments, the camera 210 is a virtual camera using an Al engine that can understand the user’s body positioning and intent, which is different from existing smart cameras that simply keep the user in frame. The camera 210 can adjust the camera parameters based on the user’s actions, providing the best framing for the user’s activities. The camera 210 can work with highly realistic avatars, which could represent the user or a celebrity in a virtual environment by mimicking the appearance and behavior of real humans as closely as possible. In some embodiments, the camera 210 can work with stylized avatars, which can represent the user based on artistic or cartoon-like representations. In some embodiments, the camera 210 leverages body tracking to understand the user’s actions and adjust the camera 210 accordingly. This
provides a new degree of freedom and control for the user, allowing for a more immersive and interactive experience.
[0055] In some embodiments, the camera 210 is Al based and can be trained to understand the way to frame a user’s avatar, for example, in a video communication application such as Messenger, WhatsApp, Instagram, and the like. The camera 210 can leverage body tracking, action recognition, and/or scene understanding to adjust the virtual camera features (e.g., position, rotation, focal length, aperture) for framing the user’s avatar according to the context of the video call. For example, the camera 210 can determine the right camera position for different scenarios such as when the user is whiteboarding versus writing at a desk (overhead camera) or exercising. Each of these scenarios would require a different setup that could be inferred if the Al engine of the camera 210 can understand the context.
[0056] The database 252 may store data and files associated with the server 130 from the application 222. In some embodiments, the client device 110 collects data, including but not limited to video and images, for upload to server 130 using the application 222, to store in the database 252.
[0057] The server 130 includes a memory 220-2, a processor 212-2, an application program interface (API) layer 215 and communications module 218-2. Hereinafter, the processors 212-1 and 212-2, and memories 220-1 and 220-2, will be collectively referred to, respectively, as “processors 212” and “memories 220.” The processors 212 are configured to execute instructions stored in memories 220. In some embodiments, memory 220-2 includes an applications engine 232. The applications engine 232 may be configured to perform operations and methods according to aspects of embodiments. The applications engine 232 may share or provide features and resources with the client device, including multiple tools associated with data, image, video collection, capture, or applications that use data, images, or video retrieved with the application engine 232 (e.g., the application 222). The user may access the applications engine 232 through the application 222, installed in a memory 220-1 of client device 110. Accordingly, the application 222 may be installed by server 130 and perform scripts and other routines provided by server 130 through any one of multiple tools. Execution of the application 222 may be controlled by processor 212-1 .
[0058] FIG. 3 is a block diagram illustrating examples of application 222 used
by the client device of FIG. 2, according to some embodiments. The application 222 includes several application modules including, but not limited to, a video chat module 310, a messaging module 320 and an Al module 340. The video chat module 310 is responsible for operations of video chat applications such as Facebook Messenger, Zoom Meeting, Facetime, Skype, and the like and can control speakers, microphones, video recorders, audio recorders and similar devices. The messaging module 320 is responsible for operations of messaging applications such as WhatsApp, Facebook Messenger, Signal, Telegram and the like and can control devices such as cameras and microphones and similar devices.
[0059] The Al module 340 may include a number of Al models. Al models apply different algorithms to relevant data inputs to achieve the tasks, or an output for which the model has been programmed for. An Al model can be defined by its ability to autonomously make decisions or predictions, rather than simulate human intelligence. Different types of Al models are better suited for specific tasks, or domains, for which their particular decision-making logic is most useful or relevant. Complex systems often employ multiple models simultaneously, using ensemble learning techniques like bagging, boosting or stacking.
[0060] Al models can automate decision-making, but only models capable of machine learning (ML) are able to autonomously optimize their performance over time. While all ML models are Al, not all Al involves ML. The most elementary Al models are a series of if-then-else statements, with rules programmed explicitly by a data scientist. Machine learning models use statistical Al rather than symbolic Al. Whereas rule-based Al models must be explicitly programmed, ML models are trained by applying their mathematical frameworks to a sample dataset whose data points serve as the basis for the model’s future real-world predictions.
[0061] The subject technology can use a system consisting of one or more ML models trained over time using a large database (e.g., database 252 of FIG. 2). In some implementations, the system can be trained to learn what the face looked like when the body engaged in certain activity. In some implementations, the system can use action recognition to understand the action that the user is doing and then drive the face to imitate or infer what the user’s expression would be during these activities. In some implementations, the system can be multimodal, using both body movements and the tonality of the user’s voice to drive facial
expressions. In some implementations, when the user is engaged in a sports activity, the system can adapt to the genre of the sport activity, changing expressions based on the activity, such as boxing.
[0062] In some implementations, the system could also consider hand interactions and scene understanding to infer facial expressions to be driven. The output of the system is the inference of a facial expression, which could potentially be modified in post-processing steps. In some implementations, the system can return to a neutral, idle state after an intense activity, but it could also infer that the user just burned a significant number of calories and might be breathing hard or flushed. In some implementations, the system can maintain the inferred facial expression for a certain period of time after an intense activity, based on factors such as the age and weight of the user and the intensity of the workout. In some implementations, the body poses may be used to drive the facial expression, either wholesale or as an overlay. In some implementations, the system can calculate body motion velocities and understand motion vectors, to infer the strain that can be displayed on the face (e.g., squat, jump, jab or cross, kick, leap). In some implementations, the system can combine body gesture with audio expression to derive a new facial expression. The expressions that are additive and can maintain lip sync quality may be authored and saved by the Al module.
[0063] In some implementations, the system can consider social factors, e.g., in conjunction with a social graph. For example, if a user is competing with others, they might try to suppress their expressions. The system may use the user’s social graph to attenuate the intensity of the expression. The system could also consider the expressions of other people around the person. For example, if a friend’s avatar is super happy, the user may want to support them and be happy as well. This is referred to as body mimicry. In some implementations, the system can go beyond audio-driven lip sync. For example, the system may use audio to drive facial expressions and body gestures. In some implementations, given environment awareness, the scene understanding can be used as an input for a most plausible expression. In some implementations, people or social graphs (e.g., users’ relationship to other avatars) can be used to infer expression according to relationships and historical interaction.
[0064] FIG. 4 is a screen shot 400 illustrating an example of a facial expression inferred from a form of a hand-in-the-air body gesture, according to some
embodiments. FIG. 4 shows several example hand-in-the-air body gestures that are self-explanatory. The Al module 340 of FIG. 3 can be trained with these body gestures and similar ones to infer a facial expression that is indicative of, for example, an elated, thrilled, delighted or excited expression.
[0065] FIG. 5 is a screen shot 500 illustrating an example of a facial expression inferred from a form of a stop body gesture, according to some embodiments. Several examples of stop body gestures are shown in FIG. 5. These body gestures are just examples and are self-explanatory. The Al module 340 of FIG. 3 can be trained with these body gestures and similar ones to infer a facial expression that is indicative of, for example, a worried, anxious, upset, or nervous expression.
[0066] FIG. 6 is a screen shot 600 illustrating an example of a facial expression inferred from a form of a peace-sign body gesture, according to some embodiments. FIG. 6 depicts multiple examples of peace-sign body gestures that are self-explanatory. The Al module 340 of FIG. 3 can be trained with these body gestures and similar ones to infer a facial expression that is indicative of, for example, a happy, friendly or agreeable expression.
[0067] FIG. 7 is a screen shot 700 illustrating an example of a facial expression inferred from a form of a punching body gesture, according to some embodiments. Several examples of punching body gestures are shown in FIG. 5, which are just example body gestures and are self-explanatory. The Al module 340 of FIG. 3 can be trained with these body gestures and similar ones to infer a facial expression that is indicative of, for example, anger, rage or aggression expression. [0068] FIG. 8 is a flow diagram illustrating an example of a method 800 for inferring facial expression from body gestures, according to some embodiments. The method 800 includes executing, by a processor (e.g., 212-1 of FIG. 2), ML instructions (810), retrieving a first set of data from memory (e.g., 220-1 of FIG. 2) (820), and obtaining, by a communication module (e.g., 218-1 of FIG. 2), from a cloud storage a second set of data (830). At least one of the first set of data or the second set of data includes a plurality of facial expressions and body poses. The ML instructions are configured to train an Al model (e.g., from 340 of FIG. 3) to infer at least one body pose based on at least one of the first set of data or the second set of data.
[0069] FIG. 9 is a flow diagram illustrating an example of a method 900 for inferring avatar facial expressions from captured user body pose data.
[0070] At block 902, process 900 can access a first set of data comprising facial expressions and a second set of data comprising body poses. Each body pose in the second set of data can be mapped to at least one facial expression in the first set of data. In some implementations, the second set of data can be based on images or video clips of body poses and each mapping for a body pose, corresponding to an image or video clip, can be based on facial expressions determined at the time the image or video clip was captured. In some implementations, the one or more body pose indications can be based on images from a virtual camera that uses an Al engine to determine the user’s body positioning and process 900 can include, adjusting parameters of the virtual camera causing the virtual camera to frame the user’s activities for improved pose capture.
[0071] In some cases, in addition to the body pose data, the second set of data can further include, associated with one or more of the body pose, biometric data including a heart rate or a blood pressure. In some cases, in addition to the body pose data, the second set of data can further include, associated with one or more of the body pose, voice data.
[0072] At block 904, process 900 can train, based on the mappings between the first set of data and the second set of data, an artificial-intelligence (Al) model to infer facial expressions when the Al model receives at least one or more body poses. In some implementations, the training of the Al model can further be based on associations between biometric data, from the second set, and one or more body poses mapped to facial expressions. In some cases, the training of the Al model can further be based on association between voice data, from the second set, and one or more body poses mapped to facial expressions;
[0073] At block 906, process 900 can receive one or more body pose indications. In some cases, the received one or more body pose indications are associated biometric data and/or a voice recording.
[0074] At block 908, process 900 can apply the Al model to the one or more body pose indications and can receive, from the Al model based on the training, an inference of a facial expression. In some implementations, applying the Al model to the one or more body pose indications further includes applying the Al model to biometric data associated with the received one or more body pose indications to infer the facial expression received from the Al model. In some
cases, applying the Al model to the one or more body pose indications further includes applying the Al model to data based on a voice recording associated with the received one or more body pose indications to infer the facial expression received from the Al model.
[0075] At block 910, process 900 can cause an avatar to affect an expression based on the facial expression inferred by the Al model. For example, process 900 can cause the avatar to smile, frown, raise its eyebrows, blink, perform motions corresponding to speaking certain phonemes, etc.
[0076] In some implementations, process 900 can determine an expression of one or more users in a vicinity of a user, on which the one or more body pose indications are based, where the expression affected by the avatar is further based on the determined expression of the one or more users in the vicinity of the user. In some cases, the determining the expression of the one or more users in a vicinity of the user is in response to determining that the one or more users has a specified type of relationship, in a social graphs, to the user or determining that there is a record of one or more historical interactions between the one or more users and the user.
[0077] In some implementations, process 900 can determine that a user, on which the one or more body pose indications are based, is engaged in a competition, where the expression affected by the avatar is further based on the determining that the user is engaged in the competition. In some implementations, process 900 can identify above a threshold level of activity of a user, on which the one or more body pose indications are based and, in response to identifying above the threshold level of activity, can further cause the avatar to affect an increased activity expression. For example, the increased activity expression can be one or more of: flaring nostrils, an accelerated rate of chest and/or neck breathing animation, or an altered skin tone. In some cases, process 900 can compute a period of time based on one or more of: an age of the user, a weight of the user, a determined intensity of the activity of the user, or any combination thereof and can identify an end of the activity of the user, where process 900 can cause the avatar to maintain the increased activity expression for the computed period of time after end of the activity of the user. In some cases, identifying the level of activity of the user is based on calculated body motion velocities and/or motion vectors for the user.
[0078] An aspect of the subject technology is directed to a device including an XR headset comprising a processor configured to execute ML instructions, memory configured to store a first set of data and a communications module configured to access a cloud storage including a second set of data. The ML instructions are configured to train an Al model to infer facial expressions based on at least one of the first set of data or the second set of data.
[0079] In some implementations, the first set of data and the second set of data comprise images or video clips of body poses.
[0080] In one or more implementations, the body poses are provided by Al- powered body scanning.
[0081] In some implementations, the body poses comprise body motions in at least one of a social activity or a physical activity including a sports activity or a fitness activity.
[0082] In one or more implementations, the body poses are indicative of emotional states in one of a plurality of contexts.
[0083] In some implementations, the first set of data or the second set of data further comprise audio including environment sounds, music or voice.
[0084] In one or more implementations, the first set of data or the second set of data further comprise a measured user’s biometric data including a heart rate or a blood pressure used to indicate an intensity of a physical activity.
[0085] In some implementations, the facial expressions include elated, thrilled, delighted or excited expressions inferred from a hand-in-the-air body gesture.
[0086] In one or more implementations, the facial expressions include worried, anxious, upset, or nervous expressions inferred from a form of a stop body gesture.
[0087] In some implementations, the facial expressions include happy, friendly or agreeable expressions inferred from a form of a peace-sign body gesture.
[0088] In one or more implementations, the facial expressions include anger, rage or aggression expressions inferred from a form of a punching body gesture.
[0089] Another aspect of the subject technology is directed to an apparatus comprising an XR headset including a processor configured to execute ML instructions, memory configured to store a first set of data and a communications module configured to access a cloud storage including a second set of data. At least one of the first set of data or the second set of data includes a plurality of
facial expressions. The ML instructions are configured to train an Al model to infer at least one body pose based on at least one of the first set of data or the second set of data.
[0090] In some implementations, the plurality of facial expressions comprises elated, thrilled, delighted, excited, happy, friendly, agreeable, worried, anxious, upset, nervous, anger, rage, aggression expressions, nostril flaring, chest and neck being animated or changing of a skin color.
[0091] In one or more implementations, the at least one body pose comprises one or more of a hand-in-the-air body gesture, a stop body gesture, a peace-sign body gesture and a punching body gesture.
[0092] In some implementations, the at least one body pose is indicative of an emotional state in one of a plurality of contexts, wherein the at least one body pose comprises body motions in at least one of a social activity or a physical activity including a sports activity or a fitness activity.
[0093] In one or more implementations, the first set of data or the second set of data further comprise a measured user’s biometric data including a heart rate or a blood pressure used to indicate an intensity of a physical activity.
[0094] In some implementations, the first set of data or the second set of data further comprise audio including environment sounds, music or voice.
[0095] Yet another aspect of the subject technology is directed to a method including executing, by a processor, ML instructions, retrieving a first set of data from memory, and obtaining, by a communication module, from a cloud storage a second set of data. At least one of the first set of data or the second set of data includes a plurality of facial expressions and body poses. The ML instructions are configured to train an Al model to infer at least one body pose based on at least one of the first set of data or the second set of data.
[0096] In one or more implementations, the ML instructions are configured to train an Al model to infer at least one facial expression based on at least one of the first set of data or the second set of data.
[0097] In some implementations, the first set of data or the second set of data further comprise a measured user’s biometric data including a heart rate or a blood pressure used to indicate an intensity of a physical activity, and audio including environment sounds, music or voice.
[0098] Figure 10 is a block diagram illustrating an overview of devices on
which some implementations of the disclosed technology can operate. The devices can comprise hardware components of a device 1000 that infers avatar facial expressions from captured user body pose data. Device 1000 can include one or more input devices 1020 that provide input to the Processor(s) 1010 (e.g. CPll(s), GPU(s), HPU(s), etc.), notifying it of actions. The actions can be mediated by a hardware controller that interprets the signals received from the input device and communicates the information to the processors 1010 using a communication protocol. Input devices 1020 include, for example, a mouse, a keyboard, a touchscreen, an infrared sensor, a touchpad, a wearable input device, a camera- or image-based input device, a microphone, or other user input devices.
[0099] Processors 1010 can be a single processing unit or multiple processing units in a device or distributed across multiple devices. Processors 1010 can be coupled to other hardware devices, for example, with the use of a bus, such as a PCI bus or SCSI bus. The processors 1010 can communicate with a hardware controller for devices, such as for a display 1030. Display 1030 can be used to display text and graphics. In some implementations, display 1030 provides graphical and textual visual feedback to a user. In some implementations, display 1030 includes the input device as part of the display, such as when the input device is a touchscreen or is equipped with an eye direction monitoring system. In some implementations, the display is separate from the input device. Examples of display devices are: an LCD display screen, an LED display screen, a projected, holographic, or augmented reality display (such as a heads-up display device or a head-mounted device), and so on. Other I/O devices 1040 can also be coupled to the processor, such as a network card, video card, audio card, USB, firewire or other external device, camera, printer, speakers, CD- ROM drive, DVD drive, disk drive, or Blu-Ray device.
[0100] In some implementations, the device 1000 also includes a communication device capable of communicating wirelessly or wire-based with a network node. The communication device can communicate with another device or a server through a network using, for example, TCP/IP protocols. Device 1000 can utilize the communication device to distribute operations across multiple network devices.
[0101] The processors 1010 can have access to a memory 1050 in a device or distributed across multiple devices. A memory includes one or more of various
hardware devices for volatile and non-volatile storage, and can include both readonly and writable memory. For example, a memory can comprise random access memory (RAM), various caches, CPU registers, read-only memory (ROM), and writable non-volatile memory, such as flash memory, hard drives, floppy disks, CDs, DVDs, magnetic storage devices, tape drives, and so forth. A memory is not a propagating signal divorced from underlying hardware; a memory is thus non- transitory. Memory 1050 can include program memory 160 that stores programs and software, such as an operating system 1062, pose-based facial expression system 1064, and other application programs 1066. Memory 1050 can also include data memory 1070 that can include application data, configuration data, settings, user options or preferences, etc., which can be provided to the program memory 1060 or any element of the device 1000.
[0102] Some implementations can be operational with numerous other computing system environments or configurations. Examples of computing systems, environments, and/or configurations that may be suitable for use with the technology include, but are not limited to, personal computers, server computers, handheld or laptop devices, cellular telephones, wearable electronics, gaming consoles, tablet devices, multiprocessor systems, microprocessor-based systems, set-top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, or the like.
[0103] Figure 11 is a block diagram illustrating an overview of an environment 1100 in which some implementations of the disclosed technology can operate. Environment 1100 can include one or more client computing devices 1105A-D, examples of which can include device 1000. Client computing devices 1105 can operate in a networked environment using logical connections through network 1130 to one or more remote computers, such as a server computing device.
[0104] In some implementations, server 1110 can be an edge server which receives client requests and coordinates fulfillment of those requests through other servers, such as servers 1120A-C. Server computing devices 1110 and 1120 can comprise computing systems, such as device 1000. Though each server computing device 1110 and 1120 is displayed logically as a single server, server computing devices can each be a distributed computing environment
encompassing multiple computing devices located at the same or at geographically disparate physical locations. In some implementations, each server 1120 corresponds to a group of servers.
[0105] Client computing devices 1105 and server computing devices 1110 and 1120 can each act as a server or client to other server/client devices. Server 1110 can connect to a database 1115. Servers 1120A-C can each connect to a corresponding database 1125A-C. As discussed above, each server 1120 can correspond to a group of servers, and each of these servers can share a database or can have their own database. Databases 1115 and 1125 can warehouse (e.g. store) information. Though databases 1 115 and 1125 are displayed logically as single units, databases 1115 and 1125 can each be a distributed computing environment encompassing multiple computing devices, can be located within their corresponding server, or can be located at the same or at geographically disparate physical locations.
[0106] Network 1130 can be a local area network (LAN) or a wide area network (WAN), but can also be other wired or wireless networks. Network 1130 may be the Internet or some other public or private network. Client computing devices 1105 can be connected to network 1130 through a network interface, such as by wired or wireless communication. While the connections between server 1110 and servers 1120 are shown as separate connections, these connections can be any kind of local, wide area, wired, or wireless network, including network 1130 or a separate public or private network.
[0107] In some implementations, servers 1110 and 1120 can be used as part of a social network. The social network can maintain a social graph and perform various actions based on the social graph. A social graph can include a set of nodes (representing social networking system objects, also known as social objects) interconnected by edges (representing interactions, activity, or relatedness). A social networking system object can be a social networking system user, nonperson entity, content item, group, social networking system page, location, application, subject, concept representation or other social networking system object, e.g., a movie, a band, a book, etc. Content items can be any digital data such as text, images, audio, video, links, webpages, minutia (e.g. indicia provided from a client device such as emotion indicators, status text snippets, location indictors, etc.), or other multi-media. In various
implementations, content items can be social network items or parts of social network items, such as posts, likes, mentions, news items, events, shares, comments, messages, other notifications, etc. Subjects and concepts, in the context of a social graph, comprise nodes that represent any person, place, thing, or idea.
[0108] A social networking system can enable a user to enter and display information related to the user's interests, age I date of birth, location (e.g. longitude/latitude, country, region, city, etc.), education information, life stage, relationship status, name, a model of devices typically used, languages identified as ones the user is facile with, occupation, contact information, or other demographic or biographical information in the user's profile. Any such information can be represented, in various implementations, by a node or edge between nodes in the social graph. A social networking system can enable a user to upload or create pictures, videos, documents, songs, or other content items, and can enable a user to create and schedule events. Content items can be represented, in various implementations, by a node or edge between nodes in the social graph.
[0109] A social networking system can enable a user to perform uploads or create content items, interact with content items or other users, express an interest or opinion, or perform other actions. A social networking system can provide various means to interact with non-user objects within the social networking system. Actions can be represented, in various implementations, by a node or edge between nodes in the social graph. For example, a user can form or join groups, or become a fan of a page or entity within the social networking system. In addition, a user can create, download, view, upload, link to, tag, edit, or play a social networking system object. A user can interact with social networking system objects outside of the context of the social networking system. For example, an article on a news web site might have a “like” button that users can click. In each of these instances, the interaction between the user and the object can be represented by an edge in the social graph connecting the node of the user to the node of the object. As another example, a user can use location detection functionality (such as a GPS receiver on a mobile device) to “check in” to a particular location, and an edge can connect the user's node with the location's node in the social graph.
[0110] A social networking system can provide a variety of communication
channels to users. For example, a social networking system can enable a user to email, instant message, or text/SMS message, one or more other users. It can enable a user to post a message to the user's wall or profile or another user's wall or profile. It can enable a user to post a message to a group or a fan page. It can enable a user to comment on an image, wall post or other content item created or uploaded by the user or another user. And it can allow users to interact (e.g., via their personalized avatar) with objects or other avatars in an artificial reality environment, etc. In some embodiments, a user can post a status message to the user's profile indicating a current event, state of mind, thought, feeling, activity, or any other present-time relevant communication. A social networking system can enable users to communicate both within, and external to, the social networking system. For example, a first user can send a second user a message within the social networking system, an email through the social networking system, an email external to but originating from the social networking system, an instant message within the social networking system, an instant message external to but originating from the social networking system, provide voice or video messaging between users, or provide an artificial reality environment were users can communicate and interact via avatars or other digital representations of themselves. Further, a first user can comment on the profile page of a second user, or can comment on objects associated with a second user, e.g., content items uploaded by the second user. [0111] Social networking systems enable users to associate themselves and establish connections with other users of the social networking system. When two users (e.g., social graph nodes) explicitly establish a social connection in the social networking system, they become “friends” (or, “connections”) within the context of the social networking system. For example, a friend request from a “John Doe” to a “Jane Smith,” which is accepted by “Jane Smith,” is a social connection. The social connection can be an edge in the social graph. Being friends or being within a threshold number of friend edges on the social graph can allow users access to more information about each other than would otherwise be available to unconnected users. For example, being friends can allow a user to view another user's profile, to see another user's friends, or to view pictures of another user. Likewise, becoming friends within a social networking system can allow a user greater access to communicate with another user, e.g., by email (internal and external to the social networking system), instant message, text message, phone,
or any other communicative interface. Being friends can allow a user access to view, comment on, download, endorse or otherwise interact with another user's uploaded content items. Establishing connections, accessing user information, communicating, and interacting within the context of the social networking system can be represented by an edge between the nodes representing two social networking system users.
[0112] In addition to explicitly establishing a connection in the social networking system, users with common characteristics can be considered connected (such as a soft or implicit connection) for the purposes of determining social context for use in determining the topic of communications. In some embodiments, users who belong to a common network are considered connected. For example, users who attend a common school, work for a common company, or belong to a common social networking system group can be considered connected. In some embodiments, users with common biographical characteristics are considered connected. For example, the geographic region users were born in or live in, the age of users, the gender of users and the relationship status of users can be used to determine whether users are connected. In some embodiments, users with common interests are considered connected. For example, users' movie preferences, music preferences, political views, religious views, or any other interest can be used to determine whether users are connected. In some embodiments, users who have taken a common action within the social networking system are considered connected. For example, users who endorse or recommend a common object, who comment on a common content item, or who RSVP to a common event can be considered connected. A social networking system can utilize a social graph to determine users who are connected with or are similar to a particular user in order to determine or evaluate the social context between the users. The social networking system can utilize such social context and common attributes to facilitate content distribution systems and content caching systems to predictably select content items for caching in cache appliances associated with specific social network accounts.
[0113] In some implementations, the word “exemplary” is used herein to mean “serving as an example, instance, or illustration.” Any embodiment described herein as “exemplary” is not necessarily to be construed as preferred or
advantageous over other embodiments. Phrases such as an aspect, the aspect, another aspect, some aspects, one or more aspects, an implementation, the implementation, another implementation, some implementations, one or more implementations, an embodiment, the embodiment, another embodiment, some embodiments, one or more embodiments, a configuration, the configuration, another configuration, some configurations, one or more configurations, the subject technology, the disclosure, the present disclosure, other variations thereof and alike are for convenience and do not imply that a disclosure relating to such phrase(s) is essential to the subject technology or that such disclosure applies to all configurations of the subject technology. A disclosure relating to such phrase(s) may apply to all configurations, or one or more configurations. A disclosure relating to such phrase(s) may provide one or more examples. A phrase such as an aspect or some aspects may refer to one or more aspects and vice versa, and this applies similarly to other foregoing phrases.
[0114] A reference to an element in the singular is not intended to mean “one and only one” unless specifically stated, but rather “one or more.” Pronouns in the masculine (e g., his) include the feminine and neuter gender (e.g., her and its) and vice versa. The term “some” refers to one or more. Underlined and/or italicized headings and subheadings are used for convenience only, do not limit the subject technology, and are not referred to in connection with the interpretation of the description of the subject technology. Relational terms such as first and second and the like may be used to distinguish one entity or action from another without necessarily requiring or implying any actual such relationship or order between such entities or actions. All structural and functional equivalents to the elements of the various configurations described throughout this disclosure that are known or later come to be known to those of ordinary skill in the art are expressly incorporated herein by reference and intended to be encompassed by the subject technology. Moreover, nothing disclosed herein is intended to be dedicated to the public, regardless of whether such disclosure is explicitly recited in the above description. No clause element is to be construed under the provisions of 35 U.S.C. §112, sixth paragraph, unless the element is expressly recited using the phrase “means for” or, in the case of a method clause, the element is recited using the phrase “step for.”
[0115] While this specification contains many specifics, these should not be
construed as limitations on the scope of what may be described, but rather as descriptions of particular implementations of the subject matter. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially described as such, one or more features from a described combination can in some cases be excised from the combination, and the described combination may be directed to a sub-combination or variation of a sub-combination.
[0116] The subject matter of this specification has been described in terms of particular aspects, but other aspects can be implemented and are within the scope of the following clauses. For example, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. The actions recited in the clauses can be performed in a different order and still achieve desirable results. As one example, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Moreover, the separation of various system components in the aspects described above should not be understood as requiring such separation in all aspects, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.
[0117] The title, background, brief description of the drawings, abstract, and drawings are hereby incorporated into the disclosure and are provided as illustrative examples of the disclosure, not as restrictive descriptions. It is submitted with the understanding that they will not be used to limit the scope or meaning of the clauses. In addition, in the detailed description, it can be seen that the description provides illustrative examples, and the various features are grouped together in various implementations for the purpose of streamlining the disclosure. The method of disclosure is not to be interpreted as reflecting an
intention that the described subject matter requires more features than are expressly recited in each clause. Rather, as the clauses reflect, inventive subject matter lies in less than all features of a single disclosed configuration or operation. The clauses are hereby incorporated into the detailed description, with each clause standing on its own as a separately described subject matter.
[0118] Aspects of the subject matter described in this disclosure can be implemented to realize one or more of the following potential advantages. The described techniques may be implemented to support a range of benefits and significant advantages of the disclosed eye tracking system. It should be noted that the subject technology enables fabrication of a depth-sensing apparatus that is a fully solid-state device with small size, low power, and low cost.
[0119] As used herein, the phrase “at least one of” preceding a series of items, with the terms “and” or “or” to separate any of the items, modifies the list as a whole, rather than each member of the list (i.e., each item).
[0120] To the extent that the term “include,” “have,” or the like is used in the description or the claims, such term is intended to be inclusive in a manner similar to the term “comprise” as “comprise” is interpreted when employed as a transitional word in a claim.
[0121] A reference to an element in the singular is not intended to mean “one and only one” unless specifically stated, but rather “one or more.” All structural and functional equivalents to the elements of the various configurations described throughout this disclosure that are known or later come to be known to those of ordinary skill in the art are expressly incorporated herein by reference and intended to be encompassed by the subject technology. Moreover, nothing disclosed herein is intended to be dedicated to the public regardless of whether such disclosure is explicitly recited in the above description.
[0122] While this specification contains many specifics, these should not be construed as limitations on the scope of what may be claimed, but rather as descriptions of particular implementations of the subject matter. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Moreover, although features may be described above as acting
in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.
Claims
1 . A method for inferring avatar facial expressions from captured user body pose data, the method comprising: accessing a first set of data comprising facial expressions and a second set of data comprising body poses, wherein each body pose in the second set of data is mapped to at least one facial expression in the first set of data; training, based on the mappings between the first set of data and the second set of data, an artificial-intelligence (Al) model to infer facial expressions when the Al model receives at least one or more body poses; receiving one or more body pose indications; applying the Al model to the one or more body pose indications and receiving, from the Al model based on the training, an inference of a facial expression; and causing an avatar to affect an expression based on the facial expression inferred by the Al model.
2. The method of claim 1 , wherein the second set of data is based on images or video clips of body poses and each mapping fora body pose, corresponding to an image or video clip, is based on facial expressions determined at the time the image or video clip was captured.
3. The method of any preceding claim, and any one or more of: a) wherein the second set of data further comprises, associated with one or more of the body pose, biometric data including a heart rate or a blood pressure; wherein the training of the Al model is further based on the association between the biometric data and one or more body poses mapped to facial expressions; wherein the received one or more body pose indications are associated biometric data; and wherein the applying the Al model to the one or more body pose indications further includes applying the Al model to the biometric data associated with the received one or more body pose indications to infer the facial expression received from the Al model; or b) wherein the second set of data further comprises, associated with one or more of the body pose, voice data;
wherein the training of the Al model is further based on the association between the voice data and one or more body poses mapped to facial expressions; wherein the received one or more body pose indications are associated a voice recording; and wherein the applying the Al model to the one or more body pose indications further includes applying the Al model to the data based on the voice recording associated with the received one or more body pose indications to infer the facial expression received from the Al model.
4. The method of any preceding claim, further comprising: determining that a user, on which the one or more body pose indications are based, is engaged in a competition; wherein the expression affected by the avatar is further based on the determining that the user is engaged in the competition.
5. The method of any preceding claim, further comprising: determining an expression of one or more users in a vicinity of a user, on which the one or more body pose indications are based; wherein the expression affected by the avatar is further based on the determined expression of the one or more users in the vicinity of the user, in which case optionally the determining the expression of the one or more users in a vicinity of the user is in response to: determining that the one or more users has a specified type of relationship, in a social graphs, to the user; or determining that there is a record of one or more historical interactions between the one or more users and the user.
6. The method of any preceding claim, further comprising: identifying above a threshold level of activity of a user, on which the one or more body pose indications are based; and in response to the identifying above the threshold level of activity, further causing the avatar to affect an increased activity expression comprising one or more of: flaring nostrils, an accelerated rate of chest and/or neck breathing animation, an altered skin tone, or any combination thereof.
7. The method of claim 6, further comprising computing a period of time based on one or more of: an age of the user, a weight of the user, a determined intensity of the activity of the user, or any combination thereof; and identifying an end of the activity of the user; wherein the increased activity expression is maintained for the computed period of time after end of the activity of the user.
8. The method of claim 6 or 7, wherein the identifying the level of activity of the user is based on calculated body motion velocities and/or motion vectors for the user.
9. The method of any preceding claim, wherein the one or more body pose indications are based on images from a virtual camera that uses an Al engine to determine the user’s body positioning; and wherein the method further comprises adjusting parameters of the virtual camera causing the virtual camera to frame the user’s activities for improved pose capture.
10. A computer-readable storage medium storing instructions, for inferring avatar facial expressions from captured user body pose data, the instructions, when executed by a computing system, cause the computing system to: access a first set of data comprising facial expressions and a second set of data comprising body poses, wherein each body pose in the second set of data is mapped to at least one facial expression in the first set of data; train, based on the mappings between the first set of data and the second set of data, an artificial-intelligence (Al) model to infer facial expressions when the Al model receives at least one or more body poses; receive one or more body pose indications; apply the Al model to the one or more body pose indications and receive, from the Al model based on the training, an inference of a facial expression; and cause an avatar to affect an expression based on the facial expression inferred by the Al model.
11. The computer-readable storage medium of claim 10, and any one or more of: a) wherein the second set of data is based on images or video clips of body
poses and each mapping for a body pose, corresponding to an image or video clip, is based on facial expressions determined at the time the image or video clip was captured; or b) wherein the second set of data further comprises, associated with one or more of the body pose, biometric data including a heart rate or a blood pressure; wherein the training of the Al model is further based on the association between the biometric data and one or more body poses mapped to facial expressions; wherein the received one or more body pose indications are associated biometric data; and wherein the applying the Al model to the one or more body pose indications further includes applying the Al model to the biometric data associated with the received one or more body pose indications to infer the facial expression received from the Al model.
12. The computer-readable storage medium of claim 10 or 11 , and any one or more of: a) wherein the instructions, when executed, further cause the computing system to: determine that a user, on which the one or more body pose indications are based, is engaged in a competition; wherein the expression affected by the avatar is further based on the determining that the user is engaged in the competition; or b) wherein the instructions, when executed, further cause the computing system to: determine an expression of one or more users in a vicinity of a user, on which the one or more body pose indications are based; wherein the expression affected by the avatar is further based on the determined expression of the one or more users in the vicinity of the user; and wherein the determining the expression of the one or more users in a vicinity of the user is in response to: determining that the one or more users has a specified type of relationship, in a social graphs, to the user; or
determining that there is a record of one or more historical interactions between the one or more users and the user.
13. A computing system for inferring avatar facial expressions from captured user body pose data, the computing system comprising: one or more processors; and one or more memories storing instructions that, when executed by the one or more processors, cause the computing system to: access a first set of data comprising facial expressions and a second set of data comprising body poses, wherein each body pose in the second set of data is mapped to at least one facial expression in the first set of data; train, based on the mappings between the first set of data and the second set of data, an artificial-intelligence (Al) model to infer facial expressions when the Al model receives at least one or more body poses; receive one or more body pose indications; apply the Al model to the one or more body pose indications and receive, from the Al model based on the training, an inference of a facial expression; and cause an avatar to affect an expression based on the facial expression inferred by the Al model.
14. The computing system of claim 13, and any one or more of: a) wherein the instructions, when executed further cause the computing system to: identify above a threshold level of activity of a user, on which the one or more body pose indications are based; and in response to the identifying above the threshold level of activity, further cause the avatar to affect an increased activity expression comprising one or more of: flaring nostrils, an accelerated rate of chest and/or neck breathing animation, an altered skin tone, or any combination thereof; or b) wherein the instructions, when executed further cause the computing system to: compute a period of time based on one or more of: an age of the user, a
weight of the user, a determined intensity of the activity of the user, or any combination thereof; and identify an end of the activity of the user; wherein the increased activity expression is maintained for the computed period of time after end of the activity of the user.
15. The computing system of claim 13 or 14, wherein the one or more body pose indications are based on images from a virtual camera that uses an Al engine to determine the user’s body positioning; and wherein the instructions, when executed further cause the computing system to adjust parameters of the virtual camera causing the virtual camera to frame the user’s activities for improved pose capture.
Applications Claiming Priority (4)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US202463656199P | 2024-06-05 | 2024-06-05 | |
| US63/656,199 | 2024-06-05 | ||
| US19/201,253 US20250378616A1 (en) | 2024-06-05 | 2025-05-07 | Pose-Based Facial Expressions |
| US19/201,253 | 2025-05-07 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2025254920A1 true WO2025254920A1 (en) | 2025-12-11 |
Family
ID=96091410
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2025/031396 Pending WO2025254920A1 (en) | 2024-06-05 | 2025-05-29 | Pose-based facial expressions |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2025254920A1 (en) |
Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20150310263A1 (en) * | 2014-04-29 | 2015-10-29 | Microsoft Corporation | Facial expression tracking |
| US20190108668A1 (en) * | 2016-03-11 | 2019-04-11 | Sony Interactive Entertainment Europe Limited | Virtual Reality |
| US20230410398A1 (en) * | 2022-06-20 | 2023-12-21 | The Education University Of Hong Kong | System and method for animating an avatar in a virtual world |
-
2025
- 2025-05-29 WO PCT/US2025/031396 patent/WO2025254920A1/en active Pending
Patent Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20150310263A1 (en) * | 2014-04-29 | 2015-10-29 | Microsoft Corporation | Facial expression tracking |
| US20190108668A1 (en) * | 2016-03-11 | 2019-04-11 | Sony Interactive Entertainment Europe Limited | Virtual Reality |
| US20230410398A1 (en) * | 2022-06-20 | 2023-12-21 | The Education University Of Hong Kong | System and method for animating an avatar in a virtual world |
Non-Patent Citations (1)
| Title |
|---|
| JOSEPH T KIDER JR ET AL: "A data-driven appearance model for human fatigue", COMPUTER ANIMATION, ACM, 2 PENN PLAZA, SUITE 701 NEW YORK NY 10121-0701 USA, 5 August 2011 (2011-08-05), pages 119 - 128, XP058007014, ISBN: 978-1-4503-0923-3, DOI: 10.1145/2019406.2019423 * |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| KR102758381B1 (en) | Integrated input/output (i/o) for a three-dimensional (3d) environment | |
| US20230130535A1 (en) | User Representations in Artificial Reality | |
| US11148296B2 (en) | Engaging in human-based social interaction for performing tasks using a persistent companion device | |
| US10391636B2 (en) | Apparatus and methods for providing a persistent companion device | |
| US12561920B2 (en) | Dynamic model adaptation customized for individual users | |
| US20170206064A1 (en) | Persistent companion device configuration and deployment platform | |
| CN121002544A (en) | Using model adaptation to overlay visual content | |
| KR20180129886A (en) | Persistent companion device configuration and deployment platform | |
| CN107000210A (en) | Apparatus and method for providing lasting partner device | |
| US20250131609A1 (en) | Generating image scenarios based on events | |
| US20240404225A1 (en) | Avatar generation from digital media content items | |
| US20250200825A1 (en) | Content item video generation template | |
| US20240112389A1 (en) | Intentional virtual user expressiveness | |
| US20240256711A1 (en) | User Scene With Privacy Preserving Component Replacements | |
| US20240220752A1 (en) | Artificial Reality System for Code Recognition and Health Metrics | |
| US20250378616A1 (en) | Pose-Based Facial Expressions | |
| KR20260012204A (en) | Group chat with chatbots | |
| WO2025254920A1 (en) | Pose-based facial expressions | |
| WO2018183812A1 (en) | Persistent companion device configuration and deployment platform | |
| US20260017898A1 (en) | Pose-based facial expressions | |
| US20260075014A1 (en) | Artificial intelligence-based system and method for generating and recommending personalized graphics for messaging applications | |
| KR20260015298A (en) | Avatar creation from digital media content items | |
| WO2025090531A1 (en) | Generating image scenarios based on events | |
| CA2904359C (en) | Apparatus and methods for providing a persistent companion device |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 25731936 Country of ref document: EP Kind code of ref document: A1 |