WO2023230638A2 - Reduced-latency communication using behavior prediction - Google Patents
Reduced-latency communication using behavior prediction Download PDFInfo
- Publication number
- WO2023230638A2 WO2023230638A2 PCT/US2023/073582 US2023073582W WO2023230638A2 WO 2023230638 A2 WO2023230638 A2 WO 2023230638A2 US 2023073582 W US2023073582 W US 2023073582W WO 2023230638 A2 WO2023230638 A2 WO 2023230638A2
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- user
- data
- participating
- behavior
- time
- Prior art date
Links
- 238000004891 communication Methods 0.000 title abstract description 31
- 238000012545 processing Methods 0.000 claims abstract description 129
- 230000006399 behavior Effects 0.000 claims abstract description 83
- 238000000034 method Methods 0.000 claims abstract description 63
- 230000033001 locomotion Effects 0.000 claims description 82
- 238000003860 storage Methods 0.000 claims description 25
- 238000013528 artificial neural network Methods 0.000 claims description 24
- 230000008859 change Effects 0.000 claims description 23
- 230000001815 facial effect Effects 0.000 claims description 21
- 238000009877 rendering Methods 0.000 claims description 18
- 230000008921 facial expression Effects 0.000 claims description 11
- 230000015654 memory Effects 0.000 description 18
- 230000005540 biological transmission Effects 0.000 description 17
- 238000005516 engineering process Methods 0.000 description 11
- 238000012549 training Methods 0.000 description 10
- 230000000007 visual effect Effects 0.000 description 8
- 230000014509 gene expression Effects 0.000 description 7
- 230000009471 action Effects 0.000 description 6
- 238000004590 computer program Methods 0.000 description 6
- 230000001934 delay Effects 0.000 description 6
- 238000010586 diagram Methods 0.000 description 6
- 238000012986 modification Methods 0.000 description 5
- 230000004048 modification Effects 0.000 description 5
- 230000008569 process Effects 0.000 description 5
- 230000006870 function Effects 0.000 description 4
- 230000003993 interaction Effects 0.000 description 4
- 230000007246 mechanism Effects 0.000 description 3
- 230000002085 persistent effect Effects 0.000 description 3
- 241000282412 Homo Species 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000013500 data storage Methods 0.000 description 2
- 238000005286 illumination Methods 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000002093 peripheral effect Effects 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 230000000644 propagated effect Effects 0.000 description 2
- 230000000306 recurrent effect Effects 0.000 description 2
- 230000005236 sound signal Effects 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- 235000008694 Humulus lupulus Nutrition 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000003542 behavioural effect Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 244000062645 predators Species 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 230000003997 social interaction Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 238000000844 transformation Methods 0.000 description 1
- 230000007723 transport mechanism Effects 0.000 description 1
- 239000013598 vector Substances 0.000 description 1
- 230000001755 vocal effect Effects 0.000 description 1
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L12/00—Data switching networks
- H04L12/02—Details
- H04L12/16—Arrangements for providing special services to substations
- H04L12/18—Arrangements for providing special services to substations for broadcast or conference, e.g. multicast
- H04L12/1813—Arrangements for providing special services to substations for broadcast or conference, e.g. multicast for computer conferences, e.g. chat rooms
- H04L12/1827—Network arrangements for conference optimisation or adaptation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/74—Image or video pattern matching; Proximity measures in feature spaces
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/50—Context or environment of the image
- G06V20/52—Surveillance or monitoring of activities, e.g. for recognising suspicious objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/174—Facial expression recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L65/00—Network arrangements, protocols or services for supporting real-time applications in data packet communication
- H04L65/40—Support for services or applications
- H04L65/403—Arrangements for multi-party communication, e.g. for conferences
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L65/00—Network arrangements, protocols or services for supporting real-time applications in data packet communication
- H04L65/60—Network streaming of media packets
- H04L65/75—Media network packet handling
- H04L65/765—Media network packet handling intermediate
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L65/00—Network arrangements, protocols or services for supporting real-time applications in data packet communication
- H04L65/80—Responding to QoS
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/47—End-user applications
- H04N21/478—Supplemental services, e.g. displaying phone caller identification, shopping application
- H04N21/4788—Supplemental services, e.g. displaying phone caller identification, shopping application communicating with other users, e.g. chatting
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N7/00—Television systems
- H04N7/14—Systems for two-way working
- H04N7/15—Conference systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L12/00—Data switching networks
- H04L12/02—Details
- H04L12/16—Arrangements for providing special services to substations
- H04L12/18—Arrangements for providing special services to substations for broadcast or conference, e.g. multicast
- H04L12/1813—Arrangements for providing special services to substations for broadcast or conference, e.g. multicast for computer conferences, e.g. chat rooms
- H04L12/1822—Conducting the conference, e.g. admission, detection, selection or grouping of participants, correlating users to one or more conference sessions, prioritising transmission
Definitions
- each service client may render or display its associated respective user’s captured data (with service client 540a displaying user A user data and service client 540b displaying user data B).
- the predictor allows the predicted image/video/sound data to be rendered displayed at the receiving devices in synchronization with time T.
- the prediction data includes any data necessary to seamlessly continue the display of a user and/or the user’s environment at other processing devices in the meeting or virtual environment.
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Signal Processing (AREA)
- General Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Computer Networks & Wireless Communication (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Human Computer Interaction (AREA)
- Databases & Information Systems (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Oral & Maxillofacial Surgery (AREA)
- Psychiatry (AREA)
- Social Psychology (AREA)
- Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
A system and method for low latency communication in networked applications. User behaviors of participants in a networked meeting or virtual environment application is captured at a time T for use in the networked application. A next or future behavior at a time greater than T is predicted and the prediction data forwarded to the processing devices of other users in the application. The prediction data is used to render the behaviors of the user on the processing devices of other users. The capture and creation of prediction data can be performed continuously or limited to cases where large amounts of latency are detected.
Description
x REDUCED-LATENCY COMMUNICATION USING BEHAVIOR PREDICTION Inventors: Ning Lu Liang Peng Hong Heather Yu FIELD [0001] The disclosure generally relates to improving the quality of audio/visual interaction in virtual environments over communication networks. BACKGROUND [0002] The expanded use of real-time video conferencing and virtual world applications requires significant data bandwidth for user devices associated. Video conferencing may typically involve group video conferences with two or more attendees who can all see and communicate with each other in real-time. Virtual world applications allow representations of users to move about and communicate in a virtual environment and interact with other users. Such virtual worlds are the basis for the creation of a “metaverse,” generally defined as a single, shared, immersive, persistent, 3D virtual space where humans experience life in ways they could not in the physical world. Typically, a user participates in the virtual world or metaverse through the use of a virtual representation of themselves, sometimes referred to as an avatar -- an icon or figure representing a particular person in the virtual world. Each of such applications are networked applications – transmitting information between devices over a public or private network, or a combination of public and private networks, where the transmission of data introduces latency into the application. [0003] Users can move about within the virtual world by any number of control mechanisms. Applications have been developed to allow image and audio capture devices to provide capture data of the user to control movement and communication
within the virtual world. Such capture data can include video, audio, and three- dimensional image data, which is communicated to a service provider processing device and/or other users. As such capture data is constantly being updated in real time, the user experience in such applications is greatly dependent upon the bandwidth of each of the users. SUMMARY [0004] One general aspect includes a computer implemented method of reducing latency in a networked application. The computer implemented method includes receiving user data for at least one of the users participating in a networked meeting application, the user data including a behavior occurring at a time T. The method includes predicting a next behavior for the user at a time greater than T Based on the user data. The method also includes generating user prediction data including the next behavior at the time greater than T; transmitting the user prediction data to other users participating in the networked meeting application; and continuously repeating the receiving, predicting, generating, and transmitting while the at least one user is participating in the networked application. Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods. [0005] Implementations may include one or more of the following features. Optionally, in any preceding aspect, the computer implemented method includes predicting performed by a neural network. Optionally, in any preceding aspect, the computer implemented method further includes predicting a next environment change for an environment of the user at a time greater than T; generating the user prediction data including the environment change at the time greater than T; and transmitting the user prediction data including the environment change to other users participating in the networked meeting application Optionally, in any preceding aspect, the computer implemented method includes determining if the prediction data for the behavior matches the actual behavior received for user data received at the time greater than T and, if the prediction data does not match, generating corrected data. Optionally, in
any preceding aspect, the computer implemented method includes each participating user having an associated processing device and where the method is performed on a service host communicating with each associated processing device. Optionally, in any preceding aspect, the computer implemented method includes the predicting may include predicting a future behavior in the form of a video image, text or audio sound (or environmental changes) which is a number N frames ahead of frames in the captured data. Optionally, in any preceding aspect, the computer implemented method includes predicting a time Δ ahead, where Δ multiplied by the video framerate is a tunable variable of the prediction engine. Optionally, in any preceding aspect, the computer implemented method includes each user having an associated processing device and where the method is performed on one or more of the participating user’s processing devices. Optionally, in any preceding aspect, the computer implemented method includes the networked meeting application may comprising a virtual environment application where a representation of all participating users is rendered in a virtual environment on a processing device associated with the participating user, and the method further includes: calculating an artificial delay; the predicting may include predicting a next behavior at time T plus the artificial delay; and the transmitting may include generating user prediction data at the time T plus the artificial delay. Optionally, in any preceding aspect, the computer implemented method includes a behavior which is any one or more of: human motion, speech, facial expressions, and action. Implementations of the described techniques may include hardware, a method or process, or computer software on a computer-accessible medium. [0006] One general aspect includes a user equipment device. The user equipment device includes a storage medium which may include computer instructions. The device also includes a display device. The device also includes one or more processors coupled to communicate with the storage medium, where the one or more processors execute the instructions to cause the system to: receive user data for at least one of the users participating in a networked meeting application, the user data including a behavior occurring at a time t. The device also includes based on the user data, predict a next behavior for the user at a time greater than t. The device also includes generate user prediction data including the next behavior at the time greater
than t; transmit the user prediction data to other users participating in the networked meeting application; and continuously repeat the receiving, predicting, generating, and transmitting while the at least one user is participating in the networked application. Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods. [0007] Implementations may include one or more of the following features. Optionally, in any preceding aspect the one or more processors execute the instructions to cause the device to predict using a neural network. Optionally, in any preceding aspect the behavior is any one or more of: human motion, speech, facial expressions, and action. Optionally, in any preceding aspect the one or more processors execute the instructions to cause the device to predict a next environment change for an environment of the user at a time greater than T; generate the user prediction data including the environment change at the time greater than T; and transmit the user prediction data including the environment change to other users participating in the networked meeting application Optionally, in any preceding aspect the one or more processors execute the instructions to cause the device to further determine if the prediction data for the behavior matches the actual behavior received for user data received at the time greater than T and, if the prediction data does not match, generate corrected data. Optionally, in any preceding aspect each participating user has an associated processing device and the one or more processors execute the instructions on a service host communicating with each associated processing device. Optionally, in any preceding aspect the one or more processors execute the instructions to cause the device to predict a future behavior in the form of a video image or audio sound which is a number N frames ahead of frames in the captured data. Optionally, in any preceding aspect the one or more processors execute the instructions to cause the device to predict a time Δ ahead, where Δ multiplied by the video framerate is a tunable variable of a prediction engine. Optionally, in any preceding aspect each user has an associated processing device and wherein the one or more processors execute the instructions on one or more of the participating user’s processing device. Optionally, in any preceding aspect the networked meeting
application may include a virtual environment application where a representation of all participating users is rendered in a virtual environment on a processing device associated with the participating user, and where the instructions further include: calculating an artificial delay; the predicting may include predicting a next behavior at time T plus the artificial delay; and the transmitting comprises generating user production data at the time T plus the artificial delay. [0008] Another aspect includes a non-transitory computer-readable medium storing computer instructions for rendering a representation of a user using user data transmitted over a network. The non-transitory computer-readable medium storing computer instructions includes instructions for receiving user data for at least one of the users participating in a networked meeting application, the user data including a behavior occurring at a time T. The instructions also include predicting a next behavior for the user at a time greater than T based on the user data. The instructions also include generating user prediction data including the next behavior at the time greater than T; transmitting the user prediction data to other users participating in the networked meeting application; and continuously repeating the receiving, predicting, generating, and transmitting while the at least one user is participating in the networked application. Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods. [0009] Implementations may include one or more of the following features. Optionally, in any preceding aspect the non-transitory computer-readable medium includes computer instructions where the predicting is performed by a neural network. Optionally, in any preceding aspect the behavior is any one or more of: human motion, speech, facial expressions, and action. Optionally, in any preceding aspect non- transitory computer-readable medium includes computer instructions to cause the device to perform predicting a next environment change for an environment of the user at a time greater than T; generating the user prediction data including the environment change at the time greater than T; and transmitting the user prediction data including the environment change to other users participating in the networked meeting application Optionally, in any preceding aspect the non-transitory computer-readable
medium includes computer instructions for determining if the prediction data for the behavior matches the actual behavior received for user data received at the time greater than T and, if the prediction data does not match, generating corrected data. Optionally, in any preceding aspect each participating user has an associated processing device and the non-transitory computer-readable medium includes computer instructions performed on a service host communicating with each associated processing device. Optionally, in any preceding aspect the non-transitory computer-readable medium includes computer instructions where the predicting includes predicting a future behavior in the form of a video image or audio sound which is a number n frames ahead of frames in the captured data. Optionally, in any preceding aspect the non-transitory computer-readable medium includes computer instructions where the predicting may include predicting a time Δ ahead, where Δ multiplied by the video framerate is a tunable variable of a prediction engine. Optionally, in any preceding aspect each user has an associated processing device, and the non-transitory computer-readable medium includes computer instructions performed on one or more of the participating user’s processing devices. Optionally, in any preceding aspect the networked application may include a virtual environment application where a representation of all participating users is rendered in a virtual environment on a processing device associated with the participating user, and the non-transitory computer-readable medium includes computer instructions includes: calculating an artificial delay; the predicting may include predicting a next behavior at time T plus the artificial delay; and the transmitting may include generating user prediction data at the time T plus the artificial delay. [0010] This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter. The claimed subject matter is not limited to implementations that solve any or all disadvantages noted in the Background.
BRIEF DESCRIPTION OF THE DRAWINGS [0011] Aspects of the present disclosure are illustrated by way of example and are not limited by the accompanying figures for which like references indicate the same or similar elements. [0012] FIG. 1A illustrates a two-dimensional interface of an online meeting application. [0013] FIG.1B illustrates an interface of a photorealistic virtual world conference application showing a first example of a display of a virtual environment. [0014] FIG.2 illustrates an example of a network environment for implementing a photorealistic real-time metaverse conference. [0015] FIG.3 illustrates participant devices with a block diagram of components of a network processing device. [0016] FIG. 4 illustrates the latencies introduced in a conventional capture, processing and display system. [0017] FIG.5 illustrates a low latency communication system utilizing a service host comprising a processing device and a predictor. [0018] FIG. 6 is a flowchart illustrating method which may be performed by the service host. [0019] FIG. 7 illustrates another implementation of a low latency communication system. [0020] FIG.8 illustrates yet another implementation of a low latency communication system. [0021] FIG.9 illustrates a further implementation of a low latency communication system.
[0022] FIG.10 illustrates another implementation of a low latency system suitable for use in a real-time virtual world application. [0023] FIG 11 is a flowchart of a method performed by the predictors in the embodiment of FIG.10. [0024] FIGS. 12A and 12B illustrate the human kinematic model which is a representation of the human body used to analyze, study, or simulate how the body moves. [0025] FIG.12C illustrates a set of facial landmarks. [0026] FIG.13 illustrates a method for creating a predictor using a neural network. WRITTEN DESCRIPTION [0027] The present disclosure and embodiments address real-time, online, audio- video interactions in networked applications which transmit large amounts of data over a network. The disclosure is particularly applicable to networked meeting and conferencing applications, and networked virtual environment applications where representations of users (whether audio, video or virtual representations) are rendered to all users participating in a group meeting or virtual environment. As used herein, a “networked meeting application” is a software platform that facilitates remote or virtual meetings through the internet, including virtual world or metaverse applications where users meet in a virtual environment through representations of themselves such as avatars. These applications typically provide functionalities for video conferencing, audio communication, real-time messaging, file sharing, and often additional features such as screen sharing, whiteboarding, and scheduling. [0028] The communication of audio/visual and rendering data over the network, and the processing required to render the representations of users in the application, introduces latency between the capture of a user’s behaviors their own processing device, and the rendering of the behaviors of the user on the processing devices of other users participating in the application environment. This disclosure is directed to
a system for reducing this latency by predicting the behaviors of users based on data captured by a user’s processing device, and generating prediction data which allows other users devices to render the actions of the user on those devices with less latency relative to the captured data. [0029] In embodiments, the virtual environment may comprise a “metaverse.” In one implementation, a virtual environment serves as a dedicated conferencing environment. In embodiments, the virtual environment or live, online meeting application captures at least one user’s speech, facial expressions and or motions and relays that information to other users in the environment or meeting. In embodiments where the technology is application to virtual environments, the environment may be provided by a service host processing device operated by a service providing entity (or “service provider”) hosted by users operating from client processing devices which access a real-time virtual environment application provided by a service provider on service host computing device or devices. In other implementations, the virtual conferencing application acts as an “always on” metaverse. In the context of this description, the term “metaverse” means a spatial computing platform that provides virtual digital experiences in an environment that acts as an alternative to or a replica of the real-world. The metaverse can include social interactions, currency, trade, economy, and property ownership. the appearance of users in the virtual environment described may comprise an avatar (a representation of the user) or a photorealistic appearance of the user. Where the appearance of users in the virtual environment is a photorealistic appearance neural radiance field rendering technology or other photorealistic three-dimensional reconstruction technologies may be used. A metaverse may be persistent or non-persistent such that it exists only during hosted conferences. [0030] FIG. 1A illustrates an interface 100 of one type of online meeting or conference application showing a first example of audio/visual information which may be presented in the interface. Interface 100 is sometimes referred to as a “meeting room” interface, where a video, picture, or other representation of each of the participating users is displayed. Interface 100 includes a presenter window 110 showing a focused attendee 120 who may be speaking or presenting and attendee
display windows 130 showing other connected attendees of the real-time meeting. In this example, the presenter window 110 shows an attendee but may also include text, video or shared screen information. The attendee display windows may be arranged in different locations or may not be shown. In addition, the placement of the windows may differ in various embodiments. It should be understood that in embodiments, the presenter window may occupy the entire screen. [0031] The presenter window 110 may show a live action, real-time video of the presenter (the speaker), while other information is displayed on another portion of the display. The video and audio of the speaker is captured by a processing device at the speaker’s location. The processing device may have a microphone, two-dimensional or three-dimensional image capture device (i.e., a camera) and other input/output devices, such as a keyboard, mouse, display device, and/or touchscreen interface It should be further understood that although eight attendees are illustrated in window 130 any number of users may be attending the presentation. [0032] FIG. 1B illustrates a two-dimensional interface 150 of a real time virtual environment application a first example of audio/visual information which may be presented in the interface 100. Interface 100 illustrates a dynamic scene of a virtual environment which in this example is a virtual “meeting room” , where a representation 160, 170 of each of two participating users is displayed. (It will be understood that the environment may contain numerous participants and is not limited to two participants.) In this example, the virtual environment comprises a room with a table around which the attendees 160, 170 are facing each other. The interface 150 in this example may be presented on a two-dimensional display window presented on a processing device such as a computer with a display, a mobile device, or a tablet with a display. In embodiments, the interface 150 may present a participant, first person view of the environment 180 (where the view is neither that of attendee representation 160 nor 170 in this example). In other embodiments, the interface may be displayed in a virtual reality device to that the viewing attendee is immersed in the virtual environment 180. In such embodiments, where the view is that of the perspective of the user’s representation in the virtual environment, that user may not perceive the totality of that user’s rendered representation. As in the real-world, such user will only see other user
representations and those portions of their own representation (arms, legs, body) that a user would normally see if that user were in a real-world environment. In still other embodiments, the interface 150 may include the entirety of the representation of the viewing participant as one of the attendee representations 160, 170 rendered in the virtual environment. It should be understood that in embodiments, an environment interface is provided for each of the attendees/participants in a virtual environment. [0033] In the application interface 150, while the attendees 160, 170 may be represented as avatars of the user, or may be photorealistic representations of the user based on image and audio data of the user which is captured by a processing device associated with the user. In this context, a photorealistic representation is a visual representation of the user which is rendered as a photographically realistic rendering of the user, such that interactions between people, surfaces and light are perceived in a lifelike view. To be realistic, the represented scene may have a consistent shadow configuration, virtual objects within the interface must look natural, and the illumination of these virtual objects needs to resemble the illumination of the real objects. A photorealistic rendered image, if generated by a computer, is indistinguishable from a photograph of the same scene. In addition to rendering a virtual background and rendering human participants (users), the system may also generate representations of non-human objects, such as cellphones, computers, furniture, etc. that are often used by the users in real world or often appear in a real- world conferencing room or social meeting. [0034] Although the technology discussed herein will be hereinafter described with respect to its use in a real-time virtual world application, it will be recognized that the technology may be used with any real-time application where bandwidth limitations can reduce a user experience with the application. [0035] FIG.2 illustrates an example of a network environment for implementing a real-time virtual environment application. Network environment 200 includes one or more service host processing devices 240a – 240d. As described herein, each service host 240 (one embodiment of which is shown in FIG. 3) may have different configurations depending on the configuration of the system. Also shown in FIG.2 are
a plurality of network nodes 220a - 220d and user (or client) processing devices 203, 208, 212, and 213. The service hosts 240a – 240d may be part of a cloud service 250, which in various embodiments may provide cloud computing services which are dedicated to providing services described herein to enable the real-time virtual environment application. Nodes 220a - 220d may comprise a switch, router, processing device, or other network-coupled processing device which may or may not include data storage capability, allowing cached data to be stored in the node for distribution to devices utilizing the real-time virtual environment application. In other embodiments, additional levels of network nodes other than those illustrated in FIG.2 are utilized. In other embodiments, fewer network nodes are utilized and, in some embodiments, comprise basic network switches having no available caching memory. In still other embodiments, the meeting servers are not part of a cloud service but may comprise one or more meeting servers which are operated by a single enterprise, such that the network environment is owned and contained by a single entity (such as a corporation) where the host and attendees are all connected via the private network of the entity. Lines between the processing devices 203, 208, 212, and 213, network nodes 220a - 220d and meeting servers 240a – 240d represent network connections of a network which may be wired or wireless and which comprise one or more public and/or private networks. An example of node devices 220a - 220d is illustrated in FIG. 18. [0036] Each of the processing devices 203, 208, 212, and 213 may provide and receive user data for each of the users participating in a meeting using the networked meeting application. As used herein, user data may include any data required by a networked meeting application, including video conferencing applications and virtual environment (or “metaverse”) applications, to render real time or near real time representations of users participating in a meeting using the application. The representation may be a video image, audio of the user, and/or an avatar representing the user. This data generally includes real-time communication data which may include voice and video data used to render a representation of the user on devices. User data may include textual communication Data used, for example, in a chat function, screen sharing data, and whiteboard and annotation Data. In a virtual
environment application, user data also includes real-time communication and interaction data, spatial data about the environment and the locations of all objects, and behavioral data about user behaviors within the environment. [0037] Each of the processing devices includes an audio and video capture system to provide the real-time communication data in the form of user motion, audio, and visual data through one or more of the network nodes 220a- 220d and the cloud service 250 via a network (represented as lines interconnected the devices). In FIG. 2, device 212 is illustrated as a desktop computer processing device with a camera 209 and display device 210 in communication therewith. Device 212 is associated with participating user 211 and the camera 209 is configured to capture audio and video data of participating user 211. Also illustrated is a tablet processing device 203 associated with user participating 202 and a mobile processing device 213 associated with participating user 214. The tablet processing device 203 and mobile processing device 213 may include integrated audio and video sensing devices, and integrated displays. Another user processing device may include a head mounted display 205 and associated processing device 208. In embodiments, the processing device 208 may be integrated into the head mounted display 205. A camera 207 is configured to communicate with processing device 208 to capture user 204. It should be understood that any type of processing device may fulfill the role of user processing device 203, 208, 212, and 213 and there may be any combination of different types of processing devices participating in a virtual world. [0038] At each client processing device 203, 209, 212, and 213, an image capture system may capture user movement, appearance and pose (in two-dimensional or three-dimensional (3D) motion data), user audio, user environment data (including the background of the environment the user is present in and objects therein), and, in some embodiments, the user’s position in the real-world relative to a real-world coordinate system. This user data is provided by the respective processing device to a service host (or, in embodiments, directly to other client devices). User data is associated with corresponding timestamp data (T). The timestamp data (T) may be acquired from a universal time source or synchronized between the various elements of the system discussed herein. Each image capture system at client processing
device 203, 209, 212, and 213 works independently and in parallel to provide the user data to the service hosts. [0039] The user data for a given user is displayed on other user’s processing device displays in a virtual environment (or a meeting interface) as described above. The rendering of each user’s data in other users’ displays optimally occurs in near real-time. As a practical matter, there is a delay between the capture and display of user data due to latency, bandwidth, and processing times. [0040] In the aforementioned description, each of the cameras may be two- dimensional video cameras, three-dimensional depth capture cameras, or a combination of both, and may incorporate audio capture devices (e.g., microphones). Each of the aforementioned devices may have additional input output devices that contribute to capture data from each of the users, such as dedicated wireless or wired microphones connected to the processing devices. [0041] In embodiments, one user may serve as a meeting host or organizer who invites others or configures a virtual conferencing environment using the real-time virtual environment application. In real-time virtual environment, all participant devices may contribute to environment data. In other embodiments, a host processing device may be a standalone service host server, connected via a network to participant processing devices. It should be understood that there may be any number of processing devices operating as participant devices for attendees of the real-time meeting, with one participant device generally associated with one attendee (although multiple attendees may use a single device in other embodiments). [0042] In FIG.2, in one example, user data may be sent by a source device, such as device 208, through the source processing device’s network interface and directed to the other participant devices 203, 212, and 213 though, for example, one or more service hosts 240a - 240d and nodes 220a – 220d. Within the cloud service 250 the data may be distributed according to the workload of each of the service hosts 240 and can be sent from the service hosts directly to a client or through one or more of the network nodes 220a – 220d. In embodiments, the network nodes 220a - 220d may include processors and memory, allowing the nodes to cache data (if needed) from
the real-time virtual environment application. In other embodiments, the network nodes do not have the ability to cache data. In further embodiments, user data may be exchanged directly between participant devices and not through network nodes or routed between participant devices through network nodes without passing through service hosts. In other embodiments, peer-to-peer communication for the meeting application may be utilized. [0043] FIG.3 illustrates the aforementioned participant devices along with a block diagram of components of a network processing device 540. Although only four client processing devices are illustrated (as in FIG.2), it will be understood that numerous processing devices may be utilized in accordance with the technology disclosed herein. [0044] In embodiments, network processing device 540 functions as a service host 240, and various embodiments may utilize all of the components shown, or only a subset of the components, and levels of integration may vary from device to device. In other embodiments, as described herein, the network processing device 540 may comprise a client device and include an image capture system and display (not shown) and be associated with a meeting participant. Furthermore, device 540 may contain multiple instances of a component, such as multiple processing units, processors, memories, transmitters, receivers, etc. The device 540 may comprise a central processing unit (CPU) 310, a memory 320, a mass storage device 330, and an I/O interface 360 connected to a bus 370. The bus 370 may be one or more of any type of several bus architectures including a memory bus or memory controller, a peripheral bus, or the like. A network interface 360 enables the network processing device to communicate over a network 380 with other processing devices such as those described herein. [0045] The mass storage device 330 may comprise any type of storage device configured to store data, programs, and other information and to make the data, programs, and other information accessible via the bus 370. The mass storage device 330 may comprise, for example, one or more of a solid-state drive, hard disk drive, a magnetic disk drive, an optical disk drive, or the like. The mass storage device 330
includes instructions which when executed by the CPU (or processor) cause the processor to perform the methods described herein. The mass storage 330 may include code comprising instructions for causing the CPU to implement the components of the real-time virtual environment application 315a, the predictor 365a and user data 350a. In other embodiments, user data 350a is not stored and/or is only buffered for processing in accordance with the embodiments described herein. Instances of the virtual environment application are present in memory 320 when executed by the CPU 310. [0046] The CPU 310 may comprise any type of electronic data processor. Memory 320 may comprise any type of system memory such as static random-access memory (SRAM), dynamic random-access memory (DRAM), synchronous DRAM (SDRAM), read-only memory (ROM), a combination thereof, or the like. In an embodiment, memory 320 may include ROM for use at boot-up, and DRAM for program and data storage for use while executing programs. In embodiments, the memory 320 is non- transitory. In one embodiment, the memory 320 includes computer readable instructions that are executed by the processor(s) 320 to implement embodiments of the disclosed technology, including the real-time virtual environment or real-time online meeting application 315a and a predictor 365b. The mass storage 330 may also store motion data 350a captured by the image capture systems of participant devices. [0047] As discussed with respect to FIGs. 5 - 11, a predictor 365a/365b acts to predict behaviors in the form of text, movement, sound and expressions of a user, and provide predicted text, movement, sound and/or expressions of a user to other users in the virtual environment. Instances of the virtual environment/meeting application 315b and predictor 365b may be present in memory 320. Individual instances a predictor may operate for each participant user: predictor 365c for user 202, predictor 365d for user 204, predictor 365e for user 211 and predictor 365f for user 214. Each predictor for each user may include a speech prediction module 365g, a body motion module 365h and a facial expression/facial motion module 365i. In embodiments, not all prediction modules may be utilized. Although only four users are shown, it will be understood that additional users and predictor instances may be handled by service host 240. In Figure 3, the audio and speech predictor neural network (NN) 365f, a
body motion predictor neural network 365g and a facial expression/facial motion predictor neural network 365h are shown for user 214, but it should be understood that each instance of a user predictor may have all or a subset of the audio and speech predictor neural network 365g, a body motion predictor neural network 365h and a facial expression/facial motion predictor neural network 365j. [0048] Each instance of a predictor for each user (for example speech predictor 365g, body motion predictor 365h and a facial expression/facial motion module 365h) comprises a neural network which is trained on an appropriate human body, facial, or speech model (for example, those illustrated below with respect to FIGs.12A – 12C). [0049] FIG.4 illustrates a conventional capture, processing and display scenario wherein two user processing devices provide user data to a service host 240, which processes and relays the information to another user. In FIG.4, the respective capture, timing and delay factors are illustrated for two users “A” and “B”, each using a respective user processing device 212a, 212b having a configuration as described above with respect to device 212 (having an associated display 210 and camera 209) and each having a respective service client application 540a, 540b which may comprise a client application for the virtual environment application 315 or in other embodiments, a real-time online meeting application. In the embodiments discussed herein, data is captured and displayed on devices relative to a time “T”, which may be considered a universal time which is the same amongst all devices participating in the networked application. There are any number of known mechanisms for ensuring that all such devices maintain time T in synchronization with other participating devices. [0050] As illustrated at service client 540a, device 212a will capture movements of user A at time T and (optionally, depending on the embodiment) display user A’s own movements in real time on device 212a. Device 212a transmits user A’s user data A to a service host 440. A first (or upload) delay (UA) will occur during this transmission. Thus, the input for processor 410 at host 440 is that of user data A at time T less the transmission time UA (or A @ T – UA). Because some processing time (P) occurs on service host 440, and additional processing delay (P) may be added. The data output for display by service client 540b to user “B’s” device 212b, is thus the original user
data A at time T less the transmission time UA and less the processing time P or (A @ T – UA – P). An additional (or download) delay DB is added during transmission of user data A to service client 540b on device 212b. Thus, the display of real time communication and behaviors of user A at service client 540b is that of the capture at time T less the transmission times UA and DB and the processing time P or (A @ T – UA – DB – P). The total latency from B to A is ^B = UB + DA + P, and from A to B is ^A = UA + DB + P. [0051] The same delays are introduced from user B to user A, with user B captured at time T; user B’s user data B arriving at T – UB and being output at T – UB – P and being displayed to user A at T – UB – DA – P on client 540a. [0052] FIG.5 illustrates a display system utilizing the service host 240 comprising a processing device 540 of FIG.3 and a predictor 365. The predictor predicts a future behavior which, when in the form of a video image or audio sound, is either N frames ahead or equivalently a time ^ ahead. N or ^ multiplied by the video framerate may comprise a tunable variable of the prediction engine. The smaller N or ^, the better the prediction accuracy. Details on the prediction engine are discussed below. [0053] As illustrated at service client 540a in FIG. 5, device 212a will capture movements of user A at time T and display user A’s own movements in real time on device 212a. Device 212a then transmits user A’s user data A to service host 240 and the input for processor 310 at host 240 is that of user A at time T less the transmission time UA (A @ T – UA). After the predictor 365j associated with user A on service host 240 receives the user data A for user A, the prediction data pA is output to service client 540b and displayed at time T or (pA @T). Processing time P and transmission delay DB are accounted for in the output prediction data pA. such that pA reflects a predicted behavior of user A at T + DB + P, such that the displayed pA data which occurs at time T on service client 540b incorporates any delay in the prediction of the user behavior, so no delay is seen by user B at service client 540b. Likewise, the input for predictor 365k is user data B associated with user B running on processor 310 at host 240 is that of user B at time T less the transmission time UB (B @ T – UB). After the predictor on service host 240 receives the user data B for user B, the predicted
future video data pB is output to service client 540a and displayed at time T or (pB @T). Processing time P and transmission delay DA are accounted for in the prediction data pB such that pB reflects a predicted behavior of user B at T + DA + P and the display of data pB at service client 540a at time T removes this delay.. In the embodiment of FIG. 5, each service client may render or display its associated respective user’s captured data (with service client 540a displaying user A user data and service client 540b displaying user data B). [0054] Thus, the predictor allows the predicted image/video/sound data to be rendered displayed at the receiving devices in synchronization with time T. In embodiments, the prediction data includes any data necessary to seamlessly continue the display of a user and/or the user’s environment at other processing devices in the meeting or virtual environment. Thus, prediction data may include the same format as the user data transmitted in the system and thus may include any data required by a networked meeting application, including video conferencing applications and virtual environment (or “metaverse”) applications, to render the representations of users participating in a meeting using the application performing their predicted next behavior. Motion data may be motion data of the user and/or include camera motion data, object motion data, and/or environment motion. In embodiments, the system compensates for camera and environment motion as well as appearance, pose, and lighting. In other embodiments, the predictor operates on only human motion and expression prediction, to minimize rendering artifacts. [0055] FIG.6 is a method which may be performed by the service host (or, in other embodiments described herein, other processing devices) in accordance with the forgoing scenario. At 605, data is received from a user of the virtual environment or online meeting application representing the user’s behavior at a time T, where a “behavior” may comprise "behavior" comprises any human motion, speech, expressions, text and/or action, including all observable activity of a person, including both physical movements and verbal communication, in text, video and/or audio format. At 610, based on the data received, a prediction is made of the next behavior, At 612, the method generates prediction data regarding the behavior at a time greater than T, and in one embodiment, at T + Δ. At 615, the prediction data is transmitted to
processing devices of other users for processing and rendering at the device of each other user participating in the networked application. [0056] At 620, a determination is made as to whether the prediction data (pA/pB) matches the actual behavior of a user in newly captured data which is received after the user data received at time T. In some instances, the prediction data may be incorrect. For example, one may predict a continued walking motion of a user when in fact the user has stopped walking. In this example, the actual behavior is that the user stopped while the prediction is that the user continued walking. In general, the difference between T and T + Δ will likely be very small, and in some cases on the order of less than a second. If the predication data matches the user data actually received at 625, the method continues continuously receiving and predicting data by repeating the receiving, predicting, generating, and transmitting. If not, then at 630, the predictor must compensate for the difference in the prediction data and the captured data. Such correction may occur by updating a next prediction data (for user data received at T + Δ) to re-render the prediction with corrected motion. In this sense, corrected data compensating for the difference between the prediction data and the captured data can allow a client device to render the display of the user in a corrected state in a display. Corrected data may comprise any motion or change to user data previously transmitted to other users which transforms the rendering of the user from an incorrect predicted behavior to the user’s actual behavior. In one embodiment, this may simply be a jump between the predicted behavior of the user and the actual behavior of the user. In other embodiments, corrected data may render user behaviors between the incorrect predicted behavior and the actual behavior, and accelerate the motion of the user between the behaviors (appearing, in essence, to “fast forward” motion between the states) in order to reach the actual behavior of the user. [0057] In embodiments, the method of FIG. 6 may be performed continuously during the meeting. In other embodiments, the method of FIG.6 may only be utilized when the latencies introduced between user devices become significant. For example, studies have shown that humans do not recognize a delay of 10 ms or less between a visual representation of a musical performance and the audio of the performance. Where the latency in the systems described herein exceeds a threshold for human
recognition, the low latency techniques may be used. Below the threshold, devices may use the user data generated at the capturing user device. [0058] FIG. 7 illustrates another implementation of a low latency communication system. In FIG. 7, service clients 712a and 712b may be applications running on processing devices 212a and 212b, respectively (not shown), in a manner similar to that illustrated in FIG. 5. In FIG. 7, the predictors 765a and 765b associated with respective users A and B are enabled on the client processing devices. User A’s movements are captured at time T and user A’s own movements displayed in real time by service client 712a. After the predictor 765a associated with user A on service client 712a receives the user data A for user A, it generates predicted future data pA which is output is output to service client 712b and displayed at time T or (pA @T). Processing time P and a total transmission CA are accounted for in the prediction data pA such that the output is the prediction pA at T + CA + P such that the prediction displayed at service client 712b is pA at T. (In FIGs.7 -9, the total transmission time C is represented as a single time frame but may be comprised of any number of hops through a network which contribute to the communication delay C.) Similarly, the input for predictor 765b associated with user B running service client 712b is a prediction data state at pB @ T + UB + DA (processing time P and transmission delays UB and DA are accounted for in the prediction data pB), and after transmission to service client 712a, the data displayed at 712a is the prediction at T (pB @T) In the embodiment of FIG.5, each service client may render or display its associated respective user’s user data (with service client 540a displaying user data A and service client 540b displaying user data B). In the embodiment of FIG.7, the method shown in FIG.8 may run on each respective processing device. Although only two processing devices are shown for illustration, it will be recognized that the any number of processing devices may be utilized in the embodiment of FIG.7. [0059] FIG. 8 illustrates another implementation of a low latency communication system. The embodiment of FIG.8 is another implementation where no service host is utilized but both processing devices have sufficient processing power to each run their own predictor. The embodiment of FIG.8 is similar to that of FIG.7 except that both processing devices run receiving predictors. In FIG.8, a receiving predictor 865a
and a receiving predictor 865b run on the respective service clients 812a and 812b. User A’s movements are captured at time T and user A’s own movements displayed in real time by service client 812a. The user data A is transmitted to service client 812b and received by receiving predictor at T minus communication delay CA. Predictor 865a generates predicted future data pA (at T plus delay CA plus processing time P) which is output to service client 812b and displayed at time T or (pA @T). The user data B is transmitted to service client 912a and received by receiving predator 865b generates predicted future data pB (compensating for delay CA plus processing time P) which is output is output to service client 812a and displayed at time T or (pB @T). [0060] FIG. 9 illustrates another implementation of a low latency communication system. The embodiment of FIG.9 is particularly suited to implementation where no service host is utilized and one or more processing devices has limited processing power, while other processing devices participating in the virtual environment or meeting have greater processing power. In FIG. 9, processing device 212a (associated with user A) has more processing power than processing device 203 (associated with user B). In this example, processing device 203 is illustrated as a mobile device but may be any processing device with limited processing power relative to other devices. As shown in FIG.9, both a sending predictor 765a and a receiving predictor 865a run on service client 912 and processing device 212a. As in FIG 7, user A’s movements are captured at time T and user A’s own movements displayed in real time by service client 712a. After the predictor 765a associated with user A on service client 712a receives the user data A for user A, it generates predicted future data pA which is output is output to service client 712b and displayed at time T or (pA @T). Processing time P and transmission delays CA are accounted for in the prediction data pA such that the output is the prediction pA at T + CA + P such that the prediction displayed at service client 712b is pA at T. Service client 914 may receive the user A’s prediction data pA and display the data on the display of processing device 203. User B’s data is captured by the service client 914 and transmitted (with delay CB) to service client 912 and receiving predictor 865a. Receiving predictor 865a generates prediction data pB based on the received data for B at time T - CB, with the prediction data pB being the user behavior at time T plus the delay CA and processing time P.
[0061] FIG.10 illustrates another implementation of a low latency system suitable for use in a real-time virtual world (or “metaverse”) application. In a virtual environment application, User A and User B are likely required to be rendered in the same scene on each device. then send back to individual, thus the display of User A is no longer instant even for user A. Recall that the total latency from user B to A is ^B = UB + DA + P, and from user A to B is ^A = UA + DB + P in a conventional communication system (FIG.4). In this embodiment of FIG.10, an artificial delay is introduced by selecting a delay D’ time between DA and DB such that the artificial delay is (D’ – DA) and (D’ – DB), respectively, and the system does not display a predicted frame/movement/sound ahead of the artificial delay. In this embodiment, the artificial delay is optimally minimized to the lowest delay necessary to accommodate simultaneously displaying the prediction data in as close to real-time capture of each user’s data as possible. As shown in FIG. 10, the input to the service host is user data A at time T less the transmission time UA or A @ T – UA, and user data B at time T less the transmission time UB or B @ T – UB. The prediction engines 1065a and 1065a generate prediction motion data pA for user A and pB for user B at time T + D’ + P such that the output to each respective service client 1012 and 1014 is pA @ T + D’ + P and pB @ T + D’+ P. At each service client, the displayed data is the prediction data (movement, speech, etc.) for both users at time T less the artificial delay such that, for user A, data p(A+B) is displayed at min(T–(DA–D’),T) and for user B data p(A+B) at min(T–(DB–D’),T). As will be understood, the artificial delay scales as the number of users increases and may be adjusted depending on the placement of the predictors. For example, where a predictor is on a user device (or a server is relatively close (or has a short latency) with respect to a user, the delays UA and DA can be small or even zero. [0062] FIG 11 is a flowchart of a method performed by the predictors in the embodiment of FIG.10. At 1105, the predictor continuously receives user data from all of the participating users. At 1110, the system computes the artificial delay by determining the maximum delay for all the users and selecting a delay D’ to add to the latency delays, where D’ is selected to be as small as needed to compensate for the latency and processing delays introduced by the various participating users. At 1115, for each user, user data is received of the user at time T. At 1120, based on the
received data at 1115, a predicted next movement, sound and/or expression is calculated for the user at time T plus the artificial delay (i.e., p(A+B) at min(T–(Duser– D),T)). At 1125, the prediction data calculated at 1120 is forwarded to all participating users for rendering (since each user is simultaneously participating in the virtual environment). At 1130, a determination is made as to whether the prediction data with delay matches the actual captured data at time min(T–(Duser–D’),T). If not, the method reconciles the user data with the prediction data and transmits reconciled data to all users for rendering. If so, the method loops to step 1115. [0063] FIGs.12A and 12B illustrate the human kinematic model 1200 which is a representation of the human body used to analyze, study, or simulate how the body moves. As shown in FIG 12B, the model 1200 is typically created by representing the human body as a system of linked rigid bodies, or segments. Each body segment is linked to others at joints like the elbow, shoulder, or hip. To implement a predictor for human body motion, a kinematic model is chosen, and motion training data is normalized for scale and rotation so that all movements are in a standard coordinate system and at a standard scale. Next, a machine learning model is trained to predict future movements. There are several types of models that could be used for this, including Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Long Short-Term Memory Networks (LSTMs), Gated Recurrent Units (GRUs) Temporal Convolutional Networks (TCN); Transformers; Graph Convolutional Networks (GCNs) and Generative Adversarial Networks (GANs) may be used to predict human motion, alone or in combination with each other. While the kinematic model is illustrated in FIGs. 12A and 12B illustrated in two dimensions (2D), in embodiments, the systems and methods herein may utilize three dimensional (3D0 kinematic models. [0064] Once the model is trained, it can be used to predict future motion. Given the current and past positions of the joints, the model uses the patterns it learned during training to predict their positions in the next frame. The input to the model is the sequence of joint positions in the past, and the output is the predicted positions in the future. While FIGs.12A and 12B illustrate full-body skeleton-based motion prediction models, more detailed motion models to track finer body movements – such as hand
and finger motion – may be utilized in embodiments where prediction of such fine human movements is relevant to the application. [0065] FIG.12C illustrates a set of 2D facial landmarks. Facial landmarks refer to key points on a face, like the corners of the eyes, the edges of the lips, the tip of the nose, and the like. By identifying these points and tracking how they move over time, it is possible to model and predict facial movements. As with predicting movement using the kinematic model, a dataset of facial images including a variety of expressions, angles, and lighting conditions is used to train a machine learning model to predict the landmark positions given an image of a face. The model would be trained using a large number of images with known landmark positions, learning to associate certain visual features with certain positions. Once the model is trained, the predictor can be used to track facial movements in real-time and takes the sequence of past landmark positions and to predict the future positions. While 2D facial landmarks are illustrated in FIG.12C, in embodiments, the systems and methods herein may utilize 3D facial landmarks. [0066] Similar predictive methods may be applied to human facial models, speech and voice prediction and audio prediction. Next word and sentence prediction is currently used in numerous types of word processing applications. A combination of next phoneme, next word, next sentence, and contextual prediction can be used in the current application. [0067] FIG.13 illustrates a method for creating a predictor (such as body motion predictor 365h) using a neural network. At 1305, training data is collected for the predictor. FIG 13 illustrates one embodiment of creating a predictor for motion prediction. The training and creation of predictors for other types of data, including facial movements and expressions, sounds, words, voice, and the like, is similar but uses data and modeling relevant to the data being predicted. [0068] For a body motion predictor, the data may comprise using motion capture data including a variety of human movements illustrating positions of the various joints in the body over time. This may further include normalizing the data with respect to scale and rotation. so that all movements are in a standard coordinate system and at
a standard scale. At 1310, a kinematic model is selected (or constructed). There are a number of known kinematic models which may be used, or one may be created with joints and limbs represented as a hierarchical structure of linked segments. At 1315, features are extracted from the training data and may include processing the training data to identify and isolate the human figures and movements. At 1320, the predictor neural network is trained to predict future movements. The input to the model is the sequence of joint positions in the past, and the output is the predicted positions in the future. At 1325, once the model is trained, it can be used to predict future motion. Given the current and past positions of the joints, the model uses the patterns it learned during training to predict their positions in the next frame. [0069] Prediction at 1325 (i.e step 610) requires obtaining new human motion data for a user which is collected in real-time (step 605). Data is fed into an input layer of the trained neural network. In embodiments, if pre-processing of the training data occurred, such pre-processing should be performed on the new human motion data. The data forward propagates through the neural network, layer by layer. The weights and biases in each layer are those optimized during the training phase. The network produces an output sequence corresponding to predicted future states of human motion. This could be positions, velocities, or other kinematic variables. The process is repeated for each new sequence or sliding window of motion data to generate a continuous stream of predictions. Predicted motion is then synthesized (step 612) into predicted user data which can be consumed by the networked application for other users of the application. [0070] In another alternative, the user behavior data may give rise to a prediction that the user environment is about to change. For example, user motion toward a door in the environment may give rise to a prediction that the user environment will change from a room 180 to a building exterior. In embodiments, the body motion predictor may include data allowing prediction of an environment change from a current environment to a next environment. Prediction of new environments may be advantageous in metaverse applications where the data required for rendering a new virtual environment can be transmitted before the user behavior (i.e. motion) moves
the user into the new environment, allowing the environment change to be more quickly rendered. [0071] A predictor (such as predictor 365g for audio or speech) is trained using audio recordings and stored as audio files which may be segmented into frames, each containing a fixed number of audio samples. Audio features such as spectrograms or raw waveforms may be identified and extracted, and the features organized into sequences that serve as input for training the predictor neural network. The input layer of the neural network accepts the sequence of feature vectors, each representing a frame of speech data. Deeper layers of the predictor neural network extract patterns from the input data and the predicts the subsequent sequence of features that represent the future state of the speech or audio signal. In the prediction phase, new audio samples of user data (step 605) are provided predictor neural network and the network calculates a prediction of a subsequent sequence, which could represent the next few milliseconds to seconds of speech data. Predicted features may be transformed back into an audio signal (i.e. predicted user data at step 612) through transformations or speech synthesis algorithms, depending on the application. [0072] For the purposes of this document, it should be noted that the dimensions of the various features depicted in the figures may not necessarily be drawn to scale. [0073] For purposes of this document, reference in the specification to “an embodiment,” “one embodiment,” “some embodiments,” or “another embodiment” may be used to describe different embodiments or the same embodiment. [0074] For the purposes of this document, a connection may be a direct connection or an indirect connection (e.g., via one or more other parts). In some cases, when an element is referred to as being connected or coupled to another element, the element may be directly connected to the other element or indirectly connected to the other element via intervening elements. When an element is referred to as being directly connected to another element, then there are no intervening elements between the element and the other element. Two devices are “in communication” if they are directly or indirectly connected so that they can communicate electronic signals between them.
[0075] Although the present disclosure has been described with reference to specific features and embodiments thereof, it is evident that various modifications and combinations can be made thereto without departing from the scope of the disclosure. The specification and drawings are, accordingly, to be regarded simply as an illustration of the disclosure as defined by the appended claims, and are contemplated to cover any and all modifications, variations, combinations, or equivalents that fall within the scope of the present disclosure. [0076] The technology described herein can be implemented using hardware, software, or a combination of both hardware and software. The software used is stored on one or more of the processor readable storage devices described above to program one or more of the processors to perform the functions described herein. The processor readable storage devices can include computer readable media such as volatile and non-volatile media, removable and non-removable media. By way of example, and not limitation, computer readable media may comprise computer readable storage media and communication media. Computer readable storage media may be implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Examples of computer readable storage media include RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information, and which can be accessed by a computer. A computer readable medium or media does (do) not include propagated, modulated, or transitory signals. [0077] Communication media typically embodies computer readable instructions, data structures, program modules or other data in a propagated, modulated, or transitory data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired
connection, and wireless media such as RF and other wireless media. Combinations of any of the above are also included within the scope of computer readable media. [0078] In alternative embodiments, some or all of the software can be replaced by dedicated hardware logic components. For example, and without limitation, illustrative types of hardware logic components that can be used include Field-programmable Gate Arrays (FPGAs), Application-specific Integrated Circuits (ASICs), Application- specific Standard Products (ASSPs), System-on-a-chip systems (SOCs), Complex Programmable Logic Devices (CPLDs), special purpose computers, etc. In one embodiment, software (stored on a storage device) implementing one or more embodiments is used to program one or more processors. The one or more processors can be in communication with one or more computer readable media/ storage devices, peripherals and/or communication interfaces. [0079] It is understood that the present subject matter may be embodied in many different forms and should not be construed as being limited to the embodiments set forth herein. Rather, these embodiments are provided so that this subject matter will be thorough and complete and will fully convey the disclosure to those skilled in the art. Indeed, the subject matter is intended to cover alternatives, modifications, and equivalents of these embodiments, which are included within the scope and spirit of the subject matter as defined by the appended claims. Furthermore, in the following detailed description of the present subject matter, numerous specific details are set forth in order to provide a thorough understanding of the present subject matter. However, it will be clear to those of ordinary skill in the art that the present subject matter may be practiced without such specific details. [0080] Aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatuses (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general-purpose computer, special purpose
computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable instruction execution apparatus, create a mechanism for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. [0081] The description of the present disclosure has been presented for purposes of illustration and description but is not intended to be exhaustive or limited to the disclosure in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the disclosure. The aspects of the disclosure herein were chosen and described in order to best explain the principles of the disclosure and the practical application, and to enable others of ordinary skill in the art to understand the disclosure with various modifications as are suited to the particular use contemplated. [0082] For purposes of this document, each process associated with the disclosed technology may be performed continuously and by one or more computing devices. Each step in a process may be performed by the same or different computing devices as those used in other steps, and each step need not necessarily be performed by a single computing device. [0083] Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.
Claims
CLAIMS What is claimed is: 1. A computer implemented method of reducing latency in a networked application, comprising: receiving user data for at least one user participating in a networked meeting application, the user data including a behavior of the user occurring at a time T; based on the user data, calculating a next behavior for the user at a time greater than T; generating user prediction data for the networked meeting application, the user prediction data enabling a rendering of the next behavior of the at least one user at the time greater than T; transmitting the user prediction data to other users participating in the networked meeting application; and continuously repeating the receiving, predicting, generating and transmitting while the at least one user is participating in the networked application.
2. The computer implemented method of claim 1 wherein the predicting is performed by a neural network.
3. The computer implemented method of any of claims 1 -2 wherein the behavior is any one or more of: human motion, speech, facial expressions, and action.
4. The computer implemented method of any of claims 1 – 2 wherein the method further includes predicting a next environment change for an environment of the user at a time greater than T; generating the user prediction data including the environment change at the time greater than T; and transmitting the user prediction data including the environment change to other users participating in the networked meeting application.
5. The computer implemented method of any of claims 1 - 3 wherein the method further includes determining if the user prediction data for the behavior matches the actual behavior received for user data received at the time greater than T and, if the user prediction data does not match, generating corrected data.
6. The computer implemented method of any of claims 1 – 5 wherein each participating user has an associated processing device and wherein the method is performed on a service host communicating with each associated processing device.
7. The computer implemented method of any of claim 1 – 5 wherein each user has an associated processing device and wherein the method is performed on one or more of the participating user’s associated processing devices.
8. The computer implemented method of any of claim 1 – 5 wherein the networked meeting application comprised a virtual environment application wherein a representation of all participating users is rendered in a virtual environment on a processing device associated with the participating user, and wherein the method further includes: calculating an artificial delay; the predicting comprises predicting a next behavior at time T plus the artificial delay; and the transmitting comprises generating user prediction data at the time T plus the artificial delay.
9. The computer implemented method of any of claims 1 – 7 wherein the predicting comprises predicting a future behavior comprising a video image or audio sound which is a number N frames ahead of frames in the user data.
10. The computer implemented method of claim 9 wherein the predicting comprises predicting a time ^ ahead, where ^ multiplied by the video framerate is a tunable variable of a prediction engine.
11. A user equipment device, comprising: a storage medium comprising computer instructions; a display device; one or more processors coupled to communicate with the storage medium, wherein the one or more processors execute the instructions to cause the device to: receive user data for at least one user participating in a networked meeting application, the user data including a behavior occurring at a time T; based on the user data, calculate a next behavior for the user at a time greater than T; generate user prediction data for the networked meeting application, the users prediction data enabling a rendering of the next behavior at the time greater than T; transmit the user prediction data to other users participating in the networked meeting application; and continuously repeat the receiving, predicting, generating and transmitting while the at least one user is participating in the networked meeting application.
12. The user equipment device of claim 11 wherein the predicting is performed by a neural network.
13. The user equipment device of any of claims 11 - 12 wherein the behavior is any one or more of: human motion, speech, facial expressions, and action.
14. The user equipment device of any of claims 11 – 12 wherein the one or more processors execute the instructions to cause device to predict a next environment change for an environment of the user at a time greater than T; generate the user prediction data including the environment change at the time greater than T; and transmit the user prediction data including the environment change to other users participating in the networked meeting application.
15. The user equipment device of any of claims 11 - 13 wherein the one or more processors execute the instructions to cause device to determine if the user prediction data for the behavior matches the actual behavior received for user data received at
the time greater than T and, if the user prediction data does not match, generating corrected data.
16. The user equipment device of any of claim 11 – 15 wherein each participating user has an associated processing device and wherein the one or more processors execute the instructions on a service host communicating with each associated processing device.
17. The user equipment device of any of claim 11 - 15 wherein each user has an associated processing device and wherein the one or more processors execute the instructions are executed on one or more of the participating user’s associated processing devices.
18. The user equipment device of any of claim 11 - 15 wherein the networked meeting application comprises a virtual environment application wherein a representation of all participating users is rendered in a virtual environment on a processing device associated with the participating user, and wherein the one or more processors execute the instructions to: calculate an artificial delay; predict a next behavior at time T plus the artificial delay; and transmit user prediction data at the time T plus the artificial delay.
19. The user equipment device of any of claims 11 - 18 wherein the one or more processors execute the instructions to predict a future behavior comprising of a video image or audio sound which is a number N frames ahead of frames in the user data.
20. The user equipment device of claim 19 wherein the one or more processors execute the instructions to predict the prediction data at a time ^ ahead, where ^ multiplied by the video framerate is a tunable variable of a prediction engine.
21. A non-transitory computer-readable medium storing computer instructions for rendering a representation of a user from user data transmitted over a network, that
when executed by one or more processors, cause the one or more processors to perform the steps of: receiving user data for at least one user participating in a networked meeting application, the user data including a behavior occurring at a time T; based on the user data, calculating a next behavior for the user at a time greater than T; generating user prediction data for the networked meeting application, the user prediction data enabling a rendering of the next behavior at the time greater than T; transmitting the user prediction data to other users participating in the networked meeting application; and continuously repeating the receiving, predicting, generating and transmitting while the at least one user is participating in the networked meeting application.
22. The non-transitory computer-readable medium storing computer instructions of claim 21 wherein the predicting is performed by a neural network.
23. The non-transitory computer-readable medium storing computer instructions of any of claims 21 - 22 wherein the behavior is any one or more of: human motion, speech, facial expressions, and action.
24. The non-transitory computer-readable medium storing computer instructions of any of claims 21 – 22 wherein the method further includes predicting a next environment change for an environment of the user at a time greater than T; generating the user prediction data including the environment change at the time greater than T; and transmitting the user prediction data including the environment change to other users participating in the networked meeting application.
25. The non-transitory computer-readable medium storing computer instructions of any of claims 21 - 23 wherein the instructions further include instructions determining if the user prediction data for the behavior matches the actual behavior received for user data received at the time greater than T and, if the user prediction data does not match, generating corrected data.
26. The non-transitory computer-readable medium storing computer instructions of any of claims 21 - 25 wherein each participating user has an associated processing device and wherein the instructions further include instructions performed on a service host communicating with each associated processing device.
27. The non-transitory computer-readable medium storing computer instructions of any of claims 21 - 25 wherein each user has an associated processing device and wherein the instructions further include instructions performed on one or more of the participating user’s associated processing devices.
28. The non-transitory computer-readable medium storing computer instructions of any of claims 21 - 25 wherein the networked meeting application comprised a virtual environment application wherein a representation of all participating users is rendered in a virtual environment on a processing device associated with the participating user, and wherein the instructions further include instructions: calculating an artificial delay; the predicting comprises predicting a next behavior at time T plus the artificial delay; and the transmitting comprises generating user prediction data at the time T plus the artificial delay.
29. The non-transitory computer-readable medium storing computer instructions of any of claims 21 - 28 wherein the predicting comprises predicting a future behavior comprising of a video image or audio sound which is a number N frames ahead of frames in the user data.
30. The non-transitory computer-readable medium storing computer instructions of claim 29 wherein the predicting comprises predicting a behavior at a time ^ ahead, where ^ multiplied by the video framerate is a tunable variable of a prediction engine.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/US2023/073582 WO2023230638A2 (en) | 2023-09-06 | 2023-09-06 | Reduced-latency communication using behavior prediction |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/US2023/073582 WO2023230638A2 (en) | 2023-09-06 | 2023-09-06 | Reduced-latency communication using behavior prediction |
Publications (2)
Publication Number | Publication Date |
---|---|
WO2023230638A2 true WO2023230638A2 (en) | 2023-11-30 |
WO2023230638A3 WO2023230638A3 (en) | 2024-05-10 |
Family
ID=88506723
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2023/073582 WO2023230638A2 (en) | 2023-09-06 | 2023-09-06 | Reduced-latency communication using behavior prediction |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2023230638A2 (en) |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10699461B2 (en) * | 2016-12-20 | 2020-06-30 | Sony Interactive Entertainment LLC | Telepresence of multiple users in interactive virtual space |
CN109831638B (en) * | 2019-01-23 | 2021-01-08 | 广州视源电子科技股份有限公司 | Video image transmission method and device, interactive intelligent panel and storage medium |
US11805157B2 (en) * | 2020-05-12 | 2023-10-31 | True Meeting Inc. | Sharing content during a virtual 3D video conference |
-
2023
- 2023-09-06 WO PCT/US2023/073582 patent/WO2023230638A2/en unknown
Also Published As
Publication number | Publication date |
---|---|
WO2023230638A3 (en) | 2024-05-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11741668B2 (en) | Template based generation of 3D object meshes from 2D images | |
US10938725B2 (en) | Load balancing multimedia conferencing system, device, and methods | |
WO2021109376A1 (en) | Method and device for producing multiple camera-angle effect, and related product | |
US11568646B2 (en) | Real-time video dimensional transformations of video for presentation in mixed reality-based virtual spaces | |
US11887235B2 (en) | Puppeteering remote avatar by facial expressions | |
US20130321566A1 (en) | Audio source positioning using a camera | |
CN111445561B (en) | Virtual object processing method, device, equipment and storage medium | |
CN111476871A (en) | Method and apparatus for generating video | |
JP7502354B2 (en) | Integrated Input/Output (I/O) for 3D Environments | |
JP7564378B2 (en) | Robust Facial Animation from Video Using Neural Networks | |
US11487498B2 (en) | Volume control for audio and video conferencing applications | |
US20230164298A1 (en) | Generating and modifying video calling and extended-reality environment applications | |
WO2020165599A1 (en) | Augmented reality methods and systems | |
EP4254943A1 (en) | Head-tracking based media selection for video communications in virtual environments | |
US20200193711A1 (en) | Virtual and physical reality integration | |
Fechteler et al. | A framework for realistic 3D tele-immersion | |
US20130314405A1 (en) | System and method for generating a video | |
CN112272296B (en) | Video illumination using depth and virtual light | |
US20230215295A1 (en) | Spatially accurate sign language choreography in multimedia translation systems | |
WO2023230638A2 (en) | Reduced-latency communication using behavior prediction | |
US20240089408A1 (en) | Visual feedback for video muted participants in an online meeting | |
Casas et al. | Intermediated Reality: A Framework for Communication Through Tele-Puppetry | |
US20240119690A1 (en) | Stylizing representations in immersive reality applications | |
US20240037866A1 (en) | Device and method for extended reality interaction and computer-readable medium thereof | |
WO2024009653A1 (en) | Information processing device, information processing method, and information processing system |