WO2024163018A1 - Virtual conferencing system using multi-neural-radiance-field synchronized rendering - Google Patents

Virtual conferencing system using multi-neural-radiance-field synchronized rendering Download PDF

Info

Publication number
WO2024163018A1
WO2024163018A1 PCT/US2023/063554 US2023063554W WO2024163018A1 WO 2024163018 A1 WO2024163018 A1 WO 2024163018A1 US 2023063554 W US2023063554 W US 2023063554W WO 2024163018 A1 WO2024163018 A1 WO 2024163018A1
Authority
WO
WIPO (PCT)
Prior art keywords
user
data
virtual environment
rendering
representation
Prior art date
Application number
PCT/US2023/063554
Other languages
French (fr)
Inventor
Chuanyue SHEN
Letian ZHANG
Zhangsihao YANG
Liang Peng
Hong Heather Yu
Masood Mortazavi
Original Assignee
Futurewei Technologies, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Futurewei Technologies, Inc. filed Critical Futurewei Technologies, Inc.
Publication of WO2024163018A1 publication Critical patent/WO2024163018A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/15Conference systems
    • H04N7/157Conference systems defining a virtual conference space and using avatars or agents

Definitions

  • FIELD The disclosure generally relates to improving the quality of audio/visual interaction in virtual environments over communication networks.
  • BACKGROUND [0003]
  • Video conferencing may typically involve group video conferences with two or more attendees who can all see and communicate with each other in real-time.
  • Popular conferencing applications involve the use of two- dimensional video representations of the attendees and thus benefit from fast and reliable network connections for effective conferencing.
  • Efforts have been made to build virtual worlds where users can move about and communicate in a virtual environment and interact with other users.
  • Such virtual worlds are the basis for the creation of a “metaverse,” generally defined as a single, FTRW-01394WO0 6000548PCT01 - 2 - shared, immersive, persistent, 3D virtual space where humans experience life in ways they could not in the physical world.
  • a user participates in the virtual world or metaverse through the use of a virtual representation of themselves, sometimes referred to as an avatar -- an icon or figure representing a particular person in the virtual world.
  • Users can utilize such virtual worlds for meetings and conferences but are limited in the representation of the user by their representation in virtual form, which traditionally has not been a photorealistic representation of the user, nor is the virtual environment a photorealistic environment.
  • One aspect includes a computer implemented method of rendering representations of a plurality of users in a virtual environment.
  • the computer implemented method includes generating a representation of the virtual environment from environment data specifying at least a virtual coordinate system for the virtual environment and background data using neural rendering.
  • neural rendering is the synthesizing photorealistic images and video content using artificial neural networks which have been trained on a dataset of images or videos of similar objects or scenes to that which is synthesized.
  • the method also includes generating, using neural rendering, a representation of each of the plurality of users, each user having an associated representation; receiving user data for at least one of the users participating in the virtual environment, where the user data may include a user position and 3D key point data of the at least one user, the 3D key point data reflecting a user pose in a real world environment associated with the user; determining a position in the virtual environment for each of the plurality of users, the position in the virtual environment determined based on the received user data for the at least one user ; and rendering the representation of each of the plurality of users in a dynamic scene of the virtual environment.
  • Implementations may include the computer implemented method where the user data is encoded and/or compressed, and the user data is received over a network from a client processing device associated with each user. Implementations may include the computer implemented method of any of the foregoing embodiments wherein the 3D key point data of each user indicates real-time motions of each user and the rendering includes rendering the real-time motions of each user to the associated representation in the virtual environment.
  • Implementations may include the computer implemented method of any of the foregoing embodiments further including accessing image data for each of the plurality of users and training a neural radiance field network for each user for the neural rendering using the image data. Implementations may include the computer implemented method of any of the foregoing embodiments further including accessing user profile information and training the neural radiance field network for each user for the neural rendering using the user profile information. Implementations may include the computer implemented method of any of the foregoing embodiments wherein the rendering includes dynamically lighting the virtual environment and the associated representation of each user. Implementations may include the computer implemented method of any of the foregoing embodiments wherein the rendering includes determining a physical plausibility for one or more objects and/or one or more representations of users in the virtual environment.
  • Implementations may include the computer implemented method of any of the foregoing embodiments wherein the rendering includes determining a physical plausibility for one or more objects and/or one or more representations users in the virtual environment. Implementations may include the computer implemented method of any of the foregoing embodiments wherein the generating a representation of a user may include a free-viewpoint rendering method of a volumetric representation of a person. Implementations may include the computer implemented method of any of the foregoing embodiments wherein the free-viewpoint rendering method is implemented on at least one fully fused multi-layer perceptron.
  • Implementations may include the computer implemented method of any of the foregoing embodiments FTRW-01394WO0 6000548PCT01 - 4 - wherein the fully fused multi-layer perceptron is implemented using a tiny Compute Unified Device Architecture neural network framework. Implementations of the described techniques may include hardware, a method or process, or computer software on a computer-accessible medium.
  • One general aspect includes a user equipment device.
  • the user equipment device includes a storage medium and may include computer instructions.
  • the device also includes an image capture system and a display device.
  • the device also includes one or more processors coupled to communicate with the storage medium, where the one or more processors execute the instructions to cause the system to: capture, using the image capture system, at least 3D key point data of a user of the user equipment ; encode and compress user data including a user position in a real world coordinate system and the 3D key point data; outputting the user data to a renderer via a network; receive dynamic scene data may include a photorealistic representation of a virtual environment generated using a neural radiance field, the virtual environment having a virtual coordinate system for the virtual environment and background data, and a photorealistic representation of each of a plurality of users in the virtual environment, each user having an associated representation, each user’s associated user data converted to a position in the virtual environment; and render the dynamic scene data on the display device.
  • Implementations may include the user equipment device of any of the foregoing embodiments where the 3D key point data of each user indicates real-time motion of the user of the user equipment, and the dynamic scene data includes data rendering the real-time motions of each user to the associated representation if the user in the virtual environment. Implementations may include the user equipment device of any of the foregoing embodiments further including transmitting image data for the user to provide training data for a neural radiance field network for the neural rendering. Implementations may include the user equipment device of any of the foregoing embodiments further including transmitting user profile information for training the neural radiance field network.
  • Implementations may include the user equipment device of any of the foregoing embodiments where the dynamic scene data FTRW-01394WO0 6000548PCT01 - 5 - includes dynamically lighting the virtual environment and the associated representation of each user.
  • One general aspect includes a non-transitory computer-readable medium storing computer instructions for rendering a photorealistic virtual environment occupied by a plurality of users.
  • the non-transitory computer-readable medium storing computer instructions include generating a photorealistic representation of the virtual environment from environment data specifying at least a virtual coordinate system for the virtual environment and background data using neural rendering.
  • the instructions also include generating, using neural rendering, a photorealistic representation of each of the plurality of users, each user having an associated photorealistic representation; receiving user data may include receiving at least user position data and 3D key point data, and a user profile of each user in the virtual environment; converting the received user data to a position the virtual environment; and rendering the associated photorealistic representation of each of the plurality of users in a dynamic scene of the virtual environment.
  • Implementations may include the non-transitory computer-readable medium of any of the foregoing embodiments where the user data is encoded and/or compressed, and the user data is received over a network from a client processing device associated with each user.
  • Implementations may include the non-transitory computer-readable medium of any of the foregoing embodiments wherein the 3D key point data of each user indicates real-time motions of each user, and the rendering includes rendering the real-time motions of each user to the associated representation in the virtual environment. Implementations may include the non-transitory computer- readable medium of any of the foregoing embodiments wherein the non-transitory computer-readable medium the instructions further causing the one or more processors to perform the steps of accessing image data for each of the plurality of users and training a neural radiance field network for each user for the neural rendering using the image data.
  • Implementations may include the non-transitory computer-readable medium of any of the foregoing embodiments wherein the non- transitory computer-readable medium includes the instructions further causing the one or more processors to perform the steps of accessing user profile information and FTRW-01394WO0 6000548PCT01 - 6 - training the neural radiance field network for each user for the neural rendering using the user profile information.
  • Implementations may include the non-transitory computer- readable medium of any of the foregoing embodiments wherein the rendering includes dynamically lighting the virtual environment and the associated representation of each user.
  • Implementations may include the computer implemented method of any of the foregoing embodiments wherein the rendering includes determining a physical plausibility for one or more objects and/or one or more representations of users in the virtual environment.
  • Implementations may include the non-transitory computer- readable medium of any of the foregoing embodiments wherein the rendering includes determining a physical plausibility for one or more objects and/or one or more representations users in the virtual environment. Implementations may include the non-transitory computer-readable medium of any of the foregoing embodiments wherein the generating a photorealistic representation of a user may include a free- viewpoint rendering method of a volumetric representation of a person. Implementations may include the non-transitory computer-readable medium of any of the foregoing embodiments wherein the free-viewpoint rendering method is implemented on at least one fully fused multi-layer perceptron.
  • Implementations may include the non-transitory computer-readable medium of any of the foregoing embodiments wherein the fully fused multi-layer perceptron uses a tiny Compute Unified Device Architecture neural network framework.
  • FIG. 1 illustrates an interface of a photorealistic virtual world conference application showing a first example of a display of a virtual environment.
  • FIG.2 illustrates an example of a network environment for implementing a photorealistic real-time metaverse conference.
  • FIG.3 is a block diagram of a service host and a plurality of clients.
  • FIG.4 is a flow chart illustrating a process performed by a user client device in accordance with the virtual conferencing system.
  • FIG.5 is a flow chart illustrating a process performed by a service host to train a human NeRF renderer.
  • FIG.6 is a flow chart illustrating a process performed by a service host in accordance with the virtual conferencing system to render the virtual conference data.
  • FIG.7 is a flow chart illustrating a process performed by a user client device in accordance with the virtual conferencing system.
  • FIG.8 is a block diagram of an alternative service host architecture.
  • FIG.9 is a diagram illustrating the training flow for a human NeRF renderer.
  • FIG.9 is a diagram illustrating the training flow for a human NeRF renderer.
  • FIG. 10 is a diagram illustrating the real-time rendering flow for a NeRF renderer.
  • FIG.11 is a ladder diagram illustrating data flow between user devices and a service host.
  • FIG.12A is a diagram illustrating Human NeRF formation. FTRW-01394WO0 6000548PCT01 - 8 -
  • FIG.12B is a diagram illustrating a pose correction MLP.
  • FIG.12C is a diagram illustrating a skeleton motion network.
  • FIG.12D is a diagram illustrating a residual non-rigid motion MLP.
  • FIG.12E is a diagram illustrating a NeRF network for obtaining color and density.
  • FIG.12F is a diagram illustrating parallel processing in accordance with the disclosure.
  • FIG.13 is a flowchart illustrating dynamic composition.
  • FIG.14 is a data flow diagram illustrating dynamic composition.
  • FIG.15 is a data flow diagram illustrating dynamic relighting.
  • FIG.16 is a block diagram of a network processing device that can be used to implement various embodiments.
  • FIG.17 is a block diagram of a network processing device that can be used to implement various embodiments of a meeting server or network node.
  • FIG.18 is a block diagram illustrating an example of a network device, or node, such as those shown in the network of FIG.2.
  • the present disclosure and embodiments address real-time, online, audio- video interactions in a virtual environment.
  • the virtual environment may comprise a “metaverse.”
  • a virtual environment serves as a dedicated conferencing environment.
  • the present disclosure and embodiments provide a system and method for generating and displaying a photorealistic virtual environment which include representations of a plurality of users in the virtual environment using neural rendering. FTRW-01394WO0 6000548PCT01 - 9 -
  • the representations comprise photorealistic representations of each user.
  • a user’s position, movements and audio expressed in a real-world environment is captured by a user processing device associated with the user, and translated to a representation of the user in a virtual environment or metaverse.
  • the representation of each user along with each user’s audio data are rendered in the virtual environment for perception by other users participating in the virtual environment, generally by rendering a dynamic scene of the environment on each user’s processing device.
  • the user’s position in the virtual environment is derived using encoded and compressed user data which may include user position data and 3D key point data from the associated user processing device to a service host.
  • the service host provides the neural rendering processing to render the representation of each of the plurality of users in a dynamic scene of the virtual environment.
  • Conferences in the virtual environment may be scheduled and hosted by users operating from client processing devices which access a real-time virtual environment application provided by a service provider on service host computing device or devices.
  • the virtual conferencing application acts as an “always on” metaverse.
  • the term “metaverse” means a spatial computing platform that provides virtual digital experiences in an environment that acts as an alternative to or a replica of the real-world.
  • the metaverse can include social interactions, currency, trade, economy, and property ownership.
  • the appearance of users in the virtual environment described herein is a photorealistic appearance, replicating each user’s real-world appearance using neural radiance field rendering technology.
  • a metaverse may be persistent or non- persistent such that it exists only during hosted conferences.
  • the system herein uses neural rendering to generate photorealistic representations of users and a virtual environment by using neural networks to learn how to simulate lighting, reflection, and refraction that occur in the real world.
  • neural rendering generally, a 3D model of the object or scene is created, which includes information about the geometry, surface materials, and lighting.
  • FTRW-01394WO0 6000548PCT01 - 10 - the neural network is trained on a large dataset of images or videos that capture similar objects or scenes under different lighting conditions.
  • the neural network learns to predict how light interacts with the 3D model based on its parameters and the lighting conditions.
  • the network can be used to generate new images or videos of the 3D model under different lighting and viewing conditions.
  • the neural network's ability to learn complex relationships between the input data and the output images or videos enables it to generate highly realistic images of 3D objects and scenes, even under challenging lighting conditions.
  • the present disclosure and embodiments employ a free-viewpoint rendering system that is capable to synthesize any novel view from any viewpoint.
  • FIG. 1 illustrates a two-dimensional interface 100 of an online virtual conference application showing a first example of audio/visual information which may be presented in the interface 100.
  • Interface 100 illustrates a dynamic scene of a virtual environment which in this example is a virtual “meeting room” , where a representation 110, 120 of each of the participants is displayed.
  • Interface 100 includes a presentation window 110 showing two attendee representations 110, 120 in the virtual environment 130 where the meeting room may comprise an environment for an online meeting.
  • the virtual environment comprises a room with a table around which the attendees 110, 120 are facing each other.
  • the interface 100 in this example may be presented on a two-dimensional display window presented on a processing device such as a computer with a display, a mobile device, or a tablet with a display.
  • the interface 100 may present a participant, first person view of the environment 130 (where the view is neither that of attendee representation 110 nor FTRW-01394WO0 6000548PCT01 - 11 - 120 in this example).
  • the interface may be displayed in a virtual reality device to that the viewing attendee is immersed in the virtual environment 130.
  • the view is that of the perspective of the user’s representation in the virtual environment, that user may not perceive the totality of that user’s rendered representation.
  • such user will only see other user representations and those portions of their own representation (arms, legs, body) that a user would normally see if that user were in a real-world environment.
  • the interface 100 may include the entirety of the representation of the viewing participant as one of the attendee representations 110, 120 rendered in the virtual environment. It should be understood that in embodiments, an environment interface is provided for each of the attendees/participants in a virtual environment.
  • an environment interface is provided for each of the attendees/participants in a virtual environment.
  • each attendee representation comprises a photorealistic representation of a human attendee (user) associated with the representation.
  • a photorealistic representation is a visual representation of the user which is rendered as a photographically realistic rendering of the user, such that interactions between people, surfaces and light are perceived in a lifelike view.
  • the represented scene may have a consistent shadow configuration, virtual objects within the interface must look natural, and the illumination of these virtual objects needs to resemble the illumination of the real objects.
  • a photorealistic rendered image if generated by a computer, is indistinguishable from a photograph of the same scene.
  • the technology disclosed herein presents techniques to capture multiple users in real- time and bring them in virtual environment 130 in a photorealistic manner.
  • the system may also generate representations of non-human objects, such as cellphones, computers, furniture, etc. that are often used by the users in real world or often appear in a real-world conferencing room or social meeting.
  • FIG.2 illustrates an example of a network environment for implementing a real-time virtual environment application.
  • Network environment 200 includes one or more service host processing devices 240a – 240d.
  • each service FTRW-01394WO0 6000548PCT01 - 12 - host 240 may have different configurations depending on the configuration of the system.
  • Also shown in FIG. 2 are a plurality of network nodes 220a - 220d and user (or client) processing devices 210, 212, 214, 216.
  • the service hosts 240a – 240d may be part of a cloud service 250, which in various embodiments may provide cloud computing services which are dedicated to the real-time virtual environment application.
  • Nodes 220a - 220d may comprise a switch, router, processing device, or other network- coupled processing device which may or may not include data storage capability, allowing cached data to be stored in the node for distribution to devices utilizing the real-time virtual environment application.
  • additional levels of network nodes other than those illustrated in FIG.2 are utilized.
  • fewer network nodes are utilized and, in some embodiments, comprise basic network switches having no available caching memory.
  • the meeting servers are not part of a cloud service but may comprise one or more meeting servers which are operated by a single enterprise, such that the network environment is owned and contained by a single entity (such as a corporation) where the host and attendees are all connected via the private network of the entity.
  • Lines between the processing devices 210, 212, 214, 216, network nodes 220a - 220d and meeting servers 240a – 240d represent network connections of a network which may be wired or wireless and which comprise one or more public and/or private networks.
  • An example of node devices 220a - 220d is illustrated in FIG.18.
  • Each of the processing devices 210, 212, 214, 216 may provide and receive real-time user motion, audio, and visual data through one or more of the network nodes 220a- 220d and the cloud service 250 via the network.
  • device 210 is illustrated as a desktop computer processing device with a camera 211 and virtual display device 205 in communication therewith. Also illustrated are three examples of participant devices, including a tablet processing device 212, a desktop computer processing device 214 and a mobile processing device 216.
  • tablet processing device 212 and mobile processing device 216 may include integrated image sensing devices, and desktop computer 214 may include an integrated or non-integrated image processing device (not shown) which may be a FTRW-01394WO0 6000548PCT01 - 13 - camera or other image sensor. It should be understood that any type of processing device may fulfill the role of a user processing device and there may be any combination of different types of processing devices participating in a virtual conference hosted by one or more service hosts 240a – 240d. It should be further understood that tablet processing device 212, a desktop computer processing device 214 and a mobile processing device 216 may also include displays which render the virtual environment described herein as well as image capture devices included therein or attached thereto.
  • one user may serve as a meeting host or conference organizer who invites others or configures a virtual conferencing environment using the real-time virtual environment application.
  • all participant devices may contribute to environment data.
  • a host processing device may be a standalone service host server, connected via a network to participant processing devices. It should be understood that there may be any number of processing devices operating as participant devices for attendees of the real-time meeting, with one participant device generally associated with one attendee (although multiple attendees may use a single device in other embodiments).
  • Each device in FIG.2 may send and receive real-time virtual environment application data.
  • user motion and audio/visual data may be sent by a source device, such as device 210, through the source processing device’s network interface and directed to the other participant devices 212, 214, 216 though, for example, one or more service hosts 240a - 240d.
  • the data is distributed according to the workload of each of the service hosts 240 and can be sent from the service hosts directly to a client or through one or more of the network nodes 220a – 220d.
  • the network nodes 220a - 220d may include processors and memory, allowing the nodes to cache data from the real-time virtual environment application. In other embodiments, the network nodes do not have the ability to cache data.
  • FIG.3 is a block diagram of components of a service host and a plurality of client processing devices illustrating a system for providing a photorealistic real-time virtual environment application in a network.
  • FIG. 3 illustrates one example of technology which enables the photorealistic real-time virtual environment application using a Neural Radiance Field (NeRF) renderer 380.
  • NeRF Neural Radiance Field
  • real-world movements, actions, and audio by individual users 310, 320, 330 in their respective real-world environments are translated into movements, actions and audio in a virtual conferencing environment 130.
  • Environment 130 is rendered by renderer 380 based on data obtained from client processing devices 305, 315, 325, and 335.
  • renderer 380 Prior to rendering the virtual conferencing environment, renderer 380 is trained with individual user data, as discussed herein.
  • client processing devices 305, 315 and 325 are all illustrated as desktop computers, each having an associated display and image capture device.
  • each client processing device in a real-time virtual environment application environment should include an image capture component and a display component.
  • Client processing device 335 is illustrated as a mobile device which in embodiments includes in integrated display and image capture device.
  • Client processing devices 305, 315, 325 and 335 are respectively associated with an individual user 310, 320, 330 and 340, though in other embodiments an individual processing device may be associated with multiple users.
  • an image capture system captures a user’s profile, three-dimensional (3D) key points, and the user’s position in the real-world relative to a real-world coordinate system.
  • Each capture is associated with corresponding timestamp data.
  • the timestamp data may be acquired from a universal time source or synchronized between the various elements of the system discussed herein.
  • Each image capture system at client processing device 305, 315, 325 and 335 works independently and in parallel to provide the user profile information, three dimensional key points, and user positions at multiple timestamps FTRW-01394WO0 6000548PCT01 - 15 - to the service hosts.
  • a user profile contains a user’s appearance, position and styling preferences. For example, a user may provide information regarding clothing preferences in environment 130 of real-time virtual environment application. The user’s appearance may be created for the virtual environment 130 using one or more static images gathered from various viewpoints, or in one or more motion videos captured from various 3D viewpoints, taking into account preferences specified by the user in the user’s profile information. User appearances may also be created based on descriptions saved in the user profiles.
  • the 3D key points identify the user’s body poses in the real-world coordinate system, while the user’s position in the real-world is defined by user data (including user profile, historical rendering data and position data) from an associated user processing device, with the user data and key point data from the real-world coordinate system allowing positioning and posing of the photorealistic representation of the user within the virtual environment.
  • the object database 345 stores relevant user information (user profile, key points, and position) for each user participating in the real-time virtual environment application. In the embodiment of FIG.3, the object database is shown as a central database on the service host 240, but the location of the database can vary based on system requirements. [0051] Also shown in FIG.3 is a theme database 350.
  • the themes database 350 comprises a global system database that stores virtual backgrounds in a background database 360 and virtual environment data in an environment database 355.
  • the background database 360 comprises rendering information for the virtual environment 130 including the virtual world background and a three-dimensional virtual environment coordinate system which can be used to position the virtual representation of the user in the virtual environment.
  • the environment 130 is a room including a table, but as should be understood, the environment may comprise any three-dimensional representation of a world where individuals may move about in the context of the technology discussed herein.
  • Environment database 355 may include lighting information for the virtual environment.
  • each user’s position in the real-world coordinate system is translated to the virtual environment coordinate position.
  • the user’s position FTRW-01394WO0 6000548PCT01 - 16 - in the real world is considered when positioning a representation of the user in the virtual environment.
  • the virtual environment rendering considers the proper positioning of all the participants when placing photorealistic representations of each user in the virtual environment.
  • a given user’s position in the virtual space maybe defined based on rules or regulations of the meeting/the system and/or user interaction with the system.
  • the user’s virtual world position may also take into consideration the user profile and preferences, as well as the user’s position and pose in the real world.
  • the human body rendering process described below, reconstructs a photorealistic human model and enables novel pose rendering of the human.
  • the rendering system accepts a single continuous 5D coordinate system as input (a spatial location (x, y, z) and viewing direction ( ⁇ , ⁇ ) and can output different representations for the same point when viewed from different angles.
  • Environment database 355 includes scene lighting information and other data augmenting the rendering of the virtual environment 130 where the virtual conferencing takes place.
  • a meeting host or organizer may choose the background and environment for the virtual conference. For example, the host or organizer may choose to have a virtual conference in a virtual conference room during daylight (or at night), all of which are rendered by the virtual conferencing application.
  • Data from the object database is provided to position converter 365.
  • the converter 365 transforms a user’s position in their respective real-world environment to a position in the virtual conferencing environment with the aid of background data 360 and environment data 355 from the theme database.
  • Converter 365 maps the coordinates between real world and virtual world, and thus is used to position the user in real world plausibly into virtual world.
  • An instance 365a, 365b, 365c of converter 365 is respectively created for each user 310, 320, 330.
  • separate converter applications are provided on client processing devices or separate converter applications for each user may be run on one or more of the service hosts 240.
  • Each instance of a converter works independently and in parallel with other instances to provide data to the renderer 380.
  • Renderer 380 receives theme and refactorized user information and outputs a real-time photorealistic virtual scene of the virtual environment.
  • the renderer 380 FTRW-01394WO0 6000548PCT01 - 17 - provides NeRF formation, dynamic composition, and dynamic relighting in a scene 100 of a virtual environment 130.
  • Renderer 380 is trained with sufficient data to create the rendering of the representation’s individual users and the virtual environment 130.
  • the output of the renderer 380 is provided to each of the client processing devices 305, 315, 325 and 335 and the virtual environment scene 100 rendered as appropriate given the type of display available on each client device.
  • a NeRF may comprise one or more fully-connected neural networks that can generate novel views of complex 3D scenes, based on a partial set of 2D images, motion videos or frames of motion videos. It is trained to reproduce input views of a scene and works by taking input images representing a scene and interpolating between them to render one complete scene. [0054]
  • a NeRF network is trained to map directly from viewing direction and spatial location (5D input) to opacity and color (4D output), using volume rendering to render new views.
  • NeRF is a computationally intensive algorithm, and processing of complex scenes can take hours or days. As such, in embodiments it is desirable to include NeRF components discussed herein on the service host devices where processing power may be greater than at individual user processing devices.
  • Renderer 380 utilizes techniques described in Mildenhall, Ben et al. “NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis.” European Conference on Computer Vision (2020) and Weng, Chung-Yi et al. “HumanNeRF: Free-viewpoint Rendering of Moving People from Monocular Video.” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022): 16189- 16199 (Hereinafter “Weng et al.”) (Each of the foregoing documents is hereby incorporated by reference in its entirety herein) to synthesizing views of complex scenes and humans using a continuous volumetric scene function from input views and/or motion video.
  • Renderer 380 uses a non-convolutional deep network, whose input may be a single continuous 5D coordinate (spatial location (x; y; z) and viewing direction ( ⁇ , ⁇ ), and whose output is the volume density and view-dependent emitted radiance at that spatial location.
  • the skeletal motion translation 1240 uses a convolutional neural network.
  • Other components in the renderer 380 may all be based on non-convolutional neural FTRW-01394WO0 6000548PCT01 - 18 - networks.
  • renderer 380 can synthesize photorealistic details of a user’s body, as seen from various camera angles that may not exist in the input images or video, as well as minute details such as cloth textures and facial appearance.
  • the motion field is decomposed into skeletal rigid and non- rigid motions, with rigid motion derived from convolutional neural networks and non- rigid motion derived by the non-convolutional networks.
  • real-time user data in the form of 3D key points reflecting motion of a user and user position reflecting location of a user in real world can be used to render a photorealistic representation of the user in the virtual environment in the photorealistic dynamic scene on user processing devices.
  • 3D key point data are mainly used to describe body poses of a human.
  • the 3D key point data are collected to project user body poses in the real-world environment into virtual world.
  • the function of converters 365a – 365c (or 375a-d) is to provide a point-to-point mapping between the real world and virtual world.
  • Converters 365a-d determine the user position in the virtual environment according to the user position in the real-world environment.
  • a photorealistic dynamic representation of the user in the virtual environment is completed by combining the user body poses in the virtual environment and the user position in the virtual environment.
  • FIG.4 is a flow chart illustrating a process performed by a client processing device in accordance with the virtual conferencing system.
  • the method of FIG. 4 comprises providing user data to the converter 365 and renderer 380 to either train the renderer 380 or allow the renderer 380 and converter 365 to create the photorealistic virtual environment.
  • user data comprising, at least in part, user images
  • the images may be a plurality of still images, user video, or frames extracted from user video.3D key points may be gathered from the user images or captured separately using, for example, a depth image capture device.
  • the 3D key points for user spatial locations are points in the image that define locations on a user’s body in the real-world coordinate system that are used to render motion of the user.
  • Image FTRW-01394WO0 6000548PCT01 - 19 - capture may comprise two dimensional static images, motion video of the user and/or depth images of the user.
  • User position data may also be captured from user image data or separately by other sensors (such as, for example, a depth image capture system).
  • a timestamp is associated with each captured image or video.
  • the timestamp may be a universal timestamp referenced by all client processing devices capturing images such as, for example, a timestamp referenced to a universal time source.
  • the user position in the real-world environment occupied by the user is determined relative to a coordinate system mapped to the user’s real-world environment.
  • user profile data may be retrieved.
  • user profile data may comprise user preferences indicating how the user’s photorealistic appearance can be modified in the virtual environment, such as clothing preferences and the like.
  • the user profile and/or key point data may be compressed for transmission to a service host and at 460 the client processing device outputs the user profile, user position, and key points to the object database for use by the converter 365 and renderer 380.
  • the process may continue repeating the capture step 410, timestamp association 420, position determination 430, compression/encoding 450 and outputting steps 460 as needed based on the application (training or real-time rendering).
  • capture step 410, timestamp association 420, position determination 430, compression/encoding 450 and outputting steps 460 are repeated continuously for the duration of the user’s presence in the virtual environment.
  • FIG.5 is a flow chart illustrating a process performed by a service host (or other processing device) to train a renderer 380.
  • FIG.6 is a flow chart illustrating a process performed by a service host in accordance with the virtual conferencing system to render the virtual conference data during a real-time conference.
  • user key point data and profiles are collected at 620.
  • the user profile and initial key point data may be retrieved from the object database and/or retrieved in real time from the data output by the user processing device.
  • user key point data is retrieved in real time in order to allow the renderer 380 to mimic user movements in the real-world to the representation of the user in the real-time virtual environment 130.
  • the real- world user data is converted into virtual positioning within the virtual environment.
  • the theme environment and background data are accessed at 640 and the virtual environment is rendered at 650 with representations of each user participating in the real-time virtual environment application.
  • FIG.7 is a flow chart illustrating a process performed by a user client device in accordance with the virtual conferencing system.
  • FIG.8 illustrates an alternative architecture to architecture illustrated in the system shown in FIG. 3. In the architecture of FIG.
  • the renderer 380 and theme database 350 are provided on the service host 240’, but rather than a central object database FTRW-01394WO0 6000548PCT01 - 21 - stored on the service host as in the embodiment of service host 240 of FIG.3, the user profile information, key points and user position data are stored in user databases 810, 812, 814, 816, each of which is associated with a user processing device.
  • each user database 810, 812, 814, 816 may be stored on the processing device; in alternative embodiments, each user database may be stored in a database on an intermediate processing device or network node.
  • Converters 375a – 375d may similarly be provided on client processing devices edge server or processing devices, network processing devices, or intermediate processing devices. Data from the themes database 350 is provided to each of the converters 375a – 375d to allow for conversion of the user’s real-world environment to the position in the virtual environment. As noted above, the renderer 380 requires substantial processing power and thus may be implemented in a cloud computing environment including one or more service host 240 or 240’. [0063] FIG.
  • FIG. 9 is a diagram illustrating components of a renderer 380 during a training process, including untrained human NeRFs 922, 928, 934, a dynamic composition module 942 and dynamic relighting module 944, a background NeRF, and training or formation modules, including NeRF formation modules 920, 926 and 932 respectively associated with human NeRFs 922, 928, 934, and NeRF formation module 938 associated with background NeRF 940.
  • three human NeRFs are illustrated in FIG.9, it will be understood that any number “N” of human NeRFs (one per user) may be included in the system.
  • the process flow demonstrates the training of each NeRF network to allow creation of a photorealistic dynamic scene 100 on a client processing device.
  • a universal timestamp 902 is utilized for each component of data 904, 906, 908, 912, and 914.
  • Each set of user data 904, 906, 908 may contain a user profile and 3D key points, as discussed above.
  • Historical user data in the form of 3D key points or a previously trained neural radiance field network may contribute to creation of a corresponding trained human NeRF 922, 928, 934 for each user.
  • a NeRF formation training FTRW-01394WO0 6000548PCT01 - 22 - process 920, 926, 932 acts on the corresponding un-trained human NeRF 922, 928, 934 and creates a trained Human NeRF (NeRFs 1022, 1028, 1034 in FIG. 10).
  • background data 912 is utilized by NeRF formation process 938 to train a background NeRF 940.
  • a rendering of the user representation generated by each human NeRF 922, 928, 934 may be combined with each respective user position in the virtual conferencing environment 130 and provided to a dynamic composition module 942 to combine a rendering of the environment background from the background NeRF 940 with the respective user positions to create the representations of each of the users in the virtual environment 130.
  • a dynamic relighting process 944 utilizes environment data 914 to add realistic shadows and highlights to the virtual environment 130 as displayed in dynamic scene 100.
  • the resulting trained human NeRFs 1022, 1028, 1034 (FIG. 10) have completed a training process are now considered “pre-trained” to optimize rendering for each individual user and output a photorealistic view of the user from a number of viewpoints.
  • FIG. 10 is a diagram illustrating the real-time rendering flow for renderer 380.
  • the human NeRFs 1022, 1028, 1034 for each user (User 1, User 2, .... User N) have been trained as discussed with respect to FIG.9.
  • a real-time data timestamp 1002 is used by the components of the renderer 380.
  • the real-time data universal timestamp 1002 allows synchronization of user key point data and positions within the virtual environment 130 of the real-time virtual environment application.
  • User 3D key points 1004, 1006, 1008 are captured in real-time for any user participating in a virtual environment in the real-time virtual environment application.
  • the key points are provided for the respective user from a client processing device (e.g., devices 305, 315 and 325, each having an associated display and image capture system) and are provided to pre-trained respective human NeRFs 1022, 1028, 1034 for each user.
  • the human NeRFs 1022, 1028, 1034 for each user render a photorealistic representation of the user which will be combined with the background and environment data in the virtual environment and represented to each user as a photorealistic dynamic scene 100, following the movements and audio inputs provided by and to each user.
  • a converter determines the respective user’s position in the metaverse 1024, 1030, 1036 and the user’s position in the metaverse 1024, 1030, 1036 is combined with the rendering data to the pretrained dynamic composition module 1042 and dynamic relighting module 1044.
  • the background data 1012 is provided to the pre-trained background NeRF 1040, which provides background rendering data to the dynamic composition module 1042.
  • both background data 1012 and environment data 1014 are optional where, for example, the virtual environment is chosen to have a static background or static environment.
  • the pre-trained dynamic composition module 1042 combines rendering data from each user and the respective user’s position in the virtual environment to compose the virtual environment.
  • the rendered virtual environment is modified by the pre-trained dynamic relighting module 1044 and outputs the photorealistic dynamic scene 100 in a format suitable for display on the associated display device of client processing devices 305, 315 and 325.
  • the technology optimizes the amount of data needed to provide the metaverse environment using several data optimization techniques. Following the initial composition of each dynamic scene, rendering of the movements of users based on real-time 3D key point data may be performed by inference.
  • the system herein utilizes a single camera video/image acquisition system, thereby reducing hardware costs and data usage.
  • the motion acquisition data optionally comprises full- body movement sequences of a dynamic human subject and uploads the acquired info into the user database.
  • the motion info can be in several formats. For example, one simple format can be a monocular video or a sequence of video frames. Alternatively, appearance profile and 3D skeletal key points may be extracted from video and uploaded. In embodiments, historical data can be leveraged to accelerate data communication in future conferencing events.
  • FIG.11 is a ladder diagram illustrating data flow between user devices and a service host. Although two user devices 305, 315 and one service host 240 are illustrated, it will be understood that any number of user processing devices and service hosts may be utilized in the present system.
  • 3D key point data which is extracted from a user via an image capture device 305b at an associated client processing device (e.g., device 305) is provided to the service host 240 for training.
  • client processing device e.g., device 305
  • the key point data may be encoded and/or compressed.
  • key point data 1120 from another capture device 315b associated with a client processing device e.g., device 315) is provided to the service host for training.
  • Rendered scene data 1110 and 1115 is returned to the respective client processing devices for display on associated displays 305a, 315a following training.
  • each device issues a service request 1135, 1140, to request entry into the virtual environment. The request is acknowledged to indicate that the client device has entered the service.
  • Each device submits real-time 3D key point data 1145, 1150, respectively.
  • key point data is encoded and/or compressed by the client processing device and provided at 1145 to a service host 240 for processing.
  • virtual environment data including the real- time movements of the user
  • virtual environment data is provided to both device 315 and device 305 at 1150 for display thereon.
  • key point data is encoded and/or compressed by the client processing device 315 and provided at 1155 to a service host 240 for processing.
  • audio data from each user may be forwarded to the service hosts and played in the virtual environment in synchronization with rendering the photorealistic representation of each user in the real-time virtual environment application.
  • FIG. 12A is a diagram illustrating NeRF formation of a photorealistic representation of users in the virtual environment by one human NeRF 1022 of the renderer 380 based on the techniques described in Weng, Chung-Yi et al.
  • FIG. 12A illustrates the technology as used herein with certain modifications to the techniques described aforementioned.
  • arts from human animation and NeRF rendering are combined to create photorealistic view synthesis from any viewpoint.
  • a renderer 380 interprets the human shape, pose, and camera parameters, and employs neural networks to learn a human motion field with aid of a pose corrector (pose correction MLP 1225). The renderer also leverages an acceleration approach to speed up its training and inference.
  • the human NeRF formation network 1022 takes a video frame (or image or series of images) of a user 1215 in an observation space 1217 as input and optimizes for canonical appearance, represented as a continuous field as well as a motion field mapping from observation space 1217 to canonical space 1255.
  • the motion field 1235 is decomposed into skeletal rigid motion 1240 (Mskel) and residual non-rigid motion 1250 (Mres) and is represented as a discrete grid and a continuous field, respectively.
  • M skel represents skeleton-driven deformation and M res starts from the skeleton-driven deformation and produces an offset ⁇ x to it.
  • the method additionally refines a body pose 1212 through pose correction MLP 1225. A loss is imposed between the rendering result and the input image.
  • Three multi-layer perceptrons (MLPs) and a convolutional neural network (CNN) are used in the human NeRF formation.
  • An MLP is a fully connected feedforward artificial network which consists of at least FTRW-01394WO0 6000548PCT01 - 26 - three layers of nodes: an input layer, a hidden layer and an output layer. Except for the input nodes, each node is a neuron that uses a nonlinear activation function.
  • a parametric human body model is utilized (e.g. an SMPL-based model), to obtain estimations of camera parameters K, pose ⁇ and shape ⁇ of the human body given in the monocular video.
  • SMPL Skinned Multi-Person Linear Model
  • SMPL is a realistic 3D model of the human body that is based on skinning and blend shapes and is learned from thousands of 3D body scans.
  • a SMPL model is a skinned vertex-based model that accurately represents a wide variety of body shapes in natural human poses. The parameters of the model are learned from data including the rest pose template, blend weights, pose-dependent blend shapes, identity-dependent blend shapes, and a regressor from vertices to joint locations.
  • a SMPL oPtimization IN the loop (SPIN) deep network may be used for regressing SMPL body shape and pose parameters directly from an image. SPIN uses a training method that combines a bottom-up deep network with a top-down, model-based, fitting method.
  • Pre-trained weights of the SPIN model are utilized. (Although monocular video is discussed herein, video or images captured by multi-camera systems may be utilized in embodiments herein.)
  • SPIN Compared to the SMPL model which relies solely on regression, SPIN combines iterative optimization and deep-network regression to estimate human poses more accurately.
  • These SPIN estimates (camera parameters, pose and human body shape) are used as the initial input parameters into the framework, where the pose parameters, particularly, are gradually refined through a pose correction MLP 1225 during training.
  • a motion field of the human subject is created by two neural networks -- skeletal motion translation network 1240 and non-rigid motion MLP 1250.
  • Skeletal motion translation network 1240 is a CNN that learns the skeletal rigid motion, but it is not a full representation of the motion field since it cannot interpret the non-rigid contents.
  • Non-rigid motion MLP 1250 is a fully fused MLP which accounts for non-rigid residual motion.
  • inverse linear blend skinning is to transform the motions in observation FTRW-01394WO0 6000548PCT01 - 27 - space to canonical space.
  • An MLP 1260 is used to learn the color and density maps, which are then volume-rendered into images.
  • the MLPs involved in the framework are accelerated using parallel computing such as fully fused Compute Unified Device Architecture (CUDA) kernel in a tiny Compute Unified Device Architecture neural network (tiny-cuda-nn) framework , as discussed below.
  • CUDA Compute Unified Device Architecture
  • the poses estimated from pre-trained weights of parametric models e.g. SPIN
  • a pose correction MLP 1225 is used to learn an adjustment for a better pose alignment.
  • the network parameters of MLP 1225 are optimized to provide updates to ⁇ 0, ⁇ ⁇ ⁇ , ⁇ K conditioned on ⁇ 0, ⁇ ⁇ ⁇ , ⁇ K. Optimizing the network parameters leads to faster convergence compared to directly optimizing ⁇ 0 , ⁇ ⁇ ⁇ , ⁇ K.
  • Skeleton motion translation network 1240 represents skeletal motion volumetrically based on an inverse linear blend skinning algorithm that wraps the points in observation space to canonical space (equivalent to warping an observed pose ⁇ o to a predefined canonical pose ⁇ c ) in a form as follows: FTRW-01394WO0 6000548PCT01 - 28 - [0076] where ⁇ ⁇ ⁇ is the blend weight for the i-th bone in the observation space, X is the 3D coordinate of a point, (i.e.
  • G i is the skeletal motion basis for the i-th bone.
  • Fig. 12C. illustrates one embodiment of a skeletal motion network 1240 used to generate Mskel.
  • M res (x skel , ⁇ o ) MLP res ( ⁇ (x skel ), ⁇ o ) (7) FTRW-01394WO0 6000548PCT01 - 29 - [0081] where x skel represents points in skeleton motion field Mskel, and ⁇ is a standard positional encoding function. [0082]
  • the residual motion network 1250 is activated after the skeletal motion network. This avoids overfitting the residual motion network to the input and undermining the contribution of the skeletal motion field.
  • MLPnerf( ⁇ (x final )) (8)
  • is a standard positional encoding function
  • x final x skel + x res (representing the points in the motion field considering skeleton and residual non-rigid motions, where xskel represents the points in skeleton motion field and Xres represents the points in residual non-rigid motion field).
  • FIG.12E An example of the NeRF network 1275 is illustrated in FIG.12E.
  • a stratified sampling approach is applied inside the bounding box and a augmentation method is applied to further improve sampling efficiency.
  • the augmentation method uses the denominator of Eq.
  • the HumanNeRF free-viewpoint rendering techniques used herein replace the MLPs discussed in the prior art with fully fused MLPs.
  • the fully fused MLP is an MLP implemented, in one embodiment, as a single processor kernel that is designed such that the only slow global memory FTRW-01394WO0 6000548PCT01 - 30 - accesses are reading and writing the network inputs and outputs.
  • Each MLP can be specifically tailored the implementation to the network architecture and the processor used and is mapped to the memory hierarchy used.
  • a Tiny CUDA neural network framework (tiny-cuda-nn) is used with a graphics processing unit (GPU).
  • CUDA is a parallel computing platform and application programming interface (API) that allows software to use certain types of GPUs for general purpose processing.
  • Tiny-cuda-nn is a minimal implementation of a neural network framework in CUDA C++. It provides an implementation of feedforward neural networks and enables the user to train a model on a GPU for faster computation. Tiny- cuda-nn provides a lightweight and simple tool for neural networks and GPU programming.
  • a given batch of input vectors may be partitioned into block-column segments that are processed by a single thread block each.
  • the thread blocks independently evaluate the network by alternating between weight-matrix multiplication and element-wise application of an activation function. By making the thread blocks small enough such that all intermediate neuron activations fit into on- chip shared memory, traffic to slow global memory is minimized.
  • each warp of the thread block computes the matrix product of a single block-row.
  • MLP 1260 is further optimized by a downsizing "fully fused" MLP 1265 and a multiresolution hash encoding 1270 and a NeRF MLP 1275 based on a Tiny CUDA neural network framework.
  • MLP 1265 downsizes the input to a smaller dimension such that 1270 can accept the input with the reduced dimension and is implemented using fully fused kernel.
  • the combination of a downsizing MLP 1265, encoding 1270 and NeRF (fully-fused MLP) 1275 comprises one embodiment. In other embodiments, the downsizing MLP 1265 and encoding 1270 may be omitted.
  • the encoding technique at 1270 may comprised that referenced in the paper Müller, Thomas et al.
  • FIG.12F illustrates the mechanism of the fully fused neural networks that leverage the parallelism of modern GPUs. As shown in FIG.12F, given a batch of input vectors 1209, a regular MLP evaluation corresponds to alternating weight- matrix multiplication and element-wise application of the activation function. In contrast, a fully fused MLP at 1280 in FIG.
  • each warp of the thread block computes one block-row (striped area 1290) of H′ i +1 by first loading the corresponding striped weights in W i into registers and then multiplying the striped weights by all block-columns of H i .
  • each thread block loads the weight matrix (e.g. W i ) from global memory exactly once, while frequent accesses to H i are through fast shared memory.
  • W i weight matrix
  • FIG.12 F at 1275 a regular MLP evaluation for a given batch of input vectors corresponds to alternating weight-matrix multiplication and element-wise application of the activation function.
  • fully fused MLP achieves accelerated performance by parallelizing the workload. It partitions the batch into 128 element wide chunks and processes each chunk by a single thread block.
  • each thread block transforms the i-th layer H i into the pre-activated next layer H′ i +1.
  • H i is diced into 16 ⁇ 16 elements to match the size of one type of processor core utilized in one embodiment, where such processor may utilize hardware acceleration technology such as the TensorCore hardware-accelerated half-precision matrix multiplier available from NVIDIA Corporation, Santa Clara, California.
  • Each warp of the thread block computes one 16 ⁇ 128 block-row (e.g., the striped area) of H′ i +1.
  • the computation is done by first loading the corresponding 16 ⁇ 64 striped weights in W i into registers and then FTRW-01394WO0 6000548PCT01 - 32 - multiplying the striped weights by all 64 ⁇ 16 block-columns of H i . Therefore, each thread block loads the weight matrix (e.g. W i ) from global memory exactly once, while the multiple passes are over layer H i , which is located in fast shared memory.
  • FIG. 13 is a flowchart illustrating dynamic composition by dynamic composition module 942.
  • FIG.14 is a diagram illustrating dynamic composition of human objects into a background.
  • the diagram of FIG. 14 and insertion of human objects into a background is performed in accordance with a modification of the techniques described in B. Yang et al., "Learning Object-Compositional Neural Radiance Field for Editable Scene Rendering," 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp.
  • the scene rendering uses an object-compositional neural radiance field and produces realistic rendering for a clustered and real-world scene.
  • a two-pathway architecture is utilized in which a scene branch 1440 encodes the scene geometry and appearance, and an object branch 1460 encodes each standalone object conditioned on learnable object activation codes.
  • a scene-guided training strategy solves 3D space ambiguity in the occluded regions and learns sharp boundaries for each object.
  • the scene branch takes the spatial coordinate x, the interpolated scene voxel features fscn at x and the ray direction d as input, and outputs the color cscn and opacity ⁇ scn of the scene.
  • the object branch takes additional object voxel features f obj as well as an object activation code l obj to condition the output to only contains the color cobj and opacity ⁇ obj for a specific object at its original location with everything else removed.
  • positional encoding ⁇ ( ⁇ ) is applied on both of the scene voxel feature f scn interpolated from eight (8) nearest vertices and space coordinate x to result in the hybrid space embedding.
  • This hybrid FTRW-01394WO0 6000548PCT01 - 33 - space embedding, along with the embedded directions ⁇ (d), are fed into the scene branch and the object branch.
  • the scene branch function F scn can output the opacity ⁇ scn and color c scn of the scene at x.
  • FIG.15 is a flow diagram illustrating dynamic relighting in a trained dynamic relighting module.
  • the dynamic relighting module uses a modification of the techniques discussed in Xiuming Zhang, Pratul P.
  • NeRFactor neural factorization of shape and reflectance under an unknown illumination.
  • the technique discussed therein, termed “NeRFactor” represents the shape and spatially-varying reflectance of an object as a set of 3D fields, each parameterized by MLPs 1510, 1515, 1520, 1525 and 1535 and whose weights are optimized so as to explain the set of observed input images.
  • NeRFactor outputs, at each 3D location ⁇ on the object’s surface, the surface normal ⁇ , light visibility in any direction ⁇ ( ⁇ i ), albedo ⁇ , and reflectance ⁇ BRDF that together explain the observed appearance. These factors are provided to a renderer 1545 to provide free-viewpoint relighting (with shadows). By recovering the object’s geometry and reflectance, NeRFactor enables free-viewpoint relighting (with shadows) and material editing. [0096] The NeRFFactor techniques discussed in Zhang et al. are modified in that: the techniques described in NeRF are modified using the techniques of Weng, Chung- Yi et al.
  • NeRF HumanNeRF
  • the input to the NeRFactor model comprises the output of human NeRF 1022 (the optimized volume density ⁇ ) which creates input images to compute initial geometry (though using Multi-View Stereo (MVS) geometry as initialization also works).
  • NeRFactor optimizes a neural radiance field MLP that maps from any 3D spatial coordinate and 2D viewing direction to the volume density at that 3D location and color emitted by particles at that location along the 2D viewing direction.
  • NeRFactor leverages NeRF’s estimated geometry by “distilling” it into a continuous surface representation that we use to initialize NeRFactor’s geometry.
  • NeRFactor optimizes NeRF to compute the expected surface location along any camera ray, the surface normal at each point on the object’s surface, and the visibility of light arriving from any direction at each point on the object’s surface.
  • denotes 3D locations, ⁇ i light direction, ⁇ o viewing direction, and ⁇ d , ⁇ h , ⁇ d Rusinkiewicz coordinates.
  • a surface normal MLP 1530 computes the surface normal in the canonical space nc.
  • a second MLP 1535 computes a surface normal in observation space n.
  • a light visibility MLP 1510 computes the visibility ⁇ a to each light source by marching through human NeRF’s 1022 ⁇ -volume from the point to each light location.
  • the visibility function is parameterized as a MLP 1510 that maps from a surface location ⁇ surf and a light direction ⁇ i to the light visibility ⁇ : ⁇ v : ( ⁇ surf, ⁇ i) ⁇ ⁇ .F
  • Reflectance is handled by the albedo MLP 1520 and BRDF identity MLP 1515.
  • a BRDF identity MLP 1515 predicts spatially-varying BRDFs for all the surface points in the plausible space of real-world BRDFs.
  • the Albedo MLP parameterizes the albedo a at any surface location xsurf as an MLP ⁇ a : ⁇ surf ⁇ ⁇ .
  • FIG.16 is a block diagram of a network processing device that can be used to implement various embodiments, including a client processing device such as devices 210, 212, 214, 216 or network nodes 220a – 220d.
  • network device 1600 may contain multiple instances of a component, such as multiple processing units, processors, memories, transmitters, receivers, etc.
  • the network device 1600 may comprise a processing unit 1601 equipped with one or more input/output devices, such as network interfaces, storage interfaces, and the like.
  • the processing device may include a central processing unit (CPU) 1610, a memory 1620, a mass storage device 1630, and an I/O interface 1660 connected to a bus 1670.
  • the bus 1670 may be one or more of any type of several bus architectures including a memory bus or memory controller, a peripheral bus, or the like.
  • a network interface 1650 enables the network processing device to communicate over a network 1680 with other processing devices such as those described herein.
  • the I/O interface is illustrated as connected to a display device 1665 and an image capture device 1655.
  • the CPU 1610 may comprise any type of electronic data processor.
  • Memory 1620 may comprise any type of system memory such as static random- access memory (SRAM), dynamic random-access memory (DRAM), synchronous DRAM (SDRAM), read-only memory (ROM), a combination thereof, or the like.
  • SRAM static random- access memory
  • DRAM dynamic random-access memory
  • SDRAM synchronous DRAM
  • ROM read-only memory
  • memory 1620 may include ROM for use at boot-up, and DRAM for program and data storage for use while executing programs.
  • the memory 1620 is non-transitory.
  • the memory 1620 includes computer readable instructions that are executed by the processor(s) 1320 to implement embodiments of the disclosed technology, including the real-time virtual environment application 1625a which may itself include a rendering engine 1625b, FTRW-01394WO0 6000548PCT01 - 36 - and converter 1675.
  • the functions of the meeting application 1625a as operable on a client processing device are described herein in various flowcharts and Figures.
  • the mass storage device 1630 may comprise any type of storage device configured to store data, programs, and other information and to make the data, programs, and other information accessible via the bus 1670.
  • the mass storage device 1630 may comprise, for example, one or more of a solid-state drive, hard disk drive, a magnetic disk drive, an optical disk drive, or the like.
  • An instance of the user DB 810a may be stored on the mass storage device 1630 in embodiments of the technology such as that described with respect to FIG.8.
  • the mass storage 1630 may also include code comprising instructions for causing the CPU to implement the components of the real-time virtual environment application illustrated in memory 1620 of the client processing device.
  • FIG.17 is a block diagram of a network processing device that can be used to implement various embodiments of a service host 240. Specific network devices may utilize all of the components shown, or only a subset of the components, and levels of integration may vary from device to device.
  • the memory 1620 includes components of the real-time virtual environment application 1715a, a rendering engine 1715b implementing a renderer 380 discussed herein, converters 1715c implementing converters 310, 320, 330 discussed herein, and environment rendering data 1715d provided from the object database and theme database.
  • the mass storage device 1630 includes the theme database 350a comprising instances background database 365a, and environment database 355a, and object database 345a.
  • the mass storage 1630 may also include code comprising instructions for causing the CPU to implement the components of the real-time virtual environment application illustrated in memory 1620 of the service host 240.
  • device 240 may include both a CPU 1610 and a dedicated GPU 1710.
  • a GPU is a type of processing unit that enables very efficient parallel processing of data.
  • GPUs may be used in a video card or the like for computer graphics, GPUs have found much broader applications.
  • FIG. 18 is a block diagram illustrating examples of details of a network device, or node, such as those shown in the network of FIG. 3.
  • a node 1800 may comprise a router, switch, server, or other network device, according to an embodiment.
  • the node 1800 can correspond to one of the nodes 220a- 220d.
  • the router or other network node 1800 can be configured to implement or support embodiments of the technology disclosed herein.
  • the node may store the user databases 810a (which may comprise any of 810, 812, 814, 816 illustrated in FIG.8) and/or may execute one or more converters discussed herein.
  • the node 1800 may comprise a number of receiving input/output (I/O) ports 1810, a receiver 1812 for receiving packets, a number of transmitting I/O ports 1830 and a transmitter 1832 for forwarding packets. Although shown separated into an input section and an output section in FIG.18, often these will be I/O ports 1810 and 1830 that are used for both down-stream and up-stream transfers and the receiver 1812 and transmitter 1832 will be transceivers.
  • I/O receiving input/output
  • the node 1800 can also include a processor 1820 that can be formed of one or more processing circuits and a memory or storage section 1822.
  • the storage 1822 can be variously embodied based on available memory technologies and in this embodiment and is shown to have recovery data cache 1870, which could be formed from a volatile RAM memory such as SRAM or DRAM, and long-term storage 1826, which can be formed of non-volatile memory such as flash NAND memory or other memory technologies.
  • Storage 1822 can be used for storing both data and instructions for implementing aspects of the technology discussed herein. In particular, instructions causing the processor 1820 to perform the functions of caching database data or executing converters as described herein. [00113] More specifically, the processor(s) 1820, including the programmable content forwarding plane 1828, can be configured to implement embodiments of the disclosed technology described below. In accordance with certain embodiments, the FTRW-01394WO0 6000548PCT01 - 38 - storage 1822 stores computer readable instructions that are executed by the processor(s) 1820 to implement embodiments of the disclosed technology.
  • FPGAs Field-programmable Gate Arrays
  • ASICs Application-specific Integrated Circuits
  • ASSPs Application-Specific Standard Products
  • SOCs System-on-a-chip systems
  • CPLDs Complex Programmable Logic Devices
  • a connection may be a direct connection or an indirect connection (e.g., via one or more other parts).
  • the element when an element is referred to as being connected or coupled to another element, the element may be directly connected to the other element or indirectly connected to the other element via intervening elements.
  • an element When an element is referred to as being directly connected to another element, then there are no intervening elements between the element and the other element.
  • Two devices are “in communication” if they are directly or indirectly connected so that they can communicate electronic signals between them.
  • the technology described herein can be implemented using hardware, software, or a combination of both hardware and software.
  • the software used is stored on one or more of the processor readable storage devices described above to program one or more of the processors to perform the functions described herein.
  • the processor readable storage devices can include computer readable media such as volatile and non-volatile media, removable and non-removable media.
  • Computer readable media may comprise computer readable storage media and communication media.
  • Computer readable storage media may be implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Examples of computer readable storage media include RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information, and which can be accessed by a computer.
  • a computer readable medium or media does (do) not include propagated, modulated, or transitory signals.
  • Communication media typically embodies computer readable instructions, data structures, program modules or other data in a propagated, modulated, or transitory data signal such as a carrier wave or other transport mechanism and includes any information delivery media.
  • modulated data signal means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal.
  • communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as RF and other wireless media. Combinations of any of the above are also included within the scope of computer readable media.
  • some or all of the software can be replaced by dedicated hardware logic components.
  • illustrative types of hardware logic components include Field-programmable Gate Arrays (FPGAs), Application-specific Integrated Circuits (ASICs), Application-specific Standard Products (ASSPs), System-on-a-chip systems (SOCs), Complex Programmable Logic Devices (CPLDs), special purpose computers, etc.
  • FPGAs Field-programmable Gate Arrays
  • ASICs Application-specific Integrated Circuits
  • ASSPs Application-specific Standard Products
  • SOCs System-on-a-chip systems
  • CPLDs Complex Programmable Logic Devices
  • special purpose computers etc.
  • FTRW-01394WO0 6000548PCT01 - 40 - embodiment software (stored on a storage device) implementing one or more embodiments is used to program one or more processors.
  • the one or more processors can be in communication with one or more computer readable media/ storage devices, peripherals and/or communication interfaces.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Processing Or Creating Images (AREA)

Abstract

A system and method for generating and displaying a photorealistic virtual environment which may act as an online conferencing system or a metaverse environment using neural rendering. A user's position in a real-world environment is translated the virtual environment, and motions of the user along with audio data are rendered in the virtual environment for perception by other users in the virtual environment. User position and motion is relayed using encoded and compressed user data from an associated user processing device to a service host. The service host uses neural rendering processing to render a representation of each user in the virtual environment, providing dynamic lighting and physically plausible placement of representations of users and objects in the virtual environment. Data to render the dynamic scene is returned to each processing device associated with a user to allow rendering of the dynamic scene on the user processing device.

Description

FTRW-01394WO0 6000548PCT01 - 1 - VIRTUAL CONFERENCING SYSTEM USING MULTI-NEURAL-RADIANCE-FIELD SYNCHRONIZED RENDERING Inventor: Chuanyue Shen Letian Zhang Zhangsihao Yang Liang Peng Hong Heather Yu Masood Mortazavi CLAIM OF PRIORITY [0001] This application claims priority to U.S. Provisional Patent Application No. 63/482,707, entitled “VIRTUAL CONFERENCING SYSTEM USING MULTI-NEURAL- RADIANCE-FIELD SYNCHRONIZED RENDERING”, filed February 1, 2023, which application is incorporated by reference herein in its entirety. FIELD [0002] The disclosure generally relates to improving the quality of audio/visual interaction in virtual environments over communication networks. BACKGROUND [0003] The use of real-time video conferencing applications has expanded considerably in recent years. Video conferencing may typically involve group video conferences with two or more attendees who can all see and communicate with each other in real-time. Popular conferencing applications involve the use of two- dimensional video representations of the attendees and thus benefit from fast and reliable network connections for effective conferencing. [0004] Efforts have been made to build virtual worlds where users can move about and communicate in a virtual environment and interact with other users. Such virtual worlds are the basis for the creation of a “metaverse,” generally defined as a single, FTRW-01394WO0 6000548PCT01 - 2 - shared, immersive, persistent, 3D virtual space where humans experience life in ways they could not in the physical world. Typically, a user participates in the virtual world or metaverse through the use of a virtual representation of themselves, sometimes referred to as an avatar -- an icon or figure representing a particular person in the virtual world. [0005] Users can utilize such virtual worlds for meetings and conferences but are limited in the representation of the user by their representation in virtual form, which traditionally has not been a photorealistic representation of the user, nor is the virtual environment a photorealistic environment. SUMMARY [0006] One aspect includes a computer implemented method of rendering representations of a plurality of users in a virtual environment. The computer implemented method includes generating a representation of the virtual environment from environment data specifying at least a virtual coordinate system for the virtual environment and background data using neural rendering. Generally, neural rendering is the synthesizing photorealistic images and video content using artificial neural networks which have been trained on a dataset of images or videos of similar objects or scenes to that which is synthesized. The method also includes generating, using neural rendering, a representation of each of the plurality of users, each user having an associated representation; receiving user data for at least one of the users participating in the virtual environment, where the user data may include a user position and 3D key point data of the at least one user, the 3D key point data reflecting a user pose in a real world environment associated with the user; determining a position in the virtual environment for each of the plurality of users, the position in the virtual environment determined based on the received user data for the at least one user ; and rendering the representation of each of the plurality of users in a dynamic scene of the virtual environment. Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on FTRW-01394WO0 6000548PCT01 - 3 - one or more computer storage devices, each configured to perform the actions of the methods. [0007] Implementations may include the computer implemented method where the user data is encoded and/or compressed, and the user data is received over a network from a client processing device associated with each user. Implementations may include the computer implemented method of any of the foregoing embodiments wherein the 3D key point data of each user indicates real-time motions of each user and the rendering includes rendering the real-time motions of each user to the associated representation in the virtual environment. Implementations may include the computer implemented method of any of the foregoing embodiments further including accessing image data for each of the plurality of users and training a neural radiance field network for each user for the neural rendering using the image data. Implementations may include the computer implemented method of any of the foregoing embodiments further including accessing user profile information and training the neural radiance field network for each user for the neural rendering using the user profile information. Implementations may include the computer implemented method of any of the foregoing embodiments wherein the rendering includes dynamically lighting the virtual environment and the associated representation of each user. Implementations may include the computer implemented method of any of the foregoing embodiments wherein the rendering includes determining a physical plausibility for one or more objects and/or one or more representations of users in the virtual environment. Implementations may include the computer implemented method of any of the foregoing embodiments wherein the rendering includes determining a physical plausibility for one or more objects and/or one or more representations users in the virtual environment. Implementations may include the computer implemented method of any of the foregoing embodiments wherein the generating a representation of a user may include a free-viewpoint rendering method of a volumetric representation of a person. Implementations may include the computer implemented method of any of the foregoing embodiments wherein the free-viewpoint rendering method is implemented on at least one fully fused multi-layer perceptron. Implementations may include the computer implemented method of any of the foregoing embodiments FTRW-01394WO0 6000548PCT01 - 4 - wherein the fully fused multi-layer perceptron is implemented using a tiny Compute Unified Device Architecture neural network framework. Implementations of the described techniques may include hardware, a method or process, or computer software on a computer-accessible medium. [0008] One general aspect includes a user equipment device. The user equipment device includes a storage medium and may include computer instructions. The device also includes an image capture system and a display device. The device also includes one or more processors coupled to communicate with the storage medium, where the one or more processors execute the instructions to cause the system to: capture, using the image capture system, at least 3D key point data of a user of the user equipment ; encode and compress user data including a user position in a real world coordinate system and the 3D key point data; outputting the user data to a renderer via a network; receive dynamic scene data may include a photorealistic representation of a virtual environment generated using a neural radiance field, the virtual environment having a virtual coordinate system for the virtual environment and background data, and a photorealistic representation of each of a plurality of users in the virtual environment, each user having an associated representation, each user’s associated user data converted to a position in the virtual environment; and render the dynamic scene data on the display device. [0009] Implementations may include the user equipment device of any of the foregoing embodiments where the 3D key point data of each user indicates real-time motion of the user of the user equipment, and the dynamic scene data includes data rendering the real-time motions of each user to the associated representation if the user in the virtual environment. Implementations may include the user equipment device of any of the foregoing embodiments further including transmitting image data for the user to provide training data for a neural radiance field network for the neural rendering. Implementations may include the user equipment device of any of the foregoing embodiments further including transmitting user profile information for training the neural radiance field network. Implementations may include the user equipment device of any of the foregoing embodiments where the dynamic scene data FTRW-01394WO0 6000548PCT01 - 5 - includes dynamically lighting the virtual environment and the associated representation of each user. [0010] One general aspect includes a non-transitory computer-readable medium storing computer instructions for rendering a photorealistic virtual environment occupied by a plurality of users. The non-transitory computer-readable medium storing computer instructions include generating a photorealistic representation of the virtual environment from environment data specifying at least a virtual coordinate system for the virtual environment and background data using neural rendering. The instructions also include generating, using neural rendering, a photorealistic representation of each of the plurality of users, each user having an associated photorealistic representation; receiving user data may include receiving at least user position data and 3D key point data, and a user profile of each user in the virtual environment; converting the received user data to a position the virtual environment; and rendering the associated photorealistic representation of each of the plurality of users in a dynamic scene of the virtual environment. [0011] Implementations may include the non-transitory computer-readable medium of any of the foregoing embodiments where the user data is encoded and/or compressed, and the user data is received over a network from a client processing device associated with each user. Implementations may include the non-transitory computer-readable medium of any of the foregoing embodiments wherein the 3D key point data of each user indicates real-time motions of each user, and the rendering includes rendering the real-time motions of each user to the associated representation in the virtual environment. Implementations may include the non-transitory computer- readable medium of any of the foregoing embodiments wherein the non-transitory computer-readable medium the instructions further causing the one or more processors to perform the steps of accessing image data for each of the plurality of users and training a neural radiance field network for each user for the neural rendering using the image data. Implementations may include the non-transitory computer-readable medium of any of the foregoing embodiments wherein the non- transitory computer-readable medium includes the instructions further causing the one or more processors to perform the steps of accessing user profile information and FTRW-01394WO0 6000548PCT01 - 6 - training the neural radiance field network for each user for the neural rendering using the user profile information. Implementations may include the non-transitory computer- readable medium of any of the foregoing embodiments wherein the rendering includes dynamically lighting the virtual environment and the associated representation of each user. Implementations may include the computer implemented method of any of the foregoing embodiments wherein the rendering includes determining a physical plausibility for one or more objects and/or one or more representations of users in the virtual environment. Implementations may include the non-transitory computer- readable medium of any of the foregoing embodiments wherein the rendering includes determining a physical plausibility for one or more objects and/or one or more representations users in the virtual environment. Implementations may include the non-transitory computer-readable medium of any of the foregoing embodiments wherein the generating a photorealistic representation of a user may include a free- viewpoint rendering method of a volumetric representation of a person. Implementations may include the non-transitory computer-readable medium of any of the foregoing embodiments wherein the free-viewpoint rendering method is implemented on at least one fully fused multi-layer perceptron. Implementations may include the non-transitory computer-readable medium of any of the foregoing embodiments wherein the fully fused multi-layer perceptron uses a tiny Compute Unified Device Architecture neural network framework. [0012] This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter. The claimed subject matter is not limited to implementations that solve any or all disadvantages noted in the Background. FTRW-01394WO0 6000548PCT01 - 7 - BRIEF DESCRIPTION OF THE DRAWINGS [0013] Aspects of the present disclosure are illustrated by way of example and are not limited by the accompanying figures for which like references indicate the same or similar elements. [0014] FIG. 1 illustrates an interface of a photorealistic virtual world conference application showing a first example of a display of a virtual environment. [0015] FIG.2 illustrates an example of a network environment for implementing a photorealistic real-time metaverse conference. [0016] FIG.3 is a block diagram of a service host and a plurality of clients. [0017] FIG.4 is a flow chart illustrating a process performed by a user client device in accordance with the virtual conferencing system. [0018] FIG.5 is a flow chart illustrating a process performed by a service host to train a human NeRF renderer. [0019] FIG.6 is a flow chart illustrating a process performed by a service host in accordance with the virtual conferencing system to render the virtual conference data. [0020] FIG.7 is a flow chart illustrating a process performed by a user client device in accordance with the virtual conferencing system. [0021] FIG.8 is a block diagram of an alternative service host architecture. [0022] FIG.9 is a diagram illustrating the training flow for a human NeRF renderer. [0023] FIG. 10 is a diagram illustrating the real-time rendering flow for a NeRF renderer. [0024] FIG.11 is a ladder diagram illustrating data flow between user devices and a service host. [0025] FIG.12A is a diagram illustrating Human NeRF formation. FTRW-01394WO0 6000548PCT01 - 8 - [0026] FIG.12B is a diagram illustrating a pose correction MLP. [0027] FIG.12C is a diagram illustrating a skeleton motion network. [0028] FIG.12D is a diagram illustrating a residual non-rigid motion MLP. [0029] FIG.12E is a diagram illustrating a NeRF network for obtaining color and density. [0030] FIG.12F is a diagram illustrating parallel processing in accordance with the disclosure. [0031] FIG.13 is a flowchart illustrating dynamic composition. [0032] FIG.14 is a data flow diagram illustrating dynamic composition. [0033] FIG.15 is a data flow diagram illustrating dynamic relighting. [0034] FIG.16 is a block diagram of a network processing device that can be used to implement various embodiments. [0035] FIG.17 is a block diagram of a network processing device that can be used to implement various embodiments of a meeting server or network node. [0036] FIG.18 is a block diagram illustrating an example of a network device, or node, such as those shown in the network of FIG.2. DETAILED DESCRIPTION [0037] The present disclosure and embodiments address real-time, online, audio- video interactions in a virtual environment. In embodiments, the virtual environment may comprise a “metaverse.” In one implementation, a virtual environment serves as a dedicated conferencing environment. [0038] The present disclosure and embodiments provide a system and method for generating and displaying a photorealistic virtual environment which include representations of a plurality of users in the virtual environment using neural rendering. FTRW-01394WO0 6000548PCT01 - 9 - In embodiments, the representations comprise photorealistic representations of each user. A user’s position, movements and audio expressed in a real-world environment is captured by a user processing device associated with the user, and translated to a representation of the user in a virtual environment or metaverse. The representation of each user along with each user’s audio data are rendered in the virtual environment for perception by other users participating in the virtual environment, generally by rendering a dynamic scene of the environment on each user’s processing device. The user’s position in the virtual environment is derived using encoded and compressed user data which may include user position data and 3D key point data from the associated user processing device to a service host. The service host provides the neural rendering processing to render the representation of each of the plurality of users in a dynamic scene of the virtual environment. Data to render the dynamic scene is returned to each processing device associated with a user to allow rendering of the dynamic scene on the user processing device. [0039] Conferences in the virtual environment may be scheduled and hosted by users operating from client processing devices which access a real-time virtual environment application provided by a service provider on service host computing device or devices. In other implementations, the virtual conferencing application acts as an “always on” metaverse. In the context of this description, the term “metaverse” means a spatial computing platform that provides virtual digital experiences in an environment that acts as an alternative to or a replica of the real-world. The metaverse can include social interactions, currency, trade, economy, and property ownership. In a unique aspect, the appearance of users in the virtual environment described herein is a photorealistic appearance, replicating each user’s real-world appearance using neural radiance field rendering technology. A metaverse may be persistent or non- persistent such that it exists only during hosted conferences. [0040] In embodiments, the system herein uses neural rendering to generate photorealistic representations of users and a virtual environment by using neural networks to learn how to simulate lighting, reflection, and refraction that occur in the real world. In neural rendering generally, a 3D model of the object or scene is created, which includes information about the geometry, surface materials, and lighting. Then, FTRW-01394WO0 6000548PCT01 - 10 - the neural network is trained on a large dataset of images or videos that capture similar objects or scenes under different lighting conditions. During training, the neural network learns to predict how light interacts with the 3D model based on its parameters and the lighting conditions. Once the network is trained, it can be used to generate new images or videos of the 3D model under different lighting and viewing conditions. Overall, the neural network's ability to learn complex relationships between the input data and the output images or videos enables it to generate highly realistic images of 3D objects and scenes, even under challenging lighting conditions. The present disclosure and embodiments employ a free-viewpoint rendering system that is capable to synthesize any novel view from any viewpoint. The system disclosed herein provides an immersive and photorealistic experience for virtual conferencing and meetings in a metaverse using, in embodiments, single-view video or multi-image acquisition using a single camera, thereby lowering cost, reducing communication bandwidth over current systems, and providing greater accessibility for different types of processing devices. In embodiments, free-viewpoint synthesis of human dynamics is utilized using the captured single-view multiple images or video, and provides accelerated photorealistic rendering of photorealistic representations of users in the metaverse. [0041] FIG. 1 illustrates a two-dimensional interface 100 of an online virtual conference application showing a first example of audio/visual information which may be presented in the interface 100. Interface 100 illustrates a dynamic scene of a virtual environment which in this example is a virtual “meeting room” , where a representation 110, 120 of each of the participants is displayed. Interface 100 includes a presentation window 110 showing two attendee representations 110, 120 in the virtual environment 130 where the meeting room may comprise an environment for an online meeting. In this example, the virtual environment comprises a room with a table around which the attendees 110, 120 are facing each other. The interface 100 in this example may be presented on a two-dimensional display window presented on a processing device such as a computer with a display, a mobile device, or a tablet with a display. In embodiments, the interface 100 may present a participant, first person view of the environment 130 (where the view is neither that of attendee representation 110 nor FTRW-01394WO0 6000548PCT01 - 11 - 120 in this example). In other embodiments, the interface may be displayed in a virtual reality device to that the viewing attendee is immersed in the virtual environment 130. In such embodiments, where the view is that of the perspective of the user’s representation in the virtual environment, that user may not perceive the totality of that user’s rendered representation. As in the real-world, such user will only see other user representations and those portions of their own representation (arms, legs, body) that a user would normally see if that user were in a real-world environment. In still other embodiments, the interface 100 may include the entirety of the representation of the viewing participant as one of the attendee representations 110, 120 rendered in the virtual environment. It should be understood that in embodiments, an environment interface is provided for each of the attendees/participants in a virtual environment. [0042] In the application interface 100, while the attendees 110, 120 are illustrated as avatars, in accordance with this technology, in dynamic scene displayed in the interface 100, each attendee representation comprises a photorealistic representation of a human attendee (user) associated with the representation. In this context, a photorealistic representation is a visual representation of the user which is rendered as a photographically realistic rendering of the user, such that interactions between people, surfaces and light are perceived in a lifelike view. To be realistic, the represented scene may have a consistent shadow configuration, virtual objects within the interface must look natural, and the illumination of these virtual objects needs to resemble the illumination of the real objects. A photorealistic rendered image, if generated by a computer, is indistinguishable from a photograph of the same scene. The technology disclosed herein presents techniques to capture multiple users in real- time and bring them in virtual environment 130 in a photorealistic manner. In addition to rendering a virtual background and rendering human participants (users), the system may also generate representations of non-human objects, such as cellphones, computers, furniture, etc. that are often used by the users in real world or often appear in a real-world conferencing room or social meeting. [0043] FIG.2 illustrates an example of a network environment for implementing a real-time virtual environment application. Network environment 200 includes one or more service host processing devices 240a – 240d. As described herein, each service FTRW-01394WO0 6000548PCT01 - 12 - host 240 (one embodiment of which is shown in FIG.3 and another in FIG.8) may have different configurations depending on the configuration of the system. Also shown in FIG. 2 are a plurality of network nodes 220a - 220d and user (or client) processing devices 210, 212, 214, 216. The service hosts 240a – 240d may be part of a cloud service 250, which in various embodiments may provide cloud computing services which are dedicated to the real-time virtual environment application. Nodes 220a - 220d may comprise a switch, router, processing device, or other network- coupled processing device which may or may not include data storage capability, allowing cached data to be stored in the node for distribution to devices utilizing the real-time virtual environment application. In other embodiments, additional levels of network nodes other than those illustrated in FIG.2 are utilized. In other embodiments, fewer network nodes are utilized and, in some embodiments, comprise basic network switches having no available caching memory. In still other embodiments, the meeting servers are not part of a cloud service but may comprise one or more meeting servers which are operated by a single enterprise, such that the network environment is owned and contained by a single entity (such as a corporation) where the host and attendees are all connected via the private network of the entity. Lines between the processing devices 210, 212, 214, 216, network nodes 220a - 220d and meeting servers 240a – 240d represent network connections of a network which may be wired or wireless and which comprise one or more public and/or private networks. An example of node devices 220a - 220d is illustrated in FIG.18. [0044] Each of the processing devices 210, 212, 214, 216 may provide and receive real-time user motion, audio, and visual data through one or more of the network nodes 220a- 220d and the cloud service 250 via the network. In FIG. 2, device 210 is illustrated as a desktop computer processing device with a camera 211 and virtual display device 205 in communication therewith. Also illustrated are three examples of participant devices, including a tablet processing device 212, a desktop computer processing device 214 and a mobile processing device 216. It should be understood that the tablet processing device 212 and mobile processing device 216 may include integrated image sensing devices, and desktop computer 214 may include an integrated or non-integrated image processing device (not shown) which may be a FTRW-01394WO0 6000548PCT01 - 13 - camera or other image sensor. It should be understood that any type of processing device may fulfill the role of a user processing device and there may be any combination of different types of processing devices participating in a virtual conference hosted by one or more service hosts 240a – 240d. It should be further understood that tablet processing device 212, a desktop computer processing device 214 and a mobile processing device 216 may also include displays which render the virtual environment described herein as well as image capture devices included therein or attached thereto. [0045] In embodiments, one user may serve as a meeting host or conference organizer who invites others or configures a virtual conferencing environment using the real-time virtual environment application. In real-time virtual environment, all participant devices may contribute to environment data. In other embodiments, a host processing device may be a standalone service host server, connected via a network to participant processing devices. It should be understood that there may be any number of processing devices operating as participant devices for attendees of the real-time meeting, with one participant device generally associated with one attendee (although multiple attendees may use a single device in other embodiments). [0046] Each device in FIG.2 may send and receive real-time virtual environment application data. In one example, user motion and audio/visual data may be sent by a source device, such as device 210, through the source processing device’s network interface and directed to the other participant devices 212, 214, 216 though, for example, one or more service hosts 240a - 240d. Within the cloud service 250 the data is distributed according to the workload of each of the service hosts 240 and can be sent from the service hosts directly to a client or through one or more of the network nodes 220a – 220d. In embodiments, the network nodes 220a - 220d may include processors and memory, allowing the nodes to cache data from the real-time virtual environment application. In other embodiments, the network nodes do not have the ability to cache data. In further embodiments, real-time virtual environment application data may be exchanged directly between participant devices and not through network nodes or routed between participant devices through network nodes without passing FTRW-01394WO0 6000548PCT01 - 14 - through service hosts. In other embodiments, peer-to-peer communication of real-time content and recovery content may be utilized. [0047] FIG.3 is a block diagram of components of a service host and a plurality of client processing devices illustrating a system for providing a photorealistic real-time virtual environment application in a network. FIG. 3 illustrates one example of technology which enables the photorealistic real-time virtual environment application using a Neural Radiance Field (NeRF) renderer 380. In this example, real-world movements, actions, and audio by individual users 310, 320, 330 in their respective real-world environments are translated into movements, actions and audio in a virtual conferencing environment 130. Environment 130 is rendered by renderer 380 based on data obtained from client processing devices 305, 315, 325, and 335. Prior to rendering the virtual conferencing environment, renderer 380 is trained with individual user data, as discussed herein. [0048] In this example, client processing devices 305, 315 and 325 are all illustrated as desktop computers, each having an associated display and image capture device. In embodiments, each client processing device in a real-time virtual environment application environment should include an image capture component and a display component. Client processing device 335 is illustrated as a mobile device which in embodiments includes in integrated display and image capture device. Client processing devices 305, 315, 325 and 335 are respectively associated with an individual user 310, 320, 330 and 340, though in other embodiments an individual processing device may be associated with multiple users. [0049] At each client processing device 305, 315, 325 and 335, an image capture system captures a user’s profile, three-dimensional (3D) key points, and the user’s position in the real-world relative to a real-world coordinate system. Each capture is associated with corresponding timestamp data. The timestamp data may be acquired from a universal time source or synchronized between the various elements of the system discussed herein. Each image capture system at client processing device 305, 315, 325 and 335 works independently and in parallel to provide the user profile information, three dimensional key points, and user positions at multiple timestamps FTRW-01394WO0 6000548PCT01 - 15 - to the service hosts. A user profile contains a user’s appearance, position and styling preferences. For example, a user may provide information regarding clothing preferences in environment 130 of real-time virtual environment application. The user’s appearance may be created for the virtual environment 130 using one or more static images gathered from various viewpoints, or in one or more motion videos captured from various 3D viewpoints, taking into account preferences specified by the user in the user’s profile information. User appearances may also be created based on descriptions saved in the user profiles. The 3D key points identify the user’s body poses in the real-world coordinate system, while the user’s position in the real-world is defined by user data (including user profile, historical rendering data and position data) from an associated user processing device, with the user data and key point data from the real-world coordinate system allowing positioning and posing of the photorealistic representation of the user within the virtual environment. [0050] The object database 345 stores relevant user information (user profile, key points, and position) for each user participating in the real-time virtual environment application. In the embodiment of FIG.3, the object database is shown as a central database on the service host 240, but the location of the database can vary based on system requirements. [0051] Also shown in FIG.3 is a theme database 350. The themes database 350 comprises a global system database that stores virtual backgrounds in a background database 360 and virtual environment data in an environment database 355. The background database 360 comprises rendering information for the virtual environment 130 including the virtual world background and a three-dimensional virtual environment coordinate system which can be used to position the virtual representation of the user in the virtual environment. In the example of FIG. 1, the environment 130 is a room including a table, but as should be understood, the environment may comprise any three-dimensional representation of a world where individuals may move about in the context of the technology discussed herein. Environment database 355 may include lighting information for the virtual environment. In embodiments, each user’s position in the real-world coordinate system is translated to the virtual environment coordinate position. The user’s position FTRW-01394WO0 6000548PCT01 - 16 - in the real world is considered when positioning a representation of the user in the virtual environment. The virtual environment rendering considers the proper positioning of all the participants when placing photorealistic representations of each user in the virtual environment. A given user’s position in the virtual space maybe defined based on rules or regulations of the meeting/the system and/or user interaction with the system. The user’s virtual world position may also take into consideration the user profile and preferences, as well as the user’s position and pose in the real world. The human body rendering process, described below, reconstructs a photorealistic human model and enables novel pose rendering of the human. As described herein, the rendering system accepts a single continuous 5D coordinate system as input (a spatial location (x, y, z) and viewing direction (θ,ϕ) and can output different representations for the same point when viewed from different angles. Environment database 355 includes scene lighting information and other data augmenting the rendering of the virtual environment 130 where the virtual conferencing takes place. When establishing a virtual conference, a meeting host or organizer may choose the background and environment for the virtual conference. For example, the host or organizer may choose to have a virtual conference in a virtual conference room during daylight (or at night), all of which are rendered by the virtual conferencing application. [0052] Data from the object database is provided to position converter 365. The converter 365 transforms a user’s position in their respective real-world environment to a position in the virtual conferencing environment with the aid of background data 360 and environment data 355 from the theme database. Converter 365 maps the coordinates between real world and virtual world, and thus is used to position the user in real world plausibly into virtual world. An instance 365a, 365b, 365c of converter 365 is respectively created for each user 310, 320, 330. In other embodiments, separate converter applications are provided on client processing devices or separate converter applications for each user may be run on one or more of the service hosts 240. Each instance of a converter works independently and in parallel with other instances to provide data to the renderer 380. [0053] Renderer 380 receives theme and refactorized user information and outputs a real-time photorealistic virtual scene of the virtual environment. The renderer 380 FTRW-01394WO0 6000548PCT01 - 17 - provides NeRF formation, dynamic composition, and dynamic relighting in a scene 100 of a virtual environment 130. Renderer 380 is trained with sufficient data to create the rendering of the representation’s individual users and the virtual environment 130. The output of the renderer 380 is provided to each of the client processing devices 305, 315, 325 and 335 and the virtual environment scene 100 rendered as appropriate given the type of display available on each client device. A NeRF may comprise one or more fully-connected neural networks that can generate novel views of complex 3D scenes, based on a partial set of 2D images, motion videos or frames of motion videos. It is trained to reproduce input views of a scene and works by taking input images representing a scene and interpolating between them to render one complete scene. [0054] A NeRF network is trained to map directly from viewing direction and spatial location (5D input) to opacity and color (4D output), using volume rendering to render new views. NeRF is a computationally intensive algorithm, and processing of complex scenes can take hours or days. As such, in embodiments it is desirable to include NeRF components discussed herein on the service host devices where processing power may be greater than at individual user processing devices. [0055] Renderer 380 utilizes techniques described in Mildenhall, Ben et al. “NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis.” European Conference on Computer Vision (2020) and Weng, Chung-Yi et al. “HumanNeRF: Free-viewpoint Rendering of Moving People from Monocular Video.” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022): 16189- 16199 (Hereinafter “Weng et al.”) (Each of the foregoing documents is hereby incorporated by reference in its entirety herein) to synthesizing views of complex scenes and humans using a continuous volumetric scene function from input views and/or motion video. Renderer 380 uses a non-convolutional deep network, whose input may be a single continuous 5D coordinate (spatial location (x; y; z) and viewing direction (θ,Φ), and whose output is the volume density and view-dependent emitted radiance at that spatial location. In HumanNeRF 1022 (discussed below with respect to Fig.12), the skeletal motion translation 1240 uses a convolutional neural network. Other components in the renderer 380 may all be based on non-convolutional neural FTRW-01394WO0 6000548PCT01 - 18 - networks. Using the aforementioned techniques, renderer 380 can synthesize photorealistic details of a user’s body, as seen from various camera angles that may not exist in the input images or video, as well as minute details such as cloth textures and facial appearance. The motion field is decomposed into skeletal rigid and non- rigid motions, with rigid motion derived from convolutional neural networks and non- rigid motion derived by the non-convolutional networks. [0056] Once trained on a user’s photorealistic details, real-time user data in the form of 3D key points reflecting motion of a user and user position reflecting location of a user in real world can be used to render a photorealistic representation of the user in the virtual environment in the photorealistic dynamic scene on user processing devices. This user data, combined with audio data from the user, allows users to communicate within the virtual environment. 3D key point data are mainly used to describe body poses of a human. The 3D key point data are collected to project user body poses in the real-world environment into virtual world. The function of converters 365a – 365c (or 375a-d) is to provide a point-to-point mapping between the real world and virtual world. Converters 365a-d determine the user position in the virtual environment according to the user position in the real-world environment. A photorealistic dynamic representation of the user in the virtual environment is completed by combining the user body poses in the virtual environment and the user position in the virtual environment. [0057] FIG.4 is a flow chart illustrating a process performed by a client processing device in accordance with the virtual conferencing system. The method of FIG. 4 comprises providing user data to the converter 365 and renderer 380 to either train the renderer 380 or allow the renderer 380 and converter 365 to create the photorealistic virtual environment. At 410, user data comprising, at least in part, user images, is captured by an image capture system associated with the client processing device. The images may be a plurality of still images, user video, or frames extracted from user video.3D key points may be gathered from the user images or captured separately using, for example, a depth image capture device. The 3D key points for user spatial locations are points in the image that define locations on a user’s body in the real-world coordinate system that are used to render motion of the user. Image FTRW-01394WO0 6000548PCT01 - 19 - capture may comprise two dimensional static images, motion video of the user and/or depth images of the user. User position data may also be captured from user image data or separately by other sensors (such as, for example, a depth image capture system). At 420, a timestamp is associated with each captured image or video. The timestamp may be a universal timestamp referenced by all client processing devices capturing images such as, for example, a timestamp referenced to a universal time source. At 430, the user position in the real-world environment occupied by the user is determined relative to a coordinate system mapped to the user’s real-world environment. At 440, user profile data may be retrieved. In embodiments, user profile data may comprise user preferences indicating how the user’s photorealistic appearance can be modified in the virtual environment, such as clothing preferences and the like. At 450, the user profile and/or key point data may be compressed for transmission to a service host and at 460 the client processing device outputs the user profile, user position, and key points to the object database for use by the converter 365 and renderer 380. The process may continue repeating the capture step 410, timestamp association 420, position determination 430, compression/encoding 450 and outputting steps 460 as needed based on the application (training or real-time rendering). [0058] During a real-time rendering operation, capture step 410, timestamp association 420, position determination 430, compression/encoding 450 and outputting steps 460 are repeated continuously for the duration of the user’s presence in the virtual environment. [0059] FIG.5 is a flow chart illustrating a process performed by a service host (or other processing device) to train a renderer 380. At 510, for each user in the virtual conference, the user’s user profile, key points and captured position with timestamp is retrieved and at 530 is used to train the renderer 380 (discussed with respect to FIG. 9, below). There are a number of NeRF components of renderer 380 which are trained as discussed herein. These include human NeRF networks (922, 928, 934), NeRF networks in the dynamic composition module 942 and NeRF networks in the dynamic relighting module 944. FTRW-01394WO0 6000548PCT01 - 20 - [0060] FIG.6 is a flow chart illustrating a process performed by a service host in accordance with the virtual conferencing system to render the virtual conference data during a real-time conference. At 610, for each user in a real-time virtual environment application, user key point data and profiles are collected at 620. The user profile and initial key point data may be retrieved from the object database and/or retrieved in real time from the data output by the user processing device. During user participation in the real-time virtual environment application, user key point data is retrieved in real time in order to allow the renderer 380 to mimic user movements in the real-world to the representation of the user in the real-time virtual environment 130. At 630, the real- world user data is converted into virtual positioning within the virtual environment. The theme environment and background data are accessed at 640 and the virtual environment is rendered at 650 with representations of each user participating in the real-time virtual environment application. Again, it should be noted that in the real-time virtual environment dynamic scene 100 which is presented to an individual user whose representation is in the virtual environment, that user may not “see” all of that user’s rendered representation – rather, the user will only see other users and portions of their own representation (arms, legs, body) that a user would normally see if that user were in a real-world environment. The rendered environment with user motion is output to user processing devices at 660. At 670, steps 610, 620, 630 and 660 are repeated for user motion and actions within a virtual conference. [0061] FIG.7 is a flow chart illustrating a process performed by a user client device in accordance with the virtual conferencing system. Render data is received at 710, and for each user participating in a real-time virtual environment application within a virtual environment, the representation of the virtual environment including other user representations is rendered on the display device associated with a user’s processing device at 730. When the user position or actions are updated at 740, the rendering of the user (and environment if it has changed) are updated. [0062] FIG.8 illustrates an alternative architecture to architecture illustrated in the system shown in FIG. 3. In the architecture of FIG. 8, the renderer 380 and theme database 350 (including the environment database 355 and background database 360) are provided on the service host 240’, but rather than a central object database FTRW-01394WO0 6000548PCT01 - 21 - stored on the service host as in the embodiment of service host 240 of FIG.3, the user profile information, key points and user position data are stored in user databases 810, 812, 814, 816, each of which is associated with a user processing device. In one embodiment, each user database 810, 812, 814, 816 may be stored on the processing device; in alternative embodiments, each user database may be stored in a database on an intermediate processing device or network node. Converters 375a – 375d may similarly be provided on client processing devices edge server or processing devices, network processing devices, or intermediate processing devices. Data from the themes database 350 is provided to each of the converters 375a – 375d to allow for conversion of the user’s real-world environment to the position in the virtual environment. As noted above, the renderer 380 requires substantial processing power and thus may be implemented in a cloud computing environment including one or more service host 240 or 240’. [0063] FIG. 9 is a diagram illustrating components of a renderer 380 during a training process, including untrained human NeRFs 922, 928, 934, a dynamic composition module 942 and dynamic relighting module 944, a background NeRF, and training or formation modules, including NeRF formation modules 920, 926 and 932 respectively associated with human NeRFs 922, 928, 934, and NeRF formation module 938 associated with background NeRF 940. Although three human NeRFs are illustrated in FIG.9, it will be understood that any number “N” of human NeRFs (one per user) may be included in the system. In FIG.9, the process flow demonstrates the training of each NeRF network to allow creation of a photorealistic dynamic scene 100 on a client processing device. In order to accomplish this rendering within the constraints of existing network bandwidths and transmission rates, advance training of a human NeRF 922, 928, 934 for each user occurs, and a background NeRF for the virtual environment is utilized. A universal timestamp 902 is utilized for each component of data 904, 906, 908, 912, and 914. Each set of user data 904, 906, 908 may contain a user profile and 3D key points, as discussed above. Historical user data in the form of 3D key points or a previously trained neural radiance field network may contribute to creation of a corresponding trained human NeRF 922, 928, 934 for each user. For each user, based on the user data 904, 906, 908, a NeRF formation training FTRW-01394WO0 6000548PCT01 - 22 - process 920, 926, 932 acts on the corresponding un-trained human NeRF 922, 928, 934 and creates a trained Human NeRF (NeRFs 1022, 1028, 1034 in FIG. 10). Similarly, background data 912 is utilized by NeRF formation process 938 to train a background NeRF 940. During training, a rendering of the user representation generated by each human NeRF 922, 928, 934 may be combined with each respective user position in the virtual conferencing environment 130 and provided to a dynamic composition module 942 to combine a rendering of the environment background from the background NeRF 940 with the respective user positions to create the representations of each of the users in the virtual environment 130. A dynamic relighting process 944 utilizes environment data 914 to add realistic shadows and highlights to the virtual environment 130 as displayed in dynamic scene 100. [0064] The resulting trained human NeRFs 1022, 1028, 1034 (FIG. 10) have completed a training process are now considered “pre-trained” to optimize rendering for each individual user and output a photorealistic view of the user from a number of viewpoints. Similarly, the dynamic composition module and dynamic relighting module 944 are pre-trained. [0065] FIG. 10 is a diagram illustrating the real-time rendering flow for renderer 380. In FIG.10, the human NeRFs 1022, 1028, 1034 for each user (User 1, User 2, .... User N) have been trained as discussed with respect to FIG.9. A real-time data timestamp 1002 is used by the components of the renderer 380. The real-time data universal timestamp 1002 allows synchronization of user key point data and positions within the virtual environment 130 of the real-time virtual environment application. User 3D key points 1004, 1006, 1008 are captured in real-time for any user participating in a virtual environment in the real-time virtual environment application. The key points are provided for the respective user from a client processing device (e.g., devices 305, 315 and 325, each having an associated display and image capture system) and are provided to pre-trained respective human NeRFs 1022, 1028, 1034 for each user. The human NeRFs 1022, 1028, 1034 for each user render a photorealistic representation of the user which will be combined with the background and environment data in the virtual environment and represented to each user as a photorealistic dynamic scene 100, following the movements and audio inputs provided by and to each user. Using FTRW-01394WO0 6000548PCT01 - 23 - the background data 1012, a converter determines the respective user’s position in the metaverse 1024, 1030, 1036 and the user’s position in the metaverse 1024, 1030, 1036 is combined with the rendering data to the pretrained dynamic composition module 1042 and dynamic relighting module 1044. The background data 1012 is provided to the pre-trained background NeRF 1040, which provides background rendering data to the dynamic composition module 1042. In embodiments, both background data 1012 and environment data 1014 are optional where, for example, the virtual environment is chosen to have a static background or static environment. If one chooses to change the background and/or environment during a virtual event, or create a new event, different background data and environment data are provided to the pre-trained background NeRF 1040. The pre-trained dynamic composition module 1042 combines rendering data from each user and the respective user’s position in the virtual environment to compose the virtual environment. The rendered virtual environment is modified by the pre-trained dynamic relighting module 1044 and outputs the photorealistic dynamic scene 100 in a format suitable for display on the associated display device of client processing devices 305, 315 and 325. [0066] As will be understood by reference to FIGs.9 and 10, the amount of data required for rendering a virtual environment in a photorealistic manner is significant. The technology optimizes the amount of data needed to provide the metaverse environment using several data optimization techniques. Following the initial composition of each dynamic scene, rendering of the movements of users based on real-time 3D key point data may be performed by inference. In one aspect, the system herein utilizes a single camera video/image acquisition system, thereby reducing hardware costs and data usage. The motion acquisition data optionally comprises full- body movement sequences of a dynamic human subject and uploads the acquired info into the user database. The motion info can be in several formats. For example, one simple format can be a monocular video or a sequence of video frames. Alternatively, appearance profile and 3D skeletal key points may be extracted from video and uploaded. In embodiments, historical data can be leveraged to accelerate data communication in future conferencing events. FTRW-01394WO0 6000548PCT01 - 24 - [0067] FIG.11 is a ladder diagram illustrating data flow between user devices and a service host. Although two user devices 305, 315 and one service host 240 are illustrated, it will be understood that any number of user processing devices and service hosts may be utilized in the present system. At 1105, 3D key point data which is extracted from a user via an image capture device 305b at an associated client processing device (e.g., device 305) is provided to the service host 240 for training. As noted above, the key point data may be encoded and/or compressed. Similarly, key point data 1120 from another capture device 315b associated with a client processing device (e.g., device 315) is provided to the service host for training. Rendered scene data 1110 and 1115 is returned to the respective client processing devices for display on associated displays 305a, 315a following training. [0068] When users associated with client processing devices 305, 315 wish to participate in the real-time virtual environment application, each device issues a service request 1135, 1140, to request entry into the virtual environment. The request is acknowledged to indicate that the client device has entered the service. Each device submits real-time 3D key point data 1145, 1150, respectively. In embodiments, in order to reduce the bandwidth required for the 3D key point data, key point data is encoded and/or compressed by the client processing device and provided at 1145 to a service host 240 for processing. After generating virtual environment data including the real- time movements of the user, (in this case user 1 from device 305, ) virtual environment data is provided to both device 315 and device 305 at 1150 for display thereon. Similarly, key point data is encoded and/or compressed by the client processing device 315 and provided at 1155 to a service host 240 for processing. It should be understood that in addition to 3D key point data for each of the users, audio data from each user may be forwarded to the service hosts and played in the virtual environment in synchronization with rendering the photorealistic representation of each user in the real-time virtual environment application. After generating virtual environment data including the real-time movements of the user, (in this case user 2 from device 315, ) virtual environment data is provided to both device 315 and device 305 at 1160 for display thereon. Again, rendering of movement data, changes in lighting or other aspects of the environment following the initial rendering of the background and user FTRW-01394WO0 6000548PCT01 - 25 - representations in the virtual environment is performed by inference. Clients exit the virtual environment by issuing a disconnect request 1165 or 1170. [0069] FIG. 12A is a diagram illustrating NeRF formation of a photorealistic representation of users in the virtual environment by one human NeRF 1022 of the renderer 380 based on the techniques described in Weng, Chung-Yi et al. “HumanNeRF: Free-viewpoint Rendering of Moving People from Monocular Video.” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022): 16189-16199 as modified by the techniques described in Thomas Müller, Fabrice Rousselle, Jan Novák, and Alexander Keller.2021. Real-time neural radiance caching for path tracing. ACM Trans. Graph.40, 4, Article 36 (August 2021). (Each of the foregoing documents is incorporated by reference in its entirety herein). FIG. 12A illustrates the technology as used herein with certain modifications to the techniques described aforementioned. In FIG.12A, arts from human animation and NeRF rendering are combined to create photorealistic view synthesis from any viewpoint. Using the motion data as input, a renderer 380 interprets the human shape, pose, and camera parameters, and employs neural networks to learn a human motion field with aid of a pose corrector (pose correction MLP 1225). The renderer also leverages an acceleration approach to speed up its training and inference. [0070] The human NeRF formation network 1022 takes a video frame (or image or series of images) of a user 1215 in an observation space 1217 as input and optimizes for canonical appearance, represented as a continuous field as well as a motion field mapping from observation space 1217 to canonical space 1255. The motion field 1235 is decomposed into skeletal rigid motion 1240 (Mskel) and residual non-rigid motion 1250 (Mres) and is represented as a discrete grid and a continuous field, respectively. Mskel represents skeleton-driven deformation and Mres starts from the skeleton-driven deformation and produces an offset Δx to it. The method additionally refines a body pose 1212 through pose correction MLP 1225. A loss is imposed between the rendering result and the input image. Three multi-layer perceptrons (MLPs) and a convolutional neural network (CNN) are used in the human NeRF formation. An MLP is a fully connected feedforward artificial network which consists of at least FTRW-01394WO0 6000548PCT01 - 26 - three layers of nodes: an input layer, a hidden layer and an output layer. Except for the input nodes, each node is a neuron that uses a nonlinear activation function. [0071] Given a sequence of images from, for example, a monocular video, a parametric human body model is utilized (e.g. an SMPL-based model), to obtain estimations of camera parameters K, pose θ and shape β of the human body given in the monocular video. In one embodiment, a Skinned Multi-Person Linear Model (SMPL) may be used. SMPL is a realistic 3D model of the human body that is based on skinning and blend shapes and is learned from thousands of 3D body scans. A SMPL model is a skinned vertex-based model that accurately represents a wide variety of body shapes in natural human poses. The parameters of the model are learned from data including the rest pose template, blend weights, pose-dependent blend shapes, identity-dependent blend shapes, and a regressor from vertices to joint locations. In another alternative, a SMPL oPtimization IN the loop (SPIN) deep network may be used for regressing SMPL body shape and pose parameters directly from an image. SPIN uses a training method that combines a bottom-up deep network with a top-down, model-based, fitting method. Pre-trained weights of the SPIN model are utilized. (Although monocular video is discussed herein, video or images captured by multi-camera systems may be utilized in embodiments herein.) Compared to the SMPL model which relies solely on regression, SPIN combines iterative optimization and deep-network regression to estimate human poses more accurately. These SPIN estimates (camera parameters, pose and human body shape) are used as the initial input parameters into the framework, where the pose parameters, particularly, are gradually refined through a pose correction MLP 1225 during training. A motion field of the human subject is created by two neural networks -- skeletal motion translation network 1240 and non-rigid motion MLP 1250. Skeletal motion translation network 1240 is a CNN that learns the skeletal rigid motion, but it is not a full representation of the motion field since it cannot interpret the non-rigid contents. Non-rigid motion MLP 1250 is a fully fused MLP which accounts for non-rigid residual motion. [0072] Body motion M may be represented as addition of skeletal-driven motion Mskel and a non-rigid residual field Mres, M = Mskel + Mres. During the skeletal motion estimation, inverse linear blend skinning is to transform the motions in observation FTRW-01394WO0 6000548PCT01 - 27 - space to canonical space. An MLP 1260 is used to learn the color and density maps, which are then volume-rendered into images. The MLPs involved in the framework are accelerated using parallel computing such as fully fused Compute Unified Device Architecture (CUDA) kernel in a tiny Compute Unified Device Architecture neural network (tiny-cuda-nn) framework , as discussed below. [0073] Each body pose θ can be represented as a combination of K joints J and corresponding K joint angles Ω = ω0, ^ ^ ^ , ωK. The poses estimated from pre-trained weights of parametric models (e.g. SPIN) do not have sufficient accuracy and may lead to pose mismatch. Thus, a pose correction MLP 1225 is used to learn an adjustment for a better pose alignment. One embodiment of MLP 1225 is shown in FIG.12B. Joints J are estimated from images using the SPIN model, and optimized relative to each of the K joint angles, ΔΩ = Δω0, ^ ^ ^ ,ΔωK. The network parameters of MLP 1225 are optimized to provide updates to Δω0, ^ ^ ^ ,ΔωK conditioned on ω0, ^ ^ ^ , ωK. Optimizing the network parameters leads to faster convergence compared to directly optimizing Δω0, ^ ^ ^ ,ΔωK. ΔΩ = MLPθ(Ω) (1) [0074] Each θ is updated to θo by corresponding joints, joint angles, and joint angle relatives: θo = (J,ΔΩ ⊗ Ω) (2) [0075] Skeleton motion translation network 1240 represents skeletal motion volumetrically based on an inverse linear blend skinning algorithm that wraps the points in observation space to canonical space (equivalent to warping an observed pose θo to a predefined canonical pose θc) in a form as follows:
Figure imgf000029_0001
FTRW-01394WO0 6000548PCT01 - 28 - [0076] where ^^ ^ is the blend weight for the i-th bone in the observation space, X is the 3D coordinate of a point, (i.e. the spatial location of a point) and Gi is the skeletal motion basis for the i-th bone. G i may be defined as G i (x) = R i x + t i (4) [0077] with Ri and ti calculated from corresponding body pose θo, and ^^ ^ is obtained by first solving the canonical blend weight ^^ ^ and then deriving from:
Figure imgf000030_0001
[0078] Fig. 12C. illustrates one embodiment of a skeletal motion network 1240 used to generate Mskel. A CNN 1240 is used to generate a weight volume Wc(x), which contains a set of ^^^ ^ (x), from a random constant latent variable, and optimize the network parameters: Wc(x) = CNNskel(x; z) (6) With Gi, ^^^ ^ derived from the above equations, the skeleton motion field Mskel can be obtained. [0079] A residual motion field Mres is estimated to account for the non-rigid deformation content that is not explained in the skeletal motion field, such as the shifting and folding of clothes. Residual motion is modeled as a pose-dependent deformation field. [0080] Specifically, MLP 1250, an example of which is illustrated in Fig. 12D, is used to learn a non-rigid deformation offset conditioned on the skeletal motion field. Adding the non-rigid offset to the skeletal motion completes the motion field. Mres(xskel, θo) = MLPres(γ(xskel), θo) (7) FTRW-01394WO0 6000548PCT01 - 29 - [0081] where xskel represents points in skeleton motion field Mskel, and γ is a standard positional encoding function. [0082] The residual motion network 1250 is activated after the skeletal motion network. This avoids overfitting the residual motion network to the input and undermining the contribution of the skeletal motion field. When residual motion network 1250 joins, a coarse-to-fine manner is employed to the residual motion network with a truncated Hann window applied to the frequency bands of positional encoding. After a certain iteration of training, it is set back to full frequency bands of positional encoding. [0083] The dynamic human in canonical space 1255 is represented as a continuous field, and the color c = (r, g, b) and density σ are derived using a NeRF- like MLP network 1260. With the obtained c and σ, NeRF volume rendering can be used to produce photo-realistic representations of users in a virtual environment. c, σ = MLPnerf(γ(xfinal)) (8) [0084] where γ is a standard positional encoding function and xfinal = xskel + xres (representing the points in the motion field considering skeleton and residual non-rigid motions, where xskel represents the points in skeleton motion field and Xres represents the points in residual non-rigid motion field). [0085] An example of the NeRF network 1275 is illustrated in FIG.12E. [0086] Since the bounding box of a human performer can be estimated from the image, a stratified sampling approach is applied inside the bounding box and a augmentation method is applied to further improve sampling efficiency. The augmentation method uses the denominator of Eq. 5 (above) to approximate the likelihood of being part of the human performer. [0087] As noted above, in embodiments herein, the HumanNeRF free-viewpoint rendering techniques used herein replace the MLPs discussed in the prior art with fully fused MLPs. The fully fused MLP is an MLP implemented, in one embodiment, as a single processor kernel that is designed such that the only slow global memory FTRW-01394WO0 6000548PCT01 - 30 - accesses are reading and writing the network inputs and outputs. Each MLP can be specifically tailored the implementation to the network architecture and the processor used and is mapped to the memory hierarchy used. In one embodiment, noted above, A Tiny CUDA neural network framework (tiny-cuda-nn) is used with a graphics processing unit (GPU). CUDA is a parallel computing platform and application programming interface (API) that allows software to use certain types of GPUs for general purpose processing. Tiny-cuda-nn is a minimal implementation of a neural network framework in CUDA C++. It provides an implementation of feedforward neural networks and enables the user to train a model on a GPU for faster computation. Tiny- cuda-nn provides a lightweight and simple tool for neural networks and GPU programming. In an embodiment of FIG.12A using CUDA technology, a given batch of input vectors (of the pose model, for example) may be partitioned into block-column segments that are processed by a single thread block each. The thread blocks independently evaluate the network by alternating between weight-matrix multiplication and element-wise application of an activation function. By making the thread blocks small enough such that all intermediate neuron activations fit into on- chip shared memory, traffic to slow global memory is minimized. Within a matrix multiplication, each warp of the thread block computes the matrix product of a single block-row. [0088] MLP 1260 is further optimized by a downsizing "fully fused" MLP 1265 and a multiresolution hash encoding 1270 and a NeRF MLP 1275 based on a Tiny CUDA neural network framework. MLP 1265 downsizes the input to a smaller dimension such that 1270 can accept the input with the reduced dimension and is implemented using fully fused kernel. The combination of a downsizing MLP 1265, encoding 1270 and NeRF (fully-fused MLP) 1275 comprises one embodiment. In other embodiments, the downsizing MLP 1265 and encoding 1270 may be omitted. The encoding technique at 1270 may comprised that referenced in the paper Müller, Thomas et al. “Instant neural graphics primitives with a multiresolution hash encoding.” ACM Transactions on Graphics (TOG) 41 (2022): 1 – 15. [0089] The fully fused MLPs increase the training and inference speed, since they take advantage of fully utilizing fast on-chip memory and minimizing traffic to ”slow” FTRW-01394WO0 6000548PCT01 - 31 - global memory. FIG.12F illustrates the mechanism of the fully fused neural networks that leverage the parallelism of modern GPUs. As shown in FIG.12F, given a batch of input vectors 1209, a regular MLP evaluation corresponds to alternating weight- matrix multiplication and element-wise application of the activation function. In contrast, a fully fused MLP at 1280 in FIG. 12F partitions the given batch of input vectors into block-column segments and processes each segment by a single thread block. The width of fully fused MLP is narrow, thus enabling the full utilization of fast on-chip memory. Specifically, the accelerated performance is achieved by the MLP weight matrices fitting into registers and the intermediate neuron activation fitting into shared memory. For a matrix multiplication H′i + 1 = Wi ˙ Hi (FIG.12F at 1285), each warp of the thread block computes one block-row (striped area 1290) of H′i +1 by first loading the corresponding striped weights in Wi into registers and then multiplying the striped weights by all block-columns of Hi . Therefore, each thread block loads the weight matrix (e.g. Wi) from global memory exactly once, while frequent accesses to Hi are through fast shared memory. [0090] In FIG.12 F, at 1275 a regular MLP evaluation for a given batch of input vectors corresponds to alternating weight-matrix multiplication and element-wise application of the activation function. At 1280, fully fused MLP achieves accelerated performance by parallelizing the workload. It partitions the batch into 128 element wide chunks and processes each chunk by a single thread block. The fully fused MLP may be narrow (where in one embodiment Mhidden = Min = 64 neurons wide (and in other alternatives could be 16, 32, or 128 neurons wide0), thus allowing the MLP weight matrices to fit into registers and the intermediate 64 × 128 neuron activation to fit into a shared memory. Within a matrix multiplication, at 1285, each thread block transforms the i-th layer Hi into the pre-activated next layer H′i +1. Hi is diced into 16 × 16 elements to match the size of one type of processor core utilized in one embodiment, where such processor may utilize hardware acceleration technology such as the TensorCore hardware-accelerated half-precision matrix multiplier available from NVIDIA Corporation, Santa Clara, California. Each warp of the thread block computes one 16 × 128 block-row (e.g., the striped area) of H′i +1. The computation is done by first loading the corresponding 16 × 64 striped weights in Wi into registers and then FTRW-01394WO0 6000548PCT01 - 32 - multiplying the striped weights by all 64 × 16 block-columns of Hi . Therefore, each thread block loads the weight matrix (e.g. Wi) from global memory exactly once, while the multiple passes are over layer Hi , which is located in fast shared memory. [0091] FIG. 13 is a flowchart illustrating dynamic composition by dynamic composition module 942. At step 1310, the renderings generated by each human NeRF are inserted into the background generated by the background NeRF. At 1320, physical plausibility is configured (including collision and occlusion) for all users and objects in the virtual environment. At 1330, dynamic scene rendering occurs and the dynamic scene output at 1340. [0092] FIG.14 is a diagram illustrating dynamic composition of human objects into a background. The diagram of FIG. 14 and insertion of human objects into a background is performed in accordance with a modification of the techniques described in B. Yang et al., "Learning Object-Compositional Neural Radiance Field for Editable Scene Rendering," 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 13759-13768, (incorporated by reference in its entirety herein). The scene rendering uses an object-compositional neural radiance field and produces realistic rendering for a clustered and real-world scene. As shown in FIG. 14, a two-pathway architecture is utilized in which a scene branch 1440 encodes the scene geometry and appearance, and an object branch 1460 encodes each standalone object conditioned on learnable object activation codes. A scene-guided training strategy solves 3D space ambiguity in the occluded regions and learns sharp boundaries for each object. [0093] The scene branch takes the spatial coordinate x, the interpolated scene voxel features fscn at x and the ray direction d as input, and outputs the color cscn and opacity σscn of the scene. The object branch takes additional object voxel features fobj as well as an object activation code lobj to condition the output to only contains the color cobj and opacity σobj for a specific object at its original location with everything else removed. For each point x sampled along a camera ray, positional encoding γ( ^) is applied on both of the scene voxel feature fscn interpolated from eight (8) nearest vertices and space coordinate x to result in the hybrid space embedding. This hybrid FTRW-01394WO0 6000548PCT01 - 33 - space embedding, along with the embedded directions γ(d), are fed into the scene branch and the object branch. The scene branch function Fscn can output the opacity σscn and color cscn of the scene at x. For the object branch function Fobj, embedded object voxel feature γ(fobj) and object activation code lobj are added to the input. Fobj helps to broaden the ability of learning decomposition and is shared by all the objects, and lobj identifies feature space for different objects and is possessed by each individual. Take the object activation code lobj as a condition, the object branch precisely outputs color cobj and opacity σobj for the desired object. [0094] FIG.15 is a flow diagram illustrating dynamic relighting in a trained dynamic relighting module. In one embodiment, the dynamic relighting module uses a modification of the techniques discussed in Xiuming Zhang, Pratul P. Srinivasan, Boyang Deng, Paul Debevec, William T. Freeman, and Jonathan T. Barron. 2021. NeRFactor: neural factorization of shape and reflectance under an unknown illumination. ACM Trans. Graph.40, 6, Article 237 (December 2021), (incorporated by reference in its entirety herein, hereinafter “Zhang et al.”). [0095] The technique discussed therein, termed “NeRFactor”, represents the shape and spatially-varying reflectance of an object as a set of 3D fields, each parameterized by MLPs 1510, 1515, 1520, 1525 and 1535 and whose weights are optimized so as to explain the set of observed input images. After optimization, NeRFactor outputs, at each 3D location ^^ on the object’s surface, the surface normal ^^, light visibility in any direction ^^ ( ^^i), albedo ^^, and reflectance ^^BRDF that together explain the observed appearance. These factors are provided to a renderer 1545 to provide free-viewpoint relighting (with shadows). By recovering the object’s geometry and reflectance, NeRFactor enables free-viewpoint relighting (with shadows) and material editing. [0096] The NeRFFactor techniques discussed in Zhang et al. are modified in that: the techniques described in NeRF are modified using the techniques of Weng, Chung- Yi et al. “HumanNeRF”; two MLPs are used instead of one to predict a surface normal n; and each 3D surface location of the HumanNeRF, ^^surf, is in the canonical space. FTRW-01394WO0 6000548PCT01 - 34 - [0097] The input to the NeRFactor model comprises the output of human NeRF 1022 (the optimized volume density σ) which creates input images to compute initial geometry (though using Multi-View Stereo (MVS) geometry as initialization also works). [0098] NeRFactor optimizes a neural radiance field MLP that maps from any 3D spatial coordinate and 2D viewing direction to the volume density at that 3D location and color emitted by particles at that location along the 2D viewing direction. NeRFactor leverages NeRF’s estimated geometry by “distilling” it into a continuous surface representation that we use to initialize NeRFactor’s geometry. In particular, NeRFactor optimizes NeRF to compute the expected surface location along any camera ray, the surface normal at each point on the object’s surface, and the visibility of light arriving from any direction at each point on the object’s surface. [0099] In FIG.15, ^^ denotes 3D locations, ^^i light direction, ^^o viewing direction, and ^^d, ^^h, ^^d Rusinkiewicz coordinates. [00100] Given trained human NeRF 1022, the dynamic relighting module computes the location at which a ray ^^ ( ^^) = ^^ + ^^ ^^ from a camera ^^ along direction ^^ is expected to terminate according to human NeRF’s optimized volume density ^^ to provide xsurf. [00101] A surface normal MLP 1530 computes the surface normal in the canonical space nc. A second MLP 1535 computes a surface normal in observation space n. [00102] A light visibility MLP 1510 computes the visibility ^^a to each light source by marching through human NeRF’s 1022 ^^-volume from the point to each light location. The visibility function is parameterized as a MLP 1510 that maps from a surface location ^^surf and a light direction ^^i to the light visibility ^^: ^^v : ( ^^surf, ^^i) ↦ ^^.F [00103] Reflectance is handled by the albedo MLP 1520 and BRDF identity MLP 1515. A BRDF identity MLP 1515 predicts spatially-varying BRDFs for all the surface points in the plausible space of real-world BRDFs. [00104] The Albedo MLP parameterizes the albedo a at any surface location xsurf as an MLP ^^a : ^^surf ↦ ^^. FTRW-01394WO0 6000548PCT01 - 35 - [00105] The final BRDF MLP 1525 is the sum of the Lambertian component and the learned non-diffuse reflectance. Given the surface normal, visibility for all light directions, albedo, and BRDF at each point ^^surf, as well as the estimated lighting, the renderer 1545 renders an image with realistic lighting relative to the virtual environment. [00106] FIG.16 is a block diagram of a network processing device that can be used to implement various embodiments, including a client processing device such as devices 210, 212, 214, 216 or network nodes 220a – 220d. Specific network devices may utilize all of the components shown, or only a subset of the components, and levels of integration may vary from device to device. Furthermore, network device 1600 may contain multiple instances of a component, such as multiple processing units, processors, memories, transmitters, receivers, etc. The network device 1600 may comprise a processing unit 1601 equipped with one or more input/output devices, such as network interfaces, storage interfaces, and the like. The processing device may include a central processing unit (CPU) 1610, a memory 1620, a mass storage device 1630, and an I/O interface 1660 connected to a bus 1670. The bus 1670 may be one or more of any type of several bus architectures including a memory bus or memory controller, a peripheral bus, or the like. A network interface 1650 enables the network processing device to communicate over a network 1680 with other processing devices such as those described herein. The I/O interface is illustrated as connected to a display device 1665 and an image capture device 1655. [00107] The CPU 1610 may comprise any type of electronic data processor. Memory 1620 may comprise any type of system memory such as static random- access memory (SRAM), dynamic random-access memory (DRAM), synchronous DRAM (SDRAM), read-only memory (ROM), a combination thereof, or the like. In an embodiment, memory 1620 may include ROM for use at boot-up, and DRAM for program and data storage for use while executing programs. In embodiments, the memory 1620 is non-transitory. In one embodiment, the memory 1620 includes computer readable instructions that are executed by the processor(s) 1320 to implement embodiments of the disclosed technology, including the real-time virtual environment application 1625a which may itself include a rendering engine 1625b, FTRW-01394WO0 6000548PCT01 - 36 - and converter 1675. The functions of the meeting application 1625a as operable on a client processing device are described herein in various flowcharts and Figures. [00108] The mass storage device 1630 may comprise any type of storage device configured to store data, programs, and other information and to make the data, programs, and other information accessible via the bus 1670. The mass storage device 1630 may comprise, for example, one or more of a solid-state drive, hard disk drive, a magnetic disk drive, an optical disk drive, or the like. An instance of the user DB 810a may be stored on the mass storage device 1630 in embodiments of the technology such as that described with respect to FIG.8. The mass storage 1630 may also include code comprising instructions for causing the CPU to implement the components of the real-time virtual environment application illustrated in memory 1620 of the client processing device. [00109] FIG.17 is a block diagram of a network processing device that can be used to implement various embodiments of a service host 240. Specific network devices may utilize all of the components shown, or only a subset of the components, and levels of integration may vary from device to device. In FIG.17, like numbers represent like parts with respect to those of FIG. 16. In one embodiment, the memory 1620 includes components of the real-time virtual environment application 1715a, a rendering engine 1715b implementing a renderer 380 discussed herein, converters 1715c implementing converters 310, 320, 330 discussed herein, and environment rendering data 1715d provided from the object database and theme database. The mass storage device 1630 includes the theme database 350a comprising instances background database 365a, and environment database 355a, and object database 345a.. The mass storage 1630 may also include code comprising instructions for causing the CPU to implement the components of the real-time virtual environment application illustrated in memory 1620 of the service host 240. Optionally, in embodiments, device 240 may include both a CPU 1610 and a dedicated GPU 1710. As understood, a GPU is a type of processing unit that enables very efficient parallel processing of data. Although GPUs may be used in a video card or the like for computer graphics, GPUs have found much broader applications. FTRW-01394WO0 6000548PCT01 - 37 - [00110] FIG. 18 is a block diagram illustrating examples of details of a network device, or node, such as those shown in the network of FIG. 3. A node 1800 may comprise a router, switch, server, or other network device, according to an embodiment. The node 1800 can correspond to one of the nodes 220a- 220d. The router or other network node 1800 can be configured to implement or support embodiments of the technology disclosed herein. For example, the node may store the user databases 810a (which may comprise any of 810, 812, 814, 816 illustrated in FIG.8) and/or may execute one or more converters discussed herein. The node 1800 may comprise a number of receiving input/output (I/O) ports 1810, a receiver 1812 for receiving packets, a number of transmitting I/O ports 1830 and a transmitter 1832 for forwarding packets. Although shown separated into an input section and an output section in FIG.18, often these will be I/O ports 1810 and 1830 that are used for both down-stream and up-stream transfers and the receiver 1812 and transmitter 1832 will be transceivers. Together I/O ports 1810, receiver 1812, I/O ports 1830, and transmitter 1832 can be collectively referred to as a network interface that is configured to receive and transmit packets over a network. [00111] The node 1800 can also include a processor 1820 that can be formed of one or more processing circuits and a memory or storage section 1822. The storage 1822 can be variously embodied based on available memory technologies and in this embodiment and is shown to have recovery data cache 1870, which could be formed from a volatile RAM memory such as SRAM or DRAM, and long-term storage 1826, which can be formed of non-volatile memory such as flash NAND memory or other memory technologies. [00112] Storage 1822 can be used for storing both data and instructions for implementing aspects of the technology discussed herein. In particular, instructions causing the processor 1820 to perform the functions of caching database data or executing converters as described herein. [00113] More specifically, the processor(s) 1820, including the programmable content forwarding plane 1828, can be configured to implement embodiments of the disclosed technology described below. In accordance with certain embodiments, the FTRW-01394WO0 6000548PCT01 - 38 - storage 1822 stores computer readable instructions that are executed by the processor(s) 1820 to implement embodiments of the disclosed technology. It would also be possible for embodiments of the disclosed technology described below to be implemented, at least partially, using hardware logic components, such as, but not limited to, Field-programmable Gate Arrays (FPGAs), Application-specific Integrated Circuits (ASICs), Application-Specific Standard Products (ASSPs), System-on-a-chip systems (SOCs), Complex Programmable Logic Devices (CPLDs), special purpose computers, etc. [00114] For the purposes of this document, it should be noted that the dimensions of the various features depicted in the figures may not necessarily be drawn to scale. [00115] For purposes of this document, reference in the specification to “an embodiment,” “one embodiment,” “some embodiments,” or “another embodiment” may be used to describe different embodiments or the same embodiment. [00116] For the purposes of this document, a connection may be a direct connection or an indirect connection (e.g., via one or more other parts). In some cases, when an element is referred to as being connected or coupled to another element, the element may be directly connected to the other element or indirectly connected to the other element via intervening elements. When an element is referred to as being directly connected to another element, then there are no intervening elements between the element and the other element. Two devices are “in communication” if they are directly or indirectly connected so that they can communicate electronic signals between them. [00117] Although the present disclosure has been described with reference to specific features and embodiments thereof, it is evident that various modifications and combinations can be made thereto without departing from the scope of the disclosure. The specification and drawings are, accordingly, to be regarded simply as an illustration of the disclosure as defined by the appended claims, and are contemplated to cover any and all modifications, variations, combinations, or equivalents that fall within the scope of the present disclosure. FTRW-01394WO0 6000548PCT01 - 39 - [00118] The technology described herein can be implemented using hardware, software, or a combination of both hardware and software. The software used is stored on one or more of the processor readable storage devices described above to program one or more of the processors to perform the functions described herein. The processor readable storage devices can include computer readable media such as volatile and non-volatile media, removable and non-removable media. By way of example, and not limitation, computer readable media may comprise computer readable storage media and communication media. Computer readable storage media may be implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Examples of computer readable storage media include RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information, and which can be accessed by a computer. A computer readable medium or media does (do) not include propagated, modulated, or transitory signals. [00119] Communication media typically embodies computer readable instructions, data structures, program modules or other data in a propagated, modulated, or transitory data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as RF and other wireless media. Combinations of any of the above are also included within the scope of computer readable media. [00120] In alternative embodiments, some or all of the software can be replaced by dedicated hardware logic components. For example, and without limitation, illustrative types of hardware logic components that can be used include Field-programmable Gate Arrays (FPGAs), Application-specific Integrated Circuits (ASICs), Application- specific Standard Products (ASSPs), System-on-a-chip systems (SOCs), Complex Programmable Logic Devices (CPLDs), special purpose computers, etc. In one FTRW-01394WO0 6000548PCT01 - 40 - embodiment, software (stored on a storage device) implementing one or more embodiments is used to program one or more processors. The one or more processors can be in communication with one or more computer readable media/ storage devices, peripherals and/or communication interfaces. [00121] It is understood that the present subject matter may be embodied in many different forms and should not be construed as being limited to the embodiments set forth herein. Rather, these embodiments are provided so that this subject matter will be thorough and complete and will fully convey the disclosure to those skilled in the art. Indeed, the subject matter is intended to cover alternatives, modifications, and equivalents of these embodiments, which are included within the scope and spirit of the subject matter as defined by the appended claims. Furthermore, in the following detailed description of the present subject matter, numerous specific details are set forth in order to provide a thorough understanding of the present subject matter. However, it will be clear to those of ordinary skill in the art that the present subject matter may be practiced without such specific details. [00122] Aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatuses (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general-purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable instruction execution apparatus, create a mechanism for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. [00123] The description of the present disclosure has been presented for purposes of illustration and description but is not intended to be exhaustive or limited to the disclosure in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the FTRW-01394WO0 6000548PCT01 - 41 - disclosure. The aspects of the disclosure herein were chosen and described in order to best explain the principles of the disclosure and the practical application, and to enable others of ordinary skill in the art to understand the disclosure with various modifications as are suited to the particular use contemplated. [00124] For purposes of this document, each process associated with the disclosed technology may be performed continuously and by one or more computing devices. Each step in a process may be performed by the same or different computing devices as those used in other steps, and each step need not necessarily be performed by a single computing device. [00125] Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Claims

FTRW-01394WO0 6000548PCT01 - 42 - CLAIMS What is claimed is: 1. A computer implemented method of rendering representations of a plurality of users in a virtual environment, comprising: receiving user data for at least one of the users participating in the virtual environment, the user data comprising at least a user position and 3D key point data of the at least one user, the 3D key point data reflecting a user pose in a real-world environment; generating a virtual environment from environment data specifying at least a virtual coordinate system for the virtual environment and background data using neural rendering; generating, using neural rendering, a representation of each of the plurality of users, each user having an associated representation; determining a position in the virtual environment for each of the plurality of users, the position in the virtual environment determined based on the received user data; and rendering the representation of each of the plurality of users in a dynamic scene of the virtual environment. 2. The computer implemented method of claim 1 wherein the user data is encoded and/or compressed, and the user data is received over a network from a client processing device associated with each user. 3. The computer implemented method of any of claims 1 and 2 wherein receiving user data further includes receiving at least user profile data specifying user preferences regarding the representation of the user. 4. The computer implemented method of any of claims 1 through 3 wherein the 3D key point data of each user indicates real-time motions of each user, and the FTRW-01394WO0 6000548PCT01 - 43 - rendering includes rendering the real-time motions of each user to the associated representation in the virtual environment. 5. The computer implemented method of any of claims 1 through 4 further including accessing image data for each of the plurality of users and training a neural radiance field network for each user for the neural rendering using the image data. 6. The computer implemented method of claim 5 further including accessing user profile information and training the neural radiance field network for each user for the neural rendering using the user profile information. 7. The computer implemented method of any of claims 1 through 6 wherein the rendering includes dynamically lighting the virtual environment and the associated representation of each user. 8. The computer implemented method of any of claims 1 through 7 wherein the rendering includes determining a physical plausibility for one or more objects and/or one or more representations users in the virtual environment. 9. The computer implemented method of any of claims 1 through 8 wherein the generating a photorealistic representation of a user comprises a free-viewpoint rendering method of a volumetric representation of a person. 10. The computer implemented method of claim 9 wherein the free-viewpoint rendering method is implemented on at least one fully fused multi-layer perceptron. 11. The computer implemented method of claim 10 wherein the fully fused multi- layer perceptron uses at least one tiny Compute Unified Device neural network framework. FTRW-01394WO0 6000548PCT01 - 44 - 12. The computer implemented method of claims 1 through 11 wherein the generating a photorealistic representation and receiving user data for at least one of the users is based on a monocular series of images. 13. A user equipment device, comprising: a storage medium comprising computer instructions; an image capture system; a display device; one or more processors coupled to communicate with the storage medium, wherein the one or more processors execute the instructions to cause the system to: capture, using the image capture system, user data comprising at least 3D key point data of a user of the user equipment; encode and compress user data including a user position in a real world coordinate system and the 3D key point data; output the user data to a renderer via a network; receive dynamic scene data comprising a photorealistic representation of a virtual environment generated using a neural radiance field, the virtual environment having a virtual coordinate system for the virtual environment and background data, and a photorealistic representation of each of a plurality of users in the virtual environment, each user having an associated representation, each users associated user data converted to a position the virtual environment; and render the dynamic scene data on the display device. 14. The user equipment device of claim 13 wherein the 3D key point data of each user indicates real-time motion of the user of the user equipment, and the dynamic scene data includes data rendering the real-time motions of each user to the associated representation if the user in the virtual environment. 15. The user equipment device of any of claims 13 and 14 further including transmitting image data of the user to provide training data for a neural radiance field network for the neural rendering. FTRW-01394WO0 6000548PCT01 - 45 - 16. The user equipment device of claim 15 further including transmitting user profile information for training the neural radiance field network. 17. The user equipment device of any of claims 13 through 15 wherein the dynamic scene data includes dynamically lighting the virtual environment and the associated representation of each user. 18. The user equipment device of any of claims 13 through 17 wherein the dynamic scene data includes a physical plausibility for one or more objects and/or one or more photorealistic representations users in the virtual environment. 19. The user equipment device of any of claims 13 through 18 wherein the user data comprises monocular series of images. 20. A non-transitory computer-readable medium storing computer instructions for rendering a photorealistic virtual environment occupied by a plurality of users, that when executed by one or more processors, cause the one or more processors to perform the steps of: generating the virtual environment from environment data specifying at least a virtual coordinate system for the virtual environment and background data using neural rendering; generating, using neural rendering, a photorealistic representation of each of the plurality of users, each user having an associated photorealistic representation; receiving user data comprising at least user position and 3D key point data, and a user profile of each user in the virtual environment; converting the received user data to a position the virtual environment; and rendering the associated photorealistic representation of each of the plurality of users in a dynamic scene of the virtual environment. 21. The non-transitory computer-readable medium of claim 20 wherein the user data is encoded and/or compressed, and the user data is received over a network from a client processing device associated with each user. FTRW-01394WO0 6000548PCT01 - 46 - 22. The non-transitory computer-readable medium of any of claims 20 and 21 wherein the 3D key point data of each user indicates real-time motions of each user, and the rendering includes rendering the real-time motions of each user to the associated representation in the virtual environment. 23. The non-transitory computer-readable medium of any of claims 20 through 22, the instructions further causing the one or more processors to perform the steps of accessing image data for each of the plurality of users and training a neural radiance field network for each user for the neural rendering using the image data. 24. The non-transitory computer-readable medium of claim 23, the instructions further causing the one or more processors to perform the steps of accessing user profile information and training the neural radiance field network for each user for the neural rendering using the user profile information. 25. The non-transitory computer-readable medium of any of claims 20 through 24 wherein the rendering includes dynamically lighting the virtual environment and the associated representation of each user. 26. The non-transitory computer-readable medium of any of claims 20 through 25 wherein the rendering includes determining a physical plausibility for each of one or more objects and representations of users in the virtual environment. 27. The non-transitory computer-readable medium of any of claims 20 through 26 wherein the generating a photorealistic representation of a user comprises a free- viewpoint rendering method of a volumetric representation of a person. 28. The non-transitory computer-readable medium of claim 27 wherein the free- viewpoint rendering method is implemented on at least one fully fused multi-layer perceptron. FTRW-01394WO0 6000548PCT01 - 47 - 29. The non-transitory computer-readable medium of claim 28 wherein the fully fused multi-layer perceptron uses a tiny Compute Unified Device Architecture multi- layer perceptron. 30. The non-transitory computer-readable medium of any of claims 20 through 29 wherein the generating a photorealistic representation and receiving user data for at least one of the users is based on a monocular or multi-camera view series of images.
PCT/US2023/063554 2023-02-01 2023-03-02 Virtual conferencing system using multi-neural-radiance-field synchronized rendering WO2024163018A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202363482707P 2023-02-01 2023-02-01
US63/482,707 2023-02-01

Publications (1)

Publication Number Publication Date
WO2024163018A1 true WO2024163018A1 (en) 2024-08-08

Family

ID=85772750

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2023/063554 WO2024163018A1 (en) 2023-02-01 2023-03-02 Virtual conferencing system using multi-neural-radiance-field synchronized rendering

Country Status (1)

Country Link
WO (1) WO2024163018A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170148267A1 (en) * 2015-11-25 2017-05-25 Joseph William PARKER Celebrity chase virtual world game system and method
US20180174347A1 (en) * 2016-12-20 2018-06-21 Sony Interactive Entertainment LLC Telepresence of multiple users in interactive virtual space
US20220230375A1 (en) * 2021-01-19 2022-07-21 Krikey, Inc. Three-dimensional avatar generation and customization

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170148267A1 (en) * 2015-11-25 2017-05-25 Joseph William PARKER Celebrity chase virtual world game system and method
US20180174347A1 (en) * 2016-12-20 2018-06-21 Sony Interactive Entertainment LLC Telepresence of multiple users in interactive virtual space
US20220230375A1 (en) * 2021-01-19 2022-07-21 Krikey, Inc. Three-dimensional avatar generation and customization

Non-Patent Citations (10)

* Cited by examiner, † Cited by third party
Title
AYUSH TEWARI ET AL: "Advances in Neural Rendering", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 30 March 2022 (2022-03-30), XP091192464 *
B. YANG ET AL.: "Learning Object-Compositional Neural Radiance Field for Editable Scene Rendering", 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV, 2021, pages 13759 - 13768, XP034092164, DOI: 10.1109/ICCV48922.2021.01352
MILDENHALL, BEN ET AL.: "NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis.", EUROPEAN CONFERENCE ON COMPUTER VISION, 2020
MULLER, THOMAS ET AL.: "Instant neural graphics primitives with a multiresolution hash encoding.", ACM TRANSACTIONS ON GRAPHICS (TOG, vol. 41, 2022, pages 1 - 15, XP058942157, DOI: 10.1145/3528223.3530127
TEWARI A ET AL: "State of the Art on Neural Rendering", COMPUTER GRAPHICS FORUM : JOURNAL OF THE EUROPEAN ASSOCIATION FOR COMPUTER GRAPHICS, WILEY-BLACKWELL, OXFORD, vol. 39, no. 2, 13 July 2020 (2020-07-13), pages 701 - 727, XP071545885, ISSN: 0167-7055, DOI: 10.1111/CGF.14022 *
THOMAS MULLERFABRICE ROUSSELLEJAN NOVAKALEXANDER KELLER: "Real-time neural radiance caching for path tracing", ACM TRANS. GRAPH., vol. 40, August 2021 (2021-08-01), pages 4
WENG, CHUNG-YI ET AL., HUMANNERF
WENG, CHUNG-YI ET AL.: "HumanNeRF: Free-viewpoint Rendering of Moving People from Monocular Video", 2022 IEEEICVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR, 2022, pages 16189 - 16199, XP034195555, DOI: 10.1109/CVPR52688.2022.01573
WENG, CHUNG-YI ET AL.: "HumanNeRF: Free-viewpoint Rendering of Moving People from Monocular Video.", 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR, 2022, pages 16189 - 16199, XP034195555, DOI: 10.1109/CVPR52688.2022.01573
ZHANG, ACM TRANS. GRAPH., vol. 40, December 2021 (2021-12-01), pages 6

Similar Documents

Publication Publication Date Title
Tewari et al. State of the art on neural rendering
CN111656407B (en) Fusing, texturing and rendering views of a dynamic three-dimensional model
US11610122B2 (en) Generative adversarial neural network assisted reconstruction
US11288857B2 (en) Neural rerendering from 3D models
US11087521B1 (en) Systems and methods for rendering avatars with deep appearance models
Alatan et al. Scene representation technologies for 3DTV—A survey
US20240212252A1 (en) Method and apparatus for training video generation model, storage medium, and computer device
US11451758B1 (en) Systems, methods, and media for colorizing grayscale images
US20230419600A1 (en) Volumetric performance capture with neural rendering
US20240020915A1 (en) Generative model for 3d face synthesis with hdri relighting
Lu et al. 3D real-time human reconstruction with a single RGBD camera
Nguyen-Ha et al. Free-viewpoint rgb-d human performance capture and rendering
Ren et al. Facial geometric detail recovery via implicit representation
Ouyang et al. Real-time neural character rendering with pose-guided multiplane images
Dai et al. PBR-Net: Imitating physically based rendering using deep neural network
CN116758202A (en) Human hand image synthesis method, device, electronic equipment and storage medium
CN115497029A (en) Video processing method, device and computer readable storage medium
WO2024163018A1 (en) Virtual conferencing system using multi-neural-radiance-field synchronized rendering
Zell et al. Volumetric video-acquisition, compression, interaction and perception
Shen et al. Gaussian Time Machine: A Real-Time Rendering Methodology for Time-Variant Appearances
CN110689602A (en) Three-dimensional face reconstruction method, device, terminal and computer readable storage medium
CN116958451B (en) Model processing, image generating method, image generating device, computer device and storage medium
Shen et al. Envisioning a Next Generation Extended Reality Conferencing System with Efficient Photorealistic Human Rendering
近藤生也 et al. 3D Physical State Prediction and Visualization using Deep Billboard
Du et al. Temporal residual neural radiance fields for monocular video dynamic human body reconstruction

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23713557

Country of ref document: EP

Kind code of ref document: A1