WO2023232267A1

WO2023232267A1 - Supporting an immersive communication session between communication devices

Info

Publication number: WO2023232267A1
Application number: PCT/EP2022/065275
Authority: WO
Inventors: Ali El Essaili; Natalya TYUDINA; Esra AKAN; Jörg Christian EWERT; Alvin Jude Hari Haran
Original assignee: Telefonaktiebolaget Lm Ericsson (Publ)
Priority date: 2022-06-03
Filing date: 2022-06-03
Publication date: 2023-12-07

Abstract

A technique for an immersive communication session between communication devices (100, 200) is described. As to a method aspect performed by a network entity (300) for supporting the immersive communication session, a first stream is received in the immersive communication session from at least one first communication device (100) among the communication devices (100, 200). The first stream comprises a real-time 3D visual representation of a participant of the immersive communication session. The real-time 3D visual representation is sent to at least one second communication device (200) of the communication devices (100, 200). An avatar model of the participant of the immersive communication session is generated based on the 3D visual representation. The generated (608) avatar model is sent to the at least one second communication device (200). In response to a shortage of resources for the first stream, a control message is sent to the at least one first communication device (100). The control message is indicative of switching (614) the real-time 3D visual representation of the participant within the immersive communication session to the avatar model for a real-time 3D computer-generated imagery (CGI) of the participant.

Description

SUPPORTING AN IMMERSIVE COMMUNICATION SESSION BETWEEN

COMMUNICATION DEVICES

Technical Field

The present disclosure relates to a method performed by a network entity for supporting an immersive communication session between communication devices, a method performed by a first communication device for supporting an immersive communication session with at least one second communication device, a method performed by a second communication device for supporting an immersive communication session with at least one first communication device, a network entity for supporting an immersive communication session between communication devices, a first communication device for supporting an immersive communication session with at least one second communication device, a second communication device for supporting in an immersive communication session with at least one first communication device, and a system for supporting an immersive communication session.

Background

Immersive communications typically rely on a particular media format such as point clouds or 3D meshes, which demands significantly more resources than 2D video. Point clouds may be used for augmented reality (AR) or virtual reality (VR) at the receiver device (e.g., a client) to provide an immersive user experience. Point clouds may be generated by transmitter devices using depth cameras such as the "Intel RealSense" or "Microsoft Kinect". Point clouds are a set of 3D points, wherein each point is represented by a position (x, y, z) and attributes such as color (e.g., RGB). Point clouds may comprise thousands or millions of points per frame providing a fine-grained representation of captured objects including environmental objects such as physical surrounding, objects or humans.

However, this requires a huge bandwidth and storage demands.

Therefore, compression algorithms are developed to reduce the bitrate before transmitting over a network, such as video-based point cloud compression (V- PCC) specified by the Moving Picture Experts Group (MPEG). V-PCC decomposes a point cloud into 2D patches that can be encoded using existing 2D video codecs (i.e., encoder and decoder) such as high efficiency video coding (HEVC). Heuristic segmentation approaches are used to minimize mapping distortions. The resulting patches are packed into images that represent the geometry or depth map (e.g., longitudinal distance to the points) and texture (e.g., color attributes).

Accordingly, for an immersive communication over a network, a plurality of challenges needs to be addressed. Firstly, the immersive communication requires a specific type of content, e.g., point clouds. 3D-captured data such as point clouds and meshes are difficult to encode, which requires dedicated resources and is computationally intensive. Secondly, rate adaptation approaches typically used in 2D streaming such as Dynamic Adaptive Streaming over HTTP (DASH) are not available for real-time 3D visual communications. T. Stockhammer, "Dynamic Adaptive Streaming over HTTP - Design Principles and Standards", ACM conference on Multimedia systems, 2011, describes existing DASH specifications available from the Third Generation Partnership Project (3GPP) and in a draft version also from MPEG. However, DASH is neither specific nor suited to 3D content. The lack of network adaptation for streaming 3D content responsive to changes in link quality results in service interruption if the network has too little resource reserves or leads to inefficient usage of network resources if the network has too much reserve resources.

US 9,883,144 B2 describes an approach for replacing a video with an avatar in case of bad network communication. The avatar is not based on a live representation of the user, i.e., no live capture or facial landmarks, nor control of the avatar. Furthermore, there is no system architecture or description of signals, how such an approach would work for real-time AR communications.

Summary

Accordingly, there is a need for supporting an immersive communication session technique that achieves session persistence under varying link qualities.

As to a first method aspect, a method performed by a network entity for supporting an immersive communication session between communication devices is provided. The method comprises receiving a first stream in the immersive communication session from at least one first communication device among the communication devices. The first stream comprises a real-time 3-dimensional (3D) visual representation of a participant of the immersive communication session. The method further comprises sending the real-time 3D visual representation or real-time 3D immersive media rendered based on the real-time 3D visual representation to at least one second communication device of the communication devices. The method further comprises generating an avatar model of a or the participant of the immersive communication session based on the 3D visual representation received from the at least one first communication device. The method further comprises sending the generated avatar model to the at least one second communication device. The method further comprises, in response to a shortage of resources for the first stream, sending a control message to the at least one first communication device. The control message is indicative of switching the real-time 3D visual representation of the participant within the immersive communication session to the avatar model for a real-time 3D computer-generated imagery (CGI) of the participant. The method further comprises receiving a second stream in the immersive communication session from the at least one first communication device. The second stream comprises real-time values of avatar parameters controlling motions of the avatar model (e.g., facial expressions) of the participant. The method further comprises sending the real-time values to the at least one second communication device.

The switching of the real-time 3D visual representation of the participant within the immersive communication session to the avatar model for the real-time 3D CGI of the participant may also be referred to as switching from the first stream to the second stream. The resources (e.g., anywhere along the path of the first stream) required for the first stream are typically substantially greater than the resources required for the second stream. For example, the resources may be associated with the performance of one or more links on the path, optionally wherein the performance comprises at least one of data rate, bandwidth, latency, and other quality indicators. The switching from the first stream to the second stream can mitigate the shortage of the resources for the first stream. For example, the shortage of resources may be due to an insufficient performance of a link (e.g., a radio link) along the path of the first stream. In other words, the first stream cannot be supported by the link due to the shortage of resources.

Herein, the real-time 3D visual representation of the participant may be a realtime 3D photographic representation of the participant. The real-time 3D photographic representation may be captured by a camera and/or may comprise pixels or voxels of the participant. For example, the voxels may comprise pixels with a depth information. Alternatively or in addition, the camera may operate in the visual and/or infrared spectrum. The first stream may also be referred to as a photographic media stream. Alternatively or in addition, the 3D visual representation may be a captured 3D visual representation of the participant.

The network entity may be configured to decode and/or process the received first stream to generate the avatar model.

The avatar model may be a photorealistic representation of the participant. Alternatively or in addition, the avatar model may be a representation of the participant that is animated according to the avatar parameters controlling the motions of the avatar model (e.g., based on detected facial expressions or other sensory data).

The avatar model may enable (e.g., the network entity and/or at least one or each of the at least one second communication devices) to render the real-time 3D computer generated imagery (CGI) of the participant based on the combination of the avatar model and the values of the avatar parameters. In other words, the avatar parameters may be used as input to the avatar model for generating the 3D CGI of the participant. The CGI may comprise graphics generated by a computer according to the avatar model, wherein the state (e.g., a position of the head or a facial expression) of the avatar model is specified by the avatar parameters. As the avatar parameters may change in real-time according to the second stream, the CGI may be animated. The CGI may be a photorealistic representation of the participant.

The second stream may also be referred to as an avatar-controlling stream. Alternatively or in addition, the CGI of the participant (according to the avatar model and the second stream) may be a generated 3D graphics representation of the participant.

The at least one first communication device may also be referred to as a transmitting device (TX device), which may be wirelessly connected to the network entity. Alternatively or in addition, the at least one second communication device may also be referred to as a receiving device (RX device), which may be wirelessly connected to the network entity. Alternatively or in addition, the communication devices may also be referred to as (e.g., two or more) immersive communication devices, e.g., since they are configured for at least one of establishing the immersive communication session, capturing the first stream and the second stream, and receiving the first stream and the second stream and using them to render the immersive communication session.

Any one of the above wireless connections may operate according to a 3GPP radio access technology (RAT), e.g., according to 3GPP Long Term Evolution (LTE) or 3GPP New Radio (NR), or Wi-Fi/WLAN.

At least one or each of the at least one TX device may be, or may comprise, a first capturing device configured to capture (e.g., optically) a scene including the participant, e.g., responsive to establishing of the immersive communication session. Alternatively or in addition, at least one or each of the at least one TX device may be, or may comprise, a second capturing device (e.g., one or more sensors) configured to capture the real-time values of the avatar parameters, e.g., responsive to the control message. In a first variant of any embodiment, the first and/or second capturing device may be comprised in the at least one TX device. In a second variant of any embodiment, at least one or each of the at least one TX device is a terminal device configured for wired or wireless device-to-device communication (e.g., using a sidelink, Bluetooth, or Wi-Fi/WLAN) with such a (e.g., first and/or second) capturing device. The terminal device (e.g., the TX device) and the (e.g., first and/or second) capturing device may be separate devices.

In any embodiment, the first capturing device may comprise a 3D camera. Alternatively or in addition, the second capturing device may comprise gloves with sensors (e.g., extensometers for detecting positions of the hand and fingers) or a headset with sensors (e.g., acceleration sensors for detecting motions of the head).

At least one or each of the at least one RX device may be, or may comprise, a displaying device configured to display (i.e., to visually output) a scene including the participant based on the first and/or second stream, e.g., responsive to the control message. In a first variant of any embodiment, the displaying device may be embedded in the at least one RX device. In a second variant of any embodiment, at least one or each of the at least one RX device may be a terminal device configured for wired or wireless device-to-device communication (e.g., using a sidelink, Bluetooth, or Wi-Fi/WLAN) with such a displaying device. The terminal device (e.g., the RX device) and the first and/or second capturing device may be separate devices.

The displaying device may comprise virtual reality (VR) glasses. Any one of the above terminal devices may be a user equipment (UE) according to 3GPP LTE or 3GPP NR.

Any one of the above wired device-to-device communications may comprise a serial bus interface and/or may use a communication protocol according to PCI Express (PCIe) and DisplayPort (DP), e.g., for a Thunderbolt hardware interface, or a communication protocol according to a universal serial bus (USB) according to the USB Implementers Forum (USB-IF) and/or a hardware interface according to USB-C.

Any one of the above wireless device-to-device communications may comprise a 3GPP sidelink (SL, e.g., according to the 3GPP document TS 23.303, version 17.0.0) or may operate according to IEEE 802.15.1 or Bluetooth Tethering of the Bluetooth Special Interest Group (SIG) or Wi-Fi Direct according to the IEEE 802.11 or the Wi-Fi Alliance.

Herein, the expressions "three-dimensional" (or briefly: 3-dimensional or 3D), spatial, and volumetric, may be used interchangeably.

The participant of the immersive communication session may be a physical object in a 3D space captured at the first communication device, in particular a person. For example, the 3D space may comprise a background or surrounding and a foreground or center. The physical object represented by the avatar model may be located in the foreground or center of the 3D space. In the case of multiple first and/or second streams received from multiple first communication devices, the participants represented by the respective streams may be arranged in the same 3D space.

The first stream (i.e., the photographic media stream for the 3D visual representation), and/or the second stream (i.e., the avatar-controlling stream for the real-time values of the avatar parameters) in combination with the avatar model, may enable (e.g., the network entity and/or at least one or each of the at least one second communication devices) to display the participant (e.g., within the 3D space) immersively. Alternatively or in addition, the expression "immersive media" may be an umbrella term that encompasses the 3D visual representation of the participant (e.g., a sequence of photographic 3D images or a photographic 3D video) comprised in the first stream and/or the 3D CGI resulting from the avatar model in combination with the real-time values of the avatar parameters comprised in the second stream. For example, each of the first stream and the second stream may be referred to as immersive media, e.g., after providing the at least one second communication device with the avatar model.

For another participant of the immersive communication session (e.g., at the at least one second communication device) to have an "immersive" experience, the rendered participant may be located and oriented (e.g., 6DoF of the object) in a scene (e.g., a virtual room) that is rendered for the other participant as a viewer, which is also mapped, i.e., located and oriented (e.g., 6DoF of the subject), in the same virtual room (e.g., even if no immersive representation of the other participant is sent in the other direction to the participant for an asymmetric XR session). In other words, a scene (which may be a monoscopic or stereoscopic image) rendered based on the first or second stream may always depend on location and orientation of the participant and/or location and orientation of a viewer (i.e., another participant).

The 3D visual representation (e.g., the photographic 3D image or video) may comprise any volumetric media, e.g., a depth map image, a plenoptic image, an omnidirectional image, a cloud of 3D points (also referred to as 3D point cloud) or 3D meshes (e.g., meshes in a 3D space). The meshes may comprise a triangulation of a (e.g., curved) surface in 3D space.

Alternatively or in addition, the immersive media may encompass any media that supports a (e.g., stereoscopic) view depending on the position of viewer. The immersive media may comprise virtual reality (VR) media, augmented reality (AR) media, or mixed reality (MR) media. For example, the view of the participant (e.g., by means of the 3D visual representation and/or the avatar model) may depend on both the position of the participant in a virtual room (e.g., a virtual meeting room) using the first communication device and a position of a participant's viewer (e.g., another participant) in the virtual room using the second communication device.

The shortage of resources for the first stream may be due to a reduction of (e.g., available) resources at the network entity and/or an increase of a resource requirement of the first stream.

Herein, real-time may refer to a stream of data that is sent with an end-to-end latency that is less than a latency threshold for real-time conservational services or interactive communications, e.g., equal to or less than 150 ms. The method (e.g., according to the first method aspect) may further comprise establishing the immersive communication session between the communication devices.

The immersive communication session may be initiated by the first or second communication devices, e.g., the one which communicates with the network entity. For example, the establishing may comprise receiving an establishing request from at least one or each of the communication devices.

The established immersive communication session may persist before, during, and/or after the switching. Alternatively or in addition, both the first stream and the second stream may be sent and received in the same immersive communication session.

The network may comprise at least one of a control media access function (AF) and a data media application server (AS).

The control media AF may perform the establishing of the immersive communication session. For example, the control media AF may control the establishing of the immersive communication session. Alternatively or in addition, the data media AS may perform the generating of the avatar model. The AS may correspond to a cloud computing entity or an edge computing entity in the network.

The shortage of resources may occur along a communication path of the first stream. The shortage of resources may include at least one of a shortage of radio resources of a radio access network (RAN) providing radio access for the at least one first communication device, a shortage of radio resources of a or the RAN providing radio access for the at least one second communication device, a shortage of a transport capacity of the network entity, a shortage of radio resources (e.g., high latency) in a core network (CN) (e.g., serving the RAN) or other node outside of the RAN, and a shortage of computational resources for rendering of the real-time 3D immersive media based on the first stream.

The transport capacity may correspond to a (e.g., maximum) bit rate provided by the network entity. Alternatively or in addition, the switching is triggered if a bit rate required by the first stream is greater than the maximum bit rate of the network entity. The shortage of resources at the network entity may comprise a shortage of computation resources for the rendering of the real-time data.

The resources for the first stream may comprise at least one of computational resources for processing the first stream and radio resources for a radio communication of the immersive communication session.

Multiple first streams may be received from multiple first communication devices among the communication devices (e.g., according to the first method aspect). The resources of the shortage (i.e., the resources that are subject to the shortage) may include computational resources for composing multiple real-time 3D visual representations of multiple participants of the immersive communication session received from the multiple first communication devices.

For example, the resources (which are subject to the shortage) may include computational resources for rendering a scene composed of the multiple real-time 3D visual representations of the multiple participants.

The method may further comprise, in response to the shortage of resources for the first stream, sending the control message to the at least one second communication device. Alternatively or in addition, the second stream may be (e.g., implicitly) indicative of switching the real-time 3D visual representation of the participant within the immersive communication session to the avatar model for the real-time 3D CGI of the participant.

Receiving the second stream at the at least one second communication device may trigger the at least one second communication device to switch the real-time 3D visual representation of the participant within the immersive communication session to the avatar model for the real-time 3D CGI of the participant. For example, the mere reception of the second stream may be interpreted as an indicator, request, or instruction, for the switching.

The control message sent to the at least one first communication device may trigger one or more sensors at the at least one first communication device to capture the real-time values of the avatar parameters. The control message sent to the at least one second communication device may trigger rendering (e.g., at the respective second communication device) the real-time 3D CGI of the participant based on the generated avatar model and the real-time values of the avatar parameters. The method may further comprise at least one of saving the generated avatar model and monitoring the 3D visual representation of the received first stream for a change in the avatar model of the participant of the immersive communication session.

The network entity may comprise, or may be part of, at least one of a radio access network (RAN). For example, the RAN may provide radio access to at least one of the communication devices. Alternatively or in addition, the network entity (e.g., according to the first method aspect) may comprise, or may be part of, a network node of a RAN. For example, the network node may serve at least one of the communication devices. Alternatively or in addition, the network entity (e.g., according to the first method aspect) may comprise, or may be part of, a core network (CN). For example, the CN may transport the first and second streams between the communication devices and/or may perform mobility management for the communication devices. Alternatively or in addition, the network entity (e.g., according to the first method aspect) may comprise, or may be part of, a local area network (LAN). Alternatively or in addition, the network entity (e.g., according to the first method aspect) may comprise, or may be part of, a distributed network for edge computing and/or a computing center.

The real-time 3D visual representation of the participant may be at least one of encoded and compressed in the received first stream.

The avatar model may comprise a biomechanical model for the motions of the participant of the immersive communication session.

The motions of the avatar model controlled by the avatar parameters may comprise at least one of gestures, facial expressions, and head motion of the participant that is encoded in the values of the avatar parameters in the received second stream.

The method may further comprise or initiate at least one of: re-generating the avatar model of the participant of the immersive communication session or updating the generated and saved avatar model of the participant of the immersive communication session, and sending the re-generated or updated avatar model to the at least one second communication device. The avatar model may be re-generated or updated after re-establishing the immersive communication session. The avatar model may be re-generated or updated responsive to a change in the 3D visual representation being greater than a predefined threshold. The re-generated or updated avatar model may be sent responsive to a change in the avatar model being greater than a predefined threshold.

The change (e.g., the change in the 3D visual representation being greater than a predefined threshold) may be detected by means of a filter. For example, the filter may be applied to the first stream. The filter may be invariant under rotation and/or translation of the participant. Alternatively or in addition, the filter may be invariant under facial expressions and/or motions that corresponds to a change in the avatar parameters.

The change may be a significant change in appearance. For example, the filter may output a scalar or a vector that is indicative of the change. The change may trigger the updating of the avatar model if the scalar is greater than a predefined threshold or if a magnitude of the vector is greater than a predefined threshold or if a scalar product between the vector of the change and a predefined reference vector is greater than a predefined threshold.

The avatar model or an update of the avatar model may be sent simultaneously with the sending of the real-time 3D visual representation, or upon establishing or re-establishing of the immersive communication session, to the at least one second communication device.

The real-time values of the avatar parameters may be derived from one or more depth-insensitive image sensors and/or one or more acceleration sensors capturing facial expressions or motions of the participant at the first communication device. The one or more depth-insensitive image sensors may comprise at least one of: a camera for capturing a 2D image (e.g., by projecting light of the participant onto a 2D image sensor), and a filter for detecting facial landmarks in the 2D image of the participant.

The real-time 3D visual representation may be derived from one or more depthsensitive image sensors at the first communication device. The one or more depthsensitive image sensors may comprise at least one of: a light field camera, an array of angle-sensitive pixels, a sensor for light-in-flight imaging, a sensor for light detection and ranging (LIDAR), a streak camera, and a device using structured light triangulation for depth sensing. The control message sent to the at least one first communication device may be further indicative of at least one of a type of sensors for capturing the 3D visual representation of the participant to be deactivated, a type of sensors for deriving the real-time values of the avatar parameters to be activated, handing over control from a or the control media AF at the network entity to a media session handler at the respective first communication device, sending a notification from a or the media session handler at the respective first communication device to a mixed reality application at the respective first communication device, switching a mixed reality run-time engine at the respective first communication device from capturing the 3D visual representation to deriving the real-time values of the avatar parameters, and switching one or more media access functions at the respective first communication device from an immersive media encoder encoding the 3D visual representation to a motion sensor encoder encoding the real-time values of the avatar parameters.

The control message sent to the at least one second communication device may be further indicative of at least one of a mixed reality run-time engine for pose correction and/or rendering of the real-time 3D visual representation of the participant to be deactivated, a mixed reality scene manager for pose correction and/or rendering of the real-time 3D CGI based on the generated avatar model and the real-time values of the avatar parameters to be activated, handing over control from a or the control media AF at the network entity to a media session handler at the respective second communication device, sending a notification from a or the media session handler at the respective second communication device to a mixed reality application at the respective second communication device, switching a mixed reality run-time engine at the respective second communication device from rendering the real-time 3D visual representation to rendering the real-time 3D CGI based on the avatar model and the real-time values of the avatar parameters, and switching one or more media access functions at the respective second communication device from an immersive media decoder decoding the 3D visual representation to a motion sensor decoder decoding the real-time values of the avatar parameters.

The control message may be sent to a or the media session handler of the at least one first communication device and/or a or the media session handler of the at least one second communication device. Each of the first stream and the second stream may further comprise immersive audio. The immersive audio may be unchanged during the switching. Alternatively or in addition, the immersive audio may provide Immersive Voice and Audio Services (IVAS).

The shortage of resources may be determined based on network state information at the network entity. The network state information may be received periodically or event-driven.

The avatar model may be generated by a or the data media application server (AS) of the network entity. The data media AS may also refer to a media AS.

Alternatively or in addition, the shortage for switching may be determined by a or the control media application function (AF) of the network entity.

Without limitation, for example in a 3GPP implementation, any "radio device" may be a user equipment (UE). Without limitation, any one or each of the communication devices may be referred to, or may be part of, a radio device.

The technique may be applied in the context of 3GPP LTE or New Radio (NR).

The technique may be implemented in accordance with a 3GPP specification, e.g., for 3GPP release 17 or 18 or later. The technique may be implemented for 3GPP LTE or 3GPP NR.

Whenever referring to the RAN, the RAN may be implemented by one or more base stations (e.g., network nodes). The RAN may be implemented according to the Global System for Mobile Communications (GSM), the Universal Mobile Telecommunications System (UMTS), 3GPP Long Term Evolution (LTE) and/or 3GPP New Radio (NR).

As to a second method aspect, a method performed by a first communication device for supporting an immersive communication session with at least one second communication device is provided. The method comprises sending a first stream in the immersive communication session to the at least one second communication device through a network entity. The first stream comprises a realtime 3-dimensional (3D) visual representation of a participant of the immersive communication session. The real-time 3D visual representation enables (e.g., triggers) the network entity to generate an avatar model of the participant and to send the generated avatar model to the at least one second communication device. The method further comprises receiving a control message from the network entity. The control message is indicative of switching the real-time 3D visual representation of the participant within the immersive communication session to the avatar model for a real-time 3D computer-generated imagery (CGI) of the participant. The method further comprises sending a second stream in the immersive communication session to the at least one second communication device through the network entity. The second stream comprises real-time values of avatar parameters controlling motions of the avatar model of the participant.

The first communication device may be configured to capture the real-time 3D visual representation of a participant. For example, the first communication device may comprise any one of the first capturing devices mentioned in the context of the first method aspect. Alternatively or in addition, the first communication device may comprise a (e.g., the above-mentioned) one or more depth-sensitive image sensors configured to capture the real-time 3D visual representation of a participant. Alternatively or in addition, the first communication device may comprise a wired interface (e.g., according to any one of the above-mentioned wired device-to-device communications, particularly a serial bus, optionally according to the universal serial bus type C, USB-C) or a wireless interface (e.g., according to any one of the above-mentioned wireless device-to-device communications, particularly a sidelink, optionally according to the third generation partnership project, 3GPP) to the one or more depth-sensitive image sensors.

Alternatively or in addition, the first communication device may be configured to capture the real-time values for the avatar parameters. For example, the first communication device may comprise any one of the second capturing devices mentioned in the context of the first method aspect. Alternatively or in addition, the first communication device may comprise a (e.g., the above-mentioned) one or more depth-insensitive image sensors or headsets configured to capture the realtime values for the avatar parameters. Alternatively or in addition, the first communication device may comprise a wired interface (e.g., according to any one of the above-mentioned wired device-to-device communications, particularly a serial bus, optionally according to USB-C) or a wireless interface (e.g., according to any one of the above-mentioned wireless device-to-device communications, particularly a sidelink, optionally according to 3GPP) to the one or more depthinsensitive image sensors. The method according to the second method aspect may further comprise any feature or step disclosed in the context of the first method aspect, or a feature or step corresponding thereto.

As to a third method aspect, a method performed by a second communication device for supporting an immersive communication session with at least one first communication device is provided. The method comprises receiving a first stream in the immersive communication session from the at least one first communication device through a network entity. The first stream comprises a real-time 3- dimensional (3D) visual representation of a participant of the immersive communication session. The method further comprises rendering the 3D visual representation of the participant at the second device. The method further comprises receiving an avatar model of the participant of the immersive communication session from the network entity. The avatar model is generated based on the 3D visual representation of the participant for a real-time 3D computer-generated imagery (CGI) of the participant. The method further comprises receiving a second stream in the immersive communication session from the at least one first communication device through the network entity. The second stream comprises real-time values of avatar parameters controlling motions (e.g., facial expressions and/or other sensory expressions) of the avatar model of the participant. Optionally, the method further comprises rendering the CGI of the participant based on the received avatar model and the received realtime values of the avatar parameters at the second device.

The second communication device may be configured to render and/or display the real-time 3D visual representation of the participant. For example, the second communication device may comprise any one of the displaying devices mentioned in the context of the first or second method aspect. Alternatively or in addition, the second communication device may comprise a holographic beamer or headmounted optical display (also referred to as VR glasses) for displaying the real-time 3D visual representation of the participant. Alternatively or in addition, the second communication device may comprise a wired interface (e.g., according to any one of the above-mentioned wired device-to-device communications, particularly a serial bus, optionally according to the universal serial bus type C, USB-C) or a wireless interface (e.g., according to any one of the above-mentioned wireless device-to-device communications, particularly a sidelink, optionally according to 3GPP) to displaying device, the holographic beamer or optical head-mounted display. If the rendering of the real-time 3D visual representation is performed by the network entity (e.g., according to edge computing or cloud computing), the first stream as received at the second communication device may comprise rendered media resulting from the rendering. The rendered media resulting from the rendering may be streamed to the second communication device with the displaying device being embedded in the second communication device (e.g., a mobile phone or VR glasses). Alternatively or in addition, the rendered media resulting from the rendering may be streamed through the second communication device (e.g., a mobile phone) to the displaying device using any one of the wired or wireless communications between the second communication device (e.g., as a terminal device) and the displaying device.

Alternatively or in addition, the second communication device may be configured to render and/or display the real-time 3D CGI of the participant based on the generated avatar model and the real-time values for the avatar parameters. For example, the second communication device may comprise any one of the displaying devices for displaying the real-time 3D CGI, e.g., the holographic beamer or head-mounted optical display. Alternatively or in addition, the second communication device may use any one of the wired interfaces (e.g., a serial bus, optionally according to USB-C) or the wireless interfaces (e.g., a sidelink, optionally according to 3GPP) for displaying the real-time 3D CGI.

If the rendering of the real-time 3D CGI of the participant is performed by the network entity (e.g., according to edge computing or cloud computing) based on the generated avatar model and the real-time values for the avatar parameters, the second stream as received at the second communication device may comprise rendered media resulting from the rendering. The rendered media resulting from the rendering may be streamed to the second communication device with the displaying device being embedded in the second communication device (e.g., a mobile phone or VR glasses). Alternatively or in addition, the rendered media resulting from the rendering may be streamed through the second communication device (e.g., a mobile phone) to the displaying device using any one of the wired or wireless communications between the second communication device (e.g., as a terminal device) and the displaying device.

Alternatively or in addition, the second communication device may save the avatar model received from the network entity. The second communication device may cache and/or update the received avatar model. The second communication device may further be configured to receive a regenerated or updated avatar model, e.g., upon re-establishment of the immersive communication session. Alternatively or in addition, the second communication device may compare the updated avatar model with the saved avatar model. In case a difference between the updated avatar model and the saved avatar model is greater than a predefined threshold, the second communication device may save the avatar model (e.g., replacing the previously saved avatar model by the updated avatar model).

The method (e.g., according to the third method aspect) may further comprise decoding and/or processing the received first stream and/or the received second stream of the immersive communication session.

The method may further comprise receiving a control message indicative of switching from the real-time 3D visual representation of the participant within the immersive communication session to the avatar model for the real-time 3D CGI of the participant.

The method according may further comprise any feature or step of any one of the first method aspect, or a feature or step corresponding thereto.

As to a first device aspect a network entity for supporting an immersive communication session between communication devices is provided. The network entity comprises memory operable to store instructions and processing circuitry operable to execute the instructions, such that the network entity is operable to receive a first stream in the immersive communication session from at least one first communication device among the communication devices. The first stream comprises a real-time 3-dimensional (3D) visual representation of a participant of the immersive communication session. The network entity is further operable to sending the real-time 3D visual representation or immersive media rendered based on the real-time 3D visual representation to at least one second communication device of the communication devices. The network entity is further operable to generate an avatar model of a participant of the immersive communication session based on the 3D visual representation received from the at least one first communication device. The network entity is further operable to sending the generated avatar model to the at least one second communication device. The network entity is further operable to send, in response to a shortage of resources for the first stream, a control message to the at least one first communication device. The control message is indicative of switching the real-time 3D visual representation of the participant within the immersive communication session to the avatar model for a real-time 3D computer-generated imagery (CGI) of the participant. The network entity is further operable to receive a second stream in the immersive communication session from the at least one first communication device. The second stream comprises real-time values of avatar parameters controlling motions of the avatar model of the participant. The network entity is further operable to send the real-time values to the at least one second communication device.

The network entity may be further operable to perform any one of the steps of the first method aspect.

As to a second device aspect, a first communication device for supporting an immersive communication session with at least one second communication device is provided. The first communication device comprises memory operable to store instructions and processing circuitry operable to execute the instructions, such that the first communication device is operable to send a first stream in the immersive communication session to the at least one second communication device through a network entity. The first stream comprises a real-time 3- dimensional (3D) visual representation of a participant of the immersive communication session. The real-time 3D visual representation enables the network entity to generate an avatar model of the participant and to send the generated avatar model to the at least one second communication device. The first communication device is further operable to receive a control message from the network entity. The control message is indicative of switching the real-time 3D visual representation of the participant within the immersive communication session to the avatar model for a real-time 3D computer-generated imagery (CGI) of the participant. The first communication device is further operable to send a second stream in the immersive communication session to the at least one second communication device through the network entity. The second stream comprises real-time values of avatar parameters controlling motions of the avatar model of the participant.

The first communication device may further be operable to perform any one of the steps of the second method aspect. As to a further second device aspect, a first communication device for supporting an immersive communication session with at least one second communication device is provided. The first communication device is configured to send a first stream in the immersive communication session to the at least one second communication device through a network entity. The first stream comprises a realtime 3-dimensional (3D) visual representation of a participant of the immersive communication session. The real-time 3D visual representation enables the network entity to generate an avatar model of the participant and to send the generated avatar model to the at least one second communication device. The first communication device is further configured to receive a control message from the network entity. The control message is indicative of switching the real-time 3D visual representation of the participant within the immersive communication session to the avatar model for a real-time 3D computer-generated imagery (CGI) of the participant. The first communication device is further configured to send a second stream in the immersive communication session to the at least one second communication device through the network entity. The second stream comprises real-time values of avatar parameters controlling motions of the avatar model of the participant.

The first communication device may further be configured to perform any one of the steps of the second method aspect.

As to a further device aspect a system for supporting an immersive communication session provided. The system comprises a network entity comprising a processing circuitry configured to execute the steps of the first method aspect. The system further comprises at least one first communication device comprising a processing circuitry configured to execute the steps of the second method aspect. The system further comprises at least one second communication device comprising a processing circuitry configured to execute the steps of the third method aspect.

As to a still further aspect a communication system including a host computer (e.g., the application server or the first communication device) is provided. The host computer comprises a processing circuitry configured to provide user data, e.g., included in the first and/or second stream. The host computer further comprises a communication interface configured to forward the user data to a cellular network (e.g., the RAN and/or the base station) for transmission to a UE. A processing circuitry of the cellular network is configured to execute any one of the steps of the first method aspect. Alternatively or in addition, the UE comprises a radio interface and processing circuitry, which is configured to execute any one of the steps of the second and/or third method aspects.

The communication system may further include the UE. Alternatively, or in addition, the cellular network may further include one or more base stations configured for radio communication with the UE and/or to provide a data link between the UE and the host computer using the first and/or second method aspects.

The processing circuitry of the host computer may be configured to execute a host application, thereby providing the first and/or second data and/or any host computer functionality described herein. Alternatively, or in addition, the processing circuitry of the UE may be configured to execute a client application associated with the host application.

Any one of the network entity (e.g., a base station as a RAN node or a CN node), the communication devices (e.g., UEs), the communication system or any node or station for embodying the technique may further include any feature disclosed in the context of the method aspect, and vice versa. Particularly, any one of the units and modules disclosed herein may be configured to perform or initiate one or more of the steps of the method aspect.

Brief Description of the Drawings

Further details of embodiments of the technique are described with reference to the enclosed drawings, wherein:

Fig. 1 shows a schematic block diagram of an embodiment of a first communication device for supporting an immersive communication session;

Fig. 2 shows a schematic block diagram of an embodiment of a second communication device for supporting an immersive communication session;

Fig. 3 shows a schematic block diagram of an embodiment of a network entity for supporting an immersive communication session;

Fig. 4 shows a schematic block diagram of system for supporting an immersive communication session; Fig. 5 shows a flowchart for a method for supporting an immersive communication session with at least one second communication device, which method may be implementable by the first communication device of Fig. 1;

Fig. 6 shows a flowchart for a method for supporting an immersive communication session with at least one first communication device, which method may be implementable by the second communication device of Fig. 2;

Fig. 7 shows a flowchart for a method for supporting an immersive communication session between communication devices, which method may be implementable by the network entity of Fig. 3;

Fig. 8 schematically illustrates a first example of a radio network comprising embodiments of the devices of Figs. 1, 2, and 3, for performing the methods of Figs. 5, 6, and 7, respectively;

Fig. 9 schematically shows the standalone architecture referred to as 5G STandalone AR (STAR) user equipment (UE);

Fig. 10 schematically shows the cloud/edge assisted device architecture, referred to as 5G EDGe-Dependent AR (EDGAR) UE according to the prior art;

Fig. 11 shows a procedure diagram for immersive communication for AR conversational services according to the prior art;

Figs 12a and 12b show flowcharts for immersive communication session (e.g., shared AR conversational experience call flow) for receiving EDGAR UE;

Fig. 13 shows a generic example of an immersive communication session between communication devices according to the method of Fig: 7;

Fig. 14 shows the example of Fig. 13 in which the network, entity after completing generating the avatar model, sends the generated avatar model to the second communication device;

Fig. 15 shows the example of Fig. 13 or 14 in which the network entity regenerates the avatar model based on the 3D visual representation received from the first communication device; Fig. 16 shows the example of any one of Figs. 13 to 15 in which the network entity, in response to a shortage of resources for the first stream, sends a control message to the first communication device indicative of switching from the first stream to the second stream of immersive communication session;

Fig. 17 shows a data flow for avatar model generation in an immersive communication session;

Fig. 18 shows a data flow for avatar model caching in the immersive communication session;

Fig. 19 shows a data flow for the avatar model update in the immersive communication session;

Fig. 20 shows a data for switching to avatar model in the immersive communication session;

Fig. 21 shows another example of an immersive communication session;

Fig. 22 shows an exemplary usage of a Standalone AR device architecture according to Fig. 9;

Fig. 23 shows an exemplary usage of a cloud/edge assisted architecture according to Fig. 10;

Fig. 24 shows an exemplary call flow for toggling between 3D real-time streams and animated avatars;

Fig. 25 shows an exemplary call flow for toggling to 3D communications based on Standalone AR device architecture;

Fig. 26 shows a call flow for toggling to avatar model for a real-time 3D CGI of the participant in the immersive communication session;

Fig. 27 shows the steps involved to switch from the 3D communications to photorealistic avatars;

Fig. 28 shows the steps involved to switch from 3D communications to photorealistic avatars in cloud/edge assisted architecture; Fig. 29 shows exemplary flowchart for switching from 3D sensors to motion sensors and facial landmarks transmission at first communication device;

Fig. 30 shows exemplary flowchart of the procedures for the second communication device to switch from 3D real-time rendering to animated avatars.

Fig. 31 shows a schematic block diagram of a first communication device embodying the first communication device of Fig. 1;

Fig. 32 shows a schematic block diagram of a second communication device embodying the second communication device of Fig. 2;

Fig. 33 shows a schematic block diagram of the network entity of Fig. 3;

Fig. 34 schematically illustrates an example telecommunication network connected via an intermediate network to a host computer;

Fig. 35 shows a generalized block diagram of a host computer communicating via a base station or radio device functioning as a gateway with a user equipment over a partially wireless connection; and

Figs. 36 and 37 show flowcharts for methods implemented in a communication system including a host computer, a base station or radio device functioning as a gateway and a user equipment.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as a specific network environment in order to provide a thorough understanding of the technique disclosed herein. It will be apparent to one skilled in the art that the technique may be practiced in other embodiments that depart from these specific details. Moreover, while the following embodiments are primarily described for a New Radio (NR) or 5G implementation, it is readily apparent that the technique described herein may also be implemented for any other radio communication technique, including a Wireless Local Area Network (WLAN) implementation according to the standard family IEEE 802.11, 3GPP LTE (e.g., LTE-Advanced or a related radio access technique such as MulteFire), for Bluetooth according to the Bluetooth Special Interest Group (SIG), particularly Bluetooth Low Energy, Bluetooth Mesh Networking and Bluetooth broadcasting, for Z-Wave according to the Z-Wave Alliance or for ZigBee based on IEEE 802.15.4.

Moreover, those skilled in the art will appreciate that the functions, steps, units and modules explained herein may be implemented using software functioning in conjunction with a programmed microprocessor, an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), a Digital Signal Processor (DSP) or a general-purpose computer, e.g., including an Advanced RISC Machine (ARM). It will also be appreciated that, while the following embodiments are primarily described in context with methods and devices, the invention may also be embodied in a computer program product as well as in a system comprising at least one computer processor and memory coupled to the at least one processor, wherein the memory is encoded with one or more programs that may perform the functions and steps or implement the units and modules disclosed herein.

Fig. 1 schematically illustrates a block diagram of an embodiment of a first communication device for supporting an immersive communication session with at least one second communication device. The first communication device is generically referred to by reference sign 100. The first communication device 100 may also be referred to as transmitting device, TX device, capturing device, capturing user equipment (UE), and/or sender UE.

The first communication device 100 comprises a first stream sending module 102. The first stream sending module 102 sends a first stream in the immersive communication session to the at least one second communication device through a network entity 300. The first stream in the immersive communication session comprises a real-time 3-dimensional (3D) visual representation of a participant of the immersive communication session. The real-time 3D visual representation of a participant may comprise a background or surrounding and a foreground or center. The first communication device 100 may be configured to capture the realtime 3D visual representation of a participant as the first stream.

The visual representation of the participant may be a photographic representation of the participant. The first communication device 100 may comprise one or more a first capturing device configured to capture a scene and/or a scene including the participant responsive to the establishing the immersive communication session. The photographic representation may be captured by a camera and/or may comprise pixels or voxels of the participant. For example, the voxels may comprise pixels with a depth information. Alternatively or in addition, the camera may operate in the visual and/or infrared spectrum.

The first communication device 100 may be configured to capture the real-time 3D visual representation of a participant. For example, the first communication device may comprise one or more of the depth-sensitive image sensors configured to capture the real-time 3D visual representation of a participant. Alternatively or in addition, the first communication device may comprise at least one of the wired interfaces (e.g., a serial bus, optionally according to the universal serial bus type C, USB-C) or at least one of the wireless interfaces (e.g., using Bluetooth or a sidelink, optionally according to the third generation partnership project, 3GPP) to the one or more depth-sensitive image sensors.

The real-time 3D visual representation is derived from one or more depth-sensitive image sensors at the first communication device 100. The one or more depthsensitive image sensors comprise at least one of a light field camera, an array of angle-sensitive pixels, a sensor for light-in-flight imaging, a sensor for light detection and ranging (LIDAR), a streak camera, and a device using structured light triangulation for depth sensing.

The first stream may also be referred to as a 3D photographic media stream. Alternatively or in addition, the 3D visual representation may be a captured 3D visual representation of the participant. The first communication device 100 may encode the captured 3D visual representation of a participant in the first stream of the immersive communication session.

The first stream may enable the network entity 300 to generate an avatar model of a participant of the communication session based on the real-time 3D visual representation received from the first communication device 100. The network entity 300 may send the generated avatar model to the at least one second communication device.

The avatar model is a computer-generated imagery (CGI) created visual content with computer software. The avatar model may comprise a biomechanical model for the motions of the participant of the immersive communication session. The motions of the avatar model controlled by the avatar parameters comprise at least one of gestures, facial expressions, and head motion of the participant that is encoded in the values of the avatar parameters in the received second stream. The avatar model enables rendering a graphical representation of the participant. The avatar may be controllable by a set of avatar parameters. The avatar parameters may control the motions of avatar model (e.g., a gesture, a facial expression or any kind of movement of the avatar). The first communication device 100 may comprise a second capturing device (e.g., one or more sensors) to capture the real-time values of the avatar parameters responsive to the control message.

The first communication device 100 may further comprise a control reception module 104 that receive a control message from the network entity 300. The control message may be indicative of switching the real-time 3D visual representation of the participant within the immersive communication session to the avatar model for a real-time 3D CGI of the participant.

The control message received from the network entity 300 may be further indicative of at least one of a type of sensors for capturing the 3D visual representation of the participant to be deactivated, a type of sensors for deriving the real-time values of the avatar parameters to be activated, sending a notification from a or the media session handler at the respective first communication device 100 to a mixed reality application at the respective first communication device 100, switching a mixed reality run-time engine at the respective first communication device 100 from capturing the 3D visual representation to deriving the real-time values of the avatar parameters, and switching one or more media access functions at the respective first communication device 100 from an immersive media encoder encoding the 3D visual representation to a motion sensor encoder encoding the real-time values of the avatar parameters.

The first communication device 100 further comprises a second stream sending module 106. The second stream sending module 106 sends a second stream of the immersive communication session to the at least one second communication device through the network entity 300. The second stream of the immersive communication session comprises a real-time of values of the avatar parameters controlling motions of the avatar model of the participant from the first communication device 100.

Alternatively or in addition, the first communication device may be configured to capture the real-time values for the avatar parameters. For example, the first communication device may comprise the one or more depth-insensitive image sensors configured to capture the real-time values for the avatar parameters. Alternatively or in addition, the first communication device may comprise at least one of the wired interfaces (e.g., a serial bus, optionally according to USB-C) or at least one of the wireless interfaces (e.g., Bluetooth or a sidelink, optionally according to 3GPP) to the one or more depth-insensitive image sensors.

The second stream may also be referred to as an avatar-controlling stream and/or photo-realistic avatar and/or animated avatar and/or real-time avatar.

Alternatively or in addition, the 3D visual representation may be a generated 3D visual representation. The first communication device 100 may encode the realtime stream of values of the avatar parameters of a participant in the second stream of the immersive communication session.

The first communication device 100 may further comprise one or more first image sensors. The real-time values of the avatar parameters may be derived from one or more first image sensors. The one or more first image sensors may comprise depth-insensitive image sensors and/or one or more acceleration sensors capturing facial expressions or motions of the participant at the first communication device 100. The depth-intensive image sensor may comprise a camera for projecting the participant onto a 2-dimensional (2D) image and/or a filter for detecting facial landmarks in the 2D image of the participant.

Each of the first stream and the second stream may comprise immersive audio. The immersive audio may be unchanged during the switching.

The first communication device 100 may be in communication with the network entity 300.

Fig. 2 schematically illustrates a block diagram of an embodiment of the second communication device for supporting an immersive communication session. The second communication device is generically referred to by reference sign 200. The second communication device 200 may also be referred to as receiving device, RX device, receiver UE, and/or receiving UE.

The second communication device 200 comprises a first stream reception module 202. The first stream reception module 202 receives the first stream of the immersive communication session from the at least one first communication device 100 through the network entity 300. The first stream comprising a real-time 3D visual representation of a participant of the immersive communication session. The second communication device 200 may render the 3D visual representation of a participant of the immersive communication session.

The second communication device 200 may be configured to render and/or display the real-time 3D visual representation of the participant. For example, the second communication device 200 may comprise at least one of the displaying devices (e.g., a holographic beamer or optical head-mounted display) for displaying the real-time 3D visual representation of the participant. Alternatively or in addition, the second communication device 200 may comprise at least one of the wired interfaces (e.g., a serial bus, optionally according to the universal serial bus type C, USB-C) or at least one of the wireless interfaces (e.g., Bluetooth or a sidelink, optionally according to 3GPP) to the displaying device such as a holographic beamer or optical head-mounted display.

The second communication device 200 may decode and/or process the first stream of the immersive communication session.

The second communication device 200 further comprises an avatar model reception module 204. The avatar model reception module 204 may receive the avatar model of the participant of the immersive communication session from the network entity 300. The avatar model may be generated based on the 3D visual representation of the participant for a real-time 3D CGI of the participant. The second communication device 200 may cache the avatar model during receiving and save the received avatar model.

The second communication device 200 further comprises a second stream reception module 206. The second stream reception module 206 may receive the second stream of the immersive communication session from the at least one first communication device 100 through the network entity 300. The second stream of the immersive communication session may comprise a real-time stream of values of the avatar parameters controlling motions of the avatar model of the participant. The second communication device 200 may render the CGI of the participant based on the received avatar model and the received real-time values of the avatar parameters at the second device 200.

Alternatively or in addition, the second communication device 200 may be configured to render and/or display the CGI of the participant based on the generated avatar model and the real-time values for the avatar parameters, e.g., using any one of the displaying devices such as the holographic beamer or optical head-mounted display and/or using any one of the wired interfaces (e.g., a serial bus, optionally according to USB-C) or any one of the wireless interfaces (e.g., Bluetooth or a sidelink, optionally according to 3GPP).

Receiving the second stream at the at least one second communication device 200 may trigger the at least one second communication device to switch the real-time 3D visual representation of the participant within the immersive communication session to the avatar model for the real-time 3D CGI of the participant.

The second communication device 200 may further be able to receive a regenerated or updated avatar model, e.g., upon re-establishment of the immersive communication session. Alternatively or in addition, the second communication device may 200 compare the updated avatar model with the saved avatar model. In case a difference between the updated avatar model and the saved avatar model is greater than a predefined threshold, the second communication device 200 may save the avatar model (e.g., replacing the previously saved avatar model by the updated avatar model). The second communication device 200 may update the avatar model and save the updated avatar model.

The second communication device 200 may decode and/or process the second stream of the immersive communication session.

The second communication device 200 may render the graphical representation of the avatar model of the participant of the immersive communication session.

The second communication device 200 may be in communication with the network entity 300.

Fig. 3 schematically illustrates a block diagram of an embodiment of a network entity for supporting an immersive communication session between two or more communication devices (e.g., a first communication device 100 and a second communication device 200). The network entity is generically referred to by reference sign 300. The network entity 300 may also be referred to as network or cloud.

The network entity 300 may enable or initiate an immersive communication session between the communication devices 100/200. The communication session may be triggered by the first or second communication devices 100/200, e.g., the one which communicates with the network entity 300. For example, the establishing may comprise receiving an establishing request from the respective one of the communication devices 100/200. The network entity 300 may establish the immersive communication session between the communication devices 100/200.

The network entity 300 comprises a first stream module 302. The first stream module 302 may receive a first stream in the immersive communication session from at least one first communication device 100 among the communication devices 100/200. The first stream comprising a real-time 3D visual representation of a participant of the immersive communication session. The first stream may also be referred to as 3D visual representation.

The first stream module 302 may send the first stream of the immersive communication session received from the first communication device 100 to the at least one second communication device 200. The network entity 300 may send the real-time 3D visual representation or real-time 3D immersive media rendered based on the real-time 3D visual representation to at least one second communication device 200 of the communication devices 100/200.

The network entity 300 may decode and/or process the first stream. The network entity 300 may further encode the first stream before sending to the second communication device 200.

The network entity 300 further comprises an avatar model generation module 308. The avatar model generation module 308 may generate an avatar model of the participant of the communication session based on the 3D visual representation received from the at least one first communication device 100.

The avatar model generation module 308 may send the generated avatar model to the at least one second communication device 200. Alternatively or in addition, the avatar model generation module 308 may save the generated avatar model.

The avatar model generation module 308 may regenerate the avatar model periodically and/or after re-establishing the immersive communication session. The avatar model generation module 308 may compare the regenerated avatar model with the previously saved avatar model. In response to a change in appearance in avatar model of the participant being greater than a predefined threshold, the module 308 may update the avatar model. The avatar model generation module 308 may send the updated (e.g., re-generated) avatar model to the at least one second communication device 200.

The change (e.g., the change in the 3D visual representation being greater than a predefined threshold) in appearance may be detected by means of a filter. The filter may be applied to the first stream of the immersive communication session. The filter may be invariant under rotation and/or translation of the participant. Alternatively or in addition, the filter may be invariant under facial expressions or any motion that corresponds to a change in the avatar parameters.

The network entity 300 may monitor the 3D visual representation of the received first stream for a change in the avatar model of the participant of the immersive communication session.

The change (e.g., change in appearance) may be a significant change in appearance. For example, the filter may output a scalar or a vector that is indicative of the change. The change may trigger the updating of the avatar model if the scalar is greater than a predefined threshold or if a magnitude of the vector is greater than a predefined threshold or if a scalar product between the vector of the change and a predefined reference vector is greater than a predefined threshold. For example, the change in appearance may be a change in outfit, hairstyle, accessories, glasses, etc.

The network entity 300, in response to a shortage of resources for the first stream, may send a control message to the at least one first communication device 100. The control message may be indicative of switching the real-time 3D visual representation of the participant within the immersive communication session to the avatar model for a real-time 3D CGI of the participant. The switching from the first stream to the second stream can mitigate the shortage of the resources for the first stream.

The avatar model may be generated by a or the data media application server, AS, of the network entity 300. The shortage for switching is determined by a or the control media application function, AF, of the network entity 300.

The control message sent to the at least one first communication device 100 may be indicative of handing over control from a or the control media AF at the network entity 300 to a media session handler at the respective first communication device 100. The shortage of resources for the first stream may be due to a reduction of (e.g., available) resources at the network entity 300 and/or an increase of a resource requirement of the first stream. The shortage of resources may occur anywhere along a communication path of the first stream. The shortage of resources may comprise shortage of radio resources of a radio access network (RAN) providing radio access to the at least one first communication device 100 and/or the at least one second communication device 200. The shortage of resources may further comprise a shortage of transport capacity of the network entity, and/or shortage of computational resources for rendering of the real-time 3D immersive media based on the first stream. The resources for the first stream may comprise at least one of computational resources for processing the first stream and radio resources for a radio communication of the immersive communication session.

The transport capacity may correspond to a (e.g., maximum) bit rate provided by the network entity. Alternatively or in addition, the switching is triggered if a bit rate required by the first stream is greater than the maximum bit rate of the network entity.

The shortage of resources at the network entity may comprise a shortage of computation resources for the rendering of the real-time data. For example, shortage of resources may include shortage of computational resources for rendering a scene composed of the multiple real-time 3D visual representations of the multiple participants.

The shortage of the resources may be a shortage of computational resources for composing multiple real-time 3D visual representations of multiple participants of the immersive communication received from the multiple first communication devices 100. For example, the composing may comprise arranging the multiple real-time 3D visual representations in the same 3D space. Alternatively or in addition, the shortage of resources at the network entity comprises a shortage of computation resources for the rendering of the real-time data.

The shortage of resources is determined based on network state information at the network entity 300. The network state information may be received periodically and/or event-driven.

The network entity 300 may optionally send a control message to the second communication device 200 in response to the shortage of resources for the first stream. The control message may be indicative of the switching from the first stream (e.g., the real-time 3D visual representation of the participant within the immersive communication session) to the second stream of the immersive communication session (e.g., avatar model for the real-time 3D CGI of the participant).

The control message sent to the at least one second communication device 200 may be further indicative of at least one of a mixed reality run-time engine for pose correction and/or rendering of the real-time 3D visual representation of the participant to be deactivated, a mixed reality scene manager for pose correction and/or rendering of the real-time 3D CGI based on the generated avatar model and the real-time values of the avatar parameters to be activated, handing over control from a or the control media AF at the network entity 300 to a media session handler at the respective second communication device 200, sending a notification from a or the media session handler at the respective second communication device 200 to a mixed reality application at the respective second communication device 200, switching a mixed reality run-time engine at the respective second communication device 200 from rendering the real-time 3D visual representation to rendering the real-time 3D CGI based on the avatar model and the real-time values of the avatar parameters, switching one or more media access functions at the respective second communication device 200 from an immersive media decoder decoding the 3D visual representation to a motion sensor decoder decoding the real-time values of the avatar parameters.

The control message may be sent to a or the media session handler of the at least one first communication device 100 and/or a or the media session handler of the at least one second communication device 200.

The network entity 300 comprises a second stream module 318. The second stream module 318 may receive a second stream of the immersive media from the first communication device 100. The second stream module 318 may send the second stream of the immersive communication session to the second communication device 200. The network entity 300 may decode and/or process the second stream. The network entity 300 may further encode the second stream before sending to the second communication device 200. The second stream may comprise real-time values of avatar parameters controlling motions of the avatar model of the participant. The second stream may also be referred to as an avatar-controlling stream. Alternatively or in addition, the 3D visual representation may be a generated 3D visual representation.

The first stream and the second stream may be received from multiple first communication devices 100. The participants represented by the respective streams may be arranged in the same 3D space. The first stream and the second stream may be sent from the network entity 300 to multiple communication devices 200.

The first stream as well as the second stream in combination with the avatar model may enable to display the participant (e.g., within the 3D space) immersively. Alternatively or in addition, the expression "immersive media" may be an umbrella term that encompasses the 3D visual representation of a participant comprised in the first stream and/or the 3D CGI resulting from the avatar model in combination with the values of the avatar model comprised in the second stream. For example, each of the first stream and the second stream may be referred to as immersive media after providing the at least one second communication device with the avatar model. The each of the first stream and the second stream further comprises immersive audio. The immersive audio may be unchanged during the switching.

The control message sent to the at least one first communication device 100 may trigger one or more sensors at the at least one first communication device 100 to capture the real-time values of the avatar parameters. The control message sent to the at least one second communication device 200 triggers rendering the real-time 3D CGI of the participant based on the generated avatar model and the real-time values of the avatar parameters.

The network entity 300 may comprise, or be part of, at least one of a RAN, optionally wherein the RAN provides radio access to at least one of the communication devices, a network node of a RAN, optionally wherein the network node serves at least one of the communication devices, a core network, CN, optionally wherein the CN transports the first and second streams between the communication devices and/or performs mobility management for the communication devices, a local area network (LAN), a distributed network for edge computing, and a computing center. Fig. 4 schematically illustrates a block diagram of an embodiment of a system for supporting an immersive communication session. The system is generically referred to by reference sign 700. The system for supporting an immersive communication session 700 may also be referred to as system.

The system 700 comprises at least one first communication device 100 for sending an immersive communication session according to Fig. 1 and at least one second communication device 200 for receiving the immersive communication session according to Fig. 2 and a network entity 300 according to Fig. 3 for supporting an immersive communication session between communication devices 100 and 200. The first communication device 100 may send a first stream of immersive communication session (e.g., a photographic media stream or 3D visual representation) to the at least one second communication device 200 through the network entity 300. The first communication device 100 may send a second stream of immersive communication session (e.g., avatar controlling stream or generated 3D visual representation) to the at least one second communication device 200 through the network entity 300.

The network entity 300 may generate an avatar model of a participant of the communication session based on the 3D visual representation received from the at least one first communication device 100. The network entity 300 may send the generated avatar model to the at least one second communication device 200.

The network entity 300, in response to a shortage of resources for the first stream, may send a control message to the at least one first communication device 100. The control message may be indicative of switching the real-time 3D visual representation of the participant within the immersive communication session to the avatar model for a real-time 3D CGI of the participant.

The second communication device 200 may receive the first stream of immersive communication session (e.g., photographic media stream or 3D visual representation) from the at least one first communication device 100 through the network entity 300. The second communication device 200 may receive the second stream of immersive communication session (e.g., avatar controlling stream or generated 3D visual representation) from the at least one first communication device 100 through the network entity 300.

The second communication device 200 may use the first stream and/or the second stream with the received avatar to render the immersive communication session. Fig. 5 shows an example flowchart for a method 400 of performed by a first communication device 100 in an immersive communication session.

In step 402 the first communication device 100 may send the first stream in the immersive communication session to the at least one second communication device 200 through a network entity 300. The first stream may comprise a realtime 3D visual representation of a participant of the immersive communication session. The real-time 3D visual representation may enable the network entity 300 to generate an avatar model of the participant and to send the generated avatar model to the at least one second communication device 200.

In step 404 the first communication device 100 may receive a control message from the network entity 300.

The control message may be indicative of switching 406 the real-time 3D visual representation of the participant within the immersive communication session to the avatar model for a real-time 3D CGI of the participant

In step 408 the first communication device 100 may send a second stream in the immersive communication session to the at least one second communication device through the network entity. The second stream comprising real-time values of avatar parameters controlling motions of the avatar model of the participant.

Alternatively or in addition, the first communication device 100 may encode the first stream and the second stream in the immersive communication session before sending the first stream and the second stream to the at least one second communication device 200 through the network entity 300.

The method 400 may be performed by the first communication device 100. For example, the module 102 may perform the step 402, the module 104 may perform the step 404 (including the implicit step 406), and the module 106 may perform the step 408.

Fig. 6 shows an example flowchart for a method 500 performed by a second communication device 200 in an immersive communication session.

In step 502 the second communication device 200 may receive a first stream in the immersive communication session from the at least one first communication device 100 through a network entity 300. The first stream comprises a real-time 3D visual representation of a participant of the immersive communication session In step 504 the second communication device 200 may render the 3D visual representation of the participant at the second device.

In step 506 the second communication device 200 may receive an avatar model of the participant of the immersive communication session from the network entity 300. The avatar model may be generated based on the 3D visual representation of the participant for a real-time 3D CGI of the participant.

Alternatively or in addition, the second communication device 200 may save the avatar model received from the network entity 300. The second communication device 200 may cache and/or update the received avatar model.

In step 508, the second communication device 200 may receive a control message indicative of switching from the real-time 3D visual representation of the participant within the immersive communication session to the avatar model for the real-time 3D CGI of the participant.

In step 510, the second communication device 200 may receive a second stream in the immersive communication session from the at least one first communication device 100 through the network entity 300. The second stream may comprise realtime values of avatar parameters controlling motions of the avatar model of the participant.

In step 512, the second communication device 200 may decode and/or processing the received 502 first stream and/or the received 510 second stream of the immersive communication session.

In step 514, the second communication device 200 may render the CGI of the participant based on the received 506 avatar model and the received 510 real-time values of the avatar parameters at the second device 200.

The method 500 may be performed by the second communication device 200. For example, the modules 202, 204 and 206 may perform the steps 502 to 514.

Fig. 7 shows an example flowchart for a method 600 of performed by a network entity 300 for supporting an immersive communication session between communication devices. The communication devices may comprise at least one embodiment of the communication device 100 and/or at least one embodiment of the communication device 200. Furthermore, at least one or each of the communication devices may embody the functionality of both the communication device 100 and the communication device 200, for example for a participant that is also a viewer of the scene including other participants.

In step, 602 the network entity 300 may establish the immersive communication session between the communication devices 100 and/or 200.

In step 604, the network entity 300 may receive a first stream in the immersive communication session from at least one first communication device 100 among the communication devices 100 and/or 200. The first stream comprising a realtime 3D visual representation of a participant of the immersive communication session.

The network entity 300 may send 606 the real-time 3D visual representation or real-time 3D immersive media rendered based on the real-time 3D visual representation to at least one second communication device 200 of the communication devices 100 and/or 200.

The network entity 300 may generate 608 an avatar model of a participant of the immersive communication session based on the 3D visual representation received 604 from the at least one first communication device 100.

In step 610, the network entity 300 may send the generated 608 avatar model to the at least one second communication device 200.

In step 612, the network entity 300, in response to a shortage of resources for the first stream, may send a control message to the at least one first communication device 100. The control message may be indicative of switching (e.g., switching step 614) the real-time 3D visual representation of the participant within the immersive communication session to the avatar model for a real-time 3D CGI of the participant.

In step 616, the network entity 300 may save the generated 608 avatar model. The network entity 300 may also monitor the 3D visual representation of the received 604 first stream for a change in the avatar model of the participant of the immersive communication session.

In step 618, the network entity 300 may receive a second stream in the immersive communication session from the at least one first communication device 100. The second stream comprises real-time values of avatar parameters controlling motions of the avatar model of the participant. In step 620, the network entity 300 may send 620 the real-time values to the at least one second communication device 200.

The network entity 300 may re-generate 622 the avatar model of the participant of the immersive communication session and/or update 624 the generated and saved avatar model of the participant of the immersive communication session.

The network entity 300 may send 610 the re-generated or updated avatar model to the at least one second communication device 200. Alternatively or in addition, the avatar model may be re-generated or updated after re-establishing the immersive communication session, and/or wherein the avatar model may be regenerated 622 or updated 624 responsive to a change in the 3D visual representation being greater than a predefined threshold, and/or wherein the regenerated 622 or updated 624 avatar model may be sent responsive to a change in the avatar model being greater than a predefined threshold.

The method 600 may be performed by the network entity 300. For example, the modules 302 to 320 may perform the steps 602 to 624.

The technique may be applied to uplink (UL), downlink (DL) or direct communications between radio devices as the communication devices, e.g., using device-to-device (D2D) communications or sidelink (SL) communications.

Each of the communication devices 100 and/or 200 may be a radio device. Herein, any radio device may be a mobile or portable station and/or any radio device wirelessly connectable to a base station or RAN, or to another radio device. For example, the radio device may be a user equipment (UE) or a device for machinetype communication (MTC). Two or more radio devices may be configured to wirelessly connect to each other, e.g., in an ad hoc radio network or via a 3GPP SL connection.

Furthermore, any base station may be a station providing radio access, may be part of a radio access network (RAN) and/or may be a node connected to the RAN for controlling the radio access. For example, the base station may be an access point, for example a Wi-Fi access point.

In any embodiment, the 3D visual representation (i.e., 3D contents or AR contents) may be represented using meshes. A polygon mesh is a collection of vertices, edges and faces that defines the shape of a polyhedral object in 3D computer graphics and solid modeling, e.g., according to the 3GPP document TR 26.928, version 16.1.0. The faces usually comprise triangles (triangle mesh), quadrilaterals (quads), or other simple convex polygons ("n-gons"), since this simplifies rendering, but may also be more generally composed of concave polygons, or even polygons with holes. Meshes can be rendered directly on graphics processing units (GPUs) that are highly optimized for mesh-based rendering.

An overview of formats for 3D or AR contents and/or corresponding codecs can be found in clause 4.4 of the 3GPP document TR 26.998, version 1.1.1. The formats for 2D and 3D contents differ in that still image formats may be used for 2D media. The 2D media may have metadata for each image or for a sequence of images. For example, pose information describes the rendering parameter of one image. The frame rate or timestamp of each image are typically valid for a sequence of such images. In contrast, 3D meshes and point clouds consists of thousands and millions of primitives such as vertex, edge, face, attribute and texture. Primitives are the very basic elements in all volumetric presentation. A vertex is a point in volumetric space, and contains position information, e.g., in terms of three axes in a Cartesian coordinate system. A vertex may have one or more attributes. Color and reflectance are typical examples of attributes. An edge is a line between two vertices. A face is a triangle or a rectangle formed by three or four vertices. The area of a face is filled by interpolated color of the vertex attributes or from textures. An overview of compression formats is described in clause 4.4.5 of 3GPP document TR 26.998, version 1.1.1.

To overcome the limitations in bandwidth and latency challenges of transmitting 3D content as point clouds, the avatar models (i.e., animated avatars) are used instead according to the step 404 or 406, or the step 508, or the step 612 or 614.

The animated avatars may be computer-based approaches that generate an avatar representation of people by creating a 3D model of a person, often using machine learning (e.g., photo-realistic avatar, A. Richard et al.-. "Audio- and Gaze- driven Facial Animation of Codec Avatars", arXiv:2008.05023vl (2020)). Avatar approaches are based on data (e.g., captured live) from an XR device (e.g., a mobile phone) including 2D images, microphone signals and sensor information from the capturing device. Using artificial intelligence (Al) and captured features such as eye gaze or headset sensor information, a computer-generated avatar model of a person can be created and motions including facial expressions can be overlaid by means of the avatar model. Several companies and organization such as Meta, Fraunhofer, Spatial. io, ARcall.com, Meetinvr.com are working on avatarbased communication.

Any embodiment may use AR and/or emerging media formats and/or a 5G AR device architecture, e.g., as defined by the Technical Specification Group for Service and System Aspects SA4 at 3GPP for study in the 3GPP document TR 26.998, version 1.1.1. Below Table 1 lists AR conversational use-cases that are considered in the study. The use-cases have different requirements on the network entity such as using different formats: 360 video, 3D immersive (e.g., point clouds, meshes), 2D images, and motion sensor data for animated avatars. This will have impacts on the procedures and end-to-end architecture.

Table 1: AR conversational services

The subject technique may be implemented using the above-mentioned type "Real-time 3D Communication" or an extension thereof.

Both a standalone architecture and an edge-assisted and/or cloud-assisted architecture for the AR devices are defined in the study to support difference use-cases and XR applications. The avatar controlled by the seconds stream is less demanding on the network entity resources. Alternatively or in addition, it may provide a real and immersive experience that can be achieved by real-time capturing and point clouds by virtue of the session persistence.

While network-assisted DASH is based on the concept of creating multiple representations of the same content at different bitrates, which is challenging in terms of computational complexity and transmission requirements for generating point clouds and 3D content in particular for real-time communications, the switching according to the subject technique can achieve session persistence under strongly varying resource availability (e.g., a varying link quality) or resource demands.

Existing procedures for AR conversational services in the 3GPP document TR 26.998, version 17.0.0 consider workflows and interfaces for immersive communication and for immersive communication and spatial mapping in a cloud. Such existing procedures may be extended according to the subject technique for data flows or procedures for switching between immersive communication and facial expressions applied to animated avatars.

Fig. 8 shows an example of the system for supporting an immersive communication session. The first communication device 100 (e.g., a user equipment or a radio device) is in the coverage cell 303 of a network node 301 (e.g., a gNB), which may be comprised in or controlled by the network entity 300. The second communication device 200 (e.g., a user equipment or a radio device) may be in another coverage cell 303 of another network node 301 (e.g., a gNB).

The first communication device 100 may send the first stream and/or the second stream of the immersive communication session via the radio network (e.g., through the network nodes 301) to the second communication device 200. The network entity 300 may be part of the network node 301 and/or a network core (not shown here). The first communication device 100 may be configured to perform the any one of the steps of the method 400 described in Figs. 1 and 5.

The network entity 300 may be configured to perform the steps of the method 600 described in Figs. 3 and 7. The second communication device 200 may receive the first stream and/or the second stream of the immersive communication session via the radio network (e.g., through the network nodes 301) from the first communication device 100.

The second communication device 200 may be configured to perform the steps of the method 500 described in Figs. 2 and 6.

Fig. 9 shows an example of a standalone architecture (e.g., a fully virtualized, cloud-native architecture (CNA) that introduces new ways to develop, deploy, and manage services), referred to as 5G STandalone AR (STAR) user equipment (UE) (e.g., embodying any one of the communication devices 100 and/or 200) based on which the technique may be implemented. Fig. 9 provides a basic extension of 5G Media streaming for immersive media communications using a STAR UE, when all essential AR/MR functions in a UE are available for typical media processing use cases. In addition to media delivery, also scene description data delivery is included.

Any aspect of the technique may use at least one of the following features:

• The AR runtime includes the vision engine schematically shown in block 1 within AR Runtime. The vision engine performs processing for AR related localization, mapping, 6DoF pose generation, object detection and etc., i.e., simultaneous localization and mapping (SLAM), object tracking, and media data objects. The main purpose of the vision engine is to „register" the device, i.e., the different sets of data from real and virtual world are transformed into the single world coordinate system.

• The AR Scene Manager includes the Scene graph handler schematically shown in block 1, compositor schematically shown in block 2, and immersive media Tenderer (e.g., visual and audio media Tenderer) schematically shown in block 3 and 4 within AR Scene Manager. The AR Scene Manager enables generation of one (monoscopic displays) or two (stereoscopic displays) eye buffers from the visual content, for example typically using GPUs.

• The Media Access Function includes the Media Session Handler that connects to a for example 5G System Network functions, typically, in order to support the delivery and quality of service (QoS) requirements for the media delivery. This may include prioritization, QoS requests, edge capability discovery, and etc. The Media Access Function further includes 2D codecs, immersive media decoders, scene description delivery, and content delivery schematically shown in blocks 1 to 4, respectively. Media Access Functions are divided in control on M5 (Media Session Handler and Media AF) and user data on M4 (Media Client and Media Application Server).

• Service Announcement is triggered by AR/MR Application. Service Access Information including Media Client entry or a reference to the Service Access Information is provided through the M8d interface.

Fig. 10 shows an example of a cloud-assisted and/or edge-assisted device architecture, referred to as 5G EDGe-Dependent AR (EDGAR) UE based on which the technique may be implemented. Fig. 10 provides a basic extension of 5G Media Streaming communication for immersive media using an EDGAR UE as the communication device 100 or 200. In this context, it is expected that the edge will pre-render the media based on pose and interaction information received from the 5G EDGAR UE. It is also highlighted, that the 5G EDGAR UE may consume the same media assets from an immersive media server as the STAR UE according to Fig. 9, but the communication of the edge server to this immersive server is outside of the considered 5G Media Streaming architecture.

The architecture splits the AR runtime, AR scene manager, and media access functions between the device and a cloud. For instance, the functionalities for immersive media decoding and rendering can be split between the AR device and a cloud.

The AR Runtime of the 5G STAR UE includes vision engine/SLAM, pose correction and sound field mapping schematically shown in blocks 1 to 3, respectively. The lightweight scene manager of the 5G STAR UE includes basic scene graph handler and compositor schematically shown in blocks 1 to 2, respectively. The media client of the media access function includes scene description delivery, content delivery and basic codecs schematically shown in blocks 1 to 3, respectively. The media AS of the media delivery functions includes scene description, decoders, encoders and content delivery schematically shown in blocks 1 to 4, respectively.

The AR/MR application includes AR scene manager and AR functions, semantical perception, social integration, and media assets schematically shown in blocks 1 to 4, respectively. The AR scene manager of the AR/MR application includes scene graph generator, immersive visual render and immersive audio render schematically shown in blocks 1 to 3, respectively. Any aspect of the technique may implement at least one step of the following workflow for the first stream (e.g., for AR conversational services) and/or as explained in clause 6.5.4 of the 3GPP document TR 26.998, version 1.1.1. After the signaling procedure to setup a session, the at least one of the following steps may be applied (e.g., focused on data plane):

1) The STAR UE 100 processes the immersive media to be transmitted a) The AR runtime function captures and processes the immersive media to be sent. b) The AR runtime function passes the immersive media data to the AR-MTSI client. c) The AR-MTSI client encodes the immersive media to be sent to the called party's STAR UE 200.

2) The STAR UE 100 and/or 200 has an AR call established with AR media traffic.

3) The STAR UE 200 processes the received immersive media. a) The AR-MTSI client decodes and process the received immersive media. b) The AR-MTSI client passes the immersive media data to the Scene Manager. c) The Scene Manager renders the immersive media, which includes the registration of the AR content into the real world accordingly.

The capturing (e.g., point clouds, mesh or immersive audio) may be done by an external device (e.g., camera). In that case, the processing and encoding may be done outside STAR UE (i.e., AR-MTSI client).

Fig. 11 shows a procedure for immersive communication for AR conversational services, based on which the subject technique may be implemented, e.g., based on or as an extension of clause 6.6.4 of the 3GPP document TR 26.998, version 1.1.1.

Fig. 11 describes a call flow for shared AR conversational services, wherein an edge server or cloud is creating a combined scene from multiple users (e.g., in a virtual conference room). In multi-party AR conversational services, the immersive media processing function on the cloud/network receives the uplink streams from various devices (e.g., at least one first communication device 100) and composes a scene description defining the arrangement of individual participants in a single virtual conference room. The scene description as well as the encoded media streams are delivered to each receiving participant (e.g., second communication device 200). A receiving participant's 5G STAR UE (e.g., second communication device 200) receives, decodes, and processes the 3D video and audio streams, and renders them using the received scene description and the information received from its AR Runtime, creating an AR scene of the virtual conference room with all other participants.

Herein, whenever referring to noise or a signal-to-noise ratio (SNR), a corresponding step, feature or effect is also disclosed for noise and/or interference or a signal-to-interference-and-noise ratio (SINR).

Figs 12a and 12b show an example of a call flow for a shared AR conversational experience for a receiving EDGAR UE 200. An embodiment of any aspect may implement at least one of the following procedure steps:

1) Session Establishment (between UEs 100 and 200 and Cloud 300): a) The AR/MR Application requests to start a session through EDGE. b) The EDGE negotiates with the Scene Composite Generator (SCG) and the sender UE to establish the session. c) The EDGE acknowledges the session establishment to the UE.

2) Media pipeline configuration (part of the control procedure): a) MAF configures its pipelines. b) EDGE configures its pipelines.

3) The AR/MR Application requests the start of the session.

Loops 4, 5, 6, and 7 are run in parallel (order not relevant):

4) AR uplink loop (the sent data is for spatial compute and localization of users and objects): a) The AR Runtime sends the AR data to the AR/MR Application. b) The AR/MR Application processes the data and sends it to the MAF. c) The MAF streams up the AR data to the EDGE. ) Shared experience loop (combining immersive media streams and spatial compute data in the Cloud): a) Parallel to 9, the sender UE streams its media streams up to Media Delivery (MD). b) The sender UE streams its AR data up to the Scene Graph Compositor (SGC). c) Using the AR data from various participants, the SCG creates the composted scene. d) The composted scene is delivered to the EDGE. e) The media streams are delivered to the EDGE. ) Media uplink loop (how uplink media data is provided to the cloud): a) The AR Runtime captures the media components and processes them. b) The AR Runtime sends the media data to the MAF. c) The MAF encodes the media. d) The MAF streams up the media streams to the EDGE. ) Media downlink loop (data streaming from the cloud and rendering on AR device): a) The EDGE parses the scene description and media components, partially renders the scene, and creates a simple scene description as well as the media component. b) The simplified scene is delivered to the Media Client and Scene Manager. c) Media stream loop: i) The pre-rendered media components are streamed to the MAF. ii) The MAF decodes the media streams. iii) The Scene Manager parses the basic scene description and composes the scene. iv) The AR manager after correcting the pose, renders the immersive scene including the registration of AR content into the real world. Embodiments of the technique can overcome at least one of the following problems or conditions:

- Immersive communication requires a new type of content (e.g., point clouds) which demands more network resources.

- 3D captured data as point clouds, meshes are difficult to encode (resource and time intensive) which demands more computing resources.

- Lack of network adaptation of 3D-streaming solutions to network changes often results is quality fluctuations and resource waste. The bit rate adaptation approaches typically used in streaming (e.g., dynamic adaptive streaming over HTTP (DASH) have not been designed/studied for 3D realtime communications) A media presentation description (MPD) describes segment information (timing, URL, media characteristics like video resolution and bit rates), and can be organized in different ways such as SegmentList, SegmentTemplate, SegmentBase and SegmentTimeline, depending on the use case.

- Avatar solutions are less demanding on the network, but they lack the real and immersive experience that can be achieved by real-time capturing and point clouds.

- Network-assisted DASH is based on concept of creating multiple representations of the same content at different bitrates. This is challenging in terms of computational complexity and transmission requirements for generating point clouds and 3D content in particular for real-time communications.

- The avatar of existing solution is not based on a live representation of the user, i.e., no live capture or facial landmarks, nor control of avatar. Furthermore, there is no system architecture or signal descriptions, how such an approach would work for real-time AR communications.

- Existing procedures for AR conversational services as described above consider workflows and interfaces for immersive media and for immersive media and spatial mapping in a cloud. There is no existing flow or procedure for switching between immersive media and facial expressions/animated avatars.

Further detailed embodiments to overcome at least some of the above- mentioned problems are described herein below, also as a system and methods, partially with reference to Figs. 1 to 7. A system and method may switch within an immersive communication session from a live (e.g., real-time) holographic captures (e.g., the first stream) to avatars animated (e.g., the second stream) according to user facial landmarks, e.g., by the communication signals, interfaces and/or XR device realizations described herein.

Fig. 13 shows a generic example of an immersive communication session (e.g., a holographic AR communications) between communication devices 100 and/or 200 according to the method 600. The first communication device 100 according to Figs. 1 and 5 may capture a real-time 3D data (e.g., photographic representation or a real-time 3D visual representation) of a participant by itself or by means of connected devices (e.g., 3D cameras). The 3D data may be represented in form of point clouds or meshes or color plus depth or any type of 3D data. The first communication device 100 (e.g., transmitter device) may process the captured real-time 3D data. The first communication device 100 may encode the captured real-time 3D data.

The first communication device 100 may send 402 a first stream in the immersive communication session to the second communication device 200 through the network entity 300. The first stream comprises a real-time 3D visual representation (e.g., real-time 3D data or the encoded real-time 3D data) of a participant of an immersive communication session.

The network entity 300 may receive 604 the first stream in the immersive communication session from the first communication device 100. The network entity 300 may generate 608 an avatar model of a participant of the immersive communication session based on the 3D visual representation received 604 from the first communication device 100. The avatar model may refer to a generated digital representation of a participant (e.g., known data formats include, e.g., ".obj" format, fbx, etc.). The avatar model generation may use artificial intelligence (Al) or machine learning (ML) pre-trained algorithms with human dataset.

The network entity 300 may send 606 the real-time 3D visual representation or the real-time 3D immersive media rendered based on the real-time 3D visual representation to the second communication device 200, simultaneously with generating the avatar model (e.g., generating the avatar model does not interrupt the transmission of the first stream). The second communication device 200 may receive 502 the first stream in the immersive communication session from the first communication device 100 through the network entity 300. The second communication device 200 may decode 512 and process the received 502 first stream of the immersive communication session. The second communication device 200 may render the received first stream and display an immersive communication session for the participant of the immersive communication session. The second communication device 200 may further comprise a device or be connected to an external device (e.g., 3D glasses) for rendering an immersive communication session.

Fig. 14 shows the network entity (e.g., cloud or network) of Fig. 13 after completing generating the avatar model, may send 610 the generated avatar model to the second communication device 200 (e.g., receiver device) simultaneously with sending 606 the first stream (e.g., sending the avatar model does not interrupt the transmission of the first stream).

The second communication device 200 may receive 506 the avatar model simultaneously with receiving the 502 the first stream in the immersive communication session. The second communication device 200 may cache the receiving avatar model and save the avatar model.

Simultaneously with sending 610, receiving 506, and caching the avatar model, pipeline may continue to capture (e.g., the first communication device), transmit (e.g., the first communication device and/or the network entity) and render 3D live stream (e.g., real-time 3D visual representation of the participant) by the second communication device 200.

Fig. 15 shows the network entity 300 of any Figs 13 and 14, may re-generate 622 the avatar model in a predefined period of time based on the 3D visual representation received 604 from the at least one first communication device 100. The network entity 300 may re-generate 622 the avatar model after re-establishing the communication session. The network entity 300 may regenerate 622 the avatar model of a participant of the immersive communication session in response to a change detected in the participant by an image recognition module. The re-generating 622 the avatar model of the participant is simultaneously with sending 606 the first stream of the immersive communication session (e.g., re-generating of the avatar model does not interrupt the transmission of the first stream). The change (e.g., the change in the 3D visual representation being greater than a predefined threshold) may be detected by means of a filter. For example, the filter may be applied to the first stream. The filter may be invariant under rotation and/or translation of the participant. Alternatively or in addition, the filter may be invariant under facial expressions and/or motions that corresponds to a change in the avatar parameters.

The change may be a significant change in appearance (e.g., haircut, plastic surgery, glasses, change in cloths, and etc.). For example, the filter may output a scalar or a vector that is indicative of the change. The change may trigger the updating of the avatar model if the scalar is greater than a predefined threshold or if a magnitude of the vector is greater than a predefined threshold or if a scalar product between the vector of the change and a predefined reference vector is greater than a predefined threshold.

The value of the threshold may be based on perceptible changes which can ascertained using just noticeable difference (JND) type of experiments.

The avatar model of the participant may be re-generated. For example, the avatar model may be completely generated or incrementally re-generate based on the existing avatar model (e.g., by modifying the existing avatar model).

The re-generated avatar model 622 of the participant may be simultaneously sent to the second communication device with sending 606 the first stream of the immersive communication session (e.g., sending the re-generated avatar model does not interrupt the transmission of the first stream at the receiving second communication device 200). For example, the data packets comprising the avatar model and the data packets comprising the first stream may be sent on different channel (e.g., "simultaneously" on the level of data packets) or alternatingly.

The second communication device 200 may cache and save the received regenerated avatar model. The second communication device 200 may save the regenerated avatar model. The second communication device 200 may replace the generated avatar model with the re-generated avatar model (e.g., update the avatar model).

Fig. 16 shows the network entity 300 in response to a shortage of resources for the first stream (e.g., resource allocation, fairness, quality drops) or when the avatar model reached to a certain predefined quality and cached on the second communication device 200. The network entity 300 may send 612 a control message to the first communication device 100. The control message may be indicative of switching 614 the real-time 3D visual representation of the participant (e.g., first stream) within the immersive communication session to the avatar model (e.g., the second stream) for a real-time 3D CGI of the participant.

The network entity 300 may send a message indicative of instructions to the first communication device 100 (e.g., capturing device or sender device). The first communication device 100 may switch to sensor information data and detecting facial landmarks (e.g., avatar parameters) and send them (e.g., second stream) to the network entity 300 instead of capturing or transmitting the real-time 3D data (e.g., first stream).

The second communication device 200 may switch to render the avatar model and animate the avatar model based on the received avatar parameters from the second stream upon receiving the second stream from the network entity 300.

The network entity 300 may further send a message indicative of instructions to the second communication device 200 (e.g., receiver device). The second communication device 200 may switch to render the avatar model and animate the avatar model based on the received avatar parameters from the second stream.

The network entity 300 may receive 618 the second stream in the immersive communication session from the first communication device 100. The second stream may comprise real-time values of avatar parameters controlling motions of the avatar model of the participant. The network entity 300 may send 620 the second stream in the immersive communication session to the second communication device 200.

The second communication device 200 render the avatar model and avatar parameters (e.g., sensors, facial landmarks) to display a 3D immersive communication.

Fig. 17 shows a data flow (including signaling flow) for avatar model generation in an immersive communication session. The network entity 300 establishes the immersive communication session in steps 1 to 2.

Step 1: Call setup and control: Signaling, codec information, media formats for 3D communications.

Step 2: quality of service (QoS) requirements in terms of bitrate and latency for 3D communications. A(AF)

Step 3: real-time 3D data capture from camera or capturing sensors provided to AR run-time.

Step 4: real-time 3D data (e.g., first stream) is passed to the immersive media codec.

Step 5: Encoding (point cloud compression/video codec).

Step 6: Transmit compressed real-time 3D data (e.g., first stream) through the network entity.

Step 6a (first of additional steps compared to existing 3GPP workflows): Feed 3D stream to application server (AS) of the network entity for generating the avatar model. This step is repeated until the model is built.

Step 7: Decode real-time 3D data (e.g., first stream) (immersive media codec on the receiver side).

Step 8: real-time 3D data (e.g., first stream) is passed to scene manager for rendering.

Step 9: real-time 3D data (e.g., first stream) is rendered.

Fig. 18 shows flow for avatar model caching in the immersive communication session.

In addition to the steps 3 to 9 of the Fig 17, after the avatar generation reaches certain level of confidence (e.g., a pre-defined quality), it can be transmitted (Step 6b) to the second communication device 200 (e.g., receiving device) and cached (Step 6c) at the second communication device 200 (e.g., receiving device). Steps 6a, 6b, and 6c, are additional in comparison with the existing 3GPP workflows. Fig. 19 shows flow for the avatar model update in the immersive communication session.

In addition to the steps of Fig. 18, in step 6a, the avatar model is re-generated based on the real-time 3D data to reflect changes in the participant. The avatar model may be updated at regular intervals and the updated avatar may be transmitted to receiving device 200 is step 6b. At least the steps 6a, 6b, and 6c, are additional in comparison with the existing 3GPP workflows.

Fig. 20 shows flow to switch to avatar model in the immersive communication session. This workflow has no corresponding existing 3GPP workflow.

Step 1: When the avatar model is generated and there is a need to switch (e.g., a trigger to switching procedure is activated) to lower bitrate (e.g., due to network changes, or original bitrate allocated for 3D stream is not available anymore, or to fairly allocate resources across different streams), the control application function (AF) may decide to switch to avatar immersive communications.

Step 2: An instruction is sent 612 to media session handler at first communication device 100 and optionally to the second communication device 200 to switch to avatars.

Step 3: The AR application of the first communication device 100 and optionally of the second communication device 200 is notified about the change.

Step 4: Sensor and motion capture information as facial landmarks are captured.

Step 5: The information is provided to sensor and motion capture codec for compression.

Step 6: The sensor and motion capture and motion information is compressed.

Step 7: The compressed information may be transmitted as the second stream to the second communication device 200 (e.g., receiving device).

In the steps 5-7, 2D codecs or other relevant codecs may be used, or this step may be skipped.

Step 8: The receiving device 200 decodes the sensor and motion capture information (e.g., the second stream). Step 9: The decoded information is provided to the scene manager for rendering.

Step 10: The scene manager retrieves cached avatar from cache.

Step 11: The avatar is rendered based on cached model and motion information.

Fig. 21 shows an example of an immersive communication session (e.g., AR conferencing) and device realization for AR conferencing use-case, which may be applied in any embodiment.

A system 700 for supporting an immersive communication session (e.g., real-time AR conferencing session) is shown in Fig. 21. The user equipment (UE) A and UE B are two parties involved in a real-time AR conferencing session using immersive audio-visual communications. UE A and UE B are located at different places and may produce immersive video using a first communication device 100 (e.g., camera system embedded on a phone or glasses or using external camera) and consume received the first and the second stream on a second communication device 200 (e.g., an AR glasses).

The two UEs agree on modes of communication during immersive communication session establishment which includes real-time 3D visual representation and real-time animated avatars of a participant of the immersive communication session. The terms animated avatars and photo-realistic avatars interchangeably used to refer to avatar representations based on augmenting information from the capturing side (e.g., first communication device).

The network entity 300 (e.g., edge cloud application server) may generate an avatar model during the real-time 3D communications as explained before. In addition, the edge cloud control function (e.g., an application function) provides a switch instruction to the participating parties between the two modes of operation based on resource utilization, network conditions, device capabilities, etc. The edge cloud control function receives network information such as QoS bearer information, URLLC modes, or radio information, and determines when a switch is needed as well as based on maturity of created avatar model. In addition, other reasons such are processing requirements and energy consumption information can be used as criterion for the switch. The availability of generated avatar model at the receiving UE can be another criterion. For instance, a decision is made to switch from 3D communications (e.g., first stream) to a motion sensor and facial landmarks (e.g., second stream) is communicated to the UEs when bandwidth is not sufficient to meet QoS requirements.

Fig. 22 shows an exemplary mapping of the technique to the case of a Standalone AR (e.g., STAR) device architecture according to Fig. 9. The proposed toggling (i.e., switching) may involve at least one of the following steps which are highlighted in the Fig. 22 as dash lined boxes (labeled in the format "X.", which are to be distinguished from the blocks described with reference to Fig. 9 and labelled in the form of a number in a solid line box):

1. Toggling switch from media AF to media session handler and AR/MR application;

2. Capturing media in AR run time such as switching between 3D immersive content to motion sensors and facial landmarks;

3. Encoding media such as switching from immersive media codecs to motion sensors;

4. Transmission of encoded media to a receiving XR device;

5. Decoding of received media using motion sensors or immersive media codecs;

6. Rendering of media in AR scene manager.

Fig. 23 shows the mapping to cloud/edge assisted architecture (e.g., EDGAR) according to Fig. 10. The steps 1 and 2 are similar to the STAR architecture in Fig. 22. In step 3, the encoding of media may be split between device and cloud due to the additional resources and hardware capabilities. The decoding and rendering of media may be split between cloud and device as well.

The call flows and procedures to switch between 3D communications and animated/photo-realistic avatars are explained in detail in the following.

Fig. 24 shows a call flow for toggling between 3D real-time streams (e.g., first stream) and animated avatars (e.g., second stream).

Step 1: UE A and UE B setup an initial communication session setup for immersive communications using 3D real-time captured streams. This is based on initial network conditions, e.g., during bearer setup between the two parties (e.g., two communication devices 100/200). Step 2: UE A transmits the real-time captured 3D data stream (e.g., first stream). The cloud (e.g., network entity) may process the 3D data stream to generate an avatar model and/or transmit the avatar model to UE B at later stages.

Step 3: The cloud (e.g., network entity) receives network state information (e.g., periodically or event-driven) for example from the 5G system.

Step 4: The cloud (e.g., network entity) decides to switch to headset sensor information only based on the current network conditions. The sensor information may be different from motion information and facial landmarks; sensor information can come from AR glasses, e.g., blinking, head shake/nod. In the immersive communication session (e.g., AR conferencing) both parties (e.g., users) can wear AR glasses. The facial landmarks in this case can come from AR glasses e.g., blinking, head shake/nod and some set of facial landmarks.

Step 5: The cloud (e.g., network entity) may send instruction indicative to switch to headset sensor information only to UE A.

Step 6: UE A may notify the cloud and UE B about the switch to headset sensor information and avatar-based rendering.

Step 7: UE A may transmit headset sensor information such as facial expressions to UE B for avatar rendering.

Fig. 25 shows call flow for toggling (i.e., switching) to 3D communications based on Standalone AR device architecture. During session establishment, different configurations may be configured for immersive communication between the AR device (e.g., communication devices 100 and/or 200) and the 5G core (e.g., network entity). In this case, two different configurations for real-time 3D visual representation immersive communications and real-time photo-realistic avatars immersive communication may be considered. Such configuration may be done between the immersive media codec and facial landmarks codecs at AR device (e.g., communication devices 100 and/or 200) and the 5G core (e.g., network entity), respectively. The 5G core can correspond to an IMS core or any other 5G system (e.g., 5GMS). The proposed approach is independent of the protocol realization. It is possible that the communication session establishment is realized in one or two sessions i.e., immersive communications based on realtime 3D data capturing (e.g., first stream) and avatar parameters capturing (e.g., second stream) can happen within one session with device capabilities for both cases exchanged or that two separate sessions are needed.

When switching to real-time 3D visual representation immersive communications (e.g., 3D communication), an instruction is sent from the media AF to the media session handler at AR device. The receiver AR device (e.g., the first communication device) is notified about the switch. The media session handler informs the AR and/or MR application about the switch to 3D communications. The AR run-time uses 3D stream capture and provides the 3D stream (e.g., the first stream), e.g., point clouds, to the immersive media codec that compresses the stream. The compressed stream is provided to the receiver XR device where decoding and rendering take place.

Fig. 26 shows a call flow for toggling (i.e., switching) to avatar model for a realtime 3D CGI of the participant (e.g., photo-realistic avatar) in the immersive communication session based on Standalone AR device architecture (STAR). The photo-realistic avatar switch instruction is provided by the media AF to media session handler at AR device. A notification (e.g., message) is provided to the receiver AR device (e.g., the second communication device). The AR/MR application is notified about the switch and the AR run-time switches to motion sensor capture. The data is compressed by motion codecs and transmitted to receiver AR device that decodes the stream and provides to scene manager for avatar animation (e.g., immersive communication session using avatar model controlled with avatar parameters). The receiver AR device may communicate with media session handler and cache to retrieve avatar model.

Fig. 27 shows the steps involved to switch from the 3D communications to photorealistic avatars. It summarizes the steps involved for toggling from 3D communications to photo-realistic avatars.

Fig. 28 shows the steps involved to switch from 3D communications to photorealistic avatars (cloud/edge assisted architecture). It summarizes the steps involved for toggling from 3D communications to photo-realistic avatars when the cloud/edge performs part of the avatar rendering (cloud/edge assisted architecture).

In practice when the first immersive communication session is stablished, the cloud may create an avatar model (e.g., representation) of a participant and transmit it to the AR glasses on UE B. Later, after switching to the avatar model communication, only avatar parameters (e.g., animation signals) are transmitted in order to for example mimic blinking, talking or other movements. The avatar model may also be created before the immersive communication session starts.

The AR glasses on UE B (e.g., the second communication devices) have the capabilities to switch between two different streams: real-time 3D stream and avatar and vice versa. For this purpose, practically two different instances are needed, pre-built functionalities, one game object for avatar processing and one for 3D stream. This is possible on today's devices and rendering engines e.g., Unity.

When network conditions improve again (e.g., the condition for shortage of resources are invalid), the network entity 300 (e.g., cloud) may send a switch instruction UE to switch back real-time 3D stream communication (switch up). In other words, the network entity may switch between real-time 3D visual representation and the avatar model for a real-time 3D CGI of the participant in the immersive communication session back and forth according to the network resource status.

The headset sensor information such as head movements (e.g., in the second stream) are provided.

The call flows shown in figures above, without limitation, are for asymmetrical communication from UE A to UE B where UE A is capturing and transmitting content and UE B is acting as a consumption device. The procedure may be applied for symmetrical communication between both UEs where UE B is additionally capturing video that is sent to UE A.

In case of symmetrical immersive communication, the network entity 300 may send an instruction to both parties such that real-time 3D communications is active on both sides, avatar-based rendering is active on both sides, or a 3D communications is active on one-side and avatar-based rendering is active on other side when network conditions vary.

The approach can be applicable to multi-party immersive communications (i.e., more than two parties), e.g., between one or more first communication devices 100 and one or more second communication devices 200. The switching step may be triggered in the network entity 300 due to shortage of resources (e.g., limited resource availability for real-time 3D stream processing at the cloud, e.g., memory, CPU/GPU).

A media application may set up (e.g., establish) the immersive communication session between the UEs (e.g., at least two parties). Sender and receiver UEs can start the communication via an application.

The control entities on both UEs may be aware of the different modes for toggling during the session.

A media player may be part of the receiver UE (e.g., AR glasses or connected to AR glasses), where the content is rendered, and optionally, displayed.

Fig. 29 describes the switching from 3D to motion sensors and facial landmarks transmission at sender UE. After receiving a signal to switch to motion sensors and facial landmarks, the sender UE switches off depth sensor and continues with motion sensors and facial landmarks (referred to as RGB camera, color frames). The cloud and receiver UE are notified about the switch. The sender UE transmits motion sensors and facial landmarks to the network entity 300 (e.g., cloud).

Fig. 30 describes the procedures for the receiver UE to switch from 3D real-time rendering to animated avatars. A notification to switch to animated avatars is received from the cloud. The receiver UE switches 3D real-time rendering (e.g., an instance in the run-time engine) to avatar-based rendering (e.g., second instance in run-time engine). Avatar parameters (e.g., animation signals) are received from the network entity (e.g., cloud). The animation signals are used for avatar-based rendering on the AR glasses.

The method for switching (e.g., toggling) between real-time 3D visual presentation and animated avatar of the participant between two or more parties involved in an immersive communication session is provided. The avatar model is generated and updated by network entity during the real-time 3D visual immersive communications session and may be used later on to switch to an animated avatar immersive communication when network conditions (e.g., resources) are not sufficient to continue with real-time 3D visual immersive communications. At any time when the network resources improved the network entity may switch back to the real-time 3D visual immersive communications. This solution allows immersive communication sessions to be sustained even when the network conditions are not good. In current implementations where only 3D streams are used, if the network conditions are not good, the 3D stream will have to be suspended which reduces the user's experience. The toggling is based on real-time captured data which allows for an improved user experience compared to current implementations.

Another current implementation is avatar-only usage that has significantly lower user experience compared to a real-time 3D visual representation in the immersive communication session. The proposed solution according to the first, second and third method aspects the best user experience (e.g., real-time 3D visual representation in the immersive communication session) is provided where possible, while the lower user (e.g., animated avatar) experience is provided as a fallback but is still preferred to the scenario where no representation of the participant (sender) is provided, or the immersive communication session is dropped completely. Moreover, the animated avatar representation of the participant in the immersive communication session according to the solution of method aspects, is maintained in the immersive media communication as defined in the summary section, and updated avatar guarantees the closest possible avatar to the real photographic representation of the participant.

Fig. 31 shows a schematic block diagram for an embodiment of the first communication device 100. The first communication device 100 comprises processing circuitry, e.g., one or more processors 3104 for performing the method 400 and memory 3106 coupled to the processors 3104. For example, the memory 3106 may be encoded with instructions that implement at least one of the modules 102, 104 and 106.

The one or more processors 3104 may be a combination of one or more of a microprocessor, controller, microcontroller, central processing unit, digital signal processor, application specific integrated circuit, field programmable gate array, or any other suitable computing device, resource, or combination of hardware, microcode and/or encoded logic operable to provide, either alone or in conjunction with other components of the first communication device 100, such as the memory 3106, transmitter functionality or first communication device functionality. For example, the one or more processors 3104 may execute instructions stored in the memory 3106. Such functionality may include providing various features and steps discussed herein, including any of the benefits disclosed herein. The expression "the device being operative to perform an action" may denote the first communication device 100 being configured to perform the action.

As schematically illustrated in Fig. 31, the first communication device 100 may be embodied by a transmitting device 3100, e.g., functioning as a capturing device or a transmitting UE or capturing UE. The transmitting station 3100 comprises a radio interface 3102 coupled to the first communication device 100 for radio communication with one or more receiving stations, e.g., functioning as a receiving base station 300 or a receiving UE 200.

Fig. 32 shows a schematic block diagram for an embodiment of the second communication device 200. The second communication device 200 comprises processing circuitry, e.g., one or more processors 3204 for performing the method 500 and memory 3206 coupled to the processors 3204. For example, the memory 3206 may be encoded with instructions that implement at least one of the modules 202, 204 and 206.

The one or more processors 3204 may be a combination of one or more of a microprocessor, controller, microcontroller, central processing unit, digital signal processor, application specific integrated circuit, field programmable gate array, or any other suitable computing device, resource, or combination of hardware, microcode and/or encoded logic operable to provide, either alone or in conjunction with other components of the second communication device 200, such as the memory 3206, receiver functionality. For example, the one or more processors 3204 may execute instructions stored in the memory 3206. Such functionality may include providing various features and steps discussed herein, including any of the benefits disclosed herein. The expression "the device being operative to perform an action" may denote the second communication device 200 being configured to perform the action.

As schematically illustrated in Fig. 32, the second communication device 200 may be embodied by a receiving device 3200, e.g., functioning as a display device or a receiving UE. The receiving device 3200 comprises a radio interface 3202 coupled to the second communication device 200 for radio communication with one or more transmitting stations, e.g., functioning as a transmitting base station 300 or a transmitting UE 100. Fig. 33 shows a schematic block diagram for an embodiment of the network entity 300. The network entity 300 comprises processing circuitry, e.g., one or more processors 3304 for performing the method 600 and memory 3306 coupled to the processors 3304. For example, the memory 3306 may be encoded with instructions that implement at least one of the modules 302 to 320.

The one or more processors 3304 may be a combination of one or more of a microprocessor, controller, microcontroller, central processing unit, digital signal processor, application specific integrated circuit, field programmable gate array, or any other suitable computing device, resource, or combination of hardware, microcode and/or encoded logic operable to provide, either alone or in conjunction with other components of the network entity 300, such as the memory 3306, network functionality. For example, the one or more processors 3304 may execute instructions stored in the memory 3306. Such functionality may include providing various features and steps discussed herein, including any of the benefits disclosed herein. The expression "the device being operative to perform an action" may denote the network entity 300 being configured to perform the action.

As schematically illustrated in Fig. 33, the network entity 300 may be embodied by a network entity 3300, e.g., functioning as a network. The network entity 3300 comprises a radio interface 3302 coupled to the network entity 300 for (e.g., partially radio) communication with one or more transmitting stations, e.g., functioning as a transmitting base station or a transmitting UE 100 and with one or more receiving stations, e.g., functioning as a receiving base station or a receiving UE 200.

With reference to Fig. 34, in accordance with an embodiment, a communication system 3400 includes a telecommunication network 3410, such as a 3GPP-type cellular network, which comprises an access network 3411, such as a radio access network, and a core network 3414. The access network 3411 comprises a plurality of base stations 3412a, 3412b, 3412c, such as NBs, eNBs, gNBs or other types of wireless access points, each defining a corresponding coverage area 3413a, 3413b, 3413c. Each base station 3412a, 3412b, 3412c is connectable to the core network 3414 over a wired or wireless connection 3415. A first user equipment (UE) 3491 located in coverage area 3413c is configured to wirelessly connect to, or be paged by, the corresponding base station 3412c. A second UE 3492 in coverage area 3413a is wirelessly connectable to the corresponding base station 3412a. While a plurality of UEs 3491, 3492 are illustrated in this example, the disclosed embodiments are equally applicable to a situation where a sole UE is in the coverage area or where a sole UE is connecting to the corresponding base station 3412.

Any of the UEs 3491, 3492 may embody the first communication device 100 and/or the second communication device 200. Any of base stations 3412 may embody the network entity 300.

The telecommunication network 3410 is itself connected to a host computer 3430, which may be embodied in the hardware and/or software of a standalone server, a cloud-implemented server, a distributed server or as processing resources in a server farm. The host computer 3430 may be under the ownership or control of a service provider, or may be operated by the service provider or on behalf of the service provider. The connections 3421, 3422 between the telecommunication network 3410 and the host computer 3430 may extend directly from the core network 3414 to the host computer 3430 or may go via an optional intermediate network 3420. The intermediate network 3420 may be one of, or a combination of more than one of, a public, private, or hosted network; the intermediate network 3420, if any, may be a backbone network or the Internet; in particular, the intermediate network 3420 may comprise two or more sub-networks (not shown).

The communication system 3400 of Fig. 34 as a whole enables connectivity between one of the connected UEs 3491, 3492 and the host computer 3430. The connectivity may be described as an over-the-top (OTT) connection 3450. The host computer 3430 and the connected UEs 3491, 3492 are configured to communicate data and/or signaling via the OTT connection 3450, using the access network 3411, the core network 3414, any intermediate network 3420 and possible further infrastructure (not shown) as intermediaries. The OTT connection 3450 may be transparent in the sense that the participating communication devices through which the OTT connection 3450 passes are unaware of routing of uplink and downlink communications. For example, a base station 3412 need not be informed about the past routing of an incoming downlink communication with data originating from a host computer 3430 to be forwarded (e.g., handed over) to a connected UE 3491. Similarly, the base station 3412 need not be aware of the future routing of an outgoing uplink communication originating from the UE 3491 towards the host computer 3430. By virtue of the method 600 being performed by any one of the UEs 3491 or 3492 and/or any one of the base stations 3412, the performance or range of the OTT connection 3450 can be improved, e.g., in terms of increased throughput and/or reduced latency. More specifically, the host computer 3430 may indicate to the network entity 300 (e.g., the RAN or core network) or the first communication device 100 or the second communication device 200 (e.g., on an application layer) the QoS of the traffic. Based on the QoS indication the network entity may toggle (i.e., switch) between the real-time 3D visual representation and real-time animated avatar of the participant of the communication session.

Example implementations, in accordance with an embodiment of the UE, base station and host computer discussed in the preceding paragraphs, will now be described with reference to Fig. 35. In a communication system 3500, a host computer 3510 comprises hardware 3515 including a communication interface 3516 configured to set up and maintain a wired or wireless connection with an interface of a different communication device of the communication system 3500. The host computer 3510 further comprises processing circuitry 3518, which may have storage and/or processing capabilities. In particular, the processing circuitry 3518 may comprise one or more programmable processors, application-specific integrated circuits, field programmable gate arrays or combinations of these (not shown) adapted to execute instructions. The host computer 3510 further comprises software 3511, which is stored in or accessible by the host computer 3510 and executable by the processing circuitry 3518. The software 3511 includes a host application 3512. The host application 3512 may be operable to provide a service to a remote user, such as a UE 3530 connecting via an OTT connection 3550 terminating at the UE 3530 and the host computer 3510. In providing the service to the remote user, the host application 3512 may provide user data, which is transmitted using the OTT connection 3550. The user data may depend on the location of the UE 3530. The user data may comprise auxiliary information or precision advertisements (also: ads) delivered to the UE 3530. The location may be reported by the UE 3530 to the host computer, e.g., using the OTT connection 3550, and/or by the base station 3520, e.g., using a connection 3560.

The communication system 3500 further includes a base station 3520 provided in a telecommunication system and comprising hardware 3525 enabling it to communicate with the host computer 3510 and with the UE 3530. The hardware 3525 may include a communication interface 3526 for setting up and maintaining a wired or wireless connection with an interface of a different communication device of the communication system 3500, as well as a radio interface 3527 for setting up and maintaining at least a wireless connection 3570 with a UE 3530 located in a coverage area (not shown in Fig. 35) served by the base station 3520. The communication interface 3526 may be configured to facilitate a connection 3560 to the host computer 3510. The connection 3560 may be direct, or it may pass through a core network (not shown in Fig. 35) of the telecommunication system and/or through one or more intermediate networks outside the telecommunication system. In the embodiment shown, the hardware 3525 of the base station 3520 further includes processing circuitry 3528, which may comprise one or more programmable processors, application-specific integrated circuits, field programmable gate arrays or combinations of these (not shown) adapted to execute instructions. The base station 3520 further has software 3521 stored internally or accessible via an external connection.

The communication system 3500 further includes the UE 3530 already referred to. Its hardware 3535 may include a radio interface 3537 configured to set up and maintain a wireless connection 3570 with a base station serving a coverage area in which the UE 3530 is currently located. The hardware 3535 of the UE 3530 further includes processing circuitry 3538, which may comprise one or more programmable processors, application-specific integrated circuits, field programmable gate arrays or combinations of these (not shown) adapted to execute instructions. The UE 3530 further comprises software 3531, which is stored in or accessible by the UE 3530 and executable by the processing circuitry 3538. The software 3531 includes a client application 3532. The client application 3532 may be operable to provide a service to a human or non-human user via the UE 3530, with the support of the host computer 3510. In the host computer 3510, an executing host application 3512 may communicate with the executing client application 3532 via the OTT connection 3550 terminating at the UE 3530 and the host computer 3510. In providing the service to the user, the client application 3532 may receive request data from the host application 3512 and provide user data in response to the request data. The OTT connection 3550 may transfer both the request data and the user data. The client application 3532 may interact with the user to generate the user data that it provides.

It is noted that the host computer 3510, base station 3520 and UE 3530 illustrated in Fig. 35 may be identical to the host computer 3430, one of the base stations 3412a, 3412b, 3412c and one of the UEs 3491, 3492 of Fig. 34, respectively. This is to say, the inner workings of these entities may be as shown in Fig. 35, and, independently, the surrounding network topology may be that of Fig. 34.

In Fig. 35, the OTT connection 3550 has been drawn abstractly to illustrate the communication between the host computer 3510 and the UE 3530 via the base station 3520, without explicit reference to any intermediary devices and the precise routing of messages via these devices. Network infrastructure may determine the routing, which it may be configured to hide from the UE 3530 or from the service provider operating the host computer 3510, or both. While the OTT connection 3550 is active, the network infrastructure may further take decisions by which it dynamically changes the routing (e.g., on the basis of load balancing consideration or reconfiguration of the network).

The wireless connection 3570 between the UE 3530 and the base station 3520 is in accordance with the teachings of the embodiments described throughout this disclosure. One or more of the various embodiments improve the performance of OTT services provided to the UE 3530 using the OTT connection 3550, in which the wireless connection 3570 forms the last segment. More precisely, the teachings of these embodiments may reduce the latency and improve the data rate and thereby provide benefits such as better responsiveness and improved QoS.

A measurement procedure may be provided for the purpose of monitoring data rate, latency, QoS and other factors on which the one or more embodiments improve. There may further be an optional network functionality for reconfiguring the OTT connection 3550 between the host computer 3510 and UE 3530, in response to variations in the measurement results. The measurement procedure and/or the network functionality for reconfiguring the OTT connection 3550 may be implemented in the software 3511 of the host computer 3510 or in the software 3531 of the UE 3530, or both. In embodiments, sensors (not shown) may be deployed in or in association with communication devices through which the OTT connection 3550 passes; the sensors may participate in the measurement procedure by supplying values of the monitored quantities exemplified above, or supplying values of other physical quantities from which software 3511, 3531 may compute or estimate the monitored quantities. The reconfiguring of the OTT connection 3550 may include message format, retransmission settings, preferred routing etc.; the reconfiguring need not affect the base station 3520, and it may be unknown or imperceptible to the base station 3520. Such procedures and functionalities may be known and practiced in the art. In certain embodiments, measurements may involve proprietary UE signaling facilitating the host computer's 3510 measurements of throughput, propagation times, latency and the like. The measurements may be implemented in that the software 3511, 3531 causes messages to be transmitted, in particular empty or "dummy" messages, using the OTT connection 3550 while it monitors propagation times, errors etc.

Fig. 36 is a flowchart illustrating a method implemented in a communication system, in accordance with one embodiment. The communication system includes a host computer, a base station and a UE which may be those described with reference to Figs. 34 and 35. For simplicity of the present disclosure, only drawing references to Fig. 36 will be included in this paragraph. In a first step 3610 of the method, the host computer provides user data. In an optional substep 3611 of the first step 3610, the host computer provides the user data by executing a host application. In a second step 3620, the host computer initiates a transmission carrying the user data to the UE. In an optional third step 3630, the base station transmits to the UE the user data which was carried in the transmission that the host computer initiated, in accordance with the teachings of the embodiments described throughout this disclosure. In an optional fourth step 3640, the UE executes a client application associated with the host application executed by the host computer.

Fig. 37 is a flowchart illustrating a method implemented in a communication system, in accordance with one embodiment. The communication system includes a host computer, a base station and a UE which may be those described with reference to Figs. 34 and 35. For simplicity of the present disclosure, only drawing references to Fig. 37 will be included in this paragraph. In a first step 3710 of the method, the host computer provides user data. In an optional substep (not shown) the host computer provides the user data by executing a host application. In a second step 3720, the host computer initiates a transmission carrying the user data to the UE. The transmission may pass via the base station, in accordance with the teachings of the embodiments described throughout this disclosure. In an optional third step 3730, the UE receives the user data carried in the transmission.

As has become apparent from above description, at least some embodiments of the technique allow for immersive communication sessions to be sustained even when the network conditions are not good. In current implementations where only 3D streams are used, if the network conditions are not good, the 3D stream will have to be suspended which reduces the user's experience. Many advantages of the present invention will be fully understood from the foregoing description, and it will be apparent that various changes may be made in the form, construction and arrangement of the units and devices without departing from the scope of the invention and/or without sacrificing all of its advantages. Since the invention can be varied in many ways, it will be recognized that the invention should be limited only by the scope of the following claims.

Claims

1. A method (600) performed by a network entity (300) for supporting an immersive communication session between communication devices (100, 200), the method (600) comprising: receiving (604) a first stream in the immersive communication session from at least one first communication device (100) among the communication devices (100, 200), the first stream comprising a real-time 3-dimensional, 3D, visual representation of a participant of the immersive communication session, and sending (606) the real-time 3D visual representation or real-time 3D immersive media rendered based on the real-time 3D visual representation to at least one second communication device (200) of the communication devices (100, 200); generating (608) an avatar model of the participant of the immersive communication session based on the 3D visual representation received (604) from the at least one first communication device (100), and sending (610) the generated (608) avatar model to the at least one second communication device (200); in response to a shortage of resources for the first stream, sending (612) a control message to the at least one first communication device (100), the control message being indicative of switching (614) the real-time 3D visual representation of the participant within the immersive communication session to the avatar model for a real-time 3D computer-generated imagery, CGI, of the participant; and receiving (618) a second stream in the immersive communication session from the at least one first communication device (100), the second stream comprising real-time values of avatar parameters controlling motions of the avatar model of the participant, and sending (620) the real-time values to the at least one second communication device (200).

2. The method (600) of claim 1, the method (600) further comprising establishing (602) the immersive communication session between the communication devices (100, 200).

3. The method (600) of claim 1 or 2, wherein the network entity (300) comprises at least one of a control media access function, AF, and a data media application server, AS.

4. The method (600) of any one of claims 1 to 3, wherein the shortage of resources occurs along a communication path of the first stream, and/or wherein the shortage of resources includes at least one of: a shortage of radio resources of a radio access network, RAN, providing radio access to the at least one first communication device, a shortage of radio resources of a or the RAN providing radio access to the at least one second communication device, a shortage of a transport capacity of the network entity, and a shortage of computational resources for rendering of the real-time 3D immersive media based on the first stream.

5. The method (600) of any one of claims 1 to 4, wherein multiple first streams are received (604) from multiple first communication devices (100) among the communication devices (100, 200), and wherein the resources of the shortage include computational resources for composing multiple real-time 3D visual representations of multiple participants of the immersive communication session received from the multiple first communication devices.

6. The method (600) of any one of claims 1 to 5, further comprising, in response to the shortage of resources for the first stream, sending (612) the control message to the at least one second communication device (200), or wherein the second stream is indicative of switching (614) the real-time 3D visual representation of the participant within the immersive communication session to the avatar model for the real-time 3D CGI of the participant.

7. The method (600) of any one of claims 1 to 6, wherein the control message sent (612) to the at least one first communication device (100) triggers one or more sensors at the at least one first communication device (100) to capture the real-time values of the avatar parameters, and/or wherein the control message sent (612) to the at least one second communication device (100) triggers rendering the real-time 3D CGI of the participant based on the generated (608) avatar model and the real-time values of the avatar parameters.

8. The method (600) of any one of claims 1 to 7, further comprising at least one of: saving (616) the generated (608) avatar model; and monitoring the 3D visual representation of the received (604) first stream for a change in the avatar model of the participant of the immersive communication session.

9. The method (600) of any one of claims 1 to 8, wherein the network entity (300) comprises, or is part of, at least one of: a radio access network, RAN, providing radio access to at least one of the communication devices; a network node of a RAN, the network node serving at least one of the communication devices; a core network, CN, transporting the first and second streams between the communication devices and/or performing mobility management for the communication devices; a local area network, LAN; a distributed network for edge computing; and a computing center.

10. The method (600) of any one of claims 1 to 9, wherein the real-time 3D visual representation of the participant is at least one of encoded and compressed in the received (604) first stream.

11. The method (600) of any one of claims 1 to 10, wherein the avatar model comprises a biomechanical model for the motions of the participant of the immersive communication session.

12. The method (600) of any one of claims 1 to 11, wherein the motions of the avatar model controlled by the avatar parameters comprise at least one of gestures, facial expressions, and head motion of the participant, that is encoded in the values of the avatar parameters in the received (618) second stream.

13. The method (600) of any one of claims 1 to 12, further comprising at least one of: re-generating (622) the avatar model of the participant of the immersive communication session or updating (624) the generated and saved avatar model of the participant of the immersive communication session; and sending (610) the re-generated or updated avatar model to the at least one second communication device (200), wherein the avatar model is re-generated or updated after re-establishing the immersive communication session, and/or wherein the avatar model is regenerated (622) or updated (624) responsive to a change in the 3D visual representation being greater than a predefined threshold, and/or wherein the regenerated (622) or updated (624) avatar model is sent responsive to a change in the avatar model being greater than a predefined threshold.

14. The method (600) of any one of claims 1 to 13, wherein the avatar model or an update of the avatar model is sent (610) simultaneously with the sending (606) of the real-time 3D visual representation, or upon establishing or re-establishing of the immersive communication session, to the at least one second communication device (200).

15. The method (600) of any one of claims 1 to 14, wherein the real-time values of the avatar parameters are derived from one or more depth-insensitive image sensors and/or one or more acceleration sensors capturing facial expressions or motions of the participant at the first communication device (100), wherein the one or more depth-insensitive image sensors comprise at least one of: a camera for projecting light of the participant onto a 2-dimensional, 2D, image; and a filter for detecting facial landmarks in the 2D image of the participant.

16. The method (600) of any one of claims 1 to 15, wherein the real-time 3D visual representation is derived from one or more depth-sensitive image sensors at the first communication device (100), optionally wherein the one or more depthsensitive image sensors comprise at least one of: a light field camera; an array of angle-sensitive pixels; a sensor for light-in- flight imaging; a sensor for light detection and ranging, LIDAR; a streak camera; and a device using structured light triangulation for depth sensing.

17. The method (600) of any one of claims 1 to 16, wherein the control message sent (614) to the at least one first communication device (100) is further indicative of at least one of: a type of sensors for capturing the 3D visual representation of the participant to be deactivated; a type of sensors for deriving the real-time values of the avatar parameters to be activated; handing over control from a or the control media AF at the network entity (300) to a media session handler at the respective first communication device (100); sending a notification from a or the media session handler at the respective first communication device (100) to a mixed reality application at the respective first communication device (100); switching a mixed reality run-time engine at the respective first communication device (100) from capturing the 3D visual representation to deriving the real-time values of the avatar parameters; and switching one or more media access functions at the respective first communication device (100) from an immersive media encoder encoding the 3D visual representation to a motion sensor encoder encoding the real-time values of the avatar parameters.

18. The method (600) of any one of claims 1 to 17, wherein the control message sent (614) to the at least one second communication device (200) is further indicative of at least one of: a mixed reality run-time engine for pose correction and/or rendering of the real-time 3D visual representation of the participant to be deactivated; a mixed reality scene manager for pose correction and/or rendering of the real-time 3D CGI based on the generated (608) avatar model and the real-time values of the avatar parameters to be activated; handing over control from a or the control media AF at the network entity (300) to a media session handler at the respective second communication device (200); sending a notification from a or the media session handler at the respective second communication device (200) to a mixed reality application at the respective second communication device (200); switching a mixed reality run-time engine at the respective second communication device (200) from rendering the real-time 3D visual representation to rendering the real-time 3D CGI based on the avatar model and the real-time values of the avatar parameters; switching one or more media access functions at the respective second communication device (200) from an immersive media decoder decoding the 3D visual representation to a motion sensor decoder decoding the real-time values of the avatar parameters.

19. The method (600) of claims 17 or 18, wherein the control message is sent to a or the media session handler of the at least one first communication device (100) and/or a or the media session handler of the at least one second communication device (200).

20. The method (600) of any one of claims 1 to 19, wherein each of the first stream and the second stream further comprises immersive audio, wherein the immersive audio is unchanged during the switching. 21. The method (600) of any one of claims 1 to 20, wherein the shortage of resources is determined based on network state information at the network entity (300), wherein the network state information is received periodically or event-driven.

22. The method (600) of any one of claims 1 to 21, wherein the avatar model is generated (608) by a or the data media application server, AS, of the network entity (300); and/or wherein the shortage for switching is determined by a or the control media application function, AF, of the network entity (300).

23. A method (400) performed by a first communication device (100) for supporting an immersive communication session with at least one second communication device (200), the method (400) comprising: sending (402) a first stream in the immersive communication session to the at least one second communication device (200) through a network entity (300), the first stream comprising a real-time 3-dimensional, 3D, visual representation of a participant of the immersive communication session, the real-time 3D visual representation enabling the network entity (300) to generate an avatar model of the participant and to send the generated avatar model to the at least one second communication device (200); receiving (404) a control message from the network entity (300), the control message being indicative of switching (406) the real-time 3D visual representation of the participant within the immersive communication session to the avatar model for a real-time 3D computer-generated imagery, CGI, of the participant; and sending (408) a second stream in the immersive communication session to the at least one second communication device (200) through the network entity (300), the second stream comprising real-time values of avatar parameters controlling motions of the avatar model of the participant.

24. The method (400) of claim 23, further comprising any feature or step of any one of the claims 1 to 23, or a feature or step corresponding thereto.

25. A method (500) performed by a second communication device (200) for supporting an immersive communication session with at least one first communication device (100), the method (500) comprising: receiving (502) a first stream in the immersive communication session from the at least one first communication device (100) through a network entity (300), the first stream comprising a real-time 3-dimensional, 3D, visual representation of a participant of the immersive communication session, and rendering (504) the 3D visual representation of the participant at the second device (200); receiving (506) an avatar model of the participant of the immersive communication session from the network entity (300), the avatar model being generated based on the 3D visual representation of the participant for a real-time 3D computer-generated imagery, CGI, of the participant; and receiving (510) a second stream in the immersive communication session from the at least one first communication device (100) through the network entity (300), the second stream comprising real-time values of avatar parameters controlling motions of the avatar model of the participant, and rendering (514) the CGI of the participant based on the received (506) avatar model and the received (510) real-time values of the avatar parameters at the second device (200).

26. The method (500) of claim 25, further comprising decoding (512) and/or processing the received (502) first stream and/or the received (510) second stream of the immersive communication session. 1. The method (500) of claim 25 or 26, further comprising receiving (508) a control message indicative of switching from the real-time 3D visual representation of the participant within the immersive communication session to the avatar model for the real-time 3D CGI of the participant.

28. The method (500) of any one of claims 25 to 1 , further comprising any feature or step of any one of the claims 1 to 22, or a feature or step corresponding thereto.

29. A network entity (300) for supporting an immersive communication session between communication devices (100, 200), the network entity (300) comprising memory operable to store instructions and processing circuitry operable to execute the instructions, such that the network entity (300) is operable to: receive (604) a first stream in the immersive communication session from at least one first communication device (100) among the communication devices (100, 200), the first stream comprising a real-time 3-dimensional, 3D, visual representation of a participant of the immersive communication session, and sending (606) the real-time 3D visual representation or immersive media rendered based on the real-time 3D visual representation to at least one second communication device (200) of the communication devices (100, 200); generate (608) an avatar model of a participant of the immersive communication session based on the 3D visual representation received (604) from the at least one first communication device (100), and sending (610) the generated (610) avatar model to the at least one second communication device (200); in response to a shortage of resources for the first stream, send (612) control message to the at least one first communication device (100), the control message being indicative of switching (614) the real-time 3D visual representation of the participant within the immersive communication session to the avatar model for a real-time 3D computer-generated imagery, CGI, of the participant; and receive (618) a second stream in the immersive communication session from the at least one first communication device (100), the second stream comprising real-time values of avatar parameters controlling motions of the avatar model of the participant, and send (620) the real-time values to the at least one second communication device (200).

30. The network entity (300) of claim 29, further operable to perform the steps of any one of claims 2 to 22.

31. A first communication device (100) for supporting an immersive communication session with at least one second communication device (200), the first communication device (100) comprising memory operable to store instructions and processing circuitry operable to execute the instructions, such that the first communication device (100) is operable to: send (402) a first stream in the immersive communication session to the at least one second communication device (200) through a network entity (300), the first stream comprising a real-time 3-dimensional, 3D, visual representation of a participant of the immersive communication session, the real-time 3D visual representation enabling the network entity (300) to generate an avatar model of the participant and to send the generated avatar model to the at least one second communication device (200); receive (404) a control message from the network entity (300), the control message being indicative of switching (406) the real-time 3D visual representation of the participant within the immersive communication session to the avatar model for a real-time 3D computer-generated imagery, CGI, of the participant; and send (408) a second stream in the immersive communication session to the at least one second communication device (200) through the network entity (300), the second stream comprising real-time values of avatar parameters controlling motions of the avatar model of the participant.

32. The first communication device (100) of claim 31, further operable to perform the steps of claim 24.

33. A second communication device (200) for supporting an immersive communication session with at least one first communication device (100), the second communication device (200) comprising memory operable to store instructions and processing circuitry operable to execute the instructions, such that the second communication device (200) is operable to: receive (502) a first stream in the immersive communication session from the at least one first communication device (100) through a network entity (300), the first stream comprising a real-time 3-dimensional, 3D, visual representation of a participant of the immersive communication session, and render (503) the 3D visual representation of the participant at the second device (200); receive (504) an avatar model of the participant of the immersive communication session from the network entity (300), the avatar model being generated based on the 3D visual representation of the participant for a real-time 3D computer-generated imagery, CGI, of the participant; and receive (510) a second stream in the immersive communication session from the at least one first communication device (100) through the network entity (300), the second stream comprising real-time values of avatar parameters controlling motions of the avatar model of the participant, and render (512) the CGI of the participant based on the received (504) avatar model and the received (510) real-time values of the avatar parameters at the second device (200).

34. The second communication device (200) of claim 33, further operable to perform the steps of any one of claims 26 to 28.

35. A system (700) for supporting an immersive communication session, the system comprising: a network entity (300) comprising a processing circuitry configured to execute the steps of any one of claims 1 to 22; at least one first communication device (100) comprising a processing circuitry configured to execute the steps of any one of claims 23 to 24; and at least one second communication device (200) comprising a processing circuitry configured to execute the steps of any one of claims 25 to 28.