WO2022263709A1

WO2022263709A1 - A method, an apparatus and a computer program product for video conferencing

Info

Publication number: WO2022263709A1
Application number: PCT/FI2022/050309
Authority: WO
Inventors: Saba AHSAN; Yu You; Sujeet Shyamsundar Mate; Emre Baris Aksu; Igor Danilo Diego Curcio; Miska Matias Hannuksela
Original assignee: Nokia Technologies Oy
Priority date: 2021-06-16
Filing date: 2022-05-09
Publication date: 2022-12-22
Also published as: EP4356603A1

Abstract

A method comprising obtaining a plurality of videoconferencing parameters from two or more participant apparatus of a video conference, wherein each of the participant apparatus is provided with a camera configured to capture a video about a user of said participant apparatus, and at least one participant apparatus is provided with a display unit capable of displaying augmented reality (AR) content; creating a data set, based on the plurality of videoconferencing parameters, for a virtual videoconferencing space, wherein the data set comprises at least parameters for spatial positions of each participant apparatus in said virtual videoconferencing space and parameters of a video stream of each participant apparatus; and delivering the data set as a bitstream to said at least one participant apparatus provided with the display unit capable of displaying AR content.

Description

A METHOD, AN APPARATUS AND A COMPUTER PROGRAM PRODUCT FOR VIDEO CONFERENCING

Technical Field

[0001 ] The present solution generally relates to video conferencing.

Background

[0002) Devices that are able to capture image and video have evolved from devices capturing a limited angular field of view to devices capturing 360-degree content. These devices are able to capture visual and audio content all around them, i.e. they can capture the whole angular field of view, which may be referred to as 360 degrees field of view. More precisely, the devices can capture a spherical field of view (i.e., 360 degrees in all spatial directions). In addition to the new types of image/video capturing devices, also new types of output technologies have been invented and produced, such as head-mounted displays. These devices allow a person to see visual content all around him/her, giving a feeling of being “immersed” into the scene captured by the 360 degrees camera. The new capture and display paradigm, where the field of view is the entire sphere, is commonly referred to as virtual reality (VR) or omnidirectional video.

[0003) The omnidirectional video provides further possibilities for implementing a VR videoconference. 360-degree video teleconferences can use 360-degree cameras at one location (e.g. a meeting room) that can be viewed immersively by remote participants.

[0004] However, this creates an asymmetric experience for the remote and locally present participants. A further problem in live capture of video streams for creating virtual reality (VR) experiences is that it typically relies on heavy centralized processing that can add further delays and complexity, thereby being not ideal for creating a real-time conversational experience.

Summary

[0005] The scope of protection sought for various embodiments of the invention is set out by the independent claims. The embodiments and features, if any, described in this specification that do not fall under the scope of the independent claims are to be interpreted as examples useful for understanding various embodiments of the invention.

[0006] Various aspects include a method, an apparatus and a computer readable medium comprising a computer program stored therein, which are characterized by what is stated in the independent claims. Various embodiments are disclosed in the dependent claims.

[0007) According to a first aspect, there is provided a method comprising: obtaining a plurality of videoconferencing parameters from two or more participant apparatus of a video conference, wherein each of the participant apparatus is provided with a camera configured to capture a video about a user of said participant apparatus, and at least one participant apparatus is provided with a display unit capable of displaying augmented reality (AR) content; creating a data set, based on the plurality of videoconferencing parameters, for a virtual videoconferencing space, wherein the data set comprises at least parameters for spatial positions of each participant apparatus in said virtual videoconferencing space and parameters of a video stream of each participant apparatus; and delivering the data set as a bitstream to said at least one participant apparatus provided with the display unit capable of displaying AR content.

[0008] An apparatus according to a second aspect comprises means for: obtaining a plurality of videoconferencing parameters from two or more participant apparatus of a video conference, wherein each of the participant apparatus is provided with a camera configured to capture a video about a user of said participant apparatus, and at least one participant apparatus is provided with a display unit capable of displaying augmented reality (AR) content; creating a data set, based on the plurality of videoconferencing parameters, for a virtual videoconferencing space, wherein the data set comprises at least parameters for spatial positions of each participant apparatus in said virtual videoconferencing space and parameters of a video stream of each participant apparatus; and delivering the data set as a bitstream to said at least one participant apparatus provided with the display unit capable of displaying AR content.

[0009] According to an embodiment, the apparatus comprises means for configuring said data set as a metadata stream specifying information for the video streams of each participant apparatus. [0010] According to an embodiment, the apparatus comprises means for delivering the video streams of each participant apparatus along the metadata stream to said at least one participant apparatus provided with the display unit capable of displaying AR content.

[0011] According to an embodiment, the apparatus comprises means for configuring said data set as a manifest file or an Extended Reality (XR) descriptor in the metadata stream.

[0012] According to an embodiment, the apparatus comprises means for configuring said data set as one or more data structures describing overlays, where an overlay corresponds to a participant’s video stream and an overlay position corresponds to the spatial position of the participant's video stream in the viewing space.

[0013] According to an embodiment, the apparatus comprises means for configuring said data set as one or more data structures describing overlays, where an overlay corresponds to a participant’s video stream and an overlay position corresponds to the spatial position of the participant's video stream in the viewing space.

[0014] According to an embodiment, the apparatus comprises means for configuring said data set as one or more data structures describing overlays according to MPEG OMAF.

[0015] According to an embodiment, the apparatus comprises means for indicating that the participant's video streams originate from external overlay sources rather than being conveyed according to MPEG OMAF.

[0016] According to an embodiment, the apparatus comprises means for indicating that background visual media is transparent or absent in an OMAF presentation and only the overlays are to be displayed.

[0017] According to an embodiment, the participant video stream is a 2D video configured to be displayed using a billboarding effect.

[0018] According to an embodiment, the apparatus comprises means for applying real-time transport protocol (RTP) to provide the one or more participant apparatus provided with the display unit capable of displaying AR content with the video stream of other participant apparatus according to viewport-dependent delivery.

[0019] According to an embodiment, the apparatus comprises means for pausing the delivery of the video streams not being displayed in the viewport.

[0020] According to an embodiment, the apparatus comprises means for creating a virtual background for the virtual videoconferencing space, while maintaining the spatial positions of each participant apparatus in said virtual videoconferencing space.

[0021] According to an embodiment, the apparatus comprises means for defining spatial regions within the virtual videoconferencing space for each participant to allow movements of the participant within the respective spatial region.

[0022] An apparatus according to a third aspect comprises at least one processor and at least one memory, said at least one memory stored with computer program code thereon, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to perform: obtain a plurality of videoconferencing parameters from two or more participant apparatus of a video conference, wherein each of the participant apparatus is provided with a camera configured to capture a video about a user of said participant apparatus, and at least one participant apparatus is provided with a display unit capable of displaying augmented reality (AR) content; create a data set, based on the plurality of videoconferencing parameters, for a virtual videoconferencing space, wherein the data set comprises at least parameters for spatial positions of each participant apparatus in said virtual videoconferencing space and parameters of a video stream of each participant apparatus; and deliver the data set as a bitstream to said at least one participant apparatus provided with the display unit capable of displaying AR content. [0023] An apparatus according to a fourth aspect comprises a camera configured to capture a video about a user of said apparatus; a display unit capable of displaying augmented reality (AR) content; means for determining a plurality of videoconferencing parameters of a video conference; means for sending said plurality of videoconferencing parameters to at least one participant apparatus of the videoconference; means for receiving a data set from said at least one participant apparatus of the videoconference, wherein the data set comprises at least parameters for spatial positions of each participant apparatus in a virtual videoconferencing space created by said at least one participant apparatus and parameters of a video stream of each participant apparatus; and means for rendering, based on the data set, the virtual videoconferencing space on the display unit, where representations of users of each participant apparatus are rendered on their corresponding spatial positions.

[0024] An apparatus according to a fifth aspect comprises a camera configured to capture a video about a user of said apparatus; a display unit capable of displaying augmented reality (AR) content; at least one processor and at least one memory, said at least one memory stored with computer program code thereon, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to perform: determine a plurality of videoconferencing parameters of a video conference; send said plurality of videoconferencing parameters to at least one participant apparatus of the videoconference; receive a data set from said at least one participant apparatus of the videoconference, wherein the data set comprises at least parameters for spatial positions of each participant apparatus in a virtual videoconferencing space created by said at least one participant apparatus and parameters of a video stream of each participant apparatus; and render, based on the data set, the virtual videoconferencing space on the display unit, where representations of users of each participant apparatus are rendered on their corresponding spatial positions.

Description of the Drawings

[0025] In the following, various embodiments will be described in more detail with reference to the appended drawings, in which

[0026] Fig. 1 shows an example of an OMAF end-to-end system;

[0027] Fig. 2a shows an example of image stitching, projection and region-wise packing;

[0028] Fig. 2b shows an example of a process of forming a monoscopic equirectangular panorama picture;

[0029] Figs. 3a-3c show examples of a conference room with some participants;

[0030] Fig. 4 shows an example of a teleconference setup;

[0031] Fig. 5 shows a flow chart of a method for collage creation according to an aspect; [0032) Fig. 6 shows an example of various information and parameters used in the collage creation;

[0033] Figs. 7a - 7c illustrate an example of a virtual videoconference with four participants according to an embodiment;

[0034] Fig. 8 shows an example of a virtual videoconference with two participants according to an embodiment;

[0035] Fig. 9 shows an exemplified setup for a unidirectional RTP -based AR/MR teleconference according to an embodiment; and

[0036] Fig. 10 shows an apparatus according to an embodiment.

Description of Example Embodiments

[0037] In the following, several embodiments of the disclosure will be described in the context of one videoconferencing arrangement. It is to be noted that different embodiments may have applications widely in any environment where improvement of virtual videoconferencing is desired.

[0038] In the following, term “omnidirectional” may refer to media content that may have greater spatial extent than a field-of-view of a device rendering the content. Omnidirectional content may, for example, cover substantially 360 degrees in the horizontal dimension and substantially 180 degrees in the vertical dimension, but omnidirectional may also refer to content covering less than 360 degree view in the horizontal direction and/or 180 degree view in the vertical direction.

[0039] Immersive multimedia, such as omnidirectional content consumption is more complex for the end user compared to the consumption of 2D content. This is due to the higher degree of freedom available to the end user. The freedom also results in more uncertainty. The MPEG Omnidirectional Media Format (OMAF) vl standardized the omnidirectional streaming of single 3DoF (3 Degrees of Freedom) content (where the viewer is located at the centre of a unit sphere and has three degrees of freedom (Yaw-Pitch-Roll). OMAF includes means to optimize the Viewport Dependent Streaming (VDS) operations and bandwidth management.

[0040] A viewport may be defined as a region of omnidirectional image or video suitable for display and viewing by the user. A current viewport (which may be sometimes referred simply as a viewport) may be defined as the part of the spherical video that is currently displayed and hence is viewable by the user(s). At any point of time, a video rendered by an application on a head- mounted display (HMD) renders a portion of the 360-degrees video, which is referred to as a viewport. Likewise, when viewing a spatial part of the 360-degree content on a conventional display, the spatial part that is currently displayed is a viewport. A viewport is a window on the 360-degree world represented in the omnidirectional video displayed via a rendering display. A viewport may be characterized by a horizontal field-of-view (VHFoV) and a vertical field-of-view (WFoV).

[0041] The 360-degree space may be divided into a discrete set of viewports, each separated by a given distance (e.g., expressed in degrees), so that the omnidirectional space can be imagined as a map of overlapping viewports, and the viewport is switched discretely as the user changes his/her orientation while watching content with a head-mounted display (HMD). When the overlapping between viewports is reduced to zero, the viewports can be imagined as adjacent non-overlapping tiles within the 360-degrees space. The H.265 video codec implements the concept of tiles which may be used to realize this scenario (both overlapping and not).

[0042] When streaming VR video, a subset of 360-degree video content covering the viewport (i.e., the current view orientation) may be transmitted at the best quality/resolution, while the remaining of 360-degree video may be transmitted at a lower quality/resolution. This is what characterizes a VDS system, as opposed to a Viewport Independent Streaming system, where the omnidirectional video is streamed at high quality in all directions.

[0043] Fig. 1 illustrates the Omnidirectional Media Format (OMAF) system architecture. The system can be situated in a video camera, or in a network server, for example. As shown in Figure 1 , an omnidirectional media (A) is acquired. If the OMAF system is part of the video source, the omnidirectional media (A) is acquired from the camera means. If the OMAF system is in a network server, the omnidirectional media (A) is acquired from a video source over network.

[0044] The omnidirectional media comprises image data (Bi) and audio data (Ba), which are processed separately. In image stitching, rotation, projection and region- wise packing, the images/video of the source media and provided as input (Bi) are stitched to generate a sphere picture on a unit sphere per the global coordinate axes. The unit sphere is then rotated relative to the global coordinate axes. The amount of rotation to convert from the local coordinate axes to the global coordinate axes may be specified by the rotation angles indicated in a RotationBox. The local coordinate axes of the unit sphere are the axes of the coordinate system that has been rotated. The absence of the RotationBox indicates that the local coordinate axes are the same as the global coordinate axes. Then, the spherical picture on the rotated unit sphere is converted to a two- dimensional projected picture, for example using the equirectangular projection. When spatial packing of stereoscopic content is applied, two spherical pictures for the two views are converted to two constituent pictures, after which frame packing is applied to pack the two constituent pictures on one projected picture. Rectangular region-wise packing can then be applied to obtain a packed picture from the projected picture. The packed pictures (D) are then provided for video and image encoding to result in encoded image (Ei) and/or encoded video stream (Ev). The audio of the source media is provided as input (Ba) to audio encoding that provides as an encoded audio (Ea). The encoded data (Ei, Ev, Ea) are then encapsulated into file for playback (F) and delivery (i.e. streaming) (Fs).

[0045] A real-world audio-visual scene (A) may be captured 200 by audio sensors as well as a set of cameras or a camera device with multiple lenses and sensors. The acquisition results in a set of digital image/video (Bi) and audio (Ba) signals. The cameras/lenses may cover all directions around the center point of the camera set or camera device, thus the name of 360-degree video. [0046] Audio can be captured using many different microphone configurations and stored as several different content formats, including channel-based signals, static or dynamic (i.e. moving through the 3D scene) object signals, and scene-based signals (e.g., Higher Order Ambisonics). The channel-based signals may conform to one of the loudspeaker layouts defined in CICP (Coding- Independent Code-Points). In an omnidirectional media application, the loudspeaker layout signals of the rendered immersive audio program may be binauralized for presentation via headphones. [0047] The images (Bi) of the same time instance are stitched, projected, and mapped 201 onto a packed picture (D).

[0048] For monoscopic 360-degree video, the input images of one time instance may be stitched to generate a projected picture representing one view. An example of image stitching, projection, and region- wise packing process for monoscopic content is illustrated with Fig. 2a.

Input images (Bi) are stitched and projected onto a three-dimensional projection structure that may for example be a unit sphere. The projection structure may be considered to comprise one or more surfaces, such as plane(s) or part(s) thereof. A projection structure may be defined as three- dimensional structure consisting of one or more surface(s) on which the captured VR image/video content is projected, and from which a respective projected picture can be formed. The image data on the projection structure is further arranged onto a two-dimensional projected picture (C). The term projection may be defined as a process by which a set of input images are projected onto a projected picture. There may be a pre-defined set of representation formats of the projected picture, including for example an equirectangular projection (ERP) format and a cube map projection (CMP) format. It may be considered that the projected picture covers the entire sphere.

[0049] Optionally, a region-wise packing is then applied to map the projected picture (C) onto a packed picture (D). If the region-wise packing is not applied, the packed picture is identical to the projected picture, and this picture is given as input to image/video encoding. Otherwise, regions of the projected picture (C) are mapped onto a packed picture (D) by indicating the location, shape, and size of each region in the packed picture, and the packed picture (D) is given as input to image/video encoding. The term region-wise packing may be defined as a process by which a projected picture is mapped to a packed picture. The term packed picture may be defined as a picture that results from region-wise packing of a projected picture. [0050] In the case of stereoscopic 360-degree video, as shown in an example of Fig. 2a, the input images of one time instance are stitched to generate a projected picture representing two views (CL, CR), one for each eye. Both views (CL, CR) can be mapped onto the same packed picture (D) and encoded by a traditional 2D video encoder. Alternatively, each view of the projected picture can be mapped to its own packed picture, in which case the image stitching, projection, and region- wise packing is performed as illustrated in Fig. 2a. A sequence of packed pictures of either the left view or the right view can be independently coded or, when using a multiview video encoder, predicted from the other view.

[0051 j An example of image stitching, projection, and region-wise packing process for stereoscopic content where both views are mapped onto the same packed picture, as shown in Fig.

1 is described next in more detailed manner. Input images (Bi) are stitched and projected onto two three-dimensional projection structures, one for each eye. The image data on each projection structure is further arranged onto a two-dimensional projected picture (CL for left eye, CR for right eye), which covers the entire sphere. Frame packing is applied to pack the left view picture and right view picture onto the same projected picture. Optionally, region- wise packing is then applied to the pack projected picture onto a packed picture, and the packed picture (D) is given as input to image/video encoding. If the region-wise packing is not applied, the packed picture is identical to the projected picture, and this picture is given as input to image/video encoding.

[0052] The image stitching, projection, and region-wise packing process can be carried out multiple times for the same source images to create different versions of the same content, e.g. for different orientations of the projection structure. Similarly, the region- wise packing process can be performed multiple times from the same projected picture to create more than one sequence of packed pictures to be encoded.

[0053 ] 360-degree panoramic content (i.e., images and video) cover horizontally (up to) the full 360-degree field-of-view around the capturing position of an imaging device. The vertical field-of- view may vary and can be e.g. 180 degrees. Panoramic image covering 360-degree field-of-view horizontally and 180-degree field-of-view vertically can be represented by a sphere that has been mapped to a two-dimensional image plane using equirectangular projection (ERP). In this case, the horizontal coordinate may be considered equivalent to a longitude, and the vertical coordinate may be considered equivalent to a latitude, with no transformation or scaling applied. In some cases panoramic content with 360-degree horizontal field-of-view but with less than 180-degree vertical field-of-view may be considered special cases of equirectangular projection, where the polar areas of the sphere have not been mapped onto the two-dimensional image plane. In some cases panoramic content may have less than 360-degree horizontal field-of-view and up to 180-degree vertical field-of-view, while otherwise have the characteristics of equirectangular projection format. [0054] An example of the process of forming a monoscopic equirectangular panorama picture is illustrated in Fig. 2b. A set of input images, such as fisheye images of a camera array or a camera device with multiple lenses and sensors, is stitched onto a spherical image. The spherical image is further projected onto a cylinder (without the top and bottom faces). The cylinder is unfolded to form a two-dimensional projected picture. In practice one or more of the presented steps may be merged; for example, the input images may be directly projected onto a cylinder without an intermediate projection onto a sphere. The projection structure for equirectangular panorama may be considered to be a cylinder that comprises a single surface.

[0055] In general, 360-degree content can be mapped onto different types of solid geometrical structures, such as polyhedron (i.e. a three-dimensional solid object containing flat polygonal faces, straight edges and sharp comers or vertices, e.g., a cube or a pyramid), cylinder (by projecting a spherical image onto the cylinder, as described above with the equirectangular projection), cylinder (directly without projecting onto a sphere first), cone, etc. and then unwrapped to a two- dimensional image plane.

[0056] In some cases, panoramic content with 360-degree horizontal field-of-view but with less than 180-degree vertical field-of-view may be considered special cases of equirectangular projection, where the polar areas of the sphere have not been mapped onto the two-dimensional image plane. In some cases, a panoramic image may have less than 360-degree horizontal field-of- view and up to 180-degree vertical field-of-view, while otherwise has the characteristics of equirectangular projection format.

[0057] In 360-degree systems, a coordinate system may be defined through orthogonal coordinate axes, such as X (lateral), Y (vertical, pointing upwards), and Z (back-to-front axis, pointing outwards). Rotations around the axes may be defined and may be referred to as yaw, pitch, and roll. Yaw may be defined to rotate around the Y axis, pitch around the X axis, and roll around the Z axis. Rotations may be defined to be extrinsic, i.e., around the X, Y, and Z fixed reference axes. The angles may be defined to increase clockwise when looking from the origin towards the positive end of an axis. The coordinate system specified can be used for defining the sphere coordinates, which may be referred to azimuth (f) and elevation (Q).

[0058] Referring again to Fig. 1, the OMAF allows the omission of image stitching, projection, and region-wise packing and encode the image/video data in their captured format. In this case, images (D) are considered the same as images (Bi) and a limited number of fisheye images per time instance are encoded.

[0059] For audio, the stitching process is not needed, since the captured signals are inherently immersive and omnidirectional.

[0060] The stitched images (D) are encoded 204 as coded images (Ei) or a coded video bitstream (Ev). The captured audio (Ba) is encoded 202 as an audio bitstream (Ea). The coded images, video, and/or audio are then composed 205 into a media file for file playback (F) or a sequence of an initialization segment and media segments for streaming (Fs), according to a particular media container file format. In this specification, the media container file format is the ISO base media file format. The file encapsulator 205 also includes metadata into the file or the segments, such as projection and region- wise packing information assisting in rendering the decoded packed pictures.

[0061 j The metadata in the file may include:

- the projection format of the projected picture,

- fisheye video parameters,

- the area of the spherical surface covered by the packed picture,

- the orientation of the projection structure corresponding to the projected picture relative to the global coordinate axes,

- region-wise packing information, and

- region-wise quality ranking (optional).

[0062] Region-wise packing information may be encoded as metadata in or along the bitstream, for example as region- wise packing SEI message(s) and/or as region- wise packing boxes in a file containing the bitstream. For example, the packing information may comprise a region- wise mapping from a pre-defined or indicated source format to the packed picture format, e.g. from a projected picture to a packed picture, as described earlier. The region- wise mapping information may for example comprise for each mapped region a source rectangle (a.k.a. projected region) in the projected picture and a destination rectangle (a.k.a. packed region) in the packed picture, where samples within the source rectangle are mapped to the destination rectangle and rectangles may for example be indicated by the locations of the top-left comer and the bottom-right comer. The mapping may comprise resampling. Additionally or alternatively, the packing information may comprise one or more of the following: the orientation of the three-dimensional projection structure relative to a coordinate system, indication which projection format is used, region-wise quality ranking indicating the picture quality ranking between regions and/or first and second spatial region sequences, one or more transformation operations, such as rotation by 90, 180, or 270 degrees, horizontal mirroring, and vertical mirroring. The semantics of packing information may be specified in a manner that they are indicative for each sample location within packed regions of a decoded picture which is the respective spherical coordinate location.

[00631 The segments (Fs) may be delivered 206 using a delivery mechanism to a player.

[0064] The file that the file encapsulator outputs (F) is identical to the file that the file decapsulator inputs (F'). A file decapsulator 207 processes the file (F') or the received segments (F's) and extracts the coded bitstreams (E'a, E'v, and/or E'i) and parses the metadata. The audio, video, and/or images are then decoded 208 into decoded signals (B'a for audio, and D' for images/video). The decoded packed pictures (D') are projected 210 onto the screen of a head- mounted display or any other display device 211 based on the current viewing orientation or viewport and the projection, spherical coverage, projection structure orientation, and region-wise packing metadata parsed from the file. Likewise, decoded audio (B'a) is rendered 210, e.g. through headphones 212, according to the current viewing orientation. The current viewing orientation is determined by the head tracking and possibly also eye tracking functionality 208. Besides being used by the Tenderer 210 to render the appropriate part of decoded video and audio signals, the current viewing orientation may also be used the video and audio decoders 209 for decoding optimization.

[0065) The process described above is applicable to both live and on-demand use cases.

[0066] At any point of time, a video rendered by an application on a HMD or on another display device renders a portion of the 360-degree video. This portion may be defined as a viewport. A viewport may be understood as a window on the 360-degree world represented in the omnidirectional video displayed via a rendering display.

[0067] According to another definition, a viewport may be defined as a part of the spherical video that is currently displayed. A viewport may be characterized by horizontal and vertical field- of-views (FOV orFoV).

[0068) A viewport may further be defined as a region of omnidirectional image or video suitable for display and viewing by the user. A current viewport (which may be sometimes referred simply as a viewport) may be defined as the part of the spherical video that is currently displayed and hence is viewable by the user(s) i.e. the point or space from which the user views the scene; it usually corresponds to a camera position. Slight head motion may not imply a different viewpoint. A viewing position may be defined as the position within a viewing space from which the user views the scene. A viewing space may be defined as a 3D space of viewing positions within which rendering of image and video is enabled and VR experience is valid.

[0069] An overlay may be defined as a video clip, an image or text that is superimposed on top of a background, which may be referred to as background visual media and may be e.g. an omnidirectional video or image.

[0070] The appearance of overlays can be controlled flexibly in OMAF. Moreover, the overlay structures are extensible and new controls or properties can be specified in future versions or amendments of the OMAF standard. Background visual media may be defined in OMAF as the omnidirectional video or image that is rendered on the unit sphere, and the term overlay source may refer to the visual content displayed as an overlay.

[00711 The following types of overlays are specified in OMAF:

1) Sphere-relative 2D overlays, where an overlay source is displayed on a plane of a given width and height. The center point of the plane is located at given spherical coordinates and distance from the center of the unit sphere, and the plane can be rotated by given yaw, pitch and roll angles.

2) Sphere-relative omnidirectional overlays, where an omnidirectional projection, such as ERP, has been used for an overlay source. Sphere-relative omnidirectional overlays may, but need not, cover the entire sphere and are located at a given spherical location and distance from the center of the unit sphere.

3) Sphere-relative 3D mesh overlays, where both a 3D mesh and a mapping of an overlay source onto the 3D mesh are specified. The 3D mesh can consist of parallelograms having any rotation and being located at any position within the unit sphere.

4) Viewport-relative overlays, which are located on a given position within the viewport regardless of the viewing orientation. The rendering process projects the sphere-relative overlays and the background visual media onto the viewport, which is then superimposed by the viewport-relative overlays. Since viewports can be of different shape and size in different player devices, the top-left comer position, width, and height of a viewport-relative overlay are provided in percents relative to the viewport dimensions.

[0072] OMAF enables two rendering modes for presenting sphere-relative overlays with background visual media. In conventional 3DOF rendering, a viewing position that is in the center of the unit sphere is used for projecting the sphere-relative overlays and the background visual media onto the viewport. In the second rendering mode, the viewing position is tracked and used for projecting the content onto the viewport. When the second rendering mode is used with a head- mounted display, it may be referred to as head-tracked rendering. The second rendering mode enables viewing overlays from different perspectives and peeking on the background appearing behind the overlays. Sphere-relative overlays can be placed at given distances from the center of the unit sphere, which is perceivable through motion parallax. Content authors can define a viewing space that specifies valid viewing positions around the center of the unit sphere. OMAF enables specifying the viewing space boundaries as a cuboid, a sphere, a cylinder, or an ellipsoid.

[0073] As discussed above, sphere-relative overlays are located at a given distance from the center of the unit sphere. A layering order can be given so that the player behavior is deterministic when several overlays are positioned at the same distance or when viewport-relative overlays overlap.

[0074] By default, overlays are opaque in OMAF. However, either a constant opacity or an alpha plane that specifies a pixel-wise opacity can be optionally provided.

[0075] In OMAF, the content author can specify separately per each overlay which types of user interactions are enabled. The following user interaction types can be enabled or disabled in an OMAF file: changing the position, modifying the distance from the center of the unit sphere, switching the overlay on or off, tuning the opacity, resizing, rotating, cropping, and switching the overlay source to another one. A textual label can be given for each overlay and utilized by a user interface to enable end-users to switch overlays on or off. Another way is to provide an associated sphere region that the user can select to turn an overlay on or off.

[0076] In OMAF, an overlay source can be a video track or an image item, in which case the overlay consists of the entire decoded picture. Since some player devices might not be capable of running several video decoder instances simultaneously, it is also possible to pack overlays spatially with the background visual media. In that case, an overlay source is specified as a rectangle within the decoded picture area. Furthermore, it is possible to indicate that an overlay source is defined by the recommended viewport timed metadata track. In yet another alternative, an overlay source is provided by external means, such as through a URL. The externally specified overlay source could be used to show content from a separate application within an OMAF presentation.

[0077] In OMAF, the content author has two mechanisms to enable scaling the player-side complexity of overlay rendering. First, each overlay can be given a priority for rendering. The highest priority value means that the overlay must be rendered. Second, it is indicated whether a control or property associated with an overlay is essential or optional. For example, it can be indicated that overlay composition with an alpha plane is optional. In this case, if the player does not have enough resources to carry out the processing required for alpha planes, it is allowed to render an opaque overlay.

[0078] In OMAF, the controls and properties for overlays can be static, i.e. remain constant for the entire duration of the overlay, or dynamic, i.e. signaled by a timed metadata track where the controls and properties are dynamically adjusted. For example, it is possible to move or resize an overlay as a function of time.

[0079] OMAF also defines an OMAF overlay information (OVLY) DASH descriptor at Adaptation Set level. The OVLY descriptor specifies overlays in DASH Representations associated with the descriptor. The OVLY descriptor allows distinguishing directly from the MPD file, between Adaptation Sets that contain overlays from the Adaptation Sets that contain background visual media.

[0080] The phrase along the bitstream (e.g. indicating along the bitstream) may be defined to refer to out-of-band transmission, signaling, or storage in a manner that the out-of-band data is associated with the bitstream. The phrase decoding along the bitstream or alike may refer to decoding the referred out-of-band data (which may be obtained from out-of-band transmission, signaling, or storage) that is associated with the bitstream. For example, an indication along the bitstream may refer to metadata in a container file that encapsulates the bitstream.

[0081] Available media file format standards include International Standards Organization (ISO) Base Media File Format (ISO/IEC 14496-12, which may be abbreviated ISOBMFF), Moving Picture Experts Group (MPEG)-4 file format (ISO/IEC 14496-14, also known as the MP4 format), file format for NAL (Network Abstraction Layer) unit structured video (ISO/IEC 14496-15) and High Efficiency Video Coding standard (HEVC or H.265/HEVC).

[0082] Some concepts, structures, and specifications of ISOBMFF are described below as an example of a container file format, based on which the embodiments may be implemented. The aspects of the invention are not limited to ISOBMFF, but rather the description is given for one possible basis on top of which the embodiments of the disclosure may be partly or fully realized. The embodiments of the present disclosure may also be implemented using other transport protocols, such as the Real-Time Transport Protocol (RTP) as described below. In general, the information transmitted to implement this invention can be transmitted using any protocol and any layers of the ISO (International Standardization Organization) OSI (Open Systems Interconnection) protocol stack.

[0083] High Efficiency Image File Format (HEIF) is a standard developed by the Moving Picture Experts Group (MPEG) for storage of images and image sequences. Among other things, the standard facilitates file encapsulation of data coded according to High Efficiency Video Coding (HEVC) standard. HEIF includes features building on top of the used ISO Base Media File Format (ISOBMFF).

[0084] The ISOBMFF structures and features are used to a large extent in the design of HEIF. The basic design for HEIF comprises that still images are stored as items and image sequences are stored as tracks.

[0085] A transmission channel or a communication channel or a channel may refer to either a physical transmission medium, such as a wire, or to a logical connection over a multiplexed medium.

[0086] Real-time transfer protocol (RTP) is widely used for real-time transport of timed media such as audio and video. RTP may operate on top of the User Datagram Protocol (UDP), which in turn may operate on top of the Internet Protocol (IP). In RTP transport, media data is encapsulated into RTP packets. Typically, each media type or media coding format has a dedicated RTP payload format.

[0087] An RTP session is an association among a group of participants communicating with RTP. It is a group communications channel which can potentially carry a number of RTP streams. An RTP stream is a stream of RTP packets comprising media data. An RTP stream is identified by an SSRC belonging to a particular RTP session. SSRC refers to either a synchronization source or a synchronization source identifier that is the 32-bit SSRC field in the RTP packet header. A synchronization source is characterized in that all packets from the synchronization source form part of the same timing and sequence number space, so a receiver device may group packets by synchronization source for playback. Examples of synchronization sources include the sender of a stream of packets derived from a signal source such as a microphone or a camera, or an RTP mixer. Each RTP stream is identified by a SSRC that is unique within the RTP session.

[0088] In an implementation embodiment, an RTP stream is associated with each of the participant’s incoming video stream to the RTP mixer. Based on the spatial metadata for each of the video streams, the viewer’s field of view and viewport position (or orientation or both), the RTP mixer can perform viewport dependent RTP stream selection.

[0089] A uniform resource identifier (URI) may be defined as a string of characters used to identify a name of a resource. Such identification enables interaction with representations of the resource over a network, using specific protocols. A URI is defined through a scheme specifying a concrete syntax and associated protocol for the URI. The uniform resource locator (URL) and the uniform resource name (URN) are forms of URI. A URL may be defined as a URI that identifies a web resource and specifies the means of acting upon or obtaining the representation of the resource, specifying both its primary access mechanism and network location. A URN may be defined as a URI that identifies a resource by name in a particular namespace. A URN may be used for identifying a resource without implying its location or how to access it.

[0090] In the following, embodiments for a 360-degree event, such as a 360-degree videoconference, teleconference, telepresence, are discussed. However as was mentioned, in addition to the 360-degree conference, the embodiments are suitable for other delivery solutions, as well. Figs. 3a-3c represent various scenarios for a 360-degree teleconference. A 360-degree conference can be a live meeting which is delivered to a receiver device(s) by the sender, wherein the sender is a video source, such as a 360-degree (i.e. omnidirectional) camera, or a system being operatively connected to a video source or comprising means to capture video. The streamable content from the sender to the receiver comprises at least video or audio. The purpose of the sender is to deliver video being captured forward to receiver device(s). The sender may also comprise means for receiving at least audio data from receiver device(s), and output the received audio data to the participants of the event.

[0091] Virtual reality (VR) teleconferencing may consist of a virtual environment (e.g. a virtual meeting room) in which one or several users are embedded to create an experience of immersivity. The experience can be created by using two or more cameras to capture each participant to track their movement and position with up to 6DOF. The captured movement and position of the participants may be used to recreate their representations, such as their avatars, in the meeting room. The participants may have an initial or default position in the meeting room with some degrees of freedom to move around within the environment. The video streams from all participants are mixed with the virtual meeting room video, possibly using a centralized server, and a video stream specific to each participant is created and delivered to that participant. The participant stream is created from a perspective of that user from their position in the room. VR teleconferencing requires use of VR equipment e.g., head mounted displays (HMD), for a truly immersive experience.

[0092] Teleconferencing may also be implemented using either a Mixed Reality (MR) or an Augmented Reality (AR) application. Mixed reality (MR) is the merging of real and virtual worlds to produce new environments and visualizations, where physical and digital objects co-exist and interact in real time. Mixed reality may be considered a mixture of the physical world and the virtual world, whereas Augmented Reality (AR) mainly displays in the physical world, however with further information or objects added virtually.

[0093] Figs. 3a-3c show some examples for implementing a virtual teleconferencing, where a group of participants is having a meeting in a conference room. The conference room can be considered as a virtual conference system A with physical elements (i.e., camera 220, view screen 210, one or more physical participants) being able to share content to and to receive data from remote participants. According to an example, the virtual conference system A may comprise a 360-degree (i.e., omnidirectional) camera 220 and a view screen 210. The meeting is also participated by two remote participants B, C through a conference call. Physical participants of the virtual conference system A use the view screen 210 to display a shared presentation and/or video streams coming from the remote participants B, C. Instead of actual video stream from the remote participants B, C, their representations, such as their avatars, may be shown on the view screen 210. One of the remote participants B is using a head mounted display for having a 360-degree view to conference content and a camera that captures his/her video. One of the remote participants C uses a mobile phone to access the conference. The mobile phone is able to show a 360-degree video on the conference and to capture his/her video.

[0094] In the example of Fig. 3 a, the conference call is set up without any media-aware network elements. Both remote participants B, C send information about their viewport to the virtual conference system A, which in turn sends them a viewport-dependent video stream from the 360-degree camera 220.

[0095] In the example of Fig. 3b, the conference call is set up using a network function, which may be performed by media function such as a Media Resource Function (MRF) or a Media Control Unit (MCU) 801. The virtual conference system for multiple conference rooms X are sending 360-degree video to an MRF/MCU 801. The rooms may choose to receive 2D video streams from other participants including one of the other rooms, which is displayed on the view screen 210 in the room. The remote participants B, C can choose to view any one or none of the available 360-degree videos from the multiple rooms. Switching from one room to another may be triggered manually, or using other mechanisms, such as viewing direction or dominant speaker.

The MRF/MCU 801 may signal to pause the receiving 360-degree video from any of the rooms that do not currently have any active viewers. [0096 j In the example of Fig. 3c, a virtual conference system for multiple conference rooms X are sending 360-degree video to an MRF/MCU 801. The rooms may choose to receive 2D video streams from other participants including one of the other rooms, which is displayed on the view screen 810 in the room. The remote participants B, C can choose to view any one or none of the available 360-degree videos from the multiple rooms. Switching from one room to another may be triggered manually, or using other mechanisms, such as viewing direction or dominant speaker.

The MRF/MCU 801 may signal to pause the receiving 360-degree video from any of the rooms that do not currently have any active viewers.

[0097) The 360-degree conference may also be completely virtual, where all the meeting participants are remote participants, i.e. receiver devices connecting to the conference via a network, and where the sender is a computer generating a virtual representation of the virtual conference and the remote participants.

[0098] Fig. 4 illustrates a further example of a set-up for 360-degree conference 900. In this example, a 360-degree conference 900 comprises at least one sender, such as a 360-degree video system 910 recording the conference. In addition, the 360-degree conference 900 comprises not only the live participants 915, but also one or more remote participants with receiver devices 930, 935 “participating” the conference by establishing a connection e.g. via a network 920 to the conference system which makes it possible to watch the 360-degree video captured by the 360- degree video system 910. In the following, the devices 930, 935 of the remote participants are referred to as “receiver devices” to emphasize their role as a receiving party for the streamed content. It is however appreciated that the receiver devices 930, 935 are also able to send data, for example audio.

[0099] The receiver devices 930, 935 can be virtual reality display systems, such as head mounted displays 935, but alternatively also (or in addition) displays 930 capable of showing only two-dimensional (2D) content.

[0100) Head mounted display is a device capable of showing three-dimensional (3D) content. For that purpose, a head mounted display may comprise two screen sections or two screens for displaying images for left and right eyes. The displays are close to the eyes, and therefore lenses are used to make the images easily viewable and for spreading the images to cover as much as possible of the eyes’ field of view. The HMD is attached to the head of the user so that it stays in place even when the user turns his head. The device may have an orientation detecting module for determining the head movements and direction of the head. The head mounted display gives a three-dimensional (3D) perception of the recorded/ streamed content to a user. The user of the head mounted display sees, at a given time instant, only a portion of 360-degree content, referred to as viewport, the size of which is being defined by the vertical and horizontal field-of- views of the HMD. Most of the audio objects of the immersive content may be visible in the viewport, while some audio objects may reside behind the user, therefore being non- visible in the viewport. As mentioned, viewport is a portion of 360-degree content. Therefore, “viewport” is a term for defining a region of omnidirectional image or video suitable for display and viewing by the user, and is defined as the part of the spherical video that is currently displayed and hence is viewable by the user(s).

[0101] In Fig. 4, the sender is illustrated as a 360-degree video system 910. It is appreciated, that instead of the 360-degree video system 910, the sender may be one of the conference participants sharing a 360-degree video, or the sender may be a middlebox or a network appliance, i.e. any computer networking device capable of media distribution. Examples of such middlebox are media gateway, conference server etc. The sender may be also a streaming server configured to deliver the content via the MPEG DASH protocol. Depending on the capabilities of the sender, it may offer viewport-independent or viewport-dependent delivery. Furthermore, if viewport- dependent delivery is offered, the control signals for determining the viewport need to be delivered from the head mounted display, or other device able to display 360-degree content, to the sender.

[0102 ] The viewport can be determined based on head and body movements of the user of the receiver device, such as the HMD 935. The viewport may be sent from the receiver device 935 via control channel to the sender, such as the 360-degree video system 910. Based on the received viewport, the sender is able to send video content relating to the determined viewport. It is appreciated that viewport control from means other than own movements of HMD users can cause motion sickness for the viewers.

[0103] For determining the viewport, a conferencing system (comprising both sender and receiver) may support 3DoF. A system supporting 3DoF allows rotational head motion around yaw, pitch and roll. In another embodiment, 6DoF (6 Degrees of Freedom) motion may be supported. 6DoF allows further translational head or body movements, in addition to the rotational movements, e.g., translationally moving head or body up, down, right, left, forwards and backwards, including walking around a space in any direction. 6DoF motion may be constrained to a defined space within which the user can move around (e.g., 3DoF+) or unconstrained.

[0104] Virtual reality teleconferencing detaches people from their own physical environment. This makes it difficult to create a conferencing experience when some of the participants are collocated. In 360 teleconferencing examples above, the virtual meeting space is typically mapped to a 360 sphere, where the viewer is in the center and where the viewing (e.g. with HMD) supports 3DoF or 3DoF+. 360-degree video teleconferences can use 360-degree cameras at one location (e.g. a meeting room) that can be viewed immersively by remote participants. However, this creates an asymmetric experience for the remote and locally present participants. A fully immersive VR experience would require realistic 3D capturing or 3D avatar generation/animation and 6DoF support. [0105] A further problem in live capture of video streams for creating MR/VR experiences is that it typically relies on heavy centralized processing that can add further delays and complexity, thereby being not ideal for creating a real-time conversational experience.

[0106] Now an improved method for virtual reality videoconferencing is introduced in order to at least alleviate the above problems.

[0107] The method according to an aspect, as shown in Figure 5, comprises obtaining (500) a plurality of videoconferencing parameters from two or more participant apparatus of a video conference, wherein each of the participant apparatus is provided with a camera configured to capture a video about a user of said participant apparatus, and at least one participant apparatus is provided with a display unit capable of displaying augmented reality (AR) content; creating (502) a data set, based on the plurality of videoconferencing parameters, for a virtual videoconferencing space, wherein the data set comprises at least parameters for spatial positions of each participant apparatus in said virtual videoconferencing space and parameters of a video stream of each participant apparatus; and delivering (504) the data set as a bitstream to said at least one participant apparatus provided with the display unit capable of displaying AR content.

[0108] Thus, videoconferencing parameters from the plurality of participant apparatus of a video conference are collected in an apparatus, which may be a media-aware network element, such a Media Gateway (MGW) implementing a media function, such as a Media Resource Function (MRF) or a Media Control Unit (MCU). Alternatively, the method may be implemented without any media-aware network elements, wherein the apparatus may be one of participant apparatus of the video conference.

[0109] Based on the collected videoconferencing parameters from the plurality of participant apparatus, a model of a virtual videoconferencing space is created, wherein the mutual spatial positions of each participant apparatus in respect to each other are defined. In other words, a model of a layout of the virtual videoconferencing space is created, and sufficient information describing the layout is combined with parameters of the video stream of each participant apparatus into a data set. A scene created by the model of a layout of the virtual videoconferencing space and the video stream of each participant apparatus is herein also referred to as a collage.

[0110] Thus, another aspect relates to an apparatus configured to create the collage. Such an apparatus may comprise means for obtaining a plurality of videoconferencing parameters from two or more participant apparatus of a video conference, wherein each of the participant apparatus is provided with a camera configured to capture a video about a user of said participant apparatus, and at least one participant apparatus is provided with a display unit capable of displaying augmented reality (AR) content; means for creating a data set, based on the plurality of videoconferencing parameters, for a virtual videoconferencing space, wherein the data set comprises at least parameters for spatial positions of each participant apparatus in said virtual videoconferencing space and parameters of a video stream of each participant apparatus; and means for delivering the data set as a bitstream to said at least one participant apparatus provided with the display unit capable of displaying AR content.

[0111] The data set, which is herein also referred to as a collage description, is a declaration of the spatial position of each participant’s video stream without a virtual space definition. The data set (i.e. collage description) provides parameters that define how to render the video in mixed reality, e.g. depth, height above floor, nature of rendering surface or anchor type etc. The data set (i.e. collage description) is delivered as a bitstream to at least one or more participant apparatus provided with the display unit capable of displaying AR content.

[0112] Such apparatus may then define its own position in the virtual videoconferencing space and render an augmented view of the virtual videoconferencing space such that other participants are placed on their respective places according to the layout from the viewpoint. The apparatus may further utilize the other parameters of the collage description render the other participant streams into a mixed reality videoconferencing space combined with his/her own physical space. Thus, the actual layout rendered by the participant apparatus provided with the display unit capable of displaying AR content may be very independent or physical space-bound (i.e. participant-specific). However, the spatial relationships of all participant streams should preferably be maintained for the consistent experience among participant, even though the local representation layout may be individual and/or physically space-bound.

[0113] A further aspect relates to a participant apparatus of a video conference. Such an apparatus may comprise a camera configured to capture a video about a user of said apparatus; a display unit capable of displaying augmented reality (AR) content; means for determining a plurality of videoconferencing parameters of a video conference; means for sending said plurality of videoconferencing parameters to at least one participant apparatus of the videoconference; means for receiving a data set from said at least one participant apparatus of the videoconference, wherein the data set comprises at least parameters for spatial positions of each participant apparatus in a virtual videoconferencing space created by said at least one participant apparatus and parameters of a video stream of each participant apparatus; and means for rendering, based on the data set, the virtual videoconferencing space on the display unit, where representations of users of each participant apparatus are rendered on their corresponding spatial positions.

10114] According to an embodiment, the method comprises configuring said data set as a metadata stream specifying information for the video streams of each participant apparatus. Consequently, based on this information, the one or more participant apparatus provided with the display unit capable of displaying AR content may render the augmented view of the virtual videoconferencing space such that video stream or other representation of each participant is shown on their respective places of the virtual videoconferencing space. The metadata stream may comprise, for example, a reference, such as an URI, to the video streams of each participant apparatus, and the participant apparatus provided with the display unit capable of displaying AR content may fetch the respective video streams from the identified location.

[0115] According to an embodiment, the method comprises delivering the video streams of one or more participant apparatus along the metadata stream to said at least one participant apparatus provided with the display unit capable of displaying AR content. Accordingly, the video streams of one or more participant apparatus, i.e. at least the video stream of participant apparatus currently being displayed, may be automatically delivered to the participant apparatus provided with the display unit capable of displaying AR content, which may obtain identifications of each video streams and render the augmented view of the virtual videoconferencing space such that video stream or other representation of each participant is shown on their respective places of the virtual videoconferencing space.

[0116] According to an embodiment, the method comprises configuring said data set as a manifest file or an Extended Reality (XR) descriptor in the metadata stream. The manifest file may comprise the necessary definitions for the video streams of all participant apparatus. The XR Descriptor, in turn, may be defined for each video stream separately. The manifest/ XR description can be updated during the call if required, for example, when a new participant joins, or the positions or roles of participants change.

[0117] According to an embodiment, the method comprises configuring said data set as one or more data structures describing overlays, where an overlay corresponds to a participant’s video stream and an overlay position corresponds to the spatial position of the participant's video stream in the viewing space. According to an embodiment, the method comprises configuring said data set as one or more data structures describing overlays according to MPEG OMAF. Hence, an XR descriptor may describe overlays and may follow the syntax and semantics of OMAF overlay syntax structures. Furthermore, any embodiments referring to an XR descriptor equally applies to an overlay syntax structure or descriptor.

[0 18] According to an embodiment, the method further comprises indicating that the participant's videos originate from external overlay sources rather than being conveyed through OMAF. Consequently, the external overlay sources may use a different communication protocol stack than the OMAF presentation, such as an RTP/UDP/IP protocol stack for delivering the video. [0119] According to an embodiment, the method comprises indicating that the background visual media is transparent or absent in an OMAF presentation and hence only the overlays should be displayed to obtain an XR experience.

[0120] The actual participant video streams of the collage to be delivered to the one or more participant apparatus may be implemented in various ways, which may be dependent on the capabilities of the capturing devices (cameras) of the participant apparatus and/or capabilities of the apparatus creating the collage and the collage description for possibly transcoding the video streams. It is noted that most of the embodiments described herein do not require any transcoding upon creating the collage and the collage description, thereby enabling a low latency and low complexity media gateway to deliver collaborative or multi-party MR experience. However, in some embodiments transcoding of the participant video streams, e.g. their resolution, may provide advantages, if the transcoding is supported by the apparatus operating as the media gateway. The apparatus operating as the media gateway is herein also referred to as a collage creator.

[0121] A participant video stream can be a 3D/2D avatar that creates movement based on the user’s pose. The use of an avatar supplemented with control information for rendering its movements based on user’s pose and movements enables to reduce the size and bitrate of the video stream significantly, there by further contributing to the operation of a low latency and low complexity media gateway. The participant video stream can also be a live captured 3D model of the user.

[0122] In an embodiment, both “virtual” and “real” avatars may be used. A virtual avatar is used when a live stream of the participant is not available. In case a person cannot participate at all, a “ghost” avatar may still be present to show the absence. A special “robot” avatar can be used to provide a capture point for recording of the conference call for later viewing by any participant or non-participant or as a common point for viewing by a group of attendees (non-panelists or audience) that are not part of the speakers (panelists) in the conference. A participant may use the tag “robot” in the SDP to indicate it wishes to join as a robot avatar. The collage creator would assign an appropriate position to this “robot” attendee such that it has a good view of the full conference and participants. The position of the “ghost” avatar may be used to record the meeting session so the absent person can replay it later from their perspective; the recording for the offline participants is handled by an application or application server (AS). An externally sourced overlay or one provided by the AS may be used to indicate the position of the “robot” or “ghost” avatars to the other participants.

[0123j According to an embodiment, the participant video stream is a 2D video configured to be displayed using a billboarding effect. For the 2D video, the participant (user) is extracted from his/her background, possibly along with any contiguous objects (e.g. a chair) so they can be displayed in MR. Billboarding is a method used to render a flat (2D) object in XR. The object is rendered at a distance from the viewer such that its orientation may automatically be computed so that it always faces the viewing direction. The distance may be derived in a manner that the object appears in a life-like size for the viewer. Approximately known geometric distances, such as inter pupillary distance or body part lengths may be used in determining the distance. The distance may be determined by the collage creator and/or the receiving participant apparatus. [0124) In billboarding, the object may be rendered on a curvature and/or may be rotated based on the viewing angle of the viewer to give the illusion of depth and not to appear flat. The curvature can be defined at the center of the avatar plane and each participant can have different curvatures, with respect to their positions in the collage. In another embodiment, the participant video/avatar which is currently in the focus may be rendered as flat (curvature = 0), while the others (being out of focus) can be curved by their Euclidean distances. The use of billboarding enables participants with 2D capture devices to join a MR conference call with realistic, live captured images.

[0125) The extraction of a participant from a 2D video may be realized using alpha channel manipulation to make the portion of the video that is not part of the region-of-interest (e.g. not part of the participant’s body) transparent and render the video as an overlay. An overlay is a spherical or 2D video that is rendered on a 360-degree video or viewport of an AR/VR device.

[0126] According to an embodiment, the apparatus operating as the media gateway is configured to use real-time transport protocol (RTP) to provide the one or more participant apparatus provided with the display unit capable of displaying AR content with the video stream of other participant apparatus according to viewport-dependent delivery.

[0127) The videoconference session may be established using session protocols, e.g. SDP (Session Description Protocol) and SIP (Session Initiation Protocol), HTTP (Hypertext Transfer Protocol) or any other suitable protocol. The media streams may be carried using real-time transport protocols with media control protocols, e.g. RTP, RTCP (Real-time Control Protocol), SCTP (Stream Control Transmission Protocol), etc.

[0128] Once the collage is created, the session management can establish RTP streams with the appropriate video characteristics (resolution, size) for each of the participants. Based on the collage, the media gateway may update the stream properties and pause or resume certain streams, for example based on user pose information, without any additional media processing. An RTP MGW can then control the streams based on the pose and viewport of the AR users for viewport- dependent delivery, i.e., only the video streams that are currently rendered in the user’s viewport are delivered and the remaining video streams may be paused. Consequently, there is no requirement for heavy centralized processing of the streams to create combined stream from each participant’s perspective.

[0129] In the following, some embodiments relating to the collage creation are described more in detail.

[0130) The collage creation is a process that may be implemented in a network entity (centralized), at the edge or within one of the participant apparatus (e.g. a user equipment UE). The collage is created based on one or more aspects including but not limited to the number of participants, the spatial characteristics of the physical environment of the participants and the characteristics of the video stream from each participant.

[0131] Figure 6 shows an example of various information and parameters used in the collage creation. Upon establishing the videoconference session, the collage creator obtains some session level information, i.e. at least the number of participants and possibly some other session preferences, such as a usable bandwidth, etc. Each participant provides its own participant level information to be used in the collage creation. This may involve the supported video properties, the distance of the object-of-interest (i.e. the user) from the camera, wherein the distance may be indicated explicitly or implicitly (derivable by the collage creator), and spatial information of the space of the participant.

[0132] When creating the collage, the collage creator may have some internal settings controlling the collage creation. The collage creator may be configured to maintain the eye contact with each participant (i.e. user), wherein the billboarding may be advantageously utilized, as well as in displaying the participants in life-size dimensions. The collage may preferably enable a virtual background to be individually applied by each participant.

[0133] As the output, the collage creator provides the session level information and the participant level information, such as the supported video properties of the one or more other participant apparatus, and the collage with appropriate metadata, such as an XR description, of each video stream of other participants.

[0134] Thus, each video stream from a participant may be assigned an XR description. The XR description defines a kind of anchor the stream should be rendered on. For example, an XR description control a stream to be rendered onto a surface on floor level on the left-hand side of the viewport with particular dimensions. The XR description is assigned by the collage creator based on the information received from the participants. Table 1 below discloses some aspects that may be defined for the XR description in the collage.

Table 1.

[0135] According to an embodiment, participants at lower resolution video streams are placed at a greater depth than other participants. Thus, the collage creator may place participants at lower resolution video streams at a greater depth than other participants so that they can be rendered with smaller dimensions. This creates the perception that the participant is locating farther away. In this case, the depth is a function of the image/video resolution. Other embodiments may link other scene properties with stream properties e.g. a low-resolution image may be rendered as a flat overlay on a wall so that it does not disrupt the realism of the other images that may be displayed with billboarding.

(0136] Figures 7a - 7c illustrate an example of a virtual videoconference with four participants, Ul, U2, U3 and U4, provided with the display unit capable of displaying AR content. The participant U 1 may send its video stream at a lower resolution than the other participants, for example due its limited video properties or as requested by the media gateway. The collage creator may then create a collage placing Ul at a greater depth, as shown in Figure 7a. The resultant collage does not need any detailed scene description apart from relative position of each participant with respect to the other and the depth levels. [0137) Figures 7b and 7c show the virtual conference rooms of participants U2 and U3, respectively, where the other participants are displayed as AR. Real furniture in each participant’s physical room are shown in brighter color; however, they are not delivered in the video streams to the other participants, but they are only visible in each participants’ own AR view. The actual anchor (placement) used for rendering each participant stream as AR is not assigned by the collage creator, but it is chosen by the participant apparatus (UE) based on the XR description provided by the collage. Figures 7b and 7c show the curvature of the billboarding surface to be used for rendering the 2D overlays as curves below the placement of the avatars. Slight variations in depth may occur (e.g. in case of the physical rooms of participants have different depths) due to the actual layout and occlusions in the room.

[0138] Thus, user U2 sees an AR view of his/her own physical room, where an avatar (or other representation) of user U4 is displayed on front-left side, user U1 is displayed in middle but on a farther distance, and U3 on front-right side. If the virtual conference room of U2 is shown as AR content, the furniture in front of U2 may also be displayed.

[0139] According to an embodiment, the participant apparatus is configured to create a virtual background for the virtual videoconferencing space, while maintaining the spatial positions of each participant apparatus in said virtual videoconferencing space. Thus, a participant apparatus (e.g. a UE) may convert his/her physical environment into a 3D scene description, thereby switching from AR mode to VR mode (virtual meeting mode). The participant apparatus may further share the VR mode 3D scene description with the other participant apparatus (UEs) via the same or a separate communication channel. The other participant apparatus (UEs) may also switch from AR mode to VR mode and vice versa.

1 140 [ Hence, the virtual conference room of U2 may instead be shown as completely virtual reality (VR), wherein the positions of other users Ul, U3 and U4 may be the same, but the background of the virtual conference room may be created and displayed as controlled by the user U2.

[0141] Maintaining the same mutual spatial positions of the users, user U3 sees an AR view of his/her own physical room, where an avatar (or other representation) of user U2 is displayed on front-left side, user U4 is displayed in middle, and Ul on front-right side but on a farther distance. If the virtual conference room of U3 is shown as AR content, the furniture on the back wall of the room and on front-right side of U3 may also be displayed.

[0142 Figure 8 shows another example of a virtual videoconference with two participants, Andy and Bob, where based on the physical space and number of participants, a collage creator may choose to include a portion of the background of a participant when creating a collage instead of using a cutout of just the person of interest. In the example of Figure 8, Bob and Andy establish a call from their living rooms that are equipped with 360-degree capturing devices configured to track the object of interest (OOI) e.g. a person. Both Bob and Andy are wearing AR devices. As there are only two participants, a 180-degree view of the room is displayed for each side.

[0143] According to an embodiment, the participant apparatus is configured to obtain control information about an extent of background to be captured around the object-of-interest. Thus, the user may indicate the extent of background that can be captured and used within the collage. According to an embodiment, the extent is marked by defining a boundary within the room using a sensor, such as a handheld controller connected to the AR device or using in-built sensors of the AR device. According to an embodiment, the extent is defined as predefined levels, e.g., i) just the head and the body, ii) the head, the body and 10-degrees on either side, iii) the whole room, etc. [0144] According to an embodiment, the apparatus operating as the media gateway is configured to define spatial regions within the virtual videoconferencing space for each participant to allow movements of the participant within the respective spatial region. Thus, based on the virtual space and number of participants, a collage creator may choose to define spatial boundaries (regions) for each participant to have some spatial constraints within which participants can move freely. The spatial regions for the movement can be calculated by comparing the region area and the physical area of the participant so that all participants can have normalized movement. For example, when the same scene is created in two physical spaces that have different sizes, it may take longer for an object (participant) to move from one point of the scene to the other in a larger physical space than for an object in a smaller physical space. The collage provides sufficient information to be able to render the motion in a realistic way for each participant apparatus (e.g. UE) in reference to their own created MR scene. Similar transformations may be applied to realistically represent gestures (pointing at a particular stream) and eye gaze.

[0145] As mentioned above, collages may be realized for an RTP -based AR/MR teleconference without additional transcoding. Figure 9 shows an exemplified setup for a unidirectional case, where UE-A is a participant apparatus equipped with AR device that is receiving video streams from multiple remote participants via RTP Media Gateway (MGW). It is noted that the same principal can be extended for bidirectional cases. The RTP -based AR/MR teleconference without additional transcoding may in the unidirectional case comprise the following steps:

1) Each participant joins the conference by sending their capabilities to the session manager. For UE-A, this includes sending (la) spatial information about the room in addition to the rendering capabilities. The sending UEs would advertise (lb) their capabilities for sending the video.

2) A collage is created based on the received information.

3) The collage and stream properties are distributed to the participants and the MGW (3a). This step would include finalizing the SDP negotiation (3b) to set up a media channel for each stream. In the unidirectional case, the sending UEs do not need to be aware of the collage.

4) The sending UEs start sending the negotiated stream to the MGW. It is assumed in this case that the UE is able to set the appropriate alpha channels to make the background (any part of the picture that does not have the object of interest) transparent for 2D participant videos that are displayed as cutout with billboarding effect. In another embodiment, this may be done in the MGW.

5) The RTP MGW sends the video streams (5a) using viewport-dependent delivery (VDD) to UE-A, wherein the delivery of the video streams may be adjusted based on the viewport feedback (5b) received from UE-A.

[0146] According to an embodiment, the viewport-dependent delivery of the video streams of the one or more participant apparatus comprises pausing the delivery of the video streams not being displayed in the viewport. Thus, the MR teleconference may be realized using pause/resume to disable RTP streams that are not in the viewport. Such an embodiment may be implemented without any MGW or middle box.

[0147] According to an embodiment, the media gateway is a Media Resource Function (MRF) configured to carry out one or more media processing operations for at least one video stream of the one or more participant apparatus. The media handling operations may comprise e.g., setting appropriate alpha channels and delivering the video stream as an overlay and various quality and bandwidth adaptations.

[0148] According to an embodiment, the media gateway is configured to carry out transcoding operations for at least one video stream of the one or more participant apparatus. Thus, the collage MGW may, for example, change the resolutions of participant streams in real-time by transcoding the input streams. Therein, one of the streams may remain as the original high-resolution stream, while the others are converted to low resolutions. Herein, the different depth values, as described above, may be used for the high and low resolution streams. ML based video segmentation can be used to extract the person or real avatar. An SDP attribute “a=ML:alpha” may be included under the media description of the stream to indicate that alpha channel coding is present in the media stream being sent by the participant. In this attribute ML refers to “machine learning” or ML controlled alpha channel coding to adjust based on the background content contrast requirement. In absence of ML based segmentation or alpha channel coding only “a=alpha” can be used.

[01 9] The collage may support some degrees of motion, up to 3DOF+. In case of a 2D video displayed as an overlay with a transparent background, the movement may be automatically captured as long as the participant remains within the delivered field-of-view of the overlay. The collage is configured to take this into account by not overlapping 2D/spherical overlays rendered in this manner, unless they vary sufficiently in depth. According to an embodiment, the collage may be configured to vary the size of an avatar when the user moves in towards or away from the camera. According to an embodiment, the collage may be configured to also define a fixed space in which the video/avatar may move as part of the XR description. In this case, the video/avatar would be rendered with enough occlusion- free space around the anchor to allow the movement. Support for higher degrees of motion may require updating the collage based on motion.

[0150[ According to an embodiment, the method comprises assigning a videoconference role for one or more participant apparatus of the video conference; and configuring said data set to include the roles. Hence, the participants may be assigned roles when creating the collage and the roles may be signaled in the collage description to the other participants. Every UE may have its role definition, e.g. in the context of a virtual meeting, one UE may be “chairman” and the others can be “meeting participant”. Different role can have corresponding spatial meanings and affect the final representation/display. For example, a chairman may always be present (e.g. as a viewport- dependent fixed overlay) regardless of the pose or current viewport of the participant apparatus.

[01511 According to an embodiment, the role of a chairman may be associated with a specific position in the layout, e.g. the center of the collage. In this case, the collage layout may be revised when the roles of the participants change. For example, B replaces A to become the new chairman of the meeting, whereupon B is placed at the center of the collage and A is placed with the rest of the participants on the sides.

[0152] According to an embodiment, the XR description may be defined as part of a Session Description Protocol. According to an embodiment, this session description protocol may be referred by glTF (Graphics Language Transmission Format) and glTF extensions, such as the MPEG Scene Description. In this case, the collage creator is the owner of all the nodes (avatars and overlays) and it creates and updates the scene description based on the information received from the participants about their respective media and physical space.

[0153] According to another embodiment, the XR description may be defined as part of a Scene Description protocol. According to an embodiment, this scene description protocol may be based on glTF (Graphics Language Transmission Format) and glTF extensions, such as MPEG Scene Description. The scene description protocols may also be based on approaches than glTF. According to an embodiment, the description of the physical space of the participants may be provided by a Spatial Computing Server, where the Spatial Computing Server has received this information from the participants or other sensors, UEs, etc.

[0154] In a streaming implementation, the Collage Creator may use the file format, e.g. MPEG OMAF, to define the position and rendering of the overlay streams.

[0155] An example of an apparatus is shown in Fig. 10. Several functionalities can be carried out with a single physical device, e.g. in a single processor, if desired. The apparatus 90 comprises a main processing unit 91, a memory 92, a user interface 94, a communication interface 93. The apparatus according to an embodiment, shown in Fig. 7, also comprises a camera module 95. The memory 92 stores data including computer program code in the apparatus 90. The computer program code is configured to implement the method according to flowchart of Fig. 4. The camera module 95 receives input data, in the form of video stream, to be processed by the processor 91.

The communication interface 93 forwards processed data for example to a display of another device, such a HMD. When the apparatus 90 is a video source comprising the camera module 95, user inputs may be received from the user interface. If the apparatus 90 is a middlebox in a network, the user interface is optional, such as the camera module.

[0156) The various embodiments may provide advantages. For example, the embodiments provide a standard approach to map and layout the remote participants with respects to their local physical environments. An MR experience obtained according to the embodiments facilitates immersion, maintains awareness of the surroundings and facilitates a symmetric experience among participants.

[0157] The various embodiments can be implemented with the help of computer program code that resides in a memory and causes the relevant apparatuses to carry out the method. For example, a device may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the device to carry out the features of an embodiment. Yet further, a network device like a server may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the network device to carry out the features of an embodiment. The computer program code comprises one or more operational characteristics. Said operational characteristics are being defined through configuration by said computer based on the type of said processor, wherein a system is connectable to said processor by a bus, wherein a programmable operational characteristic of the system comprises obtaining information of available quality features of a sender device for providing omnidirectional visual media content; selecting one or more of the available quality features; and providing an indication of the selected one or more of the available quality features.

[0158] Thus, the apparatus configured to create the collage may comprise at least one processor and at least one memory, said at least one memory stored with computer program code thereon, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to perform: obtain a plurality of videoconferencing parameters from two or more participant apparatus of a video conference, wherein each of the participant apparatus is provided with a camera configured to capture a video about a user of said participant apparatus, and at least one participant apparatus is provided with a display unit capable of displaying augmented reality (AR) content; create a data set, based on the plurality of videoconferencing parameters, for a virtual videoconferencing space, wherein the data set comprises at least parameters for spatial positions of each participant apparatus in said virtual videoconferencing space and parameters of a video stream of each participant apparatus; and deliver the data set as a bitstream to said at least one participant apparatus provided with the display unit capable of displaying AR content.

[0159] Similarly, the participant apparatus may comprise a camera configured to capture a video about a user of said apparatus; a display unit capable of displaying augmented reality (AR) content; at least one processor and at least one memory, said at least one memory stored with computer program code thereon, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to perform: determine a plurality of videoconferencing parameters of a video conference; send said plurality of videoconferencing parameters to at least one participant apparatus of the videoconference; receive a data set from said at least one participant apparatus of the videoconference, wherein the data set comprises at least parameters for spatial positions of each participant apparatus in a virtual videoconferencing space created by said at least one participant apparatus and parameters of a video stream of each participant apparatus; and render, based on the data set, the virtual videoconferencing space on the display unit, where representations of users of each participant apparatus are rendered on their corresponding spatial positions.

[0160] A computer program product according to an embodiment can be embodied on a non- transitory computer readable medium. According to another embodiment, the computer program product can be downloaded over a network in a data packet.

[0161] If desired, the different functions discussed herein may be performed in a different order and/or concurrently with other. Furthermore, if desired, one or more of the above-described functions and embodiments may be optional or may be combined.

[0162] Although various aspects of the embodiments are set out in the independent claims, other aspects comprise other combinations of features from the described embodiments and/or the dependent claims with the features of the independent claims, and not solely the combinations explicitly set out in the claims.

[0163] It is also noted herein that while the above describes example embodiments, these descriptions should not be viewed in a limiting sense. Rather, there are several variations and modifications, which may be made without departing from the scope of the present disclosure, as defined in the appended claims.

Claims

Claims:

1. An apparatus comprising means for: obtaining a plurality of videoconferencing parameters from two or more participant apparatus of a video conference, wherein each of the participant apparatus is provided with a camera configured to capture a video about a user of said participant apparatus, and at least one participant apparatus is provided with a display unit capable of displaying augmented reality (AR) content; creating a data set, based on the plurality of videoconferencing parameters, for a virtual videoconferencing space, wherein the data set comprises at least parameters for spatial positions of each participant apparatus in said virtual videoconferencing space and parameters of a video stream of each participant apparatus; and delivering the data set as a bitstream to said at least one participant apparatus provided with the display unit capable of displaying AR content.

2. The apparatus according to claim 1, comprising means for: configuring said data set as a metadata stream specifying information for the video streams of each participant apparatus.

3. The apparatus according to claim 2, comprising means for: delivering the video streams of each participant apparatus along the metadata stream to said at least one participant apparatus provided with the display unit capable of displaying AR content.

4. The apparatus according to claim 2 or 3, comprising means for: configuring said data set as a manifest file or an Extended Reality (XR) descriptor in the metadata stream.

5. The apparatus according to any preceding claim, comprising means for assigning a videoconference role for one or more participant apparatus of the video conference; and configuring said data set to include the roles.

6. The apparatus according to any preceding claim, comprising means for configuring said data set as one or more data structures describing overlays, where an overlay corresponds to a participant’s video stream and an overlay position corresponds to the spatial position of the participant's video stream in the viewing space.

7. The apparatus according to any preceding claim, comprising means for configuring said data set as one or more data structures describing overlays according to MPEG OMAF.

8. The apparatus according to claim 7, comprising means for indicating that the participant's video streams originate from external overlay sources rather than being conveyed according to MPEG OMAF.

9. The apparatus according to claim 7 or 8, comprising means for indicating that background visual media is transparent or absent in an OMAF presentation and only the overlays are to be displayed.

10. The apparatus according to any preceding claim, wherein the participant video stream is a 2D video configured to be displayed using a billboarding effect.

11. The apparatus according to any preceding claim, comprising means for applying real-time transport protocol (RTP) to provide the one or more participant apparatus provided with the display unit capable of displaying AR content with the video stream of other participant apparatus according to viewport-dependent delivery.

12. The apparatus according to claim 7, comprising means for pausing the delivery of the video streams not being displayed in the viewport.

13. The apparatus according to any preceding claim, comprising means for: creating a virtual background for the virtual videoconferencing space, while maintaining the spatial positions of each participant apparatus in said virtual videoconferencing space.

14. The apparatus according to any preceding claim, further comprising means for: defining spatial regions within the virtual videoconferencing space for each participant to allow movements of the participant within the respective spatial region.

15. A method, comprising: obtaining a plurality of videoconferencing parameters from two or more participant apparatus of a video conference, wherein each of the participant apparatus is provided with a camera configured to capture a video about a user of said participant apparatus, and at least one participant apparatus is provided with a display unit capable of displaying augmented reality (AR) content; creating a data set, based on the plurality of videoconferencing parameters, for a virtual videoconferencing space, wherein the data set comprises at least parameters for spatial positions of each participant apparatus in said virtual videoconferencing space and parameters of a video stream of each participant apparatus; and delivering the data set as a bitstream to said at least one participant apparatus provided with the display unit capable of displaying AR content.

16. An apparatus comprising at least one processor and at least one memory, said at least one memory stored with computer program code thereon, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to perform: obtain a plurality of videoconferencing parameters from two or more participant apparatus of a video conference, wherein each of the participant apparatus is provided with a camera configured to capture a video about a user of said participant apparatus, and at least one participant apparatus is provided with a display unit capable of displaying augmented reality (AR) content; create a data set, based on the plurality of videoconferencing parameters, for a virtual videoconferencing space, wherein the data set comprises at least parameters for spatial positions of each participant apparatus in said virtual videoconferencing space and parameters of a video stream of each participant apparatus; and deliver the data set as a bitstream to said at least one participant apparatus provided with the display unit capable of displaying AR content.

17. An apparatus comprising: a camera configured to capture a video about a user of said apparatus; a display unit capable of displaying augmented reality (AR) content; means for determining a plurality of videoconferencing parameters of a video conference; means for sending said plurality of videoconferencing parameters to at least one participant apparatus of the videoconference; means for receiving a data set from said at least one participant apparatus of the videoconference, wherein the data set comprises at least parameters for spatial positions of each participant apparatus in a virtual videoconferencing space created by said at least one participant apparatus and parameters of a video stream of each participant apparatus; and means for rendering, based on the data set, the virtual videoconferencing space on the display unit, where representations of users of each participant apparatus are rendered on their corresponding spatial positions.

18. An apparatus comprising: a camera configured to capture a video about a user of said apparatus; a display unit capable of displaying augmented reality (AR) content; at least one processor and at least one memory, said at least one memory stored with computer program code thereon, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to perform: determine a plurality of videoconferencing parameters of a video conference; send said plurality of videoconferencing parameters to at least one participant apparatus of the videoconference; receive a data set from said at least one participant apparatus of the videoconference, wherein the data set comprises at least parameters for spatial positions of each participant apparatus in a virtual videoconferencing space created by said at least one participant apparatus and parameters of a video stream of each participant apparatus; and render, based on the data set, the virtual videoconferencing space on the display unit, where representations of users of each participant apparatus are rendered on their corresponding spatial positions.