WO2021198550A1

WO2021198550A1 - A method, an apparatus and a computer program product for streaming conversational omnidirectional video

Info

Publication number: WO2021198550A1
Application number: PCT/FI2020/050797
Authority: WO
Inventors: Sujeet Shyamsundar Mate; Igor Curcio; Saba AHSAN
Original assignee: Nokia Technologies Oy
Priority date: 2020-04-03
Filing date: 2020-11-25
Publication date: 2021-10-07

Abstract

The embodiments relate to a method and to a technical equipment for implementing the method. The method comprises streaming conversational omnidirectional video from a sender device to a receiver device; receiving a message from the receiver device, wherein the message indicates a region (601) relating to a content in the omnidirectional video and a condition; detecting that a condition defined in the message is met; generating an overlay (610) comprising the content of the region being defined in the received message; and transmitting the generated overlay with the omnidirectional video to the receiver device to be overlaid on a current viewport (602) in the omnidirectional video.

Description

A METHOD, AN APPARATUS AND A COMPUTER PROGRAM PRODUCT FOR STREAMING CONVERSATIONAL OMNIDIRECTIONAL VIDEO

Technical Field

The present solution generally relates to streaming conversational omnidirectional video data.

Background

Since the beginning of photography and cinematography, the most common type of image and video content has been captured by cameras with relatively narrow field of view and displayed as a rectangular scene on flat displays. Such content is referred as “flat content”, or “flat image”, or “flat video” in this application. The cameras are mainly directional, whereby they capture only a limited angular field of view (the field of view towards which they are directed).

More recently, new image and video capture devices are available. These devices are able to capture visual and audio content all around them, i.e. they can capture the whole angular field of view, sometimes referred to as 360 degrees field of view. More precisely, they can capture a spherical field of view (i.e., 360 degrees in all spatial directions). Furthermore, new types of output technologies have been invented and produced, such as head-mounted displays. These devices allow a person to see visual content all around him/her, giving a feeling of being “immersed” into the scene captured by the 360 degrees camera. The new capture and display paradigm, where the field of view is spherical, is commonly referred to as virtual reality (VR) and is believed to be the common way people will experience media content in the future.

Summarv

The scope of protection sought for various embodiments of the invention is set out by the independent claims. The embodiments and features, if any, described in this specification that do not fall under the scope of the independent claims are to be interpreted as examples useful for understanding various embodiments of the invention.

Various aspects include a method, an apparatus and a computer readable medium comprising a computer program stored therein, which are characterized by what is stated in the independent claims. Various embodiments are disclosed in the dependent claims.

According to a first aspect, there is provided a method comprising at least

- streaming conversational omnidirectional video from a sender device to a receiver device;

- receiving a message from the receiver device, wherein the message indicates a region relating to a content in the omnidirectional video and a condition;

- detecting that a condition defined in the message is met;

- generating an overlay comprising the content of the region being defined in the received message; and

- transmitting the generated overlay with the omnidirectional video to the receiver device to be overlaid on a current viewport in the omnidirectional video.

According to a second aspect, there is provided an apparatus comprising at least:

- means for streaming conversational omnidirectional video from a sender device to a receiver device;

- means for receiving a message from the receiver device, wherein the message indicates a region relating to a content in the omnidirectional video and a condition;

- means for detecting that a condition defined in the message is met;

- means for generating an overlay comprising the content of the region being defined in the received message; and

- means for transmitting the generated overlay with the omnidirectional video to the receiver device to be overlaid on a current viewport in the omnidirectional video.

According to a third aspect, there is provided an apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to:

- stream conversational omnidirectional video from a sender device to a receiver device;

- receive a message from the receiver device, wherein the message indicates a region relating to a content in the omnidirectional video and a condition;

- detect that a condition defined in the message is met;

- generate an overlay comprising the content of the region being defined in the received message; and

- transmit the generated overlay with the omnidirectional video to the receiver device to be overlaid on a current viewport in the omnidirectional video.

According to a fourth aspect, there is provided a computer program product comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to at least:

- detect that a condition defined in the message is met;

According to an embodiment, the condition relates to a viewport of the receiver device.

According to an embodiment, information on the current viewport of the receiver device is received.

According to an embodiment, it is detected that the region defined in the received message is outside the current viewport of the receiver device, as a response of which the overlay is generated. According to an embodiment, the generated overlay to be rendered is indicated to be rendered as a viewport-relative overlay or as a sphere-relative overlay with a timeline pause.

According to an embodiment, the computer program product is embodied on a non-transitory computer readable medium.

Description of the Drawings

In the following, various embodiments will be described in more detail with reference to the appended drawings, in which

Fig. 1 shows an example of OMAF system architecture;

Figs 2a - 2c show examples of streaming conversational omnidirectional video;

Fig. 3 shows an example of an apparatus for content authoring;

Fig. 4 shows an example of an apparatus for content consumption;

Fig. 5 illustrates a problem behind the present solution;

Fig. 6 illustrates an example for a condition-region and condition- definition-triggered initiation of viewport locked overlay;

Fig. 7 shows an example of implementation steps for condition-based overlays;

Fig. 8 shows an example of sequence of signalling; and

Fig. 9 is a flowchart illustrating a method according to an embodiment.

Description of Example Embodiments

The present embodiments relates to rendering of overlays in omnidirectional conversational video applications. In particular, the present embodiments are related to scenario where the conditions for overlay rendering are dynamic. The motivation for the present solution is to enable condition-triggered rendering of overlays, which enables the possibility of live encoding to dynamically insert overlays over visual content.

Users may consume both videos and images as visual content. However, the consumption of videos and images have been independent on each other. The recent development of applications - such as immersive multimedia - has enabled new use cases where users consume both videos and images together.

Immersive multimedia - such as omnidirectional content consumption - is more complex for the end user compared to the consumption of 2D content. This is due to the higher degree of freedom available to the end user (e.g., three degrees of freedom for yaw, pitch and roll). This freedom also results in more uncertainty. The situation is further complicated when layers of content are rendered, e.g. in case of overlays

As used herein the term “omnidirectional” may refer to media content that has greater spatial extent than a field-of-view of a device rendering the content. Omnidirectional content may for example cover substantially 360 degrees in the horizontal dimension and substantially 180 degrees in the vertical dimension, but omnidirectional may also refer to content covering less than 360 degree view in the horizontal direction and/or 180 degree view in the vertical direction.

The MPEG Omnidirectional Media Format (OMAF) v1 standardized the omnidirectional streaming of single 3DoF content (where the viewer is located at the centre of a unit sphere and has three degrees of freedom (Yaw-Pitch- Roll).

In cube map projection format, spherical video is projected onto the six faces (a.k.a. sides) of a cube. The cube map may be generated e.g. by first rendering the spherical scene six times from a viewpoint, with the views defined by a 90 degree view frustum representing each cube face. The cube sides may be frame-packed into the same frame or each cube side may be treated individually (e.g., in encoding). There are many possible orders of locating cube sides onto a frame and/or cube sides may be rotated or mirrored. The frame width and height for frame-packing may be selected to fit the cube sides "tightly" e.g. at 3x2 cube side grid, or may include unused constituent frames e.g. at 4x3 cube side grid.

In general, 360-degree content can be mapped onto different types of solid geometrical structures, such as a polyhedron (that is, a three-dimensional solid object containing flat polygonal faces, straight edges and sharp corners or vertices, e.g., a cube or a pyramid), a cylinder (by projecting a spherical image onto the cylinder, as described above with the equirectangular projection), a cylinder (directly without projecting onto a sphere first), a cone, etc. and then unwrapped to a two-dimensional image plane. The two-dimensional image plane can also be regarded as a geometrical structure. In other words, 360- degree content can be mapped onto a first geometrical structure and further unfolded to a second geometrical structure. However, it may be possible to directly obtain the transformation to the second geometrical structure from the original 360-degree content or from other wide view visual content. In general, an omnidirectional projection format may be defined as a format to represent (up to) 360-degree content on a two-dimensional image plane. Examples of omnidirectional projection formats include the equirectangular projection format and the cubemap projection format.

OMAF defines formats for enabling the access and delivery of omnidirectional media. The media components are distributed (example multiple resolutions, bitrate/quality) among different bitstreams to provide the application, the freedom to choose between them for addressing various system challenges such as network bandwidth, Temporal and Spatial random access for user interaction.

The currently standardized Omnidirectional Media Format (OMAF) v2 enables the use of multiple omnidirectional and overlay videos and images. Figure 1 shows an example of OMAF system architecture. As shown in Figure 1 , an omnidirectional media (A) is acquired. The omnidirectional media comprises image data (Bi) and audio data (B_a), which are processed separately.

In image stitching, rotation, projection and region-wise packing, the images/video of the source media and provided as input (Bi) are stitched to generate a sphere picture on a unit sphere per the global coordinate axel. The unit sphere is then rotated relative to the global coordinate axes. The amount of rotation to convert from the local coordinate axes to the global coordinate axes may be specified by the rotation angles indicated in a RotationBox. The local coordinate axes of the unit sphere are the axes of the coordinate system that has been rotated. The absence of the RotationBox indicates that the local coordinate axes are the same as the global coordinate axes. Then, the spherical picture on the rotated unit sphere is converted to a two-dimensional projected picture, for example using the equirectangular projection. When spatial packing of stereoscopic content is applied, two spherical pictures for the two views are converted to two constituent pictures, after which frame packing is applied to pack the two constituent pictures on one projected picture. Rectangular region-wise packing can then be applied to obtain a packed picture from the projected picture. The packed pictures (D) are then provided for video and image encoding to result in encoded image (Ei) and/or encoded video stream (E_v).

The audio of the source media is provided as input (B_a) to audio encoding that provides as an encoded audio (E_a). The encoded data (Ei, E_v, E_a) are then encapsulated into file for playback (F) and delivery (i.e. streaming) (F_s).

In the OMAF player 200, a file decapsulator processes the files (F’, F’ s ) and extracts the coded bitstreams (E’i, EV, E’_a) and parses the metadata. The audio, video and/or images are then decoded into decoded data (D’, B’_a). The decoded pictures (D’) are projected onto a display according to the viewpoint and orientation sensed by a head/eye tracking device. Similarly, the decoded audio (B’_a) is rendered through loudspeakers/headphones.

A viewport may be defined as a region of omnidirectional image or video suitable for display and viewing by the user. A current viewport (which may be sometimes referred simply as a viewport) may be defined as the part of the spherical video that is currently displayed and hence is viewable by the user(s). At any point of time, a video rendered by an application on a head-mounted display (FIMD) renders a portion of the 360-degrees video, which is referred to as a viewport. Likewise, when viewing a spatial part of the 360-degree content on a conventional display, the spatial part that is currently displayed is a viewport. A viewport is a window on the 360-degree world represented in the omnidirectional video displayed via a rendering display. A viewport may be characterized by a horizontal field-of-view (VHFoV) and a vertical field-of-view (WFoV). In the following, the horizontal field-of-view of the viewport will be abbreviated with FIFoV and, respectively, the vertical field-of-view of the viewport will be abbreviated with VFoV.

A sphere region may be defined as a region on a sphere that may be specified by four great circles or by two azimuth circles and two elevation circles and additionally by a tile angle indicating rotation along the axis originating from the sphere origin passing through the center point of the sphere region. A great circle may be defined as an intersection of the sphere and a plane that passes through the center point of the sphere. A great circle is also known as an orthodrome or Riemannian circle. An azimuth circle may be defined as a circle on the sphere connecting all points with the same azimuth value. An elevation circle may be defined as a circle on the sphere connecting all points with the same elevation value.

The Omnidirectional Media Format (“OMAF”) standard (ISO/IEC 23090-2) specifies a generic timed metadata syntax for sphere regions. A purpose for the timed metadata track is indicated by the track sample entry type. The sample format of all metadata tracks for sphere regions specified starts with a common part and may be followed by an extension part that is specific to the sample entry of the metadata track. Each sample specifies a sphere region.

One of the specific sphere region timed metadata tracks specified in OMAF is known as a recommended viewport timed metadata track, which indicates the viewport that should be displayed when the user does not have control of the viewing orientation or has released control of the viewing orientation. The recommended viewport timed metadata track may be used for indicating a recommended viewport based on a “director's cut” or based on measurements of viewing statistics. A textual description of the recommended viewport may be provided in the sample entry. The type of the recommended viewport may be indicated in the sample entry and may be among the following: a. A recommended viewport per the director's cut, e.g., a viewport suggested according to the creative intent of the content author or content provider; b. A recommended viewport selected based on measurements of viewing statistics; c. As defined by applications or external specifications.

Term “observation point or viewpoint” refers to a volume in a three-dimensional space for virtual reality audio/video acquisition or playback. A viewpoint is trajectory, such as a circle, a region, or a volume, around the centre point of a device or rig used for omnidirectional audio/video acquisition and the position of the observer's head in the three-dimensional space in which the audio and video tracks are located. In some cases, an observer's head position is tracked and the rendering is adjusted for head movements in addition to head rotations, and then a viewpoint may be understood to be an initial or reference position of the observer's head. In implementations utilizing DASH (Dynamic adaptive streaming over HTTP), each observation point may be defined as a viewpoint by a viewpoint property descriptor. The definition may be stored in ISOBMFF or OMAF type of file format. The delivery could be HLS (HTTP Live Streaming), RTSP/RTP (Real Time Streaming Protocol/Real-time Transport Protocol) streaming in addition to DASH.

Term “Viewpoint group” refers to one or more viewpoints that are either spatially related or logically related. The viewpoints in a Viewpoint group may be defined based on relative positions defined for each viewpoint with respect to a designated origin point of the group. Each Viewpoint group may also include a default viewpoint that reflects a default playback starting point when a user starts to consume audio-visual content in the Viewpoint group, without choosing a viewpoint, for playback. The default viewpoint may be the same as the designated origin point. In some embodiments, one viewpoint may be included in multiple Viewpoint groups.

Term “spatially related Viewpoint group” refers to viewpoints which have content that has a spatial relationship between them. For example, content captured by VR cameras at different locations in the same basketball court or a music concert captured from different locations on the stage.

Term “logically related Viewpoint group” refers to related viewpoints which do not have a clear spatial relationship, but are logically related. The relative position of logically related viewpoints are described based on the creative intent. For example, two viewpoints that are members of a logically related Viewpoint group may correspond to content from the performance area and the dressing room. Another example could be two viewpoints from the dressing rooms of the two competing teams that form a logically related Viewpoint group to permit users to traverse between both teams to see the player reactions.

Term “static Viewpoint” refers to a viewpoint that remains stationary during one virtual reality audio/video acquisition and playback session. For example, a static viewpoint may correspond with virtual reality audio/video acquisition performed by a fixed camera.

Term “dynamic Viewpoint” refers to a viewpoint that does not remain stationary during one virtual reality audio/video acquisition and playback session. For example, a dynamic Viewpoint may correspond with virtual reality audio/video acquisition performed by a moving camera on rails or a moving camera on a flying drone.

Term “overlay” refers to a visual media that is rendered over 360-degree video content.

Videos and/or images may be overlaid on an omnidirectional video and/or image. The coded overlaying video can be a separate stream or part of the bitstream of the currently rendered 360-degree video/image. A omnidirectional streaming system may overlay a video/image on top of the omnidirectional video/image being rendered. The overlaid two-dimensional video/image may have a rectangular grid or a non-rectangular grid. The overlaying process may cover the overlaid video/image or a part of the video/image or there may be some level of transparency/opacity or more than one level of transparency/opacity wherein the overlaid video/image may be seen under the overlaying video/image but with less brightness. In other words, there could be an associated level of transparency corresponding to the video/image in a foreground overlay and the video/image in the background (video/image of VR scene). The terms opacity and transparency may be used interchangeably.

The overlaid region may have one or more than one levels of transparency. For example, the overlaid region may have different parts with different levels of transparency. In accordance with an embodiment, the transparency level could be defined to be within a certain range, such as from 0 to 1 so that the smaller the value the smaller is the transparency, or vice versa.

Additionally, the content provider may choose to overlay a part of the same omnidirectional video over the current viewport of the user. The content provider may want to overlay the video based on the viewing condition of the user. For example, overlaying may be performed, if the user’s viewport does not match the content provider’s recommended viewport, where the recommended viewport is determined by the content creator. In this case, the client player logic overlays the content provider’s recommended viewport (as a preview window) on top of the current viewport of the user. It may also be possible to overlay the recommended viewport, if the user’s current viewport does not match, such that the position of the overlaid video is based on the direction in which the user is viewing. For example, overlaying the recommended viewport to the left of the display, if the recommended viewport is to the left of the user’s current viewport. It may also be possible to overlay the whole 360-degree video. Yet another example is to use the overlaying visual information as a guidance mechanism to guide the user towards the recommended viewport, for example guiding people who are hearing impaired.

There may be one or more conditions on when and how to display the visual overlay. Therefore, a rendering device may need to receive information which the rendering device may use to perform the overlaying as indicated by the signaled information.

One or more overlays may be carried in a single visual media track or a single image item. When more than one overlay is carried in a single track or image item, or when an overlay is carried with other media (e.g. background), a mapping of regions from the samples of the track or the image item to the overlay metadata may be provided, e.g. in or associated with the OverlayStruct.

When several tracks or image items are collectively carrying one or more overlays and/or the background visual media, a group of the tracks and image items may be indicated in a container file. For example, an entity group of ISOBMFF may be used for this purpose. “Support of Immersive Teleconferencing and Telepresence for Remote Terminals” (ITT4RT) has been defined in SP-180985. The objective of this is to specify VR support in MTSI in TS 26.114 and IMS-based Telepresence in TS 26.223 to enable support of an immersive experience for remote terminals joining teleconferencing and telepresence sessions. For MTSI, the work is expected to enable scenarios with two-way audio and one-way immersive video, e.g., a remote single user wearing an HMD participates to a conference will send audio and optionally 2D video (e.g., of a presentation, screen sharing and/or a capture of the user itself), but receives stereo or immersive voice/audio and immersive video captured by an omnidirectional camera in a conference room connected to a fixed network.

More specifically, this work aims to specify the following aspects for immersive video and immersive voice/audio support: a) Recommendations of audio and video codec configurations (e.g., profile, level, and encoding constraints of IVAS, EVS, HEVC, AVC as applicable) to deliver high quality VR experiences b) Constraints on media elementary streams and RTP encapsulation formats c) Recommendations of SDP configurations for negotiating of immersive video and voice/audio capabilities. For immersive voice and audio considerations using IVAS, this is dependent on specification of the IVAS RTP payload format to be developed as part of the IVAS Wl d) An appropriate signalling mechanism, e.g., RTP/RTCP-based, for indication of viewport information to enable viewport-dependent media processing and delivery

The RTP payload format and SDP parameters to be developed under the IVAS Wl will be considered to support use of the IVAS codec for immersive voice and audio. The RTP payload format and SDP parameters for HEVC will be considered to support immersive video.

For video codec(s), use of omnidirectional video specific Supplemental Enhancement Information (SEI) messages for carriage of metadata required for rendering of the omnidirectional video has been considered. In the following, examples for streamable conversational, 360-degree event, such as a 360-degree conference, teleconference, telepresence, are discussed. However, in addition to the 360-degree conference, the embodiments are suitable for other VR streaming solutions, as well. Figures 2a - 2c represent various scenarios for a 360-degree teleconference. A 360- degree conference can be a live meeting which is streamed to receiver device(s) by the sender, wherein the sender is a video source, such as a 360- degree (i.e. omnidirectional) camera, or a system being operatively connected to a video source or comprising means to record video. The streamable content from the sender to the receiver comprises at least video and audio. The purpose of the sender is to stream video being recorded forward to receiver device(s). The sender may also comprise means for receiving at least audio data from receiver device(s) and output the received audio data to the participants of the streamable event.

In Figures 2a - 2c a group of participants is having a meeting in a conference room. The conference room can be considered as a virtual conference system A with physical elements (i.e. camera 220, view screen 210, physical participants) being able to share content to and to receive data from remote participants. As mentioned, the virtual conference system A comprises at least a 360-degree (i.e. omnidirectional) camera 220 and a view screen 210. The meeting is also participated by two remote participants B, C through a conference call. Physical participants of the virtual conference system A use the view screen 210 to display a shared presentation and/or video streams coming from the remote participants B, C. One of the remote participants B is using a head mounted display for having a 360-degree view to conference content and a camera that captures his/her video. One of the remote participants C uses a mobile phone to access the conference. The mobile phone is able to show a 360-degree video on the conference and to capture his/her video.

In the example of Figure 2a, the conference call is set up without any media- aware network elements. Both remote participants B, C send information about their viewport orientation to the virtual conference system A, which in turn sends them a viewport-dependent video stream from the 360-degree camera 220. In the example of Figure 2b, the conference call is set up using a network function, which may be performed by either a Media Resource Function (MRF) or a Media Control Unit (MCU) 230. In this example, the MRF/MCU 230 receives a viewport-independent stream from the virtual conference system A. Both remote participants B, C send viewport orientation information to the MRF/MCU 230 and receive viewport-dependents streams from it. The A/V channel for conversational non-immersive content may also go through the MRF/MCU 230, as shown in Figure 2b. The example of Figure 2b aims to enable immersive experience for remote participants B, C joining the teleconference with two-way audio and one-way immersive video.

In the example of Figure 2c, a virtual conference system for multiple conference rooms X are sending 360-degree video to an MRF/MCU 230. The rooms may choose to receive 2D video streams from other participants including one of the other rooms, which is displayed on the view screen 210 in the room. The remote participants B, C can choose to view any one or none of the available 360-degree videos from the multiple rooms. Switching from one room to another may be triggered manually, or using other mechanisms, such as viewing direction or dominant speaker. The MRF/MCU 230 may signal to pause the receiving 360-degree video from any of the rooms that do not currently have any active viewers.

In some embodiments, the 360-degree conference can be completely virtual, where all the meeting participants are remote participants, i.e. receiver devices connecting to the conference via a network, and where the sender is a computer generating a virtual representation of the virtual conference and the remote participants.

The SDP signalling semantics has been defined as a solution for predefined region signalling, and it has the following applicability for overlays as well:

- Allow adding real-time overlay on the top of a pre-defined region. Client devices may use the pre-defined regions as hints for personalized overlay operations. Compared to handling overlays for all audience on the server side, a client device may utilize its own computing power to generate overlays on pre-defined regions. - Content to be overlaid on the predefined region may be encoded and transported separately with higher quality.

An MTSI sender supporting the Overlay’ feature can allow adding real-time overlay on top of a defined region and offer this capability in the SDP as part of the initial offer-answer negotiation. Defined regions for overlays can be offered by including the "a=overlay" attribute under the relevant media line corresponding to the related 360 video and overlay images. The following parameters can be provided in the attribute for each defined region:

The parameters below are aligned with “Sphere-relative two-dimensional overlay” specification in OMAF.

• OverlayJD - identifies the offered defined region for overlay

• Overlay_azimuth: Specifies the azimuth of the centre point of the sphere region corresponding to the offered region for overlay in units of 2^-16 degrees relative to the global coordinate axes.

• Overlay_elevation: Specifies the elevation of the centre point of the sphere region corresponding to the offered region for overlay in units of 2^-16 degrees relative to the global coordinate axes.

• Overlay_tilt: Specifies the tilt angle of the sphere region corresponding to the offered region for overlay, in units of 2^-16 degrees, relative to the global coordinate axes.

• Overlay_azimuth_range: Specifies the azimuth range of the sphere region corresponding to the offered region for overlay through the centre point of the sphere region in units of 2^-16 degrees.

• Overlay_elevation_range: Specifies the elevation range of the sphere region corresponding to the offered region for overlay through the centre point of the sphere region in units of 2^-16 degrees.

• Name- specifies the name of the offered region for overlay.

The parameters below are aligned with “Viewport-relative overlay” specification in OMAF.

OverlayJD - identifies the offered region for overlay • Overlay_rect_left_percent: Specifies the x-coordinate of the top-left comer of the rectangular region of the overlay to be rendered on the viewport in per cents relative to the width and height of the viewport. The values are indicated in units of 2^-16 in the range of 0 (indicating 0%), inclusive, up to but excluding 65536 (that indicates 100%).

• Overlay_rect_top_percent: Specifies the y-coordinate of the top-left corner of the rectangular region of the overlay to be rendered on the viewport in per cents relative to the width and height of the viewport. The values are indicated in units of 2^-16 in the range of 0 (indicating 0%), inclusive, up to but excluding 65536 (that indicates 100%).

• Overlay_rect_width_percent: Specifies the width of the top-left corner of the rectangular region of the overlay to be rendered on the viewport in per cents relative to the width and height of the viewport. The values are indicated in units of 2^-16 in the range of 0 (indicating 0%), inclusive, up to but excluding 65536 (that indicates 100%).

• Overlay_rect_height_percent: Specifies the height of the top-left corner of the rectangular region of the overlay to be rendered on the viewport in per cents relative to the width and height of the viewport. The values are indicated in units of 2^-16 in the range of 0 (indicating 0%), inclusive, up to but excluding 65536 (that indicates 100%).

NOTE: The size of overlay region over the viewport changes according to the viewport resolution and aspect ratio. However, the aspect ratio of the overlaid media is not intended to be changed.

• Disparity_in_percent: Specifies the disparity, in units of 2^-16, as a fraction of the width of the display window for one view. The value may be negative, in which case the displacement direction is reversed. This value is used to displace the region to the left on the left eye view and to the right on the right eye view.

• Name specifies the name of the offered region for overlay.

For the negotiated 360 video, a=overlay" attribute is expected to contain all of the above parameters for overlays under the m= line, each overlay image is expected to only contain the corresponding OverlayJD parameter as part of a=overlay" attribute. In response to the SDP offer with the set of offered regions provided using the "a=overlay" line(s), an MTSI client accepting Overlay’ can provide an SDP answer using the "a=overlay" line(s) containing the accepted set of regions. The accepted set of regions for overlays would be a subset of the offered set of regions.

A new SDP offer-answer negotiation can be performed to modify the set of regions for overlays. The MTSI sender may update all the content of defined regions, including the total number of overlay regions, and the spherical coordinates and name of each of the overlay regions.

An overlay may fall outside the user’s field of view (FOV), i.e., a viewport of a user becomes non-overlapping with the overlay. For example, after a user rotates during omnidirectional media content playback, the viewport of the user become non-overlapping with the visual overlay. Depending on the specific situation, it may be desirable to continue or pause the playback of the overlay when the user is not watching the overlay. For example, it may be desirable to pause a timeline of overlay playback until the overlay overlaps again with the user’s viewport. It may also be desirable to continue playback of the overlay even though the overlay is outside the user’s viewport.

An example of a device for content authoring, i.e. an apparatus according to an embodiment, is shown in Figure 3. The apparatus 90 comprises a main processing unit 91 , a memory 92, a user interface 94, a communication interface 93. The apparatus according to an embodiment, shown in Figure 3, also comprises a camera module 95. Alternatively, the apparatus may be configured to receive image and/or video data from an external camera device over a communication network. The memory 92 stores data including computer program code in the apparatus 90. The computer program code is configured to implement the method according various embodiments. The camera module 95 or the communication interface 93 receives data, in the form of images or video stream, to be processed by the processor 91 . The communication interface 93 forwards processed data, i.e. the image file, for example to a display of another device, such a virtual reality headset. When the apparatus 90 is a video source comprising the camera module 95, user inputs may be received from the user interface. An example of a device for content consumption, i.e. an apparatus according to another embodiment, is shown in Figure 4. The apparatus in this example is a virtual reality headset, such as a head-mounted display (HMD) 400 for stereo viewing. The head-mounted display 400 comprises two screen sections or two screens 420, 430 for displaying the left and right eye images. The displays 420, 430 are close to the eyes, and therefore lenses are used to make the images easily viewable and for spreading the images to cover as much as possible of the eyes’ field of view. The device is attached to the head of the user so that it stays in place even when the user turns his head. The device may have an orientation detecting module 410 for determining the head movements and direction of the head. The head-mounted display is able to show omnidirectional content (3DOF content) of the recorded/streamed image file to a user.

Examples for an application of omnidirectional video comprises a video call or a teleconference with video. Such a call or conference with omnidirectional video has a benefit of providing greater immersion for a user, and a possibility to explore the captured space in different directions. However, the user’s viewport has limited FOV, and consequently, whenever more than one events or objects of interest occur,

• they do not simultaneously occur within the user’s viewport or user’s field of view;

• they are temporally (partially) overlapping or within temporal intervals which do not permit comfortable movement of the user’s viewport or orientation (e.g., moving the head forth and back all the time, trying to watch the content from both events where only one is in the current user’s viewport).

In order to tackle this, prior to the present embodiments, the only option available for the user has been to choose one of the multiple important events or objects of interest. However, this leads to missing out some other events. Figure 5 illustrates a scenario relating to the problem. At time Ti, a user is viewing an object of interest POh in his/her current viewport 501 on an omnidirectional video. Subsequently, at time T2, another event of interest POI2 appears, whereupon the user changes his/her viewport to a viewport 502 and becomes oriented towards an object of interest PO . As a consequence, the user is not able to view the object of interest POI1 anymore.

The present embodiments aims to solve the problem, by providing conditional overlays for omnidirectional conversational video applications, such as video calls and video conferences. The present embodiments may be related to the playback with pause or with temporal gap based on the conditional overlay signaling parameter.

Before describing the solution any further, a term “conditional overlay” is defined. “Conditional overlay” (also referred to as “condition-triggered overlay”) refers to overlays which are initiated when a certain condition is true or false. For example, a part of the omnidirectional video is indicated as a conditional region, which indicates whether or not the conditional overlay should be inserted.

In the present embodiments conditional overlay are inserted/initiated during viewport dependent (or viewport independent) consumption of omnidirectional conversational video. The sender device, i.e. a content authoring device, is configured to signal a possibility to support conditional overlays during session negotiation (e.g., as a capability in the SDP (Session Description Protocol)). The receiver device, i.e. content consumption device, is configured to signal the condition for the overlay initiation. The condition for the overlay initiation comprises information on the condition region and the condition due to which the overlay should be inserted. The condition region is the part of the omnidirectional video which should be monitored in order to detect whether it is within or not within the receiver device’s viewport. This leads the sender device to initiate the conditional overlay containing at least a part of the indicated conditional region whenever the indicated condition is met. Consequently, the sender device is configured to deliver the overlay triggered due to non-overlap with conditional region as - according to an embodiment - viewport-locked overlay in the background video sent from the sender device to the receiver device.

According to another embodiment, the condition region may be turned into a sphere-locked overlay which pauses the playback if it is not in the user’s viewport. This enables the user to come back to earlier viewport and to review the content in the sphere-locked overlay.

Therefore, in general, the conditional overlay can be one of the following:

- condition region o viewport overlap o viewport non-overlap (default)

- viewport relative o if the conditional overlay is viewport relative overlay, it is always visible;

- sphere-relative o if the condition region is not in user’s viewport, the content in the conditional region is “paused” for the duration the user is not viewing the said condition region.

According to an embodiment, the conditional overlay may be triggered when the condition region indicated by the receiver device is occluded by other notification overlays (e.g., new message notification, incoming call notification, etc.).

The details of the present embodiments are discussed in relation to a couple of use case scenarios 1 and 2.

Example of a use case 1 :

User A with a limited FOV device is having an MTSI (Multimedia Telephony Services over IMS) session with another venue with omnidirectional video capture and transmission capability. The omnidirectional content shows a speaker describing feature of a car. The speaker is walking around while describing the car. The user A is watching with intent to concentrate to the talk of the speaker but would like to have the car visible whenever the speaker moves away from the car. The car is indicated as the condition region by the user A. Subsequently, whenever user A’s viewport while following the speaker is such that the car is not visible, the video covering the condition region, i.e. car region, is shown as an overlay embossed on top of the venue background video. In this example, the user watching the video with a receiver device defines the condition for the overlay initiation. The condition in this example comprises information on the condition region (i.e. a car) and the condition due to which the overlay should be inserted. The condition due to which the overlay is to be inserted is “the condition region is outside the current viewport”. Therefore, a method according to this example comprises steps for determining is the condition region within the viewport, and if not, generating an overlay comprising the content of the condition region; and rendering the generated overlay on the viewport to the omnidirectional video.

Use case scenario 2:

Within the 3GPP ITT4RT framework, a multiparty conference may comprise multiple 360-degree video senders (multiple conference rooms). It may be useful to display e.g. room B and C as overlay, when the user is not in either room. The rooms B, C can be defined as conditional regions depending on the user’s currently chosen room or viewpoint. Switching to any viewpoint (or room) initiates an overlay indicating activity of the other viewpoint (or room). Furthermore, the region of the room or viewpoint to be rendered as overlay is also signalled by the receiver device.

In this example, the user watching the video with a receiver device defines the condition for the overlay initiation. The condition in this example comprises information on the condition region (i.e. room B and C) and the condition due to which the overlay should be inserted. The condition due to which the overlay is to be inserted is “when the user is not located in a room corresponding to the condition region”. Therefore, a method according to this example comprises steps for determining a location of the user; comparing the location of the user to a location defined in the condition region; if the location of the user and location defined in the condition region is not the same, generating an overlay comprising the content of the condition region; and rendering the generated overlay on the viewport in the omnidirectional video.

Figure 6 illustrates an example on the condition region and condition-definition- triggered initiation of viewport-locked overlay. The user indicates object of interest POh as the condition region 601 at time TK. The condition is defined to be a non-overlapping viewport, i.e. whenever the condition region 601 is not overlapping with the current viewport. Subsequently, at time TK+L, user changes his/her viewport to viewport 602 comprising another object of interest POI2. Consequently, user’s viewport 602 has a new overlay 610 showing the content of condition region 601 , i.e. the object of interest POI1 at time TK+L.

The signalling may be carried out as described in the following:

A client for an immersive conversational video supporting also the conditional overlays is configured to include “conditional overlay” related signalling in SDP (Session Description Protocol) for all media streams corresponding to the background videos (or the main video in MTSI). SDP is a format for describing streaming media communications parameters. The SDP may be used in conjunction with RTP (Real-time Transport Protocol), RTSP (Real Time Streaming Protocol), or SIP (Session Initiation Protocol).

The support for “conditional overlays” may be indicated or offered by including RTCP (RTP Control Protocol) feedback attribute with the “conditional overlays” type under the relevant media line scope. The “conditional overlay” may be expressed with the following parameters 3gpp-conditionai-overiay by extending the parameters disclosed in a publication “IETF RFC 4585 (2006): "Extended RTP Profile for Real-time Transport Control Protocol (RTCP) - Based Feedback (RTP/AVPF)", J. Ott, S. Wenger, N. Sato, C. Burmeister and J. Rey”. A wildcard payload type (“^*”) indicates that the RTCP feedback attribute is applicable to all the offered payload types. a=rtcp-fb:* 3gpp-conditional-overlay

The offer which indicates support for conditional overlays also incorporates “sent conditional overlay” with the a=extmap attribute indicating the sent overlay coordinates by extending the parameters disclosed in a publication “IETF RFC 5285 (2008): "A General Mechanism for RTP Fleader Extensions", D. Singer, H. Desineni”. The “Sent conditional overlay” can be signalled as shown below: a=extmap: 7 urn:3gpp:conditional-overlay-sent The signalling of “condition region” request using RTCP message as specified in IETF RFC 4585 is signalled with the in-built default condition of “condition region outside the user’s viewport or field of view”. The RTCP feedback message is identified by PT (Payload Type)=PSFB (206) or which refers to payload specific feedback message. FMT (Feedback Message Type) shall be set to the value Ί0’ for conditional overlay feedback messages. The RTCP feedback method can utilize the immediate feedback as well as the early RTCP modes.

The FCI (Feedback Control Information) for conditional overlays shall be as follows. The FCI shall contain one condition region. The conditional overlay request may be composed of the following parameters:

- Azimuth specifies the yaw rotation with respect to the origin of the omnidirectional content coordinate system. The unit is degrees or in terms of pixel. If the unit is degrees, it is in the range of [0, 360].

- Elevation specifies the pitch rotation with respect to the origin of the omnidirectional content coordinate system. The unit is degrees or in terms of pixel. If the unit is degrees, it is in the range of [0, 180].

- Florizontal-extent (FlorExt) specifies the horizontal field of view of the conditional overlay covered in the original content. The unit is degrees or in terms of pixel. If the unit is degrees, it is in the range of [0, 360].

- Vertical-extent (VerExt) specifies the vertical field of view of the conditional overlay covered in the original content. The unit is degrees or in terms of pixel. If the unit is degrees, it is in the range of [0, 180].

- COND_OVLY_ID - identifies the conditional overlay as requested by the receiver device.

0 1 2 3

0 1 2 3 4 3 3 7 8 9 0 2 2 3 4 5 6 7 8 3 0 1 2 3 4 0 3 7 8 9 0 1

I Az imuth ( h i I Aziniut h ( i s i El evat ion ( h ) i El evation ( 1 ) i j RorExt t R ) ! HorExt { i ) j YerExt ( h ) | Ver Ext. { 3. ) j For each two-byte indication of Azimuth, Elevation, HorExt and VerExt parameters, the high byte (indicated by ‘(h)’ above) shall be followed by the low byte (indicated by ‘(I)’ above), where the low byte holds the least significant bits. The above azimuth and elevation ranges assume the coverage of entire sphere with 360 degrees for azimuth and 180 degrees for elevation. However, in different implementation embodiments, different coordinates can be used with suitable ranges. Hence the above should be taken as a non-limiting example.

According to an embodiment, only the azimuth and elevation are signalled in the FCI by the receiver device to indicate object of interest (OOI) or event of interest (EOI) information. Consequently, the sender device responds with the overlay extent.

‘Sent conditional overlay’ involves signalling from the sender device to the receiver device, which helps the receiver device to know the actually sent conditional overlay corresponding to the video transmitted by the sender device, i.e. which may or may not be exactly the same with the conditional overlay coordinates requested by the receiver device, but is the best match as determined by the sender device, so that the end user at the receiver device is able to see the requested conditional overlay. When ‘Sent conditional overlay’ is successfully negotiated, it is signalled by the sender device.

The sent conditional overlay corresponds (indicated via the URN urn:3gpp:cond-overlay-sent in the SDP negotiation), the signalling of the conditional overlay uses RTP header extensions as specified in IETF 5285 and shall carry the Azimuth, Elevation, HorExt and VerExt parameters corresponding to the actually sent conditional overlay. The one-byte form of the header is used. The values for the parameters Azimuth, Elevation, HorExt and VerExt are indicated using two bytes, with the following format:

0 2 3

0 1 2 3 4 5 3 7 2 9 0 1 2 3 4 5 6 7 3 3 0 1 2 3 4 5 6 7 6 9 0 1

÷ - -ί ^. i- - t - + - -ί ^. i- - t - + - -f ^. i- - t - + - -f ^. i- - T - + - -f ^. i- - T - + - i- - T - + - -i ^. i- - T - + - -i ^. i- - T - + i ID j Ie:v---7 j Azimut h ( h ! i Azi muth ( I ) I H2Ievai: :i or; ( h i i i Elevation (4)! HorExt (hi i HorEjr;: (J} i VerExt (5) i jVerExt. il) ! zero paddrng !

÷- + -- T . * - - - + -- T . · - - - + -- T · - - - + -- T · - - - + -- T · - - - + -- T . · - - - + -- T . · - - - + -- T * - !-

The 4-bit ID is the local identifier as defined in IETF RFC 5285. The length field takes the value 7 to indicate that 8 bytes follow. For each two-byte indication of the Azimuth, Elevation, FlorExt and VerExt parameters, the high byte (indicated by ‘(h)’ above) shall be followed by the low byte (indicated by ‘(I)’ above), where the low byte holds the least significant bits.

According to another embodiment, the sent conditional overlay corresponds (indicated via the URN urn:3gpp:cond-overlay-sent in the SDP negotiation), the signalling of the conditional overlay uses RTP header extensions as specified in IETF 5285, and shall carry the RenderLocation (RLoc) parameter in addition to the Azimuth, Elevation, FlorExt and VerExt parameters corresponding to the actually sent conditional overlay content. The one-byte form of the header is used. The values for the parameters Azimuth, Elevation, FlorExt, VerExt and RenderLocation are indicated using two bytes with the following format:

0 1 2 3

01 2 34 3 37 8 90 2 23 4 56 7 83 0 1 23 4 03 7 89 0 1 j ID i Io-n--9 i Azimuth (hijAzimuth ( ! ) S Elevation i n ) I jEievati-cc (i)j EorExt (h; j HorE;n (I) jVerExt ( h ) ! iVerExt (I) j RLoc (n; i RLoc il) S EeropaROlnc_{j i}

The 4-bit ID is the local identifier as defined in IETF 5285. The length field takes the value 9 to indicate that 10 bytes follow. For each two-byte indication of the Azimuth, Elevation, FlorExt, VerExt and RLoc parameters, the high byte (indicated by ‘(h)’ above) shall be followed by the low byte (indicated by ‘(I)’ above), where the low byte holds the least significant bits. RLoc(h) specify the coordinates from the top-left corner. The bits from RLoc(h)7 up to bit position Rloc(h)4 and bits from RLoc(h)3 up to bit position RLoc(h)o indicate the ratio of width percentage and height percentage of the overlay relative to the main or background video received by the receiver device. Thus, in this example, ,the unit is percentage of the viewport size. However, in other embodiments the unit may be in terms of pixel resolution.

Example of an SDP session negotiation between the sender device and the receiver device is given in below:

SDP Offer: m-video 49134 RTP/AVP 99 a-^~tce:p:1 RTP/AVPP a-pc.fg:I t:1 b-Ao : 315 RS'U b-RR·;2500 a-rtpmap:99 H264/90000 a--frntp:99 packeti ation-mode---0; profile-level-id-42e0Oc; \ sprop-parameter~sets^JOLgDJWgUHGAf1&----,KMiSgA----- a-^~rtcp^~3lb:* tra-ant 5000 a-rtcp-fb aack a—rtcp-fb:* tack pli a¾ tcp-fb:^l ccru fir^' a-----r: cr-fb;* com tmmbr a--=extmap: 4 urn:3gpp:video-orientation a-::top-fP·.* 3gpp-coaditional-over1ay a^eataap:7 nrn:3gpp·.cond-overlay-sen

SDP Answer: ίϋ-video 49154 RTP/AvPF 39 a^acfq:1 t-i b~RS 0 b----RR 2500 a---rtpmap *. 99 H 264 / 90000 a -- i!viitp : 99 packet. 3.za;:iori--niode---- Q ; proi il'E:- ].evel-id^~- 42e0 Qc ; \ sprop-paramekei -sat 3^909rG;9ΐnr9H62zί 1A-- , KM46qA------- a-- 1 c p - f P ·. * t rr-i n t 9009 a--- r c p ^~ f b : * n a c k a----^;rtcp- 3^ b : * nack p l i a^~-- r t c r -- f b c c m k i r a-----rtcp-fb : * cc:5 tm™br a---sxtmap *. 4 virr; ; 3gpp : \.bxjeo --oriem: afcior: a^i Dcp-fb b 3 qpp - c o n d i t i o i3 a 1 - o v e r I a y a---^;£xtmap : 7 urn 3gpp : cond-over lay-aerri:

According to an embodiment, in the conditional overlay request

• the condition-region can also be indicated with an object tracking identifier sent by the receiver device and known to the sender device. This enables also handling spatially dynamic objects or events.

• the condition to initiate overlay can be indicated such that the condition-region is within the user’s field of view to initiate overlay into the background video.

• the overlay can be sphere relative (i.e. visible only when to a particular user’s viewing orientation), instead of viewport locked overlay initiation.

Figure 7 illustrates an overview of the steps in implementation of condition- triggered overlays. Figure 7 shows a sender device (i.e. sender user equipment, sender UE) 701 and a receiver device (i.e. receiver user equipment, receiver UE) 702. The sender device 701 is configured to

- confirm the availability of condition-based overlay insertion capability;

- generate an SDP offer with the condition-based overlay capability; and

- send the SDP offer with condition-based overlay capability to the receiver device 702 for each of the relevant media lines.

The receiver device 702 is configured to

- include condition-based overlay capability in answer to indicate the support; and - define condition region and condition, wherein in some embodiments the condition can be pre-defined (e.g. non-overlapping viewport with condition region true).

As a response to the SIP/SDP answer from the receiver device 702, the sender device 701 prepares to receive RTCP feedback.

The receiver device 702 is configured to generate RTCP feedback message as early as RTCP or immediate viewport, and sends the condition region 710 the to the sender device 701 whenever the receiver device 702 defines the condition region, condition, etc. The sender device 701 is configured to process the RTCP feedback and generate confirmation response.

The receiver device 702 is then configured to transmit user’s viewport or field of view, or to transmit non-overlapping viewport indication 720 to the sender device 701 . The sender device 701 is thus configured to monitor user’s viewing viewport or field of view or wait for the non-overlapping viewport indication. When the condition is true for the condition region, the sender device 701 is configured to initiate viewport locked overlay with the background and deliver the overlay to the receiver device 702 with a confirmation 730 on the delivery. The receiver device 702 is configured to receive and render the video stream from the sender device 701 .

Figure 8 illustrates a sequence of signalling to implement condition-based overlays in ITT4RT between a sender and a receiver.

The signalling being described above should not be unnecessarily considered as a limitation for the scope of the embodiments. For example, the RTCP signalling can be carried over other types of packet extensions. Also the shown signalling over SDP (and via RTCP) can be carried using other protocols at any of the ISO OSI layers.

The method according to an embodiment is shown in Figure 9. The method generally comprises at least streaming 910 conversational omnidirectional video from a sender device to a receiver device; receiving 920 a message from the receiver device, wherein the message indicates a region relating to a content in the omnidirectional video and a condition; detecting 930 that a condition defined in the message is met; generating 940 an overlay comprising the content of the region being defined in the received message; and transmitting 950 the generated overlay with the omnidirectional video to the receiver device to be overlaid on a current viewport in the omnidirectional video. Each of the steps can be implemented by a respective module of a computer system.

An apparatus according to an embodiment comprises at least means for streaming conversational omnidirectional video from a sender device to a receiver device; means for receiving a message from the receiver device, wherein the message indicates a region relating to a content in the omnidirectional video and a condition; means for detecting that a condition defined in the message is met; means for generating an overlay comprising the content of the region being defined in the received message; and means for transmitting the generated overlay with the omnidirectional video to the receiver device to be overlaid on a current viewport in the omnidirectional video. The means comprises at least one processor, and a memory including a computer program code, wherein the processor may further comprise processor circuitry. The memory and the computer program code are configured to, with the at least one processor, cause the apparatus to perform the method of Figure 9 according to various embodiments.

The various embodiments may provide advantages. For example, the present embodiments are extensible to non-static events or objects of interest monitoring. The present embodiments also provide flexible combination of condition, and condition-region is important for omnidirectional content which is inherently difficult to monitor due to limited FOV of human vision and resource constraints. The present embodiments also add versatility to cover more than one important events happening at the same time, and to monitor more than one person or object of interest simultaneously. Further, the present embodiments may improve situational awareness of the user of the receiver device. And finally, the present embodiments may assist in covering monitoring use cases for scenes which have multiple important activities happening in the scene.

The various embodiments can be implemented with the help of computer program code that resides in a memory and causes the relevant apparatuses to carry out the method. For example, a device may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the device to carry out the features of an embodiment. Yet further, a network device like a server may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the network device to carry out the features of an embodiment. The computer program code comprises one or more operational characteristics. Said operational characteristics are being defined through configuration by said computer based on the type of said processor, wherein a system is connectable to said processor by a bus, wherein a programmable operational characteristic of the system comprises streaming conversational omnidirectional video from a sender device to a receiver device; receiving a message from the receiver device, wherein the message indicates a region relating to a content in the omnidirectional video and a condition; detecting that a condition defined in the message is met; generating an overlay comprising the content of the region being defined in the received message; and transmitting the generated overlay with the omnidirectional video to the receiver device to be overlaid on a current viewport in the omnidirectional video

A computer program product according to an embodiment can be embodied on a non-transitory computer readable medium. According to another embodiment, the computer program product can be downloaded over a network in a data packet.

If desired, the different functions discussed herein may be performed in a different order and/or concurrently with other. Furthermore, if desired, one or more of the above-described functions and embodiments may be optional or may be combined.

Although various aspects of the embodiments are set out in the independent claims, other aspects comprise other combinations of features from the described embodiments and/or the dependent claims with the features of the independent claims, and not solely the combinations explicitly set out in the claims. It is also noted herein that while the above describes example embodiments, these descriptions should not be viewed in a limiting sense. Rather, there are several variations and modifications, which may be made without departing from the scope of the present disclosure as, defined in the appended claims.

Claims

Claims:

1 . A method, comprising:

- detecting that a condition defined in the message is met;

2. An apparatus comprising:

- means for detecting that a condition defined in the message is met;

3. The apparatus according to claim 2, wherein the condition relates to a viewport of the receiver device.

4. The apparatus according to claim 3, wherein the apparatus comprises means for receiving information on the current viewport of the receiver device.

5. The apparatus according to claim 4, wherein the apparatus comprises means for detecting that the region defined in the received message is outside the current viewport of the receiver device, as a response of which the overlay is generated.

6. The apparatus according to any of the claims 2 to 5, wherein the generated overlay to be rendered is indicated to be rendered as a viewport-relative overlay or as a sphere-relative overlay with a timeline pause.

7. An apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to:

- stream conversational omnidirectional video from a sender device to a receiver device; - receive a message from the receiver device, wherein the message indicates a region relating to a content in the omnidirectional video and a condition;

- detect that a condition defined in the message is met;

- generate an overlay comprising the content of the region being defined in the received message; and - transmit the generated overlay with the omnidirectional video to the receiver device to be overlaid on a current viewport in the omnidirectional video.