WO2023194648A1

WO2023194648A1 - A method, an apparatus and a computer program product for media streaming of immersive media

Info

Publication number: WO2023194648A1
Application number: PCT/FI2023/050061
Authority: WO
Inventors: Igor Danilo Diego Curcio; Saba AHSAN; Emre Baris Aksu; Burak Kara; Mehmet Necmettin AKCAY; Ali Cengiz BEGEN
Original assignee: Nokia Technologies Oy
Priority date: 2022-04-08
Filing date: 2023-01-31
Publication date: 2023-10-12

Abstract

The embodiments relate to a method comprising defining a viewport (2110); determining head motion characteristics (2120); adjusting points or an area within the viewport according to head motion characteristics (2130); and determining a quality of the viewport using tile qualities for tiles that correspond to the point or the area within the viewport (2140). The embodiments also relate to an apparatus and a computer program product for implementing the method.

Description

A METHOD, AN APPARATUS AND A COMPUTER PROGRAM PRODUCT FOR MEDIA STREAMING OF IMMERSIVE MEDIA Technical Field The present solution generally relates to streaming of immersive media. Background Devices that are able to capture image and video have evolved from devices capturing a limited angular field of view to devices capturing 360-degree content. These devices are able to capture visual and audio content all around them, i.e., they can capture the whole angular field of view, which may be referred to as 360 degrees field of view. More precisely, the devices can capture a spherical field of view (i.e., 360 degrees in all spatial directions). In addition to the new types of image/video capturing devices, also new types of output technologies have been invented and produced, such as head-mounted displays. These devices allow a person to see visual content all around him/her, giving a feeling of being “immersed” into the scene captured by the 360 degrees camera. The new capture and display paradigm, where the field of view is spherical, is commonly referred to as virtual reality (VR) and is believed to be the common way people will experience media content in the future. Summary The scope of protection sought for various embodiments of the invention is set out by the independent claims. The embodiments and features, if any, described in this specification that do not fall under the scope of the independent claims are to be interpreted as examples useful for understanding various embodiments of the invention. Various aspects include a method, an apparatus and a computer readable medium comprising a computer program stored therein, which are characterized by what is stated in the independent claims. Various embodiments are disclosed in the dependent claims. According to a first aspect, there is provided an apparatus comprising means for defining a viewport wherein the viewport comprises a plurality of points and at least one tile and an area, wherein each tile has a certain quality; means for determining head motion characteristics; means for adjusting the points or the area within the viewport according to the determined head motion characteristics; and means for determining a quality of the viewport using tile qualities within the viewport. According to a second aspect, there is provided a method, comprising defining a viewport wherein the viewport comprises a plurality of points and at least one tile and an area, wherein each tile has a certain quality; determining head motion characteristics; adjusting the points or the area within the viewport according to the determined head motion characteristics; and determining a quality of the viewport using tile qualities within the viewport. According to a third aspect, there is provided an apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following: define a viewport wherein the viewport comprises a plurality of points and at least one tile and an area, wherein each tile has a certain quality; determine head motion characteristics; adjust the points or the area within the viewport according to the determined head motion characteristics; and determine a quality of the viewport using tile qualities within the viewport. According to a fourth aspect, there is provided computer program product comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to: define a viewport wherein the viewport comprises a plurality of points and at least one tile and an area, wherein each tile has a certain quality; determine head motion characteristics; adjust the points or the area within the viewport according to the determined head motion characteristics; and determine a quality of the viewport using tile qualities within the viewport. According to an embodiment, the points are distributed according to head motion characteristics; each point is located in the viewport; each tile is located in the viewport; each tile’s quality is determined; each point’s projection is determined on tiles; a quality of a current viewport is determined using weights and tile qualities. According to an embodiment, a viewport area is reshaped according to head motion characteristics; a viewport quality is determined based on the tile qualities within the reshaped viewport area. According o an embodiment, an overall viewport quality of an immersive video is determined based on the determined qualities of several viewport. According to an embodiment, the overall viewport quality is determined by giving weights to several viewports according to the head motion characteristics. According to an embodiment, a head speed threshold is determined, and the overall viewport quality is determined according to the weights of viewports with respect to the determined head speed threshold. According to an embodiment, head motion characteristics comprises direction and speed of a head. According to an embodiment, the computer program product is embodied on a non-transitory computer readable medium. Description of the Drawings In the following, various embodiments will be described in more detail with reference to the appended drawings, in which Fig.1 shows an example of an OMAF end-to-end system; Fig.2a shows an example of equirectangular projection of a 360-degree video; Fig.2b shows an example of a viewport within the projection of Figure 2a; Fig.3 shows an example of head motion within a viewport; Fig.4 shows an example of defining a quality based on center tile quality; Fig.5 shows an example of defining a quality based on average quality of all tiles within a viewport; Fig.6 shows an example of defining a weighted average quality based on tile areas in the viewport; Fig.7 shows an example of having weights distributed according to the user gaze data; Fig.8 shows an example of user gaze in a viewport during stationary head position; Fig.9 shows an example of a user gaze in a viewport during a fast head motion to the right; Fig.10 – 12 show examples of a user gaze in a viewport during slow motion to the right; Fig.13 shows again an example of a user gaze during stationary head position; Fig.14 shows again an example of a user gaze during head motion; Fig.15 shows an example of an area within a viewport; Fig.16 shows an example of adjusting the area according to slow head motion; Fig.17 shows a geometrically more computable area for a slow head motion; Fig.18 shows an example of a calculation of an area with the help of geometrically more computable area; Fig.19 shows an example of adjusting the area according to fast head motion; Fig.20 shows a geometrically more computable area for a fast head motion; Fig.21 is a flowchart illustrating a method according to an embodiment; and Fig.22 shows an apparatus according to an embodiment.

Embodiments The following description and drawings are illustrative and are not to be construed as unnecessarily limiting. The specific details are provided for a thorough understanding of the disclosure. However, in certain instances, well- known or conventional details are not described in order to avoid obscuring the description. References to one or an embodiment in the present disclosure can be, but not necessarily are, reference to the same embodiment and such references mean at least one of the embodiments. Reference in this specification to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment in included in at least one embodiment of the disclosure. The present embodiments relate generally to immersive video or 360-degree video, and in particular to viewport dependent streaming. Since the beginning of photography and cinematography, the most common type of image and video content has been captured by cameras with relatively narrow field of view and displayed as a rectangular scene on flat displays. Such content is referred as “flat content”, or “flat image”, or “flat video” in this application. The cameras are mainly directional, whereby they capture only a limited angular field of view (the field of view towards which they are directed). Such a flat video is output by a display device capable of displaying two- dimensional content. More recently, new image and video capture devices have become available. These devices are able to capture visual and audio content all around them, i.e., they can capture the whole angular field of view, sometimes referred to as 360 degrees field of view. More precisely, they can capture a spherical field of view (i.e., 360 degrees in all spatial directions). Furthermore, new types of output such as head-mounted displays, and other devices, allow a person to see the 360-degree visual content. 360-degree video or virtual reality (VR) video generally refers to video content that provides such a large field of view (FOV) that only a part of the video is displayed at a single point of time in typical displaying arrangements. For example, VR video may be viewed on a head-mounted display (HMD) that may be capable of displaying, e.g., about a 100-degree field of view. The head- mounted display is capable of showing three-dimensional (3D) content. For that purpose, a head-mounted display may comprise two screen sections or two screens for displaying images for left and right eyes. The displays are close to the eyes, and therefore lenses are used to make the images easily viewable and for spreading the images to cover as much as possible of the eyes’ field of view. The HMD is attached to the head of the user so that it stays in place even when the user turns his head. The device may have an orientation detecting module for determining the head movements and direction of the head. The head-mounted display gives a three-dimensional (3D) perception of the recorded/streamed content to a user. The user of the head-mounted display sees, at a given time instant, only a portion of 360- degree content, referred to as viewport, the size of which is being defined by the vertical and horizontal field-of-views of the HMD. Most of the audio sources of the immersive content may be visible in the viewport, while some audio sources may reside behind the user, therefore being non-visible in the viewport. The spatial subset of the VR video content to be displayed may be selected based on the orientation of the HMD. In another example, a typical flat-panel viewing environment is assumed, wherein, e.g., up to a 40-degree field-of-view may be displayed. When displaying wide-FOV content (e.g., fisheye) on such a display, a spatial subset may be displayed rather than the entire picture. Available media file format standards include International Standards Organization (ISO) base media file format (ISO/IEC 14496-12, which may be abbreviated ISOBMFF), Moving Picture Experts Group (MPEG)-4 file format (ISO/IEC 14496-14, also known as the MP4 format), file format for NAL (Network Abstraction Layer) unit structured video (ISO/IEC 14496-15) and High Efficiency Video Coding standard (HEVC or H.265/HEVC). Some concepts, structures, and specifications of ISOBMFF are described below as an example of a container file format, based on which the embodiments may be implemented. The aspects of the invention are not limited to ISOBMFF, but rather the description is given for one possible basis on top of which the invention may be partly or fully realized. In the following, term “omnidirectional” may refer to media content that may have greater spatial extent than a field-of-view of a device rendering the content. Omnidirectional content may for example cover substantially 360 degrees in the horizontal dimension and substantially 180 degrees in the vertical dimension, but omnidirectional may also refer to content covering less than 360-degree view in the horizontal direction and/or 180-degree view in the vertical direction. A panoramic image covering a 360-degree field-of-view horizontally and a 180- degree field-of-view vertically can be represented by a sphere that has been mapped to a two-dimensional image plane using the equirectangular projection (ERP). In this case, the horizontal coordinate may be considered equivalent to a longitude, and the vertical coordinate may be considered equivalent to a latitude, with no transformation or scaling applied. In some cases, panoramic content with a 360-degree horizontal field-of-view, but with less than a 180-degree vertical field-of-view may be considered special cases of equirectangular projection, where the polar areas of the sphere have not been mapped onto the two-dimensional image plane. In some cases, panoramic content may have less than a 360-degree horizontal field-of-view and up to a 180-degree vertical field-of-view, while otherwise having the characteristics of an equirectangular projection format. Immersive multimedia, such as omnidirectional content consumption is more complex for the end user compared to the consumption of 2D content. This is due to the higher degree of freedom available to the end user. The freedom also results in more uncertainty. The MPEG Omnidirectional Media Format (OMAF) v1 standardized the omnidirectional streaming of single 3DoF (3 Degrees of Freedom) content (where the viewer is located at the centre of a unit sphere and has three degrees of freedom (Yaw-Pitch-Roll). OMAF version 2 was published in July 2021. Among other things, OMAF v2 includes means to optimize the Viewport Dependent Streaming (VDS) operations, overlay management and bandwidth management. OMAF version 3 is close to completion and it includes, among other things, a specific profile for the Versatile Video Coding (VVC). A viewport may be defined as a region of omnidirectional image or video suitable for display and viewing by the user. A current viewport (which may be sometimes referred simply as a viewport) may be defined as the part of the spherical video that is currently displayed and hence is viewable by the user(s). At any point of time, a video rendered by an application on a head-mounted display (HMD) renders a portion of the 360-degrees video, which is referred to as a viewport. Likewise, when viewing a spatial part of the 360-degree content on a conventional display, the spatial part that is currently displayed is a viewport. A viewport is a window on the 360-degree world represented in the omnidirectional video displayed via a rendering display. A viewport may be characterized by a horizontal field-of-view (VHFoV) and a vertical field-of-view (VVFoV). In the context of viewport-dependent 360-degree video streaming, the term tile may refer to an isolated region, which may be defined to depend only on the collocated isolated region in reference pictures and does not depend on any other picture regions in the current picture or in the reference pictures. Sometimes the term tile or the term tile sequence may refer to a sequence of collocated isolated regions. It is remarked that video codecs may use the term tile as a spatial picture partitioning unit, which might not have the same constraints or properties as an isolated region. An example how a tile in the context of viewport-dependent 360-degree video streaming can be obtained with the High Efficiency Video Coding (HEVC) standard is called a motion-constrained tile set (MCTS). An MCTS is such that the inter prediction process is constrained in encoding such that no sample value outside the MCTS, and no sample value at a fractional sample position that is derived using one or more sample values outside the motion- constrained tile set, is used for inter prediction of any sample within the motion- constrained tile set. Additionally, the encoding of an MCTS is constrained in a manner that motion vector candidates are not derived from blocks outside the MCTS. This may be enforced by turning off temporal motion vector prediction of HEVC, or by disallowing the encoder to use the temporal motion vector prediction (TMVP) candidate or any motion vector prediction candidate following the TMVP candidate in a motion vector candidate list for prediction units located directly left of the right tile boundary of the MCTS except the last one at the bottom right of the MCTS. In general, an MCTS may be defined to be a tile set that is independent of any sample values and coded data, such as motion vectors, that are outside the MCTS. An MCTS sequence may be defined as a sequence of respective MCTSs in one or more coded video sequences or alike. In some cases, an MCTS may be required to form a rectangular area. It should be understood that depending on the context, an MCTS may refer to the tile set within a picture or to the respective tile set in a sequence of pictures. The respective tile set may be, but in general need not be, collocated in the sequence of pictures. A motion-constrained tile set may be regarded as an independently coded tile set, since it may be decoded without the other tile sets. Another example how a tile in the context of viewport-dependent 360-degree video streaming can be obtained with the Versatile Video Coding (VVC) standard is described next. The VVC standard supports subpictures (a.k.a. sub-pictures). A subpicture may be defined as a rectangular region of one or more slices within a picture, wherein the one or more slices are complete. Consequently, a subpicture consists of one or more slices that collectively cover a rectangular region of a picture. The slices of a subpicture may be required to be rectangular slices. Partitioning of a picture to subpictures (a.k.a. a subpicture layout or a layout of subpictures) may be indicated in and/or decoded from an SPS. One or more of the following properties may be indicated (e.g., by an encoder) or decoded (e.g. by a decoder) or inferred (e.g., by an encoder and/or a decoder) for the subpictures collectively or per each subpicture individually: i) whether or not a subpicture is treated as a picture in the decoding process; in some cases, this property excludes in-loop filtering operations, which may be separately indicated/decoded/inferred; ii) whether or not in-loop filtering operations are performed across the subpicture boundaries. Treating a subpicture as a picture in a decoding process may comprise saturating the sample locations in inter prediction that would otherwise be outside the subpicture onto the subpicture boundary. When the boundaries of a subpicture are treated like picture boundaries, and sometimes also when loop filtering is turned off across subpicture boundaries, the subpicture may be regarded as a tile in the context of viewport-dependent 360° video streaming. When streaming VR video, a subset of 360-degree video content covering the viewport (i.e., the current view orientation) may be transmitted at the best quality/resolution, while the remaining of 360-degree video may be transmitted at a lower quality/resolution. This is what characterizes a VDS system, as opposed to a Viewport Independent Streaming system, where the omnidirectional video is streamed at same quality in all directions. VDS may be used for 360-degree video to optimize bandwidth utilization or viewport resolution: higher quality content is delivered for the current viewport of the player/user, whereas the remaining background area is either not delivered at all or delivered at a lower quality. VDS may degrade user experience if the new viewport after pose change (change in viewport orientation) is not updated fast enough due to application or network delays. The time it takes for the new viewport to be upgraded to the highest viewport quality after the pose changes, is defined as the motion-to-high-quality delay. Viewport prediction algorithms may be used to determine the position of future viewports based on prior head motion traces, content, metadata, current speed, and direction, etc. Figure 1 illustrates the OMAF system architecture. The system can be situated in a video camera, or in a network server, for example. As shown in Figure 1, an omnidirectional media (A) is acquired. If the OMAF system is part of the video source, the omnidirectional media (A) is acquired from the camera means. If the OMAF system is in a network server, the omnidirectional media (A) is acquired from a video source over the network. The omnidirectional media comprises image data (Bi) and audio data (Ba), which are processed separately. In image stitching, rotation, projection and region-wise packing, the images/video of the source media are provided as input (Bi) and stitched to generate a sphere picture on a unit sphere per the global coordinate axes. The unit sphere is then rotated relative to the global coordinate axes. The amount of rotation to convert from the local coordinate axes to the global coordinate axes may be specified by the rotation angles indicated in a RotationBox. The local coordinate axes of the unit sphere are the axes of the coordinate system that has been rotated. The absence of the RotationBox indicates that the local coordinate axes are the same as the global coordinate axes. Then, the spherical picture on the rotated unit sphere is converted to a two-dimensional projected picture, for example using the equirectangular projection. When spatial packing of stereoscopic content is applied, two spherical pictures for the two views are converted to two constituent pictures, after which frame packing is applied to pack the two constituent picture on one projected picture. Rectangular region-wise packing can then be applied to obtain a packed picture from the projected picture. The packed pictures (D) are then provided for video and image encoding to result in encoded image (Ei) and/or encoded video stream (Ev). The audio of the source media is provided as input (Ba) to audio encoding that provides as an output, the encoded audio (Ea). The encoded data (Ei, Ev, Ea) are then encapsulated into file for playback (F) and delivery (i.e., streaming) (Fs). In the OMAF player 200, such as in an HMD, a file decapsulator processes the files (F’, F’s) and extracts the coded bitstreams (E’i, E’v, E’a) and parses the metadata. The audio, video and/or images are then decoded into decoded data (D’, B’a). The decoded pictures (D’) are projected onto a display according to the viewport and orientation sensed by a head/eye tracking device. Similarly, the decoded audio (B’a) is rendered through loudspeakers/headphones. A basic building block in the ISO Base Media File Format (ISOBMFF) is called a box. Each box has a header and a payload. The box header indicates the type of the box and the size of the box in terms of bytes. Box type may be identified by an unsigned 32-bit integer, interpreted as a four-character code (4CC). A box may enclose other boxes, and the ISOBMFF specifies which box types are allowed within a box of a certain type. Furthermore, the presence of some boxes may be mandatory in each file, while the presence of other boxes may be optional. Additionally, for some box types, it may be allowable to have more than one box present in a file. Thus, the ISOBMFF may be considered to specify a hierarchical structure of boxes. According to the ISOBMFF, a file includes media data and metadata that are encapsulated into boxes. In files conforming to the ISOBMFF, the media data may be provided in one or more instances of MediaDataBox (‘mdat‘) and the MovieBox (‘moov’) may be used to enclose the metadata for timed media. In some cases, for a file to be operable, both of the ‘mdat’ and ‘moov’ boxes may be required to be present. The ‘moov’ box may include one or more tracks, and each track may reside in one corresponding TrackBox (‘trak’). Each track is associated with a handler, identified by a four-character code, specifying the track type. Video, audio, and image sequence tracks can be collectively called media tracks, and they contain an elementary media stream. Other track types comprise hint tracks and timed metadata tracks. Movie fragments may be used, e.g., when recording content to ISO files, e.g., in order to avoid losing data if a recording application crashes, runs out of memory space, or some other incident occurs. Without movie fragments, data loss may occur because the file format may require that all metadata, e.g., the movie box, be written in one contiguous area of the file. Furthermore, when recording a file, there may not be a sufficient amount of memory space (e.g., random access memory RAM) to buffer a movie box for the size of the storage available, and re-computing the contents of a movie box when the movie is closed may be too slow. Moreover, movie fragments may enable simultaneous recording and playback of a file using a regular ISO file parser. Furthermore, a smaller duration of initial buffering may be required for progressive downloading, e.g., simultaneous reception and playback of a file when movie fragments are used, and the initial movie box is smaller compared to a file with the same media content but structured without movie fragments. The movie fragment feature may enable splitting the metadata that otherwise might reside in the movie box into multiple pieces. Each piece may correspond to a certain period of time of a track. In other words, the movie fragment feature may enable interleaving file metadata and media data. Consequently, the size of the movie box may be limited. In some examples, the media samples for the movie fragments may reside in an ‘mdat’ box. For the metadata of the movie fragments, however, a ‘moof’ box may be provided. The ‘moof’ box may include the information for a certain duration of playback time that would previously have been in the ‘moov’ box. The ‘moov’ box may still represent a valid movie on its own, but in addition, it may include an ‘mvex’ box indicating that movie fragments will follow in the same file. The movie fragments may extend the presentation that is associated to the ‘moov’ box in time. Within the movie fragment there may be a set of track fragments, including anywhere from zero to a plurality per track. The track fragments may in turn include anywhere from zero to a plurality of track runs, each of which is a contiguous run of samples for that track (and hence are similar to chunks). Within these structures, many fields are optional and can be defaulted. The metadata that may be included in the ‘moof’ box may be limited to a subset of the metadata that may be included in a ‘moov’ box and may be coded differently in some cases. Details regarding the boxes that can be included in a ‘moof’ box may be found from the ISOBMFF specification. A self-contained movie fragment may be defined to consist of a ‘moof’ box and an ‘mdat’ box that are consecutive in the file order and where the mdat box contains the samples of the movie fragment (for which the ‘moof’ box provides the metadata) and does not contain samples of any other movie fragment (i.e., any other ‘moof’ box). Tracks comprise samples, such as audio or video frames. For video tracks, a media sample may correspond to a coded picture or an access unit. A media track refers to samples (which may also be referred to as media samples) formatted according to a media compression format (and its encapsulation to the ISO base media file format). A hint track refers to hint samples, containing cookbook instructions for constructing packets for transmission over an indicated communication protocol. A timed metadata track may refer to samples describing referred media and/or hint samples. The 'trak' box includes in its hierarchy of boxes the SampleDescriptionBox, which gives detailed information about the coding type used, and any initialization information needed for that coding. The SampleDescriptionBox contains an entry-count and as many sample entries as the entry-count indicates. The format of sample entries is track-type specific but derive from generic classes (e.g., VisualSampleEntry, AudioSampleEntry). The type of sample entry form that is used for derivation of the track-type specific sample entry format is determined by the media handler of the track. Extractor tracks are specified in the ISOBMFF encapsulation format of HEVC and AVC (Advanced Video Coding) bitstreams (ISO/IEC 14496-15). Samples in an extractor track contain instructions to reconstruct a valid HEVC or AVC bitstream by including rewritten parameter set and slice header information and referencing to byte ranges of coded video data in other tracks. Consequently, a player only needs to follow the instructions of an extractor track to obtain a decodable bitstream from tracks containing MCTSs. A uniform resource identifier (URI) may be defined as a string of characters used to identify a name of a resource. Such identification enables interaction with representations of the resource over a network, using specific protocols. A URI is defined through a scheme specifying a concrete syntax and associated protocol for the URI. The uniform resource locator (URL) and the uniform resource name (URN) are forms of URI. A URL may be defined as a URI that identifies a web resource and specifies the means of acting upon or obtaining the representation of the resource, specifying both its primary access mechanism and network location. A URN may be defined as a URI that identifies a resource by name in a particular namespace. A URN may be used for identifying a resource without implying its location or how to access it. Hypertext Transfer Protocol (HTTP) has been widely used for the delivery of real-time multimedia content over the Internet, such as in video streaming applications. Several commercial solutions for adaptive streaming over HTTP, such as Microsoft® Smooth Streaming, Apple® HTTP Live Streaming and Adobe® HTTP Dynamic Streaming, have been launched as well as standardization projects have been carried out. Adaptive HTTP streaming (AHS) was first standardized in Release 9 of 3rd Generation Partnership Project (3GPP) packet-switched streaming (PSS) service (3GPP TS 26.234 Release 9: "Transparent end-to-end packet-switched streaming service (PSS); protocols and codecs"). MPEG took 3GPP AHS Release 9 as a starting point for the MPEG DASH standard (ISO/IEC 23009-1: "Dynamic adaptive streaming over HTTP (DASH)-Part 1: Media presentation description and segment formats," International Standard, 2nd Edition, 2014). MPEG DASH and 3GP-DASH are technically close to each other and may therefore be collectively referred to as DASH. The viewport-dependent encoding and streaming can occur as or as packed video. In such approach, 360-degree image content is packed into the same frame with an emphasis (e.g., greater spatial area) on the viewport. The packed frames are encoded into a single bitstream. The viewport video is also known as tile-based encoding and streaming. In such an approach, 360-degree content is encoded and made available in a manner that enables selective streaming of viewports from different encodings. Projected pictures are encoded as several tiles. Several versions of the content are encoded at different bitrates and/or resolutions. Coded tile sequences are made available for streaming together with metadata describing the location of the tile on the omnidirectional video. Clients select which tiles are received so that the viewport is covered at higher quality and/or resolution than the other tiles. Players can choose tracks (or Representations) to be decoded and played based on the current viewing orientation. Viewport dependency can be achieved by having at least two quality areas: foreground (content in the current viewport) and background (i.e., content outside the current viewport in 360-degree video). Instead of two categories, the quality areas may comprise a third category: a margin around the viewport. Figure 2a illustrates an example, where each rectangle 201 represents a tile of a video represented as equirectangular projection of the 360-degree video. The area 205 represents the viewport, and Figure 2b illustrates zooming of the viewport 205. The viewport 205 may cover different tiles (for example, in Figure 2, nine tiles 202 are covered) even partially. In viewport-dependent streaming, the tiles (i.e., tiles 201) that are not within the viewport, are generally encoded and transmitted at lower quality, whereas the tiles that are entirely (e.g., tile 202a) or partially (e.g., tile 202b) within the viewport 205 are encoded at higher quality. The present embodiments aim to measure the quality of 360-degree video within a viewport in both presence of a head motion in a given direction, and in the absence of head motion, wherein the data concerning the head motion may be determined by a head-mounted display or other device being used for viewing the 360-degree video. The problem that the present embodiments aim to solve, is how to determine the average viewport quality during an immersive video experience. In general, the viewport may (not) be fully viewed by the users. The viewing approach depends on the user gaze. Generally, users tend to view area inside the circle 301 shown in Figure 3. However, this assumption changes upon head motion, i.e., the gaze changes based on whether the head is stationary or in motion. In Figure 3, the arrow 310 shows the head motion to the right direction. The user gaze is content dependent. For example, when watching tennis, the user gaze is typically horizontal, and when watching an air show, the user gaze is typically vertical and horizontal. The head motion produces a shift of the user gaze, since users tend to watch the part of the viewport in the direction of the motion. Another important aspect is that the content is perceived as unfocused (or blurred) to the human eye whenever the head is turned faster than a given head speed threshold. Thus calculation of the viewport quality will be challenging in the presence of head motion. The following four possible existing approaches to measure the video quality within a viewport can be used: 1. Center tile quality 2. Average quality of all tiles within the viewport 3. Weighted average quality based on the tile areas in the viewport 4. A solution called ProbGaze In the first example, i.e., Center tile quality, shown in Figure 4, the viewport quality is the quality of the tile at the center of viewport (Tile₅). This approach has the following disadvantages: - The user gaze can be on surrounding tiles. This would not capture the real viewport quality, and other tiles may have significant weight on the overall quality. In fact, for example, if Tile₄, Tile₆, Tile₈ are at lower quality than Tile₅, this significantly affects the user experience. Therefore, considering only the quality of Tile₅ does not reflect the real user experience in this viewport; - The user gaze (attention) may be shifted to the sides of the viewport during head motion; - There may be the blurring effect during fast head motion. In the second example, i.e., average quality of all tiles within the viewport, let ^ Q_i be the quality of Tile_i; ^ N be the tile count inside the viewport; ^{^ Viewport quality} In this case, the viewport quality is the average of the tile qualities. This is shown in an example of Figure 5. This approach may have the following disadvantages: - The solution assumes that all tiles have the same effect of the overall quality; - The solution suffers from the “Shifted user gaze during head motion” effects as described above; - The solution suffers from the “Blurring during fast head motion” effect as described above. In the third example, i.e., weighted average quality based on the tile areas in the viewport, let ^ Qi be the quality of Tilei; ^ Ri be the ratio of Tilei‘s area inside the viewport; ^ N be the tile count inside the viewport; ^ Viewport quality

The viewport quality is the weighted average of tile qualities. The weight is given by the area covered in the viewport by a specific tile. Figure 6 shows an example of this. This approach may have the following disadvantages: - The approach suffers from the “Shifted user gaze during head motion” effect as described above; - The approach suffers from the “Blurring during fast head motion” effect as described above. In the fourth example, i.e., ProbGaze, the weights are distributed according to the user gaze data. At first circles are distributed in the viewport with a polynomial density function based on the gaze heatmap. Then, the viewport area is sampled with uniformly distributed points on these circles. Later, the quality is calculated with the weighted averages of the quality values at these points. Figure 7 illustrates an example for this. Let ^ n be the number of circles; ^ m be the number of points on each circle; ^ ^_^^,^ be the quality of _^^^^^^ on _^^^^^^^;

In this case the viewport quality is the weighted average of different points in different circles. This approach may have the following disadvantages: - The approach assumes that the user gaze behavior is known a priori and remains the same even with head motion at different speeds; - The approach suffers from the “Blurring during fast head motion” effect as described above. The present embodiments provide two possible ways to calculate the quality within the viewport of a 360-degree tiled video: 1) Dynamic points; 2) Dynamic area. Dynamic points: The first alternative is designed on top of the ProbGaze (Figure 7), but is extended to enable a dynamic distribution of points depending on the head motion characteristics (e.g., the direction of motion, speed). In this case, different weights are assigned to individual viewports, while calculating the overall average viewport quality. For stationary head positions, the same rules as for the ProbGaze apply, and the points are distributed according to user gaze data as in Figure 7. For example, for stationary head position, the point distribution is not shifted and distributed as in ProbGaze (Figure 8). For the slow head motion to the right, the point distribution is shifted to the right on slow motion to the right direction. The point density is shifted between the middle and right border of the viewport. The situation is illustrated in Figures 10 – 12. For a fast head motion to the right, the point distribution is shifted to the right and the point density is much closer to the right border of the viewport (Figure 9). Figure 13 illustrates an example for the stationary case: ^ Let w be the width of the viewport; ^ Let h be the height of the viewport; ^ Let r_i be the radius of the ^^^^^^_^; ^ Let the speed unit be in degrees per second (dps). Figure 14 illustrates an example with head motion: ^ Let w be the width of the viewport; ^ Let h be the height of the viewport; ^ Let r_i be the radius of the ^^^^^^_^; ^ Let the speed unit be in degrees per second (dps); ^ Let xi^be the shifting amount of a ^^^^^^_^ in the horizontal direction from the center of the viewport; ^ Let yi be the shifting amount of a ^^^^^^_^ in the vertical direction from the center of the viewport; ^ w, h, ri, xi, yi are defined in terms of degree. Steps for carrying out this embodiment may comprise the following: 1. Distribute circles and points like in ProbGaze; 2. Shift circles/points to head motion and direction with a formula of (Figures

where ^^^^^_^ is the head motion speed in horizontal direction and ^^^^^_^ is the head motion speed in vertical direction. a. When the is 0 no case). Shifts:

in Figure 8. b. When the ^^^^^_^ is equal to or faster than 180 dps (i.e., maximum speed threshold that user can turn his/her head), shift all circles such that their sides in the direction of motion

c. When the ^^^^^_^ exceeds the threshold in vertical direction;

3_. Find each point’s location in terms of (_{^, ^}) in the viewport; 4. Find each tile’s quality and boundaries in terms of (_{^^^^, ^^^ℎ^, ^^^, ^^^^^^} ^{) in the viewport;} 5. Find each point’s projection on tiles; 6. Calculate the current viewport quality using weights and tile qualities; 7. Do steps 1-6 for all viewports during an immersive video session; 8. Calculate the overall viewport quality during an immersive video session using the Overall Viewport Quality Option 1 or the Overall Viewport Quality Option 2 (see below). It is to be noticed that in the above example, the speed of 180 dps is just an example of possible values. In some cases, the speed can be 60, or 90, or something else. Dynamic area: Figure 15 illustrates an idea of this alternative approach where an area of user interest is marked within the viewport based on the gaze position, the head motion or both. In Figure 15, the head is stationary and the gaze is centered, hence the marked area is circular with the center of the circle the same as the center of the viewport. This area is defined dynamically based on the head motion characteristics (e.g., direction and speed). The viewport quality may be calculated like in the “Weighted average quality of tile areas in the viewport” (existing methods, example 3) method, where the Viewport Quality is the weighted average of tile qualities, when • ^ be the number of tiles inside the viewport; • ^_^ be the quality of ^^^^_^; • ^^^^^^^^_^^^^ be the total area of the viewport; • ^_^ be the ^^^^_^’s coverage ratio to ^^^^^^^^_^^^^; • Viewport quality= ∑ ^ ^^^ ^_^ × ^_^ However, a difference between the presented solution and the existing solution is that only the coverage ratio of each ^^^^_^ inside the drawn shape is used while calculating the viewport quality. In Figures 15 - 20, ^^^^_^ indicates the area of ^^^^_^’s part inside the drawn shape. The ratio of ^^^^_^ to the area of the drawn shape gives ^_^ for ^^^^_^.Then, the formula for the presented solution is • ^ be the number of tiles inside the drawn shape; • ^_^ be the quality of ^^^^_^; • ^^^^_^ be the area of ^^^^_^ inside the drawn shape;

• Viewport quality= ∑ ^ ^^^ ^_^ × ^_^ The stationary case (no motion) in Figure 15, the viewport quality may be calculated without reshaping the drawn shape. As an example, shown in Figure 16, of slow motion to the right, the total area 1801 is dynamically reshaped based on the slow motion to the right. The viewport quality is calculated with Areai and Tilei. For instance, Tile1 has no effect on viewport quality calculated, whereas Tile5 has a significant effect. In order to be more geometrically computable, the triangle shape 1901 in Figure 17 can be taken into account as a possible approximation. The calculation can start with a triangle shape, and then add the area corresponding to an arc built on each of the sides of the triangle, with starting and ending point on each pair of edges of the triangle (See Figure 18). As an example, shown in Figures 19 – 20 (to be compared with Figures 16 – 17), of fast motion to the right, the shape gets closer to the right border. The viewport quality is calculated with Areai and Tilei. For instance, Tile1, Tile4,^Tile7 have no effect on viewport quality calculation. Tile5, Tile6 have a significant effect. Calculation of the Overall Viewport Quality Up to this point, a calculation of the quality of a single viewport has been disclosed. During an immersive video experience, there are lots of single viewports. For instance, for a 60 seconds video with 300ms segments size, there are 200 viewports to view. In the solution, a viewport and its quality are generally snapshot at each 300ms. This operation can be done less or more frequently. Thus, there is a need to find an average viewport quality to represent the overall viewport quality of an immersive video experience session. In the following, two options are presented: Option 1: • Each viewport has a different weight with inverse ratio of its head motion speed at time t . It is appreciated that there can also be other ways to assign the weights, and the given way is an example of one; • Let ^^_^, ^^_^, ^^_^, … , ^^_^ be viewport qualities where ^ is viewport count; • Let ^_^, ^_^, ^_^, … , ^_^ be head motion speed during viewports ^^_^, ^^_^, ^^_^, … .. , ^^_^ ; • Let ^_^, ^_^, ^_^, … , ^_^ be weights of viewports while calculating the overall viewport quality such that; 1. ^^ ^_^ ≥ 180 ^ℎ^^ ^_^ = 1 2. Otherwise ^_^ = 180 − ^^^^^_^ ^ Normalize each weight by _^ ∑ ^ _^^ such that

= 1. ^^^ • The overall quality is calculated by ∑ ^ ^_^^ ^^_^ × ^_^ It is appreciated that number of weights is equal to the number of viewports (^). Option 2 • A head speed threshold (e.g.,180 dps) is determined. When the head moves above this threshold, the content is blurred (in other words, not necessarily expected to be high quality by the user, so its impact on the overall viewport quality can be less) • Let ^^_^, ^^_^, ^^_^, … , ^^_^ be viewport qualities where ^ is the number of individual viewports • Let ^_^, ^_^, ^_^, … , ^_^ be head motion speed during viewports ^^_^, ^^_^, ^^_^, … .. , ^^_^ • Divide viewport qualities into two arrays such that • ^^_^: Viewport qualities array of viewport speeds above the head speed threshold • ^^_^: Viewport qualities array of viewport speeds below the head speed threshold • ^: Length of ^^_^; • ^: Length of ^^_^; • Let ^_^ , ^_^ be weights while calculating the overall viewport quality such that • ^_^ : weight of viewports below the head speed threshold (determined via subjective tests) • ^_^: weight of viewports above the head speed threshold (determined via subjective tests) • ^_^ > ^_^ such that ^ × ^_^ + ^ × ^_^ = 1 • Overall Viewport Quality:

^^_^^ × ^_^ • It is appreciated that the number of weights is 2. The previous embodiments are clarified by means of following examples, where ^ Viewport Qualities are 1.0, 2.0, 1.0, 4.0 ^ Number of viewports is 4; ^ Head Speeds (dps): 30, 60, 120, 200; ^ Head Speed Threshold: 180 dps With the solutions ^

2.0 With the Option 1: ^ Weights o ^[^_^ = 180 − 30, ^_^ = 180 − 60, ^_^ = 180 − 120, ^_^ = 1^] o Normalized: 0.45, ^_^ = 0.36, ^_^ = 0.18, ^_^ = 0.003^] such that ^∑ ^ ^_^^

= 1 ^ ^^^^^^^ ^^^^^^^ = ^∑ ^ ^_^^ ^^_^ × ^_^ = 1.0 × 0.45 + 2.0 × 0.36 + 1.0 × 0.18 + 4.0 × 0.003 = 1.36 With the option 2: ^ ^ :1 ^ ^: 3 ^ Recall that ^ × ^_^ + ^ × ^_^ = 1 then 3 × ^_^ + 1 × ^_^ = 1 ^ Let ^_^ = 2 × ^_^ ^ Then, ^_^ = 0.286, ^_^ = 0.143

0.143^] + ^[1.0 × 0.286 + 2.0 × 0.286 + 1.0 × 0.286^] = 1.71 Use of the present embodiments with dynamic adaptive streaming The method can be used for dynamic adaptive streaming for selective high- quality region and/or low-quality region for downloading 360-degree video. According to an embodiment, an algorithm may download the tiles that overlap with the dynamic area at high quality. According to another embodiment, an algorithm may download the files that have a certain extent of overlap (e.g. if the tile is at least 80% overlapping) with the dynamic area at high quality. The tiles are not overlapping with the area may be downloaded at low-quality or not downloaded at all. For non-tiled viewport-dependent delivery, the high-quality region may be defined in the area. Figure 21 illustrates a method according to an embodiment. The method generally comprises defining 2110 a viewport wherein the viewport comprises a plurality of points and at least one tile and an area, wherein each tile has a certain quality; determining 2120 head motion characteristics; adjusting 2130 the points or the area within the viewport according to the determined head motion characteristics; and determining 2140 a quality of the viewport using tile qualities within the viewport. Each of the steps can be implemented by a respective module of a computer system. An apparatus according to an embodiment comprises means for defining a viewport wherein the viewport comprises a plurality of points and at least one tile and an area, wherein each tile has a certain quality; means for determining head motion characteristics; means for adjusting the points or the area within the viewport according to the determined head motion characteristics; and means for determining a quality of the viewport using tile qualities within the viewport. The means comprises at least one processor, and a memory including a computer program code, wherein the processor may further comprise processor circuitry. The memory and the computer program code are configured to, with the at least one processor, cause the apparatus to perform the method of Figure 21 according to various embodiments. An example of an apparatus is shown in Figure 22. Several functionalities can be carried out with a single physical device, e.g., in a single processor, if desired. The apparatus 90 comprises a main processing unit 91, a memory 92, a user interface 94, a communication interface 93. The apparatus according to an embodiment, shown in Figure 22, also comprises a camera module 95. The memory 92 stores data including computer program code in the apparatus 90. The computer program code is configured to implement the method according to flowchart of Figure 21. The camera module 95 receives input data, in the form of video stream, to be processed by the processor 91. The communication interface 93 forwards processed data for example to a display of another device, such an HMD. When the apparatus 90 is a video source comprising the camera module 95, user inputs may be received from the user interface. If the apparatus 90 is a middlebox in a network, the user interface is optional, such as the camera module. The various embodiments may provide advantages. For example, the present embodiments give experienced user content’s quality instead of traditional assumption-based approaches. Also, the present embodiments reflect user experience more closely compared to the existing approaches. The present embodiments provide a generic solution based on head motion. In addition, the present embodiments are applicable to negative viewport margins, and also applicable to current tile-based streaming approaches. The various embodiments can be implemented with the help of computer program code that resides in a memory and causes the relevant apparatuses to carry out the method. For example, a device may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the device to carry out the features of an embodiment. Yet further, a network device like a server may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the network device to carry out the features of various embodiments. If desired, the different functions discussed herein may be performed in a different order and/or concurrently with other. Furthermore, if desired, one or more of the above-described functions and embodiments may be optional or may be combined. Although various aspects of the embodiments are set out in the independent claims, other aspects comprise other combinations of features from the described embodiments and/or the dependent claims with the features of the independent claims, and not solely the combinations explicitly set out in the claims. It is also noted herein that while the above describes example embodiments, these descriptions should not be viewed in a limiting sense. Rather, there are several variations and modifications, which may be made without departing from the scope of the present disclosure as, defined in the appended claims.

Claims

Claims: 1. An apparatus comprising: - means for defining a viewport wherein the viewport comprises a plurality of points and at least one tile and an area, wherein each tile has a certain quality; - means for determining head motion characteristics; - means for adjusting the points or the area within the viewport according to the determined head motion characteristics; and - means for determining a quality of the viewport using tile qualities within the viewport. 2. The apparatus according to claim 1, further comprising means for distributing the points according to head motion characteristics; means for locating each point in the viewport; means for locating each tile in the viewport; means for determining each tile’s quality; means for determining each point’s projection on tiles; means for determining a quality of a current viewport using weights and tile qualities. 3. The apparatus according to claim 1, further comprising means for reshaping a viewport area according to head motion characteristics; means for determining a viewport quality based on the tile qualities within the reshaped viewport area. 4. The apparatus according to any of the claims 1 to 3, further comprising means for determining an overall viewport quality of an immersive video based on the determined qualities of several viewports. 5. The apparatus according to claim 4, further comprising means for determining the overall viewport quality by giving weights to said several viewports according to the head motion characteristics. 6. The apparatus according to claim 4, further comprising means for determining a head speed threshold, and means for determining the overall viewport quality according to the weights of viewports with respect to the determined head speed threshold. 7. The apparatus according to any of the claims 1 to 6, wherein head motion characteristics comprises direction and speed of a head. 8. A method, comprising: - defining a viewport, wherein the viewport comprises a plurality of points and at least one tile and an area, wherein each tile has a certain quality; - determining head motion characteristics; - adjusting points or an area within the viewport according to the determined head motion characteristics; and - determining a quality of the viewport using tile qualities within the viewport. 9. The method according to claim 8, further comprising distributing the points according to head motion characteristics; locating each point in the viewport; locating each tile in the viewport; determining each tile’s quality; determining each point’s projection on tiles; determining a quality of a current viewport using weights and tile qualities. 10.The method according to claim 8, further comprising reshaping a viewport area according to head motion characteristics; determining a viewport quality based on the tile qualities within the reshaped viewport area. 11.The method according to any of the claims 8 to 10, further comprising determining an overall viewport quality of an immersive video based on the determined qualities of several viewports. 12.The method according to claim 11, further comprising determining the overall viewport quality by giving weights to said several viewports according to the head motion characteristics. The method according to claim 11, further comprising determining a head speed threshold, and determining the overall viewport quality according to the weights of viewports with respect to the determined head speed threshold. The method according to any of the claims 8 to 13, wherein head motion characteristics comprises direction and speed of a head. An apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following: - define a viewport, wherein the viewport comprises a plurality of points and at least one tile and an area, wherein each tile has a certain quality; - determine head motion characteristics; - adjust the points or the area within the viewport according to the determined head motion characteristics; and - determine a quality of the viewport using tile qualities within the viewport.