WO2020068284A1 - Virtual reality (vr) viewpoint grouping - Google Patents

Virtual reality (vr) viewpoint grouping Download PDF

Info

Publication number
WO2020068284A1
WO2020068284A1 PCT/US2019/045978 US2019045978W WO2020068284A1 WO 2020068284 A1 WO2020068284 A1 WO 2020068284A1 US 2019045978 W US2019045978 W US 2019045978W WO 2020068284 A1 WO2020068284 A1 WO 2020068284A1
Authority
WO
WIPO (PCT)
Prior art keywords
viewpoint
group
viewpoints
viewpoint group
identifier
Prior art date
Application number
PCT/US2019/045978
Other languages
French (fr)
Inventor
Ye-Kui Wang
Original Assignee
Futurewei Technologies, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Futurewei Technologies, Inc. filed Critical Futurewei Technologies, Inc.
Publication of WO2020068284A1 publication Critical patent/WO2020068284A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/21Server components or server architectures
    • H04N21/218Source of audio or video content, e.g. local disk arrays
    • H04N21/21805Source of audio or video content, e.g. local disk arrays enabling multiple viewpoints, e.g. using a plurality of cameras
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/597Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding specially adapted for multi-view video sequence encoding
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/70Methods or arrangements for coding, decoding, compressing or decompressing digital video signals characterised by syntax aspects related to video coding, e.g. related to compression standards
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • H04N21/2343Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements
    • H04N21/234345Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements the reformatting operation being performed only on part of the stream, e.g. a region of the image or a time segment
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/239Interfacing the upstream path of the transmission network, e.g. prioritizing client content requests
    • H04N21/2393Interfacing the upstream path of the transmission network, e.g. prioritizing client content requests involving handling client requests
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/25Management operations performed by the server for facilitating the content distribution or administrating data related to end-users or client devices, e.g. end-user or client device authentication, learning user preferences for recommending movies
    • H04N21/266Channel or content management, e.g. generation and management of keys and entitlement messages in a conditional access system, merging a VOD unicast channel into a multicast channel
    • H04N21/2662Controlling the complexity of the video stream, e.g. by scaling the resolution or bitrate of the video stream based on the client capabilities
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/472End-user interface for requesting content, additional data or services; End-user interface for interacting with content, e.g. for content reservation or setting reminders, for requesting event notification, for manipulating displayed content
    • H04N21/4728End-user interface for requesting content, additional data or services; End-user interface for interacting with content, e.g. for content reservation or setting reminders, for requesting event notification, for manipulating displayed content for selecting a Region Of Interest [ROI], e.g. for requesting a higher resolution version of a selected region
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/60Network structure or processes for video distribution between server and client or between remote clients; Control signalling between clients, server and network components; Transmission of management data between server and client, e.g. sending from server to client commands for recording incoming content stream; Communication details between server and client 
    • H04N21/65Transmission of management data between client and server
    • H04N21/658Transmission by the client directed to the server
    • H04N21/6587Control parameters, e.g. trick play commands, viewpoint selection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/845Structuring of content, e.g. decomposing content into time segments
    • H04N21/8456Structuring of content, e.g. decomposing content into time segments by decomposing the content in the time domain, e.g. in time segments

Definitions

  • the present disclosure is generally related to virtual reality (VR), also referred to as omnidirectional media, immersive media, and 360 degree video, and is specifically related to mechanisms for signaling group relationships between multiple VR video viewpoints.
  • VR virtual reality
  • VR virtual reality
  • HMD head mounted displays
  • VR video often also referred to as 360 degree video or omnidirectional video
  • gaming include gaming, training, education, sports video, online shopping, adult entrainment, and so on.
  • the disclosure includes a method comprising: processing a VR video stream, wherein the VR video stream comprises plurality of viewpoints included in a viewpoint group, wherein each of the viewpoints in the viewpoint group corresponds to one particular omnidirectional video camera for capturing an omnidirectional video at a particular location, and wherein the VR video stream contains information indicative of the viewpoint group; presenting a first viewport of a first viewpoint in the viewpoint group to a user; switching from the first viewpoint to a second viewport in the viewpoint group; and presenting a second viewport of the second viewpoint in the viewpoint group to a user based on the information indicative of the viewpoint group.
  • VR video can include multiple viewpoints. A user may desire to move between viewpoints.
  • a viewpoint group identifier which can be used by a decoder to determine a viewport at a second viewpoint based on a viewport used at a first viewpoint.
  • a viewpoint group ID may be included in data structures stored in a metadata track in some systems.
  • the metadata track may only be used for dynamic viewpoints (e.g., cameras that move). Accordingly, a viewpoint group ID stored in a metadata track may not be available for use with static viewpoints.
  • the present disclosure may include the viewpoint group ID in the track for static viewpoints, which extends the switching functionality with dynamic viewpoint selection to the static viewpoints.
  • another implementation of the aspect provides, wherein the information indicative of the viewpoint group comprises a viewpoint group identifier.
  • another implementation of the aspect provides, wherein the viewpoint group identifier is stored in a track containing the first viewpoint, the second viewpoint, or combinations thereof.
  • another implementation of the aspect provides, wherein the viewpoint group identifier is included in a viewpoint group box.
  • the switching functionality for dynamic viewpoints can be extended to the static tracks by including the viewpoint group structure object in the viewpoint group box. This pulls the viewpoint group ID into an object that is used by both static and dynamic viewpoints.
  • another implementation of the aspect provides, wherein the viewpoint group identifier is included in a viewpoint group structure, and wherein the viewpoint group structure is included in the viewpoint group box.
  • another implementation of the aspect provides, wherein the viewpoint group identifier is included in a viewpoint group structure, wherein the viewpoint group structure is included in a viewpoint position structure, and wherein the viewpoint position structure is included in the viewpoint group box.
  • another implementation of the aspect provides, wherein the viewpoint group identifier is included in a viewpoint group structure, wherein the viewpoint group structure is included in a viewpoint information structure, and wherein the viewpoint information structure is included in the viewpoint group box.
  • the disclosure includes a method implemented in a decoder, the method comprising: receiving, by a receiver of the decoder, a VR video stream comprising a viewpoint group designated by a viewpoint identifier and including a plurality of viewpoints, wherein the viewpoint identifier is included in one or more tracks including one or more of the plurality of viewpoints; decoding, by a processor of the decoder, a first viewport of a first viewpoint of the viewpoint group for display; determining, by the processor, to switch to a second viewpoint; determining, by the processor, that the second viewpoint is included in the viewpoint group with the first viewpoint based on the viewpoint identifier; determining, by the processor, a second viewport of the second viewpoint based on the first viewport and the viewpoint identifier; and decoding, by the processor, the second viewport of the second viewpoint for display.
  • the disclosure includes a method implemented in an encoder, the method comprising: receiving, by a processor of the encoder, a plurality of viewpoints associated with a VR video stream, wherein the plurality of viewpoints are part of a viewpoint group; encoding, by the processor, a viewpoint group identifier of the viewpoint group in one or more tracks including one or more of the plurality of viewpoints, wherein the viewpoint group identifier indicates a second viewport of a second viewpoint in the viewpoint group is determinable based on a first viewport of a first viewpoint in the viewpoint group during a viewpoint switch; and forwarding, by the processor, the viewpoint group identifier in the one or more tracks in a bitstream toward a decoder for display.
  • another implementation of the aspect provides, wherein the viewpoint identifier is included in tracks including static viewpoints and a metatadata track associated with dynamic viewpoints.
  • another implementation of the aspect provides, wherein the viewpoint identifier is included in a viewpoint group box.
  • another implementation of the aspect provides, wherein the viewpoint group identifier is included in a viewpoint group structure, and wherein the viewpoint group structure is included in the viewpoint group box.
  • another implementation of the aspect provides, wherein the viewpoint group identifier is included in a viewpoint group structure, wherein the viewpoint group structure is included in a viewpoint position structure, and wherein the viewpoint position structure is included in the viewpoint group box.
  • another implementation of the aspect provides, wherein the viewpoint group identifier is included in a viewpoint group structure, wherein the viewpoint group structure is included in a viewpoint information structure, and wherein the viewpoint information structure is included in the viewpoint group box.
  • the disclosure includes a video coding device comprising: a processor, a receiver coupled to the processor, a transmitter coupled to the processor, and a memory coupled to the processor, the processor, transmitter, receiver, and memory configured to perform the method of any of the preceding aspects.
  • the disclosure includes a non-transitory computer readable medium comprising a computer program product for use by a video coding device, the computer program product comprising computer executable instructions stored on the non- transitory computer readable medium such that when executed by a processor cause the video coding device to perform the method of any of the preceding aspects.
  • the disclosure includes a decoder comprising: a receiving means for receiving a virtual reality (VR) video stream comprising a viewpoint group designated by a viewpoint identifier and including a plurality of viewpoints, wherein the viewpoint identifier is included in one or more tracks including one or more of the plurality of viewpoints; a decoding means for: decoding a first viewport of a first viewpoint of the viewpoint group for display, and decoding a second viewport of a second viewpoint for display; and a determination means for: determining to switch to the second viewpoint, determining that the second viewpoint is included in the viewpoint group with the first viewpoint based on the viewpoint identifier, and determining the second viewport of the second viewpoint based on the first viewport and the viewpoint identifier.
  • VR virtual reality
  • the disclosure includes an encoder comprising: a receiving means for receiving a plurality of viewpoints associated with a VR video stream, wherein the plurality of viewpoints are part of a viewpoint group; an encoding means for encoding a viewpoint group identifier of the viewpoint group in one or more tracks including one or more of the plurality of viewpoints, wherein the viewpoint group identifier indicates a second viewport of a second viewpoint in the viewpoint group is determinable based on a first viewport of a first viewpoint in the viewpoint group during a viewpoint switch; and a forwarding means for forwarding the viewpoint group identifier in the one or more tracks in a bitstream toward a decoder for display.
  • any one of the foregoing embodiments may be combined with any one or more of the other foregoing embodiments to create a new embodiment within the scope of the present disclosure.
  • FIG. 1 is a schematic diagram of an example system for VR based video coding.
  • FIG. 2 is a flowchart of an example method of coding a VR picture bitstream.
  • FIG. 3 is a flowchart of an example method of coding a video signal.
  • FIG. 4 is a schematic diagram of an example coding and decoding (codec) system for video coding.
  • FIG. 5 is a schematic diagram illustrating an example video encoder.
  • FIG. 6 is a schematic diagram illustrating an example video decoder.
  • FIG. 7 is a schematic diagram illustrating an example system for capturing VR video from multiple viewpoints.
  • FIG. 8 is a schematic diagram of example viewpoints in a viewpoint group.
  • FIG. 9 is a schematic diagram of an example VR video file for multiple viewpoints.
  • FIG. 10 is an embodiment of a method of performing viewpoint switching based on viewpoint group identification.
  • FIG. 11 is an embodiment of another method of performing viewpoint switching based on viewpoint group identification.
  • FIG. 12 is an embodiment of a method of encoding viewpoint group identification at an encoder to support viewpoint switching at a decoder.
  • FIG. 13 is a schematic diagram of an example video coding device.
  • FIG. 14 is a schematic diagram of an embodiment of a system for managing viewpoint switching based on viewpoint group identification.
  • Video coding standards include International Telecommunication Union Telecommunication Standardization Sector (ITU-T) document H.261, International Organization for Standardization/ International Electrotechnical Commission (ISO/IEC) Motion Picture Experts Group (MPEG)-l Part 2, ITU-T EL262 or ISO/IEC MPEG-2 Part 2, ITU-T EL263, ISO/IEC MPEG-4 Part 2, Advanced Video Coding (AVC), also known as ITU-T EL264 or ISO/IEC MPEG-4 Part 10, and High Efficiency Video Coding (E1EVC), also known as ITU-T E1.265 or MPEG-E1 Part 2.
  • ITU-T International Telecommunication Union Telecommunication Standardization Sector
  • ISO/IEC International Organization for Standardization/ International Electrotechnical Commission
  • MPEG Motion Picture Experts Group
  • AVC Advanced Video Coding
  • E1EVC High Efficiency Video Coding
  • AVC includes extensions such as Scalable Video Coding (SVC), Multiview Video Coding (MVC) and Multiview Video Coding plus Depth (MVC+D), and three dimensional (3D) AVC (3D-AVC).
  • E1EVC includes extensions such as Scalable HEVC (SHVC), Multiview HEVC (MV-HEVC), and 3D HEVC (3D-HEVC).
  • File format standards include the ISO base media file format (ISOBMFF) (ISO/IEC 14496-12, hereinafter“ISO/IEC 14996-12”) and other file format standards derived from ISOBMFF, including MPEG-4 file format (ISO/IEC 14496-14), 3rd Generation Partnership Project (3GPP) file format (3GPP TS 26.244), and AVC file format (ISO/IEC 14496-15, hereinafter“ISO/IEC 14996-15”).
  • ISO/IEC 14496-12 specifies the ISO base media file format.
  • Other documents extend the ISO base media file format for specific applications. For instance, ISO/IEC 14496-15 describes the carriage of Network Abstraction Layer (NAL) unit structured video in the ISO base media file format.
  • NAL Network Abstraction Layer
  • FL264/AVC and F1EVC are examples of NAL unit structured video.
  • ISO/IEC 14496-15 includes sections describing the carriage of F1.264/AVC NAL units. Additionally, section 8 of ISO/IEC 14496- 15 describes the carriage of F1EVC NAL units. Thus, section 8 of ISO/IEC 14496-15 is said to describe the F1EVC file format.
  • ISOBMFF is used as the basis for many codec encapsulation formats, such as the AVC File Format, as well as for many multimedia container formats, such as the MPEG-4 File Format, the 3GPP File Format, and the digital video broadcasting (DVB) File Format.
  • continuous media such as audio and video
  • static media such as images, as well as metadata
  • files structured according to ISOBMFF may be used for many purposes, including local media file playback, progressive downloading of a remote file, segments for Dynamic Adaptive Streaming over ffyper Text Transfer Protocol (F1TTP) (DASff), containers for content to be streamed and corresponding packetization instructions, and recording of received real-time media streams.
  • F1TTP Dynamic Adaptive Streaming over ffyper Text Transfer Protocol
  • DASff Dynamic Adaptive Streaming over ffyper Text Transfer Protocol
  • containers for content to be streamed and corresponding packetization instructions and recording of received real-time media streams.
  • DASfff Dynamic Adaptive Streaming over ffyper Text Transfer Protocol
  • movie fragments defined in fSOBMFF can be used fn addition to continuous media, such as audio and video
  • static media such as images, as well as metadata can be stored in a file conforming to ISOBMFF.
  • Such file formats and streaming mechanisms can be employed to encode, signal, decode, and display a VR video.
  • a VR video can be recorded from multiple viewpoints.
  • a viewpoint is the position of a camera used to capture video.
  • multiple cameras can be positioned at multiple locations to record a scene, an event, etc.
  • such cameras may include a camera array and/or fisheye camera(s) capable of capturing wide angle video.
  • a VR camera mechanism can capture a sphere of video, or sub-portions thereof. Only a portion of the sphere may be displayed to a user. Such portion is displayed to the user on a viewport. Upon viewing, a user can control a viewing orientation from the viewpoint.
  • a VR video can be taken of a basketball game from multiple viewpoints on, around, and/or above the court.
  • a user may be allowed to view the game from a viewpoint of choice and at an orientation/angle of choice from the selected viewpoint.
  • a default viewing orientation/angle can be employed for each viewpoint. Accordingly, when a user switches to a viewpoint, the decoder can employ the default angle to orient the user until the user can select the desired viewing orientation.
  • This implementation has certain drawbacks. For example, a user may wish to pay attention to a particular object in a scene, such as a basketball or a particular player in a basketball game.
  • the user’s viewing angle is reset to the default value each time the user switches between viewpoints. Accordingly, a user viewing a basketball at a first viewpoint would be reoriented to a default angle upon switching to a second viewpoint. This would likely result in losing sight of the basketball. The user would then likely have to search for the current location of the basketball from the new viewpoint.
  • the result is that default viewing orientations may create discontinuities in a user’s viewing experience and create a poor viewing experience in some cases.
  • video data related to the viewpoints may be included in tracks of a video file.
  • a timed metadata track that contains data relevant to dynamic viewpoints can also be included in the video file.
  • a dynamic viewpoint is a viewpoint that moves over the course of the video
  • a static viewpoint is a viewpoint that remains in a constant position over the course of the video.
  • Correspondences between the viewpoints may be included in the timed metadata track.
  • Such information can indicate correspondences between viewpoints that are part of the same viewpoint group. Specifically, such information can denote that a spatial region at a first viewpoint orients toward the same location as a corresponding spatial region at a second viewpoint.
  • a user can switch between dynamic viewpoints.
  • the decoder can automatically orient the user toward a spatial region at the destination viewpoint that corresponds to the orientation selected by the user at the initial viewpoint.
  • a user watching a basketball at a first viewpoint can be automatically oriented toward the basketball upon switching to the second viewpoint. This allows the decoder to provide a consistent view to a user upon switching between viewpoints.
  • housing such data in the timed metadata track may limit such functionality to dynamic viewpoints.
  • the viewpoint group membership can be signaled by employing a viewpoint group identifier (ID), which is included in the track for each static (and/or dynamic) viewpoint.
  • ID a viewpoint group identifier
  • the decoder can review the data from the track for the source viewpoint and from the track for the destination viewpoint for a viewpoint group ID. If the source viewpoint and the destination viewpoint have the same viewpoint group ID, correspondences between the viewpoints can be used to compute a destination viewport at the destination viewpoint based on the source viewport at the source viewpoint.
  • ISOBMFF employs various boxes to include video data in a standardized manner.
  • the viewpoint group IDs for viewpoints can be stored in the corresponding tracks by placing the viewpoint group ID in a viewpoint group box.
  • the viewpoint group ID is included in a viewpoint group structure, which is further included in the viewpoint group box.
  • the viewpoint group ID is included in a viewpoint group structure, which is further included in a viewpoint position structure, which is further included in the viewpoint group box.
  • the viewpoint group ID is included in a viewpoint group structure, which is further included in a viewpoint information structure, which is further included in the viewpoint group box.
  • FIG. 1 is a schematic diagram of an example system 100 for VR based video coding.
  • System 100 includes a multi-directional camera 101, a VR coding device 104 including an encoder 103, a decoder 107, and a rendering device 109.
  • the multi-directional camera 101 comprises an array of camera devices. Each camera device is pointed at a different angle so that the multi-directional camera 101 can take multiple directional video streams of the surrounding environment from a plurality of angles.
  • multi-directional camera 101 can take video of the environment as a sphere with the multi-directional camera 101 at the center of the sphere.
  • sphere and spherical video refers to both a geometrical sphere and sub-portions of a geometrical sphere, such as spherical caps, spherical domes, spherical segments, etc.
  • a multi-directional camera 101 may take one hundred and eighty degree video to cover half of the environment so that a production crew can remain behind the multi-directional camera 101.
  • a multi-directional camera 101 can also take video in three hundred sixty degrees (or any sub-portion thereof) ffowever, a portion of the floor under the multi-directional camera 101 may be omitted, which results in video of less than a perfect sphere ffence, the term sphere, as used herein, is a general term used for clarity of discussion and should not be considered limiting from a geometrical stand point. It should be noted that in some examples a multi-directional camera 101 may include a camera that includes one or more fisheye lenses (e.g., instead of an array of cameras).
  • Video from the multi-directional camera 101 is forwarded to the VR coding device 104.
  • a VR coding device 104 may be a computing system including specialized VR coding software.
  • the VR coding device 104 may include an encoder 103 (a.k.a., a video encoder).
  • the encoder 103 can also be included in a separate computer system from the VR coding device 104.
  • the VR coding device 104 is configured to convert the multiple directional video streams into a single multiple directional video stream including the entire recorded area from all relevant angles. This conversion may be referred to as image stitching. For example, frames from each video stream that are captured at the same time can be stitched together to create a single spherical image. A spherical video stream can then be created from the spherical images.
  • image stitching For clarity of discussion, it should be noted that the terms frame, picture, and image may be used interchangeably herein unless specifically noted.
  • the spherical video stream can then be forwarded to the encoder 103 for compression.
  • An encoder 103 is a device and/or program capable of converting information from one format to another for purposes of standardization, speed, and/or compression.
  • Standardized encoders 103 are configured to encode rectangular and/or square images. Accordingly, the encoder 103 is configured to map each spherical image from the spherical video stream into a plurality of rectangular sub-pictures. The sub-pictures can then be placed in separate sub-picture video streams. As such, each sub-picture video stream displays a stream of images over time as recorded from a sub-portion of the spherical video stream.
  • the encoder 103 can then encode each sub-picture video stream to compress the video stream to a manageable file size.
  • the encoding process is discussed in more detail below.
  • the encoder 103 partitions each frame from each sub-picture video stream into pixel blocks, compresses the pixel blocks by inter-prediction and/or intra-prediction to create coding blocks including prediction blocks and residual blocks, applies transforms to the residual blocks for further compression, and applies various filters to the blocks.
  • the compressed blocks as well as corresponding syntax are stored in bitstream(s), for example in ISOBMFF and/or in omnidirectional media format (OMAF).
  • OMAF specifies the omnidirectional media format for coding, storage, delivery, and rendering of omnidirectional media, including video, images, audio, and timed text.
  • OMAF may be also be known as ISO/IEC International Standard 23090 Part 2 and/or MPEG N17827.
  • ISO/IEC International Standard 23090 Part 2 and/or MPEG N17827 In an OMAF player the user's viewing perspective is from the center of the sphere looking outward towards the inside surface of the sphere.
  • OMAF supports both projected omnidirectional video and fisheye omnidirectional video.
  • the VR coding device 104 may store the encoded bitstream(s) in memory, locally, and/or on a server, for communication to a decoder 107 on demand.
  • the data can be forwarded via a network 105, which may include the Internet, a mobile telecommunications network (e.g., a long term evolution (LTE) based data network), or other data communication data system.
  • LTE long term evolution
  • the decoder 107 (a.k.a., a video decoder) is a device on a user’s location that is configured to reverse the coding process to reconstruct the sub-picture video streams from the encoded bitstream(s).
  • the decoder 107 also merges the sub-picture video streams to reconstruct the spherical video stream.
  • the spherical video stream, or sub-portions thereof, can then be forwarded to the rendering device 109.
  • the rendering device 109 is a device configured to display the spherical video stream to the user.
  • the rendering device 109 may include a head-mounted display (HMD) that attaches to the user’s head and covers the user’s eyes.
  • HMD head-mounted display
  • the rendering device 109 may include a screen for each eye, cameras, motion sensors, speakers, etc. and may communicate with the decoder 107 via wireless and/or wired connections.
  • the rendering device 109 may display a sub-portion of the spherical video stream to the user.
  • the sub-portion shown is based on a field of view (FOV) and/or viewport of the rendering device 109.
  • a FOV is the observable area of the recorded environment that is displayed to a user by the rendering device 109.
  • the FOV can be described as a conical projection between a user’s eye and extending into the virtual environment.
  • a viewport is a two dimensional plane upon which a three dimensional environment is projected.
  • a viewport describes the area of a portion of the virtual environment displayed on a screen or screens of a rendering device
  • a FOV describes the portion of the virtual environment seen by the user.
  • viewport and FOV may be used interchangeably in many cases, but may include different technical details.
  • a FOV can be described in terms of pixels, coordinates, and/or bounds while a viewport can be described in terms of angles.
  • the rendering device 109 may change the position of the FOV/viewport based on user head movement by employing the motion tracking sensors. This allows the user to see different portions of the spherical video stream depending on head movement.
  • the rendering device 109 may offset the FOV for each eye based on the user’s interpapillary distance (IPD) to create the impression of a three dimensional space.
  • IPD interpapillary distance
  • the rendering device 109 may be a computer screen or television screen that changes a FOV/viewport based on user input.
  • FIG. 2 is a flowchart of an example method 200 of coding a VR picture bitstream as a plurality of sub-picture bitstreams, for example by employing the components of system 100.
  • multi-directional camera set such as multi-directional camera 101
  • the multiple directional video streams include views of an environment at various angles.
  • the multiple directional video streams may capture video from three hundred sixty degrees, one hundred eighty degrees, two hundred forty degrees, etc. around the camera in the horizontal plane.
  • the multiple directional video streams may also capture video from three hundred sixty degrees, one hundred eighty degrees, two hundred forty degrees, etc. around the camera in the vertical plane.
  • the result is to create video that includes information sufficient to cover a spherical area around the camera over some period of time.
  • each directional video stream includes a series of images taken at a corresponding angle.
  • the multiple directional video streams are synchronized by ensuring frames from each directional video stream that were captured at the same time domain position are processed together.
  • the frames from the directional video streams can then be stitched together in the space domain to create a spherical video stream.
  • each frame of the spherical video stream contains data taken from the frames of all the directional video streams that occur at a common temporal position.
  • a fisheye lens may capture a single video stream at a wide angle.
  • a single multi directional stream may be captured at step 201, which may allow step 203 to be omitted in some cases.
  • the spherical video stream is mapped into rectangular sub-picture video streams.
  • This process may also be referred to as projecting the spherical video stream into rectangular sub-picture video streams.
  • encoders and decoders are generally designed to encode rectangular and/or square frames. Accordingly, mapping the spherical video stream into rectangular sub-picture video streams creates video streams that can be encoded and decoded by non-VR specific encoders and decoders, respectively. It should be noted that steps 203 and 205 are specific to VR video processing, and hence may be performed by specialized VR hardware, software, or combinations thereof.
  • the rectangular sub-picture video streams can be forwarded to an encoder, such as encoder 103.
  • the encoder then encodes the sub-picture video streams as sub picture bitstreams in a corresponding media file format.
  • each sub-picture video stream can be treated by the encoder as a video signal.
  • the encoder can encode each frame of each sub-picture video stream via inter-prediction, intra-prediction, etc. Such encoding and corresponding decoding as well as encoders and decoders are discussed in detail with respect to the FIGS below.
  • the sub-picture video streams can be stored in ISOBMFF. For example, the sub-picture video streams are captured at a specified resolution.
  • the sub-picture video streams can then be downsampled to various lower resolutions for encoding.
  • Each resolution can be referred to as a representation.
  • Lower quality representations lose image clarity while reducing file size. Accordingly, lower quality representations can be transmitted to a user using fewer network resources (e.g., time, bandwidth, etc.) than higher quality representations with an attendant loss of visual quality.
  • Each representation can be stored in a corresponding set of tracks. Hence, tracks can be sent to a user, where the tracks include the sub-picture bitstreams at various resolutions (e.g., visual quality).
  • a track is a container for a complete set of media related data for a specific playable media sequence.
  • a video track contains a complete set of video information for a playable video sequence at a corresponding resolution.
  • an audio track contains a complete set of audio information for a playable audio sequence at a corresponding sound quality.
  • a metadata track contains a complete set of metadata information usable to interpret media data in one or more specified tracks.
  • the sub-picture bitstreams can be sent to the decoder as tracks.
  • all sub-picture bitstreams are transmitted at the same quality by transmitting tracks from the same representation.
  • the tracks containing sub-picture bitstreams with data in the users FOV may be sent at higher resolutions by selecting higher quality representations.
  • Tracks containing sub-picture bitstreams with areas outside the users FOV can be sent at progressively lower resolutions by selecting lower quality representations. This may be referred to as viewport dependent coding.
  • the tracks may include relatively short video segments (e.g., about three seconds), and hence the representations selected for particular areas of the video can change over time based on changes in FOV. This allows quality to change as the users FOV changes.
  • a decoder such as decoder 107, receives the tracks containing the sub picture bitstreams.
  • the decoder can then decode the sub-picture bitstreams into sub-picture video streams for display.
  • the decoding process involves the reverse of the encoding process (e.g., using inter-prediction and intra-prediction), and is discussed in more detail with respect to the FIGS below.
  • the decoder can merge the sub-picture video streams into the spherical video stream for presentation on a rendering device.
  • the decoder can employ a so called lightweight merging algorithm that selects frames from each sub-picture video stream that occur at the same presentation time and merges them together based on the position and/or angle associated with the corresponding sub-picture video stream.
  • the decoder may also employ filters to smooth edges between the sub-picture video streams, remove artifacts, etc.
  • the decoder can then forward the spherical video stream to a rendering device, such as rendering device 109.
  • the rendering device renders a viewport of the spherical video stream for presentation to the user.
  • areas of the spherical video stream outside of the FOV at each point in time are not rendered.
  • the user can select and view a sub portion of the virtual environment as recorded, and hence can experience the virtual environment as if present at the time of recording.
  • FIG. 3 is a flowchart of an example method 300 of coding a video signal.
  • method 300 may receive a plurality of sub-picture video streams from step 205 of method 200.
  • Method 300 treats each sub-picture video stream as a video signal input.
  • Method 300 applies steps 301-317 to each sub-picture video stream in order to implement steps 207- 211 of method 200.
  • the output video signal from method 300 includes the decoded sub picture video streams, which can be merged and displayed according to steps 213 and 215 of method 200.
  • method 300 can be implemented on a system 100.
  • Method 300 encodes a video signal, for example including sub-picture video streams, at an encoder.
  • the encoding process compresses the video signal by employing various mechanisms to reduce the video file size. A smaller file size allows the compressed video file to be transmitted toward a user, while reducing associated bandwidth overhead.
  • the decoder then decodes the compressed video file to reconstruct the original video signal for display to an end user.
  • the decoding process generally mirrors the encoding process to allow the decoder to consistently reconstruct the video signal.
  • the video signal is input into the encoder.
  • the video signal may be an uncompressed video file stored in memory.
  • the video file may be captured by a video capture device, such as a video camera, and encoded to support live streaming of the video.
  • the video file may include both an audio component and a video component.
  • the video component contains a series of image frames that, when viewed in a sequence, gives the visual impression of motion.
  • the frames contain pixels that are expressed in terms of light, referred to herein as luma components (or luma samples), and color, which is referred to as chroma components (or color samples).
  • a frame may also be referred to as a picture, a sub-frame as a sub-picture, etc.
  • the video signal is partitioned into blocks.
  • Partitioning includes subdividing the pixels in each frame into square and/or rectangular blocks for compression.
  • the frame can first be divided into coding tree units (CTUs), which are blocks of a predefined size (e.g., sixty four pixels by sixty four pixels).
  • CTUs coding tree units
  • Coding trees may be employed to divide the CTUs into blocks and then recursively subdivide the blocks until configurations are achieved that support further encoding.
  • luma components of a frame may be subdivided until the individual blocks contain relatively homogenous lighting values.
  • chroma components of a frame may be subdivided until the individual blocks contain relatively homogenous color values. Accordingly, partitioning mechanisms vary depending on the content of the video frames.
  • inter-prediction and/or intra-prediction may be employed.
  • Inter-prediction is designed to take advantage of the fact that objects in a common scene tend to appear in successive frames. Accordingly, a block depicting an object in a reference frame need not be repeatedly described in adjacent frames. Specifically, an object, such as a table, may remain in a constant position over multiple frames. Hence the table is described once and adjacent frames can refer back to the reference frame.
  • Pattern matching mechanisms may be employed to match objects over multiple frames. Further, moving objects may be represented across multiple frames, for example due to object movement or camera movement. As a particular example, a video may show an automobile that moves across the screen over multiple frames.
  • Motion vectors can be employed to describe such movement, or lack thereof.
  • a motion vector is a two-dimensional vector that provides an offset from the coordinates of an object in a frame to the coordinates of the object in a reference frame.
  • inter-prediction can encode an image block in a current frame as a set of motion vectors indicating an offset from a corresponding block in a reference frame.
  • Intra-prediction encodes blocks in a common frame. Intra-prediction takes advantage of the fact that luma and chroma components tend to cluster in a frame. For example, a patch of green in a portion of a tree tends to be positioned adjacent to similar patches of green. Intra-prediction employs multiple directional prediction modes (e.g., thirty three in HEVC), a planar mode, and a direct current (DC) mode. The directional modes indicate that a current block is similar/the same as samples of a neighbor block in a corresponding direction. Planar mode indicates that a series of blocks along a row/column (e.g., a plane) can be interpolated based on neighbor blocks at the edges of the row.
  • a row/column e.g., a plane
  • Planar mode in effect, indicates a smooth transition of light/color across a row/column by employing a relatively constant slope in changing values.
  • DC mode is employed for boundary smoothing and indicates that a block is similar/the same as an average value associated with samples of all the neighbor blocks associated with the angular directions of the directional prediction modes.
  • intra-prediction blocks can represent image blocks as various relational prediction mode values instead of the actual values.
  • inter-prediction blocks can represent image blocks as motion vector values instead of the actual values. In either case, the prediction blocks may not exactly represent the image blocks in some cases. Any differences are stored in residual blocks. Transforms may be applied to the residual blocks to further compress the file.
  • various filtering techniques may be applied.
  • the filters are applied according to an in-loop filtering scheme.
  • the block based prediction discussed above may result in the creation of blocky images at the decoder. Further, the block based prediction scheme may encode a block and then reconstruct the encoded block for later use as a reference block.
  • the in-loop filtering scheme iteratively applies noise suppression filters, de-blocking filters, adaptive loop filters, and sample adaptive offset (SAO) filters to the blocks/frames.
  • SAO sample adaptive offset
  • bitstream includes the data discussed above as well as any signaling data (e.g., syntax) desired to support proper video signal reconstruction at the decoder.
  • data may include partition data, prediction data, residual blocks, and various flags providing coding instructions to the decoder.
  • the bitstream may be stored in memory for transmission toward a decoder upon request, for example as a track and/or track fragment in ISOBMFF.
  • the bitstream may also be broadcast and/or multicast toward a plurality of decoders.
  • the creation of the bitstream is an iterative process. Accordingly, steps 301, 303, 305, 307, and 309 may occur continuously and/or simultaneously over many frames and blocks. The order shown is presented for clarity and ease of discussion, and is not intended to limit the video coding process to a particular order.
  • the decoder receives the bitstream and begins the decoding process at step 311.
  • the decoder can employ an entropy decoding scheme to convert the bitstream into corresponding syntax and video data.
  • the decoder employs the syntax data from the bitstream to determine the partitions for the frames at step 311. The partitioning should match the results of block partitioning at step 303.
  • Entropy encoding/decoding which may be employed in step 311, is now described.
  • the encoder makes many choices during the compression process, such as selecting block partitioning schemes from several possible choices based on the spatial positioning of values in the input image(s). Signaling the exact choices may employ a large number of bins.
  • a bin is a binary value that is treated as a variable (e.g., a bit value that may vary depending on context).
  • Entropy coding allows the encoder to discard any options that are clearly not viable for a particular case, leaving a set of allowable options.
  • Each allowable option is then assigned a code word. The length of the code word is based on the number of allowable options (e.g., one bin for two options, two bins for three to four options, etc.).
  • the encoder then encodes the code word for the selected option. This scheme reduces the size of the code words as the code words are as big as desired to uniquely indicate a selection from a small sub-set of allowable options as opposed to uniquely indicating the selection from a potentially large set of all possible options.
  • the decoder then decodes the selection by determining the set of allowable options in a similar manner to the encoder. By determining the set of allowable options, the decoder can read the code word and determine the selection made by the encoder.
  • the decoder performs block decoding. Specifically, the decoder employs reverse transforms to generate residual blocks. Then the decoder employs the residual blocks and corresponding prediction blocks to reconstruct the image blocks according to the partitioning.
  • the prediction blocks may include both intra-prediction blocks and inter prediction blocks as generated at the encoder at step 305.
  • the reconstructed image blocks are then positioned into frames of a reconstructed video signal according to the partitioning data determined at step 311. Syntax for step 313 may also be signaled in the bitstream via entropy coding as discussed above.
  • step 315 filtering is performed on the frames of the reconstructed video signal in a manner similar to step 307 at the encoder. For example, noise suppression filters, de blocking filters, adaptive loop filters, and SAO filters may be applied to the frames to remove blocking artifacts.
  • the video signal can be forwarded for merging at step 317 and then output to a display, such as a F1MD, for viewing by an end user.
  • FIG. 4 is a schematic diagram of an example coding and decoding (codec) system 400 for video coding.
  • codec system 400 provides functionality to support encoding and decoding sub-picture video streams according to methods 200 and 300. Further, codec system 400 can be employed to implement an encoder 103 and/or a decoder 107 of system 100.
  • Codec system 400 is generalized to depict components employed in both an encoder and a decoder.
  • Codec system 400 receives and partitions frames from a video signal (e.g., including a sub-picture video stream) as discussed with respect to steps 301 and 303 in operating method 300, which results in a partitioned video signal 401.
  • Codec system 400 then compresses the partitioned video signal 401 into a coded bitstream when acting as an encoder as discussed with respect to steps 305, 307, and 309 in method 300.
  • codec system 400 When acting as a decoder, codec system 400 generates an output video signal from the bitstream as discussed with respect to steps 311, 313, 315, and 317 in operating method 300.
  • the codec system 400 includes a general coder control component 411, a transform scaling and quantization component 413, an intra-picture estimation component 415, an intra-picture prediction component 417, a motion compensation component 419, a motion estimation component 421, a scaling and inverse transform component 429, a filter control analysis component 427, an in-loop filters component 425, a decoded picture buffer component 423, and a header formatting and context adaptive binary arithmetic coding (CAB AC) component 431.
  • Such components are coupled as shown.
  • black lines indicate movement of data to be encoded/decoded while dashed lines indicate movement of control data that controls the operation of other components.
  • the components of codec system 400 may all be present in the encoder.
  • the decoder may include a subset of the components of codec system 400.
  • the decoder may include the intra picture prediction component 417, the motion compensation component 419, the scaling and inverse transform component 429, the in-loop filters component 425, and the decoded picture buffer component 423. These components are now described.
  • the partitioned video signal 401 is a captured video sequence that has been partitioned into blocks of pixels by a coding tree.
  • a coding tree employs various split modes to subdivide a block of pixels into smaller blocks of pixels. These blocks can then be further subdivided into smaller blocks.
  • the blocks may be referred to as nodes on the coding tree. Larger parent nodes are split into smaller child nodes. The number of times a node is subdivided is referred to as the depth of the node/coding tree.
  • the divided blocks can be included in coding units (CUs) in some cases.
  • a CU can be a sub-portion of a CTU that contains a luma block, red difference chroma (Cr) block(s), and a blue difference chroma (Cb) block(s) along with corresponding syntax instructions for the CU.
  • the split modes may include a binary tree (BT), triple tree (TT), and a quad tree (QT) employed to partition a node into two, three, or four child nodes, respectively, of varying shapes depending on the split modes employed.
  • the partitioned video signal 401 is forwarded to the general coder control component 411, the transform scaling and quantization component 413, the intra picture estimation component 415, the filter control analysis component 427, and the motion estimation component 421 for compression.
  • the general coder control component 411 is configured to make decisions related to coding of the images of the video sequence into the bitstream according to application constraints. For example, the general coder control component 411 manages optimization of bitrate/bitstream size versus reconstruction quality. Such decisions may be made based on storage space/bandwidth availability and image resolution requests. The general coder control component 411 also manages buffer utilization in light of transmission speed to mitigate buffer underrun and overrun issues. To manage these issues, the general coder control component 411 manages partitioning, prediction, and filtering by the other components.
  • the general coder control component 411 may dynamically increase compression complexity to increase resolution and increase bandwidth usage or decrease compression complexity to decrease resolution and bandwidth usage ffence, the general coder control component 411 controls the other components of codec system 400 to balance video signal reconstruction quality with bitrate concerns.
  • the general coder control component 411 creates control data, which controls the operation of the other components.
  • the control data is also forwarded to the header formatting and CABAC component 431 to be encoded in the bitstream to signal parameters for decoding at the decoder.
  • the partitioned video signal 401 is also sent to the motion estimation component 421 and the motion compensation component 419 for inter-prediction.
  • a frame or slice of the partitioned video signal 401 may be divided into multiple video blocks.
  • Motion estimation component 421 and the motion compensation component 419 perform inter-predictive coding of the received video block relative to one or more blocks in one or more reference frames to provide temporal prediction.
  • Codec system 400 may perform multiple coding passes, e.g., to select an appropriate coding mode for each block of video data.
  • Motion estimation component 421 and motion compensation component 419 may be highly integrated, but are illustrated separately for conceptual purposes.
  • Motion estimation performed by motion estimation component 421, is the process of generating motion vectors, which estimate motion for video blocks.
  • a motion vector for example, may indicate the displacement of a coded object relative to a predictive block.
  • a predictive block is a block that is found to closely match the block to be coded, in terms of pixel difference.
  • a predictive block may also be referred to as a reference block.
  • Such pixel difference may be determined by sum of absolute difference (SAD), sum of square difference (SSD), or other difference metrics.
  • HEVC employs several coded objects including a CTU, coding tree blocks (CTBs), and CUs.
  • a CTU can be divided into CTBs, which can then be divided into CBs for inclusion in CUs.
  • a CU can be encoded as a prediction unit (PU) containing prediction data and/or a transform unit (TU) containing transformed residual data for the CU.
  • the motion estimation component 421 generates motion vectors, PUs, and TUs by using a rate-distortion analysis as part of a rate distortion optimization process. For example, the motion estimation component 421 may determine multiple reference blocks, multiple motion vectors, etc. for a current block/frame, and may select the reference blocks, motion vectors, etc. having the best rate-distortion characteristics. The best rate-distortion characteristics balance both quality of video reconstruction (e.g., amount of data loss by compression) with coding efficiency (e.g., size of the final encoding).
  • codec system 400 may calculate values for sub-integer pixel positions of reference pictures stored in decoded picture buffer component 423. For example, video codec system 400 may interpolate values of one-quarter pixel positions, one-eighth pixel positions, or other fractional pixel positions of the reference picture. Therefore, motion estimation component 421 may perform a motion search relative to the full pixel positions and fractional pixel positions and output a motion vector with fractional pixel precision. The motion estimation component 421 calculates a motion vector for a PU of a video block in an inter-coded slice by comparing the position of the PU to the position of a predictive block of a reference picture. Motion estimation component 421 outputs the calculated motion vector as motion data to the header formatting and CAB AC component 431 for encoding and motion to the motion compensation component 419.
  • Motion compensation performed by motion compensation component 419, may involve fetching or generating the predictive block based on the motion vector determined by motion estimation component 421. Again, motion estimation component 421 and motion compensation component 419 may be functionally integrated, in some examples. Upon receiving the motion vector for the PU of the current video block, motion compensation component 419 may locate the predictive block to which the motion vector points. A residual video block is then formed by subtracting pixel values of the predictive block from the pixel values of the current video block being coded, forming pixel difference values. In general, motion estimation component 421 performs motion estimation relative to luma components, and motion compensation component 419 uses motion vectors calculated based on the luma components for both chroma components and luma components. The predictive block and residual block are forwarded to transform scaling and quantization component 413.
  • the partitioned video signal 401 is also sent to intra-picture estimation component 415 and intra-picture prediction component 417.
  • intra-picture estimation component 415 and intra picture prediction component 417 may be highly integrated, but are illustrated separately for conceptual purposes.
  • the intra-picture estimation component 415 and intra-picture prediction component 417 intra-predict a current block relative to blocks in a current frame, as an alternative to the inter-prediction performed by motion estimation component 421 and motion compensation component 419 between frames, as described above.
  • the intra picture estimation component 415 determines an intra-prediction mode to use to encode a current block.
  • intra-picture estimation component 415 selects an appropriate intra-prediction mode to encode a current block from multiple tested intra prediction modes. The selected intra-prediction modes are then forwarded to the header formatting and CAB AC component 431 for encoding. [0083] For example, the intra-picture estimation component 415 calculates rate-distortion values using a rate-distortion analysis for the various tested intra-prediction modes, and selects the intra-prediction mode having the best rate-distortion characteristics among the tested modes.
  • Rate-distortion analysis generally determines an amount of distortion (or error) between an encoded block and an original unencoded block that was encoded to produce the encoded block, as well as a bitrate (e.g., a number of bits) used to produce the encoded block.
  • the intra picture estimation component 415 calculates ratios from the distortions and rates for the various encoded blocks to determine which intra-prediction mode exhibits the best rate-distortion value for the block.
  • intra-picture estimation component 415 may be configured to code depth blocks of a depth map using a depth modeling mode (DMM) based on rate-distortion optimization (RDO).
  • DDM depth modeling mode
  • RDO rate-distortion optimization
  • the intra-picture prediction component 417 may generate a residual block from the predictive block based on the selected intra-prediction modes determined by intra-picture estimation component 415 when implemented on an encoder or read the residual block from the bitstream when implemented on a decoder.
  • the residual block includes the difference in values between the predictive block and the original block, represented as a matrix.
  • the residual block is then forwarded to the transform scaling and quantization component 413.
  • the intra-picture estimation component 415 and the intra-picture prediction component 417 may operate on both luma and chroma components.
  • the transform scaling and quantization component 413 is configured to further compress the residual block.
  • the transform scaling and quantization component 413 applies a transform, such as a discrete cosine transform (DCT), a discrete sine transform (DST), or a conceptually similar transform, to the residual block, producing a video block comprising residual transform coefficient values. Wavelet transforms, integer transforms, sub-band transforms, or other types of transforms could also be used.
  • the transform may convert the residual information from a pixel value domain to a transform domain, such as a frequency domain.
  • the transform scaling and quantization component 413 is also configured to scale the transformed residual information, for example based on frequency.
  • Such scaling involves applying a scale factor to the residual information so that different frequency information is quantized at different granularities, which may affect final visual quality of the reconstructed video.
  • the transform scaling and quantization component 413 is also configured to quantize the transform coefficients to further reduce bitrate.
  • the quantization process may reduce the bit depth associated with some or all of the coefficients.
  • the degree of quantization may be modified by adjusting a quantization parameter.
  • the transform scaling and quantization component 413 may then perform a scan of the matrix including the quantized transform coefficients.
  • the quantized transform coefficients are forwarded to the header formatting and CAB AC component 431 to be encoded in the bitstream.
  • the scaling and inverse transform component 429 applies a reverse operation of the transform scaling and quantization component 413 to support motion estimation.
  • the scaling and inverse transform component 429 applies inverse scaling, transformation, and/or quantization to reconstruct the residual block in the pixel domain, e.g., for later use as a reference block which may become a predictive block for another current block.
  • the motion estimation component 421 and/or motion compensation component 419 may calculate a reference block by adding the residual block back to a corresponding predictive block for use in motion estimation of a later block/frame. Filters are applied to the reconstructed reference blocks to mitigate artifacts created during scaling, quantization, and transform. Such artifacts could otherwise cause inaccurate prediction (and create additional artifacts) when subsequent blocks are predicted.
  • the filter control analysis component 427 and the in-loop filters component 425 apply the filters to the residual blocks and/or to reconstructed image blocks.
  • the transformed residual block from the scaling and inverse transform component 429 may be combined with a corresponding prediction block from intra-picture prediction component 417 and/or motion compensation component 419 to reconstruct the original image block.
  • the filters may then be applied to the reconstructed image block.
  • the filters may instead be applied to the residual blocks.
  • the filter control analysis component 427 and the in-loop filters component 425 are highly integrated and may be implemented together, but are depicted separately for conceptual purposes. Filters applied to the reconstructed reference blocks are applied to particular spatial regions and include multiple parameters to adjust how such filters are applied.
  • the filter control analysis component 427 analyzes the reconstructed reference blocks to determine where such filters should be applied and sets corresponding parameters. Such data is forwarded to the header formatting and CABAC component 431 as filter control data for encoding.
  • the in-loop filters component 425 applies such filters based on the filter control data.
  • the filters may include a deblocking filter, a noise suppression filter, a SAO filter, and an adaptive loop filter. Such filters may be applied in the spatial/pixel domain (e.g., on a reconstructed pixel block) or in the frequency domain, depending on the example.
  • the filtered reconstructed image block, residual block, and/or prediction block are stored in the decoded picture buffer component 423 for later use in motion estimation as discussed above.
  • the decoded picture buffer component 423 stores and forwards the reconstructed and filtered blocks toward a display as part of an output video signal.
  • the decoded picture buffer component 423 may be any memory device capable of storing prediction blocks, residual blocks, and/or reconstructed image blocks.
  • the header formatting and CABAC component 431 receives the data from the various components of codec system 400 and encodes such data into a coded bitstream for transmission toward a decoder. Specifically, the header formatting and CABAC component 431 generates various headers to encode control data, such as general control data and filter control data. Further, prediction data, including intra-prediction and motion data, as well as residual data in the form of quantized transform coefficient data are all encoded in the bitstream. The final bitstream includes all information desired by the decoder to reconstruct the original partitioned video signal 401.
  • Such information may also include intra-prediction mode index tables (also referred to as codeword mapping tables), definitions of encoding contexts for various blocks, indications of most probable intra-prediction modes, an indication of partition information, etc.
  • Such data may be encoded by employing entropy coding.
  • the information may be encoded by employing context adaptive variable length coding (CAVLC), CABAC, syntax-based context-adaptive binary arithmetic coding (SBAC), probability interval partitioning entropy (PIPE) coding, or another entropy coding technique.
  • CAVLC context adaptive variable length coding
  • CABAC syntax-based context-adaptive binary arithmetic coding
  • PIPE probability interval partitioning entropy
  • the coded bitstream may be transmitted to another device (e.g., a video decoder) or archived for later transmission or retrieval.
  • FIG. 5 is a block diagram illustrating an example video encoder 500.
  • Video encoder 500 may be employed to implement the encoding functions of codec system 400 and/or implement steps 301, 303, 305, 307, and/or 309 of method 300. Further, encoder 500 may be employed to implement steps 205-209 of method 200 as well as encoder 103.
  • Encoder 500 partitions an input video signal (e.g., a sub-picture video stream), resulting in a partitioned video signal 501, which is substantially similar to the partitioned video signal 401. The partitioned video signal 501 is then compressed and encoded into a bitstream by components of encoder 500.
  • an input video signal e.g., a sub-picture video stream
  • the partitioned video signal 501 is forwarded to an intra-picture prediction component 517 for intra-prediction.
  • the intra-picture prediction component 517 may be substantially similar to intra-picture estimation component 415 and intra-picture prediction component 417.
  • the partitioned video signal 501 is also forwarded to a motion compensation component 521 for inter-prediction based on reference blocks in a decoded picture buffer component 523.
  • the motion compensation component 521 may be substantially similar to motion estimation component 421 and motion compensation component 419.
  • the prediction blocks and residual blocks from the intra-picture prediction component 517 and the motion compensation component 521 are forwarded to a transform and quantization component 513 for transformation and quantization of the residual blocks.
  • the transform and quantization component 513 may be substantially similar to the transform scaling and quantization component 413.
  • the transformed and quantized residual blocks and the corresponding prediction blocks (along with associated control data) are forwarded to an entropy coding component 531 for coding into a bitstream.
  • the entropy coding component 531 may be substantially similar to the header formatting and CAB AC component 431.
  • the transformed and quantized residual blocks and/or the corresponding prediction blocks are also forwarded from the transform and quantization component 513 to an inverse transform and quantization component 529 for reconstruction into reference blocks for use by the motion compensation component 521.
  • the inverse transform and quantization component 529 may be substantially similar to the scaling and inverse transform component 429.
  • In-loop filters in an in-loop filters component 525 are also applied to the residual blocks and/or reconstructed reference blocks, depending on the example.
  • the in-loop filters component 525 may be substantially similar to the filter control analysis component 427 and the in-loop filters component 425.
  • the in-loop filters component 525 may include multiple filters as discussed with respect to in-loop filters component 425.
  • the filtered blocks are then stored in a decoded picture buffer component 523 for use as reference blocks by the motion compensation component 521.
  • the decoded picture buffer component 523 may be substantially similar to the decoded picture buffer component 423.
  • the encoder 500 may encode video into one or more tracks. As discussed in more detail below, VR video can be recorded from multiple viewpoints. Video from each viewpoint can then be encoded in a corresponding set of tracks. This allows the decoder to swap between tracks based on user input, which allows a user to swap between viewpoints as desired. A user may wish to continuously watch a particular object or location in the virtual environment when switching between viewpoints. In order to allow the user to maintain a consistent view, the encoder 500 can be configured to encode data indicating correspondences between spatial regions of viewpoints that are part of the same viewpoint group, and hence have the same viewpoint group ID.
  • the decoder can determine the correspondences and determine the FOV and/or viewport used by the user at a first viewpoint when a viewpoint switch is requested.
  • the decoder can then determine a FOV/viewport at a second viewpoint that corresponds to the FOV/viewport used at the first viewpoint based on the correspondences encoded by the encoder 500.
  • the decoder can display a FOV/viewport at the second viewpoint that points toward the same location previously viewed by the user at the first viewpoint.
  • correspondences can be encoded in a timed metadata track.
  • FIG. 6 is a block diagram illustrating an example video decoder 600.
  • Video decoder 600 may be employed to implement the decoding functions of codec system 400 and/or implement steps 311, 313, 315, and/or 317 of operating method 300. Further, decoder 600 may be employed to implement steps 211-213 of method 200 as well as decoder 107.
  • Decoder 600 receives a plurality of tracks containing picture bitstreams and/or sub-picture bitstreams, for example from an encoder 500, generates a reconstructed output video signal, for example by merging sub-picture video streams into a spherical video stream, and forwards the spherical video stream for display to a user via a rendering device.
  • the bitstreams are received by an entropy decoding component 633.
  • the entropy decoding component 633 is configured to implement an entropy decoding scheme, such as CAVLC, CABAC, SBAC, PIPE coding, or other entropy coding techniques.
  • the entropy decoding component 633 may employ header information to provide a context to interpret additional data encoded as codewords in the bitstreams.
  • the decoded information includes any desired information to decode the video signal, such as general control data, filter control data, partition information, motion data, prediction data, and quantized transform coefficients from residual blocks.
  • the quantized transform coefficients are forwarded to an inverse transform and quantization component 629 for reconstruction into residual blocks.
  • the inverse transform and quantization component 629 may be similar to inverse transform and quantization component 529.
  • the reconstructed residual blocks and/or prediction blocks are forwarded to intra picture prediction component 617 for reconstruction into image blocks based on intra prediction operations.
  • the intra-picture prediction component 617 may be similar to intra picture estimation component 415 and intra-picture prediction component 417. Specifically, the intra-picture prediction component 617 employs prediction modes to locate a reference block in the frame and applies a residual block to the result to reconstruct intra-predicted image blocks.
  • the reconstructed intra-predicted image blocks and/or the residual blocks and corresponding inter-prediction data are forwarded to a decoded picture buffer component 623 via an in-loop filters component 625, which may be substantially similar to decoded picture buffer component 423 and in-loop filters component 425, respectively.
  • the in-loop filters component 625 filters the reconstructed image blocks, residual blocks, and/or prediction blocks, and such information is stored in the decoded picture buffer component 623.
  • Reconstructed image blocks from decoded picture buffer component 623 are forwarded to a motion compensation component 621 for inter-prediction.
  • the motion compensation component 621 may be substantially similar to motion estimation component 421 and/or motion compensation component 419. Specifically, the motion compensation component 621 employs motion vectors from a reference block to generate a prediction block and applies a residual block to the result to reconstruct an image block.
  • the resulting reconstructed blocks may also be forwarded via the in-loop filters component 625 to the decoded picture buffer component 623.
  • the decoded picture buffer component 623 continues to store additional reconstructed image blocks, which can be reconstructed into frames via the partition information. Such frames may also be placed in a sequence. The sequence is output toward a display as a reconstructed output video signal.
  • the decoder 600 may receive a set of tracks containing VR video recorded from multiple viewpoints. This allows the decoder 600 to swap between tracks based on user input, which allows a user to swap between viewpoints as desired. A user may wish to continuously watch a particular object or location in the virtual environment when switching between viewpoints. In order to allow the user to maintain a consistent view, the tracks may contain data indicating correspondences between spatial regions of viewpoints in the same viewpoint group, and hence including the same viewpoint group ID. This allows the decoder 600 to determine the correspondences and determine the FOV and/or viewport used by the user at a first viewpoint when a viewpoint switch is requested.
  • the decoder 600 can then determine a FOV/viewport at a second viewpoint that corresponds to the FOV/viewport used at the first viewpoint based on the correspondences encoded by the encoder. Accordingly, when the user switches between viewpoints, the decoder 600 can display a FOV/viewport at the second viewpoint that points toward the same location previously viewed by the user at the first viewpoint. For example, such correspondences can be encoded in a timed metadata track. Such correspondences can also be encoded in the tracks including the video data for the relevant viewpoints.
  • FIG. 7 is a schematic diagram illustrating an example system 700 for capturing VR video from multiple viewpoints 702, 703, and 704. Multiple viewpoints 702, 703, and 704 are included as an example. In other examples, less or more viewpoints may be provided. As shown, the system 700 is implemented to capture activity at a particular scene 701 (e.g., a stadium) using a plurality of cameras positioned at corresponding viewpoints 702, 703, and 704. The cameras may be similar to the multi-directional cameras 101 described above in connection with FIG. 1. In an embodiment, the cameras may capture VR videos in fixed positions at viewpoint 702 and viewpoint 703.
  • a particular scene 701 e.g., a stadium
  • the cameras may be similar to the multi-directional cameras 101 described above in connection with FIG. 1.
  • the cameras may capture VR videos in fixed positions at viewpoint 702 and viewpoint 703.
  • viewpoints 702 and 703 are static viewpoints because the location of the cameras used at viewpoints 702 and 703 do not move over the course of the video. Further, a separate camera has the ability to continuously change positions along a rail 705 in order to capture VR videos from a variety of different positions denoted as viewpoint 704. By sliding along the rail 705, the camera is able to capture the VR video from different positions, and hence viewpoint 704 may change over time. In practical applications, the camera at viewpoint 704 may be mounted in other ways in order to be moveable in one or more directions. Regardless of the mechanisms employed during filming, viewpoint 704 is a dynamic viewpoint because the location of the corresponding camera changes over the course of the video.
  • the cameras may each record a sphere of video looking outward from the perspective of the corresponding viewpoint 702, 703, and 704.
  • a viewpoint 702, 703, and 704 is the center of a sphere of video data as recorded from a specified location.
  • video (and audio) can be recorded from viewpoints 702, 703, and 704.
  • the video for each viewpoint can then be stored in a set of corresponding tracks.
  • video from a viewpoint 702 can be downsampled and stored at various resolutions in tracks as part of an adaptation set for viewpoint 702.
  • Adaptation sets for viewpoints 703 and 704 can also be stored in corresponding tracks ffence, a decoder can receive user input and, based on the user input, select an adaptation set with corresponding tracks for display.
  • the result is the user can experience VR video from a first viewpoint (e.g., viewpoint 702) at a first time and then switch to experience VR video from a second viewpoint (e.g., viewpoint 703 or 704) at a second time.
  • a first viewpoint e.g., viewpoint 702
  • a second viewpoint e.g., viewpoint 703 or 704
  • One mechanism to enable such a viewpoint switch is to provide a default orientation for each viewpoint 702, 703, and 704.
  • An orientation is a direction of view pointing outward from the center of a corresponding viewpoint 702, 703, and/or 704.
  • An orientation may be described in terms of angle, coordinates, etc.
  • a specified orientation may result in a corresponding FOV and viewport for viewing video from the viewpoint 702, 703, and/or 704.
  • the default orientation approach causes a user to view a specified default FOV and viewport upon switching to a new viewpoint 702, 703, and/or 704. ffowever, this may result in a negative user experience in some cases.
  • a user may wish to continuously view an object in the scene 701, such as a basketball, a particular player, a goal, etc. Such a consistency may not be possible using default orientations.
  • a user watching the ball at viewpoint 702 may wish to switch to viewpoint 704 to get a closer look.
  • the default orientation at viewpoint 704 may be toward the goal. In such a case, the user loses the ball upon switching and is forced to find the ball again.
  • the encoder can store spatial region correspondences between viewpoints 702, 703, and/or 704 that are part of the same viewpoint group.
  • the decoder can determine the orientation viewed by the user at viewpoint 702 upon switching to viewpoint 704 (or viewpoint 703 in other examples).
  • the decoder can then use the spatial region correspondences between viewpoint 702 and viewpoint 704 to determine an orientation at viewpoint 704 that matches the orientation at viewpoint 702.
  • the decoder can then employ the determined orientation at viewpoint 704 after making the switch.
  • the user is automatically oriented to the same location in the scene 701 after the switch between viewpoints 702, 703, and/or 704 as was viewed before the switch. For example, if the user is watching the ball at viewpoint 702, the user is automatically oriented to view the ball from viewpoint 704 upon switching.
  • the spatial region correspondences are discussed in greater detail below.
  • FIG. 8 is a schematic diagram of example viewpoints 810 and 820 in a viewpoint group 800.
  • a correspondence as used herein, is an indication that two or more sphere regions 813 and 823 are spatially related such that viewing the sphere regions 813 and 823 from a related viewpoint 810 and 820, respectively, provides a view of the same object 830.
  • the correspondences support switching in the present disclosure.
  • the correspondences shown in schematic diagram can be used by an encoder 103, an encoder 500, a decoder 107, a decoder 600, and/or a codec system 400.
  • the correspondences shown in schematic diagram can describe relationships between viewpoints 702, 703, and/or 704.
  • the correspondences shown in schematic diagram can be encoded in a bitstream and used to support selection of tracks to decode and display, and hence can be used as part of methods 200 and 300.
  • Correspondences can be stored as viewpoint 810 and 820 pairs and/or sets.
  • Viewpoints 810 and 820 each include a sphere 812 and 822, respectively, of video content in associated tracks.
  • a user viewing video from a viewpoint 810 and 820 has access to a sphere 812 and 822, respectively, of video content.
  • the video content is depicted to the user by projecting a portion of the video content from the sphere 812 and 822, depending on the user’s viewpoint 810 and 820, onto a viewport based on the current orientation 811 and 821, respectively, of the user.
  • the spheres 812 and 822 are made up of a plurality of spatial regions, including a sphere region 813 and a corresponding sphere region 823.
  • sphere regions 813 and 823 are specified spatial portions of spheres 812 and 822 around viewpoints
  • a user may view content from a viewpoint 810 and 820 by employing an orientation 811 and 821, respectively.
  • an orientation 811 and 821 is a direction of view pointing outward from the center of a corresponding viewpoint 810 and 820, respectively.
  • a sphere region 813 and/or 823 may be associated with a particular orientation
  • an orientation 811 and 821 provides a viewing angle from the viewpoint 810 and 820 and defines a sphere region 813 and 823 based on an angle and a depth of field.
  • an orientation 811 and 821 may define a FOV from a viewpoint 810 and 820, respectively.
  • a sphere region 813 and 823 may define a viewport for a viewpoint 810 and 820, respectively.
  • an orientation 811 from a viewpoint 810 may be employed to allow a user to view an object 830 at a location 831.
  • an orientation 821 from a viewpoint 820 may be employed to allow a user to view the same object 830 at the location 831 from a different perspective.
  • the sphere regions 813 and 823 act as viewports onto the object 830.
  • a decoder can use a correspondence between sphere regions 813 and 823 to determine an orientation 811 at a viewpoint 810 from an orientation 821 at a viewpoint 820, and vice versa.
  • an encoder may encode the correspondences between sphere region 813 at viewpoint 810 and sphere region 823 at viewpoint 820 during VR video creation. A decoder can then use such information to maintain viewing consistency when switching between viewpoints 810 and 820.
  • a decoder can employ an orientation 811 from a viewpoint 810 to determine an orientation 821 at a viewpoint 820 as long as the viewpoints 810 and 820 are part of the same viewpoint group 800.
  • a viewpoint group 800 is a plurality of viewpoints 810 and 820 that share a correspondence between at least one sphere region 813 and/or 823.
  • a decoder can determine that viewpoints 810 and 820 are part of the same viewpoint group 800 when the viewpoints 810 and 820 share the same viewpoint group ID.
  • the viewpoint group ID can be stored by the viewpoints 810 and 820 as discussed below.
  • a set of tracks carrying media data of viewpoint 810 and/or 820 can be indicated by the track group of grouping type 'vipo'.
  • Tracks belonging to the same viewpoint 810 and/or 820 have the same value of track group ID, denoted as track group id, for a track group type, denoted as track group type, of 'vipo'.
  • the track group id of tracks from one viewpoint may differ from the track group id of tracks from any other viewpoint.
  • this track grouping is not indicated for any track in a file, the file is considered to contain content for one viewpoint only.
  • ViewpointGroupBox() which includes the viewpoint position structure, denoted as ViewpointPosStruct(), and other information.
  • the ViewpointPosStruct() can also be included in a viewpoint information structure, denoted as ViewpointInfoStruct(), which can be included in a dynamic viewpoint sample, denoted as DynamicViewpointSample().
  • DynamicViewpointSample() may be a sample syntax of a dynamic viewpoint timed metadata track that has a sample entry of type 'dyvp'.
  • a viewpoint group structure denoted as ViewpointGroupStruct()
  • the ViewpointGroupStruct() may include syntax elements for a viewpoint group ID and a viewpoint group description, denoted as vwpt group id and vwpt group description, respectively.
  • the vwpt group id indicates the identifier of a viewpoint group. All viewpoints in a viewpoint group share a common reference coordinate system.
  • the vwpt group description is a null-terminated eight bit Unicode transformation format (UTF-8) string which indicates the description of a viewpoint group.
  • a null string is allowed.
  • An OMAF player may be expected to start with the initial viewpoint timed metadata. Subsequently, if the user wishes to switch to a viewpoint group and the initial viewpoint information is not present, the OMAF player is expected to switch to the viewpoint with the least value of the viewpoint identifier in the viewpoint group.
  • a user may desire that the decoder determine the viewport (covering the same area or object) of the second viewpoint corresponding to the viewport of the first viewpoint presented to the user just before the switching occurs. This can be accomplished when two viewpoints belong to the same viewpoint group 800.
  • the decoder can then present the viewport of the second viewpoint to the user instead of following the initial viewpoint timed metadata for the second viewpoint (even when such timed metadata exists).
  • Tracks belonging to the same viewpoint have the same value of track group id for track group type 'vipo', and the track group id of tracks from one viewpoint differs from the track group id of tracks from any other viewpoint.
  • this track grouping is not indicated for any track in a file, the file is considered containing content for one viewpoint only.
  • aligned(8) class ViewpointGroupBox extends TrackGroupTypeBox('vipo') ⁇ ViewpomtPosStruct();
  • Tracks that have the same value of track group id within TrackGroupTypeBox with track group type equal to‘vipo’ belong to the same viewpoint.
  • the track group id within TrackGroupTypeBox with track group type equal to‘vipo’ is therefore used as the identifier of the viewpoint.
  • ViewpointPosStruct() is defined below viewpoint label is a null- terminated UTF-8 string that provides a human readable text label for the viewpoint.
  • the ViewpointInfoStruct() provides information of a viewpoint, including the position of the viewpoint and the yaw, pitch, and roll rotation angles of X, Y, and Z axes, respectively, of the global coordinate system of the viewpoint relative to the common reference coordinate system using syntax as described below:
  • bit(7) reserved 0;
  • bit(3 l) reserved 0;
  • the group alignment flag equal to 1 specifies that the viewpoint belongs to a separate coordinate system (with its own origin) for the alignment of viewpoint groups and the ViewpointGroupStruct is present.
  • the group alignment flag equal to 0 specifies that the viewpoint belongs to the common reference coorindate system.
  • viewpoint pos x, viewpoint pos y, and viewpoint_pos_z specify the position of the viewpoint (when the position of the viewpoint is static) or the initial position of the viewpoint (when the position of the viewpoint is dynamic), in units of 10-1 millimeters, in 3D space with (0, 0, 0) as the centre of the common reference coordinate system.
  • viewpoint_gpspos_present_flag 1 indicates that viewpoint gpspos longitude, viewpoint gpspos latitude, and viewpoint gpspos altitude are present.
  • viewpoint_gpspos_present_flag 0 indicates that viewpoint gpspos longitude, viewpoint gpspos latitude, and viewpoint gpspos altitude are not present viewpoint gpspos longitude indicates the longitude of the geolocation of the viewpoint in units of 2-23 degrees viewpoint gpspos longitude shall be in range of -180 * 223 to 180 * 223 - 1, inclusive. Positive values represent eastern longitude and negative values represent western longitude viewpoint gpspos latitude indicates the latitude of the geolocation of the viewpoint in units of 2-23 degrees viewpoint gpspos latitude shall be in range of -90 * 223 to 90 * 223 - 1, inclusive.
  • viewpoint_gcs_yaw, v i e w po i n t gc s p itch, and viewpoint gcs roll specify the yaw, pitch, and roll angles, respectively, of the rotation angles of X, Y, Z axes of the global coordinate system of the viewpoint relative to the common reference coordinate system, in units of 2-16 degrees
  • viewpoint gcs yaw shall be in the range of -180 * 216 to 180 *216 - 1, inclusive.
  • viewpoint_gcs_pitch shall be in the range of -90 * 216 to 90 * 216
  • inclusive viewpoint gcs roll shall be in the range of -180 * 216 to 180 * 216 - 1
  • inclusive vwpt group id indicates the identifier of a viewpoint group.
  • All viewpoints in a viewpoint group share a common reference coordinate system vwpt group description is a null-terminated UTF-8 string which indicates the description of a viewpoint group. A null string is allowed.
  • An OMAF player is expected to start with the initial viewpoint timed metadata as defined below. Subsequently, if the user wishes to switch to a viewpoint group and the initial viewpoint information is not present, the OMAF player is expected to switch to the viewpoint with the least value of the viewpoint identifier in the viewpoint group.
  • a timed metadata track having sample entry type 'rcvp' may contain zero or one SampleToGroupBox with grouping type equal to 'vwpt'.
  • This SampleToGroupBox represents the assignment of samples in this timed metadata (and consequently the corresponding samples in the media tracks) to viewpoints.
  • an accompanying SampleGroupDescriptionBox with the same grouping type shall be present, and contain the ID of the particular viewpoint this group of samples belong to.
  • the sample group entry of grouping type equal to 'vwpt', named ViewpointEntry is defined as follows:
  • viewpoint id indicates the viewpoint identifier of the viewpoint this group of samples belong to.
  • the dynamic viewpoint timed metadata track indicates the viewpoint parameters that are dynamically changing over time.
  • An OMAF player should use the signaled information as follows when starting playing back of one viewpoint after switching from another viewpoint. If there is a recommended viewing orientation explicitly signaled, the OMAF player is expected to parse this information and follow the recommended viewing orientation. Otherwise, the OMAF player is expected to keep the same viewing orientation as in the switching-from viewpoint just before the switching occurs.
  • the track sample entry type 'dyvp' shall be used.
  • the sample entry of this sample entry type is specified as follows:
  • DynamicViewpointSampleEntry extends MetaDataSampleEntry('dyvp') ⁇
  • bit(7) reserved 0;
  • ViewpointPosStruct() is defined above but indicates the initial viewpoint position dynamic gcs rotated flag equal to 0 specifies that the yaw, pitch, and roll rotation angles of X, Y, and Z axes, respectively, of the global coordinate system of the viewpoint relative to the common reference coordinate system remain unchanged in all samples referring to this sample entry dynamic gcs rotated flag equal to 1 specifies that the yaw, pitch, and roll rotation angles of X, Y, and Z axes, respectively, of the global coordinate system of the viewpoint relative to the common reference coordinate system are indicated in the samples.
  • ViewpointGlobalCoordinateSysRotationStruct() is defined above but indicates the yaw, pitch, and roll rotation angles ofX, Y, and Z axes, respectively, of the global coordinate system of the viewpoint relative to the common reference coordinate system for each sample referring to this sample entry.
  • the semantics of ViewpointInfoStruct() is specified above.
  • the first sample shall have group alignment flag equal to 1.
  • the ViewpointGroupStruct() can be absent.
  • the ViewpointGrouptStruct() is absent in a sample, it is inferred to be identical to the ViewpointGroupStruct() of the previous sample, in decoding order.
  • Metadata indicates the initial viewpoint that should be used. In the absence of this information, the initial viewpoint should be inferred to be the viewpoint that has the least value of viewpoint identifier among all viewpoints in the file.
  • the track sample entry type 'invp' shall be used.
  • the sample entry of this sample entry type is specified as follows:
  • initial viewpoint indicates the value of the viewpoint identifier of the initial viewpoint for the first sample to which this sample entry applies.
  • initial viewpoint indicates the value of the viewpoint identifier of the initial viewpoint for the sample.
  • the viewpoint group ID can be used to indicate a viewpoint group.
  • the viewpoint group ID is included in the viewpoint group structure, which is further included in the viewpoint information structure, which is further included in the dynamic viewpoint sample.
  • the dynamic viewpoint sample is included in the timed metadata track, which is available to dynamic viewpoints, such as viewpoint 704 in FIG. 7, but not to static viewpoints, such as viewpoints 702 and 703 in FIG. 7. Accordingly, the preceding implementation prevents the viewpoint group 800, as shown in FIG. 8, from being determined for static viewpoints, which prevents determining destination viewports based on source viewports.
  • the present disclosure solves this issue by placing the vwpt group id into the ViewpointGroupBox(), which is included in the tracks for the corresponding viewpoints. Examples of such implementations are discussed in greater detail below.
  • FIG. 9 is a schematic diagram of an example VR video file 900 for multiple viewpoints.
  • VR video file 900 may be employed to contain correspondences between spatial regions of viewpoints in a viewpoint group, such as viewpoint group 800.
  • the VR video file 900 can be encoded and/or decoded by an encoder 103, an encoder 500, a decoder 107, a decoder 600, and/or a codec system 400.
  • the VR video file 900 can describe VR video from multiple viewpoints, such as viewpoints 702, 703, and/or 704.
  • the VR video file 900 can contain encoded VR video, and hence can be generated by an encoder and read by a decoder to support video display as part of methods 200 and 300.
  • the VR video file 900 can contain sets of tracks for corresponding viewpoints.
  • the VR video file 900 can contain a set of viewpoint A tracks 910, a set of viewpoint B tracks 920, a set of viewpoint C tracks 940, and a set of viewpoint D tracks 950.
  • such tracks can contain video data as captured from corresponding viewpoints, such as viewpoints 702, 703, 704, 810, and/or 820.
  • VR video recorded at a viewpoint is stored in a corresponding adaptation set.
  • the adaptation set is downsampled to various lower resolutions. Then a track is generated for each resolution of the adaptation set.
  • the set of viewpoint A tracks 910, set of viewpoint B tracks 920, set of viewpoint C tracks 940, and set of viewpoint D tracks 950 contain the tracks associated with the adaptation set for the corresponding viewpoints.
  • the relevant tracks can then be forwarded to the decoder/client depending on the viewpoint selected by the user and the desired resolution based on the availability of network resources.
  • the VR video file 900 can also contain a timed metadata track 960.
  • the timed metadata track 960 contains metadata relevant to one or more of the viewpoints and hence potentially to all of the tracks 910, 920, 940, and 950.
  • spatial region correspondences between viewpoints can be stored in the timed metadata track 960.
  • correspondences between each of the viewpoints can be stored toward the beginning of the timed metadata track 960.
  • Such information may be global in nature and can be used for the entire VR video file 900.
  • the spatial region correspondences change, for example due to viewpoint motion, viewpoints turning on/off, etc., such changes can be coded into the timed metadata track 960 at the temporal location in the VR video file 900 where such changes occur.
  • the timed metadata track 960 can be employed to contain the spatial region correspondences between the viewpoints over the entire length of the VR video file 900.
  • the spatial region correspondences in the timed metadata track 960 can then be used by a decoder when displaying VR video as contained in tracks 910, 920, 940, and 950.
  • viewpoint group IDs 962 for the viewpoints are stored in the timed metadata track 960.
  • decoders may not generally check the timed metadata track 960 when switching to or from a static viewpoint. Accordingly, the decoder may be unaware of the viewpoint group IDs 962 in the timed metadata track 960.
  • the present disclosure includes a viewpoint group ID 912, 922, 942, and 952, which are the group IDs for the viewpoints in the corresponding viewpoint tracks 910, 920, 940, and 950, respectively. In this way, both static and dynamic viewpoints have access to the viewpoint group IDs 912, 922, 942, and 952.
  • a file conforming to the HEVC file format may comprise a series of objects, called boxes.
  • a box may be an object-oriented building block defined by a unique type identifier and length.
  • a box is the elementary syntax structure in ISOBMFF and may include a four-character coded box type, a byte count of the box, and a payload.
  • a box may be a syntax structure comprising a coded box type, a byte count of the box, and a payload.
  • all data in a file conforming to the F1EVC file format may be contained within boxes and there may be no data in the file that is not in a box.
  • an ISOBMFF file may include a sequence of boxes, and boxes may contain other boxes.
  • the payload of a box may include one or more additional boxes.
  • a file conforming to ISOBMFF may include various types of boxes.
  • a file conforming to ISOBMFF may include a file type box, a media data box, a movie box, a movie fragment box, and so on.
  • a file type box includes file type and compatibility information.
  • a media data box may contain samples (e.g., coded pictures).
  • a Movie box (moov) contains metadata for continuous media streams present in the file. Each of the continuous media streams may be represented in the file as a track, which may be included in tracks 910, 920, 940, 950, and/or 960.
  • a movie box may contain metadata regarding a movie (e.g., logical and timing relationships between samples, and also pointers to locations of samples).
  • Movie boxes may include several types of sub-boxes.
  • the sub-boxes in a movie box may include one or more track boxes.
  • a track box may include information about an individual track of a movie.
  • a track box may include a track header box that specifies overall information of a single track.
  • a track box may include a media box that contains a media information box.
  • the media information box may include a sample table box that contains data indexing media samples in the track. Information in the sample table box may be used to locate samples in time and, for each of the samples of the track, a type, size, container, and offset into that container of the sample.
  • the metadata for a track is enclosed in a Track box ("trak"), while the media content of a track is either enclosed in a Media Data box (“mdat”) or directly in a separate file.
  • the media content for tracks comprises or consists of a sequence of samples, such as audio or video access units.
  • ISOBMFF specifies the following types of tracks: a media track, which contains an elementary media stream, a hint track, which either includes media transmission instructions or represents a received packet stream, and a timed metadata track, which comprises time- synchronized metadata.
  • the metadata for each track includes a list of sample description entries, each providing the coding or encapsulation format used in the track and the initialization data used for processing that format. Each sample is associated with one of the sample description entries of the track.
  • ISOBMFF enables specifying sample-specific metadata with various mechanisms. Specific boxes within the Sample Table box (stbl) may be employed for various functions.
  • the stbl contains a sample table that contains the time and data indexing of the media samples in a track ffence, the stbl can be used to locate samples in time, determine their type (e.g., I-frame or not), and determine their size, container, and offset into that container.
  • a Movie Fragment box is a top-level box.
  • a Movie Fragment box may contain one or more track fragment (traf) boxes. Within the Movie Fragment there is a set of track fragments, zero or more per track.
  • the track fragments in turn contain zero or more track runs, each of which documents a contiguous run of samples for that track. For instance, each track run may contain samples of pictures that are contiguous in a certain order, such as decoding order.
  • a track fragment comprises metadata for one or more track fragments.
  • a track fragment box may include a track fragment header box indicating a track ID, a base data offset, a sample description index, a default sample duration, a default sample size, and default sample flags.
  • a track fragment box may include one or more track fragment run boxes, each documenting a contiguous set of samples for a track.
  • a track fragment box may include syntax elements indicating a sample count, a data offset, sample flags, a sample duration, a sample size, sample composition time offset, and so on. Within these structures, many fields are optional and can be defaulted.
  • a VR video file 900 can contain many boxes in tracks 910, 920, 940, 950, and/or 960.
  • a sample description box (stsd) 911, 921, 941, and 951 can be employed to contain the viewpoint group IDs 912, 922, 942, and 952, respectively, for the corresponding tracks 910, 920, 940, and 950, respectively.
  • the stsd 911, 921, 941, and 951 is a box that contains data that describes the samples contained in the corresponding tracks.
  • the viewpoint group ID 912, 922, 942, and 952 can be placed in the corresponding tracks 910, 920, 940, and 950, respectively, by making changes to the code described with respect to FIG. 8. Specifically, the code can be modified to ensure the viewpoint grouping information for all viewpoints is always signaled regardless of whether a viewpoint is at a static position (where a static position is not associated with a timed metadata track with sample entry type 'dyvp'). [00131] In a first embodiment, the viewpoint group ID 912, 922, 942, and 952 is always signaled in a ViewpointGroupBox, either directly or indirectly.
  • the ViewpointGroupBox is included in the stsd 911, 921, 941, and 951, which brings the viewpoint group ID 912, 922, 942, and 952 into the tracks 910, 920, 940, and 950.
  • the viewpoint group identifier is included in a viewpoint group structure, and the viewpoint group structure is included in the viewpoint group box.
  • class ViewpointGroupBox extends TrackGroupTypeBox('vipo') ⁇
  • group alignment flag 1 specifies that the viewpoint belongs to a separate coordinate system— (with— its— own— origin)— for— the— alignment— of— viewpoint— groups— and— the
  • ViewpointGroupStruct is present.
  • the group alignment flag equal to 0 specifies that the viewpoint belongs to the common reference coordinate system.— When two viewpoints have different values of vwpt group id, their position coordinates are not comparable, because the viewpoints belong to different coordinate systems.
  • vwpt group id indicates the identifier of a viewpoint group. All viewpoints in a viewpoint group share a common reference coordinate system. When two viewpoints have different values of vwpt group id, their position coordinates are not comparable, because the viewpoints belong to different coordinate systems.
  • ViewpointGroupStruct() is always signaled in the ViewpointPosStruct() but not in the ViewpointInfoStruct().
  • the viewpoint group identifier is included in a viewpoint group structure
  • the viewpoint group structure is further included in a viewpoint position structure
  • the viewpoint position structure is further included in the viewpoint group box.
  • the viewpoint group ID 912, 922, 942, and 952 is included in the ViewpointGroupBox, which is included in the stsd 911, 921, 941, and 951.
  • the following changes to the code described with respect to FIG. 8 can implement such an embodiment.
  • ViewpointInfoStruct() and ViewpointPosStruct() are changed as follows, where removed parts are shown in strikethrough, and additions are shown in underline. aligned(8) ViewpointInfoStruct(gcs_rotation_flag) ⁇
  • V ie wpointGroup StructQ V ie wpointGroup StructQ ;
  • bit(3 l) reserved 0;
  • group alignment flag 1 specifies that the viewpoint belongs to a separate coordinate system— (with— its— own— origin)— for— the— alignment— of— viewpoint— groups— and— the ViewpointGroupStruct is present.
  • the group alignment flag 0 specifies that the viewpoint belongs to the common reference coorindate system.— When two viewpoints have different values of vwpt group id, their position coordinates are not comparable, because the viewpoints belong to different coordinate systems. [00138]
  • the semantics of vwpt group id is changed as follows: vvvpt groLip id indicates the identifier of a viewpoint group. All viewpoints in a viewpoint group share a common reference coordinate system. When two viewpoints have different values of vwpt group id. their position coordinates are not comparable, because the viewpoints belong to different coordinate systems.
  • the first sample shall have group alignment flag equal to 1.
  • the ViewpointGroupStructQ can be absent.
  • the ViewpointGrouptStructQ is absent in a sample, it is inferred to be identical to the ViewpointGroupStructQ of the previous sample, in decoding order.
  • ViewpointGroupStructQ is always signaled in the ViewpointlnfoStructQ but not in ViewpointPosStructQ.
  • the viewpoint group identifier is included in a viewpoint group structure, which is further included in a viewpoint information structure, which is further included in the viewpoint group box.
  • the viewpoint group ID 912, 922, 942, and 952 is included in the ViewpointGroupBox, which is included in the stsd 911, 921, 941, and 951. This brings the viewpoint group ID 912, 922, 942, and 952 into the tracks 910, 920, 940, and 950.
  • the following changes to the code described with respect to FIG. 8 can implement such an embodiment.
  • bit(7) reserved 0;
  • V ie wpointGroup StructQ V ie wpointGroup StructQ ;
  • DynamicViewpointSampleEntry extends MetaDataSampleEntry('dyvp') ⁇
  • V ie wpointlnfo Struct Since a wpointlnfo Struct is a wpointlnfo Struct is a wpointPosStructf is a wpointPosStructf
  • group alignment flag 1 specifies that the viewpoint belongs to a separate coordinate system— (with— its— own— origin)— for— the— alignment— of— viewpoint— groups— and— the ViewpointGroupStruct is present.
  • the group alignment flag 0 specifies that the viewpoint belongs to the common reference coorindate system.— When two viewpoints have different values of vwpt group id, their position coordinates are not comparable, because the viewpoints belong to different coordinate systems.
  • vwpt group id indicates the identifier of a viewpoint group. All viewpoints in a viewpoint group share a common reference coordinate system. When two viewpoints have different values of vwpt group id, their position coordinates are not comparable, because the viewpoints belong to different coordinate systems.
  • the first sample shall have group alignment flag equal to 1.
  • the ViewpointGroupStructQ can be absent.
  • the ViewpointGrouptStructQ is absent in a sample, it is inferred to be identical to the ViewpointGroupStructQ of the previous sample, in decoding order.
  • FIG. 10 is an embodiment of a method 1000 of performing viewpoint switching based on viewpoint group identification, as discussed with respect to VR video file 900, between viewpoints in a viewpoint group, such as viewpoint group 800.
  • method 1000 may be employed by a decoder 107, a decoder 600, and/or a codec system 400.
  • Method 1000 can be employed to support viewpoint switching, for example between viewpoints 702, 703, and/or 704, when displaying a VR video file, such as VR video file 900, and hence can be employed to improve methods 200 and 300.
  • Method 1000 initiates when the decoder determines to switch from a source viewpoint to a destination viewpoint.
  • the video decoder processes a VR video stream.
  • the VR video stream comprises a plurality of viewpoints included in a viewpoint group. Further, each of the viewpoints in the viewpoint group correspond to one particular omnidirectional video camera for capturing an omnidirectional video at a particular location.
  • the VR video stream contains information indicative of the viewpoint group.
  • the information indicative of the viewpoint group may comprise a viewpoint group identifier.
  • the decoder presents a first viewport of a first viewpoint in the viewpoint group to a user.
  • the decoder determines to switch from the first viewpoint to a second viewport in the viewpoint group at step 1005. This may occur in response to receiving user input, for example.
  • the decoder determines a second viewport of the second viewpoint based on the information indicative of the viewpoint group. For example, the decoder can obtain a viewpoint group identifier that is stored in a track containing the first viewpoint, the second viewpoint, or combinations thereof. As such, the decoder can use the viewpoint group identifier to determine whether the first viewpoint and the second viewpoint are part of the same viewpoint group, and hence whether the second viewport can validly be determined based on the first viewport. As the viewpoint group identifier can be stored in any track, the viewpoint group identifier can be contained in a track configured for a static viewpoint. Accordingly, the switching behavior of method 1000 can be employed when switching to and/or from static viewpoints, and not solely for dynamic viewpoints.
  • the viewpoint group identifier can be included in a viewpoint group box.
  • the viewpoint group identifier is included in a viewpoint group structure, which is included in the viewpoint group box.
  • the viewpoint group identifier is included in a viewpoint group structure, which is included in a viewpoint position structure, which is further included in the viewpoint group box.
  • the viewpoint group identifier is included in a viewpoint group structure, which is included in a viewpoint information structure, which is further included in the viewpoint group box.
  • FIG. 11 is an embodiment of another method 1100 of performing viewpoint switching based on viewpoint group identification, as discussed with respect to VR video file 900, between viewpoints in a viewpoint group, such as viewpoint group 800.
  • method 1100 may be employed by a decoder 107, a decoder 600, and/or a codec system 400.
  • Method 1100 can be employed to support viewpoint switching, for example between viewpoints 702, 703, and/or 704, when displaying a VR video file, such as VR video file 900, and hence can be employed to improve methods 200 and 300.
  • Method 1100 initiates at step 1101 upon receiving a VR video stream at a decoder.
  • the VR video stream comprises a viewpoint group designated by a viewpoint identifier.
  • the viewpoint group and hence also the VR video stream, includes a plurality of viewpoints.
  • the viewpoint identifier is included in one or more tracks including one or more of the plurality of viewpoints.
  • the viewpoint identifier is included in tracks including static viewpoints and may also be included in a metatadata track associated with dynamic viewpoints.
  • the decoder decodes a first viewport of a first viewpoint of the viewpoint group for display.
  • the decoder determines to switch to a second viewpoint at step 1105, for example based on user input.
  • the decoder determines that the second viewpoint is included in the viewpoint group with the first viewpoint based on the viewpoint identifier. Based on the determination that the second viewpoint is included in the viewpoint group with the first viewpoint, the decoder can determine a second viewport of the second viewpoint based on the first viewport and the viewpoint identifier at step 1109. The decoder then decodes the second viewport of the second viewpoint for display at step 1111.
  • the decoder can obtain a viewpoint group identifier that is stored in a track containing the first viewpoint, the second viewpoint, or combinations thereof.
  • the decoder can use the viewpoint group identifier to determine whether the first viewpoint and the second viewpoint are part of the same viewpoint group, and hence whether the second viewport can validly be determined based on the first viewport.
  • the viewpoint group identifier can be stored in any track, the viewpoint group identifier can be contained in a track configured for a static viewpoint. Accordingly, the switching behavior of method 1100 can be employed when switching to and/or from static viewpoints, and not solely for dynamic viewpoints.
  • the viewpoint group identifier can be included in a viewpoint group box.
  • the viewpoint group identifier is included in a viewpoint group structure, which is included in the viewpoint group box. In another example, the viewpoint group identifier is included in a viewpoint group structure, which is included in a viewpoint position structure, which is further included in the viewpoint group box. In yet another example, the viewpoint group identifier is included in a viewpoint group structure, which is included in a viewpoint information structure, which is further included in the viewpoint group box.
  • FIG. 12 is an embodiment of a method 1200 of encoding viewpoint group identification at an encoder to support viewpoint switching at a decoder, as discussed with respect to VR video file 900, between viewpoints in a viewpoint group, such as viewpoint group 800.
  • method 1200 may be employed by an encoder 103, an encoder 500, and/or a codec system 400.
  • Method 1200 can be employed to support viewpoint switching, for example between viewpoints 702, 703, and/or 704, when displaying a VR video file, such as VR video file 900, and hence can be employed to improve methods 200 and 300.
  • Method 1200 can be initiated when encoding a VR video. Specifically, at step 1201, the encoder receives video from a plurality of viewpoints associated with a VR video stream. The plurality of viewpoints is part of a viewpoint group.
  • the encoder encodes a viewpoint group identifier of the viewpoint group in one or more tracks including one or more of the plurality of viewpoints.
  • the viewpoint group identifier can be encoded before, after, and/or contemporaneously with encoding the video from the viewpoints.
  • the presence of the viewpoint group identifier indicates viewpoints in the viewpoint group include viewports that display a common location in virtual space. Accordingly, a second viewport of a second viewpoint in the viewpoint group is determinable based on a first viewport of a first viewpoint in the viewpoint group during a viewpoint switch at a decoder.
  • the viewpoint identifier is included in tracks including static viewpoints and may also be included in a metatadata track associated with dynamic viewpoints.
  • the video file containing the viewpoint group identifier in the one or more tracks can then be forwarded in a bitstream toward a decoder for display at step 1205.
  • the viewpoint group identifier in the tracks allows the decoder to use the viewpoint group identifier to determine whether the first viewpoint and the second viewpoint are part of the same viewpoint group, and hence whether the second viewport can validly be determined based on the first viewport.
  • the viewpoint group identifier can be stored in any track, the viewpoint group identifier can be contained in a track configured for a static viewpoint. Accordingly, method 1200 provides a mechanism to support the switching behavior of method 1100 when a decoder is switching to and/or from static viewpoints, and not solely for dynamic viewpoints.
  • the viewpoint group identifier can be included in a viewpoint group box.
  • the viewpoint group identifier is included in a viewpoint group structure, which is included in the viewpoint group box. In another example, the viewpoint group identifier is included in a viewpoint group structure, which is included in a viewpoint position structure, which is further included in the viewpoint group box. In yet another example, the viewpoint group identifier is included in a viewpoint group structure, which is included in a viewpoint information structure, which is further included in the viewpoint group box.
  • FIG. 13 is a schematic diagram of an example video coding device 1300 according to an embodiment of the disclosure.
  • the coding device 1300 is suitable for implementing the methods and processes disclosed herein.
  • the coding device 1300 comprises downstream ports 1310 and transceiver units (Tx/Rx) 1320 for transmitting and receiving data to and from a downstream direction; a processor, logic unit, or processor 1330 to process the data; upstream ports 1350 coupled to Tx/Rx 1320 for transmitting and receiving the data to and from an upstream direction; and a memory 1360 for storing the data.
  • Tx/Rx transceiver units
  • the coding device 1300 may also comprise optical-to-electrical (OE) components and electrical-to-optical (EO) components coupled to the downstream ports 1310, the Tx/Rx units 1320, and the upstream ports 1350 for egress or ingress of optical or electrical signals.
  • OE optical-to-electrical
  • EO electrical-to-optical
  • the processor 1330 is implemented by hardware and software.
  • the processor 1330 may be implemented as one or more CPU chips, cores (e.g., as a multi-core processor), field- programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), and digital signal processors (DSPs).
  • the processor 1330 is in communication with the downstream ports 1310, transceiver units 1320, upstream ports 1350, and memory 1360.
  • the processor 1330 comprises a coding module 1370.
  • the coding module 1370 implements the disclosed embodiments described above.
  • the coding module 1370 may implement an encoder 103, an encoder 500, a decoder 107, a decoder 600, and/or a codec system 400, depending on the example.
  • the coding module 1370 may implement method 200, method 300, method 1000, 1100, and/or method 1200, depending on the example.
  • coding module 1370 may generate or decode a VR video file 900.
  • the coding module 1370 can encode or decode VR video based on a track that contains a viewpoint group identifier denoting that viewpoints, such as viewpoints 702, 703, 704, 810, and/or 820, are part of the same viewpoint group to support determining a destination viewport during viewpoint switching. Accordingly, the inclusion of the coding module 1370 therefore provides a substantial improvement to the functionality of the coding device 1300 and effects a transformation of the coding device 1300 to a different state.
  • the coding module 1370 is implemented as instructions stored in the memory 1360 and executed by the processor 1330.
  • the video coding device 1300 may also include input and/or output (I/O) devices 1380 for communicating data to and from a user.
  • the I/O devices 1380 may include output devices such as a display for displaying video data, speakers for outputting audio data, etc.
  • the I/O devices 1380 may also include input devices, such as a keyboard, mouse, trackball, etc., and/or corresponding interfaces for interacting with such output devices.
  • the memory 1360 comprises one or more disks, tape drives, and solid-state drives and may be used as an over-flow data storage device, to store programs when such programs are selected for execution, and to store instructions and data that are read during program execution.
  • the memory 1360 may be volatile and non-volatile and may be read-only memory (ROM), random-access memory (RAM), ternary content-addressable memory (TCAM), and static random-access memory (SRAM).
  • FIG. 14 is a schematic diagram of an embodiment of a system 1400 for managing viewpoint switching based on viewpoint group identification.
  • the system 1400 is suitable for implementing the methods and processes disclosed herein, for example, may implement method 200, method 300, method 1000, method 1100, and/or method 1200, depending on the example.
  • the system 1400 includes a video encoder 1402.
  • the encoder 1402 comprises a receiving module 1401 for receiving a plurality of viewpoints associated with a VR video stream.
  • the plurality of viewpoints are part of a viewpoint group.
  • the encoder 1402 also comprises an encoding module 1403 for encoding a viewpoint group identifier of the viewpoint group in one or more tracks including one or more of the plurality viewpoints.
  • the viewpoint group identifier indicates a second viewport of a second viewpoint in the viewpoint group is determinable based on a first viewport of a first viewpoint in the viewpoint group during a viewpoint switch.
  • the encoder 1402 also comprises a forwarding module 1405 for forwarding the viewpoint group identifier in the one or more tracks in a bitstream toward a decoder for display.
  • the encoder 1402 is further configured to perform other encoding related mechanisms as discussed herein.
  • the system 1400 also includes a video decoder 1410.
  • the decoder 1410 comprises a receiving module 1411 for receiving a VR video stream comprising a viewpoint group designated by a viewpoint identifier and including a plurality of viewpoints.
  • the viewpoint identifier is included in one or more tracks including one or more of the plurality of viewpoints.
  • the decoder 1410 also comprises a decoding module 1415 for decoding a first viewport of a first viewpoint of the viewpoint group for display, and decoding a second viewport of a second viewpoint for display.
  • the decoder 1410 also comprises a determining module 1413 for determining to switch to the second viewpoint, determining that the second viewpoint is included in the viewpoint group with the first viewpoint based on the viewpoint identifier, and determining the second viewport of the second viewpoint based on the first viewport and the viewpoint identifier.
  • the decoder 1410 is further configured to perform other decoding, display, and/or viewpoint switching related mechanisms as discussed herein.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Databases & Information Systems (AREA)
  • Human Computer Interaction (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

A mechanism for processing a virtual reality (VR) video stream is disclosed. The VR video stream comprises plurality of viewpoints included in a viewpoint group. Each of the viewpoints in the viewpoint group corresponds to one particular omnidirectional video camera for capturing an omnidirectional video at a particular location. The VR video stream contains information indicative of the viewpoint group. A first viewport of a first viewpoint in the viewpoint group is presented to a user. The mechanism switches from the first viewpoint to a second viewport in the viewpoint group. The mechanism also determines a second viewport of the second viewpoint based on the information indicative of the viewpoint group..

Description

Virtual Reality (VR) Viewpoint Grouping
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] This patent application claims the benefit of U.S. Provisional Patent Application No. 62/736,966, filed September 26, 2018 by Ye-Kui Wang, and titled“Virtual Reality (VR) Viewpoint Grouping,” which is hereby incorporated by reference.
TECHNICAL FIELD
[0002] The present disclosure is generally related to virtual reality (VR), also referred to as omnidirectional media, immersive media, and 360 degree video, and is specifically related to mechanisms for signaling group relationships between multiple VR video viewpoints.
BACKGROUND
[0003] Virtual reality (VR) is the ability to be virtually present in a non-physical world created by the rendering of natural and/or synthetic images and sounds correlated by the movements of the immersed user allowing the user to interact with that world. With the recent progress made in rendering devices, such as head mounted displays (HMD) and VR video (often also referred to as 360 degree video or omnidirectional video) creation, a significant quality of experience can be offered. VR applications include gaming, training, education, sports video, online shopping, adult entrainment, and so on.
SUMMARY
[0004] In an embodiment, the disclosure includes a method comprising: processing a VR video stream, wherein the VR video stream comprises plurality of viewpoints included in a viewpoint group, wherein each of the viewpoints in the viewpoint group corresponds to one particular omnidirectional video camera for capturing an omnidirectional video at a particular location, and wherein the VR video stream contains information indicative of the viewpoint group; presenting a first viewport of a first viewpoint in the viewpoint group to a user; switching from the first viewpoint to a second viewport in the viewpoint group; and presenting a second viewport of the second viewpoint in the viewpoint group to a user based on the information indicative of the viewpoint group. For example, VR video can include multiple viewpoints. A user may desire to move between viewpoints. In some systems, default viewports are employed at each viewpoint. However, a user may view an object at a first viewport of a first viewpoint. Upon switching to a second viewpoint, the user may wish to continue viewing the object. In such a case, switching to a default viewport at the second viewpoint may be disorienting and/or provide a poor user experience. The present disclosure employs a viewpoint group identifier (ID) which can be used by a decoder to determine a viewport at a second viewpoint based on a viewport used at a first viewpoint. Such a viewpoint group ID may be included in data structures stored in a metadata track in some systems. However, the metadata track may only be used for dynamic viewpoints (e.g., cameras that move). Accordingly, a viewpoint group ID stored in a metadata track may not be available for use with static viewpoints. As such, the present disclosure may include the viewpoint group ID in the track for static viewpoints, which extends the switching functionality with dynamic viewpoint selection to the static viewpoints.
[0005] Optionally, in any of the preceding aspects, another implementation of the aspect provides, wherein the information indicative of the viewpoint group comprises a viewpoint group identifier.
[0006] Optionally, in any of the preceding aspects, another implementation of the aspect provides, wherein the viewpoint group identifier is stored in a track containing the first viewpoint, the second viewpoint, or combinations thereof.
[0007] Optionally, in any of the preceding aspects, another implementation of the aspect provides, wherein the track is configured for a static viewpoint.
[0008] Optionally, in any of the preceding aspects, another implementation of the aspect provides, wherein the viewpoint group identifier is included in a viewpoint group box. For example, the switching functionality for dynamic viewpoints can be extended to the static tracks by including the viewpoint group structure object in the viewpoint group box. This pulls the viewpoint group ID into an object that is used by both static and dynamic viewpoints.
[0009] Optionally, in any of the preceding aspects, another implementation of the aspect provides, wherein the viewpoint group identifier is included in a viewpoint group structure, and wherein the viewpoint group structure is included in the viewpoint group box.
[0010] Optionally, in any of the preceding aspects, another implementation of the aspect provides, wherein the viewpoint group identifier is included in a viewpoint group structure, wherein the viewpoint group structure is included in a viewpoint position structure, and wherein the viewpoint position structure is included in the viewpoint group box.
[0011] Optionally, in any of the preceding aspects, another implementation of the aspect provides, wherein the viewpoint group identifier is included in a viewpoint group structure, wherein the viewpoint group structure is included in a viewpoint information structure, and wherein the viewpoint information structure is included in the viewpoint group box.
[0012] In an embodiment, the disclosure includes a method implemented in a decoder, the method comprising: receiving, by a receiver of the decoder, a VR video stream comprising a viewpoint group designated by a viewpoint identifier and including a plurality of viewpoints, wherein the viewpoint identifier is included in one or more tracks including one or more of the plurality of viewpoints; decoding, by a processor of the decoder, a first viewport of a first viewpoint of the viewpoint group for display; determining, by the processor, to switch to a second viewpoint; determining, by the processor, that the second viewpoint is included in the viewpoint group with the first viewpoint based on the viewpoint identifier; determining, by the processor, a second viewport of the second viewpoint based on the first viewport and the viewpoint identifier; and decoding, by the processor, the second viewport of the second viewpoint for display.
[0013] In an embodiment, the disclosure includes a method implemented in an encoder, the method comprising: receiving, by a processor of the encoder, a plurality of viewpoints associated with a VR video stream, wherein the plurality of viewpoints are part of a viewpoint group; encoding, by the processor, a viewpoint group identifier of the viewpoint group in one or more tracks including one or more of the plurality of viewpoints, wherein the viewpoint group identifier indicates a second viewport of a second viewpoint in the viewpoint group is determinable based on a first viewport of a first viewpoint in the viewpoint group during a viewpoint switch; and forwarding, by the processor, the viewpoint group identifier in the one or more tracks in a bitstream toward a decoder for display.
[0014] Optionally, in any of the preceding aspects, another implementation of the aspect provides, wherein the viewpoint identifier is included in tracks including static viewpoints and a metatadata track associated with dynamic viewpoints.
[0015] Optionally, in any of the preceding aspects, another implementation of the aspect provides, wherein the viewpoint identifier is included in a viewpoint group box.
[0016] Optionally, in any of the preceding aspects, another implementation of the aspect provides, wherein the viewpoint group identifier is included in a viewpoint group structure, and wherein the viewpoint group structure is included in the viewpoint group box.
[0017] Optionally, in any of the preceding aspects, another implementation of the aspect provides, wherein the viewpoint group identifier is included in a viewpoint group structure, wherein the viewpoint group structure is included in a viewpoint position structure, and wherein the viewpoint position structure is included in the viewpoint group box. [0018] Optionally, in any of the preceding aspects, another implementation of the aspect provides, wherein the viewpoint group identifier is included in a viewpoint group structure, wherein the viewpoint group structure is included in a viewpoint information structure, and wherein the viewpoint information structure is included in the viewpoint group box.
[0019] In an embodiment, the disclosure includes a video coding device comprising: a processor, a receiver coupled to the processor, a transmitter coupled to the processor, and a memory coupled to the processor, the processor, transmitter, receiver, and memory configured to perform the method of any of the preceding aspects.
[0020] In an embodiment, the disclosure includes a non-transitory computer readable medium comprising a computer program product for use by a video coding device, the computer program product comprising computer executable instructions stored on the non- transitory computer readable medium such that when executed by a processor cause the video coding device to perform the method of any of the preceding aspects.
[0021] In an embodiment, the disclosure includes a decoder comprising: a receiving means for receiving a virtual reality (VR) video stream comprising a viewpoint group designated by a viewpoint identifier and including a plurality of viewpoints, wherein the viewpoint identifier is included in one or more tracks including one or more of the plurality of viewpoints; a decoding means for: decoding a first viewport of a first viewpoint of the viewpoint group for display, and decoding a second viewport of a second viewpoint for display; and a determination means for: determining to switch to the second viewpoint, determining that the second viewpoint is included in the viewpoint group with the first viewpoint based on the viewpoint identifier, and determining the second viewport of the second viewpoint based on the first viewport and the viewpoint identifier.
[0022] Optionally, in any of the preceding aspects, another implementation of the aspect provides, wherein the decoder is further configured to perform the method of any of the preceding aspects.
[0023] In an embodiment, the disclosure includes an encoder comprising: a receiving means for receiving a plurality of viewpoints associated with a VR video stream, wherein the plurality of viewpoints are part of a viewpoint group; an encoding means for encoding a viewpoint group identifier of the viewpoint group in one or more tracks including one or more of the plurality of viewpoints, wherein the viewpoint group identifier indicates a second viewport of a second viewpoint in the viewpoint group is determinable based on a first viewport of a first viewpoint in the viewpoint group during a viewpoint switch; and a forwarding means for forwarding the viewpoint group identifier in the one or more tracks in a bitstream toward a decoder for display.
[0024] Optionally, in any of the preceding aspects, another implementation of the aspect provides, wherein the encoder is further configured to perform the method of any of the preceding aspects.
[0025] For the purpose of clarity, any one of the foregoing embodiments may be combined with any one or more of the other foregoing embodiments to create a new embodiment within the scope of the present disclosure.
[0026] These and other features will be more clearly understood from the following detailed description taken in conjunction with the accompanying drawings and claims.
BRIEF DESCRIPTION OF THE DRAWINGS
[0027] For a more complete understanding of this disclosure, reference is now made to the following brief description, taken in connection with the accompanying drawings and detailed description, wherein like reference numerals represent like parts.
[0028] FIG. 1 is a schematic diagram of an example system for VR based video coding.
[0029] FIG. 2 is a flowchart of an example method of coding a VR picture bitstream.
[0030] FIG. 3 is a flowchart of an example method of coding a video signal.
[0031] FIG. 4 is a schematic diagram of an example coding and decoding (codec) system for video coding.
[0032] FIG. 5 is a schematic diagram illustrating an example video encoder.
[0033] FIG. 6 is a schematic diagram illustrating an example video decoder.
[0034] FIG. 7 is a schematic diagram illustrating an example system for capturing VR video from multiple viewpoints.
[0035] FIG. 8 is a schematic diagram of example viewpoints in a viewpoint group.
[0036] FIG. 9 is a schematic diagram of an example VR video file for multiple viewpoints.
[0037] FIG. 10 is an embodiment of a method of performing viewpoint switching based on viewpoint group identification.
[0038] FIG. 11 is an embodiment of another method of performing viewpoint switching based on viewpoint group identification.
[0039] FIG. 12 is an embodiment of a method of encoding viewpoint group identification at an encoder to support viewpoint switching at a decoder.
[0040] FIG. 13 is a schematic diagram of an example video coding device. [0041] FIG. 14 is a schematic diagram of an embodiment of a system for managing viewpoint switching based on viewpoint group identification.
DETAILED DESCRIPTION
[0042] It should be understood at the outset that although an illustrative implementation of one or more embodiments are provided below, the disclosed systems and/or methods may be implemented using any number of techniques, whether currently known or in existence. The disclosure should in no way be limited to the illustrative implementations, drawings, and techniques illustrated below, including the exemplary designs and implementations illustrated and described herein, but may be modified within the scope of the appended claims along with their full scope of equivalents.
[0043] Video coding standards include International Telecommunication Union Telecommunication Standardization Sector (ITU-T) document H.261, International Organization for Standardization/ International Electrotechnical Commission (ISO/IEC) Motion Picture Experts Group (MPEG)-l Part 2, ITU-T EL262 or ISO/IEC MPEG-2 Part 2, ITU-T EL263, ISO/IEC MPEG-4 Part 2, Advanced Video Coding (AVC), also known as ITU-T EL264 or ISO/IEC MPEG-4 Part 10, and High Efficiency Video Coding (E1EVC), also known as ITU-T E1.265 or MPEG-E1 Part 2. AVC includes extensions such as Scalable Video Coding (SVC), Multiview Video Coding (MVC) and Multiview Video Coding plus Depth (MVC+D), and three dimensional (3D) AVC (3D-AVC). E1EVC includes extensions such as Scalable HEVC (SHVC), Multiview HEVC (MV-HEVC), and 3D HEVC (3D-HEVC).
[0044] File format standards include the ISO base media file format (ISOBMFF) (ISO/IEC 14496-12, hereinafter“ISO/IEC 14996-12”) and other file format standards derived from ISOBMFF, including MPEG-4 file format (ISO/IEC 14496-14), 3rd Generation Partnership Project (3GPP) file format (3GPP TS 26.244), and AVC file format (ISO/IEC 14496-15, hereinafter“ISO/IEC 14996-15”). Thus, ISO/IEC 14496-12 specifies the ISO base media file format. Other documents extend the ISO base media file format for specific applications. For instance, ISO/IEC 14496-15 describes the carriage of Network Abstraction Layer (NAL) unit structured video in the ISO base media file format. FL264/AVC and F1EVC, as well as their extensions, are examples of NAL unit structured video. ISO/IEC 14496-15 includes sections describing the carriage of F1.264/AVC NAL units. Additionally, section 8 of ISO/IEC 14496- 15 describes the carriage of F1EVC NAL units. Thus, section 8 of ISO/IEC 14496-15 is said to describe the F1EVC file format. [0045] ISOBMFF is used as the basis for many codec encapsulation formats, such as the AVC File Format, as well as for many multimedia container formats, such as the MPEG-4 File Format, the 3GPP File Format, and the digital video broadcasting (DVB) File Format. In addition to continuous media, such as audio and video, static media, such as images, as well as metadata, can be stored in a file conforming to ISOBMFF. Files structured according to ISOBMFF may be used for many purposes, including local media file playback, progressive downloading of a remote file, segments for Dynamic Adaptive Streaming over ffyper Text Transfer Protocol (F1TTP) (DASff), containers for content to be streamed and corresponding packetization instructions, and recording of received real-time media streams. Thus, although designed for storage, ISOBMFF can be employed for streaming, e.g., for progressive download or DASff. For streaming purposes, movie fragments defined in fSOBMFF can be used fn addition to continuous media, such as audio and video, static media, such as images, as well as metadata can be stored in a file conforming to ISOBMFF.
[0046] Such file formats and streaming mechanisms can be employed to encode, signal, decode, and display a VR video. In some cases, a VR video can be recorded from multiple viewpoints. As used herein, a viewpoint is the position of a camera used to capture video. For example, multiple cameras can be positioned at multiple locations to record a scene, an event, etc. In a VR context, such cameras may include a camera array and/or fisheye camera(s) capable of capturing wide angle video. For example, a VR camera mechanism can capture a sphere of video, or sub-portions thereof. Only a portion of the sphere may be displayed to a user. Such portion is displayed to the user on a viewport. Upon viewing, a user can control a viewing orientation from the viewpoint. This allows the user to react to the filmed environment as if the user were present at the viewpoint at the time of filming. When multiple viewpoints are employed, the user may be allowed to switch between the viewpoints. This allows the user to virtually move around the scene. As an example, a VR video can be taken of a basketball game from multiple viewpoints on, around, and/or above the court. In this case, a user may be allowed to view the game from a viewpoint of choice and at an orientation/angle of choice from the selected viewpoint.
[0047] A default viewing orientation/angle can be employed for each viewpoint. Accordingly, when a user switches to a viewpoint, the decoder can employ the default angle to orient the user until the user can select the desired viewing orientation. This implementation has certain drawbacks. For example, a user may wish to pay attention to a particular object in a scene, such as a basketball or a particular player in a basketball game. When default viewing orientations are employed, the user’s viewing angle is reset to the default value each time the user switches between viewpoints. Accordingly, a user viewing a basketball at a first viewpoint would be reoriented to a default angle upon switching to a second viewpoint. This would likely result in losing sight of the basketball. The user would then likely have to search for the current location of the basketball from the new viewpoint. The result is that default viewing orientations may create discontinuities in a user’s viewing experience and create a poor viewing experience in some cases.
[0048] Another approach is to encode correspondences between VR viewpoints. For example, video data related to the viewpoints may be included in tracks of a video file. A timed metadata track that contains data relevant to dynamic viewpoints can also be included in the video file. As used herein, a dynamic viewpoint is a viewpoint that moves over the course of the video, while a static viewpoint is a viewpoint that remains in a constant position over the course of the video. Correspondences between the viewpoints may be included in the timed metadata track. Such information can indicate correspondences between viewpoints that are part of the same viewpoint group. Specifically, such information can denote that a spatial region at a first viewpoint orients toward the same location as a corresponding spatial region at a second viewpoint. Using such information, a user can switch between dynamic viewpoints. Specifically, the decoder can automatically orient the user toward a spatial region at the destination viewpoint that corresponds to the orientation selected by the user at the initial viewpoint. As a specific example, a user watching a basketball at a first viewpoint can be automatically oriented toward the basketball upon switching to the second viewpoint. This allows the decoder to provide a consistent view to a user upon switching between viewpoints. Flowever, housing such data in the timed metadata track may limit such functionality to dynamic viewpoints.
[0049] Disclosed herein are modifications to the preceding viewpoint switching mechanism that extends such functionality to static viewpoints. Specifically, the viewpoint group membership can be signaled by employing a viewpoint group identifier (ID), which is included in the track for each static (and/or dynamic) viewpoint. Upon receiving a request to switch viewpoints, the decoder can review the data from the track for the source viewpoint and from the track for the destination viewpoint for a viewpoint group ID. If the source viewpoint and the destination viewpoint have the same viewpoint group ID, correspondences between the viewpoints can be used to compute a destination viewport at the destination viewpoint based on the source viewport at the source viewpoint. ISOBMFF employs various boxes to include video data in a standardized manner. The viewpoint group IDs for viewpoints can be stored in the corresponding tracks by placing the viewpoint group ID in a viewpoint group box. In one example, the viewpoint group ID is included in a viewpoint group structure, which is further included in the viewpoint group box. In another example, the viewpoint group ID is included in a viewpoint group structure, which is further included in a viewpoint position structure, which is further included in the viewpoint group box. In another example, the viewpoint group ID is included in a viewpoint group structure, which is further included in a viewpoint information structure, which is further included in the viewpoint group box.
[0050] FIG. 1 is a schematic diagram of an example system 100 for VR based video coding. System 100 includes a multi-directional camera 101, a VR coding device 104 including an encoder 103, a decoder 107, and a rendering device 109. The multi-directional camera 101 comprises an array of camera devices. Each camera device is pointed at a different angle so that the multi-directional camera 101 can take multiple directional video streams of the surrounding environment from a plurality of angles. For example, multi-directional camera 101 can take video of the environment as a sphere with the multi-directional camera 101 at the center of the sphere. As used herein, sphere and spherical video refers to both a geometrical sphere and sub-portions of a geometrical sphere, such as spherical caps, spherical domes, spherical segments, etc. For example, a multi-directional camera 101 may take one hundred and eighty degree video to cover half of the environment so that a production crew can remain behind the multi-directional camera 101. A multi-directional camera 101 can also take video in three hundred sixty degrees (or any sub-portion thereof) ffowever, a portion of the floor under the multi-directional camera 101 may be omitted, which results in video of less than a perfect sphere ffence, the term sphere, as used herein, is a general term used for clarity of discussion and should not be considered limiting from a geometrical stand point. It should be noted that in some examples a multi-directional camera 101 may include a camera that includes one or more fisheye lenses (e.g., instead of an array of cameras).
[0051] Video from the multi-directional camera 101 is forwarded to the VR coding device 104. A VR coding device 104 may be a computing system including specialized VR coding software. The VR coding device 104 may include an encoder 103 (a.k.a., a video encoder). In some examples, the encoder 103 can also be included in a separate computer system from the VR coding device 104. The VR coding device 104 is configured to convert the multiple directional video streams into a single multiple directional video stream including the entire recorded area from all relevant angles. This conversion may be referred to as image stitching. For example, frames from each video stream that are captured at the same time can be stitched together to create a single spherical image. A spherical video stream can then be created from the spherical images. For clarity of discussion, it should be noted that the terms frame, picture, and image may be used interchangeably herein unless specifically noted.
[0052] The spherical video stream can then be forwarded to the encoder 103 for compression. An encoder 103 is a device and/or program capable of converting information from one format to another for purposes of standardization, speed, and/or compression. Standardized encoders 103 are configured to encode rectangular and/or square images. Accordingly, the encoder 103 is configured to map each spherical image from the spherical video stream into a plurality of rectangular sub-pictures. The sub-pictures can then be placed in separate sub-picture video streams. As such, each sub-picture video stream displays a stream of images over time as recorded from a sub-portion of the spherical video stream. The encoder 103 can then encode each sub-picture video stream to compress the video stream to a manageable file size. The encoding process is discussed in more detail below. In general, the encoder 103 partitions each frame from each sub-picture video stream into pixel blocks, compresses the pixel blocks by inter-prediction and/or intra-prediction to create coding blocks including prediction blocks and residual blocks, applies transforms to the residual blocks for further compression, and applies various filters to the blocks. The compressed blocks as well as corresponding syntax are stored in bitstream(s), for example in ISOBMFF and/or in omnidirectional media format (OMAF). For example, OMAF specifies the omnidirectional media format for coding, storage, delivery, and rendering of omnidirectional media, including video, images, audio, and timed text. OMAF may be also be known as ISO/IEC International Standard 23090 Part 2 and/or MPEG N17827. In an OMAF player the user's viewing perspective is from the center of the sphere looking outward towards the inside surface of the sphere. OMAF supports both projected omnidirectional video and fisheye omnidirectional video.
[0053] The VR coding device 104 may store the encoded bitstream(s) in memory, locally, and/or on a server, for communication to a decoder 107 on demand. The data can be forwarded via a network 105, which may include the Internet, a mobile telecommunications network (e.g., a long term evolution (LTE) based data network), or other data communication data system.
[0054] The decoder 107 (a.k.a., a video decoder) is a device on a user’s location that is configured to reverse the coding process to reconstruct the sub-picture video streams from the encoded bitstream(s). The decoder 107 also merges the sub-picture video streams to reconstruct the spherical video stream. The spherical video stream, or sub-portions thereof, can then be forwarded to the rendering device 109. The rendering device 109 is a device configured to display the spherical video stream to the user. For example, the rendering device 109 may include a head-mounted display (HMD) that attaches to the user’s head and covers the user’s eyes. The rendering device 109 may include a screen for each eye, cameras, motion sensors, speakers, etc. and may communicate with the decoder 107 via wireless and/or wired connections. The rendering device 109 may display a sub-portion of the spherical video stream to the user. The sub-portion shown is based on a field of view (FOV) and/or viewport of the rendering device 109. A FOV is the observable area of the recorded environment that is displayed to a user by the rendering device 109. The FOV can be described as a conical projection between a user’s eye and extending into the virtual environment. A viewport is a two dimensional plane upon which a three dimensional environment is projected. Accordingly, a viewport describes the area of a portion of the virtual environment displayed on a screen or screens of a rendering device, while a FOV describes the portion of the virtual environment seen by the user. Hence, viewport and FOV may be used interchangeably in many cases, but may include different technical details. For example, a FOV can be described in terms of pixels, coordinates, and/or bounds while a viewport can be described in terms of angles. The rendering device 109 may change the position of the FOV/viewport based on user head movement by employing the motion tracking sensors. This allows the user to see different portions of the spherical video stream depending on head movement. Further, the rendering device 109 may offset the FOV for each eye based on the user’s interpapillary distance (IPD) to create the impression of a three dimensional space. In other cases, the rendering device 109 may be a computer screen or television screen that changes a FOV/viewport based on user input.
[0055] FIG. 2 is a flowchart of an example method 200 of coding a VR picture bitstream as a plurality of sub-picture bitstreams, for example by employing the components of system 100. At step 201, multi-directional camera set, such as multi-directional camera 101, is used to capture multiple directional video streams. The multiple directional video streams include views of an environment at various angles. For example, the multiple directional video streams may capture video from three hundred sixty degrees, one hundred eighty degrees, two hundred forty degrees, etc. around the camera in the horizontal plane. The multiple directional video streams may also capture video from three hundred sixty degrees, one hundred eighty degrees, two hundred forty degrees, etc. around the camera in the vertical plane. The result is to create video that includes information sufficient to cover a spherical area around the camera over some period of time.
[0056] At step 203, the multiple directional video streams are synchronized in the time domain. Specifically, each directional video stream includes a series of images taken at a corresponding angle. The multiple directional video streams are synchronized by ensuring frames from each directional video stream that were captured at the same time domain position are processed together. The frames from the directional video streams can then be stitched together in the space domain to create a spherical video stream. Hence, each frame of the spherical video stream contains data taken from the frames of all the directional video streams that occur at a common temporal position. It should be noted that a fisheye lens may capture a single video stream at a wide angle. Hence, when a fisheye lens is employed, a single multi directional stream may be captured at step 201, which may allow step 203 to be omitted in some cases.
[0057] At step 205, the spherical video stream is mapped into rectangular sub-picture video streams. This process may also be referred to as projecting the spherical video stream into rectangular sub-picture video streams. As noted above, encoders and decoders are generally designed to encode rectangular and/or square frames. Accordingly, mapping the spherical video stream into rectangular sub-picture video streams creates video streams that can be encoded and decoded by non-VR specific encoders and decoders, respectively. It should be noted that steps 203 and 205 are specific to VR video processing, and hence may be performed by specialized VR hardware, software, or combinations thereof.
[0058] At step 207, the rectangular sub-picture video streams can be forwarded to an encoder, such as encoder 103. The encoder then encodes the sub-picture video streams as sub picture bitstreams in a corresponding media file format. Specifically, each sub-picture video stream can be treated by the encoder as a video signal. The encoder can encode each frame of each sub-picture video stream via inter-prediction, intra-prediction, etc. Such encoding and corresponding decoding as well as encoders and decoders are discussed in detail with respect to the FIGS below. Regarding file format, the sub-picture video streams can be stored in ISOBMFF. For example, the sub-picture video streams are captured at a specified resolution. The sub-picture video streams can then be downsampled to various lower resolutions for encoding. Each resolution can be referred to as a representation. Lower quality representations lose image clarity while reducing file size. Accordingly, lower quality representations can be transmitted to a user using fewer network resources (e.g., time, bandwidth, etc.) than higher quality representations with an attendant loss of visual quality. Each representation can be stored in a corresponding set of tracks. Hence, tracks can be sent to a user, where the tracks include the sub-picture bitstreams at various resolutions (e.g., visual quality). As used herein, a track is a container for a complete set of media related data for a specific playable media sequence. For example, a video track contains a complete set of video information for a playable video sequence at a corresponding resolution. As another example, an audio track contains a complete set of audio information for a playable audio sequence at a corresponding sound quality. As yet another example, a metadata track contains a complete set of metadata information usable to interpret media data in one or more specified tracks.
[0059] At step 209, the sub-picture bitstreams can be sent to the decoder as tracks. In some examples, all sub-picture bitstreams are transmitted at the same quality by transmitting tracks from the same representation. In other cases, the tracks containing sub-picture bitstreams with data in the users FOV may be sent at higher resolutions by selecting higher quality representations. Tracks containing sub-picture bitstreams with areas outside the users FOV can be sent at progressively lower resolutions by selecting lower quality representations. This may be referred to as viewport dependent coding. The tracks may include relatively short video segments (e.g., about three seconds), and hence the representations selected for particular areas of the video can change over time based on changes in FOV. This allows quality to change as the users FOV changes.
[0060] At step 211, a decoder, such as decoder 107, receives the tracks containing the sub picture bitstreams. The decoder can then decode the sub-picture bitstreams into sub-picture video streams for display. The decoding process involves the reverse of the encoding process (e.g., using inter-prediction and intra-prediction), and is discussed in more detail with respect to the FIGS below.
[0061] At step 213, the decoder can merge the sub-picture video streams into the spherical video stream for presentation on a rendering device. For example, the decoder can employ a so called lightweight merging algorithm that selects frames from each sub-picture video stream that occur at the same presentation time and merges them together based on the position and/or angle associated with the corresponding sub-picture video stream. The decoder may also employ filters to smooth edges between the sub-picture video streams, remove artifacts, etc. The decoder can then forward the spherical video stream to a rendering device, such as rendering device 109.
[0062] At step 215, the rendering device renders a viewport of the spherical video stream for presentation to the user. As mentioned above, areas of the spherical video stream outside of the FOV at each point in time are not rendered. As such, the user can select and view a sub portion of the virtual environment as recorded, and hence can experience the virtual environment as if present at the time of recording.
[0063] FIG. 3 is a flowchart of an example method 300 of coding a video signal. For example, method 300 may receive a plurality of sub-picture video streams from step 205 of method 200. Method 300 treats each sub-picture video stream as a video signal input. Method 300 applies steps 301-317 to each sub-picture video stream in order to implement steps 207- 211 of method 200. Hence, the output video signal from method 300 includes the decoded sub picture video streams, which can be merged and displayed according to steps 213 and 215 of method 200. As such, method 300 can be implemented on a system 100.
[0064] Method 300 encodes a video signal, for example including sub-picture video streams, at an encoder. The encoding process compresses the video signal by employing various mechanisms to reduce the video file size. A smaller file size allows the compressed video file to be transmitted toward a user, while reducing associated bandwidth overhead. The decoder then decodes the compressed video file to reconstruct the original video signal for display to an end user. The decoding process generally mirrors the encoding process to allow the decoder to consistently reconstruct the video signal.
[0065] At step 301, the video signal is input into the encoder. For example, the video signal may be an uncompressed video file stored in memory. As another example, the video file may be captured by a video capture device, such as a video camera, and encoded to support live streaming of the video. The video file may include both an audio component and a video component. The video component contains a series of image frames that, when viewed in a sequence, gives the visual impression of motion. The frames contain pixels that are expressed in terms of light, referred to herein as luma components (or luma samples), and color, which is referred to as chroma components (or color samples). It should be noted that a frame may also be referred to as a picture, a sub-frame as a sub-picture, etc.
[0066] At step 303, the video signal is partitioned into blocks. Partitioning includes subdividing the pixels in each frame into square and/or rectangular blocks for compression. For example, in HEVC (also known as H.265 and MPEG-H Part 2) the frame can first be divided into coding tree units (CTUs), which are blocks of a predefined size (e.g., sixty four pixels by sixty four pixels). The CTUs contain both luma and chroma samples. Coding trees may be employed to divide the CTUs into blocks and then recursively subdivide the blocks until configurations are achieved that support further encoding. For example, luma components of a frame may be subdivided until the individual blocks contain relatively homogenous lighting values. Further, chroma components of a frame may be subdivided until the individual blocks contain relatively homogenous color values. Accordingly, partitioning mechanisms vary depending on the content of the video frames.
[0067] At step 305, various compression mechanisms are employed to compress the image blocks partitioned at step 303. For example, inter-prediction and/or intra-prediction may be employed. Inter-prediction is designed to take advantage of the fact that objects in a common scene tend to appear in successive frames. Accordingly, a block depicting an object in a reference frame need not be repeatedly described in adjacent frames. Specifically, an object, such as a table, may remain in a constant position over multiple frames. Hence the table is described once and adjacent frames can refer back to the reference frame. Pattern matching mechanisms may be employed to match objects over multiple frames. Further, moving objects may be represented across multiple frames, for example due to object movement or camera movement. As a particular example, a video may show an automobile that moves across the screen over multiple frames. Motion vectors can be employed to describe such movement, or lack thereof. A motion vector is a two-dimensional vector that provides an offset from the coordinates of an object in a frame to the coordinates of the object in a reference frame. As such, inter-prediction can encode an image block in a current frame as a set of motion vectors indicating an offset from a corresponding block in a reference frame.
[0068] Intra-prediction encodes blocks in a common frame. Intra-prediction takes advantage of the fact that luma and chroma components tend to cluster in a frame. For example, a patch of green in a portion of a tree tends to be positioned adjacent to similar patches of green. Intra-prediction employs multiple directional prediction modes (e.g., thirty three in HEVC), a planar mode, and a direct current (DC) mode. The directional modes indicate that a current block is similar/the same as samples of a neighbor block in a corresponding direction. Planar mode indicates that a series of blocks along a row/column (e.g., a plane) can be interpolated based on neighbor blocks at the edges of the row. Planar mode, in effect, indicates a smooth transition of light/color across a row/column by employing a relatively constant slope in changing values. DC mode is employed for boundary smoothing and indicates that a block is similar/the same as an average value associated with samples of all the neighbor blocks associated with the angular directions of the directional prediction modes. Accordingly, intra-prediction blocks can represent image blocks as various relational prediction mode values instead of the actual values. Further, inter-prediction blocks can represent image blocks as motion vector values instead of the actual values. In either case, the prediction blocks may not exactly represent the image blocks in some cases. Any differences are stored in residual blocks. Transforms may be applied to the residual blocks to further compress the file.
[0069] At step 307, various filtering techniques may be applied. In HEVC, the filters are applied according to an in-loop filtering scheme. The block based prediction discussed above may result in the creation of blocky images at the decoder. Further, the block based prediction scheme may encode a block and then reconstruct the encoded block for later use as a reference block. The in-loop filtering scheme iteratively applies noise suppression filters, de-blocking filters, adaptive loop filters, and sample adaptive offset (SAO) filters to the blocks/frames. These filters mitigate such blocking artifacts so that the encoded file can be accurately reconstructed. Further, these filters mitigate artifacts in the reconstructed reference blocks so that artifacts are less likely to create additional artifacts in subsequent blocks that are encoded based on the reconstructed reference blocks.
[0070] Once the video signal has been partitioned, compressed, and filtered, the resulting data is encoded in a bitstream at step 309. The bitstream includes the data discussed above as well as any signaling data (e.g., syntax) desired to support proper video signal reconstruction at the decoder. For example, such data may include partition data, prediction data, residual blocks, and various flags providing coding instructions to the decoder. The bitstream may be stored in memory for transmission toward a decoder upon request, for example as a track and/or track fragment in ISOBMFF. The bitstream may also be broadcast and/or multicast toward a plurality of decoders. The creation of the bitstream is an iterative process. Accordingly, steps 301, 303, 305, 307, and 309 may occur continuously and/or simultaneously over many frames and blocks. The order shown is presented for clarity and ease of discussion, and is not intended to limit the video coding process to a particular order.
[0071] The decoder receives the bitstream and begins the decoding process at step 311. For example, the decoder can employ an entropy decoding scheme to convert the bitstream into corresponding syntax and video data. The decoder employs the syntax data from the bitstream to determine the partitions for the frames at step 311. The partitioning should match the results of block partitioning at step 303. Entropy encoding/decoding, which may be employed in step 311, is now described. The encoder makes many choices during the compression process, such as selecting block partitioning schemes from several possible choices based on the spatial positioning of values in the input image(s). Signaling the exact choices may employ a large number of bins. As used herein, a bin is a binary value that is treated as a variable (e.g., a bit value that may vary depending on context). Entropy coding allows the encoder to discard any options that are clearly not viable for a particular case, leaving a set of allowable options. Each allowable option is then assigned a code word. The length of the code word is based on the number of allowable options (e.g., one bin for two options, two bins for three to four options, etc.). The encoder then encodes the code word for the selected option. This scheme reduces the size of the code words as the code words are as big as desired to uniquely indicate a selection from a small sub-set of allowable options as opposed to uniquely indicating the selection from a potentially large set of all possible options. The decoder then decodes the selection by determining the set of allowable options in a similar manner to the encoder. By determining the set of allowable options, the decoder can read the code word and determine the selection made by the encoder.
[0072] At step 313, the decoder performs block decoding. Specifically, the decoder employs reverse transforms to generate residual blocks. Then the decoder employs the residual blocks and corresponding prediction blocks to reconstruct the image blocks according to the partitioning. The prediction blocks may include both intra-prediction blocks and inter prediction blocks as generated at the encoder at step 305. The reconstructed image blocks are then positioned into frames of a reconstructed video signal according to the partitioning data determined at step 311. Syntax for step 313 may also be signaled in the bitstream via entropy coding as discussed above.
[0073] At step 315, filtering is performed on the frames of the reconstructed video signal in a manner similar to step 307 at the encoder. For example, noise suppression filters, de blocking filters, adaptive loop filters, and SAO filters may be applied to the frames to remove blocking artifacts. Once the frames are filtered, the video signal can be forwarded for merging at step 317 and then output to a display, such as a F1MD, for viewing by an end user.
[0074] FIG. 4 is a schematic diagram of an example coding and decoding (codec) system 400 for video coding. Specifically, codec system 400 provides functionality to support encoding and decoding sub-picture video streams according to methods 200 and 300. Further, codec system 400 can be employed to implement an encoder 103 and/or a decoder 107 of system 100.
[0075] Codec system 400 is generalized to depict components employed in both an encoder and a decoder. Codec system 400 receives and partitions frames from a video signal (e.g., including a sub-picture video stream) as discussed with respect to steps 301 and 303 in operating method 300, which results in a partitioned video signal 401. Codec system 400 then compresses the partitioned video signal 401 into a coded bitstream when acting as an encoder as discussed with respect to steps 305, 307, and 309 in method 300. When acting as a decoder, codec system 400 generates an output video signal from the bitstream as discussed with respect to steps 311, 313, 315, and 317 in operating method 300. The codec system 400 includes a general coder control component 411, a transform scaling and quantization component 413, an intra-picture estimation component 415, an intra-picture prediction component 417, a motion compensation component 419, a motion estimation component 421, a scaling and inverse transform component 429, a filter control analysis component 427, an in-loop filters component 425, a decoded picture buffer component 423, and a header formatting and context adaptive binary arithmetic coding (CAB AC) component 431. Such components are coupled as shown. In FIG. 4, black lines indicate movement of data to be encoded/decoded while dashed lines indicate movement of control data that controls the operation of other components. The components of codec system 400 may all be present in the encoder. The decoder may include a subset of the components of codec system 400. For example, the decoder may include the intra picture prediction component 417, the motion compensation component 419, the scaling and inverse transform component 429, the in-loop filters component 425, and the decoded picture buffer component 423. These components are now described.
[0076] The partitioned video signal 401 is a captured video sequence that has been partitioned into blocks of pixels by a coding tree. A coding tree employs various split modes to subdivide a block of pixels into smaller blocks of pixels. These blocks can then be further subdivided into smaller blocks. The blocks may be referred to as nodes on the coding tree. Larger parent nodes are split into smaller child nodes. The number of times a node is subdivided is referred to as the depth of the node/coding tree. The divided blocks can be included in coding units (CUs) in some cases. For example, a CU can be a sub-portion of a CTU that contains a luma block, red difference chroma (Cr) block(s), and a blue difference chroma (Cb) block(s) along with corresponding syntax instructions for the CU. The split modes may include a binary tree (BT), triple tree (TT), and a quad tree (QT) employed to partition a node into two, three, or four child nodes, respectively, of varying shapes depending on the split modes employed. The partitioned video signal 401 is forwarded to the general coder control component 411, the transform scaling and quantization component 413, the intra picture estimation component 415, the filter control analysis component 427, and the motion estimation component 421 for compression.
[0077] The general coder control component 411 is configured to make decisions related to coding of the images of the video sequence into the bitstream according to application constraints. For example, the general coder control component 411 manages optimization of bitrate/bitstream size versus reconstruction quality. Such decisions may be made based on storage space/bandwidth availability and image resolution requests. The general coder control component 411 also manages buffer utilization in light of transmission speed to mitigate buffer underrun and overrun issues. To manage these issues, the general coder control component 411 manages partitioning, prediction, and filtering by the other components. For example, the general coder control component 411 may dynamically increase compression complexity to increase resolution and increase bandwidth usage or decrease compression complexity to decrease resolution and bandwidth usage ffence, the general coder control component 411 controls the other components of codec system 400 to balance video signal reconstruction quality with bitrate concerns. The general coder control component 411 creates control data, which controls the operation of the other components. The control data is also forwarded to the header formatting and CABAC component 431 to be encoded in the bitstream to signal parameters for decoding at the decoder.
[0078] The partitioned video signal 401 is also sent to the motion estimation component 421 and the motion compensation component 419 for inter-prediction. A frame or slice of the partitioned video signal 401 may be divided into multiple video blocks. Motion estimation component 421 and the motion compensation component 419 perform inter-predictive coding of the received video block relative to one or more blocks in one or more reference frames to provide temporal prediction. Codec system 400 may perform multiple coding passes, e.g., to select an appropriate coding mode for each block of video data.
[0079] Motion estimation component 421 and motion compensation component 419 may be highly integrated, but are illustrated separately for conceptual purposes. Motion estimation, performed by motion estimation component 421, is the process of generating motion vectors, which estimate motion for video blocks. A motion vector, for example, may indicate the displacement of a coded object relative to a predictive block. A predictive block is a block that is found to closely match the block to be coded, in terms of pixel difference. A predictive block may also be referred to as a reference block. Such pixel difference may be determined by sum of absolute difference (SAD), sum of square difference (SSD), or other difference metrics. HEVC employs several coded objects including a CTU, coding tree blocks (CTBs), and CUs. For example, a CTU can be divided into CTBs, which can then be divided into CBs for inclusion in CUs. A CU can be encoded as a prediction unit (PU) containing prediction data and/or a transform unit (TU) containing transformed residual data for the CU. The motion estimation component 421 generates motion vectors, PUs, and TUs by using a rate-distortion analysis as part of a rate distortion optimization process. For example, the motion estimation component 421 may determine multiple reference blocks, multiple motion vectors, etc. for a current block/frame, and may select the reference blocks, motion vectors, etc. having the best rate-distortion characteristics. The best rate-distortion characteristics balance both quality of video reconstruction (e.g., amount of data loss by compression) with coding efficiency (e.g., size of the final encoding).
[0080] In some examples, codec system 400 may calculate values for sub-integer pixel positions of reference pictures stored in decoded picture buffer component 423. For example, video codec system 400 may interpolate values of one-quarter pixel positions, one-eighth pixel positions, or other fractional pixel positions of the reference picture. Therefore, motion estimation component 421 may perform a motion search relative to the full pixel positions and fractional pixel positions and output a motion vector with fractional pixel precision. The motion estimation component 421 calculates a motion vector for a PU of a video block in an inter-coded slice by comparing the position of the PU to the position of a predictive block of a reference picture. Motion estimation component 421 outputs the calculated motion vector as motion data to the header formatting and CAB AC component 431 for encoding and motion to the motion compensation component 419.
[0081] Motion compensation, performed by motion compensation component 419, may involve fetching or generating the predictive block based on the motion vector determined by motion estimation component 421. Again, motion estimation component 421 and motion compensation component 419 may be functionally integrated, in some examples. Upon receiving the motion vector for the PU of the current video block, motion compensation component 419 may locate the predictive block to which the motion vector points. A residual video block is then formed by subtracting pixel values of the predictive block from the pixel values of the current video block being coded, forming pixel difference values. In general, motion estimation component 421 performs motion estimation relative to luma components, and motion compensation component 419 uses motion vectors calculated based on the luma components for both chroma components and luma components. The predictive block and residual block are forwarded to transform scaling and quantization component 413.
[0082] The partitioned video signal 401 is also sent to intra-picture estimation component 415 and intra-picture prediction component 417. As with motion estimation component 421 and motion compensation component 419, intra-picture estimation component 415 and intra picture prediction component 417 may be highly integrated, but are illustrated separately for conceptual purposes. The intra-picture estimation component 415 and intra-picture prediction component 417 intra-predict a current block relative to blocks in a current frame, as an alternative to the inter-prediction performed by motion estimation component 421 and motion compensation component 419 between frames, as described above. In particular, the intra picture estimation component 415 determines an intra-prediction mode to use to encode a current block. In some examples, intra-picture estimation component 415 selects an appropriate intra-prediction mode to encode a current block from multiple tested intra prediction modes. The selected intra-prediction modes are then forwarded to the header formatting and CAB AC component 431 for encoding. [0083] For example, the intra-picture estimation component 415 calculates rate-distortion values using a rate-distortion analysis for the various tested intra-prediction modes, and selects the intra-prediction mode having the best rate-distortion characteristics among the tested modes. Rate-distortion analysis generally determines an amount of distortion (or error) between an encoded block and an original unencoded block that was encoded to produce the encoded block, as well as a bitrate (e.g., a number of bits) used to produce the encoded block. The intra picture estimation component 415 calculates ratios from the distortions and rates for the various encoded blocks to determine which intra-prediction mode exhibits the best rate-distortion value for the block. In addition, intra-picture estimation component 415 may be configured to code depth blocks of a depth map using a depth modeling mode (DMM) based on rate-distortion optimization (RDO).
[0084] The intra-picture prediction component 417 may generate a residual block from the predictive block based on the selected intra-prediction modes determined by intra-picture estimation component 415 when implemented on an encoder or read the residual block from the bitstream when implemented on a decoder. The residual block includes the difference in values between the predictive block and the original block, represented as a matrix. The residual block is then forwarded to the transform scaling and quantization component 413. The intra-picture estimation component 415 and the intra-picture prediction component 417 may operate on both luma and chroma components.
[0085] The transform scaling and quantization component 413 is configured to further compress the residual block. The transform scaling and quantization component 413 applies a transform, such as a discrete cosine transform (DCT), a discrete sine transform (DST), or a conceptually similar transform, to the residual block, producing a video block comprising residual transform coefficient values. Wavelet transforms, integer transforms, sub-band transforms, or other types of transforms could also be used. The transform may convert the residual information from a pixel value domain to a transform domain, such as a frequency domain. The transform scaling and quantization component 413 is also configured to scale the transformed residual information, for example based on frequency. Such scaling involves applying a scale factor to the residual information so that different frequency information is quantized at different granularities, which may affect final visual quality of the reconstructed video. The transform scaling and quantization component 413 is also configured to quantize the transform coefficients to further reduce bitrate. The quantization process may reduce the bit depth associated with some or all of the coefficients. The degree of quantization may be modified by adjusting a quantization parameter. In some examples, the transform scaling and quantization component 413 may then perform a scan of the matrix including the quantized transform coefficients. The quantized transform coefficients are forwarded to the header formatting and CAB AC component 431 to be encoded in the bitstream.
[0086] The scaling and inverse transform component 429 applies a reverse operation of the transform scaling and quantization component 413 to support motion estimation. The scaling and inverse transform component 429 applies inverse scaling, transformation, and/or quantization to reconstruct the residual block in the pixel domain, e.g., for later use as a reference block which may become a predictive block for another current block. The motion estimation component 421 and/or motion compensation component 419 may calculate a reference block by adding the residual block back to a corresponding predictive block for use in motion estimation of a later block/frame. Filters are applied to the reconstructed reference blocks to mitigate artifacts created during scaling, quantization, and transform. Such artifacts could otherwise cause inaccurate prediction (and create additional artifacts) when subsequent blocks are predicted.
[0087] The filter control analysis component 427 and the in-loop filters component 425 apply the filters to the residual blocks and/or to reconstructed image blocks. For example, the transformed residual block from the scaling and inverse transform component 429 may be combined with a corresponding prediction block from intra-picture prediction component 417 and/or motion compensation component 419 to reconstruct the original image block. The filters may then be applied to the reconstructed image block. In some examples, the filters may instead be applied to the residual blocks. As with other components in FIG. 4, the filter control analysis component 427 and the in-loop filters component 425 are highly integrated and may be implemented together, but are depicted separately for conceptual purposes. Filters applied to the reconstructed reference blocks are applied to particular spatial regions and include multiple parameters to adjust how such filters are applied. The filter control analysis component 427 analyzes the reconstructed reference blocks to determine where such filters should be applied and sets corresponding parameters. Such data is forwarded to the header formatting and CABAC component 431 as filter control data for encoding. The in-loop filters component 425 applies such filters based on the filter control data. The filters may include a deblocking filter, a noise suppression filter, a SAO filter, and an adaptive loop filter. Such filters may be applied in the spatial/pixel domain (e.g., on a reconstructed pixel block) or in the frequency domain, depending on the example.
[0088] When operating as an encoder, the filtered reconstructed image block, residual block, and/or prediction block are stored in the decoded picture buffer component 423 for later use in motion estimation as discussed above. When operating as a decoder, the decoded picture buffer component 423 stores and forwards the reconstructed and filtered blocks toward a display as part of an output video signal. The decoded picture buffer component 423 may be any memory device capable of storing prediction blocks, residual blocks, and/or reconstructed image blocks.
[0089] The header formatting and CABAC component 431 receives the data from the various components of codec system 400 and encodes such data into a coded bitstream for transmission toward a decoder. Specifically, the header formatting and CABAC component 431 generates various headers to encode control data, such as general control data and filter control data. Further, prediction data, including intra-prediction and motion data, as well as residual data in the form of quantized transform coefficient data are all encoded in the bitstream. The final bitstream includes all information desired by the decoder to reconstruct the original partitioned video signal 401. Such information may also include intra-prediction mode index tables (also referred to as codeword mapping tables), definitions of encoding contexts for various blocks, indications of most probable intra-prediction modes, an indication of partition information, etc. Such data may be encoded by employing entropy coding. For example, the information may be encoded by employing context adaptive variable length coding (CAVLC), CABAC, syntax-based context-adaptive binary arithmetic coding (SBAC), probability interval partitioning entropy (PIPE) coding, or another entropy coding technique. Following the entropy coding, the coded bitstream may be transmitted to another device (e.g., a video decoder) or archived for later transmission or retrieval.
[0090] FIG. 5 is a block diagram illustrating an example video encoder 500. Video encoder 500 may be employed to implement the encoding functions of codec system 400 and/or implement steps 301, 303, 305, 307, and/or 309 of method 300. Further, encoder 500 may be employed to implement steps 205-209 of method 200 as well as encoder 103. Encoder 500 partitions an input video signal (e.g., a sub-picture video stream), resulting in a partitioned video signal 501, which is substantially similar to the partitioned video signal 401. The partitioned video signal 501 is then compressed and encoded into a bitstream by components of encoder 500.
[0091] Specifically, the partitioned video signal 501 is forwarded to an intra-picture prediction component 517 for intra-prediction. The intra-picture prediction component 517 may be substantially similar to intra-picture estimation component 415 and intra-picture prediction component 417. The partitioned video signal 501 is also forwarded to a motion compensation component 521 for inter-prediction based on reference blocks in a decoded picture buffer component 523. The motion compensation component 521 may be substantially similar to motion estimation component 421 and motion compensation component 419. The prediction blocks and residual blocks from the intra-picture prediction component 517 and the motion compensation component 521 are forwarded to a transform and quantization component 513 for transformation and quantization of the residual blocks. The transform and quantization component 513 may be substantially similar to the transform scaling and quantization component 413. The transformed and quantized residual blocks and the corresponding prediction blocks (along with associated control data) are forwarded to an entropy coding component 531 for coding into a bitstream. The entropy coding component 531 may be substantially similar to the header formatting and CAB AC component 431.
[0092] The transformed and quantized residual blocks and/or the corresponding prediction blocks are also forwarded from the transform and quantization component 513 to an inverse transform and quantization component 529 for reconstruction into reference blocks for use by the motion compensation component 521. The inverse transform and quantization component 529 may be substantially similar to the scaling and inverse transform component 429. In-loop filters in an in-loop filters component 525 are also applied to the residual blocks and/or reconstructed reference blocks, depending on the example. The in-loop filters component 525 may be substantially similar to the filter control analysis component 427 and the in-loop filters component 425. The in-loop filters component 525 may include multiple filters as discussed with respect to in-loop filters component 425. The filtered blocks are then stored in a decoded picture buffer component 523 for use as reference blocks by the motion compensation component 521. The decoded picture buffer component 523 may be substantially similar to the decoded picture buffer component 423.
[0093] The encoder 500 may encode video into one or more tracks. As discussed in more detail below, VR video can be recorded from multiple viewpoints. Video from each viewpoint can then be encoded in a corresponding set of tracks. This allows the decoder to swap between tracks based on user input, which allows a user to swap between viewpoints as desired. A user may wish to continuously watch a particular object or location in the virtual environment when switching between viewpoints. In order to allow the user to maintain a consistent view, the encoder 500 can be configured to encode data indicating correspondences between spatial regions of viewpoints that are part of the same viewpoint group, and hence have the same viewpoint group ID. This allows the decoder to determine the correspondences and determine the FOV and/or viewport used by the user at a first viewpoint when a viewpoint switch is requested. The decoder can then determine a FOV/viewport at a second viewpoint that corresponds to the FOV/viewport used at the first viewpoint based on the correspondences encoded by the encoder 500. Accordingly, when the user switches between viewpoints, the decoder can display a FOV/viewport at the second viewpoint that points toward the same location previously viewed by the user at the first viewpoint. For example, such correspondences can be encoded in a timed metadata track. These concepts are discussed in greater detail below.
[0094] FIG. 6 is a block diagram illustrating an example video decoder 600. Video decoder 600 may be employed to implement the decoding functions of codec system 400 and/or implement steps 311, 313, 315, and/or 317 of operating method 300. Further, decoder 600 may be employed to implement steps 211-213 of method 200 as well as decoder 107. Decoder 600 receives a plurality of tracks containing picture bitstreams and/or sub-picture bitstreams, for example from an encoder 500, generates a reconstructed output video signal, for example by merging sub-picture video streams into a spherical video stream, and forwards the spherical video stream for display to a user via a rendering device.
[0095] The bitstreams are received by an entropy decoding component 633. The entropy decoding component 633 is configured to implement an entropy decoding scheme, such as CAVLC, CABAC, SBAC, PIPE coding, or other entropy coding techniques. For example, the entropy decoding component 633 may employ header information to provide a context to interpret additional data encoded as codewords in the bitstreams. The decoded information includes any desired information to decode the video signal, such as general control data, filter control data, partition information, motion data, prediction data, and quantized transform coefficients from residual blocks. The quantized transform coefficients are forwarded to an inverse transform and quantization component 629 for reconstruction into residual blocks. The inverse transform and quantization component 629 may be similar to inverse transform and quantization component 529.
[0096] The reconstructed residual blocks and/or prediction blocks are forwarded to intra picture prediction component 617 for reconstruction into image blocks based on intra prediction operations. The intra-picture prediction component 617 may be similar to intra picture estimation component 415 and intra-picture prediction component 417. Specifically, the intra-picture prediction component 617 employs prediction modes to locate a reference block in the frame and applies a residual block to the result to reconstruct intra-predicted image blocks. The reconstructed intra-predicted image blocks and/or the residual blocks and corresponding inter-prediction data are forwarded to a decoded picture buffer component 623 via an in-loop filters component 625, which may be substantially similar to decoded picture buffer component 423 and in-loop filters component 425, respectively. The in-loop filters component 625 filters the reconstructed image blocks, residual blocks, and/or prediction blocks, and such information is stored in the decoded picture buffer component 623. Reconstructed image blocks from decoded picture buffer component 623 are forwarded to a motion compensation component 621 for inter-prediction. The motion compensation component 621 may be substantially similar to motion estimation component 421 and/or motion compensation component 419. Specifically, the motion compensation component 621 employs motion vectors from a reference block to generate a prediction block and applies a residual block to the result to reconstruct an image block. The resulting reconstructed blocks may also be forwarded via the in-loop filters component 625 to the decoded picture buffer component 623. The decoded picture buffer component 623 continues to store additional reconstructed image blocks, which can be reconstructed into frames via the partition information. Such frames may also be placed in a sequence. The sequence is output toward a display as a reconstructed output video signal.
[0097] The decoder 600 may receive a set of tracks containing VR video recorded from multiple viewpoints. This allows the decoder 600 to swap between tracks based on user input, which allows a user to swap between viewpoints as desired. A user may wish to continuously watch a particular object or location in the virtual environment when switching between viewpoints. In order to allow the user to maintain a consistent view, the tracks may contain data indicating correspondences between spatial regions of viewpoints in the same viewpoint group, and hence including the same viewpoint group ID. This allows the decoder 600 to determine the correspondences and determine the FOV and/or viewport used by the user at a first viewpoint when a viewpoint switch is requested. The decoder 600 can then determine a FOV/viewport at a second viewpoint that corresponds to the FOV/viewport used at the first viewpoint based on the correspondences encoded by the encoder. Accordingly, when the user switches between viewpoints, the decoder 600 can display a FOV/viewport at the second viewpoint that points toward the same location previously viewed by the user at the first viewpoint. For example, such correspondences can be encoded in a timed metadata track. Such correspondences can also be encoded in the tracks including the video data for the relevant viewpoints. These concepts are discussed in greater detail below.
[0098] FIG. 7 is a schematic diagram illustrating an example system 700 for capturing VR video from multiple viewpoints 702, 703, and 704. Multiple viewpoints 702, 703, and 704 are included as an example. In other examples, less or more viewpoints may be provided. As shown, the system 700 is implemented to capture activity at a particular scene 701 (e.g., a stadium) using a plurality of cameras positioned at corresponding viewpoints 702, 703, and 704. The cameras may be similar to the multi-directional cameras 101 described above in connection with FIG. 1. In an embodiment, the cameras may capture VR videos in fixed positions at viewpoint 702 and viewpoint 703. ffence, viewpoints 702 and 703 are static viewpoints because the location of the cameras used at viewpoints 702 and 703 do not move over the course of the video. Further, a separate camera has the ability to continuously change positions along a rail 705 in order to capture VR videos from a variety of different positions denoted as viewpoint 704. By sliding along the rail 705, the camera is able to capture the VR video from different positions, and hence viewpoint 704 may change over time. In practical applications, the camera at viewpoint 704 may be mounted in other ways in order to be moveable in one or more directions. Regardless of the mechanisms employed during filming, viewpoint 704 is a dynamic viewpoint because the location of the corresponding camera changes over the course of the video.
[0099] The cameras may each record a sphere of video looking outward from the perspective of the corresponding viewpoint 702, 703, and 704. ffence, a viewpoint 702, 703, and 704 is the center of a sphere of video data as recorded from a specified location. For example, video (and audio) can be recorded from viewpoints 702, 703, and 704. The video for each viewpoint can then be stored in a set of corresponding tracks. For example, video from a viewpoint 702 can be downsampled and stored at various resolutions in tracks as part of an adaptation set for viewpoint 702. Adaptation sets for viewpoints 703 and 704 can also be stored in corresponding tracks ffence, a decoder can receive user input and, based on the user input, select an adaptation set with corresponding tracks for display. This in turn allows a user to direct the decoder to switch between viewpoints 702, 703, and 704. The result is the user can experience VR video from a first viewpoint (e.g., viewpoint 702) at a first time and then switch to experience VR video from a second viewpoint (e.g., viewpoint 703 or 704) at a second time.
[00100] One mechanism to enable such a viewpoint switch is to provide a default orientation for each viewpoint 702, 703, and 704. An orientation is a direction of view pointing outward from the center of a corresponding viewpoint 702, 703, and/or 704. An orientation may be described in terms of angle, coordinates, etc. A specified orientation may result in a corresponding FOV and viewport for viewing video from the viewpoint 702, 703, and/or 704. The default orientation approach causes a user to view a specified default FOV and viewport upon switching to a new viewpoint 702, 703, and/or 704. ffowever, this may result in a negative user experience in some cases. For example, a user may wish to continuously view an object in the scene 701, such as a basketball, a particular player, a goal, etc. Such a consistency may not be possible using default orientations. For example, a user watching the ball at viewpoint 702 may wish to switch to viewpoint 704 to get a closer look. Flowever, the default orientation at viewpoint 704 may be toward the goal. In such a case, the user loses the ball upon switching and is forced to find the ball again.
[00101] In the present disclosure, the encoder can store spatial region correspondences between viewpoints 702, 703, and/or 704 that are part of the same viewpoint group. The decoder can determine the orientation viewed by the user at viewpoint 702 upon switching to viewpoint 704 (or viewpoint 703 in other examples). The decoder can then use the spatial region correspondences between viewpoint 702 and viewpoint 704 to determine an orientation at viewpoint 704 that matches the orientation at viewpoint 702. The decoder can then employ the determined orientation at viewpoint 704 after making the switch. In this manner, the user is automatically oriented to the same location in the scene 701 after the switch between viewpoints 702, 703, and/or 704 as was viewed before the switch. For example, if the user is watching the ball at viewpoint 702, the user is automatically oriented to view the ball from viewpoint 704 upon switching. The spatial region correspondences are discussed in greater detail below.
[00102] FIG. 8 is a schematic diagram of example viewpoints 810 and 820 in a viewpoint group 800. A correspondence, as used herein, is an indication that two or more sphere regions 813 and 823 are spatially related such that viewing the sphere regions 813 and 823 from a related viewpoint 810 and 820, respectively, provides a view of the same object 830. The correspondences support switching in the present disclosure. Flence, the correspondences shown in schematic diagram can be used by an encoder 103, an encoder 500, a decoder 107, a decoder 600, and/or a codec system 400. Further, the correspondences shown in schematic diagram can describe relationships between viewpoints 702, 703, and/or 704. In addition, the correspondences shown in schematic diagram can be encoded in a bitstream and used to support selection of tracks to decode and display, and hence can be used as part of methods 200 and 300.
[00103] Correspondences can be stored as viewpoint 810 and 820 pairs and/or sets. Viewpoints 810 and 820 each include a sphere 812 and 822, respectively, of video content in associated tracks. Specifically, a user viewing video from a viewpoint 810 and 820 has access to a sphere 812 and 822, respectively, of video content. The video content is depicted to the user by projecting a portion of the video content from the sphere 812 and 822, depending on the user’s viewpoint 810 and 820, onto a viewport based on the current orientation 811 and 821, respectively, of the user. The spheres 812 and 822 are made up of a plurality of spatial regions, including a sphere region 813 and a corresponding sphere region 823. In this context, sphere regions 813 and 823 are specified spatial portions of spheres 812 and 822 around viewpoints
810 and 820, respectively. A user may view content from a viewpoint 810 and 820 by employing an orientation 811 and 821, respectively. As shown, an orientation 811 and 821 is a direction of view pointing outward from the center of a corresponding viewpoint 810 and 820, respectively. A sphere region 813 and/or 823 may be associated with a particular orientation
811 and/or 821, respectively. For example, an orientation 811 and 821 provides a viewing angle from the viewpoint 810 and 820 and defines a sphere region 813 and 823 based on an angle and a depth of field. Further, an orientation 811 and 821 may define a FOV from a viewpoint 810 and 820, respectively. In addition, a sphere region 813 and 823 may define a viewport for a viewpoint 810 and 820, respectively.
[00104] As shown, an orientation 811 from a viewpoint 810 may be employed to allow a user to view an object 830 at a location 831. Also, an orientation 821 from a viewpoint 820 may be employed to allow a user to view the same object 830 at the location 831 from a different perspective. The sphere regions 813 and 823 act as viewports onto the object 830. As such, a decoder can use a correspondence between sphere regions 813 and 823 to determine an orientation 811 at a viewpoint 810 from an orientation 821 at a viewpoint 820, and vice versa. Further, an encoder may encode the correspondences between sphere region 813 at viewpoint 810 and sphere region 823 at viewpoint 820 during VR video creation. A decoder can then use such information to maintain viewing consistency when switching between viewpoints 810 and 820.
[00105] Accordingly, a decoder can employ an orientation 811 from a viewpoint 810 to determine an orientation 821 at a viewpoint 820 as long as the viewpoints 810 and 820 are part of the same viewpoint group 800. A viewpoint group 800 is a plurality of viewpoints 810 and 820 that share a correspondence between at least one sphere region 813 and/or 823. A decoder can determine that viewpoints 810 and 820 are part of the same viewpoint group 800 when the viewpoints 810 and 820 share the same viewpoint group ID. In one example, the viewpoint group ID can be stored by the viewpoints 810 and 820 as discussed below.
[00106] In OMAF, a set of tracks carrying media data of viewpoint 810 and/or 820 can be indicated by the track group of grouping type 'vipo'. Tracks belonging to the same viewpoint 810 and/or 820 have the same value of track group ID, denoted as track group id, for a track group type, denoted as track group type, of 'vipo'. The track group id of tracks from one viewpoint may differ from the track group id of tracks from any other viewpoint. By default, when this track grouping is not indicated for any track in a file, the file is considered to contain content for one viewpoint only. Information for a viewpoint can be signaled within a viewpoint group box denoted as ViewpointGroupBox(), which includes the viewpoint position structure, denoted as ViewpointPosStruct(), and other information. The ViewpointPosStruct() can also be included in a viewpoint information structure, denoted as ViewpointInfoStruct(), which can be included in a dynamic viewpoint sample, denoted as DynamicViewpointSample(). DynamicViewpointSample() may be a sample syntax of a dynamic viewpoint timed metadata track that has a sample entry of type 'dyvp'.
[00107] Within the ViewpointInfoStruct(), among other information, a viewpoint group structure, denoted as ViewpointGroupStruct(), can be included when the group alignment flag, denoted as group alignment flag, is equal to one, which signals of a group of viewpoints. The ViewpointGroupStruct() may include syntax elements for a viewpoint group ID and a viewpoint group description, denoted as vwpt group id and vwpt group description, respectively. The vwpt group id indicates the identifier of a viewpoint group. All viewpoints in a viewpoint group share a common reference coordinate system. The vwpt group description is a null-terminated eight bit Unicode transformation format (UTF-8) string which indicates the description of a viewpoint group. A null string is allowed. An OMAF player may be expected to start with the initial viewpoint timed metadata. Subsequently, if the user wishes to switch to a viewpoint group and the initial viewpoint information is not present, the OMAF player is expected to switch to the viewpoint with the least value of the viewpoint identifier in the viewpoint group.
[00108] When switching from a first viewpoint, such as viewpoint 810, to a second viewpoint, such as viewpoint 820, a user may desire that the decoder determine the viewport (covering the same area or object) of the second viewpoint corresponding to the viewport of the first viewpoint presented to the user just before the switching occurs. This can be accomplished when two viewpoints belong to the same viewpoint group 800. The decoder can then present the viewport of the second viewpoint to the user instead of following the initial viewpoint timed metadata for the second viewpoint (even when such timed metadata exists).
[00109] An implementation of the preceding example is as follows. Tracks belonging to the same viewpoint have the same value of track group id for track group type 'vipo', and the track group id of tracks from one viewpoint differs from the track group id of tracks from any other viewpoint. By default, when this track grouping is not indicated for any track in a file, the file is considered containing content for one viewpoint only. aligned(8) class ViewpointGroupBox extends TrackGroupTypeBox('vipo') { ViewpomtPosStruct();
string viewpoint label;
}
[00110] Tracks that have the same value of track group id within TrackGroupTypeBox with track group type equal to‘vipo’ belong to the same viewpoint. The track group id within TrackGroupTypeBox with track group type equal to‘vipo’ is therefore used as the identifier of the viewpoint. ViewpointPosStruct()is defined below viewpoint label is a null- terminated UTF-8 string that provides a human readable text label for the viewpoint.
[00111] The ViewpointInfoStruct() provides information of a viewpoint, including the position of the viewpoint and the yaw, pitch, and roll rotation angles of X, Y, and Z axes, respectively, of the global coordinate system of the viewpoint relative to the common reference coordinate system using syntax as described below:
aligned(8) ViewpointInfoStruct(gcs_rotation_flag) {
V ie wpointPo sStruct() ;
if (gcs_rotation_flag)
ViewpointGlobalCoordinateSysRotationStruct();
unsigned int(l) group alignment flag;
bit(7) reserved = 0;
if (group alignment flag)
V ie wpointGroup Struct() ;
}
aligned(8) ViewpointPosStruct() {
signed int(32) viewpoint pos x;
signed int(32) viewpoint pos y;
signed int(32) viewpoint_pos_z;
unsigned int(l) viewpoint_gpspos_present_flag;
bit(3 l) reserved = 0;
if(viewpoint_gpspos_present_flag) {
signed int(32) viewpoint_gpspos_longitude;
signed int(32) viewpoint_gpspos_latitude;
signed int(32) viewpoint_gpspos_altitude;
}
} aligned(8) class ViewpointGlobalCoordinateSysRotationStruct() {
signed int(32) viewpoint_gcs_yaw;
signed int(32) v i e vvpo i n t gcs p itch:
signed int(32) viewpoint_gcs_roll;
} aligned(8) class ViewpointGroupStruct() {
unsigned int(8) vwpt group id;
string vwpt group description;
}
[00112] The group alignment flag equal to 1 specifies that the viewpoint belongs to a separate coordinate system (with its own origin) for the alignment of viewpoint groups and the ViewpointGroupStruct is present. The group alignment flag equal to 0 specifies that the viewpoint belongs to the common reference coorindate system. When two viewpoints have different values of vwpt group id, their position coordinates are not comparable, because the viewpoints belong to different coordinate systems viewpoint pos x, viewpoint pos y, and viewpoint_pos_z specify the position of the viewpoint (when the position of the viewpoint is static) or the initial position of the viewpoint (when the position of the viewpoint is dynamic), in units of 10-1 millimeters, in 3D space with (0, 0, 0) as the centre of the common reference coordinate system. If a viewpoint is associated with a timed metadata track with sample entry type 'dyvp', the position of the viewpoint is dynamic. Otherwise, the position of the viewpoint is static. In the former case, the dynamic position of the viewpoint is signaled in the associated timed metadata track with sample entry type 'dyvp'. viewpoint_gpspos_present_flag equal to 1 indicates that viewpoint gpspos longitude, viewpoint gpspos latitude, and viewpoint gpspos altitude are present. viewpoint_gpspos_present_flag equal to 0 indicates that viewpoint gpspos longitude, viewpoint gpspos latitude, and viewpoint gpspos altitude are not present viewpoint gpspos longitude indicates the longitude of the geolocation of the viewpoint in units of 2-23 degrees viewpoint gpspos longitude shall be in range of -180 * 223 to 180 * 223 - 1, inclusive. Positive values represent eastern longitude and negative values represent western longitude viewpoint gpspos latitude indicates the latitude of the geolocation of the viewpoint in units of 2-23 degrees viewpoint gpspos latitude shall be in range of -90 * 223 to 90 * 223 - 1, inclusive. Positive value represents northern latitude and negative value represents southern latitude viewpoint gpspos altitude indicates the altitude of the geolocation of the viewpoint in units of milimeters above the World Geodetic System (WGS) 84 reference ellipsoid. viewpoint_gcs_yaw, v i e w po i n t gc s p itch, and viewpoint gcs roll specify the yaw, pitch, and roll angles, respectively, of the rotation angles of X, Y, Z axes of the global coordinate system of the viewpoint relative to the common reference coordinate system, in units of 2-16 degrees viewpoint gcs yaw shall be in the range of -180 * 216 to 180 *216 - 1, inclusive. viewpoint_gcs_pitch shall be in the range of -90 * 216 to 90 * 216, inclusive viewpoint gcs roll shall be in the range of -180 * 216 to 180 * 216 - 1, inclusive vwpt group id indicates the identifier of a viewpoint group. All viewpoints in a viewpoint group share a common reference coordinate system vwpt group description is a null-terminated UTF-8 string which indicates the description of a viewpoint group. A null string is allowed. An OMAF player is expected to start with the initial viewpoint timed metadata as defined below. Subsequently, if the user wishes to switch to a viewpoint group and the initial viewpoint information is not present, the OMAF player is expected to switch to the viewpoint with the least value of the viewpoint identifier in the viewpoint group.
[00113] A timed metadata track having sample entry type 'rcvp' may contain zero or one SampleToGroupBox with grouping type equal to 'vwpt'. This SampleToGroupBox represents the assignment of samples in this timed metadata (and consequently the corresponding samples in the media tracks) to viewpoints. When a SampleToGroupBox with grouping type equal to 'vwpt' is present, an accompanying SampleGroupDescriptionBox with the same grouping type shall be present, and contain the ID of the particular viewpoint this group of samples belong to. The sample group entry of grouping type equal to 'vwpt', named ViewpointEntry, is defined as follows:
class ViewpointEntryO extends SampleGroupDescriptionEntry(’vwpt’) {
unsigned int(32) viewpoint_id;
}
viewpoint id indicates the viewpoint identifier of the viewpoint this group of samples belong to. The dynamic viewpoint timed metadata track indicates the viewpoint parameters that are dynamically changing over time.
[00114] An OMAF player should use the signaled information as follows when starting playing back of one viewpoint after switching from another viewpoint. If there is a recommended viewing orientation explicitly signaled, the OMAF player is expected to parse this information and follow the recommended viewing orientation. Otherwise, the OMAF player is expected to keep the same viewing orientation as in the switching-from viewpoint just before the switching occurs.
[00115] The track sample entry type 'dyvp' shall be used. The sample entry of this sample entry type is specified as follows:
class DynamicViewpointSampleEntry extends MetaDataSampleEntry('dyvp') {
V ie wpointPo sStruct() ;
unsigned int(l) dynamic gcs rotation flag;
bit(7) reserved = 0;
if (!dynamic_gcs_rotation_flag)
ViewpointGlobalCoordinateSysRotationStruct();
}
ViewpointPosStruct() is defined above but indicates the initial viewpoint position dynamic gcs rotated flag equal to 0 specifies that the yaw, pitch, and roll rotation angles of X, Y, and Z axes, respectively, of the global coordinate system of the viewpoint relative to the common reference coordinate system remain unchanged in all samples referring to this sample entry dynamic gcs rotated flag equal to 1 specifies that the yaw, pitch, and roll rotation angles of X, Y, and Z axes, respectively, of the global coordinate system of the viewpoint relative to the common reference coordinate system are indicated in the samples. ViewpointGlobalCoordinateSysRotationStruct() is defined above but indicates the yaw, pitch, and roll rotation angles ofX, Y, and Z axes, respectively, of the global coordinate system of the viewpoint relative to the common reference coordinate system for each sample referring to this sample entry.
[00116] The sample syntax of this sample entry type ('dyvp') is specified as follows:
aligned(8) DynamicViewpointSample() {
ViewpointInfoStruct(dynamic_gcs_rotation_flag);
}
[00117] The semantics of ViewpointInfoStruct() is specified above. The first sample shall have group alignment flag equal to 1. For subsequent samples when the group information does not change, the ViewpointGroupStruct() can be absent. When the ViewpointGrouptStruct() is absent in a sample, it is inferred to be identical to the ViewpointGroupStruct() of the previous sample, in decoding order.
[00118] Metadata indicates the initial viewpoint that should be used. In the absence of this information, the initial viewpoint should be inferred to be the viewpoint that has the least value of viewpoint identifier among all viewpoints in the file. The initial viewpoint timed metadata track, when present, shall be indicated as being associated with all viewpoints in the file.
[00119] The track sample entry type 'invp' shall be used. The sample entry of this sample entry type is specified as follows:
class InitialViewpointSampleEntry extends MetaDataSampleEntry('invp') {
unsigned int(32) id of initial viewpoint;
}
id of initial viewpoint indicates the value of the viewpoint identifier of the initial viewpoint for the first sample to which this sample entry applies.
[00120] The sample syntax of this sample entry type ('invp') is specified as follows:
aligned(8) Initial Vie wpointSample() {
unsigned int(32) id of initial viewpoint;
}
id of initial viewpoint indicates the value of the viewpoint identifier of the initial viewpoint for the sample.
[00121] As shown above, the viewpoint group ID can be used to indicate a viewpoint group. Elowever, in such an implementation, the viewpoint group ID is included in the viewpoint group structure, which is further included in the viewpoint information structure, which is further included in the dynamic viewpoint sample. The dynamic viewpoint sample is included in the timed metadata track, which is available to dynamic viewpoints, such as viewpoint 704 in FIG. 7, but not to static viewpoints, such as viewpoints 702 and 703 in FIG. 7. Accordingly, the preceding implementation prevents the viewpoint group 800, as shown in FIG. 8, from being determined for static viewpoints, which prevents determining destination viewports based on source viewports. This is because, in the preceding implementation, the vwpt group id is not signaled in the ViewpointGroupBox(), and thus the viewpoint group information is not present for viewpoints with static position, which is not associated with a timed metadata track with sample entry type 'dyvp'.
[00122] The present disclosure solves this issue by placing the vwpt group id into the ViewpointGroupBox(), which is included in the tracks for the corresponding viewpoints. Examples of such implementations are discussed in greater detail below.
[00123] FIG. 9 is a schematic diagram of an example VR video file 900 for multiple viewpoints. For example, VR video file 900 may be employed to contain correspondences between spatial regions of viewpoints in a viewpoint group, such as viewpoint group 800. Further, the VR video file 900 can be encoded and/or decoded by an encoder 103, an encoder 500, a decoder 107, a decoder 600, and/or a codec system 400. In addition, the VR video file 900 can describe VR video from multiple viewpoints, such as viewpoints 702, 703, and/or 704. Also, the VR video file 900 can contain encoded VR video, and hence can be generated by an encoder and read by a decoder to support video display as part of methods 200 and 300.
[00124] The VR video file 900 can contain sets of tracks for corresponding viewpoints. For example, the VR video file 900 can contain a set of viewpoint A tracks 910, a set of viewpoint B tracks 920, a set of viewpoint C tracks 940, and a set of viewpoint D tracks 950. As a specific example, such tracks can contain video data as captured from corresponding viewpoints, such as viewpoints 702, 703, 704, 810, and/or 820. For example, in a DASF1 context VR video recorded at a viewpoint is stored in a corresponding adaptation set. The adaptation set is downsampled to various lower resolutions. Then a track is generated for each resolution of the adaptation set. In such a case, the set of viewpoint A tracks 910, set of viewpoint B tracks 920, set of viewpoint C tracks 940, and set of viewpoint D tracks 950 contain the tracks associated with the adaptation set for the corresponding viewpoints. The relevant tracks can then be forwarded to the decoder/client depending on the viewpoint selected by the user and the desired resolution based on the availability of network resources.
[00125] The VR video file 900 can also contain a timed metadata track 960. The timed metadata track 960 contains metadata relevant to one or more of the viewpoints and hence potentially to all of the tracks 910, 920, 940, and 950. As such, spatial region correspondences between viewpoints can be stored in the timed metadata track 960. For example, correspondences between each of the viewpoints can be stored toward the beginning of the timed metadata track 960. Such information may be global in nature and can be used for the entire VR video file 900. In the event that the spatial region correspondences change, for example due to viewpoint motion, viewpoints turning on/off, etc., such changes can be coded into the timed metadata track 960 at the temporal location in the VR video file 900 where such changes occur. Accordingly, the timed metadata track 960 can be employed to contain the spatial region correspondences between the viewpoints over the entire length of the VR video file 900. The spatial region correspondences in the timed metadata track 960 can then be used by a decoder when displaying VR video as contained in tracks 910, 920, 940, and 950.
[00126] The problem is that switching based on spatial region correspondences only works for viewpoints in the same viewpoint group. In the implementation discussed with respect to FIG. 8, viewpoint group IDs 962 for the viewpoints are stored in the timed metadata track 960. Flowever, decoders may not generally check the timed metadata track 960 when switching to or from a static viewpoint. Accordingly, the decoder may be unaware of the viewpoint group IDs 962 in the timed metadata track 960. Accordingly, the present disclosure includes a viewpoint group ID 912, 922, 942, and 952, which are the group IDs for the viewpoints in the corresponding viewpoint tracks 910, 920, 940, and 950, respectively. In this way, both static and dynamic viewpoints have access to the viewpoint group IDs 912, 922, 942, and 952.
[00127] It should be noted that a file conforming to the HEVC file format may comprise a series of objects, called boxes. A box may be an object-oriented building block defined by a unique type identifier and length. A box is the elementary syntax structure in ISOBMFF and may include a four-character coded box type, a byte count of the box, and a payload. Specifically, a box may be a syntax structure comprising a coded box type, a byte count of the box, and a payload. In some instances, all data in a file conforming to the F1EVC file format may be contained within boxes and there may be no data in the file that is not in a box. Thus, an ISOBMFF file may include a sequence of boxes, and boxes may contain other boxes. For instance, the payload of a box may include one or more additional boxes. A file conforming to ISOBMFF may include various types of boxes. For example, a file conforming to ISOBMFF may include a file type box, a media data box, a movie box, a movie fragment box, and so on. A file type box includes file type and compatibility information. A media data box may contain samples (e.g., coded pictures). A Movie box (moov) contains metadata for continuous media streams present in the file. Each of the continuous media streams may be represented in the file as a track, which may be included in tracks 910, 920, 940, 950, and/or 960. For instance, a movie box may contain metadata regarding a movie (e.g., logical and timing relationships between samples, and also pointers to locations of samples). Movie boxes may include several types of sub-boxes. The sub-boxes in a movie box may include one or more track boxes. A track box may include information about an individual track of a movie. A track box may include a track header box that specifies overall information of a single track. In addition, a track box may include a media box that contains a media information box. The media information box may include a sample table box that contains data indexing media samples in the track. Information in the sample table box may be used to locate samples in time and, for each of the samples of the track, a type, size, container, and offset into that container of the sample. Thus, the metadata for a track is enclosed in a Track box ("trak"), while the media content of a track is either enclosed in a Media Data box ("mdat") or directly in a separate file. The media content for tracks comprises or consists of a sequence of samples, such as audio or video access units.
[00128] ISOBMFF specifies the following types of tracks: a media track, which contains an elementary media stream, a hint track, which either includes media transmission instructions or represents a received packet stream, and a timed metadata track, which comprises time- synchronized metadata. The metadata for each track includes a list of sample description entries, each providing the coding or encapsulation format used in the track and the initialization data used for processing that format. Each sample is associated with one of the sample description entries of the track. ISOBMFF enables specifying sample-specific metadata with various mechanisms. Specific boxes within the Sample Table box (stbl) may be employed for various functions. For example, the stbl contains a sample table that contains the time and data indexing of the media samples in a track ffence, the stbl can be used to locate samples in time, determine their type (e.g., I-frame or not), and determine their size, container, and offset into that container. A Movie Fragment box is a top-level box. A Movie Fragment box may contain one or more track fragment (traf) boxes. Within the Movie Fragment there is a set of track fragments, zero or more per track. The track fragments in turn contain zero or more track runs, each of which documents a contiguous run of samples for that track. For instance, each track run may contain samples of pictures that are contiguous in a certain order, such as decoding order. A track fragment comprises metadata for one or more track fragments. For instance, a track fragment box may include a track fragment header box indicating a track ID, a base data offset, a sample description index, a default sample duration, a default sample size, and default sample flags. A track fragment box may include one or more track fragment run boxes, each documenting a contiguous set of samples for a track. For instance, a track fragment box may include syntax elements indicating a sample count, a data offset, sample flags, a sample duration, a sample size, sample composition time offset, and so on. Within these structures, many fields are optional and can be defaulted.
[00129] As shown above, a VR video file 900 can contain many boxes in tracks 910, 920, 940, 950, and/or 960. In one example, a sample description box (stsd) 911, 921, 941, and 951 can be employed to contain the viewpoint group IDs 912, 922, 942, and 952, respectively, for the corresponding tracks 910, 920, 940, and 950, respectively. The stsd 911, 921, 941, and 951 is a box that contains data that describes the samples contained in the corresponding tracks.
[00130] The viewpoint group ID 912, 922, 942, and 952 can be placed in the corresponding tracks 910, 920, 940, and 950, respectively, by making changes to the code described with respect to FIG. 8. Specifically, the code can be modified to ensure the viewpoint grouping information for all viewpoints is always signaled regardless of whether a viewpoint is at a static position (where a static position is not associated with a timed metadata track with sample entry type 'dyvp'). [00131] In a first embodiment, the viewpoint group ID 912, 922, 942, and 952 is always signaled in a ViewpointGroupBox, either directly or indirectly. The ViewpointGroupBox is included in the stsd 911, 921, 941, and 951, which brings the viewpoint group ID 912, 922, 942, and 952 into the tracks 910, 920, 940, and 950. Specifically, the viewpoint group identifier is included in a viewpoint group structure, and the viewpoint group structure is included in the viewpoint group box. The following changes to the code described with respect to FIG. 8 can implement such an embodiment.
[00132] The syntax of ViewpointPosStruct(), ViewpointInfoStruct(), ViewpointGroupBox(), DynamicViewpointSampleEntryO, and DynamicViewpointSample() are changed as follows, where removed parts are shown in strikethrough, and additions are shown in underline:
aligned(8) class ViewpointGroupBox extends TrackGroupTypeBox('vipo') {
ViewpointPosStructQ;
string viewpoint label;
ViewDointPosStructf):
ViewpointGroupStructQ;
}
[00133] Furthermore, the semantics of group alignment flag is removed:
group alignment flag equal to 1 specifies that the viewpoint belongs to a separate coordinate system— (with— its— own— origin)— for— the— alignment— of— viewpoint— groups— and— the
ViewpointGroupStruct is present. The group alignment flag equal to 0 specifies that the viewpoint belongs to the common reference coordinate system.— When two viewpoints have different values of vwpt group id, their position coordinates are not comparable, because the viewpoints belong to different coordinate systems.
[00134] In addition, the semantics of vwpt group id is changed as follows:
vwpt group id indicates the identifier of a viewpoint group. All viewpoints in a viewpoint group share a common reference coordinate system. When two viewpoints have different values of vwpt group id, their position coordinates are not comparable, because the viewpoints belong to different coordinate systems.
[00135] In a second embodiment, ViewpointGroupStruct() is always signaled in the ViewpointPosStruct() but not in the ViewpointInfoStruct(). In this approach, the viewpoint group identifier is included in a viewpoint group structure, the viewpoint group structure is further included in a viewpoint position structure, and the viewpoint position structure is further included in the viewpoint group box. In this way, the viewpoint group ID 912, 922, 942, and 952 is included in the ViewpointGroupBox, which is included in the stsd 911, 921, 941, and 951. This brings the viewpoint group ID 912, 922, 942, and 952 into the tracks 910, 920, 940, and 950. The following changes to the code described with respect to FIG. 8 can implement such an embodiment.
[00136] The syntaxes of ViewpointInfoStruct() and ViewpointPosStruct() are changed as follows, where removed parts are shown in strikethrough, and additions are shown in underline. aligned(8) ViewpointInfoStruct(gcs_rotation_flag) {
V ie wpointPo sStruct() ;
if (gcs_rotation_flag)
ViewpointGlobalCoordinateSysRotationStruct();
unsigned int(l) group alignment flag†
bit(7) reserved - 0;
if (group alignment flag)
V ie wpointGroup StructQ ;
}
aligned(8) ViewpointPosStruct() {
signed int(32) viewpoint pos x;
signed int(32) viewpoint pos y;
signed int(32) viewpoint_pos_z;
unsigned int(l) viewpoint_gpspos_present_flag;
bit(3 l) reserved = 0;
if(viewpoint_gpspos_present_flag) {
signed int(32) viewpoint_gpspos_longitude;
signed int(32) viewpoint_gpspos_latitude;
signed int(32) viewpoint_gpspos_altitude;
}
ViewpointGroupStructf) ;
}
[00137] Furthermore, the semantics of group alignment flag is removed:
group alignment flag equal to 1 specifies that the viewpoint belongs to a separate coordinate system— (with— its— own— origin)— for— the— alignment— of— viewpoint— groups— and— the ViewpointGroupStruct is present. The group alignment flag equal to 0 specifies that the viewpoint belongs to the common reference coorindate system.— When two viewpoints have different values of vwpt group id, their position coordinates are not comparable, because the viewpoints belong to different coordinate systems. [00138] In addition, the semantics of vwpt group id is changed as follows: vvvpt groLip id indicates the identifier of a viewpoint group. All viewpoints in a viewpoint group share a common reference coordinate system. When two viewpoints have different values of vwpt group id. their position coordinates are not comparable, because the viewpoints belong to different coordinate systems.
[00139] Finally, the following paragraph is removed:
The first sample shall have group alignment flag equal to 1. For subsequent samples when the group information does not change, the ViewpointGroupStructQ can be absent. When the ViewpointGrouptStructQ is absent in a sample, it is inferred to be identical to the ViewpointGroupStructQ of the previous sample, in decoding order.
[00140] In a third embodiment, ViewpointGroupStructQ is always signaled in the ViewpointlnfoStructQ but not in ViewpointPosStructQ. In this approach, the viewpoint group identifier is included in a viewpoint group structure, which is further included in a viewpoint information structure, which is further included in the viewpoint group box. In this way, the viewpoint group ID 912, 922, 942, and 952 is included in the ViewpointGroupBox, which is included in the stsd 911, 921, 941, and 951. This brings the viewpoint group ID 912, 922, 942, and 952 into the tracks 910, 920, 940, and 950. The following changes to the code described with respect to FIG. 8 can implement such an embodiment.
[00141] The syntaxes of ViewpointlnfoStructQ, ViewpointGroupBoxQ and DynamicViewpointSampleEntryO are changed as follows, where removed parts are shown in strikethrough, and additions are shown in underline.
aligned(8) V ie wpointlnfo Struct(gc s rotation flag) {
V ie wpointPo sStructQ ;
ViewpointGroupStructQ ;
unsigned int(T) gcs rotation flag;
bit(7) reserved = 0;
if (gcs_rotation_flag)
ViewpointGlobalCoordinateSysRotationStructQ;
unsigned int(l) group alignment flag†
bit(7) reserved - 0;
if (group_alignment_flag)
V ie wpointGroup StructQ ;
} aligned(8) class ViewpointGroupBox extends TrackGroupTypeBox('vipo') {
ViewpointlnfoStructO; ViewpointPosStructO;
string viewpoint label;
}
class DynamicViewpointSampleEntry extends MetaDataSampleEntry('dyvp') {
V ie wpointlnfo Struct!) ; V ie wpointPosStructf) ;
unsigned int(l) dynamic gcs rotation flag;
bit(7) reserved - 0;
if (!dynamic_gcs_rotation_flag)
ViewpointGlobalCoordinateSysRotationStructQ;
}
[00142] Furthermore, the semantics of group alignment flag is removed:
group alignment flag equal to 1 specifies that the viewpoint belongs to a separate coordinate system— (with— its— own— origin)— for— the— alignment— of— viewpoint— groups— and— the ViewpointGroupStruct is present. The group alignment flag equal to 0 specifies that the viewpoint belongs to the common reference coorindate system.— When two viewpoints have different values of vwpt group id, their position coordinates are not comparable, because the viewpoints belong to different coordinate systems.
[00143] In addition, the semantics of vwpt group id is changed as follows:
vwpt group id indicates the identifier of a viewpoint group. All viewpoints in a viewpoint group share a common reference coordinate system. When two viewpoints have different values of vwpt group id, their position coordinates are not comparable, because the viewpoints belong to different coordinate systems.
[00144] Finally, the following paragraph is removed:
The first sample shall have group alignment flag equal to 1. For subsequent samples when the group information does not change, the ViewpointGroupStructQ can be absent. When the ViewpointGrouptStructQ is absent in a sample, it is inferred to be identical to the ViewpointGroupStructQ of the previous sample, in decoding order.
[00145] FIG. 10 is an embodiment of a method 1000 of performing viewpoint switching based on viewpoint group identification, as discussed with respect to VR video file 900, between viewpoints in a viewpoint group, such as viewpoint group 800. Flence, method 1000 may be employed by a decoder 107, a decoder 600, and/or a codec system 400. Method 1000 can be employed to support viewpoint switching, for example between viewpoints 702, 703, and/or 704, when displaying a VR video file, such as VR video file 900, and hence can be employed to improve methods 200 and 300.
[00146] Method 1000 initiates when the decoder determines to switch from a source viewpoint to a destination viewpoint. At step 1001, the video decoder processes a VR video stream. The VR video stream comprises a plurality of viewpoints included in a viewpoint group. Further, each of the viewpoints in the viewpoint group correspond to one particular omnidirectional video camera for capturing an omnidirectional video at a particular location. In addition, the VR video stream contains information indicative of the viewpoint group. For example, the information indicative of the viewpoint group may comprise a viewpoint group identifier.
[00147] At step 1003, the decoder presents a first viewport of a first viewpoint in the viewpoint group to a user. The decoder then determines to switch from the first viewpoint to a second viewport in the viewpoint group at step 1005. This may occur in response to receiving user input, for example.
[00148] At step 1007, the decoder determines a second viewport of the second viewpoint based on the information indicative of the viewpoint group. For example, the decoder can obtain a viewpoint group identifier that is stored in a track containing the first viewpoint, the second viewpoint, or combinations thereof. As such, the decoder can use the viewpoint group identifier to determine whether the first viewpoint and the second viewpoint are part of the same viewpoint group, and hence whether the second viewport can validly be determined based on the first viewport. As the viewpoint group identifier can be stored in any track, the viewpoint group identifier can be contained in a track configured for a static viewpoint. Accordingly, the switching behavior of method 1000 can be employed when switching to and/or from static viewpoints, and not solely for dynamic viewpoints. As discussed above, the viewpoint group identifier can be included in a viewpoint group box. In one example, the viewpoint group identifier is included in a viewpoint group structure, which is included in the viewpoint group box. In another example, the viewpoint group identifier is included in a viewpoint group structure, which is included in a viewpoint position structure, which is further included in the viewpoint group box. In yet another example, the viewpoint group identifier is included in a viewpoint group structure, which is included in a viewpoint information structure, which is further included in the viewpoint group box.
[00149] FIG. 11 is an embodiment of another method 1100 of performing viewpoint switching based on viewpoint group identification, as discussed with respect to VR video file 900, between viewpoints in a viewpoint group, such as viewpoint group 800. Flence, method 1100 may be employed by a decoder 107, a decoder 600, and/or a codec system 400. Method 1100 can be employed to support viewpoint switching, for example between viewpoints 702, 703, and/or 704, when displaying a VR video file, such as VR video file 900, and hence can be employed to improve methods 200 and 300.
[00150] Method 1100 initiates at step 1101 upon receiving a VR video stream at a decoder. The VR video stream comprises a viewpoint group designated by a viewpoint identifier. Further, the viewpoint group, and hence also the VR video stream, includes a plurality of viewpoints. The viewpoint identifier is included in one or more tracks including one or more of the plurality of viewpoints. The viewpoint identifier is included in tracks including static viewpoints and may also be included in a metatadata track associated with dynamic viewpoints.
[00151] At step 1103, the decoder decodes a first viewport of a first viewpoint of the viewpoint group for display. The decoder then determines to switch to a second viewpoint at step 1105, for example based on user input. At step 1107, the decoder determines that the second viewpoint is included in the viewpoint group with the first viewpoint based on the viewpoint identifier. Based on the determination that the second viewpoint is included in the viewpoint group with the first viewpoint, the decoder can determine a second viewport of the second viewpoint based on the first viewport and the viewpoint identifier at step 1109. The decoder then decodes the second viewport of the second viewpoint for display at step 1111.
[00152] For example, the decoder can obtain a viewpoint group identifier that is stored in a track containing the first viewpoint, the second viewpoint, or combinations thereof. As such, the decoder can use the viewpoint group identifier to determine whether the first viewpoint and the second viewpoint are part of the same viewpoint group, and hence whether the second viewport can validly be determined based on the first viewport. As the viewpoint group identifier can be stored in any track, the viewpoint group identifier can be contained in a track configured for a static viewpoint. Accordingly, the switching behavior of method 1100 can be employed when switching to and/or from static viewpoints, and not solely for dynamic viewpoints. As discussed above, the viewpoint group identifier can be included in a viewpoint group box. In one example, the viewpoint group identifier is included in a viewpoint group structure, which is included in the viewpoint group box. In another example, the viewpoint group identifier is included in a viewpoint group structure, which is included in a viewpoint position structure, which is further included in the viewpoint group box. In yet another example, the viewpoint group identifier is included in a viewpoint group structure, which is included in a viewpoint information structure, which is further included in the viewpoint group box. [00153] FIG. 12 is an embodiment of a method 1200 of encoding viewpoint group identification at an encoder to support viewpoint switching at a decoder, as discussed with respect to VR video file 900, between viewpoints in a viewpoint group, such as viewpoint group 800. Flence, method 1200 may be employed by an encoder 103, an encoder 500, and/or a codec system 400. Method 1200 can be employed to support viewpoint switching, for example between viewpoints 702, 703, and/or 704, when displaying a VR video file, such as VR video file 900, and hence can be employed to improve methods 200 and 300.
[00154] Method 1200 can be initiated when encoding a VR video. Specifically, at step 1201, the encoder receives video from a plurality of viewpoints associated with a VR video stream. The plurality of viewpoints is part of a viewpoint group.
[00155] At step 1203, the encoder encodes a viewpoint group identifier of the viewpoint group in one or more tracks including one or more of the plurality of viewpoints. For example, the viewpoint group identifier can be encoded before, after, and/or contemporaneously with encoding the video from the viewpoints. The presence of the viewpoint group identifier indicates viewpoints in the viewpoint group include viewports that display a common location in virtual space. Accordingly, a second viewport of a second viewpoint in the viewpoint group is determinable based on a first viewport of a first viewpoint in the viewpoint group during a viewpoint switch at a decoder. Further, the viewpoint identifier is included in tracks including static viewpoints and may also be included in a metatadata track associated with dynamic viewpoints. The video file containing the viewpoint group identifier in the one or more tracks can then be forwarded in a bitstream toward a decoder for display at step 1205.
[00156] Placing the viewpoint group identifier in the tracks allows the decoder to use the viewpoint group identifier to determine whether the first viewpoint and the second viewpoint are part of the same viewpoint group, and hence whether the second viewport can validly be determined based on the first viewport. As the viewpoint group identifier can be stored in any track, the viewpoint group identifier can be contained in a track configured for a static viewpoint. Accordingly, method 1200 provides a mechanism to support the switching behavior of method 1100 when a decoder is switching to and/or from static viewpoints, and not solely for dynamic viewpoints. As discussed above, the viewpoint group identifier can be included in a viewpoint group box. In one example, the viewpoint group identifier is included in a viewpoint group structure, which is included in the viewpoint group box. In another example, the viewpoint group identifier is included in a viewpoint group structure, which is included in a viewpoint position structure, which is further included in the viewpoint group box. In yet another example, the viewpoint group identifier is included in a viewpoint group structure, which is included in a viewpoint information structure, which is further included in the viewpoint group box.
[00157] FIG. 13 is a schematic diagram of an example video coding device 1300 according to an embodiment of the disclosure. The coding device 1300 is suitable for implementing the methods and processes disclosed herein. The coding device 1300 comprises downstream ports 1310 and transceiver units (Tx/Rx) 1320 for transmitting and receiving data to and from a downstream direction; a processor, logic unit, or processor 1330 to process the data; upstream ports 1350 coupled to Tx/Rx 1320 for transmitting and receiving the data to and from an upstream direction; and a memory 1360 for storing the data. The coding device 1300 may also comprise optical-to-electrical (OE) components and electrical-to-optical (EO) components coupled to the downstream ports 1310, the Tx/Rx units 1320, and the upstream ports 1350 for egress or ingress of optical or electrical signals.
[00158] The processor 1330 is implemented by hardware and software. The processor 1330 may be implemented as one or more CPU chips, cores (e.g., as a multi-core processor), field- programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), and digital signal processors (DSPs). The processor 1330 is in communication with the downstream ports 1310, transceiver units 1320, upstream ports 1350, and memory 1360. The processor 1330 comprises a coding module 1370. The coding module 1370 implements the disclosed embodiments described above. For example, the coding module 1370 may implement an encoder 103, an encoder 500, a decoder 107, a decoder 600, and/or a codec system 400, depending on the example. Further, the coding module 1370 may implement method 200, method 300, method 1000, 1100, and/or method 1200, depending on the example. For example, coding module 1370 may generate or decode a VR video file 900. For example, the coding module 1370 can encode or decode VR video based on a track that contains a viewpoint group identifier denoting that viewpoints, such as viewpoints 702, 703, 704, 810, and/or 820, are part of the same viewpoint group to support determining a destination viewport during viewpoint switching. Accordingly, the inclusion of the coding module 1370 therefore provides a substantial improvement to the functionality of the coding device 1300 and effects a transformation of the coding device 1300 to a different state. Alternatively, the coding module 1370 is implemented as instructions stored in the memory 1360 and executed by the processor 1330.
[00159] The video coding device 1300 may also include input and/or output (I/O) devices 1380 for communicating data to and from a user. The I/O devices 1380 may include output devices such as a display for displaying video data, speakers for outputting audio data, etc. The I/O devices 1380 may also include input devices, such as a keyboard, mouse, trackball, etc., and/or corresponding interfaces for interacting with such output devices.
[00160] The memory 1360 comprises one or more disks, tape drives, and solid-state drives and may be used as an over-flow data storage device, to store programs when such programs are selected for execution, and to store instructions and data that are read during program execution. The memory 1360 may be volatile and non-volatile and may be read-only memory (ROM), random-access memory (RAM), ternary content-addressable memory (TCAM), and static random-access memory (SRAM).
[00161] FIG. 14 is a schematic diagram of an embodiment of a system 1400 for managing viewpoint switching based on viewpoint group identification. The system 1400 is suitable for implementing the methods and processes disclosed herein, for example, may implement method 200, method 300, method 1000, method 1100, and/or method 1200, depending on the example. The system 1400 includes a video encoder 1402. The encoder 1402 comprises a receiving module 1401 for receiving a plurality of viewpoints associated with a VR video stream. The plurality of viewpoints are part of a viewpoint group. The encoder 1402 also comprises an encoding module 1403 for encoding a viewpoint group identifier of the viewpoint group in one or more tracks including one or more of the plurality viewpoints. The viewpoint group identifier indicates a second viewport of a second viewpoint in the viewpoint group is determinable based on a first viewport of a first viewpoint in the viewpoint group during a viewpoint switch. The encoder 1402 also comprises a forwarding module 1405 for forwarding the viewpoint group identifier in the one or more tracks in a bitstream toward a decoder for display. The encoder 1402 is further configured to perform other encoding related mechanisms as discussed herein.
[00162] The system 1400 also includes a video decoder 1410. The decoder 1410 comprises a receiving module 1411 for receiving a VR video stream comprising a viewpoint group designated by a viewpoint identifier and including a plurality of viewpoints. The viewpoint identifier is included in one or more tracks including one or more of the plurality of viewpoints. The decoder 1410 also comprises a decoding module 1415 for decoding a first viewport of a first viewpoint of the viewpoint group for display, and decoding a second viewport of a second viewpoint for display. The decoder 1410 also comprises a determining module 1413 for determining to switch to the second viewpoint, determining that the second viewpoint is included in the viewpoint group with the first viewpoint based on the viewpoint identifier, and determining the second viewport of the second viewpoint based on the first viewport and the viewpoint identifier. The decoder 1410 is further configured to perform other decoding, display, and/or viewpoint switching related mechanisms as discussed herein.
[00163] While several embodiments have been provided in the present disclosure, it may be understood that the disclosed systems and methods might be embodied in many other specific forms without departing from the spirit or scope of the present disclosure. The present examples are to be considered as illustrative and not restrictive, and the intention is not to be limited to the details given herein. For example, the various elements or components may be combined or integrated in another system or certain features may be omitted, or not implemented.
[00164] In addition, techniques, systems, subsystems, and methods described and illustrated in the various embodiments as discrete or separate may be combined or integrated with other systems, components, techniques, or methods without departing from the scope of the present disclosure. Other examples of changes, substitutions, and alterations are ascertainable by one skilled in the art and may be made without departing from the spirit and scope disclosed herein.

Claims

CLAIMS What is claimed is:
1. A method comprising:
processing, by a processor of a decoder, a virtual reality (VR) video stream, wherein the VR video stream comprises plurality of viewpoints included in a viewpoint group, wherein each of the viewpoints in the viewpoint group corresponds to one particular omnidirectional video camera for capturing an omnidirectional video at a particular location, and wherein the VR video stream contains information indicative of the viewpoint group;
presenting, by the processor to a display of the decoder, a first viewport of a first viewpoint in the viewpoint group to a user;
switching, by the processor, from the first viewpoint to a second viewport in the viewpoint group; and
presenting, by the processor, a second viewport of the second viewpoint in the viewpoint group to a user based on the information indicative of the viewpoint group.
2. The method of claim 2, wherein the information indicative of the viewpoint group comprises a viewpoint group identifier.
3. The method of claim 2, wherein the viewpoint group identifier is stored in a track containing the first viewpoint, the second viewpoint, or combinations thereof.
4. The method of claim 3, wherein the track is configured for a static viewpoint.
5. The method of any of claims 2-4, wherein the viewpoint group identifier is included in a viewpoint group box.
6. The method of claim 4, wherein the viewpoint group identifier is included in a viewpoint group structure, and wherein the viewpoint group structure is included in the viewpoint group box.
7. The method of claim 4, wherein the viewpoint group identifier is included in a viewpoint group structure, wherein the viewpoint group structure is included in a viewpoint position structure, and wherein the viewpoint position structure is included in the viewpoint group box.
8. The method of claim 4, wherein the viewpoint group identifier is included in a viewpoint group structure, wherein the viewpoint group structure is included in a viewpoint information structure, and wherein the viewpoint information structure is included in the viewpoint group box.
9. A method implemented in a decoder, the method comprising:
receiving, by a receiver of the decoder, a virtual reality (VR) video stream comprising a viewpoint group designated by a viewpoint identifier and including a plurality of viewpoints, wherein the viewpoint identifier is included in one or more tracks including one or more of the plurality of viewpoints;
decoding, by a processor of the decoder, a first viewport of a first viewpoint of the viewpoint group for display;
determining, by the processor, to switch to a second viewpoint of the viewpoint group; determining, by the processor, that the second viewpoint is included in the viewpoint group with the first viewpoint based on the viewpoint identifier;
determining, by the processor, a second viewport of the second viewpoint based on the first viewport and the viewpoint identifier; and
decoding, by the processor, the second viewport of the second viewpoint for display.
10. A method implemented in an encoder, the method comprising:
receiving, by a processor of the encoder, a plurality of viewpoints associated with a virtual reality (VR) video stream, wherein the plurality of viewpoints are part of a viewpoint group;
encoding, by the processor, a viewpoint group identifier of the viewpoint group in one or more tracks including one or more of the plurality of viewpoints, wherein the viewpoint group identifier indicates viewpoints in the viewpoint group include viewports that display a common location in virtual space; and
forwarding, by the processor, the encoded viewpoint group identifier in the one or more tracks in a bitstream toward a decoder for display.
11. The method of any of claims 9-10, wherein the viewpoint identifier is included in tracks including static viewpoints and a metatadata track associated with dynamic viewpoints.
12. The method of any of claims 9-11, wherein the viewpoint identifier is included in a viewpoint group box.
13. The method of any of claims 9-12, wherein the viewpoint group identifier is included in a viewpoint group structure, and wherein the viewpoint group structure is included in the viewpoint group box.
14. The method of any of claims 9-12, wherein the viewpoint group identifier is included in a viewpoint group structure, wherein the viewpoint group structure is included in a viewpoint position structure, and wherein the viewpoint position structure is included in the viewpoint group box.
15. The method of any of claims 9-12, wherein the viewpoint group identifier is included in a viewpoint group structure, wherein the viewpoint group structure is included in a viewpoint information structure, and wherein the viewpoint information structure is included in the viewpoint group box.
16. A video coding device comprising:
a processor, a receiver coupled to the processor, a transmitter coupled to the processor, and a memory coupled to the processor, the processor, transmitter, receiver, and memory configured to perform the method of any of claims 1-15.
17 A non-transitory computer readable medium comprising a computer program product for use by a video coding device, the computer program product comprising computer executable instructions stored on the non-transitory computer readable medium such that when executed by a processor cause the video coding device to perform the method of any of claims 1-15.
18. A decoder comprising:
a receiving means for receiving a virtual reality (VR) video stream comprising a viewpoint group designated by a viewpoint identifier and including a plurality of viewpoints, wherein the viewpoint identifier is included in one or more tracks including one or more of the plurality of viewpoints;
a decoding means for:
decoding a first viewport of a first viewpoint of the viewpoint group for display, and
decoding a second viewport of a second viewpoint for display; and a determination means for:
determining to switch to the second viewpoint,
determining that the second viewpoint is included in the viewpoint group with the first viewpoint based on the viewpoint identifier, and
determining the second viewport of the second viewpoint based on the first viewport and the viewpoint identifier.
19. The decoder of claim 21, wherein the decoder is further configured to perform the method of any of claims 1-9 or 11-15.
20. An encoder comprising:
a receiving means for receiving a plurality of viewpoints associated with a virtual reality (VR) video stream, wherein the plurality of viewpoints are part of a viewpoint group; an encoding means for encoding a viewpoint group identifier of the viewpoint group in one or more tracks including one or more of the plurality of viewpoints, wherein the viewpoint group identifier indicates a second viewport of a second viewpoint in the viewpoint group is determinable based on a first viewport of a first viewpoint in the viewpoint group during a viewpoint switch; and
a forwarding means for forwarding the viewpoint group identifier in the one or more tracks in a bitstream toward a decoder for display.
21. The encoder of claim 21, wherein the encoder is further configured to perform the method of any of claims 10-15.
PCT/US2019/045978 2018-09-26 2019-08-09 Virtual reality (vr) viewpoint grouping WO2020068284A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201862736966P 2018-09-26 2018-09-26
US62/736,966 2018-09-26

Publications (1)

Publication Number Publication Date
WO2020068284A1 true WO2020068284A1 (en) 2020-04-02

Family

ID=69952095

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2019/045978 WO2020068284A1 (en) 2018-09-26 2019-08-09 Virtual reality (vr) viewpoint grouping

Country Status (1)

Country Link
WO (1) WO2020068284A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113949829A (en) * 2021-10-15 2022-01-18 腾讯科技(深圳)有限公司 Media file packaging method, device, equipment and storage medium
US11558708B2 (en) * 2018-07-13 2023-01-17 Nokia Technologies Oy Multi-viewpoint multi-user audio user experience

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170344843A1 (en) * 2016-05-24 2017-11-30 Qualcomm Incorporated Advanced signaling of a most-interested region in an image
US20180020204A1 (en) * 2015-04-15 2018-01-18 Lytro, Inc. Data structures and delivery methods for expediting virtual reality playback
US20180164593A1 (en) * 2016-12-14 2018-06-14 Qualcomm Incorporated Viewport-aware quality metric for 360-degree video
US20180199042A1 (en) * 2017-01-06 2018-07-12 Mediatek Inc. Methods and apparatus for signaling viewports and regions of interest

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180020204A1 (en) * 2015-04-15 2018-01-18 Lytro, Inc. Data structures and delivery methods for expediting virtual reality playback
US20170344843A1 (en) * 2016-05-24 2017-11-30 Qualcomm Incorporated Advanced signaling of a most-interested region in an image
US20180164593A1 (en) * 2016-12-14 2018-06-14 Qualcomm Incorporated Viewport-aware quality metric for 360-degree video
US20180199042A1 (en) * 2017-01-06 2018-07-12 Mediatek Inc. Methods and apparatus for signaling viewports and regions of interest

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11558708B2 (en) * 2018-07-13 2023-01-17 Nokia Technologies Oy Multi-viewpoint multi-user audio user experience
CN113949829A (en) * 2021-10-15 2022-01-18 腾讯科技(深圳)有限公司 Media file packaging method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
US11438600B2 (en) Immersive media metrics for virtual reality content with multiple viewpoints
US12003701B2 (en) File format signaling of error mitigation in sub-picture bitstream based viewport dependent video coding
TWI712313B (en) Systems and methods of signaling of regions of interest
CN110431849B (en) Signaling transmission of video content including sub-picture bitstreams for video coding
CN111819855B (en) Cancellation flag indication in video streams
WO2020068284A1 (en) Virtual reality (vr) viewpoint grouping
WO2020068935A1 (en) Virtual reality viewpoint viewport center point correspondence signaling
WO2019200227A1 (en) Signaling spatial region correspondence between virtual reality viewpoints

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19866943

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19866943

Country of ref document: EP

Kind code of ref document: A1