WO2020190928A1

WO2020190928A1 - High level syntax for immersive video coding

Info

Publication number: WO2020190928A1
Application number: PCT/US2020/023122
Authority: WO
Inventors: Jill Boyce; Max DMITRICHENKO; Atul Divekar
Original assignee: Intel Corporation
Priority date: 2019-03-19
Filing date: 2020-03-17
Publication date: 2020-09-24
Also published as: EP3942797A4; JP2022524305A; EP3942797A1; KR20210130148A; BR112021016388A2; CN113491111A

Abstract

Techniques related to coding immersive video including multiple texture and depth views of a scene are discussed. Such techniques include reducing the bit-depth of a depth view using a piece-wise linear mapping prior to encode, inverse mapping a decoded low bit-depth depth view to the high bit-depth using the piece-wise linear mapping, and efficiently encoding and decoding camera parameters corresponding to the immersive video.

Description

HIGH LEVEL SYNTAX FOR IMMERSIVE VIDEO CODING

CLAIM OF PRIORITY

This application claims priority to U.S. Provisional Patent Application Serial No.

62/820,760, filed on March 19, 2019 and titled“HIGH LEVEL SYNTAX FOR IMMERSIVE VIDEO CODING”, which is incorporated by reference in its entirety.

BACKGROUND

In compression / decompression (codec) systems, compression efficiency and video quality are important performance criteria. For example, visual quality is an important aspect of the user experience in many video applications and compression efficiency impacts the amount of memory storage needed to store video files and/or the amount of bandwidth needed to transmit and/or stream video content. A video encoder compresses video information so that more information can be sent over a given bandwidth or stored in a given memory space or the like.

The compressed signal or data is then decoded by a decoder that decodes or decompresses the signal or data for display to a user. In most implementations, higher visual quality with greater compression is desirable.

Immersive video and 3D video coding standards, such as the under-development MPEG 3DoF+ standard, encode both texture and depth representations of scenes, and utilize the coded depth to render intermediate view positions through view interpolation (which is also referred to as view synthesis.) Legacy 2D video coding standards can be used to encode the texture and the depth separately. The most commonly used video coding standards can only support 8-bit or 10- bit input images while many depth representations have 16-bit representations. Furthermore, high level syntax is needed for coding efficiency. It is with respect to these and other considerations that the present improvements have been needed. Such improvements may become critical as the desire to encode and decode immersive and 3D video becomes more widespread.

BRIEF DESCRIPTION OF THE DRAWINGS

The material described herein is illustrated by way of example and not by way of limitation in the accompanying figures. For simplicity and clarity of illustration, elements illustrated in the figures are not necessarily drawn to scale. For example, the dimensions of some elements may be exaggerated relative to other elements for clarity. Further, where considered appropriate, reference labels have been repeated among the figures to indicate corresponding or analogous elements. In the figures:

FIG. 1 is a block diagram of an example encoding system for encoding immersive video data; FIG. 2 is a block diagram of an example decoding system for decoding immersive video data;

FIG. 3 illustrates an example piece-wise linear mapping between higher bit-depth depth values and lower bit-depth depth values;

FIG. 4 illustrates an example process for reducing the bit-depth of a depth view using piece- wise linear mapping;

FIG. 5 illustrates generation of example piece-wise linear mapping based on depth value pixel counts in a depth view;

FIG. 6 illustrates an example process for defining a piece-wise linear mapping to map and inverse map depth views between high and low bit-depth values; FIG. 7 illustrates an example process for increasing the bit-depth of a depth view using piece-wise linear mapping;

FIG. 8 illustrates an example camera array and corresponding camera parameters for coding;

FIG. 9 illustrates an exemplary process for encoding immersive video data including camera parameters;

FIG. 10 illustrates an exemplary process for decoding immersive video data including camera parameters;

FIG. 11 is an illustrative diagram of an example system 1100 for coding immersive video data;

FIG. 12 is an illustrative diagram of an example system; and FIG. 13 illustrates an example device, all arranged in accordance with at least some implementations of the present disclosure.

DETAILED DESCRIPTION

One or more embodiments or implementations are now described with reference to the enclosed figures. While specific configurations and arrangements are discussed, it should be understood that this is done for illustrative purposes only. Persons skilled in the relevant art will recognize that other configurations and arrangements may be employed without departing from the spirit and scope of the description. It will be apparent to those skilled in the relevant art that techniques and/or arrangements described herein may also be employed in a variety of other systems and applications other than what is described herein.

While the following description sets forth various implementations that may be manifested in architectures such as system-on-a-chip (SoC) architectures for example, implementation of the techniques and/or arrangements described herein are not restricted to particular architectures and/or computing systems and may be implemented by any architecture and/or computing system for similar purposes. For instance, various architectures employing, for example, multiple integrated circuit (IC) chips and/or packages, and/or various computing devices and/or consumer electronic (CE) devices such as set top boxes, smart phones, etc., may implement the techniques and/or arrangements described herein. Further, while the following description may set forth numerous specific details such as logic implementations, types and interrelationships of system components, logic partitioning/integration choices, etc., claimed subject matter may be practiced without such specific details. In other instances, some material such as, for example, control structures and full software instruction sequences, may not be shown in detail in order not to obscure the material disclosed herein.

The material disclosed herein may be implemented in hardware, firmware, software, or any combination thereof. The material disclosed herein may also be implemented as instructions stored on a machine-readable medium, which may be read and executed by one or more processors. A machine-readable medium may include any medium and/or mechanism for storing or transmitting information in a form readable by a machine (e.g., a computing device). For example, a machine-readable medium may include read only memory (ROM); random access memory (RAM); magnetic disk storage media; optical storage media; flash memory devices; electrical, optical, acoustical or other forms of propagated signals (e.g., carrier waves, infrared signals, digital signals, etc.), and others.

References in the specification to "one implementation", "an implementation", "an example implementation", etc., indicate that the implementation described may include a particular feature, structure, or characteristic, but every embodiment may not necessarily include the particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same implementation. Further, when a particular feature, structure, or

characteristic is described in connection with an embodiment, it is submitted that it is within the knowledge of one skilled in the art to effect such feature, structure, or characteristic in connection with other implementations whether or not explicitly described herein.

The terms“substantially,”“close,”“approximately,”“near,” and“about,” generally refer to being within +/- 10% of a target value. For example, unless otherwise specified in the explicit context of their use, the terms“substantially equal,”“about equal” and“approximately equal” mean that there is no more than incidental variation between among things so described. In the art, such variation is typically no more than +/-10% of a predetermined target value. Unless otherwise specified the use of the ordinal adjectives“first,”“second,” and“third,” etc., to describe a common object, merely indicate that different instances of like objects are being referred to, and are not intended to imply that the objects so described must be in a given sequence, either temporally, spatially, in ranking or in any other manner.

Methods, devices, apparatuses, computing platforms, and articles are described herein related to immersive and 3D video coding.

As discussed, immersive video and 3D video coding standards may encode both texture and depth representations of scenes, and utilize the coded depth to render intermediate view positions through view interpolation or synthesis. As used herein, the term immersive video indicates video including multiple paired texture and depth views of a scene. Legacy 2D video coding standards may be used to encode the texture and the depth separately and typically support 8-bit or 10-bit input images. The techniques discussed herein improve quality of immersive video coding by applying a piece-wise linear mapping of depth views (e.g., depth images, frames, or maps) with 16-bit representation to 10-bit representation prior to encoding the depth with a 10-bit video coding standard, such as HEVC Main 10 profile. Although discussed with respect to HEVC Main 10 for the sake of clarity of presentation, any suitable codec may be applied. Furthermore, although discussed with respect to mapping 16-bit to 10-bit representations, the input and output bit-depths may be any suitable bit-depths. Additionally, high level syntax is provided to efficiently describe additional camera parameters used in view interpolation. Such modifications may be made to the HEVC video parameter set extension and/or a new profile may be provided to indicate support for the 3DoF+ standard. Again, although discussed with respect to HEVC for the sake of clarity of presentation, such modifications and/or new profiles may be implemented with respect to any suitable codec.

The techniques discussed herein improve video quality and coding efficiency for the under-development MPEG 3DoF+ standard, but may be applied to other immersive video codecs systems which includes texture and depth views and for which depth is available at a higher bit representation than what can be encoded. In some embodiments, at the encoder, prior to encoding of the depth views, the 16-bit representation is mapped to a 10-bit representation using a piece- wise linear mapping with metadata describing the piece-wise linear mapping signaled in the bitstream. At the decoder, after decoding the 10-bit bit-depth views, the inverse mapping is performed to map back to 16-bit representation, and the 16-bit representation is used for view synthesis. Notably, a 10-bit representation of depth may not be sufficient to fully represent the depth and can lead to visual artifacts when using the depth for view synthesis. Furthermore, rounding or discarding least significant bits to represent 16-bit bit-depth as 10-bit also introduces artifacts. The disclosed techniques mitigate or resolve such problems using the discussed 16-bit to 10-bit mapping with metadata describing the piece-wise linear mapping signaled in the bitstream. This results in objective and subjective video quality improvement with use of the discussed techniques (for example in common test conditions of the under-development 3DoF+ standard).

Additionally, the techniques discussed herein provide efficient signaling for carrying metadata for the additional camera parameters in a multi-camera immersive video system. In some embodiments, a new HEVC immersive profile is provided to indicate support for the 3DoF+ standard. In some embodiments, a modification of the HEVC video parameter set (VPS) is proposed to indicate the use of an atlas. For example, an atlas may be formed such that a single picture to be encoded with a video codec such as HEVC has many regions that combine data from multiple different camera positions as discussed in Renaud Dore, Franck Thudor,“Outperforming 3DoF+ anchors: first evidence,” M43504, July 2018. Notably, the discussed techniques reduce overhead for carrying camera parameters in multi-camera systems to enable more camera representations to be used in a bitstream, resulting in higher video quality. FIG. 1 is a block diagram of an example encoding system 100 for encoding immersive video data, arranged in accordance with at least some implementations of the present disclosure. FIG. 2 is a block diagram of an example decoding system 200 for decoding immersive video data, arranged in accordance with at least some implementations of the present disclosure. That is, FIGS. 1 and 2 illustrate block diagrams of an encode system and a client system (e.g., a decoding system or decode system), respectively.

With reference to FIG. 1, encoding system 100 provides video compression and encoding system 100 may be a portion of a video encode system implemented via a computer or computing device such as a server system or the like. For example, encoding system 100 receives video including, for example, texture and depth from a variety of views (e.g., each from a view or viewpoint of a scene) including texture video 101 and depth video 102 from base view 141, texture video 103 and depth video 104 from one of selected views 142, texture video 105 and depth video 106 from another one of selected views 142 (as well as any additional texture video and depth video from other selected views 142). Encoding system 100 encodes such texture video and depth video to generate a bitstream 131 that may be decoded by a decoder (as discussed below) to generate decompressed versions of such texture and depth video, and optionally generate a synthesized view of the scene represented by base view 141 and selected views 142. Notably, the synthesized view does not necessarily correspond to any such view and may therefore represent a view of the scene not encoded by encoding system 100.

For example, texture encode module 111 receives texture video 101 (e.g., a texture view) such that texture video 101 may have a bit-depth of 8 or 10 bits and texture encode module 111 may encode texture video 101 using a standards compliant codec and profile to generate a corresponding bitstream. Similarly, texture encode module 113 receives texture video 103 having a bit-depth of 8 or 10 bits and texture encode module 113 encodes texture video 103 using a standards compliant codec and profile to generate a corresponding bitstream and texture encode module 115 receives and encodes texture video 105 having a bit-depth of 8 or 10 bits to generate a corresponding bitstream. Notably, texture video 101, 103, 105 has a bit-depth that matches the bit-depth used by texture encode modules 111, 113, 115. Although discussed with respect to bit- depths 8 or 10 bits, any standard bit-depths may be used.

However, depth video 102, 104, 106 is at a bit-depth that is not employed by a standard encode module. Notably, depth encode modules 112, 114, 116 also operate at a standard bit-depth of 8 or 10 bits but depth video 102, 104, 106 is received at a higher bit-depth such as 16 bits. As shown, prior to encode by depth encode modules 112, 114, 116, depth video 102, 104, 106 are each mapped by mapping modules 121, 123, 125, respectively, from the higher bit-depth (e.g., 16 bits) to a lower bit-depth that is also compliant with a standard encode profile (e.g., 10 bits or 8 bits with 10 bits being represented in FIG. 1). Such bit-depth mapping provides a depth view of 10 bits or 8 bits from a depth view of 16 bits, for example. Such bit-depth mapping is applied in a piece-wise linear fashion based on mapping parameters 161 (e.g., endpoints of line segments representing the piece-wise linear mapping function) that are selected by mapping parameter selection module 110 based on depth video 102, 104, 106, as is discussed further herein below.

As shown, resultant depth video (or depth views) including depth video 122, 124, 126 are generated based on depth video 102, 104, 106, respectively such that depth video 122, 124, 126 are 8 bit or 10 bit bit-depth video or depth views (with 10 bits being illustrated). As used herein, the terms depth video and depth views are used interchangeably and indicate video having video frames that are depth frames inclusive of pixel-wise depth values. That is, each pixel indicates a depth value. Such lower bit-depth depth video 122, 124, 126 is then encoded by depth encode modules 112, 114, 116 using a standards based codec and profile such as HEVC Main 10 to generated corresponding bitstreams. Furthermore, as used herein, the terms texture view, texture video, and similar terms indicate color video with pixels thereof having a luma value and multiple color values, three color values, or any similar representation. For example, such texture video may include any suitable video frames, video pictures, etc. in any suitable resolution such as video graphics array (VGA), high definition (HD), Full-HD (e.g., 1080p), 4K resolution video,

8K resolution. Such texture video may be in any suitable color space such as YUV. During encode depth video and texture video may be divided into coding units, prediction units, and transform units, etc. and encoded using intra and inter techniques as is known in the art.

The resultant bitstreams from depth encode modules 112, 114, 116 (e.g., each

representative of depth video) and texture encode modules 111, 113, 115 (e.g., each representative of texture video) are then received by multiplexer 119, which multiplexes the bitstreams into a bitstream 131 for storage, transmission, etc. and, ultimately, for use by a decoder as is discussed further herein below. Bitstream 131 may be compatible with a video compression-decompression (codec) standard such as, for example, HEVC. Although discussed herein with respect to HEVC, the disclosed techniques may be implemented with respect to any codec such as AVC (Advanced Video Coding/H.264/MPEG-4 Part 10), VVC (Versatile Video Coding/MPEG-I Part 3), VP8, VP9, Alliance for Open Media (AOMedia) Video 1 (AVI), the VP8/VP9/AV1 family of codecs, etc. Encoding system 100 may be implemented via any suitable device such as, for example, server, a personal computer, a laptop computer, a tablet, a phablet, a smart phone, a digital camera, a gaming console, a wearable device, a display device, an all-in-one device, a two-in-one device, or the like. For example, as used herein, a system, device, computer, or computing device may include any such device or platform.

As discussed, in encoding system 100 of FIG. 1, multiple views with texture and depth are available, with the texture in 8 or 10 bits, and the depth in 16 bits. The encoder system selects a subset of views to encode. In the illustrated embodiment, an HEVC encoder is used. For the selected views, the texture is encoded at its available bit representation, and the depth is converted from the 16-bit representation, x, to the 10-bit representation, y. In prior systems, such conversion may be performed according to the following equation: y = round( 1024 * x/65536).

Alternatively, truncation instead of rounding may be employed and clipping may also be applied, for example, to avoid values outside of the allowable range, such as in the following equation: y = Clip3(0, 1023, int ( 1024 * x ÷ 65536 + 0.5) ). Such prior techniques disadvantageous^ lose information in the bit-depth reduction.

In contrast, as shown in FIG. 1, encoding system 100 converts depth video 102, 104, 106 from a higher bit-depth (e.g., a 16-bit representation) to a lower bit-depth (e.g., a 10-bit representation) via mapping modules 121, 123, 125 in encoding system 100 using a piece-wise linear mapping that is applied so that some ranges of depth values are represented at a finer granularity than other ranges of depth values. Such bit-depth mapping is discussed further herein below.

In some embodiments, mapping parameter selection module 110 of encoding system 100 selects the parameters for the piece-wise linear mapping. As shown, such mapping parameters 161 are provided to mapping modules 121, 123, 125 for implementation in the piece-wise linear mapping. Furthermore, mapping parameters 161 (e.g., metadata representing the mapping parameters) are signaled, encoded via metadata encode module 117, and multiplexed via multiplexer 119 along with the encoded bitstreams into bitstream 131 for use by a decoder. In an embodiment, a Supplemental Enhancement Information (SEI) message or other high level syntax structure in the video bitstream or in the systems layer may be used to carry the mapping metadata. In some embodiments, mapping parameters 161 (e.g., the mapping metadata) includes or represents a list of (x,y) coordinates of line segment endpoints, where x represents the depth 16-bit representation values and y represents the converted 10-bit representation values. Furthermore, camera array parameters 151, representative of camera parameters used to capture base view 141 and selected views 142 as well as any other views of a scene may be provided to a camera parameters encode module 118 for encode and the resultant bitstream is provided to multiplexer 119 for inclusion in bitstream 131. Camera array parameters 151 include parameters representative of the cameras during video capture including, for example, camera positions (e.g., x, y, z information), orientations of the cameras in space (e.g., yaw, pitch, and roll information or quaternion information), and camera characteristics such as focal lengths, field of view, and Znear and Zfar, which indicate the physical distances from the camera that the lowest and highest 16 bit values (0 and 65535) correspond to in the disparity depth files, etc. Camera parameters encode module 118 may advantageously encode such camera array parameters 151 to reduce redundancy and enhance efficiency as discussed further herein below.

Turning now to FIG. 2, decoding system 200 may be implemented as a client system, for example. As with encoding system 100, decoding system 200 may be implemented via any suitable device such as, for example, server, a personal computer, a laptop computer, a tablet, a phablet, a smart phone, a digital camera, a gaming console, a wearable device, a display device, an all-in-one device, a two-in-one device, or the like. For example, as used herein, a system, device, computer, or computing device may include any such device or platform. Decoding system 200 receives bitstream 131 from any source and decoding system 200 demultiplexes bitstream 131 for decode.

As shown, each bitstream demultiplexed from demultiplexer 219 is provided to a corresponding module including one of texture decode modules 211, 213, 215, depth decode modules 212, 214, 216, metadata decode module 217, and camera parameters decode module 218. As shown, texture decode module 211 receives and decodes a bitstream to generate texture video 201, which corresponds to texture video 101, texture decode module 213 receives and decodes a bitstream to generate texture video 203 corresponding to texture video 103, and texture decode module 215 receives and decodes a bitstream to generate texture video 205 corresponding to texture video 105. It is noted that texture video 201, 203, 205 corresponds to but does not perfectly replicate texture video 101, 103, 105 due to the lossy encode/decode processing. Such decodes may be performed using any suitable technique or techniques. Notably, such decodes provide texture video 201, 203, 205 at a bit-depth (e.g., 8 bit or 10 bit) that is suitable for use by a view synthesis module 231 in generating synthesized views 250 for use by a head mounted display (HMD) 251 (or other display device) based on head mounted display position information (HMD pos. info.) 252.

Furthermore, metadata decode module 217 receives a pertinent bitstream and decodes mapping parameters 161 for use by inverse mapping modules 221, 223, 225 of decoding system 200. Notably, such decode faithfully recreates mapping parameters 161, including a list of (x,y) coordinates of line segment endpoints or a similar data structure such that x represents the depth 16-bit representation values and y represents the converted 10-bit representation values. For example, inverse mapping modules 221, 223, 225 inverse the processing performed by mapping modules 121, 123, 125 to generate higher bit-depth depth video from lower bit-depth depth video. In some embodiments, the line segment endpoints metadata is sent in a depth mapping SEI message, or other high level syntax or systems signaling. In some embodiments, more than one set of line segment endpoints are signaled. In some embodiments, the SEI message persists until it is cancelled or another SEI message with new parameters is sent to displace it. Exemplary SEI syntax and semantics are provided in Table 1 and the following discussion.

Table 1 : Exemplary SEI Syntax and Semantics

In some embodiments, the SEI message parameters persist in coding order until canceled. In Table 1, the parameters may be implemented as follows: dm_cancel_flag equal to 1 indicates that the persistence of any previous depth mapping SEI message is cancelled. dm num mapping sets specifies the number of piece-wise linear mapping sets to be signaled. dm_num_segments[ i ] specifies the number of line segments in the i-th mapping set. dm seg x| i ][ j ] and dm seg y[ i ][ j ] specify the j-th (x, y) line segment end point defined in the i-th mapping set.

As discussed further below, for each sample in a depth coded picture using the i-th mapping set, DecodedValue, the variable InverseMappedValue is determined using the following: for (j = 0; j <dm_num_segments[ i ]; j++) {

if ( DecodedValue >= dm_seg_y[ i ][ j ] && Decoded SampleValue <= dm_seg_y[ i ][ ] + !])

InverseMappedValue = Clip3( 0, 65535, int( dm_seg_x[ i][ j ] + 0.5 +

( (DecodedValue - dm_seg_y[ i ][ j ]) ÷ (dm_seg_x[ i ][ j + 1])— dm_seg_x[i ][ j ])

* (dm_seg_y[ i ][ j + 1] - dm_seg_y[ i ][j ]) ) )

For example, the previous code or pseudocode may be used, for each sample in a depth coded picture using the i-th mapping set, DecodedValue, to determine the variable

InverseMappedValue. Thereby, a 10-bit bit-depth values may be mapped to 16-bit bit-depth values for use at the decoder system in view synthesis.

With continued reference to FIG. 2, depth decode module 212 receives and decodes a bitstream to generate depth video 222, which corresponds to depth video 122 and such that depth video 222 is at a lower bit-depth such as 8 or 10 bits. Notably, depth video 222 may not be suitable for use by view synthesis module 231 due to the low bit-depth. Depth video 222 and mapping parameters 161 are received by inverse mapping module 221. Inverse mapping module 221 maps depth video 222 (at the lower bit-depth) to depth video 202 at a higher bit-depth such as 16 bits as illustrated. Depth video 222 corresponds to depth video 202 although it is not a perfect replication due to the lossy encode/decode processing. Notably, inverse mapping module 221 performs the inverse piece-wise linear mapping with respect to the piece-wise linear mapping performed by mapping module 121. As shown, mapping parameters 161 are used in both contexts. Such mapping parameters 161 retain more information in some depth ranges of the depth maps while losing more information in other depth ranges. Such piece-wise linear mapping may be contrasted with rounding or truncation techniques that, in effect, lose the same amount of information at all depth ranges. In a similar manner, depth decode module 214 receives and decodes a bitstream to generate depth video 224 corresponding to depth video 124 and such that depth video 224 is at the lower bit-depth (e.g., 8 or 10 bits). Depth video 224 and mapping parameters 161 are received by inverse mapping module 223, which maps depth video 224 to depth video 204 at the higher bit- depth using inverse piece-wise linear mapping techniques as discussed herein. Furthermore, other views are processed in a like manner. For example, depth decode module 216 receives and decodes a bitstream to generate depth video 226 corresponding to depth video 126 and such that depth video 226 is at the lower bit-depth. Depth video 226 and mapping parameters 161 are used by inverse mapping module 225 to maps depth video 226 to depth video 206 at the higher bit- depth using inverse piece-wise linear mapping techniques.

Furthermore, camera parameters decode module 218 receives a pertinent bitstream and decodes camera array parameters 151 for use by view synthesis module 231. Such decode faithfully recreates camera array parameters 151, including, for example, parameters

representative of cameras used to generate the views used by view synthesis module 231. For example, camera array parameters 151 may include camera positions (e.g., x, y, z information), orientations of the cameras in space (e.g., yaw, pitch, and roll information or quaternion information), and camera characteristics such as focal lengths, field of view, Znear, Zfar, etc.

View synthesis module 231 receives camera array parameters 151 and head mounted display position information 252 or other information representative of a desired view within a recreated scene and view synthesis module 231 generates view 250. View 250 may include any suitable data structure for presentation of a scene via a display such as head mounted display 251. A user may be wearing head mounted display 251 and the user may be moving head mounted display 251 in space. Based on the position and orientation of head mounted display 251, head mounted display position information 252 is generated to identify a desired virtual view within a scene. Alternatively, the position and orientation from which the virtual view is desired (e.g., a virtual view camera or viewport) may be provided by another system or device for generation of view 250. View synthesis module 231 may generate view 250 based on texture video 201, 203, 205, depth video 202, 204, 206, camera array parameters 151, and head mounted display position information 252 (or other virtual view data) using any suitable technique or techniques. View 250 is then displayed to a user to provide an immersive viewer experience within a scene.

In the client system illustrated in FIG. 2, the bitstreams containing the compressed depth and texture views are decoded (e.g., demultiplexed and decoded). View synthesis is performed based on a selected view position and orientation (e.g., as identified by motion of a head mounted display or other view selection as indicated by HMV position information) to generate a view for a display viewport, using the decoded texture video and the decoded and inverse bit-mapped depth video. For example, the decoded depth 10-bit views may be converted to a reconstructed 16-bit representation (or higher bit integer or floating point representation), using the inverse piece- wise linear function. View synthesis may then be performed based on a selected view position and orientation (e.g., as identified by motion of a head mounted display or other view selection as indicated by HMV position information), using the reconstructed depth 16-bit representation data (or higher bit integer or floating point representation), and the Znear and Zfar parameters. In some embodiments, the client system may combine the inverse mapping and view synthesis operation, and may only do the inverse mapping calculation only as needed for the view synthesis operation.

As discussed, mapping modules 121, 123, 125 provide a piecewise linear mapping from higher bit-depth depth values to lower bit-depth depth values and inverse mapping modules 221, 223, 225 inverse the process to provide higher bit-depth depth values from lower bit-depth depth values. Discussion now turns to application of piecewise linear mapping and inverse piecewise linear mapping. In an embodiment, a list of line segment endpoints are selected by encoding system and transmitted via bitstream. In some embodiments, to represent N line segments, N+l pairs of (x,y) data are used.

FIG. 3 illustrates an example piece-wise linear mapping 300 between higher bit-depth depth values and lower bit-depth depth values, arranged in accordance with at least some implementations of the present disclosure. In the example of FIG. 3, piece- wise linear mapping 300 utilizes 3 line segments 307, 308, 309, which are represented by (xO, yO), (xl, yl), (x2, y2), and (x3, y3). However, any number of line segments may be used. In some embodiments, a piece- wise linear function is used because it has a relatively low complexity for a decoder to implement the inverse mapping function and it does not require a large amount of metadata to signal. In some embodiments, the number of line segments of the piece-wise linear function is restricted to a predefined number to limit the decoder complexity and the metadata signaling overhead.

In the illustrated example, line segment 307 is defined as a line segment between endpoint 303 (xO, yO) and endpoint 304 (xl, yl). Similarly, line segment 308 is defined as a line segment between endpoint 304 (xl, yl) and endpoint 305 (x2, y2) and line segment 309 is defined as a line segment between endpoint 305 (x2, y2) and endpoint 306 (x3, y3). As used herein, the term endpoint indicates a point defining an end of a line segment.

Notably, piece-wise linear mapping 300 maps between high bit-depth values 302 and low bit-depth values 301. For example, high bit-depth values 302 may be 16-bit bit-depth values and low bit-depth values 301 may be 8-bit or 10-bit bit-depth values as discussed herein. In such examples, high bit-depth values 302 of a depth view (e.g., any of depth video 102, 104, 106) have a bit-depth of 16 bits and therefore have an available range of values from 0 to 65,535 (e.g., =2^L16-1). Such high bit-depth values 302 of the depth view are then forward mapped to low bit- depth values 301 of a depth view (e.g., any of depth video 122, 124, 126) such that low bit-depth values 301 may be 8 bit-depth or 10 bit-depth, for example. In the context of 8 bit-depth, low bit- depth values 301 have an available range from 0 to 255 (e.g., =2^L8-1). In the context of 10 bit- depth, low bit-depth values 301 have an available range from 0 to 1,023 (e.g., =2^L10— 1).

Furthermore, at least endpoints 304, 305 are between the minimum and maximum values of the components of high bit-depth values 302 and low bit-depth values 301. For example, the range of high bit-depth values 302 from 0 to 65,535 may define a horizontal or x component of piece-wise linear mapping 300 and the range of low bit-depth values 301 from 0 to 255 or 1,023 may define a vertical or y component of piece-wise linear mapping 300. Endpoints 303, 304, 305, 306 are between (0, 0) and (65,535, 255) or (65,535, 1,023), inclusive, such that endpoint 303 may be at (0, 0) and endpoint 306 may be at (65,535, 255) or (65,535, 1,023), although other final endpoints within the available ranges may be defined. Intermediary endpoints such as endpoints 304, 305 are then between such minimum and maximum endpoints such that the horizontal or x component of both endpoints 304, 305 exceed the minimum depth value of the higher bit-depth range (e.g., 0) and the vertical or y component of both endpoints 304, 305 exceed the minimum depth value of the lower bit-depth range (e.g., 0). Similarly, the horizontal or x component of both endpoints 304, 305 are less than the maximum depth value of the higher bit-depth range (e.g., 65,535) and the vertical or y component of both endpoints 304, 305 are less than the maximum depth value of the lower bit-depth range (e.g., 255 or 1,023). As used herein the term horizontal component of a point corresponds to the coordinate along high bit-depth values 302 and the term vertical component of a point corresponds to the coordinate along low bit-depth values 301.

As shown, endpoints 303, 304 define line segment 307, endpoints 304, 305 define line segment 308, and endpoints 305, 306 define line segment 309 such that line segments 307, 308 share endpoint 304 and line segments 308, 309 share endpoint 305. As discussed, line segment 307 has a greater slope than line segment 308, which, in turn, has a greater slope than line segment 309. Such slopes are defined by endpoints 303, 304, 305, 306, and may be generated in response to depth value pixel counts within depth views used to generate piece-wise linear mapping 300, as discussed further herein below.

In some embodiments, forward mapping is performed by determining which of line segments 307, 308, 309 a particular high bit-depth value of high bit-depth values 302 corresponds to and then determining the mapped low bit-depth value by applying a linear mapping according to the endpoints of the line segments. For example, as shown with respect to high bit-depth value 311, a determination is made that high bit-depth value 311 lies within line segment 308. High bit- depth value 311 is then mapped to a low bit-depth value 312 using a linear mapping based on line segment 308. That is, high bit-depth value 311 is mapped to low bit-depth value 312 by applying a linear function defined by line segment 308 to high bit-depth value 311 to determine resultant low bit-depth value 312.

In some embodiments, a high bit-depth value (e.g., a 16-bit x value) is mapped to a low bit-depth value (e.g., an 8-bit or a 10-bit value), for x in the range of [xi, xi+i], according to Equation (1) as follows: y = round(yi + ((x - xi) / (xi+i - xi )) *(yi+i- yi))

(1) where x is in the range of [xi, xi+i] (and x is a high bit-depth value) and a linear mapping is provided between [xi, xi+i] and [yi, yni] (defining the line segment the high bit-depth value, x, lies within). For example, for high bit-depth value 311 (x), low bit-depth value 312 (y) may be determined as y = y2 + ((x - xi) / (x2 - xi )) *(y2- y I).

Inverse mapping is performed, at decoding system 200, using similar techniques. In some embodiments, inverse mapping is performed by determining which of line segments 307, 308,

309 a particular low bit-depth value of low bit-depth values 302 corresponds to and then determining the mapped high bit-depth value by applying a linear mapping (e.g., an inverse linear mapping) according to the endpoints of the line segments. For example, as shown with respect to low bit-depth value 312, a determination is made that low bit-depth value 312 lies within line segment 308. Low bit-depth value 312 is then mapped to a high bit-depth value 311 using a linear mapping, again based on line segment 308, by aping a linear function defined by line segment 308 to low bit-depth value 312 to determine resultant high bit-depth value 311.

In some embodiments, a low bit-depth value (e.g., an 8-bit or a 10-bit y value) is mapped to a high bit-depth value (e.g., a 16-bit value), for y in the range of [yr, yi+i], according to the inverse of Equation (1). For example, for low bit-depth value 312 (y), high bit-depth value 311 (x) may be determined as x = X2 + ((y - y I) / (y2 - y i )) *(x2- xi).

Such mapping and/or inverse mapping may be implemented using any suitable computational technique or techniques. In some embodiments, a lookup table may be used to map specific input values (x values) to specific output values (y values) using a mapping table, for either or both of the forward or inverse mapping functions. In some embodiments, truncation instead of rounding may be used in the above equation. Clipping may also be applied in the above equation to, for example, avoid values outside of the allowable range.

Notably, as a result of mapping from high bit-depth value 311 to low bit-depth value 312 and back again, high bit-depth value 311 is expected to lose accuracy. However, by applying piece-wise linear mapping 300 having line segments 307, 308, 309, the loss of accuracy may be advantageously varied across depth values 302. For example, line segments 307, 308, 309 having a greater slope will lose less accuracy than line segments 307, 308, 309 having smaller slope due to greater slope line segments having relatively more low bit-depth values (or bins) to map to as compared to smaller slope line segments. This can be seen visually with respect to line segments 307, 309. Line segment 307 has a greater slope and, as a result, high bit-depth values in the range of xO to xl have a relatively high number of corresponding low bit-depth values in the range of yO to yl. This can be contrasted with line segment 309 which covers an equal size of high bit-depth values range from x2 to x3 (e.g., xl - xO = x3 - x2) but has a smaller low bit-depth values range fromy2 to y3 to translate to (e.g., yl - yO > y3 - y2). Therefore, the mapping with respect to line segment 309 translates an equal number of high bit-depth values to a smaller number of low bit- depth values, causing more loss in accuracy, as reflected in the inverse mapping from low bit- depth values to high bit-depth values. Such line segments may be advantageous when a depth view has more objects and/or detail (e.g., edges) at smaller depths as compared to the objects at greater depths. Thereby, more detail is retained in the depth view by employing a piece-wise linear mapping. The selection of endpoints 303, 304, 305, 306 to define line segments 307, 308, 309 for optimal mapping and accuracy loss are discussed further herein below. It is noted that equally spaced endpoints in the x dimension are illustrated merely for the sake of clarity in presentation. Such endpoint spacing is not applied as a constraint in practice.

For example, in the embodiment illustrated in FIG. 3, the slope of line segment 307 is greater than the slope of the line segment 309. In some embodiments, such slope differences may be generated in response to a histogram (for a corresponding depth image) indicating more samples (e.g., pixel samples) within the xO ... xl range than in the x2 ... x3 range. That is, in response to a greater number of samples in a first range relative to a second range, a first slope of the first range is greater than a second slope of the second range.

With reference to FIGS. 1 and 2, mapping modules 121, 123, 125 map depth video 102, 104, 106 (at a higher bit-depth) to depth video 122, 124, 126 (at a lower bit-depth) and inverse mapping modules 221, 223, 225 map depth video 222, 224, 226 (at a lower bit-depth) to depth video 202, 204, 206 (at a higher bit-depth) in accordance with the techniques discussed with respect to FIG. 3. For example, for each depth value, a mapping is performed as discussed with respect to high bit-depth value 311 and low bit-depth value 312.

FIG. 4 illustrates an example process 400 for reducing the bit-depth of a depth view using piece-wise linear mapping, arranged in accordance with at least some implementations of the present disclosure. Process 400 may include one or more operations 401-406 as illustrated, performed by encoding system 100 for example. Process 400 begins at operation 401, where bit- depth reduction is initiated. Bit-depth reduction may be performed, for example, by mapping modules 121, 123, 125 to map depth video 102, 104, 106 (at a higher bit-depth) to depth video 122, 124, 126 (at a lower bit-depth).

Processing continues at operation 402, where a depth view at a high or higher (relative to a lower target bit-depth) bit-depth is received. The depth view may have any suitable data structure such as a depth frame or image having a depth value (at the higher bit-depth) for each pixel thereof. In some embodiments, the depth view has depth values each with a bit-depth of 16 bits.

In such context, each depth value may be in the range of 0 to 65,535 as discussed. For example, at operation 402, a first depth view having depth values at a high bit-depth may be received such that the higher bit-depth has a large available range of values from a minimum value (e.g., 0) to a maximum value (e.g., 65,535).

Processing continues at operation 403, where line segment endpoints are attained for mapping the depth view at the higher bit-depth to a lower bit-depth such that the line segment endpoints define line segments for the mapping such as line segments 307, 308, 309. Such line segment endpoints and line segments may be attained or generated using any suitable technique or techniques such as those discussed with respect to FIGS. 5 and 6. In some embodiments, line segments are determined by generating a histogram of depth pixel value counts per depth value range using at least a portion of one or more depth views (e.g., depth frames) and recursively generating a plurality of line segments for the mapping including the first and second line segments to minimize a reconstruction error corresponding to the mapping and an inverse mapping.

Processing continues at operation 404, where the high or higher bit-depth depth view is mapped to the lower bit-depth to generate a depth view with a reduced bit-depth. The lower bit- depth may be any bit-depth such as 8 bits or 10 bits. Such mapping is performed according to the line segment endpoints attained at operation 403. For example, for each pixel for the higher bit- depth depth view, a determination may be made as to which line segment of the multiple line segments of the piece-wise linear mapping the pixel pertains (e.g., such that the higher bit-depth value is within the bit-depth value range of the line segment). A linear mapping according to the pertinent line segment is then provided to map the higher bit-depth value to the lower bit-depth depth value. Such processing may be repeated in series or in parallel for all pixels of the higher bit-depth depth view (e.g., one of depth video 102, 104, 106) to generate the lower bit-depth depth view (e.g., one of depth video 122, 124, 126) such as 8 bits (range of 0 to 255) or 10 bits (range of O to 1,023).

For example, with reference to FIG. 3, the depth values of the higher depth view may be mapped to depth values at a lower bit-depth that is less than the higher bit-depth to generate another depth view such that the lower bit-depth has an available range of values from a minimum value (e.g., 0) to a maximum value(e.g., 255 or 1,023), using first (e.g., (xl, yl)) and second (e.g., (x2, y2)) line segment endpoints to define a line segment for the mapping such that horizontal (e.g., xl) and vertical (e.g., yl) components of the first line segment endpoint exceed the minimum depth value of the higher bit-depth (e.g., 0) and the minimum depth value of the lower bit-depth (e.g., 0), respectively. Furthermore, the horizontal (e.g., x2) and vertical (e.g., y2) components of the second line segment endpoint are less than the maximum depth value of the higher bit-depth (e.g., 65,535) and the maximum depth value of the lower bit-depth (e.g., 255 or 1,023), respectively. Processing continues at operation 405, where the depth view at the lower bit-depth is output for encode using a standard encoder implementing a standard codec and/or profile. In some embodiments, the lower bit-depth depth view may be encoded using an HEVC Main Profile encoder. As shown, process 400 may end at operation 406. Process 400 may be performed in series or in parallel for any number of depth views. Such depth views may also be characterized as depth frames, depth pictures, or the like. As discussed herein, such reduced bit-depth depth views may be encoded using standard codecs and profiles due to such codecs and profiles being capable of processing the reduced bit-depth.

As discussed with respect to operation 403, line segment endpoints may be generated at encoder system 100 to define a piece-wise linear mapping from higher bit-depth to lower bit depth. Such line segment endpoints are used in the mapping and transmitted to decoder system 100 for use in decode. Discussion now turns to generating line segments and line segment endpoints for use in forward and inverse piece-wise linear mapping.

FIG. 5 illustrates generation of example piece-wise linear mapping 300 based on depth value pixel counts in a depth view, arranged in accordance with at least some implementations of the present disclosure. As shown in FIG. 5, the range of high bit-depth values 302 (e.g., 0 to 65,535) may be divided into a number of bins 505 such as histogram bins. Although illustrated with respect to 12 bins 505, any number may be used. For a particular portion of a depth view (e.g., a portion of one or more of depth views 102, 104, 106), pixels thereof are allocated to one of bins 505 based on their depth value. That is, a pixel count is generated for each of bins 505 based on the corresponding depth value range. Such pixel counts generate a histogram 501 indicative of depth value pixel counts. Such pixel counts are then used to generate endpoints and corresponding line segments as discussed further below.

As shown in FIG. 5, line segment 307 has a slope and that is greater than a slope of line segment 308 responsive to a pixel count in bins 502 that correspond to line segment 307 having a greater value than a pixel count in bins 503 that correspond to line segment 308. Similarly, segment 3087 has a slope and that is greater than a slope of line segment 309 responsive to the pixel count in bins 503 having a greater value than a pixel count in bins 504 that correspond to line segment 308. It is noted again that equally spaced endpoints in the x dimension are illustrated merely for the sake of clarity in presentation. Such endpoint spacing is not an applied constraint in practice, although such endpoint spacing may be implemented. Notably, the slopes of line segments 307, 308, 309 and the coordinates of endpoints 303, 304, 305, 306 are generated based on pixel counts in corresponding bins 502, 503, 503 as generated by histogram 501 or similar counting techniques. The portion of depth views 102, 104, 106 used to generate histogram 501 may be any suitable portion. In an embodiment, histogram 501 is generated on a per frame basis. In some embodiments, histogram 501 is generated using a group of pictures of one of depth views 102, 104, 106. In some embodiments, histogram 501 is generated based on only edge regions of a single frame or multiple frames. Corresponding line segments 307, 308, 309 and the coordinates of endpoints 303, 304, 305, 306 are then used in mapping and inverse mapping, with the coordinates being represented and transmitted in bitstream 131.

FIG. 6 illustrates an example process 600 for defining a piece-wise linear mapping to map and inverse map depth views between high and low bit-depth values, arranged in accordance with at least some implementations of the present disclosure. Process 600 may include one or more operations 601-606 as illustrated, performed by encoding system 100 for example. Process 600 begins at operation 601, where piece-wise linear mapping generation is initiated. Piece-wise linear mapping model generation may be performed, for example, by mapping parameter selection module 110 to generate mapping parameters 161.

Processing continues at operation 602, where depth view data for the mapping is attained. As discussed, a portion of one or more of depth views 102, 104, 106 are used for generating the piece-wise linear mapping. In some embodiments, a single frame is used to determine the piece- wise linear mapping. In some embodiments, a preselected number of frames are used. In some embodiments, a group of pictures (e.g., as defined in an encode structure) are used. In some embodiments, only edge portions of a single frame, a preselected number of frames, or a group of pictures are used. Furthermore, a single depth view may be used to generate mapping parameters for each view of multiple vies may be used to generate mapping parameters shared among the views.

As discussed, encoding system 100 selects the piece-wise linear mapping to be used such that a 10-bit representation of depth is used instead of a 16-bit representation because of the limitations of using 10-bit video encoders, such as HEVC Main 10 Profile. The same techniques may be used to convert 16-bit data to 8-bit data, in order to use an 8-bit video codec, such as HEVC Main profile. Furthermore, the discussed techniques may be used to convert n-bit data to m-bit data such that n exceeds m. When rounding or truncation of the 16-bit data to 10-bit data is performed, view synthesis quality is reduced due to coarser representation of the depth data. The additional step of encoding and decoding can further reduce the video quality. Video quality is more severely impacted at the edges of objects (i.e., where there the depth values change between the object and background). In some embodiments, the selection method at encoding system 100 attempts to minimize the video quality reduction involved with reducing the bit representation of the depth data. In some embodiments, histograms are formed, which count the frequency of occurrence of particular depth values in the input depth views. The depth value ranges that occur most frequently are assigned to line segments with higher slope, and the depth value ranges that occur least frequently are assigned to line segments with lower slope. As discussed, higher slope means that more granularity of the representation is retained and, similarly, lower slope means that less granularity of the representation is retained.

In an embodiment, the histogram of depth values are determined separately for edge regions in the picture, by first applying an edge detection algorithm. The line segment endpoints can be selected based on the edge depth histogram, to prioritize giving higher slope to the most frequently occurring depth values occurring at object edges. For example, for a depth image, edge regions may be determined and edge histograms may be generated for the edge regions. Then, in response to a greater number of edge samples being in a first range of the histogram relative to a second range, a first slope of the first range is greater than a second slope of the second range. In some embodiments, edge regions of a depth view are determined and a histogram of depth pixel value counts are generated using pixels in the detected edge regions and exclusive of pixels outside the edge regions. Such a histogram may then be used to determine line segments of a piece-wise linear mapping as discussed below.

The encoder may select the line segments based on the histograms of any number of views and pictures. In some embodiments, it is advantageous to apply the same set of depth mapping parameters for an entire coded video sequence within the same random access period (e.g., an I picture and a number of additional non-I pictures), to avoid introducing inefficiency in inter-frame prediction that would be caused by using different sets of mapping parameters. In this case, it would be advantageous to use all of the pictures of all of the views in the random access period when selecting the mapping parameters. In an embodiment, the previously discussed histograms (using all samples or edge samples) are generated using all pictures of all views in a random access period or group of pictures. Processing continues at operation 603, where, using the selected depth view data, pixel counts over initial high depth value ranges are determined and at operation 604, where line segments (as defined by line segment endpoints) are recursively generated to minimize reconstruction error.

Techniques for recursively determining optimal piecewise linear mapping are now described. Such techniques use the histogram of higher (e.g., 16-bit) bit-depth maps (either over a full frame, a number of frames, or only for edges) and the desired number of segments in the mapping as inputs. Such histograms and number of input segments may be determined using any suitable technique or techniques. In an embodiment, the desired number of segments is limited to a particular predefined value. The following techniques determine the optimal x- and y- coordinates for the endpoints (e.g., endpoints 303, 304, 305, 306) of the segments (line segments 307, 308, 309) of the piece-wise linear mapping (e.g., piece-wise linear mapping 300)) using dynamic programming.

In the following discussion: (i) the first segment of the piece-wise linear mapping begins at (0,0), (ii) if a segment s starts at (i,k), the previous segment s-1 ends at (i-1, q) where q is k or k-1, and (iii) each segment’s slope ( # of output bins / # of input bins) is restricted to be <=1.

Let E(s,i,k) = minimum reconstruction error when input bins [0... i] map to output bins [0..k] using segments [0..s] Denote by S(s,i,k) the corresponding optimal solution, which consists of the start and end points of each of the s+1 segments. That is, S is the piecewise linear mapping that corresponds to the minimum reconstruction error. It is noted that the last segment (index s) ends at (i,k). Next, all possible starting locations (p,q) for this segment may be considered such that (p<=i and q<=k).

Notably, this problem has an optimal substructure property such that the optimal solution S(s,i,k) with error E(s,i,k) has segment s starting at a particular (p,q). Furthermore, the previous segment s-1 in the optimal solution ends at (m,n) which may be (p-l,q) or (p-1, q-1). Then, the optimal solution S(s,i,k) must consist of the optimal solution to (s-l,m,n), denoted S’(s-l,m,n), with error E’(s-l,m,n) followed by the segment s. (If S(s,i,k) did not contain S’(s-l,m,n), then the segments [O..s-I] in S(s,i,k) could be replaced with S’(s-l,m,n). This gives a new solution to (s,i,k) with a lower total error than E(s,i,k), because S’(s-l,m,n) had the lowest possible error of all solutions for (s-l,m,n).) The reconstruction error when segment s maps bins [p.i] to bins [q..k] is denoted by e(s,p,q,i,k). This gives a recursive solution for E(s,i,k) as shown in Equations (2):

E(s,i,k) = minimum over all feasible (p, q) of { E(s-l,m,n) + e(s,p,q,i,k) } if s>0

= e(0,0,i,k) if s=0

(2) where (p,q) is the start point of current segment s, (m,n) is the end point of the previous segment s-1. m = p-1, and n is q or q-1.

Feasible (p,q) are then defined as follows. Segments are assumed to be flat (single output bin) or monotonically increasing. Each previous segment must cover at least one input bin, so the current segment with index s must start at >=s. Each segment must have slope less than or equal to 1. And, each segment must cover at least one input bin. These constrains may be defined as shown in Equations (3)

S<=P<=1

0<=q<=k k-q+1 <= i-p+1 (slope constraint) s<= i (each segment must cover at least one input bin)

(3)

The error function e(s,p,q,i,k) when segment s maps bins [p..i] to bins [q..k] is then defined as follows. First map input bins uniformly to output bins with definitions as provided in Equations (4):

Let I = i-p+1 = number of input bins

Let J = k-q+1 = number of output bins

Let f = floor(I/J)

Let r = I - f* J

(4) Then, the first r output bins are assigned f+1 input bins each, [0..f] to the first, [f+l,2f+l] to the second, etc. The remaining J-r output bins are assigned f input bins each.

Next, an error model is provided as follows. Error model: Let each input bin span a range D. It may be assumed that the true value of a pixel counted in an input bin i is uniformly distributed over [ (i-l).D .. iD ). Next, suppose that K input bins map to an output bin. Then the inverse mapping reconstructs the pixel to the value at the center of the KD sized range of these input bins. The squared error se may then be given as shown in the Pseudo-Code provided by Equations (5):

For K even: Let t represent the bin distance from the center of the interval, for example with K= 6, t= {2, 1,0, 0,1, 2} se = integral (over 0..D) 1/D . (tD + x)^A2 dx = t*t*D*D + t*D*D + D*D/3 For K = 1: se = 2 integral (over 0 to D/2) 1/D . c^L2 dx = D*D/12

For K odd, >1 : Let t represent the bin distance from the center of the interval, for example with

K= 7, t= {3,2,1,0,1,2,3}

For t not equal to 0: se = integral (over 0..D) 1/D . ((t-0.5)D + c)^L2 dx = (t-0.5)^A2 *D*D + (t-0.5)*D*D + D*D/3 Total squared error for the output bin = sum_{over t} ( histogram value * se(t))

For simplicity, D may be set to one, D =1.

(5)

It is noted that an issue may arise when the current segment has a starting point whose y- coordinate (output bin) matches the y-coordinate of the end point of the previous segment. This causes more input bins to be mapped to the same output bin than when the previous segment alone was used, affecting the reconstruction error of the previous segment. One solution is to attribute this additional reconstruction error to the current segment. Another is to disallow this situation, so that the current segment can only start at a new output bin, at a possible cost to optimality.

Using such recursive techniques, line segment endpoints are generated based on input histogram data and a number of selected line segments.

Processing continues at operation 605, where the generated line segment endpoints are output for use in the forward mapping discussed with respect to mapping modules 121, 123, 125 and for encode into bitstream 131 and, ultimately, for use by decoding system 200 as discussed with respect to inverse mapping modules 221, 223, 225, and at end operation 606. Discussion now turns to use of such line segment endpoints in decode processing.

FIG. 7 illustrates an example process 700 for increasing the bit-depth of a depth view using piece-wise linear mapping, arranged in accordance with at least some implementations of the present disclosure. Process 700 may include one or more operations 701-706 as illustrated, performed by decoding system 200 for example. Process 700 begins at operation 701, where bit- depth increase is initiated. Bit-depth increase may be performed, for example, by mapping modules 221, 223, 225 to map depth video 222, 224, 226 (at a lower bit-depth) to depth video 202, 204, 206 (at a higher bit-depth).

Processing continues at operation 702, where a depth view at a low or lower (relative to a higher target bit-depth) bit-depth is received and decoded. The depth view may have any suitable data structure such as a depth frame or image having a depth value (at the lower bit-depth) for each pixel thereof. In some embodiments, the depth view has depth values each with a bit-depth of 8 bits in a range of 0 to 255. In some embodiments, the depth view has depth values each with a bit-depth of 10 bits in a range of 0 to 255. In such context, each depth value may be in the range of 0 to 1,023. For example, at operation 702, a first depth view having depth values at a low bit- depth may be received such that the low bit-depth has a large available range of values from a minimum value (e.g., 0) to a maximum value (e.g., 255 or 1,023).

Processing continues at operation 703, where line segment endpoints are decoded for mapping the depth view at the lower bit-depth to a higher bit-depth such that the line segment endpoints define line segments for the mapping such as line segments 307, 308, 309. Such line segment endpoints and line segments may be decoded using any suitable technique or techniques such as those discussed with respect to FIGS. 2. For example, the line segment endpoints may be decoded by metadata decode module 217 from a portion of a bitstream demultiplexed from bitstream 131. As discussed, the line segment endpoints and line segments defined thereby may seek to minimize reconstruction error in the inverse mapping used to increase depth values form the lower bit-depth to the higher bit-depth.

Processing continues at operation 704, where the low or lower bit-depth depth view is mapped to the higher bit-depth to generate a depth view with an increased bit-depth. The higher bit-depth may be any bit-depth such as 16 bits. Such mapping is performed according to the line segment endpoints decoded at operation 703. For example, for each pixel of the lower bit-depth depth view, a determination may be made as to which line segment of the multiple line segments of the piece-wise linear mapping the pixel pertains (e.g., such that the lower bit-depth value is within the bit-depth value range of the line segment). A linear mapping according to the pertinent line segment is then provided to map the lower bit-depth value to the higher bit-depth depth value. Such processing may be repeated in series or in parallel for all pixels of the lower bit-depth depth view (e.g., one of depth video 222, 224, 226) to generate the higher bit-depth depth view (e.g., one of depth video 202, 204, 206) such as 8 bits (range of 0 to 255) or 10 bits (range of 0 to 1,023).

For example, with reference to FIG. 3, the depth values of the depth view may be mapped to depth values at a higher bit-depth that is greater than the lower bit-depth to generate a depth view such that the higher bit-depth has an available range of values from a minimum value (e.g., 0) to a maximum value(e.g., 65,535), using first (e.g., (xl, yl)) and second (e.g., (x2, y2)) line segment endpoints to define a line segment for the mapping such that horizontal (e.g., xl) and vertical (e.g., yl) components of the first line segment endpoint exceed the minimum depth value of the higher bit-depth (e.g., 0) and the minimum depth value of the lower bit-depth (e.g., 0), respectively. Furthermore, the horizontal (e.g., x2) and vertical (e.g., y2) components of the second line segment endpoint are less than the maximum depth value of the higher bit-depth (e.g., 65,535) and the maximum depth value of the lower bit-depth (e.g., 255 or 1,023), respectively.

Processing continues at operation 705, where the depth view at the higher bit-depth is output for view synthesis processing as discussed herein. In some embodiments, the higher bit- depth depth view may be used to generate view 250, which is also generated based on a corresponding texture view as well as other depth and texture view pairs. As shown, process 700 may end at operation 706. Process 700 may be performed in series or in parallel for any number of depth views. Such depth views may also be characterized as depth frames, depth pictures, or the like. As discussed herein, such increased bit-depth depth views may be suitable for and may provide improved performance in the context of view synthesis.

With reference to FIGS. 1 and 2, discussion now turns to encode and decode of camera array parameters 151 via camera parameters encode module 118 and camera parameters decode module 218, respectively. Such coded camera array parameters may be transmitted between encoding system 100 and decoding system 200 using Camera Information SEI Messages. In some embodiments, metadata describing camera parameters, including the metadata that is currently in a json file and additional information, may also be carried in the same or a different SEI message. In some embodiments, it is advantageous to include these parameters in a separate SEI message because the depth mapping parameters may change more frequently than the camera parameters. In an embodiment, a 3DoFplus camera information SEI message represents the data more compactly than the json file does, saving bitrate to transmit it. For each camera, the SEI message may contain (x, y, z) position information and (yaw, pitch, roll or quaternion parameter based) orientation information. In an embodiment, a yaw_pitch_roll flag indicates whether the (yaw, pitch, roll) information is required to be sent, or if they all have default values of 0. For the (x, y, z) parameters, it is typical to have a regular camera spacing. In some embodiments, bitrate savings may be achieved by signaling the granularity of the signaled positions, for each of x, y, and z. For example, spacing values in x, y, and z may be provided for the camera positions for bitrate savings.

Furthermore, video discussed herein may include omnidirectional content or normal perspective content. In some embodiments, the camera information SEI message contains a flag to indicate which is used. For example, the camera information SEI message may include a flag indicating omnidirectional content or normal perspective content. Furthermore, in some implementations, all cameras in a multi-camera immersive media system have the same parameters, including field of view and Znear (a minimum physical depth attainable by the camera) and Zfar (a maximum physical depth attainable by the camera). In some embodiments, signaling overhead is reduced by sending a flag to indicate common values of those parameters can be signaled for all of the views, or if those parameters are to be explicitly signaled for each view. For example, the camera information SEI message may include a flag to indicate common camera parameters between all views or differing camera parameters. In the former example, a single set of corresponding camera parameters are provided and, in the latter example, multiple sets of camera parameters are provided. In examples where sets of cameras share camera parameters, only unique sets of camera parameters may be provided and the SEI message may contain indicators as to which views correspond to which set of camera parameters such that redundant sets of camera parameters are not sent.

FIG. 8 illustrates an example camera array 810 and corresponding camera parameters 800 for coding, arranged in accordance with at least some implementations of the present disclosure. For example, camera array 810 may be used to attain base view 141, selected views 142, and other views, if needed, of a scene. That is camera array 810 may be used for immersive video generation or capture. In the example of FIG. 8, camera array 810 includes six cameras C1-C6 in a grid pattern such that cameras C1-C3 are in a first row and cameras C4-C6 are in a second row. However, camera array 810 may have any number of cameras in any spacing layout. In an embodiment, camera array 810 includes a single row of cameras. Furthermore, camera array 810 is illustrated such that cameras C1-C6 are all aligned in the x-y plane. However, cameras of camera array 810 may be positioned in any manner.

Each of cameras C1-C6 has corresponding camera parameters 811, 812, 813, 814, 815, 816, respectively. As shown camera parameters 811 may include a position of camera Cl, an orientation of camera Cl, and imaging parameters for camera Cl. As shown, position parameters include coordinates (xl, yl, zl) of camera Cl. Orientation parameters include roll, pitch, and yaw values (rl, pi, yawl) or quaternion parameters that define orientation of camera Cl. As shown with respect to camera Cl, each of cameras C1-C6 may be positioned and oriented throughout 3D space with the position and orientation characterized as 6-DOF motion 835, which shows each of cameras C1-C6 may move with translation: forward/back (e.g., in a z-direction), up/down (e.g., in ay-direction), right/left (e.g., in an x-direction) and rotation: rotating with yaw (e.g., angle around the y-axis), roll (e.g., angle around the x-axis), and pitch (e.g., angle around the z-axis). Imaging parameters include one or more of a projection mode (PM1), a focal length (FL1), a field of view (FOV1), a Znear value (ZN1), and a Zfar value (ZF1). For example, the projection mode may indicate one of omnidirectional content, normal perspective content, or orthogonal content for the camera. That is, the projection mode may indicate one of an omnidirectional projection mode, a normal perspective projection mode, or an orthographic projection mode.

Similarly, camera parameters 812, 813, 814, 815, 816 include like position, orientation, and imaging parameters for each of cameras C2-C6, respectively. Notably, such camera parameters 811, 812, 813, 814, 815, 816 or portions thereof as represented by camera array parameters 151 are encoded into bitstream 131 and decoded by decoder system 200 for use in view synthesis. The techniques discussed herein provide for efficient coding of camera parameters 812, 813, 814, 815, 816 of camera array 810.

As shown with respect to camera Cl, system 110 may move throughout 3D space with motion characterized as 6-DOF motion 135, which shows system 110 may move with translation: forward/back (e.g., in an x-direction), up/down (e.g., in ay-direction), right/left (e.g., in a z- direction) and rotation: rotating with yaw (e.g., angle a around the z-axis), roll (e.g., angle b around the y-axis), and pitch (e.g., angle g around the y-axis). As will be appreciated, 3D content such as VR frames may be generated based on a presumed line of vision (e.g., along the forward direction of the x-axis) of the user of system 110.

FIG. 9 illustrates an exemplary process 900 for encoding immersive video data including camera parameters, arranged in accordance with at least some implementations of the present disclosure. Process 900 may include one or more operations 901-908 as illustrated, performed by encoding system 100 for example. Process 900 begins at operation 901, where camera parameters coding is initiated. Camera parameters coding may be performed, for example, by camera parameters encode module 118 to encode camera parameters 800 for camera array 810 as represented by camera array parameters 151 into bitstream 131.

Processing continues at operation 902, where locations, orientations, and imaging parameters for cameras of a camera array are attained. Such locations, orientations, and imaging parameters may have any suitable data structure. Processing continues at operation 903, where a particular camera parameter type is selected for encode. For example, each of the discussed camera parameters may be processed in turn.

Processing continues at decision operation 904, where a determination is made as to whether, for the selected camera parameter type, a particular parameter value is shared by each of the cameras represented by the camera parameters. If not, processing continues at operation 905, where a flag indicating a shared parameter value for the cameras is set to off (e.g., 0) and the individual parameters for each of the cameras are separately coded.

If a parameter value is shared by each of the cameras for the camera parameter type (e.g., all of cameras C1-C6 have the same parameter value for that parameter type), processing continues at operation 906, where the flag indicating a shared parameter value for the cameras is set to on (e.g., 1) and the shared value is encoded. Notably, operations 904-906 provide for reduced overhead when each of the cameras has a shared parameter value. For example, if the focal length for each of the cameras is the same, a shared focal length flag may be set to on (e.g., 1) and the flag and the shared focal length may be coded. Such processing reduces overhead by not repeating the focal length for each camera. However, if the cameras do not share a focal length, the shared focal length flag is set to off and the focal length of each camera is coded. It is noted that such flags may be set for a number of cameras and then changed. For example, if cameras C1-C4 have the same parameter value and C5 and C6 differ, a flag of on may be set, the parameter value may be set for camera Cl and C2-C3 may use the same parameter value until a flag of off is set and the parameter value for C5 and C6 may then be individually signaled. For example, the shared parameter value flag may indicate the shared value is used until the flag is switched to off.

Such shared parameter value flags may be characterized simply as parameter value flags and may be labeled or identified based on the parameter value they correspond to. For example, a shared projection mode flag (e.g., a projection mode flag or a projection mode value flag) may indicate when all cameras C1-C6 share the same projection mode (e.g., one of an omnidirectional projection mode, a normal perspective projection mode, or an orthographic projection mode) based on the projection mode flag having a first value (e.g., 1) or that the cameras do not all have the same projection mode based on the projection mode flag having a second value (e.g., 0). A shared field of view flag (e.g., a field of view flag or a field of view shared value flag) may indicate when all cameras C1-C6 share the same field of view based on the field of view flag having a first value (e.g., 1) or that the cameras do not all have the same field of view based on the projection mode flag having a second value (e.g., 0). In a similar manner, cameras having the same Znear and Zfar values may be signaled together or separately using a shared Znear flag (e.g., a Znear flag, a Znear shared value flag, or a minimum physical depth flag) and a shared Zfar flag (e.g., a Zfar flag, a Zfar shared value flag, or a maximum physical depth flag) or only a shared Zdistances flag (e.g., a Zdistances flag, a Zdistances shared value flag, or a physical depths flag) to indicate both Znear and Zfar are the same for all of cameras C1-C6. In the case of all such flagging, if the parameter value is shared by all cameras, indicators of the single shared parameter value, the single shared parameter value, etc is then coded in the bitstream. If not, a separate parameter value for each of the cameras may be signaled.

Furthermore, a shared parameter value flag may be used for multiple camera parameter values. For example, a single shared imaging flag may be used to indicate all of the projection mode, focal length, field of view, Znear value, and Zfar value are to be shared among the cameras. Such a single shared imaging flag is then followed by the parameter values for each of the projection mode, focal length, field of view, Znear value, and Zfar value and such values are not signaled for the other cameras. Similarly, a single yaw, pitch, and roll flag or a quaternion flag may be signaled to indicate yaw, pitch, and roll or a quaternion parameters are the same for each of cameras C1-C6. In some embodiments, the values for yaw, pitch, and roll or a quaternion parameters (to be shared by all cameras are then signaled). In some embodiments, a second flag may indicate the yaw, pitch, and roll or a quaternion parameters are to be default values such as 0, 0, 0 for yaw, pitch, and roll. In some embodiments, the second flag is not used and, when the single yaw, pitch, and roll flag or a quaternion flag is set to on (e.g., 1), the default values are presumed by the decoder.

In some embodiments, although the camera parameter values are not shared, they may be predicted using a single value. For example, when cameras C1-C6 are equally spaced in a row, a row position flag may be provided. Based on the row position flag, the shared z and y positions for the cameras may then be signaled and used as the same value for all of cameras C1-C6.

Furthermore, the x position for camera Cl may be signaled along with a constant x position spacing value (e.g., xc in FIG 8) that is the same for all of cameras. Thereby, the x positions of cameras C2-C6 may be generated using the x position of camera Cl and the constant x position spacing value (e.g., camera CN x position = xi + (N - l)*xc and without explicit signaling of the positions of cameras C2-C6.

Similarly, when cameras C1-C6 are equally spaced in a grid (as shown in FIG. 8), a grid position flag may be provided. Based on the grid position flag, the shared z position for cameras may then be signaled and used as the same value for all of cameras C1-C6. Furthermore, the x and y positions for camera Cl may be signaled along with a constant x position spacing value (e.g., xc in FIG 8) and constant y position spacing value (e.g., yc in FIG 8) that are the same for all of cameras. Thereby, the x and y positions of cameras C2-C6 may be generated using the x and y positions of camera Cl and the constant x position spacing value and the constant y position spacing value (C2 is at ((xi + xc), (yi)), C4 is at ((xi), (yi + yc)), C5 is at ((xi + xc), (yi + yc)), and so on) and without explicit signaling of the positions of cameras C2-C6.

Processing continues at decision operation 907, where a determination is made as to whether the camera parameter type selected at operation 903 is a last camera parameter type to be processed. If so, processing ends at operation 908. If not, processing continues at operation 903, as discussed above, until the last camera parameter type is processed. Although illustrated with respect to such recursive processing for the sake of clarity of presentation, process 900 may decode any number of camera parameter types in parallel.

As discussed process 900 provides efficient signaling of camera parameters including camera positions ((x, y, z)), camera orientations ((r, p, yaw) or quaternion values), and camera imaging parameters (PM, FL, FOV, ZN, ZF). Such compression techniques may be combined in any suitable combination. Such compressed flags and values may be encoded into a bitstream and subsequently decoded as discussed with respect to FIG. 10 and used in view synthesis. In some embodiments, such encoding provides a part of an immersive video data encode and includes determining a camera projection mode for each of multiple cameras of a camera array

corresponding to immersive video generation or capture, generating a camera projection mode flag for the cameras, the camera projection mode flag having a first value (e.g., 1) in response to each of the cameras having a shared projection mode and a second value (e.g., 0) in response to any of the cameras having a projection mode other than the shared projection mode, and encoding the camera projection mode flag into a bitstream. In some embodiments, in response to the first camera projection mode flag having the first value, the bitstream further includes a single camera projection mode indicator indicative of the shared projection mode. In some embodiments, in response to the camera projection mode flag having the second value, the bitstream further comprises a plurality of camera projection mode indicators, one for each of the cameras.

In some embodiments, the encoding further includes determining a minimum physical depth and a maximum physical depth attainable by each of the cameras, generating a physical depth flag for the cameras, the physical depth flag indicative of whether the cameras have a shared minimum physical depth and/or a shared maximum physical depth, and encoding the physical depth flag into the bitstream. In some embodiments, the encoding further includes determining one of yaw, pitch, and roll parameters or quaternion parameters for each of the cameras, generating a yaw, pitch, and roll flag or a quaternion flag for the cameras, the yaw, pitch, and roll flag or quaternion parameters flag indicative of whether the cameras have shared yaw, pitch, and roll parameters or shared quaternion parameters, and encoding the yaw, pitch, and roll flag or the quaternion parameters flag into the bitstream. In some embodiments, the yaw, pitch, and roll parameters or quaternion parameters indicate default yaw, pitch and roll or default quaternion parameters flag values. In some embodiments, the encoding further includes determining a shared physical spacing value representative of a shared physical spacing between the cameras and encoding the shared physical spacing value into the bitstream.

FIG. 10 illustrates an exemplary process 1000 for decoding immersive video data including camera parameters, arranged in accordance with at least some implementations of the present disclosure. Process 1000 may include one or more operations 1001-1008 as illustrated, performed by decoding system 200 for example. Process 1000 begins at operation 1-01, where camera parameters decoding is initiated. Camera parameters decoding may be performed, for example, by camera parameters decode module 218 to decode camera parameters 800 for camera array 810 as represented by camera array parameters 151 from bitstream 131.

Processing continues at operation 1002, where a received bitstream including camera parameter flags and indicators is received and decoded using any suitable technique or techniques. Notably, the bitstream may include any flags, parameters, indicators, etc. as discussed with respect to FIG. 9, which may be decoded to reconstruct. Processing continues at operation 1002, where a first camera parameter type is selected. For example, each of the discussed camera parameters may be processed in turn.

Processing continues at decision operation 904, where a determination is made as to whether, for the selected camera parameter type, a particular or shared parameter value flag for the selected camera parameter type is set to on (e.g., 1). If not, processing continues at operation 1005, where a camera parameter value is decoded for each of the cameras of the camera array. For example, due to the cameras not sharing a particular or shared parameter value, the value for each camera is then decoded separately.

If the particular or shared parameter value flag for the selected camera parameter type is set to on (e.g., 1), processing continues at operation 1006, where an indicator or value or the like of the particular or shared parameter value is decoded and set for all of the cameras in the camera array. That is, each camera is assigned the same value (or in the context of position information, the position of each camera is determined using the decoded information as discussed below) for the particular or shared parameter value. Operations 1004-1006 provide for the decode of efficiently packed camera information such that shared camera information may be flagged and encoded once. For example, if the focal length for each of the cameras in the camera is the same, a shared focal length flag (or focal length flag or the like) may be decoded to be on (e.g., 1) and the shared or particular focal length or an indicator thereof is decoded and assigned to each camera. However, if the shared focal length flag is decoded to be off (e.g., 0), the focal length of each camera is separately decoded. As with encode, in some embodiments, a shared flag of 1 may be set and indicators transmitted that are applied until the flag is switched to 0 and then new parameters are sent. For example, each camera may be assigned a flag of 1 to indicate the share is continued for the camera. The first camera with a differing focal length is then flagged to 0 and new parameters are sent it is noted that such flags may be set for a number of cameras and then changed.

As discussed with respect to encode, such shared parameter value flags may be characterized simply as parameter value flags and may be labeled or identified based on the parameter value they correspond to. For example, a shared projection mode flag may indicate when all cameras share the same projection mode, a shared field of view flag may indicate when all cameras share the same field of view, a shared Znear flag, a shared Zfar flag, or a shared Zdistances flag may indicate the minimum physical depth attainable by the camera, the maximum physical depth attainable by the camera, or both, respectively, are the same for all of cameras. Again as discussed with respect to camera parameters encode, a shared parameter value flag may be used for multiple camera parameter values. For example, a single shared imaging flag may indicate all of the projection mode, focal length, field of view, Znear value, and Zfar value are the same for all the cameras. When such a shared imaging flag is decoded as on, indicators or parameter values for each of the projection mode, focal length, field of view, Znear value, and Zfar value are then decoded from the bitstream and applied to all of cameras.

Similarly, a shared yaw, pitch, and roll flag or a quaternion flag may decoded and, when flagged as on, a single set of yaw, pitch, and roll or a quaternion parameters are decoded and applied to all the cameras. Alternatively, a shared yaw, pitch, and roll flag or a quaternion flag that is on may indicate that such parameters are shared by all the camera and that the parameters are to be set to default values as 0, 0, 0 for yaw, pitch, and roll.

Furthermore, in some embodiments, such flagging and parameter sharing may be used by the decoder to determine values at the decoder that are not explicitly encoded. For example, the bitstream may be decoded to determine a row position flag for the camera is set to on. Based on the row position flag, shared z and y positions for the cameras are decoded and assigned to all of the cameras C1-C6. Furthermore, the x position for camera Cl (e.g., xi in FIG 8) and a constant x position spacing value (e.g., xc) are decoded. Camera positions in the x-dimension for each of cameras C1-C6 may then be determined by the decoder such that camera Cl is at xi and cameras C2-C6 are at a position offset by xc with respect to a prior camera in the row (e.g., camera CN x position = xi + (N - l)*xc and without explicit signaling of the positions of cameras C2-C6. In some embodiments, the bitstream is decoded to determine a grid position flag is on may be provided. Based on the grid position flag, the shared z position for the cameras is decoded and assigned to all of cameras C1-C6. The x and y positions for camera Cl (e.g., xi and yi, respectively) and constant x position spacing value (e.g., xc) and constant y position spacing value (e.g., yc) are decoded. Thereby, the x and y positions of cameras C2-C6 may be generated using the x and y positions of camera Cl and the constant x and y position spacing values.

Processing continues at decision operation 1007, where a determination is made as to whether the camera parameter type selected at operation 1003 is a last camera parameter type to be processed. If so, processing ends at operation 1008. If not, processing continues at operation 1003, as discussed above, until the last camera parameter type is processed. Although illustrated with respect to such recursive processing for the sake of clarity of presentation, process 1000 may decode any number of camera parameter types in parallel.

As discussed process 1000 provides decode of camera parameters including camera positions ((x, y, z)), camera orientations ((r, p, yaw) or quaternion values), and camera imaging parameters (PM, FL, FOV, ZN, ZF). In some embodiments, such decoding provides a part of an immersive video data decode and includes decoding a bitstream to determine a depth view, a texture view, and a camera projection mode flag corresponding to a plurality of cameras of a camera array, the camera projection mode flag having a first value to indicate all of the cameras have a particular projection mode or a second value to indicate any of the cameras have a projection mode other than the particular projection mode and generating a decoded view based on the depth view, the texture view, and the camera projection mode flag depth view. In some embodiments, in response to the first camera projection mode flag having the first value, the bitstream further includes a single camera projection mode indicator indicative of the particular projection mode. In some embodiments, in response to the camera projection mode flag having the second value, the bitstream further includes a plurality of camera projection mode indicators, one for each of the cameras. In some embodiments, the shared camera projection mode include at least one of an omnidirectional projection mode, a normal perspective projection mode, or an orthographic projection mode.

In some embodiments, the bitstream further includes one of a yaw, pitch, and roll flag or a quaternion flag for the cameras, the yaw, pitch, and roll flag or quaternion parameters flag indicative of whether the cameras have shared yaw, pitch, and roll parameters or shared quaternion parameters. In some embodiments, the bitstream further includes a physical depth flag for the cameras, the physical depth flag indicative of whether the cameras have a shared minimum physical depth and/or a shared maximum physical depth. In some embodiments, the decode process further includes setting, in response to the yaw, pitch, and roll flag or the quaternion flag indicating the cameras have shared yaw, pitch, and roll parameters or shared quaternion parameters, yaw, pitch, and roll parameters or quaternion parameters for each of the cameras to default yaw, pitch and roll parameters or default quaternion parameters. In some embodiments, the bitstream further includes a shared physical spacing value and the decode processing further includes determining a physical spacing between the cameras using the shared physical spacing value.

Discussion now turns to Conformance of 3DoF+ Standard and to supporting atlases created to represent texture and depth as discussed herein.

HEVC uses a two-byte NAL (network abstraction layer) unit header, whose syntax is copied below. It includes 6 bits allocated to nuh layer id.

Table 2: Example NAL Unit Header

HEVC defines a base Video Parameter Set (VPS) in section 7, and a VPS extension in Annex F. Annex F describes an Independent Non-Base Layer Decoding (INBLD) capability, in which an independently coded layer may have a nuh layer id value that is not equal to 0. The base VPS includes a vps max layers minusl syntax element. The VPS extension includes the syntax elements splitting_flag, scalability _mask_flag[ i ],

dimension_id_len_minusl[ j ], and dimensioned) i ] [ j ], and direct_dependency_flag[ i ][ j ]. These syntax elements are used to indicate the characteristics of each layer, including derivation of the Scalabilityld variable, and any dependency relationship between layers. The Scalabilityld variable may be mapped to scalability dimensions, or layer characteristics, as follows.

Table 3: Mapping of Scalabiltyld to Scalability Dimensions

Annex G defines the multi-view main profile, in which an enhancement layer is inter-layer predicted from the base layer.

In some embodiments, two atlases are created, to represent texture and depth. These two atlases may be coded as separate layers within the same bitstream, each with a different value of nuh layer id. The depth atlas layer is independent, and does not allow inter-layer prediction from the texture atlas layer. A new“Immersive” HEVC profile is provided herein. In an embodiment, the profile allows two independent layers, each of which individually conforms to the Main 10 (or Main) profile with INBLD (Independent Non-Base Layer Decoding) capability. In another embodiment, additional profiles are supported for each independent layer, in particular to allow monochrome representation of the depth, rather than the 4:2:0 chroma format required by Main and Main 10 profiles.

Creation and definition of this profile enables bitstreams and decoders to indicate conformance to the 3DoF+ standard, which is necessary to ensure interoperability.

In some embodiments, a new entry in the Scalability mask table is provided, AtlasFlag, which must be set to indicate that the coded picture is an atlas, by using one of the reserved values. The existing DepthLayerFlag may be used to differentiate between the texture atlas layer and the depth atlas layer.

Table 4: Modification of mapping of Scalabiltyld to scalability dimensions

The Immersive profile requires that the texture atlas layer have DepthLayerFlag equal to 0 and AtlasFlag equal to 1, and that the atlas layer have DepthLayerFlag equal to 1 and AtlasFlag equal to 1, as derived from the Scalabilityld value. The Immersive profile does not impose any additional requirements on the value of nuh layer id for the two layers, and the VPS provides flexibility in how to signal proper values of Scalabilityld

Discussion now turns to the most bitrate efficient method of using the VPS to properly set the Scalabilityld value for the Immersive profile. In some embodiments, nuh layer id of 0 is used for the texture atlas layer and nuh layer id of 1 is used for the depth atlas layer.

In some embodiments, splitting flag is set equal to 1 to indicate a simple mapping of the bits of nuh layer id to multi-layer characteristics.

In some embodiments, scalability _mask_flag[ i ] is set equal to 000000010001 (with each bit representing an array value) to indicate the layer dimensions to be used are 0

(DepthLayerFlag) and 4 (for AtlasFlag).

In some embodiments, dimension_id_len_minusl [ 1 ] is set equal to 0 to indicate that only 1 bit is needed to represent AtlasFlag.

In some embodiments, direct_dependency_flag[ 1 ] [ 0 ] is set to 0 to indicate that the depth atlas layer is independent of the texture atlas layer. For the two layers, Table 5 below describes values of relevant syntax elements and variables. Table values highlighted in gray (i.e., DepthLayerFlag, AtlasFlag, Scalability Id, and direct dependency flag for Depth atlas) are required, while un-highlighted values (i.e., nuh layer id) may vary. In some embodiments, the values of DepthLayerFlag and AtlasFlag are required in the Immersive profile, while the nuh layer id value may vary.

Table 5: Syntax Elements and Variables

Discussion now turns to systems and devices for implementing the discussed techniques, encoders, and decoders. For example, any encoder or decoder discussed herein may be implemented via the system illustrated in FIG. 11, the system illustrated in FIG. 12, and/or the device illustrated in FIG. 13. Notably, the discussed techniques, encoders, and decoders may be implemented via any suitable device or platform such as a personal computer, a laptop computer, a tablet, a phablet, a smart phone, a digital camera, a gaming console, a wearable device, a display device, an all-in-one device, a two-in-one device, or the like.

Various components of the systems described herein may be implemented in software, firmware, and/or hardware and/or any combination thereof. For example, various components of the devices or systems discussed herein may be provided, at least in part, by hardware of a computing System-on-a-Chip (SoC) such as may be found in a computing system such as, for example, a smart phone. Those skilled in the art may recognize that systems described herein may include additional components that have not been depicted in the corresponding figures. For example, the systems discussed herein may include additional components that have not been depicted in the interest of clarity.

While implementation of the example processes discussed herein may include the undertaking of all operations shown in the order illustrated, the present disclosure is not limited in this regard and, in various examples, implementation of the example processes herein may include only a subset of the operations shown, operations performed in a different order than illustrated, or additional operations. In addition, any one or more of the operations discussed herein may be undertaken in response to instructions provided by one or more computer program products. Such program products may include signal bearing media providing instructions that, when executed by, for example, a processor, may provide the functionality described herein. The computer program products may be provided in any form of one or more machine-readable media. Thus, for example, a processor including one or more graphics processing unit(s) or processor core(s) may undertake one or more of the blocks of the example processes herein in response to program code and/or instructions or instruction sets conveyed to the processor by one or more machine-readable media. In general, a machine-readable medium may convey software in the form of program code and/or instructions or instruction sets that may cause any of the devices and/or systems described herein to implement at least portions of the devices or systems, or any other module or component as discussed herein.

As used in any implementation described herein, the term“module” refers to any combination of software logic, firmware logic, hardware logic, and/or circuitry configured to provide the functionality described herein. The software may be embodied as a software package, code and/or instruction set or instructions, and“hardware”, as used in any implementation described herein, may include, for example, singly or in any combination, hardwired circuitry, programmable circuitry, state machine circuitry, fixed function circuitry, execution unit circuitry, and/or firmware that stores instructions executed by programmable circuitry. The modules may, collectively or individually, be embodied as circuitry that forms part of a larger system, for example, an integrated circuit (IC), system on-chip (SoC), and so forth.

FIG. 11 is an illustrative diagram of an example system 1100 for coding immersive video data, arranged in accordance with at least some implementations of the present disclosure. As shown in FIG. 11, system 1100 may include one or more central processing units (CPU) 1101, one or more graphics processing units (GPU) 1102, memory stores 1103, a display 1104, and a transmitter 1105. Also as shown, graphics processing unit 1102 may include or implement a bit depth mapping module 1111, a camera parameters coding module 1112, an encoder 1121, a decoder 1122, and/or a view synthesis module 1123. Such modules may be implemented to perform operations as discussed herein. In the example of system 1100, memory stores 1103 may store texture views, depth views, immersive video data, piece-wise linear mapping parameters, bitstream data, synthesized view data, camera array parameters, or any other data or data structure discussed herein. As shown, in some examples, bit depth mapping module 1111, camera parameters coding module 1112, encoder 1121, decoder 1122, and/or view synthesis module 1123 are implemented via graphics processing unit 1102. In other examples, one or more or portions of bit depth mapping module 1111, camera parameters coding module 1112, encoder 1121, decoder 1122, and/or view synthesis module 1123 are implemented via central processing units 1101 or an image processing unit (not shown) of system 1100. In yet other examples, one or more or portions of bit depth mapping module 1111, camera parameters coding module 1112, encoder 1121, decoder 1122, and/or view synthesis module 1123 are implemented via an imaging processing pipeline, graphics pipeline, or the like.

Graphics processing unit 1102 may include any number and type of graphics processing units, that may provide the operations as discussed herein. Such operations may be implemented via software or hardware or a combination thereof. For example, graphics processing unit 1102 may include circuitry dedicated to manipulate texture views, depth views, bitstream data, etc. obtained from memory stores 1103. Central processing units 1101 may include any number and type of processing units or modules that may provide control and other high level functions for system 1100 and/or provide any operations as discussed herein. Memory stores 1103 may be any type of memory such as volatile memory (e.g., Static Random Access Memory (SRAM),

Dynamic Random Access Memory (DRAM), etc.) or non-volatile memory (e.g., flash memory, etc.), and so forth. In a non-limiting example, memory stores 1103 may be implemented by cache memory. In an embodiment, one or more or portions of bit depth mapping module 1111, camera parameters coding module 1112, encoder 1121, decoder 1122, and/or view synthesis module 1123 are implemented via an execution unit (EU) of graphics processing unit 1102. The EU may include, for example, programmable logic or circuitry such as a logic core or cores that may provide a wide array of programmable logic functions. In an embodiment, one or more or portions of bit depth mapping module 1111, camera parameters coding module 1112, encoder 1121, decoder 1122, and/or view synthesis module 1123 are implemented via dedicated hardware such as fixed function circuitry or the like. Fixed function circuitry may include dedicated logic or circuitry and may provide a set of fixed function entry points that may map to the dedicated logic for a fixed purpose or function. In some embodiments, one or more or portions of bit depth mapping module 1111, camera parameters coding module 1112, encoder 1121, decoder 1122, and/or view synthesis module 1123 are implemented via an application specific integrated circuit (ASIC). The ASIC may include an integrated circuitry customized to perform the operations discussed herein. FIG. 12 is an illustrative diagram of an example system 1200, arranged in accordance with at least some implementations of the present disclosure. In various implementations, system 1200 may be a mobile device system although system 1200 is not limited to this context. For example, system 1200 may be incorporated into a personal computer (PC), laptop computer, ultra-laptop computer, tablet, touch pad, portable computer, handheld computer, palmtop computer, personal digital assistant (PDA), cellular telephone, combination cellular telephone/PDA, television, smart device (e.g., smart phone, smart tablet or smart television), mobile internet device (MID), messaging device, data communication device, cameras (e.g. point-and-shoot cameras, super zoom cameras, digital single-lens reflex (DSLR) cameras), a surveillance camera, a surveillance system including a camera, and so forth.

In various implementations, system 1200 includes a platform 1202 coupled to a display 1220. Platform 1202 may receive content from a content device such as content services device(s) 1230 or content delivery device(s) 1240 or other content sources such as image sensors 1219. For example, platform 1202 may receive image data as discussed herein from image sensors 1219 or any other content source. A navigation controller 1250 including one or more navigation features may be used to interact with, for example, platform 1202 and/or display 1220. Each of these components is described in greater detail below.

In various implementations, platform 1202 may include any combination of a chipset 1205, processor 1210, memory 1212, antenna 1213, storage 1214, graphics subsystem 1215, applications 1216, image signal processor 1217 and/or radio 1218. Chipset 1205 may provide intercommunication among processor 1210, memory 1212, storage 1214, graphics subsystem 1215, applications 1216, image signal processor 1217 and/or radio 1218. For example, chipset 1205 may include a storage adapter (not depicted) capable of providing intercommunication with storage 1214.

Processor 1210 may be implemented as a Complex Instruction Set Computer (CISC) or Reduced Instruction Set Computer (RISC) processors, x86 instruction set compatible processors, multi-core, or any other microprocessor or central processing unit (CPU). In various

implementations, processor 1210 may be dual-core processor(s), dual-core mobile processor(s), and so forth. Memory 1212 may be implemented as a volatile memory device such as, but not limited to, a Random Access Memory (RAM), Dynamic Random Access Memory (DRAM), or Static RAM (SRAM).

Storage 1214 may be implemented as a non-volatile storage device such as, but not limited to, a magnetic disk drive, optical disk drive, tape drive, an internal storage device, an attached storage device, flash memory, battery backed-up SDRAM (synchronous DRAM), and/or a network accessible storage device. In various implementations, storage 1214 may include technology to increase the storage performance enhanced protection for valuable digital media when multiple hard drives are included, for example.

Image signal processor 1217 may be implemented as a specialized digital signal processor or the like used for image processing. In some examples, image signal processor 1217 may be implemented based on a single instruction multiple data or multiple instruction multiple data architecture or the like. In some examples, image signal processor 1217 may be characterized as a media processor. As discussed herein, image signal processor 1217 may be implemented based on a system on a chip architecture and/or based on a multi-core architecture.

Graphics subsystem 1215 may perform processing of images such as still or video for display. Graphics subsystem 1215 may be a graphics processing unit (GPU) or a visual processing unit (VPU), for example. An analog or digital interface may be used to communicatively couple graphics subsystem 1215 and display 1220. For example, the interface may be any of a High- Definition Multimedia Interface, DisplayPort, wireless HDMI, and/or wireless HD compliant techniques. Graphics subsystem 1215 may be integrated into processor 1210 or chipset 1205. In some implementations, graphics subsystem 1215 may be a stand-alone device communicatively coupled to chipset 1205.

The graphics and/or video processing techniques described herein may be implemented in various hardware architectures. For example, graphics and/or video functionality may be integrated within a chipset. Alternatively, a discrete graphics and/or video processor may be used. As still another implementation, the graphics and/or video functions may be provided by a general purpose processor, including a multi-core processor. In further embodiments, the functions may be implemented in a consumer electronics device.

Radio 1218 may include one or more radios capable of transmitting and receiving signals using various suitable wireless communications techniques. Such techniques may involve communications across one or more wireless networks. Example wireless networks include (but are not limited to) wireless local area networks (WLANs), wireless personal area networks (WPANs), wireless metropolitan area network (WMANs), cellular networks, and satellite networks. In communicating across such networks, radio 1218 may operate in accordance with one or more applicable standards in any version.

In various implementations, display 1220 may include any television type monitor or display. Display 1220 may include, for example, a computer display screen, touch screen display, video monitor, television-like device, and/or a television. Display 1220 may be digital and/or analog. In various implementations, display 1220 may be a holographic display. Also, display 1220 may be a transparent surface that may receive a visual projection. Such projections may convey various forms of information, images, and/or objects. For example, such projections may be a visual overlay for a mobile augmented reality (MAR) application. Under the control of one or more software applications 1216, platform 1202 may display user interface 1222 on display 1220.

In various implementations, content services device(s) 1230 may be hosted by any national, international and/or independent service and thus accessible to platform 1202 via the Internet, for example. Content services device(s) 1230 may be coupled to platform 1202 and/or to display 1220. Platform 1202 and/or content services device(s) 1230 may be coupled to a network 1260 to communicate (e.g., send and/or receive) media information to and from network 1260. Content delivery device(s) 1240 also may be coupled to platform 1202 and/or to display 1220.

Image sensors 1219 may include any suitable image sensors that may provide image data based on a scene. For example, image sensors 1219 may include a semiconductor charge coupled device (CCD) based sensor, a complimentary metal-oxide-semiconductor (CMOS) based sensor, an N-type metal-oxide-semiconductor (NMOS) based sensor, or the like. For example, image sensors 1219 may include any device that may detect information of a scene to generate image data.

In various implementations, content services device(s) 1230 may include a cable television box, personal computer, network, telephone, Internet enabled devices or appliance capable of delivering digital information and/or content, and any other similar device capable of uni- directionally or bi-directionally communicating content between content providers and platform 1202 and/display 1220, via network 1260 or directly. It will be appreciated that the content may be communicated uni-directi onally and/or bi-directionally to and from any one of the components in system 1200 and a content provider via network 1260. Examples of content may include any media information including, for example, video, music, medical and gaming information, and so forth.

Content services device(s) 1230 may receive content such as cable television

programming including media information, digital information, and/or other content. Examples of content providers may include any cable or satellite television or radio or Internet content providers. The provided examples are not meant to limit implementations in accordance with the present disclosure in any way.

In various implementations, platform 1202 may receive control signals from navigation controller 1250 having one or more navigation features. The navigation features of navigation controller 1250 may be used to interact with user interface 1222, for example. In various embodiments, navigation controller 1250 may be a pointing device that may be a computer hardware component (specifically, a human interface device) that allows a user to input spatial (e.g., continuous and multi-dimensional) data into a computer. Many systems such as graphical user interfaces (GUI), and televisions and monitors allow the user to control and provide data to the computer or television using physical gestures.

Movements of the navigation features of navigation controller 1250 may be replicated on a display (e.g., display 1220) by movements of a pointer, cursor, focus ring, or other visual indicators displayed on the display. For example, under the control of software applications 1216, the navigation features located on navigation controller 1250 may be mapped to virtual navigation features displayed on user interface 1222, for example. In various embodiments, navigation controller 1250 may not be a separate component but may be integrated into platform 1202 and/or display 1220. The present disclosure, however, is not limited to the elements or in the context shown or described herein.

In various implementations, drivers (not shown) may include technology to enable users to instantly turn on and off platform 1202 like a television with the touch of a button after initial boot-up, when enabled, for example. Program logic may allow platform 1202 to stream content to media adaptors or other content services device(s) 1230 or content delivery device(s) 1240 even when the platform is turned“off.” In addition, chipset 1205 may include hardware and/or software support for 5.1 surround sound audio and/or high definition 7.1 surround sound audio, for example. Drivers may include a graphics driver for integrated graphics platforms. In various embodiments, the graphics driver may comprise a peripheral component interconnect (PCI) Express graphics card.

In various implementations, any one or more of the components shown in system 1200 may be integrated. For example, platform 1202 and content services device(s) 1230 may be integrated, or platform 1202 and content delivery device(s) 1240 may be integrated, or platform 1202, content services device(s) 1230, and content delivery device(s) 1240 may be integrated, for example. In various embodiments, platform 1202 and display 1220 may be an integrated unit. Display 1220 and content service device(s) 1230 may be integrated, or display 1220 and content delivery device(s) 1240 may be integrated, for example. These examples are not meant to limit the present disclosure.

In various embodiments, system 1200 may be implemented as a wireless system, a wired system, or a combination of both. When implemented as a wireless system, system 1200 may include components and interfaces suitable for communicating over a wireless shared media, such as one or more antennas, transmitters, receivers, transceivers, amplifiers, filters, control logic, and so forth. An example of wireless shared media may include portions of a wireless spectrum, such as the RF spectrum and so forth. When implemented as a wired system, system 1200 may include components and interfaces suitable for communicating over wired communications media, such as input/output (I/O) adapters, physical connectors to connect the I/O adapter with a

corresponding wired communications medium, a network interface card (NIC), disc controller, video controller, audio controller, and the like. Examples of wired communications media may include a wire, cable, metal leads, printed circuit board (PCB), backplane, switch fabric, semiconductor material, twisted-pair wire, co-axial cable, fiber optics, and so forth.

Platform 1202 may establish one or more logical or physical channels to communicate information. The information may include media information and control information. Media information may refer to any data representing content meant for a user. Examples of content may include, for example, data from a voice conversation, videoconference, streaming video, electronic mail (“email”) message, voice mail message, alphanumeric symbols, graphics, image, video, text and so forth. Data from a voice conversation may be, for example, speech information, silence periods, background noise, comfort noise, tones and so forth. Control information may refer to any data representing commands, instructions or control words meant for an automated system. For example, control information may be used to route media information through a system, or instruct a node to process the media information in a predetermined manner. The embodiments, however, are not limited to the elements or in the context shown or described in FIG. 12.

As described above, system 1200 may be embodied in varying physical styles or form factors. FIG. 13 illustrates an example small form factor device 1300, arranged in accordance with at least some implementations of the present disclosure. In some examples, system 1200 may be implemented via device 1300. In other examples, other systems, components, or modules discussed herein or portions thereof may be implemented via device 1300. In various

embodiments, for example, device 1300 may be implemented as a mobile computing device a having wireless capabilities. A mobile computing device may refer to any device having a processing system and a mobile power source or supply, such as one or more batteries, for example.

Examples of a mobile computing device may include a personal computer (PC), laptop computer, ultra-laptop computer, tablet, touch pad, portable computer, handheld computer, palmtop computer, personal digital assistant (PDA), cellular telephone, combination cellular telephone/PDA, smart device (e.g., smartphone, smart tablet or smart mobile television), mobile internet device (MID), messaging device, data communication device, cameras (e.g. point-and- shoot cameras, super-zoom cameras, digital single-lens reflex (DSLR) cameras), and so forth.

Examples of a mobile computing device also may include computers that are arranged to be implemented by a motor vehicle or robot, or worn by a person, such as wrist computers, finger computers, ring computers, eyeglass computers, belt-clip computers, arm-band computers, shoe computers, clothing computers, and other wearable computers. In various embodiments, for example, a mobile computing device may be implemented as a smartphone capable of executing computer applications, as well as voice communications and/or data communications. Although some embodiments may be described with a mobile computing device implemented as a smartphone by way of example, it may be appreciated that other embodiments may be implemented using other wireless mobile computing devices as well. The embodiments are not limited in this context.

As shown in FIG. 13, device 1300 may include a housing with a front 1301 and a back 1302. Device 1300 includes a display 1304, an input/output (I/O) device 1306, a color camera 1321, a color camera 1322, and an integrated antenna 1308. In some embodiments, color camera 1321 and color camera 1322 ahain planar images as discussed herein. In some embodiments, device 1300 does not include color camera 1321 and 1322 and device 1300 attains input image data (e.g., any input image data discussed herein) from another device. Device 1300 also may include navigation features 1312. I/O device 1306 may include any suitable I/O device for entering information into a mobile computing device. Examples for I/O device 1306 may include an alphanumeric keyboard, a numeric keypad, a touch pad, input keys, buttons,

switches, microphones, speakers, voice recognition device and software, and so forth. Information also may be entered into device 1300 by way of microphone (not shown), or may be digitized by a voice recognition device. As shown, device 1300 may include color cameras 1321, 1322, and a flash 1310 integrated into back 1302 (or elsewhere) of device 1300. In other examples, color cameras 1321, 1322, and flash 1310 may be integrated into front 1301 of device 1300 or both front and back sets of cameras may be provided. Color cameras 1321, 1322 and a flash 1310 may be components of a camera module to originate color image data with IR texture correction that may be processed into an image or streaming video that is output to display 1304 and/or communicated remotely from device 1300 via antenna 1308 for example.

Various embodiments may be implemented using hardware elements, software elements, or a combination of both. Examples of hardware elements may include processors,

microprocessors, circuits, circuit elements (e.g., transistors, resistors, capacitors, inductors, and so forth), integrated circuits, application specific integrated circuits (ASIC), programmable logic devices (PLD), digital signal processors (DSP), field programmable gate array (FPGA), logic gates, registers, semiconductor device, chips, microchips, chip sets, and so forth. Examples of software may include software components, programs, applications, computer programs, application programs, system programs, machine programs, operating system software, middleware, firmware, software modules, routines, subroutines, functions, methods, procedures, software interfaces, application program interfaces (API), instruction sets, computing code, computer code, code segments, computer code segments, words, values, symbols, or any combination thereof. Determining whether an embodiment is implemented using hardware elements and/or software elements may vary in accordance with any number of factors, such as desired computational rate, power levels, heat tolerances, processing cycle budget, input data rates, output data rates, memory resources, data bus speeds and other design or performance constraints.

One or more aspects of at least one embodiment may be implemented by representative instructions stored on a machine-readable medium which represents various logic within the processor, which when read by a machine causes the machine to fabricate logic to perform the techniques described herein. Such representations, known as IP cores may be stored on a tangible, machine readable medium and supplied to various customers or manufacturing facilities to load into the fabrication machines that actually make the logic or processor.

While certain features set forth herein have been described with reference to various implementations, this description is not intended to be construed in a limiting sense. Hence, various modifications of the implementations described herein, as well as other implementations, which are apparent to persons skilled in the art to which the present disclosure pertains are deemed to he within the spirit and scope of the present disclosure.

In one or more first embodiments, a method for encoding immersive video data comprises receiving a first depth view having first depth values at a first bit-depth, the first bit-depth comprising a first available range of values from a first minimum value to a first maximum value, mapping the first depth values to second depth values at a second bit-depth that is less than the first bit-depth to generate a second depth view, the second bit-depth comprising a second available range of values from a second minimum value to a second maximum value, using first and second line segment endpoints to define a line segment for the mapping, wherein horizontal and vertical components of the first line segment endpoint exceed the first minimum depth value and the second minimum depth value, respectively, and encoding the second depth view and the first and second line segment endpoints into a bitstream.

In one or more second embodiments, further to the first embodiment, wherein horizontal and vertical components of the second line segment endpoint are less than the first maximum depth value and the second maximum depth value, respectively.

In one or more third embodiments, further to the first or second embodiments, the first and second line segment endpoints define a first line segment of the mapping, the mapping further comprising a second line segment defined by the second line segment endpoint and a third line segment endpoint.

In one or more fourth embodiments, further to any of the first through third embodiments, the first line segment has a first slope and the second line segment has a second slope less than the first slope in response to a first depth value pixel count for the first line segment exceeding a second depth pixel value count for the second line segment. In one or more fifth embodiments, further to any of the first through fourth embodiments, the method further comprises determining at least the first and second line segments of the mapping by generating a histogram of depth pixel value counts per depth value range using at least a portion of the first depth view and recursively generating a plurality of line segments for the mapping including the first and second line segments to minimize a reconstruction error corresponding to the mapping and an inverse mapping.

In one or more sixth embodiments, further to any of the first through fifth embodiments, the method further comprises determining edge regions of the first depth view, wherein the histogram of depth pixel value counts are generated using pixels in the edge regions and exclusive of pixels outside the edge regions.

In one or more seventh embodiments, further to any of the first through sixth

embodiments, the method further comprises encoding a texture view corresponding to the first and second depth views into the bitstream.

In one or more eighth embodiments, further to any of the first through seventh embodiments, the first bit-depth is 16 bits, the second bit-depth is one of 8 bits or 10 bits, and the bitstream is a High Efficiency Video Coding (HEVC) compliant bitstream.

In one or more ninth embodiments, a method for decoding immersive video data comprises decoding a bitstream to determine a first depth view having first depth values at a first bit-depth and first and second line segment endpoints for mapping the first depth view to a second depth view, the first bit-depth comprising a first available range of values from a first minimum value to a first maximum value, mapping the first depth values, using the first and second line segment endpoints, to second depth values to generate a second depth view, the second depth values at a second bit-depth greater than the first bit-depth, the second bit-depth comprising a second available range of values from a second minimum value to a second maximum value, wherein horizontal and vertical components of the first line segment endpoint exceed the first minimum value and the second first minimum value, respectively, and generating a synthesized view based on the second depth view.

In one or more tenth embodiments, further to the ninth embodiment, horizontal and vertical components of the second line segment endpoint are less than the first maximum depth value and the second maximum depth value, respectively. In one or more eleventh embodiments, further to the ninth or tenth embodiments, the first and second line segment endpoints define a first line segment of the mapping, the mapping further comprising a second line segment defined by the second line segment endpoint and a third line segment endpoint.

In one or more twelfth embodiments, further to any of the ninth through eleventh embodiments, the first line segment has a first slope and the second line segment has a second slope less than the first slope.

In one or more thirteenth embodiments, further to any of the ninth through twelfth embodiments, the method further comprises decoding a texture view corresponding to the first and second depth views from the bitstream, wherein the synthesized view is generated based on the texture view.

In one or more fourteenth embodiments, further to any of the ninth through thirteenth embodiments, the first bit-depth is one of 8 bits or 10 bits, the second bit-depth is 16 bits, and the bitstream is a High Efficiency Video Coding (HEVC) compliant bitstream.

In one or more fifteenth embodiments, a method for encoding immersive video data comprises determining a camera projection mode for each of a plurality of cameras of a camera array corresponding to immersive video generation or capture, generating a camera projection mode flag for the cameras, the camera projection mode flag having a first value in response to each of the cameras having a particular projection mode and a second value in response to any of the cameras having a projection mode other than the particular projection mode, and encoding the camera projection mode flag into a bitstream.

In one or more sixteenth embodiments, further to the fifteenth embodiment, in response to the first camera projection mode flag having the first value, the bitstream further comprises a single camera projection mode indicator indicative of the particular projection mode.

In one or more seventeenth embodiments, further to the fifteenth or sixteenth

embodiments, in response to the camera projection mode flag having the second value, the bitstream further comprises a plurality of camera projection mode indicators, one for each of the cameras. In one or more eighteenth embodiments, further to any of the fifteenth through seventeenth embodiments, the camera projection modes comprise at least one of an

omnidirectional projection mode, a normal perspective projection mode, or an orthographic projection mode.

In one or more nineteenth embodiments, further to any of the fifteenth through eighteenth embodiments, the method further comprises determining a minimum physical depth and a maximum physical depth attainable by each of the cameras, generating a physical depth flag for the cameras, the physical depth flag indicative of whether the cameras have a shared minimum physical depth and/or a shared maximum physical depth, and encoding the physical depth flag into the bitstream.

In one or more twentieth embodiments, further to any of the fifteenth through nineteenth embodiments, the method further comprises determining one of yaw, pitch, and roll parameters or quaternion parameters for each of the cameras, generating a yaw, pitch, and roll flag or a quaternion flag for the cameras, the yaw, pitch, and roll flag or quaternion parameters flag indicative of whether the cameras have shared yaw, pitch, and roll parameters or shared quaternion parameters, and encoding the yaw, pitch, and roll flag or the quaternion parameters flag into the bitstream.

In one or more twenty-first embodiments, further to any of the fifteenth through twentieth embodiments, the yaw, pitch, and roll parameters or quaternion parameters indicate default yaw, pitch and roll or default quaternion parameters flag values.

In one or more twenty-second embodiments, further to any of the fifteenth through twenty -first embodiments, the method further comprises determining a shared physical spacing value representative of a shared physical spacing between the cameras and encoding the shared physical spacing value into the bitstream.

In one or more twenty -third embodiments, a method for decoding immersive video data comprises decoding a bitstream to determine a depth view, a texture view, and a camera projection mode flag corresponding to a plurality of cameras of a camera array, the camera projection mode flag having a first value to indicate all of the cameras have a particular projection mode or a second value to indicate any of the cameras have a projection mode other than the particular projection mode and generating a decoded view based on the depth view, the texture view, and the camera projection mode flag depth view. In one or more twenty-fourth embodiments, further to the twenty -third embodiment, in response to the first camera projection mode flag having the first value, the bitstream further comprises a single camera projection mode indicator indicative of the particular projection mode.

In one or more twenty-fifth embodiments, further to the twenty -third or twenty-fourth embodiments, in response to the camera projection mode flag having the second value, the bitstream further comprises a plurality of camera projection mode indicators, one for each of the cameras.

In one or more twenty-sixth embodiments, further to any of the twenty -third through twenty -fifth embodiments, the shared camera projection mode comprise at least one of an omnidirectional projection mode, a normal perspective projection mode, or an orthographic projection mode.

In one or more twenty-seventh embodiments, further to any of the twenty -third through twenty-sixth embodiments, the bitstream further comprises a physical depth flag for the cameras, the physical depth flag indicative of whether the cameras have a shared minimum physical depth and/or a shared maximum physical depth.

In one or more twenty-eighth embodiments, further to any of the twenty -third through twenty -seventh embodiments, the bitstream further comprises one of a yaw, pitch, and roll flag or a quaternion flag for the cameras, the yaw, pitch, and roll flag or quaternion parameters flag indicative of whether the cameras have shared yaw, pitch, and roll parameters or shared quaternion parameters.

In one or more twenty -ninth embodiments, further to any of the twenty -third through twenty-eighth embodiments, the method further comprises setting, in response to the yaw, pitch, and roll flag or the quaternion flag indicating the cameras have shared yaw, pitch, and roll parameters or shared quaternion parameters, yaw, pitch, and roll parameters or quaternion parameters for each of the cameras to default yaw, pitch and roll parameters or default quaternion parameters.

In one or more thirtieth embodiments, further to any of the twenty -third through twenty- eighth embodiments, the bitstream further comprises a shared physical spacing value and the method further comprises determining a physical spacing between the cameras using the shared physical spacing value. In one or more thirty -first embodiments, a device or system includes a memory and a processor to perform a method according to any one of the above embodiments.

In one or more thirty-second embodiments, at least one machine readable medium includes a plurality of instructions that in response to being executed on a computing device, cause the computing device to perform a method according to any one of the above embodiments.

In one or more thirty -third embodiments, an apparatus includes means for performing a method according to any one of the above embodiments.

It will be recognized that the embodiments are not limited to the embodiments so described, but can be practiced with modification and alteration without departing from the scope of the appended. For example, the above embodiments may include specific combination of features. However, the above embodiments are not limited in this regard and, in various implementations, the above embodiments may include the undertaking only a subset of such features, undertaking a different order of such features, undertaking a different combination of such features, and/or undertaking additional features than those features explicitly listed. The scope of the embodiments should, therefore, be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled.

Claims

CLAIMS What is claimed is:

1. A method for encoding immersive video data comprising:

receiving a first depth view having first depth values at a first bit-depth, the first bit-depth

comprising a first available range of values from a first minimum value to a first maximum value;

mapping the first depth values to second depth values at a second bit-depth that is less than the first bit-depth to generate a second depth view, the second bit-depth comprising a second available range of values from a second minimum value to a second maximum value, using first and second line segment endpoints to define a line segment for the mapping, wherein horizontal and vertical components of the first line segment endpoint exceed the first minimum depth value and the second minimum depth value, respectively; and encoding the second depth view and the first and second line segment endpoints into a bitstream.

2. The method of claim 1, wherein horizontal and vertical components of the second line segment endpoint are less than the first maximum depth value and the second maximum depth value, respectively.

3. The method of claim 1 or 2, wherein the first and second line segment endpoints define a first line segment of the mapping, the mapping further comprising a second line segment defined by the second line segment endpoint and a third line segment endpoint.

4. The method of claim 3, wherein the first line segment has a first slope and the second line segment has a second slope less than the first slope in response to a first depth value pixel count for the first line segment exceeding a second depth pixel value count for the second line segment.

5. The method of claim 3 or 4, further comprising determining at least the first and second line segments of the mapping by:

generating a histogram of depth pixel value counts per depth value range using at least a portion of the first depth view; and

recursively generating a plurality of line segments for the mapping including the first and second line segments to minimize a reconstruction error corresponding to the mapping and an inverse mapping.

6. The method of claim 5, further comprising:

determining edge regions of the first depth view, wherein the histogram of depth pixel value counts are generated using pixels in the edge regions and exclusive of pixels outside the edge regions.

7. A method for decoding immersive video data comprising:

decoding a bitstream to determine a first depth view having first depth values at a first bit-depth and first and second line segment endpoints for mapping the first depth view to a second depth view, the first bit-depth comprising a first available range of values from a first minimum value to a first maximum value;

mapping the first depth values, using the first and second line segment endpoints, to second depth values to generate a second depth view, the second depth values at a second bit-depth greater than the first bit-depth, the second bit-depth comprising a second available range of values from a second minimum value to a second maximum value, wherein horizontal and vertical components of the first line segment endpoint exceed the first minimum value and the second first minimum value, respectively; and generating a synthesized view based on the second depth view.

8. The method of claim 7, wherein horizontal and vertical components of the second line segment endpoint are less than the first maximum depth value and the second maximum depth value, respectively.

9. The method of claim 7 or 8, wherein the first and second line segment endpoints define a first line segment of the mapping, the mapping further comprising a second line segment defined by the second line segment endpoint and a third line segment endpoint.

10. The method of claim 9, wherein the first line segment has a first slope and the second line segment has a second slope less than the first slope.

11. The method of any of claims 7 to 10, further comprising:

decoding a texture view corresponding to the first and second depth views from the bitstream, wherein the synthesized view is generated based on the texture view .

12. A method for encoding immersive video data comprising:

determining a camera projection mode for each of a plurality of cameras of a camera array

corresponding to immersive video generation or capture;

generating a camera projection mode flag for the cameras, the camera projection mode flag

having a first value in response to each of the cameras having a particular projection mode and a second value in response to any of the cameras having a projection mode other than the particular projection mode; and

encoding the camera projection mode flag into a bitstream.

13. The method of claim 12, wherein, in response to the first camera projection mode flag having the first value, the bitstream further comprises a single camera projection mode indicator indicative of the particular projection mode.

14. The method of claim 12, wherein, in response to the camera projection mode flag having the second value, the bitstream further comprises a plurality of camera projection mode indicators, one for each of the cameras.

15. The method of any of claims 12 to 14, further comprising:

determining a minimum physical depth and a maximum physical depth attainable by each of the cameras;

generating a physical depth flag for the cameras, the physical depth flag indicative of whether the cameras have a shared minimum physical depth and/or a shared maximum physical depth; and

encoding the physical depth flag into the bitstream.

16. The method of any of claims 12 to 15, further comprising:

determining one of yaw, pitch, and roll parameters or quaternion parameters for each of the

cameras;

generating a yaw, pitch, and roll flag or a quaternion flag for the cameras, the yaw, pitch, and roll flag or quaternion parameters flag indicative of whether the cameras have shared yaw, pitch, and roll parameters or shared quaternion parameters; and

encoding the yaw, pitch, and roll flag or the quaternion parameters flag into the bitstream.

17. The method of any of claims 12 to 16, further comprising:

determining a shared physical spacing value representative of a shared physical spacing between the cameras; and

encoding the shared physical spacing value into the bitstream.

18. A method for decoding immersive video data comprising:

decoding a bitstream to determine a depth view, a texture view, and a camera projection mode flag corresponding to a plurality of cameras of a camera array, the camera projection mode flag having a first value to indicate all of the cameras have a particular projection mode or a second value to indicate any of the cameras have a projection mode other than the particular projection mode; and

generating a decoded view based on the depth view, the texture view, and the camera projection mode flag depth view.

19. The method of claim 18, wherein, in response to the first camera projection mode flag having the first value, the bitstream further comprises a single camera projection mode indicator indicative of the particular projection mode.

20. The method of claim 18, wherein, in response to the camera projection mode flag having the second value, the bitstream further comprises a plurality of camera projection mode indicators, one for each of the cameras.

21. The method of any of claims 18 to 20, wherein the bitstream further comprises a physical depth flag for the cameras, the physical depth flag indicative of whether the cameras have a shared minimum physical depth and/or a shared maximum physical depth.

22. The method of any of claims 18 to 21, wherein the bitstream further comprises one of a yaw, pitch, and roll flag or a quaternion flag for the cameras, the yaw, pitch, and roll flag or quaternion parameters flag indicative of whether the cameras have shared yaw, pitch, and roll parameters or shared quaternion parameters.

23. The method of any of claims 18 to 22, wherein the bitstream further comprises a shared physical spacing value and the method further comprises:

determining a physical spacing between the cameras using the shared physical spacing value.

24. A system comprising:

a memory; and

one or more processors to perform any of the methods of claims 1 to 23.

25. At least one machine readable medium comprising a plurality of instructions that, in response to being executed on a computing device, cause the computing device to perform any of the methods of claims 1 to 23.