WO2019138163A1

WO2019138163A1 - A method and technical equipment for encoding and decoding volumetric video

Info

Publication number: WO2019138163A1
Application number: PCT/FI2019/050026
Authority: WO
Inventors: Jaakko KERÄNEN; Kimmo Roimela; Emre Aksu; Johannes PYSTYNEN
Original assignee: Nokia Technologies Oy
Priority date: 2018-01-15
Filing date: 2019-01-14
Publication date: 2019-07-18

Abstract

The invention relates to a method and technical equipment, wherein the method comprises converting each frame of volumetric video to a set of three-dimensional voxel bricks, the three-dimensional bricks representing nodes in a node structure; converting each brick of the set of three-dimensional voxel bricks into one or more two-dimensional tile combinations; laying the one or more two-dimensional tile combinations onto respective two-dimensional video frames; storing nodes in the node structure in a metadata associated with the two- dimensional video frames; and encoding the two-dimensional video frames with two-dimensional video codec and encoding the associated metadata. The invention also relates to a method and technical equipment for decoding.

Description

A METHOD AND TECHNICAL EQUIPMENT FOR ENCODING AND DECODING VOLUMETRIC VIDEO

Technical Field

The present solution generally relates to virtual reality. In particular, the solution relates to a method, an apparatus and a computer program product for encoding and decoding volumetric video.

Background

Since the beginning of photography and cinematography, the most common type of image and video content has been captured by cameras with relatively narrow field of view, and displayed as a rectangular scene on flat displays. Such content is referred as“flat content”, or“flat image”, or“flat video” in this application. The cameras are mainly directional, whereby they capture only a limited angular field of view (the field of view towards which they are directed).

More recently, new image and video capture devices are available. These devices are able to capture visual and audio content all around them, i.e. they can capture the whole angular field of view, sometimes referred to as 360 degrees field of view. More precisely, they can capture a spherical field of view (i.e., 360 degrees in all spatial directions). Furthermore, new types of output technologies have been invented and produced, such as head-mounted displays. These devices allow a person to see visual content all around him/her, giving a feeling of being“immersed” into the scene captured by the 360 degrees camera. The new capture and display paradigm, where the field of view is spherical, is commonly referred to as virtual reality (VR) and is believed to be the common way people will experience media content in the future.

For volumetric video, a scene may be captured using one or more 3D (three- dimensional) cameras. The cameras are in different positions and orientations within a scene. One issue to take into account is that compared to 2D (two-dimensional) video content, volumetric 3D video content has much more data, so viewing it requires lots of bandwidth (with or without transferring it from a storage location to a viewing device): disk I/O, network traffic, memory bandwidth, GPU (Graphics Processing Unit) upload. Capturing volumetric content also produces a lot of data, particularly when there are multiple capture devices used in parallel. Summary

Now there has been invented an improved method and technical equipment implementing the method, by which the above problems are alleviated. Various aspects of the invention include a method, an apparatus, and a computer readable medium comprising a computer program stored therein, which are characterized by what is stated in the independent claims. Various embodiments of the invention are disclosed in the dependent claims.

According to a first aspect, there is provided a method comprising converting each frame of volumetric video to a set of three-dimensional voxel bricks, the three- dimensional bricks representing nodes in a node structure; converting each brick of the set of three-dimensional voxel bricks into one or more two-dimensional tile combinations; laying the one or more two-dimensional tile combinations onto respective two-dimensional video frames; storing nodes in the node structure in a metadata associated with the two-dimensional video frames; and encoding the two- dimensional video frames with two-dimensional video codec and encoding the associated metadata.

According to a second aspect, there is provided an apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to convert each frame of volumetric video to a set of three-dimensional voxel bricks, the three-dimensional bricks representing nodes in a node structure; convert each brick of the set of three-dimensional voxel bricks into one or more two-dimensional tile combinations; lay the one or more two-dimensional tile combinations onto respective two-dimensional video frames; store nodes in the node structure in a metadata associated with the two-dimensional video frames; and encode the two-dimensional video frames with two-dimensional video codec and encoding the associated metadata.

According to a third aspect, there is provided an apparatus comprising at least means for converting each frame of volumetric video to a set of three-dimensional voxel bricks, the three-dimensional bricks representing nodes in a node structure; means for converting each brick of the set of three-dimensional voxel bricks into one or more two- dimensional tile combinations; means for laying the one or more two-dimensional tile combinations onto respective two-dimensional video frames; means for storing nodes in the node structure in a metadata associated with the two-dimensional video frames; and means for encoding the two-dimensional video frames with two-dimensional video codec and encoding the associated metadata.

According to a fourth aspect, there is provided a computer program product embodied on a non-transitory computer readable medium, comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to convert each frame of volumetric video to a set of three-dimensional voxel bricks, the three-dimensional bricks representing nodes in a node structure; convert each brick of the set of three-dimensional voxel bricks into one or more two- dimensional tile combinations; lay the one or more two-dimensional tile combinations onto respective two-dimensional video frames; store nodes in the node structure in a metadata associated with the two-dimensional video frames; and encode the two- dimensional video frames with two-dimensional video codec and encoding the associated metadata.

According to an embodiment, each frame of a volumetric video content is converted to a sparse voxel octree, and one or more levels of the sparse voxel octree are gathered to the set of three-dimensional voxel bricks.

According to an embodiment, the set of three-dimensional voxel bricks are composed from the sparse voxel octree by determining the depth of the subtree of each node; finding nodes having a depth corresponding to a predefined brick size; and copying content of the found nodes into three-dimensional voxel bricks.

According to an embodiment, the two-dimensional tile combination is formed of tiles of at least two attributes.

According to an embodiment, said at least two attributes are any combination of the following: colour, depth, normal.

According to an embodiment, the metadata comprises three-dimensional voxel coordinates for three-dimensional voxel bricks and parameters for each two- dimensional tile combination in the two-dimensional video frame.

According to an embodiment, a three-dimensional voxel brick that is either an exact match of another brick or produced by transforming another brick is detected and included only once for encoding. According to an embodiment, depth range is adjusted on a per-tile basis, wherein two or more consecutive three-dimensional voxel bricks are encoded into a same tile.

According to an embodiment, a scene or an object of the volumetric video is subdivided into multiple two-dimensional video frames, and the multiple two-dimensional video frames are transmitted progressively starting from frames having low levels of details and proceeding to frames with finer details.

According to an embodiment, each tile is assigned a score based on how much unique information is contained in the tile.

According to an embodiment, it is determined whether a region of a depth tile has empty depth values, and the non-empty boundary values surrounding the region are examined to see if they are close to low or high end of the depth range, and filling the region with black or white respectively.

According to an embodiment, tiles are sorted in a tile frame buffer of the two- dimensional video so that similar tiles are adjacent to each other.

According to an embodiment, variable-sized tile combinations are generated from bricks of a set of three-dimensional voxel bricks comprising more than one levels of the sparse voxel octree.

According to a fifth aspect, there is provided a method for decoding, comprising decoding two-dimensional video frames with two-dimensional video decoder and decoding the associated metadata; decoding from the associated metadata nodes of a node structure; decoding from two-dimensional video frames one or more two- dimensional tile combinations; converting the one or more two-dimensional tile combinations into three-dimensional voxel bricks being represented by nodes in the node structure; and generating volumetric video content according to three- dimensional voxel bricks.

According to a sixth aspect, there is provided an apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to decode two-dimensional video frames with two-dimensional video decoder and decode the associated metadata; decode from the associated metadata nodes of a node structure; decode from two-dimensional video frames one or more two-dimensional tile combinations; convert the one or more two-dimensional tile combinations into three- dimensional voxel bricks being represented by nodes in the node structure; and generate volumetric video content according to three-dimensional voxel bricks.

According to a seventh aspect, there is provided an apparatus comprising at least means for decoding two-dimensional video frames with two-dimensional video decoder and decoding the associated metadata; means for decoding from the associated metadata nodes of a node structure; means for decoding from two-dimensional video frames one or more two-dimensional tile combinations; means for converting the one or more two-dimensional tile combinations into three-dimensional voxel bricks being represented by nodes in the node structure; and means for generating volumetric video content according to three-dimensional voxel bricks.

According to an eighth aspect, there is provided a computer program product embodied on a non-transitory computer readable medium, comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to decode two-dimensional video frames with two-dimensional video decoder and decode the associated metadata; decode from the associated metadata nodes of a node structure; decode from two-dimensional video frames one or more two-dimensional tile combinations; convert the one or more two-dimensional tile combinations into three-dimensional voxel bricks being represented by nodes in the node structure; and generate volumetric video content according to three- dimensional voxel bricks.

According to an embodiment, a scene or an object is built up from multiple two- dimensional video frames.

According to an embodiment, a colour of a voxel is recovered by averaging colours of neighboring voxels and using the average as a colour for the voxel.

Description of the Drawings

In the following, various embodiments of the invention will be described in more detail with reference to the appended drawings, in which

Fig. 1 shows a system according to an embodiment for generating and viewing volumetric video;

Fig. 2a shows a camera device according to an embodiment comprising two cameras; Fig. 2b shows a viewing device according to an embodiment;

Fig. 2c shows a camera according to an embodiment;

Fig. 3 shows an encoding process according to an embodiment;

Fig. 4 shows a decoding process according to an embodiment;

Fig. 5 shows an example of manipulation of volumetric video data;

Figs. 6a-c show examples of voxels for projection;

Fig. 7 shows an example of brick projection;

Fig. 8 is a flowchart of a method according to an embodiment;

Figs. 9a-b show examples of default and brick coding;

Fig. 10 shows an example of tile frame buffer contents;

Fig. 11 shows an example of four 16 x 16 tiles;

Fig. 12 shows bricks where depth is coded as an offset from a reference surface;

Fig. 13 is a flowchart illustrating a method for encoding according to an embodiment; and

Fig. 14 is a flowchart a method for decoding according to an embodiment.

Description of

Embodiments

In the following, several embodiments of the invention will be described in the context of video coding arrangement. It is to be noted however, that the invention is not limited to this particular arrangement. In fact, the different embodiments have applications widely in any environment where improvement of coding when switching between coded fields and frames is desired. For example, the invention may be applicable to video coding systems like streaming systems, DVD (Digital Versatile Disc) players, digital television receivers, personal video recorders, systems and computer programs on personal computers, handheld computers and communication devices, as well as network elements such as transcoders and cloud computing arrangements where video data is handled.

The present embodiments relate to real-time computer graphics, augmented reality (AR), and virtual reality (VR).

Fig. 1 shows a system and apparatuses for stereo viewing, that is, for 3D video and 3D audio digital capture and playback. The task of the system is that of capturing sufficient visual and auditory information from a specific location such that a convincing reproduction of the experience, or presence, of being in that location can be achieved by one or more viewers physically located in different locations and optionally at a time later in the future. Such reproduction requires more information that can be captured by a single camera or microphone, in order that a viewer can determine the distance and location of objects within the scene using their eyes and their ears. To create a pair of images with disparity, two camera sources are used. In a similar manner, for the human auditory system to be able to sense the direction of sound, at least two microphones are used (the commonly known stereo sound is created by recording two audio channels). The human auditory system can detect the cues, e.g. in timing difference of the audio signals to detect the direction of sound.

The system of Fig. 1 may consist of three main parts: image sources, a server and a rendering device. A video capture device SRC1 comprises one or more cameras CAM1 , CAM2, ..., CAMN with overlapping field of view so that regions of the view around the video capture device is captured from at least two cameras. The device SRC1 may comprise multiple microphones (not shown in Figure 1 ) to capture the timing and phase differences of audio originating from different directions. The device SRC1 may comprise a high resolution orientation sensor so that the orientation (direction of view) of the plurality of cameras can be detected and recorded. The device SRC1 comprises or is functionally connected to a computer processor and memory, the memory comprising computer program code for controlling the video capture device. The image stream captured by the video capture device may be stored on a memory device for use in another device, e.g. a viewer, and/or transmitted to a server using a communication interface. It needs to be understood that although a camera setup of three cameras is described here as part of the system, another type of setup may be used instead as part of the system.

Alternatively or in addition to the video capture device SRC1 creating an image stream, or a plurality of such, one or more sources SRC2 of synthetic images may be present in the system. Such sources of synthetic images may use a computer model of a virtual world to compute the various image streams it transmits. For example, the source SRC2 may compute N video streams corresponding to N virtual cameras located at a virtual viewing position. When such a synthetic set of video streams is used for viewing, the viewer may see a three-dimensional virtual world. The device SRC2 comprises or is functionally connected to a computer processor PROC2 and memory MEM2, the memory comprising computer program PROGR2 code for controlling the synthetic sources device SRC2. There may be a storage, processing and data stream serving network in addition to the capture device SRC1 . For example, there may be a server SERVER or a plurality of servers storing the output from the capture device SRC1 or computation device SRC2. The device SERVER comprises or is functionally connected to a computer processor PROC3 and memory MEM3, the memory comprising computer program PROGR3 code for controlling the server. The device SERVER may be connected by a wired or wireless network connection, or both, to sources SRC1 and/or SRC2, as well as the viewer devices VIEWER1 and VIEWER2 over the communication interface COMM3.

For viewing the captured or created video content, there may be one or more viewer devices VIEWER1 and VIEWER2. These devices may have a rendering module and a display module, or these functionalities may be combined in a single device. The devices may comprise or be functionally connected to a computer processor PROC4 and memory MEM4, the memory comprising computer program PROG4 code for controlling the viewing devices. The viewer (playback) devices may consist of a data stream receiver for receiving a video data stream from a server and for decoding the video data stream. The data stream may be received over a network connection through communications interface COMM4, or from a memory device MEM6 like a memory card CARD2. The viewer devices may have a graphics processing unit for processing of the data to a suitable format for viewing. The viewer VIEWER1 comprises a high-resolution stereo-image head-mounted display for viewing the rendered stereo video sequence. The head-mounted display may have an orientation sensor DET1 and stereo audio headphones. According to an embodiment, the viewer VIEWER2 comprises a display enabled with 3D technology (for displaying stereo video), and the rendering device may have a head-orientation detector DET2 connected to it. Alternatively, the viewer VIEWER2 may comprise a 2D display, since the volumetric video rendering can be done in 2D by rendering the viewpoint from a single eye instead of a stereo eye pair. Any of the devices (SRC1 , SRC2, SERVER, RENDERER, VIEWER1 , VIEWER2) may be a computer or a portable computing device, or be connected to such. Such rendering devices may have computer program code for carrying out methods according to various examples described in this text. Fig. 2a shows a camera device 200 for stereo viewing. The camera comprises two or more cameras that are configured into camera pairs 201 for creating the left and right eye images, or that can be arranged to such pairs. The distances between cameras may correspond to the usual (or average) distance between the human eyes. The cameras may be arranged so that they have significant overlap in their field-of-view. For example, wide-angel lenses of 180-degrees or more may be used, and there may be 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 16, or 20 cameras. The cameras may be regularly or irregularly spaced to access the whole sphere of view, or they may cover only part of the whole sphere. For example, there may be three cameras arranged in a triangle and having different directions of view towards one side of the triangle such that all three cameras cover an overlap area in the middle of the directions of view. As another example, 8 cameras having wide-angle lenses and arranged regularly at the corners of a virtual cube and covering the whole sphere such that the whole or essentially whole sphere is covered at all directions by at least 3 or 4 cameras. In Fig. 2a three stereo camera pairs 201 are shown.

Fig. 2b shows a head-mounted display (FIMD) for stereo viewing. The head-mounted display comprises two screen sections or two screens DISP1 and DISP2 for displaying the left and right eye images. The displays are close to the eyes, and therefore lenses are used to make the images easily viewable and for spreading the images to cover as much as possible of the eyes’ field of view. The device is attached to the head of the user so that it stays in place even when the user turns his head. The device may have an orientation detecting module ORDET1 for determining the head movements and direction of the head. The head-mounted display gives a three-dimensional (3D) perception of the recorded/streamed content to a user.

Fig. 2c illustrates a camera CAM1 . The camera has a camera detector CAMDET1 , comprising a plurality of sensor elements for sensing intensity of the light hitting the sensor element. The camera has a lens OBJ1 (or a lens arrangement of a plurality of lenses), the lens being positioned so that the light hitting the sensor elements travels through the lens to the sensor elements. The camera detector CAMDET1 has a nominal center point CP1 that is a middle point of the plurality of sensor elements, for example for a rectangular sensor the crossing point of the diagonals. The lens has a nominal center point PP1 , as well, lying for example on the axis of symmetry of the lens. The direction of orientation of the camera is defined by the line passing through the center point CP1 of the camera sensor and the center point PP1 of the lens. The direction of the camera is a vector along this line pointing in the direction from the camera sensor to the lens. The optical axis of the camera is understood to be this line CP1 -PP1 .

The system described above may function as follows. Time-synchronized video, audio and orientation data is first recorded with the capture device. This can consist of multiple concurrent video and audio streams as described above. These are then transmitted immediately or later to the storage and processing network for processing and conversion into a format suitable for subsequent delivery to playback devices. The conversion can involve post-processing steps to the audio and video data in order to improve the quality and/or reduce the quantity of the data while preserving the quality at a desired level. Finally, each playback device receives a stream of the data from the network, and renders it into a stereo viewing reproduction of the original location which can be experienced by a user with the head-mounted display and headphones.

A video codec consists of an encoder that transforms an input video into a compressed representation suited for storage/transmission and a decoder that can uncompress the compressed video representation back into a viewable form. Typically encoder discards some information in the original video sequence in order to represent the video in a more compact form (that is, at lower bitrate). An example of an encoding process is illustrated in Figure 3. Figure 3 illustrates an image to be encoded (lⁿ); a predicted representation of an image block (P’ⁿ); a prediction error signal (Dⁿ ); a reconstructed prediction error signal (D’ⁿ); a preliminary reconstructed image (l’ⁿ); a final reconstructed image (R’ⁿ); a transform (T) and inverse transform (T^_1); a quantization (Q) and inverse quantization (Q^_1 ); entropy encoding (E); a reference frame memory (RFM); inter prediction (P^inter); intra prediction (P^intra); mode selection (MS) and filtering (F). An example of a decoding process is illustrated in Figure 4. Figure 4 illustrates a predicted representation of an image block (P’ⁿ); a reconstructed prediction error signal (D’ⁿ); a preliminary reconstructed image (l’ⁿ); a final reconstructed image (R’ⁿ); an inverse transform (T^-1); an inverse quantization (Q ¹); an entropy decoding (E^-1); a reference frame memory (RFM); a prediction (either inter or intra) (P); and filtering (F).

Figure 5 demonstrates an example of processing steps of manipulating volumetric video data, starting from raw camera frames (from various locations within the world) and ending with a frame rendered at a freely-selected 3D viewpoint. The starting point 510 is media content obtained from one or more camera devices. The media content may comprise raw camera frame images, depth maps, and camera 3D positions. The recorded media content, i.e. image data, is used to construct an animated 3D model 520 of the world. The viewer is then freely able to choose his/her position and orientation within the world when the volumetric video is being played back 530.

An octree is a tree data structure used to partition a three-dimensional space. Octrees are the three-dimensional analog of quadtrees. “A voxel octree” is a central data structure, i.e. a hierarchy of voxels to which the present embodiments are based. A voxel octree represents the volume as an 8-ary tree in multiple resolutions. A“sparse voxel octree” describes a volume of a space containing a set of solid voxels of varying sizes. Empty areas, i.e. empty subtrees, within the volume are absent from the tree, which is why it is called“sparse”.

“Voxel” of a three-dimensional world corresponds to a pixel of a two-dimensional world. Voxels exist in a 3D grid layout. A voxel or point may have number of attributes that describe its properties. One common attribute is colour. Other attributes can be opacity, a 3D surface normal vector, and parameters describing the surface material.

A volumetric video frame is a complete sparse voxel octree that models the world at a specific point in time in a video sequence. Voxel attributes are referenced in the sparse voxel octrees (e.g. color of a solid voxel), but can also be stored separately.

Volumetric video may be captured using one or more 3D cameras, as shown in Figure 1 . When multiple cameras are in use, the captured footage is synchronized so that the cameras provide different viewpoints to the same world. In contrast to traditional 2D/3D video, volumetric video describes a 3D model of the world where the viewer is free to move and observer different parts of the world.

Volumetric video frames contain a large amount of data because they model the contents of a 3D volume instead of just a 2D plane. However, only a relatively small part of the volume changes over time. Therefore, it may be possible to reduce the total amount of data by only coding information about an initial state and changes which may occur between frames.

Volumetric video refers to video that enables the viewer to move in six degrees of freedom: in contrast to common 360° video, where the user has from 2 to 3 degrees of freedom (yaw, pitch, and possibly roll), a volumetric video represents a 3D volume of space rather than a flat image plane.

Volumetric video can be rendered from synthetic 3D animations, reconstructed from multi-view video using 3D reconstruction techniques such as structure from motion, or captured with a combination of cameras and depth sensors such as LiDAR, for example.

One common representation of volumetric video is 3D point clouds, where each point of each 3D surface is described as a 3D point with colour and/or other attribute information such as surface normal or material reflectance. Point clouds (like any other 3D representations) can be converted into voxels. Voxels represent a volume as a 3D grid of volume elements, each volume element containing occupancy and the aforementioned colour and other attributes.

The present embodiments are targeted to converting volumetric video into a set of 2D (e.g. colour and geometry) tiles according to a sparse voxel octree structure. This results in a tile atlas, which can be encoded using a regular video codec, which provides benefits for both compression and rendering of the resulting video.

Each frame of the volumetric video content is converted to a sparse voxel octree using known techniques.

Then for each frame: the octree’s lowest levels are collected to a set of 3D voxel bricks. The 3D voxel bricks are composed from the sparse voxel octree by determining the depth of the subtree of each node; then for the chosen brick size, all nodes that have a corresponding depth (e.g. 16 x 16 x 16 bricks, 4 level subtree) are searched and found; the contents of the found nodes are copied into 3D voxel arrays. These arrays are then called 3D voxel bricks. The 3D voxel bricks may be separately allocated or reside in a larger cubic atlas.

Each 3D voxel brick is converted into one or more 2D tile pairs. The fewer tiles each brick produces, the better. The tile pairs can be e.g. colour-depth tile pairs. It is appreciated that in the present description term“tile-pairs” is used as an example, for simplicity. However, if there are additional attributes, each type of attribute will produce an additional tile (for example, colour-depth-normal tile triplet). Therefore, each 3D voxel brick can be converted in other embodiments into one or more two-dimensional tile combinations, wherein the combination comprises tile pairs, tile triplets, tile quads, etc.

The one or more 2D tile combinations are laid out onto 2D video frames and associated metadata is generated. The 2D video frames are encoded using a 2D video codec (e.g., HEVC). A sequence of such frames are encoded with a 2D video codec. Ideally, the tile allocation is optimized over a sequence of frames such as a GOP or an l-frame interval, so that the video codec can take full advantage of temporal prediction when encoding the tile atlas.

The octree node structure of the sparse voxel octree is stored in a separate stream of metadata. The metadata contains 3D voxel coordinates for bricks and parameters for each 2D tile pair in the frame, and is compressed losslessly.

In the following, a set of embodiments are described. Each of the embodiments contribute to the efficiency and quality of the solution. The embodiments can be applied separately or in combination the solution discussed above.

One of the embodiments relate to persistent 3D voxel brick atlas, where sparse voxel octree node coordinates are used for determining when to reuse a specific brick location between frames. This embodiment allows comparing the 3D voxel bricks over time to detect 3D transformations and 3D similarity. If there are redundant bricks, they only need to be included once in the encoded output. Such redundant bricks may be exact matches to another brick, or they may be produced by transforming another brick (translation, scaling, rotation). In practice, the encoder is able to output a mapping table that describes which 3D voxel bricks can be generated based on other 3D voxel bricks, after applying one or more transformations to the brick(s). A 3D voxel brick can also be generated based on a combination of two or more existing 3D voxel bricks. The assumption here is that the mapping table as a whole is smaller than including the total set of bricks in the output. This basically amounts to brick atlas compression and it can be done independently of anything that happens in the tile projection or tile buffering stages.

One of the embodiments relate to a step for converting bricks to tile combinations. In order to process 3D voxel bricks using a 2D video codec, the 3D voxel bricks are first converted to 2D tiles. Each 3D voxel brick will produce at least one colour-depth tile pair, or tile combination of other attributes. The 3D voxel brick contents are projected along the X, Y, or Z axis, either in positive or negative (front-to-back or back-to-front) order. Because a 3D voxel brick describes arbitrary 3D content, a 2D projection of it may cause some of the input voxels to be occluded and thus excluded.

Figures 6a-c show examples of voxels that are either occluded 610 during the projection or included 605 in the tile. Some directions may yield fewer or no occlusions (as in Figure 6c). The embodiments prefer those directions to other directions. One depth pixel may describe up to three distinct values since there are three colour channels in the frame. However, in practice, one depth value per pixel is easiest to encode and decode via a 2D video codec without introducing discrepancies. In other words, in this case the depth tile pixels use greyscale values.

Figure 7 shows an example, wherein when projecting a 3D voxel brick, some depth tile pixels may represent multiple depth values. A colour is only stored for the front-most voxel 700.

The smallest set of tiles where each tile contains the most amount of unique information not present in other tiles needs to be found. This can be achieved for example by the following method (also shown in Figure 8):

- project the brick using each of the six directions, producing a set of candidate colour-depth tiles 810

- for each candidate, determine the number of voxels that are missing compared to the original brick, and the number of voxels whose colour was not included in the tile 820

- if the number of lost voxels/colours is below the chosen threshold 830, choose the corresponding candidate to be used as the only tile representing the brick 835

- Otherwise 837, for each candidate, project the brick along each of the other directions, producing a second set of candidate tiles. In total, this will produce 6 x 5 colour-depth tiles

- For each combination of tiles from the first set and the second set, check how many voxels/colours are lost 840

- If the number of lost voxels/colours is below the chosen threshold 850, choose the corresponding best combination as the pair of colour-depth tiles to represent the brick 855

- Otherwise, continue similarly with a third direction, producing a total of 6 x 5 x 4 tiles 857

- Continue until the chose threshold is met, or all six directions have been projected 860

One of the embodiments relates to deep projection of tiles. This embodiment extends the range of depth values so more than one 3D voxel brick is covered inside a single tile pair. A common problem in the tile atlas is that the geometry inside a tile often crosses a brick boundary. This results in an otherwise continuous surface being split into two or more separate tiles. By adjusting the depth range on a per-tile basis so that e.g. two or four consecutive 3D voxel bricks are encoded in the same tile, this issue can be mitigated and the tiles can be made more coherent. This also allows the encoder to reduce the number of tiles in some cases, leading to better tile allocation and compression.

Figures 9a-b shows an embodiment of default and brick coding, respectively. Arrows denote coding direction and a depth range. The deep brick coding shown in Fig. 9b enables the middle section of the surface to be encoded as a single tile.

One of the embodiments relates to tile frame buffer. Fixed-size 2D video frames are used for transmitting the volumetric content. This means there is a known upper limit for the amount of data that can be transmitted per frame. For example, a 4K video frame (3840 x 2160) can fit 135 colour-depth tile pairs of size 16 x 16. Each pixel in the tiles corresponds to an input voxel. If every pixel in the tiles represents a unique voxel, this gives an upper limit of approximately 4.1 million voxels per frame. A scene/object larger than that may be subdivided into multiplexed or parallel 2D video streams, or may be transmitted progressively by starting with low levels of detail and proceeding to finer details in later frames. Figure 6 shows an example of what the tile frame buffer contents may look like.

Each tile is assigned a score based on how much unique information is contained in that tile. This enables a fixed tile budget to be fully utilized with prioritized tile allocation, in case all the tiles cannot fit inside the current frame.

Another aspect is that the video codec needs to be configured to behave more suitably for tile-based content. Possible artefacts are: colour bleeding near tile boundaries, where a tile is affected by colours from an adjacent tile; high-frequency details are lost, leading to blurrier surface colours and incorrect depth values; depth discrepancies due to slight differences compared to the original, causing pixels to be interpreted incorrect as empty/non-empty.

The magnitude of these 2D video coding artefacts depends on the bit rate used by the video codec (smaller bit rate causes more severe artefacts).

Figure 10 illustrates an example of tile frame buffer contents. The left-part 1010 contains colour tiles while the right-part 1020 contains the depth tiles. Each colour tile pairs with one depth tile at the corresponding coordinates on the other part.

One of the embodiments relate to depth value flood filling to improve visual quality. As a simple example, the depth values can be converted to an 8-bit range as follows: 0=empty, 46...196 depth values, 255=empty. However, non-linear range of depth values may yield better quality, because depth values that are near to the ends of the range will be subject to more errors during video encoding/decoding. Therefore, it is better to represent those error-prone depth values by a larger segment of the 8-bit range. For 16 x 16 bricks, the conversion can be done as follows ( depth is a value from 0...15):

D = 46 + depth ^* 10

Sharp ridges inside the depth tiles can be minimized by having empty at both the high and low ends of the range. This produces fewer encoding artefacts during the 2D video coding. To determine whether a particular region of empty depth values should use black or white, the non-empty boundary pixels surrounding the region are examined to see if they are closer to the low or high end of the depth range. If they are closer to the low end, the region is flood-filled with black; otherwise, white.

Figure 1 1 illustrates an example of four 16 x 16 tiles. The two tiles 1 1 10, 1 120 on the left use white as empty, while the two tiles 1 130, 1 140 on the right use black as empty. It should be noticed that on the left (tiles 1 1 10, 1 120), the surface slopes towards the high end of the range (lighter), while on the right (tiles 1 130, 1 140), the slope is toward the low end (darker).

One of the embodiments relate mitigating black/white flipping inside depth tiles. When a 3D surface moves inside a brick, the range of depth values it produces may cross from the low end of the range to the high end. When choosing where to place a new tile, if it flips between black and white empty values, the tile position may also be changed if it produces a smaller change at the new location.

One of the embodiments relate to gradient-based depth coding. Coding efficiency can be improved by adapting the depth quantization to the shape of the surface within each brick. This way, the depth tiles can be made more uniform, avoiding large discontinuities inside and between tiles, enhancing the video compression efficiency. A predefined surface shape or, rather, a depth quantization pattern can be selected from a fixed library with a few bits in the tile metadata. Figure 12 shows a few example bricks where depth is coded as an offset from a reference surface. In particular, in Figure 12 there are three examples of shape-based quantization in a brick with the depth coding axis running from top to bottom. The solid line is the reference surface, while the dashed and dotted lines represent 50% and 75% points in the depth offset quantization, respectively. One of the embodiments relate to mitigation of wrong colours which are caused by minor depth artefacts. After depths tiles have been decoded, their contents may not exactly match the original colour values. This leads to slight differences in the output (typically producing single off-surface voxels). This may cause the decoder to use colour values of pixels that were marked as empty in the input data. To avoid visible artefacts from this, the boundaries inside colour tiles are extruded by taking the non- empty pixels’ colour values and extending them one or more pixels into the empty area.

One of the embodiments relate to persistent tiles. The contents of the tile buffer are not cleared between frames. A video codec encodes the differences between the frames, so the objective is to minimize differences between frames. The encoder keeps the metadata of the old tiles in memory so it can be used for making decision about where to place future tiles.

When writing a tile to the buffer, a crucial aspect is placement of tiles over time. If the tile contents keep randomly changing from frame to frame, the video codec will either have to increase the bit rate or reduce the quality of the encoded data. Therefore, the tiles produced from a given brick must be placed in the same location that they were placed during the last frame inside the file frame buffer.

To choose where to place a new tile: compare it to existing tiles that are not in use in the current frame, and choose one that is most similar to the new tile. This can be done by comparing both the colour and depth tile contents and finding the minimum delta (sum over all pixels). When writing the colour tile, empty pixels in the new tile cause no changes in the tile buffer. However, in case there is empty space in the frame buffer (e.g. if the current level of details has fewer than 4 million voxels), empty tiles may be more preferable than old tiles.

One of the embodiments relate to brick and tile metadata. In order to decode the 2D tiles back to 3D content, and to enable tile sorting optimization in the encoding stage, tile metadata can be defined that maps each tile in the frame back to a 3D brick face in the original sparse voxel octree model. This includes the octree node coordinates of the brick (level, XYZ), and brick projection direction for each tile in the atlas. Alternatively, depending on the number of tiles per brick, it may be more efficient to encode a separate brick array (level, XYZ), and then reference that from a tile metadata array (brick index, face index). During a sequence of frames, the brick coordinates also change less frequently than the tile metadata. One of the embodiments relate to sorting tiles. In order for the video codec to be able to employ spatial prediction during coding, the tiles may be sorted in the tile frame buffer so that similar tiles are adjacent to each other. Ideally, surfaces in the buffer are completely continuous, but this is not possible in the general case of complex 3D environments. Sorting can be done during l-frames to maximize continuous surface are of the object. One criterion for tile similarity can be the direction of motion occurring inside the tile; tiles whose content is moving in the same direction should be placed adjacent to each other. According to an embodiment, the orientation and spatial coordinates of tiles can also be used as a sorting criteria, for example by sorting similarly oriented tiles together so that tiles with adjacent spatial coordinates form continuous spans in the atlas.

One of the embodiments relate to variable-sized tiles. While 3D voxel bricks may be generated in one size, the tiles that get laid out in the tile frame buffer can be aggregates or fragments of the bricks. This helps to preserve larger segments of the original surfaces as they are converted to tiles. For example, the tile conversion begins at level N of the octree (targeting 64 x 64 tiles), and if that fails to produce an acceptable tile representation, attempts level N+1 (32 x 32 tiles), etc. The tile buffer may be filled in a manner that subdivides the space starting with the largest tile size (e.g., 64 x 64) to match how a 2D video codec subdivides the frame. Note that even though a large tile may not fully capture all the voxel data of a brick, a few smaller-sized additional tiles may be enough to account for the missing data - in other words, after a large tile has been produced, tile conversion can still continue to smaller sub-tiles for those sub- regions where voxels were lost.

One of the embodiments relate to two-pass video encoding. In order to optimize the bit allocations to frames and parts of the video frame - preserving visual quality of difficult- to-encode regions, and saving bits on easy-to-encode regions - a two-pass encoding scheme can be utilized. In the first pass, the encoder can examine all the video frames and identify the spatial locations that require mode bit budget, and hence higher quality. In another embodiments, such regions that require high bitrate can be roughly pre- configured (e.g. texture regions and depths regions can be two high-level regions), and given to the two-pass encoding as an input. In the second pass, the encoder utilizes the pre-allocated bit budgets and quality parameters to encode the video into the final bitstream.

One of the embodiments relate to separately encoded colour and depth tile frames. The tile frame buffer can be split into two sub-video frames: texture-only and depth- only video frames. Each video frame can then be processed and encoded separately with different quality, bitrate and encoder parameters for delivery of optimized quality. Additional sub-frames may also be used for any additional attributes present in the input data.

Decoding the tile frames

The tiles and associated metadata can be used to reconstruct the original sparse voxel octree and/or point cloud. The operation is highly parallelizable.

The decoding method according to an embodiment comprises the following:

- decode the video frame(s) containing the tile combination of specific attributes of the frame;

- decode the metadata of the frame;

- allocate a set of 3D bricks (the number of the bricks to be allocated is specified in the metadata);

- for each tile combination:

o read the projection direction and brick index in the tile metadata; o look up the output brick using the brick index;

o for each pixel in the colour tile:

^■ read the depth value(s) in the corresponding depth tile pixel. If a depth value is outside the used depth range (e.g., 46...196), the pixel is considered empty and is skipped.

^■ given the projection direction, determine the XYZ coordinates that correspond to the pixel inside the brick.

^■ for the front-most depth, write the voxel colour to the brick.

^■ for any additional depth values of the pixel, store a placeholder voxel in the brick (voxel with undefined colours); unless the voxel in question has already received a colour from another tile.

- to reconstruct the point cloud; for each brick in the atlas:

o check the brick’s voxel coordinates from the brick metadata;

o translate each non-empty voxel’s coordinates (XYZ + level inside the brick), relative to the brick coordinates (XYZ + level), to 3D model space coordinates.

- to reconstruct the voxel octree, for each brick in the atlas:

o if not yet constructed, construct node hierarchy leading down to the brick’s voxel coordinates

o for each non-empty voxel of the brick: ^■ if not yet constructed, construct node hierarchy leading down to the voxel’s parent node;

^■ place the voxel inside the parent node.

According to an embodiment, lost voxel colours are recovered. In case there are multiple depth values stored per pixel (in depth tile colour channels), there is still only one colour value stored in the colour tile for that pixel. Recovering colour values for the voxels other than the front-most one: as a post-processing step, within the 3D brick, the neighbourhood of a voxel can be examined when the neighbourhood voxels have a known colour. These colours can be averaged and used as the colour for a voxel without a known colour. It is to be noticed that a colourless voxel may also receive a colour from another tile where the same brick has been projected from another direction.

After a point cloud has been reconstructed, the decoder may apply further post- processing filters to alleviate remaining errors in the decoded data. For example, single points that are not surrounded by other points on a continuous surface are likely the product of distorted depth values, and should be removed from the decoded point cloud. Inversely, gaps in otherwise continuous surfaces could be filled with new points. Note that this kind of filtering can be also done in real time during rendering.

Viewpoint-dependent streaming

There is a fixed upper limit to the amount of data that can fit inside a 2D video frame, and thus if the entire volumetric content does not fit inside that limit, it needs to be chosen which subset to include. The embodiments discussed so far are independent of the viewer. However, if the characteristics of the viewport and how the volumetric content is being rendered are taken into account, the data to include in the tile frames may be optimized further.

According to an embodiment, the level of details in the object and/or scene is reduced until the data fits in the tile buffer. Reductions can be applied uniformly to the entire object/scene, or in regions chosen to be less important.

According to an embodiment a view-dependent brick is based on viewing location and/or direction. Bricks can be generated for an arbitrary level of the octree by combining data from multiple pre-generated bricks. This allows fine-tuning the resolution of the data being transferred. According to an embodiment, a limited 2D viewport on a low-resolution screen requires lower LOD (Level of Detail) levels and has a narrower potentially visible set of bricks compared to a 3D HMD viewer where the entire surroundings must be rendered with relatively high level of detail. A 3D HMD may also imply 6-DOF (Six Degrees of Freedom) viewing setup, which makes it important to include more information about surface materials, reflected geometry, etc.

Rendering without reconstructing the voxel octree

3D brick index is known and thus its 3D location is known. Points can be directly projected to world from 2D decoded tiles with their depth values. The cubic shape of a brick is beneficial for optimization before and during rendering. Bricks can be culled from view and additional depth testing optimizations can be used because it is known that all the points inside a brick map fall into a predefined cubic volume in world. In addition, bricks can be sorted based on a view to further reduce the rendering time of brick data.

There are several rendering techniques for outputting the points without reconstructing the voxel octree. For example, fixed-function rendering can be done with point based primitive rendering or creating surface polygons (quads) in vertex shader. Polygon based rendering is scalable and does not necessarily need any hole filling. In addition, new point rendering technique, scattering points atomically in compute shader, is another possibility of rendering this kind of content.

Figure 13 is a flowchart illustrating a method for encoding according to an embodiment. A method comprises converting 1310 each frame of volumetric video to a set of three- dimensional voxel bricks, the three-dimensional bricks representing nodes in a node structure; converting 1320 each brick of the set of three-dimensional voxel bricks into one or more two-dimensional tile combinations; laying 1330 the one or more two- dimensional tile combinations onto respective two-dimensional video frames; storing 1340 nodes in the node structure in a metadata associated with the two-dimensional video frames; and encoding 1350 the two-dimensional video frames with two- dimensional video codec and encoding the associated metadata.

An apparatus according to an embodiment comprises means for converting each frame of volumetric video to a set of three-dimensional voxel bricks, the three- dimensional bricks representing nodes in a node structure; means for converting each brick of the set of three-dimensional voxel bricks into one or more two-dimensional tile combinations; means for laying the one or more two-dimensional tile combinations onto respective two-dimensional video frames; means for storing nodes in the node structure in a metadata associated with the two-dimensional video frames; and means for encoding the two-dimensional video frames with two-dimensional video codec and encoding the associated metadata. The means comprises at least one processor, and a memory including a computer program code. The memory and the computer program code are configured to, with the at least one processor, cause the apparatus to perform the method for encoding.

The computer program code comprises one or more operational characteristics. Said operational characteristics are being defined through configuration by said computer based on the type of said processor, wherein a system is connectable to said processor by a bus, wherein a programmable operational characteristic of the system comprises converting each frame of volumetric video to a set of three-dimensional voxel bricks, the three-dimensional bricks representing nodes in a node structure; converting each brick of the set of three-dimensional voxel bricks into one or more two-dimensional tile combinations; laying the one or more two-dimensional tile combinations onto respective two-dimensional video frames; storing nodes in the node structure in a metadata associated with the two-dimensional video frames; and encoding the two- dimensional video frames with two-dimensional video codec and encoding the associated metadata

Figure 14 is a flowchart illustrating a method for decoding according to an embodiment. A method comprises decoding 1410 two-dimensional video frames with two- dimensional video decoder and decoding the associated metadata; decoding 1420 from the associated metadata nodes of a node structure; decoding 1430 from two- dimensional video frames one or more two-dimensional tile combinations; converting 1440 the one or more two-dimensional tile combinations into three-dimensional voxel bricks being represented by nodes in the node structure; and generating 1450 volumetric video content according to three-dimensional voxel bricks.

An apparatus according to an embodiment comprises means for decoding two- dimensional video frames with two-dimensional video decoder and decoding the associated metadata; means for decoding from the associated metadata nodes of a node structure; means for decoding from two-dimensional video frames one or more two-dimensional tile combinations; means for converting the one or more two- dimensional tile combinations into three-dimensional voxel bricks being represented by nodes in the node structure; and means for generating volumetric video content according to three-dimensional voxel bricks. The means comprises at least one processor, and a memory including a computer program code. The memory and the computer program code are configured to, with the at least one processor, cause the apparatus to perform the method for encoding.

The computer program code comprises one or more operational characteristics. Said operational characteristics are being defined through configuration by said computer based on the type of said processor, wherein a system is connectable to said processor by a bus, wherein a programmable operational characteristic of the system comprises decoding two-dimensional video frames with two-dimensional video decoder and decoding the associated metadata; decoding from the associated metadata nodes of a node structure; decoding from two-dimensional video frames one or more two- dimensional tile combinations; converting the one or more two-dimensional tile combinations into three-dimensional voxel bricks being represented by nodes in the node structure; and generating volumetric video content according to three- dimensional voxel bricks.

The various embodiments may provide advantages. For example, the solution presented here is agnostic to the scene topology: the micro-projection nature of the encoding makes handling of occlusions and complex scenes a non-issue compared to methods based on larger projections that may target representing a single character only. The present solution is also applicable both when volumetric content is being captured (to compress content viewpoint-independently), and when volumetric content is being streamed for viewing (optimizing for a known viewer).

Yet, a further advantage is that the tile layout is very well suited to GPU processing. The implementation can use parallel processing because each brick and tile is independent of each other. GPUs are particularly good at parallel processing of large amounts of data. For example, voxel octree can be constructed on the GPU in parallel; bricks can be composed on the GPU in parallel; bricks can be converted to tiles in parallel; tile-to-tile comparisons can be made in parallel. Also, view-dependent streaming and the capability to directly render a 3D view from the tile atlas enable a very efficient architecture for transcoding and delivering view-dependent substreams from a very large model to client devices.

The various embodiments of the invention can be implemented with the help of computer program code that resides in a memory and causes the relevant apparatuses to carry out the invention. For example, a device may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the device to carry out the features of an embodiment. Yet further, a network device like a server may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the network device to carry out the features of an embodiment.

If desired, the different functions discussed herein may be performed in a different order and/or concurrently with other. Furthermore, if desired, one or more of the above- described functions and embodiments may be optional or may be combined. Although various aspects of the embodiments are set out in the independent claims, other aspects comprise other combinations of features from the described embodiments and/or the dependent claims with the features of the independent claims, and not solely the combinations explicitly set out in the claims. It is also noted herein that while the above describes example embodiments, these descriptions should not be viewed in a limiting sense. Rather, there are several variations and modifications, which may be made without departing from the scope of the present disclosure as, defined in the appended claims.

Claims

Claims:

1 . A method, comprising:

- converting each frame of volumetric video to a set of three-dimensional voxel bricks, the three-dimensional bricks representing nodes in a node structure;

- converting each brick of the set of three-dimensional voxel bricks into one or more two-dimensional tile combinations;

- laying the one or more two-dimensional tile combinations onto respective two- dimensional video frames;

- storing nodes in the node structure in a metadata associated with the two-dimensional video frames; and

- encoding the two-dimensional video frames with two-dimensional video codec and encoding the associated metadata.

2. The method according to claim 1 , wherein each frame of a volumetric video content is converted to a sparse voxel octree, and one or more levels of the sparse voxel octree are gathered to the set of three-dimensional voxel bricks.

3. The method according to claim 2, wherein the set of three-dimensional voxel bricks are composed from the sparse voxel octree by

- determining the depth of the subtree of each node;

-finding nodes having a depth corresponding to a predefined brick size; and

- copying content of the found nodes into three-dimensional voxel bricks.

4. The method according to claim 1 or 2 or 3, wherein the two-dimensional tile combination is formed of tiles of at least two attributes.

5. The method according to claim 4, wherein said at least two attributes are any combination of the following: colour, depth, normal.

6. The method according to any of the claims 1 to 5, wherein the metadata comprises three-dimensional voxel coordinates for three-dimensional voxel bricks and parameters for each two-dimensional tile combination in the two-dimensional video frame.

7. The method according to any of the claims 1 to 6, further comprising detecting a three-dimensional voxel brick that is either an exact match of another brick or produced by transforming another brick, and including such brick only once for encoding.

8. The method according to any of the claims 1 to 7, further comprising adjusting depth range on a per-tile basis, wherein two or more consecutive three-dimensional voxel bricks are encoded into a same tile.

9. The method according to any of the claims 1 to 8, further comprising subdividing a scene or an object of the volumetric video into multiple two-dimensional video frames, and transmitting the multiple two-dimensional video frames progressively starting from frames having low levels of details and proceeding to frames with finer details.

10. The method according to any of the claims 1 to 9, further comprising assigning each tile a score based on how much unique information is contained in the tile.

1 1 . The method according to any of the claims 1 to 10, further comprising determining whether a region of a depth tile has empty depth values, and examining the non-empty boundary values surrounding the region to see if they are close to low or high end of the depth range, and filling the region with black or white respectively.

12. The method according to any of the claims 1 to 1 1 , further comprising sorting tiles in a tile frame buffer of the two-dimensional video so that similar tiles are adjacent to each other.

13. The method according to any of the claims 1 to 12, further comprising generating variable-sized tile combinations from bricks of a set of three-dimensional voxel bricks comprising more than one levels of the sparse voxel octree.

14. A method for decoding, comprising

- decoding two-dimensional video frames with two-dimensional video decoder and decoding the associated metadata;

- decoding from the associated metadata nodes of a node structure;

- decoding from two-dimensional video frames one or more two-dimensional tile combinations;

- converting the one or more two-dimensional tile combinations into three-dimensional voxel bricks being represented by nodes in the node structure; and

- generating volumetric video content according to three-dimensional voxel bricks.

15. The method according to claim 14, further comprising building up a scene or an object from multiple two-dimensional video frames.

16. The method according to claim 14 or 15, further comprising recovering a colour of a voxel by averaging colours of neighboring voxels and using the average as a colour for the voxel.

17. An apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the method steps according to claims 1 to 13.

18. An apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the method steps according to claims 14 to 16.

19. A computer program product embodied on a non-transitory computer readable medium, comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to perform at least the method steps according to claims 1 to 13 20. A computer program product embodied on a non-transitory computer readable medium, comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to perform at least the method steps according to claims 14 to 16.