EP4133719A1

EP4133719A1 - A method, an apparatus and a computer program product for volumetric video coding

Info

Publication number: EP4133719A1
Application number: EP21783737.6A
Authority: EP
Inventors: Deepa NAIK; Sebastian Schwarz; Kimmo Roimela; Vinod Kumar MALAMAL VADAKITAL
Original assignee: Nokia Technologies Oy
Current assignee: Nokia Technologies Oy
Priority date: 2020-04-09
Filing date: 2021-04-01
Publication date: 2023-02-15
Also published as: WO2021205068A1

Abstract

The embodiments relate to method for video encoding and decoding, and a technical equipment for the same. The method for encoding comprises receiving (410) a point cloud with a number of visual attributes; identifying (420) areas of the point cloud according to their SLF activity and classifying the areas accordingly into different classes; generating (430) two-dimensional patches from the points and by using the classification information, appointing a generated patches into a corresponding class; selecting (440) a set of cameras for patches according to the SLF activity of a patch; generating (450) a number of attribute video streams into a bitstream, where at least for the first attribute video stream one camera view is packed for all patches, and where at least for the last attribute video stream two or more camera views are packed for patches having a high activity; and encoding (460) information on the selected cameras into a bitstream.

Description

A METHOD, AN APPARATUS AND A COMPUTER PROGRAM PRODUCT FOR VOLUMETRIC VIDEO CODING

The project leading to this application has received funding from the ECSEL Joint Undertaking (JU) under grant agreement No 783162. The JU receives support from the European Union’s Horizon 2020 research and innovation program and Netherlands, Czech Republic, Finland, Spain, Italy.

Technical Field

The present solution generally relates to volumetric video encoding and decoding.

Background

Since the beginning of photography and cinematography, the most common type of image and video content has been captured by cameras with relatively narrow field of view and displayed as a rectangular scene on flat displays. The cameras are mainly directional, whereby they capture only a limited angular field of view (the field of view towards which they are directed).

More recently, new image and video capture devices are available. These devices are able to capture visual and audio content all around them, i.e. they can capture the whole angular field of view, sometimes referred to as 360 degrees field of view. More precisely, they can capture a spherical field of view (i.e., 360 degrees in all spatial directions). Furthermore, new types of output technologies have been invented and produced, such as head-mounted displays. These devices allow a person to see visual content all around him/her, giving a feeling of being “immersed” into the scene captured by the 360 degrees camera. The new capture and display paradigm, where the field of view is spherical, is commonly referred to as virtual reality (VR) and is believed to be the common way people will experience media content in the future.

For volumetric video, a scene may be captured using one or more 3D (three- dimensional) cameras. The cameras are in different positions and orientations within a scene. One issue to consider is that compared to 2D (two-dimensional) video content, volumetric 3D video content has much more data, so viewing it requires lots of bandwidth (with or without transferring it from a storage location to a viewing device): disk I/O, network traffic, memory bandwidth, GPU (Graphics Processing Unit) upload. Capturing volumetric content also produces a lot of data, particularly when there are multiple capture devices used in parallel.

Summary

The scope of protection sought for various embodiments of the invention is set out by the independent claims. The embodiments and features, if any, described in this specification that do not fall under the scope of the independent claims are to be interpreted as examples useful for understanding various embodiments of the invention.

Various aspects include a method, an apparatus and a computer readable medium comprising a computer program stored therein, which are characterized by what is stated in the independent claims. Various embodiments are disclosed in the dependent claims.

According to a first aspect, there is provided a method for encoding, the method comprising:

- receiving a point cloud with a number of visual attributes, where the number represents different viewing angles;

- identifying areas of the point cloud according to their surface light field (SLF) activity and classifying the areas accordingly into one or more different classes;

- generating two-dimensional patches from the points of the point cloud and by using the classification information, appointing a generated patches into a corresponding class;

- selecting a set of cameras for patches according to the SLF activity of a patch;

- generating a number of attribute video streams into a bitstream, where at least for the first attribute video stream one camera view is packed for all patches, and where at least for the last attribute video stream two or more camera views are packed for patches having a high activity; and

- encoding information on the selected cameras into a bitstream.

According to a second aspect, there is provided a method for decoding, the method comprising

- receiving a bitstream; - decoding from the bitstream a number of attribute video streams and information on selected cameras;

- reconstructing cameras for the patches by using the camera views that are available for a patch according to its surface light field (SLF) activity;

- determining three-dimensional positions of points from the two-dimensional patches;

- reconstructing a three-dimensional point cloud according to the points and their three-dimensional positions; and

- rendering the reconstructed point cloud.

According to a third aspect, there is provided an apparatus for encoding, comprising:

- means for receiving a point cloud with a number of visual attributes, where the number represents different viewing angles;

- means for identifying areas of the point cloud according to their surface light field (SLF) activity and classifying the areas accordingly into one or more different classes;

- means for generating two-dimensional patches from the points of the point cloud and by using the classification information, appointing a generated patches into a corresponding class;

- means for selecting a set of cameras for patches according to the SLF activity of a patch;

- means for generating a number of attribute video streams into a bitstream, where at least for the first attribute video stream one camera view is packed for all patches, and where at least for the last attribute video stream two or more camera views are packed for patches having a high activity; and

- means for encoding information on the selected cameras into a bitstream.

According to a fourth aspect, there is provided an apparatus for decoding comprising

- means for receiving a bitstream;

- means for decoding from the bitstream a number of attribute video streams and information on selected cameras;

- means for reconstructing cameras for the patches by using the camera views that are available for a patch according to its surface light field (SLF) activity;

- means for determining three-dimensional positions of points from the two- dimensional patches;

- means for reconstructing a three-dimensional point cloud according to the points and their three-dimensional positions; and - means for rendering the reconstructed point cloud.

According to a fifth aspect, there is provided an apparatus for encoding, the apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following:

- receive a point cloud with a number of visual attributes, where the number represents different viewing angles;

- identify areas of the point cloud according to their surface light field (SLF) activity and classifying the areas accordingly into one or more different classes;

- generate two-dimensional patches from the points of the point cloud and by using the classification information, appointing a generated patches into a corresponding class;

- select a set of cameras for patches according to the SLF activity of a patch;

- generate a number of attribute video streams into a bitstream, where at least for the first attribute video stream one camera view is packed for all patches, and where at least for the last attribute video stream two or more camera views are packed for patches having a high activity; and

- encode information on the selected cameras into a bitstream.

According to a sixth aspect, there is provided an apparatus for decoding, the apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following:

- receive a bitstream;

- decode from the bitstream a number of attribute video streams and information on selected cameras;

- reconstruct cameras for the patches by using the camera views that are available for a patch according to its surface light field (SLF) activity;

- determine three-dimensional positions of points from the two-dimensional patches;

- reconstruct a three-dimensional point cloud according to the points and their three- dimensional positions; and

- render the reconstructed point cloud.

According to a seventh aspect, there is provided a computer program product comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to - receive a point cloud with a number of visual attributes, where the number represents different viewing angles;

- select a set of cameras for patches according to the SLF activity of a patch;

- encode information on the selected cameras into a bitstream.

According to an eighth aspect, there is provided a computer program product comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to

- receive a bitstream;

- render the reconstructed point cloud.

According to an embodiment, the different classes may comprise at least a high SLF activity, and a low SLF activity.

According to an embodiment, an SLF activity is identified according to one or more of the following:

- a variance of luma over all cameras;

- a variance of color sub-channels;

- local variance in the neighborhood of a three-dimensional point;

- detection of specular reflection by thresholding the color in hue saturation value. According to an embodiment, the cameras are selected so that

- patches having a low SLF activity are represented by a single camera view;

- patches having higher SLF activity are presented by at least two camera views.

According to an embodiment, the cameras to be used are selected either one or more of the following:

- spatial distribution in 3D space;

- distribution of a color;

- reconstruction error.

According to an embodiment,

- one camera view is packed for all patches in a video frame for the first attribute video stream;

- additional camera views are packed for higher activity patches in at least one additional video stream; and

- additional camera views are continuously packed for even higher activity patches in at least one video stream, until the number of desired total video streams is reached.

According to an embodiment, one or more of the following are encoded into a bitstream:

- number of encoded camera views;

- list of respective camera indices;

- list of available cameras and their position in 3D space.

According to an embodiment, the computer program product is embodied on a non- transitory computer readable medium.

Description of the Drawings

In the following, various embodiments will be described in more detail with reference to the appended drawings, in which

Fig. 1 shows an example of a volumetric video compression process;

Fig. 2 shows an example of a volumetric video decompression process; Fig. 3 shows an example of SLF activity packing with three video streams;

Fig. 4 is a flowchart illustrating a method according to an embodiment;

Fig. 5 is a flowchart illustrating a method according to another embodiment; and

Fig. 6 shows an apparatus according to an embodiment.

Description of Example Embodiments

In the following, several embodiments will be described in the context of volumetric video encoding and decoding. In particular, the several embodiments enable packing and signaling surface light field information for volumetric video coding.

A video codec comprises an encoder that transforms the input video into a compressed representation suited for storage/transmission, and a decoder that can un-compress the compressed video representation back into a viewable form. An encoder may discard some information in the original video sequence in order to represent the video in a more compact form (i.e. at lower bitrate).

Volumetric video refers to a visual content that may have been captured using one or more three-dimensional (3D) cameras. When multiple cameras are in use, the captured footage is synchronized so that the cameras provide different viewpoints to the same world. In contrast to traditional 2D/3D video, volumetric video describes a 3D model of the world where the viewer is free to move and observe different parts of the world.

Volumetric video enables the viewer to move in six degrees of freedom (6DOF): in contrast to common 360° video, where the user has from 2 to 3 degrees of freedom (yaw, pitch, and possibly roll), a volumetric video represents a 3D volume of space rather than a flat image plane. Volumetric video frames contain a large amount of data because they model the contents of a 3D volume instead of just a two- dimensional (2D) plane. However, only a relatively small part of the volume changes over time. Therefore, it may be possible to reduce the total amount of data by only coding information about an initial state and changes which may occur between frames. Volumetric video can be rendered from synthetic 3D animations, reconstructed from multi-view video using 3D reconstruction techniques such as structure from motion, or captured with a combination of cameras and depth sensors such as LiDAR (Light Detection and Ranging), for example.

Volumetric video data represents a three-dimensional scene or object, and can be used as input for AR (Augmented Reality), VR (Virtual Reality) and MR (Mixed Reality) applications. Such data describes geometry (shape, size, position in three- dimensional space) and respective attributes (e.g. color, opacity, reflectance, ...), plus any possible temporal changes of the geometry and attributes at given time instances (like frames in two-dimensional (2D) video). Volumetric video is either generated from three-dimensional (3D) models, i.e. CGI (Computer Generated Imagery), or captured from real-world scenes using a variety of capture solutions, e.g. multi-camera, laser scan, combination of video and dedicated depth sensors, and more. Also, a combination of CGI and real-world data is possible. Representation formats for such volumetric data comprises triangle meshes, point clouds, or voxel. Temporal information about the scene can be included in the form of individual capture instances, i.e. volumetric video frame.

Because volumetric video describes a 3D scene (or object), such data can be viewed from any viewpoint. Therefore, volumetric video is an important format for any AR, VR or MR applications, especially for providing 6DOF viewing capabilities.

Increasing computational resources and advances in 3D data acquisition devices has enabled reconstruction of highly detailed volumetric video representations of natural scenes. Infrared, lasers, time-of-flight and structured light are all examples of devices that can be used to construct 3D video data. Representation of the 3D data depends on how the 3D data is used. Dense voxel arrays have been used to represent volumetric medical data. In 3D graphics, polygonal meshes are extensively used. Point clouds on the other hand are well suited for applications such as capturing real world 3D scenes where the topology is not necessarily a 2D manifold. Another way to represent 3D data is coding this 3D data as a set of texture and depth map as is the case in the multi-view plus depth. Closely related to the techniques used in multi-view plus depth is the use of elevation maps, and multi level surface maps. In dense point clouds or voxel arrays, the reconstructed 3D scene may contain tens or even hundreds of millions of points. If such representations are to be stored or interchanged between entities, then efficient compression becomes essential. Standard volumetric video representation formats, such as point clouds, meshes, voxel suffer from poor temporal compression performance. Identifying correspondences for motion-compensation in 3D space is an ill-defined problem, as both the geometry and respective attributes may change. For example, temporal successive “frames” do not necessarily have the same number of meshes, points or voxel. Therefore, compression of dynamic 3D scenes may be inefficient. 2D-video based approaches for compressing volumetric data, i.e. multiview and depth, have much better compression efficiency, but rarely cover the full scene. Therefore, they provide only limited 6DOF capabilities.

Instead of the above-mentioned approach, a 3D scene, represented as meshes, points, and/or voxel, can be projected onto one, or more geometries. These geometries are “unfolded” onto 2D planes (two planes per geometry: one for texture, one for depth), which may be then encoded using standard 2D video compression techniques. Relevant projection geometry information is transmitted alongside the encoded video files to the decoder. The decoder decodes the video and performs the inverse projection to regenerate the 3D scene in any desired representation format (not necessarily the starting format).

Projecting volumetric models onto 2D planes allows for using standard 2D video coding tools with highly efficient temporal compression. Thus, coding efficiency may be increased greatly. Using geometry-projections instead of prior-art 2D-video based approaches, i.e. multiview and depth, provide a better coverage of the scene (or object). Thus, 6DOF capabilities may be improved. Using several geometries for individual objects improves the coverage of the scene further. Furthermore, standard video encoding hardware can be utilized for real-time compression/de compression of the projected planes. The projection and reverse projection steps are of low complexity.

Figure 3 illustrates an overview of an example of a compression process of a volumetric video. Such process may be applied for example in MPEG Point Cloud Coding (PCC). The process starts with an input point cloud frame 301 that is provided for patch generation 302, geometry image generation 304 and texture image generation 305. The patch generation 302 process aims at decomposing the point cloud into a minimum number of patches with smooth boundaries, while also minimizing the reconstruction error. For patch generation, the normal at every point can be estimated. An initial clustering of the point cloud can then be obtained by associating each point with one of the following six oriented planes, defined by their normals:

- (1.0, 0.0, 0.0),

- (0.0, 1.0, 0.0),

- (0.0, 0.0, 1.0),

- (-1.0, 0.0, 0.0),

- (0.0, -1.0, 0.0), and

- (0.0, 0.0, -1.0)

More precisely, each point may be associated with the plane that has the closest normal (i.e. maximizes the dot product of the point normal and the plane normal).

The initial clustering may then be refined by iteratively updating the cluster index associated with each point based on its normal and the cluster indices of its nearest neighbors. The final step may comprise extracting patches by applying a connected component extraction procedure.

Patch info determined at patch generation 302 for the input point cloud frame 301 is delivered to packing process 303, to geometry image generation 304 and to texture image generation 305. The packing process 303 aims at mapping the extracted patches onto a 2D plane, while trying to minimize the unused space, and guaranteeing that every TxT (e.g. 16x16) block of the grid is associated with a unique patch. It should be noticed that T may be a user-defined parameter. Parameter T may be encoded in the bitstream and sent to the decoder.

The used simple packing strategy iteratively tries to insert patches into a WxH grid. W and H may be user-defined parameters, which correspond to the resolution of the geometry/texture images that will be encoded. The patch location is determined through an exhaustive search that is performed in raster scan order. The first location that can guarantee an overlapping-free insertion of the patch is selected and the grid cells covered by the patch are marked as used. If no empty space in the current resolution image can fit a patch, then the height H of the grid may be temporarily doubled, and search is applied again. At the end of the process, H is clipped so as to fit the used grid cells.

The geometry image generation 304 and the texture image generation 305 are configured to generate geometry images and texture images respectively. The image generation process may exploit the 3D to 2D mapping computed during the packing process to store the geometry and texture of the point cloud as images. In order to better handle the case of multiple points being projected to the same pixel, each patch may be projected onto two images, referred to as layers. For example, let H(u, y) be the set of points of the current patch that get projected to the same pixel (u, v). The first layer, also called a near layer, stores the point of H(u, v) with the lowest depth DO. The second layer, referred to as the far layer, captures the point of H(u, v) with the highest depth within the interval [DO, D0+AJ, where A is a user-defined parameter that describes the surface thickness. The generated videos may have the following characteristics:

• Geometry: WxH YUV420-8bit,

• Texture: WxH YUV420-8bit,

It is to be noticed that the geometry video is monochromatic. In addition, the texture generation procedure exploits the reconstructed/smoothed geometry in order to compute the colors to be associated with the re-sampled points.

The geometry images and the texture images may be provided to image padding 307. The image padding 307 may also receive as an input an occupancy map (OM) 306 to be used with the geometry images and texture images. The occupancy map 306 may comprise a binary map that indicates for each cell of the grid whether it belongs to the empty space or to the point cloud. In other words, the occupancy map (OM) may be a binary image of binary values where the occupied pixels and non-occupied pixels are distinguished and depicted respectively. The occupancy map may alternatively comprise a non-binary image allowing additional information to be stored in it. Therefore, the representative values of the DOM (Deep Occupancy Map) may comprise binary values or other values, for example integer values. It should be noticed that one cell of the 2D grid may produce a pixel during the image generation process. Such an occupancy map may be derived from the packing process 303.

The padding process 307, for which the present embodiment are related, aims at filling the empty space between patches in order to generate a piecewise smooth image suited for video compression. For example, in a simple padding strategy, each block of TxT (e.g. 16x16) pixels is compressed independently. If the block is empty (i.e. unoccupied, i.e. all its pixels belong to empty space), then the pixels of the block are filled by copying either the last row or column of the previous TxT block in raster order. If the block is full (i.e. occupied, i.e., no empty pixels), nothing is done. If the block has both empty and filled pixels (i.e. edge block), then the empty pixels are iteratively filled with the average value of their non-empty neighbors.

The padded geometry images and padded texture images may be provided for video compression 308. The generated images/layers may be stored as video frames and compressed using for example the H.265 video codec according to the video codec configurations provided as parameters. The video compression 308 also generates reconstructed geometry images to be provided for smoothing 309, wherein a smoothed geometry is determined based on the reconstructed geometry images and patch info from the patch generation 302. The smoothed geometry may be provided to texture image generation 305 to adapt the texture images.

The patch may be associated with auxiliary information being encoded/decoded for each patch as metadata. The auxiliary information may comprise index of the projection plane, 2D bounding box, 3D location of the patch.

For example, the following metadata may be encoded/decoded for every patch:

- index of the projection plane o Index 0 for the planes (1 .0, 0.0, 0.0) and (-1 .0, 0.0, 0.0) o Index 1 for the planes (0.0, 1 .0, 0.0) and (0.0, -1 .0, 0.0) o Index 2 for the planes (0.0, 0.0, 1 .0) and (0.0, 0.0, -1 .0)

- 2D bounding box (uO, vO, ul, vl)

- 3D location (xO, yO, zO) of the patch represented in terms of depth 50, tangential shift sO and bitangential shift rO. According to the chosen projection planes, (50, sO, rO) may be calculated as follows: o Index o Index o Index

Also, mapping information providing for each TxT block its associated patch index may be encoded as follows: - For each TxT block, let L be the ordered list of the indexes of the patches such that their 2D bounding box contains that block. The order in the list is the same as the order used to encode the 2D bounding boxes. L is called the list of candidate patches.

- The empty space between patches is considered as a patch and is assigned the special index 0, which is added to the candidate patches list of all the blocks.

- Let / be index of the patch, which the current TxT block belongs to, and let J be the position of / in L. Instead of explicitly coding the index /, its position J is arithmetically encoded instead, which leads to better compression efficiency.

An example of such patch auxiliary information is atlas data defined in ISO/IEC 23090-5.

The occupancy map consists of a binary map that indicates for each cell of the grid whether it belongs to the empty space or to the point cloud. One cell of the 2D grid produces a pixel during the image generation process.

The occupancy map compression 310 leverages the auxiliary information described in previous section, in order to detect the empty TxT blocks (i.e. blocks with patch index 0). The remaining blocks may be encoded as follows: The occupancy map can be encoded with a precision of a BOxBO blocks. B0 is a configurable parameter. In order to achieve lossless encoding, B0 may be set to 1 . In practice B0=2 or B0=4 results in visually acceptable results, while significantly reducing the number of bits required to encode the occupancy map.

The compression process may comprise one or more of the following example operations:

• Binary values may be associated with BOxBO sub-blocks belonging to the same TxT block. A value 1 associated with a sub-block, if it contains at least a non-padded pixel, and 0 otherwise. If a sub-block has a value of 1 it is said to be full, otherwise it is an empty sub-block.

• If all the sub-blocks of a TxT block are full (i.e., have value 1 ). The block is said to be full. Otherwise, the block is said to be non-full.

• A binary information may be encoded for each TxT block to indicate whether it is full or not. • If the block is non-full, an extra information indicating the location of the full/empty sub-blocks may be encoded as follows: o Different traversal orders may be defined for the sub-blocks, for example horizontally, vertically, or diagonally starting from top right or top left corner o The encoder chooses one of the traversal orders and may explicitly signal its index in the bitstream. o The binary values associated with the sub-blocks may be encoded by using a run-length encoding strategy.

^■ The binary value of the initial sub-block is encoded.

^■ Continuous runs of 0s and 1s are detected, while following the traversal order selected by the encoder.

^■ The number of detected runs is encoded.

^■ The length of each run, except of the last one, is also encoded.

Figure 4 illustrates an overview of a de-compression process for MPEG Point Cloud Coding (PCC). A de-multiplexer 401 receives a compressed bitstream, and after de-multiplexing, provides compressed texture video and compressed geometry video to video decompression 402. In addition, the de-multiplexer 401 transmits compressed occupancy map to occupancy map decompression 403. It may also transmit a compressed auxiliary patch information to auxiliary patch-info compression 404. Decompressed geometry video from the video decompression 402 is delivered to geometry reconstruction 405, as are the decompressed occupancy map and decompressed auxiliary patch information. The point cloud geometry reconstruction 405 process exploits the occupancy map information in order to detect the non-empty pixels in the geometry/texture images/layers. The 3D positions of the points associated with those pixels may be computed by leveraging the auxiliary patch information and the geometry images.

The reconstructed geometry image may be provided for smoothing 406, which aims at alleviating potential discontinuities that may arise at the patch boundaries due to compression artifacts. The implemented approach moves boundary points to the centroid of their nearest neighbors. The smoothed geometry may be transmitted to texture reconstruction 407, which also receives a decompressed texture video from video decompression 402. The texture reconstruction 407 outputs a reconstructed point cloud. The texture values for the texture reconstruction are directly read from the texture images. The point cloud geometry reconstruction process exploits the occupancy map information in order to detect the non-empty pixels in the geometry/texture images/layers. The 3D positions of the points associated with those pixels are computed by levering the auxiliary patch information and the geometry images. More precisely, let P be the point associated with the pixel (u, v) and let (SO, sO, rO) be the 3D location of the patch to which it belongs and (uO, vO, ul, vl) its 2D bounding box. P can be expressed in terms of depth S(u, v), tangential shift s(u, v) and bi-tangential shift r(u, v) as follows:

S(u, v) = SO + g(u, v) s(u, v) = sO - uO + u r(u, v) = rO - vO + v where g(u, v) is the luma component of the geometry image.

For the texture reconstruction, the texture values can be directly read from the texture images. The result of the decoding process is a 3D point cloud reconstruction.

Visual volumetric video-based Coding (3VC) relates to a core part shared between ISO/IEC 23090-5 (formerly V-PCC (Video-based Point Cloud Compression)) and ISO/IEC 23090-12 (formerly MIV (MPEG Immersive Video)). 3VC will not be issued as a separate document, but as part of ISO/IEC 23090-5 (expected to include clauses 1-8 of the current V-PCC text). ISO/IEC 23090-12 will refer to this common part. ISO/IEC 23090-5 is expected to be renamed to 3VC PCC, ISO/IEC 23090-12 renamed to 3VC MIV.

Up to now, it has been assumed in V-PCC and MIV that apparent brightness of a surface to an observer is the same regardless of the observer’s angle of view. This is so called Lambertian surface. In this case, a reflection, for example, is calculated by taking the dot product of the surface's normal vector N, and a normalized light- direction vector L, pointing from the surface to a light source. The resulting value is then multiplied by the color of the surface and the intensity of the light hitting the surface. The angle between the directions of the two vectors L and N, provides the intensity and is the highest if the normal vector points in the same direction as the light vector.

Surface Light Field (SLF) representations aim to provide photo-realistic, free- viewpoint viewing experiences. They produce high-quality sense of presence by producing motion parallax and extremely realistic textures and lighting. For each point of a point cloud, there exist several viewing-direction-depending attributes, e.g. texture colour from various viewing angles. At the receiver-side, all attributes are decoded, and the rendered may choose the appropriate colour reconstruction based on the viewer’s position and orientation. By providing different texture colours, non- Lambertian reflections are reproduced much more realistically than with only a single attribute.

Signaling additional color attributes in V-PCC may be done as additional individual video stream. Thus, every additional color may significantly increase the required bit rate. For example, when a data has 13 attribute streams for 13 different cameras, the required bitrate to transmit all these attributes is increased eight-fold compared to transmitting a single colour only. Flowever, not every single point actually requires all 13 camera views for a convincing reconstruction. Lambertian surfaces, such as cloth, have the same colour reproduction independent from the viewpoint. Therefore, it may be beneficial to identify areas that most benefit from SLF representation, e.g. highly reflective surfaces, and transmit only a reduced set of cameras for Lambertian surfaces.

The present embodiments are addressed to the above drawback by identifying such “high activity” SLF areas and providing the means to efficiently signal them in the current V-PCC standard.

In general, a method according to an embodiment comprises

- analyzing SLF content to identify the most active regions, i.e. regions with the most changes based on viewport; and

- signaling the areas with the most active regions with more camera views than areas with less active regions.

These steps are discussed in more detailed manner in the following:

SLF optimized patching of V-PCC content

According to an embodiment, an encoder receives a SLF point cloud. The SLF point cloud is a point cloud with n (n>1 ) attributes, representing n different viewing angles. The attribute can be any attribute representing the appearance of the model as seen from a specific viewing angle. An example of the attribute is a color attribute. The point cloud may contain arbitrary (non-visual) attributes as well, but what is relevant for the SLF coding, is attributes describing light as seen from different angles.

The encoder is configured to analyze the received SLF point cloud in order to identify areas with various activity levels. The activity levels may comprise high SLF activity, medium SLF activity and low SLF activity. It is appreciated that the number of activity levels may vary. Therefore, the areas to be identified may comprise one or more different SLF activity levels. . For example, the encoder may compare the luma value 7 or a 3D point p(X, Y, Z) seen from camera Co to all other available cameras Ci, Cn. Measures indicating high activity can include one or more of the following:

- variance of luma Y(p) over all cameras Co to C„;

- variance of colour sub-channels U(p)/V(p)\

- local variance in the neighborhood of the 3D point;

- detection of specular reflection by thresholding the color in HSV color space.

It is appreciated that the measures indicating high activity can contain other measures in addition or instead to the examples listed above.

Based on the performed analysis, the 3D points are classified into different classes of SLF activity. For example, the different classes may be: high SLF activity, medium SLF activity and low SLF activity. It is appreciated that any number of classes can be used, however for simplicity, in this example the embodiments are discussed in relation to the given three classes.

The V-PCC encoder is configured to create the 2D patches using the class information. Thus, every patch is either of high SLF activity, medium SLF activity or low SLF activity.

According to an embodiment, in order to avoid very small patches, a cost function can be defined to ensure that the majority of points represented by a certain patch area are of the same activity level.

SLF activity and camera view packing

By using the patches that has been classified according to the activity, the encoder is configured to decide to send only a sub-set of available cameras per patch, based on patch’s SLF activity. For example: - All patches having a low SLF activity are represented by only a single camera view.

- All patches having a medium SLF activity are represented by three camera views.

- All patches with a high SLF activity are represented by five camera views.

The decision on which camera set shall be used, is made on a per-patch basis, and can be implemented based on various approaches, for example:

- Spatial distribution in 3D space: Select m cameras covering a certain 3D space around the selected patch. I.e. the closest one for low activity, the closest, furthest and an in-between camera for medium activity.

- Distribution of the color: Fit a spherical distribution to the color. Estimate the parameters of the distribution. Send these estimated parameters for further reconstruction.

- Reconstruction error: as an example, incrementally add cameras that provide the best reconstruction of the entire set of source cameras until the desired number of cameras (based on patch activity class) or a threshold quality (based on error in the reconstructed colors) is reached.

According to an embodiment, a reduced number of attribute video streams are generated from the SLF activity information per patch. Following the example above, a total of five video streams are created. The following rules are used to generate the attribute videos.

1 . For the first attribute video stream: packing one camera view for all patches in a video frame according to a best suitable patch packing strategy. This does not have to be necessarily the same camera view for all patches.

2. For the next two video streams: packing two more additional camera views for only the medium activity patches and high activity patches, exactly the same position as in the first video stream. These do not have to be necessarily the same camera views for all patches.

3. For the last two video streams: packing two more additional camera views only for the high activity patches, exactly the same position as in the other three video streams. These do not have to be necessarily the same camera views for all patches. Figure 5 illustrates three example frames 501 , 502, 503 for a video frame packed with patches of various degrees of SLF activity. The first frame 501 comprises low SLF activity patches 551, medium SLF activity patches 552 and high SLF activity patches 553. The second frame 502 comprises only medium SLF activity patches 552 and high SLF activity patches 553. The third frame 503 comprises a high SLF activity patch. 553.

SLF activity and camera signaling The selected cameras may then be signaled in the V-PCC bitstream as in below (excerpt from ISO/IEC 23090-5, V-PCC) with respect to patch data unit syntax:

In the above syntax structure, an element pdu_patchAttCount indicates the number of encoded camera views, e.g. three for a patch with medium SLF activity, and an element pdu_AttCameraldx(i) lists the respective camera indices.

Furthermore, the list of all available cameras and their position in 3D space is signaled in or along the bitstream. This can be on per-sequence level if the cameras are static, or per-frame level if one or more cameras are moving. It is also possible to signal a sub-set of cameras per-frame, while other remain static, or signal just individual updates on per-frame level. A syntax element pdu_AttCameraldx refers to this list of cameras.

SLF reconstruction

With respect to the above example, a decoder receives a V-PCC bit stream with five attribute video streams, as well as the necessary patch data unit information to identify which cameras are signaled, and the information on how many cameras are to be reconstructed and where in 3D space these cameras are.

The reconstruction may be performed on a per-patch basis:

- For any low SLF activity patch, only one camera view is available, thus this view is used to reconstruct all the remaining n-1 cameras.

- For a medium SLF patch, three camera views are available to be used to reconstruct n-3 cameras.

- For a high SLF patch, five camera views are available to be used to reconstruct n-5 cameras.

As mentioned before, any other number and combination of SLF activity and camera view number is possible.

The reconstruction of the missing camera information can be done for example by:

• Direct copy based on spatial distance: Select the available camera closest to the reconstructed camera in 3D space. • (Weighted) blending based on spatial distance: Blend x available cameras closest to the reconstructed camera in 3D space, possibly weighted by distance.

• Parameters of the spherical distribution to reconstruct the color.

According to an embodiment, the reconstruction may happen in the rendering stage, directly from the decoded video frames and without creating temporary in-memory copies of the cameras omitted during encoding.

A method according to an embodiment is shown in Figure 4. The method generally comprises at least receiving 410 a point cloud with a number of visual attributes, where the number represents different viewing angles; identifying 420 areas of the point cloud according to their surface light field (SLF) activity and classifying the areas accordingly into one or more different classes; generating 430 two- dimensional patches from the points of the point cloud and by using the classification information, appointing a generated patches into a corresponding class; selecting 440 a set of cameras for patches according to the SLF activity of a patch; generating 450 a number of attribute video streams into a bitstream, where at least for the first attribute video stream one camera view is packed for all patches, and where at least for the last attribute video stream two or more camera views are packed for patches having a high activity; and encoding 460 information on the selected cameras into a bitstream. Each of the steps can be implemented by a respective module of a computer system.

An apparatus according to an embodiment comprises at least means for receiving a point cloud with a number of visual attributes, where the number represents different viewing angles; means for identifying areas of the point cloud according to their surface light field (SLF) activity and classifying the areas accordingly into one or more different classes; means for generating two-dimensional patches from the points of the point cloud and by using the classification information, appointing a generated patches into a corresponding class; means for selecting a set of cameras for patches according to the SLF activity of a patch; means for generating a number of attribute video streams into a bitstream, where at least for the first attribute video stream one camera view is packed for all patches, and where at least for the last attribute video stream two or more camera views are packed for patches having a high activity; and means for encoding information on the selected cameras into a bitstream. The means comprises at least one processor, and a memory including a computer program code, wherein the processor may further comprise processor circuitry. The memory and the computer program code are configured to, with the at least one processor, cause the apparatus to perform the method of Figure 4 according to various embodiments.

A method according to another embodiment is shown in Figure 5. The method generally comprises at least receiving 510 a bitstream; decoding 520 from the bitstream a number of attribute video streams and information on selected cameras; reconstructing 530 cameras for the patches by using the camera views that are available for a patch according to its surface light field (SLF) activity; determining 540 three-dimensional positions of points from the two-dimensional patches; reconstructing 550 a three-dimensional point cloud according to the points and their three-dimensional positions; and rendering 560 the reconstructed point cloud.

An apparatus according to another embodiment comprises at least means for receiving a bitstream; means for decoding from the bitstream a number of attribute video streams and information on selected cameras; means for reconstructing cameras for the patches by using the camera views that are available for a patch according to its surface light field (SLF) activity; means for determining three- dimensional positions of points from the two-dimensional patches; means for reconstructing a three-dimensional point cloud according to the points and their three-dimensional positions; and means for rendering the reconstructed point cloud. The means comprises at least one processor, and a memory including a computer program code, wherein the processor may further comprise processor circuitry. The memory and the computer program code are configured to, with the at least one processor, cause the apparatus to perform the method of Figure 5 according to various embodiments.

An example of an apparatus is disclosed with reference to Figure 6. Figure 6 shows a block diagram of a video coding system according to an example embodiment as a schematic block diagram of an electronic device 50, which may incorporate a codec. In some embodiments the electronic device may comprise an encoder or a decoder. The electronic device 50 may for example be a mobile terminal or a user equipment of a wireless communication system or a camera device. The electronic device 50 may be also comprised at a local or a remote server or a graphics processing unit of a computer. The device may be also comprised as part of a head- mounted display device. The apparatus 50 may comprise a display 32 in the form of a liquid crystal display. In other embodiments of the invention the display may be any suitable display technology suitable to display an image or video. The apparatus 50 may further comprise a keypad 34. In other embodiments of the invention any suitable data or user interface mechanism may be employed. For example, the user interface may be implemented as a virtual keyboard or data entry system as part of a touch-sensitive display. The apparatus may comprise a microphone 36 or any suitable audio input which may be a digital or analogue signal input. The apparatus 50 may further comprise an audio output device which in embodiments of the invention may be any one of: an earpiece 38, speaker, or an analogue audio or digital audio output connection. The apparatus 50 may also comprise a battery (or in other embodiments of the invention the device may be powered by any suitable mobile energy device such as solar cell, fuel cell or clockwork generator). The apparatus may further comprise a camera 42 capable of recording or capturing images and/or video. The camera 42 may be a multi-lens camera system having at least two camera sensors. The camera is capable of recording or detecting individual frames which are then passed to the codec 54 or the controller for processing. The apparatus may receive the video and/or image data for processing from another device prior to transmission and/or storage.

The apparatus 50 may comprise a controller 56 or processor for controlling the apparatus 50. The apparatus or the controller 56 may comprise one or more processors or processor circuitry and be connected to memory 58 which may store data in the form of image, video and/or audio data, and/or may also store instructions for implementation on the controller 56 or to be executed by the processors or the processor circuitry. The controller 56 may further be connected to codec circuitry 54 suitable for carrying out coding and decoding of image, video and/or audio data or assisting in coding and decoding carried out by the controller.

The apparatus 50 may further comprise a card reader 48 and a smart card 46, for example a UICC (Universal Integrated Circuit Card) and UICC reader for providing user information and being suitable for providing authentication information for authentication and authorization of the user at a network. The apparatus 50 may comprise radio interface circuitry 52 connected to the controller and suitable for generating wireless communication signals for example for communication with a cellular communications network, a wireless communications system or a wireless local area network. The apparatus 50 may further comprise an antenna 44 connected to the radio interface circuitry 52 for transmitting radio frequency signals generated at the radio interface circuitry 52 to other apparatus(es) and for receiving radio frequency signals from other apparatus(es). The apparatus may comprise one or more wired interfaces configured to transmit and/or receive data over a wired connection, for example an electrical cable or an optical fiber connection.

The various embodiments may provide advantages. For example, the coding efficiency may be significantly improved for attribute texture data. In addition, less decoder instances are needed as less attribute values are signaled. Yet further, the present embodiments enable saving in GPU memory and bandwidth in the client device.

The various embodiments can be implemented with the help of computer program code that resides in a memory and causes the relevant apparatuses to carry out the method. For example, a device may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the device to carry out the features of an embodiment. Yet further, a network device like a server may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the network device to carry out the features of an embodiment. The computer program code comprises one or more operational characteristics. Said operational characteristics are being defined through configuration by said computer based on the type of said processor, wherein a system is connectable to said processor by a bus, wherein a programmable operational characteristic of the system are configured to implement a method according to various embodiments.

A computer program product according to an embodiment can be embodied on a non-transitory computer readable medium. According to another embodiment, the computer program product can be downloaded over a network in a data packet.

If desired, the different functions discussed herein may be performed in a different order and/or concurrently with other. Furthermore, if desired, one or more of the above-described functions and embodiments may be optional or may be combined.

Although various aspects of the embodiments are set out in the independent claims, other aspects comprise other combinations of features from the described embodiments and/or the dependent claims with the features of the independent claims, and not solely the combinations explicitly set out in the claims.

It is also noted herein that while the above describes example embodiments, these descriptions should not be viewed in a limiting sense. Rather, there are several variations and modifications, which may be made without departing from the scope of the present disclosure as, defined in the appended claims.

Claims

Claims:

1 . A method for encoding, comprising:

- encoding information on the selected cameras into a bitstream.

2. A method for decoding, comprising

- receiving a bitstream;

- decoding from the bitstream a number of attribute video streams and information on selected cameras;

- rendering the reconstructed point cloud.

3. An apparatus for encoding, comprising:

- means for identifying areas of the point cloud according to their surface light field (SLF) activity and classifying the areas accordingly into one or more different classes; - means for generating two-dimensional patches from the points of the point cloud and by using the classification information, appointing a generated patches into a corresponding class;

- means for encoding information on the selected cameras into a bitstream.

4. The apparatus according to claim 3, wherein the different classes may comprise at least a high SLF activity, and a low SLF activity.

5. The apparatus according to claim 3, wherein an SLF activity is identified according to one or more of the following:

- a variance of luma over all cameras;

- a variance of color sub-channels;

- local variance in the neighborhood of a three-dimensional point;

- detection of specular reflection by thresholding the color in hue saturation value.

6. The apparatus according to any of the claims 3 to 5, further comprising means for selecting the cameras so that

- patches having a low SLF activity are represented by a single camera view; and

7. The apparatus according to claim 6, wherein cameras to be used are selected either one or more of the following:

- spatial distribution in 3D space;

- distribution of a color;

- reconstruction error.

8. The apparatus according to any of the claims 3 to 7, further comprising means for

- packing one camera view for all patches in a video frame for the first attribute video stream; - packing additional camera views for higher activity patches in at least one additional video stream; and

- continuously packing additional camera views for even higher activity patches in at least one video stream, until the number of desired total video streams is reached.

9. The apparatus according to any of the claims 3 to 8, further comprising means for encoding into a bitstream one or more of the following:

- number of encoded camera views;

- list of respective camera indices;

- list of available cameras and their position in 3D space;

10. An apparatus for decoding comprising

- means for receiving a bitstream;

- means for reconstructing a three-dimensional point cloud according to the points and their three-dimensional positions; and

- means for rendering the reconstructed point cloud.

11 . The apparatus according to claim 10, wherein patches having a low SLF activity are represented by a single camera view; and patches having higher SLF activity are presented by at least two camera views.

12. The apparatus according to claim 10 or 11 , wherein one camera view is has been packed for all patches in a video frame for the first attribute video stream; additional camera views has been packed for higher activity patches in at least one additional video stream; and additional camera views have been continuously packed for even higher activity patches in at least one video stream, until the number of desired total video streams is reached.

13. The apparatus according to any of the claims 10 to 12, further comprising means for decoding from a bitstream one or more of the following:

- number of encoded camera views; - list of respective camera indices; and

- list of available cameras and their position in 3D space;

14. An apparatus for encoding, the apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following:

- select a set of cameras for patches according to the SLF activity of a patch;

- encode information on the selected cameras into a bitstream.

15. An apparatus for decoding, the apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following:

- receive a bitstream;

- render the reconstructed point cloud.