WO2019076503A1

WO2019076503A1 - An apparatus, a method and a computer program for coding volumetric video

Info

Publication number: WO2019076503A1
Application number: PCT/EP2018/070444
Authority: WO
Inventors: Payman Aflaki Beni; Vinod Kumar Malamal Vadakital; Sebastian Schwarz
Original assignee: Nokia Technologies Oy
Priority date: 2017-10-17
Filing date: 2018-07-27
Publication date: 2019-04-25
Also published as: GB201717012D0

Abstract

There are disclosed various methods, apparatuses and computer program products for volumetric video encoding and decoding. In some embodiments one or more point clouds representing volumetric video data is obtained, wherein the one or more point clouds comprise at least a first attribute and a second attribute. The first attribute of the one or more point clouds is examined to detect changes in the point clouds. If a change is detected, data which corresponds to the changed part of the first attribute is located in the second attribute.That part of the second attribute data which corresponds to the changed part of the first attribute is encoded.

Description

AN APPARATUS, A METHOD AND A COMPUTER PROGRAM FOR CODING

VOLUMETRIC VIDEO

TECHNICAL FIELD

[0001] The present invention relates to an apparatus, a method and a computer program for encoding and decoding of volumetric video.

BACKGROUND

[0002] This section is intended to provide a background or context to the invention that is recited in the claims. The description herein may include concepts that could be pursued, but are not necessarily ones that have been previously conceived or pursued. Therefore, unless otherwise indicated herein, what is described in this section is not prior art to the description and claims in this application and is not admitted to be prior art by inclusion in this section.

[0003] A video coding system may comprise an encoder that transforms an input video into a compressed representation suited for storage/transmission and a decoder that can uncompress the compressed video representation back into a viewable form. The encoder may discard some information in the original video sequence in order to represent the video in a more compact form, for example, to enable the storage/transmission of the video information at a lower bitrate than otherwise might be needed.

[0004] Volumetric video data represents a three-dimensional scene or object and can be used as input for virtual reality (VR), augmented reality (AR) and mixed reality (MR) applications. Such data describes the geometry attribute, e.g. shape, size, position in three-dimensional (3D) space, and other respective attributes, e.g. colour, opacity, reflectance and any possible temporal changes of the geometry attribute and other attributes at given time instances, comparable to frames in two-dimensional (2D) video. Volumetric video is either generated from 3D models through computer-generated imagery (CGI), or captured from real-world scenes using a variety of capture solutions, e.g. multi-camera, laser scan, combination of video and dedicated depth sensors, and more. Also, a combination of CGI and real-world data is possible.

[0005] Typical representation formats for such volumetric data are triangle meshes, point clouds (PCs), or voxel arrays. Temporal information about the scene can be included in the form of individual capture instances, i.e. "frames" in 2D video, or other means, e.g. position of an object as a function of time. SUMMARY

[0006] Some embodiments provide a method for encoding and decoding video

information. In some embodiments of the present invention there is provided a method, apparatus and computer program product for volumetric video coding as well as decoding.

[0007] Various aspects of examples of the invention are provided in the detailed

description.

[0008] According to a first aspect, there is provided a method comprising:

obtaining one or more volumetric video data representations, wherein the one or more volumetric video data representations comprise at least a first attribute and a second attribute;

examining the first attribute of the one or more volumetric video data representations to detect changes in the one or more volumetric video data

representations;

if a change is detected, locating in the second attribute data which corresponds to the changed part of the first attribute; and

encoding that part of the second attribute data which corresponds to the changed part of the first attribute.

[0009] An apparatus according to a second aspect comprises at least one processor and at least one memory, said at least one memory stored with code thereon, which when executed by said at least one processor, causes the apparatus to perform at least:

obtain one or more volumetric video data representations, wherein the one or more volumetric video data representations comprise at least a first attribute and a second attribute;

examine the first attribute of the one or more volumetric video data

representations to detect changes in the one or more volumetric video data

representations;

locate in the second attribute data which corresponds to the changed part of the first attribute, if a change is detected; and

encode that part of the second attribute data which corresponds to the changed part of the first attribute. [0010] A computer readable storage medium according to a third aspect comprises code for use by an apparatus, which when executed by a processor, causes the apparatus to perform:

examine the first attribute of the one or more volumetric video data

representations to detect changes in the one or more volumetric video data

representations;

encode that part of the second attribute data which corresponds to the changed part of the first attribute.

[001 1] An apparatus according to a fourth aspect comprises:

means for obtaining one or more volumetric video data representations, wherein the one or more volumetric video data representations comprise at least a first attribute and a second attribute;

means for examining the first attribute of the one or more volumetric video data representations to detect changes in the one or more volumetric video data

representations;

means for locating in the second attribute data which corresponds to the changed part of the first attribute, if a change is detected; and

means for encoding that part of the second attribute data which corresponds to the changed part of the first attribute.

[0012] According to a fifth aspect, there is provided a method comprising:

receiving an encoded volumetric video data presentation comprising at least a first attribute and a second attribute;

decoding the volumetric video data presentation;

comparing a first attribute of the volumetric video data presentation with a corresponding first attribute of a previously decoded volumetric video data presentation, if the encoded volumetric video data presentation comprises inter prediction data, or

comparing a first attribute of two or more parts of the volumetric video data presentation, if the encoded volumetric video data presentation comprises intra prediction data;

using the comparison result to determine whether there are changes between the first attribute compared volumetric video data presentations; and reconstructing a second attribute of the volumetric video data presentation by using information of the changes.

[0013] An apparatus according to a sixth aspect comprises at least one processor and at least one memory, said at least one memory stored with code thereon, which when executed by said at least one processor, causes the apparatus to perform at least:

receive an encoded volumetric video data presentation comprising at least a first attribute and a second attribute;

decode the volumetric video data presentation;

compare a first attribute of the volumetric video data presentation with a corresponding first attribute of a previously decoded volumetric video data presentation, if the encoded volumetric video data presentation comprises inter prediction data, or

compare a first attribute of two or more parts of the volumetric video data presentation, if the encoded volumetric video data presentation comprises intra prediction data;

use the comparison result to determine whether there are changes between the first attribute compared volumetric video data presentations; and

reconstruct a second attribute of the volumetric video data presentation by using information of the changes.

[0014] A computer readable storage medium according to a seventh aspect comprises code for use by an apparatus, which when executed by a processor, causes the apparatus to perform:

decode the volumetric video data presentation;

reconstruct a second attribute of the volumetric video data presentation by using information of the changes. [0015] An apparatus according to an eighth aspect comprises:

means for receiving an encoded volumetric video data presentation comprising at least a first attribute and a second attribute;

means for decoding the volumetric video data presentation;

means for comparing a first attribute of the volumetric video data presentation with a corresponding first attribute of a previously decoded volumetric video data presentation, if the encoded volumetric video data presentation comprises inter prediction data, or

means for comparing a first attribute of two or more parts of the volumetric video data presentation, if the encoded volumetric video data presentation comprises intra prediction data;

means for using the comparison result to determine whether there are changes between the first attribute compared volumetric video data presentations; and

means for reconstructing a second attribute of the volumetric video data presentation by using information of the changes.

[0016] Further aspects include at least apparatuses and computer program

products/code stored on a non-transitory memory medium arranged to carry out the above methods. BRIEF DESCRIPTION OF THE DRAWINGS

[0017] For a more complete understanding of example embodiments of the present invention, reference is now made to the following descriptions taken in connection with the accompanying drawings in which:

[0018] Figure 1 a shows an example of a multi-camera system as a simplified block diagram, in accordance with an embodiment;

[0019] Figure 1 b shows a perspective view of a multi-camera system, in accordance with an embodiment;

[0020] Figure 2a illustrates an example of a representation of geometry attributes as a voxel octree, in accordance with an embodiment;

[0021] Figure 2b illustrates an example of a representation of colour attributes as a two-dimensional texture map, in accordance with an embodiment;

[0022] Figures 2c and 2d illustrate an example of a relationship between the geometry attributes and voxel colour attributes, in accordance with an embodiment;

[0023] Figures 3a and 3b illustrate some examples of changes in voxel octree and corresponding changes in two-dimensional texture map; [0024] Figure 4a depicts as a simplified block diagram an apparatus for predicting and encoding voxel clouds, in accordance with an embodiment;

[0025] Figure 4b depicts as a simplified block diagram an apparatus for decoding

encoded voxel cloud information, in accordance with an embodiment;

[0026] Figure 5a shows a flow chart of an encoding method, in accordance with an embodiment;

[0027] Figure 5b shows a flow chart of an encoding method, in accordance with

another embodiment;

[0028] Figure 5c shows a flow chart of a decoding method, in accordance with an

embodiment;

[0029] Figure 6 illustrates an example of a volumetric video pipeline;

[0030] Figure 7a shows a schematic diagram of an encoder suitable for implementing embodiments of the invention;

[0031] Figure 7b shows a schematic diagram of a decoder suitable for implementing embodiments of the invention;

[0032] Figure 8 shows schematically an electronic device employing embodiments of the invention;

[0033] Figure 9 shows schematically a user equipment suitable for employing

embodiments of the invention; and

[0034] Figure 10 further shows schematically electronic devices employing

embodiments of the invention connected using wireless and wired network

connections.

DETAILED DESCRIPTON OF SOME EXAMPLE EMBODIMENTS

[0035] In the following, several embodiments of the invention will be described in the context of one volumetric video coding arrangement. It is to be noted, however, that the invention is not limited to this particular arrangement. For example, the invention may be applicable to video coding systems like streaming systems, DVD players, digital television receivers, personal video recorders, systems and computer programs on personal computers, handheld computers and communication devices, as well as network elements such as transcoders and cloud computing arrangements where video data is handled.

[0036] "Voxel" of a three-dimensional world corresponds to a pixel of a two- dimensional world. Voxels exist in a three-dimensional grid layout. An octree is a tree data structure used to partition a three-dimensional space. Octrees are the three-dimensional analogue of quadtrees. A sparse voxel octree (SVO) describes a volume of a space containing a set of solid voxels of varying sizes. Empty areas within the volume are absent from the tree, which is why it is called "sparse".

[0037] A three-dimensional volumetric representation of a scene is determined as a plurality of voxels on the basis of input streams of at least one multicamera device. Thus, at least one but preferably a plurality (i.e. 2, 3, 4, 5 or more) of multicamera devices are used to capture 3D video representation of a scene. The multicamera devices are distributed in different locations in respect to the scene, and therefore each multicamera device captures a different 3D video representation of the scene. The 3D video

representations captured by each multicamera device may be used as input streams for creating a 3D volumetric representation of the scene, said 3D volumetric representation comprising a plurality of voxels. Voxels may be formed from the captured 3D points e.g. by merging the 3D points into voxels comprising a plurality of 3D points such that for a selected 3D point, all neighboring 3D points within a predefined threshold from the selected 3D point are merged into a voxel without exceeding a maximum number of 3D points in a voxel.

[0038] Voxels may also be formed through the construction of the sparse voxel octree. Each leaf of such a tree represents a solid voxel in world space; the root node of the tree represents the bounds of the world. The sparse voxel octree construction may have the following steps: 1 ) map each input depth map to a world space point cloud, where each pixel of the depth map is mapped to one or more 3D points; 2) determine voxel attributes such as colour and surface normal vector by examining the neighborhood of the source pixel(s) in the camera images and the depth map; 3) determine the size of the voxel based on the depth value from the depth map and the resolution of the depth map; 4) determine the SVO level for the solid voxel as a function of its size relative to the world bounds; 5) determine the voxel coordinates on that level relative to the world bounds; 6) create new and/or traversing existing SVO nodes until arriving at the determined voxel coordinates; 7) insert the solid voxel as a leaf of the tree, possibly replacing or merging attributes from a previously existing voxel at those coordinates. Nevertheless, the size of voxel within the 3D volumetric representation of the scene may differ from each other. The voxels of the 3D volumetric representation thus represent the spatial locations within the scene.

[0039] A volumetric video frame is a complete sparse voxel octree that models the world at a specific point in time in a video sequence. Voxel attributes contain information like colour, opacity, surface normal vectors, and surface material properties. These are referenced in the sparse voxel octrees (e.g. colour of a solid voxel), but can also be stored separately. [0040] Point clouds are commonly used data structures for storing volumetric content. Compared to point clouds, sparse voxel octrees describe a recursive subdivision of a finite volume with solid voxels of varying sizes, while point clouds describe an unorganized set of separate points limited only by the precision of the used coordinate values.

[0041] In a point cloud based scene model or object model, points may be represented with any floating point coordinates. A quantized point cloud may be used to reduce the amount of data, whereby the coordinate values of the point cloud are represented e.g. with 10-bit, 12-bit or 16-bit integers. Integers may be used because hardware accelerators may be able to operate on integers more efficiently. The points in the point cloud may have associated colour, reflectance, opacity and/or other texture values. The points in the point cloud may also have a size, or a size may be the same for all points. The size of the points may be understood as indicating how large an object the point appears to be in the model in the projection. The point cloud may be projected by ray casting from the projection surface to find out the pixel values of the projection surface. In such a manner, the topmost point remains visible in the projection, while points closer to the center of the projection surface may be occluded.

[0042] Voxel coordinates uniquely identify an individual node or solid voxel within the octree. The coordinates are not stored in the SVO itself but instead describe the location and size of the node/voxel. The coordinates have four integer components: level, X, Y, and Z. The level component specifies the subdivision level (level zero being the root node), which each subsequent level subdividing the node in eight equal-sized segments along the X, Y, and Z axes. For example, level 1 comprises eight nodes and level 2 has 64 (= 8 x 8) nodes.

[0043] When encoding a volumetric video, each frame may produce several hundred megabytes or several gigabytes of voxel data which needs to be converted to a format that can be streamed to the viewer, and rendered in real-time. The amount of data depends on the world complexity and the number of cameras. The larger impact comes in a multi-device recording setup with a number of separate locations where the cameras are recording. Such a setup produces more information than a camera at a single location.

[0044] Increasing computational resources and advances in 3D data acquisition devices has enabled reconstruction of highly detailed volumetric video representations of natural scenes. Infrared, lasers, time-of-flight and structured light are all examples of devices that can be used to construct 3D video data. Representation of the 3D data depends on how the 3D data is used. Dense Voxel arrays have been used to represent volumetric medical data. In 3D graphics polygonal meshes are extensively used. Point clouds on the other hand are well suited for applications such as capturing real world 3D scenes where the topology is not necessarily a 2D manifold. Another way to represent 3D data is to code it as a set of texture and depth maps, as is the case in the multi-view plus depth. Closely related to the techniques used in multi-view plus depth is the use of elevation maps, and multi-level surface maps.

[0045] In the cases of dense point clouds or voxel arrays, the reconstructed 3D scene may contain tens or even hundreds of millions of points. If such representations are to be stored or interchanged between endpoints, then efficient compression may become important.

[0046] The Octree data-structure is used extensively to encode geometry attributes induced by the point cloud. Each node in the octree is a point/voxel. The root voxel is the point cloud aligned bounding box. Each voxel is recursively subdivided into eight child voxels. Only non-empty voxels continue to be subdivided. The position of each voxel is represented by its centre point. Each level in the octree is called a Level of Detail (LOD). At each LOD, the voxel's attributes (for e.g. colour, normal, reflectance) are set to the average of the respective attributes of all the enclosed points. The octree structure is typically serialised using occupancy coding. In this serialisation method each node starting from the root node is coded as an eight bit mask of occupied children where a one in the mask means that the child spatially contains points of the point cloud and a zero indicates that there are not points in that spatial region of the octree subdivided space.

[0047] In many cases the coding of voxel and their attributes are separated. This is because, while octree based voxel coding is efficient in the compression of geometry attributes, when used for coding geometry attributes combined with other attributes the opportunities to use the specific statistics of the attribute for compression diminishes. For example, the geometry attribute is coded using occupancy codes based on an octree data structure, while the colour attributes are mapped onto a 2D sampling grid and then coded using, for example, a standard image or video encoding method. Figure 2a illustrates an example of a representation of geometry attributes as a voxel octree 200. Every node 202 has eight children and a dark node in this figure represents a leaf node (i.e. a child node that has no children) with a respective colour presentation. In Figure 2a only the leaf nodes which are dark have a colour representation and other leaf nodes or branch nodes (the nodes which have children) do not have a colour presentation. This means, the leaf nodes that do not have any children, may or may not have a colour representation and this is according to the volumetric content being presented by the voxel structure. Each leaf can be in different level. The levels are according to the distance from the top node i.e. the closer the leaf node to the top node, the higher its level. The higher the level of the leaf node, it means that it is representing a larger spatial area in the volumetric representation. The lowest level of leaf nodes represents the smallest spatial presentation in the current voxel presentation.

[0048] Figure 2b illustrates an example of a representation of voxel colours as a two- dimensional texture map 204. Each cell in Figure 2b represents a respective voxel colour information. The respective voxels are as shown in Figure 2a with dark leaf nodes (some of them are labelled with the reference numeral 202). In the example of Figure 2b the letters R, G and B in the cells 206 illustrate colour parameters of the voxels. The location of the cell 206 in the two-dimensional texture map 204 may indicate the voxel whose parameters are stored in the cell, or there may be another way to assign the parameter cells to the voxels of the octree. One conventional method is to have the two-dimensional texture map 204 values in accordance with the voxel octree structure meaning that from the top level, each leaf node has a presentation in the two-dimensional texture map 204 going to the next level, starting from left, whenever a leaf node has colour values, they are presented in the two-dimensional texture map 204. In this method, a correspondence between the two-dimensional texture map 204 and the voxel structure is created. All cells in the two-dimensional texture map 204 have a respective leaf node in the voxel presentation of Figure 2a. Depending on the level of the leaf node in Figure 2a, the number of cells assigned to that leaf node may vary, meaning that the higher the level, the bigger number of cells should be assigned to present the colour values of that voxel and the lower the leaf node level, the lower the number of cells that should be assigned to present the colour values of that voxel.

[0049] It should be noted that RGB is one of the presentations of the colour information and this information can be characterized with any other type of colour information representation e.g. YUV.

[0050] Each point in a 3D point cloud is characterized at least by two attributes:

geometry and colour (texture). There may also be other attributes such as normal, reflectance and material.

[0051 ] In this application, the geometry attribute is presented by the voxel octree 200 and the colour (texture) attribute is presented by the two-dimensional texture map 204.

[0052] In accordance with an embodiment, each attribute is coded independently.

Referencing mechanisms may then be used to relate the different attributes within a static point cloud. There may be a significant cross correlation between attributes within a point cloud. By considering these kinds of inter-attribute correlations increased compression efficiency may be achieved when coding static and dynamic point clouds.

[0053] In accordance with an approach, the existing correlation between two point clouds captured at two adjacent time instances is used when compressing dynamic point clouds. These correlations may exist between two available attributes i.e. geometry attribute and colour attribute (respective colour information of each voxel). Since both sets of attributes represent the same scene at the same time stamp, the changes which happen between a first time instance TO and a second time instance T1 in the first attribute have similarities to the changes that happen to the second attribute between time the first time instance TO and the second time instance T1 . This change may be categorized into four different types: 1 ) some part of the scene disappears, 2) some part of the scene moves, 3) some part appears in the scene, 4) some part of the scene changes formation.

[0054] In a similar fashion, in accordance with another approach, existing correlation between neighboring sub-parts of the same point cloud is used similar to intra prediction in video coding, when compressing dynamic point clouds. Again, these correlations may exist between two available attributes i.e. geometry attribute and colour attribute

(respective colour information of each voxel). However, instead of representing the same part of a scene at different time instance, as in the previous case, the sub-parts represent neighboring parts of a scene at the same time instance. As 3D scenes typically consist of several larger, consecutive objects, the changes that happen from sub-part NO to neighboring sub-part N 1 in the first attribute have similarities to the changes that happen to the second attribute between the sub-part NO and the neighboring sub-part N 1 .

[0055] In other words, changes in one attribute are detected and used as a prediction base for the other attribute. Since both attributes are not presented in a similar manner, a mapping for the prediction change from one attribute to another attribute is taken into account.

[0056] Figure 6 illustrates an example of a volumetric video pipeline. In the process, multiple cameras 715 capture video data of the world, which video data is input 720 to the pipeline. The video data comprises image frames, positions and depth maps 730 which are transmitted to the Voxel Encoding 740.

[0057] During the "Video Sequencing" stage of the Voxel Encoding 740, the input video material has been divided into shorter sequences of volumetric frames. The encoder is configured to produce a voxel octree for the sequence's volumetric frames at different time instances, and the volumetric frame currently being encoded.

[0058] The outcome of the Voxel Encoding 740 is a SVOX (Sparse VOXel) file 750, which is transmitted for playback 760. The SVOX file 750 is streamed 770, which creates stream packets 780. For these stream packets 780 a voxel rendering 790 is applied which provides viewer state (e.g. current time, view frustum) 795 to the streaming 770. [0059] In the following, some details of the change detection approaches will be described in more detail. These approaches may be implemented, for example, at the "Change Detection" stage of Figure 6.

[0060] The first approach utilizes inter-prediction for example as follows. The following description uses the terms a first point cloud and a second point cloud, but generally they represent volumetric presentations. Hence, the first point cloud can also be regarded as a first volumetric presentation and the second point cloud can also be regarded as a second volumetric presentation.

[0061] When considering dynamic point clouds, there are significant correlations between two point clouds captured at two adjacent time instances. These correlations occur both within attributes, where the term attributes here refers to either geometry attribute or other attributes of voxel octrees, and across attributes such as a change in geometry attribute inducing a change in the colour attribute representation across time steps. This correlation across attributes between two adjacent time-steps of a voxel octree can be used to predict and encode the changes between attributes in that time step.

[0062] Figure 4a depicts as a simplified block diagram an apparatus for predicting and encoding voxel clouds, in accordance with an embodiment, and Figure 5a depicts as a flow diagram a method for predicting and encoding voxel clouds, in accordance with an embodiment. Figure 5b depicts as a flow diagram a method for predicting and encoding voxel clouds, in accordance with another embodiment. In the embodiment of Figure 5a changes in point clouds are examined whereas in the embodiment of Figure 5b changes in two-dimensional texture maps are examined.

[0063] As an example, a voxel cloud coded using two attributes, the geometry and colour, is examined. A prediction element 220 receives a first voxel cloud at a first time instant and stores it to a memory 222. The prediction element 220 also receives a second voxel cloud at a second time instant and stores it to the memory 222. Then, changes in geometry attributes between voxel clouds of adjacent time-steps can be characterised as follows. If there are no changes in geometry attribute between two adjacent time steps, the voxel clouds can be considered identical. If, however, there are one or more new voxels apart from those that were present in the previous time-step, the change in geometry attribute can be regarded as addition. Correspondingly, if some voxels that were there in the voxel cloud of the previous time step are no more there in the current time- step, the change in geometry attribute can be regarded as subtraction. In a situation where some voxels have moved from a location in the previous time step to a neighboring location in the current time-step, the change in geometry attribute can be regarded as movement. It should be noted here that changes in the geometry attribute between two adjacent time steps may include one or more of the above mentioned options.

[0064] When detecting changes in the voxel clouds, a comparison element 224 may obtain the first voxel cloud and the second voxel cloud from the memory 222 (block 601 in Figure 5a) and compares 602 them. If the comparison element 224 detects 603 differences between the first voxel cloud and the second voxel cloud, the comparison element 224 determines 604 the type of the change(s), i.e. addition, subtraction and/or movement. The comparison element 224 may then store information of the change(s) in the memory 222. This information identifies the location of the change(s) so that an attributes encoding element 226 may use that information to determine 605 where there are corresponding changes in the two-dimensional texture map. Hence, only information of the changed part of the two-dimensional texture map 204 need be encoded 606.

[0065] Given a Level of Detail, the voxelized colour attributes of a point cloud are mapped to an N dimensional array. There is typically some form of referencing between the serialised geometry attribute data structure and the voxelized colour attribute data structure. If the difference in geometry attribute is between one time step to another time step, then it is encoded in some way (e.g. XOR coding), then identifying those regions where the voxelized colour attribute data structures may change can be done in a fairly straightforward manner.

[0066] In the example of Figure 3a the voxel cloud 200 on the left illustrates a part of a voxel cloud at a first (previous) time instance and the voxel cloud 200 on the right illustrates a corresponding part of the voxel cloud at a second (successive) time instance. Correspondingly, the two-dimensional texture map 204 on the left illustrates a part of the two-dimensional texture map of the voxel cloud at the first time^" instant and the two- dimensional texture map 204 on the right illustrates a part of the two-dimensional texture map of the voxel cloud at the second time instant. In this example the second voxel cloud has new voxels which did not exist in the previous voxel cloud. These voxels are surrounded by a dashed triangle 208 in Figure 3a. The texture parameters surrounded by a dashed rectangle 210 in the two-dimensional texture map on the right depict the texture parameters of the added voxels. Letters A, B, ... I in connection with the voxel clouds and the two-dimensional texture maps of Figures 2d, 3a and 3b are only shown for clarifying the relationship between elements in the voxel clouds and the two-dimensional texture maps.

[0067] The example of Figure 3b illustrates movement of voxels. Also in this Figure, the voxel cloud 200 on the left illustrates a voxel cloud at a first (previous) time instant and the voxel cloud 200 on the right illustrates a voxel cloud at a second (a successive) time instant; the two-dimensional texture map 204 on the left illustrates the two-dimensional texture map of the voxel cloud at the first time instant and the two-dimensional texture map 204 on the right illustrates the two-dimensional texture map of the voxel cloud at the second time instant. In this example, some voxels of the first voxel cloud (depicted with the dashed triangle 208) have moved to a new location (depicted with the arrow). There is also a corresponding change in the two-dimensional texture map 204 in which texture parameters surrounded by a dashed rectangle 212 in the two-dimensional texture map on the left are moved to a new location (depicted with the dashed rectangle 214 and the arrow in the two-dimensional texture map 204 on the right).

[0068] The above procedure may also be performed so that changes in another attribute data is examined, for example changes in the two-dimensional texture map of the first time instant and the second time instant. This is illustrated in Figure 5b as a flow diagram. A first two-dimensional texture map and a second two-dimensional texture map are obtained 61 1 and compared 612 with each other. When changes are detected 613, it is determined 614 whether the change is addition, subtraction, deformation and/or movement. The location of the change(s) in the two-dimensional texture map is determined and this location information is used to determine 615 the corresponding changes in the voxel cloud. This information of the location of the change(s) in the voxel cloud can be encoded 616 so that no information of the unchanged parts of the voxel cloud need be encoded.

[0069] As a conclusion of the above, in the prediction step, the change that has happened in one attribute is taken into account and considering the structure nature of that attribute, it is converted to a recognized/suitable prediction to the other attribute. To do so, the structure presentation of each attribute should be known and taken into account. The conversion happens based on the inevitable relation between geometry attributes and colour attributes. Such relation is known to the content provider and while compression it is taken into account how each change in one attribute is

converted/mapped to a respective change in other attribute and hence, it is used as the basis for the prediction of the other attribute. For example, if an object enters the scene, the respective geometry attribute is changed too. Such change can be mapped to colour attribute and while keeping the rest of the colour table intact, only update the colour values for the voxels where the geometry attribute has changed. Such knowledge may come from communication (inter attribute prediction) between different attribute presentations of the scene and point cloud.

[0070] In one embodiment the structure of the geometry attribute is known. Moreover, the location of leaf nodes which do have a colour representation is known. Depending on the layer in which the leaf nodes with a colour representation are located, a known number of RGB cells will be assigned to them. Hence, reading these leaf nodes from highest level to the lowest level and in each level from left to right, will structure a reading algorithm and the respective RGB cells will be filled according to this reading algorithm. Hence, any change in one attribute (geometry or colour) will have a respective trackable modification in the other attribute. Such a relation can be taken into account for the inter- attribute prediction.

[0071 ] In the following, the second aspect, which is the intra-prediction case, will be described in more detail. When considering a single point cloud, there may be significant correlations between neighboring parts of the 3D scene, i.e. consecutive 3D objects.

These correlations occur both within attributes and across attributes, such as a change in geometry attribute inducing a change in the colour attribute representation. This correlation across attributes between two adjacent sub-parts of a point cloud can be used to predict and encode the changes between attributes between sub-parts. Consider for example a voxel cloud coded using two attributes, the geometry and colour. Geometry attribute changes between voxels of adjacent sub-parts can be characterised as either one or a combination of the following. If there are no geometry attribute changes between a current sub-part and a previous sub-part, which are adjacent to each other (i.e. adjacent sub-parts), they can be concluded to be identical. If some new voxels apart from those that were present in the previous sub-part exist in the current sub-part, it can be classified as addition. Correspondingly, if some voxels that were there in the previous sub-part are no more there in the current sub-part, it can be classified as subtraction. It may also happen that some voxels have moved from a location in the previous sub-part to a neighboring location in the current sub-part, wherein this may be classified as movement.

[0072] The apparatus implementing the second approach may operate as follows. The prediction element 220 receives a voxel cloud and stores it to the memory 222. The comparison element 224 may examine the voxel cloud in sub-sections so that the comparison element 224 obtains one sub-section and another sub-section from the memory 222 and compares them. If the comparison element 224 detects differences between the two sub-sections, the comparison element 224 determines the type of the change(s), i.e. addition, subtraction and/or movement. The comparison element 224 may then store information of the change(s) in the memory 222. This information identifies the location of the change(s) so that the attributes encoding element 226 may use that information to determine where there are corresponding changes in the two-dimensional texture map. Hence, only information of the changed part of the two-dimensional texture map 204 need be encoded. [0073] The operation described above may be repeated until all sub-sections of the voxel cloud has been examined before continuing the examination and prediction process for a next voxel cloud of the volumetric video.

[0074] The remaining steps may be carried out in a similar fashion as for the above described inter-prediction case.

[0075] Also the second approach may be implemented so that, instead of examining first the voxel clouds, changes in other attribute data is examined. For example, changes in a sub-section of the two-dimensional texture map and another sub-section of the two- dimensional texture map may be examined, and when changes are detected, it is determined whether the change is addition, subtraction and/or movement. The location of the change(s) in the two-dimensional texture map is determined and this location information is used to determine the corresponding change(s) in the voxel cloud. This information of the location of the change(s) in the voxel cloud can be encoded so that no information of the unchanged parts of the voxel cloud need be encoded.

[0076] In the following, corresponding operations at a decoder side will be described in more detail with reference to Figure 4b, which depicts as a simplified block diagram an apparatus for decoding encoded voxel cloud information, in accordance with an embodiment, and to Figure 5c, which shows a flow chart of a decoding method, in accordance with an embodiment.

[0077] The apparatus receives 620 encoded voxel data and stores 622 it into a buffer 232. A decoding element 230 retrieves the stored information and decodes 624 the voxel clouds and two-dimensional texture maps. When the encoded information comprises inter prediction data the inter prediction element 234 may use that data together with already decoded voxel clouds and/or two-dimensional texture maps to determine 626 changes in adjacent volumetric frames. When the encoded information comprises intra prediction data the intra prediction element 236 may use that data together with already decoded sub-parts of voxel clouds and/or two-dimensional texture maps to determine 628 changes between different sub-parts of voxel clouds and/or two-dimensional texture maps in the similar fashion than the prediction element 220 at the encoder side. Information of the changes together with already decoded parts which have not changed between frames/sub-parts can be used to decode 630 the volumetric frames, which may then, for example, be stored and/or output 632 to e.g. a display.

[0078] Figure 7a shows a block diagram of a video encoder suitable for employing embodiments of the invention. Figure 7a presents an encoder for two layers, but it would be appreciated that presented encoder could be similarly simplified to encode only one layer or extended to encode more than two layers. Figure 7a illustrates an embodiment of a video encoder comprising a first encoder section 500 for a base layer and a second encoder section 502 for an enhancement layer. Each of the first encoder section 500 and the second encoder section 502 may comprise similar elements for encoding incoming pictures. The encoder sections 500, 502 may comprise a pixel predictor 302, 402, prediction error encoder 303, 403 and prediction error decoder 304, 404. Figure 7a also shows an embodiment of the pixel predictor 302, 402 as comprising an inter-predictor 306, 406, an intra-predictor 308, 408, a mode selector 310, 410, a filter 316, 416, and a reference frame memory 318, 418. The pixel predictor 302 of the first encoder section 500 receives 300 base layer images of a video stream to be encoded at both the inter- predictor 306 (which determines the difference between the image and a motion compensated reference frame 318) and the intra-predictor 308 (which determines a prediction for an image block based only on the already processed parts of current frame or picture). The output of both the inter-predictor and the intra-predictor are passed to the mode selector 310. The intra-predictor 308 may have more than one intra-prediction modes. Hence, each mode may perform the intra-prediction and provide the predicted signal to the mode selector 310. The mode selector 310 also receives a copy of the base layer picture 300. Correspondingly, the pixel predictor 402 of the second encoder section 502 receives 400 enhancement layer images of a video stream to be encoded at both the inter-predictor 406 (which determines the difference between the image and a motion compensated reference frame 418) and the intra-predictor 408 (which determines a prediction for an image block based only on the already processed parts of current frame or picture). The output of both the inter-predictor and the intra-predictor are passed to the mode selector 410. The intra-predictor 408 may have more than one intra-prediction modes. Hence, each mode may perform the intra-prediction and provide the predicted signal to the mode selector 410. The mode selector 410 also receives a copy of the enhancement layer picture 400.

[0079] Depending on which encoding mode is selected to encode the current block, the output of the inter-predictor 306, 406 or the output of one of the optional intra-predictor modes or the output of a surface encoder within the mode selector is passed to the output of the mode selector 310, 410. The output of the mode selector is passed to a first summing device 321 , 421 . The first summing device may subtract the output of the pixel predictor 302, 402 from the base layer picture 300/enhancement layer picture 400 to produce a first prediction error signal 320, 420 which is input to the prediction error encoder 303, 403.

[0080] The pixel predictor 302, 402 further receives from a preliminary reconstructor 339, 439 the combination of the prediction representation of the image block 312, 412 and the output 338, 438 of the prediction error decoder 304, 404. The preliminary

reconstructed image 314, 414 may be passed to the intra-predictor 308, 408 and to a filter 316, 416. The filter 316, 416 receiving the preliminary representation may filter the preliminary representation and output a final reconstructed image 340, 440 which may be saved in a reference frame memory 318, 418. The reference frame memory 318 may be connected to the inter-predictor 306 to be used as the reference image against which a future base layer picture 300 is compared in inter-prediction operations. Subject to the base layer being selected and indicated to be source for inter-layer sample prediction and/or inter-layer motion information prediction of the enhancement layer according to some embodiments, the reference frame memory 318 may also be connected to the inter- predictor 406 to be used as the reference image against which a future enhancement layer pictures 400 is compared in inter-prediction operations. Moreover, the reference frame memory 418 may be connected to the inter-predictor 406 to be used as the reference image against which a future enhancement layer picture 400 is compared in inter-prediction operations.

[0081 ] Filtering parameters from the filter 316 of the first encoder section 500 may be provided to the second encoder section 502 subject to the base layer being selected and indicated to be source for predicting the filtering parameters of the enhancement layer according to some embodiments.

[0082] The prediction error encoder 303, 403 comprises a transform unit 342, 442 and a quantizer 344, 444. The transform unit 342, 442 transforms the first prediction error signal 320, 420 to a transform domain. The transform is, for example, the DCT transform. The quantizer 344, 444 quantizes the transform domain signal, e.g. the DCT coefficients, to form quantized coefficients.

[0083] The prediction error decoder 304, 404 receives the output from the prediction error encoder 303, 403 and performs the opposite processes of the prediction error encoder 303, 403 to produce a decoded prediction error signal 338, 438 which, when combined with the prediction representation of the image block 312, 412 at the second summing device 339, 439, produces the preliminary reconstructed image 314, 414. The prediction error decoder may be considered to comprise a dequantizer 361 , 461 , which dequantizes the quantized coefficient values, e.g. DCT coefficients, to reconstruct the transform signal and an inverse transformation unit 363, 463, which performs the inverse transformation to the reconstructed transform signal wherein the output of the inverse transformation unit 363, 463 contains reconstructed block(s). The prediction error decoder may also comprise a block filter which may filter the reconstructed block(s) according to further decoded information and filter parameters. [0084] The entropy encoder 330, 430 receives the output of the prediction error encoder 303, 403 and may perform a suitable entropy encoding/variable length encoding on the signal to provide error detection and correction capability. The outputs of the entropy encoders 330, 430 may be inserted into a bitstream e.g. by a multiplexer 508.

[0085] Figure 7b shows a block diagram of a video decoder suitable for employing embodiments of the invention. Figure 7b depicts a structure of a two-layer decoder, but it would be appreciated that the decoding operations may similarly be employed in a single- layer decoder.

[0086] The video decoder 550 comprises a first decoder section 552 for base layer pictures and a second decoder section 554 for enhancement layer pictures. Block 556 illustrates a demultiplexer for delivering information regarding base layer pictures to the first decoder section 552 and for delivering information regarding enhancement layer pictures to the second decoder section 554. Reference P'n stands for a predicted representation of an image block. Reference D'n stands for a reconstructed prediction error signal. Blocks 704, 804 illustrate preliminary reconstructed images (I'n). Reference R'n stands for a final reconstructed image. Blocks 703, 803 illustrate inverse transform (T- 1 ). Blocks 702, 802 illustrate inverse quantization (Q-1 ). Blocks 700, 800 illustrate entropy decoding (E-1 ). Blocks 706, 806 illustrate a reference frame memory (RFM). Blocks 707, 807 illustrate prediction (P) (either inter prediction or intra prediction). Blocks 708, 808 illustrate filtering (F). Blocks 709, 809 may be used to combine decoded prediction error information with predicted base or enhancement layer pictures to obtain the preliminary reconstructed images (I'n). Preliminary reconstructed and filtered base layer pictures may be output 710 from the first decoder section 552 and preliminary reconstructed and filtered enhancement layer pictures may be output 810 from the second decoder section 554.

[0087] Herein, the decoder could be interpreted to cover any operational unit capable to carry out the decoding operations, such as a player, a receiver, a gateway, a demultiplexer and/or a decoder.

[0088] The decoder reconstructs the output video by applying prediction means similar to the encoder to form a predicted representation of the pixel blocks (using the motion or spatial information created by the encoder and stored in the compressed representation) and prediction error decoding (inverse operation of the prediction error coding recovering the quantized prediction error signal in spatial pixel domain). After applying prediction and prediction error decoding means the decoder sums up the prediction and prediction error signals (pixel values) to form the output video frame. The decoder (and encoder) can also apply additional filtering means to improve the quality of the output video before passing it for display and/or storing it as prediction reference for the forthcoming frames in the video sequence.

[0089] Figures 1 a and 1 b illustrate an example of a camera having multiple lenses and imaging sensors but also other types of cameras may be used to capture wide view images and/or wide view video.

[0090] The camera 100 of Figure 1 a comprises two or more camera units 102 and is capable of capturing wide view images and/or wide view video. In this example the number of camera units 102 is eight, but may also be less than eight or more than eight. Each camera unit 102 is located at a different location in the multi-camera system and may have a different orientation with respect to other camera units 102. As an example, the camera units 102 may have an omnidirectional constellation so that it has a 360 viewing angle in a 3D-space. In other words, such camera 100 may be able to see each direction of a scene so that each spot of the scene around the camera 100 can be viewed by at least one camera unit 102.

[0091] The camera 100 of Figure 1 a may also comprise a processor 104 for controlling the operations of the camera 100. There may also be a memory 106 for storing data and computer code to be executed by the processor 104, and a transceiver 108 for communicating with, for example, a communication network and/or other devices in a wireless and/or wired manner. The camera 100 may further comprise a user interface (Ul) 1 10 for displaying information to the user, for generating audible signals and/or for receiving user input. However, the camera 100 need not comprise each feature mentioned above, or may comprise other features as well. For example, there may be electric and/or mechanical elements for adjusting and/or controlling optics of the camera units 102 (not shown).

[0092] Figure 1 a also illustrates some operational elements which may be

implemented, for example, as a computer code in the software of the processor, in a hardware, or both. A focus control element 1 14 may perform operations related to adjustment of the optical system of camera unit or units to obtain focus meeting target specifications or some other predetermined criteria. An optics adjustment element 1 16 may perform movements of the optical system or one or more parts of it according to instructions provided by the focus control element 1 14. It should be noted here that the actual adjustment of the optical system need not be performed by the apparatus but it may be performed manually, wherein the focus control element 1 14 may provide information for the user interface 1 10 to indicate a user of the device how to adjust the optical system.

[0093] Figure 1 b shows as a perspective view the camera 100 of Figure 1 a. In Figure 1 b seven camera units 102a-102g can be seen, but the camera 100 may comprise even more camera units which are not visible from this perspective. Figure 1 b also shows two microphones 1 12a, 1 12b, but the apparatus may also comprise one or more than two microphones.

[0094] It should be noted here that embodiments disclosed in this specification may also be implemented with apparatuses having only one camera unit 102 or less or more than eight camera units 102a-102g.

[0095] In accordance with an embodiment, the camera 100 may be controlled by another device (not shown), wherein the camera 100 and the other device may communicate with each other and a user may use a user interface of the other device for entering commands, parameters, etc. and the user may be provided information from the camera 100 via the user interface of the other device.

[0096] The following describes in further detail suitable apparatus and possible mechanisms for implementing the embodiments of the invention. In this regard reference is first made to Figure 8 which shows a schematic block diagram of an exemplary apparatus or electronic device 50 depicted in Figure 9, which may incorporate a transmitter according to an embodiment of the invention.

[0097] The electronic device 50 may for example be a mobile terminal or user equipment of a wireless communication system. However, it would be appreciated that embodiments of the invention may be implemented within any electronic device or apparatus which may require transmission of radio frequency signals.

[0098] The apparatus 50 may comprise a housing 30 for incorporating and protecting the device. The apparatus 50 further may comprise a display 32 in the form of a liquid crystal display. In other embodiments of the invention the display may be any suitable display technology suitable to display an image or video. The apparatus 50 may further comprise a keypad 34. In other embodiments of the invention any suitable data or user interface mechanism may be employed. For example, the user interface may be implemented as a virtual keyboard or data entry system as part of a touch-sensitive display. The apparatus may comprise a microphone 36 or any suitable audio input which may be a digital or analogue signal input. The apparatus 50 may further comprise an audio output device which in embodiments of the invention may be any one of: an earpiece 38, speaker, or an analogue audio or digital audio output connection. The apparatus 50 may also comprise a battery 40 (or in other embodiments of the invention the device may be powered by any suitable mobile energy device such as solar cell, fuel cell or clockwork generator). The term battery discussed in connection with the embodiments may also be one of these mobile energy devices. Further, the apparatus 50 may comprise a combination of different kinds of energy devices, for example a rechargeable battery and a solar cell. The apparatus may further comprise an infrared port

41 for short range line of sight communication to other devices. In other embodiments the apparatus 50 may further comprise any suitable short range communication solution such as for example a Bluetooth wireless connection or a USB/FireWire wired connection.

[0099] The apparatus 50 may comprise a controller 56 or processor for controlling the apparatus 50. The controller 56 may be connected to memory 58 which in embodiments of the invention may store both data and/or may also store instructions for implementation on the controller 56. The controller 56 may further be connected to codec circuitry 54 suitable for carrying out coding and decoding of audio and/or video data or assisting in coding and decoding carried out by the controller 56.

[0100] The apparatus 50 may further comprise a card reader 48 and a smart card 46, for example a universal integrated circuit card (UICC) reader and a universal integrated circuit card for providing user information and being suitable for providing authentication information for authentication and authorization of the user at a network.

[0101] The apparatus 50 may comprise radio interface circuitry 52 connected to the controller and suitable for generating wireless communication signals for example for communication with a cellular communications network, a wireless communications system or a wireless local area network. The apparatus 50 may further comprise an antenna 60 connected to the radio interface circuitry 52 for transmitting radio frequency signals generated at the radio interface circuitry 52 to other apparatus(es) and for receiving radio frequency signals from other apparatus(es).

[0102] In some embodiments of the invention, the apparatus 50 comprises a camera

42 capable of recording or detecting imaging.

[0103] With respect to Figure 10, an example of a system within which embodiments of the present invention can be utilized is shown. The system 10 comprises multiple communication devices which can communicate through one or more networks. The system 10 may comprise any combination of wired and/or wireless networks including, but not limited to a wireless cellular telephone network (such as a global systems for mobile communications (GSM), universal mobile telecommunications system (UMTS), long term evolution (LTE) based network, code division multiple access (CDMA) network etc.), a wireless local area network (WLAN) such as defined by any of the IEEE 802.x standards, a Bluetooth personal area network, an Ethernet local area network, a token ring local area network, a wide area network, and the Internet.

[0104] For example, the system shown in Figure 10 shows a mobile telephone network 1 1 and a representation of the internet 28. Connectivity to the internet 28 may include, but is not limited to, long range wireless connections, short range wireless connections, and various wired connections including, but not limited to, telephone lines, cable lines, power lines, and similar communication pathways.

[0105] The example communication devices shown in the system 10 may include, but are not limited to, an electronic device or apparatus 50, a combination of a personal digital assistant (PDA) and a mobile telephone 14, a PDA 16, an integrated messaging device (I MD) 18, a desktop computer 20, a notebook computer 22, a tablet computer. The apparatus 50 may be stationary or mobile when carried by an individual who is moving. The apparatus 50 may also be located in a mode of transport including, but not limited to, a car, a truck, a taxi, a bus, a train, a boat, an airplane, a bicycle, a motorcycle or any similar suitable mode of transport.

[0106] Some or further apparatus may send and receive calls and messages and communicate with service providers through a wireless connection 25 to a base station 24. The base station 24 may be connected to a network server 26 that allows

communication between the mobile telephone network 1 1 and the internet 28. The system may include additional communication devices and communication devices of various types.

[0107] The communication devices may communicate using various transmission technologies including, but not limited to, code division multiple access (CDMA), global systems for mobile communications (GSM), universal mobile telecommunications system (UMTS), time divisional multiple access (TDMA), frequency division multiple access

(FDMA), transmission control protocol-internet protocol (TCP-I P), short messaging service (SMS), multimedia messaging service (MMS), email, instant messaging service (I MS), Bluetooth, I EEE 802.1 1 , Long Term Evolution wireless communication technique (LTE) and any similar wireless communication technology. A communications device involved in implementing various embodiments of the present invention may communicate using various media including, but not limited to, radio, infrared, laser, cable connections, and any suitable connection.

[0108] Although the above examples describe embodiments of the invention operating within a wireless communication device, it would be appreciated that the invention as described above may be implemented as a part of any apparatus comprising a circuitry in which radio frequency signals are transmitted and received. Thus, for example, embodiments of the invention may be implemented in a mobile phone, in a base station, in a computer such as a desktop computer or a tablet computer comprising radio frequency communication means (e.g. wireless local area network, cellular radio, etc.).

[0109] In general, the various embodiments of the invention may be implemented in hardware or special purpose circuits or any combination thereof. While various aspects of the invention may be illustrated and described as block diagrams or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.

[01 10] Embodiments of the inventions may be practiced in various components such as integrated circuit modules. The design of integrated circuits is by and large a highly automated process. Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate.

[01 1 1 ] Programs, such as those provided by Synopsys, Inc. of Mountain View, California and Cadence Design, of San Jose, California automatically route conductors and locate components on a semiconductor chip using well established rules of design as well as libraries of pre stored design modules. Once the design for a semiconductor circuit has been completed, the resultant design, in a standardized electronic format (e.g., Opus, GDSI I , or the like) may be transmitted to a semiconductor fabrication facility or "fab" for fabrication.

[01 12] The foregoing description has provided by way of exemplary and non-limiting examples a full and informative description of the exemplary embodiment of this invention. However, various modifications and adaptations may become apparent to those skilled in the relevant arts in view of the foregoing description, when read in conjunction with the accompanying drawings and the appended claims. However, all such and similar modifications of the teachings of this invention will still fall within the scope of this invention.

Claims

1 . A method comprising:

obtaining one or more volumetric video data representations, wherein the volumetric video data representations comprise at least a first attribute and a second attribute;

examining the first attribute of the one or more volumetric video data representations to detect changes in the first attribute;

2. The method according to claim 1 , said locating comprising:

taking into account the structure of the first attribute and the second attribute.

3. The method according to claim 1 or 2, wherein the examining the first attribute comprises:

examining a first volumetric video data representation representing a first time instance; and

examining a second volumetric video data representation representing a second time instance succeeding the first time instance.

4. The method according to claim 1 , 2 or 3, wherein the examining the first attribute comprises:

examining a first sub-part of one volumetric video data representation; and examining a second sub-part of the same volumetric video data representation.

5. The method according to any of the claims 1 to 4, wherein the first attribute is geometry and the second attribute is texture.

6. The method according to any of the claims 1 to 4, wherein the first attribute is texture and the second attribute is geometry.

7. The method according to any of the claims 1 to 6, wherein the volumetric video data representations comprise point cloud data.

8. The method according to any of the claims 1 to 7, wherein the change is one of the following:

addition;

subtraction;

movement;

formation change.

9. An apparatus comprising at least one processor and at least one memory, said at least one memory stored with code thereon, which when executed by said at least one processor, causes the apparatus to perform at least:

examine the first attribute of the one or more volumetric video data

representations to detect changes in the volumetric video data representations;

10. The apparatus according to claim 9, said at least one memory stored with code thereon, which when executed by said at least one processor, causes the apparatus to perform the locating by:

1 1 . The apparatus according to claim 9 or 10, said at least one memory stored with code thereon, which when executed by said at least one processor, causes the apparatus to perform the examining the first attribute by:

12. The apparatus according to claim 9, 10 or 1 1 , said at least one memory stored with code thereon, which when executed by said at least one processor, causes the apparatus to perform the examining the first attribute by:

13. The apparatus according to any of the claims 9 to 12, wherein the first attribute is geometry and the second attribute is texture.

14. The apparatus according to any of the claims 9 to 12, wherein the first attribute is texture and the second attribute is geometry.

15. The apparatus according to any of the claims 9 to 14, wherein the volumetric video data representations comprise point cloud data.

16. The apparatus according to any of the claims 9 to 15, wherein the change is one of the following:

addition;

subtraction;

movement;

formation change.

17. A computer readable storage medium comprising code for use by an apparatus, which when executed by a processor, causes the apparatus to perform at least:

examine the first attribute of the one or more volumetric video data

representations to detect changes in the point clouds;

18. An apparatus comprising:

means for obtaining one or more volumetric video data representations, wherein the one or more point clouds comprise at least a first attribute and a second attribute;

means for examining the first attribute of the one or more volumetric video data representations to detect changes in the volumetric video data representations;

19. The apparatus according to claim 18 comprising means for performing the method of any of the claims 2 to 8.

20. A method comprising:

decoding the volumetric video data presentation;

using the comparison result to determine whether there are changes between the first attribute compared volumetric video data presentations; and

reconstructing a second attribute of the volumetric video data presentation by using information of the changes.

21 . The method according to claim 20, wherein the first attribute is geometry and the second attribute is texture.

22. The method according to claim 20, wherein the first attribute is texture and the second attribute is geometry.

23. The method according to claim 20, 21 or 22, wherein the volumetric video data presentation comprises voxel clouds and two-dimensional texture maps.

24. An apparatus comprising at least one processor and at least one memory, said at least one memory stored with code thereon, which when executed by said at least one processor, causes the apparatus to perform at least:

decode the volumetric video data presentation;

25. The apparatus according to claim 24, wherein the first attribute is geometry and the second attribute is texture.

26. The apparatus according to claim 24, wherein the first attribute is texture and the second attribute is geometry.

27. The apparatus according to claim 24, 25 or 26, wherein the volumetric video data presentation comprises voxel clouds and two-dimensional texture maps.

28. A computer readable storage medium comprising code for use by an apparatus, which when executed by a processor, causes the apparatus to perform at least:

decode the volumetric video data presentation;

compare a first attribute of the volumetric video data presentation with a corresponding first attribute of a previously decoded volumetric video data presentation, if the encoded volumetric video data presentation comprises inter prediction data, or compare a first attribute of two or more parts of the volumetric video data presentation, if the encoded volumetric video data presentation comprises intra prediction data;

29. An apparatus comprising:

means for decoding the volumetric video data presentation;

30. The apparatus according to claim 29 comprising means for performing the method of any of the claims 21 to 23.