WO2021170906A1 - An apparatus, a method and a computer program for volumetric video - Google Patents

An apparatus, a method and a computer program for volumetric video Download PDF

Info

Publication number
WO2021170906A1
WO2021170906A1 PCT/FI2021/050096 FI2021050096W WO2021170906A1 WO 2021170906 A1 WO2021170906 A1 WO 2021170906A1 FI 2021050096 W FI2021050096 W FI 2021050096W WO 2021170906 A1 WO2021170906 A1 WO 2021170906A1
Authority
WO
WIPO (PCT)
Prior art keywords
patch
component picture
patches
value
geometry
Prior art date
Application number
PCT/FI2021/050096
Other languages
French (fr)
Inventor
Payman Aflaki Beni
Vinod Kumar MALAMAL VADAKITAL
Original Assignee
Nokia Technologies Oy
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nokia Technologies Oy filed Critical Nokia Technologies Oy
Publication of WO2021170906A1 publication Critical patent/WO2021170906A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/169Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
    • H04N19/17Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/10Processing, recording or transmission of stereoscopic or multi-view image signals
    • H04N13/106Processing image signals
    • H04N13/161Encoding, multiplexing or demultiplexing different image signal components
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/10Processing, recording or transmission of stereoscopic or multi-view image signals
    • H04N13/106Processing image signals
    • H04N13/172Processing image signals image signals comprising non-image signal components, e.g. headers or format information
    • H04N13/178Metadata, e.g. disparity information
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/102Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
    • H04N19/103Selection of coding mode or of prediction mode
    • H04N19/105Selection of the reference unit for prediction within a chosen coding or prediction mode, e.g. adaptive choice of position and number of pixels used for prediction
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/102Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
    • H04N19/103Selection of coding mode or of prediction mode
    • H04N19/109Selection of coding mode or of prediction mode among a plurality of temporal predictive coding modes
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/134Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or criterion affecting or controlling the adaptive coding
    • H04N19/136Incoming video signal characteristics or properties
    • H04N19/137Motion inside a coding unit, e.g. average field, frame or block difference
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/46Embedding additional information in the video signal during the compression process
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/503Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving temporal prediction
    • H04N19/507Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving temporal prediction using conditional replenishment
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/597Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding specially adapted for multi-view video sequence encoding
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/70Methods or arrangements for coding, decoding, compressing or decompressing digital video signals characterised by syntax aspects related to video coding, e.g. related to compression standards
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/10Processing, recording or transmission of stereoscopic or multi-view image signals
    • H04N13/106Processing image signals
    • H04N13/111Transformation of image signals corresponding to virtual viewpoints, e.g. spatial image interpolation
    • H04N13/117Transformation of image signals corresponding to virtual viewpoints, e.g. spatial image interpolation the virtual viewpoint locations being selected by the viewers or determined by viewer tracking

Definitions

  • the present invention relates to an apparatus, a method and a computer program for volumetric video coding.
  • Volumetric video data represents a three-dimensional scene or object and can be used as input for virtual reality (VR), augmented reality (AR) and mixed reality (MR) applications.
  • Such data describes the geometry, e.g. shape, size, position in three- dimensional (3D) space, and respective attributes, e.g. colour, opacity, reflectance and any possible temporal changes of the geometry and attributes at given time instances.
  • Volumetric video is either generated from 3D models through computer-generated imagery (CGI), or captured from real-world scenes using a variety of capture solutions, e.g. multi camera, laser scan, combination of video and dedicated depth sensors, and more. Also, a combination of CGI and real-world data is possible.
  • CGI computer-generated imagery
  • Typical representation formats for such volumetric data are triangle meshes, point clouds (PCs), or voxel arrays.
  • the reconstructed 3D scene may contain tens or even hundreds of millions of points.
  • One way to compress a time-varying volumetric scene/object is to project 3D surfaces to some number of pre defined 2D planes.
  • Regular 2D video compression algorithms can then be used to compress various aspects of the projected surfaces.
  • MPEG Video-Based Point Cloud Coding V-PCC provides a procedure for compressing a time-varying volumetric scene/object by projecting 3D surfaces onto a number of pre-defined 2D planes, which may then be compressed using regular 2D video compression algorithms.
  • the projection is presented using different patches, where each set of patches may represent a specific object or specific parts of a scene.
  • the patches may be aligned differently, i.e. different vertical and horizontal positions, sizes, rotations and/or mirroring.
  • the plurality of parameters are defined separately for each patch. Transmitting such information per patch takes a relatively high amount of bits, and therefore increases the size of encoded content bitstream that needs to be transmitted for the presentation of the content to the decoded.
  • a method comprising projecting a 3D representation of a scene onto a plurality of 2D patches; generating at least a texture component picture, an optional geometry component picture and auxiliary patch information from the 2D patches; aggregating said plurality of 2D patches with the corresponding texture component picture into an atlas; determining a difference value for at least a first patch belonging to a first part of the scene, said difference value being based on difference between content of the texture component picture or the geometry component picture of the first patch at a first temporal value and content of a texture component picture or a geometry component picture of at least a second patch at a second temporal value prior to the first temporal value, said second patch matching to the first patch according to at least one predefined similarity criterium; and signaling, in response to said difference value being smaller than a predetermined threshold value, an indication about at least the second patch to be used as a prediction reference for said first patch.
  • An apparatus comprises at least one processor and at least one memory, said at least one memory stored with computer program code thereon, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to perform: project a 3D representation of a scene onto a plurality of 2D patches; generate at least a texture component picture, an optional geometry component picture and auxiliary patch information from the 2D patches; aggregate said plurality of 2D patches with the corresponding texture component picture into an atlas; determine a difference value for at least a first patch belonging to a first part of the scene, said difference value being based on difference between content of the texture component picture or the geometry component picture of the first patch at a first temporal value and content of a texture component picture or a geometry component picture of at least a second patch at a second temporal value prior to the first temporal value, said second patch matching to the first patch according to at least one predefined similarity criterium; and signal, in response to said difference value being smaller than a
  • An apparatus comprises: means for projecting a 3D representation of a scene onto a plurality of 2D patches; means for generating at least a texture component picture, an optional geometry component picture and auxiliary patch information from the 2D patches; means for aggregating said plurality of 2D patches with the corresponding texture component picture into an atlas; means for determining a difference value for at least a first patch belonging to a first part of the scene, said difference value being based on difference between content of the texture component picture or the geometry component picture of the first patch at a first temporal value and content of a texture component picture or a geometry component picture of at least a second patch at a second temporal value prior to the first temporal value, said second patch matching to the first patch according to at least one predefined similarity criterium; and means for signaling, in response to said difference value being smaller than a predetermined threshold value, an indication about at least the second patch to be used as a prediction reference for said first patch.
  • said difference value is determined as a mean square error of between pixel values of the texture component pictures or the geometry component pictures of said first and second patches.
  • said predefined similarity criteria comprises one or more of the following criterium:
  • the second patch comprising at least one image block being used a prediction reference for at least one image block of the first patch.
  • a skip mode of encoding is indicated to be used for said first patch.
  • said indication is configured to be encoded as a flag at least for the first patch, wherein said flag indicates whether at least the second patch is to be used as a prediction reference for said first patch.
  • a signalling of at least the second patch to be used as a prediction reference for said first patch is configured to be carried out by at least one syntax element included in an atlas parameter syntax structure.
  • the apparatus in response to said difference value being zero, is configured to include a second syntax element, indicative of a prediction residual, in the atlas parameter syntax structure and set a value of the second syntax element for said first patch as zero.
  • the apparatus in response to said difference value being greater than zero but smaller than said predetermined threshold value, is configured to include the second syntax element in the atlas parameter syntax structure and set the value of the second syntax element for said first patch as one.
  • the apparatus is configured to include a third syntax element in the atlas parameter syntax structure, wherein said third syntax element indicates a prediction residual to be added to the prediction based on the at least the second patch.
  • a method comprises: receiving a bitstream in a decoder, said bitstream comprising an encoded texture component picture, an optional encoded geometry component picture and an encoded occupancy map indicative of a plurality of 2D patches from a 3D representation of a scene, wherein said plurality of 2D patches have been aggregated with the corresponding texture component picture into an atlas; receiving, either in said bitstream or in a further bitstream, an encoded auxiliary patch information from said plurality of 2D patches, wherein said auxiliary patch information comprises a difference value for at least a first patch belonging to a first part of the scene, said difference value being based on difference between content of the texture component picture or the geometry component picture of the first patch at a first temporal value and content of a texture component picture or a geometry component picture of at least a second patch at a second temporal value prior to the first temporal value, said second patch matching to the first patch according to at least one predefined similarity criterium; decoding, in response to said difference
  • An apparatus comprises at least one processor and at least one memory, said at least one memory stored with computer program code thereon, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to perform: receive a bitstream in a decoder, said bitstream comprising an encoded texture component picture, an optional encoded geometry component picture and an encoded occupancy map indicative of a plurality of 2D patches from a 3D representation of a scene, wherein said plurality of 2D patches have been aggregated with the corresponding texture component picture into an atlas; receive, either in said bitstream or in a further bitstream, an encoded auxiliary patch information from said plurality of 2D patches, wherein said auxiliary patch information comprises a difference value for at least a first patch belonging to a first part of the scene, said difference value being based on difference between content of the texture component picture or the geometry component picture of the first patch at a first temporal value and content of a texture component picture or a geometry component picture of at
  • An apparatus comprises: means for receiving a bitstream in a decoder, said bitstream comprising an encoded texture component picture, an optional encoded geometry component picture and an encoded occupancy map indicative of a plurality of 2D patches from a 3D representation of a scene, wherein said plurality of 2D patches have been aggregated with the corresponding texture component picture into an atlas; means for receiving, either in said bitstream or in a further bitstream, an encoded auxiliary patch information from said plurality of 2D patches, wherein said auxiliary patch information comprises a difference value for at least a first patch belonging to a first part of the scene, said difference value being based on difference between content of the texture component picture or the geometry component picture of the first patch at a first temporal value and content of a texture component picture or a geometry component picture of at least a second patch at a second temporal value prior to the first temporal value, said second patch matching to the first patch according to at least one predefined similarity criterium; means for decoding
  • the apparatus is configured to decode the indication about a skip mode of decoding to be used for said first patch, wherein said indicating includes reference to one or more reference image blocks of at least the second patch to be used instead of the first patch.
  • said indication at least for the first patch is configured to be decoded from a flag, wherein said flag indicates whether at least the second patch is to be used as a prediction reference for said first patch.
  • the apparatus is configured to decode the indication about at least the second patch to be used as a prediction reference for said first patch from at least one syntax element included in an atlas parameter syntax structure.
  • the apparatus is configured to decode a second syntax element, indicative of a prediction residual, from the atlas parameter syntax structure, and in response to the value of the second syntax element for said first patch being one, decode a third syntax element from the atlas parameter syntax structure, wherein said third syntax element indicates a prediction residual to be added to the prediction based on at least the second patch.
  • Computer readable storage media comprise code for use by an apparatus, which when executed by a processor, causes the apparatus to perform the above methods.
  • Figs la and lb show an encoder and decoder for encoding and decoding 2D pictures
  • FIGs. 2a and 2b show a compression and a decompression process for 3D volumetric video
  • FIGs. 3a and 3b show an example of a point cloud frame and a projection of points to a corresponding plane of a point cloud bounding box
  • Fig. 4 shows an illustrative example of the relationship between atlases, patches and view representations
  • Fig. 5 shows a decoder reference architecture for immersive video
  • Fig. 6 shows a flow chart for encoding patch information according to an embodiment
  • Fig. 7 shows a flow chart for decoding patch information according to an embodiment
  • Figs. 8a and 8b show some embodiments relating to the encoding and decoding of the patch information.
  • a video codec comprises an encoder that transforms the input video into a compressed representation suited for storage/transmission, and a decoder that can un compress the compressed video representation back into a viewable form.
  • An encoder may discard some information in the original video sequence in order to represent the video in a more compact form (i.e. at lower bitrate).
  • Volumetric video may be captured using one or more three-dimensional (3D) cameras. When multiple cameras are in use, the captured footage is synchronized so that the cameras provide different viewpoints to the same world. In contrast to traditional 2D/3D video, volumetric video describes a 3D model of the world where the viewer is free to move and observer different parts of the world.
  • Volumetric video enables the viewer to move in six degrees of freedom (6DOF): in contrast to common 360° video, where the user has from 2 to 3 degrees of freedom (yaw, pitch, and possibly roll), a volumetric video represents a 3D volume of space rather than a flat image plane.
  • Volumetric video frames contain a large amount of data because they model the contents of a 3D volume instead of just a two-dimensional (2D) plane.
  • 2D two-dimensional
  • Volumetric video can be rendered from synthetic 3D animations, reconstructed from multi-view video using 3D reconstruction techniques such as structure from motion, or captured with a combination of cameras and depth sensors such as LiDAR (Light Detection and Ranging), for example.
  • Volumetric video data represents a three-dimensional scene or object, and thus such data can be viewed from any viewpoint.
  • Volumetric video data can be used as an input for augmented reality (AR), virtual reality (VR) and mixed reality (MR) applications.
  • AR augmented reality
  • VR virtual reality
  • MR mixed reality
  • Such data describes geometry (shape, size, position in 3D-space) and respective attributes (e.g. color, opacity, reflectance, ...), together with any possible temporal changes of the geometry and attributes at given time instances (e.g. frames in 2D video).
  • Volumetric video is either generated from 3D models, i.e. computer-generated imagery (CGI), or captured from real-world scenes using a variety of capture solutions, e.g. a multi-camera, a laser scan, a combination of video and dedicated depths sensors, etc. Also, a combination of CGI and real-world data is possible. Examples of representation formats for such volumetric data are triangle meshes, point clouds, or voxel.
  • Temporal information about the scene can be included in the form of individual capture instances, i.e. “frames” in 2D video, or other means, e.g. position of an object as a function of time.
  • 3D graphics polygonal meshes are extensively used. Point clouds on the other hand are well suited for applications, such as capturing real world 3D scenes where the topology is not necessarily a 2D manifold.
  • Another way to represent 3D data is coding this 3D data as a set of texture and depth map as is the case in the multi-view plus depth. Closely related to the techniques used in multi-view plus depth is the use of elevation maps, and multi-level surface maps.
  • each point of each 3D surface is described as a 3D point with color and/or other attribute information such as surface normal or material reflectance.
  • Point cloud is a set of data points in a coordinate system, for example in a three- dimensional coordinate system being defined by X, Y, and Z coordinates.
  • the points may represent an external surface of an object in the screen space, e.g. in a three-dimensional space.
  • the reconstructed 3D scene may contain tens or even hundreds of millions of points. If such representations are to be stored or interchanged between entities, then efficient compression of the presentations becomes fundamental.
  • Standard volumetric video representation formats such as point clouds, meshes, voxel, suffer from poor temporal compression performance. Identifying correspondences for motion-compensation in 3D-space is an ill-defined problem, as both, geometry and respective attributes may change. For example, temporal successive “frames” do not necessarily have the same number of meshes, points or voxel. Therefore, compression of dynamic 3D scenes is inefficient. 2D-video based approaches for compressing volumetric data, i.e. multiview with depth, have much better compression efficiency, but rarely cover the full scene. Therefore, they provide only limited 6DOF capabilities.
  • a 3D scene represented as meshes, points, and/or voxel
  • a 3D scene can be projected onto one, or more, geometries. These geometries may be “unfolded” or packed onto 2D planes (two planes per geometry: one for texture, one for depth), which are then encoded using standard 2D video compression technologies. Relevant projection geometry information may be transmitted alongside the encoded video files to the decoder. The decoder decodes the video and performs the inverse projection to regenerate the 3D scene in any desired representation format (not necessarily the starting format).
  • Figs la and lb show an encoder and decoder for encoding and decoding the 2D texture pictures, geometry pictures and/or auxiliary pictures.
  • a video codec consists of an encoder that transforms an input video into a compressed representation suited for storage/transmission and a decoder that can uncompress the compressed video representation back into a viewable form.
  • the encoder discards and/or loses some information in the original video sequence in order to represent the video in a more compact form (that is, at lower bitrate).
  • Figure la An example of an encoding process is illustrated in Figure la.
  • Figure la illustrates an image to be encoded (F); a predicted representation of an image block (P' n ); a prediction error signal (D n ); a reconstructed prediction error signal (D' n ); a preliminary reconstructed image (I' n ); a final reconstructed image (R' n ); a transform (T) and inverse transform (T -1 ); a quantization (Q) and inverse quantization (Q 1 ); entropy encoding (E); a reference frame memory (RFM); inter prediction (Pi nter ); intra prediction (Pi ntra ); mode selection (MS) and filtering (F).
  • Figure lb illustrates a predicted representation of an image block (P' n ); a reconstructed prediction error signal (D' n ); a preliminary reconstructed image (I' n ); a final reconstructed image (R' n ); an inverse transform (T 1 ); an inverse quantization (Q 1 ); an entropy decoding (E 1 ); a reference frame memory (RFM); a prediction (either inter or intra) (P); and filtering (F).
  • P' n an image block
  • D' n a reconstructed prediction error signal
  • I' n preliminary reconstructed image
  • R' n final reconstructed image
  • T 1 inverse transform
  • Q 1 inverse quantization
  • E 1 entropy decoding
  • RLM reference frame memory
  • F filtering
  • pixel values in a certain picture area are predicted for example by motion compensation means (finding and indicating an area in one of the previously coded video frames that corresponds closely to the block being coded) or by spatial means (using the pixel values around the block to be coded in a specified manner).
  • the prediction error i.e. the difference between the predicted block of pixels and the original block of pixels. This is typically done by transforming the difference in pixel values using a specified transform (e.g. Discrete Cosine Transform (DCT) or a variant of it), quantizing the coefficients and entropy coding the quantized coefficients.
  • DCT Discrete Cosine Transform
  • Video codecs may also provide a transform skip mode, which the encoders may choose to use.
  • the prediction error is coded in a sample domain, for example by deriving a sample-wise difference value relative to certain adjacent samples and coding the sample-wise difference value with an entropy coder.
  • a coding block may be defined as an NxN block of samples for some value of N such that the division of a coding tree block into coding blocks is a partitioning.
  • a coding tree block may be defined as an NxN block of samples for some value of N such that the division of a component into coding tree blocks is a partitioning.
  • a coding tree unit may be defined as a coding tree block of luma samples, two corresponding coding tree blocks of chroma samples of a picture that has three sample arrays, or a coding tree block of samples of a monochrome picture or a picture that is coded using three separate color planes and syntax structures used to code the samples.
  • a coding unit may be defined as a coding block of luma samples, two corresponding coding blocks of chroma samples of a picture that has three sample arrays, or a coding block of samples of a monochrome picture or a picture that is coded using three separate color planes and syntax structures used to code the samples.
  • a CU with the maximum allowed size may be named as LCU (largest coding unit) or coding tree unit (CTU) and the video picture is divided into non-overlapping LCUs.
  • a picture can be partitioned in tiles, which are rectangular and contain an integer number of LCUs.
  • the partitioning to tiles forms a regular grid, where heights and widths of tiles differ from each other by one LCU at the maximum.
  • a slice is defined to be an integer number of coding tree units contained in one independent slice segment and all subsequent dependent slice segments (if any) that precede the next independent slice segment (if any) within the same access unit.
  • a slice segment is defined to be an integer number of coding tree units ordered consecutively in the tile scan and contained in a single NAL unit. The division of each picture into slice segments is a partitioning.
  • an independent slice segment is defined to be a slice segment for which the values of the syntax elements of the slice segment header are not inferred from the values for a preceding slice segment
  • a dependent slice segment is defined to be a slice segment for which the values of some syntax elements of the slice segment header are inferred from the values for the preceding independent slice segment in decoding order.
  • a slice header is defined to be the slice segment header of the independent slice segment that is a current slice segment or is the independent slice segment that precedes a current dependent slice segment
  • a slice segment header is defined to be a part of a coded slice segment containing the data elements pertaining to the first or all coding tree units represented in the slice segment.
  • the CUs are scanned in the raster scan order of LCUs within tiles or within a picture, if tiles are not in use. Within an LCU, the CUs have a specific scan order.
  • Entropy coding/decoding may be performed in many ways. For example, context-based coding/decoding may be applied, where in both the encoder and the decoder modify the context state of a coding parameter based on previously coded/decoded coding parameters.
  • Context-based coding may for example be context adaptive binary arithmetic coding (CABAC) or context-adaptive variable length coding (CAVLC) or any similar entropy coding.
  • Entropy coding/decoding may alternatively or additionally be performed using a variable length coding scheme, such as Huffman coding/decoding or Exp-Golomb coding/decoding. Decoding of coding parameters from an entropy-coded bitstream or codewords may be referred to as parsing.
  • the phrase along the bitstream may be defined to refer to out-of-band transmission, signalling, or storage in a manner that the out- of-band data is associated with the bitstream.
  • the phrase decoding along the bitstream or alike may refer to decoding the referred out-of-band data (which may be obtained from out-of-band transmission, signalling, or storage) that is associated with the bitstream.
  • an indication along the bitstream may refer to metadata in a container file that encapsulates the bitstream.
  • a first texture picture may be encoded into a bitstream, and the first texture picture may comprise a first projection of texture data of a first source volume of a scene model onto a first projection surface.
  • the scene model may comprise a number of further source volumes.
  • data on the position of the originating geometry primitive may also be determined, and based on this determination, a geometry picture may be formed. This may happen for example so that depth data is determined for each or some of the texture pixels of the texture picture. Depth data is formed such that the distance from the originating geometry primitive such as a point to the projection surface is determined for the pixels. Such depth data may be represented as a depth picture, and similarly to the texture picture, such geometry picture (such as a depth picture) may be encoded and decoded with a video codec. This first geometry picture may be seen to represent a mapping of the first projection surface to the first source volume, and the decoder may use this information to determine the location of geometry primitives in the model to be reconstructed.
  • first geometry information encoded into or along the bitstream.
  • encoding a geometry (or depth) picture into or along the bitstream with the texture picture is only optional and arbitrary for example in the cases where the distance of all texture pixels to the projection surface is the same or there is no change in said distance between a plurality of texture pictures.
  • a geometry (or depth) picture may be encoded into or along the bitstream with the texture picture, for example, only when there is a change in the distance of texture pixels to the projection surface.
  • An attribute picture may be defined as a picture that comprises additional information related to an associated texture picture.
  • An attribute picture may for example comprise surface normal, opacity, or reflectance information for a texture picture.
  • a geometry picture may be regarded as one type of an attribute picture, although a geometry picture may be treated as its own picture type, separate from an attribute picture.
  • Texture picture(s) and the respective geometry picture(s), if any, and the respective attribute picture(s) may have the same or different chroma format.
  • Terms texture (component) image and texture (component) picture may be used interchangeably.
  • Terms geometry (component) image and geometry (component) picture may be used interchangeably.
  • a specific type of a geometry image is a depth image.
  • Embodiments described in relation to a geometry (component) image equally apply to a depth (component) image, and embodiments described in relation to a depth (component) image equally apply to a geometry (component) image.
  • Terms attribute image and attribute picture may be used interchangeably.
  • a geometry picture and/or an attribute picture may be treated as an auxiliary picture in video/image encoding and/or decoding.
  • the processes may be applied, for example, in Point Cloud Coding (PCC) according to MPEG standard.
  • MPEG Video-Based Point Cloud Coding (V- PCC), Test Model a.k.a. TMC2vO (MPEG N18017) discloses a projection-based approach for dynamic point cloud compression.
  • V-PCC video-based point cloud compression
  • MPEG N18017 discloses a projection-based approach for dynamic point cloud compression.
  • V-PCC video-based point cloud compression
  • Each point cloud frame represents a dataset of points within a 3D volumetric space that has unique coordinates and attributes.
  • An example of a point cloud frame is shown on Figure 3 a.
  • the patch generation process decomposes the point cloud frame by converting 3d samples to 2d samples on a given projection plane using a strategy that provides the best compression.
  • the patch generation process aims at decomposing the point cloud into a minimum number of patches with smooth boundaries, while also minimizing the reconstruction error.
  • TMC2vO the following approach is implemented.
  • the normal per each point is estimated and the tangent plane and its corresponding normal are defined per each point, based on the point’s nearest neighbours m within a predefined search distance.
  • the barycenter c is computed as follows:
  • the normal is estimated from eigen decomposition for the defined point cloud as: [0062] Based on this information each point is associated with a corresponding plane of a point cloud bounding box. Each plane is defined by a corresponding normal n p.dx with values:
  • each point is associated with the plane that has the closest normal (i.e., maximizes the dot product of the point normal n p. and the plane normal n Pidx
  • the sign of the normal is defined depending on the point’s position in relationship to the “center”.
  • the projection estimation description is shown in Figure 3b.
  • the initial clustering is then refined by iteratively updating the cluster index associated with each point based on its normal and the cluster indices of its nearest neighbors.
  • the next step consists of extracting patches by applying a connected component extraction procedure.
  • the packing process aims at mapping the extracted patches onto a 2D grid while trying to minimize the unused space, and guaranteeing that every TxT (e.g., 16x16) block of the grid is associated with a unique patch.
  • T is a user-defined parameter that is encoded in the bitstream and sent to the decoder.
  • TMC2vO uses a simple packing strategy that iteratively tries to insert patches into a WxH grid.
  • W and H are user defined parameters, which correspond to the resolution of the geometry/texture images that will be encoded.
  • the patch location is determined through an exhaustive search that is performed in raster scan order. The first location that can guarantee an overlapping-free insertion of the patch is selected and the grid cells covered by the patch are marked as used. If no empty space in the current resolution image can fit a patch, then the height H of the grid is temporarily doubled and search is applied again. At the end of the process, H is clipped so as to fit the used grid cells.
  • the image generation process exploits the 3D to 2D mapping computed during the packing process to store the geometry and texture of the point cloud as images.
  • each patch is projected onto two images, referred to as layers. More precisely, let H(u,v) be the set of points of the current patch that get projected to the same pixel (u, v).
  • the first layer also called the near layer, stores the point of H(u,v) with the lowest depth DO.
  • the second layer referred to as the far layer, captures the point of H(u,v) with the highest depth within the interval [DO, DO+D], where D is a user-defined parameter that describes the surface thickness.
  • the generated videos have the following characteristics: geometry: WxH YUV420-8bit, where the geometry video is monochromatic, and texture: WxH YUV420- 8bit, where the texture generation procedure exploits the reconstructed/smoothed geometry in order to compute the colors to be associated with the re-sampled points.
  • the padding process aims at filling the empty space between patches in order to generate a piecewise smooth image suited for video compression.
  • TMC2vO uses a simple padding strategy, which proceeds as follows:
  • Each block of TxT (e.g., 16x16) pixels is processed independently.
  • the pixels of the block are filled by copying either the last row or column of the previous TxT block in raster order.
  • the block has both empty and filled pixels (i.e. a so-called edge block), then the empty pixels are iteratively filled with the average value of their non-empty neighbors.
  • the generated images/layers are stored as video frames and compressed using a video codec.
  • mapping information providing for each TxT block its associated patch index is encoded as follows:
  • L the ordered list of the indexes of the patches such that their 2D bounding box contains that block.
  • the order in the list is the same as the order used to encode the 2D bounding boxes.
  • L is called the list of candidate patches.
  • the empty space between patches is considered as a patch and is assigned the special index 0, which is added to the candidate patches list of all the blocks.
  • the occupancy map consists of a binary map that indicates for each cell of the grid whether it belongs to the empty space or to the point cloud.
  • one cell of the 2D grid produces a pixel during the image generation process.
  • an occupancy map may be considered to comprise occupancy patches.
  • Occupancy patches may be considered to have block-aligned edges according to the auxiliary information described in the previous section.
  • An occupancy patch hence comprises occupancy information for a corresponding texture and geometry patches.
  • the occupancy map compression leverages the auxiliary information described in previous section, in order to detect the empty TxT blocks (i.e., blocks with patch index 0).
  • the remaining blocks are encoded as follows.
  • the occupancy map could be encoded with a precision of a BOxBO blocks.
  • the generated binary image covers only a single colour plane. However, given the prevalence of 4:2:0 codecs, it may be desirable to extend the image with “neutral” or fixed value chroma planes (e.g. adding chroma planes with all sample values equal to 0 or 128, assuming the use of an 8-bit codec).
  • the obtained video frame is compressed by using a video codec with lossless coding tool support (e.g., AVC, HEVC RExt, HEVC-SCC).
  • a video codec with lossless coding tool support e.g., AVC, HEVC RExt, HEVC-SCC.
  • Occupancy map is simplified by detecting empty and non-empty blocks of resolution TxT in the occupancy map and only for the non-empty blocks we encode their patch index as follows:
  • a list of candidate patches is created for each TxT block by considering all the patches that contain that block.
  • the list of candidates is sorted in the reverse order of the patches.
  • the point cloud geometry reconstruction process exploits the occupancy map information in order to detect the non-empty pixels in the geometry/texture images/layers.
  • the 3D positions of the points associated with those pixels are computed by levering the auxiliary patch information and the geometry images. More precisely, let P be the point associated with the pixel (u, v) and let (50, sO, rO) be the 3D location of the patch to which it belongs and (uO, vO, ul, vl) its 2D bounding box. P could be expressed in terms of depth d (u, v), tangential shift s(u, v) and bi-tangential shift r(u, v) as follows:
  • the smoothing procedure aims at alleviating potential discontinuities that may arise at the patch boundaries due to compression artifacts.
  • the implemented approach moves boundary points to the centroid of their nearest neighbors.
  • the texture values are directly read from the texture images.
  • V-PCC provides a procedure for compressing a time-varying volumetric scene/object by projecting 3D surfaces onto a number of pre-defined 2D planes, which may then be compressed using regular 2D video compression algorithms.
  • the projection is presented using different patches, where each set of patches may represent a specific object or specific parts of a scene.
  • the bitstream contains one or more layer pairs, each layer pair having a texture layer and a depth layer.
  • Each layer contains one or more consecutive CVSes in a unique single independent video coding layer, such as a HE VC independent layer, with each CVS containing a sequence of coded pictures.
  • Each layer pair represents a sequence of atlases.
  • An atlas is represented by a decoded picture pair in each access unit, with a texture component picture and a depth component picture.
  • the size of an atlas is equal to the size of the decoded picture of the texture layer representing the atlas.
  • the depth decoded picture size may be equal to the decoded picture size of the corresponding texture layer of the same layer pair. Decoded picture sizes may vary for different layer pairs in the same bitstream.
  • a patch may have an arbitrary shape, but in many embodiments it may be preferable to consider the patch as a rectangular region that is represented in both an atlas and a view representation. The size of a particular patch may be the same in both the atlas representation and the view representation.
  • An atlas contains an aggregation of one or more patches from one or more view representations, with a corresponding texture component and depth component.
  • the atlas patch occupancy map generator process outputs an atlas patch occupancy map.
  • the atlas patch occupancy map is a 2D array of the same size as the atlas, with each value indicating the index of the patch to which the co-located sample in the atlas corresponds, if any, or otherwise indicates that the sample location has an invalid value.
  • a view representation represents a field of view of a 3D scene for particular camera parameters, for the texture and depth component.
  • View representations may be omnidirectional or perspective, and may use different projection formats, such as equirectangular projection or cube map projection.
  • the texture and depth components of a view representation may use the same projection format and have the same size.
  • the decoding process may be illustrated by Figure 5, which shows a decoder reference architecture for immersive video as defined in N18576.
  • the bitstream comprises a CVS for each texture and depth layer of a layer pair, which is input to a 2D video decoder, such as an HEVC decoder, which outputs a sequence of decoded picture pairs of synchronized decoded texture pictures (A) and decoded depth pictures (B).
  • Each decoded picture pair represents an atlas (C).
  • the metadata is input to a metadata parser which outputs an atlas parameters list (D), and camera parameters list (E).
  • the atlas patch occupancy map takes as inputs the depth decoded picture (B) and the atlas parameters list (D) and outputs an atlas patch occupancy map (F).
  • a hypothetical reference Tenderer take as inputs one or more decoded atlases (C), the atlas parameters list (D), the camera parameters list (E), the atlas patch occupancy map sequence (F), and the viewer position and orientation, and outputs a viewport.
  • the atlas parameters are defined in atlas_parameters syntax structure, where a plurality of parameters are defined for each patch, such as disclosed in Table 1 below.
  • Such patch-specific auxiliary information indicates e.g. the size of the patch, its location within the atlas and in the view, and the orientation (i.e. rotation and/or mirroring) of the patch.
  • a starting point for the method may be considered, for example, that a 3D representation of at least one object, such as a point cloud frame or a 3D mesh, is input in an encoder.
  • the method which is disclosed in Figure 6, comprises projecting (600) a 3D representation of a scene onto a plurality of 2D patches; generating (602) at least a texture component picture, an optional geometry component picture and auxiliary patch information from the 2D patches; aggregating (604) said plurality of 2D patches with the corresponding texture component picture into an atlas; determining (606) a difference value for at least a first patch belonging to a first part of the scene, said difference value being based on difference between content of the texture component picture or the geometry component picture of the first patch at a first temporal value and content of a texture component picture or a geometry component picture of at least a second patch at a second temporal value prior to the first temporal value, said second patch matching to the first patch according to at least one predefined similarity criterium; and
  • the method targets finding patches which do not change or their change is smaller than a specific threshold between atlas frames at a first time instance and a second time instance.
  • the similarity may be defined based on comparison of the content of texture pixels between the first and at least the second temporally previous patch, or the content of the geometry information, if present, between the first and at least the second patch.
  • patches, for which at least one temporally previous patch with sufficient similarity is found will be signalled to be subject for faster encoding/decoding/rendering process by using said at least one temporally previous patch as a prediction reference for said patch. Thereby, the required bitrate to encoding such patches is reduced and the execution time for encoder/decoder/rendering processes is reduced, as well.
  • this aspect relates to the signaling of only the auxiliary patch information, which may be encoded into a separate bitstream, which may be stored or transmitted to a decoder as such.
  • the geometry image, the texture image and the occupancy map may each be encoded into separate bitstreams, as well.
  • encoding the geometry (or depth) image into or along the bitstream with the texture image is only optional.
  • the auxiliary patch information may be encoded into a common bitstream with one or more of the geometry image, the texture image or the occupancy map.
  • FIG. 7 shows an example of a decoding method comprising receiving (700) a bitstream in a decoder, said bitstream comprising an encoded texture component picture, an optional encoded geometry component picture and an encoded occupancy map indicative of a plurality of 2D patches from a 3D representation of a scene, wherein said plurality of 2D patches have been aggregated with the corresponding texture component picture into an atlas; receiving (702), either in said bitstream or in a further bitstream, an encoded auxiliary patch information from said plurality of 2D patches, wherein said auxiliary patch information comprises difference value for at least a first patch belonging to a first part of the scene, said difference value being based on difference between content of the texture component picture or the geometry component picture of the first patch at a first temporal value and content of a texture component picture or a geometry component picture of at least a second patch at a second temporal value prior to the first temporal value, said second patch matching to the first patch according to
  • the decoder receives and decodes texture image, occupancy map, auxiliary patch information of the plurality of 2D patches and optionally the geometry image, as well as the atlas aggregation of said plurality of 2D patches with the corresponding texture component picture and the corresponding geometry component picture, received either in a common bitstream or in two or more separate bitstreams.
  • the decoder decodes, among other auxiliary patch information, also a difference value for at least a first patch belonging to a first part of the scene, wherein the difference value is based on difference between content of the texture component picture or the geometry component picture of the first patch at a first time instance and content of a texture component picture or a geometry component picture of at least a second patch at a second time instance prior to the first time instance, wherein the second patch is sufficiently similar to the first patch, i.e. the second patch matches to the first patch according to at least one predefined similarity criterium. If the difference value is smaller than a predetermined threshold value, an indication about at least one temporally previous patch, i.e.
  • At least the second patch, to be used as a prediction reference for said first patch is decoded from the auxiliary patch information, and at least the first patch of the texture component picture is decoded by using said at least one temporally previous patch, i.e. at least the second patch, as a prediction reference for said first patch.
  • the 3D representation of the scene is then reconstructed.
  • said predefined similarity criteria comprises one or more of the following criterium:
  • the first and the second patches having at least one similar auxiliary parameter; - the second patch comprising at least one image block being used a prediction reference for at least one image block of the first patch.
  • said first and second patch may represent the same object at different time instances (i.e. having different temporal time stamps), or said first and second patch may have similar auxiliary parameters e.g. size, rotation to be selected.
  • the image blocks of the second patch are defined to be used as reference blocks for motion vector prediction for the image blocks of the first patch, it may provide an indication that there may be sufficient similarity between the first and the second patch.
  • the methods and embodiments as disclosed herein may be generally applicable to scenes presented by a plurality of patches, wherein each patch presents a part of the scene. Moreover, any object belonging to the scene may be presented by a set of patches and any specific part of the scene may be presented by a group of patches as well.
  • the current patch is processed by comparing it with at least one patch that belongs to the same part of the scene (e.g. same object) and is preferably located close to the world coordinates of the current patch.
  • the comparison may include one or more further patches at one or at several time instances before the current time instance of the first patch.
  • the decision of which patches are to be compared to the current patch can be dynamic and a radius of temporal frame number for searching similar patches may be introduced in the process. Moreover, the number of patches to be included in the comparison and their time instances may vary and they may be defined as initializing parameters for the embodiments.
  • said difference value is determined as a mean square error of between pixel values of the texture component pictures or the geometry component pictures of said first and second patches.
  • MSE mean square error
  • said predetermined threshold value is zero. In such case, only the patches which have not changed at all may be allowed to use the at least one temporally previous patch as a prediction reference for said patch.
  • said indication is configured to be encoded as a flag at least for the first patch, wherein said flag indicates whether at least one temporally previous patch is to be used as a prediction reference for said first patch.
  • a flag which may be referred to herein as patch changed flag, is used for indicating whether at least one temporally previous patch is to be used as a prediction reference for said patch. For those patches that are determined to use one or more temporally previous patches as a prediction reference, significant bit savings in the signalling are achieved by indicating this with a single flag.
  • a threshold may be defined for the difference value reflecting the change between the current patch and the previously encoded patch, and if the difference value, such as the MSE, is smaller than the specified threshold, then patch changed flag is set equal to 0. If the threshold value is set as zero, only the patches which have not changed at all qualify for having the patch changed flag is set equal to 0. For those patches where the difference value between the current patch and the previously encoded patch is higher than the specified threshold, the patch changed flag is set equal to 1.
  • a signalling of the at least one temporally previous patch to be used as a prediction reference for said first patch is configured to be carried out by at least one syntax element included in an atlas parameter syntax structure or any other suitable syntax structure for ISO/IEC 23090-5 (or similar volumetric video coding technology).
  • Table 2 shows an example of including said at least one syntax element into atlas parameter syntax element.
  • each patch[i] there may be a further syntax element as a patch-specific flag patch changed flag indicating whether at least one temporally previous patch to be used as a prediction reference for said patch.
  • the apparatus in response to the difference value between the current patch and the previously encoded patch is higher than the specified threshold value, the apparatus is configured to set the value of the flag for said particular patch as one.
  • the patch changed flag 1
  • a conventional encoding and/or decoding and/or rendering processes may be applied.
  • the atlas parameter syntax structure according to Table 2 this is indicated by the prior used syntax elements patch_pos_in_atlas_x[a][i], patch_pos_in_atlas_y[a][i] and patch_rotation[a][i], which are only triggered to be used by the else condition, i.e. when the value of the patch changed flag is not zero.
  • a skip mode of encoding is indicated to be used for said first patch.
  • all image blocks which are fully or partially covered by said patch are to be coded with skip mode indicating the reference block from where the patch is to be predicted.
  • the image blocks which are partially covered by said patch should not include content from any other patch.
  • the skip mode is to refer to the frame at temporal time T1 and to a specific location where the respective patch is located. This will prevent the process at encoder to search for an appropriate mode for the encoding of the current block and hence, decreases the execution time for encoding the atlas image.
  • the metadata indicating the location and rotation of the patch is preferably excluded from the transmission of the metadata information.
  • metadata include but are not limited to the syntax element patch rotation.
  • a new parameter syntax element may be introduced as: patch affine change.
  • Such parameter indicates the affine change between the current patch and the reference patch to be signalled to the decoder.
  • this may be indicated by two syntax parameters: patch_ref_pos_in_atlas_x[ a ][ p ] and patch_ref_pos_in_atlas_y[ a ][ p ] specify the horizontal and vertical coordinates in luma samples, respectively, of the top- left comer of the p-th patch of the a-th atlas.
  • a second syntax element indicative of a prediction residual, may be included in the atlas parameter syntax structure, wherein the value of the second syntax element for said first patch is set as zero.
  • the apparatus in response to said difference value being greater than zero but smaller than said predetermined threshold value, is configured to include the second syntax element in the atlas parameter syntax structure and set the value of the second syntax element for said first patch as one.
  • a third syntax element may be included in the atlas parameter syntax structure, wherein said third syntax element indicates a prediction residual to be added to the prediction based on the at least one temporally previous patch.
  • a syntax element for the residual information such as patch_pred_res, may be created and transmitted along the bitstream.
  • the differences between the current patch and reference patch are encoded in the bitstream and signalled as respective residual metadata.
  • the image blocks which are partially covered by said patch should not include content from any other patch.
  • the blocks are presented with skip mode equal to 1 and hence, the blocks are decoded as respectively predicted from the reference block.
  • the parameter syntax element patch affme change may be decoded and utilized to locate the accurate orientation of the current patch as compared to the orientation of the reference patch.
  • the spatial and temporal location of the reference patch may be calculated based on the indicative parameters , such as the parameters patch_ref_pos_in_atlas_x[ a ][ p ] and patch_ref_pos_in_atlas_y[ a ][ p ] described above.
  • the decision to set the patch changed flag equal to 0 has been based on having allowable differences (i.e.
  • FIG. 8a The operation of the encoder is shown in Figure 8a, where a plurality of patches are input (800) in the encoder. A difference value of the pixels of the texture or geometry component images between time instances T k and Ti is calculated (802) for one or more patches. Then for each patch, it is determined (804) whether the difference value is smaller than a predetermined threshold value. If not, the patch changed flag is set to 1 (806) and the conventional encoding process is applied for said patch, e.g. by defining the patch-specific syntax elements patch_pos_in_atlas_x[a][i], patch_pos_in_atlas_y[a][i] and patch_rotation[a][i] in the atlas parameter syntax structure,
  • the patch changed flag is set to 0 (808). If the threshold value > 0, then it may be further examined (810) whether the difference value is exactly zero. If yes, the content of the texture or geometry component images between time instances T k and Ti is the same for said patch, and no residual exists; therefore, the patch_pred_res_flag is set as zero (812). On the other hand, if there is a residual, the patch_pred_res_flag is set as one and the value of the residual is determined (814) to be further included in the signalling.
  • FIG. 8b The operation of the decoder is shown in Figure 8b, where encoded patches and auxiliary patch information are input (820) in the decoder.
  • the patch is decoded by a conventional decoding process, e.g. by obtaining the patch-specific syntax elements patch_pos_in_atlas_x[a][i], patch_pos_in_atlas_y[a][i] and patch_rotation[a][i] from the atlas parameter syntax structure.
  • the decoder obtains (830) the residual value and decodes (832) the patch based on the reference to at least one temporally previous patch and applies the residual value thereto.
  • the patches which are indicated with patch changed flag equal to 0 are to be rendered first as they do not need any information to be decoded other than the signal indicating that they are equal to another patch which has already been decoded.
  • the process analysing the similarity between the patches may be limited to specific number of frames prior to the current frame.
  • Such limitation may be dictated based on the codec requirements e.g. GOP (group of pictures) length, where the GOP size is indicated in the codec which encodes the atlas image.
  • such limitation may be defined based on a specific and fixed number of frames e.g. 32 regardless of GOP size.
  • such limitation may be defined based on the amount of motion is the scene, e.g. if the scene has high motion, the frame number should be N frames and if the scene has low motion, the frame number should be M frames where N ⁇ M.
  • Two or more of the embodiments as described above may be combined, and they may be introduced as one or more indicators in any suitable syntax structure for ISO/IEC 23090-5 (or similar volumetric video coding technology).
  • the embodiments as described herein enable to reduce the required bitrate to encode the same content. Moreover, the embodiments enable to reduce the complexity of patch presentation. Furthermore, the embodiments enable to reduce the execution time both at encoder side and at rendering side. [0136]
  • the embodiments relating to the encoding aspects may be implemented in an apparatus comprising: means for projecting a 3D representation of a scene onto a plurality of 2D patches; means for generating at least a texture component picture, an optional geometry component picture and auxiliary patch information from the 2D patches; means for aggregating said plurality of 2D patches with the corresponding texture component picture into an atlas; means for determining a difference value for at least a first patch belonging to a first part of the scene, said difference value being based on difference between content of the texture component picture or the geometry component picture of the first patch at a first temporal value and content of a texture component picture or a geometry component picture of at least a second patch at a second temporal value prior to the first temporal value,
  • the embodiments relating to the encoding aspects may likewise be implemented in an apparatus comprising at least one processor and at least one memory, said at least one memory stored with computer program code thereon, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to perform: project a 3D representation of a scene onto a plurality of 2D patches; generate at least a texture component picture, an optional geometry component picture and auxiliary patch information from the 2D patches; aggregate said plurality of 2D patches with the corresponding texture component picture into an atlas; determine a difference value for at least a first patch belonging to a first part of the scene, said difference value being based on difference between content of the texture component picture or the geometry component picture of the first patch at a first temporal value and content of a texture component picture or a geometry component picture of at least a second patch at a second temporal value prior to the first temporal value, said second patch matching to the first patch according to at least one predefined similarity criterium; and signal
  • the embodiments relating to the decoding aspects may be implemented in an apparatus comprising means for receiving a bitstream in a decoder, said bitstream comprising an encoded texture component picture, an optional encoded geometry component picture and an encoded occupancy map indicative of a plurality of 2D patches from a 3D representation of a scene, wherein said plurality of 2D patches have been aggregated with the corresponding texture component picture into an atlas; means for receiving, either in said bitstream or in a further bitstream, an encoded auxiliary patch information from said plurality of 2D patches, wherein said auxiliary patch information comprises a difference value for at least a first patch belonging to a first part of the scene, said difference value being based on difference between content of the texture component picture or the geometry component picture of the first patch at a first temporal value and content of a texture component picture or a geometry component picture of at least a second patch at a second temporal value prior to the first temporal value, said second patch matching to the first patch according to at least one predefined similarity
  • the embodiments relating to the decoding aspects may likewise be implemented in an apparatus comprising at least one processor and at least one memory, said at least one memory stored with computer program code thereon, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to perform receive a bitstream in a decoder, said bitstream comprising an encoded texture component picture, an optional encoded geometry component picture and an encoded occupancy map indicative of a plurality of 2D patches from a 3D representation of a scene, wherein said plurality of 2D patches have been aggregated with the corresponding texture component picture into an atlas; receive, either in said bitstream or in a further bitstream, an encoded auxiliary patch information from said plurality of 2D patches, wherein said auxiliary patch information comprises a difference value for at least a first patch belonging to a first part of the scene, said difference value being based on difference between content of the texture component picture or the geometry component picture of the first patch at a first temporal value and content of a texture
  • Such apparatuses may comprise e.g. the functional units disclosed in any of the Figures la, lb, 2a and 2b for implementing the embodiments.
  • said encoding may comprise one or more of the following: encoding source image data into a bitstream, encapsulating the encoded bitstream in a container file and/or in packet(s) or stream(s) of a communication protocol, and announcing or describing the bitstream in a content description, such as the Media Presentation Description (MPD) of ISO/IEC 23009-1 (known as MPEG-DASH) or the IETF Session Description Protocol (SDP).
  • MPD Media Presentation Description
  • SDP IETF Session Description Protocol
  • said decoding may comprise one or more of the following: decoding image data from a bitstream, decapsulating the bitstream from a container file and/or from packet(s) or stream(s) of a communication protocol, and parsing a content description of the bitstream,
  • embodiments can be similarly realized when encoding or decoding texture pictures, geometry pictures, (optionally) attribute pictures and auxiliary patch information into or from several bitstreams that are associated with each other, e.g. by metadata in a container file or media presentation description for streaming.
  • the various embodiments of the invention may be implemented in hardware or special purpose circuits or any combination thereof. While various aspects of the invention may be illustrated and described as block diagrams or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.
  • Embodiments of the inventions may be practiced in various components such as integrated circuit modules.
  • the design of integrated circuits is by and large a highly automated process.
  • Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate.
  • Programs such as those provided by Synopsys, Inc. of Mountain View, California and Cadence Design, of San Jose, California automatically route conductors and locate components on a semiconductor chip using well established rules of design as well as libraries of pre stored design modules.
  • the resultant design in a standardized electronic format (e.g., Opus, GDSII, or the like) may be transmitted to a semiconductor fabrication facility or "fab" for fabrication.

Abstract

A method according to an embodiment comprises: projecting a 3D representation of a scene onto a plurality of 2D patches (600); generating at least a texture component picture, an optional geometry component picture and auxiliary patch information from the 2D patches (602); aggregating said plurality of 2D patches with the corresponding texture component picture into an atlas (604); determining a difference value for at least a first patch belonging to a first part of the scene, said difference value being based on difference between content of the texture component picture or the geometry component picture of the first patch at a first temporal value and content of a texture component picture or a geometry component picture of at least a second patch at a second temporal value prior to the first temporal value, said second patch matching to the first patch according to at least one predefined similarity criterion (606); and signaling, in response to said difference value being smaller than a predetermined threshold value, an indication about at least the second patch to be used as a prediction reference for said first patch (608).

Description

AN APPARATUS, A METHOD AND A COMPUTER PROGRAM FOR
VOLUMETRIC VIDEO
TECHNICAL FIELD
[0001 ] The present invention relates to an apparatus, a method and a computer program for volumetric video coding.
BACKGROUND
[0002] Volumetric video data represents a three-dimensional scene or object and can be used as input for virtual reality (VR), augmented reality (AR) and mixed reality (MR) applications. Such data describes the geometry, e.g. shape, size, position in three- dimensional (3D) space, and respective attributes, e.g. colour, opacity, reflectance and any possible temporal changes of the geometry and attributes at given time instances. Volumetric video is either generated from 3D models through computer-generated imagery (CGI), or captured from real-world scenes using a variety of capture solutions, e.g. multi camera, laser scan, combination of video and dedicated depth sensors, and more. Also, a combination of CGI and real-world data is possible.
[0003] Typical representation formats for such volumetric data are triangle meshes, point clouds (PCs), or voxel arrays. In dense point clouds or voxel arrays, the reconstructed 3D scene may contain tens or even hundreds of millions of points. One way to compress a time-varying volumetric scene/object is to project 3D surfaces to some number of pre defined 2D planes. Regular 2D video compression algorithms can then be used to compress various aspects of the projected surfaces. For example, MPEG Video-Based Point Cloud Coding (V-PCC) provides a procedure for compressing a time-varying volumetric scene/object by projecting 3D surfaces onto a number of pre-defined 2D planes, which may then be compressed using regular 2D video compression algorithms. The projection is presented using different patches, where each set of patches may represent a specific object or specific parts of a scene.
[0004] Depending on the viewport to be outputted, the patches may be aligned differently, i.e. different vertical and horizontal positions, sizes, rotations and/or mirroring. The plurality of parameters are defined separately for each patch. Transmitting such information per patch takes a relatively high amount of bits, and therefore increases the size of encoded content bitstream that needs to be transmitted for the presentation of the content to the decoded.
SUMMARY
[0005] Now, an improved method and technical equipment implementing the method has been invented, by which the above problems are alleviated. Various aspects include a method, an apparatus and a computer readable medium comprising a computer program, or a signal stored therein, which are characterized by what is stated in the independent claims. Various details of the embodiments are disclosed in the dependent claims and in the corresponding images and description.
[0006] The scope of protection sought for various embodiments of the invention is set out by the independent claims. The embodiments and features, if any, described in this specification that do not fall under the scope of the independent claims are to be interpreted as examples useful for understanding various embodiments of the invention.
[0007] According to a first aspect, there is provided a method comprising projecting a 3D representation of a scene onto a plurality of 2D patches; generating at least a texture component picture, an optional geometry component picture and auxiliary patch information from the 2D patches; aggregating said plurality of 2D patches with the corresponding texture component picture into an atlas; determining a difference value for at least a first patch belonging to a first part of the scene, said difference value being based on difference between content of the texture component picture or the geometry component picture of the first patch at a first temporal value and content of a texture component picture or a geometry component picture of at least a second patch at a second temporal value prior to the first temporal value, said second patch matching to the first patch according to at least one predefined similarity criterium; and signaling, in response to said difference value being smaller than a predetermined threshold value, an indication about at least the second patch to be used as a prediction reference for said first patch. [0008] An apparatus according to a second aspect comprises at least one processor and at least one memory, said at least one memory stored with computer program code thereon, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to perform: project a 3D representation of a scene onto a plurality of 2D patches; generate at least a texture component picture, an optional geometry component picture and auxiliary patch information from the 2D patches; aggregate said plurality of 2D patches with the corresponding texture component picture into an atlas; determine a difference value for at least a first patch belonging to a first part of the scene, said difference value being based on difference between content of the texture component picture or the geometry component picture of the first patch at a first temporal value and content of a texture component picture or a geometry component picture of at least a second patch at a second temporal value prior to the first temporal value, said second patch matching to the first patch according to at least one predefined similarity criterium; and signal, in response to said difference value being smaller than a predetermined threshold value, an indication about at least the second patch to be used as a prediction reference for said first patch.
[0009] An apparatus according to a third aspect comprises: means for projecting a 3D representation of a scene onto a plurality of 2D patches; means for generating at least a texture component picture, an optional geometry component picture and auxiliary patch information from the 2D patches; means for aggregating said plurality of 2D patches with the corresponding texture component picture into an atlas; means for determining a difference value for at least a first patch belonging to a first part of the scene, said difference value being based on difference between content of the texture component picture or the geometry component picture of the first patch at a first temporal value and content of a texture component picture or a geometry component picture of at least a second patch at a second temporal value prior to the first temporal value, said second patch matching to the first patch according to at least one predefined similarity criterium; and means for signaling, in response to said difference value being smaller than a predetermined threshold value, an indication about at least the second patch to be used as a prediction reference for said first patch.
[0010] According to an embodiment, said difference value is determined as a mean square error of between pixel values of the texture component pictures or the geometry component pictures of said first and second patches. [0011] According to an embodiment, said predefined similarity criteria comprises one or more of the following criterium:
- the first and the second patches belonging to the same part of the scene;
- the first and the second patches having at least one similar auxiliary parameter;
- the second patch comprising at least one image block being used a prediction reference for at least one image block of the first patch.
[0012] According to an embodiment, a skip mode of encoding is indicated to be used for said first patch.
[0013] According to an embodiment, said indication is configured to be encoded as a flag at least for the first patch, wherein said flag indicates whether at least the second patch is to be used as a prediction reference for said first patch.
[0014] According to an embodiment, a signalling of at least the second patch to be used as a prediction reference for said first patch is configured to be carried out by at least one syntax element included in an atlas parameter syntax structure.
[0015] According to an embodiment, in response to said difference value being zero, the apparatus is configured to include a second syntax element, indicative of a prediction residual, in the atlas parameter syntax structure and set a value of the second syntax element for said first patch as zero.
[0016] According to an embodiment, in response to said difference value being greater than zero but smaller than said predetermined threshold value, the apparatus is configured to include the second syntax element in the atlas parameter syntax structure and set the value of the second syntax element for said first patch as one.
[0017] According to an embodiment, the apparatus is configured to include a third syntax element in the atlas parameter syntax structure, wherein said third syntax element indicates a prediction residual to be added to the prediction based on the at least the second patch.
[0018] A method according to a fourth aspect comprises: receiving a bitstream in a decoder, said bitstream comprising an encoded texture component picture, an optional encoded geometry component picture and an encoded occupancy map indicative of a plurality of 2D patches from a 3D representation of a scene, wherein said plurality of 2D patches have been aggregated with the corresponding texture component picture into an atlas; receiving, either in said bitstream or in a further bitstream, an encoded auxiliary patch information from said plurality of 2D patches, wherein said auxiliary patch information comprises a difference value for at least a first patch belonging to a first part of the scene, said difference value being based on difference between content of the texture component picture or the geometry component picture of the first patch at a first temporal value and content of a texture component picture or a geometry component picture of at least a second patch at a second temporal value prior to the first temporal value, said second patch matching to the first patch according to at least one predefined similarity criterium; decoding, in response to said difference value being smaller than a predetermined threshold value, an indication about at least the second patch to be used as a prediction reference for said first patch; decoding at least the first patch of the texture component picture by using at least the second patch as a prediction reference for said first patch; and decoding the geometry component picture, the occupancy map and the auxiliary patch information for reconstructing a 3D representation of said scene.
[0019] An apparatus according to a fifth aspect comprises at least one processor and at least one memory, said at least one memory stored with computer program code thereon, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to perform: receive a bitstream in a decoder, said bitstream comprising an encoded texture component picture, an optional encoded geometry component picture and an encoded occupancy map indicative of a plurality of 2D patches from a 3D representation of a scene, wherein said plurality of 2D patches have been aggregated with the corresponding texture component picture into an atlas; receive, either in said bitstream or in a further bitstream, an encoded auxiliary patch information from said plurality of 2D patches, wherein said auxiliary patch information comprises a difference value for at least a first patch belonging to a first part of the scene, said difference value being based on difference between content of the texture component picture or the geometry component picture of the first patch at a first temporal value and content of a texture component picture or a geometry component picture of at least a second patch at a second temporal value prior to the first temporal value, said second patch matching to the first patch according to at least one predefined similarity criterium; decode, in response to said difference value being smaller than a predetermined threshold value, an indication about at least the second patch to be used as a prediction reference for said first patch; decode at least the first patch of the texture component picture by using at least the second patch as a prediction reference for said first patch; and decode the geometry component picture, the occupancy map and the auxiliary patch information for reconstructing a 3D representation of said scene.
[0020] An apparatus according to a sixth aspect comprises: means for receiving a bitstream in a decoder, said bitstream comprising an encoded texture component picture, an optional encoded geometry component picture and an encoded occupancy map indicative of a plurality of 2D patches from a 3D representation of a scene, wherein said plurality of 2D patches have been aggregated with the corresponding texture component picture into an atlas; means for receiving, either in said bitstream or in a further bitstream, an encoded auxiliary patch information from said plurality of 2D patches, wherein said auxiliary patch information comprises a difference value for at least a first patch belonging to a first part of the scene, said difference value being based on difference between content of the texture component picture or the geometry component picture of the first patch at a first temporal value and content of a texture component picture or a geometry component picture of at least a second patch at a second temporal value prior to the first temporal value, said second patch matching to the first patch according to at least one predefined similarity criterium; means for decoding, in response to said difference value being smaller than a predetermined threshold value, an indication about at least the second patch to be used as a prediction reference for said first patch; means for decoding at least the first patch of the texture component picture by using at least the second patch as a prediction reference for said first patch; and means for decoding the geometry component picture, the occupancy map and the auxiliary patch information for reconstructing a 3D representation of said scene.
[0021 ] According to an embodiment, the apparatus is configured to decode the indication about a skip mode of decoding to be used for said first patch, wherein said indicating includes reference to one or more reference image blocks of at least the second patch to be used instead of the first patch. [0022] According to an embodiment, said indication at least for the first patch is configured to be decoded from a flag, wherein said flag indicates whether at least the second patch is to be used as a prediction reference for said first patch.
[0023] According to an embodiment, the apparatus is configured to decode the indication about at least the second patch to be used as a prediction reference for said first patch from at least one syntax element included in an atlas parameter syntax structure. [0024] According to an embodiment, the apparatus is configured to decode a second syntax element, indicative of a prediction residual, from the atlas parameter syntax structure, and in response to the value of the second syntax element for said first patch being one, decode a third syntax element from the atlas parameter syntax structure, wherein said third syntax element indicates a prediction residual to be added to the prediction based on at least the second patch.
[0025] Computer readable storage media according to further aspects comprise code for use by an apparatus, which when executed by a processor, causes the apparatus to perform the above methods.
BRIEF DESCRIPTION OF THE DRAWINGS
[0026] For a more complete understanding of the example embodiments, reference is now made to the following descriptions taken in connection with the accompanying drawings in which:
[0027] Figs la and lb show an encoder and decoder for encoding and decoding 2D pictures;
[0028] Figs. 2a and 2b show a compression and a decompression process for 3D volumetric video;
[0029] Figs. 3a and 3b show an example of a point cloud frame and a projection of points to a corresponding plane of a point cloud bounding box;
[0030] Fig. 4 shows an illustrative example of the relationship between atlases, patches and view representations;
[0031] Fig. 5 shows a decoder reference architecture for immersive video;
[0032] Fig. 6 shows a flow chart for encoding patch information according to an embodiment; [0033] Fig. 7 shows a flow chart for decoding patch information according to an embodiment; and
[0034] Figs. 8a and 8b show some embodiments relating to the encoding and decoding of the patch information.
DETAILED DESCRIPTON OF SOME EXAMPLE EMBODIMENTS
[0035] In the following, several embodiments of the invention will be described in the context of point cloud models for volumetric video coding. It is to be noted, however, that the invention is not limited to specific scene models or specific coding technologies. In fact, the different embodiments have applications in any environment where coding of volumetric scene data is required.
[0036] A video codec comprises an encoder that transforms the input video into a compressed representation suited for storage/transmission, and a decoder that can un compress the compressed video representation back into a viewable form. An encoder may discard some information in the original video sequence in order to represent the video in a more compact form (i.e. at lower bitrate).
[0037] Volumetric video may be captured using one or more three-dimensional (3D) cameras. When multiple cameras are in use, the captured footage is synchronized so that the cameras provide different viewpoints to the same world. In contrast to traditional 2D/3D video, volumetric video describes a 3D model of the world where the viewer is free to move and observer different parts of the world.
[0038] Volumetric video enables the viewer to move in six degrees of freedom (6DOF): in contrast to common 360° video, where the user has from 2 to 3 degrees of freedom (yaw, pitch, and possibly roll), a volumetric video represents a 3D volume of space rather than a flat image plane. Volumetric video frames contain a large amount of data because they model the contents of a 3D volume instead of just a two-dimensional (2D) plane. However, only a relatively small part of the volume changes over time. Therefore, it may be possible to reduce the total amount of data by only coding information about an initial state and changes which may occur between frames. Volumetric video can be rendered from synthetic 3D animations, reconstructed from multi-view video using 3D reconstruction techniques such as structure from motion, or captured with a combination of cameras and depth sensors such as LiDAR (Light Detection and Ranging), for example. [0039] Volumetric video data represents a three-dimensional scene or object, and thus such data can be viewed from any viewpoint. Volumetric video data can be used as an input for augmented reality (AR), virtual reality (VR) and mixed reality (MR) applications. Such data describes geometry (shape, size, position in 3D-space) and respective attributes (e.g. color, opacity, reflectance, ...), together with any possible temporal changes of the geometry and attributes at given time instances (e.g. frames in 2D video). Volumetric video is either generated from 3D models, i.e. computer-generated imagery (CGI), or captured from real-world scenes using a variety of capture solutions, e.g. a multi-camera, a laser scan, a combination of video and dedicated depths sensors, etc. Also, a combination of CGI and real-world data is possible. Examples of representation formats for such volumetric data are triangle meshes, point clouds, or voxel. Temporal information about the scene can be included in the form of individual capture instances, i.e. “frames” in 2D video, or other means, e.g. position of an object as a function of time.
[0040] Increasing computational resources and advances in 3D data acquisition devices has enabled reconstruction of highly detailed volumetric video representations of natural scenes. Infrared, lasers, time-of-flight and structured light are all examples of devices that can be used to construct 3D video data. Representation of the 3D data depends on how the 3D data is used. Dense voxel arrays have been used to represent volumetric medical data.
In 3D graphics, polygonal meshes are extensively used. Point clouds on the other hand are well suited for applications, such as capturing real world 3D scenes where the topology is not necessarily a 2D manifold. Another way to represent 3D data is coding this 3D data as a set of texture and depth map as is the case in the multi-view plus depth. Closely related to the techniques used in multi-view plus depth is the use of elevation maps, and multi-level surface maps.
[0041 ] In 3D point clouds, each point of each 3D surface is described as a 3D point with color and/or other attribute information such as surface normal or material reflectance. Point cloud is a set of data points in a coordinate system, for example in a three- dimensional coordinate system being defined by X, Y, and Z coordinates. The points may represent an external surface of an object in the screen space, e.g. in a three-dimensional space.
[0042] In dense point clouds or voxel arrays, the reconstructed 3D scene may contain tens or even hundreds of millions of points. If such representations are to be stored or interchanged between entities, then efficient compression of the presentations becomes fundamental. Standard volumetric video representation formats, such as point clouds, meshes, voxel, suffer from poor temporal compression performance. Identifying correspondences for motion-compensation in 3D-space is an ill-defined problem, as both, geometry and respective attributes may change. For example, temporal successive “frames” do not necessarily have the same number of meshes, points or voxel. Therefore, compression of dynamic 3D scenes is inefficient. 2D-video based approaches for compressing volumetric data, i.e. multiview with depth, have much better compression efficiency, but rarely cover the full scene. Therefore, they provide only limited 6DOF capabilities.
[0043] Instead of the above-mentioned approach, a 3D scene, represented as meshes, points, and/or voxel, can be projected onto one, or more, geometries. These geometries may be “unfolded” or packed onto 2D planes (two planes per geometry: one for texture, one for depth), which are then encoded using standard 2D video compression technologies. Relevant projection geometry information may be transmitted alongside the encoded video files to the decoder. The decoder decodes the video and performs the inverse projection to regenerate the 3D scene in any desired representation format (not necessarily the starting format).
[0044] Projecting volumetric models onto 2D planes allows for using standard 2D video coding tools with highly efficient temporal compression. Thus, coding efficiency can be increased greatly. Using geometry-projections instead of 2D-video based approaches based on multiview and depth, provides a better coverage of the scene (or object). Thus, 6DOF capabilities are improved. Using several geometries for individual objects improves the coverage of the scene further. Furthermore, standard video encoding hardware can be utilized for real-time compression/decompression of the projected planes. The projection and the reverse projection steps are of low complexity. [0045] Figs la and lb show an encoder and decoder for encoding and decoding the 2D texture pictures, geometry pictures and/or auxiliary pictures. A video codec consists of an encoder that transforms an input video into a compressed representation suited for storage/transmission and a decoder that can uncompress the compressed video representation back into a viewable form. Typically, the encoder discards and/or loses some information in the original video sequence in order to represent the video in a more compact form (that is, at lower bitrate). An example of an encoding process is illustrated in Figure la. Figure la illustrates an image to be encoded (F); a predicted representation of an image block (P'n); a prediction error signal (Dn); a reconstructed prediction error signal (D'n); a preliminary reconstructed image (I'n); a final reconstructed image (R'n); a transform (T) and inverse transform (T-1); a quantization (Q) and inverse quantization (Q 1); entropy encoding (E); a reference frame memory (RFM); inter prediction (Pinter); intra prediction (Pintra); mode selection (MS) and filtering (F).
[0046] An example of a decoding process is illustrated in Figure lb. Figure lb illustrates a predicted representation of an image block (P'n); a reconstructed prediction error signal (D'n); a preliminary reconstructed image (I'n); a final reconstructed image (R'n); an inverse transform (T 1); an inverse quantization (Q 1); an entropy decoding (E 1); a reference frame memory (RFM); a prediction (either inter or intra) (P); and filtering (F). [0047] Many hybrid video encoders encode the video information in two phases. Firstly pixel values in a certain picture area (or “block”) are predicted for example by motion compensation means (finding and indicating an area in one of the previously coded video frames that corresponds closely to the block being coded) or by spatial means (using the pixel values around the block to be coded in a specified manner). Secondly the prediction error, i.e. the difference between the predicted block of pixels and the original block of pixels, is coded. This is typically done by transforming the difference in pixel values using a specified transform (e.g. Discrete Cosine Transform (DCT) or a variant of it), quantizing the coefficients and entropy coding the quantized coefficients. By varying the fidelity of the quantization process, encoder can control the balance between the accuracy of the pixel representation (picture quality) and size of the resulting coded video representation (file size or transmission bitrate). Video codecs may also provide a transform skip mode, which the encoders may choose to use. In the transform skip mode, the prediction error is coded in a sample domain, for example by deriving a sample-wise difference value relative to certain adjacent samples and coding the sample-wise difference value with an entropy coder.
[0048] Many video encoders partition a picture into blocks along a block grid. For example, in the High Efficiency Video Coding (HEVC) standard, the following partitioning and definitions are used. A coding block may be defined as an NxN block of samples for some value of N such that the division of a coding tree block into coding blocks is a partitioning. A coding tree block (CTB) may be defined as an NxN block of samples for some value of N such that the division of a component into coding tree blocks is a partitioning. A coding tree unit (CTU) may be defined as a coding tree block of luma samples, two corresponding coding tree blocks of chroma samples of a picture that has three sample arrays, or a coding tree block of samples of a monochrome picture or a picture that is coded using three separate color planes and syntax structures used to code the samples. A coding unit (CU) may be defined as a coding block of luma samples, two corresponding coding blocks of chroma samples of a picture that has three sample arrays, or a coding block of samples of a monochrome picture or a picture that is coded using three separate color planes and syntax structures used to code the samples. A CU with the maximum allowed size may be named as LCU (largest coding unit) or coding tree unit (CTU) and the video picture is divided into non-overlapping LCUs.
[0049] In HEVC, a picture can be partitioned in tiles, which are rectangular and contain an integer number of LCUs. In HEVC, the partitioning to tiles forms a regular grid, where heights and widths of tiles differ from each other by one LCU at the maximum. In HEVC, a slice is defined to be an integer number of coding tree units contained in one independent slice segment and all subsequent dependent slice segments (if any) that precede the next independent slice segment (if any) within the same access unit. In HEVC, a slice segment is defined to be an integer number of coding tree units ordered consecutively in the tile scan and contained in a single NAL unit. The division of each picture into slice segments is a partitioning. In HEVC, an independent slice segment is defined to be a slice segment for which the values of the syntax elements of the slice segment header are not inferred from the values for a preceding slice segment, and a dependent slice segment is defined to be a slice segment for which the values of some syntax elements of the slice segment header are inferred from the values for the preceding independent slice segment in decoding order. In HEVC, a slice header is defined to be the slice segment header of the independent slice segment that is a current slice segment or is the independent slice segment that precedes a current dependent slice segment, and a slice segment header is defined to be a part of a coded slice segment containing the data elements pertaining to the first or all coding tree units represented in the slice segment. The CUs are scanned in the raster scan order of LCUs within tiles or within a picture, if tiles are not in use. Within an LCU, the CUs have a specific scan order.
[0050] Entropy coding/decoding may be performed in many ways. For example, context-based coding/decoding may be applied, where in both the encoder and the decoder modify the context state of a coding parameter based on previously coded/decoded coding parameters. Context-based coding may for example be context adaptive binary arithmetic coding (CABAC) or context-adaptive variable length coding (CAVLC) or any similar entropy coding. Entropy coding/decoding may alternatively or additionally be performed using a variable length coding scheme, such as Huffman coding/decoding or Exp-Golomb coding/decoding. Decoding of coding parameters from an entropy-coded bitstream or codewords may be referred to as parsing.
[0051 ] The phrase along the bitstream (e.g. indicating along the bitstream) may be defined to refer to out-of-band transmission, signalling, or storage in a manner that the out- of-band data is associated with the bitstream. The phrase decoding along the bitstream or alike may refer to decoding the referred out-of-band data (which may be obtained from out-of-band transmission, signalling, or storage) that is associated with the bitstream. For example, an indication along the bitstream may refer to metadata in a container file that encapsulates the bitstream.
[0052] A first texture picture may be encoded into a bitstream, and the first texture picture may comprise a first projection of texture data of a first source volume of a scene model onto a first projection surface. The scene model may comprise a number of further source volumes.
[0053] In the projection, data on the position of the originating geometry primitive may also be determined, and based on this determination, a geometry picture may be formed. This may happen for example so that depth data is determined for each or some of the texture pixels of the texture picture. Depth data is formed such that the distance from the originating geometry primitive such as a point to the projection surface is determined for the pixels. Such depth data may be represented as a depth picture, and similarly to the texture picture, such geometry picture (such as a depth picture) may be encoded and decoded with a video codec. This first geometry picture may be seen to represent a mapping of the first projection surface to the first source volume, and the decoder may use this information to determine the location of geometry primitives in the model to be reconstructed. In order to determine the position of the first source volume and/or the first projection surface and/or the first projection in the scene model, there may be first geometry information encoded into or along the bitstream. It is noted that encoding a geometry (or depth) picture into or along the bitstream with the texture picture is only optional and arbitrary for example in the cases where the distance of all texture pixels to the projection surface is the same or there is no change in said distance between a plurality of texture pictures. Thus, a geometry (or depth) picture may be encoded into or along the bitstream with the texture picture, for example, only when there is a change in the distance of texture pixels to the projection surface.
[0054] An attribute picture may be defined as a picture that comprises additional information related to an associated texture picture. An attribute picture may for example comprise surface normal, opacity, or reflectance information for a texture picture. A geometry picture may be regarded as one type of an attribute picture, although a geometry picture may be treated as its own picture type, separate from an attribute picture.
[0055] Texture picture(s) and the respective geometry picture(s), if any, and the respective attribute picture(s) may have the same or different chroma format.
[0056] Terms texture (component) image and texture (component) picture may be used interchangeably. Terms geometry (component) image and geometry (component) picture may be used interchangeably. A specific type of a geometry image is a depth image. Embodiments described in relation to a geometry (component) image equally apply to a depth (component) image, and embodiments described in relation to a depth (component) image equally apply to a geometry (component) image. Terms attribute image and attribute picture may be used interchangeably. A geometry picture and/or an attribute picture may be treated as an auxiliary picture in video/image encoding and/or decoding. [0057] Figures 2a and 2b illustrate an overview of exemplified compression/ decompression processes. The processes may be applied, for example, in Point Cloud Coding (PCC) according to MPEG standard. MPEG Video-Based Point Cloud Coding (V- PCC), Test Model a.k.a. TMC2vO (MPEG N18017) discloses a projection-based approach for dynamic point cloud compression. For the sake of illustration, some of the processes related to video-based point cloud compression (V-PCC) compression/decompression are described briefly herein. For a comprehensive description of the model, a reference is made to MPEG N 18017.
[0058] Each point cloud frame represents a dataset of points within a 3D volumetric space that has unique coordinates and attributes. An example of a point cloud frame is shown on Figure 3 a.
[0059] The patch generation process decomposes the point cloud frame by converting 3d samples to 2d samples on a given projection plane using a strategy that provides the best compression. The patch generation process aims at decomposing the point cloud into a minimum number of patches with smooth boundaries, while also minimizing the reconstruction error. In the V-PCC test model TMC2vO, the following approach is implemented.
[0060] First, the normal per each point is estimated and the tangent plane and its corresponding normal are defined per each point, based on the point’s nearest neighbours m within a predefined search distance. A K-D tree is used to separate the data and find neighbours in a vicinity of a point pt and a barycenter c = p of that set of points is used to define the normal. The barycenter c is computed as follows:
Figure imgf000017_0001
[0061] The normal is estimated from eigen decomposition for the defined point cloud as:
Figure imgf000017_0002
[0062] Based on this information each point is associated with a corresponding plane of a point cloud bounding box. Each plane is defined by a corresponding normal np.dx with values:
- (1.0, 0.0, 0.0), - (0.0, 1.0, 0.0),
- (0.0, 0.0, 1.0),
- (-1.0, 0.0, 0.0),
- (0.0, -1.0, 0.0),
- (0.0, 0.0, -1.0). [0063] More precisely, each point is associated with the plane that has the closest normal (i.e., maximizes the dot product of the point normal np. and the plane normal nPidx
Figure imgf000018_0001
[0064] The sign of the normal is defined depending on the point’s position in relationship to the “center”. The projection estimation description is shown in Figure 3b. [0065] The initial clustering is then refined by iteratively updating the cluster index associated with each point based on its normal and the cluster indices of its nearest neighbors. The next step consists of extracting patches by applying a connected component extraction procedure. [0066] The packing process aims at mapping the extracted patches onto a 2D grid while trying to minimize the unused space, and guaranteeing that every TxT (e.g., 16x16) block of the grid is associated with a unique patch. Herein, T is a user-defined parameter that is encoded in the bitstream and sent to the decoder.
[0067] TMC2vO uses a simple packing strategy that iteratively tries to insert patches into a WxH grid. W and H are user defined parameters, which correspond to the resolution of the geometry/texture images that will be encoded. The patch location is determined through an exhaustive search that is performed in raster scan order. The first location that can guarantee an overlapping-free insertion of the patch is selected and the grid cells covered by the patch are marked as used. If no empty space in the current resolution image can fit a patch, then the height H of the grid is temporarily doubled and search is applied again. At the end of the process, H is clipped so as to fit the used grid cells.
[0068] The image generation process exploits the 3D to 2D mapping computed during the packing process to store the geometry and texture of the point cloud as images. In order to better handle the case of multiple points being projected to the same pixel, each patch is projected onto two images, referred to as layers. More precisely, let H(u,v) be the set of points of the current patch that get projected to the same pixel (u, v). The first layer, also called the near layer, stores the point of H(u,v) with the lowest depth DO. The second layer, referred to as the far layer, captures the point of H(u,v) with the highest depth within the interval [DO, DO+D], where D is a user-defined parameter that describes the surface thickness.
[0069] The generated videos have the following characteristics: geometry: WxH YUV420-8bit, where the geometry video is monochromatic, and texture: WxH YUV420- 8bit, where the texture generation procedure exploits the reconstructed/smoothed geometry in order to compute the colors to be associated with the re-sampled points.
[0070] The padding process aims at filling the empty space between patches in order to generate a piecewise smooth image suited for video compression. TMC2vO uses a simple padding strategy, which proceeds as follows:
Each block of TxT (e.g., 16x16) pixels is processed independently.
If the block is empty (i.e., all its pixels belong to empty space), then the pixels of the block are filled by copying either the last row or column of the previous TxT block in raster order.
If the block is full (i.e., no empty pixels), nothing is done.
If the block has both empty and filled pixels (i.e. a so-called edge block), then the empty pixels are iteratively filled with the average value of their non-empty neighbors.
[0071] The generated images/layers are stored as video frames and compressed using a video codec.
[0072] In the auxiliary patch information compression, the following meta data is encoded/decoded for every patch:
Index of the projection plane o Index 0 for the normal planes (1.0, 0.0, 0.0) and (-1.0, 0.0, 0.0) o Index 1 for the normal planes (0.0, 1.0, 0.0) and (0.0, -1.0, 0.0) o Index 2 for the normal planes (0.0, 0.0, 1.0) and (0.0, 0.0, -1.0).
2D bounding box (uO, vO, ul, vl)
3D location (xO, yO, zO) of the patch represented in terms of depth 50, tangential shift sO and bi-tangential shift rO. According to the chosen projection planes, (dq, sO, rO) are computed as follows: o Index 0, d0= xO, s0=z0 and rO = yO o Index 1, d0= yO, s0=z0 and rO = xO o Index 2, d0= zO, s0=x0 and rO = yO
[0073] Also, mapping information providing for each TxT block its associated patch index is encoded as follows:
For each TxT block, let L be the ordered list of the indexes of the patches such that their 2D bounding box contains that block. The order in the list is the same as the order used to encode the 2D bounding boxes. L is called the list of candidate patches.
The empty space between patches is considered as a patch and is assigned the special index 0, which is added to the candidate patches list of all the blocks.
Let I be index of the patch to which belongs the current TxT block and let J be the position of I in L. Instead of explicitly encoding the index I, its position J is arithmetically encoded instead, which leads to better compression efficiency. [0074] The occupancy map consists of a binary map that indicates for each cell of the grid whether it belongs to the empty space or to the point cloud. Herein, one cell of the 2D grid produces a pixel during the image generation process. When considering an occupancy map as an image, it may be considered to comprise occupancy patches. Occupancy patches may be considered to have block-aligned edges according to the auxiliary information described in the previous section. An occupancy patch hence comprises occupancy information for a corresponding texture and geometry patches. [0075] The occupancy map compression leverages the auxiliary information described in previous section, in order to detect the empty TxT blocks (i.e., blocks with patch index 0). The remaining blocks are encoded as follows. [0076] The occupancy map could be encoded with a precision of a BOxBO blocks. B0 is a user-defined parameter. In order to achieve lossless encoding, B0 should be set to 1. In practice B0=2 or B0=4 result in visually acceptable results, while significantly reducing the number of bits required to encode the occupancy map. The generated binary image covers only a single colour plane. However, given the prevalence of 4:2:0 codecs, it may be desirable to extend the image with “neutral” or fixed value chroma planes (e.g. adding chroma planes with all sample values equal to 0 or 128, assuming the use of an 8-bit codec).
[0077] The obtained video frame is compressed by using a video codec with lossless coding tool support (e.g., AVC, HEVC RExt, HEVC-SCC).
[0078] Occupancy map is simplified by detecting empty and non-empty blocks of resolution TxT in the occupancy map and only for the non-empty blocks we encode their patch index as follows:
A list of candidate patches is created for each TxT block by considering all the patches that contain that block.
The list of candidates is sorted in the reverse order of the patches.
For each block, o If the list of candidates has one index, then nothing is encoded o Otherwise, the index of the patch in this list is arithmetically encoded.
[0079] The point cloud geometry reconstruction process exploits the occupancy map information in order to detect the non-empty pixels in the geometry/texture images/layers. The 3D positions of the points associated with those pixels are computed by levering the auxiliary patch information and the geometry images. More precisely, let P be the point associated with the pixel (u, v) and let (50, sO, rO) be the 3D location of the patch to which it belongs and (uO, vO, ul, vl) its 2D bounding box. P could be expressed in terms of depth d (u, v), tangential shift s(u, v) and bi-tangential shift r(u, v) as follows:
8(u, v) = 50 + g(u, v) s(u, v) = sO - uO + u r(u, v) = rO - vO + v where g(u, v) is the luma component of the geometry image. [0080] The smoothing procedure aims at alleviating potential discontinuities that may arise at the patch boundaries due to compression artifacts. The implemented approach moves boundary points to the centroid of their nearest neighbors.
[0081 ] In the texture reconstruction process, the texture values are directly read from the texture images.
[0082] Consequently, V-PCC provides a procedure for compressing a time-varying volumetric scene/object by projecting 3D surfaces onto a number of pre-defined 2D planes, which may then be compressed using regular 2D video compression algorithms. The projection is presented using different patches, where each set of patches may represent a specific object or specific parts of a scene.
[0083] For explaining the relationship between patches, view representations, layer pairs, layers, coded video sequences (CVS), decoded picture pairs, and atlases, a reference is made to the document N18576 “Working Draft 2 of Metadata for Immersive Video (MIV)”, ISO/IEC JTC 1/SC 29/WG 11, 6 Aug 2019, and to the attached Figure 4 included therein, which shows an illustrative example, in which two atlases contain five patches (patches 2, 3, 5, 7 and 8), which are mapped to three view representations (ViewO, Viewl, View2).
[0084] The bitstream contains one or more layer pairs, each layer pair having a texture layer and a depth layer. Each layer contains one or more consecutive CVSes in a unique single independent video coding layer, such as a HE VC independent layer, with each CVS containing a sequence of coded pictures.
[0085] Each layer pair represents a sequence of atlases. An atlas is represented by a decoded picture pair in each access unit, with a texture component picture and a depth component picture. The size of an atlas is equal to the size of the decoded picture of the texture layer representing the atlas. The depth decoded picture size may be equal to the decoded picture size of the corresponding texture layer of the same layer pair. Decoded picture sizes may vary for different layer pairs in the same bitstream.
[0086] A patch may have an arbitrary shape, but in many embodiments it may be preferable to consider the patch as a rectangular region that is represented in both an atlas and a view representation. The size of a particular patch may be the same in both the atlas representation and the view representation. [0087] An atlas contains an aggregation of one or more patches from one or more view representations, with a corresponding texture component and depth component. The atlas patch occupancy map generator process outputs an atlas patch occupancy map. The atlas patch occupancy map is a 2D array of the same size as the atlas, with each value indicating the index of the patch to which the co-located sample in the atlas corresponds, if any, or otherwise indicates that the sample location has an invalid value.
[0088] A view representation represents a field of view of a 3D scene for particular camera parameters, for the texture and depth component. View representations may be omnidirectional or perspective, and may use different projection formats, such as equirectangular projection or cube map projection. The texture and depth components of a view representation may use the same projection format and have the same size.
[0089] The decoding process may be illustrated by Figure 5, which shows a decoder reference architecture for immersive video as defined in N18576. The bitstream comprises a CVS for each texture and depth layer of a layer pair, which is input to a 2D video decoder, such as an HEVC decoder, which outputs a sequence of decoded picture pairs of synchronized decoded texture pictures (A) and decoded depth pictures (B). Each decoded picture pair represents an atlas (C).
[0090] The metadata is input to a metadata parser which outputs an atlas parameters list (D), and camera parameters list (E). The atlas patch occupancy map takes as inputs the depth decoded picture (B) and the atlas parameters list (D) and outputs an atlas patch occupancy map (F). In the reference architecture, a hypothetical reference Tenderer take as inputs one or more decoded atlases (C), the atlas parameters list (D), the camera parameters list (E), the atlas patch occupancy map sequence (F), and the viewer position and orientation, and outputs a viewport. [0091] Currently, the atlas parameters are defined in atlas_parameters syntax structure, where a plurality of parameters are defined for each patch, such as disclosed in Table 1 below. Such patch-specific auxiliary information indicates e.g. the size of the patch, its location within the atlas and in the view, and the orientation (i.e. rotation and/or mirroring) of the patch.
Figure imgf000024_0001
Table 1.
[0092] Transmitting this information per patch takes a relatively high amount of bits and therefore, increases the size of encoded content bitstream that needs to be transmitted for the presentation of the content to the decoded. [0093] In the following, an enhanced method for reducing the size of auxiliary patch information to be transmitted will be described in more detail, in accordance with various embodiments.
[0094] A starting point for the method may be considered, for example, that a 3D representation of at least one object, such as a point cloud frame or a 3D mesh, is input in an encoder. The method, which is disclosed in Figure 6, comprises projecting (600) a 3D representation of a scene onto a plurality of 2D patches; generating (602) at least a texture component picture, an optional geometry component picture and auxiliary patch information from the 2D patches; aggregating (604) said plurality of 2D patches with the corresponding texture component picture into an atlas; determining (606) a difference value for at least a first patch belonging to a first part of the scene, said difference value being based on difference between content of the texture component picture or the geometry component picture of the first patch at a first temporal value and content of a texture component picture or a geometry component picture of at least a second patch at a second temporal value prior to the first temporal value, said second patch matching to the first patch according to at least one predefined similarity criterium; and signalling (608), in response to said difference value being smaller than a predetermined threshold value, an indication about at least the second patch to be used as a prediction reference for said first patch.
[0095] Thus, the method targets finding patches which do not change or their change is smaller than a specific threshold between atlas frames at a first time instance and a second time instance. The similarity may be defined based on comparison of the content of texture pixels between the first and at least the second temporally previous patch, or the content of the geometry information, if present, between the first and at least the second patch. Such patches, for which at least one temporally previous patch with sufficient similarity is found, will be signalled to be subject for faster encoding/decoding/rendering process by using said at least one temporally previous patch as a prediction reference for said patch. Thereby, the required bitrate to encoding such patches is reduced and the execution time for encoder/decoder/rendering processes is reduced, as well. Moreover, there is less amount of metadata required to be transmitted for such patches which are sufficiently similar to a previously encoded patch and hence, the total amount of bitrate is reduced. [0096] It is further noted that this aspect relates to the signaling of only the auxiliary patch information, which may be encoded into a separate bitstream, which may be stored or transmitted to a decoder as such. The geometry image, the texture image and the occupancy map may each be encoded into separate bitstreams, as well. As mentioned above, encoding the geometry (or depth) image into or along the bitstream with the texture image is only optional. Alternatively, the auxiliary patch information may be encoded into a common bitstream with one or more of the geometry image, the texture image or the occupancy map.
[0097] Another aspect relates to the operation of a decoder. Figure 7 shows an example of a decoding method comprising receiving (700) a bitstream in a decoder, said bitstream comprising an encoded texture component picture, an optional encoded geometry component picture and an encoded occupancy map indicative of a plurality of 2D patches from a 3D representation of a scene, wherein said plurality of 2D patches have been aggregated with the corresponding texture component picture into an atlas; receiving (702), either in said bitstream or in a further bitstream, an encoded auxiliary patch information from said plurality of 2D patches, wherein said auxiliary patch information comprises difference value for at least a first patch belonging to a first part of the scene, said difference value being based on difference between content of the texture component picture or the geometry component picture of the first patch at a first temporal value and content of a texture component picture or a geometry component picture of at least a second patch at a second temporal value prior to the first temporal value, said second patch matching to the first patch according to at least one predefined similarity criterium; decoding (704), in response to said difference value being smaller than a predetermined threshold value, an indication about at least the second patch to be used as a prediction reference for said first patch; decoding (706) at least the first patch of the texture component picture by using at least the second patch as a prediction reference for said first patch; and decoding (708) the geometry component picture, the occupancy map and the auxiliary patch information for reconstructing a 3D representation of said scene.
[0098] Thus, the decoder receives and decodes texture image, occupancy map, auxiliary patch information of the plurality of 2D patches and optionally the geometry image, as well as the atlas aggregation of said plurality of 2D patches with the corresponding texture component picture and the corresponding geometry component picture, received either in a common bitstream or in two or more separate bitstreams. From the auxiliary patch information, the decoder decodes, among other auxiliary patch information, also a difference value for at least a first patch belonging to a first part of the scene, wherein the difference value is based on difference between content of the texture component picture or the geometry component picture of the first patch at a first time instance and content of a texture component picture or a geometry component picture of at least a second patch at a second time instance prior to the first time instance, wherein the second patch is sufficiently similar to the first patch, i.e. the second patch matches to the first patch according to at least one predefined similarity criterium. If the difference value is smaller than a predetermined threshold value, an indication about at least one temporally previous patch, i.e. at least the second patch, to be used as a prediction reference for said first patch is decoded from the auxiliary patch information, and at least the first patch of the texture component picture is decoded by using said at least one temporally previous patch, i.e. at least the second patch, as a prediction reference for said first patch. After decoding the geometry component picture, the occupancy map and possibly the remaining of the auxiliary patch information, the 3D representation of the scene is then reconstructed.
[0099] According to an embodiment, said predefined similarity criteria comprises one or more of the following criterium:
- the first and the second patches belonging to the same part of the scene;
- the first and the second patches having at least one similar auxiliary parameter; - the second patch comprising at least one image block being used a prediction reference for at least one image block of the first patch.
[0100] Thus, said first and second patch may represent the same object at different time instances (i.e. having different temporal time stamps), or said first and second patch may have similar auxiliary parameters e.g. size, rotation to be selected. Moreover, if the image blocks of the second patch are defined to be used as reference blocks for motion vector prediction for the image blocks of the first patch, it may provide an indication that there may be sufficient similarity between the first and the second patch.
[0101] The methods and embodiments as disclosed herein may be generally applicable to scenes presented by a plurality of patches, wherein each patch presents a part of the scene. Moreover, any object belonging to the scene may be presented by a set of patches and any specific part of the scene may be presented by a group of patches as well. Basically, the current patch is processed by comparing it with at least one patch that belongs to the same part of the scene (e.g. same object) and is preferably located close to the world coordinates of the current patch.
[0102] It is noted that for determining the difference value for the first patch, the comparison may include one or more further patches at one or at several time instances before the current time instance of the first patch.
[0103] The decision of which patches are to be compared to the current patch can be dynamic and a radius of temporal frame number for searching similar patches may be introduced in the process. Moreover, the number of patches to be included in the comparison and their time instances may vary and they may be defined as initializing parameters for the embodiments.
[0104] According to an embodiment, said difference value is determined as a mean square error of between pixel values of the texture component pictures or the geometry component pictures of said first and second patches. Thus, either all or a subgroup of pixels of the texture component picture or the geometry component picture are compared between said first and second patch, and an average of the differences may be calculated as a mean square error (MSE). [0105] According to an embodiment, said predetermined threshold value is zero. In such case, only the patches which have not changed at all may be allowed to use the at least one temporally previous patch as a prediction reference for said patch.
[0106] According to an embodiment, said indication is configured to be encoded as a flag at least for the first patch, wherein said flag indicates whether at least one temporally previous patch is to be used as a prediction reference for said first patch.
[0107] Thus, a flag, which may be referred to herein as patch changed flag, is used for indicating whether at least one temporally previous patch is to be used as a prediction reference for said patch. For those patches that are determined to use one or more temporally previous patches as a prediction reference, significant bit savings in the signalling are achieved by indicating this with a single flag. Thus, a threshold may be defined for the difference value reflecting the change between the current patch and the previously encoded patch, and if the difference value, such as the MSE, is smaller than the specified threshold, then patch changed flag is set equal to 0. If the threshold value is set as zero, only the patches which have not changed at all qualify for having the patch changed flag is set equal to 0. For those patches where the difference value between the current patch and the previously encoded patch is higher than the specified threshold, the patch changed flag is set equal to 1.
[0108] According to an embodiment, a signalling of the at least one temporally previous patch to be used as a prediction reference for said first patch is configured to be carried out by at least one syntax element included in an atlas parameter syntax structure or any other suitable syntax structure for ISO/IEC 23090-5 (or similar volumetric video coding technology). Table 2 shows an example of including said at least one syntax element into atlas parameter syntax element.
Figure imgf000029_0001
Table 2: (ISO/IEC 23090-5 example)
[0109] According to an embodiment, for each patch[i], there may be a further syntax element as a patch-specific flag patch changed flag indicating whether at least one temporally previous patch to be used as a prediction reference for said patch.
[0110] According to an embodiment, in response to the difference value between the current patch and the previously encoded patch is higher than the specified threshold value, the apparatus is configured to set the value of the flag for said particular patch as one. For the patches that the patch changed flag = 1, a conventional encoding and/or decoding and/or rendering processes may be applied. In the atlas parameter syntax structure according to Table 2 this is indicated by the prior used syntax elements patch_pos_in_atlas_x[a][i], patch_pos_in_atlas_y[a][i] and patch_rotation[a][i], which are only triggered to be used by the else condition, i.e. when the value of the patch changed flag is not zero.
[0111 ] According to an embodiment, a skip mode of encoding is indicated to be used for said first patch. Thus, in response to a patch being indicated with patch changed flag = 0, then all image blocks which are fully or partially covered by said patch are to be coded with skip mode indicating the reference block from where the patch is to be predicted. Herein, the image blocks which are partially covered by said patch should not include content from any other patch. Hence, if e.g. a patch in current atlas frame at temporal time Tk is to be predicted from a respective patch from a reference atlas frame at temporal time Tl, then the skip mode is to refer to the frame at temporal time T1 and to a specific location where the respective patch is located. This will prevent the process at encoder to search for an appropriate mode for the encoding of the current block and hence, decreases the execution time for encoding the atlas image.
[0112] The patches which have patch changed flag = 0 may be generated due to having at least one object being static in the 3D scene during a time period and hence, the patches representing the object are configured to remain untouched. In this scenario, the metadata indicating the location and rotation of the patch is preferably excluded from the transmission of the metadata information. Such metadata include but are not limited to the syntax element patch rotation.
[0113] Alternatively, an object may be dynamic (moving) in the 3D scene while at least one patch presenting said object remains with patch changed flag = 0 for different times. This may happen, for example, when a person enters a room and the patches representing his/her face may remain under patch changed flag = 0 criteria. In such cases, a new parameter syntax element may be introduced as: patch affine change.
[0114] Such parameter indicates the affine change between the current patch and the reference patch to be signalled to the decoder.
[0115] According to another embodiment, the patches which are indicated with patch changed flag = 0 are not included in the atlas image, but only the metadata indicating their presentation will be added to the bitstream signalling that they exist and indicating from which reference patch they are to be predicted/copied. As shown in the atlas parameter syntax structure according to Table 2, this may be indicated by two syntax parameters: patch_ref_pos_in_atlas_x[ a ][ p ] and patch_ref_pos_in_atlas_y[ a ][ p ] specify the horizontal and vertical coordinates in luma samples, respectively, of the top- left comer of the p-th patch of the a-th atlas. The number of bits used for the representation of patch_pos_in_atlas_x[ a ][ p ] and patch_pos_in_atlas_y[ a ][ p ] are Ceil ( Log2( atlas_width_minusl[ a ] + 1 ) ) and Ceil( Log2( atlas_height_minusl[ a ] + 1 ) ), bits respectively. Such patches will be used as reference for compression of the current patch. [0116] According to an embodiment, in response to said difference value being zero, a second syntax element, indicative of a prediction residual, may be included in the atlas parameter syntax structure, wherein the value of the second syntax element for said first patch is set as zero. Presence of residual information may be signalled with a parameter e.g. patch_pred_res_flag, wherein if no residual exists (i.e. the difference value=0), the value of patch_pred_res_flag is set as zero.
[0117] According to another embodiment, in response to said difference value being greater than zero but smaller than said predetermined threshold value, the apparatus is configured to include the second syntax element in the atlas parameter syntax structure and set the value of the second syntax element for said first patch as one. Thus, if the decision to set the patch changed flag = 0 is based on having allowable differences (i.e. under the predetermined threshold) between said patch and the reference patch(es), the value of patch_pred_res_flag is set as one.
[0118] According to an embodiment, a third syntax element may be included in the atlas parameter syntax structure, wherein said third syntax element indicates a prediction residual to be added to the prediction based on the at least one temporally previous patch. Thus, a syntax element for the residual information, such as patch_pred_res, may be created and transmitted along the bitstream. In such embodiment, the differences between the current patch and reference patch are encoded in the bitstream and signalled as respective residual metadata.
[0119] The above embodiments, when implemented at the encoder side, reflect to the operation of a decoder. [0120] According to an embodiment, the patches which are indicated with patch changed flag = 0 are to be copied from the reference patch and hence not to be decoded.
[0121] According to an embodiment, the blocks which are fully or partially covered by a patch with patch changed flag = 0 are to be decoded in skip mode. Herein, the image blocks which are partially covered by said patch should not include content from any other patch.
[0122] According to an embodiment, if a patch is indicated with patch changed flag =
0 then the blocks are presented with skip mode equal to 1 and hence, the blocks are decoded as respectively predicted from the reference block.
[0123] For the patches which have patch changed flag = 0 generated due to having one object being static in the 3D scene during a time period, the patches representing the object thus remaining untouched, the metadata indicating the location and rotation of the patch are not to be fetched from the reference patch either.
[0124] Alternatively, if the object is dynamic (moving) in the 3D scene while at least one patch presenting said object remains with patch changed flag = 0 for different times, the parameter syntax element patch affme change may be decoded and utilized to locate the accurate orientation of the current patch as compared to the orientation of the reference patch.
[0125] According to another embodiment, the patches which are indicated with patch changed flag = 0 and are not put into the atlas image and only the metadata indicating their presentation is added to the bitstream, are fetched from the location of the reference patch directly. The spatial and temporal location of the reference patch may be calculated based on the indicative parameters , such as the parameters patch_ref_pos_in_atlas_x[ a ][ p ] and patch_ref_pos_in_atlas_y[ a ][ p ] described above. [0126] According to another embodiment, if the decision to set the patch changed flag equal to 0 has been based on having allowable differences (i.e. under the predetermined threshold) between said patch and the reference patch(es), and residual information has also be signalled using the syntax elements patch_pred_res_flag and patch_pred_res as described above, the residual information indicating the differences between the current patch and reference patch are decoded and added to the reference patch to present the current patch with the required accuracy.
[0127] The encoding and decoding aspects including at least some of the above embodiments may be illustrated by the flow charts of Figures 8a and 8b. The operation of the encoder is shown in Figure 8a, where a plurality of patches are input (800) in the encoder. A difference value of the pixels of the texture or geometry component images between time instances Tk and Ti is calculated (802) for one or more patches. Then for each patch, it is determined (804) whether the difference value is smaller than a predetermined threshold value. If not, the patch changed flag is set to 1 (806) and the conventional encoding process is applied for said patch, e.g. by defining the patch-specific syntax elements patch_pos_in_atlas_x[a][i], patch_pos_in_atlas_y[a][i] and patch_rotation[a][i] in the atlas parameter syntax structure,
[0128] If the difference value is smaller than a predetermined threshold value, the patch changed flag is set to 0 (808). If the threshold value > 0, then it may be further examined (810) whether the difference value is exactly zero. If yes, the content of the texture or geometry component images between time instances Tk and Ti is the same for said patch, and no residual exists; therefore, the patch_pred_res_flag is set as zero (812). On the other hand, if there is a residual, the patch_pred_res_flag is set as one and the value of the residual is determined (814) to be further included in the signalling.
[0129] The operation of the decoder is shown in Figure 8b, where encoded patches and auxiliary patch information are input (820) in the decoder. An indication for each patch whether at least one temporally previous patch to be used as a prediction reference for said patch, i.e. patch changed flag, is obtained from a syntax structure of the auxiliary patch information and its value is examined (822). In response to the value of patch changed flag being one, the patch is decoded by a conventional decoding process, e.g. by obtaining the patch-specific syntax elements patch_pos_in_atlas_x[a][i], patch_pos_in_atlas_y[a][i] and patch_rotation[a][i] from the atlas parameter syntax structure.
[0130] In response to the value of patch changed flag being zero (822), thereby indicating that the patch is to be decoded by using at least one temporally previous patch to be used as a prediction reference for said patch, it is examined (826) whether the patch_pred_res_flag=l. If the patch_pred_res_flag=0, thus indicating that no residual value is determined for the prediction, the patch can be decoded (828) based on the reference to at least one temporally previous patch. If the patch_pred_res_flag=l, thus indicating that there is signalled a residual value for the prediction, the decoder obtains (830) the residual value and decodes (832) the patch based on the reference to at least one temporally previous patch and applies the residual value thereto.
[0131] In the rendering end, the patches which are indicated with patch changed flag equal to 0 are to be rendered first as they do not need any information to be decoded other than the signal indicating that they are equal to another patch which has already been decoded.
[0132] For example in an atlas which has N patches, and M of those patches are indicated with patch changed flag equal to 0, said M patch are first rendered in the rendering part while the rest are pending the decoding process and will be rendered thereafter. This will shorten the rendering process considerably and the distribution of the rendering resources will be more flexible as a part of the rendering happens earlier than the rest of the content.
[0133] The process analysing the similarity between the patches may be limited to specific number of frames prior to the current frame. Such limitation may be dictated based on the codec requirements e.g. GOP (group of pictures) length, where the GOP size is indicated in the codec which encodes the atlas image. Alternatively, such limitation may be defined based on a specific and fixed number of frames e.g. 32 regardless of GOP size. Alternatively, such limitation may be defined based on the amount of motion is the scene, e.g. if the scene has high motion, the frame number should be N frames and if the scene has low motion, the frame number should be M frames where N<M.
[0134] Two or more of the embodiments as described above may be combined, and they may be introduced as one or more indicators in any suitable syntax structure for ISO/IEC 23090-5 (or similar volumetric video coding technology).
[0135] Consequently, the embodiments as described herein enable to reduce the required bitrate to encode the same content. Moreover, the embodiments enable to reduce the complexity of patch presentation. Furthermore, the embodiments enable to reduce the execution time both at encoder side and at rendering side. [0136] The embodiments relating to the encoding aspects may be implemented in an apparatus comprising: means for projecting a 3D representation of a scene onto a plurality of 2D patches; means for generating at least a texture component picture, an optional geometry component picture and auxiliary patch information from the 2D patches; means for aggregating said plurality of 2D patches with the corresponding texture component picture into an atlas; means for determining a difference value for at least a first patch belonging to a first part of the scene, said difference value being based on difference between content of the texture component picture or the geometry component picture of the first patch at a first temporal value and content of a texture component picture or a geometry component picture of at least a second patch at a second temporal value prior to the first temporal value, said second patch matching to the first patch according to at least one predefined similarity criterium; and means for signaling, in response to said difference value being smaller than a predetermined threshold value, an indication about at least the second patch to be used as a prediction reference for said first patch.
[0137] The embodiments relating to the encoding aspects may likewise be implemented in an apparatus comprising at least one processor and at least one memory, said at least one memory stored with computer program code thereon, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to perform: project a 3D representation of a scene onto a plurality of 2D patches; generate at least a texture component picture, an optional geometry component picture and auxiliary patch information from the 2D patches; aggregate said plurality of 2D patches with the corresponding texture component picture into an atlas; determine a difference value for at least a first patch belonging to a first part of the scene, said difference value being based on difference between content of the texture component picture or the geometry component picture of the first patch at a first temporal value and content of a texture component picture or a geometry component picture of at least a second patch at a second temporal value prior to the first temporal value, said second patch matching to the first patch according to at least one predefined similarity criterium; and signal, in response to said difference value being smaller than a predetermined threshold value, an indication about at least the second patch to be used as a prediction reference for said first patch. [0138] The embodiments relating to the decoding aspects may be implemented in an apparatus comprising means for receiving a bitstream in a decoder, said bitstream comprising an encoded texture component picture, an optional encoded geometry component picture and an encoded occupancy map indicative of a plurality of 2D patches from a 3D representation of a scene, wherein said plurality of 2D patches have been aggregated with the corresponding texture component picture into an atlas; means for receiving, either in said bitstream or in a further bitstream, an encoded auxiliary patch information from said plurality of 2D patches, wherein said auxiliary patch information comprises a difference value for at least a first patch belonging to a first part of the scene, said difference value being based on difference between content of the texture component picture or the geometry component picture of the first patch at a first temporal value and content of a texture component picture or a geometry component picture of at least a second patch at a second temporal value prior to the first temporal value, said second patch matching to the first patch according to at least one predefined similarity criterium; means for decoding, in response to said difference value being smaller than a predetermined threshold value, an indication about at least the second patch to be used as a prediction reference for said first patch; means for decoding at least the first patch of the texture component picture by using at least the second patch as a prediction reference for said first patch; and means for decoding the geometry component picture, the occupancy map and the auxiliary patch information for reconstructing a 3D representation of said scene.
[0139] The embodiments relating to the decoding aspects may likewise be implemented in an apparatus comprising at least one processor and at least one memory, said at least one memory stored with computer program code thereon, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to perform receive a bitstream in a decoder, said bitstream comprising an encoded texture component picture, an optional encoded geometry component picture and an encoded occupancy map indicative of a plurality of 2D patches from a 3D representation of a scene, wherein said plurality of 2D patches have been aggregated with the corresponding texture component picture into an atlas; receive, either in said bitstream or in a further bitstream, an encoded auxiliary patch information from said plurality of 2D patches, wherein said auxiliary patch information comprises a difference value for at least a first patch belonging to a first part of the scene, said difference value being based on difference between content of the texture component picture or the geometry component picture of the first patch at a first temporal value and content of a texture component picture or a geometry component picture of at least a second patch at a second temporal value prior to the first temporal value, said second patch matching to the first patch according to at least one predefined similarity criterium; decode, in response to said difference value being smaller than a predetermined threshold value, an indication about at least the second patch to be used as a prediction reference for said first patch; decode at least the first patch of the texture component picture by using at least the second patch as a prediction reference for said first patch; and decode the geometry component picture, the occupancy map and the auxiliary patch information for reconstructing a 3D representation of said scene.
[0140] Such apparatuses may comprise e.g. the functional units disclosed in any of the Figures la, lb, 2a and 2b for implementing the embodiments.
[0141 ] In the above, some embodiments have been described with reference to encoding. It needs to be understood that said encoding may comprise one or more of the following: encoding source image data into a bitstream, encapsulating the encoded bitstream in a container file and/or in packet(s) or stream(s) of a communication protocol, and announcing or describing the bitstream in a content description, such as the Media Presentation Description (MPD) of ISO/IEC 23009-1 (known as MPEG-DASH) or the IETF Session Description Protocol (SDP). Similarly, some embodiments have been described with reference to decoding. It needs to be understood that said decoding may comprise one or more of the following: decoding image data from a bitstream, decapsulating the bitstream from a container file and/or from packet(s) or stream(s) of a communication protocol, and parsing a content description of the bitstream,
[0142] In the above, where the example embodiments have been described with reference to an encoder or an encoding method, it needs to be understood that the resulting bitstream and the decoder or the decoding method may have corresponding elements in them. Likewise, where the example embodiments have been described with reference to a decoder, it needs to be understood that the encoder may have structure and/or computer program for generating the bitstream to be decoded by the decoder. [0143] In the above, some embodiments have been described with reference to encoding or decoding texture pictures, geometry pictures, (optionally) attribute pictures and auxiliary patch information into or from a single bitstream. It needs to be understood that embodiments can be similarly realized when encoding or decoding texture pictures, geometry pictures, (optionally) attribute pictures and auxiliary patch information into or from several bitstreams that are associated with each other, e.g. by metadata in a container file or media presentation description for streaming.
[0144] In general, the various embodiments of the invention may be implemented in hardware or special purpose circuits or any combination thereof. While various aspects of the invention may be illustrated and described as block diagrams or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.
[0145] Embodiments of the inventions may be practiced in various components such as integrated circuit modules. The design of integrated circuits is by and large a highly automated process. Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate.
[0146] Programs, such as those provided by Synopsys, Inc. of Mountain View, California and Cadence Design, of San Jose, California automatically route conductors and locate components on a semiconductor chip using well established rules of design as well as libraries of pre stored design modules. Once the design for a semiconductor circuit has been completed, the resultant design, in a standardized electronic format (e.g., Opus, GDSII, or the like) may be transmitted to a semiconductor fabrication facility or "fab" for fabrication.
[0147] The foregoing description has provided by way of exemplary and non-limiting examples a full and informative description of the exemplary embodiment of this invention. However, various modifications and adaptations may become apparent to those skilled in the relevant arts in view of the foregoing description, when read in conjunction with the accompanying drawings and the appended examples. However, all such and similar modifications of the teachings of this invention will still fall within the scope of this invention.

Claims

1. A method comprising: projecting a 3D representation of a scene onto a plurality of 2D patches; generating at least a texture component picture, an optional geometry component picture and auxiliary patch information from the 2D patches; aggregating said plurality of 2D patches with the corresponding texture component picture into an atlas; determining a difference value for at least a first patch belonging to a first part of the scene, said difference value being based on difference between content of the texture component picture or the geometry component picture of the first patch at a first temporal value and content of a texture component picture or a geometry component picture of at least a second patch at a second temporal value prior to the first temporal value, said second patch matching to the first patch according to at least one predefined similarity criterium; and signaling, in response to said difference value being smaller than a predetermined threshold value, an indication about at least the second patch to be used as a prediction reference for said first patch.
2. An apparatus comprising at least one processor and at least one memory, said at least one memory stored with computer program code thereon, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to perform: project a 3D representation of a scene onto a plurality of 2D patches; generate at least a texture component picture, an optional geometry component picture and auxiliary patch information from the 2D patches; aggregate said plurality of 2D patches with the corresponding texture component picture into an atlas; determine a difference value for at least a first patch belonging to a first part of the scene, said difference value being based on difference between content of the texture component picture or the geometry component picture of the first patch at a first temporal value and content of a texture component picture or a geometry component picture of at least a second patch at a second temporal value prior to the first temporal value, said second patch matching to the first patch according to at least one predefined similarity criterium; and signal, in response to said difference value being smaller than a predetermined threshold value, an indication about at least the second patch to be used as a prediction reference for said first patch.
3. An apparatus comprising: means for projecting a 3D representation of a scene onto a plurality of 2D patches; means for generating at least a texture component picture, an optional geometry component picture and auxiliary patch information from the 2D patches; means for aggregating said plurality of 2D patches with the corresponding texture component picture into an atlas; means for determining a difference value for at least a first patch belonging to a first part of the scene, said difference value being based on difference between content of the texture component picture or the geometry component picture of the first patch at a first temporal value and content of a texture component picture or a geometry component picture of at least a second patch at a second temporal value prior to the first temporal value, said second patch matching to the first patch according to at least one predefined similarity criterium; and means for signaling, in response to said difference value being smaller than a predetermined threshold value, an indication about at least the second patch to be used as a prediction reference for said first patch.
4. The apparatus according to claim 2 or 3, wherein said difference value is determined as a mean square error of between pixel values of the texture component pictures or the geometry component pictures of said first and second patches.
5. The apparatus according to any of claims 2 - 4, wherein said predefined similarity criteria comprises one or more of the following criterium:
- the first and the second patches belonging to the same part of the scene;
- the first and the second patches having at least one similar auxiliary parameter;
- the second patch comprising at least one image block being used a prediction reference for at least one image block of the first patch.
6. The apparatus according to any of claims 2 - 5, further comprising means for indicating a skip mode of encoding to be used for said first patch.
7. The apparatus according to any of claims 2 - 6, wherein said indication is configured to be encoded as a flag at least for the first patch, wherein said flag indicates whether at least the second patch is to be used as a prediction reference for said first patch.
8. The apparatus according to claim 7, wherein a signalling of at least the second patch to be used as a prediction reference for said first patch is configured to be carried out by at least one syntax element included in an atlas parameter syntax structure.
9. The apparatus according to claim 8, wherein, in response to said difference value being zero, the apparatus is configured to include a second syntax element, indicative of a prediction residual, in the atlas parameter syntax structure and set a value of the second syntax element for said first patch as zero.
10. The apparatus according to claim 8 or 9, wherein, in response to said difference value being greater than zero but smaller than said predetermined threshold value, the apparatus is configured to include the second syntax element in the atlas parameter syntax structure and set the value of the second syntax element for said first patch as one.
11. The apparatus according to claim 10, wherein the apparatus is configured to include a third syntax element in the atlas parameter syntax structure, wherein said third syntax element indicates a prediction residual to be added to the prediction based on the at least the second patch.
12. A method comprising: receiving a bitstream in a decoder, said bitstream comprising an encoded texture component picture, an optional encoded geometry component picture and an encoded occupancy map indicative of a plurality of 2D patches from a 3D representation of a scene, wherein said plurality of 2D patches have been aggregated with the corresponding texture component picture into an atlas; receiving, either in said bitstream or in a further bitstream, an encoded auxiliary patch information from said plurality of 2D patches, wherein said auxiliary patch information comprises a difference value for at least a first patch belonging to a first part of the scene, said difference value being based on difference between content of the texture component picture or the geometry component picture of the first patch at a first temporal value and content of a texture component picture or a geometry component picture of at least a second patch at a second temporal value prior to the first temporal value, said second patch matching to the first patch according to at least one predefined similarity criterium; decoding, in response to said difference value being smaller than a predetermined threshold value, an indication about at least the second patch to be used as a prediction reference for said first patch; decoding at least the first patch of the texture component picture by using at least the second patch as a prediction reference for said first patch; and decoding the geometry component picture, the occupancy map and the auxiliary patch information for reconstructing a 3D representation of said scene.
13. An apparatus comprising at least one processor and at least one memory, said at least one memory stored with computer program code thereon, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to perform: receive a bitstream in a decoder, said bitstream comprising an encoded texture component picture, an optional encoded geometry component picture and an encoded occupancy map indicative of a plurality of 2D patches from a 3D representation of a scene, wherein said plurality of 2D patches have been aggregated with the corresponding texture component picture into an atlas; receive, either in said bitstream or in a further bitstream, an encoded auxiliary patch information from said plurality of 2D patches, wherein said auxiliary patch information comprises a difference value for at least a first patch belonging to a first part of the scene, said difference value being based on difference between content of the texture component picture or the geometry component picture of the first patch at a first temporal value and content of a texture component picture or a geometry component picture of at least a second patch at a second temporal value prior to the first temporal value, said second patch matching to the first patch according to at least one predefined similarity criterium; decode, in response to said difference value being smaller than a predetermined threshold value, an indication about at least the second patch to be used as a prediction reference for said first patch; decode at least the first patch of the texture component picture by using at least the second patch as a prediction reference for said first patch; and decode the geometry component picture, the occupancy map and the auxiliary patch information for reconstructing a 3D representation of said scene.
14. An apparatus comprising: means for receiving a bitstream in a decoder, said bitstream comprising an encoded texture component picture, an optional encoded geometry component picture and an encoded occupancy map indicative of a plurality of 2D patches from a 3D representation of a scene, wherein said plurality of 2D patches have been aggregated with the corresponding texture component picture into an atlas; means for receiving, either in said bitstream or in a further bitstream, an encoded auxiliary patch information from said plurality of 2D patches, wherein said auxiliary patch information comprises a difference value for at least a first patch belonging to a first part of the scene, said difference value being based on difference between content of the texture component picture or the geometry component picture of the first patch at a first temporal value and content of a texture component picture or a geometry component picture of at least a second patch at a second temporal value prior to the first temporal value, said second patch matching to the first patch according to at least one predefined similarity criterium; means for decoding, in response to said difference value being smaller than a predetermined threshold value, an indication about at least the second patch to be used as a prediction reference for said first patch; means for decoding at least the first patch of the texture component picture by using at least the second patch as a prediction reference for said first patch; and means for decoding the geometry component picture, the occupancy map and the auxiliary patch information for reconstructing a 3D representation of said scene.
15. The apparatus according to claim 13 or 14, wherein the apparatus is configured to decode the indication about a skip mode of decoding to be used for said first patch, wherein said indicating includes reference to one or more reference image blocks of at least the second patch to be used instead of the first patch.
16. The apparatus according to any of claims 13 - 15, wherein said indication at least for the first patch is configured to be decoded from a flag, wherein said flag indicates whether at least the second patch is to be used as a prediction reference for said first patch.
17. The apparatus according to claim 16, wherein the apparatus is configured to decode the indication about at least the second patch to be used as a prediction reference for said first patch from at least one syntax element included in an atlas parameter syntax structure.
18. The apparatus according to claim 17, wherein, the apparatus is configured to decode a second syntax element, indicative of a prediction residual, from the atlas parameter syntax structure, and in response to the value of the second syntax element for said first patch being one, decode a third syntax element from the atlas parameter syntax structure, wherein said third syntax element indicates a prediction residual to be added to the prediction based on at least the second patch.
PCT/FI2021/050096 2020-02-28 2021-02-12 An apparatus, a method and a computer program for volumetric video WO2021170906A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
FI20205208 2020-02-28
FI20205208 2020-02-28

Publications (1)

Publication Number Publication Date
WO2021170906A1 true WO2021170906A1 (en) 2021-09-02

Family

ID=77489884

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/FI2021/050096 WO2021170906A1 (en) 2020-02-28 2021-02-12 An apparatus, a method and a computer program for volumetric video

Country Status (1)

Country Link
WO (1) WO2021170906A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023050432A1 (en) * 2021-09-30 2023-04-06 浙江大学 Encoding and decoding methods, encoder, decoder and storage medium
WO2023091814A1 (en) * 2021-11-22 2023-05-25 Tencent America LLC Encoding of patch temporal alignment for mesh compression

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019199415A1 (en) * 2018-04-13 2019-10-17 Futurewei Technologies, Inc. Differential coding method for patch side information
WO2019243663A1 (en) * 2018-06-21 2019-12-26 Nokia Technologies Oy An apparatus, a method and a computer program for volumetric video
EP3591975A1 (en) * 2018-07-03 2020-01-08 Industrial Technology Research Institute Method and apparatus for processing patches of a point cloud

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019199415A1 (en) * 2018-04-13 2019-10-17 Futurewei Technologies, Inc. Differential coding method for patch side information
WO2019243663A1 (en) * 2018-06-21 2019-12-26 Nokia Technologies Oy An apparatus, a method and a computer program for volumetric video
EP3591975A1 (en) * 2018-07-03 2020-01-08 Industrial Technology Research Institute Method and apparatus for processing patches of a point cloud

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023050432A1 (en) * 2021-09-30 2023-04-06 浙江大学 Encoding and decoding methods, encoder, decoder and storage medium
WO2023091814A1 (en) * 2021-11-22 2023-05-25 Tencent America LLC Encoding of patch temporal alignment for mesh compression

Similar Documents

Publication Publication Date Title
EP3751857A1 (en) A method, an apparatus and a computer program product for volumetric video encoding and decoding
US11509933B2 (en) Method, an apparatus and a computer program product for volumetric video
US20230050860A1 (en) An apparatus, a method and a computer program for volumetric video
US11659151B2 (en) Apparatus, a method and a computer program for volumetric video
WO2019158821A1 (en) An apparatus, a method and a computer program for volumetric video
US20230068178A1 (en) A method, an apparatus and a computer program product for volumetric video encoding and decoding
US11711535B2 (en) Video-based point cloud compression model to world signaling information
WO2019229293A1 (en) An apparatus, a method and a computer program for volumetric video
WO2019185985A1 (en) An apparatus, a method and a computer program for volumetric video
WO2021170906A1 (en) An apparatus, a method and a computer program for volumetric video
US20220329871A1 (en) An Apparatus, A Method and a Computer Program for Volumetric Video
US20220159297A1 (en) An apparatus, a method and a computer program for volumetric video
WO2021191495A1 (en) A method, an apparatus and a computer program product for video encoding and video decoding
EP3699867A1 (en) An apparatus, a method and a computer program for volumetric video
WO2023144445A1 (en) A method, an apparatus and a computer program product for video encoding and video decoding
WO2021191500A1 (en) An apparatus, a method and a computer program for volumetric video
EP3987774A1 (en) An apparatus, a method and a computer program for volumetric video
US11974026B2 (en) Apparatus, a method and a computer program for volumetric video
WO2021165566A1 (en) An apparatus, a method and a computer program for volumetric video
WO2019162564A1 (en) An apparatus, a method and a computer program for volumetric video
WO2019185983A1 (en) A method, an apparatus and a computer program product for encoding and decoding digital volumetric video
EP3804334A1 (en) An apparatus, a method and a computer program for volumetric video
EP4199516A1 (en) Reduction of redundant data in immersive video coding
WO2022074286A1 (en) A method, an apparatus and a computer program product for video encoding and video decoding
US20240064334A1 (en) Motion field coding in dynamic mesh compression

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21761087

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21761087

Country of ref document: EP

Kind code of ref document: A1