WO2019185983A1 - Procédé, appareil et produit-programme d'ordinateur destinés au codage et au décodage de vidéo volumétrique numérique - Google Patents

Procédé, appareil et produit-programme d'ordinateur destinés au codage et au décodage de vidéo volumétrique numérique Download PDF

Info

Publication number
WO2019185983A1
WO2019185983A1 PCT/FI2019/050219 FI2019050219W WO2019185983A1 WO 2019185983 A1 WO2019185983 A1 WO 2019185983A1 FI 2019050219 W FI2019050219 W FI 2019050219W WO 2019185983 A1 WO2019185983 A1 WO 2019185983A1
Authority
WO
WIPO (PCT)
Prior art keywords
occupancy map
sampled
pixels
occupied
representative value
Prior art date
Application number
PCT/FI2019/050219
Other languages
English (en)
Inventor
Payman AFLAKI
Sebastian Schwarz
Original Assignee
Nokia Technologies Oy
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nokia Technologies Oy filed Critical Nokia Technologies Oy
Publication of WO2019185983A1 publication Critical patent/WO2019185983A1/fr

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/10Processing, recording or transmission of stereoscopic or multi-view image signals
    • H04N13/106Processing image signals
    • H04N13/161Encoding, multiplexing or demultiplexing different image signal components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/59Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving spatial sub-sampling or interpolation, e.g. alteration of picture size or resolution
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/593Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving spatial prediction techniques
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/597Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding specially adapted for multi-view video sequence encoding

Definitions

  • the present solution generally relates to virtual reality.
  • the solution relates to encoding and decoding of digital volumetric video.
  • new image and video capture devices are available. These devices are able to capture visual and audio content all around them, i.e. they can capture the whole angular field of view, sometimes referred to as 360 degrees field of view. More precisely, they can capture a spherical field of view (i.e., 360 degrees in all spatial directions).
  • new types of output technologies have been invented and produced, such as head-mounted displays. These devices allow a person to see visual content all around him/her, giving a feeling of being“immersed” into the scene captured by the 360 degrees camera.
  • the new capture and display paradigm, where the field of view is spherical is commonly referred to as virtual reality (VR) and is believed to be the common way people will experience media content in the future.
  • VR virtual reality
  • volumetric video For volumetric video, a scene may be captured using one or more 3D (three- dimensional) cameras. The cameras are in different positions and orientations within a scene.
  • 3D three- dimensional
  • One issue to take into account is that compared to 2D (two-dimensional) video content, volumetric 3D video content has
  • a method comprising determining an occupancy map having a first resolution; grouping pixels of the occupancy map into non-overlapping blocks; and generating a down-sampled occupancy map for the occupancy map by generating a representative value in the down-sampled occupancy map for at least one of the non-overlapping blocks, where generating the representative value for a current block comprises examining pixel occupancy in the current block and at least one block adjacent to the current block.
  • a method comprising receiving a down-sampled occupancy map, wherein at least one representative value of the down- sampled occupancy map is indicative of pixel occupancy of a block of pixels in an occupancy map; and up-sampling representative values in the down-sampled occupancy map, wherein up-sampling of a current representative value in the down- sampled occupancy map comprises examining at least one representative value adjacent to the current representative value.
  • an apparatus comprising means for determining an occupancy map having a first resolution; means for grouping pixels of the occupancy map into non-overlapping blocks; and means for generating a down- sampled occupancy map for the occupancy map by generating one representative value in the down-sampled occupancy map for at least one of the non-overlapping blocks, where the generating the representative value for a current block comprises examining pixel occupancy in the current block and at least one block adjacent to the current block.
  • the down-sampled occupancy map has a second resolution
  • the representative value comprises a binary value indicating whether a pixel in the down-sampled occupancy map is occupied or unoccupied.
  • examining pixel occupancy comprises determining a number of occupied pixels and/or locations of occupied pixels in the current block and the at least one adjacent block.
  • the representative down-sampled pixel is determined to be occupied; if the number of occupied pixels in the current block is smaller than the number of unoccupied pixels, the representative down-sampled pixel is determined to be unoccupied; and/or if the number of occupied pixels in the current block equals with the number of unoccupied pixels, the representative down-sampled pixel is determined according to one or more adjacent pixels of the at least one adjacent block in the occupancy map.
  • the apparatus further comprises means for up-sampling representative values in the down-sampled occupancy map, wherein up-sampling of a current representative value in the down-sampled occupancy map comprises examining pixel occupancy associated with the current representative value and pixel occupancy associated with at least one representative value in the down-sampled occupancy map adjacent to the current representative value.
  • the apparatus further comprises means to generate up- sampling guidance information to be signaled in a Chroma channel, the guidance information being based on an analysis of the occupancy map and the up-sampled occupancy map.
  • an apparatus comprising means for receiving a down-sampled occupancy map, wherein at least one representative value of the down-sampled occupancy map is indicative of pixel occupancy of a block of pixels in an occupancy map; and means for up-sampling representative values in the down-sampled occupancy map, wherein up-sampling of a current representative value in the down-sampled occupancy map comprises examining at least one representative value adjacent to the current representative value.
  • the up-sampling comprises means for determining two occupied and diagonally adjacent blocks in the down-sampled occupancy map; means for determining if the two other blocks adjacent to the occupied and diagonally adjacent blocks are un-occupied, and in response to determining that the two other blocks adjacent to the occupied and diagonally adjacent blocks are un-occupied; means for up-sampling the current representative values of each occupied and diagonally adjacent blocks in the down-sampled occupancy map to respective up-sampled blocks of pixels in the up-sampled occupancy map; means for determining two pixels adjacent to both up-sampled blocks of pixels in the up-sampled occupancy map; and means for assigning the two determined pixels to be occupied in the up-sampled occupancy map.
  • one or more pixels adjacent to either the two determined pixels or up-sampled blocks of pixels are assigned to be occupied in the up-sampled occupancy map.
  • the up-sampling comprises means for determining two occupied and diagonally aligned blocks in the down-sampled occupancy map; means for determining if the diagonal separation between said two occupied and diagonally aligned blocks is one un-occupied block in the down-sampled occupancy map, and in response to determining that the diagonal separation between said two occupied and diagonally aligned blocks is one un-occupied block in the down-sampled occupancy map; means for up-sampling the current representative values of each occupied and diagonally aligned blocks in the down-sampled occupancy map to respective up- sampled blocks of pixels in the up-sampled occupancy map; means for determining the least number of pixels which diagonally connect the two up-sampled blocks of pixels in the up-sampled occupancy map; and means for assigning the determined pixels to be occupied in the up-sampled occupancy map.
  • the means comprises at least one processor; and at least one memory including computer program code, the at least one memory and computer program code configured to, with the at least one processor, cause the performance of the apparatus.
  • a computer program product embodied on a non-transitory computer readable medium, comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to determine an occupancy map having a first resolution; group pixels of the occupancy map into non-overlapping blocks; and generate a down-sampled occupancy map for the occupancy map by generating a representative value in the down-sampled occupancy map for at least one of the non-overlapping blocks, where the generating the representative value for a current block comprises examining pixel occupancy in the current block and at least one block adjacent to the current block.
  • an computer program product embodied on a non-transitory computer readable medium, comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to receive a down-sampled occupancy map, wherein a representative value of the down-sampled occupancy map is indicative of pixel occupancy of a block of pixels in an occupancy map; and up-sample representative values in the down-sampled occupancy map, wherein up-sampling of a current representative value in the down- sampled occupancy map comprises examining pixel occupancy of at least one representative value adjacent to the current representative value.
  • Fig. 1 shows an example of a compression process
  • Fig. 2 shows an example of a decompression process
  • Fig. 3 shows an example of 3D to 2D projection patches
  • Fig. 4 shows a table illustrating possible modes for down-sampling a 2x2 block of pixels
  • Fig. 5 shows an example of pixel locations around respective block of 2x2
  • Fig. 6 shows an example of up-sampling representative down-sampled pixels from a down-sampled occupancy map to create the up-sampled occupancy map
  • Fig. 7 shows an example of adjacent pixels to the recently created dashed pixels
  • Fig. 8 shows another example of adjacent pixels to the recently created dashed pixels
  • Fig. 9 shows another example of adjacent pixels to the recently created dashed pixels
  • Fig. 10 is a flowchart illustrating a method according to an embodiment
  • Fig. 1 1 is a flowchart illustrating a method according to another embodiment
  • Fig. 12 shows an example of an apparatus according to an embodiment
  • Fig. 13 shows an example of a layout of an apparatus according to an embodiment.
  • Volumetric video data represents a three-dimensional scene or object, and can be used as an input for augmented reality (AR), virtual reality (VR) and mixed reality (MR) applications.
  • AR augmented reality
  • VR virtual reality
  • MR mixed reality
  • Such data describes geometry (shape, size, position in 3D-space) and respective attributes (e.g. colour, opacity, reflectance, ...), plus any possible temporal changes of the geometry and attributes at given time instances (e.g. frames in 2D video).
  • Volumetric video is either generated from 3D models, i.e. CGI, or captured from real-world scenes using a variety of capture solutions, e.g. a multi-camera, a laser scan, a combination of video and dedicated depths sensors, and more. Also, a combination of CGI and real-world data is possible.
  • Typical representation formats for such volumetric data are triangle meshes, point clouds, or voxel.
  • Temporal information about the scene can be included in the form of individual capture instances, i.e. “frames” in 2D video, or other means, e.g. position of an object as a function of time.
  • volumetric video describes a 3D scene (or object), such data can be viewed from any viewpoint. Therefore, volumetric video is an important format for any AR, VR, or MR applications, especially for providing 6DOF viewing capabilities.
  • 3D data acquisition devices has enabled reconstruction of highly detailed volumetric video representations of natural scenes.
  • Infrared, lasers, time-of-flight and structured light are all examples of devices that can be used to construct 3D video data.
  • Representation of the 3D data depends on how the 3D data is used.
  • Dense Voxel arrays have been used to represent volumetric medical data.
  • polygonal meshes are extensively used.
  • Point clouds on the other hand are well suited for applications, such as capturing real world 3D scenes where the topology is not necessarily a 2D manifold.
  • Another way to represent 3D data is coding this 3D data as a set of texture and depth map as is the case in the multi-view plus depth. Closely related to the techniques used in multi-view plus depth is the use of elevation maps, and multi-level surface maps.
  • the reconstructed 3D scene may contain tens or even hundreds of millions of points. If such representations are to be stored or interchanged between entities, then efficient compression becomes essential.
  • Standard volumetric video representation formats such as point clouds, meshes, voxel, suffer from poor temporal compression performance. Identifying correspondences for motion-compensation in 3D-space is an ill-defined problem, as both, geometry and respective attributes may change. For example, temporal successive“frames” do not necessarily have the same number of meshes, points or voxel. Therefore, compression of dynamic 3D scenes is inefficient. 2D-video based approaches for compressing volumetric data, i.e. multiview with depth, have much better compression efficiency, but rarely cover the full scene. Therefore, they provide only limited 6DOF capabilities.
  • a 3D scene represented as meshes, points, and/or voxel can be projected onto one, or more geometries. These geometries are“unfolded” onto 2D planes (two planes per geometry: one for texture, one for depth), which are then encoded using standard 2D video compression technologies. Relevant projection geometry information is transmitted alongside the encoded video files to the decoder. The decoder decodes the video and performs the inverse projection to regenerate the 3D scene in any desired representation format (not necessarily the starting format).
  • Projecting volumetric models onto 2D planes allows for using standard 2D video coding tools with highly efficient temporal compression.
  • coding efficiency is increased greatly.
  • 6DOF capabilities are improved.
  • Using several geometries for individual objects improves the coverage of the scene further.
  • standard video encoding hardware can be utilized for real-time compression/decompression of the projected planes. The projection and reverse projection steps are of low complexity.
  • Figure 1 illustrates an overview of an example of a compression process. Such process may be applied for example in MPEG Point Cloud Coding (PCC).
  • PCC MPEG Point Cloud Coding
  • the process starts with an input point cloud frame 101 that is provided for patch generation 102, geometry image generation 104 and texture image generation 105.
  • the patch generation 102 process aims at decomposing the point cloud into a minimum number of patches with smooth boundaries, while also minimizing the reconstruction error.
  • the normal at every point can be estimated.
  • An initial clustering of the point cloud can then be obtained by associating each point with one of the following six oriented planes, defined by their normals:
  • each point may be associated with the plane that has the closes normal (i.e. maximizes the dot product of the point normal and the plane normal).
  • the initial clustering may then be refined by iteratively updating the cluster index associated with each point based on its normal and the cluster indices of its nearest neighbors.
  • the final step may comprise extracting patches by applying a connected component extraction procedure.
  • Patch info determined at patch generation 102 for the input point cloud frame 101 is delivered to packing process 103, to geometry image generation 104 and to texture image generation 105.
  • the packing process 103 aims at mapping the extracted patches onto a 2D plane, while trying to minimize the unused space, and guaranteeing that every TxT (e.g. 16x16) block of the grid is associated with a unique patch.
  • T may be a user-defined parameter.
  • Parameter T may be encoded in the bitstream and sent to the decoder.
  • PCC may use a simple packing strategy that iteratively tries to insert patches into a WxH grid.
  • W and H are user-defined parameters, which correspond to the resolution of the geometry/texture images that will be encoded.
  • the patch location is determined through an exhaustive search that is performed in raster scan order. The first location that can guarantee an overlapping-free insertion of the patch is selected and the grid cells covered by the patch are marked as used. If no empty space in the current resolution image can fit a patch, then the height H of the grid is temporarily doubled, and search is applied again. At the end of the process, H is clipped so as to fit the used grid cells.
  • the geometry image generation 104 and the texture image generation 105 are configured to generate geometry images and texture images.
  • each patch may be projected onto two images, referred to as layers.
  • H(u, y) be the set of points of the current patch that get projected to the same pixel (u, v).
  • the first layer also called a near layer, stores the point of H(u, v) with the lowest depth DO.
  • the second layers referred to as the far layer, captures the point of H(u, v) with the highest depth within the interval [DO, DO+D], where D is a user-defined parameter that describes the surface thickness.
  • the generated videos may have the following characteristics:
  • the geometry video is monochromatic.
  • the texture generation procedure exploits the reconstructed/smoothed geometry in order to compute the colors to be associated with the re-sampled points.
  • the geometry images and the texture images may be provided to image padding 107.
  • the image padding 107 may also receive as an input an occupancy map (OM) 106 to be used with the geometry images and texture images.
  • the occupancy map 106 may comprise a binary map that indicates for each cell of the grid whether it belongs to the empty space or to the point cloud.
  • the occupancy map (OM) may be a binary image of binary values where the occupied pixels and non-occupied pixels are distinguished and depicted respectively.
  • the occupancy map may alternatively comprise a non-binary image allowing additional information to be stored in it. Therefore, the representative values of the DOM may comprise binary values or other values, for example integer values. It should be noticed that one cell of the 2D grid may produce a pixel during the image generation process.
  • the padding process 107 aims at filling the empty space between patches in order to generate a piecewise smooth image suited for video compression. For example, in a simple padding strategy, each block of TxT (e.g. 16x16) pixels is compressed independently. If the block is empty (i.e. unoccupied, i.e. all its pixels belong to empty space), then the pixels of the block are filled by copying either the last row or column of the previous TxT block in raster order. If the block is full (i.e. occupied, i.e., no empty pixels), nothing is done. If the block has both empty and filled pixels, then the empty pixels are iteratively filled with the average value of their non-empty neighbors.
  • TxT e.g. 16x16
  • the padded geometry images and padded texture images may be provided for video compression 108.
  • the generated images/layers may be stored as video frames and compressed using for example the HM16.16 video codec according to the HM configurations provided as parameters.
  • the video compression 108 also generates reconstructed geometry images to be provided for smoothing 109, wherein a smoothed geometry is determined based on the reconstructed geometry images and patch info from the patch generation 102.
  • the smoothed geometry may be provided to texture image generation 105 to adapt the texture images.
  • the patch may be associated with auxiliary information being encoded/decoded for each patch as metadata.
  • the auxiliary information may comprise index of the projection plane, 2D bounding box, 3D location of the patch represented in terms of depth 50, tangential shift sO and bitangential shift rO. According to the chose projection planes, (50, sO, rO) may be computed as follows:
  • mapping information provided for each TxT block its associated patch index may be encoded for example as follows:
  • L the ordered list of the indexes of the patches such that their 2D bounding box contains that block.
  • the order in the list is the same as the order used to encode the 2D bounding boxes.
  • L is called the list of candidate patches.
  • the occupancy map compression 1 10 leverages the auxiliary information described in previous section, in order to detect the empty TxT blocks (i.e. blocks with patch index 0).
  • the remaining blocks may be encoded as follows:
  • the occupancy map can be encoded with a precision of a BOxBO blocks.
  • the compression process may comprise one or more of the following example operations:
  • Binary values may be associated with BOxBO sub-blocks belonging to the same TxT block.
  • a value 1 associated with a sub-block if it contains at least a non- padded pixel, and 0 otherwise. If a sub-block has a value of 1 it is said to be full, otherwise it is an empty sub-block.
  • a binary information may be encoded for each TxT block to indicate whether it is full or not.
  • an extra information indicating the location of the full/empty sub-blocks may be encoded as follows:
  • Different traversal orders may be defined for the sub-blocks, for example horizontally, vertically, or diagonally starting from top right or top left corner
  • the encoder chooses one of the traversal orders and may explicitly signal its index in the bitstream.
  • the binary values associated with the sub-blocks may be encoded by using a run-length encoding strategy.
  • a multiplexer 112 may receive a compressed geometry video and a compressed texture video from the video compression 108, a compressed occupancy map from occupancy map compression 110 and optionally a compressed auxiliary patch information from auxiliary patch-info compression 111. The multiplexer 112 uses the received data to produce a compressed bitstream.
  • FIG. 2 illustrates an overview of a compression process for MPEG Point Cloud Coding (PCC).
  • a de-multiplexer 201 receives a compressed bitstream, and after de- multiplexing, provides compressed texture video and compressed geometry video to video decompression 202.
  • the de-multiplexer 201 transmits compressed occupancy map to occupancy map decompression 203. It may also transmit a compressed auxiliary patch information to auxiliary patch-info compression 204.
  • Decompressed geometry video from the video decompression 202 is delivered to geometry reconstruction 205, as are the decompressed occupancy map and decompressed auxiliary patch information.
  • the point cloud geometry reconstruction 205 process exploits the occupancy map information in order to detect the non-empty pixels in the geometry/texture images/layers.
  • the 3D positions of the points associated with those pixels may be computed by leveraging the auxiliary patch information and the geometry images. For example, let P be the point associated with the pixel (u, v) and let (50, sO, rO) be the 3D location of the patch to which it belongs and (uO, vO, u1 , v1 ) its 2D bounding box. P can be expressed in terms of depth 5(u, v), tangential shift s(u, v) and bi-tangential shift r(u, v) as follows:
  • r(u, v) rO - vO + v
  • g(u, v) is the luma component of the geometry image.
  • the reconstructed geometry image may be provided for smoothing 206, which aims at alleviating potential discontinuities that may arise at the patch boundaries due to compression artifacts.
  • the implemented approach moves boundary points to the centroid of their nearest neighbors.
  • the smoothed geometry may be transmitted to texture reconstruction 207, which also receives a decompressed texture video from video decompression 202.
  • the texture reconstruction 207 outputs a reconstructed point cloud.
  • the texture values for the texture reconstruction are directly read from the texture images.
  • Coding of occupancy information can be performed with the geometry image.
  • a specific depth value e.g. 0, or a specific depth value range may be reserved to indicate that a pixel is inpainted and not present in the source material.
  • the specific depth value or the specific depth value range may be pre-defined, for example in a standard, or the specific depth value or the specific depth value range may be encoded into or along the bitstream and/or may be decoded from or along the bitstream. This way of multiplexing the occupancy information in the depth sample array creates sharp edges into the images, which may be subject to additional bitrate as well as compression artefacts around the sharp edges.
  • One way to compress a time-varying volumetric scene/object is to project 3D surfaces on to some number of pre-defined 2D planes.
  • Regular 2D video compression algorithms can then be used to compress various aspects of the projected surfaces.
  • a time-varying 3D point cloud with spatial and texture coordinates can be mapped into a sequence of at least two sets of planes, where one of the two sets carry the texture data and the other carries the distance of the mapped 3D surface points from the projection planes.
  • 2D projections of 3D data will have an arbitrary shape, i.e. no block- alignment.
  • Figure 3 illustrates the outcome of such projection step, from 3D projection to 2D projection.
  • the boundaries of the projections are blurred/padded to avoid high frequency content.
  • the decoder For accurate 2D to 3D reconstruction at the receiving side, the decoder must be aware which 2D points are“valid” and which points stem from interpolation/padding. This requires the transmission of additional data.
  • the additional data may be encapsulated in the geometry image as a pre-defined depth value (e.g. 0) or a pre-defined range of depth values. This will increase the coding efficiency only on the texture image, since the geometry image is not blurred/padded.
  • encoding artefacts at the object boundaries of the geometry image may create severe artefacts, which require post-processing and may not be concealable.
  • the additional data may also be sent separately as “occupancy map”.
  • occupancy map Such an occupancy map is costly to transmit. To minimize the cost, it may only be transmitted every 4 frames (every l-frame in a IPPP coding structure). Still it may require 8-18% of the overall bit rate budget.
  • 3D resampling and additional motion information is required to align the l-frame occupancy map to the 3 P-frames without transmitted occupancy map.
  • the coding and decoding of the occupancy map information and the 3D motion information also require significant computational memory, and memory access resources.
  • the occupancy map information uses a codec different from the video codec used for texture and geometry images. Consequently, it is unlikely that such a dedicated occupancy map codec would be hardware-accelerated.
  • a projection-based compression of a volumetric video data can comprise presenting and encoding different parts of the same object as 2D projections.
  • a new algorithm to down-sample and up-sample the occupancy map (OM) is disclosed.
  • This algorithm is adaptive and non-linear and takes into account the structure of the OM and tries to preserve the OM shape while reducing the required bit-rate to encode it.
  • the bitrate reduction is due to down-sampling applied.
  • the down-sampling step targets removing potential pixels and the up-sampling targets to re-introduce them.
  • the whole down/up sampling process also reduces the amount of bitrate that needs to be transmitted due to reduced number of pixels to be encoded in the OM.
  • the occupancy maps are referred to as follows:
  • OOM original occupancy map
  • - down-sampled occupancy map with ratio 1 ⁇ 2, 1 ⁇ 4, etc., i.e. occupancy information is only available for a group of n pixels, where n is defined based on the down-sampling factor. For example, if the down-sampling ratio is 1 ⁇ 2 c 1 ⁇ 2 then n is equal to 4.
  • UOM - up-sampled occupancy map
  • the down-sampling takes into account the pixel values of the adjacent blocks and is not necessarily limited to the pixel values of each block separately. It means that pixel values from different blocks may be used simultaneously in the down-sampling process.
  • the blocks may have a rectangular shape where block size of T1XT2 is considered.
  • the size and/or shape of the block may vary in the process.
  • the blocks may be square with size TxT, in another part of the OOM the blocks may be rectangular with size T1XT2, and in another part of the OOM the blocks may be rectangular with size ⁇ 3cT 4 .
  • the present solution provides steps of down-sampling and up-sampling which are described in the following:
  • an adaptive non-linear down-sampling is proposed.
  • the left column 401 presents the mode assigned number.
  • the middle column 402 shows which pixels are occupied and which pixels are unoccupied (i.e. empty) in the OOM.
  • the right column 403 shows how the block of 2x2 is presented in the DOM by showing how the representative down- sampled pixel will be in the DOM.
  • the solid black squares 413 refer to occupied pixels while the white squares 412 refer to unoccupied (i.e. empty) pixels.
  • the representative down-sampled pixel (RDP) in the DOM will be an unoccupied pixel.
  • the RDP will be an occupied pixel.
  • RDP will be unoccupied unless both of the RDPs along with the diagonal direction of occupied pixels in the block of TxT (in the OOM) are occupied in the DOM. If the adjacent RDPs are not known yet, this block of TxT in the OOM will be marked to be assessed later. This is aligned with the general approach, i.e.
  • the RDP may be occupied or unoccupied according to some criteria. For example, determining pixel occupancy for a pixel of a DOM may comprise determining a number of occupied pixels and/or locations of occupied pixels in a first block of OOM and one or more pixels of at least one adjacent block of the first block. Embodiments of the decision-making process for mode 13 are described below. The rest of the modes (14- 16) will be processed similarly.
  • RDP may be determined to be unoccupied if both left and right RDPs are unoccupied. (The RDPs along the direction of the two occupied pixels) (RDPs representing the blocks containing pixels e and f in Figure 5.)
  • RDP may be determined to be unoccupied if at least two adjacent pixels (in the OOM) to either of the occupied pixels in Mode 13 are unoccupied (two of a, b and e to be unoccupied or two of c, d, and f in figure 5 to be unoccupied). In another embodiment, and alternatively, this can be two of the adjacent pixels (two of a, b, c, d, e, f in Figure 5) in the OOM to be unoccupied.
  • RDP may be determined to be occupied if all top, left and right RDPs are occupied.
  • RDP may be determined to be occupied if e, b, c, and f are occupied and at least one of a or d is also occupied.
  • RDP may be determined to be unoccupied otherwise.
  • the order of steps (1 to 6, as mentioned above) introduced for down-sampling process of mode 13 may change.
  • different similar criteria for steps 2, 3, 4, and/or 5 may be defined.
  • the criteria may target less shrinking or more shrinking of the OM.
  • the DOM when the down-sampling is applied on the OM, the DOM may be up-sampled in encoder side according to the algorithm which is to be used in the decoder side to create the UOM.
  • the texture and geometry images may be then aligned with the UOM to make sure than no extra information is to be encoded and transmitted to the decoder side.
  • the number of pixels in the texture and geometry images should also decrease and hence, less pixels occupied pixels to be encoded.
  • the down-sampled OM may be encoded and the respective bitstream may be transmitted to the decoder side.
  • the DOM will be decoded and up-sampled to an UOM. Up-sampling
  • an adaptive non-linear up-sampling method may be applied.
  • the up-sampling process tries to somehow compensate this.
  • the process may comprise one or more of the following operations: If any two diagonally adjacent and occupied RDPs are to be up-sampled (current RDPs), then if the other two RDPs, which are adjacent to both of current RDPs are unoccupied, then one pixel on each side which is adjacent to both up-sampled RDPs will be considered occupied in the UOM. This is shown in Figure 6, where a block of 3x3 in DOM is up-sampled to a 6x6 block in UOM.
  • the blocks of dark black squares 61 1 are direct up-sampling of RDPs 601 from DOM while the dashed pixels 613 in the UOM are added targeting better and smoother visual representation.
  • an unoccupied RDPs 602 in DOM are upsampled to a block of white squares 612.
  • one or more pixels adjacent to either the two determined pixels or up-sampled blocks of pixels may be assigned to be occupied in the up- sampled occupancy map.
  • the respective geometry pixel values may be decided to have an average value of the geometry values from adjacent pixels.
  • the adjacent pixels to any added dashed pixel are five (m, H Q, g, g) and any combination of those may be used for the estimation of the value of the dashed pixel, e.g. only the geometry value of pixel o may be considered or n, g, g or all m, n, g, g, g pixel geometry values may be considered.
  • Similar process for the texture pixel values may be taken into account.
  • a weighted average based on the geometry values may be taken into account. In this case, the closer pixels in 3D space, may have a heavier weight in the weighted average calculation compared to the farther pixels in the 3D space. Such weights will be used in value calculation of the dashed pixel in the UOM.
  • the dashed pixels may be more than one (e.g. three) adjacent pixels to the pixels in the occupied pixel in the UOM. This is shown in Figure 8 where adjacent pixels 812, 814 to the dashed pixel 813 are also considered as occupied in the up-sampling process.
  • the diagonally aligned single pixels 921 between the up-sampled pixels may be determined to be occupied too. This is shown in Figure 9.
  • the respective resolution of the geometry and texture images will be decreased too. If the occupancy map resolution is increased, the respective geometry and texture image resolution will be interpolated based on the adjacent pixels.
  • the proposed non-linear up-sampling does not have to be used in conjunction with the proposed down-sampling approach. Any other linear, or non-linear down-sampling approach may also benefit from this solution.
  • an encoder performs the OM down-sampling and the selected OM up-sampling, thus creating the UOM as it would be recreated at the decoder. Any differences between OOM and UOM can now be signaled to the decoder for a perfect OM reconstruction. Such information can be signaled in run-length coding. With the expected large number of zeros for this approach, the bitrate overhead should be rather small. Other ways of signaling this information can be envisioned, e.g. signaling it in an unused chroma channel of the geometry picture. Rate-distortion-optimization (RDO) at the encoder can be utilized to determine if this additional information shall be signaled or not.
  • RDO Rate-distortion-optimization
  • one of the unused chroma channels of the geometry picture is used to signal up-sampling guidance information to the decoder. This information is based on the analysis of the OOM and reconstructed UOM during the encoding.
  • each chroma channel pixel can give simple up-sampling instructions. I.e. if modes 1 1 -16 shall be used for up-sampling from an unoccupied or occupied DOM pixel. This approach may allow for more accurate up-sampling at the cost of slight bitrate increase. The requirements on buffer size may remain the same. Again, RDO may be used to determine the optimal signaling structure.
  • FIG. 10 is a flowchart illustrating a method according to an embodiment.
  • a method comprises determining 101 1 an occupancy map having a first resolution; grouping 1012 pixels of the occupancy map into non-overlapping blocks; and generating 1013 a down-sampled occupancy map for the occupancy map by generating a representative value in the down-sampled occupancy map for at least one of the non- overlapping blocks, where generating the representative value for a current block comprises examining pixel occupancy in the current block and at least one block adjacent to the current block.
  • An apparatus comprises means for determining an occupancy map having a first resolution; means for grouping pixels of the occupancy map into non-overlapping blocks; and means for generating a down-sampled occupancy map for the occupancy map by generating a representative value in the down-sampled occupancy map for at least one of the non-overlapping blocks, where generating the representative value for a current block comprises examining pixel occupancy in the current block and at least one block adjacent to the current block.
  • the means comprises a processor, a memory, and a computer program code residing in the memory.
  • the processor may further comprise a processing circuit.
  • Figure 1 1 is a flowchart illustrating a method according to another embodiment.
  • a method comprises receiving 1 1 1 1 a down-sampled occupancy map, wherein at least one representative value of the down-sampled occupancy map is indicative of pixel occupancy of a block of pixels in an occupancy map; and up-sampling 1 1 12 representative values in the down-sampled occupancy map, wherein up-sampling of a current representative value in the down-sampled occupancy map comprises examining at least one representative value adjacent to the current representative value.
  • An apparatus comprises means for receiving a down- sampled occupancy map, wherein at least one representative value of the down- sampled occupancy map is indicative of pixel occupancy of a block of pixels in an occupancy map; and means for up-sampling representative values in the down- sampled occupancy map, wherein up-sampling of a current representative value in the down-sampled occupancy map comprises examining at least one representative value adjacent to the current representative value.
  • the means comprises a processor, a memory, and a computer program code residing in the memory.
  • the processor may further comprise a processing circuit.
  • Figure 12 shows a block diagram of a video coding system according to an example embodiment as a schematic block diagram of an electronic device 50, which may incorporate a codec.
  • the electronic device may comprise an encoder or a decoder.
  • Figure 13 shows a layout of an apparatus according to an embodiment.
  • the electronic device 50 may for example be a mobile terminal or a user equipment of a wireless communication system or a camera device.
  • the electronic device 50 may be also comprised at a local or a remote server or a graphics processing unit of a computer.
  • the device may be also comprised as part of a head-mounted display device.
  • the apparatus 50 may comprise a housing 30 for incorporating and protecting the device.
  • the apparatus 50 may further comprise a display 32 in the form of a liquid crystal display.
  • the display may be any suitable display technology suitable to display an image or video.
  • the apparatus 50 may further comprise a keypad 34.
  • any suitable data or user interface mechanism may be employed.
  • the user interface may be implemented as a virtual keyboard or data entry system as part of a touch-sensitive display.
  • the apparatus may comprise a microphone 36 or any suitable audio input which may be a digital or analogue signal input.
  • the apparatus 50 may further comprise an audio output device which in embodiments of the invention may be any one of: an earpiece 38, speaker, or an analogue audio or digital audio output connection.
  • the apparatus 50 may also comprise a battery 40 (or in other embodiments of the invention the device may be powered by any suitable mobile energy device such as solar cell, fuel cell or clockwork generator).
  • the apparatus may further comprise a camera 42 capable of recording or capturing images and/or video.
  • the camera 42 may be a multi-lens camera system having at least two camera sensors. The camera is capable of recording or detecting individual frames which are then passed to the codec 54 or the controller for processing.
  • the apparatus may receive the video and/or image data for processing from another device prior to transmission and/or storage.
  • the apparatus 50 may further comprise an infrared port for short range line of sight communication to other devices.
  • the apparatus may further comprise any suitable short-range communication solution such as for example a Bluetooth wireless connection or a USB (Universal Serial Bus)/firewire wired connection.
  • the apparatus 50 may comprise a controller 56 or processor for controlling the apparatus 50.
  • the apparatus or the controller 56 may comprise one or more processors or processor circuitry and be connected to memory 58 which may store data in the form of image, video and/or audio data, and/or may also store instructions for implementation on the controller 56 or to be executed by the processors or the processor circuitry.
  • the controller 56 may further be connected to codec circuitry 54 suitable for carrying out coding and decoding of image, video and/or audio data or assisting in coding and decoding carried out by the controller.
  • the apparatus 50 may further comprise a card reader 48 and a smart card 46, for example a UICC (Universal Integrated Circuit Card) and UICC reader for providing user information and being suitable for providing authentication information for authentication and authorization of the user at a network.
  • a card reader 48 and a smart card 46 for example a UICC (Universal Integrated Circuit Card) and UICC reader for providing user information and being suitable for providing authentication information for authentication and authorization of the user at a network.
  • UICC Universal Integrated Circuit Card
  • the apparatus 50 may comprise radio interface circuitry 52 connected to the controller and suitable for generating wireless communication signals for example for communication with a cellular communications network, a wireless communications system or a wireless local area network.
  • the apparatus 50 may further comprise an antenna 44 connected to the radio interface circuitry 52 for transmitting radio frequency signals generated at the radio interface circuitry 52 to other apparatus(es) and for receiving radio frequency signals from other apparatus(es).
  • the apparatus may comprise one or more wired interfaces configured to transmit and/or receive data over a wired connection, for example an electrical cable or an optical fiber connection. Such wired interface may be configured to operate according to one or more digital display interface standards, such as for example High-Definition Multimedia Interface (HDMI), Mobile High-definition Link (MHL), or Digital Visual Interface (DVI).
  • HDMI High-Definition Multimedia Interface
  • MHL Mobile High-definition Link
  • DVI Digital Visual Interface
  • the whole down-/up-sampling process reduces the amount of bitrate that needs to be transmitted due to reduced number of pixels to be encoded in the OM.
  • the present embodiments may decrease the compression complexity, memory usage and buffer allocation.
  • the up-sampling process of OM is a normative process in the de(coding) process as any interprediction between geometry and texture images require to clarify what are the valid pixels in the geometry map and 2D grid. Such information is achieved based on the up-sampled OM and hence, the up- sampling process on OM is considered to be normative.
  • a device may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the device to carry out the features of an embodiment.
  • a network device like a server may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the network device to carry out the features of an embodiment.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • Computer Graphics (AREA)
  • Geometry (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

L'invention concerne une solution qui consiste à déterminer une carte d'occupation comportant une première résolution, à regrouper des pixels de la carte d'occupation en blocs non chevauchants ; et à générer une carte d'occupation sous-échantillonnée pour la carte d'occupation par génération d'une valeur représentative dans la carte d'occupation sous-échantillonnée pour au moins l'un des blocs ne se chevauchant pas, à générer la valeur représentative pour un bloc courant consistant à examiner l'occupation des pixels dans le bloc courant et au moins un bloc adjacent au bloc courant. Lorsque la carte d'occupation sous-échantillonnée (601. 602) est reçue, les valeurs représentatives dans la carte d'occupation sous-échantillonnée sont suréchantillonnées (611, 612), un suréchantillonnage d'une valeur représentative courante dans la carte d'occupation sous-échantillonnée consistant à examiner au moins une valeur représentative adjacente à la valeur représentative courante.
PCT/FI2019/050219 2018-03-28 2019-03-14 Procédé, appareil et produit-programme d'ordinateur destinés au codage et au décodage de vidéo volumétrique numérique WO2019185983A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
FI20185286 2018-03-28
FI20185286 2018-03-28

Publications (1)

Publication Number Publication Date
WO2019185983A1 true WO2019185983A1 (fr) 2019-10-03

Family

ID=68060979

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/FI2019/050219 WO2019185983A1 (fr) 2018-03-28 2019-03-14 Procédé, appareil et produit-programme d'ordinateur destinés au codage et au décodage de vidéo volumétrique numérique

Country Status (1)

Country Link
WO (1) WO2019185983A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021173262A1 (fr) * 2020-02-24 2021-09-02 Microsoft Technology Licensing, Llc Dilatation de tampon de profondeur pour rendu à distance

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0890921A2 (fr) * 1997-07-10 1999-01-13 Samsung Electronics Co., Ltd. Procédé d'interpolation d'images binaires
US6510246B1 (en) * 1997-09-29 2003-01-21 Ricoh Company, Ltd Downsampling and upsampling of binary images
US20050084016A1 (en) * 1996-10-31 2005-04-21 Noboru Yamaguchi Video encoding apparatus and video decoding apparatus
US20190087979A1 (en) * 2017-09-18 2019-03-21 Apple Inc. Point cloud compression

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050084016A1 (en) * 1996-10-31 2005-04-21 Noboru Yamaguchi Video encoding apparatus and video decoding apparatus
EP0890921A2 (fr) * 1997-07-10 1999-01-13 Samsung Electronics Co., Ltd. Procédé d'interpolation d'images binaires
US6510246B1 (en) * 1997-09-29 2003-01-21 Ricoh Company, Ltd Downsampling and upsampling of binary images
US20190087979A1 (en) * 2017-09-18 2019-03-21 Apple Inc. Point cloud compression

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
3DG: "PCC Test Model Category 2 vO", 120. MPEG MEETING; (MOTION PICTURE EXPERT GROUP OR ISO/IEC JTC1/SC29/WG11) N17248, 15 December 2017 (2017-12-15), Macau, XP030023909, Retrieved from the Internet <URL:http://phenix.int-evry.fr/mpeg> [retrieved on 20181016] *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021173262A1 (fr) * 2020-02-24 2021-09-02 Microsoft Technology Licensing, Llc Dilatation de tampon de profondeur pour rendu à distance
US11430179B2 (en) 2020-02-24 2022-08-30 Microsoft Technology Licensing, Llc Depth buffer dilation for remote rendering

Similar Documents

Publication Publication Date Title
US11509933B2 (en) Method, an apparatus and a computer program product for volumetric video
EP3751857A1 (fr) Procédé, appareil et produit programme informatique de codage et décodage de vidéos volumétriques
US20230068178A1 (en) A method, an apparatus and a computer program product for volumetric video encoding and decoding
US20190222821A1 (en) Methods for Full Parallax Compressed Light Field 3D Imaging Systems
CN112219398B (zh) 用于深度编码和解码的方法和装置
US11051039B2 (en) Methods for full parallax light field compression
EP3759925A1 (fr) Appareil, procédé et programme informatique pour vidéo volumétrique
US20120014590A1 (en) Multi-resolution, multi-window disparity estimation in 3d video processing
EP4085633A1 (fr) Appareil, procédé et programme informatique pour vidéo volumétrique
US11711535B2 (en) Video-based point cloud compression model to world signaling information
US12096027B2 (en) Method, an apparatus and a computer program product for volumetric video encoding and decoding
WO2021260266A1 (fr) Procédé, appareil et produit-programme informatique pour codage vidéo volumétrique
WO2021191495A1 (fr) Procédé, appareil et produit-programme d&#39;ordinateur pour codage vidéo et décodage vidéo
EP4133719A1 (fr) Procédé, appareil et produit-programme informatique pour codage vidéo volumétrique
WO2021170906A1 (fr) Appareil, procédé et programme informatique pour vidéo volumétrique
WO2021191500A1 (fr) Appareil, procédé et programme informatique pour vidéo volumétrique
WO2021186103A1 (fr) Procédé, appareil et produit-programme d&#39;ordinateur pour codage et décodage de vidéo volumétrique
WO2019185983A1 (fr) Procédé, appareil et produit-programme d&#39;ordinateur destinés au codage et au décodage de vidéo volumétrique numérique
EP3699867A1 (fr) Appareil, procédé et programme informatique pour vidéo volumétrique
WO2019211519A1 (fr) Procédé et appareil de codage et de décodage de vidéo volumétrique
WO2021053261A1 (fr) Procédé, appareil et produit-programme informatique pour codage vidéo et décodage vidéo
WO2021165566A1 (fr) Appareil, procédé et programme informatique pour vidéo volumétrique
WO2020254719A1 (fr) Appareil, procédé et programme informatique pour vidéo volumétrique
WO2022074286A1 (fr) Procédé, appareil et produit-programme informatique de codage et de décodage vidéo
WO2022219230A1 (fr) Procédé, appareil et produit-programme d&#39;ordinateur de codage vidéo et de décodage vidéo

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19775679

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19775679

Country of ref document: EP

Kind code of ref document: A1