EP3821602A1 - A method, an apparatus and a computer program product for volumetric video coding - Google Patents

A method, an apparatus and a computer program product for volumetric video coding

Info

Publication number
EP3821602A1
EP3821602A1 EP19834742.9A EP19834742A EP3821602A1 EP 3821602 A1 EP3821602 A1 EP 3821602A1 EP 19834742 A EP19834742 A EP 19834742A EP 3821602 A1 EP3821602 A1 EP 3821602A1
Authority
EP
European Patent Office
Prior art keywords
patch
bitstream
activity
interaction
patches
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
EP19834742.9A
Other languages
German (de)
French (fr)
Other versions
EP3821602A4 (en
Inventor
Sebastian Schwarz
Mika Pesonen
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nokia Technologies Oy
Original Assignee
Nokia Technologies Oy
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nokia Technologies Oy filed Critical Nokia Technologies Oy
Publication of EP3821602A1 publication Critical patent/EP3821602A1/en
Publication of EP3821602A4 publication Critical patent/EP3821602A4/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/03Arrangements for converting the position or the displacement of a member into a coded form
    • G06F3/033Pointing devices displaced or positioned by the user, e.g. mice, trackballs, pens or joysticks; Accessories therefor
    • G06F3/0346Pointing devices displaced or positioned by the user, e.g. mice, trackballs, pens or joysticks; Accessories therefor with detection of the device orientation or free movement in a 3D space, e.g. 3D mice, 6-DOF [six degrees of freedom] pointers using gyroscopes, accelerometers or tilt-sensors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T9/00Image coding
    • G06T9/001Model-based coding, e.g. wire frame
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/10Processing, recording or transmission of stereoscopic or multi-view image signals
    • H04N13/106Processing image signals
    • H04N13/111Transformation of image signals corresponding to virtual viewpoints, e.g. spatial image interpolation
    • H04N13/117Transformation of image signals corresponding to virtual viewpoints, e.g. spatial image interpolation the virtual viewpoint locations being selected by the viewers or determined by viewer tracking
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/10Processing, recording or transmission of stereoscopic or multi-view image signals
    • H04N13/194Transmission of image signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/46Embedding additional information in the video signal during the compression process
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/597Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding specially adapted for multi-view video sequence encoding
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • H04N21/23412Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs for generating or manipulating the scene composition of objects, e.g. MPEG-4 objects
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/434Disassembling of a multiplex stream, e.g. demultiplexing audio and video streams, extraction of additional data from a video stream; Remultiplexing of multiplex streams; Extraction or processing of SI; Disassembling of packetised elementary stream
    • H04N21/4348Demultiplexing of additional data and video streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44012Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving rendering scenes according to scene graphs, e.g. MPEG-4 scene graphs
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/81Monomedia components thereof
    • H04N21/816Monomedia components thereof involving special video data, e.g 3D video
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/84Generation or processing of descriptive data, e.g. content descriptors

Definitions

  • the present solution generally relates to volumetric video coding.
  • the solution relates to a solution for projection-based point cloud compression.
  • new image and video capture devices are available. These devices are able to capture visual and audio content all around them, i.e. they can capture the whole angular field of view, sometimes referred to as 360 degrees field of view. More precisely, they can capture a spherical field of view (i.e., 360 degrees in all spatial directions).
  • new types of output technologies have been invented and produced, such as head-mounted displays. These devices allow a person to see visual content all around him/her, giving a feeling of being“immersed” into the scene captured by the 360 degrees camera.
  • the new capture and display paradigm, where the field of view is spherical is commonly referred to as virtual reality (VR) and is believed to be the common way people will experience media content in the future.
  • VR virtual reality
  • volumetric video For volumetric video, a scene may be captured using one or more 3D (three-dimensional) cameras. The cameras are in different positions and orientations within a scene.
  • 3D three-dimensional
  • One issue to consider is that compared to 2D (two-dimensional) video content, volumetric 3D video content has much more data, so viewing it requires lots of bandwidth (with or without transferring it from a storage location to a viewing device): disk I/O, network traffic, memory bandwidth, GPU (Graphics Processing Unit) upload. Capturing volumetric content also produces a lot of data, particularly when there are multiple capture devices used in parallel.
  • a method comprising receiving a volumetric video comprising a three-dimensional object; segmenting the three-dimensional object into a plurality of patches; for one or more patches of a three-dimensional object: inserting into a bitstream a signal indicating a relation of three-dimensional location of the patch to at least one predefined item; and/or inserting into the bitstream a signal indicating an activity of the patch as a response to an interaction; transmitting the bitstream to a decoder.
  • an apparatus comprising at least one processor, memory including computer program code, wherein memory and the computer program code are configured to, with the at least one processor, cause the apparatus to receive a volumetric video comprising a three- dimensional object; segment the three-dimensional object into a plurality of patches; for one or more patches of a three-dimensional object: to insert into a bitstream a signal indicating a relation of three- dimensional location of the patch to at least one predefined item; and/or to insert into the bitstream a signal indicating an activity of the patch as a response to an interaction; transmit the bitstream to a decoder.
  • an apparatus comprising means for receiving a volumetric video comprising a three-dimensional object; means for segmenting the three-dimensional object into a plurality of patches; for one or more patches of a three-dimensional object; means for inserting into a bitstream a signal indicating a relation of three-dimensional location of the patch to at least one predefined item; and/or means for inserting into the bitstream a signal indicating an activity of the patch as a response to an interaction; means for transmitting the bitstream to a decoder.
  • a computer program product comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to receive a volumetric video comprising a three-dimensional object; segment the three-dimensional object into a plurality of patches; for one or more patches of a three-dimensional object: to insert into a bitstream a signal indicating a relation of three-dimensional location of the patch to at least one predefined item; and/or to insert into the bitstream a signal indicating an activity of the patch as a response to an interaction; transmit the bitstream to a decoder.
  • a method comprising receiving a bitstream; decoding from a bitstream a signal indicating a relation of three-dimensional location of the patch to at least one predefined item; and/or a signal indicating an activity of the patch as a response to an interaction; reconstructing a volumetric video according to the decoded information; passing the reconstructed volumetric video to a Tenderer; and passing any decoded activity information to the Tenderer
  • an apparatus comprising at least one processor, memory including computer program code, wherein memory and the computer program code are configured to, with the at least one processor, cause the apparatus to receive a bitstream; decode from a bitstream a signal indicating a relation of three-dimensional location of the patch to at least one predefined item; and/or a signal indicating an activity of the patch as a response to an interaction; reconstruct a volumetric video according to the decoded information; pass the reconstructed volumetric video to a Tenderer; and pass any decoded activity information to the Tenderer.
  • the item is one or more of the following: one or more of other patches of the same 3D object; a current position of a user with or without a viewing orientation; another object; a predefined path.
  • the interaction is an interaction of a user to the 3D object, or an interaction of another 3D object to the patch.
  • the activity of the patch is associated with an activity to be performed.
  • the relation or activity signaling is performed in an auxiliary patch information.
  • the relation or activity signaling is performed by one or more bits.
  • the available activities are signaled in or along the bitstream as a look up table.
  • the computer program product is embodied on a non-transitory computer readable medium.
  • Fig. 1 shows an example of a volumetric video compression process
  • Fig. 2 shows an example of a volumetric video decompression process
  • Fig. 3 shows an example of a 3D point cloud being segmented into patches
  • Fig. 4a is a flowchart illustrating a method according to an embodiment
  • Fig. 4b is a flowchart illustrating a method according to another embodiment
  • Fig. 5 shows an apparatus according to an embodiment in a simplified manner
  • Fig. 6 shows a layout of an apparatus according to an embodiment.
  • the present embodiments relate to patch functionality signaling for projection-based point cloud compression.
  • Volumetric video may be captured using one or more three-dimensional (3D) cameras. When multiple cameras are in use, the captured footage is synchronized so that the cameras provide different viewpoints to the same world. In contrast to traditional 2D/3D video, volumetric video describes a 3D model of the world where the viewer is free to move and observer different parts of the world.
  • Volumetric video enables the viewer to move in six degrees of freedom (6DOF): in contrast to common 360° video, where the user has from 2 to 3 degrees of freedom (yaw, pitch, and possibly roll), a volumetric video represents a 3D volume of space rather than a flat image plane.
  • Volumetric video frames contain a large amount of data because they model the contents of a 3D volume instead of just a two-dimensional (2D) plane.
  • 2D two-dimensional
  • Volumetric video can be rendered from synthetic 3D animations, reconstructed from multi- view video using 3D reconstruction techniques such as structure from motion, or captured with a combination of cameras and depth sensors such as LiDAR (Light Detection and Ranging), for example.
  • Volumetric video data represents a three-dimensional scene or object, and thus such data can be viewed from any viewpoint.
  • Volumetric video data can be used as an input for augmented reality (AR), virtual reality (VR) and mixed reality (MR) applications.
  • AR augmented reality
  • VR virtual reality
  • MR mixed reality
  • Such data describes geometry (shape, size, position in 3D-space) and respective attributes (e.g. color, opacity, reflectance, ...), together with any possible temporal changes of the geometry and attributes at given time instances (e.g. frames in 2D video).
  • Volumetric video is either generated from 3D models, i.e. computer-generated imagery (CG1), or captured from real-world scenes using a variety of capture solutions, e.g. a multi-camera, a laser scan, a combination of video and dedicated depths sensors, etc.
  • CG1 computer-generated imagery
  • CG1 and real-world data are possible.
  • representation formats for such volumetric data are triangle meshes, point clouds, or voxel.
  • Temporal information about the scene can be included in the form of individual capture instances, i.e.“frames” in 2D video, or other means, e.g. position of an object as a function of time.
  • lncreasing computational resources and advances in 3D data acquisition devices has enabled reconstruction of highly detailed volumetric video representations of natural scenes lnfrared, lasers, time-of-flight and structured light are all examples of devices that can be used to construct 3D video data. Representation of the 3D data depends on how the 3D data is used.
  • Dense voxel arrays have been used to represent volumetric medical data ln 3D graphics, polygonal meshes are extensively used. Point clouds on the other hand are well suited for applications, such as capturing real world 3D scenes where the topology is not necessarily a 2D manifold. Another way to represent 3D data is coding this 3D data as a set of texture and depth map as is the case in the multi-view plus depth. Closely related to the techniques used in multi- view plus depth is the use of elevation maps, and multi-level surface maps. ln 3D point clouds, each point of each 3D surface is described as a 3D point with color and/or other attribute information such as surface normal or material reflectance.
  • Point cloud is a set of data points in a coordinate system, for example in a three-dimensional coordinate system being defined by X, Y, and Z coordinates.
  • the points may represent an external surface of an object in the screen space, e.g. in a three-dimensional space.
  • ln dense point clouds or voxel arrays the reconstructed 3D scene may contain tens or even hundreds of millions of points lf such representations are to be stored or interchanged between entities, then efficient compression of the presentations becomes fundamental.
  • Standard volumetric video representation formats, such as point clouds, meshes, voxel suffer from poor temporal compression performance ldentifying correspondences for motion-compensation in 3D-space is an ill-defined problem, as both, geometry and respective attributes may change.
  • temporal successive“frames” do not necessarily have the same number of meshes, points or voxel. Therefore, compression of dynamic 3D scenes is inefficient.
  • 2D-video based approaches for compressing volumetric data i.e. multiview with depth, have much better compression efficiency, but rarely cover the full scene. Therefore, they provide only limited 6DOF capabilities.
  • a 3D scene, represented as meshes, points, and/or voxel can be projected onto one, or more, geometries. These geometries may be“unfolded” or packed onto 2D planes (two planes per geometry: one for texture, one for depth), which are then encoded using standard 2D video compression technologies.
  • Relevant projection geometry information may be transmitted alongside the encoded video files to the decoder. The decoder decodes the video and performs the inverse projection to regenerate the 3D scene in any desired representation format (not necessarily the starting format).
  • Projecting volumetric models onto 2D planes allows for using standard 2D video coding tools with highly efficient temporal compression.
  • coding efficiency can be increased greatly.
  • Using geometry-projections instead of 2D-video based approaches based on multiview and depth, provides a better coverage of the scene (or object).
  • 6DOF capabilities are improved.
  • Using several geometries for individual objects improves the coverage of the scene further.
  • standard video encoding hardware can be utilized for real-time compression/decompression of the projected planes. The projection and the reverse projection steps are of low complexity.
  • Figure 1 illustrates an overview of an example of a compression process. Such process may be applied for example in MPEG Point Cloud Coding (PCC).
  • PCC MPEG Point Cloud Coding
  • the process starts with an input point cloud frame 101 that is provided for patch generation 102, geometry image generation 104 and texture image generation 105.
  • the patch generation 102 process aims at decomposing the point cloud into a minimum number of patches with smooth boundaries, while also minimizing the reconstruction error.
  • the normal at every point can be estimated.
  • An initial clustering of the point cloud can then be obtained by associating each point with one of the following six oriented planes, defined by their normals:
  • each point may be associated with the plane that has the closest normal (i.e. maximizes the dot product of the point normal and the plane normal).
  • the initial clustering may then be refined by iteratively updating the cluster index associated with each point based on its normal and the cluster indices of its nearest neighbors.
  • the final step may comprise extracting patches by applying a connected component extraction procedure. Patch info determined at patch generation 102 for the input point cloud frame 101 is delivered to packing process 103, to geometry image generation 104 and to texture image generation 105.
  • the packing process 103 aims at mapping the extracted patches onto a 2D plane, while trying to minimize the unused space, and guaranteeing that every TxT (e.g. 16x16) block of the grid is associated with a unique patch.
  • T may be a user-defined parameter. Parameter T may be encoded in the bitstream and sent to the decoder.
  • W and H may be user-defined parameters, which correspond to the resolution of the geometry/texture images that will be encoded.
  • the patch location is determined through an exhaustive search that is performed in raster scan order. The first location that can guarantee an overlapping- free insertion of the patch is selected and the grid cells covered by the patch are marked as used. If no empty space in the current resolution image can fit a patch, then the height H of the grid is temporarily doubled, and search is applied again. At the end of the process, H is clipped so as to fit the used grid cells.
  • the geometry image generation 104 and the texture image generation 105 are configured to generate geometry images and texture images respectively.
  • the image generation process may exploit the 3D to 2D mapping computed during the packing process to store the geometry and texture of the point cloud as images.
  • each patch may be projected onto two images, referred to as layers.
  • H(u, y) be the set of points of the current patch that get projected to the same pixel (u, v).
  • the first layer also called a near layer, stores the point of H(u, v) with the lowest depth DO.
  • the second layer referred to as the far layer, captures the point of H(u, v) with the highest depth within the interval [DO, DO+D], where D is a user- defined parameter that describes the surface thickness.
  • the generated videos may have the following characteristics:
  • the geometry video is monochromatic.
  • the texture generation procedure exploits the reconstructed/smoothed geometry in order to compute the colors to be associated with the re-sampled points.
  • the geometry images and the texture images may be provided to image padding 107.
  • the image padding 107 may also receive as an input an occupancy map (OM) 106 to be used with the geometry images and texture images.
  • the occupancy map 106 may comprise a binary map that indicates for each cell of the grid whether it belongs to the empty space or to the point cloud.
  • the occupancy map (OM) may be a binary image of binary values where the occupied pixels and non-occupied pixels are distinguished and depicted respectively.
  • the occupancy map may alternatively comprise a non-binary image allowing additional information to be stored in it. Therefore, the representative values of the DOM (Deep Occupancy Map) may comprise binary values or other values, for example integer values. It should be noticed that one cell of the 2D grid may produce a pixel during the image generation process. Such an occupancy map may be derived from the packing process 103.
  • the padding process 107 aims at filling the empty space between patches in order to generate a piecewise smooth image suited for video compression.
  • each block of TxT (e.g. 16x16) pixels is compressed independently lf the block is empty (i.e. unoccupied, i.e. all its pixels belong to empty space), then the pixels of the block are filled by copying either the last row or column of the previous TxT block in raster order lf the block is full (i.e. occupied, i.e., no empty pixels), nothing is done lf the block has both empty and filled pixels (i.e. edge block), then the empty pixels are iteratively filled with the average value of their non-empty neighbors.
  • TxT e.g. 16x16
  • the padded geometry images and padded texture images may be provided for video compression 108.
  • the generated images/layers may be stored as video frames and compressed using for example the HM16.16 video codec according to the HM configurations provided as parameters.
  • the video compression 108 also generates reconstructed geometry images to be provided for smoothing 109, wherein a smoothed geometry is determined based on the reconstructed geometry images and patch info from the patch generation 102.
  • the smoothed geometry may be provided to texture image generation 105 to adapt the texture images.
  • the patch may be associated with auxiliary information being encoded/decoded for each patch as metadata.
  • the auxiliary information may comprise index of the projection plane, 2D bounding box, 3D location of the patch.
  • Metadata may be encoded/decoded for every patch:
  • mapping information providing for each TxT block its associated patch index may be encoded as follows: - For each TxT block, let L be the ordered list of the indexes of the patches such that their 2D bounding box contains that block. The order in the list is the same as the order used to encode the 2D bounding boxes. L is called the list of candidate patches.
  • the empty space between patches is considered as a patch and is assigned the special index 0, which is added to the candidate patches list of all the blocks.
  • the occupancy map consists of a binary map that indicates for each cell of the grid whether it belongs to the empty space or to the point cloud.
  • One cell of the 2D grid produces a pixel during the image generation process.
  • the occupancy map compression 110 leverages the auxiliary information described in previous section, in order to detect the empty TxT blocks (i.e. blocks with patch index 0).
  • the remaining blocks may be encoded as follows:
  • the occupancy map can be encoded with a precision of a BOxBO blocks.
  • the compression process may comprise one or more of the following example operations:
  • Binary values may be associated with BOxBO sub-blocks belonging to the same TxT block.
  • a value 1 associated with a sub-block if it contains at least a non-padded pixel, and 0 otherwise. If a sub-block has a value of 1 it is said to be full, otherwise it is an empty sub-block.
  • a binary information may be encoded for each TxT block to indicate whether it is full or not.
  • an extra information indicating the location of the full/cmpty sub-blocks may be encoded as follows:
  • Different traversal orders may be defined for the sub-blocks, for example horizontally, vertically, or diagonally starting from top right or top left comer
  • the encoder chooses one of the traversal orders and may explicitly signal its index in the bitstream.
  • the binary values associated with the sub-blocks may be encoded by using a run-length encoding strategy.
  • FIG. 2 illustrates an overview of a de-compression process for MPEG Point Cloud Coding (PCC).
  • a de-multiplexer 201 receives a compressed bitstream, and after de-multiplexing, provides compressed texture video and compressed geometry video to video decompression 202.
  • the de multiplexer 201 transmits compressed occupancy map to occupancy map decompression 203. It may also transmit a compressed auxiliary patch information to auxiliary patch-info compression 204.
  • Decompressed geometry video from the video decompression 202 is delivered to geometry reconstruction 205, as are the decompressed occupancy map and decompressed auxiliary patch information.
  • the point cloud geometry reconstruction 205 process exploits the occupancy map information in order to detect the non-empty pixels in the geometry/texture images/layers. The 3D positions of the points associated with those pixels may be computed by leveraging the auxiliary patch information and the geometry images.
  • the reconstructed geometry image may be provided for smoothing 206, which aims at alleviating potential discontinuities that may arise at the patch boundaries due to compression artifacts.
  • the implemented approach moves boundary points to the centroid of their nearest neighbors.
  • the smoothed geometry may be transmitted to texture reconstruction 207, which also receives a decompressed texture video from video decompression 202.
  • the texture reconstruction 207 outputs a reconstructed point cloud.
  • the texture values for the texture reconstruction are directly read from the texture images.
  • the point cloud geometry reconstruction process exploits the occupancy map information in order to detect the non-empty pixels in the geometry/texture images/layers.
  • the 3D positions of the points associated with those pixels are computed by levering the auxiliary patch information and the geometry images. More precisely, let P be the point associated with the pixel (u, v) and let (SO, s0, rO ) be the 3D location of the patch to which it belongs and (uO, vO, ill , vl) its 2D bounding box. P can be expressed in terms of depth S(u, v), tangential shift s(u, v) and bi-tangential shift r(u, v) as follows:
  • r(u, v) rO - vO + v
  • g(u, v) is the luma component of the geometry image.
  • the texture values can be directly read from the texture images.
  • the result of the decoding process is a 3D point cloud reconstruction.
  • the various embodiments introduce a signal per patch to define the relation of the patch 3D location (part of the auxiliary patch information) to one or more of the following:
  • the user s current position, e.g. a virtual assistant or a pet following the user,
  • another object in the 3D scene e.g. a piece of advertisement or information screen close following the object moving through 3D space.
  • Another signal is introduced per patch to define the behavior of the reprojected 3D data with regards to user interactions, for example on or more of the following:
  • HMD FoV Head- mounted Display Field-of-View
  • eye-tracker or similar
  • simulated interaction/collision e.g. the user avatar in 3D space has some sort of marker (laser point, gun, etc. ) to interact with 3D objects
  • a simulated 3D projectile e.g. a ball thrown by the user (or another 3D object)
  • Opening a webpage in a browser e.g. if a user clicks on the shirt of a 3D model, a webpage with the store to purchase this shirt is opened,.
  • Playback of a mediafile e.g. if the user touches a 3D bird, a soundfile is played back.
  • the projected patch is projected from different parts of the 3D scene/model to the projection planes.
  • Fig. 3 illustrates an image 301 showing a 3D object (i.e. 3D volumetric video object) being segmented into patches 305 (only three patches have been indicated with numeral 305 for the sake of simplicity). Each grayscale color represents a different patch.
  • 3D object i.e. 3D volumetric video object
  • the original point cloud content may be labelled so that, for example, each point include an index number that, for example, refers to different body parts.
  • Simple example can be index values of 0, 1, 2 and 3, where points belonging to a head will have index value of 0, points belonging to arms will have index value of 1 , points belonging to the torso will have index value of 2 and points belonging to the legs will have index value of 3.
  • Body part index number can be created using computer vision, machine learning algorithms, where each body part is identified and indexed, or assigned manually by user input. Indexing is easy with CGI generated content where body hierarchy is known and character model is converted from triangles to points with index numbers.
  • a point cloud may be segmented into patches (Fig. 3: 305).
  • the segmentation may segment points with similar normal into different patches. However, with the additional index data, the segmentation has to take into account that the final segmented patches must contain the same index number. For example, all the points that refer to the head object (index 0) for example mouth, ears, eyes can be different patches that all have the same index of 0. However, there cannot be a patch that would include points from torso and head object as those have different indices.
  • Each patch will therefore have additional“index” number stored in the patch metadata.
  • These indices can refer to several look-up tables that have different actions and additional properties for the patches.
  • a 3D location relationship is signaled in a bitstream.
  • the 3D location of each patch is signaled as auxiliary patch information as absolute values for (xO, yO, zO).
  • the reconstructed 3D object (or parts thereof) are always in relation to an overall 3D bounding box of the 3D volume.
  • an additional signal is introduced to the auxiliary patch information, to represent the relation of the 3D patch location.
  • a single bit signal is introduced, wherein the single bit indicates if the 3D location of the patch is absolute, i.e. there is no change, or it does not have relation to a previous patch.
  • the second example is similar to residual coding in video compression, i.e. the values for relative location (xr, yr, zr) may be smaller than the absolute (xO, yO, zO) values.
  • coding efficiency for the auxiliary patch data is improved.
  • more advanced relationships are enabled by additional signaling lnstead of the single 1 bit signal, more bits, e.g. 3 bits, can be used to signal different relationships. For example:
  • Entropy coding can be applied to minimize the bit costs for less probable relationships.
  • the previous examples a - f may relate to the following use cases and have the example signaling, respectively:
  • c) Can be used for e.g. virtual assistant or pet following the user. At first a relationship is signaled, then 3D translation vector with regards to user position is signaled. d) Can be used for e.g. a virtual advertisement or information data always in view. At first a relationship is signaled, then 3D translation vector with regards to user position is signaled, and initial 3D rotation offset is kept constant if user rotates head.
  • e Can be used for e.g. a virtual advertisement or information data around a 3D object.
  • a relationship is signaled, then target patch 1D and respective 3D translation vector to this patch are signaled.
  • f Can be used for e.g. a virtual advertisement or information data moving independently.
  • target path 1D and respective 3D translation vector can be zero
  • One or more possible paths are signaled as look-up-tables in or along the bitstream
  • patch interaction characteristics are signaled.
  • an additional signal in the auxiliary patch information is introduced, to provide the user with interaction possibilities when consuming projection-based coded/transmitted volumetric video content.
  • a single bit signal is introduced, wherein the bit indicates if a patch is active or not. If an interaction with the patch is detected, a predefined action is performed. Such an action can be hardcoded in the decoder/receiver/rendered, or transmitted along the bitstream, e.g. as XML data.
  • the single bit for indicating if a patch is active or not is followed by a signal to determine what action should be performed, e.g. a fixed length coded index number, point at an action look-up-table signaled in or along the bitstream, e.g. as XML data.
  • the table 1 shows an example for such a look-up-table and possible triggered actions.
  • a multiple bit signal is introduced indicating if a patch is active or not, and to which interaction it is active. For example, a user can click or hover with an input cursor (e.g. a mouse interface), touch with 3D input device, look directly at the patch, approach the patch in 3D space, collide/approach the patch with an interactive marker (e.g. virtual reality (VR) laserpointer), or use other interaction methods.
  • an input cursor e.g. a mouse interface
  • touch with 3D input device look directly at the patch, approach the patch in 3D space, collide/approach the patch with an interactive marker (e.g. virtual reality (VR) laserpointer), or use other interaction methods.
  • VR virtual reality
  • Table 2 shows an example on how different interactions can trigger different effects. Such a table can be signaled in or along the bitstream, e.g. as XML data. Different combinations of active interactions are possible, i.e. a patch can have different reactions to different combinations of interactions.
  • patch material characteristics is signaled.
  • Such material characteristics define how a patch will react to environment interactions. E.g. if a patch is hit by a virtual 3D object, should the object bounce back (and if yes, how), or if the patch is hit by the virtual environment effect“rain”, should it change its color to reflect a new status“wet”.
  • Such material characteristics can again be signaled as look-up-table.
  • Other physics related material properties can include bounciness, friction, restitution, density and mass.
  • n Bits signalling if patch is active or not and what activity shall be performed. Look up action in LUT.
  • n Bits signalling if patch is active or not and what activity shall be performed for what interaction. Look up interaction and respective action in LUT.
  • n Bits signalling what activity shall be performed for what interaction. Look up interaction and respective action in LUT.
  • n Bits signalling if patch is active or not and what activity shall be performed for what interaction. Look up interaction and respective action and material in LUT. alternatively
  • n Bits signalling material and what activity shall be performed for what interaction. Look up material, interaction, and respective action in LUT.
  • n Bits signalling what activity shall be performed for what interaction, followed by n Bits: signalling what material the patch consists of (different index as above)
  • n Bits signalling what material the patch consists of. Look up material in LUT.
  • Fig. 4a is a flowchart illustrating a method according to an embodiment.
  • a method comprises receiving 1001 a volumetric video comprising a three-dimensional object; segmenting 1002 the three-dimensional object into a plurality of patches; for one or more patches of a three-dimensional object 1003: inserting into a bitstream a signal indicating a relation of three-dimensional location of the patch to at least one predefined item; and/or inserting into the bitstream a signal indicating an activity of the patch as a response to an interaction; transmitting 1004 the bitstream to a decoder.
  • the decoder receives the bitstream from the encoder, and performs the inverse operation of reconstruction as described in Figs. 2 and 4a, i.e. interpreting any patch relationship information according to the previous embodiments and passing on any interactivity information to a Tenderer to analyze interactions and perform specified interactions.
  • Fig. 4b is a flowchart illustrating a method according to another embodiment.
  • a method comprises receiving a bitstream 1011 ; decoding from a bitstream a signal indicating a relation of three-dimensional location of the patch to at least one predefined item; and/or a signal indicating an activity of the patch as a response to an interactionl0l2; reconstructing a volumetric video according to the decoded information 1013; passing the reconstructed volumetric video to a Tenderer 1014; and passing any decoded activity information to the Tenderer 1015.
  • An apparatus comprises means for receiving a volumetric video comprising a three-dimensional object; means for segmenting the three-dimensional object into a plurality of patches; for one or more patches of a three-dimensional object: means for inserting into a bitstream a signal indicating a relation of three-dimensional location of the patch to at least one predefined item; and/or means for inserting into the bitstream a signal indicating an activity of the patch as a response to an interaction; means for transmitting the bitstream to a decoder.
  • the means comprises at least one processor, and a memory including a computer program code, wherein the processor may further comprise processor circuitry.
  • the memory and the computer program code are configured to, with the at least one processor, cause the apparatus to perform the method according to various embodiments.
  • An apparatus comprises means for receiving a bitstream; means for decoding from a bitstream a signal indicating a relation of three-dimensional location of the patch to at least one predefined item; and/or a signal indicating an activity of the patch as a response to an interaction; means for reconstructing a volumetric video according to the decoded information; means for passing the reconstructed volumetric video to a Tenderer, and means for passing any decoded activity information to the Tenderer.
  • the means comprises at least one processor, and a memory including a computer program code, wherein the processor may further comprise processor circuitry.
  • the memory and the computer program code are configured to, with the at least one processor, cause the apparatus to perform the method according to various embodiments.
  • Fig. 5 shows a block diagram of a video coding system according to an example embodiment as a schematic block diagram of an electronic device 50, which may incorporate a codec ln some embodiments the electronic device may comprise an encoder or a decoder.
  • Fig. 6 shows a layout of an apparatus according to an embodiment.
  • the electronic device 50 may for example be a mobile terminal or a user equipment of a wireless communication system or a camera device.
  • the electronic device 50 may be also comprised at a local or a remote server or a graphics processing unit of a computer.
  • the device may be also comprised as part of a head-mounted display device.
  • the apparatus 50 may comprise a housing 30 for incorporating and protecting the device.
  • the apparatus 50 may further comprise a display 32 in the form of a liquid crystal display.
  • the display may be any suitable display technology suitable to display an image or video.
  • the apparatus 50 may further comprise a keypad 34.
  • any suitable data or user interface mechanism may be employed.
  • the user interface may be implemented as a virtual keyboard or data entry system as part of a touch-sensitive display.
  • the apparatus may comprise a microphone 36 or any suitable audio input which may be a digital or analogue signal input.
  • the apparatus 50 may further comprise an audio output device which in embodiments of the invention may be any one of: an earpiece 38, speaker, or an analogue audio or digital audio output connection.
  • the apparatus 50 may also comprise a battery 40 (or in other embodiments of the invention the device may be powered by any suitable mobile energy device such as solar cell, fuel cell or clockwork generator).
  • the apparatus may further comprise a camera 42 capable of recording or capturing images and/ or video.
  • the camera 42 may be a multi-lens camera system having at least two camera sensors. The camera is capable of recording or detecting individual frames which are then passed to the codec 54 or the controller for processing.
  • the apparatus may receive the video and/or image data for processing from another device prior to transmission and/or storage.
  • the apparatus 50 may further comprise an infrared port for short range line of sight communication to other devices.
  • the apparatus may further comprise any suitable short- range communication solution such as for example a Bluetooth wireless connection or a USB (Universal Serial Bus)/firewire wired connection.
  • the apparatus 50 may comprise a controller 56 or processor for controlling the apparatus 50.
  • the apparatus or the controller 56 may comprise one or more processors or processor circuitry and be connected to memory 58 which may store data in the form of image, video and/or audio data, and/or may also store instructions for implementation on the controller 56 or to be executed by the processors or the processor circuitry.
  • the controller 56 may further be connected to codec circuitry 54 suitable for carrying out coding and decoding of image, video and/or audio data or assisting in coding and decoding carried out by the controller.
  • the apparatus 50 may further comprise a card reader 48 and a smart card 46, for example a UICC (Universal Integrated Circuit Card) and UICC reader for providing user information and being suitable for providing authentication information for authentication and authorization of the user at a network.
  • a card reader 48 and a smart card 46 for example a UICC (Universal Integrated Circuit Card) and UICC reader for providing user information and being suitable for providing authentication information for authentication and authorization of the user at a network.
  • UICC Universal Integrated Circuit Card
  • the apparatus 50 may comprise radio interface circuitry 52 connected to the controller and suitable for generating wireless communication signals for example for communication with a cellular communications network, a wireless communications system or a wireless local area network.
  • the apparatus 50 may further comprise an antenna 44 connected to the radio interface circuitry 52 for transmitting radio frequency signals generated at the radio interface circuitry 52 to other apparatus(es) and for receiving radio frequency signals from other apparatus(es).
  • the apparatus may comprise one or more wired interfaces configured to transmit and/or receive data over a wired connection, for example an electrical cable or an optical fiber connection. Such wired interface may be configured to operate according to one or more digital display interface standards, such as for example High-Definition Multimedia lnterface (HDM1), Mobile High- definition Link (MHL), or Digital Visual lnterface (DV1).
  • HDMI High-Definition Multimedia lnterface
  • MHL Mobile High- definition Link
  • DV1 Digital Visual lnterface
  • the various embodiments may provide advantages. For example, signaling the relationships improves the coding efficiency for auxiliary patch data ft also enables use cases currently not supported, e.g. virtual assistants, advertisements, etc. Further, signaling patch interaction characteristics provide improved interactivity to support new use cases, e.g. gaming, e-leaming, advertisements, etc. ln addition, the various embodiments provide an improved immersion due to the 3D model reacting to virtual environment, and an improved flexibility for content creators to support their specific use cases/target applications.
  • a device may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the device to carry out the features of an embodiment.
  • a network device like a server may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the network device to carry out the features of an embodiment.
  • the computer program code comprises one or more operational characteristics.
  • said operational characteristics are being defined through configuration by said computer based on the type of said processor, wherein a system is connectable to said processor by a bus, wherein a programmable operational characteristic of the system comprises receiving a volumetric video comprising a three-dimensional object; segmenting the three-dimensional object into a plurality of patches; for one or more patches of a three-dimensional object: inserting into a bitstream a signal indicating a relation of three-dimensional location of the patch to at least one predefined item; and/or inserting into the bitstream a signal indicating an activity of the patch as a response to an interaction; transmitting the bitstream to a decoder.
  • said operational characteristics are being defined through configuration by said computer based on the type of said processor, wherein a system is connectable to said processor by a bus, wherein a programmable operational characteristic of the system comprises receiving a bitstream; decoding from a bitstream a signal indicating a relation of three-dimensional location of the patch to at least one predefined item; and/or a signal indicating an activity of the patch as a response to an interaction; reconstructing a volumetric video according to the decoded information; passing the reconstructed volumetric video to a Tenderer together with any decoded activity information.
  • the computer program code can be a part of a computer program product that may be embodied on a non-transitory computer readable medium. Alternatively, the computer program product may be downloadable via communication network.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Testing, Inspecting, Measuring Of Stereoscopic Televisions And Televisions (AREA)

Abstract

The embodiments relate to a method comprising receiving a volumetric video comprising a three-dimensional object; segmenting the three-dimensional object into a plurality of patches; for one or more patches of a three-dimensional object: inserting into a bitstream a signal indicating a relation of three-dimensional location of the patch to at least one predefined item; and/or inserting into the bitstream a signal indicating an activity of the patch as a response to an interaction; and transmitting the bitstream to a decoder. The embodiments relate to a method for decoding the bitstream, as well as to technical equipment for implementing any of the methods.

Description

A METHOD, AN APPARATUS AND A COMPUTER PROGRAM PRODUCT FOR
VOLUMETRIC VIDEO CODING
Technical Field
The present solution generally relates to volumetric video coding. In particular, the solution relates to a solution for projection-based point cloud compression.
Background
Since the beginning of photography and cinematography, the most common type of image and video content has been captured by cameras with relatively narrow field of view and displayed as a rectangular scene on flat displays. The cameras are mainly directional, whereby they capture only a limited angular field of view (the field of view towards which they are directed).
More recently, new image and video capture devices are available. These devices are able to capture visual and audio content all around them, i.e. they can capture the whole angular field of view, sometimes referred to as 360 degrees field of view. More precisely, they can capture a spherical field of view (i.e., 360 degrees in all spatial directions). Furthermore, new types of output technologies have been invented and produced, such as head-mounted displays. These devices allow a person to see visual content all around him/her, giving a feeling of being“immersed” into the scene captured by the 360 degrees camera. The new capture and display paradigm, where the field of view is spherical, is commonly referred to as virtual reality (VR) and is believed to be the common way people will experience media content in the future.
For volumetric video, a scene may be captured using one or more 3D (three-dimensional) cameras. The cameras are in different positions and orientations within a scene. One issue to consider is that compared to 2D (two-dimensional) video content, volumetric 3D video content has much more data, so viewing it requires lots of bandwidth (with or without transferring it from a storage location to a viewing device): disk I/O, network traffic, memory bandwidth, GPU (Graphics Processing Unit) upload. Capturing volumetric content also produces a lot of data, particularly when there are multiple capture devices used in parallel.
Summary
Now there has been invented a method and technical equipment implementing the method, for providing an improvement for volumetric video coding. Various aspects of the invention include a method, an apparatus, and a computer readable medium comprising a computer program stored therein, which are characterized by what is stated in the independent claims. Various embodiments of the invention are disclosed in the dependent claims.
According to a first aspect, there is provided a method comprising receiving a volumetric video comprising a three-dimensional object; segmenting the three-dimensional object into a plurality of patches; for one or more patches of a three-dimensional object: inserting into a bitstream a signal indicating a relation of three-dimensional location of the patch to at least one predefined item; and/or inserting into the bitstream a signal indicating an activity of the patch as a response to an interaction; transmitting the bitstream to a decoder.
According to a second aspect, there is provided an apparatus comprising at least one processor, memory including computer program code, wherein memory and the computer program code are configured to, with the at least one processor, cause the apparatus to receive a volumetric video comprising a three- dimensional object; segment the three-dimensional object into a plurality of patches; for one or more patches of a three-dimensional object: to insert into a bitstream a signal indicating a relation of three- dimensional location of the patch to at least one predefined item; and/or to insert into the bitstream a signal indicating an activity of the patch as a response to an interaction; transmit the bitstream to a decoder.
According to a third aspect, there is provided an apparatus comprising means for receiving a volumetric video comprising a three-dimensional object; means for segmenting the three-dimensional object into a plurality of patches; for one or more patches of a three-dimensional object; means for inserting into a bitstream a signal indicating a relation of three-dimensional location of the patch to at least one predefined item; and/or means for inserting into the bitstream a signal indicating an activity of the patch as a response to an interaction; means for transmitting the bitstream to a decoder.
According to a fourth aspect, there is provided a computer program product comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to receive a volumetric video comprising a three-dimensional object; segment the three-dimensional object into a plurality of patches; for one or more patches of a three-dimensional object: to insert into a bitstream a signal indicating a relation of three-dimensional location of the patch to at least one predefined item; and/or to insert into the bitstream a signal indicating an activity of the patch as a response to an interaction; transmit the bitstream to a decoder.
According to a fifth aspect, there is provided a method comprising receiving a bitstream; decoding from a bitstream a signal indicating a relation of three-dimensional location of the patch to at least one predefined item; and/or a signal indicating an activity of the patch as a response to an interaction; reconstructing a volumetric video according to the decoded information; passing the reconstructed volumetric video to a Tenderer; and passing any decoded activity information to the Tenderer
According to a sixth aspect, there is provided an apparatus comprising at least one processor, memory including computer program code, wherein memory and the computer program code are configured to, with the at least one processor, cause the apparatus to receive a bitstream; decode from a bitstream a signal indicating a relation of three-dimensional location of the patch to at least one predefined item; and/or a signal indicating an activity of the patch as a response to an interaction; reconstruct a volumetric video according to the decoded information; pass the reconstructed volumetric video to a Tenderer; and pass any decoded activity information to the Tenderer.
According to an embodiment, the item is one or more of the following: one or more of other patches of the same 3D object; a current position of a user with or without a viewing orientation; another object; a predefined path.
According to an embodiment, the interaction is an interaction of a user to the 3D object, or an interaction of another 3D object to the patch.
According to an embodiment, the activity of the patch is associated with an activity to be performed.
According to an embodiment, the relation or activity signaling is performed in an auxiliary patch information.
According to an embodiment, the relation or activity signaling is performed by one or more bits.
According to an embodiment, the available activities are signaled in or along the bitstream as a look up table.
According to an embodiment, the computer program product is embodied on a non-transitory computer readable medium.
Description of the Drawings
In the following, various embodiments will be described in more detail with reference to the appended drawings, in which
Fig. 1 shows an example of a volumetric video compression process; Fig. 2 shows an example of a volumetric video decompression process;
Fig. 3 shows an example of a 3D point cloud being segmented into patches;
Fig. 4a is a flowchart illustrating a method according to an embodiment;
Fig. 4b is a flowchart illustrating a method according to another embodiment;
Fig. 5 shows an apparatus according to an embodiment in a simplified manner; and
Fig. 6 shows a layout of an apparatus according to an embodiment.
Description of Example Embodiments
In the following, several embodiments will be described in the context of volumetric video. In particular, the present embodiments relate to patch functionality signaling for projection-based point cloud compression.
Volumetric video may be captured using one or more three-dimensional (3D) cameras. When multiple cameras are in use, the captured footage is synchronized so that the cameras provide different viewpoints to the same world. In contrast to traditional 2D/3D video, volumetric video describes a 3D model of the world where the viewer is free to move and observer different parts of the world.
Volumetric video enables the viewer to move in six degrees of freedom (6DOF): in contrast to common 360° video, where the user has from 2 to 3 degrees of freedom (yaw, pitch, and possibly roll), a volumetric video represents a 3D volume of space rather than a flat image plane. Volumetric video frames contain a large amount of data because they model the contents of a 3D volume instead of just a two-dimensional (2D) plane. However, only a relatively small part of the volume changes over time. Therefore, it may be possible to reduce the total amount of data by only coding information about an initial state and changes which may occur between frames. Volumetric video can be rendered from synthetic 3D animations, reconstructed from multi- view video using 3D reconstruction techniques such as structure from motion, or captured with a combination of cameras and depth sensors such as LiDAR (Light Detection and Ranging), for example.
Volumetric video data represents a three-dimensional scene or object, and thus such data can be viewed from any viewpoint. Volumetric video data can be used as an input for augmented reality (AR), virtual reality (VR) and mixed reality (MR) applications. Such data describes geometry (shape, size, position in 3D-space) and respective attributes (e.g. color, opacity, reflectance, ...), together with any possible temporal changes of the geometry and attributes at given time instances (e.g. frames in 2D video). Volumetric video is either generated from 3D models, i.e. computer-generated imagery (CG1), or captured from real-world scenes using a variety of capture solutions, e.g. a multi-camera, a laser scan, a combination of video and dedicated depths sensors, etc. Also, a combination of CG1 and real-world data is possible. Examples of representation formats for such volumetric data are triangle meshes, point clouds, or voxel. Temporal information about the scene can be included in the form of individual capture instances, i.e.“frames” in 2D video, or other means, e.g. position of an object as a function of time. lncreasing computational resources and advances in 3D data acquisition devices has enabled reconstruction of highly detailed volumetric video representations of natural scenes lnfrared, lasers, time-of-flight and structured light are all examples of devices that can be used to construct 3D video data. Representation of the 3D data depends on how the 3D data is used. Dense voxel arrays have been used to represent volumetric medical data ln 3D graphics, polygonal meshes are extensively used. Point clouds on the other hand are well suited for applications, such as capturing real world 3D scenes where the topology is not necessarily a 2D manifold. Another way to represent 3D data is coding this 3D data as a set of texture and depth map as is the case in the multi-view plus depth. Closely related to the techniques used in multi- view plus depth is the use of elevation maps, and multi-level surface maps. ln 3D point clouds, each point of each 3D surface is described as a 3D point with color and/or other attribute information such as surface normal or material reflectance. Point cloud is a set of data points in a coordinate system, for example in a three-dimensional coordinate system being defined by X, Y, and Z coordinates. The points may represent an external surface of an object in the screen space, e.g. in a three-dimensional space. ln dense point clouds or voxel arrays, the reconstructed 3D scene may contain tens or even hundreds of millions of points lf such representations are to be stored or interchanged between entities, then efficient compression of the presentations becomes fundamental. Standard volumetric video representation formats, such as point clouds, meshes, voxel, suffer from poor temporal compression performance ldentifying correspondences for motion-compensation in 3D-space is an ill-defined problem, as both, geometry and respective attributes may change. For example, temporal successive“frames” do not necessarily have the same number of meshes, points or voxel. Therefore, compression of dynamic 3D scenes is inefficient. 2D-video based approaches for compressing volumetric data, i.e. multiview with depth, have much better compression efficiency, but rarely cover the full scene. Therefore, they provide only limited 6DOF capabilities. lnstead of the above-mentioned approach, a 3D scene, represented as meshes, points, and/or voxel, can be projected onto one, or more, geometries. These geometries may be“unfolded” or packed onto 2D planes (two planes per geometry: one for texture, one for depth), which are then encoded using standard 2D video compression technologies. Relevant projection geometry information may be transmitted alongside the encoded video files to the decoder. The decoder decodes the video and performs the inverse projection to regenerate the 3D scene in any desired representation format (not necessarily the starting format).
Projecting volumetric models onto 2D planes allows for using standard 2D video coding tools with highly efficient temporal compression. Thus, coding efficiency can be increased greatly. Using geometry-projections instead of 2D-video based approaches based on multiview and depth, provides a better coverage of the scene (or object). Thus, 6DOF capabilities are improved. Using several geometries for individual objects improves the coverage of the scene further. Furthermore, standard video encoding hardware can be utilized for real-time compression/decompression of the projected planes. The projection and the reverse projection steps are of low complexity.
Figure 1 illustrates an overview of an example of a compression process. Such process may be applied for example in MPEG Point Cloud Coding (PCC). The process starts with an input point cloud frame 101 that is provided for patch generation 102, geometry image generation 104 and texture image generation 105.
The patch generation 102 process aims at decomposing the point cloud into a minimum number of patches with smooth boundaries, while also minimizing the reconstruction error. For patch generation, the normal at every point can be estimated. An initial clustering of the point cloud can then be obtained by associating each point with one of the following six oriented planes, defined by their normals:
- (1.0, 0.0, 0.0),
- (0.0, 1.0, 0.0),
- (0.0, 0.0, 1.0),
- (-1.0, 0.0, 0.0),
- (0.0, -1.0, 0.0), and
- (0.0, 0.0, -1.0)
More precisely, each point may be associated with the plane that has the closest normal (i.e. maximizes the dot product of the point normal and the plane normal).
The initial clustering may then be refined by iteratively updating the cluster index associated with each point based on its normal and the cluster indices of its nearest neighbors. The final step may comprise extracting patches by applying a connected component extraction procedure. Patch info determined at patch generation 102 for the input point cloud frame 101 is delivered to packing process 103, to geometry image generation 104 and to texture image generation 105. The packing process 103 aims at mapping the extracted patches onto a 2D plane, while trying to minimize the unused space, and guaranteeing that every TxT (e.g. 16x16) block of the grid is associated with a unique patch. It should be noticed that T may be a user-defined parameter. Parameter T may be encoded in the bitstream and sent to the decoder.
The used simple packing strategy iteratively tries to insert patches into a WxH grid. W and H may be user-defined parameters, which correspond to the resolution of the geometry/texture images that will be encoded. The patch location is determined through an exhaustive search that is performed in raster scan order. The first location that can guarantee an overlapping- free insertion of the patch is selected and the grid cells covered by the patch are marked as used. If no empty space in the current resolution image can fit a patch, then the height H of the grid is temporarily doubled, and search is applied again. At the end of the process, H is clipped so as to fit the used grid cells.
The geometry image generation 104 and the texture image generation 105 are configured to generate geometry images and texture images respectively. The image generation process may exploit the 3D to 2D mapping computed during the packing process to store the geometry and texture of the point cloud as images. In order to better handle the case of multiple points being projected to the same pixel, each patch may be projected onto two images, referred to as layers. For example, let H(u, y) be the set of points of the current patch that get projected to the same pixel (u, v). The first layer, also called a near layer, stores the point of H(u, v) with the lowest depth DO. The second layer, referred to as the far layer, captures the point of H(u, v) with the highest depth within the interval [DO, DO+D], where D is a user- defined parameter that describes the surface thickness. The generated videos may have the following characteristics:
• Geometry: WxH YUV420-8bit,
• Texture: WxH YUV420-8bit,
It is to be noticed that the geometry video is monochromatic. In addition, the texture generation procedure exploits the reconstructed/smoothed geometry in order to compute the colors to be associated with the re-sampled points.
The geometry images and the texture images may be provided to image padding 107. The image padding 107 may also receive as an input an occupancy map (OM) 106 to be used with the geometry images and texture images. The occupancy map 106 may comprise a binary map that indicates for each cell of the grid whether it belongs to the empty space or to the point cloud. In other words, the occupancy map (OM) may be a binary image of binary values where the occupied pixels and non-occupied pixels are distinguished and depicted respectively. The occupancy map may alternatively comprise a non-binary image allowing additional information to be stored in it. Therefore, the representative values of the DOM (Deep Occupancy Map) may comprise binary values or other values, for example integer values. It should be noticed that one cell of the 2D grid may produce a pixel during the image generation process. Such an occupancy map may be derived from the packing process 103.
The padding process 107 aims at filling the empty space between patches in order to generate a piecewise smooth image suited for video compression. For example, in a simple padding strategy, each block of TxT (e.g. 16x16) pixels is compressed independently lf the block is empty (i.e. unoccupied, i.e. all its pixels belong to empty space), then the pixels of the block are filled by copying either the last row or column of the previous TxT block in raster order lf the block is full (i.e. occupied, i.e., no empty pixels), nothing is done lf the block has both empty and filled pixels (i.e. edge block), then the empty pixels are iteratively filled with the average value of their non-empty neighbors.
The padded geometry images and padded texture images may be provided for video compression 108. The generated images/layers may be stored as video frames and compressed using for example the HM16.16 video codec according to the HM configurations provided as parameters. The video compression 108 also generates reconstructed geometry images to be provided for smoothing 109, wherein a smoothed geometry is determined based on the reconstructed geometry images and patch info from the patch generation 102. The smoothed geometry may be provided to texture image generation 105 to adapt the texture images.
The patch may be associated with auxiliary information being encoded/decoded for each patch as metadata. The auxiliary information may comprise index of the projection plane, 2D bounding box, 3D location of the patch.
For example, the following metadata may be encoded/decoded for every patch:
- index of the projection plane
o lndex 0 for the planes (1.0, 0.0, 0.0) and (-1.0, 0.0, 0.0)
o lndex 1 for the planes (0.0, 1.0, 0.0) and (0.0, -1.0, 0.0)
o lndex 2 for the planes (0.0, 0.0, 1.0) and (0.0, 0.0, -1.0)
- 2D bounding box (uO, vO, ul, vl)
- 3D location (xO, yO, zO ) of the patch represented in terms of depth SO, tangential shift sO and bitangential shift rO. According to the chosen projection planes, (SO, sO, rO) may be calculated as follows:
o lndex
o lndex
o lndex
Also, mapping information providing for each TxT block its associated patch index may be encoded as follows: - For each TxT block, let L be the ordered list of the indexes of the patches such that their 2D bounding box contains that block. The order in the list is the same as the order used to encode the 2D bounding boxes. L is called the list of candidate patches.
- The empty space between patches is considered as a patch and is assigned the special index 0, which is added to the candidate patches list of all the blocks.
- Let I be index of the patch, which the current TxT block belongs to, and let J be the position of / in L. Instead of explicitly coding the index /, its position J is arithmetically encoded instead, which leads to better compression efficiency.
The occupancy map consists of a binary map that indicates for each cell of the grid whether it belongs to the empty space or to the point cloud. One cell of the 2D grid produces a pixel during the image generation process.
The occupancy map compression 110 leverages the auxiliary information described in previous section, in order to detect the empty TxT blocks (i.e. blocks with patch index 0). The remaining blocks may be encoded as follows: The occupancy map can be encoded with a precision of a BOxBO blocks. B0 is a configurable parameter. In order to achieve lossless encoding, B0 may be set to 1. In practice B0=2 or B0=4 results in visually acceptable results, while significantly reducing the number of bits required to encode the occupancy map.
The compression process may comprise one or more of the following example operations:
• Binary values may be associated with BOxBO sub-blocks belonging to the same TxT block. A value 1 associated with a sub-block, if it contains at least a non-padded pixel, and 0 otherwise. If a sub-block has a value of 1 it is said to be full, otherwise it is an empty sub-block.
• If all the sub-blocks of a TxT block are full (i.e., have value 1). The block is said to be full.
Otherwise, the block is said to be non- full.
• A binary information may be encoded for each TxT block to indicate whether it is full or not.
• If the block is non- full, an extra information indicating the location of the full/cmpty sub-blocks may be encoded as follows:
o Different traversal orders may be defined for the sub-blocks, for example horizontally, vertically, or diagonally starting from top right or top left comer
o The encoder chooses one of the traversal orders and may explicitly signal its index in the bitstream.
o The binary values associated with the sub-blocks may be encoded by using a run-length encoding strategy.
The binary value of the initial sub-block is encoded.
Continuous runs of 0s and ls are detected, while following the traversal order selected by the encoder. The number of detected runs is encoded.
The length of each run, except of the last one, is also encoded.
Figure 2 illustrates an overview of a de-compression process for MPEG Point Cloud Coding (PCC). A de-multiplexer 201 receives a compressed bitstream, and after de-multiplexing, provides compressed texture video and compressed geometry video to video decompression 202. In addition, the de multiplexer 201 transmits compressed occupancy map to occupancy map decompression 203. It may also transmit a compressed auxiliary patch information to auxiliary patch-info compression 204. Decompressed geometry video from the video decompression 202 is delivered to geometry reconstruction 205, as are the decompressed occupancy map and decompressed auxiliary patch information. The point cloud geometry reconstruction 205 process exploits the occupancy map information in order to detect the non-empty pixels in the geometry/texture images/layers. The 3D positions of the points associated with those pixels may be computed by leveraging the auxiliary patch information and the geometry images.
The reconstructed geometry image may be provided for smoothing 206, which aims at alleviating potential discontinuities that may arise at the patch boundaries due to compression artifacts. The implemented approach moves boundary points to the centroid of their nearest neighbors. The smoothed geometry may be transmitted to texture reconstruction 207, which also receives a decompressed texture video from video decompression 202. The texture reconstruction 207 outputs a reconstructed point cloud. The texture values for the texture reconstruction are directly read from the texture images.
The point cloud geometry reconstruction process exploits the occupancy map information in order to detect the non-empty pixels in the geometry/texture images/layers. The 3D positions of the points associated with those pixels are computed by levering the auxiliary patch information and the geometry images. More precisely, let P be the point associated with the pixel (u, v) and let (SO, s0, rO ) be the 3D location of the patch to which it belongs and (uO, vO, ill , vl) its 2D bounding box. P can be expressed in terms of depth S(u, v), tangential shift s(u, v) and bi-tangential shift r(u, v) as follows:
S(u, v) = SO + g(u, v)
s(u, v) = sO - uO + u
r(u, v) = rO - vO + v where g(u, v) is the luma component of the geometry image.
For the texture reconstruction, the texture values can be directly read from the texture images. The result of the decoding process is a 3D point cloud reconstruction. In the context described above, the various embodiments introduce a signal per patch to define the relation of the patch 3D location (part of the auxiliary patch information) to one or more of the following:
- other patches of the same projected 3D object (which improves coding efficiency of 111 in Fig.
1),
- the user’s current position, e.g. a virtual assistant or a pet following the user,
- the user’s current position and viewing orientation (viewport), e.g. heads-up-display-type screens,
- another object in the 3D scene, e.g. a piece of advertisement or information screen close following the object moving through 3D space).
- a predefined path signaled in or along bitstream, similar to above but independent from other objects in the 3D scene.
(the later four relations provide improved use case and functionality support)
Furthermore, another signal is introduced per patch to define the behavior of the reprojected 3D data with regards to user interactions, for example on or more of the following:
- If the user interacts with the part of an 3D object, represented by a certain patch, a certain action is performed. Interactions can included (but are not limited to):
• mouse clicks or hovers
• touch interaction with 3D input device
• looking at object for a certain amount of time (determined by HMD FoV (Head- mounted Display Field-of-View), eye-tracker, or similar)
• approach, or even collision, of user avatar (or parts of) in simulated 3D space
• simulated interaction/collision, e.g. the user avatar in 3D space has some sort of marker (laser point, gun, etc. ) to interact with 3D objects
- If another 3D object interacts with the part of an 3D object, represented by a certain patch, a certain action is performed. Interactions are the same as above, possible 3D objects are (but are not limited to):
• another object in the scenery, i.e. to individually represented 3D objects collide/approach each other
• a simulated 3D projectile, e.g. a ball thrown by the user (or another 3D object)
• simulated environmental effects, e.g. simulated rain hitting the clothes of a 3D object
- Possible actions triggered by the above-mentioned interactions are (but are not limited to):
• Opening a webpage in a browser, e.g. if a user clicks on the shirt of a 3D model, a webpage with the store to purchase this shirt is opened,.
• Playback of a mediafile, e.g. if the user touches a 3D bird, a soundfile is played back.
• Execute a computer program.
• The texture color of patch is changed, e.g. if a collision with object“water” detected, the patch texture is darkened. • Run a script, e.g. combining several of the above actions.
Both signals significantly extend the use cases for projection-based volumetric video compression. Such signaling can be implemented according to the details described next.
In the various embodiments, the projected patch is projected from different parts of the 3D scene/model to the projection planes.
Fig. 3 illustrates an image 301 showing a 3D object (i.e. 3D volumetric video object) being segmented into patches 305 (only three patches have been indicated with numeral 305 for the sake of simplicity). Each grayscale color represents a different patch.
The segmenting can be implemented as follows: According to an embodiment, the original point cloud content may be labelled so that, for example, each point include an index number that, for example, refers to different body parts. Simple example can be index values of 0, 1, 2 and 3, where points belonging to a head will have index value of 0, points belonging to arms will have index value of 1 , points belonging to the torso will have index value of 2 and points belonging to the legs will have index value of 3. Body part index number can be created using computer vision, machine learning algorithms, where each body part is identified and indexed, or assigned manually by user input. Indexing is easy with CGI generated content where body hierarchy is known and character model is converted from triangles to points with index numbers.
Once the content is indexed, a point cloud may be segmented into patches (Fig. 3: 305). The segmentation may segment points with similar normal into different patches. However, with the additional index data, the segmentation has to take into account that the final segmented patches must contain the same index number. For example, all the points that refer to the head object (index 0) for example mouth, ears, eyes can be different patches that all have the same index of 0. However, there cannot be a patch that would include points from torso and head object as those have different indices.
Each patch will therefore have additional“index” number stored in the patch metadata. These indices can refer to several look-up tables that have different actions and additional properties for the patches.
According to a first embodiment, a 3D location relationship is signaled in a bitstream. In the current approach for projection-based volumetric video compression, as described with reference to Fig. 1, the 3D location of each patch is signaled as auxiliary patch information as absolute values for (xO, yO, zO). Thus, the reconstructed 3D object (or parts thereof) are always in relation to an overall 3D bounding box of the 3D volume. In this embodiment, an additional signal is introduced to the auxiliary patch information, to represent the relation of the 3D patch location. In a first example, a single bit signal is introduced, wherein the single bit indicates if the 3D location of the patch is absolute, i.e. there is no change, or it does not have relation to a previous patch. The second example is similar to residual coding in video compression, i.e. the values for relative location (xr, yr, zr) may be smaller than the absolute (xO, yO, zO) values. Thus, coding efficiency for the auxiliary patch data is improved.
According to another embodiment, more advanced relationships are enabled by additional signaling lnstead of the single 1 bit signal, more bits, e.g. 3 bits, can be used to signal different relationships. For example:
a) 000: absolute values, no change,
b) 001 : residual to previous patch values, improved coding efficiency,
c) 010: absolute to user position, fixed orientation viewport,
d) 011 : absolute to user position, rotation with viewport,
e) 100: residual to predefined other patch,
f) 101 : residual to a predefined path (the residual can be 0 to follow the path exactly), g) 111 : other option
Other options are feasible, and not all options have to be implemented. Entropy coding can be applied to minimize the bit costs for less probable relationships.
The previous examples a - f may relate to the following use cases and have the example signaling, respectively:
a) At first a relationship is signaled, then the 3D location is signaled.
b) Can be used for improved coding efficiency. At first a relationship is signaled, then 3D location residual with regards to previous patch is signaled.
c) Can be used for e.g. virtual assistant or pet following the user. At first a relationship is signaled, then 3D translation vector with regards to user position is signaled. d) Can be used for e.g. a virtual advertisement or information data always in view. At first a relationship is signaled, then 3D translation vector with regards to user position is signaled, and initial 3D rotation offset is kept constant if user rotates head.
e) Can be used for e.g. a virtual advertisement or information data around a 3D object. At first a relationship is signaled, then target patch 1D and respective 3D translation vector to this patch are signaled.
f) Can be used for e.g. a virtual advertisement or information data moving independently. At first a relationship is signaled, then target path 1D and respective 3D translation vector (can be zero) are signaled. One or more possible paths are signaled as look-up-tables in or along the bitstream According to a second embodiment, patch interaction characteristics are signaled. In this embodiment, an additional signal in the auxiliary patch information is introduced, to provide the user with interaction possibilities when consuming projection-based coded/transmitted volumetric video content.
According to a first example, a single bit signal is introduced, wherein the bit indicates if a patch is active or not. If an interaction with the patch is detected, a predefined action is performed. Such an action can be hardcoded in the decoder/receiver/rendered, or transmitted along the bitstream, e.g. as XML data.
In a second example, the single bit for indicating if a patch is active or not is followed by a signal to determine what action should be performed, e.g. a fixed length coded index number, point at an action look-up-table signaled in or along the bitstream, e.g. as XML data. The table 1 shows an example for such a look-up-table and possible triggered actions.
Table 1: Index and respective example actions
In a third example, a multiple bit signal is introduced indicating if a patch is active or not, and to which interaction it is active. For example, a user can click or hover with an input cursor (e.g. a mouse interface), touch with 3D input device, look directly at the patch, approach the patch in 3D space, collide/approach the patch with an interactive marker (e.g. virtual reality (VR) laserpointer), or use other interaction methods. It is appreciated that this list of possible interactions methods is an example, and thus not exhaustive. Table 2 shows an example on how different interactions can trigger different effects. Such a table can be signaled in or along the bitstream, e.g. as XML data. Different combinations of active interactions are possible, i.e. a patch can have different reactions to different combinations of interactions.
Table 2: Index with corresponding interactions and resulted actions
In a fourth example, patch material characteristics is signaled. Such material characteristics define how a patch will react to environment interactions. E.g. if a patch is hit by a virtual 3D object, should the object bounce back (and if yes, how), or if the patch is hit by the virtual environment effect“rain”, should it change its color to reflect a new status“wet”. Such material characteristics can again be signaled as look-up-table. Other physics related material properties can include bounciness, friction, restitution, density and mass. This fourth example can stand alone or be combined with any of the previous mentioned examples. Table 3 shows signaling examples combining the previous example for interactive patches with material characteristics signaling. In this case, an additional look-up-table is signaled in or along the bit stream
Table 3: Signaling examples with material characteristics It should be noticed that patch indices for the two different tables do not necessarily have to be the same, in this case the per-patch metadata signaling has to be extended by an additional signal for the material characteristics. An overview of different per-patch metadata signaling examples is given below.
A. Simple interactivity
1 Bit: signalling if patch is active or not. Look up action in look-up-table (LUT).
B. Extended interactivity
n Bits: signalling if patch is active or not and what activity shall be performed. Look up action in LUT.
alternatively
1 Bit: signalling if patch is active or not, followed by n Bits: signalling activity shall be performed. Look up action in LUT.
C. Extended interactivity with multiple interactions
n Bits: signalling if patch is active or not and what activity shall be performed for what interaction. Look up interaction and respective action in LUT.
alternatively
1 Bit: signalling if patch is active or not, followed by
n Bits: signalling what activity shall be performed for what interaction. Look up interaction and respective action in LUT.
D. Extended interactivity with multiple interactions and material characteristices /material indices match activity indices)
n Bits: signalling if patch is active or not and what activity shall be performed for what interaction. Look up interaction and respective action and material in LUT. alternatively
1 Bit: signalling if patch is active or not, followed by
n Bits: signalling material and what activity shall be performed for what interaction. Look up material, interaction, and respective action in LUT.
E. Extended interactivity with multiple interactions and material characteristics /material indices do not match activity indices)
1 Bit: signalling if patch is active or not, followed by
n Bits: signalling what activity shall be performed for what interaction, followed by n Bits: signalling what material the patch consists of (different index as above)
F. Material signalling only
n Bits: signalling what material the patch consists of. Look up material in LUT.
It shall be noted that this list is not exhaustive. Many different combinations of interactivity and material signalling can be envisioned. The main idea remains the same: provide additional signalling to allow for more immersive interactions with a 3D object.
Fig. 4a is a flowchart illustrating a method according to an embodiment. A method comprises receiving 1001 a volumetric video comprising a three-dimensional object; segmenting 1002 the three-dimensional object into a plurality of patches; for one or more patches of a three-dimensional object 1003: inserting into a bitstream a signal indicating a relation of three-dimensional location of the patch to at least one predefined item; and/or inserting into the bitstream a signal indicating an activity of the patch as a response to an interaction; transmitting 1004 the bitstream to a decoder.
The decoder receives the bitstream from the encoder, and performs the inverse operation of reconstruction as described in Figs. 2 and 4a, i.e. interpreting any patch relationship information according to the previous embodiments and passing on any interactivity information to a Tenderer to analyze interactions and perform specified interactions. Fig. 4b is a flowchart illustrating a method according to another embodiment. A method comprises receiving a bitstream 1011 ; decoding from a bitstream a signal indicating a relation of three-dimensional location of the patch to at least one predefined item; and/or a signal indicating an activity of the patch as a response to an interactionl0l2; reconstructing a volumetric video according to the decoded information 1013; passing the reconstructed volumetric video to a Tenderer 1014; and passing any decoded activity information to the Tenderer 1015.
An apparatus according to an embodiment comprises means for receiving a volumetric video comprising a three-dimensional object; means for segmenting the three-dimensional object into a plurality of patches; for one or more patches of a three-dimensional object: means for inserting into a bitstream a signal indicating a relation of three-dimensional location of the patch to at least one predefined item; and/or means for inserting into the bitstream a signal indicating an activity of the patch as a response to an interaction; means for transmitting the bitstream to a decoder. The means comprises at least one processor, and a memory including a computer program code, wherein the processor may further comprise processor circuitry. The memory and the computer program code are configured to, with the at least one processor, cause the apparatus to perform the method according to various embodiments.
An apparatus according to another embodiment comprises means for receiving a bitstream; means for decoding from a bitstream a signal indicating a relation of three-dimensional location of the patch to at least one predefined item; and/or a signal indicating an activity of the patch as a response to an interaction; means for reconstructing a volumetric video according to the decoded information; means for passing the reconstructed volumetric video to a Tenderer, and means for passing any decoded activity information to the Tenderer. The means comprises at least one processor, and a memory including a computer program code, wherein the processor may further comprise processor circuitry. The memory and the computer program code are configured to, with the at least one processor, cause the apparatus to perform the method according to various embodiments.
An example of an apparatus is disclosed with reference to Figures 5 and 6. Fig. 5 shows a block diagram of a video coding system according to an example embodiment as a schematic block diagram of an electronic device 50, which may incorporate a codec ln some embodiments the electronic device may comprise an encoder or a decoder. Fig. 6 shows a layout of an apparatus according to an embodiment. The electronic device 50 may for example be a mobile terminal or a user equipment of a wireless communication system or a camera device. The electronic device 50 may be also comprised at a local or a remote server or a graphics processing unit of a computer. The device may be also comprised as part of a head-mounted display device. The apparatus 50 may comprise a housing 30 for incorporating and protecting the device. The apparatus 50 may further comprise a display 32 in the form of a liquid crystal display. In other embodiments of the invention the display may be any suitable display technology suitable to display an image or video. The apparatus 50 may further comprise a keypad 34. In other embodiments of the invention any suitable data or user interface mechanism may be employed. For example, the user interface may be implemented as a virtual keyboard or data entry system as part of a touch-sensitive display.
The apparatus may comprise a microphone 36 or any suitable audio input which may be a digital or analogue signal input. The apparatus 50 may further comprise an audio output device which in embodiments of the invention may be any one of: an earpiece 38, speaker, or an analogue audio or digital audio output connection. The apparatus 50 may also comprise a battery 40 (or in other embodiments of the invention the device may be powered by any suitable mobile energy device such as solar cell, fuel cell or clockwork generator). The apparatus may further comprise a camera 42 capable of recording or capturing images and/ or video. The camera 42 may be a multi-lens camera system having at least two camera sensors. The camera is capable of recording or detecting individual frames which are then passed to the codec 54 or the controller for processing. The apparatus may receive the video and/or image data for processing from another device prior to transmission and/or storage.
The apparatus 50 may further comprise an infrared port for short range line of sight communication to other devices. According to an embodiment, the apparatus may further comprise any suitable short- range communication solution such as for example a Bluetooth wireless connection or a USB (Universal Serial Bus)/firewire wired connection. The apparatus 50 may comprise a controller 56 or processor for controlling the apparatus 50. The apparatus or the controller 56 may comprise one or more processors or processor circuitry and be connected to memory 58 which may store data in the form of image, video and/or audio data, and/or may also store instructions for implementation on the controller 56 or to be executed by the processors or the processor circuitry. The controller 56 may further be connected to codec circuitry 54 suitable for carrying out coding and decoding of image, video and/or audio data or assisting in coding and decoding carried out by the controller.
The apparatus 50 may further comprise a card reader 48 and a smart card 46, for example a UICC (Universal Integrated Circuit Card) and UICC reader for providing user information and being suitable for providing authentication information for authentication and authorization of the user at a network.
The apparatus 50 may comprise radio interface circuitry 52 connected to the controller and suitable for generating wireless communication signals for example for communication with a cellular communications network, a wireless communications system or a wireless local area network. The apparatus 50 may further comprise an antenna 44 connected to the radio interface circuitry 52 for transmitting radio frequency signals generated at the radio interface circuitry 52 to other apparatus(es) and for receiving radio frequency signals from other apparatus(es). The apparatus may comprise one or more wired interfaces configured to transmit and/or receive data over a wired connection, for example an electrical cable or an optical fiber connection. Such wired interface may be configured to operate according to one or more digital display interface standards, such as for example High-Definition Multimedia lnterface (HDM1), Mobile High- definition Link (MHL), or Digital Visual lnterface (DV1).
The various embodiments may provide advantages. For example, signaling the relationships improves the coding efficiency for auxiliary patch data ft also enables use cases currently not supported, e.g. virtual assistants, advertisements, etc. Further, signaling patch interaction characteristics provide improved interactivity to support new use cases, e.g. gaming, e-leaming, advertisements, etc. ln addition, the various embodiments provide an improved immersion due to the 3D model reacting to virtual environment, and an improved flexibility for content creators to support their specific use cases/target applications.
The various embodiments can be implemented with the help of computer program code that resides in a memory and causes the relevant apparatuses to carry out the method. For example, a device may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the device to carry out the features of an embodiment. Yet further, a network device like a server may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the network device to carry out the features of an embodiment. The computer program code comprises one or more operational characteristics. According to an embodiment, said operational characteristics are being defined through configuration by said computer based on the type of said processor, wherein a system is connectable to said processor by a bus, wherein a programmable operational characteristic of the system comprises receiving a volumetric video comprising a three-dimensional object; segmenting the three-dimensional object into a plurality of patches; for one or more patches of a three-dimensional object: inserting into a bitstream a signal indicating a relation of three-dimensional location of the patch to at least one predefined item; and/or inserting into the bitstream a signal indicating an activity of the patch as a response to an interaction; transmitting the bitstream to a decoder.
According to another embodiment, said operational characteristics are being defined through configuration by said computer based on the type of said processor, wherein a system is connectable to said processor by a bus, wherein a programmable operational characteristic of the system comprises receiving a bitstream; decoding from a bitstream a signal indicating a relation of three-dimensional location of the patch to at least one predefined item; and/or a signal indicating an activity of the patch as a response to an interaction; reconstructing a volumetric video according to the decoded information; passing the reconstructed volumetric video to a Tenderer together with any decoded activity information. The computer program code can be a part of a computer program product that may be embodied on a non-transitory computer readable medium. Alternatively, the computer program product may be downloadable via communication network.
If desired, the different functions discussed herein may be performed in a different order and/or concurrently with other. Furthermore, if desired, one or more of the above-described functions and embodiments may be optional or may be combined.
Although various aspects of the embodiments are set out in the independent claims, other aspects comprise other combinations of features from the described embodiments and/or the dependent claims with the features of the independent claims, and not solely the combinations explicitly set out in the claims.
It is also noted herein that while the above describes example embodiments, these descriptions should not be viewed in a limiting sense. Rather, there are several variations and modifications, which may be made without departing from the scope of the present disclosure as, defined in the appended claims.

Claims

Claims:
1. A method, comprising:
- receiving a volumetric video comprising a three-dimensional object;
- segmenting the three-dimensional object into a plurality of patches;
- for one or more patches of a three-dimensional object:
• inserting into a bitstream a signal indicating a relation of three-dimensional location of the patch to at least one predefined item; and/or
• inserting into the bitstream a signal indicating an activity of the patch as a response to an interaction;
- transmitting the bitstream to a decoder.
2. The method according to claim 1, wherein the item is one or more of the following: one or more of other patches of the same 3D object; a current position of a user with or without a viewing orientation; another object; a predefined path.
3. The method according to claim 1, wherein the interaction is an interaction of a user to the 3D object, or an interaction of another 3D object to the patch.
4. The method according to claim 1, wherein activity of the patch is associated with an activity to be performed.
5. The method according to claim 1, wherein the relation or activity signaling is performed in an auxiliary patch information.
6. The method according to claim 1, wherein the relation or activity signaling is performed by one or more bits.
7. The method according to claim 1, wherein the available activities are signaled in or along the bitstream as a look-up-table.
8. An apparatus comprising at least one processor, memory including computer program code, wherein memory and the computer program code are configured to, with the at least one processor, cause the apparatus to
- receive a volumetric video comprising a three-dimensional object;
- segment the three-dimensional object into a plurality of patches;
- for one or more patches of a three-dimensional object: • to insert into a bitstream a signal indicating a relation of three-dimensional location of the patch to at least one predefined item; and/or
• to insert into the bitstream a signal indicating an activity of the patch as a response to an interaction;
- transmit the bitstream to a decoder.
9. The apparatus according to claim 8 wherein the item is one or more of the following: one or more of other patches of the same 3D object; a current position of a user with or without a viewing orientation; another object; a predefined path.
10. The apparatus according to claim 8, wherein the interaction is an interaction of a user to the 3D object, or an interaction of another 3D object to the patch.
11. The apparatus according to claim 8, wherein activity of the patch is associated with an activity to be performed.
12. The apparatus according to claim 8, wherein the relation or activity signaling is performed in an auxiliary patch information.
13. The apparatus according to claim 8, wherein the relation or activity signaling is performed by one or more bits.
14. The apparatus according to claim 8, wherein the available activities are signaled in or along the bitstream as a look-up table.
15. An apparatus comprising:
- means for receiving a volumetric video comprising a three-dimensional object;
- means for segmenting the three-dimensional object into a plurality of patches;
- for one or more patches of a three-dimensional object;
• means for inserting into a bitstream a signal indicating a relation of three- dimensional location of the patch to at least one predefined item; and/or
• means for inserting into the bitstream a signal indicating an activity of the patch as a response to an interaction;
- means for transmitting the bitstream to a decoder.
16. A computer program product comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to:
- receive a volumetric video comprising a three-dimensional object; - segment the three-dimensional object into a plurality of patches;
- for one or more patches of a three-dimensional object:
• to insert into a bitstream a signal indicating a relation of three-dimensional location of the patch to at least one predefined item; and/or
• to insert into the bitstream a signal indicating an activity of the patch as a response to an interaction;
- transmit the bitstream to a decoder.
17. A computer program product according to claim 16, wherein the computer program product is embodied on a non-transitory computer readable medium.
18. A method comprising
- receiving a bitstream;
- decoding from a bitstream a signal indicating a relation of three-dimensional location of the patch to at least one predefined item; and/or a signal indicating an activity of the patch as a response to an interaction;
- reconstructing a volumetric video according to the decoded information;
- passing the reconstructed volumetric video to a Tenderer;
- passing any decoded activity information to the Tenderer.
19. An apparatus comprising at least one processor, memory including computer program code, wherein memory and the computer program code are configured to, with the at least one processor, cause the apparatus to
- receive a bitstream;
- decode from a bitstream a signal indicating a relation of three-dimensional location of the patch to at least one predefined item; and/or a signal indicating an activity of the patch as a response to an interaction;
- reconstruct a volumetric video according to the decoded information;
- pass the reconstructed volumetric video to a Tenderer;
- pass any decoded activity information to the Tenderer.
EP19834742.9A 2018-07-10 2019-07-09 A method, an apparatus and a computer program product for volumetric video coding Pending EP3821602A4 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201862696026P 2018-07-10 2018-07-10
PCT/FI2019/050537 WO2020012071A1 (en) 2018-07-10 2019-07-09 A method, an apparatus and a computer program product for volumetric video coding

Publications (2)

Publication Number Publication Date
EP3821602A1 true EP3821602A1 (en) 2021-05-19
EP3821602A4 EP3821602A4 (en) 2022-05-04

Family

ID=69143222

Family Applications (1)

Application Number Title Priority Date Filing Date
EP19834742.9A Pending EP3821602A4 (en) 2018-07-10 2019-07-09 A method, an apparatus and a computer program product for volumetric video coding

Country Status (2)

Country Link
EP (1) EP3821602A4 (en)
WO (1) WO2020012071A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3942821A4 (en) 2019-03-19 2023-01-18 Nokia Technologies Oy An apparatus, a method and a computer program for volumetric video
US20220191544A1 (en) * 2020-12-14 2022-06-16 Nokia Technologies Oy Radiative Transfer Signalling For Immersive Video

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8964008B2 (en) * 2011-06-17 2015-02-24 Microsoft Technology Licensing, Llc Volumetric video presentation
US9928661B1 (en) * 2016-03-02 2018-03-27 Meta Company System and method for simulating user interaction with virtual objects in an interactive space
EP3823276B1 (en) * 2016-11-17 2024-08-14 INTEL Corporation Indication of suggested regions of interest in the metadata of an omnidirectional video
WO2018109265A1 (en) * 2016-12-15 2018-06-21 Nokia Technologies Oy A method and technical equipment for encoding media content

Also Published As

Publication number Publication date
WO2020012071A1 (en) 2020-01-16
EP3821602A4 (en) 2022-05-04

Similar Documents

Publication Publication Date Title
US11509933B2 (en) Method, an apparatus and a computer program product for volumetric video
EP3751857A1 (en) A method, an apparatus and a computer program product for volumetric video encoding and decoding
US11202086B2 (en) Apparatus, a method and a computer program for volumetric video
JP6939883B2 (en) UV codec centered on decoders for free-viewpoint video streaming
JP7499182B2 (en) Method, apparatus and stream for volumetric video format - Patents.com
WO2021260266A1 (en) A method, an apparatus and a computer program product for volumetric video coding
US20230283759A1 (en) System and method for presenting three-dimensional content
KR20220063254A (en) Video-based point cloud compression model for global signaling information
JP7344988B2 (en) Methods, apparatus, and computer program products for volumetric video encoding and decoding
WO2021191495A1 (en) A method, an apparatus and a computer program product for video encoding and video decoding
EP3729805B1 (en) Method and apparatus for encoding and decoding volumetric video data
WO2021205068A1 (en) A method, an apparatus and a computer program product for volumetric video coding
WO2020012071A1 (en) A method, an apparatus and a computer program product for volumetric video coding
EP4162691A1 (en) A method, an apparatus and a computer program product for video encoding and video decoding
KR20210027483A (en) Methods and Devices for Encoding and Decoding 3 Degrees of Freedom and Volumetric Compatible Video Stream
US12120347B2 (en) Method, an apparatus and a computer program product for video encoding and video decoding
EP3698332A1 (en) An apparatus, a method and a computer program for volumetric video
EP3873095A1 (en) An apparatus, a method and a computer program for omnidirectional video
WO2019211519A1 (en) A method and an apparatus for volumetric video encoding and decoding
WO2018069215A1 (en) Method, apparatus and stream for coding transparency and shadow information of immersive video format
WO2019185983A1 (en) A method, an apparatus and a computer program product for encoding and decoding digital volumetric video
EP3310057A1 (en) Method, apparatus and stream for coding transparency and shadow information of immersive video format
EP3310053A1 (en) Method and apparatus for coding transparency information of immersive video format
WO2022219230A1 (en) A method, an apparatus and a computer program product for video encoding and video decoding
WO2023175243A1 (en) A method, an apparatus and a computer program product for video encoding and video decoding

Legal Events

Date Code Title Description
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE

PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE

17P Request for examination filed

Effective date: 20210210

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

DAV Request for validation of the european patent (deleted)
DAX Request for extension of the european patent (deleted)
A4 Supplementary search report drawn up and despatched

Effective date: 20220404

RIC1 Information provided on ipc code assigned before grant

Ipc: H04N 21/84 20110101ALI20220328BHEP

Ipc: H04N 21/234 20110101ALI20220328BHEP

Ipc: H04N 21/44 20110101ALI20220328BHEP

Ipc: H04N 21/434 20110101ALI20220328BHEP

Ipc: H04N 13/268 20180101ALI20220328BHEP

Ipc: H04N 19/597 20140101ALI20220328BHEP

Ipc: G06F 3/01 20060101ALI20220328BHEP

Ipc: G06T 17/00 20060101ALI20220328BHEP

Ipc: H04N 21/472 20110101ALI20220328BHEP

Ipc: H04N 21/4725 20110101ALI20220328BHEP

Ipc: G06T 7/10 20170101ALI20220328BHEP

Ipc: G06F 3/0481 20130101ALI20220328BHEP

Ipc: G06T 19/20 20110101ALI20220328BHEP

Ipc: H04N 19/17 20140101ALI20220328BHEP

Ipc: H04N 19/33 20140101AFI20220328BHEP