WO2024069045A1 - An apparatus and a method for processing volumetric video content - Google Patents

An apparatus and a method for processing volumetric video content Download PDF

Info

Publication number
WO2024069045A1
WO2024069045A1 PCT/FI2023/050534 FI2023050534W WO2024069045A1 WO 2024069045 A1 WO2024069045 A1 WO 2024069045A1 FI 2023050534 W FI2023050534 W FI 2023050534W WO 2024069045 A1 WO2024069045 A1 WO 2024069045A1
Authority
WO
WIPO (PCT)
Prior art keywords
parameters
movements
capturing system
parameter
camera capturing
Prior art date
Application number
PCT/FI2023/050534
Other languages
French (fr)
Other versions
WO2024069045A9 (en
Inventor
Lukasz Kondrad
Lauri Aleksi ILOLA
Sebastian Schwarz
Saba AHSAN
Sujeet Shyamsundar Mate
Original Assignee
Nokia Technologies Oy
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nokia Technologies Oy filed Critical Nokia Technologies Oy
Publication of WO2024069045A1 publication Critical patent/WO2024069045A1/en
Publication of WO2024069045A9 publication Critical patent/WO2024069045A9/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/70Methods or arrangements for coding, decoding, compressing or decompressing digital video signals characterised by syntax aspects related to video coding, e.g. related to compression standards
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/597Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding specially adapted for multi-view video sequence encoding
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/81Monomedia components thereof
    • H04N21/816Monomedia components thereof involving special video data, e.g 3D video
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/85Assembly of content; Generation of multimedia applications
    • H04N21/854Content authoring
    • H04N21/85406Content authoring involving a specific file format, e.g. MP4 format
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/60Network structure or processes for video distribution between server and client or between remote clients; Control signalling between clients, server and network components; Transmission of management data between server and client, e.g. sending from server to client commands for recording incoming content stream; Communication details between server and client 
    • H04N21/63Control signaling related to video distribution between client, server and network components; Network processes for video distribution between server and clients or between remote clients, e.g. transmitting basic layer and enhancement layers over different transmission paths, setting up a peer-to-peer communication via Internet between remote STB's; Communication protocols; Addressing
    • H04N21/643Communication protocols
    • H04N21/6437Real-time Transport Protocol [RTP]

Definitions

  • the present invention relates to an apparatus, a method and a computer program for volumetric video coding.
  • BACKGROUND Visual volumetric video-based coding (V3C; defined in ISO/IEC DIS 23090-5) provides a generic syntax and mechanism for volumetric video coding.
  • the generic syntax can be used by applications targeting volumetric content, such as point clouds, immersive video with depth, and mesh representations of volumetric frames.
  • volumetric video enables the viewer to move in six degrees of freedom (6DOF): in contrast to common 360° video, where the user has from 2 to 3 degrees of freedom (yaw, pitch, and possibly roll), a volumetric video represents a 3D volume of space rather than a flat image plane.
  • 6DOF six degrees of freedom
  • a volumetric video represents a 3D volume of space rather than a flat image plane.
  • Volumetric video may be captured using various configurations of 3D camera capturing systems. The relative position of the cameras with respect to other cameras in a capturing system may be fixed. On the other hand, the position of the cameras in the global coordinate system may change during the capturing process.
  • the apparatus capturing and creating the volumetric video content has no means to indicate to an apparatus receiving and rendering the volumetric video content, how the captured scene is intended to be rendered in terms of the changes of the position of the cameras during the capturing process; i.e. whether said changes should be reflected upon rendering and if they are intentional.
  • the rendered volumetric video content may appear significantly different as intended by the content creator.
  • an apparatus comprising means for capturing volumetric video content by a camera capturing system; means for determining one or more parameters describing movements of the camera capturing system to be stored upon capturing said volumetric video content; and means for transmitting the captured volumetric video content and said one or more parameters describing the movements of the camera capturing system to a receiver.
  • said one or more parameters comprise at least a first parameter for controlling a rendering process of the volumetric video content, wherein a value of said first parameter is configured to control the rendering process whether to take into account said movements of the camera capturing system.
  • said one or more parameters comprise at least a second parameter describing whether said movements of the camera capturing system are intentional or unintentional.
  • the apparatus comprises means for carrying out a signalling of said one or more parameters in a supplemental enhancement information (SEI) message according to MPEG Immersive Video (MIV) specification.
  • SEI Supplemental Enhancement Information
  • MIV MPEG Immersive Video
  • said SEI message comprises a rendering mode parameter capable of having a plurality of values, wherein a value of at least one of said one or more parameters are indicated by a rendering mode parameter value.
  • the apparatus comprises means for carrying out a signalling of said one or more parameters by at least one syntax element included in media level of a Session Description Protocol (SDP) file.
  • the apparatus comprises means for setting a value of at least one parameter to indicate that the one or more parameters describing the movements of the camera capturing system have been adjusted according to a predetermined value of the rendering mode parameter.
  • the apparatus comprises means for carrying out a signalling of said one or more parameters by a uniquely identified Real-Time Transport Control Protocol (RTCP) application packet.
  • RTCP Real-Time Transport Control Protocol
  • a method comprises: capturing volumetric video content by a camera capturing system; determining one or more parameters describing movements of the camera capturing system to be stored upon capturing said volumetric video content; and transmitting the captured volumetric video content and said one or more parameters describing the movements of the camera capturing system to a receiver.
  • An apparatus comprises: means for receiving volumetric video content captured by a camera capturing system and one or more parameters describing the movements of the camera capturing system during capturing; and means for controlling a rendering process of the volumetric video content according to the one or more parameters describing the movements of the camera capturing system during the capturing.
  • said one or more parameters comprise at least a first parameter for controlling a rendering process of the volumetric video content, wherein the apparatus comprises means for controlling the rendering process whether to take into account said movements of the camera capturing system based on a value of said first parameter.
  • said one or more parameters comprise at least a second parameter describing whether said movements of the camera capturing system are intentional or unintentional, wherein the apparatus comprises means for controlling the rendering process to ignore the unintentional movements of the camera capturing system based on a value of said second parameter.
  • the apparatus comprises means for receiving a signalling of said one or more parameters in a supplemental enhancement information (SEI) message according to MPEG Immersive Video (MIV) specification.
  • SEI Supplemental Enhancement Information
  • MIV MPEG Immersive Video
  • the apparatus comprises means for receiving a signalling of said one or more parameters by at least one syntax element included in media level of a Session Description Protocol (SDP) file.
  • a method according to a fourth aspect comprises: receiving volumetric video content captured by a camera capturing system and one or more parameters describing the movements of the camera capturing system during capturing; and controlling a rendering process of the volumetric video content according to the one or more parameters describing the movements of the camera capturing system during the capturing.
  • Computer readable storage media according to further aspects comprise code for use by an apparatus, which when executed by a processor, causes the apparatus to perform the above methods.
  • Figs.1a and 1b show an encoder and decoder for encoding and decoding 2D pictures
  • Figs.2a and 2b show a compression and a decompression process for 3D volumetric video
  • Fig.3 shows an example of a V3C bitstream
  • Fig.4 shows a structure of extension header data of an RTP packet header
  • Figs.5a – 5c shows an example of rendering options for a scene captured with camera system movements involved
  • Figs.6a – 6c shows an example of rendering options for a scene captured with intentional/unintentional camera system movements involved
  • Fig.7 shows a flow chart for an encoding or a content creating method according to an embodiment
  • Fig.8 shows a flow chart for a decoding or a rendering method according to an embodiment.
  • a video codec comprises an encoder that transforms the input video into a compressed representation suited for storage/transmission, and a decoder that can un- compress the compressed video representation back into a viewable form.
  • An encoder may discard some information in the original video sequence in order to represent the video in a more compact form (i.e. at lower bitrate).
  • Volumetric video may be captured using one or more three-dimensional (3D) cameras. When multiple cameras are in use, the captured footage is synchronized so that the cameras provide different viewpoints to the same world. In contrast to traditional 2D/3D video, volumetric video describes a 3D model of the world where the viewer is free to move and observe different parts of the world.
  • Volumetric video enables the viewer to move in six degrees of freedom (6DOF): in contrast to common 360° video, where the user has from 2 to 3 degrees of freedom (yaw, pitch, and possibly roll), a volumetric video represents a 3D volume of space rather than a flat image plane.
  • Volumetric video frames contain a large amount of data because they model the contents of a 3D volume instead of just a two-dimensional (2D) plane.
  • 2D two-dimensional
  • Volumetric video can be rendered from synthetic 3D animations, reconstructed from multi-view video using 3D reconstruction techniques such as structure from motion, or captured with a combination of cameras and depth sensors such as LiDAR (Light Detection and Ranging), for example.
  • Volumetric video data represents a three-dimensional scene or object, and thus such data can be viewed from any viewpoint.
  • Volumetric video data can be used as an input for augmented reality (AR), virtual reality (VR) and mixed reality (MR) applications.
  • AR augmented reality
  • VR virtual reality
  • MR mixed reality
  • Such data describes geometry (shape, size, position in 3D-space) and respective attributes (e.g. color, opacity, reflectance, ...), together with any possible temporal changes of the geometry and attributes at given time instances (e.g. frames in 2D video).
  • Volumetric video is either generated from 3D models, i.e. computer-generated imagery (CGI), or captured from real-world scenes using a variety of capture solutions, e.g. a multi-camera, a laser scan, a combination of video and dedicated depths sensors, etc. Also, a combination of CGI and real-world data is possible. Examples of representation formats for such volumetric data are triangle meshes, point clouds, or voxel.
  • Temporal information about the scene can be included in the form of individual capture instances, i.e. “frames” in 2D video, or other means, e.g. position of an object as a function of time.
  • Increasing computational resources and advances in 3D data acquisition devices has enabled reconstruction of highly detailed volumetric video representations of natural scenes.
  • Infrared, lasers, time-of-flight and structured light are all examples of devices that can be used to construct 3D video data.
  • Representation of the 3D data depends on how the 3D data is used. Dense voxel arrays have been used to represent volumetric medical data. In 3D graphics, polygonal meshes are extensively used. Point clouds on the other hand are well suited for applications, such as capturing real world 3D scenes where the topology is not necessarily a 2D manifold.
  • Another way to represent 3D data is coding this 3D data as a set of texture and depth map as is the case in the multi-view plus depth. Closely related to the techniques used in multi-view plus depth is the use of elevation maps, and multi-level surface maps.
  • each point of each 3D surface is described as a 3D point with color and/or other attribute information such as surface normal or material reflectance.
  • Point cloud is a set of data points in a coordinate system, for example in a three- dimensional coordinate system being defined by X, Y, and Z coordinates. The points may represent an external surface of an object in the screen space, e.g. in a three-dimensional space.
  • dense point clouds or voxel arrays the reconstructed 3D scene may contain tens or even hundreds of millions of points. If such representations are to be stored or interchanged between entities, then efficient compression of the presentations becomes fundamental. Standard volumetric video representation formats, such as point clouds, meshes, voxel, suffer from poor temporal compression performance.
  • Identifying correspondences for motion-compensation in 3D-space is an ill-defined problem, as both, geometry and respective attributes may change. For example, temporal successive “frames” do not necessarily have the same number of meshes, points or voxel. Therefore, compression of dynamic 3D scenes is inefficient.2D-video based approaches for compressing volumetric data, i.e. multiview with depth, have much better compression efficiency, but rarely cover the full scene. Therefore, they provide only limited 6DOF capabilities. Instead of the above-mentioned approach, a 3D scene, represented as meshes, points, and/or voxel, can be projected onto one, or more, geometries.
  • geometries may be “unfolded” or packed onto 2D planes (two planes per geometry: one for texture, one for depth), which are then encoded using standard 2D video compression technologies. Relevant projection geometry information may be transmitted alongside the encoded video files to the decoder. The decoder decodes the video and performs the inverse projection to regenerate the 3D scene in any desired representation format (not necessarily the starting format). Projecting volumetric models onto 2D planes allows for using standard 2D video coding tools with highly efficient temporal compression. Thus, coding efficiency can be increased greatly. Using geometry-projections instead of 2D-video based approaches based on multiview and depth, provides a better coverage of the scene (or object). Thus, 6DOF capabilities are improved.
  • Figs.1a and 1b show an encoder and decoder for encoding and decoding the 2D texture pictures, geometry pictures and/or auxiliary pictures.
  • a video codec consists of an encoder that transforms an input video into a compressed representation suited for storage/transmission and a decoder that can uncompress the compressed video representation back into a viewable form.
  • the encoder discards and/or loses some information in the original video sequence in order to represent the video in a more compact form (that is, at lower bitrate).
  • Figure 1a illustrates an image to be encoded (I n ); a predicted representation of an image block (P' n ); a prediction error signal (D n ); a reconstructed prediction error signal (D' n ); a preliminary reconstructed image (I' n ); a final reconstructed image (R' n ); a transform (T) and inverse transform (T -1 ); a quantization (Q) and inverse quantization (Q -1 ); entropy encoding (E); a reference frame memory (RFM); inter prediction (P inter ); intra prediction (Pintra); mode selection (MS) and filtering (F).
  • Figure 1b illustrates a predicted representation of an image block (P' n ); a reconstructed prediction error signal (D' n ); a preliminary reconstructed image (I' n ); a final reconstructed image (R' n ); an inverse transform (T -1 ); an inverse quantization (Q -1 ); an entropy decoding (E -1 ); a reference frame memory (RFM); a prediction (either inter or intra) (P); and filtering (F).
  • Many hybrid video encoders encode the video information in two phases.
  • pixel values in a certain picture area are predicted for example by motion compensation means (finding and indicating an area in one of the previously coded video frames that corresponds closely to the block being coded) or by spatial means (using the pixel values around the block to be coded in a specified manner).
  • the prediction error i.e. the difference between the predicted block of pixels and the original block of pixels. This is typically done by transforming the difference in pixel values using a specified transform (e.g. Discrete Cosine Transform (DCT) or a variant of it), quantizing the coefficients and entropy coding the quantized coefficients.
  • DCT Discrete Cosine Transform
  • Video codecs may also provide a transform skip mode, which the encoders may choose to use.
  • the prediction error is coded in a sample domain, for example by deriving a sample-wise difference value relative to certain adjacent samples and coding the sample-wise difference value with an entropy coder.
  • Many video encoders partition a picture into blocks along a block grid. For example, in the High Efficiency Video Coding (HEVC) standard, the following partitioning and definitions are used.
  • HEVC High Efficiency Video Coding
  • a coding block may be defined as an NxN block of samples for some value of N such that the division of a coding tree block into coding blocks is a partitioning.
  • a coding tree block may be defined as an NxN block of samples for some value of N such that the division of a component into coding tree blocks is a partitioning.
  • a coding tree unit may be defined as a coding tree block of luma samples, two corresponding coding tree blocks of chroma samples of a picture that has three sample arrays, or a coding tree block of samples of a monochrome picture or a picture that is coded using three separate color planes and syntax structures used to code the samples.
  • a coding unit may be defined as a coding block of luma samples, two corresponding coding blocks of chroma samples of a picture that has three sample arrays, or a coding block of samples of a monochrome picture or a picture that is coded using three separate color planes and syntax structures used to code the samples.
  • a CU with the maximum allowed size may be named as LCU (largest coding unit) or coding tree unit (CTU) and the video picture is divided into non-overlapping LCUs.
  • LCU largest coding unit
  • CTU coding tree unit
  • a picture can be partitioned in tiles, which are rectangular and contain an integer number of LCUs.
  • a slice is defined to be an integer number of coding tree units contained in one independent slice segment and all subsequent dependent slice segments (if any) that precede the next independent slice segment (if any) within the same access unit.
  • a slice segment is defined to be an integer number of coding tree units ordered consecutively in the tile scan and contained in a single NAL unit. The division of each picture into slice segments is a partitioning.
  • an independent slice segment is defined to be a slice segment for which the values of the syntax elements of the slice segment header are not inferred from the values for a preceding slice segment
  • a dependent slice segment is defined to be a slice segment for which the values of some syntax elements of the slice segment header are inferred from the values for the preceding independent slice segment in decoding order.
  • a slice header is defined to be the slice segment header of the independent slice segment that is a current slice segment or is the independent slice segment that precedes a current dependent slice segment
  • a slice segment header is defined to be a part of a coded slice segment containing the data elements pertaining to the first or all coding tree units represented in the slice segment.
  • Entropy coding/decoding may be performed in many ways. For example, context-based coding/decoding may be applied, where in both the encoder and the decoder modify the context state of a coding parameter based on previously coded/decoded coding parameters. Context-based coding may for example be context adaptive binary arithmetic coding (CABAC) or context-adaptive variable length coding (CAVLC) or any similar entropy coding.
  • CABAC context adaptive binary arithmetic coding
  • CAVLC context-adaptive variable length coding
  • Entropy coding/decoding may alternatively or additionally be performed using a variable length coding scheme, such as Huffman coding/decoding or Exp-Golomb coding/decoding.
  • Decoding of coding parameters from an entropy-coded bitstream or codewords may be referred to as parsing.
  • the phrase along the bitstream (e.g. indicating along the bitstream) may be defined to refer to out-of-band transmission, signaling, or storage in a manner that the out- of-band data is associated with the bitstream.
  • the phrase decoding along the bitstream or alike may refer to decoding the referred out-of-band data (which may be obtained from out-of-band transmission, signaling, or storage) that is associated with the bitstream.
  • an indication along the bitstream may refer to metadata in a container file that encapsulates the bitstream.
  • a first texture picture may be encoded into a bitstream, and the first texture picture may comprise a first projection of texture data of a first source volume of a scene model onto a first projection surface.
  • the scene model may comprise a number of further source volumes.
  • data on the position of the originating geometry primitive may also be determined, and based on this determination, a geometry picture may be formed. This may happen for example so that depth data is determined for each or some of the texture pixels of the texture picture. Depth data is formed such that the distance from the originating geometry primitive such as a point to the projection surface is determined for the pixels.
  • Such depth data may be represented as a depth picture, and similarly to the texture picture, such geometry picture (such as a depth picture) may be encoded and decoded with a video codec.
  • This first geometry picture may be seen to represent a mapping of the first projection surface to the first source volume, and the decoder may use this information to determine the location of geometry primitives in the model to be reconstructed.
  • encoding a geometry (or depth) picture into or along the bitstream with the texture picture is only optional and arbitrary for example in the cases where the distance of all texture pixels to the projection surface is the same or there is no change in said distance between a plurality of texture pictures.
  • a geometry (or depth) picture may be encoded into or along the bitstream with the texture picture, for example, only when there is a change in the distance of texture pixels to the projection surface.
  • An attribute picture may be defined as a picture that comprises additional information related to an associated texture picture.
  • An attribute picture may for example comprise surface normal, opacity, or reflectance information for a texture picture.
  • a geometry picture may be regarded as one type of an attribute picture, although a geometry picture may be treated as its own picture type, separate from an attribute picture. Texture picture(s) and the respective geometry picture(s), if any, and the respective attribute picture(s) may have the same or different chroma format. Terms texture (component) image and texture (component) picture may be used interchangeably. Terms geometry (component) image and geometry (component) picture may be used interchangeably.
  • a specific type of a geometry image is a depth image. Embodiments described in relation to a geometry (component) image equally apply to a depth (component) image, and embodiments described in relation to a depth (component) image equally apply to a geometry (component) image.
  • Terms attribute image and attribute picture may be used interchangeably.
  • a geometry picture and/or an attribute picture may be treated as an auxiliary picture in video/image encoding and/or decoding.
  • Figures 2a and 2b illustrate an overview of exemplified compression/ decompression processes. The processes may be applied, for example, in MPEG visual volumetric video-based coding (V3C), defined currently in ISO/IEC DIS 23090-5: “Visual Volumetric Video-based Coding and Video-based Point Cloud Compression”, 2nd Edition.
  • V3C specification enables the encoding and decoding processes of a variety of volumetric media by using video and image coding technologies.
  • Such representations may include occupancy, geometry, and attribute components.
  • the occupancy component can inform a V3C decoding and/or rendering system of which samples in the 2D components are associated with data in the final 3D representation.
  • the geometry component contains information about the precise location of 3D data in space, while attribute components can provide additional properties, e.g. texture or material information, of such 3D data.
  • An example of volumetric media conversion at an encoder is shown in Figure 2a and an example of a 3D reconstruction at a decoder is shown in Figure 2b. There are alternatives to capture and represent a volumetric frame.
  • a volumetric frame can be represented as a point cloud.
  • a point cloud is a set of unstructured points in 3D space, where each point is characterized by its position in a 3D coordinate system (e.g., Euclidean), and some corresponding attributes (e.g., color information provided as RGBA value, or normal vectors).
  • a volumetric frame can be represented as images, with or without depth, captured from multiple viewpoints in 3D space.
  • the volumetric video can be represented by one or more view frames (where a view is a projection of a volumetric scene on to a plane (the camera plane) using a real or virtual camera with known/computed extrinsic and intrinsic).
  • Each view may be represented by a number of components (e.g., geometry, color, transparency, and occupancy picture), which may be part of the geometry picture or represented separately.
  • a volumetric frame can be represented as a mesh.
  • Mesh is a collection of points, called vertices, and connectivity information between vertices, called edges. Vertices along with edges form faces. The combination of vertices, edges and faces can uniquely approximate shapes of objects.
  • a volumetric frame can provide viewers the ability to navigate a scene with six degrees of freedom, i.e., both translational and rotational movement of their viewing pose (which includes yaw, pitch, and roll).
  • the data to be coded for a volumetric frame can also be significant, as a volumetric frame can contain many numbers of objects, and the positioning and movement of these objects in the scene can result in many dis-occluded regions.
  • the interaction of the light and materials in objects and surfaces in a volumetric frame can generate complex light fields that can produce texture variations for even a slight change of pose.
  • a sequence of volumetric frames is a volumetric video. Due to large amount of information, storage and transmission of a volumetric video requires compression.
  • a way to compress a volumetric frame can be to project the 3D geometry and related attributes into a collection of 2D images along with additional associated metadata.
  • the projected 2D images can then be coded using 2D video and image coding technologies, for example ISO/IEC 14496-10 (H.264/AVC) and ISO/IEC 23008-2 (H.265/HEVC).
  • the metadata can be coded with technologies specified in specification such as ISO/IEC 23090-5.
  • the coded images and the associated metadata can be stored or transmitted to a client that can decode and render the 3D volumetric frame.
  • V3C Visual Volumetric Video-based Coding
  • V-PCC Video-based Point Cloud Compression
  • ISO/IEC 23090-5 specifies the syntax, semantics, and process for coding volumetric video.
  • the specified syntax is designed to be generic, so that it can be reused for a variety of applications.
  • Point clouds, immersive video with depth, and mesh representations can all use ISO/IEC 23090-5 standard with extensions that deal with the specific nature of the final representation.
  • the purpose of the specification is to define how to decode and interpret the associated data (for example atlas data in ISO/IEC 23090-5) which tells a renderer how to interpret 2D frames to reconstruct a volumetric frame.
  • V-PCC ISO/IEC 23090-5
  • MIV MPEG Immersive Video
  • V-PCC ISO/IEC 23090-5
  • MIV MPEG Immersive Video
  • V- PCC use a number of V3C syntax elements with a slightly modified semantics.
  • An example on how the generic syntax element can be differently interpreted by the application is pdu_projection_id.
  • the syntax element pdu_projection_id specifies the index of the projection plane for the patch. There can be 6 or 18 projection planes in V-PCC, and they are implicit, i.e., pre-determined.
  • pdu_projection_id corresponds to a view ID, i.e., identifies which view the patch originated from. View IDs and their related information is explicitly provided in MIV view parameters list and may be tailored for each content.
  • MPEG 3DG ISO SC29 WG7
  • mesh coding will re-use V3C syntax as much as possible and can also slightly modify the semantics.
  • V3C uses the ptl_profile_toolset_idc parameter.
  • V3C bitstream is a sequence of bits that forms the representation of coded volumetric frames and the associated data making one or more coded V3C sequences (CVS).
  • V3C bitstream is composed of V3C units that contain V3C video sub-bitstreams, V3C atlas sub-bitstreams, or V3C Parameter Set (VPS).
  • Fig. 3 illustrates an example of VC3 bitstream.
  • Video sub-bitstream and atlas sub-bitstreams can be referred to as V3C sub-bitstreams.
  • Each V3C unit has a V3C unit header and a V3C unit payload.
  • a V3C unit header in conjunction with VPS information identify which V3C sub-bitstream a V3C unit contains and how to interpret it.
  • V3C bitstream can be stored according to Annex C of ISO/IEC 23090-5, which specifies syntax and semantics of a sample stream format to be used by applications that deliver some or all of the V3C unit stream as an ordered stream of bytes or bits within which the locations of V3C unit boundaries need to be identifiable from patterns in the data.
  • a Real Time Transfer Protocol (RTP) is intended for an end-to-end, real-time transfer or streaming media and provides facilities for jitter compensation and detection of packet loss and out-of-order delivery.
  • RTP allows data transfer to multiple destinations through IP multicast or to a specific destination through IP unicast. The majority of the RTP implementations are built on the User Datagram Protocol (UDP). Other transport protocols may also be utilized.
  • UDP User Datagram Protocol
  • RTP can be used together with other protocols such as H.323 and Real Time Streaming Protocol RTSP.
  • the RTP specification describes two protocols: RTP and RTCP.
  • RTP is used for the transfer of multimedia data, and the RTCP is used to periodically send control information and QoS parameters.
  • RTP sessions may be initiated between client and server using a signalling protocol, such as H.323, the Session Initiation Protocol (SIP), or RTSP.
  • SIP Session Initiation Protocol
  • RTSP Real Time Streaming Protocol
  • These protocols may use the Session Description Protocol (RFC 8866) to specify the parameters for the sessions.
  • RTP is designed to carry a multitude of multimedia formats, which permits the development of new formats without revising the RTP standard. To this end, the information required by a specific application of the protocol is not included in the generic RTP header.
  • an RTP profile For a class of applications (e.g., audio, video), an RTP profile may be defined. For a media format (e.g., a specific video coding format), an associated RTP payload format may be defined. Every instantiation of RTP in a particular application may require a profile and payload format specifications.
  • the profile defines the codecs used to encode the payload data and their mapping to payload format codecs in the protocol field Payload Type (PT) of the RTP header.
  • PT Payload Type
  • RTP profile for audio and video conferences with minimal control is defined in RFC 3551.
  • the profile defines a set of static payload type assignments, and a dynamic mechanism for mapping between a payload format, and a PT value using Session Description Protocol (SDP).
  • SDP Session Description Protocol
  • RTP payload format for H.264 Video defined in RFC 6184 or RTP Payload Format for High Efficiency Video Coding (HEVC) defined in RFC 7798.
  • An RTP session is established for each multimedia stream. Audio and video streams may use separate RTP sessions, enabling a receiver to selectively receive components of a particular stream.
  • the RTP specification recommends even port number for RTP, and the use of the next odd port number of the associated RTCP session. A single port can be used for RTP and RTCP in applications that multiplex the protocols.
  • Each RTP stream consists of RTP packets, which in turn consist of RTP header and payload parts. RTP packets are created at the application layer and handed to the transport layer for delivery.
  • RTP header Each unit of RTP media data created by an application begins with the RTP packet header.
  • the RTP header has a minimum size of 12 bytes.
  • optional header extensions may be present. This is followed by the RTP payload, the format of which is determined by the particular class of application.
  • the fields in the header are as follows: Version: (2 bits) Indicates the version of the protocol.
  • the extension header is application or profile specific.
  • CC (CSRC count): (4 bits) Contains the number of CSRC identifiers that follow the SSRC M (Marker): (1 bit) Signalling used at the application level in a profile-specific manner. If it is set, it means that the current data has some special relevance for the application.
  • PT Payment type: (7 bits) Indicates the format of the payload and thus determines its interpretation by the application. Sequence number: (16 bits) The sequence number is incremented for each RTP data packet sent and is to be used by the receiver to detect packet loss and to accommodate out-of-order delivery. Timestamp: (32 bits) Used by the receiver to play back the received samples at appropriate time and interval. When several media streams are present, the timestamps may be independent in each stream.
  • the granularity of the timing is application specific. For example, video stream may use a 90 kHz clock. The clock granularity is one of the details that is specified in the RTP profile for an application.
  • SSRC (32 bits) Synchronization source identifier uniquely identifies the source of the stream. The synchronization sources within the same RTP session will be unique.
  • CSRC (32 bits each) Contributing source IDs enumerate contributing sources to a stream which has been generated from multiple sources.
  • Header extension (optional, presence indicated by Extension field) The first 32- bit word contains a profile-specific identifier (16 bits) and a length specifier (16 bits) that indicates the length of the extension in 32-bit units, excluding the 32 bits of the extension header.
  • the extension header data is shown in Figure 4.
  • the first two bytes of the payload of an RTP packet are commonly referred to as the payload header.
  • the payload header consists of the same fields (F, NUT, NLI, and TID) as the NAL unit header as defined in 23090-5, irrespective of the type of the payload structure.
  • RTP payload header For convenience, the structure of RTP payload header is described below: 0 1 0 1 23 4 5 67 8 90 1 2 34 5 6 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  • NLI nal_layer_id as specified in ISO/IEC 23090-5 defines the identifier of the layer to which an ACL NAL unit belongs or the identifier of a layer to which a non-ACL NAL unit applies.
  • TID nal_temporal_id_plus1 minus 1 as specified in ISO/IEC 23090-5 defines a temporal identifier for the NAL unit. The value of nal_temporal_id_plus1 MUST NOT be equal to 0. The remainder of the data in RTP payload contains the information as defined by the RTP header.
  • An Application-Defined RTCP Packet (APP), a.k.a. a RTCP APP packet is intended for experimental use at developing new applications and features.
  • a new APP RTCP packet type is tested and found useful, it is registered with the Internet Assigned Numbers Authority (IANA) as an original packet type.
  • the structure of the RTCP APP packet is as follows: 0 1 2 3 0 12 3 4 56 7 89 0 1 23 4 5 67 8 90 1 2 34 5 67 8 9 01 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  • SDP Session Description Protocol
  • SDP Session Description Protocol
  • SDP is a format for describing multimedia communication sessions for the purposes of announcement and invitation. Its predominant use is in support of conversational and streaming media applications. SDP does not deliver any media streams itself, but it is used between endpoints for negotiation of network metrics, media types, and other associated properties. The set of properties and parameters is called a session profile. SDP is extensible for the support of new media types and formats.
  • the Session Description Protocol describes a session as a group of fields in a text-based format, one field per line.
  • Session descriptions consist of three sections: session, timing, and media descriptions. Each description may contain multiple timing and media descriptions. Names are only unique within the associated syntactic construct.
  • This session is originated by the user “jdoe” at IPv4 address 10.47.16.5. Its name is “SDP Seminar” and extended session information (“A Seminar on the session description protocol”) is included along with a link for additional information and an email address to contact the responsible party, Jane Doe.
  • This session is specified to last two hours using NTP timestamps, with a connection address (which indicates the address clients must connect to or – when a multicast address is provided, as it is here – subscribe to) specified as IPv4224.2.17.12 with a TTL of 127. Recipients of this session description are instructed to only receive media. Two media descriptions are provided, both using RTP Audio Video Profile.
  • the first is an audio stream on port 49170 using RTP/AVP payload type 0 (defined by RFC 3551 as PCMU), and the second is a video stream on port 51372 using RTP/AVP payload type 99 (defined as “dynamic”). Finally, an attribute is included which maps RTP/AVP payload type 99 to format h263-1998 with a 90 kHz clock rate. RTCP ports for the audio and video streams of 49171 and 51373, respectively, are implied.
  • Codec-specific parameters may be added in other attributes, for example, "fmtp".
  • "fmtp" attribute allows parameters that are specific to a particular format to be conveyed in a way that SDP does not have to understand them.
  • the format can be one of the formats specified for the media.
  • Format-specific parameters, semicolon separated may be any set of parameters required to be conveyed by SDP and given unchanged to the media tool that will use this format. At most one instance of this attribute is allowed for each format.
  • RFC7798 defines the following sprop-vps, sprop-sps, sprop-pps, profile-space, profile-id, tier-flag, level-id, interop-constraints, profile-compatibility- indicator, sprop-sub-layer-id, recv-sub-layer-id, max-recv-level-id, tx-mode, max-lsr, max- lps, max-cpb, max-dpb, max-br, max-tr, max-tc, max-fps, sprop-max-don-diff, sprop- depack-buf-nalus, sprop-depack-buf-bytes, depack-buf-cap, sprop-segmentation-id,
  • Volumetric video content may be captured by various configurations of camera capturing systems.
  • the relative position of the cameras with respect to other cameras in a capturing system may be fixed.
  • the cameras may be mounted on a rigid stand, such as on a stand-alone multiview camera, e.g. the OZO camera, or the cameras may be mounted on a mobile device.
  • the position of the cameras in the global coordinate system may change during the capturing process.
  • the rigid stand of e.g. a multiview camera can be moved, or the individual cameras of the capturing system may move.
  • a user may move the device, such a mobile device, comprising the camera capturing system while capturing a video. Parameters describing the movement of the camera system, i.e.
  • volumetric video content e.g. MIV
  • a rigid system i.e. the relative camera positions with respect to each other do not change
  • the movement of the system can be configured by the creator to achieve “paint the space” type of experiences, where the viewer sees the rendered scene in global coordinate space with the camera rig movement visible. The movement of the camera system and viewer position are controlled separately.
  • a second scenario such as in a real-time augmented reality (AR) conversation
  • one or both sides of the end-points may be moving, i.e. the rigid camera system moves upon capturing, but also the viewer may move within the rendered scene.
  • the camera system movement from one end-point were to be translated to the other end-point, the other end-point would have to move to keep up with the moving first end-point.
  • FIGs 5a – 5c An example of this scenario is depicted in Figures 5a – 5c.
  • the user U1 captures, while moving forward, an extended reality (XR) scene comprising two visual objects A and B. Due to the movement, the user U1 moves closer to the object A and farther away from the object B during the capture.
  • XR extended reality
  • Figures 5b and 5c illustrate two options for the user U2 to render the scene captured by the user U1.
  • the scene is rendered according to the movement of the user U1, whereupon the object A moves closer to the user U2 and the object B moves farther away from the user U2 during the rendering.
  • Fig.5c depicts such a situation, where the rendering is locked to the original position of the user U1, and the rendered scene, and thus the locked positions of the objects A and B, are kept stationary with respect to the user U2.
  • the user U2 may lock the scene captured by the user U1 to rendered.
  • the user U2 may always view the space of the user U1 with respect to the initial position of the user U1.
  • the user U2 may always view the space of the user U1 with respect to current position of U1.
  • the user U2 may view the space of the user U1 such that the initial position of the user U1 is locked to the initial position of the user U2, i.e., the user U2 may roam around in the space.
  • the content creator such as the apparatus of the user U1
  • the renderer such as the apparatus of the user U2, how the captured scene is intended to be rendered.
  • the rendering apparatus which may be referred to as a renderer/client/viewer/receiver, does not know, if it should keep the capture camera extrinsics fixed (as in the second scenario), or if it should support free translation of the capture system and allow viewer to freely explore the content (as in the first scenario).
  • the renderer is missing information how to provide the final rendering to the end user, such as the user U2.
  • movements of the capturing apparatus such as a mobile device, wherein some of the movements of the capturing apparatus may be unintentional while others are intentional. For example, a translational motion of the handheld device upon walking is typically not considered intentional, but the rotation around the client face to change the angle of recording typically is intentional.
  • FIG.6a the user U1 captures, using a handheld device, an extended reality (XR) scene comprising two visual objects A and B and user’s U1 face, by a user movement, which is depicted by the rounded arrow.
  • the captured scene is delivered to the user U2.
  • the capturing device such as the apparatus of the user U1
  • the rendering device such as the apparatus of the user U2
  • the method comprises capturing (700) volumetric video content by a camera capturing system; determining (702) one or more parameters describing movements of the camera capturing system to be stored upon capturing said volumetric video content; and transmitting (704) the captured volumetric video content and said one or more parameters describing the movements of the camera capturing system to a receiver.
  • the method enables the content creator, such as the capturing device or its user, to determine whether at least some of the movements of the camera capturing system upon capturing said volumetric video content should be taken into account in the rendering process, thereby providing signaling, which allows the renderer to properly interpret the provided camera extrinsic information and render the content as intended.
  • the capturing device may comprise means, such an application, for determining said one or more parameters describing the movements of the camera capturing system.
  • the application may comprise default settings regarding the one or more parameters describing the movements of the camera capturing system, i.e. the camera extrinsic information, but the settings may be user-adjustable
  • the sender as used herein, may refer to an apparatus comprising a camera capturing system, such as a mobile device or any other apparatus provided with multi-view camera and transmission means.
  • the sender may also refer to an apparatus configured to obtain the captured volumetric video content and said one or more parameters describing the movements of the camera capturing system from an external device, such as from a stand-alone multiview camera or a data storage, and configured to transmit said data to a receiver.
  • the sender may be a server in client-server architecture, for example.
  • said one or more parameters comprise at least a first parameter for controlling a rendering process of the volumetric video content, wherein a value of said first parameter is configured to control the rendering process whether to take into account said movements of the camera capturing system.
  • the application may comprise default settings or it may provide the user with possibility to define, by the value of the first parameter, whether the changes of the mobile movement (i.e.
  • the volumetric video encoder e.g. MIV encoder
  • the value of the first parameter may be dependent on the capture type; i.e. whether to allow the viewer to freely explore the content or to lock the captured scene at some temporal position.
  • said one or more parameters comprise at least a second parameter describing whether said movements of the camera capturing system are intentional or unintentional.
  • the application may comprise default settings or it may provide the user with possibility to define, by the value of the second parameter, which movements are considered intentional, and thus to be recorded in the camera extrinsic information. Consequently, unintentional movements shall be ignored.
  • Figure 8 shows an example of a method comprising receiving (800) volumetric video content captured by a camera capturing system and one or more parameters describing the movements of the camera capturing system during capturing; and controlling (802) a rendering process of the volumetric video content according to the one or more parameters describing the movements of the camera capturing system during the capturing.
  • the receiver determines, based on the received one or more parameters describing the movements of the camera capturing system during capturing, how the content creator has intended the rendering to be carried out, and then performs the rendering of the volumetric video content accordingly.
  • the receiver may refer to an apparatus comprising a rendering functionality, a.k.a. a renderer, which may also be referred to as a media player or viewer.
  • the receiver may also be a client in client-server architecture.
  • the method as performed by the sender comprises carrying out a signalling of said one or more parameters in a supplemental enhancement information (SEI) message according to MPEG Immersive Video (MIV) specification.
  • SEI Supplemental Enhancement Information
  • MIV MPEG Immersive Video
  • the application may provide a rendering mode through an SEI message as described in the Table 1 below.
  • Table 1 A dedicated value is preferably reserved for the rendering mode SEI message payload type that does not conflict with the existing payload types in ISO/IEC 23090-5: Annex F. In the example of Virhe. Viitteen lähdettä ei löytynyt., the reserved value of 19 is used.
  • Table 2 said SEI message comprises a rendering mode parameter capable of having a plurality of values, wherein a value of at least one of said one or more parameters are indicated by a rendering mode parameter value.
  • a rendering mode parameter value capable of having a plurality of values, wherein a value of at least one of said one or more parameters are indicated by a rendering mode parameter value.
  • rendering_mode 1 may indicate to a renderer that a sender-locked rendering is recommended. This mode would be appropriate in case of a real time conversational scenario or an AR telephony call. Setting the rendering_mode to the sender-locked value would allow the sender to indicate to the receiver that the translational motion of the camera extrinsics should be ignored by the renderer, thus locking the receiver to the sender’s position.
  • rendering_mode 2 may indicate to a renderer that a viewer-locked rendering is recommended. This mode could be appropriate in case of a free view consumption of content. Setting the rendering_mode to viewer-locked would allow the sender to signal to the receiver that it is acceptable to allow separate transformation of the receiver camera properties.
  • said SEI message may comprise further parameters that indicate the camera extrinsic information in a more detailed manner.
  • the sender may also provide additional parameter(s) that inform which changes in the camera extrinsics could be intentional and which not; for example, a rotation along one axis could be intentional while along the other axis is not.
  • said SEI message may comprise parameters that correspond to sender- or viewer-locked camera intrinsics. Storing the rendering modes in a SEI message enables to include the SEI message into a bitstream after the encoding of the volumetric video content is performed. Moreover, including the rendering modes in a SEI message does not cause misinterpretations in legacy decoders as the unknown SEI message is ignored by such decoder.
  • the method comprises carrying out a signalling of said one or more parameters by at least one syntax element included in media level of a Session Description Protocol (SDP) file.
  • SDP Session Description Protocol
  • the application or a server may provide the recommended rendering mode to a receiver/client as out of band parameter.
  • a new generic parameter which may be referred to as, for example, 3d-rendering-mode may be defined that can be used by any volumetric video.
  • the syntax and semantics of the payload format parameter is defined based on the media type the parameter is used in combination with.
  • the encoded value representation can be base64 encoded data, as defined in RFC4648, that contains the SEI message described above.
  • the value of 3d-rendering-mode parameter may indicate an application type, possibly as an integer value or a textual tag.
  • the application type indicates to the sender and/or the receiver the intended use of the rendered video.
  • a peer-to-peer AR telephony call can be indicated as ‘p2p telephony’ and would require a sender-locked mode of rendering.
  • the application type can be set based on the current mode of use of the AR device. For example, if it is noticed that the caller is multitasking, the device may be configured to take the call in a limited fashion (for the sake of user experience or device capabilities) instead of rendering a viewer-locked scene.
  • the parameter 3d-rendering-mode may also be used in session negotiation, where one entity informs the other entity which rendering mode it supports and consequently expects to receive. If 3d-rendering-mode is included in an offer but not supported by the receiver, the attribute could be dropped.
  • the sender may deliver the camera intrinsics and extrinsics as usual and the receiver will apply those parameters for rendering.
  • the sender may perform additional processing to prepare the content bitstream such that the intended experience is achieved upon rendering. The latter option may be carried out as a sender-side implementation without any signalling. This may require application-level logic.
  • the method comprises setting a value of at least one parameter to indicate that the one or more parameters describing the movements of the camera capturing system have been adjusted according to a predetermined value of the rendering mode parameter.
  • the sender and receiver may agree on what may be referred to as a receiver-transparent rendering mode.
  • the sender transforms the capture apparatus position and rotation information parameters according to the sender-locked, viewer-locked or any other mode.
  • the receiver can perform rendering without the knowledge about the rendering mode.
  • the parameter 3d-rendering-mode may include a V-PCC specific tag, which informs a client that it is allowed to ignore the AAPS camera extrinsic information, or parts of it.
  • the parameter 3d-rendering-mode may contain more than one tag separated by a semicolon. Each tag may be predefined to indicate to a client how to optimally render the content.
  • the parameter 3d-rendering-mode comprises one tag ‘p2p telephony’ and the following tags indicate unintentional (‘translation-unint’; roll-unint) and intentional (yaw-int; pitch-int) behaviour.
  • a server may signal the rendering mode using a “v3c-sei” parameter as follows: Said parameter may be used to convey SEI NAL units of the V3C common atlas sub bitstream for out-of-band transmission. The value of the parameter is a comma- separated list of encoded representations of the SEI NAL units (i.e.
  • the method comprises carrying out a signalling of said one or more parameters by a uniquely identified Real-Time Transport Control Protocol (RTCP) application packet.
  • RTCP Real-Time Transport Control Protocol
  • an application or a server may provide a recommended rendering mode and related parameters to a client as through RTCP APP packet, which can be identified by a unique name value, e.g.
  • the embodiments relating to the encoding aspects may be implemented in an apparatus comprising: means for capturing volumetric video content by a camera capturing system; means for determining one or more parameters describing movements of the camera capturing system to be stored upon capturing said volumetric video content; and means for transmitting the captured volumetric video content and said one or more parameters describing the movements of the camera capturing system to a receiver.
  • said one or more parameters comprise at least a first parameter for controlling a rendering process of the volumetric video content, wherein a value of said first parameter is configured to control the rendering process whether to take into account said movements of the camera capturing system.
  • said one or more parameters comprise at least a second parameter describing whether said movements of the camera capturing system are intentional or unintentional.
  • the apparatus comprises means for carrying out a signalling of said one or more parameters in a supplemental enhancement information (SEI) message according to MPEG Immersive Video (MIV) specification.
  • SEI message comprises a rendering mode parameter capable of having a plurality of values, wherein a value of at least one of said one or more parameters are indicated by a rendering mode parameter value.
  • the apparatus comprises means for carrying out a signalling of said one or more parameters by at least one syntax element included in media level of a Session Description Protocol (SDP) file.
  • SDP Session Description Protocol
  • the apparatus comprises means for setting a value of at least one parameter to indicate that the one or more parameters describing the movements of the camera capturing system have been adjusted according to a predetermined value of the rendering mode parameter.
  • the apparatus comprises means for carrying out a signalling of said one or more parameters by a uniquely identified Real-Time Transport Control Protocol (RTCP) application packet.
  • RTCP Real-Time Transport Control Protocol
  • the embodiments relating to the encoding aspects may likewise be implemented in an apparatus comprising at least one processor and at least one memory, said at least one memory stored with computer program code thereon, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to perform: capture volumetric video content by a camera capturing system; determine one or more parameters describing movements of the camera capturing system to be stored upon capturing said volumetric video content; and transmit the captured volumetric video content and said one or more parameters describing the movements of the camera capturing system to a receiver.
  • said one or more parameters comprise at least a first parameter for controlling a rendering process of the volumetric video content, wherein a value of said first parameter is configured to control the rendering process whether to take into account said movements of the camera capturing system.
  • said one or more parameters comprise at least a second parameter describing whether said movements of the camera capturing system are intentional or unintentional.
  • the apparatus comprises computer program code configured to, with the at least one processor, cause the apparatus at least to carry out a signalling of said one or more parameters in a supplemental enhancement information (SEI) message according to MPEG Immersive Video (MIV) specification.
  • SEI Supplemental Enhancement Information
  • MIV MPEG Immersive Video
  • said SEI message comprises a rendering mode parameter capable of having a plurality of values, wherein a value of at least one of said one or more parameters are indicated by a rendering mode parameter value.
  • the apparatus comprises computer program code configured to, with the at least one processor, cause the apparatus at least to carry out a signalling of said one or more parameters by at least one syntax element included in media level of a Session Description Protocol (SDP) file.
  • the apparatus comprises computer program code configured to, with the at least one processor, cause the apparatus at least to set a value of at least one parameter to indicate that the one or more parameters describing the movements of the camera capturing system have been adjusted according to a predetermined value of the rendering mode parameter.
  • the apparatus comprises computer program code configured to, with the at least one processor, cause the apparatus at least to carry out a signalling of said one or more parameters by a uniquely identified Real-Time Transport Control Protocol (RTCP) application packet.
  • RTCP Real-Time Transport Control Protocol
  • the embodiments relating to the decoding aspects may be implemented in an apparatus comprising means for receiving volumetric video content captured by a camera capturing system and one or more parameters describing the movements of the camera capturing system during capturing; and means for controlling a rendering process of the volumetric video content according to the one or more parameters describing the movements of the camera capturing system during the capturing.
  • said one or more parameters comprise at least a first parameter for controlling a rendering process of the volumetric video content, wherein the apparatus comprises means for controlling the rendering process whether to take into account said movements of the camera capturing system based on a value of said first parameter.
  • said one or more parameters comprise at least a second parameter describing whether said movements of the camera capturing system are intentional or unintentional, wherein the apparatus comprises means for controlling the rendering process to ignore the unintentional movements of the camera capturing system based on a value of said second parameter.
  • the apparatus comprises means for receiving a signalling of said one or more parameters in a supplemental enhancement information (SEI) message according to MPEG Immersive Video (MIV) specification.
  • SEI Supplemental Enhancement Information
  • MIV MPEG Immersive Video
  • the apparatus comprises means for receiving a signalling of said one or more parameters by at least one syntax element included in media level of a Session Description Protocol (SDP) file.
  • SDP Session Description Protocol
  • the embodiments relating to the decoding aspects may likewise be implemented in an apparatus comprising at least one processor and at least one memory, said at least one memory stored with computer program code thereon, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to perform: receive volumetric video content captured by a camera capturing system and one or more parameters describing the movements of the camera capturing system during capturing; and control a rendering process of the volumetric video content according to the one or more parameters describing the movements of the camera capturing system during the capturing.
  • said one or more parameters comprise at least a first parameter for controlling a rendering process of the volumetric video content
  • the apparatus comprises computer program code configured to, with the at least one processor, cause the apparatus at least to control the rendering process whether to take into account said movements of the camera capturing system based on a value of said first parameter.
  • said one or more parameters comprise at least a second parameter describing whether said movements of the camera capturing system are intentional or unintentional, wherein the apparatus comprises computer program code configured to, with the at least one processor, cause the apparatus at least to control the rendering process to ignore the unintentional movements of the camera capturing system based on a value of said second parameter.
  • the apparatus comprises computer program code configured to, with the at least one processor, cause the apparatus at least to receive a signalling of said one or more parameters in a supplemental enhancement information (SEI) message according to MPEG Immersive Video (MIV) specification.
  • the apparatus comprises computer program code configured to, with the at least one processor, cause the apparatus at least to receive a signalling of said one or more parameters by at least one syntax element included in media level of a Session Description Protocol (SDP) file.
  • SDP Session Description Protocol
  • Such apparatuses may comprise e.g. the functional units disclosed in any of the Figures 1a, 1b, 2a and 2b for implementing the embodiments. In the above, some embodiments have been described with reference to encoding.
  • said encoding may comprise one or more of the following: encoding source image data into a bitstream, encapsulating the encoded bitstream in a container file and/or in packet(s) or stream(s) of a communication protocol, and announcing or describing the bitstream in a content description, such as the Media Presentation Description (MPD) of ISO/IEC 23009-1 (known as MPEG-DASH) or the IETF Session Description Protocol (SDP).
  • MPD Media Presentation Description
  • MPEG-DASH ISO/IEC 23009-1
  • SDP IETF Session Description Protocol
  • said decoding may comprise one or more of the following: decoding image data from a bitstream, decapsulating the bitstream from a container file and/or from packet(s) or stream(s) of a communication protocol, and parsing a content description of the bitstream.
  • decoding image data from a bitstream decapsulating the bitstream from a container file and/or from packet(s) or stream(s) of a communication protocol, and parsing a content description of the bitstream.
  • the various embodiments of the invention may be implemented in hardware or special purpose circuits or any combination thereof. While various aspects of the invention may be illustrated and described as block diagrams or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof. Embodiments of the inventions may be practiced in various components such as integrated circuit modules. The design of integrated circuits is by and large a highly automated process. Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate. Programs, such as those provided by Synopsys, Inc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computer Security & Cryptography (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

A method comprising: capturing volumetric video content by a camera capturing system; determining one or more parameters describing movements of the camera capturing system to be stored upon capturing said volumetric video content; and transmitting the captured volumetric video content and said one or more parameters describing the movements of the camera capturing system to a receiver.

Description

AN APPARATUS, A METHOD AND A COMPUTER PROGRAM FOR VOLUMETRIC VIDEO TECHNICAL FIELD The present invention relates to an apparatus, a method and a computer program for volumetric video coding. BACKGROUND Visual volumetric video-based coding (V3C; defined in ISO/IEC DIS 23090-5) provides a generic syntax and mechanism for volumetric video coding. The generic syntax can be used by applications targeting volumetric content, such as point clouds, immersive video with depth, and mesh representations of volumetric frames. The purpose of the specification is to define how to decode and interpret the associated data (atlas data in ISO/IEC 23090-5) which tells a renderer how to interpret two-dimensional (2D) frames for reconstructing volumetric frames. Volumetric video enables the viewer to move in six degrees of freedom (6DOF): in contrast to common 360° video, where the user has from 2 to 3 degrees of freedom (yaw, pitch, and possibly roll), a volumetric video represents a 3D volume of space rather than a flat image plane. Volumetric video may be captured using various configurations of 3D camera capturing systems. The relative position of the cameras with respect to other cameras in a capturing system may be fixed. On the other hand, the position of the cameras in the global coordinate system may change during the capturing process. However, the apparatus capturing and creating the volumetric video content has no means to indicate to an apparatus receiving and rendering the volumetric video content, how the captured scene is intended to be rendered in terms of the changes of the position of the cameras during the capturing process; i.e. whether said changes should be reflected upon rendering and if they are intentional. As a result, the rendered volumetric video content may appear significantly different as intended by the content creator. SUMMARY Now, an improved method and technical equipment implementing the method has been invented, by which the above problems are alleviated. Various aspects include a method, an apparatus and a computer readable medium comprising a computer program, or a signal stored therein, which are characterized by what is stated in the independent claims. Various details of the embodiments are disclosed in the dependent claims and in the corresponding images and description. The scope of protection sought for various embodiments of the invention is set out by the independent claims. The embodiments and features, if any, described in this specification that do not fall under the scope of the independent claims are to be interpreted as examples useful for understanding various embodiments of the invention. According to a first aspect, there is provided an apparatus comprising means for capturing volumetric video content by a camera capturing system; means for determining one or more parameters describing movements of the camera capturing system to be stored upon capturing said volumetric video content; and means for transmitting the captured volumetric video content and said one or more parameters describing the movements of the camera capturing system to a receiver. According to an embodiment, said one or more parameters comprise at least a first parameter for controlling a rendering process of the volumetric video content, wherein a value of said first parameter is configured to control the rendering process whether to take into account said movements of the camera capturing system. According to an embodiment, said one or more parameters comprise at least a second parameter describing whether said movements of the camera capturing system are intentional or unintentional. According to an embodiment, the apparatus comprises means for carrying out a signalling of said one or more parameters in a supplemental enhancement information (SEI) message according to MPEG Immersive Video (MIV) specification. According to an embodiment, said SEI message comprises a rendering mode parameter capable of having a plurality of values, wherein a value of at least one of said one or more parameters are indicated by a rendering mode parameter value. According to an embodiment, the apparatus comprises means for carrying out a signalling of said one or more parameters by at least one syntax element included in media level of a Session Description Protocol (SDP) file. According to an embodiment, the apparatus comprises means for setting a value of at least one parameter to indicate that the one or more parameters describing the movements of the camera capturing system have been adjusted according to a predetermined value of the rendering mode parameter. According to an embodiment, the apparatus comprises means for carrying out a signalling of said one or more parameters by a uniquely identified Real-Time Transport Control Protocol (RTCP) application packet. A method according to a second aspect comprises: capturing volumetric video content by a camera capturing system; determining one or more parameters describing movements of the camera capturing system to be stored upon capturing said volumetric video content; and transmitting the captured volumetric video content and said one or more parameters describing the movements of the camera capturing system to a receiver. An apparatus according to a third aspect comprises: means for receiving volumetric video content captured by a camera capturing system and one or more parameters describing the movements of the camera capturing system during capturing; and means for controlling a rendering process of the volumetric video content according to the one or more parameters describing the movements of the camera capturing system during the capturing. According to an embodiment, said one or more parameters comprise at least a first parameter for controlling a rendering process of the volumetric video content, wherein the apparatus comprises means for controlling the rendering process whether to take into account said movements of the camera capturing system based on a value of said first parameter. According to an embodiment, said one or more parameters comprise at least a second parameter describing whether said movements of the camera capturing system are intentional or unintentional, wherein the apparatus comprises means for controlling the rendering process to ignore the unintentional movements of the camera capturing system based on a value of said second parameter. According to an embodiment, the apparatus comprises means for receiving a signalling of said one or more parameters in a supplemental enhancement information (SEI) message according to MPEG Immersive Video (MIV) specification. According to an embodiment, the apparatus comprises means for receiving a signalling of said one or more parameters by at least one syntax element included in media level of a Session Description Protocol (SDP) file. A method according to a fourth aspect comprises: receiving volumetric video content captured by a camera capturing system and one or more parameters describing the movements of the camera capturing system during capturing; and controlling a rendering process of the volumetric video content according to the one or more parameters describing the movements of the camera capturing system during the capturing. Computer readable storage media according to further aspects comprise code for use by an apparatus, which when executed by a processor, causes the apparatus to perform the above methods. BRIEF DESCRIPTION OF THE DRAWINGS For a more complete understanding of the example embodiments, reference is now made to the following descriptions taken in connection with the accompanying drawings in which: Figs.1a and 1b show an encoder and decoder for encoding and decoding 2D pictures; Figs.2a and 2b show a compression and a decompression process for 3D volumetric video; Fig.3 shows an example of a V3C bitstream; Fig.4 shows a structure of extension header data of an RTP packet header; Figs.5a – 5c shows an example of rendering options for a scene captured with camera system movements involved; Figs.6a – 6c shows an example of rendering options for a scene captured with intentional/unintentional camera system movements involved; Fig.7 shows a flow chart for an encoding or a content creating method according to an embodiment; and Fig.8 shows a flow chart for a decoding or a rendering method according to an embodiment. DETAILED DESCRIPTON OF SOME EXAMPLE EMBODIMENTS A video codec comprises an encoder that transforms the input video into a compressed representation suited for storage/transmission, and a decoder that can un- compress the compressed video representation back into a viewable form. An encoder may discard some information in the original video sequence in order to represent the video in a more compact form (i.e. at lower bitrate). Volumetric video may be captured using one or more three-dimensional (3D) cameras. When multiple cameras are in use, the captured footage is synchronized so that the cameras provide different viewpoints to the same world. In contrast to traditional 2D/3D video, volumetric video describes a 3D model of the world where the viewer is free to move and observe different parts of the world. Volumetric video enables the viewer to move in six degrees of freedom (6DOF): in contrast to common 360° video, where the user has from 2 to 3 degrees of freedom (yaw, pitch, and possibly roll), a volumetric video represents a 3D volume of space rather than a flat image plane. Volumetric video frames contain a large amount of data because they model the contents of a 3D volume instead of just a two-dimensional (2D) plane. However, only a relatively small part of the volume changes over time. Therefore, it may be possible to reduce the total amount of data by only coding information about an initial state and changes which may occur between frames. Volumetric video can be rendered from synthetic 3D animations, reconstructed from multi-view video using 3D reconstruction techniques such as structure from motion, or captured with a combination of cameras and depth sensors such as LiDAR (Light Detection and Ranging), for example. Volumetric video data represents a three-dimensional scene or object, and thus such data can be viewed from any viewpoint. Volumetric video data can be used as an input for augmented reality (AR), virtual reality (VR) and mixed reality (MR) applications. Such data describes geometry (shape, size, position in 3D-space) and respective attributes (e.g. color, opacity, reflectance, …), together with any possible temporal changes of the geometry and attributes at given time instances (e.g. frames in 2D video). Volumetric video is either generated from 3D models, i.e. computer-generated imagery (CGI), or captured from real-world scenes using a variety of capture solutions, e.g. a multi-camera, a laser scan, a combination of video and dedicated depths sensors, etc. Also, a combination of CGI and real-world data is possible. Examples of representation formats for such volumetric data are triangle meshes, point clouds, or voxel. Temporal information about the scene can be included in the form of individual capture instances, i.e. “frames” in 2D video, or other means, e.g. position of an object as a function of time. Increasing computational resources and advances in 3D data acquisition devices has enabled reconstruction of highly detailed volumetric video representations of natural scenes. Infrared, lasers, time-of-flight and structured light are all examples of devices that can be used to construct 3D video data. Representation of the 3D data depends on how the 3D data is used. Dense voxel arrays have been used to represent volumetric medical data. In 3D graphics, polygonal meshes are extensively used. Point clouds on the other hand are well suited for applications, such as capturing real world 3D scenes where the topology is not necessarily a 2D manifold. Another way to represent 3D data is coding this 3D data as a set of texture and depth map as is the case in the multi-view plus depth. Closely related to the techniques used in multi-view plus depth is the use of elevation maps, and multi-level surface maps. In 3D point clouds, each point of each 3D surface is described as a 3D point with color and/or other attribute information such as surface normal or material reflectance. Point cloud is a set of data points in a coordinate system, for example in a three- dimensional coordinate system being defined by X, Y, and Z coordinates. The points may represent an external surface of an object in the screen space, e.g. in a three-dimensional space. In dense point clouds or voxel arrays, the reconstructed 3D scene may contain tens or even hundreds of millions of points. If such representations are to be stored or interchanged between entities, then efficient compression of the presentations becomes fundamental. Standard volumetric video representation formats, such as point clouds, meshes, voxel, suffer from poor temporal compression performance. Identifying correspondences for motion-compensation in 3D-space is an ill-defined problem, as both, geometry and respective attributes may change. For example, temporal successive “frames” do not necessarily have the same number of meshes, points or voxel. Therefore, compression of dynamic 3D scenes is inefficient.2D-video based approaches for compressing volumetric data, i.e. multiview with depth, have much better compression efficiency, but rarely cover the full scene. Therefore, they provide only limited 6DOF capabilities. Instead of the above-mentioned approach, a 3D scene, represented as meshes, points, and/or voxel, can be projected onto one, or more, geometries. These geometries may be “unfolded” or packed onto 2D planes (two planes per geometry: one for texture, one for depth), which are then encoded using standard 2D video compression technologies. Relevant projection geometry information may be transmitted alongside the encoded video files to the decoder. The decoder decodes the video and performs the inverse projection to regenerate the 3D scene in any desired representation format (not necessarily the starting format). Projecting volumetric models onto 2D planes allows for using standard 2D video coding tools with highly efficient temporal compression. Thus, coding efficiency can be increased greatly. Using geometry-projections instead of 2D-video based approaches based on multiview and depth, provides a better coverage of the scene (or object). Thus, 6DOF capabilities are improved. Using several geometries for individual objects improves the coverage of the scene further. Furthermore, standard video encoding hardware can be utilized for real-time compression/decompression of the projected planes. The projection and the reverse projection steps are of low complexity. Figs.1a and 1b show an encoder and decoder for encoding and decoding the 2D texture pictures, geometry pictures and/or auxiliary pictures. A video codec consists of an encoder that transforms an input video into a compressed representation suited for storage/transmission and a decoder that can uncompress the compressed video representation back into a viewable form. Typically, the encoder discards and/or loses some information in the original video sequence in order to represent the video in a more compact form (that is, at lower bitrate). An example of an encoding process is illustrated in Figure 1a. Figure 1a illustrates an image to be encoded (In); a predicted representation of an image block (P'n); a prediction error signal (Dn); a reconstructed prediction error signal (D'n); a preliminary reconstructed image (I'n); a final reconstructed image (R'n); a transform (T) and inverse transform (T-1); a quantization (Q) and inverse quantization (Q-1); entropy encoding (E); a reference frame memory (RFM); inter prediction (Pinter); intra prediction (Pintra); mode selection (MS) and filtering (F). An example of a decoding process is illustrated in Figure 1b. Figure 1b illustrates a predicted representation of an image block (P'n); a reconstructed prediction error signal (D'n); a preliminary reconstructed image (I'n); a final reconstructed image (R'n); an inverse transform (T-1); an inverse quantization (Q-1); an entropy decoding (E-1); a reference frame memory (RFM); a prediction (either inter or intra) (P); and filtering (F). Many hybrid video encoders encode the video information in two phases. Firstly pixel values in a certain picture area (or “block”) are predicted for example by motion compensation means (finding and indicating an area in one of the previously coded video frames that corresponds closely to the block being coded) or by spatial means (using the pixel values around the block to be coded in a specified manner). Secondly the prediction error, i.e. the difference between the predicted block of pixels and the original block of pixels, is coded. This is typically done by transforming the difference in pixel values using a specified transform (e.g. Discrete Cosine Transform (DCT) or a variant of it), quantizing the coefficients and entropy coding the quantized coefficients. By varying the fidelity of the quantization process, encoder can control the balance between the accuracy of the pixel representation (picture quality) and size of the resulting coded video representation (file size or transmission bitrate). Video codecs may also provide a transform skip mode, which the encoders may choose to use. In the transform skip mode, the prediction error is coded in a sample domain, for example by deriving a sample-wise difference value relative to certain adjacent samples and coding the sample-wise difference value with an entropy coder. Many video encoders partition a picture into blocks along a block grid. For example, in the High Efficiency Video Coding (HEVC) standard, the following partitioning and definitions are used. A coding block may be defined as an NxN block of samples for some value of N such that the division of a coding tree block into coding blocks is a partitioning. A coding tree block (CTB) may be defined as an NxN block of samples for some value of N such that the division of a component into coding tree blocks is a partitioning. A coding tree unit (CTU) may be defined as a coding tree block of luma samples, two corresponding coding tree blocks of chroma samples of a picture that has three sample arrays, or a coding tree block of samples of a monochrome picture or a picture that is coded using three separate color planes and syntax structures used to code the samples. A coding unit (CU) may be defined as a coding block of luma samples, two corresponding coding blocks of chroma samples of a picture that has three sample arrays, or a coding block of samples of a monochrome picture or a picture that is coded using three separate color planes and syntax structures used to code the samples. A CU with the maximum allowed size may be named as LCU (largest coding unit) or coding tree unit (CTU) and the video picture is divided into non-overlapping LCUs. In HEVC, a picture can be partitioned in tiles, which are rectangular and contain an integer number of LCUs. In HEVC, the partitioning to tiles forms a regular grid, where heights and widths of tiles differ from each other by one LCU at the maximum. In HEVC, a slice is defined to be an integer number of coding tree units contained in one independent slice segment and all subsequent dependent slice segments (if any) that precede the next independent slice segment (if any) within the same access unit. In HEVC, a slice segment is defined to be an integer number of coding tree units ordered consecutively in the tile scan and contained in a single NAL unit. The division of each picture into slice segments is a partitioning. In HEVC, an independent slice segment is defined to be a slice segment for which the values of the syntax elements of the slice segment header are not inferred from the values for a preceding slice segment, and a dependent slice segment is defined to be a slice segment for which the values of some syntax elements of the slice segment header are inferred from the values for the preceding independent slice segment in decoding order. In HEVC, a slice header is defined to be the slice segment header of the independent slice segment that is a current slice segment or is the independent slice segment that precedes a current dependent slice segment, and a slice segment header is defined to be a part of a coded slice segment containing the data elements pertaining to the first or all coding tree units represented in the slice segment. The CUs are scanned in the raster scan order of LCUs within tiles or within a picture, if tiles are not in use. Within an LCU, the CUs have a specific scan order. Entropy coding/decoding may be performed in many ways. For example, context-based coding/decoding may be applied, where in both the encoder and the decoder modify the context state of a coding parameter based on previously coded/decoded coding parameters. Context-based coding may for example be context adaptive binary arithmetic coding (CABAC) or context-adaptive variable length coding (CAVLC) or any similar entropy coding. Entropy coding/decoding may alternatively or additionally be performed using a variable length coding scheme, such as Huffman coding/decoding or Exp-Golomb coding/decoding. Decoding of coding parameters from an entropy-coded bitstream or codewords may be referred to as parsing. The phrase along the bitstream (e.g. indicating along the bitstream) may be defined to refer to out-of-band transmission, signaling, or storage in a manner that the out- of-band data is associated with the bitstream. The phrase decoding along the bitstream or alike may refer to decoding the referred out-of-band data (which may be obtained from out-of-band transmission, signaling, or storage) that is associated with the bitstream. For example, an indication along the bitstream may refer to metadata in a container file that encapsulates the bitstream. A first texture picture may be encoded into a bitstream, and the first texture picture may comprise a first projection of texture data of a first source volume of a scene model onto a first projection surface. The scene model may comprise a number of further source volumes. In the projection, data on the position of the originating geometry primitive may also be determined, and based on this determination, a geometry picture may be formed. This may happen for example so that depth data is determined for each or some of the texture pixels of the texture picture. Depth data is formed such that the distance from the originating geometry primitive such as a point to the projection surface is determined for the pixels. Such depth data may be represented as a depth picture, and similarly to the texture picture, such geometry picture (such as a depth picture) may be encoded and decoded with a video codec. This first geometry picture may be seen to represent a mapping of the first projection surface to the first source volume, and the decoder may use this information to determine the location of geometry primitives in the model to be reconstructed. In order to determine the position of the first source volume and/or the first projection surface and/or the first projection in the scene model, there may be first geometry information encoded into or along the bitstream. It is noted that encoding a geometry (or depth) picture into or along the bitstream with the texture picture is only optional and arbitrary for example in the cases where the distance of all texture pixels to the projection surface is the same or there is no change in said distance between a plurality of texture pictures. Thus, a geometry (or depth) picture may be encoded into or along the bitstream with the texture picture, for example, only when there is a change in the distance of texture pixels to the projection surface. An attribute picture may be defined as a picture that comprises additional information related to an associated texture picture. An attribute picture may for example comprise surface normal, opacity, or reflectance information for a texture picture. A geometry picture may be regarded as one type of an attribute picture, although a geometry picture may be treated as its own picture type, separate from an attribute picture. Texture picture(s) and the respective geometry picture(s), if any, and the respective attribute picture(s) may have the same or different chroma format. Terms texture (component) image and texture (component) picture may be used interchangeably. Terms geometry (component) image and geometry (component) picture may be used interchangeably. A specific type of a geometry image is a depth image. Embodiments described in relation to a geometry (component) image equally apply to a depth (component) image, and embodiments described in relation to a depth (component) image equally apply to a geometry (component) image. Terms attribute image and attribute picture may be used interchangeably. A geometry picture and/or an attribute picture may be treated as an auxiliary picture in video/image encoding and/or decoding. Figures 2a and 2b illustrate an overview of exemplified compression/ decompression processes. The processes may be applied, for example, in MPEG visual volumetric video-based coding (V3C), defined currently in ISO/IEC DIS 23090-5: “Visual Volumetric Video-based Coding and Video-based Point Cloud Compression”, 2nd Edition. V3C specification enables the encoding and decoding processes of a variety of volumetric media by using video and image coding technologies. This is achieved through first a conversion of such media from their corresponding 3D representation to multiple 2D representations, also referred to as V3C components, before coding such information. Such representations may include occupancy, geometry, and attribute components. The occupancy component can inform a V3C decoding and/or rendering system of which samples in the 2D components are associated with data in the final 3D representation. The geometry component contains information about the precise location of 3D data in space, while attribute components can provide additional properties, e.g. texture or material information, of such 3D data. An example of volumetric media conversion at an encoder is shown in Figure 2a and an example of a 3D reconstruction at a decoder is shown in Figure 2b. There are alternatives to capture and represent a volumetric frame. The format used to capture and represent the volumetric frame depends on the process to be performed on it, and the target application using the volumetric frame. As a first example a volumetric frame can be represented as a point cloud. A point cloud is a set of unstructured points in 3D space, where each point is characterized by its position in a 3D coordinate system (e.g., Euclidean), and some corresponding attributes (e.g., color information provided as RGBA value, or normal vectors). As a second example, a volumetric frame can be represented as images, with or without depth, captured from multiple viewpoints in 3D space. In other words, the volumetric video can be represented by one or more view frames (where a view is a projection of a volumetric scene on to a plane (the camera plane) using a real or virtual camera with known/computed extrinsic and intrinsic). Each view may be represented by a number of components (e.g., geometry, color, transparency, and occupancy picture), which may be part of the geometry picture or represented separately. As a third example, a volumetric frame can be represented as a mesh. Mesh is a collection of points, called vertices, and connectivity information between vertices, called edges. Vertices along with edges form faces. The combination of vertices, edges and faces can uniquely approximate shapes of objects. Depending on the capture, a volumetric frame can provide viewers the ability to navigate a scene with six degrees of freedom, i.e., both translational and rotational movement of their viewing pose (which includes yaw, pitch, and roll). The data to be coded for a volumetric frame can also be significant, as a volumetric frame can contain many numbers of objects, and the positioning and movement of these objects in the scene can result in many dis-occluded regions. Furthermore, the interaction of the light and materials in objects and surfaces in a volumetric frame can generate complex light fields that can produce texture variations for even a slight change of pose. A sequence of volumetric frames is a volumetric video. Due to large amount of information, storage and transmission of a volumetric video requires compression. A way to compress a volumetric frame can be to project the 3D geometry and related attributes into a collection of 2D images along with additional associated metadata. The projected 2D images can then be coded using 2D video and image coding technologies, for example ISO/IEC 14496-10 (H.264/AVC) and ISO/IEC 23008-2 (H.265/HEVC). The metadata can be coded with technologies specified in specification such as ISO/IEC 23090-5. The coded images and the associated metadata can be stored or transmitted to a client that can decode and render the 3D volumetric frame. In the following, a short reference of ISO/IEC 23090-5 Visual Volumetric Video-based Coding (V3C) and Video-based Point Cloud Compression (V-PCC) 2nd Edition is given. ISO/IEC 23090-5 specifies the syntax, semantics, and process for coding volumetric video. The specified syntax is designed to be generic, so that it can be reused for a variety of applications. Point clouds, immersive video with depth, and mesh representations can all use ISO/IEC 23090-5 standard with extensions that deal with the specific nature of the final representation. The purpose of the specification is to define how to decode and interpret the associated data (for example atlas data in ISO/IEC 23090-5) which tells a renderer how to interpret 2D frames to reconstruct a volumetric frame. Two applications of V3C (ISO/IEC 23090-5) have been defined, V-PCC (ISO/IEC 23090-5) and MPEG Immersive Video (MIV) (ISO/IEC 23090-12). MIV and V- PCC use a number of V3C syntax elements with a slightly modified semantics. An example on how the generic syntax element can be differently interpreted by the application is pdu_projection_id. In case of V-PCC, the syntax element pdu_projection_id specifies the index of the projection plane for the patch. There can be 6 or 18 projection planes in V-PCC, and they are implicit, i.e., pre-determined. In case of MIV, pdu_projection_id corresponds to a view ID, i.e., identifies which view the patch originated from. View IDs and their related information is explicitly provided in MIV view parameters list and may be tailored for each content. MPEG 3DG (ISO SC29 WG7) group has started a work on a third application of V3C – the mesh compression. It is also envisaged that mesh coding will re-use V3C syntax as much as possible and can also slightly modify the semantics. To differentiate between applications of V3C bitstream that allow a client to properly interpret the decoded data, V3C uses the ptl_profile_toolset_idc parameter. V3C bitstream is a sequence of bits that forms the representation of coded volumetric frames and the associated data making one or more coded V3C sequences (CVS). V3C bitstream is composed of V3C units that contain V3C video sub-bitstreams, V3C atlas sub-bitstreams, or V3C Parameter Set (VPS). Fig. 3 illustrates an example of VC3 bitstream. Video sub-bitstream and atlas sub-bitstreams can be referred to as V3C sub-bitstreams. Each V3C unit has a V3C unit header and a V3C unit payload. A V3C unit header in conjunction with VPS information identify which V3C sub-bitstream a V3C unit contains and how to interpret it. V3C bitstream can be stored according to Annex C of ISO/IEC 23090-5, which specifies syntax and semantics of a sample stream format to be used by applications that deliver some or all of the V3C unit stream as an ordered stream of bytes or bits within which the locations of V3C unit boundaries need to be identifiable from patterns in the data. A Real Time Transfer Protocol (RTP) is intended for an end-to-end, real-time transfer or streaming media and provides facilities for jitter compensation and detection of packet loss and out-of-order delivery. RTP allows data transfer to multiple destinations through IP multicast or to a specific destination through IP unicast. The majority of the RTP implementations are built on the User Datagram Protocol (UDP). Other transport protocols may also be utilized. RTP can be used together with other protocols such as H.323 and Real Time Streaming Protocol RTSP. The RTP specification describes two protocols: RTP and RTCP. RTP is used for the transfer of multimedia data, and the RTCP is used to periodically send control information and QoS parameters. RTP sessions may be initiated between client and server using a signalling protocol, such as H.323, the Session Initiation Protocol (SIP), or RTSP. These protocols may use the Session Description Protocol (RFC 8866) to specify the parameters for the sessions. RTP is designed to carry a multitude of multimedia formats, which permits the development of new formats without revising the RTP standard. To this end, the information required by a specific application of the protocol is not included in the generic RTP header. For a class of applications (e.g., audio, video), an RTP profile may be defined. For a media format (e.g., a specific video coding format), an associated RTP payload format may be defined. Every instantiation of RTP in a particular application may require a profile and payload format specifications. The profile defines the codecs used to encode the payload data and their mapping to payload format codecs in the protocol field Payload Type (PT) of the RTP header. For example, RTP profile for audio and video conferences with minimal control is defined in RFC 3551. The profile defines a set of static payload type assignments, and a dynamic mechanism for mapping between a payload format, and a PT value using Session Description Protocol (SDP). The latter mechanism is used for newer video codec such as RTP payload format for H.264 Video defined in RFC 6184 or RTP Payload Format for High Efficiency Video Coding (HEVC) defined in RFC 7798. An RTP session is established for each multimedia stream. Audio and video streams may use separate RTP sessions, enabling a receiver to selectively receive components of a particular stream. The RTP specification recommends even port number for RTP, and the use of the next odd port number of the associated RTCP session. A single port can be used for RTP and RTCP in applications that multiplex the protocols. Each RTP stream consists of RTP packets, which in turn consist of RTP header and payload parts. RTP packets are created at the application layer and handed to the transport layer for delivery. Each unit of RTP media data created by an application begins with the RTP packet header. The RTP header has a minimum size of 12 bytes. After the header, optional header extensions may be present. This is followed by the RTP payload, the format of which is determined by the particular class of application. The fields in the header are as follows: Version: (2 bits) Indicates the version of the protocol. P (Padding): (1 bit) Used to indicate if there are extra padding bytes at the end of the RTP packet. X (Extension): (1 bit) Indicates the presence of an extension header between the header and payload data. The extension header is application or profile specific. CC (CSRC count): (4 bits) Contains the number of CSRC identifiers that follow the SSRC M (Marker): (1 bit) Signalling used at the application level in a profile-specific manner. If it is set, it means that the current data has some special relevance for the application. PT (Payload type): (7 bits) Indicates the format of the payload and thus determines its interpretation by the application. Sequence number: (16 bits) The sequence number is incremented for each RTP data packet sent and is to be used by the receiver to detect packet loss and to accommodate out-of-order delivery. Timestamp: (32 bits) Used by the receiver to play back the received samples at appropriate time and interval. When several media streams are present, the timestamps may be independent in each stream. The granularity of the timing is application specific. For example, video stream may use a 90 kHz clock. The clock granularity is one of the details that is specified in the RTP profile for an application. SSRC: (32 bits) Synchronization source identifier uniquely identifies the source of the stream. The synchronization sources within the same RTP session will be unique. CSRC: (32 bits each) Contributing source IDs enumerate contributing sources to a stream which has been generated from multiple sources. Header extension: (optional, presence indicated by Extension field) The first 32- bit word contains a profile-specific identifier (16 bits) and a length specifier (16 bits) that indicates the length of the extension in 32-bit units, excluding the 32 bits of the extension header. The extension header data is shown in Figure 4. The first two bytes of the payload of an RTP packet are commonly referred to as the payload header. The payload header consists of the same fields (F, NUT, NLI, and TID) as the NAL unit header as defined in 23090-5, irrespective of the type of the payload structure. For convenience, the structure of RTP payload header is described below: 0 1 0 1 23 4 5 67 8 90 1 2 34 5 6 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure imgf000018_0001
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ wherein: F: nal_forbidden_zero_bit as specified in ISO/IEC 23090-5 MUST be equal to 0. NUT: nal_unit_type as specified in ISO/IEC 23090-5 defines the type of the RBSP data structure contained in the NAL unit. NUT value can carry other meaning depending on the RTP packet type. NLI: nal_layer_id as specified in ISO/IEC 23090-5 defines the identifier of the layer to which an ACL NAL unit belongs or the identifier of a layer to which a non-ACL NAL unit applies. TID: nal_temporal_id_plus1 minus 1 as specified in ISO/IEC 23090-5 defines a temporal identifier for the NAL unit. The value of nal_temporal_id_plus1 MUST NOT be equal to 0. The remainder of the data in RTP payload contains the information as defined by the RTP header. An Application-Defined RTCP Packet (APP), a.k.a. a RTCP APP packet is intended for experimental use at developing new applications and features. Once a new APP RTCP packet type is tested and found useful, it is registered with the Internet Assigned Numbers Authority (IANA) as an original packet type. The structure of the RTCP APP packet is as follows: 0 1 2 3 0 12 3 4 56 7 89 0 1 23 4 5 67 8 90 1 2 34 5 67 8 9 01 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | +-+-+-+-+ | +-+-+-+-+ | +-+-+-+-+
Figure imgf000019_0001
… | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ wherein: - ST: The subtype field allows a set of APP packets to be defined under one unique name, or provides any application-dependent data. - N: A name chosen by the person defining the set of APP packets to be unique with respect to other APP packets this application might receive. - A: Application-dependent data may or may not appear in an APP packet. In this disclosure, the Session Description Protocol (SDP) is used as an example of a session specific file format. SDP is a format for describing multimedia communication sessions for the purposes of announcement and invitation. Its predominant use is in support of conversational and streaming media applications. SDP does not deliver any media streams itself, but it is used between endpoints for negotiation of network metrics, media types, and other associated properties. The set of properties and parameters is called a session profile. SDP is extensible for the support of new media types and formats. The Session Description Protocol describes a session as a group of fields in a text-based format, one field per line. The form of each field is as follows: <character>=<value><CR><LF> where <character> is a single case-sensitive character and <value> is structured text in a format that depends on the character. Values may be UTF-8 encoded. Whitespace is not allowed immediately to either side of the equal sign. Session descriptions consist of three sections: session, timing, and media descriptions. Each description may contain multiple timing and media descriptions. Names are only unique within the associated syntactic construct. Fields appear in the order, shown below; optional fields are marked with an asterisk: v= (protocol version number, currently only 0) o= (originator and session identifier: username, id, version number, network address) s= (session name: mandatory with at least one UTF-8- encoded character) i=* (session title or short information) u=* (URI of description) e=* (zero or more email address with optional name of contacts) p=* (zero or more phone number with optional name of contacts) c=* (connection information—not required if included in all media) b=* (zero or more bandwidth information lines) One or more time descriptions ("t=" and "r=" lines; see below) z=* (time zone adjustments) k=* (encryption key) a=* (zero or more session attribute lines) Zero or more Media descriptions (each one starting by an "m=" line; see below) Time description (mandatory): t= (time the session is active) r=* (zero or more repeat times) Media description (optional): m= (media name and transport address) i=* (media title or information field) c=* (connection information — optional if included at session level) b=* (zero or more bandwidth information lines) k=* (encryption key) a=* (zero or more media attribute lines — overriding the Session attribute lines) Below is a sample session description from RFC 4566. This session is originated by the user “jdoe” at IPv4 address 10.47.16.5. Its name is “SDP Seminar” and extended session information (“A Seminar on the session description protocol”) is included along with a link for additional information and an email address to contact the responsible party, Jane Doe. This session is specified to last two hours using NTP timestamps, with a connection address (which indicates the address clients must connect to or – when a multicast address is provided, as it is here – subscribe to) specified as IPv4224.2.17.12 with a TTL of 127. Recipients of this session description are instructed to only receive media. Two media descriptions are provided, both using RTP Audio Video Profile. The first is an audio stream on port 49170 using RTP/AVP payload type 0 (defined by RFC 3551 as PCMU), and the second is a video stream on port 51372 using RTP/AVP payload type 99 (defined as “dynamic”). Finally, an attribute is included which maps RTP/AVP payload type 99 to format h263-1998 with a 90 kHz clock rate. RTCP ports for the audio and video streams of 49171 and 51373, respectively, are implied. v=0 o=jdoe 28908445262890842807 IN IP410.47.16.5 s=SDP Seminar i=A Seminar on the session description protocol u=http://www.example.com/seminars/sdp.pdf e=j.doe@example.com (Jane Doe) c=IN IP4224.2.17.12/127 t=2873397496 2873404696 a=recvonly m=audio 49170 RTP/AVP 0 m=video 51372 RTP/AVP 99 a=rtpmap:99 h263-1998/90000 SDP uses attributes to extend the core protocol. Attributes can appear within the Session or Media sections and are scoped accordingly as session-level or media-level. New attributes can be added to the standard through registration with IANA. A media description may contain any number of “a=” lines (attribute-fields) that are media description specific. Session-level attributes convey additional information that applies to the session as a whole rather than to individual media descriptions. Attributes are either properties or values: a=<attribute-name> a=<attribute-name>:<attribute-value> Examples of attributes defined in RFC8866 are “rtpmap” and “fmpt”. “rtpmap” attribute maps from an RTP payload type number (as used in an "m=" line) to an encoding name denoting the payload format to be used. It also provides information on the clock rate and encoding parameters. Up to one "a=rtpmap:" attribute can be defined for each media format specified. This can be the following: m=audio 49230 RTP/AVP 969798 a=rtpmap:96 L8/8000 a=rtpmap:97 L16/8000 a=rtpmap:98 L16/11025/2 In the example above, the media types are “audio/L8” and “audio/L16”. Parameters added to an "a=rtpmap:" attribute may only be those required for a session directory to make the choice of appropriate media to participate in a session. Codec-specific parameters may be added in other attributes, for example, "fmtp". "fmtp" attribute allows parameters that are specific to a particular format to be conveyed in a way that SDP does not have to understand them. The format can be one of the formats specified for the media. Format-specific parameters, semicolon separated, may be any set of parameters required to be conveyed by SDP and given unchanged to the media tool that will use this format. At most one instance of this attribute is allowed for each format. An example is: a=fmtp:96 profile-level-id=42e016;max-mbps=108000;max- fs=3600 For example, RFC7798 defines the following sprop-vps, sprop-sps, sprop-pps, profile-space, profile-id, tier-flag, level-id, interop-constraints, profile-compatibility- indicator, sprop-sub-layer-id, recv-sub-layer-id, max-recv-level-id, tx-mode, max-lsr, max- lps, max-cpb, max-dpb, max-br, max-tr, max-tc, max-fps, sprop-max-don-diff, sprop- depack-buf-nalus, sprop-depack-buf-bytes, depack-buf-cap, sprop-segmentation-id, sprop- spatial-segmentation-idc, dec-parallel-cap, and include-dph. Volumetric video content may be captured by various configurations of camera capturing systems. The relative position of the cameras with respect to other cameras in a capturing system may be fixed. For example, the cameras may be mounted on a rigid stand, such as on a stand-alone multiview camera, e.g. the OZO camera, or the cameras may be mounted on a mobile device. On the other hand, the position of the cameras in the global coordinate system may change during the capturing process. The rigid stand of e.g. a multiview camera can be moved, or the individual cameras of the capturing system may move. Moreover, a user may move the device, such a mobile device, comprising the camera capturing system while capturing a video. Parameters describing the movement of the camera system, i.e. movement of at least one camera of the system, may be referred to as camera extrinsic information, or in brief, camera extrinsics. Let us suppose that volumetric video content, e.g. MIV, is recorded by a rigid system (i.e. the relative camera positions with respect to each other do not change) that is moving during the capture session. Therein, according to a first scenario, the movement of the system can be configured by the creator to achieve “paint the space” type of experiences, where the viewer sees the rendered scene in global coordinate space with the camera rig movement visible. The movement of the camera system and viewer position are controlled separately. In a second scenario, such as in a real-time augmented reality (AR) conversation, one or both sides of the end-points may be moving, i.e. the rigid camera system moves upon capturing, but also the viewer may move within the rendered scene. Therein, if the camera system movement from one end-point were to be translated to the other end-point, the other end-point would have to move to keep up with the moving first end-point. An example of this scenario is depicted in Figures 5a – 5c. In Fig.5a, the user U1 captures, while moving forward, an extended reality (XR) scene comprising two visual objects A and B. Due to the movement, the user U1 moves closer to the object A and farther away from the object B during the capture. The captured scene is delivered to the user U2. Figures 5b and 5c illustrate two options for the user U2 to render the scene captured by the user U1. In Fig.5b, the scene is rendered according to the movement of the user U1, whereupon the object A moves closer to the user U2 and the object B moves farther away from the user U2 during the rendering. However, in some cases there may be a need to lock the captured scene at some temporal position, wherein the rendered visual scene is not intended to be changed. Fig.5c depicts such a situation, where the rendering is locked to the original position of the user U1, and the rendered scene, and thus the locked positions of the objects A and B, are kept stationary with respect to the user U2. Thus, there are various options how the user U2 may lock the scene captured by the user U1 to rendered. For example, the user U2 may always view the space of the user U1 with respect to the initial position of the user U1. Alternatively, the user U2 may always view the space of the user U1 with respect to current position of U1. Further, the user U2 may view the space of the user U1 such that the initial position of the user U1 is locked to the initial position of the user U2, i.e., the user U2 may roam around in the space. However, the content creator, such as the apparatus of the user U1, has no means to indicate to the renderer, such as the apparatus of the user U2, how the captured scene is intended to be rendered. The rendering apparatus, which may be referred to as a renderer/client/viewer/receiver, does not know, if it should keep the capture camera extrinsics fixed (as in the second scenario), or if it should support free translation of the capture system and allow viewer to freely explore the content (as in the first scenario). In other words, the renderer is missing information how to provide the final rendering to the end user, such as the user U2. A similar problem is encountered with movements of the capturing apparatus, such as a mobile device, wherein some of the movements of the capturing apparatus may be unintentional while others are intentional. For example, a translational motion of the handheld device upon walking is typically not considered intentional, but the rotation around the client face to change the angle of recording typically is intentional. The problem is illustrated in an example shown in Figures 6a – 6c. In Fig.6a, the user U1 captures, using a handheld device, an extended reality (XR) scene comprising two visual objects A and B and user’s U1 face, by a user movement, which is depicted by the rounded arrow. The captured scene is delivered to the user U2. Ideally, if the user movement capturing the user’s U1 face is unintentional, e.g. caused by an unstable device, the capturing device, such as the apparatus of the user U1, could compensate the unintended movement and the movement is not reflected upon rendering by the rendering device, such as the apparatus of the user U2, as shown in Fig. 6b. On the other hand, if the user movement capturing the user’s U1 face is intentional, the captured scene in accordance with capturing device movement is delivered to the user U2 the movement is not reflected upon rendering, as shown in Fig.6c. Again, the user U1 has no means to determine upon capturing, which movements are unintentional and should be neglected, and which movements are intentional and should be delivered to the user U2 to be reflected upon rendering. Accordingly, there is a need to be able to differentiate between intended and unintended movements. In the following, an enhanced method for providing the renderer with sufficient information for carrying out the rendering as intended will be described in more detail, in accordance with various embodiments. The method, which is disclosed in Figure 7 and illustrates the operation of a sender, comprises capturing (700) volumetric video content by a camera capturing system; determining (702) one or more parameters describing movements of the camera capturing system to be stored upon capturing said volumetric video content; and transmitting (704) the captured volumetric video content and said one or more parameters describing the movements of the camera capturing system to a receiver. Thus, the method enables the content creator, such as the capturing device or its user, to determine whether at least some of the movements of the camera capturing system upon capturing said volumetric video content should be taken into account in the rendering process, thereby providing signaling, which allows the renderer to properly interpret the provided camera extrinsic information and render the content as intended. The capturing device may comprise means, such an application, for determining said one or more parameters describing the movements of the camera capturing system. The application may comprise default settings regarding the one or more parameters describing the movements of the camera capturing system, i.e. the camera extrinsic information, but the settings may be user-adjustable It is noted that the sender, as used herein, may refer to an apparatus comprising a camera capturing system, such as a mobile device or any other apparatus provided with multi-view camera and transmission means. The sender may also refer to an apparatus configured to obtain the captured volumetric video content and said one or more parameters describing the movements of the camera capturing system from an external device, such as from a stand-alone multiview camera or a data storage, and configured to transmit said data to a receiver. Thus, the sender may be a server in client-server architecture, for example. According to an embodiment, said one or more parameters comprise at least a first parameter for controlling a rendering process of the volumetric video content, wherein a value of said first parameter is configured to control the rendering process whether to take into account said movements of the camera capturing system. Hence, the application may comprise default settings or it may provide the user with possibility to define, by the value of the first parameter, whether the changes of the mobile movement (i.e. camera extrinsic) information is provided to the volumetric video encoder (e.g. MIV encoder) during the capture. The value of the first parameter may be dependent on the capture type; i.e. whether to allow the viewer to freely explore the content or to lock the captured scene at some temporal position. According to an embodiment, said one or more parameters comprise at least a second parameter describing whether said movements of the camera capturing system are intentional or unintentional. Herein, the application may comprise default settings or it may provide the user with possibility to define, by the value of the second parameter, which movements are considered intentional, and thus to be recorded in the camera extrinsic information. Consequently, unintentional movements shall be ignored. For example, if the user is walking during the capturing, the extrinsic information may be configured to control the renderer to ignore a translational motion of the handheld device. Another aspect relates to the operation of a receiver. Figure 8 shows an example of a method comprising receiving (800) volumetric video content captured by a camera capturing system and one or more parameters describing the movements of the camera capturing system during capturing; and controlling (802) a rendering process of the volumetric video content according to the one or more parameters describing the movements of the camera capturing system during the capturing. Thus, the receiver determines, based on the received one or more parameters describing the movements of the camera capturing system during capturing, how the content creator has intended the rendering to be carried out, and then performs the rendering of the volumetric video content accordingly. It is noted that the receiver, as used herein, may refer to an apparatus comprising a rendering functionality, a.k.a. a renderer, which may also be referred to as a media player or viewer. The receiver may also be a client in client-server architecture. According to an embodiment, the method as performed by the sender comprises carrying out a signalling of said one or more parameters in a supplemental enhancement information (SEI) message according to MPEG Immersive Video (MIV) specification. Consequently, the application may provide a rendering mode through an SEI message as described in the Table 1 below.
Figure imgf000027_0001
Table 1. A dedicated value is preferably reserved for the rendering mode SEI message payload type that does not conflict with the existing payload types in ISO/IEC 23090-5: Annex F. In the example of Virhe. Viitteen lähdettä ei löytynyt., the reserved value of 19 is used.
Figure imgf000027_0002
Table 2. According to an embodiment, said SEI message comprises a rendering mode parameter capable of having a plurality of values, wherein a value of at least one of said one or more parameters are indicated by a rendering mode parameter value. Thus, several rendering modes can be defined. For example, rendering_mode equal to 1 may indicate to a renderer that a sender-locked rendering is recommended. This mode would be appropriate in case of a real time conversational scenario or an AR telephony call. Setting the rendering_mode to the sender-locked value would allow the sender to indicate to the receiver that the translational motion of the camera extrinsics should be ignored by the renderer, thus locking the receiver to the sender’s position. For example, rendering_mode equal to 2 may indicate to a renderer that a viewer-locked rendering is recommended. This mode could be appropriate in case of a free view consumption of content. Setting the rendering_mode to viewer-locked would allow the sender to signal to the receiver that it is acceptable to allow separate transformation of the receiver camera properties. Additionally, said SEI message may comprise further parameters that indicate the camera extrinsic information in a more detailed manner. For example, the sender may also provide additional parameter(s) that inform which changes in the camera extrinsics could be intentional and which not; for example, a rotation along one axis could be intentional while along the other axis is not. Furthermore, said SEI message may comprise parameters that correspond to sender- or viewer-locked camera intrinsics. Storing the rendering modes in a SEI message enables to include the SEI message into a bitstream after the encoding of the volumetric video content is performed. Moreover, including the rendering modes in a SEI message does not cause misinterpretations in legacy decoders as the unknown SEI message is ignored by such decoder. According to an embodiment, the method comprises carrying out a signalling of said one or more parameters by at least one syntax element included in media level of a Session Description Protocol (SDP) file. Herein, the application or a server may provide the recommended rendering mode to a receiver/client as out of band parameter. For example, a new generic parameter, which may be referred to as, for example, 3d-rendering-mode may be defined that can be used by any volumetric video. The syntax and semantics of the payload format parameter is defined based on the media type the parameter is used in combination with.
Figure imgf000029_0001
According to an embodiment, in case of V3C, the encoded value representation can be base64 encoded data, as defined in RFC4648, that contains the SEI message described above. According to an embodiment, the value of 3d-rendering-mode parameter may indicate an application type, possibly as an integer value or a textual tag. The application type indicates to the sender and/or the receiver the intended use of the rendered video. For example, a peer-to-peer AR telephony call can be indicated as ‘p2p telephony’ and would require a sender-locked mode of rendering. The application type can be set based on the current mode of use of the AR device. For example, if it is noticed that the caller is multitasking, the device may be configured to take the call in a limited fashion (for the sake of user experience or device capabilities) instead of rendering a viewer-locked scene. More advanced application types may also be defined, for example, taking into account rotation of the camera rig but not the translation, etc. According to an embodiment, the parameter 3d-rendering-mode may also be used in session negotiation, where one entity informs the other entity which rendering mode it supports and consequently expects to receive. If 3d-rendering-mode is included in an offer but not supported by the receiver, the attribute could be dropped. In such a case, the sender may deliver the camera intrinsics and extrinsics as usual and the receiver will apply those parameters for rendering. Alternatively, the sender may perform additional processing to prepare the content bitstream such that the intended experience is achieved upon rendering. The latter option may be carried out as a sender-side implementation without any signalling. This may require application-level logic. This may also be a useful feature, for example in a Multipoint Control Unit (MCU) or a Media Resource Function (MRF) or a Selective Forwarding Unit (SFU), to deliver appropriate bitstream for receivers which do not understand this attribute. According to an embodiment, the method comprises setting a value of at least one parameter to indicate that the one or more parameters describing the movements of the camera capturing system have been adjusted according to a predetermined value of the rendering mode parameter. Thus, the sender and receiver may agree on what may be referred to as a receiver-transparent rendering mode. In such a case, the sender transforms the capture apparatus position and rotation information parameters according to the sender-locked, viewer-locked or any other mode. In such a scenario, the receiver can perform rendering without the knowledge about the rendering mode. This can be a useful scenario to perform session negotiation, for example with the parameter 3d-rendering-mode, between the capturing apparatus (content contribution apparatus) and a middlebox, which generates the content according to the indicated receiver-transparent mode for delivering the content to a large number of V3C receivers that are not aware of 3d-rendering mode. According to an embodiment, the parameter 3d-rendering-mode may include a V-PCC specific tag, which informs a client that it is allowed to ignore the AAPS camera extrinsic information, or parts of it. According to an embodiment, the parameter 3d-rendering-mode may contain more than one tag separated by a semicolon. Each tag may be predefined to indicate to a client how to optimally render the content. In the example shown below, the parameter 3d-rendering-mode comprises one tag ‘p2p telephony’ and the following tags indicate unintentional (‘translation-unint’; roll-unint) and intentional (yaw-int; pitch-int) behaviour.
Figure imgf000030_0001
Figure imgf000031_0002
According to an embodiment, a server may signal the rendering mode using a “v3c-sei” parameter as follows:
Figure imgf000031_0004
Said parameter may be used to convey SEI NAL units of the V3C common atlas sub bitstream for out-of-band transmission. The value of the parameter is a comma- separated
Figure imgf000031_0001
list of encoded representations of the SEI NAL units (i.e. NAL_PREFIX_NSEI and NAL_SUFFIX_NSEI, NAL_PREFIX_ESEI, NAL_SUFFIX_ESEI) as specified in Section 8.4.5.2 of ISO/IEC 23090-5. The encoded representation can for example be base64 encoded data, as defined in RFC4648. An example is given below.
Figure imgf000031_0003
According to an embodiment, the method comprises carrying out a signalling of said one or more parameters by a uniquely identified Real-Time Transport Control Protocol (RTCP) application packet. Thus, an application or a server may provide a recommended rendering mode and related parameters to a client as through RTCP APP packet, which can be identified by a unique name value, e.g. ‘3drm’ The embodiments relating to the encoding aspects may be implemented in an apparatus comprising: means for capturing volumetric video content by a camera capturing system; means for determining one or more parameters describing movements of the camera capturing system to be stored upon capturing said volumetric video content; and means for transmitting the captured volumetric video content and said one or more parameters describing the movements of the camera capturing system to a receiver. According to an embodiment, said one or more parameters comprise at least a first parameter for controlling a rendering process of the volumetric video content, wherein a value of said first parameter is configured to control the rendering process whether to take into account said movements of the camera capturing system. According to an embodiment, said one or more parameters comprise at least a second parameter describing whether said movements of the camera capturing system are intentional or unintentional. According to an embodiment, the apparatus comprises means for carrying out a signalling of said one or more parameters in a supplemental enhancement information (SEI) message according to MPEG Immersive Video (MIV) specification. According to an embodiment, said SEI message comprises a rendering mode parameter capable of having a plurality of values, wherein a value of at least one of said one or more parameters are indicated by a rendering mode parameter value. According to an embodiment, the apparatus comprises means for carrying out a signalling of said one or more parameters by at least one syntax element included in media level of a Session Description Protocol (SDP) file. According to an embodiment, the apparatus comprises means for setting a value of at least one parameter to indicate that the one or more parameters describing the movements of the camera capturing system have been adjusted according to a predetermined value of the rendering mode parameter. According to an embodiment, the apparatus comprises means for carrying out a signalling of said one or more parameters by a uniquely identified Real-Time Transport Control Protocol (RTCP) application packet. The embodiments relating to the encoding aspects may likewise be implemented in an apparatus comprising at least one processor and at least one memory, said at least one memory stored with computer program code thereon, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to perform: capture volumetric video content by a camera capturing system; determine one or more parameters describing movements of the camera capturing system to be stored upon capturing said volumetric video content; and transmit the captured volumetric video content and said one or more parameters describing the movements of the camera capturing system to a receiver. According to an embodiment, said one or more parameters comprise at least a first parameter for controlling a rendering process of the volumetric video content, wherein a value of said first parameter is configured to control the rendering process whether to take into account said movements of the camera capturing system. According to an embodiment, said one or more parameters comprise at least a second parameter describing whether said movements of the camera capturing system are intentional or unintentional. According to an embodiment, the apparatus comprises computer program code configured to, with the at least one processor, cause the apparatus at least to carry out a signalling of said one or more parameters in a supplemental enhancement information (SEI) message according to MPEG Immersive Video (MIV) specification. According to an embodiment, said SEI message comprises a rendering mode parameter capable of having a plurality of values, wherein a value of at least one of said one or more parameters are indicated by a rendering mode parameter value. According to an embodiment, the apparatus comprises computer program code configured to, with the at least one processor, cause the apparatus at least to carry out a signalling of said one or more parameters by at least one syntax element included in media level of a Session Description Protocol (SDP) file. According to an embodiment, the apparatus comprises computer program code configured to, with the at least one processor, cause the apparatus at least to set a value of at least one parameter to indicate that the one or more parameters describing the movements of the camera capturing system have been adjusted according to a predetermined value of the rendering mode parameter. According to an embodiment, the apparatus comprises computer program code configured to, with the at least one processor, cause the apparatus at least to carry out a signalling of said one or more parameters by a uniquely identified Real-Time Transport Control Protocol (RTCP) application packet. The embodiments relating to the decoding aspects may be implemented in an apparatus comprising means for receiving volumetric video content captured by a camera capturing system and one or more parameters describing the movements of the camera capturing system during capturing; and means for controlling a rendering process of the volumetric video content according to the one or more parameters describing the movements of the camera capturing system during the capturing. According to an embodiment, said one or more parameters comprise at least a first parameter for controlling a rendering process of the volumetric video content, wherein the apparatus comprises means for controlling the rendering process whether to take into account said movements of the camera capturing system based on a value of said first parameter. According to an embodiment, said one or more parameters comprise at least a second parameter describing whether said movements of the camera capturing system are intentional or unintentional, wherein the apparatus comprises means for controlling the rendering process to ignore the unintentional movements of the camera capturing system based on a value of said second parameter. According to an embodiment, the apparatus comprises means for receiving a signalling of said one or more parameters in a supplemental enhancement information (SEI) message according to MPEG Immersive Video (MIV) specification. According to an embodiment, the apparatus comprises means for receiving a signalling of said one or more parameters by at least one syntax element included in media level of a Session Description Protocol (SDP) file. The embodiments relating to the decoding aspects may likewise be implemented in an apparatus comprising at least one processor and at least one memory, said at least one memory stored with computer program code thereon, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to perform: receive volumetric video content captured by a camera capturing system and one or more parameters describing the movements of the camera capturing system during capturing; and control a rendering process of the volumetric video content according to the one or more parameters describing the movements of the camera capturing system during the capturing. According to an embodiment, said one or more parameters comprise at least a first parameter for controlling a rendering process of the volumetric video content, wherein the apparatus comprises computer program code configured to, with the at least one processor, cause the apparatus at least to control the rendering process whether to take into account said movements of the camera capturing system based on a value of said first parameter. According to an embodiment, said one or more parameters comprise at least a second parameter describing whether said movements of the camera capturing system are intentional or unintentional, wherein the apparatus comprises computer program code configured to, with the at least one processor, cause the apparatus at least to control the rendering process to ignore the unintentional movements of the camera capturing system based on a value of said second parameter. According to an embodiment, the apparatus comprises computer program code configured to, with the at least one processor, cause the apparatus at least to receive a signalling of said one or more parameters in a supplemental enhancement information (SEI) message according to MPEG Immersive Video (MIV) specification. According to an embodiment, the apparatus comprises computer program code configured to, with the at least one processor, cause the apparatus at least to receive a signalling of said one or more parameters by at least one syntax element included in media level of a Session Description Protocol (SDP) file. Such apparatuses may comprise e.g. the functional units disclosed in any of the Figures 1a, 1b, 2a and 2b for implementing the embodiments. In the above, some embodiments have been described with reference to encoding. It needs to be understood that said encoding may comprise one or more of the following: encoding source image data into a bitstream, encapsulating the encoded bitstream in a container file and/or in packet(s) or stream(s) of a communication protocol, and announcing or describing the bitstream in a content description, such as the Media Presentation Description (MPD) of ISO/IEC 23009-1 (known as MPEG-DASH) or the IETF Session Description Protocol (SDP). Similarly, some embodiments have been described with reference to decoding. It needs to be understood that said decoding may comprise one or more of the following: decoding image data from a bitstream, decapsulating the bitstream from a container file and/or from packet(s) or stream(s) of a communication protocol, and parsing a content description of the bitstream. In the above, where the example embodiments have been described with reference to an encoder or an encoding method, it needs to be understood that the resulting bitstream and the decoder or the decoding method may have corresponding elements in them. Likewise, where the example embodiments have been described with reference to a decoder, it needs to be understood that the encoder may have structure and/or computer program for generating the bitstream to be decoded by the decoder. In general, the various embodiments of the invention may be implemented in hardware or special purpose circuits or any combination thereof. While various aspects of the invention may be illustrated and described as block diagrams or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof. Embodiments of the inventions may be practiced in various components such as integrated circuit modules. The design of integrated circuits is by and large a highly automated process. Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate. Programs, such as those provided by Synopsys, Inc. of Mountain View, California and Cadence Design, of San Jose, California automatically route conductors and locate components on a semiconductor chip using well established rules of design as well as libraries of pre stored design modules. Once the design for a semiconductor circuit has been completed, the resultant design, in a standardized electronic format (e.g., Opus, GDSII, or the like) may be transmitted to a semiconductor fabrication facility or "fab" for fabrication. The foregoing description has provided by way of exemplary and non-limiting examples a full and informative description of the exemplary embodiment of this invention. However, various modifications and adaptations may become apparent to those skilled in the relevant arts in view of the foregoing description, when read in conjunction with the accompanying drawings and the appended examples. However, all such and similar modifications of the teachings of this invention will still fall within the scope of this invention.

Claims

CLAIMS 1. An apparatus comprising: means for capturing volumetric video content by a camera capturing system; means for determining one or more parameters describing movements of the camera capturing system to be stored upon capturing said volumetric video content; and means for transmitting the captured volumetric video content and said one or more parameters describing the movements of the camera capturing system to a receiver.
2. The apparatus according to claim 1, wherein said one or more parameters comprise at least a first parameter for controlling a rendering process of the volumetric video content, wherein a value of said first parameter is configured to control the rendering process whether to take into account said movements of the camera capturing system.
3. The apparatus according to claim 1 or 2, wherein said one or more parameters comprise at least a second parameter describing whether said movements of the camera capturing system are intentional or unintentional.
4. The apparatus according to any preceding claim, comprising means for carrying out a signalling of said one or more parameters in a supplemental enhancement information (SEI) message according to MPEG Immersive Video (MIV) specification.
5. The apparatus according to claim 4, wherein said SEI message comprises a rendering mode parameter capable of having a plurality of values, wherein a value of at least one of said one or more parameters are indicated by a rendering mode parameter value.
6. The apparatus according to any preceding claim, comprising means for carrying out a signalling of said one or more parameters by at least one syntax element included in media level of a Session Description Protocol (SDP) file.
7. The apparatus according to claim 6, comprising means for setting a value of at least one parameter to indicate that the one or more parameters describing the movements of the camera capturing system have been adjusted according to a predetermined value of the rendering mode parameter.
8. The apparatus according to any of claims 1 - 5, comprising means for carrying out a signalling of said one or more parameters by a uniquely identified Real-Time Transport Control Protocol (RTCP) application packet.
9. A method comprising: capturing volumetric video content by a camera capturing system; determining one or more parameters describing movements of the camera capturing system to be stored upon capturing said volumetric video content; and transmitting the captured volumetric video content and said one or more parameters describing the movements of the camera capturing system to a receiver.
10. An apparatus comprising at least one processor and at least one memory, said at least one memory stored with computer program code thereon, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to perform: capture volumetric video content by a camera capturing system; determine one or more parameters describing movements of the camera capturing system to be stored upon capturing said volumetric video content; and transmit the captured volumetric video content and said one or more parameters describing the movements of the camera capturing system to a receiver.
11. The apparatus according to claim 10, said one or more parameters comprise at least a first parameter for controlling a rendering process of the volumetric video content, wherein a value of said first parameter is configured to control the rendering process whether to take into account said movements of the camera capturing system.
12. The apparatus according to claim 10 or 11, wherein said one or more parameters comprise at least a second parameter describing whether said movements of the camera capturing system are intentional or unintentional.
13. The apparatus according to any of claims 10 - 12, comprising computer program code configured to, with the at least one processor, cause the apparatus at least to carry out a signalling of said one or more parameters in a supplemental enhancement information (SEI) message according to MPEG Immersive Video (MIV) specification.
14. The apparatus according to claim 13, wherein said SEI message comprises a rendering mode parameter capable of having a plurality of values, wherein a value of at least one of said one or more parameters are indicated by a rendering mode parameter value.
15. The apparatus according to any of claims 10 - 14, comprising computer program code configured to, with the at least one processor, cause the apparatus at least to carry out a signalling of said one or more parameters by at least one syntax element included in media level of a Session Description Protocol (SDP) file.
16. The apparatus according to claim 15, comprising computer program code configured to, with the at least one processor, cause the apparatus at least to set a value of at least one parameter to indicate that the one or more parameters describing the movements of the camera capturing system have been adjusted according to a predetermined value of the rendering mode parameter.
17. The apparatus according to claim 10 - 14, comprising computer program code configured to, with the at least one processor, cause the apparatus at least to carry out a signalling of said one or more parameters by a uniquely identified Real-Time Transport Control Protocol (RTCP) application packet.
18. An apparatus comprising: means for receiving volumetric video content captured by a camera capturing system and one or more parameters describing the movements of the camera capturing system during capturing; and means for controlling a rendering process of the volumetric video content according to the one or more parameters describing the movements of the camera capturing system during the capturing.
19. The apparatus according to claim 18, wherein said one or more parameters comprise at least a first parameter for controlling a rendering process of the volumetric video content, wherein the apparatus comprises means for controlling the rendering process whether to take into account said movements of the camera capturing system based on a value of said first parameter.
20. The apparatus according to claim 18 or 19, wherein said one or more parameters comprise at least a second parameter describing whether said movements of the camera capturing system are intentional or unintentional, wherein the apparatus comprises means for controlling the rendering process to ignore the unintentional movements of the camera capturing system based on a value of said second parameter.
21. The apparatus according to any of claims 18 - 20, comprising means for receiving a signalling of said one or more parameters in a supplemental enhancement information (SEI) message according to MPEG Immersive Video (MIV) specification.
22. The apparatus according to any of claims 18 - 21, comprising means for receiving a signalling of said one or more parameters by at least one syntax element included in media level of a Session Description Protocol (SDP) file.
23. A method comprising: receiving volumetric video content captured by a camera capturing system and one or more parameters describing the movements of the camera capturing system during capturing; and controlling a rendering process of the volumetric video content according to the one or more parameters describing the movements of the camera capturing system during the capturing.
24. An apparatus comprising at least one processor and at least one memory, said at least one memory stored with computer program code thereon, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to perform: receive volumetric video content captured by a camera capturing system and one or more parameters describing the movements of the camera capturing system during capturing; and control a rendering process of the volumetric video content according to the one or more parameters describing the movements of the camera capturing system during the capturing.
25. The apparatus according to claim 25, wherein said one or more parameters comprise at least a first parameter for controlling a rendering process of the volumetric video content, wherein the apparatus comprises computer program code configured to, with the at least one processor, cause the apparatus at least to control the rendering process whether to take into account said movements of the camera capturing system based on a value of said first parameter.
26. The apparatus according to claim 24 or 25, wherein said one or more parameters comprise at least a second parameter describing whether said movements of the camera capturing system are intentional or unintentional, wherein the apparatus comprises computer program code configured to, with the at least one processor, cause the apparatus at least to control the rendering process to ignore the unintentional movements of the camera capturing system based on a value of said second parameter.
27. The apparatus according to any of claims 24 - 26, comprising computer program code configured to, with the at least one processor, cause the apparatus at least to receive a signalling of said one or more parameters in a supplemental enhancement information (SEI) message according to MPEG Immersive Video (MIV) specification.
28. The apparatus according to any of claims 24 - 27, comprises computer program code configured to, with the at least one processor, cause the apparatus at least to receive a signalling of said one or more parameters by at least one syntax element included in media level of a Session Description Protocol (SDP) file.
PCT/FI2023/050534 2022-09-27 2023-09-20 An apparatus and a method for processing volumetric video content WO2024069045A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
FI20225844 2022-09-27
FI20225844 2022-09-27

Publications (2)

Publication Number Publication Date
WO2024069045A1 true WO2024069045A1 (en) 2024-04-04
WO2024069045A9 WO2024069045A9 (en) 2024-08-29

Family

ID=88197318

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/FI2023/050534 WO2024069045A1 (en) 2022-09-27 2023-09-20 An apparatus and a method for processing volumetric video content

Country Status (1)

Country Link
WO (1) WO2024069045A1 (en)

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
LUKASZ KONDRAD ET AL: "[MIV] Comments on WD4 23090-12", no. m53124, 15 April 2020 (2020-04-15), XP030286058, Retrieved from the Internet <URL:http://phenix.int-evry.fr/mpeg/doc_end_user/documents/130_Alpbach/wg11/m53124-v1-m53124.zip m53124 - comments on 23090-12.docx> [retrieved on 20200415] *
OZGUR OYMAN (INTEL) ET AL: "Signaling of Camera Information in V-PCC Carriage Format", no. m53044, 22 April 2020 (2020-04-22), XP030285839, Retrieved from the Internet <URL:http://phenix.int-evry.fr/mpeg/doc_end_user/documents/130_Alpbach/wg11/m53044-v4-m53044_v4_53044.zip m53044_Camera_Info_VPCC_Format.doc> [retrieved on 20200422] *

Also Published As

Publication number Publication date
WO2024069045A9 (en) 2024-08-29

Similar Documents

Publication Publication Date Title
AU2018301314B2 (en) Region-wise packing, content coverage, and signaling frame packing for media content
US20190104326A1 (en) Content source description for immersive media data
EP4084490A1 (en) An apparatus, a method and a computer program for video coding and decoding
CN114600468B (en) Combiner system, receiver device, computer-implemented method and computer-readable medium for combining video streams in a composite video stream with metadata
CN113287323A (en) Multi-decoder interface for streaming media data
US20220335978A1 (en) An apparatus, a method and a computer program for video coding and decoding
US11388427B2 (en) Multiple decoder interface for streamed media data
US10200414B2 (en) Signalling assymetric media capabilities
WO2023073283A1 (en) A method, an apparatus and a computer program product for video encoding and video decoding
US20220407899A1 (en) Real-time augmented reality communication session
WO2023062271A1 (en) A method, an apparatus and a computer program product for video coding
EP4284000A1 (en) An apparatus, a method and a computer program for volumetric video
WO2024069045A1 (en) An apparatus and a method for processing volumetric video content
US20240080477A1 (en) Method, An Apparatus and A Computer Program Product For Streaming Volumetric Video Content
US20230239453A1 (en) Method, an apparatus and a computer program product for spatial computing service session description for volumetric extended reality conversation
US20240275826A1 (en) Network rendering and transcoding of augmented reality data
EP4145832A1 (en) An apparatus, a method and a computer program for volumetric video
EP4300984A1 (en) A method, an apparatus and a computer program product for mapping media bitstream partitions in real-time streaming
WO2023099809A1 (en) A method, an apparatus and a computer program product for video encoding and video decoding
WO2024194520A1 (en) A method, an apparatus and a computer program product for video coding
TWI846795B (en) Multiple decoder interface for streamed media data
US20240273833A1 (en) Signaling pose information to a split rendering server for augmented reality communication sessions
EP4383735A1 (en) Point cloud data transmission device, point cloud data transmission method, point cloud data reception device, and point cloud data reception method
WO2023175234A1 (en) A method, an apparatus and a computer program product for streaming volumetric video
US20230146498A1 (en) A Method, An Apparatus and a Computer Program Product for Video Encoding and Video Decoding

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23777001

Country of ref document: EP

Kind code of ref document: A1