WO2021053261A1 - A method, an apparatus and a computer program product for video encoding and video decoding - Google Patents

A method, an apparatus and a computer program product for video encoding and video decoding Download PDF

Info

Publication number
WO2021053261A1
WO2021053261A1 PCT/FI2020/050552 FI2020050552W WO2021053261A1 WO 2021053261 A1 WO2021053261 A1 WO 2021053261A1 FI 2020050552 W FI2020050552 W FI 2020050552W WO 2021053261 A1 WO2021053261 A1 WO 2021053261A1
Authority
WO
WIPO (PCT)
Prior art keywords
patch
patches
video
encryption method
content
Prior art date
Application number
PCT/FI2020/050552
Other languages
French (fr)
Inventor
Payman Aflaki Beni
Sebastian Schwarz
Original Assignee
Nokia Technologies Oy
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nokia Technologies Oy filed Critical Nokia Technologies Oy
Priority to EP20865714.8A priority Critical patent/EP4032314A4/en
Publication of WO2021053261A1 publication Critical patent/WO2021053261A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/76Television signal recording
    • H04N5/91Television signal processing therefor
    • H04N5/913Television signal processing therefor for scrambling ; for copy protection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/10Protecting distributed programs or content, e.g. vending or licensing of copyrighted material ; Digital rights management [DRM]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/602Providing cryptographic facilities or services
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/06Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols the encryption apparatus using shift registers or memories for block-wise or stream coding, e.g. DES systems or RC4; Hash functions; Pseudorandom sequence generators
    • H04L9/065Encryption by serially and continuously modifying data stream elements, e.g. stream cipher systems, RC4, SEAL or A5/3
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/08Key distribution or management, e.g. generation, sharing or updating, of cryptographic keys or passwords
    • H04L9/088Usage controlling of secret information, e.g. techniques for restricting cryptographic keys to pre-authorized uses, different access levels, validity of crypto-period, different key- or password length, or different strong and weak cryptographic algorithms
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/134Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or criterion affecting or controlling the adaptive coding
    • H04N19/167Position within a video image, e.g. region of interest [ROI]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/46Embedding additional information in the video signal during the compression process
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/597Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding specially adapted for multi-view video sequence encoding
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/238Interfacing the downstream path of the transmission network, e.g. adapting the transmission rate of a video stream to network bandwidth; Processing of multiplex streams
    • H04N21/2389Multiplex stream processing, e.g. multiplex stream encrypting
    • H04N21/23895Multiplex stream processing, e.g. multiplex stream encrypting involving multiplex stream encryption
    • H04N21/23897Multiplex stream processing, e.g. multiplex stream encrypting involving multiplex stream encryption by partially encrypting, e.g. encrypting only the ending portion of a movie
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/4405Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving video stream decryption
    • H04N21/44055Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving video stream decryption by partially decrypting, e.g. decrypting a video stream that has been partially encrypted
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/10Processing, recording or transmission of stereoscopic or multi-view image signals
    • H04N13/106Processing image signals
    • H04N13/161Encoding, multiplexing or demultiplexing different image signal components
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/76Television signal recording
    • H04N5/91Television signal processing therefor
    • H04N5/913Television signal processing therefor for scrambling ; for copy protection
    • H04N2005/91357Television signal processing therefor for scrambling ; for copy protection by modifying the video signal
    • H04N2005/91364Television signal processing therefor for scrambling ; for copy protection by modifying the video signal the video signal being scrambled

Definitions

  • the present solution generally relates to video encoding and video decoding.
  • the solution relates to encoding and decoding of digital volumetric video.
  • new image and video capture devices are available. These devices are able to capture visual and audio content all around them, i.e. they can capture the whole angular field of view, sometimes referred to as 360 degrees field of view. More precisely, they can capture a spherical field of view (i.e., 360 degrees in all spatial directions).
  • new types of output technologies have been invented and produced, such as head-mounted displays. These devices allow a person to see visual content all around him/her, giving a feeling of being “immersed” into the scene captured by the 360 degrees camera.
  • the new capture and display paradigm, where the field of view is spherical is commonly referred to as virtual reality (VR) and is believed to be the common way people will experience media content in the future.
  • VR virtual reality
  • volumetric video For volumetric video, a scene may be captured using one or more 3D (three- dimensional) cameras. The cameras are in different positions and orientations within a scene.
  • 3D three- dimensional
  • One issue to consider is that compared to 2D (two-dimensional) video content, volumetric 3D video content has much more data, so viewing it requires lots of bandwidth (with or without transferring it from a storage location to a viewing device): disk I/O, network traffic, memory bandwidth, GPU (Graphics Processing Unit) upload. Capturing volumetric content also produces a lot of data, particularly when there are multiple capture devices used in parallel.
  • a method comprising receiving a video presentation frame, where the video presentation represents a three-dimensional data; generating one or more patches from the video presentation frame, wherein the patches represent different parts of the content of the video presentation frame; determining which patch(es) of the generated one or more patches are to be encrypted, wherein the determination is based on a content of the patch; encrypting the determined patches with an encryption method that corresponds to a pre defined authorization level; and encoding into a bitstream of a corresponding patch, information on the used encryption method.
  • an apparatus comprising means for receiving a video presentation frame, where the video presentation represents a three-dimensional data; means for generating one or more patches from the video presentation frame, wherein the patches represent different parts of the content of the video presentation frame; means for determining which patch(es) of the generated one or more patches are to be encrypted, wherein the determination is based on a content of the patch; means for encrypting the determined patches with an encryption method that corresponds to a pre-defined authorization level; and means for encoding into a bitstream of a corresponding patch, information on the used encryption method.
  • a computer program product comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to receive a video presentation frame, where the video presentation represents a three-dimensional data; generate one or more patches from the video presentation frame, wherein the patches represent different parts of the content of the video presentation frame; determine which patch(es) of the generated one or more patches are to be encrypted, wherein the determination is based on a content of the patch; encrypt the determined patches with an encryption method that corresponds to a pre-defined authorization level; and encode into a bitstream of a corresponding patch, information on the used encryption method.
  • the information on the encryption method comprises the pre-defined authorization level and a type of the corresponding encryption method.
  • the encryption method comprises one or more different types of encryption methods.
  • the authorization level is defined based on one or more of the following criteria: privacy, age limit, content’s discretion, subject to a charge.
  • the information on the encryption method is signaled as a supplemental enhancement information (SEI) message.
  • SEI Supplemental Enhancement information
  • a method for decoding comprising receiving an encoded bitstream; for each patch of the encoded bitstream, decoding a patch from the encoded bitstream; for every patch that is determined to be an encrypted patch, decoding from the bitstream information on a used encryption method for the patch; decrypting the patch with a decryption method that corresponds to the used encryption method; and generating a video presentation frame from the decoded patch(es), wherein the patches represent different parts of the content of the video presentation frame.
  • an apparatus comprising means for receiving an encoded bitstream; for each patch of the encoded bitstream means for decoding a patch from the encoded bitstream; for every patch that is determined to be an encrypted patch means for decoding from the bitstream information on a used encryption method for the patch; means for decrypting the patch with a decryption method that corresponds to the used encryption method; and means for generating a video presentation frame from the decoded patch(es), wherein the patches represent different parts of the content of the video presentation frame.
  • a computer program product comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to receive an encoded bitstream; for each patch of the encoded bitstream decode a patch from the encoded bitstream; for every patch that is determined to be an encrypted patch, decode from the bitstream information on a used encryption method for the patch; decrypt the patch with a decryption method that corresponds to the used encryption method; and generate video presentation frame from the decoded patch(es), wherein the patches represent different parts of the content of the video presentation frame.
  • the used encryption method comprises information on an authorization level.
  • the authorization level defines a type of an encryption method.
  • the authorization level is defined based on one or more of the following criteria: privacy, age limit, content’s discretion, subject to a charge.
  • an apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following: receive a video presentation frame, where the video presentation represents a three-dimensional data; generate one or more patches from the video presentation frame, wherein the patches represent different parts of the content of the video presentation frame; determine which patch(es) of the generated one or more patches are to be encrypted, wherein the determination is based on a content of the patch; encrypt the determined patches with an encryption method that corresponds to a pre-defined authorization level; and encode into a bitstream of a corresponding patch, information on the used encryption method.
  • an apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following: receive an encoded bitstream; for each patch of the encoded bitstream decode a patch from the encoded bitstream; for every patch that is determined to be an encrypted patch decode from the bitstream information on a used encryption method for the patch; decrypt the patch with a decryption method that corresponds to the used encryption method; and generate video presentation frame from the decoded patch(es), wherein the patches represent different parts of the content of the video presentation frame.
  • Fig. 1 shows an example of a compression process
  • Fig. 2 shows an example of a layer projection structure
  • Fig. 3 shows an example of a decompression process
  • Fig. 4 shows a high-level flowchart of a encoding process
  • Fig. 5 shows a high-level flowchart of a decoding process
  • Fig. 6 is a flowchart illustrating a method according to an embodiment
  • Fig. 7 is a flowchart illustrating a method according to another embodiment
  • Fig. 8 shows a system according to an embodiment
  • Fig. 9 shows an encoding process according to an embodiment
  • Fig. 10 shows a decoding process according to an embodiment.
  • V-PCC MPEG Video-Based Point Cloud Compression
  • Volumetric video may be captured using one or more three-dimensional (3D) cameras. When multiple cameras are in use, the captured footage is synchronized so that the cameras provide different viewpoints to the same world. In contrast to traditional two-dimensional/tree-dimensional (2D/3D) video, volumetric video describes a 3D model of the world where the viewer is free to move and observe different parts of the world.
  • 3D three-dimensional
  • Volumetric video enables the viewer to move in six degrees of freedom (DOF): in contrast to common 360° video, where the user has from 2 to 3 degrees of freedom (yaw, pitch, and possibly roll), a volumetric video represents a 3D volume of shape rather than a flat image plane.
  • Volumetric video frames contain a large amount of data because they model the contents of a 3D volume instead of just a 2D plane. Flowever, only a relatively small part of the volume changes over time. Therefore, it may be possible to reduce the total amount of data by only coding information about an initial state and changes which may occur between frames.
  • Volumetric video can be rendered from synthetic 3D animations, reconstructed from multi-view video using 3D reconstructing techniques such as structure from motion, or captured with a combination of cameras and depth sensors such as LiDAR, for example.
  • Volumetric video data represents a three-dimensional scene or object, and can be used as input for AR (Augmented Reality), VR (Virtual Reality) and MR (Mixed Reality) applications.
  • Such data describes geometry (shape, size, position in 3D space) and respective attributes (e.g. colour, opacity, reflectance, ...) plus any possible temporal changes of the geometry and attributes at given time instances (like frames in 2D video).
  • Volumetric video is either generated from 3D models, i.e. CGI (Computer Generated Imagery), or captured from real-world scenes using a variety of capture solutions, e.g. a multi-camera, laser scan, a combination of video and dedicated depth sensors, etc.
  • CGI Computer Generated Imagery
  • Typical representation formats for such volumetric data are triangle meshes, point clouds, or voxel.
  • Temporal information about the scene can be included in the form of individual capture instances, i.e. “frames” in 2D video, or other means, e.g. a position of an object as a function of time.
  • volumetric video describes a 3D scene (or object), such data can be viewed from any viewpoint. Therefore, volumetric video is an important format for any AR, VR or MR applications, especially for providing 6DOF (Six Degrees of Freedom) viewing capabilities.
  • 3D data acquisition devices has enabled reconstruction of highly detailed volumetric video representations of natural scenes.
  • Infrared, lasers, time-of-flight and structured light are all examples of devices that can be used to construct 3D video data.
  • Representation of the 3D data depends on how the 3D data is used.
  • Dense Voxel arrays have been used to represent volumetric medical data.
  • polygonal meshes are extensively used.
  • Point clouds on the other hand are well-suited for applications, such as capturing real-world 3D scenes, where the topology is not necessarily a 2D manifold.
  • Another way to represent 3D data is coding this 3D data as a set of texture and depth maps as is the case in the multi-view plus depth.
  • Point cloud is a set of data points (i.e. locations) in a coordinate system, for example in a three-dimensional coordinate system being defined by X, Y, and Z coordinates.
  • the points may represent an external surface of an object in the screen space, e.g. in a 3D space.
  • a point may be associated with a vector of attributes.
  • a point cloud can be used to reconstruct an object or a scene as a composition of the points.
  • Point clouds can be captured by using multiple cameras and depth sensors.
  • a dynamic point cloud is a sequence of static point clouds, wherein each static point cloud is in its own “point cloud frame”.
  • the reconstructed 3D scene may contain tens or even hundreds of millions of points. If such representations are to be stored or interchanged between entities, then efficient compression is needed.
  • Standard volumetric video representation formats such as point clouds, meshes, voxel do not have sufficient temporal compression performance. Identifying correspondences for motion-compensation in 3D space is an ill-defined problem, as both, geometry and respective attributes may change. For example temporal successive “frames” do not necessarily have the same number of meshes, points or voxel. Therefore, compression of dynamic 3D scenes is inefficient. 2D-video based approaches for compressing volumetric data, i.e. multiview and depth, have much better compression efficiency, but rarely cover the full scene. Therefore, they provide only limited 6DOF capabilities.
  • a 3D scene represented as meshes, points and/or voxel
  • a 3D scene can be projected onto one or more geometries. These geometries are “unfolded” onto 2D planes (two planes per geometry: one for texture, one for depth), which are then encoded using standard 2D video compression technologies. Relevant projection geometry information is transmitted alongside the encoded video files to the decoder. The decoder decodes the video and performs the inverse projection to regenerate the 3D scene in any desired representation format (not necessarily the starting format).
  • Projecting volumetric models onto 2D planes allows for using standard 2D video coding tools with highly efficient temporal compression.
  • coding efficiency may be increased remarkably.
  • 6DOF capabilities are improved
  • Using several geometries for individual objects improves the coverage of the scene further.
  • existing video encoding hardware can be utilized for real-time compression/decompression of the projected planes.
  • the projection and reverse projection steps are of low complexity.
  • the input point cloud frame is processed in a following manner: First the volumetric 3D data may be represented as a set of 3D projections in different components. At the separation stage, image is decomposed into far and near components for geometry and corresponding attributes components, in addition an occupancy map 2D image may be created to indicate parts of an image that shall be used.
  • the 2D projection is composed of independent patches based on geometry characteristics of the input point cloud frame. After the patches have been generated and 2D frames for video encoding have been created, the occupancy map, geometry information and the auxiliary information may be compressed. At the end of the process, the separate bitstreams are multiplexed into the output compressed binary file.
  • Figure 1 shows the encoding process in more detailed manner.
  • the process starts with an input frame representing a point cloud frame 101 that is provided for patch generation 102, geometry image generation 104 and texture image generation 105.
  • Each point cloud frame 101 represents a dataset of points within a 3D volumetric space that has unique coordinates and attributes.
  • the patch generation 102 process decomposes the point cloud frame 101 by converting 3D samples to 2D samples on a given projection plane using a strategy which provides the best compression.
  • patch generation 102 process aims at decomposing the point cloud frame 101 into a minimum number of patches with smooth boundaries, while also minimizing the reconstruction error.
  • a normal per each point is estimated.
  • the tangent plane and its corresponding normal are defined per each point, based on the point’s nearest neighbors m within a predefined search distance.
  • the barycenter c may be computed as follows:
  • the normal is estimated from eigen decomposition for the defined point cloud as:
  • each point is associated with a corresponding plane of a point cloud bounding box.
  • Each plane is defined by a corresponding normaln p with values:
  • each point may be associated with the plane that has the closest normal (i.e. maximizes the dot product of the point normal n p and the plane normal
  • the sign of the normal is defined depending on the point’s position in relationship to the “center”.
  • the initial clustering may then be refined by iteratively updating the cluster index associated with each point based on its normal and the cluster indices of its nearest neighbors.
  • the final step of the patch generation 102 may comprise extracting patches by applying a connected component extraction procedure.
  • Patch info determined at patch generation 102 for the input point cloud frame 101 is delivered to patch packing 103, to geometry image generation 104, to texture image generation 105, to attribute smoothing (3D) 109 and to auxiliary patch info compression 113.
  • the patch packing 103 aims at generating the geometry and texture maps, by appropriately considering the generated patches and by trying to efficiently place the geometry and texture data that correspond to each patch onto a 2D grid of size WxH. Such placement also accounts for a user-defined minimum size block TxT (e.g. 16x16), which specifies the minimum distance between distinct patches as placed on this 2D grid. Parameter T may be encoded in the bitstream and sent to the decoder.
  • the packing process 103 may iteratively try to insert patches into a WxH grid.
  • W and H are user defined parameters, which correspond to the resolution of the geometry/texture images that will be encoded.
  • the patch location may be determined through an exhaustive search that may be performed in raster scan order. Initially, patches are placed on a 2D grid in a manner that would guarantee non-overlapping insertion. Samples belonging to a patch (rounded to a value that is a multiple of T) are considered as occupied blocks. In addition, a safeguard between adjacent patches is forced to distance of at least one block being multiple of T. Patches are processed in an orderly manner, based on the patch index list. Each patch from the list is iteratively placed on the grid.
  • the grid resolution depends on the original point cloud size and its width (W) and height (H) are transmitted to the decoder. In the case that there is no empty space available for the next patch the height value of the grid is initially doubled, and the insertion of this patch is evaluated again. If insertion of all patches is successful, then the height is trimmed to the minimum needed value. However, this value is not allowed to be set lower than the originally specified value in the encoder.
  • the final values for W and H correspond to the frame resolution that is used to encode the texture and geometry video signals using the appropriate video codec.
  • the geometry image generation 104 and the texture image generation 105 are configured to generate geometry images and texture images. The image generation process may exploit the 3D to 2D mapping computed during the packing process to store the geometry and texture of the point cloud as images.
  • each patch may be projected onto two images, referred to as layers.
  • H(u, y) be the set of points of the current patch that get projected to the same pixel (u, v).
  • Figure 2 illustrates an example of layer projection structure.
  • the first layer also called a near layer, stores the point of H(u, v) with the lowest depth DO.
  • the second layers referred to as the far layer, captures the point of H(u, v) with the highest depth within the interval [DO, D0+A], where A is a user-defined parameter that describes the surface thickness.
  • the generated videos may have the following characteristics:
  • the geometry video is monochromatic.
  • the texture generation procedure exploits the reconstructed/smoothed geometry in order to compute the colors to be associated with the re-sampled points.
  • the surface separation method is applied to prevent the mixing of different surfaces in the connected components when there is a stack of multiple different surfaces in that connected component.
  • One of the methods to separate surfaces is to use differences of MSE values of points in RGB color domain:
  • Threshold 20 where R 1 , G 1 , B 1 are attribute values belonging to TO and R 2 , G 2 , B 2 are the attribute values belonging to Tl.
  • the geometry images (which is monochromatic) and the texture images may be provided to image padding 107.
  • the image padding 107 may also receive as an input an occupancy map (OM) 106 to be used with the geometry images and texture images.
  • the occupancy map 106 may comprise a binary map that indicates for each cell of the grid whether it belongs to the empty space or to the point cloud.
  • the occupancy map (OM) may be a binary image of binary values where the occupied pixels and non-occupied pixels are distinguished and depicted respectively.
  • the occupancy map may alternatively comprise a non-binary image allowing additional information to be stored in it. Therefore, the representative values of the DOM may comprise binary values or other values, for example integer values. It should be noticed that one cell of the 2D grid may produce a pixel during the image generation process.
  • the padding process 107 aims at filling the empty space between patches in order to generate a piecewise smooth image suited for video compression. For example, in a simple padding strategy, each block of TxT (e.g. 16x16) pixels is compressed independently. If the block is empty (i.e. unoccupied, i.e. all its pixels belong to empty space), then the pixels of the block are filled by copying either the last row or column of the previous TxT block in raster order. If the block is full (i.e. occupied, i.e., no empty pixels), nothing is done. If the block has both empty and filled pixels (i.e. edge block), then the empty pixels are iteratively filled with the average value of their non-empty neighbors.
  • TxT e.g. 16x16
  • the padded geometry images and padded texture images may be provided for video compression 108.
  • the generated images/layers may be stored as video frames and compressed using for example High Efficiency Video Coding (HEVC) Test Model 16 (HM) video codec according to the HM configurations provided as parameters.
  • HEVC High Efficiency Video Coding
  • HM Test Model 16
  • the video compression 108 also generates reconstructed geometry images to be provided for smoothing 109, wherein a smoothed geometry is determined based on the reconstructed geometry images and patch info from the patch generation 102.
  • the smoothed geometry may be provided to texture image generation 105 to adapt the texture images.
  • the patch may be associated with auxiliary information being encoded/decoded for each patch as metadata.
  • the auxiliary information may comprise index of the projection plane, 2D bounding box, 3D location of the patch represented in terms of depth SO, tangential shift s and bitangential shift r0.
  • the following metadata may be encoded/decoded for every patch:
  • mapping information providing for each TxT block its associated patch index may be encoded as follows:
  • L the ordered list of the indexes of the patches such that their 2D bounding box contains that block.
  • the order in the list is the same as the order used to encode the 2D bounding boxes.
  • L is called the list of candidate patches.
  • the compression process may comprise one or more of the following example operations:
  • Binary values may be associated with BOxBO sub-blocks belonging to the same TxT block.
  • a value 1 associated with a sub-block if it contains at least a non-padded pixel, and 0 otherwise. If a sub-block has a value of 1 it is said to be full, otherwise it is an empty sub-block.
  • a binary information may be encoded for each TxT block to indicate whether it is full or not. • If the block is non-full, an extra information indicating the location of the full/empty sub-blocks may be encoded as follows: o Different traversal orders may be defined for the sub-blocks, for example horizontally, vertically, or diagonally starting from top right or top left corner o The encoder chooses one of the traversal orders and may explicitly signal its index in the bitstream. o The binary values associated with the sub-blocks may be encoded by using a run-length encoding strategy.
  • occupancy map coding In occupancy map coding (lossy condition) a two-dimensional binary image of resolution (Width/B0)x(Height/Bl), where Width and Height are the width and height of the geometry and texture images that are intended to be compressed.
  • a sample equal to 1 means that the correspond ing/co-located sample or samples in the geometry and texture image should be considered as point cloud points when decoding, while a sample equal to 0 should be ignored (commonly includes padding information).
  • the resolution of the occupancy map does not have to be the same as those of the geometry and texture images and instead the occupancy map could be encoded with a precision of BOxBl blocks.
  • B0 and B1 are selected to be equal to 1 .
  • the generated binary image covers only a single colour plane.
  • it may be desirable to extend the image with “neutral” or fixed value chroma planes e.g. add chroma planes with all sample values equal to 0 or 128, assuming the use of an 8-bit codec).
  • the obtained video frame may be compressed by using a video codec with lossless coding tool support (e.g., AVC, HEVC RExt, HEVC-SCC).
  • Occupancy map may be simplified by detecting empty and non-empty blocks of resolution TxT in the occupancy map and only for the non-empty blocks we encode their patch index as follows: o A list of candidate patches is created for each TxT block by considering all the patches that contain that block o The list of candidates is sorted in the reverse order of the patches o For each block,
  • a multiplexer 112 may receive a compressed geometry video and a compressed texture video from the video compression 108, and optionally a compressed auxiliary patch information from auxiliary patch-info compression 111. The multiplexer 112 uses the received data to produce a compressed bitstream.
  • the encoded binary file is demultiplexed into geometry, attribute, occupancy map and auxiliary information streams.
  • Auxiliary information stream is entropy coded.
  • Occupancy map may be compressed using entropy coding method, or video compression method depending on selected level.
  • Geometry stream is decoded and in combination with occupancy amp and auxiliary information, smoothing is applied to reconstruct point cloud geometry information. Based on the decoded attribute video stream and reconstructed information for smoothed geometry, occupancy map and auxiliary information the attributes of the point cloud can be reconstructed. After attribute reconstruction stage additional attribute smoothing method is used for point cloud refinement.
  • FIG. 3 illustrates an overview of a decoding process for MPEG Point Cloud Compression (PCC).
  • a de-multiplexer 301 receives a compressed bitstream, and after de-multiplexing, provides compressed texture video and compressed geometry video to video decompression 302.
  • the de-multiplexer 301 transmits compressed occupancy map to occupancy map decompression 303. It may also transmit a compressed auxiliary patch information to auxiliary patch-info compression 304.
  • Decompressed geometry video from the video decompression 302 is delivered to geometry reconstruction 305, as are the decompressed occupancy map and decompressed auxiliary patch information.
  • the point cloud geometry reconstruction 305 process exploits the occupancy map information in order to detect the non-empty pixels in the geometry/texture images/layers.
  • the 3D positions of the points associated with those pixels are computed by levering the auxiliary patch information and the geometry images. More precisely, let P be the point associated with the pixel (u, v) and let (SO, s0, r0) be the 3D location of the patch to which it belongs and (u0, v0, ul, vl) its 2D bounding box. P could be expressed in terms of depth S(u, v), tangential shift s(u, v) and bi- tangential shift r(u, v) as follows:
  • the reconstructed geometry image may be provided for smoothing 306, which aims at alleviating potential discontinuities that may arise at the patch boundaries due to compression artifacts.
  • the implemented approach moves boundary points to the centroid of their nearest neighbors.
  • the smoothed geometry may be transmitted to texture reconstruction 307, which also receives a decompressed texture video from video decompression 302.
  • the texture values for the texture reconstruction are directly read from the texture images.
  • the texture reconstruction 307 outputs a reconstructed point cloud for color smoothing 308, which further provides the reconstructed point cloud.
  • the auxiliary information can be signaled in a bitstream according to the following syntax:
  • One way to compress a time-varying volumetric scene/object is to project 3D surfaces on to some number of pre-defined 2D planes. Regular 2D video compression algorithms can then be used to compress various aspects of the projected surfaces. Such projection may be presented using different patches. Each set of patches may represent a specific object or specific parts of a scene.
  • the present embodiments are targeted for encryption of patches based on a content (such as a scene or an object) presented by each patch.
  • a content such as a scene or an object
  • each patch may present a part of the scene, or a set of patches may present one object, or any specific part of the image may be presented by a group of patches.
  • the patches include a signaling method enabling a player or a reconstruction process to apply encryption restrictions to the patches.
  • the content may define several different encryption combinations between patches based on authorization levels defined for end users.
  • content-based encryption of different patches enables limiting the end user’s access to different parts of the content.
  • the use cases include applications that use privacy, parental control, paid content services, etc.
  • the present embodiments enable encryption of the patches based on the content presented by each patch. This means, if the patch includes specific information of the scene that require to be encrypted for any end user, those patches should also include a signaling method enabling the player or reconstruction process to apply encryption restrictions to said patches.
  • Encryption refers to a method by which any type of data, such as images, is converted from a readable form to an encoded (i.e. encrypted) version that can only be decoded (i.e. decrypted) by specific target users. Such capability is given to end- users based on their authorization level.
  • the encryption may benefit from different encryption algorithms as well as different authorization levels defined for different users. This means, the same content may define several different encryption combinations between patches based on the authorization levels defined for the end users.
  • the high-level block diagrams for an encoder and a decoder according to embodiments are illustrated in Figures 4 and 5.
  • a process starts by obtaining a patch 410 and determining 420 whether an encryption is needed. If not, the patch goes to an encoding process 430. If the patch is to be encrypted, the patch is encrypted 440 in accordance with needed authorization level. After that, the encrypted patch goes to the encoding process 450.
  • An encoder may be situated in a processor of an imaging device, further comprising also a memory and a computer program stored therein. An encoder may alternatively be situated in a process of a processing device, that is configured to create an encoded version of image material obtained from the imaging device. The encoder may also comprise means for encrypting. According to another embodiment, the encoder may ask encryption of a certain patches from an external device.
  • the process starts by obtaining a patch 510 and determining 520 whether the patch is an encrypted patch. If not, the patch goes to a decoding process 530. If the patch is encrypted, the patch is decrypted 540 based on an authorization level 550 with a corresponding encryption method. After decryption, the patch goes to the decoding process 560.
  • a decoder may be situated on a processor of a client device. The client device may be a viewing device, or may be connected to one.
  • the authorization level may be defined to an end user e.g. at the beginning of the transmission of a content. For example, there may be parameters, e.g. a frame rate, which remain constant during the decoding and presentation of a video, and such parameters may also be included in such overhead signals at the beginning of the transmission indicating the authorization level of each end user.
  • parameters e.g. a frame rate, which remain constant during the decoding and presentation of a video
  • a user who makes a request for a content is identified at the server based on the request, and thus the server is able to send the content with the correct authorization level to the end user.
  • the patches (content) are encrypted with different authorization levels.
  • the authorization level of the latter embodiment may be determined from login information of the user (e.g. account, secure TV login system, authorization level purchase history of the user, etc.) or from the viewing device itself or from a profile of the user (e.g. age, jurisdictions, etc.) or from connection properties (IP addresses). Therefore, the authorization level will be part of user’s profile or part of a device or connection profile indicating that this user has a respective level of authorization or this device or connection has a respective level of authorization.
  • login information of the user e.g. account, secure TV login system, authorization level purchase history of the user, etc.
  • IP addresses connection properties
  • the scene is considered to be presented by several patches. Each patch presents a part of the scene. Moreover, one object may be presented by a set of patches and any specific part of the image may be presented by a group of patches as well.
  • the present embodiments introduce a signaling method to be added in a patch level informing the decoder what type/level of encryption is applied, and then based on the authorization level defined at the end user side, such as in the client device or a display device, the decryption and decoding of the patches is enabled.
  • Such concept of encryption can be signaled within the V-PCC sequence parameter set, to ensure it is not truncated or lost, e.g. by adding an element sps_encryption_enabled_flag.
  • the authorization level for end users may be defined based on one or more of the following criteria, but is not limited to only these criteria:
  • - Paid content where parts of the content is only available to users who have already paid for displaying it. In such cases, there may be several authorization levels defined considering different level of payments purchased by different end users.
  • the patches which represent a specific object, or a specific part of the scene are selected based on several different patch criteria which include but are not limited to the following:
  • ROI Regions of interest
  • the ROI also includes the parts of the scene, which are not appropriate for an audience of certain age.
  • the encryption process enabled for each patch may include one or more different types of encryption depending on the application requirement.
  • the encryption may be based on Affine Transform and XOR Operation; Three one dimensional chaotic map; One-dimensional Random Scrambling; Bit level permutation, XOR and rotation; Explosive n * n Block Displacement; Neural Network, Blowfish algorithm; RC3 Stream Cipher and Chaotic Logistic Map.
  • any type of encryption or a combination of them may be used in the present embodiments depending on the requirement of the application as well as the sophistication level of encryption required for each application.
  • the encryption may be done using different algorithms to meet the requirements of the application, e.g. having following encryption approaches when having 4 levels of authorization for the end users: a.
  • Authorization level 1 use encryption method 1 b.
  • Authorization level 2 use encryption method 2 c.
  • Authorization level 3 use encryption method 3 d.
  • Authorization level 4 use encryption method 4
  • the authorization level can be defined for each end user with a simple code, e.g.: a. Authorization level 1 -> code 00 b. Authorization level 2 -> code 01 c. Authorization level 3 -> code 10 d. Authorization level 4 -> code 11
  • each patch may be assigned with a code representing the authorization level assigned to that patch.
  • a look-up table (LUT) may be communicated between an encoder and decoder where the code not only represents an authorization level, but is also mapped to a specific encryption method, and when the code is matched with the end user’s authorized level, then the encryption method is fetched from the LUT.
  • LUT can be communicated, e.g. as SEI message.
  • each content may have several different authorization levels, targeting different end users. This means that the same content can be used without any modification for a variety of end users.
  • no patch- based code is signaled to a decoder, but the decoder will assign the code for each patch based on the content, e.g. having faces assigned one code, parts of the scene with nudity being assigned another code, potential documents being assigned another code, etc.
  • the patches are given a code based on the content of each patch and then, considering the same concept of LUT, the authorization level is fetched and compared with the authorization level of end user.
  • An example of possible syntax is as below.
  • a parameter patch_encryption_code has been added based on the present embodiments:
  • the patch_encryption_code when the patch has been received, the patch_encryption_code is read, and based on the authorization level assigned to said patch_encryption_code and the authorization level of the end user, the content may be decrypted.
  • a video encoder is restricted to use only reference of equal encryption level or without encryption. This means, that inter and intra prediction tools such as motion compensation or intra-block copy are not allowed to point at parts within a patch which is encoded with a different encryption level.
  • encryption levels are sorted hierarchically e.g. encryption level 4 covers also level 3, 2, and 1 , but not encryption level 5.
  • a video encoder may be restricted to use only references of equal or lower encryption level or reference without encryption. This means that inter and intra prediction tools such as motion compensation or intra-block copy are not allowed to point at parts within a patch which is encoded with a higher encryption level.
  • some encryption levels are sorted hierarchically while others are not, e.g. encryption level 4 covers also levels 3, 2, and 1 , but not encryption level 5.
  • the separated encryption level 6 does not cover any other encryption levels. Separated encryption levels can be signaled with a certain flag in or along the bitstream.
  • a video encoder may not be restricted on the use of references.
  • the respective video decoder is allowed to decrypt all video information internally but is only allowed to output according to its decryption authorization.
  • This embodiment has a better coding performance, but is slightly less secure, as decoded information would be available within the video decoder internal memory.
  • no encryption is applied on video level, and the received authorization level only restricts the 3D point cloud reconstruction.
  • This embodiment is more vulnerable as the 2D projections of the 3D content may be accessible in the 2D video streams.
  • standalone V-PCC playback applications can still utilize this approach, as it provides the least complexity and the 2D video can be kept in an internal memory buffer.
  • the encryption is only applied on a selection of one or more 2D video stream representing the dynamic point cloud data.
  • texture information is encrypted but geometry is still available.
  • the respective signalling can be carried, e.g. in the geometry information or attributejnformation.
  • attributejnformation comprises ai_encryption_availableJlag.
  • encryption of geometry and/or attribute of the patch can be signaled patch_data_unit as indicated in below:
  • Various encryption levels and approaches can be signaled in or along the bitstream, e.g. as SEI message, as shown in relation to sps_encryption_enabled_flag.
  • encryptionjd contains supplemental encryption information interpreted according to the information carried in encryptionjdjnfo SEI message.
  • Fig. 6 is a flowchart illustrating a method according to an embodiment.
  • a method comprises receiving 610 a video presentation frame, where the video presentation represents a three-dimensional data; generating 620 one or more patches from the video presentation frame, wherein the patches represent different parts of the content of the video presentation frame; determining 630 which patch(es) of the generated one or more patches are to be encrypted, wherein the determination is based on a content of the patch; encrypting 640 the determined patches with an encryption method that corresponds to a pre-defined authorization level; and encoding 650 into a bitstream of a corresponding patch, information on the used encryption method.
  • An apparatus comprises means for receiving a video presentation frame, where the video presentation represents a three-dimensional data; means for generating one or more patches from the video presentation frame, wherein the patches represent different parts of the content of the video presentation frame; means for determining which patch(es) of the generated one or more patches are to be encrypted, wherein the determination is based on a content of the patch; means for encrypting the determined patches with an encryption method that corresponds to a pre-defined authorization level; and means for encoding into a bitstream of a corresponding patch, information on the used encryption method.
  • the means comprises at least one processor, and a memory including a computer program code, wherein the processor may further comprise processor circuitry.
  • the memory and the computer program code are configured to, with the at least one processor, cause the apparatus to perform the method of Figure 6 according to various embodiments.
  • FIG. 7 is a flowchart illustrating a method according to an embodiment.
  • a method comprises receiving 710 an encoded bitstream; for each patch of the encoded bitstream 720:
  • An apparatus comprises means for receiving an encoded bitstream; means for decoding each patch of the encoded bitstream; means for a patch is determined to be an encrypted patch decoding from the bitstream information on a used encryption method for the patch; and means for decrypting the patch with a decryption method that corresponds to the used encryption method; and means for generating a video presentation frame from the decoded patch(es), wherein the patches represent different parts of the content of the video presentation frame.
  • the means comprises at least one processor, and a memory including a computer program code, wherein the processor may further comprise processor circuitry.
  • the memory and the computer program code are configured to, with the at least one processor, cause the apparatus to perform the method of Figure 7 according to various embodiments.
  • Figure 8 shows a system and apparatuses for viewing volumetric video according to present embodiments.
  • the task of the system is that of capturing sufficient visual and auditory information from a specific location such that a convincing reproduction of the experience, or presence, of being in that location can be achieved by one or more viewers physically located in different locations and optionally at a time later in the future.
  • Such reproduction requires more information that can be captured by a single camera or microphone, in order that a viewer can determine the distance and location of objects within the scene using their eyes and their ears.
  • two camera sources are used.
  • the human auditory system to be able to sense the direction of sound, at least two microphones are used (the commonly known stereo sound is created by recording two audio channels).
  • the human auditory system can detect the cues, e.g. in timing difference of the audio signals to detect the direction of sound.
  • the system of Figure 8 may comprise three parts: image sources 801 , 803, a server 805 and a rendering device 807.
  • An image source can be a video capture device 801 comprising two or more cameras with overlapping field of view so that regions of the view around the video capture device is captured from at least two cameras.
  • the video capture device 801 may comprise multiple microphones (not shown in the figure) to capture the timing and phase differences of audio originating from different directions.
  • the video capture device 801 may comprise a high-resolution orientation sensor so that the orientation (direction of view) of the plurality of cameras can be detected and recorded
  • the video capture device 801 comprises or is functionally connected to a computer processor PROCESSOR1 and memory MEMORY1 , the memory comprising computer program code for controlling the video capture device 801 .
  • the image stream captured by the video capture device 801 may be stored on a memory MEMORY1 and/or removable memory MEMORY9 for use in another device, e.g. in a viewer, and/or transmitted to a server 805 using a communication interface COMMUNICATION1 .
  • one or more image source devices 803 of synthetic images may be present in the system.
  • Such image source devices 803 of synthetic images may use a computer model of a virtual world to compute the various image streams it transmits.
  • the image source 803 may compute N video streams corresponding to N virtual cameras located at a virtual viewing position.
  • the viewer may see a three-dimensional virtual world.
  • the image source device 803 comprises or is functionally connected to a computer processor PROCESSORS and memory MEMORY3, the memory comprising computer program code for controlling the image source device 803.
  • a server 805 or a plurality of servers storing the output from the video capture device 801 or image source device 803.
  • the server 805 comprises or is functionally connected to a computer processor PROCESSORS and memory MEMORY5, the memory comprising computer program code for controlling the server 805.
  • the server 805 may be connected by a wired or wireless network connection, or both, to sources 801 and/or 803, as well as the viewer devices 809 over the communication interface COMMUNICATIONS.
  • viewer devices 809 For viewing the captured or created video content, there may be one or more viewer devices 809 (a.k.a. playback devices). These viewer devices 809 may have one or more displays, and may comprise or be functionally connected to a rendering module 807.
  • the rendering module 807 comprises a computer processor PROCESSOR7 and memory MEMORY7, the memory comprising computer program code for controlling the viewer devices 809.
  • the viewer devices 809 may comprise a data stream receiver for receiving a video data stream from a server and for decoding the video data stream. The data stream may be received over a network connection through communications interface, or from a memory device 811 like a memory card.
  • the viewer devices 809 may have a graphics processing unit for processing of the data to a suitable format for viewing.
  • the viewer device 809 can be a high-resolution stereo-image head-mounted display for viewing the rendered stereo video sequence.
  • the head-mounted display may have an orientation detector 813 and stereo audio headphones.
  • the viewer device 809 is a display enabled with 3D technology (for displaying stereo video), and the rendering device 807 may have a head-orientation detector 815 connected to it.
  • the viewer device 809 may comprise a 2D display, since the volumetric video rendering can be done in 2D by rendering the viewpoint from a single eye instead of a stereo eye pair.
  • Any of the devices 801 , 803, 805, 807, 809 may be a computer or a portable computing device, or be connected to such. Such devices may have computer program code for carrying out methods according to various examples described in this text.
  • the viewer device can be a head-mounted display (HMD).
  • the head-mounted display comprises two screen sections or two screens for displaying the left and right eye images.
  • the displays are close to the eyes, and therefore lenses are used to make the images easily viewable and for spreading the images to cover as much as possible of the eyes’ field of view.
  • the device is attached to the head of the user so that it stays in place even when the user turns his head.
  • the device may have an orientation detecting module for determining the head movements and direction of the head.
  • the head-mounted display gives a three-dimensional (3D) perception of the recorded/streamed content to a user.
  • the video material captured or generated by any of the image sources can be provided for an encoder that transforms an input video into a compressed representation suited for storage/transmission.
  • the compressed video is provided for a decoder that can uncompress the compressed video representation back into a viewable form.
  • the encoder may be located in the image sources or in the server.
  • the decoder may be located in the server or in the viewer, such as a HMD.
  • the encoder may discard some information in the original video sequence in order to represent the video in a more compact form (that is, at lower bitrate).
  • An example of an encoding process is illustrated in Figure 9.
  • Figure 9 illustrates an image to be encoded (l n ); a predicted representation of an image block (P’ n ); a prediction error signal (D n ); a reconstructed prediction error signal (D’ n ); a preliminary reconstructed image (l’ n ); a final reconstructed image (R’ n ); a transform (T) and inverse transform (T _1 ); a quantization (Q) and inverse quantization (Q _1 ); entropy encoding (E); a reference frame memory (RFM); inter prediction (P inter ); intra prediction (P intra ); mode selection (MS) and filtering (F).
  • An example of a decoding process is illustrated in Figure 10.
  • Figure 10 illustrates a predicted representation of an image block (P’ n ); a reconstructed prediction error signal (D’ n ); a preliminary reconstructed image (l’ n ); a final reconstructed image (R’ n ); an inverse transform (T _1 ); an inverse quantization (Q -1 ); an entropy decoding (E -1 ); a reference frame memory (RFM); a prediction (either inter or intra) (P); and filtering (F).
  • the various embodiments may provide advantages. For example, the present embodiments enable applying different encryption for each patch separately. In addition, the present embodiments enable content to be broadcasted for a variety of end users limiting their access to different parts of the scene based on their authorization level.
  • a device may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the device to carry out the features of an embodiment.
  • a network device like a server may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the network device to carry out the features of an embodiment.
  • the computer program code comprises one or more operational characteristics.
  • Said operational characteristics are being defined through configuration by said computer based on the type of said processor, wherein a system is connectable to said processor by a bus, wherein a programmable operational characteristic of the system according to an embodiment comprises receiving a video presentation frame, where the video presentation represents a three-dimensional data; generating one or more patches from the video presentation frame, wherein the patches represent different parts of the content of the video presentation frame; determining which patch(es) of the generated one or more patches are to be encrypted, wherein the determination is based on a content of the patch; encrypting the determined patches with an encryption method that corresponds to a pre-defined authorization level; and encoding into a bitstream of a corresponding patch, information on the used encryption method.
  • a programmable operational characteristic of the system comprises receiving an encoded bitstream; for each patch of the encoded bitstream: decoding a patch from the encoded bitstream; for every patch that is determined to be an encrypted patch, decoding from the bitstream information on a used encryption method for the patch and decrypting the patch with a decryption method that corresponds to the used encryption method; and generating a video presentation frame from the decoded patch(es), wherein the patches represent different parts of the content of the video presentation frame.
  • a computer program product according to an embodiment can be embodied on a non-transitory computer readable medium. According to another embodiment, the computer program product can be downloaded over a network in a data packet.

Landscapes

  • Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Technology Law (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

The embodiments relate to methods and apparatuses for coding volumetric video. A method for encoding comprises receiving (610) a video presentation frame, where the video presentation represents a three-dimensional data; generating (620) one or more patches (410) from the video presentation frame, wherein the patches represent different parts of the content of the video presentation frame; determining (420, 630) which patch(es) of the generated one or more patches are to be encrypted (440), wherein the determination is based on a content of the patch; encrypting (440, 640) the determined patches with an encryption method that corresponds to a pre-defined authorization level; and encoding (450, 650) into a bitstream of a corresponding patch, information on the used encryption method.

Description

A METHOD, AN APPARATUS AND A COMPUTER PROGRAM PRODUCT FOR VIDEO ENCODING AND VIDEO DECODING
Technical Field
The present solution generally relates to video encoding and video decoding. In particular, the solution relates to encoding and decoding of digital volumetric video.
Background
Since the beginning of photography and cinematography, the most common type of image and video content has been captured by cameras with relatively narrow field of view, and displayed as a rectangular scene on flat displays. The cameras are mainly directional, whereby they capture only a limited angular field of view (the field of view towards which they are directed).
More recently, new image and video capture devices are available. These devices are able to capture visual and audio content all around them, i.e. they can capture the whole angular field of view, sometimes referred to as 360 degrees field of view. More precisely, they can capture a spherical field of view (i.e., 360 degrees in all spatial directions). Furthermore, new types of output technologies have been invented and produced, such as head-mounted displays. These devices allow a person to see visual content all around him/her, giving a feeling of being “immersed” into the scene captured by the 360 degrees camera. The new capture and display paradigm, where the field of view is spherical, is commonly referred to as virtual reality (VR) and is believed to be the common way people will experience media content in the future.
For volumetric video, a scene may be captured using one or more 3D (three- dimensional) cameras. The cameras are in different positions and orientations within a scene. One issue to consider is that compared to 2D (two-dimensional) video content, volumetric 3D video content has much more data, so viewing it requires lots of bandwidth (with or without transferring it from a storage location to a viewing device): disk I/O, network traffic, memory bandwidth, GPU (Graphics Processing Unit) upload. Capturing volumetric content also produces a lot of data, particularly when there are multiple capture devices used in parallel.
Summary
The scope of protection sought for various embodiments of the invention is set out by the independent claims. The embodiments and features, if any, described in this specification that do not fall under the scope of the independent claims are to be interpreted as examples useful for understanding various embodiments of the invention.
Now there has been invented an improved method and technical equipment implementing the method. Various aspects include methods, apparatuses, and computer readable media comprising a computer program stored therein, which are characterized by what is stated in the independent claims. Various embodiments are disclosed in the dependent claims.
According to a first aspect, there is provided a method comprising receiving a video presentation frame, where the video presentation represents a three-dimensional data; generating one or more patches from the video presentation frame, wherein the patches represent different parts of the content of the video presentation frame; determining which patch(es) of the generated one or more patches are to be encrypted, wherein the determination is based on a content of the patch; encrypting the determined patches with an encryption method that corresponds to a pre defined authorization level; and encoding into a bitstream of a corresponding patch, information on the used encryption method.
According to a second aspect, there is provided an apparatus comprising means for receiving a video presentation frame, where the video presentation represents a three-dimensional data; means for generating one or more patches from the video presentation frame, wherein the patches represent different parts of the content of the video presentation frame; means for determining which patch(es) of the generated one or more patches are to be encrypted, wherein the determination is based on a content of the patch; means for encrypting the determined patches with an encryption method that corresponds to a pre-defined authorization level; and means for encoding into a bitstream of a corresponding patch, information on the used encryption method.
According to a third aspect, there is provided a computer program product comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to receive a video presentation frame, where the video presentation represents a three-dimensional data; generate one or more patches from the video presentation frame, wherein the patches represent different parts of the content of the video presentation frame; determine which patch(es) of the generated one or more patches are to be encrypted, wherein the determination is based on a content of the patch; encrypt the determined patches with an encryption method that corresponds to a pre-defined authorization level; and encode into a bitstream of a corresponding patch, information on the used encryption method.
According to an embodiment, the information on the encryption method comprises the pre-defined authorization level and a type of the corresponding encryption method.
According to an embodiment, the encryption method comprises one or more different types of encryption methods.
According to an embodiment, the authorization level is defined based on one or more of the following criteria: privacy, age limit, content’s discretion, subject to a charge.
According to an embodiment, the information on the encryption method is signaled as a supplemental enhancement information (SEI) message.
According to a fourth aspect, there is provided a method for decoding, comprising receiving an encoded bitstream; for each patch of the encoded bitstream, decoding a patch from the encoded bitstream; for every patch that is determined to be an encrypted patch, decoding from the bitstream information on a used encryption method for the patch; decrypting the patch with a decryption method that corresponds to the used encryption method; and generating a video presentation frame from the decoded patch(es), wherein the patches represent different parts of the content of the video presentation frame.
According to a fifth aspect, there is provided an apparatus comprising means for receiving an encoded bitstream; for each patch of the encoded bitstream means for decoding a patch from the encoded bitstream; for every patch that is determined to be an encrypted patch means for decoding from the bitstream information on a used encryption method for the patch; means for decrypting the patch with a decryption method that corresponds to the used encryption method; and means for generating a video presentation frame from the decoded patch(es), wherein the patches represent different parts of the content of the video presentation frame.
According to a sixth aspect, there is provided a computer program product comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to receive an encoded bitstream; for each patch of the encoded bitstream decode a patch from the encoded bitstream; for every patch that is determined to be an encrypted patch, decode from the bitstream information on a used encryption method for the patch; decrypt the patch with a decryption method that corresponds to the used encryption method; and generate video presentation frame from the decoded patch(es), wherein the patches represent different parts of the content of the video presentation frame.
According to an embodiment, the used encryption method comprises information on an authorization level.
According to an embodiment, the authorization level defines a type of an encryption method.
According to an embodiment, the authorization level is defined based on one or more of the following criteria: privacy, age limit, content’s discretion, subject to a charge.
According to a seventh aspect, there is provided an apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following: receive a video presentation frame, where the video presentation represents a three-dimensional data; generate one or more patches from the video presentation frame, wherein the patches represent different parts of the content of the video presentation frame; determine which patch(es) of the generated one or more patches are to be encrypted, wherein the determination is based on a content of the patch; encrypt the determined patches with an encryption method that corresponds to a pre-defined authorization level; and encode into a bitstream of a corresponding patch, information on the used encryption method.
According to an eighth aspect, there is provided an apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following: receive an encoded bitstream; for each patch of the encoded bitstream decode a patch from the encoded bitstream; for every patch that is determined to be an encrypted patch decode from the bitstream information on a used encryption method for the patch; decrypt the patch with a decryption method that corresponds to the used encryption method; and generate video presentation frame from the decoded patch(es), wherein the patches represent different parts of the content of the video presentation frame.
Description of the Drawings
In the following, various embodiments will be described in more detail with reference to the appended drawings, in which
Fig. 1 shows an example of a compression process;
Fig. 2 shows an example of a layer projection structure;
Fig. 3 shows an example of a decompression process;
Fig. 4 shows a high-level flowchart of a encoding process;
Fig. 5 shows a high-level flowchart of a decoding process; Fig. 6 is a flowchart illustrating a method according to an embodiment;
Fig. 7 is a flowchart illustrating a method according to another embodiment;
Fig. 8 shows a system according to an embodiment;
Fig. 9 shows an encoding process according to an embodiment; and Fig. 10 shows a decoding process according to an embodiment.
Description of Example Embodiments
In the following, several embodiments will be described in the context of digital volumetric video. In particular, the several embodiments enable encoding and decoding of digital volumetric video material. The present embodiments are applicable e.g. in the MPEG Video-Based Point Cloud Compression (V-PCC).
Volumetric video may be captured using one or more three-dimensional (3D) cameras. When multiple cameras are in use, the captured footage is synchronized so that the cameras provide different viewpoints to the same world. In contrast to traditional two-dimensional/tree-dimensional (2D/3D) video, volumetric video describes a 3D model of the world where the viewer is free to move and observe different parts of the world.
Volumetric video enables the viewer to move in six degrees of freedom (DOF): in contrast to common 360° video, where the user has from 2 to 3 degrees of freedom (yaw, pitch, and possibly roll), a volumetric video represents a 3D volume of shape rather than a flat image plane. Volumetric video frames contain a large amount of data because they model the contents of a 3D volume instead of just a 2D plane. Flowever, only a relatively small part of the volume changes over time. Therefore, it may be possible to reduce the total amount of data by only coding information about an initial state and changes which may occur between frames. Volumetric video can be rendered from synthetic 3D animations, reconstructed from multi-view video using 3D reconstructing techniques such as structure from motion, or captured with a combination of cameras and depth sensors such as LiDAR, for example.
Volumetric video data represents a three-dimensional scene or object, and can be used as input for AR (Augmented Reality), VR (Virtual Reality) and MR (Mixed Reality) applications. Such data describes geometry (shape, size, position in 3D space) and respective attributes (e.g. colour, opacity, reflectance, ...) plus any possible temporal changes of the geometry and attributes at given time instances (like frames in 2D video). Volumetric video is either generated from 3D models, i.e. CGI (Computer Generated Imagery), or captured from real-world scenes using a variety of capture solutions, e.g. a multi-camera, laser scan, a combination of video and dedicated depth sensors, etc. Also, a combination of CGI and real-world data is possible. Typical representation formats for such volumetric data are triangle meshes, point clouds, or voxel. Temporal information about the scene can be included in the form of individual capture instances, i.e. “frames” in 2D video, or other means, e.g. a position of an object as a function of time.
Because volumetric video describes a 3D scene (or object), such data can be viewed from any viewpoint. Therefore, volumetric video is an important format for any AR, VR or MR applications, especially for providing 6DOF (Six Degrees of Freedom) viewing capabilities.
Increasing computational resources and advances in 3D data acquisition devices has enabled reconstruction of highly detailed volumetric video representations of natural scenes. Infrared, lasers, time-of-flight and structured light are all examples of devices that can be used to construct 3D video data. Representation of the 3D data depends on how the 3D data is used. Dense Voxel arrays have been used to represent volumetric medical data. In 3D graphics, polygonal meshes are extensively used. Point clouds on the other hand are well-suited for applications, such as capturing real-world 3D scenes, where the topology is not necessarily a 2D manifold. Another way to represent 3D data is coding this 3D data as a set of texture and depth maps as is the case in the multi-view plus depth. Closely related to the techniques used in multi-view plus depth is the use of elevation maps, and multi level surface maps. In 3D point clouds, each point of each 3D surface is described as a 3D point with color and/or other attribute information such as surface normal or material reflectance. Point cloud is a set of data points (i.e. locations) in a coordinate system, for example in a three-dimensional coordinate system being defined by X, Y, and Z coordinates. The points may represent an external surface of an object in the screen space, e.g. in a 3D space. A point may be associated with a vector of attributes. A point cloud can be used to reconstruct an object or a scene as a composition of the points. Point clouds can be captured by using multiple cameras and depth sensors. A dynamic point cloud is a sequence of static point clouds, wherein each static point cloud is in its own “point cloud frame”.
In dense point clouds or voxel arrays, the reconstructed 3D scene may contain tens or even hundreds of millions of points. If such representations are to be stored or interchanged between entities, then efficient compression is needed. Standard volumetric video representation formats, such as point clouds, meshes, voxel do not have sufficient temporal compression performance. Identifying correspondences for motion-compensation in 3D space is an ill-defined problem, as both, geometry and respective attributes may change. For example temporal successive “frames” do not necessarily have the same number of meshes, points or voxel. Therefore, compression of dynamic 3D scenes is inefficient. 2D-video based approaches for compressing volumetric data, i.e. multiview and depth, have much better compression efficiency, but rarely cover the full scene. Therefore, they provide only limited 6DOF capabilities.
Instead of the above-mentioned approach, a 3D scene, represented as meshes, points and/or voxel, can be projected onto one or more geometries. These geometries are “unfolded” onto 2D planes (two planes per geometry: one for texture, one for depth), which are then encoded using standard 2D video compression technologies. Relevant projection geometry information is transmitted alongside the encoded video files to the decoder. The decoder decodes the video and performs the inverse projection to regenerate the 3D scene in any desired representation format (not necessarily the starting format).
Projecting volumetric models onto 2D planes allows for using standard 2D video coding tools with highly efficient temporal compression. Thus, coding efficiency may be increased remarkably. Using geometry-projections instead of 2D-video based approaches from the related technology, i.e. multiview and depth, provide a better coverage of the scene (or object). Thus, 6DOF capabilities are improved Using several geometries for individual objects improves the coverage of the scene further. Furthermore, existing video encoding hardware can be utilized for real-time compression/decompression of the projected planes. The projection and reverse projection steps are of low complexity.
An overview of a compression process is shortly discussed next. Such process may be applied for example in V-PCC. At the encoding stage, the input point cloud frame is processed in a following manner: First the volumetric 3D data may be represented as a set of 3D projections in different components. At the separation stage, image is decomposed into far and near components for geometry and corresponding attributes components, in addition an occupancy map 2D image may be created to indicate parts of an image that shall be used. The 2D projection is composed of independent patches based on geometry characteristics of the input point cloud frame. After the patches have been generated and 2D frames for video encoding have been created, the occupancy map, geometry information and the auxiliary information may be compressed. At the end of the process, the separate bitstreams are multiplexed into the output compressed binary file.
Figure 1 shows the encoding process in more detailed manner.
The process starts with an input frame representing a point cloud frame 101 that is provided for patch generation 102, geometry image generation 104 and texture image generation 105. Each point cloud frame 101 represents a dataset of points within a 3D volumetric space that has unique coordinates and attributes.
The patch generation 102 process decomposes the point cloud frame 101 by converting 3D samples to 2D samples on a given projection plane using a strategy which provides the best compression. According to an example, patch generation 102 process aims at decomposing the point cloud frame 101 into a minimum number of patches with smooth boundaries, while also minimizing the reconstruction error. At the initial stage of the patch generation 102, a normal per each point is estimated. The tangent plane and its corresponding normal are defined per each point, based on the point’s nearest neighbors m within a predefined search distance. A k- dimensional tree may be used to separate the data and find neighbors in a vicinity of a point pt and a barycenter c = p of that set of points is used to define the normal .
The barycenter c may be computed as follows:
Figure imgf000012_0001
The normal is estimated from eigen decomposition for the defined point cloud as:
Figure imgf000012_0002
Based on this information, each point is associated with a corresponding plane of a point cloud bounding box. Each plane is defined by a corresponding normalnp with values:
- (1.0, 0.0, 0.0),
- (0.0, 1.0, 0.0),
- (0.0, 0.0, 1.0),
- (-1.0, 0.0, 0.0),
- (0.0, -1.0, 0.0),
- (0.0, 0.0, -1.0)
More precisely, each point may be associated with the plane that has the closest normal (i.e. maximizes the dot product of the point normal np and the plane normal
Figure imgf000012_0003
The sign of the normal is defined depending on the point’s position in relationship to the “center”. The initial clustering may then be refined by iteratively updating the cluster index associated with each point based on its normal and the cluster indices of its nearest neighbors. The final step of the patch generation 102 may comprise extracting patches by applying a connected component extraction procedure.
Patch info determined at patch generation 102 for the input point cloud frame 101 is delivered to patch packing 103, to geometry image generation 104, to texture image generation 105, to attribute smoothing (3D) 109 and to auxiliary patch info compression 113. The patch packing 103 aims at generating the geometry and texture maps, by appropriately considering the generated patches and by trying to efficiently place the geometry and texture data that correspond to each patch onto a 2D grid of size WxH. Such placement also accounts for a user-defined minimum size block TxT (e.g. 16x16), which specifies the minimum distance between distinct patches as placed on this 2D grid. Parameter T may be encoded in the bitstream and sent to the decoder.
The packing process 103 may iteratively try to insert patches into a WxH grid. W and H are user defined parameters, which correspond to the resolution of the geometry/texture images that will be encoded. The patch location may be determined through an exhaustive search that may be performed in raster scan order. Initially, patches are placed on a 2D grid in a manner that would guarantee non-overlapping insertion. Samples belonging to a patch (rounded to a value that is a multiple of T) are considered as occupied blocks. In addition, a safeguard between adjacent patches is forced to distance of at least one block being multiple of T. Patches are processed in an orderly manner, based on the patch index list. Each patch from the list is iteratively placed on the grid. The grid resolution depends on the original point cloud size and its width (W) and height (H) are transmitted to the decoder. In the case that there is no empty space available for the next patch the height value of the grid is initially doubled, and the insertion of this patch is evaluated again. If insertion of all patches is successful, then the height is trimmed to the minimum needed value. However, this value is not allowed to be set lower than the originally specified value in the encoder. The final values for W and H correspond to the frame resolution that is used to encode the texture and geometry video signals using the appropriate video codec. The geometry image generation 104 and the texture image generation 105 are configured to generate geometry images and texture images. The image generation process may exploit the 3D to 2D mapping computed during the packing process to store the geometry and texture of the point cloud as images. In order to better handle the case of multiple points being projected to the same pixel, each patch may be projected onto two images, referred to as layers. For example, let H(u, y) be the set of points of the current patch that get projected to the same pixel (u, v). Figure 2 illustrates an example of layer projection structure. The first layer, also called a near layer, stores the point of H(u, v) with the lowest depth DO. The second layers, referred to as the far layer, captures the point of H(u, v) with the highest depth within the interval [DO, D0+A], where A is a user-defined parameter that describes the surface thickness. The generated videos may have the following characteristics:
• Geometry: WxH YUV420-8bit,
• Texture: fFx /YUV420-8bit,
It is to be noticed that the geometry video is monochromatic. In addition, the texture generation procedure exploits the reconstructed/smoothed geometry in order to compute the colors to be associated with the re-sampled points.
The surface separation method is applied to prevent the mixing of different surfaces in the connected components when there is a stack of multiple different surfaces in that connected component. One of the methods to separate surfaces is to use differences of MSE values of points in RGB color domain:
Patch is separated if
MSE(R — R2, G1 — G2, B1 — B2 ) > Threshold ;
Threshold = 20 where R1, G1, B1 are attribute values belonging to TO and R2, G2, B2 are the attribute values belonging to Tl.
The geometry images (which is monochromatic) and the texture images may be provided to image padding 107. The image padding 107 may also receive as an input an occupancy map (OM) 106 to be used with the geometry images and texture images. The occupancy map 106 may comprise a binary map that indicates for each cell of the grid whether it belongs to the empty space or to the point cloud. In other words, the occupancy map (OM) may be a binary image of binary values where the occupied pixels and non-occupied pixels are distinguished and depicted respectively. The occupancy map may alternatively comprise a non-binary image allowing additional information to be stored in it. Therefore, the representative values of the DOM may comprise binary values or other values, for example integer values. It should be noticed that one cell of the 2D grid may produce a pixel during the image generation process.
The padding process 107 aims at filling the empty space between patches in order to generate a piecewise smooth image suited for video compression. For example, in a simple padding strategy, each block of TxT (e.g. 16x16) pixels is compressed independently. If the block is empty (i.e. unoccupied, i.e. all its pixels belong to empty space), then the pixels of the block are filled by copying either the last row or column of the previous TxT block in raster order. If the block is full (i.e. occupied, i.e., no empty pixels), nothing is done. If the block has both empty and filled pixels (i.e. edge block), then the empty pixels are iteratively filled with the average value of their non-empty neighbors.
The padded geometry images and padded texture images may be provided for video compression 108. The generated images/layers may be stored as video frames and compressed using for example High Efficiency Video Coding (HEVC) Test Model 16 (HM) video codec according to the HM configurations provided as parameters. The video compression 108 also generates reconstructed geometry images to be provided for smoothing 109, wherein a smoothed geometry is determined based on the reconstructed geometry images and patch info from the patch generation 102. The smoothed geometry may be provided to texture image generation 105 to adapt the texture images.
The patch may be associated with auxiliary information being encoded/decoded for each patch as metadata. The auxiliary information may comprise index of the projection plane, 2D bounding box, 3D location of the patch represented in terms of depth SO, tangential shift s and bitangential shift r0.
The following metadata may be encoded/decoded for every patch:
• Index of the projection plane o Index 0 for the planes (1.0, 0.0, 0.0) and (-1.0, 0.0, 0.0) o Index 1 for the planes (0.0, 1.0, 0.0) and (0.0, -1.0, 0.0) o Index 2 for the planes (0.0, 0.0, 1.0) and (0.0, 0.0, -1.0).
• 2D bounding box (u0, v0, ul, vl)
• 3D location (x0, y0, z0) of the patch represented in terms of depth SO, tangential shift s0 and bi-tangential shift r0. According to the chosen projection planes, (SO, s0, r0) are computed as follows: o Index 0, S0= x0, s0=z0 and r0 = y0 o Index 1 , S0= y0, s0=z0 and r0 = x0 o Index 2, S0= z0, s0=x0 and r0 = y0
Also, mapping information providing for each TxT block its associated patch index may be encoded as follows:
• For each TxT block, let L be the ordered list of the indexes of the patches such that their 2D bounding box contains that block. The order in the list is the same as the order used to encode the 2D bounding boxes. L is called the list of candidate patches.
• The empty space between patches is considered as a patch and is assigned the special index 0, which is added to the candidate patches list of all the blocks.
• Let I be index of the patch to which belongs the current TxT block and let J be the position of I in L. Instead of explicitly encoding the index /, its position J is arithmetically encoded instead, which leads to better compression efficiency.
The compression process may comprise one or more of the following example operations:
• Binary values may be associated with BOxBO sub-blocks belonging to the same TxT block. A value 1 associated with a sub-block, if it contains at least a non-padded pixel, and 0 otherwise. If a sub-block has a value of 1 it is said to be full, otherwise it is an empty sub-block.
• If all the sub-blocks of a TxT block are full (i.e., have value 1). The block is said to be full. Otherwise, the block is said to be non-full.
• A binary information may be encoded for each TxT block to indicate whether it is full or not. • If the block is non-full, an extra information indicating the location of the full/empty sub-blocks may be encoded as follows: o Different traversal orders may be defined for the sub-blocks, for example horizontally, vertically, or diagonally starting from top right or top left corner o The encoder chooses one of the traversal orders and may explicitly signal its index in the bitstream. o The binary values associated with the sub-blocks may be encoded by using a run-length encoding strategy.
The binary value of the initial sub-block is encoded.
Continuous runs of 0s and 1 s are detected, while following the traversal order selected by the encoder.
The number of detected runs is encoded.
The length of each run, except of the last one, is also encoded.
In occupancy map coding (lossy condition) a two-dimensional binary image of resolution (Width/B0)x(Height/Bl), where Width and Height are the width and height of the geometry and texture images that are intended to be compressed. A sample equal to 1 means that the correspond ing/co-located sample or samples in the geometry and texture image should be considered as point cloud points when decoding, while a sample equal to 0 should be ignored (commonly includes padding information). The resolution of the occupancy map does not have to be the same as those of the geometry and texture images and instead the occupancy map could be encoded with a precision of BOxBl blocks. In order to achieve lossless encoding B0 and B1 are selected to be equal to 1 . In practice, B0=B1=2 or B0=B1 =4 can result in visually acceptable results, while significantly reducing the number of bits required to encode the occupancy map. The generated binary image covers only a single colour plane. However, given the prevalence of 4:2:0 codecs, it may be desirable to extend the image with “neutral” or fixed value chroma planes (e.g. add chroma planes with all sample values equal to 0 or 128, assuming the use of an 8-bit codec).
The obtained video frame may be compressed by using a video codec with lossless coding tool support (e.g., AVC, HEVC RExt, HEVC-SCC). Occupancy map may be simplified by detecting empty and non-empty blocks of resolution TxT in the occupancy map and only for the non-empty blocks we encode their patch index as follows: o A list of candidate patches is created for each TxT block by considering all the patches that contain that block o The list of candidates is sorted in the reverse order of the patches o For each block,
1 . If the list of candidates has one index, then nothing is encoded.
2. Otherwise, the index of the patch in this list is arithmetically encoded.
A multiplexer 112 may receive a compressed geometry video and a compressed texture video from the video compression 108, and optionally a compressed auxiliary patch information from auxiliary patch-info compression 111. The multiplexer 112 uses the received data to produce a compressed bitstream.
In decoding process of the encoded bitstream the encoded binary file is demultiplexed into geometry, attribute, occupancy map and auxiliary information streams. Auxiliary information stream is entropy coded. Occupancy map may be compressed using entropy coding method, or video compression method depending on selected level. Geometry stream is decoded and in combination with occupancy amp and auxiliary information, smoothing is applied to reconstruct point cloud geometry information. Based on the decoded attribute video stream and reconstructed information for smoothed geometry, occupancy map and auxiliary information the attributes of the point cloud can be reconstructed. After attribute reconstruction stage additional attribute smoothing method is used for point cloud refinement.
Figure 3 illustrates an overview of a decoding process for MPEG Point Cloud Compression (PCC). A de-multiplexer 301 receives a compressed bitstream, and after de-multiplexing, provides compressed texture video and compressed geometry video to video decompression 302. In addition, the de-multiplexer 301 transmits compressed occupancy map to occupancy map decompression 303. It may also transmit a compressed auxiliary patch information to auxiliary patch-info compression 304. Decompressed geometry video from the video decompression 302 is delivered to geometry reconstruction 305, as are the decompressed occupancy map and decompressed auxiliary patch information.
The point cloud geometry reconstruction 305 process exploits the occupancy map information in order to detect the non-empty pixels in the geometry/texture images/layers. The 3D positions of the points associated with those pixels are computed by levering the auxiliary patch information and the geometry images. More precisely, let P be the point associated with the pixel (u, v) and let (SO, s0, r0) be the 3D location of the patch to which it belongs and (u0, v0, ul, vl) its 2D bounding box. P could be expressed in terms of depth S(u, v), tangential shift s(u, v) and bi- tangential shift r(u, v) as follows:
S(ii, v) = SO + g(u, v) s(u, v) = sO - uO + u r(u, v) = rO - vO + v where g(u, v) is the luma component of the geometry image.
The reconstructed geometry image may be provided for smoothing 306, which aims at alleviating potential discontinuities that may arise at the patch boundaries due to compression artifacts. The implemented approach moves boundary points to the centroid of their nearest neighbors. The smoothed geometry may be transmitted to texture reconstruction 307, which also receives a decompressed texture video from video decompression 302. The texture values for the texture reconstruction are directly read from the texture images. The texture reconstruction 307 outputs a reconstructed point cloud for color smoothing 308, which further provides the reconstructed point cloud.
According to V-PCC, the auxiliary information can be signaled in a bitstream according to the following syntax:
Sequence parameter set syntax:
Figure imgf000019_0001
Figure imgf000020_0001
Geometry sequence params syntax:
Figure imgf000020_0002
Figure imgf000021_0001
Geometry frame params syntax:
Figure imgf000021_0002
Figure imgf000022_0001
Geometry patch params syntax:
Figure imgf000022_0002
Figure imgf000023_0001
One way to compress a time-varying volumetric scene/object is to project 3D surfaces on to some number of pre-defined 2D planes. Regular 2D video compression algorithms can then be used to compress various aspects of the projected surfaces. Such projection may be presented using different patches. Each set of patches may represent a specific object or specific parts of a scene.
The present embodiments are targeted for encryption of patches based on a content (such as a scene or an object) presented by each patch. For example, each patch may present a part of the scene, or a set of patches may present one object, or any specific part of the image may be presented by a group of patches. According to present embodiments, the patches include a signaling method enabling a player or a reconstruction process to apply encryption restrictions to the patches. Moreover, the content may define several different encryption combinations between patches based on authorization levels defined for end users.
By means of the present embodiments, content-based encryption of different patches enables limiting the end user’s access to different parts of the content. The use cases include applications that use privacy, parental control, paid content services, etc. Thus, the present embodiments enable encryption of the patches based on the content presented by each patch. This means, if the patch includes specific information of the scene that require to be encrypted for any end user, those patches should also include a signaling method enabling the player or reconstruction process to apply encryption restrictions to said patches.
Term “encryption” refers to a method by which any type of data, such as images, is converted from a readable form to an encoded (i.e. encrypted) version that can only be decoded (i.e. decrypted) by specific target users. Such capability is given to end- users based on their authorization level. The encryption may benefit from different encryption algorithms as well as different authorization levels defined for different users. This means, the same content may define several different encryption combinations between patches based on the authorization levels defined for the end users. The high-level block diagrams for an encoder and a decoder according to embodiments are illustrated in Figures 4 and 5.
At the encoder, shown in Figure 4, a process starts by obtaining a patch 410 and determining 420 whether an encryption is needed. If not, the patch goes to an encoding process 430. If the patch is to be encrypted, the patch is encrypted 440 in accordance with needed authorization level. After that, the encrypted patch goes to the encoding process 450. An encoder may be situated in a processor of an imaging device, further comprising also a memory and a computer program stored therein. An encoder may alternatively be situated in a process of a processing device, that is configured to create an encoded version of image material obtained from the imaging device. The encoder may also comprise means for encrypting. According to another embodiment, the encoder may ask encryption of a certain patches from an external device.
At the decoder, shown in Figure 5, the process starts by obtaining a patch 510 and determining 520 whether the patch is an encrypted patch. If not, the patch goes to a decoding process 530. If the patch is encrypted, the patch is decrypted 540 based on an authorization level 550 with a corresponding encryption method. After decryption, the patch goes to the decoding process 560. A decoder may be situated on a processor of a client device. The client device may be a viewing device, or may be connected to one.
The authorization level may be defined to an end user e.g. at the beginning of the transmission of a content. For example, there may be parameters, e.g. a frame rate, which remain constant during the decoding and presentation of a video, and such parameters may also be included in such overhead signals at the beginning of the transmission indicating the authorization level of each end user.
According to an embodiment, a user who makes a request for a content, is identified at the server based on the request, and thus the server is able to send the content with the correct authorization level to the end user. In another embodiment, the patches (content) are encrypted with different authorization levels. When the content is requested by any end user, the same content is transmitted as response. Depending on user’s authorization level, s/he can decode all or part of the content. Therefore, several users may decode different parts of the content, based on their authorization level. The authorization level of the latter embodiment may be determined from login information of the user (e.g. account, secure TV login system, authorization level purchase history of the user, etc.) or from the viewing device itself or from a profile of the user (e.g. age, jurisdictions, etc.) or from connection properties (IP addresses). Therefore, the authorization level will be part of user’s profile or part of a device or connection profile indicating that this user has a respective level of authorization or this device or connection has a respective level of authorization.
In the present embodiments, the scene is considered to be presented by several patches. Each patch presents a part of the scene. Moreover, one object may be presented by a set of patches and any specific part of the image may be presented by a group of patches as well. The present embodiments introduce a signaling method to be added in a patch level informing the decoder what type/level of encryption is applied, and then based on the authorization level defined at the end user side, such as in the client device or a display device, the decryption and decoding of the patches is enabled. Such concept of encryption can be signaled within the V-PCC sequence parameter set, to ensure it is not truncated or lost, e.g. by adding an element sps_encryption_enabled_flag.
Figure imgf000025_0001
Figure imgf000026_0001
The authorization level for end users may be defined based on one or more of the following criteria, but is not limited to only these criteria:
- Privacy: where some content is confidential and is not to be perceived by all end users. Therefore, the displaying of such content is limited to a group of end users which have the required authorization level.
- Parental control: where the content is not appropriate for younger audience and hence, a certain authorization level is required to display such content.
- Discretion: where e.g. horror content or graphic content requires a specific level of authorization to be displayed.
- Paid content: where parts of the content is only available to users who have already paid for displaying it. In such cases, there may be several authorization levels defined considering different level of payments purchased by different end users.
In the encoder side, the patches which represent a specific object, or a specific part of the scene, are selected based on several different patch criteria which include but are not limited to the following:
- Regions of interest (ROI), i.e. faces, plate numbers, documents or... other parts of the scene which may require different treatment from presentation point of view for the end user. This also includes the cases where some part of the scene is to be presented so that the content is not perceivable to the end user. The ROI also includes the parts of the scene, which are not appropriate for an audience of certain age.
- Confidential parts of the scene selected by the capturing director or content provider. This means, the confidential parts of the scene are to be filtered disabling the end user to perceive them while having a general understanding of the content with having most of the scene presented normally. This also refers to the content which is to be paid for to be visible for the end user.
The encryption process enabled for each patch may include one or more different types of encryption depending on the application requirement. For example, the encryption may be based on Affine Transform and XOR Operation; Three one dimensional chaotic map; One-dimensional Random Scrambling; Bit level permutation, XOR and rotation; Explosive n*n Block Displacement; Neural Network, Blowfish algorithm; RC3 Stream Cipher and Chaotic Logistic Map.
It is appreciated that any type of encryption or a combination of them may be used in the present embodiments depending on the requirement of the application as well as the sophistication level of encryption required for each application.
The encryption may be done using different algorithms to meet the requirements of the application, e.g. having following encryption approaches when having 4 levels of authorization for the end users: a. Authorization level 1 : use encryption method 1 b. Authorization level 2: use encryption method 2 c. Authorization level 3: use encryption method 3 d. Authorization level 4: use encryption method 4
The authorization level can be defined for each end user with a simple code, e.g.: a. Authorization level 1 -> code 00 b. Authorization level 2 -> code 01 c. Authorization level 3 -> code 10 d. Authorization level 4 -> code 11
According to an embodiment, each patch may be assigned with a code representing the authorization level assigned to that patch. Moreover, a look-up table (LUT) may be communicated between an encoder and decoder where the code not only represents an authorization level, but is also mapped to a specific encryption method, and when the code is matched with the end user’s authorized level, then the encryption method is fetched from the LUT. Such LUT can be communicated, e.g. as SEI message.
In an embodiment, each content may have several different authorization levels, targeting different end users. This means that the same content can be used without any modification for a variety of end users. In an alternative embodiment, no patch- based code is signaled to a decoder, but the decoder will assign the code for each patch based on the content, e.g. having faces assigned one code, parts of the scene with nudity being assigned another code, potential documents being assigned another code, etc. In such embodiment, the patches are given a code based on the content of each patch and then, considering the same concept of LUT, the authorization level is fetched and compared with the authorization level of end user. An example of possible syntax is as below. A parameter patch_encryption_code has been added based on the present embodiments:
Figure imgf000028_0001
Similarly, in the decoder side, when the patch has been received, the patch_encryption_code is read, and based on the authorization level assigned to said patch_encryption_code and the authorization level of the end user, the content may be decrypted.
According to an embodiment, a video encoder is restricted to use only reference of equal encryption level or without encryption. This means, that inter and intra prediction tools such as motion compensation or intra-block copy are not allowed to point at parts within a patch which is encoded with a different encryption level.
According to an embodiment, encryption levels are sorted hierarchically e.g. encryption level 4 covers also level 3, 2, and 1 , but not encryption level 5. For such hierarchical encryption level embodiment, a video encoder may be restricted to use only references of equal or lower encryption level or reference without encryption. This means that inter and intra prediction tools such as motion compensation or intra-block copy are not allowed to point at parts within a patch which is encoded with a higher encryption level.
According to an embodiment, some encryption levels are sorted hierarchically while others are not, e.g. encryption level 4 covers also levels 3, 2, and 1 , but not encryption level 5. In addition, the separated encryption level 6 does not cover any other encryption levels. Separated encryption levels can be signaled with a certain flag in or along the bitstream.
According to yet another embodiment, a video encoder may not be restricted on the use of references. The respective video decoder is allowed to decrypt all video information internally but is only allowed to output according to its decryption authorization. This embodiment has a better coding performance, but is slightly less secure, as decoded information would be available within the video decoder internal memory.
In yet another embodiment, no encryption is applied on video level, and the received authorization level only restricts the 3D point cloud reconstruction. This embodiment is more vulnerable as the 2D projections of the 3D content may be accessible in the 2D video streams. However, standalone V-PCC playback applications can still utilize this approach, as it provides the least complexity and the 2D video can be kept in an internal memory buffer.
In an embodiment, the encryption is only applied on a selection of one or more 2D video stream representing the dynamic point cloud data. For example, only texture information is encrypted but geometry is still available. The respective signalling can be carried, e.g. in the geometry information or attributejnformation. According to present embodiments, attributejnformation comprises ai_encryption_availableJlag.
Figure imgf000030_0001
According to embodiments, encryption of geometry and/or attribute of the patch can be signaled patch_data_unit as indicated in below:
Figure imgf000031_0001
Various encryption levels and approaches can be signaled in or along the bitstream, e.g. as SEI message, as shown in relation to sps_encryption_enabled_flag.
Figure imgf000031_0002
Figure imgf000032_0001
According to an embodiment, encryptionjd contains supplemental encryption information interpreted according to the information carried in encryptionjdjnfo SEI message.
Figure imgf000032_0002
Fig. 6 is a flowchart illustrating a method according to an embodiment. A method comprises receiving 610 a video presentation frame, where the video presentation represents a three-dimensional data; generating 620 one or more patches from the video presentation frame, wherein the patches represent different parts of the content of the video presentation frame; determining 630 which patch(es) of the generated one or more patches are to be encrypted, wherein the determination is based on a content of the patch; encrypting 640 the determined patches with an encryption method that corresponds to a pre-defined authorization level; and encoding 650 into a bitstream of a corresponding patch, information on the used encryption method.
An apparatus according to an embodiment comprises means for receiving a video presentation frame, where the video presentation represents a three-dimensional data; means for generating one or more patches from the video presentation frame, wherein the patches represent different parts of the content of the video presentation frame; means for determining which patch(es) of the generated one or more patches are to be encrypted, wherein the determination is based on a content of the patch; means for encrypting the determined patches with an encryption method that corresponds to a pre-defined authorization level; and means for encoding into a bitstream of a corresponding patch, information on the used encryption method. The means comprises at least one processor, and a memory including a computer program code, wherein the processor may further comprise processor circuitry. The memory and the computer program code are configured to, with the at least one processor, cause the apparatus to perform the method of Figure 6 according to various embodiments.
Fig. 7 is a flowchart illustrating a method according to an embodiment. A method comprises receiving 710 an encoded bitstream; for each patch of the encoded bitstream 720:
• decoding a patch from the encoded bitstream;
• for every patch that is determined to be an encrypted patch 730; o decoding from the bitstream information on a used encryption method for the patch; o decrypting the patch with a decryption method that corresponds to the used encryption method; and generating 740 a video presentation frame from the decoded patch(es), wherein the patches represent different parts of the content of the video presentation frame. An apparatus according to an embodiment comprises means for receiving an encoded bitstream; means for decoding each patch of the encoded bitstream; means for a patch is determined to be an encrypted patch decoding from the bitstream information on a used encryption method for the patch; and means for decrypting the patch with a decryption method that corresponds to the used encryption method; and means for generating a video presentation frame from the decoded patch(es), wherein the patches represent different parts of the content of the video presentation frame. The means comprises at least one processor, and a memory including a computer program code, wherein the processor may further comprise processor circuitry. The memory and the computer program code are configured to, with the at least one processor, cause the apparatus to perform the method of Figure 7 according to various embodiments.
Figure 8 shows a system and apparatuses for viewing volumetric video according to present embodiments. The task of the system is that of capturing sufficient visual and auditory information from a specific location such that a convincing reproduction of the experience, or presence, of being in that location can be achieved by one or more viewers physically located in different locations and optionally at a time later in the future. Such reproduction requires more information that can be captured by a single camera or microphone, in order that a viewer can determine the distance and location of objects within the scene using their eyes and their ears. To create a pair of image with disparity, two camera sources are used. In a similar manner, for the human auditory system to be able to sense the direction of sound, at least two microphones are used (the commonly known stereo sound is created by recording two audio channels). The human auditory system can detect the cues, e.g. in timing difference of the audio signals to detect the direction of sound.
The system of Figure 8 may comprise three parts: image sources 801 , 803, a server 805 and a rendering device 807. An image source can be a video capture device 801 comprising two or more cameras with overlapping field of view so that regions of the view around the video capture device is captured from at least two cameras. The video capture device 801 may comprise multiple microphones (not shown in the figure) to capture the timing and phase differences of audio originating from different directions. The video capture device 801 may comprise a high-resolution orientation sensor so that the orientation (direction of view) of the plurality of cameras can be detected and recorded The video capture device 801 comprises or is functionally connected to a computer processor PROCESSOR1 and memory MEMORY1 , the memory comprising computer program code for controlling the video capture device 801 . The image stream captured by the video capture device 801 may be stored on a memory MEMORY1 and/or removable memory MEMORY9 for use in another device, e.g. in a viewer, and/or transmitted to a server 805 using a communication interface COMMUNICATION1 .
Alternatively, or in addition to the video capture device 801 creating an image stream, or a plurality of such, one or more image source devices 803 of synthetic images may be present in the system. Such image source devices 803 of synthetic images may use a computer model of a virtual world to compute the various image streams it transmits. For example, the image source 803 may compute N video streams corresponding to N virtual cameras located at a virtual viewing position. When such a synthetic set of video streams is used for viewing, the viewer may see a three-dimensional virtual world. The image source device 803 comprises or is functionally connected to a computer processor PROCESSORS and memory MEMORY3, the memory comprising computer program code for controlling the image source device 803. There may be a storage, processing and data stream serving network in addition to the video capture device 801 . For example, there may be a server 805 or a plurality of servers storing the output from the video capture device 801 or image source device 803. The server 805 comprises or is functionally connected to a computer processor PROCESSORS and memory MEMORY5, the memory comprising computer program code for controlling the server 805. The server 805 may be connected by a wired or wireless network connection, or both, to sources 801 and/or 803, as well as the viewer devices 809 over the communication interface COMMUNICATIONS.
For viewing the captured or created video content, there may be one or more viewer devices 809 (a.k.a. playback devices). These viewer devices 809 may have one or more displays, and may comprise or be functionally connected to a rendering module 807. The rendering module 807 comprises a computer processor PROCESSOR7 and memory MEMORY7, the memory comprising computer program code for controlling the viewer devices 809. The viewer devices 809 may comprise a data stream receiver for receiving a video data stream from a server and for decoding the video data stream. The data stream may be received over a network connection through communications interface, or from a memory device 811 like a memory card. The viewer devices 809 may have a graphics processing unit for processing of the data to a suitable format for viewing. The viewer device 809 can be a high-resolution stereo-image head-mounted display for viewing the rendered stereo video sequence. The head-mounted display may have an orientation detector 813 and stereo audio headphones. According to an embodiment, the viewer device 809 is a display enabled with 3D technology (for displaying stereo video), and the rendering device 807 may have a head-orientation detector 815 connected to it. Alternatively, the viewer device 809 may comprise a 2D display, since the volumetric video rendering can be done in 2D by rendering the viewpoint from a single eye instead of a stereo eye pair. Any of the devices 801 , 803, 805, 807, 809 may be a computer or a portable computing device, or be connected to such. Such devices may have computer program code for carrying out methods according to various examples described in this text.
As mentioned, the viewer device can be a head-mounted display (HMD). The head- mounted display comprises two screen sections or two screens for displaying the left and right eye images. The displays are close to the eyes, and therefore lenses are used to make the images easily viewable and for spreading the images to cover as much as possible of the eyes’ field of view. The device is attached to the head of the user so that it stays in place even when the user turns his head. The device may have an orientation detecting module for determining the head movements and direction of the head. The head-mounted display gives a three-dimensional (3D) perception of the recorded/streamed content to a user.
The video material captured or generated by any of the image sources can be provided for an encoder that transforms an input video into a compressed representation suited for storage/transmission. The compressed video is provided for a decoder that can uncompress the compressed video representation back into a viewable form. The encoder may be located in the image sources or in the server. The decoder may be located in the server or in the viewer, such as a HMD. The encoder may discard some information in the original video sequence in order to represent the video in a more compact form (that is, at lower bitrate). An example of an encoding process is illustrated in Figure 9. Figure 9 illustrates an image to be encoded (ln); a predicted representation of an image block (P’n); a prediction error signal (Dn ); a reconstructed prediction error signal (D’n); a preliminary reconstructed image (l’n); a final reconstructed image (R’n); a transform (T) and inverse transform (T_1); a quantization (Q) and inverse quantization (Q_1 ); entropy encoding (E); a reference frame memory (RFM); inter prediction (Pinter); intra prediction (Pintra); mode selection (MS) and filtering (F). An example of a decoding process is illustrated in Figure 10. Figure 10 illustrates a predicted representation of an image block (P’n); a reconstructed prediction error signal (D’n); a preliminary reconstructed image (l’n); a final reconstructed image (R’n); an inverse transform (T_1); an inverse quantization (Q-1); an entropy decoding (E-1); a reference frame memory (RFM); a prediction (either inter or intra) (P); and filtering (F).
The various embodiments may provide advantages. For example, the present embodiments enable applying different encryption for each patch separately. In addition, the present embodiments enable content to be broadcasted for a variety of end users limiting their access to different parts of the scene based on their authorization level.
The various embodiments can be implemented with the help of computer program code that resides in a memory and causes the relevant apparatuses to carry out the method. For example, a device may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the device to carry out the features of an embodiment. Yet further, a network device like a server may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the network device to carry out the features of an embodiment. The computer program code comprises one or more operational characteristics. Said operational characteristics are being defined through configuration by said computer based on the type of said processor, wherein a system is connectable to said processor by a bus, wherein a programmable operational characteristic of the system according to an embodiment comprises receiving a video presentation frame, where the video presentation represents a three-dimensional data; generating one or more patches from the video presentation frame, wherein the patches represent different parts of the content of the video presentation frame; determining which patch(es) of the generated one or more patches are to be encrypted, wherein the determination is based on a content of the patch; encrypting the determined patches with an encryption method that corresponds to a pre-defined authorization level; and encoding into a bitstream of a corresponding patch, information on the used encryption method. A programmable operational characteristic of the system according to another embodiment comprises receiving an encoded bitstream; for each patch of the encoded bitstream: decoding a patch from the encoded bitstream; for every patch that is determined to be an encrypted patch, decoding from the bitstream information on a used encryption method for the patch and decrypting the patch with a decryption method that corresponds to the used encryption method; and generating a video presentation frame from the decoded patch(es), wherein the patches represent different parts of the content of the video presentation frame.
A computer program product according to an embodiment can be embodied on a non-transitory computer readable medium. According to another embodiment, the computer program product can be downloaded over a network in a data packet.
If desired, the different functions discussed herein may be performed in a different order and/or concurrently with other. Furthermore, if desired, one or more of the above-described functions and embodiments may be optional or may be combined.
Although various aspects of the embodiments are set out in the independent claims, other aspects comprise other combinations of features from the described embodiments and/or the dependent claims with the features of the independent claims, and not solely the combinations explicitly set out in the claims.
It is also noted herein that while the above describes example embodiments, these descriptions should not be viewed in a limiting sense. Rather, there are several variations and modifications, which may be made without departing from the scope of the present disclosure as, defined in the appended claims.

Claims

Claims:
1. A method, comprising:
- receiving a video presentation frame, where the video presentation represents a three-dimensional data;
- generating one or more patches from the video presentation frame, wherein the patches represent different parts of the content of the video presentation frame;
- determining which patch(es) of the generated one or more patches are to be encrypted, wherein the determination is based on a content of the patch;
- encrypting the determined patches with an encryption method that corresponds to a pre-defined authorization level; and
- encoding into a bitstream of a corresponding patch, information on the used encryption method.
2. The method according to claim 1 , wherein the information on the encryption method comprises the pre-defined authorization level and a type of the corresponding encryption method.
3. The method according to claim 1 or 2, wherein the encryption method comprises one or more different types of encryption methods.
4. The method according to any of the claims 1 to 3, wherein the authorization level is defined based on one or more of the following criteria: privacy, age limit, content’s discretion, subject to a charge.
5. The method according to any of the claims 1 to 4, wherein the information on the encryption method is signaled as a supplemental enhancement information (SEI) message.
6. A method for decoding, comprising
- receiving an encoded bitstream;
- for each patch of the encoded bitstream:
• decoding a patch from the encoded bitstream;
• for every patch that is determined to be an encrypted patch; o decoding from the bitstream information on a used encryption method for the patch; o decrypting the patch with a decryption method that corresponds to the used encryption method; and
- generating a video presentation frame from the decoded patch(es), wherein the patches represent different parts of the content of the video presentation frame.
7. The method according to claim 6, wherein the used encryption method comprises information on an authorization level.
8. The method according to claim 7, wherein the authorization level defines a type of an encryption method.
9. The method according to any of the claims 6 to 8, wherein the authorization level is defined based on one or more of the following criteria: privacy, age limit, content’s discretion, subject to a charge.
10. An apparatus comprising,
- means for receiving a video presentation frame, where the video presentation represents a three-dimensional data;
- means for generating one or more patches from the video presentation frame, wherein the patches represent different parts of the content of the video presentation frame;
- means for determining which patch(es) of the generated one or more patches are to be encrypted, wherein the determination is based on a content of the patch;
- means for encrypting the determined patches with an encryption method that corresponds to a pre-defined authorization level; and
- means for encoding into a bitstream of a corresponding patch, information on the used encryption method.
11. The apparatus according to claim 10, wherein the information on the encryption method comprises the pre-defined authorization level and a type of the corresponding encryption method.
12. The apparatus according to claim 10 or 11, wherein the encryption method comprises one or more different types of encryption methods.
13. The apparatus according to any of the claims 10 to 12, wherein the authorization level is defined based on one or more of the following criteria: privacy, age limit, content’s discretion, subject to a charge.
14. The apparatus according to any of the claims 10 to 13, wherein the information on the encryption method is signaled as a supplemental enhancement information (SEI) message.
15. An apparatus comprising,
- means for receiving an encoded bitstream;
- for each patch of the encoded bitstream:
• means for decoding a patch from the encoded bitstream;
• for every patch that is determined to be an encrypted patch; o means for decoding from the bitstream information on a used encryption method for the patch; o means for decrypting the patch with a decryption method that corresponds to the used encryption method; and - means for generating a video presentation frame from the decoded patch(es), wherein the patches represent different parts of the content of the video presentation frame.
16. The apparatus according to claim 15, wherein the used encryption method comprises information on an authorization level.
17. The apparatus according to claim 16, wherein the authorization level defines a type of an encryption method.
18. The apparatus according to any of the claims 15 to 17, wherein the authorization level is defined based on one or more of the following criteria: privacy, age limit, content’s discretion, subject to a charge.
19. The apparatus according to any of the claims 10 to 18, comprising at least one processor, and a memory including computer program code.
20. A computer program product comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to implement a method according to any of the claims 1 to 9.
PCT/FI2020/050552 2019-09-20 2020-08-26 A method, an apparatus and a computer program product for video encoding and video decoding WO2021053261A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP20865714.8A EP4032314A4 (en) 2019-09-20 2020-08-26 A method, an apparatus and a computer program product for video encoding and video decoding

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
FI20195788 2019-09-20
FI20195788 2019-09-20

Publications (1)

Publication Number Publication Date
WO2021053261A1 true WO2021053261A1 (en) 2021-03-25

Family

ID=74884125

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/FI2020/050552 WO2021053261A1 (en) 2019-09-20 2020-08-26 A method, an apparatus and a computer program product for video encoding and video decoding

Country Status (2)

Country Link
EP (1) EP4032314A4 (en)
WO (1) WO2021053261A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116436619A (en) * 2023-06-15 2023-07-14 武汉北大高科软件股份有限公司 Method and device for verifying streaming media data signature based on cryptographic algorithm

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100781275B1 (en) * 2006-03-02 2007-11-30 엘지전자 주식회사 Method and apparatus for decoding of video image
US20080117295A1 (en) * 2004-12-27 2008-05-22 Touradj Ebrahimi Efficient Scrambling Of Regions Of Interest In An Image Or Video To Preserve Privacy
CN106301760A (en) * 2016-08-04 2017-01-04 北京电子科技学院 A kind of 3D point cloud model encryption method based on chaotic maps
US20180189461A1 (en) 2016-12-31 2018-07-05 Entefy Inc. System and method of applying multiple adaptive privacy control layers to encoded media file types

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080117295A1 (en) * 2004-12-27 2008-05-22 Touradj Ebrahimi Efficient Scrambling Of Regions Of Interest In An Image Or Video To Preserve Privacy
KR100781275B1 (en) * 2006-03-02 2007-11-30 엘지전자 주식회사 Method and apparatus for decoding of video image
CN106301760A (en) * 2016-08-04 2017-01-04 北京电子科技学院 A kind of 3D point cloud model encryption method based on chaotic maps
US20180189461A1 (en) 2016-12-31 2018-07-05 Entefy Inc. System and method of applying multiple adaptive privacy control layers to encoded media file types

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
"V-PCC Codec Description, release 6.0", MPEG DOCUMENT MANAGEMENT SYSTEM, 126TH MPEG MEETING, GENEVA ISO/IEC JTC1/SC29/WG11, 5 July 2019 (2019-07-05), XP030222355, Retrieved from the Internet <URL:http://wg11.sc29.org> [retrieved on 20201202] *
ABU TAHA, MOHAMMED ET AL.: "End-to-End Real-Time ROI-based Encryption in HEVC Videos", PROCEEDINGS OF THE 26TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2018, 3 December 2018 (2018-12-03), pages 171 - 175, XP033461483, Retrieved from the Internet <URL:https://ieeexplore.ieee.org/document/8553038> [retrieved on 20201203], DOI: 10.23919/ EUSIPCO.2018.8553038 *
FARAJALLAH, MOUSA ET AL.: "ROI encryption for the HEVC coded video contents", PROCEEDINGS OF THE 2015 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP 2015, 10 December 2015 (2015-12-10), pages 3096 - 3100, XP032826995, Retrieved from the Internet <URL:https://ieeexplore.ieee.org/abstract/document/7351373> [retrieved on 20201203], DOI: 10.1109/ ICIP.2015.7351373 *
See also references of EP4032314A4
ZHANG, XING ET AL.: "A Lightweight Encryption Method for Privacy Protection in Surveillance Videos", IEEE ACCESS, vol. 6, 2 April 2018 (2018-04-02), pages 18074 - 18087, XP011681671, Retrieved from the Internet <URL:https://ieeexplore.ieee.org/document/8329515> [retrieved on 20201203], DOI: 10.1109/ACCESS.2018.2820724 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116436619A (en) * 2023-06-15 2023-07-14 武汉北大高科软件股份有限公司 Method and device for verifying streaming media data signature based on cryptographic algorithm
CN116436619B (en) * 2023-06-15 2023-09-01 武汉北大高科软件股份有限公司 Method and device for verifying streaming media data signature based on cryptographic algorithm

Also Published As

Publication number Publication date
EP4032314A1 (en) 2022-07-27
EP4032314A4 (en) 2023-07-19

Similar Documents

Publication Publication Date Title
EP3751857A1 (en) A method, an apparatus and a computer program product for volumetric video encoding and decoding
US11509933B2 (en) Method, an apparatus and a computer program product for volumetric video
US11599968B2 (en) Apparatus, a method and a computer program for volumetric video
US11202086B2 (en) Apparatus, a method and a computer program for volumetric video
US20230068178A1 (en) A method, an apparatus and a computer program product for volumetric video encoding and decoding
EP3614674A1 (en) An apparatus, a method and a computer program for volumetric video
US11659151B2 (en) Apparatus, a method and a computer program for volumetric video
EP3777185A1 (en) An apparatus, a method and a computer program for volumetric video
US20100309287A1 (en) 3D Data Representation, Conveyance, and Use
Zhu et al. View-dependent dynamic point cloud compression
WO2019158821A1 (en) An apparatus, a method and a computer program for volumetric video
WO2019243663A1 (en) An apparatus, a method and a computer program for volumetric video
JP7344988B2 (en) Methods, apparatus, and computer program products for volumetric video encoding and decoding
WO2021191495A1 (en) A method, an apparatus and a computer program product for video encoding and video decoding
WO2019115867A1 (en) An apparatus, a method and a computer program for volumetric video
WO2021053262A1 (en) An apparatus, a method and a computer program for volumetric video
WO2021053261A1 (en) A method, an apparatus and a computer program product for video encoding and video decoding
EP4049452B1 (en) Embedding data within transformed coefficients using bit partitioning operations
EP4133719A1 (en) A method, an apparatus and a computer program product for volumetric video coding
WO2021191500A1 (en) An apparatus, a method and a computer program for volumetric video
WO2019185983A1 (en) A method, an apparatus and a computer program product for encoding and decoding digital volumetric video
EP3804334A1 (en) An apparatus, a method and a computer program for volumetric video
WO2019162564A1 (en) An apparatus, a method and a computer program for volumetric video
WO2022074286A1 (en) A method, an apparatus and a computer program product for video encoding and video decoding
EP3680859A1 (en) An apparatus, a method and a computer program for volumetric video

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20865714

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2020865714

Country of ref document: EP

Effective date: 20220420