WO2018211171A1 - Appareil, procédé et programme d'ordinateur pour le codage et le décodage vidéo - Google Patents

Appareil, procédé et programme d'ordinateur pour le codage et le décodage vidéo Download PDF

Info

Publication number
WO2018211171A1
WO2018211171A1 PCT/FI2018/050349 FI2018050349W WO2018211171A1 WO 2018211171 A1 WO2018211171 A1 WO 2018211171A1 FI 2018050349 W FI2018050349 W FI 2018050349W WO 2018211171 A1 WO2018211171 A1 WO 2018211171A1
Authority
WO
WIPO (PCT)
Prior art keywords
scene
viewer
probable
interest
basis
Prior art date
Application number
PCT/FI2018/050349
Other languages
English (en)
Inventor
Payman Aflaki Beni
Kimmo Roimela
Jaakko KERÄNEN
Emre Aksu
Original Assignee
Nokia Technologies Oy
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nokia Technologies Oy filed Critical Nokia Technologies Oy
Publication of WO2018211171A1 publication Critical patent/WO2018211171A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T9/00Image coding
    • G06T9/40Tree coding, e.g. quadtree, octree
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/003D [Three Dimensional] image rendering
    • G06T15/08Volume rendering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T9/00Image coding
    • G06T9/001Model-based coding, e.g. wire frame
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2210/00Indexing scheme for image generation or computer graphics
    • G06T2210/36Level of detail

Definitions

  • the present invention relates to a method for a multi-camera unit, an apparatus for a multi-camera unit, and computer program for a multi-camera unit.
  • 360-degree viewing camera devices with multiple lenses per viewing direction are becoming more and more popular and affordable for both consumer and professional usage.
  • multi-camera captured scenes can be reconstructed in three- dimensional (3D) if the camera location and pose information is known.
  • reconstruction's quality and coverage may depend on the distribution of the cameras and their capture capabilities.
  • a multi-camera unit comprises two or more cameras capable of capturing images and/or video.
  • the cameras may be positioned in different ways with respect to each other camera.
  • the cameras may be located at a short distance from each other and they may view to the same direction so that the two-camera unit can provide a stereo view of the environment.
  • the multi-camera unit may comprise more than two cameras which are located in an omnidirectional manner. Hence, the viewing angle of such a multi-camera unit may be even 360°. In other words, the multi-camera unit may be able to view practically around the multi-camera unit.
  • Each camera of the multi-camera unit may produce images and/or video information i.e. visual information.
  • the plurality of visual information captured by different cameras may be combined together to form an output image and/or video.
  • Volumetric video may be captured using one or more multi-camera devices (MCDs). When multiple MCDs are in use, the captured footage may be synchronized so that the MCDs provide different viewpoints in the same world. In contrast to traditional 2D/3D video, volumetric video describes a 3D model of the world where the viewer is free to move and look around to observe different parts of the world. The volumetric
  • presentation of the scene is constructed based on the information captured by said several MCDs.
  • a method comprises determining a three-dimensional (3D) volumetric representation of a scene as a plurality of voxels on the basis of input streams of at least a first multicamera device; determining, on the basis of one or more parameters indicating viewer's probable interest with the scene, at least a first set of voxels as a first volume of interest (VOI); and sub-sampling voxels of the scene residing outside said at least first VOI.
  • 3D three-dimensional
  • the method further comprises determining a parameter indicating the viewer's probable interest with the scene on the basis of probable viewing directions of said at least first multicamera device and a second multicamera device, said determining comprising determining a most probable viewing direction of the first multicamera device and a most probable viewing direction of the second multicamera device; and determining the parameter indicating the viewer's probable interest with the scene as at least one intersection point of the most probable viewing directions of the first and the second multicamera devices.
  • the parameter indicating the viewer's probable interest with the scene indicates a volume around said at least one intersection point.
  • the method further comprises determining a parameter indicating the viewer's probable interest with the scene on the basis of one or more of the following: an amount of high frequency components in regions of the 3D volumetric representation of the scene; motion detected within the scene; and depth information of the scene.
  • the method further comprises defining the volume of interest between at least two intersection points, wherein the at least two intersection points are selected based on their closeness to the location and viewing direction of the viewer out of a plurality of intersection points.
  • the method further comprises determining a parameter indicating the viewer's probable interest with the scene on the basis of voxel distribution within said 3D volumetric representation of the scene, said determining comprising arranging the voxels of the 3D volumetric representation of the scene in a voxel octree representation; and determining the parameter indicating the viewer's probable interest with the scene on the basis of octree nodes having deep subtrees.
  • said sub-sampling comprises selecting a downsampled child node of a voxel to be sub-sampled from the voxel octree.
  • the method further comprises varying the amount of the subsampling based on the distance of the viewer from the volume of interest such that the larger the distance from the VOI, the coarser downsampling is applied.
  • the method further comprises determining a parameter indicating the viewer's probable interest with the scene on the basis of viewer's gaze tracking and view frustum obtained from a viewing apparatus used by the viewer.
  • the method further comprises determining a parameter indicating the viewer's probable interest with the scene on the basis of 2D shapes recognized in the scene.
  • the method further comprises determining a parameter indicating the viewer's probable interest with the scene on the basis of 3D shapes recognized in the scene.
  • the method further comprises obtaining tuning parameters regarding any technical limitation of an involved system; and adjusting encoding parameters according to said limitation.
  • the method further comprises providing a plurality of presentations for at least one VOI; obtaining at least one parameter defining the viewer's viewing perspective relative to the VOI; and selecting one of said plurality of presentations to be presented to the viewer on the basis of the viewer's viewing perspective relative to the VOI.
  • the second and the third aspects relate to an apparatus and a computer readable storage medium stored with code thereon, which are arranged to carry out the above method and one or more of the embodiments related thereto.
  • Figure la shows an example of a multi-camera unit as a simplified block diagram, in accordance with an embodiment
  • Figure lb shows a perspective view of a multi-camera unit, in accordance with an embodiment
  • Figure 2 shows a simplified block diagram of a system comprising a plurality of multi-camera units
  • Figures 3a - 3c show an example illustrating the principle of volumetric video
  • Figure 4 shows a flowchart of an encoding method in accordance with an embodiment
  • Figures 5a, 5b show an example illustrating the principle of Most Probabale Viewing Volume in accordance with an embodiment
  • Figure 6 shows a schematic block diagram of an exemplary apparatus or electronic device
  • Figure 7 shows an apparatus according to an example embodiment
  • Figure 8 shows an example of an arrangement for wireless communication comprising a plurality of apparatuses, networks and network elements.
  • Figure la illustrates an example of a multi-camera unit 100, which comprises two or more cameras 102.
  • the number of cameras 102 is eight, but may also be less than eight or more than eight.
  • Each camera 102 is located at a different location in the multi-camera unit and may have a different orientation with respect to other cameras 102.
  • the cameras 102 may have an omnidirectional constellation so that it has a 360 ⁇ viewing angle in a 3D-space.
  • such multi-camera unit 100 may be able to see each direction of a scene so that each spot of the scene around the multi-camera unit 100 can be viewed by at least one camera 102.
  • any two cameras 102 of the multi-camera unit 100 may be regarded as a pair of cameras 102.
  • a multi-camera unit of two cameras has only one pair of cameras
  • a multi-camera unit of three cameras has three pairs of cameras
  • a multi-camera unit of four cameras has six pairs of cameras, etc.
  • a multi- camera unit 100 comprising N cameras 102, where N is an integer greater than one, has N(N-l)/2 pairs of cameras 102. Accordingly, images captured by the cameras 102 at a certain time may be considered as N(N-l)/2 pairs of captured images.
  • the multi-camera unit 100 of Figure la may also comprise a processor 104 for controlling the operations of the multi-camera unit 100.
  • a memory 106 for for storing data and computer code to be executed by the processor 104, and a transceiver 108 for communicating with, for example, a communication network and/or other devices in a wireless and/or wired manner.
  • the user device 100 may further comprise a user interface (UI) 110 for displaying information to the user, for generating audible signals and/or for receiving user input.
  • UI user interface
  • the multi-camera unit 100 need not comprise each feature mentioned above, or may comprise other features as well.
  • the multi-camera unit 100 of Figure la may also comprise devices 128 to calculate the ranging information i.e. the depth of the scene. Such sensors enable the device to calculate all the respective depth information of scene content from the multi- camera unit. Such information results in creating a depth map and may be used in the subsequent processes of this application.
  • a depth map image may be considered to represent the values related to the distance of the surfaces of the scene objects from a reference location, for example a view point of an observer.
  • a depth map image is an image that may include per-pixel depth information or any similar information.
  • each sample in a depth map image represents the distance of the respective texture sample or samples from the plane on which the camera lies. In other words, if the z axis is along the shooting axis of the cameras (and hence orthogonal to the plane on which the cameras lie), a sample in a depth map image represents the value on the z axis.
  • depth map images are generated containing a depth value for each pixel in the image, they can be depicted as gray-level images or images containing only the luma component.
  • chroma components of the depth map images may be set to a pre-defined value, such as a value indicating no chromaticity, e.g. 128 in typical 8-bit chroma sample arrays, where a zero chromaticity level is arranged into the middle of the value range.
  • chroma components of depth map images may be used to contain other picture data, such as any type of monochrome auxiliary pictures, such as alpha planes.
  • N number of bits representing the depth map values
  • Znear and Zfar are the respective distances of the closest and farthest objects in the scene to the camera (mostly available from the content provider), respectively.
  • Each luma sample value in a coded depth view component represents an inverse of real- world distance (Z) value, i.e. 1/Z, normalized in the dynamic range of the luma samples, such as to the range of 0 to 255, inclusive, for 8-bit luma representation.
  • the normalization may be done in a manner where the quantization 1/Z is uniform in terms of disparity.
  • Each luma sample value in a coded depth view component represents an inverse of real- world distance (Z) value, i.e.
  • Each luma sample value in a coded depth view component represents a real- world distance (Z) value normalized in the dynamic range of the luma samples, such as to the range of 0 to 255, inclusive, for 8-bit luma representation.
  • Each luma sample value in a coded depth view component represents a disparity or parallax value from the present depth view to another indicated or derived depth view or view position.
  • Figure la also illustrates some operational elements which may be implemented, for example, as a computer code in the software of the processor, in a hardware, or both.
  • An occlusion determination element 114 may determine which areas of a panorama image are blocked (occluded) by other multi-camera unit(s); a 2D to 3D converting element 116 may convert 2D images to 3D images and vice versa; and an image reconstruction element 118 may reconstruct images so that occluded areas are reconstructed using image information of the blocking multi-camera unit 100.
  • the multi-camera units 100 comprise a location determination unit 124 and an orientation determination unit 126, wherein these units may provide the location and orientation information to the system.
  • the location determination unit 124 and the orientation determination unit 126 may also be implemented as one unit. The operation of the elements will be described later in more detail. It should be noted that there may also be other operational elements in the multi-camera unit 100 than those depicted in Figure la and/or some of the above mentioned elements may be implemented in some other part of a system than the multi-camera unit 100.
  • Figure lb shows as a perspective view an example of an apparatus comprising the multi-camera unit 100.
  • seven cameras 102a— 102g can be seen, but the multi-camera unit 100 may comprise even more cameras which are not visible from this perspective.
  • Figure lb also shows two microphones 112a, 112b, but the apparatus may also comprise one or more than two microphones.
  • the multi-camera unit 100 may be controlled by another device (not shown), wherein the multi-camera unit 100 and the other device may communicate with each other and a user may use a user interface of the other device for entering commands, parameters, etc. and the user may be provided information from the multi-camera unit 100 via the user interface of the other device.
  • a camera space, or camera coordinates stands for a coordinate system of an individual camera 102 whereas a world space, or world coordinates, stands for a coordinate system of the multi-camera unit 100 as a whole.
  • An optical flow may be used to describe how objects, surfaces, and edges in a visual scene move or transform, when an observing point moves between from a location of one camera to a location of another camera. In fact, there need not be any actual movement but it may virtually be determined how the view of the scene might change when a viewing point is moved from one camera to another camera.
  • a parallax can be regarded as a displacement or difference in the apparent position of an object when it is viewed along two different lines of sight. The parallax may be measured by the angle or semi-angle of inclination between those two lines.
  • Intrinsic parameters 120 may comprise, for example, focal length, image sensor format, and principal point.
  • Extrinsic parameters 122 denote the coordinate system transformations from 3D world space to 3D camera space. Equivalently, the extrinsic parameters may be used to define the position of a camera center and camera's heading in world space.
  • FIG. 2 is a simplified block diagram of a system 200 comprising a plurality of multi-camera units 130, 140, 150. It should be noted here that different multi-camera units are referred with different reference numbers for clarity, although each multi-camera unit 130, 140, 150 may have similar elements than the multi-camera unit 100 of Figure la. Furthermore, the individual cameras of each multi-camera unit 130, 140, 150 will be referred by different reference numerals 132, 132a— 132g, 142, 142a— 142g, 152, 152a— 152g, although each camera may be similar to the cameras 102a— 102g of the multi- camera unit 100 of Figure la.
  • reference numerals 132, 142, 152 will be used when any of the cameras of the multi-camera unit 130, the multi-camera unit 140, and the multi- camera unit 150 will be referred to, respectively.
  • reference numerals 132a— 132g, 142a— 142g, 152a— 152g will be used when a particular camera of the multi-camera unit 130, the multi-camera unit 140, and the multi-camera unit 150 will be referred to, respectively.
  • Figure 2 only depicts three multi-camera unit 130, 140, 150, the system may have two multi-camera units 130, 140 or more than three multi- camera units.
  • the system 200 has information about the location and orientation of each of the multi-camera units 130, 140, 150 of the system.
  • the location and orientation information may have been stored into a camera database 210. This information may have been entered manually or the system 200 may comprise elements which can determine the location and orientation of each of the multi-camera units 130, 140, 150 of the system. If the location and/or the orientation of any of the multi-camera units 130, 140, 150 changes, the changed location and/or orientation information may be updated in the camera database 210.
  • the system 200 may be controlled by a controller 202, which may be a server or another appropriate element capable of communicating with the multi-camera units 130, 140, 150 and the camera database 810.
  • the location and/or the orientation of the multi-camera units 130, 140, 150 may not be stored into the database 210 but only to each individual multi-camera unit 130, 140, 150. Hence, the location and/or the orientation of the multi-camera units 130, 140, 150 may be requested from the multi-camera units 130, 140, 150 when needed. As an example, if the first multi-camera unit 130 needs to know the location and orientation of second multi-camera unit 130, the first multi-camera unit 130 may request that information from the second multi-camera unit 140. If some information regarding the second multi-camera unit 140 is still needed, the first multi-camera unit 130 may request the missing
  • controller 202 for example.
  • the multi-camera system may be used to reconstruct multi-camera captured scenes in 3D if the camera locations and pose information are accurately known. Such a reconstruction's quality and coverage depends on the
  • Volumetric video may be captured using one or more multi-camera devices (MCDs). When multiple MCDs are in use, the captured footage may be synchronized in the controller 202 so that the MCDs provide different viewpoints in the same world. In contrast to traditional 2D/3D video, volumetric video describes a 3D model of the world where the viewer is free to move and look around to observe different parts of the world.
  • MCDs multi-camera devices
  • the image sequence of Figure 3 demonstrates the basic idea underlying the volumetric video.
  • the controller obtains a plurality of camera frame images (shown in Fig. 3a), depth maps, and camera 3D positions from the plurality of MCDs.
  • the controller constructs an animated 3D model of the world (shown in Fig. 3b) out of this recorded data.
  • HMD head-mounted display
  • the viewer is then able to move within the constructed 3D model, and choose their position and orientation within the model (shown in Fig. 3c).
  • the constructed 3D model does not necessarily have to contain video information from the MCDs only, but the constructed 3D model may in addition or alternatively contain objects of augmented reality (AR) or virtual reality (VR).
  • AR augmented reality
  • VR virtual reality
  • the scene is captured using several MCDs, each preferably covering 360° and a volumetric presentation of the scene is constructed based on the information captured by said several MCDs.
  • a three-dimensional (3D) volumetric representation of a scene is determined (400) as a plurality of voxels on the basis of input streams of at least a first multicamera device; on the basis of one or more parameters indicating viewer's probable interest with the scene, at least a first set of voxels as a first volume of interest (VOI) is determined (402); and voxels of the scene residing outside said at least first VOI are sub-sampled (404).
  • 3D three-dimensional
  • At least one but preferably a plurality (i.e. 2, 3, 4, 5 or more) of multicamera devices (MCD) are used to capture 3D video representation of a scene.
  • the multicamera devices are distributed in different locations in respect to the scene, and therefore each multicamera device captures a different 3D video representation of the scene.
  • the 3D video representations captured by each MCD are used as input streams for creating a 3D volumetric representation of the scene, said 3D volumetric representation comprising a plurality of voxels. Voxels may be formed from the captured 3D points e.g.
  • Voxels may also be formed through the construction of a sparse voxel octree (SVO). Each leaf of such a tree represents a solid voxel in world space; the root node of the tree represents the bounds of the world.
  • SVO sparse voxel octree
  • SVO construction has the following steps: 1) map each input depth map to a world space point cloud, where each pixel of the depth map is mapped to one or more 3D points; 2) determine voxel attributes such as color and surface normal vector by examining the neighborhood of the source pixel(s) in the camera images and the depth map; 3) determine the size of the voxel based on the depth value from the depth map and the resolution of the depth map; 4) determine the SVO level for the solid voxel as a function of its size relative to the world bounds; 5) determine the voxel coordinates on that level relative to the world bounds; 6) creating new and/or traversing existing SVO nodes until arriving at the determined voxel coordinates; 7) inserting the solid voxel as a leaf of the tree, possibly replacing or merging attributes from a previously existing voxel at those coordinates. Nevertheless, the size of voxel within the 3D volumetric representation of the scene may differ from each other.
  • the 3D video representations captured by each MCD comprise video information which is more relevant to a viewer and video information which is less relevant to the viewer. For example, on the basis of the locations of the MCDs in respect to the scene, the spatial content of the video information, reactions of the viewer, etc., various parameters indicating one or more areas of the viewer's probable interest with the scene may be determined. These one or more areas may be linked to one or more sets of voxels representing one or more volumes of interest (VOI). The voxels residing outside said one or more VOIs therefore represent video information which is less relevant to the viewer or of less interest for the viewer. For enhancing the encoding efficiency, the amount of video information to be encoded is decreased by sub-sampling (e.g. spatially downsampling) the voxels residing outside said one or more VOIs.
  • sub-sampling e.g. spatially downsampling
  • the method may further comprise determining a parameter indicating the viewer's probable interest with the scene on the basis of probable viewing directions of said at least first and second multicamera devices, wherein said determining comprises determining a most probable viewing direction of the first multicamera device and a most probable viewing direction of the second multicamera device, and determining the parameter indicating the viewer's probable interest with the scene as at least one intersection point of the most probable viewing directions of the first and the second multicamera devices.
  • a most probable viewing direction may be defined at each moment of operation, for example as the direction where the most number of camera units of the MCD are focused.
  • MPVD most probable viewing direction
  • one or more intersection point of the most probable viewing directions may be found. Such intersection points are expected to be the one or more areas which users are most probably interested to watch.
  • a VOI may be defined. For example, if the intersection point refers to a location of a display, then the whole display may be considered as a VOI.
  • intersection point refers to a location of a person or a car
  • the whole person or car may be considered as the VOL
  • the MPVDs do not cross at any point, for example when the MPVDs of different MCDs are referring to different parts of the same object. In such case, the intersection point may be selected based on the location where the MPVDs pass by each other with the least distance.
  • the parameter indicating the viewer's probable interest with the scene indicates a volume around said at least one intersection point.
  • the volume of interest is defined between at least two intersection points.
  • the volume of interest is defined so that the at least two intersection points are positioned inside the volume of interest.
  • a pre-defined shape e.g. sphere or cube are considered to cover the at least two intersection points.
  • the at least two intersection points are selected based on their closeness to the location and viewing direction of the viewer out of a plurality of intersection points.
  • the at least two intersections may be found to belong to a 3D object and hence, the whole object will be considered as the VOL This happens for example if two intersection points belong to different parts of a racing car.
  • the racing car will be recognized as the object of interest and its whole volume will be selected as the VOL
  • the MPVD is a volume rather than a single direction. This is illustrated in Figure 5a, where a Most Probable Viewing Volume (MPVV) of a single front-weighted omnidirectional camera 500 is determined on the density map of the image quality, where the darker the color of the density map is, the better is the image quality.
  • MPVV Most Probable Viewing Volume
  • Figure 5b illustrates the case for multiple cameras 500, 502, 504, where the aforementioned imaging quality factors overlap, producing a volume of sufficiently favorable imaging conditions in the middle of the cameras, marked as the MPVV.
  • Some of the following embodiments do not necessarily rely on knowing anything about the actual viewers of the video, relying only on what is most probable given any potential viewing pose at any given time. These embodiments are beneficial in offline/non-realtime processing scenarios, and reducing the total size of the stored data.
  • Both of these embodiment types i.e. the ones not relying on knowledge of the current poses of the viewers and the ones relying on knowledge of the current poses of the viewers, may be used together in an embodiment to receive benefits from both.
  • the method may further comprise determining a parameter indicating the viewer's probable interest with the scene on the basis of an amount of high frequency components in regions of the 3D volumetric representation of the scene.
  • the greater amount of high frequency components (HFCs) in the scene represents the parts of the scene with more details.
  • HFCs high frequency components
  • the number of HFCs in an area may also be used as secondary indicia for determining VOIs in an algorithm merging this embodiment with other embodiments disclosed herein.
  • the method may further comprise determining a parameter indicating the viewer's probable interest with the scene on the basis of voxel distribution within said 3D volumetric representation of the scene, said determining comprising arranging the voxels of the 3D volumetric representation of the scene in a voxel octree representation, and determining the parameter indicating the viewer's probable interest with the scene on the basis of octree nodes having deep subtrees.
  • the voxelization process may involve arranging the voxels of the 3D volumetric representation of the scene in a voxel octree representation.
  • the octree is a tree of nodes represented in a 3D space, where each node is divided into eight child nodes. Octrees may be used to partition a three-dimensional space by recursively subdividing it into eight octants.
  • a sparse voxel octree may be used, which describes a volume of space containing a set of solid voxels of varying sizes. Empty areas within the volume are absent from the tree, which is why it is called "sparse".
  • a volumetric video frame maybe considered a complete SVO that models the world at a specific point in time in a video sequence.
  • Voxel attributes contain information like color, opacity, surface normal vectors, and surface material properties. These are referenced in the SVOs (e.g., color of a solid voxel), but can also be stored separately.
  • An SVO can also be mipmapped. This means that each level of the SVO is considered an averaged representation of the level below it. Ultimately the root node of the SVO is a representation of the entire world. In practice, this can be implemented by having each SVO node own a set of attributes that averages the corresponding attributes of all of the node's children. Mipmapped SVOs have the advantage that any given branch can be cut off at an arbitrary depth without losing attribute information; the mipmapped attributes sufficiently summarize the data that was cut off. Therefore, sub-sampling an SVO is a trivial operation.
  • the structure of the overall voxel octree can be observed for certain characteristics, such as nodes that have deep subtrees. This information is available as a byproduct of the voxelization process. Such nodes can be considered as candidates for VOIs or as secondary indicia for determining VOIs in an algorithm merging this embodiment with other embodiments disclosed herein.
  • the above-described most probable viewing volume may be directly mapped to the voxel octree hierarchy: for example, any nodes fully enclosed within, or sufficiently overlapping with, the MPVV may be directly regarded as being part of the VOL
  • the method may further comprise determining a parameter indicating the viewer's probable interest with the scene on the basis of motion detected within the scene. Hence, the motion in the scene may be recognized and the areas having the highest amount of motion are considered to create the VOIs. It is preferable to consider possible limitations in the size/number of VOIs, whereupon prioritization should be taken into account to better adjust the VOIs based on the detected motions in the scene. Basically, the higher the scene motion in any particular area, the more likely that area belongs to a VOL In this embodiment, the movement of the viewer may also be taken into account and the relative motion between the scene objects and the viewer may be considered as motion in the scene.
  • the method may further comprise determining a parameter indicating the viewer's probable interest with the scene on the basis of the depth information of the scene.
  • determining a parameter indicating the viewer's probable interest with the scene on the basis of the depth information of the scene.
  • the current location/viewing direction of the user may be taken into account.
  • different representations of the scene may be available in order to be able to switch to different presentations based on the movement of the user through the scene.
  • the selection of VOI may then be performed adaptively based on the relative distance of each object to the current location of the viewer.
  • the viewer is wearing a HMD or an equivalent viewing apparatus that tracks the viewer's eye movements and gaze direction.
  • This information is transmitted back to the volumetric encoder and/or streaming source.
  • the view frustum comprises the viewer's 3D position, viewing direction, field of view angles, and near/far view planes, thus also describing if the viewer has zoomed in on a particular detail in the scene.
  • the method may further comprise determining a parameter indicating the viewer's probable interest with the scene on the basis of viewer's gaze tracking and view frustum obtained from a viewing apparatus used by the viewer. Gaze tracking may be applied in real time to cast rays (from one or both eyes), determining a specific voxel node that the viewer is currently looking at. This may be carried out using the viewer's voxel representation of the scene for minimizing latency.
  • the node can be selected so that it fits inside the view frustum, covering as much of the view as possible.
  • the coordinates of the node may be transmitted to the encoder / streaming source to be used as VOL If the gaze tracking is applied to both eyes separately, it is possible to detect when the user is looking at a small nearby object vs. a large far-away object in the same general direction both fitted to the same view frustum.
  • the method may further comprise determining a parameter indicating the viewer's probable interest with the scene on the basis of spatial audio information obtained from the scene.
  • an array of microphones (or any alternative device) may be used to obtain the direction and/or location of recorded audio. Any potential object residing in said direction and/or location may be considered and VOI may be defined so that the said object belongs to it. Audio information could be further processed through a recognition system, and the processed audio information may then be used as audio-based semantic information to define saliency regions or coordinates of objects of interest. Such information may further be filtered or selected for encoding the VOI with better quality.
  • the method may further comprise determining a parameter indicating the viewer's probable interest with the scene on the basis of the proximity of cameras to any given scene object. An object that has all cameras relatively close to it is more likely to be an object of interest than an object that is close to one camera only.
  • the method may further comprise determining a parameter indicating the viewer's probable interest with the scene on the basis of 2D shapes recognized in the scene.
  • individual camera images may be analyzed with 2D image recognition algorithms, such as detection of human faces. Viewers are likely to focus on faces and facial expressions and thus they are good candidates for VOIs.
  • Another example is recognizing a ball or a hockey puck in a sports game.
  • the recognized 2D shapes can be mapped onto the voxel representation for determining the corresponding VOI(s).
  • the method may further comprise determining a parameter indicating the viewer's probable interest with the scene on the basis of 3D shapes recognized in the scene.
  • 3D image recognition algorithms such as various methods based on convolutional neural networks (CNN) may be used for analyzing the shapes appearing on the scene. Shapes typically familiar to humans, such as clear geometrical shapes, are good candidates for VOIs. Also the recognized 3D shapes can be mapped onto the voxel representation for determining the corresponding VOI(s).
  • the method may further comprise determining a parameter indicating the viewer's probable interest with the scene on the basis of semantic information obtained from scene segmentation.
  • the 3D volumetric representation of the scene may segmented using scene segmentation and semantic parsing. Then, the content creator or the viewer provide key words relating to objects which they desire to see or preserve in higher quality (i.e. saliency definition). The system then matches these keywords to the semantic information of the 3D scene and identifies the VOIs which should be encoded and delivered with a higher quality.
  • the method may further comprise obtaining tuning parameters regarding any technical limitation of an involved system and adjusting encoding parameters according to said limitation.
  • the technical components of the system such as the encoder, the broadcasting system and/or the playback device may involve some technical limitations which may affect to the encoding.
  • the tuning parameters relating to technical limitations may be provided by a person, such as the viewer or the content capturing director, wishing to adjust the encoding.
  • the technical limitations may include any information regarding the required bandwidth that the final encoded content should be limited to. Such tuning information may affect the encoding parameters and may result in sacrificing the video quality according to the reduced amount of bitrate.
  • the method may further comprise providing a plurality of presentations for at least one VOI, obtaining at least one parameter defining the viewer's viewing perspective relative to the VOI, and selecting one of said plurality of presentations to be presented to the viewer on the basis of the viewer's viewing perspective relative to the VOI.
  • the spatial location of the user, the direction from which s/he is watching the scene, other view frustum parameters, and the number of frames that the viewer is buffering in memory may be obtained. This will enable to better define the VOI based on the specific perspective of each user. Different presentations for different VOIs may be available and considering their relative location to the current location of the user, the representation to be presented to the user in the playback is adjusted accordingly.
  • the quality of all areas of the scene will be degraded.
  • the quality of all areas of the scene are degraded but the quality of at least one VOI is degraded less compared to the rest of the scene.
  • the degradation may be based on sub-sampling (spatially downsampling) the voxels to reduce the amount of information to be encoded.
  • the ratio by which the downsampling is applied may depend on many factors, e.g. the amount of bitrate reduction required for compressing the content or the amount of allowed quality degradation.
  • the spatially downsampling may include application of linear or non- linear resampling filters on the voxels.
  • said sub-sampling comprises selecting
  • each parent node already contains a downsampled version of its subtrees, so the downsampling of any particular region is efficiently implemented by looking up mipmapped attributes within the SVO branch covering the selected region.
  • the VOIs may be constrained to align with suitable octree node boundaries so that downsampling can be affected simply by culling the higher-resolution nodes for regions to be downsampled.
  • the amount of such subsampling may vary based on the distance from the VOIs. In other words, the larger the distance from the VOI, the less the users are expected to pay attention to that area and hence, the coarser
  • sub-sampling voxels of the scene residing in at least one VOI In other words, not all VOIs should remain intact. Based on the method by which they have been determined, or based on the closeness of them to each other, it may be determined that voxels belonging to at least one VOI may also be sub-sampled. In this embodiment, e.g. if two VOIs are close to each other and a third VOI is farther away, the farther one may be subsampled to some extent.
  • the presence of at least two VOIs close to each other may be considered as an indicator to create a larger VOI.
  • Said larger VOI may be created by merging the boundaries of said at least two VOIs so that the larger VOI includes all of at least two VOIs and also covers some areas in between them which was not covered originally in presentation of separate at least two VOIs.
  • one VOI may include different regions where each area is subsampling by a different factor.
  • the selection of different regions inside a VOI may depend on the criteria which was used to select the VOI in the first place. Since the density of such criteria is not identical in all areas of VOI, this may be considered for different subsampling inside the said VOI.
  • voxel data is stored by separating data regarding voxel nodes and voxel attributes.
  • data size reductions by reducing accuracy of attribute data for areas outside the VOIs.
  • color information and surface normals can be encoded with fewer bits.
  • the stream can utilize an attribute palette that is updated only when its contents are sufficiently out-of-date when compared to the original voxel data.
  • the stream only contains voxel node data for areas outside the VOI, reusing attribute data already received by the viewer during past frames.
  • the encoder / streaming source may be provided with information about how many frames the viewer is buffering in memory so that it can be determined how many past frames can be considered to be available.
  • the encoder / streaming source may also reduce voxel data size by replacing specific voxel nodes with references to nodes in the past frames that are already buffered in memory on the viewer's side.
  • the encoder / streaming source must know how many frames the viewer is buffering in memory. Outside the VOI the referenced past nodes do not need to be exact matches to the current actual voxel representation of the area.
  • significant advantages may be obtained through one or more of the disclosed embodiments.
  • the amount of information that should be transmitted may be considerably reduced without sacrifying the subjective quality of experience for the users watching the content.
  • the sub-sampling of SVO data can be particularly efficiently implemented, which facilitates the implementation in real-time transcoding applications.
  • volumetric content adaptation per viewer is allowed from the same 3D scene representation based on the user preferences, viewing directions and rendering capabilities.
  • Figure 6 shows a schematic block diagram of an exemplary apparatus or electronic device 50 depicted in Figure 7, which may incorporate a controller according to an embodiment of the invention.
  • the electronic device 50 may for example be a mobile terminal or user equipment of a wireless communication system. However, it would be appreciated that embodiments of the invention may be implemented within any electronic device or apparatus which may require transmission of radio frequency signals.
  • the apparatus 50 may comprise a housing 30 for incorporating and protecting the device.
  • the apparatus 50 further may comprise a display 32 in the form of a liquid crystal display.
  • the display may be any suitable display technology suitable to display an image or video.
  • the apparatus 50 may further comprise a keypad 34.
  • any suitable data or user interface mechanism may be employed.
  • the user interface may be
  • the apparatus may comprise a microphone 36 or any suitable audio input which may be a digital or analogue signal input.
  • the apparatus 50 may further comprise an audio output device which in embodiments of the invention may be any one of: an earpiece 38, speaker, or an analogue audio or digital audio output connection.
  • the apparatus 50 may also comprise a battery 40 (or in other embodiments of the invention the device may be powered by any suitable mobile energy device such as solar cell, fuel cell or clockwork generator). The term battery discussed in connection with the embodiments may also be one of these mobile energy devices.
  • the apparatus 50 may comprise a combination of different kinds of energy devices, for example a rechargeable battery and a solar cell.
  • the apparatus may further comprise an infrared port 41 for short range line of sight communication to other devices.
  • the apparatus 50 may further comprise any suitable short range communication solution such as for example a Bluetooth wireless connection or a USB/Fire Wire wired connection.
  • the apparatus 50 may comprise a controller 56 or processor for controlling the apparatus 50.
  • the controller 56 may be connected to memory 58 which in embodiments of the invention may store both data and/or may also store instructions for implementation on the controller 56.
  • the controller 56 may further be connected to codec circuitry 54 suitable for carrying out coding and decoding of audio and/or video data or assisting in coding and decoding carried out by the controller 56.
  • the apparatus 50 may further comprise a card reader 48 and a smart card 46, for example a universal integrated circuit card (UICC) reader and a universal integrated circuit card for providing user information and being suitable for providing authentication information for authentication and authorization of the user at a network.
  • a card reader 48 and a smart card 46 for example a universal integrated circuit card (UICC) reader and a universal integrated circuit card for providing user information and being suitable for providing authentication information for authentication and authorization of the user at a network.
  • a smart card 46 for example a universal integrated circuit card (UICC) reader and a universal integrated circuit card for providing user information and being suitable for providing authentication information for authentication and authorization of the user at a network.
  • UICC universal integrated circuit card
  • the apparatus 50 may comprise radio interface circuitry 52 connected to the controller and suitable for generating wireless communication signals for example for communication with a cellular communications network, a wireless communications system or a wireless local area network.
  • the apparatus 50 may further comprise an antenna 60 connected to the radio interface circuitry 52 for transmitting radio frequency signals generated at the radio interface circuitry 52 to other apparatus(es) and for receiving radio frequency signals from other apparatus(es).
  • the apparatus 50 comprises a camera 42 capable of recording or detecting imaging.
  • the system 10 comprises multiple communication devices which can communicate through one or more networks.
  • the system 10 may comprise any combination of wired and/or wireless networks including, but not limited to a wireless cellular telephone network (such as a global systems for mobile communications (GSM), universal mobile telecommunications system (UMTS), long term evolution (LTE) based network, code division multiple access (CDMA) network etc.), a wireless local area network (WLAN) such as defined by any of the IEEE 802.x standards, a Bluetooth personal area network, an Ethernet local area network, a token ring local area network, a wide area network, and the Internet.
  • GSM global systems for mobile communications
  • UMTS universal mobile telecommunications system
  • LTE long term evolution
  • CDMA code division multiple access
  • the system shown in Figure 8 shows a mobile telephone network 11 and a representation of the internet 28.
  • Connectivity to the internet 28 may include, but is not limited to, long range wireless connections, short range wireless connections, and various wired connections including, but not limited to, telephone lines, cable lines, power lines, and similar communication pathways.
  • the example communication devices shown in the system 10 may include, but are not limited to, an electronic device or apparatus 50, a combination of a personal digital assistant (PDA) and a mobile telephone 14, a PDA 16, an integrated messaging device (IMD) 18, a desktop computer 20, a notebook computer 22, a tablet computer.
  • the apparatus 50 may be stationary or mobile when carried by an individual who is moving.
  • Some or further apparatus may send and receive calls and messages and communicate with service providers through a wireless connection 25 to a base station 24.
  • the base station 24 may be connected to a network server 26 that allows communication between the mobile telephone network 11 and the internet 28.
  • the system may include additional communication devices and communication devices of various types.
  • the communication devices may communicate using various transmission technologies including, but not limited to, code division multiple access (CDMA), global systems for mobile communications (GSM), universal mobile telecommunications system (UMTS), time divisional multiple access (TDMA), frequency division multiple access (FDMA), transmission control protocol-internet protocol (TCP-IP), short messaging service (SMS), multimedia messaging service (MMS), email, instant messaging service (IMS), Bluetooth, IEEE 802.11, Long Term Evolution wireless communication technique (LTE) and any similar wireless communication technology.
  • CDMA code division multiple access
  • GSM global systems for mobile communications
  • UMTS universal mobile telecommunications system
  • TDMA time divisional multiple access
  • FDMA frequency division multiple access
  • TCP-IP transmission control protocol-internet protocol
  • SMS short messaging service
  • MMS multimedia messaging service
  • email instant messaging service
  • IMS instant messaging service
  • Bluetooth IEEE 802.11, Long Term Evolution wireless communication technique (LTE) and any similar wireless communication technology.
  • LTE Long Term Evolution wireless communication technique
  • embodiments of the invention operating within a wireless communication device
  • the invention as described above may be implemented as a part of any apparatus comprising a circuitry in which radio frequency signals are transmitted and received.
  • embodiments of the invention may be implemented in a mobile phone, in a base station, in a computer such as a desktop computer or a tablet computer comprising radio frequency communication means (e.g. wireless local area network, cellular radio, etc.).
  • radio frequency communication means e.g. wireless local area network, cellular radio, etc.
  • Embodiments of the inventions may be practiced in various components such as integrated circuit modules.
  • the design of integrated circuits is by and large a highly automated process.
  • Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Computer Graphics (AREA)
  • Testing, Inspecting, Measuring Of Stereoscopic Televisions And Televisions (AREA)

Abstract

La présente invention concerne un procédé comprenant: la détermination d'une représentation volumétrique tridimensionnelle (3D) d'une scène sous forme d'une pluralité de voxels sur la base de flux d'entrée d'au moins un premier dispositif doté de plusieurs caméras; la détermination, sur la base d'un ou de plusieurs paramètres indiquant un intérêt probable d'un spectateur de la scène, d'au moins un premier ensemble de voxels en tant que premier volume d'intérêt (VDI) (402); et le sous-échantillonnage de voxels de la scène se trouvant en dehors dudit ou desdits premiers VDI (404).
PCT/FI2018/050349 2017-05-16 2018-05-09 Appareil, procédé et programme d'ordinateur pour le codage et le décodage vidéo WO2018211171A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GB1707794.2A GB2562490A (en) 2017-05-16 2017-05-16 An apparatus, a method and a computer program for video coding and decoding
GB1707794.2 2017-05-16

Publications (1)

Publication Number Publication Date
WO2018211171A1 true WO2018211171A1 (fr) 2018-11-22

Family

ID=59201712

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/FI2018/050349 WO2018211171A1 (fr) 2017-05-16 2018-05-09 Appareil, procédé et programme d'ordinateur pour le codage et le décodage vidéo

Country Status (2)

Country Link
GB (1) GB2562490A (fr)
WO (1) WO2018211171A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112991473A (zh) * 2021-03-19 2021-06-18 华南理工大学 一种基于立方体模板的神经网络编码解码方法及系统

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140184475A1 (en) * 2012-12-27 2014-07-03 Andras Tantos Display update time reduction for a near-eye display
WO2017059490A1 (fr) * 2015-10-06 2017-04-13 Blinxel Pty Ltd Enregistrement et lecture vidéo à profondeur volumétrique

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20160016468A (ko) * 2014-08-05 2016-02-15 삼성전자주식회사 리얼 3d 영상 생성 방법 및 그 장치

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140184475A1 (en) * 2012-12-27 2014-07-03 Andras Tantos Display update time reduction for a near-eye display
WO2017059490A1 (fr) * 2015-10-06 2017-04-13 Blinxel Pty Ltd Enregistrement et lecture vidéo à profondeur volumétrique

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
CAMPOALEGRE, L. ET AL.: "Gradient Octrees: A new Scheme for Remote Interactive Exploration of Volume Models", 13TH INTERNATIONAL CONFERENCE ON COMPUTER-AIDED DESIGN AND COMPUTER GRAPHICS, 16 November 2013 (2013-11-16) - 18 November 2013 (2013-11-18), pages 306 - 313, XP032595601, Retrieved from the Internet <URL:DOI:10.1109/CADGraphics.2013.47> *
KAMPE, V. ET AL.: "Exploiting Coherence in Time-Varying Voxel Data", 13D'16, 27 February 2016 (2016-02-27), Redmond , WA, pages 15 - 21, XP058079601, ISBN: 978-1-4503-4043-4, Retrieved from the Internet <URL:DOI:10.1145/2856400.2856413> *
MEKURIA, R. ET AL.: "Design, Implementation, and Evaluation of a Point Cloud Codec for Tele-Immersive Video", IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, vol. 27, no. 4, April 2017 (2017-04-01), pages 828 - 842, XP055548820 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112991473A (zh) * 2021-03-19 2021-06-18 华南理工大学 一种基于立方体模板的神经网络编码解码方法及系统

Also Published As

Publication number Publication date
GB201707794D0 (en) 2017-06-28
GB2562490A (en) 2018-11-21

Similar Documents

Publication Publication Date Title
US11430156B2 (en) Apparatus, a method and a computer program for volumetric video
US11599968B2 (en) Apparatus, a method and a computer program for volumetric video
CN111615715B (zh) 编码/解码体积视频的方法、装置和流
US20220174252A1 (en) Selective culling of multi-dimensional data sets
US11202086B2 (en) Apparatus, a method and a computer program for volumetric video
WO2019076503A1 (fr) Appareil, procédé et programme informatique pour coder une vidéo volumétrique
KR102371099B1 (ko) 광시야 비디오를 인코딩하기 위한 구면 회전 기법
US20160021355A1 (en) Preprocessor for Full Parallax Light Field Compression
WO2017116952A1 (fr) Codage et rendu d&#39;image indépendants de la vision de hublot
JP2015536057A (ja) 画像処理装置、方法及びコンピュータプログラム
CN111557094A (zh) 用于编码/解码体积式视频的方法、设备和流
WO2019162567A1 (fr) Codage et décodage de vidéo volumétrique
US10616548B2 (en) Method and apparatus for processing video information
WO2019008222A1 (fr) Procédé et appareil de codage de contenu multimédia
KR20220035229A (ko) 볼류메트릭 비디오 콘텐츠를 전달하기 위한 방법 및 장치
WO2018211171A1 (fr) Appareil, procédé et programme d&#39;ordinateur pour le codage et le décodage vidéo
US11528469B2 (en) Apparatus, a method and a computer program for viewing volume signalling for volumetric video
WO2018158494A1 (fr) Procédé et appareil pour une unité de caméras multiples
WO2019034803A1 (fr) Procédé et appareil de traitement d&#39;informations vidéo
WO2019008233A1 (fr) Méthode et appareil d&#39;encodage de contenu multimédia
US10783609B2 (en) Method and apparatus for processing video information
WO2021064138A1 (fr) Procédé et appareil pour le codage, la transmission et le décodage de vidéo volumétrique

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18803254

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18803254

Country of ref document: EP

Kind code of ref document: A1