WO2019008233A1 - Méthode et appareil d'encodage de contenu multimédia - Google Patents

Méthode et appareil d'encodage de contenu multimédia Download PDF

Info

Publication number
WO2019008233A1
WO2019008233A1 PCT/FI2018/050534 FI2018050534W WO2019008233A1 WO 2019008233 A1 WO2019008233 A1 WO 2019008233A1 FI 2018050534 W FI2018050534 W FI 2018050534W WO 2019008233 A1 WO2019008233 A1 WO 2019008233A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
node
source devices
source device
sparse voxel
Prior art date
Application number
PCT/FI2018/050534
Other languages
English (en)
Inventor
Jaakko KERÄNEN
Original Assignee
Nokia Technologies Oy
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nokia Technologies Oy filed Critical Nokia Technologies Oy
Publication of WO2019008233A1 publication Critical patent/WO2019008233A1/fr

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/90Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using coding techniques not provided for in groups H04N19/10-H04N19/85, e.g. fractals
    • H04N19/96Tree coding, e.g. quad-tree coding
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/10Processing, recording or transmission of stereoscopic or multi-view image signals
    • H04N13/106Processing image signals
    • H04N13/111Transformation of image signals corresponding to virtual viewpoints, e.g. spatial image interpolation
    • H04N13/117Transformation of image signals corresponding to virtual viewpoints, e.g. spatial image interpolation the virtual viewpoint locations being selected by the viewers or determined by viewer tracking
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/10Processing, recording or transmission of stereoscopic or multi-view image signals
    • H04N13/106Processing image signals
    • H04N13/161Encoding, multiplexing or demultiplexing different image signal components
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/10Processing, recording or transmission of stereoscopic or multi-view image signals
    • H04N13/106Processing image signals
    • H04N13/172Processing image signals image signals comprising non-image signal components, e.g. headers or format information
    • H04N13/178Metadata, e.g. disparity information
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/30Image reproducers
    • H04N13/366Image reproducers using viewer tracking
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/597Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding specially adapted for multi-view video sequence encoding
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/20Image signal generators
    • H04N13/204Image signal generators using stereoscopic image cameras
    • H04N13/239Image signal generators using stereoscopic image cameras using two 2D image sensors having a relative position equal to or related to the interocular distance
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/20Image signal generators
    • H04N13/204Image signal generators using stereoscopic image cameras
    • H04N13/243Image signal generators using stereoscopic image cameras using three or more 2D image sensors
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/30Image reproducers
    • H04N13/332Displays for viewing with the aid of special glasses or head-mounted displays [HMD]
    • H04N13/344Displays for viewing with the aid of special glasses or head-mounted displays [HMD] with head-mounted left-right displays
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/30Image reproducers
    • H04N13/366Image reproducers using viewer tracking
    • H04N13/383Image reproducers using viewer tracking for tracking with gaze detection, i.e. detecting the lines of sight of the viewer's eyes
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/60Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using transform coding
    • H04N19/61Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using transform coding in combination with predictive coding
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/60Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using transform coding
    • H04N19/62Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using transform coding by frequency transforming in three dimensions

Definitions

  • the present solution generally relates to video encoding.
  • the solution relates to volumetric encoding and virtual reality (VR).
  • VR virtual reality
  • new image and video capture devices are available. These devices are able to capture visual and audio content all around them, i.e. they can capture the whole angular field of view, sometimes referred to as 360 degrees field of view. More precisely, they can capture a spherical field of view (i.e., 360 degrees in all axes).
  • new types of output technologies have been invented and produced, such as head-mounted displays. These devices allow a person to see visual content all around him/her, giving a feeling of being "immersed" into the scene captured by the 360 degrees camera.
  • the new capture and display paradigm, where the field of view is spherical is commonly referred to as virtual reality (VR) and is believed to be the common way people will experience media content in the future.
  • VR virtual reality
  • a scene is captured by multiple 3D cameras.
  • the cameras are in different positions and orientations within the scene.
  • One issue to take into account is how to combine the synchronized recorded footages of all 3D cameras into a single model of the scene and whether to remove all redundant information and augmenting occluded sections of one camera with images from another camera.
  • Depth maps may be generated from camera images. However, the depth maps may have noise, distortions, discontinuities, banding due to quantization, and other errors. Also, depth sensors like LiDAR (Light Detection and Ranging) may suffer from reflections and noise. It should also be noted that the same objects may be seen from multiple view directions (e.g., both front and back).
  • Some embodiments provide a method and technical equipment implementing the method, for realtime computer graphics and virtual reality.
  • Various aspects of the invention include a method, an apparatus, and a computer readable medium comprising a computer program stored therein.
  • a method comprising:
  • An apparatus comprises at least one processor and at least one memory, said at least one memory stored with code thereon, which when executed by said at least one processor, causes the apparatus to perform at least:
  • a computer readable storage medium comprises code for use by an apparatus, which when executed by a processor, causes the apparatus to perform:
  • An apparatus comprises:
  • Figure 1 a shows an example of a multi-camera system as a simplified block diagram, in accordance with an embodiment
  • Figure lb shows a perspective view of a multi-camera system, in accordance with an embodiment
  • Figure 2 shows a system and apparatuses for stereo viewing
  • Figure 3 shows a camera device for stereo viewing
  • Figure 4 shows a head-mounted display for stereo viewing
  • Figures 5a and 5b show an encoder and a decoder according to an embodiment
  • Figure 6 illustrates an example of processing steps of manipulating volumetric video data
  • Figure 7 shows an example of a volumetric video pipeline
  • Figure 8 shows an example of a valid volume of a camera where some objects cause occlusions
  • Figure 9 shows an example of an output data of an encoder
  • Figure 10 is a flowchart of a method according to an embodiment.
  • FIGS. 1 la and 1 lb show an apparatus according to an embodiment.
  • the invention is not limited to this particular arrangement.
  • the different embodiments have applications widely in any environment where improvement of coding when switching between coded fields and frames is desired.
  • the invention may be applicable to video coding systems like streaming systems, DVD players, digital television receivers, personal video recorders, systems and computer programs on personal computers, handheld computers and communication devices, as well as network elements such as transcoders and cloud computing arrangements where video data is handled.
  • volumetric video may be captured using one or more 3D cameras. Volumetric video is to virtual reality what traditional video is to 2D/3D displays. When multiple cameras are in use, the captured footage is synchronized so that the cameras provide different viewpoints to the same world. In contrast to traditional 2D/3D video, volumetric video describes a 3D model of the world where the viewer is free to move and observe different parts of the world.
  • a multicamera device comprises two or more cameras, wherein the two or more cameras may be arranged in pairs in said multicamera device. Each said camera has a respective field of view, and each said field of view covers the view direction of the multicamera device.
  • the multicamera device may comprise cameras at locations corresponding to at least some of the eye positions of a human head at normal anatomical posture, eye positions of the human head at maximum flexion anatomical posture, eye positions of the human head at maximum extension anatomical postures, and/or eye positions of the human head at maximum left and right rotation anatomical postures.
  • the multicamera device may comprise at least three cameras, the cameras being disposed such that their optical axes in the direction of the respective camera's field of view fall within a hemispheric field of view, the multicamera device comprising no cameras having their optical axes outside the hemispheric field of view, and the multicamera device having a total field of view covering a full sphere.
  • the multicamera device described here may have cameras with wide-angle lenses.
  • the multicamera device may be suitable for creating stereo viewing image data and/or multiview video, comprising a plurality of video sequences for the plurality of cameras.
  • the multicamera may be such that any pair of cameras of the at least two cameras has a parallax corresponding to parallax (disparity) of human eyes for creating a stereo image.
  • At least two cameras may have overlapping fields of view such that an overlap region for which every part is captured by said at least two cameras is defined, and such overlap area can be used in forming the image for stereo viewing.
  • Figures la and lb illustrate an example of a camera having multiple lenses and imaging sensors but also other types of cameras may be used to capture wide view images and/or wide view video.
  • wide view image and wide view video mean an image and a video, respectively, which comprise visual information having a relatively large viewing angle, larger than 100 degrees.
  • a so called 360 panorama image/video as well as images/videos captured by using a fish eye lens may also be called as a wide view image/video in this specification.
  • the wide view image/video may mean an image/video in which some kind of projection distortion may occur when a direction of view changes between successive images or frames of the video so that a transform may be needed to find out co -located pixels from a reference image or a reference frame. This will be described in more detail later in this specification.
  • the camera 100 of Figure la comprises two or more camera units 102 and is capable of capturing wide view images and/or wide view video.
  • the number of camera units 102 is eight, but may also be less than eight or more than eight.
  • Each camera unit 102 is located at a different location in the multi-camera system and may have a different orientation with respect to other camera units 102.
  • the camera units 102 may have an omnidirectional constellation so that it has a 360 viewing angle in a 3D-space. In other words, such camera 100 may be able to see each direction of a scene so that each spot of the scene around the camera 100 can be viewed by at least one camera unit 102.
  • the camera 100 of Figure la may also comprise a processor 104 for controlling the operations of the camera 100.
  • a memory 106 for storing data and computer code to be executed by the processor 104, and a transceiver 108 for communicating with, for example, a communication network and/or other devices in a wireless and/or wired manner.
  • the camera 100 may further comprise a user interface (UI) 1 10 for displaying information to the user, for generating audible signals and/or for receiving user input.
  • UI user interface
  • the camera 100 need not comprise each feature mentioned above, or may comprise other features as well.
  • processor 104 may also comprise two or more separate processing units such as a controller and a graphics processing unit.
  • processor may comprise one or more processing cores.
  • Figure la also illustrates some operational elements which may be implemented, for example, as a computer code in the software of the processor, in a hardware, or both.
  • a focus control element 114 may perform operations related to adjustment of the optical system of camera unit or units to obtain focus meeting target specifications or some other predetermined criteria.
  • An optics adjustment element 1 16 may perform movements of the optical system or one or more parts of it according to instructions provided by the focus control element 1 14. It should be noted here that the actual adjustment of the optical system need not be performed by the apparatus but it may be performed manually, wherein the focus control element 114 may provide information for the user interface 1 10 to indicate a user of the device how to adjust the optical system.
  • Figure lb shows as a perspective view the camera 100 of Figure la.
  • seven camera units 102a-102g can be seen, but the camera 100 may comprise even more camera units which are not visible from this perspective.
  • Figure lb also shows two microphones 112a, 1 12b, but the apparatus may also comprise one or more than two microphones.
  • the camera 100 may be controlled by another device (not shown), wherein the camera 100 and the other device may communicate with each other and a user may use a user interface of the other device for entering commands, parameters, etc. and the user may be provided information from the camera 100 via the user interface of the other device.
  • Figure 2 shows a system and apparatuses for stereo viewing, that is, for 3D video and 3D audio digital capture and playback.
  • the task of the system is that of capturing sufficient visual and auditory information from a specific location such that a convincing reproduction of the experience, or presence, of being in that location can be achieved by one or more viewers physically located in different locations and optionally at a time later in the future.
  • Such reproduction requires more information that can be captured by a single camera or microphone, in order that a viewer can determine the distance and location of objects within the scene using their eyes and their ears.
  • two camera sources are used.
  • the human auditory system can be able to sense the direction of sound, at least two microphones are used (the commonly known stereo sound is created by recording two audio channels).
  • the human auditory system can detect the cues, e.g. in timing difference of the audio signals to detect the direction of sound.
  • the system of Figure 2 may consist of three main parts: image sources, a server and a rendering device.
  • a video capture device SRC1 comprises multiple cameras CAM1 , CAM2, CAMN with overlapping field of view so that regions of the view around the video capture device is captured from at least two cameras.
  • the device SRC1 may comprise multiple microphones to capture the timing and phase differences of audio originating from different directions.
  • the device SRC 1 may comprise a high resolution orientation sensor so that the orientation (direction of view) of the plurality of cameras can be detected and recorded.
  • the device SRC1 comprises or is functionally connected to a computer processor PROC1 and memory MEM1 , the memory comprising computer program PROGR1 code for controlling the video capture device.
  • the image stream captured by the video capture device may be stored on a memory device MEM2 for use in another device, e.g. a viewer, and/or transmitted to a server using a communication interface COMM1.
  • a memory device MEM2 for use in another device, e.g. a viewer, and/or transmitted to a server using a communication interface COMM1.
  • COMM1 a communication interface
  • one or more sources SRC2 of synthetic images may be present in the system.
  • Such sources of synthetic images may use a computer model of a virtual world to compute the various image streams it transmits.
  • the source SRC2 may compute N video streams corresponding to N virtual cameras located at a virtual viewing position.
  • the viewer may see a three-dimensional virtual world.
  • the device SRC2 comprises or is functionally connected to a computer processor PROC2 and memory MEM2, the memory comprising computer program PROGR2 code for controlling the synthetic sources device SRC2.
  • the image stream captured by the device may be stored on a memory device MEM5 (e.g.
  • the device SERVER comprises or is functionally connected to a computer processor PROC3 and memory MEM3, the memory comprising computer program PROGR3 code for controlling the server.
  • the device SERVER may be connected by a wired or wireless network connection, or both, to sources SRC1 and/or SRC2, as well as the viewer devices VIEWERl and VIEWER2 over the communication interface COMM3.
  • the devices may comprise or be functionally connected to a computer processor PROC4 and memory MEM4, the memory comprising computer program PROG4 code for controlling the viewing devices.
  • the viewer (playback) devices may consist of a data stream receiver for receiving a video data stream from a server and for decoding the video data stream. The data stream may be received over a network connection through a communications interface COMM4, or from a memory device MEM6 like a memory card CARD2.
  • the viewer devices may have a graphics processing unit for processing of the data to a suitable format for viewing.
  • the viewer VIEWERl comprises a high-resolution stereo-image head-mounted display for viewing the rendered stereo video sequence.
  • the head- mounted display may have an orientation sensor DET1 and stereo audio headphones.
  • the viewer VIEWER2 comprises a display enabled with 3D technology (for displaying stereo video), and the rendering device may have a head-orientation detector DET2 connected to it.
  • the viewer VIEWER2 may comprise a 2D display, since the volumetric video rendering can be done in 2D by rendering the viewpoint from a single eye instead of a stereo eye pair.
  • Any of the devices (SRC1, SRC2, SERVER, RENDERER, VIEWERl, VIEWER2) may be a computer or a portable computing device, or be connected to such.
  • Such rendering devices may have computer program code for carrying out methods according to various examples described in this text.
  • Figure 3 shows a camera device 200 for stereo viewing.
  • the camera comprises two or more cameras that are configured into camera pairs 201 for creating the left and right eye images, or that can be arranged to such pairs.
  • the distances between cameras may correspond to the usual (or average) distance between the human eyes.
  • the cameras may be arranged so that they have significant overlap in their field-of-view. For example, wide-angel lenses of 180-degrees or more may be used, and there may be 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 16, or 20 cameras.
  • the cameras may be regularly or irregularly spaced to access the whole sphere of view, or they may cover only part of the whole sphere.
  • 8 cameras having wide- angle lenses and arranged regularly at the corners of a virtual cube and covering the whole sphere such that the whole or essentially whole sphere is covered at all directions by at least 3 or 4 cameras.
  • Figure 3 three stereo camera pairs 201 are shown.
  • Multicamera devices with other types of camera layouts may be used.
  • a camera device with all cameras in one hemisphere may be used.
  • the number of cameras may be e.g., 2, 3, 4, 6, 8, 12, or more.
  • the cameras may be placed to create a central field of view where stereo images can be formed from image data of two or more cameras, and a peripheral (extreme) field of view where one camera covers the scene and only a normal non-stereo image can be formed.
  • Figure 4 shows a head-mounted display 202 (HMD) for stereo viewing.
  • the head-mounted display comprises two screen sections or two screens 203 and 204 for displaying the left and right eye images.
  • the displays are close to the eyes when the head mounted display is carried by a person, and therefore lenses are used to make the images easily viewable and for spreading the images to cover as much as possible of the eyes' field of view.
  • the device is attached to the head of the user so that it stays in place even when the user turns his head.
  • the device may have an orientation detecting module 205 for determining the head movements and direction of the head.
  • the head- mounted display gives a three-dimensional (3D) perception of the recorded/streamed content to a user.
  • Time-synchronized video, audio and orientation data is first recorded with the capture device. This can consist of multiple concurrent video and audio streams as described above. These are then transmitted immediately or later to the storage and processing network for processing and conversion into a format suitable for subsequent delivery to playback devices. The conversion can involve post-processing steps to the audio and video data in order to improve the quality and/or reduce the quantity of the data while preserving the quality at a desired level.
  • each playback device receives a stream of the data from the network, and renders it into a stereo viewing reproduction of the original location which can be experienced by a user with the head-mounted display and headphones.
  • the camera device shown in Figure la or in Figure 3 can be used as a source for media content, such as images and/or video.
  • media content such as images and/or video.
  • To create a full 360 degree stereo panorama every direction of view will be photographed from two locations, one for the left eye and one for the right eye.
  • video panorama these images will be captured substantially simultaneously to keep the eyes in sync with each other.
  • one camera cannot physically cover the whole 360 degree view, at least without being obscured by another camera, there need to be multiple cameras to form the whole 360 degree panorama.
  • Additional cameras however increase the cost and size of the system and add more data streams to be processed. This problem becomes even more significant when mounting cameras on a sphere or platonic solid shaped arrangement to get more vertical field of view.
  • the camera pairs will not achieve free angle parallax between the eye views.
  • the parallax between eyes is fixed to the positions of the individual cameras in a pair, that is, in the perpendicular direction to the camera pair, no parallax can be achieved. This is problematic when the stereo content is viewed with a head mounted display that allows free rotation of the viewing angle around z-axis as well.
  • a video codec consists of an encoder that transforms an input video into a compressed representation suited for storage/transmission and a decoder that can decompress the compressed video representation back into a viewable form. Typically encoder discards some information in the original video sequence in order to represent the video in a more compact form (that is, at lower bitrate).
  • An example of an encoding process is illustrated in Figure 5a and an example of a decoding process is illustrated in Figure 5b.
  • Figure 5a shows a block diagram of a video encoder suitable for employing embodiments of the invention.
  • Figure 5a presents an encoder for two layers, but it would be appreciated that presented encoder could be similarly simplified to encode only one layer or extended to encode more than two layers.
  • Figure 5a illustrates an embodiment of a video encoder comprising a first encoder section 500 for a base layer and a second encoder section 502 for an enhancement layer.
  • the encoder sections 500, 502 may comprise a pixel predictor 302, 402, prediction error encoder 303, 403 and prediction error decoder 304, 404.
  • Figure 5a also shows an embodiment of the pixel predictor 302, 402 as comprising an inter-predictor 306, 406, an intra-predictor 308, 408, a mode selector 310, 410, a filter 316, 416, and a reference frame memory 318, 418.
  • the pixel predictor 302 of the first encoder section 500 receives 300 base layer images of a video stream to be encoded at both the inter-predictor 306 (which determines the difference between the image and a motion compensated reference frame 318) and the intra- predictor 308 (which determines a prediction for an image block based only on the already processed parts of current frame or picture).
  • the output of both the inter-predictor and the intra- predictor are passed to the mode selector 310.
  • the intra-predictor 308 may have more than one intra-prediction modes. Hence, each mode may perform the intra-prediction and provide the predicted signal to the mode selector 310.
  • the mode selector 310 also receives a copy of the base layer picture 300.
  • the pixel predictor 402 of the second encoder section 502 receives 400 enhancement layer images of a video stream to be encoded at both the inter-predictor 406 (which determines the difference between the image and a motion compensated reference frame 418) and the intra-predictor 408 (which determines a prediction for an image block based only on the already processed parts of current frame or picture).
  • the output of both the inter- predictor and the intra-predictor are passed to the mode selector 410.
  • the intra-predictor 408 may have more than one intra-prediction modes. Hence, each mode may perform the intra-prediction and provide the predicted signal to the mode selector 410.
  • the mode selector 410 also receives a copy of the enhancement layer picture 400.
  • the output of the inter- predictor 306, 406 or the output of one of the optional intra-predictor modes or the output of a surface encoder within the mode selector is passed to the output of the mode selector 310, 410.
  • the output of the mode selector is passed to a first summing device 321 , 421.
  • the first summing device may subtract the output of the pixel predictor 302, 402 from the base layer picture 300/enhancement layer picture 400 to produce a first prediction error signal 320, 420 which is input to the prediction error encoder 303, 403.
  • the pixel predictor 302, 402 further receives from a preliminary reconstructor 339, 439 the combination of the prediction representation of the image block 312, 412 and the output 338, 438 of the prediction error decoder 304, 404.
  • the preliminary reconstructed image 314, 414 may be passed to the intra-predictor 308, 408 and to a filter 316, 416.
  • the filter 316, 416 receiving the preliminary representation may filter the preliminary representation and output a final reconstructed image 340, 440 which may be saved in a reference frame memory 318, 418.
  • the reference frame memory 318 may be connected to the inter-predictor 306 to be used as the reference image against which a future base layer picture 300 is compared in inter-prediction operations.
  • the reference frame memory 318 may also be connected to the inter-predictor 406 to be used as the reference image against which a future enhancement layer pictures 400 is compared in inter-prediction operations. Moreover, the reference frame memory 418 may be connected to the inter-predictor 406 to be used as the reference image against which a future enhancement layer picture 400 is compared in inter-prediction operations.
  • Filtering parameters from the filter 316 of the first encoder section 500 may be provided to the second encoder section 502 subject to the base layer being selected and indicated to be source for predicting the filtering parameters of the enhancement layer according to some embodiments.
  • the prediction error encoder 303, 403 comprises a transform unit 342, 442 and a quantizer 344, 444.
  • the transform unit 342, 442 transforms the first prediction error signal 320, 420 to a transform domain.
  • the transform is, for example, the DCT transform.
  • the quantizer 344, 444 quantizes the transform domain signal, e.g. the DCT coefficients, to form quantized coefficients.
  • the prediction error decoder 304, 404 receives the output from the prediction error encoder 303, 403 and performs the opposite processes of the prediction error encoder 303, 403 to produce a decoded prediction error signal 338, 438 which, when combined with the prediction representation of the image block 312, 412 at the second summing device 339, 439, produces the preliminary reconstructed image 314, 414.
  • the prediction error decoder may be considered to comprise a dequantizer 361, 461 , which dequantizes the quantized coefficient values, e.g.
  • the prediction error decoder may also comprise a block filter which may filter the reconstructed block(s) according to further decoded information and filter parameters.
  • the entropy encoder 330, 430 receives the output of the prediction error encoder 303, 403 and may perform a suitable entropy encoding/variable length encoding on the signal to provide error detection and correction capability. The outputs of the entropy encoders 330, 430 may be inserted into a bitstream e.g. by a multiplexer 508.
  • Figure 5b shows a block diagram of a video decoder suitable for employing embodiments of the invention.
  • Figure 5b depicts a structure of a two-layer decoder, but it would be appreciated that the decoding operations may similarly be employed in a single-layer decoder.
  • the video decoder 550 comprises a first decoder section 552 for base layer pictures and a second decoder section 554 for enhancement layer pictures.
  • Block 556 illustrates a demultiplexer for delivering information regarding base layer pictures to the first decoder section 552 and for delivering information regarding enhancement layer pictures to the second decoder section 554.
  • Reference P'n stands for a predicted representation of an image block.
  • Reference D'n stands for a reconstructed prediction error signal.
  • Blocks 704, 804 illustrate preliminary reconstructed images (Fn).
  • Reference R'n stands for a final reconstructed image.
  • Blocks 703, 803 illustrate inverse transform (T-l).
  • Blocks 702, 802 illustrate inverse quantization (Q-l).
  • Blocks 700, 800 illustrate entropy decoding (E-l).
  • Blocks 706, 806 illustrate a reference frame memory (RFM).
  • Blocks 707, 807 illustrate prediction (P) (either inter prediction or intra prediction).
  • Blocks 708, 808 illustrate filtering (F).
  • Blocks 709, 809 may be used to combine decoded prediction error information with predicted base or enhancement layer pictures to obtain the preliminary reconstructed images (I'n).
  • Preliminary reconstructed and filtered base layer pictures may be output 710 from the first decoder section 552 and preliminary reconstructed and filtered enhancement layer pictures may be output 810 from the second decoder section 554.
  • the decoder could be interpreted to cover any operational unit capable to carry out the decoding operations, such as a player, a receiver, a gateway, a demultiplexer and/or a decoder.
  • the decoder reconstructs the output video by applying prediction means similar to the encoder to form a predicted representation of the pixel blocks (using the motion or spatial information created by the encoder and stored in the compressed representation) and prediction error decoding (inverse operation of the prediction error coding recovering the quantized prediction error signal in spatial pixel domain). After applying prediction and prediction error decoding means the decoder sums up the prediction and prediction error signals (pixel values) to form the output video frame.
  • the decoder (and encoder) can also apply additional filtering means to improve the quality of the output video before passing it for display and/or storing it as prediction reference for the forthcoming frames in the video sequence.
  • Figure 6 demonstrates an example of processing steps of manipulating volumetric video data, starting from raw camera frames (from various locations within the world) and ending with a frame rendered at a freely-selected 3D viewpoint.
  • the starting point 610 is media content obtained from one or more camera devices.
  • the media content may comprise raw camera frame images, depth maps, and camera 3D positions.
  • the recorded media content i.e. image data, is used to construct an animated 3D model 620 of the world. The viewer may then be freely able to choose his/her position and orientation within the world when the volumetric video is being played back 630.
  • Voxel of a three-dimensional world corresponds to a pixel of a two-dimensional world. Voxels exist in a 3D grid layout.
  • An octree is a tree data structure used to partition a three-dimensional space. Octrees are the three-dimensional analog of quadtrees.
  • a sparse voxel octree (SVO) describes a volume of a space containing a set of solid voxels of varying sizes. Empty areas within the volume are absent from the tree, which is why it is called "sparse".
  • a method for determining a volume of interest within the scene will now be shortly presented.
  • a three-dimensional (3D) volumetric representation of a scene is determined as a plurality of voxels on the basis of input streams of at least a first multicamera device; on the basis of one or more parameters indicating viewer's probable interest with the scene, at least a first set of voxels as a first volume of interest (VOI) is determined; and voxels of the scene residing outside said at least first VOI are sub-sampled.
  • a three-dimensional (3D) volumetric representation of a scene is determined as a plurality of voxels on the basis of input streams of at least one multicamera device.
  • At least one but preferably a plurality (i.e. 2, 3, 4, 5 or more) of multicamera devices are used to capture 3D video representation of a scene.
  • the multicamera devices are distributed in different locations in respect to the scene, and therefore each multicamera device captures a different 3D video representation of the scene.
  • the 3D video representations captured by each multicamera device are used as input streams for creating a 3D volumetric representation of the scene, said 3D volumetric representation comprising a plurality of voxels. Voxels may be formed from the captured 3D points e.g.
  • Voxels may also be formed through the construction of the sparse voxel octree. Each leaf of such a tree represents a solid voxel in world space; the root node of the tree represents the bounds of the world.
  • the sparse voxel octree construction may have the following steps: 1) map each input depth map to a world space point cloud, where each pixel of the depth map is mapped to one or more 3D points; 2) determine voxel attributes such as color and surface normal vector by examining the neighborhood of the source pixel(s) in the camera images and the depth map; 3) determine the size of the voxel based on the depth value from the depth map and the resolution of the depth map; 4) determine the SVO level for the solid voxel as a function of its size relative to the world bounds; 5) determine the voxel coordinates on that level relative to the world bounds; 6) creating new and/or traversing existing SVO nodes until arriving at the determined voxel coordinates; 7) inserting the solid voxel as a leaf of the tree, possibly replacing or merging attributes from a previously existing voxel at those coordinates. Nevertheless, the size of voxel within the 3D volume
  • a volumetric video frame is a complete sparse voxel octree that models the world at a specific point in time in a video sequence.
  • Voxel attributes contain information like color, opacity, surface normal vectors, and surface material properties. These are referenced in the sparse voxel octrees (e.g. color of a solid voxel), but can also be stored separately.
  • Point clouds are a commonly used data structure for storing volumetric content. Compared to point clouds, sparse voxel octrees describe a recursive subdivision of a finite volume with solid voxels of varying sizes, while point clouds describe an unorganized set of separate points limited only by the precision of the used coordinate values.
  • Voxel coordinates uniquely identify an individual node or solid voxel within the octree.
  • the coordinates are not stored in the SVO itself but instead describe the location and size of the node/voxel.
  • the coordinates have four integer components: level, X, Y, and Z.
  • the present embodiments relate to real-time computer graphics and virtual reality (VR).
  • Volumetric video is to virtual reality what traditional video is to 2D/3D displays.
  • VR technology has entered an early adoption phase in the consumer market. This means that volumetric video playback is now possible on high-end consumer GPUs (Graphics Processing Unit), even in VR, but mass market adoption may still be few years in the future.
  • Mipmaps are pre-calculated, optimized sequences of images, each of which is a progressively lower resolution representation of the same image.
  • the height and width of each image, or level, in the mipmap is half the size of the previous level. They are intended to increase rendering speed and reduce aliasing artifacts.
  • each level in the octree can be considered a 3D mipmap of the next lower level.
  • each frame may produce several hundred megabytes or several gigabytes of voxel data which needs to be converted to a format that can be streamed to the viewer, and rendered in real-time.
  • the amount of data depends on the world complexity and the number of cameras. The larger impact comes in a multi-device recording setup with a number of separate locations where the cameras are recording. Such a setup produces more information than a camera at a single location.
  • the present embodiments are targeted to a problem of errors in generated voxel output.
  • nodes are selected nodes of the frames' sparse voxel octrees. Due to sparseness, nodes may also be absent in either of the volumetric frames.
  • the changed, added, or deleted node locations are given unique IDs within the video sequence, and the node subtrees are written as patches to a reference volumetric frame.
  • Figure 7 illustrates an example of a volumetric video pipeline. This implementation is entirely based on SVOs. Other implementations are also possible, for instance the Playback component could be handled via multiple 2D color+depth video streams.
  • the present embodiments are targeted to "Voxelization” and "Content Analysis” stages of Voxel Encoding 740 in the pipeline.
  • multiple cameras 715 capture video data of the world, which video data is input 720 to the pipeline.
  • the video data comprises image frames, positions and depth maps 730 which are transmitted to the Voxel Encoding 740.
  • volumetric reference frame may be chosen for each sequence.
  • the reference frame can be the first frame in the sequence, or the reference frame can be any one of the other frames in the sequences. Alternatively, the reference frame can be combined from more than one volumetric frames of the video sequence.
  • the incoming camera images and depth maps go through image and signal processing that aims to identify and track image features and objects. These known features and objects can then be referenced throughout the video sequence for additional processing and as metadata.
  • SVO sparse voxel octree
  • the encoder processes each frame in the sequence separately. Each frame is compared against the one reference frame chosen for the sequence.
  • a source device In a volumetric video pipeline, a source device is understood to mean a 3D camera, multi-lens capture device, LiDAR, or any other sensing device that produces images, depth information, and/or other such data about the surrounding world. It may be assumed that source device positions and poses are calibrated to the same world coordinate space so that all output produced by them uses the same coordinates.
  • Each source device has a valid volume (W) that represents the space that the device can observe. Inside the valid volume, some areas may be more accurate in terms of depth and/or color values. For instance, the outer edges of the view may have more distortion when using fish-eye lenses; or in the case of the camera of Figures l a and lb, the backwards-facing direction may not be recorded in three- dimensional form.
  • the source device is a camera
  • the depth values determine how far the valid volume extends from the camera origin.
  • Figure 8 shows an example of this: the volume extends to a predefined maximum distance from the camera 715, except where objects 881 cause occlusions.
  • the valid volume is depicted as an area 880 in which the darkness of the filling of the area illustrates the reliability of the recorded data. The darker the filling the less reliable the recorded data may be.
  • Sparse voxel octrees are rigidly organized data structures, which means merging a subtree from one sparse voxel octree into another sparse voxel octree is trivial so that each SVO contributes details to a combined SVO. This is especially useful when merging the captured contents of multiple 3D cameras. Also resolution adjustments and spatial subdivision of sparse voxel octrees is trivial: resolution can be changed simply by limiting the depth of the tree, and subdivision can be done by picking specific subtrees. This makes the data structure well-suited for parallelized encoding and adaptive streaming. SVOs also have the advantage of supporting variable resolution within the volume. There are also techniques for reducing the total size of the SVO by sharing subtrees between nodes (DAG)).
  • DAG directed acyclic graph
  • the layout of the octrees is always compatible with each other. However, when a plurality of solid voxels occupy a given voxel coordinate, it may be necessary to either pick one of them or merge the voxels together. In the following, it is assumed that the latest solid voxel to be stored at the given coordinates overrides a previously existing one.
  • errors in the generated voxel output may be reduced by, for example, choosing an exclusive source device for a sparse voxel octree node, choosing the processing order of the source devices, checking surface normals when interpolating between depth map points, and/or converging distorted depth map contents with the help of image feature analysis.
  • one way to eliminate or at least reduce these errors is to select only one source device whose data is trusted in each node of the sparse voxel octree.
  • determination of source devices which may be excluded may be, for example, on distance-based or node -based.
  • Distance-based camera exclusion may be performed, for example, as follows. During the voxelization of a depth map, a given solid point is ignored if there is another source device from a previously voxelized source device, whose position within the scene is closer to the solid point, and the solid point is within the valid volume of the other camera. This can be checked at a granularity of individual points.
  • the reliability within the valid volumes can be estimated on a more general level, it can be used as a weighting factor when comparing the source distances. As an example, if a point is at a distance Al from a first camera and the same point is at a distance A2>A1 from a second camera, the farther-away second camera could still be preferable if the second camera's valid volume has higher reliability at the given point.
  • node -based exclusion views of the source devices may be examined and compared with each other. Then, the selection may be performed so that the source device that has the best view of each given node of the scene will be selected.
  • criteria for defining the best view will be shortly introduced:
  • the first camera to voxelize a given area has authority to decide which parts of it are empty space. Essentially the valid volume of each source is considered empty space according to that device. Subsequent sources can then augment the sparse voxel octree with more information, but they cannot insert solid voxels into space that has already been classified empty by previous, more trusted sources.
  • the processing order of the source devices may be chosen with respect to the important/recognized objects in the scene so that the objects have a higher likelihood of being processed with fewer errors. For example, a source device whose valid volume covers the most important objects with good reliability could be processed first.
  • a most probable viewing direction may be defined at each moment of operation, for example as the direction where the most number of camera units of the multicamera device are focused. Based on the MPVDs of two or more multicamera devices, one or more intersection point of the most probable viewing directions may be found. Such intersection points are expected to be the one or more areas which users are most probably interested to watch.
  • a VOI may be defined. For example, if the intersection point refers to a location of a display, then the whole display may be considered as a VOI. As another example, if the intersection point refers to a location of a person or a car, then the whole person or car may be considered as the VOI. It is also possible that the MPVDs do not cross at any point, for example when the MPVDs of different multicamera devices are referring to different parts of the same object. In such case, the intersection point may be selected based on the location where the MPVDs pass by each other with the least distance.
  • MPVDs do not cross, if at least two of them hit the same object, that object can be considered as the VOI. If there are more than one object which can be selected by this method, the one which has most hits from the MPVDs may be selected as the VOI. If there are more than one object with same number of hits, the one which is closer to the location of the viewer may be selected as the VOI.
  • a parameter indicating the viewer's probable interest with the scene may be determined on the basis of the depth information of the scene.
  • the current location/viewing direction of the user may be taken into account.
  • different representations of the scene may be available in order to be able to switch to different presentations based on the movement of the user through the scene. The selection of VOI may then be performed adaptively based on the relative distance of each object to the current location of the viewer.
  • the valid volume of that source device may be examined.
  • image information, depth map and possibly other information regarding that source device may be used to add nodes to the sparse voxel octree in the voxelization phase.
  • the next source device in the determined processing order may be selected for processing.
  • Source exclusion may be applied here so that data from higher ranked devices may exclude some or all of the current source device's data.
  • new nodes may be added to the sparse voxel octree as subtrees. Voxels that already exist in the sparse voxel octree may only be replaced by subtrees containing more details (smaller voxels) about the node in question.
  • This procedure may be repeated for each source device which has been defined a rank in the determination of source device processing order.
  • One common type of artifact in the produced sparse voxel octree output may be surface discontinuities (holes and banding) that occur when two adjacent points in the depth maps are separated by relatively large depth values.
  • This artifact may be corrected by interpolating between the depth map points.
  • interpolation may also occur along the edges of objects (such as humans) causing the edges to stretch out and blend with the background.
  • This problem may be alleviated by setting a depth delta threshold that prevents interpolation when exceeded.
  • one threshold value may not work in all circumstances.
  • many interpolation artifacts can be avoided as follows. If the interpolation end points have normal vectors facing the same direction (e.g., points on the same flat surface) the threshold can be higher so gaps in floors, ceilings and walls can be filled with a higher probability.
  • the threshold value may need to be proportional to the distance from the source devices (reflecting local level of detail). It may also be possible to combine or average all sampled normal vectors of a given solid voxel from multiple sources to cancel noise.
  • Image content analysis may also be used to enhance surface normal estimation.
  • the original camera images may be segmented to planar regions of similar materials. Each contiguous planar region may then be assumed to have the same surface normal vector for all the corresponding voxels. Similarly, other geometric shapes may be detected (cylinders, spheres).
  • Another common artifact in the produced sparse voxel octree output may be that depth maps from different sources place the same captured surfaces in different world coordinates. This may be particularly true if the reliability of the depth values is low, for example when the depth values are estimated from camera frames from a certain type of a multicamera.
  • this artifact may be alleviated with the help of content analysis and image features recognized and matched across multiple depth maps. During the content analysis stage, features that are visible in two or more depth maps are identified. To perform this, the exact XY coordinates in the source camera images may be needed. The corresponding depth map points are then mapped to 3D space, and the distribution inside the group of points is examined.
  • suitable depth convergence factors are chosen for each depth map so that the resulting 3D points converge as closely as possible to a single point in world coordinate space.
  • the depth convergence factor values can be weighted by the reliability of the depth information. For instance, if one source device is a LiDAR, its output is likely close to the ground truth so it will receive a DCF very close to 1.0 while the other points are moved to converge toward it.
  • the factors are stored into a depth convergence factor map that in practice can be a monochrome 2D texture map whose resolution may be significantly lower than the depth map's resolution.
  • the depth convergence factor map can use a 1/16 or 1/8 resolution compared to the depth maps.
  • image features produce overlapping depth convergence factor values, those can be combined as an average value, or alternatively the prominence and reliability of the corresponding image features can be used as criteria for choosing which scaling factors to use and which to omit. It should be noted that each depth map will have a separate depth convergence factor map associated only with it.
  • the depth convergence factor map is then applied to all the depth values read from depth maps.
  • the depth convergence factor values are interpolated on both the X and Y axes so that there are no sharp edges in the factor values.
  • the output quality of the generated sparse voxel octree model is of higher quality when there are multiple source devices used for recording the scene.
  • the encoder As output for each frame, the encoder produces a frame change set. This is illustrated in Figure 9.
  • the change sets comprise at least a frame number (e.g. within the encoded sequence of frames); and a set of location IDs 901, 902, each associated with a sparse voxel subtree 905.
  • a deleted subtree is encoded as a special value that identifies that no subtree exists for that location ("X" in Figure 9). If no changes were detected in the compared frames, the change set can be omitted from the output entirely.
  • each node may have the addresses of eight child nodes, and the address of one parent node.
  • their root node's parent node is the parent node of the corresponding node in the reference octree.
  • the output data for the entire sequence contains the full reference octree (that contains the location IDs) plus all the frame change sets.
  • attributes can be shared between the octrees/subtrees to reduce total data size. This is useful also in the case when nodes have been added and the change set thus also contains unmodified contents from the reference octree.
  • the outcome of the Voxel Encoding 740 is a SVOX (Sparse VOXel) file 750, which is transmitted for playback 760.
  • the SVOX file 750 is streamed 770, which creates stream packets 780.
  • a voxel rendering 790 is applied which provides viewer state (e.g. current time, view frustum) 795 to the streaming 770.
  • FIG. 10 is a flowchart illustrating a method according to an embodiment.
  • video data sequence comprising volumetric frames is received 1010.
  • the volumetric frames may include image data and depth maps and possibly some other information.
  • Information from different source devices may be examined 1020 to determine whether to use information from only one or more than one of the plurality of source devices to construct a node to the sparse voxel octree. If it is determined to use only one source device, the information of a selected source device is used 1030 for the construction of the node to the sparse voxel octree.
  • a processing order is defined 1040 for the plurality of source devices, and the information of the selected source devices is used 1050 in the defined processing order to construct the node to the sparse voxel octree.
  • a reference sparse voxel octree may be selected 1060 from the volumetric frames.
  • a previously generated reference sparse voxel octree may also be retained without further changes.
  • the frame sparse voxel octree and the reference sparse voxel octree are then examined 1070 to determine whether one or more nodes of the frame sparse voxel octree has changed or not.
  • An identification is assigned 1080 for a node to the reference sparse voxel octree, when a change between the compared nodes is detected; and a frame change set is produced 1090 for the frame.
  • the method may select 1 110 the next frame and proceed from step 1020.
  • the method may proceed to produce 1120 an output for the video data sequence, the output comprising the reference sparse voxel octree with identifications and the produced frame change sets for the frames of the sequence, which output is transmitted for playback.
  • An apparatus comprises means for receiving video data sequence comprising volumetric frames; means for generating a reference sparse voxel octree based on zero or more volumetric frames within the video sequence; for any other frame of the volumetric frame: means for generating a frame sparse voxel octree for a frame that is currently encoded; means for comparing the frame sparse voxel octree to the reference sparse voxel octree to detect changes between nodes of the frame sparse voxel octree and the reference sparse voxel octree; means for assigning an identification for a node to the reference sparse voxel octree, when a change between the compared nodes is detected; and means for producing a frame change set for the frame; means for producing an output for the video data sequence, the output comprising the reference sparse voxel octree with identifications and
  • a device may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the device to carry out the features of an embodiment.
  • a network device like a server may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the network device to carry out the features of an embodiment.
  • Figure 11 a shows a schematic block diagram of an exemplary apparatus or electronic device 50 depicted in Figure 1 lb, which may incorporate a transmitter according to an embodiment of the invention.
  • the electronic device 50 may for example be a mobile terminal or user equipment of a wireless communication system. However, it would be appreciated that embodiments of the invention may be implemented within any electronic device or apparatus which may require transmission of radio frequency signals.
  • the apparatus 50 may comprise a housing 30 for incorporating and protecting the device.
  • the apparatus 50 further may comprise a display 32 in the form of a liquid crystal display.
  • the display may be any suitable display technology suitable to display an image or video.
  • the apparatus 50 may further comprise a keypad 34.
  • any suitable data or user interface mechanism may be employed.
  • the user interface may be implemented as a virtual keyboard or data entry system as part of a touch-sensitive display.
  • the apparatus may comprise a microphone 36 or any suitable audio input which may be a digital or analogue signal input.
  • the apparatus 50 may further comprise an audio output device which in embodiments of the invention may be any one of: an earpiece 38, speaker, or an analogue audio or digital audio output connection.
  • the apparatus 50 may also comprise a battery 40 (or in other embodiments of the invention the device may be powered by any suitable mobile energy device such as solar cell, fuel cell or clockwork generator).
  • the term battery discussed in connection with the embodiments may also be one of these mobile energy devices.
  • the apparatus 50 may comprise a combination of different kinds of energy devices, for example a rechargeable battery and a solar cell.
  • the apparatus may further comprise an infrared port 41 for short range line of sight communication to other devices.
  • the apparatus 50 may further comprise any suitable short range communication solution such as for example a Bluetooth wireless connection or a USB/Fire Wire wired connection.
  • the apparatus 50 may comprise a controller 56 or processor for controlling the apparatus 50.
  • the controller 56 may be connected to memory 58 which in embodiments of the invention may store both data and/or may also store instructions for implementation on the controller 56.
  • the controller 56 may further be connected to codec circuitry 54 suitable for carrying out coding and decoding of audio and/or video data or assisting in coding and decoding carried out by the controller 56.
  • the apparatus 50 may further comprise a card reader 48 and a smart card 46, for example a universal integrated circuit card (UICC) reader and a universal integrated circuit card for providing user information and being suitable for providing authentication information for authentication and authorization of the user at a network.
  • UICC universal integrated circuit card
  • the apparatus 50 may comprise radio interface circuitry 52 connected to the controller and suitable for generating wireless communication signals for example for communication with a cellular communications network, a wireless communications system or a wireless local area network.
  • the apparatus 50 may further comprise an antenna 60 connected to the radio interface circuitry 52 for transmitting radio frequency signals generated at the radio interface circuitry 52 to other apparatus(es) and for receiving radio frequency signals from other apparatus(es).
  • the apparatus 50 comprises a camera 42 capable of recording or detecting imaging.
  • the various embodiments of the invention may be implemented in hardware or special purpose circuits or any combination thereof. While various aspects of the invention may be illustrated and described as block diagrams or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.
  • Embodiments of the inventions may be practiced in various components such as integrated circuit modules.
  • the design of integrated circuits is by and large a highly automated process.
  • Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Library & Information Science (AREA)
  • Testing, Inspecting, Measuring Of Stereoscopic Televisions And Televisions (AREA)

Abstract

La présente invention concerne divers procédés, appareils et produits-programmes informatiques destinés au codage vidéo. Dans certains modes de réalisation, une séquence de données vidéo comprenant des trames volumétriques est reçue pour une construction d'un l'arbre octal de voxels à faible densité à partir d'une pluralité de dispositifs sources. Des informations provenant de différents dispositifs sources sont examinées pour déterminer s'il faut utiliser des informations en provenance d'un ou de plusieurs dispositifs sources parmi la pluralité des dispositifs sources pour construire un nœud de l'arbre octal de voxels à faible densité. S'il est déterminé qu'il n'utilise qu'un seul dispositif source, l'information du dispositif source sélectionné est utilisée pour la construction du nœud de l'arbre octal de voxels à faible densité. S'il est déterminé qu'il utilise plus d'un dispositif source, un ordre de traitement est défini pour la pluralité de dispositifs sources, les informations des dispositifs sources sélectionnés étant utilisées dans l'ordre de traitement défini pour la construction du nœud de l'arbre octal de voxels à faible densité.
PCT/FI2018/050534 2017-07-07 2018-07-05 Méthode et appareil d'encodage de contenu multimédia WO2019008233A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
FI20175673 2017-07-07
FI20175673 2017-07-07

Publications (1)

Publication Number Publication Date
WO2019008233A1 true WO2019008233A1 (fr) 2019-01-10

Family

ID=64949762

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/FI2018/050534 WO2019008233A1 (fr) 2017-07-07 2018-07-05 Méthode et appareil d'encodage de contenu multimédia

Country Status (1)

Country Link
WO (1) WO2019008233A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111078689A (zh) * 2019-11-20 2020-04-28 深圳希施玛数据科技有限公司 一种非连续型预排序遍历树算法的数据处理方法及系统

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040023612A1 (en) * 2002-08-02 2004-02-05 Kriesel Marshall S. Apparatus and methods for the volumetric and dimensional measurement of livestock
US20040104935A1 (en) * 2001-01-26 2004-06-03 Todd Williamson Virtual reality immersion system
US20140125767A1 (en) * 2012-02-24 2014-05-08 Matterport, Inc. Capturing and aligning three-dimensional scenes
WO2017079657A1 (fr) * 2015-11-04 2017-05-11 Intel Corporation Utilisation de vecteurs de mouvements temporels pour une reconstruction 3d

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040104935A1 (en) * 2001-01-26 2004-06-03 Todd Williamson Virtual reality immersion system
US20040023612A1 (en) * 2002-08-02 2004-02-05 Kriesel Marshall S. Apparatus and methods for the volumetric and dimensional measurement of livestock
US20140125767A1 (en) * 2012-02-24 2014-05-08 Matterport, Inc. Capturing and aligning three-dimensional scenes
WO2017079657A1 (fr) * 2015-11-04 2017-05-11 Intel Corporation Utilisation de vecteurs de mouvements temporels pour une reconstruction 3d

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
HEMMAT, J. H. ET AL.: "Exploring Distance-Aware Weighting Strategies for Accurate Reconstruction of Voxel-Based 3D Synthetic Models.", INTERNATIONAL CONFERENCE ON MULTIMEDIA MODELING (MMM' 14), vol. 8325, 6 January 2014 (2014-01-06), pages 412 - 423, XP047103780, ISBN: 978-3-319-04114-8 *
KAMPE, V. ET AL.: "Fast, Memory-Efficient Construction of Voxelized Shadows", IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, vol. 22, no. 10, October 2016 (2016-10-01), pages 2239 - 2248, XP011621352, [retrieved on 20181023] *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111078689A (zh) * 2019-11-20 2020-04-28 深圳希施玛数据科技有限公司 一种非连续型预排序遍历树算法的数据处理方法及系统
CN111078689B (zh) * 2019-11-20 2023-05-26 深圳希施玛数据科技有限公司 一种非连续型预排序遍历树算法的数据处理方法及系统

Similar Documents

Publication Publication Date Title
US10757423B2 (en) Apparatus and methods for compressing video content using adaptive projection selection
EP3669333B1 (fr) Codage et décodage séquentiels de vidéo volumétrique
EP3695597B1 (fr) Appareil et procédé de codage/décodage d'une vidéo volumétrique
US10600233B2 (en) Parameterizing 3D scenes for volumetric viewing
US10567464B2 (en) Video compression with adaptive view-dependent lighting removal
US11430156B2 (en) Apparatus, a method and a computer program for volumetric video
US10499033B2 (en) Apparatus, a method and a computer program for coding and rendering volumetric video
WO2019034808A1 (fr) Codage et décodage de vidéo volumétrique
EP3396635A2 (fr) Procédé et équipement technique de codage de contenu multimédia
WO2018172614A1 (fr) Procédé, appareil et produit-programme informatique pour la diffusion en continu adaptative
WO2019162567A1 (fr) Codage et décodage de vidéo volumétrique
WO2019008222A1 (fr) Procédé et appareil de codage de contenu multimédia
WO2018091770A1 (fr) Procédé destiné à un dispositif multicaméra
GB2558893A (en) Method for processing media content and technical equipment for the same
WO2018109265A1 (fr) Procédé et équipement technique de codage de contenu de média
CN109479147B (zh) 用于时间视点间预测的方法及技术设备
WO2019008233A1 (fr) Méthode et appareil d'encodage de contenu multimédia
WO2019077199A1 (fr) Appareil, procédé, et programme d'ordinateur pour vidéo volumétrique
WO2018109266A1 (fr) Procédé et équipement technique pour rendre un contenu multimédia
US20200311978A1 (en) Image encoding method and technical equipment for the same
GB2601597A (en) Method and system of image processing of omnidirectional images with a viewpoint shift
WO2017220851A1 (fr) Procédé de compression d'images et équipement technique pour ce procédé

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18828891

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18828891

Country of ref document: EP

Kind code of ref document: A1