WO2019076764A1

WO2019076764A1 - Methods for encoding and decoding a data flow representing an omnidirectional video

Info

Publication number: WO2019076764A1
Application number: PCT/EP2018/077922
Authority: WO
Inventors: Thibaud BIATEK
Original assignee: Tdf
Priority date: 2017-10-19
Filing date: 2018-10-12
Publication date: 2019-04-25
Also published as: CN111357292A; FR3072850B1; US20220046279A1; US20200267411A1; FR3072850A1; US11736725B2; EP3698546A1; US11172223B2

Abstract

The invention relates to a method for encoding and to a device for encoding a data flow representing an omnidirectional video, and, correlatively, to a method for decoding and to a device for decoding a data flow representing an omnidirectional video. According to the invention, the data flow representing an omnidirectional video comprises data encoded with at least one base layer representing a 2D or 3D video representing a view of a scene captured by the omnidirectional video, and data encoded with at least one enhancement layer representing the omnidirectional video, the at least one enhancement layer being predictively encoded in relation to the at least one base layer.

Description

Methods of encoding and decoding a data stream representative of omnidirectional video

1. Field of the invention

The invention lies in the field of video compression, and more particularly immersive or omnidirectional video coding and decoding techniques (eg 180 °, 360 ° in 2D or 3D).

2. Prior Art

Omnidirectional video content allows you to represent a scene from a central point in any direction. We are talking about 360 ° video content when the entire field is captured. A subset of the field can also be captured, for example 180 ° only. Content can be captured monoscopically (2D) or stereoscopically (3D). This type of content can be generated by assembling sequences of images captured by different cameras, or computer-generated (ex: video games in VR). The images of such video content make it possible to render via a suitable device the video in any direction. A user can control the direction in which the captured scene is displayed and navigate continuously in all possible directions.

Such 360 ° video contents may for example be rendered using a virtual reality headset offering the user an impression of immersion in the scene captured by the 360 ° video content.

Such 360 ° video contents require reception devices adapted to this type of content (virtual reality headset for example) in order to offer the immersion and control functions of the view displayed by the user.

However, most currently deployed video content receivers are not compatible with this type of 360 ° video content and only allow the rendering of conventional 2D or 3D video content. Indeed, the rendering of a 360 ° video content requires the application of geometric transformations to the images of the video in order to restore the desired viewing direction.

Thus, the broadcasting of 360 ° video contents is not retro-compatible with the existing video receiver park and is limited to receivers adapted to this type of content. However, we note that the content captured specifically for a 360 ° video broadcast can already be captured for a 2D or 3D video broadcast. In this case, all 360 content projected on a map is broadcast.

In addition, the simultaneous broadcast of the same captured content in different formats (2D or 3D and 360 °) to address the different video receivers is expensive in terms of band passing, since you have to send as many video streams as possible formats: 2D, 3D, 360 ° views of the same captured scene.

There is therefore a need for optimization of the coding and broadcasting of omnidirectional video contents, representing a part (180 °) or the entirety of a scene (360 °) and monoscopically (2D) or stereoscopic (3D) .

There are layered video encoding techniques, referred to as scalable or scalable video coding, for encoding a 2D video stream into multiple successive refinement layers providing different levels of reconstruction of the 2D video. For example, spatial scalability makes it possible to encode a video signal in several layers of increasing spatial resolution. Scalability in PSNR (for Peak Signal to Noise Ratio) makes it possible to encode a video signal for a fixed spatial resolution in several layers of increasing quality. Scalability in color space makes it possible to encode a video signal in several layers represented in color spaces that are larger and wider. However, none of the existing coding techniques makes it possible to generate a video stream representative of a scene that can be decoded at the same time by a conventional 2D or 3D video decoder and by a 360 ° video decoder.

Document US 2016/156917 describes a method of scalable coding of a video which may be a multi-view video and in which each view of the multi-view video is encoded in one layer of the stream and predicted by another view of the video. multi-view video.

3. Presentation of the invention

The invention improves the state of the art. To this end, it relates to a method of encoding a data stream representative of an omnidirectional video, comprising:

coding in said stream of at least one base layer representative of a 2D or 3D video, the 2D or 3D video being representative of a view of the same scene captured by the omnidirectional video (360 °, 180 °) etc.),

coding in said stream of at least one enhancement layer representative of the omnidirectional video, the at least one enhancement layer being coded by prediction with respect to the at least one base layer.

The invention thus makes it possible to reduce the transmission cost of the video streams when the video contents have to be transmitted both in 2D and 360 ° view or in 3D and 3D-360 ° views. Thus, a conventional 2D or 3D video decoder will decode only one or more of the base layers to reconstruct a 2D or 3D video of the scene and a 360 ° compatible decoder will decode the base layer (s) and at least one enhancement layer for rebuild the 360 ° video. The use of a prediction of the at least one base layer for coding the enhancement layer thus makes it possible to reduce the cost of coding the enhancement layer. Correlatively, the invention also relates to a method of decoding a data stream representative of an omnidirectional video, comprising:

decoding from said stream of at least one base layer representative of a 2D or 3D video, the 2D or 3D video being representative of a view of the same scene captured by the omnidirectional video,

decoding from said stream of at least one enhancement layer representative of the omnidirectional video, the at least one enhancement layer being decoded by prediction with respect to the at least one base layer.

By omnidirectional video, we mean here as well a video of a scene whose entire field (360 °) is captured, a video of a scene of which a subfield of the 360 ° field is captured, for example 180 °, 160 °, 255.6 °, or other. The omnidirectional video is therefore representative of a scene captured on at least one continuous part of the field at 360 °.

According to a particular embodiment of the invention, the prediction of the enhancement layer with respect to the at least one base layer comprises, for encoding or reconstructing at least one image of the enhancement layer:

the generation of a reference image obtained by geometric projection on the reference image of an image, called the basic image, reconstructed from the at least one base layer,

storing said reference image in a reference image memory of the enhancement layer.

Advantageously, the prediction in the enhancement layer is achieved by the addition during coding or decoding of an image of the enhancement layer of a reference image in which the images reconstructed from the base layers are projected. Thus, a new reference image is added to the reference image memory of the enhancement layer. This new reference image is generated by geometric projection of all the basic images reconstructed from the base layers at a time instant.

According to another particular embodiment of the invention, the data stream comprises information representative of a type of geometric projection used to represent the omnidirectional video.

According to another particular embodiment of the invention, the view represented by the 2D or 3D video is a view extracted from the omnidirectional video. According to another particular embodiment of the invention, the data stream comprises information representative of a type of geometric projection used to extract a view of the omnidirectional video and its location parameters.

According to one variant, such information representative of the projection and location parameters of said base image is encoded in the data stream at each image of the 360 ° video. Advantageously, this variant makes it possible to take into account a displacement in the scene of a view serving as a prediction for the raising layer. For example, the images of the video of the base layer can correspond to images captured by moving in the scene, for example to follow a moving object in the scene. For example, the view can be captured by a moving camera or successively by several cameras located at different points of view in the scene, to follow a balloon or a player during a football match for example.

According to another particular embodiment of the invention, the data stream comprises at least two basic layers, each base layer being representative of a 2D or 3D video, each base layer being respectively representative of a view of the scene, the at least two base layers being coded independently of one another.

Thus, it is possible to have several independent basic layers in the stream, allowing to independently reconstruct several 2D or 3D views of the 360 ° video.

According to another particular embodiment of the invention, an image of the enhancement layer is coded using a group of tiles, each tile covering a region of the image of the enhancement layer, each region being distinct and disjunct from the other regions of the enhancement layer image, each tile being prediction encoded with respect to at least one base layer. The decoding of the enhancement layer includes the reconstruction of a portion of the image of the enhancement layer, the reconstruction of said portion of the image comprising the decoding of the enhancement layer tiles covering the portion of the image of the enhancement layer to be reconstructed, and the decoding of the at least one base layer comprising the decoding of the base layers used to predict the tiles covering the portion of the image of the enhancement layer to be reconstructed.

Such a particular embodiment of the invention makes it possible to reconstruct only part of the omnidirectional image, and not the entire image. Typically, only the part being viewed by the user is reconstructed. Thus, it is not necessary to decode all the basic layers of the video stream, or even send them to the receiver. Indeed, a user can not simultaneously see the entire image of the omnidirectional video, it is thus possible to encode an omnidirectional image by a A tile mechanism for independently encoding regions of the omnidirectional image to subsequently decode only those regions of the omnidirectional image visible to the user.

Thanks to the particular embodiment of the invention, the independent coding of the basic layers thus makes it possible to reconstruct the tiles of the omnidirectional image separately and to limit the complexity to decoding by avoiding the decoding of unnecessary basic layers. Advantageously, for each tile of the enhancement layer to be decoded, information identifying the at least one base layer used to predict the tile is decoded from the data stream.

The invention also relates to a device for encoding a data stream representative of an omnidirectional video. The coding device comprises coding means in said stream of at least one base layer representative of a 2D or 3D video, the 2D or 3D video being representative of a view of the same scene captured by the omnidirectional video , and encoding means in said stream of at least one enhancement layer representative of the omnidirectional video, said enhancement layer coding means comprising means for predicting the enhancement layer with respect to the at least one layer basic. The invention also relates to a device for decoding a data stream representative of an omnidirectional video. The decoding device comprises means for decoding in said stream of at least one base layer representative of a 2D or 3D video, the 2D or 3D video being representative of a view of the same scene captured by the omnidirectional video , and decoding means in said stream of at least one enhancement layer representative of the omnidirectional video, said enhancement layer decoding means comprising means for predicting the enhancement layer with respect to the at least one layer basic.

The coding or decoding device is particularly suitable for implementing the coding or decoding method described above. The coding or decoding device may, of course, comprise the various characteristics relating to the coding or decoding method according to the invention. Thus, the characteristics and advantages of this encoding device, respectively of decoding, are the same as those of the coding method, respectively of decoding, and are not detailed further.

According to a particular embodiment of the invention, the decoding device is included in a terminal. The invention also relates to a signal representative of an omnidirectional video, comprising coded data of at least one base layer representative of a 2D or 3D video, the 2D or 3D video being representative of a view of the same scene captured by the omnidirectional video, and coded data of at least one enhancement layer representative of the omnidirectional video, the at least one enhancement layer being coded by prediction with respect to the at least one base layer.

According to a particular embodiment of the invention, an image of the enhancement layer is coded using a group of tiles, each tile covering a region of the image of the enhancement layer, each region being distinct. and disjoined from the other regions of the enhancement layer image, each tile being prediction encoded with respect to at least one base layer. According to such a particular embodiment of the invention, the signal also comprises for each tile information identifying the at least one base layer used to predict the tile. Thus, only the basic layers necessary for decoding a tile to be decoded are decoded, thus optimizing the use of the resources of the decoder.

The invention also relates to a computer program comprising instructions for implementing the coding method or the decoding method according to any one of the particular embodiments described above, when said program is executed by a processor. Such a program can use any programming language. It can be downloaded from a communication network and / or saved on a computer-readable medium. This program can use any programming language, and be in the form of source code, object code, or intermediate code between source code and object code, such as in a partially compiled form, or in any other form desirable shape.

In yet another aspect, the invention relates to a computer-readable recording medium or information medium comprising instructions of a computer program as mentioned above. The recording media mentioned above can be any entity or device capable of storing the program. For example, the medium may comprise storage means, such as a read-only memory (ROM), for example a CD-ROM or a microelectronic circuit ROM, a flash memory mounted on a removable storage medium , such as a USB key, or a magnetic mass memory type Hard-Disk Drive (HDD) or Solid-State Drive (SSD), or a combination of memories operating according to one or more data recording technologies. On the other hand, the recording media may correspond to a transmissible medium such as an electrical signal or optical, which can be routed via an electrical or optical cable, by radio or by other means. In particular, the proposed computer program can be downloaded on an Internet type network.

Alternatively, the recording media may correspond to an integrated circuit in which the program is incorporated, the circuit being adapted to execute or to be used in the execution of the method in question.

The coding or decoding method according to the invention can therefore be implemented in various ways, in particular in hard-wired form or in software form. 4. List of figures

Other features and advantages of the invention will appear more clearly on reading the following description of a particular embodiment, given as a simple illustrative and nonlimiting example, and the appended drawings, among which:

FIG. 1A illustrates steps of the coding method according to a particular embodiment of the invention,

FIG. 1B illustrates an example of a signal generated according to the coding method implemented according to a particular embodiment of the invention,

FIG. 2A illustrates an image of a view of a scene captured by a 360 ° video encoded in a base layer,

FIG. 2B illustrates the image illustrated in FIG. 2A projected in the reference frame of an image of the 360 ° video,

FIG. 2C illustrates an image of the 360 ° video coded in an enhancement layer,

FIGS. 2D and 2E each illustrate an image of two views of a scene captured by a 360 ° video and each coded in a base layer,

FIG. 2F illustrates the images of the two views illustrated in FIGS. 2D and 2E projected in the reference frame of an image of the 360 ° video,

FIG. 2G illustrates an image of the 360 ° video coded in an enhancement layer,

FIG. 3 illustrates steps of the decoding method according to a particular embodiment of the invention,

FIG. 4A illustrates an example of an encoder configured to implement the coding method according to a particular embodiment of the invention,

FIG. 4B illustrates a device adapted to implement the coding method according to another particular embodiment of the invention,

FIG. 5A illustrates an example of a decoder configured to implement the decoding method according to a particular embodiment of the invention,

FIG. 5B illustrates a device adapted to implement the decoding method according to another particular embodiment of the invention. FIGS. 6A and 6B respectively illustrate an image of the 360 ° omnidirectional video coded by independent tiles and a reference image generated from two views of two base layers and used to code the image of FIG. 6A,

FIGS. 7A-C respectively show a projection in a 2D plane of a 360 ° omnidirectional video with a cubemap projection, a 3D spherical representation in an XYZ repository of the 360 ° omnidirectional video, and a view extracted from the immersive content 360 ° in a 2D plane according to a rectilinear projection,

FIG. 7D illustrates the relationship between different geometric projections,

- Figure 8 illustrates the procedure of construction of the reference image.

The images of FIGS. 2A, CE and G and FIGS. 7A-B are taken from 360 ° videos made available by LetlnVR as part of the JVET exploration group (for Joint Video Exploration Team in English, JVT-D0179: Test Sequences for Virtual Reality Video Coding from Letin VR, October 15-21, 2016). 5. Description of an embodiment of the invention

5.1 General principle

The general principle of the invention is to encode a data stream in a scalable manner thus making it possible to reconstruct and restore a 360 ° video when a receiver is adapted to receive and render such a 360 ° video and reconstruct and render a 2D video or 3D when the receiver is only suitable for rendering a 2D or 3D video.

In order to reduce the transmission cost of a stream comprising both 2D or 3D video and 360 ° video, according to the invention, the 2D or 3D video is encoded in a base layer and the 360 ° video is coded in a predicted enhancement or enhancement layer from the basecoat.

According to a particular embodiment of the invention, the stream may comprise several basic layers each corresponding to a 2D or 3D video corresponding to a view of the scene. The improvement layer is thus coded by prediction from all or part of the base layers included in the stream. 5. 2 Examples of implementation

FIG. 1A illustrates steps of the coding method according to a particular embodiment of the invention. According to this particular embodiment of the invention, a 360 ° video is scalable by extracting 360 ° video views and encoding each view into a base layer. By view, we mean here a sequence of images acquired from a point of view of the scene captured by the 360 ° video. Such an image sequence can be a monoscopic image sequence in the case of a 2D 360 ° video or a stereoscopic image sequence in the case of a 3D 360 ° video. In the In the case of a stereoscopic image sequence, each image includes a left view and a right view coded jointly, for example in the form of an image generated using left and right views placed side by side or one at the left. above the other. The encoder encoding such a stereoscopic image sequence in a base layer or enhancement layer will then encode each image including a left view and a right view as a conventional 2D image sequence.

An embodiment is described below in which the omnidirectional video is a 2D 360 ° video.

An embodiment is described herein in which two base layers are used to code the enhancement layer. In general, the method described here applies to the case where a number of views N, with N greater than or equal to 1, are used for the coding of the improvement layer.

The number of base layers is independent of the number of views used to generate 360 ° video. The number of base layers encoded in the scalable data stream is for example determined during the production of the content, or may be determined by the encoder for rate optimization purposes.

In steps 10 and 11, first and second views are extracted from the 360 video. The views [1] and [2] are respectively coded during a coding step 12 of a base layer BL [1] and a coding step 13 of a base layer BL [2].

In the particular embodiment described here, the base layers BL [1] and BL [2] are coded independently of each other, ie there is no coding dependency (prediction, coding context , etc.) between the coding of the images of the base layer BL [1] and the coding of the images of the base layer BL [2]. Each base layer BL [1] or BL [2] is decodable independently of the others.

According to another particular embodiment, it is possible to code the base layers BL [1] and BL [2] in a dependent manner, for example to gain efficiency in compression. However, this particular embodiment of the invention requires the decoder to be able to decode the two basic layers to render a conventional 2D video.

Each coded / reconstructed image of the base layers BL [1] and BL [2] is then projected (steps 14 and 15 respectively) geometrically in the same reference image 1 _re f. This results in a partially filled reference image, which contains interpolated samples from the projected basecoat (s). The construction of the reference image is described in more detail in connection with FIG. 8. Figs. 2A-2C illustrate an embodiment in which a single base layer is used. According to this embodiment, the images of the 360 ° video have a spatial resolution of 3840x1920 pixels and are generated by an equi-rectangular projection and the 360 ° image sequence has a frequency of 30 frames per second. Figure 2C illustrates an image of 360 ° video at a time instant t coded in the enhancement layer.

An image at the time instant t of the view extracted from the 360 ° video is illustrated in FIG. 2A. Such a view is for example extracted from the 360 ° video using coordinates of Yaw = 20 °, Pitch = 5 °, Horizontal FOV (for Field Of View in English) = 1 10 ° and Vertical FOV = 80 °, the spatial resolution of the extracted image is 1920x960 pixels and the temporal frequency is 30 frames per second. The coordinates Yaw and Pitch correspond to the coordinates of the center (P in FIG. 2B) of the geometric projection of an image of the view of the base layer, the coordinates Yaw and Pitch respectively correspond to the angle Θ and to the angle φ of the point P illustrated in the pivot format in FIG. 7B. The parameters Horizontal FOV and Vertical FOV correspond respectively to the horizontal and vertical size of an image of the extracted view centered at point P in the pivot format illustrated in FIG. 7B, this image of the extracted view is represented in FIG. 7C.

FIG. 2B illustrates reference image 1 _ref used to predict the image of the 360 ° video at time t after equirectangular geometric projection of the image of the base layer illustrated in FIG. 2A.

Figures 2D-2G illustrate an embodiment in which two base layers are used. According to this embodiment, the images of the 360 ° video have a spatial resolution of 3840x1920 pixels and are generated by an equi-rectangular projection and the 360 ° image sequence at a frequency of 30 frames per second. Figure 2G illustrates an image of 360 ° video at a time instant t encoded in the enhancement layer.

An image at the time instant t of a first view extracted from the 360 ° video is illustrated in FIG. 2D. This first view is for example extracted from the 360 ° video using the coordinates of Yaw = 20 °, Pitch = 5 °, Horizontal FOV (for Field Of View in English) = 1 10 ° and Vertical FOV = 80 °, the spatial resolution of the images of the first extracted image is 1920x960 pixels and the temporal frequency is 30 images per second.

An image at the time instant t of a second view extracted from the 360 ° video is illustrated in FIG. 2E. This second view is for example extracted from the 360 ° video using the coordinates of Yaw = -100 °, Pitch = 5 °, Horizontal FOV (for Field Of View in English) = 1 10 ° and Vertical FOV = 80 ° , the spatial resolution of the images of the first extracted view is 1920x960 pixels and the temporal frequency is 30 frames per second. FIG. 2F illustrates the reference image 1 _ref used to predict the image of the 360 ° video at the instant t after equirectangular geometric projection of the images of the first view and the second view respectively illustrated in FIGS. 2D and 2E.

In order to project the reconstructed images of the base layers into the reference image, the following geometric transformation steps are applied.

The representation of a 360 ° omnidirectional video in a plane is defined by a geometric transformation characterizing the way in which a 360 ° omnidirectional content represented in a sphere is adapted to a representation in a plane. The spherical representation of the data is used as a pivot format, it makes it possible to represent the points captured by an omnidirectional video device. Such a 3D XYZ spherical representation is illustrated in FIG. 7B.

For example, the 360 ° video is represented using an equirectangular geometric transformation that can be seen as the projection of points on a cylinder surrounding the sphere. Other geometrical transformations are of course possible, for example the projection in CubeMap, corresponding to a projection of the points on a cube enclosing the sphere, the faces of the cubes finally being unfolded on a plane to form the 2D image. Such a CubeMap projection is for example illustrated in FIG. 7A.

Figure 7D illustrates in more detail the relationship between the different formats mentioned above. The transition from an equirectangular A format to a cubemap B format is done through a pivotal format C characterized by a representation of the samples in an XYZ spherical system illustrated in FIG. 7B. In the same way, the extraction of a view D from the format A is done through this pivotal format C. The extraction of a view of the immersive content is characterized by a geometrical transformation, for example by operating a projection rectilinear points of the sphere on a plane shown by the plane ABCD in Figure 7C. This projection is characterized by location parameters such as yaw, pitch, and horizontal and vertical field of view (FOV). The mathematical properties of these different geometric transformations are documented in the document JVET-G1003 ("Algorithm descriptions of projection format conversion and video quality metrics in 360Lib Version 4", Y. Ye, Alshina E., J. Boyce, JVET of ITU- T SG16 WP3 and ISO / IEC JTC 1 / SC 29 / WG 11, 7th meeting, Torino, 13-21 July, 2017). Figure 8 illustrates the different steps allowing the passage between two formats. A look-up table is first constructed in E80 to match the position of each sample in the destination image (l _ref ) with its position corresponding in the source format (corresponding to the reconstructed images of the base layers BL [1] and BL [2] in the example described with FIG. 1A). For each position (u, v) in the destination image the following steps apply:

• In E81: Passing coordinates (u, v) of the destination image in the XYZ pivot system.

• In E82: projection of the XYZ coordinates of the pivot system in the source image (υ ', ν').

• In E83: update of the mapping table linking the positions in the destination format and in the source format.

Once the correspondence table is constructed, the value of each pixel (u, v) in the destination image (l _ref ) is interpolated with respect to the value at the corresponding position (u ', v') in the image source during a step E84 (corresponding to the reconstructed images of the base layers BL [1] and BL [2] in the example described with FIG. 1A). An interpolation can be performed in (uV) before the value is assigned, by applying a Lanczos type interpolation filter to the decoded base layer image at the matched position.

In a step 16 of the coding method illustrated in FIG. 1A, the 360 ° video is coded in an improvement layer EL by prediction with respect to the BL [1] and BL [2] base layers using the image reference _ref generated from the base layers _.

In a step 17, the data encoded in steps 12, 13 and 16 are multiplexed to form a bit stream comprising the coded data of the BL [1] and BL [2] base layers and the enhancement layer. EL. The projection data making it possible to construct the reference image 1 _ref are also coded in the bit stream and transmitted to the decoder.

The coding steps 12, 13 and 16 can advantageously be implemented by standard video coders, for example by a standard encoder SHVC scalable encoder of the HEVC standard. FIG. 1B illustrates an example of a bit stream generated according to the method described with reference to FIG. 1A. According to this example, the bit stream comprises:

the coded data of the base layers BL [1] and BL [2],

a PRJ information representative of the type of geometric projection used to represent the omnidirectional content, for example a value indicating an equirectangular projection, an information PRJ_B1, respectively PRJ_B2, representative of the projection used to extract the view and its location parameters in the 360 ° video from the view of the base layer BL [1], respectively BL [2].

The information representative of the projection and location parameters of a view of the base layer can for example be coded in the form of the coordinates of the view (Yaw, Pitch, HFOV, VFOV) with the projection type (rectilinear). used to extract the view.

The information representative of the projection and location parameters of a view of a base layer can be coded once in the bit stream. It is thus valid for the entire sequence of images.

The information representative of the projection and location parameters of a view of a base layer can be coded several times in the bit stream, for example at each image, or at each group of images. It is thus valid only for an image or group of images.

When the information representative of the projection and location parameters of a view is coded to each image, such a variant provides the advantage that the view extracted at each time point of the sequence may correspond to a view of an object in progress. movement in the scene and followed over time.

When the information representative of the projection and location parameters of a view is encoded for a group of images, such a variant provides the advantage that the video sequence encoded in a base layer can change its point of view during time, thus allowing to follow an event via different points of view over time. FIG. 3 illustrates steps of the decoding method according to a particular embodiment of the invention.

According to this particular embodiment of the invention, the scalable bitstream representative of the 360 ° video is de-multiplexed during a step 30. The coded data of the base layers, BL [1] and BL [2] in FIG. the example described here are sent to a decoder for decoding (steps 31, 33 respectively).

Then, the reconstructed images of the base layers are projected (steps 32, 34 respectively) in a manner similar to the coding method on a _ref reference image to serve as a prediction to the enhancement layer EL. The geometric projection is performed from the projection data provided in the bit stream (projection type, projection information and view location).

The coded data of the enhancement layer EL are decoded (step 35) and the images of the 360 ° video are reconstructed using the _ref reference images generated from geometric projections made on the base layers, as previously specified.

The scalable bitstream representative of the 360 ° video makes it possible to address any type of receiver. Such a scalable flow also allows each receiver to decode and reconstruct a 2D video or a 360 ° video according to its capabilities.

According to the decoding method described above, conventional receivers, such as PC, TV, tablet, etc. will only decode a base layer, and will restore a sequence of 2D images. While receivers adapted for 360 ° video, such as virtual reality headphones, smartphones, etc., will decode base layers and enhancement layer and render 360 ° video.

FIG. 4A illustrates in greater detail the coding steps of a base layer and a process improvement layer described above according to a particular embodiment of the invention. The case of coding an improvement layer coding a 360 ° omnidirectional video by prediction from a base layer encoding a view k is described here.

Each image of the view k to be encoded is divided into blocks of pixels and each block of pixels is then conventionally encoded by spatial or temporal prediction using a previously reconstructed reference image of the image sequence of the view k.

In a conventional manner, a prediction module P determines a prediction for a current block B ^k _c . The current block B ^k _c is coded by spatial prediction with respect to other blocks of the same image or by temporal prediction with respect to a block of a previously coded and reconstructed reference image of the view k and stored in the MEM memory ^b .

A prediction residue is obtained by calculating the difference between the current block B ^k _c and the prediction determined by the prediction module P.

This prediction residue is then transformed by a transformation module T implementing for example a transformation of DCT (Discrete Cosine Transform) type. The transformed coefficients of the residue block are then quantized by a quantization module Q, and then coded by the entropy coding module C to form the coded data of the base layer BL [k].

The prediction residue is reconstructed, via an inverse quantization performed by the Q ^"1 module and an inverse transformation performed by the T ^" module ¹ and added to the prediction determined by the prediction module P to reconstruct the current block.

The reconstructed current block is then stored in order to reconstruct the current image and this reconstructed current image can be used as a reference when encoding subsequent images of the view k. When the current image of the view k is reconstructed, a projection module PROJ performs a geometrical projection of the reconstructed image in the reference image 1 _ref of the 360 ° video, as illustrated in FIG. 2B and according to FIG. geometric transformation described previously.

The reference image l _ref obtained by projection of the reconstructed image of the base layer is stored in the memory of the enhancement layer MEM ^e .

As with the base layer, 360 ° omnidirectional video is frame-by-frame and block-wise coded. Each block of pixels is conventionally encoded by spatial or temporal prediction using a reference image previously reconstructed and stored in the memory MEM ^e .

In a conventional manner, a prediction module P determines a prediction for a current block B ^e _c of a current image of the 360 ° omnidirectional video. The current block B ^e _c is coded by spatial prediction with respect to other blocks of the same image or by temporal prediction with respect to a block of a previously coded and reconstructed reference image of the 360 ° video and stored in MEM memory ^e .

According to the invention, advantageously, the current block B ^e _c can also be coded by inter-layer prediction with respect to a block co-located in the reference image 1 _ref obtained from the base layer. For example, such a coding mode is indicated in the coded EL data of the improvement layer by an INTER coding mode signaling a block time coding, a zero motion vector, and a reference index indicating the image of the coding layer. reference of the memory MEM ^e used indicating the image l _ref . This information is coded by the entropy coder C. Such a particular embodiment of the invention makes it possible to reuse the existing syntax of the temporal coding modes of the existing standards. Other types of signaling are of course possible. The prediction mode determined for coding a current block B ^e _c is for example selected from all possible prediction modes and selecting the one minimizing a rate / distortion criterion.

Once a prediction mode has been selected for the current block B ^e _c , a prediction residual is obtained by calculating the difference between the current block B ^e _c and the prediction determined by the prediction module P.

This prediction residue is then transformed by a transformation module T implementing for example a transformation of DCT (Discrete Cosine Transform) type. The transformed coefficients of the residue block are then quantized by a quantization module Q, and then coded by the entropy coding module C to form the coded data of the enhancement layer EL. The prediction residue is reconstructed, via an inverse quantization performed by the Q ^"1 module and an inverse transformation performed by the module T ¹ and added to the prediction determined by the prediction module P to reconstruct the current block.

The reconstructed current block is then stored in order to reconstruct the current image and this reconstructed current image can be used as a reference when encoding subsequent images of the 360 ° omnidirectional video.

The coding has been described here in the case of a single view k encoded in a base layer. The method is easily transposable to the case of several views coded in as many basic layers. Each image reconstructed at a time t of a base layer is projected onto the same reference image 1 _ref of the 360 ° video to encode an image of the 360 video at time t.

FIG. 4B shows the simplified structure of a coding device COD adapted to implement the coding method according to any one of the particular embodiments of the invention described above.

Such an encoding device comprises a memory MEM4, a processing unit UT4, equipped for example with a processor PROC4.

According to a particular embodiment of the invention, the coding method is implemented by a computer program PG4 stored in memory MEM4 and driving the processing unit UT4. The computer program PG4 includes instructions for implementing the steps of the encoding method as described above, when the program is executed by the processor PROC4.

At initialization, the code instructions of the computer program PG4 are for example loaded into a memory (not shown) before being executed by the processor PROC4. The processor PROC4 of the processing unit UT4 implements in particular the steps of the coding method described in relation with FIGS. 1A or 4A, according to the instructions of the computer program PG4.

According to another particular embodiment of the invention, the coding method is implemented by functional modules (P, T, Q, Q ^"1 , T ^{" 1} , C, PROJ). For this, the processing unit UT4 cooperates with the various functional modules and the memory MEM4 in order to implement the steps of the coding method. The memory MEM4 can in particular comprise the memories MEM ^b , MEM ^e .

The various functional modules described above can be in hardware and / or software form. In a software form, such a functional module may include a processor, a memory, and program code instructions for implementing the function corresponding to the module when the code instructions are executed by the processor. In a material form, such a functional module can be implemented by any type of suitable encoding circuits, such as for example and without limitation microprocessors, signal processing processors (DSP for Digital Signal Processor in English), integrated circuits specific to applications (ASICs for Application Specified Integrated) Circuit in English), FPGA circuits for Field Programmable Gate Arrays in English, logic unit wiring.

FIG. 5A illustrates in more detail the decoding steps of a base layer and a process improvement layer described above according to a particular embodiment of the invention. The case of the decoding of an enhancement layer EL coding a 360 ° omnidirectional video by prediction from a base layer BL [k] coding a view k is described here.

The k view and the 360 ° omnidirectional video are decoded frame by frame and block by block. Conventionally, the data of the base layer BL [k] are decoded by an entropy decoding module D. Then, for a current block of a current image to be reconstructed, a prediction residue is reconstructed via an inverse quantization of coefficients decoded entropically by an inverse quantization module Q ^"1 and an inverse transformation by an inverse transformation module T ^1. A prediction module P determines a prediction for the current block from the signaling data decoded by the entropy decoding module D. The prediction is added to the reconstructed prediction residue to reconstruct the current block.

The reconstructed current block is then stored in order to reconstruct the current image and that this reconstructed current image is stored in the reference image memory of the base layer MEM ^b and that it can serve as a reference during the decoding of the current image. following images of view k.

When the current image of the view k is reconstructed, a projection module PROJ carries out a geometric projection of the reconstructed image in the reference image 1 _ref of the 360 ° omnidirectional video, as illustrated in FIG. 2B and according to FIG. the geometric transformation described above.

The reference image 1 _ref obtained by projection of the reconstructed image of the base layer is stored in the reference image memory of the enhancement layer MEM ^e . The data of the enhancement layer EL are decoded by an entropy decoding module D. Then, for a current block of a current image to be reconstructed, a prediction residue is reconstructed via an inverse quantization of the decoded coefficients entropically implemented. by an inverse quantization module Q ^"1 and an inverse transformation implemented by an inverse transformation module T ^{" 1} . A prediction module P determines a prediction for the current block from the signaling data decoded by the entropic decoding module D. For example, the decoded syntax data indicates that the current block B ^e _c is coded by inter-layer prediction with respect to a block co-located in the reference picture l _ref obtained from the base layer. The prediction module therefore determines that the prediction corresponds to the block co-located at the current block B ^e _c in the reference image l _ref . The prediction is added to the reconstructed prediction residue to reconstruct the current block. The reconstructed current block is then stored in order to reconstruct the current image of the enhancement layer.

This reconstructed image is stored in the reference image memory of the enhancement layer MEM ^e to serve as a reference when decoding subsequent images of the 360 ° video.

FIG. 5B shows the simplified structure of a decoding device DEC adapted to implement the decoding method according to any one of the particular embodiments of the invention described above.

Such a decoding device comprises a memory MEM5, a processing unit UT5, equipped for example with a processor PROC5.

According to a particular embodiment of the invention, the decoding method is implemented by a computer program PG5 stored in memory MEM5 and driving the processing unit UT5. The computer program PG5 includes instructions for implementing the steps of the decoding method as described above, when the program is executed by the processor PROC5.

At initialization, the code instructions of the computer program PG5 are for example loaded into a memory (not shown) before being executed by the processor PROC5. The processor PROC5 of the processing unit UT5 notably implements the steps of the decoding method described in relation with FIG. 3 or 5A, according to the instructions of the computer program PG5.

According to another particular embodiment of the invention, the decoding method is implemented by functional modules (P, Q ^"1 , T ¹ , D, PROJ), for which the UT5 processing unit cooperates with the different functional modules and the memory MEM5 in order to implement the steps of the decoding method The memory MEM5 may in particular comprise the memories MEM ^b , MEM ^e .

The various functional modules described above can be in hardware and / or software form. In a software form, such a functional module may include a processor, a memory, and program code instructions for implementing the function corresponding to the module when the code instructions are executed by the processor. In a material form, such a functional module can be implemented by any type of suitable encoding circuits, such as, for example, and without limitation microprocessors, signal processing processors (DSPs for Digital Signal Processor), application-specific integrated circuits (ASICs for Application Specific Integrated Circuit), FPGAs for Field Programmable Gate Arrays in English, cabling logical units.

According to a particular embodiment of the invention, the blocks of an image of the enhancement layer are coded in groups of blocks, such a group of blocks is also called a tile. Each group of blocks, i.e. each tile is coded independently of the other tiles. Each tile can then be decoded independently of other tiles. Such tiles (TE0-TE1 1) are illustrated in FIG. 6A showing an image of the 360 ° omnidirectional video at a time instant in which 12 tiles are defined and completely cover the image.

By independent coding of the tiles, here is meant a coding of the blocks of a tile not using spatial prediction from a block of another tile of the image, or of temporal prediction from a block a tile of the reference image not co-located with the current tile.

Each tile is coded by temporal or inter-layer prediction from one or more of base layers as illustrated in FIGS. 6A and 6B. In FIGS. 6A and 6B, the tiles TE4 and TE7 are coded by inter-layer prediction with respect to the image projected in the reference image 1 _ref of the view 1 and the tiles TE3 and TE6 are coded by inter-layer prediction. relative to the image projected in the reference image l _ref of the view 2.

According to this particular embodiment of the invention, a receiver adapted to decode and render a 360 ° video can only decode the tiles necessary for the current area of the 360 ° image displayed by a user. Indeed, when rendering a 360 ° video, a user can not view at a time t, the entire image of the video, ie he can not look in all directions at once and only a moment t that the area of the image in front of his eyes.

For example, such a viewing area is represented by the area ZV in Figure 6A. Thus, according to this embodiment, only the base layers having served to predict the area visualized by the user are decoded in step 31. In the example described in FIGS. 6A and 6B, only the layer of base corresponding to the view 1 is decoded in the step 31, and only the tiles TE4, TE5, TE7 and TE8 are decoded in the step 35 of Figure 3, from the enhancement layer EL. In step 35, only the part of the image of the enhancement layer corresponding to the tiles TE4, TE5, TE7 and TE8 is reconstructed. The particular embodiment, described with reference to FIGS. 6A and 6B, is described here in the case where the tiles of the enhancement layer EL to be decoded depend only on a single base layer (that of view 1 ). According to other variants, a improvement layer tile EL can be prediction coded from several base layers, depending for example on rate / distortion optimization choices made during block coding of the enhancement layer, a block of a tile that can be prediction coded with respect to a first base layer, and another block of the same tile that can be encoded by another base layer distinct from the first base layer. In this case, all the base layers used to predict the blocks of a tile of the enhancement layer must be decoded.

For this, the coded data stream includes for each tile of the enhancement layer information identifying the base layers used to predict the tile. For example, for each tile, syntax elements indicating the number of base layers used and an identifier of each base layer used are encoded in the data stream. Such syntax elements are decoded for each tile of the enhancement layer to be decoded during step 35 of decoding the enhancement layer.

The particular embodiment described above makes it possible to limit the use of the resources of the decoder and to avoid the decoding of unnecessary data because it is not displayed by the user. Such an embodiment may be implemented by any of the encoders and any of the decoding devices described above.

The coding and decoding methods described above have been described in the case where the reconstructed images of the base layers are projected, during the steps 14 and 15 of FIG. 1A and at the 32, 34 of FIG. the same reference image inserted into the reference image memory of the enhancement layer.

When the number of base layers is limited, for example 1 or 2, such a reference image has undefined areas, for example set to 0 by default, large size, then using memory resources unnecessarily.

According to other variants, the reconstructed images of the base layers projected on the enhancement layer can be stored in reference subpictures. For example, a subimage can be used for each base layer. Each subimage is stored in association with offset information enabling the encoder and / or decoder to determine the location of the subimage in the enhancement image. Such a variant provides the advantage of saving the memory space by avoiding having a reference image at the level of the enhancement layer, the majority of the samples are zero.

Such a variant can be implemented independently to the encoder and / or the decoder.

Claims

1. A method of encoding a data stream representative of an omnidirectional video, characterized in that it comprises:

a coding step in said stream of at least one base layer representative of a 2D or 3D video, the 2D or 3D video being representative of a view of the same scene captured by the omnidirectional video,

a coding step in said stream of at least one enhancement layer representative of the omnidirectional video, the at least one enhancement layer being coded by prediction with respect to the at least one base layer.

2. A method of decoding a data stream representative of an omnidirectional video, characterized in that it comprises:

a step of decoding from said stream of at least one base layer representative of a 2D or 3D video, the 2D or 3D video being representative of a view of the same scene captured by the omnidirectional video,

a step of decoding from said stream of at least one enhancement layer representative of the omnidirectional video, the at least one enhancement layer being decoded by prediction with respect to the at least one base layer.

The method of claim 1 or 2, wherein the prediction of the enhancement layer with respect to the at least one base layer comprises, for encoding or reconstructing at least one image of the enhancement layer:

the generation of a reference image obtained by geometric projection on said reference image of an image, called a basic image, reconstructed from the at least one base layer,

The method of claim 3, wherein the data stream comprises information representative of a type of geometric projection used to represent the omnidirectional video.

The method of any one of claims 1 to 4, wherein the view represented by the 2D or 3D video is a view taken from the omnidirectional video.

The method of claim 5, wherein the data stream comprises information representative of the projection and location parameters of said base image in an omnidirectional video image, said information being used to project the base image over the reference image.

The method of claim 6, wherein said information representative of the location projection parameters of said base picture is encoded in the data stream at each frame of the omnidirectional video.

The method according to any one of claims 1 to 7, wherein the data stream comprises at least two basic layers, each base layer being representative of a 2D or 3D video respectively representative of a scene view. the at least two base layers being coded independently of one another.

The decoding method according to claim 8, wherein an image of the enhancement layer is encoded using a group of tiles, each tile covering a region of the image of the enhancement layer, each region being distinct and disjoint from the other regions of the enhancement layer image, each tile being prediction encoded with respect to at least one base layer, the decoding of the enhancement layer comprises:

the reconstruction of a part of the image of the enhancement layer comprising the decoding of the tiles of the enhancement layer covering the portion of the image of the enhancement layer to be reconstructed, and

decoding the at least one base layer comprising the decoding of the base layers used to predict the tiles covering the portion of the image of the enhancement layer to be reconstructed.

The decoding method of claim 9, further comprising, for each tile of the enhancement layer to be decoded, decoding information identifying the at least one base layer used to predict the tile.

1 1. Device for encoding a data stream representative of an omnidirectional video, characterized in that it comprises:

encoding means in said stream of at least one base layer representative of a 2D or 3D video, the 2D or 3D video being representative of a view of the same scene captured by the omnidirectional video, encoding means in said stream of at least one enhancement layer representative of the omnidirectional video, said enhancement layer coding means comprising means for predicting the enhancement layer with respect to the at least one enhancement layer; based.

12. Device for decoding a data stream representative of an omnidirectional video, characterized in that it comprises:

decoding means in said stream of at least one base layer representative of a 2D or 3D video, the 2D or 3D video being representative of a view of the same scene captured by the omnidirectional video,

decoding means in said stream of at least one enhancement layer representative of the omnidirectional video, said enhancement layer decoding means comprising means for predicting the enhancement layer with respect to the at least one enhancement layer; based.

13. Signal representative of an omnidirectional video, characterized in that it comprises coded data of at least one base layer representative of a 2D or 3D video, the 2D or 3D video being representative of a view of the same scene captured by the omnidirectional video, and coded data of at least one enhancement layer representative of the omnidirectional video, the at least one enhancement layer being coded by prediction with respect to the at least one base layer.

The signal of claim 13, wherein an image of the enhancement layer is encoded using a group of tiles, each tile covering a region of the enhancement layer image, each region being distinct and disjoint from the other regions of the enhancement layer image, each tile being prediction coded with respect to at least one base layer, the signal comprises:

for each tile, information identifying the at least one base layer used to predict the tile.

Computer program comprising instructions for carrying out the coding method according to any one of claims 1 to 8 or instructions for implementing the decoding method according to any one of claims 2 to 10. when said program is executed by a processor.