EP4104446A1 - Verfahren und vorrichtung zur verarbeitung von daten von mehrfachansichtsvideo - Google Patents

Verfahren und vorrichtung zur verarbeitung von daten von mehrfachansichtsvideo

Info

Publication number
EP4104446A1
EP4104446A1 EP21707767.6A EP21707767A EP4104446A1 EP 4104446 A1 EP4104446 A1 EP 4104446A1 EP 21707767 A EP21707767 A EP 21707767A EP 4104446 A1 EP4104446 A1 EP 4104446A1
Authority
EP
European Patent Office
Prior art keywords
obtaining
data
mode
item
view
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
EP21707767.6A
Other languages
English (en)
French (fr)
Inventor
Joël JUNG
Pavel Nikitin
Patrick GARUS
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Orange SA
Original Assignee
Orange SA
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Orange SA filed Critical Orange SA
Publication of EP4104446A1 publication Critical patent/EP4104446A1/de
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/597Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding specially adapted for multi-view video sequence encoding
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/10Processing, recording or transmission of stereoscopic or multi-view image signals
    • H04N13/106Processing image signals
    • H04N13/161Encoding, multiplexing or demultiplexing different image signal components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T9/00Image coding
    • G06T9/001Model-based coding, e.g. wire frame
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/30Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using hierarchical techniques, e.g. scalability
    • H04N19/39Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using hierarchical techniques, e.g. scalability involving multiple description coding [MDC], i.e. with separate layers being structured as independently decodable descriptions of input picture data

Definitions

  • the invention relates to so-called immersive videos, representative of a scene captured by one or more cameras, including videos for virtual reality and free navigation. More particularly, the invention relates to the processing (coding, decoding, synthesis of intermediate views) of data from such videos.
  • Immersive video allows a viewer to watch a scene from any point of view, even from a point of view that was not captured by a camera.
  • a typical acquisition system is a set of cameras, which captures a scene with multiple cameras located outside the scene or with divergent cameras built on a spherical platform.
  • the videos are usually displayed through virtual reality headsets (also known as HMD for Head Mounted Device in English), but can also be displayed on 2D screens with an additional system to interact with the user.
  • Free navigation in a scene requires proper management of every movement of the user in order to avoid motion sickness.
  • the movement is generally correctly captured by the display device (an HMD headset for example).
  • delivering the correct pixels to the display regardless of the user's movement (rotational or translational), is currently a problem.
  • This requires multiple captured views and the ability to generate additional virtual (synthesized) views, calculated from the decoded captured views and associated depths.
  • the number of views to be transmitted varies depending on the use case. However, the number of views to be transmitted and the amount of associated data is often large. Therefore, view transmission is an essential aspect of immersive video applications. It is therefore necessary to reduce as much as possible the bit rate of the information to be transmitted without compromising the quality of the synthesis of the intermediate views.
  • the views are either physically captured or computer generated.
  • depths are also captured with dedicated sensors.
  • the quality of this depth information is generally poor and prevents an effective synthesis of intermediate points of view.
  • Depth maps can also be calculated from the texture images of the captured videos. Many depth estimation algorithms exist and are used in the state of the art.
  • FIG. 1 shows an immersive video processing diagram comprising for example two captured views having respectively the texture information T x o y o and T xiy0 .
  • Depth information Dxoyo and Dxi y0 associated with each view T x o y o and T xi y o are estimated by an estimation module FE.
  • the depth information D x0y o and D xiy0 are obtained by a depth estimation software (DERS for Depth Estimation Reference Software), the views Txoyo and T xi y o and the depth information obtained D x0y o and D xiy0 are then encoded (CODEC), for example using an MV-HEVC encoder.
  • CODEC depth estimation software
  • the views T * x0 yo and T * xi yo and the associated depths of each view D * x0y o and D * xiy0 are decoded and used by a synthesis algorithm (SYNTHESIS) to calculate intermediate views, for example example here of the intermediate views S x0y o and S xiy0 .
  • SYNTHESIS synthesis algorithm
  • VSRS View Synthesis Reference
  • VSRS View Synthesis Reference
  • full depth maps are generated and sent, while on the client side, not all parts of all depth maps are useful. This is because the views can have redundant information, which makes certain parts of depth maps unnecessary. Additionally, in some cases, viewers may request only specific points of view. Without a return channel between the client and the server providing the encoded immersive video, the depth estimator on the server side ignores knowledge of these specific viewpoints.
  • the calculation of the depth information on the server side avoids any interaction between the depth estimator and the synthesis algorithm. For example, if a depth estimator wishes to inform the synthesis algorithm that it cannot correctly find the depth of a specific area, it must pass that information in the bitstream, most likely in the form of a binary map.
  • the configuration of the encoder to encode the depth maps in order to obtain the best compromise between the quality of the synthesis and the coding cost for the transmission of the depth maps is not obvious.
  • the number of pixels to be processed by a decoder is high when the textures and the depth maps are encoded, transmitted and decoded. This can, for example, slow down the deployment of immersive video processing schemes on smartphone-type terminals (for smart phones in French).
  • the invention improves the state of the art. To this end, it relates to a method for processing data from a multi-view video, comprising:
  • the invention makes it possible to take advantage of different modes of obtaining synthesis data in a flexible manner by allowing the selection of a mode of obtaining each summary data which is optimal, for example in terms of coding cost / quality. synthesis data or depending on the tools available on the decoder side and / or on the encoder side. This selection is flexible since it can advantageously be carried out at block, image, view or video level. The level of granularity of the mode of obtaining the summary data can therefore be adapted according to the content of the multi-view video, for example, or the tools available on the client / decoder side.
  • the synthetic data item is determined on the encoder side, encoded and transmitted to a decoder in a data stream.
  • the quality of the summary data can be privileged since it is determined from original images, not coded for example.
  • the synthetic data does not suffer during its estimation of the coding artifacts of the decoded textures.
  • the synthetic data item is determined on the decoder side.
  • the data necessary for the synthesis of intermediate views are obtained from the decoded and reconstructed views which have been transmitted to the decoder.
  • Such synthesis data can be obtained at the level of the decoder, or else by a module independent of the decoder taking as input the views decoded and reconstructed by the decoder.
  • This second method of obtaining makes it possible to reduce the cost of encoding the data of the multi-view video and the decoding of the multi-view video is simplified, since the decoder no longer has to decode the data used for the synthesis of views. intermediaries.
  • the invention also improves the quality of the synthesis of intermediate views.
  • summary data estimated at the decoder may be more suitable for the synthesis of views than coded summary data, for example when different estimators are available on the client side and on the server side.
  • the determination of the synthesis data at the encoder may be more suitable, for example when the decoded textures show compression artefacts or when the textures do not include enough redundant information to estimate the synthesis data on the side. customer.
  • said at least one summary data item corresponds to at least part of a depth map.
  • said at least one item of information indicating a mode of obtaining the summary data item is obtained by decoding a syntax element.
  • the information is encoded in the data stream.
  • said at least one item of information indicating a method of obtaining the synthesis data item is obtained from at least one data item coded for the reconstructed coded image.
  • the information is not directly encoded in the data stream, it is derived from the data encoded for an image in the data stream.
  • the process for deriving the synthetic data is here identical to the encoder and to the decoder.
  • the obtaining mode is selected from among the first obtaining mode and the second obtaining mode as a function of a value of a quantization parameter used to encode at least said block.
  • the method further comprises, when said at least one item of information indicates that the summary data item is obtained according to the second mode of obtaining:
  • This particular embodiment of the invention makes it possible to control the method for obtaining the summary data item, for example it may be a question of controlling the functionalities of a depth estimator such as the size of the search window or the precision.
  • the control parameter can also indicate which depth estimator to use, and / or the parameters of this estimator, or a depth map to initialize the estimator.
  • the invention also relates to a device for processing multi-view video data, comprising a processor configured for:
  • the multi-view video data processing device is included in a terminal.
  • the invention also relates to a method for encoding multi-view video data, comprising:
  • the determination for at least one block of an image of a view in an encoded data stream representative of the multi-view video, of at least one item of information indicating a mode for obtaining at least one synthetic datum, from among a first obtaining mode and a second obtaining mode, said at least one synthetic datum being used to synthesize at least one image of an intermediate view of the video multi-view, said intermediate view not being encoded in said encoded data stream, said first mode of obtaining corresponding to a decoding of at least one item of information representative of the at least one summary data item from the data stream encoded, said second mode of obtaining corresponding to obtaining the at least one synthetic data item from at least said reconstructed coded image,
  • the encoding method comprises the encoding in the data stream of a syntax element associated with said information indicating a mode of obtaining the summary data.
  • the coding method further comprises, when the information indicates that the summary data item is obtained according to the second mode of obtaining:
  • the invention also relates to a device for encoding multi-view video data, comprising a processor and a memory configured for:
  • the multi-view video data processing method according to the invention can be implemented in various ways, in particular in wired form or in software form.
  • the multi-view video data processing method is implemented by a computer program.
  • the invention also relates to a computer program comprising instructions for implementing the multi-view video data processing method according to any one of the particular embodiments described above, when said program is executed by a processor.
  • Such a program can use any programming language. It can be downloaded from a communications network and / or recorded on a computer readable medium.
  • This program can use any programming language, and be in the form of source code, object code, or intermediate code between source code and object code, such as in a partially compiled form, or in any other. desirable shape.
  • the invention also relates to a recording medium or information medium readable by a computer, and comprising instructions of a computer program as mentioned above.
  • the aforementioned recording medium can be any entity or device capable of storing the program.
  • the medium may comprise a storage means, such as a ROM, for example a CD ROM or a microelectronic circuit ROM, a USB key, or else a magnetic recording means, for example a hard disk.
  • the recording medium can correspond to a transmissible medium such as an electrical or optical signal, which can be conveyed via an electrical or optical cable, by radio or by other means.
  • the program according to the invention can in particular be downloaded from an Internet type network.
  • the recording medium can correspond to an integrated circuit in which the program is incorporated, the circuit being adapted to execute or to be used in the execution of the method in question.
  • FIG. 1 illustrates a diagram for processing multi-view video data according to the prior art.
  • FIG. 2 illustrates a multi-view video data processing diagram according to a particular embodiment of the invention.
  • FIG. 3A illustrates steps of a multi-view video data processing method according to a particular embodiment of the invention.
  • FIG. 3B illustrates steps of a multi-view video data processing method according to another particular embodiment of the invention.
  • FIG. 4A illustrates steps of a multi-view video coding method according to particular embodiments of the invention.
  • FIG. 4B illustrates steps of a method for encoding multi-view video according to particular embodiments of the invention.
  • FIG. 5 illustrates an example of a multi-view video data processing diagram according to a particular embodiment of the invention.
  • FIG. 6A illustrates a texture matrix of a multi-view video according to a particular embodiment of the invention.
  • FIG. 6B illustrates steps of the depth encoding method for a current block according to a particular embodiment of the invention.
  • FIG. 7A illustrates an example of a data flow according to a particular embodiment of the invention.
  • FIG. 7B illustrates an example of a data flow according to another particular embodiment of the invention.
  • FIG. 8 illustrates a multi-view video coding device according to a particular embodiment of the invention.
  • FIG. 9 illustrates a device for processing multi-view video data according to a particular embodiment of the invention.
  • FIG. 1, described above, illustrates a diagram of processing of multi-view video data according to the prior art.
  • the depth information is determined, encoded and transmitted in a data stream to the decoder which decodes it.
  • FIG. 2 illustrates a multi-view video data processing diagram according to a particular embodiment of the invention.
  • the depth information is not encoded in the data stream, but determined on the client side, from the reconstructed images of the multi-view video.
  • the texture images T x o y o and T xiy o from captured views are encoded (CODEC), for example using an MV-HEVC encoder, and sent to a display device d 'a user, for example.
  • CODEC CODEC
  • the T * x0y o and T * xiy0 textures of the views are decoded and used to estimate the depth information D ' x0y o and D' xiy0 associated with each view T x0y o and T xiy0 , by an estimation module FE.
  • the depth information D ' x0y o and D' xiy0 are obtained by depth estimation software (DERS).
  • the decoded views T * x0y o and T * xiy0 and the associated depths of each view D ' x0y o and D' xiy0 are used by a synthesis algorithm (SYNTHESIS) to calculate intermediate views, for example here intermediate views S ' x0y o and S ' xiy0 .
  • the aforementioned VSRS software can be used as a view synthesis algorithm.
  • the complexity of the client terminal is greater than when the depth information is transmitted to the decoder. This may involve using simpler encoder depth estimation algorithms, which may then fail in complex scenes.
  • the texture information does not include enough redundancy to perform the estimation of the depth or of the data useful for the synthesis, for example because of the encoding of the texture information on the server side during which information texture may not be encoded.
  • the invention proposes a method making it possible to select a mode of obtaining synthesis data from among a first mode of obtaining (M1) according to which the synthesis data are encoded and transmitted to the decoder and a second mode of obtaining (M2) according to which summary data is estimated on the client side.
  • M1 first mode of obtaining
  • M2 second mode of obtaining
  • the best path to obtain one or more summary data is selected for each image, or each block or for any other granularity.
  • FIG. 3A illustrates steps of a method for processing multi-view video data according to a particular embodiment of the invention.
  • the selected mode of obtaining is encoded and transmitted to the decoder.
  • a data stream BS comprising in particular texture information from one or more views of a multi-view video is transmitted to the decoder. It is considered, for example, that two views have been coded in the data stream BS.
  • the data stream BS also includes at least one syntax element representative of an item of information indicating a mode of obtaining at least one summary data item, from among a first mode of obtaining M1 and a second mode of obtaining M2.
  • the decoder decodes the texture information of the data stream to obtain the textures T * 0 and T * i.
  • the element of syntax representative of the information indicating a mode of obtaining is decoded from the data stream.
  • This syntax element is encoded in the data stream for at least one block of the texture image of a view. Its value can therefore change for each texture block in a view.
  • the syntax element is coded once for all the blocks of the texture image of a view T 0 or Ti. The information indicating a mode of obtaining synthesis data is therefore the same for all the blocks of the texture image T 0 or Ti.
  • the syntax element is coded once and for all the texture images of the same view or the syntax element is coded once for all the views.
  • step 31 information on the obtaining mode do is then obtained associated with the decoded texture image T * 0 and an information item on the obtaining mode di associated with the decoded texture image T * i.
  • a step 32 it is checked for each item of information d 0 and di indicating a mode of obtaining the synthesis data associated respectively with the decoded texture images T * 0 and T * i if the mode of obtaining corresponds to the first M1 obtaining mode or the second M2 obtaining mode.
  • the synthesis data D * 0 respectively D * i, associated with the decoded texture image T * o, respectively T * i, are decoded from the data stream BS.
  • the synthesis data D + 0 , respectively D + i, associated with the decoded texture image T * 0 , respectively T * i are estimated from the reconstructed texture images of the multi-view video.
  • the estimation can use the decoded texture T * 0 , respectively T * i, and possibly other previously reconstructed texture images.
  • the decoded textures T * 0 and T * i and the decoded (D * 0 , D * i) or estimated (D + 0 , D + i) synthesis information are used to synthesize an image d 'an intermediate view S0.5.
  • FIG. 3B illustrates steps of a multi-view video data processing method according to another particular embodiment of the invention.
  • the selected mode of obtaining is not transmitted to the decoder. This derives the mode of obtaining from the previously decoded texture data.
  • a data stream BS comprising in particular texture information from one or more views of a multi-view video is transmitted to the decoder. It is considered, for example, that two views have been coded in the data stream BS.
  • the decoder decodes the texture information of the data stream to obtain the textures T * 0 and T * i.
  • the decoder obtains information indicating a mode of obtaining from among a first obtaining M1 and a second mode of obtaining M2, at least one data item of synthesis to be used to synthesize an image of an intermediate view.
  • this information can be obtained for each block of the texture image of a view.
  • the obtaining mode can therefore change for each texture block in a view.
  • this information is obtained once for all the blocks of the texture image of a view T * 0 or T * i.
  • the information indicating a mode of obtaining synthesis data is therefore the same for all the blocks of the texture image T * 0 or T * i.
  • the information is obtained once for all the texture images of the same view or the information is obtained once for all the views.
  • step 32 ′ the information is obtained for each texture image of a view.
  • the obtaining mode information is here obtained by applying the same determination process which was applied to the encoder. An example of a determination process is described below in relation to FIG. 4.
  • step 34' if the information do, respectively di, indicates the first mode of obtaining M1, during a step 34', the synthesis data D * 0 , respectively D * i, associated with l
  • the decoded texture image T * 0 , respectively T * i, are decoded from the data stream BS.
  • the synthesis data D + 0 , respectively D + i, associated with the decoded texture image T * 0 , respectively T * i are estimated from the reconstructed texture images of the multi-view video.
  • the estimation can use the decoded texture T * 0 , respectively T * i, and possibly other previously reconstructed texture images.
  • the decoded textures T * 0 and T * i and the decoded (D * 0 , D * i) or estimated (D + 0 , D + i) synthesis information are used to synthesize an image from an intermediate view S0.5.
  • the multi-view video data processing method described here according to particular embodiments of the invention is particularly applicable in the case where the summary data correspond to depth information.
  • the data processing method applies to all types of summary data, such as an object segmentation map.
  • a given view at a given instant of the video can apply the method described above to several types of summary data.
  • these two types of synthesis data can be partially transmitted to the decoder, the other part being derived by the decoder or the synthesis.
  • part of the texture can be estimated, for example by interpolation. The view corresponding to such a texture estimated at the decoder is considered in this case as summary data.
  • the examples described here include two texture views, respectively producing two depth maps, but other combinations are of course possible, including the processing of a depth map at a given time, associated with one or more texture views.
  • FIG. 4A illustrates steps of a multi-view video coding method according to a particular embodiment of the invention.
  • the coding method is described here in the case of two views comprising the textures T 0 and Ti respectively.
  • each texture T 0 and Ti is coded and decoded to provide the decoded textures T * 0 and T * i.
  • the textures can here correspond to an image of a view, or a block of an image of a view or to any other type of granularity relating to the texture information of a multi-view video. .
  • synthesis data for example depth maps D + 0 and D + i are estimated from the decoded textures T * 0 and T * i, using a depth estimator.
  • the synthesis data D 0 and Di are estimated from the uncoded textures To and Ti, for example using a depth estimator.
  • the obtained synthesis data D 0 and Di are then encoded, then decoded to provide reconstructed synthesis data D * 0 and D * i. This is the first method of obtaining M1 summary data.
  • a step 44 it is determined an obtaining mode to be used at the decoder to obtain the synthesis data among the first obtaining mode M1 and the second obtaining mode M2.
  • a syntax element is encoded in the data stream to indicate the selected mode of obtaining.
  • J Di R, where R corresponds to the bit rate, D corresponds to the distortion and l the Lagrangian used for optimization.
  • a first variant is based on the synthesis of an intermediate view or of a block of an intermediate view, in the case where the mode of obtaining is coded for each block, and of evaluating the quality of the synthesized view, in considering the two methods of obtaining summary data.
  • a first version of the intermediate view is therefore synthesized for the obtaining mode M2 from the decoded textures T * 0 and T * i and from the estimated synthesis data D + 0 and D + i from the decoded textures T * 0 and T * i.
  • the flow then corresponds to the cost of encoding of textures T * 0 and T * i and at the cost of encoding the syntax element indicating the selected mode of obtaining.
  • This bit rate can be calculated precisely by using, for example, an entropy coder (for example an arithmetic binary coding, a variable length coding, with or without adaptation of the context).
  • a second version of the intermediate view is also synthesized for the obtaining mode M1 from the decoded textures T * 0 and T * i and from the decoded synthesis data D * 0 and DY
  • the bit rate then corresponds to the cost of coding the textures T * 0 and T * i and summary data D * 0 and D * i to which is added the cost of coding the syntax element indicating the mode of obtaining selected. This rate can be calculated as indicated above.
  • the distortion can be calculated by a metric comparing the image or the block of the synthesized view with the image or the uncoded block of the synthesized view from the non-coded textures T 0 and Ti and the data. synthesis not coded D 0 and Di.
  • the obtaining mode providing the lowest bit rate / distortion cost J is selected.
  • the distortion by applying a metric without reference to the image or the synthesized block to avoid using the original uncompressed texture.
  • a non-reference metric can for example measure in the image or the synthesized block, the amount of noise, blur, block effect, the sharpness of the contours, etc.
  • the selection of the mode is obtained for example by comparing the synthesis data D 0 and Di estimated from the uncompressed textures and the synthesis data D + 0 and D + i estimated from the coded-decoded textures. If the summary data is close enough, according to a determined criterion, the estimation of the summary data on the client side will be more efficient than the encoding and transmission of the summary data. According to this variant, the synthesis of an image or of a block of an intermediate view is avoided.
  • the selection of an obtaining mode may depend on the characteristics of the depth information. For example, computer-generated depth information or high-quality captured depth is more likely to be suitable for M1 obtaining mode.
  • the depth maps can also be estimated from the textures decoded as described above and put into competition with the depth maps generated by computer or captured in high quality. The depth maps generated by computer or captured in high quality then replace the depth maps estimated from the uncompressed textures in the method described above.
  • the quality of the depth can be used to determine a mode of obtaining the summary data.
  • the quality of the depth which can be measured by an appropriate objective metric, can include relevant information. For example, when the quality of the depth is low, or when the temporal coherence of the depth is low, it is probable that the obtaining mode M2 is the most suitable for obtaining the depth information.
  • a syntax element d representative of the mode of obtaining selected is encoded in the data stream .
  • the selected and coded mode corresponds to the first mode of obtaining M1
  • the synthesis data D 0 and Di are also coded in the data stream, for the block or the image considered.
  • additional information can also be coded in the data stream.
  • additional information may correspond to one or more control parameters to be applied to the decoder or by a synthesis module when obtaining said summary data according to the second mode of obtaining. These can be parameters for controlling a summary data estimator, or a depth estimator for example.
  • control parameters can control the functionality of a depth estimator, such as increasing or decreasing the search interval, or increasing or decreasing precision.
  • the control parameters can indicate how a summary data is to be estimated on the decoder side.
  • the control parameters indicate which depth estimator to use.
  • the encoder can test several depth estimators and select the estimator providing the best bitrate / distortion compromise among: a pixel-based depth estimator, an estimator of depth based on triangle-warping, a fast depth estimator, a monocular neural network depth estimator, a neural network depth estimator using multiple references.
  • the encoder informs the decoder or the synthesis module to use a similar synthesis data estimator.
  • control parameters can comprise parameters of a depth estimator such as the disparity interval, the precision, the neural network model, the optimization method or aggregation, energy function smoothing factors, cost functions (color-based, correlation-based, frequency-based), a simple depth map that can be used as an initialization for the client-side depth estimator, etc.
  • FIG. 4B illustrates steps of a multi-view video coding method according to another particular embodiment of the invention.
  • the method of obtaining the summary data is not coded in the data stream, but deduced from the coded information which will be available to the decoder.
  • the coding method is described here in the case of two views comprising the textures To and Ti respectively.
  • each texture T 0 and Ti is coded and decoded to provide the decoded textures T * 0 and T * i.
  • the textures can here correspond to an image of a view, or a block of an image of a view or to any other type of granularity relating to the texture information of a multi-view video. .
  • an obtaining mode is determined to be used at the decoder to obtain the synthesis data from among the first obtaining mode M1 and the second obtaining mode M2.
  • the encoder can use any information that will be available to the decoder, to decide on the obtaining mode which must be applied to the block or to the image considered.
  • the selection of an obtaining mode can be based on a quantization parameter, for example, a QP (for Quantization Parameter) used to encode an image or a texture block. For example, when the quantization parameter is greater than a determined threshold, the second obtaining mode is selected, otherwise the first obtaining mode is selected.
  • a quantization parameter for example, a QP (for Quantization Parameter) used to encode an image or a texture block.
  • QP Quantization Parameter
  • the synthesis data D 0 and Di can be generated by computer or captured in high quality.
  • This type of summary data is more suited to the method of obtaining M1.
  • the mode of obtaining the selected summary data will then be the mode of obtaining M1.
  • metadata must be transmitted to the decoder to indicate the origin of the depth (computer generated, captured in high quality). This information can be transmitted at the view sequence level.
  • the synthesis data D 0 and Di are estimated from the uncoded textures T 0 and Ti , for example using a depth estimator. This estimation is of course not carried out in the case where the summary data come from a computer generation or from a capture in high quality.
  • synthesis data obtained D 0 and Di are then encoded in the data stream.
  • additional information can be obtained. also be encoded in the data stream, during a step 46 '.
  • information can correspond to one or more control parameters to be applied to the decoder or by a synthesis module when obtaining said synthesis data according to the second mode of obtaining.
  • control parameters are similar to those described in relation to FIG. 4A.
  • FIG. 5 illustrates an example of a multi-view video data processing diagram according to a particular embodiment of the invention.
  • a scene is captured by a CAPT video capture system.
  • the view capture system comprises one or more cameras capturing the scene.
  • the scene is captured by two converging cameras, located outside the scene and looking towards the scene from two separate locations.
  • the cameras are therefore at different distances from the scene and have different angles / orientations.
  • Each camera provides a sequence of uncompressed images.
  • the image sequences respectively comprise a sequence of images of texture T 0 and Ti.
  • the texture images T 0 and Ti resulting from the sequences of images respectively supplied by the two cameras are encoded by a COD encoder, for example an MV-HEVC encoder which is a multi-view video encoder.
  • the coder COD supplies a data stream BS which is transmitted to a decoder DEC, for example via a data network.
  • the depth maps D 0 and Di are estimated from the uncompressed textures T 0 and Ti and the depth maps D + 0 and D + i are estimated from the decoded textures T * 0 and T * i using a depth estimator, for example the DERS estimator.
  • a first view T'o located at a position captured by one of the cameras is synthesized, for example here position 0, using the depth map D 0 and a second view T ” 0 located at the same position is synthesized using the depth map D + 0 .
  • the quality of the two synthesized views is compared, for example by calculating the PSNR (for Peak Signal to Noise Ratio) between each of the synthesized views T'o, T ”o and the captured view T 0 located at the same position.
  • the comparison makes it possible to select a mode of obtaining for the depth map D 0 from among a first mode of obtaining according to which the map of depth D 0 is coded and transmitted to the decoder and a second mode of obtaining according to which the map of depth D 0 is encoded and transmitted to the decoder.
  • depth D + 0 is estimated at the decoder.
  • the same process is iterated for the depth map Di associated with the captured texture Ti
  • FIG. 7A illustrates an example of part of a data stream BS according to this particular embodiment of the invention.
  • the data stream BS comprises the coded textures T 0 and Ti and the syntax elements d 0 and di indicating respectively for each of the textures T 0 and Ti the mode of obtaining the depth maps D 0 and Di. If it is decided to code and transmit the depth map D 0 , respectively Di, the value of the syntax element d 0 , respectively di is for example 0, the data stream BS then includes the depth map D 0 , respectively Di, encoded.
  • the data stream BS then does not include the depth map D 0 respectively Di. It may optionally comprise, according to the variant embodiments, one or more PAR control parameters to be applied when obtaining the depth map D + 0 , respectively D + i, by the decoder or by the synthesis module.
  • the encoded data stream BS is then decoded by the decoder DEC.
  • the DEC decoder is included in a smartphone (for smart phone in French) equipped with decoding functions for free navigation.
  • a user looks at the scene from the point of view provided by the first camera. The user then slides their point of view slowly to the left to the other camera. During this process, the smartphone displays intermediate views of the scene that were not captured by the cameras.
  • the data stream BS is scanned and decoded by an MV-HEVC decoder for example, to provide two decoded textures T * 0 and T * i
  • the depth map D + k is estimated at the decoder or by a synthesis module from the decoded textures T * 0 and T * i.
  • a SYNTH synthesis module for example based on a VVS synthesis algorithm (for Versatile View Synthesizer in English), synthesizes intermediate views with the decoded textures T * 0 and T * i and the decoded depth maps D * 0 and D * i or estimated D + 0 and D + i as the case may be to synthesize intermediate views between the views corresponding to textures T 0 and Ti.
  • VVS synthesis algorithm for Versatile View Synthesizer in English
  • the multi-view video data processing scheme described in Fig. 5 is not limited to the embodiment described above.
  • the scene is captured by six omnidirectional cameras located in the scene, from six different locations.
  • Each camera provides a sequence of 2D images in an equi-rectangular projection format (ERP for Equi-Rectangular Projection).
  • the six textures coming from the cameras are encoded using a 3D-HEVC encoder which is a multi-view encoder, providing a data stream BS which is for example transmitted via a data network.
  • a 2x3 matrix of source textures T (textures originating from the cameras) is supplied at the input of the encoder.
  • a source depth map matrix D is estimated from the uncompressed textures using a depth estimator based on a neural network.
  • the texture matrix T is encoded and decoded using the 3D-HEVC encoder providing the decoded texture matrix T * .
  • the T * decoded texture matrix is used to estimate the D + depth map matrix using the neural network based depth estimator.
  • the selection of a mode of obtaining for the depth map associated with a texture is carried out for each block or coding unit (also known under the name of CTU for Coding Tree Unit in English, in the HEVC encoder).
  • FIG. 6B illustrates the steps of the depth encoding method for a current block D x0y o (x, y, t) to be encoded, where x, y corresponds to the position of the upper left corner of the block in the image and t the temporal instant of the image.
  • the depth encoding for the current block D x0y o (0,0,0) is first evaluated by determining an optimal encoding mode among different tools for encoding the depth of a block available to the encoder.
  • Such coding tools can include any type of depth coding tools available in a multi-view coder.
  • the depth D x0y o (0,0,0) of the current block is encoded using a first encoding tool providing a coded-decoded depth D * x0y o (0,0, 0) for the current block.
  • a view at a position of one of the cameras is synthesized using the VVS synthesis software for example. For example, a view at position x1 yO is synthesized using the views decoded at positions xOyO, x2y0, and x1y1 of the T texture matrix.
  • the depth for all blocks in the multi-video views that have not yet been processed comes from the estimated source depth D.
  • the depth for all blocks of the multi-view video for which the depth has already been encoded comes from the encoded-decoded or estimated depth from the textures decoded according to the obtaining mode which was selected for each block.
  • the depth of the current block used for the synthesis of the view at position x1y0 is the coded-decoded depth D * x0y o (0,0,0) according to the coding tool being evaluated.
  • the quality of the synthesized view is evaluated using an error metric, for example a quadratic error, between the synthesized view at position x1y0 and the source view T xiy o, and the coding cost the depth of the current block according to the tool under test is calculated.
  • step 63 it is checked whether all the depth encoding tools have been tested for the current block, and if this is not the case, the steps 60 to 62 are iterated for the encoding tool. next, if not, the method goes to step 64.
  • step 65 another view at the same position as during step 61 is synthesized using the textures decoded at positions xOyO, x2y0 and x1y1 with the VVS software and the estimated current block depth D + x0y o (0,0,0).
  • a step 66 the distortion between the view synthesized at position x1y0 and the source view T xiy0 is calculated and the cost of coding the depth is set to 0, since according to this method of obtaining, the depth n 'is not encoded but estimated at the decoder.
  • a step 67 it is decided according to the bit rate / distortion cost of each mode of obtaining the depth the optimal mode of obtaining.
  • the mode of obtaining the depth minimizing the bit rate / distortion criterion is selected from the encoding of the depth with the optimal encoding tool selected in step 64 and the estimation of the depth at the decoder.
  • a syntax element is encoded in the data stream indicating the obtaining mode selected for the current block. If the obtaining mode selected matches the encoding of the depth, the depth is encoded in the data stream according to the optimal encoding tool selected previously.
  • Steps 60 to 68 are iterated by considering the next block to be processed D x0y o (64,0,0) for example if the first block has a size of 64x64. All the blocks of the depth map associated with the texture of the view at position xOyO are processed correspondingly by taking into account the coded-decoded or estimated depths of the blocks previously processed during view synthesis.
  • the depth maps of the other views are also treated in a similar way.
  • the coded data stream comprises different information for each block. If it has been decided to encode and transmit depth for a given block, the data stream includes for that block, the encoded texture of the block, a block of encoded depth data, and the syntax element indicating the mode of operation. obtaining depth for the block. If it has been decided not to encode the depth for the block, the data stream includes for the block, the encoded texture of the block, a depth information block including a same gray level value, and the syntax element indicating how to get the depth for the block.
  • the data stream may include the textures encoded consecutively for all the blocks, then the depth data and the syntax elements of the blocks.
  • the decoding can for example be carried out via a virtual reality headset equipped with free navigation functionalities, and worn by a user.
  • the user looks at the scene from a point of view provided by one of the six cameras.
  • the user looks around and slowly begins to move around the scene.
  • the headset tracks the user's movement and displays corresponding views of the scene that were not captured by the cameras.
  • the decoder DEC decodes the texture matrix T * from the coded data stream.
  • the syntax elements for each block are also decoded from the encoded data stream.
  • the depth of each block is obtained by decoding the block of depth data encoded for the block or by estimating the depth data from the decoded textures according to the value of the decoded syntax element for the block.
  • An intermediate view is synthesized using the decoded texture matrix T * and the reconstructed depth matrix comprising for each block the depth data obtained as a function of the obtaining mode indicated by the decoded syntax element for the block.
  • the multi-view video data processing diagram described in FIG. 5 also applies in the case where the syntax element is not coded at block level or at the block level. image level.
  • the COD encoder can apply an image level decision mechanism to decide whether the depth should be transmitted to the decoder or estimated after decoding.
  • the encoder which operates in variable bit rate mode, allocates, in a known manner, quantization parameters (QPs) to the blocks of the texture images so as to reach a target overall bit rate.
  • QPs quantization parameters
  • An average of the QPs allocated to each block of a texture image is calculated, optionally using a weighting between blocks. This provides an average PQ for the texture image, representative of an importance level of the image.
  • the encoder decides to calculate the depth map for this image of texture from the uncompressed textures of the multiview video, encode the calculated depth map and pass it into the data stream.
  • the target speed is high speed.
  • the encoder does not calculate the depth for this texture image and proceeds to the next texture image. No depth is encoded for this image, nor any indicator is transmitted to the decoder.
  • FIG. 7B illustrates an example of a part of a data stream encoded according to this particular embodiment of the invention.
  • the encoded data stream includes in particular the encoded textures for each image, here To and Ti.
  • the encoded data stream also includes information making it possible to obtain the average QP of each image. For example, this can be coded at the image level, or else conventionally obtained from the QPs coded for each block in the data stream.
  • the coded data stream also comprises the calculated and coded depth data D 0 and / or Di according to the decision taken by the coder. It is noted here that the syntax elements d 0 and di are not encoded in the data stream.
  • the data stream may include PAR parameters to be applied when estimating the depth. These parameters have already been described above.
  • the decoder DEC runs through the encoded data stream and decodes the texture images T * 0 and T * i.
  • the decoder applies the same decision mechanism as the encoder, by calculating the average QP of each texture image.
  • the decoder then deduces therefrom, using the determined threshold, which can be transmitted in the data stream or well known to the decoder, whether the depth for a given texture image must be decoded or else estimated.
  • the decoder then operates in a manner similar to what has been described in relation to the first embodiment of FIG. 5.
  • FIG. 8 presents the simplified structure of a COD coding device suitable for implementing the coding method according to any one of the particular embodiments of the invention described above, in particular in relation to FIGS. 2, 4A and 4B.
  • the COD encoder may for example correspond to the COD encoder described in relation to FIG. 5.
  • the steps of the coding method are implemented by computer program instructions.
  • the coding device COD has the conventional architecture of a computer and comprises in particular a memory MEM, a processing unit UT, equipped for example with a processor PROC, and controlled by the computer program PG stored in memory MEM.
  • the computer program PG comprises instructions for implementing the steps of the coding method as described above, when the program is executed by the processor PROC.
  • the code instructions of the computer program PG are for example loaded into a memory before being executed by the processor PROC.
  • the processor PROC of the processing unit UT notably implements the steps of the coding method described above, according to the instructions of the computer program PG.
  • FIG. 9 shows the simplified structure of a DTV multi-view video data processing device suitable for implementing the multi-view data processing method according to any one of the particular embodiments of the invention described. previously, in particular in relation to FIGS. 2, 3A, and 3B.
  • the DTV multi-view video data processing device may for example correspond to the SYNTH synthesis module described in relation to FIG. 5 or to a device comprising the SYNTH synthesis module and the decoder DEC of FIG. 5.
  • the DTV multi-view video data processing device has the conventional architecture of a computer and in particular comprises a MEMO memory, a UTO processing unit, equipped for example with a PROCO processor, and controlled by the computer program PGO stored in MEMO memory.
  • the computer program PGO includes instructions for implementing the steps of the multi-view video data processing method as described above, when the program is executed by the PROCO processor.
  • the code instructions of the computer program PGO are for example loaded into a memory before being executed by the processor PROCO.
  • the processor PROCO of the processing unit UTO notably implements the steps of the multi-view video data processing method described above, according to the instructions of the computer program PGO.
  • the DTV multi-view video data processing device comprises a decoder DEC suitable for decoding one or more coded data streams representative of a multi-view video.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)
EP21707767.6A 2020-02-14 2021-02-04 Verfahren und vorrichtung zur verarbeitung von daten von mehrfachansichtsvideo Pending EP4104446A1 (de)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
FR2001464A FR3107383A1 (fr) 2020-02-14 2020-02-14 Procédé et dispositif de traitement de données de vidéo multi-vues
PCT/FR2021/050207 WO2021160955A1 (fr) 2020-02-14 2021-02-04 Procédé et dispositif de traitement de données de vidéo multi-vues

Publications (1)

Publication Number Publication Date
EP4104446A1 true EP4104446A1 (de) 2022-12-21

Family

ID=70804716

Family Applications (1)

Application Number Title Priority Date Filing Date
EP21707767.6A Pending EP4104446A1 (de) 2020-02-14 2021-02-04 Verfahren und vorrichtung zur verarbeitung von daten von mehrfachansichtsvideo

Country Status (5)

Country Link
US (1) US20230065861A1 (de)
EP (1) EP4104446A1 (de)
CN (1) CN115104312A (de)
FR (1) FR3107383A1 (de)
WO (1) WO2021160955A1 (de)

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013159330A1 (en) * 2012-04-27 2013-10-31 Nokia Corporation An apparatus, a method and a computer program for video coding and decoding
BR112015006178B1 (pt) * 2012-09-21 2022-11-16 Nokia Technologies Oy Métodos, aparelhos e meio não transitório legível por computador para codificação e decodificação de vídeo
WO2014056150A1 (en) * 2012-10-09 2014-04-17 Nokia Corporation Method and apparatus for video coding
US9930363B2 (en) * 2013-04-12 2018-03-27 Nokia Technologies Oy Harmonized inter-view and view synthesis prediction for 3D video coding
KR20160001647A (ko) * 2014-06-26 2016-01-06 주식회사 케이티 다시점 비디오 신호 처리 방법 및 장치

Also Published As

Publication number Publication date
FR3107383A1 (fr) 2021-08-20
US20230065861A1 (en) 2023-03-02
CN115104312A (zh) 2022-09-23
WO2021160955A1 (fr) 2021-08-19

Similar Documents

Publication Publication Date Title
EP3878170B1 (de) Ansichtssynthese
EP3788789A2 (de) Verfahren und vorrichtung zur bildverarbeitung und geeignete verfahren und vorrichtung zur decodierung eines mehransichtsvideos
FR2959636A1 (fr) Procede d'acces a une partie spatio-temporelle d'une sequence video d'images
EP3198876B1 (de) Erzeugung und codierung von integralen restbildern
FR2768003A1 (fr) Procede de codage d'un signal de forme binaire
WO2021160955A1 (fr) Procédé et dispositif de traitement de données de vidéo multi-vues
EP2227908B1 (de) Verfahren zur bildsignaldekodierung mit unterschiedlicher komplexität und entsprechendes dekodierendgerät, kodierverfahren, kodiervorrichtung, signal und computersoftwareprodukte
EP3158749B1 (de) Verfahren zur codierung und decodierung von bildern, vorrichtung zur codierung und decodierung von bildern und entsprechende computerprogramme
WO2021214395A1 (fr) Procédés et dispositifs de codage et de décodage d'une séquence vidéo multi-vues
WO2020188172A1 (fr) Procédés et dispositifs de codage et de décodage d'une séquence vidéo multi-vues
FR2934453A1 (fr) Procede et dispositif de masquage d'erreurs
WO2020070409A1 (fr) Codage et décodage d'une vidéo omnidirectionnelle
EP1596607A1 (de) Verfahren und Anordnung zur Erzeugung von Kandidatenvektoren für Bildinterpolierungssysteme, die Bewegungsabschätzung und -kompensation verwenden
WO2019115899A1 (fr) Procédés et dispositifs de codage et de décodage d'une séquence vidéo multi-vues représentative d'une vidéo omnidirectionnelle
WO2020260034A1 (fr) Procede et dispositif de traitement de donnees de video multi-vues
WO2022269163A1 (fr) Procédé de construction d'une image de profondeur d'une vidéo multi-vues, procédé de décodage d'un flux de données représentatif d'une vidéo multi-vues, procédé de codage, dispositifs, système, équipement terminal, signal et programmes d'ordinateur correspondants
EP4222950A1 (de) Verfahren zur codierung und decodierung eines mehrfachansichtsvideos
WO2021136895A1 (fr) Synthese iterative de vues a partir de donnees d'une video multi-vues
Rudolph et al. Learned Compression of Point Cloud Geometry and Attributes in a Single Model through Multimodal Rate-Control
CN117043820A (zh) 沉浸式视频上下文中的深度估计方法
CN117426098A (zh) 用于传输和渲染包括非漫射对象的多个视图的方法、服务器和设备
EP2962459A2 (de) Ableitung eines disparitätsbewegungsvektors sowie 3d-videocodierung und -decodierung mit solch einer ableitung
FR2938146A1 (fr) Procede et dispositif d'optimisation du debit d'encodage d'une image video en mode entrelace.

Legal Events

Date Code Title Description
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: UNKNOWN

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE

PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE

17P Request for examination filed

Effective date: 20220705

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

DAV Request for validation of the european patent (deleted)
DAX Request for extension of the european patent (deleted)
RAP3 Party data changed (applicant data changed or rights of an application transferred)

Owner name: ORANGE