JP5013993B2 - Method and system for processing multiple multiview videos of a scene - Google Patents

Method and system for processing multiple multiview videos of a scene Download PDF

Info

Publication number
JP5013993B2
JP5013993B2 JP2007173941A JP2007173941A JP5013993B2 JP 5013993 B2 JP5013993 B2 JP 5013993B2 JP 2007173941 A JP2007173941 A JP 2007173941A JP 2007173941 A JP2007173941 A JP 2007173941A JP 5013993 B2 JP5013993 B2 JP 5013993B2
Authority
JP
Japan
Prior art keywords
view
multi
reference picture
video
frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
JP2007173941A
Other languages
Japanese (ja)
Other versions
JP2008022549A (en
Inventor
アンソニー・ヴェトロ
エミン・マーティニアン
ジョン・デー・オー
セフーン・イェー
セルダール・インセ
Original Assignee
ミツビシ・エレクトリック・リサーチ・ラボラトリーズ・インコーポレイテッド
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority to US11/485,092 priority Critical patent/US7728878B2/en
Priority to US11/485,092 priority
Application filed by ミツビシ・エレクトリック・リサーチ・ラボラトリーズ・インコーポレイテッド filed Critical ミツビシ・エレクトリック・リサーチ・ラボラトリーズ・インコーポレイテッド
Publication of JP2008022549A publication Critical patent/JP2008022549A/en
Application granted granted Critical
Publication of JP5013993B2 publication Critical patent/JP5013993B2/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

<P>PROBLEM TO BE SOLVED: To provide a method and system for decomposing multiview videos acquired for a scene by a plurality of cameras. <P>SOLUTION: Side information for synthesizing a particular view of the multiview video is obtained in either an encoder or decoder. A synthesized multiview video is synthesized from the multiview videos and the side information. A reference picture list is maintained for each current frame of each of the multiview videos, the reference picture indexes temporal reference pictures and spatial reference pictures of the acquired multiview videos and the synthesized reference pictures of the synthesized multiview video. Each current frame of the multiview videos is predicted according to reference pictures indexed by the associated reference picture list. <P>COPYRIGHT: (C)2008,JPO&amp;INPIT

Description

  The present invention relates generally to multi-view video encoding and decoding, and more particularly to multi-view video synthesis.

  Multiview video encoding and decoding is essential for applications such as 3D television (3DTV), free-viewpoint television (FTV), and multi-camera surveillance. Multi-view video encoding and decoding is also known as dynamic light field compression.

  FIG. 1 shows a prior art “simultaneous broadcast” system 100 for encoding multi-view video. Cameras 1-4 acquire the frame sequence of scene 5, i.e. videos 101-104. Each camera has a different view of the scene. Each video is individually encoded (111-114) to become corresponding encoded videos 121-124. This system uses conventional 2D video coding techniques. Thus, this system does not correlate different videos acquired from different viewpoints by multiple cameras when predicting a frame of encoded video. Individual encoding reduces compression efficiency and thus increases network bandwidth and storage.

  FIG. 2 shows a prior art disparity compensated prediction system 200 that uses correlation between views. The videos 201 to 204 are encoded (211 to 214) to become encoded videos 231 to 234. Video 201 and 204 are either MPEG-2 or H.264. Individually encoded using a standard video encoder such as H.264 (also known as MPEG-4 Part 10). These individually encoded videos become “reference” videos. The remaining videos 202 and 203 are encoded using temporal prediction and inter-view prediction based on reconstructed reference videos 251 and 252 obtained from decoders 221 and 222. Usually, this prediction is adaptively determined for each block (SC Chan et al., “The data compression of simplified dynamic light fields” (Proc. IEEE Int. Acoustics, Speech , and Signal Processing Conf., April, 2003)).

  Figure 3 shows a prior art “lifting-based” wavelet decomposition (“The data compression of simplified dynamic light fields” by J. Appl. Comp. Harm. Anal., Vol. 3, no. 2, pp. 186-200, 1996)). Wavelet decomposition is an effective technique for compressing static light fields. The input sample 301 is divided into an odd sample 302 and an even sample 303 (310). Odd samples are predicted from even samples (320). The prediction error forms a high frequency sample 304. This high pass sample is used to update the even sample (330) to form the low pass sample 305. Since this decomposition is reversible, linear or non-linear operations can be incorporated into the prediction and update steps.

  The lifting scheme allows motion compensated temporal conversion, ie motion compensated temporal filtering (MCTF), which substantially filters along the temporal motion trajectory in the case of video. Review of MCTF for video coding by Ohm et al. “Interframe wavelet coding-motion picture representation for universal scalability” (Signal Processing: Image Communication, vol. 19, no 9, pp. 877-908, October 2004). The lifting scheme can be based on any wavelet kernel, such as Haar or 5/3 Dobsey, and any motion model, such as block-based translation or affine global motion, without affecting the reconstruction.

  For encoding, MCTF breaks down the video into high and low frequency frames. These frames are then subjected to spatial transformation to reduce the remaining spatial correlation. The transformed low and high frequency frames are entropy encoded with associated motion information to form an encoded bitstream. The MCTF can be implemented using temporally adjacent video as input, using the lifting scheme shown in FIG. Also, MCTF can be applied iteratively to the output low band frame.

  The compression efficiency of MCTF-based video is H.264. It is comparable to that of video compression standards such as H.264 / AVC. Video also has internal time scalability. However, this method cannot be used for direct encoding of multi-view video where there is a correlation between videos acquired from multiple views. This is because there is no efficient view prediction method that accounts for temporal correlation.

  Lifting schemes have also been used to encode static light fields, ie single multiview images. Instead of performing motion-compensated temporal filtering, the encoder performs disparity-compensated inter-view filtering (DCVF) between static views in the spatial domain (Chang et al., “Light-field inter-view wavelet compression using disparity-compensated lifting (Inter- view wavelet compression of light fields with disparity compensated lifting ”(see SPIE Conf on Visual Communications and Image Processing, 2003). For encoding, DCVF decomposes the static light field into a high pass image and a low pass image, and then spatially transforms these images to reduce the remaining spatial correlation. The transformed image is entropy encoded with associated disparity information to form an encoded bitstream. The DCVF is usually implemented by using an image acquired from spatially adjacent camera views as an input, using a lifting-based wavelet transform method as shown in FIG. Also, DCVF can be applied iteratively to the output low pass image. DCVF based static light field compression provides higher compression efficiency than coding multiple frames individually. However, this method also cannot encode multi-view video that uses both temporal and spatial correlation between views. This is because there is no efficient view prediction method that accounts for temporal correlation.

  A method and system for decomposing multi-view video acquired for a scene by multiple cameras is presented.

  Each multi-view video includes a frame sequence, and each camera provides a different view of the scene.

  One prediction mode is selected from the temporal prediction mode, the spatial prediction mode, the view synthesis prediction mode, and the intra prediction mode.

  Next, the multi-view video is decomposed into low-frequency frames, high-frequency frames, and side information according to the selected prediction mode.

  A new video reflecting a composite view of the scene can also be generated from one or more of the multi-view videos.

  In particular, one embodiment of the present invention provides a system and method for encoding and decoding video. A plurality of multi-view videos are acquired for a scene by a corresponding plurality of cameras arranged in such a way that the views overlap between any pair of cameras. One composite multi-view video is generated from a plurality of acquired multi-view videos for one virtual camera. A reference picture list is held in the memory for each current frame of the multi-view video and the synthesized video. The reference picture list indexes the temporal reference picture and the spatial reference picture of the acquired multi-view video, and the synthesized reference picture of the synthesized multi-view video. Each current frame of the multi-view video is then predicted according to the reference picture indexed by the associated reference picture list during encoding and decoding.

  One embodiment of the present invention provides a combined temporal / inter-view processing method for encoding and decoding multi-view video frames. Multi-view video is video acquired for a scene by a plurality of cameras having different poses. In the present invention, the camera orientation is defined as both the 3D (x, y, z) position and the 3D (θ, ρ, φ) orientation. Each posture corresponds to a “view” of the scene.

  The method uses temporal correlation between frames in the same video acquired for a particular camera pose and spatial correlation between synchronized frames in different videos acquired from multiple camera views. Also, “composite” frames can be correlated as described below.

  In one embodiment, temporal correlation uses motion compensated temporal filtering (MCTF) and spatial correlation uses disparity compensated inter-view filtering (DCVF).

  In another embodiment of the present invention, spatial correlation uses the prediction of one view from multiple composite frames generated from “neighboring” frames. Neighboring frames are temporally or spatially adjacent frames, for example one frame acquired before or after the current frame in the time domain, or at the same time but from cameras with different poses or scene views There are multiple frames.

  Each frame of each video contains a macroblock of pixels. Therefore, the encoding and decoding method of multiview video according to an embodiment of the present invention is macroblock adaptive. The encoding and decoding of the current macroblock in the current frame is done using several possible prediction modes including various forms of temporal prediction, spatial prediction, view synthesis prediction, and intra prediction. In order to determine the best prediction mode for each macroblock, an embodiment of the present invention provides a method for selecting a prediction mode. This method can be used for any number of camera arrangements.

  As used herein, a reference picture is defined as any frame used to “predict” the current frame during encoding and decoding. Typically, the reference picture is spatially or temporally adjacent to the current frame, i.e. "near".

  It is important to note that the same operation applies to both the encoder and decoder because the same set of reference pictures is used to encode and decode the current frame at any given time .

MCTF / DCVF Decomposition FIG. 4 shows an MCTF / DCVF decomposition 400 according to one embodiment of the present invention. Frames of input video 401-404 are acquired for scene 5 by cameras 1-4 with different poses. As shown in FIG. 8, some of the cameras 1a and 1b are in the same position, but may be in different directions. It is assumed that there is a certain amount of view overlap between any pair of cameras. The camera pose can change during the acquisition of multi-view video. Usually, the cameras are synchronized with each other. Each input video provides a different “view” of the scene. Input frames 401-404 are sent to the MCTF / DCVF decomposition 400. This decomposition produces an encoded low frequency frame 411, an encoded high frequency frame 412, and associated side information 413. The high frequency frame encodes a prediction error using the low frequency frame as a reference picture. The decomposition is performed according to the selected prediction mode 410. The prediction modes include a spatial prediction mode, a temporal prediction mode, a view synthesis prediction mode, and an intra prediction mode. The prediction mode can be adaptively selected for each macroblock for each current frame. When using intra prediction, the current macroblock is predicted from other macroblocks in the same frame.

  FIG. 5 shows a preferred alternating “lattice pattern” for the low frequency frame (L) 411 and the high frequency frame (H) 412 in the vicinity of the frame 510. These frames have a space (view) dimension 501 and a time dimension 502. In essence, this pattern alternates the low and high frequency frames every time in the spatial dimension, and further alternates the low and high frequency frames temporally for each video.

  This grid pattern has several advantages. This pattern achieves spatial and temporal scalability when the decoder reconstructs only the low frequency frames by uniformly distributing the low frequency frames in both the spatial and temporal dimensions. This pattern also aligns high frequency frames with adjacent low frequency frames in both spatial and temporal dimensions. This maximizes the correlation between reference pictures for error prediction in the current frame, as shown in FIG.

  According to the lifting-based wavelet transform, a high frequency frame 412 is generated by predicting one sample set from the other sample set. This prediction can be achieved using several modes including various forms of temporal prediction, various forms of spatial prediction, and view synthesis prediction according to embodiments of the invention described below.

  Means for predicting the high frequency frame 412 and information necessary for performing this prediction are referred to as side information 413. When performing temporal prediction, the time mode is signaled with the corresponding motion information as part of the side information. When performing spatial prediction, the spatial mode is signaled along with the corresponding disparity information as part of the side information. When performing view synthesis prediction, the view synthesis mode is signaled together with the corresponding disparity information, motion information, and depth information as part of the side information.

  As shown in FIG. 6, the prediction of each current frame 600 uses both spatial and temporal neighborhood frames 510. A frame used to predict the current frame is called a reference picture. Reference pictures are held in a reference list that is part of the encoded bitstream. The reference picture is stored in the decoded picture buffer.

  In one embodiment of the present invention, MCTF and DCVF are adaptively applied to each current macroblock for each frame of the input video to decompose the low pass frame, and the high pass frame and associated side information. Produce. Thus, each macroblock is adaptively processed according to the “best” prediction mode. An optimal method for selecting the prediction mode will be described later.

  In one embodiment of the invention, MCTF is first applied individually to each video frame. The resulting frame is then further decomposed by DCVF. In addition to the final decomposed frame, corresponding side information is also generated. When performed for each macroblock, the selection of MCTF and DCVF prediction modes is considered separately. As an advantage, this prediction mode selection inherently supports temporal scalability. Thus, lower time rates of video are easily accessed in the compressed bitstream.

  In another embodiment, DCVF is first applied to the frame of the input video. The resulting frame is then temporally decomposed by MCTF. In addition to the final decomposed frame, side information is also generated. When performed for each macroblock, the selection of MCTF and DCVF prediction modes is considered separately. As an advantage, this selection inherently supports spatial scalability. Thus, a smaller number of views in the compressed bitstream are easily accessed.

  The decomposition described above can be applied iteratively to the resulting set of low-pass frames from the previous decomposition stage. As an advantage, the MCTF / DCVF decomposition 400 of the present invention can effectively remove both temporal and spatial (inter-view) correlations and achieve very high compression efficiency. The compression efficiency of the multi-view video encoder of the present invention is superior to conventional simulcast encoding that encodes each video of each view individually.

Encoding MCTF / DCVF Decomposition As shown in FIG. 7, outputs 411 and 412 of decomposition 400 are supplied to signal encoder 710 and output 413 is supplied to side information encoder 720. The signal encoder 710 performs transformation, quantization, and entropy coding, and removes the correlation remaining in the decomposed low frequency frame 411 and high frequency frame 412. Such operations are known in the art ("Digital Pictures: Representation, Compression and Standards" (Second Edition, Plenum Press, 1995) by Netravali and Haskell).

  The side information encoder 720 encodes the side information 413 generated by the decomposition 400. The side information 413 includes motion information corresponding to temporal prediction, disparity information corresponding to spatial prediction, and view synthesis information and depth information corresponding to view synthesis prediction, in addition to the prediction mode and the reference picture list.

  The encoding of side information is performed according to MPEG-4 visual standard ISO / IEC 14496-2 “Information technology—Coding of audio-visual objects—Part 2: Visual” (No. 1). 2nd edition, 2001), or more recent H.264. H.264 / AVC standard and ITU-T recommendation H.264 H.264 “Advanced video coding for generic audiovisual services” (2004), can be achieved by known and established techniques.

  For example, the motion vector of a macroblock is usually encoded using a prediction method that obtains a prediction vector from vectors in a macroblock in a reference picture. Next, an entropy coding process is applied to the difference between the prediction vector and the current vector. This process typically uses prediction error statistics. A similar procedure can be used to encode the disparity vector.

  In addition, the depth information of each macroblock may be encoded using a predictive coding method that obtains a prediction value from the macroblock in the reference picture, or simply by directly representing the depth value using a fixed length code. it can. When pixel level depth accuracy is extracted and compressed, texture coding techniques that apply transform, quantization, and entropy coding techniques can be applied.

  The encoded signals 711-713 from the signal encoder 710 and side information encoder 720 can be multiplexed (730) to generate an encoded output bitstream 731.

Decoding MCTF / DCVF Decomposition Bitstream 731 can be decoded (740) to generate output multiview video 741 corresponding to input multiview videos 401-404. Optionally, a composite video can also be generated. In general, the decoder performs the inverse operation of the encoder to reconstruct the multiview video. Once all the low and high frequency frames are decoded, a complete frame set of coding quality can be reconstructed and used in both the spatial (view) and temporal dimensions.

  Depending on the number of iteration levels of decomposition applied at the encoder and what type of decomposition is applied, a smaller number of videos and / or lower time rates can be decoded as shown in FIG.

View Synthesis As shown in FIG. 8, view synthesis is the process of generating a composite video frame 801 from one or more actual multi-view video frames 803. In other words, view synthesis provides a means to synthesize a frame 801 corresponding to a selected new view 802 of scene 5. This new view 802 may correspond to a “virtual” camera 800 that does not exist at the time the input multi-view videos 401-404 are acquired, or may correspond to a camera view that is acquired, thus The composite view is used for prediction and encoding / decoding as described later.

  When using one video, composition is based on extrapolation or warping, and when using multiple videos, composition is based on interpolation.

  Given the pixel values of one or more multi-view video frames 803 and the depth values of multiple points in the scene, the pixels in frame 801 of composite view 802 are combined from the corresponding pixel values in frame 803. can do.

  View composition is commonly used in computer graphics to render still images for multiple views (see Buehler et al., “Unstructured Lumigraph Rendering” (Proc. ACM SIGGRAPH, 2001). ) This method requires camera external and internal parameters, which are incorporated herein by reference.

  View synthesis for compressing multi-view video is novel. In one embodiment of the invention, a composite frame is generated that is used to predict the current frame. In one embodiment of the invention, a composite frame is generated for a specified high frequency frame. In another embodiment of the invention, a composite frame is generated for a particular view. The synthesized frame acts as a reference picture, and the current synthesized frame can be predicted from these reference pictures.

  One problem with this approach is that the depth value of scene 5 is unknown. Therefore, in the present invention, a depth value is estimated using a known technique, for example, based on a correspondence relationship of features in multi-view video.

  Alternatively, for each synthesized video, the present invention generates a plurality of synthesized frames each corresponding to a candidate depth value. For each macroblock in the current frame, the best matching macroblock is obtained from the set of synthesized frames. The composite frame in which this best match is found indicates the depth value of that macroblock in the current frame. This process is repeated for each macroblock in the current frame.

  The difference between the current macroblock and the synthesized block is encoded and compressed by the signal encoder 710. The side information in the multi view mode is encoded by the side information encoder 720. The side information includes a signal indicating a view synthesis prediction mode, a depth value of the macroblock, and an arbitrary displacement vector that compensates for a position shift to be compensated between the macroblock in the current frame and the best matching macroblock in the synthesis frame. including.

Prediction Mode Selection In the macroblock adaptive MCTF / DCVF decomposition, the prediction mode m of each macroblock can be selected by adaptively minimizing the cost function for each macroblock.
m * = arg m minJ (m)
Here, J (m) = D (m) + λR (m), D is distortion, λ is a weight parameter, R is a rate, m indicates a set of candidate prediction modes, and m * Indicates the optimal prediction mode selected based on the minimum cost criterion.

  Candidate mode m includes various temporal prediction modes, spatial prediction modes, view synthesis prediction modes, and intra prediction modes. The cost function J (m) depends on the rate and distortion that result from encoding a macroblock with a specific prediction mode m.

  Distortion D measures the difference between the reconstructed macroblock and the original macroblock. A reconstructed macroblock is obtained by encoding and decoding the macroblock with a given prediction mode m. A common distortion measure is the sum of squared differences. The rate R corresponds to the number of bits required to encode the macroblock, including prediction error and side information. The weight parameter λ controls the macroblock coding rate-distortion tradeoff and can be derived from the quantization step size.

  Detailed aspects of the encoding and decoding processes are described in further detail below. In particular, the various data structures used by the encoding and decoding processes are described. It should be understood that the data structure used in the encoder, as described herein, is the same as the corresponding data structure used in the decoder. It should also be understood that the processing steps of the decoder follow essentially the same processing steps as the encoder, but in the reverse order.

Reference Picture Management FIG. 9 illustrates reference picture management for a prior art single view encoding and decoding system. The temporal reference picture 901 is managed by a single view reference picture list (RPL) manager 910 that determines insertion (920) and deletion (930) of the temporal reference picture 901 in the decoded picture buffer (DPB) 940. A reference picture list 950 is also maintained to indicate the frames stored in the DPB 940. RPL is used for reference picture management operations such as insert (920) and delete (930), and temporal prediction 960 in both encoder and decoder.

  In a single-view encoder, temporal reference picture 901 applies a set of normal encoding operations including prediction, transform and quantization, and then these inverse operations including inverse quantization, inverse transform and motion compensation. Is generated as a result of applying. Further, the temporal reference picture 901 is inserted into the DPB 940 and added to the RPL 950 only when the temporal picture is necessary for prediction of the current frame in the encoder.

  In a single view decoder, the same temporal reference picture 901 is generated by applying a set of normal decoding operations, including inverse quantization, inverse transform and motion compensation, to the bitstream. Similar to the encoder, the temporal reference picture 901 is inserted into the DPB 940 and added to the RPL 950 only when necessary for prediction of the current frame at the decoder (920).

  FIG. 10 shows reference picture management for multiview encoding and decoding. In addition to the temporal reference picture 1003, the multi-view system also includes a spatial reference picture 1001 and a synthesized reference picture 1002. These reference pictures are collectively referred to as a multi-view reference picture 1005. These multiview reference pictures 1005 are managed by a multiview RPL manager 1010 that determines insertion (1020) and deletion (1030) of the multiview reference picture 1005 in the multiview DPB 1040. For each video, a multi-view reference picture list (RPL) 1050 is also maintained to indicate the frames stored in the DPB. That is, RPL is an index of DPB. The multi-view RPL is used for reference picture management operations such as insertion (1020) and deletion (1030), and prediction 1060 of the current frame.

  Note that multi-view system prediction 1060 is different from single-view system prediction 960 because it allows prediction from different types of multi-view reference pictures 1005. Further details regarding the multi-view reference picture management 1010 will be described later.

Multiview Reference Picture List Manager A set of multiview reference pictures 1005 can be indicated in the multiview RPL 1050 before encoding the current frame at the encoder or before decoding the current frame at the decoder. As defined conventionally and herein, a set may have no elements (empty set) or may have one or more elements. The same copy of RPL is maintained by both the encoder and decoder for each current frame.

  All frames inserted into the multi-view RPL 1050 are initialized and marked as available for prediction using an appropriate syntax. H. According to the H.264 / AVC standard and reference software, the “used_for_reference” flag is set to “1”. In general, the reference picture is initialized so that the frame can be used for prediction in a video coding system. H. In order to maintain compatibility with conventional single-view video compression standards such as H.264 / AVC, a picture order count (POC) is assigned to each reference picture. Typically, for a single view encoding and decoding system, the POC corresponds to temporal ordering of pictures, eg, frame numbers. For multiview encoding and decoding systems, temporal order alone is not sufficient to assign a POC to each reference picture. Therefore, in the present invention, a unique POC is obtained according to a certain rule for all multi-view reference pictures. One rule is to assign POCs to temporal reference pictures in temporal order, and then reserve a very high POC number sequence, eg, 10000-10100, for spatial reference pictures and synthesized reference pictures. Other POC assignment rules, or simply “ordering” rules, are described in further detail below.

  All frames used as multi-view reference frames are held in the RPL and stored in the DPB so that these frames are treated as conventional reference pictures by the encoder 700 or the decoder 740. Thereby, the encoding process and the decoding process can be made conventional. Further details regarding the storage of multi-view reference pictures will be described later. For each current frame to be predicted, the RPL and DPB are updated correspondingly.

Multiview Rule Definition and Signaling The process of maintaining the RPL is coordinated between the encoder 700 and the decoder 740. In particular, the encoder and decoder maintain the same copy of the multiview reference picture list when predicting a particular current frame.

  Several rules are possible to maintain a multi-frame reference picture list. Thus, the specific rules used are inserted into the bitstream 731 or provided as sequence level side information, eg configuration information communicated to the decoder. In addition, this rule allows sequences that are synthesized using different predictive structures, such as 1D arrays, 2D arrays, arcs, crosses, and view interpolation or warping techniques.

  For example, the composite frame is generated by warping a corresponding frame of one of the multi-view videos acquired by the camera. Alternatively, a conventional model of the scene can be used during synthesis. In other embodiments of the invention, several multi-view reference picture retention rules are defined that depend on the view type, insertion order, and camera characteristics.

  The view type is whether the reference picture is a frame from a video other than the video of the current frame, whether the reference picture is a composite from another frame, or whether the reference picture is another reference picture. Indicates whether it depends. For example, the composite reference picture can be kept separate from a reference picture from the same video as the current frame or from a spatially adjacent video.

  The insertion order indicates how the reference pictures are ordered in the RPL. As an example, a reference picture in the same video as the current frame can be given a lower order value than a reference picture in a video taken from an adjacent view. In this case, this reference picture is arranged in the front in the multi-view RPL.

  The camera characteristics indicate characteristics of a camera used to obtain a reference picture or a virtual camera used to generate a synthesized reference picture. These characteristics include translation and rotation with respect to a fixed coordinate system, i.e., the "posture" of the camera, internal parameters describing how the 3D points are projected onto the 2D image, lens distortion, color calibration information, illumination levels, etc. As an example, based on camera characteristics, the proximity of a specific camera to an adjacent camera can be determined automatically, and only the video acquired by the adjacent camera is considered part of a specific RPL.

  As shown in FIG. 11, according to an embodiment of the present invention, a part 1101 of each reference picture list is reserved for the temporal reference picture 1003, another part 1102 is reserved for the synthesized reference picture 1002, and the third Is used for the spatial reference picture 1001. This is an example of a rule that depends only on the view type. The number of frames included in each part may vary based on the prediction dependence of the current frame being encoded or decoded.

  Specific retention rules can be defined by standards, explicit rules or implicit rules, or can be defined as side information in the encoded bitstream.

Storing pictures in the DPB The multi-view RPL manager 1010 allows the RPL so that the order in which the multi-view reference pictures are stored in the DPB corresponds to the “usefulness” of the pictures in increasing the efficiency of encoding and decoding. Hold. Specifically, the reference picture at the beginning of the RPL can be predicatively encoded with fewer bits than the reference picture at the end of the RPL.

  As shown in FIG. 12, the optimization of the order in which the multi-view reference pictures are held in the RPL can have a great influence on the coding efficiency. For example, following the POC assignment described above for initialization, a multi-view reference picture may be assigned a very large POC value. This is because multi-view reference pictures do not occur with normal temporal ordering of video sequences. Thus, the default ordering process for most video codecs can place such multi-view reference pictures earlier in the reference picture list.

  Default ordering is not desirable because temporal reference pictures from the same sequence usually show stronger correlation than spatial reference pictures from other sequences. Thus, the multiview reference picture is explicitly reordered by the encoder, and the encoder then signals this reordering to the decoder, or the encoder and decoder implicitly reorder the multiview reference picture according to a predetermined rule. Rearrange.

  As shown in FIG. 13, the order of reference pictures is facilitated by the view mode 1300 for each reference picture. Note that view mode 1300 also affects multi-view prediction process 1060. One embodiment of the present invention uses three different types of view modes, I view, P view, and B view, described in more detail below.

Prior to describing the detailed operation of multiview reference picture management, prior art reference picture management for a single video encoding and decoding system is shown in FIG. Only the temporal reference picture 901 is used for the temporal prediction 960. The temporal prediction dependence between temporal reference pictures of video in the acquisition order or display order 1401 is shown. The reference pictures are rearranged in the encoding order 1402 (1410), and each reference picture is encoded or decoded at times t 0 to t 6 in the encoding order 1402. Block 1420 shows the ordering of reference pictures by time. For intra-frame I 0 is at time t 0 is encoded or decoded, there is no temporal reference pictures used for temporal prediction, DBP / RPL is empty. At time t 1 the one-way inter-frame P 1 is encoded or decoded, frame I 0 is available as a temporal reference picture. At times t 2 and t 3 , both frames I 0 and P 1 are available as reference frames for bi-directional temporal prediction of inter frames B 1 and B 2 . Temporal reference pictures and DBP / RPL are similarly managed for future pictures.

To illustrate the multi-view case according to one embodiment of the present invention, consider the three different types of views described above and shown in FIG. 15, namely I view, P view, and B view. The prediction dependence of the multi view between the reference pictures of the video in the display order 1501 is shown. As shown in FIG. 15, the reference pictures of the video are rearranged in the coding order 1502 for each view mode (1510), and each reference picture is encoded at a given time indicated by t 0 to t 2 in this coding order 1502. Or decoded. The order of the multi-view reference pictures is shown in block 1520 for each time.

  The I view is the simplest mode that allows a more complex mode. The I view uses conventional coding and prediction modes that do not use spatial or synthetic prediction. For example, the I view does not use the multi-view extension, but the conventional H.264. It can be encoded using H.264 / AVC techniques. When spatial reference pictures from an I-view sequence are placed in the reference lists of other views, these spatial reference pictures are usually placed after the temporal reference pictures.

As shown in FIG. 15, in the case of I view, when frame I 0 is encoded or decoded at t 0 , there is no multi-view reference picture used for prediction. Therefore, DBP / RPL is empty. At time t 1 when frame P 0 is encoded or decoded, I 0 is available as a temporal reference picture. At time t 2 when frame B 0 is encoded or decoded, both frames I 0 and P 0 are available as temporal reference pictures.

  P views are more complex than I views in that they allow prediction from another view and take advantage of the spatial correlation between views. Specifically, a sequence encoded using the P view mode uses a multi-view reference picture from another I view or P view. A composite reference picture can also be used in the P view. When placing a multi-view reference picture from an I view in the reference list of another view, the P view is placed after both the temporal reference picture and the multi-view reference picture derived from the I view.

As shown in FIG. 15, for the P view, when frame I 2 is encoded or decoded at t 0 , the synthesized reference picture S 20 and the spatial reference picture I 0 are available for prediction. Further details regarding the generation of the composite picture will be described later. At time t 1 when P 2 is encoded or decoded, I 2 is available as a temporal reference picture along with the synthesized reference picture S 21 and the spatial reference picture P 0 from the I view. At time t 2 , there are two temporal reference pictures I 2 and P 2 , a combined reference picture S 22 and a spatial reference picture B 0, and prediction can be performed from these reference pictures.

  B view is similar to P view in that it uses a multi-view reference picture. One important difference between P and B views is that P views use reference pictures from the view itself and from one other view, whereas B views can reference pictures from multiple views. That is. When using a synthesized reference picture, the synthesized view usually has a stronger correlation than the spatial reference, so the B view is placed before the spatial reference picture.

As shown in FIG. 15, in the case of B view, when I 1 is encoded or decoded at t 0 , the synthesized reference picture S 10 and the spatial reference pictures I 0 and I 2 are available for prediction. At time t 1 P 1 is encoded or decoded, as a reference picture I 1 time, a synthesized reference picture S 11, and can be utilized with spatial reference pictures P 0 and P 2 from the I-view and P-view, respectively is there. At time t 2 , there are two temporal reference pictures I 1 and P 1 , a composite reference picture S 12 , and spatial reference pictures B 0 and B 2, and prediction can be performed from these reference pictures.

  It is emphasized that the example shown in FIG. 15 is only related to one embodiment of the present invention. Many different types of predictive dependencies are supported. As an example, the spatial reference picture is not limited to pictures in different views at the same time. Spatial reference pictures can also include reference pictures for different views at different times. In addition, the number of bidirectional prediction pictures and unidirectional prediction inter pictures between intra pictures may vary. Similarly, the configuration of I view, P view, and B view may also change. In addition, several synthetic reference pictures, each generated using different picture sets or different depth maps or processes, may be available.

Compatibility One important advantage of multi-view picture management according to embodiments of the present invention is that it is compatible with existing single-view video coding systems and designs. This multi-view picture management not only requires minimal changes to the existing single-view video coding standard, but also software and hardware from the existing single-view video coding system as described herein. It can also be used.

  This is because most conventional video encoding systems communicate the encoding parameters to the decoder in a compressed bitstream. Therefore, the syntax for conveying such parameters is H.264. It is defined by existing video coding standards such as H.264 / AVC standard. For example, the video coding standard defines the prediction mode for a given macroblock in the current frame from other temporally related reference pictures. This standard also defines the method used to encode and decode the resulting prediction error. Other parameters define the type or size of the transform, the quantization method, and the entropy coding method.

  Thus, the multi-view reference picture of the present invention is implemented with only a limited number of modifications to standard encoding and decoding components such as existing system reference picture lists, decoded picture buffers, and prediction structures. can do. Note that the macroblock structure, transform, quantization and entropy coding are not changed.

View Compositing As described above with respect to FIG. 8, view compositing is the process of generating a frame 801 corresponding to the composite view 802 of the virtual camera 800 from a frame 803 obtained from an existing video. In other words, view synthesis provides a means to synthesize frames corresponding to a selected new view of a scene with a virtual camera that does not exist at the time the input video is acquired. Given the pixel values of one or more actual video frames and the depth values of points in the scene, the pixels in the frame of the composite video view can be generated by extrapolation and / or interpolation.

Prediction from Synthetic View FIG. 16 shows the process of generating a reconstructed macroblock using view synthesis mode when depth 1901 information is included in the encoded multi-view bitstream 731. The depth of a given macroblock is decoded by the side information decoder 1910. View synthesis 1920 is performed using the depth 1901 and the spatial reference picture 1902 to generate a synthesized macroblock 1904. Next, the reconstructed macroblock 1903 is formed by adding (1930) the synthesized macroblock 1904 and the decoded residual macroblock 1905.

Details of Multiview Mode Selection at the Encoder FIG. 17 shows the process of selecting the prediction mode during encoding or decoding of the current frame. Motion estimation 2010 is performed on the current macroblock 2011 using the temporal reference picture 2020. The resulting motion vector 2021 is used to determine a first coding cost cost 1 2031 using temporal prediction (2030). The prediction mode associated with this process is m 1 .

Disparity estimation 2040 is performed on the current macroblock using the spatial reference picture 2041. The resulting disparity vector 2042 is used to determine a second coding cost cost 2 2051 that uses spatial prediction (2050). Indicating the prediction mode associated with this process in m 2.

Depth estimation 2060 is performed for the current macroblock based on the spatial reference picture 2041. View synthesis is performed based on the estimated depth. Using the depth information 2061 and the synthesized view 2062, a third coding cost cost 3 2071 using view synthesis prediction is obtained (2070). The prediction mode associated with this process is m 3 .

A fourth coding cost cost 4 2081 using intra prediction is obtained using the adjacent pixel 2082 of the current macroblock (2080). The prediction mode associated with this process is m 4.

The lowest cost among cost 1 , cost 2 , cost 3 and cost 4 is determined (2090), and the mode having the lowest cost among the modes m 1 , m 2 , m 3 and m 4 is determined as the best of the current macroblock 2011 The prediction mode 2091 is selected.

View Synthesis Using Depth Estimation The view synthesis mode 2091 can be used to estimate the depth information and displacement vector of a synthesized view from one or more multi-view video decoded frames. The depth information may be the depth for each pixel estimated from the stereoscopic camera or the depth for each macroblock estimated from the macroblock matching depending on the applied process.

  The advantage of this approach is that as long as the encoder has access to the same depth and displacement information as the decoder, the depth value and displacement vector are not needed in the bitstream, thus reducing the bandwidth. The encoder can accomplish this as long as the decoder uses exactly the same depth and displacement estimation process as the encoder. Thus, in this embodiment of the invention, the difference between the current macroblock and the synthesized macroblock is encoded by the encoder.

  Side information in this mode is encoded by a side information encoder 720. The side information includes a signal indicating the view synthesis mode and reference view (s). The side information can also include depth and displacement correction information, which is the difference between the depth and displacement used for view synthesis by the encoder and the value estimated by the decoder.

  FIG. 18 shows the decoding process of a macroblock using the view synthesis mode when depth information is estimated or inferred at the decoder and is not conveyed in the encoded multi-view bitstream. The depth 2101 is estimated from the spatial reference picture 2102 (2110). Next, view synthesis 2120 is performed using the estimated depth and spatial reference pictures, and a synthesis macroblock 2121 is generated. A reconstructed macroblock 2103 is formed by addition 2130 of the synthesized macroblock and the decoded residual macroblock 2104.

Spatial Random Access In order to provide random access to frames in conventional video, intra frames, also known as I-frames, are typically spaced throughout the video. This allows the decoder to access any frame in the decoding sequence, but reduces the compression efficiency.

  For the multi-view encoding and decoding system of the present invention, a new type of frame, referred to herein as a “V-frame”, is provided to allow for improved random access and compression efficiency. V frames are similar to I frames in the sense that they are encoded without using temporal prediction. However, V-frame also allows prediction from other cameras or prediction from composite video. Specifically, the V frame is a frame in a compressed bit stream predicted from a spatial reference picture or a synthesized reference picture. By periodically inserting V frames instead of I frames into the bitstream, the present invention provides temporal random access with higher coding efficiency as is possible with I frames. Therefore, the V frame does not use a temporal reference frame. FIG. 19 shows the use of an I frame for the first view and the use of a V frame for subsequent views at the same time 1900. Note that for the grid configuration shown in FIG. 5, V-frames do not occur at the same time for all views. V frames can be assigned to any of the low frequency frames. In this case, the V frame is predicted from the low frequency frame of the neighboring view.

  H. In the H.264 / AVC video coding standard, an IDR frame similar to an MPEG-2 I frame with a closed GOP implies that all reference pictures are deleted from the decoded picture buffer. Thereby, the frame before the IDR frame cannot be used for prediction of the frame after the IDR frame.

  In the multi-view decoder described herein, V-frames similarly suggest that all temporal reference pictures can be deleted from the decoded picture buffer. However, the spatial reference picture can remain in the decoded picture buffer. Thus, the frame before the V frame in a given view cannot be used for temporal prediction of the frame after the V frame in the same view.

  In order to access one particular frame of the multi-view video, the V-frame for that view must first be decoded. As described above, this can be achieved by prediction from spatial reference pictures or synthesized reference pictures without using temporal reference pictures.

  After decoding the V-frame of the selected view, the subsequent frames of that view are decoded. Since these subsequent frames are likely to have prediction dependence on reference pictures from neighboring views, the reference pictures in these neighboring views are also decoded.

Multiview Coding and Decoding The above section describes view synthesis to improve prediction in multiview coding and depth estimation. Next, implementation of variable block size depth and motion search, rate-distortion (RD) determination, sub-pel reference depth search, and context-adaptive binary arithmetic coding (CABAC) of depth information is described. Coding may include encoding at the encoder and decoding at the decoder. CABAC is the H.C. Defined by 624 / MPEG-4 Standard Part 10 (incorporated herein by reference).

View Synthesis Prediction Two block prediction methods were implemented to capture correlations that exist between cameras and between times.
1) Disparity compensated view prediction (DCVP), and 2) View synthesis prediction (VSP)

DCVP
DCVP, which is the first method, does not use frames from different times of the same (view) camera, but supports prediction of the current frame using frames from different cameras (views) at the same time. To do. DCVP provides gain when the temporal correlation is lower than the spatial correlation due to, for example, occlusion, objects entering or leaving the scene, or fast movement.

VSP
The second method, VSP, synthesizes virtual camera frames to predict a frame sequence. VSP is complementary to DCVP due to the presence of non-translational motion between camera views, and camera parameters are sufficient to provide a high quality virtual view, as is often the case in practical applications. To provide gain when accurate.

  As shown in FIG. 20, the present invention takes advantage of these features of multi-view video by synthesizing a virtual view from already coded views and then performing predictive coding using the synthesized view. FIG. 20 shows time on the horizontal axis and view on the vertical axis, with view synthesis and warping 2001, and view synthesis and interpolation 2002.

  Specifically, for each camera c, the virtual frame I ′ [c, t, x, y] is first synthesized based on the unstructured Lumigraph rendering technique of Buehler et al. (See above), and then synthesized. Predictively encode the current sequence using the view.

  To synthesize the frame I ′ [c, t, x, y], first, a depth map D [c, indicating how far the object corresponding to the pixel (x, y) is from the camera c at time t. t, x, y] and an internal matrix A (c) describing the position of camera c relative to some world coordinate system, a rotation matrix R (c), and a translation vector T (c) are required.

  Using these quantities, a known pinhole camera model can be applied and the pixel position (x, y) can be projected onto world coordinates [u, v, w] by the following equations.

  Next, the world coordinates are mapped to the target coordinates [x ′, y ′, z ′] of the frame of the camera c ′ to be used as a reference for prediction by the following equation.

  Finally, in order to obtain the pixel position, the target coordinates are converted into a homogeneous form [x ′ / z ′, y ′ / z ′, 1], and the intensity of the pixel position (x, y) in the composite frame is I '[C, t, x, y] = I [c', t, x '/ z', y '/ z'].

Depth / motion estimation of variable block size In the above, a picture buffer management method that enables the use of DCVP without changing the syntax has been described. The disparity vector between camera views was obtained by using a motion estimation step and could be used as a simple extended reference type. In order to use VSP as another type of reference, the normal motion estimation process is extended as follows.

Given a candidate macroblock type mb_type and possibly N possible reference frames including a synthetic multiview reference frame, ie VSP, a reference frame for each sub-macroblock, a Lagrange multiplier λ motion or λ depth , respectively The following Lagrangian cost J is used together with a motion vector (→) m or a depth / correction vector pair (d, (→) m c ).

  However,

  And

It is. Here, the sum of all the pixels in the sub-macroblock (sub-MB) under consideration is taken, and Xp_synth or Xp_motion indicates the intensity of the pixel in the reference sub-macroblock.

  It should be noted here that “motion” refers not only to temporal motion, but also to motion between views resulting from parallax between views.

Depth Search In the present invention, an optimum depth is obtained for each variable-sized sub-macroblock using a block-based depth search process. Specifically, a minimum depth value D min , a maximum depth value D max , and an incremental depth value D step are defined. Next, for each variable macro sub-macroblock in the frame to be predicted, the depth that minimizes the error of the synthesized block of the following equation is selected.

  Here, ‖I [c, t, x, y] −I [c ′, t, x ′, y ′] ‖ is a sub-macroblock centered on (x, y) of camera c at time t. The average error with the corresponding block which becomes a reference for performing the prediction is shown.

  As a further refinement to enhance the performance of the basic VSP process, the addition of the composite correction vector due to slight inaccuracies in camera parameters (a non-ideal one that is not captured by the pinhole camera model) It can be seen that the performance of is greatly improved.

Specifically, as shown in FIG. 21, for each macroblock 2100, the target frame 2101 is mapped to the reference frame 2102, and then mapped to the synthesized frame 2103. However, instead of calculating the reference coordinates for performing interpolation using equation (1), the present invention adds the composite correction vector (C x , C y ) 2110 to each set of original pixel coordinates. [U, v, w] is calculated by the following equation.

  It has been discovered that a small correction vector search range of +/− 2 often greatly improves the quality of the resulting composite reference frame.

Sub-pixel reference matching Since the disparity between two corresponding pixels of different cameras is generally not given by an exact multiple of an integer, the target coordinates of the frame of the camera c ′ to be used as the basis for the prediction given by equation (2) x ′, y ′, z ′] does not always correspond to a point of an integer grid. Therefore, in the present invention, pixel values at sub-pel positions in the reference frame are generated using interpolation. As a result, the closest sub-pel reference point can be selected instead of the integer pel, and the approximate parallax between the pixels is accurately approximated.

  FIG. 22 illustrates this process, where “oxx ... ox” indicates a pixel. H. The same interpolation filter employed for sub-pel motion estimation in the H.264 standard is used in the implementation of the present invention.

Sub-pixel accuracy correction vector In the present invention, the synthesis quality can be further improved by enabling the use of a sub-pel accuracy correction vector. This is especially true when combined with the subpel reference matching described above. There is a slight difference between the sub-pel motion vector search and the sub-pel correction vector search.

  In the case of a motion vector, usually, a subpel position in the reference picture is searched, and a subpel motion vector indicating the subpel position that minimizes the RD cost is selected. However, in the case of a correction vector, after obtaining an optimum depth value, a sub-pel position in the current picture is searched to select a correction vector that minimizes the RD cost.

  The shift for the sub-pel correction vector in the current picture is not always the same shift amount in the reference picture. In other words, the corresponding match in the reference picture is always found by rounding to the nearest sub-pel position after the geometric transformation of equations (1) and (2).

  Although the coding of correction vectors with sub-pel accuracy is relatively complex, we have observed that this coding significantly improves the synthesis quality and in many cases improves the RD performance.

YUV-depth search In depth estimation, a smoother depth map can be achieved by regularization. Regularization improves the visual quality of the composite prediction, but slightly reduces its prediction quality when measured by the sum of absolute differences (SAD).

  The conventional depth search process estimates the depth of the depth map using only the Y luminance component of the input image. This minimizes the Y component prediction error, but composite prediction often produces visual artifacts in the form of color mismatch, for example. This means that the objective quality of the final reconstruction (ie U, V PSNR) and the subjective quality in the form of color mismatch are likely to be degraded.

  To address this problem, the present invention extends the depth search process to use Y luminance components and U and V chrominance components. If only the Y component is used, the block may find a good match in the reference frame by minimizing the prediction error, but these two match regions may be two completely different colors. As such, visual artifacts can occur. Thus, the quality of U and V prediction and reconstruction can be enhanced by incorporating the U and V components in the depth search process.

RD Mode Determination Mode determination can be performed by selecting mb_type that minimizes the Lagrangian cost function J mode defined as follows.

Where X p refers to the value of a pixel in the reference MB, ie, a composite multi-view reference, a pure multi-view reference or a temporal reference MB, and R side-info is a reference index and a reference index, depending on the type of the reference frame. Contains the bit rate or motion vector that encodes the depth value / correction value.

Note that CABAC coding of side information In the present invention, when each composite MB is selected as the best reference by RD mode decision, the depth value and correction vector of that MB must be coded. Both depth values and the correction vector, ligated unary / tertiary Exponential Golomb (concatenated unary / 3 rd -order Exp -Golomb) (UEG3) using binarization, and signedValFlag = 1, as a cut-off parameter uCoff = 9, It is quantized just like a motion vector.

  Next, different context models are assigned to the bins of the resulting binary representation. The assignment of ctxIdxInc to depth and correction vector components is basically ITU-T recommendation H.264. H.264 and ISO / IEC 14496-10 (MPEG-4) AVC “Advanced Video Coding for Generic Audiovisual Services” (3rd edition, 2005) (herein incorporated by reference) The same as in the case of motion vectors as defined in Table 9-30. However, the present invention does not apply subclause 9.3.3.1.1.7 to the first bin.

  In the present embodiment, the depth value and the correction vector are predictively encoded using the same prediction method as the motion vector. Since sub-MBs as small as MB or 8x8 can have their own reference pictures from time frames, multi-view frames, or composite multi-view frames, the type of side information can vary from MB to MB. This suggests that the number of neighboring MBs having the same reference picture may be reduced, and the prediction efficiency of side information (motion vector or depth / correction vector) may be reduced.

  If a sub-MB is selected to use a composite reference, but there are no surrounding MBs with the same reference, its depth / correction vector is coded separately without using prediction. In practice, it has been found that it is often sufficient to CABAC-encode the resulting bin after binarizing the correction vector component using a fixed length representation. This is because the MBs selected to use synthetic references tend to be isolated, i.e. they do not have neighboring MBs with the same reference picture, and the correction vector is usually compared to the case of motion vectors. This is because the correlation with the neighborhood is low.

Syntax and Semantics As noted above, the present invention incorporates synthetic reference pictures in addition to temporal references and pure multiview references. In the above, the H. The multi-view reference picture list management method compatible with the existing reference picture list management in the H.264 / AVC standard has been described.

  Since the composite reference in this embodiment is regarded as a special case of multi-view reference, it is processed in exactly the same way.

  This document defines a new high level syntax element called view_parameter_set to describe the multi-view identification and prediction structure. By slightly modifying the parameters, it is possible to identify whether the current reference picture is of a composite type. Thus, the depth / correction vector or motion vector of a given (sub) MB can be decoded depending on the type of reference. Thus, by extending the macroblock level syntax as specified in Appendix A, the use of this new type of prediction can be integrated.

  Although the invention has been described by way of examples of preferred embodiments, it is understood that various other adaptations and modifications can be made within the spirit and scope of the invention. Accordingly, the purpose of the appended claims is to cover all such variations and modifications as fall within the true spirit and scope of the present invention.

  Appendix A

Section 7.4.5.1 Supplement to the meaning of macroblock prediction depthd_10 [mbPartIdx] [0] specifies the difference between the depth value used and its prediction. The index mbPartIdx specifies to which macroblock boundary depthd_10 is assigned. The division of the macroblock is specified by mb_type.

  depthd_l1 [mbPartIdx] [0] has the same meaning as depthd_l0, and l0 is replaced with l1.

  corr_vd_10 [mbPartIdx] [0] [compIdx] specifies the difference between the correction vector component used and its prediction. The index mbPartIdx specifies to which macroblock boundary corr_vd_10 is assigned. The division of the macroblock is specified by mb_type. The difference between the correction vector components in the horizontal direction is first decoded in decoding order and is assigned CompIdx = 0. The vertical correction vector components are then decoded in decoding order and assigned with CompIdx = 1.

  corr_vd_l1 [mbPartIdx] [0] [compIdx] has the same meaning as corr_vd_l0, and l0 is replaced with l1.

Section 7.4.5.2 Supplementary semantics of sub-macroblock prediction depthd_l0 [mbPartIdx] [subMbPartIdx] has the same meaning as depthd_l0, but applies to sub-macroblock boundary indexes with subMbPartIdx. The indexes mbPartIdx and subMbPartIdx specify which macroblock boundary and sub-macroblock boundary are assigned depthd_l0.

  depthd_l1 [mbPartIdx] [subMbPartIdx] has the same meaning as depthd_l0, and l0 is replaced with l1.

  corr_vd_l0 [mbPartIdx] [subMbPartIdx] [compIdx] has the same meaning as corr_vd_l0, but applies to the sub-macroblock boundary index with subMbPartIdx. The indexes mbPartIdx and subMbPartIdx specify to which macroblock boundary and sub macroblock boundary the coor_vd_l0 is assigned.

  corr_vd_l1 [mbPartIdx] [subMbPartIdx] [compIdx] has the same meaning as corr_vd_l0, and l0 is replaced with l1.

Supplement to the meaning of the view parameter set When multiview_type is 1, it specifies that the current view is synthesized from other views. If multiview_type is 0, it indicates that the current view has not been synthesized.

  multiview_synth_ref0 specifies the index of the first view used for composition.

  multiview_synth_ref1 specifies the index of the second view used for composition.

1 is a block diagram of a prior art system for encoding multi-view video. FIG. 1 is a block diagram of a prior art disparity compensation prediction system for encoding multiview video. FIG. FIG. 2 is a flow diagram of a prior art wavelet decomposition process. FIG. 3 is a block diagram of MCTF / DCVF decomposition according to one embodiment of the present invention. FIG. 4 is a block diagram of low and high frequency frames as a function of time and space after MCTF / DCVF decomposition, according to one embodiment of the present invention. FIG. 4 is a block diagram of prediction of a high frequency frame from adjacent low frequency frames according to an embodiment of the present invention. 1 is a block diagram of a multi-view coding system using macroblock adaptive MCTF / DCVF decomposition according to an embodiment of the present invention. FIG. FIG. 3 is a schematic diagram of video composition according to an embodiment of the present invention. It is a block diagram of the reference picture management of a prior art. FIG. 6 is a block diagram of multi-view reference picture management according to an embodiment of the present invention. FIG. 3 is a block diagram of a multi-view reference picture in a decoded picture buffer according to an embodiment of the present invention. 6 is a graph comparing coding efficiency of ordering of different multiview reference pictures. FIG. 6 is a block diagram of view mode dependencies on a multi-view reference picture list manager, according to one embodiment of the present invention. 1 is a prior art reference picture management diagram for a single view coding system using prediction from temporal reference pictures. FIG. FIG. 3 is a diagram of reference picture management for a multi-view encoding and decoding system using prediction from multi-view reference pictures according to an embodiment of the present invention. FIG. 6 is a block diagram of view synthesis in a decoder using depth information encoded and received as side information according to an embodiment of the present invention. FIG. 6 is a block diagram of cost calculation for selecting a prediction mode according to an embodiment of the present invention. FIG. 6 is a block diagram of view synthesis in a decoder using depth information estimated by the decoder according to an embodiment of the present invention. FIG. 6 is a block diagram of multi-view video that achieves spatial random access using V-frames at a decoder according to an embodiment of the present invention. FIG. 4 is a block diagram of view synthesis using warping and interpolation according to one embodiment of the present invention. FIG. 6 is a block diagram of depth search according to an embodiment of the present invention. FIG. 4 is a block diagram of sub-pel reference matching according to an embodiment of the present invention.

Claims (20)

  1. A method of processing multiple multi-view videos of a scene, each video being acquired by a corresponding camera arranged in a specific pose, each camera view being a view of at least one other camera Overlap,
    Obtaining side information including depth values for compositing a particular view of the multi-view video;
    A video including at least one of the plurality of multi-view videos is input, and a composite multi that corresponds to one posture different from the input video using the side information from the input video. Generating a view video;
    A temporal reference picture and a spatial reference picture of the plurality of multi-view videos acquired for each picture associated with a specific time of each one particular multi-view video in the plurality of multi-view videos, and the composite multi Maintaining a reference picture list indexing a synthesized reference picture of a view video , wherein each of the temporal reference picture, the spatial reference picture, and the synthesized reference picture has a picture order count Is assigned, step ,
    Predicting each current frame of the plurality of multi-view videos according to a reference picture indexed by the associated reference picture list. A method of processing a plurality of multi-view videos of a scene.
  2. The side information The method of claim 1, further comprising a compensation vector.
  3. The method of claim 1, further comprising warping one of the plurality of acquired multi-view videos to synthesize the synthesized multi-view video.
  4. The method of claim 1, further comprising interpolating two or more of the obtained multi-view videos to synthesize the synthesized multi-view video.
  5. The method of claim 1, wherein the side information is obtained at an encoder.
  6. The method of claim 1, in which the side information is obtained at a decoder.
  7. The method of claim 2, wherein the side information is associated with each block boundary of each macroblock in each frame.
  8. The method of claim 2, wherein the correction vector has sub-pixel accuracy for the current frame.
  9. The combining step is performed by applying a geometric transformation to pixels within a macroblock boundary according to camera parameters and associated side information and by rounding to the nearest subpel position in the spatial reference picture. The method according to 1.
  10. The method of claim 1, wherein each frame of the synthesized multi-view video includes a plurality of synthesized macroblocks generated based on the spatial reference picture and the side information associated with each macroblock boundary.
  11. The method of claim 10, wherein a reconstructed macroblock is generated by adding the synthesized macroblock and the decoded residual macroblock.
  12. The method of claim 2, wherein the luminance and chrominance samples of the spatial reference picture are used to determine the associated depth value for each block boundary in each frame.
  13. The method of claim 10, wherein the composite macroblock is associated with a displacement vector.
  14. The displacement vector has sub-pixel accuracy;
    The method of claim 13.
  15. The method of claim 6, wherein the side information is obtained at the decoder by a context-based binary arithmetic decoder.
  16. The method of claim 1, wherein the predicting using the synthesized multi-view video as a reference is performed on a V picture.
  17. The method of claim 1, wherein a synthesized frame of the synthesized multiview video is generated by warping a corresponding frame of one of the acquired multiview videos.
  18. The method of claim 1, wherein the synthesizing step uses a model of the scene.
  19. The method of claim 1, wherein the combining step uses a depth map, an internal matrix, a rotation matrix, and a translation vector.
  20. A system for processing multiple multi-view videos of a scene,
    A plurality of cameras, each camera configured to acquire a multi-view video of a scene, each camera being positioned in a particular orientation, and each camera view overlapping at least one other camera view;
    Means for obtaining side information including depth values for compositing a particular view of the multi-view video;
    A video including at least one of the plurality of multi-view videos is input, and a composite multi that corresponds to one posture different from the input video using the side information from the input video. A means of generating a view video;
    A temporal reference picture and a spatial reference picture of the plurality of multi-view videos acquired for each picture associated with a specific time of each one particular multi-view video in the plurality of multi-view videos, and the composite multi A memory buffer configured to hold a reference picture list for indexing a synthesized reference picture of a view video , wherein each of the temporal reference picture, the spatial reference picture, and the synthesized reference picture is 1 A memory buffer with two picture order counts assigned ,
    Means for predicting each current frame of the plurality of multi-view videos according to a reference picture indexed by the associated reference picture list.
JP2007173941A 2004-12-17 2007-07-02 Method and system for processing multiple multiview videos of a scene Expired - Fee Related JP5013993B2 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US11/485,092 US7728878B2 (en) 2004-12-17 2006-07-12 Method and system for processing multiview videos for view synthesis using side information
US11/485,092 2006-07-12

Publications (2)

Publication Number Publication Date
JP2008022549A JP2008022549A (en) 2008-01-31
JP5013993B2 true JP5013993B2 (en) 2012-08-29

Family

ID=39078131

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2007173941A Expired - Fee Related JP5013993B2 (en) 2004-12-17 2007-07-02 Method and system for processing multiple multiview videos of a scene

Country Status (1)

Country Link
JP (1) JP5013993B2 (en)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100934675B1 (en) 2006-03-30 2009-12-31 엘지전자 주식회사 Method and apparatus for decoding / encoding video signal
WO2008023968A1 (en) 2006-08-25 2008-02-28 Lg Electronics Inc A method and apparatus for decoding/encoding a video signal
CN103297770B (en) * 2008-04-25 2017-04-26 汤姆森许可贸易公司 Multi-view video encoding based on disparity estimation of depth information
EP2308232B1 (en) 2008-07-20 2017-03-29 Dolby Laboratories Licensing Corp. Encoder optimization of stereoscopic video delivery systems
JP2010157824A (en) * 2008-12-26 2010-07-15 Victor Co Of Japan Ltd Image encoder, image encoding method, and program of the same
JP2010157823A (en) * 2008-12-26 2010-07-15 Victor Co Of Japan Ltd Image encoder, image encoding method, and program of the same
WO2010073513A1 (en) 2008-12-26 2010-07-01 日本ビクター株式会社 Image encoding device, image encoding method, program thereof, image decoding device, image decoding method, and program thereof
JP2010157821A (en) * 2008-12-26 2010-07-15 Victor Co Of Japan Ltd Image encoder, image encoding method, and program of the same
JP4821846B2 (en) * 2008-12-26 2011-11-24 日本ビクター株式会社 Image encoding apparatus, image encoding method and program thereof
JP2010157826A (en) * 2008-12-26 2010-07-15 Victor Co Of Japan Ltd Image decoder, image encoding/decoding method, and program of the same
JP2010157822A (en) * 2008-12-26 2010-07-15 Victor Co Of Japan Ltd Image decoder, image encoding/decoding method, and program of the same
SG166796A1 (en) * 2009-01-19 2010-12-29 Panasonic Corp Coding method, decoding method, coding apparatus, decoding apparatus, program, and integrated circuit
JP5614901B2 (en) 2009-05-01 2014-10-29 トムソン ライセンシングThomson Licensing 3DV reference picture list
KR20130139242A (en) 2010-09-14 2013-12-20 톰슨 라이센싱 Compression methods and apparatus for occlusion data
JP2016127372A (en) * 2014-12-26 2016-07-11 Kddi株式会社 Video encoder, video decoder, video processing system, video encoding method, video decoding method, and program

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004208259A (en) * 2002-04-19 2004-07-22 Matsushita Electric Ind Co Ltd Motion vector calculating method
JP2005005844A (en) * 2003-06-10 2005-01-06 Hitachi Ltd Computation apparatus and coding processing program
US7468745B2 (en) * 2004-12-17 2008-12-23 Mitsubishi Electric Research Laboratories, Inc. Multiview video decomposition and encoding

Also Published As

Publication number Publication date
JP2008022549A (en) 2008-01-31

Similar Documents

Publication Publication Date Title
RU2705428C2 (en) Outputting motion information for sub-blocks during video coding
JP6605565B2 (en) Advanced residual prediction in scalable multi-view video coding
JP6141386B2 (en) Depth range parameter signaling
JP6552923B2 (en) Motion vector prediction in video coding
JP6659628B2 (en) Efficient multi-view coding using depth map estimation and updating
JP6339183B2 (en) Sub prediction unit (PU) based temporal motion vector prediction in HEVC and sub PU design in 3D-HEVC
JP6254254B2 (en) Evolutionary merge mode for 3D (3D) video coding
US9998726B2 (en) Apparatus, a method and a computer program for video coding and decoding
JP6356236B2 (en) Depth-directed inter-view motion vector prediction
KR101682999B1 (en) An apparatus, a method and a computer program for video coding and decoding
JP2017022723A (en) Activation of parameter sets for multiview video coding (mvc) compatible three-dimensional video coding (3dvc)
ES2634100T3 (en) Fragment header three-dimensional video extension for fragment header prediction
CN103493483B (en) Decoding multi-view video plus depth content
JP6559337B2 (en) 360-degree panoramic video encoding method, encoding apparatus, and computer program
JP6629363B2 (en) View dependency in multi-view coding and 3D coding
JP6545667B2 (en) Sub prediction unit (PU) based temporal motion vector prediction in HEVC and sub PU design in 3D-HEVC
KR101733852B1 (en) Slice header prediction for depth maps in three-dimensional video codecs
TWI527431B (en) View synthesis based on asymmetric texture and depth resolutions
KR101770928B1 (en) Cross-layer parallel processing and offset delay parameters for video coding
CN104247427B (en) device and method for coding and decoding
Puri et al. Video coding using the H. 264/MPEG-4 AVC compression standard
CA2843187C (en) Multiview video coding
KR101722823B1 (en) Signaling view synthesis prediction support in 3d video coding
JP5575908B2 (en) Depth map generation technique for converting 2D video data to 3D video data
KR102099494B1 (en) Video coding techniques using asymmetric motion partitioning

Legal Events

Date Code Title Description
A621 Written request for application examination

Free format text: JAPANESE INTERMEDIATE CODE: A621

Effective date: 20100614

A131 Notification of reasons for refusal

Free format text: JAPANESE INTERMEDIATE CODE: A131

Effective date: 20110726

A521 Written amendment

Free format text: JAPANESE INTERMEDIATE CODE: A523

Effective date: 20110915

TRDD Decision of grant or rejection written
A01 Written decision to grant a patent or to grant a registration (utility model)

Free format text: JAPANESE INTERMEDIATE CODE: A01

Effective date: 20120508

A01 Written decision to grant a patent or to grant a registration (utility model)

Free format text: JAPANESE INTERMEDIATE CODE: A01

A61 First payment of annual fees (during grant procedure)

Free format text: JAPANESE INTERMEDIATE CODE: A61

Effective date: 20120605

R150 Certificate of patent or registration of utility model

Ref document number: 5013993

Country of ref document: JP

Free format text: JAPANESE INTERMEDIATE CODE: R150

Free format text: JAPANESE INTERMEDIATE CODE: R150

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20150615

Year of fee payment: 3

R250 Receipt of annual fees

Free format text: JAPANESE INTERMEDIATE CODE: R250

R250 Receipt of annual fees

Free format text: JAPANESE INTERMEDIATE CODE: R250

R250 Receipt of annual fees

Free format text: JAPANESE INTERMEDIATE CODE: R250

R250 Receipt of annual fees

Free format text: JAPANESE INTERMEDIATE CODE: R250

LAPS Cancellation because of no payment of annual fees