WO2008051041A1

WO2008051041A1 - Multi-view video scalable coding and decoding

Info

Publication number: WO2008051041A1
Application number: PCT/KR2007/005294
Authority: WO
Inventors: Sea-Nae Park; Dong-Gyu Sim; Jung-Hak Nam; Suk-Hee Cho; Hyoung-Jin Kwon; Nam-Ho Hur; Jin-Woong Kim; Soo-In Lee
Original assignee: Electronics And Telecommunications Research Institute
Priority date: 2006-10-25
Filing date: 2007-10-25
Publication date: 2008-05-02
Also published as: KR20080037593A; JP5170786B2; KR100919885B1; JP2010507961A

Abstract

Provided are a scalable video coding and decoding apparatus and method for multiview video. The apparatus includes a basic scalability video encoder separating one basic video to video frames and performing scalable video coding through temporal and spatial prediction, and multiple extended scalability video encoders for receiving an own video and one or more adjacent videos as reference videos captured simultaneously, separating the received videos in video frames, performing scalable video coding for the separated low resolution image frame through temporal and spatial prediction with reference to the adjacent video frames at same temporal axis as well as own adjacent frame, and performing scalable video coding for the separated high resolution video frame through temporal and spatial prediction referring to lower layers of the adjacent video frames at same temporal axis as well as an own lower layer.

Description

DESCRIPTION

MULTI-VIEW VIDEO SCALABLE CODING AND DECODING

TECHNICAL FIELD The present invention relates to a multiview video scalable coding and decoding technology; and, more particularly, to a multiview video scalable coding and decoding apparatus and method for compressing and transmitting multiview video using a multilayer spatial and temporal scalable coding technology and for providing two-dimensional or three-dimensional video services to various types of video terminals.

This work was supported by the IT R&D program of MIC/IITA [2005-S-403-02, "Development of Super- intelligent Multimedia Anytime-anywhere Realistic TV (SmarTV) Technology"] .

BACKGROUND ART

In general, data is compressed by removing temporal and spatial redundancy of data. The spatial redundancy denotes the identical color or the same objects in video. The temporal redundancy means adjacent pictures with almost no changes in moving pictures and repeated sound in audio. In a typical video coding method, the temporal redundancy is removed by temporal filtering based on motion compensation and the spatial redundancy is removed by spatial transformation.

In order to transmit multimedia data generated after removing data redundancy, various transmission mediums were introduced. The transmitting performance varies according to the types of the transmission mediums. Also, a scalable video coding technology was introduced to support the various speeds of transmission mediums and to transmit multimedia data at a transfer rate proper to a transmission environment. The scalable video coding technology is one of coding technologies for controlling a resolution, a frame rate, and a signal-to-noise ratio (SNR) of video by cutting down a predetermined part of a compressed bit stream according to conditions, such as a transport bit rate, a transport error rate, and a system resource.

Fig. 1 is a diagram describing a scalable codαng technology according to a related art.

Referring to Fig. 1, the scalable video coding technology according to the related art performs temporal transform for realizing temporal scalable and performs two-dimensional spatial transform for realizing spatial scalable. Also, the scalable video coding technology realizes a quality scalability using texture coding. The motion coding scalably encodes motion information when spatial scalable is realized. As described above, one bit stream is generated through such coding algorithms.

In order to provide the temporal scalability and improve a compression rate in the scalable video coding, motion compensated temporal filtering (MCTF) and hierarchical B-pictures were used.

The MCTF performs wavelet transform using motion information in a clockwise direction in a video sequence. The wavelet transform is performed using a lifting scheme. The lifting scheme includes three processes, polyphase decomposition, prediction, and update.

The hierarchical B-pictures may be realized in various ways using a memory management control operation that manages a decoded picture buffer (DPB) for storing 16 pictures and the syntaxes of reference picture list reordering (RPLP) .

Recently, due to advances in technologies and demands of users, researchers are studying to develop a service for providing video information for scenes at diverse viewpoints, and a service allowing viewers to edit video information transmitted from a broadcasting station and watch desired video among the video information. In order to provide the services, a technology for compressing multiview video is required. The multiview video compression technology is a technology for simultaneously coding videos from a plurality of cameras that provide multiview video and compressing, storing, and transmitting the coded video. If the multiview video is stored and transmitted without being compressed, a large transmission bandwidth is required to transmit the multiview video to a user through a broadcasting network or a wired/wireless Internet in real-time.

In the multiview video coding and decoding technology, each of video sequences is independently coded and transmitted and the transmitted coded video sequences are decoded. It is easily realized based on MPEG-1/2/4 or H.261/263/264. However, it is impossible to remove redundancy between videos, which is generated as the same object is photographed by a plurality of cameras .

In order to remove the redundancy between videos, a scalable video coding technology was introduced. In the scalable video coding technology for a single view point video, a single view point video is divided into video frames with multilayer resolutions in a spatial axis using a spatial filter, and a temporal and spatial scalable is performed on the divided video frames in a temporal axis through hierarchical bi-directional motion estimation. Also, quality scalability may be provided through entropy coding by hierarchical expression in transform coding.

However, since the scalable video coding technology was designed for a single viewpoint video, a large overhead may be generated in a video decoder because of a high transport rate when a terminal reproduces three- dimensional videos with selective two-dimensional videos.

DISCLOSURE TECHNICAL PROBLEM

An embodiment of the present invention is directed to providing a multiview video scalable coding method and apparatus for effectively compressing videos and providing various video services to terminals in diverse environments through motion estimation with reference to adjacent images at a temporal and spatial axis for compressing multiview video and through motions, differential images, and intra prediction in different resolutions of adjacent videos for providing scalability on a temporal and spatial axis in a multiview video.

Another embodiment of the present invention is directed to providing a scalable video decoding method and apparatus for receiving a scalable coded signal and decoding the received signal for multiview video. Other objects and advantages of the present invention can be understood by the following description, and become apparent with reference to the embodiments of the present invention. Also, it is obvious to those skilled in the art of the present invention that the objects and advantages of the present invention can be realized by the means as claimed and combinations thereof.

TECHNICAL SOLUTION In accordance with an aspect of the present invention, there is provided a scalable video coding apparatus for a multiview video including: a basic scalability video encoder for separating one basic video to video frames with multilayer resolutions and performing scalable video coding through performing temporal and spatial prediction on the separated low resolution image frame and at least one of the separated high resolution video frames; and a plurality of extended scalability video encoders for receiving an own video and at least one of adjacent videos as reference videos which are captured at the same time, separating the received videos in video frames with multilayer resolutions through spatial filtering, performing scalable video coding for the separated low resolution image frame through temporal and spatial prediction with reference to the adjacent video frames at the same temporal axis as well as own adjacent frame, and performing scalable video coding for the separated high resolution video frame through temporal and spatial prediction with reference to lower layers of the adjacent video frames at the same temporal axis as well as an own lower layer.

In accordance with another aspect of the present invention, there is provided a scalable video coding method for multiview video, including the steps of: (a) separating one basic video to video frames with multilayer resolutions and performing scalable video coding through temporal and spatial prediction; and (b) receiving an own video and at least one of adjacent videos, which are captured at the same time, and performing scalable video coding through temporal and spatial prediction by separating the received videos into video frames with multilayer resolutions, wherein the step (b) of receiving an own video and at least one of adjacent videos includes the steps of: (c) performing scalable video coding for low resolution video frames through temporal and spatial prediction with reference to the adjacent video frames at the same temporal axis as well as own adjacent frames; and (d) performing scalable video coding for at least one of high resolution video frames through temporal and spatial prediction with reference to lower layers of the adjacent video frames as well as own lower layer.

In accordance with another aspect of the present invention, there is provided a scalable video decoding apparatus for multiview video including: a basic scalability video decoder for receiving a bitstream generated by scalably coding one basic video and restoring a basic video through inverse temporal and inverse spatial transform; and a plurality of extended scalability video decoders for receiving a bitstream generated scalable-coded through temporal and spatial prediction for own video and reference videos, which are captured at the same time, restoring at least one of hα gh resolution image frames through inverse temporal and spatial prediction whether lowei layers of adjacent video frames that are reference video are referred as well as own lower layer or not, restoring a lower resolution image frame through inverse temporal and spatial prediction according to whether the adjacent video frames are referred or not at the same temporal axis as well as own adjacent frame, and restoring an image through inverse spatial filtering for the restored high resolution image frames and the restored low resolution image frame. The extended scalability video decoder may include: a demultiplexing unit for demultiplexing a received bitstream; at least one of enhancement decoding unit for performing scalable decoding for a high resolution image signal outputted from the demultiplexing unit through inverse temporal and spatial motion estimation according to whether lower layers of adjacent videos that are reference videos are referred as well as a lower layer of own video or not; a basic layer decoding unit for performing scalable decoding for a low resolution image signal outputted from the demultiplexing unit through inverse motion estimation for reference video frames at a temporal axis as well as inverse temporal and spatial motion estimation for own video frame; and an inverse spatial video filtering unit for restoring an image through inverse spatial filtering for the restored high resolution images from the enhancement decoding unit and the restored low resolution image from the basic decoding unit .

The enhancement decoding unit may perform scalable decoding with reference to a flag indicating whether motion vector information of a lower layer is used or not, a flag indicating whether a reference index of a lower layer for adjacent video is used as prediction information or not, a flag indicating whether a type of an intra block of a lower layer is used as prediction information or not, a flag indicating whether a differential image value of a lower layer is used or not, and an index of a reference view used for prediction.

In accordance with another aspect of the present invention, there is provided a scalable video decoding method for multiview video, including the steps of: (a) performing scalable video decoding for one basic video through inverse temporal and spatial prediction; and (b) receiving a bitstream scalable-coded with reference to an own video and at least one of adjacent videos as reference videos, which are captured at the same time, and performing scalable video decoding through inverse temporal prediction and inverse spatial prediction, wherein the step (b) of receiving a bitstream scalable- coded with reference to an own video and at least one of adjacent videos as reference videos includes the steps of: (c) performing scalable video decoding for demultiplexed high resolution image signal through inverse temporal and spatial prediction according to whether lower layers of the adjacent video frames are referred as well as an own lower layer; and (d) performing scalable video decoding for demultiplexed low resolution image signal through inverse temporal and spatial prediction according to whether the adjacent video frames are referred at the same temporal axis as well as an own adjacent frame.

In the performing scalable video decoding for demultiplexed high resolution image signal, scalable decoding may be performed with reference to a flag indicating whether motion vector information of a lower layer is used or not, a flag indicating whether a reference index of a lower layer for adjacent video is used as prediction information or not, a flag indicating whether a type of an intra block of a lower layer is used as prediction information or not, a flag indicating whether a differential image value of a lower layer is used or not, and an index of a reference view used for prediction .

ADVANTAGEOUS EFFECTS

According to the present invention, a multiview video can be effectively compressed by expanding the temporal and spatial hierarchical structure of a typical scalable coding technology to multiview videos. Also, a video service can be scalably provided to various 2-D or 3-D terminals by forming a hierarchical structure in a temporal and spatial axis for multiview video.

BRIEF DESCRIPTION OF THE DRAWINGS Fig. 1 is a diagram describing a scalable coding technology according to the related art.

Fig. 2 illustrates a scalable coding apparatus for multiview video in accordance with an embodiment of the present invention. Fig. 3 is a block diagram illustrating an extended scalability video encoder in accordance with an embodiment of the present invention.

Fig. 4 describes a reference structure for predicting and referencing adjacent frames in scalable video coding according to an embodiment of the present invention.

Fig. 5 illustrates the reference structure of Fig. 4 with circle symbols and cross symbols in a spatial (view) layer axis with a time fixed. Fig. 6 illustrates a reference structure for a B- frame structure.

Fig. 7 illustrates a reference structure of Fig. 6 with circle symbols and cross symbols in a spatial (view) layer axis with a time fixed. Fig. 8 is a block diagram illustrating an apparatus for scalable coding multiview video in accordance with another embodiment of the present invention.

Fig. 9 describes a refererce structure for a basic video 0 (91) and adjacent videos 1 (92) and 2(93) in Fig. 8.

BEST MODE FOR THE INVENTION

The advantages, features and aspects of the invention will become apparent from the following description of the embodiments with reference to the accompanying drawings, which is set forth hereinafter. Therefore, those skilled in the field of this art of the present invention can embody the technological concept and scope of the invention easily. In addition, if it is considered that detailed description on a related art may obscure the points of the present invention, the detailed description will not be provided herein. The specific embodiments of the present invention will be described in detail hereinafter with reference to the attached drawings . Fig. 2 illustrates a scalable coding apparatus for multiview video in accordance with an embodiment of the present invention.

In Fig. 2, five videos 0 to 4 are input from five cameras and each of the input videos 0 to 4 compressed by a scalable video encoder.

Referring to Fig. 2, the scalable coding apparatus according to the present embodiment includes a basic scalability video encoder 21 and extended scalability video encoders 22 to 25. The video encoder 21 performs 2-D spatial transformation and temporal transformation on the video 0 which is a basic video. The video encoder 21 also performs scalable coding through motion coding and texture coding. Each of the extended scalability video encoders 22 to 25 receives not only own video which is assigned itself but also at least one of adjacent videos as reference video and separates the received videos into frames with multilayer resolutions through spatial filtering and temporal filtering. Also, each of the extended scalability video encoders 22 to 25 performs scalable coding on the separated frames with reference to temporal and spatial hierarchical image information and compression parameters of the adjacent videos as well as the own video. In Fig. 2, the video 0 is defined as a basic video, and the basic scalability video encoder 21 performs scalable coding on the video 0. The basic scalability video encoder 21 has the same structure of a single viewpoint video scalable coding apparatus according to the related art. That is, the scalable coding apparatus according to the present embodiment has a structure compatible to typical scalable condition apparatuses for basic video.

In order to scalably compress input video with reference to the adjacent videos as well as the own video, the videos 1 to 4 are compressed through the extended scalability video encoders 22 to 25. Like a single viewpoint video scalable coding apparatus, each of the extended scalability video encoders 22 to 25 receives not only own video which is assigned to itself but also adjacent videos as a reference video, separates the received videos into frames with multilayer resolutions through spatial filtering, and performs scalable coding on the separated frames with reference to lower layers of the other videos assigned to neighbor encoders as well as that of the video assigned to itself. Also, the extended scalability encoder 23 that compresses the video 2 performs compression through bidirectional prediction using multilayer temporal and spatial resolution video information of the basic video 0 and the video 4.

Accordingly, the scalable coding apparatus according to the present embodiment can provide a typical 2-D video service using only the basic scalability video encoder 21. Also, the scalable coding apparatus according to the present embodiment can provide a stereo video service using the basic scalability video encoder 21 for the basic video 0 and the extended scalability video encoder 25 for the video 4. Furthermore, the scalable coding apparatus according to the present embodiment can provide a three-view video service or a five-view video service by selectively combining the basic scalability video encoder 21 with the extended scalability video encoders 22 to 25.

Hereinafter, the structure and the function of the extended scalability video encoder will be described with reference to Fig. 3.

Fig. 3 is a block diagram illustrating an extended scalability video encoder in accordance with an embodiment of the present invention. Referring to Fig. 3, the extended scalability video encoder according to the present embodiment includes a spatial video filtering unit 31, temporal video filtering units 330 to 340 a basic layer encoder 33, at least one of enhancement layer encoders 34, and a multiplexer 35. The spatial video filtering unit 31 separates an own video and reference videos into frames with multilayer resolutions through spatial filtering. The temporal video filtering units 330 and 340 separate the output videos from the spatial video filtering unit 31 through temporal filtering. The basic Layer encoder 33 performs scalable coding through not only temporal and spatial motion estimation for the own video frames of temporal low frequency images outputted from the temporal video filtering unit 330 but also through the motion estimation for the reference video frames on a temporal axis. Each of the enhancement layer encoders 34 with reference to the lower layers of the reference videos as well as the lower layer of the own video for temporal high frequency images outputted from the temporal video filtering unit 340. The multiplexer 35 outputs one bitstream by multiplexing outputs from the basic layer encoder 31 and the enhancement encoders 34.

The spatial video filtering unit 31 receives an own video assigned to itself, which is captured by an own camera, and the other videos captured by the other cameras as reference videos at a predetermined time interval and separates the received videos into frames with multilayer resolutions through spatial filtering based on MCTF or hierarchical B structure. The basic layer encoder 33 and the enhancement layer encoder 34 may includes temporal video filtering units 330 and 340, motion encoders 331 and 341, subtractor 332 and 342, spatial transformers 333 and 343, quantizers 334 and 344, entropy encoders 335 and 345. As described above, the basic layer encoder 33 and the enhancement layer encoder 34 have the structure similar to a typical scalable video encoder. Hereinafter, the functions of the constituent elements in the encoders 33 and 34 will be described. The temporal video filtering unit 330 of the basic layer encoder 33 separates low frequency images, which are separated through spatial filtering, in a temporal axis through filtering based or MCTF or hierarchical B- structure. ALso, the temporal video filtering unit 340 of the enhancement layer encoder 34 separates the hαgh frequency images, which are separated through the spatial filtering, into a temporal axis through filtering based on MCTF or hierarchical B-structure.

The motion encoders 331 and 341 include a motion estimation block or a motion compensation block. The motion estimation block performs motion estimation of a current frame using a reference frame as a basis and calculates a motion vector for forward motion estimation or bi-directional estimation. Here, the motion encoders 331 and 341 may use not only own frames but also peripheral frames as reference frames for motion estimation. The motion encoders 331 and 341 use a block matching algorithm that is generally used for motion estimation. That is, the motion encoders 331 and 341 calculates displacement when an error becomes minimum while moving a given motion block in a predetermined search area of a reference frame and estimates the calculated displacement as the motion vector. The motion encoders 331 and 341 provide motion data, such as motion vectors obtained as the result of motion estimation, a size of a motion block, and a reference frame number, to the entropy encoders 335 and 345. Also, the motion compensation block generates a temporal estimated frame for a current frame by performing the motion compensation for a forward reference frame, a backward reference frame, or a bi-directional reference frame using the calculated motion vector.

The subtractors 332 and 342 remove the temporal redundancy of a video by subtracting a current frame and a temporal estimated frame. The spatial transformers 333 and 343 remove spatial redundancy from the temporal redundancy removed frame using a predetermined spatial transformation method that supports spatial scalability. As the spatial transformation method, Discrete Cosine Transform (DCT) and wavelet transform are widely used.

The quantizers 334 and 344 quantize transform coefficients from the spatial transformers 333 and 343. The quantization is a process of transforming the transform coefficient, which is expressed as a predetermined real number, to a discrete value by dividing the transform coefficient by predetermined periods and matching the discrete value to a predetermined index.

The entropy encoders 335 and 345 lossless-encode the quantized transform coefficient from the quantizers 334 and 344 and the motion data provided from the motion estimation block and generates an output bitstream. As the lossless encoding method, arithmetic coding or variable length coding may be used. Meanwhile, intra prediction may be performed for an intra block before spatial transform. In order to perform the intra prediction, the enhancement layer encoder may include a 2-D spatial interpolation block for receiving a restored reference frame from the lower layer encoder and performing two-dimensional (2-D) spatial interpolation and an intra prediction block for performing the intra prediction.

In general, inter prediction searches a block most similar to a predetermined block of a current frame, obtains a predicted block that can express the current block best, and quantizes differences between the current block and the predicted block. The inter prediction includes bi-directional prediction using two reference frames, forward prediction using a past reference frame, and backward prediction using a future reference frame.

Meanwhile, the intra prediction predicts a current block using frames adjacent to the current block. The intra prediction is different from the other because the intra prediction uses information in a current frame only and does not use the other frames in the same layer or frames of the other layer.

Intra base prediction may be used when a current frame includes frames of a lower layer having the same temporal location. A macro block of a current frame can be effectively predicted from the macro blocks of a corresponding basic frame. That is, the difference between a macro block of a current block and a macro block of a corresponding basic frame is quantized. When the resolution of the lower layer is different from that of the current layer, the macro block of the basic frame is up-sampled to the resolution of the current layer before calculating the difference.

Residual prediction is the extension of the inter prediction from a single layer to multilayer. The residual prediction calculates a difference between the difference obtained from the inter prediction of a current layer and the other different obtained from the inter prediction of a lower layer and quantizes the calculated difference. In the present embodiment, the enhancement encoder uses a value obtained by multiplying two to a motion vector of a basic layer image that is a low resolution image for an own video and basic layer images that is a low resolution image for the other videos as reference videos to perform motion estimation when encoding high- resolution image frames.

In the present embodiment, the enhancement encoder performs differential image estimation by interpolating remaining images after predicting a basic layer image (low resolution image) of an own video and basic layer images (low resolution images) of the other videos as reference videos when encoding high resolution image frames .

Also, the enhancement encoder performs intra prediction using a basic layer image that is a low resolution image of an own video and basic layer images that is a low resolution image of the other videos as reference videos in an intra prediction mode when encoding high resolution image frames. Fig. 4 is a diagram illustrating a reference structure for predicting and referencing adjacent frames in scalable video coding according to an embodiment of the present invention.

In Fig. 4, a P macro block denotes single- directional prediction and a B macro block denotes a bidirectional prediction. The reference structure according to the present embodiment allows single directional prediction and bi-directional prediction to be performed in a plurality of resolution layers as well as in a temporal axis and a spatial axis.

As shown in Fig. 4, the reference structure according to the present embodiment includes a two-layer structure formed of one basic layer and an enhancement layer. However, the reference structure may further include more enhancement layers.

In Fig. 4, a reference numeral 41 denotes a predicting and referencing operation for predicting and referencing adjacent frames, which is performed in a basic layer encoder and an enhancement layer encoder in a basic scalability video encoder 21 of Fig. 2. A reference numeral 42 denotes a predicting and referencing operation which is performed in a basic layer encoder and an enhancement layer encoder in an extended scalability video encoder 22 for a video 1 of Fig. 2. A reference numeral 43 denotes a predicting and referencing operation, which is performed in a basic layer encoder and an enhancement layer encoder in an extended scalability video encoder 23 for a video 2 of Fig. 2.

In other words, the basic layer 0 (LO) denotes a reference structure performed in each of the basic layer encoder of the basic scalability video encoder 21, the basic layer encoder of the extended scalability video encoder 22 for the video 1, and the basic layer encoder of the extended scalability video encoder 23 for the video 2.

Like the basic layer 0 (L) , the enhancement layer 1

(Ll) denotes a reference structure performed in each of the enhancement layer encoder of the basic scalability video encoder 21, the enhancement layer encoder of the extended scalability video encoder 22 for the video 1, and the enhancement layer encoder of the extended scalability video encoder 23 for the video 2.

Referring to Fig. 4, the basic layer encoder of the basic scalability video encoder 21 performs a scalable video coding operation by predicting and referencing adjacent frames for own low resolution image frames in a temporal axis like the video encoder according to the related art. The basic layer encoder of the extended scalability video encoder 22 for the video 1 performs bi- directional prediction for own frame using the frames of a video 0 and the frames o£ a video 2, which are reference video frames located at the same temporal axis. Also, the basic layer encoder of the extended scalability video encoder 23 for the video 2 perform single- directional prediction with reference to the basic video 0 and performs bi-directional prediction using the own frame at the same time.

Meanwhile, the enhancement layer 1 (Ll), the upper layer of the basic layer, performs spatial and temporal prediction for own video frame and performs prediction in reference with own frames of a basic layer and adjacent frames of the basic layer. In Fig. 4, each of macro blocks includes three circles or cross symbols for indicating whether a lower layer is referred or not. Here, a circle or a cross symbol in a middle row among the three symbols indicate whether a lower layer of an own video frame is referred or not, and circle symbols or cross symbols in a top row or in a bottom row indicate whether lower layers of adjacent video frames are referred or not.

Referring to Fig. 4, the enhancement layer encoder of the basic scalability video encoder 21 performs scalable video coding with reference to own frames of a lower layer like the encoder according to the related art. The enhancement layer encoder of the extended scalability video encoder 22 for the video 1 performs bi-direction prediction with reference to lower layer frames of a video 0 and lower layer frames of a video 2, which are adjacent frames, as well as own lower layer frames. The enhancement layer encoder of the extended scalability video encoder 23 for the video 2 performs prediction with reference to the lower layer frames of the basic video 0 as well as the own lower layer frame.

Fig. 5 illustrates the reference structure of Fig. 4 with circle symbols and cross symbols in a spatial (view) layer axis with a time fixed.

In Fig. 5, the reference structure includes three layers. A reference numeral 51 denotes a reference structure for a video 0, a reference numeral 52 denotes a reference structure for a video 1, and a reference numeral 53 denotes a reference structure for a video 2.

In Fig. 5, since the macro block of an enhancement layer 2 (50) for the video 0 includes two cross symbols at the right and left columns and a circle symbol at the middle column, a coding operation is performed based on motion, differential images, and intra prediction used in the scalable video coding (SVC) according to the related art with reference to a lower layer of an own video only. On the contrary, since the macro block of an enhancement layer 2 for a video 1 (52) includes all circle symbols, a coding operation is performed with reference to lower layers of adjacent videos as well as a lower layer of an own video. Also, since the macro block of an enhancement layer 2 for a video 2 (53) includes circle symbols at the middle and left columns, a scalable video coding operation is performed with reference to the lower layer of an own video and the lower layer of a video 0.

In order to indicate whether a lower layer is referenced or not as described above, a scalable video includes predetermined sentences that describe information about videos in the reference layer.

A sentence ref view Idx denotes a view number of reference video in a lower layer. Here, a flag base_mode_flag indicates whether the motion vector information of a lower layer :.s used for estimating a motion in a current block or not. If the flag base_mode_flag is 1, a variable ref_view_Idx must have a view number of a lower layer for indicating which motion vector information is used. A flag base_mode_refinement_flag indicates whether or not the motion vector information of a lower layer is used for predicting a motion vector of a current block. Unlike the flag base_mode_flag, the reference index of a lower layer is also used as prediction information. Therefore, if the flag basejnode _refinement_flag is 1, a variable ref_view_Idx must have a view number of a lower layer for indicating which motion vector and reference index information are used. A flag intra_base_flag indicates a type of an intra block in a lower layer, which is used as prediction information of a current block. If the flag intra_base_ flag is 1, information about an intra prediction mode of a lower layer is used for a current block. Therefore, a variable ref_view_Idx must have a view number of a lower layer for indicating which intra block type information is used.

A flag residual_prediction _flag indicates whether a differential image value of a lower layer is used for predicting a differential image of a current block or not. If the flag residula_prediction_flag is 1, the differential image information of a lower layer is up- sampled. Also, the variable ref_view_Idx must have a view number of a lower layer for indicating which differential image information is used. Table 1 shows the above described sentences in the scalable video.

Table 1

Fig. 6 illustrates a reference structure for a B- frame structure.

In Fig. 6, a reference numeral 61 denotes a reference structure for a video 0 that is a basic video, a reference numeral 62 denotes a reference structure for a video 1, and a reference numeral 63 denotes a reference structure for a video 2.

Fig. 7 is a diagram il Lustrating a reference structure of E"ig. 6 with circle symbols and cross symbols in a spatial (view) layer axis with a time fixed.

In Fig. 7, the reference structure includes three layers. A reference numeral 71 denotes a reference structure for a video 0 that is a basic video, a reference numeral 72 denotes a reference structure for a video 1, and a reference numeral 73 denotes a reference structure for a video 2.

Referring to Figs. 6 and 7, the video 0 which are basic videos 62 and 71 perform scalable video coding with reference to own lower layer frames only. However, the video 1 and video 2 perform the scalable video coding with reference to the lower layer frames of adjacent videos as well as own lower layer frames.

The above described reference structures can be identically applied to a P-frame structure.

Fig. 8 is a block diagram i Llustrating an apparatus for scalable coding multiview video in accordance with another embodiment of the present invention. Fig. 9 is a diagram illustrating a reference structure for a basic video 0 (91) and adjacent videos 1 (92) and 2(93) in Fig. 8.

Referring to Figs. 8 and 9, a basic scalability video encoder 81 performs scalable coding for the basic video 0. The basic scalability video encoder 81 refers own lower layer frames like a single viewpoint video scalable coding apparatus. Therefore, the scalable coding apparatus according to another embodiment can be compatible with existing scalable coding apparatus.

Extended scalability video encoders 82 to 85 perform scalable video coding for the videos 1 to 4. Each of the extended scalability video encoders 82 to 85 separates video into frames with multilayer resolution, performs temporal and spatial prediction for the separated video frames, and performs compression with reference to temporal and spatial layer image information of adjacent videos and compression parameters.

As shown in Fig. 9, the enhancement layer 1 (92) of the video 1 performs scalable video coding with reference to a lower layer of the video 0, and the enhancement layer 2 (93) of the video 2 performs scalable video coding with reference to a lower layer of the video 1. In other word, the extended scalability video encoders perform scalable video coding with reference to one next video only in the scalable coding apparatus according to another embodiment as shown in Fig. 8.

Accordingly, the scalable coding apparatus according to another embodiment can provide a 2-D video service using the basic scalability video encoder 81. Also, the scalable coding apparatus according to another embodiment can provide a stereo video service using the basic scalability video encoder 81 for the basic video 0 and the extended scalability video encoder 82 for the video 1. Furthermore, the scalable coding apparatus according to another embodiment can provide t.hree-view service using the basic scalability video encoder 81 for the basic video 0 and the extended scalability video encoders 82 and 84 for the videos 1 and 3. The scalable coding apparatus according to another embodiment can provide a five-view service using the basic scalability video encoder 81 for the basic video 0 and the extended scalability video encoders 82 and 85 for the videos 1 and 4.

The scalable coding technology for multiview video according to the present invention will be briefly described again. At first, one of basic videos is separated into frames with multilayer resolutions using a spatial filter m a spatial axis. Then, a spatial and temporal scalable video coding operation as performed on the separated low resolution image frames through motion estimation in a temporal axis. Also, the spatαal and temporal scalable video coding operation is performed on the separated hagh resolution VLdeo frames through hierarchical motion estimation in a temporal axis with reference to a lower layer. Then, a bitstream is generated by multiplexing the coded low resolution image frame and at least one of the coded high resolution video frames. As described above, the scalable video coding for basic video according to bhe present embodiment is identical to that according to the related art.

Hereinafter, the scalable video coding for multivj ew video accord mg to the present embodiment will be described.

At first, own video is received with at least one of adjacent videos as reference videos. The received own video and adjacent videos are separated into video frames with multilayer resolutions using a spatial filter m a spatial axis. Then, the temporal and spatial scalable video coding is performed on the separated low resolution image frame through hierarchical motion estimation with reference to adjacent frames as reference frames as well as own frame in a temporal axis. Also, the temporal and spatial scalable video coding is performed on t he separated high resolution image frame through hierarchical motion estimation with reference to lower layers of adjacent video frames as well as a lower layer of the own video frame in a temporal axis. Then, a bitstream is generated by multiplexing the coded Low resolution image frame and at least one of the coded high resolution video frames. As described above, the extended scalable video coding uses not only the own lower layer frames but also adjacent lower layer frames as reference frames unlike the scalable video coding for single viewpoint video, which uses an adjacent frame and a lower layer frame thereof .

Meanwhile, a scalable video decoding apparatus for multiview video according to an embodiment of the present invention performs the operations of the scalable video encoding apparatus according to the present embodiment in a reverse order.

The scalable video decoding apparatus according to the present embodiment includes a basic scalability video decoder and a plurality of extended scalability video decoders. The basic scalability video decoder receives a bitstream generated by scalable-coding one basic video and restoring the basic video through inverse temporal transformation and inverse spatial transformation. Each of the extended scalability video decoders receives a bitstream generated by scalable-coding own video and reference videos, which are captured at the same time through the temporal and spatial prediction. Then, one of the extended scalability video decoders restores at least one of high resolution video frame through inverse temporal and spatial prediction according to whether a lower layer of adjacent video frame is referred as well as an own lower layer and restores one low resolution image frame through inverse temporal and spatial prediction according to whether the adjacent video frames at the same temporal axis are referenced as well as the own adjacent frame. Then, one of the extended scalability video decoders restores video through performing inverse spatial filtering on the restored high resolution video frames and the restored low resolution image frame . In the scalable decoding apparatus according to the present embodiment, the basic scalability video decoder has a structure identical to that of a typical scalable video decoder. Therefore, the detail description thereof is omitted.

In the present embodiment, the extended scalability video decoder includes a demultiplexer, at least one of enhancement layer decoders, a basic layer decoder, and an inverse spatial video filtering unit. The demultiplexer demultiplexes the received bitstream. Each of the enhancement layer decoder performs scalable decoding on high resolution video signal outputted from the demultiplexer through inverse temporal and spatial motion estimation according to whether adjacent videos are referred as well as a lower layer of own video. The basic layer decoder performs scalable decoding on low resolution image signal outputted from the demultiplexer through inverse temporal and spatial motion estimate on not only through inverse temporal and spatial motion estimation for own video frame but also inverse motion estimation for reference video frames on a temporal axis. The inverse spatial video filtering unit restores a video through performing inverse spatial filtering on the restored high resolution video frame from the enhancement decoder and on the restored low resolution image frame from the basic layer decoder .

Here, the basic layer decoder and the enhancement layer decoder perform operations of the basic layer encoder and the enhancement layer encoder in an inverse order. Therefore, the detail descriptions of the basic layer decoder and the enhancement layer decoder are omitted.

When the enhancement decoder performs a decoding operation for a high resolution video signal, the enhancement decoder refers a flag indicting whether the motion vector information of a lower layer is used or not, a flag indicating whether a reference index of a lower layer for adjacent video is used as prediction information or not, a flag indicating whether a type of an intra block of a lower layer is used as prediction information or not, a flag indicating whether a differential video value of a lower layer is used or not, and an index of a reference view used for prediction.

As described above, the technology of the present invention can be realized as a program and stored in a computer-readable recording medium, such as CD-ROM, RAM, ROM, floppy disk, hard disk and magneto-optical disk. Since the process can be easily implemented by those skilled in the art of the present invention, further description will not be provided herein.

While the present invention has been described with respect to certain preferred embodiments, it will be apparent to those skilled in the art that various changes and modifications may be made without departing from the scope of the invention as defined in the following claims.

INDUSTRIAL APPLICABILITY

According to the present invention, multiview video can be effectively compressed by expanding a temporal and spatial hierarchical structure of a typical scalable coding technology to the multiview video. Also, a video service can be scalably provided to various types of 2-D or 3-D terminals by forming a hierarchical structure on a temporal and spatial axis for the multiview video according to the present invention.

Claims

WHAT IS CLAIMED IS:

1. A scalable video coding apparatus for a multiview video comprising: a basic scalability video encoder for separating one basic video into video frames with multilayer resolutions and performing scalable video coding through performing temporal and spatial prediction on the separated low resolution image frame and at least one of the separated high resolution video frames; and a plurality of extended scalability video encoders for receiving an own video and at least one of adjacent videos as reference videos which are captured at the same time, separating the received videos in video frames with multilayer resolutions through spatial filtering, performing scalable video coding for the separated low resolution image frame through temporal and spatial prediction with reference to the adjacent video frames at the same temporal axis as well as own adjacent frame, and performing scalable video coding for the separated high resolution video frame through temporal and spatial prediction with reference to lower layers of the adjacent video frames at the same temporal axis as well as an own lower layer.

2. The scalable video coding apparatus of claim 1, wherein the extended scalability video encoder includes: a spatial video filtering means for separating own video and adjacent videos as reference videos into video frames with multilayer resolutions through spatial filtering; a basic layer encoding means for separating the own video and adjacent videos into low resolution image frames through temporal filtering and performing scalable coding through motion estimation for reference video frame in a temporal axis as well as temporal and spatial motion estimation for own video frame; at least one of enhancement layer encoding means for separating the own video and adjacent videos to high resolution video frames through temporal filtering and performing scalable coding through spatial and temporal motion estimation with reference to lower layers for the adjacent videos as well as a lower layer of own video; and a multiplexing means for outputting one bit stream by multiplexing the output of the basic layer encoding means and the output of the enhancement encoding means .

3. The scalable video coding apparatus of claim 2, wherein the enhancement encoding means sets a flag indicating whether motion vector information of a lower layer is used or not, a flag indicating whether a reference index of a lower layer for an adjacent video is used as prediction information or not, a flag indicating whether a type of an intra block of a lower layer is used as prediction information or not, and a flag indicting whether a differential video value of a lower layer is used or not, and marks an index of a used reference view as a coding result.

4. The scalable video coding apparatus of claim 2, wherein the enhancement layer encoding means further includes a two-dimensional (2-D) spatial interpolation means for performing 2-D spatial interpolation on a video frame restored for intra prediction for an intra block.

5. The scalable video coding apparatus of claim 2, wherein the enhancement layer encoding means performs coding through motions between frames, differential video and intra prediction on a temporal and spatial axis.

6. The scalable video coding apparatus of claim 5, wherein the enhancement layer encoding means performs motion estimation using a value obtained by multiplying two to a motion vector of a basic layer image that is a low resolution image for own video and a basic layer image that is a low resolution image for an adjacent video .

7. The scalable video coding apparatus of claim 5, wherein the enhancement layer encoding means performs differential video prediction by interpolating remaining images after predicting a basic layer image that is a low resolution image for own video and a basic layer image that is a low resolution image for an adjacent video.

8. A scalable video coding method for multiview video, comprising the steps of:

(a) separating one basic video to video frames with multilayer resolutions and performing scalable video coding through temporal and spatial prediction; and

(b) receiving an own video and at least one of adjacent videos, which are captured at the same time, and performing scalable video coding through temporal and spatial prediction by separating the received videos into video frames with multilayer resolutions, wherein the step of (b) receiving an own video and at least one of adjacent videos includes the steps of:

(c) performing scalable video coding for low resolution video frames through temporal and spatial prediction with reference to the adjacent video frames at the same temporal axis as well as own adjacent frames; and

(d) performing scalable video coding for at least one of high resolution video frames through temporal and spatial prediction with reference to lower layers of the adjacent video frames as well as own lower layer.

9. The scalable video coding method of claim 8, wherein as a result of performing the step of (d) performing scalable video coding for at least one of high resolution video frames, a flag indicating whether motion vector information of a lower layer is used or not, a flag indicating whether a reference index of a lower layer for an adjacent video is used as prediction information or not, a flag indicating whether a type of an intra block of a lower layei is used as prediction information or not, and a fLag indicting whether a differential video value of a lower layer is used or not are set, and an index of a used reference view is marked.

10. The scalable video coding method of claim 8, wherein in the step of (d) performing scalable video coding for at least one of high resolution video frames, a video is coded through motions between frames, differential images, and intra prediction on a temporal and spatial axis.

11. The scalable video coding method of claim 8, wherein in the step of (d) performing scalable video coding for at least one of high resolution video frames, two-dimensional spatial interpolation is performed for a video frame restored for intra prediction for an intra block.

12. The scalable video coding method of claim 8, wherein in the step of (d) performing scalable video coding for at least one of high resolution video frames, motion estimation is performed αsing a value obtained by multiplying two to a motion vector of a basic layer image that is a low resolution image for own video and a basic layer image that is a low resolution image for an adjacent video.

13. The scalable video coding method of claim 8, wherein in the step of (d) performing scalable video coding for at least one of high resolution video frames, differential video prediction is performed by interpolating remaining images after prediction of a basic layer image that is a low resolution image for own video and a basic layer image that is a low resolution image for an adjacent video.

14. A scalable video decoding apparatus for multiview video comprising: a basic scalability video decoder for receiving a bitstream generated by scalably coding one basic video and restoring a basic video through inverse temporal and inverse spatial transform; and a plurality of extended scalability video decoders for receiving a bitstream generated scalable-coded through temporal and spatial prediction for own video and reference videos, which are captured at the same time, restoring at least one of high resolution image frames through inverse temporal and spatial prediction whether lower layers of adjacent video frames that are reference video are referred as well as own lower layer or not, restoring a lower resolution image frame through inverse temporal and spatial prediction according to whether the adjacent video frames are referred or not at the same temporal axis as well as own adjacent frame, and restoring an image through inverse spatial filtering for the restored high resolution image frames and the restored low resolution image frame.

15. The scalable video decoding apparatus of claim

14, wherein the extended scalability video decoder includes : a demultiplexing means for demultiplexing a received bitstream; at least one of enhancement decoding means for performing scalable decoding for a high resolution image signal outputted from the demultiplexing means through inverse temporal and spatial motion estimation according to whether lower layers of adjacent videos that are reference videos are referred as well as a lower layer of own video or not; a basic layer decoding means for performing scalable decoding for a low resolution image signal outputted from the demultiplexing means through inverse motion estimation for reference video frames at a temporal axis as well as inverse temporal and spatial motion estimation for own video frame; and an inverse spatial video filtering means restoring an image through inverse spatial filtering for the restored high resolution images from the enhancement decoding means and the restored low resolution image from the basic decoding means.

16. The scalable video decoding apparatus of claim

15, wherein the enhancement decoding means performs scalable decoding with reference to a flag indicating whether motion vector information of a lower layer is used or not, a flag indicating whether a reference index of a lower layer for adjacent video is used as prediction information or not, a flag indicating whether a type of an intra block of a lower layer is used as prediction information or not, a flag indicating whether a differential image value of a lower layer is used or not, and an index of a reference view used for prediction.

17. A scalable video decoding method for multiview video, comprising the steps of:

(a) performing scalable video decoding for one basic video through inverse temporal and spatial prediction; and

(b) receiving a bitstream scalable-coded with reference to an own video and at least one of adjacent videos as reference videos, which are captured at the same time, and performing scalable video decoding through inverse temporal prediction and inverse spatial prediction, wherein the step of (b) receiving a bitstream scalable-coded with reference to an own video and at least one of adjacent videos as reference videos includes the steps of:

(c) performing scalable video decoding for demultiplexed high resolution image signal through inverse temporal and spatial prediction according to whether lower layers of the adjacent video frames are referred as well as an own lower layer; and

(d) performing scalable video decoding for demultiplexed low resolution image signal through inverse temporal and spatial prediction according to whether the adjacent video frames are referred at the same temporal axis as well as an own adjacent frame .

18. The scalable video decoding method of claim 17, wherein in the step of (c) performing scalable video decoding for demultiplexed high resolution image signal, scalable decoding is performed with reference to a flag indicating whether motion vector information of a lower layer is used or not, a flag indicating whether a reference index of a lower layer for adjacent video is used as prediction information or not, a flag indicating whether a type of an intra block of a lower layer is used as prediction information or not, a flag indicating whether a differential image value of a lower layer is used or not, and an index of a reference view used for prediction .