MXPA98004502A

MXPA98004502A - Scalable temporary and space codification for planes objects of vi

Info

Publication number: MXPA98004502A
Application number: MXPA/A/1998/004502A
Authority: MX
Inventors: Chen Xuemin; Luthra Ajay; Rajan Ganesh; Narasimhan Mandayam
Original assignee: General Instrument Corporation
Priority date: 1997-06-05
Filing date: 1998-06-05
Publication date: 1999-02-01

Abstract

The present invention relates to a method for adjusting an input video sequence comprising video grating planes for communication in a corresponding base layer and enhancement layer, the video object planes in the input video sequence have an associated spatial resolution and temporal resolution, the method comprising the steps of: descending pixel data of a particular close-up of the video object planes of the input video stream to provide the first object plane of the base layer video that have a reduced spatial resolution, up-sampling pixels data from at least a portion of the first video object plane of the base layer to provide a close-up of the sampled video object in a manner ascending in the enhancement layer; differentially encoding the close-up of the sampled video object in ascending order using the first pl particular year of the video object planes of the input video stream for the communication in the highlight layer to a temporal position corresponding to the first video object plane of the base layer, to sample downwards pixel data of a second particular plane of the video object planes of the input video stream to provide a second video object plane of the base layer having a reduced spatial resolution; to sample upwardly pixel data of at least one portion of the video object second plane of the base layer to provide a second plane of video object sampled upwardly in the highlight layer corresponding to the first video object plane sampled in ascending fashion, use at least one of the first and second planes of video object of the base layer to predict an intermediate video object plane corresponding to the first and second plane of or video object sampled in ascending order, and to encode the intermediate video object plane for communication in the enhancement layer to a timeless position that is intermediate to the layer of the first and second planes of video object sampled in an ascending manner.

Description

TEMPORARY AND SCALABLE SPACE CODING FOR VIDEO OBJECT PLANS FIELD OF THE INVENTION The present invention relates to a method and apparatus for providing temporal and spatial scaling of video images that include planes of video objects in a digital video sequence. In particular, a motion compensation scheme is presented which is convenient for use with scale video in frame mode and field mode. A schema is also presented for adaptively compressing video in field mode using a spatial transformation such as Discrete Cosine Transformation (TCD).

BACKGROUND OF THE INVENTION The invention is particularly convenient for use with various multimedia applications, and is compatible with the standard MPEG-4 (VM) 7.0 Verification Model described in document ISO / IEC / JTC1 / SC29 / WG11 N 1642, entitled "MPEG-4 Video Verification Model Version 7.0", April 1997, incorporated herein by reference. The invention may further provide stereoscopic video encoding, picture-in-picture, preview access channels, and communications in asynchronous transfer mode (MTA). The MPEG-4 model is a new coding standard, which provides a flexible reference frame and an open set of coding tools for the communication, access, and manipulation of digital audiovisual data. These tools support a wide range of features. The flexible frame of reference of the MPEG-4 model supports various combinations of coding tools and their corresponding functionalities for applications required by the computer, telecommunications, and entertainment industries (ie TV and cinema), such as searching databases, information retrieval, and interactive communications. The MPEG-4 model provides standardized core technologies that allow the storage, transmission and manipulation of efficient video information in multimedia environments. The MPEG4 model achieves efficient compression, scalability of objects, spatial and temporal scalability, and error elasticity. The MPEG-4 video video encoder / decoder (codec) is a block-based, object-based hybrid encoder with motion compensation. The texture is encoded with an 8x8 discrete cosine transformation that uses overlap block motion compensation. The shapes of the object are represented as alpha maps and are encoded using a content-based arithmetic coding algorithm (CAC) (the CAE algorithm) or a discrete cosine transformation modifier encoder, using both time predictions. The encoder can handle moving objects as they are known from computer graphics. Other coding methods can also be used, such as small wave and mobile object coding for special applications. Compensated motion texture coding is a well-known approach to video encoding. Such an approach can be modeled as a three-stage process. The first stage is signal processing that includes motion estimation and movement compensation (ME / MC) and a spatial transformation in two dimensions (2-D). The objective of movement estimation and motion compensation and spatial transformation is to take advantage of the temporal and spatial correlations in a video sequence to optimize the performance of the quantization distortion regime and the entropy coding under a complexity constraint. The most common technique for estimating movement and motion compensation has been the pairing of blocks, and the most common spatial transformation has been the discrete cosine transformation. However, special problems arise for the coding of motion estimation and motion compensation and discrete cosine transformation of the boundary blocks of an arbitrarily formed video object plane. The Main Profile of the MPEG-2 model is a precursor of the standard MPEG-4 model, and is described in document ISO / IEC JTC1 / SC29 / WG11 N0702, entitled "Information Technology - Generic Coding of Moving Pictures and Associated Audio, Recommendation H.262," March 25, 1994, incorporated herein by reference. Scalability extensions have been defined to the main MPEG-2 profile that provide two or more streams or bit layers separately. Each layer can be combined to form a single high-quality signal. For example, the base layer can provide a video signal of lower quality, while the highlight layer provides additional information that can highlight the image of the base layer. In particular, spatial and temporal scalability can provide compatibility between different video standards or decoder capabilities. With spatial scalability, the video of the base layer may have a lower spatial resolution than an input video sequence, in which case the highlight layer carries information that can restore the resolution of the base layer to the sequence level of entry. For example, an input video sequence corresponding to standard 601 of the International Telecommunication Union - Radio Sector (ITU-R) (with a resolution of 720 x 576 pixels) can be carried in a base layer that corresponds to the standard of the Common Exchange Format (CIF) (with a resolution of 360x288 pixels). The enhancement layer carries the missing pictures. When combined in a decoder, the original frame rate is restored. In accordance with the foregoing, it would be desirable to provide temporal and spatial scalability functions to encode video signals including video object planes (VOP) as used in the MPEG-4 standard. It would be desirable to have the ability to encode stereoscopic video, image-in-picture, preview access channels, and asynchronous transfer mode (ATM) communication modes. It would also be desirable to have a decoding coding design of relatively low complexity and low cost where the size of the search range is reduced for the estimation of the movement of the prediction coding of the highlight layer of the video object planes bidirectionally predicted ( B-VOP). It would also be desirable to efficiently encode a linked video input signal which is scaled to base and highlight layers by adaptively rearranging pixel lines of a video object plane of the highlight layer before determining a residue and spatially transforming the information. The present invention provides a system having the above and other advantages.

SUMMARY OF THE INVENTION According to the present invention, a method and apparatus for providing temporal and spatial scaling of video images such as planes of video objects in a digital video sequence are presented. The planes of video objects may comprise a complete frame and / or a subset of the frame, and may have an arbitrary shape. Additionally, a plurality of planes of video objects may be provided in a frame or otherwise be temporarily matched. A method for scaling an input video sequence comprising planes of video objects for communication in a corresponding base layer and an enhancement layer, wherein data sampled in a downward manner in the base layer is presented. The video object plane in the input video sequence has an associated spatial resolution and temporal resolution (e.g., frame rate). The Pixel data of a particular one of the video object planes of the input video sequence are sampled in a downward manner to provide a first video object plane of the base layer having a reduced spatial resolution. The pixel data of at least a portion of the video object foreground of the base layer is sampled upward to provide a foreground of the sampled video object upwardly in the highlight layer. The foreground of the ascendingly sampled video object is differentially coded using the particular first plane of the video object planes of the input video sequence, and is provided in the highlight layer in a temporal position corresponding to the first plane of video object of the base layer. Differential coding includes the step of determining a residue according to a difference between the pixel data of the first sampled video object plane in descending order and the pixel data of the particular first plane of the video object planes of the sequence of video input. The residue is spatially transformed to provide transformation coefficients, for example, using the discrete cosine transformation. When the planes of video objects in the input video sequence are planes of video objects in field mode, the differential coding involves rearranging the lines of the pixel information of the first object plane of the sampled video object in an ascending manner in a field mode before determining the residue if the pixel data lines satisfy a reordering criterion. The criterion is whether a sum of differences in luminance values of lines of opposing fields (ie odd-numbered, odd-even ones) is greater than a sum of luminance data differences of lines in the same field (ie, odd to odd, and even to even) and a polarized term. The pixel data sampled upward from the first video object plane of the entire base layer may be a subset of the first video object plane of the base layer so that a remaining portion of the first object plane of the Base layer video, which is not sampled in ascending order, has a lower spatial resolution than the pixel data sampled in ascending order. A video object second plane of the base layer and video object plane sampled upwardly in the highlight layer in a similar manner can be provided. One or both of the first and second planes of the video object of the base layer sampled upward can be used to predict an intermediate video object plane, which corresponds to the first and second planes of video objects sampled upwardly . The intermediate video object plane is encoded for communication in the enhancement layer temporarily between the first and the second object plane of the sampled video object in ascending order. further, the highlight layer may have a higher temporal resolution than the base layer when there is no video object plane of the intermediate base layer between the first and the second video object plane of the base layer. In a specific application, the base and the highlight layer provide an image-in-picture (PIP) capability in which the image in image is transported in the base layer, or a preview access channel capability, wherein a Preview access image is carried in the base layer. In these applications, it is acceptable for the image in image of the free preview image to have a reduced spatial and / or temporal resolution. In an asynchronous transfer mode application, lower bitrate information, of higher priority, may be provided in the highlight layer. In this case, the base layer is assigned a guaranteed bandwidth, but the information of the enhancement layer can occasionally be lost. A method for scaling an input video sequence comprising planes of video objects in which data sampled downstream in the highlight layer is transported instead of the base layer is presented. With this method, a particular close-up of the video object planes of the input video stream is provided in the base layer as a video object foreground of the base layer, for example, without changing the resolution spatial The pixel information of at least a portion of the video object foreground of the base layer is sampled in descending order to provide a video object plane sampled downwardly in the highlight layer at a time position corresponding to the first video object plane of the base layer. Corresponding pixel data of the particular first plane of the video object planes is sampled in a downward manner to provide a comparison video object plane, and the first video object plane sampled in a downward manner is differentially coded using the plane of object of comparison video. The base and highlight layers can provide a stereoscopic video capability in which the image information in the highlight layer has a lower spatial resolution than the image information in the base layer. A method for encoding a predicted two-dimensional video object plane (B-VOP) is also presented. First and second planes of video objects are provided in the base layer corresponding to the video object planes of the input video stream. The second video object plane of the base layer of a predicted video object plane (P-VOP) that is predicted from the video object plane of the base layer according to a motion vector MVp. The two-dimensional video object plane in the enhancement layer is temporarily provided between the first and second planes of video objects of the base layer. The two-dimensional video object plane is encoded using at least one of a forward motion vector MVf and a backward movement vector MVB which are obtained by scaling the motion vector MVp. This efficient coding technique avoids the need to perform a comprehensive independent search on the reference video object plane. A temporal distance TR_ separates the first and second planes of video objects from the base layer, while a temporal distance TRB separates the first video object plane from the base layer and the two-dimensional video object plane. An m / n ratio is defined as the ratio of the spatial resolution of the first and second planes of video objects of the base layer against the spatial resolution of the two-dimensional video object plane. That is, either the video object planes of the base layer or the two-dimensional video object plane in the highlight layer can be sampled in descending order relative to the video object planes of the video sequence of the video layer. input by a m / n ratio. It is assumed that either the video object plane of the base layer or that of the enhancement layer has the same spatial resolution as the input video sequence. The forward motion vector MVf is determined according to the ratio MVf = (m / n) • TRB • MVp / TR_, while the backward movement vector MVb is determined according to the ratio MVb = (m / n) • (TRB-TRp) -MVp / TRp. m / n is any possible number, including fractional values. The two-dimensional video object plane is encoded using a search region of the first video object plane of the base layer whose center is determined in accordance with the forward motion vector MVf, and a search region of the second plane of the video object. video object whose center are determined according to the MVB backward motion vector. The decoding methods and corresponding devices are also presented.

BRIEF DESCRIPTION OF THE DRAWINGS Figure 1 is an illustration of a video object plane coding and decoding process according to the present invention. Figure 2 is a block diagram of a video object plane encoder and decoder according to the present invention. Figure 3 is an illustration of upward sampling of pixels according to the present invention. Figure 4 is an illustration of an example of the prediction process between planes of video objects in a base layer and an enhancement layer. Figure 5 is an illustration of spatial and temporal scaling of a video object plane according to the present invention. Figure 6 illustrates the reordering of pixel lines from frame mode to field mode according to the present invention. Figure 7 is an illustration of an image-in-picture application or preview channel access with spatial and temporal scaling according to the present invention. Figure 8 is an illustration of a stereoscopic video application in accordance with the present invention.

DETAILED DESCRIPTION OF THE INVENTION A method and apparatus for providing temporal and spatial scaling of video images that include planes of video objects in a digital video sequence are presented. Figure 1 is an illustration of a video object encoding and decoding process according to the present invention. The table 105 includes three image elements, which include a square first term element 107, a rectangular first term element 108, and a landscape background element 109. In the 115th frame, the elements are planes of designated video objects. using segmentation mask so that the video object plane 117 represents the first square term element 107, the video object plane 118 represents the first rectangular end element 108, and the video object plane 119 represents the element of landscape background 109. A video object plane may have an arbitrary shape, and a succession of planes of video objects is known as a video object. It can also be considered that a complete rectangular video frame is a video object plane. Thus the term "video object plane" will be used herein to indicate both arbitrary and non-arbitrary image area shapes. A segmentation mask is obtained using known techniques, and has a format similar to that of the ITU-R 601 luminance data. Each pixel is identified as belonging to a certain region in the video frame. The frame 105 and the frame video object data of frame 115 are supplied to separate coding functions. In particular, the video object planes 117, 118 and 119 undergo form, motion and texture coding in the encoders 137, 138 and 139, respectively. With shape coding, binary and gray scale information is encoded. With motion coding, shape information is encoded using motion estimation within a frame. With texture coding, a spatial transformation is performed as the discrete cosine transformation to obtain transformation coefficients that can be encoded at variable length for compression. The coded video object plane data is combined in a multiplexer (MUX) 140 for transmission over a channel 145. Alternatively, the data may be stored in a recording medium. The received encoded video object plane data is separated by a demultiplexer (DEMUX) 150 so that the separate video object planes 117-119 are decoded and retrieved. Tables 155, 165 and 175 show that the planes of video objects 117, 118 and 119, respectively, have been decoded and recovered and can therefore be manipulated individually using a compositor 160 which interfaces with a video library 170, for example. The compositor can be a device such as a personal computer that is located in a user's home to allow the user to edit the information received to provide an image adapted to the client. For example, the user's personal video library 170 may include a previously stored video object plane 178 (e.g., a circle) which is different from the received video object planes. The user can compose a frame 185 where the circular video object plane 178 replaces the object plane of the square 117 video. The frame 185 thus includes the received video object planes 118 and 119 and the locally stored video object plane 178. In another example, the video object plane 109 can be replaced by a background of the user's choice. For example, when a newscast is viewed, the announcer may be coded as a video object plane which is separated from the background, as in a news studio. The user may select a fund from the library 170 or from another television program, such as a channel with the share price or time information. The user can therefore act as a video editor. The video library 170 can also store planes of video objects that are received via channel 145, and can access planes of video objects and other image elements via a network such as the Internet. It should be appreciated that the frame 105 may include regions which are not planes of video objects and therefore can not be manipulated individually. Also, frame 105 does not need to have any video object plane. Generally, a video session comprises a single video object plane, or a sequence of video object planes. The process of encoding and decoding the video object of Figure 1 allows many entertainment, business and educational applications, including personal computer games, virtual environments, graphical user interfaces, video conferences, Internet applications and the like. In particular, the capacity for spatial and temporal scaling of the video object planes according to the present invention provides even greater capabilities. Figure 2 is a block diagram of a video object encoder and decoder according to the present invention. The encoder 201, which corresponds to the elements 137-139 shown schematically in Figure 1, includes a scalability preprocessor 205 which receives an input video data stream "input" ("in"). To achieve spatial scalability with the base layer having a spatial resolution lower than the enhancement layer, "input" is sampled in descending order to obtain the signal "input_0" "(in_0) which, in turn, is provided to an encoder of the base layer 220 via a path 217. The "input_0 is encoded in the base layer encoder 220, and the encoded information is provided to a multiplexer (MUX) 230. A language multiplexer can be used. Description and System MPEG-4 (MSDL). Note that when the "input" input video sequence is in a field (linked) mode, the sampling signal descending "input_0" will be in frame mode (progressive) since sampling downward does not preserve pixel data in even and odd fields. Of course, "input_0" will also be in frame mode when "input" is in frame mode. The reconstructed image data is provided from the base layer encoder 220 to an intermediate processor 215 via a path 218 which can perform upward sampling, as discussed in more detail below in relation to FIG. 3. FIG. Image data sampled in ascending manner, which are in frame mode, are then provided to an encoder of the enhancement layer 210 via a path 212, wherein they are differentially encoded using the input image data "input_l" provided from the prior processor 205 to the encoder 210 via a path 207. In particular, the up-sampled pixel data (e.g., luminance data) is they subtract from the data of the input image to obtain a residue, which is encoded using the discrete cosine transformation or other spatial transformation. According to the present invention, when the input video sequence is in the field mode, the coding efficiency can be improved by grouping the pixel lines of the image of the sampled enhancement layer ascending, which corresponds to the original pair (top) and odd (bottom) field of the input video stream. This can decrease the magnitude of the residual in some cases since the pixel data within a field will often have a higher correlation with other pixel data in the same field as with the data in the opposite field. Thus, by reducing the magnitude of the residue, fewer bits are required to encode the image data. Refer to Figure 6 and the associated discussion, below, for more details. The coded residue of the image sampled upwardly in the highlight layer is provided to the MUX 230 for transmission with the data of the base layer on a communication channel 245. Alternatively, the data may be stored locally. Note that the MUX 230, the channel 245, and the DEMUX 250 correspond, respectively, to the elements 140, 145 and 150 in Figure 1. Note that the image data which is provided to the intermediate processor 215 from the encoder of the base layer 220 may be the entire video image, such as a full-frame video object plane, or a video object plane that is a subset of the entire image. Moreover, a plurality of video object planes can be provided to the intermediate processor 215. The MPEG-4 model currently supports up to 256 planes of video objects. In a decoder 299, the encoded data is received in a demultiplexer (DEMUX) 250, such as a DEMUX MSDL MPEG-4. The data of the highlight layer, which has a spatial resolution greater than the information of the base layer in the present example, is provided to a decoder of the highlight layer 260. The data of the base layer is provided to the decoder of base layer 270, where the signal "exit_0" ("out") is retrieved and provided to an intermediate processor 265 via a path 267, and to a subsequent scalability processor 280 via a path 277. The intermediate processor operates from a similar manner to the intermediate processor 215 on the encoder side by up-sampling the data of the base layer to recover a full resolution image. This image is provided to the enhancement layer decoder 260 via a path 262 for use in recovering the data signal from the enhancement layer "exit_l", which is then provided to the subsequent scalability processor 280 via the path 272. The subsequent scalability processor 280 performs operations such as up-sampling of the decoded base layer data to display as "output_0" signal ("outp_0"), while the highlight layer data is output to be displayed as "output_l" signal. When the encoder 201 is used for temporal scalability, the preprocessor 205 performs temporal demultiplexing (e.g., selection processing or frame drop) to reduce the frame rate, for example, for the base layer. For example, decrease the frame rate from 30 frames / second to 15 frames / second, all other frames fall. Table 1 below shows twenty-four possible configurations of the intermediate processors 215 and 265, the previous processor 205 and the subsequent scalability processor 280.

Table 1 In Table 1, the first column indicates the configuration number, the second column indicates the layer, and the third column indicates the temporal resolution of the layer (for example, either high or low). When "Low (High) is listed, the temporal resolution of the base and highlight layers is either both high or both low.The fourth column indicates the spatial resolution.The fifth, sixth and seventh columns indicate the corresponding action of the previous processor. of scalability 205, the intermediate processor 215 and 265, and the subsequent processor 280. "N / C" denotes no change (No Change) in the temporal or spatial resolution, ie, normal processing is performed. "N / A" means "Not applicable." The actions of the intermediate processor 215, 265 do not affect the enhancement layer The spatially scaled coding is illustrated using configuration 1 as an example As mentioned, when the scalable encoder 201 is used to encode a plane of the video object, the preprocessor 205 produces two subcurrent planes of video objects with different spatial resolutions.As shown in Table 1, the base layer has a low spatial resolution, and the Enhancement layer has a high spatial resolution which corresponds to the resolution of the input sequence. Therefore, the sequence of the base layer "input_0" is generated by a downstream sampling process of the input video stream "input" in a pre-scalability processor 205. The sequence of the highlight layer is generated by bottom-up sampling filtering of the downstream sampling layer layer in intermediate processors 215, 265 to achieve the same "input" spatial resolution. The subsequent processor 280 performs normal processing, that is, does not change the temporal or spatial resolution of "exit_l" or "exit_0". For example, a common interwall format resolution sequence of the base layer (360x288 pixels) can be generated from a 2: 1 downstream sampling filter of an ITU-R 601 resolution input sequence (720x576 pixels) . Downward sampling can be used by any integral or non-integral ratio. The temporally and spatially scaled coding is illustrated using configuration 2 as an example. Here, the input video sequence, "input", which has a high spatial and temporal resolution, becomes a sequence of the base layer that has a low spatial and temporal resolution, and a sequence of the highlight layer. which has a high spatial and temporal resolution. This is carried out as indicated by Table 1 by performing down sampling filtering and selection processing in the preprocessor 205 to provide the "input_0" signal, with up sampling filtering in the intermediate processors 215, 265 and the normal processing in the subsequent processor 280. With the configuration 3, the "input" input video sequence, which has a low or high temporal resolution, and a high spatial resolution, becomes a sequence of the base layer having a corresponding low or high temporal resolution, and a high spatial resolution, and a sequence of the highlight layer having a corresponding low or high temporal resolution, and a low spatial resolution. This is accomplished by performing downstream sampling filtering for the sequence of the boost layer "input_l" in the preprocessor 205, with downstream sampling filtering in the intermediate processors 215, 265, and up sampling filtering for the sequence of the enhancement layer "exit_l" in the subsequent processor 280. The remaining configurations can be understood in view of the previous examples. Figure 3 is an ascending sampling illustration in accordance with the present invention. Upstream sampling filtering can be carried out by intermediate processors 215, 265 with configuration 1 of Table 1. For example, a video object plane having a common exchange format resolution (360x288 pixels) with sampling ascending 2: 1. The pixels 310, 320, 330 and 340 of the common exchange format image are sampled to produce pixels 355, 360, 365, 370, 375, 380, 385 and 390 of the ITU-R 601 image. In particular, one pixel ITU-R 601360 is produced by sampling the common exchange format pixels 310 and 320 as shown by arrows 312 and 322, respectively. Similarly, an ITU-R pixel 601, 365, is also produced by sampling pixels 310 and 320, as shown by arrows 314 and 324, respectively. Figure 4 is an illustration of an example of the prediction process between the video object planes in the base layer and the highlight layer. In the enhancement encoder 210 of FIG. 2, a video object plane of the enhancement layer is encoded either as a P-video object plane or a two-dimensional video object plane. In this example, the planes of video objects in the enhancement layer have a higher spatial resolution than the planes of video objects in the base layer and therefore are drawn with a larger area. The temporal resolution (for example, the frame rate) is the same for both layers. The planes of video objects are shown in the order of presentation from left to right. The base layer includes an intermediate video object plane (I-VOP) 405, the predicted two-dimensional video object planes 415 and 420, and a predicted video object plane 430. The highlight layer includes the object planes of video predicted 450 and 490, and the planes of video objects 460 and 480. The video object plane 415 is predicted from other video object planes of the base layer as shown by arrows 410 and 440, while that the predicted two-dimensional video object plane 420 is also predicted from the other video object planes of the base layer as shown by arrows 425 and 435. The predicted video object plane 430 is predicted from the plane of intermediate video object 405 as shown by arrow 445. The predicted video object plane 450 is derived by sampling in ascending order a video object plane of the base layer indicated by arrow 495. The object plane of two-dimensional video predicted 460 will predicted from the video object planes of the base layer as shown by arrows 465 and 475, and the predicted two-dimensional video object plane 480 is predicted from the video object planes of the base layer as shown by arrows 470 and 485. Generally, the video object plane of the highlight layer which is temporarily matched (eg, in order of display or display) with an intermediate video object plane in the base layer , it is encoded as a predicted video object plane. For example, the video object plane 450 is temporally coincident with the intermediate video object plane 405, and is therefore encoded as a predicted video object plane. The video object plane of the highlight layer which is temporally coincident with a predicted video object plane in the base layer is encoded as either a video object plane P or B. For example, the video object plane 490 is temporally coincident with the video object plane P 430 and is encoded as a video object plane P. The video object plane of the enhancement layer that is temporally coincident with a plane of video object B in the base layer is encoded as a video object plane B. For example, see the plans of predicted bidirectional video objects 460 and 480. The intermediate video object plane 405 and the object plane of predicted video 430 are known as anchor video object planes and what are used as prediction references for the video object planes of the highlight layer. Therefore the intermediate video object plane 405 and the predicted video object plane 430 are encoded before the encoding of the corresponding predicted video object planes in the highlight layer is specified by the temporal reference indicator (prediction ) forward forward po_te_forward (for ard_temporal_ref) in a syntax compatible with the MPEG-4 model. This indicator is a non-negative integer which points to the intermediate video object plane temporarily matched in the base layer. The prediction references of the planes of bidirectional video objects predicted in the highlight layer are specified by cod_select_ref (ref_select_code), ref_temporal_adelante and ref_tep? Poral_atrás (backward temporal ref). See table 2 below. Note that the table is different for the MPEG-2 and MPEG-4 VM 3.0 scalability schemes. Table 2 Table 2 shows the prediction reference choices for the planes of bidirectional video objects predicted in the enhancement layer. For example, suppose that the ref_temporal time reference code for the intermediate video object plane 405 and the predicted video object plane 430 in the base layer are 0 and 3 respectively. Also, be the temporal ref for the predicted video object plane 450 in the enhancement layer 0. Then, in Figure 4, ref_temporal_ahead = 0 for the predicted video object plane 450. The prediction references of the object planes predicted bidirectional video inputs 460 and 480, given by arrows 465 and 475, 470 and 485, respectively, are specified by the cod_select_ref = 00, ref_temporal_ahead = 0, and ref_temporal_atrás_3. The prediction references of the predicted video object plane 490 are specified by cod_sel ect_re f = 10, re f_temporal_ade l ant e = 0 and ref_temporal_atrás = 3, later. Table 3 A prediction type "I" indicates an intracoded video object plane, a prediction type "P" indicates a predicted video object plane, and a prediction type "B" indicates a video object plane predicted bidirectionally. The coding process for the "input_0" sequence of the base layer is the same as the non-scalable coding process, for example, according to the Main MPEG-2 profile or the H.263 standard. Figure 6 illustrates the reordering, or permutation, of the pixel lines from the frame mode to the field mode according to the present invention. As mentioned, when an input video object plane is in field mode and is sampled in a downward fashion, the resulting video object plane will be in frame mode. In accordance with the foregoing, when the descending sampled image is sampled spatially upward, the resulting video object plane is differentially encoded by subtracting the input video object plane from the sampled video object plane ascendingly, the resulting residue may not produce optimal coding efficiency when a spatial transformation such as the discrete cosine transformation is subsequently performed on the remainder. This is, in many cases, the magnitude of the residual values can be reduced by permuting (ie, rearranging) the lines of the sampled image in an upward fashion to group odd and even lines since there may be a higher correlation between the pixels from the same field as in pixels and opposite fields. An image that can represent data from ascendingly sampled pixels (eg, luminance) in a highlight layer is generally displayed at 600. For example, suppose that the image 600 is a 16x16 macroblock that is derived by sampling upwardly 2: 1 an 8x8 block. The macroblock includes even numbered lines 602, 604, 606, 608, 610, 612, 614, and 616, and odd numbered lines 603, 605, 607, 609, 611, 613, 615, and 617. Even and odd lines form the fields superior and inferior, respectively. The macroblock 600 includes four 8x8 luminance blocks, including a first block defined by the intersection of region 620 and lines 602-609, a second block defined by the intersection of region 625 and lines 602-609, a third block defined by the intersection of region 620 and lines 610-617, and a fourth block defined by the intersection of region 625 and lines 610-617. When the pixel lines in the image 600 are swapped to form luminance blocks of equal fields according to the present invention before determining the residue and performing the discrete cosine transformation, the macroblock shown generally at 650 is formed. The arrows, shown generally at 645, indicate the rearrangement of lines 602-617. For example, the par-line 602, which is the first line of the macroblock 600, is also the first line of the macroblock 650. The par-line 604 becomes the second line in the macroblock 650. Similarly, the even lines 606, 608, 610, 612, 614 and 616 are returned from the third to the eighth lines, respectively, of the macroblock 650. Thus a luminance region of 16x8, 680 is formed with even numbered lines. A first block of 8x8 is defined by the intersection of region 680 and 670, while a second block is defined by the intersection of regions 680 and 675. Similarly, the odd numbered lines are moved to a region of 16x8, 685. Region 685 comprises a first block 8x8 defined by the intersection of region 685 and 670, while a second block 8x8 is defined by the intersection of regions 685 and 675. Region 685 thus includes odd lines 603, 605, 607, 609, 611, 613, 615 and 617. The discrete cosine transformation that takes place in the residue is referred to here as either the "discrete cosine field transformation" or the "discrete cosine "or something similar according to whether or not the macroblock is rearranged as shown in macroblock 650. However, it should be appreciated that the invention can be adapted for use with other spatial transformations. When the field discrete cosine transformation is used, the luminance lines (or luminance error) in the spatial domain of the macroblock are permuted from a discrete cosine transform orientation to the upper field discrete cosine transformation configuration ( pair) and lower (odd). The resulting macroblocks are transformed, quantified and encoded in variable length normally. When a field discrete cosine transformation macroblock is decoded, the inverse permutation is performed after all the luminance blocks have been obtained from the inverse discrete cosine transformation (DCT). The chrominance data 4: 2: 0 is not made by this mode. The criteria for selecting the discrete cosine transformation according to the present invention is as follows. The discrete cosine transformation can be selected when: 6 15 6 15 ? E \ P2i, j-P2i + l, j \ + \ P2i + l, j-P2i + 2, j \ > ? P2i, j-P2i + 2, j \ i = 0j = 0 ¡= 0j = 0 + \ P2i + l, j-P2i + 3, j \ + ses9 ° where pjj is the data of the spatial luminance difference (e.g., residue) just before the discrete cosine transformation is performed on each of the 8x8 luminance blocks. Advantageously, the equation uses only first order differences and therefore allows a simpler and less expensive implementation. The term "bias" is a factor that accounts for non-linear effects that are not considered. For example, bias = 64 can be used. If the previous relationship is not met, the discrete cosine transformation of the table is used. Note that, on the left side of the previous equation, the error terms refer to differences of opposite pixel field (for example, odd to even, odd to even). Thus, the left side is a sum of differences of luminance values of lines of opposite fields. On the right side, the error terms refer to the same pixel differences of equal fields (for example, peer to peer, and odd to odd). Thus, the right side is a sum of luminance data differences of lines of equal fields and term of bias. Alternatively, a second-order equation can be used to determine whether a discrete cosine transformation of a box or field should be used by modifying the previous equation to take the square of each error term instead of the absolute value. In this case, the term "bias" is not required. Figure 5 is an illustration of spatial and temporal scaling of a video object plane according to the present invention. With object-based scalability, the frame rate and spatial resolution of a selected video object plane can be enhanced so that it has a higher quality than the remaining area, for example, the frame rate and / or spatial resolution of the selected object may be greater than the remaining area. For example, a video object plane of a news announcer may be provided with a higher resolution than a study fund. The axes 505 and 506 indicate a frame number. In the base layer, the frame 510 which includes the video object plane 520 is provided in the 0 position of the frame, while the frame 530 with the video object plane 532 (corresponding to the video object plane 520) ) is provided in frame 3 position. In addition, frame 530 is predicted from frame 510, as shown by arrow 512. The enhancement layer includes the video object planes 522, 524, 526, and 542. These video object planes have an increased spatial resolution relative to the planes of video objects 520 and 532 and are therefore drawn with a larger area. The predicted video object plane 522 is derived from the ascending sample video object plane 520, as shown by the arrow 570. The predicted bidirectional video object planes 524 and 526 are predicted from the video object planes of the base layer 520 and 532, as shown by arrows 572 and 576, and 574 and 578, respectively.

The input video sequence used to create the base and highlight layer sequences has full resolution (for example 720x480 for ITU-R 601 corresponding to the National Television Standards Committee (NTSC) or 720x576 for corresponding ITU-R to the Phase Alternation Line (PAL)) and the full frame rate (30 frames / 60 fields per ITU-R 601 corresponding to PAL). The scalable coding is carried out in such a way that the resolution and the frame rate of objects is preserved using the encoding of the enhancement layer. The video object in the base layer, comprising the video object planes 520 and 532, has a lower resolution (e.g., a quarter of the size of the full-resolution video object plane) and a rate of lower box (for example, the third part of the original frame rate). Moreover, in the enhancement layer, only the video object plane 520 is enhanced. The rest of frame 510 is not enhanced. Although only one video object plane is shown, virtually any number of video object planes are provided. Moreover, when two or more planes of video objects are provided, all or only selected ones may be enhanced. The sequence of the base layer is generated by sampling in descending order and selecting-frames of the original sequence. The video object planes of the base layer are then encoded as an intermediate video object plane or predicted video object plane using progressive coding tools. When the input video sequence is intertwined, the interlaced coding tools such as field motion / frame estimation and compensation, and the field / frame discrete cosine transformation are not used as they are up sampling the video sequence linked input produces a progressive video stream. The video object planes of the enhancement layer are encoded using scalable temporal and spatial tools. For example, in the enhancement layer, the video object plane 522 and the video object plane 542 are encoded as a predicted video object plane using spatial scalability. The video object plane 524 and the video object plane 526 are decoded as planes of bidirectional video objects predicted from the video object planes sampled upwardly of the reference video object planes of the layer base, that is, the video object plane 520 and the video object plane 532, respectively, using temporary scalable tools. In another aspect of the present invention, a technique for reducing the coding complexity for estimating motion of bidirectional video object planes predicted by reducing the range of motion vector search is described. The technique is applicable to the input video sequences of both the frame mode and the field mode. In particular, the reference center of the reference video object plane is determined by scaling the motion vector of the video object plane of the corresponding base layer instead of performing an independent exhaustive search in the video object plane of reference. This exhaustive search would typically cover a range, for example, of +/- 64 pixels horizontally, and +/- 48 pixels vertically, and therefore would be less efficient than the technique described. The search center for the motion vectors of the video object planes 524 and 526 in the highlight layer is determined by: MVf = (m / n • TRB • MVp) / TRp, MVb = (m / n- ( TRB-TRp) -MVp) / TRp where MVf is the forward motion vector, MVb is the backward motion vector, MVp is the motion vector for the predicted video object plane (for example, the plane of video object 532) in the base layer, TRB is the time distance between the reference video object plane passed (e.g., the video object plane 520) and the two-dimensional video object plane predicted in the layer of enhancement, and TR_ is the time distance between the reference video object plane passed and the predicted video object plane of future reference (e.g., the video object plane 532) in the base layer. m / n is the ratio of the spatial resolution of the video object planes of the base layer against the spatial resolution of the video object planes of the highlight layer. That is, whether the video object planes of the base layer or the bidirectional video object plane predicted in the highlight layer can be sampled downward relative to the input video sequence by a ratio of / n. In the example of Figure 5, m / n is the downward sampling ratio of the video object plane of the base layer which is subsequently sampled upward to provide the video object plane of the highlight layer, m / n may be less than , equal to, or greater than 1. For example, for the bidirectional video object plane predicted 524, TRB = 1, TRp = 3, and the descending sampling is 2: 1 (that is, m / n = 2), we have MVf = 2/3 MVp, and MVb = -4 / 3 MVp. Note that all motion vectors are two-dimensional. The motion vector search range is a rectangular region of 16x16, for example, whose center is determined by MVf and MVb. The motion vectors communicate with the video data of the highlight and base layers in a transport data stream, and are retrieved by a decoder for use in decoding the video data. Generally, for the linked video encoding according to the present invention, linked coding tools are used to achieve better results. These tools include Discrete Cosine Transformation of Field / Table for intra-ma.croblocks and inter-difference macroblocks, and field prediction, that is, from the upper field to the lower field, from the upper field to the upper field, from the lower field to the upper field, and from the lower field to the lower field. For the configurations described above in Table 1, above, these linked coding tools are combined as follows. (1) For configurations with low spatial resolution for both layers, only progressive encoding tools (frame mode) are used. In this case, the two layers will encode different view sequences, for example, in a stereoscopic video signal, to encode stereoscopic video, the search range of motion estimation for right view sequence (enhancement layer) is 8x8 pixels . This search area of 8x8 (full pixel) is centered around motion vectors of the same type of a corresponding macroblock in the base layer of the corresponding video object plane. (2) For configurations with low spatial resolution in the base layer and high spatial resolution in the highlight layer, the linked coding tools will only be used for the highlight layer sequences. The search range of motion estimation to encode the sequence of the enhancement layer is 8x8 (full pixel). This search area 8x8 is centered around the movement vectors of the same type re-scaled (i.e., a factor of two) of the corresponding video object plane. Field-based estimation and prediction will be used only in the search and compensation of the enhancement layer. (3) For configurations with high spatial resolution in the base layer and low spatial resolution in the highlight layer, the linked coding tools will only be used for the base layer sequences, as with the MPEG-2 Main Profile in the Main Level. The search range of motion estimation to encode the sequence of the enhancement layer is 4x4 (full pixel). This 4x4 search is centered around the re-scaled motion vectors of the same type (ie, a factor of 1/2) of the corresponding macroblock in the base layer of the corresponding video object plane. For the configuration 2 in Table 1, above, for example, the coding of the two-layer sequences has a different temporal unit rate. Figure 7 is an illustration of a preview channel or picture-in-picture (PIP) access application with spatial and temporal scaling according to the present invention. With the image in the image, a secondary program is provided as a subset of a main program which is seen on television. Since the secondary program has a smaller area, the viewer discerns less the reduced resolution image, so that the temporal and / or spatial resolution of the image in the image can be reduced to conserve the bandwidth. Similarly, a preview access channel program can provide a program with a free low-resolution sample of a program which can be purchased for a fee. This application provides few minutes of free access from an authorized channel (for example, Pay-Per-View) for a preview. The video encoded in the preview access channel will have low resolution and preview access scheme will have lower resolution and lower frame rate. The decoder will control the access time for this previous channel. The configuration 2 of the scalable temporal-spatial coding in Table 1, above, can be used to provide an output from the decoding of the base layer having a lower spatial resolution than the output from the decoding both of the base layer as well as the enhancement layer. The video sequence in the base layer can be encoded with a low frame rate, while the layer is encoded with a higher frame rate. For example, a video sequence in the base layer can have a common exchange format resolution and a frame rate of 15 frames / second, while the corresponding video sequence in the highlight layer has an ITU-R resolution 601 and a frame rate of 30 frames / second. In this case, the enhancement layer can be conformed to the NTSC video standard, while the image in image or preview access functionality is provided by the base layer, which can be conformed to the common exchange format standard. In accordance with the above, the image-in-picture functionality can be provided by scalable coding with a complexity and coding efficiency similar to that of the MPEG-2 Main Profile in the Main Level standard. The base layer includes the planes of video objects 705 and 730 of low spatial resolution. Moreover, the temporal resolution of the base layer is 1/3 of that of the highlight layer. The enhancement layer includes the video object planes of spatial resolution ata 750, 760 and 790. The predicted video object plane 750 is derived by sampling upstream of the intermediate video object plane 705, as shown by the arrow 755. The predicted bidirectional video object plane 760 is predicted from the video object planes of the base layer as shown by arrows 765 and 775. The predicted bidirectional video object plane 780 is predicted from the planes of video objects of the base layer as shown by arrows 770 and 785. The predicted video object plane 790 is derived by ascending sampling of the predicted video object plane 730, as shown by arrow 795. Figure 8 is an illustration of a stereoscopic video application in accordance with the present invention. The stereoscopic video functionality is provided in the MPEG-2 Multiple View Profile (MVP) system, described in ISO / IEC document JTC1 / SC29 / WG11 N1196. The base layer is assigned to the left view and the highlight layer is assigned to the right view. To improve the coding efficiency, the images of the highlight layer can be encoded with a resolution lower than the base layer. For example, configuration 4 in Table 1, above, can be used where the base layer has a spatial resolution ITU-R 601, while the highlight layer has a spatial resolution of common exchange format. Reference images of the base layer for the prediction of the enhancement layer images are sampled in descending order. In accordance with the foregoing, the decoder for the images of the enhancement layer includes an ascending sampling process. Additionally, adaptive frame / field cosine transformation coding is used in the base layer but not in the highlight layer. The base layer includes video object planes 805, 815, 820 and 830, while the highlight layer includes the video object planes 850, 860, 880 and 890. The predicted bidirectional video object planes 815 and 820 they are predicted using other planes of video objects as shown by arrows 810, 840, and 835, 825, respectively. The predicted video object plane 830 is predicted from the intermediate video object plane 805 as shown by arrow 845. The predicted video object plane 850 is derived by sampling down the intermediate video object plane 805, as is shown by arrow 855. The predicted bidirectional video object plane 860 is predicted from the video object planes of the base layer as shown by arrows 865 and 875. The predicted bidirectional video object plane 880 is predicted from the video object planes of the base layer as shown by arrows 870 and 885. The predicted video object plane 890 is derived by descending the predicted video object plane 830, as shown by arrow 895. Alternatively, for the base and highlight layers to have the same spatial resolution and the same frame rate, configuration 7 can be used in Table 1, above. or. In this case, the process of encoding the base layer may be the same as a non-scalable coding process, for example, as described in the MPEG-4 VM non-scalable encoding or the MPEG-2 Main Profile in the standard of Main level, while using the adaptive field / field discrete cosine transformation coding in the highlight layer. In another application of the present invention, an asynchronous transfer mode (ATM) communication technique is presented. Generally, the trend towards the transmission of video signals over ATM networks is growing rapidly. This is due to the variable bit rate nature (VBR) of these networks which provides many advantages over constant bitrate (CBR) transmissions. For example, in the variable bit rate channels, an approximately constant image quality can be achieved. Moreover, video sources in ATM networks can be multiplexed statistically, requiring a lower transmission bit rate than if they were transmitted through constant bit-rate channels since the average long-term data rate of a signal of video is less than the short-term average due to the elastic interference separation in the systems of constant bit rate. However, despite the advantages of ATM networks, they suffer from a significant deficiency of congestion, in congested networks, video packages are queued to find an exit route. The packets with long delay can arrive too late to be useful in the receiver, and consequently the decoder discards them. Video decoding encoding must be designed to withstand packet loss. In order to make the video encoder almost immune to packet loss, the scalable temporal-spatial coding techniques of the present invention can be used. In particular, the video data of the base layer can be transmitted with a high priority and adapted to a guaranteed bit rate of an ATM network. Video data packets from the highlight layer can be lost if congestion arises because a channel is not guaranteed. If the enhancement layer packages are received, the quality of the image improves. A coding system using configuration 1 in Table 1, above, can be used to achieve this result. The scheme can be achieved as shown in Figure 4, discussed previously in relation to prediction modes, where the base layer is the high priority layer. Thus, the higher the priority, the lower bit rate information is communicated in the base layer, and at lower priority, the higher bit rate data is communicated in the enhancement layer. Similarly, this scalable coding can also be used in video encoding and transmission over the Internet, intranets and other communication networks. In accordance with the foregoing, it can be seen that the present invention provides a method and apparatus that provide temporal and spatial scaling of video images that include planes of video objects in a digital video sequence. In one aspect of the invention, the coding efficiency is improved by adaptively compressing a scaled field mode input video sequence. The planes of video objects sampled upward in the highlight layer with the original video sequence based on a linear criterion. The resulting residue is encoded using a spatial transformation such as the discrete cosine transformation. In another aspect of the invention, a compensation scheme for encoding the video object planes of the enhancement layer layer is presented by scaling the motion vectors that have already been determined for the video object planes of the layer base. A reduced search area whose center is defined by scaled motion vectors is defined. The technique is convenient for use with an input video stream in frame mode or scaled field mode. Additionally, several decoding coding processor configurations are presented to achieve particular scalable coding results. Scalable coding applications, including stereoscopic video, image-in-picture, preview access channels, and ATM communications, were also discussed. Although the invention has been described in connection with several specific embodiments, those skilled in the art will appreciate that numerous adaptations and modifications may be made thereto without departing from the spirit and scope of the invention as presented in the claims. For example, although two layers of scalability were discussed, more than two layers can be provided. Moreover, although flat rectangular or square video objects can be provided in some of the figures for simplicity, the invention is equally convenient for use with arbitrary video object planes.

Claims

NOVELTY OF THE INVENTION Having described the foregoing invention, it is considered as a novelty and, therefore, the content of the following is claimed as property. CLAIMS 1. A method for scaling an input video sequence comprising video object planes for communication in a corresponding base layer and enhancement layer, the planes of video objects in the input video sequence having a spatial resolution and associated temporal resolution comprise the steps of: descending pixel data of a particular first plane of the video object planes of the base layer having a reduced spatial resolution; sampling upward pixel data of at least a portion of the video object foreground of the base layer to provide a first shot of the sampled video object upwardly in the highlight layer; and differentially encoding the sampled video object plane in ascending manner using the first particular plane of the video object planes of the input video sequence for communication in the highlight layer to a temporary position corresponding to the first object plane video of the base layer.
2. The method according to claim 1, wherein the planes of input video objects are planes of video objects in field mode, and the step of differential coding comprises the additional steps of: reordering lines of the data pixel of the foreground of video object sampled upwardly in a field mode if the pixel data lines satisfy a reordering criterion; then determining a residue according to a difference between pixel data of the first object plane of the sampled video object and pixel data of the particular first plane of the video object planes of the input video sequence; and spatially transform the waste to provide transformation coefficients.
3. The method according to claim 2, wherein: the pixel data line of the first video object plane sampled upwards satisfies the reordering criteria when a sum of differences of luminance values of lines of opposing fields is greater than a sum of the luminance data differences of lines of equal fields and a term of bias.
The method according to claim 1 of any of the preceding claims, comprising the additional steps of: sampling downward pixel data of a particular second plane of the video object planes of the video sequence to provide a second layer of video object of the base layer having a reduced spatial resolution; upscaling pixel data from at least a portion of the second video object plane of the base layer to provide a second plane of video object sampled upwardly in the highlight layer corresponding to the first object plane of video sampled ascending; using at least one of the first and second planes of video objects of the base layer to predict an intermediate video object plane corresponding to the first and second planes of video objects sampled in ascending order; and encoding the intermediate video object plane for communication in the enhancement layer to a temporary position that is intermediate to that of the first and second video object plane sampled in ascending order.
5. The method according to claim 4, wherein: the enhancement layer has a higher temporal resolution than the base layer; and the base and highlight layers are adapted to provide at least one of: (a) an image capability in the image where an image in image is transported in the base layer, and (b) a preview access channel capability in which a preview access image is carried in the base layer.
The method according to claim 1 of any of the preceding claims, wherein: the base layer is adapted to carry higher priority, lower bit rate data, and the highlight layer is adapted to carry lower priority, Higher bit rate data.
7. A method for scaling an input video sequence comprising planes of video objects for communication in a corresponding base layer and enhancement layer, the planes of video objects in the input video sequence have a spatial resolution and associated time resolution comprising the steps of: providing a particular first plane of the video object planes of the input video stream for communication in the base layer as a video object foreground of the base layer; sampling downward pixel data of at least a portion of the first plane of the video object planes of the base layer for communication in the highlight layer as a first plane sampled downward in a time position corresponding to the first video object plane of the base layer; sampling in descending manner corresponding pixel data of the particular first plane of the video object planes to provide a comparison video object plane; and differentially encoding the sampled video object plane in a descending manner using the comparison video object plane.
The method according to claim 7, which comprises the additional steps of: differentially encoding the first video object plane of the base layer using the first particular plane of the video object planes by: determining a residue according to a difference between a pixel data of the first video object plane of the base layer and pixel data of the particular first plane of the video object planes; and especially transforming the waste to provide transformation coefficients.
The method according to claim 8, wherein the planes of video objects in the input video sequence are planes of video objects of field mode, and the first plane of video object of the Base layer is differentially encoded by the steps of: rearranging pixel data lines from the video object foreground of the base layer in a field mode before the determination step if the pixel data lines satisfy a criterion of reordering.
The method according to claim 9, wherein: the pixel data lines of the first video object plane of the base layer satisfy the reordering criteria when a sum of differences of luminance values of lines of opposite fields is greater than a sum of the luminance data differences of lines of equal fields and a bias term,
ll. The method according to claim 1 in any of claims 7 to 10, comprising the additional steps of: providing a particular second plane of the video object planes of the input video stream for communication in the base layer as a second video object plane of the base layer, -depse descending pixel data of at least a portion of the second video object plane of the base layer for communication in the highlight layer as a second plane of video object sampled downwardly at a time position corresponding to the second video object plane of the base layer, - downwardly sampling corresponding pixel data from the second particular plane of the video object planes to provide a plane of comparison video object; differentially encoding the second plane of the sampled video object in a descending manner using the comparison video object plane; using at least one of the first and second planes of video objects of the base layer to predict an intermediate video object plane corresponding to the first and second planes of video objects sampled in descending order; and encoding the intermediate video object plane for communication in the enhancement layer in a temporary position which is intermediate to that of the first and second planes of video objects sampled in ascending order.
12. The method according to claim 7, wherein: the base and highlight layers are adapted to provide a stereoscopic video capability in which the image data in the highlight layer has a spatial resolution lower than the image data in the base layer.
13. A method for encoding a bidirectionally predicted video object plane, comprising the steps of: scaling an input video sequence comprising video object planes for communication in a corresponding base layer and highlight layer; providing first and second planes of video objects of the base layer in the base layer corresponding to the planes of video objects of input video sequence; the second video object plane of the base layer being predicted from the first video object plane of the base layer according to a motion vector MV; providing the video object plane bi-directionally predicted in the enhancement layer in a temporary position which is intermediate to that of the first and second planes of video objects of the base layer; and encoding the bidirectionally predicted video object plane using at least one of: (a) a forward motion vector MVf and (b) a backward motion vector MVB, obtained by scaling the motion vector MV_.
14. The method according to claim 13, wherein: a temporal distance TR ^ separates the first and second planes of video objects from the base layer; a temporal distance TRB separates the first video object plane from the base layer and the video object plane predicted bidirectionally; m / n is a ratio of the spatial resolution of the first and second planes of video objects of the base layer against the spatial resolution of the video object plane predicted bidirectionally; and at least one of: (a) the forward motion vector MVf is determined in accordance with the ratio MVf = (m / n) -TRB-MVp / TRp; and (b) the backward movement vector MVb is determined according to the ratio MVb = (m / n) • (TRB-TRp) -MVp / TRp.
The method according to claim 13 or 14, comprising the additional step of: encoding the video object plane bidirectionally predicted using at least one of: (a) a search region of the first object plane video of the base layer whose center is determined according to the forward motion vector MVf; and (b) a search region of the second video object plane of the base layer whose center is determined in accordance with the backward movement vector MVB.
16. A method for recovering an input video sequence comprising planes of video objects that was scaled and communicated in a corresponding base layer and enhancement layer, the video object planes in the input video sequence have a spatial resolution and associated temporal resolution, wherein: pixel data of a particular close-up of the video object planes of the input video sequence is sampled in a downward manner and carried as a close-up of video object of the base layer that has a reduced spatial resolution; pixel data of at least a portion of the video object foreground of the base layer is sampled upward and carried as a video object plane sampled upwardly in the highlight layer to a temporal position corresponding to the close-up of video object of the base layer; and the foreground of the ascendingly sampled video object is differentially encoded * with the first articular plane of the video object planes of the input video sequence; the method comprises the steps of: ascertaining the pixel data of the first video object plane of the base layer to restore the associated spatial resolution, and processing the first object plane of the sampled video object and the first object plane video of the base layer with the associated spatial resolution restored to provide an output video signal with the associated spatial resolution.
The method according to claim 16, wherein: the planes of video objects in the input video sequence are planes of video objects of field mode; and the foreground of ascendingly sampled video object is differentially encoded by rearranging pixel data lines from the first video object plane sampled upward in a field mode if the pixel data lines satisfy a reordering criterion, then determining a residue according to a difference between pixel data of the foreground of the sampled video object in ascending manner and pixel data of the particular foreground of the video object planes of the input video sequence, and transforming spatially the residue to provide transformation coefficients.
18. The method according to claim 17, wherein: the pixel data lines of the first video object plane sampled upwards satisfy the reordering criteria when a sum of differences of line luminance values of opposing fields is greater than a sum of differences in luminance data from the lines of equal fields and a term of bias.
19. The method according to claim 1 of any of claims 16 to 18, wherein: a particular second plane of the video object planes of the input video sequence is sampled in a downward manner to provide a second layer of video object of the base layer that has a reduced spatial resolution; pixel data of at least a portion of the second video object plane of the base layer is sampled upwardly to provide a second plane of video object sampled upwardly in the highlight layer which corresponds to the first plane of video object sampled ascending; at least one of the first and second video object planes of the base layer is used to predict a video object plane corresponding to the first and second planes of video objects sampled in ascending order, and the video object plane intermediate is encoded for communication in the enhancement layer to a temporary position which is intermediate to that of the first and second planes of video objects sampled in ascending order.
20. The method according to claim 19, wherein: the enhancement layer has a higher temporal resolution than the base layer, and the base and highlight layer are adapted to provide at least one of: a) an image capacity in the image where an image in image is transported in the base layer, and (b) a preview access channel capacity in which a preview access image is carried in the layer of base.
21. The method of conformance to claim as claimed in any of claims 16 to 20, wherein the base layer is adapted to carry higher priority, lower bit rate data, and the enhancement layer is adapted to Take lower priority, higher bit rate data.
22. A method for recovering an input video sequence comprising planes of video objects which was scaled and communicated in a corresponding base layer and enhancement layer, having the video object planes in the input video sequence an associated spatial resolution and temporal resolution, wherein: a particular close-up of the video object planes of the input video stream is provided in the base layer as a video object foreground of the base layer, - pixel data of at least a portion of the first video object plane of the base layer is sampled in a downward fashion and carried in the highlight layer as a first plane of video object sampled downward in a temporary position corresponding to the first video object plane of the base layer, - corresponding pixel data of the first particular plane of the video object planes are sampled in descending order for provide a comparison video object plane; and the foreground of the video object is differentially encoded using the comparison video object plane; the method comprises the steps of: ascending the pixel data of the first plane of the sampled video object in a descending manner to restore the spatial resolution; and processing the foreground of the video object of the highlight layer with the associated spatial resolution restored and the foreground of the video object of the base layer to provide an output video signal with associated spatial resolution.
The method according to claim 22, wherein: the first video object plane of the base layer is differentially coded using the first particular plane of the video object planes by determining a residue according to a difference between pixel data of the video object foreground of the base layer and pixel data of the particular first plane of the video object planes, and spatially transform the residue to provide transformation coefficients.
24. The method according to claim 23, wherein: the video object planes in the input video sequence are field-mode video object planes, and the first video object plane of the base layer is differentially encoded by rearranging lines of the pixel data of the first video object plane in the field mode before determining the residue if the pixel data lines satisfy a sorting criterion.
25. The method according to claim 24, wherein: the pixel data lines of the first video object plane of the base layer satisfy the reordering criterion when a sum of differences of luminance values of lines of opposing fields is greater than a sum of luminance data differences of lines of equal fields and a term of bias.
26. The method according to claim 22 of claim 22, wherein: a particular second plane of the video object planes of the input video stream is provided in the base layer as a second plane. of video object of the base layer; pixel data of at least a portion of the second video object plane of the base layer is sampled downward and carried in the highlight layer as a second plane of video object sampled downwardly to a time position corresponding to the second video object plane of the base layer; corresponding pixel data of the second plane of the video object planes is sampled downward to provide a comparison video object plane; the second object plane of the sampled video object is differentially coded using the comparison video object plane; at least one of the first and second planes of video objects of the base layer is used to predict an intermediate video object plane corresponding to the first and second planes of ascendingly sampled video objects, - and the object plane of intermediate video is encoded for communication in the enhancement layer to a temporary position which is intermediate to that of the first and second planes of video objects sampled in ascending order.
27. The method of conformance to claim as claimed in any of claims 22 to 26, wherein: the base and highlight layer are adapted to provide a stereoscopic video capability in which the image data in the highlight layer has a spatial resolution lower than the image data in the base layer.
28. A method for recovering an input video sequence comprising video object planes that was scaled and communicated in a corresponding base layer and highlight layer in a data stream, the video object planes in the sequence of input video having a spatial resolution and associated temporal resolution, where: the first and second planes of video objects of the base layer are provided in the base layer which corresponds to the planes of sequence video objects video input; the second video object plane of the base layer is predicted from the first video object plane of the base layer according to a movement vector MVp / - a video object plane predicted bidirectionally in the enhancement layer in a temporary position which is intermediate to that of the first and second planes of video objects of the base layer, and the bidirectionally predicted video object plane is encoded using a forward motion vector MVf and a vector of backward movement MVB, which are obtained by scaling the movement vector MVp / -complishing the method the steps of: recovering the forward motion vector MVf and the backward movement vector MVB from the data stream , - and decoding the bidirectional video object plane predicted using the forward motion vector MVf and the backward movement vector MVB.
29. The method according to claim 28, wherein: a temporal distance TRp separates the first and second planes of video objects from the base layer, - a temporal distance TRB separates the first plane of video object from the base layer and predicted bidirectional video object plane; m / n is the ratio of the spatial resolution of the first and second planes of video objects of the base layer against the spatial resolution of the video object plane predicted bidirectionally; and at least one of: (a) the forward motion vector MVf is determined in accordance with the ratio MVf = (m / n) • TRB • MV_ / TRp, - and (b) the backward motion vector MVb is determined according to the ratio MVb = (m / n) • (TRB-TRp) -MVp / TRp.
The method according to claim 28, wherein the video object plane is bidirectionally predicted using at least one of: (a) a search region of the first video object plane of the base layer whose center is determined according to the forward motion vector MVf, - and (b) a search region of the second video object plane of the base layer whose center is determined according to the motion vector backwards MVB.
31. A decoding apparatus for recovering a video sequence comprising planes of video objects which are scaled and communicated in a corresponding base layer and enhancement layer, the planes of video objects in the input video sequence have a spatial resolution and associated temporal resolution, wherein: pixel data of a particular close-up of the video object planes of the input video sequence is sampled in descending order and carried as a close-up of the video object of the base layer that has a reduced spatial resolution; pixel data of at least a portion of the video object foreground of the base layer and taken as a foreground of the sampled video object upwardly in the highlight layer at a time position corresponding to the first object plane of video of the base layer, and the foreground of the sampled video object is differentially coded using the particular first plane of the video object planes of the input video sequence; the apparatus comprises: an element for ascendingly sampling the pixel data of the first video object plane of the base layer to restore the associated spatial resolution; and an element for processing the foreground of the sampled video object in an ascending manner and the first video object plane of the base layer with the associated spatial resolution restored to provide an output video signal with the associated spatial resolution.
The apparatus according to claim 31, wherein: the planes of video objects in the input video sequence are planes of video objects of field mode, - and the first object plane of video Sampling in an ascending manner is differentially encoded by rearranging lines of the pixel data of the foreground of the sampled video object ascending in the field mode if the pixel data lines satisfy a reordering criterion, then determining a residue according to a difference between the pixel data of the foreground of the video object plane of the input video sequence, and spatially transforming the residue to provide transformation coefficients.
33. The apparatus according to claim 31 or 32, wherein: the pixel data lines of the ascendingly sampled video object foreground satisfy the reordering criteria when a sum of differences of luminance values of opposing fields is greater than a sum of luminance data differences of lines of equal fields and a term of bias.
34. The apparatus as claimed in any of claims 31 to 33, wherein: a particular second plane of the video object planes of the input video sequence is sampled downwardly to provide a second plane of video object of the base layer that has a reduced spatial resolution; the pixel data of at least a portion of the second video object plane of the base layer is sampled upward to provide a second plane of video object sampled upwardly in the highlight layer which corresponds to the first plane of video object sampled ascending; at least one of the first and second planes of video objects of the base layer is used to predict an intermediate video object plane corresponding to the first and second planes of video objects sampled in ascending order; and the intermediate video object plane is encoded for communication in the enhancement layer in a temporary position which is intermediate to that of the first and second planes of video objects sampled in ascending order.
35. The apparatus according to claim 34, wherein: the enhancement layer has a higher temporal resolution than the base layer, - and the base and enhancement layers are adapted to provide at least one of : (a) an image capability in the image where an image in image is transported in the base layer, and (b) a preview access channel capability in which a preview access image is carried in the base layer.
36. The apparatus as claimed in any of claims 31 to 35, wherein: the base layer is adapted to carry higher priority, lower bit rate data, and the enhancement layer is adapted to carry lower priority, higher bit rate data.
37. A decoder apparatus for recovering an input video sequence comprising video object planes which was scaled and communicated on a corresponding base layer and enhancement layer, the video object planes in the input video sequence having a spatial resolution and an associated temporal resolution, wherein: a particular plane of the video object planes of the input video stream is provided in the base layer as a first layer of the base layer video object , - pixel data from at least a portion of the first layer of base layer video object is sampled in a descendant manner and carried in the highlight layer as a first plane of video object sampled downward in a temporary position corresponding to the first layer of base layer video object, - corresponding pixel data of the first particular plane of the video object planes is sampled in a descending manner for ra provide a comparison video object plane, and the foreground of the sampled video object is differentially coded using the comparison video object plane; the apparatus comprises: an element for ascendingly sampling the pixel data of the first sampled video object plane in a descending manner to restore the associated spatial resolution; and elements for processing the highlight layer video object foreground with the restored spatial resolution and the base layer video object foreground to provide an output video signal with the associated spatial resolution.
38. The apparatus of claim 37, wherein: the first video object plane sampled downward is differentially encoded by determining a residue according to a difference between the pixel data of the first video object plane sampled in a downward fashion and pixel data of the particular close-up of the video object planes of the input video sequence, and spatially transform the residue to provide transformation coefficients.
39. The apparatus of claim 38, wherein: the planes of video objects in the input video sequence are planes of field-mode video objects, and the first layer of video object of the base layer is encoded differentially rearranging pixel data lines from the first layer of base layer video object in a field mode before determining the residue if the pixel data lines satisfy a reordering criterion.
40. The apparatus according to claim claimed in claim 39, wherein: the pixel data lines of the base layer video object foreground satisfy the reordering cria when a sum of differences of luminance values of opposing fields is greater than a sum of luminance data differences of lines of equal fields and a term of bias.
41. The apparatus as claimed in any of claims 37 to 40, wherein: a particular second plane of the video object planes of the input video stream is provided for communication in the layer of base as a second layer of base layer video object, - the pixel data of at least a portion of the second video object plane of the base layer are sampled in a downward manner to provide a second object plane of video sampled in a descending manner in the highlight layer which corresponds to the foreground of the sampled video object ascending; at least one of the first and second planes of video objects of the base layer is used to predict an intermediate video object plane corresponding to the first and second planes of video objects sampled in descending order, - and the plane of intermediate video object is encoded for communication in the highlight layer in a temporary position which is intermediate to that of the first and second planes of video objects sampled in ascending order.
42. The apparatus as claimed in any of claims 37 to 41, wherein: the base and highlight layer are adapted to provide a stereoscopic video capability in which the image data in the highlight layer has a spatial resolution lower than the image data in the base layer.
43. A decoding apparatus for recovering an input video sequence comprising planes of video objects that were scaled and communicated on a corresponding base layer and corresponding highlight layer in a data stream, the video object planes in the sequence of input video having a spatial resolution and associated temporal resolution, wherein: the first and second planes of video objects of the base layer which correspond to the planes of video objects of the input video sequence are provide in the base layer, the second video object plane of the base layer is predicted from the first video object plane of the base layer according to a motion vector MV_; a bidirectionally predicted video object plane is provided in the enhancement layer at a time position which is intermediate to that of the first and second planes of video objects of the base layer, and the object plane is encoded. video predicted bidirectionally using a forward motion vector MVf and a backward motion vector MVB, which are obtained by scaling the motion vector MVp; the apparatus comprising: elements for recovering the forward motion vector MVf and the backward motion vector MVB from the data stream, - and elements for decoding the bidirectional video object plane predicted using the forward motion vector MVf and the MVB backward movement vector.
44. The apparatus according to claim 43, wherein: a temporal distance TRp separates the first and second planes of video objects from the base layer, - a temporal distance TRB separates the first plane from the video object of the base layer and predicted bidirectional video object plane; m / n is the ratio of the spatial resolution of the first and second planes of video objects of the base layer against the spatial resolution of the video object plane predicted bidirectionally; and at least one of: (a) the forward motion vector MVf is determined in accordance with the ratio MVf = (m / n) • TRB • MVp / TRp; and (b) the backward movement vector MVb is determined according to the ratio MVb = (m / n) • (TRB-TRp) -MVp / TRp.
45. The apparatus as claimed in claim 43 or 44, wherein: the predicted bidirectional video object plane is encoded using at least one of: (a) a search region of the first video object plane of the base layer whose center is determined according to the forward motion vector MVf; and (b) a search region of the second video object plane of the base layer whose center is determined in accordance with the backward movement vector MVB. SUMMARY OF THE INVENTION Temporary and spatial scaling of video images including video object planes (117, 118, 119, 405, 415, 420, 430, 520, 522, 524, 526, 532, 542, 705, 730, 750, are provided) 760, 780, 790, 805, 815, 820, 830, 850, 860, 880, 890) in a digital video input sequence. The coding efficiency is improved by an adaptive compression of the scaled field mode video. The planes of ascendingly sampled video objects (450, 490, 522, 542, 750, 790) in the highlight layer are rearranged to provide a greater correlation with the input video sequence based on a linear criterion. The resulting residue is encoded using a spatial transformation as the discrete cosine transformation. A motion compensation scheme is used to encode the planes of video objects of the enhancement layer (450, 460, 480, 490, 522, 524, 526, 542, 750, 760, 780, 790, 850, 860, 880, 890) by scaling motion vectors which have already been determined for the video object planes of the base layer (405, 415, 420, 430, 520, 532, 705, 730, 805, 815, 820, 830 ). A reduced search area is provided whose center is defined by the scaled motion vectors. The motion compensation scheme is convenient for use with video in frame mode or field mode. Various processor configurations achieve particular scalable coding results. Scalable coding applications include stereoscopic video, image-in-picture, preview access channels, and ATM communications.