GB2511288A

GB2511288A - Method, device, and computer program for motion vector prediction in scalable video encoder and decoder

Info

Publication number: GB2511288A
Application number: GB201300382A
Authority: GB
Inventors: Guillaume Laroche; Christophe Gisquet; Edouard Francois; Patrice Onno
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2013-01-09
Filing date: 2013-01-09
Publication date: 2014-09-03
Also published as: GB201300382D0

Abstract

An image of an enhancement layer of a scalable video sequence comprising a base layer and at least one enhancement layer is encoded and subsequently decoded by deriving motion information predictors (motion vector predictors, MVPs). An enhancement layer image portion is encoded / decoded by motion compensation with respect to reference image portions. For the enhancement layer image portion to be encoded or decoded a first list of motion vector predictors (MVPs) is generated, the generation comprising testing the availability of (1) at least one temporal MVP (1516) provided by a previously encoded image, (2) at least one spatial MVP (1500-1508) provided by image portion neighbouring the image portion to be encoded or decoded and (3) at least one base layer MVP (1524) provided by the base layer image corresponding to the enhancement layer image. Each available predictor is inserted in the list, the list is then complemented with offset predictors (1530). The offset predictors are obtained by adding at least one offset to at least one component of the first MVP in the generated list.

Description

METHOD, DEVICE, AND COMPUTER PROGRAM FOR MOTION VECTOR

PREDICTION IN SCALABLE VIDEO ENCODER AND DECODER

FIELD OF THE INVENTION

The invention generally relates to the field of scalable video coding and decoding, in particular to scalable video coding and decoding that would extend the High Efficiency Video Coding (HEVC) standard. More particularly, the invention concerns a method, device, and computer program for motion vector prediction in scalable video encoder and decoder.

BACKGROUND OF THE INVENTION

Video coding is a way of transforming a series of video images into a compact digitized bit-stream so that the video images can be transmitted or stored. An encoding device is used to code the video images, with an associated decoding device being available to reconstruct the bit-stream for display and viewing. A general aim is to form the bit-stream so as to be of smaller size than the original video information.

This advantageously reduces the capacity required of a transfer network, or storage device, to transmit or store the bit-stream code.

Common standardized approaches have been adopted for the format and method of the coding process, especially with respect to the decoding part. One of the more recent agreements is Scalable Video Coding (SVC) wherein the video image is split into smaller sections (called macroblocks or blocks) and treated as being comprised of hierarchical layers. The hierarchical layers include a base layer, equivalent to a collection of images (or frames) of the original video image sequence, and one or more enhancement layers (also known as refinement layers). SVC is the scalable extension of the H.2641AVC video compression standard.

A further video standard being standardized is High Efficiency Video Coding (HEVC), wherein the macroblocks are replaced by so-called Coding Units (CU) and are partitioned and adjusted according to the characteristics of the original image segment under consideration. This allows more detailed coding of areas of the video image which contain relatively more information and less coding effort for those areas with fewer features.

The video images were originally processed by coding each macroblock individually, in a manner resembling the digital coding of still images or pictuies. Later coding models allow for prediction of the features in one frame, either from neighboring macroblocks (spatial prediction), or by association with a similar macroblock in a neighboring frame (temporal prediction). This allows use of already available coded information, thereby shortening the amount of coding bit-rate needed overall.

Differences between the source area and the area used for prediction are captured in a residual set of values which themselves are encoded in association with the code for the source area. Many different types of predictions are possible. Effective coding choses the best model to provide image quality upon decoding, while taking account of the bit-stream size each model requires to represent an image in the bit-stream. A trade-off between the decoded picture quality and reduction in required bitrate, also known as compression of the data, is the overall goal.

Figure 1 illustrates a standard video encoding device, of a generic type, conforming to the HEVC or H.264/AVC video compression system. A block diagram of a standard HEVC or H.264/AVC encoder is shown. The input to this non-scalable encoder consists in the original sequence of frame images 102 to compress.

The encoder successively performs the following steps to encode a standard video bit-stream. A first picture or frame to be encoded (compressed) is divided into pixel blocks, called coding unit in the HEVC standard. The first picture is thus split into blocks or macroblocks in functional block 104. Each block first undergoes a motion estimation operation in functional block 106, which comprises a search, among the reference pictures stored in a dedicated memory buffer 108, for reference blocks that would provide a good prediction of the block. This motion estimation step provides one or more reference picture indexes which contain the found reference blocks, as well as the corresponding motion vectors. A motion compensation step then applies the estimated motion vectors on the found reference blocks and copies the so-obtained blocks into a temporal prediction picture in functional block 110. Moreover, an Intra prediction step determines the spatial prediction mode that would provide the best performance to predict the current block and encode it in INTRA mode in functional block 112.

Afterwards, a coding mode selection mechanism chooses the coding mode, among the spatial and temporal predictions, in functional block 114, which provides the best rate distortion trade-off in the coding of the current block. The difference between the current block (in its original version) and the so-chosen prediction block is calculated. This provides the (temporal or spatial) residual to compress. The residual block then undergoes a transform (typically a Discrete Cosine Transform, DCT) and a quantization in functional block 116. Entropy coding of the so-quantized coefficients QTC (and associated motion data MD) is performed in functional block 118. The compressed texture data associated with the coded current block is sent for output (coded bit-stream).

Finally, the current block is reconstructed by inverse quantization and inverse transform in functional block 120. This step is followed by a sum between the inverse transformed residual and the prediction block of the current block. Once the current picture is reconstructed and post-filtered (deblocking filter and SAO) in functional block 122, it is stored in the memory buffer 108 (Decoded Picture Buffer, DPB) so that it is available for use as a reference picture to predict any subsequent pictures to be encoded.

Finally, a last entropy coding step is given the coding mode and, in case of an inter block, the motion data, as well as the quantized DCT coefficients previously calculated. This entropy coder encodes each of these data into their binary form and encapsulates the so-encoded block into a container called NAL unit (Network Abstract Layer). A NAL unit contains all encoded coding units from a given slice. A coded HEVC bit-stream consists in a series of NAL units.

Figure 2 provides a block diagram of a standard HEVC or H.264/AVC decoding system 200. This decoding process of a H.264 bit-stream 202 starts by the entropy decoding of each block (array of pixels) of each coded picture in the bit-stream in functional block 204. This entropy decoding provides the coding mode, the motion data (reference pictures indexes, motion vectors of Inter coded blocks, intra prediction direction of intra coded blocks) and residual data. This residual data consists in quantized and transformed DCT coefficients. Next, these quantized DCT coefficients undergo inverse quantization (scaling) and inverse transform operations in functional block 206.

The decoded residual is then added to the temporal or Intra prediction block of current block, obtained in functional blocks 208 and 210, respectively, to provide the reconstructed block. The choice between Intra or Inter prediction, as determined in functional block 212, depends on the prediction mode information which is provided by the entropy decoding step.

Intra mode exploits spatial correlation of pixels in a frame while Inter modes exploit temporal correlation between pixels of a frame and previous and/or following encoded/decoded frames.

The reconstructed block finally undergoes one or more in-loop post-filtering processes in functional block 214, e.g. deblocking, which aim at reducing the blocking artifact inherent to any block-based video codec, and improve the quality of the decoded picture and SAO (Sample Adaptive Offset) filtering.

The full post-filtered picture is then stored in the Decoded Picture Buffer (DPB), represented by the frame memory 216, which stores pictures that will serve as references to predict future pictures to decode. The decoded pictures 218 are also ready to be displayed on screen.

In order to further reduce the cost of encoding motion information, a motion vector may be encoded in terms of a difference between the motion vector and a motion vector predictor, typically computed from one or more motion vectors of the blocks surrounding the block to encode.

In H.264, motion vectors are encoded with respect to a median predictor computed from the motion vectors situated in a causal neighborhood of the block to encode, for example from the blocks situated above and to the left of the block to encode. The difference, also referred to as a residual motion vector, between the median predictor and the current block motion vector is encoded to reduce the encoding cost.

Encoding using residual motion vectors saves some bitrate, but necessitates that the decoder performs the same computation of the motion vector predictor in order to decode the value of the motion vector of a block to decode.

Further improvements in coding efficiency have been proposed, such as using a plurality of possible motion vector predictors. This method, often referred to as motion vector competition (MVCOMP), consists in determining from among several motion vector predictors or candidates (a candidate being a particular type of predictor for a particular prediction mode) which motion vector predictor or candidate minimizes the encoding cost, typically a rate-distortion cost, of the residual motion information.

The residual motion information comprises the residual motion vector, i.e. the difference between the actual motion vector of the block to encode and the selected motion vector predictor, and an item of information indicating the selected motion vector predictor, such as for example an encoded value of the index of the selected motion vector predictor. The index of the selected motion vector predictor is coded in the bit-stream with a unary max code based on a fixed list size.

In High Efficiency Video Coding (HEVC), an implementation of the same concept for enabling the selection of the best predictor, from a given set of predictors composed of spatial motion vectors and temporal motion vectors, has been proposed.

This technique is referred to as Advanced Motion Vector Prediction (AMVP). If some predictors from among these predictors of the given set are duplicates of other predictors in the set, some duplicates can be removed and further predictors can be added to the set to create a new second set of predictors. The added predictors can be a combination of the spatial and temporal predictors already in the set, other predictors derived from these spatial and temporal predictors, or predictors with fixed values.

According to the current HEVC design, three modes can be used for temporal prediction (Inter prediction): AMVP mode, Merge mode, and Merge Skip mode. A set of motion vector predictors containing at most two predictors is used for the AMVP mode and at most five predictors is used for the Merge Skip mode and the Merge mode.

In the current HEVC design, Inter prediction can be unidirectional or bi-directional. Unidirectional refers to one predictor block being used to predict the current block. The one predictor block is defined by a list index, a reference frame index and a motion vector. The list index corresponds to a list of reference frames. It may be considered, for example, that two lists are used: [0 and [1. One list contains at least one reference frame and a reference frame can be included in both lists. A motion vector has two components: horizontal and vertical. The motion vector corresponds to the spatial displacement in term of pixels between the current block and the temporal predictor block in the reference frame. Thus, the block predictor for the uni-directional prediction is the block from the reference frame (ref index) of the list, pointed to by the motion vector.

For bi-directional Inter prediction, two block predictors are considered. One for each list (LO and U). Consequently, two reference frame indexes are considered as well as two motion vectors. The Inter block predictor for bi-prediction is the average, pixel by pixel, of the two blocks pointed to by these two motion vectors.

The motion information dedicated to the Inter block predictor can be defined by the following parameters: -an Inter prediction type: unidirectional or bidirectional prediction type; -one or two lists of index: o unidirectional prediction: LO or Li; o bidirectional prediction: LO and Li; -one or two reference frame indexes: o unidirectional prediction: RefLO or RefLi; o bidirectional prediction: RefLO and RefLi; and -one or two motion vectors: o unidirectional prediction: one motion vector having two components mvx (horizontal component) and mvy (vertical component); o bidirectional prediction: two motion vectors each having two components mvx (horizontal component) and mvy (vertical component); It may be noted that the bi-directional Inter predictor may only be used for a B type frame type. Inter prediction in B frames can be uni or bi-directional. In P frames, the Inter prediction is only unidirectional.

As mentioned above, the current design of HEVC uses three different modes for temporal prediction (the AMVP mode, Merge mode and Merge Skip mode), the main difference between these modes being the data signaled in the bit-stream.

In the AMVP mode all data are explicitly signaled. This means that the texture residual is coded and inserted into the bit-stream (the texture residual is the difference between the current block and the Inter prediction block). For the motion information, all data are coded. Thus, the direction type is coded (uni or bi-directional).

The list index, if needed, is also coded and inserted into the bit-stream. The related reference frame indexes are explicitly coded and inserted into the bit-stream. The motion vector value is predicted by the selected motion vector predictor. The motion vector residual for each component is then coded and inserted into the bit-stream.

In the Merge mode, the texture residual and the predictor index are coded and inserted into the bit-stream. The motion vector residual, direction type, list or reference frame index are not coded. These motion parameters are derived from the predictor index. Thus, the predictor, referred to as candidate, is the predictor of all data of the motion information.

In the Merge Skip mode no information is transmitted to the decoder side except for the "mode" itself and the predictor index. In this mode the processing is similar to the Merge mode except that no texture residual is coded or transmitted. The pixel values of a Merge Skip block are the pixel values of the block predictor.

The design of the derivation of predictors and candidate is very important to achieve coding efficiency without large impact on complexity. According to HEVC standard, two motion vector derivations are used: one for Inter mode (AMVP), described by reference to Figures 3 and 4, and one for Merge modes (Merge derivation process), described by reference to Figures 3 and 5.

AMVP exploits spatial and temporal correlation of motion vectors from neighboring blocks to derive the predictor for the current motion vector. AMVP first scans the motion vectors from spatial blocks located on the left side and top side of the current block and then temporal neighboring block positions in some specified locations (typically bottom right and center of the collocated block, i.e. the block at the same position in the temporal frame as the current block in the current frame) and orders to construct a motion vector predictor list. Then, encoder selects the best predictor from the list for the current coding motion vector and codes corresponding index indicating chosen predictor, as well as the motion vector difference in the bit-stream.

Figure 3 illustrates spatial and temporal blocks that can be used to generate motion vector predictors in AMVP and Merge modes of HEVC coding and decoding systems and Figure 4 shows simplified steps of the process of the AMVP predictor set derivation.

Two predictors, i.e. the two spatial motion vectors of the AMVP mode, are chosen among the top blocks and the left blocks including the top corner blocks and left corner block and one predictor is chosen among the bottom right block and center block of the collocated block as represented in Figure 3.

Turning to Figure 4, a first step aims at selecting a first spatial predictor (Predl, 406) among the bottom left blocks AU and Al, that spatial positions are illustrated in Figure 3. To that end, these blocks are selected (400, 402) one after another, in the given order, and, for each selected block, following conditions are evaluated (404) in the given order, the first block for which conditions are fulfilled being set as a predictor: -the motion vector from the same reference list and the same reference image; -the motion vector from the other reference list and the same reference image; -the scaled motion vector from the same reference list and a different reference image; or -the scaled motion vector from the other reference list and a different reference image.

If no value is found, the left predictor is considered as being unavailable. In this case, it indicates that the related blocks were Intra coded or those blocks do not exist.

A following step aims at selecting a second spatial predictor (Pred_2, 416) among the top right block BO, top block B1, and top left block B2, that spatial positions are illustrated in Figure 3. To that end, these blocks are selected (408, 410, 412) one after another, in the given order, and, for each selected block, the above mentioned conditions are evaluated (414) in the given order, the first block for which the above mentioned conditions are fulfilled being set as a predictor.

Again, if no value is found, the top predictor is considered as being unavailable. In this case, it indicates that the related blocks were Intra coded or those blocks do not exist.

In a next step (418), the two predictors, if both are available, are compared one to the other to remove one of them if they are equal (i.e. same motion vector values, same reference list, same reference index and the same direction type).

If only one spatial predictor is available, the algorithm is looking for a temporal predictor in a following step.

The temporal motion predictor (Pred_3, 426) is derived as follows: the bottom right (H, 420) position of the collocated block in a previous frame is first considered in the availability check module 422. If it does not exist or if the motion vector predictor is not available, the center of the collocated block (Center, 424) is selected to be checked. These temporal positions (Center and H) are depicted in Figure 3.

The motion predictor value is then added to the set of predictors.

Next, the number of predictors (Nb_Prod) is compared (428) to the maximum number of predictors (Max Prod). As mentioned above, the maximum number of predictors (MAX_Prod) of motion vector predictors that the derivation process of AMVP needs to generate is two in the current version of HEVC standard.

If this maximum number is reached, the final list or set of AMVP predictors (432) is built. Otherwise, a zero predictor is added (430) to the list. The zero predictor is a motion vector equal to (0,0).

As illustrated in Figure 4, the final list or set of AMVP predictors (432) is built from a subset of spatial motion predictors (400 to 412) and from a subset of temporal motion predictors (420, 424).

As mentioned above! a motion predictor candidate of Merge mode or of Merge Skip mode represents all the required motion information: direction, list, reference frame index, and motion vectors. An indexed list of several candidates is generated by a merge derivation process. In the current HEVC design the maximum number of candidates for both Merge modes is equal to five (4 spatial candidates and 1 temporal candidate).

Figure 5 is a schematic of a motion vector derivation process of the Merge modes. In a first step of the derivation process, five block positions are considered (500 to 508). These positions are the spatial positions depicted in Figure 3 with references Al, Bl, B0, AU, and B2. In a following step, the availability of the spatial motion vectors is checked and at most five motion vectors are selected (510). A predictor is considered as available if it exists and if the block is not Intra coded. Therefore, selecting the motion vectors corresponding to the five blocks as candidates is done according to the following conditions: -if the "left" Al motion vector (500) is available (510), i.e. if it exists and if this block is not Intra coded, the motion vector of the "left" block is selected and used as a first candidate in list of candidate (514); -if the top" B1 motion vector (502) is available (510), the candidate "top" block motion vector is compared to "left" Al motion vector (512), if it exists. If B1 motion vector is equal to Al motion vector, B1 is not added to the list of spatial candidates (514). On the contrary, if B1 motion vector is not equal to Al motion vector, Bl is added to the list of spatial candidates (514); -if the "top right" BO motion vector (504) is available (510), the motion vector of the "top right" is compared to Bi motion vector (512). If BO motion vector is equal to Bl motion vector, BO motion vector is not added to the list of spatial candidates (514). On the contrary, if BO motion vector is not equal to Bi motion vector, BO motion vector is added to the list of spatial candidates (514); -if the "bottom left" AU motion vector (506) is available (510), the motion vector of the "bottom left" is compared to Al motion vector (512). If AU motion vector is equal to Al motion vector, AU motion vector is not added to the list of spatial candidates (514). On the contrary, if AU motion vector is not equal to Al motion vector, AU motion vector is added to the list of spatial candidates (514); and -if the list of spatial candidates doesn't contain four candidates, the availability of "top left' B2 motion vector (508) is checked (510). If it is available, it is compared to Al motion vector and to Bi motion vector. If B2 motion vector is equal to Al motion vector or to B1 motion vector, B2 motion vector is not added to the list of spatial candidates (514). On the contrary, if B2 motion vector is not equal to Al motion vector or to B1 motion vector, B2 motion vector is added to the list of spatial candidates (514).

At the end of this stage, the list of spatial candidates comprises up to four candidates.

For the temporal candidate, two positions can be used: the bottom right position of the collocated block (516, denoted H in Figure 3) and the center of the collocated block (518). These positions are depicted in Figure 3.

As for the AMVP motion vector derivation process, a first step aims at checking (520) the availability of the block at the H position. Next, if it is not available, the availability of the block at the center position is checked (520). If at least one motion vector of these positions is available, the temporal motion vector can be scaled (522), if needed, to the reference frame having index 0, for both list LU and Li, in order to create a temporal candidate (524) which is added to the list of Merge motion vector predictor candidates. It is positioned after the spatial candidates in the list.

If the number (Nb_Cand) of candidates is strictly less (526) than the maximum number of candidates (Max_Canci that value is signaled in the bit-stream slice header and is equal to five in the current H EVC design) and if the current frame is of the B type, combined candidates are generated (528). Combined candidates are generated based on available candidates of the list of Merge motion vector predictor candidates. It mainly consists in combining the motion vector of one candidate of the list LU with the motion vector of one candidate of list Ll.

If the number (Nb_Cand) of candidates remains strictly less (530) than the maximum number of candidates (Max_Cand), zero motion candidates are generated (532) until the number of candidates of the list of Merge motion vector predictor candidates reaches the maximum number of candidates.

At the end of this process, the list or set of Merge motion vector predictor candidates is built (534).

As illustrated in Figure 5, the list or set of Merge motion vector predictor candidates is built (534) from a subset of spatial candidates (500 to 508) and from a subset of temporal candidates (516, 518).

In general, the more information can be coded, and subsequently extracted, the better the result achieved on playback (decoding) of the video stream.

By coding sympathetically to the characteristics of the different blocks, quality can be maintained while bit-stream size is managed.

A problem of the coding process is to manage the bit-stream data produced. The more concise the coding, the more efficient the process is deemed to be.

SUMMARY OF THE INVENTION

Faced with these constraints, the inventors provide a method and a device for motion vector prediction for scalable video encoding and decoding.

It is a broad object of the invention to remedy the shortcomings of the prior art as described above.

According to a first aspect of the invention there is provided a method of encoding or decoding an image of an enhancement layer of a scalable video sequence comprising a base layer and at least one enhancement layer, by deriving motion information predictors for encoding or decoding an image portion of the image of the enhancement layer by motion compensation with respect to reference image portions, wherein for the image portion of the image of the enhancement layer to be encoded or decoded, the method comprises: Generating a first list of motion vector predictors, said generation comprising testing the availability of at least one temporal motion vector predictor provided by a previously encoded image, at least one spatial motion vector predictor provided by image portion neighboring the image portion to be encoded or decoded and at least one base layer provided by the base layer image corresponding to the image of the enhancement layer, each available predictor being inserted in the list; Complementing the list with at least one offset predictor, wherein an offset predictor is obtained by adding at least one offset to at least one component of a motion vector predictor in the generated list.

According to a second aspect of the invention there is provided a device for encoding or decoding an image of an enhancement layer of a scalable video sequence comprising a base layer and at least one enhancement layer, by deriving motion information predictors for encoding or decoding an image portion of the image of the enhancement layer by motion compensation with respect to reference image portions, wherein for the image portion of the image of the enhancement layer to be encoded or decoded, the device comprises: a means for generating a first list of motion vector predictors, said generation comprising testing the availability of at least one temporal motion vector predictor provided by a previously encoded image, at least one spatial motion vector predictor provided by image portion neighboring the image portion to be encoded or decoded and at least one base layer provided by the base layer image corresponding to the image of the enhancement layer, each available predictor being inserted in the list; and a means Complementing the list with offset predictors, wherein offset predictors are obtained by adding at least one offset to at least one component of the first motion vector predictor in the generated list.

Since the present invention can be implemented in software, the present invention can be embodied as computer readable code for provision to a programmable apparatus on any suitable carrier medium. A tangible carrier medium may comprise a storage medium such as a floppy disk, a CD-ROM, a hard disk drive, a magnetic tape device or a solid state memory device and the like. A transient carrier medium may include a signal such as an electrical signal, an electronic signal, an optical signal, an acoustic signal, a magnetic signal or an electromagnetic signal, e.g. a microwave or RF signal.

BRIEF DESCRIPTION OF THE DRAWINGS

Further advantages of the present invention will become apparent to those skilled in the art upon examination of the drawings and detailed description. It is intended that any additional advantages be incorporated herein.

Embodiments of the invention will now be described, by way of example only, and with reference to the following drawings in which: Figure 1 illustrates a standard video encoding device, of a generic type, conforming to the HEVC or H.264/AVC video compression system; Figure 2 provides a block diagram of a standard HEVO or H.2641AVC decoding system; Figure 3 illustrates spatial and temporal blocs that can be used to generate motion vector predictors in AMVP and Merge modes of HEVC coding and decoding systems; Figure 4 shows simplified steps of the process of the AMVP predictor set derivation; Figure 5 is a schematic of a motion vector derivation process of the Merge modes; Figure 6 is a block diagram illustrating components of a processing device in which embodiments of the invention may be implemented; Figure 7 illustrates a block diagram of a scalable video encoder based on the standard video coder illustrated in Figure 1; Figure 8 presents a block diagram of a scalable decoder which would apply on a scalable bit-stream made of two scalability layers, e.g. comprising a base layer and an enhancement layer; Figure 9 summarizes prediction modes that can be used in a scalable codec architecture, according to an embodiment of the invention, to predict a current enhancement picture; Figure 10 illustrates the construction of a Base mode prediction picture; Figure 11 depicts a prediction information up-sampling process, executed both by an encoder and a decoder in order to construct a Base Mode prediction picture; Figure 12 summarizes possible coding modes which should be integrated for the enhancement layer of an HEVC scalable extension according to the prediction of block texture, residual texture and syntax information; Figure 13 shows a schematic of the AMVP predictor set derivation for an enhancement layer of a scalable codec of the HEVC type according to a particular embodiment; Figure 14 illustrates spatial and temporal blocks, in particular the bottom right block of the base layer that can be used to generate motion vector predictors in AMVP and Merge modes of scalable HEVC coding and decoding systems according to a particular embodiment; Figure 15 shows a schematic of the derivation process of motion vectors for an enhancement layer of a scalable codec of the HEVC type, according to a particular embodiment, for the Merge modes; Figure 16 shows an example of spatial positions of the neighboring blocks of the current block in the enhancement layer and their collocated blocks in the base layer; Figure 17 shows a schematic of the AMVP predictor set derivation for an enhancement layer of a scalable codec of the HEVC type according to a further particular embodiment; Figure 18 shows a schematic of the derivation process of motion vectors for an enhancement layer of a scalable codec of the HEVC type, according to a further particular embodiment, for the Merge modes.

DETAILED DESCRIPTION OF EMBODIMENTS OF THE INVENTION

Figure 6 schematically illustrates a processing device 600 configured to implement at least one embodiment of the present invention. The processing device 600 may be a device such as a micro-computer, a workstation or a light portable device. The device 600 comprises a communication bus 613 to which there are preferably connected: -a central processing unit 611, such as a microprocessor, denoted CPU; -a read only memory 607, denoted ROM, for storing computer programs for implementing the invention; -a random access memory 612, denoted RAM, for storing the executable code of the method of embodiments of the invention as well as the registers adapted to record variables and parameters necessary for implementing the method of encoding a sequence of digital images and/or the method of decoding a bit-stream according to embodiments of the invention; and -a communication interface 602 connected to a communication network 603 over which digital data to be processed are transmitted.

Optionally, the apparatus 600 may also include the following components: -a data storage means 604 such as a hard disk, for storing computer programs for implementing methods of one or more embodiments of the invention and data used or produced during the implementation ot one or more embodiments of the invention; -a disk drive 605 for a disk 606, the disk drive being adapted to read data from the disk 606 or to write data onto said disk; -a screen 609 for displaying data and/or serving as a graphical interface with the user, by means of a keyboard 610 or any other pointing means.

The apparatus 600 can be connected to various peripherals, such as for example a digital camera 600 or a microphone 608, each being connected to an input/output card (not shown) so as to supply multimedia data to the apparatus 600.

The communication bus provides communication and interoperability between the various elements included in the apparatus 600 or connected to it. The representation of the bus is not limiting and in particular the central processing unit is operable to communicate instructions to any element of the apparatus 600 directly or by means of another element of the apparatus 600.

The disk 606 can be replaced by any information medium such as for example a compact disk (CD-ROM), rewritable or not, a ZIP disk or a memory card and, in general terms, by an information storage means that can be read by a microcomputer or by a microprocessor, integrated or not into the apparatus, possibly removable and adapted to store one or more programs whose execution enables the method of encoding a sequence of digital images and/or the method of decoding a bit-stream according to the invention to be implemented.

The executable code may be stored either in read only memory 607, on the hard disk 604 or on a removable digital medium such as for example a disk 606 as described previously. According to a variant, the executable code of the programs can be received by means of the communication network 603, via the interface 602, in order to be stored in one of the storage means of the apparatus 600 betore being executed, such as the hard disk 604.

The central processing unit 611 is adapted to control and direct the execution of the instructions or portions of software code of the program or programs according to the invention, instructions that are stored in one of the aforementioned storage means. On powering up, the program or programs that are stored in a non-volatile memory, for example on the hard disk 604 or in the read only memory 607, are transferred into the random access memory 612, which then contains the executable code of the program or programs, as well as registers for storing the variables and parameters necessary for implementing the invention.

In this embodiment, the apparatus is a programmable apparatus which uses software to implement the invention. However, alternatively, the present invention may be implemented in hardware (for example, in the form of an Application Specific Integrated Circuit or ASIC).

A context of embodiments of the invention is the design of the scalable extension of HEVC. HEVC scalable extension will allow coding/decoding a video made of multiple scalability layers. These layers comprise a base layer that is compliant with standards such as HEVC, H.2641AVC or MPEG2, and one or more enhancement layers, coded according to the future scalable extension. Embodiments of the invention can be used for P and B slices and for both uni-and bi-directional predictions.

To ensure good scalable compression efficiency, one has to exploit redundancy that lies between the base layer and the enhancement layer, through so-called inter-layer prediction techniques (according to which an enhancement layer uses data from the base layer as classical Intra and temporal prediction coding). In case of Inter pictures, one has to selectively predict successive picture blocks through intra-layer temporal prediction, intra-layer spatial Intra prediction, inter-layer Inter prediction and inter-layer Intra prediction. In classical scalable video codecs (encoder-decoder pairs), this takes the form of block prediction choice, one block after another! among the above mentioned available prediction modes, according to a rate distortion criteria.

Each reconstructed block serves as a reference to predict subsequent blocks.

Differences are noted and encoded as residuals. Competition between the various possible encoding mechanisms takes account of both the type of encoding used and the size of the bit-stream resulting from each type. A trade-off is achieved between the two considerations.

The inventors observed that the competitive based scheme of motion vector prediction used in HEVC was defined in order to reach a good compromise between coding efficiency and complexity. However, this scheme needs adaptation to be efficiently used for the scalable extension since the latter should use new modes to avoid inter layer redundancies, impacting the efficiency of the HEVC motion vector prediction derivation process.

Figure 7 illustrates a block diagram of a scalable video encoder based on the standard video coder illustrated in Figure 1. While a video encoder of that type may comprise any number of subparts or stages, the video encoder illustrated in Figure 7 comprises two subparts or stages referred to as A7 and B7 producing data corresponding to a base layer (bit-stream 700) and data corresponding to an enhancement layer (bit-stream 702), respectively. Each of the subparts A7 and B7 follows the principles of the standard video encoder 100, with the steps of transformation, quantization. and entropy coding being applied in two separate paths, one corresponding to each layei.

The first stage B7 aims at encoding the H.264/AVC or HEVC compliant base layer of the output scalable stream, and hence is identical to the encoder of Figure 1.

The second stage A7 illustrates the coding of an enhancement layer on top of the base layer. This enhancement layei brings a refinement of the spatial resolution to the (down-sampled in functional block 704) base layer. As illustrated in Figure 7, the coding scheme of this enhancement layer is similar to that of the base layer, except that for each coding unit of a cuirent pictule 102 being compressed or coded, an additional piediction mode can be chosen by the coding mode selection module 706.

This new coding mode corresponds to the inter-layer prediction block 708. Inter-layer prediction block 708 aims at re-using the data coded in a layer lower than current refinement or enhancement layei, as prediction data of the cuirent coding unit. The lower layer used is called the reference layer for the inter-layer prediction of the current enhancement layer. In case the reference layer contains a picture that temporally coincides with the current picture, then it is called the base picture of the current picture. The co-located block (at same spatial position) of the current coding unit that has been coded in the reference layer can be used as a reference to predict the current coding unit. More precisely, the prediction data that can be used in the co-located block coriesponds to the coding mode, the block partition, the motion data (if present) and the texture data (temporal residual or reconstructed block). In case of a spatial enhancement layer, some up-sampling 710 operations of the texture and prediction data are performed.

Figure 8 presents a block diagram of a scalable decoder 800 which would apply on a scalable bit-stream made of two scalability layers, e.g. comprising a base layer and an enhancement layer. This decoding process is thus the reciprocal processing of the scalable coding process of Figure 7. The scalable stream 802 being decoded, as shown in Figure 8, is made of one base layer and one spatial enhancement layei on top of the base layer, which aie demultiplexed in functional block 804 into their respective layers.

The first stage B8 of the scalable decoder 800 represented in Figure 8 concerns the base layer decoding piocess. As pieviously explained for the non-scalable case, this decoding process starts by entropy decoding each coding unit or block of each coded picture in the base layer in functional block 204. This entropy decoding provides the coding mode, the motion data (reference pictures indexes, motion vectors of temporally predicted and coded macro-blocks), the intra prediction data, and residual data. This residual data consists of quantized and transformed DCT coefficients. Next, these quantized DCT coefficients undergo inverse quantization and inverse transform operations in functional block 206. Data obtained from motion compensation in functional block 208 or from Intra prediction in functional block 210 can be added to the decoded coding units in functional block 806.

Post-filtering is effected in functional block 214. The so-reconstructed image is then stored in the frame buffer 216.

Next, the decoded motion and temporal residual for temporal predicted blocks, and the reconstructed blocks are stored into a frame buffer in the first of the scalable decoder of Figure 8. Such frames contain the data that can be used as reference data to predict an upper scalability layer.

Next, the second stage A8 of the scalable decoder represented in Figure 8 performs the decoding of a spatial enhancement layer on top of the base layer decoded by the first stage B8. This spatial enhancement layer decoding involves the entropy decoding of the second layer in functional block 808, which provides the coding modes, motion information as well as the transformed and quantized residual information of blocks of the second layer.

A following step consists in predicting blocks in the enhancement picture.

The choice between different types of block prediction (Intra, temporal prediction or inter-layer). used in functional block 810 depends on the prediction mode obtained from the entropy decoding in functional block 808.

Concerning Intra blocks, their treatment depends on the type of Intra coding unit. In case of inter-layer predicted Intra block (referred to as lntra-BL coding mode), the result of the entropy decoding in functional block 808 undergoes inverse quantization and inverse transform in functional block 812, and then is added in functional block 814 to the co-located block of current block in base picture, in its decoded, post-filtered and up-sampled (in case of spatial scalability) version. In case of a non-lntra-BL Intra block, such a block is fully reconstructed, through inverse quantization, inverse transform to obtain the residual data in the spatial domain, and then Intra prediction is determined in functional block 816 to obtain the fully reconstructed block 818.

Concerning temporal predicted blocks! their reconstruction involves their motion compensated temporal prediction as determined in functional block 820, the residual data decoding and then the addition of their decoded residual information to their temporal predictor. In this temporal predicted block decoding process, inter-layer prediction can be used in two ways. First, the motion vectors associated with the considered block can be decoded in a predictive way, as a refinement of the motion vector of the co-located block in the base picture. Second, the temporal residual can also be inter-layer predicted from the temporal residual of the collocated block in the base layer.

It is to be noted that in a particular scalable coding mode of the block all the prediction information of the block (e.g. coding mode, motion vector) may be fully inferred from the co-located block in the base picture. Such block coding mode is known as "base mode".

As described above, the enhancement layer in scalable video coding can use data from the base layer as classical Intra and Inter coding. The modes which use data from the base layer are known as Inter layer prediction modes. In the state of the art, several Inter layer modes or Hybrid Inter layer and Intra or temporal prediction coding modes are defined.

Figure 9 summarizes prediction modes that can be used in a scalable codec architecture, according to an embodiment of the invention, to predict a current enhancement picture.

Reference 900 represents the current enhancement picture to predict. The base picture 902 corresponds to the base layer decoded picture that temporally coincides with current enhancement picture. Reference 904 corresponds to an example of reference picture in the enhancement layer used for the temporal prediction of the current enhancement picture 900. Finally, reference 906 corresponds to a Base mode prediction picture.

A Base mode prediction picture is constructed with the help of inter-layer prediction tools. The construction of such a Base mode prediction picture is explained in detail below, with reference to Figure 10. Briefly, it is constructed by predicting current enhancement picture by means of the up-sampled prediction information and optionally temporal residual data that has previously been extracted from the base layer and re-sampled to the enhancement spatial resolution.

As illustrated by Figure 9, and as explained above, the prediction of current enhancement picture 900 consists in determining, for each block 908 in current enhancement picture 900, the best available prediction mode for block 908, considering temporal prediction, Intra prediction, Intra BL prediction and Base mode prediction.

Figure 9 also illustrates the fact that the prediction information contained in the base layer is extracted, and then is used in two different ways.

First, the prediction information of the base layer is used to construct 910 the "Base Mode" prediction picture 906. This construction is discussed herein below with reference to Figure 10.

Second, the base layer prediction information is used in the predictive coding 912 of motion vectors in the enhancement layer. Therefore, the temporal prediction mode illustrated in Figure 9 makes use of the prediction information contained in the base picture 902. This allows inter-layer prediction of the motion vectors of the enhancement layer, hence increases the coding efficiency of the scalable video coding system (since it is not necessary to transmit motion vectors for the enhancement layer).

Figure 10 illustrates the construction of a Base mode prediction picture 1000. This picture is referred to as a Base mode picture because it is predicted by means of the prediction information issued from the base layer 1008. The figure also indicates the magnification 1004 of the base layer 1002 to the dimensions of an associated enhancement layer. The inputs to this process are the following: -the lists of reference pictures e.g. 1006 useful in the temporal prediction of current enhancement picture i.e. the Base mode prediction picture 1000.

-the prediction information e.g. temporal prediction 1OA extracted from the base layer 1008 and re-sampled e.g. temporal prediction lOB to the enhancement layer 1004 resolution. This corresponds to the prediction information resulting from the process described in association with Figure 11.

-the temporal residual data issued from the base layer decoding, and re-sampled to the enhancement layer resolution e.g. inter-layer temporal residual prediction bC.

-the base layer reconstructed picture 1008.

The Base mode picture construction process consists in predicting each coding unit e.g. largest coding unit (LCU) 1010 of the enhancement picture, conforming to the prediction modes and parameters inherited from the base layer.

It proceeds as follows: For each largest coding unit in current enhancement picture 1010 -obtain the up-sampled Coding Unit (CU) representation issued from the base layer (algorithm of Figure 11) -for each CU contained in the current LCU a for each Prediction Unit (PU) in the current CU, predict current PU with its prediction information inherited from the base layer.

The prediction unit prediction step proceeds as follows. In case the corresponding base prediction unit was Intra-coded e.g. base layer intra coded block 1012, then current prediction unit is predicted by the reconstructed base prediction unit, re-sampled to the enhancement layer resolution 1014. This prediction is associated with an inter-layer spatial prediction 1016. In case of a temporal coded base prediction unit 1018, then the corresponding prediction unit in the enhancement layer 1020 is also temporally predicted, by using the motion information 1 OB inherited from the base layer bA. This means the reference picture(s) in the enhancement layer that corresponds to the same temporal position of the reference picture(s) of the base prediction unit are used. A motion compensation step lOB is applied by applying the motion vector inherited 1022 from the base onto these reference pictures. Finally, the up-sampled temporal residual data 1OC of the co-located base prediction unit is applied onto the motion compensated enhancement prediction unit, which provides the predicted prediction unit in its final state.

Once this process has been applied on each prediction unit in the enhancement picture, a full "Base Mode" prediction picture is available.

Figure 11 depicts a prediction information up-sampling process, executed both by an encoder and a decoder in order to construct a Base Mode prediction picture e.g. 906 (Figure 9). The prediction information up-sampling process is a useful mean to perform inter-layer prediction.

The left side of Figure 11, referred to as 1100, illustrates a part of the base layer picture. In particular, the Coding Unit representation that has been used to encode the base picture is illustrated, for the two first [CU of the picture 1102 and 1104. The [CUs have a height and width, represented by arrows 1106 and 1108, respectively, and an identification number 1110, here shown running from zero to two.

The Coding Unit quad-tree representation of the second [CU 1104 is illustrated, as well as prediction unit (PU) partitions e.g. partition 1112. Moreover, the motion vector associated with each prediction unit, e.g. vector 1114 associated with prediction unit 1112, is shown.

On the right side of Figure 11 is shown the enhancement layer sizing 1116 of the base layer 1100, the result of the prediction information up-sampling process can be seen. On this figure, the LCU size (height and width indicated by arrows 1118 and 1120, respectively) is the same in the enhancement picture and in the base picture, i.e. the base picture LCU has been magnified. As can be seen, the up-sampled version of base LCU 1104 results in the enhancement [CUs 2, 3, 6 and 7 (references 1122, 1124, 1126, and 1128, respectively). The individual prediction units exist in a scaling relationship known as a quad-tree. It is to be noted that the coding unit quad-tree structure of coding unit 1104 has been re-sampled in 1116 as a function of the scaling ratio that exists between the enhancement picture and the base picture. The prediction unit partitioning is of the same type (i.e. the corresponding prediction units have the same shape) in the enhancement layer and in the base layer. Finally, motion vector coordinates e.g. 1130 have been re-scaled as a function of the spatial ratio between the two layers.

In other words, three main steps are involved in the prediction information up-sampling process: -the coding unit quad-tree representation is first up-sampled. To do so, a depth parameter of the base coding unit is decreased by one in the enhancement layer; -the coding unit partitioning mode is kept the same in the enhancement layer, compared to the base layer. This leads to prediction units with an up-scaled size in the enhancement layer, which has the same shape as their corresponding prediction unit in the base layer; and -the motion vector is re-sampled to the enhancement layer resolution, simply by multiplying associated x and y coordinates by the appropriate scaling ratio.

As a result of the prediction information up-sampling process, some prediction information is available on the encoder and on the decoder side, and can be used in various inter-layer prediction mechanisms in the enhancement layer.

In the current scalable encoder and decoder architectures, this up-scaled prediction information can be used in two ways: -they can be used in the construction of the "Base Mode" prediction picture of current enhancement picture; and -the up-sampled prediction information can also be used for the inter-layer prediction of motion vectors in the coding of the enhancement picture. Therefore one additional predictor is used compared to HEVC, in the predictive coding of motion vectors.

Figure 12 summarizes possible coding modes which should be integrated for the enhancement layer of an HEVC scalable extension according to the prediction of block texture, residual texture and syntax information.

It is to be recalled here that the current HEVC standard includes a competitive based scheme for motion vector prediction compared to its predecessors.

This means that several predictors or candidates compete with the rate distortion criterion at encoder side in order to find the best motion vector predictor or the best motion information for respectively the AMVP or the Merge mode. An index corresponding to the best predictors or the best predictor or candidate of the motion information is inserted in the bit-stream. The decoder can derive the same set of predictors or candidates and uses the best one according to the decoded index.

The coding efficiency of a competitive based scheme for motion vector prediction depends on several parameters which are the way the motion vector predictors are generated, the number of motion vector predictors that can be used, the order of the motion vector predictors in the list of motion vector predictors, and the spatial positions of the motion vector predictors (i.e. where the predictors come from) in the spatial and/or temporal Inter layer.

To reach a high level of coding efficiency, these parameters need to be selected in order to obtain a tradeoff between the usefulness of predictors and their redundancies.

Figure 13 shows a schematic of the AMVP predictor set derivation for an enhancement layer of a scalable codec of the HEVC type according to a particular embodiment.

According to this particular embodiment, the standard process of AMVP predictor set derivation, as described by reference to Figure 4, is applied to the base layer.

It is to be noted that determination of the motion estimation predictors that are to be used for encoding or decoding an enhancement layer is based on temporal and spatial motion information predictors that can be used with regard, in particular, to the determination of motion estimation predictors for the base layer (e.g. the order of temporal and spatial predictor of the base layer and the available predictors of the base layer) so as to improve coding efficiency.

As depicted in Figure 13. the same spatial positions AU. Al. BO, B1, and B2 (1300 to 1308) as the ones used in the standard derivation process of motion vector predictors in AMVP, as described by reference to Figure 4 and shown in Figure 3, are used to derive two spatial predictors. However, if the positions of the spatial predictors are the same, their order in the list of motion vector predictors is different.

As illustrated in Figure 13, temporal predictor 1310 is defined as the first predictor of the list of motion vector predictors. Only the center position of the collocated block (i.e. the block at the same position as the current block in the current frame in an enhancement layer of an encoded reference frame) is considered as a possible motion vector predictor (while in the standard derivation process of motion vector predictors in AMVP, applied here to the base layer, the bottom right position and the center position are used, as shown in Figure 4).

The availability of a motion vector corresponding to the center position of the collocated block is checked (1312) as done in the standard derivation process of motion vector predictors in AMVP (422) and scaled (1314) if required. This motion vector predictor is scaled (1314) as a function of the temporal distance between the current frame and the selected reference frame. If the motion vector corresponding to the center position is available, it is considered as a first predictor (Pred_1, 1316).

Next, the left blocks AU and Al (1300, 1302) are selected to derive, if it is possible, a first spatial predictor. After having checked the availability of the motion vectors, the following conditions, similar to the ones described by reference to Figure 4 (reference 404), are evaluated (1318) in the specific order of the selected blocks and then of the conditions, the first block whose conditions are fulfilled being used as a predictor: -the motion vector from the same reference list and the same reference image; -the motion vector from the other reference list and the same reference image; -the scaled motion vector from the same reference list and a different reference image; or -the scaled motion vector from the other reference list and a different reference image.

If no value is found, the left predictor is considered as being unavailable. In this case, this indicates that the related blocks were Intra coded or those blocks do not exist. On the contrary, if a predictor is identified, it is considered as a second predictor (Pred_2, 1320).

Next! the top blocks BO, B1, and B2 (1304, 1306, and 1308) are selected to derive, if it is possible, a second spatial predictor. Again, after having checked the availability of the motion vectors, the above conditions are evaluated (1322) in the specific order of the selected blocks and then of the conditions, the first block whose conditions are fulfilled being used as a predictor.

Again, if no value is found, the top predictor is considered as being unavailable. In this case, this indicates that the related blocks were Intra coded or that those blocks do not exist. On the contrary, if a predictor is identified, it is considered as a third predictor (Pred_3, 1324).

Next, a fourth predictor (Pred_4, 1330), referred to as a base layer (BL) predictor, is determined (if possible). To that end, the bottom right (BR) position of the collocated block in the base layer is selected (1326) and the availability of the corresponding motion vector is checked (1327). As it belongs to the base layer, this motion vector predictor (BL) is firstly scaled as a function of the spatial ratio between the base layer and the enhancement layer.

In addition and if needed, this motion vector predictor is scaled (1328) as a function of the temporal distance between the current frame and the selected reference frame.

As illustrated in Figure 17, base layer predictor 1710 could also be defined as the first predictor of the list of motion vector predictors and the temporal predictor could be located after the spatial predictors in the list.

Figure 14 illustrates spatial and temporal blocks, in particular the bottom right block of the base layer, that can be used to generate motion vector predictors in AMVP and Merge modes of scalable HEVC coding and decoding systems according to a particular embodiment.

Returning to Figure 13, a test is performed, in a following step (1332), to remove duplicate predictors amongst the four possible predictors (Pred_1 to Pred_4).

To that end, the available motion vectors are compared with each other.

Next, if the number of remaining predictors (Nb_Pred) is greater than or equal (1334) to the maximum number of predictors (Max_Preci), e.g. three in this particular embodiment, the resulting predictors form the ordered list or set of motion vector predictors (1338).

On the contrary, if the number of remaining predictors (Nb_Pred) is smaller than the maximum number of predictors (Max_Pred), a zero predictors is added (1336) to the resulting predictors to form the ordered list or set of motion vector predictors (1338). As set forth above, the zero predictor is a motion vector equal to (0,0).

As illustrated in Figure 13, the ordered list or set of motion vector predictors (1338) is built, in particular, from a subset of tempoial predictor (1310), from a subset of spatial predictors (1300 to 1308), and from a predictor coming from the base layer (1326). The subset of spatial predictors and the predictor coming from the base layer are preferably considered as being part of a single subset.

Figure 15 shows a schematic of the derivation process of motion vectors for an enhancement layer of a scalable codec of the HEVC type, according to a particular embodiment, for the Merge modes. According to this particular embodiment, the standard process of derivation motion vectors for Merge modes, as described by reference to Figure 5, is applied to the base layer.

Again, it is to be noted that determination of the motion estimation predictors that are to be used for encoding or decoding an enhancement layer is based on temporal and spatial motion information predictors that can be used with regard, in particular, to the determination of motion estimation predictors for the base layer (e.g. the order of temporal and spatial predictor of the base layer and the available predictors of the base layer) so as to improve the coding efficiency.

It is also to be noted that the position and the number of spatial candidates for enhancement layers are similar to those described by reference to Figure 5, that are used for the base layer. Moreover, the derivation process regarding these candidates is similar to the one described by reference to Figure 5 which is also used for the base layer. Accordingly, references 1500 to 1508 designating blocks Al, B1, BO, AU, and B2 correspond to references 500 to 508 and references 1510 to 1514 correspond to references 510 to 514, respectively.

In a first step, a temporal predictor (Cand_1, 1522) is obtained, if possible, and set as a first candidate in the list of motion vector candidates. Compared to the HEVC base layer (processed according to the derivation process as described by reference to Figure 5), only the center position (1516) of the collocated block in the corresponding temporal enhancement layer of an encoded frame is processed. The availability checking and scaling steps 1518 and 1520 are similar to steps 520 and 522 of Figure 5, respectively.

In following steps of the derivation process, the five spatial block positions Al, B1, BO, A0, and B2 (1500 to 1508) are considered. The availability of the spatial motion vectors is checked and at most five motion vectors are selected (1510). As described above! a predictor is considered as available if it exists and if the block is not Intra coded. Therefore, selecting the motion vectors corresponding to the five blocks as candidates is done according to the following conditions: -if the "left" Al motion vector (1500) is available (1510), i.e. if it exists and if this block is not Intra coded, the motion vector of the "left" block is selected and used as a first candidate in the list of candidates (1514); -if the "top" B1 motion vector (1502) is available (1510), the candidate "top" block motion vector is compared to "left" Al motion vector (1512), if it exists. If Bl motion vector is equal to Al motion vector, Bl is not added to the list of spatial candidates (1514). On the contrary, if B1 motion vector is not equal to Al motion vector, B1 is added to the list of spatial candidates (1514); -if the "top right" BO motion vector (1504) is available (1510), the motion vector of the "top right" is compared to Bi motion vector (1512). If BO motion vector is equal to Bl motion vector, BO motion vector is not added to the list of spatial candidates (1514). On the contrary, if BO motion vector is not equal to Bi motion vector, BO motion vector is added to the list of spatial candidates (1514); -if the "bottom left" AD motion vector (1506) is available (1510), the motion vector of the "bottom left" is compared to Al motion vector (1512). If AD motion vector is equal to Al motion vector, AD motion vector is not added to the list of spatial candidates (1514). On the contrary, if AD motion vector is not equal to Al motion vector, AD motion vector is added to the list of spatial candidates (1514); and -if the list of spatial candidates doesn't contain four candidates, the availability of "top left" B2 motion vector (1508) is checked (1510). If it is available, it is compared to Al motion vector and to Bl motion vector. If B2 motion vector is equal to Al motion vector or to B1 motion vector, B2 motion vector is not added to the list of spatial candidates (1514).

On the contrary, if B2 motion vector is not equal to Al motion vector or to Bl motion vector, B2 motion vector is added to the list of spatial candidates (1514).

At the end of this stage, the list of spatial candidates comprises up to four spatial candidates (Cand_2 to Canc!_5).

In one embodiment each of these spatial predictors can be compared to the first candl in order to not add a spatial predictor if it is a duplicate compared to this first predictor. This is represented with the arrow between Cand 1 (1522) and pruning process with partial comparisons (1512).

When the temporal candidate (1522) and the up to four spatial candidates (1514) are generated, a base layer merge motion candidate (Cand_6, 1528) is generated.

To generate the base layer merge motion candidate, the base layer motion vector at the bottom right position (1524) of the collocated block in the base layer, as illustrated in Figure 16, is selected and the availability of a corresponding motion vector is checked (1526). If it is available a corresponding candidate is derived. As this base layer motion vector belongs to the base layer, this motion vector predictor (BL) used for the enhancement layer is firstly scaled as a function of the spatial ratio between the base layer and the enhancement layer.

Next, if the number (Nb_Cand) of candidates is strictly less (1529) than the maximum number of candidates (Max_Cand, e.g. 6 in this embodiment), the offset predictors are generated based on the first candidate in the list. The maximum of possible offset predictor should be limited. In one embodiment 4 predictors are added in the list of merge candidates. It can be noted that in the embodiment of Figure 15 the temporal candidates is represented as the first candidate in the list.

Next, if the number (Nb_Cand) of candidates is strictly less (1542) than the maximum number of candidates (Max_Cand, e.g. 6 in this embodiment) and if the current frame is of the B type, combined candidates are generated (1544). Combined candidates are generated based on available candidates of the list of Merge motion vector predictor candidates (e.g. combined candidates can be obtained by linear combination of available candidates of the list of Merge motion vector predictor candidates). This mainly consists in combining the motion vector of one candidate of the list LU with the motion vector of one candidate of list Li.

If the number (Nb Cand) of candidates remains strictly less (1546) than the maximum number of candidates (Max Cand), zero motion candidates are generated (1548) until the number of candidates of the list of Merge motion vector predictor candidates reaches the maximum number of candidates.

At the end of this process, the list or set of Merge motion vector predictor candidates for an enhancement layer is built (1550).

As illustrated in Figure 15, the list or set of Merge motion vector predictor candidates for an enhancement layer is built (1550), in particular, from a subset of temporal candidates (1516), from a subset of spatial candidates (1500 to 1508), and from a subset of base layer candidates (1524) The subset of spatial candidates and the subset of base layer candidates are preferably considered as being part of a single subset.

As described above, offset predictors are included in the list of Merge candidates.

Figure 18 shows a schematic of the derivation process of motion vectors for an enhancement layer of a scalable codec of the HEVC type, according to a particular embodiment, for the Merge modes. This figure is based on Figure 15. The difference is that the collocated base layer motion vector is used as the first candidate of the list Candi (1828) and the temporal predictor is located after the spatial candidates if any (1822). For the temporal predictor, the same positions as used in the Merge derivation of the classical HEVC (cf. Figure 5), when the temporal is at this position the temporal comes from the H (1831) position or from the center (1816) position of the temporal collocated block as depicted in Figure 3. For the base layer candidate the center position should be considered.

In one embodiment, when the base mode is in the list, the bottom right position should be considered at least when the up sampled partition of the base layer has the same size as the current Merge block.

In one embodiment the derivation process of the Merge is changed at slice or frame level by checking if the temporal frame use as reference for the collocated temporal motion vector comes from the past or the future. If this frame comes from the past, the derivation process of Figure 18 is used otherwise the derivation process of Figure 15 is used.

According to a particular embodiment, the offset predictors are generated by adding an offset value to one or more components of a reference motion vector such as a motion vector candidate of a motion vector associated with a neighboring block of a block corresponding to a motion vector candidate. Therefore, an offset predictor results from a combination of a reference motion vector and one or more offsets.

For the sake of illustration, the reference motion vector MV(mvx, mvy) combined with a single offset value o can lead to several offset predictors, for example the following ones: MV0I(mvx+ a, mvy) MVo2(mvx, mvy+ o) MVo3(mvx + o, mvy + o) MVo4(mvx -a, mvy) To obtain a good coding efficiency with offset predictors, the inventors observed that the following parameters have to be carefully adapted (in particular for scalable enhancement layers): -the reference motion vector (MV(mvx, mvy)); -the offset values; and -the number of offsets predictors.

In a particular embodiment, offset predictors are generated from base motion vector as reference, i.e., a reference motion vector is chosen from amongst motion vectors of the base layer. For example, such a reference motion vector can be the one associated with the bottom right block of the collocated block in the base layer.

In a particular embodiment, four offset predictors are derived with the following rule: MV0I(mvx + a, mvy) MVo2(mvx-o, mvy) MVo3(mvx, mvy + a) MV04(mvx, mvy -a) Where the offset o is set equal to 4.

Still in a particular embodiment, the offset predictors are generated from the collocated base motion vector as reference, i.e., the reference motion vector is the motion vector of the collocated block in the base layer.

Still in a particular embodiment, the offset value is added alternatively to the horizontal and vertical component of motion vectors of list [0 and its inverse value is added to the corresponding motion vectors of list Li if the motion vectors do not have the same temporal direction. For the sake of illustration, one can consider motion information referring to two motion vectors MVLO(mvLOx, mvLOy) and MVLI(mvLlx, mvLly), MVLO being associated with list [0 and MVLI being associated with list Li, wherein motion vector MVLO refers to a backward reference frame and vector MVLI refers to a forward reference frame. According to the embodiment, if only one offset value o is to be used, geneiated offset piedictors can be the following: MVoI (MVLO (mvLox + o, mvLOy) ; MVLI (mvLlx-o, mvLly)) MVo2 (MVLO (mvLOx -a, mvLOy) MVL I (mvL lx + a, mvL ly)) MVo3(MVLO (mvLOx, mvLOy+ o); MVLI (mvLlx, mvLly-o)) MVo4 (MVLO (mvLOx. mvLOy -o) ; MVLI (mvLlx. mvLly + o)) MVo5(MVLO (mvLOx + o, mvLOy+ o); MVLI (mvLlx-o, mvLly-o)) MVo6 (MVLO (mvLox -o, mvLOy -a) ; MVLI (mvLlx + o, mvLly + a)) MVo7 (MVLO (mvLOx -a, mvLOy + o) MVLI (mvLlx + o, mvLly-o)) MVo8 (MVLO (mvLOx + a, mvLOy -o); MVLI (mvLlx -o, mvLly + o)) In a particular embodiment, the absolute offset value o added to each component is always the same and it is equal to 4 whatever the scalability iatio.

Still in a particular embodiment, the absolute offset value a is equal to two and it is multiplied by the scalability ratio. For example, for an enhancement layer whose size is twice that of the base layer, the offset value o to be used is equal to four (4 = 2 x 2). Similaily, if the ratio is equal to 1.5, the offset value is 3. Still foi the sake of illustration, regarding SNR scalability (according to which the base layer and an enhancement layer are of the same size, i.e. the spatial ratio is one), the offset value is set to two.

In a particular embodiment, the offset value depends on the value of the base motion vector. For example, the sum of absolute value of horizontal and vertical components is multiplied by a fixed value to obtain the offset value to be used. As described above, this offset value o can be added alternatively to the horizontal and vertical components. Likewise! this particular embodiment can be combined with one or several other embodiments described above.

Still in a particular embodiment, two offset values ox and oy are computed, one being associated with each component. Again, the value of each offset can be computed as a function of the value of the component. For example, offset values ox and oy can be computed accoiding to the following formula: ox=mvx/c+landoy=mvy/c+1 where c is a constant value.

Offset values ax and ay can be computed according to the following formula: ox = mvx I mvy and oy = mvy I mvx These offsets can be alternatively added to their respective components.

When two lists are considered, four offsets can be computed one for each list and each component. These offset can be added alternatively by taking into account the direction (foiwaid and backwaid) to determine the sign.

As described above by reference to Figures 13 and 15, the order of the predictors and of the candidates in the lists of predictors and of candidates is not the same for the base layer and the enhancement layers.

In particular, temporal predictors and candidates (references 1316 and 1522 of Figuies 13 and 15, respectively) are considered as first piedictors and candidates for processing enhancement layers since the inventors observed that such order leads to improve coding efficiency (compared to using spatial predictors or candidates at the fiist rank in the list).

Regarding the enhancement layer, seveial tempoial prediction and lntei layer modes can be used, in particular the Base mode which competes with the AMVP and Merge modes to exploit temporal correlation of pixels.

The Base mode mainly consists in deiiving the encoding syntax of enhancement layers from the encoding syntax of the base layer. Therefore, its coding efficiency is close to the one provided by the use of spatial predictors. Indeed, the base layer prediction can be considered as a spatial prediction based on a collocated block of base layer (and not on neighboring predictors).

Motion vector selection according to the Base mode leads to similar selections of spatial motion vector predictors. Indeed, this mode is mostly selected when the motion activity is low and when the temporal distance between frames is high. As a consequence, the spatial predictors of AMVP and Merge mode are redundant with the Base mode. Moreover, the first predictor of the set is the most selected. For example, with HEVC common test conditions, the selection of the first piedictor represents sixty percent of the selections. Accoidingly, it seems impoitant to select another predictor as the first predictor for AMVP and Merge modes to provide diversity. The temporal predictor is, in that case, a good choice.

For the same reason as setting the temporal predictor at the first position is a good choice, the use of base motion vector (references 1326 and 1524 of Figures 13 and 15, respectively) leads to bettei coding efficiency when it is set at the end of the list of predictors. Indeed, the base motion vector leads to a similar block predictor as the Base mode and exactly the same block predictor for the same block size. For the same block size, if the Base mode has only one motion vector, the Merge modes can produce exactly the same block prediction. Accordingly, it is preferable to set it at the end of the list of predictors. It is also possible to select a different spatial position for this base motion vector as explained herein below.

According to a particular embodiment, offset predictors MVoI to MVo4 (references 1532 to 1538 in Figure 15) are added to the list of candidates before the base motion vector (reference 1528 in Figure 15).

As described above, the motion information coding is different between AMVP and Merge modes. For AMVP mode, the motion vector predictor only predicts a value of a motion vector and a residual between this motion vector predictor and the real motion vector is inserted in the bit-stream. The Merge mode is different in that complete motion information is predicted and no motion vector residual is to be transmitted in the bit-stream. Consequently, the list order can depend, for AMVP mode, on several other parameters and so the embodiment that is directed to the Merge mode is simpler than those that are directed to AMVP mode.

In a particular embodiment, if the reference frame index and the list index used for the AMVP mode are different from those used for the Base mode, the base motion vector is positioned in one of the first positions (e.g. position 1, 2 or 3).

Still in a particular embodiment, the base motion vector is ordered at the end of the list of predictors if the residual of the motion vector is equal to zero, otherwise it is positioned in one of the first positions (e.g. position 1, 2 or 3).

Still in a particular embodiment, the value of a variable is set to the sum of absolute value of the residual motion vector. If the value of this variable is less than a predetermined threshold, the base motion vector is added at the end of the list of predictors. Otherwise, it is positioned in one of the tirst positions (e.g. position 1, 2 or 3). In another embodiment, the base motion vector predictor is positioned in one of the possible positions according to the value of this variable. In this embodiment, the further the variable strays from zero the weaker the rank of the base layer predictor in the list of predictors is.

In still another embodiment, all criteria proposed below can be used to determine whether or not the base motion vector is to be added to the list of predictors.

In that case, instead of adding the base motion vector at the end of the list of predictors, it is removed. For example, when the residual of the motion vector is equal to zero, the base motion should be removed from the Inter predictors list.

It is to be noted that the spatial positions of predictors can be taken into consideration even if these predictors come from a previous encoded or decoded frame or from the base layer. This means that it is possible to consider that temporal block positions or inter layer block positions are projected in the current frame.

As described above, Figure 14 illustrates the spatial positions of motion vector predictors (ieferences 1310, 1300 to 1308, and 1326 in Figuie 13 and references 1516, 1500 to 1508, and 1524 in Figure 15) that are used in the predictor/candidate derivation processes described by reference to Figures 13 and 15.

Compared to the standard HEVC Merge and Inter predictor/candidate derivation process according to which the temporal predictor/candidate is based on two block positions, the tempoial predictor is based on only one block position (the centel position). Such a choice comes from the fact that the temporal predictor/candidate being the first predictor/candidate of the list, it has to represent as much as possible the motion information of the cuirent block and so, on average, the center block is the best spatial position to represent the motion information of the cuilent block.

For the predictor/candidate obtained from the base layer, only one position is considered (bottom right position of the collocated block). Firstly, the piedictor/candidate obtained from the base layer should create diveisity in the spatial positions. Accordingly, the bottom right position can be considered as a good choice in that it is the farthest position to the average spatial positions of predictors already in the list of predictors/candidates (e.g. references 1516 and 1500 to 1508 in Figure 15).

Secondly, the predictor/candidate obtained from the base layer should not be redundant in comparison to the predictors/candidate selection of the Base mode (according to which the base motion vector can be added, for example, at the end of the list of predictois/candidates). Indeed, foi the Merge mode, if the centei of the base layer is used, this means that exactly the same block predictor is derived as Base mode. Consequently, the Merge modes can give exactly the same decoded block as the Base mode. Therefore, to avoid such redundancies, the base motion vector should be selected at position other than the collocated block.

Nevertheless, if the use of the bottom right position avoids producing the same block predictor as the Base mode, it is possible that the bottom right position does not change the motion information.

To handle such a case, all motion predictors/candidates can be compared to the motion information of the Base mode in order to remove those equal to it from the list.

According to another embodiment, all neighboring blocks of the base layer are checked in order to find one that is different from the base motion vector at the center position.

When the up-sampled syntax of the base layer (Base mode) gives a smaller block size than the current block size, the predictor generated from the Base mode could be different because several motion vectors have been used to encode the base layer. Consequently, the derivation process considers, in a particular embodiment, the block size to switch between the center position of the base layer to another neighboring position.

As mentioned above, the use or not of some predictors has to be carefully adapted for the derivation process of motion predictors in enhancement layer.

Figure 16 shows an example of spatial positions of the neighboring blocks of the current block in the enhancement layer (AD, Al, BO, B1, B2) and their collocated blocks in the base layer (ALO, ALl, BLO, BL1, BL2). For the sake of clarity, it is considered that all blocks in the enhancement layer are of the same size and that the latter is exactly the same as the up-sampled collocated blocks of the base layer.

According to a particular embodiment, the step of checking motion vector availability (references 1318 and 1322 in Figure 13 and reference 1510 in Figure 15) checks, if the neighboring blocks (AO, Al, BO, B1, B2) are encoded with the Base mode, one after another. In that case, the corresponding motion predictor/candidate is not added to the list of predictors/candidates.

For example, if AD is encoded with the Base mode, it means that it is encoded with the motion information of ALO. If the base motion vector in the collocated block is different from ALO. it means that ALO is not the best motion vector in the base layer. Moreover, the base motion vector of the collocated block is more correlated with the current block than its neighboring blocks (the motion vector selected for the current block is theoretically a better representation of the motion of the current block than the motion vector of its neighboring blocks). Therefore, in such a case, ALO being different from the collocated block, AU does not need to be added to the list of predictors/candidates. Where ALO is the best predictor in the base layer, the base motion vector is equal to ALU. In view of the previous remark regarding the use of the collocated base motion vector which is redundant with the Base mode, AD motion information does not need to be added. Consequently, the neighboring motion vector of a neighboring block (AD, Al, BO, Bl, B2) does not need to be added if it is encoded with the Base mode. It is to be noted that this embodiment needs only to check the encoding mode of neighboring blocks and so, it can be easily implemented.

According to a particular embodiment, if the motion information of a neighboring block is equal to the base motion information, it is not added to the list of predictors (the collocated block and not the bottom right block being considered for the base motion information).

In another particular embodiment, if the motion information of one predictor is equal to the base motion information, it is not added to the list of predictors.

Still in another embodiment, if the motion information of one predictor is equal to the base motion information or to the base motion information of a neighboring block, it is not added to the list of predictors.

These embodiments can be extended to temporal predictors. Since the temporal predictor corresponds to the center of the collocated block and not to the bottom right block (as it is for the base layer), the temporal predictor should be different from the temporal motion vector used in the derivation process of the base layer.

The predictors that are strictly equal to their own base motion vector can be removed from the list of predictors.

According to a particular embodiment directed to spatial scalability. the motion vector of a predictor block is not added into the list of predictors if it does not contain a motion vector refinement in the spatial increase.

Naturally, in order to satisfy local and specific requirements, a person skilled in the art may apply to the solution described above many modifications and alterations all of which, however, are included within the scope of protection of the invention as defined by the following claims.

Claims

CLAIMS1. A method of encoding or decoding an image of an enhancement layer of a scalable video sequence comprising a base layer and at least one enhancement layer, by deriving motion information predictors for encoding or decoding an image portion of the image of the enhancement layer by motion compensation with respect to reference image portions, wherein for the image portion of the image of the enhancement layer to be encoded or decoded, the method comprises: Generating a first list of motion vector predictors, said generation comprising testing the availability of at least one temporal motion vector predictor provided by a previously encoded image, at least one spatial motion vector predictor provided by image portion neighboring the image portion to be encoded or decoded and at least one base layer provided by the base layer image corresponding to the image of the enhancement layer, each available predictor being inserted in the list; Complementing the list with offset predictors, wherein offset predictors are obtained by adding at least one offset to at least one component of the first motion vector predictor in the generated list.
2. A device of encoding or decoding an image of an enhancement layer of a scalable video sequence comprising a base layer and at least one enhancement layer, by deriving motion information predictors for encoding or decoding an image portion of the image of the enhancement layer by motion compensation with respect to reference image portions, wherein for the image portion of the image of the enhancement layer to be encoded or decoded, the device comprises: a means for generating a first list of motion vector predictors, said generation comprising testing the availability of at least one temporal motion vector predictor provided by a previously encoded image! at least one spatial motion vector predictor provided by image portion neighboring the image portion to be encoded or decoded and at least one base layer provided by the base layer image corresponding to the image of the enhancement layer, each available predictor being inserted in the list; and a means Complementing the list with offset predictors, wherein offset predictors are obtained by adding at least one offset to at least one component of the first motion vector predictor in the generated list.
3. A video encoder comprising the device according to claim 2.
4. A video decoder comprising the device according to claim2.
5. A method of encoding or decoding an image of an enhancement layer of a scalable video sequence comprising a base layer and at least one enhancement layer, by deriving motion information predictors for encoding or decoding an image portion of the image of the enhancement layer by motion compensation with respect to reference image substantially as hereinbefore described with reference to, and as shown in Figure 13 or 18.
6. A device for encoding or decoding an image of an enhancement layer of a scalable video sequence comprising a base layer and at least one enhancement layer, by deriving motion information predictors for encoding or decoding an image portion of the image of the enhancement layer by motion compensation with respect to reference image portions, substantially as hereinbefore described with reference to, and as shown in Figure 6, 7, or 8.