EP1721471A1

EP1721471A1 - Method for coding and decoding an image sequence encoded with spatial and temporal scalability

Info

Publication number: EP1721471A1
Application number: EP05728100A
Authority: EP
Inventors: Edouard Francois; Guillaume Boisson; Jérome Vieron; Gwena¬Lle Marquant; Philippe Robert
Original assignee: Thomson Licensing SAS
Current assignee: InterDigital VC Holdings Inc
Priority date: 2004-03-02
Filing date: 2005-02-21
Publication date: 2006-11-15
Also published as: JP4986842B2; CN1926876B; FR2867328A1; CN1926876A; JP2007535834A; US20070171971A1; WO2005086488A1

Abstract

The invention relates to a method which is characterised in that the resolution selected for the movement information and the complexity of the interpolation filters used during a temporal filtering operation compensated in movement, depend on a decoding scenario, namely flow, temporal and spatial resolutions selected for the decoding either in terms of the corresponding temporal decomposition or of a combination of said parameters. The inventive method can be applied to so-called scalable video encoders/decoders, for example in the field of videotelephony or video transmission on the internet.

Description

METHOD FOR CODING AND DECODING A SEQUENCE OF CODED IMAGES WITH SPATIAL AND TEMPORAL SCALABILITY

The invention relates to a video coding and decoding process of a coded image sequence with spatial and temporal nability step, by hierarchical temporal analysis exploiting the motion compensated temporal filtering. The field is that of video compression based on spatial and / or temporal scalability schemes also known as “scalable”. This is for example a 2D + t wavelet coding comprising a motion-compensated temporal filtering. A scalable coding-extraction-decoding chain is illustrated in FIG. 1. The source images are transmitted to a scalable video coding circuit 1. The original bit stream obtained is processed by an extractor 2 to give an extracted bit stream. The latter is decoded by the decoding circuit 3 which outputs the decoded video. The scalability or “scalability” makes it possible to generate an original binary train from which it is possible to extract binary sub-trains adapted to given sets, bit rate, spatial resolution, temporal frequency ... For example, if the original scalable binary train has been generated from a video sequence at 25Hz, resolution 720 * 480 pixels and without bit rate constraint, it is possible, after extracting from this bit stream the appropriate data, to obtain a bit stream, for example of parameters 1 Mb / s, 12.5Hz, 360 * 240 pixels, itself scalable. The decoding of this extracted binary substruct generates a video at 12.5 Hz, of size 360 * 240 pixels. In existing scalable video compression approaches, coding and decoding proceed in an identical manner, without taking into account operating conditions such as the level of temporal decomposition, the bit rate, the spatial resolution of the decoded video, etc. , if the decoding involves compensation for movement between images, this compensation is applied identically, regardless of the size of the images or the bit rate of the video to be decoded. This results in degraded image quality, especially when the image resolution becomes small. relative to the size of the interpolation filters used for motion compensation. The invention aims to overcome the aforementioned drawbacks. One of the objects of the invention is a method of decoding a coded image sequence with spatial and temporal scalability, the coded data comprising movement information, comprising a step of hierarchical temporal synthesis carrying out temporal filtering compensated for movement, or MCTF, of images of a frequency decomposition level from said motion information, to provide images of a lower decomposition level, characterized in that, during a motion-compensated temporal filtering operation , the resolution chosen for the exploitation of the movement information and the complexity of the interpolation filters used are a function of a decoding scenario, namely spatial, temporal and bit rate resolutions selected for decoding or else the level of decomposition corresponding time or a combination of these parameters. According to a particular implementation, the number of coefficients of the interpolation filter used for motion compensation depends on the decoding scenario or the level of temporal decomposition. According to a particular implementation, the step of hierarchical temporal synthesis is a decoding of wavelet coefficients with filtering compensated in movement. The invention also relates to a method of coding a sequence of images of given spatial resolution, with spatial and temporal scalability, comprising a step of hierarchical temporal analysis performing temporal filtering compensated in motion, or MCTF, of images. of a frequency decomposition level, from movement information between these images, to give images of a higher decomposition level, characterized in that, during a motion-compensated temporal filtering operation, the resolution chosen for the exploitation of said motion information and the complexity of the interpolation filters used are a function of said spatial resolution of the source images or of the corresponding temporal decomposition level. The method, according to a particular implementation, comprises a step of motion estimation calculated between two images of a given level of decomposition to achieve motion compensation and in that the accuracy of calculation of the motion estimation is depending on level of temporal decomposition or of said spatial resolution of the source images. The step of hierarchical temporal analysis is for example a coding in wavelets with filtering compensated in movement. The invention also relates to a decoder for implementing the previously described decoding method, characterized in that it includes a movement configuration selection circuit for determining the resolution of the movement and the interpolation filter to be used in the motion compensation for motion compensated filtering, depending on the decoding scenario, i.e. spatial and temporal resolutions and bit rate selected for decoding or the corresponding temporal decomposition level or a combination of these parameters. The invention also relates to an encoder for implementing the previously described encoding method, characterized in that it includes a circuit for choosing the configuration of movement for determining the interpolation filter to be used by the time analysis circuit. for motion compensation as a function of said spatial resolution of the source images or of the corresponding time decomposition level. According to a particular embodiment, the coder is characterized in that it includes a circuit for choosing the configuration of movement for determining the precision of the movement calculated by the circuit for estimating movement, as a function of said spatial resolution of the images. source or the corresponding time decomposition level.

The precision of the movement and the interpolation filters used for the movement compensation in the coding and decoding process are adapted according to various parameters, such as the level of temporal decomposition on which one proceeds. These filters are adapted, for decoding, to the bit rate of the decoded stream, to the spatial or temporal resolution of the decoded video.

Thanks to this compensation in adaptive movement, the quality of the images is improved, the complexity of the treatments is reduced. Other particularities and advantages of the invention emerge better from the following description given by way of nonlimiting example and made with reference to the appended figures which represent: FIG. 1, a coding chain according to the prior art, - Figure 2, a simplified coding scheme, - Figure 3, a time filter of a GOP, - Figure 4, a time filter on two images, - Figure 5, a decoding circuit, - Figure 6 , a flowchart of choice of movement configuration, - Figure 7, a second flowchart of choice of movement configuration. We consider a wavelet-based coding / decoding scheme

2D + t, performing wavelet analysis / synthesis along the motion trajectories. The system operates on groups of images or GOPs. The overall architecture of the encoder is described in FIG. 2. The source images are transmitted to a time analysis circuit 4 which performs a time compensated motion analysis or MCTF, acronym for motion compensated temporal filtering, to obtain the different frequency time bands. The images are transmitted to a motion estimation circuit 7 which calculates the motion fields. These fields are transmitted to a “pruning” circuit 10 which performs a “pruning” or a simplification of the movement information calculated by the movement estimation circuit to control the cost of the movement. The motion fields thus simplified are transmitted to the time analysis circuit in order to define the analysis filters. They are also transmitted to a coding circuit 11 which performs the coding of the simplified motion fields. The images resulting from the time analysis are transmitted to a spatial analysis circuit 5 which performs sub-band coding of the low band image and of the high band images obtained by the time analysis. The spatio-temporal wavelet coefficients thus obtained are finally coded by an entropy coder 6. This coder provides as output a set of binary packets corresponding to the layers of scalabilities superimposed, both in quality, in spatial and temporal resolutions. A packetizer 12 performs the fusion of these binary packets with the movement data coming from the coding circuit 11 to provide the final echoable binary train. The images at the different levels of temporal decomposition are transmitted by the temporal analysis circuit 4 to the estimation circuit 7 movement comprising a first circuit configuration choice of movement. This circuit, not shown in the figure, defines the operating conditions of the motion estimation circuit according to the different levels of decomposition of the images. Optionally, the movement information, once simplified via the pruning circuit 10, is transmitted to the time analysis circuit through a mode switching circuit 9. This circuit makes it possible to test the quality of the motion estimation by testing for example the number of pixels connected between the current image and the previous image, at a given level of decomposition, and can impose on the temporal analysis circuit a coding in intra mode or a coding in predictive mode, c ' is to say a filtering of the current image with the following image and not preceding, when this quality of movement is insufficient. The choice between intra and predictive mode depends for example on the quality of the motion estimation between the current image and the next image. The time analysis circuit includes a second movement configuration selection circuit, also not shown in the figure, which determines, depending on the levels of decomposition of the images and / or the spatial resolution of the source images, the configuration to be adopted. for motion compensation used in this time analysis.

FIG. 3 represents in a synthetic way the operations of temporal filtering compensated in movement carried out by the temporal analysis circuit 4, with a decomposition at 4 levels for GOPs comprising, in this example, 16 images represented in bold lines. The filtering method used is called "lifting". Instead of using complex filtering for wavelet coding, exploiting a linear filter of considerable length, in our example the filtering would be carried out on a group of 16 images, this filtering method consists, in known manner, of "factorizing" the filter by using filters of limited length, for example two if it is chosen to filter the samples two by two, this filtering being repeated for each level of decomposition. We therefore consider the case where the filtering in the direction of movement is carried out on pairs of images. The low frequency and high frequency filtering, on each of the pairs of the GOP, produces respectively, at the first level of temporal decomposition, 8 images of low temporal frequency (tL) and 8 images of high temporal frequency (tH). The images of low time frequency are then decomposed again according to the same method. The low pass filtering of these images provides 4 new images of low time frequency t-LL and the high pass filtering of these same images provides 4 images of high time frequency t-LH. The third level of decomposition provides 2 images of low time frequency t-LLL and 2 images of high time frequency t-LLH. The fourth and last level provides a low time frequency image t-LLLL and a high time frequency image t-LLLH. This temporal decomposition is a 5 temporal bands decomposition which therefore generates 1 t-LLLL image, 1 t-LLLH image, 2 t-LLH images, 4 t-LH images, and 8 tH images per GOP of 16 images. The images tL, t-LL, t-LLL and of course the original images are ignored for the downstream coding since they are at the origin of the decomposition into sub-bands to provide the decorrelated images at each level. This decomposition therefore allows a new distribution of the energy by generating a useful image of low time frequency t-LLLL, which represents an average of the whole of the GOP and in which the energy is concentrated, and four image levels of low energy high time frequency, ie 5 frequency bands. These are the images which are transmitted to the spatial analysis circuit for spatial decomposition into sub-bands. To perform the filtering, a motion field is estimated between each pair of images to be filtered and this for each level. This is the function of the motion estimator 7. The filtering of a pair of source images A and B consists by default in generating a low temporal frequency image L and a high temporal frequency image H, according to the following equations: where MC (I) corresponds to the image I compensated in motion. The sum is for low pass filtering, the difference is for high pass filtering. FIG. 4 is a simplified illustration of the temporal filtering of the two successive images A and B, the image A being the first image along the time axis and according to the display order, giving a low frequency image L and an image high frequency H. The motion estimation is made with respect to a reference image, from the current image to the reference image. We search for each pixel of the current image, its correspondent, if it exists, in the reference image, and we assign to it the corresponding motion vector. The pixel of the reference image is then said to be connected. Obtaining image L requires compensation for movement of image A. This compensation is done by estimating movement from image B to image A, taking A as the reference image, a movement and therefore a vector then being assigned to each pixel of image B. The value of a pixel of L is equal, in the near form factor, to the sum of the luminance of the corresponding pixel of image B and the luminance of the pixel or sub-pixel of A pointed by the motion vector assigned to the corresponding pixel of image B. An interpolation is necessary when this vector does not point to a pixel of image A. Here it is a question of predicted prediction from a past reference image and a calculation of forward motion vectors or “forward vectors” in English, with reference to the MPEG standard. Obtaining image H requires compensation for movement of image B. This compensation is done by estimating movement from image A to image B by taking B as a reference image, a movement and therefore a vector then being assigned to each pixel of image A. The value of a pixel of H is equal, to the form factor near, to the difference of the luminance of the corresponding pixel of image A and the luminance of the pixel or sub-pixel of B pointed by the motion vector assigned to the corresponding pixel of image A. An inteφolation is necessary when this vector does not point to a pixel of image B. This is a delayed prediction from a posterior reference image and a calculation of delayed movement vectors or “backward vectors” in English, with reference to the MPEG standard. In a practical way, only one field of motion vectors is calculated, from A to B or from B to A. The other field of motion vectors is deduced from the first, generating pixels which are not connected, that is to say unaffected. of a motion vector and corresponding to holes in the reverse motion vector field. Conveniently, the low and high frequency images are calculated as follows: This filtering, equivalent to that previously described, consists in first of all calculating the image H. This image is obtained by point-to-point difference of the image B and of the image A compensated in movement. Thus, we subtract from a pixel of B a certain value, interpolated if necessary, pointed by the displacement vector in A, motion vector calculated during the motion estimation from image B to image A. The image L is then deduced from image H and no longer from image B, by adding image A to image H compensated in reverse motion. MCf ^~ __ _B (H) corresponds to a “decompensation” in movement of the image (H). Thus, a certain value, interpolated where appropriate, located in the image H, at the base of a displacement vector of B, is added to a pixel of A or more exactly to a normalized value of the luminance of the pixel. towards A and pointing to the pixel of A. The same reasoning can be held at the level of an image block instead of a pixel.

The motion estimation circuit 7 uses for example a motion estimation algorithm by block matching better known by the English name of "block matching". A current image block is correlated to the blocks of a search window in the reference image to determine the motion vector corresponding to the best correlation. This search is carried out not only on the blocks of the search window obtained by successive displacements of a pixel in horizontal and vertical but also on interpolated blocks if the precision sought is less than the pixel. This interpolation consists in calculating the luminance values of the sub-pixels, for the generation of image blocks obtained by successive displacements of a value less than the distance between two pixels. For example, for a quarter pixel accuracy, a correlation test is performed every quarter of a pixel, horizontally and vertically. This iπteφolation uses filters called interpolation filters of the motion estimation. The images for which motion-compensated temporal filtering is to be carried out are transmitted to the motion estimator 7 so that it estimates the motion between two images. This circuit includes a first motion configuration selection circuit which receives, in addition to the image decomposition level information, other information such as the spatial resolution of the source images. This circuit decides the movement configuration according to this level and / or the spatial resolution. Thus, for example, the precision in the calculation of the motion vectors depends on the level of temporal decomposition of the processed images. This precision is lower the higher the level of decomposition. The motion estimator interpolation filters are configured to be adapted to this motion accuracy. An example of configuration is given below. The time analysis circuit 4, as indicated above, performs motion compensation for the temporal filtering of the images. These motion compensation operations require interpolation operations using interpolation filters, and this for each level of decomposition. The second motion configuration selection circuit, in this time analysis circuit, which may be different from the first, implements a processing algorithm adapting the precision of the movement and the complexity of the interpolation filter for motion compensation. as a function of the level of temporal decomposition of the images to be compensated for in motion. As with the first motion configuration choice circuit, these different adaptations or configurations can also depend on the spatial resolution of the source images processed. Obviously, an encoder comprising only one of these configuration choice circuits is within the scope of the invention.

A decoder according to the invention is described in FIG. 5. The bit stream received by the decoder is transmitted to the input of an entropy decoding circuit 13 which performs the reverse operations of the entropy coding circuit of the coder. Among other things, it decodes the space-time wavelet coefficients and, where appropriate, the coding modes. This bit stream is transmitted in parallel to the input of a motion decoding circuit 14 which decodes the motion fields received in the bit stream to transmit them to the time synthesis circuit. The entropy decoding circuit 13 is connected to a spatial synthesis circuit 15 which reconstructs the images corresponding to the different time sub-bands. The temporal wavelet coefficients coming from the spatial synthesis circuit are transmitted to a temporal synthesis circuit 16 which reconstructs the output images from synthesis filters time. The time synthesis circuit includes a movement configuration selection circuit, not shown in the figure, which determines, as a function of the decoding conditions and / or the levels of decomposition of the images, the configuration to be adopted for the movement compensation used. in this temporal synthesis. The time synthesis circuit is connected to a post-processing circuit 17, the output of which is the output of the decoder. This is for example post-filtering to reduce artifacts such as block effects. In the case where the coder uses other coding modes than MCTF mode, for example intra mode and predictive mode, a time filter switching circuit is used to receive this coding mode information coming from the decoding circuit entropy 13 and to transmit them to the time synthesis circuit 16 which consequently performs the filter switching. The motion configuration selection circuit receives the flow rate, spatial and temporal resolution information and the temporal decomposition levels. From this information or from one of these pieces of information, it chooses a motion compensation configuration for the time synthesis. The time synthesis circuit adapts the integration filter according to this chosen configuration. The bit rate of the bit stream received by the decoder corresponds to the bit stream extracted. The scalable coder generally transmits the highest bit rate which is the original bit stream, as we saw previously, and the extractor, which can be controlled by the decoder, extracts the bit stream corresponding to the desired resolutions. The bit rate information received is available at the decoder. The spatial, temporal and bit rate resolution information defines a decoding scenario. This scenario depends, for example, on the display used by the decoder, the speed available for receiving data. It is on the basis of this information and / or the level of temporal decomposition that the temporal synthesis circuit is configured as regards the interφolation filters.

An example of adaptation of the precision of the movement and of the interpolation filter which depends on this precision is given below, for the operations of motion estimation at the encoder or the operations of motion compensation at the encoder or decoder: configuration precision of movement interpolation filters 1/4 pixel Bilinear 1/8 pixel A pixel by FIR interpolation 8 coefficients, then 1/8 pixel by bilinear inteφolation The filter of configuration 2 is very similar to that used in the standard MPEG-4 part 10 (reference ITU-T Rec. H.264 ISO / IEC 14496-10 AVC). FIG. 6 represents a decision flow diagram implemented by the movement configuration choice circuit belonging to the time analysis circuit. Step 20 determines whether the resolution of the source image supplied to the coder is lower than that of the QCIF format, from the English Quarter Common Intermediate Format and corresponding to 176 columns, 120 lines. If so, the next step is step 23 which decides configuration 1. If not, the next step is step 21 which checks the level of time decomposition. If this level is strictly greater than 2, the next step is step 23, configuration 1 is chosen. Otherwise, the next step is step 22 which decides the configuration 2. FIG. 7 represents a decision flowchart for the decoder. Step 24 determines whether the resolution of the image supplied by the decoder and corresponding to the extracted bit stream is lower than that of the QCIF format, 176 columns, 120 lines. If so, the next step is step 26 which chooses configuration 1. If not, the next step is step 25 which checks the level of time decomposition. If this level is strictly greater than 2, the next step is step 26, configuration 1 is used. Otherwise, the next step is step 27. This step 27 determines whether the resolution of the image to be decoded is equal to that of the SD format, from the English Standard Definition, 720 columns, 480 lines and whether the bit rate of the bit stream received is less than 1.5 Mb / s. If yes, the next step is step 26 which decides the configuration 1. If not, step 28 is the next step. This step 28 determines whether the resolution of the image to be decoded is equal to that of the CIF format, 352 columns, 240 lines and whether the bit rate is less than 700 kbits / s. If so, the next step is step 26 which imposes configuration 1. If not, configuration 2 is imposed on the time filter circuits.

The isolation filter is for example of the FIR 8 coefficients type, acronym for English Finite Impulse Response. The filtering is carried out by convolution, thus taking into account the luminances of the 4 pixels preceding and following the sub-pixel to be calculated. For different positions at the sub-pixel s at%, Y≥, and ³ A, three different inteφolation filters of the previous type can be used. The value of a coefficient n is given by the formula: s is the sub-pixel position, s ≈ VA, ^Λ A, OR%, n is the number of the coefficient and h (m) the attenuation filter or Hamming window. We can deduce FIR filters, by weighting by a Hamming window and truncation of these weighted filters. For s =, the coefficients are: 4 [-0.0110 0.0452 -0.1437 0.8950 0.2777 -0.0812 0.0233 -0.0053] For s = -, the coefficients are: 2 [-0.0053 0.0233 -0.0812 0.2777 0.8950 -0.1437 0.0452 -0.0110] 3 For s = -, the coefficients are: 4 [-0.0105 0.0465 -0.1525 0.6165 0.6165 -0.1525 0.0465 -0.0105] t 'With these filters, we can interpolate at A, V≥ and%> of pixel.

The interpolation is done first according to the horizontal dimension, then vertical. The 1/8 pixel inteφolation is then performed by a bilinear interpolation from the pixel ^Λ A positions.

The adaptation example given above at the level of the coder can be applied in the same way at the level of the decoder. In general, the principle is to use limited movement precision and simple interpolation filters when operating at limited image qualities, that is to say at low bit rate, on images of small in size and at high levels of temporal decomposition. Conversely, when processing good quality images, high spatial resolution, high bit rates, low temporal decomposition levels, high precision of movement and sophisticated interφolation filters are used. The justification for this principle is that when the images to be filtered are poor in frequency content or of limited resolution, it is not useful to use very advanced interpolation filters or very high movement precision.

The applications of the invention relate to so-called “scalable” video coders / decoders used for data compression / decompression, for example in the field of video telephony or video transmission over the Internet.

Claims

1 Method for decoding a coded image sequence with spatial and temporal scalability, the coded data comprising movement information, comprising a step of hierarchical temporal synthesis (16) carrying out temporal filtering compensated in movement, or MCTF, of images of a frequency decomposition level from said movement information, to provide images of a lower decomposition level, characterized in that, during a motion-compensated temporal filtering operation, the resolution chosen for the exploitation of the movement information and the complexity of the interpolation filters used are a function of a decoding scenario, namely spatial, temporal and bit rate resolutions selected for decoding or else the corresponding temporal decomposition level or a combination of these parameters.

2 Method according to claim 1, characterized in that the number of coefficients of the interpolation filter (16) used for motion compensation depends on the decoding scenario or the level of temporal decomposition.

3 Method according to claim 1, characterized in that the step of hierarchical temporal synthesis (16) is a decoding of wavelet coefficients with filtering compensated in movement.

4 Method for coding a sequence of images of given spatial resolution, with spatial and temporal scalability, comprising a step of hierarchical temporal analysis (4) performing temporal filtering compensated in motion, or MCTF, of images of a frequency decomposition level, from movement information between these images (7), to give images of a higher decomposition level, characterized in that, during a motion-compensated temporal filtering operation (4 ), the resolution chosen for processing said motion information and the complexity of the interpolation filters used (9) are a function of said spatial resolution of the source images or of the corresponding temporal decomposition level. 5 Method according to claim 4, characterized in that it comprises a motion estimation step (7) calculated between two images of a given level of decomposition to achieve motion compensation (4) and in that the precision for calculating the motion estimate (7) is a function of the level of temporal decomposition or of the said spatial resolution of the source images.

6 Method according to claim 4, characterized in that the hierarchical temporal analysis step (4) is a wavelet coding with compensated filtering in motion.

7 decoder for implementing the method according to claim 1, characterized in that it comprises a movement configuration selection circuit (16) for determining the resolution of the movement and the interpolation filter to be used in the compensation in motion (16) for motion compensated filtering, depending on the decoding scenario, i.e. spatial and temporal resolutions and bit rate selected for decoding or the corresponding temporal decomposition level or a combination of these parameters

8 encoder for implementing the method according to claim 4, characterized in that it includes a movement configuration selection circuit (4) for determining the interpolation filter to be used by the time analysis circuit for the motion compensation (4) as a function of said spatial resolution of the source images or of the corresponding time decomposition level.

9 encoder for implementing the method according to claim 4, characterized in that it comprises a movement configuration selection circuit (7) for determining the precision of the movement calculated by the movement estimation circuit (7) , as a function of said spatial resolution of the source images or of the corresponding time decomposition level.