CN1461563A

CN1461563A - Method of encoding sequence of frames

Info

Publication number: CN1461563A
Application number: CN02801121A
Authority: CN
Inventors: B·佩斯奎特－波佩斯库; V·博特雷奥
Original assignee: Koninklijke Philips Electronics NV
Current assignee: Koninklijke Philips NV
Priority date: 2001-04-10
Filing date: 2002-04-02
Publication date: 2003-12-10
Anticipated expiration: 2022-04-02
Also published as: WO2002085026A1; US6983021B2; US20030026339A1; JP2004523994A; CN1251509C; EP1380174A1; KR20030014705A

Abstract

The invention relates to a method for encoding a sequence of frames by means of a three-dimensional (3D) subband decomposition involving a filtering step applied to the frames considered as a 3D volume. The decomposition is applied to successive groups of frames after motion estimation and compensation which lead to double-connected pixels corresponding to uncovered regions. A minimization criterion is provided for choosing in the current frame, for the temporal filtering of a double-connected pixel of the reference frame, the pixel that leads to a minimum energy of the detail subband. The temporal part of the 3D decomposition may be a temporal wavelet transform that includes a temporal filtering based on the lifting scheme, including an overlapped motion compensation using, in the prediction step of said scheme, an average of pixels from adjacent windows in the reference frame.

Description

Method for encoding a sequence of frames

The present invention relates generally to video compression and, more particularly, to a method for encoding a sequence of frames using a three-dimensional (3D) subband decomposition, said encoding involving spatial and temporal filtering steps applied to 3D spatio-temporal data corresponding to said sequence of frames considered as a 3D quantity, said decomposition being applied to successive groups of frames only after block-based motion estimation and compensation operations have been carried out on the frames of each group in the low-frequency temporal subbands at each level of temporal decomposition; and the compensation operation results in double-connected pixels being generated in the filtered pixels, which correspond to regions of the current frame that become uncovered reference frames during motion.

Recent developments in multimedia applications and the need to deliver compressed bit streams over heterogeneous networks have led to the view of scalability as a very important feature of video encoders. Three-dimensional (3D) wavelet decomposition provides natural spatial resolution and frame rate scalability. Embedded coding algorithms, such as the 3D SPIHT algorithm described in "embedded wavelet video coder using set partitioning in three-dimensional hierarchical trees" (Proceedings of datacompresssin Conference, Snowbird, USA, March 25-27, 1997, pp.251-260), for example, b. The global coding/decoding scheme applied to a sequence of frames may comprise a temporal multi-resolution analysis along the direction of motion between successive frames in order to take into account the large displacements (more precisely, a Haar filter is applied to the motion compensated frames at each resolution level), followed by a spatial multi-resolution analysis of the resulting temporal subbands. The encoding method is an improved 3D SPIHT followed by a context arithmetic encoder. By this approach scalability is achieved in both temporal and spatial resolution as well as in quality.

However, despite some work on scalable coding algorithms and possible improvements in spatial wavelet transforms, little success has been achieved in exploiting the particularities of the temporal wavelet transforms used in 3D decomposition. The temporal wavelet transform itself is nonlinear due to the motion estimation step. The so-called lifting scheme described in "decompose wavelet transform into lifting (lifting) steps" of, for example, i.daubechies and w.sweldens (Journal of Fourier Analysis applications, vol.4, N ° 3, 1998, pp.247-269) provides a flexible framework to construct wavelet transforms (the importance of which in processing one-dimensional signals and providing appropriate decomposition for image coding or feature extraction has been demonstrated). The advantage of this approach lies in both complexity ("in-place" computation) and additional functionality: not only can each linear wavelet decomposition be expressed in this form, it also allows the introduction of non-linear operators in the prediction update step.

An important issue related to time domain multi-resolution analysis is the selection of the temporal filtering length: long filtering more advantageously exploits the temporal correlation that exists between successive frames. However, in "three-dimensional subband coding with motion compensation" (IEEE Transactions on image Processing, vol.3, No.5, 1994, pp.559-571) in j.r.ohm it is shown that the long filtering does blur the motion and increases the buffer memory requirements (corresponding to the filter length) and the reconstruction delay on the decoder side, which may be undesirable in real-time applications. Furthermore, performing temporal analysis with longer filtering does not significantly improve coding efficiency. Therefore, haar filtering is more suitable for temporal filtering. In addition, when haar filtering is used for temporal decomposition, motion estimation and motion compensation (ME/MC) are performed every two frames in the input sequence since temporal down-sampling is performed with 2. By repeating this process for the time-approximated subbands at several decomposition levels, the total number of ME/MC operations is approximately the same as for the prediction scheme.

However, motion compensated temporal filtering causes the problem of double-connected or non-connected pixels, i.e. pixels that are filtered twice or not filtered at all. A solution is proposed in "motion compensated 3D sub-band coding of video" (IEEE trans on imaging processing, vol.8, No.2, pp.155-164, feb.1999) by s.j.choi and j.w.woods, which can calculate high-pass coefficients at the same positions as the pixels in the current frame and interpret the co-existing pixels in the motion compensated reference frame as low-pass coefficients. In the case of a multi-connected pixel, the current frame may be scanned from top to bottom and from left to right and the first pixel in the current frame to which it points may be considered for the calculation of the low-pass coefficient. Nevertheless, it does not appear to be the optimal strategy.

At a given level of resolution, one can use H [ m, n ]]Representing pixels in temporal high-frequency subbands, by L m, n]Representing pixels in a low frequency subband, and using (u)_m，n，v_m，n) Representing the two components of the motion vector associated with pixel (m, n). If fractional pixel motion estimation is allowed, the integer part of the motion vector will be composed of (u)_m，n， v_m，n) And (4) showing. Thus, in the Choi method, the equations used to compute the high-pass and low-pass subbands connecting the pixels are as follows:

(1) - - - H [m, n] = (B [m, n] - \tilde{A} [{m - u}_{m, n}, {n - v}_{m, n}]) / \sqrt{2}

<math> <mrow> <mrow> <mo>(</mo> <mn>2</mn> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mi>L</mi> <mo>[</mo> <mi>m</mi> <mo>-</mo> <msub> <mover> <mi>u</mi> <mo>&OverBar;</mo> </mover> <mrow> <mi>m</mi> <mo>,</mo> <mi>n</mi> </mrow> </msub> <mo>,</mo> <mi>n</mi> <mo>-</mo> <msub> <mover> <mi>v</mi> <mo>&OverBar;</mo> </mover> <mrow> <mi>m</mi> <mo>,</mo> <mi>n</mi> </mrow> </msub> <mo>]</mo> <mo>=</mo> <mrow> <mo>(</mo> <mover> <mi>B</mi> <mo>~</mo> </mover> <mo>[</mo> <mi>m</mi> <mo>-</mo> <msub> <mover> <mi>u</mi> <mo>&OverBar;</mo> </mover> <mrow> <mi>m</mi> <mo>,</mo> <mi>n</mi> </mrow> </msub> <mo>+</mo> <msub> <mi>u</mi> <mrow> <mi>m</mi> <mo>,</mo> <mi>n</mi> </mrow> </msub> <mo>,</mo> <mi>n</mi> <mo>-</mo> <msub> <mover> <mi>v</mi> <mo>&OverBar;</mo> </mover> <mrow> <mi>m</mi> <mo>,</mo> <mi>n</mi> </mrow> </msub> <mo>+</mo> <msub> <mi>v</mi> <mrow> <mi>m</mi> <mo>,</mo> <mi>n</mi> </mrow> </msub> <mo>]</mo> <mo>+</mo> <mi>A</mi> <mo>[</mo> <mi>m</mi> <mo>-</mo> <msub> <mover> <mi>u</mi> <mo>&OverBar;</mo> </mover> <mrow> <mi>m</mi> <mo>,</mo> <mi>n</mi> </mrow> </msub> <mo>,</mo> <mi>n</mi> <mo>-</mo> <msub> <mover> <mi>v</mi> <mo>&OverBar;</mo> </mover> <mrow> <mi>m</mi> <mo>,</mo> <mi>n</mi> </mrow> </msub> <mo>]</mo> <mo>)</mo> </mrow> <mo>/</mo> <msqrt> <mn>2</mn> </msqrt> </mrow> </math>

wherein,

representing the interpolated value of field X. For unconnected pixels, the high frequency component is obtained as before, while the low frequency value is simply the scaled value of the reference pixel:

as a special case of the haar transform lifting scheme, it is possible to write the temporal low-pass filtering in the motion direction as:

<math> <mrow> <mrow> <mo>(</mo> <mn>4</mn> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mi>L</mi> <mo>[</mo> <mi>m</mi> <mo>-</mo> <msub> <mover> <mi>u</mi> <mo>&OverBar;</mo> </mover> <mrow> <mi>m</mi> <mo>,</mo> <mi>n</mi> </mrow> </msub> <mo>,</mo> <mi>n</mi> <mo>-</mo> <msub> <mover> <mi>v</mi> <mo>&OverBar;</mo> </mover> <mrow> <mi>m</mi> <mo>,</mo> <mi>n</mi> </mrow> </msub> <mo>]</mo> <mo>=</mo> <mover> <mi>H</mi> <mo>~</mo> </mover> <mo>[</mo> <mi>m</mi> <mo>-</mo> <msub> <mover> <mi>u</mi> <mo>&OverBar;</mo> </mover> <mrow> <mi>m</mi> <mo>,</mo> <mi>n</mi> </mrow> </msub> <mo>+</mo> <msub> <mi>u</mi> <mrow> <mi>m</mi> <mo>,</mo> <mi>n</mi> </mrow> </msub> <mo>,</mo> <mi>n</mi> <mo>-</mo> <msub> <mover> <mi>v</mi> <mo>&OverBar;</mo> </mover> <mrow> <mi>m</mi> <mo>,</mo> <mi>n</mi> </mrow> </msub> <mo>+</mo> <msub> <mi>v</mi> <mrow> <mi>m</mi> <mo>,</mo> <mi>n</mi> </mrow> </msub> <mo>]</mo> <mo>+</mo> <msqrt> <mn>2</mn> </msqrt> <mi>A</mi> <mo>[</mo> <mi>m</mi> <mo>-</mo> <msub> <mover> <mi>u</mi> <mo>&OverBar;</mo> </mover> <mrow> <mi>m</mi> <mo>,</mo> <mi>n</mi> </mrow> </msub> <mo>,</mo> <mi>n</mi> <mo>-</mo> <msub> <mover> <mi>v</mi> <mo>&OverBar;</mo> </mover> <mrow> <mi>m</mi> <mo>,</mo> <mi>n</mi> </mrow> </msub> <mo>]</mo> </mrow> </math>

taken together (1), this equation can derive the form of the nonlinear operators P (prediction) and U (update) used in temporal haar lifting. It can be seen that P is the motion compensation operator (C), which in the case of fractional-pixel motion estimation is followed by interpolation (I). At the same time, U can be determined as the motion compensation operator, which uses the same motion vector as in P, but of opposite sign, followed by an interpolation. As a result, these operations will be expressed as:

P{.}＝I{C{.}}

and U { } ═ I { C { } { (C { } { } { (C { }

And position (m-u) of (p, q)_m，n，n- v_m，n)。

With these symbologies, the temporal analysis of connected pixels can be written as:

H [m, n] \frac{1}{\sqrt{2}} (B [m, n] - I {C {A [m, n]}})

<math> <mrow> <mi>L</mi> <mo>[</mo> <mi>p</mi> <mo>,</mo> <mi>q</mi> <mo>]</mo> <mo>=</mo> <mi>I</mi> <mo>{</mo> <mover> <mi>C</mi> <mo>&OverBar;</mo> </mover> <mo>{</mo> <mi>H</mi> <mo>[</mo> <mi>p</mi> <mo>,</mo> <mi>q</mi> <mo>]</mo> <mo>}</mo> <mo>}</mo> <mo>+</mo> <mi>A</mi> <mo>[</mo> <mi>p</mi> <mo>,</mo> <mi>q</mi> <mo>]</mo> <mo>·</mo> <msqrt> <mn>2</mn> </msqrt> </mrow> </math>

for the synthesis part, there are:

<math> <mrow> <mrow> <mo>(</mo> <mn>7</mn> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mi>A</mi> <mo>[</mo> <mi>p</mi> <mo>,</mo> <mi>q</mi> <mo>]</mo> <mo>=</mo> <mfrac> <mn>1</mn> <msqrt> <mn>2</mn> </msqrt> </mfrac> <mrow> <mo>(</mo> <mi>L</mi> <mo>[</mo> <mi>p</mi> <mo>,</mo> <mi>q</mi> <mo>]</mo> <mo>-</mo> <mi>I</mi> <mo>{</mo> <mover> <mi>C</mi> <mo>&OverBar;</mo> </mover> <mo>{</mo> <mi>H</mi> <mo>[</mo> <mi>p</mi> <mo>,</mo> <mi>q</mi> <mo>]</mo> <mo>}</mo> <mo>}</mo> <mo>)</mo> </mrow> </mrow> </math>

for connected pixels (unconnected pixels in the reference frame are derived directly from equation (3) above), and:

(8) - - - B [m, n] = \sqrt{2} H [m, n] + I {C {A [m, n]}}}

it is therefore a first object of the present invention to adapt the update operator using such lifting formulas in order to improve the processing of uncovered areas.

To this end, the invention relates to a coding method as defined in the introductory part of the description, which is further characterized in that: -a decision step following said identification step of double-connected pixels, which provides, on the basis of a minimization criterion, said minimization criterion for choosing, in the current frame, the pixel with the lowest energy of the fine sub-band (detail sub-band) as the double-connected pixel of the reference frame for temporal filtering, the energy considered being the average energy of said double-connected pixel in the high-frequency temporal sub-band, and calculating said average energy by means of a low-pass filtering operation.

The proposed technical solution is used to correlate a pixel at a given position in the reference frame a, obtained by minimizing the energy of the fine subband, with its closest value in the current frame B.

In an advantageous specific embodiment said minimization criterion is provided for combining a second criterion for performing a minimization of the motion vector criterion related to candidate connected pixels with said energy criterion.

In this case, the minimization criterion according to the motion vector criterion may employ a threshold value that is appropriate for the value of the criterion, so as to not consider optimization of a motion vector that is too large.

It is another object of the invention to employ lifting formulas to modify the update operator in order to reduce blocking artifacts related to block-based motion compensation.

To this end, the invention relates to a coding method as defined above, in which the temporal part of the 3D decomposition is a temporal wavelet transform comprising a temporal filtering based on a so-called lifting scheme comprising, in succession, a separation step, a prediction step and an update step at each decomposition level; the method is further characterized by: the temporal filtering includes an overlapping motion compensation in the prediction step using an average of pixels of neighboring windows in the reference frame to reduce blocking artifacts caused by block-based motion estimation.

The problem of unconnected and doubly connected pixels is strongly related to the area of coverage removed by the moving object. In practice, two objects are considered that correspond to a common portion in a frame at time T and become separated at time T + Δ T. In this case, two regions in the current frame will correspond to the same region in the reference frame (or previous frame) through motion compensation. For one of these objects, the area will be an uncovered area. This region will act as a dual connection in the reference frame. For the method described in the cited document "motion compensated 3D sub-band coding of video", which relates the first block encountered during motion compensation to these double-connected pixels, the method proposes to optimize this choice by applying certain criteria based on the lifting scheme. The main structural attributes used here are: it is possible to use all the information derived from the prediction step (high frequency subbands) and the reason information in the low frequency subbands for the update step (calculation of the temporal low frequency subbands).

The basic idea corresponding to the proposed first criterion relates to the energy of the minutiae sub-bands of two moving objects. If at time T the first object is in the foreground, the uncovered area in the second object will yield a higher fine factor energy. The second criterion is the condition for motion field regularization: if several pixels are connected to the same pixel in the reference frame, the pixel with the smallest displacement will be selected for filtering.

These ideas can be formalized in the non-linear lifting framework described above. In the case of multiple connected pixels in the reference frame, if one of these pixels at position (p, q) and position (m) in the current frame are considered₁，n₁) And (m)₂，n₂) Two pixels found by the motion estimation algorithm, the two pixels having two corresponding motion vectors

(U_{m_{1}, n_{1}}, v_{m_{1}, n_{1}})

And

(U_{m_{2}, n_{2}}, v_{m_{2}, n_{2}}),

then there are:

<math> <mrow> <mrow> <mo>(</mo> <mn>9</mn> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <msub> <mi>m</mi> <mn>1</mn> </msub> <mo>-</mo> <msub> <mover> <mi>u</mi> <mo>&OverBar;</mo> </mover> <mrow> <msub> <mi>m</mi> <mn>1</mn> </msub> <mo>,</mo> <msub> <mi>n</mi> <mn>1</mn> </msub> </mrow> </msub> <mo>=</mo> <msub> <mi>m</mi> <mn>2</mn> </msub> <mo>-</mo> <msub> <mover> <mi>u</mi> <mo>&OverBar;</mo> </mover> <mrow> <msub> <mi>m</mi> <mn>2</mn> </msub> <mo>,</mo> <msub> <mi>n</mi> <mn>2</mn> </msub> </mrow> </msub> <mo>=</mo> <mi>p</mi> <mo>,</mo> <msub> <mi>n</mi> <mn>1</mn> </msub> <mo>-</mo> <msub> <mover> <mi>v</mi> <mo>&OverBar;</mo> </mover> <mrow> <msub> <mi>m</mi> <mn>1</mn> </msub> <mo>,</mo> <msub> <mi>n</mi> <mn>1</mn> </msub> </mrow> </msub> <mo>=</mo> <msub> <mi>n</mi> <mn>2</mn> </msub> <mo>-</mo> <msub> <mover> <mi>v</mi> <mo>&OverBar;</mo> </mover> <mrow> <msub> <mi>m</mi> <mn>2</mn> </msub> <mo>,</mo> <msub> <mi>n</mi> <mn>2</mn> </msub> </mrow> </msub> <mo>=</mo> <mi>q</mi> </mrow> </math>

with this result, two pixels (m) can be addressed₁，n₁) And (m)₂，n₂) Are written as equations (5) and (6), two different values H m are obtained in the fine sub-band₁，n₁]And H [ m ]₂，n₂]. Thus, the values in the approximation subband can be computed with either of these two values (it can be noted that both values allow a full reconstruction). In fact, P_(p，q)Representing the set of all pixels (m, n) in the current frame connected to the pixel (p, q) in the reference frame, it can be noted that the operator f is guaranteed to be completeFull reconstruction property, such that:

<math> <mrow> <mrow> <mo>(</mo> <mn>10</mn> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mi>L</mi> <mo>[</mo> <mi>p</mi> <mo>,</mo> <mi>q</mi> <mo>]</mo> <mo>=</mo> <mi>f</mi> <mrow> <mo>(</mo> <mover> <mi>H</mi> <mo>~</mo> </mover> <mo>[</mo> <mi>m</mi> <mo>-</mo> <msub> <mover> <mi>u</mi> <mo>&OverBar;</mo> </mover> <mrow> <mi>m</mi> <mo>,</mo> <mi>n</mi> </mrow> </msub> <mo>+</mo> <msub> <mi>u</mi> <mrow> <mi>m</mi> <mo>,</mo> <mi>n</mi> </mrow> </msub> <mo>,</mo> <mi>n</mi> <mo>-</mo> <msub> <mover> <mi>v</mi> <mo>&OverBar;</mo> </mover> <mrow> <mi>m</mi> <mo>,</mo> <mi>n</mi> </mrow> </msub> <mo>+</mo> <msub> <mi>v</mi> <mrow> <mi>m</mi> <mo>,</mo> <mi>n</mi> </mrow> </msub> <mo>]</mo> <mo>,</mo> <mrow> <mo>(</mo> <mi>m</mi> <mo>,</mo> <mi>n</mi> <mo>)</mo> </mrow> <mo>&Element;</mo> <msub> <mi>P</mi> <mrow> <mo>(</mo> <mi>p</mi> <mo>,</mo> <mi>q</mi> <mo>)</mo> </mrow> </msub> <mo>)</mo> </mrow> <mo>+</mo> <mi>A</mi> <mo>[</mo> <mi>p</mi> <mo>,</mo> <mi>q</mi> <mo>]</mo> <mo>·</mo> <msqrt> <mn>2</mn> </msqrt> </mrow> </math>

one criterion for selecting operator f is to minimize the energy of the fine sub-bands, thereby minimizing A [ p, q ]]Associated with its "closest" value in frame B. This means that the pixel (m) is a pixel₀，n₀) Low-pass filtering is performed so that:

<math> <mrow> <mrow> <mo>(</mo> <mn>11</mn> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mo>|</mo> <mi>I</mi> <mo>{</mo> <mover> <mi>C</mi> <mo>&OverBar;</mo> </mover> <mo>{</mo> <mi>H</mi> <mo>[</mo> <msub> <mi>m</mi> <mn>0</mn> </msub> <mo>-</mo> <msub> <mover> <mi>u</mi> <mo>&OverBar;</mo> </mover> <mrow> <mi>m</mi> <mn>0</mn> <mo>,</mo> <mi>n</mi> <mn>0</mn> </mrow> </msub> <mo>,</mo> <msub> <mi>n</mi> <mn>0</mn> </msub> <mo>-</mo> <msub> <mover> <mi>v</mi> <mo>&OverBar;</mo> </mover> <mrow> <mi>m</mi> <mn>0</mn> <mo>,</mo> <mi>n</mi> <mn>0</mn> </mrow> </msub> <mo>]</mo> <mo>}</mo> <mo>}</mo> <mo>|</mo> <mo>=</mo> <mi>mi</mi> <msub> <mi>n</mi> <mrow> <mrow> <mo>(</mo> <mi>m</mi> <mo>,</mo> <mi>n</mi> <mo>)</mo> </mrow> <mo>&Element;</mo> <msub> <mi>P</mi> <mrow> <mo>(</mo> <mi>p</mi> <mo>,</mo> <mi>q</mi> <mo>)</mo> </mrow> </msub> </mrow> </msub> <mo>|</mo> <mi>I</mi> <mo>{</mo> <mover> <mi>C</mi> <mo>&OverBar;</mo> </mover> <mo>{</mo> <mi>H</mi> <mo>[</mo> <mi>m</mi> <mo>-</mo> <msub> <mover> <mi>u</mi> <mo>&OverBar;</mo> </mover> <mrow> <mi>m</mi> <mo>,</mo> <mi>n</mi> </mrow> </msub> <mi>n</mi> <mo>-</mo> <msub> <mover> <mi>v</mi> <mo>&OverBar;</mo> </mover> <mrow> <mi>m</mi> <mo>,</mo> <mi>n</mi> </mrow> </msub> <mo>]</mo> <mo>}</mo> <mo>}</mo> <mo>|</mo> </mrow> </math>

for Choi's algorithm, in the proposed algorithm, it is not necessary to send the classification map to the decoder (e.g., which pixels are connected and which pixels are not). Since the decoder follows a symmetrical process with the encoder, the same classification map will be used theoretically to derive the decision on the energy of the high frequency coefficients.

However, the decision made based on the value of a single pixel is not robust enough. In particular, in the previous example, two values in the high frequency subband may not be quantized in the same quantization step due to the fractional quantization strategy used in the SPIHT algorithm. Therefore, an erroneous decision will be made. The above decision can then be made more robust by comparing the average energy of the Displacement Frame Difference (DFD) around the considered pixel.

<math> <mrow> <mrow> <mo>(</mo> <mn>12</mn> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mi>e</mi> <msup> <mrow> <mo>(</mo> <mi>p</mi> <mo>,</mo> <mi>q</mi> <mo>)</mo> </mrow> <mn>2</mn> </msup> <mo>=</mo> <munder> <mi>Σ</mi> <mrow> <mrow> <mo>(</mo> <mi>k</mi> <mo>,</mo> <mi>l</mi> <mo>)</mo> </mrow> <mo>&Element;</mo> <msub> <mi>S</mi> <mrow> <mo>(</mo> <mi>p</mi> <mo>,</mo> <mi>q</mi> <mo>)</mo> </mrow> </msub> </mrow> </munder> <msup> <mrow> <mo>(</mo> <mi>H</mi> <mrow> <mo>(</mo> <mi>p</mi> <mo>-</mo> <mi>k</mi> <mo>,</mo> <mi>q</mi> <mo>-</mo> <mi>l</mi> <mo>)</mo> </mrow> <mi>u</mi> <mrow> <mo>(</mo> <mi>k</mi> <mo>,</mo> <mi>l</mi> <mo>)</mo> </mrow> <mo>)</mo> </mrow> <mn>2</mn> </msup> </mrow> </math>

Where S (p, q) is a neighborhood around the pixel (p, q), and u (k, l) corresponds to the weighting factor of each pixel in the neighborhood S (p, q) according to its distance to the center point. For example, u (k, l) ═ α may be selected^(|k|+[l])Where α > 0 is a forgetting factor.

The second term in the minimization criterion is the motion vector specification, | d_m，n‖＝(u² _m，n+v² _m，n)^1/2. The regularization criterion may be expressed as J (p, q) ═ e (p, q)²+λ‖ d_m，n|, λ is the regularization parameter. If the motion vector is too large, its value is not very reliable and therefore columns can be chosen that do not take this into account for optimization. This gives the following criteria: (13) j (p, q) ═ e (p, q)²+λ‖ d_m，nII, if d_m，n‖≤s(14)J(p，q)＝e(p，q)²+ λ s if | d_m，n‖＞s

Where s is an empirically determined threshold. Therefore, some loss occurs on motion vectors larger than the threshold.

However, block-based motion estimation algorithms suffer from blocking artifacts. If the spatial transform applied to the generated motion-compensated frames is wavelet analysis, these artifacts will result in undesirably large wavelet coefficients, which in turn results in a reduction in coding efficiency. Hence, the improvement derivable from the above non-linear lifting formula relates to the possibility of introducing overlapping motion compensation in the temporal filtering algorithm, thereby reducing blocking artifacts. This operation involves using the pixel averaging of adjacent windows in the reference frame in the prediction step.

For example, if one pixel overlap is consideredThen the high-pass filtering of the pixels belonging to the first line (respectively last line) of a block is in this case:

corresponding to

Wherein beta is a constant, beta is more than 0 and less than 1.

A similar process is applied to the first (respectively last) column of each block according to:

accordingly, the

The constant β is determined experimentally. In the case of the present invention, the constant β of 0.2 yields the best results.

Claims

1. A method for encoding a sequence of frames with a three-dimensional (3D) subband decomposition, said encoding involving spatial and temporal filtering steps applied to 3D spatio-temporal data corresponding to said sequence of frames considered as a 3D quantity, said decomposition being applied to successive groups of frames only after block-based motion estimation and compensation operations have been carried out on the frames of each group in the low-frequency temporal subbands at each level of temporal decomposition; and such compensation operation results in double-connected pixels in the filtered pixels, said double-connected pixels corresponding to regions of the current frame that become uncovered reference frames during motion, said method being further characterized by: said double-connected pixel identification step is followed by a decision step based on a minimization criterion provided for picking in the current frame said pixel producing the lowest energy of said fine sub-band as the double-connected pixel of said reference frame to be temporally filtered, the energy considered being the average energy of said double-connected pixels in said high-frequency temporal sub-band, and calculating said average energy by a low-pass filtering operation.

2. A coding method according to claim 1, characterized in that said minimization criterion is provided for combining a second criterion for performing minimization of a motion vector specification relating to candidate connected pixels with said energy criterion.

3. A coding method according to claim 2, characterized in that the minimization criterion according to the motion vector specification uses a threshold value that is adapted to the value of the specification so as not to take into account optimization too large motion vectors.

4. The coding method according to claim 1, characterized in that said temporal part of said 3D decomposition is a temporal wavelet transform comprising a temporal filtering based on a so-called lifting scheme comprising a separation step, a prediction step and an update step carried out in sequence at each decomposition level; the method is further characterized by: the temporal filtering includes overlapping motion compensation using an average of neighboring windowed pixels in the reference frame in the predicting step to reduce blocking artifacts caused by the block-based motion estimation.