Three-dimensional video scalable video encoding method
FIELD OF THE INVENTION The present invention relates to a method of and a device for three-dimensional wavelet encoding a sequence of frames. This invention may be used, for example, in video compression systems adapted to generate progressively scalable (signal to noise ratio SNR, spatially or temporally) compressed video signals.
BACKGROUND OF THE INVENTION A conventional method for three-dimensional video scalable video encoding a sequence of frames is described, for example, in "Lifting schemes in scalable video coding", B. Pesquet-Popescu, V. Bottreau, SCI 2001, Orlando, USA. Said method comprises the following steps illustrated in Figure 1. In a first step, a sequence of frames is divided into groups GOF of 2N frames, said group having, in the example of Figure 1, 8 frames FI to F8. Then, the encoding method comprises a step of motion estimation ME based on pairs of odd Fo and even Fe input frames within the group of frames, as illustrated by the dotted arrows. Said motion estimation step results in a set MV1 of motion vector fields of a first decomposition level comprising 4 fields in the example of Figure 1. The motion estimation step is followed by a step of motion-compensated temporal wavelet-based filtering MCTF, for example Haar filtering, based on the set MN1 of motion vector fields and on a lifting scheme according to which the high-frequency wavelet coefficients H[n] and the low-frequency wavelet coefficients L[n] are computed sequentially as follows: H[n] = Fe[n] - P(Fo[n]), L[n] = Fo[n] + U(H[n]), where P is a prediction function, U is an update function, Fo[n] and Fe[n] are values of pixels of odd and even input frames, the pixels corresponding to Fo[n] and Fe[n] being taken along a corresponding motion vector of the set MN1 of motion vector fields.
The temporal filtering MCTF step delivers a temporal sub-band Tl of a first decomposition level comprising filtered frames, which are 4 low-frequency frames L and 4 high-frequency frames H in our example. The motion estimation and filtering steps are repeated on the low-frequency frames L of the temporal sub-band T 1 , that is : motion estimation is done on pairs of odd and even low-frequency frames within the temporal sub-band Tl, resulting in a set MN2 of motion vector fields of a second decomposition level comprising 2 fields in our example; motion-compensated temporal wavelet-based filtering MCTF based on the set MN2 of motion vector fields and on the lifting equations, and resulting in a temporal sub-band T2 of a second decomposition level comprising filtered frames, which are 2 low-frequency frames LL and 2 high-frequency frames LH in the example of Figure 1. Motion estimation and motion compensated temporal filtering are still repeated on the pair of odd and even low-frequency frames of the temporal sub-band T2, resulting in a temporal sub-band T3 of a third and last decomposition level in the case of a group GOF of 8 frames. Said temporal sub-band T3 comprises 1 low-frequency frame LLL and 1 high- frequency frame LLH. A four-stage wavelet spatial filtering step is then applied on the frames LLL and LLH of the temporal sub-band T3 and on the high-frequency frames of the temporal sub-bands Tl and T2, i.e. the 2 filtered frames LH and the 4 filtered frames H, respectively. It results in spatio-temporal sub-bands comprising 4 spatially filtered frames sub-sampled in a horizontal and in a vertical direction. At a next step, a spatial encoding of the coefficients of the frames of the spatio- temporal sub-bands is then performed, each spatio-temporal sub-band being encoded separately beginning from the low-frequency frame of the spatio-temporal sub-band of the last decomposition level. The motion vector fields are also encoded. Finally, an output bit-stream is formed on the basis of the encoded coefficients of the spatio-temporal sub-bands and of the encoded motion vector fields, the bits of said motion vector fields being sent as an overhead. However, the encoding method according to the prior art has a number of disadvantages. First of all, the motion estimation and the motion-compensated temporal wavelet-based filtering steps are implemented on full size frames. Therefore, these steps are computationally expensive and may cause a delay during encoding. Besides, motion vectors
of the highest spatial resolution are encoded at each temporal level, which results in a quite high overhead. The encoding method has also a low computational scalability.
SUMMARY OF THE INVENTION It is an object of the invention to propose a three-dimensional wavelet encoding method, which has a better computational scalability than the one of the prior art. To this end, the encoding method in accordance with the invention is characterized in that it comprises the steps of: dividing the sequence of frames into groups of 2N consecutive input frames, where N is an integer, doing motion estimation on pairs of even and odd input frames of the group of frames, resulting in a set of motion vector fields, motion-compensated temporal wavelet-based filtering using a first equation L[n] = Fo[n] + U(Fe[n]) of a reverse lifting scheme, where U is an update function, Fo[n] and Fe[n] are values of pixels of odd and even input frames of the group of frames, respectively, pixels corresponding to Fo[n] and Fe[n] being taken along an appropriate motion vector of the set of motion vector fields, said filtering step resulting in low-frequency filtered frames of a first level temporal sub-band. As a consequence, only the low-frequency filtered frames of successive temporal decomposition levels are computed without the need of computing the high-frequency filtered frames. This greatly simplifies the three-dimensional wavelet encoding method with motion compensation in accordance with the invention and authorizes different levels of scalability, as it will be explained in more detail hereinafter. The present invention also relates to an encoding device implementing such an encoding method. It finally relates to a computer program product comprising program instructions for implementing said encoding method. These and other aspects of the invention will be apparent from and will be elucidated with reference to the embodiments described hereinafter.
BRIEF DESCRIPTION OF THE DRAWINGS The present invention will now be described in more detail, by way of example, with reference to the accompanying drawings, wherein: Figure 1 is a block diagram showing an encoding method in accordance with the prior art,
Figure 2 is a block diagram illustrating the 3 first steps of an encoding method in accordance with the invention, Figure 3 is a block diagram illustrating the following steps of a first embodiment of the encoding method in accordance with the invention, and - Figure 4 is a block diagram illustrating the following steps of a second embodiment of the encoding method in accordance with the invention.
DETAILED DESCRIPTION OF THE INVENTION The present invention relates to a three-dimensional 3D wavelet encoding method with motion compensation. Such an encoding method has been demonstrated to be an efficient technique for scalable video encoding applications. Said 3D encoding method uses wavelet transform in both spatial and temporal domains. It is known that the wavelet temporal filtering is computationally expensive and causes a delay during encoding. The present invention disclosure proposes an improvement of conventional 3D scalable wavelet video encoder. The process of temporal wavelet filtering is modified in such a way, that delay caused by the motion-compensated temporal wavelet- based filtering of video frames is reduced. The modification also allows implementation of computationally scalable encoder with parallel processing. Figure 2 is a block diagram showing an encoding method in accordance with the invention. In a first step, the sequence of frames is divided into groups of 2N consecutive frames, a group of frames GOF having, for example, 8 frames. In a second step, a first sub-group of frames GOF1 within the GOF, namely the first 4 frames Fie, Flo, F2e and F2o in our example, are processed. Said first processing step comprises the sub-steps of: doing motion estimation ME on pairs of consecutive frames (Fe,Fo) within the group GOF1, resulting in a set of 2 motion vector fields MV11 of a first decomposition level; motion-compensated temporal wavelet-based filtering MCTF, for example Haar filtering, based on the motion vector fields MV11 of the first decomposition level. Said temporal filtering sub-step is based on the use of a reverse lifting scheme adapted to deliver sequentially low-frequency wavelet coefficients L(n) and high-frequency wavelet coefficients H(n). The reverse lifting scheme comprises the 2 following equations: L[n] = Fo[n] + U(Fe[n])
H[n] = Fe[n] - P(L[n]) where Fo[n] and Fe[n] are values of pixels of odd and even input frames, respectively, the pixels corresponding to Fo[n] and Fe[n] being taken along a corresponding motion vector of the motion vector fields MN 11. For example, the prediction and update functions of the reverse lifting scheme are based on the (4,4) Deslauriers-Dubuc wavelet transform such as: L[n] = Fo[n] + (-Fe[n-2] + 9Fe[n-l] + 9Fe[n] - Fe[n+l])/32, H[n] = Fe[n] - (-L[n-1] + 9L[n] + 9L[n+l] - L[n+2])/16. It will be apparent to a person skilled in the art that other prediction and update functions can be used without departing from the scope of the invention. In a first stage, only the first equation L[n] = Fo[n] + U(Fe[n]) of the reverse lifting scheme is applied, resulting in 2 low-frequency filtered frames Lie and Llo of a temporal sub-band Tl of a first decomposition level. Said first processing step further comprises the sub-steps of: - doing motion estimation ME on pairs of consecutive low-frequency filtered frames within the temporal sub-band Tl, resulting in a motion vector field MN21 of a second decomposition level; motion-compensated temporal wavelet-based filtering MCTF based on the motion vector field MN21 and on the first equation of the reverse lifting scheme: LL[n] = Lo[n] + U(Le[n]), where Lo[n] and Le[n] are values of pixels of odd and even frames of the temporal sub-band Tl, respectively, the pixels corresponding to Lo[n] and Le[n] being taken along a corresponding motion vector of the motion vector field MN21. This results in a low-frequency filtered frame LL of a temporal sub-band T2 of a second decomposition level. In a third step, a second sub-group of frames GOF2, namely the last 4 frames F3e, F3o, F4e and F4o in our example, within the GOF are processed as in the first processing step. Said second processing step results in: a set of 2 motion vector fields MN12 of a first decomposition level, - 2 low-frequency frames L2e and L2o in the temporal sub-band Tl based on the motion vector fields MN12, a motion vector field MN22 of a second decomposition level, and 1 low-frequency frame LL in the temporal sub-band T2 based on the motion vector field MN22.
The second and third steps (i.e. the first and second processing steps) can be processed sequentially or in parallel.
Figure 3 is a block diagram illustrating a first embodiment of the encoding method in accordance with the invention. According to this embodiment, the encoding method comprises a third processing step including the sub-steps: doing motion estimation ME on the pair of consecutive low-frequency filtered frames LLe and LLo within the temporal sub-band T2, which have been computed during the first and second processing steps. This results in a motion vector field MN3 of a third decomposition level; motion-compensated temporal wavelet-based filtering MCTF based on the motion vector field MN3 and on the first equation of the reverse lifting scheme: LLL[n] = LLo[n] + U(Le[n]), where LLo[n] and LLe[n] are values of pixels of odd and even frames of the temporal sub-band T2, respectively, the pixels corresponding to LLo[n] and LLe[n] being taken along a corresponding motion vector of the motion vector field MN3. This results in a low-frequency filtered frame LLL of a temporal sub-band T3 of a third and last decomposition level in the case of a group GOF of 8 frames. The second equation of the reverse lifting scheme is then applied: LLH[n] = LLe[n] - P(LLL[n]), resulting in a high-frequency filtered frame LLH of the temporal sub-band T3. This additional level of temporal decomposition, i.e. third level in our example, will not contribute necessarily to the visual quality if the low-frequency filtered frames LL from the current (i.e. second) decomposition level have a low temporal correlation. The criterion used to determine if an additional level is needed is based on an efficiency of the motion estimation step at a current decomposition level. For example, said criterion is based on a comparison of the number of unconnected pixels defined during motion estimation versus a predetermined threshold. In case of a high-intensity motion within a video sequence, an inhomogeneous motion vector fields occur, such that certain pixels or entire areas may not be members of motion vectors. These positions are related to newly uncovered areas, and are referred to as unconnected. The notion of "unconnected pixels" is also addressed to the areas where motion trajectories converge or merge, which for example happens when areas are being covered. In
other words, unconnected pixels are pixels for which the motion estimation step does not find unique motion vectors. If a video sequence comprises a high-intensity motion, then the low-frequency filtered frames LL of the temporal sub-band T2 derived from the first GOF1 and second GOF2 group of four frames (these low-frequency filtered frames consist in fact of temporally averaged information) are different. Thus, motion estimation will not be efficient, and the temporal filtering step will not lead to packing of temporal information in the resulted low-frequency filtered frame (a lot of information will be left in high-frequency filtered frame). In contrast, if the motion in the video sequence is slow and regular (i.e. more pixels are uniquely connected by motion vectors and the motion estimation step is thus more efficient), then the low-frequency filtered frames of the temporal sub-band T2 are very similar. Thus, the additional level of temporal decomposition may be efficiently implemented. Then the encoding method comprises a one level four-stage wavelet spatial filtering step of the low-frequency filtered frame LLL and of the high-frequency filtered frame LLH of the temporal sub-band T3. Said filtering step is based on a wavelet transform such as, for example, the one described in "Image coding using wavelet transform", by M.Antonini, M.Barlaud, P.Mathieu and I.Daubechies, IEEE Trans. Image Processing, vol.l, pp. 205-220, Apr. 1992. Said spatial filtering step is adapted to generate 4 spatial sub-bands of a first decomposition level, corresponding to a spatially filtered low-low frame, a spatially filtered low-high frame, a spatially filtered high-low frame, and a spatially filtered high-high frame. Each spatially filtered frame is sub-sampled by a factor 2 both in a horizontal and in a vertical direction. Said spatial filtering is applied several times in a pyramidal manner up to the coarsest spatial decomposition level, i.e. the smallest spatial resolution needed. Once the spatial filtering step has been performed, the encoding method in accordance with the invention comprises a step of quantizing and entropy coding the wavelet coefficients of the filtered frames of the temporal sub-band T3. This coding step is based on, for example, embedded zero-tree block coding EZBC according to a principle known to a person skilled in the art. The encoding method in accordance with the invention also comprises a step of encoding the motion vector fields based on, for example, lossless differential pulse code modulation DPCM or adaptive arithmetic coding. It finally comprises a step of forming the final bit-stream on the basis of the encoded coefficient of the spatio-temporal sub-bands and of the encoded motion vector fields, the bits of said motion vector fields being sent as overhead.
If not all temporal levels have been encoded yet, and if there are enough computational resources, the second equation of the reverse lifting scheme is applied to obtain the high-frequency filtered frames of the lower level temporal sub-band, that is, in our example: LH[n] = Le[n] - P(LL[n]), to obtain the high-frequency filtered frames LH of the temporal sub-band T2 of the second decomposition level. This temporal filtering step is followed by a four-stage wavelet spatial filtering, as described before, of the high-frequency filtered frames LH of the temporal sub-band T2. The filtering steps are followed by a step of quantizing and entropy coding the wavelet coefficients of said filtered frames. The bit-stream is finally completed by the encoded data. If computational resources are still available, the second equation of the reverse lifting scheme is applied to obtain the high-frequency filtered frames H of the sub-band Tl of the first decomposition level as follows: H[n] = Fe[n] - P(L [n]). This temporal filtering step is followed by a four-stage wavelet spatial filtering, as described before, of the high-frequency filtered frames H, a step of quantizing and entropy coding the wavelet coefficients of said filtered frames, said encoded data being added to the bit-stream. The available computational resources is determined, for example, based on a number of CPU cycles required to execute a temporal decomposition level in real-time, or on available amount of memory, or on the cash size. For example, if the encoding of three temporal decomposition levels in real-time requires a 400 MHz CPU, and if the encoder only has a 300 MHz CPU, then only two temporal decomposition levels will be implemented in order to encode a video sequence in real-time. Inversely, if only a 200 MHz CPU is required for the encoding of two temporal decomposition levels whereas the encoder has a 300 MHz CPU, a third temporal decomposition level will be encoded. Figure 4 is a block diagram illustrating the following steps of a second embodiment of the encoding method in accordance with the invention. Said embodiment corresponds to the fact that an additional level of temporal decomposition, i.e. the third decomposition level in our example, will not contribute to the visual quality or that there are enough computational resources.
According to this embodiment, the temporal sub-band T3 of the third decomposition level is not computed. Instead, the second equation of the reverse lifting scheme is applied to obtain the high-frequency filtered LH frames of the temporal sub-band T2. This temporal filtering step is followed by a four-stage wavelet spatial filtering, as described before, of the low-frequency filtered frames LL and of the high-frequency filtered frames LH of the temporal sub-band T2. The filtering steps are followed by a step of quantizing and entropy coding the wavelet coefficients of said filtered frames. The motion vector fields are also encoded and the bit-stream is finally formed on the basis of the encoded coefficients and vectors. If computational resources are still available, the second equation of the reverse lifting scheme is applied to obtain the high-frequency filtered frames H of the sub-band Tl of the first decomposition level. This temporal filtering step is followed by a four-stage wavelet spatial filtering, as described before, of the high-frequency filtered frames H, a step of quantizing and entropy coding the wavelet coefficients of said filtered frames, said encoded data being added to the bit-stream.
Thus, different encoding configurations are possible depending on the computational resources and/or the required visual quality. Based on the example of a group of frames GOF comprising 8 frames, these different configurations are the following: - only the third temporal sub-band T3 (i.e. the frames LLL and LLH) is generated, spatially filtered and encoded. In this case, only the video sequence with the lowest temporal resolution will be reconstructed; only the third temporal sub-band T3 and high-frequency frames of the second temporal sub-band T2 (i.e. the frames LLL-LLH-LH-LH) are generated, spatially filtered and encoded; the third temporal sub-band T3 and high-frequency frames of the second T2 and first Tl temporal sub-bands (i.e. the frames LLL-LLH-LH-LH-H-H-H-H) are generated, spatially filtered and encoded, allowing 3 levels of temporal scalability; only the second temporal sub-band T2 (i.e. 2 frames LL and 2 frames LH) is generated, spatially filtered and encoded; the second temporal sub-band T2 and high-frequency frames of the first temporal sub- band Tl (i.e. the frames LL-LL-LH-LH-H-H-H-H) are generated, spatially filtered and encoded, allowing 2 levels of temporal scalability.
The encoding method in accordance with the invention can be implemented by means of items of hardware or software, or both. Said hardware or software items can be implemented in several manners, such as by means of wired electronic circuits or by means of an integrated circuit that is suitable programmed, respectively. The integrated circuit can be contained in a video encoder. The integrated circuit comprises a set of instructions. Thus, said set of instructions contained, for example, in an encoder memory may cause the integrated circuit to carry out the different steps of the motion estimation method. The set of instructions may be loaded into the programming memory by reading a data carrier such as, for example, a disk. A service provider can also make the set of instructions available via a communication network such as, for example, the Internet.
It will be apparent to a person skilled in art that the proposed encoding method is not restricted to the size of the group of frames GOF or to the size of the sub-groups of frames. Moreover, it allows an easy and natural adaptation of the GOF size, depending on the efficiency of motion estimation between each pair of consecutive frames. For example, if during motion estimation the number of unconnected pixels exceeds a predetermined threshold, the temporal filtering MCTF is not implemented. According to the invention, the low-frequency frames of the temporal sub-band of the last decomposition level is spatially filtered and quantized before generation of high- frequency frames of the temporal sub-bands of lower decomposition levels. It means that frame information of a lower decomposition level takes into account the quantization errors of previously encoded frame of a higher decomposition level. Thus, the same quantized low- frequency temporal information is used at encoder and decoder providing the bit-rate synchronization of encoder and decoder. The proposed invention also allows a low encoding delay because the sub-band frames, which have to be encoded and decoded first are generated at the beginning of the encoding process. This simplifies the bit-budget allocation. Finally, the corresponding encoder and decoder can be easily synchronized in time. The order of generation of high-frequency frames of a temporal sub-band replicates the order of video frames reconstruction at the decoding side. This feature allows an optimal on-the-fly implementation of temporal scalability: the encoder encodes exactly the same number of high-frequency frames of the temporal sub-bands (i.e. provides the same frame rate) as the decoder is capable of decoding. This feature is very useful for a low-delay one-to-one video communication (e.g. video phone).
Any reference sign in the following claims should not be construed as limiting the claim. It will be obvious that the use of the verb "to comprise" and its conjugations do not exclude the presence of any other steps or elements besides those defined in any claim. The word "a" or "an" preceding an element or step does not exclude the presence of a plurality of such elements or steps.