WO2007011851A2

WO2007011851A2 - Filtered and warped motion compensation

Info

Publication number: WO2007011851A2
Application number: PCT/US2006/027632
Authority: WO
Inventors: Madhukar Budagavi; Minhua Zhou; Aziz Umit Batur
Original assignee: Texas Instruments Incorporated
Priority date: 2005-07-15
Filing date: 2006-07-17
Publication date: 2007-01-25
Also published as: WO2007011851A3

Abstract

Video compression utilizes filterings and/or warpings of the reference frames for motion estimation and motion compensation. The presence of affine, fade, or blur is determined. If present, filters and/or warpings are applied to reference frames and motion is estimated using reference frames plus any filtering/warping reference frames.

Description

FILTERED AND WARPPED MOTION COMPENSATION The present invention relates to digital video signal processing, and more particularly to devices and methods for video compression. BACKGROUND Various applications for digital video communication and storage exist, and corresponding international standards have been and are continuing to be developed. Low bit rate communications, such as, video telephony and conferencing, led to the H.261 standard with bit rates as multiples of 64 kbps. Demand for even lower bit rates resulted in the H.263 standard. H.264/AVC is a recent video coding standard that makes use of several advanced video coding tools to provide better compression performance than existing video coding standards such as MPEG-2, MPEG-4, and H.263. At the core of the H.264/AVC standard is the hybrid video coding technique of block motion compensation (BMC) and transform coding. BMC is used to remove temporal redundancy, whereas transform coding is used to remove spatial redundancy in the video sequence. Traditional block motion compensation schemes basically assume that objects in a scene undergo a displacement in the x- and y- directions. This simple assumption works out in a satisfactory fashion in most cases in practice, and thus BMC has become the most widely used technique for temporal redundancy removal in video coding standards. FIGS. 2a-2b illustrate the encoding and decoding with BMC in H.264/AVC, and FIG. 2c shows multiple reference frames.

The traditional BMC model, however, fails to capture temporal redundancy in several scenarios in video sequences as listed below:

Affine motion: When objects in the scene under go affine motion such as zoom and rotation. There are several techniques in the literature which modify the motion compensation scheme to take care of affine motion; see Y. T. Tse and R. L. Baker, "Global zoom/pan estimation and compensation for video compression", IEEE Proc. ICASSP'91 (Toronto, Ont, Canada) May 1991, pp. 2725-2728; Y. Nakaya and H. Harashima, "Motion compensation based on spatial transformations", IEEE Trans. Circuits Systs. Video Technol., vol. 4, No. 3, pp. 339-356, 366-7, June 1994; and T. Wiegand et al, "Affine Multipicture Motion-Compensated Prediction", IEEE Trans. Circuits Systs. Video Technol., vol. 5, No. 2, pp. 197-209, February 2005. Lighting variations, fading, and blending: Another scenario where the traditional BMC model fails to capture temporal redundancy is when there is brightness variation in the scene (see K. Kamikura et al., "Global brightness variation compensation for video coding," IEEE Trans. Circuits Systs. Video TechnoL, vol. 8, No. 8, pp. 988-1000, Dec. 1998) or when there are scenes of fade in the video sequence (see J. M. Boyce, "Weighted prediction in the H.264/MPEG AVC video coding standard," IEEE ISCAS, pp. 789-792, 2004). Fades (e.g. fade-to-black, fade-to-white, etc.) are sometimes used to transition between scenes is a video sequence. The H.264/AVC standard introduces a new video coding tool called the weighted prediction to efficiently encode such scenes of fades. The traditional BMC approach also fails when blending of frames is used to transition between scenes.

Blurring: One more scenario where the traditional BMC model fails to capture temporal redundancy is when there is blurring in the video sequence. Blurring typically occurs in video sequences when the relative motion between the camera and the scene being captured is faster than the camera exposure time. Blurring also occurs when objects at different depths in a scene are focused and defocused as is done in movies to focus on different actors in a scene.

Consequently, traditional block-based motion compensation techniques such as those used in H.264/AVC become ineffective when affine motion, lighting variations, and blurring start to occur in the video sequence. SUMMARY

The present invention provides a video codec architecture and methods with filtered/warped versions of reference frames as additional reference frames for motion compensation.

This allows for efficient compression in scenarios of affine motion, lighting variation, and blurring.

BRIEF DESCRIPTION OF THE DRAWINGS

FIGS. Ia- Id illustrate a preferred embodiment including encoding and decoding.

FIGS. 2a-2c illustrate motion compensation in H.264/AVC.

FIGS. 3a-3b show a processor and network communication. DETAILED DESCRIPTION OF THE EMBODIMENTS

1. Overview

The preferred embodiment video compression methods include filtering and/or warping one or more reference frames (or just parts thereof) to provide additional reference frames for motion compensation. This allows for accurate prediction for scenarios with affine motion, lighting and fading variations, and/or blurring. FIG. Ia is a flow diagram of a preferred embodiment with detection, FIG. Ib shows a set of reference frames generated by reconstructed frames plus filtering and/or warping versions of these frames, and FIGS. Ic-I d illustrate an encoder and decoder which implement a preferred embodiment method. The set of reference frames (or portions of frames) generated may be in response to detection of blurring, fading, and/or affine motion in a current frame or may be the result of application of a pre-defined set of filterings and warpings with possible adaptation.

Preferred embodiment systems (e.g., cellphones, PDAs, digital cameras, notebook computers, etc.) perform preferred embodiment methods with any of several types of hardware, such as digital signal processors (DSPs), general purpose programmable processors, application specific circuits, or systems on a chip (SoC) such as multicore processor arrays or combinations such as a DSP and a RISC processor together with various specialized programmable accelerators (e.g., FIG. 3a). A stored program in an onboard or external (flash EEP)ROM or FRAM could implement the signal processing methods. Analog-to-digital and digital-to-analog converters can provide coupling to the analog world; modulators and demodulators (plus antennas for air interfaces such as for video on cellphones) can provide coupling for transmission waveforms; and packetizers can provide formats for transmission over networks such as the Internet as illustrated in FIG. 3b.

2. Video compression with filtered/warped reference frames First consider FIG. 2a which shows the traditional multiframe block motion compensation (BMC) approach of H.264/AVC. The current video frame flji) is predicted from the set of M frames, j(n-l),βjι-2), ...,βn-λd) (some of which could be frames in the future), by using BMC.

In contrast, FIG. Ic shows the first preferred embodiment generalized filtered and warped multiframe BMC, and FIG. Ib heuristically shows an array of reference frames. Given a reference frame fln-i), the preferred embodiment methods generate a set of A:,- other reference frames,/;_,(l),/w(2), ...,f_n-i(ki), each of which is either a filtered version of frame fln—i) or a warped version of frame fijι-i) or both. This extended set of filtered and warped reference frames is then used in BMC for predicting the current frame βn). More explicitly, let K_n~i,k{ } denote the operator for obtaining f_n~0) ixovaβn-i); that is,/,_,(/c) = H_n-i._hifln-1)} . The first preferred embodiment filterings and warpings are as follows.

(a) Filtered reference frames

The filtered reference frames are obtained by linear or non-linear filtering. The filtering operation covers several scenarios in video sequences that are not efficiently captured by BMC as follows.

(i) Blurring: H_n-ι_tk{ } will be a linear translation-invariant filter with impulse response give by h_n-i,_k(p,q)- For example, a motion blur which is four pixels long in the horizontal direction is captured by the impulse response as h_n^_k(p,q) - [¹A, ¹A, ¹A, ¹A } — ¹A at the points (p, q) = (0, 0), (0, 1), (0, 2), (0, 3) and h_n-_iιk(p,q) - 0 at all other points. Section 4 has more blurring details.

(ii) Fading: H_n^_k{} will be an operator that acts on one pixel at a time, such as, H_n-4,k{} ^can be defined by the following operation: f_n-i(k) = aβrι-i) + β with parameters α and β. (iii) Global Motion: Global motion can also be captured by a linear translation-invariant filter. Let (mvx, mvy) be the global motion vector. Then H_n-^O is defined by the linear translation-invariant filter h_n-i_tk(p,q) which is zero everywhere except for the location (mvx, mvy) where h_n^_k(mvx, mvy) = 1 ; that is, the impulse response is a delta function.

(b) Warped reference frames

The warped reference frames are derived as follows. Let (x, y) denote the coordinate system of the input reference image βn-i) and let (x\ y') denote the coordinate system of f_n-i(k). Then H,,__(|/c{} is defined by the following geometric transformation: [x',y^f, 1] = [x, y, 1] T where the 3x3 matrix T is given by:

1

The above transformation covers several scenarios in video sequences that are not efficiently captured by BMC. Some specific examples of transformations are given below:

(i) Zoom: T

cos θ sin θ 0 (ii) Rotation: T = - sin θ cos θ 0

0 0 1

For a localized transformation, the coordinates would be relative to the center of the region being transformed. Note that transforming the image involves image interpolation. We can use nearest neighbor or bilinear filtering or other image interpolation techniques for this purpose. Also in the case of zoom, the resulting transformed image size is greater than the input image size, so we allow for the transformed reference image size to be greater than the input image size. Also when the transformed image size ends up being smaller than the input image size we pad the transformed image by extending edge pixels. We also do padding when the image is rotated.

Of course, any two of or all three of the operations blurring, fading, and warping can be applied to a frame. Also, note that translations (global motion) can be expressed with a T matrix by non-zero ^₁, t₃₂. 3. Encoder options

The encoder has to signal to the decoder which type of filtering/warping was carried out on the reference frames. And the filtering/warping can be at the frame level or localized only at the macroblock level, (a) Frame level

We use two parameters opjype and op_parameters to encode this information. For example, in the case of motion blur, op_type would be motion blur and op_parameters would be the filter response coefficients. In the case of warping, opjype would be affϊne transform and op_parameters would be the transform matrix T. And so forth for fading and other combinations, (b) Macroblock level

The filtering and warping can be carried out at a macroblock level too. This is especially useful when portions of the input frame undergo different kinds of transformation. For example, one person in a scene might be walking towards the camera resulting in a zoom of the person, and another person in the scene might start running away horizontally creating a motion blur in that portion of the video frame. Since the whole frame is not undergoing the same kind of filtering/warping, it is not computationally efficient to generate the whole filtered/warped reference frame. In the case of macroblock level encoding we expand the macroblock type parameter mb_type that is used for signaling the mode information to include the case of filtering and warping. In this case to reduce the overhead on signaling mb_type we can a priori signal a table of possible filter and warp parameters for the whole video sequence. The mb_type parameter would then index into this table of possible filter and warp parameters . 4. Blurring

The blurring filter used to generate a blurred frame is signaled from the encoder to the decoder as side information, e.g., as part of the Supplemental Enhancement Information (SEI) in H.264/AVC. The decoder uses this information to generate the blurred frame in its reference frame buffer from its prior reconstructed frame.

The encoder iterates over a set of predefined blur filters to find the best blur filter in terms of rate reduction. For example, blur filters of two types could be considered:

(a) averaging filter: b_aκ(-,-) which averages over a block of size K x K

(b) motion blur filter: b_m__r_β(.,.) where r denotes motion magnitude and θ denotes direction of motion. In particular, let ones(m,n) denote an mxn matrix will all entries being equal to 1.

The following set of seven simple predefined blur filters in the coder could be used. The first three blur filters are averaging filters and the last four blur filters are motion blur filters.

- b_ai = ones(8,8) I 64 b_al6 = ones(\6,\6) 1256 b_m 4_o = ones(\,A) I A bmjjo - ones(4,l) / 4 b_m_6_o - ones(l,6) / 6 bmjsjo - ones(6,l) / 6

The blur compensation technique is useful only in the region of blurs. The use of this video coding tool is similar to that of the weighted prediction tool of H.264/AVC which is mainly useful only in the region of fades.

Blur compensation can also be done at a block level by using additional modes in motion estimation/compensation. Current video encoders search over a set of several modes (INTRA, INTER, INTER+4MV, etc.) to find the best encoding option. Blur mode would be one such additional mode in this set of modes over which the encoder does a search. Blur compensation at a block level would reduce computational complexity in a scenario where only a portion of the video frame has a blur. It would also be useful in a scenario where there are different objects undergoing blur in different directions, e.g., the camera could be moving to the left and the main object of interest could be moving to the right. The complexity of the foregoing preferred embodiment brute force blur compensation encoder is high. Hence, to reduce computation complexity, preferred embodiment methods may run the blur compensation method only in the regions where there is blurring. Detect such regions by using techniques of video camera auto-focusing (for example, see K-S. Choi et al, New Autofocussing Technique Using Frequency Selective Weighted Median Filter for Video Cameras, 45 IEEE Trans. Cons. Elec. 820 (1999) and references there in). In the encoder, iterate over a set of pre-defined blur filters to find the best blur filter in terms of rate reduction. Improve the compression performance and reduce complexity by estimating the blur, such as by using transform domain processing. 5. Reference frame buffer The frames in the multiframe buffer (which consists of reference frames and their filtered/warped counterparts) can be long-term reference frames or short-term reference frames. Hence, extend the short-term and long-term reference frame management of H.264/AVC to the preferred embodiment multiframe buffer. 6. Parameter determination

Parameter values for the filterings/warpings applied to the reconstructed reference frame(s) to generate the full set of reference frames as in FIG. Ib are needed and can be either a selected pre-defined set of parameter values (with some possible refinements) or the parameter values can be directly estimated in various ways. Parameter values can also be predicted from one frame to another. For example, zooming in a video sequence typically occurs over several frames. Zoom values calculated for one frame can be used to predict the initial zoom parameters to use for the next frame. Note that parameters may be roughly constant only in regions, and thus estimation and filtering/warping would be made only on a regional basis. For blurring, horizontal object motion and vertical object falling blurs are common blurs, so a pre-defined set of blur filters could be as in section 4 with only 0 and 90 degree directional blurs. Parameter value adaptation (i.e., a vertical blur for a falling object should increase as the object accelerates) could be used to lower complexity. Translations of objects (regions) or the background (camera panning) would be estimated in any case by the usual motion vectors derived from the reference frame(s) without filtering/warping.

Similarly, uniform zooms (s_x ^~ s_y) and rotations are typical affine motions and only require two parameters, so a pre-defined set of filters as in section 3 with coordinates relative to the center of an object or region plus adaptive parameters likewise could lower complexity. Various methods are available for direct parameter estimation (instead of a search over a pre-defined initial set of parameter values plus refinements and adaptations). For blurring the autofocus and local frequency domain analysis provide information; for fading and lighting variations local illuminance can be used; and for affine motion methods such as refining the initial motion vector(s) through a error minimization as in Wiegand et al cited in the background can be used. 7. Modifications

The preferred embodiments can be modified in various ways while retaining the feature of inclusion of a filtered and/or warped (portions of) reference frame for motion estimation.

For example, fields (top and bottom) could be used instead of frames, that is, the methods apply to pictures generally with corresponding adjustments. Detection of possible blur, fade, and/or affine motion could be invoked when the distortion (e.g., SAD) of the usual BMC motion vector prediction exceeds a threshold; in this case apply filtering/warping locally about the reference block, and compare the corresponding distortions (SADs) to the original SAD to find a possible blur, fade, and/or affine motion, and the refine parameter values; decide upon encoding with a trade-off of distortion and rate changes due to more information encoded.

Claims

CLAIMSWhat is claimed is:

1. A method of video compression of the type with block motion compensation which finds motion vectors for blocks of pixels in a current picture using one or more reference pictures, the improvement comprising the steps of:

(a) including in the one or more reference pictures a set of filterings/warpings of a prior picture, wherein said filterings/warpings of said prior picture are selected from the group consisting of (i) a blurred version of said prior picture plus a warped version of said prior picture, (ii) a blurred version of said prior picture plus a faded version of said prior picture, (iii) a faded version of said prior picture plus a warped version of said prior picture, and (iv) a blurred version of said prior picture plus a faded version of said prior picture plus a warped version of said prior picture.

2. The method of Claim 1 , wherein said filterings/warpings are selected from a finite set.

3. The method of Claim 1, wherein said filterings/warpings are determined from estimation of blur parameters, fade parameters, and/or warp parameters.

4. A video encoder of the type with a block motion compensator which finds motion vectors for blocks of pixels in a current picture using one or more reference pictures in memory, the improvement comprising:

(a) a filter/warper coupled to said memory and operable to generate and store in said memory for use as reference pictures a set of filterings/warpings of a prior picture, wherein said filterings/warpings of said prior picture are selected from the group consisting of (i) a blurred version of said prior picture plus a warped version of said prior picture, (ii) a blurred version of said prior picture plus a faded version of said prior picture, (iii) a faded version of said prior picture plus a warped version of said prior picture, and (iv) a blurred version of said prior picture plus a faded version of said prior picture plus a warped version of said prior picture.

5. The encoder of Claim 4, wherein said filter/warper includes a blur detector.

6. The encoder of Claim 4, wherein information about a filtering/warping of a prior picture used to determine a motion vector is associated with the motion vector.

7. A video decoder of the type with a block motion compensator which uses motion vectors to predict blocks of pixels in a current picture from one or more reference pictures in memory, the improvement comprising:

(a) a filter/warper coupled to said memory and operable to generate and store in said memory filterings/warpings of a prior reconstructed picture, wherein said filterings/warpings of said prior picture are selected from the group consisting of (i) a blurred version of said prior picture plus a warped version of said prior picture, (ii) a blurred version of said prior picture plus a faded version of said prior picture, (iii) a faded version of said prior picture plus a warped version of said prior picture, and (iv) a blurred version of said prior picture plus a faded version of said prior picture plus a warped version of said prior picture.

8. The decoder of Claim 7, wherein information about a filtering/warping of a prior picture associated with a motion vector is used to determine which of the filtering/warping to use for block prediction.