WO2010070128A1

WO2010070128A1 - Method for multi-resolution motion estimation

Info

Publication number: WO2010070128A1
Application number: PCT/EP2009/067589
Authority: WO
Inventors: Fabrice Urban; Olivier Le Meur; Edouard Francois
Original assignee: Thomson Licensing
Priority date: 2008-12-19
Filing date: 2009-12-18
Publication date: 2010-06-24
Also published as: FR2940492A1

Abstract

The purpose of the invention is a motion estimation method for a video sequence in which the images are divided into blocks of pixels, the motion estimation being carried out by the analysis of N versions of a same image corresponding to different resolution levels, said analysis starting with the lowest level resolution and ending with the highest level resolution of the current image. A motion field estimation (203, 204, 205, 206, 208) is carried out for the different resolution levels and the dominant motion parameters are estimated (207) for at least one low or medium resolution level, said parameters being used as predictions for the estimation of the motion field of a higher resolution level.

Description

METHOD FOR MULTI-RESOLUTION MOTION ESTIMATION

The invention relates to a multi-resolution motion estimation method. It applies notably to the domains of video analysis, coding and transcoding.

A video sequence comprises by its nature a high statistical redundancy both in the temporal and spatial domains. This redundancy can be used on the one hand to compress said sequence and on the other in order to analyse and characterize its content in identifying, for example, the areas in motion of images of said sequence. Thus, the motion estimation algorithms search for the block or the area in the reference images that best corresponds to a given block or area of the image being processed, said image being referred to as the current image in the remainder of the description. A motion estimation vector is obtained, said vector corresponding to the displacement of the block or the area between two images.

Today numerous applications require the implementation of algorithms enabling analysis in real time of the physical motion within a video sequence. To do this, "block matching" type algorithms, usually designated by the abbreviation BMA, can be used. In this case, the current image is divided into blocks of MxN pixels. The BMA algorithm then searches for, for a given block of the current image, a corresponding block in a reference image. To do this, a measurement distance D is calculated between the current image block and each candidate. An example of measurement distance D using a Lagrangien is described in the article by G. Sullivan and T. Wiegand entitled 'Rate-Distortion Optimization for Video Compression', IEEE Signal Processing Magazine, pp. 74-90, November 1998. Optimization by Lagrangien enables the homogeneity of the motion field obtained by BMA to be improved.

The simplest version of the BMA algorithm carries out a complete search in a given window with a width of p pixels, that is to say that each reference image block present inside said window is a candidate to be considered. This technique requires a significant computing power. Thus, faster algorithms have been proposed, such as for example the hierarchical HME (Hierarchical Motion Estimator) model, or the improved HDS (Hierarchical Diamond Search) model The BMA type algorithms thus enable a motion field to be generated composed of motion vectors, a vector being associated with each of the blocks analysed.

The purpose of the DME (Dominant Motion Estimator) type algorithms is to estimate the motion relative to the background of images of the video sequence. This is due, for example, to camera movements, the effects of zoom or to a panoramic shot. The algorithm uses as inputs motion vectors resulting from, for example, a BMA estimation, and then proceeds to the estimation of parameters of a motion model, a two-dimensional refined model, for example.

For the homogenous areas of an image as well as for the areas with unidirectional texture, the reliability of motion vectors estimated via a BMA type algorithm is usually poor. In fact, in these areas, these vectors do not necessarily correspond to a real motion. Within the context of an images segmentation application of the video sequence to be analysed, incoherent results can thus be obtained. In fact, the homogenous areas corresponding to the dominant motion are thus not detected. Moreover, if the vectors thus obtained are used by a DME type algorithm, the overall motion estimation only uses a reduced number of correct motion vectors. As a consequence, the precision of results is not good.

One purpose of the invention is notably to overcome the aforementioned disadvantages.

For this purpose the object of the invention is a motion estimation method for a video sequence in which the images are divided into blocks of pixels, the motion estimation being carried out by the analysis of N versions of a same image corresponding to different resolution levels, said analysis starting with the lowest level resolution and ending with the highest level resolution of the current image. A motion field estimation is carried out for the different resolution levels and the dominant motion parameters are estimated for at least one low or medium resolution level, said parameters being used as predictions for the estimation of the motion field of a higher resolution level.

According to an aspect of the invention, the dominant motion parameters estimated for a given level are memorized in order to be used as predictions during the motion field estimation of the image or images corresponding to the current image for the same resolution level.

The motion field vectors of a given resolution level can be used, for example, as predictions for the motion field estimation of a higher resolution level.

The dominant motion parameters estimated for a given resolution level are, for example, memorized in order to be used to initialise the step of estimation of dominant motion parameters of the image or images corresponding to the current image for the same resolution level. In one embodiment, the dominant motion parameters verify a two- dimensional refined model.

In another embodiment, for the estimation of dominant motion parameters of low and medium resolution levels, a translation parameter is estimated and for the highest resolution levels, 6 parameters verifying a two- dimensional refined model are determined.

For a block of pixels of a given resolution level of the current image, the best prediction available for the estimation of vectors of the motion field can be selected such that the measurement distance D is minimized, said distance being expressed by an equation of type D = SAD + λx C in which:

SAD is the sum of absolute differences between the current block and the reference block,

C is the motion vectors coding cost, that is to say the distance measured between the motion vector and a cost indicator, λ is a real constant.

According to an aspect of the invention, the cost indicator corresponds to the median of motion vectors of neighbouring blocks.

According to another aspect of the invention, the cost indicator corresponds to a prediction corresponding to the dominant motion estimation parameters.

The choice between a cost indicator corresponding to the median of motion vectors of neighbouring blocks and a cost indicator corresponding to dominant motion estimation parameters is selected per block according to, for example, the best motion vector prediction. In one implementation, the algorithm carrying out the dominant motion estimation at a given resolution level is initialised by the dominant motion parameters estimated for the current image at a lower resolution level.

A confidence level of the motion estimation carried out on the current image is determined, for example, by calculating the vector level corresponding to the dominant motion at the highest resolution level.

Other characteristics and advantages of the invention will emerge with the help of the description that follows provided as a non-restrictive example, made with regard to the annexed drawings wherein:

figure 1 illustrates the principle of multi-resolution motion estimation, figure 2 provides an diagram example implementing the method according to the invention, figure 3 presents a way of carrying out the dominant motion estimation in the context of the invention.

Figure 1 illustrates the principle of multi-resolution motion estimation. The BMA type algorithms as described previously involve a high level of computing complexity. So as to produce a motion estimation on a video sequence, it is therefore recommended to use this algorithm type intelligently.

The video sequences content is taken into account by the motion prediction techniques. In fact, the motion fields usually present spatial and temporal continuity properties. Thus, it is possible to predict the motion of a given block from the motion of its neighbouring blocks and preceding images. A set of predictions is then available. Hereafter in the description, a prediction corresponds to a candidate vector representing the motion of a block between two images and having to be tested in order to verify that it indeed corresponds to the real motion of said block. Each prediction is evaluated by calculating, for example, a measurement distance D. For example, this measurement distance could be the sum of absolute differences, designated by the abbreviation SAD (Sum of Absolute Differences). This SAD represents the distortion between the current block and the reference block. The motion vectors coding cost C can be taken into account due to the introduction of a Lagrange coefficient in order to minimize the distortions introduced by the estimation.

The distance D can be described by the following expression:

D = SAD + λ x C (1 )

A search for the best motion vector is then carried out in the neighbouring area of the best prediction using, for example, a local search schema. An example of an algorithm enabling this search type is described in the article by Alexis Micheal Tourapis "Enhanced Predictive Zonal Search for Single and Multiple Frame Motion Estimation" proceedings of Visual Communications and Image Processing, pages 1069-1079, 2002. Numerous other BMA type algorithms exist and are distinguished by the manner in which the set of predictions is determined for a block as well as by the local search schema selected.

A way of enabling a reduction in the calculation complexity is to use a multi-resolution approach. The HME (Hierarchical Motion Estimator) algorithm is an example. A pyramid of images is deduced from the current image. This pyramid of images is composed of several images deduced from the current image, each of said images representing a search level. The level 0 corresponds to the current image at full resolution. A low or medium resolution level is a level other than level 0, this latter corresponding to the highest resolution level of the pyramid if images.

The n+1 level corresponds to the image obtained by low-pass filtering and under-sampling of the n level image. The n+1 level image has therefore a lower resolution than the n level image.

Initially, a motion field is estimated on the highest level, that is to say on the lowest resolution image. Next, said motion field is improved using the motion field vectors obtained at the higher level as prediction, and those in descending the levels of the image pyramid until level 0 is reached. For a given block, the motion vectors of neighbouring blocks that have already been calculated are also used as predictions. The estimation is then refined by searching for the best motion vector around the best prediction.

The example of figure 1 illustrates the principle of multi-resolution motion estimation. Three levels are considered. Level 0 corresponds to the image to be analysed and for which the resolution is not reduced. Levels 1 and 2 correspond to the image to be analysed after alteration of the resolution, the resolution of level 2 being less good than that for level 1. The estimation process starts at the highest level, that is to say at level 2 for the example of figure 1. The image is analysed block by block. For a given block 100, one or more predictions are available. In fact, it is possible to have several predictions for each block to be analysed, and this in taking account, for example, of the motion of neighbouring blocks or indeed of preceding images, but also of the result of the motion estimation at the higher level. For each prediction, a refinement can be carried out so as to find the best possible candidate 101 best corresponding to the real motion of the block analysed.

A prediction 102 for the block being analysed 106 at level 1 can be the result of the motion estimation carried out for the same block but at the higher level 101. The refinement of the search then leads to a more refined estimation 103. The same principle is then reproduced at the level 0, with one of the predictions 104 corresponding to the result of the estimation at the higher level and a refinement enabling the final result 105 to be obtained. The selection of the best prediction and of the final vector resulting from the refinement mentioned previously is carried out, for example, by calculating and comparing the distance D for each candidate vector.

The result of these calculations per level is a motion field composed of a set of vectors, a vector of said field being associated with a current image block. Even if the HME type multi-resolution approach enables the complexity to be reduced, it remains significant. To further accelerate the calculations, it is possible, in order to improve the local search around a prediction, to implement an algorithm referred to as HDS (Hierarchical Diamond Search). This algorithm carries out a multi-resolution motion estimation while using a refinement step based on a diamond recursive search. The best prediction is refined by local search using a small pattern of several blocks in the form of a diamond or square. Figure 2 provides an example of implementation of the method according to the invention. The images of the video sequence to be analysed are processed one after another. An image memory 200 contains the pyramid of multi-resolution images associated with the current image as well as the reference image or images to be used for the motion estimation. The current image pyramid 201 as well as the reference image pyramid or pyramids 202 are used to carry out the different estimations described hereafter. In this example, a multi-resolution approach at 5 levels, indexed 0 to 4, is used. A BMA estimation of the motion field is carried out for the low resolution images starting with the level 4 203, to then process the level 3 204, the level 2 205, the level 1 206 and the level 0 208.

The motion field vectors resulting from the estimation at level 1 are used as prediction in order to estimate the dominant motion parameters 207. In other words, the dominant motion estimation is first calculated for a low resolution motion field, namely at level 1. The dominant motion estimation can be carried out, for example, according to a two-dimensional refined model. In this case, this estimation means estimating for each image block to be analysed the dominant motion parameters a₀, ai, a₂ and b₀, bi, b₂ verifying the equation:

in which V_x and v_y are the coordinates of a vector V of the motion field and X and Y are the coordinates enabling the block being processed to be located for which a dominant motion estimation is carried out.

The dominant motion parameters are then used to add a new prediction during the motion field estimation for the next resolution level. This prediction is evaluated in the same way as the other predictions available for each block by calculating, for example, the measurement distance D previously explained using the expression (1 ). The reliability of the motion field estimation is thus improved. The term C of the expression (1 ) represents the motion vector coding cost, that is to say the distance measured between the motion vector and a cost indicator. The median of motion vectors of neighbouring blocks is usually selected as the cost indicator. The taking into account of the coding cost enables a more homogenous motion field to be obtained.

In the context of the invention, it is possible to use two different cost indicators, said indicator being selected according to the best motion vector prediction: either the prediction from the dominant motion estimation or the median previously described.

The areas according to the dominant motion are then identified directly, even in the case of homogenous areas.

As an example, the sky is usually a homogenous area. Using a motion estimation algorithm belonging to the prior art, a null motion is in general associated with this area, even in the presence of camera movements. Using the dominant motion estimation, the camera movement is identified and the sky area is constrained to follow this dominant motion, which corresponds best to the real motion. The motion field estimation followed by the calculation of dominant motion parameters at each level leads to a recursive approach with a reasonable calculation complexity.

The dominant motion parameters of level 1 are stored in the memory

211 to be used for the analysis of the next image as prediction 214 for the estimation 206 of the motion field of level 1. The use of dominant motion is rejected 213 for the entire image if the parameters are not reliable according to a reliability criterion estimated with said parameters.

The dominant motion parameters estimated 207 at level 1 are moreover used as prediction for the estimation 208 of the motion field of level 0. The best set of dominant motion parameters is selected for the entire image 212, that is to say the result of the dominant motion estimation carried out at the higher level 207, the memorized dominant motion parameters 210 or no parameter.

The motion field of level 1 is also used as prediction for the motion field estimation of level 0. An overall estimation 209 of motion parameters is also carried out following the motion field estimation of level 0. To do this, the predictions used at inputs are on one hand the vector field of level 0, and on the other hand the overall motion parameters estimated 215 based on the motion field of level 1 , and finally the overall motion parameters of level 0 estimated during the analysis of the preceding image and stored in the memory 214.

Several results 217 are available following the analysis of an image belonging to a video sequence. It may be decided to have as output the motion field CM resulting from the field estimation carried out on the high resolution image. Moreover, the confidence level TC, as well as the dominant motion parameters MD estimated at level 0 can be presented at output and used for later processing operations. The confidence level TC can be defined, for example, as the vector level according to the dominant motion at level 0.

Figure 3 presents a way of carrying out the dominant motion estimation in the context of the invention. The dominant motion parameters are estimated using a recursive weighted least square algorithm. The dominant motion estimation model can be adapted according to the level of resolution. Thus for medium and low resolutions, only one translation parameter can be estimated, while for the highest resolutions, a refined model with 6 parameters such as that previously described can be used.

The purpose of the estimation algorithm of dominant motion parameters is to estimate the values of said parameters by use of the weighted least square algorithm. Three types of initial parameters can be used to initialise the algorithm. These three types of initialisation are called temporal initialisation, hierarchical initialisation and simple initialisation.

The main input of the dominant motion estimation algorithm is a motion field CM.

The parameters used for the temporal initialisation, noted as initialisation 1 , are the dominant motion parameters 214, 216 calculated for an image previously processed and stored 210, 211 in the memory. The parameters used for the hierarchical initialisation, noted as initialisation 2, are the dominant motion parameters calculated for the current image at a lower resolution level 215.

If none of the initialisations 1 and 2 are reliable, an initialisation is calculated from all the vectors of the vector field CM using a non-weighted simple least square algorithm 302. If the temporal initialisation parameters are available, an evaluation 300 is made. It is then verified 303 that the result 307 is reliable, in that it does not comprise a number of "inliers" 309, that is to say a vector according to the dominant motion, less than a threshold value. If this is the case, the result is not considered reliable. If the result is reliable, an iteration of the weighted least square algorithm is calculated 311.

In the case where temporal initialisation leads to a non-reliable result 305, the hierarchical parameters when these are available and from a higher level, are used for the initialisation. An evaluation 301 of parameters is carried out. As described previously, the reliability of the result is verified 304, 308, 310. If the result is reliable, an iteration 311 of the weighted least square algorithm is then calculated. If the result is not reliable 306, a step 302 using a simple least square algorithm is used and uses the motion field calculated for the current level. The least square algorithm is then executed recursively 311 , with the initialisation previously described. A coherence indicator TC and the dominant motion parameters MD are presented as results. The coherence of dominant motion parameters is ensured by temporal initialisation. The initialisation by the recursive approach enables cases where the motion is not temporally constant to be overcome, and the number of iterations to be reduced without affecting the final result. The processing is consequently accelerated.

Claims

1 . Method for motion estimation of a video sequence for which the images are divided into blocks of pixels, the motion estimation being carried out by analysis of N versions of a same image corresponding to different resolution levels of the image, said analysis starting at the lowest resolution level and ending at the highest resolution level of the current image, said method being characterized in that a motion field estimation (203, 204, 205, 206, 208) is carried out for the different resolution levels and that dominant motion parameters defining the dominant motion within the image are estimated (207) on at least one low and medium level resolution, said parameters being used as predictions for the motion field estimation of a higher resolution level.

2. Method according to claim 1 , characterized in that the dominant motion parameters estimated for a given level are memorized (21 1 ) in order to be used as predictions (214) during the motion field estimation of the image or images corresponding to the current image for the same resolution level.

3. Method according to any one of the preceding claims, characterized in that the vectors of a motion field of a given resolution level are used as predictions for the estimation of the motion field of the higher resolution level.

4. Method according to any one of the preceding claims, characterized in that the dominant motion parameters estimated for a given resolution level are memorized (210, 21 1 ) in order to be used to initialise (214,216) the step of estimation of the dominant motion parameters of the image or images corresponding to the current image for the same resolution level.

5. Method according to any one of the preceding claims, characterized in that the dominant motion parameters verify a two-dimensional refined model.

6. Method according to any one of the preceding claims, characterized in that for a block of pixels of a given resolution level of the current image, the best prediction available for the estimation of vectors of the motion field can be selected such that the measurement distance D is minimized, said distance being expressed by an equation of type

D = SAD + λ x C in which:

SAD is the sum of absolute differences between the current block and the reference block, C is the motion vectors coding cost, that is to say the distance measured between the motion vector and a cost indicator, λ is a real constant.

7. Method according to claim 6, characterized in that the cost indicator corresponds to the median of motion vectors of neighbouring blocks.

8. Method according to any one of claims 6 or 7, characterized in that the cost indicator corresponds to a prediction corresponding to the dominant motion estimation parameters.

9. Method according to claims 7 and 8, characterized in that the choice between a cost indicator corresponding to the median of motion vectors of neighbouring blocks and a cost indicator corresponding to dominant motion estimation parameters is selected per block according to the best motion vector prediction.

10. Method according to any one of the preceding claims, characterized in that the algorithm carrying out the dominant motion estimation at a given resolution level is initialised (215) by the dominant motion parameters estimated for the current image at a lower resolution level.

1 1 . Method according to any one of the preceding claims, characterized in that a confidence level TC of the motion estimation carried out on the current image is determined by calculating the vector level corresponding to the dominant motion at the highest resolution level.