WO2001008402A2

WO2001008402A2 - Method of block-matching motion estimation with full search in a video sequence and corresponding architecture

Info

Publication number: WO2001008402A2
Application number: PCT/EP2000/003546
Authority: WO
Inventors: Luca Fanucci; Lorenzo Bertini; Pierpaolo Moio; Sergio Saponara
Original assignee: Cnr Consiglio Nazionale Delle Ricerche
Priority date: 1999-04-19
Filing date: 2000-04-19
Publication date: 2001-02-01
Also published as: AU1513601A; IT1309846B1; ITPI990025A1; WO2001008402A3

Abstract

Method of motion estimation in a video sequence by means of a block-matching with full search. Firstly the current video frame (1) that forms the sequence is divided into a plurality of reference macro-block (MB), and each macro-block (MB) is divided into a plurality of sub-blocks (SB). Then a search window (3) is chosen in a video frame (2) previous to the current frame (1) and a SAD (Sum of Absolute Difference) is calculated between the pixels of a first reference sub-block (SB) of the current frame (a) and all the ones of the sub-blocks (SB) of equal size (b) in the search window (3). Then the SADmin are determined between all the calculated SAD. Repeating the calculus for each further sub-block (SB) the MV of the macro-block (MB) is computed. An architecture for carrying out this search has two data loading lines (9x, 9y) of reference block (a) and of candidate block (b); a matrix (10a) of processor Element (11) for loading the data of the reference block (a) and comparing them with the data of the candidate block (b); a buffering resource (10b) for adapting the serial input (9y) of the data with their parallel processing (9x, 9y) carried out by the matrix (10a) of the PE (11); an accumulator (20) of the partial sums computed by the matrix (10a) or the PE (11); a Motion Vector processor (30) for computing the MV of reference block (a) with respect to the candidate blocks (b).

Description

TITLE METHOD OF BLOCK-MATCHING MOTION ESTIMATION WITH FULL SEARCH IN A VIDEO SEQUENCE AND LOW COMPLEXITY / HIGH THROUGHPUT ARCHITECTURE Field of the invention

The present invention relates to the field of video- communication and more precisely it relates to a method for motion estimation in a video sequence by means of a Block-Matching with Full -Search algorithm . Furthermore the invention relates to a low complexity / high throughput programmable architecture that carries out this method.

Description of the prior art

The video communication has many applicazioni among which videotelephony and videoconference on ISDN, high definition digital TV (HDTV) , video systems for sorveglianza remota, those for telemedicina, the apprendimento remotely and the telelavoro.

Sistemi multimediali that utilizzano the video communication trovano a limit of base in the high number of bit necessary for representing the video signals, which is traduced in an eccessivo load for resources of transmission and memorizzazione . for cercare of passing this limit, which presently not allows a soddisfacente sviluppo of these systems in a market of type consumer, is necessary ricorrere to techniques of compressure of the signals video (see Table 1 for characteristics of main image format) .

In this context, committee international of the ISO and of the ITU-T have developed different standard of codifica/devideo coding, or codec :

ISO has developed JPEG for applicazioni with the immagini statiche and MPEG in the releases 1, 2 and recently 4 for interactive video playback, the entertainment such asty video distribution and for HDTV.

The ITU-T has proposto H.261 and its evoluzioni H.263, H.263+ and H.26L for applicazioni of videotelephony and videoconference . These codec require a high complexity hardware that contrasta with the need of sviluppare systems to low costo, whereby becomes indispensabile ricorrere to architectures VLSI dedicate.

The technique essential, developed in the codec ISO and ITU-T, for compressure of the signals video, is that of motion estimation, or Motion Estimation (ME) , which riduce the ridondanza of informazione time present in a video sequence, i.e. between two frame of the same.

The idea base of the ME, through the technique of the block-matching (BMA) , is that of dividere in blocks the current frame in the video sequence and for each block searchre, in a suitable search window frame computed previously, that more simile according to a suitable function of costo. Presently have been developed different algorithms of block-ma tching (BMA) based on different strategie of search: Full -Search, Three Step Search, 2D Logari thmic Search, Conjugated direction Search, Cross Search, Hierarchical Search and recentemente algorithms of type predittivo.

Among these, the algorithm of Full Search (FS) , which achieves a Full-Search in the search window, is the better to obtain high such kind of the image coded. Actually, with reference to figures IA and IB, a block of pixel square of dimensions N x N of the current frame 1, called reference block and indicated as block α, is compared with the all the blocks of equal dimension of the frame computed previously 2, called candidate blocks and indicated as block b , in a search window 3. the search window 3 has dimensions p_h * p_v , where p„ is the number of pixel of its edge horizontal whereas p_v is the number of pixel of its edge vertical; if ph = p_v = p and N is the number of pixel of the edge of the block square N N the possible positions of the block b in the search window 3 are 4p².

In figure IB candidate block N x N is shown in the central position wherein the upper left point has coordinates (p, p).

According to the full -search (FS) technique, the ma tching algorithm consists in computing the SAD ( Sum of Absolute Difference) between the blocks α and b and is defined as follows : if

- α(i,j) is a pixel of reference block α,

- b(i + n,j + m) is a pixel of candidate block b , indexes m and n indicated the position differential of candidate block b in the search window 3 , i.e. the coordinates of a motion vector MV, is

N-\ N-\

SAD(n,/w) = 2∑l ^α(*> J) ^{~ b}(i + n,j + m) \ _{{ 1 )}

.=0 j=0 ' wherein - p_h ≤ m ≤ p_h - \ and - p_v ≤ n ≤ p_v - l Usually p_h = p_v = p .

The calculus is repeated for all the 4p²possible positions of candidate block b in search window 3. The coordinates of block b corresponding to the value of the function of minimum cost are used for the prediction:

MV = (m, n) j_n ^SAD _mi_n = ^min [SAD(n, m)] (₂) . ^Aυmm where ""•">

This exhaustive approach is characterised by a big computing complexity. For example, for a video format 4CIF

(a 30 frame/s with the N = 16 and p = \β for cases of practical interest in accordance to what provided by ITU-T standard) e necessary a computational power of more that

12xl0⁹ operations of aJsolute difference for each second.

In addition to the full search (FS) also the other cited algorithms have been studied for reducing this computing complexity pagando however versus such kind of the image coded with respect to the case of the FS .

By virtue of the regularity of the FS algorithm and the high flux of data required, the architecture that is suitable for a VLSI implementation is systolic with a pipeline data flux organisation. According to this architecture, the data of search window 3 and of reference block are loaded in a modulated computational structure, and pass through lines of delay to registers that have, principalmente, the object of carrying out a correct temporizzazione of equal .

Per a plausibile application of the FS to standards such as the QCIF, CIF and the 4CIF, are known architectures that are, however, still very complex with reference to the dimension of the search window, such as: a) Hya Nam and Moon Key Lee, "High-Throughput B-M VLSI Architecture with Low Memory Bandwidth", IEEE Trans, on Circuits and System, vol.45, n.4, pp. 508-512, Apr. 1998. b) Luc De Vos and Michael Stegherr, "Parametrizable VLSI Architectures for Full-Search Block-Matching Algorithm". IEEE Trans, on Circuits and System, vol.36, n.10, pp. 1309-1316, Oct.1989. c) Chaur-Heh Hsieh and Ting-Pang Lin, "VLSI Architecture for Block-Matching Motion Estimation Algorithm" , IEEE Trans . on Circuits and System for Video Technology, vol .2 , n.2, pp. 169-175, June 1992.

In a) and b) the following aspects are present:

- l'high ricorrenza of the generico register for propagation of the data, which more of every other serves to the complexity of the struttura;

- l'high number of lines for operation of the data, which determines a further incremento of the architecture complexity; - 1 ' organizzazione complex of the flux of data, which has ripercussioni onto the costs relative to the resources hardware necessary for its operation.

In the architecture according to a) , in particular, three data loading lines are used, one for reference block a and two for data of search window 3 , for a total of 4 - [(2 » - l)(N - l)+ N]-l- 4N² registers, having indicated wherein N is the block characteristic dimension and p is the maximum movement In the search window. The use of the search window registers provides then the need of 2 - [(2/? - l)(N - l)+ N]+ N² Multiplexer (MUX) and of a relative control logic.

In the architecture described in b) , according to a quadratic array solution, 2N - (2/? - l)+ 7N² registers, 2N - (2/? - l)+ N² MUX to 3 vie and N² elements of calculus are used combined in a network of very complex connections. Instead, with a Linear array solution the complexity of the structure is reduced through a phase of hardware mul tiplexing that however is capable of driving the typical video streams (30 frame/s in CIF standard with p = l6 and N = 16) single to frequencies working very high with all the relative power consumption drawbacks. In particular, the architecture according to b) is strongly limited operation of standards and difference of fotogram. Part of the problems of a) and b) are overcome in the architecture described in c) wherein however, since the parallel processing of macro-blocks ΝxΝ is carried out by means of a matrix of ΝxΝ Elementary Processors, the circuit complexity in the cases of practical interest (Ν=16, p=16) is still very high.

For the above reasons, the architectures according to the state of the art are not very effective for a consumer market .

Summary of the invention It is an object of the present invention to provide, in a video communication system, a method of Block- Matching motion estimation in a video sequence with Full- Search (hereinafter FS-BMA) , according to the video coding standards, such as for example H.263, MPEG-4, wherein the data flux is new and effective, with a substantial complexity reduction of the architecture that implements this method and high efficiency throughput/area .

It is another object of the present invention to provide an architecture for carrying out this method that allows a not complex data flux and memory layout in the source coder or codec made according to the international standards .

It is a particular object of the present invention to provide a such an architecture that allows the implementation of additional features such as:

- Advanced Prediction mode (AP) provided in the international standards (H.263, MPEG-4),

- la chosen of the MV (Motion Vector) to norma minima, - la predilezione of the block in central positione, to which corresponds a MV of coordinates nulle,

- la dynamical programming of the search window and then the possibility of implementare also the search to mezzo pixel, - la dynamical programming of the search window as technique to obtain a riduzione of the power consumed.

- la parametricita hardware versus the parameters N and p above introdotti .

The above objects are achieved from the method according to the present invention, whose characteristic is that the FS-BMA on a macro-block is carried out starting from the FS-BMA relative to the respective sub-blocks. Preferably, are provided the steps of:

- in a video sequence, division of the current video frame in a plurality of thereof macro-blocks;

- partition of every macro-block NxN in a plurality of thereof HxH sub-blocks of dimensions N/HxN/H with the H parametrico; - for each macro-block choosing a search window in a video frame computed previously with respect to the frame corrente; calculus of a SAD between the pixel of a first sottoreference block of the current frame and all the sub- blocks of equal dimension present in the search window;

- computing the SAD_mιn between all the calculatedSAD and calculus of the MV of the first sub-block on the basis of the SAD_mιn;

- repeating the calculus of the SAD_mιn and of the MV for each sub-block wherein is divided said macro-block;

- computing the MV of the macro-block starting from the computing carried out onto the respective sub-blocks;

- repeating the calculus of the MV for other macro- blocks . Advantageously, every macro-block has square dimension NxN and its sub-blocks are four and have square dimension N/2xN/2 (H=2) as well. Advantageously, each macro-block has square dimension NxN and its sub-blocks are 16 and have square dimension N/4xN/4 (H=4) . The above objects are achieved also by an architecture for carrying out a block-matching with full search, wherein the determination is necessary of the motion vector of a reference block present in the current frame of a video sequence with respect to a block present in a search window of the frame computed previously with respect to the frame corrente, whose characteristic is it comprises :

- due lines of loading of the data respectively of reference block and of candidate block; - a matrix of processor Element for loading the data of reference block and comparing them with the data of candidate block;

- a buffering resource for adapting the input seriale of the data with their processing parallel eseguito from the matrix of the processor Element;

- a accumulatore of the partial sums elaborate from the matrix of the PE;

- a Motion Vector processor for calculus of the Motion Vector of said blocks of reference with respect to said candidate block.

Advantageously, in the case of H=2 , reference block has dimension N/2xN/2 and the Motion Vector processor comprises two modules of Minimum Distortion Detection with the resource of memorizzazione of way that an allows of to calculate the Motion Vector of the blocks N/2xN/2 and, for every 4 blocks N/2xN/2, the other calculating also the MV of the block NxN fromit costituito.

Advantageously anwhich, in the case of H=4 , reference block has dimension N/4xN/4 and the Motion Vector processor comprises two modules of Minimum Distortion Detection with the resource of memorizzazione of way that an allows of to calculate, for every 4 blocks N/4xN/4, the Motion Vector of the blocks N/2xN/2 fromit costituiti and the other calculating, for every 4 blocks N/2xN/2, i.e. for every 16 blocks N/4xN/4 , also the MV of the block NxN fromit costituito.

The architecture according to the invention, having the characteristic of have two sole lines of loading of the data, unitamente to the organizzazione of the data loaded, riduce the number of registers occorrenti a

(_V/ H + 2p - 2) {N I H -l)+ N I H + (N I H)² .

Potendo drive at the same time blocks NxN and blocks N/HxN/H relatively to the latter, with respect to the state of the art is riduce of a factor H² the needs of elements of calculus, which are less numbersi with respect to the registers, but singlermente more complex, and not is needs of the circuiteria of which to the prior art according to a) and b) for operation of the data flux, i.e. Multiplexer (MUX) and relative control logic.

Brief description of the drawings

Further characteristics and advantages of the method and of the architecture according to the present invention will be made clearer with the following description of a thereof particular embodiment (H=2), exemplifying but not limitative, with reference to attached drawings wherein:

- figures IA and IB show the general principle of calculus of the MV of a reference block N x N in an search window of dimensions (2p + N - \)² ;

- figure 1C shows the division according to the invention of a Macro-block (MB) ΝxΝ in four under Blocchi (SB) Ν/2xΝ/2; - figure 2 shows the block diagram of the operation of the method according to the invention;

- figure 3 details a structure global of the architecture according to the invention and the connectedoni between the different modules for case of N=8 (then N/H=4), p = 4 ; - figure 4 shows a diagrammatical view functional of the architecture of figure 3;

- figure 5 shows the organizzazione of the snake of figure 4 consisting in the matrix of PE and SR, for case of N=8 (then N/H=4), p = 4 and M = 9 ; - figure 6 shows the structure inner of a processor Element (PE) of figures 4 and 5;

- figure 7 details the structure of the module AD processor incluso in the PE of figure 6;

- figure 8 shows the diagrammatical view circuit of the module of Adder Tree of figure 4 in the case of N=16 (then N/H=8) , = 9;

- figure 9 details the structure of the module ouJle adder in adder Tree of figure 8 ; - figure 10 shows the general structure of the Motion Vector processor (MVP) of figure 4 ;

- figure 11 details the structure of the mdd_spo module in the MVP of figure 10;

- figure 12 details the structure of the mdd module in the mdd_spo module of figure 11;

- figure 13 details the structure of the modmin module in the mdd module of figure 12;

- figure 14 shows the circuit solution which, applied to the buffering resource, matrix of Shift Register SR of figure 3, 4 and 5, allows the dynamical programming of parameter p ;

- figure 15 shows a diagrammatical view of a source coder or codec H.263/MPEG that uses the architecture of figures 3 and 4 as a module of Motion Estimation or ME. - Figure 16 and 17 show the Search area pixel mapping for the case example N/2=3, p=3.

- Figures 18 and 19 show a flux graph for the architecture according to the invention for N=4

Description of the preferred embodiments Method of motion estimation

As indicated in figure 1C, the method according to the invention, for carrying out a FS-BMA on a macro-block MB

NxN four sub-blocks SBs N/2xN/2 thereof are considered

(exemplifying, but not limitative, with H=2) . Starting from this partition, the control of a MB, i.e. the calculus of SAD minimum and of the corresponding MV, is carried out starting from the results of FS-BMA obtained for four SBs.

This way, if at the same time both the MV relative to blocks SBs N/2xN/2 and the MV relative to blocks MB NxN, are computed, the computing resources necessary for the video standards are reduced four times.

More precisely, the four blocks SBs of every block MB are processed in turn by control structure described hereinafter and all the relative 4p² Sums of Absolute Difference (SAD) are suitably stored in a SAD memory .

With reference to figure 2, which represents a diagrammatical view of operation of the method, the SAD memory is indicated with the numeral 4, and has dimension 4p² words, being p the dimension of the search window

(figure IB) .

Memory 4, which is a Dual Port RAM, loads progressively the SAD_k (i,j) (i = 0...2p - l, j - 0...2? — 1) relative to the k sub-block SB (k = l..Λ) coming from line 5. The global structure of the flux diagram of the architecture is given in Figures 18 and 19 for N=4.

More precisely, as indicated in Figure 2, the SAD on k-\ line 5 are summed to the values V SAD _. (, j) coming as output r=l from memory 4 on line 6a, from adder 6 and the result of the sum is loaded in memory 4 through line 6b.

This process implements the following formula, which defines the SAD(i,j) relative to block NxN responsive to the SAD_k{ ,j) relative to the single N/2xN/2 SB:

SAD(i,j) = ∑SAD_k(i,j) k=\ The SAD relative to the blocks SBs N/2xN/2 are provided through line 7 for evaluating the SADmin and the relative MV, according to formula (2) above indicated.

After having processed the first three SBs and with the output of the SAD relative to the fourth SB, line 8 provides the SAD relative to the NxN MB that allow computing its MV, always according to formula (2) .

Structure of the architecture

The novel layout of the data flux above described is mapped in the architecture of figure 3, indicated as a ME module 100, which comprises:

- an array 10, called snake, shown in detail in figures 4-7, in which the partial sums of the input data are computed through lines 9x and 9y respectively of candidate block b and of reference block a;

- an Adder Tree 20, detailed in figures 8-9, which receives the partial sums from snake 10 through lines 13;

- a MV processor 30, detailed in figures 10-13, which receives through line 5 the calculatedSAD by Adder Tree 20;

- a control unit 40 and counters 50.

With reference to figures 4 and 5, the systolic architecture is shown of the system of figure 3 whose core is, in snake 10, a two-dimensional array 10a of Processing Element or PE 11 linked among them by means of the first input line 9x.

PE 11 are arranged in four columns 11a, lib, lie, lid and in four rows 11' (per N/H=4) . A second input line, indicated with the numeral 9y, crosses the PE 11 of each column 11a, lib, lie, lid and, between each column, crosses respectively columns 14a, 14b and 14c of an array 10b of Shift Register (SR) 14, which represents the buffering resource of the system. A clock line 12 provides the clock signal to elements 11 and 14 of matrix 10a and 10b.

Arrays 10a and 10b, being pipeline connected by lines 9x and 9y, form a snake structure indicated with 10 in figure 3 and in figures 4 and 5.

The single element PE 11 of figures 4 and 5 has a general structure shown in figure 6 and is substantially a unit 110 for computing an absolute difference with carry and of registers 111, 112 and 113 necessary respectively to the propagation of the data of the search window, of the SBs and of the partial SADs . In PE 11 a threshold maximum value of SAD is, beyond which, during of the partial calculus, there is no need to make further increments. This is obtained by limiting to a reasonable value the number M of bits of the AD processor 110 that carries the SAD. This procedure indicates to the codec that comprises ME module 100 the opportunity to carry out an intraframe coding of the corresponding MB (and not by the MV) . This parameter M, of maximum number of bits of the SAD, is one of the architecture hardware configuration parameters .

Unit 110 of PE 11 of figure 6 is an absolute Difference (AD) processor and is shown in figure 7. AD processor module 110 signals, through preset_out 115 output module, whether the maximum value of M bit has been reached, giving the propagation of the latter through the preset of the downstream register.

The same procedure has been used in the module 20 of Adder Tree of figures 3 and 4, shown in more detail in figure 8. Adder Tree 20 stores the partial sums coming from rows 11' of matrix 10a of PE 11. In figure 9 is shown one of the modules double adder 21 of figure 8, which comprises two adders 201 and 202.

The Adder Tree 20 output value of the SAD(n, m) under formula (1) described above is on line 5. This way a parallel processing is obtained by means of matrix 10a of PE but with a serial sequence of the data flux.

In order that the loading in Adder Tree 20 the candidate blocks in PE matrix 10a is carried out correctly a buffering resource is necessary that embodied in the architecture according to the invention by matrix 10b of Shift Register (SR) 14 of figures 4 and 5. Such SR matrix requires only an elementary functionality of flip-flop (D- FF) type.

Always with reference to figure 4, for every possible couple of coordinates (m,n) in the search window, the Motion Vector processor (MVP) 30 controls whether the SAD{n, m) 5 provided by Adder Tree 20 is less than the previous one of which the minimum value is stored in a corresponding register in module 60 of Minimum Distortion Detection MDD. In the affirmative, MVP 30 updates this register with the new value. At the end of the control step the registers in MVP 30 contain the SAD minimum and the coordinates (m,n) of the respective MV. In figure 10 the structure is shown of the MVP 30 of figure 4. Counter cnt_in, 301 suitably synchronised scans, column after column, all the possible positions of comparison contained in the SBs search window ( 0 ≤ cnt_in ≤ 4p² -1 ) . Similarly, counters cnt_in_r 302 and cnt_in_c 303, indicate, respectively, the number of row and the number of column of the candidate position, under the condition 0 < cnt _in_r , cnt _in _c < 1p -1.

A first module mdd_spo 304 (detailed in figure 11) receives, from the sad_in 5 input module, the organized succession of the SAD of the SBs along with the above position values, providing the SAD minimum value and the relative MV.

Preferably, the static position, i.e. that with MV of null coordinates, in accordance with what provided for by the standard, is preferred through the possibility of decrease the relative SAD of a fixed value (input parameter of the architecture) that can be assigned by inlet module sad_sb_in 305. The values provided by counters cnt_in_r 302 and cnt_in_c 303 are used as MV coordinates, being useful to discriminate, among all the positions for which the minimum SAD is obtained, that is nearest to the static position, given by cnt _ in _ r = cnt _ in _ c = p . This functionality is obtained (see figures 12 and 13) through the modmin 61 module that is present in the MDD module 60 of Minimum Distortion Detection.

The generic SAD for the MB is obtained by the sum of the relative four SADs of SB. For achieving this object in

MVP 30 (figure 10) a Dual Port Ram 4 memory has been provided capable of storing the partial SAD calculus for each of the 4p² search window positions.

As also indicated in figure 10, memory 4 is scanned sequentially through port b 307 by counter cnt_in 301 for picking up the partial SADs of MB (sad_stored) and for summing them by means of adder 6 to the SAD of the current SBs providing for a maximum calculus threshold. In the following cycle the result of the sum is then loaded in memory 4 at the same location, piloting the write addresses, port b 307, through the value of cnt_in 301 suitably delayed.

Since, for the first of the four SB, memory 4 does not contain significant data, mask 308 ( and_m) has been provided capable of zeroing, at adder 6, the value relative to the partial SAD.

At the end of the control step of the third SB, the data written in memory 4 are not any more significant since the output of adder 6 provides directly the SAD of MB. This output is therefore sent to a second module mdd_spo 309 that, like tyhe former 304, supplies the minimum SAD and the relative MV of MB. Organisation of the data flux

In this section a more detailed description of the ALPHA-B input data flow and relevant SAD processing for a N/2 x N/2 reference SB and its corresponding search area is provided (caso esemplificativo, a non limitativo, di H=2) . We refer to Fig. 16 where the search area and candidate block loading (via the y line) are shown for the case example N/2 = 3 and /> = 3. In particular, Fig. 17 shows the status of SR and PE internal registers during that operation. The shadow PE and SR means that search area and reference block pixels are correctly aligned thus providing useful results to AD while the others are not .

The array operation is divided in a preload phase

(which is necessary to properly align the reference block data with the relevant search area data) and a continuos processing phase. During the preload phase the PE array is loaded via the x line of Fig. Orig#4 with the N² /4 pixels of the reference block while the PE and SR matrixes are loaded via the y line of Fig. Orig#4 with the first N² /4 + (N/2 - l)(2_jp - 2) pixels of the relevant search area. Both the reference block and search area are scanned in the typical row-column way. The duration of this preload phase is N² /4+ (N/2 - l)(2_p - 2) clock cycles, after which the array is ready for BM operation (with reference to the given example the architecture status relevant to candidate blocks is shown in Fig.17. At the end of the preload phase the generic PE (i , 0) elements (1st column of PE) elaborate the AD | α(/,0) - b(i - p,0 - p) | (with i = 0 , 1...ΪV/2-1) related to the evaluation of the SAD{-p,-p) , while all the others columns are in idle (see Fig. 17) . At next clock cycle the PE (i , l) elements (2nd column) elaborate the value psum(i,ϊ) = psum(i,0) + \ (i,Y) - b(i - p,l - p) \ , where psum(i,0) is the AD of the previous column related to the SAD (-p, -p) while the following (j = 2...N/2- 1 ) columns are in idle. Note that the presence of the shift registers in the y line has allowed the proper values of the b pixels to be present at that clock cycle in the 2nd column of the array. It is also important to underline that during this clock cycle the PE (i , 0) elements (1st column) are not idle but they are calculating the AD | α(z^',0)- b(i - p +1,0- p) | related to the evaluation of the SAD(-p + \,-p) (see Fig. 18) . So, after N /2 clock cycles from the end of the preload phase, the PE (i , N/2-l) elements (last column) provide to the

ΛT/2-1

Adder Tree the N/2 partial sums T| a(i, j) - b(i - p, j - p) | ,

(with i=0, l..., N/2- l ) related to the SAD ( -p, -p) (see Fig. 17-d) . The Adder Tree performs the addition of these partial sums yielding the

N/2-\ N/2-l

SAD(-p,-p) = \ a(i,j) -b(i - p,j - p) \ . Then, after 2p cycles

1=0 j=0 all the SAD (n, -p) (with - p ≤ n ≤ p - l ) are ready (see Fig. 17-i) . It has to be considered that, before starting the processing of the next column of SAD(n,-p +1) (see Fig. 17- 1) , N 12 -1 idle clock cycles are necessary to skip partial sums relative to not valid cadidate blocks (see Figs. 17- h, 17-i) . However, it is worth noting that the array is continuously filled with new data of the search area, independently of the inner array operation thus simplifying ALPHA-B interface with coder frame memory as it will be detailed later. All the aforesaid processing steps have to be performed 2p times to cover the whole search area, before starting the BM computation for the following reference block. In particular, the first pixel of the i-th SB search area (i.e. starting of the preload phase for i-th SB) is input to the PE matrix just a clock cycle after the last pixel of the (i -1) -th SB one.

Obviously, according to (3) and (4) , the hardware structure sketched in Fig. 0rig#2 and the relevant FG of

Fig. 18,19 allow for the concurrent elaboration of the

N xN MBs from its corresponding, aforesaid, N/2xN/2 SBs.

Summarizing, the proposed architecture is characterized by a continuos input data flow with an overall throughput of l/T_a, where T_a is the time required to process candidate and reference pixels relevant to one N x N MB. T_a amounts for 4(2p + N 12 - Y)² T_clock being (2/J + N/2 - 1)² the number of pixels relevant to a search area for a NOT FURNISHED UPON FILING

NOT FURNISHED UPON FILING

Claims

1. Method of block-matching motion estimation with full search in a video sequence, characterised in that said block-ma tching with full search on a macro-block (MB) is carried out starting from the block-matching with full search relative to a plurality of sub-blocks (SB) of the former.

2. Method of motion estimation according to claim 1, wherein the steps are provided of : - in a video sequence, division of the current video frame (1) , that forms said succession, into a plurality of reference macro-block (MB) ,

- partition of each macro-block (MB) into a plurality of sub-blocks (SB) ; - for each macro-block, choosing a search window (3) in a video frame (2) computed previously with respect to the current frame (1) ;

- calculus of a Sum of Absolute Difference (SAD) between the pixel of a first sub-block (SB) of reference of the current frame (a) and all the sub-blocks (SB) of equal dimension (b) present in the search window (3);

- computing the SADmin between all the calculated SAD and calculus of the motion vector (MV) of the first sub-block (SB) on the basis of said SADmin; - repeating the calculus of the SADmin and of the motion vector (MV) for each further sub-block (SB) into which said macro-block (MB) is divided;

- computing the MV of the macro-block (MB) starting from the calculus carried out for the respective sub-blocks (SB) ;

- repeating the calculus of the MV for other macro-blocks and sub-blocks .

3. Method of motion estimation according to claims 1 or 2, wherein said macro-block has square dimension NxN and its sub-blocks (SB) are HxH and have square dimension N/HxN/H.

4. Method according to the previous claims, wherein the central position of said search window (3) corresponds to the MV of null coordinates.

5. Architecture for carrying out a block-matching with full search, wherein is the motion vector (MV) of a reference block (a) present in the current frame (1) is determined of a video sequence with respect to a block (b) present in an search window (3) of the frame computed previously (2) to the current frame (1), characterised in that it comprises :

- two respective loading lines of the data (9x, 9y) of the reference block (a) and of the candidate block (b) ; - a matrix (10a) of processor Element (11) for loading the data of said reference block (a) and comparing them with the data of said candidate block (b) ;

- a buffering resource (10b) for adapting the input seriale (9y) of the data with their processing parallel (9x, 9y) eseguito from the matrix (10a) of the processor Element (11) ;

- a accumulatore (20) of the partial sums elaborate from the matrix (10a) of the PE (11) ;

- a Motion Vector processor (30) for computing the Motion Vector (MV) of said reference block (a) with respect to said candidate blocks (b) .

6. Architecture according to claim 5, wherein said reference block (a) has dimension N/HxN/H (with H >1) and said Motion Vector processor (30) comprises two modules (60) of Minimum Distortion Detection with the resource of memorizzazione of which an allows of to calculate the Motion Vector (MV) of the blocks N/2xN/2 and the other, for every 4 blocks N/2xN/2, calcola also the MV of the block NxN fromit costituito.

7. Architecture according to claim 5 or 6 , wherein said search window has dimension p and said buffering resource (10b) comprises means for program dinamicamente the value of said dimension p within a range [l, pmax] .

8. Architecture according to claim 7, wherein said buffering resource (10b) is executed by means of a chain of Shift Register (14).

9. Architecture according to claim 8, wherein said means for dynamical programming of parameter p comprise (N/H-1) Multiplexer (18) and means for check of said Multiplexer suitable for modificare the length useful of the (N/H-1) chains of SR (14) of said buffering resource (10b) viste from said matrix (10a) of PE (11) .

10. Architecture according to claim 5 or 6 , wherein said buffering resource (10b) has a structure based on memories RAM suitably controlled.

11. the architecture, according to claim 7, wherein the dimension of said buffering resource (10b) is (N/H-l)(2p-

2).

12. Architecture according to claim 5 or 6 , wherein is provided an organizzazione pipeline of the data flux coming from said two lines of loading (9x, 9y) .

13. Architecture according to claim 5, wherein said matrix (10a) of the PE (11) implements a function of cost chosen between SAD, MAD, MSE for algorithms of Block Matching.

14. Architecture according to claim 6, wherein said modules of Minimum Distortion Detection (60) calculate, to parita of function of cost minimum, the MV to norma minima .

15. Architecture according to claim 6, wherein the flux inner of the data is costituito, for every block N/HxN/H, from the alternarsi of a step of Preload of the duration of (N/H)²+(N/H-l)(2p-2) clock cycles and of a step NOT FURNISHED UPON FILING