CN1283107C

CN1283107C - Quick movement prediction method and structure thereof

Info

Publication number: CN1283107C
Application number: CN 03147938
Authority: CN
Inventors: 陈剑军
Original assignee: GAOTE INFORMATION TECHNOLOGY Co Ltd HANGZHOU CITY
Current assignee: GAOTE INFORMATION TECHNOLOGY Co Ltd HANGZHOU CITY
Priority date: 2003-06-30
Filing date: 2003-06-30
Publication date: 2006-11-01
Anticipated expiration: 2023-06-30
Also published as: CN1568014A

Abstract

The present invention relates to a motion predicting method and a structure thereof, which is used in the information compression of image signal processing, etc. The present invention is mainly characterized in that on the basis of thee steps of searching methods, pixel data loaded from external rams every time is utilized repeatedly, a structure using nine calculating units for parallel processing is adopted, real-time image motion prediction is realized, and the image processing speed is increased. If the motion predicting method and the structure thereof are utilized, the calculation amount is reduced greatly, and the image processing speed is increased. Besides, the present invention has the advantage of regular structure and is suitable for being realized by FPGA. The present invention can be applied to various technologies of image data compression and particularly can be widely used for the prediction of motion images with low code rates of videophones, video conferences, etc.

Description

Rapid movement Forecasting Methodology and device thereof

Technical field

The present invention relates to a kind of motion forecast method and the device thereof that in the Information Compression of picture signal processing etc., use, relate in particular in low code check compression standard standard, by obtaining a kind of motion forecast method fast and the device thereof that motion vector generates the prediction that is compressed image.

Technical background

In image compression encoding method of being used widely and standard, mainly utilize three kinds of means that image is compressed, utilize DCT and vector quantization to come the interior spatial redundancy of frame of removal of images, utilize motion prediction to eliminate the interframe time redundancy, utilize entropy coding to eliminate coding redundancy, and interframe redundancy redundancy and symbolic coding redundancy in the frame, so the order of accuarcy of motion prediction is extremely important to the compression effectiveness of interframe encode, it directly has influence on the efficient of image coding and decoding.Especially in the such equipment of multimedia video communications such as video conferencing system, video telephone, the movable part of the image of this kind equipment local often and also relatively slowly, the moving image that belongs to low code check, more general moving image has stronger time and the correlation on the space, therefore in this type of image of compression, motion prediction just has very crucial effect.If it is good that prediction is done, be compressed so and only stay very little value after image and predicted picture subtract each other and be used for coding and transmit, can quicken the efficient of encoding and decoding greatly.

Motion prediction is that unit carries out with the macro block, calculates the offset between the macro block that is compressed on image and the reference picture correspondence position.This offset is described with motion vector, and motion vector is represented the displacement on level and the vertical both direction.People represent with motion vector how many objects moved actually, represent the corresponding motion parts of former frame filled up in the new frame picture with motion compensation to also have great different needs correction, have so just reached the purpose of data compression.

Motion forecast method can be classified as two classes: a class is the pattern matching process, and a class is a recursion method.All pixels in segment of first kind method supposition are done same motion, obtain so motion prediction can be searched for optimum Match from former frame; A kind of method in back is based on recurrence thought, if in the successive frame, the variation of pixel data is because the displacement of object causes that this method will be done interative computation at gradient direction, makes continuous computing converge on a motion prediction vector at last.Based on these two kinds of methods, the method of all kinds of practicalities occurs one after another, wherein mainly contain BMA BMA (block matching algorithms), pixel recurrence method, phase correlation method, the estimation of global motion algorithm etc., wherein, BMA is simply effective because of it, and is easy to extensive integrated and be used widely in video coding.

In the block matching method, every two field picture is divided into the sub-piece (generally getting N=16) of the N * N pixel of two dimension, such sub-piece is called search block, suppose that pixel in each search block all does the translational motion that equates, the former frame of present frame then is divided into the corresponding large search window of number relatively, and each search window is subdivided into a plurality of again and the onesize candidate's macro block of search block.The search block of present frame searches the sub-piece that mates most with it in the large search window of former frame correspondence, the displacement on two dimensional surface of current search piece and match block is the motion vector that motion prediction obtains.

In order in present frame, to determine the motion vector of a search block, with the search block of present frame respectively with previous frame in a plurality of candidate blocks in the corresponding field of search carry out a similitude and calculate.Adopt different error functions to calculate similitude, multiple match-on criterion is just arranged, such as mean square deviation MSE function, mean absolute error MAD function, maximum count MPC function, cross-correlation function CCF, the worst error minimum function MME of mating.Wherein, the MME adaptation function is too simple, and the characteristic information that does not make full use of match block and comprised reduces the precision of estimation greatly; The calculating of CCF adaptation function is too complicated; With MSE comparatively speaking, the amount of calculation of MAD adaptation function is less relatively, does not need multiplying and effect approaches MSE, therefore is widely used.The coordinate in the upper left corner of supposing the searched piece of frame n is (k, 1), the displacement in frame n-1 be (u, v), and the size of hypothesis piece is M * N, the MAD matched rule is defined as follows so:

MAD (k, l; u, v) = \frac{1}{MN} Σ_{i = 0}^{M - 1} Σ_{j = 0}^{N - 1} | I_{n} (k + i, l + j) - I_{n - 1} (k + i + u, l + j + v) | .

At present, the method for searching match block has a variety of, and wherein the prediction effect that obtains with full search method is best.Full search method also claims the method for exhaustion or traversal method, it is when the displacement of a pixel of prediction, get with this pixel is a sub-piece at center, a sub-piece that mates is most with it sought in all possible position in the former frame image then, and the center of this sub-piece and the displacement of present picture element are the displacement vector of estimation.Under full search condition, BMA can reach global optimum, and is big but shortcoming is an operand, particularly all the more so for the fast image of resolution height, movement velocity.Searching in 3 * 16 * 3 * 16 pixel areas with 16 * 16 pel block is example, if adopt full search method, has 16 * 16 kinds of possible results, and every calculating is the result once, needs to calculate 256 subtractions, 255 sub-additions.As seen, though this kind method precision height, complexity is also very high.In order to reduce the motion search complexity of full search method, a lot of improved motion forecast methods have appearred, and wherein most fast B MA method is to reach the purpose that reduces operand by reducing check point.These methods comprise as three step search methods, conjugate direction search method, two dimensional logarithmic search method, intersection search method, hierarchical block coupling, News Search window adjusts search method etc., and wherein three step search methods are the most simple and effective, also are the most frequently used.Counting of needing when also having some fast methods by the distortion of minimizing computing block in addition reaches the purpose that reduces amount of calculation, and these class methods comprise layering bias distortion search etc.

Based on above-mentioned technical know-how, there be the method and the structure thereof of multiple motion prediction, shown in Figure 1 is exactly that the improved global search method of a kind of employing is sought match block, thereby realizes the circuit structure diagram of motion prediction.This method with

E = \min_{m_{i}, m_{j}} Σ_{i = 0}^{M - 1} Σ_{j = 0}^{N - 1} | I_{n} (i, j) - I_{n - 1} (1 + m_{i}, j + m_{j}) | < threshold

For standard is sought match block, wherein I _n(i, j) pixel value in the macro block that M * N is big in the expression present frame is represented I with c among the figure _N-1(with p, p represents, (m among the figure for i, the j) pixel value in the region of search of expression current block in former frame _i, m _j) motion vector that obtains of expression, threshold represents predefined threshold value.The size of macro block is 16 * 16 pixels in the embodiment present frame shown in Figure 1, the size of former frame region of search is 32 * 32 pixels, the hunting zone is (8, + 7), concrete performing step comprises following a few part: 1. the pixel data of the region of search of Shu Ru former frame is divided into 2 data flow, from p and p serial input, deposits among register Q and the R respectively, per 16 clocks, the data parallel among the Q is input among the R; 2. each clock is sent into the PE arithmetic element simultaneously with the current frame picture data of serial input among the c from R and data of line output, adds and subtracts, takes absolute value, accumulating operation; 3. per 256 clocks, the result of PE arithmetic element is parallel to be input among the register S; 4.S in data serial send into the CMP comparing unit and threshold compares, obtain and the best match block of current macro correlation output movement vector m behind 4096 clocks _i, m _jSeek match block according to this structure, find a best matching blocks to need 4096 clocks, 85436 sub-additions, 85436 subtractions take absolute value for 85436 times, and visible operand is still very big, and arithmetic speed is also undesirable.

Summary of the invention

In order to address the above problem, the purpose of this invention is to provide a kind of fast operation, complexity is low, is suitable for a kind of rapid movement Forecasting Methodology and the device thereof of moving images such as video telephone, video conference, and it can carry out the rapid movement prediction to going out the picture frame that 16 * 16 macro block forms.Rapid movement prediction unit of the present invention comprises:

---with outside ram interface unit, its output current macro data and search window data and control signal;

---current macro pixel memory cell, it will output to arithmetic element PE from the data serial that receives with outside ram interface unit;

---I/P macro block discrimination threshold generation unit, its receive from the data of outside ram interface unit and current macro pixel memory cell, output macro block discrimination threshold after (III), (IV) computing by formula, wherein formula (III), (IV) are:

MB_mean = (Σ_{i = 1, j = 1}^{16,16} original) / 256 - - - (III)

tg_thrsd = Σ_{i = 1, j = 1}^{16,16} | original - MB_mean | - - - (IV)

In the formula, MB_mean represents the mean value of pixel number certificate in the original macro;

Original represents the data value of pixel in the original macro;

---the search window data storage cell, it with serial received to the macro block by identical in the former frame with the current macro position be that the data of 9 macro blocks at center are the cycle with the data of 3 macro blocks of the every row of vertical direction with 16 clocks, be one group with 3 ram and deposit among ram0～ram8 line by line; Under the address signal control that coupling macro block pixel access address generation unit produces, read address rd_addr1 (0～8) according to formula (I) with what the address of 9 candidate's macro blocks was converted into ram0～ram8 among the ram one by one, again through tabling look-up computing, determine pairing ram, realize the switching of ram home address, and with the parallel arithmetic element PE that exports to of result; 9 candidate's macro blocks carry out simultaneously, and corresponding 9 ram are distinguished in 9 addresses that each clock obtains; Wherein formula (I) is:

\{\begin{matrix} rd_addr 1 (i) = 6 * div (swy, 3) + div (swx, 3) & when : rem (rem (swx, 16), 3) = 0 \\ rd_addr 1 (i) = 5 * div (swy, 3) + div (swx, 3) & when : rem (rem (swx, 16), 3) / = 0 \end{matrix}

In the formula, the coordinate points of swx remarked pixel point x direction in search window;

The coordinate points of swy remarked pixel point y direction in search window;

Rem represents to rem;

Div represents to ask the merchant;

Rem (rem (swx, 16), 3) /=0 expression swx is not equal to 0 except that the remainder of 3 gained again divided by the remainder of 16 gained;

---arithmetic element PE, it carries out the candidate's macro block data and the current macro data that receive subtraction and takes absolute value computing, and operation result is exported to the macroblock match processing unit;

---the macroblock match processing unit, its receives the data from arithmetic element PE and I/P macro block discrimination threshold generation unit, after the data of arithmetic element PE are added up, obtains the mean absolute error MAD value of 9 candidate's macro blocks behind 256 clocks; 9 mean absolute error MAD values and I/P macro block threshold value are compared, macro block less than I/P macro block threshold value tg_thrsd is exactly the P macro block, otherwise be exactly the I macro block, wherein in the P macro block mean absolute error MAD minimum be exactly to mate macro block, export the address and the motion vector of a datum mark of this macro block, through obtaining the optimum Match macro block after four couplings;

---coupling macro block pixel access address generation unit, it calculates the address of mating the data of required candidate's macro block for the second time according to the address value of data in search window of the datum mark pixel of the coupling macro block of input, output to the search window data storage cell, carry out secondary matching operation.

Be that also ram is corresponding one by one with computing circuit PE, accumulator is corresponding one by one with candidate's macro block.

As a further improvement on the present invention, described current macro pixel memory cell comprises current macro pixel memory buffers unit and current macro pixel access address generation unit.Described macroblock match processing unit comprises data accumulation circuit and data comparison circuit.

Rapid movement Forecasting Methodology of the present invention is to deposit the current macro data in the current macro pixel unit one by one; By formula (III), (4) solve I/P macro block threshold value (tg_thrsd) simultaneously, and wherein formula (III), (IV) are:

MB_mean = (Σ_{i = 1, j = 1}^{16,16} original) / 256 - - - (III)

tg_thrsd = Σ_{i = 1, j = 1}^{16,16} | original - MB_mean | - - - (IV)

Original represents the data value of pixel in the original macro;

And carry out following steps simultaneously:

1. store the search window data

Search window is that 9 macro blocks at center are formed by macro block identical with the current macro position in the former frame, is the cycle with the data of 3 macro blocks of the every row of vertical direction with 16 clocks, is one group with 3 ram and deposits among ram0～ram8 line by line; The current macro data also simultaneously deposit the current macro pixel unit in one by one, and each ram size is set to 32 * 16 pixels, and each ram is divided into 4 districts again, and size is respectively 8 * 16 pixels;

2. from search window, obtain the candidate blocks data

Selected center's macro block point is a datum mark in search window, with this datum mark is the center, get step-length and be 8 o'clock datum marks of 8 as 8 candidate's macro blocks, determine 9 addresses of candidate's macro block in search window, read address rd_addr1 (0～8) according to formula (I) with what the address of candidate's macro block was converted into ram0～ram8 among the ram one by one, through tabling look-up computing, determine pairing ram again, realize the switching of ram home address; 9 candidate's macro blocks carry out simultaneously, and corresponding 9 ram are distinguished in 9 addresses that each clock obtains; Wherein formula (I) is:

\{\begin{matrix} rd_addr 1 (i) = 6 * div (swy, 3) + div (swx, 3) & when : rem (rem (swx, 16), 3) = 0 \\ rd_addr 1 (i) = 5 * div (swy, 3) + div (swx, 3) & when : rem (rem (swx, 16), 3) / = 0 \end{matrix}

The coordinate points of swy remarked pixel point y direction in search window;

Rem represents to rem;

Div represents to ask the merchant;

3. obtain the coupling macro block

9 candidate's macro block datas and current macro subtracted each other take absolute value, wherein ram is corresponding one by one with computing circuit PE, accumulator is corresponding one by one with candidate's macro block, accumulator adds up the output valve of the computing circuit PE in corresponding candidate's macro block, obtains the mean absolute error MAD value of 9 candidate's macro blocks behind 256 clocks; 9 mean absolute error MAD values and I/P macro block threshold value are compared, are exactly the P macro block less than the macro block of tg_thrsd, otherwise are exactly the I macro block, wherein in the P macro block MAD minimum be exactly to mate macro block;

4. step-length is set at half of last step-length, and is central point with the datum mark of the coupling macro block that obtains, repeating

step

2,3 is till step-length＜1;

5. export the motion vector of optimum Match macro block.

Rapid movement Forecasting Methodology of the present invention and structure thereof, on the basis of three step search methods, the pixel data that recycling is loaded from outside ram at every turn, the structure of 9 computing unit parallel processings of employing realizes the realtime graphic motion prediction, improves image processing speed.Utilize motion forecast method of the present invention and structure thereof, 16 * 16 current macro is finished the search of three steps, try to achieve efficient motion-vector, the hunting zone is (15,15), only needs 256 * 9=2304 clock, need 9216 subtractions, 9216 sub-additions take absolute value for 9216 times, and visible operand significantly reduces.

Description of drawings

Fig. 1 is existing a kind of rapid movement predict figure based on the global search method.

Fig. 2 is the system circuit diagram of embodiments of the invention.

Fig. 3 is that embodiments of the invention are at the composition of search window and the storage mode in ram thereof.

Fig. 4 is that the data of the macro block MB0～MB2 of embodiments of the invention write the address sequential of ram and the corresponding relation of address and pixel location, address and ram piece.

Fig. 5 be the embodiments of the invention step-length be 4 o'clock from ram0, ram1, the order of ram2 reading of data.

Fig. 6 determines the used table of ram that the rd_addl address is affiliated in the embodiments of the invention.

Embodiment

The motion prediction structure of present embodiment is on the basis of three steps search, and recycling is adopted the structure of 9 computing unit parallel processings at every turn from the pixel data of outside ram loading, realizes the motion prediction of 16 * 16 macro blocks fast.

Three step search methods (3SS) are that people such as Koga proposed in 1981.The thought of this algorithm is precision variation from coarse to fine, checks that step-length step successively decreases according to index law.Initial step-length is decided to be half of maximum possible moving displacement d, just In each step, inspection and central point are at a distance of being 9 points of step, and selecting wherein, the BDM smallest point is the central point of next step search.For d=7, need counting of inspection to be the 9+8+8=25 point.For large search window (just bigger d) more, 3SS can expand to n step search method easily, and its search point altogether is

Search window is made up of 9 macro blocks that with macro block identical with the current macro position in the former frame are the center usually, and its size is 48 * 48 pixels.It is overlapping with delegation's corresponding search window data of two adjacent current macro 6 macro blocks being arranged as can be seen by the process of macroblock match.Consideration should effectively repeat to be used to overlapping data, only need load the data (when the residing row of current macro changed, the search window data will all be reloaded) of emerging 3 macro blocks when therefore loading the search window data at every turn.System adopts 9 ram (0-8) to come alternately memory search window data, each ram size is set to 32 * 16 pixels, buffer memory carries out simultaneously in the ram for realization former frame data write and read, each ram is divided into 4 districts again, size is respectively 8 * 16 pixels, is used for replacing the data of 3 macro blocks of the every row of memory search window vertical direction.In piece once mates full search time of section, wherein 3 districts that the search window data go out 9 ram form, the corresponding PE of ram, outputing among the PE of the data parallel of 9 ram handled, simultaneously will be from next search window that external RAM is sent here three new macro block datas write another district the ram.Specifically about the mode that writes ram with reference to Fig. 3 and Fig. 4.

After the search window data all write ram, from ram reading of data output to PE will be according to certain rule.Each clock all will provide 9 addresses, corresponding ram0～ram8 when reading ram.As long as determine the position swx (0～8) of the required pixel that reads of each ram of each clock in search window, swy (0～8), with reference to figure 4 (b) memory address and pixel location corresponding relation, can read address rd_addr1 (0～8) by what following formula calculated ram0～ram8:

\{\begin{matrix} rd_addr 1 (i) = 6 * div (swy, 3) + div (swx, 3) & when : rem (rem (swx, 16), 3) = 0 \\ rd_addr 1 (i) = 5 * div (swy, 3) + div (swx, 3) & when : rem (rem (swx, 16), 3) / = 0 \end{matrix} - - - (I)

Wherein (x y) refers to that x divides exactly the merchant that y obtains to function d iv, and (x y) refers to that x divides exactly the remainder that y obtains to function rem.Swx (i), the span of swy (i) is 0～47,0～47, i.e. the length of search window and wide value.

The rd_addr1 that obtains (0～8) is the memory address of the pixel of corresponding MB0～MB8 respectively, through tabling look-up computing, determines pairing ram again, realizes the switching of ram home address.This part function produces circuit 7 by coupling macro block pixel access address and finishes, and realizes parallel the reading of pixel of 9 candidate's macro blocks.

The rd_addr1 that obtains (0～8) is the memory address of the pixel of corresponding MB0～MB8 respectively.But each pixel is stored among which ram of ram0～ram8, and is all different at each clock.By searching map_tab0, map_tab1 is mapped as rd_addr2 (0～8) with rd_addr1 (0～8).Rd_addr2 (0～8) order is bound with the address port of ram0～ram8.Rd_addr1 (0～8), ra_addr2 (0～8) is 10, wherein high 4 is that the selection realization ram home address of ram0～ram8 is switched.This part function produces circuit 7 by coupling macro block pixel access address and finishes, and realizes parallel the reading of pixel of 9 candidate's macro blocks.

The present invention proposes on the basis of three steps search, is that (15 ,+15) are example with the hunting zone, needs 4 steps of search just to find the optimum Match macro block, and each step-length is respectively 8,4, and 2,1.Per step search needs 256 clocks, wherein required 256 groups of swx (i), (cntr_swy cntr_swx) tries to achieve in the residing ranks value of macro block in conjunction with this step step-size in search step and pixel the initial picture element of the coupling macro block that swy (i) value is returned by previous step search at the address value of search window.Address value (the cntr_swy of the initial picture element of elder generation's coupling macro block that search is returned according to previous step, cntr_swx) obtain 9 pairs of datum marks (swy_base (i) in conjunction with this step step-size in search step, swx_base (i)), just should search for the initial picture element position (swy (i) of 9 macro blocks the step, swx (i)), the initial picture element here all is defined as first picture element of the upper left corner of each candidate's macro block in the present invention.Since last step search return the upper left corner data of coupling macro block address value (cntr_swy is exactly the initial picture element position of the center macro block searched in this step cntr_swx), so mapping method is as follows:

\{\begin{matrix} swx_base (0) = cntr_swx - step, swy_base (0) = cntr_swy - step \\ swx_base (1) = cntr_swx - step, swy_base (1) = cntr_swy \\ swx_base (2) = cntr_swx - step, swy_base (2) = cntr_swy + step \\ swx_base (3) = cntr_swx, swy_base (3) = cntr_swy - step \\ swx_base (4) = cntr_swx, swy_base (4) = cntr_swy \\ swx_base (5) = cntr_swx, swy_base (5) = cntr_swy + step \\ swx_base (6) = cnrt_swx + step, swy_base (6) = cntr_swy - step \\ swx_base (7) = cntr_swx + step, swy_base (7) = cnrt_swy \\ swx_base (8) = cntr_swx + step, swy_base (8) = cntr_swy + step \end{matrix} - - - (II)

Position (swy_base (i), swx_base (i)) with initial picture element is a benchmark, and corresponding swx (i) of all the other, swy (i) are according to step value step and the computing of row, column count value 255 clock cycle, obtain the coordinate of this pixel in search window (swx, swy).

Search window with 48 * 48 pixel sizes is an example, the address definition of data be (swy, swx), wherein the address of initial picture element is (0,0), when carrying out piece when coupling first time, step-length is 8, the initial picture element of candidate's macro block L0 is (8,8) in the address of search window, by that analogy, (8,16), (8,24), (16,8), (16,16), (16,24), (24,8), (24,16), (24,24) be the address of the initial picture element of all the other 8 candidate's macro blocks respectively in search window, the scope that is integrated in the search window of 9 candidate's macro blocks is exactly (8,8)～(40,40).The initial picture element of the coupling macro block that searches with the first step is a central point, and the beginning step-length is 4 range searching, is 1 up to step-length, obtains the optimum Match macro block.

In addition, I/P macro block discrimination threshold method for solving of the present invention adopts following formula:

MB_mean = (Σ_{i = 1, j = 1}^{16,16} original) / 256 - - - (III)

tg_thrsd = Σ_{i = 1, j = 1}^{16,16} | original - MB_mean | - - - (IV)

In the comparison circuit of macroblock match processing unit, the MAD value and the tg_thrsd of macro block compared, be exactly the P macro block less than the macro block of tg_thrsd, otherwise be exactly the I macro block, wherein in the P macro block MAD minimum be exactly to mate macro block.

The composition of computing structure of the present invention is described below.

Fig. 2 is the system construction drawing that is used for illustrating present embodiment.In the drawings, whole system is made of following several parts: with the circuit 1 of outside ram interface, its output current macro and search window data and control signal, wherein the current macro data are sent into current macro pixel memory circuit and discrimination threshold generation circuit simultaneously, and the search window data are sent into the search window memory circuit; Current macro pixel memory circuit 2, it will output to the PE computing circuit from the data serial that circuit 1 receives; I/P macro block discrimination threshold produces circuit 3, and it receives the data from circuit 1 and circuit 2, through output macro block discrimination threshold after a series of computings, outputs to the macroblock match treatment circuit; Search window data storage circuitry 4, it according to certain storage rule with serial received to data deposit in respectively in 9 ram pieces, under the control of address signal, read the search window data parallel and export to the PE computing circuit; PE computing circuit 5, it carries out the candidate's macro block certificate and the current macro data that receive subtraction and takes absolute value computing, and operation result is exported to the macroblock match treatment circuit; Macroblock match treatment circuit 6, its receives the data from circuit 5 and circuit 3, after the data of circuit 5 are added up according to certain rule, compares with threshold value, obtain the coupling macro block, through obtaining optimum Match macro block and effective motion vector after 4 matching treatment; Coupling macro block pixel access address produces circuit 7, it calculates the address of mating the data of required candidate's macro block for the second time according to the address value of initial picture element in search window of the coupling macro block of input, export to the search window data storage circuitry, carry out secondary matching operation.

In addition, foregoing circuit 2 comprises following two parts: (1) current macro pixel memory buffers circuit, and (2) current macro pixel access address produces circuit; Circuit 6 comprises following two parts: (1) data accumulation circuit, wherein accumulator be not with PE one to one, but it is corresponding one by one with 9 candidate's macro blocks, by one group of three PE, the mode of per 16 clock circulation primary is carried out the combination of PE dateout, send into accumulator and add up, obtain the MAD of each candidate's macro block; (2) data comparison circuit, the address of the initial pixel of the match is successful output movement vector and match block, it fails to match, advances people's piece matching operation next time.

Fig. 3 is used for illustrating that embodiments of the invention are at the composition of search window and the storage mode in ram thereof.As shown in the figure, in time period T0, search window is made up of MB0～MB8, and serial deposits 1 district of ram in a direction indicated by the arrow, and in 2 districts and 3 districts, new data writes 4 districts, and in time period T1, search window is made up of MB3～MB11, and new data writes 1 district, so recursion.Because the region of search is continuously in former reconstructed frame and overlaps, processing can effectively reuse search window lap data like this.

Fig. 4 is that the data of macro block MB0～MB2 write the address sequential of ram and the corresponding relation of address and pixel location, address and ram piece.Ram adopts the alternative expression storage among the present invention, to realize the also line output of data.The storage format of search window should be able to be synchronous provide desired data to 9 PE, satisfying the requirement of computing unit parallel processing, therefore deposit the data alternative expression of search window in 9 memory block ram0, ram1 here ... among the ram8.Shown in Fig. 4 (a) is exactly the write address sequential of search window data when writing ram, and shown in Fig. 4 (b) is exactly write address and pixel location corresponding relation, is exactly the storage format that the search window data write ram shown in Fig. 4 (c).Label is that 0～8 pixel is represented respectively to be stored among ram0～ram8 among Fig. 4 c, and the data among visible ram0～ram8 alternately appear in the search window.The data represented step-length that marks with little black surround among the figure is 4 o'clock needed 8 reference points, and the data represented step-length that frame of broken lines marks is the picture element of 3 candidate's macro blocks of 4 o'clock.The thick black line left side is all data of 3 macro blocks of the vertical row of search window, and the right is the data (incomplete) of 3 macro blocks of adjacent another row, and the three row data of totally 9 macro blocks constitute search window altogether.Because the characteristic of 4 step search itself, the distance between the corresponding picture element of two adjacent candidate's macro blocks is 2 certainly ^k(k=2,1 or 0).2 ^kBeing removed remainder by 3 is not 0 certainly, when so just having guaranteed that each step searches in each clock cycle 9 data that PE read respectively from 9 memory block ram0～ram8.Utilize this storage organization can effectively realize the output of search window data 9 channel parallels.

Fig. 5 be step-length be 4 o'clock from ram0, ram1, the order of ram2 reading of data.By formula (1) as can be known, each clock produces respectively corresponding 9 ram in 9 addresses of circuit 7 outputs from coupling macro block pixel access address, search window data storage circuitry 4 is exported to PE from 9 ram reading of data respectively according to these 9 addresses, finish subtraction and the computing that takes absolute value, 9 candidate's macro blocks have just been formed after 256 clocks, shown in Fig. 4 (c).Therefore, in second 0～15 clock cycle of search step, the candidate's macro block order under the pixel that ram0 read is:

L2→L1→L0→L2→L1→L0→L2→L1→L0→L2→L1→L0→L2→L1→L0→L2；

Candidate's macro block order under the pixel that ram1 read is:

L0→L2→L1→L0→L2→L1→L0→L2→L1→L0→L2→L1→L0→L2→L1→L0；

Candidate's macro block order under the pixel that ram2 read is:

L1→L0→L2→L1→L0→L2→L1→L0→L2→L1→L0→L2→L1→L0→L2→L1。

Simultaneously, as shown in Figure 5, each candidate's macro block MAD order that adds up also just in time is to be undertaken by the order shown in the figure, it should be noted that just first 16 clock cycle are data accumulations of getting ram0-ram2, second period is got ram3-ram5, the 3rd cycle got ram6-ram8, so circulation finishes up to 256 clocks, the parallel MAD that obtains 9 candidate's macro blocks.

Fig. 6 is a table of determining the affiliated ram in rd_addr1 address.Reading of the interior data of ram carried out in the synthetic new address of result who tables look-up and group of addresses, exports to the PE computing circuit.

In sum, rapid movement Forecasting Methodology of the present invention and structure thereof are on the basis of three steps search, adopt the structure of 9 computing unit parallel processings, make full use of the data of having loaded, finish the piece matching operation, obtain the optimum Match macro block, only need 2304 clocks, 9216 subtraction/addition/computings that take absolute value are compared with existing motion estimation technique, not only operand reduces in a large number, image processing speed obtains to improve, and compound with regular structure of the present invention, is fit to realize with FPGA.The present invention can be applied in the various Image Data Compression technology, especially can be widely used in the prediction of moving image of low code checks such as video telephone, video conference.

Claims

1, rapid movement Forecasting Methodology is characterized in that depositing the current macro data in the current macro pixel unit one by one; By formula (III), (IV) solve I/P macro block threshold value tg_thrsd simultaneously, and wherein formula (III), (IV) are:

MB_mean = (Σ_{i = 1, j = 1}^{16,16} original) / 256 - - - (III)

tg_thrsd = Σ_{i = 1, j = 1}^{16,16} | original - MB_mean | - - - (IV)

Original represents the data value of pixel in the original macro;

And carry out following steps simultaneously:

1. store the search window data

Search window is that 9 macro blocks at center are formed by macro block identical with the current macro position in the former frame, is the cycle with the data of 3 macro blocks of the every row of vertical direction with 16 clocks, being one group with 3 ram deposits among ram0～ram8 line by line, each ram size is set to 32 * 16 pixels, each ram is divided into 4 districts again, and size is respectively 8 * 16 pixels;

2. from search window, obtain the candidate blocks data

Selected center's macro block point is a datum mark in search window, with this datum mark is the center, get step-length and be 8 o'clock datum marks of 8 as 8 candidate's macro blocks, determine 9 addresses of candidate's macro block in search window, read address rd_addrl (0～8) according to formula (I) with what the address of candidate's macro block was converted into ram0～ram8 among the ram one by one, through tabling look-up computing, determine pairing ram again, realize the switching of ram home address; 9 candidate's macro blocks carry out simultaneously, and corresponding 9 ram are distinguished in 9 addresses that each clock obtains; Wherein formula (I) is:

\{\begin{matrix} rd_addr 1 (i) = 6 * div (swy, 3) + div (swx, 3) & when : rem (rem (swx, 16), 3) = 0 \\ rd_addr 1 (i) = 5 * div (swy, 3) + div (swx, 3) & when : rem (rem (swx, 16), 3) / = 0 \end{matrix}

The coordinate points of swy remarked pixel point y direction in search window;

Rem represents to rem;

Div represents to ask the merchant;

3. obtain the coupling macro block

9 candidate's macro block datas and current macro subtracted each other take absolute value, wherein ram is corresponding one by one with computing circuit PE, accumulator is corresponding one by one with candidate's macro block, accumulator adds up the output valve of the computing circuit (PE) in corresponding candidate's macro block, obtains the mean absolute error MAD value of 9 candidate's macro blocks behind 256 clocks; 9 mean absolute error MAD values and I/P macro block threshold value are compared, are exactly the P macro block less than the macro block of tg_thrsd, otherwise are exactly the I macro block, wherein in the P macro block mean absolute error MAD minimum be exactly to mate macro block;

4. step-length is set at half of last step-length, and is central point with the datum mark of the coupling macro block that obtains, repeating step 2,3 is till step-length＜1;

5. export the motion vector of optimum Match macro block.

2, rapid movement prediction unit is characterized in that comprising:

_ _ with outside ram interface unit (1), its output current macro data and search window data and control signal;

_ _ current macro pixel memory cell (2), it will output to arithmetic element PE from the data serial that receives with outside ram interface unit (1);

_ _ I/P macro block discrimination threshold generation unit (3), its receive from the data of outside ram interface unit (1) and current macro pixel memory cell (2), by formula export the macro block discrimination threshold after (III), (IV) computing, wherein formula (III), (IV) are:

MB_mean = (Σ_{i = 1, j = 1}^{16,16} original) / 256 - - - (III)

tg_thrsd = Σ_{i = 1, j = 1}^{16,16} | original - MB_mean | - - - (IV)

Original represents the data value of pixel in the original macro;

_ _ search window data storage cell (4), it with serial received to the macro block by identical in the former frame with the current macro position be that the data of 9 macro blocks at center are the cycle with the data of 3 macro blocks of the every row of vertical direction with 16 clocks, be one group with 3 ram and deposit among ram0～ram8 line by line; Under the address signal control that coupling macro block pixel access address generation unit (7) produces, read address rd_addrl (0～8) according to formula (I) with what the address of 9 candidate's macro blocks was converted into ram0～ram8 among the ram one by one, again through tabling look-up computing, determine pairing ram, realize the switching of ram home address, and with the parallel arithmetic element PE (5) that exports to of result; 9 candidate's macro blocks carry out simultaneously, and corresponding 9 ram are distinguished in 9 addresses that each clock obtains;

Wherein formula (I) is:

\{\begin{matrix} rd_addr 1 (i) = 6 * div (swy, 3) + div (swx, 3) & when : rem (rem (swx, 16), 3) = 0 \\ rd_addr 1 (i) = 5 * div (swy, 3) + div (swx, 3) & when : rem (rem (swx, 16), 3) / = 0 \end{matrix}

The coordinate points of swy remarked pixel point y direction in search window;

Rem represents to rem;

Div represents to ask the merchant;

_ _ arithmetic element PE (5), it carries out the candidate's macro block data and the current macro data that receive subtraction and takes absolute value computing, and operation result is exported to the macroblock match processing unit;

_ _ macroblock match processing unit (6), it receives the data from arithmetic element PE (5) and I/P macro block discrimination threshold generation unit (3), after the data of arithmetic element PE (5) are added up, behind 256 clocks, obtain the mean absolute error MAD value of 9 candidate's macro blocks; 9 mean absolute error MAD values and I/P macro block threshold value are compared, macro block less than I/P macro block threshold value tg_thrsd is exactly the P macro block, otherwise be exactly the I macro block, wherein in the P macro block mean absolute error MAD minimum be exactly to mate macro block, export the address and the motion vector of a datum mark of this macro block, through obtaining the optimum Match macro block after four couplings;

_ _ coupling macro block pixel access address generation unit (7), it calculates the address of mating the data of required candidate's macro block for the second time according to the address value of data in search window of the datum mark pixel of the coupling macro block of input, output to search window data storage cell (4), carry out secondary matching operation;

3, rapid movement prediction unit as claimed in claim 2 is characterized in that described current macro pixel memory cell (2) comprises current macro pixel memory buffers unit and current macro pixel access address generation unit.

4, rapid movement prediction unit as claimed in claim 2 is characterized in that described macroblock match processing unit (6) comprises data accumulation circuit and data comparison circuit.