WO2000064182A1

WO2000064182A1 - Motion estimation

Info

Publication number: WO2000064182A1
Application number: PCT/EP2000/002583
Authority: WO
Inventors: Michael Bakhmutsky; Viktor Gornstein
Original assignee: Koninklijke Philips Electronics N.V.
Priority date: 1999-04-06
Filing date: 2000-03-21
Publication date: 2000-10-26
Also published as: KR20010052624A; JP2002542737A; EP1086591A1; CN1201589C; CN1314052A

Abstract

A method for determining a best match between a first pixel array in a picture currently being encoded and a plurality of second pixel arrays in a search region of a reference picture, wherein each of the first and second pixel arrays includes a plurality of rows and columns of individual pixel values. The method is designed to be performed in a motion estimation search engine of a digital video encoder, and includes the steps of producing a first orthogonal-sum signature of the first pixel array (M1) comprised of a set of horizontal sums (S1H-S16H) representative of the sums of the individual pixel values of the rows of the first pixel array and a first set of vertical sums (S1V-S16V) representative of the sums of the individual pixel values of the columns of the first pixel array; producing a plurality of second orthogonal-sum signatures for respective ones of at least selected ones of the plurality of second pixel arrays, each of the plurality of second orthogonal-sum signatures being comprised of a set of horizontal sums (S1H-S16H) representative of the sums of the individual pixel values of the rows of a respective one of the second pixel arrays and a set of vertical sums (S1V-S16V) representative of the sums of the individual pixel values of the columns of a respective one of the second pixel arrays.

Description

Motion estimation.

In general, the encoding of an MPEG video data stream requires a number of steps. The first of these steps consists of partitioning each picture into macroblocks. Next, in theory, each macroblock of each "non-intra" picture in the MPEG video data stream is compared with all possible 16-by-16 pixel arrays located within specified vertical and horizontal search ranges of the current macroblock' s corresponding location in the anchor picture(s). This theoretical "full search algorithm" (i.e., searching through every possible block in the search region for the best match) always produces the best match, but is seldom used in real-world applications because of the tremendous amount of calculations that would be required, e.g., for a block size of NxN and a search region of (N+2w) by (N+2w), the distortion function MAE has to be calculated (2w+l)² times for each block, which is a tremendous amount of calculations. Rather, it is used only as a reference or benchmark to enable comparison of different more practical motion estimation algorithms that can be executed far faster and with far fewer computations. These more practical motion estimation algorithms are generally referred to as "fast search algorithms". The aforementioned search or "motion estimation" procedure, for a given prediction mode, results in a motion vector that corresponds to the position of the closest- matching macroblock (according to a specified matching criterion) in the anchor picture within the specified search range. Once the prediction mode and motion vector(s) have been determined, the pixel values of the closest-matching macroblock are subtracted from the corresponding pixels of the current macroblock, and the resulting 16-by-16 array of differential pixels is then transformed into 8-by-8 "blocks", on each of which is performed a discrete cosine transform (DCT), the resulting coefficients of which are each quantized and Huffman-encoded (as are the prediction type, motion vectors, and other information pertaining to the macroblock) to generate the MPEG bit stream. If no adequate macroblock match is detected in the anchor picture, or if the current picture is an intra, or "I-" picture, the above procedures are performed on the actual pixels of the current macroblock (i.e., no difference is taken with respect to pixels in any other picture), and the macroblock is designated an "intra" macroblock. For all MPEG-2 prediction modes, the fundamental technique of motion estimation consists of comparing the current macroblock with a given 16-by-16 pixel array in the anchor picture, estimating the quality of the match according to the specified metric, and repeating this procedure for every such 16-by-16 pixel array located within the search range. The hardware or software apparatus that performs this search is usually termed the "search engine," and there exists a number of well-known criteria for determining the quality of the match. Among the best-known criteria are the Minimum Absolute Error (MAE), in which the metric consists of the sum of the absolute values of the differences of each of the 256 pixels in the macroblock with the corresponding pixel in the matching anchor picture macroblock; and the Minimum Square Error (MSE), in which the metric consists of the sum of the squares of the above pixel differences. In either case, the match having the smallest value of the corresponding sum is selected as the best match within the specified search range, and its horizontal and vertical positions relative to the current macroblock therefore constitute the motion vector. If the resulting minimum sum is nevertheless deemed too large, a suitable match does not exist for the current macroblock, and it is coded as an intra macroblock. For the purposes of the present invention, either of the above two criteria, or any other suitable criterion, may be used.

The various fast search algorithms evaluate the distortion function (e.g., the MAE function) only at a predetermined subset of the candidate motion vector locations within the search region, thereby reducing the overall computational effort. These algorithms are based on the assumption that the distortion measure is monotonically decreasing in the direction of the best match prediction. Even though this assumption is not always true, it can still find a suboptimal motion vector with much less computation.

The most commonly used approach to motion estimation is a hybrid approach generally divided into several processing steps. First, the image can be decimated by pixel averaging. Next, the fast search algorithm operating on a smaller number of pixels is performed, producing a result in the vicinity of the best match. Then, a full search algorithm in a smaller search region around the obtained motion vector is performed. If half-pel vectors are required (as with MPEG-2), a half -pel search is performed as a separate step or is combined with the limited full search.

Even with the great savings that can be achieved in the hybrid approach to motion estimation, an enormous amount of computations still have to be performed for each iteration of computing MAE. Assuming that the distortion function has to be computed every clock cycle for every block offset, which is desirable in demanding applications such as MPEG-2 HDTV where motion block size is 16-by-16, a distortion function computational unit (DFCU) will consist of a number of simpler circuits of increasing bit width starting from 8 (8- bit luminance data is used for motion estimation) to produce MAE. This number will be equal to the sum of the following: 256 subtraction circuits, 256 absolute value compute circuits, 255 summation circuits of increasing bit width, for a total of 757 circuits of increasing bit width starting with 8, per DFCU.

Depending on picture resolution, a number of these extremely complex units will be required for a practical system. Using a smaller number of circuits within a DFCU in order to reuse its hardware is possible, but will substantially increase processing time and may not be acceptable in demanding applications such as HDTV. In this case, the number of DFCUs will simply have to be increased to compensate by enhanced parallel processing. The first step in the hybrid approach to motion estimation (rough search) is usually the most demanding step in terms of hardware utilization because it has to cover the largest search region in order to produce a reasonably accurate match. Based on the above and foregoing, there presently exists a need in the art for a method for motion estimation that enhances the speed at which motion estimation can be performed, that greatly reduces the amount and complexity of the motion estimation or DFCU hardware required to perform motion estimation, and that provides for significant picture quality improvement at a reasonable cost. The present invention as defined by the independent claims fulfills this need in the art. The dependent claims define advantageous embodiments. In overview, the method of the present invention searches for best matches by comparing unique macroblock signatures rather than by comparing the individual luminance values of the collocated pixels in the current macroblock and the search region. This method is based on the same assumption as all fast search algorithms are based on, i.e., that the distortion measure is monotonically decreasing in the direction of the best match prediction.

The present invention encompasses a method for determining a best match between a first pixel array in a picture currently being encoded and a plurality of second pixel arrays in a search region of a reference picture, wherein each of the first and second pixel arrays includes a plurality of rows and columns of individual pixel values. The method is designed to be performed in a motion estimation search engine of a digital video encoder, and includes the steps of producing a first orthogonal-sum signature of the first pixel array comprised of a set of horizontal sums representative of the sums of the individual pixel values of the rows of the first pixel array and a first set of vertical sums representative of the sums of the individual pixel values of the columns of the first pixel array; producing a plurality of second orthogonal-sum signatures for respective ones of at least selected ones of the plurality of second pixel arrays, each of the plurality of second orthogonal-sum signatures being comprised of a set of horizontal sums representative of the sums of the individual pixel values of the rows of a respective one of the second pixel arrays and a set of vertical sums representative of the sums of the individual pixel values of the columns of a respective one of the second pixel arrays; and, comparing the first orthogonal-sum signature with each of the second orthogonal-sum signatures in order to determine the best match between the first and second pixel arrays. In a disclosed embodiment, the first and second pixel arrays are either decimated or undecimated macroblocks having a structure defined by an MPEG standard, e.g., the MPEG-2 standard.

The present invention also encompasses a device, e.g., a motion estimation search engine of a digital video encoder, that implements the method of the present invention.

In a preferred embodiment, the method and device of the present invention greatly reduces the computational requirements and significantly accelerates the motion estimation search by storing in a local memory and extensively reusing previously computed (available) sums to produce the orthogonal sums, thereby also significantly reducing the motion estimation search engine hardware requirements. Further, the local memory can advantageously be a RAM, e.g., a DRAM or SRAM, as opposed to being implemented as a matrix of shift registers, as is necessary with the presently available technology. However, although this constitutes a novel and presently preferred feature of the present invention, in one of its aspects, this is not in and of itself an essential feature of the present invention, in its broadest sense, as will become fully apparent hereinafter.

These and other objects, features, and advantages of the present invention will be readily understood from the following detailed description taken in conjunction with the accompanying drawings, in which:

Fig. 1 A is a diagram that illustrates a 32-orthogonal sum signature for an undecimated 16-by-16 macroblock;

Fig. IB is a diagram that illustrates a 16-orthogonal sum signature for an 8-by-8 macroblock that represents a 2:1 decimated 16-by-16 macroblock; Fig. 2 is a combination flow chart and graph that illustrates best match estimation in accordance with a preferred embodiment of the present invention;

Fig. 3 is a diagram that depicts the basic methodology of a preferred implementation of the present invention, in the context of an orthogonal-sum update in a horizontal motion estimation search; Fig. 4 is a block diagram of an orthogonal-sum generator that constitutes an exemplary embodiment of the present invention;

Fig. 5 is a diagram that illustrates the sequence of RAM operations in an illustrative horizontal motion estimation search using the methodology of the present invention; and

Fig. 6 is a block diagram of a motion estimation search engine that constitutes an exemplary embodiment of the present invention.

In overview, the motion estimation method of the present invention generally consists of the following steps. First, the individual pixel values of each row and column of a current macroblock are summed, to produce a set of orthogonal sums that represent a unique pattern or "signature" of that macroblock' s content. Next, the resultant orthogonal-sum signature of that macroblock is compared with the corresponding orthogonal-sum signatures of each macroblock-sized pixel array in a prescribed search region of the reference or anchor picture(s), and a search is made for the best match according to a prescribed matching criterion or search metric, e.g., the Minimum Absolute Error (MAE) distortion function. Because it is statistically improbable that macroblocks having different contents will have the same signature, there is a low probability of a false match. Further, since the orthogonal sums represent an average luminance magnitude per line (row) or column, small-step increments in the macroblock origin within the search region will not be able to produce large jumps in magnitude for bandwidth-limited filtered video. For this reason, it can be concluded that the distortion measure computed based on matching the orthogonal sum sets will be monotonically decreasing in the direction of the best match prediction as well as in the prior art search methods.

With reference now to Figs. 1 A and IB, specific illustrations of the motion estimation method of the present invention will now be described. More particularly, with reference now to Fig. 1A, the individual pixel (luminance) values for each row (1H-16H) and for each column (IV- 16V) of an undecimated 16-by-16 macroblock Ml are summed, to thereby produce a set of orthogonal sums S]_H to S_{1 H} (horizontal sums) and S]v to S₁₆v (vertical sums) that collectively constitute the orthogonal-sum signature of the undecimated 16-by-16 macroblock Ml. With reference now to Fig. IB, the individual pixel (luminance) values for each row (1H-8H) and for each column (1V-8V) of the 8-by-8 macroblock Ml' are summed, to thereby produce a set of orthogonal sums S_{I H} to S₈₁₁ (horizontal sums) and S]γ to S₈v (vertical sums) that collectively constitute the orthogonal-sum signature of the 8-by-8 macroblock Ml'. The 8-by-8 macroblock Ml' constitutes the macroblock Ml decimated 2: 1 both horizontally and vertically.

With reference now to Fig. 2, the motion estimation method of the present invention is performed as follows. More particularly, a best match estimation procedure is carried out by comparing (match estimation ME) the orthogonal-sum signatures of a current coded macroblock (CM) with the orthogonal-sum signatures of each macroblock in a specified search region of a reference or anchor picture (search area macroblock SAM), and then selecting as the best match (BM) the reference (search area) macroblock that has the highest degree of correlation with the orthogonal sum set of the current macroblock according to a specified matching criterion (search metric), e.g., MAE, MSE, or any other suitable metric. The graph in the lower half of Fig. 2 shows the magnitudes M of the orthogonal sum set members.

Due to the high complexity of the distortion function computational unit (DFCU), the motion estimation search is normally performed at least initially on decimated video (i.e., decimated macroblocks). For example, in the case of generating orthogonal sum sets for an undecimated macroblock depicted in Fig. 1A, the number of sums representing the 16-by-16 macroblock's orthogonal-sum signature is 32 (2x16), whereas in the case of generating orthogonal sum sets for a 2:1 decimated macroblock depicted in Fig. IB, the number of sums representing the 8-by-8 macroblock's orthogonal-sum signature is reduced to 16 (2x8). It is quite apparent that evaluating a distortion function for 2N numbers will substantially reduce the DFCU computational requirements relative to the existing technology that requires that the distortion function be evaluated for N2 numbers. For example, in the case of the undecimated 16-by-16 macroblock depicted in Fig. 1A, the distortion function has to be evaluated for 8 times less numbers (256/32), and in the case of the decimated 8-by-8 macroblock depicted Fig. IB, the distortion function has to be evaluated for 4 times less numbers (64/16).

As stated previously, the computational complexity of the DFCU is a major factor in the cost of the motion estimation circuitry (search engine). However, since the motion estimation method of the present invention provides such a dramatic reduction in the cost and complexity of the DFCU, it becomes much more practical to start with undecimated or low-level decimated video for motion estimation searching, thereby improving motion estimation search accuracy, and ultimately, picture quality. In this connection, not only does the motion estimation method of the present invention allow a substantial reduction in the number of stages of motion estimation, but it also allows for the elimination of the special video filtering circuitry required for all eliminated decimation stages. With such hardware savings, the search process can potentially be started with the undecimated video producing a great quality improvement at reasonable cost.

Another advantage realized with the motion estimation method of the present invention is greatly enhanced speed of operation. Traditionally, multiple stages of logic are required in order to compare collocated luminance magnitudes, practically excluding the possibility of obtaining results in a single clock cycle. For this reason, either the system clock frequency has to be substantially reduced or the system has to be pipelined utilizing substantial logic resources. The motion estimation method of the present invention allows for concurrent computation of orthogonal sums easily achieved in a single clock cycle, followed by dramatically reduced MAE computation.

In addition to these advantages, the invention greatly reduces the cross- communication between computations performed on the data originating in different memories. This allows for precomputation and storage of intermediate results (orthogonal sums) prior to motion estimation which can be very beneficial in some hardware architectures. With reference now to Fig. 3, the fundamental principle of a preferred implementation of the present invention will now be described. More particularly, in order to compute a horizontal (orthogonal) sum (OS_NEW) for an 8-pixel wide macroblock displaced by one pixel to the right with respect to a previous 8-pixel wide macroblock whose horizontal (orthogonal) sum (OSO_LD) was previously computed during a previous iteration of a horizontal search, the following equation (1) is used: (1) OSNEW = OSOLD - aoo + a_n0 , where aoo is the pixel value of the pixel that was the horizontal origin of the previous macroblock, and a_n0 is the pixel value of the pixel that is the horizontal origin of the "new" macroblock, i.e., the macroblock displaced by one pixel to the right with respect to the previous macroblock.

For example, assuming the horizontal origin of the previous macroblock was the pixel labeled a_n.i , so that the horizontal origin of the macroblock displaced by one pixel to the right is the pixel labeled an , then, using equation (1), OS_NEW = OS_OLD - a_n-ι + a_n+7 . In other words, since pixel a_n+7 is the only pixel that is contained in the new macroblock that was not contained in the previous macroblock, due to the one-pixel displacement to the right, then its value must be added to the previously-computed orthogonal sum OS_OLD in computing OS_NEW. and since the pixel a_n-ι is the only pixel that is not contained in the new macroblock but was contained in the previous macroblock, due to the one-pixel displacement to the right, then its value must be subtracted from the previously-computed orthogonal sum OSOL_D n computing OSNEW •

Similarly, as the horizontal search proceeds with an additional one-pixel displacement to the right, then the horizontal origin of the previous macroblock becomes the pixel labeled a_n , so that the horizontal origin of the "new" macroblock displaced by one pixel to the right becomes the pixel labeled a_n+1 , then, using equation (1), OSNEW = OSOLD - an-i + a_n+7 .

In other words, since pixel a_n+7 is the only pixel that is contained in the new macroblock that was not contained in the previous macroblock, due to the additional one-pixel displacement to the right, then its value must be added to the previously-computed orthogonal sum OSO_LD in computing OS_NEW, and since the pixel &_a. is the only pixel that is not contained in the new macroblock but was contained in the previous macroblock, due to the additional one-pixel displacement to the right, then its value must be subtracted from the previously- computed orthogonal sum OS_OLD in computing OSN_EW • This procedure for updating the value of the orthogonal sum OS_NEW is repeated for each additional one-pixel displacement during the horizontal search until the limit of the horizontal search range within the search region of the reference picture is reached, at which time, the horizontal search for that row of the search region is completed.

With reference now to Fig. 4, there can be seen a block diagram of an orthogonal-sum generator 20 that constitutes an exemplary embodiment of the present invention. At the outset, it should be recognized that although the invention is described using the example of an 8x4 macroblock, the present invention is not limited to macroblocks or pixel arrays of any particular size or structure. The motion estimation method of the present invention will now be described in conjunction with the orthogonal-sum generator 20 depicted in Fig. 2, although it should be appreciated that other hardware implementations of the method of the present invention will become readily apparent to those of ordinary skill in the pertinent art, and thus, are encompassed by the present invention, in its broadest sense.

First, a full orthogonal-sum signature of a macroblock currently being encoded ("coded macroblock") is computed by computing a set of horizontal sums representative of the sums of the individual pixel (luminance) values of the rows of that macroblock and a set of vertical sums representative of the sums of the individual pixel values of the columns of that macroblock, in the manner described above with reference to Figs. 1A, IB and 2.

Second, an initialization procedure is executed by loading/writing into a local memory 22 (e.g., a DRAM, SRAM, or shift register matrix) the pixel values for a macroblock- sized initial reference pixel array (macroblock) having a specified origin in a specified search region of a reference picture stored in a reference picture (anchor) memory (not shown). The anchor memory is preferably organized in such a manner that its outputs are always adjacent vertically. For example, if the outputs of the anchor memory produce pixels from lines (rows) 1, 2, 3, and 4, then a one-pixel vertical displacement down will cause the anchor memory to produce pixels from lines (rows) 2, 3, 4, and 5. This can be achieved, for example, by appropriate partitioning of the anchor memory without increasing its size using a method described in the non-prepublished international application No. PCT/IB99/00986 (Attorneys' docket PHA 23.420), the disclosure of which is herein incorporated by reference. During the initialization procedure, the full set of horizontal sums for the initial macroblock-sized reference pixel array are accumulated in a set of parallel horizontal sum modifier circuits 25 each having a subtract (-) input coupled to respective data outputs of the local memory 22; simultaneously (preferably), the vertical sums for each column of the initial reference pixel array are produced by a four-input vertical sum adder circuit 27, and the thusly-computed vertical sums are sequentially loaded into a shift register 29.

After this initialization procedure is completed, the motion estimation search method of the present invention works as follows. More particularly, as the motion estimation search proceeds pixel-by-pixel in the horizontal direction through the specified search region of the reference picture (hereinafter referred to as a "horizontal search"), the resultant reference pixel array will be correspondingly displaced by one pixel to the right with respect to the initial reference pixel array.

After each one pixel displacement within the search region, the pixel values stored in each row of the ith column of the local memory 22 are read out of the local memory 22 and applied to the subtract input of the respective horizontal sum modifier circuit 25, and the pixel values corresponding to the (N + i)th column of the search region of the reference picture are written into the respective rows of the ith column of the local memory 22 to replace the pixel values just read therefrom, where i = 1 through N, and N is the horizontal dimension of the initial reference pixel array (i.e., the horizontal dimension of the coded macroblock). Preferably, after N is reached, for memory addressing purposes, i will wrap back to a count of 1, and will be incremented by 1 until N is reached again, and the cycle repeated until the limit of the horizontal search range (as measured from the horizontal origin of the initial reference pixel array) has been reached and the horizontal search thus concluded. In this connection, a modulo-8 address counter (not shown) or other suitable mechanism can be utilized for performing this function. The pixel values corresponding to the (N + i)th column of the search region of the reference picture (hereinafter referred to simply as the "new pixel values") are also simultaneously applied to an add (+) input of the respective horizontal sum modifier circuits 25, and to respective inputs of the vertical sum adder circuit 27. By way of example, if the local memory 22 is a DRAM, the memory read and write operations described above can be performed during a single memory clock cycle via a read-modify-write operation.

Upon receiving the read-out and new pixel values, each of the horizontal sum modifier circuits 25 adds the new pixel value it received at its add input to the previously- accumulated horizontal sum, and subtracts the read-out pixel value it received at its subtract input from the previously-accumulated horizontal sum, and outputs the resultant sum as a "new" horizontal sum. That is, the set of horizontal sums produced at the outputs of the horizontal sum modifier circuits 25 will constitute the set of horizontal sums for the "new" reference pixel array that is displaced by one pixel from the reference pixel array of the previous iteration. Also, after each one pixel displacement, the shift register 29 is shifted horizontally by one word to the right, so that the vertical sum stored in its last stage is discarded, and the remaining vertical sums are shifted by one stage to the right. Upon receiving the new pixel values, the vertical sum adder circuit 27 produces at its output a "new" vertical sum that is loaded into the first stage of the shift register 29 (which is an N-word shift register) to replace the previous vertical sum that was shifted to the right. The resultant set of vertical sums that appear at the outputs of the shift register 29 constitute the set of vertical sums for the "new" reference pixel array that is displaced by one pixel from the reference pixel array of the previous iteration.

The above-described procedure is repeated after each one pixel displacement during the horizontal search through the search region of the reference picture until the horizontal search is concluded.

With reference now to Fig. 5, the sequence of memory read write operations for an exemplary horizontal search in accordance with the exemplary embodiment of the present invention will now be described. The circles indicate active addresses, while the squares indicate inactive addresses. The column headed by (+) indicates the RAM input (pixel number), while the column headed by (-) indicates the RAM output (pixel number). More particularly, after the first eight (8) horizontally adjacent pixels 1 through 8 for each row of the reference picture search region being searched are stored in the four respective rows (sections) of the local memory 22, and, simultaneously, are accumulated in the corresponding horizontal sum modifier circuits 25. At this point, the horizontal sums output by the horizontal sum modifier circuits 25 are the valid horizontal sums for the initial reference pixel array (macroblock). Then, for each single-pixel displacement, as the horizontal search through the search region of the reference picture proceeds, the address counter is incremented by one to point to pixel i, where i = 1 through 8, with the count repeating after the terminal count (8) is reached, thereby causing the old pixel value for each row of the local memory 22 to be read out of the currently addressed location in the local memory 22 and be applied to the subtract (- ) input of the respective horizontal sum modifier circuit 25, and the new pixel value for each row of the local memory 22 to be written into the currently addressed location in the local memory 22 and simultaneously applied to the add (+) input of the respective horizontal sum modifier circuit 25 and the respective input of the four-input vertical sum adder circuit 27. Thus, after each single-pixel displacement, the updated full set of horizontal sums will be output by the horizontal sum modifier circuits 25 and the updated full set of vertical sums will be output by the shift register 29. For example, as can be seen diagrammatically in Fig. 5, after the first 8 pixels are written into the appropriate row of the local memory 22, the reference pixel array will be displaced by one pixel to the right, and pixel number 1 will be read out of the local memory 22 and replaced by pixel number 9; next, the reference pixel array will be displaced by one pixel to the right, and pixel number 2 will be read out of the local memory 22 and replaced by pixel number 10; next, the reference pixel array will be displaced by one pixel to the right, and pixel number 3 will be read out of the local memory 22 and replaced by pixel number 11; next, the reference pixel array will be displaced by one pixel to the right, and pixel number 4 will be read out of the local memory 22 and replaced by pixel number 12; next, the reference pixel array will be displaced by one pixel to the right, and pixel number 5 will be read out of the local memory 22 and replaced by pixel number 13; next, the reference pixel array will be displaced by one pixel to the right, and pixel number 6 will be read out of the local memory 22 and replaced by pixel number 14; next, the reference pixel array will be displaced by one pixel to the right, and pixel number 7 will be read out of the local memory 22 and replaced by pixel number 15; next, the reference pixel array will be displaced by one pixel to the right, and pixel number 8 will be read out of the local memory 22 and replaced by pixel number 16; next, the reference pixel array will be displaced by one pixel to the right, and pixel number 9 will be read out of the local memory 22 and replaced by pixel number 17; next, the reference pixel array will be displaced by one pixel to the right, and pixel number 10 will be read out of the local memory 22 and replaced by pixel number 18; and, finally, the reference pixel array will be displaced by one pixel to the right, and pixel number 11 will be read out of the local memory 22 and replaced by pixel number 19, etc.

With reference now to Fig. 6, there can be seen a block diagram of a field-based motion estimation search engine 40 that constitutes an exemplary implementation of the present invention. As can be seen, the search engine 40 includes a Field 1 orthogonal-sum generator 20a (like the one depicted in Fig. 2) and a parallel Field 2 orthogonal-sum generator 20b (like the one depicted in Fig. 2). The Field 1 orthogonal-sum generator 20a receives four new pixels over parallel lines 44 from a Field 1 anchor memory 45 upon each one pixel displacement of a Field 1 reference pixel array during a horizontal search operation, and the Field 2 orthogonal-sum generator 20b receives four new pixels over parallel lines 46 from a Field 2 anchor memory 47 upon each one pixel displacement of a Field 2 reference pixel array during a horizontal search operation. A Field 1 orthogonal-sum generator 50a receives the pixels of a Field lmacroblock currently being encoded (i.e., coded macroblock) from a Field 1 coded picture memory 52, and a Field 2 orthogonal-sum generator 50b receives the pixels of a Field 2 coded macroblock from a Field 2 coded picture memory 54. The Field 1 orthogonal- sum generator 50a produces at its outputs the full set of orthogonal sums (both horizontal and vertical) representing the orthogonal-sum signature of the Field 1 coded macroblock, and the Field 2 orthogonal-sum generator 50b produces at its outputs the full set of orthogonal sums representing the orthogonal-sum signature of the Field 2 coded macroblock. With continuing reference to Fig. 6, the search engine 40 further includes a

Field 1 best match estimator 60 that receives at one set of inputs the orthogonal-sum signature of the current reference pixel array, and at another set of inputs the orthogonal-sum signature of the Field 1 coded macroblock, and then determines, in accordance with a prescribed search metric (e.g., MAE), which of the reference pixel arrays from the specified search region of the Field 1 anchor memory 45 constitutes the best match for the coded macroblock, and outputs the result as the "Field 1 Motion Vector". Similarly, the search engine 40 further includes a Field 2 best match estimator 62 that receives at one set of inputs the orthogonal-sum signature of the current reference pixel array, and at another set of inputs the orthogonal-sum signature of the Field 2 coded macroblock, and then determines, in accordance with a prescribed search metric (e.g., MAE), which of the reference pixel arrays from the specified search region of the Field 2 anchor memory 47 constitutes the best match for the coded macroblock, and outputs the result as the "Field 2 Motion Vector". It should be readily appreciated that for a more efficient design implementation, the search engine RAMs can be combined to store data for both fields, since these RAMs are controlled in the identical way for both fields. As stated previously, the computational complexity of the DFCU is a major factor in the cost of the motion estimation circuitry (search engine). However, since the motion estimation method of the present invention provides such a dramatic reduction in the cost and complexity of the DFCU, it becomes much more practical to start with undecimated or low-level decimated video for motion estimation searching, thereby dramatically improving motion estimation search accuracy, and ultimately, picture quality. In this connection, not only does the motion estimation method of the present invention allow a substantial reduction in the number of stages of motion estimation, but it also allows for the elimination of the special video filtering circuitry required for all eliminated decimation stages. With such hardware savings, the search process can potentially be started with the undecimated video producing a great quality improvement at reasonable cost.

Another advantage realized with the motion estimation method of the present invention is greatly enhanced speed of operation. Traditionally, multiple stages of logic are required in order to compare collocated luminance magnitudes, practically excluding the possibility of obtaining results in a single clock cycle. For this reason, either the system clock frequency has to be substantially reduced or the system has to be pipelined utilizing substantial logic resources.

In addition to these advantages, the preferred embodiment of the invention greatly accelerates the motion estimation method using orthogonal-sum block matching described above. Moreover, the present invention achieves the following three significant advantages over the presently available technology:

(1) Substantial hardware reduction in orthogonal-sum computations. Since the orthogonal sums are updated with the macroblock displacements in the anchor picture using the available sums to produce the new (updated) orthogonal sums, a much smaller computational effort requiring significantly lesser computational hardware is made possible;

(2) A long chain of adder circuits to produce the orthogonal sums is eliminated, thereby substantially accelerating the speed of operation;

(3) The present invention enables the usage of RAMs to store the search data rather than the usage of a massive register matrix to store the search data, as is required by the presently available technology, which requires that all of the engine memory's outputs be immediately available for comparison, thereby providing substantial cost savings; and

(4) Due to its novel architecture, a motion estimation search engine according to the present invention can be implemented with logic and memory integrated into a single silicon device using emerging embedded memory technologies in order to thereby enhance system performance due to wider internal bus widths, among other things.

The preferred embodiment of the invention can be summarized as follows. A RAM-based search engine for updating a horizontal sum representing the sum of the values of N pixels contained in a horizontal row of a reference pixel array during a motion estimation search during which the reference pixel array is displaced by one pixel in a horizontal search direction during each of a plurality of iterations of the motion estimation search. The RAM- based search engine includes a horizontal sum modifier circuit that accumulates the values of the N pixels contained in the horizontal row of the reference pixel array prior to any displacement of the reference pixel array to produce the horizontal sum, and that updates the horizontal sum by computing the new horizontal sum using the following equation:

OSNEW = OSOLD - aoo + a_no , where OSN_EW is the new horizontal sum after the last displacement of the reference pixel array by one pixel in the horizontal direction, OS_OLD is the horizontal sum prior to the last displacement of the reference pixel array by one pixel in the horizontal direction, aoo is the pixel value of the pixel that was the horizontal origin of the reference pixel array prior to the last displacement of the reference pixel array by one pixel in the horizontal direction, and a_π0 is the pixel value of the pixel that is the horizontal origin of the reference pixel array after the reference pixel array has been displaced by one pixel to the right with respect to the previous position of the reference pixel array as a result of the last displacement of the reference pixel array by one pixel in the horizontal direction.

Although preferred embodiments of the present invention have been described in detail herein-above, it should be clearly understood that many variations and/or modifications of the basic inventive concepts taught herein that may appear to those skilled in the pertinent art will still fall within the scope of the present invention, as defined in the appended claims. For example, although the present invention is described as being applicable to digital video encoders, it should be clearly understood that the present invention is not limited to any particular application, e.g., it can be used in a decoder portion of a television set or other picture display system when it is necessary to encode the received picture to accommodate the requirements of the television set or other picture display system. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps other than those listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention can be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the device claim enumerating several means, several of these means can be embodied by one and the same item of hardware.

Claims

CLAIMS:

1. A method of comparing a first pixel array having a plurality of rows and columns of individual pixel values, and a second pixel array having a plurality of rows and columns of individual pixel values, the method comprising the steps of: summing the individual pixel values of each row of individual pixel values of the first pixel array to produce a first set of horizontal sums; summing the individual pixel values of each column of individual pixel values of the first pixel array to produce a first set of vertical sums; summing the individual pixel values of each row of individual pixel values of the second pixel array to produce a second set of horizontal sums; summing the individual pixel values of each row of individual pixel values of the second pixel array to produce a second set of vertical sums; wherein the first set of horizontal sums and the first set of vertical sums comprises a first set of orthogonal sums, wherein the second set of horizontal sums and the second set of vertical sums comprises a second set of orthogonal sums, and comparing the first and second sets of orthogonal sums.

2. The method as set forth in Claim 1, wherein the first pixel array comprises of an undecimated macroblock of a picture currently being encoded, and the second pixel array comprises an undecimated macroblock in a search region of a reference picture.

3. The method as set forth in Claim 1, wherein the first pixel array comprises of a decimated macroblock of a picture currently being encoded, and the second pixel array comprises a decimated macroblock in a search region of a reference picture.

4. A method as claimed in Claim 1, wherein the first summing step comprises the step of updating a horizontal sum representing the sum of the values of N pixels contained in a horizontal row of a reference pixel array during a motion estimation search, the updating step including the steps of: computing the horizontal sum; displacing the reference pixel array by one pixel in a horizontal direction; and, updating the horizontal sum to produce a new horizontal sum by adding a new pixel value to the previously-computed horizontal sum, and subtracting an old pixel value no longer contained in the horizontal row of the reference pixel array after the displacing step, from the previously-computed horizontal sum.

5. The method as set forth in Claim 4, further including the step of repeating the displacing and updating steps until a limit of a horizontal search range is reached.

6. The method as set forth in Claim 4, wherein: the step of computing is performed by using a horizontal sum modifier circuit (25) that accumulates the values of the N pixels contained in the horizontal row of the reference pixel array prior to performing the step of displacing; and the step of updating the horizontal sum is performing by using the horizontal sum modifier circuit (25) to compute the new horizontal sum using the following equation:

OSNEW = OSOLD - aoo + a_no , where OS_NEW is the new horizontal sum, OS_OLD is the horizontal sum prior to the last iteration of the displacing step, aoo is the pixel value of the pixel that was the horizontal origin of the reference pixel array prior to the last iteration of the displacing step, and a_n0 is the pixel value of the pixel that is the horizontal origin of the reference pixel array after the reference pixel array has been displaced by one pixel to the right with respect to the previous position of the reference pixel array as a result of the last iteration of the displacing step.

7. A method as claimed in Claim 1, further comprising the steps of generating a horizontal sum for each of N rows of a reference pixel array and simultaneously generating a vertical sum for each of M columns of the reference pixel array for each iteration of a horizontal motion estimation search of a prescribed search region of a reference picture, the method further including the steps of: (a) storing initial pixel values corresponding to an initial position of the reference pixel array by storing M individual pixel values in each of N rows of a memory (22) and storing N individual pixel values in each of M columns of the memory (22) ;

(b) computing the horizontal sum for each of the N rows of the initial position of the reference pixel array and storing each of the computed horizontal sums; (c) computing the vertical sum for each of the M columns of the initial position of the reference pixel array and storing the computed vertical sums in a shift register (29);

(d) displacing the reference pixel array by one pixel in a horizontal direction;

(e) in response to the displacing step: i) providing N new pixel values, one for each of the N rows of the reference pixel array corresponding to a last column of the reference pixel array after being displaced by one pixel in the horizontal direction; ii) summing the N new pixel values to produce a new vertical sum, and applying the new vertical sum to the shift register (29), and shifting the previously-stored vertical sums by one word in the horizontal direction of the motion estimation search, whereby a first-stored vertical sum is discarded and the new vertical sum is stored in the former storage location of a last-stored vertical sum; iii) outputting a set of M new vertical sums from the shift register (29); iv) updating each of the horizontal sums to produce a set of N new horizontal sums by adding the respective one of the N new pixel values to the previously- computed horizontal sum for each of the N rows, and by subtracting respective old pixel values no longer contained in the M columns of the reference pixel array after being displaced by one pixel in the horizontal direction from the previously-computed horizontal sum for each of the N rows; and v) outputting the set of N new horizontal sums.

8. The method as set forth in Claim 7, wherein step (b) is performed by using N horizontal sum modifier circuits (25) corresponding to respective ones of the N rows of the memory (22) , whereby each of the horizontal sum modifier circuits (25) accumulates the values of the M individual pixel values stored in the respective row of the memory (22).

9. A method for determining a best match between a first pixel array in a picture currently being encoded and a plurality of second pixel arrays in a search region of a reference picture, wherein each of the first and second pixel arrays includes a plurality of rows and columns of individual pixel values, the method comprising the steps of: providing a first orthogonal-sum signature of the first pixel array comprised of a set of horizontal sums representative of the sums of the individual pixel values of the rows of the first pixel array and a first set of vertical sums representative of the sums of the individual pixel values of the columns of the first pixel array; providing a plurality of second orthogonal-sum signatures for respective ones of at least selected ones of the plurality of second pixel arrays, each of the plurality of second orthogonal-sum signatures being comprised of a set of horizontal sums representative of the sums of the individual pixel values of the rows of a respective one of the second pixel arrays and a set of vertical sums representative of the sums of the individual pixel values of the columns of a respective one of the second pixel arrays; and comparing the first orthogonal-sum signature with each of the second orthogonal-sum signatures in order to determine the best match between the first and second pixel arrays.

10. A motion estimation device for determining a best match between a first pixel array in a picture currently being encoded and a plurality of second pixel arrays in a search region of a reference picture, wherein each of the first and second pixel arrays includes a plurality of rows and a plurality of columns of individual pixel values, the motion estimation search engine including: means for providing a first orthogonal-sum signature of the first pixel array comprised of a set of horizontal sums representative of the sums of the individual pixel values of the rows of the first pixel array and a first set of vertical sums representative of the sums of the individual pixel values of the columns of the first pixel array, and for providing a plurality of second orthogonal-sum signatures for respective ones of at least selected ones of the plurality of second pixel arrays, each of the plurality of second orthogonal-sum signatures being comprised of a set of horizontal sums representative of the sums of the individual pixel values of the rows of a respective one of the second pixel arrays and a set of vertical sums representative of the sums of the individual pixel values of the columns of a respective one of the second pixel arrays; and, means for comparing the first orthogonal-sum signature with each of the second orthogonal-sum signatures in order to determine the best match between the first and second pixel arrays.

11. The motion estimation device as set forth in Claim 10, wherein the first and second pixel arrays are each macroblocks having a structure defined by an MPEG standard.

12. A device as claimed in Claim 10, further comprising circuitry (20) for updating a horizontal sum representing the sum of the values of N pixels contained in a horizontal row of a reference pixel array during a motion estimation search, the updating circuitry (20) including: means for computing the horizontal sum; means for displacing the reference pixel array by one pixel in a horizontal direction; and means (25) for updating the horizontal sum to produce a new horizontal sum by adding a new pixel value to the previously-computed horizontal sum, and subtracting an old pixel value no longer contained in the horizontal row of the reference pixel array after displacement of the reference pixel array by one pixel in the horizontal direction, from the previously-computed horizontal sum.

13. A device as claimed in Claim 10, further comprising circuitry (20) for generating a horizontal sum for each of N rows of a reference pixel array and for simultaneously generating a vertical sum for each of M columns of the reference pixel array for each iteration of a horizontal motion estimation search of a prescribed search region of a reference picture, the updating circuitry (20) including: a memory (22) for storing initial pixel values corresponding to an initial position of the reference pixel array by storing M individual pixel values in each of N rows of the memory (22) and storing N individual pixel values in each of M columns of the memory (22); means (25) for computing the horizontal sum for each of the N rows of the initial position of the reference pixel array and for storing each of the computed horizontal sums; means (27) for computing the vertical sum for each of the M columns of the initial position of the reference pixel array; a shift register (29) for storing the computed vertical sums; means for displacing the reference pixel array by one pixel in a horizontal direction; means (25,27,29) for, in response to each displacement of the reference pixel array by one pixel in the horizontal direction: i) providing N new pixel values, one for each of the N rows of the reference pixel array corresponding to a last column of the reference pixel array after being displaced by one pixel in the horizontal direction; ii) summing the N new pixel values to produce a new vertical sum, and applying the new vertical sum to the shift register (29), and shifting the previously-stored vertical sums by one word in the horizontal direction of the motion estimation search, whereby a first-stored vertical sum is discarded and the new vertical sum is stored in the former storage location of a last-stored vertical sum; iii) outputting a set of M new vertical sums from the shift register (29); iv) updating each of the horizontal sums to produce a set of N new horizontal sums by adding the respective one of the N new pixel values to the previously- computed horizontal sum for each of the N rows, and by subtracting respective old pixel values no longer contained in the M columns of the reference pixel array after being displaced by one pixel in the horizontal direction from the previously-computed horizontal sum for each of the N rows; and v) outputting the set of N new horizontal sums.

14. A device as claimed in Claim 10, further comprising circuitry (20) for updating a horizontal sum representing the sum of the values of N pixels contained in a horizontal row of a reference pixel array during a motion estimation search during which the reference pixel array is displaced by one pixel in a horizontal search direction during each of a plurality of iterations of the motion estimation search, the updating circuitry (20) including a horizontal sum modifier circuit (25) that accumulates the values of the N pixels contained in the horizontal row of the reference pixel array prior to any displacement of the reference pixel array to produce the horizontal sum, and that updates the horizontal sum by computing the new horizontal sum using the following equation: OSNEW = OSOLD - oo + a_n0 , where OS_NEW is the new horizontal sum after the last displacement of the reference pixel array by one pixel in the horizontal direction, OS_OLD is the horizontal sum prior to the last displacement of the reference pixel array by one pixel in the horizontal direction, aoo is the pixel value of the pixel that was the horizontal origin of the reference pixel array prior to the last displacement of the reference pixel array by one pixel in the horizontal direction, and a_no is the pixel value of the pixel that is the horizontal origin of the reference pixel array after the reference pixel array has been displaced by one pixel to the right with respect to the previous position of the reference pixel array as a result of the last displacement of the reference pixel array by one pixel in the horizontal direction.