GB2415561A

GB2415561A - Comparing data from sets at same positions

Info

Publication number: GB2415561A
Application number: GB0413862A
Authority: GB
Inventors: Ken Cameron
Original assignee: Clearspeed Solutions Ltd; ClearSpeed Technology PLC; ClearSpeed Technology Ltd
Current assignee: ClearSpeed Technology PLC
Priority date: 2004-06-21
Filing date: 2004-06-21
Publication date: 2005-12-28
Also published as: GB0413862D0

Abstract

A function between a first set of data (a to h etc) and a succession of second sets of data (A to H etc) from an image frame is evaluated. Evaluating means (75, 76) for evaluating data items in the first set with data items in corresponding positions in a succession of second data sets; characterised by a plurality of evaluating means for simultaneously evaluating a data item in the first data set with a corresponding data item in each of n second data sets in the succession, and local memory means (71 to 73) for retaining at least one of the corresponding data items in a second data set, whereby the retained data item is reused in an evaluation of the next data item in the first data set with n data items in corresponding positions in at least the next second data set in the succession. The method avoids having to re-fetch the same data for successive operations. The results are accumulated (77, 78). The technique may perform a convolution between the data sets to emulate a filter or perform an effect (eg a blur or sharpen) or may be used as a video compression process to search for a block of image data in an image frame so as to encode a subsequent frame by reference to the closeness of the match and the offset of the matching block.

Description

l 241 5561

DATA PROCESSING

Introduction

The present mventon relates to a method, apparatus and system for processing nnage data, especially but not solely video data fonning part of a video transmission or broadcast. Such proeessmg may be to conserve bandwidth m a video transmission or to evaluate a function, such as a convolution.

In the discussion that follows, the invention is mostly described in the context of mlage (video) compression but the principles Involved are equally applicable to more general data processing, where two sets of data items are processed to perform or evaluate u a particular function. For the purposes of this specification, the major embodiment is a mechanism for searching for the appearance m an Image frame of a block of data items (eg pixels) from another nnage frame as a precursor to real-time video compression by any one of a number of techniques known per se. Another embodiment utilises the same hardware to evaluate a function between a first set of data and a succession of sets of data I'rom an image frame. In this embodiment, a convolution can be performed between the sets of data so as to emulate a filter or other image processing function such as a blur, sharpen, fade etc. first, the current state of the art as regards video compression will be outlined.

The Current State of the Art in Video Compression A known techmque for compressing video data in order to conserve bandwidth is to compare successive frames of a video stream in order to identify similarities between them. Where similarities exist, the amount of data that needs to be transmitted can be reduced In the following discussion, two successive *ames in a video sequence will be considered but, m practice, it is common to apply the techmque across multiple frames and to work backwards as well as forwards m the video stream.

Because each of the frames is usually similar, ie there Is a high level of coherence, a compression technique can be used where only differences are sent.

In this discussion, all of the known techniques separate colour nnages mto their components. Although this could theoretically apply to ROB components, the artefaets i; are unhkely to be acceptable. Therefore, it is preferable to divide a video signal into YUV components, usmg one luminance and two chrominance (chrome) components The compression technique Is then applied independently to each of these components or channels When applied to YUV components, artet'acts are less hkely to be objectionable - 2 - since they will typically be restricted to luminance artefacts or colour artefacts but not a colorn channel artefact.

One known compression technique is JPEG. It has several variations. A still unage can be divided into blocks and each block compressed using JPEG. Similarly, a sequence of still images can be compressed using JPEG for each frame, in which case the overall compression process is known as Motion JPEG (MJPEG). For real motion picture compression, the MPEG technique is used, in which compression is applied to every nth Irame without reference to any of the other frames in the stream. With advances in technology, better versions of such independent compression become available This technique gives images that can be decompressed completely Independently at Intervals. It is then necessary to generate all the intermediate frames, for example by reference to the independent frames. A simple way is to compare one frame with the next. There Is less information because generally there will have been fewer changes between successive images. Utilisation of this reduced amount of information, eg by l transmuting only the changes, will clearly achieve data compression. What is actually done Instead of one-to-one comparison, for example by comparing top left pixel in one Irame with top left pixel In the next and so on, is to take a region of the image that is to be compressed, for example a block of 8 x 8 or 16 x 16 or 8 x 16 or 16 x 8 pixels and, for each of these blocks, to search the previous image to find somewhere in that image that contains that region. If that region can actually be found, a greatly improved level of compression can be achieved since that whole block can be represented by a pair of co- ordnates defining its location or offset rather than anything about the block per se. That block can then be reproduced on decompression by copying it to the particular offset relative to the previous image. The process is repeated for the next block which, m all 2s probability, Is likely to be fairly close. The process effectively takes a window from the new frame and attempts to find it somewhere in the previous frame.

The reason why this works is because normally, m a series of video images, objects are moving relatively slowly. In all probability the whole Image has simply shifted slightly, so the block may simply have moved a small number of pixels m one to direction In the case of a background it Is unhkely to have changed from frame to frame.

Ihs will be true of all the pixels making up the background. On the other hand, in the case of a moving object, such as a vehicle, the block will be offset and there will not always be an exact match. If the match is so far out that no comparison can be found m - 3 - the ordinal frame, for example when a scene changes completely or a new object appears us a scene, it Is then possible to consider the block as an Independent block to compress on its own without reference to any preceding block. As a half way house, a search can tie made to find the best match and, if it is within an acceptable tolerance, that match can be used as a basis for noting the differences and those differences will then use a much smaller amount of data. This Is all part of the known MPEG technique.

One particular technique involves the use of a motion estimation algorithm to compare neghbounug "macroblocks" of 16x16 pixel elements using sum of absolute dt'i'erences calculations. The motion estimation algorithm can be accelerated by creating i o acceleration logic to calculate the sum of absolute differences across many macroblocks simultaneously. A 128-bt data path for a sum of absolute differences accelerator could Include streams of 16 pixels from first and second frames entering a difference block to calculate the differences, a block for calculating the absolute value of the differences, and a summing block that adds the absolute values of these differences to the previously l 5 calculated total of absolute values held in an accumulator.

However, performance is likely to be hindered by the limitations of the relatively narrow interfaces found on fixed processors. in the absence of additional logic to maxmse data throughput across the processor interface, bus bandwidth may easily become the performance bottleneck for the entire operation. In that context, the main challenge is the design of hardware to keep the accelerator working at proper capacity.

Hardware accelerators with built-in DMA controllers and RAMs are often used to mitigate this problem.

Nevertheless, a mechanism is still needed to allocate a "score" to a match, representing the closeness or quality of match. That needs to be done for every block in 2s the new Image. There are at least two recogmsed techniques for domg this: one involves the SUn1 of absolute differences as above, the other uses the sum of the squares of the dil'ferences. The latter Is generally accepted as giving better compression but requires a much larger amount of computation. The differences are those between corresponding pixels In the respective blocks of the previous and the current Image ::o In a 16 x 16 block, with 256 pixels, it Is necessary to subtract corresponding pixels one Firm the other, square the difference then keep a runmng total. This involves 256 subtractions, 256 squarings and a final addition. For a 16 x 16 block this means 51 1 additions and 256 multphcatons per block. Where this Is to be done in real time, it clearly requires a lot of computing power so it Is usual to use the less effective sum of absolute dtferences. The scoring function is not as accurate with this approach but it can be close enougll. It has the added advantage of not requiring multipliers. Of course all this happens m real tune because the technique Is apphed to real time video Whereas broadcast video compression requires vast racks of expensive equipment, lower resolution compression can be achieved with a reasonably fast PC fitted with an appropriate card.

The highest quality is achieved when each frame Is dealt with independently without having to interpolate or invent any data in a frame. Quality can be selected at the cost of compression ratio. This technique does not take advantage of frame to frame l t, coherence so there is a fairly high bandwidth for the quality, for example about l Mb/sec for a straightforward Motion JPEG of ordinary TV quality. The same perceived quality can be achieved for l/lOth of the bandwidth when interpolation between frames Is used.

It Is therefore desirable to use better compression methods.

In a known interpolation technique, some frames (I-frames) are compressed l S mdvdually, others (P-framcs) can be interpolated between l-frames' and further frames (El-frames) can be interpolated between the P-frames and the l-frames but they do rely on having information from both forward and backward directions. A search of only the previous frame for regions that look similar would not be undertaken. Rather, the next frame would be searched as well. The larger the number of l-frames transmitted, the higher the data rate. Conversely, the more P- frames and B-frames transmitted, the lower the tfata rate. The more 16 x 16 blocks for which a perfect (or a very good) match can be found, the less additional data will need to be sent to describe that 16 x 16 block.

For the sake of simplicity m explaining the operation of our technique and those rcprcsentcd m the prior art, we will assume a frame size of 64 x 64 pixels. This Is of course unrealistic m practice but makes the following discussion easier to envisage. The frame can be notionally divided into sixty four regions of 8 x 8 pixels, each requiring 64 bytes to represent it. When this block is overlaid on another 64 x 64 pixel Image, there are 3,249 possible positions at which a match may be found. If a perfect match were actually found, only the co-ordinates of the block need be sent, plus an mdcaton that the 3, block Is Identical. The further away from a perfect match, the greater the amount of digital data required to describe the difference bctwcen them. This will be discussed l;rther with reference to Figure 1. - 5 -

Currently, others skilled in the art tend to limit their search to a 64 x 64 (or thereabouts) sized block around the location where the previous block was situated.

Motion vectors can theta be used to advantage. For example, if there is a block that appears to be moving across a frame in a certain direction and if allowance is made for that movement rather than concentrating on the centre, there Is snore chance of finding that block In the next frame.

The larger the search area, the snore chance there Is of finding a better scormg match. The ultunate would be to search the entire frame. That implies three additions for every 16 x 16 block of one frame at every possible pixel offset of the other frame.

0 Beanng in trend that In l-sigh Definition TV there are 2m11ion pixels, the number of calculations needed per frame at 60Hz (allowing for searching forwards and backwards) becomes unmanageable - needing something like 80,000 adders running at 250MHz.

Nevertheless, some systems have been built that incorporate large numbers of adders. These typically use a mixture of FPGAs and ASICs. A single chip may carry out each of the 256 absolute difference calculations. A tree is used to summate all the calculations. An entire 16 x 16 block can be searched per cycle but this still requires tens or hundreds of chips to achieve full frame rate. Part of the problem is the amount of data needed to feed these devices. Each absolute difference typically requires 2 bytes of mfortnaton. This approach Implies billions of bytes per frame.

There is a trade off with this technique in that the deeper a search can be carried out, the snore chance there is of getting a better snatch and therefore a lower bandwidth with either no loss of quality or potentially better quality. On the other hand, the larger the search area, the longer the compute time but there is only a finite amount of compute capability in a given machine. A balance therefore needs to be struck between acceptable 2s rendition and compute speed and/or capability.

Prior attempts at dealing with this problem have simply Involved a rather "brute force" solution of boltmg on hardware as an accelerator to a processor and hoping for the best More relmed solutions involve making some ofthe operations, such as calculating the slim ot absolute differences, available as a special SIMD Instruction in otherwise , "standard" processors. Some such special processors presented with X- or 16-byte data can calculate the absolute differences and accumulate the calculations. At present, there are no processors that have the capability to perform the whole operation, leavmg the complete functionality to be built up In software.

Our approach as far as video data processing If concerned is to partition the problem differently and solve some of the bandwidth issues, as will be described shortly.

The Current State of the Art in Function Evaluation Functions between variables occur in many and varied apphcations. Often, it is s necessary to evaluate a function between two sets of data. These sets may mvolve blocks of data If one of these sets of data represents an image, data processing may involve taking a succession of blocks of data from the image set and performing a particular process by convolution with another data set. This other data set may be a block from another nnage or Image frame or may be a set of predetermined data designed to achieve i o an intended effect when convolved with the blocks of data in the image set.

For example, the predetermined data set may be a pre-loaded set of values which, when convolved with the blocks of data from another data set, perform a filter function.

Likewise, the pre-loaded data set may be convolved with a data set from an image frame to blur or sharpen the image. It Is also possible that the first data set is derived from a first image and the other data set Is a succession of blocks from another image, the effect of convolution between the two sets emulating a fade or "special effects" function, marking a transition between successive Images m a succession of Images.

The current state of the art, like the video compression scenario discussed above, is heavy on compute hardware and/or slow on speed. On the other hand, the present mventon can outperform prior art processing several times over.

The present invention achieves significant advantages over the prior art by adopting a radical approach that permits multiple strands of data to be processed concurrently without the need to fetch each item of data as it is required. Rather, once ['etched, each data Item Is held m local memory and re-used several times over. The 2s overall result Is greatly increased processing speed.

The essential features of the invention in both of the aspects outlined above will

become apparent from the following description.

Summary of the Invention

In a first aspect, the mention provides a method for evaluating a function between lo a hrst set of data and a succession of second sets of data from an image frame, wherein each data Item In the first data set Is evaluated with data items in corresponding positions no the second sets of data, charactersed m that a data Item in the first block Is evaluated snultaneously with a con-esponding data item In each of_ second data sets m said - 7 succession and wherem at least one of said corresponding data items in a said second data set share locally stored and reused In an evaluation of the next data Item in said first data set with data items ITI corresponding positions in at least the next second data set in the succession.

s In a second aspect, the invention provides a method for searching in a second t'rame of Image data for a first block of Image data appearing in a first frame, wherem data Items m the first block are compared with data items In corresponding positions in a succession of second blocks in the second frame, charactersed in that a data item In the f'h-st block Is compared simultaneously with a corresponding data item in each of_ second i o blocks m said succession and wherein at least one of said corresponding data items in a said second block is/are locally stored and reused in a comparison of the next data item in said first block with data Items m corresponding positions in at least the next second block Ul the succession.

In a third aspect, the invention provides apparatus for evaluating a function s between a first set of data and a succession of second sets of data from an image frame, comprising evaluating means for evaluating data items in the first set with data items in corresponding positions in a succession of second data sets; characterized by a plurality of evaluating means for simultaneously evaluating a data item m the first data set with a corresponding data item In each of_ second data sets in said succession, and local memory means for retaining at least one of said corresponding data items in a said second data set, whereby said retained data item is reused in an evaluation of the next data item m said first data set with _ data items in corresponding positions in at least the next second data set in the succession.

In a fourth aspect' the invention provides apparatus for searching in a second frame of image data for a first block of nnage data appearing in a first frame, comprising comparing means for comparing data Items m the first block with data items in corresponding positions In a succession of second blocks In the second frame, charactersed by a plurality of comparison means for comparing n data Items in the first block simultaneously with _ data items In corresponding positions in _ second blocks in said succession, local memory means for retiming at least one of said _ data items in a second block, whereby said retained data item can be reused in a comparison of the next data Item m said first block with n data items in corresponding positions m at least the next second block in the succession. - 8 -

In a fifth aspect, the invention provides a data processor for processing first and second sets of data, comprising an array of processing elements each in combination with an accelerator, the accelerator comprising apparatus as specified in the preceding paragraphs.

s In a sixth aspect, the invention provides a method of compressing image data by encoding a first image frame by reference to a second image frame, the method compnsmg the steps of searching in a succession of second blocks m said second image frame for a first block of image data appearing in the first image frame; and encoding the first block as a function of the closeness of the match between the first block and a t' matching second block and the offset of the first block relative to said matching second block U1 said second frame; wherein said searching step comprises: comparing data items us the first block with data items in corresponding positions in a succession of second blocks in the second frame, characterized in that a data item in the first block is compared simultaneously with a corresponding data item in each of_ second blocks in said s succession and wherein at least one of said corresponding data items in a said second block is/are locally stored and reused in a comparison of the next data item in said first block with data items in corresponding positions in at least the next second block in the succession.

The Invention also encompasses a computer program product comprising a medium on which is stored a computer program adapted to perform the steps of the method as specified in the preceding paragraphs.

The apparatus is preferably connected to a processing element to act as an accelerator. In the case of a processor having an array of processing elements, an accelerator may be connected to each processing element or the accelerators themselves 2s may be configured to act as the processing elements in an array. The array Is preferably a Sulgle Instruction Multiple Data (SIMD) array.

The evaluation may consist of a convolution, in which case the method and apparatus may emulate a filter or may perform a particular function, such as a blur or sharpen, on image data Where the first data set is a predetermined data set and the lo succession of data sets are Image data, the process may generate a third set of data, such as an nnage frame exhibiting a particular effect.

Brief Description of the Drawings

The invention will now be described with reference to the following drawings, m wltcl Figure I Is a schematic diagram Indicating how a "target" image frame can be s encoded by reference to "source" Image frame using an 8 x 8 pixel window; Figure 2 Is a simplified illustration of the mechanism Involved in the Invention; Figures 3 to 7 represent a sequence of operations carried out m performing a preferred embodiment of the invention In relation to image compression and as an aid to understanding how the invention operates; I () Figure 8 is a schematic representation of the processing blocks used in a preferred nnplementaton of the invention; Figure 9 Illustrates how a unit embodying the invention can be connected to a processor array; and figure 10 Illustrates a basic processor element for the purpose of explanation.

Detailed Description of the Illustrated Embodiments Video image processing Consider first a processing element (PE), such as Illustrated generically at 300 in Figure 10. The PE is preferably provided with a selected functionality represented by an arithmetic umt (ALU) 301. The PE also Includes a local memory 303 and a local register file 302. The connection between the register file and the ALU includes three paths, two to the ALU, entering at ports A and B. and one return.

The PE can be used to implement a particular functionality, such as a floating point arithmetic. This function block, represented by 304 in Figure 10, Is connected m parallel to the ALU. In the same way, a sum of absolute differences (SAD) unit could be connected in parallel with the ALU to perform a SAD function as a precursor to compression, as will be explained later.

As will become clearer, the present invention achieves a technique for reusing data by carrying out calculations in parallel paths and thereby using multiple adders to compute different values rather than using multiple adders eombimng to produce one 3 result Four such parallel paths are used m the preferred embodiment.

In the preferred embodunent of the PE, the paths to and from the register file are f4 hits wide. This provides sufficient bandwidth to the SAD unit, so that the datapaths do not need to transfer data to it on every cycle. This allows the remaining register file bandwidth to be used by other functional units, such as the ALU.

Consider now Figure 1, which Illustrates the objective of searching for a block of unage data from one frame in a different frame, which may be a previous frame m a succession of frames as in the case of a stream of video image frames. The "target" frame 2 m Figure I Is the frame of interest, ie the frame that Is to be compressed. This frame is notionally divided Into 64 blocks (say), each of 8x8 pixels, as previously mentonel.

Compression of the target frame is by reference to a "source" frame 1. This may be the previous frame in a video stream or sequence, for example.

Each of the blocks in the target frame 2 is to be searched for in the source frame I by considering it to be a "window" that is moved over the target frame until a "match" is found. The moving and matching operation begins with the first block 4 or window from the top left corner m the target frame being located over the top left block 5 in the source frame I. Eventually the window 4 visits each offset in the source frame, looking for a l S perfect match or a best match, If one exists. If a perfect match can be found the target block can simply be encoded by flagging it as identical to the matching source block 3 and provdmg data as to the offset of that block 3 relative to the starting position 5 of the block being searched for in the source frame. The offset is represented by arrow 6 in Figure 1. The arrow 7 merely indicates that the matching block 3 found in the source 2() frame I Is used to encode the target block 4 in the target frame 2.

It Is unhkely that a perfect match will be found, so the technique looks for a closest match. Therefore, for each position of the window or block over the source frame a ratmg or "score" Is allocated and the "best" overall score Is taken to be the location of the block from the target frame in the source frame I. The process will now be explained in more detail with reference to Figure 2. This Is an even more simplified illustration of the basic principles underlying the invention m order to enhance explanation without complicating the Illustration. For the purposes of Illustration, the technique Is described with reference to a 4x4 block but the principle Is applicable to an 8x8 or 1 6x 16 block etc. It Is important to note that the use of a 4x4 block 3(' for Illustrative purposes has no bearing on the fact that in this example we use four parallel processing paths with four sets of adders processing four blocks at a tune. In other words, there Is no relationship between the size of window and the number of paral lel processing paths.

The objective of the techmque Is to attempt to find a block 100 from the target frame (m this case a 4x4 block, purely for simplicity in explanation) m a source frame 101 The top row of pixels in the target block 100 is identified as a, b, c and d. Pixels in the top row of the source are identified as A, B. C, D, E, F. G. Il etc. Where the target s block Is 4x4, the pixels in the second row could be labelled e, f, g, h and so on for the third and fourth rows. However, where a block larger than 4x4 is used, the top row of pixels m the source block will continue beyond d Into e, f, g, h and so on.

Regarding the target block 100 as a sliding window, it is first positioned in the top left hand corner of the source frame in the position identified as 1 10. In this position, I o pixel a overlies pixel A, pixel b overlies pixel B. pixel c overlies pixel C, and pixel d overhes pixel D. In this position, the absolute differences between the corresponding pixels In the target block and each of the successive (in this case four) source blocks are calculated and accumulated in respective accumulators (indicated generally at 150 In Figure 2) for that position. Thus, in one clock cycle, the value la-Al Is calculated for the s position where pixel a overhes pixel A in the first of a succession of blocks in the source frame, labelled "source block 1" in Figure 2. In the same clock cycle, since the value B is already available (ie it has been fetched) the value la-Bl can be calculated. Similarly, the values la-Cl and la-Dl can be calculated in the same clock cycle.

Inspection of these values shows that the top left pixel a in the target block has so been processed with the top left pixel from each of the first four source blocks 1 10, 120, and 140. The values thus calculated are put into accumulators not shown in Figure 2 but represented at 150 by Result 0 to Result 3). In the next clock cycle, the second pixel b m the target block can be processed with the second pixels in each of the four source blocks in the source frame. This time, the values of those calculations are lb-Bl, lb-Cl, lb s Dl and lb-Bl, where E is the next pixel value from the source frame. When the first four pixels In the target block have been processed with

the corresponding pixels In each of the four source blocks 110 to 140, the partial sum la-Al + lb-Bl -t Ic-CI + Id-DI will have been calculated and stored as "Result 0" In an accumulator.

Su1llarly, the partial sum la-Bl + Ib-CI + Ic-DI + ld-ll will have been calculated and 3( accumulated as "Result 1" m Figure 2 The process similarly leads to the accumulation of the partial sums illustrated as "Result 2" and "Result 3" m Figure 2.

In other words, for each position of the window over the source block, In one clock cycle one of the picks In the target block is compared simultaneously with the corresponding pixel In a succession of source blocks and the results put mto the respective accumulators. In the next clock cycle, the next pixel in the target block is compared with the next corresponding pixel m each of the succession of source blocks in parallel and the partial results accumulated once again. The sequence continues line by line until the s whole of the target block has been processed. The target block is then moved four (in this case) pixels to the right, so that the top left pixel, a, Is now positioned over the source l'rame block where the top left pixel value is E. The process continues as before, stepping along the hnes of the source block, four pixels at a time (in this example), and then down to the next hne, until the whole of the source frame has been searched.

O As a variation of the above process, it is not essential that the source blocks are located at adlacent pixels. If they are instead offset by four pixels and only every fourth pixel Is searched, the same level of reuse of data from the second blocks is attained but the data can be sub- sampled in order to obtain a quick and approximate match. This may then be followed by homing in for a more precise match position.

s What is important to note about each comparison cycle (ie one pixel in the target frame being compared simultaneously with a corresponding pixel m each of the next set of source blocks in succession) Is that the value A is used once, value B is used twice, value C is used three times and value D is used four times in this set of calculations. This reuse is represented by the numbers m parentheses under the relevant pixel in Figure 2, 2() for example (x2) under pixel B. The opportunity is thus presented of using the same value more than once without having to fetch it more than once.

Considering the position where the window has been moved from the start position 110 to the end position 140 four pixels to the right, it can be seen that the end pixel values are used once, the next pixels in from the ends are used twice and the next as pixels three times. If this is extrapolated to the situation where a whole row of the source frame is scanned, the remamng pixels, which compose the majority of the pixels in the Iran1e, are each used four tunes. There is thus a huge potential for re-usmg data that is already present, Be has already been fetched. It is therefore desirable for the pixel values that are reused to be held In local memory within the proccssmg unit perf'ormmg the (, tcchnque until it is finished with, rather than having to fetch each value each tic it Is to be used Figures 3 to 7 Illustrate in more detail how this saving can be achieved. In Fig 3, the pool values a-d of the target block are taken m sequence via registers 201-204 of a shft register to respective Inputs of sum of absolute differences (SAD) units 220.

Similarly, via registers 210-213, the values A-D representing the pixel values of the first source block are presented to the other Inputs respectively of the SAD units 220. These umts therefore calculate the sum of absolute differences between a and A, a and B. a and s C, a and D and accumulate the results R0 to R3 in accumulators 230. For the moment, it is sulfeent to note that the next set of values E-l-l of the next block in the source frame spaced four pixels along from the previous block, are introduced at 214-217 awaiting their turn m the next calculation.

Referring now to Figure 4, both sets of pixel values have been stepped to the next I o value In the sequence. So, value b has moved from 202 in Fig 3 to 201 in Fig 4 to replace value a, value c has moved from 203 to 202 to replace value b, and value d has moved from 204 to 203 to replace value e. Similarly for the pixel values A-D, value B has moved from 211 to 210 to replace value A, value C has moved from 212 to 211 to replace value B. and value D has moved from 213 to 212 to replace value C. New value E has entered at 213 m readiness for the next calculation and the remaining values F-H likewise have moved mto 214-216.

Figure 5, 6 and 7 show the continuation of the above process, as the values a, b, c ete and A, B. C etc progress through the SAD units and the calculated values accumulate at R0 to R3. Figure 7 shows the position that has been reached when all four values a to d 2(' and A to D have passed through the process and have been replaced by the next set e to h and E to F1. In this manner, a complete line of pixels can be scanned and the results of the SAD calculations stored m accumulators R0 to R3. The whole process is then repeated for the next row of pixels in both the target and source frames.

The accumulated values represent an indication or "score" of the closeness of the 2s match between the target block and the source blocks. The algorithm/program operating the compression per se can use these scores to judge whether a perfect match has been found or, If not, where the closest match hes That block can then be used to encode the target block by reference to the matching source block.

The pixels a, b, c, etc can be 32-bt Inputs via an input port. The other mput port 3 for the A, E3, C etc pixels can be 64-bit to capture two sets of 32-bt data. However, the abhty exists to choose, as well as shifting across, to shift 32 out of 64 bits into the other four rcgsters. This is shown In Figure 3, for the 64-bt width Input and m Figure 7 for the 32-bt uncut If there are more than four samples, eg In an 8x8 or l 6x16 block version of the Invention, at every fourth step when the a, b, c, etc values have been shifted through 201-204, a whole new set of four values out of the block being searched for can be shifted ??ltO the register Al the 1 6x1 6 case, this has to be done four times, starting with a to d, then e to h and so on. The same thing will of course have to be done for the A to D etc values, as shown at 214-217 m Figures 3 and 7, for example.

After all of the hnes of the four source blocks have been searched or scanned in this way, the result registers RO-R3 will contain accumulated sums of absolute differences of the entire target block for each of the four possible offsets 110-140 shown m Figure 2, the accumulated results representing the scores for each of those offsets, as previously I o mentioned. This process is repeated until the whole of the image frame has been scanned and "scored". The scores in the accumulators are utilised by the compression algorithm and then discarded once the entire block has been scanned and the block or window has been moved four pixels along the scan line. For a given 16x16 block at four offsets, the computations are effectively made m parallel using parallel processing paths, one for each ?5 SAD unit.

It is useful to pause here and compare how the present invention differs from the current state of the art. It is worth remembering that the block being searched in the source frame IS actually part of the target frame that is currently being compressed, eg for transmission by reference to the source frame. The current state of the art Is not to search 2' the entire image for that block but instead to search only the region around where the block m question belongs In the frame, eg plus or minus 32 pixels. This is on the basis that, where there is a high degree of coherence between successive frames, the block is likely to be near Its previous position. Potentially, better compression is possible by making the search area bigger and bigger, ultimately being the size of the frame Itself, but at the cost of tune and hardware. Iowever, the prior art has not yet managed to search the entire ullage In real time.

If a perfect match IS found for the target block in the source frame, the target fiame can supply copy the matching block from the source frame. If a perfect match cannot be found, one block IS effectively subtracted from the other to create a "difference" so nnage the? can then be used m the succecohng compression stage, bearing m mind that the cd?f'f'e?ences are likely to be small. The compression Itself; eg JPEG, can then be apphed to the ci?t'f'erences. Most of the difference values will (hopefully) be zero, so encoding needs only to be performed on the blocks that are different. There IS a range of encoding schemes that can be used to represent the differences. If there really Is no match at all, eg because of a change of scene, there is very tattle option but to encode the whole of the new to ame as an independent frame without reference to any preceding frame. There are a number of algorithms that can be used to determine how far a search should be carried out n1 any given frame For example, a perfect match may lead to further searching being prematurely stopped. Other approaches may be used.

Whereas typical Motion JPEG may take l Mb/s bandwidth, compression can achieve the same perceived quality in one tenth of that bandwidth if interpolation between frames Is employed using better and better compression techniques.

o The above description of the operation of the technique used shift registers as a means of explaining the process. However, in the preferred implementation shown m Figure 8, latches and multplexers are used Instead. In the left hand side of Figure 8, values a-h of the target window are presented to one of the inputs of a set of multiplexers (Mux) Indicated generally at 72. The values can be momentarily held as one or the other s by means of the latches 73 connected to the Mux outputs. Similarly, the first four source block pixel values A-D are available at a next group of four Mux circuits 72 and likewise the values A-H (shown at 70 in Figure 8) are available at further Mux circuits 71 feeding Mux circuits 72 and latch 73. The combination of Mux and latch circuits acts In the same way as the shift registers used in the functional description above.

Initially, the very first 64 bits of data brought into the processing unit, ie the values A-l-1 at 70, are latched into the registers. They are then shifted down as the calculation progresses. After four such steps, it Is necessary to bring in another four bytes at 70.

I-lowever, there Is a 64-bit path, so one or the other half of the input path has to be selected to bring in the next half. In the next step, the other half is brought in. I fence, after four cycles, new values of A-D would be brought In via the Mux array 71. This is probably better illustrated by considering the input values a, b, c etc and A, B. C etc as coming m via ports A and as shown in Figure 9. The entire 64 bytes can be latched in one step. Thereafter, only four at a time will be latched in.

The requisite combmaton of source and target pixel values are thereby fed to the 3) Inputs of subtraction and absolute value circuits 75, 76 and thence to addition and latch circuits 78, 79 perfonnmg the necessary accumulation function. The results are accumulated m results registers RO to R3.

If the Input latches were considered as being equivalent to the shift registers descnbet1 above (so that only one new byte is mput at the end) only two new bytes would be needed every cycle Instead of eight, so the bandwidth will have been reduced by a ['actor ot'4. The PE array permits 32 bits per cycle to be brought from each PE memory s into the units described above but only 16 bits are actually needed because data has been re-used. Although four operations are ongoing at any one time, they are not independent.

The current state of the art tends to compute an entire 16 x 16 or 8 x 8 block in one go at one offset in its entirety before trymg another offset, so there is no reuse of data, thereby leading to a bandwidth problem. The justifieaton for this approach is that it Is lo straghtl'orward to bring in two groups of 16 bytes into 16 byte line registers. However, this approach fails to recognise that, by using multiple adders to compute different values rather than usmg multiple adders combining to produce one result, data can thereby be re used.

The above technique works equally for any width that is a multiple of four. It will therefore work with 4 x 4; 4 x 8; 8 x 8; 8 x 16 and so on, up to 32 x 32, even though no known technique ventures beyond 16 x 16. The preferred choice of four is arrived at by considering that greater widths do not use any more values (the maximum remains at one byte per cycle) but at mitialisation, the data has to be input in parallel. The only shift is along a scan line and when the seen moves down to the next line, the register has to be re set with the line data. Unless very wide searches are being Conducted, eg by using many 16 x 16 or 32 x 32 blocks, the cost of "pre-loading" this data in the first place increases and thereby becomes a bigger portion of the overall performance. It is also beneficial to support down to 4 x 4 block sizes. Four gives the level of desired performance with the minimum and maximum sizes of interest but it could be any number. Taking it down to 2s one Koreans that there is only one unit and data Is being fed at one byte per cycle. Some bandwidth Is saved using two.

The wider the window, the greater the bandwidth advantage but only when the search is conducted on something that is reasonably wide. There Is a trade-off between Chanced pcrf'ormance and silicon area cost on a chip implementing the techmque. The best known units can at present only theoretically achieve at best half of the performance otter system can achieve but in practice the known unit achieves only half of that, so our system exceeds the best of the prior art by a factor of 4.

The mam advantage of the techmque Is scaled up performance without scaling up bandwdtll. The umt in accordance with the invention can be "bolted" onto the side of a processor, a SIMD processor preferably, as an accelerator. Moreover, each of the PEs can be used within the software with a consequential improvement m flexibility regarding the s clone of algorithm to direct the search rather than arriving at a best match by a series of approxnnations or interpolations. There is thus the dual advantage of hardware acceleration and programmability.

The present Invention has a wide spectrum of application, from live broadcast TV encoding and streaming video data across the mternet for example and especially I O commercial broadcasting where the reductions in bandwidth that the invention achieves is directly reflected in a corresponding reduction in transmission cost per channel but with no loss of perceived quality. It makes more commercial sense for a broadcaster to make a capital Investment m a machine that does a better job of compression without sacrificing unage quahty than to pay increased costs per unit bandwidth on a continuing basis.

The unit described In the preferred embodiments of the invention exhibits advantages when it is attached to a processor m that it accelerates the functions performed by the processor. The unit could also be attached to each of the PEs of an array processor (such as a traditional multiprocessing environment or preferably a SIMD processor), where each PE has local memory, local ALU and local register file. Alternatively, the processors described in the present specification can themselves be arranged into an array with some local storage/memory for each one, plus any necessary control logic. This would then constitute a stand-alone device that performs only the search operation described without the need for a separate processor per se. If there are enough PEs, the searching operation can be performed in real time.

2s Considering a SIMD array of 64 or 96 PEs, the present invention enables 64 or 96 blocks to be searched in parallel whereby to gain an even more enhanced performance.

Altcrnatvely, the same block can be searched In different places m each PE or the search could be carried out in the same part of the frame (or in a dffcrent parts of the frame) but with each PE looking for a different block. It may be possible to search for a target block 3() by USmg only the corner pixels of the target block and of a succession of source blocks.

Other variations are also possible. However, the technique of the present invention has advantages m each application.

Although the aspect of the present invention concerned with Image frame processing has been described m teens of video compression, in which Images are compared m order to reduce bandwidth, the ability to compare images can be used with advantage n1 other technological areas, such as straightforward image comparison. This s may be useful in say Optical Character Recogmtion (OCR) or for Identify Friend or Foe ( IFI:-) systems, where rapid comparison of an Image with a "target" Is either advantageous or essential. It may then be more advantageous, instead of using sum of absolute dlferences, to conduct a higher quality sum of squares of differences.

Other data processing 0 So far, the invention has been described in terms of processing one set of data from a block In a first Image frame with successive blocks of data from a second Image frame. The apparatus described succeeds over the prior art by performing several "eompansons" concurrently and reusing data that has already been fetched without having to re-feteh the same set of data each time it is required for a next comparison. However, s the same apparatus can be used in a different way to perform other mathematical functions.

for example, if the target block was not a block from an Image frame but was Instead a block of predetermined data forming a set of values of interest, that data set could still be brought mto the ports of the accelerator deseribecl. The data set could then be used in the Mux and latch circuits of Figure 8 to evaluate a function relating that data set with a succession of data sets introduced into the other port. The function Itself would be established through the selection of processing umts connected to the outputs of the Mux and latch circuits in place of the SAD umts 75, 76 in Figure 8. Thus, the two data sets could be multiphed together, with or without a scaling factor, to emulate a filter luncton operable on the data set in the succession of second blocks. The individual data Items would be stepped through the accelerator as before, using parallel processing paths and re-use of the data In the second blocks, also as before.

A useful function would be a convolution function. This would be achieved by replacing the add functionality by a multiplication functionality and could have a range of 3' appUcatons such as filtering. The results of the evaluations of the data Items in the first set with corresponding data items In the successive second sets of data may be used to create a third set of data composed of data Items representing a convolution function of the data Items m the first and second sets of data.

The above uses of the apparatus could have particular apphcaton to image data where it could perform a variety of picture effects. The functionality of the accelerator would be designed to provide specific functions, for example a "blur" or"sharpen" operation. Consider a 16 x 16 grid with values addmg up to one, for example a central s value of 0.5 reducing rapidly to zeroes at the edges. If that value Is multiplied by all the corresponding values of a data set relating to an image, a value between zero and one results. That constitutes a blur and the blur function can be chosen to be a simple linear or other function. That blur could then be stepped by the unit in accordance with the nventon, four pixels at a time instead of pixel by pixel.

lo It is also conceivable that the processing can achieve a desired transition between one data set and another, such as a simple fade or other special transition between frames Or Images in a sequence.

Claims

Claims I A method for evaluating a function between a first set of data

and a succession of second sets of data fron1 an mlage frame, wherein each data item iTI the first data set Is evaluated with data items in corresponding positions in the second sets of data, CharaCterTSed in that a data item in the first block TS evaluated simultaneously with a if:' corresponding data Item in each of_ second data sets in said succession and wherein at least one of said corresponding data items in a said second data set is/are locally stored and r Bused Tn an evaluation of the next data item in said first data set with data Items in I () corresponding positrons in at least the next second data set in the succession.

2. A method for searching m a second frame of image data for a first block of Image data appearing in a first frame, wherein data items in the first block are compared with data items m corresponding positions in a succession of second blocks in the second I 5 frame, charactensed in that a data item in the first block is compared simultaneously with a corresponding data item in each of_ second blocks in said succession and wherein at least one of said corresponding data items in a said second block Ts/are locally stored and reused in a comparison of the next data item in said first block with data items in corresponding pOStTOTIs In at least the next second block m the succession. 2()

3 A method as claimed in claim I or claim 2, wherein said data items in said first block or in said first data set are adjacent and said data items In a said second block or in a said second data set are adjacent.

2s 4 A method as claimed In claim I or claim 2, further comprising making said evaluations or said comparisons along a plurality of scan hnes one after the other until the data Ttcms In the first block or in the first data set have been compared with n successive second blocks or data sets.

:3o 5 A method as claimed m claim 2, wherem said comparison TS carried out for each ot a plurality of first blocks m said first frame with each of said succession of second licks TTT Said SCCOT1d frame.

6. A method as claimed in any of claims 2 to 5, wherem the partial results of simultaneously comparing n said data items in said first block with n corresponding data Items in each said second block in said succession are temporarily accumulated m a espective one of _ accumulators.

7. A method as claimed in claim 6, wherein after said _ data items in said first block have been compared with said n corresponding data items in _ successive second blocks, the first block is then compared with a further _ successive second blocks, each of which Is spaced _ data Items from the previous second block in the succession.

8. A method as claimed in claim 6, further comprising assessing from the partial results m the accumulators, the closeness of the match between the first block and a matching second block in the second frame.

9. A method as claimed in claim 2, wherein said comparison step is based on a sum ol absolute differences algorithm.

10. A method as claimed claim 2, wherein said comparison step Is based on a sum of the squares of differences algorithm.

I I. A method as claimed m claim I, wherem said data items in said first set of data comprise a predetermined set of data items and wherein the results of said evaluations with corresponding data items in said successive second sets of data are used to create a third set of data composed of data items representing a convolution function of the data Items in the first and second sets of data.

12 A method as clanged m claim 11, wherem said convolution operation performs a blur function.

3 13. A method as claimed m claim I 1, wherem said convolution operation performs a sharpen function.

14 A method as claimed in claim 1 1, wherein said convolution operation performs a filter function 15. Apparatus for evaluating a function between a first set of data and a succession of s second sets of data from an image frame, composing evaluating means for evaluating data Items in the first set with data items in corresponding positions in a succession of second data sets; charactersed by a plurality of evaluating means for simultaneously evaluating a data Item in the first data set with a corresponding data item in each of_ second data sets In said succession, and local memory means for retaining at least one of said t () corresponding data items In a said second data set, whereby said retained data item Is reused m an evaluation of the next data item in said first data set with _ data items in corresponding positions m at least the next second data set in the succession.

16. Apparatus for searching in a second frame of image data for a first block of image data appearing in a first frame, comprising comparing means for comparing data Items in the first block with data items in corresponding positions in a succession of second blocks in the second frame; characterized by a plurality of comparison means for comparing n data Items in the first block simultaneously with _ data items in corresponding positions in n second blocks in said succession, local memory means for JO retaining at least one of said _ data items in a second block, whereby said retained data Item can be reused in a comparison of the next data Item in said first block with _ data Items in corresponding positions in at least the next second block in the succession.

17. Apparatus as claimed in claim 15 or 16, further comprising a first input port for 2s receiving said data items in said first data set or block, a second input port for receiving said data Items m said second data set or block, and an output port providing _ results of simultaneously comparing or evaluating a data Item m said first data set or block with _ data Items m corresponding positions In _ second data sets or blocks in said succession.

3' 18 Apparatus as claimed m claim 17, further comprsmg _ accumulators for temporanly storing the _ respective results.

19 Apparatus as claimed m claim 17, wherein said first input port comprises n multiplexers and latches in _ parallel paths connected to a single multiplexer, whereby successive data items in said first data set or block can be sequenced through to one input of n said comparison means.

20. Apparatus as claimed in claim 19, wherein said second input port comprises a plurahty of multiplexers and latches in parallel paths, whereby successive data items m said succession of second data sets or blocks can be sequenced through to the other input of said n comparison means. 1)

21. Apparatus as claimed in claim 16, wherein said comparison means comprise means for performing the sum of absolute differences between the data items in said first and second blocks.

22. Apparatus as claimed in claim 16, wherein said comparison means comprise means for performing the sum of the squares of the differences between the data items in said first and second blocks.

23. Apparatus as claimed in claim 21 or claim 22, further comprising n accumulators, each connected to the output of a respective comparison means, whereby to accumulate the partial results of the comparisons between the data items in the first block and the data Items m the succession of second blocks.

24. Apparatus as claimed in claim 24, wherem said accumulators each comprise an 2s adder and a latch m series, with a feedback path from the output of the latch to the input of the adder.

A data processor for processing first and second sets of data, comprising a processing element In combination with an accelerator, the accelerator comprising 3 apparatus as claimed m any of clarrns 15 to 24.

26 A data processor for processing first and second sets of data, comprising an array of processing elements each in combination with an accelerator, the accelerator comprsmg apparatus as claimed m any of claims 15 to 24.

27. A data processor as claimed in claim 26, wherem said array of processor elements Is a Single Instruction Multiple Data (SIMD) array.

28. A data processor for processing first and second sets of data, comprsmg an array of processing elements, each processing element comprising apparatus as claimed in any I o of claims 15 to 24.

29. A method of compressing image data by encoding a first image frame by reference to a second image frame, the method comprising the steps of searching in a succession of second blocks in said second image frame for a first block of image data appearing in the first image frame; and encoding the first block as a function of the closeness of the match between the first block and a matching second block and the offset of the first block relative to said matching second block in said second frame; wherein said searching step comprises: comparing data items in the first block with data items in corresponding positions in a secession of second blocks m the second frame, eharaeterised in that a data item in the first block Is compared simultaneously with a corresponding data item in each of_ second blocks n1 said succession and wherein at least one of said corresponding data Items in a said second block is/are locally stored and reused in a comparison of the next data item in said first block with data items In corresponding positions m at least the next second block In the succession.

3(). A computer program product comprsmg a medium on which is stored a computer program adapted to perform the steps of the method of claim I. ii) 31 A computer program product comprising a medium on which Is stored a computer program adapted to perform the steps of the method of claim 2.

32. A computer program product comprising a medium on which is stored a computer program adapted to provide a first set of data and a succession of second sets of data, to perform the steps of the method of claim 1 or claim 2 on said first and second sets of data, and to rise the resulting data produced by the method.

33. A method of processing first and second sets of data, substantially as herem described with reference to the accompanying drawings.

34. A method of compressing a series of image frames, substantially as herein l o described with reference to the accompanying drawings.

35. A data processor, substantially as herein described with reference to the accompanying drawings.