WO2006018822A1

WO2006018822A1 - Combined load and computation execution unit

Info

Publication number: WO2006018822A1
Application number: PCT/IB2005/052742
Authority: WO
Inventors: Jan-Willem Van De Waerdt
Original assignee: Koninklijke Philips Electronics, N.V.; U.S. Philips Corporation
Priority date: 2004-08-20
Filing date: 2005-08-19
Publication date: 2006-02-23

Abstract

A load/store unit (240) in a processing device includes one or more computation elements (250). These computation elements (250) are configured to pre-process, or post-process, data as it is transferred between memory (190) and the processing registers (130). Preferably, the computation elements (250) are located between a cache memory (260) and the register banks (130). The load/store unit (240) is controlled by an expanded set of instructions that include combined transfer and process instructions. The instruction may include, for example, an instruction that loads an average of two data elements to a register.

Description

COMBINED LOAD AND COMPUTATION EXECUTION UNIT

This invention relates to the field of computer systems, and in particular to a processor that includes a load/store unit that is configured to provide computation capabilities, and in a particular embodiment, to provide filtering capabilities for video data during the load phase of an instruction.

FIG. 1 illustrates an example architecture of a conventional processing device, as might be embodied in an integrated circuit. An instruction component 1 10 receives program instructions 101, and decodes each program instruction into one or more 'machine' instructions that are communicated to the other components in the processing device via an instruction bus 112. These other components are commonly termed "execution units", and include, but are not limited to, a processing unit 120, and a load/store unit 140. The processing unit 120 includes, for example, arithmetic-logic units (ALUs), multiplication units, floating point units, and so on.

The processing unit 120 is typically configured to perform operations based on the contents of registers 130, such as adding the contents of one register to the contents of another register, applying a mask to bits of one register and storing the result in another register, and so on.

The load/store unit 140 is configured to transfer data between the registers 130 and an external memory 190. To minimize delays associated with the data transfer, a cache memory 160 is typically used to temporarily store the data. The cache is located closer to the registers 130 than the memory 190, typically within the same integrated circuit as the registers 130, and often within the load/store unit 140. The memory 190 is typically external to the integrated circuit, and typically requiring the use of an external bus interface 180. Thus, the time required to effect a transfer between the register banks 130 and the cache 160 is substantially less than the time required to effect a transfer between the register banks 130 and the memory 190.

When data is requested from memory 190, the cache 160 is accessed to determine whether the requested data is present in the cache 160. If the data is present in the cache, commonly termed a cache "hit", the data is transferred to the intended register 130 from the cache 160; otherwise, termed a cache "miss", the data is transferred from the memory 190 to the cache 160 and on to the register 130. An overall time savings is achieved by transferring 'blocks' of data between the cache 160 and the memory 190, when a cache-miss occurs. Assuming that the program exhibits some spatial or temporal locality in the processing of data, the likelihood that a subsequent instruction will utilize data that is included in the block of data transferred by a prior instruction is high. In like manner, data that is to be transferred from a register 130 to memory 190 is stored in the cache 160, and block-writes from the cache 160 to memory 190 provide time savings when multiple transfers from the registers 130 address spatially local memory spaces.

Techniques are commonly available to optimize the likelihood that the cache will contain subsequently requested data, including techniques for providing spatial locality for two dimensional data, such as the "tiling" of video data into n x m blocks that extend in the x and y dimensions. U.S. Patent 6,353,438, "CACHE ORGANIZATION - DIRECT MAPPED CACHE", issued 5 March 2002, and U.S. Patent 6,1 15,793, "MAPPING LOGICAL CACHE INDEXES TO PHYSICAL CACHE INDEXES TO REDUCE THRASHING AND INCREASE CACHE SIZE", issued 5 September 2000, each teach techniques for optimizing cache addressing for two-dimensional data, and are each incorporated by reference herein.

Although a cache substantially decreases the delay associated with providing the appropriate data in the registers 130 for the processing unit 120 (or other execution unit) to operate on, and transferring data from the registers 130 after these operations, the processing of large amounts of data, such as multi-color data associated with each pixel of a high-resolution video image, still consumes a substantial amount of data-transfer time. Consider, for example, the interpolation of pixel values, as is typically performed when a video image is scaled or translated. For example, if an image is "zoomed" by a factor of two in each dimension, new pixel values are created, corresponding to a 'sub-pixel' value between the original pixel values. To determine each sub-pixel value in each direction, a 'next' adjacent pixel value must be loaded into the register, then averaged with the 'prior' pixel value(s), and the average stored. To determine the sub-pixel value at each diagonal between the original pixels, either the four surrounding pixels are averaged, or two of the opposing new sub-pixel values are averaged, each averaging and storing operation typically requiring at least two transfers of data between the cache 160 and the register banks 130. Therefore, to zoom a IK x IK image into a 2K x 2K image typically requires over five million transfers between the registers 130 and the cache 160.

It is an object of this invention to reduce the number of data transfer operations between a load/store unit and a register bank in a processing system. It is a further object of this invention to reduce the computations required for routine operations in a processing system.

These objects, and others, are achieved by a load/store unit in a processing device that includes one or more computation elements. These computation elements are configured to pre-process, or post-process, data as it is transferred between memory and the processing registers. Preferably, the computation elements are located between a cache memory and the register banks. The load/store unit is controlled by an expanded set of instructions that include combined transfer and process instructions. The instruction may include, for example, an instruction that loads an average of two data elements to a register. The invention is explained in further detail, and by way of example, with reference to the accompanying drawings wherein:

FlG. 1 illustrates an example architecture of a conventional processing system. FIG. 2 illustrates an example architecture of a processing system in accordance with this invention. FIG. 3 illustrates an example block diagram of a computation execution unit in accordance with this invention.

FIG. 4 illustrates an example flow diagram of a combined transfer and processing instruction for the filtering of video data.

Throughout the drawings, the same reference numeral refers to the same element, or an element that performs substantially the same function. The drawings are included for illustrative purposes and are not intended to limit the scope of the invention.

FIG. 2 illustrates an example architecture of a processing system in accordance with this invention. The structure and function of this processing system is similar to that of the system of FIG. 1 , and only the differences will be detailed hereinafter. In accordance with this invention, the load/store unit 240 is configured to include a computation execution unit (CEU) 250, typically in the form of a filter or other function that provides an output that is based on multiple data values from a cache 260. By providing a single output from multiple data values, the number of transfers of data values between the cache 260 and the registers 130 is reduced. In a simple example, the CEU 250 may include an adder that is configured to add two adjacent data values from the cache 260, shift the result by one bit position, and provide the result as the value to be transferred to a register 130, thereby loading the register 130 with an average of the two data values from the cache 260. In this example, the instruction fetch unit 210 would be configured to communicate a "load_average" command to the load/store unit 240, specifying the memory address of the first data value, and identifying the register 130 for receiving the result. Upon receipt of this command, the instruction fetch unit 210 accesses the cache 260 to obtain the two adjacent data values, enables the CEU 250 to perform the example addition-shift operation, and transfers the result to the identified register 130. Other functions that the CEU 250 may perform to reduce the number of data transfers to the register 130 include, for example, a "load minimum" or "load_maximum" command, wherein the CEU 250 returns the minimum value, or maximum value, of a set of data values in the cache. Hereinafter, the term "filter" is used in the general sense to include any operation that produces a resultant value based on the processing of multiple data values.

As in known in the art, a conventional cache memory is structured as a plurality of blocks. When a block of data is identified as containing a particular element, based on the memory address of the element, all of the elements of the block are accessible, and a subset of bits of the memory address is used to identify the particular element within the block. By locating the CEU 250 at the cache 260, the aforementioned two adjacent data values are likely to be contained in the same block, and thus a single cache access will most often be sufficient to obtain these adjacent data values for use by the CEU 250.

In a preferred embodiment of this invention, the CEU 250 is preferably interfaced to the cache 260 at a block-level, so that multiple data values from a common block at the cache 260 can be provided simultaneously to the CEU 250.

Substantial register-transfer delay savings can be achieved using the principles of this invention by the use of combined load-computation instructions that are designed for a particular application. That is, if a particular operation that is commonly performed in an application is found to use a substantial number of register-cache-transfers to produce a result, the performance of that operation in the CEU 250 will generally provide substantial savings in register-transfer delays.

Similarly, by reducing the data that needs to be loaded into registers, the task of a compiler/scheduler is substantially reduced. In a conventional compiler/scheduler, when the amount of required registers to perform a task exceed the amount of available register resources, the compiler/scheduler must use a FIFO or other form of memory to temporarily store and recall intermediate values, which can substantially degrade the processing systems performance. Of particular note, the processing of video data includes interpolation and down- sampling operations that use a substantial number of register-cache-transfers to provide "fractional" sub-pixel values. Most video codecs rely heavily on the computation of sub- pixel elements that are not actually present in the image, but are derived from their spatially surrounding pixel elements, typically by using a weighted average function. That is, for example, a fractional sub-pixel that is located a quarter of the horizontal distance between coordinates (7,5) and (8,5) is referred to a sub-pixel at (7.25,5), and its value is a weighted average of the values of the pixels at (7,5) and (8,5) (and, optionally, other surrounding pixel values). In like manner, a sub-pixel at (7.25,5.50) is between the four coordinates (7,5), (7,6), (8,5), (8,6), and its value is a weighted average of the values at these four coordinates (and, optionally, other surrounding coordinates).

Two-point interpolation is commonly used to determine the aforementioned weighted averages, wherein: value(x+a, y) = value(x, y)*(l -a) + value(x+l, y)*(a), where 0<a<l. (1) value(x, y+b) = value(x, y)*(l -b) + value(x, y+l)*(b), where 0<b<l . (2) value(x+a, y+b) = value(x+a, y)*(l-b) + value(x+a, y+l)*(b), or, (3a) value(x+a, y+b) = value(x, y+b)*(l -a) + value(x+l , y+b)*(a). (3b)

The value terms on the right of equations (3a) and (3b) are obtained from equations (1) and (2), respectively. Both equations provide the same result, and equation 3a will be used hereinafter.

Because integer arithmetic is substantially faster than floating-point arithmetic in conventional systems, the above fractional terms are scaled up to integers, and the final values are rounded and scaled down. Equation (1) can thus be rewritten as follows: value (x+a, y) = (value(x,y)*(2^N-A) + value (x+1 , y)*(A) + 2^N"')/2^N, (4) where A is an integer representing a*2^N, and 2^N is the scale factor. The division by 2^N is effected by a shift of the resultant to the right by N bits. The choice of N is dependent on the degree of resolution required of the sub-pixels. That is, for example, if N is four, the fractional sub-pixels correspond to a sixteenth of a pixel size; if N is one, the fractional sub- pixels correspond to half a pixel size, and equation (4) corresponds to the simple addition- shift example CEU 250 discussed above, with rounding ((...+ 2^N"')/2^N). Equations (2), (3a), and (3b) can be similarly rewritten in integer form: value(x, y+b) = (value(x, y)*(2^N-B) + value(x, y+l)*(B) + 2^N"')/2^N. (5) value(x+a, y+b) = (value(x+a, y)*(2^N-B) + value(x+a, y+1 )*(B) + 2^N-')/2^N (6) In a preferred embodiment of this invention for a video processor, the CEU 250 includes one or more multipliers, adders, and shifters to implement the above equations (4- 6).

FIG. 3 illustrates an example embodiment of a CEU that provides an interpolation function. The example embodiment illustrates the use of four interpolation blocks 310 that each contain two multipliers 312, an adder 314, and a rounder/shifter 316; however, one of ordinary skill in the art will recognize that one interpolation block 310 could be used in a time-multiplex fashion to effect the four interpolations illustrated.

Four data values v(x,y) 301, v(x+l , y) 302, v(x, y+1) 303, and v(x+l , y+1) 304, which, in an example video application, would represent four adjacent pixel values, are input to the CEU of FIG. 3 from the cache 260 of FlG. 2. The above discussed interpolation factors, A 306, K-A 307, B 308, and K-B 309, are also provided to the CEU, where K=2^N, and N is the number of bits used for these factors. In a preferred embodiment, these interpolation factors are pre-loaded into registers within the CEU, before each image- interpolation process, for repeated use as each of the pixels of the image are processed.

Each multiplier 312 multiplies an interpolation factor 306-309 by a data value 301 - 304, or, in the case of the lower interpolation block 310, by an intermediate value 321 , 323, and the products are added together by the adder 314. Each rounder 316 is provided the values of K and K/2 (not illustrated), and provides a rounded and scaled result by adding K/2 to the sum from the adder 314, and dividing by K (shifting by N). The rounder 316 is illustrated for ease of understanding; one of ordinary skill in the art will recognize that the functions of the rounder will typically be included in the adder 314.

In this example embodiment, one or more of the interpolation results 321 , 322, 323, 324 are provided to the register bank 130 by the load/store unit 240 of FIG. 2, depending upon the defined load-computation instruction, and the width of the data path to the register bank 130.

One of ordinary skill in the art will recognize that the example CEU of FIG. 3 is but one example of a variety of functions that can be performed during a load process to reduce the number of data transfers between a register bank and a load/store unit in a processing system. Typically, the function will be defined in terms of a set of instructions, rather than by a schematic diagram as illustrated in FIG. 3.

The following set of instructions defines another form of an interpolation function that reduces the number of data transfers to a register bank. Ioad_frac4 regl reg2 -> reg3 datal = mem(regl) data2 = mem(regl+l) data3 = mem(regl+2) data4 = mem(regl+3) data5 = mem(regl+4)

A=reg2[3:0]

KmA=IO-A tempi = (datal *KmA + data2*A + 8)»4 temp2 = (data2*KmA + data3*A + 8)»4 temp3 = (data3*KmA + data4*A + 8)»4 temp4 = (data4*KmA + data5*A + 8)»4 word[31 :24] = tempi word[23:16] = temp2 word[15:8] = temp3 word[7:0] = temp4 reg3=word end

In the above example, the instruction "Ioad_frac4" loads register 3 (reg3) with a set of four single-interpolation results, corresponding to four applications of equation (4) above. In this example, the four interpolations are interpolations between each pair of five sequential data points, such as would be used to determine fractional pixel values along a line of five pixel values. Register 1 (regl) contains the address of the first sequential data point, and register 2 (reg2) contains the integer interpolation factor (A). In this example, the interpolation factor comprises four bits, so that a fractional resolution of 1/16 of a pixel width can be obtained. The "KmA" term above corresponds to "2^N-A" term in equation (4), "8" corresponds to the "2^N/2" term, and "»4", which is read as "right shift by four bits", corresponds to the "/2^N" term.

Tempi in the above example represents the execution of equation (4) to determine the interpolation value between the first and second data elements; temp2 represents the interpolation value between the second and third elements; and so on.

In this example, the width of the registers is thirty-two bits, and the data values are eight-bit values, and thus the interpolated values tempi -temp4 are each also eight-bit values. To further minimize register transfers, the interpolated values are packed into a 32- bit word, and this word is transferred to register 3 (reg3).

The above example instruction is particularly well suited for creating additional pixel values between existing pixel values, commonly termed "upsampling". A similar interpolation command could be provided to "downsample" data, wherein a single pixel value is determined for each of four pairs of pixels (i.e. four pixel values determined from eight pixel values), as might be used when "zooming-out" on an image.

Note that in these examples, the same interpolation unit 310 of FlG. 3 could be used to provide each of the interpolation results. Thus, the amount of circuitry required to provide a variety of load-computation instructions need not be excessive.

FlG. 4 illustrates an example flow diagram of another function that can be provided in a CEU of this invention to minimize register transfers. This example represents the determination of a filtered data value based on surrounding data values. As is known in the art, polynomial-based filtering can provide a more accurate or realistic determination of a pixel value than the simple two-point or four-point interpolations discussed above. For example, a fractional pixel value for a point may be determined based on a weighted average of the values of the four pixels before and the four pixels after the point, rather than merely the one pixel before and one pixel after, as presented above.

At 410, one or more data blocks corresponding to data in the vicinity of the point are loaded into a cache. At 420, the particular data that is to be used for determining the value at the point is located. At 430, the filter coefficients are obtained, typically from one or more predefined set of coefficients corresponding to a given filter parameter. As in the example interpolator of FIG. 3, a set of filter coefficients may be loaded at the beginning of the processing of an image, and the same coefficients used for each subsequent filter operation as the pixels of the image are processed. These coefficients may be loaded via a separate load command to the CEU, or via a reference to a register that is included in the instruction that initiates each load and filter operation. The coefficients may be communicated to the CEU via the registers, or the CEU may be configured with a ROM that contains a plurality of coefficient sets, and an index to this ROM can be communicated to the CEU. These and other techniques for configuring the CEU with parameters required for a given function will be evident to one of ordinary skill in the art in view of this disclosure.

Assuming two-dimensional filtering, at 440 the filter coefficients for a first dimension are applied, and at 450, the filter coefficients for the other dimension are applied. The one or more results from the filter operation are then loaded to the register or registers identified in the load-filter instruction.

The foregoing merely illustrates the principles of the invention. It will thus be appreciated that those skilled in the art will be able to devise various arrangements which, although not explicitly described or shown herein, embody the principles of the invention and are thus within its spirit and scope. For example, although the invention is presented in the context of loading processed data into registers 130 for use by a processing unit 120, one of ordinary skill in the art will recognize that an instruction could be defined wherein the "destination" register identifies a memory location for storing the result. In the execution of such an instruction, the processed data can be transferred directly to the cache 260, at the location corresponding to the contents of the destination register, and need not be transferred to the register bank 130. In like manner, although the invention is presented in the context of a preprocessing of data from the cache to the registers, one of ordinary skill in the art will recognize that the principles of this invention can also be applied to post process data from the registers to the cache. For example, an instruction can be provided that instructs the load/store unit 240 to store the average of the contents of two registers into a given memory location. These and other system configuration and optimization features will be evident to one of ordinary skill in the art in view of this disclosure, and are included within the scope of the following claims. In interpreting these claims, it should be understood that: a) the word "comprising" does not exclude the presence of other elements or acts than those listed in a given claim; b) the word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements; c) any reference signs in the claims do not limit their scope; d) several "means" may be represented by the same item or hardware or software implemented structure or function; e) each of the disclosed elements may be comprised of hardware portions (e.g., including discrete and integrated electronic circuitry), software portions (e.g., computer programming), and any combination thereof; f) hardware portions may be comprised of one or both of analog and digital portions; g) any of the disclosed devices or portions thereof may be combined together or separated into further portions unless specifically stated otherwise; h) no specific sequence of acts is intended to be required unless specifically indicated; and i) the term "plurality" of an element means at least two of the claimed element, and does not imply any particular range of the number of elements.

Claims

CLAIMS What is claimed is:

1. A processing system comprising: one or more registers (130), a processing unit (120) that is configured to access the one or more registers (130), a cache memory (260), a computation unit (250) that is configured to process data from the cache memory (260) to produce an intermediate result (321-324), and a load unit (240) that is configured to transfer the intermediate result (321 -324) to the one or more registers (130).

2. The processing system of claim 1 , wherein the intermediate result (321-324) is based on at least two data elements from the cache memory (260).

3. The processing system of claim 2, wherein the intermediate result (321-324) corresponds to an interpolation of the at least two data elements.

4. The processing system of claim 2, wherein the intermediate result (321-324) corresponds to a filtering of the at least two data elements.

5. The processing system of claim 1, wherein the intermediate result (321-324) corresponds to a plurality of values determined from a plurality of sequential data elements in the cache memory (260).

6. The processing system of claim 1 , wherein the computation unit (250) includes one or more interpolation units (310).

7. The processing system of claim 1, wherein the computation unit (250) includes one or more filter units (310, 440, 450).

8. The processing system of claim 7, wherein the one or more filter units include at least one polynomial filter unit (440, 450).

9. The processing system of claim 1, further including an instruction unit (210) that is configured to decode program instructions and to provide machine instructions to the processing unit (120) and the load unit (240), and wherein the load unit (240) is configured to control the computation unit (250) to provide the intermediate result (321 -324) for transfer to the one or more registers (130) for use by the processing unit (120).

10. The processing system of claim 1 , wherein the data corresponds to an image, and the cache memory (260) is configured to store a plurality of multi-dimensional tiles of the image to provide spatial locality.

1 1. A method of loading data from an external memory (190) to a register (130) of a processing system, comprising: receiving a command that references two or more data elements, retrieving (410) the two or more data elements from the external memory (190), processing (440, 450) the two or more data elements to produce a result, and communicating the result to the register (130).

12. The method of claim 1 1, wherein retrieving the two or more data elements from the external memory (190) includes loading (410) one or more blocks of data elements that include the one or more data elements from the external memory (190) to a cache memory (260), and retrieving (420) the two or more data elements from the cache memory (260).

13. The method of claim 12, wherein processing the two or more data elements includes interpolating the at least two data elements.

14. The method of claim 12, wherein processing the two or more data elements includes filtering (440, 450) the at least two data elements.

15. The method of claim 12, wherein the filtering (440, 450) includes polynomial filtering.

16. The method of claim 12, wherein the result corresponds to a plurality of values determined from a plurality of sequential data elements in the cache memory (260).

17. The method of claim 12, wherein communicating the result includes packing the plurality of values into the result.

18. The method of claim 12, wherein the data corresponds to an image, and loading the one or more blocks of data elements to the cache memory (260) includes loading one or more multi-dimensional tiles of the image to provide spatial locality.

19. A load unit for a processing system, comprising a cache memory (260), a computation unit (250), operationally coupled to the cache memory (260), and a transfer unit (240), operably coupled to the computation unit (250) and the cache memory (260), that is configured to: receive a load-compute instruction from an instruction unit (210) of the processing system, determine from the load-compute instruction a plurality of locations corresponding to a plurality of data elements in the cache memory (260), communicate the plurality of locations to the cache memory (260), determine from the load-compute instruction a function to be performed on the plurality of data elements, and communicate an identification of the function to the computation unit (250), wherein, the cache memory (260) is configured to provide the plurality of data elements to the computation unit (250) based on the plurality of locations, the computation unit (250) is configured to apply the function to the plurality of data elements from the cache memory (260) to provide a result, and the transfer unit (240) is further configured to transfer the result to a register (130) of the processing system.

20. The load unit of claim 19, wherein the function includes at least one of: a polynomial filter function, an inteφolation function, an average function, a minimum function, and a maximum function.