GB2393279A - Manipulating data in a plurality of processing elements - Google Patents
Manipulating data in a plurality of processing elements Download PDFInfo
- Publication number
- GB2393279A GB2393279A GB0309198A GB0309198A GB2393279A GB 2393279 A GB2393279 A GB 2393279A GB 0309198 A GB0309198 A GB 0309198A GB 0309198 A GB0309198 A GB 0309198A GB 2393279 A GB2393279 A GB 2393279A
- Authority
- GB
- United Kingdom
- Prior art keywords
- count
- data
- processing elements
- processing element
- shifting
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000012545 processing Methods 0.000 title claims abstract description 82
- 238000000034 method Methods 0.000 claims abstract description 47
- 239000011159 matrix material Substances 0.000 claims description 18
- 230000004044 response Effects 0.000 claims description 6
- 238000012986 modification Methods 0.000 claims description 2
- 230000004048 modification Effects 0.000 claims description 2
- 230000001419 dependent effect Effects 0.000 claims 2
- 230000008571 general function Effects 0.000 abstract description 3
- 230000008569 process Effects 0.000 abstract description 3
- 238000012544 monitoring process Methods 0.000 abstract description 2
- 230000015654 memory Effects 0.000 description 34
- 238000010586 diagram Methods 0.000 description 8
- 230000006870 function Effects 0.000 description 5
- 230000001360 synchronised effect Effects 0.000 description 5
- 230000008901 benefit Effects 0.000 description 4
- 238000004891 communication Methods 0.000 description 3
- 238000007667 floating Methods 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 238000003491 array Methods 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/76—Architectures of general purpose stored program computers
- G06F15/80—Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
- G06F15/8007—Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors single instruction multiple data [SIMD] multiprocessors
- G06F15/8023—Two dimensional arrays, e.g. mesh, torus
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/02—Comparing digital values
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/76—Architectures of general purpose stored program computers
- G06F15/78—Architectures of general purpose stored program computers comprising a single central processing unit
- G06F15/7807—System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
- G06F15/7821—Tightly coupled to memory, e.g. computational memory, smart memory, processor in memory
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/76—Architectures of general purpose stored program computers
- G06F15/80—Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/02—Comparing digital values
- G06F7/026—Magnitude comparison, i.e. determining the relative order of operands based on their numerical value, e.g. window comparator
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Computer Hardware Design (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Pure & Applied Mathematics (AREA)
- Mathematical Optimization (AREA)
- Mathematical Analysis (AREA)
- Computational Mathematics (AREA)
- Microelectronics & Electronic Packaging (AREA)
- Multi Processors (AREA)
- Advance Control (AREA)
- Image Processing (AREA)
Abstract
The present invention is capable of placing or loading input data into a 2D or 3D array of processing elements interconnected in a variety of ways, and moving the data around by using a combination of shifts, e.g. north, south, east, west, which can be combined in any desired manner. The exact type and combination of shifts depends upon the particular data manipulation desired. As the shifting proceeds, each processing element is presented with a plurality of different array values. Each processing element can conditionally load any of the values it sees into the output result. The timing of the loading is achieved by monitoring a local counter. In a preferred embodiment, when the value in the local counter is non-positive, the current array value is selected as the final output for the output result. In general, each local counter is initialized to a different positive value and, at certain points in the shifting process, the counter is decremented. The initial value of the counter depends upon its location, and is given by the general function f(x_Index, y_Index, z_Index), with the exact form of f( ) depending upon the particular data manipulation desired.
Description
( METHOD FOR MANIPULATING DATA
IN A GROUP OF PROCESSING ELEMENTS
CROSS-REFERENCE TO RELATED APPLICATIONS
10001] The present application is related to the following applications: Method for Manipulating Data in a Group of Processing Elements to Perform a Reflection of the Data (docket no. DB001071-000); Method for Manipulating Data in a Group of Processing Elements to Transpose the Data (docket no. DB001070-000); Method for Manipulating the Data in a Group of Processing Elements to Transpose the Data Using a Memory Stack (docket no. DB001069-000); and Method of Rotating Data in a Plurality of Processing Elements (docket no. DB001063-000), all filed concurrently herewith.
BACKGROUND OF INVENTION
2] The present invention relates generally to parallel processing and, more specifically, to parallel processing in an active memory device or single instruction, multiple data (SIMD) computer. [00031 A single, synchronous dynamic random access memory (SDRAM) chip has an internal data bandwidth of greater than 200 G bits/e and a very wide data bus (thousands of bits). That vast data bandwidth provides an opportunity for high performance. Active memories represent one effort to use that vast data bandwidth to improve performance.
100041 An active memory is a memory device which has a built in processing resource. One of the principal advantages of active memory is that data is processed close to where it is stored.
Usually the processing resource is a highly parallel computer system which has processing power to exploit the very high data bandwidths available inside a memory system. An example of an active memory system is illustrated in Figure 1.
1 51 In FIG. 1, a main memory 10 appears as a traditional memory to a CPU 12 except that the main memory 10, by virtue of memory processors 14, can be instructed to perform tasks on its data without the data being transferred to the CPU 12 or to any other part of the system over a system bus 16. The memory processors 14 are a processing resource distributed throughout the main memory 10. The processing resource is most often partitioned into many similar
processing elements (PEs). The PEs are usually simple and operate in parallel. In such a system, the work ofthe CPU 12 is reduced to various operating system tasks such as scheduling. A substantial portion of the data processing is performed within the main memory 10 by virtue of the memory processors 14.
[00061 Active memory systems have a long history. The earliest systems were built in the 1 960's. However, until the advent of integrated logic and current DRAM technologies, active memory computers were always expensive, special machines, excluded from mass market applications. For active memory to be effective, the organization of data in the PE array is an important consideration. Hence, the provision of an efficient mechanism for moving data from one PE to another is an important consideration in the design of the PE array.
[00071 In the past, several different methods of connecting PEs have been used in a variety of geometric arrangements including hypercubes, butterfly networks, one-dimensional strings/rings and two-dimensional meshes. In a two-dimensional mesh or arrays, the PEs are arranged in rows and columns, with each PE being connected to its four neighboring PEs in the rows above and below and columns to either side which are sometimes referred to as north, south, east and west connections. 100081 Disclosed in G.B. Patent Application Serial No. GB02215 630, entitled Control of Processing Elements in Parallel Processors, filed September 17, 2002 is an arrangement in which a column select line and a row select line can be used to identify processing elements which are active, e.g. capable of transmitting or receiving data. The ability to use a row select signal and a column select signal to identify active PEs provides a substantial advantage over the art in that it enables data to be moved through the array of PEs in a nonuniform manor. However, the need still exists for enabling PEs within the array to work independently of its neighboring PEs even though each PE within the array has received the same instruction.
SUMMARY OF THE INVENTION
9] The present invention is directed to a method of controlling a plurality of processing elements. The method is comprised of maintaining a count in at least certain of the processing elements. Each count maintained within a processing clement is responsive to that processing element's location. For each processing element which is maintaining account, that processing element stores data in response to its count.
-2
[00101 According to another aspect of the present invention, a method of controlling the data selected as output data by a plurality of processing elements is comprised of issuing an instruction set to the plurality of processing elements. The instruction set is performed through a series of data shifts. A count of the data shifts is maintained within at least certain of the processing elements with data being selected based on each processing elements' count.
1] According to another aspect of the present invention, a method of controlling the position of data in a matrix or array of processing elements is comprised of shifting data within the matrix of processing elements along one of a row, column, diagonal or combination thereof. For each active processing element, data is selected as a final output in response to that processing element's location within the matrix.
10012] According to another aspect of the present invention, a method for controlling the position of data in a matrix of processing elements is comprised of shifting data within the matrix of processing elements. A current count is maintained in each active processing element with the current count being responsive to the number of data shifts which have been performed. Output data is selected as a function of the current count.
3] The present invention contemplates hardware, e.g. memory containing an ordered set of instructions, for carrying out the disclosed methods. The present invention is capable of placing or loading input matrix data into a two dimensional mesh or array of processing elements, and moving the data around the array by using a combination of north, south, east and west shifts, which can be combined to provided northeast to southwest, southwest to northeast, northwest to southeast and southeast to northwest shifts. The exact type and combination of shifts depends upon the particular data manipulation desired. As the shifting proceeds, each processing element is presented with a plurality of different array values. Each processing element can conditionally load any of the values it sees into the output result matrix. However, only one value (the desired result) is loaded into the output matrix.
l0014l The timing of the above loading is achieved by monitoring a local counter. In a preferred embodiment, when the value in the local counter is non-positive, the current array value is selected as the final output for the output result matrix. In general, each local counter is initialized to a different positive value and, at certain points in the shifting process, the counter is decremented. The initial value of the counter depends upon its location in the array, and is given by the general function f(Row_lndex, Col_Index), with the exact form of f() depending upon -3
the particular array manipulation desired. The present invention enables each processing element within the array to operate independently of all other processing elements even though each of the processing elements is responding to the same command, e.g. an edge shift, planer shift, wrap shift, vector shift or some combination thereof. Other advantages and benefits will become apparent from the description of the invention appearing below.
BRIEF DESCRIPTION OF THE DRAWINGS
100151 For the present invention to be easily understood and readily practiced, the present invention will be described in conjunction with an exemplary embodiment, for purposes of illustration and not limitation, in conjunction with the following figures wherein: 00161 FIG. 1 is a block diagram illustrating the concept of active memory; [00171 FIG. 2 is a high level block diagram of one example of an active memory on which the methods of the present invention may be practiced; 0018] FIG. 3 is a high level block diagram of one example of a PE; [0019] FIG. 4 is a diagram illustrating one type of logic circuit that may be used to interconnect the PE illustrated in FIG. 3 to other PEs; 10020] FIG. 5 illustrates one method of interconnecting PEs to form an array of PEs; 00211 FIGs. 6A and 6B illustrate one example of an edge shift; 00221 FIGs. 7A and 7B illustrate one example of a planar shift; 00231 FIGs. 8A and 8B illustrate one example of a wrap shift; 00241 FIGs. 9A and 9B illustrate one example of a vector shift; 00251 FIGs. I OA and I OB illustrate another example of a vector shift; 100261 FIGs. I 1A and I IB illustrate one example of a data broadcast from the edge registers in which a row and column select function enabled; 100271 FIGs. I 2A and 12B illustrate one example of a broadcatch in which only one column is selected; 100281 FIGs. I 3A and 1 3B illustrate one example of selected edge registers being loaded with the AND of selected columns; and 100291 FIGs. 14A and 14B illustrate another example of a data broadcast.
DESCRIPTION OF THE INVENTION
[00301 Illustrated in FIG. 2 is a high level block diagram of one example of an active memory device 18 on which the methods of the present invention may be practiced. The reader should understand that the methods of the present invention are generally applicable to any group of processing elements having the necessary physical connections between PEs to enable the manipulation of data as required by the methods. The hardware illustrated in FIG. 2 is disclosed for purposes of illustration and not limitation. Furthermore, those of ordinary skill in the art will recognize that the block diagram of FIG. 2 is an overview of an active memory device with a number of components known in the art being omitted for purposes of clarity.
100311 The active memory device 18 of FIG. 2 is intended to be deployed in a computer system as a slave device, where a host processor (e.g. CPU 12 in FIG. 1) sends commands to the active memory device 18 to initiate processing within the active memory device 18. A complete processing operation, i.e., data movement and processing, in the active memory device 18 will usually consist of a sequence of many commands from the host to the active memory device 18.
100321 The active memory device 18 may have two interfaces, a bus interface 20 and a host memory interface 22, for interfacing with the host or other external logic for the purposes of data input, data output and for control. The host memory interface 22 (data input/output ports) of the active memory device 18 is similar in its operation to the interface of a synchronous DRAM. To access data within a DRAM array 24, the host must first activate a page of data. Each page may contain 1024 bytes of data and there may be 16384 pages in all. Once a page has been activated, it can be written and read through the, for example, 32 bit data input/output ports. The data in the DRAM array 24 is updated when the page is deactivated.
[00331 In the active memory device 18 the input and output ports are separate, or they may be combined into a single bi-directional input/output port. A control output may be provided to control a bidirectional buffer servicing the single bi-directional input/output port.
10034] The host memory interface 22 may operate at twice the frequency of the master input clock. A copy of the 2x clock may be driven off-chip as a timing reference. Unlike a traditional DRAM, the access time for the host memory interface 22 port takes a variable number of cycles to complete an internal operation, such as an activate or deactivate. A ready signal (ray) is provided to allow the host to detect when the command has been completed.
_5
100351 The control or command port (cmd) may be a straightforward 32 bit synchronous write/read interface. Writes place both data and the corresponding address into a FIFO 26 of a task dispatch unit 28, which holds the commands until they are executed in the order they were issued. This arrangement allows a burst of commands to be written to the active memory device 18 suiting the burst operation of many peripheral buses. Reads may operate directly.
10036] The command port is also synchronous, running at the same frequency as the master input clock. Similarly to the host memory interface 22 port, the clock may be driven out as a timing reference.
10037] In addition to the two address-mapped ports, the active memory device 18 has an interrupt output (intr) which is used to alert its host to various different conditions.
100381 Internal control in the active memory device 18 is handled by three processors. The task dispatch unit 28 (mentioned above) receives commands from the command port, interprets them and passes them on to the other two processors, a DRAM control unit 30 and an array sequence controller 32. The task dispatch unit 28 also maintains the addresses for operand placement in an array processor register file (RF) and enables access to on-chip resources such as a program memory 34.
[00391 The DRAM control unit 30 controls the DRAM array 24. The DRAM control unit 30 arbitrates between requests for DRAM array 24 access from the host through host memory access registers (H) and through the host memory interface 22. The DRAM control unit 30 also schedules DRAM array 24 refreshes.
100401 The array sequence controller 32 controls an array or two dimensional mesh of PEs 36.
The sequence controller 32 also executes a program from the program memory 34 and broadcasts control signals into the array of PEs 36. The DRAM control unit 30 and array sequence controller 32 may have a synchronization mechanism, whereby they can link the execution of tasks in either processor., 10041] The active memory device 18 may contain, according to one embodiment, sixteen 64k x128 eDRAM cores. Each eDRAM core is closely connected to an array of sixteen PEs, making 256 (16 x 16) PEs in all.
100421 FIG. 3 is a high level block diagram of one example of a PE 37. The PE 37 is comprised of a set of Q registers and a shift network 38 which interact with a set of M registers and another shift register 40. One of the sets of registers and shift network 38, 40 receives inputs from -6
various registers, such as register RO, R1, R2 and 0. The output of the registers and shift networks 38, 40 is input to an arithmetic logic unit (ALU) 42. The ALU 42 is capable of perfomming various arithmetic functions on its input such as addition, subtraction, etc. as is known. The ALU 42 is in communication with condition logic 44 and a result pipe 46.
10043] The result pipe 46 is a series of interconnected registers RO, R1, R2 and a neighborhood connection register X, which may be used to output a final value. The result pipe 46 also receives through a multiplexer 47 data in the fomm of an output signal X from its four neighbors, one to the north (XN), one to the east (XE), one to the south (XS) and one to the west (XW). If the PE 37 happens to be located on an edge of an array, then it may be receiving data from an edge register or a PE in the same row or column, but on an opposite edge, as will be described in greater detail below.
100441 The result pipe 46 is in communication with a register file (RF) 48 which in turn is in communication with an interface 50. The interface 50 may include a DRAM interface 52 as well as access to the host memory access registers (H).
5] The reader should recognize that the PE 37 illustrated in FIG. 3 is exemplary only and is not intended to limit the present invention. For example, the number and location of registers and shift networks may vary, the complexity of the ALU 42 and condition logic 44 may vary, the number of registers and interconnection of registers in the result pipe 46, the size and number of register files, and connection to neighboring PEs as well as other logic may be varied while remaining within the scope of the present invention. The particular architecture illustrated in FIG. 3 was selected to provide a rich register set to enable fairly complex multibyte operations to be kept within the PE as much as possible.
100461 For example, the Q registers and shift network 38 allow for data shifting within the 32 bits of the Q register to the left (most significant direction) one, two, four or eight places and eight places to the right as well as for merging data back into a floating point format. The M registers and shift network 40 allow for data shifting within the 32 bits of the M register to the right (least significant direction) one, two, four or eight places and for demerging data from floating point into a signed magnitude plus exponent format. The result from the ALU 42 can be loaded into register RO while the input from the register file 48 can he loaded into either register R1 or register R2. The neighborhood connection register X can be used as a flexible member of the result pipeline allowing a pipeline length of up to four to be programmed within the PE 37.
The X register can be loaded from the R0,RI or R2 registers, or from the neighborhood interconnection input (the X register of a neighboring PE). The output of the X register can be fed back into the result pipeline at R1 or R2. The register file 48 may be implemented as a 128 entry by 8 bit register file implemented as a synchronous static RA M. [00471 The DRAM interface 52 may contain two registers, a RA M IN register and a RA M OUT register. Input from the DRA M 24 of FIG.2 may be held in the RA M IN register while output to the DRAM 24 is held in the RA M OUT register. The RA M IN and RA M OUT registers may reside in the clock domain of the DRA M 24 which typically uses a slower or divided clock derived from the same source as the clock used for the PE array 36. The RAM IN and RA M OUT registers may be controlled directly from the DRAM control unit 30 and are not visible to the programmer. Data can be transferred into and out of the register file 48 using stolen cycles.
Data can also be transferred to/from the host memory access registers (H) without stealing cycles from processing in the PE 37.
8] Eight host memory access registers (H) may be provided which allows for a short burst of four or eight bytes to be transferred into or out of the DRAM 24 for host access. Those registers may be multiplexed and be visible from the host memory interface 22 (see FIG. 1) as a page of data. More details about the PEs may be found in G.B. Patent Application No. O 1 5 6 -ill entitled Host Memory Interface for a Parallel Processor and filed September 17, 2002, which is hereby incorporated by reference.
[00491 FIG.4is a diagram illustrating one type of logic circuit that may be used to interconnect PEs of the type illustrated in FIG. 3. The reader will understand that many types of logic circuits may be used to interconnect PEs depending upon the functions to be performed. Using the logic circuit of FIG.4 to interconnect PEs may result in an array of PEs 36 of the type illustrated in FIG. 5.
10050] Turning now to FIG.S, the X register within the result pipe 46 of each PE is driven out as, for example, an eight bit wide X output. Eight bits has been chosen in connection with this architecture as the data width for the PE-PE interconnect to keep a balance between the data movement performance of the array and the improved computational performance. Other sizes of interconnects may be used. The X output is connected to the neighboring inputs of each PE's closest neighbors in the north and west directions. To the south and east, the X output is combined with the input from the opposite direction and driven out to the neighboring PK.
-8
[00511 At the edges of the array 36, the out-of-array connection is selected though a multiplexer to be either the output from the opposite side of the array or an edge/row register 54 or an edge/col. register 56. The edge registers 54, 56 can be loaded from the array output or from the controller data bus. A data shift in the array can be perfommed by loading the X register from one of the four neighboring directions. The contents of the X register can be conditionally loaded on the AND gate of the row select and column select signals which intersect at each PK. When the contents of the X register is conditionally loaded, the edge registers 54, 56 are also loaded conditionally depending on the value of the select line which runs in the same direction. Hence, an edge/row register 54 is loaded if the column select for that column is set to 1 and an edge/co! register 56 is set if the row select is set to 1. The reader desiring more information about the hardware configuration illustrated in FIG. 5 is directed to G.B. Patent Application GB02215 61,0, entitled Control of Processing Elements in Parallel Processors filed September 17, 2002, which is hereby incorporated by reference.
2] With the hardware previously described, a number of shifting operations may be performed as illustrated in FIGs. 6A, 6B through IDA, JOB. In FIG. 6A and 6B, an edge shift is illustrated. In the edge shift, the edge/co! registers 56 are active as the data is shifted left to right (west to east) as shown in FIGs. 6A, 6B. The reader will recognize that an edge shift may be performed in the other direction, right to left (east to west). Altematively, edge shifts may be performed by using the edge/row registers 54 in a north to south or south to north direction.
[00531 Illustrated in FIGs. 7A, 7B is a planer shift. In the planer shift there is no wrap around from the edge of the array. The reader will recognize that in addition to the planer shift illustrated in FIGs. 7A, 7B, planer shifts from east to west, north to south, and south to north may also be performed.
[00541 Illustrated in FIGs. 8A, 8B is a wrap shift. In the wrap shift, the edge/co! registers 56 do not participate. Additionally, wrap shifts from east to west, north to south and south to north may be performed.
[00551 Illustrated in FIGs. 9A, 9B is a vector shift. Again, the edge/co! registers 56 do not participate. Furthermore, the output of the PE in the bottom right corner of the array wraps to the input of the PE in the upper left corner of the array. In FIGs. I OA and I OB, a vector shift in the direction opposite to the direction of FIGs. 9A, 9B is illustrated. The reader will recognize that vector shifts from north to south and south to north may also be performed.
100561 Returning to FIG. 5, the PE-PE interconnect may also provide a broadcast and broadcatch network. Connections or buses 58 extend north to south from a column select register S9 and connections or buses 60 extend west to east from a row select register 61. Also provided is row broadcast/broadcatch AND chain 62 and a column broadcast/broadcatch AND chain When used for data broadcast or broadcatch, these connections (column buses 58 and row buses 60) act as if driven by open drain drivers; the value on any bit is the wire-AND of all the drivers outputs.
Three control signals (broadcatch, broadcast and intercast) determine the direction of the buses as follows: If broadcatch is set to 1, any PE for which the corresponding bits of the row select register 61 and column select register S9 are both set will drive both the row buses 60 and the column buses 58. Note that if no PEs in a row or column drive the bus, the edge register at the end of that row or column will be loaded with O x FF. I If broadcast is set to I, the row bus 60 is driven from the row select register 61 and the column bus 58 is driven from the column select register S9 and any PE for which the corresponding bits of the row select register 61 and column select register 59 are both set will be loaded from one of the row or column inputs, according to which is selected.
If intercast is set to 1, any PE in which its A register is 1 will drive its output onto its row bus 60 and column bus 58 and any PE for which the corresponding bits of the row select register 61 and column select register 59 are both set will be loaded from one of the row buses 60 or column buses 58, according to which is selected.
7] With the aforementioned connections, a number of operations are possible, some of which are illustrated in FIGs. 11A, 11B through 14A, 14B.
100581 In FIGs. 11A, 11B, data is broadcast from cdgc/col registers 56 with the column select register 59 and row select register 61 set as illustrated in FIG. 11A. As a result, data is latched in the PEs as shown in FIG. 1 I B in which four PEs are active, and the remainder of the PEs in the array are inactive.
100591 In FIGs. 1 2A, I 2B, a broadcatch instruction is illustrated in which one column is selected by setting the value for that column's bus 58 to 1. In this broadcatch-column operation, only those edge /colt registers 56 for which the row select register 61 bits are set, will be loaded.
-10
Similarly, in a broadcatch-row operation (not shown), only those row/edge registers 54 for which the corresponding column select register 59 bits are set, will be loaded.
[00601 FIGs. 13A, 13B illustrate a broad catch instruction. In the illustrated example, the column select register 59 and row select register 61 are used to select the PEs whose values will be AND' ed together and loaded in the corresponding edge/co! registers 56. In FIGs. 13A, 13B, the column edge registers 56 are loaded with the AND of selected columns, except where the row select is 0.
l0061l In FIGs. 14A, 14B, an example of an intercast operation is illustrated. In an intercast operation, PEs which drive onto the row buses 60 and column buses 58 are determined by each PE's A register value. The PEs which are loaded are determined by the row and column selects, just like for a broadcast. In FIG. 14A, data is broadcast from the X registers of those PEs where A equals 1 while in FIG. 14B, the column select register 59 and row select register 61 together activate those PEs into which data will be written.
10062] Using the aforementioned instructions or operations, a group of instructions may be combined into an instruction set for manipulating data within the array 36 of PEs. The instruction set may include a single instruction or operation or a combination of instructions.
Each individual instruction is carried out though a series of shins.
[00631 In operation, an input matrix of data is placed on the shin network, and moved around by using a combination of north, south, east and west shins. In addition, the column select register 59 and row select register 61 may be used to determine which of the PEs is active. The exact combination of active PEs, instructions, and direction in which the instruction (shift) is performed will depend upon the particular array manipulation required. As the instructions are executed and the shifting proceeds, each PE will be presented with different array values. For example, if a wrap shift is performed a number of times equal to the number of PEs in a row, each PE in the row will see every value held by all of the other PEs in the row.
[00641 A PE can conditionally select any of the values it sees as its final output value by conditionally loading that value, which is representative of an output result matrix. However, only one value, the desired result, is loaded.
[ados] All X values are passed through the PE; the required output value is conditionally loaded once it has arrived in the PK. The conditional loading can be done in various ways. e.g. by using any PE registers except X, R1, or R2. An example is shown below.
Cc1 cke PEC+O - PEC+ l _ PEC+2 1 PEC+3 T + O X ≤ ':<e0): me ':e 3 x x.
to ≤ Rl <cond>.7RO ≤ Rl. Ad> RO ≤ Rl <cond>.7RO ≤ Rl I À At time T+O:The X register reads data form the X register on the PE to the East. This shifts data to the left (or West).
À At time T+1: The Rl register unconditionally reads the data off the shift network (X register) À At time T+2: The RO register conditionally loads the data from R1. (i.e. if <cond≥1).
[00661 The timing of the loading is achieved by maintaining a current count in a local counter, which is typically implemented in software. In one embodiment, the local counter is set to an initial value. The local counter can be set in a variety of ways, including loading the counter with the initial value or calculating the initial value locally based on the processing element's location in the matrix (or array) and the function being performed on the data. Thereafter, at certain points in the shifting process, the counter is decremented. For example, the counter may be decrementcd once for each shift that occurs, or may be decremented once per n clock cycles where n clock cycles equals one shift. As stated, the initial value of the counter depends on its position in the matrix or array and is given by the general function f (Row_lndex, Col_lndex), where the exact form of f() will depend on the particular array manipulation required. When the counter reaches a non-positive value (i.e., zero or negative) the PE selects the data to be loaded into the output matrix.
-12
100671 Other ways of achieving the same result include resetting the counter to zero and loading each PE with a target value. Thereafter, the counter is incremented producing a current count.
When the current count equals the target value, the data value is selected as the final output value to be loaded into the output matrix. Generally, a counter is set to a first known value. Then, at certain programmable points in the algorithm, the value of the counter may be altered, up or down, by a programmable amount. Storing, occurs when a current count in the counter hits a pre-defined target value.
8] By using the method of the present invention, PEs within a group of PEs can be individually controlled as to the output value which the PE selects for output into the final matrix. Thus, although all of the PEs are responding to the same command, e.g., an east to west wrap shift, each of the PEs is capable of selecting different data at different points during the execution of the instruction thereby enabling various types of data manipulations, e.g., transpose, reflection. Furthermore, by determining which PEs are active, additional flexibility is provided so that subsets of data can be manipulated.
[00691 Although the figures illustrate a two-dimensional (2D) array connected as a mesh the present invention is applicable to other configurations. Accordingly, the phrase "plurality of processing elements" is to be broadly construed to include 2D and 3D collections of PEs connected in any known manner. For example, the PE's could be connected in shapes other than as illustrated in the figures, e.g., a cube. That would have f(x_lndex, y_Index, z_Index). An n dimensional hypercube would have n dimensions and f (d(0), d(1), d(2) d(n-1)).
100701 Additionally, the network need not be connected as a mesh. For example, a simple extension may be implemented by providing two extra connections, one to the PE halfway across the row and the other to the PE halfway down the column. For that example there would be two more shift connections. In addition to the North, East, South, and West shifts, there could also be Half_Row and Half Col shifts. Both of the above changes could be used at the same time.
For example, a four dimensional hypcr-cubc with half-way connections would have twelve shift options. 100711 While the present invention has been described in connection with a preferred embodiment thereof, those of ordinary skill in the art will recognize that many modifications and variations such as those previously discussed are possible. The present invention is not to be limited by the foregoing description but only by the following claims.
-13
Claims (26)
1. A method of controlling a plurality of processing elements; comprising: at least certain of said processing elements maintaining a count, each count being responsive to a processing element's location; and for each processing element maintaining a count; storing data in response to its count.
2. The method of claim 1 wherein said maintaining a count includes setting a counter to a first known value and altering the count at programmable intervals by a programmable amount, said storing occurring when a current count equals a target value.
3. The method of claim 1 wherein said maintaining a count includes setting a counter to an initial value, and counting down from said initial value, said storing occurring when a current count is non- positive.
4. The method of claim I wherein said maintaining a count includes setting a counter to a first known value, and counting up from said first known value, said storing occurring when a current count equals a target count.
5. A method of controlling the data selected as output data by a plurality of processing elements, comprising: issuing an instruction set to said plurality of processing elements, said instruction set being performed through a series of data shifts; maintaining a count responsive to said data shifts within at least certain of said processing elements; and selecting data based on said counts.
6. The method of claim 5 wherein said instruction set includes one of an edge shift, planer shift, wrap shift and vector shift or a combination thereof.
7. The method of claim 5 wherein said data shifts include shifting data in one of a north, south, east and west, plus z and minus z directions.
8. A method of controlling the position of data in a plurality of processing elements, comprising: shifting data within the plurality of processing elements along one of a row, column or diagonal; and -14
each active processing element selecting data as a final output in response to that processing element's location within the plurality of processing elements.
9. The method of claim 8 additionally comprising loading an initial count into at least certain of said plurality of processing elements and calculating an initial count locally based on the processing element's location in the plurality and the function being performed on the data.
10. The method of claim 9 additionally comprising maintaining a current count in at least certain of said plurality of processing elements, said current count being responsive to said initial count and the number of data shifts performed, said selecting being responsive to said current count.
1 1. The method of claim 10 wherein said initial count is modified by a programmable amount at programmable intervals to produce said current count.
12. The method of claim 1 1 wherein said modification includes one of incrementing and decrementing said initial count.
13. The method of claim 12 wherein said selecting occurs when said current count is non positive.
14. The method of claim 12 wherein said selecting occurs when said current count equals a target value.
15. The method of claim 8 wherein said shifting includes shifting data north to south, south to north, east to west, west to east, northeast to southwest, southwest to northeast, northwest to southeast and southeast to northwest.
16. A method for controlling the position of data in a matrix of processing elements, comprising: shifting data within the matrix of processing elements; maintaining a current count in each active processing element responsive to the number of data shifts; and selecting output data as a function of said current count.
17. The method of claim 16 wherein said current count is incremented in response to said data shifts and said selecting occurs when a target value is reached.
18. The method of claim 16 wherein said current count is decremented from an initial count and said selecting occurs when said current count reaches a non-positive value.
-15
l
19. The method of claim 16 wherein said shifting includes the north to south and south to north shifting of columns, the east to west and west to east shifting of rows, and the northeast to southwest, southwest to northeast, northwest to southeast and southeast to northwest shifting of diagonals.
20. A method, comprising: shifting data within a plurality of processing elements; and each active processing element selecting data as a final output in accordance with the formula fix Index, y_ Index, z_Index) where f is dependent upon the desired output.
21. The method of claim 20 additionally comprising one of loading an initial count into each processing element and calculating an initial count locally based on the processing element's location and the function f.
22. The method of claim 21 additionally comprising maintaining a current count in each processing element, said current count being responsive to said initial count and the number of data shifts performed, said selecting being responsive to said current count.
23. A method, comprising: shifting data within a plurality of processing elements; and each active processing element selecting data as a final output in accordance with the formula f (d(O), d(1), d(2) d(n-1)) where f is dependent upon the desired output.
24. The method of claim 23 additionally comprising one of loading an initial count into each processing element and calculating an initial count locally based on the processing element's location and the function f.
25. The method of claim 24 additionally comprising maintaining a current count in each processing element, said current count being responsive to said initial count and the number of data shifts performed, said selecting being responsive to said current count.
26. A memory device carrying a set of instructions which, when executed, perform a method . compnsmg: maintaining a count in at least certain of said processing elements, each count being responsive to a processing element's location; and for each processing elewcnt maintaining a count; storing data in response to its count.
-16
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US10/689,380 US7581080B2 (en) | 2003-04-23 | 2003-10-20 | Method for manipulating data in a group of processing elements according to locally maintained counts |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
GB0221563A GB2395299B (en) | 2002-09-17 | 2002-09-17 | Control of processing elements in parallel processors |
GB0221562A GB0221562D0 (en) | 2002-09-17 | 2002-09-17 | Host memory interface for a parallel processor |
Publications (3)
Publication Number | Publication Date |
---|---|
GB0309198D0 GB0309198D0 (en) | 2003-05-28 |
GB2393279A true GB2393279A (en) | 2004-03-24 |
GB2393279B GB2393279B (en) | 2006-08-09 |
Family
ID=26247117
Family Applications (12)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
GB0309204A Expired - Fee Related GB2393283B (en) | 2002-09-17 | 2003-04-23 | Method for load balancing an N-dimensional array of parallel processing elements |
GB0309206A Expired - Fee Related GB2393285B (en) | 2002-09-17 | 2003-04-23 | Method for finding global extrema of a set of bytes distributed across an array of parallel processing elements |
GB0309214A Expired - Fee Related GB2393290B (en) | 2002-09-17 | 2003-04-23 | Method for load balancing a loop of parallel processing elements |
GB0309212A Expired - Fee Related GB2393289C (en) | 2002-09-17 | 2003-04-23 | Method for load balancing a line of parallel processing elements |
GB0309209A Expired - Fee Related GB2393287B (en) | 2002-09-17 | 2003-04-23 | Method for using extrema to load balance a loop of parallel processing elements |
GB0309205A Expired - Fee Related GB2393284B (en) | 2002-09-17 | 2003-04-23 | Method for finding global extrema of a set of shorts distributed across an array of parallel processing elements |
GB0309198A Expired - Fee Related GB2393279B (en) | 2002-09-17 | 2003-04-23 | Method for manipulating data in a group of processing elements |
GB0309211A Expired - Fee Related GB2393288B (en) | 2002-09-17 | 2003-04-23 | Method of obtaining interleave interval for two data values |
GB0309207A Expired - Fee Related GB2393286B (en) | 2002-09-17 | 2003-04-23 | Method for finding local extrema of a set of values for a parallel processing element |
GB0309200A Expired - Fee Related GB2393281B (en) | 2002-09-17 | 2003-04-23 | Method for rounding values for a plurality of parallel processing elements |
GB0309202A Expired - Fee Related GB2393282B (en) | 2002-09-17 | 2003-04-23 | Method for using filtering to load balance a loop of parallel processing elements |
GB0309199A Expired - Fee Related GB2393280B (en) | 2002-09-17 | 2003-04-23 | Method for manipulating data in a group of processing elements to transpose the data using a memory stack |
Family Applications Before (6)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
GB0309204A Expired - Fee Related GB2393283B (en) | 2002-09-17 | 2003-04-23 | Method for load balancing an N-dimensional array of parallel processing elements |
GB0309206A Expired - Fee Related GB2393285B (en) | 2002-09-17 | 2003-04-23 | Method for finding global extrema of a set of bytes distributed across an array of parallel processing elements |
GB0309214A Expired - Fee Related GB2393290B (en) | 2002-09-17 | 2003-04-23 | Method for load balancing a loop of parallel processing elements |
GB0309212A Expired - Fee Related GB2393289C (en) | 2002-09-17 | 2003-04-23 | Method for load balancing a line of parallel processing elements |
GB0309209A Expired - Fee Related GB2393287B (en) | 2002-09-17 | 2003-04-23 | Method for using extrema to load balance a loop of parallel processing elements |
GB0309205A Expired - Fee Related GB2393284B (en) | 2002-09-17 | 2003-04-23 | Method for finding global extrema of a set of shorts distributed across an array of parallel processing elements |
Family Applications After (5)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
GB0309211A Expired - Fee Related GB2393288B (en) | 2002-09-17 | 2003-04-23 | Method of obtaining interleave interval for two data values |
GB0309207A Expired - Fee Related GB2393286B (en) | 2002-09-17 | 2003-04-23 | Method for finding local extrema of a set of values for a parallel processing element |
GB0309200A Expired - Fee Related GB2393281B (en) | 2002-09-17 | 2003-04-23 | Method for rounding values for a plurality of parallel processing elements |
GB0309202A Expired - Fee Related GB2393282B (en) | 2002-09-17 | 2003-04-23 | Method for using filtering to load balance a loop of parallel processing elements |
GB0309199A Expired - Fee Related GB2393280B (en) | 2002-09-17 | 2003-04-23 | Method for manipulating data in a group of processing elements to transpose the data using a memory stack |
Country Status (1)
Country | Link |
---|---|
GB (12) | GB2393283B (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPS6028345A (en) * | 1983-07-26 | 1985-02-13 | Fujitsu Ltd | Communication system in parallel computer |
US4816993A (en) * | 1984-12-24 | 1989-03-28 | Hitachi, Ltd. | Parallel processing computer including interconnected operation units |
Family Cites Families (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4215401A (en) * | 1978-09-28 | 1980-07-29 | Environmental Research Institute Of Michigan | Cellular digital array processor |
SU1546960A1 (en) * | 1988-06-14 | 1990-02-28 | Aleksandr V Vasilkevich | Device for determining extreme values |
JPH0833810B2 (en) * | 1989-06-19 | 1996-03-29 | 甲府日本電気株式会社 | Vector data retrieval device |
JPH05501460A (en) * | 1990-05-30 | 1993-03-18 | アダプティブ・ソリューションズ・インコーポレーテッド | Distributed digital maximization function architecture and method |
JP2637862B2 (en) * | 1991-05-29 | 1997-08-06 | 甲府日本電気株式会社 | Element number calculation device |
CA2148719A1 (en) * | 1992-11-05 | 1994-05-11 | Warren Marwood | Scalable dimensionless array |
JPH0764766A (en) * | 1993-08-24 | 1995-03-10 | Fujitsu Ltd | Maximum and minimum value calculating method for parallel computer |
US5546336A (en) * | 1995-01-19 | 1996-08-13 | International Business Machine Corporation | Processor using folded array structures for transposition memory and fast cosine transform computation |
US6078945A (en) * | 1995-06-21 | 2000-06-20 | Tao Group Limited | Operating system for use with computer networks incorporating two or more data processors linked together for parallel processing and incorporating improved dynamic load-sharing techniques |
US6029244A (en) * | 1997-10-10 | 2000-02-22 | Advanced Micro Devices, Inc. | Microprocessor including an efficient implementation of extreme value instructions |
EP1021759B1 (en) * | 1997-10-10 | 2006-07-05 | Advanced Micro Devices, Inc. | MICROPROCESSOR COMPRISING INSTRUCTIONS TO DETERMINE EXTREME VALUES and to execute a comparison |
US5991785A (en) * | 1997-11-13 | 1999-11-23 | Lucent Technologies Inc. | Determining an extremum value and its index in an array using a dual-accumulation processor |
DE60115609T2 (en) * | 2000-03-08 | 2006-08-17 | Sun Microsystems, Inc., Palo Alto | DATA PROCESSING ARCHITECTURE WITH FIELD TESTING FOR MATRIX |
GB0011974D0 (en) * | 2000-05-19 | 2000-07-05 | Smith Neale B | rocessor with load balancing |
-
2003
- 2003-04-23 GB GB0309204A patent/GB2393283B/en not_active Expired - Fee Related
- 2003-04-23 GB GB0309206A patent/GB2393285B/en not_active Expired - Fee Related
- 2003-04-23 GB GB0309214A patent/GB2393290B/en not_active Expired - Fee Related
- 2003-04-23 GB GB0309212A patent/GB2393289C/en not_active Expired - Fee Related
- 2003-04-23 GB GB0309209A patent/GB2393287B/en not_active Expired - Fee Related
- 2003-04-23 GB GB0309205A patent/GB2393284B/en not_active Expired - Fee Related
- 2003-04-23 GB GB0309198A patent/GB2393279B/en not_active Expired - Fee Related
- 2003-04-23 GB GB0309211A patent/GB2393288B/en not_active Expired - Fee Related
- 2003-04-23 GB GB0309207A patent/GB2393286B/en not_active Expired - Fee Related
- 2003-04-23 GB GB0309200A patent/GB2393281B/en not_active Expired - Fee Related
- 2003-04-23 GB GB0309202A patent/GB2393282B/en not_active Expired - Fee Related
- 2003-04-23 GB GB0309199A patent/GB2393280B/en not_active Expired - Fee Related
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPS6028345A (en) * | 1983-07-26 | 1985-02-13 | Fujitsu Ltd | Communication system in parallel computer |
US4816993A (en) * | 1984-12-24 | 1989-03-28 | Hitachi, Ltd. | Parallel processing computer including interconnected operation units |
Also Published As
Publication number | Publication date |
---|---|
GB0309205D0 (en) | 2003-05-28 |
GB0309209D0 (en) | 2003-05-28 |
GB2393282B (en) | 2005-09-14 |
GB2393284A (en) | 2004-03-24 |
GB2393285B (en) | 2007-01-03 |
GB2393282A (en) | 2004-03-24 |
GB2393279B (en) | 2006-08-09 |
GB2393280A (en) | 2004-03-24 |
GB2393283A (en) | 2004-03-24 |
GB0309198D0 (en) | 2003-05-28 |
GB2393280B (en) | 2006-01-18 |
GB2393290B (en) | 2005-09-14 |
GB0309212D0 (en) | 2003-05-28 |
GB2393288A (en) | 2004-03-24 |
GB2393288B (en) | 2005-11-09 |
GB2393284B (en) | 2007-01-03 |
GB0309202D0 (en) | 2003-05-28 |
GB0309207D0 (en) | 2003-05-28 |
GB2393289C (en) | 2008-02-28 |
GB2393286B (en) | 2006-10-04 |
GB0309200D0 (en) | 2003-05-28 |
GB2393285A (en) | 2004-03-24 |
GB0309199D0 (en) | 2003-05-28 |
GB2393290A (en) | 2004-03-24 |
GB2393289B (en) | 2005-11-30 |
GB2393281A (en) | 2004-03-24 |
GB2393287B (en) | 2005-09-14 |
GB2393283B (en) | 2005-09-14 |
GB0309206D0 (en) | 2003-05-28 |
GB2393281B (en) | 2005-09-14 |
GB0309204D0 (en) | 2003-05-28 |
GB2393289A (en) | 2004-03-24 |
GB0309211D0 (en) | 2003-05-28 |
GB2393286A (en) | 2004-03-24 |
GB2393287A (en) | 2004-03-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
USRE36954E (en) | SIMD system having logic units arranged in stages of tree structure and operation of stages controlled through respective control registers | |
EP0726532B1 (en) | Array processor communication architecture with broadcast instructions | |
US7581080B2 (en) | Method for manipulating data in a group of processing elements according to locally maintained counts | |
US7584343B2 (en) | Data reordering processor and method for use in an active memory device | |
US9032185B2 (en) | Active memory command engine and method | |
US20040215677A1 (en) | Method for finding global extrema of a set of bytes distributed across an array of parallel processing elements | |
KR20010031192A (en) | Data processing system for logically adjacent data samples such as image data in a machine vision system | |
EP0223690B1 (en) | Processor array with means to control cell processing state | |
US7263543B2 (en) | Method for manipulating data in a group of processing elements to transpose the data using a memory stack | |
US7596678B2 (en) | Method of shifting data along diagonals in a group of processing elements to transpose the data | |
US8856493B2 (en) | System of rotating data in a plurality of processing elements | |
GB2393279A (en) | Manipulating data in a plurality of processing elements | |
GB2393277A (en) | Generating the reflection of data in a plurality of processing elements | |
US7930518B2 (en) | Method for manipulating data in a group of processing elements to perform a reflection of the data | |
GB2393278A (en) | Transposing data in an array of processing elements by shifting data diagonally | |
GB2393276A (en) | Method of rotating data in a plurality of processing elements | |
US7503046B2 (en) | Method of obtaining interleave interval for two data values | |
JP2006515446A (en) | Data processing system with Cartesian controller that cross-references related applications | |
WO2004053709A1 (en) | Device for transferring data arrays between buses and system for mac layer processing comprising said device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PCNP | Patent ceased through non-payment of renewal fee |
Effective date: 20140423 |