WO1997042580A1

WO1997042580A1 - Parallel-to-serial input/output module for mesh multiprocessor system

Info

Publication number: WO1997042580A1
Application number: PCT/US1997/007599
Authority: WO
Inventors: Ira H. Gilbert
Original assignee: Integrated Computing Engines, Inc.
Priority date: 1996-05-08
Filing date: 1997-05-06
Publication date: 1997-11-13
Also published as: AU3059297A

Abstract

An input and output module for an array of processors is described. The module includes a first set of shift registers which transfers data words to and from a host for the processor array. A second set of shift registers is connected to the registers of the first set so that data can be shifted from the first set to the second set. The second set interfaces with the processor array. As a result, data may be loaded into the array by first loading the first set with data from the host, then shifting the data to the second set, and loading the data into the array. In the opposite operation, data is loaded from the array to the second set of shift registers and then transferred to the first set. Since the arrays may be clocked independently of each other, full duplex operation can be achieved.

Description

PARALLEL-TO-SERIAL INPUT/OUTPUT MODULE FOR MESH MULTIPROCESSOR SYSTEM

BACKGROUND OF THE INVENTION

Mesh multiprocessor configurations are well-suited for operations on large multidimensional arrays of data. Example applications include two-dimensional fast Fourier transforms (FFTs) , graphics, image filtering, matrix decomposition, and neural network emulation. Typically, the mesh multiprocessing systems have a two or more dimensional array of processors. The processors in the array connect to their two nearest neighbors using link ports. At the edges of the array, these link ports provide communications outside the mesh.

Mesh processing systems typically operate in one of two configurations: Single instruction multiple data (SIMD) and multiple instruction multiple data (MIMD) . In SIMD operation, a master processor of the mesh processing system contains a single copy of the application program. The individual slave processing elements synchronously execute the instructions that are broadcasted by the master processor. Each processor receives and executes the same instruction streams. Data-dependent operations that change the instruction flow are not permitted. In MIMD operation, each processor executes an instruction stored in its internal memory and operates on data also stored in memory. The processors operate independently. Data- dependent instructions that change the instruction flow are permitted. Since data is downloaded into the slave processors of the mesh by a single host processor, a large instruction bandwidth between the host processor and each slave processor is not required. Data can only be loaded into the mesh as fast as the host processor can supply it . The hardware to enable the data loading can actually be quite simple in an effort to minimize the inter-slave and the master processor-slave array wiring. It is common to provide each slave processor with transmit and receive serial ports connected in a daisy chain to let each processor in a column to receive data from an adjacent slave in the preceding row and to transmit data to the following row. Slave processors at the edge of the array send and receive data to a system input/output module (SIOM) that is accessible by the host processor.

The SIOM is typically constructed from shift registers. Each daisy-chained set of slave processors sends and receives data to and from one of the shift registers, which is addressable by the host processor as memory. An entire data word is loaded into each of the shift registers, and the registers are simultaneously clocked so that each register's word is loaded into the first slave processor of each daisy- chain. The process is reversed when uploading from the array. The array is clocked so that the last processors in each daisy-chain transfer their data words into the associated shift register, where the word may be read-out by the host processor. In some implementations, two shift registers may be assigned to each daisy-chain, one for downloading into the daisy- chain and one for uploading from the daisy-chain. This allows full duplex operation. SUMMARY OF THE INVENTION

The prior art SIOM has a number of drawbacks that can have the effect of limiting its speed of operation. The shift registers have some physical extent, which typically places the last flip-flop in the register farther from the host bus than the first flip-flop. As a result, the last data bit of a word must propagate a longer distance from the host bus than the first data bit. And, this time difference limits the speed at which successive words may be loaded into the SIOM.

Moreover, the SIOM is typically implemented in a field programmable gate array (FPGA) . The SIOMs comprise thousands of flip-flops, making a SIOM composed of discrete flip-flops impracticable. FPGAs usually require that a signal must pass through a transistor when turning a corner within the device. Corner-turning is common in shift registers implemented in FPGAs. The incoming signals initially propagate on a bus that is parallel to the shift registers and must turn a corner in order to reach the flip-flop that is assigned to the particular signal. Passage of the signal through the transistor located at the corner further lengthens the time required to load a complete word and slows the overall speed of the device.

The present invention also concerns an input and output module for an array of processors connected in a daisy chain or similar configuration in which bits of data are passed to the array at its edges and then distributed within the array. The module includes a first set of shift registers that receives successive data words from a host for the processor array. A second set of shift registers, however, is connected in association with the registers of the first set so that data can be shifted from the first set to the second set, which then passes the data to the processor array. Therefore, the first set of shift registers may be oriented so that each flip-flop receives data from the bus simultaneously, while the second set may be oriented to directly exchange its data with the processor array. Further, corner turning is not necessary, neither between the host bus and the first set of registers nor between the second set of shift registers and the processor array.

In specific embodiments, the second set of shift registers is also connected to receive data from the processor array and then pass the data to the first set of shift registers. Consequently, in some implementations, data can be shifted into the first set from the host while data is shifted from the second set into the processor array. Moreover, data may also be shifted out of the processor array into the second set of shift registers while data is shifted into the first set from the host. The data can then be exchanged between the sets of shift registers by simultaneously clocking both sets.

In other embodiments, the shift registers may be constructed from flip-flops and multiplexors. The flip-flops of the first set of shift registers are in a one-to-one association with flip-flops of the second set of shift registers. The multiplexors select the input to the flip-flops from either an adjacent flip- flop in the same shift register or an associated flip- flop in the other set of shift registers.

According to another aspect, the invention features a method for passing data between a host and an array of processors. This method includes first shifting the data into a first set of shift registers from the host. The data is then transferred into a second set of shift registers, which comprise flip- flops in one-to-one association with the flip-flops of the first set. The data is then shifted from the second set into the host.

In a related operation, data is shifted into a second set of shift registers from the processor array and then transferred into the first set of shift registers. The data can then be shifted from the first set into the host.

The above and other features of the invention including various novel details of construction and combinations of parts, and other advantages, will now be more particularly described with reference to the accompanying drawings and pointed out in the claims. It will be understood that the particular method and device embodying the invention are shown by way of illustration and not as a limitation of the invention. The principles and features of this invention may be employed in various and numerous embodiments without departing from the scope of the invention.

BRIEF DESCRIPTION OF THE DRAWINGS In the accompanying drawings, reference characters refer to the same parts throughout the different views.

The drawings are not necessarily to scale; emphasis has instead been placed upon illustrating the principles of the invention. Of the drawings: Fig. 1 is a schematic block diagram of the mesh multiprocessor system relying on a SIOM of the present invention; Fig. 2 is a schematic diagram illustrating the daisy chaining of the slave processors and the interconnection with the SIOM module;

Fig. 3 is a schematic block diagram showing the internal architecture of the SIOM of the present invention;

Fig. 4 is a circuit diagram showing one flip-flop of a horizontal shift register and its associated flip- flop in the vertical shift register; Figs 5A-5D illustrate the loading of data into the SIOM shift registers from the host;

Fig. 6A and 6B show the movement of data through the SIOM when transferring data from the array;

Fig. 7 is a timing diagram illustrating host read and writes and shift activity of the SIOM for arbitrary time periods;

Fig. 8 shows another embodiment of the inventive SIOM for a slave array having two daisy chains per slave column; and Figs. 9A and 9B show still another embodiment of the inventive SIOM that avoids timing problems in data exchanges between flip-flops.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

Fig. 1 is a schematic block diagram illustrating a mesh multi-processor system 100 incorporating the principles of the present invention. A master processor 110 under control of a host 116 has direct access to a large program memory 112. The stored program is executed by the master 110, and the instructions are broadcast to the slave processor array 114 for parallel execution through a buffer 118. Data may be passed between the slave processor array 114 and the host 116 in the background. Typically, the master processor 110 is identical to any one of the slave processors P_:,,_y.

The master processor 110 does exert some control over the SIOM 200. A gated clock controls the downloading into the array 114. And, the SIOM 200 informs the master of its ready state via a SIOM ready signal .

Usually, the array 114 is configured as a rectangular array of Nx rows and Ny columns . In the example illustrated, Nx = Ny = 8. Larger or smaller arrays are of course possible .

The host 116 downloads data to the slave processor array 114 from the outside world. Parallel data words are received via the data bus at the SIOM 200 and passed to an array 114.

Fig. 2 shows the functional connectivity between the SIOM 200 and the slave processor array 114. The SIOM 200 receives data words n-bits wide in parallel from the host data bus. In the following description n=8 for simplicity of description, although 32 bit wide datapaths from the host are anticipated in other embodiments .

The parallel data words are received in a SIOM bus input port 210. Each data word is then serially passed from the SIOM 200 to a respective processor P_lrl - P_lrg in the first row of the slave processor array 114 via SIOM array output ports 211. On a set of clock signals, the data word can then serially pass to the second row P_2fl - P₂,_β- Data leaves the array from the bottom row P_8FI - P_8;8 where it passes to SIOM array input ports 212 in the SIOM module 200 and then out to the data bus via SIOM bus output port 213.

Each slave processor usually has two bi-direction serial ports. These are each four pin ports with data- out, clock data-out, data-in, and clock data-in pins.

In the embodiment of Fig. 2, only one of these ports is used. The port's data-in pin gets bits of data from the slave processor of the same column but in the preceding row under the control of the signal received on the clock data-in pin; and the port's data-out pin sends data to the processor in the same column but in the next row on clock signals from the clock data-out pin. Processors P_X/1 - P_lιθ of the first row have data- in pins connected to the SIOM array output ports 211 of the SIOM module 200, and processors P_8>1 - P_8,8 of the last row have data-out pins connected to the SIOM array input ports 212 of the SIOM 200. Thus, only four wires between the SIOM and the array are required for each column/daisy chain; data up, data down, clock up, and clock down.

Fig. 3 shows the SIOM 200, which has been constructed according to the principles of the present invention. The SIOM 200 comprises two overlaid sets of shift registers 310, 320 constructed from two matrices of flip-flops H_a(b, V_3rb. Horizontal shift registers 310 are connected to the host bus through the bus input and output ports 210,213. The number of horizontal shift registers Na equals the number of bits in the data words from the host. Vertical shift registers 320 connect to the slave processor array 114 via SIOM array input and output ports 212, 211. The number of vertical shift registers Nb equals the number of daisy chains. In the example shown, every column of the array 114 is daisy chained resulting in eight chains and vertical shift registers, N* = Nb = 8.

The horizontal shift registers 310 are constructed from a horizontally-connected matrix of flip-flops R_{Λ ι t} . The vertical shift registers 320 are constructed from a vertically-connected matrix of flip-flops V_3rb. The matrices are of identical dimensions, Na x Nb.

Each of the horizontal shift registers 310 operate synchronously with each other, and each of the vertical shift registers 320 operate synchronously with each other. In more detail, all of in the horizontally- connected flip-flops H_aιh operate on a common horizontal shift common clock Hclk, and all of the vertically- connected flip-flops V_S;b operate in response to a vertical shift clock Vclk. Each of the horizontal flip-flops H_3/b is connected to receive data from the horizontal flip-flop to its left H_a(b.j, and each vertical flip-flop V_a,_b receives data from the vertical flip-flop above V_a._1)b. Each horizontal flip-flop H_3(b is also capable of receiving data from the associated vertical flip-flop V_a#b. In a similar vein, each vertical flip-flop may receive data from the associated horizontal flip-flop.

Fig. 4 shows the circuit diagram for a single horizontal and vertical flip-flop pair. The input terminal of a horizontal flip-flop 405 receives input data from flip-flop immediately to its left (or the host if the flip-flop is on the left edge of the SIOM 200) . Alternatively, the horizontal flip-flop 405 can receive data from its associated vertical flip-flop 415 by appropriately setting a horizontal multiplexor 410. The horizontal flip-flop output Q is transmitted to the horizontal flip-flop immediately to its right, or the host 116 if the flip-flop 405 is on the right edge of the SIOM 200. The horizontal flip-flop 405 is clocked by the horizontal clock Hclk.

The vertical flip-flop 415 receives the output from the vertical flip-flop immediately above it or the array if the flip-flop is on the top edge of the SIOM 200. By setting a vertical multiplexor 420, the vertical flip-flop 415 can alternatively receive the output of its associated horizontal flip-flop 405. The vertical flip-flop output Q is transmitted to the vertical flip-flop immediately below it, or the array if the flip-flop 415 is on the bottom edge of the SIOM 200. The vertical flip-flop 415 clocks in its data in response to the vertical clock Vclk. (It should be appreciated that the terms vertical, horizontal, up, down, right, and left are simply convenient mechanisms for describing the embodiment with reference to the drawing and should not be construed as limiting or suggestive of the device's ultimate orientation.)

Figs. 5A-5C illustrate the movement of data loaded from the host 116 into the SIOM 200. The host 116 writes Nb successive Na-bit words D into the SIOM 200. Fig. 5A shows the first data word loaded into the horizontal shift registers 310. Each write involves presenting a new word to the flip-flops U_{λ : l} - H_8/1 in the first column of the SIOM 200. When the horizontal shift registers 310 are full, as shown in Fig. 5B, each horizontal flip-flop transfers its data to the associated vertical flip-flop by appropriately configuring the vertical multiplexors 420. A subsequent single vertical clock moves the data into the vertical shift registers 320, as shown in Fig. 5C. Data are then moved vertically in the SIOM 200 by first appropriately reconfiguring the bit to the vertical multiplexors 420. With each transition of the vertical clock Vclk, the data propagates downward through the SIOM array output port 211 and into the slave processor array 114. Fig. 5D shows the data in the SIOM 200 after two vertical clock Vclk cycles.

The SIOM 200 provides for double buffering. Immediately after data are transferred from the horizontal shift registers 310 to the vertical shift registers 320, the horizontal shift registers 310 are once again available to receive data from the host 116. Thus, the next set of words can begin to fill the horizontal shift registers 310 even while the vertical shift registers 320 are being emptied into the array 114.

Figs. 6A-6B show the propagation of data through the SIOM 200 when data is being shifted out of the array 114. As shown in 6A, the array 114 progressively loads data words into the SIOM 200 via array input port 212. When the vertical shift registers 320 are full, the data are transferred to the horizontal shift registers 310 to be shifted out to the host bus as shown in Fig. 6B.

The SIOM 200 can be adapted for full duplex operation. The process begins with the host writing to the horizontal shift registers 320 while the slave processor array 114 writes data to the vertical shift registers 310. When both arrays are full, the data are exchanged. The exchange can be accomplished by clocking both sets of shift registers simultaneously. The output data, now in the horizontal shift registers 310, are read out to the host bus. Simultaneously, new input data are written from the host bus into the horizontal shift registers. Simultaneously, input data just received from the horizontal shift registers 310 are written into the slave processor array 114 as it is vertically shifted down through the SIOM 200 in the vertical shift registers 320. Simultaneously, the next set of output data are shifted up from the array 114. This is useful when the host bus can read or write data faster than the data can be supplied or absorbed by the SIOM 200. Ideally, the host 116 should be able to read and write a complete set of data at least as fast as the array 114 can transmit and receive that data. Such operation is full duplex from the perspective of the serial lines, which carry the data in and out of the array at the same time.

From the perspective of the host, however, the operation is half duplex as transmission and reception are interleaved as shown in Fig. 7. The duplex transmission begins at time TO when the master properly configures the slave process array 114 and SIOM 200 for the output transmission, and then signals the SIOM 200 causing output data to be shifted from the slave array 114 to the SIOM 200. At the same time, the host 116 begins to write the first input data WR1 to the SIOM 200. Each data consists of Nb words, which may be transmitted in a single burst. After the WR1 data is in the SIOM 200, it is shifted to the vertical shift registers 320 allowing data WR2 to be loaded. It is assumed that the host 116 is sufficiently fast to write two data sets WR1,WR2 to the SIOM 200 in the intervals from TO to Tl, where the serial system has had sufficient time to transmit just a single output data set. During the next two intervals, Tl to T2 the host reads one output data RD1 set and writes one data set WR3 , while the SIOM 200 inputs one data set and outputs one data set. This process is repeated until the last two final intervals where the host performs read operations RD3 ,RD4 only. Note that while the full duplex operation is maintained for all intervals, the first and last intervals are used for output and input only, respectively.

In other embodiments, it is possible to trade off input/output bandwidth against hardware resources, wires, by varying the degree of daisy-chaining. For example, bandwidth may be increased by splitting every column into two daisy-chains as shown in Fig. 8. This change doubles the number of vertical shift registers Nb in the SIOM 200 and the number of wires into the array. This doubling can be continued until the daisy- chaining is eliminated, when Nb equals the number of slave processors in the array. In this extreme case, bandwidth is maximized at the expense of external wires. In the opposite case, to minimize the number of external wires, the daisy chaining can extend between columns of the slave processor array. In the extreme case, only a single pair of data wires leaves the array, the received data from the first row of processors in the slave process array and the transmit data from the last row of slave processors.

Half duplex operation can also be supported by the present invention. Only a single data wire leaves the array for each column. This bidirectional wire is connected both to the receive side (data-in pin) of the first slave in the daisy chain and the transmit side (data-out pin) of the last slave. Such an arrangement yields the minimum input/output, equal to the bandwidth of a single serial port.

In other embodiments, one may separately bring out the receive and transmit data wires from each slave port. This configuration maximizes the input/output bandwidth at a cost of greater hardware resources. In the extreme case, both ports of each slave may be independently accessed by two independent SIOMS . This will require 64 data wires leaving the array in a 4x4 matrix, for example. Two wires are required for each of the ports of each of the slaves.

In still other embodiments, through careful resynchronization it is necessary to supply only two clock signals to the entire slave processor array to control data movement. Recall that typically each column or daisy chain of slave processors receives two wires for clock signals. One wire carries the clock signal received at the clock data-in pin of each slave in the daisy-chain, and the other clock wire carries the clock signal received at the clock data-out pin of each slave of the daisy chain. Thus, four wires are generally required for each daisy-chain when the two data wires, data-in and data-out, are also included.

Separate clocks are required for each daisy chain because of signal propagation delays associated with the physical extent of the array. This effect, however, may be mitigated by providing resynchronization circuitry in the array. The total number of clock wires to the array may then be reduced to two, one clock for data-in and one clock for data- out. The data movement between the array and the SIOM is maintained in synchronism by separate circuitry in the array that resynchronizes the clock signals to every slave so that data transfer between the SIOM and array is compatible. In effect, the resynchronization circuit retimes the clock signals to each slave based upon the signal propagation delays associated with communications to that slave. Thus, the wires to each daisy-chain may be limited to data up and data down, with the addition of two global clock wires for the entire array.

Problems may also arise during the simultaneous clocking of both matrices of flip-flops H_iιC,, V_a#b in which data are exchanged between the matrices during full duplex operation. If both flip-flops 405 and 415 (shown in Fig. 4) do not receive the clock edges at the same time, data are lost, but this event is difficult to ensure in a large array of flip-flops due to signal propagation delays.

Figs . 9A and 9B show another embodiment , which solves the problem by doubling the number of flip- flops. The horizontal and vertical registers 310A,320A of Fig. 9A are dedicated to only pass data from the host bus to the array 114. Data enter through the bus input port 210 and exit through the array output port 211 as described earlier. A horizontally-connected flip-flop only receives data from the host and passes the data to a vertical flip-flop that only transfers the data into the array 114.

Fig. 9B shows the horizontal and vertical shift registers 310B,320B that are dedicated to passing data from the array 114 to the host bus. Data enters through array input port 212 and exits to the bus via bus output port 213. Here, vertically connected flip- flops only receive data from the array and pass it to horizontally connected flip-flops that only pass data to the host. Thus, it never occurs that two flip-flops must exchange data.

While this invention has been particularly shown and described with references to preferred embodiments thereof, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

CLAIMSWhat is claimed is:

1. An input and output module for an array of processors (114) , comprising: a first set of shift registers (310) which receives data from a host (116) of the processor array and passes the data in a first direction; and a second set of shift registers (320) which receives data from the first set of shift registers and passes the data to the processor array in a second direction.

2. An input and output module as described in Claim 1, wherein the second set of shift registers is connected to receive data from the processor array and the first set of shift registers is connected to receive the data from second set and pass the data to the host .

3. An input and output module as described in either of Claims 1 or 2, further comprising: a third set of shift registers which receives data from the processor array; and a fourth set of shift registers which receives data from the third set of shift registers and passes the data to the host.

4. An input and output module as described in any of the preceding claims, wherein each of the shift registers comprises flip-flops and multiplexors for selectively providing data to the flip-flops either from an adjacent flip-flop in the same shift register or an associated flip-flop in the other set of shift registers.

5. An input and output module as described in any of the preceding claims, wherein the first set and the second set of shift registers shift data in response to different clock signals.

6. An input and output module as described in any of Claim 1-4, wherein the first set and the second set of shift registers shift data in response to the same clock signals.

7. An input and output module as described in any of the preceding claims, wherein the processor array comprises daisy-chained rows of processors which pass data from the second set of shift registers to successive processors in the daisy-chain.

8. An input and output module as described in any of the preceding claims, wherein the processor array comprises daisy-chained rows of processors which pass data to successive processors and to the second set of shift registers for transfer to the host.

9. A method for passing data between a host and an array of processors, the method comprising: shifting the data into a first set of shift registers from the host ; 5 transferring the data into a second set of shift registers; and shifting the data from the second set into the processor array.

10. A method as described in Claim 9, further 10. comprising distributing the data within the processor array by passing the data between adjacent processors of the array.

11. A method as described in either of Claims 9 or 10, further comprising:

15 shifting data into the second set of shift registers from the processor array; transferring the data into the first set of shift registers from the second set of shift registers; and 20 shifting the data from the first set of shift registers to the host.

12. A method as described in any of Claims 9-11, further comprising: shifting data into a third set of shift 25 registers from the processor array; transferring the data into a fourth set of shift registers from the third set of shift registers; and shifting the data from the fourth set of 30 shift registers to the host.

13. A method for passing data from an array of processors to a host, the method comprising: shifting the data into a set of shift registers from the processor array; transferring the data into another set of shift registers; and shifting the data from the other set of shift registers into the processor array.

14. An input and output module transmitting data between a host and an array of slave processors receiving instructions from a master processor, the module comprising: a first matrix of flip-flops which are connected as a first set of shift registers that receive data with a host of the processor array; a second matrix of flip-flops which are in a one-to-one association with the flip-flops of the first matrix and are connected as a second set of shift registers that receive data from the first matrix of flip-flops and pass the data to the processor array; and multiplexors associated with the flip-flops of the second matrix which determine whether the flip-flops receive data from an adjacent flip-flop within their shift register or the associated flip-flops in the first matrix.

15. An input and output module as described in Claim 14, further comprising multiplexors associated with the flip-flops of the first matrix which determine whether the flip-flops receive data from an adjacent flip-flop within their shift register or the associated flip-flops in the second matrix.

16. An input and output module as described in either of Claim 14 or 15, further comprising: a third matrix of flip-flops which are connected as a third set of shift registers that receive data with the processor array; a fourth matrix of flip-flops which are in a one-to-one association with the flip-flops of the third matrix and are connected as a fourth set of shift registers that receive data from the third matrix of flip-flops and pass the data to the host .