GB2203574A

GB2203574A - Parallel processor arrays

Info

Publication number: GB2203574A
Application number: GB08708025A
Authority: GB
Inventors: Christopher Roger Jesshope; Jimmy Melvin Stewart; Russell John O'gorman
Original assignee: University of Southampton
Current assignee: University of Southampton
Priority date: 1987-04-03
Filing date: 1987-04-03
Publication date: 1988-10-19
Also published as: GB8708025D0

Abstract

In a parallel processor array (Fig. 1) the array of processing elements (PE) is provided with an array of communications processors (CP) which handle all internode communications and preferably themselves form an SIMD machine. Each communications processor has bidirectional bit-serial connections to its four nearest neighbours and comprises four shift registers S1 to S4 which enable it to receive message bytes from all four inputs at once. A parallel bus 34 connects the registers to a buffer (32, 38 Fig. 2) which can be accessed by the associated processing element to write source data and read destination data. In a packet switching mode a message commences with address bytes which are compared with the communications processors own address. A message can be passed from node to node (Fig.4) to its destination address. A received message enters the buffer and address comparisons are effected to determine, by means of multiplexers 64, in which output direction the message will be transmitted from the buffer to the next node. The shift registers are preferably duplicated and connected to the input and output multiplexers by switching means 60, 62 controlled by a "page" bit 42 which determines which set of registers is available for input/output. The other set can be loaded and tested using the parallel bus, during transmission via the first set. In this way, the physical connections between communications processors may be used continuously. <IMAGE>

Description

PARALLEL PROCESSOR ARRAYS The present invention relates to parallel processor arrays comprising a plurality of processing elements (PE's), such as SIMD and MIMD arrays well known in themselves. The invention is concerned more specifically with the problem of achieving efficient inter-element communication for the passing of data.

Rnown systems are summarized in PCT/GB86/00514 and comprise common bus systems, which can limit processing speeds even when the bus has a very large bandwidth, fixed network systems which require very complex networks to handle a sufficiently extensive range of connections and systems with a switching centre. These latter systems also run into bandwidth limitations and are best suited to connecting small numbers of powerful processing elements.

Considerable interest attaches to processor arrays comprising a large number (e.g. 1024) of simple PE's, e.g. 8-bit PE1s of no greater complexity than such well-known processors as the 6502 and Z80. It is known (e.g. GB 2178572) to effect communication between nearest-neighbour PE's. Communication between distant PE's can only be effected by way of a chain of PE's linked by nearest-neighbour connections. Control of the communications by the PE's represents a significant and undesirable computing overhead.

Many applications to which parallel processor arrays are suited exhibit irregular data structures which require powerful long-range and irregular communication schemes.

The object of the present invention is to provide a solution to these problems. The invention is defined with particularly in the appended claims but the following features are noted as representing various aspects of the invention.

In the first place, the invention contemplates a parallel processor array in which each node of the array comprises both a data processing element and a communications processor, with interconnections between the communications processors. These interconnections are preferably nearest-neighbour only and are preferably bit-serial in order to minimise the pin-out complexity of each node. Each data processing element can command its communications processor (CP) to effect communication with another element. The CP sets up the communication which is effected through a chain of Cup'6, without any processing burden on the PE's associated with the intervening CP's.

Each CP can comprise a set of communications registers, a simple ALU, a set of data' registers and associated memory shared with the PE, all connected by a common bus. The communications registers can be parallel load/unload shift registers in the case of bit-serial interconnections, with parallel connections to the common bus.. The registers are preferably duplicated to allow one set to be active in communication while the other is free to be loaded or unloaded by the associated ALU. However, the output connections of the registers are preferably switchably assignable to the various nearest neighbour conections of the CP. This may be to allow different semi-permanent network configurations to be set up, as in GB 2 178 572.However, an important feature of the invention resides in the use of the switchable assignments in conjunction with address fields in the data being passed automatically to route a message to a specified destination CP - and hence to the corresponding PE. The messages can be of variable length but this length must be a global variable, set to the entire value over the entire array. The data registers in the CP may contain a field to specify the length.The invention thus leads to a kind of packet switching communications network by means of which a message may be sent efficiently from any one PE to any other PE. On the other hand, each CP can be of simple architecture and represent a very acceptable addition to the chip area required for the associated CP.

Communications can be synchronous, locked in step over the whole array so that no handshaking is required. For bit-serial communication, the internodal connections are restricted to two lines per connection, one for each direction of data flow. In theory, but with less convenience, only a single (half-duplex) line would suffice.

Each CP may have its own control means but the CP's preferably constitute, with a common controller, an SIMD processor array which is mapped on to the PE array. Another possibility is to use only a single processor at each node, the PE processing facilities being assimilated into the CP's. This will still allow use of the advanced communication facilities of the SIMD CP array but at the expense of these facilities becoming a significant overhead on the data processing effected by the same processors. When there are both PE and CP arrays, efficiency may be increased by running the CP at a higher clock rate than the PE.

It is preferred that. all CPs on a single chip share a single controller, with controllers on separate chips being synchronized so as to provide a system-wide synchronization of the CPs. This reduces the need to provide input pins for a common control word.

The program requirements for communication are simple and repetitive and use little on-chip storage.

An embodiment of the invention will now be described in more detail, by way of example, with reference to the accompanying drawings. Other advantageous features of the preferred embodiment will be explained. In the drawings: Fig. 1 is a block diagram of a complete processor array, but with a very small number of nodes for simplicity of drawing, Fig. 2 is a block diagram of a single node of the array, Fig. 3 is a block diagram of the communications registers and associated circuits of a CP, Fig. 4 is an explanatory diagram illustrating packet switching routes, Figs. 5 and 6 are diagrams of two different packet switching output direction selection circuits, Fig. 7 is a buffer memory map, Fig. 8 is a diagram of a circuit switching control register, and Figs. 9 and 10 are diagrams of two circuits controlling two activity lines.

Fig. 1 shows a parallel processor array comprising an array of nodes 10, each of which comprises a processor element 12 and a communications processor 14. The array is illustrated as rectangular and the connections are restricted to N, S, E and W nearest neighbour connections, each provided by two bit-serial lines 16 for the two directions of data flow respectively. The invention is in no-wise restricted to this particular scheme. As symbolised by just two connections 18, the array may be wrapped round in both dimensions, so that the nodes 10 lie on a toroidal surface. The implications of this will be considered later in the description.

As illustrated the array is a 3 by 3 array but, in a practical embodiment, it might be a 32 by 32 array with 1024 nodes.

Each PE is drawn inscribed within its associated CP. This is appropriate symbolism as the PE's communicate with each other only via their CP's and the interconnections 16 exist only between the CP's.

The PE's are controlled by a PE controller 20 and a bus 22, shown in fragmentary form only, whereby the controller broadcasts instructions, to all the PE's. Similarly, the CP's are controlled by a CP controller 24 and a bus 26, whereby this controller broadcasts instructions to all the CP's. The invention is not concerned with the PE's and their controller; these may be as in known machines. No detailed description will be given of the CP controller 24 because this differs from known devices only in the microcode which determines the instruction sequences sent to the CP's. Examples of algorithms on which such microcode is based are given below.

A single node 10 of the array is illustrated in Fig.2 and comprises the PE 12, the PE data bus 30, a RAM 32, and the CP 14 with its data bus 34. The RAM 32 comprises 512 bytes forming a working memory 36 private to the PE and two 256 byte buffers 38 of which, at any one time, one serves the CP while the other is free for reading and/or writing by the PE. To this end, the two data buses 30 and 34 are connected to the two buffers 38 by an exchange switch 40 controlled by a 1-bit page register 42. Whichever buffer 38 is connected to the CP bus 34 is addressable by an 8-bit address register 44 on the bus 34.The way in which the PE addresses its 768 bytes of RAM is of no significance to the present invention.

One possibility is an 8-bit address register in conjunction with a 2-bit page register.

The other main elements of the CP 14 are a set 46 of communications registers on the bus 34, an ALU 48 with its output connected to the bus 34 and a set of registers 50 connected to receive a result from the ALU by way of the bus 34 and to furnish two operands OPI and OP2 to the ALU by way of buses 52 and 54. This three-bus set-up provides for efficient, fast register-register operations in response to commands specifying the two source registers and the destination register for an operation. The detailed circuitry for selecting the registers and the control circuitry in general are not shown since this circuitry will be in conformity with well-known micro-processor architectures.

The ALU has three status outputs E, G and L for signalling the outcome of a comparison between two operands - E when they are equal, G when the first operand OPl is greater than the second OP2 and L when the first operand OP1 is less than the second OP2.

A bus switch 56 controlled by the PE 12 can connect the two buses 30 and 34 together to enable the PE to access the CP register set 50. There are eight 8-bit registers which are utilized as follows to control the routing of a packet to its destination node.

(1) X address of this node (2) Y address of this node (3) Packet length (bytes) (4) 2 times packet length (5) 3 times packet length (6) Temporary results (7) AC1, a send message pointer for the buffer 38 serving the CP (8) AC2, an arrived message pointer for the buffer 38.

The significance of these parameters will become clear below, after the structure of the communications registers set 46 has been described with reference to Fig. 3. Briefly, however, when a message arrives at a CP it is written into the buffer 38. The X and Y addresses of the message are compared with the CP's own X and Y addresses. If both comparisons lead to E true, the message has arrived and is sent no further; it is in the buffer 38 ready to be read by the PE. If either or both comparisons lead to G true or L true, the message is re-transmitted from the buffer in the appropriate direction.

Turning then to Fig. 3, there are eight 9-bit shift registers S1 to S4 and S1' to S4' in four pairs S1, S1'; S2, S2' and so on.

All registers are connected to the CP bus 34 for parallel loading and unloading of eight bits. The ninth bit is an "activity" bit whose usage is explained below. The inputs of the pair of registers SI, SI' are connected to the North input line 16 (see Fig.1) by way of a 1-to-2 de-multiplexer 60. The other pairs of registers are similarly connected, S2, S2' to the South input, S3, S3' to the East input and S4, S4' to the West input. The demultiplexers 60 are controlled by the page register 42 which selects which of the two sets S1-S4 and S1'-54' is connected to receive incoming serial data and which is available for parallel loading or unloading via the bus 34.

The outputs of the registers are similarly connected via 2-to-1 multiplexers 62, also controlled by the page register, to four output lines OP1 to OP4. These four lines are connected in parallel to four 4-to-1 multiplexers 64 whose outputs are connected to the North, South, East and West output lines 16. Each multiplexer 64 is controlled by a 2-bit selection input sN, sS, sE, sW whereby it can be set to pass out the message from any one of the output lines OP1 to OP4. Each multiplexer will only pass out a message when enabled by a corresponding enabling signal EnN, EnS, EnE, EnW.

The main implications of this scheme are as follows. The four register pairs are in fixed one-to-one correspondence with the four input directions, from N, from S, from E and from W and since there are four registers always active on the four inputs, data collisions cannot occur. There is no fixed correspondence with output directions, these being selectable by sN, sS, sE and sW. The selection may be set up semi-permanently using a direction register 66 connected to the bus 34 but may also be effected dynamically in packet switching mode, as will be fully described. When the direction register 66 is employed, the register pair S1, S1' may be assigned to the output direction N, S2, S2' to S and so on, so that the nodes are set up for strict, bidirectional, nearest-neighbour communication, in what will be called the circuit switching mode, in contrast to the packet switching mode. However other assignments may be made for particular purposes. Fig. 3 also shows a circuit switching register 68 connected to the bus. This is used, as will be described, to select the mode of operation of each of the communications registers in the circuit switching mode.

For simplicity, reference will now be made only to the registers S1 to S4 even although these registers are actually used alternately with the registers S1' to S4', the CP toggling the PAGE bit between messages.

The hardware described so far can be controlled in a wide variety of ways to route a message from one node to another, i.e. by establishing a fixed circuit using a setting of the direction register 66, or by routing addressed packets through the array and setting each direction register dynamically on the basis of address comparisons. Various routing strategies for packet routing may be employed as can be seen from Fig.4 in which each small square represents a node 10. One way to route a message is along a row (or column) until the destination column (or row) is reached, then along that column (or row) until the destination node is reached. This is illustrated by the route 70 from node A to node B. Another possibility, which is employed in the preferred embodiment, is to give priority alternately to the X and Y directions so that the packet follows a staircase path 72.Once the packet gets on to the right row or column it steps along that row or column to the destination node B without regard for the priority, as illustrated by the terminal part 74 of the route 72.

If the registers S1 to S4 were large enough to hold a complete destination address, the routing could be determined from the address in the communication register which it had entered in that a message could be routed directly out of that register, with no further buffering. It is preferred to write the whole message into the buffer 38 and then send it on from a specific one of the -registers, namely the register S1. Indeed this is necessary when, as in the present embodiment, the address (2 bytes) is larger than the register (1 byte - plus the activity bit).Moreover routing control is simplified if transmission is always from a specific register and another advantage is that every transmission takes place in exactly the same way, whether it is the initiating transmission from node A or a subsequent transmission en route to node B. The preferred embodiment is rendered more powerful to enable two messages to be sent simultaneously from registers S1 and S2.

Fig.5 shows the output direction selection circuit used for the single message approach. The direction register 66 (Fig.3) is set so that sN = S1, sS = 51, sE = S1 and sW = S1, whereby all the multiplexors 64 select the register S1. The selection circuit of Fig.5 merely has to provide the correct enabling signal to route the message out in the correct direction.The circuit comprises a 3-bit X direction request register 80 which is strobed by a microcode signal CMP X provided at the time at which the ALU (Fig.2) is comparing the X destination address as OP2 with the X this node address as OP1. The resultant L,G,E bits (of which one will be true) are latched in the register 80. Similarly the L,G,E bits for the Y comparison are latched in a 3-bit Y direction request register 82 when this is strobed by CMP Y.

Direction priority is determined by a 1-bit register 84. When PRIORITY is true, it enables two AND gates 86 (via an OR gate 90) to pass L(X) as EnE, or pass G(X) as EnW because the X this node address is greater than the X destination address. When PRIORITY is false, the inverse thereof enables two AND gates 88 via an OR gate 92 to provide EnN or EnS from L(Y) or G(Y).

If E(X) is true, the Y AND gates 88 are enabled via the OR gate, regardless of the value of PRIORITY and if E(Y) is true, the X AND gates 86 are enabled via the OR gate, regardless of the value of PRIORITY, to allow a message to pass always in the Y direction when it has reached the destination column or always in the X direction when it has reached the destination row.

If both E(X) and E(Y) are true, an AND gate 94 sets an ARRIVED latch 96.

The circuit of Fig.6 used to handle two messages is essentially the same as the circuit of Fig.5 but with additional logic to determine the least significant bit of sN, sS etc. The register 66 of Fig.3 is used to set the most significant bit to zero, which effects a partial selection of S1 and S2, rather than S3 and S4. In Fig. 6, each of the fewer AND gates 86,88 has associated therewith a pair of 1-bit latches, namely a latch 100 which is set by a true output from its AND gate 86 or 88, and a latch 102 which can be set by a signal MESSAGE SOURCE, which is false for the first message but true for the second message. However each latch 102 can only be set when the corresponding latch 100 remains un-set because the latch 102 is enabled by the invested output of the latch 100.

The latches 102 provide the least significant bits of sN, sS etc. to complete the selection between S1 and 52.. The way in which the circuit operates is as follows.

When the X and Y address comparisons for the first message are performed, one of the latches 100 will be set, while the corresponding latch 102 remains unset, (assuming that the message has not yet arrived at its destination). Accordingly, one of the multiplexors 64 of Fig.3 is enabled and selected the register S1.

The MESSAGE SOURCE signal is then sent true to set the remaining three latches 102 - only the remaining three because the latch 102 associated with the set latch 100 is disabled. The X and Y address comparisons for the second message are then performed and, unless both messages are competing for the same route, another one of latches 100 is set. Accordingly, another one of the multiplexors 64 is enabled and selects the register S2, because the associated latch 102 is set. If both messages are competing for the same route, the second message has to wait for the next transmission.

This leads to consideration of the function of the buffer 38.

Adequate buffing is essential because each CP can receive up to four messages simultaneously, one in each of the registers S1 to S4 and can only transmit two messages at a time, only one if there is competition for a route. Each buffer 38 is actually used to provide two buffers, namely an arrive buffer for messages which have arrived at the destination node and a message buffer for messages which are to be transmitted, even in initiating a transmission (in which case the source node PE will have put the message in the forwarding buffer) or as messages in transit. The CP puts a message into the message buffer if it has not arrived at its destination node but puts it in the arrived buffer if it has arrived.

In order to make most efficient use of the 256 bytes available, the arrived and message buffers 110 and 112 (Fig.7) grow from opposite ends of the buffer 38, each accommodating a plurality of messages 114. As indicated by the arrow 116, the arrived buffer 110 is treated by the CP as a growing stack. Messages are only removed by the PE when PAGE is switched to give the PE access to the buffer 38 in question. As indicated by the double arrow 118, the message buffer 110 is treated by the CP as a dynamic LIFO stack. The tops of the stacks are indicated by pointers AC1 (message buffer) and AC2 (arrived buffer).The chosen organisation allows- both stacks to be fixed and to require only one pointer while being more efficient in memory organisation than a scheme which gives a fixed memory allocation to each stack.

The obvious constraint is that the stacks must not be allowed to meet. The software detects if this is about to occur and signals an interrupt to the controller 24, which must then take appropriate action.

As mentioned above, the array may be wrapped round in two dimensions. There are then two alternative routes from a source node to a destination node and in order to be able to choose between routes, the X and Y addresses must include a sign bit (most significant bit). The destination X and Y addresses are set up by the initiating PE, when it writes a message into the quiescent buffer 38, and the PE can determine the sign bit to choose the shortest route or the route which will minimise the load on certain parts of the array, for example.

The microcode program for each CP is preferably held in ROM on the same chip as a plurality of CP/PE nodes, the one ROM controlling all CP's on the chip as a SIMD array. The microcode will be given below only for the two-message embodiment. The microcode for the one-message embodiment is a simplification of that given. The second byte of the message is exactly the same as the first but contains the Y address of the destination, with its sign bit. The body of the message is of arbitary length (in bytes) and follows immediately after the X,Y destination address.

The structure of the CP control word is as follows: Bits OP1 3 AC1,AC2 or MEM Temp,LENGTH,LENGTH,2LEN,3LEN OP2 3 AC1,AC2 or MEM Temp,LENGTH,LENGTH,2LEN,3LEN OP 3 ADD,SUB,NOP,INC,DEC DEST 3 AC1,AC2 or MEM Temp,IENGTH,LENGTH,2LEN,3LEN AREN 1 Address register enable COMP 3 COMP T and AC1 X address of M1 and M2 Y address of M1 and M2 start of BUF (M1) start of BUF (M2) TX 1 Enable shift registers P=NOT P 1 Switch page NN 3 NN REGISTER R/W MEM 1 Memory read write line PS 1 Packet switch enable LATCH SOURCE FIELD 2 Clear latches 1 Latch M1 1 Latch M2 1 MS 1 Message source (S1 or S2) NORTH ACT 1 ACTIVITY control from north shift register ninth bit SOUTH ACT 1 EAST ACT 1 WEST ACT 1 ARR ACT 1 ACTIVITY control from arrived latch M1 ACT 1 ACTIVITY control from first message latch M2 ACT 1 ACTIVITY control from second message latch It will be appreciated that OP1, OP2 and DEST each select one of the eight registers 50 of Fig.2. NN is an abbreviation for nearest neighbour.

In the microcode listings N (North) is to be equated with the register S1 (i.e. the register which receives input from the North), S (South) is to be equated with S2, and so on.

T (temp) - Temporary storage register AC1 - Pointer to message buffer AC2 - Pointer to arrived buffer MEM - Memory byte addressed by address reg AREN - This -bit latches O/P of ALU into memory address register LENGTH - Length of packet (bytes) 2LEN - 2 X Length of packet (bytes) 3LEN - 3 X Length of packet (bytes) OP - ALU instruction (operation) COMP - The COMP instruction compares OP1 to the register specified by this field of the control word.The O/P's of the comparator are also controlled by this field TX - This bit enables the NN shift registers P=NOTP - This bit toggles the NN shift reg page NN - These 3 bits select the NN shift reg accessed R/W MEM - This bit refers only to move instructions between NN shift registers and memory selecting no register with the NN field disables a move PS - Enables packet switching The final six lines in the CP control word specify the activity source for the current instruction. The activity control may be taken from any one of the 4 NN shift registers (i.e. the ACTIVITY bit, that is the ninth bit, in the currently addressed page of the NN shift register specified). The fifth activity source line uses activity from the arrived latch.Another source of activity is the M2 activity latch: this activity line controls address offset instructions if a second message is to be sent in each major cycle, whereas M1 is an activity control for the first message. These activity lines may be coded into 3 bits to save control word space.

The ninth bit of the shift register acts as an activity bit or start bit. Initially all the ninth bits are set to zero. When a value is written to a shift register (S1..4 or S1'..4' depending on the page setting) then the ninth bit is automatically set to one.

The ninth bit is transmitted along with the 8-bits of data in the shift register. When the data has been shifted through the NNI network to the next CP then this bit can be used as an activity bit indicating whether the particular shift register contains valid data. The CP being activity controlled from the ninth bit can then read the valid data from the shift register. The ninth bit is automatically reset when the read takes place.

If no valid data is in the shift register then the ninth bit will be zero. Since the CP is activity controlled from this bit then the data will not be read. On write operations the CP is activity controlled from the M1 and M2 activity lines. If a message is present then the CP can write data to the respective shift register. This write action sets the ninth bit to one.

The MESSAGE SOURCE line comes from a field of the control word.

Other control lines are CLEAR LATCHES, LATCH M1, LATCH M2 and LATCH SOURCE FIELD. The first of these is used to clear the message source latches shown in Figure 6. LATCH M1 is used to enable loading of the M1 latch and is set when there are any messages to send. LATCH M2 is used to enable loading of the M2 latch.

LATCH M2 is used as an activity source and is set (a). If there are more than one message in the message buffer (line 4 in the microcode) then subsequently is set to zero if there is another message to send but the direction it wants to go in is already being used by the first message (line 13). LATCH SOURCE FIELD copies the respective source field bits into the ninth bit of the corresponding shift registers which can be activity controlled.

The copy operation is itself controlled by the Equal (E) line from the comparator. This field is used in the circuit switching routine given (line 1).

A list and explanation of every cycle of the packet switch microcode is now provided for packet switching microcode to support sending in two directions simultaneously.

1) COMP AC1 and START OF BUFF: LATCH M1 - This comparison checks whether any messages are to be sent.

- If the result of the comparison is Equal then M1=0.

2) SUB LEN FROM AC1 and STORE in AC1 Activity: ACT M1 - Updates message buffer to point at beginning of the last - message received.

~: 3) LOAD S1 with M1, i.e. Sl=(AR); Activity: ACT M1 - Load sending registers with X address of first message.

4) COMP AC1 and START OF BUF; LATCH M2 - This comparison checks whether there is another message to be - sent. M2 is set if there is another message.

5) COMP X of M1; MS=O Activity: ACT M1 - Compare CP X coordinate with the destination X coordinate of - the message. Set message source to be S1 6) INC AC1,STORE IN AC1 and AR Activity: ACT M1 - Inc message buffer pointer and make a temporary store as well - as placing in the address register and original position.

7) COMP Y OF M1; MS =0 Activity: ACT M1 - This instruction will complete the output switch settings for - Ml.

- We are now ready for sending the first message.

8) AC1 - LEN, STORE IN AC1 Activity: ACT M2 - Point to second byte of message 2 (M2).

9) DEC ACT , STORE IN AC1 Activity: ACT M2 - Point to first byte of message 2 (M2).

10) LOAD S2 with FIRST BYTE OF M2 Activity: ACT M2 - Load S2 with X address of M2.

11) COMP X OF M2; MS =1 Activity: ACT M2 - Compare CP X coordinate with the destination X coordinate of - the message. Sent message source to be S2.

12) INC AC1, STORE IN AC1 and AR Activity: ACT M2 - Inc message buffer pointer and make a temporary store - as well as placing in the address register and original - - position.

13) COMP Y OF M2: MS =1; LATCH M2 Activity: ACT M2 - This instruction will complete the output switch settings - for M2.

- IF another message can be sent then M2 will be set.

- We are now ready for sending the second message.

14) AC1 - LEN, SAVE AT AR, AC1 AND T Activity: ACT M1 - The address pointer is now pointing to the second byte - of the first message.

15) STORE AC2 in AR Activity: ARR LATCH - Update arrived pointer if message arrived.

16) MOVE S TO ABUF, INC AC2, STORE IN AC2, CHANGE PAGE.

Activity: ARR LATCH - Move message to arrived buffer if arrived.

- NOTE - Only M1 can arrive, this avoids the necessity to - provide an indirection register.

- The change page register is not activity controlled.

17) TX, LOAD S1 Activity: ACT M1 - Load sending reg with second byte of message and start - transmission.

18) TX, AC1 - LEN STORE AT AR and AC1. Activity: ACT M2 - If a second message is being sent then point to second - byte of message.

19) TX, LOAD S2 Activity: ACT M2 - Load next byte of second message.

20) TX, DEC AC1 and SAVE AT AC1 and AR Activity: ACT M1 - The buffer pointer is now pointing at the first byte of M1 - if second message can't be sent, else pointer is at first - byte of M2. This allows the first message received to over - write the message being sent out.

21) TX,MOVE LEN TO 2LEN.

22) TX, 2LEN + LEN, STORE AT 2LEN.

23) TX, 2LEN + LEN, STORE AT 3LEN.

- Set up constants for the packet switching routine.

24) TX, STORE AC2 in AR Activity: ARR LATCH - Save a byte that may have arrived.

25) TX, MOVE S1 TO ABUF,INC AC2,STORE IN AC2 Activity: ACT ARR LATCH, CHANGE PAGE - Arrive buffer is updated, ADBUF is addressed via the - address register - Nine TX's later the shift registers have emptied hence - the page is changed again.

26) MOVE AC1 to AR - Set address register to start of messages Looping starts here 27) TX, MOV N TO BUF, ADD LEN TO AC1, SAVE AT AC1 and AR.

Activity: N - Move received message to buffer and update pointer.

- Note BUF is addressed via the Address register.

28) TX, MOV S TO BUF, ADD LEN TO AC1, SAVE AT#AC1 and AR.

Activity: S - Move received message to buffer and update pointer.

29) TX, MOV E TO BUF, ADD LEN TO AC1, SAVE AT AC1 and AR Activity: E - Move received message to buffer and update pointer.

30X TX, MOV W TO BUF Activity: W - Move received message to buffer and DON'T update pointer.

31) TX, INC T, SAVE AT T, AC1 and AR - Point to next byte of message and load address register.

32) TX, LOAD S1 with M1. Activity ACT Ml - Load next byte of message 1 into sending register.

33) TX, AC1 - LEN, STORE AT AC1 and AR Activity: ACT M2 - If second message is to be sent, then point at it.

34) TX, LOAD S2 with M2 Activity: ACT M2 35) TX, STORE AC2 in AR, act ARR LATCH - Update arrived pointer if message arrived 36) MOVE S1 TO ABUF, INC AC2, STORE AC2 CHANGE PAGE activity ARR LATCH.

- Move message to arrived buffer if arrived.

37) MOVE T to AR.

DO INSTRUCTIONS 27 to 37 (LENGTH - 1) TIMES 38) MOV N TO BUF, ADD LEN TO AC1 SAVE AT AC1 and AR Activity: ACT N - Move received byte to buffer and update pointer.

39) MOV S TO BUF, ADD LEN TO AC1 SAVE AT AC1 and AR.

Activity: ACT S - Move received byte to buffer and update pointer.

40) MOV E TO BUF, ADD LEN TO AC1 SAVE AT AC1 and AR Activity: ACT E - Move received byte to buffer and update pointer.

41) MOV W TO BUF Activity: ACT W - Move received: byte to buffer and DON'T update pointer.

42) INC AC1, SAVE AC1 - Pointer is now at first byte of next message to be received.

43) AC2 + 2LEN, STORE IN AC2.

- Has an overflow occurred? 44) AC1 + 3LEN, STORE IN T.

45) COMP T AND AC2.

- If T = AC2 Then generate an interrupt.

- This instruction generates an interrupt to the controller - if the buffers are potentially about to collide.

This completes one major cycle of the packet switching routine.

The next major cycle starts back at line one.

Operation in the circuit switching mode involves not only use of the direction register 66 of Fig. 3 to assign a register to each output multiplexer 64 but also the circuit switching register 68 which accepts two bytes from the PE via the switch 56. The bytes are arranged as shown in Fig.8 in four groups of four. The four bits of the first group 120, if set, cause the registers S1 to S4 respectively to act in bus mode, in which data is passed through.

The four bits of the second group 122, if set, cause the registers S1 to S4 respectively to act in distribute mode, in which data is both passed through and read into the buffer 38. The four bits of the third group 124, if set, cause the registers S1 to S4 respectively to act in source mode, in which data is loaded from the buffer 38 into the shift register. The four bits of the fourth group 126, if set, cause the registers S1 to S4 respectively to act in destination mode in which data is not passed through but is loaded into the buffer 38. It will be appreciated that, for any given register, there must be a set bit in only one of the four groups.The registers may be treated alike (their set bits are all in the same group) or differently (their set bits are distributed over two or more groups). A register performs no operation if it has no set bit in any group.

The microcode for the circuit switching algorithm is as follows: Notice that initially all the shift registers will have been cleared. So for the first pass no valid data will be sent. A CP assigned the condition SOURCE has four messages, which are loaded to the shift registers S1 to S4 respectively. At the start of the routine AC1 points to the end of the message to send buffer.

1) TX, COMP AC1 TO START OF BUF, STORE AC1 IN AR and T; LATCH SOURCE FIELD - If AC1 equals start of buffer, S1..4 in Source field are - disabled 2) TX, LOAD N FROM BUF, AC1 + LED. Store in AC1 and AR Activity: ACT 51 (SOURCE) - Send message in North direction.

3) TX, LOAD S FROM BUF, AC1+ LEN. Store in AC1 and AR Activity: ACT S2 (SOURCE) - Send message in South direction.

4) TX, LOAD E FROM BUF, AC1 + LEN. Store in AC1 and AR Activity: ACT S3 (SOURCE) - Send message in East direction.

5) TX, LOAD W FROM BUF, AC1 + LEN. Store in ACI and AR Activity: ACT S4 (SOURCE) - Send message in West direction.

6) TX, AC1 = T + 1 7) TX, STORE AC2 IN AR and T The following four cycles unload the registers IF the CP has been assigned the condition DEST 8) TX, UNLOAD N TO BUF, AC2 + LEN. Store in AC2 and AR.

Activity: ACT N AND DEST.

- Message arrived from N.

9) TX, UNLOAD S TO BUF, AC2 + LEN. Store in AC2 and AR.

Activity: ACT S AND DEST.

- Message arrived from S.

10) UNLOAD E TO BUF, AC2 + LEN. Store in AC2 and AR.

Activity: ACT E AND DEST.

- Message arrived from É.

11) UNLOAD W TO BUF, AC2 + LEN. Store in AC2 and AR, Change PAGE.

Activity: ACT W AND DEST.

- Message arrived from W.

12) AC2 = T + 1: LOOP BACK TO 1.

The looping is user controlled and continues until a message has had time to propagate from the source CP to the destination CP. The user will be aware of the maximum distance that a message must travel. Also note in the above implementation of microcode if a CP is expecting to receive messages from more than one direction (S bits in destination field are set) then the messages must arrive at the same time so as to ensure correct storage in the arrive buffer.

Fig. 9 shows the hardware used to implement the M1 activity line. It consists of a latch which is enabled on. the M1 latch line from the micro-controiword being selected. The value latched is the inverted equal (E) line from the ALU cómparator. The latch is set in line 1 of the microcode where a comparison is performed. If the quantity AC1 does not equal the start of the buffer then there must be at least one message to send. Figure 10 shows the circuitry controlling the M2 activity line. M2 latch is enabled in line 4 if there is another message to send. However in line 13 it is set again from a logical combination of the message source latches.

This is done to ensure that if there is a second message, it is not sent in a direction already selected by the first message.

Notice how the comparison line is disabled by the CMPY line.

The activity control is obtained from effectively ORing all the activity sources together, namely the ones listed under the structure of the CP control word. This activity line then disables the storage elements in the CP.

The X and Y coordinate registers are set up to match the actual position of the CP in the array. The length of a message is a fixed quantity that, like the X and Y coordinates, has to be fixed over the entire array.

Claims

CLAIMS:

1. A parallel processor array comprising an array of processing elements for effecting parallel data processing, -wherein the communications between the processing elements is effected by a further array of communications processors controlled by a common controller.

2. A parallel processor array according to Claim 1, in which the array of commnicatiohs processors constitute a SIMD machine.

3. A parallel processor array according to claim 1 or 2, in which each processing element is connected to one communications processor.

4. A parallel processor array according to any of the preceding claims in which each processing element is connected to a communications processor by means of a shared memory.

5. A parallel processor array according to any of the preceding claims, in which each communications processor is coupled for bidirectional communications to a restricted set of other ones of the communications processors, each communications processor comprising storage means enabling it to receive messages simultaneously from any number of the associated set of coupled communications processors, means responsive to address or routing information to accept a message addressed thereto and to send message to different ones of the associated set of coupled communications processors from that from which the message was received.

6. A parallel processor array according to Claim 5, in which the address information is contained within the individual messages being received and each communications processor directs the message towards the processor to which it is addressed.

7. A parallel processor array according to Claim 5 or 6, in which additional storage means provide information for directing messages, so that a message may be originated, redirected, accepted, or redirected and accepted at a communications processor.

8. A parallel processor array according to Claim 5, 6 or 7, in which the storage means for receiving messages are duplicated to facilitate the concurrent unloading and reception of messages, and comprising control means responsive to the first bit of the message, which is set only if a valid message is received, and which can be used to select betweensthe duplicated storage means.

9. A system comprising a plurality of arrays according to Claim 2 or any of Claims 3 to 8 insofar as dependent on Claim 2, in which each SIMD array of communications processors is disposed on one integrated chip which also carries the associated common controller, the controllers on the different chips being synchronized to operate in lock step.

10. An array of processing elements according to Claim 1 and substantially as hereinbefore described with reference to and as illustrated in the accompanying drawings.