WO1988004077A1

WO1988004077A1 - Pipelining technique and pipelined processes

Info

Publication number: WO1988004077A1
Application number: PCT/US1987/003072
Authority: WO
Original assignee: Thinking Machines Corporation; Blelloch, Guy; Ranade, Abhiram
Priority date: 1986-11-24
Filing date: 1987-11-24
Publication date: 1988-06-02

Abstract

A method of (and related apparatus for) pipelining the execution of selected operations in an n-dimensional array of processor cells having at least 2n nodes with at least one processor per node. Each processor cell includes memory means and a processor element for producing an output depending at least in part of data read from said memory means and instruction information supplied to the cell. Each processor cell is identified by an address in the array, which specifies the location of the processor cell within the dimensionality of the array. The array is operated so as to provide processing time slots during which the processor cells execute said operations and communications time slots during which the processors transmit information to each other. During each communications time slot, each processor can receive one bit of data from only one other processor (i.e., the preceding stage) along an edge dimension ''d'' of the n-cube; and each processor can transmit only one bit of data to only one other processor, along edge dimension d + 1. A data value for an element of an input data array is supplied to the memory of each node. Then for each of a series of successive time slots, each of a set of first processor cells executes said operation on a selected bit of the argument in the memory of its node, in accordance with a bit received from a first other processor cell, and transmits the result of said operation to a second other processor cell, until the final result appears at a predetermined node. The computation (i.e., selected operation) performed by the processors is identical for all processors, but may be conditional. Algorithms which can be converted into an appropriate form for pipelining in this fashion include those which (1) can be implemented by sending information along only one dimension in the array at a time and (2) send information along successive dimensions whose dimension numbers form an arithmetic sequence. Further, for an algorithm to be appropriate for (i.e., efficiently suited for) such pipelining, it must be possible to start performing the underlying computation without having all "M" bits of the data words available. A number of exemplary pipeline algorithms are disclosed, including addition of several terms in an array (i.e., sum reduction) and partial sum generation of the terms in an array (i.e., parallel prefix-sum).

Description

PIPELINING TECHNIQUE AND PIPELINED PROCESSES

Cross Reference to Related Applications

Related applications are "Parallel Processor," Serial No. 499,474 and "Parallel Processor/Memory Circuit," Serial No. 499,471, both filed May 31, 1983; "Method and Apparatus for Routing Message Packets," Serial No. 671,835, filed November 15, 1984 and now U.S. Patent No. 4,598,400, issued July 1, 1986; "Method and Apparatus for Interconnecting Processors in a Hyper-Dimensional Array," Serial No. 740,943, filed May 31, 1985; "Method and Apparatus for Simulating Systems Described by Partial Differential Equations," filed December 27, 1985; and "Method of Simulating Additional Processors in a SIMD Parallel Processor Array," Serial No. 832,913, filed February 24, 1986; all of which are hereby incorporated by reference. Field of the Invention

This invention relates to the field of parallel processing or multi-processing in digital computer systems. More particularly, it relates to a technique for implementing pipelining of operations in n-dimensional parallel processing arrays.

Background of the Invention

Until recently, the architecture of digital computer systems has been dominated by sequential processing using single processors. This has been changing, though, with a move toward parallel processing architectures. The goal of parallel processing is to achieve higher computer speeds and computational power. Simply put, parallel processing involves the use of several processors operating concurrently. The processors may be operating independently, or different, isolated tasks; or they may be operating on different parts of a larger problem. Among the most complex of these parallel processing environments are parallel processor arrays. These include arrays of microprocessors and arrays of other processor/memory elements. One exemplary parallel processing architecture is an array formed as an n-dimensional pattern having at least 2ⁿ nodes through which data may be routed from any processor/memory element in the array to any other processor/memory element . Such an arrangement is shown, for example, in U.S. Patent No. 4,598,400 and is discussed in W. D. Hillis, The Connection Machine, MIT Press, 1985, both of which are hereby incorporated by reference. As exemplified therein, the n-dimensional pattern is a Boolean cube, or hyper-cube, of anywhere from 12-16 dimensions; and each processor element is essentially only an arithmetic/logic unit (ALU) rather than an entire microprocessor.

The efficient use of the processors in an n-dimensional cube is important if the benefits of the increased processing power of this architecture are to be realized. In general, this means that, as a goal, each processor should be maximally active and minimally inactive. Consequently, problems and tasks must be re-cast from their sequential, single processor form, to take advantage of opportunities for concurrent processing and to exploit the properties of the interprocessor communications network.

Pipelining is a design philosophy which complements parallelism; that is, it is a way to exploit parallelism. Parallelism achieves high speeds by replicating (and executing) some basic function many times, with one piece of the input data provided for each replication. Pipelining, by contrast, takes the same function and partitions it into many autonomous but interconnected subfunctions. The concept of pipelining, in general, is virtually as old as electronic computers. A useful treatise on the general subject of pipelining is Peter M. Kogge, The Architecture of Pipelined Computers, Hemisphere Publishing Corporation and McGraw-Hill Book Computer (New York), 1981, which is incorporated by referenced herein for general background information. As explained in that reference, the implementation of pipelining generally takes the approach of breaking the function to be performed into smaller pieces, and allocating separate hardware to each piece, termed a "stage." Much as water flows through a physical pipeline, instructions, or data, flow through, the stages of a digital computer pipeline; the rate of flow-through is independent of the length of the pipeline (i.e., number of stages) and depends only on the rate at which new entries may be fed to the input of the pipeline. A computer pipeline, like its physical counterpart, may do more than simply move its contents unchanged from one location to the next. For example, a physical pipeline in a chemical plant may have several stages dedicated to filtering its contents, adding chemicals, and boiling it. Comparably, a computer pipeline may have stages devoted to instruction fetching, decoding, and execution.

As a particular item flows through the pipeline, it occupies only one stage at a time. Simultaneously, an item which entered the pipeline ahead of the referenced item occupies a stage further down the pipeline, and an item which enters the pipeline after the referenced item will occupy a stage closer to the input end of the pipeline. As time goes on, the stage vacated by one item is occupied by the item immediately following it. This concurrent use of many different stages by different items is often called "overlap." The net result is that the maximum rate at which new items may enter the pipeline depends strictly on the longest time required to traverse any single stage, and not on the number of stages.

The goal of designing a computer using pipelining is enhanced performance. The key to how much performance gain is poss ible depends on the operations which are pipelined and the quality of the partitioning of those functions into individual subfunctions to which stages can be assigned. Such systems are often hierarchically designed, which each stage for a level of pipelining itself actually constituting a pipeline.

It is not immediately apparent, however, how to efficiently apply pipelining techniques to complicated architectures such as n-dimensional cubes. Architectures of such complexity have heretofore typically employed microprocessors as the processing elements of each node. In such an arrangement, pipelining may be employed internal to each processor; but on a global sense, pipelining has not been readily applied.

Accordingly, it is an object of the present invention to provide a technique for pipelining the processing of operations in a computer system configured as an n-dimensional Boolean cube array of processor elements.

Another object of the invention is to provide a technique for providing pipelined processes corresponding to non-pipelined processes, applicable to such computers, for the efficient execution of broad classes of algorithms.

Summary of the Invention

According to the present invention, a high degree of processor utilization is obtained in an n-dimensional cube array of processors, for processing algorithms of certain types, by a specialized pipelining technique. Algorithms meeting certain criteria may readily be cast in a form appropriate for execution in this pipelined arrangement.

The foregoing and other objects and advantages are obtained in the present invention as follows: (a) for each processing cycle, each node in the pipeline receives data from another processor (i.e., the preceding stage), along an edge dimension "d" of the n-cube; (b) each such processor executes a selected operation, using the data it received in step (a); and (c) each such processor transmits its results along edge dimension d + 1 to the next node. The computation (i.e., selected operation) performed by the processors in the aforesaid step "b" is identical for all processors. This provides orderly communications and a pipeline that can be filled at the lowest dimension, dimension zero.

Algorithms which can be converted into an appropriate form for pipelining in this fashion include those which (1) can be implemented by sending information along only one dimension in the array at a time and (2) send information along successive dimensions whose dimension numbers for an arithmetic sequence. Further, for an algorithm to be appropriate for (i.e., efficiently suited for) such pipel ining , it must be poss ible to start performing the underlying computation without having all "M" bits of the data words available. A number of exemplary pipeline algorithms are disclosed, including addition of several terms in an array (i.e., sum reduction) and partial sum generation of the terms in an array (i.e., parallel prefix-sum).

The invention is pointed out with particularity in the appended claims. The above and further objects, features and advantages of the invention may be better understood by referring to the following detailed description, which should be read in conjunction with the accompanying drawing.

Brief Description of the Drawing

In the drawing,

Fig. 1 is a schematic illustration of a Boolean n-cube of three dimensions;

Fig. 2 is a schematic illustration of a Boolean n-cube of four dimensions;

Fig. 3 is a block diagram of an exemplary processor cell such as may be used in the processor array of the present invention;

Fig. 4 is a schematic illustration of a Boolean n-cube with eight nodes and three processor cells per node, in accordance with the present invention;

Figs. 5A and 5B are, collectively, a listing of detailed procedure for performing a pipelined prefix-sum operation according to the present invention;

Figs. 6A - 6H comprise a diagrammatic illustration of the results of performing the successive steps of Figs. 5A and 5B to calculate the prefix-sum of the data array 2,3,1,2,1,3,2,3 using an array of eight nodes with three processors per node, with the results appear in Fig. 6H; and

Figs. 7A and 7B are, collectively, a listing of detailed procedure for performing a pipelined prefix-max operation according to the present invention, for finding the maximum value in a data array.

Detailed Description of an Illustrative Embodiment

The Hardware Environment

The processing environment in which the present invention operates is an array of a large number of processor cells (e.g., 64K = 2¹⁶), each having several bits of memory (e.g., 2 ¹² = 4K bits) and a simple serial arithmetic logic unit (ALU). The processor cells (also referred to below as processor elements) are connected by a communications network configured as a modified Boolean n-cube topology.

All processors execute instructions from a single stream generated by a microcontroller under the direction of a conventional host computer.

In the array, processor cells are packaged together in groups of sixteen, in an integrated circuit "chip." A single chip is placed at each node in the n-cube; the cube itself is of dimension 12 (i.e., has 2 ¹² nodes) in the example discussed herein.

When the processor chips are connected into a Boolean n-cube; the cube itself is of dimension 12 (i.e., n = 12), each processor cell is connected to its sixteen nearest neighbors; some are on the same chip but most are on other chips, at different vertices. Each chip communicates with the remainder of the array through twelve (12) "hypercube wires" or dimension wires, one of each possible dimension of the hypercube.

A second communications system is provided internal to each chip for strictly local communications. Through this local communications system, each processor may communicate directly with two of its on-chip neighbors. Specifically, processor P_{( m,j)} at node m can receive data from processor P_(m,j-1), except when j = 0, and can send data to processor P_(m,j+1), except when j = J 0, and can send data to processor P_(m,j+1), except when j = J - 1, where the processors at each node asre indexed from j = 0 to j = J - 1, where J is the number of processors in each node. These on-chip communications take place over one-way connections called "node wires."

This combination of node wires and hypercube wires produces a topology which is similar, but not identical to, a so-called "cube-connected cycle" ( CCC) arrangement; it is a subset of the cube-connected cycle, omitting certain communications paths which would be present in the full CCC topology. The advantage of such topology is that it reduces addressing hardware. Specifically, it reduces the number of wires over which each processor cell must communicate directly. A processor cell is required to be able to communicate over only three wires--one hypercube wire for "off-chip" communications (over an appropriate dimension) and two node wires for on-chip communications with its neighboring on-chip processors. That is, the sixteen on-chip processor cells share external (i.e., hypercube) communications wires. By contrast, if only one processor cell were provided per node, and the same number of processors were employed in a conventional 16-dimensional cube, each processor would have to connect directly to sixteen communications wires. This, of course, would require considerably more communications-related circuitry on each chip, leaving less room (on chips of the same size) for processor cells.

In a comparable 16-dimensional hypercube, the execution of a pipelined operation such as adding 2ⁿ numbers of m-bits each would require m + n = m + 16 steps. In the modified cube-connected cycle aray described above, the number of required steps is slightly larger: m + 2n - 1 = m + 2(12) - 1 = m + 23 steps, plus some additional overhead for the four "extra" processors per chip. The latter topology therefore increases the number of steps only slightly over the larger-dimensioned hypercube, but saves on hardware. By contrast, without pipelining, the execution of the same operation would require m x n steps. Pipelining therefore saves a considerable number of processing steps for large values of m and n.

While it is not possible to provide an intelligible illustration of the large number of interconnections in a "cube" of 12 or 16 dimensions, an appreciation of the complexity of the interconnection pattern of a Boolean n-cube can be obtained from a consideration of the interconnections that would be used for an array of such processor chips in Boolean n-cubes of three and four didmensions. Fig. 1 depicts schematically a Boolean n-cube 10 of three dimensions, with one processor chip, 12, at each node. This will be recognized as a conventional cube having eight nodes (i.e., vertices) and twelve edges. Three dimensions of this cube are identified by the Roman numerals I, II and III. At each node is a chip 12 containing one or more processors; from each chip, there are three output lines that extend along the three dimensions of the cube to the chip's nearest neighbors. The bottom left hand node is assumed to be the "origin" of this system; accordingly, the processor chip at that node has the 0 position, or address, in the first, second and third dimensions of the cube. The address is written (000), where the parentheses are used to denote that the quantity is an address. Since each chip can be at one of only two positions in each dimension (i.e., each dimension placed in the address can be only 0 or 1), the other chips have addresses that are other three digit combinations of 0 and 1, as shown in Fig. 1.

Fig. 2 illustrates a Boolean n-cube of four dimensions. In such a cube there are sixteen nodes and thirty-two edges. Again, a processor chip (of one or more processors) is located at each node and is connected to its nearest neighbors by input lines and output lines. In this case, however, each chip has four nearest neighbors (instead of three) and, therefore, four input lines and four output lines extending along the four dimensions of the 4-cube. The position of each chip in the Boolean 4-cube is identified by a four-digit binary address as shown in Fig, 2, and the four dimensions of this 4-cube are identified by Roman numerals I, II, III and IV.

The extrapolation of this pattern to cubes of a larger number of dimensions will be apparent. In each case, adding a dimension will produce a cube with twice as many vertices and with each processor chip having one additional nearest neighbor. Accordingly, a Boolean 12-cube will have 4,096 nodes, with a chip at each node; and each chip will have twelve nearest neighbors. Each individual processor cell can be extremely simple. For example, its data paths can be only one bit wide and it may have only eight bits of internal state information (i.e., flags). A block diagram of such an exemplary processor cell 14 is shown in Fig. 3. There, the cell memory is shown at 16, the ALU at 18 and the state information (flag) register, at 22. The basic operation of the processor cell is to read two bits from an external memory, with one flag, and to combine them according to a specified logical operation; this produces two result bits, which are written into the memory and an internal flag register, respectively. Three clock cycles are needed for this sequence, one of each reference to the memory. Other, different or more complicated processor cells may be used, of course, to allow for logical operations other than those available with an ALU.

As stated above, the chip at each node normally will contain several processor cells, but that is not meant to preclude the use of a single processor cell per node. Certain advantages may be gained (principally in terms of a reduction in hardware needed for interprocessor communications) by using a cluster of processor cells contained on a single chip. A simple example of this arrangement appears in Fig. 4, which shows how the three-dimensional cube 10 of Fig 1 may be modified by the use at each node of a clusster of three processor instead of just one processors. Each dot 24 (just a few of which are labelled) represents a processor cell; each box 26 surrounding a group of processor cells 24 indicates the group is clustered on a chip. The communications arrangement within a chip is represented by the unidirectional arcuate links 28 (illustrated at just two nodes but present at all nodes). Of course, in the exemplary twelve-dimensional cube of interest herein, each processor chip contains sixteen processors, instead of just the three processors per node shown in the simplified three-dimensional cube illustration of Fig. 4.

The wiring betwen the chips themselves establishes the pattern of the Boolean n-cube. The address of each processor within the array depends on its relative position in respect to a predetermined origin. Geometrically, the Boolean n-cube can be interpreted as a generalization of a cube to an n-dimensional Euclidean space. Each dimension of the space corresponds to one bit position in the node address. An edge of the cube running along the k^th dimension connects two vertices whose node addresses differ by 2^k; that is, they differ in the k^th bit of their addresses.

To permit communication through the interconnection pattern of the Boolean 12-cube, the computer system is operated so that it has both processing cycles (or "slots") and communications cycles (or "slots"). Computations are performed during the processing cycles. During the communications cycles, the results of the computations are routed from one chip to the next, in accordance with a processor pairing scheme described below.

PIPELINING

Among those operations which lend themselves to pipelined execution in such arrays are certain "reduction" operations and certain "prefix" operations. A reduction operation takes as input an array x of n numbers an produces as output a single number. A prefix operation also takes as input an array x of n numbers, but its output, rather than being a single number, is another array z, also of n numbers. Each element of the output array z is a reduction of all elements of x either (a) up to but not including the corresponding element of z or (b) up to and including the corresponding element of z. If the prefix operation does not include the corresponding element of z, it is said to be "exclusive"; if it does include such element, it is said to be "inclusive." Example -- Addition (i.e., Sum-Reduction)

A typical operation which can be pipelined for execution is the operation of sum-reduction, or addition, of an array of n number each of m-bits in length. Assume that at the start of the operation there is an m-bit number at each node of the array. A microcontroller (not shown) generates a sequence of detailed instructions to control communications between nodes. During each communications time slot (which for convenience are numbered, starting from zero), each of the processors sends partial sum and carry or argument information to a second processor (i.e., the next stage) and receives information from still a third processor (i.e., the previous stage), until the full sum appears at a predetermined node.

To impose order on this processing, sum-reduction is performed as a series of partial sums, formed from pairs of bit-wise additions. The processors of the array are paired for this purpose in a defined pattern, as follows. Each processor is identified by a pair of indices (m,j); the first index, m, identifies the specific node in the hypercube (i.e., 0≤m<2^N) and the second index, j, identifies a specified processor within the node

(i.e., 0 ≤j<J, where J represents the number of processors per node). For communications in the pipeline, processor P_(m,j) is paired with processor P_(m+2j_,j),where "j" refers to the

dimension number and

+" is the symbol for a bit-wise exclusive-OR operation. That is, a hypercube wire for dimension j is considered to connect processors P_(m,j)and P_(m

₊₂j_,j); for this operation, data flows across the hypercube dimension wires in one direction only, from processor P_{(m, j)} to processor P_(m+2j

_{, j)}. By convention, m is less than m +

2^j, which means that the node address m has a zero bit in place j. Nodes are numbered from 0 to 2^j -1.

The same instruction is executed by all of the processors during each processing time slot, although each processor can perform operations conditional on its indices (m,j). All processors access the same location, each in its own memory, at the same time.

For the discussion below, let y name a field of bits (one filed per processors); the notation y_{( m,j )}[k] refers to bit "k" of the field y within processor P_(m,j). The bits of a field are numbered starting from 0; the field may be regarded as an unsigned integer in binary notation, with bit 0 being the least significant bit (LSB).

To simplify the detailed explanation of sum-reduction, the following example assumes there is only one processor per node; the extension to multiple-processor nodes is straightforward. Fig. 2, of course, shows a 4-cube with one processor per node; the following discussion therefore will refer to the array shown in Fig. 2.

The desired sum is formed by successive development of partial sums. Each bit "b" of such a partial sum is transmitted along a hypercube wire belonging to dimension k at time slot n+k, where both b and k are zero at the origin node 0000 and the first time slot (i.e., the time slot when node P₀₀₀₀ transmits to processor P₀₀₀₁) is designated slot zero.

At time slot "t", each node adds into bit t-k of its partial accumulation the bit coming in from dimension k, provided that t-k is both (1) greater than or equal to zero and (2) less than m (the length of the operands, in bits).

Assume that the array has been initialized with one addend at each node. During communications step (or slot) 0, each of processors P₀₀₀₀, P₀₀₁₀,

P₀₁₀₀, P₀₁₁₀, P₁₁₁₀, P₁₁₀₀, P₁₀₁₀, and P₁₀₀₀ sends a bit along rspective hypercube wires 52-58 , 62-68, in dimension I to its paired processor (respectively, processors P₀₀₀₁,

P₀₀₁₁, P₀₁₀₁, P₀₁₁₁, P₁₁₁₁, P₁₁₀₁, P₁₀₁₁ and _P1001). During the processing step which follows, each of the receiving processors P₀₀₀₁, P0011' P₀₁₀₁, P₀₁₁₁, P₁₁₁₁, P₁₁₀₁, P₁₀₁₁ and P₁₀₀₁ forms the partial sum of the least significant bits of the added it originally held and the addend receiving over the dimension I wires. In communications step 1, processor P₀₁₀₁ sends its partial sum and carry bits over wire 72 in dimension

II to processor P₀₁₁₁ and processor P₀₀₀₁ sends it partial sum over wire 74 to processor P₀₀₁₁; similarly, processor P_{100 1} sends a partial sum over wire 76 in dimension II to processor P₁₀₁₁ and processor P₁₁₀₁ sends its partial sum over wire 78 to processor P₁₁₁₁. The four receiving nodes then form new partial sums.

In communications slot 2, processor P₀₀₁₁ sends a partial sum to processor P₀₁₁₁ and processor P₀₁₁₁ sends a partial sum to processor

P₁₁₁₁. The two receiving processors compute new partial sums in the computation cycle which follows. Then, in communications slot 3, processor P₀₁₁₁ sends a partial sum bit to processor P₁₁₁₁. Upon the completion of the computation activity at node P₁₁₁₁ during the next computation slot, the full sum will be available at node

P₁₁₁₁.

The foregoing procedure, which has been described in the context of an exemplary four-dimensional cube, can, of course, be extended to a generalized approach to be used in the n-dimensional cube.

Example -- Sum-Parallel-Prefix

Another operation which is readily adaptable to pipelinig is referred to as the "prefix-sum" or "sum-parallel-prefix" operation, which consists of forming the intermediate partial sums of a series of numbers to be added together (as well as their full sum, in the case of an inclusive prefix-sum). This operation is best described in terms of a mathematical definition of steps to be performed.

Start from the assumption that an array of 2^N integers is stored in field x of length w in (the memory of) the "base" processor of each node -- i.e., in processors P_(m,0) for 0 m 2^N. The object is then to compute the exclusive prefix-sum of this array of data. That is, (the memory at) at predetermined processor, following these computations, is to contain the value

The procedure given below accomplishes this objective, placing the result in processor P_(m,N-1). The value in that processor, however, is too large by factor 2^N-1 -- that is, it is displaced upward in memory by N-1 bit positions.

Consequently, at the conclusion of the computation, a field z of length w+2N-1 bits will then have been computer, such that for each m

This calculation requires w + 2N-1 computations, where at each computation one bit is transferred across each hypercube wire (two bits are transferred between adjacent processor internal to a node), and a constant number of single-bit operations is performed by each processor.

As an extra benefit, the operation also computes a second field of length w + 2N-1, such that for each m,

That is, every processor whose second (intranode) index is N-1 will have the same value -- namely, the sum of all the original x values, displaced in memory by N-1 bit positions. Implementation

To implement the foregoing process, certain memory fields are allocated for use by each processor: (1) an input field, x, of w bits in length; (2) a field, y, of w + 2N-1 bits in length; (3) a field, z, of w + 2N-1 bits in length; (4) a single-bit field called "b"; (5) a single-bit field called "c"; and (6) a single-bit field called "d". The field "c" contains a carry bit for addition operations involving the field y; the field "d" contains a carry bit for addition operations involving field z. Next, let the operation

p, q <---- adder (t, u, v)

take three input bits t, u and v and store their two-bit sum into the bits p and q; that is, p contains the logical sum and q contains the carry bit. This is equivalent to the simultaneous execution of the following two operations:

p <---- t + u + v and q <---- (t Λ u) v (u Λ v) V (t Λ v).

Figs. 5A-5B contain a detailed listing for the pipelined prefix-sum process 100. The method starts with a step 102 to set to zero bits c and d of all processors. Next, the procedure 103, 103' (comprising the remainder of Figs. 5A and 5B) is performed. These Figures are self-explanatory but a brief explanation will nevertheless be given to facilitate their comprehension.

Initally, an input value (or argument) is supplied to the memory associated with the base processor at each node -- i.e., the processor for which the index j is 0. For the first time slot, designated T=0, a 0 bit is written to the output wire for each node. The 0 bit is chosen so as not to contribute to the result; for operations other than sum-reduction, another value may be necessary in this initial step. For each successive time slot, each processor writes bit [i-1] of its y_{( m, j)} field to its output node wire, where i is an index corresponding to the number of the time slot, so that successive bits of the y field are sent in successive time slots. In the same time sequence, each processor, in each time slot, reads a bit from its input node wire and stores that bit in y_(m,j)[i] (i.e, the i^th bit of the y field).

Those processors other than processors whose intranode index j=N-1 then write a 0 bit to their output node wires during the first time slot and write bit z_{( m,j)}[i-1] to their output node wires for all other time slots up to slot w+N-2. Further, each processor reads a bit from its input node wire and stores that bit in z_(m,j)[i]. The "base" processor in each node then performs one of two sequences, depending on the time slot. If the time slot is less than the w^th slot , where w represents the length of the input values , in bits , the processor places a 0 in bit z_(m,j) [i ] and replaces the contents of bit y _(m,j)[i] with x _{( m,j)}[i].

However, if the time slot is not less than w (but is less than w+N), then a 0 is placed in each of bits z_(m,j)[i] and y_(m,j)[i]. Each processor then writes a bit y_(m,j)[i] to its hypercube output wire and reads into the bit "b" a bit for its hypercube input wire.

A summation step is next executed by each processor, such that y_(m,j)[i] contains the sum of bits b, y_(m,j)[i] contains the sum of bits b, y_(m,j)[i] and c, while the carry from that operation is placed in bit c. Finally, if bit j of the index m is 1, a summation step is executed such that z_(m,j)[i] contains the sura of bits b, z_(m,j)[i] and d, while the carry from that operation is placed in bit d.

Figs. 6A-6G collectively illustrate the successive steps in calculating the prefix-sum of the array 2,3,1,2,1,3,2,3 (after time slots 0 through 6, respectively) on an array of * nodes, with three processors per node. The results appear in Fig. 6H. Bit positions not yet calculated or which contain meaningless information are indicated by the letter "x." Note that the members of the array to be used are injected at the first processor (i.e., j = 0) in each node, the exclusive partial sum results, however, appear in the calculated bits of the z fields at the third processor (i.e., j = 2) of each node. The y field at such processor all contain the sum-reduction of the sane input array.

Example : Prefix-max

Another operation adaptable to this type of pipelining is that of determining the maximum value in an array. More particularly, an array of 2 ^N integers is stored with one integer in the x field

(of length w) in each of processors P _(m,0); for this purpose, the only processors of interest are those for which 0 m 2^N. The result (i.e, the maximum value in the input array), is generated in processor P_{(m, N-1)}, displaced downward in memory by N-1 bit positions; that is, the result value is too small by the factor 2^N-1. This may be contrasted with the prefix-sum operation, which displaces it result upward in memory; that the prefix-sum operation processes fields LSB first, while the prefix-max operation processes fields MSB first.

The pipelined prefix-max operation is performed in w+N-1 iterations. This is N iterations fewer than the prefix-sum calculation requires, because there is no need to deal with carries. At each iteration, one bit is transferred across each hypercube wire, two bits are transferred between adjacent processors within a node, and a constant number of single-bit operations are performed by each processor.

The operation also computes a second field y of length w+N-1 bits, such that for each m

2^N-1 y_{(m, N-1)} = MAX (x_(k,0)) k = 0

where only the w low-order bits of y are considered. That is, every processor whose intranode index is N-1 will have the same value. That value is the maximum of all the original x values, displaced in memory by N-1 bit positions.

Implementation

Implementation begins by assigning to each processor a w-bit input field x, fields y and z of length w+N-1 bits each, and five single-bit fields called b, c, d, e, and f. Bits c and e serve as holders of state information for maximum operations involving y, while bits d and f serve as holders of state information for maximum operations involving z. Next , let the operation p , q, r <---- maxer ( s , t , u , v)

take four input bits s, t, u and v, where s and t are operand bits and u and v are state information bits; store three result bits into p, q, and r, where p is the result bit and q and r are new state information. If the first state bit (q or u) is 1, it means the first operand has been determined to be the larger; if the second state bit (r or v) is 1, it means that the second operand has been determined to be the larger; and if both state bits are 0, it means that the two operands have been the same so far.

A more complete detailing of the prefix-max process 200 is given in Figs. 7A-7B, which are generally self-explanatory. The sequence of steps 1112' listed in Fig. 7B directly follows the sequence of steps 112 listed in Fig. 7A.

Having thus explained the general method of pipelining operations in a Boolean n-cube or hypercube, and having provided examples of three operations which may be so pipelined, it will be apparent that other operations may also be pipelined and that the above-described detailed procedures for pipelining operations may be modified, altered or amended without departing from the spirit and scope of the invention. For example, instead of transmitting only one bit per communications slot and processing only one bit per computation slot, multiple bits could be transmitted and processed. Such modifications, alterations, amendments and improvements as will be obvious to those skilled in the art are intended to be suggested by this disclosure. Accordingly, the foregoing discussion is intended to be exemplary only, and not limiting. The invention is limited only by the claims which follows, and equivalents thereto.

What is claimed is:

Claims

1. A method of pipelining the execution of selected operations in an array of processor cells, each processor cell in the array including memory means and a processor element for producing an output depending at least in part on data read from said memory means and instruction information supplied to the cell; the array also including means for interconnecting the processor cells in an n-dimensional pattern having at least 2ⁿ nodes with at least one processor per node; each processor cell being identified by an address in the array, which specifies the location of the processor cell within the dimensionality of the array; such method comprising the steps of:

a. operating the array so as to provide processing time slots during which the processor cells execute said operations and communications time slots during which the processors transmit information to reach other;

b. further operating the array such that during each communications time slot each processor can receive one bit of data from only one other processor and such that each processor can transmit only one bit of data to only one other processor; c. supplying to the memory of each node in the array the data values for an element of an input array to be processed in accordance with the selected operation; and

d. for each successive time slot at time = i, where i ranges from 0 to some predetermined upper limit, each of a set of first processor cells executing said operation on a selected set of bits of the argument in the memory of its node, in accordance with a corresponding set of bits received from a first other processor cell, and transmitting the result of said operation to a second other processor cell.

2. The method of claim 1 wherein said selected set of bits comprises the t bit of the word in memory.

3. The method of claim 1 wherein the operation executed by each processor is an addition.

4. The method of claim 1 wherein the selected set of bits includes at least two bits.

5. The method of claim 1 wherein the operation executed by each processor comprises a logical comparison of at least one bit of a data value for an element of the input array with a data bit received from said first other processor cell.

6. A method of pipelining the execution for selected operations in an array of processor cells, each processor cell in the array including memory means and a processor element for producing an output depending at least in part on data read from said memory means and instruction information supplied to the cell; the array also including means for interconnecting the processor cells in an n-dimensional pattern having at least 2 nodes with at least one processor per node; each processor cell being identified by an address in the array, which specifies the location of the processor cell within the dimensionality of the array; such method comprising the steps of:

a. operating the array so as to to provide a series of processing time slots during which the processor cells execute said operations and communications time slots during which the processors transmit information to each other;

b. further operating the array such that each processor can receive data from only one other processor and such that each processor can transmit data to only one other processor;

c. supplying to the memory of each node in the array an element of an input array to be processed in accordance with the selected operation; and d. for each successive time slot at time t+1, where i ranges from 0 to some predetermined upper limit, each of a set of first processor cells executing said operation on a selected bit of the argument in the memory of its node, in accordance with a bit received from another processor cell, and transmitting the result of said operation to another processor cell.

7. A method of pipelining the execution of selected operations in an array of processors, each processor cell in the array including memory means and a processor element for producing an output depending at least in part on data read from said memory means and instruction information supplied to the cell; said method comprising the steps of:

a. connecting the processors as an n-dimensional Boolean cube with "2ⁿ" nodes processors per node of the cube, such that

(1) the nodes of the array are connected through a pattern of hypercube dimension wires arranged to form said n-dimensipnal Boolean cube,

(2) the J processors at each node of the array are connected in a local communications system via node wires connecting parts of such processors, (3) each processor cell is identified by a pair of address coordinates (m,j), the first coordinate specifying the nodal position of the processor in the cube and the second coordinate specifying the position of the processor within the node,

(4) every processor has an input node wire and an output node wire, except that those processors whose intranode index j is J-1 have no output node wire and those processors whose intranode index j is 0 have no input node wire,

(5) each processor P_(m,j) is permitted to exchange data with only one other processor, P₍

_m+2j_,j),such that processor P_(m,j) (for j>0) is permitted to receive data only from processor

_P(m,j-1), and processor P_(m,j) (for j<J-1) is permitted to send data only to processor P_(m,j+1);

b. operating the array so as to provide a series of processing time slots during which the processor cells executed said operations and communications time slots during which the processors transmit information to each other; c. supplying to the memory of each processor cell in the array the value of an element of an input array to be processed in accordance with the selected operation; and

d. each processor P _(m,j) receiving from processor P_(m+2j_,j) a set of bits during each communications time slot, then executing the selected operation on the data in its memory and said received set of bits during the next processing time slot and during the subsequent communications time slot transmitting the results of such opertion to processor P_(m,j+1) when j J-1 and to processor

P_{( m+ 1,0)} when j = J - 1, unless m = 2ⁿ - 1.

8. The method of claim 7 wherein the set of bits a single bit.

9. The method of claim 7 wherein the set of bits comprises a plurality of bits.

10. A pipelined parallel processing system, comprising:

a. an array of processor cells, each processor cell including memory means and a processor element for producing an output depending at least in part on data read from said memory means and instruction information supplied to the cell; the array also including means for interconnecting the processor cells in an n-dimensional pattern having at least 2ⁿ nodes with at least one processor per node; each processor cell being identified by an address in the array, which specifies the location of the processor cell within the dimensionality of the array; such system comprising:

a. the memory of ech node in the array being supplied with the data value for an element of an input array to be processed in accordance with the selected operation;

b. sequencer means for operating the array so as to provide processing time slots during which the processor cells execute said operations and communications time slots during which the processors transmit information to each other;

c. an inter-processor communications network which during each communications time slot allows each processor to receive one bit of data from only one other processor and which allows each processor to transmit only one bit of data to only one other processor; and

d. a set of first processor cells for executing during each processing time slot t, where t ranges from 0 to a preselected upper limit, the selected operation on a selected bit of the argument in the memory of its node, in accordance with a bit received from another processor cell, and for transmitting the results of said operation to another processor cell.

11. The array of claim 10 wherein each processor is identified by a first index, m, indicating the nodal location of the processor in the array, and a second index, j, indicating the position of the processor within its node, and wherein said communications network allows processor P_(m,j) to exchange data with only one other processor, P_(m

₂j_{, j)}, such that processor P_(m,j) (for j>0) is permitted to receive data only from processor P_(m,j-1), and processor P_{(m,j )} (for j <J-1) is permitted to send data only to processor P_(m,j+l) .