WO1987006034A1 - Data-flow multiprocessor architecture for efficient signal and data processing - Google Patents

Data-flow multiprocessor architecture for efficient signal and data processing Download PDF

Info

Publication number
WO1987006034A1
WO1987006034A1 PCT/US1987/000410 US8700410W WO8706034A1 WO 1987006034 A1 WO1987006034 A1 WO 1987006034A1 US 8700410 W US8700410 W US 8700410W WO 8706034 A1 WO8706034 A1 WO 8706034A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
processor
micromachine
bus
flow
Prior art date
Application number
PCT/US1987/000410
Other languages
English (en)
French (fr)
Inventor
Michael L. Campbell
Dennis J. Finn
George K. Tucker
Michael D. Vahey
Rex W. Vedder
Original Assignee
Hughes Aircraft Company
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hughes Aircraft Company filed Critical Hughes Aircraft Company
Publication of WO1987006034A1 publication Critical patent/WO1987006034A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/448Execution paradigms, e.g. implementations of programming paradigms
    • G06F9/4494Execution paradigms, e.g. implementations of programming paradigms data driven

Definitions

  • the present invention relates to methods and apparatus for performing high-speed digital computations of programmed larqe-scale numerical and logical problems, in particular to such methods and apparatuses making use of data-flow principles that allow for highly parallel execution of computer instructions and calculations.
  • Multiprocessor architectures are widely accepted as the class of architectures that will enable this goal to be met for applications that have sufficient inherent parallelism.
  • Systolic arrays tiqhtly coupled networks of von Neumann processors
  • data flow architectures are three such classes.
  • Systolic arrays are regular structures of identical processing elements (PEs) with interconnection between PEs. High performance is achieved through the use of parallel PEs and highly pipelined algorithms. Systolic arrays are limited in the applications for which they may be used. They are most useful for algorithms which may be highly pipelined to use many PEs whose intercom ⁇ munications may be restricted to adjacent PEs (for example, array operations). In addition, systolic arrays have limited programmabilit . They are "hardwired" designs in that they are extremely fast, but inflexible. Another drawback is that they are limited to usinq local data for processing. Algorithms that would require access to external memories between computations would not be suitable for systolic array implementation.
  • Tightly coupled networks of von Neumann processors typically have the PEs interconnected using a communica ⁇ tion network, with each PE being a microprocessor having local memory.
  • some architectures provide global memory between PEs for interprocessor communication. These systems are most well suited for applications in which each parallel task consists of code that can be executed efficiently on a von Neumann processor (i.e., seguential code). They are not well suited for taking full advantage of low-level (micro) parallelism that may exist within tasks. When used for problems with low-level parallelism they typically give rise to large ALU (arithmetic and logical unit) idle times .
  • ALU arithmetic and logical unit
  • Data flow multiprocessor architectures based on the data flow graph execution model implicitly provide for asynchronous control of parallel process execution and inter-process communication, and when coupled with a functional high-level language can be programmed as a single PE, without the user having to explicitly identify parallel processes. They are better suited to taking advantage of low-level parallelism than von Neumann multiprocessor architectures.
  • a data flow graph represents this information usinq nodes (actors) for the operations and directed arcs for the data dependencies between actors.
  • the output result from an actor is passed to other actors by means of data items called tokens which travel along the arcs.
  • the actor execution, or firing occurs when all the actor's input tokens are present on its input arcs.
  • templates When actors are implemented in an archi ⁇ tecture they are called templates. Each template consists of slots for an opcode, operands, and destination pointers, which indicate the actors to which the results of the operation are to be sent.
  • the data flow graph representation of an algorithm is the data dependency graph of the algorithm.
  • the nodes in the graph represent the operators (actors) and the directed arcs connecting the nodes represent the data paths by which operands (tokens) travel between operands (actors).
  • the actor may "fire” by consuminq its input tokens, performing its operation on them, and producing some output tokens.
  • a restriction is placed on the arcs and actors so that an arc may have at most one input token on it at a time. This implies that an actor may not fire unless all of its output arcs are empty.
  • a more general definition allows for each arc to be an infinite queue into which tokens may be placed.
  • All data flow architectures consist of multiple processing elements that execute the actors in the data flow qraph.
  • Data flow architectures take advantage of the inherent parallelism in the data flow graph by executing in separate PEs those actors that may fire in parallel.
  • Data flow control is particularly attractive because it can express the full parallelism of a problem and reduce explicit programmer concern with interprocessor communication and synchronization.
  • U.S. Patent Number 3,962,706—Dennis et al. a data processing apparatus for the highly parallel execution of stored programs is disclosed. Unlike the present invention, the apparatus disclosed makes use of a central controller and global memory and therefore suffers from the limitations imposed by such an architecture.
  • U.S. Patent Number 4,153,932 discloses another version of the apparatus disclosed in the previous two patents, distinguished by the addition of a new network apparently intended to facilitate expandability, but not related to the present invention.
  • U.S. Patent Number 4,418,383 Doyle et al. a large-scale integration (LSI) data flow component for processor and microprocessor systems is described. It bears no substantive relation to the processing element of the present invention, nor does it teach anything related to the data flow architecture of the present invention.
  • LSI large-scale integration
  • None of the inventions disclosed in the patents referred to above provides a processor designed to perform image and signal processing algorithms and related tasks that is also programmable in a high-level language which allows exploiting a maximum of low-level parallelism from the algorithms for high throughput.
  • the present invention is designed for efficient realization with advanced VLSI circuitry using a smaller number of distinct chips than other data flow machines. It is readily expandable and uses short communication paths that can be guickly traversed for high performance. Previous machines lack the full capability of the present invention for larqe-throuqhput realtime applications in data and siqnal processing in combination with the easy progra ⁇ imability in a high-level language.
  • the present invention aims specifically at providing the potential for performance of siqnal processinq problems and the related data processing functions including tracking, control, and display processinq on the same processor.
  • An instruction-level data flow (micro data flow) approach and compile time (static) assignment of tasks to processinq elements are used to get efficient runtime performance.
  • the present invention is a data flow architecture and software environment for high performance signal and data processing.
  • the programming environment allows applications coding in a functional high-level language, the Hughes Data Flow Language, which is compiled to a data flow graph form which is then automatically partitioned and distributed to multiple processing elements.
  • a data flow graoh language assembler and local allocator allow programming directly in data flow graph form.
  • the data flow architecture consists of many processing elements connected by a three-dimensional bussed packet routing network.
  • the processing elements are designed for implementation in VLSI (very large scale integration) to provide realtime processing with very large throughput.
  • VLSI very large scale integration
  • the modular nature of the data-flow processor allows adding more processing elements to meet a range of throuohput and reliability requirements. Simulation results have demonstrated hiqh-perfor ance operation. Accordingly, it is one object of the present invention to provide a data-flow multiprocessor that is a high-performance, fault-tolerant processor which can be programmed in a high-level language for large- throughput signal and data processing applications.
  • FIG. 1 is a schematic block diagram of the present invention, with illustrative information about some of its parts to the right of the block diagram.
  • FIG. 2 is a schematic representation of how the processing elements are connected together in a three- dimensional bussed packet routing network.
  • FIG. 3 shows how a data packet consists of a packet type, a PE and template address, and one or more data words.
  • FIG. 4 illustrates the organization of a processinq element in schematic block diagram form.
  • FIG. 5 shows how templates and arrays are mapped into physical memories.
  • FIG. 6 gives examples of some primitive actors which are implemented directly in hardware.
  • FIG. 8 is a simulation results graph of throughput for the program radar3na (in millions of instructions per second) versus the number of processing elements.
  • the curve marked "A” is for a random allocation algorithm
  • the curve marked "B” for an allocation algorithm using transitive closure
  • the curve marked "C” for an allocation algorithm using nontransitive closure.
  • FIG. 9 is a plot of simulation results for the program radarb. The ordinate represents throughput in MIPS and the abcissa represents number of processing elements.
  • the lower curve is for a random allocation algorithm and the upper curve is for a nontransitive closure allocation algorithm.
  • FIG. 10 is a simulation results graph of percentage of time the ALU is busy versus the number of processing elements for the program radar3na.
  • the solid curves marked “D” and “G” are average ALU busy time and maximum ALU busy time, respectively, for a transitive closure allocation algorithm.
  • the curves marked “E” and “F” are average ALU busy time and maximum ALU busy time, respectively, for a nontransitive closure allocation algorithm.
  • FIG. 11 shows percentage of time the ALU is busy in a simulation of the program radarb versus number of processing elements, using a nontransitive closure allocation algorithm.
  • the lower curve is for average ALU busy time and the upper curve is for maximum ALU busy time.
  • FIG. 12 is a graph of percentage of maximum achieved throughput versus percentage of time the average ALU is busy.
  • the solid circles are from simulation results for the program radarb using a nontransitive closure allocation algorithm.
  • the x-symbols and open circles are for the program radar3na using transitive closure and nontransitive closure allocation algorithms, respectively.
  • FIG. 13 is a plot of the percentage of packet communication that is local (intra-PE as opposed to in . ter-PE) versus number of processing elements for the program radar3na.
  • the lower curve is for a transitive closure allocation algorithm and the upper curve is for a nontransitive closure allocation aloorithm.
  • FIG. 14 is a plot of the percentage of packet communication that is local (intra-PE as opposed to inter-PE) versus number of processing elements for the
  • FIG. 15 is a graph of the length (in packets) of the result gueue versus number of processing elements for the nontransitive closure allocation of the proqram 5 radarb.
  • the lower curve is average queue length and the upper curve is maximum queue length.
  • FIG. 16 is a plot of average communication packet latency (in clock cycles) versus number of processinq elements for nontransitive closure allocation of the
  • FIG. 1 is a schematic block diagram of the present invention, a data flow architecture and software environ ⁇
  • the programming environment allows applications coding in a functional high-level language which results in a program file 20 which is input into a compiler 30 which converts it to a data flow graph form 40 which a global
  • allocator 50 then automatically partitions and distributes to multiple processing elements 80.
  • programming can be done in a data flow graph and assembled by an assembler 15 that operates directly on an input data flow graph file 13 whose output is
  • the data flow processor 70 consists of many processinq elements 80 connected in a three-dimensional bussed packet routing network. Data enters or leaves the processor 80 by means of input/output devices 90 connected to the processor.
  • the data flow processor 70 comprises 1 to 512 identical processing elements connected by a global inter-PE communication network.
  • This network is a three-dimensional bussed network in which the hardware implements a fault-tolerant store- and-forward packet-switching protocol which allows any PE to transfer data to any other PE.
  • Each processing element contains queues for storinq packets in the communication network, and the appropriate control for monitoring the health of the processing elements and performing the packet routing.
  • the communication chip 81 which connects a processing chip 80 to the row, column, and plane buses 82,84,86.
  • the communication chip 81 acts like a crossbar in that it takes packets from its four input ports and routes them to the appropriate output ports. In addition, it provides buffering with a number of first-in-first-out queues, including the processor input queue 112 and processor output queue 114.
  • the three-dimensional bussed network is optimized for transmission of very short packets consisting of a single token. As shown in FIG. 3, each packet consists of a packet type, an address, and a piece of data.
  • packets include normal token packets, initialization packets, and special control packets for machine reconfiguration control.
  • the address of each packet consists of a processing element address and a template address which points to one particular actor instruction within a processing element.
  • the data can be any of the allowed data types of the high-level data flow language or control information if the packet is a control packet.
  • a configuration of up to 8x8x8 or 512 processing elements can be physically accommodated by the communi ⁇ cation network. Many signal processin ⁇ problems could potentially use this many processing elements without overloading the bus capacity because of the ease of partitioning these algorithms. However, for general data processing the bus bandwidth will start to saturate above four processing elements per bus. More processing elements can be added and performance will increase but at lower efficiency per processing element.
  • a single-path routing scheme is used in transferring packets between PEs. In other words, the same path is used every time packets are sent from a given source PE to a given sink PE. This guarantees that packets sent from an actor in the given source PE to an actor in the given sink PE arrive in the same order in which they were sent, which is necessary when actors are executed more than once (as, for example, when the graph is pipelined) .
  • Each PE continually monitors its plane, column, and row buses looking for packets it should accept. PEs accept packets addressed directly to them, and packets that need to be rerouted to other PEs through them. For example, if a packet is put on a plane bus, all PEs on that bus examine the packet address and the PE whose plane address matches the packet's plane address accepts the packet.
  • the communication network is designed to be reliable, with automatic retry on garbled messages, distributed bus arbitration, alternate-path packet routing, and failed processing element translation tables to allow rapid switch-in and use of spare processing elements. Static fault tolerance is fully supported.
  • a spare PE can be loaded with the templates from the failed PE and operation continued. This creates two problems, however: 1) the spare PE has a different address than the PE it replaced, and 2) messages that were to be routed through the failed PE must instead be routed around it.
  • the first problem is solved by two methods.
  • the applications program can be reallocated using the allocator software during a scheduled maintenance period.
  • immediate recovery a small number of failed PE address translation registers, called the error memory 110, are provided in each PE.
  • error memory 110 a small number of failed PE address translation registers, called the error memory 110.
  • a PE fails its address is entered in the error memeory 110 followed by the address of its replacement PE.
  • Each packet generated is checked against the error memory and if a match is made, the replacement address is substituted for the address of the failed PE. Routing of packets around failed PEs is accomplished by each PE keeping track of which directly connected PEs are operative and which have failed. In the case of failed PEs the sending PE routes packets to alternate buses.
  • Dynamic fault tolerance can be provided by running two or more copies of critical code sections in parallel in different PEs and voting on the results. Unlike difficulties encountered in other types of parallel processors, the data flow concept avoids synchronization problems by its construction, and interprocess communi ⁇ cation overhead is minimized because it is supported in hardware. This software approach to dynamic fault tolerance minimizes the extra hardware reguired for this feature.
  • the packets that are transferred contain either 16-bit or 24-bit token values (see FIG. 3).
  • the data paths are 17 bits wide: 16 data bits plus 1 tag bit.
  • Each packet contains six type bits, a PE address, an actor address, and the data being transmitted from one actor to another.
  • the PE address identifies the destination PE and the actor address identifies the actor within that PE to which data is being sent.
  • PE address is 9 bits (3 bits for each plane, column, and row address) and can be used to address up to 512 distinct PEs (such as there would be in an 8x8x8 cubic arrangement of PEs).
  • Variable-length packets are supported by the network protocol, with the last word of a packet transmission indicated by an end-of-packet bit.
  • Each processinq element 80 consists of a communi ⁇ cations chip 81, a processing chip 120, and memory as shown in FIG. 4.
  • the communication network is distributed over all the PEs for improved fault tolerance.
  • the part of the communications network associated with a sinqle PE is represented in FIG. 4 by external plane, column, and row buses 82, 84, 86.
  • the external buses 82,84,86 use parity, a store-and-forward protocol, and a two-cycle timeout that indicates a bus or PE failure if the packet reception signal is not received within two cycles.
  • the parity and timeout features are used for error detection.
  • the store-and-forward protocol is necessary because the input queue at the receivinq communication chip may be full, in which case the sendinq communication chip needs to retransmit the packet later.
  • the arbitration control of the external buses 82,84,86 is decentralized for high reliability. Pairs of input/output gueues 88,100; 102,104; and 106,108 are used to buffer the data entering or leaving via the external plane, column, and row buses 82,84,86. Two internal buses 89 and 107 are used for sending packets from the processing chip plane, column, and row input queues 88,102,106 to the processor plane, column, and row output queues 100,104,108. All of the buses use round robin arbitration.
  • the communication chip 81 accepts tokens addressed to actors stored in its associated processing chip 120 and passes them to it.
  • An error memory 110 in the communications chip 81 contains a mapping of logical PE addresses to physical PE addresses. Normally the two are the same, but if a PE fails, its logical address is mapped to the physical address of one of the spare PEs. Static fault tolerance is used. When a PE fails, self- test routines are used to determine whether the failure is temporary or permanent. If it is permanent, the code that was allocated to the failed PE must be reloaded into the spare PE that will have the address of the failed PE. The prooram must then be restarted from the last breakpoint.
  • the communication chip is highly pipelined so that it can transmit a word of a packet almost every cycle.
  • Each individual PE is a complete computer with its own local memory for program and data storage.
  • Associated with each processing chip are two random access memories (RAMs ) 146 and 156 that store the actors allocated to the PE. These two memories, the destination memory 146 and template memory 156, are attached to processing chip 120. Each is composed of multiple RAM chips and has an access time of less than 80 nanoseconds, with two cycles reguired per memory access.
  • a single bi ⁇ directional bus 158 is used to communicate between the communication chip 81 and the processing chip 120.
  • the processing chip contains four special-purpose microprocessors which are choose to call "micromachines" .
  • the processing chip 120 accepts tokens from the communications chip 81 and determines whether each token allows an actor to fire. If not, the token is stored until the matching token or tokens arrive. If a token does enable an actor, then the actor is fetched from memory and executed by an ALU micromachine 144 in the processing chip 120. The resulting value is formed into one or more tokens and they are transmitted to the other actors that are expecting them.
  • a template consists of a slot for an opcode, a destination list of addresses where results should be sent, and a space for storing the first token that arrives until the one that atches it is received.
  • the memory is also used to store arrays, which can be sent to the memory of a single processing element or distributed over many processing elements. With dis- tributed arrays, it is possible for an actor executing in one processing element to need a'ccess to an array value stored in the memory of another processing element. Special actors are provided in the architecture for these nonlocal accesses. Given the array index or indices, the address of the processing element containing the value is calculated based on the way the array is distributed, and a reguest for the value is sent via the communication network. The other processing element then responds by sending back the reguested value as a normal token. Nonlocal array updates are handled similarly.
  • the processing chip is a pipelined processor with the following three operations overlapped: 1) instruction/operand fetch and data flow firing rule check, 2) instruction execution, and 3) matching results with destination addresses to form packets. There is some variance in the service times of each of these units for different instructions, so gueueing is provided between the units as shown in FIG. 4.
  • the instruction fetch and data flow firing rule check is performed by two parallel micromachine units, the template memory controller 130 and the destination memory controller 122.
  • the templates are spread across three independent memories: the fire detect memory 132, the template memory 156, and the destination memory 146.
  • the first 4K locations of each of these memories contain addresses of actors.
  • the fire detect memory 132 only has 4K locations.
  • the template memory 156 and destination memory 146 have additional memory that is used to store variable-length data associated with each actor, array data, and gueue overflow data.
  • the templates are split between the three memories so that the template memory controller 130 and destination memory controller 122 can operate in parallel and thus prepare actors for firing more guickly than if one memory and one controller were used.
  • the status of the template to which the packet is addressed is accessed from the fire detect memory 132 and a decision is made on whether the template is ready to fire.
  • the status bits are stored in the on-chip fire detect memory 132 to allow fast access and update of template status. If the template is not ready to fire, the arriving token (operand) is stored in the template memory 156.
  • the template memory controller 130 fetches the template opcode and operand stored in the template memory 156, combines them with the incoming operand, which enabled the actor to fire, and sends them to the firing gueue 138, from which the arithmetic and logic unit (ALU) micromachine 144 will fetch them. Simultaneously, the destination memory controller 122 begins fetching the destination addresses to which the template results should be sent and stores these addresses in the destination queue 134. Since each result of each template (actor) may need to be sent to multiple destinations, the destination memory 146 includes an overflow storage area to accommodate lists of destinations for each result of each actor. FIG. 5 shows how templates and arrays are mapped into physical memories.
  • the results of the actor execution performed in the ALU micromachine 144 are put into the result gueue 142.
  • the results in the result queue 142 and the destinations in the destination queue 134 are combined toqether into packets by the destination tagger micromachine 136 and sent back to the template memory controller 130 (via the feedback queue 138) or to other PEs (via the to- communication gueue 124).
  • the four main functions of a proces ⁇ sing element are communication network processing, actor fire detection, actor execution, and result token formation. All four of these functions are performed concurrently in a pipelined fashion.
  • a stand-alone processing element is capable of performing 2 to 4 micro-operations per seconds (MOPS) depending on the instruction mix used.
  • MOPS micro-operations per seconds
  • an MOP is defined as a primitive actor instruction; these vary in complexity from a simple 16-bit add which is completed in one micro-instruction to some array addres ⁇ sing instructions which take over ten cycles, or a 16-bit divide which takes approximately 25 cycles.
  • Two separate memory interfaces 148,150 and 152,154 allow a large memory-processor bandwidth which is necessary to sustain high performance.
  • the design goal of minimizing chip types and power consumption resulted in a simple design for the ALU: there is no hardware multiply — multiplication is performed by a modified Booths algorithm technique.
  • Each of the chips has less than 150 pins, consists of approximately 15K gates, and operates at a clock rate of 20 MHz.
  • the preferred embodiment of the present invention is programmed in the Hughes Data Flow Language (HDFL), which is a high-level functional language.
  • HDFL Hughes Data Flow Language
  • a record of the HDFL program 20 is read into the compiler 30 which translates it into a parallel data flow graph form 40 which along with a description of the processor configura ⁇ tion 45 is fed into the global allocator 50 for distri ⁇ bution to the multiple processing elements 80.
  • the allocator employs static graph analysis to produce a compile-time assignment of program graph to hardware that attempts to maximize the number of operations which can proceed in parallel while minimizing the inter-PE communication.
  • the HDFL is designed to allow full expression of parallelism. It is an applicative language but includes the use of familiar algebraic notation and programming language conventions.
  • the Hughes Data Flow Language is value oriented, allowing only single-assignment variables. Its features include strong typing, data structures including records and arrays, conditionals (IF THEN ELSE), iteration (FOR), parallel iteration (FORALL), and streams.
  • HDFL program consists of a program definition and zero or more function definitions. There are no global variables or side effects; values are passed via parameter passing.
  • the example shown immediately above consists of a function "foo" which takes four parameters (one record and three integers), and returns one record and one integer.
  • "Result" is a keyword beginning the body of a function and "endfun” terminates it.
  • the function body consists of a list of arbitrarily complex expressions separated by commas with one expression per return value.
  • the first expression in the function body is a "record type constructor” which assigns values to the fields of the record result. The conditional below it evaluates to an integer value. Constants and types may be declared before the function header or before the body. Functions may be nested. - HDFL Compiler
  • the compiler translates HDFL into a data flow graph intermediate form composed of primitive data flow actors. Operation proceeds in three phases: 1) syntax checking and parse tree construction, 2) semantics checking and augmentation, and 3) code qeneration. Each phase is table driven. Following table-driven code generation is a final post-processing stage to eliminate unnecessary code, evaluate constant subgraphs, and perform some optimizations.
  • the qraph intermediate form generated by the compiler includes syntactic information and other information which is used by the allocator.
  • the primitive actors are those supported directly by the hardware. Some of the actors are in 16-bit form and others are in 32-bit form. Many are simple arithmetic and Boolean actors such as ADD, others are control actors such as ENABLE and SWITCH, or hybrids like LE5, some are used in function invocation such as FORWARD, and others are used for array and stream handling.
  • FIG. 6 shows some of the primitive actors implemented directly in hardware.
  • the assignment of actors to processing elements can have a large impact on the performance of the multi ⁇ processor. For example, since each PE is a sequential computer, actors that potentially can. fire in parallel cannot do so if they are assi ⁇ ned to the same PE. Performance can also be affected by data communication delays in the inter-PE communication network. It takes many more clock cycles to transmit a token from one PE to another than it does to transmit a token from one actor to another in the same PE, which bypasses the communication network completely.
  • the input to the local allocator 17 is a file 13 containing a data flow graph in the form of a seguence of templates. Each template lists the opcode of the operator it represents and the data dependency arcs emanating from it. This file also lists arrays, each of which will be assigned to a single processing element or distributed over many processing elements.
  • a file 14 describing the configuration of the data flow multi- processor 70 to be allocated onto is also read int- the local allocator 17, specifying how many processinq elements 80 there are in each dimension of the three- dimensional packet routing network connecting the PEs. For simulation purposes the output of the local allocator 17 consists of two files.
  • the first file specifies the mapping of each actor of the graph to a memory location in one of the processing elements
  • the second file specifies how arrays have been assigned to specific blocks of memory in one or more processing elements.
  • the local allocator 17 begins by topologically sorting the actors of the graph using a variation of breadth-first search (for a description see The Design and Analysis of Computer Algorithms, by Aho et al. , published by Addison-Wesley, 1974). In topologically sorted order, the actors that receive the inputs of the graph are first, followed by the actors that receive arcs from the first actors, and so on. (For this purpose we can ignore cycles in the graph by disregarding back arcs to previosly seen actors.) The next step is to compute the transitive closure of the data flow graph, which is defined in the discussion of heuristics below.
  • the local allocator then seguentially processes the sorted list of actors in the graph and assigns each actor to one of the processinq elements.
  • the alqorith applies several heuristic cost function to each of the PEs, takes the weighted sum of the results, and uses the PE with the lowest cost.
  • the communication cost corresponds to the goal of minimizing inter-PE communication traffic
  • the parallel processing cost function corresponds to the goal of maximizing parallelism.
  • the communication cost function takes an actor and a PE and returns an approximate measure of the traffic through the network which would result if from assigning the given actor to the given PE. In general, when two actors are connected, the further apart they are allocated, the higher the communication cost.
  • the heuristic function uses a distance function to determine how far apart PEs are in the three-dimensional bussed communication network. For example, if two PEs are on a common bus, then the distance between them is
  • the distance between a PE and itself is zero hops, because the communications network can be bypassed in transmitting a token. Because the actors are assigned in topologically sorted order, when an actor is about to be allocated most of the actors that it receives input tokens from have already been allocated. Using the distance function between PEs, the communication cost function determines how far through the communication network each input token would have to travel if the actor were assigned to the given PE. The value of the communication cost function is just the sum of these distances.
  • the processing cost heuristic uses the transitive closure of the data flow graph to detect parallelism.
  • the transitive closure of a directed graph is defined to be the graph with the same set of nodes and an arc from one node to another if and only if there is a directed path from one node to another in the original graph. In the worst case this computation reguires time proportional to the cube of the number of nodes (actors).
  • Transitive closure is closely related to parallelism in data flow graphs, because two actors can fire in parallel unless there is a directed path from one to the other in the graph, which would force them to be executed sequentially. Thus, two actors can fire in parallel unless they are directly connected in the transitive closure of the graph.
  • This fact is used in the parallel processing cost heuristic to determine which actors should be assigned to separate PEs in order to maximize the parallelism of the allocated graph. It simply assigns a higher cost when potentially parallel actors (according to the transitive closure) are assigned to the same PE.
  • the local allocator attempts to allocate actors that access an array close to the array, guided by the array-access cost function.
  • This heuristic function is a generalization of the communication cost. It measures the traffic through the network which would result from assigning a given actor that accesses an array to a given processing element, depending on how far away the array is assigned.
  • the local allocator allocates each array to one or more PEs using similar heuristics. For small arrays with a small number of actors that access it, the local allocator will choose to confine the actor to a single PE in order to speed up access time. If an array is large and has a large number of actors that can access it in parallel according to the transitive closure, the program will attempt to distribute the array over many PEs. The actors that access the array will also be distributed over these PEs to decrease contention for access to the arrays.
  • the global allocator combines the heuristic approach from the local allocator with a divide-and- conguer strategy, enabling it to operate on large graphs. Like the local allocator, it accepts a data flow graph and information about the dimensions of the processor. It also accepts a hierarchical representation of the syntactic parse tree from the first pass of the compiler 30 to guide the allocator as it partitions the graph into parallel modules. By integrating the compiler and the allocator in this way, the allocator is able to take advantage of the way the high-level programmer chose to partition the writing of the program into functions, subfunctions, and expressions.
  • the divide-and-conguer strategy reduces the problem to two related subproblinds: partitioning the input graph into a set of smaller, more tractable modules, and heuristically assigning each module to a set of processing elements.
  • the algorithm proceeds from the top down by partitioning the graph into several modules and assigning each module to some set of the processing elements of the data flow processor. Then, recursively it further partitions each module into submodules and assigns each of them to a subset of PEs within the set of PEs previously to which that module was previously assigned. This partition-and-assign process is repeated hierarchically until the individual submodules are small enough to be allocated efficiently, one actor at a time, to individual PEs.
  • the nodes of the parse tree from the compiler correspond to the syntactic elements of the program such as functions, subfunctions, loop-bodies, and so forth.
  • the tree is connected by pointers to the data flow graph so that the actors of the graph become the leaves of the tree.
  • the set of actors below a given node of the tree form the module of the data flow graph that computes the value of the expression corresponding to that node.
  • the root of the tree corresponds to the module consisting of the entire data flow graph program.
  • the children of the node of the tree correspond to the subfunctions and subexpressions of the parent node.
  • the task of partitioning the data flow graph into a set of modules is guided by this syntactic parse tree.
  • the global allocator partitions a module corresponding to an expression into a set of submodules correspondinq to the subexpressions of the expression.
  • the syntactic parse tree it splits up a node into the children of the node.
  • expressions and functions can generally be computed in parallel because there are no side effects. Therefore these syntactic elements are usually ideal choices in the partitioning of the correspon ing data flow graph.
  • modules are usually not completely parallel; there can be some data dependencies between them. For example, if there is an expression in the data flow language program that is assigned to a value name, then there will be a a data dependency from the module computing that expression to any other modules that refer to that value name.
  • the global allocator finds such data dependencies between modules by looking for data dependency arcs between individual actors in different modules. These dependencies are then used to construct a graph called by the "module graph," the nodes of which correspond to modules of the partitioned data flow graph, and the arcs of which indicate the data dependencies between submodules. It is essentially another data flow graph.
  • the task of assigning the nodes (submodules) of the module graph to sets of PEs is similar to the assignment performed by the local allocator program. A variant of that algorithm is used. First the nodes of the module graph are topologically sorted, then its transitive closure is computed. In this way it is never reguired to compute the transitive closure of the entire graph at one time, so the inefficiency of the local allocator for large graphs is avoided. In the global allocator the assiqnment of modules (and individual actors) to PEs is quided by two of the heuristic cost functions defined previously in the section dealing with the local allocator. They have been generalized to apply to modules consisting of many individual actors being assigned to sets of PEs.
  • the distance function between PEs is generalized to an average distance between sets of PEs by using the distances between the individual PEs divided by the number of PEs.
  • the generalized parallel processinq cost function a higher cost is assigned whenever parallel modules (according to the transitive closure of the module graph) are assigned to intersecting sets of PEs, weighted by the number of PEs in the intersection.
  • the two programs which have been simulated most extensively are related to realtime radar signal processing applications. Both programs have been simulated using a variety of allocation algorithms and processing element configurations.
  • the radar3na program has 96 actors, 152 arcs, 17 constants, an average ALU execution time of 7.19 cycles (50 ns cycle time), an average actor fanout (the number of output arcs for an actor) of 1.58 arcs, and a degree of parallelism of 21.14 actor firings per cycle (the average number of actors which can fire in parallel on the instruction-level simulator).
  • the radarb program uses a 16-point fast Fourier transform (FFT) with complex arithmetic. It has 415 actors, 615 arcs, 71 constants, an averaqe ALU execution time of 4.92 cycles, an average actor fanout of 1.56 arcs, and a degree of parallelism of 80.63 actor firings per cycle. Both programs were simulated on lxlxl, 2x1x1,
  • Radarb was also simulated on an 8x4x4 configuration. Both of these programs were designed to be continuously processing incoming data. In the simulations eight sets of data were used for each program execution. Each input actor grabbed successive data as soon as it could fire, thus the programs were processing several sets of input data simultaneously. No explicit pipeline stages existed, nor were any acknowledgement tokens used to prevent sets of input data from interferino with each other. Instead, operand queues were used to guarantee safety.
  • FIGS. 8 and 9 illustrate that both radar3na and radarb have significantly better throughput using the nonrando allocations.
  • the transitive closure algorithm yields about the same maximum throughput as the nontransitive closure algorithm, but uses fewer PEs, because it is more likely than the nontransitive closure algorithm to place two actors into the same PE when they fire sequentially. In comparing the results shown in FIG.
  • FIG. 10 shows that the transitive closure and nontransi ⁇ tive closure graphs have similar performance. The portion of the nontransitive closure graph beyond 20 PEs is not of interest because the throughput does not increase when more than 20 PEs are used.
  • FIG. 12 shows how FIGS. 8 through 11 imply that there is a tradeoff between maximizing throughput and efficiently using PEs.
  • the average ALU is very busy, but the program throughput is significantly less than the maximum that may be obtained because not all of the parallelism of the program is being exploited. As more PEs are used, the program throughput increases but the percentage of time that the average ALU is busy decreases.
  • FIGS. 13 and 14 shows how the percentage of packet communication that is local (within a PE rather than between PEs) varies with the number of PEs for radar3na and radarb. They show that as the number of PEs increases, less of the packet communication traffic is local. As one might expect, the transitive closure allocation algorithm has more local packet communication than the nontransitive closure algorithm. What is surprisinq is that for radarb, which has more than four times as many actors as radar3na, the percentage of local packet communication does not decrease very rapidly, and in fact sometimes increases, as more PEs are used.
  • FIG. 15 illustrates the average and maximum length of the result gueue for the nontransitive closure allocation of radarb. Not shown in FIG. 15 because of the scale chosen are results of 103 and 158 for the average and maximum queue lengths for one PE, and 42 and 74 for the average and maximum gueue lengths for two PEs. Note that the average gueue length decreases rapidly beyond a few PEs, and that for eight or more PEs the average gueue length is less than one packet. This is characteristic of the other gueues in the communica ⁇ tions and processor chips, and indicates that the queue lenqths may be limited to a few words so long as a gueue overflow area is provided or other methods are used to prevent deadlock.
  • FIG. 16 shows how the average communication packet latency varies with number of PEs. This measure of latency includes the packet delays encountered in the communication chips and in accessing the communication chips. It does not take into account the delays encountered in the template memory controller, firing queue, ALU, result gueue, or destination tagger. It measures the latency from from the output of the DT to the input of the template memory controller. It is a good measure of the efficiency of the communication system. Note that for few PEs there is very little communication chip activity, hence the packet latency contributed by the communication chip is low. As shown in FIG. 16 the average communication packet latency peaks at four PEs and decreases rapidly for more PEs.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multi Processors (AREA)
PCT/US1987/000410 1986-03-31 1987-03-02 Data-flow multiprocessor architecture for efficient signal and data processing WO1987006034A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US84708786A 1986-03-31 1986-03-31
US847,087 1986-03-31

Publications (1)

Publication Number Publication Date
WO1987006034A1 true WO1987006034A1 (en) 1987-10-08

Family

ID=25299732

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1987/000410 WO1987006034A1 (en) 1986-03-31 1987-03-02 Data-flow multiprocessor architecture for efficient signal and data processing

Country Status (4)

Country Link
EP (1) EP0261173A1 (ja)
JP (1) JPS63503099A (ja)
IL (1) IL81756A0 (ja)
WO (1) WO1987006034A1 (ja)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1058578C (zh) * 1992-06-24 2000-11-15 株式会社东芝 目视模拟装置
US7325232B2 (en) * 2001-01-25 2008-01-29 Improv Systems, Inc. Compiler for multiple processor and distributed memory architectures

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
IEEE Transactions on Computers, Vol. C-34, No. 12, December 1985 (New York, USA), J.L. GAUDIOT et al., "A Distributed VLSI Architecture for Efficient Signal and Data Processing", pages 1072-1087, see the whole document *
Proceedings of the 1985 International Conference on Parallel Processing, 20-23 August 1985, Washington, USA, (IEEE Computer Society Press, USA), M.L. CAMPBELL, "Static Allocation for a Data Flow Multiprocessor", pages 511-517, see the whole document *
The 12th Annual International Symposium on Computer Architecture, 17-19 June 1985, Boston, Massachusetts, USA, (IEEE Computer Society Press, USA), R. VEDDER et al.8 "The Hughes Data Flow Multiprocessor: Architecture for Efficient Signal and Data Processing", pages 324-332, see the whole document *
The 5th International Conference on Distributed Computing Systems, 13-17 May 1985, Denver, Colorado, USA, (IEEE Computer Society Press, USA), R. VEDDER et al., "The Hughes Data Flow Multiprocessor", pages 2-9, see the whole document *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1058578C (zh) * 1992-06-24 2000-11-15 株式会社东芝 目视模拟装置
US7325232B2 (en) * 2001-01-25 2008-01-29 Improv Systems, Inc. Compiler for multiple processor and distributed memory architectures

Also Published As

Publication number Publication date
EP0261173A1 (en) 1988-03-30
JPS63503099A (ja) 1988-11-10
IL81756A0 (en) 1987-10-20

Similar Documents

Publication Publication Date Title
US5021947A (en) Data-flow multiprocessor architecture with three dimensional multistage interconnection network for efficient signal and data processing
Dally et al. The message-driven processor: A multicomputer processing node with efficient mechanisms
Dongarra et al. Solving linear systems on vector and shared memory computers
Srini An architectural comparison of dataflow systems
Roosta Parallel processing and parallel algorithms: theory and computation
Culler et al. The explicit token store
Lee et al. Issues in dataflow computing
Sterling et al. Gilgamesh: A multithreaded processor-in-memory architecture for petaflops computing
DeMara et al. The SNAP-1 parallel AI prototype
Gaudiot et al. A distributed VLSI architecture for efficient signal and data processing
US5765012A (en) Controller for a SIMD/MIMD array having an instruction sequencer utilizing a canned routine library
Gaudiot et al. The TX 16: A highly programmable multi-microprocessor architecture.
WO1987006034A1 (en) Data-flow multiprocessor architecture for efficient signal and data processing
Topham et al. Context flow: An alternative to conventional pipelined architectures
Yousif Parallel algorithms for asynchronous multiprocessors
Giloi The SUPRENUM supercomputer: Goals, achievements, and lessons learned
Bronnenberg POOL and DOOM a survey of esprit 415 subproject A, Philips research laboratories
Foley A hardware simulator for a multi-ring dataflow machine
Greenberg An investigation into architectures for a parallel packet reduction machine
Abu-Ghazaleh Shared control multiprocessors: a paradigm for supporting control parallelism on SIMD-like architectures
Dally et al. The Message-Driven Processor: A Multicomputer Processing Node with E cient Mechanisms
Fresno Bausela et al. HitFlow: A Dataflow Programming Model for Hybrid Distributed-and Shared-Memory Systems
Fischler The ACPMAPS System
Amano et al. /sup 2/-II: a large-scale multiprocessor for sparse matrix calculations
Hall et al. Hardware for fast global operations on multicomputers

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): JP

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): DE FR GB IT

WWE Wipo information: entry into national phase

Ref document number: 1987901955

Country of ref document: EP

WWP Wipo information: published in national office

Ref document number: 1987901955

Country of ref document: EP

WWW Wipo information: withdrawn in national office

Ref document number: 1987901955

Country of ref document: EP