WO1991020043A1 - Global registers for a multiprocessor system - Google Patents

Global registers for a multiprocessor system Download PDF

Info

Publication number
WO1991020043A1
WO1991020043A1 PCT/US1991/004058 US9104058W WO9120043A1 WO 1991020043 A1 WO1991020043 A1 WO 1991020043A1 US 9104058 W US9104058 W US 9104058W WO 9120043 A1 WO9120043 A1 WO 9120043A1
Authority
WO
WIPO (PCT)
Prior art keywords
global
register
registers
data
multiprocessor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/US1991/004058
Other languages
English (en)
French (fr)
Inventor
Douglas R. Beard
George A. Spix
Edward C. Miller
Robert E. Strout, Iii
Anthony R. Schooler
Alexander A. Silbey
Brandon D. Vanderwarn
Jimmie R. Wilson
Richard E. Hessel
Andrew E. Phelps
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Supercomputer Systems LP
Original Assignee
Supercomputer Systems LP
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Supercomputer Systems LP filed Critical Supercomputer Systems LP
Publication of WO1991020043A1 publication Critical patent/WO1991020043A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • G06F15/173Interprocessor communication using an interconnection network, e.g. matrix, shuffle, pyramid, star, snowflake
    • G06F15/17356Indirect interconnection networks
    • G06F15/17368Indirect interconnection networks non hierarchical topologies
    • G06F15/17375One dimensional, e.g. linear array, ring
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • G06F15/17Interprocessor communication using an input/output type connection, e.g. channel, I/O port
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • G06F15/173Interprocessor communication using an interconnection network, e.g. matrix, shuffle, pyramid, star, snowflake
    • G06F15/17356Indirect interconnection networks
    • G06F15/17368Indirect interconnection networks non hierarchical topologies
    • G06F15/17381Two dimensional, e.g. mesh, torus
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/80Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
    • G06F15/8053Vector processors
    • G06F15/8076Details on data register access
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/80Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
    • G06F15/8053Vector processors
    • G06F15/8092Array of vector units
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/43Checking; Contextual analysis
    • G06F8/433Dependency analysis; Data or control flow analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/45Exploiting coarse grain parallelism in compilation, i.e. parallelism between groups of instructions
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30021Compare instructions, e.g. Greater-Than, Equal-To, MINMAX
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/3004Arrangements for executing specific machine instructions to perform operations on memory
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30076Arrangements for executing specific machine instructions to perform miscellaneous control operations, e.g. NOP
    • G06F9/30087Synchronisation or serialisation instructions
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30094Condition code generation, e.g. Carry, Zero flag
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • G06F9/30101Special purpose registers
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • G06F9/3012Organisation of register space, e.g. banked or distributed register file
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30145Instruction analysis, e.g. decoding, instruction word fields
    • G06F9/30149Instruction analysis, e.g. decoding, instruction word fields of variable length instructions
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3802Instruction prefetching
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3802Instruction prefetching
    • G06F9/3804Instruction prefetching for branches, e.g. hedging, branch folding
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3824Operand accessing
    • G06F9/3834Maintaining memory consistency
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • G06F9/3838Dependency mechanisms, e.g. register scoreboarding

Definitions

  • This invention relates generally to the field of registers and interconnection techniques for multiprocessor computer and electronic logic systems. More particularly, the present invention relates to a system of global registers for a multiprocessor system that provides for an efficient and distributed mechanism that is capable of providing an atomic resource allocation mechanism for interconnecting and coordinating the multiprocessors in such a system.
  • Global registers are registers that are generally accessible to all requestors in a multiprocessor system.
  • Dijkstra describes the use of global registers for a semaphore operation to control the operational flow of a multiprocessor system.
  • the use of global registers as part of a semaphore operation is typically limited to minimally parallel supercomputers and hierarchical memory supercomputers.
  • Massively parallel supercomputers by their very architecture, do not have a use for a set of global registers as control and coordination of the processors is accomplished via a message passing scheme.
  • Most prior art global register systems utilize some form of hardware dependent interlock mechanism to accomplish the semaphore function.
  • a deadlock interrupt means is used to coordinate requests to the global registers by two high-speed processors. While this type of tightly- coupled, direct-connection method is an efficient means for coordinating two high speed processors, the hardware deadlock interrupt mechanism described in that patent is most effective when both the number of processors being coupled together and the number of global registers involved are relatively small.
  • the global registers must be capable of supporting many multiple requests to the same global register.
  • the global registers must operate in a distributed environment where there is no central scheduler and where portions of the distributed input/output are also allowed direct access to the global registers without processor intervention.
  • the global registers must be capable of atomic arithmetic operations and atomic resource allocation operations in order to support the software routines for a multithreaded operating system that use shared- variable synchronization and anarchy-based scheduling to allocate work and coordinate access to common data structures used by the operating system.
  • the present invention provides for global registers for a multiprocessor system that will support multiple parallel access paths for simultaneous operations on separate sets of global registers, each set of global registers being referred to as a global register file.
  • An arbitration mechanism associated with the global registers is used for resolving multiple, simultaneous requests to a single global register file.
  • An arithmetic and logical unit (ALU) is also associated with each global register file for allowing atomic arithmetic operations to be performed on the entire register value for any of the global registers in that global register file.
  • the global registers of the present invention are a globally accessible resource that may be accessed from any processor or peripheral controller through an external interface port in the multiprocessor system.
  • the global registers support a variety of synchronization primitives to allow the most efficient choice for synchronization primitive, depending upon the particular synchronization task at hand.
  • One of the more notable synchronization primitives of the present invention is the Fetch and Conditional Add (FCA) instruction.
  • FCA Fetch and Conditional Add
  • the global registers are implemented as one part of an entire set of common shared hardware resources that are all available to each requestor in a distributed, democratic multiprocessor environment.
  • the global registers are organized as eight global register files within each cluster of the preferred embodiment of the multiprocessor system.
  • the organization of the global registers of the present invention into global register files allows simultaneous access to multiple global register files.
  • Another objective of the present invention is to provide a set of global registers that allow atomic arithmetic operation to be performed on the entire register value for any of the global registers.
  • a further objective of the present invention is to provide a set of global registers that are capable of supporting a Fetch and Conditional Add (FCA) instruction.
  • FCA Conditional Add
  • Fig. 1 is a block diagram of the various interconnections among processors, external interface ports and the global registers in a single cluster of a multiprocessor system in the preferred embodiment of the present invention.
  • Figs. 2a and 2b are a block diagram of a four cluster implementation of the preferred embodiment of a multiprocessor system.
  • Fig. 3 is a block diagram showing the implementation of the global registers as part of the NRCA means of the preferred embodiment of the multiprocessor system.
  • Fig. 4 is a block diagram showing the arbitration logic and cross bar switch mechanisms for the various global register files of the present invention.
  • Fig. 5 is a is a more detailed block diagram of Fig.4 showing the data and address pipelines for the global registers.
  • Fig. 6 is a schematic representation of the logical and physical address maps for the global registers.
  • Fig. 7 is a more detailed block diagram of Fig. 4 showing the address and data lines for a single global register file and the arithmetic logical unit associated with that global register file.
  • Fig. 8 is a schematic representation showing the global register addressing.
  • the preferred cluster architecture for a highly parallel scalar/vector multiprocessor system is capable of supporting a plurality of high-speed processors 10 sharing a large set of shared resources 12 (e.g., main memory 14, global registers 16, and interrupt mechanisms 18).
  • the processors 10 are capable of both vector and scalar parallel processing and are connected to the shared resources 12 through an arbitration node means 20.
  • Also connected through the arbitration node means 20 are a plurality of external interface ports 22 and input/output concentrators (IOC) 24 which are further connected to a variety of external data sources 26.
  • IOC input/output concentrators
  • the external data sources 26 may include a secondary memory system (SMS) 28 linked to the input/output concentrator 24 via a high speed channel 30.
  • the external data sources 26 may also include a variety of other peripheral devices and interfaces 32 linked to the input/output concentrator 24 via one or more standard channels 34.
  • the peripheral devices and interfaces 32 may include disk storage systems, tape storage system, printers, external processors, and communication networks.
  • the processors 10, shared resources 12, arbitration node means 20 and external interface ports 22 comprise a single multiprocessor cluster 40 for a highly parallel multiprocessor system in accordance with the preferred embodiment of the present invention.
  • the preferred embodiment of the multiprocessor clusters 40 overcomes the direct-connection interface problems of present shared-memory supercomputers by physically organizing the processors 10, shared resources 12, arbitration node means 20 and external interface ports 22 into one or more clusters 40.
  • Each of the dusters 40a, 40b, 40c and 40d physically has its own set of processors 10a, 10b, 10c and lOd, shared resources 12a, 12b, 12c and 12d, and external interface ports 22a, 22b, 22c and 22d that are assodated with that duster.
  • the dusters 40a, 40b, 40c and 40d are interconnected through a remote duster adapter 42 that is a logical part of each arbitration nodes means 20a, 20b, 20c and 20d. Although the dusters 40a, 40b, 40c and 40d are physically separated, the logical organization of the clusters and the physical interconnection through the remote duster adapter 42 enables the desired symmetrical access to all of the shared resources 12a, 12b, 12c and 12d across all of the dusters 40a, 40b, 40c and 40d.
  • any and all processors 10 and external interface ports 22 may simultaneously access the same or different global registers 16 in any given dock cyde.
  • the global registers 16 are physically and logically organized into global register files. References to global registers within a given global register file are serialized over a number of dock cydes and take place at the rate of one operation every dock cyde. Simultaneous references to registers in separate global register files take place in the same dock cycle.
  • Global register logic resolves any access contention within a global register file by serially granting access to each requestor so that only one operation is performed at a time. References to a single global register within a global register file are processed in the order in which they arrive.
  • the preferred embodiment provides addressing for a contiguous block of 32,768 global registers located among the four dusters 40. There are 8192 global registers per duster 40. The global registers are organized within each duster 40 as eight global register files so that accesses to different global register files can occur simultaneously.
  • the global registers 16 are assodated with the logic for the NRCA means 46 in the remote duster adapter 42. While the physical location of the global register 16 is preferably in the remote duster adapter 42 for the preferred multiprocessor system, it will be understood that the location and global registers 16 can be accomplished by a variety of designs, depending upon the architecture and layout of the multiprocessor system that is using them. There are sixteen NRCA ports 47 in the arbitration node means 20 (one per arbitration node 44) that provide an access path to the global registers 16 from the thirty-two processors 10 and thirty-two external interface ports 22 in a duster 40.
  • Each NRCA port 47 is shared by two processors 10 and two external interface ports 22 and is accessed over the path 52.
  • a similar port 49 services inter-duster requests for the global registers 16 in the duster 40 as received by the MRCA means 48 and accessed over the path 56. It will be recognized that access time to global registers 16 will, in general, be slightly faster than to main memory 14 when requests remain within the same duster 40. Also, there is no interference between in-duster memory traffic and global register traffic because requests are communicated over different paths.
  • a cross bar/arbitration means 51 and an remote duster crossbar 53 receive requests from the sixteen arbitration nodes 44 and the MRCA means 48. Access to the NRCA means 46 via paths 52 and 56 are routed through the cross bar/arbitration means 51 to direct the access to and from the appropriate logic in the NRCA means 46 for the global register 16 and the interrupt mechanism 18 comprised of signal logic 31 and fast interrupt logic 33.
  • an arbitration dedsion requires address information to select the target register and control information to determine the operation to be performed as described in greater detail hereinafter. This information is transmitted to the NRCA means 46 along with the data. The address and control can be for data to be sent to global registers 16 or to signal logic 31 or fast interrupt logic 33.
  • An important feature of the global registers 16 of the present invention is their ability to perform a read-modify-write operation in a single uninterruptable operation. This feature is used to provide atomic resource allocation mechanisms that are used by the operating system and input/output system for creating a multiprocessor system that has integrated support for distributed and multithreaded operations throughout the multiprocessor system. Several versions of such an atomic resource allocation mechanism are supported.
  • the atomic global register operations are as follows: Test And Set (TAS) - Data supplied by the originator of the request is logically ORed with data in the register, and the result is placed in the selected register. Contents of the register prior to modification are returned to the originator of the request.
  • TAS Test And Set
  • Set (SET) - Data supplied by the originator of the request is logically ORed with data in the register, and the result is placed in the register.
  • CLR Clear
  • Fetch And Add Fetch And Add (FAA) - Data supplied by the originator of the request is arithmetically added to the value in the register, and the result is placed in the register. Register contents prior to the addition are returned to the originator of the request.
  • Fetch and Conditional Add FCA
  • FCA Fetch and Conditional Add
  • Swap Data supplied by the originator of the request is written into the selected register. Contents of the register prior to modification are returned to the originator of the request. Read (READ) - Contents of the register are returned to the originator of the request.
  • TAS Test and Set
  • the TAS instruction causes a number of bits to be set in a global register 16. However, before the data is modified, the contents of the global register 16 are sent back to the issuing processor 10. The processor 10 then checks to see if these bits are different than the bits originally sent. If they are different, the processor 10 has acquired the semaphore because only one register at a time can change any data in a global register 16. If the bits are the same, the software may loop back to retry the TAS operation.
  • each process determines how many processors 10 it can use for various portions of the code. This value can be placed in its active global register set. Any free processor is, by definition, in the operating system and can search for potential work simply by changing the GMASK and GOFFSET control registers as described in further detail in connection with Fig. 8 and scanning an active process's processor request number.
  • processors when added to a process, decrement the processor request number.
  • the operating system can easily add processors to a process, or pull processors from a process, based on need and usage.
  • the fetch and conditionally add (FCA) instruction ensures that no more processors than necessary are added to a process. This instruction also fadlitates the parallel loop handling capabilities of multiple processors.
  • FCA fetch and conditionally add
  • Fig. 4 the cross bar/arbitration means 51 is described in greater detail. The flow begins with data from one of the arbitration nodes 44 which has been buffered by the NRCA means 46. As each request is received at the NRCA input registers 510 (Fig. 5), decode logic 406 decodes the request to be presented to a global register arbitration network 410.
  • simultaneous requests come in for multiple global registers 16 in the same global register file 400, these requests are handled in a pipelined manner by the FIFO's 412, pipelines 414 and the global register arbitration network 410.
  • Priority is assigned by a FIFO (first in, first out) scheme supplemented with a multiple request toggling priority scheme.
  • the global register arbitration network 410 uses this type of arbitration logic, or its equivalent, to prioritize simultaneous requests to the same global register file 400.
  • a 17x10 crossbar switch means 430 matches the request in the FIFO 412 with the appropriate global register file 400.
  • a plurality of NRCA input registers 510 (Fig. 5) provide seventeen paths into the global registers input crossbar 430.
  • each global register file 400 has 1024 general purpose, 64-bit registers.
  • Each global register file 400 also contains a separate Arithmetic and Logical Unit (ALU) operation unit 460, permitting eight separate global register operations in a single dock cyde per duster.
  • ALU Arithmetic and Logical Unit
  • the global register files 400 are interleaved eight ways such that referendng consecutive locations accesses a different file with each reference. In this embodiment, the global registers are implemented using a very fast 1024x64-bit RAM.
  • address and command information travel through a pipeline 520 that is separate from the data pipeline 530.
  • the address and command information is decoded and used to direct data and certain of the address bits to their destination. Because the results of the arbitration dedsions are used to direct data to this destination, the data and arbitration results must arrive at the input crossbar 430 in the same dock cyde.
  • Staging registers 560 are added to the data pipeline 530 to adjust the data delay to match the control dday through the address pipeline 520.
  • the arbitration is based on a decode of address bit 13 (the SETN select bit), the three address least significant bit (the global register file select bits), and a four-bit operation code (not shown). If the operation code specifies a signal operation, the address and data information are always sent to the signal logic output port 442. If address bit 13 is set to one, the address, data, and command information are sent to the fast interrupt logic output port 444. Otherwise, the address, control, and data are sent to the global register file output port selected by the three address LSB using one of the paths 440. The other ten address bits of the logical address (bits 12-3) shown at path 540 in Fig. 5 are not used in the arbitration process.
  • the command bits on path 540 are duplicated and carried through the data pipeline as well for use at the destination. Simultaneous requests from different sources for the same global register file 400 (or for the signal logic 31 or the fast interrupt logic 33) are resolved by the arbitration logic 410 by granting one of the requestors access and delaying any other requests to later cydes.
  • the arbitration address pipeline registers 520 hold any requests that cannot be immediately serviced in the Address Pipeline FIFO 570. In any single Data Pipeline FIFO 580, the data are submitted serially. Similarly, requests in the Address Pipeline FIFO 570 are handled serially. For example, data B entered later cannot pass data A entered before it.
  • data A may be waiting for a busy global register
  • data B may be waiting for an available global register
  • data B can not be processed until data A is finished.
  • Data stays in order within a single queue; no data under Address Control can slip ahead of the data order in Data Address Control.
  • Ten arbitrations can be handled simultaneously by the arbitration logic 410. If data cannot go, signals 512 and 514 are sent to FIFOs 570 and 580, respectively, instructing them to hold the request at their respective outputs. The FIFOs 570 and 580 then wait for their arbitration dedsion. Signals (not shown) are sent back to each requestor from the arbitration logic 410 indicating that a request has been removed from the FIFOs 570 and 580. The source uses this signal to determine when the FIFOs 570 and 580 are full. The source stops sending requests when the FIFOs 570 and 580 are full so that no requests are lost. Once an arbitration dedsion is made, a multiplex select signal 590 is generated that steers the input cross bar 460. This automatically unloads the FIFOs 570 and 580 and sends data to the global register files 400 or the signal logic 31 or the fast interrupt logic 33.
  • the input crossbar 460 is implemented as ten, 17:1 multiplexors. There is one multiplexor for each of the eight output paths 440, and output paths 442 and 444. The multiplexors are controlled by multiplex select signals 590 from the arbitration logic.
  • the arbitration logic 410 also sends a signal to alert the NRCA means 46 (Fig. 3) that data will be returning to the source via the functional unit output path 450 (Fig. 4). Once the request is granted access, data will return to the NRCA means 46 in a fixed number of cydes.
  • the NRCA logic relies on this fixed interval to determine when to receive the data from the global registers 16 and return it to the processor 10. Data is returned through a 9:17 Global Registers Output Crossbar 422 (signal logic does not return data).
  • the register address information associated with a requested operation enters through the address pipe 609. Addresses pass through the pipe in two steps. Each step requires a single dock cyde.
  • the two steps in the address pipe 609 are as follows: 1. Load data from the arbitration input crossbar 460 into the address pipe input register 630.
  • the address information is used to fetch data from the register file 623.
  • the fetched data is modified by combining it with data from the global register pipe 610 in the ALU 460.
  • the modified data is then written back into the register file 623. If the spedfied operation requires that data be returned to the requestor, the data first fetched from the file 623 is delivered to the NRCA logic via the functional unit output register 631.
  • the ALU 460 consists of a primary adder 602, a wrap adder 603, and a logical unit 604. These three dements can take two operands from three sources.
  • the primary adder takes one operand from either the file output latch 619 or the ALU output latch 621 via latch 620 and the second operand from the data pipeline output register 626.
  • the wrap adder takes one operand from the ALU output latch 621 via latch 620 and the other from the data pipeline output register 626.
  • the logical unit takes one operand from either the file output latch 619 or the ALU output latch 621 via latch 620 and the second operand from the data pipeline output register 626.
  • the address dday unit 632 delays the read address used in step 1 by four cycles so that it will be available to use when the modified data is written back to the file 623 in step 5.
  • the address delay unit 632 is loaded from the register file read address register 624 at the end of step 1.
  • step 4 data is taken from the ALU input latch 620 in step 3 but is taken from the ALU output latch 621 in step 4. Sdection is made by the logical unit input mux 625. If an adder is required in the second operation, the wrap adder 603 is used in step 4 because it takes data from the ALU output latch 621. This method ensures that data resulting from the first operation is used in the second operation.
  • Each step requires a single dock cyde: 1. Read the register file 623, and load data into the file output latch 619. 2. Move data into the primary exit register 633 via path 613. 3. Compute and append ECC syndrome bits, and load data and syndrome bits into the primary ECC output register 634.
  • the normal fetch sequence is used.
  • the embodiment ensures that any global register operation is completed before another request can be initiated on the same register, giving the appearance that the operation has completed in a single cycle even through multiple cycles are actually required.
  • the pipelined organization allows a new operation to be initiated in the fimctional unit every cycle, regardless of prior activity. This pipelining, in combination with multiple, parallel paths to multiple functional units, results in the best possible throughput, and hence, the most efficient means for supporting synchronization variables among multiple parallel processes.
  • the logical address map 710 is used by the processor 10.
  • the physical address map 720 is used by the IOC 24.
  • Fig. 8 illustrates the global register calculation in the processor 10.
  • the present invention uses a relative addressing scheme for the global registers 16 to eliminate the need for expli ⁇ t coding of global register addresses in the user's program.
  • Global register address calculations are based on the contents of three processor control registers: GOFFSET 810, GMASK 820 and GBASE 830. Setting GMASK 820 to all ones permits the user to access all of the available global registers 16.
  • GOFFSET 810 and GMASK 820 are protected registers that can be written only by the operating system. Together they define a segment of the collection of global registers 16 that the processor 10 or IOC 24 can address.
  • GOFFSET 810 The three least-significant bits of GOFFSET 810 are assumed to be zero when the address calculation is performed, and the three least-significant bits of GMASK 820 are assumed to be ones.
  • GBASE 830 is a user-accessible 15-bit register. The value contained in the instruction j field 850 is added to GBASE 830 to form the user address. The j field 850 is considered to be unsigned, and any carry out is ignored. The sum of GBASE 830 and the instruction j field 850 is logically ANDed with the contents of GMASK 820, placing a limit on the maximum displacement into the register set that the user can address. The result of the mask operation is added to the contents of GOFFSET 810. Any carry out is ignored.
  • GOFFSET 810 is a 16-bit register. The 16th bit is used to sdect the SETN registers assodated with the fast interrupt logic 33 and must be zero when accessing the global registers 16.
  • the address generated by this method allows access to the set of global registers 16 that the operating system assigns any particular processor. All processors 10 could be assigned to one particular set or to different sets of global registers 16, depending on the application and availability of processors. It will be understood that logic in the processor means 10 rearranges the logical address 710 into the physical address 720 used at the NRCA means 46, as shown in the mapping in Fig. 7. It should be noted that address values which specify a binary one in bit position 13 of 720 will address the SETN registers, rather than the global registers 16.
  • the IOC 24 can also perform global register operations.
  • the operating system reserves for itself any number of global register sets that will be used for parameter passing, interrupt handling, synchronization and input/output control.
  • the peripherals 32 attached to the various IOCs 24 contain part of the operating system software and are able to access all of the global registers 16 in all dusters 40.
  • n is an unsigned 8-bit number.
  • q is a signed 8-bit literal and n is an unsigned 8- bit number.
  • n is an unsigned 8-bit number.
  • q is a signed 8-bit literal and n an unsigned 8- bit number.
  • n is an unsigned 8-bit number.
  • n is an unsigned 8-bit number.
  • Time to completioAnother instruction may issue which reads or modifies the same global register: One cyde.
  • q is a signed 8-bit literal and where n is an unsigned 8-bit number.
  • Time to completioAnother instruction may issue which reads or modifies the same global register: One cycle. Exceptions None.
  • n is an unsigned 8-bit number.
  • q is a signed 8-bit literal and where n is an unsigned 8-bit number. Hold issue Si reserved. conditions Scalar memory read or write port unavailable.
  • n is an unsigned 8-bit number.
  • q is a signed 8-bit literal and where n is an unsigned 8-bit number.
  • the global register addressed is register GOFFSET+(GMASK & (GBASE+j)). Move contents of sk into the addressed global register.
  • n is an unsigned 8-bit number.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Multi Processors (AREA)
PCT/US1991/004058 1990-06-11 1991-06-10 Global registers for a multiprocessor system Ceased WO1991020043A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US07/536,198 US5165038A (en) 1989-12-29 1990-06-11 Global registers for a multiprocessor system
US536,198 1990-06-11

Publications (1)

Publication Number Publication Date
WO1991020043A1 true WO1991020043A1 (en) 1991-12-26

Family

ID=24137555

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1991/004058 Ceased WO1991020043A1 (en) 1990-06-11 1991-06-10 Global registers for a multiprocessor system

Country Status (5)

Country Link
US (1) US5165038A (enExample)
JP (1) JPH05508495A (enExample)
AU (1) AU8424491A (enExample)
TW (1) TW197502B (enExample)
WO (1) WO1991020043A1 (enExample)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0665503A3 (en) * 1994-01-28 1996-01-17 Nec Corp High speed synchronization communication control mechanism for a multiprocessor system.
EP0762293A3 (en) * 1995-08-29 1997-07-02 Nec Corp Control device for controlling a connection between an arithmetic processor and a central memory
EP0721164A3 (en) * 1995-01-03 1998-07-29 International Business Machines Corporation Crossbar switch apparatus and protocol
EP0742520A3 (en) * 1995-05-08 1999-09-01 Nec Corporation Information processing system for performing mutual control of input/output devices among a plurality of clusters

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5434970A (en) * 1991-02-14 1995-07-18 Cray Research, Inc. System for distributed multiprocessor communication
US5287461A (en) * 1991-10-31 1994-02-15 Sun Microsystems, Inc. Method and apparatus for remotely accessing a plurality of server consoles
JPH05282143A (ja) * 1992-03-30 1993-10-29 Nec Ibaraki Ltd 主記憶アクセス制御回路
US5428803A (en) * 1992-07-10 1995-06-27 Cray Research, Inc. Method and apparatus for a unified parallel processing architecture
US5435001A (en) * 1993-07-06 1995-07-18 Tandem Computers Incorporated Method of state determination in lock-stepped processors
US6073231A (en) * 1993-10-18 2000-06-06 Via-Cyrix, Inc. Pipelined processor with microcontrol of register translation hardware
SG75756A1 (en) * 1994-02-28 2000-10-24 Intel Corp Method and apparatus for avoiding writeback conflicts between execution units sharing a common writeback path
JP3169779B2 (ja) * 1994-12-19 2001-05-28 日本電気株式会社 マルチスレッドプロセッサ
US6298479B1 (en) * 1998-05-29 2001-10-02 Sun Microsystems, Inc. Method and system for compiling and linking source files
US6351848B1 (en) * 1998-05-29 2002-02-26 Sun Microsystems, Inc. Unitary data structure systems, methods, and computer program products, for global conflict determination
US20040030873A1 (en) * 1998-10-22 2004-02-12 Kyoung Park Single chip multiprocessing microprocessor having synchronization register file
JP3721283B2 (ja) * 1999-06-03 2005-11-30 株式会社日立製作所 主記憶共有型マルチプロセッサシステム
US20040128475A1 (en) * 2002-12-31 2004-07-01 Gad Sheaffer Widely accessible processor register file and method for use
US20040268143A1 (en) * 2003-06-30 2004-12-30 Poisner David I. Trusted input for mobile platform transactions
US7246218B2 (en) * 2004-11-01 2007-07-17 Via Technologies, Inc. Systems for increasing register addressing space in instruction-width limited processors
KR100806274B1 (ko) * 2005-12-06 2008-02-22 한국전자통신연구원 멀티 쓰레디드 프로세서 기반의 병렬 시스템을 위한 적응형실행 방법
KR100663709B1 (ko) * 2005-12-28 2007-01-03 삼성전자주식회사 재구성 아키텍처에서의 예외 처리 방법 및 장치
US10423415B2 (en) 2017-04-01 2019-09-24 Intel Corporation Hierarchical general register file (GRF) for execution block
US20240370242A1 (en) * 2023-05-01 2024-11-07 Mellanox Technologies, Ltd. Register allocation optimization using per-register bin packing

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4240143A (en) * 1978-12-22 1980-12-16 Burroughs Corporation Hierarchical multi-processor network for memory sharing
US4523273A (en) * 1982-12-23 1985-06-11 Purdue Research Foundation Extra stage cube
US4814980A (en) * 1986-04-01 1989-03-21 California Institute Of Technology Concurrent hypercube system with improved message passing
US4924380A (en) * 1988-06-20 1990-05-08 Modular Computer Systems, Inc. (Florida Corporation) Dual rotating priority arbitration method for a multiprocessor memory bus

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3308436A (en) * 1963-08-05 1967-03-07 Westinghouse Electric Corp Parallel computer system control
US3970993A (en) * 1974-01-02 1976-07-20 Hughes Aircraft Company Cooperative-word linear array parallel processor
US4015243A (en) * 1975-06-02 1977-03-29 Kurpanek Horst G Multi-processing computer system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4240143A (en) * 1978-12-22 1980-12-16 Burroughs Corporation Hierarchical multi-processor network for memory sharing
US4523273A (en) * 1982-12-23 1985-06-11 Purdue Research Foundation Extra stage cube
US4814980A (en) * 1986-04-01 1989-03-21 California Institute Of Technology Concurrent hypercube system with improved message passing
US4924380A (en) * 1988-06-20 1990-05-08 Modular Computer Systems, Inc. (Florida Corporation) Dual rotating priority arbitration method for a multiprocessor memory bus

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0665503A3 (en) * 1994-01-28 1996-01-17 Nec Corp High speed synchronization communication control mechanism for a multiprocessor system.
EP0721164A3 (en) * 1995-01-03 1998-07-29 International Business Machines Corporation Crossbar switch apparatus and protocol
EP0742520A3 (en) * 1995-05-08 1999-09-01 Nec Corporation Information processing system for performing mutual control of input/output devices among a plurality of clusters
EP0762293A3 (en) * 1995-08-29 1997-07-02 Nec Corp Control device for controlling a connection between an arithmetic processor and a central memory
US5761730A (en) * 1995-08-29 1998-06-02 Nec Corporation Control device for controlling a connection between an arithmetic processor and a main memory unit

Also Published As

Publication number Publication date
JPH05508495A (ja) 1993-11-25
US5165038A (en) 1992-11-17
AU8424491A (en) 1992-01-07
TW197502B (enExample) 1993-01-01

Similar Documents

Publication Publication Date Title
US5165038A (en) Global registers for a multiprocessor system
US5197130A (en) Cluster architecture for a highly parallel scalar/vector multiprocessor system
US5239629A (en) Dedicated centralized signaling mechanism for selectively signaling devices in a multiprocessor system
US5208914A (en) Method and apparatus for non-sequential resource access
US4636942A (en) Computer vector multiprocessing control
EP0553158B1 (en) A scalable parallel vector computer system
US5168547A (en) Distributed architecture for input/output for a multiprocessor system
US4901230A (en) Computer vector multiprocessing control with multiple access memory and priority conflict resolution method
US4661900A (en) Flexible chaining in vector processor with selective use of vector registers as operand and result registers
US5574939A (en) Multiprocessor coupling system with integrated compile and run time scheduling for parallelism
US6282583B1 (en) Method and apparatus for memory access in a matrix processor computer
Smith Architecture and applications of the HEP multiprocessor computer system
US5740402A (en) Conflict resolution in interleaved memory systems with multiple parallel accesses
US4968977A (en) Modular crossbar interconnection metwork for data transactions between system units in a multi-processor system
US7076597B2 (en) Broadcast invalidate scheme
US5524255A (en) Method and apparatus for accessing global registers in a multiprocessor system
JP2501419B2 (ja) 多重プロセッサメモリシステム及びメモリ参照競合解決方法
JP2001092772A (ja) 同期固定レイテンシループを使用するデータバス
US12248788B2 (en) Distributed shared memory
WO1986003038A1 (en) Instruction flow computer
Liou Design of pipelined memory systems for decoupled architectures
JPH01243123A (ja) 情報処理装置
Chalk Parallel Architectures
IE901533A1 (en) Modular crossbar interconnection network for data¹transactions between system units in a multi-processor¹system

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AU CA JP KR

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE CH DE DK ES FR GB GR IT LU NL SE

NENP Non-entry into the national phase

Ref country code: CA

122 Ep: pct application non-entry in european phase