CA2050828A1

CA2050828A1 - Register-cache architecture and super-actor machine

Info

Publication number: CA2050828A1
Application number: CA 2050828
Authority: CA
Inventors: Herbert H.J. Hum; Guang R. Gao
Original assignee: Herbert H.J. Hum; Guang R. Gao; Mcgill University; Centre De Recherche Informatique De Montreal/ Computer Research Institut E Of Montreal
Current assignee: McGill University; Centre de Recherche Informatique de Montreal CRIM
Priority date: 1991-05-28
Filing date: 1991-09-06
Publication date: 1992-11-29

Abstract

ABSTRACT

A multi-threaded architecture having high speed memories used with a multiprocessor. The memories utilized in the processor are organized as both a register file and a cache.
A program is compiled into a number of instruction threads called super-actors. Each super-actor would have a dormant, an enabled, a ready and an active state. The super-actor becomes ready for execution only when its input data is physically residing in the register-cache and spaces are reserved to store the result. The invention would operate utilizing either a sequential super-actor which requires that the instructions be sequentially executed, or a parallel super-actor, in which case the instructions are data dependent, and none of the instructions depend on any results produced by any other instruction within the super-actor.
The register file and cache is addresssable based upon the content of its memory locations, when viewed from a main memory, as well as being addressable by its address tags when viewed by a super-actor execution unit.

Description

~ r~

BACKGROUND OF THE INVENTION
Currently, microelectronic technology has advanced to the stage in which more than one million transistors can be applied to a particular microprocessor chip. Therefore, based upon this large number of transistors provided in a relatively small area, computer architects are facing the increasing challenge of ultra-large scale integration (ULSI) which one day may boast the capability of fifty to one hundred million transistors on a chip by the beginning of the next century.
These computer architects may utilize this enormous hardware parallelism to increase significantly the architectural support for fine-grain parallelism. For example, super-scalar machines which can issue multiple instructions per cycle, super-pipeline machines which utilize deep instruction pipelining, such as the CDC-7600, or both may be employed. Another trend is utilizing floating point units on a chip in a new generation of reduced instruction set computer (RISC) microprocessors.
Conventional processor architectures have inherent limitations in fully exploiting instruction level concurrency~
This is primarily due to the fact that a processor, equipped with only the mechanism of executing a totally ordered instruction stream, lacks the capacity of tolerating long and unpredictable memory and communication latencies, which are unavoidable in a von Neumann style multiprocessing system. An alternate approach is to utilize a multi-threaded architecture in which multiple instruction threads are provided at the processor architecture level. These multi-threaded architectures have the potential to keep the processor pipelines busy by tolerating memory latencies.
Conventional RISC architectures endeavor to reduce memory latencies by providing explicit programmable registers and implicit high speed caches. However, a small number of program-mable register3 alone can only provide a partial solution since the register allocation for subscript variables of array data is difficult, and increasing the number of programmable registers would increase the context or the actual number of registers utilized by a particular thread. An increased context will have a severe negative effect for fine-grain processing.

Similarly, conventional caches also have various limitations. For example, the effectiveness of caches for scientific applications is not satisfactory, where large arrays or vectors of data are accessed in the computation. Additional-ly, when a cache miss occurs, the instruction pipeline usuallystalls or freezes, thereby causing considerable performance degradation. Also, the fact that a conventional cache is transparent to the programmers or compilers, makes performance improvements by optimizing compilers quite difficult. Finally, the conventional cache memory is not designed to accommodate multi-threaded architectures.
Various prior art computer architectures have ~een developed and patented relating to systems which could be characterized as employing a dynamic resource allocation algorithm. These prior art architectures are described in U.S.
Patents 4,384,324, issued to Kim et al; 4,459,659, 4,467,410, 4,649,472, all issued to Kim; 4,733,347, issued to Fukuoka and 4,922,413, issued to Stoughton et al.
The Kim and Kim et al patents are directed to a data processing system which can execute a high level instruction with microinstructions, which, in turn, are executed in a multi-processing manner. A program controller is responsible for fetching high-level instructions, decoding these instructions and passing one or more microinstruction tasks representing a high-level instruction to a task controller. It is this taskcontroller which is responsible for register allocation, task initiation and termination, task selection, etc. The task controller allows the overlapping of microinstruction tasks in a pipeline execution unit. A complex scheme of managing a register file utilizing a register allocation list is outlined, wherein one or more registers are assigned to an initiator task for its execution, and are de-allocated when the task terminates.
Tasks are therefore only initiated when there are enough registers. Registers are allocated as temporary registers, input registers and output registers. Both temporary registers and input registers are de-allocated near or at the termination of a task. It should be noted that the Kim patents deal with issues 1 r, ~!

arising at the microinstruction level and not the instruction level and that these patents are not aimed at the local memory latency problem of multi-threaded architectures, but are aimed at the dynamic management of limited resources. Additionally, a simple and small register ~ile with a complex register allocation mechanism is employed.
The patent to Stoughton et al illustrates a scheme which is based upon a data-driven architectural model, wherein the execution of primitive operations are controlled by the availability of data. This patent, as well as the patent to Fukuoka, is also directed to an architecture utilizing the data-driven principle and do not address the local memory latency problem of multi~threaded architectures.

BRIEF ~UMMARY OF T~E INVENTION
This and other problems in the prior art are overcome utilizing the present invention, which is directed to providing a super-actor machine (SAM) which utilizes a multi-threaded architecture based upon a hybrid data flow and von Neumann evaluation model. A number of instructions can be issued simultaneously in the SAM so that effective overlapping of floating point ALU operations with other operations can result in a higher floating point performance. The problem of variable and sometimes high memory latency is overcome using a plurality of high-speed memories known as register-caches. The register-cache is organized both as a register ~ile and a cache. Viewedfrom an execution unit, the contents of the register-cache are addressable similarly to ordinary CPU registers employing relatively short addresses. From a main memory perspective, the register-cache is content addressable, since its contents are tagged just as in conventional caches. Additionally, the register allocation for the register-cache is adaptively performed at runtime.
This architecture is utilized with a program compiled into a number c~f instruction threads designated as super-actors.
A super-actor becomes ready for execution only when both data dependence and space locality are satisfied. Therefore, the super-actor ~ecomes ready only if all its input data are generated and its result data from previous activation, if any, have been used and the input data is physically residing in the register-cache and space is reserved there to store any result.
While this first condition of data dependence is similar to the so-called firing rule in traditional data flow machines, each scheduling quantum in the SAM is an instruction thread. The space locality requirement ensures that an enabled super-actor can be scheduled for execution only when all memory accesses of its instructions are guaranteed to be in the high-speed buffer memory, i.e. the register-cache. Therefore, the execution unit will experience a 100% hit ratio when accessing the register-cache, thereby eliminating one main source of pipeline perfor-mance degradation due to cache misses. As long as there are enough enabled instruction threads, then the execution unit can be kept usefully busy. Finally, architectural support for overlapping executions of super-actors and main memory operations is provided so that the available concurrency in the underlying machine can be better utilized.

BRIEF DE~CRIPTION OF THE DRAWING~
This and other features of the present invention will be explained ;n more detail with respect to the accompanying drawings in which:
Fig. 1 is a block diagram showing the various states of a super-actor;
Fig. 2 is a block diagram illustrating the processing element sf a super-actor machine;
Fig. 3 is a block diagram illustrating the super-actor execution unit;
Fig. 4 is a block diagram illustrating the architecture of the register-cache;
Fig. 5 is a diagram illustrating the registering process;
Fig. 6 is a block diagram describing the actor preparation unit (APU) of the present invention;

7 ~ p:

Fig. ~ i~ a flow diag~am illustrating the chec~-in process; and Figs. 8a, 8b and 8c represent a flow diagram illustrat-ing the operation of the data register-cache.

DETAILED DESCRIPTION OF THE PREE~ENT INVENTION
Multi-threaded architectures consist of individual instructions, called actors in data flow terminology, which are logically grouped into threads so that the cost of synchroniza-tion can be reduced by performing the synchronizations only among the thread. The actors within a thread can be scheduled via a conventional t:echnique of sequencing with a program counter similar to von Neumann computing. Grouping instructions into threads and sequentially executing these instructions within the threads while performing data flow-like fine-grain synchroniza-tions at the thread level are referred to as hybrid data flow/von Neumann models of computation. The aggregation of one or more of these actors is designated as a super-actor.
As depicted in Fig. 1, a super-actor is provided with four states and goes through the transitions shown therein. This figure also lists the various attributes of an actor and a super-actor. A super-actor is in its dormant state while it is waiting for its neighboring actors to signal that it has been enabled.
For an enabled super-actor to make a transition into the ready state, all of the memory blocks containing its input values must be in a fast memory, designated as the register-cache, and a block in this memory must be reserved for its resultant values.
A ready super-actor enters the active state when it is assigned an available physical domain for it to be executed. Instructions in an active super-actor can be scheduled for execution as will be explained further. After execution, the active super-actor will signal its completion to all of the actors requiring notification that it has been executed, and re-enter its dormant state. It is important to note that once a super-actor enters the active state, all of its instructions will be executed atomica-lly unt:il its completion without the possibility of suspension. ALl of the instructions in the super-actor perform operations entirely local to its execution unit, thereby causing no external transactions or synchronization requirements with other super-actors during its execution.
Since super-actors are processed atomically, scheduling them based on the data-driven principle will ensure that the data dependencies among the super-actors are satisfied. Thus, the determinancy of the data flow computation model is retained, where a node in the super-actor machine model is an instruction thread. Additionally, the enabled super-actor is scheduled for execution only when all memory accesses of its instructions are guaranteed to be in the high-speed buffer memory designated as the register-cache. Thus, the super-actors would decrease the synchronization cost, and at the same time offer the opportunity to exploit the locality of reference so as to minimize the latencies in memory accesses in the execution system.
The present invention can operate utilizing two types of super-actors designated as sequential super-actors and parallel super-actors. The sequential super-actor requires that the data dependency between the instructions be sequentially executed. This sequential super-actor may contain conditional branch instructions which jump to another instruction within the same super-actor, called short branches. Conditional instruc-tions which fork multiple super-actors or alter the stream of evaluation of ~;uper-actors are restricted to being tail-instruc-tions since the scheduling of super-actors is performed in a separate unit i`rom the execution unit. The parallel super-actor operates in a situation in which the instructions are data-independent whereby they do not depend on any results produced by any other instruction within the super-actor. The instruc-tions in the parallel super-actors can be executed in parallel (e.g., every pipe beat).
Due to its very nature, instructions with long and unpredictable Latencies should be excluded from ordinary super-actors. These instructions would include non-local memory access operations, explicit "send" and "receive" instructions which perform inter-PE communication, etc. A long latency instruction is grouped by itself, and the actor containing it is called a long latency actor (L-actor). These L-actors would be handled by a dedicated unit.
Finally, instructions which modify the me~ory addresses of the lines of a super-actor should be grouped separately into aggregates denoted as support-actors.
The super-actor machine is to be used in a multi-processor system consisting of multiple processing elements (PE) linked together by an interconnection network. Memories are distributed to each processor in the machine, and the aggregation of these memories presents a global address space which is shared by all the processors. Therefore, no centralized global memory system is utilized.
Fig. 2 illustrates a processing element used in the super-actor machine. This element consists of a super-actor execution unit (SEU), an actor preparation unit (APU) having an adjoining support-actor execution pipe, an actor scheduling unit (ASU), an L-ac1:or execution unit (LEU), and a local main memory.
As shown therein, the main memory is in communication with each of the four additional components, i.e. the SEU, the LEU, the ASU
and the support-actor execution pipe of the APU.
The structure of the super-actor execution unit is shown in Fig. 3. The SEU utilizes a smooth execution pipeline and collection of physical contexts realized by multiple sets of registers. Both instruction register-caches and data register-caches are included. The architecture of the smooth executionpipeline is like any standard instruction processing pipeline except that the ALA (arithmetic logic unit) stage is made up of sub-pipes which can handle integer and floating point operations.
The other stages are standard, including instruction fetch, operand fetch, etc. The aim of the smooth pipeline is to initiate an independent instruction at a sustained rate of one instruction per cycle. Thus, the pipeline is to be free of structural hazards, and all stages in the pipeline have a uniform and fixed processing time for all types of instructions.
A physical context is realized by a set of registers which will be assigned to each super-actor when it becomes active, and will be returned to the pool of free physical context r .
8 ~ ,. ) when it leaves the execution unit. The purposes of this set of registers are to store information of a super-actor and to be temporary scratch-pad registers for an active sequential super-actor. Therefore, the values in the registers are not retained after the activation of a super-actor and cannot be used by other super-actors.
All of the contexts share an instruction issuer. This issuer chooses a ready context, increments its counter value and sends the instructions into the execution pipe. An activation identification is associated with each context and is sent along with each instruction when it enters the execution pipe so that the proper set of registers is used. The issuer is also responsible for sending decrement-reserve-counter signals to the register-caches when a super-actor exits the SEU. If a context is assigned to a parallel super-actor, it can be ready every machine cycle. Otherwise, the instructions in a super-actor must be executed sequentially, and the context must wait for a signal from the execu1ion pipe before it can pro~ress. Attached to the APU is a simple RISC pipeline which is responsible for processing instructions within a support actor. The only instructions which the pipeline can process are loads and stores, and integer add and multiple, since the sole purpose of the support actors is to perform address calculations such as array indexing. The reasons for processing address calculations in the APU are that the register-cache loader must access these calculated addresses and to increase the instruction level parallelism in the machine.
The LEU is responsible for fetching the instructions and necessary operands of long latency instruction from the main memory, and processing them.
Upon completion of an active super-actor, the SEU sends an appropriate signal to the ASU indicating that the super-actor has been executed. The ASU, in turn, processes the signals and sends enabled actors along with its attributes to the APU. It is in this location that the enabled actors are queued for entry to either the .SEU, the LEU or the support-actor execution pipe.
The structure of the ASU and the handling of signals therein is quite similar to the instruction scheduling unit described and g authored by G. R. Gao, R. Tio and H.H.J. Hum, entitled "Design of an Efficient Data Flow Architecture Without Data Flow," which is published in the PROC. of the International Conference on Fifth-Generation Computers, pp. 861-868, Tokyo, Japan, December 1988, which is being incorporated by reference.
The register-caches are organized both as a register file and a cache and are shown in Fig. 4. This register-cache invention is also described in a paper authored by H.H.J. Hum and G. R. Gao, entitled "A Novel High-Speed Memory Organization for Fine-Grain Multi-Thread Computing" which is published in the Proceedings of the Parallel Architectures and ~.anguages, Europe '91 Conference, Eindhoven, The Netherlands, June 1991. Viewed from the SEU, its contents are directly accessible using relatively short addresses, a pxocess similar to the addressing of general reg;isters in con~entional CPU's. Moreover, from the perspective of the APU, it is content addressable in that its contents are tagged just as in conventional caches. Although it is not absolutely required, the use of a fully associative cache would make effective use of all of the register-cache line. Full associativity :is important in determining the minimum cache size for proper functionality. A loader/storer is employed for ensuring that all of the necessary data for the operation of the super-actor is in the data register-cache and that space is provided therein for its results.
The register-cache retains the transparency feature of conventional caches in the sense that it is not visible to the programmers or compilers. Thus, no register allocation by the compiler is required for the register-cache. The allocation of a register-cache line for a block of memory is done entirely at runtime and i5 performed using cache update and replacement algorithms. Once this is accomplished, the register-cache locations withln a line can be accessed by the SEU directly using short addresses just as if they were general registPrs. This binding process is called registering and is illustrated in Fig.
5.
Fig. 6 describes the actor preparation unit (APU) and includes a queue for enabled parallel and sequential super-r, ~) ') ~ --' ''`) actors, support actors, parallel and sequential super-actors, long-latency actors as well as for ready super actors.
The APU also contains a register-cache loader which is responsible for "checking in" enabled super-actors to ensurs that all of the necessary data for operation of the super-actor is in the register-cache and that space is reserved in the register-cache for its results.
As indicated in Fig. 7, once the PSA/SSA ready gueue is provided with a signal indicating a free context in the physical contexts of the SEU, the next available super-actor is taken from the PSA/SSA enabled queue. The addresses for the operand and result lines must be calculated if they are offset values from the base address. If they are pointed values, then the memory location must be calculated and the address fetched from the data cache which is shared with the support actor execution pipeO If this is not the case, the values are absolute addresses, and they may be sent to the data register-cache without modification. Once the addresses are sent to the data register-cache, it will return a line number for each address.
It will also send head-instruction addresses to the instruction-register-cache and receive register-cache line numbers. When the register-cache loader has received all the line numbers from these two caches, it will send the super-actor, consisting of its identification, base address, length, instruction and data register-cache line numbers to the ready PSA/SSA queue, where it will wait until a context is free in the SEU.
The operation of the data register-cache is illustrated in Fig. 8. The registration process begins when a memory address is sent to the data register-cache from the APU. Read-in requests are issued for operand lines and reserve requests are issued for result lines. After the registering process, the instruction is checked in. If the data register-cache is full, the least recently used (LRU) cache replacement policy is used on lines which are no longer needed to find a replacement line.
The LRU algorithm uses age counters to decide which line to replace. It is noted that the age counters are updated only by request from the APU and not by accesses from the SEU.

A decrement-reserve counter signal along with the line numbers are sent from the instruction issuer in the SEU when a super-actor exits th~ SEU. Forced write-backs are used to handle super-actors passing their results to long-latency actors, because the LEU does not access the data register-cache.
Mandatory read-ins are necessary in this cache because operand lines of a super-actor which was written by a long-latency actor must be brought: in since the LEU can only write into main memory.
The instruction register-cache check-in algorithm is similar to the data register-cache algorithm except that no write steps are employed.
For the tandem of the APU and instruction regist~r-càche or data register-cache to function correctly, the SEU must be guaranteed that its operations will always find their value in the registex-cache. Therefore, a minimum number of required register-cache lines must be present. This minimum number of required regist:er-cache lines is determined utilizing the formula (J+K)xL lines in order that a maximum of (J+K) super-actors can be active, wherein J is the number of slots in the PSA/SSA ready queue, K is the maximum number of allowable super-actors in the SEU, and L is the maximum number of register-cache lines allocated to a super-actor. Similarly, if there are J+K active super-actors, t:hen their reserved or read-in register-cache lines will not be replaced until the super-actor that requested it, exits the SEU.
Fig. 6 includes a bypass fast path to avoid unnecessary probing of the register-caches. This is important since super-actors in loop constructs which are enabled every time the loop iterates, the lines that they use might already be in the register-caches when the are enabled. These super-actors can be tagged by the compiler as possible fast-path candidates so that, when they are enabled, a small fully-associated conventional cache memory queue containing the recently fired super-actors can be checked associatively for its presence. If that super-actor is present, then the cache line numbers which it used previously are retrieved, the lines reserved, and the super-actor enters its ready state immediately. Entries for this small cache are r ~ ~ r ~
., . ~ ..1 inputted by the register-cache loader when it recognizes that a super-actor it is processing has been tagged as a fast-path candidate and it has received the register-cache line numbers the super-actor is to use from the register-cache. Using the equation for cletermining the size of the register-cache, the number of lines in the small cache should be no more than J+K to ensure that the recently fired fast-path super-actors in the queue would still have their lines in the register-caches. In this instance, if the super-actor is not present in the queue, it will be sent back to the regular path where its other attributes can be fetched and the register-caches probed.
Although the description provided herein has been directed to a particular preferred embodiment, it must be understood that many modifications and variations in structure, arrangement, operation and use are possible without departing from the inventive contributions disclosed herein. Accordingly, the present invention is to be considered as embracing all possible modifications and variations coming within the scope of the appended claims.

Claims

1. A super-actor multiprocesor architecture for processing instruction threads denoted as super-actors, each of the super-actors having a dormant, enabled, ready and active state, comprising:
a super-actor execution unit provided with an instruc-tion register file cache, a data register file cache, and a set of physical context registers assigned to each of the super-actors when they become active, aid super-actor execution additionally provided with an instruction issuer, said super-actor execution unit used to execute active super-actors, the active super-actors exiting said super-actor execution unit after completion of the active super-actor;
an actor preparation unit in communication with said super-actor execution unit for calculating address calculations used in said instruction register file cache and said data register file cache of said super-actor execution unit, said actor preparation unit provided with a queue for ready super-actors indicating that all of the data needed for the operation of the super-actor is provided in said data register file cache and space is reserved in said data register file cache for the results of said super-actor operation, said ready super-actors sent to said super-actor execution unit when space becomes available;
an actor scheduling unit in connection with said super-actor execution unit and said actor preparation unit for processing signals generated by said super-actor execution unit indicating that a super-actor has been executed, said actor scheduling unit sending enabled super-actors to said actor preparation unit; and a main memory unit provided in connection with said super-actor execution unit, said actor preparation unit at said actor scheduling unit.

2. The super-actor multiprocessor architecture in accordance with claim 1, wherein said actor preparation unit is provided with a means for modifying the memory address lines of the super-actors.

3. The super-actor multiprocessor architecture in accordance with claim 1, further including a long-latency execution unit in communication with said super-actor execution unit and said actor scheduling unit.

4. The super actor multiprocessor architecture in accordance with claim 2, further including a long-latency execution unit in communication with said super-actor execution unit and said actor scheduling unit.

5. The super-actor multiprocessor in accordance with claim 1, wherein said actor preparation unit further provided with a bypass fast-path mechanism provided with a memory device containing a list of recently fired super-actors, said bypass fast-path mechanism containing a means for checking said memory device in said fast-path mechanism to determine whether a paticular super-actor is provided therein, and a means for allowing the particular super-actor to directly enter its ready state if it is preset in said memory device.

6. A register-cache used with a super-actor multiprocessor provided with a main memory, an actor preparation unit, and a super-actor execution unit, the register-cache organized as both a register file and a cache, said register cache provided with a plurality of memory locations addressable based on the content of said plurality of memory locations by the main memory, and addressable by address tags by the super-actor execution unit.

7. The register-cache in accordance with claim 6, wherein the availability of at least a portion of said memory locations are assigned at runtime.

8. The register-cache in accordance with claim 6, wherein the actor preparation unit is provided with a queue for enabled parallel and sequential super-actors and further wherein said register-cache must contain at least (J + K) x L lines, wherein J represents the number of slots in said queue for enabled parallel and sequential super-actors, K represents the maximum number of allowable super-actors in said super-actor execution unit and L represents the maximum number of register-cache lines allocated to a super-actor.