NL9100598A

NL9100598A - Microprocessor circuit with extended and flexible architecture - provides separation between data transfer and data processing operations

Info

Publication number: NL9100598A
Application number: NL9100598A
Authority: NL
Original assignee: Henk Corporaal
Priority date: 1991-04-05
Filing date: 1991-04-05
Publication date: 1992-11-02

Abstract

Data transport is completely separated from operations on the data. This separation enables the available transfer capacity to be used more effectively. The separation gives great freedom of choice in the implementation of function units, allowing `pipelining schemes' to be used throughout. Cycle times can also be cut to a min. i.e. time needed to transfer the data to be worked on.

Description

MOVE: Een Flexibele en Uitbreidbare Architectuur voor het Ontwerpen vanMOVE: A Flexible and Expandable Architecture for Designing

Processoren 1 InleidingProcessors 1 Introduction

Een van de meest belangrijke ontwikkelingen in het ontwerp van computer systemen was het gebruik van RISC- i.p.v. CISC-ontweipprincipes [1]. De belangrijkste les was, dat extra hardware (bijv. voor het implementeren van complexe instructies) niet altijd resulteert in snellere executie; in tegendeel, het kan de totale executie tijd doen toenemen. Mogelijke oorzaken hiervoor zijn; 1. Extra hardware functionaliteit kan het kritieke tijdspad vergroten, hetgeen tot een verhoogde cyclustijd leidt.One of the most important developments in the design of computer systems has been the use of RISC instead of CISC design principles [1]. The most important lesson was that extra hardware (eg for implementing complex instructions) does not always result in faster execution; on the contrary, it can increase the total execution time. Possible causes for this are; 1. Additional hardware functionality can increase the critical time frame, leading to an increased cycle time.

2. Complexe instructies worden in algemene berekeningen zelden gebruikt. Hardware hieraan besteedt kan wellicht effectiever voor andere verbeteringen aangewend worden.2. Complex instructions are rarely used in general calculations. Spending hardware on this can perhaps be used more effectively for other improvements.

3. Complexe instructies zijn moeilijk te pipelinen.3. Complex instructions are difficult to pipelin.

4. Soms zijn software oplossingen voor complexere instmcties zelfs sneller dan de overeenkomstige hardware oplossing; i.h.b. indien compiler optimalizaties de software overhead voor een deel elimineren.4. Sometimes software solutions for more complex instmctions are even faster than the corresponding hardware solution; in particular if compiler optimizations partially eliminate the software overhead.

5. Complexe hardware verhoogt de ontwikkeltijd en maakt daardoor het gebruik van de laatste technologische verbeteringen onmogelijk.5. Complex hardware increases development time and therefore makes it impossible to use the latest technological improvements.

Afgezien van deze oorzaken, vermindert de implementatie van complexe functies de ontwerp-flexibiliteit. Deze functies kunnen het ontwerp zodanig bepalen, dat het aanbrengen van toekomstige verbeteringen lastig wordt.Apart from these causes, the implementation of complex functions reduces design flexibility. These features can determine the design to make future improvements difficult.

Momenteel leveren de meeste CPU-fabrikanten op RlSC-principes gebaseerde computersystemen met vergelijkbare prestaties. De prestatie wordt gedomineerd door de gebruikte technologie. Daar de basis complexiteit van een RISC-processor tamelijk beperkt is, kan deze eenvoudig worden geïntegreerd in een VLSI-ontwerpomgeving voor de creatie van applicatie-specifieke-processoren ([2]). Theoretisch haalt een RISC-systeem een prestatie van 1 CPI (cyclus per instructie); in de praktijk is deze iets groter t.g.v. penal-ties veroorzaakt door load en branch opdrachten. Efficiënt gebruik van RISC-systemen vereist compilers welke in staat zijn deze penalties te minimalizeren d.m.v. het plaatsen van instructies in delay slots; dit kan beschouwd worden als een beperkte vorm van scheduling van parallelle instructies.Currently, most CPU manufacturers provide RlSC-based computer systems with comparable performance. The performance is dominated by the technology used. Since the basic complexity of a RISC processor is rather limited, it can be easily integrated into a VLSI design environment for the creation of application-specific processors ([2]). Theoretically, a RISC system achieves a performance of 1 CPI (cycle per instruction); in practice it is slightly larger due to penalties caused by load and branch assignments. Efficient use of RISC systems requires compilers capable of minimizing these penalties by means of placing instructions in delay slots; this can be considered as a limited form of scheduling of parallel instructions.

Ter verkrijging van een nog grotere performance verbetering, dan alleen met snellere technologie mogelijk is, onderscheiden we twee belangrijke architectuur-technieken: 1) het gebruik van diepepipelines, en 2) het gebruik van meerdere onafhankelijke functie-eenheden (FUs). De eerste techniek reduceert de cyclustijd, de tweede de CPI. Beide technieken vereisen uitgebreide parallellizatie van code op het instructie-niveau. De volgende sectie bespreekt deze ontwikkelingen. Tevens wordt verklaart waarom de resulterende architecturen een aantal nadelen hebben, welke hun toepasbaarheid sterk beperken.To achieve an even greater performance improvement than is possible only with faster technology, we distinguish two important architectural techniques: 1) the use of deep pipelines, and 2) the use of multiple independent function units (FUs). The first technique reduces the cycle time, the second the CPI. Both techniques require extensive code parallelization at the instruction level. The next section discusses these developments. It also explains why the resulting architectures have a number of drawbacks, which greatly limit their applicability.

De hier beschreven uitvinding betreft een zeer afwijkende architectuur, genoemd de MOVE-architectuur, welke het data-transport en de data-operaties volledig scheidt. Sectie 3 beschrijft deze uitvinding terwijl sectie 2 de recente architectuur-ontwikkelingen en de MÖVE-concurenten beschrijft. Sectie 4 vergelijkt de MOVE-architectuur met deze concurrerende architectuur-implementaties en toont de superioriteit van de beschreven uitvinding. Sectie 5 bevat de conclusies, sectie 6 de figuren en de laatste sectie bevat een verklarende woordenlijst en een bibliography.The invention described here concerns a very different architecture, called the MOVE architecture, which completely separates the data transport and the data operations. Section 3 describes this invention while section 2 describes recent architectural developments and the MÖVE competitors. Section 4 compares the MOVE architecture with these competing architecture implementations and shows the superiority of the described invention. Section 5 contains the conclusions, section 6 the figures and the last section contains a glossary and a bibliography.

2 Recente Architectuur-ontwikkelingen2 Recent Architectural Developments

Om de snelheid van enkelvoudige processoren te verhogen worden verschillende architectuur-technieken toegepast [3], te weten:To increase the speed of single processors, different architectural techniques are used [3], namely:

Superpipelining: Het opdelen van bestaande pipeline stages in süb-stages, waardoor een kortere cyclustijd mogelijk wordt. Dit is een uitbreiding van het RISC-pzpe/z'«e-principe; daar besloeg de executie-stage nog 1 cyclus, terwijl deze nu wordt opgedeeld in meerdere stages. Superpipelining zal de effectieve CPI iets vergroten tengevolge van het toegenomen aantal delay slots, echter dit effect wordt gecompenseerd door de verlaagde cyclustijd. Voorbeelden zijn te vinden in [4,5].Superpipelining: Dividing existing pipeline stages into süb stages, allowing a shorter cycle time. This is an extension of the RISC-pzpe / z '«e principle; there the execution internship covered 1 more cycle, while it is now divided into several internships. Superpipelining will slightly increase the effective CPI due to the increased number of delay slots, however this effect is offset by the decreased cycle time. Examples can be found in [4,5].

Functioneel parallellisme: Door het toevoegen van onafhankelijke functie-eenheden ter exploitatie van het operatie parallellisme heeft als gevolg dat de effectieve CPI lager wordt dan één. Superscalars [6,7] en VLIWs [8,9,10,11,12] zijn vertegenwoordigers van het toepassen van deze techniek. Superscalars passen dynamische detectie van operatie-parallellisme toe, terwijl VLIWs dit volledig door de compiler (statisch) laten doen. Hoewel Superscalars ter verkrijgen van een redelijke efficiënte executie ook intensieve compiler ondersteuning vereisen, kunnen zij ook niet-geschedulede code uitvoeren; zij kunnen daardoor object-code compatibel zijn met architecturen die niet met functioneel parallellisme zijn uitgevoerd.Functional parallelism: Adding independent function units to exploit the operation parallelism results in the effective CPI becoming less than one. Superscalars [6.7] and VLIWs [8,9,10,11,12] are representatives of the application of this technique. Superscalars apply dynamic detection of operation parallelism, while VLIWs have this done completely by the compiler (static). While Superscalars also require intensive compiler support to achieve reasonably efficient execution, they can also execute non-shed code; they can therefore be object code compatible with architectures that are not implemented with functional parallelism.

Superscalars hebben een sterke beperking in de hoeveelheid te exploiteren parallellisme. De hardware nodig voor runtime parallellisme-detectie beperkt de grootte van het decode-window tot een paar instructies. Verdere hardware reductie is mogelijk door het aantal mogelijk te combineren instructies te beperken (bijv. maximaal 1 integer, 1 drijvende-komma, 1 loadlstore en 1 controle-instructie). De compiler is veel beter in staat om mogelijke parallelle-instructies te ontdekken dan met hardware ooit mogelijk is (een mogelijke uitzondering is het detecteren van adres-afhankelijkheden bij geheugen-operaties). Een gunstig gevolg van dynamische detectie is wel, dat de grootte van de code niet toeneemt t.g.v. het invoegen van lege instructies (NOPs). Kortom, superscalars zijn goed voor beperkte prestatie verbeteringen van bestaande systemen, maar vereisen hiervoor wel compiler support.Superscalars have a strong limitation in the amount of parallelism to be exploited. The hardware needed for runtime parallelism detection limits the size of the decode window to a few instructions. Further hardware reduction is possible by limiting the number of instructions that can be combined (eg maximum 1 integer, 1 floating point, 1 loadlstore and 1 control instruction). The compiler is much more capable of discovering possible parallel statements than is ever possible with hardware (a possible exception is detecting address dependencies in memory operations). A favorable consequence of dynamic detection is that the size of the code does not increase due to the insertion of empty instructions (NOPs). In short, superscalars are good for limited performance improvements of existing systems, but require compiler support.

VLIW- en Superpipelined-architectartn hebben een hoger potentieel m.b.t. parallelle executie dan Superscalars. Het voordeel zal voor scalaire applicaties beperkt zijn (i.v.m. de sequentiële aard van de berekeningen); echter voor specifieke- en vector-applicaties kan een veel grotere snelheidstoename gerealiseerd worden.VLIW and Superpipelined architects have higher parallel execution potential than Superscalars. The benefit will be limited for scalar applications (due to the sequential nature of the calculations); however, for specific and vector applications, a much greater speed increase can be achieved.

Idealiter willen we VLIW- en Superpipelining-technieken combineren. Dit zou een architectuur opleveren die hoge vector-prestaties levert, die geschikt gemaakt kan worden voor specifieke-applicaties en ook nog een goede scalaire-prestatie laat zien. Echter, zoals we zullen aantonen in sectie 4, bezit deze architectuur een aantal nadelen welke resulteren in een complexe-organizatie, inefficiënt hardware gebruik, moeilijk te wijzigen functionaliteit, en beperkte uitbreidbaaiheid. Daardoor is zij niet ideaal voor gebruik in een raamwerk voor applicatie specifiek processor ontwerp. De uitvinding welke beschreven wordt in de volgende sectie heeft deze nadelen niet.Ideally, we want to combine VLIW and Superpipelining techniques. This would provide an architecture that provides high vector performance, which can be tailored for specific applications and also shows good scalar performance. However, as we will demonstrate in section 4, this architecture has a number of drawbacks that result in complex organization, inefficient hardware usage, difficult-to-change functionality, and limited extensibility. Therefore, it is not ideal for use in an application specific processor design framework. The invention described in the next section does not have these drawbacks.

3 De MOVE-architectuur3 The MOVE architecture

De vereiste prestatie, en de toegestane ontwerp en fabricage kosten bepalen voor VLIW het aantal en type en voor Superpipelined de diepte en type van de te gebruiken functie-eenhedea De communicatie-bandbreedte tussen functie-eenheden wordt uitsluitend bepaald door het aantal en type functie-eenheden; zij dient op de worst-case situatie berekend te zijn. Evenzo is de instructie-bandbreedte afgestemd op de worst-case situatie, n.l. dat alle functie units tegelijkertijd in gebruik zijn. Zelfs indien dit het geval is, zal de vereiste communicatie-bandbreedte zelden worst-case zijn.The required performance, and the allowed design and manufacturing costs determine the number and type for VLIW and for Superpipelined the depth and type of the function units to be used. The communication bandwidth between function units is determined solely by the number and type of function units. units; it must be designed for the worst-case situation. Likewise, the instruction bandwidth is tailored to the worst-case situation, i.e. that all function units are in use at the same time. Even if this is the case, the required communication bandwidth will rarely be a worst case.

In deze sectie beschrijven we een uitvinding van een architectuur (de MOVE-architectuur), die bovengenoemde problemen van overdesign sterk reduceert. In de volgende subsectie ontwikkelen we deze architectuur vanuit het gezichtspunt van VLIW en Superpipelining architecturen. Vervolgens worden zaken als pipelining, het afhandelen van excepties, en de benodigde ondersteunings-mechanismen, nodig voor een efficiënte afbeelding van hogere programmeertalen en operating-systemen, besproken.In this section we describe an invention of an architecture (the MOVE architecture), which greatly reduces the above-mentioned problems of overdesign. In the next subsection, we develop this architecture from the point of view of VLIW and Superpipelining architectures. Subsequently, issues such as pipelining, handling of exceptions, and the necessary support mechanisms, necessary for an efficient depiction of higher programming languages and operating systems, are discussed.

3.1 Van VLIW en Superpipelining naar MOVE3.1 From VLIW and Superpipelining to MOVE

Indien we VLIW en Superpipelining als vertrekpunt kiezen, kan de ontwikkeling van de MOVE-architectuur als een 6-staps proces worden gezien: 1) reductie van register-file communicatie-bandbieedte, 2) het programmeren van de bypass, 3) het reduceren van de öypim-connectiviteit, 4) het separeren van transport en operatie, 5) reductie van de tranport capaciteit en 6) het programmeren van het transport.If we choose VLIW and Superpipelining as a starting point, the development of the MOVE architecture can be seen as a 6-step process: 1) reduction of register-file communication bandbid, 2) programming the bypass, 3) reducing the oypim connectivity, 4) the separation of transport and operations, 5) reduction of the transport capacity and 6) the programming of the transport.

Reductie van de register-file-communicatie-bandbreedte Figuur la toont de interne communicatiestructuur van een (super-) pipelined organizatie. Een set van functie-eenheden (FUs) is verbonden met een bypass-eenheid welke operand- en bypass-registers bevat. De bypass-eenheid is verbonden met een register-file. Op soortgelijke wijze toont figuur lb de communicatie-structuur van een VLIW. Deze VLIW organizatie wordt in het vervolg als uitgangspunt gebruikt.Reduction of the register-file-communication bandwidth Figure 1a shows the internal communication structure of a (super-) pipelined organization. A set of function units (FUs) is connected to a bypass unit containing operand and bypass registers. The bypass unit is connected to a register file. Similarly, Figure 1b shows the communication structure of a VLIW. This VLIW organization will be used as a starting point from now on.

De bypass bevat naast operand-registers, een aantal bypass-registers voor tijdelijke resultaten. Deze laatste zijn georganizeerd als FIFOs. De operand-registers lezen hun data uit de register-file, de FIFO, of de uitgangen van de FUs. Het schrijven van resultaten kan in alle FIFO registers geschieden, afhankelijk van de tijd die een FU-operatie duurt. Dit schrijven gaat vergezeld van het schrijven van de regisler-identifier. Deze identifier dient voor het schrijven in de register-file en voor identificatie van de tijdelijke resultaten in de bypass. Een resultaat welke de FIFO verlaat wordt in de register-file geschreven. De bypass is niet zichtbaar op architectuur niveau.In addition to operand registers, the bypass contains a number of bypass registers for temporary results. The latter are organized as FIFOs. The operand registers read their data from the register file, the FIFO, or the outputs of the FUs. The writing of results can be done in all FIFO registers, depending on the time an FU operation takes. This writing is accompanied by the writing of the regisler identifier. This identifier serves to write in the register file and to identify the temporary results in the bypass. A result which leaves the FIFO is written in the register file. The bypass is not visible at the architecture level.

Met respect tot de benodigde bandbreedte tussen de register-file en de bypass kunnen we een 2-tal observaties geven: 1. Meer dan 50 % van de data in de bypass-registers wordt na het schrijven hiervan in de register-file niet meer gebruikt. Diepere FIFOs kunnen dit percentages nog verhogen.With respect to the required bandwidth between the register file and the bypass, we can give 2 observations: 1. More than 50% of the data in the bypass registers is no longer used after writing this in the register file . Deeper FIFOs can increase these percentages even more.

2. Veel instructies gebruiken niet alle mogelijke register-operanden. Bijv. een branch instructie genereert geen resultaat; monadische instructies gebruiken maar 1 source-operand. Deze instructies gebruiken dus niet de volledige register-file bandbreedte.2. Many instructions do not use all possible register operands. E.g. a branch instruction does not generate a result; monadic statements use only 1 source operand. So these instructions do not use the full register file bandwidth.

De compiler kan al deze gevallen van onvolledig gebruik van register-bandbreedte detecteren. We kunnen daardoor een organizatie construeren waarbij het register-transport volledig door de compiler gecontroleerd wordt. Zoals figuur 2a laat zien, wordt de register-file nu een functie-eenheid met veel minder lees- en schrijf-poorten.The compiler can detect all these cases of incomplete use of registry bandwidth. We can therefore construct an organization in which the register transport is fully checked by the compiler. As Figure 2a shows, the register file now becomes a function unit with far fewer read and write ports.

Programmeren van de bypass In de nu gecreëerde situatie is het efficiënter om de bypass zichtbaar te maken op architectuur niveau. De bypass wordt nu een expliciet geadresseerd hoogste niveau van de geheugen hiërarchie. Het schrijven van waarden in de register-file wordt nu niet meer beïnvloed door de FIFOs; het is volledig onder controle van de compiler. Alle bypass-registers (vanaf nu inclusief de operand-registers) kunnen beschreven en gelezen worden door alle FUs (zie figuur 2b). De voordelen hiervan zijn: • Eenvoudiger ontwerp van de register-file.Programming the bypass In the situation that has now been created, it is more efficient to make the bypass visible at the architecture level. The bypass now becomes an explicitly addressed top level of the memory hierarchy. Writing values in the register file is now no longer affected by the FIFOs; it is completely under the control of the compiler. All bypass registers (from now on including the operand registers) can be written and read by all FUs (see Figure 2b). The advantages of this are: • Simpler design of the register file.

• Eenvoudiger ontwerp van de bypass (geen FIFOs, geen identifier matching).• Simpler bypass design (no FIFOs, no identifier matching).

• Kleinere operand-identifier velden in de instructies.• Smaller operand identifier fields in the instructions.

Reductie van de bypass-connectiviteit Daar lezen en schrijven van de bypass-registers onderdeel is van de FU-executie, dient dit zo efficiënt mogelijk te geschieden. Een manier om dit te verwezenlijken is het reduceren van het aantal connecties en daarmee van de read- en wnite-load. Toepassing van dit principe op de zichtbare bypass voor het aantal lees- en schrijf-connecties resulteerd in de organizaties getekend in de figuren 2c en 2d resp.Reduction of bypass connectivity Since reading and writing the bypass registers is part of the FU execution, this should be done as efficiently as possible. One way to achieve this is to reduce the number of connections and thus the read and wnite load. Application of this principle to the visible bypass for the number of read and write connections results in the organizations drawn in Figures 2c and 2d, respectively.

De eenvoudigste methode ter vermindering van het aantal lees-connecties is het gebruik van privé register-sets voor alle FU-ingangen. Deze sets kunnen door alle FUs beschreven worden (zie figuur 2c). De LIFE-processor [13] gebruikt deze organizatie. Ter vermijding van meerdere schrijfpoorten per register-set beperkt LIFE het schrijven per register-set tot één FU per keer.The simplest method of reducing the number of read connections is to use private register sets for all FU inputs. These sets can be described by all FUs (see Figure 2c). The LIFE processor [13] uses this organization. To avoid multiple write ports per register set, LIFE limits writing per register set to one FU at a time.

Een andere methode is het reduceren van het aantal schrijf-connecties. Iedere FU schrijft hierbij in een eigen bypass-register. De Cydra-5 [12] gebruikt deze organizatie. Figuur 2d toont een organizatie met 2 privé resultaat registers per FU.Another method is to reduce the number of write connections. Each FU writes in its own bypass register. The Cydra-5 [12] uses this organization. Figure 2d shows an organization with 2 private result registers per FU.

Separatie van transport en operatie Reductie van zowel het aantal lees- en schrijf-connecties wordt geïllustreerd in figuur 3a. Alle bypass-registers zijn nu ingedeeld als privé FU-resultaat en FU-operand registers. De gehele operatie vindt nu in de FU plaats, het transport daarbuiten. Transport en operatie zijn daardoor gescheiden.Separation of transport and operation Reduction of both the number of read and write connections is illustrated in Figure 3a. All bypass registers are now classified as private FU result and FU operand registers. The entire operation now takes place in the FU, the transport outside. Transport and operation are therefore separated.

Reductie van transport-capaciteit Het transport-netwerk uit figuur 3a wordt nog steeds niet ten volle benut; immers, niet alle FUs zullen iedere cyclus een resultaat afleverea De transport-capaciteit van dit netwerk moet beter worden afgestemd op gemiddeld gebruik; d.w.z. het aantal bussen moet worden gereduceerd (of gelijkwaardig, met hetzelfde aantal bussen kunnen meer FUs bediend worden). Figuur 3b toont 3 FUs die samen 2 bussen moeten delen.Reduction of transport capacity The transport network from figure 3a is still not fully utilized; after all, not all FUs will deliver a result every cycle. The transport capacity of this network must be better attuned to average use; i.e. the number of buses must be reduced (or equivalent, with the same number of buses more FUs can be served). Figure 3b shows 3 FUs that must share 2 buses together.

De belangrijkste implicatie van deze organizatie is dat we zowel operaties als transport moeten schedulen. Gezien vanuit VLIW gezichtspunt worden de operaties door de compiler (statisch) en het transport-netwerk door de hardware (dynamisch) gescheduled.The main implication of this organization is that we have to schedule both operations and transportation. From a VLIW point of view, operations are scheduled by the compiler (static) and the transport network by the hardware (dynamically).

Programmeren van het transport De volgende logische stap is om ook de compiler het tranport te laten schedulen. De compiler kan hier potentieel een veel betere prestatie leveren dan de hardware; i.t.t. hardware, kan de compiler het gehele programma bekijkea Nu wordt het transport zichtbaar op architectuur niveau. Dit betekent dat operaties niet apart gespecificeerd behoeven te worden, zij treden op als zij-effect van het transport. Als gevolg is het programmeer-model omgedraaid:Programming the transport The next logical step is to have the compiler also schedule the transport. The compiler can potentially outperform the hardware here; With regard to hardware, the compiler can view the entire program. Now the transport becomes visible at the architecture level. This means that operations do not have to be specified separately, they act as a side effect of the transport. As a result, the programming model has been reversed:

Traditioneel: Geprogrammeerde operations, operaties triggeren transport.Traditional: Programmed operations, operations trigger transport.

MOVE: Geprogrammeerd transport, transport triggert operaties.MOVE: Scheduled transport, transport triggers operations.

De MOVE-architectuur is gebaseerd op dit concept; zij kan worden gezien als een niet-uniforme register-set verbonden d.m.v. een poinMo-point of point-to-multipoint of ander type netwerk l. De register-set bevat FU-input en FU-output registers, en general-purpose registers. De input-registers zijn verdeeld in 2 types: operand- en trigger-registers. Schrijven van data in het trigger-register start een FU-operatie, waarbij de 'Het gebruik van uitsluitend move-instructies is niet nieuw (zie o.a. [14]). Move-architecturen, waarbij alle functionaliteit memory-mapped is, zijn reeds lang bekend. Echter het gebruik van een volledige register-mapped functionaliteit, waarbij meerdere guarded tranport-opdrachten parallel kunnen worden uitgevoerd is nieuw. Deze methode leidt tot allerlei verrassende voordelen zoals in sectie 4 beschreven.The MOVE architecture is based on this concept; it can be seen as a non-uniform register set connected by a poinMo-point or point-to-multipoint or other type of network l. The register set contains FU input and FU output registers, and general-purpose registers. The input registers are divided into 2 types: operand and trigger registers. Writing data in the trigger register starts an FU operation, in which the 'Using only move instructions is not new (see, inter alia, [14]). Move architectures, where all functionality is memory-mapped, have been known for a long time. However, the use of a full register-mapped functionality, where multiple guarded transport commands can be executed in parallel, is new. This method leads to all kinds of surprising advantages as described in section 4.

data uit een aantal (O of meer) FU-specifieke operand-iegisters de overige input-operanden bevatten. Als gevolg bevat de MOVE-architectuur maar 1 type instructie: de move. Een move bevat tenminste een source- en een destination-identifier. Het specificeert een transport van een register (of FU) naar een ander (of meerdere i.g.v. multicast transport) register (of FU), waarbij mogelijkerwijs 1 of meerdere operaties worden getriggerd.data from a number of (O or more) FU-specific operand registers contain the other input operands. As a result, the MOVE architecture contains only 1 type of statement: the move. A move contains at least a source and a destination identifier. It specifies a transport from one register (or FU) to another (or multiple multicast transport) register (or FU), possibly triggering 1 or more operations.

Voor een traditionele optel-operatie heeft de MOVE-architectuur 3 moves nodig, de eerste om het operand-register te laden, de tweede voor het triggeren van de operatie (waarbij de tweede source-operand getransporteerd wordt) en de derde voor het lezen van het resultaat. In de praktijk zal deze laatste move veelal ook de volgende operatie triggeren.For a traditional addition operation, the MOVE architecture requires 3 moves, the first for loading the operand register, the second for triggering the operation (transporting the second source operand) and the third for reading. the result. In practice, this last move will often also trigger the next operation.

Het transport-netwerk kan een willekeurige topologie hebben, zolang transport-opdrachten kunnen worden uitgevoerd. Het netwerk behoeft geen volledige connectiviteit. Het is de taak van de compiler het transport te optimalizeren voor een gegeven topologie. Beperkte connectiviteit kan de capaciteit en daardoor de cyclustijd verlagen. Het MO VE-prototype, welke de MOVE-architectuur realiseert op een single chip, bevat een reguliere volledig verbonden bus-structuur, waarbij 4 move-opdrachten parallel worden uitgevoerd.The transport network can have any topology, as long as transport orders can be executed. The network does not require full connectivity. The task of the compiler is to optimize the transport for a given topology. Limited connectivity can reduce capacity and therefore cycle time. The MO VE prototype, which realizes the MOVE architecture on a single chip, contains a regular fully connected bus structure, where 4 move commands are executed in parallel.

3.2 Pipeline alternatieven3.2 Pipeline alternatives

De MOVE-architectuur vereist niet dat de FUs gepipelined moeten worden uitgevoerd, echter optimaal FU-gebruik vereist maximale FU-pipelining. Indien we pipeline-latches met logica combineren kunnen we het aantal poorten per pipeline-stage reduceren tot 2. In de praktijk (zie [15]) kunnen clock-skew en data-skew een hogere ondergrens opleveren. Bij het implementeren van pipelines hebben we een aantal alternatieven: • Continue always In dit geval wordt de data in de pipeline op reguliere intervallen (meestal iedere klokcyclus) naar de volgende stage verschoven. Dit is geïmplementeerd in de Cydra-5 VLIW processor [12] en in de meeste vector-machines. Dit is eenvoudig te implementeren, echter het vereist, zoals verderop zal worden aangetoond, veel extra registers.The MOVE architecture does not require the FUs to be pipelined, however optimal FU utilization requires maximum FU pipelining. If we combine pipeline latches with logic, we can reduce the number of ports per pipeline stage to 2. In practice (see [15]), clock skew and data skew can provide a higher lower limit. When implementing pipelines, we have a number of alternatives: • Continue always In this case, the data in the pipeline is shifted to the next stage at regular intervals (usually every clock cycle). This is implemented in the Cydra-5 VLIW processor [12] and in most vector machines. This is easy to implement, however, as will be shown later, it requires many additional registers.

• Push/Pull Hier wordt de data in de pipeline alleen doorgeschoven in geval een volgende instructie dezelfde pipeline gebruikt [5]. Merk op dat het onderscheid tussen push of pull alleen betekenis heeft indien traditionele operaties worden opgesplitst in afzonderlijke moves.• Push / Pull Here the data in the pipeline is only pushed through if a subsequent instruction uses the same pipeline [5]. Note that the distinction between push or pull only makes sense if traditional operations are split into separate moves.

• Hybrid Deze oplossing houdt in dat data in de pipeline meestal mag doorschuiven, behalve in het geval data afkomstig van eerdere operaties dreigt overschreven te worden, doordat bijv. de resultaten niet optijd zijn gelezen. Pipeline-stages blokkeren dus zo laat mogelijk.• Hybrid This solution means that data in the pipeline can usually be passed on, except in the event that data from previous operations threatens to be overwritten, for example because the results have not been read in time. So block pipeline internships as late as possible.

In de MOVE-architectuur hebben de meeste FUs een hybride pipeline-implementatie. De eerste oplossing vereist dat de resultaten exact op de juiste tijd worden gelezen. Dit vermindert niet alleen de scheduling-mogelijkheden van de MOVE-machine, doch betekent verlies van data in geval van blokkeren (denk aan cache-miss of een exceptie). Dit is niet het geval als de pipeline slechts 1-stage diep is.In the MOVE architecture, most FUs have a hybrid pipeline implementation. The first solution requires that the results are read exactly at the right time. This not only reduces the scheduling capabilities of the MOVE machine, but also means loss of data in case of blocking (think cache miss or an exception). This is not the case if the pipeline is only 1 stage deep.

Het tweede alternatief leidt tot grote code omvang voor scalaire applicaties; extra dummy instructies zijn nodig, alleen om resultaten uit de pipeline te halen.The second alternative leads to large code size for scalar applications; additional dummy instructions are needed, only to extract results from the pipeline.

Het hybride alternatief is iets moeilijker te implementeren omdat stages conditioneel mogen doorschuiven, echter het combineert eenvoudige afhandeling van excepties, kleine code-omvang en extra scheduling-vrijheden (zie sectie 4).The hybrid alternative is slightly more difficult to implement because stages are allowed to move conditionally, however it combines simple exception handling, small code size and additional scheduling freedoms (see section 4).

3.3 Excepties3.3 Exceptions

Excepties kunnen worden onderverdeeld in drie categoriën: 1. Precies. Het behandelen van deze excepties vereist dat de source-operanden van de operatie die aanleiding geeft tot de exceptie nog voorhanden zijn. Voorbeelden zijn loads, stores en drijvende-komma-operaties (volgens de IEEE 754 floating-point standaard). Na behandeling van de exceptie moet de executie voortgezet kunnen worden.Exceptions can be divided into three categories: 1. Exactly. Handling these exceptions requires that the source operands of the operation giving rise to the exception are still available. Examples are loads, stores and floating-point operations (according to the IEEE 754 floating-point standard). After the objection has been processed, the execution must be able to continue.

2. Niet-precies. Deze excepties vereisen niet dat de source-operanden nog aanwezig zijn voor de behandeling van de exceptie.Ze vereisen echter wel dat de executie na behandeling voortgezet kan worden. Voorbeelden zijn excepties veroorzaakt door integer-arithmetiek2.2. Not exactly. These exceptions do not require that the source operands are still present for the handling of the exception, but they do require execution to continue after treatment. Examples are exceptions caused by integer arithmetic2.

3. Fataal. In contrast met de vorige twee wordt voor deze exceptie niet de eis gesteld dat executie weer voortgezet moet kunnen worden.3. Fatal. Contrary to the previous two, this exception does not require that execution be resumed.

Preciese excepties worden gedeeltelijk geïmplementeerd doormiddel van het gebruik van de zo genoemde inexacte exceptie-conditie. FUs proberen zo snel als mogelijk te verifiëren of mogelijk een exceptie kan optreden. Voordat deze conditie geverifieerd is en ook indien deze conditie waar is, moeten we er voor zorgen dat de source-operanden niet worden overschreven. Dit is op vier manieren te realiseren: 1) door middel van compiler-afspraken, 2) door onmiddelijk locking te initiëren totdat bekend is dat de inexacte conditie niet geldig is, of totdat het bekend is dat de exceptie echt is opgetreden, 3) door locking te initiëren wanneer de source-operanden dreigen te worden overschreven, of 4) door het redden van de source-operanden in de FU zelf. Het zal duidelijk dat oplossing 3, samen met een snelle generatie van de inexacte conditie te prefereren is. Voor de meeste FUs uit het MOVE-prototype is dan ook deze combinatie geïmplementeerd.Precise exceptions are partially implemented using the so-called inexact exception condition. FUs try to verify as soon as possible whether an exception can occur. Before this condition is verified and even if this condition is true, we must ensure that the source operands are not overwritten. This can be achieved in four ways: 1) by means of compiler agreements, 2) by immediately initiating locking until it is known that the inexact condition is not valid, or until it is known that the exception has actually occurred, 3) by initiate locking when the source operands are in danger of being overwritten, or 4) by rescuing the source operands in the FU itself. It will be clear that solution 3, along with a rapid generation of the inexact condition, is preferable. This combination has therefore been implemented for most FUs from the MOVE prototype.

Om van een exceptie te kunnen herstellen dient de processor in een staat te worden gebracht die een mogelijk herstel garandeert. Hier zijn grofweg drie methoden voor: 1) restart, 2) completion, 3) blocking.In order to recover from an exception, the processor must be placed in a state that guarantees a possible recovery. There are roughly three methods for this: 1) restart, 2) completion, 3) blocking.

Restart In het geval van multi-cycli operaties met out-of-order beëindiging is het herstarten (restart) van onafgemaakte instructies praktisch onmogelijk. De PC en overige processor-state dient in dat geval voor vele cycli bewaard te blijven, hetgeen een enorme hardware investering vergt.Restart In the case of multi-cycle operations with out-of-order termination, restarting unfinished instructions is practically impossible. In that case, the PC and other processor state must be saved for many cycles, which requires a huge hardware investment.

Completion Veel superpipelined en VLIW processoren zijn ontworpen om na een exceptie die werkzaamheden af te maken waar de verschillende FUs mee bezig zijn (completion). Deze processoren hebben niet de mogelijkheid de FUs te blokkeren tijdens excepties3. Het gevolg hiervan is, dat al de onafgemaakte instructies na het detecteren van de exceptie alsnog worden afgemaakt. Dit impliceert op zich dat de compiler is gedwongen om de resultaat-registers van multi-cycli operaties niet te gebruiken voor andere operaties tijdens de berekening van deze multi-cycli operatie. In het geval dat excepties precies gedetek-teerd moeten worden, wordt dit inefficiënt register-gebmik alleen maar erger: ook de source-operanden moeten nu beschermd worden tegen overschrijving tijdens de operatie op deze operanden. In de Cydra-5 [12] garandeert de compiler dit, door een register als in gebruik te verklaren vanaf het begin van de operatie tot het moment waarop al de operaties die de geschreven waarde gebruiken afgelopen zijn. Voor loops die software-pipelined zijn geeft dit een vrij disastreuse toename in registergebruik. Vergeet niet dat al de register-^/es in Cydra-5 ontworpen moeten worden voor worstcase situaties.Completion Many superpipelined and VLIW processors are designed to finish the work that the various FUs are doing (completion) after an exception. These processors do not have the ability to block the FUs during exceptions3. As a result, all unfinished instructions are still completed after the exception is detected. This in itself implies that the compiler is forced not to use the result registers of multi-cycle operations for other operations during the calculation of this multi-cycle operation. In the event that exceptions need to be detected precisely, this inefficient registry usage will only get worse: the source operands must now also be protected against overwriting during the operation on these operands. In the Cydra-5 [12], the compiler guarantees this, by declaring a register as in use from the beginning of the operation to the moment when all the operations using the written value have ended. For software pipelined loops this gives a fairly disastrous increase in registry usage. Don't forget that all the register ^ / es in Cydra-5 have to be designed for worst case situations.

Blocking In de MOVE-architectuur blokkeren de FUs vanzelf i.g.v. een exceptie, omdat dit mechanism reeds is ingebouwd om een extra graad van .sc/redWMg-flexibiliteit te verkrijgen. De MOWE-pipelines blokkeren vanzelf wanneer het resultaat niet gelezen wordt. Register bestemmingen worden daarom nooit overschreven. In feite kan een bestemming niet overschreven worden, omdat de FUs niet weten wat 2Sommige LISP systemen kunnen preciese excepties voor integer-arithmetiek goed gebruiken. Als gevolg van een overflow op fixnum operaties kan dan de representatie naar bignum gewijzigd worden.Blocking In the MOVE architecture, the FUs automatically block as a result of an exception, as this mechanism is already built in to provide an additional degree of .sc / redWMg flexibility. The MOWE pipelines freeze automatically when the result is not read. Register destinations are therefore never overwritten. In fact, a destination cannot be overwritten because the FUs do not know what 2Some LISP systems can use precise exceptions for integer arithmetics. As a result of an overflow on fixnum operations, the representation can be changed to bignum.

3Een uitzondering hierop is Intel i860 f5]. De pipelines van de i860 moeten doormiddel van duw-en-trek instructies geleegd worden.3 An exception to this is Intel i860 f5]. The pipelines of the i860 must be emptied by push and pull instructions.

(buiten het FU-resultaat register) de uiteindelijke bestemming is. Het redden en herstellen van de FU-state kan bijzonder complex zijn in de huidige VLIWs. Drie redenen hiervoor zijn: 1) stages kunnen complexe data formaten hebben, 2) bestemming identificatie dient bewaard en hersteld te kunnen worden en 3) de data en bestemming-identificatie dient te worden hersteld in de juiste pipeline-stage. Voor de MOVE-architectuur is blocking veel makkelijker te implementeren. 1) het is niet nodig tussenresultaten te bewaren, slechts eindresultaten zijn nodig. 2) omdat de instructies gesplitst zijn in componenten voor het laden van operanden en het lezen van resultaat, is de enige bestemming het FU-resultaat register en is het dus niet nodig de bestemming-identificatie te bewaren. 3) aangezien het aantal cycli tussen het starten van een operatie en het lezen van de resultaten niet bepaald wordt door de FU-latency is slechts het herstellen van de juiste volgorde van operaties in een FU belangrijk; de exacte positie in de FU-pipeline is irrelevant. De FU behoeft dus ook geen complexe organisatie om dit herstel proces mogelijk te maken, slechts een eenheids-operatie per FU is voldoende om herstel te bewerkstelligen (bijv. +0, xl, enz.).(outside the FU result register) is the ultimate destination. Saving and restoring the FU state can be particularly complex in current VLIWs. There are three reasons for this: 1) internships can have complex data formats, 2) destination identification must be saved and restored and 3) data and destination identification must be restored in the correct pipeline stage. For the MOVE architecture, blocking is much easier to implement. 1) it is not necessary to keep intermediate results, only final results are necessary. 2) because the instructions are split into components for loading operands and reading result, the only destination is the FU result register, so it is not necessary to keep the destination identification. 3) since the number of cycles between starting an operation and reading the results is not determined by the FU latency, only restoring the correct order of operations in an FU is important; the exact position in the FU pipeline is irrelevant. The FU therefore does not need a complex organization to enable this recovery process, only one unit operation per FU is sufficient to effect recovery (eg +0, xl, etc.).

3.4 Support-mechanismen3.4 Support mechanisms

De MOVE-architectuur separeert transport van operatie en wordt geprogrammeerd door uitsluitend het tran-port te specificeren. Een processor gebasseerd op deze architectuur vereist een aantal support-mechanismen ter ondersteuning van hogere programmeertalen en operating-systemen. Figuur 4 toont een mogelijke re-alizatie van zowel het move- als de support-mechanismen, zoals dit in het prototype gerealiseerd is. Het move-mechanisme bestaat uit het transport-netwerk en de move-identifier-bus. De support-mechanismen zijn de guard-, locking- en exceptie-mechanismen.The MOVE architecture separates transport from operation and is programmed by specifying only the transport. A processor based on this architecture requires a number of support mechanisms to support higher programming languages and operating systems. Figure 4 shows a possible re-realization of both the move and the support mechanisms, as realized in the prototype. The move mechanism consists of the transport network and the move identifier bus. The support mechanisms are the guard, locking and exception mechanisms.

Conditionele executie Het uitsluitend ondersteunen van conditionele executie d.m.v. een compare- en branch-mechanisme is minder acceptabel voor VLIW en superpipelined machines. Een betere aanpak is het gebruik van guards, welke conditionele executie van een aantal of alle transport-operaties mogelijk maken. Een aantal (of alle) moves bevatten een guard-selector, welke de waarde van een guard als conditie selecteert, of een combinatie (boolean-expressie) van deze guards specificeert. De semantiek van een move is zodanig, dat indien de ga&rdfalse is, de move-operatie geen gevolgen heeft 4.Conditional execution Only supporting conditional execution by means of a compare and branch mechanism is less acceptable for VLIW and super pipelined machines. A better approach is to use guards, which allow conditional execution of some or all transport operations. Some (or all) moves contain a guard selector, which selects the value of a guard as a condition, or specifies a combination (boolean expression) of these guards. The semantics of a move are such that if the move is true, the move operation has no consequences 4.

Het guard-mechanisme is geïntegreerd in een FU. Deze FU verwerkt de guard-selector zoals gespecificeerd in de move-instructies. Voor iedere gespecificeerde guard-selector activeert deze FU de corresponderende guard-lijn. Deze FU implementeert tevens de operaties voor het zetten van de guards. Het guard-mechanisme in combinatie met de scheiding van trigger en result move maakt het speculatief opstarten van een FU-operatie eenvoudig. Het verwijderen van ongewenste resultaten (eventueel inclusief ongewenste exceptie conditie) gebeurt in het prototype met behulp van een move naar een dummy register.The guard mechanism is integrated in an FU. This FU processes the guard selector as specified in the move statements. For each specified guard selector, this FU activates the corresponding guard line. This FU also implements the operations for setting the guards. The guard mechanism in combination with the separation of trigger and result move makes speculative start-up of an FU operation easy. The removal of unwanted results (possibly including unwanted exception condition) is done in the prototype using a move to a dummy register.

Locking Compile-time synchronizatie van alle move-operaties is niet mogelijk indien FU-latencies niet door de compiler kunnen worden bepaald (e.g. in geval van een cache-miss), en is niet efficiënt indien onvoldoende operaties kunnen worden gevonden om de FU-latency te overbruggen i.g.v. een RaW-hazard. Beide problemen worden opgelost door hardware synchronizatie. In het algemeen impliceert synchronizatie dat producent en consument elkaar kunnen locken, indien één van tweeën nog niet klaar is voor de transactie. Locking kan ook impliciet plaatsvinden indien het transportnetwerk uitgevoerd wordt met behulp van selftimed logica.Locking Compile-time synchronization of all move operations is not possible if FU latencies cannot be determined by the compiler (eg in case of a cache miss), and is not efficient if insufficient operations can be found to FU latency to be bridged igv a RaW hazard. Both problems are solved by hardware synchronization. In general, synchronization implies that producer and consumer can lock each other if one of the two is not yet ready for the transaction. Locking can also be implicit if the transport network is executed using self-timed logic.

In de MOVE-architectuur leidt locking tot 2 types transport-locks: 1) een lees-lock, zolang een resultaat (in een FU) nog onderweg is, en 2) een schrijf-lock, zolang de FU nog niet beschikbaar is. De lees-lock maakt het schedulen van resultaat-moves onafhankelijk van de FU-latency, de schrijf-lock staat een 4Een mogelijke uitzondering, die wij niet in het prototype geïmplementeerd hebben, is dat deze move toch een waarde uit een FU-pipeline leest, doch hiermee niets doet, en ook mogelijke excepties onderdrukt; dit ter ondersteuning van speculatieve executie.In the MOVE architecture, locking leads to 2 types of transport locks: 1) a read lock, as long as a result (in an FU) is still underway, and 2) a write lock, as long as the FU is not yet available. The read lock makes scheduling result moves independent of the FU latency, the write lock is a 4 One possible exception, which we have not implemented in the prototype, is that this move still reads a value from an FU pipeline , but does nothing with this, and also suppresses possible exceptions; this in support of speculative execution.

efficiënte implementatie van preciese excepties toe. Locking verlengt de tijd van de move-operatie zodanig, dat deze kan worden voltooid.efficient implementation of precise exceptions. Locking extends the time of the move operation so that it can be completed.

Excepties Het exceptie-support-mechanisme biedt de FU de mogelijkheid 3 types excepties te signaleren (zie sectie 3.3). Een herstelbare exceptie betekent dat de huidige transport-opdracht niet voltooid is, doch dat deze na afhandeling van de exceptie kan worden afgemaakt. Het trap en return mechanisme hangt af van de implementatie van de exceptie-FU. In het prototype wordt gebruik gemaakt van de identifiers op de move-identifier-bus, om binnen één cyclus het adres van de juiste exceptie-routine te kunnen vormen. Dit maakt zeer snelle emulatie van niet geïmplementeerde functionaliteit mogelijk. Ten behoeve van speculatieve executie is het mogelijk om niet-preciese excepties op een later tijdstip af te handelen. Dit is mogelijk door het toevoegen van een exceptie-bit aan de data. Het testen van de exceptie conditie vindt nu programmatisch plaats.Exceptions The exception support mechanism allows the FU to identify 3 types of exceptions (see section 3.3). A recoverable exception means that the current transport order has not been completed, but can be completed after the exception has been processed. The trap and return mechanism depends on the implementation of the exception FU. The prototype uses the identifiers on the move-identifier bus to form the address of the correct exception routine within one cycle. This allows very fast emulation of non-implemented functionality. For speculative execution, it is possible to settle inaccurate exceptions at a later date. This is possible by adding an exception bit to the data. Testing of the exception condition now takes place programmatically.

4 Vergelijking van MOVE met architectuur trends4 Comparison of MOVE with architectural trends

De vorige twee secties beschrijven de recente architectuur ontwikkelingen en ons alternatief, de MOVE-architectuur. In deze sectie vergelijken we deze recente ontwikkelingen met de MOVE-architectuur vanuit verschillende gezichtspunten. De belangrijkste reden voor deze veigelijking is de rechtvaardiging van de characteristieken van de MOVE-architectuur zoals gepresenteerd in de vorige sectie. Vergelijkingen vinden plaats op de volgende gebieden: 1) utilizatie van de hardware; 2) geschiktheid voor aanpassing van de MOVE naar de eisen van specifieke toepassingen; 3) pipelining van funktie-eenheden; 4) implementatie konsekwenties van excepties; en tenslotte, 5) prestatie aspekten zoals snelheid en code-omvang.The previous two sections describe recent architecture developments and our alternative, MOVE architecture. In this section, we compare these recent developments with the MOVE architecture from different points of view. The main reason for this equation is the justification for the characteristics of the MOVE architecture as presented in the previous section. Comparisons are made in the following areas: 1) utilization of the hardware; 2) suitability for adapting the MOVE to the requirements of specific applications; 3) pipelining of functional units; 4) implementation of the consequences of exceptions; and finally, 5) performance aspects such as speed and code size.

4.1 Utilizatie van de processor hardware4.1 Utilization of the processor hardware

Bij het waarderen van de MOVE-architectuur met betrekking tot hardware gebruik, moeten we kijken naar de verschillende componenten in een MOVE-processor. Zoals aangegeven in figuur 3a bestaat een MOVE-processor uit een transportnetwerk en funktie-units. Registers kunnen ook beschreven worden als funktie-units, maar we beschouwen deze apart.When valuing the MOVE architecture with regard to hardware usage, we need to look at the different components in a MOVE processor. As shown in figure 3a, a MOVE processor consists of a transport network and functional units. Registers can also be described as function units, but we consider these separately.

Transportnetwerk In sectie 3.1 is uitgelegd dat een MOVE-processor de interne transportcapaciteit efficiënter gebruikt dan een met de MOVE vergelijkbare VLIW. Vergeleken met de VLIW, betekent dit dat we met dezelfde hoeveelheid interconnectiemetaal meer funktie-units kunnen aan spreken en bezig houden. Integratie op een enkele chip wordt hierdoor vereenvoudigd, omdat er minder interconnectie overhead is.Transport network In section 3.1 it was explained that a MOVE processor uses the internal transport capacity more efficiently than a VLIW comparable to the MOVE. Compared to the VLIW, this means that with the same amount of interconnection metal we can address more functional units and keep them busy. This simplifies integration on a single chip, as there is less interconnection overhead.

Funktie units Functie-units (FUs genaamd) zijn in de MOVE-architectuur bijna geheel gescheiden van het data-transportnetwerk. De FUs hoeven slechts te voldoen aan een FU-transport interface beschrijving. Dit betekent bijvoorbeeld dat we de pipelining van functie-units kunnen laten afhangen van de funktie en het gebruik. De enige restrictie die opgelegd wordt door het netwerk is de tijd tussen het triggeren van operaties; deze tijd is namenlijk een veelvoud van de data-transporttijd5.Function units Function units (called FUs) in the MOVE architecture are almost completely separated from the data transport network. The FUs only have to comply with an FU transport interface description. This means, for example, that we can make the pipelining of function units dependent on the function and use. The only restriction imposed by the network is the time between triggering operations; this time is a multiple of the data transport time5.

Register use Het aantal registers benodigd in een MOVE-architectuur kan sterk worden verminderd vergeleken met andere architecturen. Hiervoor zijn drie redenen aan te geven: 5Dit maakt zelfs wavepipelining mogelijk. Wavepipelines vereisen wel dat data op tijd wordt gelezen uit de pipeline. Ondersteuning van excepties betekent dan ook dat de pipeline moet uitlopen in een aantal zg. uitloop registers.Register use The number of registers required in a MOVE architecture can be greatly reduced compared to other architectures. There are three reasons for this: 5This even allows wavepipelining. Wavepipelines do require data to be read from the pipeline in time. Support for exceptions therefore means that the pipeline must end in a number of so-called run-out registers.

1. Minder tijdelijke waarden hoeven bewaard te worden. Veel van deze tijdelijke waarden gaan direkt van FU naar FU zonder in de general-purpose-register geschreven te worden.1. Less temporary values need to be kept. Many of these temporary values go directly from FU to FU without being written into the general-purpose register.

2. Effectief blijkt, dat Pipelinesecties en FU-operand en FU-resultaat registers worden gebruikt ter vervanging van general-purpose-registers. In de MOVE-architectuur wordt een gewone RISC-instruktie gesplitst in zijn drie fundamentele transportcomponenten en deze componenten worden apart van elkaar gescheduled. Een resultaat dat gegenereerd wordt door een FU, maar niet direkt gebruikt kan worden (door de zelfde of een andere FU), kan tijdelijk blijven staan in de genererende FU, behalve uiteraard, indien dit resultaten blokkeert die eerder nodig zijn. In dit laatste geval is wel een register nodig om dit resultaat even te bewaren. Op dezelfde manier kunnen we een FU-source-register gebruiken om een waarde tijdelijk in op te slaan voordat de operatie op deze source-operand wordt uitgevoerd.2. Effectively, it appears that Pipeline sections and FU operand and FU result registers are used to replace general-purpose registers. In the MOVE architecture, a regular RISC instruction is split into its three basic transport components and these components are scheduled separately. A result generated by an FU, but which cannot be used directly (by the same or another FU), may remain temporarily in the generating FU, except, of course, if it blocks results that are needed earlier. In the latter case, a register is required to keep this result for a while. Likewise, we can use an FU source registry to temporarily store a value before performing the operation on this source operand.

3. Blokkerende excepties. Wij hebben gekozen voor het automatische blokkeren van de pipeline-stages binnen in een FU wanneer de resultaten nog niet zijn weggehaald. Deze keus heeft tot gevolg dat geen compilatie-strategie nodig is waarin registers voor lange tijd gereserveerd worden.3. Blocking exceptions. We have chosen to automatically block the pipeline stages within an FU when the results have not yet been removed. As a result, this does not require a compilation strategy in which registers are reserved for a long time.

4.2 Prestatie4.2 Performance

Voor de prestatievergelijking kijken we naar: cyclus-tijdbeperkingen, scheduling vrijheid, code grootte en de geschiktheid voor het verwerken van scalaire en vector toepassingen.For the performance comparison we look at: cycle-time constraints, scheduling freedom, code size and suitability for processing scalar and vector applications.

Cyclustij d De meeste processoren gebaseerd op de RIS C-ontwerpfilosofie gebruiken verschillende pipeline-stages voor het ophalen van instrukties (IF), uitvoeren van de operatie (EX of ALU) en geheugen toegang (MEM). De cyclustijd van deze processoren wordt nu beperkt door de tijd benodigd voor cache toegang (IF en/of MEM) of voor de operatie (ALU). Zowel cache toegang als operatie kunnen worden gesplitst in meerdere pipeline-stages (superpipelining). De cache toegang kan opgesplitst worden in een decodeer-, een opzoek- en een tag-vcrgdijk-stage. Het is ook mogelijk om interleaving op de cache toe te passen indien de opzoek -stage het kritieke pad in de cyclustijd vormt (de opzoek -stage kan kritiek worden indien te veel cache-lijnen gelezen dienen te worden).Cycle time Most processors based on the RIS C design philosophy use different pipeline stages for fetching instructions (IF), executing the operation (EX or ALU) and memory access (MEM). The cycle time of these processors is now limited by the time required for cache access (IF and / or MEM) or for operation (ALU). Both cache access and operation can be split into multiple pipeline stages (super pipelining). The cache access can be split into a decoding, a look-up and a tag-vcrgdijk stage. It is also possible to apply interleaving to the cache if the look-up stage is the critical path in the cycle time (the look-up stage can become critical if too many cache lines need to be read).

De operatit-stage bestaat uit het lezen van de source-operand-latches, het doorlopen van de kombinato-rische logica, het doorlopen van de bypass-multiplexers en vervolgens schrijven van een bypass-register en eventueel van een source-operand-latch. In principe kunnen we ook een pipeline-latch plaatsen direkt na de logica en voor de multiplexer^. Zoals eerder beschreven ([15]), is de minimale hoeveelheid logica tussen staging-rcgisters afhankelijk van hoeveelheid data- en klok-stov. Het verband tussen de schrijftijd via de multiplexer en de schrijfbelasting is lineair (zie [16]). Deze belasting is evenredig met het aantal ingangen van de multiplexer, De schrijftijd wordt dus θ(η) waarin n het aantal multiplexer ingangen is6. Het schrijven van n mogelijke sources naar 1 bestemming zou wel eens het kritische pad kunnen worden in zwaar gcpipelinedc processoren. Dit heeft als konsekwentie dat het lezen uit een register-file (hetgeen ook een n naar 1 transport is) of het voeren door een komplex bypass circuit weleens de cyclustijd kan gaan bepalen. Behalve ontwerpen voor een minimum aan data- en klok-skew, dienen zowel de grootte van de register-file als de grootte van de bypass te worden beperkt. Zoals reeds aangegeven, is de MOVE-architectuur superieur wat betreft het beperken van het aantal benodigde registers. Het bypass circuit is vervangen door het transport netwerk en het aantal verbindingen binnen het netwerk is veel minder in vergelijking met een VLIW architectuur. Het is onze overtuiging dat wanneer transport het kritischepad wordt in de cyclustijd, de MOVE-architectuur geen reële concurrentie heeft.The operational stage consists of reading the source operand latches, going through the combinatorial logic, going through the bypass multiplexers and then writing a bypass register and possibly a source operand latch. In principle, we can also place a pipeline latch directly after logic and before the multiplexer ^. As previously described ([15]), the minimum amount of logic between staging registers depends on amount of data and clock stov. The relationship between the write time via the multiplexer and the write load is linear (see [16]). This load is proportional to the number of multiplexer inputs, so the write time becomes θ (η) where n is the number of multiplexer inputs6. Writing n possible sources to 1 destination may become the critical path in heavy gcpipelinedc processors. This has the consequence that reading from a register file (which is also an n to 1 transport) or passing through a complex bypass circuit can sometimes determine the cycle time. In addition to designing for a minimum of data and clock skew, both the size of the register file and the size of the bypass should be limited. As already indicated, the MOVE architecture is superior in limiting the number of registers required. The bypass circuit has been replaced by the transport network and the number of connections within the network is much less compared to a VLIW architecture. It is our belief that when transport becomes the critical path in the cycle time, the MOVE architecture has no real competition.

6Het gebruiken van grotere drivers heeft geen zin omdat kapaciteits-belasting evenredig toeneemt met de dWv/ng-kapaciteit. Het gebruik van een driver boom, beperkt de tijd tot 0{log{n)). Een boom is echter veel moeilijker te bedraden in VLSI.6Using larger drivers makes no sense because capacity loading increases proportionately with the dwv / ng capacity. Using a driver tree limits the time to 0 {log {n)). However, a tree is much more difficult to wire in VLSI.

Scheduling-vrijheid De prestatie van een VLIW is op cruciale wijze afhankelijk van de kwaliteit van scheduling van de code. In de MOVE-architectuur is een typische VLIW-RISC-operatie gesplitst in één of meer data-transport operaties (moves genoemd). Een 3 operand RISC-instruktie, bijvoorbeeld, komt overeen met 3 moves', één voor het transport van de eerste operand (operand-move), één voor het transport van de tweede (en laatste) operand (trigger-move) en één om het resultaat van de operatie te transporteren naar het uiteindelijke doel (result-move). De instruktie die de laatste operand transporteert, veroorzaakt ook het opstarten van de operatie (trigger). In combinatie met hybtide-pipelines (zie 3.2) in FUs, deze opsplitsing in transport komponenten veroorzaakt nieuwe vrijheidsgraden voor scheduling die nog niet eerder vertoont zijn. Zowel het transport van operanden als van resultaten zijn losgekoppeld van het transport van trigger-operanden. Het splitsen van een traditionele instruktie in zijn transportkomponenten heeft verschillende voordelen: 1. Het elimineren van gemeenschappelijke subexpressies (CSE) kan worden toegepast op het transport van operanden. Indien een operand niet verandert tussen twee operaties die gebruik maken van de zelfde FU, hoeft deze operand slechts één keer naar de FU getransporteerd te worden.Scheduling freedom The performance of a VLIW depends crucially on the quality of the scheduling of the code. In the MOVE architecture, a typical VLIW-RISC operation is split into one or more data transport operations (called moves). A 3 operand RISC instruction, for example, corresponds to 3 moves', one for the transport of the first operand (operand-move), one for the transport of the second (and last) operand (trigger-move) and one to to transport the result of the operation to the ultimate goal (result-move). The instruction that transports the last operand also causes the start of the operation (trigger). Combined with hybrid pipelines (see 3.2) in FUs, this breakdown in transport components causes new degrees of freedom for scheduling that have never been seen before. Both the transport of operands and results are separated from the transport of trigger operands. Splitting a traditional instruction into its transport components has several advantages: 1. The elimination of common subexpressions (CSE) can be applied to the transport of operands. If an operand does not change between two operations using the same FU, this operand only needs to be transported to the FU once.

2. Het verwijderen van onnodig transport naar general-purpose-registers doordat een resultaat niet direkt geconsumeerd hoeft te worden. Wanneer, bijvoorbeeld, de tweede operand voor een operatie nog niet beschikbaar is moet de eerste operand bewaard worden in een general-purpose-register. Mits de FU die deze operand produceert niet onmiddelijk nodig is, kan binnen de MOVE-architectuur dit bewaren ook gebeuren in deze FU zelf door de operand-move uit te stellen.2. The removal of unnecessary transport to general-purpose registers because a result does not have to be consumed immediately. For example, if the second operand for an operation is not yet available, the first operand must be kept in a general-purpose register. Provided the FU that produces this operand is not immediately needed, within the MOVE architecture this storage can also be done in this FU itself by postponing the operand move.

3. Het verkrijgen van betere initiatie-intervallen in software-pipelined-loops. Zoals opgemerkt in [17] is het vaak wenselijk om vertragings-stages toe te voegen in pipe//ne-reserveringstabellen om optimale initiatie-intervallen te kunnen verkrijgen. Voor de MOVE betekent dit, dat het transport vanuit een resultaat-register vertraagd dient te worden. Ook hier geldt dat dit slechts kan voor die FUs die niet op volle snelheid gebruikt worden.3. Obtain better initiation intervals in software pipelined loops. As noted in [17], it is often desirable to add delay stages in pipe // ne reservation tables in order to obtain optimal initiation intervals. For MOVE this means that transport from a result register must be delayed. Again, this is only possible for those FUs that are not used at full speed.

Code grootte De verschuiving van CISC naar RISC veroorzaakte een vergroting van de object-code van ongeveer 50%. Deze vergroting is acceptabel gezien 1) de capaciteitsvergroting van RAMs en 2) het gebmik van grote instruktie caches ter compensatie van het gestegen instruktieverkeer. Het gebruik van VLIW architecturen kan echter gemakkelijk leiden tot een code explosie vanwege 1) het (gemiddeld) grote aantal ongebruikte operatie-s/ots en 2) de code duplicatie vereist voor geadvanceerde compiler technieken. In principe zijn MOVE-instrukties zelfs minder compact dan de vergelijkbare RISC-instrukties. Immers een 3-operand RISC-instruktie vertaald in drie moves. In ons MOVE-prototype, een MOVE-instruktie kost 16 bits, hetgeen neerkomt op 48 bits voor drie moves of 50% extra instrukties. Gelukkig zijn er verschillende redenen waarom deze 50% een bovengrens is, die gemiddeld genomen veel lager uitvalt: 1. Veel moves transporteren data van FU naar FU. In het ideale geval veroorzaakt dit 1.5 move per 3 operand RISC-instruktie; een code reductie van 25%.Code size The shift from CISC to RISC caused the object code to increase by about 50%. This increase is acceptable in view of 1) the capacity increase of RAMs and 2) the use of large instruction caches to compensate for increased instruction traffic. However, the use of VLIW architectures can easily lead to a code explosion due to 1) the (average) large number of unused operation s / ots and 2) the code duplication required for advanced compiler techniques. In principle, MOVE instructions are even less compact than the comparable RISC instructions. After all, a 3-operand RISC instruction translated into three moves. In our MOVE prototype, a MOVE instruction costs 16 bits, which equates to 48 bits for three moves or 50% extra instructions. Fortunately, there are several reasons why this 50% is an upper limit, which on average is much lower: 1. Many moves transport data from FU to FU. Ideally, this would cause 1.5 move per 3 operand RISC instruction; a code reduction of 25%.

2. Er zijn instrukties met minder dan drie operanden (branch/jump/monadische).2. There are instructions with less than three operands (branch / jump / monadic).

3. Het elimineren van gemeenschappelijke subexpressies (CSE) verwijdert ook onnodige operand moves.3. Eliminating common subexpressions (CSE) also removes unnecessary operand moves.

Metingen met behulp van onze prototype-compiler, welke slechts scheduling op basic-block-mvtm uitvoerde zonder CSE, laten zien dat voor een aantal grote benchmarks het gemiddeld aantal moves per RISC-instruktie ongeveer 2.2 is. De volgende versie van de compiler zal hoogst waarschijnlijk de RISC-instructiedichtheid evenaren. Wij verwachten dat met betrekking tot code dichtheid, de MOVE-architectuur vergelijkbaar of zelfs beter is dan een RISC.Measurements using our prototype compiler, which only performed scheduling on basic-block mvtm without CSE, show that for some large benchmarks the average number of moves per RISC instruction is approximately 2.2. The next version of the compiler will most likely match the RISC instruction density. We expect that with regard to code density, the MOVE architecture is comparable or even better than a RISC.

Vergeleken met een VLIW- de MOVE-architectuur is superieur met betrekking tot code dichtheid. Allereerst is het aantal ongebruikte instructie slots gereduceerd vanwege het shared-transport netwerk (zie sectie 3) en ten tweede, is het mogelijk om volledige lege instructies te vermijden met behulp van pipeline-blocking (hetgeen essentieel is voor scalaire code).Compared to a VLIW the MOVE architecture is superior in code density. First, the number of unused instruction slots has been reduced due to the shared-transport network (see section 3) and second, it is possible to avoid completely empty instructions using pipeline blocking (which is essential for scalar code).

Scalair en vector toepasbaarheid De MOVE-architectuur is ontworpen om uitbreidbaarheid in aantal en type FUs mogelijk te maken. In combinatie met de bijna onbeperkte mogelijkheden voor het implementeren van FUs en het efficiënte gebruik van het data-transportnetwerk is de MOVE-architectuur uitermate geschikt voor het gebruik in vector en andere speciale toepassingen. Eén van de problemen van veel vector machines is de onbalans tussen de vector en de scalaire prestatie. Amdahl’s wet beperkt daarom de bruikbaarheid van deze machines voor algemene vector toepassingen. VLIW architecturen gecombineerd met slimme compilers zijn in staat om de prestatie van zowel vector als scalaire code te verbeteren. Alhoewel de prestatie verbetering voor scalaire code beperkt is tot een factor van 2 & 4.Scalar and vector applicability The MOVE architecture is designed to allow extensibility in number and type of FUs. In combination with the almost unlimited possibilities for implementing FUs and the efficient use of the data transport network, the MOVE architecture is ideal for use in vector and other special applications. One of the problems of many vector machines is the imbalance between the vector and the scalar performance. Amdahl's law therefore limits the utility of these machines for general vector applications. VLIW architectures combined with smart compilers are able to improve the performance of both vector and scalar code. Although the performance improvement for scalar code is limited to a factor of 2 & 4.

De MOVE-architectuur verbeterd een standaard VLIW in 2 aspecten, 1) zoals reeds gezegd, de code uitbreiding is minder dramatisch en 2) extra FUs kunnen makkelijk toegevoegd worden zonder dat veranderingen in het transportnetwerk vereist zijn en zonder dat het instructieformaat verandert dient te worden. In standaard VLIWs worden minder gebruikte functies gecombineerd in een enkele FU (bijv. integer, logic, shift en branch). In MOVE kunnen al deze functies worden gescheiden zonder veel kosten (alleen extra operand latches). Deze scheiding heeft tot gevolg dat deze functies ook parallel gebruikt kunnen worden en daardoor tot een prestatie verbetering leiden voor de scalaire performance.The MOVE architecture improves a standard VLIW in 2 aspects, 1) as already mentioned, the code extension is less dramatic and 2) additional FUs can be easily added without requiring changes in the transport network and without changing the instruction format . In standard VLIWs, less used functions are combined in a single FU (e.g. integer, logic, shift and branch). In MOVE all these functions can be separated without much cost (only extra operand latches). As a result of this separation, these functions can also be used in parallel and thus lead to an improvement in performance for the scalar performance.

Vergeleken met een RISC-processor lijkt het net alsof de MOVE-architectuur een inherent nadeel voor scalaire code heeft. Zoals getoond in figuur 5, ontstaat tussen twee afhankelijke operaties een extra cyclus (lost-cycle). In beide gevallen is de hoeveelheid werk per instructie vergelijkbaar: propagatie door kombinatorische logica en het latchen van het resultaat (zie sectie 4.2). Het verschil ligt hem in het feit dat de MOVE-architectuur twee maal latched inplaats van één maal (meteen na de logica, en meteen na het transport). De FU-resultaat-tofc/r kan echter wel gecombineerd worden met combinatorische logica (door gebruik te maken van Earl of Polarity-hold latches),Compared to a RISC processor, the MOVE architecture seems to have an inherent drawback to scalar code. As shown in figure 5, an extra cycle (lost cycle) is created between two dependent operations. In both cases the amount of work per instruction is comparable: propagation by combinatorial logic and latching the result (see section 4.2). The difference lies in the fact that the MOVE architecture is latched twice instead of once (immediately after logic, and immediately after transport). However, the FU result tofc / r can be combined with combinatorial logic (using Earl or Polarity hold latches),

Het is ook mogelijk om het aantal cycli te verminderen op de zelfde manier als in een RISC door het verwijderen van de trigger-latch (zie figuur 6). Deze laatste oplossing voor de lost-cycle is echter tegenstrijdig met het streven de cyclustijd te minimaliseren en nuttige moves the plaatsen in de lost-cycli. Het is in dit geval dan ook te preferen om minder transport capaciteit te implementeren en zodoende een betere vulgraad voor deze lost-cycli te verkrijgen. Al met al, verwachten wij dat een MOVE-processor met twee moves per instructie reeds beter presteert dan een standaard RISC.It is also possible to reduce the number of cycles in the same way as in a RISC by removing the trigger latch (see Figure 6). However, this last solution for the lost cycle contradicts the aim of minimizing cycle time and placing useful moves in the lost cycles. In this case it is therefore preferable to implement less transport capacity and thus obtain a better filling degree for these unloading cycles. All in all, we expect a MOVE processor with two moves per instruction to perform better than a standard RISC.

4.3 Toepasbaarheid voor het ontwerpen van applikatie-specifieke processoren4.3 Applicability for designing application specific processors

Uitgaand van state-of-the-art silicon-compilers, met faciliteiten voor het gebruik van geparametrizeerde VLSI cellen, wordt het snel kunnen wijzigen van een processor een reële mogelijkheid. Een processor is geschikt voor het aanpassen aan specifieke toepassingen indien de architectuur zowel flexibel als wel uitbreidbaar is en het ontwerpen eenvoudig is.Starting from state-of-the-art silicon compilers, with facilities for the use of parameterized VLSI cells, the rapid modification of a processor becomes a real possibility. A processor is suitable for adapting to specific applications if the architecture is both flexible and extensible and the design is simple.

Flexibiliteit Het retargeting's proces vereist dat de functionaliteit flexibel te wijzigen is. VLIWs hebben hier een duidelijk nadeel; het toevoegen van FUs behelst: 1) het veranderen van het instructieformaat en 2) het toevoegen van extra bussen (hetgeen de volledige register-filelayoutvermdert). De MOVE-architectuur heeft dit nadeel niet. De MOVE-architectuur legt ook geen beperkingen op aan het aantal operanden en resultaten van specifieke FUs.Flexibility The retargeting process requires that the functionality can be changed flexibly. VLIWs have a clear disadvantage here; adding FUs includes: 1) changing the instruction format and 2) adding additional buses (which reduces the entire register file layout). The MOVE architecture does not have this drawback. The MOVE architecture also does not limit the number of operands and results from specific FUs.

Uitbreidbaarheid in prestatie......Naast de mogelijkheid tot het op eenvoudige wijze veranderen en aanpas sen van de functionaliteit is het van uitzonderlijk belang dat de prestatie aan te passen is aan de eisen van de toepassing. De MOVE-architectuur heeft hier een extra vrijheidsgraad; behalve het toevoegen van FUs, is het mogelijk om onafhankelijk de transportcapaciteit te vergroten. Het transportnetwerk kan bijvoorbeeld geïmplementeerd worden met behulp van een aantal parallelle bussen die gegenereerd worden door para-metrizeerbare VLSI bibliotheekcellen. Het toevoegen van een bus is eenvoudig, alhoewel deze toevoeging het instructieformaat verandert. Deze verandering is echter voorspelbaar en daardoor ook eenvoudig in de compiler door te voeren (bijvoorbeeld het aantal move bussen als compiler parameter). Het veranderen van de pipeline-gr&ad is een andere manier om de prestatie aan te passen.Extensibility in performance ...... In addition to the possibility to easily change and adapt the functionality, it is extremely important that the performance can be adapted to the requirements of the application. The MOVE architecture has an extra degree of freedom here; besides adding FUs, it is possible to independently increase the transport capacity. For example, the transport network can be implemented using a number of parallel buses generated by parametrizable VLSI library cells. Adding a bus is easy, although this addition changes the instruction format. However, this change is predictable and therefore easy to implement in the compiler (for example, the number of move buses as a compiler parameter). Changing the pipeline gr & ad is another way to adjust performance.

Ontwerptijd Bij het ontwerpen van ASPs spelen twee kwesties een belangrijke rol: de prestatie-kosten veihouding en de ontwikkel-tijd (time to market); Ontwerptijd is van cruciaal belang voor beide kwesties. Een lange ontwerptijd veroorzaakt hoge ontwerpkosten en een lagere performance omdat het ontwerp op verouderde technologie wordt gebaseerd. Het aantal ontwerp beperkingen dat inherent door de MOVE-architectuur wordt opgelegd is echter bijzonder laag; het veranderen van FUs, het toevoegen van FUs en het veranderen van het transportnetwerk kan allemaal onafhankelijk van elkaar gedaan worden. In onze overtuiging is de MOVE-architectuur ideaal voor het snel genereren van ASPs, gebruik makend van bestaande VLSI ontwerpomgevingen.Design time Two issues play an important role in the design of ASPs: the performance cost maintenance and the development time (time to market); Design time is critical on both issues. A long design time causes high design costs and lower performance because the design is based on outdated technology. However, the number of design limitations inherent in the MOVE architecture is very low; changing FUs, adding FUs and changing the transport network can all be done independently. We believe the MOVE architecture is ideal for the rapid generation of ASPs using existing VLSI design environments.

Claims

1. An architecture of a Central Processor Unit (CPU) in which data transport of operands and all, or most important, operations on these operands are disconnected. The operations on operands are performed by so-called function units (FUs); the data transport takes place over a data transport network. Programming of this architecture takes place by means of specifying data transport operations. 1 or more of these transport operations are specified for each instruction. Each transport operation specifies one source and one or more destinations.

An architecture as specified under claim 1, wherein a hierarchical bus structure is used for the data transport.

An architecture as specified in claim 1, wherein the data transport operations are performed conditionally.

An architecture as specified under claim 3, wherein the conditions are specified by one or more guards.

An architecture as specified in claim 4, wherein the specification of the guards is part of the specification for the data transport operations.

5 Conclusions

An architecture as specified in claim 5, wherein the specification of the use of the guards consists of boolean expressions, with the specification of 1 or more guards in this expression.

An architecture as specified in claim 1, wherein the data transport is implemented by single-cycle data transport instructions.

An architecture as specified in claim 1, wherein the data transport is implemented pipelined. The data transport is separated into a read cycle and a write cycle.

An architecture as specified in claim 1, wherein the data transport is implemented through a combination of 1, 2 or more cycles of data transport operations. A data transport operation of 3 or more cycles consists of a read cycle, 1 or more transport cycles, and 1 write cycle. Data transport operations implemented through multiple cycles are used to allow clusters of function units to communicate with each other.

An architecture as specified in claim 1, wherein the transport netweik and / or one or more FUs are implemented with self-timed logic.

An architecture as specified in claim 1, wherein a so-called hybrid pipeline scheme is used for one or more function units.

An architecture as specified in claim 1, wherein one or more function units are implemented using wave pipelining.

An architecture as specified under claim 11 or claim 12, wherein for the purpose of exceptions one or more function units are equipped with run-out registers. These function units can be used directly for the exception code during an exception. After returning from an exception, the run-out registers are automatically read when reading results from these function units.

An architecture as specified in claim 1, wherein precise exceptions are supported by means of the so-called inexact exception mechanism.

An architecture as specified in claim 1, wherein during reading from a function unit of an inaccurate exception result, an exception is not directly generated, but the exception propagates as an extra condition with the data over the transport network. This exception can be handled at a later time determined by the compiler.

An architecture as specified in claim 1, wherein the correct exception vector is determined within 1 cycle. The transport identification number of the transport operation that causes an exception is used for this determination.

An architecture as specified in claim 16, wherein functionality not implemented by means of an emulation mechanism is emulated.

An architecture as specified in claim 1, wherein the transport network automatically blocks as long as 1 of the data values to be transported cannot yet be supplied by the function unit calculating this value, because the calculation of this data value is not yet complete. As soon as this calculation has been completed, the blocking is canceled.

An architecture as specified in claim 1, wherein a function unit to which a data value is to be transported blocks the data transport network as long as it cannot wither this data value.

An architecture as specified in claim 1, wherein the instruction unit behaves like a standard function unit, i.e. similar to most other function units. The operands of this function unit, the instruction register and the program counter are accessible via the data transport network. Immediate data values can be read directly from the instruction register. The transport network is managed by a so-called bus controller; it has access to the instruction register and the program counter via the data transport network or via a separate transport network.

An architecture as specified under claim 1 and claim 20, wherein multiple instruction units are present and connected to the data transport network. The bus controller can select instructions from multiple instruction registers to use for controlling the data transport network.