WO1995016955A1 - Temps d'attente de chargement nul pour instructions de chargement en virgule flottante en utilisant une file d'attente de donnees de chargement - Google Patents

Temps d'attente de chargement nul pour instructions de chargement en virgule flottante en utilisant une file d'attente de donnees de chargement Download PDF

Info

Publication number
WO1995016955A1
WO1995016955A1 PCT/US1994/014304 US9414304W WO9516955A1 WO 1995016955 A1 WO1995016955 A1 WO 1995016955A1 US 9414304 W US9414304 W US 9414304W WO 9516955 A1 WO9516955 A1 WO 9516955A1
Authority
WO
WIPO (PCT)
Prior art keywords
floating point
integer
load
instructions
store data
Prior art date
Application number
PCT/US1994/014304
Other languages
English (en)
Inventor
John M. Brennan
Peter Yan-Tek Hsu
Monica R. Nofal
Original Assignee
Silicon Graphics, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Silicon Graphics, Inc. filed Critical Silicon Graphics, Inc.
Publication of WO1995016955A1 publication Critical patent/WO1995016955A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3824Operand accessing
    • G06F9/3834Maintaining memory consistency
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • G06F9/3838Dependency mechanisms, e.g. register scoreboarding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3854Instruction completion, e.g. retiring, committing or graduating
    • G06F9/3856Reordering of instructions, e.g. using queues or age tags
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3854Instruction completion, e.g. retiring, committing or graduating
    • G06F9/3858Result writeback, i.e. updating the architectural state or memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3867Concurrent instruction execution, e.g. pipeline, look ahead using instruction pipelines

Definitions

  • the present invention relates generally to the field of computers, and more particularly, to a system and method for achieving load latencies of zero and cache coherency of store operations resulting from out-of-order floating point execution in a superscalar reduced instruction set computer (RISC) processor.
  • RISC reduced instruction set computer
  • RISC Superscalar reduced instruction set computers
  • the MIPS RISC architecture is a specific architecture exemplified by the R2000, R3000 R4000 and R6000 (collectively known as the R series) processors.
  • R2000, R3000 R4000 and R6000 collectively known as the R series processors.
  • the MIPS RISC architecture uses a number of internal techniques that enable the execution of all instructions in a single cycle.
  • two categories of instructions have special requirements that could disturb the smooth flow of instructions through the instruction pipeline.
  • the datapath(s) are divided into stages called “pipeline” stages, and thus the datapath is also referred to as the pipeline.
  • the first category of instructions are "load” instructions, and the second category of instructions are "jump" and "branch” instructions.
  • One embodiment of the present invention addresses the delay, or latency, specifically caused by load instructions.
  • Load instructions read operands from memory into processor registers for subsequent operation by other instructions. Because memory typically operates at much lower speeds than processor clock rates, the loaded operand (i.e., data) is not immediately available to subsequent instructions in the instruction pipeline. This data dependency is illustrated in Table 1 below.
  • Table 1 shows four instructions 1-4 traversing a simple 4 stage instruction pipeline comprising the following stages: an instruction fetch (F) operation, an ALU (Arithmetic/Logic Unit) operation (A), a memory access operation (M) and a write results operation (W).
  • the operand loaded by instruction 1 is not available for use in the A stage of instruction 2, but is available for use in the A stage of instruction 3.
  • One way to handle this data dependency so that instruction 2 can use the data loaded by instruction 1 is to delay die pipeline by inserting additional clock cycles into the execution of instruction 2 until the loaded data becomes available. This approach obviously introduces delays that would increase the cycles per instruction factor.
  • Table 1 illustrates a load delay of one instruction. The instruction that immediately follows the load is in the "load delay slot. " If the instruction in the load delay slot does not require the data from the load (instruction 1 for example), then no pipeline delay is required. If this load latency is made visible to software, a compiler can arrange instruction to ensure that there is no data dependency between a load instruction and the instruction in the load delay slot.
  • NOP No Operation instruction
  • a more efficient solution to handling the data dependency is to fill the load delay slot with a useful instruction.
  • Good optimizing compilers can usually accomplish this, especially if the load delay is only one instruction.
  • tasks performed by software are eventually implemented by hardware when die cost and feasibility of design permits.
  • the present invention is directed to a system and method that detects when a floating point load instruction is followed by a compute instruction having a data dependency with the floating point load instruction.
  • the instructions are then dispatched to a floating point queue where tiiey wait to be sent to a floating point unit.
  • a memory load request is sent to a cache memory to retrieve data corresponding to the floating point load instruction.
  • the data corresponding to the floating point load instruction is available from the cache memory, it is loaded into a load data queue, and both instructions are transferred to the floating point unit.
  • the compute instruction is then executed using the data from the load data queue with zero load latency.
  • the invention is further directed to a system and mediod for storing integer and floating point store data to maintain cache coherency in a cache memory.
  • This method comprises the following steps. Integer instructions are dispatched to a multi-stage integer pipeline in the integer unit. The integer instructions are then executed to generate integer store data, while floating point instructions are dispatched to a floating point queue. The integer store data is also transferred to the floating point queue. The floating point queue stores the floating point instructions and the integer store data in program order. The floating point instructions and integer store data later transferred to the floating point unit where the floating point instructions are executed out of the program order with respect to the integer instructions to generate floating point store data. The floating point store data and die integer store data are then sent by the floating point unit to a store data queue. From there all the store data is eventually stored in a cache memory according to the program order. In this manner, cache coherency of the floating point store data and the integer store data is maintained in die cache memory.
  • Figure 1 shows a high level representative block diagram of a processor system 100 according to d e present invention.
  • Figure 2 shows an integer pipeline 200 for the integer unit 102 of Figure 1 according to the present invention.
  • Figure 3 shows the integer pipeline 200 of Figure 2 and a floating point pipeline 300 according to the present invention.
  • Figure 4 shows a representative flow diagram of the pipelines 200 and 300 and a GCACHE pipeline 400 Detailed Description of the Preferred Embodiment
  • Register pressure can best be characterized by considering the register file as a hardware resource, and the demand for that resource as “register pressure. "
  • Register pressure gets worse as both the cycle latency of individual instructions goes up (pipelining), and as the number of instructions dispatched per cycle goes up (superscalarity).
  • One way to mitigate mis problem is to reduce the apparent latency of some instructions. It is much easier to reduce die apparent latency of loads tiian it is to reduce die apparent latency of the arithmetic operations themselves.
  • the present invention is tiius directed to a computer system and memod of operation that eliminates the presence of load delay slots for both integer and floating point loads.
  • the organization of the integer pipeline permits integer load operations to be directly followed by operations requiring the data of tiiat load, tiius yielding an integer load latency of zero.
  • decoupling of the computer system's floating point unit from its integer unit, and die specific structure and functionality of the floating point unit permits the computer system to make data available for floating point loads in advance of die actual execution of tiiose loads in die floating point unit to achieve a floating point load latency of zero.
  • FIG. 1 shows a high level representative block diagram of a processor system 100.
  • the following description of processor system 100 is provided by way of example.
  • the system and method of the present invention can be used in a processor system having various architecture schemes, as would be apparent to a person skilled in die relevant art.
  • the processor system 100 can be integrated on a single chip or on two or more chips, as would also be apparent to a person skilled in die relevant art.
  • die processor system 100 can be part of a multiprocessor computer system. Representative parameters and values of die preferred embodiment for the functional elements and blocks of Figure 1 are indicated in die following discussion and in Figure 1. While these are preferred in die disclosed implementation of the processor system 100, the present invention is not limited to tiiese parameter and values, and instead extends to any parameters and values die produce die intended functionality and equivalents tiiereof.
  • the processor system 100 comprises two main chips, an integer unit chip (IU) 102 and a floating point chip unit (FPU) 104. Additionally, die system comprises two global tag random access memory (RAM) chips (GTAG) 106 and a two banks of static RAM chips making up an external global cache (GCACHE) 108.
  • IU integer unit chip
  • FPU floating point chip unit
  • die system comprises two global tag random access memory (RAM) chips (GTAG) 106 and a two banks of static RAM chips making up an external global cache (GCACHE) 108.
  • GTAG global tag random access memory
  • GCACHE static RAM chips
  • Instructions are fetched from an on-chip 16KB (kilobyte) instruction cache (ICACHE) 110.
  • This ICACHE 110 is direct mapped with 32B (Byte) lines. Four instructions (128 total bits) are fetched per cycle, as shown generally at buses 111.
  • the ICACHE 110 is virtually indexed and virtually tagged. Software is responsible for maintaining coherence.
  • the ICACHE 110 is refilled from the GCACHE 108 in 11 cycles via a pair of buses 109. The contents of the ICACHE 110 need not be a subset of the GCACHE 108.
  • the BCACHE 112 is also direct mapped and contains IK entries.
  • Instructions from me ICACHE 110 are buffered in an instruction buffer (IBUF) 114 and realigned before going to dispatch logic (means) 116 via buses 115. Up to four instructions chosen from, for example, two integer, two memory, and four floating point instruction types may be dispatched per cycle. Floating point instructions are dispatched via buses 117 into a floating point que (FPQ) 118 where they can wait for resource contention and data dependencies to clear without holding up dispatching of integer instructions. In particular, the FPQ 118 decouples the FPU 102 from the IU 102 to hide the latency of the GCACHE 108, as is described in further detail below.
  • IBUF instruction buffer
  • FPQ floating point que
  • Integer and memory instructions get tiieir operands from a 13 port integer register (IREG) file 120.
  • a plurality of integer function units 122 comprise two integer ALUs, one shifter, and one multiply-divide unit, and are coupled to die IREG file 120 via buses 121.
  • the ALUs and shifter operate in one cycle.
  • the multiply-divide unit is iterative: 4/6 cycles for 32b/64b (bit) integer multiply, and 21 to 73 cycles for integer division depending on die value of die result. All iterative operations are fully interlocked for easy programming. Up to two integer operations may be initiated every cycle.
  • Memory instructions go through an address generation units (AGEN)
  • TLB translation lookaside buffer
  • the TLB 126 is a three way set associative cache containing 384 entries.
  • the TLB 126 is dual-ported so that two independent memory instructions can be supported per cycle. TLB misses are serviced by a software handler (not shown).
  • Integer loads and stores go to an on-chip data cache (DCACHE) 128 via buses 129. It, too, is dual-ported to support two loads, or one load and one store per cycle.
  • the DCACHE 128 is 16KB direct-mapped with 32B lines and is refilled from the external cache in 7 cycles. It is virtually addressed and physically tagged.
  • the DCACHE 128 is a proper subset of the GCACHE 108 and hardware (not shown) maintains coherence.
  • the DCACHE 128 and GCACHE 108 comprise a split-level cache, where the DCACHE 128 serves a the first level cache and the GCACHE 108 serves as the second level cache.
  • the GCACHE 108 is two-way interleaved to support two 64b loads or stores per cycle.
  • the GCACHE 108 is configurable from one to 16MB in size.
  • the GCACHE 108 is four-way set associative, each cache line containing four sectors or sub-blocks each with its own state bits.
  • the line size is configurable as 128B, 256B, or 512B, which corresponds to sector sizes of 32B, 64B or 128B, respectively.
  • the external cache is implemented using a pair of custom tag RAMs (GTAG 106) and from 8 to 36 commodity synchronous static RAMs (GCACHE 108). Specific timing for the components, .such as cache refill time, depends on die system implementation (i.e., it is application specific).
  • the GCACHE 108 and GTAG 106 are addressed by die TLB 126 via buses 119.
  • a further description of die split- level cache is found in die above cross-referenced patent application Ser. No. , titled “Split-Level Cache,” which is incorporated herein by reference.
  • Floating point loads are done via a load data queue (LDQ) 120 coupled to buses 109.
  • Floating point stores go off FPU chip 104 via buses 131 to a store data queue (SDQ) 130 then to GCACHE 108 via buses 133, after translation to physical addresses by TLB 126 and bypass the on-chip DCACHE 128.
  • LDQ load data queue
  • SDQ store data queue
  • the FPU 104 is coupled to two execution data patiis (buses 109), each capable of double precision fused multiply-adds, simple multiplies, adds, divides, square-roots, and conversions.
  • a twelve port floating point register (FREG) file 134 feeds the execution data paths, which are tiiemselves fully bypassed, via buses 135.
  • a plurality of floating point functional units are shown generally at 136.
  • the floating point functional units comprise many various types. Short operations comprising compares, moves, and die like take one cycle. Medium operations comprising adds, multiplies, fused multiply-adds, and d e like take four cycles and are fully pipelined. Long operations comprising divides, square-roots and the like are iterative. Divides take 14 and 20 cycles for single and double precision, respectively. The two datapaths are completely symmetric and indistinguishable to software. The compiler simply knows that it can schedule two floating point operations per cycle.
  • FIG. 2 An integer pipeline 200 for the IU 102 is shown in Figure 2.
  • a Fetch (F) stage 202 accesses.the ICACHE 110 and the BCACHE 112.
  • a Decode (D) stage 204 makes dispatch decisions based on register scoreboarding and resource reservations, and also reads the IREG register file 120.
  • An Address (F) stage 202 accesses.the ICACHE 110 and the BCACHE 112.
  • a Decode (D) stage 204 makes dispatch decisions based on register scoreboarding and resource reservations, and also reads the IREG register file 120.
  • (A) stage 206 computes the effective addresses of loads and stores.
  • An Execute (E) stage 208 evaluates the ALUs 122, accesses the DCACHE 128 and TLB 126, resolves branches and handles all exceptions.
  • a Writeback (W) stage 210 updates the IREG register file 122.
  • This pipeline 200 differs from a traditional RISC pipeline in two ways: there are actually four pipelines — two integer ALU pipelines and two load/store pipelines; and ALU operations occur in parallel with data cache accesses.
  • the traditional RISC pipeline has a load shadow: die instruction cycle immediately after a load cannot use the result of the load. This was found acceptable for scalar pipelines because the compiler can frequently put some independent instruction after a load.
  • die compiler would have to find four independent instructions to cover the same shadow in tiiis superscalar pipeline, a rather unlikely scenario.
  • die load shadow is removed for load to ALU operations, but creates an ALU shadow for load addresses.
  • the inventors found this trade-off to be advantageous because load-use dependencies occur more frequently than compute-load/store dependencies, particularly in branchy integer code which is not amenable to superscalar speedup.
  • a new register ---register addressing mode for floating point loads and stores helps reduce the need for precalculated addresses.
  • a disadvantage of putting die ALU further down die pipeline is that it slows down branch resolution. This was mitigated by having branch prediction, and also by allowing the delay slot of branches to execute in parallel with the branch itself.
  • Figure 3 shows integer pipeline 200, and a floating point pipeline 300 comprising stages S-N.
  • the D stage 204 determines which of those instructions can actually be dispatched into d e next stage in the next cycle. For instance, the system cannot do four loads in one cycle, so if four loads are present, only two could be dispatched.
  • die A stage 206 the addresses of any memory operations, loads and stores are generated. Then, the ALU operations are preformed during die E stage 208 and integer store state changes are made in the DCACHE 128.
  • Instructions are committed at stage W 210, and the determination of which instructions to commit is made in stage E 208.
  • Floating point (FP) instructions are committed by writing them into the FPQ 118 for later execution.
  • Integer loads and ALU operations are committed by writing tiieir results into the IREG file 120.
  • Integer stores write their data into the DCACHE 128 in the E stage 208, but are committed in the W stage 210 by both writing die store data into the FPQ 118, and by marking the DCACHE locations already written as valid. Integer stores which are not committed (perhaps because an earlier instruction has taken an exception) have die
  • the FPQ 118 is guaranteed to only have instructions in it that are committed.
  • the processor knows that when an instruction is read from die bottom of the FPQ 118, it can be executed by a FP functional unit 136. Its results can be written back to the FREG file 134 or stored into die GCACHE 108, or the like, depending on whatever die instruction calls for.
  • integer instructions can execute out of order with respect to FP instructions. However, the results of integer instructions actually change the state of the processor in die W stage 210.
  • the integer instructions will change the state of the processor in the W stage 210.
  • the FP instructions will go into die FPQ at tiiat point. Later on, they'll be pulled out of die FPQ in an S stage and sent to die FPU to "drain" through the rest of die floating point pipeline 300: the T stage, R stage, X, Y, Z, U stage and finally, V stage.
  • the word "drain” means mat committed instructions are issued in turn to die FPU and will pass completely tiirough the FP pipeline. At that time in d e V stage, they will finally commit their state to changing the state of the processor.
  • stage T the FP instructions are transferred from the IU chip 102 to the FPU chip 104.
  • data for the FP load instruction(s) being transferred is written into die LDQ 130.
  • stage R the operands for the FP instructions are read from the FREG file 134 and transferred into the FP units 136.
  • FP load data is written into the FREG file during this stage as well.
  • FP CC floating point condition code
  • the bypass bubbles shown in stages R, X and U represent the ability of the FP units 136 to send results at die end of stages X and U back to die inputs of the same or other FP floating point units at die end of stage R.
  • the bypassing is represented by dashed lines 302 and 304. In tiiis manner, execution results are immediately available for subsequent instructions without having to wait the additional time to read new results out of die FREG file.
  • More than one floating point instruction can be sent from the IU to the FPU during stage T as discussed above.
  • the first instruction (II) is a floating point load tiiat loads data into a sixtii register of the FREG (floating point register) file.
  • the second instruction (12) is a floating point add instruction tiiat adds the operand data stored at floating point registers 6 and 4, and stores the result in floating point register 3.
  • a data dependency exists between II and 12. Because II precedes 12 in the instruction stream, in prior MIPS RISC architectures for example, a load delay slot would be necessary between II and 12 since 12 must wait for fp6 to be loaded by II before the addition can be done.
  • both II and 12 can be dispatched from me FPQ to the FPU in the same cycle.
  • the data for both II and 12 must be available to die FPU before II and 12 can be executed by the FPU in order to achieve a load latency of zero.
  • Figure 4 shows a representative flow diagram of die pipelines 200 and 300 and a GCACHE pipeline 400 to further illustrate die impact of d e present invention of load latency of zero for floating point instructions.
  • Figure 4 also illustrates how cache coherency of integer stores is maintained in die GCACHE as a result of the FPU being decoupled from the IU.
  • the GCACHE pipeline 400 comprises six stages: G, H, I, J, K and L.
  • die IU In order for the FPU to have die load data destined for fp6 available in advance of receiving the instructions corresponding to diat data, die IU must send a request to the global cache when the floating point load instruction II is first detected in die instruction stream by the IU.
  • the IU after detecting die floating point load instruction II, the IU performs two operations in parallel: (1) the IU sends die floating point load instruction II to die floating point dispatch logic, as shown generally at 402, and (2) die IU makes a request to the GCACHE to locate the data corresponding to the floating point load instruction II prior to the floating point dispatch logic's dispatching of that floating point instruction into die FPQ, as shown generally at 404. In this manner, the data corresponding to die floating point load instruction II can be located and sent to die FPU while the instruction itself waits in the FPQ.
  • the FPU By delaying floating point instructions in the FPQ at 406 before they are sent to the FPU, the FPU is said to be "decoupled” from the IU pipeline.
  • the decoupling of die FPU from the IU pipeline hides the GCACHE latency of stages G-K, which is the time required for the GCACHE to access the GCACHE tag RAM at 410 and locate die data corresponding to an FP load instruction in the GCACHE, as shown at 412.
  • die "set match” information indicates which set of d e GCACHE die physical data resides, and whetiier the data is actually inthe GCACHE (i.e., "hit” or "miss”). This information is also sent to the FP dispatcher on the IU chip indicating that die associated load instruction II in the FPQ can be transferred to die FPU chip for execution, as shown at 413.
  • die load instruction II is "cleared" for transfer to the FPU in this manner, and because the FP dispatcher knows the data dependency between II and die following compute instruction 12 waiting behind II in die FPQ, 12 can be transferred to die FPU in the same cycle that II is transferred.
  • the FPQ allows both instructions to be dispatched togetiier as if the load had no latency at all (i.e. , zero latency).
  • the load data is written into the LDQ at 416 in stage L.
  • die stages below bold line 434 can be delayed in time.
  • blocks 418 and 416 can be lengtiiened to further decouple die execution of FP instructions from the IU pipeline.
  • a delay such as a GCACHE miss is an example of why the processor would delay die operations below line 434.
  • the FPQ is read during stage S after the processor determines that die next FP instruction can be executed (i.e., when die load data is available or when die necessary FU resources become available), as shown at 420.
  • the instructions (II and 12 in this example) are then transferred to the
  • both the FREG file and LDQ are read, as shown at 424 and 426, respectively.
  • the FPU execution begins, as shown at 428.
  • the ADD instruction 12 can use the operand fp4 from the FREG file and the data from the LDQ, which is die otiier operand, fp6, loaded tiiere by II. Multiplexing circuitry is used to route the appropriate data, as represented generally at 430. Of course, the load instruction also writes the data into fjp6 of the FREG file, as shown at 432.
  • die present invention achieves out-of-order execution of FP instructions with respect to integer instructions, thereby hiding the latency of the GCACHE and yielding zero load latency witiiout the use of a load delay slot.
  • This section describes a further aspect of die invention, which is die ability of the processor system to maintain cache coherency for data stores in die GCACHE 108.
  • integer instructions are first dispatched to the IU 102.
  • the integer instructions are then executed to generate integer store data, while floating point instructions are dispatched to a FPQ 118.
  • the integer store data is also transferred to the FPQ 118.
  • the FPQ 118 stores the floating point instructions and die integer store data in program order.
  • the floating point instructions and integer store data are transferred to die FPU 104 where the floating point instructions are executed out of die program order with respect to the integer instructions to generate floating point store data.
  • the floating point store data and the integer store data are then sent by the FPU 104 to the SDQ 132. From there, all the store data, both integer and floating point, is eventually stored in die GCACHE 108 according to die program order. In this manner cache coherency of the floating point store data and die integer store data is maintained in die GCACHE 108.
  • an integer store word instruction wants to store the contents of integer register file location 6 (r6) at the address of die general purpose register 10, offset by zero (0(rl0)). That SW instruction is directly followed by an integer load word (LW) instruction that wants to load die data pointed to by the general purpose register 10 in integer register 7 (r7).
  • LW integer load word
  • the two instructions copy the contents of r6 into r7.
  • an floating point store word instruction wants to store the contents of FP register file location 6 ( ⁇ 6) at the address of the general purpose register 10, offset by zero (0(rl0)). That SWC1 instruction is directly followed an FP load word (LWC1) instruction that wants to load die data pointer to by the general purpose register 10 in FP register 7 ( ⁇ 7).
  • LWC1 FP load word
  • these two instructions copy die contents of ⁇ 6 into ⁇ 7.
  • the FP load is stalled until a store address queue (SAQ) match logic block (not shown in Figure 1) signals that the FP store has been written to the GCACHE.
  • SAQ store address queue
  • Example 3 an FP store instruction is followed by an integer load instruction. First, all FP stores invalidate die word(s) of die DCACHE tiiat they map to in 32b blocks. Second, subsequent integer loads from these invalidated words cause DCACHE misses. The DCACHE blocks are then flushed from the GCACHE, and die SAQ match logic keeps mis in order.
  • Example 4 an integer store instruction is followed by a FP load instruction.
  • all integer stores are made to both the DCACHE and the GCACHE. That is, all integer store writes to the first level cache (i.e., die DCACHE) are also written to the second level (i.e., d e GCACHE).
  • the SAQ match logic die interlocks the above-described FP store ⁇ FP load sequence, also interlocks integer store - * FP load sequence. The placing of integer stores in the FPQ in-order with FP stores is described further below.
  • a further aspect of the present invention system is that integer store data is sent through the FPU and into die SDQ, radier than going directly into the GCACHE. Because the IU executes integer instructions out-of-order, and the FPU only loads data from the GCACHE and not the DCACHE, those instructions must be stored in die correct order back into die GCACHE. Thus, by sending tiiose results as store data through the FPU, IU store data can be reordered with FP store data. IU stores are written to the FPQ 406 in order with respect to FP instructions from the instruction stream, and are stored in d e FPQ as pseudo- FP instructions.
  • IU stores are read from the FPQ at 420, and rather tiian being read into the FREG file at 422, they are sent to the multiplexer circuitry 430, as shown at 436. At stage X, the IU stores are written into the SDQ, as shown at 438.
  • the SDQ is read as early as the end of stage X, as shown at 440, and sent off die FPU chip and arrive at the GCACHE at stage Z, as shown at 442. As noted above, however, the data read from the SDQ is coordinated witii the SAQ. Thus, the IU store will wait in the SDQ until die corresponding store address indicating die location where that data is to be stored in die GCACHE reaches die bottom of the SAQ.
  • the transfer of FP stores from the SDQ to the GCACHE functions similar to IU stores. The difference is where the FP store data comes from.
  • the FP store data originates from the FREG file or directly from the FPU pipe via the bypass logic discussed above in connection witii Figure 3.
  • Aldiough integer stores can complete far in advance of any floating point stores resulting from neighboring instructions in die instruction stream, they are sent to the second level cache in order, relative to diose floating point stores.
  • the present invention thus maintains cache coherency for all data stores to the second level cache.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Advance Control (AREA)

Abstract

L'invention concerne un système et un procédé permettant d'effectuer des chargements en virgule flottante de temps d'attente de chargement nul. Le processeur détecte lorsqu'une instruction de chargement en virgule flottante est suivie d'une instruction de calcul ayant une dépendance des données avec l'instruction de chargement en virgule flottante. Les instructions sont alors envoyées vers une file d'attente en virgule flottante où elles attendent d'être envoyées vers une unité de virgule flottante. Entre temps, une demande de chargement en mémoire est envoyée à une antémémoire pour extraire des données correspondant à l'instruction de chargement en virgule flottante. Lorsque les données correspondant à l'instruction de chargement en virgule flottante sont disponibles à partir de l'antémémoire, elles sont chargées dans une file d'attente de données de chargement, et les deux instructions sont transférées vers l'unité à virgule flottante. L'instruction de calcul est ensuite exécutée en utilisant les données de la file d'attente de données de chargement avec temps d'attente de chargement nul. De plus, le système processeur peut mémoriser des données de nombre entier et de virgule flottante de manière réordonnée afin de conserver la cohérence dans l'antémémoire.
PCT/US1994/014304 1993-12-15 1994-12-15 Temps d'attente de chargement nul pour instructions de chargement en virgule flottante en utilisant une file d'attente de donnees de chargement WO1995016955A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US16700493A 1993-12-15 1993-12-15
US08/167,004 1993-12-15

Publications (1)

Publication Number Publication Date
WO1995016955A1 true WO1995016955A1 (fr) 1995-06-22

Family

ID=22605554

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1994/014304 WO1995016955A1 (fr) 1993-12-15 1994-12-15 Temps d'attente de chargement nul pour instructions de chargement en virgule flottante en utilisant une file d'attente de donnees de chargement

Country Status (1)

Country Link
WO (1) WO1995016955A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008118949A1 (fr) * 2007-03-28 2008-10-02 Qualcomm Incorporated Système et procédé pour exécuter des instructions avant un étage d'exécution dans un processeur

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0101596A2 (fr) * 1982-08-16 1984-02-29 Hitachi, Ltd. Processeur de données avec des unités opérationelles fonctionnant en parallèle
EP0331372A2 (fr) * 1988-02-29 1989-09-06 Silicon Graphics, Inc. Procédé et appareil utilisant la prédiction d'exceptions dans l'arithmétique à virgule flottante pour commander plusieurs processeurs
EP0436092A2 (fr) * 1989-12-26 1991-07-10 International Business Machines Corporation Contrôles d'extraction d'instructions hors-séquence pour un système de traitement de données
EP0437044A2 (fr) * 1989-12-20 1991-07-17 International Business Machines Corporation Système de traitement de données avec dispositif de marquage d'instructions

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0101596A2 (fr) * 1982-08-16 1984-02-29 Hitachi, Ltd. Processeur de données avec des unités opérationelles fonctionnant en parallèle
EP0331372A2 (fr) * 1988-02-29 1989-09-06 Silicon Graphics, Inc. Procédé et appareil utilisant la prédiction d'exceptions dans l'arithmétique à virgule flottante pour commander plusieurs processeurs
EP0437044A2 (fr) * 1989-12-20 1991-07-17 International Business Machines Corporation Système de traitement de données avec dispositif de marquage d'instructions
EP0436092A2 (fr) * 1989-12-26 1991-07-10 International Business Machines Corporation Contrôles d'extraction d'instructions hors-séquence pour un système de traitement de données

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"Handling of Fetches Subsequent to Unexecuted Stores", IBM TECHNICAL DISCLOSURE BULLETIN., vol. 28, no. 7, December 1985 (1985-12-01), NEW YORK US, pages 3173 - 3174 *
V. G. OKLOBDZIJA: "Issues in CPU-Coprocessor Communication and Synchronization", MICROPROCESSING AND MICROPROGRAMMING., vol. 24, no. 1/5, August 1988 (1988-08-01), AMSTERDAM NL, pages 695 - 700 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008118949A1 (fr) * 2007-03-28 2008-10-02 Qualcomm Incorporated Système et procédé pour exécuter des instructions avant un étage d'exécution dans un processeur
US8127114B2 (en) 2007-03-28 2012-02-28 Qualcomm Incorporated System and method for executing instructions prior to an execution stage in a processor
KR101119612B1 (ko) 2007-03-28 2012-03-22 콸콤 인코포레이티드 프로세서에서 실행 스테이지 이전에 명령들을 실행시키기 위한 시스템 및 방법

Similar Documents

Publication Publication Date Title
US5537538A (en) Debug mode for a superscalar RISC processor
EP0547240B1 (fr) Architecture risc de microprocesseur avec mode d'interruption et d'exception rapide
EP1385085B1 (fr) Architecture de microprocesseur RISC à hautes performances
US5557763A (en) System for handling load and/or store operations in a superscalar microprocessor
US8019975B2 (en) System and method for handling load and/or store operations in a superscalar microprocessor
US5751983A (en) Out-of-order processor with a memory subsystem which handles speculatively dispatched load operations
US5974523A (en) Mechanism for efficiently overlapping multiple operand types in a microprocessor
US5961629A (en) High performance, superscalar-based computer system with out-of-order instruction execution
US5832292A (en) High-performance superscalar-based computer system with out-of-order instruction execution and concurrent results distribution
US5778210A (en) Method and apparatus for recovering the state of a speculatively scheduled operation in a processor which cannot be executed at the speculated time
EP0762270B1 (fr) Microprocesseur ayant des opérations de chargement/stockage vers/depuis une pluralité de registres
JP2694124B2 (ja) 処理システム及び操作方法
US5812812A (en) Method and system of implementing an early data dependency resolution mechanism in a high-performance data processing system utilizing out-of-order instruction issue
EP0690372B1 (fr) Pipeline d'instructions de microprocesseur superscalaire avec commande de distribution et de diffusion d'instructions
US5805916A (en) Method and apparatus for dynamic allocation of registers for intermediate floating-point results
US5850563A (en) Processor and method for out-of-order completion of floating-point operations during load/store multiple operations
US7127591B2 (en) Instruction control device and method therefor
WO1995016955A1 (fr) Temps d'attente de chargement nul pour instructions de chargement en virgule flottante en utilisant une file d'attente de donnees de chargement

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): JP

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE CH DE DK ES FR GB GR IE IT LU MC NL PT SE

WR Later publication of a revised version of an international search report
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
121 Ep: the epo has been informed by wipo that ep was designated in this application
122 Ep: pct application non-entry in european phase