US20120042151A1 - Processor having execution core sections operating at different clock rates - Google Patents

Processor having execution core sections operating at different clock rates Download PDF

Info

Publication number
US20120042151A1
US20120042151A1 US12/879,872 US87987210A US2012042151A1 US 20120042151 A1 US20120042151 A1 US 20120042151A1 US 87987210 A US87987210 A US 87987210A US 2012042151 A1 US2012042151 A1 US 2012042151A1
Authority
US
United States
Prior art keywords
frequency
integrated circuit
instruction
data
clock
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US12/879,872
Inventor
David J. Sager
Thomas D. Fletcher
Glenn J. Hinton
Michael D. Upton
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US09/527,065 external-priority patent/US6256745B1/en
Application filed by Individual filed Critical Individual
Priority to US12/879,872 priority Critical patent/US20120042151A1/en
Publication of US20120042151A1 publication Critical patent/US20120042151A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • G06F9/3842Speculative instruction execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F1/00Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
    • G06F1/04Generating or distributing clock signals or signals derived directly therefrom
    • G06F1/08Clock generators with changeable or programmable clock frequency
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7828Architectures of general purpose stored program computers comprising a single central processing unit without memory
    • G06F15/7832Architectures of general purpose stored program computers comprising a single central processing unit without memory on one IC chip (single chip microprocessors)
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30145Instruction analysis, e.g. decoding, instruction word fields
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3824Operand accessing
    • G06F9/383Operand prefetching
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • G06F9/3838Dependency mechanisms, e.g. register scoreboarding
    • G06F9/384Register renaming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3861Recovery, e.g. branch miss-prediction, exception handling
    • G06F9/3863Recovery, e.g. branch miss-prediction, exception handling using multiple copies of the architectural state, e.g. shadow registers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3867Concurrent instruction execution, e.g. pipeline or look ahead using instruction pipelines
    • G06F9/3869Implementation aspects, e.g. pipeline latches; pipeline synchronisation and clocking

Definitions

  • the present invention relates generally to the field of high speed processors, and more specifically to a processor including a sub-core operating at a higher frequency than the rest of the execution core, and also to a replay architecture for facilitating data-speculating operation of the sub-core.
  • FIG. 1 illustrates a microprocessor 100 according to the prior art.
  • the microprocessor includes an I/O ring which operates at a first clock frequency, and an execution core which operates at a second clock frequency.
  • the Intel186DX2 may run its I/O ring at 33 MHz and its execution core at 66 MHz for a 2:1 ratio (1/2 bus)
  • the IntelDX4 may run its I/O ring at 25 MHz and its execution core at 75 MHz for a 3:1 ratio (1/3 bus)
  • the Intel Pentium.®. OverDrive.®. processor may operate its I/O ring at 33 MHz and its execution core at 82.5 MHz for a 2.5:1 ratio (5/2 bus).
  • I/O operations and “execution operations”.
  • the I/O ring performs I/O operations such as buffering, bus driving, receiving, parity checking, and other operations associated with communicating with the off-chip world
  • the execution core performs execution operations such as addition, multiplication, address generation, comparisons, rotation and shifting, and other “processing” manipulations.
  • the processor 100 may optionally include a clock multiplier. In this mode, the processor can automatically set the speed of its execution core according to an external, slower clock provided to its I/O ring. This may reduce the number of pins needed.
  • the processor may include a clock divider, in which case the processor sets the I/O ring speed responsive to an external clock provided to the execution core.
  • clock mult/div will be used herein to denote either a multiplier or divider as suitable.
  • the skilled reader will comprehend how external clocks may be selected and provided, and from there multiplied or divided. Therefore, specific clock distribution networks, and the details of clock multiplication and division, will not be expressly illustrated.
  • the clock mult/div units need not necessarily be limited to integer multiple clocks, but can perform e.g. 2:5 clocking
  • the clock mult/div units need not necessarily even be limited to fractional bus clocking, but can, in some embodiments, be flexible, asynchronous, and/or programmable, such as in providing a P/Q clocking scheme.
  • the execution latency of an instruction may be defined as the time from when its input operands must be ready for it to execute until its result is ready to be used by another instruction.
  • a part of a program contains a sequence of N instructions, I 1 , I 2 , I 3 , . . . , I N .
  • I n+1 requires, as part of its inputs, the result of I n , for all n, from 1 to N ⁇ 1.
  • This part of the program may also contain any other instructions.
  • T remains a lower bound for the time to execute this part of this program.
  • f j be the number of instructions that are in flight during cycle j.
  • P the parallelism P as the average number of instructions in flight for the program or 1/M*(f 1 +f 2 +f 3 + . . . +f M ).
  • L is the average latency for a program
  • B is its average bandwidth
  • P is its average Parallelism.
  • B tells how fast we execute the program. It is instructions per second. If the program has N instructions, it takes N/B seconds to execute it. The goal of a faster processor is exactly the goal of getting B higher.
  • Control speculating processors include mechanisms for recovering from mispredicted branches, to maintain program and data integrity as though no speculation were taking place.
  • FIG. 2 illustrates a conventional data hierarchy.
  • a mass storage device such as a hard drive, stores the programs and data (collectively “data”) which the computer system (not shown) has at its disposal.
  • a subset of that data is loaded into memory such as DRAM for faster access.
  • a subset of the DRAM contents may be held in a cache memory.
  • the cache memory may itself be hierarchical, and may include a level two (L 2 ) cache, and then a level one (L 1 ) cache which holds a subset of the data from the L 2 .
  • L 1 level one
  • the physical registers of the processor contain a smallest subset of the data.
  • various algorithms may be used to determine what data is stored in what levels of this overall hierarchy. In general, it may be said that the more recently a datum has been used, or the more likely it is to be needed soon, the closer it will be held to the processor.
  • data speculating processors must have some mechanism for recovering from having incorrectly assumed that data values are correct, to maintain program and data integrity as though no data speculation were taking place. Data speculation is made more difficult by the hierarchical storage system, especially when it is coupled with a microarchitecture which uses different clock frequencies for various portions of the execution environment.
  • every processor is adapted to execute instructions of its particular “architecture”. In other words, every processor executes a particular instruction set, which is encoded in a particular machine language.
  • Some processors such as the Pentium Pro processor, decode those “macro-instructions” down into “micro-instructions” or “uops”, which may be thought of as the machine language of the micro-architecture and which are directly executed by the processor's execution units.
  • other processors such as those of the RISC variety, may directly execute their macro-instructions without breaking them down into micro-instructions.
  • the term “instruction” should be considered to cover any or all of these cases.
  • the invention provides a microprocessor having two or more levels of execution sub-core each clocked at different frequencies.
  • the processor may also have an I/O ring, which may be clocked at yet another frequency.
  • Clock division or multiplication may be used between the various levels, to derive the various clocks from a common clock, such as the I/O clock, which may be provided from off-chip.
  • Having the different clock domains enables the designer to make trade-offs in the design of various components of the chip, such as individual execution units, instruction fetch and decode units, register files, caches, and the like.
  • selected components can be designed to operate at a very high frequency, without requiring the entire chip to be designed to operate at this frequency.
  • Less latency-critical units, or those whose required throughput can be obtained by twice as many units running at half the clock speed can be relegated to the slower sections of the chip, easing their design considerably.
  • FIG. 1 is a block diagram illustrating a prior art processor having an I/O ring and an execution core operating at different clock speeds.
  • FIG. 2 demonstrates a hierarchical memory structure such as is well known in the art.
  • FIG. 3 is a block diagram illustrating the processor of the present invention, and showing a plurality of execution core sections each having its own clock frequency.
  • FIG. 4 is a block diagram illustrating a mode in which the processor of FIG. 3 includes yet another sub-core with its own clock frequency.
  • FIG. 5 is a block diagram illustrating a different mode in which the sub-core is not nested as shown in FIG. 4 .
  • FIG. 6 is a block diagram illustrating a partitioning of the execution core.
  • FIG. 7 is a block diagram illustrating one embodiment of the replay architecture of the present invention, which permits data speculation.
  • FIG. 8 illustrates one embodiment of the checker unit of the replay architecture.
  • FIG. 3 illustrates the high-speed sub-core 205 of the processor 200 of the present invention.
  • the high-speed sub-core includes the most latency-intolerant portions of the particular architecture and/or microarchitecture employed by the processor. For example, in an Intel Architecture processor, certain arithmetic and logic functions, as well as data cache access, may be the most unforgiving of execution latency.
  • the processor 200 communicates with the rest of the system (not shown) via the I/O ring 215 . If the I/O ring operates at a different clock frequency than the latency-tolerant execution core, the processor may include a clock mult/div unit 220 which provides clock division or multiplication according to any suitable manner and conventional means. Because the latency-intolerant execution sub-core 205 operates at a higher frequency than the rest of the latency-tolerant execution core 210 , there may be a mechanism 225 for providing a different clock frequency to the latency-intolerant execution sub-core 205 . In one mode, this is a clock mult/div unit 225 .
  • FIG. 4 illustrates a refinement of the invention shown in FIG. 3 .
  • the processor 250 of FIG. 4 includes the I/O ring 215 , clock mult/div unit 220 , and latency-tolerant execution core 210 .
  • this improved processor 250 includes a latency-intolerant execution sub-core 255 and an even more latency-critical execution sub-core 260 , with their clock mult/div units 265 and 270 , respectively.
  • an Intel Architecture processor may advantageously include only its most common integer ALU functions and data storage portion of its data cache in the innermost sub-core.
  • the innermost sub-core may also include the register file; although, for reasons including those stated above concerning FIG.
  • the register file might not technically be needed to operate at the highest clock frequency, its design may be simplified by including it in a more inner sub-core that is strictly necessary. For example, it may be more efficient to make twice as fast a register file with half as many ports, than vice versa.
  • the processor performs an I/O operation at the I/O ring and at the I/O clock frequency, such as to bring in a data item not presently available within the processor.
  • the latency-tolerant execution core may perform an execution operation on the data item to produce a first result.
  • the latency-intolerant execution sub-core may perform an execution operation on the first result to produce a second result.
  • the latency-critical execution sub-core may perform a third execution operation upon the second result to produce a third result.
  • the flow of execution need not necessarily proceed in the strict order of the hierarchy of execution sub-cores. For example, the newly read in data item could go immediately to the innermost core, and the result could go from there to any of the core sections or even back to the I/O ring for writeback.
  • FIG. 5 shows an embodiment which is slightly different than that of FIG. 4 .
  • the processor 280 includes the I/O ring 215 , the execution cores 210 , 255 , 260 , and the clock mult/div units 220 , 265 , 270 .
  • the latency-critical execution sub-core 260 is not nested within the latency-intolerant execution core 255 .
  • the clock mult/div units 265 and 270 perform different ratios of multiplication to enable their respective cores to run at different speeds.
  • either of these cores might be clock-interfaced directly to the I/O ring or to the external world.
  • clock mult/div units may not be required, if separate clock signals are provided from outside the processor.
  • the different speeds at which the various layers of sub-core operate may be in-use, operational speeds. It is known, for example in the Pentium processor, that certain units may be powered down when not in use, by reducing or halting their clock; in this case, the processor may have the bulk of its core running at 66 MHz while a sub-core such as the FPU is at substantially 0 MHz. While the present invention may be used in combination with such power-down or clock throttling techniques, it is not limited to such cases.
  • non-integer ratios may be applied at any of the boundaries, and that the combinations of clock ratios between the various rings is almost limitless, and that different baseline frequencies could be used at the I/O ring. It is also possible that the clock multiplication factors might not remain constant over time. For example, in some modes, the clock multiplication applied to the innermost sub-core could be adjusted up and down, for example between 3.times. and 1.times. or between 2.times. and 0.times. or the like, when the higher frequency (and therefore higher power consumption and heat generation) are not needed. Also, the processor may be subjected to clock throttling or clock stop, in whole or in part. Or, the I/O clock might not be a constant frequency, in which case the other clocks may either scale accordingly, or they may implement some form of adaptive P/Q clocking scheme to maintain their desired performance level.
  • FIG. 6 illustrates somewhat more detail about one embodiment of the contents of the latency-critical execution sub-core 260 of FIG. 4 .
  • the latency-tolerant execution core 210 includes components which are not latency-sensitive, but which are dependent only upon some level of throughput. In this sense, the latency-tolerant components may be thought of as the “plumbing” whose job is simply to provide a particular “gallons per minute” throughput, in which a “big pipe” is as good as a “fast flow”.
  • the fetch and decode units may not be notably demanding on execution latency, and may thus be put in the latency-tolerant core 210 rather than the latency-intolerant sub-core 205 , 255 , 260 .
  • the microcode and register file may not need to be in the sub-core.
  • the most latency-sensitive pieces are the arithmetic/logic functions and the cache. In the mode shown in FIG. 6 , only a subset of the arithmetic/logic functions are deemed to be sufficiently latency-sensitive that it is warranted to put them into the sub-core, as illustrated by critical ALU 300 .
  • the critical ALU functions include adders, subtractors, and logic units for performing AND, OR, and the like.
  • the critical ALU functions may also include a small, special-purpose shifter for doing address generation by scaling the index register.
  • the register file may reside in the latency-critical execution core, for design convenience; the faster the core section the register file is in, the fewer ports the register file needs.
  • the functions which are generally more latency-sensitive than the plumbing are those portions which are of a recursive nature, or those which include a dependency chain. Execution is a prime example of this concept; execution tends to be recursive or looping, and includes both false and true data dependencies both between and within iterations and loops.
  • the other ALU functions 305 can be relegated to the less speedy core 210 .
  • only a subset of the cache needs to be inside the sub-core.
  • the data storage portion 310 of the cache is inside the sub-core, while the hit/miss logic and tags are in the slower core 210 .
  • the hit/miss signal is needed at the same time as the data.
  • a recent paper implied that the hit/miss signal is the limiting factor on cache speed (Austin, Todd M, “Streamlining Data Cache Access with Fast Address Calculation”, Dionisios N. Pneumatikatos, Giandinar S. Sohi, Proceedings of the 22nd Annual International Symposium on Computer Architecture, Jun. 18-24, 1995, Session 8, No. 1, page 5).
  • hit/miss determination is more difficult and more time-consuming than the simple matter of reading data contents from cache locations.
  • the instruction cache (not shown) may be entirely in the core 210 , such that the cache 310 stores only data.
  • the instruction cache (Icache) is accessed speculatively. It is the business of branch prediction to predict where the flow of the program will go, and the Icache is accessed on the basis of that prediction. Branch prediction methods commonly used today can predict program flow without ever seeing the instructions in the Icache. If such a method is used, then the Icache is not latency-sensitive, and becomes more bandwidth-constrained than latency- constrained, and can be relegated to a lower clock frequency portion of the execution core.
  • the branch prediction itself could be latency-sensitive, so it would be a good candidate for a fast cycle time in one of the inner sub-core sections.
  • the innermost sub-core 205 , 255 , or 260 of FIG. 6 would therefore hold the data which is stored at the top of the memory hierarchy of FIG. 2 , that is, the data which is stored in the registers.
  • the register file need not be contained within the sub-core, but may, instead, be held in the less speedy portion of the core 210 .
  • the register file may be stored in any of the core sections 205 , 210 , 255 , 260 , as suits the particular embodiment chosen. As shown in FIG.
  • the reason that the register file is not required to be within the innermost core is that the data which result from operations performed in the critical ALU 300 are available on a bypass bus 315 as soon as they are calculated.
  • these data can be made available to the critical ALU 300 in the next clock cycle of the sub-core, far sooner than they could be written to and then read from the register file.
  • the hit/miss logic or the tag logic in the outer core may signal that the speculated data is, in fact, invalid. In this case, there must be a means provided to recover from the speculative operations which have been performed. This includes not only the specific operations which used the incorrect, speculated data as input operands, but also any subsequent operations which used the outputs of those specific operations as inputs. Also, the erroneously generated outputs may have subsequently been used to determine branching operations, such as if the erroneously generated output is used as a branch address or as a branch condition. If the processor performs control speculation, there may have also been errors in that operation as well.
  • the present invention provides a replay mechanism for recovering from data speculation upon data which ultimately prove to have been incorrect.
  • the replay mechanism may reside outside the innermost core, because it is not notably latency-critical. While the replay architecture is described in conjunction with a multiple-clock-speed execution engine which performs data speculation, it will be appreciated that the replay architecture may be used with a wide variety of architectures and micro-architectures, including those which perform data speculation and those which do not, those which perform control speculation and those which do not, those which perform in-order execution and those which perform out-of-order execution, and so forth.
  • FIG. 7 illustrates one implementation of such a replay architecture, generally showing the data flow of the architecture.
  • renamer such as a register alias table.
  • a register alias table In sophisticated microarchitectures which permit data speculation and/or control speculation, it is highly desirable to decouple the actual machine from the specific registers indicated by the instruction. This is especially true in an architecture which is register-poor, such as the Intel Architecture. Renamers are well known, and the details of the renamer are not particularly germane to an understanding of the present invention. Any conventional renamer will suffice. It is desirable that it be a single-valued and single-assignment renamer, such that each instance of a given instruction will write to a different register, although the instruction specifies the same register.
  • the renamer provides a separate storage location for each different value that each logical register assumes, so that no such value of any logical register is prematurely lost (i.e. before the program is through with that value), over a well-defined period of time.
  • the instruction proceeds to an optional scheduler such as a reservation station, where instructions are reordered to improve execution efficiency.
  • the scheduler is able to detect when it is not allowed to issue further instructions. For example, there may not be any available execution slots into which a next instruction could be issued. Or, another unit may for some reason temporarily disable the scheduler.
  • the scheduler may reside in the latency-critical execution core, if the particular scheduling algorithm can schedule only single latency generation per cycle, and is therefore tied to the latency of the critical ALU functions.
  • the instruction proceeds to the execution core 205 , 210 , 255 , 260 (indirectly through a multiplexor to be described below), where it is executed.
  • an address associated with the instruction is sent to the translation lookaside buffer (TLB) and cache tag lookup logic (TAG).
  • TLB translation lookaside buffer
  • TAG cache tag lookup logic
  • This address may be, for example, the address (physical or logical) of a data operand which the instruction requires.
  • TLB and TAG logic From the TLB and TAG logic, the physical address referenced and the physical address represented in the cache location accessed are passed to the hit/miss logic, which determines whether the cache location accessed in fact contained the desired data.
  • the execution logic gives the highest priority to generating perhaps only a portion of the address, but enough that data may be looked up in the high speed data cache.
  • this partial address is used with the highest priority to retrieve data from the data cache, and only as a secondary priority is a complete virtual address, or in the case of the Intel Architecture, a complete linear address, generated and sent to the TLB and cache TAG lookup logic.
  • the critical ALU functions and the data cache are in the innermost sub-core—or are at least in a portion of the processor which runs at a higher clock rate than the TLB and TAG logic and the hit/miss logic—some data will have already been obtained from the data cache and the processor will have already speculatively executed the instruction which needed that data, the processor having assumed the data that was obtained to have been correct, and the processor likely having also executed additional instructions using that data or the results of the first speculatively executed instruction.
  • the replay architecture includes a checker unit which receives the output of the hit/miss logic. If a miss is indicated, the checker causes a “replay” of the offending instruction and any which depended on it or which were otherwise incorrect as a result of the erroneous data speculation.
  • a miss is indicated, the checker causes a “replay” of the offending instruction and any which depended on it or which were otherwise incorrect as a result of the erroneous data speculation.
  • a copy of it was forwarded to a delay unit which provides a delay latency which matches the time the instruction will take to get through the execution core, TLB/TAG, and hit/miss units, so that the copy arrives at the checker at about the same time that the hit/miss logic tells the checker that the data speculation was incorrect. In one mode, this is roughly 10-12 clocks of the inner core. In FIG.
  • the delay unit is shown as being outside the checker. In other embodiments, the delay unit may be incorporated as a part of the checker. In some embodiments, the checker may reside within the latency-critical execution core, if the checking algorithm is tied to the critical ALU speed.
  • the checker When the checker determines that data speculation was incorrect, the checker sends the copy of the instruction back around for a “replay”. The checker forwards the copy of the instruction to a buffer unit. It may happen as an unrelated event that the TLB/TAG unit informs the buffer that the TLB/TAG is inserting a manufactured instruction in the current cycle. This information is needed by the buffer so the buffer knows not to reinsert another instruction in the same cycle. Both the TLB/TAG and the buffer also inform the scheduler when they are inserting instructions, so the scheduler knows not to dispatch an instruction in that same cycle. These control signals are not shown but will be understood by those skilled in the art.
  • the buffer unit provides latching of the copied instruction, to prevent it from getting lost if it cannot immediately be handled.
  • One such condition may be that there may be some higher priority function that could claim execution, such as when the TLB/TAG unit needs to insert a manufactured instruction, as mentioned above.
  • the buffer may not be necessary.
  • the scheduler's output was provided to the execution core indirectly, through a multiplexor.
  • the function of this multiplexor is to select among several possible sources of instructions being sent for execution.
  • the first source is, of course, the scheduler, in the case when it is an original instruction which is being sent for execution.
  • the second source is the buffer unit, in the case when it is a copy of an instruction which is being sent for replay execution.
  • a third source is illustrated as being from the TLB/TAG unit; this permits the architecture to manufacture “fake instructions” and inject them into the instruction stream.
  • the TLB logic or TAG logic many need to get another unit to do some work for them, such as to read some data from the data cache as might be needed to evict that data, or for refilling the TLB, or other purposes, and they can do this by generating instructions which did not come from the real instruction stream, and then inserting those instructions back at the multiplexor input to the execution core.
  • the mux control scheme may, in one mode, include a priority scheme wherein a replay instruction has higher priority than an original instruction. This is advantageous because a replay instruction is probably older than the original instruction in the original macroinstruction flow, and may be a “blocking” instruction such as if there is a true data dependency.
  • a manufactured instruction may have higher priority than a replay instruction. This is advantageous because these manufactured instructions may be used for critically important and time-sensitive operations.
  • One such sensitive operation is an eviction. After a cache miss, new data will be coming from the L 1 cache. When that data arrives, it must be put in the data cache (L 0 ) as quickly as possible. If that is done, the replayed load will just meet the new data and will now be successful. If the data is even one cycle late getting the data there, the replayed load will pass again too soon and must again be replayed. Unfortunately, the data cache location where the processor is going to put the data is now holding the one and only copy of some data that was written some time ago. In other words, the location is “dirty”.
  • the replay architecture may also be used to enable the processor to in effect “stall” without actually slowing down the execution core or performing clock throttling or the like.
  • stall There are some circumstances where it would be necessary to stall the frontend and/or execution core, to avoid losing the results of instructions or to avoid other such problems.
  • the processor's backend temporarily runs out of resources such as available registers into which to write execution results.
  • Other examples include where the external bus is blocked, an upper level of cache is busy being snooped by another processor, a load or store crosses page boundary, an exception occurs, or the like.
  • the replay architecture may very simply be used to send back around for replay all instructions whose results would be otherwise lost.
  • the execution core remains functioning at full speed, and there are no additional signal paths required for stalling the frontend, beyond those otherwise existing to permit the multiplexor to give priority to replay instructions over original instructions.
  • stall-like uses can be made of the replay architecture. For example, assume that a store address instruction misses in the TLB. Rather than saving the linear address to process after getting the proper entry in the TLB, the processor can just drop it on the floor and request the store address instruction to be replayed. As another example, the Page Miss Handler (not shown) may be busy. In this case the processor does not even remember that it needs to do a page walk, but finds that out over again when the store address comes back.
  • the memory subsystem by itself, will never do anything for this instance of the instruction. When the instruction executes again, then it is considered all over again.
  • the instruction has delivered its linear address to the memory subsystem and it doesn't want anything back.
  • a more conventional approach might be to say that this instruction is done, and any problems from here on out are memory subsystem problems, in which case the memory subsystem must then store information about this store address until it can get resources to take care of it.
  • the present approach is that the store address replays, and the memory subsystem does not have to remember it at all. Here it is a little more clear that the processor is replaying the store address specifically because of inability to handle it in the memory subsystem.
  • all dependent instructions when an instruction gets replayed, all dependent instructions also get replayed. This may include all those which used the replayed instruction's output as input, all those which are down control flow branches picked according to the replayed instruction, and so forth.
  • the processor does not replay instructions merely because they are control flow dependent on an instruction that replayed.
  • the thread of control was predicted.
  • the processor is always following a predicted thread of control and never necessarily knows during execution if it is going the right way or not. If a branch gets bad input, the branch instruction itself is replayed. This is because the processor cannot reliably determine from the branch if the predicted thread of control is right or not, since the input data to the branch was not valid. No other instructions get replayed merely because the branch got bad data.
  • the branch will be correctly executed. At this time, it does what all branches do—it reports if the predicted direction taken for this branch was correct or not. If it was correctly predicted, everything goes on about its business.
  • a instruction is replayed either: 1) because the instruction itself was not correctly processed for any reason, or 2) if the input data that this instruction uses is not known to be correct.
  • Data is known to be correct if it is produced by a instruction that is itself correctly processed and all of its input data is known to be correct.
  • branches are viewed not as having anything to do with the control flow but as data handling instructions which simply report interesting things to the front end of the machine but do not produce any output data that can be used by any other instruction. Hence, the correctness of any other instruction cannot have anything to do with them.
  • the correctness of the control flow is handled by a higher authority and is not in the purview of mere execution and replay.
  • FIG. 8 illustrates more about the checker unit. Again, a instruction is replayed if: 1) it was not processed correctly, or 2) if it used input data that is not known to be correct. These two conditions give a good division for discussing the operation of the checker unit. The first condition depends on everything that needs to be done for the instruction. Anything in the machine that needs to do something to correctly execute the instruction is allowed to goof and to signal to the checker that it goofed. The first condition is therefore talking about signals that come into the checker, potentially from many places, that say, “I goofed on this instruction.”
  • the most common goof is the failure of the data cache to supply the correct result for a load. This is signaled by the hit/miss logic. Another common goof is failure to correctly process a store address; this would typically result from a TLB miss on a store address, but there can be other causes, too.
  • the L 1 cache may deliver data (which may go into the L 0 cache and be used by instructions) that contains an ECC error. This would be signaled quickly, and then corrected as time permits.
  • the adder cannot correctly add two numbers. This is signaled by the flag logic which keeps tabs on the adders.
  • the logic unit fails to get the correct answer when doing an AND, XOR, or other simple logic operation. These, too, are signaled by the flag logic.
  • the floating point unit may not get the correct answer all of the time, in which case it will signal when it goofs a floating point operation. In of principle, you could use this mechanism for many types of goofs. It could be used for algorithmic goofs and it could even be used for hardware errors (circuit goofs). Regardless the cause, whenever the processor doesn't do exactly what it is supposed to do, and the goof is detected, the processor's various units can request a replay by signaling to the checker.
  • the checker contains the official list of what data is known to be correct. It is what is sometimes called the “scoreboard”. It is the checker's responsibility to look at all of the input data for each instruction execution instance and to determine if all such input data is known to be correct or not. It is also the checker's responsibility to add it all up for each instruction execution instance, to determine if the result produced by that instruction execution instance can therefore be deemed to be “known to be correct”. If the result of a instruction is deemed “known to be correct”, this is noted on the scoreboard so the processor now has new, known-correct data that can be the input for other instructions.
  • FIG. 8 illustrates one exemplary checker which may be employed in practicing the architecture of the present invention. Because the details of the checker are not necessary in order to understand the invention, a simplified checker is illustrated to show the requirements for a checker sufficient to make the replay system work correctly.
  • one instruction is processed per cycle. After an instruction has been executed, it is represented to the checker by signals OP 1 , OP 1 V, OP 2 , OPV 2 , DST, and a latency vector which was assigned to the uop by the decoder on the basis of the opcode.
  • the signals OP 1 V and OP 2 V indicate whether the instruction includes a first operand and a second operand, respectively.
  • the signals OP 1 and OP 2 identify the physical source registers of the first and second operands, respectively, and are received at read address ports RA 1 and RA 2 of the scoreboard.
  • the signal DST identifies the physical destination register where the result of the instruction was written.
  • the latency vector has all 0's except a 1 in one position. The position of the 1 denotes the latency of this instruction.
  • An instruction's latency is how many cycles there are after the instruction begins execution before another instruction can use its result.
  • the scoreboard has one bit of storage for each physical register in the machine. The bit is 0 if that register is not known to contain correct data and it is 1 if that register is known to contain correct data.
  • the register renamer described above, allocates these registers. At the time a physical register is allocated to hold the result of some instruction, the renamer sends the register number to the checker as multiple-bit signal CLEAR. The scoreboard sets to 0 the scoreboard bit which is addressed by CLEAR.
  • the one or two register operands for the instruction currently being checked are looked up in the scoreboard to see if they are known to be correct, and the results are output as scoreboard values SV 1 and SV 2 , respectively.
  • An AND gate 350 receives the first scoreboard value SV 1 and the first operand valid signal OP 1 V.
  • Another AND gate 355 similarly receives signals SV 2 and OP 2 V for the second operand.
  • the operand valid signals OP 1 V and OP 2 V cause the scoreboard values SV 1 and SV 2 to be ignored if the instruction does not actually require those respective operands.
  • the outputs of the AND gates are provided to NOR gate 360 , along with an external replay request signal.
  • the output of the NOR gate will be false if either operand is required by the instruction and is not known to be correct, or if the external replay request signal is asserted. Otherwise the output will be true.
  • the output of the NOR gate 360 is the checker output INSTRUCTION OK. If it is true, the instruction was completed correctly and is ready to be considered for retirement. If it is false, the instruction must be replayed.
  • a delay line receives the destination register identifier DST and the checker output INSTRUCTION OK information for the instruction currently being checked.
  • the simple delay line shown is constructed of registers (single cycle delays) and muxes. It will be understood that each register and mux is a multiple-bit device, or represents multiple single-bit devices. Those skilled in the art will understand that various other types of delay lines, and therefore different formats of latency vectors, could be used.
  • the DST and INSTRUCTION OK information is inserted in one location of the delay line, as determined by the value of the latency vector. This information is delayed for the required number of cycles according to the latency vector, and then it is applied to the write port WP of the scoreboard.
  • the scoreboard bit corresponding to the destination register DST for the instruction is then written according to the value of INSTRUCTION OK. A value of 1 indicates that the instruction did not have to be replayed, and a value of 0 indicates that the instruction did have to be replayed, meaning that its result data is not known to be correct.
  • this checker checks one instruction per cycle (but other embodiments are of course feasible).
  • the cycle in which an instruction is checked is a fixed number of cycles after that instruction began execution and captured the data that it used for its operands. This number of cycles later is sufficient to allow the EXTERNAL REPLAY REQUEST signal for the instruction to arrive at the checker to be processed along with the other information about the instruction.
  • the EXTERNAL REPLAY REQUEST signal is the OR of all signals from whatever parts of the machine may produce replay requests that indicate that the instruction was not processed correctly. For example it may indicate that data returned from the data cache may not have been correct, for any of many reasons, a good example being that there was a cache miss.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Microelectronics & Electronic Packaging (AREA)
  • Advance Control (AREA)
  • Microcomputers (AREA)
  • Executing Machine-Instructions (AREA)
  • Power Sources (AREA)

Abstract

A processor including a first execution core section clocked to perform execution operations at a first clock frequency, and a second execution core section clocked to perform execution operations at a second clock frequency which is different than the first clock frequency. The second execution core section runs faster and includes a data cache and critical ALU functions, while the first execution core section includes latency-tolerant functions such as instruction fetch and decode units and non-critical ALU functions. The processor may further include an I/O ring which may be still slower than the first execution core section. Optionally, the first execution core section may include a third execution core section whose clock rate is between that of the first and second execution core sections. Clock multipliers/dividers may be used between the various sections to derive their clocks from a single source, such as the I/O clock.

Description

  • This application is a divisional of U.S. patent application Ser. No. 10/996,328, filed on Nov. 24, 2004, entitled “PROCESSOR HAVING EXECUTION CORE SECTIONS OPERATING AT DIFFERENT CLOCK RATES”, which is a Re-Issue of U.S. patent application Ser. No. 09/775,383, filed Feb. 02, 2001, now patented U.S. Pat. No. 6,487,675, entitled “PROCESSOR HAVING EXECUTION CORE SECTIONS OPERATING AT DIFFERENT CLOCK RATES”, which is continuation of U.S. patent application Ser. No. 09/527,065 filed Mar. 16, 2000, now patented U.S. Pat. No. 6,256,745, entitled “PROCESSOR HAVING EXECUTION CORE SECTIONS OPERATING AT DIFFERENT CLOCK RATES”, which is continuation of U.S. patent application Ser. No. 09/092,353 filed Jun. 5, 1998, now patented U.S. Pat. No. 6,216,234, entitled “PROCESSOR HAVING EXECUTION CORE SECTIONS OPERATING AT DIFFERENT CLOCK RATES”, which is continuation of U.S. patent application Ser. No. 08/746,606 filed Nov. 13, 1996, now patented U.S. Pat. No. 5,828,868, entitled “PROCESSOR HAVING EXECUTION CORE SECTIONS OPERATING AT DIFFERENT CLOCK RATES”.
  • BACKGROUND OF THE INVENTION
  • 1. Field of the Invention
  • The present invention relates generally to the field of high speed processors, and more specifically to a processor including a sub-core operating at a higher frequency than the rest of the execution core, and also to a replay architecture for facilitating data-speculating operation of the sub-core.
  • 2. Background of the Prior Art
  • FIG. 1 illustrates a microprocessor 100 according to the prior art. The microprocessor includes an I/O ring which operates at a first clock frequency, and an execution core which operates at a second clock frequency. For example, the Intel186DX2 may run its I/O ring at 33 MHz and its execution core at 66 MHz for a 2:1 ratio (1/2 bus), the IntelDX4 may run its I/O ring at 25 MHz and its execution core at 75 MHz for a 3:1 ratio (1/3 bus), and the Intel Pentium.®. OverDrive.®. processor may operate its I/O ring at 33 MHz and its execution core at 82.5 MHz for a 2.5:1 ratio (5/2 bus).
  • A distinction may be made between “I/O operations” and “execution operations”. For example, in the DX2, the I/O ring performs I/O operations such as buffering, bus driving, receiving, parity checking, and other operations associated with communicating with the off-chip world, while the execution core performs execution operations such as addition, multiplication, address generation, comparisons, rotation and shifting, and other “processing” manipulations.
  • The processor 100 may optionally include a clock multiplier. In this mode, the processor can automatically set the speed of its execution core according to an external, slower clock provided to its I/O ring. This may reduce the number of pins needed. Alternatively, the processor may include a clock divider, in which case the processor sets the I/O ring speed responsive to an external clock provided to the execution core.
  • These clock multiply and clock divide functions are logically the same for the purposes of this invention, so the term “clock mult/div” will be used herein to denote either a multiplier or divider as suitable. The skilled reader will comprehend how external clocks may be selected and provided, and from there multiplied or divided. Therefore, specific clock distribution networks, and the details of clock multiplication and division, will not be expressly illustrated. Furthermore, the clock mult/div units need not necessarily be limited to integer multiple clocks, but can perform e.g. 2:5 clocking Finally, the clock mult/div units need not necessarily even be limited to fractional bus clocking, but can, in some embodiments, be flexible, asynchronous, and/or programmable, such as in providing a P/Q clocking scheme.
  • The basic motivation for increasing clock frequencies in this manner is to reduce instruction latency. The execution latency of an instruction may be defined as the time from when its input operands must be ready for it to execute until its result is ready to be used by another instruction.
  • Suppose that a part of a program contains a sequence of N instructions, I1, I2, I3, . . . , IN. Suppose that In+1 requires, as part of its inputs, the result of In, for all n, from 1 to N−1. This part of the program may also contain any other instructions. Then we can see that this program cannot be executed in less time than T=L1,+L2+L3+ . . . +LN, where Ln is the latency of instruction In, for all n from 1 to N. In fact, even if the processor was capable of executing a very large number of instructions in parallel, T remains a lower bound for the time to execute this part of this program. Hence to execute this program faster, it will ultimately be essential to shorten the latencies of the instructions.
  • We may look at the same thing from a slightly different point of view. Define that an instruction In is “in flight” from the time that it requires its input operands to be ready until the time when its result is ready to be used by another instruction. Instruction In is therefore “in flight” for a length of time Ln=An*C where An is the latency, as defined above, of In, but this time expressed in cycles. C is the cycle time. Let a program execute N instructions as above and take M “cycles” or units of time to do it. Looked at from either point of view, it is critically important to reduce the execution latency as much as possible.
  • The average latency can be conventionally defined as 1/N*(L1+L2+L3+ . . . +LN)=C/N*(A1+A2+A3+ . . . +AN). Let fj be the number of instructions that are in flight during cycle j. We can then define the parallelism P as the average number of instructions in flight for the program or 1/M*(f1+f2+f3+ . . . +fM).
  • Notice that f1+f2+f3+ . . . +fM=A1+A2+A3+ . . . +AN. Both sides of this equation are ways of counting up the number of cycles in which instructions are in flight, wherein if x instructions are in flight in a given cycle, that cycle counts as x cycles.
  • Now define the “average bandwidth” B as the total number of instructions executed, N, divided by the time used, M*C, or in other words, B=N/(M*C).
  • We may then easily see that P=L*B. In this formula, L is the average latency for a program, B is its average bandwidth, and P is its average Parallelism. Note that B tells how fast we execute the program. It is instructions per second. If the program has N instructions, it takes N/B seconds to execute it. The goal of a faster processor is exactly the goal of getting B higher.
  • We now note that increasing B requires either increasing the parallelism P, or decreasing the average latency L. It is well known that the parallelism, P, that can be readily exploited for a program is limited. Whereas, it is true that certain classes of programs have large exploitable parallelism, a large class of important programs has P restricted to quite small numbers.
  • One drawback which the prior art processors have is that their entire execution core is constrained to run at the same clock speed. This limits some components within the core in a “weakest link” or “slowest path” manner.
  • In the 1960s and 1970s, there existed central processing units in which a multiplier or divider co-processor was clocked at a frequency higher than other circuitry in the central processing unit. These central processing units were constructed of discrete components rather than as integrated circuits or monolithic microprocessors. Due to their construction as co-processors, and/or the fact that they were not integrated with the main processor, these units should not be considered as “sub-cores”.
  • Another feature of some prior art processors is the ability to perform “speculative execution”. This is also known as “control speculation”, because the processor guesses which way control (branching) instructions will go. Some processors perform speculative fetch, and others, such as the Intel Pentium Pro processor, also perform speculative execution. Control speculating processors include mechanisms for recovering from mispredicted branches, to maintain program and data integrity as though no speculation were taking place.
  • FIG. 2 illustrates a conventional data hierarchy. A mass storage device, such as a hard drive, stores the programs and data (collectively “data”) which the computer system (not shown) has at its disposal. A subset of that data is loaded into memory such as DRAM for faster access. A subset of the DRAM contents may be held in a cache memory. The cache memory may itself be hierarchical, and may include a level two (L2) cache, and then a level one (L1) cache which holds a subset of the data from the L2. Finally, the physical registers of the processor contain a smallest subset of the data. As is well known, various algorithms may be used to determine what data is stored in what levels of this overall hierarchy. In general, it may be said that the more recently a datum has been used, or the more likely it is to be needed soon, the closer it will be held to the processor.
  • The presence or absence of valid data at various points in the hierarchical storage structure has implications on another drawback of the prior art processors, including control speculating processors. The various components within their execution cores are designed such that they cannot perform “data speculation”, in which a processor guesses what values data will have (or, more precisely, the processor assumes that presently-available data values are correct and identical to the values that will ultimately result, and uses those values as inputs for one or more operations), rather than which way branches will go. Data speculation may involve speculating that data presently available from a cache are identical to the true values that those data should have, or that data presently available at the output of some execution unit are identical to the true values that will result when the execution unit completes its operation, or the like.
  • Like control speculating processors' recovery mechanisms, data speculating processors must have some mechanism for recovering from having incorrectly assumed that data values are correct, to maintain program and data integrity as though no data speculation were taking place. Data speculation is made more difficult by the hierarchical storage system, especially when it is coupled with a microarchitecture which uses different clock frequencies for various portions of the execution environment.
  • It is well-known that every processor is adapted to execute instructions of its particular “architecture”. In other words, every processor executes a particular instruction set, which is encoded in a particular machine language. Some processors, such as the Pentium Pro processor, decode those “macro-instructions” down into “micro-instructions” or “uops”, which may be thought of as the machine language of the micro-architecture and which are directly executed by the processor's execution units. It is also well-known that other processors, such as those of the RISC variety, may directly execute their macro-instructions without breaking them down into micro-instructions. For purposes of the present invention, the term “instruction” should be considered to cover any or all of these cases.
  • SUMMARY OF THE INVENTION
  • The invention provides a microprocessor having two or more levels of execution sub-core each clocked at different frequencies. The processor may also have an I/O ring, which may be clocked at yet another frequency. Clock division or multiplication may be used between the various levels, to derive the various clocks from a common clock, such as the I/O clock, which may be provided from off-chip. Having the different clock domains enables the designer to make trade-offs in the design of various components of the chip, such as individual execution units, instruction fetch and decode units, register files, caches, and the like. Thus, selected components can be designed to operate at a very high frequency, without requiring the entire chip to be designed to operate at this frequency. Less latency-critical units, or those whose required throughput can be obtained by twice as many units running at half the clock speed, can be relegated to the slower sections of the chip, easing their design considerably.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIG. 1 is a block diagram illustrating a prior art processor having an I/O ring and an execution core operating at different clock speeds.
  • FIG. 2 demonstrates a hierarchical memory structure such as is well known in the art.
  • FIG. 3 is a block diagram illustrating the processor of the present invention, and showing a plurality of execution core sections each having its own clock frequency.
  • FIG. 4 is a block diagram illustrating a mode in which the processor of FIG. 3 includes yet another sub-core with its own clock frequency.
  • FIG. 5 is a block diagram illustrating a different mode in which the sub-core is not nested as shown in FIG. 4.
  • FIG. 6 is a block diagram illustrating a partitioning of the execution core.
  • FIG. 7 is a block diagram illustrating one embodiment of the replay architecture of the present invention, which permits data speculation.
  • FIG. 8 illustrates one embodiment of the checker unit of the replay architecture.
  • DETAILED DESCRIPTION OF THE INVENTION
  • FIG. 3 illustrates the high-speed sub-core 205 of the processor 200 of the present invention. The high-speed sub-core includes the most latency-intolerant portions of the particular architecture and/or microarchitecture employed by the processor. For example, in an Intel Architecture processor, certain arithmetic and logic functions, as well as data cache access, may be the most unforgiving of execution latency.
  • Other functions, which are not so sensitive to execution latency, may be contained within a more latency-tolerant execution core 210. For example, in an Intel Architecture processor, execution of infrequently-executed instructions, such as transcendentals, may be relegated to the slower part of the core.
  • The processor 200 communicates with the rest of the system (not shown) via the I/O ring 215. If the I/O ring operates at a different clock frequency than the latency-tolerant execution core, the processor may include a clock mult/div unit 220 which provides clock division or multiplication according to any suitable manner and conventional means. Because the latency-intolerant execution sub-core 205 operates at a higher frequency than the rest of the latency-tolerant execution core 210, there may be a mechanism 225 for providing a different clock frequency to the latency-intolerant execution sub-core 205. In one mode, this is a clock mult/div unit 225.
  • FIG. 4 illustrates a refinement of the invention shown in FIG. 3. The processor 250 of FIG. 4 includes the I/O ring 215, clock mult/div unit 220, and latency-tolerant execution core 210. However, in place of the unitary sub-core (205) and clock mult/div unit (225) of FIG. 3, this improved processor 250 includes a latency-intolerant execution sub-core 255 and an even more latency-critical execution sub-core 260, with their clock mult/ div units 265 and 270, respectively.
  • The skilled reader will appreciate that this is illustrative of a hierarchy of sub-cores, each of which includes those units which must operate at least as fast as the respective sub-core level. The skilled reader will further appreciate that the selection of what units go how deep into the hierarchy will be made according to various design constraints such as die area, clock skew sensitivity, design time remaining before tapeout date, and the like. In one mode, an Intel Architecture processor may advantageously include only its most common integer ALU functions and data storage portion of its data cache in the innermost sub-core. In one mode, the innermost sub-core may also include the register file; although, for reasons including those stated above concerning FIG. 2, the register file might not technically be needed to operate at the highest clock frequency, its design may be simplified by including it in a more inner sub-core that is strictly necessary. For example, it may be more efficient to make twice as fast a register file with half as many ports, than vice versa.
  • In operation, the processor performs an I/O operation at the I/O ring and at the I/O clock frequency, such as to bring in a data item not presently available within the processor. Then, the latency-tolerant execution core may perform an execution operation on the data item to produce a first result. Then, the latency-intolerant execution sub-core may perform an execution operation on the first result to produce a second result. Then, the latency-critical execution sub-core may perform a third execution operation upon the second result to produce a third result. Those skilled in the art will understand that the flow of execution need not necessarily proceed in the strict order of the hierarchy of execution sub-cores. For example, the newly read in data item could go immediately to the innermost core, and the result could go from there to any of the core sections or even back to the I/O ring for writeback.
  • FIG. 5 shows an embodiment which is slightly different than that of FIG. 4. The processor 280 includes the I/O ring 215, the execution cores 210, 255, 260, and the clock mult/ div units 220, 265, 270. However, in this embodiment the latency-critical execution sub-core 260 is not nested within the latency-intolerant execution core 255. In this mode, the clock mult/ div units 265 and 270 perform different ratios of multiplication to enable their respective cores to run at different speeds.
  • In another slightly different mode (not shown), either of these cores might be clock-interfaced directly to the I/O ring or to the external world. In such a mode, clock mult/div units may not be required, if separate clock signals are provided from outside the processor.
  • It should be noted that the different speeds at which the various layers of sub-core operate may be in-use, operational speeds. It is known, for example in the Pentium processor, that certain units may be powered down when not in use, by reducing or halting their clock; in this case, the processor may have the bulk of its core running at 66 MHz while a sub-core such as the FPU is at substantially 0 MHz. While the present invention may be used in combination with such power-down or clock throttling techniques, it is not limited to such cases.
  • Those skilled in the art will appreciate that non-integer ratios may be applied at any of the boundaries, and that the combinations of clock ratios between the various rings is almost limitless, and that different baseline frequencies could be used at the I/O ring. It is also possible that the clock multiplication factors might not remain constant over time. For example, in some modes, the clock multiplication applied to the innermost sub-core could be adjusted up and down, for example between 3.times. and 1.times. or between 2.times. and 0.times. or the like, when the higher frequency (and therefore higher power consumption and heat generation) are not needed. Also, the processor may be subjected to clock throttling or clock stop, in whole or in part. Or, the I/O clock might not be a constant frequency, in which case the other clocks may either scale accordingly, or they may implement some form of adaptive P/Q clocking scheme to maintain their desired performance level.
  • FIG. 6 illustrates somewhat more detail about one embodiment of the contents of the latency-critical execution sub-core 260 of FIG. 4. (It may also be understood to illustrate the contents of the sub-core 205 of FIG. 3 or the sub-core 255 of FIG. 4.) The latency-tolerant execution core 210 includes components which are not latency-sensitive, but which are dependent only upon some level of throughput. In this sense, the latency-tolerant components may be thought of as the “plumbing” whose job is simply to provide a particular “gallons per minute” throughput, in which a “big pipe” is as good as a “fast flow”.
  • For example, in some architectures the fetch and decode units may not be terribly demanding on execution latency, and may thus be put in the latency-tolerant core 210 rather than the latency- intolerant sub-core 205, 255, 260. Likewise, the microcode and register file may not need to be in the sub-core. In some architectures (or microarchitectures), the most latency-sensitive pieces are the arithmetic/logic functions and the cache. In the mode shown in FIG. 6, only a subset of the arithmetic/logic functions are deemed to be sufficiently latency-sensitive that it is warranted to put them into the sub-core, as illustrated by critical ALU 300.
  • In some embodiments, the critical ALU functions include adders, subtractors, and logic units for performing AND, OR, and the like. In some embodiments which use index register addressing, such as the Intel Architecture, the critical ALU functions may also include a small, special-purpose shifter for doing address generation by scaling the index register. In some embodiments, the register file may reside in the latency-critical execution core, for design convenience; the faster the core section the register file is in, the fewer ports the register file needs.
  • The functions which are generally more latency-sensitive than the plumbing are those portions which are of a recursive nature, or those which include a dependency chain. Execution is a prime example of this concept; execution tends to be recursive or looping, and includes both false and true data dependencies both between and within iterations and loops.
  • Current art in high performance computer design (e.g. the Pentium Pro processor) already exploits most of the readily exploitable parallelism in a large class of important low P programs. It becomes extraordinarily difficult or even practically impossible to greatly increase P for these programs. In this case there is no alternative to reducing the average latency if it is desired to build a processor to run these programs faster.
  • On the other hand, there are certain other functions such as for example, instruction decode, or register renaming. While it is essential that these functions are performed, current art has it arranged that the lapsed time for performing these functions may have an effect on performance only when a branch has been miss predicted. A branch is miss predicted typically once in fifty instructions on average. Hence one nanosecond longer to do decoding or register renaming provides the equivalent of 1/50 nanoseconds increase in average instruction execution latency while one nanosecond increase in the time to execute an instruction increases the average instruction latency by one nanosecond. We may conclude that the time it takes to decode instructions or rename registers, for example, is significantly less critical than the time it takes to execute instructions.
  • There are still other functions that must be performed in a processor. Many of these functions are even more highly leveraged than decoding and register renaming. For these functions 1 nsec increase in the time to perform them may add even less than 1/50 nanoseconds to the average execution latency. We may conclude that the time it takes to do these functions is even less critical.
  • As shown, the other ALU functions 305 can be relegated to the less speedy core 210. Further, in the mode shown in FIG. 6, only a subset of the cache needs to be inside the sub-core. As illustrated, only the data storage portion 310 of the cache is inside the sub-core, while the hit/miss logic and tags are in the slower core 210. This is in contrast to the conventional wisdom, which is that the hit/miss signal is needed at the same time as the data. A recent paper implied that the hit/miss signal is the limiting factor on cache speed (Austin, Todd M, “Streamlining Data Cache Access with Fast Address Calculation”, Dionisios N. Pneumatikatos, Giandinar S. Sohi, Proceedings of the 22nd Annual International Symposium on Computer Architecture, Jun. 18-24, 1995, Session 8, No. 1, page 5). Unfortunately, hit/miss determination is more difficult and more time-consuming than the simple matter of reading data contents from cache locations.
  • Further, the instruction cache (not shown) may be entirely in the core 210, such that the cache 310 stores only data. The instruction cache (Icache) is accessed speculatively. It is the business of branch prediction to predict where the flow of the program will go, and the Icache is accessed on the basis of that prediction. Branch prediction methods commonly used today can predict program flow without ever seeing the instructions in the Icache. If such a method is used, then the Icache is not latency-sensitive, and becomes more bandwidth-constrained than latency- constrained, and can be relegated to a lower clock frequency portion of the execution core.
  • The branch prediction itself could be latency-sensitive, so it would be a good candidate for a fast cycle time in one of the inner sub-core sections.
  • At first glance, one might think that the innermost sub-core 205, 255, or 260 of FIG. 6 would therefore hold the data which is stored at the top of the memory hierarchy of FIG. 2, that is, the data which is stored in the registers. However, as is illustrated in FIG. 6, the register file need not be contained within the sub-core, but may, instead, be held in the less speedy portion of the core 210. In the mode of FIG. 3 or 4, the register file may be stored in any of the core sections 205, 210, 255, 260, as suits the particular embodiment chosen. As shown in FIG. 6, the reason that the register file is not required to be within the innermost core is that the data which result from operations performed in the critical ALU 300 are available on a bypass bus 315 as soon as they are calculated. By appropriate operation of multiplexors (in any conventional manner), these data can be made available to the critical ALU 300 in the next clock cycle of the sub-core, far sooner than they could be written to and then read from the register file.
  • Similarly, if data speculation is permitted, that is, if the critical ALU is allowed to perform calculations upon operands which are not yet known to be valid, portions of the data cache need not reside within the innermost sub-core. In this mode, the data cache 310 holds only the actual data, while the hit/miss logic and cache tags reside in a slower portion 210 of the core. In this mode, data from the data cache 310 are provided over an inner bus 320 and muxed into the critical ALU, and the critical ALU performs operations assuming those data to be valid.
  • Some number of clock cycles later, the hit/miss logic or the tag logic in the outer core may signal that the speculated data is, in fact, invalid. In this case, there must be a means provided to recover from the speculative operations which have been performed. This includes not only the specific operations which used the incorrect, speculated data as input operands, but also any subsequent operations which used the outputs of those specific operations as inputs. Also, the erroneously generated outputs may have subsequently been used to determine branching operations, such as if the erroneously generated output is used as a branch address or as a branch condition. If the processor performs control speculation, there may have also been errors in that operation as well.
  • The present invention provides a replay mechanism for recovering from data speculation upon data which ultimately prove to have been incorrect. In one mode, the replay mechanism may reside outside the innermost core, because it is not terribly latency-critical. While the replay architecture is described in conjunction with a multiple-clock-speed execution engine which performs data speculation, it will be appreciated that the replay architecture may be used with a wide variety of architectures and micro-architectures, including those which perform data speculation and those which do not, those which perform control speculation and those which do not, those which perform in-order execution and those which perform out-of-order execution, and so forth.
  • FIG. 7 illustrates one implementation of such a replay architecture, generally showing the data flow of the architecture. First, an instruction is fetched into the instruction cache.
  • From the instruction cache, the instruction proceeds to a renamer such as a register alias table. In sophisticated microarchitectures which permit data speculation and/or control speculation, it is highly desirable to decouple the actual machine from the specific registers indicated by the instruction. This is especially true in an architecture which is register-poor, such as the Intel Architecture. Renamers are well known, and the details of the renamer are not particularly germane to an understanding of the present invention. Any conventional renamer will suffice. It is desirable that it be a single-valued and single-assignment renamer, such that each instance of a given instruction will write to a different register, although the instruction specifies the same register. The renamer provides a separate storage location for each different value that each logical register assumes, so that no such value of any logical register is prematurely lost (i.e. before the program is through with that value), over a well-defined period of time.
  • From the renamer, the instruction proceeds to an optional scheduler such as a reservation station, where instructions are reordered to improve execution efficiency. The scheduler is able to detect when it is not allowed to issue further instructions. For example, there may not be any available execution slots into which a next instruction could be issued. Or, another unit may for some reason temporarily disable the scheduler. In some embodiments, the scheduler may reside in the latency-critical execution core, if the particular scheduling algorithm can schedule only single latency generation per cycle, and is therefore tied to the latency of the critical ALU functions.
  • From the renamer or the optional scheduler, the instruction proceeds to the execution core 205, 210, 255, 260 (indirectly through a multiplexor to be described below), where it is executed. After or simultaneous with its execution, an address associated with the instruction is sent to the translation lookaside buffer (TLB) and cache tag lookup logic (TAG). This address may be, for example, the address (physical or logical) of a data operand which the instruction requires. From the TLB and TAG logic, the physical address referenced and the physical address represented in the cache location accessed are passed to the hit/miss logic, which determines whether the cache location accessed in fact contained the desired data.
  • In one mode, if the instruction being executed reads memory, the execution logic gives the highest priority to generating perhaps only a portion of the address, but enough that data may be looked up in the high speed data cache. In this mode, this partial address is used with the highest priority to retrieve data from the data cache, and only as a secondary priority is a complete virtual address, or in the case of the Intel Architecture, a complete linear address, generated and sent to the TLB and cache TAG lookup logic.
  • Because the critical ALU functions and the data cache are in the innermost sub-core—or are at least in a portion of the processor which runs at a higher clock rate than the TLB and TAG logic and the hit/miss logic—some data will have already been obtained from the data cache and the processor will have already speculatively executed the instruction which needed that data, the processor having assumed the data that was obtained to have been correct, and the processor likely having also executed additional instructions using that data or the results of the first speculatively executed instruction.
  • Therefore, the replay architecture includes a checker unit which receives the output of the hit/miss logic. If a miss is indicated, the checker causes a “replay” of the offending instruction and any which depended on it or which were otherwise incorrect as a result of the erroneous data speculation. When the instruction was handed from the reservation station to the execution core, a copy of it was forwarded to a delay unit which provides a delay latency which matches the time the instruction will take to get through the execution core, TLB/TAG, and hit/miss units, so that the copy arrives at the checker at about the same time that the hit/miss logic tells the checker that the data speculation was incorrect. In one mode, this is roughly 10-12 clocks of the inner core. In FIG. 7, the delay unit is shown as being outside the checker. In other embodiments, the delay unit may be incorporated as a part of the checker. In some embodiments, the checker may reside within the latency-critical execution core, if the checking algorithm is tied to the critical ALU speed.
  • When the checker determines that data speculation was incorrect, the checker sends the copy of the instruction back around for a “replay”. The checker forwards the copy of the instruction to a buffer unit. It may happen as an unrelated event that the TLB/TAG unit informs the buffer that the TLB/TAG is inserting a manufactured instruction in the current cycle. This information is needed by the buffer so the buffer knows not to reinsert another instruction in the same cycle. Both the TLB/TAG and the buffer also inform the scheduler when they are inserting instructions, so the scheduler knows not to dispatch an instruction in that same cycle. These control signals are not shown but will be understood by those skilled in the art.
  • The buffer unit provides latching of the copied instruction, to prevent it from getting lost if it cannot immediately be handled. In some embodiments, there may be conditions under which it may not be possible to reinsert replayed instructions immediately. In these conditions, the buffer holds them—perhaps a large number of them--until they can be reinserted. One such condition may be that there may be some higher priority function that could claim execution, such as when the TLB/TAG unit needs to insert a manufactured instruction, as mentioned above. In some other embodiments, the buffer may not be necessary.
  • Earlier, it was mentioned that the scheduler's output was provided to the execution core indirectly, through a multiplexor. The function of this multiplexor is to select among several possible sources of instructions being sent for execution. The first source is, of course, the scheduler, in the case when it is an original instruction which is being sent for execution. The second source is the buffer unit, in the case when it is a copy of an instruction which is being sent for replay execution. A third source is illustrated as being from the TLB/TAG unit; this permits the architecture to manufacture “fake instructions” and inject them into the instruction stream. For example, the TLB logic or TAG logic many need to get another unit to do some work for them, such as to read some data from the data cache as might be needed to evict that data, or for refilling the TLB, or other purposes, and they can do this by generating instructions which did not come from the real instruction stream, and then inserting those instructions back at the multiplexor input to the execution core.
  • The mux control scheme may, in one mode, include a priority scheme wherein a replay instruction has higher priority than an original instruction. This is advantageous because a replay instruction is probably older than the original instruction in the original macroinstruction flow, and may be a “blocking” instruction such as if there is a true data dependency.
  • It is desirable to get replayed instructions finished as quickly as possible. As long as there are unresolved instructions sent to replay, new instructions that are dispatched have a fairly high probability of being dependent on something unresolved and therefore of just getting added to the list of instructions that need to be replayed. As soon as it is necessary to replay one instruction, that one instruction tends to grow a long train of instructions behind it that follows it around. The processor can quickly get in a mode where most instructions are getting executed two or three times, and such a mode may persist for quite a while. Therefore, resolving replayed instructions is very much preferable to introducing new instructions.
  • Each new instruction introduced while there are things to replay is a gamble. There is a certain probability the new instruction will be independent and some work will get done. On the other hand, there is a certain probability that the new instruction will be dependent and will also need to be replayed. Worse, there may be a number of instructions to follow that will be dependent on the new instruction, and all of those will have to be replayed, too, whereas if the machine had waited until the replays were resolved, then all of these instructions would not have to execute twice.
  • In one mode, a manufactured instruction may have higher priority than a replay instruction. This is advantageous because these manufactured instructions may be used for critically important and time-sensitive operations. One such sensitive operation is an eviction. After a cache miss, new data will be coming from the L1 cache. When that data arrives, it must be put in the data cache (L0) as quickly as possible. If that is done, the replayed load will just meet the new data and will now be successful. If the data is even one cycle late getting the data there, the replayed load will pass again too soon and must again be replayed. Unfortunately, the data cache location where the processor is going to put the data is now holding the one and only copy of some data that was written some time ago. In other words, the location is “dirty”. It is necessary to read the dirty data out, to save it before the new data arrives and is written in its place. This reading of the old data is called “evicting” the data. In some embodiments, there is just exactly enough time to complete the eviction before starting to write the new data in its place. The eviction is done with one or more manufactured instructions. If they are held up for even one cycle, the eviction does not occur in time to avoid the problem described above, and therefore they must be given the highest priority.
  • The replay architecture may also be used to enable the processor to in effect “stall” without actually slowing down the execution core or performing clock throttling or the like. There are some circumstances where it would be necessary to stall the frontend and/or execution core, to avoid losing the results of instructions or to avoid other such problems. One example is where the processor's backend temporarily runs out of resources such as available registers into which to write execution results. Other examples include where the external bus is blocked, an upper level of cache is busy being snooped by another processor, a load or store crosses page boundary, an exception occurs, or the like.
  • In such circumstances, rather than halt the frontend or throttle the execution core, the replay architecture may very simply be used to send back around for replay all instructions whose results would be otherwise lost. The execution core remains functioning at full speed, and there are no additional signal paths required for stalling the frontend, beyond those otherwise existing to permit the multiplexor to give priority to replay instructions over original instructions.
  • Other stall-like uses can be made of the replay architecture. For example, assume that a store address instruction misses in the TLB. Rather than saving the linear address to process after getting the proper entry in the TLB, the processor can just drop it on the floor and request the store address instruction to be replayed. As another example, the Page Miss Handler (not shown) may be busy. In this case the processor does not even remember that it needs to do a page walk, but finds that out over again when the store address comes back.
  • Most cases of running out of resources occur when there is a cache miss. There could well be no fill buffer left, so the machine can't even request an L1 lookup. Or, the L1 may be busy. When a cache miss happens, the machine MAY ask for the data from a higher level cache and MAY just forget the whole thing and not do anything at all to help the situation. In either case, the load (or store address) instruction is replayed. Unlike a more conventional architecture, the present invention does not NEED to remember this instruction in the memory subsystem and take care of it. The processor will do something to help it if it has the resources to do something. If not, it may do nothing at all, not even remember that such a instruction was seen by the memory subsystem. The memory subsystem, by itself, will never do anything for this instance of the instruction. When the instruction executes again, then it is considered all over again. In the case of a store address instruction, the instruction has delivered its linear address to the memory subsystem and it doesn't want anything back. A more conventional approach might be to say that this instruction is done, and any problems from here on out are memory subsystem problems, in which case the memory subsystem must then store information about this store address until it can get resources to take care of it. The present approach is that the store address replays, and the memory subsystem does not have to remember it at all. Here it is a little more clear that the processor is replaying the store address specifically because of inability to handle it in the memory subsystem.
  • In one mode, when an instruction gets replayed, all dependent instructions also get replayed. This may include all those which used the replayed instruction's output as input, all those which are down control flow branches picked according to the replayed instruction, and so forth.
  • The processor does not replay instructions merely because they are control flow dependent on an instruction that replayed. The thread of control was predicted. The processor is always following a predicted thread of control and never necessarily knows during execution if it is going the right way or not. If a branch gets bad input, the branch instruction itself is replayed. This is because the processor cannot reliably determine from the branch if the predicted thread of control is right or not, since the input data to the branch was not valid. No other instructions get replayed merely because the branch got bad data. Eventually--possibly after many replays—the branch will be correctly executed. At this time, it does what all branches do—it reports if the predicted direction taken for this branch was correct or not. If it was correctly predicted, everything goes on about its business. If it was not correctly predicted, then there is simply a branch misprediction; the fact that this branch was replayed any number of times makes no difference. A mispredicted branch cannot readily be repaired with a replay. A replay can only execute exactly the same instructions over again. If a branch was mispredicted, the processor has likely done many wrong instructions and needs to actually execute some completely different instructions.
  • To summarize: A instruction is replayed either: 1) because the instruction itself was not correctly processed for any reason, or 2) if the input data that this instruction uses is not known to be correct. Data is known to be correct if it is produced by a instruction that is itself correctly processed and all of its input data is known to be correct. In this definition, branches are viewed not as having anything to do with the control flow but as data handling instructions which simply report interesting things to the front end of the machine but do not produce any output data that can be used by any other instruction. Hence, the correctness of any other instruction cannot have anything to do with them. The correctness of the control flow is handled by a higher authority and is not in the purview of mere execution and replay.
  • FIG. 8 illustrates more about the checker unit. Again, a instruction is replayed if: 1) it was not processed correctly, or 2) if it used input data that is not known to be correct. These two conditions give a good division for discussing the operation of the checker unit. The first condition depends on everything that needs to be done for the instruction. Anything in the machine that needs to do something to correctly execute the instruction is allowed to goof and to signal to the checker that it goofed. The first condition is therefore talking about signals that come into the checker, potentially from many places, that say, “I goofed on this instruction.”
  • In some embodiments, the most common goof is the failure of the data cache to supply the correct result for a load. This is signaled by the hit/miss logic. Another common goof is failure to correctly process a store address; this would typically result from a TLB miss on a store address, but there can be other causes, too. In some embodiments, the L1 cache may deliver data (which may go into the L0 cache and be used by instructions) that contains an ECC error. This would be signaled quickly, and then corrected as time permits.
  • In some fairly rare cases, the adder cannot correctly add two numbers. This is signaled by the flag logic which keeps tabs on the adders. In some other rare cases, the logic unit fails to get the correct answer when doing an AND, XOR, or other simple logic operation. These, too, are signaled by the flag logic. In some embodiments, the floating point unit may not get the correct answer all of the time, in which case it will signal when it goofs a floating point operation. In of principle, you could use this mechanism for many types of goofs. It could be used for algorithmic goofs and it could even be used for hardware errors (circuit goofs). Regardless the cause, whenever the processor doesn't do exactly what it is supposed to do, and the goof is detected, the processor's various units can request a replay by signaling to the checker.
  • The second condition which causes replays—whether data is known to be correct—is entirely the responsibility of the checker itself. The checker contains the official list of what data is known to be correct. It is what is sometimes called the “scoreboard”. It is the checker's responsibility to look at all of the input data for each instruction execution instance and to determine if all such input data is known to be correct or not. It is also the checker's responsibility to add it all up for each instruction execution instance, to determine if the result produced by that instruction execution instance can therefore be deemed to be “known to be correct”. If the result of a instruction is deemed “known to be correct”, this is noted on the scoreboard so the processor now has new, known-correct data that can be the input for other instructions.
  • FIG. 8 illustrates one exemplary checker which may be employed in practicing the architecture of the present invention. Because the details of the checker are not necessary in order to understand the invention, a simplified checker is illustrated to show the requirements for a checker sufficient to make the replay system work correctly.
  • In this embodiment, one instruction is processed per cycle. After an instruction has been executed, it is represented to the checker by signals OP1, OP1V, OP2, OPV2, DST, and a latency vector which was assigned to the uop by the decoder on the basis of the opcode. The signals OP1V and OP2V indicate whether the instruction includes a first operand and a second operand, respectively. The signals OP1 and OP2 identify the physical source registers of the first and second operands, respectively, and are received at read address ports RA1 and RA2 of the scoreboard. The signal DST identifies the physical destination register where the result of the instruction was written.
  • The latency vector has all 0's except a 1 in one position. The position of the 1 denotes the latency of this instruction. An instruction's latency is how many cycles there are after the instruction begins execution before another instruction can use its result. The scoreboard has one bit of storage for each physical register in the machine. The bit is 0 if that register is not known to contain correct data and it is 1 if that register is known to contain correct data.
  • The register renamer, described above, allocates these registers. At the time a physical register is allocated to hold the result of some instruction, the renamer sends the register number to the checker as multiple-bit signal CLEAR. The scoreboard sets to 0 the scoreboard bit which is addressed by CLEAR.
  • The one or two register operands for the instruction currently being checked (as indicated by OP1 and OP2) are looked up in the scoreboard to see if they are known to be correct, and the results are output as scoreboard values SV1 and SV2, respectively. An AND gate 350 receives the first scoreboard value SV1 and the first operand valid signal OP1V. Another AND gate 355 similarly receives signals SV2 and OP2V for the second operand. The operand valid signals OP1V and OP2V cause the scoreboard values SV1 and SV2 to be ignored if the instruction does not actually require those respective operands.
  • The outputs of the AND gates are provided to NOR gate 360, along with an external replay request signal. The output of the NOR gate will be false if either operand is required by the instruction and is not known to be correct, or if the external replay request signal is asserted. Otherwise the output will be true. The output of the NOR gate 360 is the checker output INSTRUCTION OK. If it is true, the instruction was completed correctly and is ready to be considered for retirement. If it is false, the instruction must be replayed.
  • A delay line receives the destination register identifier DST and the checker output INSTRUCTION OK information for the instruction currently being checked. The simple delay line shown is constructed of registers (single cycle delays) and muxes. It will be understood that each register and mux is a multiple-bit device, or represents multiple single-bit devices. Those skilled in the art will understand that various other types of delay lines, and therefore different formats of latency vectors, could be used.
  • The DST and INSTRUCTION OK information is inserted in one location of the delay line, as determined by the value of the latency vector. This information is delayed for the required number of cycles according to the latency vector, and then it is applied to the write port WP of the scoreboard. The scoreboard bit corresponding to the destination register DST for the instruction is then written according to the value of INSTRUCTION OK. A value of 1 indicates that the instruction did not have to be replayed, and a value of 0 indicates that the instruction did have to be replayed, meaning that its result data is not known to be correct.
  • In this design, it is assumed that no instruction has physical register zero as a real destination or as a real source. If there is no valid instruction in some cycle, the latency vector for that cycle will be all zeros. This will effectively enter physical register zero with the longest possible latency into the delay line, which is harmless. Similarly, an instruction that does not have a real destination register will specify a latency vector of all zeros. It is further assumed that at startup, this unit runs for several cycles with no valid instructions arriving, so as to fill the delay line with zeros before the first real instruction has been allocated a destination register, and hence before the corresponding bit in the scoreboard has been cleared. The scoreboard needs no additional initialization.
  • Potentially, this checker checks one instruction per cycle (but other embodiments are of course feasible). The cycle in which an instruction is checked is a fixed number of cycles after that instruction began execution and captured the data that it used for its operands. This number of cycles later is sufficient to allow the EXTERNAL REPLAY REQUEST signal for the instruction to arrive at the checker to be processed along with the other information about the instruction. The EXTERNAL REPLAY REQUEST signal is the OR of all signals from whatever parts of the machine may produce replay requests that indicate that the instruction was not processed correctly. For example it may indicate that data returned from the data cache may not have been correct, for any of many reasons, a good example being that there was a cache miss.
  • It should be appreciated by the skilled reader that the particular partitionings described above are illustrative only. For example, although it has been suggested that certain features may be relegated to the outermost core 210, it may be desirable that certain of these reside in a mid-level portion of the core, such as in the latency-intolerant core 255 of FIG. 4, between the outermost core 210 and the innermost core 260. It should also be appreciated that although the invention has been described with reference to the Intel Architecture processors, it is useful in any number of alternative architectures, and with a wide variety of microarchitectures within each.
  • While the invention has been described with reference to specific modes and embodiments, for ease of explanation and understanding, those skilled in the art will appreciate that the invention is not necessarily limited to the particular features shown herein, and that the invention may be practiced in a variety of ways which fall under the scope and spirit of this disclosure. The invention is, therefore, to be afforded the fullest allowable scope of the claims which follow.

Claims (24)

We claim:
1-64. (canceled)
65. An integrated circuit comprising:
logic to perform input/output (I/O) operations at a first frequency;
an arithmetic logic unit (ALU) to operate at a second frequency; and
a floating-point unit (FPU) to operate at a third frequency, the third frequency being different than the second frequency.
66. The integrated circuit of claim 65, wherein the third frequency is half of the second frequency.
67. The integrated circuit of claim 66, further comprising an integer register file coupled to the ALU to operate at the second frequency and a floating point register file coupled to the FPU to operate at the third frequency.
68. The integrated circuit of claim 67, further comprising:
an instruction cache to cache fetched instructions;
a renamer unit to rename specific registers indicated by instructions;
a scheduler unit to reorder instructions; and
a look-aside buffer to provide physical addresses of data operands;
the instruction cache, renamer unit, scheduler unit, and look-aside buffer to operate at a fourth frequency.
69. The integrated circuit of claim 68, wherein the fourth frequency is the same as the second frequency.
70. The integrated circuit of claim 68, wherein the fourth frequency is slower than the third frequency.
71. The integrated circuit of claim 65, wherein I/O operations are selected from a group consisting of buffering data, buffering instructions, receiving data, receiving instructions, parity checking, and communicating with external devices.
72. The integrated circuit of claim 65, wherein the third frequency is substantially 0 MHz when the FPU is powered down.
73. The integrated circuit of claim 65, wherein the second frequency is substantially 0 MHz when the ALU is powered down.
74. An integrated circuit comprising:
logic to perform input/output (I/O) operations at a first frequency;
a first arithmetic logic unit (ALU), a first data cache, and a first register file to operate at a second clock frequency; and
a second ALU, a second register file, and a second data cache to operate at a third clock frequency, the third clock frequency being different then the second clock frequency.
75. The integrated circuit of claim 74, further comprising a floating-point unit (FPU) to operate at the third clock frequency.
76. The integrated circuit of claim 75, wherein the second ALU, second data cache, second register file, and the FPU are not nested within the first ALU, first data cache, and first register file.
77. The integrated circuit of claim 75, wherein the third clock frequency is faster then the second clock frequency.
78. The integrated circuit of claim 74, wherein the second clock frequency is a multiple of N of the third clock frequency.
79. The integrated circuit of claim 74, wherein the second clock frequency is substantially 0 when the first ALU, first data cache, and first register file are powered down, the third clock frequency being an integer multiple of the first clock frequency.
80. The integrated circuit of claim 74, further comprising:
a look-aside buffer operating at a fourth frequency, the look-aside buffer having a first partition dedicated to the first ALU, first data cache, and first register file and a second partition dedicated to the second ALU, second data cache, and second register file.
81. The integrated circuit of claim 74, further comprising:
a first look-aside buffer, a first renamer unit, a first scheduler unit, and a first hit/miss unit operating at the second frequency; and
a second look-aside buffer, a second renamer unit, a second scheduler unit, and a second hit/miss unit operating at the third frequency.
82. A microprocessor comprising:
a fetch unit and a decoder to operate at a first frequency;
a multiplier and a first shifter to operate at a second frequency; and
an adder and logic to perform AND and OR operations to operate at a third frequency, the third frequency being different from the second frequency.
83. The microprocessor of claim 82, wherein first frequency is lower than the second frequency, and wherein the third frequency is an integer multiple of the second frequency.
84. The microprocessor of claim 83, wherein the third frequency is higher than the second frequency by a factor of 2.
85. The microprocessor of claim 84, wherein the second and third frequencies are not integer multiples of the first frequency.
86. The microprocessor of claim 84, further comprising:
a register file, the register file coupled to the adder and to the logic; and
a second shifter;
the register file and the second shifter to operate at the third frequency.
87. The microprocessor of claim 84, further comprising an instruction cache and a register file to operate at the first frequency.
US12/879,872 1996-11-13 2010-09-10 Processor having execution core sections operating at different clock rates Abandoned US20120042151A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US12/879,872 US20120042151A1 (en) 1996-11-13 2010-09-10 Processor having execution core sections operating at different clock rates

Applications Claiming Priority (6)

Application Number Priority Date Filing Date Title
US08/746,606 US5828868A (en) 1996-11-13 1996-11-13 Processor having execution core sections operating at different clock rates
US09/092,353 US6216234B1 (en) 1996-11-13 1998-06-05 Processor having execution core sections operating at different clock rates
US09/527,065 US6256745B1 (en) 1998-06-05 2000-03-16 Processor having execution core sections operating at different clock rates
US09/775,383 US6487675B2 (en) 1996-11-13 2001-02-02 Processor having execution core sections operating at different clock rates
US10/996,328 USRE44494E1 (en) 1996-11-13 2004-11-24 Processor having execution core sections operating at different clock rates
US12/879,872 US20120042151A1 (en) 1996-11-13 2010-09-10 Processor having execution core sections operating at different clock rates

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US10/996,328 Division USRE44494E1 (en) 1996-11-13 2004-11-24 Processor having execution core sections operating at different clock rates

Publications (1)

Publication Number Publication Date
US20120042151A1 true US20120042151A1 (en) 2012-02-16

Family

ID=25001559

Family Applications (3)

Application Number Title Priority Date Filing Date
US08/746,606 Expired - Lifetime US5828868A (en) 1996-11-13 1996-11-13 Processor having execution core sections operating at different clock rates
US09/092,353 Expired - Lifetime US6216234B1 (en) 1996-11-13 1998-06-05 Processor having execution core sections operating at different clock rates
US12/879,872 Abandoned US20120042151A1 (en) 1996-11-13 2010-09-10 Processor having execution core sections operating at different clock rates

Family Applications Before (2)

Application Number Title Priority Date Filing Date
US08/746,606 Expired - Lifetime US5828868A (en) 1996-11-13 1996-11-13 Processor having execution core sections operating at different clock rates
US09/092,353 Expired - Lifetime US6216234B1 (en) 1996-11-13 1998-06-05 Processor having execution core sections operating at different clock rates

Country Status (6)

Country Link
US (3) US5828868A (en)
AR (1) AR008322A1 (en)
AU (1) AU4669297A (en)
TW (1) TW351791B (en)
WO (1) WO1998021641A1 (en)
ZA (1) ZA979600B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140380081A1 (en) * 2013-06-25 2014-12-25 Alexander Gendler Restricting Clock Signal Delivery In A Processor
US9377836B2 (en) 2013-07-26 2016-06-28 Intel Corporation Restricting clock signal delivery based on activity in a processor
WO2023235004A1 (en) * 2022-06-02 2023-12-07 Micron Technology, Inc. Time-division multiplexed simd function unit

Families Citing this family (49)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6631454B1 (en) 1996-11-13 2003-10-07 Intel Corporation Processor and data cache with data storage unit and tag hit/miss logic operated at a first and second clock frequencies
US6256745B1 (en) * 1998-06-05 2001-07-03 Intel Corporation Processor having execution core sections operating at different clock rates
US6385715B1 (en) * 1996-11-13 2002-05-07 Intel Corporation Multi-threading for a processor utilizing a replay queue
US6163838A (en) * 1996-11-13 2000-12-19 Intel Corporation Computer processor with a replay system
US5828868A (en) * 1996-11-13 1998-10-27 Intel Corporation Processor having execution core sections operating at different clock rates
US6735688B1 (en) * 1996-11-13 2004-05-11 Intel Corporation Processor having replay architecture with fast and slow replay paths
US6161189A (en) * 1997-03-31 2000-12-12 International Business Machines Corporation Latch-and-hold circuit that permits subcircuits of an integrated circuit to operate at different frequencies
US6026497A (en) * 1997-12-23 2000-02-15 Sun Microsystems, Inc. System and method for determining the resolution of a granular clock provided by a digital computer and for using it to accurately time execution of computer program fragment by the digital computer
US6035389A (en) * 1998-08-11 2000-03-07 Intel Corporation Scheduling instructions with different latencies
US6535798B1 (en) * 1998-12-03 2003-03-18 Intel Corporation Thermal management in a system
US6519682B2 (en) * 1998-12-04 2003-02-11 Stmicroelectronics, Inc. Pipelined non-blocking level two cache system with inherent transaction collision-avoidance
US6304955B1 (en) * 1998-12-30 2001-10-16 Intel Corporation Method and apparatus for performing latency based hazard detection
US6629271B1 (en) * 1999-12-28 2003-09-30 Intel Corporation Technique for synchronizing faults in a processor having a replay system
US7100061B2 (en) 2000-01-18 2006-08-29 Transmeta Corporation Adaptive power control
US6880069B1 (en) * 2000-06-30 2005-04-12 Intel Corporation Replay instruction morphing
JP3450814B2 (en) * 2000-09-26 2003-09-29 松下電器産業株式会社 Information processing device
US6993669B2 (en) * 2001-04-18 2006-01-31 Gallitzin Allegheny Llc Low power clocking systems and methods
US6990598B2 (en) * 2001-03-21 2006-01-24 Gallitzin Allegheny Llc Low power reconfigurable systems and methods
TW548534B (en) * 2001-04-12 2003-08-21 Via Tech Inc Control method for frequency raising and reducing of central processing unit using neural network
US6990594B2 (en) * 2001-05-02 2006-01-24 Portalplayer, Inc. Dynamic power management of devices in computer system by selecting clock generator output based on a current state and programmable policies
US20040064678A1 (en) * 2002-09-30 2004-04-01 Black Bryan P. Hierarchical scheduling windows
US20040064679A1 (en) * 2002-09-30 2004-04-01 Black Bryan P. Hierarchical scheduling windows
US7886164B1 (en) 2002-11-14 2011-02-08 Nvidia Corporation Processor temperature adjustment system and method
US7882369B1 (en) 2002-11-14 2011-02-01 Nvidia Corporation Processor performance adjustment system and method
US7849332B1 (en) 2002-11-14 2010-12-07 Nvidia Corporation Processor voltage adjustment system and method
US7165167B2 (en) * 2003-06-10 2007-01-16 Advanced Micro Devices, Inc. Load store unit with replay mechanism
DE10349580A1 (en) * 2003-10-24 2005-05-25 Robert Bosch Gmbh Method and device for operand processing in a processor unit
US7574611B2 (en) * 2005-11-28 2009-08-11 Atmel Corporation Command decoder for microcontroller based flash memory digital controller system
US7702885B2 (en) * 2006-03-02 2010-04-20 Atmel Corporation Firmware extendable commands including a test mode command for a microcontroller-based flash memory controller
US7414550B1 (en) 2006-06-30 2008-08-19 Nvidia Corporation Methods and systems for sample rate conversion and sample clock synchronization
US20080059753A1 (en) * 2006-08-30 2008-03-06 Sebastien Hily Scheduling operations corresponding to store instructions
US7603527B2 (en) * 2006-09-29 2009-10-13 Intel Corporation Resolving false dependencies of speculative load instructions
US9134782B2 (en) 2007-05-07 2015-09-15 Nvidia Corporation Maintaining optimum voltage supply to match performance of an integrated circuit
US9209792B1 (en) 2007-08-15 2015-12-08 Nvidia Corporation Clock selection system and method
US8327173B2 (en) * 2007-12-17 2012-12-04 Nvidia Corporation Integrated circuit device core power down independent of peripheral device operation
US9088176B2 (en) * 2007-12-17 2015-07-21 Nvidia Corporation Power management efficiency using DC-DC and linear regulators in conjunction
US8370663B2 (en) 2008-02-11 2013-02-05 Nvidia Corporation Power management with dynamic frequency adjustments
US9411390B2 (en) 2008-02-11 2016-08-09 Nvidia Corporation Integrated circuit device having power domains and partitions based on use case power optimization
US8762759B2 (en) * 2008-04-10 2014-06-24 Nvidia Corporation Responding to interrupts while in a reduced power state
US9423846B2 (en) 2008-04-10 2016-08-23 Nvidia Corporation Powered ring to maintain IO state independent of the core of an integrated circuit device
US8977837B2 (en) * 2009-05-27 2015-03-10 Arm Limited Apparatus and method for early issue and recovery for a conditional load instruction having multiple outcomes
US9256265B2 (en) 2009-12-30 2016-02-09 Nvidia Corporation Method and system for artificially and dynamically limiting the framerate of a graphics processing unit
US9830889B2 (en) 2009-12-31 2017-11-28 Nvidia Corporation Methods and system for artifically and dynamically limiting the display resolution of an application
US8433944B2 (en) 2010-04-12 2013-04-30 Qualcomm Incorporated Clock divider system and method with incremental adjustment steps while controlling tolerance in clock duty cycle
US8839006B2 (en) 2010-05-28 2014-09-16 Nvidia Corporation Power consumption reduction systems and methods
CN103003769B (en) 2010-07-20 2016-02-24 飞思卡尔半导体公司 Clock circuit, electronic equipment and the method for clock signal is provided
US9471395B2 (en) 2012-08-23 2016-10-18 Nvidia Corporation Processor cluster migration techniques
US8947137B2 (en) 2012-09-05 2015-02-03 Nvidia Corporation Core voltage reset systems and methods with wide noise margin
CN106164810B (en) * 2014-04-04 2019-09-03 英派尔科技开发有限公司 Use the optimization of the performance change of the function based on voltage

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5481573A (en) * 1992-06-26 1996-01-02 International Business Machines Corporation Synchronous clock distribution system
US5537581A (en) * 1991-10-17 1996-07-16 Intel Corporation Microprocessor with a core that operates at multiple frequencies
US5560032A (en) * 1991-07-08 1996-09-24 Seiko Epson Corporation High-performance, superscalar-based computer system with out-of-order instruction execution and concurrent results distribution
US5812860A (en) * 1996-02-12 1998-09-22 Intel Corporation Method and apparatus providing multiple voltages and frequencies selectable based on real time criteria to control power consumption
US5828868A (en) * 1996-11-13 1998-10-27 Intel Corporation Processor having execution core sections operating at different clock rates
US6256745B1 (en) * 1998-06-05 2001-07-03 Intel Corporation Processor having execution core sections operating at different clock rates
US6631454B1 (en) * 1996-11-13 2003-10-07 Intel Corporation Processor and data cache with data storage unit and tag hit/miss logic operated at a first and second clock frequencies
US6754837B1 (en) * 2000-07-17 2004-06-22 Advanced Micro Devices, Inc. Programmable stabilization interval for internal stop grant state during which core logic is supplied with clocks and power to minimize stabilization delay
US7917799B2 (en) * 2007-04-12 2011-03-29 International Business Machines Corporation Method and system for digital frequency clocking in processor cores
US7945804B2 (en) * 2007-10-17 2011-05-17 International Business Machines Corporation Methods and systems for digitally controlled multi-frequency clocking of multi-core processors

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5142684A (en) * 1989-06-23 1992-08-25 Hand Held Products, Inc. Power conservation in microprocessor controlled devices
US5222239A (en) * 1989-07-28 1993-06-22 Prof. Michael H. Davis Process and apparatus for reducing power usage microprocessor devices operating from stored energy sources
US5309561A (en) * 1990-09-28 1994-05-03 Tandem Computers Incorporated Synchronous processor unit with interconnected, separately clocked processor sections which are automatically synchronized for data transfer operations
US5630107A (en) * 1992-09-30 1997-05-13 Intel Corporation System for loading PLL from bus fraction register when bus fraction register is in either first or second state and bus unit not busy
US5644760A (en) * 1995-05-01 1997-07-01 Apple Computer, Inc. Printed circuit board processor card for upgrading a processor-based system
US5680543A (en) * 1995-10-20 1997-10-21 Lucent Technologies Inc. Method and apparatus for built-in self-test with multiple clock circuits

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5560032A (en) * 1991-07-08 1996-09-24 Seiko Epson Corporation High-performance, superscalar-based computer system with out-of-order instruction execution and concurrent results distribution
US5537581A (en) * 1991-10-17 1996-07-16 Intel Corporation Microprocessor with a core that operates at multiple frequencies
US5481573A (en) * 1992-06-26 1996-01-02 International Business Machines Corporation Synchronous clock distribution system
US5812860A (en) * 1996-02-12 1998-09-22 Intel Corporation Method and apparatus providing multiple voltages and frequencies selectable based on real time criteria to control power consumption
US7100012B2 (en) * 1996-11-13 2006-08-29 Intel Corporation Processor and data cache with data storage unit and tag hit/miss logic operated at a first and second clock frequencies
US5828868A (en) * 1996-11-13 1998-10-27 Intel Corporation Processor having execution core sections operating at different clock rates
US6216234B1 (en) * 1996-11-13 2001-04-10 Intel Corporation Processor having execution core sections operating at different clock rates
US6487675B2 (en) * 1996-11-13 2002-11-26 Intel Corporation Processor having execution core sections operating at different clock rates
US6631454B1 (en) * 1996-11-13 2003-10-07 Intel Corporation Processor and data cache with data storage unit and tag hit/miss logic operated at a first and second clock frequencies
US6256745B1 (en) * 1998-06-05 2001-07-03 Intel Corporation Processor having execution core sections operating at different clock rates
US6754837B1 (en) * 2000-07-17 2004-06-22 Advanced Micro Devices, Inc. Programmable stabilization interval for internal stop grant state during which core logic is supplied with clocks and power to minimize stabilization delay
US7917799B2 (en) * 2007-04-12 2011-03-29 International Business Machines Corporation Method and system for digital frequency clocking in processor cores
US7945804B2 (en) * 2007-10-17 2011-05-17 International Business Machines Corporation Methods and systems for digitally controlled multi-frequency clocking of multi-core processors

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140380081A1 (en) * 2013-06-25 2014-12-25 Alexander Gendler Restricting Clock Signal Delivery In A Processor
US9471088B2 (en) * 2013-06-25 2016-10-18 Intel Corporation Restricting clock signal delivery in a processor
US9377836B2 (en) 2013-07-26 2016-06-28 Intel Corporation Restricting clock signal delivery based on activity in a processor
WO2023235004A1 (en) * 2022-06-02 2023-12-07 Micron Technology, Inc. Time-division multiplexed simd function unit

Also Published As

Publication number Publication date
US6216234B1 (en) 2001-04-10
TW351791B (en) 1999-02-01
WO1998021641A1 (en) 1998-05-22
AU4669297A (en) 1998-06-03
AR008322A1 (en) 1999-12-29
US5828868A (en) 1998-10-27
ZA979600B (en) 1999-04-28

Similar Documents

Publication Publication Date Title
US6216234B1 (en) Processor having execution core sections operating at different clock rates
USRE45487E1 (en) Processor having execution core sections operating at different clock rates
US5966544A (en) Data speculatable processor having reply architecture
WO1998021684A9 (en) Processor having replay architecture
EP1296230B1 (en) Instruction issuing in the presence of load misses
US6381692B1 (en) Pipelined asynchronous processing
US9262171B2 (en) Dependency matrix for the determination of load dependencies
US9058180B2 (en) Unified high-frequency out-of-order pick queue with support for triggering early issue of speculative instructions
US6138230A (en) Processor with multiple execution pipelines using pipe stage state information to control independent movement of instructions between pipe stages of an execution pipeline
US8255670B2 (en) Replay reduction for power saving
EP0779577B1 (en) Micoprocessor pipe control and register translation
US5931957A (en) Support for out-of-order execution of loads and stores in a processor
EP1296229B1 (en) Scoreboarding mechanism in a pipeline that includes replays and redirects
US20020091915A1 (en) Load prediction and thread identification in a multithreaded microprocessor
US6564315B1 (en) Scheduler which discovers non-speculative nature of an instruction after issuing and reissues the instruction
JP3678443B2 (en) Write buffer for super pipelined superscalar microprocessor
US6622235B1 (en) Scheduler which retries load/store hit situations
US7100012B2 (en) Processor and data cache with data storage unit and tag hit/miss logic operated at a first and second clock frequencies
US6115730A (en) Reloadable floating point unit
EP1296228B1 (en) Instruction Issue and retirement in processor having mismatched pipeline depths
GB2361082A (en) Processor with data depedency checker
Lee et al. An Asynchronous Processor Simulator
Golze et al. Internal Specification of Coarse Structure

Legal Events

Date Code Title Description
STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION