GB2361082A

GB2361082A - Processor with data depedency checker

Info

Publication number: GB2361082A
Application number: GB0111976A
Authority: GB
Inventors: David J Sager
Original assignee: Intel Corp
Current assignee: Intel Corp
Priority date: 1996-11-13
Filing date: 1997-10-14
Publication date: 2001-10-10
Anticipated expiration: 2017-10-14
Also published as: GB0111976D0; GB2361082B

Abstract

A microprocessor has a first section operating at a first clock frequency and a second section operating at a second clock frequency which is lower than the first clock frequency. The processor includes a checker for determining whether an instruction was processed correctly and used input data that are known to be correct. The checker comprises a register scoreboard for tracking data dependency and a delay line which carries latency vectors indicating how many clock cycles after a first instruction begins execution before a second instruction can use a result of the first instruction.

Description

2361082 PROCESSOR

BACKGROUND OF THE INVENTION

Field of the Invention thereto.

The present invention relates generally to the field of high speed processors. This application is a divisional of GB-A-2333384 and attention is directed or the Prior Art

Fig. 1 illusirates.,i iiiic-,op-,ocessor 100,,.cco.-d'n,. to the prior art.

Thc microprocessor includes an 1/0 ring which operates at a first clock frequency, and an execution core wbich operzitcs..xt a second clock frequenev. For cx-,lriiple, (he In.,c14S6DX2 may run its 1/0 rin'. at 3:3LMEz and its cxtcu.:lon core at 66'MHz for a 2:1 ratio (l/2 bus), the Inte1DX4 may ru[l its 1/0 rin. -it 25,IYLHz and its execution core at 75.. \,,LHz for a 3:1 ratio (l/3 bus), and the Intcl Pentlurn,' OverDrive processor may operate- its 1/0 riti. at 33.\,1.lz and its execution core at 82.5,\,1Hzlor a 2.5:1 ratio (5[2 bus).

A discinction way beniade bc(w-.--ii---1/0 operafions-,rid "e.xecution operations---. For cxample, in the DX2, the- 110 ring performs 110 operations such as bufferin., bus drivin., receivin,,, parity checking, and other operations associated with communicating with the off-chip world, whil.' the c.xc--uloti core perfon-ns execution operations such as additior.' niijltlplic.iz'ton, address -cncr.iton. coniparisons, rozation and -,rid "processir.,.," r,,,zinipitiziiions.

The processor 100 may optionally includc la clock multiplier. 111 this mode, the processor can automatically S.1 ti)c- sp.--.-d of its ex-cudion core 1 according to an exLemil, slower clock- provided to l.,s 1/0 ring. This I- nay W reduce the number of pins needed. Alternatively, the processor may include a clock- divider, in which case the processor sets the 1/0 ring speed responsive to an external clock provided to ilic execution cot.e.

These clock multiply and clock divide functions arc logically the 1 san-ic for the purposes of this Invenfion, so the tenii "clock niult/div" will beused herein to denote cither a multiplier or divider as suitable. The skilled reader will comprehend how external clocks may be selected and provided, and from there niiiltip'li ^ ed or divided. Therefore, specific clock discributioll networks, and the details of clock multiplication and division, will not be expressly ilitisn-aied. Furiliemiore, ilic clock mult/div unlis ne-ed not necessarily be limited to Integer multipleclocks, but can perform c... 2:5 clocking. Finally, the clock muk/div units need not necessarily even ty-limited to frictioii, .ii bus clocking, but can, in soni., embodiments, be flexible, asynchronous, andlor pro grarnmable, such as in providing a PIQ clocking scheme.

The basic motivation for increasing clock frequencies in this manner is to reduce instruction latency. The execution latency of an Instruction may be defined as the time from when its input operands must be ready for it to I 1. 1 -o U execute, unti 1 s result is ready to be used by another Insti-LICti 11. S PPOSC that a part of a program connkins a sequence of N instructions, lp 12, 1 3 IN. Suppose that 1,, requires, as part of its inputs, the result of I, for all n, from 1 to N- 1. This part of the program may also contain any other instructions. Then we can sec that this program cannot be executed in less time than T=I,+L 2+L,+...+L,, where L. is the latency of instruction I, for all n from 1 to N. In fact, even if the processor was capable of executing, a very number of instructions in parallel, T remains a lower bound for (lie time to execute this part of this prograni. Hence, to execute this program faster, it will ultimately be cssciitltl to shorten the latcricies of the instructions.

We may look at tllc same thing from a slightly different point of 0 view. Define that an instruction 1,, is "in flight- from the time that it requires its inpui operands to be ready until the time when its rcsiilt is ready to be used by another instruction. instruction I,, is therefore "in flight" for a length of ti-ne L.=A,,'C where A. is the latency, as defined above. of In, but this time expressed in cycles. C is the cycle time. Let a program execute N instructions as above and take M "cycles" or units of time to do it. Looked it from either point of view, it is critically imponant to reduce the execution latency as iiicii as possible.

The average latency can be conventionally defined as = C[N(A,+A2+A,+_+A.). Let fi be the number of instructions that are in flight during cyclej. We can then define t.hc parallelism Pas theaverage number of instructions in flight for the program or I/M(F!+f2+fj+...+fm).

Notice that fi+f2+f3±..+f.,, = AI+A 2+A,+_+AN- Both sides of this equation are ways of counting up the number of cycles in which instructions are in flight, wherein if x instructions are in flicyht in a given W cycle, that CYCIC- Counts as x cycles.

Now define the "avertlc.,e b.tildwid(li" B as the total number of instructions executed, N, divided by the time used, M.C, or ill other words, B = N/(MC).

We may then easily see that P = LB. In this formula, L is the avera(Te latency for a program, B is its average bandwidth, and P is its 19 vera e Parallelism. Note that B tells how fast we execute the program. It is instructions per second. If the program has N instructions, it takes NIB 0 seconds to execute it. Tile wat of -a faster processor is exactly the goal of gettiticT B hi-lier.

0 -3 We now note thac increasing B requires either increasin,g the tl parallelism P, or decreasing the average latency L. It is well known that the r),ir,illcli.srii, P, that can be readily exploited for a program is limited.

cl ecrtti[i classes of programs have large exploitable parallelism, a large class of Important programs has P restricted to qui 1 Ite sniall numbers.

Orle drawback which tile prior art processors have is that their entire cxcclltlorl core is collstrained to run at the sanic clock speed. This limits sonic components widiln iiic core in a ",vetkcs( link.---or "slowest path" manner.

Iii [fie 1960s and 1970s, there existed central processing units in which a multiplier or divider co-processor was clocked at a frequency higher than other circulcy in the central processing unit. These central 0 processing units were constructed of discrete components rather than as 0 integrated circuits or monolithic microprocessors. Due to tlicir construction as co-processors, and/or the fact that they were not integrated with the main processor, these units should not be considered as "sub-cores Ano(licr feature of sonic prior art processors is the ability to perform speculative cxccutioii". This Is also known as "control spcculatioii", because the processor guesses which way conrrol (branching) instructions will go. Sonic p rocessors perfonii speculative fetch. and others, such as the Intel Pentium Pro processor, also perfomi speculative execution. Control speculating processors include nicchanisnis for recovering from mispredicied branches, to maintiin prograrn and data integrity as though no speculation were taking place.

C Fig. 2 illustrates a conventional data hierarchy. A mass storage device, such as a hard drive, stores the progranis and data (collectively "dztti") which the coniputer systeni (not showri) has it its disposal. A subset of that data is loaded into nieniory such as DRAM for faster access.

A subset of ilic DRANI contents may be held in a cache i-nci-nory. The cache 1 1 1 1 (L memory may tscIf be lilcnircli'cil, and may include a level two 2) cache, and then a level one (L1) cache which holds a subset of the data from the L2. Finally, the physical recisters of the processor contain a smallest subset of the data. As is well known, variotis algorithms may be used to determine what data is stored in what levels of this overall hierarchy. In general, it may be said that the more recently -a datuni has becii uscd, or the more likely it is to be needed soon, the closer it will be held to the processor.

The presence or absence of valld dara at various points in the lilcrarchical sorzi,e structure has iiinplications on another drawback of the prior art processors, including control speculating processors. The various components within their execution cores aredesigned such thit they cannot perfonii "data specult(ioji", in which a processor guesses what values data will have (or, niore precisely, the processor assurnees that presentl y-aval 1 able data values are correct and identical to the values that will ultimately result, and uses those values as inputs for one or more operations), rather than,.,lilch was, branches will go. Data speculation may involve speculating that data presently available from a cache are identical to the tnie values that those data should have, or that data presently available at the output of sor-ne execution unit are identical to the tr-ue values that will restilt when the execution unit completes its operation, or the like.

Like control speculatifix; processors' recovery mechanisms, data speculating prowssors must have some mechanism for recovering from haviric, lilcori.ectlv assunied that data values are correct, to maintain prograni and data integrity its thou,,h no data speculation were taking place. Data speculation is made more difficult by the hierarchical storage system, especially when It is coupled with a microarchitecture which uses different clock freqticncics for various portions of the execution environment.

It is weil-known chat every processor is adapted to execute Cseuctions of Is pankcular -architecture-. In other words, every Pt'C>C"Issor coccutes a panicubr instruction sc, which is encoded in a particular c ent Urll Tme processors, such is th P i Pro processor, decede those "nmcroinsmucdoi," down into "niicro-iiisli-ucliots" or - uops", which may be Owngic of as the inachinc of the rnicro -arch Jectu re and which are dicedy executed by the processor's exceukon units. It is aTo weNtnowti tlint other processors, such as [hose c' the RISC variety, inny directly execuw their rnacro-instructions withoui them down To nbe-oAnswucdons, Fo:- purposes of the p:-cs-..nt [he tenn Mscuction,' should be considered io cover any c.all or thcse cases.

and SUiM..MARY OF THE INVEN7TION According to the invention there is provided a processor comprising: a first section operating at a first clock frequency including, an execution core; a second section operating to a second clock frequency lower than the first clock frequency and including, a checker for determining whether an instruction was processed correctly and used input data that are known to be correct.

BRE-EF DESCRIPTION OF TI1E DRAWINGS

Fig. 1 is a block illustrating. a prior art processor having an

1/0 ring and -in execution core operating a., diffe-ient c[ock speeds.

6 Fig. 2 demonstrates a hierarchical mernory structure such as is well known in the art.

Fig. 3 Is a block diagram Illustrating the processor of the present 1-1 0 on, and showing a plurality of execution core sections each having iivcnti 1 1 1 its own clock frequency.

F'10,,. 4 is -a block- dlagram illustrating a mode in which the processor W Oil Fig. 3 includes yet anothcr sub-corc with its own clock frequency.

F1g. -5 Is a block diagram Illustrating a different mode in which the sub-core is not nested as shown in Fig. 4.

Fig. 6 is a block diagram illustraling a partitioning of the execudon 0 0 core.

Fig. 7 is a block diagram illustrating one embodiment of the replay 0 W architecture of the present invention, which pemilts data speculation.

Fig. 8 one emboffinient of the checker unit of the replay architecture.

DETAILED DESCRIPTION OF THE INVENTION

Fig. 3 illustrates the h1211-speed sub-core 205 of the processor 200 of the present invention. The high-speed sub-core includes the most latcricyiiitoleraiit portions of the particular architecture and/or microarchitecture employed by the processor. For example, in an Intel Architecture processor, certain arithmetic and logic functions, as well as data cache access, may be the most unforgiving of execution latency. c) 0 -7.

Other functions, whIch are not so sensitive to execution latency, may be contained.kItlilii a more]itericy-toict-,iiii exccii(loll core 210. For example, in an Intel Ai.cliitcttji-c processor, execution Of itifrequciitly-exccii(c(l Instructions, such is transcendentals, lilay be relegated to the slower part of the core.

The processor 200 communicates with the rest of the system (not shown) via the 1/0 i-iii 215. If the 1/0 ring operates at a different clock frequency than (lie 1..itcncy-toler.itit exectinon core, the processor may include, a clock riiiilt/div unit 220 which provides clock division or multiplication according to any suitable niarincr and conventional nicalis. Bceause the litcricy-iiitoler,.itit execution sub-corc 205 operates at a higher frcqLICIICY than the rest of (he 1;ttcticy-toleraiit execution core 210, there may be a nicchanism 225 for providing a different clock frequency to the latency- intolerant execution sub-core 205, In one mode, this is a clock iiiul[/div unit 225.

Fig. 4 illustrates a refinement of the invention shown in Fig. 3. The , 1 0 processor 250 of Fig. 4 Includes the 1/0 ring 2 15, clock mult/div unit 220, in place of the unitary alid 1,itcticy-tolenint execution core 2 10. However, sub-core (205) and clock niulldiv unit (225) of Fig. 3, this improved processor 250 includes a latency-in tolerant execution sub-core 255 and an even more latency- critical execution sub-core 260, with their clock mult/div units 265 and 270, respectively.

The skilled reader will appreciate that this Is Illustrative ofa hlerarchy of sub-cores, each of which includes those units which must opcratc at least as fast its the respective sub-core level. The skilled reader 1 [ further appreciate that the selection of what units go how deep into the v] 1 cl hierarchy will be niade according to various design constraints such as die =1 arca, clock skew sensitivity, design tinie remaining before- tapeout date, and the like. In on-- mode, an Intel Architecture processor may advantageously include only its niost common integer ALU functions and data storage por---tion of its data cache in the innemiost sub-core. In one mode, the innernlost sub-core may also include the register file, alihoug h, for reasons including those stated above concerning Fig. 2, the register file might not technically be needed to operate at the highesi clock frcquency, its design may be simplified by including it in a more inner sub-core that is strictly necessary. For example, it may be more efficient to make twice as fast a register file with half a niany ports, than vice versa.

In operanon, the processor perfoniis an 1/0 operation at the 1/0 ring C) and,it the 1/0 clock fi-cqticiicy,, stich as to bring -a data item not presently available within the processor. Then, the latency-(oterint execution core may perfoni,, an execution opcration on the data ltern to produce a first result. Then, the latency- in toleran c execution sub-core may perform an execution operation on the first result to produce a second result. Then, the latency-crifica] execution sub-core nuty per-fomi a third execution operation upon the second result to produce a third result. Those skilled in the art will understand that the flow of execution need not necessarily proceed in the str ct order of the hierarchy of execution sub-cores. For example, the newly read in data ltern could go immediately to the innermost core, and the result could go from (here to any of the core sections or even back to the I/0 ring for wrlieback..

Fig. 5 shows an embodiment which is slightly different than that of Fig. 4. The processor 280 includes the 1/0 ring 215, the execution cores 210, 255, 260, and the clock mult/div units 220, 265, 270. However, in this embodiment the latency-critical execution sub-core 260 is not nested within the 1.,ttency-ititolerziiit execution core 255. In this mode, the clock nitilt/div units 265 and 270 perform different ratios of multiplication to enable their respective cores to run at different speeds.

In another slightly different mode (not shown), either of these cores nil,,,lit be clock- 1 riterfaced directly to the 1/0 ring c,- to the external world. In a mode, clock iiltilt/div milts may not be required, if separate clock si-rials are provided from ouiside the processor.

C1 It should be noted tliat the different sneeds at which tile various layers of stib-core, operate may be in-use, operational speeds. It is known, for- example in the Permuni processor., thar: certain unlis may be powered down when riot in use, by reducing or halting. their clock; in this case, the processor may have the bulk of its core at 66vtHz while a sub-core 51.1ch as the FPU is at substantizilly OMHz. While the present invention may be tised in comblimilon with such povci--clo,.x.,ji or clock throttling icchniques, it is not finifted to stich casts.

Those,;killed in the wt will appreciate that non-integer ratios may be applied at any of the boundaries, and that the combinations of clock ratios between the various rings is alniost limitless, and that different baseline frequencies could be used at the I/0 ring. It is also possible that the clock multiplication factors nil,,,lit no, remain conskant over tinic. For example, in sonic modes, the clock multiplication applied to the innemlost sub-core.

could be adjusted tip and down, for example 3X and I X or between 2X and OX or the like, when the ill-her frequency (and therefore hl-her power consumption and heat gencration) are riot needed, Also, the processor rilay be subjected to clock- throttl ing or clock stop, In whole or in part. Or, the 1/0 clock nii-lit no be a constant frequency, ill which case the C1 oilier clocks may either scale accordingly, or they may implement sonic formi of adaptive P/Q clocking sclienic to maintain thc'Lr desired performance level.

Fig. 6 illustrates somewhat niore detail about one embodiment of the contents of the latency-critical execution sub-core 260 of Fig. 4. (It may also be understood to illustrate the contents of the sub-core 205 of Fig. 3 or the sub-core 255 of Fig. 4) The latency-tolerant execution core 210 includes components which are riot latency-sensitivc, but which are dependent only uposi sonic lcvcl of throughput. In this sense, the latency-tolerant components may be thought of as the -plumbing- whose job is simply to provide it particular "giilloiis per nilriute" throughput. in which it---bigpipe- is as good as it "fitst flow---.

For example, ifsonic architectures the fetch and decode units may not, be terribly demanding oil execution latency, and may thus be put in the]at-ricy-to!crant core 210 rather than the latency-intolerant sub-core 205, 255, 260. L'I',-c\vlse, the microcodc and register file may not need to be in the sub-core. lit sonic architecture.% (or microarchitectures), the most lzitericy-.,;ciisitive pieces are the zti-ithiiictic/ioc,ic functions and tile cactle. In the mode shown in Fig. 6, only a subset of the arithmetic/logic functions,ire deemed to be, sufficiently la[encysensitive that it is warranted to put ihem into tilt stib-core, as illustrated by critical ALU 300.

In sonic embodinients, the critical ALU functions include adders, subtractors, and logic units ror performing AND, OR, and the like. In some embodiments which use index register addressing, such as the Intel Architecture, tile critical ALU functions may also include a small, spccial-porpose shifter for doing address generation by scaling the index register. In some embodiments, the register file may reside in the latency-critical execution core, for design convenience; the faster the core section the register file is in, the fewer ports the register file needs.

The functions which are generally more latency-serisitive than the plumbing are those portions which are of a recursive nature, or those which include a dependency chain. Execution is a prir lie example of this concept; execution tends to be recursive or loopiii,,, and includes both false and tnic 0 data depctidciicies both between and within iterations and loops.

Curt-ciii art 'In hl,(,li computer dcsl,-n (c... the Pentium Pro processor) already explolts most of the readily exploitable parallelism in a large, class of importaric low P pi.ogranis. It becomes exiraordinarily difficult w even practically Impossible to greatly incrcase P for these, progranis. In this case there is no alternitive to reducing tlic ivcnigc latency I1 It Is desired to build a processor to run these progranis Caster.

c On (lie other haild, there are certain other functions such as for cxanipl--, Instruction decode, or register renaming. While- it is essential that 0 these functions are pcrfomied, current art has it arranged that tile lapsed tInic for perfonning these functions may have an effect on perfomiance only when a branch has been niiss predicted. A branch is miss predicted typically once in fifty instructions on average. Hence one- nanosecond longer to do decoding or register renaming provides the equivalent of 1/50 nanoseconds increase in avenwe Instruction execution latency while one iiiiiiosccolid increase In the tinic to execute an Instruction increases the average instruction litency by one nanosecond. We may conclude that the finie it takes to decode instructions or rename registers, for example, is sIgnificantly Iess critical than the tinie it takes to execute inst:ructiorls.

There are still other functions that must be perfonned in a processor. N1jny of these functions are even more highly leveraged than decoding and register renaming. For these functions 1 nsec increase in the 0 -11.3 0 tinle to perform theni may add even less than 1/50 nanoseconds to the average execution latency. We may conclude that the time it takes to do 0 these functions is even less critical.

As shown, the other ALU functions 305 can be relegated to the less 0 speedy core 210. Further, in the niode shown 'In Fig. 6, only a subset of the cache needs to be inside the sub-core. As ilitistrated, only the data stora(le portion 3 10 of the cichc is Iii.sidc the sub-core, while the hit/niiss C> logic and tags are in the slowercore 210. This is in contrast to the W z1 conventional wisdoni, which is that the hit/miss signal is needed at the same iiiii-- as the data. A recent paper iniplied that the IiiL/n-u'ss signal is the ltriiltliil(, factor on cache specd (Austin, Todd M, "Siz- cxiiillillrig Data Cache Access with Fast Address Calculation-, Dloni. sios N. Pticiiiiiatikatos, Giandinar S. Solil, Proceedings of the 22nd Annual Interil..ttlotia[ Symposium on Coniputer At-cliliecitii-e, June 18- 24, 1995, Session 8, No. 1, page 5). Unfortunately, hit/miss determination is more difficult and more time-consuming than the siniple matter of reading data contents from cache locations.

Further, the instruction cache (not shown) may be entirely in the core 210, such that the cache 3 10 stores only data. The instruction cache (Icache) is accessed spcculatively. It is the business of branch prediction to predict where the flow of the program will go, and the Icache is accessed on the basis of thai prediction. Branch prediction methods commonly used today can predict program flow without ever seeing the instructions in the Icache. If such a method is used, then the Icacheis riot latency-sensitive, and becomes niore baildwidth-constralned than latency-constrained, and can be relegated to a lower clock frequency portion of the execution core.

The branch 1)i-cd'lct;,oii itself could be latency-sensitive, so it would be a good candida(e for a fast cycle time in one of the inner sub-core sections.

At first glance, one might think- that the innermost sub-core 205, 255, or 260 of Fig. 6 would therefore hold the data which is stored at the top of the memory hierarchy of Fig. 2, that is, the data which is stored in the registers. However, is is illustrated in Fig. 6. the register file need not 0 be contained within the sub-core. but may, instead, be held in the less speedy portion of the core 210. In the mode of Figs. 3 or 4, the register file may be stored in any of the core sections 205, 210, 255, 260, as suits the particular embodiment chosen. As shown in Fig. 6, the rcason that the rcl, ,,ister file is not required to be within the innermost core is that the data which result froni operations perfomicel in the critical ALU 300 are available on a b.,,,piss bus 3 15 as soon as they ar-c calculated. By appropriate operazion or multiplexors (in any conventional manner), these data can be niadc avallable to the critical ALU 300 in the next clock. cycle of the sub-core, far sooner thati fficy could be written to and then read from [lie register file.

Similarly, if data specuiation is pcnii[ttcd, that is, if tile critical ALU is allo,.vcd to perform caltulatiolis Upon operands which are not yet known (o be valid, portions of the data cache need no( reside within the innermost sub-core. Ill dils modle, the data cache 3 10 holds only the actual data, while the lm/miss lovic and cache tacs reside ill a slower portion 2 10 of the core. Ill this modc, data froni tile data cache 3 10 are provided over all iliner bus 320 arid in-luxed into the critical ALU, arid the critical ALU per-fomis operations assuming those data to be valid.

Some number otclock- cycles later, the hit/niiss logic or the tag logic in the outer core may signal that the speculated data is, in fact, invalid. In this case, there must be a nicans provided to recover from the speculative operations which have been performed. This includes not only the specific operations which used the incorrect, speculated data as input operands, but also any subsequent operations which used the outputs of illose specific operations is inputs. Also, the erroncously generated outputs may have subsequently been used to determine branching operations, such as if the erroneously 1,ciicritcd output is used as a branch addi--ss or as a branch colldition. If the processor pei-forms control speculation, there may have also been errors ill that operatlon as well.

The present invciitioii provicics zt replay IIICCII.Ltlisni for recovering from data speculation upon data which ultimately prove to have been incorrect. Iii one mode, the replay niechanism irray reside outside the i, -,iicm)ost core, because it is riot terribly latcncy-criticil. While the replay zi,-chitCCtLire is described in conjunction with a niultiple-clock-sl)eed execution engine which performs data speculation, it -will be appreciated.ia( the replay architecture may be used with a wide variety of architcctures and micro-architectures, including those which perform data speculation and those which do riot, those which pcrfomi control spcculation and those,.,,hlcli do riot, those which perfonii in-order execution and those which pcrfonii out-of-order ex-e'ctition, and so forth.

Fig. 7 illustrates one implementation of such a replay architecture, ciicr,illy showing the data flow of the architecture. First, an instruction is W fetched into the 'Instruction cache.

From the instruction cache, the instruction proceeds to a rcnamer such as a register allas table. In sophisticated microarclittectures which peniiit data speculation arid/or control speculation, it is highly desirable to decouple the actual niachine froni the specific registers indicated by the instruction. This is especially true in an architecture which is recister-poor, such as the Intel Architecture. Renanicrs are well known, and the details of the renamer are not particularly gcnii-, iile to an understanding of the present invention. Any conventional renamer will stiffice. It is desirable that it be a single-valued and single-assignment renanier, such that each instance of a given instruction will write to a different register, although the instruction specifies the sarne register. The renamer provides a separate storage location for each different value that each locical re-ister assumes, so that 0 0 no such value of any logical register is prematurely lost (i.e. before the 0 0 prograni is throu(Th with that value), over a well-delined peniod of time.

c) 0 From the renamer, the instruction proceeds to an optional schcdulerstich as a reservation station, where instructions zire reordered to improve cxccti[loii efficiency. The schcdtiler is able to detect when it is riot allowed -Is- to issue further instructions. For example, there ni..ty, not be any available execution slots inio which a next instruction could be issued. Or, another und may for some reason iciliporTily, disablee the scllcdtilet.. In sorne embodiments, the scii-ldulei. may reside in the, 1, itcr,,cy-criticit execution if the particular scheduling al-orithm can schedule only single latency core, 1 gencration per cycle, and is therefore tied to tlic latency of the critical ALU flinctiolls.

From the renamer or the optional scheduler, the instruction proceeds to the execution core 205, 210, 255, 260 (indirectly [hrough a multiplexor to be described bclow), where it is executed. After or simultaneous with its execution, in address associated with the instruction is sent to the iranslation look-aside buffer (TLB) and cache- tag lookup logic (TAG). This address may be, for cxini p I c, th c address (physica I or lo-ica I) of a data operand which the Instruction requires. From the TLB and TAG logic, the physical address referenced and the physical address represented in the caclic location accessed are passed to the hiLlmiss logic, which deteniiines whether tlic cache location accesscd in fact contained the desired data.

In one modc, if the instruction being executed reads memory, the execution logic gives the highest pi-lority to generating perhaps only a portion of tlic address, but enough that datiL may be looked lip in the high speed data cache. lil this modc, this partial address is used with the highest priority to retrieve data from the data cache, and only as a secondary priority is a completc virtual address, or in the case of the Intel Architecture, a complete I] near address, generated -and sent to the TLB and cache TAG lc;oktip logic.

0 Because the critical ALU functions and the dam cache are in the innermost sub-core - or arc at least In a portion of th processor which runs at a liii,licr clock- rate than the TLB and TAG lo-le and the hit/rrIL'SS logic - some data will have already becii obtained from the data cache and 1 tr t n \klf [ the processor will have already speciiia.,1'v-lly (!]c ins uc lo ilc i needed that data, the processor having assunied the data that was obtained 1 ' to have been correct, and [lie processor likely having also executed additional instructions using that data or the rcsLilis of the first speculatIvely executed instruction.

Therefore, the replay architecture includes a checker unit which receives the output of the hit/iiiiss logic. If a iiiiss Is indicated, the, checker causes a of [he offending illStrllCtiOtl and ally which depended on it or which were otherwise incorrect its a result of the erroneous data speculation. When the Instruction was lianded from the reservation statioll to the exectifion core, a copy of it was forwarded to a delay unit which provides a delay latency which rnatches the time the instruction will take to get through the execution core, TI-BITAG, and liii/niiss units, so that the copy arrives at the cliecker at about the same tinic that the hit/miss logic tells the checker that the data speculation was incorrect. In one mode, this is rou---lily10- 12 clocks of the inner core. In Fi t,. 7. the delay unit is shown 0 0 as bein- outside the checker. In other enibodinients, the delay unit may be 0 incorporated as a part of the checker. In sonic enibodinients, the checker may reside within the latency-critical exectition core, if the checking ilt,orltliiii is tied to the critical ALU spe.-d.

0 When the checker detci-ni'liles that dant speculation was incorrect, the checker sends the copy of the instruction back around for a "replay". The checker forwards the copy of the instruction to a buffer unit. It may happen as an unrelated event that the TI-13/TAG unit informs the buffer that the TLI3/TAG is inserting a manufactured instruction in the current cycle. This information is needed by the buffer so the buffer blows not to reinsert another instruc(lon in [lie sanle cycle. Both the TL13/TAG and the buffer also inform the scheduler when they aree inserting instructions, so the t) scheduler knows not to dispatch an instruction in that same cycle. These -17.

coliciol a,-e riot shown but will be understood by those sUled in the The buffer unit provides latching of ific copied instruction, to prevent 1E from lost If it cannot inin-iedlately be handled. In sonic cilibodiiiic,nts, there may be conditions under which it may riot be possible to reinsert replayed instructions ininiedlatcly. In tlicse conditions, the buffer holds theni - perhaps a large number of them until they can be reinseried. One such condition niny be that there niny be sonic higher priority function that could claini execution, such as when tlic TLBITAG uril( needs to insert a niantifactured Instruction, as nientioned above. In sonic other- ciiil)odiiiieiit,,, the buffer may riot be necessary.

Earlici., it was iiiciiiloiicd that the sclicdiiler's ourput was provided to the execution corc indircetly, through a multiplexor. The function of this multiplcxor is to select among several possible sources of instructions being sent for execution. The first source is, of course, the scheduler, in the case when it is an original instruction which is beirlf, sent for execution. The 0 second source is the buffer unit, in the case when it is a copy of an instruction,k,hicli is be-ing scrit for replay execution. A third source is g from the TLBiTAG unit; this permits the architecture to 111Li,rratcd as being manufacture "fake and inject theni into the instructioll strelcirn.

For example, tile TLB logic or TAG logic many need to get anothcr unit to 0 0 do sonic work for theni, stich as to read sonic data from the data cache as might be tleedcd to evict that data, or for refilling the TLB, or other purposes, and they can do this by generating instructions which did not 0 0 conic from the real instruction Strearll, 2ind then insc-,tIng those instructions back at the nitiltiplexor Input to the execution core.

The control schenic may, in one mode, include a priority scheme whercin a rcplay instruction has higher priority than an original instruction. This is advanta-cous becatise a replay instruction is probably older than thc oricinal "tistFuct on n tilt original -,,1,icroiris-, uctloti flow, and may be a---blocking-instruction such as if there is a true data dependency.

It is cl--siriblc to get rei)liycd instructions finished as quickly as Dossibic. As long is there are unresolved instructions sent to replay, new instructions dhat are dispatched have a fairly high probability of being dependent oil soiiic(iin,(, unresolved and (herefore ofjiist getting added to the list of Instructlows that need to be replayed. As soon as it is necessary to replay one instruction, that one Instruction tends to grow a long train of instructions behind it that follows it around. The processor can quickly get in a niode wlici-c niost Instructions -are geltinexecuted two or three times, and such a mode may persist for quite a while. Therefore, resolving r--pl,xycd instructions is very much preferable to introducing liew instructiolls.

Each new instrucilon introduced while there are things to replay is a gamble. There is a certain probability the new instruction will be independent arid some work will get done. Oil the other hand, there is a certain probability that the new instruction will be dependent and will also need to be replayed. Worse, there niny be a riuniber of instructions to follow that will be dependent oil the new instruction, and all of those will have to be. replayed, too, whercas if the niachine had waited until tile replays were resolved, then all of these instructions would riot have to CXectire 1WICC.

In one niode, a mantifactured instruction may have higher priority than a replay instruction. This is advantageous because these manufactured instructions may be used for critic.,.lly important and tinie-sensitive operations. One such sensitive operation is an eviction. After a cache nliss, new data will be conling from the Ll cache. When that data arrives, it must be put in the data cache (L0) as quickly as possible. If that is done, the replayed load will.just ineet the new data arid will now be successful. If the 1 0 1 data Is eve[] 011C cycic late lclj,ett"ii,, the data there, the replay-ed load will pass soot] Llrld must again be rcplayed. Unfoilunately, the data cache location where the processor is 'coing to put the data is now holding the one and only copy of sonic dam that was writteri some tinic ago. In other words, the location Is "dirty", It is iiccess;try to read the dirty data out, to save it befor-1 the nev., data arri vcs and is written in its place. This reading of the old data is callcd "evictiii,-" the data. In sonic enibodiments, there, is jum exzic(ly enough iiii,,c to complete the eviction before starfing to write [lie new data In its place. The eviction is done with onc or more manufactured instructions. If they are- held up for even otic cycle, the eviction does not occur in tinic to avoid the problem clcscribed above, and therefore they must be given the hi-liest priority.

-2 The replay ti-ciiitcctiirc may also be tised to enable the processor to ill effect---Stall-without actlially slowill. down tile exectition core or petfoi-niln,i,, clock throtilin- or tile like. There arc sorne circumstances where it would be nceessary to stall the fron(end and/or execution core, to avoid losing the results of instructions or to avoid other such problems. One example is where the processor's backcrid temporarily runs out of resources such as available re-isters into which to write execution results. Other examples include where the cxteriizil bits is blocked, an upper level of cache is busy being snooped by another processor, a load or store cross.es page boundary, an exception occurs, or the likc.

In such circumstances, rather than lialt llic frontend or throttle the execlition core, the rcplay architecture may very siniply be used to send back around for rcplay all instructions whose results would be other-wise lost. The execution core remains functioning at full specd, and there are no additional signal paths required for stalling the fronicnd, beyond those otherwise existing to pemiir the multiplexor to give priority to replay instructions over original instructions. cl Other stall-like uses can be made of the replay architecture,. For example, assurne that a store address instruction n.sses in tile TLB. Rather than saving the linear address to process after getting tilt proper enuy n the TLB, the processor can just drop it on the floor and request the store address instruction to be replayed. As another example, the Page Miss Handler (not shown) may be busy. In this case the processor does riot even remember that it needs to do a Pt,,c walk, but finds that out over a,-airi 0 P when the store address conies back.

Most cases of rurinin- out of resources occur when there is a cache miss. There could well be no fill buffer left, so the machine can't even request an L] lookup. Or, tile Ll may be busy. When a cache miss li, tppciis, the rnachine MAY ask for the data from a higher level cache and MAY just forget the whole thing and not do anything at all to help the situation. In either case, the load (or store address) instruction is replayed. Unlike a more conventional architecture, the present invention does not NEED to remember this instruction in the nieniory subsystem and take care of it. The processor will do something to help it if it has the resources to do something. If riot, it may do nothing at all, not even remember that such a instruction was seen by the nieniory subsystem. The meniory subsystem, by itself, will never do anything for this instance. of the instruction. When tile instruction executes again, then it is considered all over acain. In the case of a store.,iddi-c." instruction, the instruction has delivered its linear address to the memory subsystem and it doesn't want anything back. A more conventional approach might be to say that this instruction is done, and any problems from here on out are memory subsystem problems, in which case the memory subsystem must then store information about this store address until it can get resources to take care of it. The present y approach is that the store address replays, and the memor subsystem does not have, to remember it it ill. Here it is a little more clear that the processor is replaying che store address specifically because of inability to handle it ill the mernory subsystem.

In on.c niode, when an instruction gets replayed, all dependent itis, rLIC110115 U1SO lget replayed. This may include aH those which used the replayed 1,'IS7JCtlOtl'S OLItpUt US Input, all those which are down control flow branches picked according to the replayed instruction, and so forth.

The processor does not replay instructions merely because they ar control flow dependerit oil an instruction that replayed. The thread of control was:,-cdlcied. The processor is always followiri. a predicted thread of control never ticecssir.lly knows during execution if it is going the wav or;,,oi. Va brarich gets bad inpur, the branch instruction itself is replayed. Th:s is becausc the processor cannot reliably deteriiiine frorn the branch if the predicted thread or control is right or not, since the input data to the branch was not valid. No o(her instructions get replayed merely because the brailch got bad data. Eventually - possibly after many replays - the branc;,. will be correctly executed. At this tin-le, it does what all branclics do - 11 reports i f the predicted d ireCt lon talken for th is branch was correct or no. If it was correctly predicted, everything goes on about its business. IF it. was not correctly predicted, then there is simply a branch misprediction. the fact that this branch was replayed any number of times makes no A iiilsl)i.edictcd branch cannot readily be repaired with a replay. A replay call only executel exactly the sarne instructions over acain. If a branch was mispredicted, the processor has likely done many wrong illStrLICP,;1011,S and needs to actually exectite sorne completely different instructions.

To A instruction is replayed either: 1) because the instruction its--If was not correctly processed for any reason, or 2) if the input data thar ihis ins(ruction uses is not known to be correct. Data is known to be correct if it is produced by a instruction that is itself correctly processed arid all of its input data is known to be correct. In this definition, branches arc vicwed not,is having anything to do with the control flow but as data liindllfl,,, instructions which siniply report interesting things to the front end of (lie machine but do not produce any output data that call be used by ally oilier instruction. Hence, the correctness of any other instruction cannot have anything to do with them. The correctness of tile control flow is handled by a higher atithority and not in the purview of nici-c execii(loti arid replay.

Fig. 8 Illustra-ies more about the checker unit. Again, a instruction is replayed J f: 1) i ( was riot processed correct] y, or 2) 1 f 1 c used input data that is not known to be correct. These two conditions give i good division for discussing the operition of tl ic checker unir. Th.' first condition depends oil everything that needs, to be done for the instruction. Anything in the machine thai needs to do something to correctly execute the instruction is allowed to wofand to sicrnal to the checker that it goofed. The first W 0 condition is therefore talkin. abotit sicmals that corne into the checker, 0 0 potentially froni many places, that say, "I goofed on this instruction."

0 In some embodiments, the most common goof is the failure of the data cache to supply thel correct result fora load. This is signaled by the hiL/niiss locIc. Another coninion goof is fallure, to correctly process a store address; this would typically result from a TLB miss oil a store address. but there call be other causes, too. In son-re ernbodiments, the L1 cache may deliver data (which may go into the LO cache and be used by instructions) that contains all ECC error. This would be signaled quickly, and then corrected as time perniiis.

Ill some fairly rare cases, the adder cannot correctly add two numbers. This is by the flag logic which keeps tabs oil the adders.

0 =1 11 Ill sonie oilier rare cases, the logic tiiii[ fails to get the correct answer when doiiii ill AND, XOR, or other simple logic operation. These, too, are Z2 0 sl,-ii,iled by the flag logic. In som.e enibodiniems, the floating pollit unit W - p may not get (lie correct answer ail of the tinic, in which case it will signal C when 1z gools a floating point operation. In principle, you could use this for niziny typc.s of,ooFs. 11 could be used for ligoritlinlic goofs and I t could even be used for hardware errors (circuit vo fs). Re ardl ss o g c the cause, wiiciicvci[lie processor doesn't do exictly what it is supposed to do, and the goof is detected, the processor's various units can request a rcplay by signaling to the checker.

The second coi.id'tiioil which causes replays - whether data is known to be correct - is entirely the responsibility of the checker itself. The checkcr conialns the official list of what daia is known to be correct. It is what is soniclinies called the "scoreboard". lt is the checker's responsibility to look. at all of (lie input dam foE each instruction execution instance and to detciiiiiiie ifall such input data is known to be correct or riot. It is also the checker's responsibility to add it all tip for each instruction execution instance, to detei-niljie if the result produced by that instrucdon execution instance can therefore be deemed to be "kjlo",ti to bc corTect". If the result of a instruction is deemed "k-tiowii to be correcC', this is noted on the scoreboard so the processor now has new, kinown -correct data that can be the input forother instructions.

ililistnites one exemplary checker which may be employed in practicing the architecture of the present invention. Because the details of the checker are not necessary in order to understand the invention, a simplified checker is il]Lisrited to show the requirements for a checker sufficient to make the replay sys teni work correctly.

In this enil)odinici,,,,, one instruction is processed per cycle. After at] instruction has been executed, it is represented to the checker by signals 1 OP 1, OP 1 V, OP2, OPV2, DST, and a latency vector wh ich was assi gned 0 to the tiop by (lie decoder on [fie bas is of the opcode. The signals OP 1 V and OP2V indicate whether th-- instruciion includes a first operand and a second operand, rcspectively. The signals OPI and OP2 identify tile physical source rcglster. of the first and second operands, respectively, arid are recelved it read address por(s RA 1 and RA2 of the scoreboard. The signal DST identiFies tile physical destination register ",here the result of the instruction was written.

The latency vecior has all O's except a 1 in one position. The position of the 1 denotes the latency of this instruction. An instruction's latency is how many-cycles there areafter the instruction begins execution before another Instruction can use its result. The scoreboard his one bit or.%or,.xgc for each physical register in the niachine. The bit is 0 if thar register W 17 is not knowil to coiliziiii correct data and it is 1 if that re-Ister is known to contain correct data.

The re-istcr renanier, described above, allocates these registers. At 0 the tinic 2x physical register is alloca[ed to hold the result of some instruction, the renarrier sends the register nurnber to the checker as Z) nitiltiple-bit signal CLEAR. The scoreboard sets to 0 the scoreboard bit which is addressed by CLEAR.

The one or two register operands for the instruction currently being checked (as indicated by OPI -arid OP2) are loo,',-ed up in the scoreboard to see if they are known to be correct, and the results are output as scoreboard values SV1 and,SV2, respectively. An AND gate 350 receives the first scoreboard valtie SV 1 and the first operand valid signal OPIV. Another ANTD gate 35.5 similarly receives signals 5V2 and OP2V for the second operand. The operand valid si gnals OPI V and OP2V cause the scoreboard values SV 1 and SV2 to be ignored if the instruction does not actually require those respective operands.

Tht. outputs of th,', AiND provided to NOR gate 360, alon, lb an emernal replay request signal. The output of the NOR gate xvill be, falsc if cither operand is i.c(Iiilt.e(i by the instruction and is not known to be correct, or if the externitl replay reqtjcst signal is asserted. Otherwise the output will be truc. Tfic otitput of the NOR gate 360 is the checkeroutput INSTRUCTION OK. If it is true, the Instruction was completed correctly and is ready to be consldcr.cd for retirement. If It is false, ific instruction must be replayed.

A dclay line recelves the destination register identifier DSTand the checker ottiput INSTRUCTION OK information for the instruction currently being cliccked. The simple dclay linc shown is constructed of registers (single cycle delays) and iiiiixc.. I( will be understood that each register and niiix is a niiiii1p[c-bit device, or represents multiple single-bit devices. Those skilled in the art will understand (hat various other types of delay lines, and therefore different foiTn.t(.s of latency vectors, could be used.

The DST and INSTRUCTION OK information is inserted in one location of the dclay linc, its deteniiiticd by the value of the latency vector. This information is delayed for the required tiut-,ibcr of cycles according to the latency vector, and then it is iippllcd to the writc port WP of the scoreboard. The scoreboard bit corresponding to the destination register DST for the instrue(jon is then written accordinCT to the value of INSTRUCTION OK. A valtic of 1 indicates that the instruction did not have, to be replayed, and a value of 0 indicates that the instruction did have to be replayed, iiletrijji,, that Its resuli data is riot known to be correct.

0 In this design, R is iissuiii-ld thixt no Instrumon has physical register zero as a real destinatioll 01 '415 11 relfl source. If there Is no valid instruction in sonic cvc)e, the latcney vec[or for that cycle will be all zeros. This will effectively enter physical r-,,,,lstcr zero xvith the longest possible latency into tli-- delay line, which is hamiless. Similarly, an instruction that does riot have a real destination register will specify a latency vector of all zeros. 11 further assumed that at startup, this unit runs for several cycles with no valid instructions arriving, so as to fill the delay line with zeros before the First real instruction lia., been allocated a destination register, and hence before the corresponding bit in the scoreboard has been cleared. The scoreboa.rd needs no additional Wiiialization.

Potentially, this clicck-ci. cheeks one instruction per cycle (but other ciiibodinierits arc of course, feasible). The cycle in which an instruction is checked is i fixed number of cycles after that instruction be-an execution and captured the data that it used for its operands. This number of cycles kiter is sufficient to allow the EXTERNAL REPLAY REQUEST signal for the instruction to arrive at the checker to be processed along with the other infor-mation about the- instruction. The EXTERNAL PEPLAY REQUEST signal is the OR of all signals from whatever parts of the machine may producc replay requests that indicate that the instruction was not processed correctly. For example it may indicate that data returned from the data cache may not have been correct, for any of niany reasons, a good example being that there was a cache riliss.

It should be appreciated by the skilled reader that the particular partitionings described above are- illustrative only. For example, although it has been suggested that certain features way be relegated to the outermost core 210, it may be desirable that certain of these reside in a mid-level portion of the core, such as in the lateilcy-ijltc)lr-rant core 255 of Fig.. 4, between the outermost core 210 and the innermost core 260. It should also be appreciated that althotigh the invention has been described with reference to the Intel Architecture processors, it is Liseful in any number of alterriative architecturcs, and with a wide variety of microarchirectures within each.

TMe the invendon has been dcs,,:t,ibed with rcfc,-ciic-, to speciFic modes and c,-ibodirnen[s, for ease of explanation and understanding, thoseskilled in he an TH appeciic that the invention is riot necessarily Iiii-ited to the panichar features shown hembm and hat he imendon may be practiced in a varicty of ways which fall under the scope of this disclosure. The invention is, therefore, to be -ift'ordcd the fullest allowable scope of ffic clairlls which follow.

Claims

CLAIMS:

1. A processor comprising: a first section operating at a first clock frequency and including, an execution core; and a second section operating at a second clock frequency lower than the first clock frequency and including, a checker for determining whether an instruction was processed correctly and used input data that are known to be correct.

2. The processor of claim 1 wherein the second section further comprises a renamer.

3. The processor of claim 2 wherein the renamer comprises a register allocation table.

4. The processor of claim 1 wherein: the processor further comprises a plurality of registers; and the checker comprises: a scoreboard for keeping track of which of the registers have contents which are known to be correct, a delay line coupled to receive a latency vector indicating how many clock cycles after a first instruction begins execution can a second instruction use a result of the first instruction, and logic circuitry, coupled to the scoreboard and to the delay line, for indicating whether the instruction was correctly completed or the instruction needs to be replayed.

IL '11