WO2001027749A1

WO2001027749A1 - Apparatus and method for caching alignment information

Info

Publication number: WO2001027749A1
Application number: PCT/US2000/012617
Authority: WO
Inventors: James B. Keller; Puneet Sharma; Keith R. Schakel; Francis M. Matus
Original assignee: Advanced Micro Devices, Inc.
Priority date: 1999-10-14
Filing date: 2000-05-09
Publication date: 2001-04-19
Also published as: EP1224539A1; JP2003511789A; US20040168043A1; KR20020039689A

Abstract

A line predictor (12) caches alignment information for instructions. In response to each fetch address, the line predictor (12) provides alignment information for the instruction beginning at the fetch address, as well as one or more additional instructions subsequent to that instruction. The alignment information may be, for example, instruction pointers. The line predictor (12) may include a memory having multiple entries (90, 82), each entry storing up to a predefined maximum number of instruction pointers (102, 104, 106, 108) and a fetch address (92) corresponding to the instruction identified by a first one of the instruction pointers. Additionally, each entry (90, 82) may include a link to another entry storing instruction pointers to the next instructions within the predicted instruction stream. Furthermore, the entries (90, 82) may store a next fetch address (112) corresponding to the first instruction within the next entry (90, 82). The next fetch address (112) may be provided to the instruction cache (14) to fetch the corresponding instruction bytes.

Description

APPARATUS AND METHOD FOR CACHI NG ALIGNMENT INFORMATION

BACKGROUND OF THE INVENTION

1 Technical Field

This invention is related to the field of processors and, more particularly, to mstruction fetching mechanisms withm processors

2 Background Art

Superscalar processors achieve high performance by executing multiple instructions per clock cycle and by choosmg the shortest possible clock cycle consistent with the design As used herem, the term "clock cycle" refers to an interval of time accorded to various stages of an mstruction processmg pipeline withm the processor Storage devices (e g registers and arrays) capture their values accordmg to the clock cycle For example, a storage device may capture a value accordmg to a πsmg or fallmg edge of a clock signal defmmg the clock cycle The storage device then stores the value until the subsequent πsmg or fallmg edge of the clock signal respectively The term "mstruction processmg pipeline" is used herem to refer to the logic circuits employed to process instructions m a pipelined fashion Although the pipeline may be divided mto any number of stages at which portions of mstruction processmg are performed, mstruction processmg generally compnses fetchmg the mstruction, decodmg the mstruction, executmg the mstruction, and stormg the execution results m the destination identified by the mstruction

A popular mstruction set architecture is the x86 instruction set architecture Due to the widespread acceptance of the x86 mstruction set architecture m the computer industry, superscalar processors designed m accordance with this architecture are becoming mcreasmgly common The x86 mstruction set architecture specifies a variable byte-length mstruction set m which different instructions may occupy differing numbers of bytes For example, the 80386 and 80486 processors allow a particular mstruction to occupy a number of bvtes between 1 and 15 The number of bytes occupied depends upon the particular instruction as well as various addressing mode options for the mstruction

Because instructions are variable-length, locatmg mstruction boundaπes is complicated The length of a first mstrucnon must be determmed prior to locatmg a second mstruction subsequent to the first mstruction withm an instruction stream However, the ability to locate multiple instructions withm an mstruction stream during a particular clock cycle is crucial to superscalar processor operation As operatmg frequencies mcrease (1 e as clock cycles shorten), it becomes mcreasmgly difficult to locate multiple instructions simultaneously

Various predecode schemes have been proposed m which a predecoder appends information regardmg each mstruction byte to the mstruction byte as the instruction is stored into the cache As used herem, the term "predecoding" is used to refer to generating mstruction decode information pnor to stormg the corresponding mstruction bytes into an mstruction cache of a processor The generated information may be stored with the instruction bytes m the instruction cache For example, an instruction byte may be mdicated to be the beginning or end of an mstruction By scanning the predecode information when the corresponding instruction bytes are fetched, instructions may be located without actually attempting to decode the mstruction bytes The predecode information may be used to decrease the amount of logic needed to locate multiple variable-length instructions simultaneously Unfortunately, these schemes become insufficient at high clock frequencies as well A method for locatmg multiple instructions during a clock cycle at high frequencies is needed

DISCLOSURE OF INVENTION

The problems outlmed above are in large part solved by a lme predictor as descπbed herem The lme predictor caches alignment information for instructions In response to each fetch address the lme predictor provides alignment information for the mstruction beginning at the fetch address, as well as one or more additional lnstrucnons subsequent to that mstruction The alignment information may be, for example, mstruction pomters, each of which directly locates a correspondmg mstruction withm a plurality of mstruction bytes fetched m response to the fetch address Smce instructions are located by the pomters, the alignment of instructions to decode units may be a low latency, high frequency operation Rather than having to scan predecode data stored on a byte by byte basis, the alignment information is stored on an mstruction basis based on fetch address In this manner, instructions may be more easily extracted from the fetched mstruction bytes The lme predictor may include a memory havmg multiple entπes, each entry stormg up to a predefined maximum number of mstruction pomters and a fetch address correspondmg to the mstruction identified by a first one of the mstruction pomters Fetch addresses may be searched agamst the fetch addresses stored m the multiple entπes, and if a match is detected the correspondmg mstruction pomters may be used Additionally, each entry may mclude a link to another entry stormg mstruction pomters to the next instructions withm the predicted instruction stream Furthermore, the entπes may store a next fetch address correspondmg to the first mstruction withm the next entry The next fetch address may be provided to the mstruction cache to fetch the correspondmg instruction bytes Fetchmg instructions by following the links withm the lme predictor may allow skippmg of the search for fetch addresses withm the lme predictor for those subsequent entπes Power dissipation may be reduced due to the fewer searches of the lme predictor memory, and the number of pipeline stages pπor to execution may be reduced for the fetches completed by following the links

Broadly speaking, a processor is contemplated The processor comprises a fetch address generation unit configured to generate a fetch address and a line predictor coupled to the fetch address generation unit The lme predictor mcludes a first memory compnsmg a plurality of entπes, each entry stormg a plurality of mstruction pomters The lme predictor is configured to select a first entry (of the plurality of entπes) correspondmg to the fetch address Each of a first plurality of instruction pomters withm the first entry, if valid, directly locates an mstruction withm a plurality of instruction bytes fetched m response to the fetch address Additionally, a computer system is contemplated including the processor and an input/output (I/O) device configured to communicate between the computer system and another computer system to which the I/O device is couplable

Moreover, a method is contemplated A fetch address is generated A first plurality of mstruction pomters are selected from a lme predictor, the first plurality of instruction pomters correspondmg to the fetch address Each of the first plurality of mstruction pomters, if valid, directlv locates an mstruction withm a plurality of mstruction bytes fetched m response to the fetch address BRIEF DESCRIPTION OF DRAWINGS

Other objects and advantages of the mvention will become apparent upon readmg the following detailed descπption and upon reference to the accompanymg drawings m which:

Fig. 1 is a block diagram of one embodiment of a processor Fig. 2 is a pipeline diagram which may be employed by one embodiment of the processor shown m Fig. 1

Fig. 3 is a block diagram illustrating one embodiment of a branch prediction apparams, a fetch PC generation unit, a lme predictor, an mstruction TLB, an I-cache, and a predictor miss decode unit.

Fig. 4 is a block diagram of one embodiment of a lme predictor.

Fig. 5 is a diagram illustratmg one embodiment of an entry m a PC CAM shown m Fig. 4. Fig. 6 is a diagram illustratmg one embodiment of an entry m an Index Table shown m Fig. 4.

Fig. 7 is a diagram illustratmg one embodiment of a next entry field shown m Fig. 6.

Fig. 8 is a diagram illustrating one embodiment of a control information field shown in Fig. 6.

Fig. 9 is a table illustrating one embodiment of termination conditions for creating an entry withm the lme predictor. Fig. 10 is a timing diagram illustratmg operation of one embodiment of the lme predictor for a branch prediction which matches the prediction made by the lme predictor

Fig. 11 is a timing diagram illustratmg operation of one embodiment of the lme predictor for a branch prediction which does not match the prediction made by the lme predictor.

Fig. 12 is a timing diagram illustratmg operation of one embodiment of the lme predictor for an indirect target branch prediction which does not match the prediction made by the lme predictor.

Fig. 13 is a timing diagram illustratmg operation of one embodiment of the lme predictor for a return address prediction which matches the prediction made by the lme predictor.

Fig. 14 is a timing diagram illustratmg operation of one embodiment of the lme predictor for a return address prediction which does not match the prediction made by the lme predictor. Fig. 15 is a timing diagram illustratmg operation of one embodiment of the lme predictor for a fetch which crosses a page boundary.

Fig. 16 is a timing diagram illustratmg operation of one embodiment of the lme predictor and the predictor miss decode unit for a lme predictor miss.

Fig. 17 is a timing diagram illustratmg operation of one embodiment of the lme predictor and the predictor miss decode unit for a null next mdex m the lme predictor.

Fig. 18 is a timing diagram illustratmg operation of one embodiment of the lme predictor and the predictor miss decode unit for a lme predictor entry having incorrect alignment information

Fig. 19 is a timing diagram illustratmg operation of one embodiment of the line predictor and the predictor miss decode unit for generatmg an entry termmated by an MROM mstruction or a non-branch mstruction Fig. 20 is a timing diagram illustratmg operation of one embodiment of the lme predictor and the predictor miss decode unit for generatmg an entry termmated by a branch mstruction

Fig. 21 is a timing diagram illustrating operation of one embodiment of the lme predictor and the predictor miss decode unit for trainmg a lme predictor entry termmated by a branch mstruction for both next fetch PCs and mdexes. Fig. 22 is a block diagram illustratmg one embodiment of a predictor miss decode unit shown m Figs 1 and 3

Fig. 23 is a block diagram of a first exemplary computer system mcludmg the processor shown m Fig 1

Fig 24 is a block diagram of a second exemplary computer system mcludmg the processor shown m Fig 1

While the mvention is susceptible to vaπous modifications and alternative forms, specific embodiments thereof are shown by wav of example m the drawmgs and will herem be descπbed in detail It should be understood, however, that the drawmgs and detailed descπption thereto are not mtended to limit the mvention to the particular form disclosed, but on the contrary, the intention is to cover all modifications, equivalents and alternatives fallmg withm the spiπt and scope of the present mvention as defined by the appended claims

MODE(S) FOR CARRYING OUT THE INVENTION

Processor Overview Turning now to Fig 1, a block diagram of one embodiment of a processor 10 is shown Other embodiments are possible and contemplated In the embodiment of Fig 1, processor 10 mcludes a lme predictor 12, an mstruction cache (I-cache) 14, an alignment unit 16, a branch prediction fetch PC generation unit 18, a plurality of decode units 24A-24D, a predictor miss decode unit 26, a microcode unit 28, a map unit 30, a retire queue 32. an architectural renames file 34, a future file 20. a scheduler 36, an mteger register file 38A, a floatmg pomt register file 38B, an mteger execution core 40A. a floatmg pomt execution core 40B, a load/store unit 42, a data cache (D-cache) 44, an external interface unit 46, and a PC silo 48 Lme predictor 12 is coupled to predictor miss decode umt 26, branch prediction/fetch PC generation unit 18, PC silo 48, and alignment unit 16 Lme predictor 12 may also be coupled to I-cache 14 I-cache 14 is coupled to alignment unit 16 and branch prediction/fetch PC generation unit 18, which is further coupled to PC silo 48 Alignment unit 16 is further coupled to predictor miss decode unit 26 and decode units 24A-24D Decode units 24A-24D are further coupled to map unit 30, and decode unit 24D is coupled to microcode unit 28 Map unit 30 is coupled to retire queue 32 (which is coupled to architectural renames file 34), future file 20, scheduler 36, and PC silo 48 Architectural renames file 34 is coupled to future file 20 Scheduler 36 is coupled to register files 38A-38B, which are further coupled to each other and respecπve execution cores 40A-40B Execution cores 40A-40B are further coupled to load/store unit 42 and scheduler 36 Execution core 40A is further coupled to D-cache 44 Load store unit 42 is coupled to scheduler 36, D-cache 44, and external mterface unit 46 D-cache 44 is coupled to register files 38 External mterface unit 46 is coupled to an external mterface 52 and to I-cache 14 Elements referred to herem by a reference numeral followed by a letter will be collectively referred to by the reference numeral alone For example, decode units 24A- 24D will be collectively refeπed to as decode units 24 In the embodiment of Fig 1, processor 10 employs a variable byte length, complex mstruction set computmg (CISC) mstruction set architecture For example, processor 10 may employ the x86 mstruction set architecture (also referred to as IA-32) Other embodiments may employ other mstruction set architectures mcludmg fixed length mstruction set architectures and reduced mstruction set computmg (RISC) mstruction set architectures Certain features shown in Fig 1 may be omitted m such architectures Branch prediction/fetch PC generation unit 18 is configured to provide a fetch address (fetch PC) to I- cache 14, lme predictor 12, and PC silo 48. Branch prediction/fetch PC generation unit 18 may mclude a suitable branch prediction mechanism used to aid m the generation of fetch addresses. In response to the fetch address, lme predictor 12 provides alignment mformaUon correspondmg to a plurality of instructions to alignment unit 16, and may provide a next fetch address for fetchmg instructions subsequent to the instructions identified by the provided instruction mformation. The next fetch address may be provided to branch prediction/fetch PC generation unit 18 or may be directly provided to I-cache 14, as desired. Branch prediction/fetch PC generation unit 18 may receive a trap address from PC silo 48 (if a trap is detected) and the trap address may compπse the fetch PC generated by branch prediction/fetch PC generation unit 18. Otherwise, the fetch PC may be generated usmg the branch prediction mformation and mformation from lme predictor 12 Generally, lme predictor 12 stores mformation correspondmg to instructions previously speculatively fetched by processor 10. In one embodiment, lme predictor 12 mcludes 2K entπes, each entry locatmg a group of one or more instructions referred to herem as a "lme" of instructions The lme of instructions may be concurrently processed by the mstruction processmg pipeline of processor 10 through bemg placed mto scheduler 36 I-cache 14 is a high speed cache memory for stormg mstruction bytes. Accordmg to one embodiment I- cache 14 may compπse, for example, a 128 Kbyte, four way set associative organization employmg 64 byte cache lmes. However, any I-cache structure may be suitable (mcludmg direct-mapped structures)

Alignment unit 16 receives the mstruction alignment mformation from lme predictor 12 and mstruction bytes correspondmg to the fetch address from I-cache 14 Alignment unit 16 selects mstruction bytes mto each of decode units 24A-24D accordmg to the provided instruction alignment mformation. More particularly, lme predictor 12 provides an mstruction pomter correspondmg to each decode unit 24A-24D The mstruction pomter locates an mstruction withm the fetched mstruction bytes for conveyance to the correspondmg decode unit 24A- 24D. In one embodiment, certain instructions may be conveyed to more than one decode unit 24A-24D Accordmgly, m the embodiment shown, a lme of instructions from lme predictor 12 may mclude up to 4 instructions, although other embodiments may include more or fewer decode units 24 to provide for more or fewer instructions withm a lme

Decode units 24A-24D decode the instructions provided thereto, and each decode unit 24A-24D generates information identifying one or more mstruction operations (or ROPs) correspondmg to the instructions. In one embodiment, each decode unit 24A-24B may generate up to two mstruction operations per mstruction. As used herem, an mstruction operation (or ROP) is an operation which an execution unit withm execution cores 40A-40B is configured to execute as a smgle entity Simple instructions may correspond to a smgle mstruction operation, while more complex instructions may correspond to multiple instruction operations. Certain of the more complex instructions may be implemented withm microcode unit 28 as microcode routmes (fetched from a read-only memory therein via decode unit 24D in the present embodiment). Furthermore, embodiments employmg non-CISC instruction sets may employ a smgle instruction operation for each mstruction (i.e. mstruction and mstruction operation may be synonymous m such embodiments)

PC silo 48 stores the fetch address and instruction mformation for each mstruction fetch, and is responsible for redirecting mstruction fetchmg upon exceptions (such as mstruction traps defined by the mstruction set architecture employed by processor 10, branch mispredictions, and other microarchitecturally defined traps). PC silo 48 may mclude a circular buffer for stormg fetch address and mstruction mformation coπespondmg to multiple lines of mstructions which may be outstanding withm processor 10 In response to retirement of a lme of instructions, PC silo 48 may discard the correspondmg entry. In response to an exception, PC silo 48 may provide a trap address to branch prediction fetch PC generation unit 18. Retirement and exception mformation may be provided by scheduler 36 In one embodiment, PC silo 48 assigns a sequence number (R#) to each mstruction to identify the order of mstructions outstanding withm processor 10. Scheduler 36 may return R#s to PC silo 48 to identify mstruction operations expeπencmg exceptions or retiring mstruction operations.

Upon detecting a miss m lme predictor 12, alignment unit 16 routes the correspondmg mstruction bytes from I-cache 14 to predictor miss decode unit 26. Predictor miss decode unit 26 decodes the mstruction, enforcing any limits on a lme of mstructions as processor 10 is designed for (e.g. maximum number of mstruction operations, maximum number of mstructions, terminate on branch mstructions, etc.). Upon terminating a lme, predictor miss decode unit 26 provides the mformation to lme predictor 12 for storage. It is noted that predictor miss decode unit 26 may be configured to dispatch mstructions as they are decoded Alternatively, predictor miss decode unit 26 may decode the lme of mstruction mformation and provide it to lme predictor 12 for storage Subsequently, the missing fetch address may be reattempted in lme predictor 12 and a hit may be detected

In addition to decodmg mstructions upon a miss m lme predictor 12, predictor miss decode unit 26 may be configured to decode mstructions if the mstruction mformation provided by lme predictor 12 is invalid. In one embodiment, processor 10 does not attempt to keep mformation in line predictor 12 coherent with the mstructions withm I-cache 14 (e.g when mstructions are replaced or invalidate in I-cache 14, the coπespondmg mstruction mformation may not actively be mvalidated). Decode units 24A-24D may veπfy the mstruction mformation provided, and may signal predictor miss decode unit 26 when invalid instruction mformation is detected. Accordmg to one particular embodiment, the following mstruction operations are supported by processor 10 mteger (mcludmg anthmetic, logic, shift/rotate, and branch operations), floatmg pomt (mcludmg multimedia operations), and load/store. The decoded mstruction operations and source and destination register numbers are provided to map unit

30. Map unit 30 is configured to perform register renaming by assignmg physical register numbers (PR#s) to each destination register operand and source register operand of each mstruction operation. The physical register numbers identify registers within register files 38A-38B. Map unit 30 additionally provides an indication of the dependencies for each mstruction operation by providmg R#s of the instruction operations which update each physical register number assigned to a source operand of the mstruction operation. Map unit 30 updates future file 20 with the physical register numbers assigned to each destmation register (and the R# of the correspondmg mstruction operation) based on the corresponding logical register number. Additionally, map unit 30 stores the logical register numbers of the destmation registers, assigned physical register numbers, and the previously assigned physical register numbers m retire queue 32 As mstructions are retired (indicated to map unit 30 by scheduler 36), retire queue 32 updates architectural renames file 34 and frees any registers which are no longer m use. Accordmgly, the physical register numbers m architectural register file 34 identify the physical registers stormg the committed architectural state of processor 10, while future file 20 represents the speculative state of processor 10 In other words, architectural renames file 34 stores a physical register number correspondmg to each logical register, representing the committed register state for each logical register. Future file 20 stores a physical register number coπespondmg to each logical register, representing the speculative register state for each logical register

The lme of mstruction operations, source physical register numbers, and destmation physical register numbers are stored mto scheduler 36 accordmg to the R#s assigned by PC silo 48 Furthermore, dependencies for a particular mstruction operation may be noted as dependencies on other mstruction operations which are stored m the scheduler In one embodiment, mstruction operations remam m scheduler 36 until retired

Scheduler 36 stores each mstruction operation until the dependencies noted for that mstruction operation have been satisfied In response to schedulmg a particular mstruction operation for execution, scheduler 36 may determine at which clock cycle that particular mstruction operation will update register files 38A-38B Different execution units withm execution cores 40A-40B may employ different numbers of pipeline stages (and hence different latencies) Furthermore, certain mstructions may expeπence more latency withm a pipeline than others Accordmgly, a countdown is generated which measures the latency for the particular mstruction operation (m numbers of clock cycles) Scheduler 36 awaits the specified number of clock cycles (until the update will occur pnor to or comcident with the dependent mstruction operations readmg the register file), and then mdicates that mstruction operations dependent upon that particular mstruction operation may be scheduled It is noted that scheduler 36 may schedule an mstruction once its dependencies have been satisfied (l e out of order with respect to its order withm the scheduler queue)

Integer and load store mstruction operations read source operands accordmg to the source physical register numbers from register file 38A and are conveyed to execution core 40A for execution Execution core 40A executes the mstruction operation and updates the physical register assigned to the destmation withm register file 38A Additionally, execution core 40A reports the R# of the mstruction operation and exception mformation regardmg the mstruction operation (if any) to scheduler 36 Register file 38B and execution core 40B may operate m a similar fashion with respect to floatmg point instruction operations (and may provide store data for floatmg pomt stores to load/store unit 42) In one embodiment, execution core 40A may mclude. for example, two mteger units, a branch unit, and two address generation units (with coπespondmg translation lookaside buffers, or TLBs) Execution core 40B may mclude a floatmg point/multimedia multiplier, a floating point/multimedia adder, and a store data unit for delivering store data to load/store unit 42 Other configurations of execution units are possible

Load/store unit 42 provides an mterface to D-cache 44 for performing memory operations and for schedulmg fill operations for memory operations which miss D-cache 44 Load memory operations may be completed by execution core 40A performing an address generation and forwarding data to register files 38A-38B (from D-cache 44 or a store queue withm load store unit 42) Store addresses may be presented to D-cache 44 upon generation thereof by execution core 40A (directly via connections between execution core 40A and D-Cache 44) The store addresses are allocated a store queue entry The store data may be provided concurrently, or may be provided subsequently, accordmg to design choice Upon retirement of the store mstruction, the data is stored mto D-cache 44 (although there may be some delay between retirement and update of D-cache 44) Additionally, load/store unit 42 may mclude a load store buffer for storing load/store addresses which miss D-cache 44 for subsequent cache fills (via external mterface unit 46) and re-attemptrng the missing load store operations Load/store unit 42 is further configured to handle load/store memory dependencies D-cache 44 is a high speed cache memory for stormg data accessed by processor 10 While D-cache 44 may compnse any suitable structure (mcludmg direct mapped and set-associative structures), one embodiment of D-cache 44 may compπse a 128 Kbyte, 2 way set associative cache havmg 64 byte lmes

External mterface unit 46 is configured to communicate to other devices via external mterface 52 Any suitable external mterface 52 may be used, mcludmg interfaces to L2 caches and an external bus or buses for connecting processor 10 to other devices External mterface unit 46 fetches fills for I-cache 16 and D-cache 44, as well as writing discarded updated cache lmes from D-cache 44 to the external mterface Furthermore external mterface unit 46 may perform non-cacheable reads and wπtes generated by processor 10 as well

Turning next to Fig 2, an exemplary pipeline diagram illustratmg an exemplary set of pipeline stages which may be employed by one embodiment of processor 10 is shown Other embodiments may employ different pipelines, pipelines mcludmg more or fewer pipeline stages than the pipeline shown m Fig 2 The stages shown m Fig 2 are delimited by vertical dashed lines Each stage is one clock cycle of a clock signal used to clock storage elements (e g registers, latches, flops, and the like) withm processor 10

As illustrated m Fig 2, the exemplary pipeline mcludes a CAM0 stage, a CAM1 stage, a lme predictor (LP) stage, an mstruction cache (IC) stage, an alignment (AL) stage, a decode (DEC) stage, a mapl (Ml) stage, a map2 (M2) stage, a wπte scheduler (WR SC) stage, a read scheduler (RD SC) stage, a register file read (RF RD) stage, an execute (EX) stage, a register file wπte (RF WR) stage, and a retire (RET) stage Some mstructions utilize multiple clock cycles m the execute state For example, memory operations, floatmg pomt operations, and mteger multiply operations are illustrated m exploded form m Fig 2 Memory operations mclude an address generation (AGU) stage, a translation (TLB) stage, a data cache 1 (DCl) stage, and a data cache 2 (DC2) stage Similarly, floatmg pomt operations mclude up to four floatmg pomt execute (FEX1-FEX4) stages, and mteger multiplies mclude up to four (IM1-IM4) stages

Duπng the CAM0 and CAM1 stages, lme predictor 12 compares the fetch address provided by branch prediction fetch PC generation unit 18 to the addresses of lmes stored therem Additionally, the fetch address is translated from a virtual address (e g a lmear address in the x86 architecmre) to a physical address duπng the CAM0 and CAM1 stages (e g in ITLB 60 shown in Fig 3) In response to detectmg a hit duπng the CAM0 and CAM1 stages, the correspondmg line mformation is read from the line predictor during the lme predictor stage Also, I-cache 14 initiates a read (usmg the physical address) during the line predictor stage The read completes duπng the mstruction cache stage It is noted that, while the pipeline illustrated m Fig 2 employs two clock cycles to detect a hit m lme predictor 12 for a fetch address, other embodiments may employ a smgle clock cycle (and stage) to perform this operation Moreover, in one embodiment, lme predictor 12 provides a next fetch address for I-cache 14 and a next entry in lme predictor 12 for a hit, and therefore the CAM0 and CAM1 stages mav be skipped for fetches resulting from a previous hit m lme predictor 12 Instruction bvtes provided by I-cache 14 are aligned to decode units 24A-24D by alignment unit 16 during the alignment stage m response to the coπespondmg line mformation from lme predictor 12 Decode units 24 A- 24D decode the provided mstructions, identifying ROPs coπesponding to the mstructions as well as operand mformation during the decode stage Map unit 30 generates ROPs from the provided mformation duπng the mapl stage, and performs register renaming (updatmg future file 20) Duπng the map2 stage, the ROPs and assigned renames are recorded m retire queue 32 Furthermore, the ROPs upon which each ROP is dependent are determmed. Each ROP may be register dependent upon earlier ROPs as recorded m the future file, and may also exhibit other types of dependencies (e.g. dependencies on a previous seπalizmg mstruction, etc.)

The generated ROPs are wπtten mto scheduler 36 duπng the wπte scheduler stage. Up until this stage, the ROPs located by a particular lme of mformation flow through the pipelme as a unit. However, subsequent to be wπtten mto scheduler 36. the ROPs may flow independently through the remammg stages, at different times Generally, a particular ROP remams at this stage until selected for execution by scheduler 36 (e g. after the ROPs upon which the particular ROP is dependent have been selected for execution, as descπbed above) Accordmgly, a particular ROP may expeπence one or more clock cycles of delay between the wπte scheduler wπte stage and the read scheduler stage. Duπng the read scheduler stage, the particular ROP participates m the selecπon logic withm scheduler 36, is selected for execution, and is read from scheduler 36 The particular ROP then proceeds to read register file operations from one of register files 38A-38B (dependmg upon the type of ROP) m the register file read stage.

The particular ROP and operands are provided to the coπespondmg execution core 40A or 40B. and the mstruction operation is performed on the operands durmg the execution stage As mentioned above, some ROPs have several pipelme stages of execution. For example, memory instruction operanons (e.g. loads and stores) are executed through an address generation stage (m which the data address of the memory location accessed by the memory mstruction operation is generated), a translation stage (m which the virtual data address provided by the address generation stage is translated) and a pair of data cache stages m which D-cache 44 is accessed. Floating pomt operations may employ up to 4 clock cycles of execution, and mteger multiplies may similarly employ up to 4 clock cycles of execution.

Upon completmg the execution stage or stages, the particular ROP updates its assigned physical register duπng the register file wπte stage. Finally, the particular ROP is retired after each previous ROP is retired (m the retire stage). Agam, one or more clock cycles may elapse for a particular ROP between the register file wπte stage and the retire stage. Furthermore, a particular ROP may be stalled at any stage due to pipelme stall conditions, as is well known m the art.

Lme Predictor

Turning now to Fig 3, a block diagram illustratmg one embodiment of branch prediction fetch PC generation unit 18, lme predictor 12, I-cache 14, predictor miss decode unit 26, an mstruction TLB (ITLB) 60, an adder 62, and a fetch address mux 64 is shown Other embodiments are possible and contemplated. In the embodiment of Fig. 3, branch predic on/fetch PC generation unit 18 mcludes a branch predictor 18A, an indirect branch target cache 18B, a return stack 18C, and fetch PC generation unit 18D Branch predictor 18A and indirect branch target cache 18B are coupled to receive the output of adder 62, and are coupled to fetch PC generation unit 18D, lme predictor 12, and predictor miss decode unit 26 Fetch PC generation unit 18D is coupled to receive a trap PC from PC silo 48, and is further coupled to ITLB 60, lme predictor 12, adder 62, and fetch address mux 64 ITLB 60 is further coupled to fetch address mux 64, which is coupled to I-cache 14 Lme predictor 12 is coupled to I-cache 14. predictor miss decode unit 26, adder 62, and fetch address mux 64

Generally, fetch PC generation unit 18D generates a fetch address (fetch PC) for mstructions to be fetched The fetch address is provided to lme predictor 12, TLB 60. and adder 62 (as well as PC silo 48, as shown m Fig 1) Lme predictor 12 compares the fetch address to fetch addresses stored therein to determine if a lme predictor entry correspondmg to the fetch address exists withm lme predictor 12 If a correspondmg lme predictor entry is found, the instruction pomters stored m the lme predictor entry are provided to alignment unit 16 In parallel with lme predictor 12 searching the lme predictor entries, ITLB 60 translates the fetch address (which is a virtual address in the present embodiment) to a physical address (physical PC) for access to I-cache 14 ITLB 60 provides the physical address to fetch address mux 64, and fetch PC generation unit 18D controls mux 64 to select the physical address I-cache 14 reads mstruction bytes coπespondmg to the physical address and provides the mstruction bytes to alignment unit 16 In the present embodiment, each lme predictor entry also provides a next fetch address (next fetch PC)

The next fetch address is provided to mux 64, and fetch PC generation unit 18D selects the address through mux 64 to access I-cache 14 m response to lme predictor 12 detectmg a hit In this manner, the next fetch address may be more rapidly provided to I-cache 14 as long as the fetch addresses continue to hit m the lme predictor The lme predictor entry may also mclude an indication of the next lme predictor entry withm lme predictor 12 (correspondmg to the next fetch address) to allow line predictor 12 to fetch mstruction pomters coπespondmg to the next fetch address Accordmgly, as long as fetch addresses continue to hit m lme predictor 12. fetchmg of lmes of instructions may be initiated from the lme predictor stage of the pipelme shown m Fig 2 Traps initiated by PC silo 48 (m response to scheduler 36), a disagreement between the prediction made by lme predictor 12 for the next fetch address and the next fetch address generated by fetch PC generation unit 18D (descπbed below) and page crossmgs (descπbed below) may cause line predictor 12 to search for the fetch address provided by fetch PC generation unit 18D, and may also cause fetch PC generation unit 18D to select the coπespondmg physical address provided by ITLB 60

Even while next fetch addresses are bemg generated by lme predictor 12 and are hitting m lme predictor 12, fetch PC generation unit 18D contmues to generate fetch addresses for loggmg by PC silo 48 Furthermore, fetch PC generation unit 18D may veπfy the next fetch addresses provided by lme predictor 12 via the branch predictors 18A-18C The lme predictor entπes withm lme predictor 12 identify the terminating mstruction withm the lme of instructions by type, and lme predictor 12 transmits the type mformation to fetch PC generation umt 18D as well as the predicted direction of the terminating mstruction (branch mfo m Fig 3) Furthermore, for branches forming a target address via a branch displacement mcluded withm the branch mstruction, lme predictor 12 may provide an indication of the branch displacement For purposes of veπfymg the predicted next fetch address, the terminating mstruction may be a conditional branch mstruction, an indirect branch instruction, or a return mstruction

If the terminating instruction is a conditional branch mstruction or an indirect branch mstruction, lme predictor 12 generates a branch offset from the cuπent fetch address to the branch mstruction by examining the mstruction pomters m the lme predictor entry The branch offset is added to the current fetch address by adder 62, and the address is provided to branch predictor 18A and indirect branch target cache 18B Branch predictor 18A is used for conditional branches, and indirect branch target cache 18B is used for indirect branches

Generally, branch predictor 18A is a mechanism for predicting conditional branches based on the past behavior of conditional branches More particularly, the address of the branch mstruction is used to mdex mto a table of branch predictions (e g , two bit saturating counters which are incremented for taken branches and decremented for not-taken branches, and the most significant bit is used as a taken/not-taken prediction) The table is updated based on past executions of conditional branch instructions, as those branch mstructions are retired or become non-speculative In one particular embodiment, two tables are used (each havmg 16K entπes of two bit saturating counters) The tables are mdexed by an exclusive OR of recent branch prediction history and the least significant bits of the branch address, and each table provides a prediction A third table (compnsmg 4K entries of two bit saturating selector counters) stores a selector between the two tables, and is mdexed by the branch address directly The selector picks one of the predictions provided by the two tables as the prediction for the conditional branch mstruction Other embodiments may employ different configurations and different numbers of entπes Usmg the three table structure, aliasmg of branches havmg the same branch history and least significant address bits (but different most significant address bits) may be alleviated

In response to the address provided by adder 62, branch predictor 18A provides a branch prediction Fetch PC generation umt 18D compares the prediction to the prediction recorded m the lme predictor entry If the predictions do not match, fetch PC generation unit 18D signals (via status lmes shown m Fig 3) lme predictor 12 Additionally, fetch PC generation unit 18D generates a fetch address based on the prediction from branch predictor 18A (either the branch target address generated m response to the branch displacement, or the sequential address) More particularly, the branch target address m the x86 mstruction set architecture may be generated by addmg the sequential address and the branch displacement Other mstruction set architectures may add the address of the branch mstruction to the branch displacement In one embodiment, lme predictor 12 stores a next alternate fetch address (and alternate indication of the next lme predictor entry) in each lme predictor entry If fetch PC generation unit 18D signals a mismatch between the prediction recorded m a particular lme predictor entry and the prediction from branch predictor 18 A, lme predictor 12 may swap the next fetch address and next alternate fetch address In this manner, the lme predictor entry may be updated to reflect the actual execution of branch mstructions (recorded in branch predictor 18 A) The lme predictor is thereby tramed to match recent branch behavior, without requirmg that the lme predictor entπes be directly updated m response to branch mstruction execution

Indirect branch target cache 18B is used for indirect branch mstructions While branch mstructions which form a target address from the branch displacement have static branch target addresses (at least at the virtual stage, although page mappmgs to physical addresses may be changed), indirect branch mstructions have vanable target addresses based on register and/or memory operands Indirect branch target cache 18B caches previously generated indirect branch target addresses m a table mdexed by branch mstruction address Similar to branch predictor 18 A, indirect branch target cache 18B is updated with actually generated indirect branch target addresses upon the retirement of indirect branch target mstructions In one particular embodiment, indirect branch target cache 18B may compπse a branch target buffer havmg 128 entries, mdexed by the least significant bits of the indirect branch instruction address, a second table havmg 512 entπes mdexed by the exclusive-OR of the least significant bits of the indirect branch instruction address (bits inverted) and least significant bits of the four indirect branch target addresses most recently predicted usmg the second table The branch target buffer output is used until it mispredicts, then the second table is used until it mispredicts, etc This structure may predict indirect branch target addresses which do not change duπng execution usmg the branch target buffer, while usmg the second table to predict addresses which do change during execution.

Fetch PC generation unit 18D receives the predicted indirect branch target address from indirect branch target cache 18B, and compares the indirect branch target address to the next fetch address generated by line predictor 12. If the addresses do not match (and the coπesponding line predictor entry is terminated by an indirect branch instruction), fetch PC generation unit 18D signals line predictor 12 (via the stams lines) that a mismatched indirect branch target has been detected. Additionally, the predicted indirect target address from indirect branch target cache 18B is generated as the fetch address by fetch PC generation unit 18D. Line predictor 12 compares the fetch address to detect a hit and select a line predictor entry. I-cache 14 (through ITLB 60) fetches the instruction bytes coπesponding to the fetch address. It is noted that, in one embodiment, indirect branch target cache 18B stores linear addresses and the next fetch address generated by line predictor 12 is a physical address. However, indirect branch instructions may be unconditional in such an embodiment, and the next alternate fetch address field (which is not needed to store an alternate fetch address since the branch is unconditional) may be used to store the linear address coπesponding to the next fetch address for comparison purposes.

Return stack 18C is used to predict target addresses for return instructions. As call instructions are fetched, the sequential address to the call instruction is pushed onto the return stack as a return address. As return instructions are fetched, the most recent return address is popped from the return stack and is used as the return address for that return mstruction. Accordingly, if a line predictor entry is terminated by a return instruction, fetch PC generation unit 18D compares the next fetch address from the line predictor entry to the return address provided by return address stack 18C. Similar to the indirect target cache discussion above, if the return address and the next fetch address mismatch, fetch PC generation unit 18D signals line predictor 12 (via the stams lines) and generates the return address as the fetch address. The fetch address is searched in line predictor 12 (and translated by ITLB 60 for fetching in I-cache 14).

The above described mechanism may allow for rapid generation of fetch addresses using line predictor 12, with parallel verification of the predicted instruction stream using the branch predictors 18A-18C. If the branch predictors 18A-18C and line predictor 12 agree, then rapid instruction fetching continues. If disagreement is detected, fetch PC generation unit 18D and line predictor 12 may update the affected line predictor entries locally.

On the other hand, certain conditions may not be detected and or coπected by fetch PC generation unit 18D. Predictor miss decode unit 26 may detect and handle these cases. More particularly, Predictor miss decode unit 26 may decode instruction bytes when a miss is detected in line predictor 12 for a fetch address generated by fetch PC generation unit 18D, when the next line predictor entry indication within a line predictor is invalid, or when the instruction pointers within the line predictor entry are not valid. For the next line predictor indication being invalid, predictor miss decode unit 26 may provide the next fetch address as a search address to line predictor 12. If the next fetch address hits, an indication of the coπesponding line predictor entry may be recorded as the next line predictor entry indication. Otherwise, predictor miss decode unit 26 decodes the coπesponding instruction bytes (received from alignment unit 12) and generates a line predictor entry for the instructions. Predictor miss decode unit 26 communicates with fetch PC generation unit 18D (via the line predictor update bus shown in Fig. 3) during the generation of line predictor entries.

More particularly, predictor miss decode unit 26 may be configured to access the branch predictors 18A- 18C when terminating a line predictor entry with a branch instruction. In the present embodiment, predictor miss decode unit 26 may provide the address of the branch mstruction to fetch PC generation unit 18D, which may provide the address as the fetch PC but cancel access to lme predictor 12 and ITLB 60 In this manner, the address of the branch instruction may be provided through adder 62 (with a branch offset of zero) to branch predictor 18 A and mdirect branch target cache 18B) Alternatively, predictor miss decode unit 26 may directly access branch predictors 18A-18D rather than providmg the branch mstruction address to fetch PC generation unit 18D The coπespondmg prediction mformation may be received by predictor miss decode unit 26 to generate next fetch address mformation for the generated lme predrctor entry For example, rf the lme predictor entrv is termmated by a conditional branch mstruction, predictor miss decode unit 26 may use the branch prediction provided by branch predictor 18A to determine whether to use the branch target address or the sequential address as the next fetch address The next fetch address may be received from mdirect branch target cache 18B and may be used as the next fetch address if the lme is termmated by an mdirect branch mstruction The return address may be used (and popped from return stack 18C) if the lme is termmated by a return mstruction

Once the next fetch address is determmed for a lme predictor entry, predrctor miss decode unit 26 may search lme predictor 12 for the next fetch address If a hit is detected, the hitting lme predictor entry is recorded for the newly created lme predrctor entry and predrctor miss decode unrt 26 may update lme predrctor 12 wrfh the new entry If a miss is detected, the next entry to be replaced m lme predrctor 12 may be recorded m the new entry and predictor miss decode unit 26 may update lme predictor 12 In the case of a miss, predictor miss decode unit 26 may continue to decode mstructions and generate lme predrctor entπes untrl a hrt in lme predictor 12 is detected In one embodiment, lme predrctor 12 may employ a first- rn, first-out replacement policy for lme predrctor entπes, although any suitable replacement scheme may be used

It is noted that, m one embodrment, I-cache 14 may provrde a fixed number of mstruction bytes per instruction fetch, beginning with the mstruction byte located by the fetch address Smce a fetch address may locate a byte anywhere wtthrn a cache lme, I-cache 14 may access two cache lmes m response to the fetch address (the cache lme mdexed by the fetch address, and a cache lme at the next mdex m the cache) Other embodiments may limit the number of mstruction bytes provided to up to a fixed number or the end of the cache lme, whrchever comes first In one embodrment, the fixed number rs 16 although other embodrments may use a fixed number greater or less than 16 Furthermore, m one embodrment, I-cache 14 is set-associative Set-assocrative caches provide a number of possrble storage locations for a cache lme identified by a particular address Each possrble storage location rs a "way" of the set-associative cache For example, rn one embodrment, I-cache 14 may be 4 way set-assocrative and hence a particular cache lme may be stored m one of 4 possrble storage locations Set- assocrative caches thus use two mput values (an mdex deπved from the fetch address and a wav determined by comparing tags m the cache to the remaining portion of the fetch address) to provrde output bvtes Rather than awart the completion of tag compansons to determine the way, lme predictor 12 may store a way prediction (provided to I-cache 14 as the way prediction shown in Fig 3) The predicted way may be selected as the output, and the predicted way may be subsequently venfied via the tag compaπsons If the predicted way is incorrect, I- cache 14 may search the other ways for a hit The hitting way may then be recorded m lme predictor 12 Way prediction may also allow for power savmgs by only activatmg the portion of the I-cache memory compnsmg the predicted way (and leavmg the remammg memory coπesponding to the unpredrcted ways idle) For embodiments m which two cache lmes are accessed to provide the fixed number of bytes, two way predictions may be provided by lme predictor 12 for each fetch address.

It is further noted that processor 10 may support a mode m which lme predictor 12 and the branch predictors are disabled. In such a mode, predictor miss decode unit 26 may provide mstructions to map umt 30. Such a mode may be used for debuggmg, for example. As used herem, a branch mstruction is an mstruction which may cause the next mstruction to be fetched to be one of two addresses: the branch target address (specified via operands of the mstruction) or the sequential address (which is the address of the mstruction rmmedrately subsequent to the branch mstruction m memory). It is noted that the term "control transfer mstruction" may also be used m this manner Conditional branch mstructions select one of the branch target address or sequential address by testing an operand of the branch mstruction (e.g. condition flags). An unconditional branch mstruction, by contrast, always causes mstruction fetchmg to continue at the branch target address. Indrrect branch mstructions, which may generally be conditional or unconditional, generate their branch target address usmg at least one non-immediate operand (regrster or memory operands). As opposed to direct branch mstructions (which generate their targets from rmmediate data such as a branch drsplacement mcluded wrthm the branch mstruction), indrrect branch mstructions have a branch target address whrch rs not completely determrnable until the operands are fetched (from regrsters or memory) Finally, return mstructions are mstructions which have a branch target address coπespondmg to the most recently executed call instruction. Call mstructions and return mstructions may be used to branch to and from subroutines, for example.

As used herem, an "address" is a value which identifies a byte withm a memory system to which processor 10 is couplable. A "fetch address" is an address used to fetch mstruction bytes to be executed as mstructions withm processor 10. As mentioned above, processor 10 may employ an address translation mechanism m which virtual addresses (generated m response to the operands of mstructions) are translated to physical addresses (which physically identify locations m the memory system). In the x86 mstruction set architecture, virtual addresses may be lmear addresses generated accordmg to a segmentation mechamsm operating upon logical addresses generated from operands of the mstructions. Other mstruction set architectures may define the virtual address differently. Turning next to Fig 4, a block diagram of one embodrment of lme predrctor 12 is shown. Other embodiments are possible and contemplated. In the embodiment of Fig. 4, lme predictor 12 mcludes a PC CAM 70, an mdex table 72, control circurt 74, an mdex mux 76, a way predrctron mux 78, and a next fetch PC mux 80. Control circuit 74 is coupled to PC CAM 70, mdex table 72, muxes 76, 78, and 80, fetch PC generation unit 18D, predictor miss decode unit 26, and adder 62. PC CAM 70 is further coupled to predictor miss decode unit 26, fetch PC generation unit 18D, and muxes 76 and 78. Index table 72 is further coupled to muxes 76, 78, and 80, alignment unit 16, fetch PC generation unit 18D, and predictor miss decode unit 26.

Generally, the embodiment of lme predrctor 12 illustrated m Fig 4 mcludes two memoπes for stormg lme predictor entries. The first memory is PC CAM 70, whrch is used to search for fetch addresses generated by fetch PC generation unit 18D. If a hit is detected for a fetch address, PC CAM 70 provrdes an mdex (LP mdex m Frg. 4) mto mdex table 72 (the second memory). Index table 72 stores the lme predictor information for the lme predrctor entry, mcludmg mstruction alrgnment mformation (e.g. mstruction pomters) and next entry mformation. In response to the mdex from PC CAM 70, mdex table 72 provrdes an output lme predrctor entry 82 and a next mdex for mdex table 72. The next mdex selects a second entry withm mdex table 72, which provides: (l) mstruction alignment mformation for the mstructions fetched by the next fetch address; and (rr) yet another next fetch address. Lme predictor 12 may then continue to generate next fetch addresses, alrgnment information, and a next mdex from mdex table 72 until (I) a next mdex is selected which is mvalid (l e does not pomt to a next entry m mdex table 72), (n) stams srgnals from fetch PC generation unit 18D mdicate a redirection (due to trap, or a predrction by the branch predictors which disagrees with the prediction recorded m the mdex table, etc ), or (in) decode units 24A- 24D detect mcoπect alignment mformation provided by lme predictor 12

Viewed m another way, the next mdex stored m each lme predictor entry is a link to the next lme predictor entry to be fetched As long as the next link is valid, a check that the fetch address hits m PC CAM 70 (identifying a coπespondmg entry withm mdex table 72) may be skipped Power savmgs may be achieved by keepmg PC CAM 70 idle durmg clock cycles that the next mdex is bemg selected and fetched More particularly, control circuit 74 may keep PC CAM 70 m an idle state unless fetch PC generation unit 18D mdicates a redirection to the fetch PC generated by fetch PC generation unit 18D, a search of PC CAM 70 is bemg initiated by predictor miss decode unit 26 to determrne a next mdex, or control crrcurt 74 rs updating PC CAM 70

Control crrcurt 74 controls mdex mux 76 to select an mdex for mdex table 72 If PC CAM 70 rs bemg searched and a hrt rs detected for the fetch address provided by fetch PC generation unit 18D. control circuit 74 selects the mdex provrded by PC CAM 70 through mdex mux 76 On the other hand, rf a lme predictor entry has been fetched and the next mdex is valid m the lme predictor entry, control circuit 74 selects the next mdex provided by mdex table 72 Still further, rf the branch predrctron stored m a particular lme predictor entry disagrees with the branch prediction from the branch predictors or an update of mdex table 72 is to be performed, control circuit 74 provides an update mdex to index mux 76 and selects that mdex through mdex mux 76 In embodiments employmg way prediction, a way misprediction (detected by I-cache 14 by comparing the tag of the predrcted way to the coπespondmg fetch address) may result m an update to coπect the way predrctrons

If a mrss occurs m either PC CAM 70 or mdex table 72, lme predictor miss decode unrt 26 may decode the mstruction bytes fetched m response to the missing fetch address and provrde lme predrctor entires vra the lme predrctor update lmes shown m Frgs 3 and 4 Control crrcurt 74 recerves signals from the lme predictor update lmes tndicatrng the type of update bemg provrded (PC CAM, mdex table, or both) and selects an entry m the coπespondmg memones to store the updated entries In one embodiment, control crrcurt 74 employs a FIFO replacement scheme within PC CAM 70 and index table 72 Other embodiments may employ different replacement schemes, as desired If mdex table 72 is bemg updated, control crrcurt 74 provrdes the update mdex to mdex mux 76 and selects the update mdex Control circuit 74 also provides an indication of the entry bemg updated to PC CAM 70 if PC CAM 70 is bemg updated

Additionally, control crrcurt 74 may provrde an update mdex to update a lme predrctor entrv m mdex table 72 rf the branch predrctron for the lme predrctor entry disagrees with the branch predictors 18A-18C Fetch PC generation unit 18D mdicates, via the stams lines, that a prediction disagreement has occuπed Control crrcurt 74 captures the lme predrctor entries read from mdex table 72, and may modrfy predrction mformation n response to the stams srgnals and may update mdex table 72 with the mformation These updates are illustrated m the timmg diagrams below and will be discussed m more detail then

Predictor miss decode unit 26 may be configured to search PC CAM 70 for the next fetch address bemg assrgned to a lme predrctor entry bemg generated therein, m order to provide the next mdex (wtth n mdex table 72) for that lme predictor entry Predictor miss decode unit 26 may provide the next fetch address usmg the lme predictor update lmes, and may receive an indication of the hit miss for the search (hit/mrss lmes) and the LP mdex from the hrtting entry (provided by control circuit 74 on the lme predrctor update lmes) Alternatively, control crrcurt 74 may retatn the LP mdex from the hitting entry and use the mdex as the next mdex when updating the entry m mdex table 72 Generally, PC CAM 70 compπses a plurality of entπes to be searched by a fetch address (from fetch PC generation unit 18D, or from predictor miss decode unit 26 for trainmg lme predictor entπes) An exemplary PC CAM entry is shown below m Fig 5 Similarly, mdex table 72 compπses a plurality of entπes (refeπed to herem as lme predictor entπes) which store alignment mformation (e g mstruction pomters), next fetch information, and control mformation regardmg the termination of the entry An exemplary lme predrctor entry rs shown m Frgs 6, 7, and 8 below Index table 72 provrdes the next mdex from the lme predrctor entry to mdex mux 76 (as descπbed above) and further provides the entry (mcludmg the next mdex) as output lme predrctor entry 82 The output lme predrctor entry 82 rs provrded to control crrcurt 74, and portions of the output lme predrctor entry 82 are shown separated m Frg 4 to be provrded to vaπous other portions of processor 10

More particularly, the mstruction pomters stored m the entry are provided to alignment umt 16, which associates the mstruction pomters with the coπesponding mstruction bytes and aligns the mstruction bytes response thereto Additionally, mformation regardmg the terminating mstruction identified by the lme predictor entry (e g whether or not it is a branch, the type of branch if it is a branch, etc ) is transmitted to fetch PC generation unit 18D (branch mfo m Figs 3 and 4) The mformation may be used to determine which of the branch predictors is to veπfy the branch prediction m the lme predrctor Addrtronally, the branch mformation may mclude an indication of the branch displacement and the taken/not taken prediction from the entry, as descπbed above

The next fetch address from the entry is provided to next fetch PC mux 80, and may be selected by control circuit 74 through next fetch PC mux 80 to be provided to I-cache 14 Additionally, control crrcurt 74 provides an mput to next fetch PC mux 80 Control crrcurt 74 may provide the next fetch address m cases m whrch the branch prediction stored m a lme predictor entry disagrees with branch predictors 18A-18C The next fetch address provided by control circuit 74 may be the next alternate fetch address from the affected entry (and control crrcurt 74 may also update the affected entry)

Lme predrctor entry 82 also mcludes way predictions coπespondmg to the next fetch address (as descπbed above, although other embodrments may not employ way predrctions, as desrred) The way predrctrons are provrded to way predrctron mux 78 Addrtionally, way predictions for a fetch address searched m PC CAM 70 are provided by PC CAM 70 as the other mput to way prediction mux 78 Control crrcurt 74 selects the way predictions from PC CAM 70 if a fetch address is searched m PC CAM 70 and hits Otherwise, the way predictions from lme predictor entry 82 are selected The selected way predictions are provided to I-cache 14 It is noted that I-cache 14 may venfy the way predictions by performing a tag compaπson of the fetch address to the predrcted way If a way predrction rs found to be rncoπect, I-cache 14 is reaccessed with the fetch address to determine the coπect way and fetch the coπect mstruction bytes Additionally, lme predictor 12 is updated to coπect the way prediction

Control crrcurt 74 rs further configured to generate the branch offset for adder 62 from the mformation m the lme predictor entry More particularly, control circuit 74 determines whrch of the mstruction pomters identifies the last valid mstruction withm the lme predictor entry, and generates the branch offset from that mstruction pomter For example, the mstruction pomter may be an offset, and hence control circuit 74 may select the rnstniction pomter coπespondmg to the terminating mstruction as the branch offset Alternatively, the mstruction pomters may be lengths of the rnstructions The mstruction pomters of each instruction pnor to the terminating mstruction may be added to produce the branch offset In one particular embodrment, PC CAM 70 may compπse a content addressable memory (CAM) and mdex table 72 may compπse a random access memory (RAM) In a CAM, at least a portion of each entry m the memory rs coupled to a comparator wtthrn the CAM whrch compares the portion to an mput value, and rf a match is detected a hit signal is asserted by the CAM Additionally, if only a portion of the entry is compared, the remamder of the hitting entry may be provided as an output In the embodrment shown, the portion of the entry compared may be the stored fetch addresses and the remamder may be the way predrctions and LP mdex In one particular embodiment, only a portion of the fetch address may be compared m the CAM For example, a plurality of least srgnrficant bits of the fetch address may be compared Such an embodrment allows alrasmg of certain fetch addresses which have the same least significant bits but differ in the most significant bits Accordmgly, the number of bits compared may be selected as a trade-off between the amount of allowable aliasmg and the amount of power expended in performing the comparisons (since each entry is compared to the mput value concuπently) The process of accessmg a CAM wrth a value and performing the compaπsons to the stored values is refeπed to herem is "camming" On the other hand, a RAM selects an entry by decodmg an mput value (e g an mdex) and provrdes the selected entry as an output

As used herem. an entry m a memory is one location provided by the memory for stonng a type of rnformation A memory compπses a plurality of the entπes, each of which may be used to store mformation of the designated type Furtheπnore, the term control crrcuit is used herem to refer to any combmation of crrcurtry (e g combmatoπal logic gates, data flow elements such as muxes, regrsters, latches, flops, adders, shrfters, rotators, etc , and/or circuits implementing state machines) whrch operates on mputs and generates outputs m response thereto as descπbed It rs noted that, while the embodrment of Fig 4 shows two memoπes, other embodiments may implement a smgle memory wrthtn lme predrctor 12 The memory may mclude a CAM portion to be searched m response to the fetch address, and a RAM portion whrch stores the coπespondmg lme predrctor entry The lme predrctor entnes may provrde a next fetch address whrch may be cammed agamst the memory to find the next hrt (or a next mdex tdentrfymg the next entry) It is further noted that one or both of the CAM portion and the RAM portion may be banked to conserve power For example, 8 banks could be used In such an embodrment, the least srgnificant 3 bits of the fetch address may select the bank, and the remamder of the address may be cammed

The discussion herem may occasionally refer to "misses" m lme predictor 12 For the embodiment of Fig 4, a lme predictor miss may be a truss m PC CAM 70, or a hit m PC CAM 70 but the coπespondmg lme predrctor entry mcludes mvalrd alignment mformation Additionally, a next mdex may be mvalid, and the next fetch address may be considered to be a miss in lme predictor 12

Turning now to Fig 5, a diagram illustrating an exemplary entry 90 for PC CAM 70 is shown Other embodiments of PC CAM 70 may employ entries 90 mcludmg more mformation, less mformation, or substitute mformation to the mfoπnation shown m the embodiment of Fig 5 In the embodiment of Fig 5. entry 90 mcludes a fetch address field 92, a lme predictor mdex field 94, a first way predrctron field 96, and a second way predrction field 98.

Fetch address field 92 stores the fetch address locating the first byte for which the information m the coπespondmg lme predictor entry is stored. The fetch address stored m fetch address field 92 may be a vrrtual address for companson to fetch addresses generated by fetch PC generation unrt 18D For example, embodiments of processor 10 employmg the x86 mstruction set architecture, the virtual address may be a lmear address. As mentioned above, a least significant portion of the fetch address may be stored m fetch address field 92 and may be compared to fetch addresses generated by fetch PC generation unit 18D For example, m one particular embodrment, the least stg ficant 18 to 20 brts may be stored and compared.

A coπespondmg lme predrctor entry withm mdex table 72 is identified by the mdex stored m lme predictor mdex field 94 Furthermore, way predictions coπespondmg to the fetch address and the address of the next sequential cache lme are stored m way predrctron fields 96 and 98, respectively

Turning next to Frg. 6, an exemplary lme predrctor entry 82 is shown. Other embodiments of mdex table 72 may employ entπes 82 mcludmg more mformation, less mformation, or substitute mformation to the mformation shown m the embodrment of Frg 6 In the embodrment of Frg 6, lme predrctor entry 82 mcludes a next entry field 100, a pluralrty of mstruction pomter fields 102-108, and a control field 110

Next entry field 100 stores mformation identify mg the next lme predictor entry to be fetched, as well as the next fetch address. One embodiment of next entry field 100 is shown below (Fig. 7). Control field 110 stores control mformation regardmg the lme of mstructions, mcludmg mstruction termination mformation and any other information which may be used with the lme of mstructions One embodiment of control field 110 is illustrated m Fig. 8 below

Each of mstruction pomter fields 102-108 stores an mstruction pomter for a coπespondmg decode unit 24A-24D. Accordmgly, the number of mstruction pomter fields 102-108 may be the same as the number of decode units provided withm vanous embodrments of processor 10 Vrewed m another way, the number of mstruction pomters stored m a lme predictor entry may be the maximum number of mstructions which may be concuπently decoded (and processed to the schedule stage) by processor 10 Each mstruction pomter field 102-108 directly locates an mstruction wrthin the rnstructron bytes (as opposed to predecode data, which is stored on a byte basis and must be scanned as a whole before any mstructions can be located) In one embodiment, the mstruction pomters may be the length of each mstruction (which, when added to the address of the mstruction, locates the next mstruction) A length of zero may mdicate that the next mstruction is mvalid. Alternatively, the mstruction pomters may compπse offsets from the fetch address (and a valrd brt to mdrcate valrdrty of the pomter). In one specrfrc embodrment, mstruction pomter 102 (which locates the first mstruction withm the mstruction bytes) may compπse a length of the mstruction, and the remammg mstruction pomters may compπse offsets and valrd brts.

In one embodrment, microcode unit 28 is coupled only to decode umt 24D (which coπesponds to mstruction pomter field 108). In such an embodiment, rf a lme predrctor entry mcludes an MROM mstruction, the MROM mstruction is located by instruction pointer field 108 If the lme of mstructions mcludes fewer than the maximum number of mstructions, the MROM rnstructron rs located by mstruction pomter field 108 and one or more of the mstruction pomter fields 102-106 are mvalid Alternatively, the MROM mstruction may be located by the appropnate mstruction pomter field 102-108 based on the number of mstructions m the lme, and the type field 120 (shown below) may mdicate that the last mstruction is an MROM mstruction and thus rs to be alrgned to decode unit 24D

Turning now to Fig. 7, an exemplary next entry field 100 is shown. Other embodiments of next entry field 100 may employ more information, less mformation, or substitute rnformation to the rnformation shown m the embodrment of Fig. 7. In the embodiment of Fig 7, next entry field 100 compπses a next fetch address field 112, a next alternate fetch address field 114, a next mdex field 116, and a next alternate mdex field 118.

Next fetch address field 112 stores the next fetch address for the lme predictor entry. The next fetch address is provided to next fetch address mux 80 m Fig 4, and rs the address of the next mstructions to be fetched after the lme of mstructions m the cuπent entry, accordmg to the branch prediction stored in the lme predictor entry. For lmes not termmated with a branch mstruction, the next fetch address may be the sequential address to the termmatmg mstruction. The next mdex field 116 stores the mdex wrthm mdex table 72 of the lme predrctor entry coπespondmg to the next fetch address (r.e. the lme predictor entry stormg mstruction pomters for the mstructions fetched m response to the next fetch address)

Next alternate fetch address field 114 (and the coπespondmg next alternate mdex field 118) are used for lmes which are termmated by branch mstructions (particularly conditional branch mstructions). The fetch address (and coπespondmg lme predrctor entry) of the non-predrcted path for the branch rnstructron are stored m the next alternate fetch address field 114 (and the next alternate index field 118) In this manner, if the branch predictor 18A disagrees with the most recent prediction by line predictor 12 for a conditional branch, the alternate path may be rapidly fetched (e.g. without resorting to predrctor miss decode unit 26). Accordmgly, if the branch is predicted taken, the branch target address is stored m next fetch address field 112 and the sequential address rs stored m next alternate fetch address field 114. On the other hand, if the branch is predicted not taken, the sequential address is stored m next fetch address field 112 and the branch target address rs stored m next alternate fetch address field 114. Coπespondmg next mdexes are stored as well m fields 116 and 118

In one embodiment, next fetch address field 112 and next alternate fetch address field 114 store physrcal addresses for addressrng I-cache 14. In this manner, the time used to perform a virtual to physical address translation may be avoided as lmes of mstructions are fetched from line predictor 12. Other embodrments may employ virtual addresses m these fields and perform the translations (or employ a virtually tagged cache). It is noted that, m embodiments employmg a smgle memory withm lme predictor 12 (mstead of the PC CAM and mdex table), the mdex fields may be eliminated smce the fetch addresses are searched m the lme predictor. It is noted that the next fetch address and the next alternate fetch address may be a portion of the fetch address. For example, the m-page portions of the addresses may be stored (e.g. the least significant 12 bits) and the full address may be formed by concatenatmg the cuπent page to the stored portion

Turning next to Frg. 8, an exemplary control field 110 is shown. Other embodiments of control field 110 may employ more mformation, less mformation, or substitute mformation to the mformation shown m the embodrment of Frg. 8 In the embodrment of Frg 8, control field 110 mcludes a last rnstructron type field 120, a branch prediction field 122, a branch displacement field 124, a continuation field 126, a first way prediction field 128, a second way prediction field 130, and an entry pomt field 132

Last rnstructron type field 120 stores an tndrcatton of the type of the last mstruction (or termmatmg instruction) withm the lme of mstructions. The type of mstruction may be provided to fetch PC generation unit 18D to allow fetch PC generation unit 18D to determine which of branch predictors 18A-18C to use to veπfy the branch prediction withm the lme predictor entrv More particularly, last struction type field 120 may mclude encodmgs indicating sequential fetch (no branch), microcode mstruction, conditional branch instruction, mdirect branch instruction, call mstruction, and return mstruction The conditional branch mstruction encodmg results m branch predictor 18A bemg used to veπfy the direction of the branch prediction The indrrect branch mstruction encodmg results m the next fetch address bemg veπfied agamst indirect branch target cache 18B The return mstruction encodmg results m the next fetch address being veπfied agamst return stack 18C

Branch predrction field 122 stores the branch predrctron recorded by lme predrctor 12 for the branch rnstructron termmatmg the lme (if any) Generally, fetch PC generation unit 18D veπfies that the branch prediction m field 122 matches (m terms of taken not taken) the prediction from branch predictor 18A In one embodiment, branch prediction field 122 may compπse a bit with one binary state of the bit mdicatmg taken (e g bmary one) and the other bmary state rndtcating not taken (e g bmary zero) If the prediction disagrees with branch predictor 122, the prediction may be switched In another embodiment, branch prediction field 122 may compπse a saturating counter with the bmary state of the most significant bit mdicatmg taken/not taken If the taken not taken prediction disagrees with the prediction from branch predictor 18A, the saturating counter is adjusted by one m the drrection of the predrctron from branch predictor 18 A (e mcremented if taken, decremented if not taken) The saturating counter embodrment may more accurately predict loop mstructions, for example, m which each N-l taken iterations (where N is the loop count) is followed by one not taken iteration

Branch displacement field 124 stores an indication of the branch displacement coπespondmg to a direct branch mstruction In one embodiment, branch displacement field 124 may compπse an offset from the fetch address to the first byte of the branch displacement Fetch PC generation unit 18D may use the offset to locate the branch displacement withm the fetched mstruction bytes, and hence may be used to select the displacement from the fetched mstruction bytes In another embodiment, the branch displacement may be stored in branch displacement field 124, which may be directly used to determine the branch target address

In the present embodrment, the mstruction bytes represented by a lme predictor entry may be fetched from two consecutive cache lmes of mstruction bytes Accordmgly, one or more bytes may be m a different page than the other mstruction bytes Contmuation field 126 is used to signal the page crossmg, so that the fetch address coπespondmg to the second cache lme may be generated and translated Once a new page mappmg is avarlable, other fetches withm the page have the coπect physical address as well The mstruction bytes m the second page are then fetched and merged with the rnstructron bytes within the first page Contmuation field 126 may compπse a bit indicative, m one bmary state, that the lme of rnstructtons crosses a page boundary, and indicative, m the other bmary state, that the lme of instructions does not cross a page boundary Contmuation field 126 may also be used to signal a branch target address which is m a different page than the branch mstruction

Srmrlar to way predrctron fields 96 and 98, way predrctron fields 128 and 130 store the way predrctions coπespondmg to the next fetch address (and the sequential address to the next fetch address) Finally, entry pomt field 132 may store an entry pomt for a mtcrocode instruction withm the lme of mstructions (if any) An entry pomt for microcode mstructions is the first address withm the microcode ROM at which the microcode routine coπespondmg to the microcode mstruction is stored If the lme of mstructions mcludes a microcode mstruction, entry pomt field 132 stores the entry pomt for the mstruction Smce the entry pomt is stored, decode unit 24D may omit entry pomt decode hardware and mstead directly use the stored entry pomt The time used to decode the microcode instruction to determine the entry pomt may also be eltminated duπng the fetch and drspatch of the instruction, allowing for the microcode routme to be entered more rapidly. The stored entry pomt may be veπfied against an entry pomt generated m response to the instruction (by decode unit 24D or MROM unit 28).

Turning now to Frg. 9, a table 134 illustrating termination conditions for a lme of mstructions accordmg to one embodiment of processor 10 is shown Other embodiments are possible and contemplated. In creating a lme predictor entry by decodmg mstructions, lme predictor miss decode unit 26 terminates the lme (updating lme predictor 12 with the entry) m response to detectmg any one of the lme termination condrtions lrsted m Frg. 9.

As table 134 illustrates, a lme is termmated m response to decodmg either a microcode mstruction or a branch mstruction. Also, if a predetermined maxrmum number of mstructions have been decoded (e.g. four m the present embodiment, matchmg the four decode units 24A-24D), the lme rs termmated In determining the maxrmum number of mstructions decoded, mstructions which generate more than two mstruction operations (and which are not microcode mstructions, which generate more than four mstruction operations) are counted as two instructions. Furthermore, a lme rs termmated if a predetermined maximum number of mstruction bytes are decoded (e.g. 16 bytes m the present embodiment, matching the number of bytes fetched from I-cache 14 duπng a clock cycle). A lme is also terminated if the number of rnstructron operatrons generated by decodmg mstructions withm the lme reaches a predefined maxrmum number of rnstructron operations (e.g. 6 m the present embodiment) Moreover, a lme is termmated if a page crossmg is detected while decodmg an mstruction wrthm the lme (and the continuation field rs set). Ftnally, the lme rs terminated if the instructions withm the lme update a predefined maximum number of destmation registers. This termination condition is set such that the maxrmum number of register renames that map unit 30 may assign duπng a clock cycle is not exceeded. In the present embodrment, 4 renames may be the maxrmum.

Viewed m another way, the termination conditions for predictor miss decode umt 26 m creating lme predictor entπes are flow control condrtions for lme predrctor 12. In other words, lme predrctor 12 rdentrfres a lme of mstructions m response to each fetch address. The lme of mstructions does not violate the conditions of table 134, and thus is a lme of mstruction that the hardware wrthm the ptpelme stages of processor 10 may be desrgned to handle. Difficult- to-handle combmations, which might otherwrse add significant hardware (to provide concuπent handlrng or to provrde stalltng and separation of the mstructions flowing through the pipelme) may be separated to different lmes in lme predictor 12 and thus, the hardware for controlling the pipelme m these circumstances may be elrminated. A lme of mstructions may flow through the pipeline as a unit. Although pipelme stalls may still occur (e.g. if the scheduler is full, or rf a microcode routme rs bemg drspatched, or rf map unrt 30 does not have rename registers avarlable), the stalls hold the progress of the instructions as a unit Furthermore, stalls are not the result of the combrnation of rnstructtons withm any particular lme. Pipelme control may be srmplrfied In the present embodrment, lme predrctor 12 rs a flow control mechanism for the pipelme stages up to scheduler 36. Accordmgly, one microcode unit is provided (decode unit 24D and MROM unit 28). branch prediction/fetch PC generation umt 18 is configured to perform one branch prediction per clock cycle, a number of decode units 24A-24D is provided to handle the maximum number of mstructions, I-cache 14 delivers the maximum number of mstruction bytes per fetch, scheduler 36 receives up to the maximum number of mstruction operations per clock cycle, and map unit 30 provides up to the maximum number of rename registers per clock cycle. Timmg Diagrams

Turning next to Figs 10-21, a set of trmrng dragrams are shown to illustrate operation of one embodrment of lme predrctor 12 wrthm the mstruction processmg pipelme shown m Fig 2 Other embodiments of lme predictor 12 may operate withm other pipelines, and the number of pipelme stages may vary from embodrment to embodrment If a lower clock frequency is employed, stages may be combmed to form fewer stages

Generally, each timmg diagram illustrates a set of clock cycles delimited by vertical dashed lmes, wrth a label for the clock cycle above and between (horizontally) the vertical dashed lmes for that clock cycle Each clock cycle will be refeπed to with the coπespondmg label The pipelme stage labels shown m Fig 2 are used m the timmg diagrams, with a subscπpt used to designate different lmes fetched from lme predrctor 12 (e g a subscπpt of zero refers to a first lme, a subscπpt of 1 refers to a second lme predrcted by the first lme. etc ) While the subscπpts may be shown m increasing numeπcal order, this order is mtended to mdicate that fetch order and not the particular entπes withm mdex table 72 which store the lme predictor entπes Generally, the lme predictor entπes may be randomly located withm mdex table 72 with respect to their fetch order Instead, the order is determmed by the order m which the entires are created Various operations of interest may be illustrated m the timmg diagrams as well, and these operations are descπbed with respect to the coπespondmg timmg diagram

Fig 10 illustrates the case m which fetches are hitting in lme predictor 12 and branch predictions are agreeing with the branch predictions stored m the line predictor for conditional branches and rndrrect branches Frg 13 rllustrates the case m which a return mstruction predrctron agrees wrth return stack 18C Figs 11, 12, and 14 illustrate conditions m which lme predictor 12 and branch prediction fetch PC generation unit 18 handle the trainmg of lme predictor entrres Fig 15 illustrates the use of the contmuation field for page crossmgs Frgs 16-18 rllustrate vanous conditions which cause predictor miss decode umt 26 to initiate generation of a lme predictor entry Figs 19 and 20 illustrate generation of a line predictor entry termmatmg m a non-branch type mstruction (e g a microcode mstruction or a non-branch mstruction) and a branch mstruction, respectively Fig 21 illustrates the training of both target (or taken) and sequential (or not taken) paths for a branch mstruction It is noted that each timmg diagram illustrates the first lme fetched (subscript 0) beginning with the lme predictor (LP) stage The first lme fetched may be the result of camming a fetch address, a valid next mdex field, or a next alternate fetch mdex field followmg a branch predrctor drsagreement

Each trmrng diagram will next be individually described Fig 10 illustrates fetchmg of several lme predictor entπes withm a predrcted mstruction stream Line 0 is termmated by a conditional branch, and is fetched from lme predictor 12 duπng clock cycle CLK1 The next mdex of lme 0 mdicates lme 1 (arrow 140), and lme 1 is fetched from the lme predictor duπng clock cycle CLK2 Similarly, line 1 further mdicates lme 2 (arrow 142), and lme 2 is fetched from the lme predictor durmg clock cycle CLK3 Lme 2 further mdtcates lme 3 (arrow 144), and lme 3 rs fetched from the lme predrctor duπng clock cycle CLK4 Each lme proceeds through subsequent stages duπng subsequent clock cycles as rllustrated m Fig 10 Aπows srmilar to arrows 140-144 are used throughout the timmg diagrams to mdicate that a lme predictor entry identifies the next lme predictor entry vra the next mdex field

Smce lme 0 rs termmated by a conditional branch, control crrcurt 74 generates the branch offset coπespondmg to the predrcted branch mstruction from the coπesponding mstruction pomter and provrdes the offset to adder 62, whrch adds the offset to the fetch address provrded by fetch PC generation umt 18D (aπow 146) The resultmg branch mstruction address rs provrded to branch predictor 18A, whrch selects a branch predrctron (aπow

72 148) Fetch PC generation unrt 18D compares the branch predrctron from branch predrctor 18A (m response to the branch rnformation received from lme predictor 12 mdicatmg that a conditional branch terminates the lme), and determines that the predictions agree (aπow 150) Fetch PC generation unit 18D provides stams on the status lmes to lme predrctor 12 rndrcating that the predrctron is coπect Accordmgly, fetchmg continues as directed by the next mdex fields It rs noted that, srnce the branch predrction for lme 0 rs not venfied until clock cycle CLK3, the fetches of lmes 1 and 2 are speculative and may be cancelled rf the predrctions are found to disagree (as rllustrated m Frg 11, for example) Venfymg the predrction for a lme terminated m an mdirect branch mstruction may be srmilar to the timmg of Frg 11 , but fetch PC generatron unrt 18D may veπfy the branch target address agamst mdirect branch target cache 18B mstead of the branch prediction agamst branch predictor 18A (agam, m response to the branch information mdicatmg a mdirect branch) In embodiments m which mdirect branch mstructions are condrtronal, both veπficatrons may be performed

By way of contrast, Frg 13 rllustrates a case in which lme 0 is termmated by a return mstruction Smce return mstructions select the return address coπespondmg to the most recent call mstruction and return stack 18C is a stack of return addresses with the most recent return address provided from the top of return stack 18C, fetch PC generatron unrt 18D compares the most recent return address to the next fetch address generated by lme predrctor 12 (aπow 152) In the example of Fig 13, the return address and next fetch address match, and fetch PC generation unrt 18D returns stams to lme predrctor 12 mdicatmg that the predrction rs coπect Accordmgly, only lme 1 is fetched speculatively with respect to the verification of lme 0's branch prediction

Returmng to Frg 11, a case m whrch the condrtronal branch predrction from branch predrctor 18A drsagrees wrth the branch predrction wrthm the line predictor rs shown In this example, lme 0 mdicates a first taken path mdex (subscπpt tl) is the next mdex which further mdicates a second taken path mdex (subscπpt t2) Both taken path fetches are speculative Similar to the example of Fig 10, the branch offset rs added to the fetch address and branch predrctor 18A produces a branch predrctron (aπows 146 and 148) However, m Frg 11, the fetch PC generation unrt 18D determines that the predrctron from branch predrctor 18A disagrees with the prediction from lme 0 (r e branch predictor 18 A predicts not taken and lme 0 predicts taken — aπow 154) Fetch PC generation unit 18D returns a stams of misprediction to lme predrctor 12

Control circuit 74 records the next alternate mdex and next alternate fetch address from lme 0 durmg clock cycle CLK1 In response to the mrspredrctron stams from fetch PC generation unit 18D, control crrcurt 74 provrdes the next alternate mdex from lme 0 durmg clock cycle CLK4 The next alternate mdex is the not taken path m this example subscπpt ntl However, the same timing diagram applies if the branch mstruction rs ongrnally predrcted not taken and subsequently predrcted taken bv branch predrctor 18A Also during clock cycle CLK4, the speculative fetches of lmes tl and t2 are cancelled and the next alternate fetch address rs provrded as the next fetch address to I-cache 14

Duπng clock cycle CLK5, control circuit 74 updates the lme predictor entry for lme 0 to swap the next mdex and next alternate mdex fields, to swap the next fetch address and next alternate fetch address fields, and to change the branch prediction (aπow 156) For example, if a single bit of branch prediction is stored m lme 0 and the prediction was taken (as in the example of Fig 1 1 ), the prediction is updated to not taken Smce control crrcurt 74 is updating mdex table 72 during clock cycle CLK5, the next mdex from lme ntl (mdicatmg lme nt2) rs not fetched from the mdex table until clock cycle CLK6 Control crrcuit 74 may capture the next mdex from lme ntl and provide that index through mdex mux 76 during clock cycle CLK6

It is noted that control crrcuit 74 captures lme information at various points during operation, and uses that information m a subsequent clock cycle Control circuit 74 may employ a queue havmg enough entrres to capture line predictor entries duπng successive clock cycles and retain those entries long enough to perform any potential coπectrve measures For example, m the present embodiment, a queue of two entrres may be used Alternatively, a larger queue may be employed and may store lme predictor entπes which have not yet been verified as coπect (e g decode units 24A-24D have not yet verified the instruction alignment information, etc )

Turning next to Fig 12. a timing diagram illustratmg a misprediction for an indirect branch rnstructron termrnatmg lme 0 is shown Lme 0 is fetched from the lme predictor in clock cycle CLK1. and the next index and next fetch address are based on a previous execution of the mdirect branch rnstructron Accordmgly, lme 1 is fetched, and subsequently lme 2, durmg clock cycles CLK2 and CLK3, respectively Similar to Fig 11, the branch instruction address is generated (aπow 146) However, in this case, the indirect branch target cache 18B is accessed during clock cycles CLK2 and CLK3 (aπow 158) Fetch PC generation unit 18D compares the indirect target address provided by indirect branch target cache 18B to the next fetch address from lme 0, and a mismatch is detected (aπow 160) Fetch PC generation unit 18D indicates, via that stams lmes, that a mispredicted indirect branch target has been detected

During clock cycle CLK4, the speculative fetches of lmes 1 and 2 are cancelled In addition, control crrcuit 74 activates PC CAM 70 to cam the predicted indirect branch target address being provided by fetch PC generation unit 18D as the fetch address during clock cycle CLK4 The cam completes duπng clock cycles CLK4 and CLK5 A hit is detected, and the LP index from the hitting entry (entry I) is provided to index table 72 during clock cycle CLK6 During clock cycle CLK7, control circuit 74 updates the lme 0 entry to set the next fetch address to the newly predicted indirect branch target address provided by indirect branch target cache 18B and the next index field to indicate line l (aπow 162).

Fig 14 illustrates a case in which line 0 is termmated by a remm rnstructron, but the next fetch address does not match the return address at the top of return stack 18C Fetch PC generation unit 18D determines from the branch information for lme 0 that the termination instruction is a return mstruction, and therefore compares the next fetch address to the return address stack during clock cycle CLK2 (arrow 164) Fetch PC generation umt 18D returns a stams of misprediction to line predictor 12, and provides the predicted return address from return address stack 18C as the fetch address (clock cycle CLK3). As with the indirect branch target address misprediction. control circuit 74 activates PC CAM 70 during clock cycle CLK3, and the cam completes with a hit durmg clock cycle CLK4 (with the LP index from the hitting entry indicating entry RAS in index table 72) Lme RAS is fetched during clock cycle CLK4. and control circuit 74 updates the next fetch address field of line 0 to reflect the newly predicted return address and the next mdex field of line 0 to reflect line RAS (aπow 166)

Turning next to Fig 15, an example of lme 0 bemg terminated by a continuation over a page crossing is shown During clock cycle CLKO, line 0 is fetched from the lme predictor Control crrcuit 74 detects the continuation indication in lme 0, and mdicates that the next fetch address is to be translated The virtual next fetch address in this case is provided by fetch PC generation unit 18D to ITLB 60 for translation The result of the translation is compared to the next fetch address provided by line predictor 12 to ensure that the coπect physical address is provided If the next fetch address is incoπect, line predictor 12 is updated and the coπespondmg linear address may be cammed to detect the next entry Fig 15 illustrates the case m which the next fetch address is coπect (1 e the physical mapping has not been changed) Accordingly, the next index from line 0 is fetched from index table 72 durmg clock cvcle CLK2. and the instructions from the new page are read m clock cycle CLK3 (IC stage for lme 1) Line 1 further indicates that line 2 is the next index to be fetched from the line predictor, and fetching continues via the indexes from cycle CLK3 forward in Fig 15

Additionally, lme 0 is stalled in the decode stage until the instruction bytes for line 1 aπive m the decode stage The rnstructron bytes may then be merged by the decode unit (clock cycle CLK5) and the coπesponding lme of instructions may continue to propagate through the pipeline (illustrated by line 0 and lme 1 propagating to the Ml stage rn clock cycle CLK6 and to the M2 stage in clock cycle CLK7) It is noted that, while the merge is performed in decode units 24A-24D in the present embodiment, other embodiments may effect the merge in other stages (e g the alignment stage)

It is noted that the teπns mrsprediction and coπect prediction have been used with respect to Figs 10-15 to refer to the prediction in the lme predictor agreeing with the prediction from branch predictors 18A-18C However, a "coπect prediction" in this sense mav still lead to a misprediction during execution of the coπespondmg branch instruction, and a "misprediction" in this sense may alter what would

e been a coπect prediction according to execution of the coπesponding branch instruction

Turning next to Fig 16, a timing diagram illustrates initiation of decode by predictor miss decode unit 26 due to a fetch miss in PC CAM 70 During clock cycle CLK1, the cam of the fetch address completes and a miss is detected (aπow 168) In response to the miss, control circuit 74 assigns an entry m PC CAM 70 and index table 72 for the missing line predictor entry The fetch address and coπesponding rnstructron bytes flow through the lme predrctor. instruction cache, and alignment stages Since there is no valid alignment information, alignment unit 16 provides the fetched instruction bytes to predictor miss decode unit 26 at the decode stage (illustrated as SDEC0) in Fig 16

Fig 17 illustrates another case in which decode is initiated by predictor miss decode unit 26 In the case of Fig 17, lme 0 stores a null or invalid next index (aπow 170) In response to the invalid next index, control circuit 4 initiates a cam of PC CAM 70 of the fetch address provided by fetch PC generation unit 18D (clock cycle CLK2) As described above, fetch PC generation unit 18D continues to generate virtual fetch addresses coπesponding to the next fetch addresses provided by line predictor 12 (using the branch information provided by lme predictor 12) It is noted that one or more clock cycles may occur between clock cycles CLK1 and CLK2, depending upon the number of clock cycles which may occur before the coπesponding virmal address is generated by fetch PC generation unit 18D

The cam completes in clock cycle CLK3, and one of two actions are taken depending upon whether the cam is a hit (aπow 172) or a miss (aπow 174) If the cam is a hit, the LP index from the hitting entry is provided to index table 72 and the coπesponding line predictor entry is read during clock cycle CLK4 During clock cycle CLK5, control circuit 74 updates lme 0, setting the next index field to equal the LP index provided from the hitting entry

On the other hand, if the cam is a miss, the fetch address and the coπesponding instruction bytes flow through the line predictor, instruction cache, and alignment stages (clock cycles CLK4, CLK5, and CLK6), similar to the timing diagram of Fig 16 Control circuit 74 assigns entries in PC CAM 70 and mdex table 72 according to the employed replacement scheme (e g FIFO), and updates line 0 with the assigned next index value (clock cycle CLK5) Subsequently, predictor miss decode unit 26 may update the assigned entrres wrth information generated by decodmg the coπespondmg mstruction bytes It is noted that, m the case that the cam is a miss, the update may be delayed from clock cycle CLK5 since the lme predictor is idle while predictor miss decode unit 26 is decodmg Fig 18 illustrates a case m which a hit in both PC CAM 70 and index table 72 is detected, but the instruction alignment information (e g instruction pointers) are found not to coπespond to the instruction bytes This case may occur due to address aliasing, for example, in embodiments which compare a predetermined range of the fetch address rn PC CAM 70 to the fetch addresses

The mstruction bytes and alignment information flow through the instruction cache and alignment stages Alignment unit 16 uses the provided alignment information to align instructions to decode units 24A-24D The decode units 24A-24D decode the provided mstructions (Decode stage, clock cycle CLK4) Additionally, the decode units 24A-24D signal one of decode units 24A-24D (e g decode umt 24A) with an indication of whether or not that decode unit 24A-24D received a valid instruction If one or more of the instructions is invalid (clock cycle CLK5), the instruction bytes are routed to predictor miss decode unit 26 (clock cycle CLK6) It is noted that predictor miss decode unit 26 may speculatively begin decoding at clock cycle CLK4, if desired

Figs 16-18 illustrate various scenarios in which predictor miss decode unit 26 initiates a decode of instruction bytes in order to generate a line predictor entry for the rnstructron bytes Figs 19-20 illustrate operation of predictor miss decode unit 26 in performing the decode, regardless of the manner in which the decode was initiated Fig 19 illustrates generation of a line predictor entry for a line of instructions terminated by a non-branch instruction During clock cycles CLK1 , CLK2, and up to CLKM, predictor miss decode unit 26 decodes the instructions within the provided instruction bytes The number of clock cycles may vary depending on the rnstructron bytes bemg decoded In clock cycle CLKM, predrctor miss decode unit 26 determrnes that a termrnation condrtron has been reached and that the termmatron condition is a non-branch instruction (aπow 184) In response to terminating the line in a non-branch instruction, predictor miss decode unit 26 provides the sequential address to lme predrctor 12 and lme predrctor 12 cams the sequential address to the terminating instruction to determine if a line predictor entry coπesponding to the next sequential instruction is stored therein (clock cycles CLKN and CLKN+1 ) In the example, a hit is detected and the sequential instructions are read from the instruction cache and the coπesponding line predictor entry is read from line predictor 12 (clock cycle CLKN+2) Predictor miss decode unit 26 transmits the line predictor entry to lme predictor 12, which updates the line predictor entry assigned to the lme (e g line 0 clock cycle CLKN+3) The next index field of the updated entry rs set to the mdex rn whrch the sequential address hits If the sequential address were to miss m lme predictor 12, lme 0 may still be updated at clock cycle CLKN+3 In this case, however, the next index field is set to indicate the entry allocated to the missing sequential address Instruction bytes coπespondmg to the missing sequential address are provided to predictor miss decode unit 26, which generates another lme predictor entry for the instruction bytes

Fig 20 illustrates generation of a lme predictor entry for a lme termmated by a branch instruction Similar to the timing diagram of Fig 19, predictor miss decode unit 26 decodes instructions within the instruction bytes for one or more clock cycles (e g CLK1, CLK2, and up to CLKM in the example of Fig 20) Predictor miss decode unit 26 decodes the branch instruction, and thus determrnes that the line is termmated (aπow 186) If the line is terminated in a conditional branch instruction, the next fetch address is either the branch target address or the sequential address A prediction is used to initialize the line predictor entry to select one of the two addresses On the other hand, if the lme is termmated by an indirect branch instruction, the target address is variable A prediction from indirect branch target cache 18B is used to initialize the next fetch address (and index) Similarly, if the line is terminated by a return instruction, a return address prediction from return stack 18C is used to initialize the next fetch address ( and index)

Predictor miss decode unit 26 may access the branch predictors 18A-18C to aid m rnrttalrzrng the next fetch address (and next mdex) For condrtronal branches, branch predrctor 18A is accessed to provide a branch prediction For indirect branches, branch predictor 18B is accessed to provide a predicted indirect branch target address For remm rnstructtons, the top entry of return stack 18C is used as the prediction for the next fetch address Fig. 20 illustrates the timing for accessing branch predictor 18A The timing for accessing branch predictor 18B may be similar Return stack 18C may be accessed without the address of the instruction, but otherwise may operate similarly The address of the branch instruction is provided to the branch predictor 18A (aπow 176) and the predictor accesses a coπesponding prediction (aπow 178) The taken or not taken prediction is determined (aπow 180) In response to the taken/not taken prediction from branch predictor 18 A, predictor miss decode unit 26 selects a predicted next fetch address (subscript PA) The predicted next fetch address is the branch target address if the branch instruction is predicted taken, or the sequential address if the branch instruction is predicted not taken Predictor miss decode umt 26 provides the predicted address to line predictor 12, which cams the predicted address in PC CAM 70 (clock cycles CLKNτ-2 and CLKN+3) and, similar to the timing diagram of Fig 19, records the coπesponding LP index from the hitting entry as the next index of the newly created line predictor entry If the predrcted address rs a mrss, the index of the assigned entry is stored The next fetch address of the newly created lme predictor entry rs set to the predrcted address, and the next alternate fetch address is set to whichever of the sequential address and branch target address is not predicted The next alternate index is set to null (or invalid) Line 0 (the entry assigned to the lme predictor entry being generated) is subsequently updated (clock cycle CLK N+5)

A similar timing diagram may apply to the indirect branch case, except that instead of accessing branch predictor 18A to get a prediction for the branch instruction, indirect branch target cache 18B is accessed to get the predicted address For return mstructions, a similar timing diagram may apply except that the top of return stack 18C is used as the predicted address

Fig 20 illustrates the training of the line predictor entry for a predicted fetch address However, conditional branches may select the alternate address if the condition upon which the conditional branch depends results in a different outcome for the branch than was predicted However, the next alternate index is null (or invalid), and hence if the branch prediction for the conditional branch changes, then the next index is not known

Fig 21 illustrates the training of a conditional branch instruction which is initialized as taken Imtialrzatton to not taken may be similar, except that the sequential address and next index are selected during clock cycles CLKN-CLKN+1 and the index of the branch target address is found m clock cycles CLKM-CLKM+7 Clock cycles CLK1-CLK3 and CLKN-CLKN+5 are similar to the above description of Fig 20 (with the predicted address being the branch target address, subscript Tgt, m response to the taken prediction from branch predictor 18 A)

Subsequently, during clock cycle CLKM, line 0 (termmated with the conditional branch instruction) is fetched (clock cycle CLKM) As illustrated by aπow 182, the next index of line 0 continues to select the lme coπesponding to the branch target address of the conditional branch instruction In parallel, as illustrated m Fig 11 above, the address of the conditional branch instruction is generated and branch predictor 18A is accessed In this example, the prediction has now changed to not taken (due to executions of the conditional branch instruction) Furthermore, smce the next alternate index is null, line predictor 12 cams the next alternate fetch address against PC CAM 70 (clock cycles CLKM+4 and CLKM+5) In the example, the sequential address is a hit Control circuit 74 swaps the next fetch address and next alternate fetch address fields of line 0, puts the former next index field (identifying the line predictor entry of the branch target address) in the next alternate index field, and sets the next index field to the index coπesponding to the sequential address Control circuit 74 updates line 0 in index table 72 with the updated next entry rnformation in clock cycle CLKM-7 Accordingly, both the sequential and target paths have been trained mto lme 0 Subsequently, the next and next alternate addresses (and rndexes) may be swapped according to branch predictor 18A (e g Fig 1 1 ), but predictor miss decode unit 26 may not be activated

Predictor Miss Decode Unit Block Diagram

Turning now to Fig 22, a block diagram of one embodiment of predictor miss decode umt 26 is shown Other embodiments are possible and contemplated In the embodrment of Fig 22, predictor miss decode unit 26 includes a register 190, a decoder 192, a lme predictor entrv regrster 194, and a termmatron control circuit 196 Register 190 is coupled to receive instruction bytes and a coπesponding fetch address from alignment unit 16, and is coupled to decoder 192 and termination control crrcuit 196 Decoder 192 is coupled to line predictor entry register 194, to termination control crrcurt 192, and to dispatch instructions to map unit 30 Line predictor entry register 194 is coupled to line predictor 12 Termination control circuit 196 is coupled to receive branch prediction information from branch predictors 18A-18C and is coupled to provide a branch address to fetch PC generation unit 18D and a CAM address to line predictor 12 Together, the branch prediction address, the CAM address, and the ltne entry (as well as control signals for each, not shown) may compπse the line predictor update bus shown rn

Generally, decoder 192 decodes the instruction bytes provided from alignment unit 16 m response to one of the cases shown m Figs 16-18 above Decoder 192 may decode several bytes in parallel (e g four bytes per clock cvcle in one embodiment) to detect instructions and generate a line predictor entry The first byte of the rnstructron bytes provrded to predrctor miss decode unit 26 is the first byte of rnstructron (since line predictor entries begin and terminate as full instructions), and thus decoder 192 locates the end of the first instruction as well as determining the instruction poιnter(s) coπesponding to the first mstruction and detecting if the first instruction is a termination condition (e g branch, microcode, etc ) Similarly, the second instruction is identified and processed etc Decoder 192 may, for example, employ a three stage pipeline for decoding each group of four instruction bytes Iφon exiting the pipeline, the group of four bytes is decoded and coπesponding instruction rnformation has been determined

As instructions are identified, pointers to those instructions are stored in the instruction pomter fields 102- 108 of the entrv Decoder 192 accumulates the ltne predrctor entrv in lme predictor entry register 194 Additionally, decoder 192 may dispatch instructions to map unit 30 as they are identified and decoded

In response to detecting a termination condrtion for the line, decoder 192 signals termination control circuit 196 of the type of termination Furthermore, decoder 192 sets the last rnstructron type field 120 to mdicate the terminating instruction type If the instruction is an MROM instruction, decoder 192 generates an entry pomt for the mstruction and updated MROM entry pomt field 132 Branch displacement field 124 and continuation field

126 are also set appropnatel)

In response to the teπnmatron condrtion, termination control circuit 196 generates the address of the branch instruction and accesses the branch predrctors (rf applrcable) In response to the branch predrctron information received in response to the branch address, teimination control circuit 196 provides the CAM address as one of the sequential address or the branch target address For lmes terminated m a non-branch mstruction, termination control crrcurt 196 provrdes the sequentral address as the CAM address Lme predictor 12 searches for the CAM address to generate the next index field Based on the branch predictor access (if applicable, or the sequential address otherwrse), termination control crrcuit 196 initializes next fetch address field 112 and next alternate fetch address field 114 in lme predictor entrv register 194 (as well as branch prediction field 122) The next index may be provided by control circuit 74 as the entry is updated into line predictor 12 or mav be provided to termination control circuit 196 for storage in line predictor entry register 194

Computer Systems Tummg now to Fig 23, a block diagram of one embodiment of a computer system 200 including processor 10 coupled to a variety of system components through a bus bπdge 202 rs shown Other embodtments are possrble and contemplated In the deptcted system, a mam memory 204 rs coupled to bus bridge 202 through a memory bus 206, and a graphrcs controller 208 is coupled to bus bridge 202 through an AGP bus 210 Fmally, a plurahty of PCI devrces 212A-212B are coupled to bus brtdge 202 through a PCI bus 214 A secondary bus brtdge 216 may further be provided to accommodate an electπcal mterface to one or more EISA or ISA devrces 218 through an EISA/ISA bus 220 Processor 10 is coupled to bus bridge 202 through a CPU bus 224 and to an optional L2 cache 228 Together, CPU bus 224 and the interface to L2 cache 228 may comprise external interface 52

Bus bridge 202 provides an interface between processor 10, mam memory 204, graphics controller 208. and devrces attached to PCI bus 214 When an operatron rs received from one of the devrces connected to bus brrdge 202, bus brrdge 202 identifies the target of the operation (e g a particular device or, m the case of PCI bus 214, that the target is on PCI bus 214) Bus bridge 202 routes the operation to the targeted device Bus bridge 202 generally translates an operation from the protocol used by the source device or bus to the protocol used by the target device or bus In addition to providing an interface to an ISA/EISA bus for PCI bus 214, secondary bus bπdge 216 may further incorporate addrtronal functionality, as desired An input output controller (not shown), either external from or integrated with secondary bus bridge 216, may also be included withm computer system 200 to provide operational support for a keyboard and mouse 222 and for various serial and parallel ports, as desired An external cache unit (not shown) may further be coupled to CPU bus 224 between processor 10 and bus bridge 202 in other embodiments Alternatively, the external cache may be coupled to bus bridge 202 and cache control logrc for the external cache may be mtegrated into bus bridge 202 L2 cache 228 is further shown rn a backsrde confrguratron to processor 10 It is noted that L2 cache 228 may be separate from processor 10, integrated mto a cartrrdge (e g slot 1 or slot A) wrth processor 10. or even mtegrated onto a semrconductor substrate with processor 10 Mam memory 204 is a memory rn whrch applrcatron programs are stored and from which processor 10 primarily executes A suitable mam memory 204 comprrses DRAM (Dynamrc Random Access Memory) For example, a plurality of banks of SDRAM (Synchronous DRAM) or Rambus DRAM (RDRAM) mav be suitable

PCI devices 212A-212B are illustrative of a variety of peπpheral devices such as, for example, network mterface cards, vtdeo accelerators, audro cards, hard or floppy drsk drives or drive controllers, SCSI (Small Computer Svstems Interface) adapters and telephony cards Similarly, ISA device 218 is illustrativ e of various types of peripheral devices, such as a modem, a sound card, and a variety of data acquisition cards such as GPIB or field bus interface cards

Graphics controller 208 is provided to control the rendering of text and images on a display 226 Graphics controller 208 may embody a tvprcal graphrcs accelerator generally known in the art to render three-dimensional data structures whrch can be effectively shrfted into and from mam memory 204 Graphics controller 208 mav therefore be a master of AGP bus 210 m that it can request and receive access to a target interface withm bus bridge 202 to thereby obtain access to main memory 204 A dedicated graphics bus accommodates rapid retπeval of data from mam memory 204 For certam operations, graphics controller 208 mav further be configured to generate PCI protocol transactions on AGP bus 210 The AGP interface of bus bridge 202 mav thus include functionality to support both AGP protocol transactions as well as PCI protocol target and initiator transactions Display 226 is any electrontc display upon which an image or text can be presented A suitable display 226 includes a cathode ray tube ("CRT"), a liquid crystal display ("LCD"), etc

It is noted that, while the AGP, PCI, and ISA or EISA buses have been used as examples m the above descrrptron, any bus archrtectures may be substituted as desired It is further noted that computer system 200 may be a multiprocessing computer system including additional processors (e g processor 10a shown as an optional component of computer system 200) Processor 10a may be similar to processor 10 More particularly, processor 10a mav be an identical copy of processor 10 Processor 10a may be connected to bus bridge 202 via an independent bus (as shown in Fig 23) or may share CPU bus 224 with processor 10 Furthermore, processor 10a may be coupled to an optional L2 cache 228a similar to L2 cache 228 Turning now to Fig 24, another embodiment of a computer system 300 is shown Other embodiments are possible and contemplated In the embodiment of Fig 24, computer system 300 includes several processing nodes 312A, 312B, 312C, and 312D Each processing node is coupled to a respective memory 314A-314D vra a memory controller 316A-316D mcluded wrthm each respective processing node 312A-312D Additionally, processing nodes 312A-312D include interface logic used to communicate between the processing nodes 312A-312D For example, processing node 312A includes interface logic 318A for communicating with processing node 312B, interface logic 318B for communicating with processing node 312C, and a third interface logic 318C for communicating with yet another processing node (not shown) Similarly, processing node 312B includes interface logic 318D, 318E, and 318F, processing node 312C includes interface logic 318G, 318H, and 3181, and processing node 312D includes interface logic 318J 318K, and 318L Processing node 312D is coupled to communicate with a plurality of input/output devices (e g devices 320A-320B in a daisy chain configuration) via interface logic 318L Other processing nodes may communicate with other I/O devices in a similar fashion

Processing nodes 312A-312D implement a packet-based link for inter-processing node communication In the present embodiment, the link is implemented as sets of unidirectional lmes (e g lines 324A are used to transmit packets from processing node 312A to processing node 312B and lines 324B are used to transmrt packets from processmg node 312B to processing node 312A) Other sets of lines 324C-324H are used to transmrt packets between other processing nodes as illustrated m Fig 24 Generally, each set of lines 324 may include one or more data lines, one or more clock lines coπesponding to the data lines, and one or more control lines indicating the type of packet bemg conveyed The link may be operated m a cache coherent fashion for communication between processmg nodes or rn a noncoherent fashion for communication between a processmg node and an I/O device (or a bus bridge to an 1 0 bus of conventional construction such as the PCI bus or ISA bus) Furthermore, the link may be operated in a non-coherent fashion using a daisy-chain structure between I/O devices as shown It is noted that a packet to be transmrtted from one processing node to another may pass through one or more intermediate nodes For example, a packet transmuted by processmg node 312A to processing node 312D may pass through either processing node 312B or processing node 312C as shown in Fig 24 Any suitable routing algorithm may be used Other embodiments of computer system 300 may include more or fewer processmg nodes then the embodiment shown in Fig 24

Generally, the packets may be transmuted as one or more bit times on the lines 324 between nodes A bit time may be the rising or falling edge of the clock signal on the coπesponding clock lines The packets may include command packets for initiating transactions, probe packets for maintaining cache coherency, and response packets from responding to probes and commands

Processing nodes 312A-312D, in addition to a memory controller and interface logic, may include one or more processors Broadly speaking, a processing node compπses at least one processor and may optionally include a memory controller for communicating with a memory and other logrc as desrred More particularly, a processmg node 312A-312D may comprrse processor 10 External mterface unit 46 may includes the interface logic 318 withm the node, as well as the memory controller 316

Memoπes 314A-314D may comprise any suitable memory devices For example, a memory 314A-314D may comprise one or more RAMBUS DRAMs (RDRAMs), synchronous DRAMs (SDRAMs). static RAM, etc The address space of computer system 300 is divided among memories 314A-314D Each processing node 312A- 312D may include a memory map used to determine which addresses are mapped to which memories 314A-314D, and hence to which processing node 312A-312D a memory request for a particular address should be routed In one embodiment, the coherency point for an address withm computer system 300 is the memory controller 316A- 316D coupled to the memory stormg bytes coπespondmg to the address In other words, the memory controller 316A-316D is responsrble for ensurrng that each memory access to the coπespondtng memory 314A-314D occurs in a cache coherent fashion Memory controllers 316A-316D may comprise control circuitry for interfacing to memories 314A-314D Additionally, memory controllers 316A-316D may mclude request queues for queuing memory requests

Generally, mterface logic 318A-318L may comprise a variety of buffers for receiving packets from the link and for buffering packets to be transmitted upon the link Computer system 300 may employ any suitable flow control mechanism for transmitting packets For example, in one embodiment, each interface logic 318 stores a count of the number of each type of buffer withm the receiver at the other end of the link to which that mterface logic is connected The interface logic does not transmit a packet unless the receivmg mterface logic has a free buffer to store the packet As a receiving buffer is freed by routing a packet onward, the receiving interface logic transmits a message to the sending interface logic to indicate that the buffer has been freed Such a mechanism may be refeπed to as a "coupon-based" system

I O devices 320A-320B may be any suitable I/O devices For example, 1/0 devrces 320A-320B may mclude network interface cards, video accelerators, audio cards, hard or floppy disk drives or drive controllers, SCSI (Small Computer Systems Interface) adapters and telephony cards, modems, sound cards, and a vaπety of data acquisition cards such as GPIB or field bus interface cards

Numerous variations and modifications will become apparent to those skilled in the art once the above disclosure is fully appreciated It is intended that the following claims be interpreted to embrace all such variations and modifications

INDUSTRIAL APPLICABILITY

This invention may generally be applicable to processors and computer systems

Claims

WHAT IS CLAIMED IS:

1. A processor (10) compnsmg an mstruction cache (14) coupled to recetve a fetch address, and a lme predrctor ( 12) coupled to receive said fetch address, said line predictor (12) including a first memory

(72) comprrsrng a plurality of entries, each entry storing a plurality of mstruction pointers (102. 104, 106. 108), wherein said lme predictor ( 12) is configured to select a first entry of sard plurality of entries, said first entry coπespondmg to said fetch address, and wherein each of a first plurality of instruction pointers (102, 104, 106, 108) within said first entry, if valrd, drrectly locates an rnstructron withm a plurality of instruction bytes fetched from up to two cache lines of said instruction cache ( 14) in response to said fetch address

2 The processor (10) as recited in claim 1 wherein said first entry is further configured to store a next entry indication ( 100) identifying a second entry of said plurality of entries withm said first memory (72), wherein said line predictor ( 12) is configured to subsequently select said second entry to provide a second plurality of instruction pointers ( 102, 104, 106. 108) stored therein responsive to said next entry indication (100)

3 The processor ( 10) as recited in claim 2 wherein said lme predictor ( 12) further includes a second memory (70) coupled to receive said fetch address and further coupled to said first memory (72), sard second memory (70) comprising a second plurality of entrres configured to store fetch addresses and rndexes mto said first memory (72) wherern sard second memory (70) is configured to compare said fetch address to fetch addresses stored rn said second plurality of entrres and to select a second entry of sard second pluralrty of entrres in response to said fetch address matchmg sard fetch address stored in said second entry, and wherern sard second memory (70) rs configured to provrde said index stored in said second entry to sard first memory (72) to select said first entry.

4 The processor ( 10) as recited in claim 3 wherein said lme predictor ( 12) is configured to inhibit access to said second memory (70) if said next entry indication (100) m said first entry is valid

5. The processor (10) as recited in claim 2 wherein said next entry indication ( 100) further identifies a third entry of sard pluralrty of entrres wrthm sard first memory (72), wherern a last rnstructron identified by said first plurality of instruction pointers (102, 104, 106, 108) is a branch instruction, and wherein said third entry conesponds to instructions m a non-predrcted path of said branch instruction

6 The processor (10) as recited in claim 1 wherein said first entry is further configured to store control information (HO) coπesponding to said mstructions located by said first plurality of instruction pointers ( 102 104, 106, 108), and wherein said control information ( 110) includes an indication (126) that at least one byte of a last instruction located by said first plurality of instruction pointers is stored on a different page than said plurality of instruction bytes

7 The processor ( 10) as recited in claim 1 wherein said first entry is further configured to store control information (1 10) coπesponding to said instructions located by said first plurality of instruction pointers ( 102, 104, 106, 108), and wherein said control information (110) includes a type of a last instruction identified by said first plurality of instruction pointers (102, 104, 106, 108)

8 A method comprising generating a fetch address, and selecting a first plurality of instruction pointers ( 102, 104, 106, 108) from a line predictor (12), said first plurality of instruction pointers (102, 104, 106, 108) coπesponding to said fetch address, each of said first plurality of instruction pointers ( 102, 104, 106, 108), if valid, directly locating an instruction within a plurality of instruction bytes fetched from up to two cache lines of an instruction cache (14) in response to said fetch address

9 The method as recrted in claim 8 wherein said line predictor ( 12) compπses a first memory (72) including a plurality of entries, each of said plurality of entries configured to store a plurality of instruction pointers ( 102, 104,

106, 108), and wherein said first entry is further configured to store a next entry indication ( 100), and wherein said selecting comprises selecting a first entry of said plurality of entries, said first entry storing said first plurality of instruction pointers ( 102, 104, 106, 108), the method further comprising selecting a second entry of said plurality of entries responsive to said next entry indication ( 100)

10. A processor ( 10) comprising a line predictor ( 12) coupled to receive a fetch address and to provide a plurality of rnstructron pointers (102, 104, 106, 108) in response thereto