WO1998002817A1

WO1998002817A1 - A way prediction unit and a method for operating the same

Info

Publication number: WO1998002817A1
Application number: PCT/US1996/011755
Authority: WO
Inventors: Thang M. Tran; James K. Pickett
Original assignee: Advanced Micro Devices, Inc.
Priority date: 1996-07-16
Filing date: 1996-07-16
Publication date: 1998-01-22
Also published as: EP1019831A1

Abstract

A way prediction unit for a superscalar microprocessor is provided which predicts the next fetch address as well as the way of the instruction cache that the current fetch address hits in while the instructions associated with the current fetch are being read from the instruction cache. The way prediction unit is intended for high frequency microprocessors in which associative caches tend to be clock cycle limiting, causing the instruction fetch mechanism to require more than one clock cycle between fetch requests. Therefore, an instruction fetch can be made every clock cycle using the predicted fetch address until an incorrect next fetch address or an incorrect way is predicted. The instructions from the predicted way are provided to the instruction processing pipelines of the superscalar microprocessor each clock cycle.

Description

TITLE: A WAY PREDICTION UNIT AND A METHOD FOR OPERATING THE SAME

BACKGROUND OF THE INVENTION

1. Field of the Invention

This invention relates to the field of superscalar microprocessors and, more particularly, to branch prediction mechanisms employed within superscalar microprocessors.

2. Description of the Relevant Art

Superscalar microprocessors achieve high performance by simultaneously executing multiple instructions in a clock cycle and by specifying the shortest possible clock cycle consistent with the design. As used herein, the term "clock cycle" refers to an interval of time during which the pipeline stages of a microprocessor preform their intended functions. At the end of a clock cycle, the resulting values are moved to the next pipeline stage.

Since superscalar microprocessors execute multiple instructions per clock cycle and the clock cycle is short, a high bandwidth memory system is required to provide instructions to the superscalar microprocessor (i.e. a memory system that can provide a large number of bytes in a short period of time). Without a high bandwidth memory system, the microprocessor would spend a large number of clock cycles waiting for instructions to be provided, then would execute the received instructions in a relatively small number of clock cycles. Overall performance would be degraded by the large number of idle clock cycles. However, superscalar microprocessors are ordinarily configured into computer systems with a large main memory composed of dynamic random access memory (DRAM) cells. DRAM cells are characterized by access times which are significantly longer than the clock cycle of modern superscalar microprocessors. Also, DRAM cells typically provide a relatively narrow output bus to convey the stored bytes to the superscalar microprocessor. Therefore, DRAM cells provide a memory system that provides a relatively small number of bytes in a relatively long period of time, and do not form a high bandwidth memory system.

Because superscalar microprocessors are typically not configured into a computer system with a memory system having sufficient bandwidth to continuously provide instructions for execution, superscalar microprocessors are often configured with an instruction cache. Instruction caches are multiple blocks of storage locations, configured on the same silicon substrate as the microprocessor or coupled nearby. The blocks of storage locations are used to hold previously fetched instruction bytes. The bytes can be transferred from the instruction cache to the instruction processing pipelines quickly; commonly one or two clock cycles are required.

Instruction caches are typically organized into an "associative" structure. In an associative structure, the blocks of storage locations are accessed as a two-dimensional array having rows and columns. When a "fetch control unit" (for example, a fetch PC unit) searches the instruction cache for instructions residing at an address, a number of bits from the address are used as an "index" into the cache The index selects a particular row within the two-dimensional array, and therefore the number of address bits required for the index is determined by the number of rows configured into the instruction cache The addresses associated with instruction bytes stored in the multiple blocks of a row are examined to determine if any of the addresses stored in the row match the requested address If a match is found, the access is said to be a "hit", and the instruction cache provides the associated instruction bytes If a match is not found, the access is said to be a "miss" When a miss is detected, the fetch control unit causes the instruction bytes to be transferred from the memory system into the instruction cache The addresses associated with instruction bytes stored in the cache are also stored These stored addresses are referred to as "tags"

The blocks of memory configured into a row form the columns of the row Each block of memory is referred to as a "way", multiple ways comprise a row The way is selected by providing a way value to the instruction cache The way value is determined by examining the tags for a row and finding a match between one of the tags and the input address from the fetch control unit As used herein the term "fetch control unit" refers to a unit configured to fetch instructions from the instruction cache and cause instructions not residing in the instruction cache to be transferred into the instruction cache

It is well known that associative caches provide better "hit rates" (I e a higher percentage of accesses to the cache are hits) than caches that are configured as a linear array of storage locations (typically referred to as a direct-mapped configuration) The hit rates are better for an associative cache because instruction bytes stored at multiple addresses having the same index may be stored in the associative cache simultaneously, whereas a direct-mapped cache is capable of storing one set of instruction bytes per index For example, a program having a loop that extends over two addresses with the same index can store instruction bytes from both addresses in an associative instruction cache, but will have to repeatedly reload the two addresses each time the loop is executed in a microprocessor having a direct-mapped cache The hit rate in an instruction cache is important to the performance of the superscalar microprocessor, because when a miss is detected the instructions must be fetched from the memory system The microprocessor will quickly become idle while waiting for the instructions to be provided Unfortunately, associative caches require more access time than direct- mapped caches since the tags must be compared to the address being searched for and the resulting hit or miss information must then be used to select which instruction bytes should be conveyed out of the instruction cache to the instruction processing pipelines of the superscalar microprocessor With the clock cycles of superscalar microprocessors being shortened, this compare and select logic becomes a problem Often an entire clock cycle or even longer is required to provide instructions from such an instruction cache

Long instruction cache access times are a particular problem with respect to branch prediction mechanisms Branch prediction mechanisms predict the next address that a computer program will attempt to fetch from memory, fetch the predicted address, and pass the instructions to the instruction processing pipelines The predicted address may be the next sequential line in the cache, the target of a branch instruction contained in the current instruction cache line, or some other address Branch prediction is important to the performance of superscalar microprocessors because they execute multiple instructions per clock cycle Branches occur often in computer programs, on the average approximately once every four instructions Therefore, a superscalar microprocessor configured to execute four or more instructions per clock cycle encounters a branch every clock cycle, on the average Whether or not a particular branch instruction will be taken or not-taken may depend on the result of instructions that are near the branch instruction in the program, and therefore execute in parallel with the branch instruction Superscalar microprocessors employ branch prediction to allow fetching and speculative execution of instructions while the branch is being executed Various branch prediction mechanisms are well-known

Multiple branches are predicted in many superscalar microprocessors so that more instructions may be fetched to prevent the execution units from idling thereby degrading overall performance Instructions that are fetched from an address that is mispredicted (l e the predicted address is determined to be wrong) may be discarded at later positions in the pipeline However, for branch prediction to occur, the instructions must be read from the instruction cache to allow scanning of the instructions for branches Because an associative cache may require the entire clock cycle (or more) to read instructions from the instruction cache, branch prediction would occur in the clock cycle after instructions are read Branch prediction is a fairly complex function as well, requiring a significant portion of a clock cycle Therefore, a fetch from a predicted address cannot occur until two cycles after the instruction sequence that contains the branch is fetched from the instruction cache The instruction cache is accessed once every two cycles, utilizing half of the available instruction cache bandwidth Better performance would be achieved if the entire available instruction cache bandwidth were used A solution to the branch prediction and fetch mechanism requiring two cycles in a superscalar microprocessor is therefore desirable

SUMMARY OF THE INVENTION

The problems outlined above are in large part solved by a superscalar microprocessor employing a way prediction unit in accordance with the present invention In one embodiment, the way prediction unit predicts the next fetch address as well as the way of the instruction cache that the current fetch address hits while the instructions associated with the current fetch are being read from the instruction cache Thus the two clock cycle instruction fetch mechanism is advantageously reduced to one clock cycle Therefore, an instruction fetch can be made every clock cycle until an incorrect next fetch address or an incorrect way is predicted The instructions from the predicted way are provided to the instruction processing pipelines of the superscalar microprocessor each clock cycle Relatively higher performance may be achieved than would be possible with a superscalar microprocessor which provides instructions to the instruction processing pipelines every other clock cycle

The way prediction unit may also enable higher frequencies (shorter clock cycles) while advantageously retaining the performance benefits of an associative instruction cache because the long access time of the associative cache is no longer a limiting factor The way selection is predicted instead of determined through tag comparisons, removing the dominant barrier to the use of highly associative instruction caches at high frequencies

Broadly speaking, the present invention contemplates a way prediction unit comprising a first input port, an array and an output port The first input port conveys an input address and an update value into the way prediction unit The array includes a plurality of storage locations wherein each of the plurality of storage locations is configured to store an address and a way value Purthermore, the array includes a mechanism to select one of the plurality of storage locations as indexed by the input address The output port conveys an output address and an output way value which are the predicted fetch address for the next clock cycle and the predicted way for the fetch occurring in the current clock cycle

The present invention further contemplates a mechanism in a microprocessor for predicting the index of a next block of instructions required by a program executing on said microprocessor and for predicting the way of a fetch address accessing the instruction cache in the current clock cycle, comprising a fetch control unit, a way prediction unit, and an instruction cache The fetch control unit is configured to produce the address and update value that are inputs to the way prediction unit, and is further configured to receive the predicted address and predicted way from the way prediction unit as inputs The way prediction unit comprises components as described above, and the instruction cache is configured to store blocks of contiguous instruction bytes and to receive a fetch address from the fetch control unit 1 he present invention still further contemplates a superscalar microprocessor comprising a way prediction unit as described herein

BRIEF DESCRIPTION OF THE DRAWINGS

Other objects and advantages of the invention will become apparent upon reading the following detailed description and upon reference to the accompanying drawings in which

Figure 1 is a block diagram of a superscalar microprocessor employing a branch prediction unit in accordance with the present invention

Figure 2 is a block diagram of several of the units from Figure 1, showing a way prediction unit within the branch prediction unit of Figure 1

Figure 3 is a diagram showing the components of the way prediction unit depicted in Figure 2

Figure 4A is a diagram showing the bit fields within a storage location of the way prediction unit depicted in Figure 3

Figure 4B is a timing diagram depicting important relationships between the way prediction unit and the other units depicted in Figure 2 Figure 4C is another timing diagram depicting several instruction fetches and their corresponding way predictions, including a way misprediction cycle.

Figure 4D is yet another timing diagram depicting several instruction fetches and their corresponding way predictions, including a target fetch address misprediction.

While the invention is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail. It should be understood, however, that the drawings and detailed description thereto are not intended to limit the invention to the particular form disclosed, but on the contrary, the intention is to cover all modifications, equivalents and alternatives falling within the spirit and scope of the present invention as defined by the appended claims.

DETAILED DESCRIPTION OF THE INVENTION

Turning now to the drawings, Figure 1 shows a block diagram of a superscalar microprocessor 200 including a branch prediction unit 220 employing a way prediction unit in accordance with the present invention. As illustrated in the embodiment of Figure 1 , superscalar microprocessor 200 includes a prefetch/predecode unit 202 and a branch prediction unit 220 coupled to an instruction cache 204. Instruction alignment unit 206 is coupled between instruction cache 204 and a plurality of decode units 208A-208F (referred to collectively as decode units 208). Each decode unit 208A-208F is coupled to a respective reservation station unit 210A -21 OF (referred collectively as reservation stations 210), and each reservation station 210A-210F is coupled to a respective functional unit 212A-212F (referred to collectively as functional units 212). Decode units 208, reservation stations 210, and functional units 212 are further coupled to a reorder buffer 216, a register file 218 and a load/store unit 222. A data cache 224 is finally shown coupled to load/store unit 222, and an MROM unit 209 is shown coupled to instruction alignment unit 206.

Generally speaking, instruction cache 204 is a high speed cache memory provided to temporarily store instructions prior to their dispatch to decode units 208. In one embodiment, instruction cache 204 is configured to cache up to 32 kilobytes of instruction code organized in lines of 16 bytes each (where each byte consists of 8 bits). During operation, instruction code is provided to instruction cache 204 by prefetching code from a main memory (not shown) through prefetch/predecode unit 202.

Prefetch/predecode unit 202 is provided to prefetch instruction code from the main memory for storage within instruction cache 204. In one embodiment, prefetch/predecode unit 202 is configured to burst 64-bit wide code from the main memory into instruction cache 204. It is understood that a variety of specific code prefetching techniques and algorithms may be employed by prefetch/predecode unit 202. As prefetch/predecode unit 202 fetches instructions from the main memory, it generates three predecode bits associated with each byte of instruction code a start bit, an end bit, and a "functional" bit 1 he predecode bits form tags indicative of the boundaries of each instruction The predecode tags may also convey additional information such as whether a given instruction can be decoded directly by decode units 208 or whether the instruction must be executed by invoking a microcode procedure controlled by MROM unit 209, as will be described in greater detail below

Table 1 indicates one encoding of the predecode tags As indicated within the table, if a given byte is the first byte of an instruction, the start bit for that byte is set If the byte is the last byte of an instruction, the end bit for that byte is set If a particular instruction cannot be directly decoded by the decode units 208, the functional bit associated with the first byte of the instruction is set On the other hand, if the instruction can be directly decoded by the decode units 208, the functional bit associated with the first byte of the instruction is cleared The functional bit for the second byte of a particular instruction is cleared if the opcode is the first byte, and is set if the opcode is the second byte It is noted that in situations where the opcode is the second byte, the first byte is a prefix byte The functional bit values for instruction byte numbers 3-8 indicate whether the byte is a MODRM or an SIB byte, as well as whether the byte contains displacement or immediate data

Table 1 Encoding of Start. End and Functional Bits

lnstr Start End Functional

Byte Bit Bit Bit

Number Value Value Value Meaning

1 1 X 0 Tast decode

1 1 X 1 MROM instr

2 0 X 0 Opcode is first byte

2 0 X 1 Opcode is this byte, first byte is prefix

3-8 0 X 0 Mod R/M or SIB byte

3-8 0 X 1 Displacement or immediate data, the second functional bit set in bytes 3-8 indicates immediate data

1-8 X 0 X Not last byte of instruction

1-8 X 1 X Last byte of instruction

As stated previously, in one embodiment certain instructions within the x86 instruction set may be directly decoded by decode units 208 These instructions are referred to as "fast path" instructions The remaining instructions of the x86 instruction set are referred to as "MROM instructions" MROM instructions are executed by invoking MROM unit 209. When an MROM instruction is encountered, MROM unit 209 parses and serializes the instruction into a subset of defined fast path instructions to effectuate a desired operation. A listing of exemplary x86 instructions categorized as fast path instructions as well as a description of the manner of handling both fast path and MROM instructions will be provided further below.

Instruction alignment unit 206 is provided to channel variable byte length instructions from instruction cache 204 to fixed issue positions formed by decode units 208A-208F. Instruction alignment unit 206 is configured to channel instruction code to designated decode units 208A-208F depending upon the locations of the start bytes of instructions within a line as delineated by instruction cache 204. In one embodiment, the particular decode unit 208A-208F to which a given instruction may be dispatched is dependent upon both the location of the start byte of that instruction as well as the location of the previous instruction's start byte, if any. Instructions starting at certain byte locations may further be restricted for issue to only one predetermined issue position. Specific details follow.

Before proceeding with a detailed description of the way prediction unit within branch prediction unit 220, general aspects regarding other subsystems employed within the exemplary superscalar microprocessor 200 of Figure 1 will be described. For the embodiment of Figure 1, each of the decode units 208 includes decoding circuitry for decoding the predetermined fast path instructions referred to above. In addition, each decode unit 208A-208F routes displacement and immediate data to a corresponding reservation station unit 210A-210F. Output signals from the decode units 208 include bit-encoded execution instructions for the functional units 212 as well as operand address information, immediate data and/or displacement data.

The superscalar microprocessor of Figure 1 supports out of order execution, and thus employs reorder buffer 216 to keep track of the original program sequence for register read and write operations, to implement register renaming, to allow for speculative instruction execution and branch misprediction recovery, and to facilitate precise exceptions. As will be appreciated by those of skill in the art, a temporary storage location within reorder buffer 216 is reserved upon decode of an instruction that involves the update of a register to thereby store speculative register states. Reorder buffer 216 may be implemented in a first-in- first-out configuration wherein speculative results move to the "bottom" of the buffer as they are validated and written to the register file, thus making room for new entries at the "top" of the buffer. Other specific configurations of reorder buffer 216 are also possible, as will be described further below. If a branch prediction is incorrect, the results of speculatively-executed instructions along the mispredicted path can be invalidated in the buffer before they are written to register file 218.

The bit-encoded execution of instructions and immediate data provided at the outputs of decode units 208A-208F are routed directly to respective reservation station units 210A-210F. In one embodiment, each reservation station unit 210A-210F is capable of holding instruction information (i.e., bit encoded execution bits as well as operand values, operand tags and/or immediate data) for up to three pending instructions awaiting issue to the corresponding functional unit It is noted that for the embodiment of Figure 1 , each decode unit 208A-208F is associated with a dedicated reservation station unit 210A-210I , and that each reservation station unit 210A-210F is similarly associated with a dedicated functional unit 212A-212F Accordingly, six dedicated "issue positions" are formed by decode units 208, reservation station units 210 and functional units 212 Instructions aligned and dispatched to issue position 0 through decode unit 208A are passed to reservation station unit 210A and subsequently to functional unit 212A for execution Similarly, instructions aligned and dispatched to decode unit 208B are passed to reservation station unit 21 OB and into functional unit 212B, and so on

Upon decode of a particular instruction, if a required operand is a register location, register address information is routed to reorder buffer 216 and register file 218 simultaneously Those of skill in the art will appreciate that the x86 register file includes eight 32 bit real registers (I e , typically referred to as LAX EBX, ECX, EDX, EBP, ESI, EDI and ESP), as will be described further below Reorder buffer 216 contains temporary storage locations for results which change the contents of these registers to thereby allow out of order execution A temporary storage location of reorder buffer 216 is reserved for each instruction which upon decode, modifies the contents of one of the real registers Therefore, at various points during execution of a particular program, reorder buffer 216 may have one or more locations which contain the speculatively executed contents of a given register If following decode of a given instruction it is determined that reorder buffer 216 has previous locatιon(s) assigned to a register used as an operand in the given instruction, the reorder buffer 216 forwards to the corresponding reservation station either 1 ) the value in the most recently assigned location, or 2) a tag for the most recently assigned location if the value has not yet been produced by the functional unit that will eventually execute the previous instruction If the reorder buffer has a location reserved for a given register, the operand value (or tag) is provided from reorder buffer 216 rather than from register file 218 If there is no location reserved for a required register in reorder buffer 216, the value is taken directly from register file 218 If the operand corresponds to a memory location the operand value is provided to the reservation station unit through load/store unit 222

Details regarding suitable reorder buffer implementations may be found within the publication "Superscalar Microprocessor Design" by Mike Johnson, Prentice-Hall, Englewood Cliffs, New Jersey, 1991 , and within the co-pending, commonly assigned patent application entitled "High Performance Superscalar Microprocessor", Serial No 08/146,382, filed October 29, 1993 by Witt, et al These documents are incorporated herein by reference in their entirety

Reservation station units 210A-2I 0F are provided to temporarily store instruction information to be speculatively executed by the corresponding functional units 212A-212F As stated previously, each reservation station unit 210A-210F may store instruction information for up to three pending instructions Each of the six reservation stations 210A-210F contain locations to store bit-encoded execution instructions to be speculatively executed by the corresponding functional unit and the values of operands If a particular operand is not available, a tag for that operand is provided from reorder buffer 216 and is stored within the corresponding reservation station until the result has been generated (i e , by completion of the execution of a previous instruction) It is noted that when an instruction is executed by one of the functional units 212A- 212F, the result of that instruction is passed directly to any reservation station units 210A-210F that are waiting for that result at the same time the result is passed to update reorder buffer 216 (this technique is commonly referred to as "result forwarding") Instructions are issued to functional units for execution after the values of any required operand(s) are made available That is if an operand associated with a pending instruction within one of the reservation station units 210A-210r has been tagged with a location of a previous result value within reorder buffer 216 which corresponds to an instruction which modifies the required operand, the instruction is not issued to the corresponding functional unit 212 until the operand result for the previous instruction has been obtained Accordingly, the order in which instructions are executed may not be the same as the order of the original program instruction sequence Reorder buffer 216 ensures that data coherency is maintained in situations where read-after-wπte dependencies occur

In one embodiment, each of the functional units 212 is configured to perform integer arithmetic operations of addition and subtraction, as well as shifts, rotates, logical operations, and branch operations It is noted that a floating point unit (not shown) may also be employed to accommodate floating point operations

Each of the functional units 212 also provides information regarding the execution of conditional branch instructions to the branch prediction unit 220 If a branch prediction was incorrect, branch prediction unit 220 flushes instructions subsequent to the mispredicted branch that have entered the instruction processing pipeline, and causes prefetch/predecode unit 202 to fetch the required instructions from instruction cache 204 or main memory It is noted that in such situations, tesults of instructions in the original program sequence which occur after the mispredicted branch instruction are discarded, including those which were speculatively executed and temporarily stored in load/store unit 222 and reorder buffer 216 Exemplary configurations of suitable branch prediction mechanisms are well known

Results produced by functional units 212 are sent to the reorder buffer 216 if a register value is being updated, and to the load/store unit 222 if the contents of a memory location is changed If the result is to be stored in a register, the reorder buffer 216 stores the result in the location reserved for the value of the register when the instruction was decoded As stated previously, results are also broadcast to reservation station units 210A-210F where pending instructions may be waiting for the results of previous instruction executions to obtain the required operand values

Data cache 224 is a high speed cache memory provided to temporarily store data being transferred between load/store unit 222 and the main memory subsystem In one embodiment, data cache 224 has a capacity of storing up to eight kilobytes of data It is understood that data cache 224 may be implemented in a variety of specific memory configurations, including a set associative configuration

Generally speaking, load/store unit 222 provides an interface between functional units 212A-212r and data cache 224 In one embodiment, load/store unit 222 is configured with a load/store buffer with sixteen storage locations for data and address information for pending load or store memory operations Functional units 212 arbitrate for access to the load/store unit 222 The load/store unit 222 also performs dependency checking for load memory operations against pending store memory operations to ensure that data coherency is maintained

Turning now to Figure 2, a block diagram of a portion of microprocessor 200 is shown Included in the diagram is instruction cache 204 and branch prediction unit 220 Instruction cache 204 is further coupled to instruction alignment unit 206 and decode units 208 (shown in this diagram as a single block, although decode units 208 are shown as several blocks in Figure 1 ) Within branch prediction unit 220 is a way prediction unit 250 in accordance with the present invention Branch prediction unit 220 may also contain other units (not shown) In one embodiment, instruction cache 204 is eight-way associative Shown within instruction cache 204 is a fetch PC unit 254 and instruction cache storage 255

Generally speaking, way prediction unit 250 generates a predicted fetch address for the next cache line to be fetched (I e a prediction of the branch prediction address) and a predicted way value for the fetch address accessing the instruction cache in the current clock cycle (the "current fetch address") The predicted fetch address and predicted way are based on the current fetch address By predicting the branch prediction address, microprocessor 200 is able to read the predicted line from instruction cache 204 while the branch prediction is being generated, advantageously removing a clock cycle from the instruction fetch process in cases where the predicted fetch address matches the branch prediction address Furthermore, predicting the way for the current fetch address makes the instructions for the current fetch address available by the end of the clock cycle in which the fetch address accesses the cache, as opposed to the next cycle if the tag comparison information is used to select the instructions The predicted way is validated by the tag comparison information in the following cycle

The predicted fetch address and predicted way aie transferred on a prediction bus 251 to fetch PC unit 254 In one embodiment, the predicted fetch address is a partial address containing the index bits used to index instruction cache 204 Way prediction unit 250 predicts the next instruction cache line index to be fetched and the way of the instruction cache which contains the current fetch address using the current fetch address, as conveyed from fetch PC unit 254 on a fetch request bus 252 Fetch request bus 252 also conveys way selection information for instruction cache 204, and a next index address which is the current fetch address incremented by the size of one instruction cache line In one embodiment, the instruction cache line size is sixteen bytes and therefore the next index address is the current fetch address incremented by sixteen

The current fetch address conveyed on fetch request bus 252 is the predicted fetch address conveyed by way prediction unit 250 in the previous clock cycle, except for cycles in which fetch PC unit 254 detects a way or predicted fetch address misprediction During the clock cycle following a fetch address access to the cache, the way prediction and the predicted fetch address generated for that fetch address are validated The way prediction is validated by comparing the actual way that the fetch address hits in instruction cache 204 (determined via a tag compare to the full fetch address) to the way prediction If the way prediction matches the actual way, then the way prediction is correct If the way prediction is wrong, then the correct way is selected from the eight ways of instruction cache 204 (which were latched from the previous clock cycle), the instructions associated with the predicted way are discarded, and the predicted fetch address is discarded as well If the way prediction is correct, then the predicted fetch address is validated by comparing the predicted fetch address to the branch prediction address generated by branch prediction unit 220 In embodiments where the predicted fetch address is an index address, only the index bits of the branch prediction address are compared to the predicted fetch address If the index bits match, the two addresses are defined to match and the branch prediction address is then used in the tag compares for the instruction cache lines associated with the predicted fetch address If the predicted fetch address is incorrect, the associated instructions are discarded and the branch prediction address is fetched

As mentioned above, fetch PC unit 254 sends update information to way prediction unit 250 on fetch request bus 252 The update information for a particular predicted fetch address is sent the clock cycle following the predicted address validation, and includes an update way value and update selection control bits If the prediction is correct, the update information is the predicted information If the prediction is incorrect, the update way value is the way of instruction cache 204 which contains the instructions actually fetched and the update selection control bits indicate whether the branch prediction address is a taken branch, a next sequential line fetch, or a RFT instruction The update information is stored by way prediction unit 250 such that the next prediction using a similar current fetch address will include the update information in the prediction mechanism The prediction mechanism will be explained in more detail with respect to Figure 3

Figure 2 also depicts a return stack address bus 253 connected to way prediction unit 250 Return stack address bus 253 conveys the address that is currently at the top of the return stack in decode units 208 The return stack is a stack of addresses that refer to instructions following previously executed CALL instructions A RET instruction would use the address at the top of the return stack to locate the next instruction to be executed As will be appreciated by one skilled in the art, the CALL and RET instructions are defined by the x86 architecture as subroutine entrance and exit instructions, respectively

Way prediction unit 250 uses the next index address provided on fetch request bus 252 and the return stack address as sources for the predicted fetch address The return stack address is selected during clock cycles in which way prediction unit 250 predicts that a RET instruction is in the cache line currently being fetched Alternatively, the next index address is selected during clock cycles in which way prediction unit 250 predicts that no branch-type instructions exist in the cache line currently being fetched Way prediction unit 250 also selects internally stored addresses as the predicted fetch address during clock cycles in which the prediction mechanism predicts that a branch-type instruction exists in the cache line, as will be described in further detail below

Turning now to figure 3, an embodiment of way prediction unit 250 is shown with return stack bus 253 and fetch request bus 252 connected to it The current fetch address conveyed on fetch request bus 252 is decoded by decoder circuit 300 I he resulting select lines are stored by a delay latch 301, and also select a storage location from a way prediction array 302 In this embodiment, way prediction array 302 is configured as a linear array of storage locations In another embodiment prediction array 302 is composed of registers Each storage location is configured to store prediction addresses, a predicted way, and target selection control bits The prediction addresses are branch prediction addresses previously generated from instructions residing at a fetch address with the same index bits as the current fetch address The predicted way is the last correctly predicted way for a fetch address with the same index as the current fetch address The target selection control bits indicate which of the stored prediction addresses should be selected, or if the next index address or the return stack address should be selected When microprocessor 200 is initialized, the target selection control bits of storage locations with way prediction array 302 are set to select the next index address

Delay latch 301 transfers its value to an update latch 304 Update latch 304 selects a storage location within way prediction array 302 for storing the update information provided by fetch PC unit 254 Delay latch 301 and update latch 304 store the decoded selection lines for way prediction array 302, and thus avoid the need for a second decoder circuit similar to decoder 300 As will be appreciated by one skilled in the art, a decoder circuit such as decoder 300 is larger (in terms of silicon area) than delay latch 301 and update latch 304 Therefore, silicon area is saved by implementing this embodiment instead of an embodiment with a second decoder for updates

Way prediction unit 250 is also configured with an address selection device for selecting the address to provide as the predicted fetch address In this embodiment, the address selection device is a multiplexor 305 and an address selection circuit 306 Address selection circuit 306 receives the target selection control bits from way prediction array 302 and produces multiplexor select lines for multiplexor 305 In one embodiment, address selection circuit 306 causes multiplexor 305 to select the first address from way prediction array 302 if the target selection control bits contain the binary value "01 ", the second address from way prediction array 302 it the target selection control bits contain the binary value "10", the next index address from fetch request bus 252 if the target selection bits contain the binary value "00", and the return stack address from return stack address bus 253 if the selection control bits contain the binary value " 11" Therefore, address selection circuit 306 is a decode of the target selection control bits The predicted address is conveyed on prediction bus 251 along with the predicted way selected from way prediction array 302

Turning now to Figure 4 A, a diagram of one of the storage locations of way prediction array 302 (shown in Figure 3) is shown In this embodiment, two prediction addresses are stored (shown as fields 400 and 401 ) Each of the prediction addresses are 12 bits wide Way prediction information is stored in a field 402 which is 3 bits wide in this embodiment to encode the eight ways of instruction cache 204 In another embodiment, field 402 is 8 bits wide and the way prediction information is not encoded Instead, one of the eight bits is set indicating the predicted way Target selection bits are also stored within the storage location in field 403 which is 2 bits wide in this embodiment to encode selection of prediction address field 400, prediction address field 401 , the next index address, or the return stack address In another embodiment, field 403 is four bits wide and the target selection bits are not encoded Instead, one of the 4 bits is set indicating one of the four possible prediction addresses

T urning now to Figure 4B, a timing diagram depicting important relationships between way prediction unit 250, instruction cache 204 and fetch PC unit 254 is shown At the beginning of ICLK 1 , a current fetch address is sent from fetch PC unit 254 to instruction cache 204 and way prediction unit 250 Instruction cache 204 transfers the associated instructions to its output bus during the time indicated by the horizontal line 420 and latches them In parallel, way prediction unit 250 indexes into way prediction array 302 and selects a storage location From the value of the target selection control bits stored within the selected storage location, a predicted fetch address is generated The predicted fetch address and the predicted way from the selected storage location are conveyed to fetch PC unit 254 near the end of ICLK 1 , as indicated by arrow 421 The predicted way is selected from the eight ways indexed by the current fetch address in ICLK1 , and the selected instructions are scanned to form a branch prediction address Also, the selected instructions are forwarded to instruction alignment unit 206 in ICLK2

In ICLK2, the predicted fetch address is conveyed as the current fetch address In parallel, fetch PC unit 254 determines whether or not the previous fetch address is an instruction cache hit in the predicted way and branch prediction unit 220 generates a branch prediction address, as indicated by arrow 422 This information is used as described above to validate the predicted fetch address that is currently accessing instruction cache 204 and the predicted way that was provided in ICLK 1 If a way misprediction is detected, the correct instructions are selected from the eight ways of instruction cache 204 latched in ICLK1 , and the instructions read in ICLK! are discarded If a predicted address misprediction is detected, then the branch prediction address is fetched in ICLK3 at arrow 423 and the instructions read in 1CLK2 are ignored Otherwise, the predicted fetch address received during ICLK2 from way prediction unit 250 is used Also at arrow 423, the way prediction and control selection bits are updated in way prediction unit 250 for the address fetched in ICLK 1 ICLK4 is the clock cycle in which the RET instruction is detected and the return stack address prediction is validated, as shown at arrow 424 If the return stack address prediction is incorrect, then the corrected address is fetched and the way prediction information is updated at arrow 425 In one embodiment, return stack address predictions require two extra clock cycles to validate (as indicated by Figure 4B) as compared to next index address or branch prediction address predictions

Turning now to Figure 4C, a timing diagram is shown to illustrate way misprediction In ICLK 1 , a current fetch address A is conveyed to instruction cache 204 and way prediction unit 250, as indicated by block 440 A prediction address B is determined by way prediction unit 250 and conveyed to fetch PC unit 254 as indicated by block 441 In ICLK2 address B is conveyed as a current fetch address because it was predicted in ICL 1 , as indicated by block 442 In response to current fetch address B, prediction address C is determined by way prediction unit 250 and conveyed to fetch PC unit 254 as indicated by block 443 At arrow 444, address A is determined to hit in the instruction cache in the predicted way, and at arrow 445 a branch prediction associated with address A is calculated The branch prediction address matches the predicted address B, and therefore address B is a valid prediction In ICLK3 at arrow 446, the predicted way for address B is found to be incorrect Therefore, the correct instructions are selected from the eight ways that were latched in the previous cycle, as indicated by block 447 The instructions read in 1CLK2 are discarded Also, a new predicted address C is conveyed as indicated by block 448 In ICI K4, address C is fetched as indicated by box 449

Turning now to Figure 4D, a timing diagram is shown depicting several consecutive instruction fetches, to further illustrate the interaction between fetch PC unit 254, way prediction unit 250, and instruction cache 204 In ICL 1 , a current fetch address A is conveyed to instruction cache 204 and way prediction unit 250, as indicated by block 460 A prediction address B is determined by way prediction unit 250 and conveyed to fetch PC unit 254 as indicated by block 461 In ICLK2 address B is conveyed as a current fetch address because it was predicted in ICLK1 , as indicated by block 462 In response to current fetch address B, prediction address C is determined by way prediction unit 250 and conveyed to fetch PC unit 254 as indicated by block 463 At arrow 464, address A is determined to hit in the instruction cache in the predicted way, and at arrow 465 a branch prediction associated with address A is calculated The branch prediction address matches the predicted address B, and therefore address B is a valid prediction

In ICLK3, address C is used as the current fetch address, as indicated by block 466 As for address A in ICLK2, address B is determined to be a hit in the predicted way as indicated by arrow 467 However, at arrow 468 the branch prediction associated with address B is determined and the branch prediction address does not match address C Therefore, the predicted fetch address being conveyed in 1CLK3 is ignored as well as the instructions associated with address C In 1CLK4, the corrected branch prediction address C is used as the current fetch address as indicated by block 469 A predicted fetch address and way based on corrected address C is made by way prediction unit 250 in ICLK4, and the current fetch address for ICLK5 will reflect that prediction

It is noted that the number and size of addresses stored within way prediction array 302 may differ for other embodiments In particular, the number of addresses stored may be more or less than the embodiment of Figure 3 Furthermore, the number of external addresses added to the address prediction selection may vary from embodiment to embodiment, as will the number and encoding of the target selection control bits It is also noted that the portion of the address stored within way prediction array 302 may vary from embodiment to embodiment, and the entire address may be stored in another embodiment It is further noted that other embodiments could store multiple way predictions and select among them in a manner similar to the address selection device shown in Figure 3 It is also noted that some embodiments may store other information with each predicted address in way prediction array 302 For example, a way, a byte position with the instruction cache line, and branch prediction counter information may be stored within fields 400 and 401

In accordance with the foregoing description, a superscalar microprocessor employing a way prediction unit is disclosed The way prediction unit is provided to reduce the number of clock cycles needed to predict the next address that a code stream will fetch from the instruction cache By reducing the number of clock cycles required to predict the address from two to one, instruction cache utilization is raised Instructions are provided to the instruction processing pipelines continuously, advantageously reducing the idle clock cycles that the superscalar microprocessor endures Therefore, overall performance may be increased

Numerous variations and modifications will become apparent to those skilled in the art once the above disclosure is fully appreciated It is intended that the following claims be interpreted to embrace all such variations and modifications

Claims

WHAT IS CLAIMED IS:

1 A way prediction unit associated with an instruction cache, comprising

a first input port wherein said first input port conveys an input address and an update value, and wherein said update value includes an update way value,

an way prediction storage coupled to said input port including a plurality of storage locations wherein each of said plurality of storage locations is configured to store an address and a way value, and wherein said way prediction storage further includes a mechanism to select one of said plurality of storage locations indexed by said input address, and

an output port wherein said output port conveys an output address and an output way value

2 The way prediction unit as recited in claim 1 wherein each of said plurality of storage locations is further configured to store at least one additional address and a plurality of selection control bits, and wherein said selection control bits are indicative of the selection of one said address stored within one of said plurality of storage locations, and wherein said selection control bits are further indicative of the selection of said additional address stored within one of said plurality of storage locations, and wherein said address so selected is conveyed on said output port, and wherein said address and said additional address for a group of addresses stored within one of said storage locations

3 The way prediction unit as recited in claim 2 further comprising an address selection device coupled to said array configured to select an address from said group of addresses stored within said one of said storage locations wherein said selection device receives said selection control bits, and wherein said selection control bits control said selection by said address selection device

4 The way prediction unit as recited in claim 3 further comprising a multiplexor and a multiplexor select circuit

5 The way prediction unit as recited in claim 3 further comprising a second input port conveying a second input address wherein said second input port is coupled to said address selection device, and wherein said selection control bits are further indicative of the selection of said second input address on said second input port to be conveyed on said output port

6 The way prediction unit as recited m claim 5 wherein said second input address conveyed on said second input port is a return stack address, and wherein said return stack address is provided by a decode unit

7 The way prediction unit as recited m claim 5 wherein said second input address conveyed on said second input port is a next index address, and wherein said next index address is said input address incremented by the number of bytes stored in an instruction cache, and wherein said next index address is provided by a fetch

8 The way prediction unit as recited in claim 1 furthei comprising an update selection mechanism coupled to said array wherein one of said plurality of said storage locations is selected to receive said update way value

9 The way prediction unit as recited in claim 8 wherein said update selection mechanism is a separate storage location independent of said plurality of storage locations, and wherein said separate storage location stores an input address from a previous clock cycle

10 The way prediction unit as recited in claim 1 wherein said plurality of storage locations within said way prediction storage comprise an array

I I The way prediction unit as recited in claim 1 wherein said plurality of storage locations within said way prediction storage comprise a plurality of registers

12 A mechanism for predicting the way of an instruction cache containing a block of instructions currently being fetched and for predicting the index of a next block of instructions required by a program, comprising

a fetch control unit configured to produce a fetch address and an update value wherein said update value includes an update way value, and wherein said fetch control unit is further configured with an input port, and wherein said input port is configured to receive a next fetch address and a way value,

a way prediction unit coupled to said fetch control unit wherein said way prediction unit includes

a first input port wherein said first input port conveys an input address and an update value, and wherein said input address is said fetch address from said fetch control unit, and wherein said update value is said update value from said fetch control unit,

an array coupled to said input port including a plurality of storage locations wherein each of said plurality of storage locations is configured to store an address and a way value, and wherein said array further includes a mechanism to select one of said plurality of storage locations indexed by said input address, and

an output port wherein said output port conveys an output address and an output way value, and wherein said output address and said output way value are conveyed to said input port of said fetch control unit, and wherein said output address is said next fetch address on said input port of said fetch control unit, and wherein said output way value is said way value on said input port of said fetch control unit, and

an instruction cache configured to store blocks of contiguous instruction bytes wherein said instruction cache receives said fetch address from said fetch control unit

13 The mechanism as recited in claim 12 further comprising an instruction alignment unit coupled to said instruction cache wherein said instruction cache is further configured to receive way selection information from said fetch control unit, and wherein said way selection information selects a way within said instruction cache, and wherein said instruction cache is configured to transfer instruction bytes from the indexed line and way to said instruction alignment unit

14 The mechanism as recited in claim 12 wherein said fetch control unit is further configured to convey said next fetch address as said fetch address in a following clock cycle

15 The mechanism as recited in claim 12 wherein said way prediction unit is configured to select said output address from said addresses stored within one of said plurality of storage locations indexed by said input address

16 The mechanism as recited in claim 15 wherein said way prediction unit is further configured to select said output address from a next index address provided by said fetch control unit

17 The mechanism as recited in claim 15 further comprising a decode unit coupled to said way prediction unit wherein said way prediction unit is further configured to select said output address from a return stack address provided by said decode unit

18 The mechanism as recited in claim 12 wherein said fetch control unit is configured to validate said next fetch address and said way prediction provided in a clock cycle wherein said validate occurs in a first subsequent clock cycle following said clock cycle, and wherein said next fetch address and said way prediction are validated if said previous fetch address is a hit in said instruction cache in a way equal to said prediction way and if index bits of a branch prediction generated from a plurality of instructions stored at said previous fetch address matches index bits of said next fetch address, and wherein said fetch control unit generates a corrected fetch address if said next fetch address and said way prediction are not validated

19 The mechanism as recited in claim 18 wherein said fetch control unit is configured to convey said corrected fetch address on said fetch address in a second subsequent clock cycle following a first subsequent clock cycle in which said next fetch address and said way prediction are not validated

20 The mechanism as recited in claim 1 wherein said fetch control unit is further configured to generate said update value in said second subsequent clock cycle

21. A superscalar microprocessor comprising:

a branch prediction unit including a way prediction unit including:

a first input port wherein said first input port conveys an input address and an update value, and wherein said update value includes an update way value;

an way prediction storage coupled to said input port including a plurality of storage locations wherein each of said plurality of storage locations is configured to store an address and a way value, and wherein said way prediction storage further includes a mechanism to select one of said plurality of storage locations indexed by said input address; and

an output port wherein said output port conveys an output address and an output way value; and

an instruction cache for storing previously fetched instruction blocks coupled to said branch prediction unit wherein said instruction cache comprises a plurality of blocks of memory.

22. The superscalar microprocessor as recited in claim 21 wherein each of said plurality of storage locations within said way prediction unit is further configured to store at least one additional address and a plurality of selection control bits, and wherein said selection control bits are indicative of the selection of one said address stored within one of said plurality of storage locations, and wherein said selection control bits are further indicative of the selection of said additional address stored within one of said plurality of storage locations, and wherein said address so selected is conveyed on said output port, and wherein said address and said additional address for a group of addresses stored within one of said storage locations.

23. The superscalar microprocessor as recited in claim 22 wherein said way prediction unit further includes an address selection device coupled to said array configured to select an address from said group of addresses stored within said one of said storage locations wherein said selection device receives said selection control bits, and wherein said selection control bits control said selection by said address selection device.

24. The superscalar microprocessor as recited in claim 23 wherein said address selection device comprises a multiplexor and a multiplexor select circuit.

25. The superscalar microprocessor as recited in claim 23 wherein said way prediction unit further includes a second input port conveying a second input address wherein said second input port is coupled to said address selection device, and wherein said selection control bits are further indicative of the selection of said second input address on said second input port to be conveyed on said output port. 26 The superscalar microprocessor as recited in claim 25 wherein said second input address conveyed on said second input port of said way prediction unit is a return stack address, and wherein said return stack address is provided by a decode unit

27 The superscalar microprocessor as recited in claim 25 wherein said second input address conveyed on said second input port of said way prediction unit is a next index address, and wherein said next index address is said input address incremented by the number of bytes stored in an instruction cache, and wherein said next index address is provided by a fetch PC unit

28 The superscalar microprocessor as recited in claim 21 wherein said way prediction unit further includes an update selection mechanism coupled to said array wherein one of said plurality of said storage locations is selected to receive said update way value

29 1 he superscalar microprocessor as recited in claim 28 wherein said update selection mechanism of said way prediction unit is a separate storage location independent of said plurality of storage locations, and wherein said separate storage location stores an input address to said way prediction unit from a previous clock cycle

30 The superscalar microprocessor as recited in claim 21 wherein said plurality of storage locations within said way prediction storage within said way prediction unit comprise an array

31 The superscalar microprocessor as recited in claim 21 wherein said plurality of storage locations within said way prediction storage within said way prediction unit comprise a plurality of registers

32 The superscalar microprocessor as recited in claim 21 further comprising

an instruction alignment unit coupled to said instruction cache for aligning instructions to a plurality of decode units,

said plurality of decode units for decoding said plurality of instruction bytes transferred from said instruction alignment unit, coupled to said instruction alignment unit,

a prefetch/predecode unit coupled to said instruction cache for prefetching and predecoding instructions from a mam memory,

an MROM unit coupled to said instruction alignment unit for microcoding difficult instructions,

a plurality of reservation stations wherein each one of said plurality of reservation stations is coupled to a respective one of said plurality of decode units for storing decoded instructions until one of a plurality of functional units is available to execute said decoded instructions and said decoded instructions have been provided with their operands;

said plurality of functional units wherein each one of said plurality of functional units is coupled to a respective one of said plurality of reservation stations for executing said decoded instructions stored in said respective one of said plurality of reservation stations;

a load/store unit coupled to said plurality of functional units and said plurality of decode units for executing load/store instructions;

a data cache coupled to said load/store unit for storing previously fetched data memory locations;

a reorder buffer coupled to said plurality of functional units, said load/store unit, and said plurality of decode units wherein said reorder buffer stores speculatively executed results until said results are no longer speculative; and

a register file coupled to said plurality of decode units and said reorder buffer for storing the non- speculative state of the register set.