WO1998002806A1

WO1998002806A1 - A data address prediction structure utilizing a stride prediction method

Info

Publication number: WO1998002806A1
Application number: PCT/US1996/011847
Authority: WO
Inventors: James K. Pickett
Original assignee: Advanced Micro Devices, Inc.
Priority date: 1996-07-16
Filing date: 1996-07-16
Publication date: 1998-01-22
Also published as: EP0912928A1

Abstract

A data prediction structure is provided. The data prediction structure stores base addresses and stride values in a prediction array. The base address and the stride value are added to form a data prediction address which is then used to fetch data bytes into a relatively small, relatively fast buffer which may be accessed by the decode stage(s) of the instruction processing pipeline. If the data associated with an operand address calculated by a decode stage resides in the buffer, the clock cycles used to perform the load operation occur before the instruction reaches the execution stage of the instruction processing pipeline. The execution stage clock cycles that are saved may be used to execute other instructions. Additionally, the base address is updated to the address generated by a decode unit each time a basic block is executed, and the stride value is updated when the data prediction address is found to be incorrect. In this way, the data prediction address may be more accurate than a static data prediction address.

Description

TITLE: A DATA ADDRESS PREDICTION STRUCTURE UTILIZING A STRIDE PREDICTION METHOD

BACKGROUND OF THE INVENTION

1. Field of the Invention

This invention relates to the field of superscalar microprocessors and, more particularly, to data prediction mechanisms in superscalar microprocessors.

2. Description of the Relevant Art

Superscalar microprocessors achieve high performance by simultaneously executing multiple instructions in a clock cycle and by specifying the shortest possible clock cycle consistent with the design. As used herein, the term "clock cycle" refers to an interval of time during which the pipeline stages of a microprocessor perform their intended functions. For example, superscalar microprocessors are typically configured with instruction processing pipelines which process instructions. The processing of instructions includes the actions of fetching, dispatching, decoding, executing, and writing back results. Each action may be implemented in one or more pipeline stages, and an instruction flows through each of the pipeline stages where an action or portion of an action is performed. At the end of a clock cycle, the instruction and the values resulting from performing the action of the current pipeline stage are moved to the next pipeline stage. When an instruction reaches the end of an instruction processing pipeline, it is processed and the results of executing the instruction have been recorded.

Since superscalar microprocessors execute multiple instructions per clock cycle and the clock cycle is short, a high bandwidth memory system is required to provide instructions and data to the superscalar microprocessor (i.e. a memory system that can provide a large number of bytes in a short period of time). Without a high bandwidth memory system, the microprocessor would spend a large number of clock cycles waiting for instructions or data to be provided, then would execute the received instructions and/or instructions dependent upon the received data in a relatively small number of clock cycles. Overall performance would be degraded by the large number of idle clock cycles. Superscalar microprocessors are, however, ordinarily configured into computer systems with a relatively large main memory composed of dynamic random access memory (DRAM) cells. DRAM cells are characterized by access times which are significantly longer than the clock cycle of modern superscalar microprocessors. Also, DRAM cells typically provide a relatively narrow output bus to convey the stored bytes to the superscalar microprocessor. Therefore, DRAM cells provide a memory system that provides a relatively small number of bytes in a relatively long period of time, and do not form a high bandwidth memory system.

Because superscalar microprocessors are typically not configured into a computer system with a memory system having sufficient bandwidth to continuously provide instructions and data, superscalar microprocessors are often configured with caches. Caches include multiple blocks of storage locations configured on the same silicon substrate as the microprocessor or coupled nearby. The blocks of storage locations are used to hold previously fetched instruction or data bytes The bytes can be transferred from the cache to the destination (a register or an instruction processing pipeline) quickly; commonly one or two clock cycles are required as opposed to a large number of clock cycles to transfer bytes from a DRAM main memory

Unfortunately, the latency of one or two clock cycles within a data cache is becoming a performance problem for superscalar microprocessors as they attempt to execute more instructions per clock cycle The problem is of particular importance to superscalar microprocessors configured to execute instructions from a complex instruction set architecture such as the x86 architecture As will be appreciated by one skilled in the art, x86 instructions allow for one of their "operands" (the values that the instruction operates on) to be stored in a memory location. Instructions which have an operand stored in memory are said to have an implicit (memory read) operation and an explicit operation which is defined by the particular instruction being executed (i.e an add instruction has an explicit operation of addition) Such an instruction therefore requires an implicit address calculation for the memory read to retrieve the operand, the implicit memory read, and the execution of the explicit operation of the instruction (for example, an addition). Typical superscalar microprocessors have required that these operations be performed by the execute stage of the instruction processing pipeline The execute stage of the instruction processing pipeline is therefore occupied for several clock cycles when executing such an instruction For example, a superscalar microprocessor might require one clock cycle for the address calculation, one to two clock cycles for the data cache access, and one clock cycle for the execution of the explicit operation of the instruction. Conversely, instructions with operands stored in registers configured within the microprocessor retrieve the operands before they enter the execute stage of the instruction processing pipeline, since no address calculation is needed to locate the operands Instructions with operands stored in registers would therefore only require the one clock cycle for execution of the explicit operation A structure allowing the retrieval of memory operands for instructions before they enter the execute stage is therefore desired

SUMMARY OF THE INVENTION

The problems outlined above are in large part solved by a data prediction structure for a superscalar microprocessor in accordance with the present invention. The data prediction structure stores base addresses and stride values in a prediction array. A particular base address and associated stride value are added to form a data prediction address. The data prediction address is then used to fetch data bytes into a relatively small, relatively fast buffer which may be accessed by the decode stage(s) of the instruction processing pipeline If the data associated with an operand address calculated by a decode stage resides in the buffer, the data is routed to the corresponding reservation station Advantageously, the clock cycles used to perform the load operation occur before the instruction reaches the execution stage of the instruction processing pipeline. The execution stage clock cycles that are saved may be used to execute other instructions Additionally, the base address is updated to the address generated by a decode unit each time a basic block is executed, and the stride value is updated when the data prediction address is found to be incorrect In this way, the data prediction address is in many cases more accurate then a static data prediction address that changes only when it is found to be incorrect Advantageously, the implicit load may be performed more often prior to the instruction reaching the execute stage of the instruction processing pipeline Furthermore, if several correct predictions are made in consecutive executions of a basic block, the prediction must be incorrect in several consecutive executions of the basic block before the stride is changed according to one embodiment of the present invention Advantageously, a single execution of another basic block whose instruction addresses index the same storage location as the correct prediction information does not destroy the prediction information

The present invention contemplates a method for predicting a data address which will be referenced by a plurality of instructions residing in a basic block when the basic block is fetched, comprising several steps First, a data prediction address is generated from a base address and a stride value during a clock cycle in which a data prediction counter indicates that the base address and the stride value are valid

Second, the data prediction address is fetched from a data cache into a data buffer Third, the data buffer is accessed for load data

The present invention further contemplates a data address prediction structure comprising an array, an adder circuit and a data buffer The array includes a plurality of storage locations for storing a plurality of base addresses and a plurality of stride values A particular one of the plurality of storage locations in which a particular one of the plurality of base addresses and a particular one of the plurality of stride values is stored is selected by the instruction address of a branch instruction which begins a basic block containing an instruction which generates one of the plurality of addresses The adder circuit is coupled to the array for adding a base address and a stride value conveyed from the array to produce a data prediction address The data buffer is included for storing a plurality of bytes associated with the data prediction address

BRIEF DESCRIPTION OF THE DRAWINGS

Other objects and advantages of the invention will become apparent upon reading the following detailed description and upon reference to the accompanying drawings in which

Figure 1 is a block diagram of a superscalar microprocessor

Figure 2 is a diagram depicting a branch prediction unit, a load/store unit, and a data cache of the superscalar microprocessor of Figure 1 , along with several elements of one embodiment of the present invention Figure 3 A is a diagram of the branch prediction unit shown in Figure 2 depicting several elements of one embodiment of the present invention

Figure 3B is a diagram of a decode unit of the superscalar microprocessor shown in Figure 1 showing several elements of one embodiment of the present invention

Figure 4 is a diagram of the information stored in a storage location of a branch prediction array in accordance with the present invention

While the invention is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail It should be understood, however, that the drawings and detailed description thereto are not intended to limit the invention to the particular form disclosed, but on the contrary, the intention is to cover all modifications, equivalents and alternatives falling within the spirit and scope of the present invention as defined by the appended claims

DETAILED DESCRIPTION OF THE INVENTION

Turning now to Figure 1, a block diagram of a superscalar microprocessor 200 including decode units 208 and a branch prediction unit 220 in accordance with the present invention is shown As illustrated in the embodiment of Figure 1 , superscalar microprocessor 200 includes a prefetch/predecode unit 202 and a branch prediction unit 220 coupled to an instruction cache 204 Instruction alignment unit 206 is coupled between instruction cache 204 and a plurality of decode units 208A-208F (referred to collectively as decode units 208) Each decode unit 208A-208F is coupled to respective reservation station units 210A-210F (referred to collectively as reservation stations 210), and each reservation station 210A-21 OF is coupled to a respective functional unit 212A-212F (referred to collectively as functional units 212) Decode units 208, reservation stations 210, and functional units 212 are further coupled to a reorder buffer 216. a register file 218 and a load/store unit 222 A data cache 224 is finally shown coupled to load/store unit 222, and an MROM unit 209 is shown coupled to instruction alignment unit 206

Generally speaking, instruction cache 204 is a high speed cache memory provided to temporarily store instructions prior to their dispatch to decode units 208 In one embodiment, instruction cache 204 is configured to cache up to 32 kilobytes of instruction code organized in lines of 16 bytes each (where each byte consists of 8 bits) During operation, instruction code is provided to instruction cache 204 by prefetching code from a main memory (not shown) through prefetch/predecode unit 202 It is noted that instruction cache 204 could be implemented in a set-associative, a fully-associative, or a direct-mapped configuration

Prefetch/predecode unit 202 is provided to prefetch instruction code from the main memory for storage within instruction cache 204 In one embodiment, prefetch/predecode unit 202 is configured to burst 64-bit wide code from the main memory into instruction cache 204 It is understood that a variety of specific code prefetching techniques and algorithms may be employed by prefetch/predecode unit 202

As prefetch/predecode unit 202 fetches instructions from the main memory, it generates three predecode bits associated with each byte of instruction code a start bit, an end bit, and a "functional" bit The predecode bits form tags indicative of the boundaries of each instruction The predecode tags may also convey additional information such as whether a given instruction can be decoded directly by decode units 208 or whether the instruction must be executed by invoking a microcode procedure controlled by MROM unit 209, as will be described in greater detail below

Table 1 indicates one encoding of the predecode tags As indicated within the table, if a given byte is the first byte of an instruction, the start bit for that byte is set If the byte is the last byte of an instruction, the end bit for that byte is set If a particular instruction cannot be directly decoded by the decode units 208, the functional bit associated with the first byte of the instruction is set On the other hand, if the instruction can be directly decoded by the decode units 208. the functional bit associated with the first byte of the instruction is cleared The functional bit for the second byte of a particular instruction is cleared if the opcode is the first byte, and is set if the opcode is the second byte It is noted that in situations where the opcode is the second byte, the first byte is a prefix byte The functional bit values for instruction byte numbers 3-8 indicate whether the byte is a MODRM or an SIB byte or whether the byte contains displacement or immediate data Table 1 Encoding of Start, End and Functional Bits

Instr Start End Functional

Byte Bit Bit Bit

Number Value Value Value Meaninε

1 1 X 0 Fast decode 1 1 X 1 MROM instr

2 0 X 0 Opcode is first byte

2 0 X 1 Opcode is this byte, first byte is prefix

3-8 0 X 0 Mod R M or SIB byte

3-8 0 X 1 Displacement or immediate data, the second functional bit set in bytes 3-8 indicates immediate data

1 -8 X 0 X Not last byte of instruction

1 -8 X 1 X Last byte of instruction

As stated previously, in one embodiment certain instructions within the x86 instruction set may be directly decoded by decode units 208 These instructions are referred to as "fast path" instructions The remaining instructions of the x86 instruction set are referred to as "MROM instructions" MROM instructions are executed by invoking MROM unit 209 More specifically, when an MROM instruction is encountered, MROM unit 209 parses and serializes the instruction into a subset of defined fast path instructions to effectuate a desired operation A listing of exemplary x86 instructions categorized as fast path instructions as well as a description of the manner of handling both fast path and MROM instructions will be provided further below

Instruction alignment unit 206 is provided to channel variable byte length instructions from instruction cache 204 to fixed issue positions formed by decode units 208A-208F Instruction alignment unit 206 independently and in parallel selects instructions from three groups of instruction bytes provided by instruction cache 204 and arranges these bytes into three groups of preliminary issue positions Each group of issue positions is associated with one of the three groups of instruction bytes The preliminary issue positions are then merged together to form the final issue positions, each of which is coupled to one of decode units 208

Before proceeding with a detailed description of the data address prediction structure embodied in decode units 208 and branch prediction unit 220, general aspects regarding other subsystems employed within the exemplary superscalar microprocessor 200 of Figure 1 will be described For the embodiment of Figure 1 , each of the decode units 208 includes decoding circuitry for decoding the predetermined fast path instructions referred to above In addition, each decode unit 208A-208F routes displacement and immediate data to a corresponding reservation station unit 210A-210F Output signals from the decode units 208 include bit-encoded execution instructions for the functional units 212 as well as operand address information, immediate data and/or displacement data

The superscalar microprocessor of Figure 1 supports out of order execution, and thus employs reorder buffer 216 to keep track of the original program sequence for register read and write operations, to implement register renaming, to allow for speculative instruction execution and branch misprediction recovery, and to facilitate precise exceptions As will be appreciated by those of skill in the art, a temporary storage location within reorder buffer 216 is reserved upon decode of an instruction that involves the update of a register to thereby store speculative register states Reorder buffer 216 may be implemented in a first-m- first-out configuration wherein speculative results move to the "bottom" of the buffer as they are validated and written to the register file, thus making room for new entries at the "top" of the buffer Other specific configurations of reorder buffer 216 are also possible, as will be described further below If a branch prediction is incorrect, the results of speculatively-executed instructions along the mispredicted path can be invalidated in the buffer before they are written to register file 218

The bit-encoded execution instructions and immediate data provided at the outputs of decode units 208A-208F are routed directly to respective reservation station units 21 OA-21 OF In one embodiment, each reservation station unit 210A-210F is capable of holding instruction information (l e , bit encoded execution bits as well as operand values, operand tags and/or immediate data) for up to three pending instructions awaiting issue to the corresponding functional unit It is noted that for the embodiment of Figure 1, each decode unit 208A-208F is associated with a dedicated reservation station unit 210A-210F, and that each reservation station unit 210A-210F is similarly associated with a dedicated functional unit 212A-2 I2F Accordingly, six dedicated "issue positions" are formed by decode units 208, reservation station units 210 and functional units 212 Instructions aligned and dispatched to issue position 0 through decode unit 208A are passed to reservation station unit 210A and subsequently to functional unit 212 A for execution Similarly, instructions aligned and dispatched to decode unit 208B are passed to reservation station unit 21 OB and into functional unit 212B, and so on

Upon decode of a particular instruction, if a required operand is a register location, register address information is routed to reorder buffer 216 and register file 218 simultaneously Those of skill in the art will appreciate that the x86 register file includes eight 32 bit real registers (1 e , typically referred to as EAX, EBX, ECX, EDX, EBP, ESI, EDI and ESP) Reorder buffer 216 contains temporary storage locations for results which change the contents of these registers to thereby allow out of order execution A temporary storage location of reorder buffer 216 is reserved for each instruction which, upon decode, is determined to modify the contents of one of the real registers Therefore, at various points during execution of a particular program, reorder buffer 216 may have one or more locations which contain the speculatively executed contents of a given register If following decode of a given instruction it is determined that reorder buffer 216 has a previous location or locations assigned to a register used as an operand in the given instruction, the reorder buffer 216 forwards to the corresponding reservation station either I ) the value in the most recently assigned location, or 2) a tag for the most recently assigned location if the value has not yet been produced by the functional unit that will eventually execute the previous instruction If the reorder buffer has a location reserved for a given register, the operand value (or tag) is provided from reorder buffer 216 rather than from register file 218 If there is no location reserved for a required register in reorder buffer 216, the value is taken directly from register file 218 If the operand corresponds to a memory location, the operand value is provided to the reservation station unit through load/store unit 222

Details regarding suitable reorder buffer implementations may be found within the publication "Superscalar Microprocessor Design" by Mike Johnson, Prentice-Hall, Englewood Cliffs, New Jersey, 1991 , and within the co-pending, commonly assigned patent application entitled "High Performance Superscalar Microprocessor", Serial No 08/146,382, filed October 29, 1993 by Witt, et al These documents are incorporated herein by reference in their entirety

Reservation station units 210A-210F are provided to temporarily store instruction information to be speculatively executed by the corresponding functional units 212A-212F As stated previously, each reservation station unit 210A-210F may store instruction information for up to three pending instructions Each of the six reservation stations 21 OA-21 OF contain locations to store bit-encoded execution instructions to be speculatively executed by the corresponding functional unit and the values of operands If a particular operand is not available, a tag for that operand is provided from reorder buffer 216 and is stored within the corresponding reservation station until the result has been generated (i e , by completion of the execution of a previous instruction) It is noted that when an instruction is executed by one of the functional units 212A- 212F, the result of that instruction is passed directly to any reservation station units 210A -21 OF that are waiting for that result at the same time the result is passed to update reorder buffer 216 (this technique is commonly referred to as "result forwarding") Instructions are issued to functional units for execution after the values of any required operand(s) are made available That is, if an operand associated with a pending instruction within one of the reservation station units 210A-210F has been tagged with a location of a previous result value within reorder buffer 216 which corresponds to an instruction which modifies the required operand, the instruction is not issued to the corresponding functional unit 212 until the operand result for the previous instruction has been obtained Accordingly, the order in which instructions are executed may not be the same as the order of the original program instruction sequence Reorder buffer 216 ensures that data coherency is maintained in situations where read-after-wπte dependencies occur

In one embodiment, each of the functional units 212 is configured to perform integer arithmetic operations of addition and subtraction, as well as shifts, rotates, logical operations, and branch operations It is noted that a floating point unit (not shown) may also be employed to accommodate floating point operations

Each of the functional units 212 also provides information regarding the execution of conditional branch instructions to the branch prediction unit 220 If a branch prediction was incorrect, branch prediction unit 220 flushes instructions subsequent to the mispredicted branch that have entered the instruction processing pipeline, and causes prefetch/predecode unit 202 to fetch the required instructions from instruction cache 204 or main memory It is noted that in such situations, results of instructions in the original program sequence which occur after the mispredicted branch instruction are discarded, including those which were speculatively executed and temporarily stored in load/store unit 222 and reorder buffer 216 Exemplary configurations of suitable branch prediction mechanisms are well known

Results produced by functional units 212 are sent to the reorder buffer 216 if a register value is being updated, and to the load/store unit 222 if the contents of a memory location is changed If the result is to be stored in a register, the reorder buffer 216 stores the result in the location reserved for the value of the register when the instruction was decoded As stated previously, results are also broadcast to reservation station units 210A-210F where pending instructions may be waiting for the results of previous instruction executions to obtain the required operand values

Generally speaking, load store unit 222 provides an interface between functional units 212A-212F and data cache 224 In one embodiment, load/store unit 222 is configured with a load/store buffer with eight storage locations for data and address information for pending loads or stores Decode units 208 arbitrate for access to the load/store unit 222 When the buffer is full, a decode unit must wait until the load/store unit 222 has room for the pending load or store request information The load/store unit 222 also performs dependency checking for load instructions against pending store instructions to ensure that data coherency is maintained Data cache 224 is a high speed cache memory provided to temporarily store data being transferred between load/store unit 222 and the mam memory subsystem In one embodiment, data cache 224 has a capacity of storing up to eight kilobytes of data It is understood that data cache 224 may be implemented in a variety of specific memory configurations, including a set associative configuration

Turning now to Figure 2, several elements of one embodiment of a data address prediction structure are shown Branch prediction unit 220 is included, along with load/store unit 222 and data cache 224 Branch prediction unit 220 is connected to a data prediction bus 253 which is coupled to an arbitration multiplexor 254 Also coupled to arbitration multiplexor 254 is a second request bus 255 and an arbitration select line 256 In this embodiment, signals on both second request bus 255 and arbitration select line 256 originate in load/store unit 222 The output bus of arbitration multiplexor 254 is coupled to an input port of data cache 224 A first request bus 257 is coupled between load/store unit 222 and an input port of data cache 224 Data cache 224 is configured with two output ports which are coupled to a first reply bus 258 and a second reply bus 259 Both first reply bus 258 and second reply bus 259 are coupled to load/store unit 222. and second reply bus 259 is coupled to a data buffer 260 Associated with data buffer 260 is a data buffer control unit 263 Data buffer 260 is also coupled to a plurality of decode address request buses 261 provided by decode units 208 Associated with the plurality of decode address request buses 261 is a plurality of data buses 262

Generally speaking, data buffer 260 is a relatively small, high speed cache provided to store data bytes associated with data prediction addresses provided by branch prediction unit 220 During a clock cycle in which branch prediction unit 220 predicts a taken branch instruction, it also produces a data prediction address on data prediction bus 253 The data prediction address is a prediction of the data addresses that are used by instructions residing at the target address of the predicted branch instruction, and is generated from a stored set of addresses and associated stride values as will be detailed below

During clock cycles in which the data prediction address is selected by arbitration multiplexor 254, the data prediction address accesses data cache 224 The data cache line of data bytes associated with the data prediction address is transferred on second reply bus 259 to a write port on data buffer 260 Data buffer 260 stores the data bytes along with the data prediction address under the direction of control unit 263

Decode units 208, in a later clock cycle, may request data bytes associated with an implicit memory read operation of an instruction using the plurality of decode address request buses 261 The requested data bytes are provided to the respective reservation station 210 associated with a decode unit 208 which requested the data bytes Because data buffer 260 is relatively small, it is capable of providing data within the same clock cycle that an address request is made (as opposed to requiring the entire clock cycle or more, as a typical data cache does) Therefore, the implicit memory read associated with an instruction may be performed before the instruction enters a functional unit 212, and only the explicit operation of the instruction is executed in functional unit 212 Performance may be advantageously increased by allowing other instructions to execute during clock cycles that such an implicit memory read operation would have occupied functional unit 212 As stated previously, branch unit 220 produces a data prediction address during clock cycles in which a taken branch is predicted Instructions residing at the target address of the taken branch instruction are, therefore in a new "basic block" Basic blocks are blocks of instructions having the property that if the first instruction in the block is executed, then each subsequent instruction in the block is executed Basic blocks end with a branch instruction One property often associated with the instructions in a basic block is "data locality" The property of data locality exists when the instructions of a basic block access data memory locations which are physically near each other (perhaps in the same data cache line) Therefore, if a data cache line that a basic block will access is fetched simultaneously with the instructions that reside in the basic block, the data cache latency can be endured before the instructions are prepared to access the data bytes Then, when the instructions access the data buffer the latency is shorter Data may be accessed in an amount of time similar to a register data access from register file 218 of Figure 1

When a data prediction address misprediction is detected (as described below with respect to Figures 3A and 3B), the branch instruction address associated with the data prediction address and the correct data prediction address are conveyed to branch prediction unit 220 Branch prediction unit 220 stores the correct prediction address for use in a subsequent prediction for the basic block

The data prediction address is conveyed on data prediction bus 253 to arbitration multiplexor 254, which arbitrates between the data prediction address and a second request from load/store unit 222 The term arbitration, as used herein, means the selection of one request over another according to an arbitration scheme In one embodiment, the arbitration scheme is a priority scheme in which a load/store request conveyed on second request bus 255 is a higher priority that the data prediction address request Therefore, the arbitration select signal conveyed on arbitration select line 256 is an indication that a load/store request is being conveyed on second request bus 255 If a valid load/store request is being made during a clock cycle, arbitration multiplexor 254 selects the load/store request to be conveyed to data cache 224 If no valid load/store request is being made during a clock cycle arbitration multiplexor 254 selects the data prediction address to be conveyed to data cache 224

In one embodiment, load/store unit 222 is configured to make up to two requests to data cache 224 in a clock cycle In clock cycles where load/store unit 222 is making one request, first request bus 257 is used to allow the data prediction address the maximum opportunity to access data cache 224 If a data prediction address request is valid during a clock cycle in which load/store unit 222 is making no requests or one request to data cache 224, then the data prediction address request will be given access to data cache 224 Load/store unit 222 receives the data bytes associated with requests on first request bus 257 and second request bus 255 on first reply bus 258 and second reply bus 259, respectively It is noted that other embodiments of load/store unit 222 may have different numbers of request buses for data cache 224

Second reply bus 259 is also coupled to a write port on data buffer 260 Data buffer control unit 263 detects cycles when the data prediction address is selected by arbitration multiplexor 254 and hits in data cache 224, and causes data buffer 260 to store the data conveyed on second reply bus 259 In one embodiment, the entire cache line associated with the data prediction address is conveyed on second reply bus 259 to data buffer 260 In this embodiment, data buffer 260 is configured with eight storage locations each of which store a data prediction address and the associated data bytes Data buffer control unit 263 operates the storage locations of this embodiment as a fϊrst-in, first-out stack It is noted that other replacement policies could be alternatively used

Data buffer 260 is also configured with a number of read ports, one for each decode unit 208 Each of the plurality of decode address request buses 261 is coupled to a respective read port Decode units 208 each provide a request address on one of the plurality of decode address request buses 261 Each request address is compared to the addresses stored in data buffer 260 If a match is detected, data bytes associated with the address are conveyed on the associated one of the plurality of data buses 262 to the reservation station unit 21 OA-21 OF associated with the requesting decode unit 208A-208F The associated reservation station 210A-210F stores the data bytes in the same manner that it stores data bytes from a register If the data bytes for the memory operand of an instruction are stored with the instruction in a reservation station 210A-210F, then the associated functional unit 212A-212F does not perform the implicit memory read for that instruction Instead, it executes the explicit operation of the instruction If the data bytes of the memory operand are not stored with the instruction in a reservation station 210A-210F then the implicit address calculation and memory read operation are performed by the associated functional unit 212A-212F prior to performing the explicit operation of the instruction

Data buffer control unit 263 is further configured to monitor first request bus 257 and second request bus 255 for store memory accesses When data buffer control unit 263 detects a store memory access, it invalidates the associated cache line in data buffer 260 (if the cache line is currently stored in data buffer 260) Similarly, if data cache 224 replaces a line with data fetched from main memory, the replaced line is invalidated in data buffer 260 Reorder buffer 216 implements dependency checking for reads that follow writes in cases where the reads receive their data from data buffer 260 before the write occurs to data cache 224 In this manner, read-after-wπte hazards and data coherency are correctly handled with respect to data buffer 260

Turning now to Figure 3A, an embodiment of branch prediction unit 220 is shown The data prediction information is stored within a prediction array 270 In this embodiment, prediction array 270 is a linear array of storage locations The current fetch address (conveyed from instruction cache 204) is used to index into prediction array 270 and select a storage location Stored within the selected storage location is a base address, a stride value, and a data prediction counter The base address is an address generated by an instruction within a basic block, wherein the address of the instructions within the basic block index to the selected storage location The stride value is the difference between the two addresses generated by an instruction within a basic block (or blocks) on two consecutive executions of the instruction The stride value is a signed value allowing both positive and negative strides to be stored The data prediction counter is configured to indicate the validity of the base address and the stride value Furthermore, the data prediction counter may store values indicating that a number of previous data prediction addresses were correct predictions A comparable number of consecutive mispredictions must then be detected before the stride value is changed Generally speaking, the data prediction counter value may be incremented and decremented but does not decrement below zero nor increment above the largest value that may be represented by the data prediction counter Instead, if the data prediction counter contains zero and is decremented, it remains zero If the data prediction counter contains the largest value that it may represent and is incremented, it remains at the largest value that it may represent The data prediction counter is therefore a saturating counter

An adder circuit 271 is coupled to the output port of prediction array 270 Adder circuit 271 is configured to add the stride value stored within the selected storage location to the base address stored within the selected storage location, creating a data prediction address The data prediction address is conveyed on data prediction bus 253, along with the associated data prediction counter value In one embodiment, if the stride value is invalid (as indicated by the data prediction counter value), then the base address is added to zero

In one embodiment, branch prediction unit 220 receives four sets of inputs from decode units 208 First, branch prediction unit 220 receives the plurality of decode address request buses 261 which convey the actual data addresses generated by decode units 208 as well as a valid bit for each address indicating the validity of the address If a decode unit 208A-208F detects a mispredicted data prediction address, the associated one of decode address request buses 261 will convey the corrected addresses (as will be explained below with respect to Figure 3B) Each decode unit 208A-208F also provides one of a plurality of mispredicted lines 273 indicating whether or not the actual data address conveyed on the associated one of decode address request buses 261 matches the data prediction address provided with the instruction currently being decoded In one embodiment, the actual data address and the data prediction address are determined to match if they he within the same cache line Other granularities may be used for the comparison in other embodiments Additionally, branch prediction unit 220 receives a plurality of branch address buses 274 from decode units 208 conveying branch instruction addresses associated with the data prediction addresses received by decode units 208 The branch address is used to index into prediction array 270 in order to store the updated base address, stride value, and data prediction counter values Finally, prediction address buses 276 are received by branch prediction unit 220 Prediction address buses 276 convey the data prediction address received by each decode unit 208A-208F along with the associated data prediction counter value Decode address request buses 261, mispredicted lines 273, branch address buses 274 and prediction address buses 276 are received by a prediction validation and correction block 275 within branch prediction unit 220

If one of mispredicted lines 273 signals an incorrect data address prediction and the corresponding address valid bit from the associated one of decode address request buses 261 indicates the actual data address conveyed on that associated decode address request bus is valid, then a data address misprediction has occurred Prediction validation and correction block 275 causes the corresponding actual data address to be stored as the base address in prediction array 270 m the storage location indexed by the corresponding branch address The associated data prediction counter value is decremented for this case and stored as the data prediction counter in the indexed storage location. In this manner, a data prediction address is corrected if predicted incorrectly. Additionally, a base address is created if a valid base address was not previously stored in prediction array 270 as indicated by the associated data prediction counter. A particular one of mispredicted lines 273 indicates a misprediction if the data prediction address associated with the instruction currently being decoded is invalid. Therefore, the address provided on the associated one of decode address request buses 261 is stored as the base address.

If a particular one of mispredicted lines 273 signals a correct prediction and the corresponding address valid bit conveyed on the corresponding one of decode address request buses 261 is set, then the data prediction address is correctly predicted. Prediction validation and correction block 275 causes the corresponding actual data address to be stored as the base address within prediction array 270 in the storage location indexed by the corresponding branch instruction address. The corresponding data prediction counter value is incremented and stored as the data prediction counter in the indexed storage location.

If the address valid bit of a particular one of decode address request buses 261 is not set, then no prediction validation information is conveyed with respect to the instruction decoded by the corresponding decode unit.

In either the misprediction case or correct prediction case detailed above, another check is performed. If the current data prediction counter value (i.e. prior to incrementing or decrementing) indicates that the stride value is invalid, a new stride value is calculated by prediction validation and correction block 275. The new stride value is the difference between the prediction value conveyed on the associated one of prediction address buses 276 and the actual data address conveyed on the associated one of decode address request buses 261. The new stride value is stored as the stride value in the indexed storage location within prediction array 270. If the stride value is invalid (as indicated by the data prediction counter) then the data prediction counter is incremented regardless of the misprediction/prediction status of the current data address prediction.

For mispredicted lines and address valid bits in a given clock cycle, decode unit 208A's outputs are given highest priority, then decode unit 208B's outputs, etc. In this manner, the first data address calculated by the instructions in a basic block is used to validate and create the data prediction information (including the base address, the stride value, and the data prediction counter). Once data prediction information is validated or changed for a particular basic block, the associated branch instruction address is saved by prediction validation and correction block 275. Subsequent signals are ignored by prediction validation and correction unit 275 until a branch address different from the saved branch instruction address is conveyed from a decode unit 208 to branch prediction unit 220. Prediction validation and correction block 275 has a dedicated write port into prediction array 270 to perform its updates, in one embodiment.

Because the prediction is formed by adding a formerly generated data address and a stride value generated from the difference between two consecutive generations of the data address, the prediction address advantageously predicts the next address correctly for at least two types of data access patterns First, if an instruction accesses the same address each time it executes, then the stride value will be zero and the data prediction address associated with that instruction will be the same each time the instruction executes Therefore, the current data prediction structure will correctly predict static data addresses Additionally, instructions which access a regular pattern of data addresses such that the data address accessed on consecutive executions of the instruction differ by a fixed amount may receive correct address predictions from the present data prediction structure Static data prediction structures (which store the previously generated data address without a stride value) do not predict these types of instruction sequences correctly Therefore, the present prediction structure may correctly predict a larger percentage of data addresses than static data prediction structures

Turning now to Figure 3B, an embodiment of decode unit 208A is depicted Decode units 208B- 208F are configured similar to decode unit 208A The logic circuits shown in Figure 3B are exemplary circuits for implementing the data prediction structure Other circuits (not shown) decode x86 instructions into a format suitable for functional units 212 Therefore, decode units 208 are each a decode stage of an instruction processing pipeline

In order to implement one embodiment of the data prediction structure, several logic elements are added to decode unit 208A an adder circuit 300, a prediction address register 301. a comparator circuit 302, an address valid logic circuit 303, and a branch address register 304 Adder circuit 300 is used to add the contents of a base register and a displacement value to produce an actual data address if the instruction being decoded by decode unit 208A has a memory operand In one embodiment the base register is the EBP or ESP register, and the displacement value is conveyed with the instruction bytes as part of the instruction This type of addressing is a common addressing mode used for x86 instructions If the base register is not available, or the instruction does not have an operand residing in a memory location, or the addressing mode is different then a base register and displacement value, then validation logic circuit 303 indicates that the actual data address formed by adder circuit 300 is invalid Reservation station unit 210A ignores any data bytes that may be transferred from data buffer 260 in clock cycles where the address formed by adder circuit

300 is invalid The output of address valid circuit 303 is conveyed with the address, as indicated in Figure 3B If the address formed by adder circuit 300 is invalid and the instruction requires a memory operand, the implicit load operation associated with the instruction will be executed by functional unit 212A

The actual data address formed by adder circuit 300 is conveyed on decode address request bus 261 A to data buffer 260 and branch prediction unit 220 Decode address request bus 261 A is one of the plurality of decode address request buses 261 If the actual data address is a hit in data buffer 260, the associated data bytes are returned to reservation station 210A on one of the plurality of data buses 262 The actual data address is also conveyed to comparator circuit 302 which is coupled to prediction address register

301 Prediction address register 301 stores the data prediction address for the instruction currently being decoded along with the .associated data prediction counter value The data prediction address and data prediction counter value are transferred to decode unit 208A with the instruction The data prediction counter value is indicative of whether or not a data address prediction was made for the block associated with the instructions (stored in prediction address register 301) In one embodiment, the data prediction counter value indicates a valid data prediction address if it conveys a value greater than zero If both the data prediction address and the actual data address are valid and compare equal within a specified granularity (such as a cache line), then comparator circuit 302 conveys a value indicating that the data address prediction is correct to branch prediction unit 220 If either address is invalid or the addresses do not compare equal, then the output of comparator circuit 302 conveys a value indicating that the data prediction address is incorrect to branch prediction unit 220 The value is conveyed on a mispredicted line 273A Mispredicted line 273 A is one of the plurality of mispredicted lines 273 The branch instruction address stored in branch address register 304 is conveyed on a branch address bus 274A Branch address bus 274A is one of the plurality of branch address buses 274 The data prediction address and data prediction counter value stored within prediction address register 301 are conveyed on prediction address bus 276A Prediction address bus 276A is one of the plurality of prediction address buses 276

Turning now to Figure 4, a diagram of the fields within a storage location 400 of prediction array

270 is shown A field of information stored in storage location 400 of prediction array 270 is the base address field 401 The base address is stored in this field In one embodiment, base address field 401 is 32 bits wide Associated with the base address is a stride value stored in stride value field 402 In one embodiment stride value field 401 is 10 bits wide The base address and stride value are added to form the data prediction address

Finally, data prediction counter field 403 is included in storage location 400 The data prediction counter is stored in this field In one embodiment, the data prediction counter is two bits A binary value of 00 for the data prediction counter indicates that neither the base address field 401 nor the stride value field 402 contain valid values A binary value of 01 indicates that base address field 401 is valid but stride value field 402 does not contain a valid value The base address field is added to zero for this encoding, and the stride value is updated when the actual data address is generated by one of decode units 208 A binary value of 10 or 1 1 indicates that both base address field 401 and stride value field 402 are valid Having two values with the same indication allows for a stride value to be maintained until two consecutive mispredictions are detected for basic blocks which index storage location 400 In this way prediction accuracy is maintained for the case where a basic block is intermittently executed between multiple executions of a second basic block and the two basic blocks index the same storage location 400 Other embodiments may implement more or less bits for the data prediction counter value

In another embodiment of the data address prediction structure data prediction address bus 253 is connected to an additional read port on data cache 224 In this embodiment, arbitration multiplexor 254 may be eliminated It is noted that, although the preceding discussion described an embodiment of the data prediction structure within a processor implementing the x86 architecture, any microprocessor architecture could benefit from the data prediction structure In accordance with the above disclosure, a data address prediction structure is described which allows an implicit memory operation of an instruction to be performed before the instruction reaches a functional unit. Therefore, the number of cycles required in a functional unit to execute the instruction is advantageously reduced. Performance may be increased by implementing such a data prediction structure.

Numerous variations and modifications will become apparent to those skilled in the art once the above disclosure is fully appreciated. It is intended that the following claims be interpreted to embrace all such variations and modifications.

Claims

WHAT IS CLAIMED IS:

1 A method for predicting a data address which will be referenced by a plurality of instructions residing in a basic block when said basic block is fetched, comprising

generating a data prediction address from a base address and a stride value during a clock cycle in which a data prediction counter indicates that said base address and said stride value are valid,

fetching data associated with said data prediction address from a data cache into a data buffer, and

accessing said data buffer for load data

2 The method as recited in claim 1 wherein said base address, said stride value, and said data prediction counter are stored in a prediction array

3 The method as recited in claim 2 further comprising updating said prediction array with a new base address during a second clock cycle in which a second data address is generated by executing said plurality of instructions residing in said basic block

4 The method as recited in claim 3 further comprising incrementing said data prediction counter during said second clock cycle if said data prediction address is correctly predicted

5 The method as recited in claim 4 further comprising updating said prediction array with a new stride value during a second clock cycle if said data prediction counter indicates that said stride value is invalid

6 The method as recited in claim 5 wherein said new stride value is a difference between said data prediction address and said second data address generated by executing said plurality of instructions within said basic block

7 The method as recited in claim 5 further comprising decrementing said data prediction counter during said second clock cycle if which said data prediction address is found to be mispredicted and said data prediction counter indicates that said stride value is valid

8 The method as recited in claim 1 further comprising generating an actual data address from a base register and a displacement data wherein said base register and said displacement data are specified by one of said plurality of instructions

9 The method as recited in claim 8 further comprising comparing said actual data address to said data prediction address 10 The method as recited in claim 9 further comprising transferring said actual data address and a result of said comparing to a branch prediction unit

1 1 The method as recited in claim 8 wherein said accessing said data buffer further comprises determining if a plurality of bytes associated with said actual data address are stored within said data buffer

12 The method as recited in claim 1 1 further comprising transferring said plurality of bytes to a decode stage of an instruction processing pipeline if said plurality of bytes are stored within said data buffer

13 The method as recited in claim 1 wherein said generating a data prediction address is performed during a third clock cycle in which said basic block is being fetched from an instruction cache

14 The method as recited in claim 1 wherein said accessing is performed from a decode stage of an instruction processing pipeline

15 A data address prediction structure comprising

an array including a plurality of storage locations for storing a plurality of base addresses and a plurality of stride values wherein a particular one of said plurality of storage locations in which a particular one of said plurality of base addresses and a particular one of said plurality of stride values is stored is selected by the instruction address of a branch instruction which begins a basic block containing an instruction which generates said one of said plurality of base addresses,

an adder circuit coupled to said array for adding a base address and a stride value conveyed from said array to produce a data prediction address, and

a data buffer for storing a plurality of bytes associated with said data prediction address

16 The data address prediction structure as recited in claim 15 wherein said array is a linear array of said storage locations

17 The data address prediction structure as recited in claim 15 wherein said data buffer is further configured to simultaneously store a plurality of pluralities of data bytes wherein each one of said plurality of pluralities of data bytes is associated with a different data prediction address

18 The data address prediction structure as recited in claim 17 further comprising a data cache coupled to said data buffer and further coupled to said adder circuit, wherein said data prediction address is used to fetch data from said data cache into said data buffer 19 The data address prediction structure as recited in claim 17 further comprising a decode stage coupled to said data buffer wherein said decode stage includes a second adder circuit for producing a data address from a base register and a displacement value

20 The data address prediction structure as recited in claim 19 wherein said decode stage further includes a comparator circuit and a register wherein said comparator circuit is coupled to said second adder circuit and said register

21 The data address prediction structure as recited in claim 20 wherein said register is configured to store said data prediction address

22 The data address prediction structure as recited in claim 21 wherein said comparator circuit is configured to produce a value indicative of the correctness of said data prediction address

23 The data address prediction structure as recited in claim 22 wherein said comparator circuit is configured to convey said value to said array