WO1998020416A1 - Structure de prediction d'adresse de donnees fondee sur le rythme - Google Patents

Structure de prediction d'adresse de donnees fondee sur le rythme Download PDF

Info

Publication number
WO1998020416A1
WO1998020416A1 PCT/US1996/017516 US9617516W WO9820416A1 WO 1998020416 A1 WO1998020416 A1 WO 1998020416A1 US 9617516 W US9617516 W US 9617516W WO 9820416 A1 WO9820416 A1 WO 9820416A1
Authority
WO
WIPO (PCT)
Prior art keywords
address
data
prediction
instruction
recited
Prior art date
Application number
PCT/US1996/017516
Other languages
English (en)
Inventor
James K. Pickett
Original Assignee
Advanced Micro Devices, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Advanced Micro Devices, Inc. filed Critical Advanced Micro Devices, Inc.
Priority to AU76667/96A priority Critical patent/AU7666796A/en
Priority to PCT/US1996/017516 priority patent/WO1998020416A1/fr
Publication of WO1998020416A1 publication Critical patent/WO1998020416A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/34Addressing or accessing the instruction operand or the result ; Formation of operand address; Addressing modes
    • G06F9/345Addressing or accessing the instruction operand or the result ; Formation of operand address; Addressing modes of multiple operands or results
    • G06F9/3455Addressing or accessing the instruction operand or the result ; Formation of operand address; Addressing modes of multiple operands or results using stride
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3824Operand accessing
    • G06F9/383Operand prefetching
    • G06F9/3832Value prediction for operands; operand history buffers

Definitions

  • TITLE A STRIDE-BASED DATA ADDRESS PREDICTION STRUCTURE
  • This invention relates to the field of superscalar microprocessors and, more particularly, to data prediction mechanisms in superscalar microprocessors.
  • clock cycle refers to an interval of time during which the pipeline stages of a microprocessor perform their intended functions.
  • superscalar microprocessors are typically configured with instruction processing pipelines which process instructions. The processing of instructions includes the actions of fetching, dispatching, decoding, executing, and writing back results. Each action may be implemented in one or more pipeline stages, and an instruction flows through each of the pipeline stages where an action or portion of an action is performed. At the end of a clock cycle, the instruction and the values resulting from performing the action of the current pipeline stage are moved to the next pipeline stage. When an instruction reaches the end of an instruction processing pipeline, it is processed and the results of executing the instruction have been recorded.
  • superscalar microprocessors execute multiple instructions per clock cycle and the clock cycle is short, a high bandwidth memory system is required to provide instructions and data to the superscalar microprocessor (i.e. a memory system that can provide a large number of bytes in a short period of time). Without a high bandwidth memory system, the microprocessor would spend a large number of clock cycles waiting for instructions or data to be provided, then would execute the received instructions and/or instructions dependent upon the received data in a relatively small number of clock cycles. Overall performance would be degraded by the large number of idle clock cycles.
  • Superscalar microprocessors are, however, ordinarily configured into computer systems with a relatively large main memory composed of dynamic random access memory (DRAM) cells.
  • DRAM dynamic random access memory
  • DRAM cells are characterized by access times which are significantly longer than the clock cycle of modern superscalar microprocessors. Also, DRAM cells typically provide a relatively narrow output bus to convey the stored bytes to the superscalar microprocessor. Therefore, DRAM cells provide a memory system that provides a relatively small number of bytes in a relatively long period of time, and do not form a high bandwidth memory system.
  • superscalar microprocessors are typically not configured into a computer system with a memory system having sufficient bandwidth to continuously provide instructions and data
  • superscalar microprocessors are often configured with caches.
  • Caches include multiple blocks of storage locations configured on the same silicon substrate as the microprocessor or coupled nearby. The blocks of storage locations are used to hold previously fetched instruction or data bytes. The bytes can be transferred from the cache to the destination (a register or an instruction processing pipeline) quickly; commonly one or two clock cycles are required as opposed to a large number of clock cycles to transfer bytes from a DRAM main memory.
  • x86 instructions allow for one of their "operands" (the values that the instruction operates on) to be stored in a memory location. Instructions which have an operand stored in memory are said to have an implicit (memory read) operation and an explicit operation which is defined by the particular instruction being executed (i.e. an add instruction has an explicit operation of addition).
  • Such an instruction therefore requires an implicit address calculation for the memory read to retrieve the operand, the implicit memory read, and the execution of the explicit operation of the instruction (for example, an addition).
  • Typical superscalar microprocessors have required that these operations be performed by the execute stage of the instruction processing pipeline.
  • the execute stage of the instruction processing pipeline is therefore occupied for several clock cycles when executing such an instruction.
  • a superscalar microprocessor might require one clock cycle for the address calculation, one to two clock cycles for the data cache access, and one clock cycle for the execution of the explicit operation of the instruction.
  • instructions with operands stored in registers configured within the microprocessor retrieve the operands before they enter the execute stage of the instruction processing pipeline, since no address calculation is needed to locate the operands. Instructions with operands stored in registers would therefore only require the one clock cycle for execution of the explicit operation.
  • a structure allowing the retrieval of memory operands for instructions before they enter the execute stage is therefore desired.
  • the problems outlined above are in large part solved by a data prediction structure for a superscalar microprocessor in accordance with the present invention.
  • the data prediction structure stores base addresses and stride values in a prediction array. A particular base address and associated stride value are added to form a data prediction address.
  • the data prediction address is then used to fetch data bytes which are conveyed to the reservation stations of the superscalar microprocessor. If the data associated with an operand address calculated by a functional unit resides in the reservation station, the data is used as the operand.
  • the clock cycles used to perform the load operation occur before the instruction reaches the reservation station.
  • the instruction occupies the reservation station for fewer clock cycles than would be necessary utilizing the conventional method of executing these instructions.
  • the clock cycles that are saved may be profitably used to store other instructions.
  • the base address is updated to the address generated by a functional unit each time an associated instruction is executed, and the stride value is updated when the data prediction address is found to be incorrect.
  • the data prediction address is in many cases more accurate then a static data prediction address that changes only when it is found to be incorrect.
  • the implicit load may be performed more often prior to the instruction reaching the reservation station.
  • the prediction must be incorrect in several consecutive executions before the stride is changed (according to one embodiment of the present invention).
  • a single execution of another instruction whose instruction address indexes the same storage location as the correct prediction information does not destroy the prediction information.
  • the present invention contemplates a method for predicting a data address which will be referenced by a plurality of instructions when said plurality of instructions are fetched, comprising several steps.
  • a data prediction address is generated from a base address and a stride value during a clock cycle in which a data prediction counter indicates that the base address and the stride value are valid.
  • Data associated with the data prediction is fetched from a data cache. Then, the data is stored within a plurality of reservation stations.
  • the present invention further contemplates a data address prediction structure comprising an array, an adder circuit, and a reservation station.
  • the array includes a plurality of storage locations for storing a plurality of base addresses and a plurality of stride values. A particular storage location is selected by an instruction address which identifies an instruction. The instruction generates one of the plurality of base addresses.
  • the adder circuit is coupled to the array for adding a base address and a stride value conveyed from the array to produce a data prediction address.
  • the reservation station stores the instruction and data associated with the data prediction address.
  • Figure 1 is a block diagram of a superscalar microprocessor.
  • Figure 2 is a diagram depicting a branch prediction unit, a load/store unit, and a data cache of the superscalar microprocessor of Figure 1, along with several elements of one embodiment of the present invention.
  • Figure 3 A is a diagram of the branch prediction unit shown in Figure 2 depicting several elements of one embodiment of the present invention.
  • Figure 3B is a diagram of a reservation station of the superscalar microprocessor shown in Figure 1 showing several elements of one embodiment of the present invention.
  • Figure 4 is a diagram of the information stored in a storage location of a branch prediction array in accordance with the present invention.
  • superscalar microprocessor 200 includes a prefetch predecode unit 202 and a branch prediction unit 220 coupled to an instruction cache 204.
  • Instruction alignment unit 206 is coupled between instruction cache 204 and a plurality of decode units 208A-208F (referred to collectively as decode units 208).
  • Each decode unit 208A-208F is coupled to respective reservation station units 210A- 21 OF (referred to collectively as reservation stations 210), and each reservation station 210A-210F is coupled to a respective functional unit 212A-212F (referred to collectively as functional units 212).
  • Decode units 208, reservation stations 210, and functional units 212 are further coupled to a reorder buffer 216, a register file 218 and a load/store unit 222.
  • a data cache 224 is finally shown coupled to load/store unit 222, and an MROM unit 209 is shown coupled to instruction alignment unit 206.
  • instruction cache 204 is a high speed cache memory provided to temporarily store instructions prior to their dispatch to decode units 208.
  • instruction cache 204 is configured to cache up to 32 kilobytes of instruction code organized in lines of 16 bytes each (where each byte consists of 8 bits).
  • instruction code is provided to instruction cache 204 by prefetching code from a main memory (not shown) through prefetch/predecode unit 202. It is noted that instruction cache 204 could be implemented in a set-associative, a fully-associative, or a direct-mapped configuration.
  • Prefetch/predecode unit 202 is provided to prefetch instruction code from the main memory for storage within instruction cache 204.
  • prefetch/predecode unit 202 is configured to burst 64-bit wide code from the main memory into instruction cache 204. It is understood that a variety of specific code prefetching techniques and algorithms may be employed by prefetch/predecode unit 202.
  • prefetch/predecode unit 202 fetches instructions from the main memory, it generates three predecode bits associated with each byte of instruction code: a start bit, an end bit, and a "functional" bit.
  • the predecode bits form tags indicative of the boundaries of each instruction.
  • the predecode tags may also convey additional information such as whether a given instruction can be decoded directly by decode units 208 or whether the instruction must be executed by invoking a microcode procedure controlled by MROM unit 209, as will be described in greater detail below.
  • Table 1 indicates one encoding of the predecode tags. As indicated within the table, if a given byte is the first byte of an instruction, the start bit for that byte is set. If the byte is the last byte of an instruction, the end bit for that byte is set. If a particular instruction cannot be directly decoded by the decode units 208, the functional bit associated with the first byte of the instruction is set. On the other hand, if the instruction can be directly decoded by the decode units 208, the functional bit associated with the first byte of the instruction is cleared. The functional bit for the second byte of a particular instruction is cleared if the opcode is the first byte, and is set if the opcode is the second byte.
  • the opcode is the second byte
  • the first byte is a prefix byte.
  • the functional bit values for instruction byte numbers 3-8 indicate whether the byte is a MODRM or an SIB byte, or whether the byte contains displacement or immediate data.
  • certain instructions within the x86 instruction set may be directly decoded by decode units 208. These instructions are referred to as “fast path” instructions.
  • the remaining instructions of the x86 instruction set are referred to as "MROM instructions”.
  • MROM instructions are executed by invoking MROM unit 209. More specifically, when an MROM instruction is encountered, MROM unit 209 parses and serializes the instruction into a subset of defined fast path instructions to effectuate a desired operation.
  • a listing of exemplary x86 instructions categorized as fast path instructions as well as a description of the manner of handling both fast path and MROM instructions will be provided further below.
  • Instruction alignment unit 206 is provided to channel variable byte length instructions from instruction cache 204 to fixed issue positions formed by decode units 208A-208F. Instruction alignment unit 206 independently and in parallel selects instructions from three groups of instruction bytes provided by instruction cache 204 and arranges these bytes into three groups of preliminary issue positions. Each group of issue positions is associated with one of the three groups of instruction bytes. The preliminary issue positions are then merged together to form the final issue positions, each of which is coupled to one of decode units 208.
  • each of the decode units 208 includes decoding circuitry for decoding the predetermined fast path instructions referred to above.
  • each decode unit 208A-208F routes displacement and immediate data to a corresponding reservation station unit 210A-210F.
  • Output signals from the decode units 208 include bit-encoded execution instructions for the functional units 212 as well as operand address information, immediate data and/or displacement data.
  • the superscalar microprocessor of Figure 1 supports out of order execution, and thus employs reorder buffer 216 to keep track of the original program sequence for register read and write operations, to implement register renaming, to allow for speculative instruction execution and branch misprediction recovery, and to facilitate precise exceptions.
  • a temporary storage location within reorder buffer 216 is reserved upon decode of an instruction that involves the update of a register to thereby store speculative register states.
  • Reorder buffer 216 may be implemented in a first-in- first-out configuration wherein speculative results move to the "bottom" of the buffer as they are validated and written to the register file, thus making room for new entries at the "top" of the buffer.
  • reorder buffer 216 Other specific configurations of reorder buffer 216 are also possible, as will be described further below. If a branch prediction is incorrect, the results of speculatively-executed instructions along the mispredicted path can be invalidated in the buffer before they are written to register file 218.
  • each reservation station unit 210A-210F is capable of holding instruction information (i.e., bit encoded execution bits as well as operand values, operand tags and/or immediate data) for up to three pending instructions awaiting issue to the corresponding functional unit. It is noted that for the embodiment of Figure 1, each decode unit 208A-208F is associated with a dedicated reservation station unit 210A-210F, and that each reservation station unit 210A-210F is similarly associated with a dedicated functional unit 212A-212F.
  • decode units 208 a decode unit 208
  • reservation station units 210 a reservation station unit 210
  • functional units 212 a number of instructions aligned and dispatched to issue position 0 through decode unit 208A are passed to reservation station unit 210A and subsequently to functional unit 212A for execution.
  • instructions aligned and dispatched to decode unit 208B are passed to reservation station unit 21 OB and into functional unit 212B, and so on.
  • register address information is routed to reorder buffer 216 and register file 218 simultaneously.
  • the x86 register file includes eight 32 bit real registers (i.e., typically referred to as EAX, EBX, ECX, EDX, EBP, ESI, EDI and ESP).
  • Reorder buffer 216 contains temporary storage locations for results which change the contents of these registers to thereby allow out of order execution.
  • a temporary storage location of reorder buffer 216 is reserved for each instruction which, upon decode, is determined to modify the contents of one of the real registers.
  • reorder buffer 216 may have one or more locations which contain the speculatively executed contents of a given register. If following decode of a given instruction it is determined that reorder buffer 216 has a previous location or locations assigned to a register used as an operand in the given instruction, the reorder buffer 216 forwards to the corresponding reservation station either: 1) the value in the most recently assigned location, or 2) a tag for the most recently assigned location if the value has not yet been produced by the functional unit that will eventually execute the previous instruction. If the reorder buffer has a location reserved for a given register, the operand value (or tag) is provided from reorder buffer 216 rather than from register file 218. If there is no location reserved for a required register in reorder buffer 216, the value is taken directly from register file 218. If the operand corresponds to a memory location, the operand value is provided to the reservation station unit through load/store unit 222.
  • Reservation station units 210A-210F are provided to temporarily store instruction information to be speculatively executed by the corresponding functional units 212A-212F. As stated previously, each reservation station unit 210A-210F may store instruction information for up to three pending instructions. Each of the six reservation stations 210A-210F contain locations to store bit-encoded execution instructions to be speculatively executed by the corresponding functional unit and the values of operands. If a particular operand is not available, a tag for that operand is provided from reorder buffer 216 and is stored within the corresponding reservation station until the result has been generated (i.e., by completion of the execution of a previous instruction).
  • Reservation stations 210 additionally store information related to the data prediction structure disclosed herein, as will be described below.
  • each of the functional units 212 is configured to perform integer arithmetic operations of addition and subtraction, as well as shifts, rotates, logical operations, and branch operations. It is noted that a floating point unit (not shown) may also be employed to accommodate floating point operations.
  • Each of the functional units 212 also provides information regarding the execution of conditional branch instructions to the branch prediction unit 220. If a branch prediction was incorrect, branch prediction unit 220 flushes instructions subsequent to the mispredicted branch that have entered the instruction processing pipeline, and causes prefetch/predecode unit 202 to fetch the required instructions from instruction cache 204 or main memory. It is noted that in such situations, results of instructions in the original program sequence which occur after the mispredicted branch instruction are discarded, including those which were speculatively executed and temporarily stored in load/store unit 222 and reorder buffer 216. Exemplary configurations of suitable branch prediction mechanisms are well known.
  • Results produced by functional units 212 are sent to the reorder buffer 216 if a register value is being updated, and to the load/store unit 222 if the contents of a memory location is changed. If the result is to be stored in a register, the reorder buffer 216 stores the result in the location reserved for the value of the register when the instruction was decoded. As stated previously, results are also broadcast to reservation station units 210A-210F where pending instructions may be waiting for the results of previous instruction executions to obtain the required operand values.
  • load/store unit 222 provides an interface between functional units 212A-212F and data cache 224.
  • load/store unit 222 is configured with a load/store buffer with eight storage locations for data and address information for pending loads or stores.
  • Decode units 208 arbitrate for access to the load/store unit 222. When the buffer is full, a decode unit must wait until the load/store unit 222 has room for the pending load or store request information.
  • the load/store unit 222 also performs dependency checking for load instructions against pending store instructions to ensure that data coherency is maintained.
  • Data cache 224 is a high speed cache memory provided to temporarily store data being transferred between load/store unit 222 and the main memory subsystem.
  • data cache 224 has a capacity of storing up to eight kilobytes of data. It is understood that data cache 224 may be implemented in a variety of specific memory configurations, including a set associative configuration.
  • Branch prediction unit 220 is included, along with load/store unit 222 and data cache 224.
  • Branch prediction unit 220 is connected to a data prediction bus 253 which is coupled to an arbitration multiplexor 254.
  • Also coupled to arbitration multiplexor 254 is a second request bus 255 and an arbitration select line 256.
  • signals on both second request bus 255 and arbitration select line 256 originate in load/store unit 222.
  • the output bus of arbitration multiplexor 254 is coupled to an input port of data cache 224.
  • a first request bus 257 is coupled between load/store unit 222 and an input port of data cache 224.
  • Data cache 224 is configured with two output ports which are coupled to a first reply bus 258 and a second reply bus 259. Both first reply bus 258 and second reply bus 259 are coupled to load/store unit 222, and second reply bus 259 is coupled to reservation stations 210 (shown in Figure 1).
  • branch prediction unit 220 produces a data prediction address on data prediction bus 253 during a clock cycle.
  • the data prediction address is a prediction of the data addresses that are used by instructions being fetched during the clock cycle, and is generated from a stored set of addresses and associated stride values as will be detailed below.
  • the data prediction address accesses data cache 224.
  • the data bytes associated with the data prediction address are transferred on second reply bus 259 to reservation stations 210 if the data prediction address is a hit in data cache 224.
  • Reservation stations 210 store the data bytes along with the data prediction address and the associated instructions provided by decode units 208. Reservation stations 210, in a later clock cycle, direct the corresponding functional units 212 to generate a linear address if one or more of the associated instructions have an implicit memory read operation. If the linear address matches the data prediction address, than the data stored in the respective reservation station 210 is used as the operand for the instruction.
  • Load/store unit 222 is informed by the corresponding reservation station 210A-210F that the associated implicit memory read operation need not be performed. Load/store unit 222 thereby discards the implicit memory read operation transferred by decode units 208 when the associated instruction was decoded. Therefore, the implicit memory read associated with an instruction may be performed before the instruction enters a functional unit 212.
  • the latency of data cache 224 is endured prior to the instruction reaching the reservation station. Performance may be advantageously increased by allowing subsequent instructions to more quickly enter reservation stations 210 during clock cycles in which the associated instruction would have occupied reservation stations 210 due to the latency of data cache 224.
  • load/store unit 222 stores data associated with data prediction addresses.
  • load/store unit 222 conveys the stored data to the respective reservation station 210 without accessing data cache 224.
  • data is provided for a load memory access without enduring the latency associated with data cache 224.
  • Branch prediction unit 220 stores the correct prediction address for use in a subsequent prediction.
  • the data prediction address is conveyed on data prediction bus 253 to arbitration multiplexor 254, which arbitrates between the data prediction address and a second request from load/store unit 222.
  • arbitration means the selection of one request over another according to an arbitration scheme.
  • the arbitration scheme is a priority scheme in which a load/store request conveyed on second request bus 255 is a higher priority that the data prediction address request. Therefore, the arbitration select signal conveyed on arbitration select line 256 is an indication that a load/store request is being conveyed on second request bus 255. If a valid load/store request is being made during a clock cycle, arbitration multiplexor 254 selects the load/store request to be conveyed to data cache 224. If no valid load/store request is being made during a clock cycle, arbitration multiplexor 254 selects the data prediction address to be conveyed to data cache 224.
  • load/store unit 222 is configured to make up to two requests to data cache 224 during a clock cycle. During clock cycles in which load/store unit 222 is making one request, first request bus 257 is used to allow the data prediction address the maximum opportunity to access data cache 224. If a data prediction address request is valid during a clock cycle in which load/store unit 222 is making no requests or one request to data cache 224, then the data prediction address request will be given access to data cache 224.
  • Load/store unit 222 receives the data bytes associated with requests on first request bus 257 and second request bus 255 on first reply bus 258 and second reply bus 259, respectively. It is noted that other embodiments of load/store unit 222 may have different numbers of request buses for data cache 224.
  • prediction array 270 is a linear array of storage locations.
  • the current fetch address (conveyed from instruction cache 204) is used to index into prediction array 270 and select a storage location.
  • Stored within the selected storage location is a base address, a stride value, and a data prediction counter.
  • the base address is a data address generated by an instruction, wherein the address indicative of the memory location storing the instruction indexes to the selected storage location.
  • the stride value is the difference between two addresses generated by the instruction on two consecutive executions of the instruction.
  • the stride value is a signed value allowing both positive and negative strides to be stored.
  • the data prediction counter is configured to indicate the validity of the base address and the stride value. Furthermore, the data prediction counter may store values indicating that a number of previous data prediction addresses were correct predictions. A comparable number of consecutive mispredictions must then be detected before the stride value is changed. Generally speaking, the data prediction counter value may be incremented and decremented but does not decrement below zero nor increment above the largest value that may be represented by the data prediction counter. Instead, if the data prediction counter contains zero and is decremented, it remains zero. If the data prediction counter contains the largest value that it may represent and is incremented, it remains at the largest value that it may represent. The data prediction counter is therefore a saturating counter.
  • An adder circuit 271 is coupled to the output port of prediction array 270.
  • Adder circuit 271 is configured to add the stride value stored within the selected storage location to the base address stored within the selected storage location, creating a data prediction address.
  • the data prediction address is conveyed on data prediction bus 253, along with the associated data prediction counter value. In one embodiment, if the stride value is invalid (as indicated by the data prediction counter value), then the base address is added to zero.
  • branch prediction unit 220 receives four sets of inputs from reservation stations 210.
  • branch prediction unit 220 receives a plurality of linear address buses 261 which convey the actual data addresses (referred to below as linear addresses) generated by functional units 212 as well as a valid indicator for each address indicating the validity of the address. If a reservation station 210A-210F detects a mispredicted data prediction address, the associated one of linear address buses 261 will convey the corrected addresses (as will be explained below with respect to Figure 3B).
  • Each reservation station 210A- 21 OF also provides one of a plurality of mispredicted lines 273 indicating whether or not the linear address conveyed on the associated one of linear address buses 261 matches the data prediction address provided with the associated instruction.
  • branch prediction unit 220 receives a plurality of instruction address buses 274 from reservation stations 210 conveying instruction addresses associated with the linear addresses received on linear address buses 261. The instruction address is used to index into prediction array 270 in order to store the updated base address, stride value, and data prediction counter value. Finally, prediction address buses 276 are received by branch prediction unit 220. Prediction address buses 276 convey the data prediction address received by each reservation station 210A-210F along with the associated data prediction counter value. Linear address buses 261, mispredicted lines 273, instruction address buses 274 and prediction address buses 276 are received by a prediction validation and correction block 275 within branch prediction unit 220.
  • Prediction validation and correction block 275 causes the corresponding linear address to be stored as the base address in prediction array 270 in the storage location indexed by the corresponding instruction address.
  • the associated data prediction counter value is decremented for this case and stored as the data prediction counter in the indexed storage location. In this manner, a data prediction address is corrected if predicted incorrectly.
  • a base address is created if a valid base address was not previously stored in prediction array 270 as indicated by the associated data prediction counter.
  • a particular one of mispredicted lines 273 indicates a misprediction if the data prediction address associated with the instruction currently being validated is invalid. Therefore, the address provided on the associated one of linear address buses 261 is stored as the base address.
  • Prediction validation and correction block 275 causes the corresponding linear address to be stored as the base address within prediction array 270 in the storage location indexed by the corresponding branch instruction address.
  • the corresponding data prediction counter value is incremented and stored as the data prediction counter in the indexed storage location.
  • a new stride value is calculated by prediction validation and correction block 275.
  • the new stride value is the difference between the prediction value conveyed on the associated one of prediction address buses 276 and the linear address conveyed on the associated one of linear address buses 261.
  • the new stride value is stored as the stride value in the indexed storage location within prediction array 270. If the stride value is invalid (as indicated by the data prediction counter) then the data prediction counter is incremented regardless of the misprediction/prediction status of the current data address prediction.
  • reservation station 21 OA's outputs are given highest priority, then reservation station 210B's outputs, etc.
  • reservation station 210A is conveying prediction validation information during a clock cycle (as indicated by the associated address valid indicator)
  • reservation station 21 OA's information is processed by prediction validation and correction block 275.
  • Prediction validation and correction block 275 is coupled to a dedicated write port into prediction array 270 to perform its updates, in one embodiment.
  • the prediction address advantageously predicts the next address correctly for at least two types of data access patterns.
  • the stride value will be zero and the data prediction address associated with that instruction will be the same each time the instruction executes. Therefore, the current data prediction structure will correctly predict static data addresses.
  • instructions which access a regular pattern of data addresses such that the data address accessed on consecutive executions of the instruction differ by a fixed amount may receive correct address predictions from the present data prediction structure.
  • Static data prediction structures (which store the previously generated data address without a stride value) do not predict these types of instruction sequences correctly. Therefore, the present prediction structure may correctly predict a larger percentage of data addresses than static data prediction structures.
  • reservation station 210A is depicted.
  • One storage entry is shown in Figure 3B, but reservation station 210A is configured with several storage entries for storing decoded instructions and related data.
  • Other storage entries within reservation station 210A are configured similar to the storage entry shown.
  • reservation stations 210B-21 OF are configured similar to reservation station 210A.
  • the logic circuits shown in Figure 3B are exemplary circuits for implementing the data prediction structure. Other circuits (not shown) select instructions for execution by functional units 212.
  • the reservation station storage entry shown in Figure 3B includes storage for the decoded instruction and related information.
  • a decoded instruction register 300 stores a decoded instruction conveyed from decode unit 208A on an input instruction bus 303.
  • the input operands are stored in a pair of registers: AOP register 301 and BOP register 302.
  • AOP register 301 and BOP register 302 receive their respective values either from a pair of data forwarding buses 304 and 305 from functional units 212 through operand steering logic (not shown), or from a predicted data register 306.
  • the selection of data to store within AOP register 301 and BOP register 302 is performed by multiplexors 307 and 308, respectively, under the control of a reservation station control unit 309.
  • AOP register 301 and BOP register 302 are each configured with a valid bit indicative of whether or not a value has been provided for the corresponding operand. When both operands are valid, the explicit operation of the associated instruction may be executed by functional unit 212 A.
  • Predicted data register 306 is coupled to second reply bus 259.
  • data from second reply bus 259 is stored into predicted data register 306, along with an indication of whether or not the address associated with the data is the data prediction address and whether or not the address was a hit in data cache 224.
  • the address used to fetch the instruction from instruction cache 204 is stored into instruction address register 310.
  • the associated data prediction address and data prediction counter value generated by branch prediction unit 220 are stored into a predicted address register 311.
  • Predicted address register 311 is configured with a validation done bit (represented in Figure 3B by the D field of predicted address register 311).
  • the validation done bit is used to ensure that a particular predicted address is validated only once by reservation station 210A and conveyed to branch prediction unit 220.
  • the validation done bit is initially cleared by reservation station control unit 309 and subsequently set as described below.
  • the storage entry includes a linear address register 312 for storing a linear address generated for the decoded instruction by functional unit 212A. Additionally, a valid bit is stored in linear address register 312 to indicate the validity of the linear address.
  • Reservation station control unit 309 is configured to control the storing of data into the various registers within each reservation station storage location.
  • a decoded instruction is conveyed to the storage location and stored in decoded instruction register 300.
  • predicted data register 306 receives data from data cache 224 and instruction address register 310 and predicted address register 311 receive values routed through decode unit 208A from branch prediction unit 220.
  • AOP register 301 and BOP register 302 may receive operand values from data forwarding buses 304 and 305.
  • the decoded instruction occupies the storage entry within reservation station 210A until the explicit operation of the instruction is executed by functional unit 212 A.
  • reservation station control unit 309 causes functional unit 212A to generate a linear address for the implicit memory read from the appropriate address values (not shown).
  • the linear address is conveyed to load/store unit 222 for use in performing the implicit memory read if the data prediction address is incorrect, and is additionally stored into linear address register 312.
  • Reservation station control unit 309 causes the valid bit in linear address register 312 to be set.
  • a comparator circuit 313 compares the generated linear address to the data prediction address stored in predicted address register 311. Comparator circuit 313 is configured to indicate a mismatch if the data prediction counter value stored in predicted address register 311 indicates that the data prediction address is invalid (i.e. the associated storage location within prediction array 270 is not currently storing a valid base address).
  • Comparator circuit 313 conveys its output signal indicative of a match or mismatch between the data prediction address and the linear address to a multiplexor 314. Additionally, the value stored in instruction address register 310 is conveyed to a multiplexor 315; the value stored in predicted address register 311 is conveyed to a multiplexor 316; and the value stored in linear address register 312 is conveyed to a multiplexor 317. Multiplexors 314, 315, 316, and 317 are controlled by reservation station control unit 309 and receive similar values from other reservation station storage entries (not shown).
  • reservation station control unit 309 selects the values stored in the associated storage entry through multiplexors 314, 315, 316, and 317 and sets the validation done bit in predicted address register 311.
  • Multiplexor 314 is configured to convey a value on mispredicted line 273 A, which is one of the plurality of mispredicted lines 273 shown in Figure 3B.
  • multiplexor 315 is configured to convey a value on instruction address bus 274A, which is one of the plurality of instruction address buses 274 shown in Figure 3B.
  • Multiplexor 316 is configured to convey the value stored in predicted address register 311 on prediction address bus 276A, which is one of plurality of prediction address buses 276.
  • multiplexor 317 is configured to convey the value stored in linear address register 312 (including the valid bit, which forms the address valid indicator described above) on linear address bus 261 A, which is one of plurality of linear address buses 261.
  • the values used by prediction validation and correction block 275 of branch prediction unit 220 receive the appropriate values to validate and correct the data address prediction generated for the instruction stored in decoded instruction register 300.
  • reservation station control unit 309 sets the validation done bit in predicted address register 311. Since the validation done bit must be clear for validation values to be conveyed by the reservation station storage entry, the validation in branch prediction unit 220 is performed once for a particular instruction. If the linear address and data prediction address are found to match and the data stored in predicted data register 306 is associated with the data prediction address, reservation station control unit 309 causes the data stored in predicted data register 306 to be stored into either AOP register 301 or BOP register 302. AOP register 301 or BOP register 302 are chosen according to which operand is intended to receive the result of the implicit memory read operation.
  • the implicit memory read operation is performed by load/store unit 222 in a subsequent clock cycle. If the predicted data is transferred to AOP register 301 or BOP register 302, then reservation station control unit 309 conveys an indication to load/store unit 222 that the associated implicit memory read operation need not be performed. Load/store unit 222 thereby discards the implicit memory read operation (as noted above).
  • reservation station control unit 309 causes the address valid indicator on linear address bus 261 A to convey an invalid address indication. Therefore, branch prediction unit 220 will ignore the values conveyed on linear address bus 261 A, prediction address bus 276A, instruction address bus 274A, and mispredicted line 273A during this clock cycle. Reservation stations 210B-21 OF thereby may update data prediction values during this clock cycle.
  • a field of information stored in storage location 400 of prediction array 270 is the base address field 401.
  • the base address is stored in this field.
  • base address field 401 is 32 bits wide.
  • stride value field 402. is 10 bits wide. The base address and stride value are added to form the data prediction address.
  • data prediction counter field 403 is included in storage location 400.
  • the data prediction counter is stored in this field.
  • the data prediction counter is two bits.
  • a binary value of 00 for the data prediction counter indicates that neither the base address field 401 nor the stride value field 402 contain valid values.
  • a binary value of 01 indicates that base address field 401 is valid but stride value field 402 does not contain a valid value.
  • the base address field is added to zero for this encoding, and the stride value is updated when the linear address is generated by one of decode units 208.
  • a binary value of 10 or 11 indicates that both base address field 401 and stride value field 402 are valid.
  • Having two values with the same indication allows for a stride value to be maintained until two consecutive mispredictions are detected for basic blocks which index storage location 400. In this way, prediction accuracy is maintained for the case where a basic block is intermittently executed between multiple executions of a second basic block and the two basic blocks index the same storage location 400.
  • Other embodiments may implement more or less bits for the data prediction counter value.
  • data prediction address bus 253 is connected to an additional read port on data cache 224.
  • arbitration multiplexor 254 may be eliminated. It is noted that, although the preceding discussion described an embodiment of the data prediction structure within a processor implementing the x86 architecture, any microprocessor architecture could benefit from the data prediction structure.
  • a data address prediction structure which allows an implicit memory operation of an instruction to be performed before the instruction reaches a reservation station. Therefore, the number of cycles that the instruction is stored in the reservation station is advantageously reduced. Performance may be increased by implementing such a data prediction structure.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Advance Control (AREA)

Abstract

Structure de prédiction de données destinée à un processeur superscalaire. La structure de prédiction de données conserve en mémoire les adresses de base et les valeurs de rythme dans un réseau de prédiction. L'adresse de base et la valeur de rythme d'un endroit de la structure de prédiction de données repéré par une adresse d'instruction, sont additionnées pour former une adresse de prédiction de données qui est ensuite utilisée pour rechercher des octets de données dans une unité de réservation conservant en mémoire une instruction associée. Si les données associées à une adresse d'opérande calculée par une unité fonctionnelle se situe dans l'unité de réservation le cycle d'horloge utilisé pour effectuer l'opération de changement s'est produit avant l'arrivée de l'instruction dans l'unité de réservation. En outre l'adresse de base est mise à jour à l'adresse générée par l'exécution d'une instruction chaque fois que l'instruction est exécutée et la valeur de rythme est mise à jour lorsque l'adresse de prédiction des données est déterminée incorrecte.
PCT/US1996/017516 1996-11-04 1996-11-04 Structure de prediction d'adresse de donnees fondee sur le rythme WO1998020416A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
AU76667/96A AU7666796A (en) 1996-11-04 1996-11-04 A stride-based data address prediction structure
PCT/US1996/017516 WO1998020416A1 (fr) 1996-11-04 1996-11-04 Structure de prediction d'adresse de donnees fondee sur le rythme

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US1996/017516 WO1998020416A1 (fr) 1996-11-04 1996-11-04 Structure de prediction d'adresse de donnees fondee sur le rythme

Publications (1)

Publication Number Publication Date
WO1998020416A1 true WO1998020416A1 (fr) 1998-05-14

Family

ID=22256050

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1996/017516 WO1998020416A1 (fr) 1996-11-04 1996-11-04 Structure de prediction d'adresse de donnees fondee sur le rythme

Country Status (2)

Country Link
AU (1) AU7666796A (fr)
WO (1) WO1998020416A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004053686A1 (fr) * 2002-12-12 2004-06-24 Koninklijke Philips Electronics N.V. Prediction de la cadence basee sur un compteur pour une lecture anticipee

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5442767A (en) * 1992-10-23 1995-08-15 International Business Machines Corporation Address prediction to avoid address generation interlocks in computer systems

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5442767A (en) * 1992-10-23 1995-08-15 International Business Machines Corporation Address prediction to avoid address generation interlocks in computer systems

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
BAER J L ET AL: "AN EFFECTIVE ON-CHIP PRELOADING SCHEME TO REDUCE DATA ACCESS PENALTY", PROCEEDINGS OF THE SUPERCOMPUTING CONFERENCE, ALBUQUERQUE, NOV. 18 - 22, 1991, no. CONF. 4, 18 November 1991 (1991-11-18), INSTITUTE OF ELECTRICAL AND ELECTRONICS ENGINEERS, pages 176 - 186, XP000337480 *
EICKEMEYER R J ET AL: "A LOAD-INSTRUCTION UNIT FOR PIPELINED PROCESSORS", IBM JOURNAL OF RESEARCH AND DEVELOPMENT, vol. 37, no. 4, July 1993 (1993-07-01), pages 547 - 564, XP000647102 *
PO-YUNG CHANG ET AL: "ALTERNATIVE IMPLEMENTATIONS OF HYBRD BRANCH PREDICTORS", PROCEEDINGS OF THE 28TH. ANNUAL INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE, ANN ARBOR, NOV. 29 - DEC. 1, 1995, no. SYMP. 28, 29 November 1995 (1995-11-29), INSTITUTE OF ELECTRICAL AND ELECTRONICS ENGINEERS, pages 252 - 257, XP000585367 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004053686A1 (fr) * 2002-12-12 2004-06-24 Koninklijke Philips Electronics N.V. Prediction de la cadence basee sur un compteur pour une lecture anticipee

Also Published As

Publication number Publication date
AU7666796A (en) 1998-05-29

Similar Documents

Publication Publication Date Title
US6253316B1 (en) Three state branch history using one bit in a branch prediction mechanism
US7024545B1 (en) Hybrid branch prediction device with two levels of branch prediction cache
US5761712A (en) Data memory unit and method for storing data into a lockable cache in one clock cycle by previewing the tag array
US5845323A (en) Way prediction structure for predicting the way of a cache in which an access hits, thereby speeding cache access time
US6012125A (en) Superscalar microprocessor including a decoded instruction cache configured to receive partially decoded instructions
US5822575A (en) Branch prediction storage for storing branch prediction information such that a corresponding tag may be routed with the branch instruction
US6502188B1 (en) Dynamic classification of conditional branches in global history branch prediction
JP2001521241A (ja) 分岐予測を迅速に特定するための命令キャッシュ内のバイト範囲に関連する分岐セレクタ
EP0651331B1 (fr) Tampon d'écriture pour un microprocesseur superscalaire à pipeline
US20070033385A1 (en) Call return stack way prediction repair
US6397326B1 (en) Method and circuit for preloading prediction circuits in microprocessors
US5893146A (en) Cache structure having a reduced tag comparison to enable data transfer from said cache
JP3794918B2 (ja) 復帰選択ビットを用いて分岐予測のタイプを分類する分岐予測
EP0912927B1 (fr) Unite de chargement/stockage a indicateurs multiples pour achever des instructions de stockage et de chargement ayant manque la memoire cache
US6175909B1 (en) Forwarding instruction byte blocks to parallel scanning units using instruction cache associated table storing scan block boundary information for faster alignment
WO1998002806A1 (fr) Structure de prediction d'adresses de donnees faisant appel a un procede de prediction par enjambee
WO1998020421A1 (fr) Structure de prevision de voie
WO1998020416A1 (fr) Structure de prediction d'adresse de donnees fondee sur le rythme
EP0912929B1 (fr) Structure de prediction d'adresses de donnees et procede permettant de la faire fonctionner
EP0912930B1 (fr) Unite fonctionnelle avec un pointeur pour la resolution de branchements avec erreur de prediction, et microprocesseur superscalaire comprenant une telle unite
EP0919027B1 (fr) Registre d'actualisation retardee pour matrice
EP1005675B1 (fr) Unite memoire de donnees concue pour le stockage de donnees en un seul site d'horloge et procede de fonctionnement de cette unite
EP0912925B1 (fr) Structure de pile d'adresses de retour et microprocesseur superscalaire comportant cette structure
EP1015980B1 (fr) Antememoire de donnees capable d'effectuer des acces memoire dans un seul cycle d'horloge
WO1998002817A1 (fr) Unite de prediction de blocs de memoire et son procede de fonctionnement

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AL AM AT AU AZ BA BB BG BR BY CA CH CN CU CZ DE DK EE ES FI GB GE HU IL IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK TJ TM TR TT UA UG US UZ VN AM AZ BY KG KZ MD RU TJ TM

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): KE LS MW SD SZ UG AT BE CH DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI

DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
121 Ep: the epo has been informed by wipo that ep was designated in this application
WD Withdrawal of designations after international publication

Free format text: US

REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: CA