WO1998002806A1 - A data address prediction structure utilizing a stride prediction method - Google Patents
A data address prediction structure utilizing a stride prediction method Download PDFInfo
- Publication number
- WO1998002806A1 WO1998002806A1 PCT/US1996/011847 US9611847W WO9802806A1 WO 1998002806 A1 WO1998002806 A1 WO 1998002806A1 US 9611847 W US9611847 W US 9611847W WO 9802806 A1 WO9802806 A1 WO 9802806A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data
- address
- prediction
- instruction
- recited
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims description 21
- 238000012545 processing Methods 0.000 claims abstract description 16
- 238000006073 displacement reaction Methods 0.000 claims description 10
- 230000003068 static effect Effects 0.000 abstract description 5
- 238000010200 validation analysis Methods 0.000 description 9
- 238000012937 correction Methods 0.000 description 7
- 238000010586 diagram Methods 0.000 description 7
- 230000009471 action Effects 0.000 description 5
- 238000004364 calculation method Methods 0.000 description 4
- 238000012986 modification Methods 0.000 description 4
- 230000004048 modification Effects 0.000 description 4
- 230000008901 benefit Effects 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 235000019580 granularity Nutrition 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- XUIMIQQOPSSXEZ-UHFFFAOYSA-N Silicon Chemical compound [Si] XUIMIQQOPSSXEZ-UHFFFAOYSA-N 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000011084 recovery Methods 0.000 description 1
- 238000009738 saturating Methods 0.000 description 1
- 229910052710 silicon Inorganic materials 0.000 description 1
- 239000010703 silicon Substances 0.000 description 1
- 238000000547 structure data Methods 0.000 description 1
- 239000000758 substrate Substances 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3824—Operand accessing
- G06F9/383—Operand prefetching
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/34—Addressing or accessing the instruction operand or the result ; Formation of operand address; Addressing modes
- G06F9/345—Addressing or accessing the instruction operand or the result ; Formation of operand address; Addressing modes of multiple operands or results
- G06F9/3455—Addressing or accessing the instruction operand or the result ; Formation of operand address; Addressing modes of multiple operands or results using stride
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3824—Operand accessing
- G06F9/383—Operand prefetching
- G06F9/3832—Value prediction for operands; operand history buffers
Definitions
- TITLE A DATA ADDRESS PREDICTION STRUCTURE UTILIZING A STRIDE PREDICTION METHOD
- This invention relates to the field of superscalar microprocessors and, more particularly, to data prediction mechanisms in superscalar microprocessors.
- clock cycle refers to an interval of time during which the pipeline stages of a microprocessor perform their intended functions.
- superscalar microprocessors are typically configured with instruction processing pipelines which process instructions. The processing of instructions includes the actions of fetching, dispatching, decoding, executing, and writing back results. Each action may be implemented in one or more pipeline stages, and an instruction flows through each of the pipeline stages where an action or portion of an action is performed. At the end of a clock cycle, the instruction and the values resulting from performing the action of the current pipeline stage are moved to the next pipeline stage. When an instruction reaches the end of an instruction processing pipeline, it is processed and the results of executing the instruction have been recorded.
- superscalar microprocessors execute multiple instructions per clock cycle and the clock cycle is short, a high bandwidth memory system is required to provide instructions and data to the superscalar microprocessor (i.e. a memory system that can provide a large number of bytes in a short period of time). Without a high bandwidth memory system, the microprocessor would spend a large number of clock cycles waiting for instructions or data to be provided, then would execute the received instructions and/or instructions dependent upon the received data in a relatively small number of clock cycles. Overall performance would be degraded by the large number of idle clock cycles.
- Superscalar microprocessors are, however, ordinarily configured into computer systems with a relatively large main memory composed of dynamic random access memory (DRAM) cells.
- DRAM dynamic random access memory
- DRAM cells are characterized by access times which are significantly longer than the clock cycle of modern superscalar microprocessors. Also, DRAM cells typically provide a relatively narrow output bus to convey the stored bytes to the superscalar microprocessor. Therefore, DRAM cells provide a memory system that provides a relatively small number of bytes in a relatively long period of time, and do not form a high bandwidth memory system.
- Caches include multiple blocks of storage locations configured on the same silicon substrate as the microprocessor or coupled nearby. The blocks of storage locations are used to hold previously fetched instruction or data bytes The bytes can be transferred from the cache to the destination (a register or an instruction processing pipeline) quickly; commonly one or two clock cycles are required as opposed to a large number of clock cycles to transfer bytes from a DRAM main memory
- Instructions which have an operand stored in memory are said to have an implicit (memory read) operation and an explicit operation which is defined by the particular instruction being executed (i.e an add instruction has an explicit operation of addition)
- Such an instruction therefore requires an implicit address calculation for the memory read to retrieve the operand, the implicit memory read, and the execution of the explicit operation of the instruction (for example, an addition).
- Typical superscalar microprocessors have required that these operations be performed by the execute stage of the instruction processing pipeline The execute stage of the instruction processing pipeline is therefore occupied for several clock cycles when executing such an instruction
- a superscalar microprocessor might require one clock cycle for the address calculation, one to two clock cycles for the data cache access, and one clock cycle for the execution of the explicit operation of the instruction.
- the problems outlined above are in large part solved by a data prediction structure for a superscalar microprocessor in accordance with the present invention.
- the data prediction structure stores base addresses and stride values in a prediction array. A particular base address and associated stride value are added to form a data prediction address.
- the data prediction address is then used to fetch data bytes into a relatively small, relatively fast buffer which may be accessed by the decode stage(s) of the instruction processing pipeline If the data associated with an operand address calculated by a decode stage resides in the buffer, the data is routed to the corresponding reservation station
- the clock cycles used to perform the load operation occur before the instruction reaches the execution stage of the instruction processing pipeline.
- the execution stage clock cycles that are saved may be used to execute other instructions
- the base address is updated to the address generated by a decode unit each time a basic block is executed, and the stride value is updated when the data prediction address is found to be incorrect
- the data prediction address is in many cases more accurate then a static data prediction address that changes only when it is found to be incorrect
- the implicit load may be performed more often prior to the instruction reaching the execute stage of the instruction processing pipeline
- the prediction must be incorrect in several consecutive executions of the basic block before the stride is changed according to one embodiment of the present invention
- a single execution of another basic block whose instruction addresses index the same storage location as the correct prediction information does not destroy the prediction information
- the present invention contemplates a method for predicting a data address which will be referenced by a plurality of instructions residing in a basic block when the basic block is fetched, comprising several steps First, a data prediction address is generated from a base address and a stride value during a clock cycle in which a data prediction counter indicates that the base address and the stride value are valid
- the data prediction address is fetched from a data cache into a data buffer
- the data buffer is accessed for load data
- the present invention further contemplates a data address prediction structure comprising an array, an adder circuit and a data buffer
- the array includes a plurality of storage locations for storing a plurality of base addresses and a plurality of stride values
- a particular one of the plurality of storage locations in which a particular one of the plurality of base addresses and a particular one of the plurality of stride values is stored is selected by the instruction address of a branch instruction which begins a basic block containing an instruction which generates one of the plurality of addresses
- the adder circuit is coupled to the array for adding a base address and a stride value conveyed from the array to produce a data prediction address
- the data buffer is included for storing a plurality of bytes associated with the data prediction address
- FIG. 1 is a block diagram of a superscalar microprocessor
- Figure 2 is a diagram depicting a branch prediction unit, a load/store unit, and a data cache of the superscalar microprocessor of Figure 1 , along with several elements of one embodiment of the present invention
- Figure 3 A is a diagram of the branch prediction unit shown in Figure 2 depicting several elements of one embodiment of the present invention
- Figure 3B is a diagram of a decode unit of the superscalar microprocessor shown in Figure 1 showing several elements of one embodiment of the present invention
- Figure 4 is a diagram of the information stored in a storage location of a branch prediction array in accordance with the present invention
- superscalar microprocessor 200 includes a prefetch/predecode unit 202 and a branch prediction unit 220 coupled to an instruction cache 204
- Instruction alignment unit 206 is coupled between instruction cache 204 and a plurality of decode units 208A-208F (referred to collectively as decode units 208)
- Each decode unit 208A-208F is coupled to respective reservation station units 210A-210F (referred to collectively as reservation stations 210), and each reservation station 210A-21 OF is coupled to a respective functional unit 212A-212F (referred to collectively as functional units 212)
- Decode units 208, reservation stations 210, and functional units 212 are further coupled to a reorder buffer 216.
- a data cache 224 is finally shown coupled to load/store unit 222
- instruction cache 204 is a high speed cache memory provided to temporarily store instructions prior to their dispatch to decode units 208
- instruction cache 204 is configured to cache up to 32 kilobytes of instruction code organized in lines of 16 bytes each (where each byte consists of 8 bits)
- instruction code is provided to instruction cache 204 by prefetching code from a main memory (not shown) through prefetch/predecode unit 202
- instruction cache 204 could be implemented in a set-associative, a fully-associative, or a direct-mapped configuration
- Prefetch/predecode unit 202 is provided to prefetch instruction code from the main memory for storage within instruction cache 204 In one embodiment, prefetch/predecode unit 202 is configured to burst 64-bit wide code from the main memory into instruction cache 204 It is understood that a variety of specific code prefetching techniques and algorithms may be employed by prefetch/predecode unit 202
- prefetch/predecode unit 202 fetches instructions from the main memory, it generates three predecode bits associated with each byte of instruction code a start bit, an end bit, and a "functional" bit
- the predecode bits form tags indicative of the boundaries of each instruction
- the predecode tags may also convey additional information such as whether a given instruction can be decoded directly by decode units 208 or whether the instruction must be executed by invoking a microcode procedure controlled by MROM unit 209, as will be described in greater detail below
- Table 1 indicates one encoding of the predecode tags As indicated within the table, if a given byte is the first byte of an instruction, the start bit for that byte is set If the byte is the last byte of an instruction, the end bit for that byte is set If a particular instruction cannot be directly decoded by the decode units 208, the functional bit associated with the first byte of the instruction is set On the other hand, if the instruction can be directly decoded by the decode units 208.
- the functional bit associated with the first byte of the instruction is cleared
- the functional bit for the second byte of a particular instruction is cleared if the opcode is the first byte, and is set if the opcode is the second byte It is noted that in situations where the opcode is the second byte, the first byte is a prefix byte
- the functional bit values for instruction byte numbers 3-8 indicate whether the byte is a MODRM or an SIB byte or whether the byte contains displacement or immediate data Table 1 Encoding of Start, End and Functional Bits
- MROM instructions MROM instructions
- MROM instructions are executed by invoking MROM unit 209 More specifically, when an MROM instruction is encountered, MROM unit 209 parses and serializes the instruction into a subset of defined fast path instructions to effectuate a desired operation
- MROM unit 209 parses and serializes the instruction into a subset of defined fast path instructions to effectuate a desired operation
- Instruction alignment unit 206 is provided to channel variable byte length instructions from instruction cache 204 to fixed issue positions formed by decode units 208A-208F Instruction alignment unit 206 independently and in parallel selects instructions from three groups of instruction bytes provided by instruction cache 204 and arranges these bytes into three groups of preliminary issue positions Each group of issue positions is associated with one of the three groups of instruction bytes The preliminary issue positions are then merged together to form the final issue positions, each of which is coupled to one of decode units 208
- each of the decode units 208 includes decoding circuitry for decoding the predetermined fast path instructions referred to above
- each decode unit 208A-208F routes displacement and immediate data to a corresponding reservation station unit 210A-210F
- Output signals from the decode units 208 include bit-encoded execution instructions for the functional units 212 as well as operand address information, immediate data and/or displacement data
- the superscalar microprocessor of Figure 1 supports out of order execution, and thus employs reorder buffer 216 to keep track of the original program sequence for register read and write operations, to implement register renaming, to allow for speculative instruction execution and branch misprediction recovery, and to facilitate precise exceptions
- a temporary storage location within reorder buffer 216 is reserved upon decode of an instruction that involves the update of a register to thereby store speculative register states
- Reorder buffer 216 may be implemented in a first-m- first-out configuration wherein speculative results move to the "bottom" of the buffer as they are validated and written to the register file, thus making room for new entries at the "top" of the buffer
- Other specific configurations of reorder buffer 216 are also possible, as will be described further below If a branch prediction is incorrect, the results of speculatively-executed instructions along the mispredicted path can be invalidated in the buffer before they are written to register file 218
- each reservation station unit 210A-210F is capable of holding instruction information (l e , bit encoded execution bits as well as operand values, operand tags and/or immediate data) for up to three pending instructions awaiting issue to the corresponding functional unit
- instruction information L e , bit encoded execution bits as well as operand values, operand tags and/or immediate data
- each decode unit 208A-208F is associated with a dedicated reservation station unit 210A-210F
- each reservation station unit 210A-210F is similarly associated with a dedicated functional unit 212A-2 I2F
- six dedicated "issue positions" are formed by decode units 208, reservation station units 210 and functional units 212 Instructions aligned and dispatched to issue position 0 through decode unit 208A are passed to reservation station unit 210A and subsequently to functional unit 212 A for execution Similarly, instructions aligned and dispatched to decode unit 208B are
- Reorder buffer 216 contains temporary storage locations for results which change the contents of these registers to thereby allow out of order execution
- a temporary storage location of reorder buffer 216 is reserved for each instruction which, upon decode, is determined to modify the contents of one of the real registers Therefore, at various points during execution of a particular program, reorder buffer 216 may have one or more locations which contain the speculatively executed contents of a given register If following decode of a given instruction it is determined that reorder buffer 216 has a previous location or locations assigned to a register used as an operand in the given instruction, the reorder buffer 216
- Reservation station units 210A-210F are provided to temporarily store instruction information to be speculatively executed by the corresponding functional units 212A-212F As stated previously, each reservation station unit 210A-210F may store instruction information for up to three pending instructions Each of the six reservation stations 21 OA-21 OF contain locations to store bit-encoded execution instructions to be speculatively executed by the corresponding functional unit and the values of operands If a particular operand is not available, a tag for that operand is provided from reorder buffer 216 and is stored within the corresponding reservation station until the result has been generated (i e , by completion of the execution of a previous instruction) It is noted that when an instruction is executed by one of the functional units 212A- 212F, the result of that instruction is passed directly to any reservation station units 210A -21 OF that are waiting for that result at the same time the result is passed to update reorder buffer 216 (this technique is commonly referred to as "result forwarding") Instructions are issued to functional units for execution after the
- each of the functional units 212 is configured to perform integer arithmetic operations of addition and subtraction, as well as shifts, rotates, logical operations, and branch operations It is noted that a floating point unit (not shown) may also be employed to accommodate floating point operations
- Each of the functional units 212 also provides information regarding the execution of conditional branch instructions to the branch prediction unit 220 If a branch prediction was incorrect, branch prediction unit 220 flushes instructions subsequent to the mispredicted branch that have entered the instruction processing pipeline, and causes prefetch/predecode unit 202 to fetch the required instructions from instruction cache 204 or main memory It is noted that in such situations, results of instructions in the original program sequence which occur after the mispredicted branch instruction are discarded, including those which were speculatively executed and temporarily stored in load/store unit 222 and reorder buffer 216 Exemplary configurations of suitable branch prediction mechanisms are well known
- Results produced by functional units 212 are sent to the reorder buffer 216 if a register value is being updated, and to the load/store unit 222 if the contents of a memory location is changed If the result is to be stored in a register, the reorder buffer 216 stores the result in the location reserved for the value of the register when the instruction was decoded As stated previously, results are also broadcast to reservation station units 210A-210F where pending instructions may be waiting for the results of previous instruction executions to obtain the required operand values
- load store unit 222 provides an interface between functional units 212A-212F and data cache 224
- load/store unit 222 is configured with a load/store buffer with eight storage locations for data and address information for pending loads or stores
- Decode units 208 arbitrate for access to the load/store unit 222 When the buffer is full, a decode unit must wait until the load/store unit 222 has room for the pending load or store request information
- the load/store unit 222 also performs dependency checking for load instructions against pending store instructions to ensure that data coherency is maintained
- Data cache 224 is a high speed cache memory provided to temporarily store data being transferred between load/store unit 222 and the mam memory subsystem In one embodiment, data cache 224 has a capacity of storing up to eight kilobytes of data It is understood that data cache 224 may be implemented in a variety of specific memory configurations, including a set associative configuration
- Branch prediction unit 220 is included, along with load/store unit 222 and data cache 224
- Branch prediction unit 220 is connected to a data prediction bus 253 which is coupled to an arbitration multiplexor 254
- arbitration multiplexor 254 Also coupled to arbitration multiplexor 254 is a second request bus 255 and an arbitration select line 256
- signals on both second request bus 255 and arbitration select line 256 originate in load/store unit 222
- the output bus of arbitration multiplexor 254 is coupled to an input port of data cache 224
- a first request bus 257 is coupled between load/store unit 222 and an input port of data cache 224
- Data cache 224 is configured with two output ports which are coupled to a first reply bus 258 and a second reply bus 259 Both first reply bus 258 and second reply bus 259 are coupled to load/store unit 222.
- Second reply bus 259 is coupled to a data buffer 260 Associated with data buffer 260 is a data buffer control unit 263 Data buffer 260 is also coupled to a plurality of decode address request buses 261 provided by decode units 208 Associated with the plurality of decode address request buses 261 is a plurality of data buses 262
- data buffer 260 is a relatively small, high speed cache provided to store data bytes associated with data prediction addresses provided by branch prediction unit 220 During a clock cycle in which branch prediction unit 220 predicts a taken branch instruction, it also produces a data prediction address on data prediction bus 253
- the data prediction address is a prediction of the data addresses that are used by instructions residing at the target address of the predicted branch instruction, and is generated from a stored set of addresses and associated stride values as will be detailed below
- the data prediction address accesses data cache 224
- the data cache line of data bytes associated with the data prediction address is transferred on second reply bus 259 to a write port on data buffer 260
- Data buffer 260 stores the data bytes along with the data prediction address under the direction of control unit 263
- Decode units 208 in a later clock cycle, may request data bytes associated with an implicit memory read operation of an instruction using the plurality of decode address request buses 261 The requested data bytes are provided to the respective reservation station 210 associated with a decode unit 208 which requested the data bytes Because data buffer 260 is relatively small, it is capable of providing data within the same clock cycle that an address request is made (as opposed to requiring the entire clock cycle or more, as a typical data cache does) Therefore, the implicit memory read associated with an instruction may be performed before the instruction enters a functional unit 212, and only the explicit operation of the instruction is executed in functional unit 212 Performance may be advantageously increased by allowing other instructions to execute during clock cycles that such an implicit memory read operation would have occupied functional unit 212 As stated previously, branch unit 220 produces a data prediction address during clock cycles in which a taken branch is predicted Instructions residing at the target address of the taken branch instruction are, therefore in a new "basic block" Basic blocks are blocks of instructions having the property that if the first instruction in the
- Branch prediction unit 220 stores the correct prediction address for use in a subsequent prediction for the basic block
- the data prediction address is conveyed on data prediction bus 253 to arbitration multiplexor 254, which arbitrates between the data prediction address and a second request from load/store unit 222
- arbitration means the selection of one request over another according to an arbitration scheme
- the arbitration scheme is a priority scheme in which a load/store request conveyed on second request bus 255 is a higher priority that the data prediction address request Therefore, the arbitration select signal conveyed on arbitration select line 256 is an indication that a load/store request is being conveyed on second request bus 255 If a valid load/store request is being made during a clock cycle, arbitration multiplexor 254 selects the load/store request to be conveyed to data cache 224 If no valid load/store request is being made during a clock cycle arbitration multiplexor 254 selects the data prediction address to be conveyed to data cache 224
- load/store unit 222 is configured to make up to two requests to data cache 224 in a clock cycle In clock cycles where load/store unit 222 is making one request, first request bus 257 is used to allow the data prediction address the maximum opportunity to access data cache 224 If a data prediction address request is valid during a clock cycle in which load/store unit 222 is making no requests or one request to data cache 224, then the data prediction address request will be given access to data cache 224
- Load/store unit 222 receives the data bytes associated with requests on first request bus 257 and second request bus 255 on first reply bus 258 and second reply bus 259, respectively It is noted that other embodiments of load/store unit 222 may have different numbers of request buses for data cache 224
- Second reply bus 259 is also coupled to a write port on data buffer 260
- Data buffer control unit 263 detects cycles when the data prediction address is selected by arbitration multiplexor 254 and hits in data cache 224, and causes data buffer 260 to store the data conveyed on second reply bus 259 In one embodiment, the entire cache line associated with the data prediction address is conveyed on second reply bus 259 to data buffer 260
- data buffer 260 is configured with eight storage locations each of which store a data prediction address and the associated data bytes
- Data buffer control unit 263 operates the storage locations of this embodiment as a f ⁇ rst-in, first-out stack It is noted that other replacement policies could be alternatively used
- Data buffer 260 is also configured with a number of read ports, one for each decode unit 208 Each of the plurality of decode address request buses 261 is coupled to a respective read port Decode units 208 each provide a request address on one of the plurality of decode address request buses 261 Each request address is compared to the addresses stored in data buffer 260 If a match is detected, data bytes associated with the address are conveyed on the associated one of the plurality of data buses 262 to the reservation station unit 21 OA-21 OF associated with the requesting decode unit 208A-208F
- the associated reservation station 210A-210F stores the data bytes in the same manner that it stores data bytes from a register If the data bytes for the memory operand of an instruction are stored with the instruction in a reservation station 210A-210F, then the associated functional unit 212A-212F does not perform the implicit memory read for that instruction Instead, it executes the explicit operation of the instruction If the data bytes of the memory operand are not stored with the instruction in a reservation station
- Data buffer control unit 263 is further configured to monitor first request bus 257 and second request bus 255 for store memory accesses When data buffer control unit 263 detects a store memory access, it invalidates the associated cache line in data buffer 260 (if the cache line is currently stored in data buffer 260) Similarly, if data cache 224 replaces a line with data fetched from main memory, the replaced line is invalidated in data buffer 260 Reorder buffer 216 implements dependency checking for reads that follow writes in cases where the reads receive their data from data buffer 260 before the write occurs to data cache 224 In this manner, read-after-w ⁇ te hazards and data coherency are correctly handled with respect to data buffer 260
- prediction array 270 is a linear array of storage locations
- the current fetch address (conveyed from instruction cache 204) is used to index into prediction array 270 and select a storage location
- Stored within the selected storage location is a base address, a stride value, and a data prediction counter
- the base address is an address generated by an instruction within a basic block, wherein the address of the instructions within the basic block index to the selected storage location
- the stride value is the difference between the two addresses generated by an instruction within a basic block (or blocks) on two consecutive executions of the instruction
- the stride value is a signed value allowing both positive and negative strides to be stored
- the data prediction counter is configured to indicate the validity of the base address and the stride value
- the data prediction counter may store values indicating that a number of previous data prediction addresses were correct predictions A comparable number of consecutive mispredictions must then be detected before the stride value is changed Generally speaking, the data prediction counter
- An adder circuit 271 is coupled to the output port of prediction array 270 Adder circuit 271 is configured to add the stride value stored within the selected storage location to the base address stored within the selected storage location, creating a data prediction address
- the data prediction address is conveyed on data prediction bus 253, along with the associated data prediction counter value
- the stride value is invalid (as indicated by the data prediction counter value)
- the base address is added to zero
- branch prediction unit 220 receives four sets of inputs from decode units 208 First, branch prediction unit 220 receives the plurality of decode address request buses 261 which convey the actual data addresses generated by decode units 208 as well as a valid bit for each address indicating the validity of the address If a decode unit 208A-208F detects a mispredicted data prediction address, the associated one of decode address request buses 261 will convey the corrected addresses (as will be explained below with respect to Figure 3B) Each decode unit 208A-208F also provides one of a plurality of mispredicted lines 273 indicating whether or not the actual data address conveyed on the associated one of decode address request buses 261 matches the data prediction address provided with the instruction currently being decoded In one embodiment, the actual data address and the data prediction address are determined to match if they he within the same cache line Other granularities may be used for the comparison in other embodiments Additionally, branch prediction unit 220 receives a plurality of branch address buses 274 from decode units
- Prediction validation and correction block 275 causes the corresponding actual data address to be stored as the base address in prediction array 270 m the storage location indexed by the corresponding branch address The associated data prediction counter value is decremented for this case and stored as the data prediction counter in the indexed storage location. In this manner, a data prediction address is corrected if predicted incorrectly. Additionally, a base address is created if a valid base address was not previously stored in prediction array 270 as indicated by the associated data prediction counter. A particular one of mispredicted lines 273 indicates a misprediction if the data prediction address associated with the instruction currently being decoded is invalid. Therefore, the address provided on the associated one of decode address request buses 261 is stored as the base address.
- Prediction validation and correction block 275 causes the corresponding actual data address to be stored as the base address within prediction array 270 in the storage location indexed by the corresponding branch instruction address.
- the corresponding data prediction counter value is incremented and stored as the data prediction counter in the indexed storage location.
- a new stride value is calculated by prediction validation and correction block 275.
- the new stride value is the difference between the prediction value conveyed on the associated one of prediction address buses 276 and the actual data address conveyed on the associated one of decode address request buses 261.
- the new stride value is stored as the stride value in the indexed storage location within prediction array 270. If the stride value is invalid (as indicated by the data prediction counter) then the data prediction counter is incremented regardless of the misprediction/prediction status of the current data address prediction.
- decode unit 208A's outputs are given highest priority, then decode unit 208B's outputs, etc.
- the first data address calculated by the instructions in a basic block is used to validate and create the data prediction information (including the base address, the stride value, and the data prediction counter).
- the associated branch instruction address is saved by prediction validation and correction block 275. Subsequent signals are ignored by prediction validation and correction unit 275 until a branch address different from the saved branch instruction address is conveyed from a decode unit 208 to branch prediction unit 220.
- Prediction validation and correction block 275 has a dedicated write port into prediction array 270 to perform its updates, in one embodiment.
- the prediction address advantageously predicts the next address correctly for at least two types of data access patterns
- the stride value will be zero and the data prediction address associated with that instruction will be the same each time the instruction executes Therefore, the current data prediction structure will correctly predict static data addresses
- instructions which access a regular pattern of data addresses such that the data address accessed on consecutive executions of the instruction differ by a fixed amount may receive correct address predictions from the present data prediction structure
- Static data prediction structures which store the previously generated data address without a stride value
- decode units 208B- 208F are configured similar to decode unit 208A
- the logic circuits shown in Figure 3B are exemplary circuits for implementing the data prediction structure
- Adder circuit 300 is used to add the contents of a base register and a displacement value to produce an actual data address if the instruction being decoded by decode unit 208A has a memory operand
- the base register is the EBP or ESP register
- the displacement value is conveyed with the instruction bytes as part of the instruction
- This type of addressing is a common addressing mode used for x86 instructions If the base register is not available, or the instruction does not have an operand residing in a memory location, or the addressing mode is different then a base register and displacement value, then validation logic circuit 303 indicates that the actual data address formed by adder circuit 300 is invalid
- Reservation station unit 210A ignores any data bytes that may be transferred from data buffer 260 in clock cycles where the address formed by adder circuit
- address valid circuit 303 is conveyed with the address, as indicated in Figure 3B If the address formed by adder circuit 300 is invalid and the instruction requires a memory operand, the implicit load operation associated with the instruction will be executed by functional unit 212A
- the actual data address formed by adder circuit 300 is conveyed on decode address request bus 261 A to data buffer 260 and branch prediction unit 220 Decode address request bus 261 A is one of the plurality of decode address request buses 261 If the actual data address is a hit in data buffer 260, the associated data bytes are returned to reservation station 210A on one of the plurality of data buses 262 The actual data address is also conveyed to comparator circuit 302 which is coupled to prediction address register
- Prediction address register 301 stores the data prediction address for the instruction currently being decoded along with the .associated data prediction counter value The data prediction address and data prediction counter value are transferred to decode unit 208A with the instruction
- the data prediction counter value is indicative of whether or not a data address prediction was made for the block associated with the instructions (stored in prediction address register 301)
- the data prediction counter value indicates a valid data prediction address if it conveys a value greater than zero If both the data prediction address and the actual data address are valid and compare equal within a specified granularity (such as a cache line), then comparator circuit 302 conveys a value indicating that the data address prediction is correct to branch prediction unit 220 If either address is invalid or the addresses do not compare equal, then the output of comparator circuit 302 conveys a value indicating that the data prediction address is incorrect to branch prediction unit 220 The value is conveyed on a mispredicted line 273A Mispredicted line 273 A is one of the plurality of mispredicted lines 273
- a field of information stored in storage location 400 of prediction array 270 is the base address field 401
- the base address is stored in this field
- base address field 401 is 32 bits wide
- stride value field 402 is 10 bits wide
- stride value field 401 is 10 bits wide
- data prediction counter field 403 is included in storage location 400
- the data prediction counter is stored in this field
- the data prediction counter is two bits
- a binary value of 00 for the data prediction counter indicates that neither the base address field 401 nor the stride value field 402 contain valid values
- a binary value of 01 indicates that base address field 401 is valid but stride value field 402 does not contain a valid value
- the base address field is added to zero for this encoding, and the stride value is updated when the actual data address is generated by one of decode units 208
- a binary value of 10 or 1 1 indicates that both base address field 401 and stride value field 402 are valid Having two values with the same indication allows for a stride value to be maintained until two consecutive mispredictions are detected for basic blocks which index storage location 400 In this way prediction accuracy is maintained for the case where a basic block is intermittently executed between multiple executions of a second basic block and the two basic blocks index the same storage location 400
- Other embodiments may implement more or less bits for the data prediction counter value
- data prediction address bus 253 is connected to an additional read port on data cache 224
- arbitration multiplexor 254 may be eliminated
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Advance Control (AREA)
Abstract
Description
Claims
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/US1996/011847 WO1998002806A1 (en) | 1996-07-16 | 1996-07-16 | A data address prediction structure utilizing a stride prediction method |
EP96925349A EP0912928A1 (en) | 1996-07-16 | 1996-07-16 | A data address prediction structure utilizing a stride prediction method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/US1996/011847 WO1998002806A1 (en) | 1996-07-16 | 1996-07-16 | A data address prediction structure utilizing a stride prediction method |
Publications (1)
Publication Number | Publication Date |
---|---|
WO1998002806A1 true WO1998002806A1 (en) | 1998-01-22 |
Family
ID=22255476
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US1996/011847 WO1998002806A1 (en) | 1996-07-16 | 1996-07-16 | A data address prediction structure utilizing a stride prediction method |
Country Status (2)
Country | Link |
---|---|
EP (1) | EP0912928A1 (en) |
WO (1) | WO1998002806A1 (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
AU2003291835B2 (en) * | 2002-12-05 | 2010-06-17 | Evoqua Water Technologies Llc | Mixing chamber |
WO2014000641A1 (en) | 2012-06-27 | 2014-01-03 | Shanghai Xinhao Microelectronics Co. Ltd. | High-performance cache system and method |
-
1996
- 1996-07-16 WO PCT/US1996/011847 patent/WO1998002806A1/en not_active Application Discontinuation
- 1996-07-16 EP EP96925349A patent/EP0912928A1/en not_active Withdrawn
Non-Patent Citations (3)
Title |
---|
BAER J L ET AL: "AN EFFECTIVE ON-CHIP PRELOADING SCHEME TO REDUCE DATA ACCESS PENALTY", PROCEEDINGS OF THE SUPERCOMPUTING CONFERENCE, ALBUQUERQUE, NOV. 18 - 22, 1991, no. CONF. 4, 18 November 1991 (1991-11-18), INSTITUTE OF ELECTRICAL AND ELECTRONICS ENGINEERS, pages 176 - 186, XP000337480 * |
HUA K A ET AL: "DESIGNING HIGH-PERFORMANCE PROCESSORS USING REAL ADDRESS PREDICTION", IEEE TRANSACTIONS ON COMPUTERS, vol. 42, no. 9, 1 September 1993 (1993-09-01), pages 1146 - 1151, XP000411690 * |
PLESZKUN AND DAVIDSON: "Structured Memory Access architecture", IEEE INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, 23 August 1983 (1983-08-23) - 26 August 1983 (1983-08-26), pages 461 - 471, XP000212140 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
AU2003291835B2 (en) * | 2002-12-05 | 2010-06-17 | Evoqua Water Technologies Llc | Mixing chamber |
WO2014000641A1 (en) | 2012-06-27 | 2014-01-03 | Shanghai Xinhao Microelectronics Co. Ltd. | High-performance cache system and method |
EP2867778A4 (en) * | 2012-06-27 | 2016-12-28 | Shanghai Xin Hao Micro Electronics Co Ltd | High-performance cache system and method |
Also Published As
Publication number | Publication date |
---|---|
EP0912928A1 (en) | 1999-05-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US5761712A (en) | Data memory unit and method for storing data into a lockable cache in one clock cycle by previewing the tag array | |
US6339822B1 (en) | Using padded instructions in a block-oriented cache | |
US7024545B1 (en) | Hybrid branch prediction device with two levels of branch prediction cache | |
US5845323A (en) | Way prediction structure for predicting the way of a cache in which an access hits, thereby speeding cache access time | |
JP3871883B2 (en) | Method for calculating indirect branch targets | |
US5802588A (en) | Load/store unit implementing non-blocking loads for a superscalar microprocessor and method of selecting loads in a non-blocking fashion from a load/store buffer | |
US6012125A (en) | Superscalar microprocessor including a decoded instruction cache configured to receive partially decoded instructions | |
US6427192B1 (en) | Method and apparatus for caching victimized branch predictions | |
US20070033385A1 (en) | Call return stack way prediction repair | |
US5893146A (en) | Cache structure having a reduced tag comparison to enable data transfer from said cache | |
US6212621B1 (en) | Method and system using tagged instructions to allow out-of-program-order instruction decoding | |
JP3794918B2 (en) | Branch prediction that classifies branch prediction types using return selection bits | |
US5951671A (en) | Sharing instruction predecode information in a multiprocessor system | |
US6175909B1 (en) | Forwarding instruction byte blocks to parallel scanning units using instruction cache associated table storing scan block boundary information for faster alignment | |
EP0912927B1 (en) | A load/store unit with multiple pointers for completing store and load-miss instructions | |
WO1998002806A1 (en) | A data address prediction structure utilizing a stride prediction method | |
EP0912929B1 (en) | A data address prediction structure and a method for operating the same | |
WO1998020421A1 (en) | A way prediction structure | |
WO1998020416A1 (en) | A stride-based data address prediction structure | |
EP1005675B1 (en) | A data memory unit configured to store data in one clock cycle and method for operating same | |
EP1015980B1 (en) | A data cache capable of performing store accesses in a single clock cycle | |
EP0912930B1 (en) | A functional unit with a pointer for mispredicted branch resolution, and a superscalar microprocessor employing the same | |
EP0919027B1 (en) | A delayed update register for an array | |
EP0912925B1 (en) | A return stack structure and a superscalar microprocessor employing same | |
KR20220154821A (en) | Handling the fetch stage of an indirect jump in the processor pipeline |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): CN JP KR |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): AT BE CH DE DK ES FI FR GB GR IE IT LU MC NL PT SE |
|
DFPE | Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101) | ||
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
WWE | Wipo information: entry into national phase |
Ref document number: 1996925349 Country of ref document: EP |
|
WWP | Wipo information: published in national office |
Ref document number: 1996925349 Country of ref document: EP |
|
NENP | Non-entry into the national phase |
Ref country code: JP Ref document number: 1998505961 Format of ref document f/p: F |
|
WWW | Wipo information: withdrawn in national office |
Ref document number: 1996925349 Country of ref document: EP |