EP1019831A1 - Unite de prediction de blocs de memoire et son procede de fonctionnement - Google Patents

Unite de prediction de blocs de memoire et son procede de fonctionnement

Info

Publication number
EP1019831A1
EP1019831A1 EP96925321A EP96925321A EP1019831A1 EP 1019831 A1 EP1019831 A1 EP 1019831A1 EP 96925321 A EP96925321 A EP 96925321A EP 96925321 A EP96925321 A EP 96925321A EP 1019831 A1 EP1019831 A1 EP 1019831A1
Authority
EP
European Patent Office
Prior art keywords
address
way
fetch
unit
recited
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP96925321A
Other languages
German (de)
English (en)
Inventor
Thang M. Tran
James K. Pickett
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced Micro Devices Inc
Original Assignee
Advanced Micro Devices Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Advanced Micro Devices Inc filed Critical Advanced Micro Devices Inc
Publication of EP1019831A1 publication Critical patent/EP1019831A1/fr
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3802Instruction prefetching
    • G06F9/3804Instruction prefetching for branches, e.g. hedging, branch folding
    • G06F9/3806Instruction prefetching for branches, e.g. hedging, branch folding using address prediction, e.g. return stack, branch history buffer
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0864Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches using pseudo-associative means, e.g. set-associative or hashing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/60Details of cache memory
    • G06F2212/608Details relating to cache mapping
    • G06F2212/6082Way prediction in set-associative cache

Definitions

  • TITLE A WAY PREDICTION UNIT AND A METHOD FOR OPERATING THE SAME
  • This invention relates to the field of superscalar microprocessors and, more particularly, to branch prediction mechanisms employed within superscalar microprocessors.
  • clock cycle refers to an interval of time during which the pipeline stages of a microprocessor preform their intended functions. At the end of a clock cycle, the resulting values are moved to the next pipeline stage.
  • superscalar microprocessors execute multiple instructions per clock cycle and the clock cycle is short, a high bandwidth memory system is required to provide instructions to the superscalar microprocessor (i.e. a memory system that can provide a large number of bytes in a short period of time). Without a high bandwidth memory system, the microprocessor would spend a large number of clock cycles waiting for instructions to be provided, then would execute the received instructions in a relatively small number of clock cycles. Overall performance would be degraded by the large number of idle clock cycles.
  • superscalar microprocessors are ordinarily configured into computer systems with a large main memory composed of dynamic random access memory (DRAM) cells.
  • DRAM dynamic random access memory
  • DRAM cells are characterized by access times which are significantly longer than the clock cycle of modern superscalar microprocessors. Also, DRAM cells typically provide a relatively narrow output bus to convey the stored bytes to the superscalar microprocessor. Therefore, DRAM cells provide a memory system that provides a relatively small number of bytes in a relatively long period of time, and do not form a high bandwidth memory system.
  • superscalar microprocessors are typically not configured into a computer system with a memory system having sufficient bandwidth to continuously provide instructions for execution, superscalar microprocessors are often configured with an instruction cache.
  • Instruction caches are multiple blocks of storage locations, configured on the same silicon substrate as the microprocessor or coupled nearby. The blocks of storage locations are used to hold previously fetched instruction bytes. The bytes can be transferred from the instruction cache to the instruction processing pipelines quickly; commonly one or two clock cycles are required.
  • Instruction caches are typically organized into an "associative" structure.
  • an associative structure the blocks of storage locations are accessed as a two-dimensional array having rows and columns.
  • a "fetch control unit” for example, a fetch PC unit
  • searches the instruction cache for instructions residing at an address a number of bits from the address are used as an "index" into the cache
  • the index selects a particular row within the two-dimensional array, and therefore the number of address bits required for the index is determined by the number of rows configured into the instruction cache
  • the addresses associated with instruction bytes stored in the multiple blocks of a row are examined to determine if any of the addresses stored in the row match the requested address If a match is found, the access is said to be a "hit", and the instruction cache provides the associated instruction bytes If a match is not found, the access is said to be a "miss”
  • the fetch control unit causes the instruction bytes to be transferred from the memory system into the instruction cache
  • the addresses associated with instruction bytes stored in the cache are also stored These
  • the blocks of memory configured into a row form the columns of the row
  • Each block of memory is referred to as a "way"
  • multiple ways comprise a row
  • the way is selected by providing a way value to the instruction cache
  • the way value is determined by examining the tags for a row and finding a match between one of the tags and the input address from the fetch control unit
  • fetch control unit refers to a unit configured to fetch instructions from the instruction cache and cause instructions not residing in the instruction cache to be transferred into the instruction cache
  • associative caches provide better "hit rates" (I e a higher percentage of accesses to the cache are hits) than caches that are configured as a linear array of storage locations (typically referred to as a direct-mapped configuration)
  • the hit rates are better for an associative cache because instruction bytes stored at multiple addresses having the same index may be stored in the associative cache simultaneously, whereas a direct-mapped cache is capable of storing one set of instruction bytes per index
  • a program having a loop that extends over two addresses with the same index can store instruction bytes from both addresses in an associative instruction cache, but will have to repeatedly reload the two addresses each time the loop is executed in a microprocessor having a direct-mapped cache
  • the hit rate in an instruction cache is important to the performance of the superscalar microprocessor, because when a miss is detected the instructions must be fetched from the memory system The microprocessor will quickly become idle while waiting for the instructions to be provided
  • associative caches require more access time than direct- mapped
  • Branch prediction mechanisms predict the next address that a computer program will attempt to fetch from memory, fetch the predicted address, and pass the instructions to the instruction processing pipelines
  • the predicted address may be the next sequential line in the cache, the target of a branch instruction contained in the current instruction cache line, or some other address
  • Branch prediction is important to the performance of superscalar microprocessors because they execute multiple instructions per clock cycle Branches occur often in computer programs, on the average approximately once every four instructions Therefore, a superscalar microprocessor configured to execute four or more instructions per clock cycle encounters a branch every clock cycle, on the average Whether or not a particular branch instruction will be taken or not-taken may depend on the result of instructions that are near the branch instruction in the program, and therefore execute in parallel with the branch instruction
  • Superscalar microprocessors employ branch prediction to allow fetching and speculative execution of instructions while the branch is being executed Various branch prediction mechanisms are well-known
  • the way prediction unit predicts the next fetch address as well as the way of the instruction cache that the current fetch address hits while the instructions associated with the current fetch are being read from the instruction cache
  • the two clock cycle instruction fetch mechanism is advantageously reduced to one clock cycle Therefore, an instruction fetch can be made every clock cycle until an incorrect next fetch address or an incorrect way is predicted
  • the instructions from the predicted way are provided to the instruction processing pipelines of the superscalar microprocessor each clock cycle Relatively higher performance may be achieved than would be possible with a superscalar microprocessor which provides instructions to the instruction processing pipelines every other clock cycle
  • the way prediction unit may also enable higher frequencies (shorter clock cycles) while advantageously retaining the performance benefits of an associative instruction cache because the long access time of the associative cache is no longer a limiting factor
  • the way selection is predicted instead of determined through tag comparisons, removing the dominant barrier to the use of highly associative instruction caches at high frequencies
  • the present invention contemplates a way prediction unit comprising a first input port, an array and an output port
  • the first input port conveys an input address and an update value into the way prediction unit
  • the array includes a plurality of storage locations wherein each of the plurality of storage locations is configured to store an address and a way value Purthermore, the array includes a mechanism to select one of the plurality of storage locations as indexed by the input address
  • the output port conveys an output address and an output way value which are the predicted fetch address for the next clock cycle and the predicted way for the fetch occurring in the current clock cycle
  • the present invention further contemplates a mechanism in a microprocessor for predicting the index of a next block of instructions required by a program executing on said microprocessor and for predicting the way of a fetch address accessing the instruction cache in the current clock cycle, comprising a fetch control unit, a way prediction unit, and an instruction cache
  • the fetch control unit is configured to produce the address and update value that are inputs to the way prediction unit, and is further configured to receive the predicted address and predicted way from the way prediction unit as inputs
  • the way prediction unit comprises components as described above, and the instruction cache is configured to store blocks of contiguous instruction bytes and to receive a fetch address from the fetch control unit 1 he present invention still further contemplates a superscalar microprocessor comprising a way prediction unit as described herein
  • Figure 1 is a block diagram of a superscalar microprocessor employing a branch prediction unit in accordance with the present invention
  • Figure 2 is a block diagram of several of the units from Figure 1, showing a way prediction unit within the branch prediction unit of Figure 1
  • FIG. 3 is a diagram showing the components of the way prediction unit depicted in Figure 2
  • Figure 4A is a diagram showing the bit fields within a storage location of the way prediction unit depicted in Figure 3
  • Figure 4B is a timing diagram depicting important relationships between the way prediction unit and the other units depicted in Figure 2
  • Figure 4C is another timing diagram depicting several instruction fetches and their corresponding way predictions, including a way misprediction cycle.
  • Figure 4D is yet another timing diagram depicting several instruction fetches and their corresponding way predictions, including a target fetch address misprediction.
  • Figure 1 shows a block diagram of a superscalar microprocessor 200 including a branch prediction unit 220 employing a way prediction unit in accordance with the present invention.
  • superscalar microprocessor 200 includes a prefetch/predecode unit 202 and a branch prediction unit 220 coupled to an instruction cache 204.
  • Instruction alignment unit 206 is coupled between instruction cache 204 and a plurality of decode units 208A-208F (referred to collectively as decode units 208).
  • Each decode unit 208A-208F is coupled to a respective reservation station unit 210A -21 OF (referred collectively as reservation stations 210), and each reservation station 210A-210F is coupled to a respective functional unit 212A-212F (referred to collectively as functional units 212).
  • Decode units 208, reservation stations 210, and functional units 212 are further coupled to a reorder buffer 216, a register file 218 and a load/store unit 222.
  • a data cache 224 is finally shown coupled to load/store unit 222, and an MROM unit 209 is shown coupled to instruction alignment unit 206.
  • instruction cache 204 is a high speed cache memory provided to temporarily store instructions prior to their dispatch to decode units 208.
  • instruction cache 204 is configured to cache up to 32 kilobytes of instruction code organized in lines of 16 bytes each (where each byte consists of 8 bits).
  • instruction code is provided to instruction cache 204 by prefetching code from a main memory (not shown) through prefetch/predecode unit 202.
  • Prefetch/predecode unit 202 is provided to prefetch instruction code from the main memory for storage within instruction cache 204.
  • prefetch/predecode unit 202 is configured to burst 64-bit wide code from the main memory into instruction cache 204. It is understood that a variety of specific code prefetching techniques and algorithms may be employed by prefetch/predecode unit 202.
  • prefetch/predecode unit 202 fetches instructions from the main memory, it generates three predecode bits associated with each byte of instruction code a start bit, an end bit, and a "functional" bit 1 he predecode bits form tags indicative of the boundaries of each instruction
  • the predecode tags may also convey additional information such as whether a given instruction can be decoded directly by decode units 208 or whether the instruction must be executed by invoking a microcode procedure controlled by MROM unit 209, as will be described in greater detail below
  • Table 1 indicates one encoding of the predecode tags As indicated within the table, if a given byte is the first byte of an instruction, the start bit for that byte is set If the byte is the last byte of an instruction, the end bit for that byte is set If a particular instruction cannot be directly decoded by the decode units 208, the functional bit associated with the first byte of the instruction is set On the other hand, if the instruction can be directly decoded by the decode units 208, the functional bit associated with the first byte of the instruction is cleared The functional bit for the second byte of a particular instruction is cleared if the opcode is the first byte, and is set if the opcode is the second byte It is noted that in situations where the opcode is the second byte, the first byte is a prefix byte The functional bit values for instruction byte numbers 3-8 indicate whether the byte is a MODRM or an SIB byte, as well as whether the byte contains displacement or immediate data
  • certain instructions within the x86 instruction set may be directly decoded by decode units 208 These instructions are referred to as “fast path” instructions
  • the remaining instructions of the x86 instruction set are referred to as “MROM instructions” MROM instructions are executed by invoking MROM unit 209.
  • MROM unit 209 parses and serializes the instruction into a subset of defined fast path instructions to effectuate a desired operation.
  • a listing of exemplary x86 instructions categorized as fast path instructions as well as a description of the manner of handling both fast path and MROM instructions will be provided further below.
  • Instruction alignment unit 206 is provided to channel variable byte length instructions from instruction cache 204 to fixed issue positions formed by decode units 208A-208F. Instruction alignment unit 206 is configured to channel instruction code to designated decode units 208A-208F depending upon the locations of the start bytes of instructions within a line as delineated by instruction cache 204. In one embodiment, the particular decode unit 208A-208F to which a given instruction may be dispatched is dependent upon both the location of the start byte of that instruction as well as the location of the previous instruction's start byte, if any. Instructions starting at certain byte locations may further be restricted for issue to only one predetermined issue position. Specific details follow.
  • each of the decode units 208 includes decoding circuitry for decoding the predetermined fast path instructions referred to above.
  • each decode unit 208A-208F routes displacement and immediate data to a corresponding reservation station unit 210A-210F.
  • Output signals from the decode units 208 include bit-encoded execution instructions for the functional units 212 as well as operand address information, immediate data and/or displacement data.
  • the superscalar microprocessor of Figure 1 supports out of order execution, and thus employs reorder buffer 216 to keep track of the original program sequence for register read and write operations, to implement register renaming, to allow for speculative instruction execution and branch misprediction recovery, and to facilitate precise exceptions.
  • a temporary storage location within reorder buffer 216 is reserved upon decode of an instruction that involves the update of a register to thereby store speculative register states.
  • Reorder buffer 216 may be implemented in a first-in- first-out configuration wherein speculative results move to the "bottom" of the buffer as they are validated and written to the register file, thus making room for new entries at the "top" of the buffer.
  • reorder buffer 216 Other specific configurations of reorder buffer 216 are also possible, as will be described further below. If a branch prediction is incorrect, the results of speculatively-executed instructions along the mispredicted path can be invalidated in the buffer before they are written to register file 218.
  • each reservation station unit 210A-210F is capable of holding instruction information (i.e., bit encoded execution bits as well as operand values, operand tags and/or immediate data) for up to three pending instructions awaiting issue to the corresponding functional unit
  • instruction information i.e., bit encoded execution bits as well as operand values, operand tags and/or immediate data
  • each decode unit 208A-208F is associated with a dedicated reservation station unit 210A-210I
  • each reservation station unit 210A-210F is similarly associated with a dedicated functional unit 212A-212F
  • six dedicated "issue positions" are formed by decode units 208, reservation station units 210 and functional units 212 Instructions aligned and dispatched to issue position 0 through decode unit 208A are passed to reservation station unit 210A and subsequently to functional unit 212A for execution Similarly, instructions aligned and dispatched to decode
  • reorder buffer 216 contains temporary storage locations for results which change the contents of these registers to thereby allow out of order execution
  • a temporary storage location of reorder buffer 216 is reserved for each instruction which upon decode, modifies the contents of one of the real registers Therefore, at various points during execution of a particular program, reorder buffer 216 may have one or more locations which contain the speculatively executed contents of a given register If following decode of a given instruction it is determined that reorder buffer 216 has previous locat ⁇ on(s) assigned to a register used as an operand in the given instruction, the re
  • Reservation station units 210A-2I 0F are provided to temporarily store instruction information to be speculatively executed by the corresponding functional units 212A-212F As stated previously, each reservation station unit 210A-210F may store instruction information for up to three pending instructions Each of the six reservation stations 210A-210F contain locations to store bit-encoded execution instructions to be speculatively executed by the corresponding functional unit and the values of operands If a particular operand is not available, a tag for that operand is provided from reorder buffer 216 and is stored within the corresponding reservation station until the result has been generated (i e , by completion of the execution of a previous instruction) It is noted that when an instruction is executed by one of the functional units 212A- 212F, the result of that instruction is passed directly to any reservation station units 210A-210F that are waiting for that result at the same time the result is passed to update reorder buffer 216 (this technique is commonly referred to as "result forwarding") Instructions are issued to functional units for
  • each of the functional units 212 is configured to perform integer arithmetic operations of addition and subtraction, as well as shifts, rotates, logical operations, and branch operations It is noted that a floating point unit (not shown) may also be employed to accommodate floating point operations
  • Each of the functional units 212 also provides information regarding the execution of conditional branch instructions to the branch prediction unit 220 If a branch prediction was incorrect, branch prediction unit 220 flushes instructions subsequent to the mispredicted branch that have entered the instruction processing pipeline, and causes prefetch/predecode unit 202 to fetch the required instructions from instruction cache 204 or main memory It is noted that in such situations, tesults of instructions in the original program sequence which occur after the mispredicted branch instruction are discarded, including those which were speculatively executed and temporarily stored in load/store unit 222 and reorder buffer 216 Exemplary configurations of suitable branch prediction mechanisms are well known
  • Results produced by functional units 212 are sent to the reorder buffer 216 if a register value is being updated, and to the load/store unit 222 if the contents of a memory location is changed If the result is to be stored in a register, the reorder buffer 216 stores the result in the location reserved for the value of the register when the instruction was decoded As stated previously, results are also broadcast to reservation station units 210A-210F where pending instructions may be waiting for the results of previous instruction executions to obtain the required operand values
  • Data cache 224 is a high speed cache memory provided to temporarily store data being transferred between load/store unit 222 and the main memory subsystem
  • data cache 224 has a capacity of storing up to eight kilobytes of data It is understood that data cache 224 may be implemented in a variety of specific memory configurations, including a set associative configuration
  • load/store unit 222 provides an interface between functional units 212A-212r and data cache 224
  • load/store unit 222 is configured with a load/store buffer with sixteen storage locations for data and address information for pending load or store memory operations
  • Functional units 212 arbitrate for access to the load/store unit 222
  • the load/store unit 222 also performs dependency checking for load memory operations against pending store memory operations to ensure that data coherency is maintained
  • FIG. 2 a block diagram of a portion of microprocessor 200 is shown Included in the diagram is instruction cache 204 and branch prediction unit 220 Instruction cache 204 is further coupled to instruction alignment unit 206 and decode units 208 (shown in this diagram as a single block, although decode units 208 are shown as several blocks in Figure 1 ) Within branch prediction unit 220 is a way prediction unit 250 in accordance with the present invention Branch prediction unit 220 may also contain other units (not shown) In one embodiment, instruction cache 204 is eight-way associative Shown within instruction cache 204 is a fetch PC unit 254 and instruction cache storage 255
  • way prediction unit 250 generates a predicted fetch address for the next cache line to be fetched (I e a prediction of the branch prediction address) and a predicted way value for the fetch address accessing the instruction cache in the current clock cycle (the "current fetch address")
  • the predicted fetch address and predicted way are based on the current fetch address
  • microprocessor 200 is able to read the predicted line from instruction cache 204 while the branch prediction is being generated, advantageously removing a clock cycle from the instruction fetch process in cases where the predicted fetch address matches the branch prediction address
  • predicting the way for the current fetch address makes the instructions for the current fetch address available by the end of the clock cycle in which the fetch address accesses the cache, as opposed to the next cycle if the tag comparison information is used to select the instructions
  • the predicted way is validated by the tag comparison information in the following cycle
  • the predicted fetch address and predicted way aie transferred on a prediction bus 251 to fetch PC unit 254
  • the predicted fetch address is a partial address containing the index bits used to index instruction cache 204
  • Way prediction unit 250 predicts the next instruction cache line index to be fetched and the way of the instruction cache which contains the current fetch address using the current fetch address, as conveyed from fetch PC unit 254 on a fetch request bus 252
  • Fetch request bus 252 also conveys way selection information for instruction cache 204, and a next index address which is the current fetch address incremented by the size of one instruction cache line
  • the instruction cache line size is sixteen bytes and therefore the next index address is the current fetch address incremented by sixteen
  • the current fetch address conveyed on fetch request bus 252 is the predicted fetch address conveyed by way prediction unit 250 in the previous clock cycle, except for cycles in which fetch PC unit 254 detects a way or predicted fetch address misprediction
  • the way prediction is validated by comparing the actual way that the fetch address hits in instruction cache 204 (determined via a tag compare to the full fetch address) to the way prediction If the way prediction matches the actual way, then the way prediction is correct If the way prediction is wrong, then the correct way is selected from the eight ways of instruction cache 204 (which were latched from the previous clock cycle), the instructions associated with the predicted way are discarded, and the predicted fetch address is discarded as well
  • the predicted fetch address is validated by comparing the predicted fetch address to the branch prediction address generated by branch prediction unit 220 In embodiments where the predicted fetch address is an index address, only the index bits of the branch prediction address are compared to the
  • fetch PC unit 254 sends update information to way prediction unit 250 on fetch request bus 252
  • the update information for a particular predicted fetch address is sent the clock cycle following the predicted address validation, and includes an update way value and update selection control bits If the prediction is correct, the update information is the predicted information If the prediction is incorrect, the update way value is the way of instruction cache 204 which contains the instructions actually fetched and the update selection control bits indicate whether the branch prediction address is a taken branch, a next sequential line fetch, or a RFT instruction
  • the update information is stored by way prediction unit 250 such that the next prediction using a similar current fetch address will include the update information in the prediction mechanism
  • the prediction mechanism will be explained in more detail with respect to Figure 3
  • Figure 2 also depicts a return stack address bus 253 connected to way prediction unit 250
  • Return stack address bus 253 conveys the address that is currently at the top of the return stack in decode units 208
  • the return stack is a stack of addresses that refer to instructions following previously executed CALL instructions
  • a RET instruction would use the address at the top of the return stack to locate the next instruction to be executed
  • the CALL and RET instructions are defined by the x86 architecture as subroutine entrance and exit instructions, respectively
  • Way prediction unit 250 uses the next index address provided on fetch request bus 252 and the return stack address as sources for the predicted fetch address The return stack address is selected during clock cycles in which way prediction unit 250 predicts that a RET instruction is in the cache line currently being fetched Alternatively, the next index address is selected during clock cycles in which way prediction unit 250 predicts that no branch-type instructions exist in the cache line currently being fetched Way prediction unit 250 also selects internally stored addresses as the predicted fetch address during clock cycles in which the prediction mechanism predicts that a branch-type instruction exists in the cache line, as will be described in further detail below
  • way prediction unit 250 is shown with return stack bus 253 and fetch request bus 252 connected to it
  • the current fetch address conveyed on fetch request bus 252 is decoded by decoder circuit 300 I he resulting select lines are stored by a delay latch 301, and also select a storage location from a way prediction array 302
  • way prediction array 302 is configured as a linear array of storage locations
  • prediction array 302 is composed of registers Each storage location is configured to store prediction addresses, a predicted way, and target selection control bits
  • the prediction addresses are branch prediction addresses previously generated from instructions residing at a fetch address with the same index bits as the current fetch address
  • the predicted way is the last correctly predicted way for a fetch address with the same index as the current fetch address
  • the target selection control bits indicate which of the stored prediction addresses should be selected, or if the next index address or the return stack address should be selected When microprocessor 200 is initialized, the target selection control bits of storage locations with way prediction array 302 are set to select the next index address
  • Delay latch 301 transfers its value to an update latch 304
  • Update latch 304 selects a storage location within way prediction array 302 for storing the update information provided by fetch PC unit 254
  • Delay latch 301 and update latch 304 store the decoded selection lines for way prediction array 302, and thus avoid the need for a second decoder circuit similar to decoder 300
  • a decoder circuit such as decoder 300 is larger (in terms of silicon area) than delay latch 301 and update latch 304 Therefore, silicon area is saved by implementing this embodiment instead of an embodiment with a second decoder for updates
  • Way prediction unit 250 is also configured with an address selection device for selecting the address to provide as the predicted fetch address
  • the address selection device is a multiplexor 305 and an address selection circuit 306
  • Address selection circuit 306 receives the target selection control bits from way prediction array 302 and produces multiplexor select lines for multiplexor 305
  • address selection circuit 306 causes multiplexor 305 to select the first address from way prediction array 302 if the target selection control bits contain the binary value "01 ", the second address from way prediction array 302 it the target selection control bits contain the binary value "10", the next index address from fetch request bus 252 if the target selection bits contain the binary value "00”, and the return stack address from return stack address bus 253 if the selection control bits contain the binary value " 11" Therefore, address selection circuit 306 is a decode of the target selection control bits The predicted address is conveyed on prediction bus 251 along with the predicted way selected from way prediction array 302
  • FIG. 4 A a diagram of one of the storage locations of way prediction array 302 (shown in Figure 3) is shown
  • two prediction addresses are stored (shown as fields 400 and 401 )
  • Each of the prediction addresses are 12 bits wide Way prediction information is stored in a field 402 which is 3 bits wide in this embodiment to encode the eight ways of instruction cache 204
  • field 402 is 8 bits wide and the way prediction information is not encoded Instead, one of the eight bits is set indicating the predicted way Target selection bits are also stored within the storage location in field 403 which is 2 bits wide in this embodiment to encode selection of prediction address field 400, prediction address field 401 , the next index address, or the return stack address
  • field 403 is four bits wide and the target selection bits are not encoded Instead, one of the 4 bits is set indicating one of the four possible prediction addresses
  • FIG. 4B a timing diagram depicting important relationships between way prediction unit 250, instruction cache 204 and fetch PC unit 254 is shown
  • a current fetch address is sent from fetch PC unit 254 to instruction cache 204 and way prediction unit 250
  • Instruction cache 204 transfers the associated instructions to its output bus during the time indicated by the horizontal line 420 and latches them
  • way prediction unit 250 indexes into way prediction array 302 and selects a storage location From the value of the target selection control bits stored within the selected storage location, a predicted fetch address is generated
  • the predicted fetch address and the predicted way from the selected storage location are conveyed to fetch PC unit 254 near the end of ICLK 1 , as indicated by arrow 421
  • the predicted way is selected from the eight ways indexed by the current fetch address in ICLK1 , and the selected instructions are scanned to form a branch prediction address Also, the selected instructions are forwarded to instruction alignment unit 206 in ICLK2
  • fetch PC unit 254 determines whether or not the previous fetch address is an instruction cache hit in the predicted way and branch prediction unit 220 generates a branch prediction address, as indicated by arrow 422 This information is used as described above to validate the predicted fetch address that is currently accessing instruction cache 204 and the predicted way that was provided in ICLK 1 If a way misprediction is detected, the correct instructions are selected from the eight ways of instruction cache 204 latched in ICLK1 , and the instructions read in ICLK!
  • return stack address predictions require two extra clock cycles to validate (as indicated by Figure 4B) as compared to next index address or branch prediction address predictions
  • a current fetch address A is conveyed to instruction cache 204 and way prediction unit 250, as indicated by block 440
  • a prediction address B is determined by way prediction unit 250 and conveyed to fetch PC unit 254 as indicated by block 441
  • ICLK2 address B is conveyed as a current fetch address because it was predicted in ICL 1 , as indicated by block 442
  • prediction address C is determined by way prediction unit 250 and conveyed to fetch PC unit 254 as indicated by block 443
  • address A is determined to hit in the instruction cache in the predicted way, and at arrow 445 a branch prediction associated with address A is calculated
  • the branch prediction address matches the predicted address B, and therefore address B is a valid prediction
  • ICLK3 at arrow 446 the predicted way for address B is found to be incorrect Therefore, the correct instructions are selected from the eight ways that were latched in the previous cycle, as indicated by block 447
  • the instructions are selected from the eight ways that were latched in the previous cycle, as indicated by block 447
  • the instructions are selected from
  • FIG. 4D a timing diagram is shown depicting several consecutive instruction fetches, to further illustrate the interaction between fetch PC unit 254, way prediction unit 250, and instruction cache 204
  • ICL 1 a current fetch address A is conveyed to instruction cache 204 and way prediction unit 250, as indicated by block 460
  • a prediction address B is determined by way prediction unit 250 and conveyed to fetch PC unit 254 as indicated by block 461
  • ICLK2 address B is conveyed as a current fetch address because it was predicted in ICLK1 , as indicated by block 462
  • prediction address C is determined by way prediction unit 250 and conveyed to fetch PC unit 254 as indicated by block 463
  • address A is determined to hit in the instruction cache in the predicted way, and at arrow 465 a branch prediction associated with address A is calculated
  • the branch prediction address matches the predicted address B, and therefore address B is a valid prediction
  • address C is used as the current fetch address, as indicated by block 466
  • address B is determined to be a hit in the predicted way as indicated by arrow 467
  • the branch prediction associated with address B is determined and the branch prediction address does not match address C Therefore, the predicted fetch address being conveyed in 1CLK3 is ignored as well as the instructions associated with address C
  • the corrected branch prediction address C is used as the current fetch address as indicated by block 469
  • a predicted fetch address and way based on corrected address C is made by way prediction unit 250 in ICLK4, and the current fetch address for ICLK5 will reflect that prediction
  • the number and size of addresses stored within way prediction array 302 may differ for other embodiments In particular, the number of addresses stored may be more or less than the embodiment of Figure 3 Furthermore, the number of external addresses added to the address prediction selection may vary from embodiment to embodiment, as will the number and encoding of the target selection control bits It is also noted that the portion of the address stored within way prediction array 302 may vary from embodiment to embodiment, and the entire address may be stored in another embodiment It is further noted that other embodiments could store multiple way predictions and select among them in a manner similar to the address selection device shown in Figure 3 It is also noted that some embodiments may store other information with each predicted address in way prediction array 302 For example, a way, a byte position with the instruction cache line, and branch prediction counter information may be stored within fields 400 and 401
  • a superscalar microprocessor employing a way prediction unit is disclosed
  • the way prediction unit is provided to reduce the number of clock cycles needed to predict the next address that a code stream will fetch from the instruction cache By reducing the number of clock cycles required to predict the address from two to one, instruction cache utilization is raised Instructions are provided to the instruction processing pipelines continuously, advantageously reducing the idle clock cycles that the superscalar microprocessor endures Therefore, overall performance may be increased

Abstract

L'invention concerne une unité de prédiction de blocs de mémoire destinée à un microprocesseur superscalaire, laquelle prédit la prochaine adresse de lecture ainsi que le bloc de la mémoire cache d'instructions auquel accède l'adresse de lecture actuelle tandis que les instructions associées à la lecture actuelle sont en cours de lecture dans la mémoire cache d'instructions. Cette unité de prédiction de blocs de mémoire est destinée à des microprocesseurs haute fréquence dans lesquels des mémoires caches associatives tendent à limiter les cycles d'horloge, le dispositif de lecture des instructions devant alors exiger plus d'un cycle d'horloge entre les demandes de lecture. Par conséquent, une lecture d'instructions peut être effectuée à chaque cycle d'horloge au moyen de l'adresse de lecture prédite jusqu'à ce qu'une adresse de lecture suivante inexacte ou un bloc de mémoire inexact soient prédits. Les instructions émanant du bloc de mémoire prédit sont communiquées aux pipelines de traitement d'instructions du microprocesseur superscalaire à chaque cycle d'horloge.
EP96925321A 1996-07-16 1996-07-16 Unite de prediction de blocs de memoire et son procede de fonctionnement Withdrawn EP1019831A1 (fr)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US1996/011755 WO1998002817A1 (fr) 1996-07-16 1996-07-16 Unite de prediction de blocs de memoire et son procede de fonctionnement

Publications (1)

Publication Number Publication Date
EP1019831A1 true EP1019831A1 (fr) 2000-07-19

Family

ID=22255458

Family Applications (1)

Application Number Title Priority Date Filing Date
EP96925321A Withdrawn EP1019831A1 (fr) 1996-07-16 1996-07-16 Unite de prediction de blocs de memoire et son procede de fonctionnement

Country Status (2)

Country Link
EP (1) EP1019831A1 (fr)
WO (1) WO1998002817A1 (fr)

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5283873A (en) * 1990-06-29 1994-02-01 Digital Equipment Corporation Next line prediction apparatus for a pipelined computed system
US5235697A (en) * 1990-06-29 1993-08-10 Digital Equipment Set prediction cache memory system using bits of the main memory address
JP2636088B2 (ja) * 1991-03-15 1997-07-30 甲府日本電気株式会社 情報処理装置
US5418922A (en) * 1992-04-30 1995-05-23 International Business Machines Corporation History table for set prediction for accessing a set associative cache

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See references of WO9802817A1 *

Also Published As

Publication number Publication date
WO1998002817A1 (fr) 1998-01-22

Similar Documents

Publication Publication Date Title
US5761712A (en) Data memory unit and method for storing data into a lockable cache in one clock cycle by previewing the tag array
US5802588A (en) Load/store unit implementing non-blocking loads for a superscalar microprocessor and method of selecting loads in a non-blocking fashion from a load/store buffer
US5845323A (en) Way prediction structure for predicting the way of a cache in which an access hits, thereby speeding cache access time
US6073230A (en) Instruction fetch unit configured to provide sequential way prediction for sequential instruction fetches
US6167510A (en) Instruction cache configured to provide instructions to a microprocessor having a clock cycle time less than a cache access time of said instruction cache
US6101577A (en) Pipelined instruction cache and branch prediction mechanism therefor
US5765035A (en) Recorder buffer capable of detecting dependencies between accesses to a pair of caches
US5893146A (en) Cache structure having a reduced tag comparison to enable data transfer from said cache
US5903910A (en) Method for transferring data between a pair of caches configured to be accessed from different stages of an instruction processing pipeline
US5848287A (en) Superscalar microprocessor including a reorder buffer which detects dependencies between accesses to a pair of caches
JP3794918B2 (ja) 復帰選択ビットを用いて分岐予測のタイプを分類する分岐予測
US5787474A (en) Dependency checking structure for a pair of caches which are accessed from different pipeline stages of an instruction processing pipeline
EP1005672B1 (fr) Unite de chargement/stockage permettant de terminer de fa on non bloquante des chargements dans un microprocesseur superscalaire
EP0912927B1 (fr) Unite de chargement/stockage a indicateurs multiples pour achever des instructions de stockage et de chargement ayant manque la memoire cache
EP0919027B1 (fr) Registre d'actualisation retardee pour matrice
WO1998020421A1 (fr) Structure de prevision de voie
EP1019831A1 (fr) Unite de prediction de blocs de memoire et son procede de fonctionnement
EP0912925B1 (fr) Structure de pile d'adresses de retour et microprocesseur superscalaire comportant cette structure
EP1005675B1 (fr) Unite memoire de donnees concue pour le stockage de donnees en un seul site d'horloge et procede de fonctionnement de cette unite
EP0912930B1 (fr) Unite fonctionnelle avec un pointeur pour la resolution de branchements avec erreur de prediction, et microprocesseur superscalaire comprenant une telle unite
EP0912929B1 (fr) Structure de prediction d'adresses de donnees et procede permettant de la faire fonctionner
EP1015980B1 (fr) Antememoire de donnees capable d'effectuer des acces memoire dans un seul cycle d'horloge
WO1998002806A1 (fr) Structure de prediction d'adresses de donnees faisant appel a un procede de prediction par enjambee
KR100417459B1 (ko) 저장 및 적재 미스 명령들을 완료하기 위한 다수의 포인터들을 갖는 적재/저장 유닛
WO1998020416A1 (fr) Structure de prediction d'adresse de donnees fondee sur le rythme

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 19981217

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): DE ES FR GB NL

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION HAS BEEN WITHDRAWN

18W Application withdrawn

Effective date: 20030515