US10909034B2 - Issue queue snooping for asynchronous flush and restore of distributed history buffer - Google Patents
Issue queue snooping for asynchronous flush and restore of distributed history buffer Download PDFInfo
- Publication number
- US10909034B2 US10909034B2 US15/845,757 US201715845757A US10909034B2 US 10909034 B2 US10909034 B2 US 10909034B2 US 201715845757 A US201715845757 A US 201715845757A US 10909034 B2 US10909034 B2 US 10909034B2
- Authority
- US
- United States
- Prior art keywords
- restore
- register file
- itag
- bus
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related, expires
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0804—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches with main memory updating
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/14—Error detection or correction of the data by redundancy in operation
- G06F11/1402—Saving, restoring, recovering or retrying
- G06F11/1405—Saving, restoring, recovering or retrying at machine instruction level
- G06F11/1407—Checkpointing the instruction stream
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2212/00—Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
- G06F2212/10—Providing a specific technical effect
- G06F2212/1008—Correctness of operation, e.g. memory ordering
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2212/00—Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
- G06F2212/10—Providing a specific technical effect
- G06F2212/1016—Performance improvement
Definitions
- the present disclosure generally relates to data processing systems, and more specifically, to techniques for performing a flush and restore of a distributed history buffer in a processing unit.
- High performance processors currently used in data processing systems today may be capable of “superscalar” operation and may have “pipelined” elements.
- Such processors may include multiple execution/processing slices that are able to operate in parallel to process multiple instructions in a single processing cycle.
- Each execution slice may include a register file and history buffer that includes the youngest and oldest copies, respectively, of architected register data.
- Each instruction that is fetched may be tagged by a multi-bit instruction tag. Once the instructions are fetched and tagged, the instructions may be executed (e.g., by an execution unit) to generate results, which are also tagged.
- a Results (or Writeback) Bus one per execution slice, feeds all slices with the resultant instruction finish data.
- any individual history buffer generally includes one write port per Results/Writeback bus.
- the history buffer is typically a centralized component of the processing unit, such that it can back up the data when a new instruction is dispatched and the target register has to be saved into the back up register file.
- centralized components may not be feasible for processors that include multiple execution/processing slices. For example, in processors with a large number of processing slices, the number of ports needed for such a centralized history buffer can be extensive, leading to an extensive amount of wires between the distributed execution units.
- history buffer can be expensive to implement in the circuit. For example, as the number of ports associated with the history buffer increases, the circuit area of the history buffer in the processing unit can grow rapidly. This, in turn, creates a compromise on the number of history buffer entries that can be supported in a given circuit area. For example, smaller history buffers generally fill up faster and can impact performance, stalling the dispatch of new instructions until older instructions are retired and free up history buffer entries. On the other hand, larger history buffers are generally expensive to implement and lead to larger circuit size.
- some processing units may use a distributed history buffer design.
- the history buffer may include multiple distributed levels to provide support for the main line execution of instructions in the processing unit.
- the use of distributed history buffers has prompted new issues to emerge as areas of concern.
- One such issue relates to recovery operations for restoring the registers in the register file to the proper states.
- One embodiment presented herein includes a method for performing a flush and restore of a history buffer (HB) in a processing unit.
- the method generally includes identifying an entry of the HB to restore to a register file in the processing unit, sending a restore ITAG of the HB entry to the register file via a first restore bus, and sending restore data of the HB entry and the restore ITAG to the register file via a second restore bus.
- the method also includes, after sending the restore ITAG and the restore data, dispatching an instruction prior to the register file obtaining the restore data.
- the method further includes upon determining that the restore data is still available via the second restore bus, performing a snooping operation to obtain the restore data from the second restore bus for the dispatched instruction.
- FIG. 1 illustrates an example of a data processing system configured to perform a flush and restore of a distributed history buffer, according to one embodiment.
- FIG. 2 illustrates a block diagram of a processor that includes one or more history buffer restoration components, according to one embodiment.
- FIG. 3 illustrates a block diagram of a multi-slice processor 300 configured to perform a flush and restore of a distributed history buffer, according to one embodiment.
- FIG. 4 illustrates an example of a restore function at a register file, according to one embodiment.
- FIG. 5 illustrates an example of a restore operation at a distributed history buffer, according to one embodiment.
- FIG. 6 further illustrates an example of the restore operation at a distributed history buffer, according to one embodiment.
- FIG. 7 is a flow chart illustrating a method for performing issue queue snooping for an asynchronous restore of a distributed history buffer, according to one embodiment.
- a processing unit may use a distributed (e.g., multi-level) history buffer (HB) design to overcome the limitations associated with a single level HB.
- a split-level (two-level) HB may be used in the processing unit.
- a smaller first level (L1) HB may include multiple write ports for sinking the multiple write back busses (e.g., one write port per results/write back bus).
- the L1 HB can move an entry to a larger second level (L2) HB after the valid data for the L1 HB entry has been written back by the write back bus.
- the write back ITAG compares occur on the fewer number of entries in the L1 HB.
- the L2 HB may have a greater number of entries than the L1 HB. However, the L2 HB may include a fewer number of write ports (compared to the L1 HB), reducing the circuit size of the L2 HB. In general, however, a processing unit may include a distributed HB with any number of levels (e.g., three levels or more).
- data that is saved in the HB may have to be restored to the registers (e.g., general purpose registers (GPRs)) in the future.
- GPRs general purpose registers
- Data may be flushed from the GPRs and need to be restored from the HB for a variety of reasons.
- One reason is branch mis-prediction, where a processing unit mis-predicts the next instruction to process in branch prediction.
- Other reasons for a flush include interrupts, load data misses, data errors, etc.
- the conventional recovery process typically involves marking HB entries (e.g., having data to be recovered) and reading the entries out of the HB.
- the data is then sent through the issue queue, which issues an instruction (e.g., an error correcting code (ECC) correction instruction) to the execution unit (e.g., fixed/floating point unit, such as a vector scalar unit (VSU)).
- ECC error correcting code
- the execution unit may perform an error recovery process, and place the recovery data on its own result bus (e.g., write back bus).
- the data can then travel from the result bus to all of the GPR copies to write in the recovery data.
- Each distributed HB can be recovered simultaneously this way through their respective execution unit. Performing this process, however, for each HB entry in the distributed HB can take a significant amount of time.
- the amount of time that it takes to perform a flush/restore process can be reduced by enabling the processing unit to perform an asynchronous flush/restore of the distributed HB.
- the processing unit can restore an ITAG and data of an HB entry asynchronously in order to speed up dispatch after flush/restore handling. That is, the restore ITAG and restore data can be sent asynchronously on different restore busses.
- the processing unit may include two restore busses: a “ITAG only” restore bus, and a “ITAG+Write Back (WB) data” restore bus.
- the “ITAG+WB data” restore bus may be a bus going from the distributed HB to the issue queue to be issued out to the execution unit.
- the data from the execution unit may be written back into the register file via a write back mechanism.
- the “ITAG only” restore bus may be a direct bus going from the distributed HB to the register file (e.g., bypassing the issue queue and execution unit), and may be used for restoring HB entries without write back data.
- the restore ITAG is sent out first on the dedicated restore ITAG bus from the distributed HB to write in to the register file as fast as possible.
- the restore data is sent out second (e.g., after a predetermined number of cycles) on the write back bus.
- the register file control logic can sync the restore data with the previously sent restore ITAG before writing the ITAG and data into the register file.
- the dispatch can resume before the restore data shows up in the register file. This, in turn, can speed up the flush/restore process, relative to traditional flush recovery techniques.
- the flush recovery may not be finished until the register file has received the last ITAG with (full or partial) restore data. Waiting for the last ITAG with restore data to show up in the register file may slow down the dispatch of instructions, and reduce the speed up/efficiency of the asynchronous flush/restore process.
- Embodiments presented herein provide improved techniques for performing an asynchronous flush and restore of a distributed HB in a processing unit, relative to conventional techniques for performing a flush/restore of a HB. More specifically, as described below, embodiments provide techniques for significantly speeding up the flush/restore process for HB entries that do not include write back data (e.g., write back data has not been written to the entries).
- the dispatch can resume when the ITAG of the restoring HB entry (without write back data) is written into the register file, and the issue queue can start snooping the writeback busses for its operand data.
- the restoring HB ITAG can be written first into the register file to enable early dispatch to look up the dependent ITAG.
- This same ITAG and its write back data can also be sent through the issue queue (ISQ) and subsequent execution unit to be placed on the write back bus.
- the dispatch can resume quickly since the ISQ can snoop the writeback bus for operands needed by the new incoming dispatch instructions.
- the ISQ can snoop the ITAG or the LREG to capture the restore data.
- FIG. 1 illustrates an example of a data processing system 100 that may include a HB restoration component for performing a flush/restore of one or more distributed HBs, according to one embodiment.
- the system has a central processing unit (CPU) 110 such as a PowerPC microprocessor (“PowerPC” is a trademark of IBM Corporation).
- the CPU 110 is coupled to various other components by system bus 112 .
- Read only memory (“ROM”) 116 is coupled to the system bus 112 and includes a basic input/output system (“BIOS”) that controls certain basic functions of the data processing system 100 .
- RAM random access memory
- I/O adapter 118 I/O adapter 118
- communications adapter 134 are also coupled to the system bus 112 .
- I/O adapter 118 may be a small computer system interface (“SCSI”) adapter that communicates with a disk storage device 120 .
- Communications adapter 134 interconnects bus 112 with an outside network enabling the data processing system to communicate with other such systems.
- Input/Output devices are also connected to system bus 112 via user interface adapter 122 and display adapter 136 .
- Keyboard 124 , track ball 132 , mouse 126 and speaker 128 are all interconnected to bus 112 via user interface adapter 122 .
- Display monitor 138 is connected to system bus 112 by display adapter 136 . In this manner, a user is capable of inputting to the system through the keyboard 124 , trackball 132 or mouse 126 and receiving output from the system via speaker 128 and display 138 .
- an operating system such as AIX (“AIX” is a trademark of the IBM Corporation) is used to coordinate the functions of the various components shown in FIG. 1 .
- the CPU 110 includes various registers, buffers, memories, and other units formed by integrated circuitry, and operates according to reduced instruction set computing (“RISC”) techniques.
- the CPU 110 processes according to processor cycles, synchronized, in some aspects, to an internal clock (not shown).
- FIG. 2 illustrates a block diagram of a processor 110 that may be configured to perform ISQ snooping for an asynchronous flush/restore of a distributed HB, according to one embodiment.
- Processor 110 may include one or more HB restoration components and one or more distributed HBs.
- Processor 110 has a bus interface unit 202 coupled to the bus 112 for controlling transfers of data and instructions between memory, such as random access memory 114 , and caches, e.g. instruction cache (I-Cache) 204 and data cache (D-Cache) 206 .
- I-Cache instruction cache
- D-Cache data cache
- Instructions may be processed in the processor 110 in a sequence of logical, pipelined stages. However, it should be understood that the functions of these stages, may be merged together, so that this particular division of stages should not be taken as a limitation, unless such a limitation is indicated in the claims herein. Indeed, some of the previously described stages are indicated as a single logic unit 208 in FIG. 2 for the sake of simplicity of understanding and because each distinction between stages is not necessarily central to the present invention.
- Logic unit 208 in FIG. 2 includes fetch, branch processing, instruction buffer, decode and dispatch units.
- the logic unit 208 fetches instructions from instruction cache 204 into the instruction buffer, either based on a normal sequence of the instructions or, in the case of a sequence having a conditional branch instruction, a predicted sequence, the predicted sequence being in accordance with addresses selected by the branch processing unit.
- the logic unit 208 also decodes the instructions and dispatches them to an appropriate functional unit (e.g., execution unit) 212 . 0 , 212 . 1 , . . . 212 . n ⁇ 1 via reservation station 210 .
- logic unit 208 may include an instruction sequencing unit (ISU) (not shown) for dispatching the instructions to the appropriate functional units.
- ISU instruction sequencing unit
- the units 212 In executing the instructions, the units 212 input and output information to registers (shown collectively as register file (RF) 216 ).
- the functional units 212 signal the completion unit 218 upon execution of instructions and the completion unit 218 retires the instructions, which includes notifying history buffer (HB) unit 214 .
- the HB unit 214 may save a processor state before, for example, an interruptible instruction, so that if an interrupt occurs, HB control logic may recover the processor state to the interrupt point by restoring the content of registers.
- RF 216 may include an array of processor registers (e.g., GPRs, VSRs, etc.).
- RF 216 can include a number of RF entries or storage locations, each RF entry storing a 64 bit double word and control bits.
- an RF entry may store 128 bit data.
- RF 216 is accessed and indexed by logical register (LREG) identifiers, e.g., r 0 , r 1 , . . . , rn.
- LREG logical register
- Each RF entry holds the most recent (or youngest) fetched instruction and its ITAG.
- each RF entry may also hold the most recent (or youngest) target result data corresponding to a LREG for providing the result data to a next operation.
- a new dispatch target replaces (or evicts) a current RF entry. In such cases, the current RF entry can be moved to the HB unit 214 .
- HB logic 214 may use a multi-level or distributed HB in processor 110 .
- the functional units 212 also assert results on one or more result buses (e.g., write back buses) 230 so that the results may be written by one or more write ports 220 to the registers in the RF 216 .
- the completion unit 218 or logic unit 208 may also notify the HB unit 214 about exception conditions and mis-predicted branches for which instructions should be discarded prior to completion and for which the HB unit 214 should recover a state of the processor 110 as will be further described below.
- the HB unit 214 may also receive other information about dispatched instructions from the logic unit 208 , the RF 216 , and one or more functional units 212 .
- a CPU 110 may have multiple execution/processing slices with each slice having one or more of the units shown in FIG. 2 .
- each processing slice may have its own logic unit 208 , RF 216 , HB unit 214 , reservation station 210 and functional/execution units 212 .
- a CPU 110 having the multiple processing slices may be capable of executing multiple instructions simultaneously, for example, one instruction in each processing slice simultaneously in one processing cycle.
- Such a CPU having multiple processing slices may be referred to as a multi-slice processor or a parallel-slice processor.
- Each processing slice may be an independent processor (e.g., processor 110 ) and may execute instructions independently of other processing slices in the multi-slice processor.
- HB unit 214 may include a HB restoration component (or logic) for performing an asynchronous flush/restore of a distributed HB (e.g., used by the HB unit 214 ).
- a HB restoration component or logic for performing an asynchronous flush/restore of a distributed HB (e.g., used by the HB unit 214 ).
- the HB restoration component may read out a single HB entry to be restored at a time.
- the HB restoration component may send the restore ITAG first on the “ITAG only” restore bus from the distributed HB to write into the register file 216 control logic as fast as possible.
- the “ITAG only” restore bus may be a dedicated ITAG restore bus to the register file that bypasses the issue queue and the execution unit.
- the HB restoration component may then send the restore data and ITAG second on the “ITAG+WB data” restore bus.
- the RF control logic can sync the restore data with the previously sent restore ITAG before writing the ITAG and data into the RF 216 .
- FIG. 3 illustrates a multi-slice processor 300 configured to perform ISQ snooping for an asynchronous flush/restore of a distributed HB, according to one embodiment. It may be noted that FIG. 3 only shows portions/components of the multi-slice processor 300 that are relevant for this discussion.
- the multi-slice processor 300 includes two processing slices, Slice 0 and Slice 1 .
- Each of the Slices 0 and 1 may include a distributed HB.
- each Slice 0 and 1 includes a two level HB: a L1 HB ( 302 a and 302 b ) and a L2 HB ( 304 a and 304 b ).
- Each level of HB may be implemented as a separate circuit in the processor.
- the L2 HB 304 may include a greater number of entries than the L1 HB 302 .
- the L1 HB 302 may include 16 HB entries and the L2 HB 304 may include 80 HB entries. Note, however, that the L1 HB 302 and L2 HB 304 may include any number of entries.
- Each Slice 0 and 1 also includes an issue queue (ISQ) ( 306 a and 306 b ), and execution unit(s) ( 308 a and 308 b ).
- the execution unit(s) 308 may include a load store unit (LSU), vector scalar unit (VSU), etc.
- a logic unit e.g., logic unit 208 ) may perform instruction fetch and dispatch for the multi-slice processor.
- Slices 0 and 1 may share one or more register file(s) 310 , which may be configured as a register bank, and register file control logic 312 .
- Slices 0 and 1 may each include a register file.
- Slices 0 and 1 may use register file 310 , register file control logic 312 and other components therein for register renaming.
- the ISQ 306 can hold a set of instructions and the reservation station (not shown in FIG. 3 ) can accumulate data for the instruction inputs.
- the reservation station may be a part of the ISQ 306 .
- the ISQ 306 may allocate an RF entry for the instruction.
- the source RF entries required as input for the instruction are looked up and passed on to the reservation station.
- the reservation station passes it on to one or more execution units designated for execution of the instruction.
- Each of the execution units 308 may make result data available on the write back buses (e.g., WB bus 230 ) for writing into a RF entry or HB entry.
- multi-slice processor 300 may include more than two slices with each slice having all the components discussed above for each of the slices 0 and 1 .
- the processing slices may be grouped into super slices (SS), with each super slice including a pair of processing slices.
- SS super slices
- a multi-slice processor may include two super slices SS 0 and SS 1 , with SS 0 including slices 0 and 1 , and SS 1 including slices 2 and 3 .
- one register file 216 may be allocated per super slice and shared by the processing slices of the super slice.
- the slices 0 and 1 of the multi-slice processor 300 may be configured to simultaneously execute independent threads (e.g., one thread per slice) in a simultaneous multi-threading mode (SMT).
- SMT simultaneous multi-threading mode
- multiple threads may be simultaneously executed by the multi-slice processor 300 .
- a super slice may act as a thread boundary.
- threads T 0 and T 1 may execute in SS 0 and threads T 2 and T 3 may execute in SS 1 .
- instructions associated with a single thread may be executed simultaneously by the multiple processing slices of at least one super slice, for example, one instruction per slice simultaneously in one processing cycle.
- the simultaneous processing in the multiple slices may considerably increase processing speed of the multi-slice processor 300 .
- the new instruction may evict the current RF entry associated with the previous instruction (e.g., first instruction), and the current RF entry may be moved to the L1 HB 302 .
- Each entry in the L1 HB 302 may include an ITAG of the previous instruction, the previous instruction, the evictor ITAG of the new instruction and/or one or more control bits.
- the L1 HB entry may also include result data for the first instruction (e.g., from the write back bus 230 ).
- the L1 HB entry can be moved to the L2 HB 304 .
- each Slice 0 and 1 of the multi-slice processor 300 includes two restore buses: a “direct ITAG only” restore bus 330 (e.g., restore bus 330 A in Slice 0 and restore bus 330 B in Slice 1 ); and a “ITAG+WB data” restore bus 340 (e.g., restore bus 340 A in Slice 0 and restore bus 340 B in slice 1 ).
- the “direct ITAG only” bus 330 is a direct restore bus from the distributed HB (e.g., L1 HB 302 and L1 HB 304 ) to the register file control logic 312 .
- the “direct ITAG only” bus 330 bypasses the ISQ 306 , execution unit(s) 308 and write back bus to register file 310 .
- the “ITAG+WB data” restore bus 340 is a restore bus from the distributed HB to the ISQ 306 to be issued out to the execution unit 308 .
- the “ITAG+WB data” restore bus 340 may bypass the ISQ 306 .
- the HB unit 214 (via the HB restoration component) may be configured to perform an asynchronous flush/restore of a distributed HB using the restore busses 330 , 340 .
- logic unit 208 may determine to restore one or more entries of the register file 310 with entries of the L1 HB 302 and/or L2 HB 304 , and signal the HB restoration component to perform a flush and restore operation.
- the logic unit 208 may send a flush ITAG to the HB restoration component and the HB restoration component may independently perform two different ITAG compares on L1 HB 302 and/or L2 HB 304 based on the flush ITAG.
- the HB restoration component may perform the flush compare for the distributed HB only (e.g., the HB restoration component may not have to perform flush compares for the GPR/VRF entries in the register file 310 ).
- a first ITAG compare the flush ITAG, evictor ITAG, and entry ITAG are compared. If the entry ITAG is greater/older than the flush ITAG and the flush ITAG is older than/equal to the evictor ITAG, then the entry may be marked for restoration (e.g., a restore pending (RP) bit may be set to 1 for the entry).
- RP restore pending
- a second ITAG compare the flush ITAG and entry ITAG are compared. If the flush ITAG is older/equal to the entry ITAG, then the entry can be invalidated.
- the HB restoration component may generate a vector of HB entries to be restored based on the flush compares with the evictor and entry ITAGs.
- FIG. 4 illustrates an example of a restore function at a register file (e.g., register file 310 ), according to one embodiment.
- FIG. 4 depicts a view of the restore function at the register file 310 from the perspective of the processing slices of the multi-slice processor 300 .
- the register file 310 is addressed by LREG(s).
- the processing slices may decode the LREG(s) from the distributed HB (in the respective processing slices) into a 1-hot write enable vector.
- the processing slices may indicate for a given LREG vector that the restore and particular RF are on the same thread.
- Each LREG vector goes to the GPR register file.
- Each register file entry may include a writeback bit (W), a history bit (H), and ITAG_valid (ITAG+V) (e.g., from the distributed HB) and data (e.g., from the distributed HB).
- W writeback bit
- H history bit
- ITAG_valid ITAG_valid
- setting the writeback bit (W) to “1” may indicate that all writebacks for a thread are finished
- setting the history bit (H) to “1” may indicate that the data was saved previously and may have to be saved at a future time.
- the ITAG associated with the ITAG+V may no longer be valid (e.g., in the case of completed data).
- FIG. 5 illustrates an example of a restore function performed by the HB restoration component at the distributed HB, according to one embodiment.
- the HB restoration component may independently perform two ITAG compares on each level of the distributed HB, e.g., to determine which entries need to be written back to the register file.
- FIG. 5 shows the restore function for one level (e.g., L1 HB 502 ) of the distributed HB, those of ordinary skill in the art will recognize that the HB restoration component may perform the restore function for each level of the distributed HB.
- L1 HB 502 includes 48 HB entries.
- the restore function may begin with the L1 HB 502 receiving a flush ITAG (e.g., from the logic unit 208 ). Once the flush ITAG is received, the L1 HB 502 (using the HB restoration component) may perform first magnitude compares of the flush ITAG against the ITAG and evictor ITAG in each occupied HB entry. The L1 HB 502 may set a restore pending (RP) flag (or bit) in every entry where the condition “ITAG ⁇ Flush ITAG ⁇ Evictor ITAG” is met.
- RP restore pending
- the L1 HB 502 may also perform second magnitude compares of the Flush ITAG and ITAG to determine which HB entries of the L1 HB 502 to invalidate/clear. For example, as shown, the L1 HB 502 may clear every entry where the condition “Flush ITAG ⁇ ITAG” is met. In one embodiment, the L1 HB 502 may clear an entry by setting one or more of writeback bit (W), RP, transactional memory bit (TM), ITAG_V, Evictor_ITAG_V to “0”.
- W writeback bit
- TM transactional memory bit
- ITAG_V transactional memory bit
- Evictor_ITAG_V Evictor_ITAG_V
- FIG. 6 illustrates an example of a HB restoration component restoring entries from a distributed HB, according to one embodiment. Note that while FIG. 6 shows the restore function for one level (e.g., L1 HB) of the distributed HB, those of ordinary skill in the art will recognize that the HB restoration component may perform the restore function for each level of the distributed HB.
- L1 HB level of the distributed HB
- the HB restoration component reads out the LREG, ITAG and any available data associated with the entry.
- the HB restoration component then clears the HB entry.
- the HB restoration component can clear bits RP, W, ITAG_Valid, Evictor_ITAG Valid, etc.
- embodiments presented herein provide techniques for significantly speeding up the flush/restore process for HB entries that do not contain write back data.
- embodiments enable dispatch to resume when the ITAG of the restoring HB entry (without write back data) is written into the register file, allow the ISQ to snoop the write back busses for its operand data.
- FIG. 7 is a flow chart illustrating a method 700 for performing ISQ snooping for an asynchronous restore of a distributed HB in a processing unit, according to one embodiment.
- the method 700 can be performed by a processing unit (e.g., CPU 110 ) or one or more components (e.g., ISQ 306 , HB restoration component, etc.) of the processing unit.
- a processing unit e.g., CPU 110
- components e.g., ISQ 306 , HB restoration component, etc.
- the method 700 begins at block 702 , where the processing unit identifies an entry of a HB (e.g., one or more levels of a distributed HB) to restore to a register file.
- a HB e.g., one or more levels of a distributed HB
- the processing unit can perform magnitude flush compares of the ITAG for each entry of the HB against a flush ITAG and evictor ITAG, and mark the entries as restore pending if the comparison satisfies a predetermined condition.
- the processing unit may have a vector of HB entries to be restored.
- the processing unit can read out a single HB entry to be restored at a time.
- the processing unit can be configured to send the restore ITAG of the HB entry and the restore data of the HB entry to the register file in an asynchronous manner. For example, the processing unit can write the restoring HB ITAG into the register file first to enable an early dispatch to look up the dependent ITAG.
- the same ITAG for the HB and its WB data may also be sent through the ISQ and subsequent execution unit to be placed on the WB bus.
- the processing unit sends the restore ITAG of the HB entry to the register file via a dedicated ITAG bus and, at block 706 , the processing unit sends restore data of the HB entry and the ITAG of the HB entry to the register file via a WB bus (e.g., ITAG+WB bus).
- Block 706 may occur once a predetermined amount of time has elapsed after block 704 .
- the processing unit may resume dispatching instructions before the restore data shows up in the register file. For example, once the processing unit determines that the last restore ITAG is visible in the register file control, the processing unit may resume dispatch.
- the processing unit may use the restore ITAG to perform a snooping operation on the WB bus to capture the restore data (block 710 ).
- the ISQ can snoop the ITAG or the LREG to capture the restore data. For example, if the operand data coming from the WB bus is still in the reservation station, the ISQ can compare the restoring ITAG/LREG with the source ITAG/LREG coming from the register file control. If the ISQ determines there is a match, the ERS can grab the data from the WB bus. Enabling the ISQ to snoop the WB bus for operands needed by new incoming dispatch instructions can enable dispatch to resume even quicker, e.g., compared to waiting for the operand data to show up in the register file.
- the processing unit can obtain the operand data by reading the register file (e.g., via a normal RF read).
- the register file control logic can perform a compare between the dispatching source LREG and the source of the restoring LREG. If there is a match, then the register file control logic can bypass the restoring data to the dispatching instruction.
- the processing unit may therefore send the restore ITAG and restore data asynchronously to the register file via different busses.
- the processing unit can send the restore ITAG out first from the HB to the register file via the dedicated ITAG bus (e.g., to write into the register file control as fast as possible). After a predetermined amount of time, the processing unit may then send the restore data and ITAG on the WB bus.
- the register file control may sync the restore data with the previously sent restore ITAG before writing the ITAG and restore data into the register file.
- the processing unit may obtain the data for the instruction by at least one of reading the data from the register file, snooping the WB bus, or restoring the data from the RF bypass multiplexer.
- the processing unit determines that the HB data was already completed (e.g., the instruction producing the data in the HB entry was completed), then the ITAG for that particular entry is no longer valid.
- the ISQ has to obtain the LREG from the restoring control logic to enable it to snoop the restoring data coming from the WB bus.
- the processing unit (e.g., via the HB restoration component) can insert the producing LREG into the ITAG field of that HB entry.
- the processing unit may then send the restore ITAG/LREG and restore data asynchronously on different busses. That is, the restore ITAG/LREG can be sent via the dedicated ITAG bus and the restore data and ITAG/LREG can be sent (after a predetermined amount of time) via the WB bus.
- the register control may sync the restore data with the previously sent restore ITAG/LREG before writing the ITAG and restore data into the register file.
- the processing unit When dispatch resumes, if the processing unit determines that an instruction needs operand data that is being restored, the processing unit can obtain the data via a RF normal read, assuming the operand data is already in the register file. In cases where the ITAG/LREG is available in the RF control, then the ITAG/LREG field for the operand can be read out and sent to the ISQ. In some cases, if the operand data coming from the WB bus is still in the reservation station, the processing unit (via the ISQ) may obtain the data by snooping the WB bus.
- the ISQ can compare the restoring ITAG/LREG with the source ITAG/LREG coming from the register file control, and if there is a match, the reservation station can capture the data from the WB bus. In some cases, as described above, if the operand data has not been written into the register file and is no longer in the reservation station WB staging latch, then the data can be obtained from the register file bypass multiplexer.
- aspects of the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, microcode, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system.”
- the present invention may be a system, a method, and/or a computer program product.
- the computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention.
- the computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device.
- the computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
- a non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing.
- RAM random access memory
- ROM read-only memory
- EPROM or Flash memory erasable programmable read-only memory
- SRAM static random access memory
- CD-ROM compact disc read-only memory
- DVD digital versatile disk
- memory stick a floppy disk
- a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon
- a computer readable storage medium is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
- Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network.
- the network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers.
- a network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.
- Computer readable program instructions for carrying out operations of the present invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++ or the like, and conventional procedural programming languages, such as the “C” programming language or similar programming languages.
- the computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
- the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
- electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.
- These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
- These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.
- the computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
- each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s).
- the functions noted in the block may occur out of the order noted in the figures.
- two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Quality & Reliability (AREA)
- Retry When Errors Occur (AREA)
Abstract
Description
Claims (20)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US15/845,757 US10909034B2 (en) | 2017-12-18 | 2017-12-18 | Issue queue snooping for asynchronous flush and restore of distributed history buffer |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US15/845,757 US10909034B2 (en) | 2017-12-18 | 2017-12-18 | Issue queue snooping for asynchronous flush and restore of distributed history buffer |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| US20190188133A1 US20190188133A1 (en) | 2019-06-20 |
| US10909034B2 true US10909034B2 (en) | 2021-02-02 |
Family
ID=66815152
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US15/845,757 Expired - Fee Related US10909034B2 (en) | 2017-12-18 | 2017-12-18 | Issue queue snooping for asynchronous flush and restore of distributed history buffer |
Country Status (1)
| Country | Link |
|---|---|
| US (1) | US10909034B2 (en) |
Families Citing this family (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US10740140B2 (en) * | 2018-11-16 | 2020-08-11 | International Business Machines Corporation | Flush-recovery bandwidth in a processor |
| US11720354B2 (en) * | 2020-01-07 | 2023-08-08 | SK Hynix Inc. | Processing-in-memory (PIM) system and operating methods of the PIM system |
Citations (10)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US8161235B2 (en) | 2007-10-16 | 2012-04-17 | Hitachi, Ltd. | Storage system and data erasing method |
| US8661227B2 (en) | 2010-09-17 | 2014-02-25 | International Business Machines Corporation | Multi-level register file supporting multiple threads |
| US8823719B2 (en) | 2010-05-13 | 2014-09-02 | Mediatek Inc. | Graphics processing method applied to a plurality of buffers and graphics processing apparatus thereof |
| US20150220439A1 (en) | 2014-02-04 | 2015-08-06 | Microsoft Corporation | Block storage by decoupling ordering from durability |
| US20160253180A1 (en) * | 2015-02-26 | 2016-09-01 | International Business Machines Corporation | History Buffer with Hybrid Entry Support for Multiple-Field Registers |
| US20160283236A1 (en) | 2015-03-25 | 2016-09-29 | International Business Machines Corporation | History Buffer with Single Snoop Tag for Multiple-Field Registers |
| US20160328330A1 (en) * | 2015-05-07 | 2016-11-10 | International Business Machines Corporation | Distributed History Buffer Flush and Restore Handling in a Parallel Slice Design |
| US20160371087A1 (en) * | 2015-06-16 | 2016-12-22 | International Business Machines Corporation | Split-level history buffer in a computer processing unit |
| US20170109167A1 (en) * | 2015-10-14 | 2017-04-20 | International Business Machines Corporation | Method and apparatus for restoring data to a register file of a processing unit |
| US10379867B2 (en) * | 2017-12-18 | 2019-08-13 | International Business Machines Corporation | Asynchronous flush and restore of distributed history buffer |
-
2017
- 2017-12-18 US US15/845,757 patent/US10909034B2/en not_active Expired - Fee Related
Patent Citations (11)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US8161235B2 (en) | 2007-10-16 | 2012-04-17 | Hitachi, Ltd. | Storage system and data erasing method |
| US8823719B2 (en) | 2010-05-13 | 2014-09-02 | Mediatek Inc. | Graphics processing method applied to a plurality of buffers and graphics processing apparatus thereof |
| US8661227B2 (en) | 2010-09-17 | 2014-02-25 | International Business Machines Corporation | Multi-level register file supporting multiple threads |
| US20150220439A1 (en) | 2014-02-04 | 2015-08-06 | Microsoft Corporation | Block storage by decoupling ordering from durability |
| US20160253180A1 (en) * | 2015-02-26 | 2016-09-01 | International Business Machines Corporation | History Buffer with Hybrid Entry Support for Multiple-Field Registers |
| US20160283236A1 (en) | 2015-03-25 | 2016-09-29 | International Business Machines Corporation | History Buffer with Single Snoop Tag for Multiple-Field Registers |
| US20160328330A1 (en) * | 2015-05-07 | 2016-11-10 | International Business Machines Corporation | Distributed History Buffer Flush and Restore Handling in a Parallel Slice Design |
| US20160371087A1 (en) * | 2015-06-16 | 2016-12-22 | International Business Machines Corporation | Split-level history buffer in a computer processing unit |
| US20160378501A1 (en) * | 2015-06-16 | 2016-12-29 | International Business Machines Corporation | Split-level history buffer in a computer processing unit |
| US20170109167A1 (en) * | 2015-10-14 | 2017-04-20 | International Business Machines Corporation | Method and apparatus for restoring data to a register file of a processing unit |
| US10379867B2 (en) * | 2017-12-18 | 2019-08-13 | International Business Machines Corporation | Asynchronous flush and restore of distributed history buffer |
Non-Patent Citations (1)
| Title |
|---|
| Tomari, H., Inaba, M. & Hiraki, K. (2010). Compressing Floating-Point Number Stream for Numerical Applications. First International Conference on Networking and Computing. |
Also Published As
| Publication number | Publication date |
|---|---|
| US20190188133A1 (en) | 2019-06-20 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US10289415B2 (en) | Method and apparatus for execution of threads on processing slices using a history buffer for recording architected register data | |
| US11093248B2 (en) | Prefetch queue allocation protection bubble in a processor | |
| US20170109093A1 (en) | Method and apparatus for writing a portion of a register in a microprocessor | |
| US10282205B2 (en) | Method and apparatus for execution of threads on processing slices using a history buffer for restoring architected register data via issued instructions | |
| US10073699B2 (en) | Processing instructions in parallel with waw hazards and via a distributed history buffer in a microprocessor having a multi-execution slice architecture | |
| US20220050679A1 (en) | Handling and fusing load instructions in a processor | |
| US11392386B2 (en) | Program counter (PC)-relative load and store addressing for fused instructions | |
| US10956158B2 (en) | System and handling of register data in processors | |
| US11868773B2 (en) | Inferring future value for speculative branch resolution in a microprocessor | |
| US20230367597A1 (en) | Instruction handling for accumulation of register results in a microprocessor | |
| US10545765B2 (en) | Multi-level history buffer for transaction memory in a microprocessor | |
| US20200356369A1 (en) | System and handling of register data in processors | |
| US10909034B2 (en) | Issue queue snooping for asynchronous flush and restore of distributed history buffer | |
| US10379867B2 (en) | Asynchronous flush and restore of distributed history buffer | |
| US10977040B2 (en) | Heuristic invalidation of non-useful entries in an array | |
| US11403109B2 (en) | Steering a history buffer entry to a specific recovery port during speculative flush recovery lookup in a processor | |
| US10929144B2 (en) | Speculatively releasing store data before store instruction completion in a processor | |
| US10802830B2 (en) | Imprecise register dependency tracking | |
| US10996995B2 (en) | Saving and restoring a transaction memory state | |
| US11119774B2 (en) | Slice-target register file for microprocessor | |
| US11520591B2 (en) | Flushing of instructions based upon a finish ratio and/or moving a flush point in a processor | |
| US10740140B2 (en) | Flush-recovery bandwidth in a processor | |
| US11379241B2 (en) | Handling oversize store to load forwarding in a processor | |
| US11301254B2 (en) | Instruction streaming using state migration | |
| US11194578B2 (en) | Fused overloaded register file read to enable 2-cycle move from condition register instruction in a microprocessor |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| AS | Assignment |
Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:TERRY, DAVID R.;NGUYEN, DUNG Q.;THOMPTO, BRIAN W.;AND OTHERS;SIGNING DATES FROM 20171211 TO 20171214;REEL/FRAME:044425/0249 Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW YORK Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:TERRY, DAVID R.;NGUYEN, DUNG Q.;THOMPTO, BRIAN W.;AND OTHERS;SIGNING DATES FROM 20171211 TO 20171214;REEL/FRAME:044425/0249 |
|
| FEPP | Fee payment procedure |
Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT VERIFIED |
|
| STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
| FEPP | Fee payment procedure |
Free format text: MAINTENANCE FEE REMINDER MAILED (ORIGINAL EVENT CODE: REM.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
| LAPS | Lapse for failure to pay maintenance fees |
Free format text: PATENT EXPIRED FOR FAILURE TO PAY MAINTENANCE FEES (ORIGINAL EVENT CODE: EXP.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
| STCH | Information on status: patent discontinuation |
Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362 |
|
| FP | Lapsed due to failure to pay maintenance fee |
Effective date: 20250202 |