US20080229068A1 - Adaptive fetch gating in multithreaded processors, fetch control and method of controlling fetches - Google Patents
Adaptive fetch gating in multithreaded processors, fetch control and method of controlling fetches Download PDFInfo
- Publication number
- US20080229068A1 US20080229068A1 US12/106,360 US10636008A US2008229068A1 US 20080229068 A1 US20080229068 A1 US 20080229068A1 US 10636008 A US10636008 A US 10636008A US 2008229068 A1 US2008229068 A1 US 2008229068A1
- Authority
- US
- United States
- Prior art keywords
- instruction
- thread
- cache
- fetch
- flags
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
- 238000000034 method Methods 0.000 title abstract description 11
- 230000003044 adaptive effect Effects 0.000 title description 9
- 239000000872 buffer Substances 0.000 claims description 42
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 claims description 13
- 238000012544 monitoring process Methods 0.000 claims description 2
- 238000013461 design Methods 0.000 description 12
- 230000008569 process Effects 0.000 description 5
- 230000009467 reduction Effects 0.000 description 5
- 238000013459 approach Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 238000012913 prioritisation Methods 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 241000665848 Isca Species 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 238000005265 energy consumption Methods 0.000 description 2
- 230000003116 impacting effect Effects 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 206010017577 Gait disturbance Diseases 0.000 description 1
- 101000985296 Homo sapiens Neuron-specific calcium-binding protein hippocalcin Proteins 0.000 description 1
- 101000694017 Homo sapiens Sodium channel protein type 5 subunit alpha Proteins 0.000 description 1
- 101000935117 Homo sapiens Voltage-dependent P/Q-type calcium channel subunit alpha-1A Proteins 0.000 description 1
- 102100025330 Voltage-dependent P/Q-type calcium channel subunit alpha-1A Human genes 0.000 description 1
- 230000000903 blocking effect Effects 0.000 description 1
- 238000001816 cooling Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 235000003642 hunger Nutrition 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000004806 packaging method and process Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 230000037351 starvation Effects 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3836—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
- G06F9/3851—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution from multiple instruction streams, e.g. multistreaming
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3802—Instruction prefetching
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3867—Concurrent instruction execution, e.g. pipeline or look ahead using instruction pipelines
Definitions
- the present invention is a divisional of U.S. patent application Ser. No. 11/928,686, (Attorney docket No. YOR920040167US3) entitled “ADAPTIVE FETCH GATING IN MULTITHREADED PROCESSORS, FETCH CONTROL AND METHOD OF CONTROLLING FETCHES” to Pradip Bose et al., filed Oct. 30, 2007, and a continuation of allowed U.S. patent application Ser. No. 11/228,781, (Attorney docket No.
- the present invention generally relates to the multi-threaded processors and more particularly to reducing power consumption in a Simultaneous MultiThreaded (SMT) processor or microprocessor.
- SMT Simultaneous MultiThreaded
- a scalar processor fetches and issues/executes one instruction at a time. Each such instruction operates on scalar data operands. Each such operand is a single or atomic data value or number.
- Pipelining within a scalar processor introduces what is known as concurrency, i.e., processing multiple instructions at difference pipeline stages in a given clock cycle, while preserving the single-issue paradigm.
- a superscalar processor can fetch, issue and execute multiple instructions in a given machine cycle, each in a different execution path or thread. Each instruction fetch, issue and execute path is usually pipelined for further, parallel concurrency.
- Examples of superscalar processors include the Power/PowerPC processors from IBM Corporation, the Pentium processor family from Intel Corporation, the Ultrasparc processors from Sun Microsystems and the Alpha processor and PA-RISC processors from Hewlett Packard Company (HP). Front-end instruction delivery (fetch and dispatch/issue) accounts for a significant fraction of the energy consumed in a typical state of the art dynamic superscalar processor.
- the processor consumes a significant portion of chip power in the instruction cache (ICACHE) during normal access and fetch processes.
- ICACHE instruction cache
- the fetch process stalls, temporarily (e.g., due to instruction buffer fill-up, or cache misses), that portion of chip power falls off dramatically, provided the fetch process is stalled also.
- Buyuktosunoglu I and II focus on reconfiguring the size of issue queues, in conjunction (optionally) with an adjustable instruction fetch rate.
- Manne et al. “Pipeline Gating: Speculation Control for Energy Reduction,” Proc. 25 th Int'l. Symp. on Computer Architecture ( ISCA ), 1998, teaches using the processor branch mis-prediction rate in the instruction fetch to effectively control the fetch rate for power and efficiency.
- monitoring the branch prediction accuracy requires additional, significant and complex on-chip hardware that consumes both valuable chip area and power.
- SMT Simultaneous MultiThreading
- a SMT processors In each processor cycle, a SMT processors simultaneously fetches instructions and/or dispatches for different threads that populate the back-end execution resources. Fetch gating in an SMT processor refers to conditionally blocking the instruction fetch process. Thread prioritization involves assigning priorities in the order of fetching instructions from a mix of different workloads in a multi-threaded processor.
- Boosting SMT Performance by Speculation Control Proc. Int'l. Parallel and Distributed Processing Simulation , ( IPDPS ), 2001, teaches improving performance in energy-aware SMT processor design.
- Moursy et al. “Front-End Policies for Improved Issue Efficiency in SMT Processors,” Proc. HPCA 2003, focuses on reducing the average power consumption in SMT processors by sacrificing some performance.
- Knijnenburg et al. “Branch Classification for SMT Fetch Gating,” Proc. MTEAC 2002 focuses on increasing performance without regard to complexity.
- the additional logic hardware dynamically calculates complex utilization, prediction rates and/or flow rate metrics within the processor or system.
- the verification logic of such control algorithms adds overhead in complexity, area and power, that is not amenable to a low cost, easy implementation for high performance chip designs. This overhead just adds to both escalating development costs and spiraling power dissipation costs.
- SMT Simultaneous MultiThreaded
- the present invention is related to multithreaded processor, fetch control for a multithreaded processor and a method of fetching in the multithreaded processor.
- Processor event and use (EU) signals are monitored for downstream pipeline conditions indicating pipeline execution thread states. Instruction cache fetches are skipped for any thread that is incapable of receiving fetched cache contents, e.g., because the thread is full or stalled. Also, consecutive fetches may be selected for the same thread, e.g., on a branch mis-predict. Thus, the processor avoids wasting power on unnecessary or place keeper fetches.
- EU Processor event and use
- FIG. 1 shows a general example of Simultaneous MultiThreaded (SMT) architecture wherein the front end of a state of the art SMT processor is optimized for minimum power consumption without impacting performance or area according to a preferred embodiment of the present invention
- FIG. 2 shows a block diagram of a more specific example of a preferred embodiment SMT processor in more detail that supports two threads in this example;
- FIGS. 3A-B show an example of the preferred fetch control, which determines on each cycle, whether a fetch from the ICACHE occurs, based on the current state of thread monitor and control flags;
- FIGS. 4A-B show examples of a state diagrams for the preferred embodiment fetch control from thread monitor and control flags.
- FIG. 1 shows a general example of Simultaneous MultiThreaded (SMT) architecture wherein the front end of a state of the art SMT processor 100 is optimized for minimum power consumption without impacting performance or area, according to a preferred embodiment of the present invention.
- the SMT processor 100 which may be a single chip or multi-chip microprocessor, includes an instruction cache (ICACHE) 102 with a number of tasks or applications in cache contents from which to select/fetch.
- the ICACHE 102 provides cached instructions for R threads that originate from one of R ports 104 - 1 , 104 - 2 , - - - 104 -R.
- Preferred embodiment priority thread selection logic 106 selectively fetches and passes the contents of each of ports 104 - 1 , 104 - 2 , - - - 104 -R to an Instruction Fetch Unit (IFU) pipeline 108 .
- IFU Instruction Fetch Unit
- Each of the R ports 104 - 1 , 104 - 2 , - - - 104 -R has a fixed maximum fetch bandwidth to the IFU pipeline 108 of a number of instructions per cycle.
- the preferred embodiment priority thread selection logic 106 may pass the contents from each port 104 - 1 , 104 - 2 , - - - 104 -R at a rate up to that maximum with the overall bandwidth being R times that maximum.
- the IFU 108 passes instructions into T front-end Instruction BUFfers (IBUF), 110 - 1 , 110 - 2 , - - - 110 -T, one for each supported machine execution thread.
- the preferred embodiment priority thread selection logic 106 also receives Event and Use (EU) signals or flags to control fetch and thread selection for the fetch process, determine target instruction buffer threads in instruction buffers 110 - 1 , 110 - 2 , - - - 110 -T, as well as order within the threads and the number of instructions fetched, if any, for a given thread.
- EU Event and Use
- each instruction buffer 110 - 1 , 110 - 2 , - - - 110 -T pass through a corresponding decode and dispatch unit, 112 - 1 , 112 - 2 , - - - 112 -T and, subsequently, emerge under control of dispatch-thread priority logic 114 .
- the dispatch-thread priority logic 114 selects instructions from various different threads and multiplexes the selected instructions as an input to a common dispatch buffer 116 .
- This dispatch buffer 116 issues instructions into the back-end execution pipes (not shown in this example).
- the front-end fetch engine of this SMT processor 100 example accesses the ICACHE 102 much more frequently than necessary and uses the instruction buffers, 110 - 1 , 110 - 2 , - - - 110 -T, much more than necessary.
- the preferred embodiment fetch control balances the power-performance of the front-end fetch engine of this SMT processor 100 for dramatically improved efficiency.
- FIG. 2 shows a block diagram of a more specific example of a preferred embodiment SMT processor 120 in more detail, supporting two threads in this example.
- the ICACHE 122 has a single read port 124 to preferred fetch control 126 .
- the preferred fetch control 126 selectively fetches instructions and forwards fetched instructions to front end pipeline stages 128 . So, instructions exiting the front end pipe line stages 128 pass through multiplexor/demultiplexor (mux/demux) 132 and enter an Instruction BUFfer (IBUF) in one of two threads, 134 - 0 , 134 - 1 of this example.
- IBUF Instruction BUFfer
- Each thread passes through a number of buffer pipeline stages 136 - 0 , 136 - 1 , eventually emerging from an Instruction Register (IR) 138 - 0 , 138 - 1 .
- a multiplexer 140 selects a mix of instructions from the contents of the instruction registers 138 - 0 , 138 - 1 to back end processor logic (not shown), e.g., to a dispatch group for back end execution.
- An Instruction Fetch Address Register (IFAR) 142 - 0 , 142 - 1 addresses each fetched instruction.
- IFAR Instruction Fetch Address Register
- Thread monitor and control flags 144 , 146 , 148 , 150 determine in each clock cycle whether the preferred fetch control 126 forwards an instruction from the ICACHE 122 , that is identified by one of the instruction fetch address registers 142 - 0 , 142 - 1 .
- the thread monitor and control flags include stall event flags (e.g., branch mis-predicts, cache misses, etc.) 144 , flow rate mismatch flags 146 , utilization flags 148 and, optionally, thread priority flags 150 .
- the utilization flags 148 may include individual instruction buffer high water mark controls 148 - 0 , 148 - 1 that also operate to stall corresponding instruction buffers 134 - 0 , 134 - 1 , whenever a respective thread pipeline is full to its respective high water mark.
- the utilization flags 148 - 0 and 148 - 1 are indicated herein as two flags, each having to do with the instruction buffers 134 - 0 , 134 - 1 , this is for example only.
- Multiple utilization flags may be included as downstream utilization markers. For example, a high watermark may be provided for various other downstream queues, e.g., in the execution back-end of the machine, that may provide additional or alternate inputs to the preferred fetch control 126 .
- the address in the instruction fetch address register, 142 - 0 , 142 - 1 may simply be incremented from the previous cycle, e.g., by an incrementer 152 - 0 , 152 - 1 .
- the address may be loaded from next fetch address logic 154 - 0 , 154 - 1 , e.g., in response to a branch. So, for example, the next address may depend upon an interrupt, a branch instruction or Branch History Table/Branch Target Buffer (BHT/BTB) contents.
- BHT/BTB Branch History Table/Branch Target Buffer
- the next fetch address logic 154 - 0 , 154 - 1 logic may be implemented using any suitable such fetch address logic to generate the next cache address as may be appropriate for the particular application.
- the preferred fetch control 126 infers thread stall states, cycle-by-cycle, from the stall flags 144 indicating selected stall events, e.g., branch mis-prediction, cache miss, and dispatch stall.
- These stall event flags 144 are often routinely tracked on-chip in state of the art processors, e.g., using performance counters, or as part of other book-keeping and stall management.
- the stall flags 144 are invoked as override conditions to prevent/enable fetch-gating for a stalled thread, or to redirect fetches for another thread.
- the thread contents are invalid.
- the preferred fetch control 126 gives that thread priority and allows uninhibited fetches at full bandwidth to fill up pipeline slots in the thread that are vacated by flushed instructions.
- Downstream utilization state flags 148 provide a set of high watermark indicators that the preferred fetch control 126 monitors for developing path criticalities. Thus, each high watermark flag 148 , when asserted, indicates that a particular queue or buffer resource is almost full. Depending on whether a thread-specific resource or a shared resource is filling, a thread selection and prioritization policy may be defined in the preferred fetch control 126 and dynamically adjusted to indicate when any particular resources are at or near capacity. Upon such an occurrence, the preferred fetch control 126 may invoke fetch-gating based on the falloff of downstream demand to save energy whenever possible.
- FIGS. 3A-B show examples of inputs and output control to the preferred fetch control 126 for determining on each cycle, whether a fetch from the ICACHE 122 occurs based on the current state of thread monitor and control flags 144 , 146 , 148 , 150 , collectively, 160 in this example.
- the fetch control logic 126 is a simple finite state machine, that monitors a small subset of processor utilization indicators, e.g., stall state and last thread identifier.
- thread monitor and control flags 160 may include, for example, a branch mis-prediction indicator, a cache miss indicator, an execution pipeline stall indicator, a dependence-related dispatch stall indicator, a resource-conflict stall indicator, and a pipeline flush-and-replay stall indicator.
- the fetch control logic 126 may include a finite state controller with two outputs, a fetch_gate 162 and a next_thread_id indicator 164 .
- the fetch_gate 162 is a Boolean flag that is asserted whenever gating the instruction fetch is deemed to be desirable.
- the next_thread_id indicator 164 points to the thread for fetching in the next cycle.
- a miss/stall latch 166 holds the last fetch identification and latches the current thread fetch identification for facilitating in determining in each fetch cycle, the next thread fetch identification.
- a fetch gate output enables gating the contents of the ICACHE ( 122 in FIG. 2 ) as selected by the corresponding fetch address register ( 142 - 0 , 142 - 1 ).
- the inverse of the fetch gate 162 inverted by inverter 168 in this example, combines with a dispatch stall signal 170 in an AND gate 172 to provide a flow rate indicator as a flow mismatch flag 146 in FIG. 2 .
- FIGS. 4A-B show examples of a state diagrams for the preferred embodiment fetch control 126 of FIGS. 2 and 3A from thread monitor and control flags 160 .
- the flags 160 are checked for an indication of a flow rate mismatch. If a flow rate mismatch is not indicated, then in 1462 , the flags 160 are checked for an indication that a branch mis-prediction has occurred. If the flags 160 do not indicate a branch mis-prediction either, then in 1464 the next ICACHE fetch is for a thread that is different than the last. However, if it is determined in 1460 that a flow rate mismatch has occurred, then in 1466 the flags 160 are checked for a Data/Instruction (D/I) cache miss.
- D/I Data/Instruction
- the flags 160 are checked for an indication that a branch mis-prediction has occurred. If the flags 160 indicate that a branch mis-prediction has occurred in either 1462 or 1468 , then in 1470 , a determination is made of which thread, e.g., thread 0 , thread 1 , or both in this example. If in 1470 the mis-prediction indication is: thread 0 , then in 1472 , the next thread ID is set to indicate thread 0 ; thread 1 , then in 1474 , the next thread ID is set to indicate thread 1 ; otherwise, both threads are indicated and in 1476 , and the next thread ID is set to indicate that it is undefined.
- next thread ID is undefined in 1476 . Since the next thread ID is undefined in 1476 , the fetch gate should be enabled, and nothing should be fetched from either thread in the next cycle. If it is determined that a D/I cache miss has occurred in 1466 , then in 1478 , a determination is made of which thread, e.g., thread 0 , thread 1 , or both in this example. A determination of either thread 0 , or thread 1 , results in an opposite indication of determination 1470 .
- the flags 160 are checked for an indication of that the high water mark for one of the instruction buffers is above a selected threshold. So, for the example of FIG. 2 , in 1480 , the high water mark is checked for instruction buffer 0 . Depending on the results of that check, the high water mark is checked for instruction buffer 1 in 1482 if the high water mark for instruction buffer 0 is at or above that threshold, or in 1484 if the high water mark for instruction buffer 0 is below the threshold. If in 1482 , the high water mark for instruction buffer 1 is below the threshold; then, in 1486 the flags 160 are checked for an indication that a branch mis-prediction has occurred.
- next thread ID is set to indicate that it is undefined; and, simultaneously, the previous thread ID is held (e.g., in the miss/stall latch 162 of FIG. 3A ) and the fetch gate is asserted.
- the flags 160 are checked for an indication that a branch mis-prediction has occurred. If in either 1486 or 1490 , a branch mis-prediction is found to have occurred; then in 1492 , a determination is made of which branch, again, thread 0 , thread 1 , or both in this example.
- the mis-prediction indication is: thread 0 , then in 1494 , the next thread ID is set to indicate thread 0 ; thread 1 , then in 1496 , the next thread ID is set to indicate thread 1 ; otherwise, both threads are indicated and in 1498 and the next ICACHE fetch is for a thread that is different than the last. If in 1482 , the high water mark for instruction buffer 1 was found at or above the threshold, the next thread ID is set to indicate thread 1 in 1496 . If in 1484 , the high water mark for instruction buffer 1 was found below the threshold, the next thread ID is set to indicate thread 0 in 1494 .
- fetch control provides simple, effective adaptive fetch-gating for front-end thread selection and priority logic for significant performance gain, and with simultaneous front-end power reduction.
- the thread monitor and control flags 144 , 146 , 148 , 150 of FIG. 2 provide a simple indication of a processor state that derive cache gating controls to prevent unnecessary or superfluous instruction cache fetches or accesses.
- the preferred embodiment adaptive fetch-gating infers gating control from a typical set of (normally found in state of the art processor architectures) queue markers and event flags, and/or flags that are added or supplemented with insignificant area and timing overhead.
- the present invention has application to SMT processors, generally, where adaptive fetch gating may be combined naturally with an implicit set of power-aware thread prioritization heuristics.
- application of the invention naturally reduces to simple, adaptive fetch gating.
- the preferred fetch gating has application on a cycle-by-cycle basis to determining whether each fetch should proceed, and if so, from which of a number of available threads.
- application of the invention to a typical state of the art processors significantly improves processor throughput performance, while reducing the number of actual cache accesses and, therefore, dramatically reducing energy consumption.
- the energy consumption reduction from application of the present invention may far exceed the reduction in execution time, thereby providing an overall average power dissipation reduction as well.
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Advance Control (AREA)
Abstract
A multithreaded processor, fetch control for a multithreaded processor and a method of fetching in the multithreaded processor. Processor event and use (EU) signals are monitored for downstream pipeline conditions indicating pipeline execution thread states. Instruction cache fetches are skipped for any thread that is incapable of receiving fetched cache contents, e.g., because the thread is full or stalled. Also, consecutive fetches may be selected for the same thread, e.g., on a branch mis-predict. Thus, the processor avoids wasting power on unnecessary or place keeper fetches.
Description
- The present invention is a divisional of U.S. patent application Ser. No. 11/928,686, (Attorney docket No. YOR920040167US3) entitled “ADAPTIVE FETCH GATING IN MULTITHREADED PROCESSORS, FETCH CONTROL AND METHOD OF CONTROLLING FETCHES” to Pradip Bose et al., filed Oct. 30, 2007, and a continuation of allowed U.S. patent application Ser. No. 11/228,781, (Attorney docket No. YOR920040167US2) entitled “ADAPTIVE FETCH GATING IN MULTITHREADED PROCESSORS, FETCH CONTROL AND METHOD OF CONTROLLING FETCHES” to Pradip Bose et al., filed Sep. 16, 2005, which is a continuation of U.S. Provisional Patent Application Ser. No. 60/610,990, entitled “System And Method For Adaptive Fetch Gating” to Pradip Bose et al., filed Sep. 17, 2004, both of which are incorporated herein by reference.
- 1. Field of the Invention
- The present invention generally relates to the multi-threaded processors and more particularly to reducing power consumption in a Simultaneous MultiThreaded (SMT) processor or microprocessor.
- 2. Background Description
- Semiconductor technology and chip manufacturing advances have resulted in a steady increase of on-chip clock frequencies, the number of transistors on a single chip and the die size itself. Thus, not withstanding the decrease of chip supply voltage, chip power consumption has increased as well. Both at the chip and system levels cooling and packaging costs have escalated as a natural result of this increase in chip power. At the low end for small systems (e.g., handhelds, portable and mobile systems), where battery life is crucial, it is important to reduce net power consumption, without having performance degrade to unacceptable levels. Thus, the increase in microprocessor power consumption has become a major stumbling block for future performance gains. Pipelining is one approach to maximizing processor performance.
- A scalar processor fetches and issues/executes one instruction at a time. Each such instruction operates on scalar data operands. Each such operand is a single or atomic data value or number. Pipelining within a scalar processor introduces what is known as concurrency, i.e., processing multiple instructions at difference pipeline stages in a given clock cycle, while preserving the single-issue paradigm.
- A superscalar processor can fetch, issue and execute multiple instructions in a given machine cycle, each in a different execution path or thread. Each instruction fetch, issue and execute path is usually pipelined for further, parallel concurrency. Examples of superscalar processors include the Power/PowerPC processors from IBM Corporation, the Pentium processor family from Intel Corporation, the Ultrasparc processors from Sun Microsystems and the Alpha processor and PA-RISC processors from Hewlett Packard Company (HP). Front-end instruction delivery (fetch and dispatch/issue) accounts for a significant fraction of the energy consumed in a typical state of the art dynamic superscalar processor. For high-performance processors, such as IBM's POWER4™, the processor consumes a significant portion of chip power in the instruction cache (ICACHE) during normal access and fetch processes. Of course, when the fetch process stalls, temporarily (e.g., due to instruction buffer fill-up, or cache misses), that portion of chip power falls off dramatically, provided the fetch process is stalled also.
- Unfortunately, other factors (e.g., chip testability, real estate, yield) tend to force a trade of power for control simplification. So, in prior generation power-unaware designs, one may commonly find processors architected to routinely access the ICACHE on each cycle, even when the fetched results may be discarded, e.g., due to stall conditions. Buffers and queues in such processor designs have fixed sizes, and depending on the implementation, consume power at a fixed rate, irrespective of actual cache utilization or workload demand. For example, for a typical state of the art instruction fetch unit (IFU) in a typical state of the art eight-issue superscalar processor, executing a class of commercial benchmark applications, only about 27% of the cycles result in useful fetch activity. Similarly, idle and stalled resources of a front-end instruction decode unit (IDU) pipe wastes significant power. Further, this front-end starvation keeps back-end execute pipes even more underutilized, which impacts processor throughput.
- By contrast, in what is known as an energy-aware design, the fetch and/or issue stages are architected to be adaptive, to accommodate workload demand variations. These energy-aware designs adjusts the fetch and/or issue resources to save power without appreciable performance loss. For example, Buyuktosunoglu et al. (Buyuktosunoglu I), “Energy efficient co-adaptive instruction fetch and issue,” Proc. Int'l. Symp. on Computer Architecture (ISCA), June 2003 and Buyuktosunoglu et al. (Buyuktosunoglu II), “Tradeoffs in power-efficient issue queue design,” Proc. ISLPED, August 2002, both discuss such energy aware designs. In particular, Buyuktosunoglu I and II focus on reconfiguring the size of issue queues, in conjunction (optionally) with an adjustable instruction fetch rate. In another example, Manne et al., “Pipeline Gating: Speculation Control for Energy Reduction,” Proc. 25th Int'l. Symp. on Computer Architecture (ISCA), 1998, teaches using the processor branch mis-prediction rate in the instruction fetch to effectively control the fetch rate for power and efficiency. Unfortunately, monitoring the branch prediction accuracy requires additional, significant and complex on-chip hardware that consumes both valuable chip area and power.
- This problem is exacerbated in multithreaded machines, where multiple instruction threads may, or may not be in the pipeline at any one time. For example, Karkhanis et. al, “Saving energy with just-in-time instruction delivery,” Proc. Int'l. Symp. on Low Power Electronics and Design (ISLPED), August 2002, teach controlling instruction fetch rate by keeping a count of valid, downstream instructions. Both U.S. Pat. No. 6,212,544 to Borkenhagen et al. (Borkenhagen I), entitled “Altering thread priorities in a multithreaded processors,” and U.S. Pat. No. 6,567,839 to Borkenhagen et al. (Borkenhagen II), “Thread switch control in a multithreaded processor system,” both assigned to the assignee of the present invention and incorporated herein by reference, teach designing efficient thread scheduling control for boosting performance and/or reducing power in multithreaded processors. In yet another example, Seng et al. “Power-Sensitive Multithreaded Architecture,” Proc. Int'l. Conf on Computer Design (ICCD) 2000, teaches an energy-aware multithreading design.
- State of the art commercial microprocessors (e.g. Intel's Netburst™ Pentium™ IV or IBM's POWER5™) use a mode of multithreading that is commonly referred to as Simultaneous MultiThreading (SMT). In each processor cycle, a SMT processors simultaneously fetches instructions and/or dispatches for different threads that populate the back-end execution resources. Fetch gating in an SMT processor refers to conditionally blocking the instruction fetch process. Thread prioritization involves assigning priorities in the order of fetching instructions from a mix of different workloads in a multi-threaded processor. Some of the above energy-aware design approaches have been applied to SMT. For example, Luo et al. “Boosting SMT Performance by Speculation Control,” Proc. Int'l. Parallel and Distributed Processing Simulation, (IPDPS), 2001, teaches improving performance in energy-aware SMT processor design. Moursy et al. “Front-End Policies for Improved Issue Efficiency in SMT Processors,” Proc. HPCA 2003, focuses on reducing the average power consumption in SMT processors by sacrificing some performance. By contrast, Knijnenburg et al. “Branch Classification for SMT Fetch Gating,” Proc. MTEAC 2002 focuses on increasing performance without regard to complexity. These energy aware approaches require complex variable instruction fetch rate mechanisms and control signals necessitating significant additional logic hardware. The additional logic hardware dynamically calculates complex utilization, prediction rates and/or flow rate metrics within the processor or system. However, the verification logic of such control algorithms adds overhead in complexity, area and power, that is not amenable to a low cost, easy implementation for high performance chip designs. This overhead just adds to both escalating development costs and spiraling power dissipation costs.
- Unfortunately, many of these approaches have achieved improved performance only at the cost of increased processor power consumption. Others have reduced power consumption (or at least net energy usage) by accepting significantly degraded performance. Still others have accepted complex variable instruction fetch rate mechanisms that necessitate significant additional logic hardware.
- Thus, there is a need for a processor architecture that minimizes power consumption without impairing processor performance and without requiring significant control logic overhead or power.
- It is therefore a purpose of the invention to minimize processor power consumption;
- It is another purpose of the invention to minimize Simultaneous MultiThreaded (SMT) processor power consumption;
- It is yet another purpose of the invention to minimize SMT processor power consumption without incurring significant performance or area overhead.
- The present invention is related to multithreaded processor, fetch control for a multithreaded processor and a method of fetching in the multithreaded processor. Processor event and use (EU) signals are monitored for downstream pipeline conditions indicating pipeline execution thread states. Instruction cache fetches are skipped for any thread that is incapable of receiving fetched cache contents, e.g., because the thread is full or stalled. Also, consecutive fetches may be selected for the same thread, e.g., on a branch mis-predict. Thus, the processor avoids wasting power on unnecessary or place keeper fetches.
- The foregoing and other objects, aspects and advantages will be better understood from the following detailed description of a preferred embodiment of the invention with reference to the drawings, in which:
-
FIG. 1 shows a general example of Simultaneous MultiThreaded (SMT) architecture wherein the front end of a state of the art SMT processor is optimized for minimum power consumption without impacting performance or area according to a preferred embodiment of the present invention; -
FIG. 2 shows a block diagram of a more specific example of a preferred embodiment SMT processor in more detail that supports two threads in this example; -
FIGS. 3A-B show an example of the preferred fetch control, which determines on each cycle, whether a fetch from the ICACHE occurs, based on the current state of thread monitor and control flags; -
FIGS. 4A-B show examples of a state diagrams for the preferred embodiment fetch control from thread monitor and control flags. - Turning now to the drawings, and more particularly,
FIG. 1 shows a general example of Simultaneous MultiThreaded (SMT) architecture wherein the front end of a state of theart SMT processor 100 is optimized for minimum power consumption without impacting performance or area, according to a preferred embodiment of the present invention. TheSMT processor 100, which may be a single chip or multi-chip microprocessor, includes an instruction cache (ICACHE) 102 with a number of tasks or applications in cache contents from which to select/fetch. TheICACHE 102 provides cached instructions for R threads that originate from one of R ports 104-1, 104-2, - - - 104-R. Preferred embodiment prioritythread selection logic 106 selectively fetches and passes the contents of each of ports 104-1, 104-2, - - - 104-R to an Instruction Fetch Unit (IFU)pipeline 108. Each of the R ports 104-1, 104-2, - - - 104-R has a fixed maximum fetch bandwidth to theIFU pipeline 108 of a number of instructions per cycle. Thus, the preferred embodiment prioritythread selection logic 106 may pass the contents from each port 104-1, 104-2, - - - 104-R at a rate up to that maximum with the overall bandwidth being R times that maximum. - The
IFU 108 passes instructions into T front-end Instruction BUFfers (IBUF), 110-1, 110-2, - - - 110-T, one for each supported machine execution thread. The preferred embodiment prioritythread selection logic 106 also receives Event and Use (EU) signals or flags to control fetch and thread selection for the fetch process, determine target instruction buffer threads in instruction buffers 110-1, 110-2, - - - 110-T, as well as order within the threads and the number of instructions fetched, if any, for a given thread. Instructions in each instruction buffer 110-1, 110-2, - - - 110-T pass through a corresponding decode and dispatch unit, 112-1, 112-2, - - - 112-T and, subsequently, emerge under control of dispatch-thread priority logic 114. The dispatch-thread priority logic 114 selects instructions from various different threads and multiplexes the selected instructions as an input to acommon dispatch buffer 116. Thisdispatch buffer 116 issues instructions into the back-end execution pipes (not shown in this example). - It may be shown that, absent preferred embodiment fetch control, within an average processor cycle window, the front-end fetch engine of this
SMT processor 100 example accesses theICACHE 102 much more frequently than necessary and uses the instruction buffers, 110-1, 110-2, - - - 110-T, much more than necessary. Thus, the preferred embodiment fetch control balances the power-performance of the front-end fetch engine of thisSMT processor 100 for dramatically improved efficiency. -
FIG. 2 shows a block diagram of a more specific example of a preferredembodiment SMT processor 120 in more detail, supporting two threads in this example. TheICACHE 122 has asingle read port 124 to preferred fetchcontrol 126. The preferred fetchcontrol 126 selectively fetches instructions and forwards fetched instructions to front end pipeline stages 128. So, instructions exiting the front end pipe line stages 128 pass through multiplexor/demultiplexor (mux/demux) 132 and enter an Instruction BUFfer (IBUF) in one of two threads, 134-0, 134-1 of this example. Each thread passes through a number of buffer pipeline stages 136-0, 136-1, eventually emerging from an Instruction Register (IR) 138-0, 138-1. Amultiplexer 140, selects a mix of instructions from the contents of the instruction registers 138-0, 138-1 to back end processor logic (not shown), e.g., to a dispatch group for back end execution. An Instruction Fetch Address Register (IFAR) 142-0, 142-1 addresses each fetched instruction. - Thread monitor and control
flags control 126 forwards an instruction from theICACHE 122, that is identified by one of the instruction fetch address registers 142-0, 142-1. In this example, the thread monitor and control flags include stall event flags (e.g., branch mis-predicts, cache misses, etc.) 144, flow rate mismatch flags 146,utilization flags 148 and, optionally, thread priority flags 150. The utilization flags 148 may include individual instruction buffer high water mark controls 148-0, 148-1 that also operate to stall corresponding instruction buffers 134-0, 134-1, whenever a respective thread pipeline is full to its respective high water mark. Although the utilization flags 148-0 and 148-1 are indicated herein as two flags, each having to do with the instruction buffers 134-0, 134-1, this is for example only. Multiple utilization flags may be included as downstream utilization markers. For example, a high watermark may be provided for various other downstream queues, e.g., in the execution back-end of the machine, that may provide additional or alternate inputs to the preferred fetchcontrol 126. - However, for any particular cycle in the example of
FIG. 2 , when a fetch is enabled, the address in the instruction fetch address register, 142-0, 142-1 may simply be incremented from the previous cycle, e.g., by an incrementer 152-0, 152-1. Alternately, the address may be loaded from next fetch address logic 154-0, 154-1, e.g., in response to a branch. So, for example, the next address may depend upon an interrupt, a branch instruction or Branch History Table/Branch Target Buffer (BHT/BTB) contents. Further, the next fetch address logic 154-0, 154-1 logic may be implemented using any suitable such fetch address logic to generate the next cache address as may be appropriate for the particular application. - The preferred fetch
control 126 infers thread stall states, cycle-by-cycle, from the stall flags 144 indicating selected stall events, e.g., branch mis-prediction, cache miss, and dispatch stall. These stall event flags 144 are often routinely tracked on-chip in state of the art processors, e.g., using performance counters, or as part of other book-keeping and stall management. However, in accordance with a preferred embodiment of the present invention, the stall flags 144 are invoked as override conditions to prevent/enable fetch-gating for a stalled thread, or to redirect fetches for another thread. Also, when a branch mis-prediction occurs in a given thread, the thread contents are invalid. The preferred fetchcontrol 126 gives that thread priority and allows uninhibited fetches at full bandwidth to fill up pipeline slots in the thread that are vacated by flushed instructions. - Downstream utilization state flags 148 provide a set of high watermark indicators that the preferred fetch
control 126 monitors for developing path criticalities. Thus, eachhigh watermark flag 148, when asserted, indicates that a particular queue or buffer resource is almost full. Depending on whether a thread-specific resource or a shared resource is filling, a thread selection and prioritization policy may be defined in the preferred fetchcontrol 126 and dynamically adjusted to indicate when any particular resources are at or near capacity. Upon such an occurrence, the preferred fetchcontrol 126 may invoke fetch-gating based on the falloff of downstream demand to save energy whenever possible. -
FIGS. 3A-B show examples of inputs and output control to the preferred fetchcontrol 126 for determining on each cycle, whether a fetch from theICACHE 122 occurs based on the current state of thread monitor and controlflags control logic 126, is a simple finite state machine, that monitors a small subset of processor utilization indicators, e.g., stall state and last thread identifier. Thus, thread monitor and controlflags 160 may include, for example, a branch mis-prediction indicator, a cache miss indicator, an execution pipeline stall indicator, a dependence-related dispatch stall indicator, a resource-conflict stall indicator, and a pipeline flush-and-replay stall indicator. The fetchcontrol logic 126 may include a finite state controller with two outputs, afetch_gate 162 and anext_thread_id indicator 164. Thefetch_gate 162 is a Boolean flag that is asserted whenever gating the instruction fetch is deemed to be desirable. Thenext_thread_id indicator 164 points to the thread for fetching in the next cycle. A miss/stall latch 166 holds the last fetch identification and latches the current thread fetch identification for facilitating in determining in each fetch cycle, the next thread fetch identification. A fetch gate output enables gating the contents of the ICACHE (122 inFIG. 2 ) as selected by the corresponding fetch address register (142-0, 142-1). The inverse of the fetchgate 162, inverted byinverter 168 in this example, combines with adispatch stall signal 170 in an ANDgate 172 to provide a flow rate indicator as aflow mismatch flag 146 inFIG. 2 . -
FIGS. 4A-B show examples of a state diagrams for the preferred embodiment fetchcontrol 126 ofFIGS. 2 and 3A from thread monitor and controlflags 160. Instep 1460 ofFIG. 4A , theflags 160 are checked for an indication of a flow rate mismatch. If a flow rate mismatch is not indicated, then in 1462, theflags 160 are checked for an indication that a branch mis-prediction has occurred. If theflags 160 do not indicate a branch mis-prediction either, then in 1464 the next ICACHE fetch is for a thread that is different than the last. However, if it is determined in 1460 that a flow rate mismatch has occurred, then in 1466 theflags 160 are checked for a Data/Instruction (D/I) cache miss. If a D/I cache miss has not occurred, then in 1468, theflags 160 are checked for an indication that a branch mis-prediction has occurred. If theflags 160 indicate that a branch mis-prediction has occurred in either 1462 or 1468, then in 1470, a determination is made of which thread, e.g.,thread 0,thread 1, or both in this example. If in 1470 the mis-prediction indication is:thread 0, then in 1472, the next thread ID is set to indicatethread 0;thread 1, then in 1474, the next thread ID is set to indicatethread 1; otherwise, both threads are indicated and in 1476, and the next thread ID is set to indicate that it is undefined. Also, if branch mis-prediction is determined not to have occurred in 1468, then, the next thread ID is undefined in 1476. Since the next thread ID is undefined in 1476, the fetch gate should be enabled, and nothing should be fetched from either thread in the next cycle. If it is determined that a D/I cache miss has occurred in 1466, then in 1478, a determination is made of which thread, e.g.,thread 0,thread 1, or both in this example. A determination of eitherthread 0, orthread 1, results in an opposite indication ofdetermination 1470. - Similarly,
FIG. 4B , theflags 160 are checked for an indication of that the high water mark for one of the instruction buffers is above a selected threshold. So, for the example ofFIG. 2 , in 1480, the high water mark is checked forinstruction buffer 0. Depending on the results of that check, the high water mark is checked forinstruction buffer 1 in 1482 if the high water mark forinstruction buffer 0 is at or above that threshold, or in 1484 if the high water mark forinstruction buffer 0 is below the threshold. If in 1482, the high water mark forinstruction buffer 1 is below the threshold; then, in 1486 theflags 160 are checked for an indication that a branch mis-prediction has occurred. If a branch mis-prediction has not occurred, then in 1488 the next thread ID is set to indicate that it is undefined; and, simultaneously, the previous thread ID is held (e.g., in the miss/stall latch 162 ofFIG. 3A ) and the fetch gate is asserted. Similarly, in 1484 if the high water mark forinstruction buffer 1 is at or above the threshold; then, in 1490 theflags 160 are checked for an indication that a branch mis-prediction has occurred. If in either 1486 or 1490, a branch mis-prediction is found to have occurred; then in 1492, a determination is made of which branch, again,thread 0,thread 1, or both in this example. If in 1492 the mis-prediction indication is:thread 0, then in 1494, the next thread ID is set to indicatethread 0;thread 1, then in 1496, the next thread ID is set to indicatethread 1; otherwise, both threads are indicated and in 1498 and the next ICACHE fetch is for a thread that is different than the last. If in 1482, the high water mark forinstruction buffer 1 was found at or above the threshold, the next thread ID is set to indicatethread 1 in 1496. If in 1484, the high water mark forinstruction buffer 1 was found below the threshold, the next thread ID is set to indicatethread 0 in 1494. Finally, if a branch mis-prediction is found to have occurred in 1490; then, the next ICACHE fetch is for a thread that is different than the last in 1496. Thus, using fetch control according to the present invention provides simple, effective adaptive fetch-gating for front-end thread selection and priority logic for significant performance gain, and with simultaneous front-end power reduction. - Advantageously, the thread monitor and control
flags FIG. 2 provide a simple indication of a processor state that derive cache gating controls to prevent unnecessary or superfluous instruction cache fetches or accesses. Accordingly, the preferred embodiment adaptive fetch-gating infers gating control from a typical set of (normally found in state of the art processor architectures) queue markers and event flags, and/or flags that are added or supplemented with insignificant area and timing overhead. Further, the present invention has application to SMT processors, generally, where adaptive fetch gating may be combined naturally with an implicit set of power-aware thread prioritization heuristics. For single-threaded processing, application of the invention naturally reduces to simple, adaptive fetch gating. Additionally, the preferred fetch gating has application on a cycle-by-cycle basis to determining whether each fetch should proceed, and if so, from which of a number of available threads. In yet another advantage, application of the invention to a typical state of the art processors significantly improves processor throughput performance, while reducing the number of actual cache accesses and, therefore, dramatically reducing energy consumption. The energy consumption reduction from application of the present invention may far exceed the reduction in execution time, thereby providing an overall average power dissipation reduction as well. - While the invention has been described in terms of preferred embodiments, those skilled in the art will recognize that the invention can be practiced with modification within the spirit and scope of the appended claims. It is intended that all such variations and modifications fall within the scope of the appended claims. Examples and drawings are, accordingly, to be regarded as illustrative rather than restrictive.
Claims (26)
1. A multithreaded processor comprising:
an instruction cache with a plurality of cache locations;
a thread selection and priority circuit configured to monitor processor flags and selectively retrieve contents of each of said plurality of cache locations;
an instruction fetch unit pipeline configured to receive the selectively retrieved contents from said thread selection and priority circuit; and
a plurality of instruction buffer threads, wherein each of the selectively retrieved contents are passed to one of the instruction buffer threads through said instruction fetch unit pipeline, the thread selection and priority circuit being further configured to retrieve contents only for threads indicated by the processor flags as being capable of receiving the selectively retrieved contents.
2. A multithreaded processor as in claim 1 , wherein said processor flags include pipeline stall flags, flow mismatch flags, and utilization flags.
3. A multithreaded processor as in claim 2 , wherein said processor flags further include thread priority flags.
4. A multithreaded processor as in claim 1 , wherein said thread selection and priority circuit comprises an instruction cache fetch control circuit receiving said processor flags and determining whether contents are fetched from said instruction cache and further selecting instruction cache contents being fetched.
5. A multithreaded processor as in claim 4 , wherein said instruction cache fetch control circuit is a state machine.
6. A multithreaded processor as in claim 5 , wherein said state machine comprises:
means for determining flow rate mismatch in said pipeline from said flags;
means for determining Data/Instruction (D/I) cache misses, responsive to a flow rate mismatch;
means for determining a branch mis-prediction responsive to said flags;
means for determining a next thread responsive to said means for flow rate mismatch determination and said means for D/I cache miss determination; and
means for indicating a next thread.
7. A multithreaded processor as in claim 5 , wherein said state machine comprises:
means for determining whether each thread is at or above a high water mark;
means for determining a mis-prediction responsive to said flags;
means for determining a next thread responsive to said means for determining a mis-prediction; and
means for indicating a next thread.
8. A multithreaded processor as in claim 4 , wherein said instruction cache fetch control circuit provides a fetch gate signal and a thread identification to said cache responsive to said flags.
9. A multithreaded processor as in claim 8 , wherein said fetch gate signal is combined with a dispatch stall signal, a flow rate mismatch flag being provided from the combination.
10. A multithreaded processor as in claim 1 , wherein said multithreaded processor is a Simultaneous MultiThreaded (SMT) processor.
11. An instruction fetch controller connectable between an instruction cache and a plurality of instruction buffers, said instruction fetch controller comprising:
one or more inputs, each connected to an instruction cache port, each input receiving instructions from one or more threads stored in respective instruction cache banks;
one or more instruction outputs, each selectively providing one or more fetched instructions to a corresponding instruction buffer; and
one or more control inputs providing receiving event and use (EU) signals, said EU signals selecting in any clock cycle whether an instruction is fetched from the instruction cache.
12. An instruction fetch controller as in claim 11 , where the EU signals further controlling fetch selection priority and the instruction fetch controller selects which instructions are fetched in each clock cycle responsive to said EU signals whenever instructions are fetched.
13. An instruction fetch controller as in claim 11 , where the EU signals are selected from the group comprising: a signal indicating level of a queue, a stall event indicator, a thread priority indicator, an information flow rate indicator, a status flag, a pipeline stall condition indicator, a logical input indicator, a function input indicator, statistical indication signal, a historical state signal, and a state signal.
14. An instruction fetch controller as in claim 13 , wherein the stall event indicator is selected from the group comprising: a branch mis-prediction indicator, a cache miss indicator, an execution pipeline stall indicator, a dependence-related dispatch stall indicator, a resource-conflict stall indicator, and a pipeline flush-and-replay stall indicator.
15. An instruction fetch controller as in claim 13 , wherein the signal indicating level of the queue is a high watermark indicator for a buffer selected from the group comprising: an instruction fetch buffer, a load buffer, a store buffer, and an issue buffer.
16. An instruction fetch controller as in claim 11 , wherein one or more of the EU signals are dispatch stage thread priority signals.
17. An instruction fetch controller as in claim 16 , wherein the dispatch stage thread priority signals are asserted by software.
18. An instruction fetch controller as in claim 17 , wherein the dispatch stage thread priority signals are hardware generated signals.
19. An instruction fetch controller as in claim 16 , wherein the dispatch stage thread priority signals indicate an encoding order for considering threads in selecting cache contents and dispatching selected contents for execution in a given cycle.
20. A Simultaneous MultiThreaded (SMT) processor comprising:
an instruction cache with a plurality of cache locations;
an instruction fetch unit pipeline configured to receive selectively retrieved cache contents;
a plurality of instruction buffers, each selectively receiving data and instructions from said instruction fetch unit pipeline, wherein said instruction fetch unit pipeline is between said instruction cache and said plurality of instruction buffers, cache contents received by each of said plurality of instruction buffers passing through said instruction fetch unit pipeline to a respective instruction buffer;
an plurality of instruction buffer threads, each of said plurality of instruction buffer threads traversing one of said plurality of instruction buffers; and
a thread selection and priority circuit monitoring event and use (EU) flag signals for an indication of ones of said plurality of instruction buffers being capable of receiving retrieved cache contents, said thread selection and priority circuit selecting cache content locations being fetched and retrieving said cache contents from selected said cache content locations for a thread traversing an indicated one.
21. A SMT processor as in claim 20 , wherein said thread selection and priority circuit comprises an instruction cache fetch control circuit receiving said EU flags and determining whether contents are fetched from said instruction cache and further selecting instruction said cache content locations.
22. A SMT processor as in claim 21 , wherein said instruction cache fetch control circuit is a state machine comprising:
means for determining flow rate mismatch in said pipeline from said EU flags;
means for determining Data/Instruction (D/I) cache misses, responsive to a flow rate mismatch;
means for determining a branch mis-prediction responsive to said EU flags;
means for determining a next thread responsive to said means for flow rate mismatch determination and said means for D/I cache miss determination; and
means for indicating a next thread.
23. A SMT processor as in claim 21 , wherein said instruction cache fetch control circuit is a state machine comprising:
means for determining whether each thread is at or above a high water mark;
means for determining a mis-prediction responsive to said EU flags;
means for determining a next thread responsive to said means for determining a mis-prediction; and
means for indicating a next thread.
24. A SMT processor as in claim 21 , wherein said instruction cache fetch control circuit provides a fetch gate signal and a thread identification to said cache responsive to said EU flags.
25. A SMT processor as in claim 24 , wherein said fetch gate signal is combined with a dispatch stall signal, a flow rate mismatch flag being provided from the combination.
26. A SMT processor as in claim 21 , wherein said EU flags include pipeline stall flags, flow mismatch flags, utilization flags and thread priority flags.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US12/106,360 US20080229068A1 (en) | 2004-09-17 | 2008-04-21 | Adaptive fetch gating in multithreaded processors, fetch control and method of controlling fetches |
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US61099004P | 2004-09-17 | 2004-09-17 | |
US11/228,781 US7392366B2 (en) | 2004-09-17 | 2005-09-16 | Adaptive fetch gating in multithreaded processors, fetch control and method of controlling fetches |
US11/928,686 US20080133886A1 (en) | 2004-09-17 | 2007-10-30 | Adaptive fetch gating in multithreaded processors, fetch control and method of controlling fetches |
US12/106,360 US20080229068A1 (en) | 2004-09-17 | 2008-04-21 | Adaptive fetch gating in multithreaded processors, fetch control and method of controlling fetches |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US11/928,686 Division US20080133886A1 (en) | 2004-09-17 | 2007-10-30 | Adaptive fetch gating in multithreaded processors, fetch control and method of controlling fetches |
Publications (1)
Publication Number | Publication Date |
---|---|
US20080229068A1 true US20080229068A1 (en) | 2008-09-18 |
Family
ID=36317705
Family Applications (3)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US11/228,781 Active US7392366B2 (en) | 2004-09-17 | 2005-09-16 | Adaptive fetch gating in multithreaded processors, fetch control and method of controlling fetches |
US11/928,686 Abandoned US20080133886A1 (en) | 2004-09-17 | 2007-10-30 | Adaptive fetch gating in multithreaded processors, fetch control and method of controlling fetches |
US12/106,360 Abandoned US20080229068A1 (en) | 2004-09-17 | 2008-04-21 | Adaptive fetch gating in multithreaded processors, fetch control and method of controlling fetches |
Family Applications Before (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US11/228,781 Active US7392366B2 (en) | 2004-09-17 | 2005-09-16 | Adaptive fetch gating in multithreaded processors, fetch control and method of controlling fetches |
US11/928,686 Abandoned US20080133886A1 (en) | 2004-09-17 | 2007-10-30 | Adaptive fetch gating in multithreaded processors, fetch control and method of controlling fetches |
Country Status (1)
Country | Link |
---|---|
US (3) | US7392366B2 (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080040730A1 (en) * | 2006-08-14 | 2008-02-14 | Jack Kang | Event-based bandwidth allocation mode switching method and apparatus |
US20080069130A1 (en) * | 2006-09-16 | 2008-03-20 | Mips Technologies, Inc. | Transaction selector employing transaction queue group priorities in multi-port switch |
US20080069128A1 (en) * | 2006-09-16 | 2008-03-20 | Mips Technologies, Inc. | Transaction selector employing barrel-incrementer-based round-robin apparatus supporting dynamic priorities in multi-port switch |
US7773621B2 (en) | 2006-09-16 | 2010-08-10 | Mips Technologies, Inc. | Transaction selector employing round-robin apparatus supporting dynamic priorities in multi-port switch |
US20110131438A1 (en) * | 2009-12-02 | 2011-06-02 | International Business Machines Corporation | Saving Power by Powering Down an Instruction Fetch Array Based on Capacity History of Instruction Buffer |
US7961745B2 (en) * | 2006-09-16 | 2011-06-14 | Mips Technologies, Inc. | Bifurcated transaction selector supporting dynamic priorities in multi-port switch |
US8261049B1 (en) | 2007-04-10 | 2012-09-04 | Marvell International Ltd. | Determinative branch prediction indexing |
US9519944B2 (en) | 2014-09-02 | 2016-12-13 | Apple Inc. | Pipeline dependency resolution |
US20160371068A1 (en) * | 2015-06-16 | 2016-12-22 | Fujitsu Limited | Computer that performs compiling, compiler program, and link program |
US11182166B2 (en) | 2019-05-23 | 2021-11-23 | Samsung Electronics Co., Ltd. | Branch prediction throughput by skipping over cachelines without branches |
Families Citing this family (28)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7350060B2 (en) * | 2003-04-24 | 2008-03-25 | International Business Machines Corporation | Method and apparatus for sending thread-execution-state-sensitive supervisory commands to a simultaneous multi-threaded (SMT) processor |
US7937557B2 (en) | 2004-03-16 | 2011-05-03 | Vns Portfolio Llc | System and method for intercommunication between computers in an array |
US7392366B2 (en) * | 2004-09-17 | 2008-06-24 | International Business Machines Corp. | Adaptive fetch gating in multithreaded processors, fetch control and method of controlling fetches |
US7904695B2 (en) * | 2006-02-16 | 2011-03-08 | Vns Portfolio Llc | Asynchronous power saving computer |
US20070101102A1 (en) * | 2005-10-27 | 2007-05-03 | Dierks Herman D Jr | Selectively pausing a software thread |
US7617383B2 (en) * | 2006-02-16 | 2009-11-10 | Vns Portfolio Llc | Circular register arrays of a computer |
US7913069B2 (en) * | 2006-02-16 | 2011-03-22 | Vns Portfolio Llc | Processor and method for executing a program loop within an instruction word |
US7904615B2 (en) | 2006-02-16 | 2011-03-08 | Vns Portfolio Llc | Asynchronous computer communication |
US7966481B2 (en) | 2006-02-16 | 2011-06-21 | Vns Portfolio Llc | Computer system and method for executing port communications without interrupting the receiving computer |
US7454596B2 (en) * | 2006-06-29 | 2008-11-18 | Intel Corporation | Method and apparatus for partitioned pipelined fetching of multiple execution threads |
US20080172398A1 (en) * | 2007-01-12 | 2008-07-17 | Borkenhagen John M | Selection of Processors for Job Scheduling Using Measured Power Consumption Ratings |
US7555637B2 (en) * | 2007-04-27 | 2009-06-30 | Vns Portfolio Llc | Multi-port read/write operations based on register bits set for indicating select ports and transfer directions |
US8006073B1 (en) * | 2007-09-28 | 2011-08-23 | Oracle America, Inc. | Simultaneous speculative threading light mode |
US8006070B2 (en) * | 2007-12-05 | 2011-08-23 | International Business Machines Corporation | Method and apparatus for inhibiting fetch throttling when a processor encounters a low confidence branch instruction in an information handling system |
US8086825B2 (en) * | 2007-12-31 | 2011-12-27 | Advanced Micro Devices, Inc. | Processing pipeline having stage-specific thread selection and method thereof |
US7925853B2 (en) * | 2008-01-04 | 2011-04-12 | International Business Machines Corporation | Method and apparatus for controlling memory array gating when a processor executes a low confidence branch instruction in an information handling system |
US8255669B2 (en) * | 2008-01-30 | 2012-08-28 | International Business Machines Corporation | Method and apparatus for thread priority control in a multi-threaded processor based upon branch issue information including branch confidence information |
US20090193240A1 (en) * | 2008-01-30 | 2009-07-30 | Ibm Corporation | Method and apparatus for increasing thread priority in response to flush information in a multi-threaded processor of an information handling system |
GB2457265B (en) * | 2008-02-07 | 2010-06-09 | Imagination Tech Ltd | Prioritising of instruction fetching in microprocessor systems |
US20100023730A1 (en) * | 2008-07-24 | 2010-01-28 | Vns Portfolio Llc | Circular Register Arrays of a Computer |
US20110179254A1 (en) * | 2010-01-15 | 2011-07-21 | Sun Microsystems, Inc. | Limiting speculative instruction fetching in a processor |
US20110276784A1 (en) * | 2010-05-10 | 2011-11-10 | Telefonaktiebolaget L M Ericsson (Publ) | Hierarchical multithreaded processing |
US8527994B2 (en) * | 2011-02-10 | 2013-09-03 | International Business Machines Corporation | Guarded, multi-metric resource control for safe and efficient microprocessor management |
US9652243B2 (en) | 2011-06-29 | 2017-05-16 | International Business Machines Corporation | Predicting out-of-order instruction level parallelism of threads in a multi-threaded processor |
US9158698B2 (en) | 2013-03-15 | 2015-10-13 | International Business Machines Corporation | Dynamically removing entries from an executing queue |
US10572261B2 (en) * | 2016-01-06 | 2020-02-25 | Nxp Usa, Inc. | Providing task-triggered deterministic operational mode for simultaneous multi-threaded superscalar processor |
US10528352B2 (en) * | 2016-03-08 | 2020-01-07 | International Business Machines Corporation | Blocking instruction fetching in a computer processor |
GB2563587B (en) | 2017-06-16 | 2021-01-06 | Imagination Tech Ltd | Scheduling tasks |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6212544B1 (en) * | 1997-10-23 | 2001-04-03 | International Business Machines Corporation | Altering thread priorities in a multithreaded processor |
US6567839B1 (en) * | 1997-10-23 | 2003-05-20 | International Business Machines Corporation | Thread switch control in a multithreaded processor system |
US7392366B2 (en) * | 2004-09-17 | 2008-06-24 | International Business Machines Corp. | Adaptive fetch gating in multithreaded processors, fetch control and method of controlling fetches |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6267839B1 (en) * | 1999-01-12 | 2001-07-31 | Applied Materials, Inc. | Electrostatic chuck with improved RF power distribution |
US6587839B1 (en) * | 2000-02-04 | 2003-07-01 | Eastman Kodak Company | Method and system for notifying a consumer that the photofinishing order is ready and for controlling inventory of photofinishing orders in a business |
-
2005
- 2005-09-16 US US11/228,781 patent/US7392366B2/en active Active
-
2007
- 2007-10-30 US US11/928,686 patent/US20080133886A1/en not_active Abandoned
-
2008
- 2008-04-21 US US12/106,360 patent/US20080229068A1/en not_active Abandoned
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6212544B1 (en) * | 1997-10-23 | 2001-04-03 | International Business Machines Corporation | Altering thread priorities in a multithreaded processor |
US6567839B1 (en) * | 1997-10-23 | 2003-05-20 | International Business Machines Corporation | Thread switch control in a multithreaded processor system |
US7392366B2 (en) * | 2004-09-17 | 2008-06-24 | International Business Machines Corp. | Adaptive fetch gating in multithreaded processors, fetch control and method of controlling fetches |
Cited By (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080040730A1 (en) * | 2006-08-14 | 2008-02-14 | Jack Kang | Event-based bandwidth allocation mode switching method and apparatus |
US8799929B2 (en) | 2006-08-14 | 2014-08-05 | Marvell World Trade Ltd. | Method and apparatus for bandwidth allocation mode switching based on relative priorities of the bandwidth allocation modes |
US8424021B2 (en) | 2006-08-14 | 2013-04-16 | Marvell World Trade Ltd. | Event-based bandwidth allocation mode switching method and apparatus |
US8046775B2 (en) * | 2006-08-14 | 2011-10-25 | Marvell World Trade Ltd. | Event-based bandwidth allocation mode switching method and apparatus |
US7773621B2 (en) | 2006-09-16 | 2010-08-10 | Mips Technologies, Inc. | Transaction selector employing round-robin apparatus supporting dynamic priorities in multi-port switch |
US20080069128A1 (en) * | 2006-09-16 | 2008-03-20 | Mips Technologies, Inc. | Transaction selector employing barrel-incrementer-based round-robin apparatus supporting dynamic priorities in multi-port switch |
US7961745B2 (en) * | 2006-09-16 | 2011-06-14 | Mips Technologies, Inc. | Bifurcated transaction selector supporting dynamic priorities in multi-port switch |
US7990989B2 (en) | 2006-09-16 | 2011-08-02 | Mips Technologies, Inc. | Transaction selector employing transaction queue group priorities in multi-port switch |
US7760748B2 (en) | 2006-09-16 | 2010-07-20 | Mips Technologies, Inc. | Transaction selector employing barrel-incrementer-based round-robin apparatus supporting dynamic priorities in multi-port switch |
US20080069130A1 (en) * | 2006-09-16 | 2008-03-20 | Mips Technologies, Inc. | Transaction selector employing transaction queue group priorities in multi-port switch |
US8539212B1 (en) | 2007-04-10 | 2013-09-17 | Marvell International Ltd. | Determinative branch prediction indexing |
US8261049B1 (en) | 2007-04-10 | 2012-09-04 | Marvell International Ltd. | Determinative branch prediction indexing |
US8370671B2 (en) * | 2009-12-02 | 2013-02-05 | International Business Machines Corporation | Saving power by powering down an instruction fetch array based on capacity history of instruction buffer |
US20110131438A1 (en) * | 2009-12-02 | 2011-06-02 | International Business Machines Corporation | Saving Power by Powering Down an Instruction Fetch Array Based on Capacity History of Instruction Buffer |
US9519944B2 (en) | 2014-09-02 | 2016-12-13 | Apple Inc. | Pipeline dependency resolution |
US20160371068A1 (en) * | 2015-06-16 | 2016-12-22 | Fujitsu Limited | Computer that performs compiling, compiler program, and link program |
US10089088B2 (en) * | 2015-06-16 | 2018-10-02 | Fujitsu Limited | Computer that performs compiling, compiler program, and link program |
US11182166B2 (en) | 2019-05-23 | 2021-11-23 | Samsung Electronics Co., Ltd. | Branch prediction throughput by skipping over cachelines without branches |
Also Published As
Publication number | Publication date |
---|---|
US20060101238A1 (en) | 2006-05-11 |
US7392366B2 (en) | 2008-06-24 |
US20080133886A1 (en) | 2008-06-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7392366B2 (en) | Adaptive fetch gating in multithreaded processors, fetch control and method of controlling fetches | |
Kalla et al. | IBM Power5 chip: A dual-core multithreaded processor | |
EP1238341B1 (en) | Method, apparatus, medium and program for entering and exiting multiple threads within a multithreaded processor | |
US8321712B2 (en) | System and method for reducing power requirements of microprocessors through dynamic allocation of datapath resources | |
EP1236107B1 (en) | Method and apparatus for disabling a clock signal within a multithreaded processor | |
US7447923B2 (en) | Systems and methods for mutually exclusive activation of microprocessor resources to control maximum power | |
EP1856603B1 (en) | Bifurcated thread scheduler in a multithreading microprocessor | |
US6496925B1 (en) | Method and apparatus for processing an event occurrence within a multithreaded processor | |
US9389869B2 (en) | Multithreaded processor with plurality of scoreboards each issuing to plurality of pipelines | |
US7627770B2 (en) | Apparatus and method for automatic low power mode invocation in a multi-threaded processor | |
US7890738B2 (en) | Method and logical apparatus for managing processing system resource use for speculative execution | |
Canal et al. | Reducing the complexity of the issue logic | |
US5991884A (en) | Method for reducing peak power in dispatching instructions to multiple execution units | |
US20040215984A1 (en) | Method and circuitry for managing power in a simultaneous multithread processor | |
CN116420140A (en) | Dynamically configurable over-provisioned microprocessor | |
Sharkey et al. | Efficient instruction schedulers for SMT processors | |
Mehta et al. | Fetch halting on critical load misses | |
Sharkey et al. | Exploiting operand availability for efficient simultaneous multithreading | |
Sharkey et al. | Balancing ilp and tlp in smt architectures through out-of-order instruction dispatch | |
KR20060107508A (en) | Processor with demand-driven clock throttling for power reduction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:BOSE, PRADIP;BUYUKTOSUNOGLU, ALPER;EICKEMEYER, RICHARD J.;AND OTHERS;REEL/FRAME:020829/0873 Effective date: 20050915 |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |