US20130061236A1 - System and method for reducing power requirements of microprocessors through dynamic allocation of datapath resources - Google Patents
System and method for reducing power requirements of microprocessors through dynamic allocation of datapath resources Download PDFInfo
- Publication number
- US20130061236A1 US20130061236A1 US13/666,097 US201213666097A US2013061236A1 US 20130061236 A1 US20130061236 A1 US 20130061236A1 US 201213666097 A US201213666097 A US 201213666097A US 2013061236 A1 US2013061236 A1 US 2013061236A1
- Authority
- US
- United States
- Prior art keywords
- resource
- datapath
- dynamically
- resource allocation
- estimates
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 73
- 230000008569 process Effects 0.000 claims description 22
- 238000013468 resource allocation Methods 0.000 claims description 20
- 238000005259 measurement Methods 0.000 claims description 7
- 230000000737 periodic effect Effects 0.000 claims description 5
- 239000000872 buffer Substances 0.000 abstract description 11
- 238000005070 sampling Methods 0.000 description 16
- 238000013461 design Methods 0.000 description 8
- 230000000694 effects Effects 0.000 description 7
- 230000006870 function Effects 0.000 description 7
- 238000004377 microelectronic Methods 0.000 description 6
- 238000012360 testing method Methods 0.000 description 5
- 239000013598 vector Substances 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 238000012544 monitoring process Methods 0.000 description 4
- 230000009467 reduction Effects 0.000 description 3
- 238000013459 approach Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 238000007726 management method Methods 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000021715 photosynthesis, light harvesting Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 239000000758 substrate Substances 0.000 description 2
- 230000003213 activating effect Effects 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 238000005056 compaction Methods 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 230000003111 delayed effect Effects 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 230000008685 targeting Effects 0.000 description 1
- 230000001960 triggered effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F1/00—Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
- G06F1/26—Power supply means, e.g. regulation thereof
- G06F1/32—Means for saving power
- G06F1/3203—Power management, i.e. event-based initiation of a power-saving mode
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F1/00—Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
- G06F1/26—Power supply means, e.g. regulation thereof
- G06F1/32—Means for saving power
- G06F1/3203—Power management, i.e. event-based initiation of a power-saving mode
- G06F1/3234—Power saving characterised by the action undertaken
- G06F1/3287—Power saving characterised by the action undertaken by switching off individual functional units in the computer system
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3802—Instruction prefetching
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3802—Instruction prefetching
- G06F9/3814—Implementation provisions of instruction buffers, e.g. prefetch buffer; banks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3836—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3836—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
- G06F9/3838—Dependency mechanisms, e.g. register scoreboarding
- G06F9/384—Register renaming
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3867—Concurrent instruction execution, e.g. pipeline or look ahead using instruction pipelines
- G06F9/3869—Implementation aspects, e.g. pipeline latches; pipeline synchronisation and clocking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5094—Allocation of resources, e.g. of the central processing unit [CPU] where the allocation takes into account power or heat criteria
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Definitions
- the present disclosure relates to reducing power requirements of microelectronic devices and, more particularly, to an apparatus and method for reducing power dissipation and energy requirements in high-performance microprocessors.
- Modern, high-performance microprocessors use sophisticated instruction scheduling mechanisms and pipelines designed to reorder the startup and completion of instructions in a sequential instruction stream so as to achieve a high-level of processor performance.
- One common form of such mechanisms is a superscalar microprocessor that is capable of fetching, decoding, issuing, executing, completing and retiring more than one instruction within a single cycle of the clock signal used to synchronize activities at the lowest level in the microprocessor.
- instruction refers to the smallest unit of work that is scheduled independently within a microprocessor.
- instructions are fetched from an instruction cache (I-cache) in program order along the predicted path of execution.
- the instructions are then decoded to resolve inter-instruction dependencies and are then dispatched into a buffer commonly known as the issue queue (IQ).
- IQ issue queue
- execution units also called function units or FUs
- FUs function units
- Instructions that are ready for execution, issued from the IQ to the chosen FU, may therefore start as well as finish execution out of program order.
- the processor state as defined by the contents of the committed or architectural registers, as well as the state of the memory, must be updated in program order despite the fact that instructions can complete out of program order. This requirement is met by collecting the results produced out-of-program-order into another buffer called the reorder buffer (ROB). Information stored in the ROB is used to update the processor and memory state into the original program order.
- ROB reorder buffer
- an entry is simultaneously established in program order in the ROB, which behaves as a first-in, first-out (FIFO) queue.
- FIFO first-in, first-out
- the entry for the dispatched instruction is made at the tail of the ROB.
- the ROB entry for an instruction can itself serve as the repository of the instruction's results or it may point to the repository of the results within a separate physical register file.
- the process of retiring or committing an instruction involves updating the processor's state and/or the memory state in program order, typically using the information stored in the ROB. Instructions are retired from the head of the ROB. If the ROB entry at the head of the ROB is awaiting the completion of the corresponding instruction, instruction retirement is blocked (i.e., halted) momentarily until the results are correctly produced.
- LSQ load-store queue
- Canal and Gonzalez (“A low-complexity issue logic”, Proc. ACM Int'l. Conference on Supercomputing (ICS), pp. 327-335, Santa Fe, N. Mex., June, 2000) describe a scheme to reduce the complexity of the issue queue in a microprocessor. Their technique relies on the use of an additional queue, called the “ready queue” to hold instructions whose operands are determined to be available at the time of instruction dispatch. Instructions can be issued from this ready queue without the need to have energy-dissipating logic to check for the availability of operands.
- the present invention does not use any auxiliary structures to hold instructions and relies on the reduction of power in the issue queue by controlling the amount of resource units that are allocated for the issue queue.
- This scheme also makes use of an additional structure called the “first use” table to hold instructions that cannot be put into the “ready queue” at the time of instruction dispatching. With the use of this table and the associated logic, it is not clear that this scheme results in an overall power savings. Unlike the Canal et al. scheme, the present invention also reduces power dissipation within other datapath structures such as the reorder buffer, the load-store queue and the register file.
- Folegnani and Gonzalez (“Energy-Effective Issue Logic”, in Proceedings of the Int'l Symposium on Computer Architecture, June 2001, pp. 230-239) describe a FIFO issue queue that permitted out-of-order issue but avoided the compaction of vacated entries within the valid region of the queue to save power.
- the queue was divided into regions. The number of instructions committed from the most-recently allocated issue queue region in FIFO order (called the “youngest region”) was used to determine the number of regions within the circular buffer that was allocated for the actual extent of the issue queue.
- the number of regions allocated was incremented by one periodically; in-between, also at periodic intervals, a region was deactivated to save energy/power if the number of commits from the current youngest region was below a threshold.
- the energy overhead of the control logic for doing this resizing was not made clear. Additional energy savings were documented by not activating forwarding comparators within entries that are ready for issue or entries that are unallocated.
- the scheme of Folegnani et al. is thus limited to a FIFO style issue queue design and does nothing to reduce power dissipation in other datapath structures such as the reorder buffer, the load-store queue and the register file.
- the present invention is applicable to more general styles of issue queue design, including FIFO issue queues.
- This invention also reduces power dissipations in the reorder buffer, the load-store queue and the register file. Furthermore, unlike the method of the present invention, the scheme of Folegnani et al. relies on continuous measurements of issue queue activity rather than sampled measurements.
- Bahar and Manne (“Power and Energy Reduction Via Pipeline Balancing”, Proceedings of the Int'l Symposium on Computer Architecture, June 2001, pp. 218-229) describe a scheme for shutting off clusters of execution units and some associated register files in the Compaq 21264 microprocessor based on continuous monitoring of the IPC.
- the dispatch rate was varied between 4, 6 and 8 to allow an unused cluster of function units to be shut off completely.
- the dispatch rate changes were triggered by the crossing of thresholds associated with the floating point and overall IPC (average number of instructions processed in a clock cycle), requiring dispatch monitoring on a cycle-by-cycle basis. Fixed thresholds were chosen from the empirical data that was generated experimentally. Significant power savings within the dynamic scheduling components were achieved with a minimum reduction of the IPC.
- the scheme of Bahar et al. is limited to a clustered style microprocessor datapath and relies on continuous monitoring of performance.
- the present invention saves power by controlling power dissipations within components smaller than clusters and also includes the reorder buffer, avoiding continuous monitoring of performance.
- Buyuktosunoglu Alper et al. (U.S. Patent Application No. 2002/0053038, May 2, 2002) describe a method and structure for reducing the power dissipation in a microprocessor that relies on dynamic resizing of at least one storage structure in a microprocessor. Unlike the method of Buyuktosunoglu et al., the present invention directly uses the lack of resource units (which indirectly affects performance) to allocate additional resource units to counter any performance drop arising from the lack of resources. The method of Buyuktosunoglu et al.
- the present invention controls resource unit allocations for a variety of datapath artifacts such as the issue queue, the reorder buffer, the load-store queue and the register file simultaneously and independently to conserve power with minimal impact on performance. Buyuktosunoglu et al. focus on techniques that are driven solely by the activity of the issue queue.
- a further distinction of the present invention from Buyuktosunoglu et al. uses sampled, non-continuous measurements of usage of each resource that is controlled. Buyuktosunoglu et al. rely on continuous measurements of activity and performance, such as IPC.
- datapath resources or simply resources refers to the IQ, ROB, LSQ and register files, etc., but excludes the architectural register file (ARF).
- resource unit hereinafter refers to the basic unit of incremental resource which may be dynamically allocated or deallocated as required for execution of a particular instruction.
- interval and period are used interchangeably herein.
- occupancy Resource usage as used herein is defined by the actual number of valid entries, hereafter referred to as “occupancy”.
- the present invention is primarily intended for reducing dynamic power dissipation arising from switching activity in the microprocessor and similar devices. Power and energy dissipation arising from leakage in the resource units that are deallocated can also be reduced or avoided by using a variety of techniques known to those of skill in the circuit design arts, including, but not limited to, the use of sleep transistors, circuits using dual-threshold devices and substrate biasing.
- an apparatus and method of dynamically estimating the instantaneous resource needs of a program running on a microprocessor are used to allocate the minimum number of units of these resources to meet the instantaneous performance needs of that particular program.
- This approach requires that all allocatable resources be partitionable into independent allocation units that can be incrementally allocated or deallocated. For each of the datapath resources, unused resource units are shut off and isolated from the active, allocated units so as to reduce power dissipations resulting from leakage as well as from switching activities.
- resource units may be independently allocated to each resource. Unused resource units may be reclaimed if the running program is not utilizing them. The reclaimed or deallocated resource units are powered down and isolated from the allocated units to maintain the instantaneous allocation levels at about the right level needed to meet the program's performance needs.
- the present invention comprises six key features:
- Estimates for the instantaneous need of a program for a specific resource type are obtained through multiple, periodic sampling within the update periods instead of continuous measurements on a cycle-by-cycle basis.
- the sampling period is predetermined or can be dynamically adjusted.
- the sampling frequency is typically a multiple of the update frequency.
- unused resource units may be deallocated.
- the deallocation may be gradual, with only one resource unit deallocated at the end of a sampling period, or the deallocation can be more aggressive, with multiple unused resource units being deallocated at the end of the sampling interval. Deallocations typically coincide with the end of an update interval.
- resource units are allocated as soon as the true instantaneous demands for the resource exceed the currently allocated units for a predetermined number of times within a sampling interval. When this happens, one or more resource units may be immediately allocated, as availability permits, and a new update period may then be started. Resource allocations thus do not necessarily coincide with the end of the periodic update interval.
- units of certain resources that are organized as FIFO queues may have slightly different allocation and deallocation methods than other types of resources.
- FIG. 1 is a schematic block diagram of an architecture of the datapath portion of a first superscalar processor wherein certain datapath resources may be dynamically resized in accordance with the invention
- FIG. 2 is a schematic block diagram of an architecture of a datapath portion of a second superscalar processor wherein certain datapath resources may be dynamically resized;
- FIG. 3 is a schematic block diagram of an architecture of a datapath portion of a third superscalar processor wherein certain datapath resources may be dynamically resized in accordance with the invention
- FIG. 4 is a flow chart of a method of allocating non-queued resources in accordance with the present invention.
- FIG. 5 is a flow chart of a method of deallocating non-queued resources
- FIG. 6 is a flow chart of at set of steps associated with allocating resources used like a FIFO queue.
- FIG. 7 is a flow chart of a set of steps associated with deallocating resources used like a FIFO queue.
- the present invention reduces dynamic power dissipation arising from switching activity in microprocessors and similar devices. Power and energy dissipation arising from leakage in the resource units that are deallocated can also be reduced or avoided by using a variety of techniques known to those of skill in the circuit design arts, including, but not limited to, the use of sleep transistors, circuits using dual-threshold devices and substrate biasing.
- a reorder buffer (ROB) 102 contains the ROB entry set up for an instruction at the time of dispatch.
- the ROB 102 entry also includes a field, not shown, to hold the result produced by the instruction.
- ROB 102 operates analogously to a physical register. If an operand value, not shown, has been committed, a dispatched instruction attempts to read operand values from the architectural register file (ARF) 104 directly. If, however, the operand value was generated but has not been committed, a dispatched instruction attempts to read the required operand associatively from the ROB 102 from the most recently established entry for an architectural register.
- ALF architectural register file
- Source registers that contain valid data are read into the IQ 106 for the associated instruction. If a source operand is not available at the time of dispatch in the ARF 104 or the ROB 102 , the address of the physical register (i.e., the ROB slot) is saved in the tag field associated with the source register in the IQ 106 for the instruction.
- a function unit 108 When a function unit 108 completes processing an instruction, it outputs the result produced along with the address of the destination register for this result. This result is placed on a forwarding bus 112 which runs across the length of the IQ 106 and the LSQ 110 . An associative tag matching process is then used to steer the result to matching entries within the IQ 106 . Since multiple function units 108 may complete processing their respective instructions within a particular cycle, multiple forwarding buses 112 are used. Each input operand field within an IQ 106 thus uses a comparator, not shown, for each forwarding bus 112 . Alternative designs use a scoreboarding logic to identify the destinations of a forwarded result instead of using tag-based result forwarding.
- an entry is also reserved in the LSQ 110 at the time the instruction is dispatched. Because the address used by a load or a store instruction must be calculated, this instruction is removed from the IQ 106 , even if the value to be stored (for store instructions) has not yet been computed. In this case, this value is forwarded to the appropriate LSQ 110 entry as it is generated by one of the function units 108 .
- the resources that are allocated and deallocated dynamically following the method of the present invention are: the IQ 106 , the LSQ 110 , and the ROB 102 .
- the superscalar processor datapath 200 shown in FIG. 2 is similar to processor 100 ( FIG. 1 ), with the difference that the destination registers are allocated within a separate physical register file (PRF) 214 .
- PRF physical register file
- a physical register in PRF 214 is allocated for the instruction if its result is destined to a register.
- an entry is simultaneously established in ROB 202 for the instruction.
- the ROB 202 entry for the instruction holds a pointer, not shown, to the destination register of the instruction within the PRF 214 .
- the PRF 214 may be managed exactly like a FIFO queue, similar to the ROB 202 .
- the resources that may be allocated and deallocated dynamically following the method of the invention are: the IQ 206 , the LSQ 210 , the PRF 214 , and the ROB 202 .
- FIG. 3 depicts the architecture of the datapath portion 300 of yet another superscalar microprocessor that can benefit from the method of present invention.
- registers allocated to hold both committed register values and the results of instructions targeting a register are held in a common register file, RF 316 .
- a register alias table may be used to point to committed register values within the RF 316 .
- a similar table may be used to point to most recent instances of an architectural register within the RF 316 when register renaming is used to handle data dependencies.
- the resources that may be allocated and deallocated dynamically following the method described in this invention are: the IQ 306 , the LSQ 310 , the RF 316 , and the ROB 302 .
- the method of the present invention may also be used in variations of these three architectures where all source register operands are read out at the time of issuing instructions to the function units.
- energy and power requirements are reduced using the inventive method by incrementally allocating and deallocating the resources as has been described.
- the method of the present invention may also be applied to datapath architectures that are clustered or to architectures that use a distributed form of the IQ 106 , 206 , 306 , called reservation stations.
- a predetermined number of units of each type of resource to be dynamically allocated and deallocated is initially allocated.
- a preset counter not shown, or other suitable device is used to generate signals indicating the end of an update period. The same counter or a different counter may be used to generate signals that determine when resource usage is sampled. When resource units are added, this preset counter may be reset to begin a new update period.
- all resources have a common predetermined update period and a common predetermined sampling period.
- both the update period and the sampling period are chosen to be powers of 2 in the number of clock cycles. It will be recognized that in alternate embodiments of the invention, variations using update and/or sampling periods specific to a resource type may be implemented. These alternate embodiments use sets of counters for generating signals to mark the end of such periods, typically one counter per resource.
- each resource type into uniform-sized 15, resource units.
- the size of each such resource unit is predetermined and is specific to each type of resource.
- a number of well-known circuit design techniques such as multiple banking, bitline segmentation or partitioning with shared components can be used to implement: (i) the resource units themselves; (ii) facilities to add further units to an allocated suite of resource units, and (iii) facilities to deallocate certain already-allocated units.
- resource units are added (i.e., allocated) as the program requires a higher resource allocation to maintain its performance. If allocated resource units are determined to be unused at the end of an update period, they may be deallocated. The exact nature of the allocation and deallocation steps is described below.
- FIG. 4 there is shown a flow chart of one possible set of steps for allocating resources that do not behave like FIFO queues.
- examples of such resources include, but are not limited to the register file (e.g., RF 316 of the datapath of FIG. 3 ), and non-collapsing issue queues, where IQ entries can be allocated or freed up at any position within the queue.
- the process of allocating non-queued resources depicted in FIG. 4 begins with the commencement of an update period by initializing an overflow counter to zero, step 400 .
- the overflow counter counts the number of times, since the update period started, that resources exceeding current allocations were required.
- IQ e.g., IQ 306
- instruction dispatch is blocked and performance suffers.
- step 405 one clock cycle is allowed to elapse, step 405 , and then a check is performed, step 410 , to determine if additional resources (beyond the current allocations) were required in the clock cycle that just elapsed. If additional resources were required, step 410 , the value of the overflow counter is then incremented, step 415 , and the process continues at step 420 where the overflow counter is checked to determine if its count has exceeded a predetermined threshold value, variable OTH.
- step 420 If this comparison, step 420 , indicates that the overflow counter has exceeded OTH, it is then necessary to check whether an additional free resource unit is available, step 425 . If no additional free resource units are available, control is transferred to step 440 . Otherwise a resource unit is allocated to increase the current resource allocation, step 430 . After housekeeping tasks are performed, such as clearing variables and counters for keeping various statistics within an update period, and resetting the update period counter to begin a new update period, step 435 , the process shown in FIG. 4 terminates.
- step 410 program control is passed to step 440 .
- step 420 program control is returned to step 440 .
- step 440 a check is performed to determine if the current update period has finished. If so, the process of FIG. 4 is terminated. If, however, the current update period has not yet completed, control is returned to step 405 .
- variable OTH may be specific to the type of resource. It is also possible to vary the value of variable OTH for a single resource over time. Although this does not occur in the embodiment chosen for purposes of disclosure, the present invention encompasses such an additional embodiment.
- step 420 the process of FIG. 4 may be modified to allocate more than a single free allocation unit of a particular resource when the overflow counter exceeds OTH, step 420 , early in the update cycle.
- Such a condition indicates a rapidly increasing demand for additional resources which, if not satisfied aggressively, may hurt overall performance.
- the present invention encompasses all such variations of additional free resource unit allocation.
- FIG. 5 there is shown a flow chart of one possible set of steps required to deallocate a resource of the type allocated according to the process of FIG. 4 .
- This deallocation process commences when a new update period starts.
- a variable S maintains a running sum of samples usage estimates of the currently allocated resources and is initialized, step 500 .
- a sampling period is allowed to elapse, step 505 .
- the number of occupied entries within the allocated resource units is placed into a variable, N, step 510 .
- the number of occupied entries within the allocated resource is also added to S, step 515 .
- the term occupied entries refers to the number of allocated entries within the currently-allocated resource units.
- bit vectors indicating the occupancy status of the entries within each allocated unit may be created.
- a bit vector contains a bit for every entry within a resource unit, with a 1 indicating an occupied entry and a 0 indicating a free entry.
- the sum of the number of 1s in each of these bit vectors may be estimated using known techniques to derive the total number of occupied entries within each allocated resource unit.
- the total number of occupied entries, N may then be determined by adding up the already computed sums of the is in the bit vectors for the currently allocated resource units. For example, one possible way to perform such an estimate is to use replicated, parallel logic structures to estimate the sum of is in the aforesaid bit vectors and add them up using a fast tree adder to determine N.
- control is transferred to step 520 .
- step 520 control is returned to step 505 . If, however, the update period is over, step 520 , the average sampled occupancy, A, of the allocated resource units over the update period is estimated, step 525 . If the update period and sampling period are both powers of 2 (as used in the embodiment chosen for purposes of this disclosure), determining this average occupancy, A, does not require any division; the division process is reduced to a simple operation that ignores some lower order bits in S.
- the number of resource units, K, required to accommodate the averaged number of occupied entries, A is determined by dividing A by number of entries Q within each resource unit, and rounding the result up to the nearest higher integer, step 530 . Again, a division step may be avoided by choosing Q to be a power of 2 . It will be recognized that the value Q may be specific and different for each resource type.
- the dynamically allocated datapath resources that are used as a queue require special considerations for allocations and deallocations because of the circular nature of the FIFO queues. It may be assumed that such queue resources use two pointers, typically head and tail pointers to identify the two extremes of the circular queue. It may also be assumed that both these pointers are first initialized to zero, and then incremented, typically in a circular fashion, to permit wraparound, as the queue grows or shrinks
- FIFO resources such as the ROB, the LSQ and collapsing variations of IQs ( FIGS. 1 , 2 , and/or 3 .
- the resource units allocated to implement FIFO queues are physically adjacent; the queue structure is confined entirely within the allocated resource units. If a resource unit must be deallocated, the unit that is deallocated is the one that has entries with the highest index values. Likewise, when a new resource unit is added, the free unit added is the one adjacent to the currently allocated partition that has entries with the highest index value.
- the circular nature of these queues which allow the queue to wrap around within the allocated resource units, adds some complication to the process of allocating and deallocating resource units.
- step 430 is replaced with the multiple exemplary steps shown in the flow chart of FIG. 6 .
- the tail end of the FIFO queue should be able to extend into the newly allocated unit (i.e., wrap around).
- the allocation process begins with determining if the value of the head pointer is less than or equal to the value of the current tail pointer, step 600 . This is a normal comparison that ignores the consequences of circular increments to these pointers. This assumption is extended to all pointer comparisons discussed hereafter. If the value of the head pointer is less than or equal to the value of the current tail pointer, step 600 , a free resource unit is added adjacent to the currently allocated resource unit having entries with the highest index values, step 605 . If, however, the value of the head pointer is greater than the value of the current tail pointer, step 600 , a cycle is allowed to elapse and the head and tail pointers are updated to reflect events therewithin, step 610 . Control is then returned step 600 .
- step 540 must be replaced with the multiple exemplary steps shown in the flow chart of FIG. 7 .
- deallocation cannot be considered until all entries currently within the unit marked for deallocation are consumed;
- the deallocation should be performed in a manner that allows the queue to wrap around, properly following the deallocation.
- the deallocation process of FIG. 7 begins by setting the variable Limit to the index of the highest numbered slot, step 700 . This includes both allocated and unallocated entries within the resource units that are to remain allocated, but excludes resource units marked for deallocation.
- a test is performed to determine if the value of the head pointer is less than or equal to the value of the tail pointer, step 705 .
- step 705 a test is performed to determine if the value of the tail pointer is less than or equal to that of the variable Limit, step 710 . If this is true, the block marked for deallocation is actually deallocated, step 725 , and the process of FIG. 7 terminates. If the test, step 710 , is false, one clock cycle is allowed to elapse, step 715 . If, however, the test, step 710 , is false, then one clock pulse is allowed to elapse, step 715 , and the head and tail pointers are updated as needed, step 720 . Control is then transferred to step 705 .
- step 705 If, however, the value of the head pointer is greater than the value of the tail pointer, step 705 , a single clock cycle is allowed to elapse, step 730 . A test is then performed, step 735 , to determine if any events in the upcoming clock cycle might cause the tail pointer to extend into the unit marked for deallocation. If any such event exists, it or they are momentarily blocked, step 740 , and control is transferred to step 720 . If no events in the upcoming clock cycle might cause the tail pointer to extend into the unit marked for deallocation, step 735 , control is transferred directly to step 720 .
- the size estimates developed using these disclose inventive methods may also be used to selectively control clock rates to at least one component of a datapath resource.
- Such components include an instruction cache, an execution unit, clusters of registers, and function units. It will be recognized that may other microprocessor components may well benefit from such selective clock rate control and the invention is not considered limited to these specifically disclosed components.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Hardware Design (AREA)
- Computing Systems (AREA)
- Advance Control (AREA)
Abstract
There is provided a system and methods for segmenting datapath resources such as reorder buffers, physical registers, instruction queues and load-store queues, etc. in a microprocessor so that their size may be dynamically expanded and contracted. This is accomplished by allocating and deallocating individual resource units to each resource based on sampled estimates of the instantaneous resource needs of the program running on the microprocessor. By keeping unused datapath resources to a minimum, power and energy savings are achieved by shutting off resource units that are not needed for sustaining the performance requirements of the running program. Leakage energy and switching energy and power are reduced using the described methods.
Description
- This application is a Continuation of U.S. application Ser. No. 12/502,930, filed Jul. 14, 2009, which is a Continuation of U.S. application Ser. No. 11/748,411, filed May 14, 2007 (now U.S. Pat. No. 7,562,243), which is Continuation of U.S. application Ser. No. 10/727,105, filed Dec. 3, 2003 (now U.S. Pat. No. 7,219,249, which claims priority from Provisional Application 60/431425, filed Dec. 3, 2002, all of which are incorporated herein by reference in their entirety.
- The present disclosure relates to reducing power requirements of microelectronic devices and, more particularly, to an apparatus and method for reducing power dissipation and energy requirements in high-performance microprocessors.
- Modern, high-performance microprocessors use sophisticated instruction scheduling mechanisms and pipelines designed to reorder the startup and completion of instructions in a sequential instruction stream so as to achieve a high-level of processor performance. One common form of such mechanisms is a superscalar microprocessor that is capable of fetching, decoding, issuing, executing, completing and retiring more than one instruction within a single cycle of the clock signal used to synchronize activities at the lowest level in the microprocessor. As used hereinbelow, the term instruction refers to the smallest unit of work that is scheduled independently within a microprocessor.
- In a typical superscalar microprocessor, instructions are fetched from an instruction cache (I-cache) in program order along the predicted path of execution. The instructions are then decoded to resolve inter-instruction dependencies and are then dispatched into a buffer commonly known as the issue queue (IQ). Then, subject to both the availability of execution units (also called function units or FUs), and the input operands of the instruction, each instruction is eventually executed.
- Instructions that are ready for execution, issued from the IQ to the chosen FU, may therefore start as well as finish execution out of program order. To comply with the sequential semantics of the executing program, the processor state as defined by the contents of the committed or architectural registers, as well as the state of the memory, must be updated in program order despite the fact that instructions can complete out of program order. This requirement is met by collecting the results produced out-of-program-order into another buffer called the reorder buffer (ROB). Information stored in the ROB is used to update the processor and memory state into the original program order.
- As instructions are decoded in program order, an entry is simultaneously established in program order in the ROB, which behaves as a first-in, first-out (FIFO) queue. At the time of dispatch, the entry for the dispatched instruction is made at the tail of the ROB. The ROB entry for an instruction can itself serve as the repository of the instruction's results or it may point to the repository of the results within a separate physical register file.
- The process of retiring or committing an instruction involves updating the processor's state and/or the memory state in program order, typically using the information stored in the ROB. Instructions are retired from the head of the ROB. If the ROB entry at the head of the ROB is awaiting the completion of the corresponding instruction, instruction retirement is blocked (i.e., halted) momentarily until the results are correctly produced.
- To process load and store instructions that move data between memory locations and registers, many modern microprocessors also employ a load-store queue (LSQ), which also behaves as a FIFO queue. Entries are established for load and store instructions in program order as the instructions are dispatched, at the tail of the LSQ. Memory operations are started from the LSQ to conform to program ordering.
- In modern microprocessor systems, the overall design strategy has heretofore been a “one-size-fits-all” approach, where the datapath resources like the IQ, ROB, registers and LSQ are set at predetermined, fixed sizes irrespective of the changes in the instantaneous needs of am executing program for these resources. As a result, these resources frequently remain under-utilized. Unused portions of the resources remain powered up, wasting energy and power.
- Canal and Gonzalez (“A low-complexity issue logic”, Proc. ACM Int'l. Conference on Supercomputing (ICS), pp. 327-335, Santa Fe, N. Mex., June, 2000) describe a scheme to reduce the complexity of the issue queue in a microprocessor. Their technique relies on the use of an additional queue, called the “ready queue” to hold instructions whose operands are determined to be available at the time of instruction dispatch. Instructions can be issued from this ready queue without the need to have energy-dissipating logic to check for the availability of operands. The present invention does not use any auxiliary structures to hold instructions and relies on the reduction of power in the issue queue by controlling the amount of resource units that are allocated for the issue queue. This scheme also makes use of an additional structure called the “first use” table to hold instructions that cannot be put into the “ready queue” at the time of instruction dispatching. With the use of this table and the associated logic, it is not clear that this scheme results in an overall power savings. Unlike the Canal et al. scheme, the present invention also reduces power dissipation within other datapath structures such as the reorder buffer, the load-store queue and the register file.
- Folegnani and Gonzalez (“Energy-Effective Issue Logic”, in Proceedings of the Int'l Symposium on Computer Architecture, June 2001, pp. 230-239) describe a FIFO issue queue that permitted out-of-order issue but avoided the compaction of vacated entries within the valid region of the queue to save power. The queue was divided into regions. The number of instructions committed from the most-recently allocated issue queue region in FIFO order (called the “youngest region”) was used to determine the number of regions within the circular buffer that was allocated for the actual extent of the issue queue. To avoid a performance hit, the number of regions allocated was incremented by one periodically; in-between, also at periodic intervals, a region was deactivated to save energy/power if the number of commits from the current youngest region was below a threshold. The energy overhead of the control logic for doing this resizing was not made clear. Additional energy savings were documented by not activating forwarding comparators within entries that are ready for issue or entries that are unallocated. The scheme of Folegnani et al. is thus limited to a FIFO style issue queue design and does nothing to reduce power dissipation in other datapath structures such as the reorder buffer, the load-store queue and the register file. The present invention is applicable to more general styles of issue queue design, including FIFO issue queues. This invention also reduces power dissipations in the reorder buffer, the load-store queue and the register file. Furthermore, unlike the method of the present invention, the scheme of Folegnani et al. relies on continuous measurements of issue queue activity rather than sampled measurements.
- Bahar and Manne (“Power and Energy Reduction Via Pipeline Balancing”, Proceedings of the Int'l Symposium on Computer Architecture, June 2001, pp. 218-229) describe a scheme for shutting off clusters of execution units and some associated register files in the Compaq 21264 microprocessor based on continuous monitoring of the IPC. The dispatch rate was varied between 4, 6 and 8 to allow an unused cluster of function units to be shut off completely. The dispatch rate changes were triggered by the crossing of thresholds associated with the floating point and overall IPC (average number of instructions processed in a clock cycle), requiring dispatch monitoring on a cycle-by-cycle basis. Fixed thresholds were chosen from the empirical data that was generated experimentally. Significant power savings within the dynamic scheduling components were achieved with a minimum reduction of the IPC. The dynamic allocation of the reorder buffer—a major power sink—was left completely unexplored in this study. The scheme of Bahar et al. is limited to a clustered style microprocessor datapath and relies on continuous monitoring of performance. The present invention, on the other hand, saves power by controlling power dissipations within components smaller than clusters and also includes the reorder buffer, avoiding continuous monitoring of performance.
- A portion of the dynamic resource management described in this invention was first described in the publication of Ponomarev, Kucuk and Ghose (“Reducing Power Requirements of Instruction Scheduling Through Dynamic Allocation of Multiple Datapath Resources”, in Proceedings of the 34th International Symposium on Microarchitecture, December 2001, pp. 90-101). Since then, the scheme was extended by S. Dropsho, A. Buyuktosunoglu, R. Balasubramonian, et al., (“Integrating Adaptive On-chip Structures for Reduced Dynamic Power”, in Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), September 2002), where limited histogramming was used to control resource allocations instead of average queue sizes. Based on the presented results, it is difficult to see any obvious gains in terms of power/performance trade-offs between the method of the present invention and the scheme of Dropsho et al. It is certain, however, that the use of limited histogramming considerably complicates the control logic.
- Buyuktosunoglu, Alper et al. (U.S. Patent Application No. 2002/0053038, May 2, 2002) describe a method and structure for reducing the power dissipation in a microprocessor that relies on dynamic resizing of at least one storage structure in a microprocessor. Unlike the method of Buyuktosunoglu et al., the present invention directly uses the lack of resource units (which indirectly affects performance) to allocate additional resource units to counter any performance drop arising from the lack of resources. The method of Buyuktosunoglu et al. uses the monitored value of the current IPC (average number of instructions processed in a clock cycle) and the prior measured value of IPC to reallocate additional units of the resized resource when a performance drop exceeds a predetermined threshold. However, the performance drop can be caused by reasons other than resizing, such as branch mispredictions and cache misses. Resource allocation is thus not always necessary when such performance drops are noticed. Furthermore, the present invention controls resource unit allocations for a variety of datapath artifacts such as the issue queue, the reorder buffer, the load-store queue and the register file simultaneously and independently to conserve power with minimal impact on performance. Buyuktosunoglu et al. focus on techniques that are driven solely by the activity of the issue queue. A further distinction of the present invention from Buyuktosunoglu et al. uses sampled, non-continuous measurements of usage of each resource that is controlled. Buyuktosunoglu et al. rely on continuous measurements of activity and performance, such as IPC.
- As used hereinafter, the terms datapath resources or simply resources refers to the IQ, ROB, LSQ and register files, etc., but excludes the architectural register file (ARF). The term resource unit hereinafter refers to the basic unit of incremental resource which may be dynamically allocated or deallocated as required for execution of a particular instruction. The terms interval and period are used interchangeably herein.
- Resource usage as used herein is defined by the actual number of valid entries, hereafter referred to as “occupancy”.
- The present invention is primarily intended for reducing dynamic power dissipation arising from switching activity in the microprocessor and similar devices. Power and energy dissipation arising from leakage in the resource units that are deallocated can also be reduced or avoided by using a variety of techniques known to those of skill in the circuit design arts, including, but not limited to, the use of sleep transistors, circuits using dual-threshold devices and substrate biasing.
- In accordance with the present invention, there is provided an apparatus and method of dynamically estimating the instantaneous resource needs of a program running on a microprocessor. These estimates are used to allocate the minimum number of units of these resources to meet the instantaneous performance needs of that particular program. This approach requires that all allocatable resources be partitionable into independent allocation units that can be incrementally allocated or deallocated. For each of the datapath resources, unused resource units are shut off and isolated from the active, allocated units so as to reduce power dissipations resulting from leakage as well as from switching activities.
- As the program's demands for each resource grow during program execution, further resource units may be independently allocated to each resource. Unused resource units may be reclaimed if the running program is not utilizing them. The reclaimed or deallocated resource units are powered down and isolated from the allocated units to maintain the instantaneous allocation levels at about the right level needed to meet the program's performance needs. The present invention comprises six key features:
- 1) The allocation and deallocation of each type of resource is controlled independently. This is because instantaneous requirements for one type of resource typically vary independently from requirements for a different resource (i.e., there is little, if any correlation between resource requirements). Decisions to deallocate resource units are made periodically, typically at the end of an update period whose duration is predetermined. Resource units may be added within an update period, as described in detail hereinbelow.
- 2) Estimates for the instantaneous need of a program for a specific resource type are obtained through multiple, periodic sampling within the update periods instead of continuous measurements on a cycle-by-cycle basis. The sampling period is predetermined or can be dynamically adjusted. The sampling frequency is typically a multiple of the update frequency.
- 3) At the end of the update period, unused resource units may be deallocated. The deallocation may be gradual, with only one resource unit deallocated at the end of a sampling period, or the deallocation can be more aggressive, with multiple unused resource units being deallocated at the end of the sampling interval. Deallocations typically coincide with the end of an update interval.
- 4) To avoid large penalties on performance, additional resource units are allocated as soon as the true instantaneous demands for the resource exceed the currently allocated units for a predetermined number of times within a sampling interval. When this happens, one or more resource units may be immediately allocated, as availability permits, and a new update period may then be started. Resource allocations thus do not necessarily coincide with the end of the periodic update interval.
- 5) As described in detail hereinbelow, units of certain resources that are organized as FIFO queues may have slightly different allocation and deallocation methods than other types of resources.
- 6) It is possible to use common sampling and update periods for all resources, but these intervals may also be chosen independently for each resource type.
- Although the methods of the present invention are applicable to superscalar processors that utilize dynamic, hardware-implemented scheduling techniques, they may readily be extended to microprocessors that use a combination of static and dynamic scheduling techniques.
- It is therefore an object of the invention to provide a microprocessor or similar microelectronic apparatus wherein various datapath resources may be dynamically sized.
- It is an additional object of the invention to provide a microprocessor or similar microelectronic apparatus wherein various datapath resources are allocated and deallocated in increments.
- It is a further object of the invention to provide a microprocessor or similar microelectronic apparatus wherein various datapath resources are allocated and deallocated dynamically, responsive to the needs for a particular resource by a particular program being executed.
- It is yet another object of the invention to provide a microprocessor or similar microelectronic apparatus wherein resource units may be allocated one-at-a time or, if needed, may be allocated more than one-at-a-time.
- It is an additional object of the invention to provide a microprocessor or similar microelectronic apparatus wherein datapath resources are allocated in accordance with statistics gathered during sampling periods.
- A complete understanding of the present invention may be obtained by reference to the accompanying drawings, when considered in conjunction with the detailed description, in which:
-
FIG. 1 is a schematic block diagram of an architecture of the datapath portion of a first superscalar processor wherein certain datapath resources may be dynamically resized in accordance with the invention; -
FIG. 2 is a schematic block diagram of an architecture of a datapath portion of a second superscalar processor wherein certain datapath resources may be dynamically resized; -
FIG. 3 is a schematic block diagram of an architecture of a datapath portion of a third superscalar processor wherein certain datapath resources may be dynamically resized in accordance with the invention; -
FIG. 4 is a flow chart of a method of allocating non-queued resources in accordance with the present invention; -
FIG. 5 is a flow chart of a method of deallocating non-queued resources; -
FIG. 6 is a flow chart of at set of steps associated with allocating resources used like a FIFO queue; and -
FIG. 7 is a flow chart of a set of steps associated with deallocating resources used like a FIFO queue. - The present invention provides a system that permits the power dissipation and energy requirements of a high-performance microprocessor to be reduced through the dynamic allocation and deallocation of datapath resources, with minimum impact on processor performance.
- The present invention reduces dynamic power dissipation arising from switching activity in microprocessors and similar devices. Power and energy dissipation arising from leakage in the resource units that are deallocated can also be reduced or avoided by using a variety of techniques known to those of skill in the circuit design arts, including, but not limited to, the use of sleep transistors, circuits using dual-threshold devices and substrate biasing.
- Referring first to
FIG. 1 , there is shown a schematic block diagram of the datapath portion of a first superscalar microprocessor, generally atreference number 100. In thesuperscalar processor 100, a reorder buffer (ROB) 102 contains the ROB entry set up for an instruction at the time of dispatch. TheROB 102 entry also includes a field, not shown, to hold the result produced by the instruction.ROB 102 operates analogously to a physical register. If an operand value, not shown, has been committed, a dispatched instruction attempts to read operand values from the architectural register file (ARF) 104 directly. If, however, the operand value was generated but has not been committed, a dispatched instruction attempts to read the required operand associatively from theROB 102 from the most recently established entry for an architectural register. - Source registers that contain valid data are read into the
IQ 106 for the associated instruction. If a source operand is not available at the time of dispatch in theARF 104 or theROB 102, the address of the physical register (i.e., the ROB slot) is saved in the tag field associated with the source register in theIQ 106 for the instruction. - When a
function unit 108 completes processing an instruction, it outputs the result produced along with the address of the destination register for this result. This result is placed on a forwardingbus 112 which runs across the length of theIQ 106 and theLSQ 110. An associative tag matching process is then used to steer the result to matching entries within theIQ 106. Sincemultiple function units 108 may complete processing their respective instructions within a particular cycle, multiple forwardingbuses 112 are used. Each input operand field within anIQ 106 thus uses a comparator, not shown, for each forwardingbus 112. Alternative designs use a scoreboarding logic to identify the destinations of a forwarded result instead of using tag-based result forwarding. - For every instruction accessing memory, not shown, an entry is also reserved in the
LSQ 110 at the time the instruction is dispatched. Because the address used by a load or a store instruction must be calculated, this instruction is removed from theIQ 106, even if the value to be stored (for store instructions) has not yet been computed. In this case, this value is forwarded to theappropriate LSQ 110 entry as it is generated by one of thefunction units 108. - In the datapath architecture of
FIG. 1 , the resources that are allocated and deallocated dynamically following the method of the present invention are: theIQ 106, theLSQ 110, and theROB 102. - The
superscalar processor datapath 200 shown inFIG. 2 is similar to processor 100 (FIG. 1 ), with the difference that the destination registers are allocated within a separate physical register file (PRF) 214. In this case, at the time of dispatching an instruction, a physical register inPRF 214 is allocated for the instruction if its result is destined to a register. In addition, an entry is simultaneously established inROB 202 for the instruction. TheROB 202 entry for the instruction holds a pointer, not shown, to the destination register of the instruction within thePRF 214. Forprocessor 200, thePRF 214 may be managed exactly like a FIFO queue, similar to theROB 202. - An alternative management scheme for the
PRF 214 is also possible, requiring a list of allocated registers within thePRF 214 to be maintained. - In the datapath architecture of
FIG. 2 , the resources that may be allocated and deallocated dynamically following the method of the invention are: theIQ 206, theLSQ 210, thePRF 214, and theROB 202. -
FIG. 3 depicts the architecture of thedatapath portion 300 of yet another superscalar microprocessor that can benefit from the method of present invention. Indatapath 300, registers allocated to hold both committed register values and the results of instructions targeting a register, are held in a common register file,RF 316. A register alias table, not shown, may be used to point to committed register values within theRF 316. A similar table may be used to point to most recent instances of an architectural register within theRF 316 when register renaming is used to handle data dependencies. - In the datapath architecture of
FIG. 3 , the resources that may be allocated and deallocated dynamically following the method described in this invention are: theIQ 306, theLSQ 310, theRF 316, and theROB 302. - In addition to application to the three datapath architectures depicted in
FIGS. 1 , 2 and 3, the method of the present invention may also be used in variations of these three architectures where all source register operands are read out at the time of issuing instructions to the function units. In each of these three datapath architectures and their variations, energy and power requirements are reduced using the inventive method by incrementally allocating and deallocating the resources as has been described. - The method of the present invention may also be applied to datapath architectures that are clustered or to architectures that use a distributed form of the
IQ - In the inventive method, a predetermined number of units of each type of resource to be dynamically allocated and deallocated is initially allocated. A preset counter, not shown, or other suitable device is used to generate signals indicating the end of an update period. The same counter or a different counter may be used to generate signals that determine when resource usage is sampled. When resource units are added, this preset counter may be reset to begin a new update period.
- In the preferred embodiment, all resources have a common predetermined update period and a common predetermined sampling period. Furthermore, both the update period and the sampling period are chosen to be powers of 2 in the number of clock cycles. It will be recognized that in alternate embodiments of the invention, variations using update and/or sampling periods specific to a resource type may be implemented. These alternate embodiments use sets of counters for generating signals to mark the end of such periods, typically one counter per resource.
- To permit incremental allocation and deallocation of resources, the traditional monolithic forms of these resources are altered to segment each resource type into uniform-sized 15, resource units. The size of each such resource unit is predetermined and is specific to each type of resource. For each resource type, a number of well-known circuit design techniques, such as multiple banking, bitline segmentation or partitioning with shared components can be used to implement: (i) the resource units themselves; (ii) facilities to add further units to an allocated suite of resource units, and (iii) facilities to deallocate certain already-allocated units.
- As a program is run on the system initialized as described above, resource units are added (i.e., allocated) as the program requires a higher resource allocation to maintain its performance. If allocated resource units are determined to be unused at the end of an update period, they may be deallocated. The exact nature of the allocation and deallocation steps is described below.
- Method for the Allocation of Non-Queue Resources
- Referring now to
FIG. 4 , there is shown a flow chart of one possible set of steps for allocating resources that do not behave like FIFO queues. Examples of such resources include, but are not limited to the register file (e.g.,RF 316 of the datapath ofFIG. 3 ), and non-collapsing issue queues, where IQ entries can be allocated or freed up at any position within the queue. - The process of allocating non-queued resources depicted in
FIG. 4 begins with the commencement of an update period by initializing an overflow counter to zero,step 400. The overflow counter counts the number of times, since the update period started, that resources exceeding current allocations were required. For a non-collapsing IQ (e.g., IQ 306), when additional resources beyond the current allocations are needed but not allocated, instruction dispatch is blocked and performance suffers. - Next, one clock cycle is allowed to elapse,
step 405, and then a check is performed,step 410, to determine if additional resources (beyond the current allocations) were required in the clock cycle that just elapsed. If additional resources were required,step 410, the value of the overflow counter is then incremented,step 415, and the process continues atstep 420 where the overflow counter is checked to determine if its count has exceeded a predetermined threshold value, variable OTH. - If this comparison,
step 420, indicates that the overflow counter has exceeded OTH, it is then necessary to check whether an additional free resource unit is available,step 425. If no additional free resource units are available, control is transferred to step 440. Otherwise a resource unit is allocated to increase the current resource allocation,step 430. After housekeeping tasks are performed, such as clearing variables and counters for keeping various statistics within an update period, and resetting the update period counter to begin a new update period,step 435, the process shown inFIG. 4 terminates. - If, however, additional resources are not required,
step 410, program control is passed to step 440. - Likewise, if the overflow counter has not exceeded OTH,
step 420, program control is returned to step 440. - In
step 440, a check is performed to determine if the current update period has finished. If so, the process ofFIG. 4 is terminated. If, however, the current update period has not yet completed, control is returned to step 405. - It will be recognized that the value of variable OTH may be specific to the type of resource. It is also possible to vary the value of variable OTH for a single resource over time. Although this does not occur in the embodiment chosen for purposes of disclosure, the present invention encompasses such an additional embodiment.
- It will also be recognized that the process of
FIG. 4 may be modified to allocate more than a single free allocation unit of a particular resource when the overflow counter exceeds OTH,step 420, early in the update cycle. Such a condition indicates a rapidly increasing demand for additional resources which, if not satisfied aggressively, may hurt overall performance. The present invention encompasses all such variations of additional free resource unit allocation. - Method for the Deallocation of Non-Queue Resources
- Referring now to
FIG. 5 , there is shown a flow chart of one possible set of steps required to deallocate a resource of the type allocated according to the process ofFIG. 4 . This deallocation process commences when a new update period starts. First, a variable S maintains a running sum of samples usage estimates of the currently allocated resources and is initialized,step 500. Once variable S is initialized, a sampling period is allowed to elapse,step 505. At the end of the elapsed sampling period, the number of occupied entries within the allocated resource units is placed into a variable, N,step 510. In addition, the number of occupied entries within the allocated resource is also added to S,step 515. It should be noted that the term occupied entries refers to the number of allocated entries within the currently-allocated resource units. - At the end of a sampling interval, bit vectors indicating the occupancy status of the entries within each allocated unit may be created. Typically, such a bit vector contains a bit for every entry within a resource unit, with a 1 indicating an occupied entry and a 0 indicating a free entry. The sum of the number of 1s in each of these bit vectors may be estimated using known techniques to derive the total number of occupied entries within each allocated resource unit. The total number of occupied entries, N, may then be determined by adding up the already computed sums of the is in the bit vectors for the currently allocated resource units. For example, one possible way to perform such an estimate is to use replicated, parallel logic structures to estimate the sum of is in the aforesaid bit vectors and add them up using a fast tree adder to determine N.
- At the end of a sampling interval, after updating S,
step 515, control is transferred to step 520. - If the update period has not yet expired,
step 520, control is returned to step 505. If, however, the update period is over,step 520, the average sampled occupancy, A, of the allocated resource units over the update period is estimated,step 525. If the update period and sampling period are both powers of 2 (as used in the embodiment chosen for purposes of this disclosure), determining this average occupancy, A, does not require any division; the division process is reduced to a simple operation that ignores some lower order bits in S. - Next, the number of resource units, K, required to accommodate the averaged number of occupied entries, A, is determined by dividing A by number of entries Q within each resource unit, and rounding the result up to the nearest higher integer,
step 530. Again, a division step may be avoided by choosing Q to be a power of 2. It will be recognized that the value Q may be specific and different for each resource type. - Next, a check is performed to determine if K is smaller than the number of currently allocated resource units,
step 535. If not, the process ofFIG. 5 terminates. If, however, K is smaller than the number of currently allocated resource units,step 535, a single unit of resource is marked for deallocation,step 540, and the process ofFIG. 5 terminates. The actual deallocation of this marked resource unit takes place when all occupied entries within this unit are consumed (i.e., vacated). No entries are allocated within the resource unit marked for deallocation. In a more aggressive deallocation scheme that emphasizes power/energy savings over performance, more than one allocated resource unit, up to a maximum of the difference between K and the number of currently allocated units, may be marked for deallocation and may eventually be deallocated. - General Usage of Resources Used Like a FIFO Queue
- The dynamically allocated datapath resources that are used as a queue (such as the ROB, the LSQ and collapsing variations of IQs (
FIGS. 1 , 2, and/or 3) require special considerations for allocations and deallocations because of the circular nature of the FIFO queues. It may be assumed that such queue resources use two pointers, typically head and tail pointers to identify the two extremes of the circular queue. It may also be assumed that both these pointers are first initialized to zero, and then incremented, typically in a circular fashion, to permit wraparound, as the queue grows or shrinks Hereinafter in the description of the FIFO resources and in related methods exemplified in the flow charts ofFIGS. 6 and 7 , all arithmetic operations and comparisons performed on the head and tail pointers of the queue take into account the implications of wrap-around. New entries are made at the end identified by tail pointer, after incrementing the tail pointer to point to the next empty entry. Entries are consumed (i.e., removed) from the head of the queue. More specifically, the entry pointed to by the head pointer is consumed and the value of the head is then incremented circularly to point to the next entry to be removed. For the ROB, establishing an entry at the tail of the queue corresponds to the creation of a ROB entry for an instruction at the time that it is dispatched. The consumption of a ROB entry using the head pointer corresponds to the act of retiring an instruction. - Typically, the resource units allocated to implement FIFO queues are physically adjacent; the queue structure is confined entirely within the allocated resource units. If a resource unit must be deallocated, the unit that is deallocated is the one that has entries with the highest index values. Likewise, when a new resource unit is added, the free unit added is the one adjacent to the currently allocated partition that has entries with the highest index value. The circular nature of these queues, which allow the queue to wrap around within the allocated resource units, adds some complication to the process of allocating and deallocating resource units.
- Method for Allocating Resources Used Like a FIFO Queue
- The decisions leading to the conclusion that additional resource units must be allocated to a resource used as a circular FIFO queue structure are substantially identical to those illustrated in the process of
FIG. 4 . However,step 430 is replaced with the multiple exemplary steps shown in the flow chart ofFIG. 6 . One significant difference is that the tail end of the FIFO queue should be able to extend into the newly allocated unit (i.e., wrap around). - The allocation process begins with determining if the value of the head pointer is less than or equal to the value of the current tail pointer,
step 600. This is a normal comparison that ignores the consequences of circular increments to these pointers. This assumption is extended to all pointer comparisons discussed hereafter. If the value of the head pointer is less than or equal to the value of the current tail pointer,step 600, a free resource unit is added adjacent to the currently allocated resource unit having entries with the highest index values,step 605. If, however, the value of the head pointer is greater than the value of the current tail pointer,step 600, a cycle is allowed to elapse and the head and tail pointers are updated to reflect events therewithin,step 610. Control is then returnedstep 600. - Method for Deallocating Resources Used Like a FIFO Queue
- The decisions leading to the conclusion that previously-allocated resource units may be deallocated in a resource used as a circular FIFO queue structure are substantially identical to those illustrated in the process of
FIG. 5 . However, step 540 must be replaced with the multiple exemplary steps shown in the flow chart ofFIG. 7 . - Several checks are needed before the resource unit that has entries with the highest index values among the allocated units can be deallocated:
- a) As in the case of non-queue resources, deallocation cannot be considered until all entries currently within the unit marked for deallocation are consumed;
- b) While the actual deallocation of the unit identified for deallocation is pending, the queue should not be allowed to grow back into that resource unit, and any event (e.g., instruction dispatching, in the case of the ROB) that causes the queue to grow like this should be suspended until the resource unit is deallocated; and
- c) The deallocation should be performed in a manner that allows the queue to wrap around, properly following the deallocation.
- The deallocation process of
FIG. 7 begins by setting the variable Limit to the index of the highest numbered slot,step 700. This includes both allocated and unallocated entries within the resource units that are to remain allocated, but excludes resource units marked for deallocation. - Next, a test is performed to determine if the value of the head pointer is less than or equal to the value of the tail pointer,
step 705. - If so,
step 705, a test is performed to determine if the value of the tail pointer is less than or equal to that of the variable Limit,step 710. If this is true, the block marked for deallocation is actually deallocated,step 725, and the process ofFIG. 7 terminates. If the test,step 710, is false, one clock cycle is allowed to elapse,step 715. If, however, the test,step 710, is false, then one clock pulse is allowed to elapse,step 715, and the head and tail pointers are updated as needed,step 720. Control is then transferred to step 705. - If, however, the value of the head pointer is greater than the value of the tail pointer,
step 705, a single clock cycle is allowed to elapse,step 730. A test is then performed,step 735, to determine if any events in the upcoming clock cycle might cause the tail pointer to extend into the unit marked for deallocation. If any such event exists, it or they are momentarily blocked,step 740, and control is transferred to step 720. If no events in the upcoming clock cycle might cause the tail pointer to extend into the unit marked for deallocation,step 735, control is transferred directly to step 720. - It will be recognized by those skilled in the design of processor architecture that the two methods described above for handling the allocation and deallocation of resource units for resources that are used like a circular FIFO queues may be modified to permit the allocation and deallocation of more than one resource item at a time. Consequently, the present invention is not considered to be limited by the embodiment chosen for purposes of disclosure.
- Some general aspects of these inventive methods should be noted. First, new resource units are typically allocated more rapidly than resource units are deallocated. This avoids noticeable performance degradation. Second, the actual deallocation of resource units for resources that are used like a circular FIFO queue can be delayed substantially until the conditions for deallocation are all valid. During this time, events such as instruction dispatching in the case of a ROB may also be momentarily blocked.
- The size estimates developed using these disclose inventive methods may also be used to selectively control clock rates to at least one component of a datapath resource. Such components include an instruction cache, an execution unit, clusters of registers, and function units. It will be recognized that may other microprocessor components may well benefit from such selective clock rate control and the invention is not considered limited to these specifically disclosed components.
- Since other modifications and changes varied to fit particular operating requirements and environments will be apparent to those skilled in the art, the invention is not considered limited to the examples chosen for purposes of disclosure, and covers all changes and modifications which do not constitute departures from the true spirit and scope of this invention.
- Having thus described the invention, what is desired to be protected by Letters Patent is presented in the subsequently appended claims:
Claims (20)
1. A method comprising:
dynamically estimating a resource allocation requirement for a datapath resource of a processor, wherein the dynamically estimating comprises using estimates of a usage of the datapath resource by a computing process; and
dynamically altering the resource allocation of the datapath resource responsive to the resource allocation requirement estimate.
2. The method of claim 1 , wherein the estimates include sampled estimates.
3. The method of claim 1 , wherein the estimates comprises periodic measurement within at least one update period.
4. The method of claim 3 , wherein the update period is dynamically determined.
5. The method of claim 1 , wherein the estimates are determined during an estimation interval.
6. The method of claim 1 , wherein dynamically altering comprises allocating an additional discrete resource unit to the datapath resource, and deallocating a discrete resource unit from the datapath resource.
7. The method of claim 6 , wherein the additional resource unit is allocated more rapidly than the resource unit is deallocated.
8. The method of claim 6 , wherein allocating the additional resource unit to the datapath resource comprises starting a new update period.
9. The method of claim 1 , wherein the datapath resource comprises a resource used as a FIFO queue.
10. The method of claim 1 , further comprising using the resource allocation requirement estimate to selectively adjust a clock rate to at least one processor component.
11. The method of claim 1 , further comprising using the resource allocation requirement estimate to dynamically control a rate of instruction dispatch.
12. The method of claim 1 , further comprising using the resource allocation requirement estimate to selectively adjust a supply voltage to at least one of microprocessor component.
13. An apparatus comprising:
a computing module configured to dynamically estimate a resource allocation requirement for a datapath resource of a processor, wherein the dynamically estimating comprises using estimates of a usage of the datapath resource by a computing process, wherein the computing module is further configured to dynamically alter the resource allocation of the datapath resource responsive to the resource allocation requirement estimate.
14. The apparatus of claim 13 , wherein the estimates comprises periodic measurement within at least one update period.
15. The apparatus of claim 14 , wherein the update period is dynamically determined.
16. The apparatus of claim 13 , wherein the estimates include sampled estimates.
17. A system comprising:
means for dynamically estimating a resource allocation requirement for a datapath resource of a processor, wherein the dynamically estimating comprises using sampled estimates of a usage of the datapath resource by a computing process; and
means for dynamically altering the resource allocation of the datapath resource responsive to the resource allocation requirement estimate.
18. The system of claim 17 , further comprising means for selectively adjusting a clock rate to at least one processor component using the resource allocation requirement estimate.
19. The system of claim 17 , further comprising means for dynamically controlling a rate of instruction dispatch using the resource allocation requirement estimate.
20. The system of claim 17 , further comprising means for selectively adjusting a supply voltage to at least one of microprocessor component using the resource allocation requirement estimate.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US13/666,097 US8700938B2 (en) | 2002-12-03 | 2012-11-01 | System and method for reducing power requirements of microprocessors through dynamic allocation of datapath resources |
Applications Claiming Priority (5)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US43142502P | 2002-12-03 | 2002-12-03 | |
US10/727,105 US7219249B1 (en) | 2002-12-03 | 2003-12-03 | System and method for reducing power requirements of microprocessors through dynamic allocation of datapath resources |
US11/748,411 US7562243B1 (en) | 2002-12-03 | 2007-05-14 | System and method for reducing power requirements of microprocessors through dynamic allocation of datapath resources |
US12/502,930 US8321712B2 (en) | 2002-12-03 | 2009-07-14 | System and method for reducing power requirements of microprocessors through dynamic allocation of datapath resources |
US13/666,097 US8700938B2 (en) | 2002-12-03 | 2012-11-01 | System and method for reducing power requirements of microprocessors through dynamic allocation of datapath resources |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/502,930 Continuation US8321712B2 (en) | 2002-12-03 | 2009-07-14 | System and method for reducing power requirements of microprocessors through dynamic allocation of datapath resources |
Publications (2)
Publication Number | Publication Date |
---|---|
US20130061236A1 true US20130061236A1 (en) | 2013-03-07 |
US8700938B2 US8700938B2 (en) | 2014-04-15 |
Family
ID=38015874
Family Applications (4)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US10/727,105 Active 2025-06-14 US7219249B1 (en) | 2002-12-03 | 2003-12-03 | System and method for reducing power requirements of microprocessors through dynamic allocation of datapath resources |
US11/748,411 Expired - Lifetime US7562243B1 (en) | 2002-12-03 | 2007-05-14 | System and method for reducing power requirements of microprocessors through dynamic allocation of datapath resources |
US12/502,930 Expired - Lifetime US8321712B2 (en) | 2002-12-03 | 2009-07-14 | System and method for reducing power requirements of microprocessors through dynamic allocation of datapath resources |
US13/666,097 Expired - Lifetime US8700938B2 (en) | 2002-12-03 | 2012-11-01 | System and method for reducing power requirements of microprocessors through dynamic allocation of datapath resources |
Family Applications Before (3)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US10/727,105 Active 2025-06-14 US7219249B1 (en) | 2002-12-03 | 2003-12-03 | System and method for reducing power requirements of microprocessors through dynamic allocation of datapath resources |
US11/748,411 Expired - Lifetime US7562243B1 (en) | 2002-12-03 | 2007-05-14 | System and method for reducing power requirements of microprocessors through dynamic allocation of datapath resources |
US12/502,930 Expired - Lifetime US8321712B2 (en) | 2002-12-03 | 2009-07-14 | System and method for reducing power requirements of microprocessors through dynamic allocation of datapath resources |
Country Status (1)
Country | Link |
---|---|
US (4) | US7219249B1 (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9135063B1 (en) | 2009-07-21 | 2015-09-15 | The Research Foundation For The State University Of New York | Apparatus and method for efficient scheduling of tasks |
US20190101973A1 (en) * | 2017-09-29 | 2019-04-04 | Advanced Micro Devices, Inc. | Saving power in the command processor using queue based watermarks |
Families Citing this family (39)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7219249B1 (en) | 2002-12-03 | 2007-05-15 | The Research Foundation Of State University Of New York | System and method for reducing power requirements of microprocessors through dynamic allocation of datapath resources |
US7281145B2 (en) * | 2004-06-24 | 2007-10-09 | International Business Machiness Corporation | Method for managing resources in a CPU by allocating a specified percentage of CPU resources to high priority applications |
EP1793603A1 (en) * | 2005-11-30 | 2007-06-06 | Nagra France Sarl | Method of transmission of authorization messages to a plurality of mobile receivers and method of treatment of such messages |
US8244265B2 (en) | 2007-11-28 | 2012-08-14 | Motorola Mobility Llc | Techniques for aligning application output and uplink resource allocation in wireless communication systems |
US8160085B2 (en) * | 2007-12-21 | 2012-04-17 | Juniper Networks, Inc. | System and method for dynamically allocating buffers based on priority levels |
US7441135B1 (en) * | 2008-01-14 | 2008-10-21 | International Business Machines Corporation | Adaptive dynamic buffering system for power management in server clusters |
US7921280B2 (en) * | 2008-06-27 | 2011-04-05 | Intel Corporation | Selectively powered retirement unit using a partitioned allocation array and a partitioned writeback array |
US8266289B2 (en) * | 2009-04-23 | 2012-09-11 | Microsoft Corporation | Concurrent data processing in a distributed system |
EP2372962B1 (en) * | 2010-03-31 | 2017-08-16 | Alcatel Lucent | Method and system for reducing energy consumption in packet processing linecards |
US8589665B2 (en) | 2010-05-27 | 2013-11-19 | International Business Machines Corporation | Instruction set architecture extensions for performing power versus performance tradeoffs |
US8516284B2 (en) | 2010-11-04 | 2013-08-20 | International Business Machines Corporation | Saving power by placing inactive computing devices in optimized configuration corresponding to a specific constraint |
US9443085B2 (en) | 2011-07-19 | 2016-09-13 | Elwha Llc | Intrusion detection using taint accumulation |
US9460290B2 (en) | 2011-07-19 | 2016-10-04 | Elwha Llc | Conditional security response using taint vector monitoring |
US9098608B2 (en) | 2011-10-28 | 2015-08-04 | Elwha Llc | Processor configured to allocate resources using an entitlement vector |
US9170843B2 (en) | 2011-09-24 | 2015-10-27 | Elwha Llc | Data handling apparatus adapted for scheduling operations according to resource allocation based on entitlement |
US9471373B2 (en) * | 2011-09-24 | 2016-10-18 | Elwha Llc | Entitlement vector for library usage in managing resource allocation and scheduling based on usage and priority |
US9558034B2 (en) | 2011-07-19 | 2017-01-31 | Elwha Llc | Entitlement vector for managing resource allocation |
US9798873B2 (en) | 2011-08-04 | 2017-10-24 | Elwha Llc | Processor operable to ensure code integrity |
US9298918B2 (en) | 2011-11-30 | 2016-03-29 | Elwha Llc | Taint injection and tracking |
US9575903B2 (en) | 2011-08-04 | 2017-02-21 | Elwha Llc | Security perimeter |
US9465657B2 (en) | 2011-07-19 | 2016-10-11 | Elwha Llc | Entitlement vector for library usage in managing resource allocation and scheduling based on usage and priority |
US8745681B2 (en) * | 2011-10-19 | 2014-06-03 | Verizon Patent And Licensing Inc. | Set top box resource allocation for executing a widget |
EP2650750A1 (en) * | 2012-04-12 | 2013-10-16 | Telefonaktiebolaget L M Ericsson AB (Publ) | Apparatus and method for allocating tasks in a node of a telecommunication network |
GB2505884B (en) * | 2012-09-12 | 2015-06-03 | Imagination Tech Ltd | Dynamically resizable circular buffers |
KR101994332B1 (en) * | 2012-10-30 | 2019-07-01 | 삼성디스플레이 주식회사 | Organic light emitting transistor and display device including thereof |
US9417961B2 (en) | 2014-11-18 | 2016-08-16 | HGST Netherlands B.V. | Resource allocation and deallocation for power management in devices |
US10326834B2 (en) * | 2016-10-17 | 2019-06-18 | At&T Intellectual Property I, L.P. | Efficient un-allocation of cloud resources |
US10496409B2 (en) | 2016-11-22 | 2019-12-03 | The Arizona Board Of Regents | Method and system for managing control of instruction and process execution in a programmable computing system |
US10428713B2 (en) | 2017-09-07 | 2019-10-01 | Denso International America, Inc. | Systems and methods for exhaust heat recovery and heat storage |
US10572264B2 (en) | 2017-11-30 | 2020-02-25 | International Business Machines Corporation | Completing coalesced global completion table entries in an out-of-order processor |
US10884753B2 (en) | 2017-11-30 | 2021-01-05 | International Business Machines Corporation | Issue queue with dynamic shifting between ports |
US10922087B2 (en) | 2017-11-30 | 2021-02-16 | International Business Machines Corporation | Block based allocation and deallocation of issue queue entries |
US10564979B2 (en) | 2017-11-30 | 2020-02-18 | International Business Machines Corporation | Coalescing global completion table entries in an out-of-order processor |
US10564976B2 (en) | 2017-11-30 | 2020-02-18 | International Business Machines Corporation | Scalable dependency matrix with multiple summary bits in an out-of-order processor |
US10802829B2 (en) | 2017-11-30 | 2020-10-13 | International Business Machines Corporation | Scalable dependency matrix with wake-up columns for long latency instructions in an out-of-order processor |
US10901744B2 (en) | 2017-11-30 | 2021-01-26 | International Business Machines Corporation | Buffered instruction dispatching to an issue queue |
US10942747B2 (en) * | 2017-11-30 | 2021-03-09 | International Business Machines Corporation | Head and tail pointer manipulation in a first-in-first-out issue queue |
US10929140B2 (en) | 2017-11-30 | 2021-02-23 | International Business Machines Corporation | Scalable dependency matrix with a single summary bit in an out-of-order processor |
US11681533B2 (en) | 2019-02-25 | 2023-06-20 | Intel Corporation | Restricted speculative execution mode to prevent observable side effects |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020133729A1 (en) * | 2001-03-19 | 2002-09-19 | Guy Therien | Method for determining transition points on multiple performance state capable microprocessors |
US6684298B1 (en) * | 2000-11-09 | 2004-01-27 | University Of Rochester | Dynamic reconfigurable memory hierarchy |
US8321712B2 (en) * | 2002-12-03 | 2012-11-27 | The Research Foundation Of State University Of New York | System and method for reducing power requirements of microprocessors through dynamic allocation of datapath resources |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6526448B1 (en) * | 1998-12-22 | 2003-02-25 | At&T Corp. | Pseudo proxy server providing instant overflow capacity to computer networks |
US7865747B2 (en) * | 2000-10-31 | 2011-01-04 | International Business Machines Corporation | Adaptive issue queue for reduced power at high performance |
US6948051B2 (en) * | 2001-05-15 | 2005-09-20 | International Business Machines Corporation | Method and apparatus for reducing logic activity in a microprocessor using reduced bit width slices that are enabled or disabled depending on operation width |
US7096471B2 (en) * | 2001-06-01 | 2006-08-22 | Texas Instruments Incorporated | Apparatus for resource management in a real-time embedded system |
GB0119145D0 (en) * | 2001-08-06 | 2001-09-26 | Nokia Corp | Controlling processing networks |
US7958199B2 (en) * | 2001-11-02 | 2011-06-07 | Oracle America, Inc. | Switching systems and methods for storage management in digital networks |
-
2003
- 2003-12-03 US US10/727,105 patent/US7219249B1/en active Active
-
2007
- 2007-05-14 US US11/748,411 patent/US7562243B1/en not_active Expired - Lifetime
-
2009
- 2009-07-14 US US12/502,930 patent/US8321712B2/en not_active Expired - Lifetime
-
2012
- 2012-11-01 US US13/666,097 patent/US8700938B2/en not_active Expired - Lifetime
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6684298B1 (en) * | 2000-11-09 | 2004-01-27 | University Of Rochester | Dynamic reconfigurable memory hierarchy |
US20020133729A1 (en) * | 2001-03-19 | 2002-09-19 | Guy Therien | Method for determining transition points on multiple performance state capable microprocessors |
US8321712B2 (en) * | 2002-12-03 | 2012-11-27 | The Research Foundation Of State University Of New York | System and method for reducing power requirements of microprocessors through dynamic allocation of datapath resources |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9135063B1 (en) | 2009-07-21 | 2015-09-15 | The Research Foundation For The State University Of New York | Apparatus and method for efficient scheduling of tasks |
US9377837B2 (en) | 2009-07-21 | 2016-06-28 | The Research Foundation For The State University Of New York | Apparatus and method for efficient scheduling of tasks |
US9715264B2 (en) | 2009-07-21 | 2017-07-25 | The Research Foundation Of The State University Of New York | System and method for activation of a plurality of servers in dependence on workload trend |
US9753465B1 (en) | 2009-07-21 | 2017-09-05 | The Research Foundation For The State University Of New York | Energy aware processing load distribution system and method |
US10289185B2 (en) | 2009-07-21 | 2019-05-14 | The Research Foundation For The State University Of New York | Apparatus and method for efficient estimation of the energy dissipation of processor based systems |
US11194353B1 (en) | 2009-07-21 | 2021-12-07 | The Research Foundation for the State University | Energy aware processing load distribution system and method |
US11429177B2 (en) | 2009-07-21 | 2022-08-30 | The Research Foundation For The State University Of New York | Energy-efficient global scheduler and scheduling method for managing a plurality of racks |
US11886914B1 (en) | 2009-07-21 | 2024-01-30 | The Research Foundation For The State University Of New York | Energy efficient scheduling for computing systems and method therefor |
US20190101973A1 (en) * | 2017-09-29 | 2019-04-04 | Advanced Micro Devices, Inc. | Saving power in the command processor using queue based watermarks |
US10955901B2 (en) * | 2017-09-29 | 2021-03-23 | Advanced Micro Devices, Inc. | Saving power in the command processor using queue based watermarks |
Also Published As
Publication number | Publication date |
---|---|
US8700938B2 (en) | 2014-04-15 |
US20100017638A1 (en) | 2010-01-21 |
US7562243B1 (en) | 2009-07-14 |
US8321712B2 (en) | 2012-11-27 |
US7219249B1 (en) | 2007-05-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8700938B2 (en) | System and method for reducing power requirements of microprocessors through dynamic allocation of datapath resources | |
US7698707B2 (en) | Scheduling compatible threads in a simultaneous multi-threading processor using cycle per instruction value occurred during identified time interval | |
US7627770B2 (en) | Apparatus and method for automatic low power mode invocation in a multi-threaded processor | |
EP1238341B1 (en) | Method, apparatus, medium and program for entering and exiting multiple threads within a multithreaded processor | |
US7752627B2 (en) | Leaky-bucket thread scheduler in a multithreading microprocessor | |
US7600135B2 (en) | Apparatus and method for software specified power management performance using low power virtual threads | |
US7676808B2 (en) | System and method for CPI load balancing in SMT processors | |
US7613904B2 (en) | Interfacing external thread prioritizing policy enforcing logic with customer modifiable register to processor internal scheduler | |
EP1236107B1 (en) | Method and apparatus for disabling a clock signal within a multithreaded processor | |
TWI494850B (en) | Providing an asymmetric multicore processor system transparently to an operating system | |
US7353370B2 (en) | Method and apparatus for processing an event occurrence within a multithreaded processor | |
US20130179615A1 (en) | Increasing Turbo Mode Residency Of A Processor | |
US20110087865A1 (en) | Intermediate Register Mapper | |
US20060037021A1 (en) | System, apparatus and method of adaptively queueing processes for execution scheduling | |
Wang et al. | Utilization-based resource partitioning for power-performance efficiency in SMT processors | |
Marculescu | Application adaptive energy efficient clustered architectures | |
WO2021061367A1 (en) | Soft watermarking in thread shared resources implemented through thread mediation | |
Zeng et al. | Register Versioning: A Low-Complexity Implementation of Register Renaming in Out-of-Order Microarchitectures |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551) Year of fee payment: 4 |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 8TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1552); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 8 |