WO2008141873A1

WO2008141873A1 - Method and unit for power management of a microprocessor

Info

Publication number: WO2008141873A1
Application number: PCT/EP2008/054540
Authority: WO
Inventors: Cedric Lichtenau; Thomas Pflueger; Ulrich Weiss; Tobias Gemmeke
Original assignee: International Business Machines Corporation
Priority date: 2007-05-22
Filing date: 2008-04-15
Publication date: 2008-11-27

Abstract

The invention relates to a method for power management of a microprocessor characterized by the use of a workload prediction method for adaptation of the power consumption of said microprocessor to workload during program execution. The method is characterized by measuring at least microprocessor utilization between two branch instructions. This is achieved by using a utilization history table similar to a branch history that profiles the microprocessor utilization during execution and is used for subsequent execution of the same code segment.

Description

D E S C R I P T I O N

Method and unit for power management of a microprocessor

FIELD OF THE INVENTION

The invention relates to a method and a unit for power management of a microprocessor according to the preambles of the independent claims.

DESCRIPTION OF THE RELATED ART

Power and cooling constraints increasingly determine the operating points of system components and, in particular, that of the Central Processing Unit (CPU) and memory interface of computer systems. The operating point in this case is defined as the operating frequency and associated voltage of the supply powering the circuits in the CPU. Any power-efficient design will require the CPU chip to operate at the lowest possible frequency for the current workload and at the lowest safe voltage for this given operating frequency. In general, this 'safe voltage' is a function of frequency where the general structure of the function is known from device physics but whose details are determined by the technology used in the development and manufacture of the part. Hence, one can assume a one-to-one correspondence between voltage and frequency, and one can characterize each operating point as an ordered pair, <frequency, voltage>. For the reminder of this document, we will use the term operating point to denote a specific operating frequency and its associated voltage.

A number of commercial processors now offer dynamic voltage and frequency scaling (DVS) as a mechanism to reduce or limit power consumption. Examples include the so called Enhanced SpeedStep® technology in Intel processors, PowerNow!™ in AMD processors, and PowerTune® in IBM PowerPC® 970. There is significant prior art that proposes schemes that use CPU utilization to determine when to use DVS without reducing or reducing excessively the computing system's performance. Low CPU utilization is considered as indicative of a low performance requirement, which, in turn, lets the system use a lower operating point, thereby saving power. High CPU utilization at a lower operating point would be considered indicative of higher demand for processor cycles and consequently interpreted as a situation when a higher operating point could improve performance. Other schemes for determining the right operating point to use for particular workloads and workload mixes use off-line characterization of workload behavior or a general expectation from the type of workload. For example, running compute-intensive applications would cause the CPU to use a higher operating point while running user-interaction dominated applications would cause the CPU to use a lower operating point. In all proposed and implemented approaches, DVS is exploited primarily for power savings with user-specified, application-specified or system- inferred measures to estimate the CPU requirements.

While frequency and voltage of operation impact power consumption, they do not solely determine it. How a workload utilizes the CPU chip also affects power consumption. The active power of a semiconductor circuit is typically taken to be proportional to A*C*V²*f, where A is the switching activity factor, V is the operating voltage, f is the operating frequency, and C is the capacitance. The activity factor depends on the workload. Based on how workloads use the chip, the activity in the circuits is different leading to different power consumption for different workloads at the same operating point.

When system designers determine the nominal operating point allowed for a CPU and system for a given cooling capacity, power distribution system and power supply, they do so for a worst-case workload, one whose overall activity factor is an upper bound that they expect no realistic workload to reach. This allows them to ship the system safely with a guarantee that it will run correctly at the nominal operating point no matter what workload is running. As semiconductor chips incorporate more circuit-level power savings technologies such as clock gating and power gating, the variation in power consumption for different workloads due to different activity factors increases. So a CPU whose nominal operating point is determined by the worst-case workload behavior is leaving increasingly larger amounts of unused capacity and performance for all other workloads.

A static prediction of power consumption based on code analysis is very difficult to achieve since modern processors use context switches, have to respond dynamically to interrupts and e.g. have multiple level memory subsystem.

Current state of the art methods used in the industry try rather to react to ambient conditions like temperature and instruction issue rate in order to stay within the thermal design point or adapt the operation point to the current conditions. Lots of efforts are spent to try to respond fast to situations since no forecast is used. Low utilization phases are typically recognized late after being in this state for an extensive period of time. This leaves large power reduction possibilities unexploited. Similarly when high performance is required, it takes time to detect the condition and bring frequency and voltage back up.

SUMMARY OF THE INVENTION

It is an objective of the invention to provide a method for adaptation of the power consumption of a microprocessor to workload conditions while avoiding performance degradation, where the adaptation can be immediate or preemptive.

Another objective is to provide a power management unit, a data processing program and a computer program product for performing such a method.

The objectives are achieved by the features of the independent claims. The other claims and the description disclose advantageous embodiments of the invention.

A method for power management of a microprocessor is proposed, which employs a workload prediction method for adaptation of the power consumption of said microprocessor to workload during program execution. The method is further characterized by deriving said workload prediction from processor utilization .

The present invention provides a mean for fine grain forecast for adaptation to changing workload to maximize power savings of the CPU during program execution without suffering from less performance of the microprocessor. The invention is based on a specific method to solve this task by a special kind of workload prediction. The method, if realized on microprocessors with pipeline architecture, uses the branch prediction facilities of these microprocessors. This invention also allows to closely following thermal design point guidelines since the power consumption can be predicted accurately.

Additionally, it is possible to run the processor at higher frequency / performance due to better power management.

Favorably, when programs are rerun on the processor, in the case of using the proposed power management method the power consumption will decrease if the program is rerun.

In a further embodiment, a power management unit of a microprocessor is proposed, comprising means for the use of a workload prediction method for adaptation of the power consumption of said microprocessor to workload during program execution. The power management unit comprises further a usage prediction logic connected to a usage profiling block and a usage history table for adjustment of power consumption of said microprocessor to changing workload by delivering input to a frequency voltage control block of said microprocessor.

The method can be completely implemented on a microprocessor by widely use of already existing components. The control of power management takes place on a special dedicated unit that may be on or off chip.

In another embodiment of the invention a data processing program for execution in a data processing system is proposed, comprising software code portions for performing said method when said program is run on a computer.

According to another aspect of the invention a computer program product is proposed, which is stored on a computer usable medium, which comprises computer readable program means for causing a computer to perform said routing method when said program is run on said computer. Particularly the method comprises the steps of: measuring at least processor utilization between two branch instructions with said usage profiling block; updating said usage history table with said processor utilization for the code executed between the last and current branch via said usage prediction logic- synchronizing said usage prediction logic with said branch prediction logic connected to said branch history table as well as with the input from said usage profiling block; tuning the frequency and/or the voltage and/or other means to reduce power to the various parts of said processor and/or other chips in the system by said frequency voltage control block based on the input from said usage prediction logic.

BRIEF DESCRIPTION OF THE DRAWINGS

The present invention together with the above-mentioned and other objects and advantages may best be understood from the following detailed description of the embodiments, but not restricted to the embodiments, wherein is shown in:

Fig. 1 a circuit overview of the data flow for the proposed power management method;

Fig. 2 a branch prediction scheme used for the power management method; Fig. 3 an operation scheme for the power management method using the proposed usage history table; and

Fig. 4 a preferred general layout for a frequency generation and transformation circuit of electronic signals; Fig. 5 input and output signals for exemplifying the functionality of a 5-to-4 frequency reduction according to a preferred SDA method; Fig. 6 main components of a preferred signal delay element for implementation of a preferred SDA method on a chip; and Fig. 7 a preferred data processing system for performing a preferred method in accordance with the present invention .

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

According to the invention, a method for power management of a microprocessor is proposed, characterized by the use of a workload prediction method for adaptation of the power consumption of said microprocessor to workload during program execution, where the adaptation can be immediate or preemptive. The method is further characterized by deriving said workload prediction from processor utilization.

The accurate prediction of the power consumption suggested in the preferred embodiment of this invention is achieved by using a usage history table 120 very similar to a branch history that profiles the CPU utilization during execution and is used for subsequent execution of the same code segment. Modern branch prediction methods provide generally better than 90% accurate branch prediction. Since the proposed power management prediction method bases on the same idea, comparable power prediction accuracy can be achieved. Similarly to a branch history table, prediction of the usage history is also available 5 to 10 cycles before the branch is effectively taken, allowing the clocking frequency to be adjusted in time and not later as a response e.g. to a voltage drop. This increases vastly the effectiveness of the proposed method.

Results have shown that when a code section is executed many times it tends to behave the same. Advantage is taken of this fact to tune voltage, frequency and/or clock gating and/or other method of power reduction to portion or whole of the core/chip. Since the condition is predicted ahead of time, the safety margins can be much smaller than with state of the art methods that only react to current conditions. By reducing voltage and frequency by only 15% one can win about 50% of the power .

A general overview of the block scheme of the power management method as proposed for the preferred embodiment of the invention is shown in fig. 1. The utilization of the CPU is measured by a number of sensors 110...112 on the microprocessor and feeding their input to the usage profiling block 114. This input is transferred to a usage prediction logic 118, the central component of the proposed power management system. Said usage prediction logic 118 is operating with further input from a branch prediction logic 122 as well as from a usage history table 120. Information about branching behaviour of the system is stored in a branch history table 124, connected to the branch prediction logic 122. The usage prediction logic 118 is further feeding input to the frequency voltage control block 116.

According to a preferred embodiment of the invention a module that measures at least CPU utilization/condition between two branch instructions (usage profiling block 114). Further, a usage history table 120 is provided that is updated with the CPU utilization/condition for the code executed between the last and current branch by the usage prediction logic 118. This is similar to a branch history table update after a speculative branch. Further, a frequency voltage control block is provided that tunes the frequency and voltage to the various parts of the processor based on said usage history table 120 on every branch. In the art, a similar block is normally present on modern processors.

The used branch prediction scheme 102 is explained in fig. 2. The instruction fetch block 126 is the controlling component for this procedure. Instructions from the cache 172 are triggering a branch target request 166 through the instruction fetch block 126 to the branch prediction block 122. The required information is received via a lookup 162 from the branch history table 124 and fed back as the predicted target 168 to the instruction fetch block 126 and further to the cache as fetch address command 174. The instruction fetch block 126 gives instructions to execution units 178 and on the other hand receives branch retired/flushed information 176, transfers it as prediction outcome 170 to the branch prediction block 122 which updates (signal 164) the branch history table 124.

The operation scheme of the preferred embodiment of the present invention is shown in fig. 3. During normal execution the CPU utilization and conditions are measured. This includes but is not limited to cycle-per-instruction, pipeline-flushes, and cache-misses. This creates a profile on the code executed since the last branch. Statistically such a section is about 50 instructions long.

Whenever a branch is predicted by the branch prediction block 122, the usage history table entry for the predicted address is read out via a request 158 from the branch prediction block 122 to the usage prediction logic 118 and further on via a lookup 156 to the usage history table 120. Based on the predicted utilization information 150 provided by said usage history table 120 the frequency and voltage are tuned for the next code section.

The predicted branch update is saved to be able to update the usage history table entry at the next branch. In parallel the usage history table entry corresponding to the previous branch is updated with the current utilization information 152 and the utilization profiling reset.

Should a branch be mispredicted or no valid utilization information is available, then the frequency and voltage control is instructed to fall back to state of the art frequency and voltage tracking based on current measured condition. This is done by feeding the bad branch predicted information 160 via the usage prediction logic 118 to the frequency voltage control block 116. This implies larger frequency and voltage safety margins to prevent timing or voltage problems. This case is very unlikely as modern branch prediction gives good prediction normally in more than 90% of the cases. Tuning of the frequency and/or voltage according to workload prediction described by the present invention can be implemented by a special method of immediate decrease of the frequency at any cycle of the clock rate of the microprocessor. Additionally or alternatively, tuning the power consumption and/or the performance of a processor or system by a control unit based on the input from said usage prediction logic can be implemented.

The method for frequency adjustment is comprising the steps of providing an output signal of a frequency generator with a first frequency as input signal for a signal delay element; providing an edge of said input signal of said signal delay element; delaying said input signal by adding a delay to each cycle of said input signal until the delayed output signal of the signal delay element is aligned to an edge of said input signal .

The general layout for a frequency generation and transformation circuit of electronic signals according to the proposed method is shown in fig. 4. An electronic signal generated by a reference clock 10 is fed to a frequency generator 12 which is preferably represented by a phase locked loop circuit (PLL) . The output of said frequency generator serves as input signal 20 for a signal delay element 14. This signal delay element is acting according to the described method, further on also called successive delay add (SDA) , and behind the PLL circuit modifying said frequency of said input signal 20. The output signal 22 with the second transformed frequency supplies the latches and arrays of further units of an electronic circuit.

The basic functionality of the frequency transformation method proposed according to the SDA procedure is described in fig. 5. The clock edges of the input signal 20 of said signal delay element 14 are delayed in such a way that e.g. as shown in fig. 5 the PLL has generated 10 clock pulses of a fixed cycle period 24 and said signal delay element 14 has only generated 9 pulses of a longer cycle period 26. This is a frequency reduction seen by the clock mesh following said signal delay element 14 behind on the electronic circuit.

This frequency change can be triggered at any cycle of the clock mesh and it is also possible to stop at the current frequency or change the frequency to slower or faster values at once. Stepping through such a frequency scheme it is possible to change the frequency as fast as possible omitting a dl/dt slew rate problem in case of large fast frequency reduction. Also because of the fast response time it would be possible to control voltage drops with such or a similar aperture .

Preferably the rising edge of said input signal 20 is used for edge alignment of the output 22 and input signals 20. Alternatively it is also possible to use the falling edge for edge alignment of the output 22 and input signals 20.

In order to get stable signals it is preferable that equal delays are added to each cycle of the input signal 20. Preferably, in each cycle a delay is added to said rising and falling edge.

As is demonstrated in fig. 5 the last cycle 9 of the output signal 22 is characterized by a longer cycle period 30 in order to catch up with the next input signal cycle 20 fed by the frequency generator 12. Fig. 6 shows the main components of said signal delay element 14. It comprises a programmable delay line 62 with a signal input 20 for receiving an input signal and a signal output 22 for outputting an output signal. A phase compare and reset logic 68 is connected to the signal input 20 and receives an output signal from said programmable delay line 62 as well as from a delay look ahead block 66. Said delay look-ahead block 66 is also connected to the signal output of said programmable delay line 62. An adder block 74, which gets an input 78 from a delay step size definition block 76 is feeding input to a counter block 72. Said counter block 72 is receiving input also from said phase compare and reset logic 68. Said counter block 72 is connected to an input 82 of a decoder block 70 which gives main input to the programmable delay line 62.

The adder block 74 is also connected directly to the input of the decoder block 70 bypassing the counter block 72 thus serving feedback to the adder block 74.

Preferably the described method is implemented in a signal delay element as described by the block diagram of fig. 6.

With the implementation of fig. 6 said signal delay element 14 is getting an output signal of a frequency generator 12 as input signal 20. By this way an edge of said input signal 20 can be identified for delaying said input signal 20 by adding a delay to each cycle of said input signal until the delayed output signal 22 of the signal delay element 14 is aligned to an edge of said input signal 20.

The method which may be implemented in said signal delay element 14 consists in more detail of calculating the number of delay steps according to the actual requirements for transformation of the first frequency to the second frequency in a delay step size definition block 76. Then the amount of delay per step is calculated according to the actual requirements for transformation of the first frequency to the second frequency in a delay step size definition block 76. The step size of the delay is added with each half cycle of the input signal 78 in an adder block 74, while still counting the single delays added to the edges in a counter block 72. By this way the i^th delay is added to the rising signal phase of the i^th cycle in an adder block 74 as well as the (i+l)^th delay is added to the falling signal phase of the (i+l)^th cycle in the same adder block 74, being i any index of a cycle of the input signal.

The invention can further take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment containing both hardware and software elements. In a preferred embodiment, the invention is implemented in software, which includes but is not limited to firmware, resident software, microcode, etc.

Furthermore, the invention can take the form of a computer program product accessible from a computer-usable or computer- readable medium providing program code for use by or in connection with a computer or any instruction execution system. For the purposes of this description, a computer- usable or computer-readable medium can be any apparatus that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

The medium can be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium. Examples of a computer-readable medium include a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM) , a read-only memory (ROM) , a rigid magnetic disk and an optical disk. Current examples of optical disks include compact disk - read only memory (CD-ROM) , compact disk - read/write (CD-R/W) and DVD.

A data processing system suitable for storing and/or executing program code will include at least one processor coupled directly or indirectly to memory elements through a system bus. The memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code in order to reduce the number of times code must be retrieved from bulk storage during execution.

Input/output or I/O devices (including but not limited to keyboards, displays, pointing devices, etc.) can be coupled to the system either directly or through intervening I/O controllers .

Network adapters may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks. Modems, cable modem and Ethernet cards are just a few of the currently available types of networks adapters.

Fig. 7 depicts schematically a preferred data processing system, consisting of a computer 200 comprising a central processing unit 202 which incorporates a power management control unit 216, a computer usable medium, comprising computer readable program 204 connected by a bus system 206 to the central processing unit 202, an IO system connected to input and output devices 208, 210. The computer is further connected to a network communication unit 214. The said power management control unit 216 comprises software code and/or hardware portions for performing a power management method according to at least one of the preferred embodiments of the invention when said unit is active on said computer 200.

Claims

C L A I M S

1. A power management method of a microprocessor using a workload prediction method for adaptation of the power consumption of said microprocessor to workload during program execution, wherein said workload prediction is derived from processor utilization,

the method being characterized by

measuring at least processor utilization between two branch instructions.

2. The method according to claim 1, characterized by using a usage profiling block (114) for measuring said processor utilization .

3. The method according to claim 1 or 2, characterized by updating a usage history table (120) with said processor utilization for the code executed between the last and current branch.

4. The method according to claim 3, characterized by using a usage prediction logic (118) for updating said usage history table (120) .

5. The method according to one of the claims 1 to 4, characterized by synchronizing said usage prediction logic (118) with a branch prediction logic connected to a branch history table.

6. The method according to one of the claims 1 to 5, characterized by synchronizing said usage prediction logic (118) with the input from said usage profiling block (114) .

7. The method according to one of the claims 1 to 6, characterized by tuning the frequency and/or the voltage to the various parts of said processor or system by a frequency voltage control unit (116) based on the input from said usage prediction logic (118).

8. The method according to one of the claims 1 to 7, characterized by tuning the power consumption and/or the performance of a processor or system by a control unit

(116) based on the input from said usage prediction logic (118) .

9. A power management unit (100) of a microprocessor comprising means for the use of a workload prediction method for adaptation of the power consumption of said microprocessor to workload during program execution, further comprising a usage prediction logic (118) connected to a usage profiling block (114) and a usage history table (120) for adjustment of power consumption of said microprocessor to changing workload by delivering input to a frequency voltage control block (116) of said microprocessor,

the unit being characterized by

said usage profiling block (114) getting input from a multitude of sensors (110, 112) about processor utilization; said usage history table (120) connected to said usage prediction logic (118); said usage prediction logic (118) getting input from said usage profiling block (114) and said usage history table as well as from a branch prediction logic (118), said branch history table (124) connected to said branch prediction logic (122); said frequency voltage control block (116) controlling frequency and power to various parts of said processor or system.

10. The unit according to claim 9 comprising means for performing the steps of measuring at least processor utilization between two branch instructions with said usage profiling block (114); updating said usage history table (120) with said processor utilization for the code executed between the last and current branch via a usage prediction logic (118) ; synchronizing said usage prediction logic (118) with said branch prediction logic connected to said branch history table as well as with the input from said usage profiling block (114); tuning the frequency and/or the voltage and/or power consumption to the various parts of said processor or system by said frequency voltage control block (116) based on the input from said usage prediction logic (118) .

11. A data processing program for execution in a data processing system comprising software code portions for performing a method according to anyone of the preceding claims 1 to 8 when said program is run on a computer ( 200 ) .

12. A computer program product stored on a computer usable medium, comprising computer readable program means for causing a computer to perform a method according to anyone of the preceding claims 1 to 8 when said program is run on a computer (200) .

13. A computer program product stored on a computer usable medium, comprising computer readable program means for causing a computer to perform the following steps when said program is run on a computer (200), measuring at least processor utilization between two branch instructions with said usage profiling block (114); updating said usage history table (120) with said processor utilization for the code executed between the last and current branch via said usage prediction logic (118); synchronizing said usage prediction logic (118) with said branch prediction logic connected to said branch history table as well as with the input from said usage profiling block (14); tuning the frequency and/or the voltage and/or power consumption to the various parts of said processor or system by said frequency/voltage/power control block (116) based on the input from said usage prediction logic (118).