US20140173312A1 - Dynamic re-configuration for low power in a data processor - Google Patents
Dynamic re-configuration for low power in a data processor Download PDFInfo
- Publication number
- US20140173312A1 US20140173312A1 US13/714,011 US201213714011A US2014173312A1 US 20140173312 A1 US20140173312 A1 US 20140173312A1 US 201213714011 A US201213714011 A US 201213714011A US 2014173312 A1 US2014173312 A1 US 2014173312A1
- Authority
- US
- United States
- Prior art keywords
- mode
- data processor
- processor
- decode
- core
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F1/00—Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
- G06F1/26—Power supply means, e.g. regulation thereof
- G06F1/32—Means for saving power
- G06F1/3203—Power management, i.e. event-based initiation of a power-saving mode
- G06F1/3234—Power saving characterised by the action undertaken
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F1/00—Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
- G06F1/26—Power supply means, e.g. regulation thereof
- G06F1/32—Means for saving power
- G06F1/3203—Power management, i.e. event-based initiation of a power-saving mode
- G06F1/3234—Power saving characterised by the action undertaken
- G06F1/3243—Power saving in microcontroller unit
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F1/00—Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
- G06F1/26—Power supply means, e.g. regulation thereof
- G06F1/32—Means for saving power
- G06F1/3203—Power management, i.e. event-based initiation of a power-saving mode
- G06F1/3234—Power saving characterised by the action undertaken
- G06F1/3287—Power saving characterised by the action undertaken by switching off individual functional units in the computer system
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F1/00—Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
- G06F1/26—Power supply means, e.g. regulation thereof
- G06F1/32—Means for saving power
- G06F1/3203—Power management, i.e. event-based initiation of a power-saving mode
- G06F1/3234—Power saving characterised by the action undertaken
- G06F1/3293—Power saving characterised by the action undertaken by switching to a less power-consuming processor, e.g. sub-CPU
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5094—Allocation of resources, e.g. of the central processing unit [CPU] where the allocation takes into account power or heat criteria
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Definitions
- This disclosure relates generally to data processors, and more specifically to configurable data processors.
- Data processors are typically designed to meet specific product needs. For example, desktop microprocessors operate at very high speeds and have long execution pipelines and multiple parallel resources to achieve very high performance. Microprocessors for laptop computers offer reasonably high performance but have low power features to lengthen battery life. Embedded microprocessors operate at relatively slow speeds and have simple architectures in order to reduce product cost.
- multimedia mobile devices can at various times require either high performance or low power consumption. For example when running certain games, a multimedia mobile device requires the microprocessor to provide high performance. However when running other tasks such as EMAIL, the multimedia mobile device requires much lower performance from the microprocessor. These varying processing environments make it difficult for the microprocessor designer to achieve the right balance between performance and preservation of battery life.
- FIG. 1 illustrates in block diagram form a data processor known in the prior art.
- FIG. 2 illustrates in block diagram form another data processor known in the prior art.
- FIG. 3 illustrates in block diagram form a data processor according to some embodiments.
- FIG. 4 illustrates in block diagram form a central processing unit (CPU) core that may be used in the data processor of FIG. 3 according to some embodiments.
- CPU central processing unit
- FIG. 5 illustrates in block diagram form a pipeline of the CPU core of FIG. 4 in a first mode according to some embodiments.
- FIG. 6 illustrates in block diagram form a pipeline of the CPU core of FIG. 4 in a second mode according to some embodiments.
- FIG. 7 illustrates a flow diagram of a method for configuring a processor core according to some embodiments.
- FIG. 8 illustrates a flow diagram of a method for configuring a multi-core data processor according to some embodiments.
- FIG. 1 illustrates in block diagram form a data processor 100 known in the prior art.
- Data processor 100 generally includes a big core 110 and a little core 120 .
- big core 110 and little core 120 each have the capability to execute the same instruction set.
- the micro-architecture of big core 110 is constructed to support high intensity tasks and the micro-architecture of little core 120 is constructed to support low power and low intensity tasks.
- big core 110 is a core known as the Cortex A15 core available from Advanced RISC Machines, Ltd. of Cambridge, G.B, and has the capability to decode, dispatch, issue and execute multiple, out-of-order instructions. Big core 110 operates multiple pipelines having 15 to 24 stages.
- little core 120 is a core known as the Cortex A7 core also available from ARM Ltd. and decodes, dispatches, issues, and executes in-order instructions and operates a smaller number of pipelines, having 8 to 10 stages.
- data processor 100 migrates instructions to big core 110 or to little core 120 .
- data processor 100 consumes additional silicon area for little core 120 , and in general, data processor 100 requires overhead to migrate instructions between the two cores when the processing task changes.
- FIG. 2 illustrates in block diagram form another data processor 200 known in the prior art.
- Data processor 200 generally includes a companion core 210 labeled “COMPANION CORE1”, a main core 220 labeled “CORE1”, a main core 230 labeled “CORE2”, a main core 240 labeled “CORE3”, a main core 250 labeled “CORE4”, and a clock generator 260 .
- Companion core 210 has a clock input.
- Main cores 220 , 230 , 240 , and 250 each have a clock input.
- Clock generator 260 has a first output connected to the clock input of companion core 210 and a second output connected to each clock input of main cores 220 , 230 , 240 , and 250 .
- main cores 220 , 230 , 240 , and 250 each have the capability to execute the same instruction set.
- main cores 220 , 230 , 240 , and 250 and companion core 210 execute instructions in a consistent way
- data processor 200 enables and disables cores based on the work load. For example, data processor 200 could enable only companion core 210 to execute low intensity tasks such as audio, video, and email, only two main cores to execute higher intensity tasks such as flash enabled browsing and multitasking, and all four main cores to execute high intensity tasks such as console class gaming and media processing.
- Main cores 220 , 230 , 240 , and 250 are each constructed to support high frequency, performance intensive tasks, whereas companion core 210 is constructed to support low frequency, low power, low intensity tasks.
- Clock generator 260 provides a high frequency clock to main cores 220 , 230 , 240 , and 250 , but provides a low frequency clock to companion core 210 .
- CPU power management hardware and the operating system migrate instructions to selected ones of main cores 220 , 230 , 240 , and 250 or to companion core 210 .
- data processor 200 consumes additional silicon area to operate companion core 210 , and in general, data processor 200 requires overhead to migrate instructions between any of cores 220 - 250 and companion core 210 when the processing task changes.
- FIG. 3 illustrates in block diagram form a data processor 300 according to some embodiments.
- Data processor 300 generally includes a CPU cluster 310 .
- CPU cluster 310 includes a CPU core 312 labeled “CPU0”, a CPU core 314 labeled “CPU1”, a CPU core 316 labeled “CPU2”, a CPU core 318 labeled “CPU3”, and a cache 320 which is a shared L2 cache.
- CPU cores 312 - 318 each include a fetch unit for fetching a stream of instructions, an execution unit connected to the fetch unit that has a multiple number of redundant resources, and a configuration circuit that operates in a first mode and a second mode. In the first mode, the configuration circuit enables the multiple number of redundant resources, and in the second mode, the configuration circuit selectively disables the multiple number of redundant resources.
- Each of CPU cores 312 - 318 has the capability to execute the same instruction set. Also, each CPU core has a substantially identical architecture and executes instructions in a consistent way. Unlike data processors 100 and 200 , however, data processor 300 can configure the micro-architecture of each of CPU cores 312 - 318 to support either high intensity tasks or low intensity tasks, where the associated CPU is configured for desired power management and in some applications, longest potential battery life.
- data processor 300 could configure CPU cores 312 and 314 to decode, dispatch, issue, and execute multiple out-of-order instructions, and to operate multiple pipelines each having a multiple number of stages.
- data processor 300 could re-configure CPU cores 316 and 318 to decode, dispatch, issue, and execute instructions using a smaller number of pipelines.
- data processor 300 functionally throttles, for example, CPU core 312 and gates off CPU core 314 , CPU core 316 , and CPU core 318 . Subsequently, data processor 300 would execute instructions using CPU core 312 . By eliminating a dedicated little core, data processor 300 preserves silicon area and saves power.
- data processor 300 reconfigures CPU cores 312 - 318 to perform high intensity tasks, by modifying at least one pipeline for high intensity operation, by increasing a width of a decode pipeline, by enabling an execution pipeline, and/or by enabling or disabling portions of one or more caches, while managing the high frequency, high intensity operation of each core.
- data processor 300 processes instructions using a quad core cluster.
- CPU cores 312 , 314 , 316 , and/or 318 can be dynamically and selectively reconfigured.
- CPU power management hardware and the operating system can migrate instructions to any CPU core.
- a data processor can have other than four cores that can be dynamically and selectively reconfigured.
- FIG. 4 illustrates in block diagram form a central processing unit (CPU) core 400 that may be used in data processor 300 of FIG. 3 according to some embodiments.
- CPU core 400 generally includes a fetch unit 410 , a level one instruction cache (“L1 ICACHE”) 415 , an execution unit 420 , a decode unit 430 , and a configuration circuit 450 .
- L1 ICACHE level one instruction cache
- Fetch unit 410 has an input/output port and an output for providing instructions fetched from cache 415 or main memory.
- Cache 415 has an input, and an input/output port connected to the input/output port of fetch unit 410 .
- Execution unit 420 includes multiple execution pipelines 440 including exemplary execution pipelines 442 and 444 , and a level one data cache (“L1 DCACHE”) 460 .
- Decode unit 430 has a first input connected to the output of fetch unit 410 , a second input, and an output.
- Execution pipeline 442 has a first input connected to the output of decode unit 430 , a second input, and a data input/output port.
- Execution pipeline 444 has a first input connected to the output of decode unit 430 , a second input, and a data input/output port.
- Cache 460 has an input and data input/output ports connected to the data input/output ports of one or more execution pipelines such as execution pipelines 442 and 444 , depending on their respective function.
- Configuration circuit 450 includes a register 452 and a functional throttling circuit 456 .
- Register 452 is a model specific register (MSR) that includes a field 454 defining a mode of CPU core 400 , and has an output for providing the contents of field 454 .
- Functional throttling circuit 456 has an input connected to the output of register 452 , an output connected to the input of cache 415 , the second input of decode unit 430 , the second input of execution pipelines 442 and 444 , and the input of cache 460 .
- fetch unit 410 fetches a stream of instructions from cache 415 (or main memory through cache 415 if the fetch misses in cache 415 ), and provides the instructions to decode unit 430 .
- Decode unit 430 decodes the instructions and dispatches them to selected execution units for execution.
- Execution unit 420 includes redundant resources that are not needed to execute the instruction set of CPU core 400 .
- execution unit 420 may have two identical pipelines that can be used to execute the same type of instruction.
- each execution pipeline may queue a large number of operations to handle high workloads without stalling decode unit 430 but can operate properly with a smaller queue.
- decode unit 430 can decode multiple operations in parallel to increase throughput. Each of these features is useful for meeting the performance requirements of high intensity tasks, but consumes unneeded power for low intensity tasks.
- each of caches 415 and 460 have configurable sizes and can operate at full size for high intensity tasks, or reduced size for low intensity tasks.
- Configuration circuit 450 has at least a first mode and a second mode. In the first mode, configuration circuit 450 causes CPU core 400 to operate as a “big core” by enabling the redundant resources. In the second mode, configuration circuit 450 causes CPU core 400 to operate as a “little core” by disabling the redundant resources. Thus a single, generic core can easily be reconfigured for different processing tasks.
- CPU core 400 provides a protected mechanism to dynamically reconfigure CPU core 312 , CPU core 314 , CPU core 316 , and/or CPU core 318 by writing field 454 of register 452 .
- FIG. 5 illustrates in block diagram form a pipeline 500 of CPU core 400 of FIG. 4 in a first mode according to some embodiments.
- Pipeline 500 generally includes a fetch stage 510 , a decode/dispatch/rename stage 520 , and an execute stage 530 .
- Fetch stage 510 is a four-deep stage that provides instructions in program order to decode/dispatch/rename stage 520 .
- Decode/dispatch/rename stage 520 includes a seven-deep set of sub-stages 522 and a three-deep set of sub-stages 524 associated with floating point operations that can be recognized near the end of decode/dispatch/rename stage 520 .
- Decode sub-stages 522 provide up to two decoded instructions in parallel to execute stage 530
- decode sub-stages 524 provide up to two decoded floating point instructions in parallel to execute stage 530 .
- Execute stage 530 includes a set of execution pipelines 540 each of which has its own corresponding pipeline segment organized into a queue sub-stage 532 , an issue sub-stage 534 , an execute sub-stage 536 , and a writeback (WB) sub-stage 538 .
- execution piplines 540 include an integer pipeline 542 , an integer pipeline 544 , a multiply divide (“Mult/Div”) pipeline 546 , a load/store pipeline 548 , a load/store pipeline 550 , a floating point (“FP”) pipeline 552 , and a FP pipeline 554 .
- Mult/Div multiply divide
- FP floating point
- FP floating point
- FIG. 5 shows each queue sub-stage 532 as having three entries but these are representative of an arbitrary number of multiple entries.
- queue sub-stage 532 in integer pipeline 542 has twenty-four queue stages.
- FIG. 5 shows the remaining sub-stages with their actual depth.
- some execute pipelines, such as integer pipeline 544 and Mult/Div pipeline 546 share a common queue sub-stage as is illustrated in FIG. 5 .
- pipeline 500 represents the pipeline of CPU core 400 in the first mode, in which configuration circuit 450 enables the multiple redundant resources to support high frequency, high intensity tasks.
- FIG. 6 illustrates in block diagram form a pipeline 600 of CPU core 400 of FIG. 4 in a second mode according to some embodiments.
- Pipeline 600 generally includes a fetch stage 610 , a decode/dispatch/rename stage 620 , and an execute stage 630 corresponding to fetch stage 510 , decode/dispatch/rename stage 520 , and execute stage 530 , respectively of FIG. 5 .
- pipeline 600 identifies redundant resources that have now been disabled.
- Pipeline 600 illustrates four types of redundant resources. First, since integer pipeline 642 and 644 both execute the same types of instructions, the second one is redundant and CPU core 400 disables integer pipeline 642 in the second mode.
- each slot of queue sub-stage 632 beyond the first stage is redundant, and CPU core 400 reduces the size of each queue sub-stage 632 by half.
- the size of queue sub-stage 632 can be reduced from twenty-four slots to twelve slots.
- the second half of decode/dispatch/rename stage 620 is redundant since decode/dispatch/rename stage 620 decodes two instructions in parallel.
- CPU core 400 disables the redundant half of decode/dispatch/rename stage 620 such that it can only issue a single instruction per clock cycle.
- the effective sizes of caches 415 and 460 can be reduced, such as by half. FIG.
- FIG. 6 shows the disabling of these redundant resources in the second sub-stage of fetch stage 610 , and in the second sub-stage of execute sub-stage 636 of load/store pipeline 648 , in response to receiving a signal from functional throttling circuit 456 labeled “DISABLE”.
- pipeline 600 is able to fully execute the instruction set of CPU core 400 , but to consume less power for low intensity tasks.
- each pipeline can transition seamlessly. For example, when disabling a redundant half of decode/dispatch/rename stage 620 , the hardware may simply disable sub-stages in the unneeded half as the last instruction flows down decode/dispatch/rename stage 620 .
- CPU core 400 can allow the size of each queue sub-stage to be reduced by stalling decode/dispatch/rename stage 620 until only half of the slots are used, and then disabling the unused half.
- CPU core 400 can also disable a redundant pipeline by stopping the input of new decoded instructions into the pipeline and waiting until the pipeline naturally drains. Moreover CPU core 400 can reduce the sizes of instruction and data caches. In these ways, CPU core 400 can transition from the first (big core) mode to the second (little core) mode seamlessly and without the need for slow instruction migration.
- FIG. 7 illustrates a flow diagram of a method 700 for configuring a processor core according to some embodiments.
- Action box 710 includes fetching and decoding a write MSR instruction (“WMSR”) in a processor core.
- the flow proceeds to decision box 720 , which determines whether CPU core 400 is in a privileged state. If the CPU core 400 is not in a privileged state, flow proceeds to action box 730 in which the method ends by some appropriate action, such as taking a privilege mode violation exception. If the processor core is in a privileged state, then flow proceeds to action box 740 , which updates a power control field in the MSR.
- WMSR write MSR instruction
- method 700 proceeds to action box 750 , which reconfigures the execution pipeline of CPU core 400 in response to a change in the power control field. Finally flow proceeds to action box 760 in which CPU core 400 executes instructions using the reconfigured core.
- FIG. 8 illustrates a flow diagram of a method 800 for configuring a multi-core data processor according to some embodiments.
- Action box 810 includes functionally throttling a processor core of a CPU cluster.
- Action box 820 includes gating off remaining processor cores of the CPU cluster.
- Action box 830 includes executing instructions using the processor core that was enabled.
- FIGS. 3-6 may be implemented with various combinations of hardware and software, and the software component may be stored in a computer readable storage medium for execution by at least one processor. Moreover the methods illustrated in FIGS. 7 and 8 may also be governed by instructions that are stored in a computer readable storage medium and that are executed by at least one processor. Each of the operations shown in FIGS. 7 and 8 may correspond to instructions stored in a non-transitory computer memory or computer readable storage medium.
- the non-transitory computer readable storage medium includes a magnetic or optical disk storage device, solid state storage devices such as Flash memory, or other non-volatile memory device or devices.
- the computer readable instructions stored on the non-transitory computer readable storage medium may be in source code, assembly language code, object code, or other instruction format that is interpreted and/or executable by one or more processors.
- FIGS. 3-6 may be described or represented by a computer accessible data structure in the form of a database or other data structure which can be read by a program and used, directly or indirectly, to fabricate integrated circuits of FIGS. 3-6 .
- this data structure may be a behavioral-level description or register-transfer level (RTL) description of the hardware functionality in a high level design language (HDL) such as Verilog or VHDL.
- HDL high level design language
- the description may be read by a synthesis tool which may synthesize the description to produce a netlist comprising a list of gates from a synthesis library.
- the netlist comprises a set of gates which also represent the functionality of the hardware comprising integrated circuits of FIGS. 3-6 .
- the netlist may then be placed and routed to produce a data set describing geometric shapes to be applied to masks.
- the masks may then be used in various semiconductor fabrication steps to produce integrated circuits of FIGS. 3-6 .
- the database on the computer accessible storage medium may be the netlist (with or without the synthesis library) or the data set, as desired, or Graphic Data System (GDS) II data.
- GDS Graphic Data System
- CPU cluster 310 includes four CPU cores 312 , 314 , 316 , and 318 , and a cache 320 , which is a shared L2 cache.
- CPU cluster 310 could include a different number of cores, and different cache memory hierarchies, including shared and dedicated cache memories.
- CPU cores 312 , 314 , 316 , and 318 could use a common circuit design and process technology or different circuit design and process technologies.
- a software write to register 452 could include selectively executing the write based on whether CPU core 400 is in a privileged state.
- configuration circuit 450 could reconfigure different redundant functions of a CPU core of CPU cluster 310 , including an arithmetic logic unit (ALU), a schedule queue cluster, a FP unit, a multimedia extension unit (MMX), a cache memory, a cache controller, a translation lookaside buffer (TLB), a branch prediction unit, and the like.
- ALU arithmetic logic unit
- MMX multimedia extension unit
- cache memory a cache controller
- TLB translation lookaside buffer
- branch prediction unit and the like.
Abstract
Description
- This disclosure relates generally to data processors, and more specifically to configurable data processors.
- Data processors are typically designed to meet specific product needs. For example, desktop microprocessors operate at very high speeds and have long execution pipelines and multiple parallel resources to achieve very high performance. Microprocessors for laptop computers offer reasonably high performance but have low power features to lengthen battery life. Embedded microprocessors operate at relatively slow speeds and have simple architectures in order to reduce product cost.
- Certain products such as multimedia mobile devices can at various times require either high performance or low power consumption. For example when running certain games, a multimedia mobile device requires the microprocessor to provide high performance. However when running other tasks such as EMAIL, the multimedia mobile device requires much lower performance from the microprocessor. These varying processing environments make it difficult for the microprocessor designer to achieve the right balance between performance and preservation of battery life.
-
FIG. 1 illustrates in block diagram form a data processor known in the prior art. -
FIG. 2 illustrates in block diagram form another data processor known in the prior art. -
FIG. 3 illustrates in block diagram form a data processor according to some embodiments. -
FIG. 4 illustrates in block diagram form a central processing unit (CPU) core that may be used in the data processor ofFIG. 3 according to some embodiments. -
FIG. 5 illustrates in block diagram form a pipeline of the CPU core ofFIG. 4 in a first mode according to some embodiments. -
FIG. 6 illustrates in block diagram form a pipeline of the CPU core ofFIG. 4 in a second mode according to some embodiments. -
FIG. 7 illustrates a flow diagram of a method for configuring a processor core according to some embodiments. -
FIG. 8 illustrates a flow diagram of a method for configuring a multi-core data processor according to some embodiments. - In the following description, the use of the same reference numerals in different drawings indicates similar or identical items. Unless otherwise noted, the word “coupled” and its associated verb forms include both direct connection and indirect electrical connection by means known in the art, and unless otherwise noted any description of direct connection implies alternate embodiments using suitable forms of indirect electrical connection as well.
-
FIG. 1 illustrates in block diagram form adata processor 100 known in the prior art.Data processor 100 generally includes abig core 110 and alittle core 120. - In operation,
big core 110 andlittle core 120 each have the capability to execute the same instruction set. However, the micro-architecture ofbig core 110 is constructed to support high intensity tasks and the micro-architecture oflittle core 120 is constructed to support low power and low intensity tasks. - For one particular example,
big core 110 is a core known as the Cortex A15 core available from Advanced RISC Machines, Ltd. of Cambridge, G.B, and has the capability to decode, dispatch, issue and execute multiple, out-of-order instructions. Bigcore 110 operates multiple pipelines having 15 to 24 stages. On the other hand,little core 120 is a core known as the Cortex A7 core also available from ARM Ltd. and decodes, dispatches, issues, and executes in-order instructions and operates a smaller number of pipelines, having 8 to 10 stages. - Depending on the intensity and target power consumption of a task,
data processor 100 migrates instructions tobig core 110 or tolittle core 120. Howeverdata processor 100 consumes additional silicon area forlittle core 120, and in general,data processor 100 requires overhead to migrate instructions between the two cores when the processing task changes. -
FIG. 2 illustrates in block diagram form anotherdata processor 200 known in the prior art.Data processor 200 generally includes acompanion core 210 labeled “COMPANION CORE1”, amain core 220 labeled “CORE1”, amain core 230 labeled “CORE2”, amain core 240 labeled “CORE3”, amain core 250 labeled “CORE4”, and aclock generator 260. -
Companion core 210 has a clock input.Main cores Clock generator 260 has a first output connected to the clock input ofcompanion core 210 and a second output connected to each clock input ofmain cores - In operation,
main cores companion core 210 each have the capability to execute the same instruction set. Althoughmain cores companion core 210 execute instructions in a consistent way,data processor 200 enables and disables cores based on the work load. For example,data processor 200 could enable onlycompanion core 210 to execute low intensity tasks such as audio, video, and email, only two main cores to execute higher intensity tasks such as flash enabled browsing and multitasking, and all four main cores to execute high intensity tasks such as console class gaming and media processing. -
Main cores companion core 210 is constructed to support low frequency, low power, low intensity tasks.Clock generator 260 provides a high frequency clock tomain cores companion core 210. - However, depending on the intensity and target power consumption of a task, CPU power management hardware and the operating system migrate instructions to selected ones of
main cores companion core 210. Likedata processor 100,data processor 200 consumes additional silicon area to operatecompanion core 210, and in general,data processor 200 requires overhead to migrate instructions between any of cores 220-250 andcompanion core 210 when the processing task changes. -
FIG. 3 illustrates in block diagram form adata processor 300 according to some embodiments.Data processor 300 generally includes aCPU cluster 310.CPU cluster 310 includes aCPU core 312 labeled “CPU0”, aCPU core 314 labeled “CPU1”, aCPU core 316 labeled “CPU2”, aCPU core 318 labeled “CPU3”, and acache 320 which is a shared L2 cache. - In operation, CPU cores 312-318 each include a fetch unit for fetching a stream of instructions, an execution unit connected to the fetch unit that has a multiple number of redundant resources, and a configuration circuit that operates in a first mode and a second mode. In the first mode, the configuration circuit enables the multiple number of redundant resources, and in the second mode, the configuration circuit selectively disables the multiple number of redundant resources.
- Each of CPU cores 312-318 has the capability to execute the same instruction set. Also, each CPU core has a substantially identical architecture and executes instructions in a consistent way. Unlike
data processors data processor 300 can configure the micro-architecture of each of CPU cores 312-318 to support either high intensity tasks or low intensity tasks, where the associated CPU is configured for desired power management and in some applications, longest potential battery life. - For example,
data processor 300 could configureCPU cores data processor 300 could re-configureCPU cores - For example, to reduce the power of
data processor 300 for low intensity tasks,data processor 300 functionally throttles, for example,CPU core 312 and gates offCPU core 314,CPU core 316, andCPU core 318. Subsequently,data processor 300 would execute instructions usingCPU core 312. By eliminating a dedicated little core,data processor 300 preserves silicon area and saves power. - On the other hand to increase performance,
data processor 300 reconfigures CPU cores 312-318 to perform high intensity tasks, by modifying at least one pipeline for high intensity operation, by increasing a width of a decode pipeline, by enabling an execution pipeline, and/or by enabling or disabling portions of one or more caches, while managing the high frequency, high intensity operation of each core. - Like
data processor 200,data processor 300 processes instructions using a quad core cluster. However, depending on the intensity of the task, the desired performance level, and the desired power consumption target,CPU cores -
FIG. 4 illustrates in block diagram form a central processing unit (CPU)core 400 that may be used indata processor 300 ofFIG. 3 according to some embodiments.CPU core 400 generally includes afetch unit 410, a level one instruction cache (“L1 ICACHE”) 415, anexecution unit 420, adecode unit 430, and aconfiguration circuit 450. -
Fetch unit 410 has an input/output port and an output for providing instructions fetched fromcache 415 or main memory.Cache 415 has an input, and an input/output port connected to the input/output port of fetchunit 410.Execution unit 420 includesmultiple execution pipelines 440 includingexemplary execution pipelines Decode unit 430 has a first input connected to the output of fetchunit 410, a second input, and an output.Execution pipeline 442 has a first input connected to the output ofdecode unit 430, a second input, and a data input/output port.Execution pipeline 444 has a first input connected to the output ofdecode unit 430, a second input, and a data input/output port.Cache 460 has an input and data input/output ports connected to the data input/output ports of one or more execution pipelines such asexecution pipelines -
Configuration circuit 450 includes aregister 452 and afunctional throttling circuit 456.Register 452 is a model specific register (MSR) that includes afield 454 defining a mode ofCPU core 400, and has an output for providing the contents offield 454.Functional throttling circuit 456 has an input connected to the output ofregister 452, an output connected to the input ofcache 415, the second input ofdecode unit 430, the second input ofexecution pipelines cache 460. - In operation, fetch
unit 410 fetches a stream of instructions from cache 415 (or main memory throughcache 415 if the fetch misses in cache 415), and provides the instructions to decodeunit 430.Decode unit 430 decodes the instructions and dispatches them to selected execution units for execution.Execution unit 420 includes redundant resources that are not needed to execute the instruction set ofCPU core 400. For example,execution unit 420 may have two identical pipelines that can be used to execute the same type of instruction. Also each execution pipeline may queue a large number of operations to handle high workloads without stallingdecode unit 430 but can operate properly with a smaller queue. Moreover,decode unit 430 can decode multiple operations in parallel to increase throughput. Each of these features is useful for meeting the performance requirements of high intensity tasks, but consumes unneeded power for low intensity tasks. In addition, each ofcaches -
Configuration circuit 450 has at least a first mode and a second mode. In the first mode,configuration circuit 450 causesCPU core 400 to operate as a “big core” by enabling the redundant resources. In the second mode,configuration circuit 450 causesCPU core 400 to operate as a “little core” by disabling the redundant resources. Thus a single, generic core can easily be reconfigured for different processing tasks. - Moreover by using a model specific register that can only be accessed in privileged mode to establish the mode of operation,
CPU core 400 provides a protected mechanism to dynamically reconfigureCPU core 312,CPU core 314,CPU core 316, and/orCPU core 318 by writingfield 454 ofregister 452. -
FIG. 5 illustrates in block diagram form apipeline 500 ofCPU core 400 ofFIG. 4 in a first mode according to some embodiments.Pipeline 500 generally includes a fetch stage 510, a decode/dispatch/rename stage 520, and an executestage 530. - Fetch stage 510 is a four-deep stage that provides instructions in program order to decode/dispatch/
rename stage 520. Decode/dispatch/rename stage 520 includes a seven-deep set ofsub-stages 522 and a three-deep set ofsub-stages 524 associated with floating point operations that can be recognized near the end of decode/dispatch/rename stage 520. Decodesub-stages 522 provide up to two decoded instructions in parallel to executestage 530, whereas decodesub-stages 524 provide up to two decoded floating point instructions in parallel to executestage 530. - Execute
stage 530 includes a set ofexecution pipelines 540 each of which has its own corresponding pipeline segment organized into aqueue sub-stage 532, anissue sub-stage 534, an executesub-stage 536, and a writeback (WB) sub-stage 538. Inpipeline 500,execution piplines 540 include aninteger pipeline 542, aninteger pipeline 544, a multiply divide (“Mult/Div”)pipeline 546, a load/store pipeline 548, a load/store pipeline 550, a floating point (“FP”)pipeline 552, and aFP pipeline 554. However the number and composition of the pipelines will vary in other embodiments. - Note that
FIG. 5 shows eachqueue sub-stage 532 as having three entries but these are representative of an arbitrary number of multiple entries. For example,queue sub-stage 532 ininteger pipeline 542 has twenty-four queue stages.FIG. 5 shows the remaining sub-stages with their actual depth. Moreover some execute pipelines, such asinteger pipeline 544 and Mult/Div pipeline 546, share a common queue sub-stage as is illustrated inFIG. 5 . - In operation,
pipeline 500 represents the pipeline ofCPU core 400 in the first mode, in whichconfiguration circuit 450 enables the multiple redundant resources to support high frequency, high intensity tasks. -
FIG. 6 illustrates in block diagram form apipeline 600 ofCPU core 400 ofFIG. 4 in a second mode according to some embodiments.Pipeline 600 generally includes a fetchstage 610, a decode/dispatch/rename stage 620, and an executestage 630 corresponding to fetch stage 510, decode/dispatch/rename stage 520, and executestage 530, respectively ofFIG. 5 . However unlikepipeline 500,pipeline 600 identifies redundant resources that have now been disabled.Pipeline 600 illustrates four types of redundant resources. First, sinceinteger pipeline 642 and 644 both execute the same types of instructions, the second one is redundant andCPU core 400 disables integer pipeline 642 in the second mode. Second, each slot ofqueue sub-stage 632 beyond the first stage is redundant, andCPU core 400 reduces the size of eachqueue sub-stage 632 by half. For example, the size ofqueue sub-stage 632 can be reduced from twenty-four slots to twelve slots. Third, the second half of decode/dispatch/rename stage 620 is redundant since decode/dispatch/rename stage 620 decodes two instructions in parallel.CPU core 400 disables the redundant half of decode/dispatch/rename stage 620 such that it can only issue a single instruction per clock cycle. Fourth, the effective sizes ofcaches FIG. 6 shows the disabling of these redundant resources in the second sub-stage of fetchstage 610, and in the second sub-stage of execute sub-stage 636 of load/store pipeline 648, in response to receiving a signal fromfunctional throttling circuit 456 labeled “DISABLE”. By reducing the cache size in half, the power consumed in performing associative lookups and in maintaining valid data is reduced. - In this way,
pipeline 600 is able to fully execute the instruction set ofCPU core 400, but to consume less power for low intensity tasks. Moreover, whenCPU core 400 transitions from the first mode to the second mode, each pipeline can transition seamlessly. For example, when disabling a redundant half of decode/dispatch/rename stage 620, the hardware may simply disable sub-stages in the unneeded half as the last instruction flows down decode/dispatch/rename stage 620. Moreover,CPU core 400 can allow the size of each queue sub-stage to be reduced by stalling decode/dispatch/rename stage 620 until only half of the slots are used, and then disabling the unused half.CPU core 400 can also disable a redundant pipeline by stopping the input of new decoded instructions into the pipeline and waiting until the pipeline naturally drains. MoreoverCPU core 400 can reduce the sizes of instruction and data caches. In these ways,CPU core 400 can transition from the first (big core) mode to the second (little core) mode seamlessly and without the need for slow instruction migration. -
FIG. 7 illustrates a flow diagram of amethod 700 for configuring a processor core according to some embodiments.Action box 710 includes fetching and decoding a write MSR instruction (“WMSR”) in a processor core. The flow proceeds todecision box 720, which determines whetherCPU core 400 is in a privileged state. If theCPU core 400 is not in a privileged state, flow proceeds toaction box 730 in which the method ends by some appropriate action, such as taking a privilege mode violation exception. If the processor core is in a privileged state, then flow proceeds toaction box 740, which updates a power control field in the MSR. - Continuing on,
method 700 proceeds toaction box 750, which reconfigures the execution pipeline ofCPU core 400 in response to a change in the power control field. Finally flow proceeds toaction box 760 in whichCPU core 400 executes instructions using the reconfigured core. -
FIG. 8 illustrates a flow diagram of amethod 800 for configuring a multi-core data processor according to some embodiments.Action box 810 includes functionally throttling a processor core of a CPU cluster.Action box 820 includes gating off remaining processor cores of the CPU cluster.Action box 830 includes executing instructions using the processor core that was enabled. - The functions of
FIGS. 3-6 may be implemented with various combinations of hardware and software, and the software component may be stored in a computer readable storage medium for execution by at least one processor. Moreover the methods illustrated inFIGS. 7 and 8 may also be governed by instructions that are stored in a computer readable storage medium and that are executed by at least one processor. Each of the operations shown inFIGS. 7 and 8 may correspond to instructions stored in a non-transitory computer memory or computer readable storage medium. In various embodiments, the non-transitory computer readable storage medium includes a magnetic or optical disk storage device, solid state storage devices such as Flash memory, or other non-volatile memory device or devices. The computer readable instructions stored on the non-transitory computer readable storage medium may be in source code, assembly language code, object code, or other instruction format that is interpreted and/or executable by one or more processors. - Moreover, the functions of
FIGS. 3-6 may be described or represented by a computer accessible data structure in the form of a database or other data structure which can be read by a program and used, directly or indirectly, to fabricate integrated circuits ofFIGS. 3-6 . For example, this data structure may be a behavioral-level description or register-transfer level (RTL) description of the hardware functionality in a high level design language (HDL) such as Verilog or VHDL. The description may be read by a synthesis tool which may synthesize the description to produce a netlist comprising a list of gates from a synthesis library. The netlist comprises a set of gates which also represent the functionality of the hardware comprising integrated circuits ofFIGS. 3-6 . The netlist may then be placed and routed to produce a data set describing geometric shapes to be applied to masks. The masks may then be used in various semiconductor fabrication steps to produce integrated circuits ofFIGS. 3-6 . Alternatively, the database on the computer accessible storage medium may be the netlist (with or without the synthesis library) or the data set, as desired, or Graphic Data System (GDS) II data. - While particular embodiments have been described, various modifications to these embodiments will be apparent to those skilled in the art. For example, in the illustrated embodiment,
CPU cluster 310 includes fourCPU cores cache 320, which is a shared L2 cache. In some embodiments,CPU cluster 310 could include a different number of cores, and different cache memory hierarchies, including shared and dedicated cache memories.CPU cores CPU core 400 is in a privileged state. Also,configuration circuit 450 could reconfigure different redundant functions of a CPU core ofCPU cluster 310, including an arithmetic logic unit (ALU), a schedule queue cluster, a FP unit, a multimedia extension unit (MMX), a cache memory, a cache controller, a translation lookaside buffer (TLB), a branch prediction unit, and the like. - Accordingly, it is intended by the appended claims to cover all modifications of the disclosed embodiments that fall within the scope of the disclosed embodiments.
Claims (25)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US13/714,011 US9164570B2 (en) | 2012-12-13 | 2012-12-13 | Dynamic re-configuration for low power in a data processor |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US13/714,011 US9164570B2 (en) | 2012-12-13 | 2012-12-13 | Dynamic re-configuration for low power in a data processor |
Publications (2)
Publication Number | Publication Date |
---|---|
US20140173312A1 true US20140173312A1 (en) | 2014-06-19 |
US9164570B2 US9164570B2 (en) | 2015-10-20 |
Family
ID=50932424
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US13/714,011 Active 2033-10-07 US9164570B2 (en) | 2012-12-13 | 2012-12-13 | Dynamic re-configuration for low power in a data processor |
Country Status (1)
Country | Link |
---|---|
US (1) | US9164570B2 (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130346778A1 (en) * | 2012-06-20 | 2013-12-26 | Douglas D. Boom | Controlling An Asymmetrical Processor |
US20150370306A1 (en) * | 2014-06-23 | 2015-12-24 | Mediatek Inc. | Method and System Providing Power Management for Multimedia Processing |
US20160357554A1 (en) * | 2015-06-05 | 2016-12-08 | Arm Limited | Controlling execution of instructions for a processing pipeline having first and second execution circuitry |
US10089155B2 (en) * | 2015-09-22 | 2018-10-02 | Advanced Micro Devices, Inc. | Power aware work stealing |
US20220035635A1 (en) * | 2014-11-26 | 2022-02-03 | Texas Instruments Incorporated | Processor with multiple execution pipelines |
US11429173B2 (en) * | 2018-12-21 | 2022-08-30 | Intel Corporation | Apparatus and method for proactive power management to avoid unintentional processor shutdown |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170212581A1 (en) * | 2016-01-25 | 2017-07-27 | Qualcomm Incorporated | Systems and methods for providing power efficiency via memory latency control |
US10591966B1 (en) * | 2019-02-20 | 2020-03-17 | Blockchain Asics Llc | Actively controlled series string power supply |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090164812A1 (en) * | 2007-12-19 | 2009-06-25 | Capps Jr Louis B | Dynamic processor reconfiguration for low power without reducing performance based on workload execution characteristics |
US20100268968A1 (en) * | 2009-04-16 | 2010-10-21 | International Business Machines Corporation | Managing processor power-performance states |
US7865667B2 (en) * | 2001-10-22 | 2011-01-04 | Oracle America, Inc. | Multi-core multi-thread processor |
US20120198207A1 (en) * | 2011-12-22 | 2012-08-02 | Varghese George | Asymmetric performance multicore architecture with same instruction set architecture |
US8321362B2 (en) * | 2009-12-22 | 2012-11-27 | Intel Corporation | Methods and apparatus to dynamically optimize platforms |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5392437A (en) | 1992-11-06 | 1995-02-21 | Intel Corporation | Method and apparatus for independently stopping and restarting functional units |
US8086825B2 (en) | 2007-12-31 | 2011-12-27 | Advanced Micro Devices, Inc. | Processing pipeline having stage-specific thread selection and method thereof |
-
2012
- 2012-12-13 US US13/714,011 patent/US9164570B2/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7865667B2 (en) * | 2001-10-22 | 2011-01-04 | Oracle America, Inc. | Multi-core multi-thread processor |
US20090164812A1 (en) * | 2007-12-19 | 2009-06-25 | Capps Jr Louis B | Dynamic processor reconfiguration for low power without reducing performance based on workload execution characteristics |
US7962770B2 (en) * | 2007-12-19 | 2011-06-14 | International Business Machines Corporation | Dynamic processor reconfiguration for low power without reducing performance based on workload execution characteristics |
US20100268968A1 (en) * | 2009-04-16 | 2010-10-21 | International Business Machines Corporation | Managing processor power-performance states |
US8171319B2 (en) * | 2009-04-16 | 2012-05-01 | International Business Machines Corporation | Managing processor power-performance states |
US8321362B2 (en) * | 2009-12-22 | 2012-11-27 | Intel Corporation | Methods and apparatus to dynamically optimize platforms |
US20120198207A1 (en) * | 2011-12-22 | 2012-08-02 | Varghese George | Asymmetric performance multicore architecture with same instruction set architecture |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130346778A1 (en) * | 2012-06-20 | 2013-12-26 | Douglas D. Boom | Controlling An Asymmetrical Processor |
US9164573B2 (en) * | 2012-06-20 | 2015-10-20 | Intel Corporation | Controlling an asymmetrical processor |
US20150370306A1 (en) * | 2014-06-23 | 2015-12-24 | Mediatek Inc. | Method and System Providing Power Management for Multimedia Processing |
US9965021B2 (en) * | 2014-06-23 | 2018-05-08 | Mediatek, Inc. | Method and system providing power management for multimedia processing |
US20220035635A1 (en) * | 2014-11-26 | 2022-02-03 | Texas Instruments Incorporated | Processor with multiple execution pipelines |
US20160357554A1 (en) * | 2015-06-05 | 2016-12-08 | Arm Limited | Controlling execution of instructions for a processing pipeline having first and second execution circuitry |
US9952871B2 (en) * | 2015-06-05 | 2018-04-24 | Arm Limited | Controlling execution of instructions for a processing pipeline having first out-of order execution circuitry and second execution circuitry |
US10089155B2 (en) * | 2015-09-22 | 2018-10-02 | Advanced Micro Devices, Inc. | Power aware work stealing |
US11429173B2 (en) * | 2018-12-21 | 2022-08-30 | Intel Corporation | Apparatus and method for proactive power management to avoid unintentional processor shutdown |
Also Published As
Publication number | Publication date |
---|---|
US9164570B2 (en) | 2015-10-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9164570B2 (en) | Dynamic re-configuration for low power in a data processor | |
JP6708335B2 (en) | User-level branch and join processor, method, system, and instructions | |
CN105144082B (en) | Optimal logical processor count and type selection for a given workload based on platform thermal and power budget constraints | |
US8190863B2 (en) | Apparatus and method for heterogeneous chip multiprocessors via resource allocation and restriction | |
US10162687B2 (en) | Selective migration of workloads between heterogeneous compute elements based on evaluation of migration performance benefit and available energy and thermal budgets | |
US8589665B2 (en) | Instruction set architecture extensions for performing power versus performance tradeoffs | |
US9519324B2 (en) | Local power gate (LPG) interfaces for power-aware operations | |
CN108885586B (en) | Processor, method, system, and instruction for fetching data to an indicated cache level with guaranteed completion | |
US9329666B2 (en) | Power throttling queue | |
Seki et al. | A fine-grain dynamic sleep control scheme in MIPS R3000 | |
Burd et al. | Energy efficient microprocessor design | |
Roy et al. | State-retentive power gating of register files in multicore processors featuring multithreaded in-order cores | |
US11886918B2 (en) | Apparatus and method for dynamic control of microprocessor configuration | |
US20190095231A1 (en) | Dynamic platform feature tuning based on virtual machine runtime requirements | |
US20130086357A1 (en) | Staggered read operations for multiple operand instructions | |
US20210191725A1 (en) | System, apparatus and method for dynamic pipeline stage control of data path dominant circuitry of an integrated circuit | |
Mangalwedhe et al. | Low power implementation of 32-bit RISC processor with pipelining | |
CN113366458A (en) | System, apparatus and method for adaptive interconnect routing | |
Gary | Low-power microprocessor design | |
US20240086198A1 (en) | Register reorganisation | |
US20240103868A1 (en) | Virtual Idle Loops | |
Tseng | Energy-efficient register file design | |
Francisco Lorenzon et al. | Fundamental Concepts | |
Murti et al. | Embedded Processor Architectures | |
Praveen et al. | A survey on control implementation scheme |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: ADVANCED MICRO DEVICES, INC., CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:SHIPPY, DAVID J.;REEL/FRAME:029570/0648 Effective date: 20121226 |
|
FEPP | Fee payment procedure |
Free format text: PAYOR NUMBER ASSIGNED (ORIGINAL EVENT CODE: ASPN); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 4 |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 8TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1552); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 8 |