CN106095046A - The processor architecture of power efficient - Google Patents

The processor architecture of power efficient Download PDF

Info

Publication number
CN106095046A
CN106095046A CN201610369305.9A CN201610369305A CN106095046A CN 106095046 A CN106095046 A CN 106095046A CN 201610369305 A CN201610369305 A CN 201610369305A CN 106095046 A CN106095046 A CN 106095046A
Authority
CN
China
Prior art keywords
core
state
interruption
logic
processor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610369305.9A
Other languages
Chinese (zh)
Inventor
A·J·赫德瑞奇
R·G·伊利卡尔
R·艾耶
S·斯里尼瓦桑
J·摩西
S·马基嫩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Intel Corp
Original Assignee
Intel Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Intel Corp filed Critical Intel Corp
Priority to CN201610369305.9A priority Critical patent/CN106095046A/en
Priority claimed from CN201180073263.XA external-priority patent/CN103765409A/en
Priority claimed from CN201610364515.9A external-priority patent/CN106020424B/en
Publication of CN106095046A publication Critical patent/CN106095046A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F1/00Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
    • G06F1/26Power supply means, e.g. regulation thereof
    • G06F1/32Means for saving power
    • G06F1/3203Power management, i.e. event-based initiation of a power-saving mode
    • G06F1/3234Power saving characterised by the action undertaken
    • G06F1/3293Power saving characterised by the action undertaken by switching to a less power-consuming processor, e.g. sub-CPU
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5094Allocation of resources, e.g. of the central processing unit [CPU] where the allocation takes into account power or heat criteria
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Power Sources (AREA)

Abstract

The present invention relates to the processor architecture of power efficient.In one embodiment, the present invention includes for receiving interruption from accelerator, in response to interruption, recovery signal is sent directly to small nut, the subset of the execution state of macronucleus is provided to the first small nut, and judge whether small nut can process and interrupt the request that is associated, if it is determined that be affirmative, in small nut, perform the operation corresponding with this request, otherwise, macronucleus is performed state and recovers the method that signal provides macronucleus.It is described and claimed other embodiments.

Description

The processor architecture of power efficient
The application is that filing date JIUYUE in 2011 Application No. the 201180073263.Xth on the 6th is invention entitled " meritorious The processor architecture of rate efficiency " the divisional application of Chinese patent application.
Background
Generally, when it is possible, processor uses electric energy to save sleep pattern, such as according to ACPI (ACPI) standard (Rev.3.0b that such as, on October 10th, 2006 is issued).When core is idle or is not exclusively used, except voltage Outside adjusting (DVFS or ACPI performance state (P-state)) with frequency, these so-called C-state core low power state (ACPI C State) electric energy can be saved.But, even in polycaryon processor context, core is usually waken up from the sleep state come into force, with Perform relatively simple operation, then, return to sleep state.Power efficiency can be had adverse effect on by this operation, because Exit and return low power state and there is the cost postponed with power consumption.In state conversion process, at some type of place May consume electric energy in reason device but not complete useful work, this is unfavorable to power efficiency.
When exiting low power state, the example of operation to be processed includes: in the interruption of input through keyboard, timer, network Disconnected, etc..For processing these operations in the way of power sensitive, current operation system (OS) passes through the number that single treatment is bigger According to measuring or move to without idle loop OS (interrupting without periodic timer, the most fragmentary programming is interrupted), change Program behavior.Another strategy is to use timer polymerization, wherein, multiple interrupt groups is processed altogether and simultaneously.But, except Outside the behavior of reprogramming, these options all produce complexity, and can still result in the operation that power efficiency is low.Enter one Step ground, certain form of software (such as, media play) can be by asking periodic wakeup (no matter how many job demand frequently Complete), and attempt defeating hardware power efficiency mechanism.So, can be not required to by minimizing without idle loop/timer aggregation strategy The number of times waken up from deep C-state wanted, saves certain power, but, they need to carry out OS the change of invasive, and May devote a tremendous amount of time through calculating ecosystem, because such change is until the redaction of operating system is distributed it Before will not be implemented.
Accompanying drawing is sketched
Fig. 1 is the block diagram of processor according to an embodiment of the invention.
Fig. 2 is the block diagram of processor according to another embodiment of the present invention.
Fig. 3 is the flow chart recovering stream option between core according to an embodiment of the invention.
Fig. 4 is the flow chart of method according to an embodiment of the invention.
Fig. 5 is the flow chart of the method for transmitting execution state according to an embodiment of the invention.
Fig. 6 is the block diagram of processor according to still a further embodiment.
Fig. 7 shows the block diagram of the processor of the further embodiment according to the present invention.
Fig. 8 is the block diagram of processor according to still a further embodiment.
Fig. 9 is sequential chart according to an embodiment of the invention.
Figure 10 is the graphic extension of amount of electricity saving according to an embodiment of the invention.
Figure 11 is the block diagram of system according to an embodiment of the invention.
Detailed description of the invention
In various embodiments, in heterogeneous processor environment, average power consumption can reduce.This heterogeneous environment by In system and power efficiency reason, large-scale quick core and the core of less more power efficient can be included.Further, respectively Embodiment can be to the transparent mode of the operating system (OS) performed on a processor to provide this power to control.But, this Bright scope is not limited only to heterogeneous environment, it is also possible to for the environment of homogeneity (with transparent to OS, but not necessarily hardware isomery Angle for), to reduce mean power (such as, in multi-processor environment, make core as much as possible sleep).Each embodiment Can hardware-accelerated environment (such as its center usually sleep based on tablet PC and system on chip (SoC) system Structure) in the most suitable.
It is said that in general, each embodiment by being directed to less core and the biggest core by all wake-up signals, enter Row power controls.So, when system 95% is idle, mean power can decrease beyond twice.As described below, permitted In many embodiments, the core that this is less can be separated with OS.That is, the existence of this less core is unknown for OS, so, OS is invisible in this verification.So, each embodiment can to the transparent mode of OS and the application program performed on a processor, The processor being provided with power efficiency by processor hardware is operated.
With reference now to Fig. 1, shown in be the block diagram of processor according to an embodiment of the invention.As it is shown in figure 1, process Device 100 can be the heterogeneous processor with several macronucleus, small nut and accelerator.Although herein is in polycaryon processor Described in context, however, it will be understood that embodiment is unrestricted, in each realization, can be at SoC or other are based on half In the processing equipment of conductor.Note that accelerator queue based on input service can perform work, no matter processor core is No it is energized.In the embodiment in figure 1, processor 100 includes multiple macronucleus.In shown specific embodiment, it is shown that Two such core 110a and 110b (in general manner, macronucleus 110), although be appreciated that, it is provided that plural this The macronucleus of sample.In each realization, these macronucleus can be to have relative complex pipeline architecture and according to complicated order Collection calculates unordered (out-of-order) processor that (CISC) architecture operates.
It addition, processor 100 also includes multiple small nut 120a-120n (in general manner, small nut 120).Although the reality at Fig. 1 Execute and example shows 8 such cores, however, it will be understood that the scope of the present invention is not limited in this respect.In various enforcements In example, small nut 120 can be orderly (in-order) processor of power efficient, such as, with according to CISC or reduced instruction Collection calculates (RISC) architecture and performs instruction.In some implementations, the two or more cores in these cores can be connected It is coupled, to perform relevant treatment, such as, if multiple macronucleus is in power save mode, then, one or more less Core may be at movable to perform work, and otherwise these work will wake up macronucleus up.In many examples, small nut 120 can be right OS is transparent, although in other embodiments, and small nut and macronucleus can be exposed to OS, has config option to use.It is said that in general, Can use in various embodiments macronucleus and any core between small nut mixing.For example, it is possible to each macronucleus is carried For single small nut, or in other embodiments, single small nut can be associated with multiple macronucleus.
As used herein, term " macronucleus " can be to have relative complex design and may disappear compared with " small nut " Consume the processor core of relatively large chip area, and small nut can have the less design of complexity and consume smaller Chip area.It addition, smaller core is higher compared to power efficiency for bigger core, because they may be bigger Core there is less TDP (TDP).However, it is to be appreciated that compared with macronucleus, less core manages ability side at which Face is restricted.Such as, these less cores may not process all operations feasible in macronucleus.It addition, it is less Core is probably relatively inefficient when instruction processes.That is, macronucleus is more quickly performed instruction than in small nut.
It can further be seen that macronucleus 110 and small nut 120 may be coupled to interconnect 130.In various embodiments, The different realization of this interconnection structure can be realized.Such as, in certain embodiments, interconnection structure can be according to Front Side Bus (FSB) architecture orQuick Path Interconnect (QPI) agreement.In other embodiments, interconnection structure can be according to one Given system interconnection fabric.
Again referring to Fig. 1, multiple accelerator 140a-140c are also coupled to interconnect 130.Although the scope of the present invention is at this Aspect is unrestricted, but, accelerator can also include Media Processor, such as audio frequency and/or video processor, Cipher Processing Device, fixed-function unit, etc..These accelerators can be designed by the same design personnel of design core, or can be bag Include independent third party's intelligent attributes (IP) block in processor.It is said that in general, dedicated processes task can be at these accelerators Middle efficiently perform in macronucleus or small nut than them, either for performance or for power consumption.Although Utilize in the embodiment in figure 1 shown in this specific implementation, it will be appreciated that the scope of the present invention is not limited in this respect System.Such as, replacing only two kinds of core (that is, macronucleus and small nut), other embodiments can have the level knot of multiple core Structure, including at least macronucleus, medium core and small nut, medium core has the chip area bigger than small nut, but less than macronucleus Chip area, and there is the corresponding power consumption between macronucleus and the power consumption of small nut.In other embodiments, small nut Can be embedded in bigger core, such as, as logic and the subset of structure of bigger core.
Although additionally, be shown as including multiple macronucleus and multiple small nut in the embodiment in figure 1, but, for such as moving Some of dynamic processor or SoC etc realizes, and can provide only single macronucleus and single small nut.Specifically, with reference now to figure 2, shown in be the block diagram of processor according to another embodiment of the present invention, wherein, processor 100 " includes single macronucleus 110 He Single small nut 120, and interconnection 130 and accelerator 140a-c.As mentioned above, this realization may be adapted to Mobile solution.
As the example power figure of typical macronucleus, power consumption can be about substantially 6000 milliwatts (mW), and right In medium core, power consumption can be the most substantially 500mW, and for very small nut, power consumption can be the most substantially 15mW.In avoiding the one waking up macronucleus up to realize, it is possible to achieve significantly Power Benefit.
Each embodiment makes bigger, and the slightly lower core of power efficiency is maintained at the time of low power sleep state than them at it In the case of him, institute is retainable longer.By interrupting and other core wake events are directed to less core rather than bigger Core, less core can run the longer time, and wake up up frequently, but, this performs such as data than waking up macronucleus up More power efficient for the footy task of movement etc.Note that as described below for some operation described by, example As ought be less core may support vector computing (such as, AVX computing), complicated addressing mode or during floating-point (FP) computing, Macronucleus is energized to perform.In this case, wake-up signal can be rerouted to macronucleus from small nut.
Such as, when performing hardware-accelerated 1080p video playback on a processor, generation per second turnover core C6 state Transition more than 1000 times and the almost interruption of 1200 times.If use embodiments of the invention, even these wake events A part be redirected to less core, then can realize significant amount of electricity saving.
Fig. 3 outlines the recovery stream option between core according to an embodiment of the invention.As it is shown on figure 3, exist soft Part territory 210 and hardware domain 220.It is said that in general, software domain 210 operates corresponding to the OS relative to power management, such as, according to ACPI realizes.It is said that in general, OS, according to its scheduling mechanism, the understanding to upcoming task based on it, can select many In individual C-state one, enters into low-power mode with request processor.Such as, OS can send MWAIT and call, and this calls Including just at requested specific low power state.
It is said that in general, CO is corresponding to performing the normal operating state of instruction, and state C1-C3 is OS lower power state, Each state all has the amount of electricity saving of different stage, and returns to the delay of the corresponding different stage of CO state.It can be seen that Depending on the intended workload of processor, OS can select busy state, such as, in OS CO or multiple idle condition One, such as, OS C-state C1-C3.Each in these idle conditions may be mapped at processor hardware Corresponding hardware low power state under control.So, given OS C-state can be mapped to correspondence by processor hardware Hardware C-state, this C-state can provide bigger amount of electricity saving than by the amount of electricity saving specified by OS.It is said that in general, shallower C-shaped State (such as, Cl) saves less power compared with deeper C-state (such as, C3), but has relatively low recovery time.Various In embodiment, the mapping of hardware domain 220 and OS C-state to processor C-state can be by the power control unit of processor (PCU) perform, although the scope of the present invention is unrestricted in this regard.This mapping can be based on former power supply based on OS The history of management request.Moreover, it is judged that can state based on whole system, configuration information etc..
It addition, PCU or other processor logics can be configured to all wake events are directed to the available of minimum Core (in various embodiments, can be the sightless core of OS).As it is shown on figure 3, moving back from given hardware based idle condition When going out, controlling to be directly returned to the available core of minimum, thus state is transferred to the core of this minimum.By contrast, hard in routine In part/software rejuvenation, control only to return to macronucleus.It is said that in general, OS based on intended free time and recovers to postpone requirement, come Selecting C-state, this C-state is mapped to hardware C-state by architecture.So, as illustrated in the exemplary embodiment of figure 3, all recovery is believed Number (such as interrupt) is routed to the available core of minimum, and this core judges whether it can process recovery operation, or on the contrary, will wake up up Signal is sent to bigger core to continue.Note that each embodiment does not disturb existing P-state or C-state automatically to demote, its In, in existing P-state or C-state are demoted automatically, hardware automatically selects band based on the experimental efficiency measured There is the relatively low hardware C-state recovering to postpone.Note that PCU or another programmable entity can check and incoming wake up thing up Part judges to route them to which core (macronucleus or small nut).
As described above, in some implementations, small nut itself can not allow OS and application software see.Example As, little-macronucleus pairing can be separated, and does not allow application software see.At low power state, all cores can It is sleep, and accelerator (such as video decoding accelerator) performs the Given task of such as decoding task etc.Work as accelerator When being finished data, it orients wake-up signal, may come from the other data of small nut with request, and this small nut wakes up up and judge can With realize this simple data move operation without waking up macronucleus up, so, save electric energy.If timer interrupt arrive and Small nut wakes up up and detects and there is complicated vector operation (such as 256 bit A VX instruction) in instruction stream, then may wake up macronucleus up To process complicated order (and other instructions in this stream), to shorten delay.In replacing realization, global hardware observes mechanism can To be positioned in PCU, or it is positioned at another the non-nuclear location near PCU, or the unitary part as the hardware logic on globally interconnected, Or supplementing as the inner control logic to small nut, global hardware observes mechanism can detect that small nut runs into AVX instruction, and Can generate undefined instruction fault, this fault may result in small nut and closes, and instruction flowed after waking up bigger core up It is re-introduced into this bigger core.Note that this behavior can be not limited to instruction, and expand to configuration or feature.Such as, if Small nut runs into the write to the configuration space only existed on macronucleus, it can request that the waking up up of macronucleus.
With reference now to Fig. 4, shown in be the flow chart of method according to an embodiment of the invention.Note that and depend on giving Realizing, the method for Fig. 4 can be performed by various agencies.Such as, in certain embodiments, method 300 can partly be led to The System Agent circuit crossing the such as power control unit etc in processor (may be in System Agent or processor non- Core part) realize.In other embodiments, method 300 can control to patrol partially by the such as power in interconnection structure Volume etc interconnection logic realize, interconnection logic can such as from be coupled to interconnection structure accelerator receive interrupt, and will Interrupt the position being forwarded to select.
As shown in Figure 4, method 300 can start (frame 310) by macronucleus and small nut are placed in sleep state.That is, false It is scheduled in core and does not has the operation of activity to be performed.So, selected low power state can be placed on them, to reduce merit Rate consumes.Although core is not likely to be movable, but, its in the SoC of processor or the most one or more accelerator etc He agency can perform task.At frame 320, interruption can be received from such accelerator.When accelerator completes task, chance To mistake or when accelerator needs other data or other process will be performed by another assembly (such as, given core), can To send this interruption.Controlling to enter frame 330, there, logic can be sent directly to small nut by recovering signal.That is, logic can To be programmed to when macronucleus and small nut are both in low power state, all the time recovery signal is sent to small nut (or send A small nut selected by multiple such small nuts, depends on that system realizes).By interruption directly and is sent to all the time Small nut, can process many situations of the interruption being asked operation, can avoid the bigger merit in macronucleus for those small nuts Rate consumes.Note that and can add certain form of filtration or cache mechanism to frame 330, in order to the most all the time Some interrupt source is routed to a core or another core, with balance quality and power.
Again referring to Fig. 4, control to turn next to rhombus 340, there, it can be determined that whether small nut can process and interrupt The request being associated.Although the scope of the present invention is unrestricted in this regard, but, in some embodiments it is possible at small nut At small nut, itself carries out this after being waken up to judge.Or, the logic of the method performing Fig. 4 can perform judgement (at this In the case of sample, before sending recovery signal to small nut, can be performed this and analyze).
As example, small nut can performance requirement based on small nut and/or and/or instruction set architecture (ISA) ability Judge whether it can process asked operation.If small nut can not process owing to it does not has ISA to support and be asked Operation, then the front end logic of small nut can resolve the instruction stream received, and judges that at least one instruction in stream is not by small nut Support.Correspondingly, small nut can send undefined instruction fault.This undefined fault can be sent to PCU, and (or another is real Body), this PCU (or another entity) can analyze the state of fault and small nut to judge whether undefined fault is owing to small nut does not has Have the hardware supported for processing instruction, if or it be real undefined fault.In the case of the latter, undefined event Barrier may be forwarded to OS, for processing further.If fault is not suitably for processing the hard of instruction due to small nut Part is supported, then PCU can will be transmitted to the execution state transfer of this small nut to corresponding macronucleus, the instruction asked with process.
In other embodiments, when judge small nut have been carried out the long time or performance class the lowest time, can occur The transmission between small nut and macronucleus of the execution state.I.e., it is assumed that small nut has been carried out thousands of or millions of processor cycle, with Perform the task of request.Owing to there being more favourable execution can use in macronucleus, by by state transfer to macronucleus so that macronucleus Can end task more quickly, bigger power reduction can occur.
Again referring to Fig. 4, if it is determined that the operation asked can be processed in small nut, then control to enter frame 350, at that In, so, small nut performs operation.For example, it is assumed that the operation searching request is data movement operations, then small nut can perform institute The process of request, if not having other tasks to be pending for small nut, then can place it into low power state again.
If instead in rhombus 340 judging, small nut can not process asked operation, such as, if operation is that small nut does not has There is the relative complex operation that configuration processes, then control to forward frame 360 to.There, wake-up signal can be sent, such as, directly from Small nut is sent to macronucleus, so that macronucleus is energized.Correspondingly, controlling to enter frame 370, there, the operation of request is the most permissible Macronucleus performs.Although note that and utilizing this specific operation group to describe in the fig. 4 embodiment, however, it is possible to reason Solving, the scope of the present invention is unrestricted in this regard.
So, in various embodiments, it is provided that allow hardware interrupts and other wake-up signals to be routed directly to little Core is without waking up the mechanism of macronucleus up.Note that in different realizations, small nut itself or supervision agency can decide whether Wake-up signal and process can be completed in the case of not waking up macronucleus up.In the case of representational, the power effect of less core Rate can be bigger core much higher, and result can only support that macronucleus be supported the subset instructed.From low power state In many operations to be performed when waking up can be offloaded to simpler, the higher core of power efficiency, to avoid at isomery ring Border wakes up up bigger more strength core (due to performance or power efficiency reason in isomerous environment, the core of many all sizes It is included in systems).
With reference now to Figure 55, shown in be the stream of the method for transmitting execution state according to an embodiment of the invention Cheng Tu.As it is shown in figure 5, in one embodiment, method 380 can be performed by the logic of PCU.This logic can be in response to inciting somebody to action Macronucleus is placed in the request of low power state and triggers.In response to such request, method 380 can be from the beginning of frame 382, at that In, the execution state of macronucleus can be stored in scratchpad area (SPA).Note that this scratchpad area (SPA) can be to associate with nuclear phase Single user state conservation zone, or, it can be in the shared cache of such as last level cache (LLC) etc.Though So the scope of the present invention is unrestricted in this regard, but, execution state can include that general register, state and configuration are deposited Device, execution flag etc..It addition, at this point it is possible to perform the extra operation making macronucleus be placed in low power state.Such behaviour Make to include emptying inner buffer, and other states and for closing the signaling of given core.
Again referring to Fig. 5, it can be determined that whether small nut recovers (rhombus 384).This recovery can connect as in response to interruption Receive recovers the result of signal and occurs, and this interruption is from the accelerator of such as processor.The part recovered as small nut, Control to enter frame 386, there, at least some of of big nuclear state can be extracted from scratchpad area (SPA).More specifically, this The part extracted can be the part that will be used by small nut in the execution state of macronucleus.As example, this status sections can To include master register content, the various labellings of such as some execution flag etc, machine status register(MSR) etc..But, certain A little states may not be extracted, such as with macronucleus present in but in small nut, there is no the one or more of corresponding performance element The state that performance element is associated.This part extracted of state can be sent to small nut (frame 388), so, make little nuclear energy Any suitable operation is performed in response to given interruption.Although utilizing in the 5 embodiment of figure 5 shown in this specific implementation, but It is, it will be understood that the scope of the present invention is unrestricted in this regard.
With reference now to Fig. 6, shown in be the block diagram of processor according to an embodiment of the invention.As shown in Figure 6, processor 400 can be polycaryon processor, including can be to more than first core 410i-410n disclosed in OS, to more than transparent for OS second Core 410a-x.
It can be seen that various cores can be coupled to include System Agent or the non-core of various assembly by interconnection 415 420.It can be seen that non-core 420 can include the shared cache 430 as last level cache.It addition, non-core is permissible Including integrated Memory Controller 440, various interface 450a-n, power control unit 455, and advanced programmable interrupt control Device processed (APIC) 465.
PCU 450 can include that the various of the operation realizing power efficient according to an embodiment of the invention are patrolled Volume.It can be seen that PCU 450 can include the wakeup logic 452 that can perform to wake up up as described above.So, logic 452 is permissible It is configured to first wake up up small nut.But, this logic can be configured dynamically, and with in some cases, does not perform this The small nut of sample directly wakes up up.Such as, system can be configured dynamically into and save operation for electric energy, such as, when system is profit During with the mobile system of battery operation.In this case, logic may be configured to wake up up all the time small nut.On the contrary, if System is attached to the server system of wall power source, desktop computer or laptop system, then embodiment can provide base In the selection of user, to select to postpone and performance rather than amount of electricity saving.So, in this case, wakeup logic 452 is permissible It is configured to respond to interrupt, wakes up macronucleus up, and not small nut.Can cause being redirected to macronucleus when judging that substantial amounts of small nut wakes up up Time, similar the waking up up of macronucleus can be performed.
For realizing the operation of power efficient further, PCU 450 can also include can entering between macronucleus and small nut Row performs the state transfer logic 454 of state transfer.As discussed above, at low power state, it is possible to use this logic is come Obtain the execution state of the macronucleus stored in temporary memory, and extract at least some of of this state, to wake up up at small nut Time be supplied to small nut.
Further, PCU 450 can include interrupting historical memory 456.Such memorizer can include multiple Mesh, each entry all identifies the interruption occurred in system operation procedure and interrupts the most successfully being processed by small nut.Then, Based on this history, when receiving given interruption, the corresponding entry of this memorizer can be accessed, before judging same type One interrupts the most successfully being processed by small nut.If it is, the interruption of new incoming can be directed to identical small nut by PCU.On the contrary, Judging if based on this history, such interruption is not successfully processed (or with unsafty low by small nut Performance), on the contrary, interruption can be sent to macronucleus.
Undefined process logic 458 can also be included again referring to Fig. 6, PCU 450.Such logic can receive by small nut The undefined fault sent.Based on this logic, the information in small nut can be accessed.It is then possible to judge that undefined fault is No is owing to lacking the support for the instruction in small nut or another kind of reason.Judging in response to this, logic can cause small nut State and macronucleus perform the merging and be sent to after that greatly of remainder of state (being stored in scratchpad area (SPA)) Core is for the process to interruption, or undefined fault is sent to OS for further processing.When judging that small nut can not When processing interruption, obtain a part for the execution state being supplied to small nut immediately from small nut, and be saved back to temporary storage location, Correspondingly, can be by small nut power-off.It is provided back to greatly it is then possible to the residue of the state this merged and macronucleus performs state Core, so that macronucleus can process the most treatable interruption of small nut.It shall yet further be noted that can in response to such process of small nut not When, the entry interrupted in historical memory 456 can be write.Although utilizing shown in this certain logic in the embodiment in fig 6, However, it will be understood that the scope of the present invention is unrestricted in this regard.Such as, in other embodiments, PCU's 450 is various Logic can realize with unity logic block.
APIC 465 can receive various interruption (such as, send) from accelerator, and correspondingly interruption is directed to Fixed one or more cores.In certain embodiments, hiding OS for being maintained by small nut, APIC 465 can dynamically will pass The interruption (each interruption can include APIC identifier associated with it) entered is from the APIC ID weight being associated with macronucleus New mappings is to the APIC ID being associated with small nut.
With further reference to Fig. 6, processor 400 is permissible, such as, by memory bus, leads to system storage 460 Letter.It addition, by interface 450, it may be connected to the outer group of the various chips of such as ancillary equipment, mass storage etc Part.Although utilizing in the embodiment in fig 6 shown in this specific implementation, but, the scope of the present invention is unrestricted in this regard.
Note that various architecture realizes macronucleus and the different coupling of small nut or integrated also possible.As showing Example, the degree of coupling between these diverse cores may rely on relevant to die area, power, performance and response each Plant design optimization parameter.
With reference now to Fig. 7, shown in be the block diagram of processor according to another embodiment of the invention.As it is shown in fig. 7, place Reason device 500 can be the real heterogeneous processor including macronucleus 510 and small nut 520.It can be seen that each processor all may be used With with the private cache storage hierarchy of their own (i.e., it is possible to include the height of 1 grade and 2 grades cache memory Speed buffer memory 515 and 525) it is associated.Core can be coupled by annular interconnection 530 again.Multiple accelerator 540a It is also coupled to annular interconnection with 540b and LLC (that is, L3 cache 550 can be shared cache).Realize at this In, the execution state between two cores can be transmitted by annular interconnection 530.As described above, the execution of macronucleus 500 State can be stored in cache 550 before entering into given low power state.Then, waking up up at small nut 520 Time, at least this subset performing state can be provided to small nut, to read core, in order to performs to trigger its operation waken up up.As This, in the embodiment of Fig. 7, core is by this annular interconnection loose couplings.Although utilizing single macronucleus and single for ease of diagram Shown in small nut, however, it is understood that the scope of the present invention is unrestricted in this regard.By using the realization of such as Fig. 7, can To process any state to be switched by annular solid architecture (can also is that bus or interconnection structure architecture) Or communication.Or, in other embodiments, this communication can be by the dedicated bus (the most not shown) between two cores.
With reference now to Fig. 8, shown in be the block diagram of processor according to still a further embodiment.As shown in Figure 8, place Reason device 500 " can be mixed type heterogeneous processor, wherein, have close-coupled or integrated between macronucleus and small nut.Concrete and Speech, as shown in Figure 8, macronucleus 510 and small nut 520 can share cache memory 518, and this memorizer 518 exists Various embodiments can include 1 grade and 2 grades of caches.So, perform state can by this cache memory from One core is transferred to other core, so, it is to avoid by the delay of the communication of annular interconnection 530.Note that this layout due to The data reduced move the communication between expense and core and faster obtain lower power, however it is possible to underaction.
It should be noted that Fig. 7 and 8 merely illustrates two kinds of possible realizations (merely illustrating a limited number of core).Can also have More realizations, including the different layout of core, the combination of two schemes, two or more core of type etc..Variant at Fig. 8 In, two cores can share some assembly, such as performance element, instruction pointer or register file.
As discussed, each embodiment can be fully transparent, invisible to operating system, so, does not has software modification, The prolongation of the minimum recovery time from C-state.In other embodiments, existence and the availability of small nut can be public to OS Open, so so that OS can make provides small nut or the decision of macronucleus by interrupting.Additionally, each embodiment can also be all As provided in the systems soft ware of basic input output system (BIOS) etc to the open macronucleus of OS and small nut, or whether configuration discloses The mechanism of small nut.Each embodiment can increase significantly from the recovery time of C-state, but, this is acceptable, because currently Platform is variant in terms of recovering delay, currently, the time that the state at core is being resumed, does not perform useful work.Little Core and the most different ratio of macronucleus can change from inappreciable difference to bigger microarchitecture difference.According to Each embodiment, the most of main distinctions between isomery core can be die area and by the power of karyophthisis.
In some implementations, it is provided that control mechanism, in order to if be detected that recover time the macronucleus most of the time all Wake, then can avoid that small nut being waken up up, it is possible to directly wake up macronucleus up and be at least up to predetermined time span to keep Performance.Note that in certain embodiments, usually all interruptions and other wake-up signals are re-introduced into small nut or big The mechanism of core can be open to software (system and the software of user class), and this depends on that application and the power of system and performance are wanted Ask.As such example, it is provided that the instruction of user class, with the core being directed to specify by wake operation.So Instruction can be analogous to the variant of instruction of MWAIT.
In certain embodiments, accelerator can will be sent to PCU or other administration agents with the hint interrupted, to refer to Going out asked operation is relatively simple operation, thus can effectively process it in small nut.It is dark that this accelerator provides Show and can be used for processing by PCU for automatically incoming interruption being directed to small nut.
With reference now to Fig. 9, shown in show according to an embodiment of the invention in macronucleus 710 and small nut 720 The sequential chart of the operation occurred.It can be seen that by allowing device interrupt, small nut 720 can be supplied directly to, and at small nut Middle judge whether it can process interruption, realize the long sleep time of macronucleus 710.If it can, macronucleus 710 May remain in sleep state, and in small nut 720, process interruption.
With reference now to Figure 10, shown in be the graphic extension of amount of electricity saving according to an embodiment of the invention.Such as Figure 10 institute Show, in having from mobile C O state to the conventional system of the transition of deep low power state (such as, C6 state), the core of macronucleus Power consumption zero energy from of a relatively high rank (such as, the 500mW in every time entering into CO state procedure) to C6 Change between consumption level (medial view).On the contrary, in one embodiment of the invention (bottom view), to waking up up of CO state Can be left from macronucleus and be directed to small nut, thus, be not 500mW power consumption rank, small nut can be much lower Power level process CO state, such as, be 10mW in the embodiment in figure 10.
Each embodiment can realize with many different system types.With reference now to Figure 11, shown in be according to the present invention The block diagram of system of embodiment.As shown in figure 11, multicomputer system 600 is point-to-point interconnection system, and includes by point The first processor 670 that an interconnection 650 is coupled and the second processor 680.As shown in figure 11, in processor 670 and 680 Can be each polycaryon processor, including (that is, processor core 674a and 674b and the process of the first and second processor cores Device core 684a and 684b), although during the most more multinuclear may reside in processor.More specifically, each in processor The individual macronucleus that may comprise, small nut (and may also have medium core), the mixing of accelerator etc., also ought at least be in low by macronucleus During power rating, the logic of the available core being directed to minimum will be waken up up, as described herein.
Again referring to Figure 11, first processor 670 also includes that memory controller hub (MCH) 672 and point-to-point (P-P) connect Mouth 676 and 678.Similarly, the second processor 680 includes MCH 682 and P-P interface 686 and 688.As shown in figure 11, MCH 672 and 682 couple the processor to corresponding memorizer, i.e. memorizer 632 and memorizer 634, and they can be local connection A part to the system storage (such as, DRAM) of corresponding processor.First processor 670 and the second processor 680 can To be coupled to chipset 690 by P-P interconnection 652 and 654 respectively.As shown in figure 11, chipset 690 includes P-P interface 694 With 698.
Additionally, chipset 690 also includes interface 692, interface 692 by P-P interconnection 639 by chipset 690 and high property Can couple by graphics engine 638.Chipset 690 can be coupled to the first bus 616 by interface 696 again.As shown in figure 11, various Input/output (I/O) equipment 614 and bus bridge 618 are alternatively coupled to the first bus 616, and bus bridge 618 is by One bus 616 is coupled to the second bus 620.Various equipment are alternatively coupled to the second bus 620, including, such as, keyboard/mouse 622, communication equipment 626 and data storage cell 628, data storage cell 628 such as disc driver maybe can include code Other mass-memory units of 630.Further, audio frequency I/O624 is alternatively coupled to the second bus 620.Each embodiment is permissible It is included in other kinds of system, sets including such as smart cellular phone, panel computer, the movement of net book etc Standby.
Each embodiment can realize with code, it is possible to the storage being stored in the non-momentary that have stored thereon instruction is situated between In matter, instruction can be used to be programmed performing instruction to system.Storage medium can include but not limited to, any The disk of type, can weigh including floppy disk, CD, solid-state drive (SSD), compact disc read-only memory (solid-state drive), CD Write (CD-RW), and the semiconductor device of magneto-optic disk, such as read only memory (ROM) etc, such as dynamic random access memory The random access storage device (RAM) of device (DRAM) and static RAM (SRAM) etc, erasable programmable is read-only deposits Reservoir (EPROM), flash memory, electric erasable program read-only memory (EEPROM), magnetic or optical card, or be suitable to store e-command The medium of any other type.
Despite with reference to a limited number of embodiments, the present invention is described, but, those are proficient in the people of this technology will be from Wherein understand a lot of amendment and variant.Appended claims contains all such amendments and variant all true by the present invention In positive spirit and scope.

Claims (20)

1. a mobile device, including:
Device;
The dynamic random access memory (DRAM) coupled with described device;
Data storage, wherein said device includes:
Cryptography accelerators;
Video accelerator;
Memory Controller;And
Processor, described processor includes:
More than first core;
More than second core, described more than second core and described more than first core are homogeneities and have relatively low power consumption;
For coupling described more than first core and the interconnection of described more than second core and coupling with the most described more than first core Shared cache memory;And
Make the core in described more than second core perform the logic of operation, wherein, be at least partially based in described more than second core The performance class of described core, described logic is described for making the execution state of the described core in described more than second core be transferred to Core in more than first core is to make the described core in described more than first core perform described operation.
2. mobile device as claimed in claim 1, it is characterised in that described logic is for the institute in described more than first core State the described core in core and described more than second core when being in low power state, make the described core in described more than second core and Described core in non-described more than first core interrupts in response to one and is waken up.
3. mobile device as claimed in claim 2, it is characterised in that described logic is for pointing out described the in the entry of form When described core in more than two core produced undefined fault in response to the previous interruption with described interruption same type, make described Described core in more than first core rather than the described core in described more than second core are waken up in response to described interruption.
4. mobile device as claimed in claim 2, it is characterised in that described logic is used for described in response to described interruption The subset of the execution state of the described core in more than first core is supplied to the described core in described more than second core.
5. mobile device as claimed in claim 4, it is characterised in that can not in response to the described core in described more than second core Processing the determination of at least one operation asked, described logic obtains described for the described core from described more than second core The described subset of execution state and for will described execution subsets of states and in temporary storage area described more than the first of storage The remainder of the execution state of the described core in individual core merges.
6. mobile device as claimed in claim 2, it is characterised in that described video accelerator is for performing a task and using When completing in described task, described interruption is sent to described logic.
7. mobile device as claimed in claim 2, it is characterised in that described logic is used for analyzing multiple interruption, and if institute The major part stating multiple interruption to be processed by the described core in described more than first core, and the most described logic is not responsive to described interruption And wake up the described core in described more than second core up, but wake up the described core in described more than first core up.
8. mobile device as claimed in claim 1, it is characterised in that described processor includes polycaryon processor, described logic Including:
Wakeup logic;
State transfer logic;
Undefined process logic;And
Interrupt historical memory.
9. mobile device as claimed in claim 1, it is characterised in that also include interrupt control unit, be used for receiving multiple interruption also And by or many in the plurality of at least one interrupted and guiding to described more than first core and described more than second core Individual core.
10. mobile device as claimed in claim 1, it is characterised in that described mobile device includes smart phone.
11. mobile devices as claimed in claim 1, it is characterised in that described mobile device includes panel computer.
12. mobile devices as claimed in claim 1, it is characterised in that also include audio frequency apparatus.
13. mobile devices as claimed in claim 1, it is characterised in that the described core in described more than first core also include to A few cache memory.
14. 1 kinds of methods, including:
Make mobile device processor more than second core in core be at least partially based on the described core in described more than second core Performance class and perform operation, described processor include more than first core, described more than second core, interconnection and with at least institute State the shared cache memory of more than first core coupling, described more than second core and described more than first core homogeneity and have Having lower power consumption, described interconnection is for coupling described more than first core with described more than second core;And
The execution state making the described core in described more than second core is transferred to the core in described more than first core to make institute State the described core in more than first core and perform described operation.
15. methods as claimed in claim 14, it is characterised in that also include, described core in described more than first core and When described core in described more than second core is in low power state, make the described core in described more than second core rather than described Described core in more than one core interrupts in response to one and is waken up.
16. methods as claimed in claim 15, it is characterised in that also including, the entry at form points out described more than second core In described core when producing undefined fault in response to the previous interruption with described interruption same type, make described more than first Described core in core rather than the described core in described more than second core are waken up in response to described interruption.
17. methods as claimed in claim 15, it is characterised in that also include, in response to described interruption by described more than first The subset of the execution state of the described core in core is supplied to the described core in described more than second core.
18. include instruction at least one computer-readable recording medium, described instruction make when executed a system for:
Make mobile device processor more than second core in core be at least partially based on the described core in described more than second core Performance class and perform operation, described processor include more than first core, described more than second core, interconnection and with at least institute State the shared cache memory of more than first core coupling, described more than second core and described more than first core homogeneity and have Having lower power consumption, described interconnection is for coupling described more than first core with described more than second core;And
The execution state making the described core in described more than second core is transferred to the core in described more than first core to make institute State the described core in more than first core and perform described operation.
19. at least one computer-readable recording medium as claimed in claim 18, it is characterised in that also include instruction, described Instruction make when executed described system time described in the described core in more than first core and the institute in described more than second core State core when being in low power state, interrupt in response to one and wake up the described core in described more than second core up rather than described more than first Described core in individual core.
20. at least one computer-readable recording medium as claimed in claim 19, it is characterised in that also include instruction, described Instruction makes described core that described system points out in described more than second core in the entry of form in response to described when executed Interrupt the previous interruption of same type when producing undefined fault, make the described core in described more than first core rather than described the Described core in more than two core is waken up in response to described interruption.
CN201610369305.9A 2011-09-06 2011-09-06 The processor architecture of power efficient Pending CN106095046A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610369305.9A CN106095046A (en) 2011-09-06 2011-09-06 The processor architecture of power efficient

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
CN201180073263.XA CN103765409A (en) 2011-09-06 2011-09-06 Power efficient processor architecture
CN201610369305.9A CN106095046A (en) 2011-09-06 2011-09-06 The processor architecture of power efficient
CN201610364515.9A CN106020424B (en) 2011-09-06 2011-09-06 The processor architecture of power efficient

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN201180073263.XA Division CN103765409A (en) 2011-09-06 2011-09-06 Power efficient processor architecture

Publications (1)

Publication Number Publication Date
CN106095046A true CN106095046A (en) 2016-11-09

Family

ID=57232809

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610369305.9A Pending CN106095046A (en) 2011-09-06 2011-09-06 The processor architecture of power efficient

Country Status (1)

Country Link
CN (1) CN106095046A (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1860446A (en) * 2003-12-16 2006-11-08 苹果计算机公司 Almost-symmetric multiprocessor that supports high-performance and energy-efficient execution
US20070245164A1 (en) * 2004-08-05 2007-10-18 Shuichi Mitarai Information Processing Device
US20080126747A1 (en) * 2006-11-28 2008-05-29 Griffen Jeffrey L Methods and apparatus to implement high-performance computing
US20080263324A1 (en) * 2006-08-10 2008-10-23 Sehat Sutardja Dynamic core switching
US20090248934A1 (en) * 2008-03-26 2009-10-01 International Business Machines Corporation Interrupt dispatching method in multi-core environment and multi-core processor
US20100030927A1 (en) * 2008-07-29 2010-02-04 Telefonaktiebolaget Lm Ericsson (Publ) General purpose hardware acceleration via deirect memory access
CN101923491A (en) * 2010-08-11 2010-12-22 上海交通大学 Thread group address space scheduling and thread switching method under multi-core environment

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1860446A (en) * 2003-12-16 2006-11-08 苹果计算机公司 Almost-symmetric multiprocessor that supports high-performance and energy-efficient execution
US20070245164A1 (en) * 2004-08-05 2007-10-18 Shuichi Mitarai Information Processing Device
US20080263324A1 (en) * 2006-08-10 2008-10-23 Sehat Sutardja Dynamic core switching
US20080126747A1 (en) * 2006-11-28 2008-05-29 Griffen Jeffrey L Methods and apparatus to implement high-performance computing
US20090248934A1 (en) * 2008-03-26 2009-10-01 International Business Machines Corporation Interrupt dispatching method in multi-core environment and multi-core processor
US20100030927A1 (en) * 2008-07-29 2010-02-04 Telefonaktiebolaget Lm Ericsson (Publ) General purpose hardware acceleration via deirect memory access
CN101923491A (en) * 2010-08-11 2010-12-22 上海交通大学 Thread group address space scheduling and thread switching method under multi-core environment

Similar Documents

Publication Publication Date Title
CN106155265A (en) The processor architecture of power efficient
CN104169832B (en) Providing energy efficient turbo operation of a processor
US20110238974A1 (en) Methods and apparatus to improve turbo performance for events handling
CN107209548A (en) Power management is performed in polycaryon processor
CN102566739A (en) Multicore processor system and dynamic power management method and control device thereof
CN106020424B (en) The processor architecture of power efficient
CN106095046A (en) The processor architecture of power efficient
CN114787777A (en) Task transfer method between heterogeneous processors
JP6409218B2 (en) Power efficient processor architecture
GB2537300A (en) Power efficient processor architecture
JP2017021811A (en) Power efficient processor architecture
JP2016212907A (en) Excellent power efficient processor architecture

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20161109