CN101464813A - Automatic workload distribution system and method for multi-core processor - Google Patents

Automatic workload distribution system and method for multi-core processor Download PDF

Info

Publication number
CN101464813A
CN101464813A CNA2008101812684A CN200810181268A CN101464813A CN 101464813 A CN101464813 A CN 101464813A CN A2008101812684 A CNA2008101812684 A CN A2008101812684A CN 200810181268 A CN200810181268 A CN 200810181268A CN 101464813 A CN101464813 A CN 101464813A
Authority
CN
China
Prior art keywords
processor
application program
performance
instruction
resource load
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CNA2008101812684A
Other languages
Chinese (zh)
Inventor
小罗伯特·H·贝尔
小路易斯·B·卡普斯
托马斯·E·库克
托马斯·J·杜克特
内尔什·内亚
罗纳尔德·E·纽哈特
伯纳德特·A·皮尔逊
迈克尔·J·夏皮罗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
International Business Machines Corp
Original Assignee
International Business Machines Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by International Business Machines Corp filed Critical International Business Machines Corp
Publication of CN101464813A publication Critical patent/CN101464813A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Debugging And Monitoring (AREA)

Abstract

The invention discloses a multiprocessor system comprising an automatic workload distribution. When the multiprocessor system executes a thread, an operating system or an administrator continuously acquires an execution feature of the tread, and stores the information in a specific thread control block. The execution feature is used to generate thread performance data. When the thread is executed, the operating system continuously uses the performance data to guide the thread to a nucleus for highest efficiently executing the workload.

Description

The system and method that is used for the automatic workload distribution on the polycaryon processor
Technical field
The present invention relates to the method that a kind of automatic workload (workload) that is used on the polycaryon processor is distributed, relate more specifically to make the workload allocations robotization on the polycaryon processor.
Background technology
In the multi-core computer system, different system resource (for example CPU, storer, I/O bandwidth, disk storage device or the like) is used to do in the drilling of multiple instruction thread separately.With operate challenge that these multi-core computer systems are associated efficiently only along with the quantity of multiprocessor computer center and complexity increase and increase.
A problem relevant with the use of multinuclear integrated circuit is usually to be difficult to write software to utilize a plurality of nuclears.For utilizing polycaryon processor, usually needing task division is thread, and usually needs thread is assigned on the available nuclear.A problem relevant with distributing thread is how to guide (steer) these threads efficiently.In known system, workload is sent to nuclear based on availability and similarity (affinity).In other systems, write software so that specific task is moved on the nuclear of particular type.When the quantity of nuclear and type increase, will have an opportunity with the intelligent manner amount of sharing out the work more.
Summary of the invention
According to the present invention, a kind of multiple nucleus system that comprises automatic workload distribution has been proposed.More specifically, when thread was carried out in multiple nucleus system, operating system/supervisory routine (hypervisor) was known the execution feature of (learn) described thread continuously, and information is kept in (thread-specific) controll block of thread.This execution feature is used to produce the thread performance data.When described thread execution, operating system/supervisory routine uses this performance data that described thread is directed to the nuclear of execution work amount efficiently continuously.
More specifically, in one embodiment, the present invention relates to a kind of method that is used for the automatic workload distribution of multicomputer system.This method comprises: the performance when measurement application program (application) is carried out on the processor of multicomputer system; Storage and the relevant data of the performance of this application program on the processor of described multicomputer system; And, the execution of application program is distributed to the processor that has with the corresponding feature of processing consumption (consumption) attribute of this application program.
In another embodiment, the present invention relates to a kind of device that is used for the automatic workload distribution of polycaryon processor.This device comprises: the parts of the performance when being used to measure application program and carrying out on the processor of multicomputer system; Be used to store the parts of the data relevant with the performance of this application program on the processor of described multicomputer system; And, be used for the execution of application program distributed to and have the parts that consume the processor of the corresponding feature of attribute with the processing of this application program.
In another embodiment, the present invention relates to a kind of polycaryon processor system that comprises a plurality of processor cores and storer.This memory stores automatic workload distribution system.This automatic workload distribution system comprises can be by the instruction of polycaryon processor execution, and described instruction is used for: the performance when the measurement application program is carried out on the processor of multicomputer system; Storage and the relevant data of the performance of this application program on the processor of described multicomputer system; And, the execution of application program distributed to have the processor that consumes the corresponding feature of attribute with the processing of this application program.
Description of drawings
By the reference accompanying drawing, the present invention may be better understood, and make many purposes of the present invention, feature and advantage apparent for those skilled in the art.Same or analogous element is represented in the use that runs through the same reference numerals of several figure.
Fig. 1 illustrates the multiprocessor computer framework that can realize selected embodiment of the present invention therein; And
Fig. 2 is illustrated in the process flow diagram of the operation of the automatic workload distribution system on the polycaryon processor.
Embodiment
Referring now to Fig. 1, illustrate high level block diagram according to multiprocessor (MP) data handling system 100 of execution selected embodiment of the present invention, that improved single-threading program is provided.Data handling system 100 has the one or more processing units that are arranged in one or more processor groups, and go out as shown like that, data handling system 100 comprises four processing units 111,121,131,141 in the processor group 110.In symmetric multi processor (smp) embodiment, all processing units 111,121,131,141 are normally identical, that is, they all use common instruction set or subclass and agreement to operate, and have same framework usually.As utilize processing unit 111 illustrates, each processing unit can comprise one or more processor core 116a, 116b, described processor core execution of program instructions is so that the operational computations machine.The example process unit is the POWER5 that is sold by International Business Machines Corporation TM, it comprises single integrated circuit superscale (superscalar) microprocessor, this microprocessor has various performance elements, register, impact damper, storer and other functional unit that is all formed by integrated circuit.Described processor core can calculate (RISC) technology according to reduced instruction set computer and operate, and can adopt the streamline (pipelining) of instruction and unordered (out-of-order) to carry out the performance that the two further improves the superscale framework.
As further illustrating among Fig. 1, each processor core 116a, 116b comprise by (L1) high-speed cache (cache) storer 119a, 119b on the plate of high speed storing equipment structure (being typically instruction and data high-speed cache separately).High-speed cache is generally used for the processed value of thinking highly of access multiple of interim storage possibility, so that by avoiding accelerating processing from the long step of time that system storage 161 loads described value.Processing unit can comprise another high-speed cache, and for example the second level (L2) high-speed cache 112, and it supports it is respectively L1 high-speed cache 119a, the 119b that examines the part of 116a and 116b with cache memory controller (not shown).Other level cache can be provided, for example can be via the L3 high-speed cache 166 of configuration bus (fabric bus) 150 accesses.Each level cache from the highest (L1) to minimum (L3) can be stored more information continuously, is cost with longer access of time still.For example, L1 high-speed cache (for example 119a) can have the memory capacity of 128 kilobyte of storer on the plate in the processor core (for example 116a), L2 high-speed cache 112 can have the memory capacity of 4 megabyte, and L3 high-speed cache 166 can have the memory capacity of 132 megabyte.Repairing/replacing for convenient defective processing unit assembly, each processing unit 111,121,131,141 can with removable circuit board, pluggable module or similarly the form of Field Replaceable Unit (FRU) construct, described removable circuit board, pluggable module or similarly Field Replaceable Unit (FRU) can be easily be replaced, be installed in the system 100 or the system 100 that swapped out in the mode of module.
Processing unit is via other component communication of internal system line (interconnect) or configuration bus 150 and system 100.Configuration bus 150 be connected to one or more service processors 160, system memory devices 161, Memory Controller 162, share or L3 system cache 166 and/or various peripherals 169.Processor bridge 170 can randomly be used to the processor group that interconnects other.Though do not illustrate, data handling system 100 will be understood and firmware can also be comprised, the basic I/O logic of this firmware stores system, and when computer system is connected (startup) first, all from a peripheral hardware, find out and the load operation system.
As shown in Figure 1, data handling system 100 is included between a plurality of threads a plurality of system resources of sharing (for example cache memory, Memory Controller, interconnector, I/O controller or the like).
System memory devices 161 (random access memory or RAM) comprises operating system 161A and application program 161B with programmed instruction and operand data that volatibility (interim) state storage processing unit uses.Automatic workload distribution module 161C can be stored in the system storage with any desired form such as operating system module, supervisory routine assembly etc., and is used to optimize the execution of single-threading program on a plurality of nuclears of processor unit.Although be illustrated as the facility (facility) in the system storage, but those skilled in the art will recognize that, replacedly, can realize automatic workload distribution module 161C in another assembly of data handling system 100, perhaps the automatic workload distribution unit can be used as independent unit (be positioned at processor or processor is outside) and exists.Automatic workload distribution module 161C is implemented as executable instruction, code and/or comprises the steering logic of programmable register, described steering logic is operated so that check the performance monitoring information of the code of operation in system 100, use predetermined policy (policy) to give the assignment of code priority value, and the priority value that is distributed that utilizes each instruction to add label (tag) to each instruction, make described priority value in system 100, be assigned with instruction, as following more fully as described in.
System 100 also comprises performance monitor 180.Performance monitor 180 can provide the performance information that is used by automatic workload distribution module 161C when carrying out the automatic workload distribution function.More specifically, when thread was carried out in multiple nucleus system, operating system/supervisory routine was known the execution feature of described thread continuously, and this information is kept in the controll block of thread.Described execution feature is used to produce the thread performance data.When described thread execution, operating system/supervisory routine uses this performance data that described thread is directed to and will carries out the nuclear of this workload the most efficiently continuously.
With reference to Fig. 2, show the process flow diagram of the operation of the automatic workload distribution on the polycaryon processor.The automatic workload distribution function knows how workload carries out and workload is directed to the nuclear of optimum in multiple nucleus system.More specifically, the automatic workload distribution process begins to test the performance of polycaryon processor.Polycaryon processor can comprise the nuclear similar or foreign peoples.For the processor that comprises similar nuclear, each processor will show (perform) difference owing to the variation (variation) of semiconductor technology.For example, on identical voltage level, endorse for one and can move quickly than another example of described nuclear.Different endorses to carry out with different frequencies.In step 210, (for example via performance monitor 180) measures described performance difference when module or card (card) test.Carry out described measurement in step 220 by (targeted) sets of applications that execution is used as target.Performance data is stored on the chip, on the module or on the card ROM.For the processor of the nuclear that comprises the foreign peoples,, can carry out the measurement of floating-point or vector performance as example.Also can carry out performance according to the knowledge of processor design distributes.
For the first time in system during executive utility, the performance monitor measuring system on the nuclear is used the feature of (usage).This monitor is analyzed for example single knuckle or two floating-point operation, storer use (L1, L2 or main memory access), is used instruction and other project in single or multiple cycles.In step 230, performance monitor knows that described application program places the resource load in (put) system.In step 240, add label and retention data for described application program or subroutine or thread.Extract the performance monitor data from performance monitor 180.With this hardware performance data storage in the control data structure of thread so that use by the operating system/supervisory routine/scheduler program of trooping (cluster scheduler).(these hardware performance data can also be used to show the feature of performance of the nuclear of processor, and feature can be showed information stores on processor).In step 250, the feature of the processor in hardware utilization (utilization) statistic (statistics) that scheduler program will be stored in the control data structure of described thread and the system compares.In step 260, operating system or supervisory routine are distributed to described thread the processing of hardware capabilities and software measurement are consumed the suitable nuclear that the workload of attribute is mated best.
Scheduler program also can use described data to come workload on combining processor intelligently or the nuclear.For example, can to determine that the thread of access data with the high-speed cache on same nuclear or processor is carried out the thread of frequent access storer more efficient for automatic workload distribution module 161C.Also can use described data with high-speed cache stand-by period (latency) Performance Match to have the various stand-by period and the size high-speed cache.
Having different disposal feature, rudimentary (low level) Noninvasive (non-intrusive) processor or nuclearity can surveillance coverage and make assignments (dispatching) decision based on measured unit by using feature and provide favourable automatic workload distribution system so that work is sent to the combination of processor of the dispatching algorithm of suitable processor or nuclear, and described processor all is individual system, troop or the part of supervisory routine execution association (complex).In addition, because this process be continuous and during each timeslice (timeslice) collect performance utilize data, if change so thread or workload comprise workload, then this thread or workload can be in described association along with the time spontaneously one by one processor move.
Those skilled in the art will recognize that data handling system 100 can comprise many assemblies additional or still less, for example I/O adapter, interconnector bridge, non-volatile memory device, be used to port that is connected to by network or optional equipment or the like.Because such assembly is optional for understanding the present invention, so they do not illustrate or here further discussed at Fig. 1.Yet, it is also understood that improvement provided by the present invention (enhancement) can be applicable to the multithreading data handling system of any framework, and never be subject to illustrated general MP framework among Fig. 1.
Therefore, the invention is intended to is the restriction that only is subjected to providing in all respects to the spirit and scope of the claims of fully realizing of equivalent.

Claims (19)

1. method that is used for the automatic workload distribution in the multicomputer system comprises:
Performance when the measurement application program is carried out on the processor of multicomputer system;
Storage and the relevant data of the performance of this application program on the processor of described multicomputer system; And
The execution of application program distributed to have the processor that consumes the corresponding feature of attribute with the processing of this application program.
2. the method for claim 1, wherein:
Described storage is in the control data structure of corresponding application program.
3. method as claimed in claim 2 also comprises:
The hardware that to store in the control data structure of described application program utilizes the feature of the processor in statistic and the described system to compare.
4. the method for claim 1 also comprises:
Know that described application program places the resource load on the multicomputer system; And
When based on the execution of the described application program of this resource load distribution, consider this resource load.
5. method as claimed in claim 4, wherein:
Describedly know that resource load comprises and determine single knuckle or two floating-point operation, storer uses and use at least a in the instruction in single or multiple cycles.
6. the method for claim 1, wherein:
Multicomputer system comprises performance monitor; And
Carry out described measurement by the performance monitor of multicomputer system.
7. the method for claim 1, wherein:
At least one processor in the processor of multicomputer system comprises a plurality of nuclears; And,
Described measurement comprises: the performance when the measurement application program is carried out on a plurality of nuclears of described at least one processor; And comprise:
Characterize the performance of a plurality of nuclears based on described measurement; And
The relevant characterization information of performance of storage and a plurality of nuclears of described at least one processor.
8. device that is used for the automatic workload distribution in the polycaryon processor comprises:
The parts of the performance when being used to measure application program and on the processor of multicomputer system, carrying out;
Be used to store the parts of the data relevant with the performance of this application program on the processor of described multicomputer system; And
Be used for the execution of application program distributed to and have the parts that consume the processor of the corresponding feature of attribute with the processing of this application program.
9. device as claimed in claim 8, wherein:
Described storage is in the control data structure of corresponding application program.
10. device as claimed in claim 9 also comprises:
The parts that are used for to utilize at the hardware that the control data structure of described application program is stored the feature of the processor in statistic and the described system to compare.
11. device as claimed in claim 8 also comprises:
Be used to know that described application program places the parts of the resource load on the multicomputer system; And,
Be used for when based on the execution of the described application program of this resource load distribution, considering the parts of this resource load.
12. device as claimed in claim 11, wherein:
Describedly know that resource load comprises and determine single knuckle or two floating-point operation, storer uses and use at least a in the instruction in single or multiple cycles.
13. device as claimed in claim 8, wherein:
Multicomputer system comprises performance monitor; And
Carry out described measurement by the performance monitor of multicomputer system.
14. a polycaryon processor system comprises:
A plurality of processor cores;
Storer, this memory stores automatic workload distribution system, this automatic workload distribution system comprise can be by the instruction of polycaryon processor execution, and described instruction is used for:
Performance when the measurement application program is carried out on the processor of multicomputer system;
Storage and the relevant data of the performance of this application program on the processor of described multicomputer system; And
The execution of application program distributed to have the processor that consumes the corresponding feature of attribute with the processing of this application program.
15. polycaryon processor as claimed in claim 14 system, wherein:
Described storage is in the control data structure of corresponding application program.
16. polycaryon processor as claimed in claim 14 system, wherein, described automatic workload distribution system also comprises the instruction that is used for following operation:
The hardware that to store in the control data structure of described application program utilizes the feature of the processor in statistic and the described system to compare.
17. polycaryon processor as claimed in claim 14 system, wherein, described automatic workload distribution system also comprises the instruction that is used for following operation:
Know that described application program places the resource load on the multicomputer system; And
When based on the execution of the described application program of this resource load distribution, consider this resource load.
18. polycaryon processor as claimed in claim 17 system, wherein, the described instruction that is used for knowing the resource load also comprises at least a instruction that is used for determining that single knuckle or two floating-point operation, storer use and use the instruction in single or multiple cycles.
19. polycaryon processor as claimed in claim 14 system also comprises:
Performance monitor; And wherein
The described instruction that is used to measure makes this performance monitor measure the performance of described application program.
CNA2008101812684A 2007-12-19 2008-11-18 Automatic workload distribution system and method for multi-core processor Pending CN101464813A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US96952007A 2007-12-19 2007-12-19
US11/969,520 2007-12-19

Publications (1)

Publication Number Publication Date
CN101464813A true CN101464813A (en) 2009-06-24

Family

ID=40805406

Family Applications (1)

Application Number Title Priority Date Filing Date
CNA2008101812684A Pending CN101464813A (en) 2007-12-19 2008-11-18 Automatic workload distribution system and method for multi-core processor

Country Status (1)

Country Link
CN (1) CN101464813A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102648453A (en) * 2009-11-24 2012-08-22 超威半导体公司 Distributed multi-core memory initialization
CN102929725A (en) * 2012-11-12 2013-02-13 中国人民解放军海军工程大学 Dynamic reconfiguration method of signal processing parallel computing software
CN103502906A (en) * 2011-03-30 2014-01-08 讯宝科技公司 Dynamic allocation of processor cores running an operating system
CN103649938A (en) * 2011-07-26 2014-03-19 国际商业机器公司 Managing workloads in a multiprocessing computer system
CN103645954A (en) * 2013-11-21 2014-03-19 华为技术有限公司 CPU scheduling method, device and system based on heterogeneous multi-core system
CN104969182A (en) * 2012-12-28 2015-10-07 英特尔公司 High dynamic range software-transparent heterogeneous computing element processors, methods, and systems
CN105980988A (en) * 2014-02-07 2016-09-28 华为技术有限公司 Methods and systems for dynamically allocating resources and tasks among database work agents in smp environment

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102648453B (en) * 2009-11-24 2015-11-25 超威半导体公司 The method and apparatus of initialize memory
CN102648453A (en) * 2009-11-24 2012-08-22 超威半导体公司 Distributed multi-core memory initialization
CN103502906B (en) * 2011-03-30 2016-02-03 讯宝科技公司 The dynamic assignment of the processor core of operation system
CN103502906A (en) * 2011-03-30 2014-01-08 讯宝科技公司 Dynamic allocation of processor cores running an operating system
CN103649938B (en) * 2011-07-26 2016-01-20 国际商业机器公司 Management work load in multiprocessing computer system
CN103649938A (en) * 2011-07-26 2014-03-19 国际商业机器公司 Managing workloads in a multiprocessing computer system
CN102929725B (en) * 2012-11-12 2015-07-08 中国人民解放军海军工程大学 Dynamic reconfiguration method of signal processing parallel computing software
CN102929725A (en) * 2012-11-12 2013-02-13 中国人民解放军海军工程大学 Dynamic reconfiguration method of signal processing parallel computing software
CN104969182A (en) * 2012-12-28 2015-10-07 英特尔公司 High dynamic range software-transparent heterogeneous computing element processors, methods, and systems
US10162687B2 (en) 2012-12-28 2018-12-25 Intel Corporation Selective migration of workloads between heterogeneous compute elements based on evaluation of migration performance benefit and available energy and thermal budgets
WO2015074393A1 (en) * 2013-11-21 2015-05-28 华为技术有限公司 Cpu scheduling method, apparatus and system based on heterogeneous multi-core system
CN103645954A (en) * 2013-11-21 2014-03-19 华为技术有限公司 CPU scheduling method, device and system based on heterogeneous multi-core system
CN105980988A (en) * 2014-02-07 2016-09-28 华为技术有限公司 Methods and systems for dynamically allocating resources and tasks among database work agents in smp environment

Similar Documents

Publication Publication Date Title
US7996346B2 (en) Method for autonomic workload distribution on a multicore processor
US8489904B2 (en) Allocating computing system power levels responsive to service level agreements
US20170075690A1 (en) Multicore Processor and Method of Use That Configures Core Functions Based on Executing Instructions
US8566836B2 (en) Multi-core system on chip
US8190863B2 (en) Apparatus and method for heterogeneous chip multiprocessors via resource allocation and restriction
KR101476568B1 (en) Providing per core voltage and frequency control
CN101464813A (en) Automatic workload distribution system and method for multi-core processor
US8645673B2 (en) Multicore processor and method of use that adapts core functions based on workload execution
US9087146B2 (en) Wear-out equalization techniques for multiple functional units
US20150007187A1 (en) Method of Scheduling Threads for Execution on Multiple Processors within an Information Handling System
US9032482B2 (en) Information processing apparatus and control method
KR20100069551A (en) Mulicore processor and method of use that configures core functions based on executing instructions
EP3237998B1 (en) Systems and methods for dynamic temporal power steering
CN103562870A (en) Automatic load balancing for heterogeneous cores
KR101671184B1 (en) Apparatus for dynamically self-adapting of software framework on many-core systems and method of the same
CN104169832A (en) Providing energy efficient turbo operation of a processor
WO2008083879A1 (en) Selection of processors for job scheduling using measured power consumption ratings
US8862786B2 (en) Program execution with improved power efficiency
EP3295276B1 (en) Reducing power by vacating subsets of cpus and memory
US10209749B2 (en) Workload allocation based on downstream thermal impacts
US20180006951A1 (en) Hybrid Computing Resources Fabric Load Balancer
US11157329B2 (en) Technology for managing per-core performance states
CN112579299B (en) Resource scheduling method, electronic device and storage medium
EP3929743B1 (en) Technology for optimizing hybrid processor utilization
EP4160423A1 (en) Memory device, memory device operating method, and electronic device including memory device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20090624