CN101464813A

CN101464813A - Automatic workload distribution system and method for multi-core processor

Info

Publication number: CN101464813A
Application number: CNA2008101812684A
Authority: CN
Inventors: 小罗伯特·H·贝尔; 小路易斯·B·卡普斯; 托马斯·E·库克; 托马斯·J·杜克特; 内尔什·内亚; 罗纳尔德·E·纽哈特; 伯纳德特·A·皮尔逊; 迈克尔·J·夏皮罗
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 2007-12-19
Filing date: 2008-11-18
Publication date: 2009-06-24

Abstract

The invention discloses a multiprocessor system comprising an automatic workload distribution. When the multiprocessor system executes a thread, an operating system or an administrator continuously acquires an execution feature of the tread, and stores the information in a specific thread control block. The execution feature is used to generate thread performance data. When the thread is executed, the operating system continuously uses the performance data to guide the thread to a nucleus for highest efficiently executing the workload.

Description

The system and method that is used for the automatic workload distribution on the polycaryon processor

Technical field

The present invention relates to the method that a kind of automatic workload (workload) that is used on the polycaryon processor is distributed, relate more specifically to make the workload allocations robotization on the polycaryon processor.

Background technology

In the multi-core computer system, different system resource (for example CPU, storer, I/O bandwidth, disk storage device or the like) is used to do in the drilling of multiple instruction thread separately.With operate challenge that these multi-core computer systems are associated efficiently only along with the quantity of multiprocessor computer center and complexity increase and increase.

A problem relevant with the use of multinuclear integrated circuit is usually to be difficult to write software to utilize a plurality of nuclears.For utilizing polycaryon processor, usually needing task division is thread, and usually needs thread is assigned on the available nuclear.A problem relevant with distributing thread is how to guide (steer) these threads efficiently.In known system, workload is sent to nuclear based on availability and similarity (affinity).In other systems, write software so that specific task is moved on the nuclear of particular type.When the quantity of nuclear and type increase, will have an opportunity with the intelligent manner amount of sharing out the work more.

Summary of the invention

According to the present invention, a kind of multiple nucleus system that comprises automatic workload distribution has been proposed.More specifically, when thread was carried out in multiple nucleus system, operating system/supervisory routine (hypervisor) was known the execution feature of (learn) described thread continuously, and information is kept in (thread-specific) controll block of thread.This execution feature is used to produce the thread performance data.When described thread execution, operating system/supervisory routine uses this performance data that described thread is directed to the nuclear of execution work amount efficiently continuously.

More specifically, in one embodiment, the present invention relates to a kind of method that is used for the automatic workload distribution of multicomputer system.This method comprises: the performance when measurement application program (application) is carried out on the processor of multicomputer system; Storage and the relevant data of the performance of this application program on the processor of described multicomputer system; And, the execution of application program is distributed to the processor that has with the corresponding feature of processing consumption (consumption) attribute of this application program.

In another embodiment, the present invention relates to a kind of device that is used for the automatic workload distribution of polycaryon processor.This device comprises: the parts of the performance when being used to measure application program and carrying out on the processor of multicomputer system; Be used to store the parts of the data relevant with the performance of this application program on the processor of described multicomputer system; And, be used for the execution of application program distributed to and have the parts that consume the processor of the corresponding feature of attribute with the processing of this application program.

In another embodiment, the present invention relates to a kind of polycaryon processor system that comprises a plurality of processor cores and storer.This memory stores automatic workload distribution system.This automatic workload distribution system comprises can be by the instruction of polycaryon processor execution, and described instruction is used for: the performance when the measurement application program is carried out on the processor of multicomputer system; Storage and the relevant data of the performance of this application program on the processor of described multicomputer system; And, the execution of application program distributed to have the processor that consumes the corresponding feature of attribute with the processing of this application program.

Description of drawings

By the reference accompanying drawing, the present invention may be better understood, and make many purposes of the present invention, feature and advantage apparent for those skilled in the art.Same or analogous element is represented in the use that runs through the same reference numerals of several figure.

Fig. 1 illustrates the multiprocessor computer framework that can realize selected embodiment of the present invention therein; And

Fig. 2 is illustrated in the process flow diagram of the operation of the automatic workload distribution system on the polycaryon processor.

Embodiment

Referring now to Fig. 1, illustrate high level block diagram according to multiprocessor (MP) data handling system 100 of execution selected embodiment of the present invention, that improved single-threading program is provided.Data handling system 100 has the one or more processing units that are arranged in one or more processor groups, and go out as shown like that, data handling system 100 comprises four processing units 111,121,131,141 in the processor group 110.In symmetric multi processor (smp) embodiment, all processing units 111,121,131,141 are normally identical, that is, they all use common instruction set or subclass and agreement to operate, and have same framework usually.As utilize processing unit 111 illustrates, each processing unit can comprise one or more processor core 116a, 116b, described processor core execution of program instructions is so that the operational computations machine.The example process unit is the POWER5 that is sold by International Business Machines Corporation ^TM, it comprises single integrated circuit superscale (superscalar) microprocessor, this microprocessor has various performance elements, register, impact damper, storer and other functional unit that is all formed by integrated circuit.Described processor core can calculate (RISC) technology according to reduced instruction set computer and operate, and can adopt the streamline (pipelining) of instruction and unordered (out-of-order) to carry out the performance that the two further improves the superscale framework.

As further illustrating among Fig. 1, each processor core 116a, 116b comprise by (L1) high-speed cache (cache)

storer

119a, 119b on the plate of high speed storing equipment structure (being typically instruction and data high-speed cache separately).High-speed cache is generally used for the processed value of thinking highly of access multiple of interim storage possibility, so that by avoiding accelerating processing from the long step of time that system storage 161 loads described value.Processing unit can comprise another high-speed cache, and for example the second level (L2) high-speed cache 112, and it supports it is respectively L1 high-speed cache 119a, the 119b that examines the part of 116a and 116b with cache memory controller (not shown).Other level cache can be provided, for example can be via the L3 high-speed cache 166 of configuration bus (fabric bus) 150 accesses.Each level cache from the highest (L1) to minimum (L3) can be stored more information continuously, is cost with longer access of time still.For example, L1 high-speed cache (for example 119a) can have the memory capacity of 128 kilobyte of storer on the plate in the processor core (for example 116a), L2 high-speed cache 112 can have the memory capacity of 4 megabyte, and L3 high-speed cache 166 can have the memory capacity of 132 megabyte.Repairing/replacing for convenient defective processing unit assembly, each processing unit 111,121,131,141 can with removable circuit board, pluggable module or similarly the form of Field Replaceable Unit (FRU) construct, described removable circuit board, pluggable module or similarly Field Replaceable Unit (FRU) can be easily be replaced, be installed in the system 100 or the system 100 that swapped out in the mode of module.

Processing unit is via other component communication of internal system line (interconnect) or configuration bus 150 and system 100.Configuration bus 150 be connected to one or more service processors 160, system memory devices 161, Memory Controller 162, share or L3 system cache 166 and/or various peripherals 169.Processor bridge 170 can randomly be used to the processor group that interconnects other.Though do not illustrate, data handling system 100 will be understood and firmware can also be comprised, the basic I/O logic of this firmware stores system, and when computer system is connected (startup) first, all from a peripheral hardware, find out and the load operation system.

As shown in Figure 1, data handling system 100 is included between a plurality of threads a plurality of system resources of sharing (for example cache memory, Memory Controller, interconnector, I/O controller or the like).

System memory devices 161 (random access memory or RAM) comprises operating system 161A and application program 161B with programmed instruction and operand data that volatibility (interim) state storage processing unit uses.Automatic workload distribution module 161C can be stored in the system storage with any desired form such as operating system module, supervisory routine assembly etc., and is used to optimize the execution of single-threading program on a plurality of nuclears of processor unit.Although be illustrated as the facility (facility) in the system storage, but those skilled in the art will recognize that, replacedly, can realize automatic workload distribution module 161C in another assembly of data handling system 100, perhaps the automatic workload distribution unit can be used as independent unit (be positioned at processor or processor is outside) and exists.Automatic workload distribution module 161C is implemented as executable instruction, code and/or comprises the steering logic of programmable register, described steering logic is operated so that check the performance monitoring information of the code of operation in system 100, use predetermined policy (policy) to give the assignment of code priority value, and the priority value that is distributed that utilizes each instruction to add label (tag) to each instruction, make described priority value in system 100, be assigned with instruction, as following more fully as described in.

System 100 also comprises performance monitor 180.Performance monitor 180 can provide the performance information that is used by automatic workload distribution module 161C when carrying out the automatic workload distribution function.More specifically, when thread was carried out in multiple nucleus system, operating system/supervisory routine was known the execution feature of described thread continuously, and this information is kept in the controll block of thread.Described execution feature is used to produce the thread performance data.When described thread execution, operating system/supervisory routine uses this performance data that described thread is directed to and will carries out the nuclear of this workload the most efficiently continuously.

With reference to Fig. 2, show the process flow diagram of the operation of the automatic workload distribution on the polycaryon processor.The automatic workload distribution function knows how workload carries out and workload is directed to the nuclear of optimum in multiple nucleus system.More specifically, the automatic workload distribution process begins to test the performance of polycaryon processor.Polycaryon processor can comprise the nuclear similar or foreign peoples.For the processor that comprises similar nuclear, each processor will show (perform) difference owing to the variation (variation) of semiconductor technology.For example, on identical voltage level, endorse for one and can move quickly than another example of described nuclear.Different endorses to carry out with different frequencies.In step 210, (for example via performance monitor 180) measures described performance difference when module or card (card) test.Carry out described measurement in step 220 by (targeted) sets of applications that execution is used as target.Performance data is stored on the chip, on the module or on the card ROM.For the processor of the nuclear that comprises the foreign peoples,, can carry out the measurement of floating-point or vector performance as example.Also can carry out performance according to the knowledge of processor design distributes.

For the first time in system during executive utility, the performance monitor measuring system on the nuclear is used the feature of (usage).This monitor is analyzed for example single knuckle or two floating-point operation, storer use (L1, L2 or main memory access), is used instruction and other project in single or multiple cycles.In step 230, performance monitor knows that described application program places the resource load in (put) system.In step 240, add label and retention data for described application program or subroutine or thread.Extract the performance monitor data from performance monitor 180.With this hardware performance data storage in the control data structure of thread so that use by the operating system/supervisory routine/scheduler program of trooping (cluster scheduler).(these hardware performance data can also be used to show the feature of performance of the nuclear of processor, and feature can be showed information stores on processor).In step 250, the feature of the processor in hardware utilization (utilization) statistic (statistics) that scheduler program will be stored in the control data structure of described thread and the system compares.In step 260, operating system or supervisory routine are distributed to described thread the processing of hardware capabilities and software measurement are consumed the suitable nuclear that the workload of attribute is mated best.

Scheduler program also can use described data to come workload on combining processor intelligently or the nuclear.For example, can to determine that the thread of access data with the high-speed cache on same nuclear or processor is carried out the thread of frequent access storer more efficient for automatic workload distribution module 161C.Also can use described data with high-speed cache stand-by period (latency) Performance Match to have the various stand-by period and the size high-speed cache.

Having different disposal feature, rudimentary (low level) Noninvasive (non-intrusive) processor or nuclearity can surveillance coverage and make assignments (dispatching) decision based on measured unit by using feature and provide favourable automatic workload distribution system so that work is sent to the combination of processor of the dispatching algorithm of suitable processor or nuclear, and described processor all is individual system, troop or the part of supervisory routine execution association (complex).In addition, because this process be continuous and during each timeslice (timeslice) collect performance utilize data, if change so thread or workload comprise workload, then this thread or workload can be in described association along with the time spontaneously one by one processor move.

Those skilled in the art will recognize that data handling system 100 can comprise many assemblies additional or still less, for example I/O adapter, interconnector bridge, non-volatile memory device, be used to port that is connected to by network or optional equipment or the like.Because such assembly is optional for understanding the present invention, so they do not illustrate or here further discussed at Fig. 1.Yet, it is also understood that improvement provided by the present invention (enhancement) can be applicable to the multithreading data handling system of any framework, and never be subject to illustrated general MP framework among Fig. 1.

Therefore, the invention is intended to is the restriction that only is subjected to providing in all respects to the spirit and scope of the claims of fully realizing of equivalent.

Claims

1. method that is used for the automatic workload distribution in the multicomputer system comprises:

Performance when the measurement application program is carried out on the processor of multicomputer system;

Storage and the relevant data of the performance of this application program on the processor of described multicomputer system; And

The execution of application program distributed to have the processor that consumes the corresponding feature of attribute with the processing of this application program.

2. the method for claim 1, wherein:

Described storage is in the control data structure of corresponding application program.

3. method as claimed in claim 2 also comprises:

The hardware that to store in the control data structure of described application program utilizes the feature of the processor in statistic and the described system to compare.

4. the method for claim 1 also comprises:

Know that described application program places the resource load on the multicomputer system; And

When based on the execution of the described application program of this resource load distribution, consider this resource load.

5. method as claimed in claim 4, wherein:

Describedly know that resource load comprises and determine single knuckle or two floating-point operation, storer uses and use at least a in the instruction in single or multiple cycles.

6. the method for claim 1, wherein:

Multicomputer system comprises performance monitor; And

Carry out described measurement by the performance monitor of multicomputer system.

7. the method for claim 1, wherein:

At least one processor in the processor of multicomputer system comprises a plurality of nuclears; And,

Described measurement comprises: the performance when the measurement application program is carried out on a plurality of nuclears of described at least one processor; And comprise:

Characterize the performance of a plurality of nuclears based on described measurement; And

The relevant characterization information of performance of storage and a plurality of nuclears of described at least one processor.

8. device that is used for the automatic workload distribution in the polycaryon processor comprises:

The parts of the performance when being used to measure application program and on the processor of multicomputer system, carrying out;

Be used to store the parts of the data relevant with the performance of this application program on the processor of described multicomputer system; And

Be used for the execution of application program distributed to and have the parts that consume the processor of the corresponding feature of attribute with the processing of this application program.

9. device as claimed in claim 8, wherein:

10. device as claimed in claim 9 also comprises:

The parts that are used for to utilize at the hardware that the control data structure of described application program is stored the feature of the processor in statistic and the described system to compare.

11. device as claimed in claim 8 also comprises:

Be used to know that described application program places the parts of the resource load on the multicomputer system; And,

Be used for when based on the execution of the described application program of this resource load distribution, considering the parts of this resource load.

12. device as claimed in claim 11, wherein:

13. device as claimed in claim 8, wherein:

Multicomputer system comprises performance monitor; And

14. a polycaryon processor system comprises:

A plurality of processor cores;

Storer, this memory stores automatic workload distribution system, this automatic workload distribution system comprise can be by the instruction of polycaryon processor execution, and described instruction is used for:

15. polycaryon processor as claimed in claim 14 system, wherein:

16. polycaryon processor as claimed in claim 14 system, wherein, described automatic workload distribution system also comprises the instruction that is used for following operation:

17. polycaryon processor as claimed in claim 14 system, wherein, described automatic workload distribution system also comprises the instruction that is used for following operation:

18. polycaryon processor as claimed in claim 17 system, wherein, the described instruction that is used for knowing the resource load also comprises at least a instruction that is used for determining that single knuckle or two floating-point operation, storer use and use the instruction in single or multiple cycles.

19. polycaryon processor as claimed in claim 14 system also comprises:

Performance monitor; And wherein

The described instruction that is used to measure makes this performance monitor measure the performance of described application program.