CN1308839C

CN1308839C - Line-stage distribution system and method in multiple processor system

Info

Publication number: CN1308839C
Application number: CNB031563465A
Authority: CN
Inventors: 上田真
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 2002-10-11
Filing date: 2003-09-04
Publication date: 2007-04-04
Anticipated expiration: 2023-09-04
Also published as: CN1489062A

Abstract

In order to optimize the throughput ratio, the invention provides a thread system in the distribution process used for a plurality of processors in the multiprocessor system, comprising: a determining device for determining treatment efficiency of the process and a selecting apparatus for selecting a processor or a plurality of processor executing process based on the result of the determining device.

Description

Thread distribution system and distribution method in the multicomputer system

Technical field

The present invention relates in a plurality of processes (or program), distribute the system and the distribution method of thread.

Background technology

Usually, in Web server was used, multicomputer system successfully had been used as the execution environment of multithread programs.In this case, the process carried out of system is the server program of acting server and fire wall for example.When carrying out distributed treatment, a server program is carried out by a system every now and then.

The asynchronous request that the multithread programs of Web server responds a plurality of client computer repeats to produce and cancel task.On the other hand, when using the server of CPU (CPU (central processing unit)) server for example or CPU array, carry out a plurality of programs of different nature that have, for example calculate and computer-aided design (CAD) (CAD).

When execution comprised process of multithread programs, by all threads in the process are assigned in the assignable processor, the execution speed of process was improved greatly.On the other hand, when having a plurality of process, be necessary to distribute thread by the method for optimization system throughput.

When 8 processors in the multicomputer system are independently carried out 8 processes, the throughput when throughput is starkly lower than process of 8 separate processors execution.The reason of this efficiency degradation is the competition that is produced of sharing of user's space.

In multithread programs, the speed that thread switches is higher than the speed that program is switched.In addition, owing to share identical user's space in the middle of the different threads, the form that can transmit by pointer is in thread intermediate transfer data.The speed of thread intermediate data transmission is higher than the data transfer of program intermediate demand replicate run.Because when the data transfer in the middle of the form execution thread that transmits with pointer in the process of using the different processor executive routine, needn't carry out replicate run at the hardware level, so when using multicomputer system to carry out multithread programs, multicomputer system preferably has the relevant shared storage of high-speed cache.

As the optimization result that multithread programs is carried out, the cache coherence support of hardware level is added into multicomputer system.Be applicable to the middle data transfer of thread of frequent appearance in the maintaining cached correlativity of hardware level, and be particularly useful for the communication in the middle of the thread.

If use the data access of cache coherency protocol to be used to prepare to have the data of transmitting between the stand-alone program of low frequency communication all the time in the middle of thread, then from the angle of speed, this can increase unnecessary processing.If in the multicomputer system of CC-NUMA (the relevant non-uniform memory visit of high-speed cache) type, have the error of L2 (secondary) cache hit, then can visit the mark memory that belongs to home node (storer under the management data), and owner's node of latest data is preserved in the meeting self-verifying.In addition, spy upon high-speed cache, then need all the time, therefore under the situation of having only a cover address bus, can not produce a plurality of memory accesses that to spy upon simultaneously to all high-speed cache broadcast addresses if use.Mark memory is designed to control the state of temporary storaging data.Spy upon high-speed cache and be used to the correlativity of automatically management and hardware.L2 cache hit error means in the second level and the first order and the cache hit error occurs that wherein cache memory has two-layer configuration.

Therefore, in multithread programs, preferably and effectively on hardware, support cache coherence, but when carrying out a plurality of program, do not need this support.

The program and the distribution method that the purpose of this invention is to provide the throughput optimization of control system.

Summary of the invention

Distribute the system of thread to comprise in the performed process of the multicomputer system that comprises a plurality of processors according to of the present invention being used for: determine during when the thread in each the processor distribution process in a plurality of processors process treatment effeciency definite device and select a processor or a plurality of processor to carry out the device of this process according to the result who determines device.

Described definite device comprises: the execution time of described process is divided into by the processing time of processor execution and the device of input-output delay; With the device that processing time and input-output is postponed compare.

Preferably, described system also is included in the handled data of process of a processor execution by under the situation of other processor access, changes over the device of the described visit of a described processor from the visit of a plurality of processors.

Preferably, described system also comprises a described processor read/write that allows to carry out described process by the handled data of described process of described processor execution, and forbids the device of other these data of processor read/write.

According to the method that is used for distributing a plurality of threads of a performed process of the multicomputer system that comprises a plurality of processors of the present invention, the step that comprises has:

Determining step is used for determining the treatment effeciency of this process when a plurality of thread in process of each processor distribution of a plurality of processors; And

Select step, be used for selecting a processor or a plurality of processor to carry out this process according to the result of determining step.

Wherein said determining step comprises: execution time of described process is divided into processing time and the input-output carried out by processor postpones, and the comparison step that the processing time is compared with the input-output delay.

According to the present invention, because the optimized distribution of process can be carried out all processes under the situation that multicomputer system does not exist any throughput to degenerate.

Description of drawings

Fig. 1 shows system architecture of the present invention and distribution state.

Fig. 2 shows the executing state of thread in the process.

Fig. 3 shows the example of the page table of maintaining cached correlativity.

Fig. 4 shows the example of multicomputer system.

Embodiment

The preferred embodiment of system of the present invention and distribution method is described with reference to the accompanying drawings.In Fig. 1,2 and 4 of this instructions, code 38 systems refer to 38a, 38b, P1, P2 and P3.

As shown in Figure 1, system assignment thread of the present invention.In other words, the present invention relates to the scheduler program of operating system in the multicomputer system.Describe the method that is used for the optimization system throughput now in detail.

Fig. 2 shows operating system 32 and process 38.T0 is a thread in the process 38 to T4, and thread T0, T1, T2, T3 and the T4 of each position indicates each to carry out point.In Fig. 2, thread T0 be in user's space 36 can with the thread of thread T1, T2, T3 and T4 executed in parallel.Carry out T1 by system call, and then carry out T2, T3 and T4.

If have only the T0 shown in Fig. 2, T2 and T3 to be assigned to different processor in this instructions, even then under process 38 at most can the situation of 3 threads of executed in parallel, the distribution of a processor (uniprocessor) also relates to and enters the uniprocessor executing state.

Usually, because shared data needs cache coherency protocol so that use different processor executed in parallel thread T0, T2 and T3.Use uniprocessor to carry out T0 and then do not need cache coherency protocol to T4.Data in the user's space 36 are indicated in the page table of correspondence does not need correlativity.In addition, page table is used to manage the memory block of the storer of each page, has wherein stored the corresponding relation between virtual address and the physical address.By using this form, can carry out high-speed cache control and cache coherence control.The page is meant by the resulting various piece of equal length ground segmentation procedure.

For example in the page table shown in Fig. 3 42, when not needing cache coherence, in WIM (memory/cache control bit) bit field, indicate this situation.Because being the data access of user's space 36, the function of the MMU (Memory Management Unit) that comprises in the multicomputer system 40, hardware 30 do not produce cache coherency protocol.

Yet because the procedure operation of being carried out by other processor when Data transmission can keep cache coherence.For example, if thread T5 quote and have process (wherein the T5 shown in Fig. 2 on the processor uP2 shown in Fig. 4 the term of execution do not need cache coherence) user's space 36a, then can not keep cache coherence.

In these cases, by changing, keep cache coherence to the distribution of processor and the T5 of increase permission uniprocessor executing state.MMU is used to help to produce the incident that is used to change distribution.In page table 42, a processor that only is in executing state can read/write data, and other processor can not be in user's space 36 read/write data.Take place to interrupt before reference data breaking the cache coherence of relative T5, and then T5 is assigned to a processor, this processor is identical with the processor that T0, T1, T2, T3 and T4 are assigned to.

In the example of the page table 42 of Fig. 3, the read/write properties of the relevant page allows to omit cache coherency protocol by distribution PP (page protection position) bit field.

Different processor can be seen the different read/write properties contents of record in the page table 42, and this is a kind of technology that also can use when install software high-speed cache in DSM (distributed shared memory).Though obtain the high-speed cache effect by transmit wherein the page that page fault has taken place between the node of the software caching of DSM in the present invention, yet what transmit is the thread that page fault occurred between node.In addition, page fault is the interruption that takes place during the non-existent page when in the reference-to storage.

If uniprocessor is carried out whole process 38, then multicomputer system 40 can lose its existence value.Therefore, system of the present invention has following structure, and selection process 38 enters the uniprocessor executing state.

System of the present invention 10 shown in Fig. 1 comprises: be used for determining definite device 12 of the treatment effeciency of

process

38a and 38b when each

process

38a and 38b are assigned to a plurality of processors (multiprocessor) 28 of multicomputer system 40; With select uniprocessor 26 or multiprocessor 28 selecting arrangement 14 with

executive process

38a and 38b according to the result who determines device 12.

Determine that device 12 comprises: the observation parts 16 of observing the executing state of

process

38a and 38b; With the treatment effeciency of

definite process

38a and 38b limiting-members 18 really.Observing parts 16 comprises: the execution time of

process

38a and 38b is divided into

processor

26 and 28 processing times of carrying out, and the device (not shown) of I/O (input-output) delay; With the device (not shown) that the processing time is compared with the I/O delay.Determine that parts 18 determine the treatment effeciency of

process

38a and 38b according to the comparative result of comparison means.

Selecting arrangement 14 comprises: memory management unit 20 and task distributor 22.Memory management unit 20 determines whether to keep cache coherence at process 38a in the storer 24 and 38b.Task distributor 22 is selected multiprocessor or uniprocessor executive process.

In the present invention, when use determining device 12 and selecting arrangement 14 from handled data of process 38a that other processor access uniprocessor 26 is carried out, visit is terminated and changes to above-mentioned uniprocessor 26.More specifically, interrupt, so that visit is changed to uniprocessor 26 from the visit of other processor.

In the present invention, the data of the process 38a operation that uniprocessor 26 read-writes will be carried out by uniprocessor 26 read and write data to forbid other processor.

Describe the distribution method of using said system 10 below in detail.The treatment effeciency that system 10 determines when the thread among process 38a and the 38b is assigned to each multiprocessor 28, and select uniprocessor to carry out or the multiprocessor execution.

(1) though executed in parallel process in the short time that when CPU determines treatment effeciency, is consumed, its effect still is restricted when handling under the situation that has the input-output restriction.So the execution time of

process

38a and 38b is divided into processing time and the input-output carried out by processor 28 and postpones, so that processing time and input-output are postponed to compare.According to comparative result, when the input-output delay is longer, use uniprocessor.Process 38a in the uniprocessor execution graph 1.Deal with data is worked as in term " input-output restriction " expression, when promptly producing swap file, uses the virtual storage region on the hard disk.

(2) if the application system of working on the operating system 32 that comprises system 10 is a single-threading program, then also can be by the uniprocessor executive routine.In this case, when operating system 32 was multithread programs, the function class of application program was similar to multithread programs.When a plurality of threads did not have in application program executed in parallel owing to expectation applies input-output restriction, operating system 32 can be carried out by uniprocessor.More specifically, in program of the present invention, selecting arrangement 14 selects uniprocessor to carry out.

(3) even be suitable for using the process 38 of multiprocessor to be carried out separately, treatment effeciency still may be not good when carrying out a plurality of process 38.Can use uniprocessor to carry out this processing.More specifically, in system 10 of the present invention, determine that device 12 determines that uniprocessors carry out the treatment effeciency of

corresponding process

38a and 38b, and selecting arrangement 14 is according to determining that the result selects uniprocessor is used for executive process.For example, even multiprocessor is suitable for the process 38a in the execution graph 1, selecting arrangement 14 still can select to use the uniprocessor executive process.

In the multicomputer system 40 that comprises system 10 of the present invention, the process 38 with the multithread programs that is suitable for the multiprocessor execution is used and can be kept the multiprocessor 28 of cache coherence to carry out parallel processings at the hardware level.

By 2 processor uP3 among Fig. 4 and uP4 treatment progress P3.In processor uP3 and uP4, produce cache coherency protocol at data access.

The process preferably do not carried out by multiprocessor 38 is carried out by uniprocessor.On the other hand, software control method utilizes the support of MMU only to keep cache coherence on other processor.When at user's space 36a and 36b (wherein each process P1 and the P2 in processor uP1 and the uP2 execution graph 4) visit data, do not produce cache coherency protocol.As a result, the bandwidth of cache coherency protocol is compressed, and makes the throughput of multicomputer system 40 be improved.

In order to prevent to produce cache coherency protocol, when from processor uP2 visit process P1, above-mentioned interrupting device causes occurring visit to interrupt.After interruption occurring, thread is assigned to processor uP1 from the processor uP2 that will visit.

In addition, MMU controls multicomputer system 40, makes the data that processor can read/write be carried out by process 38, thereby forbids that other processor reads and writes data.

As mentioned above, described embodiments of the invention, but the present invention is not limited to these embodiment.Above-mentioned explanation provides according to following supposition, and promptly high-speed cache is L1 (one-level) high-speed cache, and the uniprocessor executing state is represented the execution thread of uniprocessor.Identical thinking can be applied to the L2 high-speed cache.In this case, the uniprocessor node enters executing state." uniprocessor node executing state " is such state, and a plurality of processors that wherein belong to same node point are carried out a plurality of tasks that belong to a process.

Suppose that the space of wherein omitting cache coherency protocol is the user's space that is easy to install and have advantageous effects, but cache coherency protocol can be applied to a part of system space.That is to say, in system space, use is classified, and cache coherency protocol is applied to omitting the part of cache coherency protocol.

Any modification, variation or equivalent that those skilled in the art can expect all should be considered within the scope of the invention.

The front has illustrated and has described innovative system and the distribution method that satisfies its all purposes of seeking and advantage in the multicomputer system.Yet on the basis of this instructions and accompanying drawing (wherein disclosing the preferred embodiments of the present invention), it will be appreciated by those skilled in the art that many changes of the present invention, modification, variation, combination and other purposes and application.All are this not to depart from change, modification, variation and other purposes of aim of the present invention and scope and uses all to regard as by the present invention and cover, and is only limited by following claim.

Claims

1. system that is used for distributing a plurality of threads of a process of being carried out by the multicomputer system that comprises a plurality of processors comprises:

When a plurality of thread in the process of each processor distribution in a plurality of processors, determine definite device of the treatment effeciency of this process; With

Select a processor or a plurality of processor to carry out the device of this process according to the result who determines device,

Wherein said definite device comprises: the execution time of described process is divided into by the processing time of processor execution and the device of input-output delay; With the device that processing time and input-output is postponed compare.

2. the system as claimed in claim 1 also is included in handled data of process that a processor carries out by under the situation of other processor access, changes over the device of the described visit of a described processor from the visit of a plurality of processors.

3. system as claimed in claim 2 also comprises a processor read/write that allows to carry out described process by the handled data of described process of described processor execution, and forbids the device of other these data of processor read/write.

4. method that is used for distributing a plurality of threads of a performed process of the multicomputer system that comprises a plurality of processors, the step that comprises has:

Select step, be used for selecting a processor or a plurality of processor to carry out this process according to the result of determining step,

5. method as claimed in claim 4, wherein input-output postpones to be longer than the processing time in described comparison step, and selects a processor to carry out in described selection step.

6. method as claimed in claim 4, the operating system of wherein said multicomputer system comprises multithread programs, the application program that multicomputer system is carried out comprises single-threading program, and wherein said selection step selects a processor to carry out this multithread programs and single-threading program.

7. method as claimed in claim 4, if wherein described determining step is determined to carry out each process efficient height by a plurality of processors in a plurality of processes, then described selection step is selected to be carried out by a plurality of processors when carrying out a plurality of process.

8. as any one described method of claim 4 to 7, the handled data of described process that also are included in a processor execution are changed to visit the step of a described processor by under the situation of other processor access.

9. method as claimed in claim 8 also comprises the handled data of described process that the described processor of a processor read/write that allows the described process of execution is carried out, and forbids the step of other these data of processor read/write.