WO2006050348A2 - Methods and apparatus for scheduling and running applications on computer grids - Google Patents

Methods and apparatus for scheduling and running applications on computer grids Download PDF

Info

Publication number
WO2006050348A2
WO2006050348A2 PCT/US2005/039439 US2005039439W WO2006050348A2 WO 2006050348 A2 WO2006050348 A2 WO 2006050348A2 US 2005039439 W US2005039439 W US 2005039439W WO 2006050348 A2 WO2006050348 A2 WO 2006050348A2
Authority
WO
WIPO (PCT)
Prior art keywords
tasks
task
group
file
files
Prior art date
Application number
PCT/US2005/039439
Other languages
French (fr)
Other versions
WO2006050348A3 (en
Inventor
Fabricio Alves Barbosa Da Silva
Silvia Regina De Carvalho
Original Assignee
Hewlett-Packard Development Company, L.P.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hewlett-Packard Development Company, L.P. filed Critical Hewlett-Packard Development Company, L.P.
Publication of WO2006050348A2 publication Critical patent/WO2006050348A2/en
Publication of WO2006050348A3 publication Critical patent/WO2006050348A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5072Grid computing

Definitions

  • the invention relates to methods and apparatus for scheduling and executing applications on computer grids. More particularly, although not exclusively, the invention relates to methods and apparatus for scheduling the components of applications, also known as tasks, of grid- based applications on computational units constituting a computational grid or cluster. Even more particularly, although not exclusively, the invention relates to scheduling tasks on heterogeneous distributed computational grids. The invention may be particularly suitable for scheduling sequential independent tasks, otherwise known as Bag-of-Tasks or Parameter Sweep applications, on computational grids.
  • a computational grid can be thought of as a collection of physically distributed, heterogeneous computational units, or nodes.
  • the physical distribution of the grid nodes may range from immediate proximity to wide geographical distribution.
  • Grid nodes may be either heterogeneous or homogeneous with homogeneous grids differing primarily in that the nodes constituting such a grid provide essentially a uniform operating environment and computing capacity. Given the operational characteristics of grids as often being formed across administrative domains and over a wide range of hardware, homogeneous grids are considered a specific case of the general heterogeneous grid concept.
  • the present invention contemplates both types.
  • the described embodiments of the present invention contemplate distributed networks of heterogeneous nodes which are desired to be treated as a unified computing resource.
  • Computational grids are usually built on top of specially designed middleware platforms known as grid platforms.
  • Grid platforms enable the sharing, selection and aggregation of the variety resources constituting the grid.
  • These resources which constitute the nodes of the grid can include supercomputers, servers, workstations, storage systems, desktop systems and specialized devices that may be owned and operated by different organizations.
  • BoT Bag-of-Tasks
  • BoT applications can be decomposed into groups of tasks. Tasks for this type of grid application are characterized as being independent in that no communication is required between them while they are running and that there are no dependencies between tasks. That is, each task constituting an element of the grid application as a whole can be executed independently with its result contributing to the overall result of the grid-based computation. Examples of BoT applications include Monte Carlo simulations, massive searches, key breaking, image manipulation and data mining.
  • the amount of computation involved with each task T t is generally predefined and may vary among the tasks A.
  • the input for each task A is one or more (input) files and the output one or more (output) files.
  • the present exemplary embodiment relates to clusters organized as a master-slave platform.
  • a master node is responsible for scheduling computation among the slave nodes and collecting the results.
  • Other grid/cluster models are possible within the scope of " the ⁇ ⁇ rese ⁇ t-embodiments-of- the -invention, -and-may-be- capable. ofJncorporating_Jhe execution/scheduling technique described herein with appropriate modification.
  • the slave components are themselves clusters.
  • Grid platforms typically use a non-dedicated network infrastructure such as the internet for inter-node communication.
  • a non-dedicated network infrastructure such as the internet for inter-node communication.
  • machine heterogeneity, long and/or variable network delays and variable processor loads and capacities are common. Since tasks belonging to a BoT application do not need to communicate with each other and can be executed independently, it is considered that BoT applications are particularly suitable for execution on such a grid infrastructure.
  • the invention provides for a method of scheduling the execution of an application on a plurality of computational units, said application comprising a plurality of tasks, each task having at least one input file associated therewith, the method including the steps of:
  • the number of groups of tasks is equal to the number of computational units.
  • the tasks are preferably aggregated into the groups so that the time needed for processing each group is substantially the same for each computing unit.
  • the file affinity or task file affinity is preferably be defined by:
  • ⁇ f t ⁇ is the size of file f. and N,- is the number of tasks in group G which have file f. as an input file.
  • the method of distributing the tasks constituting the application among a plurality of computing units may include the following steps:
  • a method of scheduling tasks among a plurality of computing units including the following steps: A) define the number of tasks to be assigned in groups to the computing units, where P is the number of computing units;
  • the value k is preferably selected to represent the desired level of association between task pairs according to the degree of sharing of input files.
  • the number of tasks to be assigned to each group is determined such that the time needed for processing each group on a corresponding computing unit is substantially the same for each computing unit.
  • the size of each task is calculated on the basis of the byte sum of all of the input files needed to execute each task on a computing unit.
  • tasks are aggregated into a specified group for values of k greater than 0.5.
  • the method may also include the further step of dispatching the groups of tasks to corresponding computing units, preferably in a manner which substantially overlaps computation and communication.
  • the computing units are preferably grid resources such as processors.
  • the computational units are one of more clusters.
  • the grid may be formed of one or more clusters as opposed to composed of a set of processors.
  • the number-of-tasks assigned to each-cluster_wilLbe .calculated ⁇ based on: the requirement that the time needed by the cluster for processing each group is the same for each cluster, the file affinity, and pipelining, where applied, proceeding substantially as hereinbefore defined.
  • Figures Ia & b illustrates a method of assigning tasks on a dedicated cluster according to a round-robin approach in accordance with an embodiment of the invention
  • Figure 2 illustrates an embodiment of the invention having a master-slave node configuration
  • Figure 3 illustrates the results of an execution time simulation for a prior art homogeneous platform
  • Figure 4 illustrates the results of an execution time simulation according to an embodiment of the invention.
  • Figure 5 illustrates an embodiment of the invention whereby tasks are dispatched to processors in a pipelined manner.
  • a simple execution model will be described initially in relation to a prior art technique for scheduling an application on a homogeneous grid. This will then be compared with an embodiment of the invention.
  • the specific embodiment described herein relates to fine-grain Bag-of-Tasks applications on dedicated master-slave platform as shown in Figure 1. However, this is not to be construed as limiting and the invention may be applied to other computing contexts with suitable modification. Further classes of applications that may benefit from the invention are those composed of tasks with dependencies where sets of dependent tasks can be grouped and the groups are independent among themselves. The method may be modified slightly to group the tasks according to such dependencies.
  • the master node (10), is responsible for the organizing, scheduling, transmitting and receiving the tasks corresponding to the grid application. Referring to figure 1, each task goes through three phases during execution:
  • the set of files sent may include a parameter file corresponding to a specified task and an executable file which is charged with performing the computational task on the slave processor.
  • the time in this phase includes the overhead incurred by the master node (10) to initiate a data transfer to a slave (11), for example, to initiate a TCP connection. For example, consider a task i that needs to send two files to a slave nodes before execution. The time t; m - t can then be computed as follows:
  • L ⁇ t t is the overhead incurred by the master node to initiate data transfer to the slave node (11-14).
  • Y ⁇ ih j is the total size in bytes of the input files that have to be transferred to slave node s and B is the data transfer rate. For simplicity, in this example it is assumed that each task has only one separate parameter file of the same size associated with it.
  • the task processes the parameter file at the slave node (11-14) and produces an output file.
  • the duration of this phase is equal to t comp . Any additional overhead related to the reception of input files by a slave node is also included in this phase.
  • this phase the output file is sent back to the master node (1) and the task T is terminated.
  • the duration of this phase is t end .
  • This phase may require some processing at the master and this is mainly related to writing files at the file repository (not shown in (10)). This writing step may be deferred until disk resources are available. It is therefore considered negligible.
  • the initialization phase of one slave can occur concurrently to the completion phase of another slave node.
  • the total execution time of a task is therefore:
  • the exemplary embodiment described herein corresponds to a dedicated node cluster composed of P+J homogeneous processors where T » P .
  • the additional processor is the master node (10).
  • Communication between the master (10) and slave (11-14) is by way of a shared link and, in this embodiment the master (10) can only send files through the network to a single slave at a given time.
  • the communication link is full duplex.
  • This embodiment corresponds to a one-port model whereby there are at most two communication processes involving a given master, one sent and one received.
  • the one-port embodiment discussed herein is particularly suited to LAN network connections.
  • a slave node (11-14) is considered to be idle when it is not involved with the execution of any of the three phases of a task.
  • Figure 1 shows the execution of a set of tasks in a system composed of three slave nodes where the scheduling algorithm is of a round-robin type.
  • the effective number of processors P e ff is defined as the maximum number of slave processors needed to run an application with no idle periods on any slave processor. Taking into account the task and platform models of the particular embodiment described herein, a processor may have idle periods if: t Co , nP + t e ,, d ⁇ ⁇ P -l)t init
  • the total number of tasks is to be executed on a given processor is at most:
  • T is the total number of tasks.
  • P ej processors the total execution time, or makespan
  • the number of groups is equal to the number of nodes available. This is not however a limitation and modifications to the scheduling method are viable to take into account different processor/group. For example, for some specific sets of applications/platforms, the optimal execution in terms of the makespan will use a number of groups smaller than the total number of processors.
  • F ffi,f 2 ,f3--f ⁇
  • the file affinity I aff is defined in one embodiment as follows:
  • This simplified embodiment reduces the size of the possible solution space and provides a viable method of calculating the file affinities for tasks to within a workable level of accuracy.
  • the simplified embodiment includes the following steps: Initially, for each computing unit or processor, the number of tasks to be aggregated into a group is defined for that computing unit. This is done so that the time needed for processing each group is substantially the same for each computing unit.
  • the size of each task corresponds to the sum of the input file sizes for the task concerned.
  • the required number of tasks is allocated to the group as a function of both the number of tasks determined previously and task affinity.
  • the allocation step in a preferred embodiment is as follows. The reference to 'position' relates to the position of the task input file in the size-ordered list. The smallest size task, task(position) is assigned to a first group. Then the file affinity of the pair task(position) and task(position+l) in the size-ordered list is determined. If the file affinity k is greater than a specified value, task(position+l) is assigned to the first group.
  • task(position+l) is assigned to a subsequent group. This process is repeated, filling sequentially the groups in order until the group allocations determined in the initial step are populated with the size-ordered, associated tasks.
  • This embodiment can be expressed in pseudocode as follows:
  • the 10 heterogeneous input file tasks are ⁇ 20K, 35K, 44K, 80K, 102K, HOK, 200K, 300K, 400K, 450K ⁇ .
  • Three groups of tasks are generated, one with 4 tasks and the others with 3 tasks.
  • the simplified embodiment of the heuristic in the case of zero file affinity operates as follows. Each task is considered in- size- order. Ihus,.20K is allocated to the .first position_of group 1. Then the 35K input task is allocated to the next group following the principle that each group should minimize initial transmission or initialization time. Task 44K is allocated to the third group. Task 8OK is then allocated to position two of the first group, 102K to the second position of group two and so on.
  • the transfer of the files occurs in a pipelined manner, i.e.; where computation is overlapped with communication.
  • Figure 5 illustrates the pipelined transfer of input files from a master to three slave processors. As can be seen in this example, the transfers to and from the master/nodes are staggered with the computation on the slaves being overlapped with the communication phase on one or more of the other processor nodes. This reduces tj n i t when executing a group of tasks on a slave processor.
  • Another example is that of 10 homogeneous tasks with ten completely homogeneous sets of input files ⁇ 30K, 30K, 30K, 30K, 30K, 3OK, 30K, 30K, 30K, 30K ⁇ .
  • three groups of tasks are generated, one with four tasks and the others with three.
  • each pair will have a file affinity of approximately 1.
  • the three groups of input files will be ⁇ 30K ⁇ , ⁇ 30K ⁇ , and ⁇ 30K ⁇ .
  • the number of tasks to be assigned to each group is determined such that the time needed for processing each group is substantially the same for each computing unit. This will depend on each processors relative speed, based on the average speed of the processors in the cluster. For example, if the relative speed of a particular node processor is 1.0 compared to the average speed of the cluster nodes, the
  • the size of each task is calculated on the basis of the byte sum of all of the input files needed to execute each task on " a computing unit.
  • the file " affmity ⁇ may usefully be defined as kfov which an affinity of 0.5 is considered acceptable as a benchmark for grouping tasks into a specified group. Essentially, this equates to setting the minimum degree of 'association' which is necessary to consider two tasks as related or sharing input files. This ensures that the file affinity is maximized within a group. Thus sending similar sets of files to multiple processors is avoided. As noted above, if the next set of files is different enough (i.e., has a file affinity with a previously allocated task less than the minimum), that task will be located at the next processor position.
  • the number of tasks is allocated to each processor based on the processing power of the processor concerned and the file affinity, and the tasks are dispatched or transferred to the processor in a pipelined way as illustrated in the example shown in figure 5.
  • IQ a further embodiment, it is possible to consider grouping tasks on a grid composed of a set of clusters as opposed to a grid composed of a set of processors.
  • the number of tasks assigned to each cluster will be calculated based on the requirement that the time needed by the cluster for processing each group is the same for each cluster. As before, this will depend on the processing speed of the cluster aggregate and will depend on the internal structure of the particular cluster such as the number of processors, load from other users etc.
  • Embodiments of the invention are further intended to cover the task scheduling/grouping technique in its most general sense as specified in the claims regardless of the possible size of the solution space for the affinity determination. It is also noted that embodiments of the invention may be applied to the distribution of tasks among nodes in a grid system where the computational characteristics of such nodes may take a variety of forms. That is, node processing may take the form of numerical calculation, storage or any other form of processing which might be envisaged as part of distributed application execution. Further, embodiments of the present invention may be included in a broader scheduling system in the context of allocating information to generalized computing resources.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Multi Processors (AREA)

Abstract

A method of running an application on a plurality of computational units (Fig. 1a, 11-14) is described. The application comprises a plurality of tasks (Fig. 1a, 11-14), each task having at least one input file associated therewith. One embodiment of the invention includes the method including the steps of aggregating said plurality of tasks into one or more groups of tasks and allocating each group of tasks to a computational unit, wherein the plurality of tasks are aggregated so that tasks which share one or more input file are included in the same group (Fig. 1a, 10).

Description

Methods and Apparatus for Scheduling and Running Applications on Computer Grids
Field of the invention
The invention relates to methods and apparatus for scheduling and executing applications on computer grids. More particularly, although not exclusively, the invention relates to methods and apparatus for scheduling the components of applications, also known as tasks, of grid- based applications on computational units constituting a computational grid or cluster. Even more particularly, although not exclusively, the invention relates to scheduling tasks on heterogeneous distributed computational grids. The invention may be particularly suitable for scheduling sequential independent tasks, otherwise known as Bag-of-Tasks or Parameter Sweep applications, on computational grids.
Background to the Invention
A computational grid, or more simply 'grid', can be thought of as a collection of physically distributed, heterogeneous computational units, or nodes. The physical distribution of the grid nodes may range from immediate proximity to wide geographical distribution. Grid nodes may be either heterogeneous or homogeneous with homogeneous grids differing primarily in that the nodes constituting such a grid provide essentially a uniform operating environment and computing capacity. Given the operational characteristics of grids as often being formed across administrative domains and over a wide range of hardware, homogeneous grids are considered a specific case of the general heterogeneous grid concept. The present invention contemplates both types.
The described embodiments of the present invention contemplate distributed networks of heterogeneous nodes which are desired to be treated as a unified computing resource.
Computational grids are usually built on top of specially designed middleware platforms known as grid platforms. Grid platforms enable the sharing, selection and aggregation of the variety resources constituting the grid. These resources which constitute the nodes of the grid can include supercomputers, servers, workstations, storage systems, desktop systems and specialized devices that may be owned and operated by different organizations.
The described embodiment of the present invention is concerned with grid applications known as Bag-of-Tasks (BoT) applications. These types of applications can be decomposed into groups of tasks. Tasks for this type of grid application are characterized as being independent in that no communication is required between them while they are running and that there are no dependencies between tasks. That is, each task constituting an element of the grid application as a whole can be executed independently with its result contributing to the overall result of the grid-based computation. Examples of BoT applications include Monte Carlo simulations, massive searches, key breaking, image manipulation and data mining.
In this specification and the exemplary embodiments described therein, we will refer to a BoT
Figure imgf000004_0001
The amount of computation involved with each task Tt is generally predefined and may vary among the tasks A. Note that the input for each task A is one or more (input) files and the output one or more (output) files.
The present exemplary embodiment relates to clusters organized as a master-slave platform. According to this model, a master node is responsible for scheduling computation among the slave nodes and collecting the results. Other grid/cluster models are possible within the scope of"the~ρreseπt-embodiments-of- the -invention, -and-may-be- capable. ofJncorporating_Jhe execution/scheduling technique described herein with appropriate modification. For example, a further embodiment is described where the slave components are themselves clusters.
Grid platforms typically use a non-dedicated network infrastructure such as the internet for inter-node communication. In such a network environment, machine heterogeneity, long and/or variable network delays and variable processor loads and capacities are common. Since tasks belonging to a BoT application do not need to communicate with each other and can be executed independently, it is considered that BoT applications are particularly suitable for execution on such a grid infrastructure.
While heterogeneous grids have been found to be suitable for executing such applications, a significant problem is scalability. Under certain circumstances, it has been found that simply increasing the number of processors in the grid does not produce any additional decrease in overall application execution time. Specifically, it has been found that the execution of a grid application is guaranteed to be scalable only on a platform with up to Peff processors where Peff is defined as the maximum number of slave processors needed to run an application with no idle periods on any slave processor.
Beyond this number of processors, the application execution time asymptotes to a fixed level and the application execution no longer scales. This is a significant barrier to the efficient execution of large and complex tasks on computational grids. Existing work in this area has proposed various heuristics for scheduling BoT applications on cluster and grid platforms. However, none of them take into account scalability issues related to the execution of fine-grain applications in master-slave platforms.
Disclosure of the invention
Ih one aspect, the invention provides for a method of scheduling the execution of an application on a plurality of computational units, said application comprising a plurality of tasks, each task having at least one input file associated therewith, the method including the steps of:
aggregating said plurality of tasks into one or more groups of tasks; and
allocating each group of tasks to a computational unit, wherein the plurality of tasks are aggregated so that tasks which share one or more input file are included in the same group.
This effectively groups associated files with each other, thereby introducing the concept of file affinity, based on the reduction of the amouήFofliafaThaf needs To be transferred "to~a computational unit when all tasks of a group are sent to that unit. This increases P^ thereby increasing the number of slave processors that can be used effectively thereby allowing more efficient scaling.
Li a preferred embodiment, the number of groups of tasks is equal to the number of computational units.
The tasks are preferably aggregated into the groups so that the time needed for processing each group is substantially the same for each computing unit.
The step of aggregating the plurality of tasks may include the step of determining the file affinity between pairs of tasks in respect of their input files wherein for a set G of tasks composed of K tasks,
Figure imgf000005_0001
T2,..., Tκ} and the set F of Y input files needed by one or more tasks belonging to group G, F= {fi, f∑, f3,.. ,fγ} .
The file affinity or task file affinity is preferably be defined by:
Figure imgf000005_0002
Where \ ft \ is the size of file f. and N,- is the number of tasks in group G which have file f. as an input file.
In a preferred embodiment of the invention, the method of distributing the tasks constituting the application among a plurality of computing units may include the following steps:
- define the number of tasks to be assigned in groups to the computing units, where P is the number of computing units;
o compute the size of each task;
o rank the task files in a list L in order of increasing size,
o for each group, beginning with the group with the largest number of tasks:
assign the smallest unassigned task file to the group;
■ ' settasMleiist position = Ir
• until the group is completely populated by task files do:
o imposition + P < size of list L) and (task file affinity(task file[ρosition], task fϊle[position+l]) < a specified value, k) then position = position +P;
else increment position = position + 1 ;
o assign to the group, the task file located at position in list L
• end do
Remove assigned task files from List L
Increment P = P - I
■ populate the next group
According to a further embodiment of the invention, there is provided a method of scheduling tasks among a plurality of computing units including the following steps: A) define the number of tasks to be assigned in groups to the computing units, where P is the number of computing units;
B) compute the size of each task;
C) rank the task files in a list L in order of increasing size,
D) for each group, beginning with the group with the largest number of tasks perform the following steps (a) to (e):
(a) assign the smallest unassigned task file to the group;
(b) set the task file list position index equal to 1;
(c) while the group is not completely populated by task files perform the following steps:
(i) if the position index plus P is less than or equal than the size of the list L, and the task file affinity between the task file at the position index and the task file at the position index +1 is less than a specified value, k then increment the position index by P;
otherwise increment position index by 1 ;
(ii) assign to the group, the task file located at position in list L
(d) Remove assigned task files from List L
(e) Increment P = P - I
As the rigorous equation for file affinity defined above is dominated by the number of possible ways of clustering n tasks in G groups, this preferred simplified embodiment allows a practical calculation or an approximation of the task file affinity.
The value k is preferably selected to represent the desired level of association between task pairs according to the degree of sharing of input files.
In a preferred embodiment, the number of tasks to be assigned to each group is determined such that the time needed for processing each group on a corresponding computing unit is substantially the same for each computing unit. In a preferred embodiment, the size of each task is calculated on the basis of the byte sum of all of the input files needed to execute each task on a computing unit.
In a preferred embodiment, tasks are aggregated into a specified group for values of k greater than 0.5.
The method may also include the further step of dispatching the groups of tasks to corresponding computing units, preferably in a manner which substantially overlaps computation and communication.
The computing units are preferably grid resources such as processors.
In an alternative embodiment, the computational units are one of more clusters.
Thus, the grid may be formed of one or more clusters as opposed to composed of a set of processors.
"According to this embodiment, the number-of-tasks assigned to each-cluster_wilLbe .calculated^ based on: the requirement that the time needed by the cluster for processing each group is the same for each cluster, the file affinity, and pipelining, where applied, proceeding substantially as hereinbefore defined.
Brief Description of the Drawings
The invention will now be described by way of exemplary embodiments only, with reference to the drawings, in which:
Figures Ia & b: illustrates a method of assigning tasks on a dedicated cluster according to a round-robin approach in accordance with an embodiment of the invention;
Figure 2: illustrates an embodiment of the invention having a master-slave node configuration;
Figure 3 : illustrates the results of an execution time simulation for a prior art homogeneous platform;
Figure 4: illustrates the results of an execution time simulation according to an embodiment of the invention; and
Figure 5: illustrates an embodiment of the invention whereby tasks are dispatched to processors in a pipelined manner. For the purposes of explanation, a simple execution model will be described initially in relation to a prior art technique for scheduling an application on a homogeneous grid. This will then be compared with an embodiment of the invention. The specific embodiment described herein relates to fine-grain Bag-of-Tasks applications on dedicated master-slave platform as shown in Figure 1. However, this is not to be construed as limiting and the invention may be applied to other computing contexts with suitable modification. Further classes of applications that may benefit from the invention are those composed of tasks with dependencies where sets of dependent tasks can be grouped and the groups are independent among themselves. The method may be modified slightly to group the tasks according to such dependencies.
Referring to Figure 2, the application A is composed of T homogeneous tasks. That is: A = T ■ The master node (10), is responsible for the organizing, scheduling, transmitting and receiving the tasks corresponding to the grid application. Referring to figure 1, each task goes through three phases during execution:
Initialization phase
This is the process whereby the files constituting the grid application and its data are sent from the master node (10) to the slave nodes (11-14) and the task is started. The duration of this phase is equal to *,•„,-,.
The set of files sent may include a parameter file corresponding to a specified task and an executable file which is charged with performing the computational task on the slave processor. The time in this phase includes the overhead incurred by the master node (10) to initiate a data transfer to a slave (11), for example, to initiate a TCP connection. For example, consider a task i that needs to send two files to a slave nodes before execution. The time t;m-t can then be computed as follows:
∑Filβj tm = Lαt{ + y=i
B
Where Lαtt is the overhead incurred by the master node to initiate data transfer to the slave node (11-14). . Yβihj is the total size in bytes of the input files that have to be transferred to slave node s and B is the data transfer rate. For simplicity, in this example it is assumed that each task has only one separate parameter file of the same size associated with it.
Computation Phase In this phase, the task processes the parameter file at the slave node (11-14) and produces an output file. The duration of this phase is equal to tcomp. Any additional overhead related to the reception of input files by a slave node is also included in this phase.
Completion Phase
During this phase, the output file is sent back to the master node (1) and the task T is terminated. The duration of this phase is tend. This phase may require some processing at the master and this is mainly related to writing files at the file repository (not shown in (10)). This writing step may be deferred until disk resources are available. It is therefore considered negligible. Thus the initialization phase of one slave can occur concurrently to the completion phase of another slave node.
The total execution time of a task is therefore:
''total ~~ ^init "*" *comp """ end
The exemplary embodiment described herein corresponds to a dedicated node cluster composed of P+J homogeneous processors where T » P . The additional processor is the master node (10). Communication between the master (10) and slave (11-14) is by way of a shared link and, in this embodiment the master (10) can only send files through the network to a single slave at a given time. The communication link is full duplex. This embodiment corresponds to a one-port model whereby there are at most two communication processes involving a given master, one sent and one received. The one-port embodiment discussed herein is particularly suited to LAN network connections.
A slave node (11-14) is considered to be idle when it is not involved with the execution of any of the three phases of a task. Figure 1 shows the execution of a set of tasks in a system composed of three slave nodes where the scheduling algorithm is of a round-robin type.
The effective number of processors Peff is defined as the maximum number of slave processors needed to run an application with no idle periods on any slave processor. Taking into account the task and platform models of the particular embodiment described herein, a processor may have idle periods if: tCo,nP + te,,d < {P -l)tinit
Peg- is then given by the following equation: camp end
P4T = + 1 ^init
The total number of tasks is to be executed on a given processor is at most:
Figure imgf000011_0001
Where T is the total number of tasks. For a platform with Pej processors, the total execution time, or makespan, will be:
t makespan ~ ^- (tinit + t cmψ + 1 end ) + {Jr — Lμinit
The second term in the right side of this equation gives the time which is needed to start the first (P-I) tasks in the other P-I processors. If the platform has more processors than Pφ then the overall makespan is dominated by communication times between the master and the slaves. Then:
''makespan
Figure imgf000011_0002
^comp "*" ''end
As there are idle periods on every processor, the following equation holds:
(P -ϊ)tinit > (tcomp + tend)
This equation applies primarily to two cases: a. For very large platforms (P Large); and b. For applications with a large ""' ratio, such as fine-grain applications. comp
Thus, it can be seen that the execution of an application according to the prior art is guaranteed to be scalable only in a platform with up to P^slave processors.
Beyond this, the idle time in slave nodes increases proportionally with the number of processors. This result of a simulation for such a scheduling system is shown in Figure 3. Here, 4,,-ρl, /comp+4,,,r=8 and T=SOO. The effective number of processors is 9 in figure 3 and for P ≥ 9 the overall makespan asymptotes to a constant level of 808 seconds. In accordance with an embodiment of the present invention, tcomp is increased by grouping sets of tasks sharing common input files into a larger task. By doing so, it is possible to increase the effective number of processors therefore increasing the number of slave processors that can be used effectively. The time corresponding to tinit should ideally not increase in the same proportion to tcomp. Thus, in one embodiment, tasks which share one or more input files are selected and scheduled so as to run on a common slave node or processor.
This is achieved by introducing the concept of the file affinity which indicates the reduction in the amount of data that needs to be transferred to a remote node when all tasks of a group are sent to that node.
In this discussion it is assumed that the number of groups is equal to the number of nodes available. This is not however a limitation and modifications to the scheduling method are viable to take into account different processor/group. For example, for some specific sets of applications/platforms, the optimal execution in terms of the makespan will use a number of groups smaller than the total number of processors. Given a set G of tasks composed of K tasks
Figure imgf000012_0001
of the Y input" files- needed- by one or more tasks- belonging to group G, F = ffi,f2,f3--fγ}, the file affinity Iaff is defined in one embodiment as follows:
Figure imgf000012_0002
|f;| is the size in bytes of file f; and N; is the number of tasks in group G which have file f; as an input file. 0 < Iaff < 1. An input file affinity of zero indicates that there is no sharing of files among tasks of a group. An input file affinity close to one means that all tasks of a group have a high degree of sharing of input files.
The potential benefits of clustering tasks into groups for execution are illustrated in the results of the simulation shown in Figure 4. In accordance with one embodiment of the invention, it was assumed that all tasks share the same input files and that the same parameters of the simulation shown in Figure 3 applied. As can be seen from the example there is a reduction in total execution time for all values of number of processor P with a consistent reduction when the size of the grid platform is increased. Thus, in this embodiment of the invention, effective scaling is achieved. For the example shown in Figure 4, for a platform with 80 processors, the total execution time when tasks are grouped is 160s. Without grouping, the total time is 808s.
The equation above for file affinity is dominated by the combinatorial function whereby all possible pairs of tasks are considered. For large numbers of tasks, this can lead to very large numbers of combinations of task pairs. For example, there are N(25, 5) ways of clustering 25 tasks into 5 groups. This equates roughly to 1015 possible combinations. It may be therefore impractical to search exhaustively in solution space for an optimal task grouping. For this reason, according to another preferred embodiment of the invention there is provided a simplified heuristic for determining the optimal task grouping which is based on the general file affinity equation described above.
As an preliminary illustration of this simplified embodiment, consider a group of tasks, each of which requires a different input file. Because there is no input file sharing, there is no file affinity between them. It is desirable to start processing them on slave nodes as soon as possible to minimize t;,,it . Therefore, the tasks are transferred to slave nodes in size order from smallest to largest with no account taken of sharing amongst input files (as this is zero).
If an application where all tasks share the same input file is considered, that input file only needs to be transferred once. This is taken into account by including the effect of file affinity. If the file affinity of two consecutive tasks (in size order) is very high, it is advantageous to assign those two tasks to the same processor instead of transferring the same set of input files twice over the network. In the ideal situation described here, this set of files is transferred only once to each processor or node of the network.
This simplified embodiment reduces the size of the possible solution space and provides a viable method of calculating the file affinities for tasks to within a workable level of accuracy. According to this embodiment, and talcing into account file affinity, the simplified embodiment includes the following steps: Initially, for each computing unit or processor, the number of tasks to be aggregated into a group is defined for that computing unit. This is done so that the time needed for processing each group is substantially the same for each computing unit.
Then the total size of each task is calculated. Here, the size of each task corresponds to the sum of the input file sizes for the task concerned. For each group defined in the aggregation step, the required number of tasks is allocated to the group as a function of both the number of tasks determined previously and task affinity. The allocation step in a preferred embodiment is as follows. The reference to 'position' relates to the position of the task input file in the size-ordered list. The smallest size task, task(position) is assigned to a first group. Then the file affinity of the pair task(position) and task(position+l) in the size-ordered list is determined. If the file affinity k is greater than a specified value, task(position+l) is assigned to the first group. If the file affinity is less than a specified value, task(position+l) is assigned to a subsequent group. This process is repeated, filling sequentially the groups in order until the group allocations determined in the initial step are populated with the size-ordered, associated tasks. This embodiment can be expressed in pseudocode as follows:
define the number of tasks to be assigned in groups to the computing units,
- P = the number of computing units;
o compute the size of each task;
o rank the task files in a list L in order of increasing^ size,_
o for each group, beginning with the group with the largest number of tasks:
■ assign the smallest unassigned task file to the group;
■ task file list position = 1;
• until the group is completely populated by task files do:
o if(position + P < size of list L) and (task file affinity(task file[position] , task file[position+1] ) < a specified value, k) then position = position
+P;
■ else position = position + 1;
o assign to the group, the task file at position in list L • end do
Remove assigned task files from List L
P = P - I
An example application of this simplified heuristic is as follows. Consider a set of tasks composed often task files which are to be distributed on three homogeneous slave processors. The set of input files needed by each task is described as {fj, ... fi0} where fl is a real value that corresponds to the byte sum of the input files needed by task t{. As the tasks are heterogeneous, they will share no input files and the file affinity between any pair of tasks will be zero.
The 10 heterogeneous input file tasks are {20K, 35K, 44K, 80K, 102K, HOK, 200K, 300K, 400K, 450K}. Three groups of tasks are generated, one with 4 tasks and the others with 3 tasks. The simplified embodiment of the heuristic in the case of zero file affinity operates as follows. Each task is considered in- size- order. Ihus,.20K is allocated to the .first position_of group 1. Then the 35K input task is allocated to the next group following the principle that each group should minimize initial transmission or initialization time. Task 44K is allocated to the third group. Task 8OK is then allocated to position two of the first group, 102K to the second position of group two and so on. This produces the group of files as follows: {20K, 80K, 200K, 400K}, {35K, 102K, 300K} and {44K, HOK, 400K}. At a first approximation this keeps the amount of transmitted data similar for each group and allows the task transmission/calculation to be pipelined in a reasonably efficient manner. In a preferred embodiment, the transfer of the files occurs in a pipelined manner, i.e.; where computation is overlapped with communication. Figure 5 illustrates the pipelined transfer of input files from a master to three slave processors. As can be seen in this example, the transfers to and from the master/nodes are staggered with the computation on the slaves being overlapped with the communication phase on one or more of the other processor nodes. This reduces tjnit when executing a group of tasks on a slave processor.
Introducing the file affinity as Jc allows a balance to be struck between grouping tasks of different sizes into one task group for transmission and grouping tasks with sufficient file affinity or commonality. Thus this degree of balance may be altered by setting the file affinity threshold for associating input files.
Another example is that of 10 homogeneous tasks with ten completely homogeneous sets of input files {30K, 30K, 30K, 30K, 30K, 3OK, 30K, 30K, 30K, 30K}. Again, three groups of tasks are generated, one with four tasks and the others with three. As the tasks are completely homogeneous, each pair will have a file affinity of approximately 1. Thus, following the simplified embodiment of the heuristic, the three groups of input files will be {30K}, {30K}, and {30K}.
These two extreme examples serve to illustrate how the task grouping may be performed in the simplified embodiment.
In terms of implementing this method as noted above, the number of tasks to be assigned to each group is determined such that the time needed for processing each group is substantially the same for each computing unit. This will depend on each processors relative speed, based on the average speed of the processors in the cluster. For example, if the relative speed of a particular node processor is 1.0 compared to the average speed of the cluster nodes, the
\T~ maximum number of tasks to be assigned to that processor will be —
The size of each task is calculated on the basis of the byte sum of all of the input files needed to execute each task on "a computing unit. The file "affmity~may usefully be defined as kfov which an affinity of 0.5 is considered acceptable as a benchmark for grouping tasks into a specified group. Essentially, this equates to setting the minimum degree of 'association' which is necessary to consider two tasks as related or sharing input files. This ensures that the file affinity is maximized within a group. Thus sending similar sets of files to multiple processors is avoided. As noted above, if the next set of files is different enough (i.e., has a file affinity with a previously allocated task less than the minimum), that task will be located at the next processor position. Firstly, this is done so that tasks with the smallest byte sum are sent initially. Secondly, this is done to guarantee that the groups are as uniform as possible in respect of the number of bytes that need to be transmitting from the master node. Thus, in a preferred embodiment, at initialization of the algorithm, the number of tasks is allocated to each processor based on the processing power of the processor concerned and the file affinity, and the tasks are dispatched or transferred to the processor in a pipelined way as illustrated in the example shown in figure 5.
IQ a further embodiment, it is possible to consider grouping tasks on a grid composed of a set of clusters as opposed to a grid composed of a set of processors. In this embodiment, the number of tasks assigned to each cluster will be calculated based on the requirement that the time needed by the cluster for processing each group is the same for each cluster. As before, this will depend on the processing speed of the cluster aggregate and will depend on the internal structure of the particular cluster such as the number of processors, load from other users etc. Once the number of tasks to be assigned to each cluster is determined, the method proceeds substantially as described above.
Although the invention has been described by way of example and with reference to particular simplified or reduced-scope embodiments it is to be understood that modification and/or improvements may be made without departing from the scope of the appended claims. Embodiments of the invention are further intended to cover the task scheduling/grouping technique in its most general sense as specified in the claims regardless of the possible size of the solution space for the affinity determination. It is also noted that embodiments of the invention may be applied to the distribution of tasks among nodes in a grid system where the computational characteristics of such nodes may take a variety of forms. That is, node processing may take the form of numerical calculation, storage or any other form of processing which might be envisaged as part of distributed application execution. Further, embodiments of the present invention may be included in a broader scheduling system in the context of allocating information to generalized computing resources.
Where in the foregoing description reference has been made to integers or elements having known equivalents, then such equivalents are herein incorporated as if individually set forth.

Claims

Claims
1. A method of scheduling the running of an application on a plurality of computational units, said application comprising a plurality of tasks, each task having at least one input file associated therewith, the method including the steps of:
aggregating said plurality of tasks into one or more groups of tasks; and
allocating each group of tasks to a computational unit, wherein the plurality of tasks are aggregated so that tasks which share one or more input file are included in the same group.
2. A method as claimed in claim 1 wherein the number of groups of tasks is equal to the number of computational units.
3. A method as claimed in any preceding claim wherein the tasks are aggregated into the groups such that the time needed for processing each group is substantially the same for each" computing unit.
4. A method as claimed in any preceding claim wherein the step of aggregating the plurality of tasks includes the step of determining the file affinity between pairs of tasks in respect of their input files wherein for a set G of tasks composed of K tasks, G= {Ti, T2,..., Tκ} and the set F of Y input files needed by one or more tasks belonging to group G, F= {fj, f2, f3,...fy}, the file affinity is defined by:
Figure imgf000018_0001
Where \ f. \ is the size of file ft and N,- is the number of tasks in group G which have file fi as an input file.
5. A method of scheduling tasks among a plurality of computing units including the following steps:
define the number of tasks to be assigned in groups to the computing units, where P is the number of computing units; o compute the size of each task;
o rank the task files in a list L in order of increasing size,
o for each group, beginning with the group with the largest number of tasks:
" assign the smallest unassigned task file to the group;
set task file list position = 1 ;
• until the group is completely populated by task files do:
o imposition + P < size of list L) and (task file affmity(task file[position], task file[position+l]) < a specified value, k) then position = position +P;
else increment position = position + 1;
o assign to the group, the task file located at position In list L
• end do
Remove assigned task files from List L
Increment P = P - I
populate to next group
6. A method of scheduling tasks among a plurality of computing units including the following steps:
A) define the number of tasks to be assigned in groups to the computing units, where P is the number of computing units;
B) compute the size of each task;
C) rank the task files in a list L in order of increasing size,
D) for each group, beginning with the group with the largest number of tasks perform the following steps (a) to (e):
(a) assign the smallest unassigned task file to the group; (b) set the task file list position index equal to 1;
(c) while the group is not completely populated by task files perform the following steps:
(i) if the position index plus P is less than or equal than the size of the list L, and the task file affinity between the task file at the position index and the task file at the position index +1 is less than a specified value, k then increment the position index by P;
otherwise increment position index by 1;
(ii) assign to the group, the task file located at position in list L
(d) Remove assigned task files from List L
(e) Increment P = P - I
7. A method as claimed in claim 5 or 6 wherein the value k is selected to represent the desired level of association between task pairs according to the degree of sharing of input files.
8. A method as claimed in any of claims 5 to 8 wherein k is greater or equal than substantially 0.5.
9. A method as claimed in any preceding claim wherein the maximum number of tasks to be assigned to each group is determined such that the time needed for processing each group on a corresponding computing unit is substantially the same for each computing unit.
10. A method as claimed in any preceding claim wherein the size of each task is calculated on the basis of the byte sum of all of the input files needed to execute each task on a computing unit.
11. A method of executing an application on a grid including the scheduling method as claimed in any preceding claim further including the step of dispatching the groups of tasks to corresponding computing units.
12. A method as claimed in claim 11 wherein the groups of tasks are dispatched in a manner which overlaps computation and communication.
13. A method as claimed in any preceding claim wherein the computing units correspond to processors.
14. A method as claimed in any preceding claim, wherein the computing units correspond to clusters.
15. A method as claimed in claim 14 wherein the clusters are composed of a plurality of computing resources.
16. A method as claimed in claim 15 wherein the computing resources include one or more processors.
17. A system configured to operate in accordance of the method as claimed in any of claims 1 to 16.
18. A computing device configured to operate in accordance with the method as claimed in any of claims 1 to 16.
19. A computing network adapted to operate in accordance with the method as claimed in any of claims 1 to 16.
20. A computer program adapted to perform the steps in the method as claimed in any of claims 1 to 16.
21. A data carrier adapted to store a computer program as claimed in claim 20.
22. A node computing device adapted to execute one or more tasks in accordance with the method as claimed in any of claims 1 to 16.
23. A master computing device adapted to schedule an application in accordance with the method as claimed in any of claims 1 to 16.
24. A computing grid adapted to operate in accordance with the method as claimed in any of claims 1 to 16.
25. A scheduling system for an aggregate of computational resources adapted to operate in accordance with the method as claimed in any of claims 1 to 16.
PCT/US2005/039439 2004-10-29 2005-10-28 Methods and apparatus for scheduling and running applications on computer grids WO2006050348A2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GB0423988.5 2004-10-29
GB0423988A GB2419692A (en) 2004-10-29 2004-10-29 Organisation of task groups for a grid application

Publications (2)

Publication Number Publication Date
WO2006050348A2 true WO2006050348A2 (en) 2006-05-11
WO2006050348A3 WO2006050348A3 (en) 2007-05-31

Family

ID=33515733

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2005/039439 WO2006050348A2 (en) 2004-10-29 2005-10-28 Methods and apparatus for scheduling and running applications on computer grids

Country Status (2)

Country Link
GB (1) GB2419692A (en)
WO (1) WO2006050348A2 (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040103413A1 (en) * 2002-11-27 2004-05-27 Sun Microsystems, Inc. Distributed process runner

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040103413A1 (en) * 2002-11-27 2004-05-27 Sun Microsystems, Inc. Distributed process runner

Also Published As

Publication number Publication date
WO2006050348A3 (en) 2007-05-31
GB2419692A (en) 2006-05-03
GB0423988D0 (en) 2004-12-01

Similar Documents

Publication Publication Date Title
CN103347055B (en) Task processing system in cloud computing platform, Apparatus and method for
CN104536937B (en) Big data all-in-one machine realization method based on CPU GPU isomeric groups
CN107291536B (en) Application task flow scheduling method in cloud computing environment
Convolbo et al. GEODIS: towards the optimization of data locality-aware job scheduling in geo-distributed data centers
WO2006050349A2 (en) Methods and apparatus for running applications on computer grids
Kaya et al. Heuristics for scheduling file-sharing tasks on heterogeneous systems with distributed repositories
Shih et al. Performance study of parallel programming on cloud computing environments using mapreduce
Kijsipongse et al. A hybrid GPU cluster and volunteer computing platform for scalable deep learning
Malik et al. Optimistic synchronization of parallel simulations in cloud computing environments
Yu et al. Algorithms for divisible load scheduling of data-intensive applications
Carretero et al. Mapping and scheduling HPC applications for optimizing I/O
Zhang et al. Meteor: Optimizing spark-on-yarn for short applications
Wang et al. MATRIX: MAny-Task computing execution fabRIc at eXascale
In et al. Sphinx: A scheduling middleware for data intensive applications on a grid
Ebrahimi et al. TPS: A task placement strategy for big data workflows
Senger Improving scalability of Bag-of-Tasks applications running on master–slave platforms
Banicescu et al. Addressing the stochastic nature of scientific computations via dynamic loop scheduling
Díaz et al. Derivation of self-scheduling algorithms for heterogeneous distributed computer systems: Application to internet-based grids of computers
Ko et al. New worker-centric scheduling strategies for data-intensive grid applications
Mohamed et al. DDOps: dual-direction operations for load balancing on non-dedicated heterogeneous distributed systems
Zhang et al. A distributed computing framework for All-to-All comparison problems
WO2006050348A2 (en) Methods and apparatus for scheduling and running applications on computer grids
Ghose et al. Computing BLAS level-2 operations on workstation clusters using the divisible load paradigm
Mamat et al. Scheduling real-time divisible loads with advance reservations
Zhu et al. Scheduling divisible loads in the dynamic heterogeneous grid environment

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KM KN KP KR KZ LC LK LR LS LT LU LV LY MA MD MG MK MN MW MX MZ NA NG NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SM SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GM KE LS MW MZ NA SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LT LU LV MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

NENP Non-entry into the national phase in:

Ref country code: DE

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 05815778

Country of ref document: EP

Kind code of ref document: A2

122 Ep: pct application non-entry in european phase

Ref document number: 05815778

Country of ref document: EP

Kind code of ref document: A2