WO2006050348A2

WO2006050348A2 - Methods and apparatus for scheduling and running applications on computer grids

Info

Publication number: WO2006050348A2
Application number: PCT/US2005/039439
Authority: WO
Inventors: Fabricio Alves Barbosa Da Silva; Silvia Regina De Carvalho
Original assignee: Hewlett-Packard Development Company, L.P.
Priority date: 2004-10-29
Filing date: 2005-10-28
Publication date: 2006-05-11
Also published as: WO2006050348A3; GB2419692A; GB0423988D0

Abstract

A method of running an application on a plurality of computational units (Fig. 1a, 11-14) is described. The application comprises a plurality of tasks (Fig. 1a, 11-14), each task having at least one input file associated therewith. One embodiment of the invention includes the method including the steps of aggregating said plurality of tasks into one or more groups of tasks and allocating each group of tasks to a computational unit, wherein the plurality of tasks are aggregated so that tasks which share one or more input file are included in the same group (Fig. 1a, 10).

Description

Methods and Apparatus for Scheduling and Running Applications on Computer Grids

Field of the invention

The invention relates to methods and apparatus for scheduling and executing applications on computer grids. More particularly, although not exclusively, the invention relates to methods and apparatus for scheduling the components of applications, also known as tasks, of grid- based applications on computational units constituting a computational grid or cluster. Even more particularly, although not exclusively, the invention relates to scheduling tasks on heterogeneous distributed computational grids. The invention may be particularly suitable for scheduling sequential independent tasks, otherwise known as Bag-of-Tasks or Parameter Sweep applications, on computational grids.

Background to the Invention

A computational grid, or more simply 'grid', can be thought of as a collection of physically distributed, heterogeneous computational units, or nodes. The physical distribution of the grid nodes may range from immediate proximity to wide geographical distribution. Grid nodes may be either heterogeneous or homogeneous with homogeneous grids differing primarily in that the nodes constituting such a grid provide essentially a uniform operating environment and computing capacity. Given the operational characteristics of grids as often being formed across administrative domains and over a wide range of hardware, homogeneous grids are considered a specific case of the general heterogeneous grid concept. The present invention contemplates both types.

The described embodiments of the present invention contemplate distributed networks of heterogeneous nodes which are desired to be treated as a unified computing resource.

Computational grids are usually built on top of specially designed middleware platforms known as grid platforms. Grid platforms enable the sharing, selection and aggregation of the variety resources constituting the grid. These resources which constitute the nodes of the grid can include supercomputers, servers, workstations, storage systems, desktop systems and specialized devices that may be owned and operated by different organizations.

The described embodiment of the present invention is concerned with grid applications known as Bag-of-Tasks (BoT) applications. These types of applications can be decomposed into groups of tasks. Tasks for this type of grid application are characterized as being independent in that no communication is required between them while they are running and that there are no dependencies between tasks. That is, each task constituting an element of the grid application as a whole can be executed independently with its result contributing to the overall result of the grid-based computation. Examples of BoT applications include Monte Carlo simulations, massive searches, key breaking, image manipulation and data mining.

In this specification and the exemplary embodiments described therein, we will refer to a BoT

The amount of computation involved with each task T_t is generally predefined and may vary among the tasks A. Note that the input for each task A is one or more (input) files and the output one or more (output) files.

The present exemplary embodiment relates to clusters organized as a master-slave platform. According to this model, a master node is responsible for scheduling computation among the slave nodes and collecting the results. Other grid/cluster models are possible within the scope of^"the^~ρreseπt-embodiments-of- the -invention, -and-may-be- capable. ofJncorporating_Jhe execution/scheduling technique described herein with appropriate modification. For example, a further embodiment is described where the slave components are themselves clusters.

Grid platforms typically use a non-dedicated network infrastructure such as the internet for inter-node communication. In such a network environment, machine heterogeneity, long and/or variable network delays and variable processor loads and capacities are common. Since tasks belonging to a BoT application do not need to communicate with each other and can be executed independently, it is considered that BoT applications are particularly suitable for execution on such a grid infrastructure.

While heterogeneous grids have been found to be suitable for executing such applications, a significant problem is scalability. Under certain circumstances, it has been found that simply increasing the number of processors in the grid does not produce any additional decrease in overall application execution time. Specifically, it has been found that the execution of a grid application is guaranteed to be scalable only on a platform with up to P_eff processors where P_eff is defined as the maximum number of slave processors needed to run an application with no idle periods on any slave processor.

Beyond this number of processors, the application execution time asymptotes to a fixed level and the application execution no longer scales. This is a significant barrier to the efficient execution of large and complex tasks on computational grids. Existing work in this area has proposed various heuristics for scheduling BoT applications on cluster and grid platforms. However, none of them take into account scalability issues related to the execution of fine-grain applications in master-slave platforms.

Disclosure of the invention

Ih one aspect, the invention provides for a method of scheduling the execution of an application on a plurality of computational units, said application comprising a plurality of tasks, each task having at least one input file associated therewith, the method including the steps of:

aggregating said plurality of tasks into one or more groups of tasks; and

allocating each group of tasks to a computational unit, wherein the plurality of tasks are aggregated so that tasks which share one or more input file are included in the same group.

This effectively groups associated files with each other, thereby introducing the concept of file affinity, based on the reduction of the amouήFofliafaThaf needs To be transferred ^"to^~a computational unit when all tasks of a group are sent to that unit. This increases P^ thereby increasing the number of slave processors that can be used effectively thereby allowing more efficient scaling.

Li a preferred embodiment, the number of groups of tasks is equal to the number of computational units.

The tasks are preferably aggregated into the groups so that the time needed for processing each group is substantially the same for each computing unit.

The step of aggregating the plurality of tasks may include the step of determining the file affinity between pairs of tasks in respect of their input files wherein for a set G of tasks composed of K tasks,

T₂,..., T_κ} and the set F of Y input files needed by one or more tasks belonging to group G, F= {fi, f∑, f₃,.. ,fγ} .

The file affinity or task file affinity is preferably be defined by:

Where \ f_t \ is the size of file f. and N,- is the number of tasks in group G which have file f. as an input file.

In a preferred embodiment of the invention, the method of distributing the tasks constituting the application among a plurality of computing units may include the following steps:

- define the number of tasks to be assigned in groups to the computing units, where P is the number of computing units;

o compute the size of each task;

o rank the task files in a list L in order of increasing size,

o for each group, beginning with the group with the largest number of tasks:

^■ assign the smallest unassigned task file to the group;

^{■ '} settasMleiist position = Ir

• until the group is completely populated by task files do:

o imposition + P < size of list L) and (task file affinity(task file[ρosition], task fϊle[position+l]) < a specified value, k) then position = position +P;

^■ else increment position = position + 1 ;

o assign to the group, the task file located at position in list L

• end do

^■ Remove assigned task files from List L

^■ Increment P = P - I

■ populate the next group

According to a further embodiment of the invention, there is provided a method of scheduling tasks among a plurality of computing units including the following steps: A) define the number of tasks to be assigned in groups to the computing units, where P is the number of computing units;

B) compute the size of each task;

C) rank the task files in a list L in order of increasing size,

D) for each group, beginning with the group with the largest number of tasks perform the following steps (a) to (e):

(a) assign the smallest unassigned task file to the group;

(b) set the task file list position index equal to 1;

(c) while the group is not completely populated by task files perform the following steps:

(i) if the position index plus P is less than or equal than the size of the list L, and the task file affinity between the task file at the position index and the task file at the position index +1 is less than a specified value, k then increment the position index by P;

otherwise increment position index by 1 ;

(ii) assign to the group, the task file located at position in list L

(d) Remove assigned task files from List L

(e) Increment P = P - I

As the rigorous equation for file affinity defined above is dominated by the number of possible ways of clustering n tasks in G groups, this preferred simplified embodiment allows a practical calculation or an approximation of the task file affinity.

The value k is preferably selected to represent the desired level of association between task pairs according to the degree of sharing of input files.

In a preferred embodiment, the number of tasks to be assigned to each group is determined such that the time needed for processing each group on a corresponding computing unit is substantially the same for each computing unit. In a preferred embodiment, the size of each task is calculated on the basis of the byte sum of all of the input files needed to execute each task on a computing unit.

In a preferred embodiment, tasks are aggregated into a specified group for values of k greater than 0.5.

The method may also include the further step of dispatching the groups of tasks to corresponding computing units, preferably in a manner which substantially overlaps computation and communication.

The computing units are preferably grid resources such as processors.

In an alternative embodiment, the computational units are one of more clusters.

Thus, the grid may be formed of one or more clusters as opposed to composed of a set of processors.

"According to this embodiment, the number-of-tasks assigned to each-cluster_wilLbe .calculated_^ based on: the requirement that the time needed by the cluster for processing each group is the same for each cluster, the file affinity, and pipelining, where applied, proceeding substantially as hereinbefore defined.

Brief Description of the Drawings

The invention will now be described by way of exemplary embodiments only, with reference to the drawings, in which:

Figures Ia & b: illustrates a method of assigning tasks on a dedicated cluster according to a round-robin approach in accordance with an embodiment of the invention;

Figure 2: illustrates an embodiment of the invention having a master-slave node configuration;

Figure 3 : illustrates the results of an execution time simulation for a prior art homogeneous platform;

Figure 4: illustrates the results of an execution time simulation according to an embodiment of the invention; and

Figure 5: illustrates an embodiment of the invention whereby tasks are dispatched to processors in a pipelined manner. For the purposes of explanation, a simple execution model will be described initially in relation to a prior art technique for scheduling an application on a homogeneous grid. This will then be compared with an embodiment of the invention. The specific embodiment described herein relates to fine-grain Bag-of-Tasks applications on dedicated master-slave platform as shown in Figure 1. However, this is not to be construed as limiting and the invention may be applied to other computing contexts with suitable modification. Further classes of applications that may benefit from the invention are those composed of tasks with dependencies where sets of dependent tasks can be grouped and the groups are independent among themselves. The method may be modified slightly to group the tasks according to such dependencies.

Referring to Figure 2, the application A is composed of T homogeneous tasks. That is: A = _T ■ The master node (10), is responsible for the organizing, scheduling, transmitting and receiving the tasks corresponding to the grid application. Referring to figure 1, each task goes through three phases during execution:

Initialization phase

This is the process whereby the files constituting the grid application and its data are sent from the master node (10) to the slave nodes (11-14) and the task is started. The duration of this phase is equal to *,•„,-,.

The set of files sent may include a parameter file corresponding to a specified task and an executable file which is charged with performing the computational task on the slave processor. The time in this phase includes the overhead incurred by the master node (10) to initiate a data transfer to a slave (11), for example, to initiate a TCP connection. For example, consider a task i that needs to send two files to a slave nodes before execution. The time t;_m-_t can then be computed as follows:

∑Filβ_j t_m = Lαt_{ + y=i

B

Where Lαt_t is the overhead incurred by the master node to initiate data transfer to the slave node (11-14). . Yβih_j is the total size in bytes of the input files that have to be transferred to slave node s and B is the data transfer rate. For simplicity, in this example it is assumed that each task has only one separate parameter file of the same size associated with it.

Computation Phase In this phase, the task processes the parameter file at the slave node (11-14) and produces an output file. The duration of this phase is equal to t_comp. Any additional overhead related to the reception of input files by a slave node is also included in this phase.

Completion Phase

During this phase, the output file is sent back to the master node (1) and the task T is terminated. The duration of this phase is t_end. This phase may require some processing at the master and this is mainly related to writing files at the file repository (not shown in (10)). This writing step may be deferred until disk resources are available. It is therefore considered negligible. Thus the initialization phase of one slave can occur concurrently to the completion phase of another slave node.

The total execution time of a task is therefore:

''total ~~ ^init ^"*^" *comp ^""^" end

The exemplary embodiment described herein corresponds to a dedicated node cluster composed of P+J homogeneous processors where T » P . The additional processor is the master node (10). Communication between the master (10) and slave (11-14) is by way of a shared link and, in this embodiment the master (10) can only send files through the network to a single slave at a given time. The communication link is full duplex. This embodiment corresponds to a one-port model whereby there are at most two communication processes involving a given master, one sent and one received. The one-port embodiment discussed herein is particularly suited to LAN network connections.

A slave node (11-14) is considered to be idle when it is not involved with the execution of any of the three phases of a task. Figure 1 shows the execution of a set of tasks in a system composed of three slave nodes where the scheduling algorithm is of a round-robin type.

The effective number of processors P_eff is defined as the maximum number of slave processors needed to run an application with no idle periods on any slave processor. Taking into account the task and platform models of the particular embodiment described herein, a processor may have idle periods if: t_Co,_nP + t_e,,_d < {P -l)t_init

P_eg- is then given by the following equation: camp end

^P4T = + 1 ^init

The total number of tasks is to be executed on a given processor is at most:

Where T is the total number of tasks. For a platform with P_ej processors, the total execution time, or makespan, will be:

t m_akes_pan ~ ^- (t_init + t _cmψ + 1 _end ) + {Jr — Lμ_init

The second term in the right side of this equation gives the time which is needed to start the first (P-I) tasks in the other P-I processors. If the platform has more processors than Pφ then the overall makespan is dominated by communication times between the master and the slaves. Then:

''makespan

^comp ^"*^" ''end

As there are idle periods on every processor, the following equation holds:

(P -ϊ)t_init > (t_comp + t_end)

This equation applies primarily to two cases: a. For very large platforms (P Large); and b. For applications with a large ""' ratio, such as fine-grain applications. comp

Thus, it can be seen that the execution of an application according to the prior art is guaranteed to be scalable only in a platform with up to P^slave processors.

Beyond this, the idle time in slave nodes increases proportionally with the number of processors. This result of a simulation for such a scheduling system is shown in Figure 3. Here, 4,,-ρl, /_comp+4,,,r=8 and T=SOO. The effective number of processors is 9 in figure 3 and for P ≥ 9 the overall makespan asymptotes to a constant level of 808 seconds. In accordance with an embodiment of the present invention, t_comp is increased by grouping sets of tasks sharing common input files into a larger task. By doing so, it is possible to increase the effective number of processors therefore increasing the number of slave processors that can be used effectively. The time corresponding to t_init should ideally not increase in the same proportion to t_comp. Thus, in one embodiment, tasks which share one or more input files are selected and scheduled so as to run on a common slave node or processor.

This is achieved by introducing the concept of the file affinity which indicates the reduction in the amount of data that needs to be transferred to a remote node when all tasks of a group are sent to that node.

In this discussion it is assumed that the number of groups is equal to the number of nodes available. This is not however a limitation and modifications to the scheduling method are viable to take into account different processor/group. For example, for some specific sets of applications/platforms, the optimal execution in terms of the makespan will use a number of groups smaller than the total number of processors. Given a set G of tasks composed of K tasks

of the Y input" files- needed- by one or more tasks- belonging to group G, F = ffi,f₂,f3--fγ}, the file affinity I_aff is defined in one embodiment as follows:

|f;| is the size in bytes of file f; and N; is the number of tasks in group G which have file f_; as an input file. 0 < I_aff < 1. An input file affinity of zero indicates that there is no sharing of files among tasks of a group. An input file affinity close to one means that all tasks of a group have a high degree of sharing of input files.

The potential benefits of clustering tasks into groups for execution are illustrated in the results of the simulation shown in Figure 4. In accordance with one embodiment of the invention, it was assumed that all tasks share the same input files and that the same parameters of the simulation shown in Figure 3 applied. As can be seen from the example there is a reduction in total execution time for all values of number of processor P with a consistent reduction when the size of the grid platform is increased. Thus, in this embodiment of the invention, effective scaling is achieved. For the example shown in Figure 4, for a platform with 80 processors, the total execution time when tasks are grouped is 160s. Without grouping, the total time is 808s.

The equation above for file affinity is dominated by the combinatorial function whereby all possible pairs of tasks are considered. For large numbers of tasks, this can lead to very large numbers of combinations of task pairs. For example, there are N(25, 5) ways of clustering 25 tasks into 5 groups. This equates roughly to 10¹⁵ possible combinations. It may be therefore impractical to search exhaustively in solution space for an optimal task grouping. For this reason, according to another preferred embodiment of the invention there is provided a simplified heuristic for determining the optimal task grouping which is based on the general file affinity equation described above.

As an preliminary illustration of this simplified embodiment, consider a group of tasks, each of which requires a different input file. Because there is no input file sharing, there is no file affinity between them. It is desirable to start processing them on slave nodes as soon as possible to minimize t;,,_it . Therefore, the tasks are transferred to slave nodes in size order from smallest to largest with no account taken of sharing amongst input files (as this is zero).

If an application where all tasks share the same input file is considered, that input file only needs to be transferred once. This is taken into account by including the effect of file affinity. If the file affinity of two consecutive tasks (in size order) is very high, it is advantageous to assign those two tasks to the same processor instead of transferring the same set of input files twice over the network. In the ideal situation described here, this set of files is transferred only once to each processor or node of the network.

This simplified embodiment reduces the size of the possible solution space and provides a viable method of calculating the file affinities for tasks to within a workable level of accuracy. According to this embodiment, and talcing into account file affinity, the simplified embodiment includes the following steps: Initially, for each computing unit or processor, the number of tasks to be aggregated into a group is defined for that computing unit. This is done so that the time needed for processing each group is substantially the same for each computing unit.

Then the total size of each task is calculated. Here, the size of each task corresponds to the sum of the input file sizes for the task concerned. For each group defined in the aggregation step, the required number of tasks is allocated to the group as a function of both the number of tasks determined previously and task affinity. The allocation step in a preferred embodiment is as follows. The reference to 'position' relates to the position of the task input file in the size-ordered list. The smallest size task, task(position) is assigned to a first group. Then the file affinity of the pair task(position) and task(position+l) in the size-ordered list is determined. If the file affinity k is greater than a specified value, task(position+l) is assigned to the first group. If the file affinity is less than a specified value, task(position+l) is assigned to a subsequent group. This process is repeated, filling sequentially the groups in order until the group allocations determined in the initial step are populated with the size-ordered, associated tasks. This embodiment can be expressed in pseudocode as follows:

define the number of tasks to be assigned in groups to the computing units,

- P = the number of computing units;

o compute the size of each task;

o rank the task files in a list L in order of increasing_^ size,_

o for each group, beginning with the group with the largest number of tasks:

■ assign the smallest unassigned task file to the group;

■ task file list position = 1;

• until the group is completely populated by task files do:

o if(position + P < size of list L) and (task file affinity(task file[position] , task file[position+1] ) < a specified value, k) then position = position

+P;

■ else position = position + 1;

o assign to the group, the task file at position in list L • end do

^■ Remove assigned task files from List L

^■ P = P - I

An example application of this simplified heuristic is as follows. Consider a set of tasks composed often task files which are to be distributed on three homogeneous slave processors. The set of input files needed by each task is described as {fj, ... f_i0} where fl is a real value that corresponds to the byte sum of the input files needed by task t_{. As the tasks are heterogeneous, they will share no input files and the file affinity between any pair of tasks will be zero.

The 10 heterogeneous input file tasks are {20K, 35K, 44K, 80K, 102K, HOK, 200K, 300K, 400K, 450K}. Three groups of tasks are generated, one with 4 tasks and the others with 3 tasks. The simplified embodiment of the heuristic in the case of zero file affinity operates as follows. Each task is considered in- size- order. Ihus,.20K is allocated to the .first position_of group 1. Then the 35K input task is allocated to the next group following the principle that each group should minimize initial transmission or initialization time. Task 44K is allocated to the third group. Task 8OK is then allocated to position two of the first group, 102K to the second position of group two and so on. This produces the group of files as follows: {20K, 80K, 200K, 400K}, {35K, 102K, 300K} and {44K, HOK, 400K}. At a first approximation this keeps the amount of transmitted data similar for each group and allows the task transmission/calculation to be pipelined in a reasonably efficient manner. In a preferred embodiment, the transfer of the files occurs in a pipelined manner, i.e.; where computation is overlapped with communication. Figure 5 illustrates the pipelined transfer of input files from a master to three slave processors. As can be seen in this example, the transfers to and from the master/nodes are staggered with the computation on the slaves being overlapped with the communication phase on one or more of the other processor nodes. This reduces tj_ni_t when executing a group of tasks on a slave processor.

Introducing the file affinity as Jc allows a balance to be struck between grouping tasks of different sizes into one task group for transmission and grouping tasks with sufficient file affinity or commonality. Thus this degree of balance may be altered by setting the file affinity threshold for associating input files.

Another example is that of 10 homogeneous tasks with ten completely homogeneous sets of input files {30K, 30K, 30K, 30K, 30K, 3OK, 30K, 30K, 30K, 30K}. Again, three groups of tasks are generated, one with four tasks and the others with three. As the tasks are completely homogeneous, each pair will have a file affinity of approximately 1. Thus, following the simplified embodiment of the heuristic, the three groups of input files will be {30K}, {30K}, and {30K}.

These two extreme examples serve to illustrate how the task grouping may be performed in the simplified embodiment.

In terms of implementing this method as noted above, the number of tasks to be assigned to each group is determined such that the time needed for processing each group is substantially the same for each computing unit. This will depend on each processors relative speed, based on the average speed of the processors in the cluster. For example, if the relative speed of a particular node processor is 1.0 compared to the average speed of the cluster nodes, the

\T^~ maximum number of tasks to be assigned to that processor will be —

The size of each task is calculated on the basis of the byte sum of all of the input files needed to execute each task on ^"a computing unit. The file ^"affmity^~may usefully be defined as kfov which an affinity of 0.5 is considered acceptable as a benchmark for grouping tasks into a specified group. Essentially, this equates to setting the minimum degree of 'association' which is necessary to consider two tasks as related or sharing input files. This ensures that the file affinity is maximized within a group. Thus sending similar sets of files to multiple processors is avoided. As noted above, if the next set of files is different enough (i.e., has a file affinity with a previously allocated task less than the minimum), that task will be located at the next processor position. Firstly, this is done so that tasks with the smallest byte sum are sent initially. Secondly, this is done to guarantee that the groups are as uniform as possible in respect of the number of bytes that need to be transmitting from the master node. Thus, in a preferred embodiment, at initialization of the algorithm, the number of tasks is allocated to each processor based on the processing power of the processor concerned and the file affinity, and the tasks are dispatched or transferred to the processor in a pipelined way as illustrated in the example shown in figure 5.

IQ a further embodiment, it is possible to consider grouping tasks on a grid composed of a set of clusters as opposed to a grid composed of a set of processors. In this embodiment, the number of tasks assigned to each cluster will be calculated based on the requirement that the time needed by the cluster for processing each group is the same for each cluster. As before, this will depend on the processing speed of the cluster aggregate and will depend on the internal structure of the particular cluster such as the number of processors, load from other users etc. Once the number of tasks to be assigned to each cluster is determined, the method proceeds substantially as described above.

Although the invention has been described by way of example and with reference to particular simplified or reduced-scope embodiments it is to be understood that modification and/or improvements may be made without departing from the scope of the appended claims. Embodiments of the invention are further intended to cover the task scheduling/grouping technique in its most general sense as specified in the claims regardless of the possible size of the solution space for the affinity determination. It is also noted that embodiments of the invention may be applied to the distribution of tasks among nodes in a grid system where the computational characteristics of such nodes may take a variety of forms. That is, node processing may take the form of numerical calculation, storage or any other form of processing which might be envisaged as part of distributed application execution. Further, embodiments of the present invention may be included in a broader scheduling system in the context of allocating information to generalized computing resources.

Where in the foregoing description reference has been made to integers or elements having known equivalents, then such equivalents are herein incorporated as if individually set forth.

Claims

1. A method of scheduling the running of an application on a plurality of computational units, said application comprising a plurality of tasks, each task having at least one input file associated therewith, the method including the steps of:

aggregating said plurality of tasks into one or more groups of tasks; and

2. A method as claimed in claim 1 wherein the number of groups of tasks is equal to the number of computational units.

3. A method as claimed in any preceding claim wherein the tasks are aggregated into the groups such that the time needed for processing each group is substantially the same for each^" computing unit.

4. A method as claimed in any preceding claim wherein the step of aggregating the plurality of tasks includes the step of determining the file affinity between pairs of tasks in respect of their input files wherein for a set G of tasks composed of K tasks, G= {Ti, T₂,..., T_κ} and the set F of Y input files needed by one or more tasks belonging to group G, F= {fj, f₂, f₃,...fy}, the file affinity is defined by:

Where \ f. \ is the size of file f_t and N,- is the number of tasks in group G which have file f_i as an input file.

5. A method of scheduling tasks among a plurality of computing units including the following steps:

define the number of tasks to be assigned in groups to the computing units, where P is the number of computing units; o compute the size of each task;

o rank the task files in a list L in order of increasing size,

o for each group, beginning with the group with the largest number of tasks:

" assign the smallest unassigned task file to the group;

^■ set task file list position = 1 ;

• until the group is completely populated by task files do:

o imposition + P < size of list L) and (task file affmity(task file[position], task file[position+l]) < a specified value, k) then position = position +P;

^■ else increment position = position + 1;

o assign to the group, the task file located at position In list L

• end do

^■ Remove assigned task files from List L

^■ Increment P = P - I

^■ populate to next group

6. A method of scheduling tasks among a plurality of computing units including the following steps:

A) define the number of tasks to be assigned in groups to the computing units, where P is the number of computing units;

B) compute the size of each task;

C) rank the task files in a list L in order of increasing size,

(a) assign the smallest unassigned task file to the group; (b) set the task file list position index equal to 1;

otherwise increment position index by 1;

(ii) assign to the group, the task file located at position in list L

(d) Remove assigned task files from List L

(e) Increment P = P - I

7. A method as claimed in claim 5 or 6 wherein the value k is selected to represent the desired level of association between task pairs according to the degree of sharing of input files.

8. A method as claimed in any of claims 5 to 8 wherein k is greater or equal than substantially 0.5.

9. A method as claimed in any preceding claim wherein the maximum number of tasks to be assigned to each group is determined such that the time needed for processing each group on a corresponding computing unit is substantially the same for each computing unit.

10. A method as claimed in any preceding claim wherein the size of each task is calculated on the basis of the byte sum of all of the input files needed to execute each task on a computing unit.

11. A method of executing an application on a grid including the scheduling method as claimed in any preceding claim further including the step of dispatching the groups of tasks to corresponding computing units.

12. A method as claimed in claim 11 wherein the groups of tasks are dispatched in a manner which overlaps computation and communication.

13. A method as claimed in any preceding claim wherein the computing units correspond to processors.

14. A method as claimed in any preceding claim, wherein the computing units correspond to clusters.

15. A method as claimed in claim 14 wherein the clusters are composed of a plurality of computing resources.

16. A method as claimed in claim 15 wherein the computing resources include one or more processors.

17. A system configured to operate in accordance of the method as claimed in any of claims 1 to 16.

18. A computing device configured to operate in accordance with the method as claimed in any of claims 1 to 16.

19. A computing network adapted to operate in accordance with the method as claimed in any of claims 1 to 16.

20. A computer program adapted to perform the steps in the method as claimed in any of claims 1 to 16.

21. A data carrier adapted to store a computer program as claimed in claim 20.

22. A node computing device adapted to execute one or more tasks in accordance with the method as claimed in any of claims 1 to 16.

23. A master computing device adapted to schedule an application in accordance with the method as claimed in any of claims 1 to 16.

24. A computing grid adapted to operate in accordance with the method as claimed in any of claims 1 to 16.

25. A scheduling system for an aggregate of computational resources adapted to operate in accordance with the method as claimed in any of claims 1 to 16.