GB2419692A - Organisation of task groups for a grid application - Google Patents

Organisation of task groups for a grid application Download PDF

Info

Publication number
GB2419692A
GB2419692A GB0423988A GB0423988A GB2419692A GB 2419692 A GB2419692 A GB 2419692A GB 0423988 A GB0423988 A GB 0423988A GB 0423988 A GB0423988 A GB 0423988A GB 2419692 A GB2419692 A GB 2419692A
Authority
GB
United Kingdom
Prior art keywords
tasks
task
group
file
files
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
GB0423988A
Other versions
GB0423988D0 (en
Inventor
Fabricio Alves Da Silva
Silvia Regina De Carvalho
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hewlett Packard Development Co LP
Original Assignee
Hewlett Packard Development Co LP
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hewlett Packard Development Co LP filed Critical Hewlett Packard Development Co LP
Priority to GB0423988A priority Critical patent/GB2419692A/en
Publication of GB0423988D0 publication Critical patent/GB0423988D0/en
Priority to PCT/US2005/039439 priority patent/WO2006050348A2/en
Publication of GB2419692A publication Critical patent/GB2419692A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5072Grid computing

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Multi Processors (AREA)

Abstract

A method of organising tasks into groups for dispatch to nodes in a grid computing environment wherein the organisation step comprises aggregating a number of tasks into one or more groups and allocating them to a computational unit or node such that tasks which share a common input file are allocated to the same group. Preferably the number of groups is equal to the number of nodes. The tasks may also be aggregated such that the time taken to complete the tasks in a group is substantially the same. The aggregation step may also include determining the file affinity between pairs of tasks in respect of their input files. The allocation method may involve listing the tasks in increasing size and allocating the smallest to the group with the largest number of tasks as part of an iterative process until no task remain unallocated. The size of a task may be determined by the byte size of the associated input file.

Description

Methods and Apparatus for Scheduling and Running Applications on Computer
Grids
Field of the invention
The invention relates to methods and apparatus for scheduling and executing applications on computer grids. More particularly, although not exclusively, the invention relates to methods and apparatus for scheduling the components of applications, also known as tasks, of grid- based applications on computational units constituting a computational grid or cluster. Even more particularly, although not exclusively, the invention relates to scheduling tasks on heterogeneous distributed computational grids. The invention may be particularly suitable for scheduling sequential independent tasks, otherwise known as Bag-of-Tasks or Parameter Sweep applications, on computational grids.
Background to the Invention
A computational grid, or more simply grid', can be thought of as a collection of physically distributed, heterogeneous computational units, or nodes. The physical distribution of the grid nodes may range from immediate proximity to wide geographical distribution. Grid nodes may be either heterogeneous or homogeneous with homogeneous grids diffenng primarily in that the nodes constituting such a grid provide essentially a uniform operating environment and computing capacity. Given the operational characteristics of grids as often being formed across administrative domains and over a wide range of hardware, homogeneous grids are considered a specific case of the general heterogeneous grid concept. The present invention contemplates both types.
The described embodiments of the present invention contemplate distributed networks of heterogeneous nodes which are desired to be treated as a unified computing resource.
Computational grids are usually built on top of specially designed middleware platforms known as grid platforms. Grid platforms enable the shanng, selection and aggregation of the variety resources constituting the grid. These resources which constitute the nodes of the grid can include supercomputers, servers, workstations, storage systems, desktop systems and specialized devices that may be owned and operated by different organizations.
The descnbed embodiment of the present invention is concerned with grid applications known as Bag-of-Tasks (B0T) applications. These types of applications can be decomposed into groups of tasks. Tasks for this type of grid application are characterized as being independent in that no communication is required between them while they are running and that there are no dependencies between tasks. That is, each task constituting an element of the grid application as a whole can be executed independently with its result contributing to the overall result of the grid- based computation. Examples of BoT applications include Monte Carlo simulations, massive searches, key breaking, image manipulation and data mining.
In this specification and the exemplary embodiments described therein, we will refer to a BoT application A as being composed of Ttasks (Tj,=1 T The amount of computation involved with each task T, is generally predefined and may vary among the tasks A. Note that the input for each task A is one or more (input) files and the output one or more (output) files.
The present exemplary embodiment relates to clusters organized as a master-slave platform.
According to this model, a master node is responsible for scheduling computation among the slave nodes and collecting the results. Other grid/cluster models are possible within the scope of the present embodiments of the invention, and may be capable of incorporating the execution/scheduling technique described herein with appropriate modification. For example, a further embodiment is described where the slave components are themselves clusters.
Grid platforms typically use a non-dedicated network infrastructure such as the internet for inter- node communication. In such a network environment, machine heterogeneity, long and/or variable network delays and variable processor loads and capacities are common. Since tasks belonging to a BoT application do not need to communicate with each other and can be executed independently, it is considered that BoT applications are particularly suitable for execution on such a grid infrastructure.
While heterogeneous grids have been found to be suitable for executing such applications, a significant problem is scalability. Under certain circumstances, it has been found that simply increasing the number of processors in the grid does not produce any additional decrease in overall application execution time. Specifically, it has been found that the execution of a grid application is guaranteed to be scalable only on a platform with up to processors where PCff S defined as the maximum number of slave processors needed to run an application with no idle periods on any slave processor.
Beyond this number of processors, the application execution time asymptotes to a fixed level and the application execution no longer scales. This is a significant barrier to the efficient execution of large and complex tasks on computational gnds.
Existing work in this area has proposed various heuristics for scheduling BoT applications on cluster and grid platforms. However, none of them take into account scalability issues related to the execution of fine- grain applications in master-slave platforms.
Disclosure of the invention
In one aspect, the invention provides for a method of scheduling the execution of an application on a plurality of computational units, said application comprising a plurality of tasks, each task having at least one input file associated therewith, the method including the steps of: aggregating said plurality of tasks into one or more groups of tasks; and allocating each group of tasks to a computational unit, wherein the plurality of tasks are aggregated so that tasks which share one or more input file are included in the same group.
This effectively groups associated files with each other, thereby introducing the concept of file affinity, based on the reduction of the amount of data that needs to be transferred to a computational unit when all tasks of a group are sent to that unit. This increases Peff thereby increasing the number of slave processors that can be used effectively thereby allowing more efficient scaling.
In a preferred embodiment, the number of groups of tasks is equal to the number of computational units.
The tasks are preferably aggregated into the groups so that the time needed for processing each group is substantially the same for each computing unit.
The step of aggregating the plurality of tasks may include the step of determining the file affinity between pairs of tasks in respect of their input files wherein for a set G of tasks composed of K tasks, G={T1, T2,.. ., TK} and the set F of Y input files needed by one or more tasks belonging to group G, F={f1, f2, f3,...f}.
The file affinity or task file affinity is preferably be defined by: (N1 )I I(G)= N,IfI Where is the size of file J and N is the number of tasks in group G which have file J as an input file.
In a preferred embodiment of the invention, the method of distributing the tasks constituting the application among a plurality of computing units may include the following steps: - define the number of tasks to be assigned in groups to the computing units, where P is the number of computing units; o compute the size of each task; o rank the task files in a list L in order of increasing size, o for each group, beginning with the group with the largest number of tasks: * assign the smallest unassigned task file to the group; * set task file list position = 1; * until the group is completely populated by task files do: o if(position + P = size of list L) and (task file affinity(task file[positionj, task file[position+l]) < a specified value, k) then position = position +P; * else increment position = position + 1; o assign to the group, the task file located at position in list
L
* enddo * Remove assigned task files from List L * Increment P = P - 1 * populate the next group According to a further embodiment of the invention, there is provided a method of scheduling tasks among a plurality of computing units including the following steps: A) define the number of tasks to be assigned in groups to the computing units, where P is the number of computing umts; B) compute the size of each task; C) rank the task files in a list L in order of increasing size, D) for each group, beginning with the group with the largest number of tasks perform the following steps (a) to (e): (a) assign the smallest unassigned task file to the group; (b) set the task file list position index equal to 1; (c) while the group is not completely populated by task files perform the following steps: (i) if the position index plus P is less than or equal than the size of the list L, and the task file affinity between the task file at the position index and the task file at the position index +1 is less than a specified value, k then increment the position index by P; otherwise increment position index by 1; (ii) assign to the group, the task file located at position in list L (d) Remove assigned task files from List L (e) Increment P = P - 1 As the ngorous equation for file affinity defined above is dominated by the number of possible ways of clustering n tasks in G groups, this preferred simplified embodiment allows a practical calculation or an approximation of the task file affinity.
The value k is preferably selected to represent the desired level of association between task pairs according to the degree of sharing of input files.
In a preferred embodiment, the number of tasks to be assigned to each group is determined such that the time needed for processing each group on a corresponding computing unit is substantially the same for each computing unit.
In a preferred embodiment, the size of each task is calculated on the basis of the byte sum of all of the input files needed to execute each task on a computing unit.
In a preferred embodiment, tasks are aggregated into a specified group for values of k greater than 0.5.
The method may also include the further step of dispatching the groups of tasks to corresponding computing units, preferably in a manner which substantially overlaps computation and communication.
The computing units are preferably grid resources such as processors.
In an alternative embodiment, the computational units are one of more clusters.
Thus, the grid may be formed of one or more clusters as opposed to composed of a set of processors.
According to this embodiment, the number of tasks assigned to each cluster will be calculated based on: the requirement that the time needed by the cluster for processing each group is the same for each cluster, the file affinity, and pipelining, where applied, proceeding substantially as hereinbefore defined.
Brief Description of the Drawin2s
The invention will now be described by way of exemplary embodiments only, with reference to the drawings, in which: Figure 1: illustrates a method of assigning tasks on a dedicated cluster according to a round-robin approach in accordance with an embodiment of the invention; Figure 2: illustrates an embodiment of the invention having a master-slave node configuration; Figure 3: illustrates the results of an execution time simulation for a prior art homogeneous platform; Figure 4: illustrates the results of an execution time simulation according to an embodiment of the invention; and Figure 5: illustrates an embodiment of the invention whereby tasks are dispatched to processors in a pipelined manner.
For the purposes of explanation, a simple execution model will be described initially in relation to a pnor art technique for scheduling an application on a homogeneous grid. This will then be compared with an embodiment of the invention. The specific embodiment descnbed herein relates to fine-grain Bag-of-Tasks applications on dedicated master-slave platform as shown in Figure 1. However, this is not to be construed as limiting and the invention may be applied to other computing contexts with suitable modification. Further classes of applications that may benefit from the invention are those composed of tasks with dependencies where sets of dependent tasks can be grouped and the groups are independent among themselves. The method may be modified slightly to group the tasks according to such dependencies.
Referring to Figure 2, the application A is composed of Thomogeneous tasks. That is: A = r* The master node (10), is responsible for the organizing, scheduling, transmitting and receiving the tasks corresponding to the grid application. Referring to figure 1, each task goes through three phases during execution: Initialization phase This is the process whereby the files constituting the grid application and its data are sent from the master node (10) to the slave nodes (11-14) and the task is started. The duration of this phase is equal to t.
The set of files sent may include a parameter file corresponding to a specified task and an executable file which is charged with performing the computational task on the slave processor.
The time in this phase includes the overhead incurred by the master node (10) to initiate a data transfer to a slave (11), for example, to initiate a TCP connection. For example, consider a task i that needs to send two files to a slave nodes before execution. The time t1 can then be computed as follows: Filed = Lat, + j=I B Where Lat1 is the overhead incurred by the master node to initiate data transfer to the slave node (11-14). . )ile is the total size in bytes of the input files that have to be transferred to slave node s and B is the data transfer rate. For simplicity, in this example it is assumed that each task has only one separate parameter file of the same size associated with it.
Computation Phase In this phase, the task processes the parameter file at the slave node (11-14) and produces an output file. The duration of this phase is equal to tcomp. Any additional overhead related to the reception of input files by a slave node is also included in this phase.
Completion Phase During this phase, the output file is sent back to the master node (1) and the task T is terminated.
The duration of this phase is tnd This phase may require some processing at the master and this is mainly related to writing files at the file repository (not shown in (10)). This writing step may be deferred until disk resources are available. It is therefore considered negligible. Thus the initialization phase of one slave can occur concurrently to the completion phase of another slave node.
The total execution time of a task is therefore: tWit)1 = tinu + tcomp + tend The exemplary embodiment described herein corresponds to a dedicated node cluster composed of P+1 homogeneous processors where T>> P. The additional processor is the master node (10). Communication between the master (10) and slave (11-14) is by way of a shared link and, in this embodiment the master (10) can only send files through the network to a single slave at a given time. The communication link is full duplex. This embodiment corresponds to a one- port model whereby there are at most two commurncation processes involving a given master, one sent and one received. The one-port embodiment discussed herein is particularly suited to LAN network connections.
A slave node (1 1-14) is considered to be idle when it is not involved with the execution of any of the three phases of a task. Figure 1 shows the execution of a set of tasks in a system composed of three slave nodes where the scheduling algorithm is of a round-robin type.
The effective number of processors Peff is defined as the maximum number of slave processors needed to run an application with no idle periods on any slave processor. Taking into account the task and platform models of the particular embodiment descnbed herein, a processor may have idle periods if: tcomp + tend < (P - Peff5 then given by the following equation: Pff = tCOpflp +Ifl(J +1 e tinit The total number of tasks is to be executed on a given processor is at most: M=[i
P
Where T is the total number of tasks. For a platform with Peff processors, the total execution time, or makespan, will be: tmake = M(t1,, + tcomp + tCfl(I) + (P - 1)tin,t The second term in the right side of this equation gives the time which is needed to start the first (P-I) tasks in the other P-i processors. If the platform has more processors than Peff, then the overall makespan is dommated by communication times between the master and the slaves. Then: = .&tPt + t + tend As there are idle periods on every processor, the following equation holds: (P - > (t.0,p + This equation applies pnmarily to two cases: a. For very large platforms (P Large); and b. For applications with a large --- ratio, such as fine-grain applications. tCOmp
Thus, it can be seen that the execution of an application according to the prior art is guaranteed to be scalable only in a platform with up to Peffslave processors.
Beyond this, the idle time in slave nodes increases proportionally with the number of processors.
This result of a simulation for such a scheduling system is shown in Figure 3. Here, t,,,,=1, tcomp+ten8 and T=800. The effective number of processors is 9 in figure 3 and for P = 9 the overall makespan asymptotes to a constant level of 808 seconds.
In accordance with an embodiment of the present invention, tcomp is increased by grouping sets of tasks sharing common input files into a larger task. By doing so, it is possible to increase the effective number of processors therefore increasing the number of slave processors that can be used effectively. The time corresponding to t, should ideally not increase in the same proportion to tcomp. Thus, in one embodiment, tasks which share one or more input files are selected and scheduled so as to run on a common slave node or processor.
This is achieved by introducing the concept of the file affinity which indicates the reduction in the amount of data that needs to be transferred to a remote node when all tasks of a group are sent to that node.
In this discussion it is assumed that the number of groups is equal to the number of nodes available. This is not however a limitation and modifications to the scheduling method are viable to take into account different processor/group. For example, for some specific sets of applications/platforms, the optimal execution in terms of the makespan will use a number of groups smaller than the total number of processors. Given a set G of tasks composed of K tasks G=(T1, T2 TK}, and the set F of the Y input files needed by one or more tasks belonging to group G, F = {tJJ3. ..fy}, the file affinity Ijis defined in one embodiment as follows: (N, -1)If I Iff(G)= N1If,I I f11 is the size in bytes of file f, and N, is the number of tasks in group G which have file f1 as an input file. 0 =Iaff< 1. An input file affinity of zero indicates that there is no sharing of files among tasks of a group. An input file affinity close to one means that all tasks of a group have a high degree of sharing of input files.
The potential benefits of clustenng tasks into groups for execution are illustrated in the results of the simulation shown in Figure 4. In accordance with one embodiment of the invention, it was assumed that all tasks share the same input files and that the same parameters of the simulation shown in Figure 3 applied. As can be seen from the example there is a reduction in total execution time for all values of number of processor P with a consistent reduction when the size of the grid platform is increased. Thus, in this embodiment of the invention, effective scaling is achieved.
For the example shown in Figure 4, for a platform with 80 processors, the total execution time when tasks are grouped is 1 60s. Without grouping, the total time is 808s.
The equation above for file affinity is dominated by the combinatorial function whereby all possible pairs of tasks are considered. For large numbers of tasks, this can lead to very large numbers of combinations of task pairs. For example, there are N(25, 5) ways of clustering 25 tasks into 5 groups. This equates roughly to 1015 possible combinations. It may be therefore impractical to search exhaustively in solution space for an optimal task grouping. For this reason, according to another preferred embodiment of the invention there is provided a simplified heuristic for determining the optimal task grouping which is based on the general file affinity equation described above.
As an preliminary illustration of this simplified embodiment, consider a group of tasks, each of which requires a different input file. Because there is no input file sharing, there is no file affinity between them. It is desirable to start processing them on slave nodes as soon as possible to minimize t. Therefore, the tasks are transferred to slave nodes in size order from smallest to largest with no account taken of sharing amongst input files (as this is zero).
If an application where all tasks share the same input file is considered, that input file only needs to be transferred once. This is taken into account by including the effect of file affinity. If the file affinity of two consecutive tasks (in size order) is very high, it is advantageous to assign those two tasks to the same processor instead of transferring the same set of input files twice over the network. In the ideal situation described here, this set of files is transferred only once to each processor or node of the network.
This simplified embodiment reduces the size of the possible solution space and provides a viable method of calculating the file affinities for tasks to within a workable level of accuracy.
According to this embodiment, and taking into account file affinity, the simplified embodiment includes the following steps: Initially, for each computing unit or processor, the number of tasks to be aggregated into a group is defined for that computing unit. This is done so that the time needed for processing each group is substantially the same for each computing unit.
Then the total size of each task is calculated. Here, the size of each task corresponds to the sum of the input file sizes for the task concerned. For each group defined in the aggregation step, the required number of tasks is allocated to the group as a function of both the number of tasks determined previously and task affinity. The allocation step in a preferred embodiment is as follows. The reference to position' relates to the position of the task input file in the size-ordered list. The smallest size task, task(position) is assigned to a first group. Then the file affinity of the pair task(position) and task(position+1) in the size-ordered list is determined. If the file affinity k is greater than a specified value, task(position+1) is assigned to the first group. If the file affinity is less than a specified value, task(position+l) is assigned to a subsequent group. This process is repeated, filling sequentially the groups in order until the group allocations determined in the initial step are populated with the size-ordered, associated tasks. This embodiment can be expressed in pseudocode as follows: - define the number of tasks to be assigned in groups to the computing units, - P = the number of computing units; o compute the size of each task; o rank the task files in a list L in order of increasing size, o for each group, beginning with the group with the largest number of tasks: * assign the smallest unassigned task file to the group; * task file list position = 1; * until the group is completely populated by task files do: o if(position + P = size of list L) and (task file affinity(task file [positioni, task file [position+l]) < a specified value, k) then position = position +P; * else position = position + 1; o assign to the group, the task file at position in list L * enddo * Remove assigned task files from List L * p=P-l An example application of this simplified heunstic is as follows. Consider a set of tasks composed of ten task files which are to be distributed on three homogeneous slave processors.
The set of input files needed by each task is described as ff1, f,) wherefi is a real value that corresponds to the byte sum of the input files needed by task t. As the tasks are heterogeneous, they will share no input files and the file affinity between any pair of tasks will be zero.
The 10 heterogeneous input file tasks are {20K, 35K, 44K, 80K, 102K, 110K, 200K, 300K, 400K, 450K). Three groups of tasks are generated, one with 4 tasks and the others with 3 tasks.
The simplified embodiment of the heuristic in the case of zero file affinity operates as follows.
Each task is considered in size order. Thus, 20K is allocated to the first position of group 1. Then the 35K input task is allocated to the next group following the principle that each group should minimize initial transmission or initialization time. Task 44K is allocated to the third group. Task 80K is then allocated to position two of the first group, 102K to the second position of group two and so on. This produces the group of files as follows: {20K, 80K, 200K, 400K), {35K, 102K, 300K) and {44K, 110K, 400K). At a first approximation this keeps the amount of transmitted data similar for each group and allows the task transmission/calculation to be pipelined in a reasonably efficient maimer. In a preferred embodiment, the transfer of the files occurs in a pipelined manner, i.e.; where computation is overlapped with communication. Figure 5 illustrates the pipehned transfer of input files from a master to three slave processors. As can be seen in this example, the transfers to and from the master/nodes are staggered with the computation on the slaves being overlapped with the communication phase on one or more of the other processor nodes. This reduces t1 when executing a group of tasks on a slave processor.
Introducing the file affinity as k allows a balance to be struck between grouping tasks of different sizes into one task group for transmission and grouping tasks with sufficient file affinity or commonality. Thus this degree of balance may be altered by setting the file affinity threshold for associating input files.
Another example is that of 10 homogeneous tasks with ten completely homogeneous sets of input files {30K, 30K, 30K, 30K, 30K, 30K, 30K, 30K, 30K, 30K}. Again, three groups of tasks are generated, one with four tasks and the others with three. As the tasks are completely homogeneous, each pair will have a file affinity of approximately 1. Thus, following the simplified embodiment of the heuristic, the three groups of input files will be {30K}, {30K}, and {30K}.
These two extreme examples serve to illustrate how the task grouping may be performed in the simplified embodiment.
In terms of implementing this method as noted above, the number of tasks to be assigned to each group is determined such that the time needed for processing each group is substantially the same for each computing unit. This will depend on each processors relative speed, based on the average speed of the processors in the cluster. For example, if the relative speed of a particular node processor is 1.0 compared to the average speed of the cluster nodes, the maximum number of rT tasks to be assigned to that processor will be - The size of each task is calculated on the basis of the byte sum of all of the input files needed to execute each task on a computing unit. The file affinity may usefully be defined as k for which an affinity of 0.5 is considered acceptable as a benchmark for grouping tasks into a specified group.
Essentially, this equates to setting the minimum degree of association' which is necessary to consider two tasks as related or sharing input files. This ensures that the file affinity is maximized within a group. Thus sending similar sets of files to multiple processors is avoided. Asnoted above, if the next set of files is different enough (i.e., has a file affinity with a previously allocated task less than the minimum), that task will be located at the next processor position.
Firstly, this is done so that tasks with the smallest byte sum are sent initially. Secondly, this is done to guarantee that the groups are as uniform as possible in respect of the number of bytes that need to be transmitting from the master node. Thus, in a preferred embodiment, at initialization of the algorithm, the number of tasks is allocated to each processor based on the processing power of the processor concerned and the file affinity, and the tasks are dispatched or transferred to the processor in a pipelined way as illustrated in the example shown in figure 5.
In a further embodiment, it is possible to consider grouping tasks on a grid composed of a set of clusters as opposed to a grid composed of a set of processors. In this embodiment, the number of tasks assigned to each cluster will be calculated based on the requirement that the time needed by the cluster for processing each group is the same for each cluster. As before, this will depend on the processing speed of the cluster aggregate and will depend on the internal structure of the particular cluster such as the number of processors, load from other users etc. Once the number of tasks to be assigned to each cluster is determined, the method proceeds substantially as described above.
Although the invention has been described by way of example and with reference to particular simplified or reduced-scope embodiments it is to be understood that modification and/or improvements may be made without departing from the scope of the appended claims.
Embodiments of the invention are further intended to cover the task scheduling/grouping technique in its most general sense as specified in the claims regardless of the possible size of the solution space for the affinity determination. It is also noted that embodiments of the invention may be applied to the distribution of tasks among nodes in a grid system where the computational characteristics of such nodes may take a variety of forms. That is, node processing may take the form of numerical calculation, storage or any other form of processing which might be envisaged as part of distributed application execution. Further, embodiments of the present invention may be included in a broader scheduling system in the context of allocating information to generalized computing resources.
Where in the foregoing description reference has been made to integers or elements having known equivalents, then such equivalents are herein incorporated as if individually set forth.

Claims (25)

  1. Claims 1. A method of scheduling the running of an application on a
    plurality of computational units, said application comprising a plurality of tasks, each task having at least one input file associated therewith, the method including the steps of: aggregating said plurality of tasks into one or more groups of tasks; and allocating each group of tasks to a computational unit, wherein the plurality of tasks are aggregated so that tasks which share one or more input file are included in the same group.
  2. 2. A method as claimed in claim 1 wherein the number of groups of tasks is equal to the number of computational units.
  3. 3. A method as claimed in any preceding claim wherein the tasks are aggregated into the groups such that the time needed for processing each group is substantially the same for each computing unit.
  4. 4. A method as claimed in any preceding claim wherein the step of aggregating the plurality of tasks includes the step of determining the file affinity between pairs of tasks in respect of their input files wherein for a set G of tasks composed of K tasks, G={T1, T2,..., TK} and the set F of Y input files needed by one or more tasks belonging to group G, F={f1, f2, f3,.. .fy}, the file affinity is defined by: (I -l)Jf I NjçJ Where is the size of file J and N, is the number of tasks in group G which have file J as an input file.
  5. 5. A method of scheduling tasks among a plurality of computing units including the following steps: - define the number of tasks to be assigned in groups to the computing units, where P is the number of computing units; o compute the size of each task; o rank the task files in a list L in order of rncreasing size, o for each group, beginning with the group with the largest number of tasks: * assign the smallest unassigned task file to the group; * set task file list position = 1; * until the group is completely populated by task files do: o if(position + P = size of list L) and (task file affinity(task file[position], task file[position+l]) < a specified value, k) then position = position +P; else increment position = position + 1; o assign to the group, the task file located at position in list
    L
    * enddo * Remove assigned task files from List L * Increment P = P - 1 * populate to next group
  6. 6. A method of scheduling tasks among a plurality of computing units including the following steps: A) define the number of tasks to be assigned in groups to the computing units, where P is the number of computing units; B) compute the size of each task; C) rank the task files in a list L in order of increasing size, D) for each group, beginning with the group with the largest number of tasks perform the following steps (a) to (e): (a) assign the smallest unassigned task file to the group; (b) set the task file list position index equal to 1; (c) while the group is not completely populated by task files perform the following steps: (i) if the position index plus P is less than or equal than the size of the list L, and the task file affinity between the task file at the position index and the task file at the position index +1 is less than a specified value, k then increment the position index by P; otherwise increment position index by 1; (ii) assign to the group, the task file located at position in list L (d) Remove assigned task files from List L (e) Increment P = P -
  7. 7. A method as claimed in claim 5 or 6 wherein the value k is selected to represent the desired level of association between task pairs according to the degree of sharing of input files.
  8. 8. A method as claimed in any of claims 5 to 8 wherein k is greater or equal than substantially 0.5.
  9. 9. A method as claimed in any preceding claim wherein the maximum number of tasks to be assigned to each group is determined such that the time needed for processing each group on a corresponding computing unit is substantially the same for each computing unit.
  10. 10. A method as claimed in any preceding claim wherein the size of each task is calculated on the basis of the byte sum of all of the input files needed to execute each task on a computing unit.
  11. 11. A method of executing an application on a gnd including the scheduling method as claimed in any preceding claim further including the step of dispatching the groups of tasks to corresponding computing units.
  12. 12. A method as claimed in claim 11 wherein the groups of tasks are dispatched in a manner which overlaps computation and communication.
  13. 13. A method as claimed in any preceding claim wherein the computing units correspond to processors.
  14. 14. A method as claimed in any preceding claim, wherein the computing units correspond to clusters.
  15. 15. A method as claimed in claim 14 wherein the clusters are composed of a plurality of computing resources.
  16. 16. A method as claimed in claim 15 wherein the computing resources include one or more processors.
  17. 17. A system configured to operate in accordance of the method as claimed in any of claims ltol6.
  18. 18. A computing device configured to operate in accordance with the method as claimed in any of claims 1 to 16.
  19. 19. A computing network adapted to operate in accordance with the method as claimed in any of claims ito 16.
  20. 20. A computer program adapted to perform the steps in the method as claimed in any of claims I to 16.
  21. 21. A data carrier adapted to store a computer program as claimed in claim 20.
  22. 22. A node computing device adapted to execute one or more tasks in accordance with the method as claimed in any of claims Ito 16.
  23. 23. A master computing device adapted to schedule an application in accordance with the method as claimed in any of claims ito 16.
  24. 24. A computing grid adapted to operate in accordance with the method as claimed in any of claims ito 16.
  25. 25. A scheduling system for an aggregate of computational resources adapted to operate in accordance with the method as claimed in any of claims ito 16.
GB0423988A 2004-10-29 2004-10-29 Organisation of task groups for a grid application Withdrawn GB2419692A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
GB0423988A GB2419692A (en) 2004-10-29 2004-10-29 Organisation of task groups for a grid application
PCT/US2005/039439 WO2006050348A2 (en) 2004-10-29 2005-10-28 Methods and apparatus for scheduling and running applications on computer grids

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB0423988A GB2419692A (en) 2004-10-29 2004-10-29 Organisation of task groups for a grid application

Publications (2)

Publication Number Publication Date
GB0423988D0 GB0423988D0 (en) 2004-12-01
GB2419692A true GB2419692A (en) 2006-05-03

Family

ID=33515733

Family Applications (1)

Application Number Title Priority Date Filing Date
GB0423988A Withdrawn GB2419692A (en) 2004-10-29 2004-10-29 Organisation of task groups for a grid application

Country Status (2)

Country Link
GB (1) GB2419692A (en)
WO (1) WO2006050348A2 (en)

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7243352B2 (en) * 2002-11-27 2007-07-10 Sun Microsystems, Inc. Distributed process runner

Also Published As

Publication number Publication date
WO2006050348A2 (en) 2006-05-11
WO2006050348A3 (en) 2007-05-31
GB0423988D0 (en) 2004-12-01

Similar Documents

Publication Publication Date Title
Le et al. Allox: compute allocation in hybrid clusters
Convolbo et al. Cost-aware DAG scheduling algorithms for minimizing execution cost on cloud resources
CN103347055B (en) Task processing system in cloud computing platform, Apparatus and method for
Convolbo et al. GEODIS: towards the optimization of data locality-aware job scheduling in geo-distributed data centers
WO2008025761A2 (en) Parallel application load balancing and distributed work management
Amalarethinam et al. An Overview of the scheduling policies and algorithms in Grid Computing
Chang et al. Dynamic task allocation models for large distributed computing systems
Shih et al. Performance study of parallel programming on cloud computing environments using mapreduce
WO2006050349A2 (en) Methods and apparatus for running applications on computer grids
Han et al. Energy efficient VM scheduling for big data processing in cloud computing environments
Carretero et al. Mapping and scheduling HPC applications for optimizing I/O
Zhang et al. Meteor: Optimizing spark-on-yarn for short applications
Wang et al. MATRIX: MAny-Task computing execution fabRIc at eXascale
Aarthee et al. Energy-aware heuristic scheduling using bin packing mapreduce scheduler for heterogeneous workloads performance in big data
Liu et al. Funcpipe: A pipelined serverless framework for fast and cost-efficient training of deep learning models
Simmhan et al. Comparison of resource platform selection approaches for scientific workflows
Senger Improving scalability of Bag-of-Tasks applications running on master–slave platforms
Rajendran et al. Matrix: Many-task Computing Execution Frabic for Extreme Scales
Javanmardi et al. An architecture for scheduling with the capability of minimum share to heterogeneous Hadoop systems
Banicescu et al. Addressing the stochastic nature of scientific computations via dynamic loop scheduling
Jeyaraj et al. Fine-grained data-locality aware MapReduce job scheduler in a virtualized environment
Díaz et al. Derivation of self-scheduling algorithms for heterogeneous distributed computer systems: Application to internet-based grids of computers
Wang et al. Cooperative job scheduling and data allocation in data-intensive parallel computing clusters
Lin et al. Joint deadline-constrained and influence-aware design for allocating MapReduce jobs in cloud computing systems
Ghazali et al. CLQLMRS: improving cache locality in MapReduce job scheduling using Q-learning

Legal Events

Date Code Title Description
WAP Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1)