WO2006050348A2 - Methods and apparatus for scheduling and running applications on computer grids - Google Patents
Methods and apparatus for scheduling and running applications on computer grids Download PDFInfo
- Publication number
- WO2006050348A2 WO2006050348A2 PCT/US2005/039439 US2005039439W WO2006050348A2 WO 2006050348 A2 WO2006050348 A2 WO 2006050348A2 US 2005039439 W US2005039439 W US 2005039439W WO 2006050348 A2 WO2006050348 A2 WO 2006050348A2
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- tasks
- task
- group
- file
- files
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5061—Partitioning or combining of resources
- G06F9/5072—Grid computing
Definitions
- the invention relates to methods and apparatus for scheduling and executing applications on computer grids. More particularly, although not exclusively, the invention relates to methods and apparatus for scheduling the components of applications, also known as tasks, of grid- based applications on computational units constituting a computational grid or cluster. Even more particularly, although not exclusively, the invention relates to scheduling tasks on heterogeneous distributed computational grids. The invention may be particularly suitable for scheduling sequential independent tasks, otherwise known as Bag-of-Tasks or Parameter Sweep applications, on computational grids.
- a computational grid can be thought of as a collection of physically distributed, heterogeneous computational units, or nodes.
- the physical distribution of the grid nodes may range from immediate proximity to wide geographical distribution.
- Grid nodes may be either heterogeneous or homogeneous with homogeneous grids differing primarily in that the nodes constituting such a grid provide essentially a uniform operating environment and computing capacity. Given the operational characteristics of grids as often being formed across administrative domains and over a wide range of hardware, homogeneous grids are considered a specific case of the general heterogeneous grid concept.
- the present invention contemplates both types.
- the described embodiments of the present invention contemplate distributed networks of heterogeneous nodes which are desired to be treated as a unified computing resource.
- Computational grids are usually built on top of specially designed middleware platforms known as grid platforms.
- Grid platforms enable the sharing, selection and aggregation of the variety resources constituting the grid.
- These resources which constitute the nodes of the grid can include supercomputers, servers, workstations, storage systems, desktop systems and specialized devices that may be owned and operated by different organizations.
- BoT Bag-of-Tasks
- BoT applications can be decomposed into groups of tasks. Tasks for this type of grid application are characterized as being independent in that no communication is required between them while they are running and that there are no dependencies between tasks. That is, each task constituting an element of the grid application as a whole can be executed independently with its result contributing to the overall result of the grid-based computation. Examples of BoT applications include Monte Carlo simulations, massive searches, key breaking, image manipulation and data mining.
- the amount of computation involved with each task T t is generally predefined and may vary among the tasks A.
- the input for each task A is one or more (input) files and the output one or more (output) files.
- the present exemplary embodiment relates to clusters organized as a master-slave platform.
- a master node is responsible for scheduling computation among the slave nodes and collecting the results.
- Other grid/cluster models are possible within the scope of " the ⁇ ⁇ rese ⁇ t-embodiments-of- the -invention, -and-may-be- capable. ofJncorporating_Jhe execution/scheduling technique described herein with appropriate modification.
- the slave components are themselves clusters.
- Grid platforms typically use a non-dedicated network infrastructure such as the internet for inter-node communication.
- a non-dedicated network infrastructure such as the internet for inter-node communication.
- machine heterogeneity, long and/or variable network delays and variable processor loads and capacities are common. Since tasks belonging to a BoT application do not need to communicate with each other and can be executed independently, it is considered that BoT applications are particularly suitable for execution on such a grid infrastructure.
- the invention provides for a method of scheduling the execution of an application on a plurality of computational units, said application comprising a plurality of tasks, each task having at least one input file associated therewith, the method including the steps of:
- the number of groups of tasks is equal to the number of computational units.
- the tasks are preferably aggregated into the groups so that the time needed for processing each group is substantially the same for each computing unit.
- the file affinity or task file affinity is preferably be defined by:
- ⁇ f t ⁇ is the size of file f. and N,- is the number of tasks in group G which have file f. as an input file.
- the method of distributing the tasks constituting the application among a plurality of computing units may include the following steps:
- a method of scheduling tasks among a plurality of computing units including the following steps: A) define the number of tasks to be assigned in groups to the computing units, where P is the number of computing units;
- the value k is preferably selected to represent the desired level of association between task pairs according to the degree of sharing of input files.
- the number of tasks to be assigned to each group is determined such that the time needed for processing each group on a corresponding computing unit is substantially the same for each computing unit.
- the size of each task is calculated on the basis of the byte sum of all of the input files needed to execute each task on a computing unit.
- tasks are aggregated into a specified group for values of k greater than 0.5.
- the method may also include the further step of dispatching the groups of tasks to corresponding computing units, preferably in a manner which substantially overlaps computation and communication.
- the computing units are preferably grid resources such as processors.
- the computational units are one of more clusters.
- the grid may be formed of one or more clusters as opposed to composed of a set of processors.
- the number-of-tasks assigned to each-cluster_wilLbe .calculated ⁇ based on: the requirement that the time needed by the cluster for processing each group is the same for each cluster, the file affinity, and pipelining, where applied, proceeding substantially as hereinbefore defined.
- Figures Ia & b illustrates a method of assigning tasks on a dedicated cluster according to a round-robin approach in accordance with an embodiment of the invention
- Figure 2 illustrates an embodiment of the invention having a master-slave node configuration
- Figure 3 illustrates the results of an execution time simulation for a prior art homogeneous platform
- Figure 4 illustrates the results of an execution time simulation according to an embodiment of the invention.
- Figure 5 illustrates an embodiment of the invention whereby tasks are dispatched to processors in a pipelined manner.
- a simple execution model will be described initially in relation to a prior art technique for scheduling an application on a homogeneous grid. This will then be compared with an embodiment of the invention.
- the specific embodiment described herein relates to fine-grain Bag-of-Tasks applications on dedicated master-slave platform as shown in Figure 1. However, this is not to be construed as limiting and the invention may be applied to other computing contexts with suitable modification. Further classes of applications that may benefit from the invention are those composed of tasks with dependencies where sets of dependent tasks can be grouped and the groups are independent among themselves. The method may be modified slightly to group the tasks according to such dependencies.
- the master node (10), is responsible for the organizing, scheduling, transmitting and receiving the tasks corresponding to the grid application. Referring to figure 1, each task goes through three phases during execution:
- the set of files sent may include a parameter file corresponding to a specified task and an executable file which is charged with performing the computational task on the slave processor.
- the time in this phase includes the overhead incurred by the master node (10) to initiate a data transfer to a slave (11), for example, to initiate a TCP connection. For example, consider a task i that needs to send two files to a slave nodes before execution. The time t; m - t can then be computed as follows:
- L ⁇ t t is the overhead incurred by the master node to initiate data transfer to the slave node (11-14).
- Y ⁇ ih j is the total size in bytes of the input files that have to be transferred to slave node s and B is the data transfer rate. For simplicity, in this example it is assumed that each task has only one separate parameter file of the same size associated with it.
- the task processes the parameter file at the slave node (11-14) and produces an output file.
- the duration of this phase is equal to t comp . Any additional overhead related to the reception of input files by a slave node is also included in this phase.
- this phase the output file is sent back to the master node (1) and the task T is terminated.
- the duration of this phase is t end .
- This phase may require some processing at the master and this is mainly related to writing files at the file repository (not shown in (10)). This writing step may be deferred until disk resources are available. It is therefore considered negligible.
- the initialization phase of one slave can occur concurrently to the completion phase of another slave node.
- the total execution time of a task is therefore:
- the exemplary embodiment described herein corresponds to a dedicated node cluster composed of P+J homogeneous processors where T » P .
- the additional processor is the master node (10).
- Communication between the master (10) and slave (11-14) is by way of a shared link and, in this embodiment the master (10) can only send files through the network to a single slave at a given time.
- the communication link is full duplex.
- This embodiment corresponds to a one-port model whereby there are at most two communication processes involving a given master, one sent and one received.
- the one-port embodiment discussed herein is particularly suited to LAN network connections.
- a slave node (11-14) is considered to be idle when it is not involved with the execution of any of the three phases of a task.
- Figure 1 shows the execution of a set of tasks in a system composed of three slave nodes where the scheduling algorithm is of a round-robin type.
- the effective number of processors P e ff is defined as the maximum number of slave processors needed to run an application with no idle periods on any slave processor. Taking into account the task and platform models of the particular embodiment described herein, a processor may have idle periods if: t Co , nP + t e ,, d ⁇ ⁇ P -l)t init
- the total number of tasks is to be executed on a given processor is at most:
- T is the total number of tasks.
- P ej processors the total execution time, or makespan
- the number of groups is equal to the number of nodes available. This is not however a limitation and modifications to the scheduling method are viable to take into account different processor/group. For example, for some specific sets of applications/platforms, the optimal execution in terms of the makespan will use a number of groups smaller than the total number of processors.
- F ffi,f 2 ,f3--f ⁇
- the file affinity I aff is defined in one embodiment as follows:
- This simplified embodiment reduces the size of the possible solution space and provides a viable method of calculating the file affinities for tasks to within a workable level of accuracy.
- the simplified embodiment includes the following steps: Initially, for each computing unit or processor, the number of tasks to be aggregated into a group is defined for that computing unit. This is done so that the time needed for processing each group is substantially the same for each computing unit.
- the size of each task corresponds to the sum of the input file sizes for the task concerned.
- the required number of tasks is allocated to the group as a function of both the number of tasks determined previously and task affinity.
- the allocation step in a preferred embodiment is as follows. The reference to 'position' relates to the position of the task input file in the size-ordered list. The smallest size task, task(position) is assigned to a first group. Then the file affinity of the pair task(position) and task(position+l) in the size-ordered list is determined. If the file affinity k is greater than a specified value, task(position+l) is assigned to the first group.
- task(position+l) is assigned to a subsequent group. This process is repeated, filling sequentially the groups in order until the group allocations determined in the initial step are populated with the size-ordered, associated tasks.
- This embodiment can be expressed in pseudocode as follows:
- the 10 heterogeneous input file tasks are ⁇ 20K, 35K, 44K, 80K, 102K, HOK, 200K, 300K, 400K, 450K ⁇ .
- Three groups of tasks are generated, one with 4 tasks and the others with 3 tasks.
- the simplified embodiment of the heuristic in the case of zero file affinity operates as follows. Each task is considered in- size- order. Ihus,.20K is allocated to the .first position_of group 1. Then the 35K input task is allocated to the next group following the principle that each group should minimize initial transmission or initialization time. Task 44K is allocated to the third group. Task 8OK is then allocated to position two of the first group, 102K to the second position of group two and so on.
- the transfer of the files occurs in a pipelined manner, i.e.; where computation is overlapped with communication.
- Figure 5 illustrates the pipelined transfer of input files from a master to three slave processors. As can be seen in this example, the transfers to and from the master/nodes are staggered with the computation on the slaves being overlapped with the communication phase on one or more of the other processor nodes. This reduces tj n i t when executing a group of tasks on a slave processor.
- Another example is that of 10 homogeneous tasks with ten completely homogeneous sets of input files ⁇ 30K, 30K, 30K, 30K, 30K, 3OK, 30K, 30K, 30K, 30K ⁇ .
- three groups of tasks are generated, one with four tasks and the others with three.
- each pair will have a file affinity of approximately 1.
- the three groups of input files will be ⁇ 30K ⁇ , ⁇ 30K ⁇ , and ⁇ 30K ⁇ .
- the number of tasks to be assigned to each group is determined such that the time needed for processing each group is substantially the same for each computing unit. This will depend on each processors relative speed, based on the average speed of the processors in the cluster. For example, if the relative speed of a particular node processor is 1.0 compared to the average speed of the cluster nodes, the
- the size of each task is calculated on the basis of the byte sum of all of the input files needed to execute each task on " a computing unit.
- the file " affmity ⁇ may usefully be defined as kfov which an affinity of 0.5 is considered acceptable as a benchmark for grouping tasks into a specified group. Essentially, this equates to setting the minimum degree of 'association' which is necessary to consider two tasks as related or sharing input files. This ensures that the file affinity is maximized within a group. Thus sending similar sets of files to multiple processors is avoided. As noted above, if the next set of files is different enough (i.e., has a file affinity with a previously allocated task less than the minimum), that task will be located at the next processor position.
- the number of tasks is allocated to each processor based on the processing power of the processor concerned and the file affinity, and the tasks are dispatched or transferred to the processor in a pipelined way as illustrated in the example shown in figure 5.
- IQ a further embodiment, it is possible to consider grouping tasks on a grid composed of a set of clusters as opposed to a grid composed of a set of processors.
- the number of tasks assigned to each cluster will be calculated based on the requirement that the time needed by the cluster for processing each group is the same for each cluster. As before, this will depend on the processing speed of the cluster aggregate and will depend on the internal structure of the particular cluster such as the number of processors, load from other users etc.
- Embodiments of the invention are further intended to cover the task scheduling/grouping technique in its most general sense as specified in the claims regardless of the possible size of the solution space for the affinity determination. It is also noted that embodiments of the invention may be applied to the distribution of tasks among nodes in a grid system where the computational characteristics of such nodes may take a variety of forms. That is, node processing may take the form of numerical calculation, storage or any other form of processing which might be envisaged as part of distributed application execution. Further, embodiments of the present invention may be included in a broader scheduling system in the context of allocating information to generalized computing resources.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Mathematical Physics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Multi Processors (AREA)
Abstract
Description
Claims
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
GB0423988.5 | 2004-10-29 | ||
GB0423988A GB2419692A (en) | 2004-10-29 | 2004-10-29 | Organisation of task groups for a grid application |
Publications (2)
Publication Number | Publication Date |
---|---|
WO2006050348A2 true WO2006050348A2 (en) | 2006-05-11 |
WO2006050348A3 WO2006050348A3 (en) | 2007-05-31 |
Family
ID=33515733
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2005/039439 WO2006050348A2 (en) | 2004-10-29 | 2005-10-28 | Methods and apparatus for scheduling and running applications on computer grids |
Country Status (2)
Country | Link |
---|---|
GB (1) | GB2419692A (en) |
WO (1) | WO2006050348A2 (en) |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040103413A1 (en) * | 2002-11-27 | 2004-05-27 | Sun Microsystems, Inc. | Distributed process runner |
-
2004
- 2004-10-29 GB GB0423988A patent/GB2419692A/en not_active Withdrawn
-
2005
- 2005-10-28 WO PCT/US2005/039439 patent/WO2006050348A2/en active Application Filing
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040103413A1 (en) * | 2002-11-27 | 2004-05-27 | Sun Microsystems, Inc. | Distributed process runner |
Also Published As
Publication number | Publication date |
---|---|
WO2006050348A3 (en) | 2007-05-31 |
GB2419692A (en) | 2006-05-03 |
GB0423988D0 (en) | 2004-12-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103347055B (en) | Task processing system in cloud computing platform, Apparatus and method for | |
CN104536937B (en) | Big data all-in-one machine realization method based on CPU GPU isomeric groups | |
CN107291536B (en) | Application task flow scheduling method in cloud computing environment | |
Convolbo et al. | GEODIS: towards the optimization of data locality-aware job scheduling in geo-distributed data centers | |
WO2006050349A2 (en) | Methods and apparatus for running applications on computer grids | |
Kaya et al. | Heuristics for scheduling file-sharing tasks on heterogeneous systems with distributed repositories | |
Shih et al. | Performance study of parallel programming on cloud computing environments using mapreduce | |
Kijsipongse et al. | A hybrid GPU cluster and volunteer computing platform for scalable deep learning | |
Malik et al. | Optimistic synchronization of parallel simulations in cloud computing environments | |
Yu et al. | Algorithms for divisible load scheduling of data-intensive applications | |
Carretero et al. | Mapping and scheduling HPC applications for optimizing I/O | |
Zhang et al. | Meteor: Optimizing spark-on-yarn for short applications | |
Wang et al. | MATRIX: MAny-Task computing execution fabRIc at eXascale | |
In et al. | Sphinx: A scheduling middleware for data intensive applications on a grid | |
Ebrahimi et al. | TPS: A task placement strategy for big data workflows | |
Senger | Improving scalability of Bag-of-Tasks applications running on master–slave platforms | |
Banicescu et al. | Addressing the stochastic nature of scientific computations via dynamic loop scheduling | |
Díaz et al. | Derivation of self-scheduling algorithms for heterogeneous distributed computer systems: Application to internet-based grids of computers | |
Ko et al. | New worker-centric scheduling strategies for data-intensive grid applications | |
Mohamed et al. | DDOps: dual-direction operations for load balancing on non-dedicated heterogeneous distributed systems | |
Zhang et al. | A distributed computing framework for All-to-All comparison problems | |
WO2006050348A2 (en) | Methods and apparatus for scheduling and running applications on computer grids | |
Ghose et al. | Computing BLAS level-2 operations on workstation clusters using the divisible load paradigm | |
Mamat et al. | Scheduling real-time divisible loads with advance reservations | |
Zhu et al. | Scheduling divisible loads in the dynamic heterogeneous grid environment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A2 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KM KN KP KR KZ LC LK LR LS LT LU LV LY MA MD MG MK MN MW MX MZ NA NG NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SM SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A2 Designated state(s): GM KE LS MW MZ NA SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LT LU LV MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG |
|
NENP | Non-entry into the national phase in: |
Ref country code: DE |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 05815778 Country of ref document: EP Kind code of ref document: A2 |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 05815778 Country of ref document: EP Kind code of ref document: A2 |