WO2017018978A1

WO2017018978A1 - Scheduling jobs in a computing cluster

Info

Publication number: WO2017018978A1
Application number: PCT/US2015/041899
Authority: WO
Inventors: Yuan Chen; Vanish Talwar; Dejan S. Milojicic
Original assignee: Hewlett Packard Enterprise Development Lp
Priority date: 2015-07-24
Filing date: 2015-07-24
Publication date: 2017-02-02

Abstract

In example implementations, a method is provided. The method includes receiving a job request at a scheduler of a plurality of schedulers based upon a quality of service (QoS) level associated with the job request and the scheduler. The job request is scheduled to a computing node based upon locally stored resource information of a selected number of computing nodes within a computing cluster. A shared memory is accessed via a memory fabric to obtain updated resource information of the selected number of computing nodes. The job request may then be re-scheduled to a different computing node based upon the updated resource information.

Description

SCHEDULING JOBS IN A COMPUTING CLUSTER

BACKGROUND

[0001] Traditional cluster scheduling relies on centralized architectures. The traditional cluster scheduling also implements a single scheduling algorithm for all workloads and the entire cluster. Such schedulers are inflexible and are difficult to scale when serving a large number of jobs in a large scale cluster. Newer schedulers are being developed; however, these newer schedulers still suffer from bottlenecks due to a centralized key component.

BRIEF DESCRIPTION OF THE DRAWINGS

[0002] FIG. 1 is an example system of the present disclosure;

[0003] FIG. 2 is an example table of quality of service for a plurality of different schedulers;

[0004] FIG. 3 is a flowchart of an example method for scheduling a job request from a perspective of a scheduler; and

[0005] FIG. 4 is a flowchart of an example method for scheduling a job request from a perspective of a computing node.

DETAILED DESCRIPTION

[0006] The present disclosure discloses a method, system and apparatus for scheduling a job request. As discussed above, traditional cluster scheduling relies on centralized architectures. The traditional cluster scheduling also implements a single scheduling algorithm for all workloads and the entire cluster. Such schedulers are inflexible and are difficult to scale when serving a large number of jobs in a large scale cluster. Newer schedulers are being developed; however, these newer schedulers still suffer from bottlenecks due to a centralized key component.

[0007] Examples of the present disclosure provide a method for scheduling a job request in a computing cluster comprising a plurality of computing nodes. In one example, the present disclosure provides a hybrid scheduler that is decentralized and scalable. In other words, as a number of job requests grow and a number of computing nodes are added, the system may be scaled by adding additional schedulers as the number of job requests and computing nodes grow.

[0008] In addition, the hybrid schedulers allow for different schedulers to use different scheduling algorithms. In one example, the schedulers may be based upon different levels of Quality of Service (QoS). In addition, the hybrid schedulers allow for the system to be adaptable based upon an increase in some types of QoS requirements and a decrease in other QoS requirements.

[0009] Lastly, the system is decentralized. As a result, no bottleneck is created by a single scheduler or any component that performs a centralized function. Rather, the system of the present disclosure provides a plurality of different schedulers that may use different scheduling algorithms. Each scheduler may schedule tasks independently of other schedulers. In addition, rather than attempting to resolve scheduling conflicts before job requests are scheduled by communicating between the schedulers, any scheduling conflicts or load balancing may be performed midstream.

[0010] For example, each scheduler may schedule a job request to a computing node initially using a local resource information stored on a local memory of the scheduler. Each scheduler may have global or a partial view of resource information depending on the need of its scheduling algorithm.

Periodically, each scheduler may access updated resource information that is stored and continuously updated on each computing node via a memory fabric. If the initial assumption of how resources are distributed changes based on the updated resource information, the scheduler may reassign the job request to a different computing node at a later time. Similarly, the scheduler may reassign the job request to a new computing node according to a Quality of Service (QoS) change. For example, the scheduler may move a job from a busy computing node to a less loaded computing node as the job's completion time deadline is approaching. As a result, the methods and systems for scheduling a job request of the present disclosure provide a more efficient process than current job schedulers of a computing cluster.

[0011] FIG. 1 illustrates an example system 100 of the present disclosure. In one example, the system 100 includes a plurality of different schedulers 102-1 to 102-N (herein also referred to individually as scheduler 102 or collectively as schedulers 102). In one example, each one of the schedulers 102-1 to 102-N may receive a job request 106-1 to 106-N (herein also referred to individually as a job request 106 or collectively as job requests 106). The job requests 106 may include different types of job requests (e.g., a web service, a MapReduce batch job, a database query, and the like). The job requests 106 may each be associated with a particular QoS level.

[0012] In one example, the job requests 106 may arrive from different users for different applications that are located remotely from the system 100. In one example, the job requests 106 may be sent to one of the schedulers 102 based upon a corresponding QoS level. In other word, each one of the schedulers 102-1 to 102-N may have a different QoS level. For example, the scheduler 102-1 may have a highest QoS level, the scheduler 102-2 may have a medium QoS level and the scheduler 102-N may have a lowest QoS level. In one example, multiple schedulers 102 may have a highest QoS level, a medium QoS level and a lowest QoS level. Said another way, each QoS level may be associated with more than one scheduler 102.

[0013] FIG. 2 illustrates an example table 200 of various parameters that can be used to determine a QoS level. The table 200 illustrates an example of four different schedulers 102. In one example, the parameters may include a resource state, an algorithm, constraints and a scheduling throughput. In one example, the resource state parameter is related to whether the scheduler is global or non-global. For example, global schedulers may make decisions based on resource information of the entire computing cluster. A non-global scheduler may make decisions based on partial information or a selected number of computing nodes within the computing cluster. [0014] In one example, the algorithm parameter is related to what type of scheduling algorithm is employed by a scheduler 102. For example, each scheduler 102 may use any type of scheduling algorithm ranging from complex algorithms to simple load balancing algorithms. Examples of scheduling algorithms may include load balancing, bin packing, random-sampling, load balancing on a sub-cluster, and the like.

[0015] In one example, the constraints parameter is related to a particular policy or constraint that is enforced. Examples of constraints that can be enforced include data-locality constraints, inter-job constraints (e.g., job 1 and job 2 cannot be placed on the same computing node), capacity constraints, fairness, priority, and the like.

[0016] In one example, the scheduling throughput parameter may be related to a particular scheduling performance. For example, different schedulers may provide different scheduling performance related to latency (e.g., delay) and throughput (e.g., scheduling decisions/second) depending on the scheduling algorithm that is used and the complexity. The scheduling throughput parameter may have example values such as high, medium and low. In another example, the scheduling throughput parameter may be a numerical value (e.g., 1 -10).

[0017] Thus, based on the particular requirements for a job request 106, the job request 106 may be matched with a scheduler 102 that matches the QoS requirements of the job request 106. In one example, each one of the

schedulers 102 may have a local memory 104-1 to 104-N (herein also

individually referred to as local memory 104 or collectively referred to as local memories 104). The local memory 104 may include a local resource

information of a respective scheduler 102.

[0018] In one example, the schedulers 102 may be deployed as a computer or server comprising hardware. The computer may include a processor, a non- transitory computer readable medium (e.g., a hard disk drive, random access memory (RAM), and the like) and input/output devices. The schedulers 102-1 to 102-N may be deployed as separate computers having its own processor and local memory 104. [0019] In one example, each scheduler 102 may track resource information for a selected number of computing nodes 1 10-1 to 1 10-N (herein also individually referred to as a computing node 1 10 or collectively referred to as computing nodes 1 10) within a computing cluster 108. The computing nodes 1 10 of the computing cluster 108 may be at a same location or may be remotely located from one another. In one example, each one of the computing nodes 1 10 may include a task queue 1 12, a node manager 1 14 that manages the completion of tasks 1 16-1 to 1 16-N (herein referred to individually as a task 1 16 or collectively as tasks 1 16). In one example, the computing nodes 1 10 may also be deployed as separate computers or servers. Each computing node 1 10 may include its own allocation of a processor or processors and a non-transitory computer readable medium or mediums (e.g., a hard disk drive, random access memory (RAM), and the like).

[0020] Each job request 106 may include a task or tasks that are completed by a computing node 1 10. When the computing node 1 10 receives a job request, the tasks associated with the job request may be placed in the task queue 1 12. The tasks within the task queue 1 12 may be processed based on any method, including for example, a first in first out (FIFO) method, based on priority, based on a fair sharing method, and the like.

[0021] As noted above, some schedulers 102 may be a global scheduler that tracks resources of all computing nodes 1 10-1 to 1 10-N. Thus, the selected number of computing nodes 1 10 tracked by global schedulers may be all of the computing nodes 1 10.

[0022] Alternatively, some schedulers 102 may be a non-global scheduler that tracks resources of only a subset of the computing nodes 1 10-1 to 1 10-N. Thus, the selected number of computing nodes 1 10 tracked by non-global schedulers may be less than all of the computing nodes 1 10. For example, the scheduler 102-2 may be a non-global scheduler that tracks resource information of computing nodes 1 10-2 and 1 10-N .

[0023] In one example, resource information may be defined as a snapshot at a particular time of each one of the computing nodes 1 10 that are tracked by a scheduler. The resource information may provide information related to how many tasks are currently assigned to each computing node 1 10, how much processing power and memory are currently used and available at each computing node 1 10, what types of jobs are currently assigned to each computing node 1 10, an estimated time to complete all of the currently assigned job requests or tasks at each computing node 1 10, and the like.

[0024] In one example, each scheduler 102 may have a local resource information (e.g., resource information stored locally) on a respective local memory 104. The scheduler 102 may use the local resource information to initially assign the job request 106 to a computing node 1 10. For example, a job request 106-1 may arrive at the scheduler 102-1 . The scheduler 102-1 may have local resource information stored at local memory 104-1 . Based on the local resource information, the scheduler 102-1 may determine that the computing node 1 10-2 has the most available processing power and memory available to complete the job request 106-1 . As a result, the scheduler 102-1 may initially assign the job request 106-1 to the computing node 1 10-2.

[0025] In one example, each one of the schedulers 102-1 to 102-N may schedule its respective job requests 106-1 to 106-N in a similar fashion and independent of one another. In other words, the schedulers 102-1 to 102-N do not need to communicate with one another to avoid scheduling conflicts before scheduling the respective job requests 106-1 to 106-N. Rather, each scheduler 102-1 to 102-N assigns its respective job request 106-1 to 106-N based upon a local resource information and independent of how the other schedulers 102 are scheduling the job requests 106.

[0026] In one example, the system 100 may include a memory fabric 1 18 and shared memory 120-1 to 120-N (also referred to herein collectively as shared memory 120). In one example, the shared memory 120-1 to 120-N may include a plurality of memory unit locations or physical memory units. In one example, the shared memory units 120 may be a dynamic random access memory (DRAM) or a non-volatile memory (NVM). In one example, access to the shared memory 120 (e.g., a read access or a write access) may be translated and routed over the memory fabric 1 18.

[0027] In one example, each one of the computing nodes 1 10 may be in communication with each one of the schedulers 102 and the shared memory 120 via the memory fabric 1 18. As a result, as job requests 106 are assigned to computing nodes 1 10, as job requests 106 are completed, or as job requests 106 are reassigned or moved to different nodes 1 10 (as discussed below), the node manager 1 14 of each computing node 1 10 may send updates regarding resource information to the shared memory 120. Updated resource information of each one of the computing nodes 1 10 may be stored in shared memory 120.

[0028] In one example, each one of the schedulers 102 may access the updated resource information for the selected number of computing nodes 1 10 from the shared memory 120. In one example, the schedulers 102 may access the updated resource information for all computing nodes periodically. In one example, a small amount of randomness may be inserted into the periodic access times of each scheduler 102 such that the schedulers 102 do not access the shared memory 120 at the same time or obtain identical updated resource information.

[0029] As noted above, each one of the schedulers 102 may assign a respective job request 106 to one of the computing nodes 1 10 based upon a local resource information stored in local memory 104. However, the local resource information may be stale or outdated. After the scheduler assigns the respective job request 106, the scheduler 102 may access the local memory 104 to obtain updated resource information obtained from the shared memory 120. Using the example above, after receiving the updated resource

information the scheduler 102-1 may determine that computing node 1 10-2 is actually very busy and the computing node 1 10-1 has more processing power and memory available to complete the job request 106-1 . As a result, the scheduler 102-1 may send an instruction to the computing node 1 10-2 to reassign the job request 106-1 to the computing node 1 10-1 .

[0030] In one example, if the job request 106-1 has not been initiated at the computing node 1 10-2, then the computing node 1 10-2 may reassign the job request 106-1 to the computing node 1 10-1 . For example, the computing node 1 10-2 may send the job request 106-1 and the associated tasks to the computing node 1 10-1 directly. In another example, the computing node 1 10-2 may delete the job request 106-1 from the task queue 1 12 and the scheduler 102-1 may then reassign the job request 106-1 to the computing node 1 10-1 .

[0031] In one example, if the job request 106-1 has been initiated or completed by the computing node 1 10-2, then the computing node 1 10-2 may send a notification to the scheduler 102-1 that the job request 106-1 has been initiated or completed and deny the instruction to reassign the job request 106- 1 .

[0032] In other words, the system 100 saves time and prevents bottlenecks by allowing each one of the schedulers 102-1 to 102-N to schedule job requests 106-1 to 106-N immediately based upon a local resource information. In other words, the schedulers 102-1 to 102-N do not require communication with one another or coordination with one another to schedule the job requests 106-1 to 106-N. Rather, any scheduling conflicts are resolved mid-stream at a later time after the initial scheduling of the job request 106 by using updated resource information obtained by accessing the memory fabric 1 18. In addition, to prevent two schedulers 102 from obtaining the same updated resource information and potentially assigning the respective job requests 106 to the same computing node 1 10, a small randomness may be introduced to the periodic timing of when the schedulers 102 access the shared memory 120 via the memory fabric 1 18.

[0033] FIG. 3 illustrates an example flowchart of a method 300 for scheduling a job request. In one example, the method 300 may be performed by a scheduler 102.

[0034] At block 302 the method 300 begins. At block 304, the method 300 receives a job request at a scheduler based on a quality of service (QoS) of the job request and the scheduler. In one example, the scheduler may be one of a plurality of different schedulers. Each one of the plurality of different schedulers may have a different QoS level based upon different parameters (e.g., parameters described in table 200 in FIG. 2).

[0035] At block 306, the method 300 schedules the job request to a computing node based upon a locally stored resource information. For example, each scheduler may have a local memory and locally stored resource information. The locally stored resource information may include a snapshot at a particular time of each one of the computing nodes that are tracked by a scheduler. The resource information may relate to how many tasks are currently assigned to each computing node, how much processing power and memory are currently used and available at each computing node, what types of jobs are currently assigned to each computing node, an estimated time to complete all of the currently assigned job requests or tasks at each computing node, and the like.

[0036] In one example, the locally stored resource information may be for a selected number of computing nodes. For example, each scheduler may be either a global scheduler or a non-global scheduler. The global schedulers may track resource information for all of the computing nodes in a computing cluster. The non-global schedulers may track resource information for less than all (e.g., a subset, two or more, and the like) of the computing nodes in the computing cluster.

[0037] At block 308, the method 300 accesses a shared memory via a memory fabric to obtain an updated resource information. In one example, the scheduler may periodically access (e.g., every 30 seconds, every minute, every hour, every few hours, and the like) the shared memory to obtain the updated resource information. In one example, a small amount of randomness may be inserted into the timing of the periodic access of each one of the schedulers to prevent two schedulers from accessing the shared memory at the same time or obtaining identical updated resource information. Inserting a small amount of randomness may prevent scheduling conflicts caused by two different schedulers attempting to reassign its respective job requests to the same computing node at the same time based upon the same updated resource information.

[0038] At block 310, the method 300 re-schedules the job request to a different computing node based upon the updated resource information. For example, the locally stored resource information used by the scheduler to initially assign the job request to a computing node may have been stale, not recent or not current. As a result, the updated resource information may indicate that the computing node that was assigned the job request initially may be very busy or have a large number of tasks to complete. As a result, the estimated time to complete the job request may be longer than what is acceptable. As a result, the scheduler may determine that a different computing node has more processing power and memory available to complete the job request based upon the updated resource information. The scheduler may then reassign the job request to the different computing node based upon the updated resource information.

[0039] In one example, the blocks 304-310 may be repeated for each job request that is received by a scheduler or for each scheduler in the system. At block 312, the method 300 ends.

[0040] FIG. 4 illustrates an example flowchart of another method 400 for scheduling a job request. In one example, the method 400 may be performed by a computing node 1 10.

[0041] At block 402 the method 400 begins. At block 404, the method 400 receives a job request from a scheduler based upon local resource information. For example, a computing node may appear to have the processing power and memory available to complete a job request within the QoS requirements of the job request and the scheduler based upon the local resource information. As a result, the scheduler may assign the computing node to complete the job request.

[0042] At block 406, the method 400 places the job request in a task queue. For example, the job request may include task or tasks that are to be

completed. The task or tasks associated with the job request may be placed in a task queue of the assigned computing node.

[0043] At block 408, the method 400 receives an instruction to reassign the job request to a different computing node based upon updated resource information obtained by the scheduler from a shared memory. For example, the scheduler may periodically access a shared memory to obtain updated resource information of a select number of computing nodes including the computing node that is assigned to complete the job request.

[0044] For example, the computing nodes may be in communication with the shared memory via the memory fabric. As job requests are completed or reassigned, the computing nodes may continuously provide updated resource information to the shared memory. As a result, the shared memory may store in memory updated resource information for each computing node within the computing cluster that is updated continuously.

[0045] The updated resource information may reveal to the scheduler that the computing node initially assigned to complete the job request actually does not have enough processing power and memory to meet the QoS requirements of the job request and the scheduler. For example, between the time the local resource information was received and the time the updated resource information was received, the computing node may have been assigned with many additional job requests from other schedulers that reduced the processing power and memory that are available. As a result, the scheduler may send the instruction to the computing node to reassign the job request to a different computing node.

[0046] In one example, the instruction to reassign the job request to a different computing node may be received before the job request is initiated on the computing node. Said another way, the instruction may be received before the computing node begins processing the job request. As a result, the job request can be successfully reassigned.

[0047] In one example, the instruction to reassign the job request to a different computing node may be received after the job request is initiated on the computing node or has been completed. If the instruction is received after the job request is initiated or completed by the computing node initially assigned to complete the job request, the computing node may send a notification to the scheduler that the job request has been initiated or completed and deny the instruction to reassign the job request.

[0048] At block 410, the method 400 reassigns the job request to the different computing node. For example, the scheduler may select a different computing node that has the processing power and memory available to complete the job request in accordance with the QoS requirements of the job request and the scheduler. At block 412, the method 400 ends. [0049] It should be noted that although not explicitly specified, any of the blocks, functions, or operations of the example methods 300 and 400 described above may include a storing, displaying, and/or outputting block. In other words, any data, records, fields, and/or intermediate results discussed in the methods can be stored, displayed, and/or output to another device.

Furthermore, blocks, functions, or operations in FIGs. 3 and 4 that recite a determining operation, or involve a decision, do not necessarily require that both branches of the determining operation be practiced.

[0050] It will be appreciated that variants of the above-disclosed and other features and functions, or alternatives thereof, may be combined into many other different systems or applications. Various presently unforeseen or unanticipated alternatives, modifications, or variations, therein may be subsequently made which are also intended to be encompassed by the following claims.

Claims

1 . A method, comprising:

receiving, by a processor, a job request at a scheduler of a plurality of schedulers based upon a quality of service (QoS) level associated with the job request and the scheduler;

scheduling, by the processor, the job request to a computing node based upon locally stored resource information of a selected number of computing nodes within a computing cluster;

accessing, by the processor, a shared memory via a memory fabric to obtain updated resource information of the selected number of computing nodes; and

re-scheduling, by the processor, the job request to a different computing node based upon the updated resource information.

2. The method of claim 1 , wherein each one of the plurality of schedulers has a different QoS level.

3. The method of claim 2, wherein each one of the different QoS levels is based upon a resource state, a scheduling algorithm, a constraint and a scheduling throughput.

4. The method of claim 1 , wherein the selected number of computing nodes comprises less than all of the computing nodes in the computing cluster.

5. The method of claim 1 , wherein the selected number of computing nodes comprises all of the computing nodes in the computing cluster.

6. The method of claim 1 , wherein the accessing is performed periodically to obtain the updated resource information and update the locally stored resource information.

7. The method of claim 1 , further comprising:

sending, by the processor, a resource update to the shared memory via a memory fabric after the job request is re-scheduled to the different computing node.

8. A system, comprising:

a plurality of schedulers for receiving a job request, each one of the plurality of schedulers having a different quality of service (QoS) level and having a local memory to store resource information of a selected number of computing nodes within a computing cluster;

a plurality of computing nodes communicatively coupled to each one of the plurality of schedulers for processing the job request scheduled by one of the plurality of schedulers based upon the resource information in the local memory of the one of the plurality of schedulers; and

a shared memory communicatively coupled to each one of the plurality of schedulers via a memory fabric for storing updated resource information of all of the computing nodes within the computing cluster, wherein the updated resource information is used to update the resource information in the local memory and cause the one of the plurality of schedulers to re-schedule the job request based upon the resource information in the local memory of the one of the plurality of schedulers that is updated.

9. The system of claim 8, wherein each one of the computing nodes comprises:

a task queue;

a node manager in communication with the task queue for managing each job request that is assigned to a respective computing node.

10. The system of claim 8, wherein the shared memory comprises a dynamic random access memory (DRAM) or a non-volatile memory (NVM).

1 1 . The system of claim 8, wherein each one of the plurality of schedulers is deployed as a separate server.

12. The system of claim 8, wherein each one of the different QoS levels is based upon a resource state, a scheduling algorithm, a constraint and a scheduling throughput.

13. A method, comprising:

receiving, by a processor of a computing node, a job request from a scheduler of a plurality of different schedulers based upon local resource information of a selected number of computing nodes within a computing cluster that is stored in a local memory of the scheduler;

placing, by the processor, the job request in a task queue;

receiving, by the processor, an instruction to reassign the job request to a different computing node within the computing cluster based upon updated resource information of the selected number of computing nodes obtained by the scheduler from a shared memory; and

reassigning, by the processor, the job request to the different computing node.

14. The method of claim 13, wherein the instruction is received before the job request is initiated.

15. The method of claim 13, further comprising:

sending, by the processor, a resource update to the shared memory via a memory fabric after the job request is reassigned to the different computing node.