WO2017018978A1 - Scheduling jobs in a computing cluster - Google Patents

Scheduling jobs in a computing cluster Download PDF

Info

Publication number
WO2017018978A1
WO2017018978A1 PCT/US2015/041899 US2015041899W WO2017018978A1 WO 2017018978 A1 WO2017018978 A1 WO 2017018978A1 US 2015041899 W US2015041899 W US 2015041899W WO 2017018978 A1 WO2017018978 A1 WO 2017018978A1
Authority
WO
WIPO (PCT)
Prior art keywords
job request
resource information
schedulers
computing
scheduler
Prior art date
Application number
PCT/US2015/041899
Other languages
French (fr)
Inventor
Yuan Chen
Vanish Talwar
Dejan S. Milojicic
Original Assignee
Hewlett Packard Enterprise Development Lp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hewlett Packard Enterprise Development Lp filed Critical Hewlett Packard Enterprise Development Lp
Priority to PCT/US2015/041899 priority Critical patent/WO2017018978A1/en
Publication of WO2017018978A1 publication Critical patent/WO2017018978A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals

Definitions

  • FIG. 1 is an example system of the present disclosure
  • FIG. 2 is an example table of quality of service for a plurality of different schedulers
  • FIG. 3 is a flowchart of an example method for scheduling a job request from a perspective of a scheduler
  • FIG. 4 is a flowchart of an example method for scheduling a job request from a perspective of a computing node.
  • the present disclosure discloses a method, system and apparatus for scheduling a job request.
  • traditional cluster scheduling relies on centralized architectures.
  • the traditional cluster scheduling also implements a single scheduling algorithm for all workloads and the entire cluster.
  • Such schedulers are inflexible and are difficult to scale when serving a large number of jobs in a large scale cluster. Newer schedulers are being developed; however, these newer schedulers still suffer from bottlenecks due to a centralized key component.
  • Examples of the present disclosure provide a method for scheduling a job request in a computing cluster comprising a plurality of computing nodes.
  • the present disclosure provides a hybrid scheduler that is decentralized and scalable. In other words, as a number of job requests grow and a number of computing nodes are added, the system may be scaled by adding additional schedulers as the number of job requests and computing nodes grow.
  • the hybrid schedulers allow for different schedulers to use different scheduling algorithms.
  • the schedulers may be based upon different levels of Quality of Service (QoS).
  • QoS Quality of Service
  • the hybrid schedulers allow for the system to be adaptable based upon an increase in some types of QoS requirements and a decrease in other QoS requirements.
  • the system is decentralized. As a result, no bottleneck is created by a single scheduler or any component that performs a centralized function. Rather, the system of the present disclosure provides a plurality of different schedulers that may use different scheduling algorithms. Each scheduler may schedule tasks independently of other schedulers. In addition, rather than attempting to resolve scheduling conflicts before job requests are scheduled by communicating between the schedulers, any scheduling conflicts or load balancing may be performed midstream.
  • each scheduler may schedule a job request to a computing node initially using a local resource information stored on a local memory of the scheduler.
  • Each scheduler may have global or a partial view of resource information depending on the need of its scheduling algorithm.
  • each scheduler may access updated resource information that is stored and continuously updated on each computing node via a memory fabric. If the initial assumption of how resources are distributed changes based on the updated resource information, the scheduler may reassign the job request to a different computing node at a later time. Similarly, the scheduler may reassign the job request to a new computing node according to a Quality of Service (QoS) change. For example, the scheduler may move a job from a busy computing node to a less loaded computing node as the job's completion time deadline is approaching.
  • QoS Quality of Service
  • the methods and systems for scheduling a job request of the present disclosure provide a more efficient process than current job schedulers of a computing cluster.
  • FIG. 1 illustrates an example system 100 of the present disclosure.
  • the system 100 includes a plurality of different schedulers 102-1 to 102-N (herein also referred to individually as scheduler 102 or collectively as schedulers 102).
  • each one of the schedulers 102-1 to 102-N may receive a job request 106-1 to 106-N (herein also referred to individually as a job request 106 or collectively as job requests 106).
  • the job requests 106 may include different types of job requests (e.g., a web service, a MapReduce batch job, a database query, and the like).
  • the job requests 106 may each be associated with a particular QoS level.
  • the job requests 106 may arrive from different users for different applications that are located remotely from the system 100.
  • the job requests 106 may be sent to one of the schedulers 102 based upon a corresponding QoS level.
  • each one of the schedulers 102-1 to 102-N may have a different QoS level.
  • the scheduler 102-1 may have a highest QoS level
  • the scheduler 102-2 may have a medium QoS level
  • the scheduler 102-N may have a lowest QoS level.
  • multiple schedulers 102 may have a highest QoS level, a medium QoS level and a lowest QoS level. Said another way, each QoS level may be associated with more than one scheduler 102.
  • FIG. 2 illustrates an example table 200 of various parameters that can be used to determine a QoS level.
  • the table 200 illustrates an example of four different schedulers 102.
  • the parameters may include a resource state, an algorithm, constraints and a scheduling throughput.
  • the resource state parameter is related to whether the scheduler is global or non-global. For example, global schedulers may make decisions based on resource information of the entire computing cluster. A non-global scheduler may make decisions based on partial information or a selected number of computing nodes within the computing cluster.
  • the algorithm parameter is related to what type of scheduling algorithm is employed by a scheduler 102.
  • each scheduler 102 may use any type of scheduling algorithm ranging from complex algorithms to simple load balancing algorithms.
  • scheduling algorithms may include load balancing, bin packing, random-sampling, load balancing on a sub-cluster, and the like.
  • the constraints parameter is related to a particular policy or constraint that is enforced.
  • constraints that can be enforced include data-locality constraints, inter-job constraints (e.g., job 1 and job 2 cannot be placed on the same computing node), capacity constraints, fairness, priority, and the like.
  • the scheduling throughput parameter may be related to a particular scheduling performance.
  • different schedulers may provide different scheduling performance related to latency (e.g., delay) and throughput (e.g., scheduling decisions/second) depending on the scheduling algorithm that is used and the complexity.
  • the scheduling throughput parameter may have example values such as high, medium and low.
  • the scheduling throughput parameter may be a numerical value (e.g., 1 -10).
  • the job request 106 may be matched with a scheduler 102 that matches the QoS requirements of the job request 106.
  • schedulers 102 may have a local memory 104-1 to 104-N (herein also
  • the local memory 104 may include a local resource
  • the schedulers 102 may be deployed as a computer or server comprising hardware.
  • the computer may include a processor, a non- transitory computer readable medium (e.g., a hard disk drive, random access memory (RAM), and the like) and input/output devices.
  • the schedulers 102-1 to 102-N may be deployed as separate computers having its own processor and local memory 104.
  • each scheduler 102 may track resource information for a selected number of computing nodes 1 10-1 to 1 10-N (herein also individually referred to as a computing node 1 10 or collectively referred to as computing nodes 1 10) within a computing cluster 108.
  • the computing nodes 1 10 of the computing cluster 108 may be at a same location or may be remotely located from one another.
  • each one of the computing nodes 1 10 may include a task queue 1 12, a node manager 1 14 that manages the completion of tasks 1 16-1 to 1 16-N (herein referred to individually as a task 1 16 or collectively as tasks 1 16).
  • the computing nodes 1 10 may also be deployed as separate computers or servers.
  • Each computing node 1 10 may include its own allocation of a processor or processors and a non-transitory computer readable medium or mediums (e.g., a hard disk drive, random access memory (RAM), and the like).
  • Each job request 106 may include a task or tasks that are completed by a computing node 1 10.
  • the computing node 1 10 receives a job request, the tasks associated with the job request may be placed in the task queue 1 12.
  • the tasks within the task queue 1 12 may be processed based on any method, including for example, a first in first out (FIFO) method, based on priority, based on a fair sharing method, and the like.
  • FIFO first in first out
  • some schedulers 102 may be a global scheduler that tracks resources of all computing nodes 1 10-1 to 1 10-N.
  • the selected number of computing nodes 1 10 tracked by global schedulers may be all of the computing nodes 1 10.
  • some schedulers 102 may be a non-global scheduler that tracks resources of only a subset of the computing nodes 1 10-1 to 1 10-N.
  • the selected number of computing nodes 1 10 tracked by non-global schedulers may be less than all of the computing nodes 1 10.
  • the scheduler 102-2 may be a non-global scheduler that tracks resource information of computing nodes 1 10-2 and 1 10-N .
  • resource information may be defined as a snapshot at a particular time of each one of the computing nodes 1 10 that are tracked by a scheduler.
  • the resource information may provide information related to how many tasks are currently assigned to each computing node 1 10, how much processing power and memory are currently used and available at each computing node 1 10, what types of jobs are currently assigned to each computing node 1 10, an estimated time to complete all of the currently assigned job requests or tasks at each computing node 1 10, and the like.
  • each scheduler 102 may have a local resource information (e.g., resource information stored locally) on a respective local memory 104.
  • the scheduler 102 may use the local resource information to initially assign the job request 106 to a computing node 1 10.
  • a job request 106-1 may arrive at the scheduler 102-1 .
  • the scheduler 102-1 may have local resource information stored at local memory 104-1 .
  • the scheduler 102-1 may determine that the computing node 1 10-2 has the most available processing power and memory available to complete the job request 106-1 .
  • the scheduler 102-1 may initially assign the job request 106-1 to the computing node 1 10-2.
  • each one of the schedulers 102-1 to 102-N may schedule its respective job requests 106-1 to 106-N in a similar fashion and independent of one another.
  • the schedulers 102-1 to 102-N do not need to communicate with one another to avoid scheduling conflicts before scheduling the respective job requests 106-1 to 106-N. Rather, each scheduler 102-1 to 102-N assigns its respective job request 106-1 to 106-N based upon a local resource information and independent of how the other schedulers 102 are scheduling the job requests 106.
  • the system 100 may include a memory fabric 1 18 and shared memory 120-1 to 120-N (also referred to herein collectively as shared memory 120).
  • the shared memory 120-1 to 120-N may include a plurality of memory unit locations or physical memory units.
  • the shared memory units 120 may be a dynamic random access memory (DRAM) or a non-volatile memory (NVM).
  • DRAM dynamic random access memory
  • NVM non-volatile memory
  • access to the shared memory 120 (e.g., a read access or a write access) may be translated and routed over the memory fabric 1 18.
  • each one of the computing nodes 1 10 may be in communication with each one of the schedulers 102 and the shared memory 120 via the memory fabric 1 18.
  • the node manager 1 14 of each computing node 1 10 may send updates regarding resource information to the shared memory 120. Updated resource information of each one of the computing nodes 1 10 may be stored in shared memory 120.
  • each one of the schedulers 102 may access the updated resource information for the selected number of computing nodes 1 10 from the shared memory 120.
  • the schedulers 102 may access the updated resource information for all computing nodes periodically.
  • a small amount of randomness may be inserted into the periodic access times of each scheduler 102 such that the schedulers 102 do not access the shared memory 120 at the same time or obtain identical updated resource information.
  • each one of the schedulers 102 may assign a respective job request 106 to one of the computing nodes 1 10 based upon a local resource information stored in local memory 104.
  • the local resource information may be stale or outdated.
  • the scheduler 102 may access the local memory 104 to obtain updated resource information obtained from the shared memory 120. Using the example above, after receiving the updated resource
  • the scheduler 102-1 may determine that computing node 1 10-2 is actually very busy and the computing node 1 10-1 has more processing power and memory available to complete the job request 106-1 . As a result, the scheduler 102-1 may send an instruction to the computing node 1 10-2 to reassign the job request 106-1 to the computing node 1 10-1 .
  • the computing node 1 10-2 may reassign the job request 106-1 to the computing node 1 10-1 .
  • the computing node 1 10-2 may send the job request 106-1 and the associated tasks to the computing node 1 10-1 directly.
  • the computing node 1 10-2 may delete the job request 106-1 from the task queue 1 12 and the scheduler 102-1 may then reassign the job request 106-1 to the computing node 1 10-1 .
  • the computing node 1 10-2 may send a notification to the scheduler 102-1 that the job request 106-1 has been initiated or completed and deny the instruction to reassign the job request 106- 1 .
  • the system 100 saves time and prevents bottlenecks by allowing each one of the schedulers 102-1 to 102-N to schedule job requests 106-1 to 106-N immediately based upon a local resource information.
  • the schedulers 102-1 to 102-N do not require communication with one another or coordination with one another to schedule the job requests 106-1 to 106-N. Rather, any scheduling conflicts are resolved mid-stream at a later time after the initial scheduling of the job request 106 by using updated resource information obtained by accessing the memory fabric 1 18.
  • a small randomness may be introduced to the periodic timing of when the schedulers 102 access the shared memory 120 via the memory fabric 1 18.
  • FIG. 3 illustrates an example flowchart of a method 300 for scheduling a job request.
  • the method 300 may be performed by a scheduler 102.
  • the method 300 begins.
  • the method 300 receives a job request at a scheduler based on a quality of service (QoS) of the job request and the scheduler.
  • QoS quality of service
  • the scheduler may be one of a plurality of different schedulers. Each one of the plurality of different schedulers may have a different QoS level based upon different parameters (e.g., parameters described in table 200 in FIG. 2).
  • the method 300 schedules the job request to a computing node based upon a locally stored resource information.
  • each scheduler may have a local memory and locally stored resource information.
  • the locally stored resource information may include a snapshot at a particular time of each one of the computing nodes that are tracked by a scheduler.
  • the resource information may relate to how many tasks are currently assigned to each computing node, how much processing power and memory are currently used and available at each computing node, what types of jobs are currently assigned to each computing node, an estimated time to complete all of the currently assigned job requests or tasks at each computing node, and the like.
  • the locally stored resource information may be for a selected number of computing nodes.
  • each scheduler may be either a global scheduler or a non-global scheduler.
  • the global schedulers may track resource information for all of the computing nodes in a computing cluster.
  • the non-global schedulers may track resource information for less than all (e.g., a subset, two or more, and the like) of the computing nodes in the computing cluster.
  • the method 300 accesses a shared memory via a memory fabric to obtain an updated resource information.
  • the scheduler may periodically access (e.g., every 30 seconds, every minute, every hour, every few hours, and the like) the shared memory to obtain the updated resource information.
  • a small amount of randomness may be inserted into the timing of the periodic access of each one of the schedulers to prevent two schedulers from accessing the shared memory at the same time or obtaining identical updated resource information. Inserting a small amount of randomness may prevent scheduling conflicts caused by two different schedulers attempting to reassign its respective job requests to the same computing node at the same time based upon the same updated resource information.
  • the method 300 re-schedules the job request to a different computing node based upon the updated resource information.
  • the locally stored resource information used by the scheduler to initially assign the job request to a computing node may have been stale, not recent or not current.
  • the updated resource information may indicate that the computing node that was assigned the job request initially may be very busy or have a large number of tasks to complete.
  • the estimated time to complete the job request may be longer than what is acceptable.
  • the scheduler may determine that a different computing node has more processing power and memory available to complete the job request based upon the updated resource information. The scheduler may then reassign the job request to the different computing node based upon the updated resource information.
  • the blocks 304-310 may be repeated for each job request that is received by a scheduler or for each scheduler in the system.
  • the method 300 ends.
  • FIG. 4 illustrates an example flowchart of another method 400 for scheduling a job request.
  • the method 400 may be performed by a computing node 1 10.
  • the method 400 begins.
  • the method 400 receives a job request from a scheduler based upon local resource information. For example, a computing node may appear to have the processing power and memory available to complete a job request within the QoS requirements of the job request and the scheduler based upon the local resource information. As a result, the scheduler may assign the computing node to complete the job request.
  • the method 400 places the job request in a task queue.
  • the job request may include task or tasks that are to be
  • the task or tasks associated with the job request may be placed in a task queue of the assigned computing node.
  • the method 400 receives an instruction to reassign the job request to a different computing node based upon updated resource information obtained by the scheduler from a shared memory.
  • the scheduler may periodically access a shared memory to obtain updated resource information of a select number of computing nodes including the computing node that is assigned to complete the job request.
  • the computing nodes may be in communication with the shared memory via the memory fabric. As job requests are completed or reassigned, the computing nodes may continuously provide updated resource information to the shared memory. As a result, the shared memory may store in memory updated resource information for each computing node within the computing cluster that is updated continuously.
  • the updated resource information may reveal to the scheduler that the computing node initially assigned to complete the job request actually does not have enough processing power and memory to meet the QoS requirements of the job request and the scheduler. For example, between the time the local resource information was received and the time the updated resource information was received, the computing node may have been assigned with many additional job requests from other schedulers that reduced the processing power and memory that are available. As a result, the scheduler may send the instruction to the computing node to reassign the job request to a different computing node.
  • the instruction to reassign the job request to a different computing node may be received before the job request is initiated on the computing node. Said another way, the instruction may be received before the computing node begins processing the job request. As a result, the job request can be successfully reassigned.
  • the instruction to reassign the job request to a different computing node may be received after the job request is initiated on the computing node or has been completed. If the instruction is received after the job request is initiated or completed by the computing node initially assigned to complete the job request, the computing node may send a notification to the scheduler that the job request has been initiated or completed and deny the instruction to reassign the job request.
  • the method 400 reassigns the job request to the different computing node.
  • the scheduler may select a different computing node that has the processing power and memory available to complete the job request in accordance with the QoS requirements of the job request and the scheduler.
  • the method 400 ends.
  • any of the blocks, functions, or operations of the example methods 300 and 400 described above may include a storing, displaying, and/or outputting block. In other words, any data, records, fields, and/or intermediate results discussed in the methods can be stored, displayed, and/or output to another device.
  • blocks, functions, or operations in FIGs. 3 and 4 that recite a determining operation, or involve a decision, do not necessarily require that both branches of the determining operation be practiced.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

In example implementations, a method is provided. The method includes receiving a job request at a scheduler of a plurality of schedulers based upon a quality of service (QoS) level associated with the job request and the scheduler. The job request is scheduled to a computing node based upon locally stored resource information of a selected number of computing nodes within a computing cluster. A shared memory is accessed via a memory fabric to obtain updated resource information of the selected number of computing nodes. The job request may then be re-scheduled to a different computing node based upon the updated resource information.

Description

SCHEDULING JOBS IN A COMPUTING CLUSTER
BACKGROUND
[0001] Traditional cluster scheduling relies on centralized architectures. The traditional cluster scheduling also implements a single scheduling algorithm for all workloads and the entire cluster. Such schedulers are inflexible and are difficult to scale when serving a large number of jobs in a large scale cluster. Newer schedulers are being developed; however, these newer schedulers still suffer from bottlenecks due to a centralized key component.
BRIEF DESCRIPTION OF THE DRAWINGS
[0002] FIG. 1 is an example system of the present disclosure;
[0003] FIG. 2 is an example table of quality of service for a plurality of different schedulers;
[0004] FIG. 3 is a flowchart of an example method for scheduling a job request from a perspective of a scheduler; and
[0005] FIG. 4 is a flowchart of an example method for scheduling a job request from a perspective of a computing node.
DETAILED DESCRIPTION
[0006] The present disclosure discloses a method, system and apparatus for scheduling a job request. As discussed above, traditional cluster scheduling relies on centralized architectures. The traditional cluster scheduling also implements a single scheduling algorithm for all workloads and the entire cluster. Such schedulers are inflexible and are difficult to scale when serving a large number of jobs in a large scale cluster. Newer schedulers are being developed; however, these newer schedulers still suffer from bottlenecks due to a centralized key component.
[0007] Examples of the present disclosure provide a method for scheduling a job request in a computing cluster comprising a plurality of computing nodes. In one example, the present disclosure provides a hybrid scheduler that is decentralized and scalable. In other words, as a number of job requests grow and a number of computing nodes are added, the system may be scaled by adding additional schedulers as the number of job requests and computing nodes grow.
[0008] In addition, the hybrid schedulers allow for different schedulers to use different scheduling algorithms. In one example, the schedulers may be based upon different levels of Quality of Service (QoS). In addition, the hybrid schedulers allow for the system to be adaptable based upon an increase in some types of QoS requirements and a decrease in other QoS requirements.
[0009] Lastly, the system is decentralized. As a result, no bottleneck is created by a single scheduler or any component that performs a centralized function. Rather, the system of the present disclosure provides a plurality of different schedulers that may use different scheduling algorithms. Each scheduler may schedule tasks independently of other schedulers. In addition, rather than attempting to resolve scheduling conflicts before job requests are scheduled by communicating between the schedulers, any scheduling conflicts or load balancing may be performed midstream.
[0010] For example, each scheduler may schedule a job request to a computing node initially using a local resource information stored on a local memory of the scheduler. Each scheduler may have global or a partial view of resource information depending on the need of its scheduling algorithm.
Periodically, each scheduler may access updated resource information that is stored and continuously updated on each computing node via a memory fabric. If the initial assumption of how resources are distributed changes based on the updated resource information, the scheduler may reassign the job request to a different computing node at a later time. Similarly, the scheduler may reassign the job request to a new computing node according to a Quality of Service (QoS) change. For example, the scheduler may move a job from a busy computing node to a less loaded computing node as the job's completion time deadline is approaching. As a result, the methods and systems for scheduling a job request of the present disclosure provide a more efficient process than current job schedulers of a computing cluster.
[0011] FIG. 1 illustrates an example system 100 of the present disclosure. In one example, the system 100 includes a plurality of different schedulers 102-1 to 102-N (herein also referred to individually as scheduler 102 or collectively as schedulers 102). In one example, each one of the schedulers 102-1 to 102-N may receive a job request 106-1 to 106-N (herein also referred to individually as a job request 106 or collectively as job requests 106). The job requests 106 may include different types of job requests (e.g., a web service, a MapReduce batch job, a database query, and the like). The job requests 106 may each be associated with a particular QoS level.
[0012] In one example, the job requests 106 may arrive from different users for different applications that are located remotely from the system 100. In one example, the job requests 106 may be sent to one of the schedulers 102 based upon a corresponding QoS level. In other word, each one of the schedulers 102-1 to 102-N may have a different QoS level. For example, the scheduler 102-1 may have a highest QoS level, the scheduler 102-2 may have a medium QoS level and the scheduler 102-N may have a lowest QoS level. In one example, multiple schedulers 102 may have a highest QoS level, a medium QoS level and a lowest QoS level. Said another way, each QoS level may be associated with more than one scheduler 102.
[0013] FIG. 2 illustrates an example table 200 of various parameters that can be used to determine a QoS level. The table 200 illustrates an example of four different schedulers 102. In one example, the parameters may include a resource state, an algorithm, constraints and a scheduling throughput. In one example, the resource state parameter is related to whether the scheduler is global or non-global. For example, global schedulers may make decisions based on resource information of the entire computing cluster. A non-global scheduler may make decisions based on partial information or a selected number of computing nodes within the computing cluster. [0014] In one example, the algorithm parameter is related to what type of scheduling algorithm is employed by a scheduler 102. For example, each scheduler 102 may use any type of scheduling algorithm ranging from complex algorithms to simple load balancing algorithms. Examples of scheduling algorithms may include load balancing, bin packing, random-sampling, load balancing on a sub-cluster, and the like.
[0015] In one example, the constraints parameter is related to a particular policy or constraint that is enforced. Examples of constraints that can be enforced include data-locality constraints, inter-job constraints (e.g., job 1 and job 2 cannot be placed on the same computing node), capacity constraints, fairness, priority, and the like.
[0016] In one example, the scheduling throughput parameter may be related to a particular scheduling performance. For example, different schedulers may provide different scheduling performance related to latency (e.g., delay) and throughput (e.g., scheduling decisions/second) depending on the scheduling algorithm that is used and the complexity. The scheduling throughput parameter may have example values such as high, medium and low. In another example, the scheduling throughput parameter may be a numerical value (e.g., 1 -10).
[0017] Thus, based on the particular requirements for a job request 106, the job request 106 may be matched with a scheduler 102 that matches the QoS requirements of the job request 106. In one example, each one of the
schedulers 102 may have a local memory 104-1 to 104-N (herein also
individually referred to as local memory 104 or collectively referred to as local memories 104). The local memory 104 may include a local resource
information of a respective scheduler 102.
[0018] In one example, the schedulers 102 may be deployed as a computer or server comprising hardware. The computer may include a processor, a non- transitory computer readable medium (e.g., a hard disk drive, random access memory (RAM), and the like) and input/output devices. The schedulers 102-1 to 102-N may be deployed as separate computers having its own processor and local memory 104. [0019] In one example, each scheduler 102 may track resource information for a selected number of computing nodes 1 10-1 to 1 10-N (herein also individually referred to as a computing node 1 10 or collectively referred to as computing nodes 1 10) within a computing cluster 108. The computing nodes 1 10 of the computing cluster 108 may be at a same location or may be remotely located from one another. In one example, each one of the computing nodes 1 10 may include a task queue 1 12, a node manager 1 14 that manages the completion of tasks 1 16-1 to 1 16-N (herein referred to individually as a task 1 16 or collectively as tasks 1 16). In one example, the computing nodes 1 10 may also be deployed as separate computers or servers. Each computing node 1 10 may include its own allocation of a processor or processors and a non-transitory computer readable medium or mediums (e.g., a hard disk drive, random access memory (RAM), and the like).
[0020] Each job request 106 may include a task or tasks that are completed by a computing node 1 10. When the computing node 1 10 receives a job request, the tasks associated with the job request may be placed in the task queue 1 12. The tasks within the task queue 1 12 may be processed based on any method, including for example, a first in first out (FIFO) method, based on priority, based on a fair sharing method, and the like.
[0021] As noted above, some schedulers 102 may be a global scheduler that tracks resources of all computing nodes 1 10-1 to 1 10-N. Thus, the selected number of computing nodes 1 10 tracked by global schedulers may be all of the computing nodes 1 10.
[0022] Alternatively, some schedulers 102 may be a non-global scheduler that tracks resources of only a subset of the computing nodes 1 10-1 to 1 10-N. Thus, the selected number of computing nodes 1 10 tracked by non-global schedulers may be less than all of the computing nodes 1 10. For example, the scheduler 102-2 may be a non-global scheduler that tracks resource information of computing nodes 1 10-2 and 1 10-N .
[0023] In one example, resource information may be defined as a snapshot at a particular time of each one of the computing nodes 1 10 that are tracked by a scheduler. The resource information may provide information related to how many tasks are currently assigned to each computing node 1 10, how much processing power and memory are currently used and available at each computing node 1 10, what types of jobs are currently assigned to each computing node 1 10, an estimated time to complete all of the currently assigned job requests or tasks at each computing node 1 10, and the like.
[0024] In one example, each scheduler 102 may have a local resource information (e.g., resource information stored locally) on a respective local memory 104. The scheduler 102 may use the local resource information to initially assign the job request 106 to a computing node 1 10. For example, a job request 106-1 may arrive at the scheduler 102-1 . The scheduler 102-1 may have local resource information stored at local memory 104-1 . Based on the local resource information, the scheduler 102-1 may determine that the computing node 1 10-2 has the most available processing power and memory available to complete the job request 106-1 . As a result, the scheduler 102-1 may initially assign the job request 106-1 to the computing node 1 10-2.
[0025] In one example, each one of the schedulers 102-1 to 102-N may schedule its respective job requests 106-1 to 106-N in a similar fashion and independent of one another. In other words, the schedulers 102-1 to 102-N do not need to communicate with one another to avoid scheduling conflicts before scheduling the respective job requests 106-1 to 106-N. Rather, each scheduler 102-1 to 102-N assigns its respective job request 106-1 to 106-N based upon a local resource information and independent of how the other schedulers 102 are scheduling the job requests 106.
[0026] In one example, the system 100 may include a memory fabric 1 18 and shared memory 120-1 to 120-N (also referred to herein collectively as shared memory 120). In one example, the shared memory 120-1 to 120-N may include a plurality of memory unit locations or physical memory units. In one example, the shared memory units 120 may be a dynamic random access memory (DRAM) or a non-volatile memory (NVM). In one example, access to the shared memory 120 (e.g., a read access or a write access) may be translated and routed over the memory fabric 1 18.
[0027] In one example, each one of the computing nodes 1 10 may be in communication with each one of the schedulers 102 and the shared memory 120 via the memory fabric 1 18. As a result, as job requests 106 are assigned to computing nodes 1 10, as job requests 106 are completed, or as job requests 106 are reassigned or moved to different nodes 1 10 (as discussed below), the node manager 1 14 of each computing node 1 10 may send updates regarding resource information to the shared memory 120. Updated resource information of each one of the computing nodes 1 10 may be stored in shared memory 120.
[0028] In one example, each one of the schedulers 102 may access the updated resource information for the selected number of computing nodes 1 10 from the shared memory 120. In one example, the schedulers 102 may access the updated resource information for all computing nodes periodically. In one example, a small amount of randomness may be inserted into the periodic access times of each scheduler 102 such that the schedulers 102 do not access the shared memory 120 at the same time or obtain identical updated resource information.
[0029] As noted above, each one of the schedulers 102 may assign a respective job request 106 to one of the computing nodes 1 10 based upon a local resource information stored in local memory 104. However, the local resource information may be stale or outdated. After the scheduler assigns the respective job request 106, the scheduler 102 may access the local memory 104 to obtain updated resource information obtained from the shared memory 120. Using the example above, after receiving the updated resource
information the scheduler 102-1 may determine that computing node 1 10-2 is actually very busy and the computing node 1 10-1 has more processing power and memory available to complete the job request 106-1 . As a result, the scheduler 102-1 may send an instruction to the computing node 1 10-2 to reassign the job request 106-1 to the computing node 1 10-1 .
[0030] In one example, if the job request 106-1 has not been initiated at the computing node 1 10-2, then the computing node 1 10-2 may reassign the job request 106-1 to the computing node 1 10-1 . For example, the computing node 1 10-2 may send the job request 106-1 and the associated tasks to the computing node 1 10-1 directly. In another example, the computing node 1 10-2 may delete the job request 106-1 from the task queue 1 12 and the scheduler 102-1 may then reassign the job request 106-1 to the computing node 1 10-1 .
[0031] In one example, if the job request 106-1 has been initiated or completed by the computing node 1 10-2, then the computing node 1 10-2 may send a notification to the scheduler 102-1 that the job request 106-1 has been initiated or completed and deny the instruction to reassign the job request 106- 1 .
[0032] In other words, the system 100 saves time and prevents bottlenecks by allowing each one of the schedulers 102-1 to 102-N to schedule job requests 106-1 to 106-N immediately based upon a local resource information. In other words, the schedulers 102-1 to 102-N do not require communication with one another or coordination with one another to schedule the job requests 106-1 to 106-N. Rather, any scheduling conflicts are resolved mid-stream at a later time after the initial scheduling of the job request 106 by using updated resource information obtained by accessing the memory fabric 1 18. In addition, to prevent two schedulers 102 from obtaining the same updated resource information and potentially assigning the respective job requests 106 to the same computing node 1 10, a small randomness may be introduced to the periodic timing of when the schedulers 102 access the shared memory 120 via the memory fabric 1 18.
[0033] FIG. 3 illustrates an example flowchart of a method 300 for scheduling a job request. In one example, the method 300 may be performed by a scheduler 102.
[0034] At block 302 the method 300 begins. At block 304, the method 300 receives a job request at a scheduler based on a quality of service (QoS) of the job request and the scheduler. In one example, the scheduler may be one of a plurality of different schedulers. Each one of the plurality of different schedulers may have a different QoS level based upon different parameters (e.g., parameters described in table 200 in FIG. 2).
[0035] At block 306, the method 300 schedules the job request to a computing node based upon a locally stored resource information. For example, each scheduler may have a local memory and locally stored resource information. The locally stored resource information may include a snapshot at a particular time of each one of the computing nodes that are tracked by a scheduler. The resource information may relate to how many tasks are currently assigned to each computing node, how much processing power and memory are currently used and available at each computing node, what types of jobs are currently assigned to each computing node, an estimated time to complete all of the currently assigned job requests or tasks at each computing node, and the like.
[0036] In one example, the locally stored resource information may be for a selected number of computing nodes. For example, each scheduler may be either a global scheduler or a non-global scheduler. The global schedulers may track resource information for all of the computing nodes in a computing cluster. The non-global schedulers may track resource information for less than all (e.g., a subset, two or more, and the like) of the computing nodes in the computing cluster.
[0037] At block 308, the method 300 accesses a shared memory via a memory fabric to obtain an updated resource information. In one example, the scheduler may periodically access (e.g., every 30 seconds, every minute, every hour, every few hours, and the like) the shared memory to obtain the updated resource information. In one example, a small amount of randomness may be inserted into the timing of the periodic access of each one of the schedulers to prevent two schedulers from accessing the shared memory at the same time or obtaining identical updated resource information. Inserting a small amount of randomness may prevent scheduling conflicts caused by two different schedulers attempting to reassign its respective job requests to the same computing node at the same time based upon the same updated resource information.
[0038] At block 310, the method 300 re-schedules the job request to a different computing node based upon the updated resource information. For example, the locally stored resource information used by the scheduler to initially assign the job request to a computing node may have been stale, not recent or not current. As a result, the updated resource information may indicate that the computing node that was assigned the job request initially may be very busy or have a large number of tasks to complete. As a result, the estimated time to complete the job request may be longer than what is acceptable. As a result, the scheduler may determine that a different computing node has more processing power and memory available to complete the job request based upon the updated resource information. The scheduler may then reassign the job request to the different computing node based upon the updated resource information.
[0039] In one example, the blocks 304-310 may be repeated for each job request that is received by a scheduler or for each scheduler in the system. At block 312, the method 300 ends.
[0040] FIG. 4 illustrates an example flowchart of another method 400 for scheduling a job request. In one example, the method 400 may be performed by a computing node 1 10.
[0041] At block 402 the method 400 begins. At block 404, the method 400 receives a job request from a scheduler based upon local resource information. For example, a computing node may appear to have the processing power and memory available to complete a job request within the QoS requirements of the job request and the scheduler based upon the local resource information. As a result, the scheduler may assign the computing node to complete the job request.
[0042] At block 406, the method 400 places the job request in a task queue. For example, the job request may include task or tasks that are to be
completed. The task or tasks associated with the job request may be placed in a task queue of the assigned computing node.
[0043] At block 408, the method 400 receives an instruction to reassign the job request to a different computing node based upon updated resource information obtained by the scheduler from a shared memory. For example, the scheduler may periodically access a shared memory to obtain updated resource information of a select number of computing nodes including the computing node that is assigned to complete the job request.
[0044] For example, the computing nodes may be in communication with the shared memory via the memory fabric. As job requests are completed or reassigned, the computing nodes may continuously provide updated resource information to the shared memory. As a result, the shared memory may store in memory updated resource information for each computing node within the computing cluster that is updated continuously.
[0045] The updated resource information may reveal to the scheduler that the computing node initially assigned to complete the job request actually does not have enough processing power and memory to meet the QoS requirements of the job request and the scheduler. For example, between the time the local resource information was received and the time the updated resource information was received, the computing node may have been assigned with many additional job requests from other schedulers that reduced the processing power and memory that are available. As a result, the scheduler may send the instruction to the computing node to reassign the job request to a different computing node.
[0046] In one example, the instruction to reassign the job request to a different computing node may be received before the job request is initiated on the computing node. Said another way, the instruction may be received before the computing node begins processing the job request. As a result, the job request can be successfully reassigned.
[0047] In one example, the instruction to reassign the job request to a different computing node may be received after the job request is initiated on the computing node or has been completed. If the instruction is received after the job request is initiated or completed by the computing node initially assigned to complete the job request, the computing node may send a notification to the scheduler that the job request has been initiated or completed and deny the instruction to reassign the job request.
[0048] At block 410, the method 400 reassigns the job request to the different computing node. For example, the scheduler may select a different computing node that has the processing power and memory available to complete the job request in accordance with the QoS requirements of the job request and the scheduler. At block 412, the method 400 ends. [0049] It should be noted that although not explicitly specified, any of the blocks, functions, or operations of the example methods 300 and 400 described above may include a storing, displaying, and/or outputting block. In other words, any data, records, fields, and/or intermediate results discussed in the methods can be stored, displayed, and/or output to another device.
Furthermore, blocks, functions, or operations in FIGs. 3 and 4 that recite a determining operation, or involve a decision, do not necessarily require that both branches of the determining operation be practiced.
[0050] It will be appreciated that variants of the above-disclosed and other features and functions, or alternatives thereof, may be combined into many other different systems or applications. Various presently unforeseen or unanticipated alternatives, modifications, or variations, therein may be subsequently made which are also intended to be encompassed by the following claims.

Claims

1 . A method, comprising:
receiving, by a processor, a job request at a scheduler of a plurality of schedulers based upon a quality of service (QoS) level associated with the job request and the scheduler;
scheduling, by the processor, the job request to a computing node based upon locally stored resource information of a selected number of computing nodes within a computing cluster;
accessing, by the processor, a shared memory via a memory fabric to obtain updated resource information of the selected number of computing nodes; and
re-scheduling, by the processor, the job request to a different computing node based upon the updated resource information.
2. The method of claim 1 , wherein each one of the plurality of schedulers has a different QoS level.
3. The method of claim 2, wherein each one of the different QoS levels is based upon a resource state, a scheduling algorithm, a constraint and a scheduling throughput.
4. The method of claim 1 , wherein the selected number of computing nodes comprises less than all of the computing nodes in the computing cluster.
5. The method of claim 1 , wherein the selected number of computing nodes comprises all of the computing nodes in the computing cluster.
6. The method of claim 1 , wherein the accessing is performed periodically to obtain the updated resource information and update the locally stored resource information.
7. The method of claim 1 , further comprising:
sending, by the processor, a resource update to the shared memory via a memory fabric after the job request is re-scheduled to the different computing node.
8. A system, comprising:
a plurality of schedulers for receiving a job request, each one of the plurality of schedulers having a different quality of service (QoS) level and having a local memory to store resource information of a selected number of computing nodes within a computing cluster;
a plurality of computing nodes communicatively coupled to each one of the plurality of schedulers for processing the job request scheduled by one of the plurality of schedulers based upon the resource information in the local memory of the one of the plurality of schedulers; and
a shared memory communicatively coupled to each one of the plurality of schedulers via a memory fabric for storing updated resource information of all of the computing nodes within the computing cluster, wherein the updated resource information is used to update the resource information in the local memory and cause the one of the plurality of schedulers to re-schedule the job request based upon the resource information in the local memory of the one of the plurality of schedulers that is updated.
9. The system of claim 8, wherein each one of the computing nodes comprises:
a task queue;
a node manager in communication with the task queue for managing each job request that is assigned to a respective computing node.
10. The system of claim 8, wherein the shared memory comprises a dynamic random access memory (DRAM) or a non-volatile memory (NVM).
1 1 . The system of claim 8, wherein each one of the plurality of schedulers is deployed as a separate server.
12. The system of claim 8, wherein each one of the different QoS levels is based upon a resource state, a scheduling algorithm, a constraint and a scheduling throughput.
13. A method, comprising:
receiving, by a processor of a computing node, a job request from a scheduler of a plurality of different schedulers based upon local resource information of a selected number of computing nodes within a computing cluster that is stored in a local memory of the scheduler;
placing, by the processor, the job request in a task queue;
receiving, by the processor, an instruction to reassign the job request to a different computing node within the computing cluster based upon updated resource information of the selected number of computing nodes obtained by the scheduler from a shared memory; and
reassigning, by the processor, the job request to the different computing node.
14. The method of claim 13, wherein the instruction is received before the job request is initiated.
15. The method of claim 13, further comprising:
sending, by the processor, a resource update to the shared memory via a memory fabric after the job request is reassigned to the different computing node.
PCT/US2015/041899 2015-07-24 2015-07-24 Scheduling jobs in a computing cluster WO2017018978A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/US2015/041899 WO2017018978A1 (en) 2015-07-24 2015-07-24 Scheduling jobs in a computing cluster

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2015/041899 WO2017018978A1 (en) 2015-07-24 2015-07-24 Scheduling jobs in a computing cluster

Publications (1)

Publication Number Publication Date
WO2017018978A1 true WO2017018978A1 (en) 2017-02-02

Family

ID=57884877

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2015/041899 WO2017018978A1 (en) 2015-07-24 2015-07-24 Scheduling jobs in a computing cluster

Country Status (1)

Country Link
WO (1) WO2017018978A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110764910A (en) * 2019-10-23 2020-02-07 中国银行股份有限公司 Batch job scheduling processing method and device
CN111949407A (en) * 2020-08-13 2020-11-17 北京字节跳动网络技术有限公司 Resource allocation method and device
WO2024087663A1 (en) * 2022-10-28 2024-05-02 华为技术有限公司 Job scheduling method and apparatus, and chip

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070240161A1 (en) * 2006-04-10 2007-10-11 General Electric Company System and method for dynamic allocation of resources in a computing grid
US20090288095A1 (en) * 2008-05-15 2009-11-19 International Business Machines Corporation Method and System for Optimizing a Job Scheduler in an Operating System
US20120060171A1 (en) * 2010-09-02 2012-03-08 International Business Machines Corporation Scheduling a Parallel Job in a System of Virtual Containers
US8205208B2 (en) * 2007-07-24 2012-06-19 Internaitonal Business Machines Corporation Scheduling grid jobs using dynamic grid scheduling policy
WO2013106256A1 (en) * 2012-01-09 2013-07-18 Microsoft Corporation Decoupling paas resources, jobs, and scheduling

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070240161A1 (en) * 2006-04-10 2007-10-11 General Electric Company System and method for dynamic allocation of resources in a computing grid
US8205208B2 (en) * 2007-07-24 2012-06-19 Internaitonal Business Machines Corporation Scheduling grid jobs using dynamic grid scheduling policy
US20090288095A1 (en) * 2008-05-15 2009-11-19 International Business Machines Corporation Method and System for Optimizing a Job Scheduler in an Operating System
US20120060171A1 (en) * 2010-09-02 2012-03-08 International Business Machines Corporation Scheduling a Parallel Job in a System of Virtual Containers
WO2013106256A1 (en) * 2012-01-09 2013-07-18 Microsoft Corporation Decoupling paas resources, jobs, and scheduling

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110764910A (en) * 2019-10-23 2020-02-07 中国银行股份有限公司 Batch job scheduling processing method and device
CN111949407A (en) * 2020-08-13 2020-11-17 北京字节跳动网络技术有限公司 Resource allocation method and device
CN111949407B (en) * 2020-08-13 2024-04-12 抖音视界有限公司 Resource allocation method and device
WO2024087663A1 (en) * 2022-10-28 2024-05-02 华为技术有限公司 Job scheduling method and apparatus, and chip

Similar Documents

Publication Publication Date Title
US11593404B2 (en) Multi-cluster warehouse
US20200364608A1 (en) Communicating in a federated learning environment
He et al. Matchmaking: A new mapreduce scheduling technique
US8689226B2 (en) Assigning resources to processing stages of a processing subsystem
US9442760B2 (en) Job scheduling using expected server performance information
US20200174844A1 (en) System and method for resource partitioning in distributed computing
CN110383764B (en) System and method for processing events using historical data in a serverless system
CN109564528B (en) System and method for computing resource allocation in distributed computing
JP6519111B2 (en) Data processing control method, data processing control program and data processing control device
US9507633B2 (en) Scheduling method and system
Jonathan et al. Awan: Locality-aware resource manager for geo-distributed data-intensive applications
WO2017018978A1 (en) Scheduling jobs in a computing cluster
CN113760549B (en) Pod deployment method and device
US20150365474A1 (en) Computer-readable recording medium, task assignment method, and task assignment apparatus
Liu et al. Deadline guaranteed service for multi-tenant cloud storage
US9990240B2 (en) Event handling in a cloud data center
US20220124151A1 (en) Task allocation among devices in a distributed data storage system
Zeng et al. Workload-aware resource reservation for multi-tenant nosql
Ghazali et al. CLQLMRS: improving cache locality in MapReduce job scheduling using Q-learning
Nzanywayingoma et al. Task scheduling and virtual resource optimising in Hadoop YARN-based cloud computing environment
Ullah et al. Task Priority‐Based Cached‐Data Prefetching and Eviction Mechanisms for Performance Optimization of Edge Computing Clusters
RU2679546C2 (en) Device and method for running multiple stream
Reda et al. BRB: betteR batch scheduling to reduce tail latencies in cloud data stores
Swanson Matchmaking: A new mapreduce scheduling
CN104063327B (en) A kind of storage method and read-write storage device applied to wireless telecommunications

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15899780

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15899780

Country of ref document: EP

Kind code of ref document: A1