WO2016122596A1

WO2016122596A1 - Checkpoint-based scheduling in cluster

Info

Publication number: WO2016122596A1
Application number: PCT/US2015/013768
Authority: WO
Inventors: Yuan Chen; Jack Yanxin Li; Vanish Talwar
Original assignee: Hewlett Packard Enterprise Development Lp
Priority date: 2015-01-30
Filing date: 2015-01-30
Publication date: 2016-08-04

Abstract

Jobs are executed in a shared cluster of computing nodes. A checkpointbased scheduling system determines checkpoint overhead for the jobs. A job is selected based on the checkpoint overheads. Generation of a checkpoint for the selected job is facilitated by the checkpoint-based scheduling system.

Description

CHECKPOINT-BASED SCHEDULING IN CLUSTER

BACKGROUND

[0001] Clusters typically include distributed environments including computing nodes that may be connected via a network. Clusters are often used for parallel processing. Big data applications often use clusters for parallel processing, such as for query execution, streaming, batch processing, etc.

[0002] In the past, computing nodes in a cluster were dedicated for a particular user or job. Clusters are becoming dynamic. A shift is taking place where these applications are now deployed on shared clusters, whereby the resources are shared among multiple users and frameworks. Resources, such as central processing units (CPUs), memory, bandwidth, etc., are assigned as needed to processing tasks. Schedulers are often used in shared clusters to assign resources to users and jobs.

BRIEF DESCRIPTION OF THE DRAWINGS

[0003] Features of the present disclosure are illustrated by way of example and not limited in the following figure(s), in which like numerals indicate like elements, and in which:

[0004] Figure 1 shows a checkpoint-based scheduling system including a cluster of computing nodes, according to an example of the present disclosure;

[0005] Figure 2 shows additional components that may be in the system of figure 1 , according to an example of the present disclosure;

[0006] Figure 3 shows a checkpoint-based scheduling system hosted on a computer system, according to an example of the present disclosure;

[0007] Figures 4 and 5 show methods for checkpoint generation, according to examples of the present disclosure; and

[0008] Figure 6 shows a method of restoring a job, according to an example of the present disclosure.

DETAILED DESCRIPTION

[0009] For simplicity and illustrative purposes, the present disclosure is described by referring mainly to an example thereof. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure. It will be readily apparent however, that the present disclosure may be practiced without limitation to these specific details. In other instances, some methods and structures have not been described in detail so as not to unnecessarily obscure the present disclosure. In the present disclosure, the term "includes" means includes but not limited thereto, the term "including" means including but not limited thereto. The term "based on" means based at least in part on. In addition, the terms "a" and "an" are intended to denote at least one of a particular element.

[0010] A cluster is a distributed processing environment that includes hardware computing nodes that can run jobs. The computing nodes may be connected via a network. The cluster may be a shared cluster whereby the resources of the computing nodes are shared among jobs according to policies. For example, a portion or all of the resources of each of the computing nodes are assigned to a job as needed by a scheduler. The scheduler may be connected to the cluster to assign the resources of the computing nodes to jobs. The resources of the computing nodes, for example, may include CPUs, memory, input/output (I/O) bandwidth, etc., and a certain number of CPUs, memory size, bandwidth, etc., may be assigned to a job by the scheduler.

[0011] Some jobs may be higher priority than other jobs. If sufficient resources are unavailable to run multiple jobs, then the higher priority jobs may be given precedence. In these instances, the lower priority jobs may be delayed or killed to make resources from the computing nodes available to execute the higher priority jobs, and the lower priority jobs that were killed are restarted later when sufficient resources are available. According to an example of the present application, job-transparent checkpoints and restore mechanisms are implemented to suspend execution of a currently-running job and store its state for resumption at a later time when resources are available. The checkpoints are used in a checkpoint-based preemption method that minimizes a preemption penalty and provides enhanced quality of service (QoS) and resource efficiency. Based on testing on a HADOOP™ YARN platform, the checkpoint-based preemption method achieved up to 35% improvement in performance, 38% reduction in energy consumption and 75% reduction in resource wastage over a current HADOOP™ YARN scheduler.

[0012] A job for the shared cluster is a process or processes running on a computing node or multiple computing nodes in the shared cluster. The job may be a software application or other types of processing tasks. A computing node in the shared cluster may include hardware, such as CPUs, memory, I/O interface, etc. The computing node may include software, such as an operating system (OS). A job may run under control of the OS in a computing node.

[0013] A checkpoint-based scheduling system facilitates the generation of checkpoints for preempted jobs in the cluster. A checkpoint is a state of a job, which includes information regarding a current execution point of the job such as process information, CPU state, memory content, etc. The current execution point may be the time the job is preempted, such as when the checkpoint is generated, and execution of the job is stopped. The checkpoints generated by the checkpoint- based scheduling system, for example, are job-transparent. Job-Transparent means that details about the job do not need to be known to generate the checkpoint or the checkpointing is application-agnostic and can be applied to any application or program. For example, the details that may not be known may include where the job stores intermediate results, when the job executes certain processes, what data is used to execute the processes, and so on. The job- transparent checkpoint may include a copy of the name space and all data stored in memory that is currently used by the job. The name space may include kernel objects, process tree (e.g., ptrace), system calls, function calls, network connections and socket values, process identifiers, CPU register sets, memory content, etc. The memory content may include all data in memory allocated to the job. To create a checkpoint for a job, the job is stopped, and the entire name space and memory content for the job is collected and dumped to the data storage. The same procedure for generating a checkpoint may be performed for all types of jobs and applications and thus the checkpoint generation is job-transparent. Incremental checkpoints may be generated if multiple checkpoints are generated for a job, such as if a job is stopped and started multiple times. An incremental checkpoint may include only new data and information from the previously-stored checkpoint, which saves processing time and storage space.

[0014] In addition to generating job-transparent checkpoints in a shared- cluster environment, the checkpoint-based scheduling system uses checkpoints to suspend jobs, to release resources when there is resource contention, and to restore the jobs when sufficient resources are available. The checkpoint-based scheduling system runs functions to choose proper victim jobs, and execute preemption mechanisms (checkpoint or kill) based on resource usage, data locality, and checkpoint/restore overhead. The optimizations, which are further described below, can reduce processing time for checkpoint generation and restoring jobs.

[0015] For example, a job scheduler implementing the checkpoint-based scheduling system dynamically selects victim job and preemption mechanisms (checkpoint or kill) based on the progress of each task and its checkpoint/restore overhead. For example, the time of checkpointing and restoring a task is estimated according to the checkpoint size and input/output (I/O) bandwidth, and then compared with the current progress of the task. If the progress exceeds the overhead of checkpoint/restore, the task is checkpointed. Otherwise, the task is killed. The task may be an application. Another optimization is remote restore. For example, when a resource is available, a preempted job is restored. A distributed file system may be used to store checkpoints, and hence a preempted job can be scheduled on a local or remote node. Local or remote resumption is determined according to their overheads, which are calculated based on the checkpoint size, available network and I/O bandwidth, etc. For another optimization, instead of dumping the entire memory region, memory usage is tracked and only those memory regions that are changed since last checkpoint are saved to reduce checkpoint size and latency.

[0016] Also, nonvolatile memory, such as memristors, spin torque transfer storage, and phase-change memory (PCM), may be used for the data storage to store checkpoints. The nonvolatile memory provides fast, large, byte-addressable memory as a storage system that improves processing time for checkpoint generation and restoring jobs. In one example, the nonvolatile memory is organized in a file system, and the checkpoints that are saved as files in the local non-volatile memory. The file system may be a distributed file system, and checkpoints can be saved on local or remote nonvolatile memory as files. The nonvolatile memory may be used as fast disks and checkpoint images are saved in the nonvolatile-memory-based file systems. In another example, instead of using files to save checkpoints in a file system, checkpoint data is copied from memory of the computing node in the cluster to the nonvolatile memory using a memory operation. Such a method exploits the nonvolatile memory's byte-addressability to avoid serialization and can use OS paging and processor caching to improve latency. A shadow buffering mechanism can be used to copy variables between the computing node memory and the nonvolatile memory of the checkpoint data storage. For example, updates to the computing node memory can be incrementally written to the nonvolatile memory. [0017] Figure 1 illustrates a system 100 according to an example. The system 100 includes a shared cluster environment including computing nodes 1 10a-n for the cluster. In the shared cluster, the resources of the computing nodes are shared among jobs according to policies. The computing nodes 1 10a-n run jobs. A checkpoint-based scheduling system 150 communicates with the computing nodes 1 10a-n to preempt jobs and generate checkpoints for the jobs, and to restore jobs from the checkpoints. The checkpoint-based scheduling system 150 may select a job from the currently running jobs based on multiple factors, including checkpoint overhead. A selected job is preempted, e.g., suspended, and checkpointed. For example, the selected job is stopped, and a checkpoint is created for the job. For example, the entire name space for the job is collected and dumped to data storage 130 which stores the checkpoints for the computing nodes 1 10a-n. When a preempted job is to be restored, the checkpoint- based scheduling system may calculate local and remote overhead to determine whether to restore the job to the computing node originally running the job before preemption or to restore the job to a different computing node in the cluster. Restoring a job may include copying a checkpoint for the job to a computing node to run the job. Policies may be used to determine when to preempt and restore jobs. One example of a policy is prioritization of jobs, whereby lower priority jobs may be preempted if resources are needed to run a higher priority job.

[0018] The system 100 may include a network 120 connecting the components of the system 100, including the checkpoint-based scheduling system 150, computing nodes 1 10a-n and data storage 130. The data storage 130 stores checkpoints. The data storage 130 may include nonvolatile data storage, such as disks, nonvolatile memory, or other forms of nonvolatile data storage. The data storage 130 may be a distributed data storage including multiple storage devices. In one example, the data storage 130 includes a file system, and checkpoints are saved as files in the file system. In another example, instead of using files to save checkpoints in a file system, checkpoint data is copied from memory of the computing node in the cluster to nonvolatile memory of the data storage 130 using a memory operation. The nonvolatile memory of the data storage 130 for example is byte-addressable and allows for OS paging and processor caching to improve latency. Shadow buffering can be used to copy variables between the computing node memory and the nonvolatile memory of the data storage 130.

[0019] According to an example, the check-point based scheduling system may work with a job scheduler, such as shown in figure 2, to generate checkpoints and restore jobs from checkpoints. Figure 2 shows a scheduling system 200 and the computing nodes 1 10a-n. For example, the computing nodes 1 10a-n in the cluster may each include hardware, such as CPUs 210 and memory 21 1 shown in computing node 1 10a. The computing nodes 1 10a-n may run an OS to run jobs, including applications, such as shown for the computing node 1 10a which may have OS 221 and runs job 220.

[0020] The scheduling system 200 may include checkpoint-based scheduling system 250 and job scheduler 201 . The job scheduler 201 , for example, schedules jobs and assign resources from the computing nodes 1 10a-n to run the jobs based on policies. For example, a new job arrives. The job scheduler 201 determines there are insufficient resources in the cluster to run the new job and the existing jobs, and applies a policy (e.g., capacity sharing or priority scheduling) to select candidate victim jobs that may be preempted. The job scheduler 201 selects a job to preempt from the candidate victim jobs that are currently running based on an adaptive policy that decides to either kill the job or suspend the job by checkpointing its state on the data storage 130. The job selected to preempt may be added to a preemption queue 215 which stores a list of jobs that are waiting to be preempted. The job scheduler 201 may instruct the checkpoint-based scheduling system 250 to create checkpoints and restore jobs from checkpoints. For example, the job scheduler 201 selects a job to preempt and notifies the checkpoint-based scheduling system 250. The checkpoint-based scheduling system 250 sends a preempt instruction to the computing node running the job to suspend the job and create a checkpoint. For example, job 220 is running on the computing node 1 10a. The job 220 is selected for preemption. A preemption instruction is sent to the computing node 1 10a, and the computing node 1 10a stops the job 220 and creates a checkpoint 230 in the data storage 130 for the job 220. Then, the new job may be executed on the computing node 1 10a. If the job scheduler 201 decided to kill the job 220 without preempting based on the adaptive policy, a kill instruction is sent to the computing node 1 10a to kill the job 220. The OS 221 or an OS library may execute the instructions for checkpoint creation, job suspension, job killing and restoration.

[0021] When resources in the computing nodes 1 10a-n become available, for example, as determined by the job scheduler 201 , the checkpoint-based scheduling system 250 is notified of a job to restore and may instruct a computing node to restore the job from a checkpoint. For example, a restore instruction is sent to the computing node 1 10a to restore the job 220 when resources become available. A restoration queue 216 may store a list of jobs waiting to be restored and the job 220 may be added to the restoration queue 216. Optimizations for restoring to a local computing node or remote computing node may be performed by the job scheduler 201 and the checkpoint-based scheduling system 250.

[0022] Figure 3 shows a computer system 300, which may host a checkpoint-based scheduling system 350 and a job scheduler 351 . The checkpoint-based scheduling system 350 may perform the functions of the checkpoint-based scheduling system shown in figure 1 and/or figure 2, and the job scheduler 351 may perform the functions of the job scheduler shown in figure 2. The computer system 300 includes a processor 301 an input/output (I/O) interface 302, and a data storage 306. The processor 301 may include a microprocessor operable to execute machine readable instructions to perform programmed operations. The data storage 306 may include volatile and/or non-volatile data storage, such as random access memory, memristors, flash memory, and the like. The data storage 306 may store any information used by the checkpoint-based scheduling system 350 and the job scheduler 351 . Machine readable instructions may be stored in the data storage 306. The checkpoint-based scheduling system 350 and the job scheduler 351 may comprise machine readable instructions stored in the data storage 306 and executed by the processor 301 . The input/output interface 303 may include a network interface or another interface to connect to the computer system 300 to a network, such as the network 120 shown in figures 1 and 2.

[0023] Figure 4 illustrates a method 400 for checkpoint generation and job scheduling. The method 400 may be performed by the checkpoint-based scheduling system 150 shown in figure 1 or jointly by the checkpoint-based scheduling system 250 and job scheduler 201 shown in figure 2. At 401 , candidate victim jobs currently executing in a shared cluster are determined. For example, a list of candidate victim jobs are determined from a policy, such as based on priority of jobs.

[0024] At 402, checkpoint overhead for each of the candidate victim jobs is determined. Checkpoint overhead may be measured in amount of time to checkpoint or restore. Checkpoint overhead may be calculated based on an amount of memory used by the candidate victim job at a checkpoint time, an amount of time to store data for the candidate victim job from the memory of the computing node running the job to the data storage 130, and an amount of time to restore data from the data storage 130 to a memory of a computing node in the shared cluster. The checkpoint overhead may also be based on checkpoint wait queue time if the job is queued for example in the preemption queue 215 shown in figure 2 and is waiting to be preempted. In an example, the checkpoint overhead for each of the candidate victim jobs is calculated as follows: checkpoint_overhead = memory_size/write_bandwidth +

mennory_size/read_bandwidth + checkpoint_queue_wait_time,

wherein checkpoint_overhead is the checkpoint overhead for the candidate victim job, memory_size is the amount of memory used by the candidate victim job at the checkpoint time, write_bandwidth is bandwidth to write the data for the candidate victim job from the memory of the computing node to the data storage, read_bandwidth is bandwidth to read the data for the candidate victim job from the data storage to restore the data to a memory of a computing node in the shared cluster, and checkpoint_queue_wait_time is an amount of time the candidate victim job has to wait in the queue. An OS, such as OS 221 shown in figure 2, the checkpoint-based scheduling system and/or job scheduler may determine the metrics described above for determine checkpoint overhead.

[0025] At 403, a candidate victim job is selected from the candidate victim jobs to checkpoint and suspend based on the determined checkpoint overheads. In an example, the candidate victim jobs may be ordered according to increasing checkpoint overheads. The candidate victim job with the lowest overhead may be selected or multiple ones of the candidate victim jobs may be selected if, for example, multiple jobs need to be stopped to clear resources for a higher priority job. In one example, an adaptive policy, described in further detail below, is applied to determine whether to kill a job without checkpointing or to checkpoint a job and suspend a job.

[0026] At 404, the checkpoint-based scheduling system facilitates generating a job-agnostic checkpoint for the selected candidate victim job. For example, the checkpoint-based scheduling system 250 sends an instruction to the computing node 1 10a shown in figure 2 to checkpoint the job 220. The computing node 1 10a suspends the job 220 and the checkpoint 230, including the entire name space for the job 220, is collected and stored in the data storage 130. [0027] Figure 5 shows a method 500 for generating a checkpoint according to an adaptive policy. The method 500 may be performed by the checkpoint-based scheduling system 150 shown in figure 1 or jointly by the checkpoint-based scheduling system 250 and job scheduler 201 shown in figure 2. At 501 , a candidate victim job is selected to preempt. At 502, the checkpoint overhead is calculated for the selected candidate victim job. At 503, a determination is made as to whether to kill the selected candidate victim job without checkpointing or to suspend and checkpoint the selected candidate victim job. For example, progress of the selected candidate victim job is determined. Progress for example is the amount of time the selected candidate victim job has been running. If the selected candidate victim job was previously checkpointed, then progress may include the amount of time since the last checkpoint. If the checkpoint overhead is less than the progress, then the selected candidate victim job is checkpointed, i.e., a checkpoint is generated for the job at 504. The checkpoint may be an incremental checkpoint if the job was previously checkpointed. If the checkpoint overhead is greater than or equal to the progress, then the selected candidate victim job is killed without checkpointing at 505. The steps of the method 500 may be performed at steps 403 and 404 of the method 400 to determine whether to checkpoint or kill a job.

[0028] Figure 6 shows a method 600 for restoration of jobs from

checkpoints. The method 600 may be performed by the checkpoint-based scheduling system 150 shown in figure 1 or jointly by the checkpoint-based scheduling system 250 and job scheduler 201 shown in figure 2. At 601 , a job that was previously checkpointed is selected. At 602, local overhead is calculated for the job. Local overhead is the amount of time to restart the job from a checkpoint on the computing node previously running the job when it was stopped. The local overhead may be calculated as follows: local overhead = checkpoint_size/read_bandwidth +

local_resumption_queue_wait_time,

wherein checkpoint_size is an amount of data in the checkpoint for the selected job, read_bandwidth is bandwidth to read the data for the checkpoint from a data storage to restore the data to a memory of the local computing node, and local_resumption_queue_wait_time is an amount of time the selected job has to wait in a queue before the selected job is restored on the local computing node.

[0029] At 603, remote overhead is calculated for the job. Remote overhead is the amount of time to restart the job from a checkpoint on a computing node in the shared cluster that is different from the computing node previously running the job when it was stopped. The remote overhead may be calculated as follows:

remote overhead = checkpoint_size/network_bandwidth +

checkpoint_size/read_bandwidth + remote_node_resumption_queue_wait_time, wherein network_bandwidth is bandwidth to transmit the data for the checkpoint to the remote computing node, and remote_node_resumption_queue_wait_time is an amount of time the selected job has to wait in a queue before the selected job is restored on the remote computing node.

[0030] At 604, a determination is made as to whether the local overhead is less than the remote overhead. If yes, then at 605, the job is restored to the local computing node from the checkpoint. If no, then at 606, the job is restored to the remote computing node from the checkpoint.

[0031] What has been described and illustrated herein are examples of the disclosure along with some variations. The terms, descriptions and figures used herein are set forth by way of illustration only and are not meant as limitations. Many variations are possible within the scope of the disclosure, which is intended to be defined by the following claims, and their equivalents, in which all terms are meant in their broadest reasonable sense unless otherwise indicated.

Claims

What is claimed is:

1 . A checkpoint-based scheduling system for a shared cluster of computing nodes, the checkpoint-based scheduling system comprising:

a processor to:

determine candidate victim jobs currently executing in the shared cluster;

determine checkpoint overhead for each of the candidate victim jobs; select a candidate victim job from the candidate victim jobs to checkpoint and suspend based on the determined checkpoint overheads; and

send an instruction to a computing node in the shared cluster running the selected candidate victim job to generate a job-transparent checkpoint for the selected candidate victim job and suspend the selected candidate victim job.

2. The checkpoint-based scheduling system of claim 1 , wherein the checkpoint overhead for each of the candidate victim jobs is calculated based on an amount of memory used by the candidate victim job at a checkpoint time, an amount of time to store data for the candidate victim job from the memory to a data storage, and an amount of time to restore data from the data storage to a memory of a computing node in the shared cluster.

3. The checkpoint-based scheduling system of claim 2, wherein the checkpoint overhead for each of the candidate victim jobs is further calculated based on checkpoint wait queue time, wherein the checkpoint wait queue time is amount of time the candidate victim job has to wait in a queue before a checkpoint is created for the candidate victim job.

4. The checkpoint-based scheduling system of claim 3, wherein the checkpoint overhead for each of the candidate victim jobs is a function of memory_size, write_bandwidth, read_band width, and checkpoint_queue_wait_time,

wherein checkpoint_overhead is the checkpoint overhead for the candidate victim job, memory_size is the amount of memory used by the candidate victim job at the checkpoint time, write_bandwidth is bandwidth to write the data for the candidate victim job from the memory to the data storage, read_bandwidth is bandwidth to read the data for the candidate victim job from the data storage to restore the data to a memory of a computing node in the shared cluster, and checkpoint_queue_wait_time is an amount of time the candidate victim job has to wait in the queue.

5. The checkpoint-based scheduling system of claim 1 , wherein the processor is to:

for each candidate victim job:

determine a progress for the candidate victim job;

if the checkpoint overhead is less than the progress for the candidate victim job, then send instructions to the compute node executing the candidate victim job to generate a checkpoint for the candidate victim job; and

if the checkpoint overhead is greater than or equal to the progress for the candidate victim job, then kill the candidate victim job without generating a checkpoint for the candidate victim job.

6. The checkpoint-based scheduling system of claim 5, wherein the generated checkpoint for the candidate victim job is an incremental checkpoint that includes new data since a last checkpoint was generated for the candidate victim job instead of all the data for the candidate victim job since the initial execution of the candidate victim job.

7. The checkpoint-based scheduling system of claim 1 , wherein the processor adds the selected candidate victim job to a checkpoint queue, and the checkpoint queue includes jobs waiting to be checkpointed.

8. The checkpoint-based scheduling system of claim 1 , wherein the processor is to:

select a previously checkpointed job to resume;

determine a local overhead to resume the selected job on a local computing node of the shared cluster, wherein the local computing node is where the selected job was running prior to being preempted;

determine a remote overhead to resume the selected job on a remote computing node of the shared cluster different than the local computing node; if the local overhead is less than the remote overhead, instruct the local computing node to resume the selected job; and

if the local overhead is greater than or equal to the remote overhead, instruct the remote computing node to resume the selected job.

9. The checkpoint-based scheduling system of claim 8, wherein the local overhead is a function of checkpoint_size, read_bandwidth, and

local_resumption_queue_wait_time,

10. The checkpoint-based scheduling system of claim 9, wherein the remote overhead is a function of the checkpoint_size, network_bandwidth, the

read_bandwidth, and remote_node_resumption_queue_wait_time,

wherein network_bandwidth is bandwidth to transmit the data for the checkpoint to the remote computing node, and

remote_node_resumption_queue_wait_time is an amount of time the selected job has to wait in a queue before the selected job is restored on the remote computing node.

1 1 . The checkpoint-based scheduling system of claim 1 , comprising:

a data storage to store application-transparent checkpoints generated from jobs running on the computing nodes, wherein the data storage comprises nonvolatile memory.

12. The checkpoint-based scheduling system of claim 1 1 , wherein the data storage includes a file system, and the checkpoints are stored as files in the nonvolatile memory.

13. The checkpoint-based scheduling system of claim 1 1 , wherein the data storage is byte addressable, and each checkpoint is stored on the non-volatile memory through a memory operation.

14. A shared cluster comprising:

a scheduling system to schedule jobs in the shared cluster;

computing nodes, including memory and central processing units to execute the jobs; and

non-volatile data storage to store checkpoints for preempted jobs, wherein the scheduling system is to: determine candidate victim jobs currently executing in the shared cluster;

determine checkpoint overhead for each of the candidate victim jobs; select a candidate victim job from the candidate victim jobs to checkpoint based on the determined checkpoint overheads; and

send an instruction to a computing node in the cluster running the selected candidate victim job to generate a checkpoint for the selected candidate victim job and suspend the job.

15. A method for checkpoint-based scheduling comprising:

determining candidate victim jobs currently executing in a shared cluster of computing nodes;

determining, by a processor, a checkpoint overhead for each of the candidate victim jobs;

selecting a candidate victim job from the candidate victim jobs to checkpoint and suspend based on the determined checkpoint overheads; and

facilitating generating of a job-agnostic checkpoint for the selected candidate victim job.