CN117120978A

CN117120978A - Transparent preemption and migration of planetary-scale computers

Info

Publication number: CN117120978A
Application number: CN202280021860.6A
Authority: CN
Inventors: M·斯瓦塔努; S·维斯瓦纳萨; D·K·舒克拉; N·夸特拉; R·拉姆基; R·V·内赫姆; P·沙玛; B·E·兰甘; V·沙玛
Original assignee: Microsoft Technology Licensing LLC
Current assignee: Microsoft Technology Licensing LLC
Priority date: 2021-03-25
Filing date: 2022-03-03
Publication date: 2023-11-24

Abstract

The disclosure herein describes platform-level checkpointing for Deep Learning (DL) jobs. Checkpointing is performed by capturing two types of state data: (i) GPU state (device state) and (ii) CPU state (host state). The GPU state includes GPU data (e.g., model parameters, optimizer states, etc.) that are located in the GPU and the GPU context (e.g., default flows in the GPU, various handles created by libraries such as DNN, blas, etc.). Since checkpointing is done in a domain-aware manner, only portions of the GPU memory are copied. The "active" memory contains useful data such as model parameters. To be able to capture useful data, memory management is controlled to identify which portions of memory are active. Furthermore, to restore the destination GPU to the same context/state, a mechanism is used to capture such state change events on the original GPU and replay on the destination GPU.

Description

Transparent preemption and migration of planetary-scale computers

Background

Artificial Intelligence (AI) innovation is based on a highly scalable, high-performance, robust, and technology-efficient AI infrastructure. Current methods of gradually expanding existing universal infrastructure as a service (IaaS) and cloud-based environments have significant limitations because AI workloads are fundamentally different and require a specially built AI infrastructure. Managing the details of the current infrastructure presents a significant challenge to data scientists attempting to accelerate AI algorithm innovations.

Today, the increasingly popular computing trend in the AI computing field is the Deep Learning (DL) field. DL has had a significant impact on personal products that are widely used for voice and image recognition and has a great potential to impact businesses. DL jobs represent an important and growing set of computing workloads, particularly in cloud data centers. However, as with most AI models, DL jobs are computationally intensive and therefore rely heavily on powerful but expensive Graphics Processing Units (GPUs). For example, GPU Virtual Machines (VMs) in the cloud are less efficient than conventional VMs. Cloud operators and large companies that manage clusters of tens of thousands of GPUs rely on cluster schedulers to ensure efficient utilization of GPUs. While efficient scheduling of Deep Learning Training (DLT) jobs is important, it is common practice today to use a traditional cluster scheduler, such as Kubernetes or YARNde, designed specifically for handling large data jobs, such as MapReduce, which is a programming model and implementation for processing and generating large data sets.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.

Aspects described herein relate to a computerized method for providing checkpointing of a machine learning job (such as a DLT job) at one node in a cloud computing environment and restoring the DLT job from a checkpointed state on a different node. To this end, the GPU state of the GPU executing the DLT job is captured. The GPU state includes GPU data including model parameters and optimizer states located in the GPU when checkpointed. In addition, the CPU state of the CPU executing the DLT job is also captured. The GPU and CPU states are stored in a shared memory accessible to the proxy node, and the checkpointing state is defined at least in part by the GPU and CPU states in the shared memory. DLT jobs may then be migrated to the destination node and restored thereon using checkpointing.

Drawings

The present description will be better understood from a reading of the following detailed description taken in conjunction with the drawings in which:

FIG. 1 is a block diagram illustrating a system configured to provide infrastructure services for Artificial Intelligence (AI) workloads;

FIG. 2 is a block diagram illustrating a runtime plane of the system of FIG. 1;

FIG. 3 is a block diagram illustrating an infrastructure plane of the system of FIG. 1;

FIG. 4 is a flow chart illustrating a method for managing AI workloads in the cloud infrastructure platform;

FIG. 5 is a block diagram illustrating a hierarchical scheduling subsystem configured for scheduling AI workloads;

FIG. 6 is a block diagram illustrating a proxy-based dual process architecture configured to checkpointing various operating parameters of a DLT job so that the DLT job may be migrated from an original node to a separate destination node;

FIG. 7 illustrates a block diagram of a planetary-scale AI infrastructure network environment implementing a migration service for moving DLT jobs from an origin node to a destination node;

FIG. 8 illustrates a flow chart depicting an operational procedure for checkpointing a DLT job at an original node in a cloud computing environment and restoring the DLT job from the checkpointed state at a different destination node;

FIG. 9 illustrates a flow chart depicting an operational procedure for checknodding a DLT job across a plurality of first nodes and restoring the DLT job from a checknoded state across a plurality of second nodes different from the first nodes in a cloud computing environment; and

FIG. 10 illustrates an example computing device.

Corresponding reference characters indicate corresponding parts throughout the drawings.

Detailed Description

Various implementations, examples, and embodiments are described in detail with reference to the accompanying drawings. Wherever possible, the same reference numbers will be used throughout the drawings to refer to the same or like parts. References made throughout this disclosure to particular examples, implementations, and embodiments are provided for illustrative purposes only, and are not meant to limit all examples unless indicated to the contrary.

The present disclosure describes several implementations and examples for transparent and preemptive migration of Deep Learning Training (DLT) jobs and inferences from one group of processing resources to another group in the cloud. The disclosed examples and implementations provide checkpoints for a given DLT job using proxy processes (or services) that store host client configurations and reconfigure the server-based configuration before moving the DLT job to a new resource group, or by implementing barriers across multiple processing resources, such as Central Processing Units (CPUs), graphics Processing Units (GPUs), application Specific Integrated Circuits (ASICs), quantum processors, virtual Machines (VMs), etc.

The disclosed examples provide platform level, domain awareness, iteration level, on-demand checkpointing for DLT jobs to transparently preempt DLT jobs, migrate them from one node to another, and then continue processing DLT jobs on the new node. These implementations and examples provide transparent checkpointing and migration of DLT jobs across large cloud environments, such as global scale infrastructure as a service (IaaS). The ability to checkpointing any DLT job and recover DLT jobs from the same point on different nodes is a key building block that enables multiple important features to be implemented in AI-centric IaaS. Such checkpointing, migration, and restoration provide automatic fault tolerance for user jobs when a machine or job fails. This is particularly important because DLT jobs require long runs (which may last hours, days, or weeks).

The disclosed preemption and migration techniques enable efficient utilization of preemptible resources (such as preemptible VMs or spot VMs). Spot VM is more technically efficient and enables some cloud operators to sell spare capacity. However, long running DLT jobs typically cannot progress under preemption unless they recover from the point they leave after recovering on another computer. Additionally, some implementations and examples enable a scheduler to transparently preempt and move jobs or tasks across devices/machines, perform defragmentation, dynamic load balancing, live migration to handle upgrades, or automatically adjust jobs to the correct GPU inventory units (SKUs) based on runtime analysis.

DLT jobs are machine-learned, big-data jobs that are assigned GPU sets at job start-up and remain exclusively accessed to their GPUs until completion. While the disclosed examples are discussed with respect to DLT jobs and inferences, any kind of AI jobs may be migrated using the disclosed techniques, such as, for example, but not limited to, deep Neural Network (DNN) or "deep web" jobs. Such a job may be long-running (e.g., hours, or days, or weeks, or months of treatment). Moving a job after it begins processing may damage the processing time for hours, days, or longer. The disclosed examples and implementations provide a way to move a job to other processing resources in the cloud at the point in time that the job is being processed, thereby eliminating the loss of any significant work.

Conventionally, there are two methods of moving DLT jobs. A developer who requires scripting for DLT jobs actually writes custom code for checkpointing. The code may take the form of a very strict library, or logic of what to do when a job is preempted and how to bring it to the same state. This is quite complex for programmers, which is why most DLT jobs today do not handle checkpointing or preemption. Thus, the scheduler cannot rely on this. Typically, only 5% of DLT jobs have checkpointing enabled, which is unreliable from the perspective of the technical performance expectations that can be provided by the scheduler.

Both of the above options impose significant constraints on the user writing the model. Not surprisingly, most of the presently disclosed models do not perform any checkpointing. Rather, the disclosed implementations and examples include platform-level support for checkpointing that handles any unmodified user code and transparently provides checkpointing and restoration functionality without the user having to worry about it.

The platform-level approach for checkpointing is typically domain independent and therefore violent. For example, the checkpointing library needs to checkpoint the CPU application, grab the entire address space of the process and checkpoint the entire memory state, open all file descriptors, etc., and then restore them at the other end. Current checkpointing libraries do not support checkpointing of device states and in the case of DLT jobs, similar domain-independent methods are computationally expensive, as most jobs utilize all memory on the GPU (16 GB or 32 GB). However, due to the periodicity of memory usage in DLT jobs, GPU memory usage can increase dramatically by activation at the end of a forward pass and then return to a very low number-typically 30-70 times lower than the peak value reached after activation is released at the end of a reverse pass). The disclosed implementations and examples employ a domain-specific approach that leverages this feature to time checkpointing at low memory points (e.g., small-batch). Since small batches typically will be completed in a few seconds, the latency for low memory points is very small.

To provide a technically more efficient framework, examples make preemption and migration the default for each DLT job of a large-scale cloud infrastructure. In some implementations and examples, each DLT job in a large-scale cloud infrastructure is preemptible and migratable in nature without requiring a developer to run or write any special content. Thus, the user does not have to execute any special content to preempt or migrate the job. This is done by intercepting at a sufficiently low level and checkpointing the progress status of the DLT job in a way that the user program does not know what is happening. In other words, it is transparent to the software layer, user code, and framework libraries (e.g., pyTorch or TensorFlow) described above. Job migration is possible because the disclosed examples are applicable to DLT jobs written in Python, pyTorch or TensorFlow.

Unlike conventional programs, DLT jobs frequently use GPUs, and GPU states are not easily migrated. There are different libraries for checkpointing programs running in the CPU. Aspects of the present disclosure may operate with any functionality that enables checkpointing of the entire CPU address space. These checkpointing libraries are able to checkpoint a process, move it to a new machine, and start it. These checkpointing libraries are not suitable for GPUs because they have proprietary states embedded in many GPUs that are not understood by checkpointing. Since the GPU driver is proprietary, it is not possible for the checkpointing library to handle the problems caused by several factors.

The disclosed examples checkpointing a client process and rebuilding it in a manner such that the server process is stateless. The server process may then be stopped, in which case the job is migrated to another server node. The server process may be recreated when it is lifted on another server node. To speed up the server process, some implementations and examples log the GPU's calls to recreate the same state of the GPU on a new server node. Furthermore, some examples capture the memory of the initial server before the initial server is disabled, so that the same memory can be recreated at the new server node. For example, a server may be copied to disk, and then the same pointer may be assigned to a new server. Thus, the disclosed examples allow useful states to be copied from clients, GPU states to be copied from servers, then only useful client states to be checkpointed, and server processes to be recreated. The sequence may then continue at the new server node.

Having generally and specifically described some implementations and examples, attention is directed to the drawings to provide further clarity.

Fig. 1 is a block diagram illustrating a system 100, the system 100 configured to provide infrastructure services for AI workloads, according to an embodiment. The system 100 includes a control plane 102, a runtime plane 104, and an infrastructure plane 106. In some examples, system 100 is a distributed computing infrastructure system that includes hardware devices distributed across many different locations (e.g., a global or planetary scale distributed system). Further, the system 100 is specifically configured to enable execution of the AI workload such that hardware, firmware, and/or software of the system 100 is configured to enable technically efficient execution of tasks associated with the AI workload. Alternatively or additionally, system 100 may include hardware, firmware, and/or software specifically configured to enable execution of other types of workloads without departing from the scope of the present description.

The control plane 102 includes a manageability subsystem 108, a pluggable data plane 110, and a global scheduling subsystem 112. In some examples, the control plane 102 is configured to receive or accept AI workloads and associated data through a variety of extensible or pluggable data planes 110 that may be defined by the tenant of the system (e.g., an alternate data plane under a plug-in scheduler to support Kubernetes or another similar system running in the tenant's private data center). As described herein, these AI workloads are scheduled to execute on the infrastructure (e.g., infrastructure plane 106) of the system 100.

Manageability subsystem 108 includes hardware, firmware, and/or software configured to provide interactive processing of AI workload requests to tenants. In addition, manageability subsystem 108 is configured to provide all infrastructure resources of system 100 in all areas of system operation. In some examples, manageability subsystem 108 includes manageability copies in various areas of system 100 such that infrastructure resources of system 100 are multi-hosted by the various copies as interfaces between tenants and system 100. Manageability subsystem 108 may be decoupled from global scheduling subsystem 112.

The global scheduling subsystem 112 includes hardware, firmware, and/or software configured to schedule AI workloads/jobs for execution on the infrastructure resources of the system 100 as described herein. In some examples, global scheduler subsystem 108 includes a hierarchical scheduler: global scheduler, regional scheduler, and coordinator service. The global scheduler is responsible for preparing the schedules corresponding to AI workloads (e.g., jobs, models, and/or Pod) and handing them over to the regional scheduler based on these prepared schedules. The regional dispatcher is responsible for managing and reporting regional capacity to the global dispatcher and then also executing the dispatch received from the global dispatcher. The coordinator service is responsible for translating the schedule into physical resource allocations across the infrastructure resource clusters within the area. The coordinator service may also constitute a reliability subsystem 122 or otherwise be closely related to the reliability subsystem 122, as described herein. Global scheduling subsystem 112 is described in more detail below.

As described herein, the runtime plane 104 includes subsystems configured to enable AI workloads to be distributed to the infrastructure plane 106 and executed on the infrastructure plane 106. Such subsystems may include a monitoring subsystem 114, a compiling subsystem 116, a communication subsystem 118, and/or a load balancing subsystem 120. Further, the runtime plane 104 includes a reliability subsystem 122, the reliability subsystem 122 being configured to ensure reliability of AI workload execution while enabling such workload to be checkpointed and/or migrated throughout the infrastructure resources of the system 100. The runtime plane 104 also includes an AI accelerator provider model 124, the AI accelerator provider model 124 configured to enable the use of a wide variety of libraries and/or configurations to manage AI accelerators when executing AI workloads. The runtime plane 104 is described in more detail below.

The infrastructure plane 106 includes hardware, firmware, and/or software for executing AI workloads based on the schedule provided by the control plane 102 and the instructions received from the runtime plane 104. The infrastructure plane 106 includes a hosting and activation subsystem 126, infrastructure resources 128, and a device/AI accelerator 130. The infrastructure plane 106 is described in more detail below.

FIG. 2 is a block diagram 200 illustrating a runtime plane 204 of the system 100 of FIG. 1, according to an embodiment. In some examples, the runtime plane 204 is substantially the same as the runtime plane 104 described above with respect to fig. 1. The runtime plane 204 includes a monitoring subsystem 214, a compiling subsystem 216, a communication subsystem 218, a load balancing subsystem 220, a reliability subsystem 222, and an AI accelerator provider model 224.

The reliability subsystem 222 includes routines for interacting with the AI workload to ensure its reliability. In some examples, the routine includes failover 232, suspension 234, restoration 236, migration 238, scaling 240, checkpointing 242, and restoration 244. Checkpointing 242 and recovery 244 routines may be configured as core routines, and other routines (failover 232, suspension 234, restoration 236, migration 238, and scaling 240) may be configured to achieve desired results using checkpointing 242 and/or recovery 244 routines.

The checkpointing 242 routine is configured to save the state of the AI workload as it is executed so that the saved state can be used to continue executing the AI workload from the saved point in time. Checkpointing 242 may be used to execute a suspend 234 routine to suspend execution of the AI workload for a period of time and/or to execute a migrate 238 routine to save the state of the AI workload so that it may be moved to another set of infrastructure resources to continue execution.

The resume 244 routine is configured to take as input the saved state of the AI workload and resume execution of the AI workload on the infrastructure resources starting from the point of the saved state. The resume 244 routine may be used to execute the resume 236 routine and/or to resume executing an AI workload that has been migrated to another set of infrastructure resources based on the migration 238 routine.

The failover 232 routine is configured to checkpoint the status of the AI workload based on detecting a failure of a current infrastructure resource and to recover the AI workload on a new set of infrastructure resources such that the AI workload recovers from the detected failure.

The zoom 240 routine is configured to magnify and/or shrink the number, quality, and/or type of infrastructure resources used to perform the AI workload. For example, if additional infrastructure resources are available, the AI workload may be scaled up to take advantage of these additional infrastructure resources. Alternatively, if the new AI workload needs to use some infrastructure resources to execute the current AI workload, the current AI workload may be scaled down to release some resources for the new AI workload (e.g., the new AI workload may be associated with a higher priority/hierarchy than the current AI workload).

The reliability subsystem 222 also includes a conference and protocol 246, the conference and protocol 246 being configured to synchronize or otherwise force the AI workloads to which the above-described routines are to be applied. For example, if an AI workload is to be migrated, the synchronization with the protocol 246 is configured to synchronize system operations such that the resources involved in the migration are not altered during the migration process. Such a session and protocol 246 may include the use of locks or forming a barrier so that processes not associated with migration do not inadvertently affect migration.

The AI accelerator provider model 224 is configured to enable use of various software stacks, including a third party (3P) library 248 (e.g., a library provided by a tenant of the system 100) and/or a first party (1P) library 250 (e.g., a library provided by an entity managing the system 100). For example, 3P library 248 may include a 3P-specific Management Library (ML) 252, a 3P-specific multi-GPU communication library (MGCL) 254, and a 3P-specific GPU library 256. Additionally or alternatively, the 1P library 250 may include a management library 264, a communication library 266, and/or a compiler toolchain 268. The runtime plane 204 enables tenants to perform AI workloads within the described system 100 using a wide variety of software stacks and associated libraries (including the tenant's own software stack) based on the extensible, flexible configuration of the runtime plane 204.

Fig. 3 is a block diagram 300 illustrating an infrastructure plane 306 of the system 100 of fig. 1, according to an embodiment. In some examples, as described above, the infrastructure plane 306 is substantially the same as the infrastructure plane 106 of fig. 1. The infrastructure plane 306 includes a hosting and activation subsystem 326, infrastructure resources 328, and devices and AI accelerators 330.

Hosting and activation subsystem 326 includes host agent 370 and container 372. Host agent 370 enables and organizes AI workload hosting on infrastructure resources 328. The containers 372 (e.g., copy-on-write containers) keep different AI workloads (e.g., workloads from different tenants) separate from each other and secure even though they execute on the same host. The host controlled by host agent 370 may be a device that includes infrastructure resource set 328 configured to execute an AI workload or at least a portion thereof. Thus, by separating the AI workloads into containers 372, some resources of the host may be used to execute the AI workload from one tenant, while other resources of the host may be used to execute the AI workload of another tenant at the same time. The container 372 is configured such that two separate AI workloads are prevented from interacting in any manner when they are executed.

Infrastructure resources 328 include service fabric 396 interfaces, storage resources 376, networking resources 378, computing resources 380, which may include bare metal blades 382 (e.g., physical processing devices) and virtual machines 384, and other resources 386 (e.g., integrated infrastructure resources). In some examples, the infrastructure resources 328 are provided primarily for use by an entity (e.g., 1P resources) that is providing the services of the system 100, but in other examples, the infrastructure resources 328 may also include resources provided by other entities (e.g., 3P resources), such as resources owned and used by tenants of the system 100. Such integration may be achieved via the 3P library 248 and other configurations described above.

The devices and AI accelerator 330 include a GPU 388, an FPGA device 390, other 3P devices 392, and other 1P devices 394. The described processes may also be implemented by the back-end network 374 and/or associated devices. Execution of the AI workload may uniquely benefit from the use of the GPU 388, FPGA 390, and/or other specialized hardware. In such an example, an infrastructure resource 328, such as computing resource 380, may be linked to GPU 388, for example, such that computing resource 380 provides instructions to GPU 388 on how to perform the steps of the AI workload. Such execution then utilizes a dedicated architecture of the GPU 388, such as the GPU 388 having many cores enabling data that is processed in largely parallel beyond the capabilities of the computing resources 380.

The backend network 374 is configured to support a wide variety of non-uniform backend network architectures that can be envisioned by a wide variety of entities that use the system, such as 1P and 3P hardware manufacturers. Such a back-end network 374 may be used to provide links between a compute node (e.g., computing resource 380) and a broken topology of hardware accelerators (e.g., GPUs 388).

Fig. 4 is a flowchart illustrating a method 400 for managing AI workloads in a cloud infrastructure platform, the method 400 in accordance with an embodiment. In some examples, the cloud infrastructure platform of method 400 is a system such as system 100 of fig. 1. At 402, a set of distributed infrastructure resources (e.g., hosting and activation subsystem 126, infrastructure resources 128, and/or device/AI accelerator 130 of infrastructure plane 106) are integrated into the cloud infrastructure platform via the native support interfaces of these resources. In some examples, the native support interfaces may include interfaces and/or libraries of resource providers, such as 3P library 248 and 1P library 250 of the figures. For example, a tenant of a cloud infrastructure platform may provide a subset of infrastructure resources based on the provided libraries for integration into the platform such that the tenant and/or other tenants of the platform may use those resources in executing AI workloads.

At 404, an AI workload is received from a plurality of tenants, wherein the received AI workload includes a training workload and an inference workload. In some examples, a tenant provides AI workloads for execution on a platform via an interface, such as pluggable data plane 110 described herein.

At 406, a subset of resources of the distributed infrastructure resources is assigned to the received AI workload. In some examples, assigning the subset of resources to the AI workload is performed by the global scheduling subsystem 112, as described herein. Assigning resources may include determining resource requirements of AI workloads and then identifying a subset of infrastructure resources that meet those requirements (e.g., AI workloads that require parallel use of four GPUs may be assigned to nodes of a system having at least four GPUs).

Additionally or alternatively, assigning the subset of resources to the AI workload may include rearranging other AI workloads with respect to the subset of resources. For example, assigning a subset of resources to an AI workload may include saving a status checkpoint of the AI workload currently executing on a first subset of resources, migrating the AI workload to a second subset of resources, restoring the saved status checkpoint of the migrated AI workload on the second subset of resources, and then assigning at least a portion of the first subset of resources to another AI workload. In some examples, such a process may be performed using routines of reliability subsystem 222 as described herein.

At 408, the received AI workload is scheduled to execute on the assigned subset of resources. In some examples, the global scheduling subsystem 112 generates a schedule of AI workloads as described herein. Further, scheduling the execution of the AI workload may include scheduling the training workload and the inference workload on the same infrastructure resources, and both types of workload are multiplexed on those infrastructure resources (e.g., the execution of the training workload is interspersed with the execution of the inference workload on infrastructure resources such as GPUs).

Further, in some examples, the AI workloads are associated with priorities or tiers that affect how resources are assigned and how AI workloads are scheduled to execute on those resources. For example, lower-level AI workloads are more likely to be migrated to other resources in order to make room for higher-level AI workloads, or higher-level AI workloads may be scheduled for a greater share of resource usage time than lower-level AI workloads, as described herein.

At 410, the AI workload is performed based on the scheduling of the AI workload on the assigned subset of resources. In some examples, the AI workload is hosted in the hosting and activation subsystem 126, and then the infrastructure resources 128 and/or the device/AI accelerator 130 are used to execute the AI workload. For example, assigning and executing AI workloads on a subset of resources includes isolating AI workloads from each other in a secure container, such that AI workloads associated with different tenants execute securely with each other (e.g., on resources associated with the same server).

Further, in some examples, the execution of the AI workload is monitored based on the performance of the cloud infrastructure platform, and the scheduling of the AI workload is adjusted based on the monitoring. Adjustment of the schedule may include preempting the AI workload, migrating the AI workload, zooming in the AI workload, zooming out the AI workload, and/or load balancing between two or more AI workloads. Such scheduling adjustments may be performed by global scheduling subsystem 112 or other components of system 100.

Fig. 5 is a block diagram illustrating a hierarchical scheduling subsystem 500 configured for scheduling AI workloads 512, according to an embodiment. In some examples, scheduling subsystem 500 is included in a system, such as system 100 of fig. 1. For example, scheduling subsystem 500 may be substantially identical to global scheduling subsystem 112 of FIG. 1. Scheduling subsystem 500 includes a global scheduler 502 and a plurality of regional schedulers 504, a coordinator service 506, and associated infrastructure resources 508. The global scheduler 502 is configured to generate an global schedule 514 using global capacity data 510 (e.g., data indicating the current status of resource usage in the entire associated global infrastructure system, including resource usage in each region of the system) and AI workloads 512, the global schedule 514 scheduling AI workloads 512 to be executed on the infrastructure resources 508. The global scheduler 514 includes a region schedule 520 for each region of the system, which region schedule 520 is then provided to the region schedulers 504 associated with those regions (e.g., the region schedule 520 for a region is provided to the region schedulers 504 associated with that particular region).

The zone scheduler 504 monitors the current zone capacity data 516 of the infrastructure resources 508 associated with the respective zone and the zone capacity data 516 is provided to the global scheduler 502 periodically or based on a pattern or trigger event. In addition, the zone scheduler 504 receives zone AI workloads 518 associated with the zones of the zone scheduler 504 from the global scheduler 502 from the AI workload set 512. The zone scheduler 504 is also configured to instruct the coordinator service 506 to use the data of the zone AI workload 518 to perform an associated zone schedule 520 (each zone including the zone scheduler 504 and the coordinator service 506).

The coordinator service 506 is configured to receive the zone schedule 522 and the associated zone AI workload 524 from the associated zone scheduler 504, and is configured to be executed using the reliability routine 526 (e.g., the routines of the reliability subsystem 222 of fig. 2, as described above) based on the zone scheduler 522 to cause the zone AI workload 524 to be executed using the infrastructure resources 508 of the zone. For example, the coordinator service 506 may be configured to allocate a subset of the infrastructure resources 508 of the region to the region AI workload 524 and cause the workload 524 to be executed on those allocated resources 508. Additionally or alternatively, coordinator service 506 may be configured to check for checkpointing, restoration, migration, and/or execute other reliability routines 526 to schedule use of infrastructure resources 508 according to regional schedule 522.

DLT jobs cannot be modified. Thus, the disclosed implementations and examples instead recover training of a given DLT job at a different node with the same state (e.g., same PC/instruction pointer, same register state, call stack, etc.) during checknodulation as the given DLT job at the original node running the DLT job. The disclosed implementations and examples save the program state of a DLT job and restore the DLT job in that program state on another node of the cloud environment, switching execution/control flow to the same instruction.

In some implementations and examples, the checkpointing library performs checkpointing (e.g., domain independent) of the entire CPU address space and restores the CPU address space on the destination node using, for example, the checkpoint service 600 discussed below and shown in fig. 6. Since the CPU address space occupation of most DLT and DNN jobs is small, since most states reside in the GPU, the computational cost of dumping the entire address space is not too high (e.g., 1-2GB per task). Assuming a domain-specific approach to capturing only relevant GPU states, some implementations and examples copy active GPU states to host (CPU) memory and then checkpointing the entire address space. After recovery, some implementations and examples copy the active GPU buffers back to GPU memory on the target node.

FIG. 6 is a block diagram illustrating a checkpointing service 600, the checkpointing service 600 configured to checkpoint various operating parameters of a DLT job 602 so that the DLT job 602 may be migrated to a separate destination node 608. The checkpoint service 600 may also be referred to as a proxy-based dual process (PBDP) architecture. "agent-based dual-process architecture" and "checkpointing service" are used synonymously in some examples herein.

The original node 604 includes one or more CPUs 610, the proxy node 606 includes one or more CPUs 612 and GPUs 614, and the destination node 608 includes one or more CPUs 616 and GPUs 618. The disclosed embodiments reference different nodes, an original node 604, a proxy node 606, and a destination node 608. These nodes 604-608 may be any type of server, computer, VM, etc. An example computing device is discussed below in fig. 10 as computing device 1000, which may act as each of an origin node 604, a proxy node 606, and a destination node 608

Original node 604 operates the processing layers including DLT job 602, modified open source code 634, interceptor 626, GPU proxy client 633, CPU 610, and GPU 611. GPU proxy client 633 captures and stores CPU state 627, CPU memory 629, GPU state 630, and GPU memory 632 of DLT job 602.

Proxy node 606 operates the processing layer for checkpointing DLT job 602, the proxy node 606 including proxy process 620, GPU proxy server 621, various GPU libraries 631, VM or server Operating System (OS) 623, CPU 610, and GPU 611.

Destination node 608 includes CPU 616 and GPU 618 and is the destination of DLT job 602 that migrates after checkpointing.

As described below, the various checkpointed parameters of DLT job 602 are stored in shared memory 622 accessible to proxy node 606. The shared memory 622 may be temporarily stored. As shown by the dashed oval 650, shared memory 622 captures the various checkpointed parameters of DLT job 602 and restores those parameters at proxy node 606. Once restored on proxy node 606, the captured parameters of DLT job 602 may be deleted and removed from original node 604. The DLT job 602 may then be moved to the destination node 608 using the checknoded and restored parameters.

GPU proxy client 633 captures and stores CPU state 627, CPU memory 629, GPU state 630, and GPU memory 632 of DLT job 602.

In some implementations and examples, checkpointing of DLT job 602 is done at one node of the cloud computing environment (original node 604) and DLT job 602 is restored from the checkpointed state on a different node (destination node 608). This may be scheduled by global scheduler 502, regional scheduler 504, coordinator service 506, or some other migration service in the cloud computing environment.

As previously described, the checkpointing library handles CPU state, but fails when the address space is contaminated by GPU-related state. The disclosed checkpoint service 600 resolves this conflict. In some implementations and examples, when DLT job 602 is being processed (e.g., user Python code with a PyTorch/TensorFlow (PT/TF) training loop, etc.), all GPU-related activities of DLT job 602 are isolated in proxy process 620 at a separate proxy node 606 located in a different address space than original node 604.

Proxy process 620 is implemented in executable code, firmware, hardware, or a combination thereof, and is designed to be stateless across checkpoints. As a result, the address space of the proxy process 620 is contaminated by the GPU-related mappings described above, but because the proxy process is stateless, implementations and examples are able to delete (or terminate) the proxy process and restart the proxy process at the destination node 608. The main process address space of DLT job 602 remains (as a useful, stateful part) without any GPU-related state and thus can be checkpointed safely.

Some examples capture GPU state 630 and GPU memory 632 of GPU 611 during execution of DLT job 602 on original node 604. GPU state 630 may include GPU data including model parameters and optimizer states located in the GPU at the time of checkpointing. In addition, the CPU state 627 and CPU memory 629 of the CPU 610 are captured on the DLT job 602 on the original node 604. CPU state 627, CPU memory 629, GPU state 630, and GPU memory 632 may be stored in shared memory 622 and made accessible to proxy node 606. In some examples, checkpointing of DLT job 602 may then be performed using CPU state 627, CPU memory 629, GPU state 630, and GPU memory 632. Other examples use different or additional parameters, as discussed in more detail below. After checkpointing, DLT job 602 may be migrated to destination node 608, and processing of DLT job 602 is restored from the checkpointed state defined by one or more of CPU state 627, CPU memory 629, GPU state 630, and GPU memory 632.

Additionally or alternatively, some disclosed examples capture a "checkpointed state" that includes model parameters that are written to memory, SSD, hard disk, etc. during checkpointing, and that is read during recovery on destination node 608. In addition, GPU function call parameters are shared between original node 604 and proxy node 606, are read and written in shared memory between the two, and are accessed continuously while DLT job 602 is running.

Some examples sequester GPU-related activity of GPU proxy client 633 or GPU 611 into proxy process 620 of proxy node 606 having a different address space than GPU 611 of original node 604. In some implementations and examples, proxy process 620 is stateless across checkpoints, which results in the address space of proxy process 620 being contaminated by GPU-related mappings. The address space of proxy process 620 may be contaminated by the GPU-related map, but because proxy process 620 is stateless, proxy process 620 can be deleted (terminated) and restarted at destination node 608. However, the main process address space of proxy node 606 may remain clean without any GPU-related state.

Additionally or alternatively, GPU proxy server 621 may be configured to read model parameters of DLT job 602 from shared memory 622 and execute corresponding GPU calls in the address space of proxy process 620. In addition, the return value may be transmitted or carried back to GPU proxy server 621 (which may also be considered a proxy client) through shared memory 622.

Additional examples isolate GPU-related activities of DLT job 602 into separate proxy processes across a first plurality of original nodes in a cloud computing environment. During isolation, DLT job 602 may be allowed to continue computation in the host process, and the computation may be accomplished by executing Python code and/or PT/TF training loops.

In some examples, only a portion of GPU memory 632 or CPU memory 629 that is active is captured. For example, a portion of GPU memory 632 containing model parameters for DLT job 602 may be captured.

In some examples, program state associated with DLT job 602 is saved in shared memory 622. Program state may be used to resume DLT job 602 on destination node 608 by switching control flow to the saved program state.

To move the GPU-related activities into the shared memory 622 of the proxy node 606, some implementations and examples use dynamic library insertion on all GPU-related calls of the DLT job 602. These GPU-related calls are intercepted in the main process of DLT job 602 by the proxy node serializing the parameters and the GPU-related calls are written into shared memory 622 between the original node and proxy node 606. Agent node 606 then reads these function parameters from shared memory and executes the corresponding GPU calls in its own address space, and forwards the return values back to the host process (agent client) through shared memory 622.

The checkpointing mechanism implemented by checkpointing service 600 is domain-aware in that it identifies specific portions of GPU memory 632 that hold useful data for checkpointing and migration, and copies to CPU memory only on those portions prior to initiating checkpointing. This is critical to maintaining checkpointed size manageability (otherwise, the entire GPU address space (e.g., 16GB or 32GB per device) needs to be replicated) -however, to have this capability, embodiments may have full visibility into allocation/idle events (e.g., pyTorch) that occur at the framework level.

The memory allocators in pyrerch and TensorFlow typically allocate the entire GPU memory (by executing Malloc ()) at the start-up operation, and then manage the "heap" by their memory allocators. In contrast, the default memory allocator in PyTorch or TensorFlow is overridden by the unique allocator, giving visibility of which areas have been allocated or are free. While this overlay is easy in TensorFlow (which provides an extensibility point to overlay memory allocators), pyTorch does not have a clean interface for the overlay, thus changing the framework to accommodate this.

In some implementations and examples, the current memory usage of DLT job 602 is captured on original node 604 by interceptor 626 on original node 604. Memory usage of DLT jobs 602 increases during forward migration to destination node 608 and decreases at the end of reverse transfer. Without semantic visibility to the structure or code of the model of DLT job 602, checkpointing may be timed to use points near the minimum (e.g., within 10% of the minimum) in memory. It should be noted that this is a performance optimization and is not necessary for correctness in all implementations and examples. Thus, the low memory condition is orthogonal to the correctness issue in the distributed DLT job 602, because after the correctness constraint is satisfied (e.g., the cleanup suspension is all reduced), it can be used as an additional constraint for a technically efficient checkpointing.

The interception method of GPU-related calls to different address spaces presents some challenges. Because the interaction with the proxy process 620 is in the critical path of the GPU kernel dispatch (which is sensitive to delay), the synchronization mechanism between the master process on the original node 604 and the proxy process 620 is low latency (e.g., no sleep, wake, etc.). To achieve this, the proxy process 620 waits on the shared memory 622 and when a particular "packet" (or function call out) is written, the proxy process 620 dequeues the packet and executes the packet. For a typical DLT workload, the overhead through proxy process 620 is shown to be approximately 1-2% in some cases, but the overhead may be further reduced using the techniques discussed herein.

In some implementations and examples, interceptor 626 intercepts calls of DLT job 602, forwards the intercepted calls, stores those calls in shared memory 622, and forwards those calls to lower-level GPU library 631 running on proxy node 606. Low-level libraries such as GPU driver Application Programming Interfaces (APIs) and GPU runtime APIs are intercepted (at least partially intercepted), but they have a limited set of APIs. In addition, multiple higher-level libraries, such as, for example, but not limited to, open source library Thrust, eigen, apex, may be accessed (shown as modified open source code 634) and used by DLT job 602. In some implementations and examples, these libraries are captured and added to, or at least referenced to, shared memory 622.

In addition, the DLT job 602 or user model may define its own kernels (custom kernels) that are launched directly on the GPU proxy client 633 of the original node 604 and added to the shared memory 622. Since these libraries will launch custom kernels directly on the GPU, a simple approach requires interception of all these libraries, which is difficult to manage. Some examples intercept only small low-level collections. Either a higher level library such as Apex or a model level library defining user-defined kernels interact with the GPU through a LaunchKernel or launchkernellapi, so in some implementations and examples LaunchKernel calls are intercepted and forwarded to proxy node 606. Proxy node 606 serializes the parameters of LaunchKernel and copies them to shared memory 622.

Checkpoint service 600 isolates the host address space of original node 604 from direct device mapping and GPU-related contamination, but the PyTorch or TensorFlow process of DLT job 602 running in the host address space of original node 604 retains a pointer to GPU state 630 stored in GPU proxy client 633 device state. For example, a tensor object in PyTorch may have a device pointer that points to the GPU memory 632 of the original node 604. Similarly, the CPU variable may hold a "handle identifier" returned by a GPU call running on GPU proxy client 633. Such virtual pointers continue to be valid and have the same meaning when the address space is restored on GPU 618 of destination node 608.

An object (e.g., tensor object) in the host address space of the original node 604 holds a pointer to the device memory. The host process on DLT job 602 may not directly dereference or interpret the pointers. Instead, these pointers are stored in shared memory 622 and carried to agent node 606 as parameters of the kernel, and are kernel code (running in GPU 618) that interprets these pointers. However, implementations and examples ensure that these pointers point to the same objects as they were in the old GPU of the original node 604 before checkpointing.

In some implementations and examples, the checkpoint service 600 is controlled to allow only a single allocation of device memory through Malloc. Some implementations and examples intercept Malloc execution (e.g., via ld_reload mechanism) and force mapped mmap to be executed at a stable address (which is the same in GPUs 614 and 618). By default mmap specifies NULL in the desired address, which means that the OS 623 of proxy node 606 maps it to some arbitrary region in the address space. The mapped virtual address is completed, ensuring that the Malloc starting address is the same across all GPUs. The active area of the GPU state 630 of the original node 604 is captured and copied to the same relative address within the GPU memory 632, ensuring fidelity of all device pointers in the host address space. This eliminates the need to track and patch such pointers.

Similar problems occur in the case of return handles invoked by various GPUs. For example, streamCreate returns an opaque handle that will be stored in host state and then used as a reference in subsequent kernel launch calls initiated by the host process. However, when restored, the device will return a different handle for the same stream during playback. To maintain the fidelity of these handles across checkpointing-restoration, the handles are virtualized. The proxy node 606 that intercepts these calls does not return the actual handle returned by the device, but returns a virtual handle and remembers the mapping as part of the CPU state. Any API with any parameter that allows a listed handle type will first translate that parameter before it is passed to the proxy server. The proxy handle starts at 0x00a0b0c0, advancing by increments of 1. There is no need to distinguish between handles and virtual handles. The only requirement is that the live virtual handle is never reused across checkpointing/restoration. During resume/replay it simply updates the mapping table with the new physical handle but maps it to the same virtual handle. Since the remainder of the host process only stores and operates on virtual handles, it remains consistent after restoration.

Stateful API calls create handles to contexts, streams, events, basic Linear Algebra Subroutines (BLAS), DNNs, multiple GPUs collective communication primitive libraries, or set associations between handles or change configurations. Stateful API calls are captured in log 640 and stored as GPU memory 632 and replayed in the same temporal order upon recovery. The log may become longer with each learning iteration. But most logs are compressed in one of the following ways. First, if the configuration of the handles changes or a new association is established between the handles, an idempotent change will be detected but will not be recorded for playback. Furthermore, depending on the type of change, only the most recent call is replayed, even if the most recent call is not idempotent. Since each GPU proxy client 633 uses a single device and a single stream, the latter compression is done without fear of the contents of the replay list between two calls of the same type.

In addition, if an earlier created handle is corrupted (e.g., a multi-GPU communication primitive handle), some examples will delete all creation, change configuration, and set the associated call in log 640 linked to that particular handle. To achieve this, a "garbage collection key" (gc_key) is associated with each time something is recorded in the playback log. When a new item is being recorded with the same gc_key or a handle linked to the gc_key is corrupted, the replay log will be compressed and kept brief.

With these, the replay list is reduced to 5 to 100 calls depending on the model and can be executed in a few seconds. In some implementations and examples, this is not dependent on the duration of model execution, but only on the time within the iteration at which checkpointing is performed. Performing checknodulation near or at iteration boundaries results in a minimal replay log 640—doing so at the round (epoch) boundary would make the log even smaller. The log is a list of live handles and their configurations that are still in use, and thus in some examples require replay.

Checkpoint service 600 handles multi-GPU and multi-node DLT jobs 602. In some examples, distributed DLT job 602 runs in a multi-process mode (e.g., one process per GPU). Each process uses its own proxy and starts with the correct environment variables to indicate to the proxy which GPU to use. Each process is checkpointed separately in the distributed DLT job 602 (because each process will have its own data loader and level state, etc.).

The checkpointing framework also coordinates across processes to checkpoint at the same point in the workflow and does not start any new AllReduce. For example, if some processes have started AllReduce, while other processes decide to checkpoint before executing the AllReduce, this can result in a deadlock. Implementation and example ensure that AllReduce or collective operations do not occur in any process when checkpointing occurs. Further, after restoration, tasks may be mapped to a different set of network endpoints, meaning that the communication endpoint (e.g., the ProcessGroup concept maintained by Pytorch) is reinitialized to point to a new address. Today, the establishment of endpoints is done by user scripts at the beginning of a job; however, for current checkpointing, this must be re-performed after each recovery.

Nor does the user have to do anything special to preempt or migrate the job. This is accomplished by intercepting and checking the process state of the noded DLT job 602 at a sufficiently low level in a manner that the user program does not know what is happening. Also, it is transparent to the software layer, user code, and framework libraries (e.g., pyTorch or TensorFlow) described above.

In some implementations and examples, checkpointing involves deleting a process group created by a user script and passed to a distributeddataParall class, as well as some other distributedDataParall data structure (e.g., pytorchresuducer) built on the process group. Recovery involves only re-initializing the process group and associated DistributedDataParallel data structures. The single GPU checkpointing/restoration mechanism is invoked via proxy process 620.

In some implementations and examples, the user has the option to bring his own container with a particular library, and so on. At the platform level, only the cmdline/env of such DLT job 602 is enhanced to perform LD_PRELOAD with the version of the proxy library of the yield proxy process at the time of the first GPU call. The proxy library keeps polling the local file to see if it needs to initiate checknodification.

In addition to performing checkpointing on demand, the proxy library also periodically performs continuous checkpointing (at an adjustable frequency-e.g., once every 15 minutes). This may handle unplanned interrupts or job failures. The frequency is fixed to apportion checkpointing overhead across longer execution times. Thus, for a planned interrupt or scheduler driven preemption, the DLT job 602 is restored (e.g., no work lost) at the next iteration, but for an unplanned failure, the DLT job 602 may lose a certain amount of processing (e.g., 15 minutes).

In the continuous checkpointing mode, the checkpointing service 600 checkpoints the file system state because the job will continue to run after checkpointing and be attached to the output file/log. Thus, if DLT job 602 is later restored from the previous snapshot, the file system state may not be consistent with the job state (e.g., the user may see a duplicate log message for the same pedometer). To address this issue, every time a successive checkpoint is created, delta changes made to the file system since the previous checkpoint are also replicated using rsync.

The client agent is a separate process running in the same container as the user's DLT job and therefore has access to the same file system namespace. In some examples, the client agent is part of the underlying container image from which the user is derived, and thus has access to the local file system. The client agent exposes the following example Remote Procedure Call (RPC) interface:

InitiateCheckpoint (jobID), return status "SUCCESS" or "error_code;

IsCheckpointDone (jobID) (asynchronous) or WaitForCheckpoint (jobID) (blocking), return "true" or "false"; and

RestoreJob (checkpoint_location, checkpoint_time [ default to latest checkpoint ]).

Upon receiving the InitiateCheckpoint () the client agent writes a file in the local file system, indicating that a checkpoint request has been received. Proxy process 620 running as part of DLT job 602 looks for the file to trigger checknodulation at the earliest "safe" occasion. Once the proxy process 620 is "ready" for checkpointing (in terms of both correctness (e.g., done with pending AllReduce) and performance (e.g., near low memory points), it writes back to the file, indicating that it is ready for checkpointing, and stops (or terminates) the proxy process 620. The client agent then performs checkpointing of the user process that dumps the address space checkpoint into the local file. The local file is compressed and shipped to a remote storage device.

In some implementations and examples, the storage location of the write checkpoint is specified in the cmdline of the client agent. By default, the checkpoint location is a directory in the job output directory—for example, output_dir/< job_id >/< rank >/checkpoints/, which means that credentials written to the same directory, such as blob store credentials, are provided to the client agent. In some implementations and examples, the scheduler asks all DLT jobs 602 to write output to a particular storage device and then lets the user enter credentials into that particular storage device, which it can use to lift up the client agent. To handle failed or incomplete checkpoints, in some examples, the DONE file is written in the same directory after the rest of the checkpoint data is successfully written.

In some implementations and examples, the client agent is also responsible for starting jobs in the new machine from a previous consistent checkpoint. The recovery API specifies the checkpoint directory, and a coarse timestamp (e.g., -1 means up-to-date). The semantics of the timestamp are suggestive in that the recovery is done at the closest checkpoint before (a) the consistency/integrity, and (b) the specified timestamp. This is a suggested reason for the fact that the latest checkpointing may have been damaged or inconsistent. This is especially true in distributed jobs, where each task independently writes its own checkpoint. It is possible that some of the tasks successfully write to the latest checkpoint while others fail, in which case it should be restored to the previous checkpoint (since all tasks guarantee that the previous checkpoint has been written). Because all tasks of the distributed job write the same job ID, the client agent of each task looks at all checkpoint directories of itself and all checkpoint directories of all other tasks and independently concludes the same conclusion-from which the checkpoint service 600 is restored.

The external service that consumes the API may be one of the schedulers, or some other service that notifies the preemption signal for the VM, etc. One example performance expectation is that within the time buffer of the InitiateCheckPoint (e.g., 15 seconds), a checkpoint is written to, or a failure is returned that can terminate the job (to recover from the older checkpoint). The reason for the time buffer is to allow checkpointing to be done at "safe" times and times aligned with small lot boundaries, such that the checkpoint size is low. Since most DLT jobs 602 have a small batch time of less than 1 second and certainly less than 5 seconds, a time buffer of 15 seconds is sufficient to ensure technical efficiency. However, in the case of a ill-condition where the job takes longer for a small batch, the agent may employ forced checkpointing after 10 seconds (without waiting for a low memory threshold).

Unlike conventional programs, DLT jobs frequently use GPUs, and GPU states are not easily migrated. The checkpointing program running in the CPU has different libraries. Aspects of the present disclosure may operate with any functionality that enables checkpointing of the entire CPU address space. These checkpointing libraries are able to checkpoint a process, move it to a new machine, and start it. These checkpointing libraries are not suitable for GPUs because they have many proprietary states embedded in the GPU that are not understood by checkpointing. Since the GPU driver is proprietary and it is not possible for the checkpointing library to handle the problems caused by several factors.

Unlike conventional procedures, DLT jobs 602 frequently use GPUs and GPU states are not easily migrated. The checkpointing program running in the CPU has different libraries. Aspects of the present disclosure may operate with any functionality that enables checkpointing of the entire CPU address space. These checkpointing libraries are able to checkpoint a process, move it to a new machine, and start it. These checkpointing libraries are not suitable for GPUs because they have many proprietary states embedded in the GPU that are not understood by checkpointing. Since the GPU driver is proprietary and it is not possible for the checkpointing library to handle the problems caused by several factors.

Example implementations are described next. However, one skilled in the art will note that this can be implemented in any cluster using any container technology (not just Kubernetes). Furthermore, aspects of the present disclosure may operate with any scheduler (not just the schedulers described below). Some examples use Kubernetes as a cluster manager, where a custom scheduler assigns jobs to nodes. In this example, the job is submitted as a Docker container. An example scheduler is implemented in Scala code using the Akka Actors library for concurrency and performing remote procedure calls with a remote procedure call library (RPC or gPRC).

In some examples, there are four main modules: manager, scheduler, executor, and client. The manager exposes REST APIs and gRPC endpoints for clients to connect to the scheduler. The scheduler makes decisions like placement, migration, ticket distribution, prize token management, transactions, etc. In some examples, there is a global executor for performing group scheduling of multi-server jobs and a local executor for each server in the cluster, and together they are responsible for running jobs on the servers in proportion to the tickets assigned by the scheduler. The client running within the container with the job also exposes the gRPC endpoint and is responsible for receiving commands from the executor to perform operations such as suspend/restore, checkpointing/migration, reporting job metadata, and reporting the status of the running job.

The mechanism utilized by the disclosed examples is the ability to migrate jobs between nodes. To migrate a job, DLT jobs are checkspotted as needed and then restored on different nodes. Some DLT jobs are written using checkpointing capability and thus can be restored from the last checkpoint (if any). Typically, DLT jobs using checkpoints are checkpointed only at each wheel. One wheel may last for several hours or more. While such checkpoints are useful for preventing occasional server failures, examples require finer granularity checkpointing to achieve fairness and efficiency and avoid losing valuable computation time. Thus, an automatic, on-demand checkpointing mechanism is achieved.

To support job migration, the PyTorch and TensorFlow frameworks were modified. While generic process migration tools exist, they cannot handle processes with GPU state. In some implementations, the proxy process diverges from the main process. Some or all GPU calls made by the process may be intercepted and directed to the proxy process. In this way, the address space of the main process only holds the CPU, and checkpointing can be easily performed. The example proxy process is responsible for: 1) Converting all GPU handles, such as streams, contexts, etc.; 2) Maintaining a log of all state change GPU calls so that they can be replayed upon recovery; 3) Memory management of GPU memory. The memory manager maps the virtual address space to the physical GPU address space in a consistent manner across the migration such that pointers to GPU memory remain completely transparent to the parent process. Upon checkpointing, the proxy's memory manager copies the GPU state to the parent's CPU memory and stops. The parent process may then be migrated. Upon recovery, the proxy process replays a log of state change GPU calls and copies back the GPU memory. All communications between the proxy and parent processes are handled via shared memory with negligible overhead. Proxy implementations remain unchanged between pyrerch and TensorFlow and require minimal modification to the actual framework.

The example overhead of suspension-restoration is also similar, e.g., about 100-250ms depending on the size of the model. However, some examples optimize migration performance overhead by implementing a three-phase context switch called suspend-preload-restore. When the notification framework hangs, it will complete the hang in about 100 milliseconds by copying the minimum data in the GPU (proxy process) to CPU memory (parent process) at the end of the small batch training, allowing the scheduler to run another job on the GPU. If a job is to be migrated across servers, the scheduler will perform checkpointing on the job container and resume it on the target server. The framework then waits for a preload notification. When it receives the preload, it will set up state on the new GPU by replaying a log of all stateful operations, but will not recover. Thus, the preload hides the 5 second delay of the GPU context initialization. Finally, when the notification framework is restored, it copies the data back to the GPU memory, which (in some examples) requires approximately 100 milliseconds, and quickly restores the GPU computing. Thus, migration occurs primarily in the background while other jobs utilize the GPU.

The state tracked inside the GPU is done by closed source proprietary software inside the GPU and CPU. For example, the user may have a PyTorch program that runs in part on the CPU and carries the computation to the GPU-e.g., more expensive parts of the job typically run on the GPU. The state of the DL job spans both the CPU and the GPU because some computations are done on the CPU and others on the GPU. The checkpointing library does not know what to do with the state in the tracking GPU, which does pollute the address space in the CPU. To address this technical problem, examples keep the host address space of the CPU clean by implementing a split process architecture for DLT job execution. When the GPU is called, the GPU call is not executed in this address space. Instead, GPU calls are performed in a separate process (also referred to as a proxy process) that interacts with the GPU. This ensures that only the address space of the proxy process is contaminated, while the host process remains in the original state.

The disclosed implementations and examples provide a highly scalable AI infrastructure. The service is designed to scale across hundreds of data centers and tens of thousands of accelerators with trained models of trillions parameters. The service may also be configured to cross geographic boundaries. The architecture is also capable of equally treating training jobs and inference services when they originate from a data center and are on a local source.

While aspects of the disclosure have been described in terms of various examples and their associated operations, those skilled in the art will recognize that combinations of operations from any number of the different examples are also within the scope of aspects of the disclosure.

While the examples provided relate to implementations using GPUs, it should be understood that FPGAs, ASICs, or other specialized hardware may be similarly used to perform the functionality described herein.

Fig. 7 illustrates a block diagram of a planetary-scale AI infrastructure network environment (network environment 700) that implements migration of DLT jobs 602 from an original node 604 to a destination node 608, in accordance with an embodiment. Numerous computing devices communicate with cloud environment 700 over network 730. Cloud environment 700 represents a cloud infrastructure comprised of a number of computer servers 701, which computer servers 701 may be any type of server or remote computing device, or may be dedicated, relational, virtual, private, public, hybrid, or other cloud-based resources. As depicted, server 701 includes a mix of physical servers 701a and virtual servers 701n, the latter being established as VMs running inside cloud environment 700. For clarity, these physical servers 701a and virtual servers 701n are collectively referred to as "servers 701" unless otherwise indicated. In some implementations and examples, yun Huan The environment 700 is implemented as a large-scale (e.g., planetary) cloud environment (e.g., byDeveloped COSMOS operations), processes large amounts of data such as ideas or more. Such implementations and examples may operate the various servers 701 partially or fully worldwide.

The server 701 includes or has access to one or more processors 702, I/O ports 704, communication interfaces 706, computer storage memory 708, I/O components 710, and communication paths 715.

Memory 708 represents a number of computer storage memories and memory devices that store executable instructions and data for automatically adjusting the operating parameters of cloud environment 700. Memory 708 stores executable instructions for checkpoint service 600 discussed previously, as discussed above and shown in FIG. 6, checkpoint service 600 is used to checkmark DLT job 602 prior to migration from original node 604 to destination node 608 through proxy node 606. In addition, memory 708 stores instructions for migration service 712, migration service 712 effectively moves DLT job 602 from original node 604 to destination node 608 using the disclosed techniques referenced herein. Further, memory 708 stores executable instructions for memory manager 714, memory manager 714 handles allocation of DLT jobs 602 to different memory locations throughout network environment 700. The checkpoint service 600, migration service 712, and memory manager 714 may be implemented in software, firmware, hardware, or a combination thereof in various implementations and examples.

In some examples, to support job migration, migration service 712 slightly modifies the PyTorch and TensorFlow frameworks. In other examples, other frames are used. Some implementations may handle unmodified user code, requiring some minor changes to both frameworks. While generic process migration tools exist, they cannot handle processes with GPU state. In some implementations, the proxy process 620 within the checkpoint service 600 forks with the host process. Some or all GPU calls made by the process are intercepted and directed to proxy process 620. In this way, the address space of the main process only holds the CPU and can be easily checkpointed. Proxy process 620 is responsible for: 1) Converting all GPU handles, such as streams, contexts, etc.; 2) Maintaining a log of all state change GPU calls so that they can be replayed upon recovery; 3) Memory management of GPU memory. Memory manager 714 maps the virtual address space to the physical GPU address space in a consistent manner throughout the migration process such that pointers to GPU memory remain completely transparent to the parent process. Upon checkpointing, the proxy's memory manager copies the GPU state to the parent's CPU memory and stops. The parent process may then be migrated. Upon recovery, proxy process 620 replays a log of state change GPU calls and copies back the GPU memory. All communications between the proxy and parent processes are handled via shared memory with negligible overhead. The proxy implementation remains unchanged between pyrerch and TensorFlow and requires minimal modification to the actual framework.

Cloud resource overhead (e.g., CPU, GPU, memory, VM, etc.) for suspension and restoration (suspension-restoration) is also similar, e.g., about 100-250 milliseconds (ms) depending on the size of the model. In some implementations, migration service 712 optimizes migration overhead by implementing a three-phase context switch known as suspend-preload-restore. In some examples, when notification migration service 712 hangs, migration service 712 completes the hang within about 100ms by copying GPU memory 632 (using proxy process 620) to parent process's CPU memory 629 at the end of the small lot training. This allows the scheduler (global or regional) to run another DLT job 602 on the GPU 611 of the original node 604.

Some examples perform checknodulation across numerous GPUs. For example, there may be DLT jobs 602 running on hundreds of GPUs. Since these hundreds of GPUs work together, consistent checkpointing must be undertaken. To this end, examples apply or use "distributed barriers" across multiple different GPUs, discussed in more detail below.

To implement the multi-barrier example, the barrier mechanism 713 performs the following functions. "meta AllReduce" is performed before the actual AllReduce is performed. In some cases, additional interceptors are encoded into the GPU communications library or other similar call. The meta AllReduce is executed asynchronously in the background to ensure that no latency problems occur. When any of the disclosed schedulers decides to migrate a DLT job 602, such migration is done on demand. When AllReduce is performed, the sum is calculated across all workers. The disclosed example uses a similar sum to quickly derive how much AllReduce the worker has issued. A maximum AllReduce count is calculated, giving a barrier point when to stop all workers to effect migration.

In some examples, the barrier is implemented through multi-GPU communication API interception in the following manner. APIs are intercepted, similar to intercepting other libraries and proxy calls. Stateful APIs (e.g., commInitRank) may be replayed upon recovery. The Comm_t returned by the above operation is virtualized and recreated transparently with a new unique identifier (UniqueId) upon recovery. UniqueId contains the socketaddr of the main process of barrier mechanism 713.

The host process creates a UniqueId before executing the CommInitRank. Other workers may get the UniqueId from the out-of-band host process before they can also execute CommInitRank. In one flow, pytorch FileStore/TCPStore, etc., is used, and this sharing is the responsibility of Pytorch/horovad/tf. In some examples, this out-of-band (OOB) channel is not used after the initial exchange of UniqueId. Otherwise, in the resume flow, this may be done byAISC treatment in brand framework. Collective APIs (e.g., allReduce) would also be intercepted.

In some implementations and examples, the protocol of the OOB channel is implemented in the following manner. The host process receives the checkpoint signals and coordinates checkpointing between the computing devices of the original node 604 as previously discussed. Processes implement a synchronization protocol via the OOB channel using coordinator threads, and each process tracks the following mutually exclusive protection variables: current_current_count, maybe_stall, max_current_count, and the like. Collective invocation of agents in the main thread will continue if either: the maybe_stall is false, or current_current_count is less than max_current_count. If there is no green signal from the distributed worker coordination thread, implementation of interception is blocked, otherwise call is proxied.

In operation, the main process waits for a checkpoint signal until the memory point falls or is at a low memory point and then broadcasts a quiesce signal to the computing devices (workers) of the original node 604. The worker at the original node 604 then sets the value "maybe_stall" equal to true and responds back to the main process row with their current collective count (current_collective_count). In turn, the master process captures the maximum of all current_collecting_count values and broadcasts the maximum current collective count to all workers of the original node 604. The worker sets its corresponding current collective count to the broadcasted maximum current collective count, waits until their corresponding current collective count exceeds the maximum current collective count, and then performs checkpointing. After checkpointing, the worker resumes by setting maybe_stall to false.

According to these concepts, the example host process operates as follows:

additionally or alternatively, some implementations and examples use multiple GPU communication primitives as unique cross-worker communications without OOB channels. These implementations and examples similarly intercept stateful and collective APIs. The host process waits for a checkpoint request. For each AllReduce, the worker also enqueues an asynchronous (async) element AllReduce on the exclusive stream to calculate the following: sum (seeds_barrer) and sum (acked_barrer). The parameter "seeds_barrer" includes a level of 0 and is set to 1 when checknodulation is initiated. Other workers also set similar levels to 0.

In addition, provision is made for checkpointing. Whenever the worker detects that sum (needle_barrier) is equal to 1, the worker synchronously executes AllReduces and sets acked_barrier to 1. In addition, the first unary AllReduce in the above operation may cause sum (acked_barrer) to be set to "world_size" to accommodate the planetary-scale cloud environment.

Additionally or alternatively, the heartbeat element AllReduce may be implemented in the following manner. The host process waits for a checkpoint request. Each level is assigned a meta AllReduce budget in the following manner: (1) initially setting the budget to 1, (2) increasing the budget by 1 after each data AllReduce call, and (3) increasing the budget again by 1 when all levels time out. Each level (and its budget) will issue a meta AllReduce if a timeout occurs (e.g., timer T expires since the last collective operation, which may be issued synchronously) or data AllReduce issues before the timeout (which may be issued asynchronously). In some implementations and examples, the meta AllReduce calculates the sum of: (1) needleds_barrer: a level 0 sets it to 1 if it wants to check nodulation, otherwise set to 0; (2) issued_on_timeout: if issued due to a timeout, it is set to 1, otherwise set to 0.

If sum (issued_on_timeout) is set to world_size, all levels time out, all meta allreduces are synchronized, and all levels are lockstep. In addition, all levels decide to send out one more element AllReduce before the next dataAllreduce, e.g. each level increases its element AllReduce budget by 1. This may repeat sum (issued_on_timeout) < world_size. Using these operations, global commands can be provided for heartbeats.

Each AllReduce call from the user script is put into a queue for later scheduling by the background thread. The background thread frequently runs the synchronization element AllReduce to agree on AllReduce to be executed next. All workers have their own current_allreduce_count—the number of allReduce calls made by the user script (but not necessarily scheduled).

More specifically, in some implementations and examples, each synchronization element AllReduce calculates max (seeds_checkpoint) and max (current_allreduce_count). If max (feeds_checkpoint) is equal to 0: the background thread schedules the queued AllReduce until current_allreduce_count reaches max (current_allreduce_count), and then executes the synchronization element AllReduce again (delay time such as 5 milliseconds to avoid rotation and allow queue growth). If max (clocks_checkpoints) is equal to 1, then the barrier has been reached and checknodulation can be performed immediately.

To handle GPU calls on a stream, some implementations and examples mark the stream associated with a queued AllReduce call as spotted until the AllReduce is actually scheduled by a background thread. All calls (sync, event, allreduce, etc.) to a dirty flow wait for a background thread to decontaminate its flow before the call can be sent to the GPU of proxy node 606.

In some implementations and examples, the background threads are operated asynchronously. All incoming sets are queued. The background thread is set to be responsible for metadata consensus and issues queued sets to multiple GPU communication primitives. The communication channel may be an existing transmission or an OOB transmission, such as TCP, fileStore, redisStore, etc. To work with one of the queues, or OOBs, or stores, at each clock cycle, a synchronous blocking of metadata send+recv is performed to find two things across the worker: lsni=num_allreduces_send_so_far plus max (allreduce_queue_size_across_works) and needcackpoint. If the needle_check point is false, it is safe to send up to lsni allreduces, after which a short sleep is performed to allow the queue to grow, and then another blocking synchronization metadata send+recv is performed. In addition, the metadata send+recv may be asynchronous.

Additionally or alternatively, all incoming sets may be queued. The background thread is responsible for worker consensus and sends the queued set to the multiple GPU communication primitives. The communication channels used include: either (1) an existing transmission, or (2) an OOB transmission. In some implementations and examples, this is performed by:

global_N0＝0

NumUnilateral(int N)is a func that

Returns some number<＝N

Return value>0if N>0

needs_barrier0＝false

wait_time＝10ms

then, if the needle_barrier is set to true, then the barrier has been successfully acquired. If not, each worker will perform the following. The worker ensures that wait_time has expired since the last update to global_ni. Local_ni is set to numunited (num_pending_collecting_in_queue). It is checked whether the asynchronous metadata send+recv agrees with the seed_barrier+1=seed_barrier+1 and global_ni+1=max_over_all_works (local_ni) from the level 0. If so, local_Ni groups are unilaterally assigned. In addition, the asynchronous metadata send+recv waits for completion. If local_Ni < global_Ni+1, more sets are sent until local_Ni equals global_Ni+1.

If DLT job 602 is to be migrated across server 701 (i.e., from original node 604 to destination node 608), the disclosed scheduler performs checkpointing on DLT job 602 using checkpointing service 600, and migration service 712 resumes DLT job 602 on destination node 608. Migration service 712 may be configured to wait for a preload notification. When migration service 712 receives the preload notification, migration service 712 sets up a state on new GPU618 of destination node 608 by replaying a log of all stateful operations (e.g., from GPU state 630 and/or CPU state 627), but without restoration. Thus, preloading hides the latency (e.g., 5 seconds) of GPU context initialization.

When migration service 712 is notified to resume, migration service 712 copies the data back to GPU 618's GPU memory on destination node 608, which (in some examples) takes approximately 100ms. Migration service 712 quickly restores GPU computing on destination node 608. Thus, migration occurs primarily in the background while other DLT jobs 602 utilize the GPU.

In some implementations and examples, the GPU state 630 is tracked inside the GPU 611 of the original node 604 by closed source proprietary software inside the GPU 611 and CPU 610. For example, the user may have a portion of the PyTorch program running on the CPU and sending the computation to the GPU-e.g., the more expensive portion of the job typically runs on the GPU. The state of the DLT job spans the CPU and GPU because some computations are done on the CPU and other computations are done on the GPU. The checkpointing library does not know what to do with respect to tracking states in the GPU, which does contaminate the address space in the CPU. To address this problem, some examples described herein keep the host address space of the CPU clean by implementing a split process architecture for DLT job execution. When the GPU is called, the GPU call is not executed in this address space. Instead, GPU calls are performed in a separate process (also referred to as a proxy process) that interacts with the GPU. This ensures that only the proxy process's address space is contaminated, while the host process remains original.

The disclosed implementations and examples provide a highly scalable AI infrastructure. Checkpointing service 600, migration service 712, and memory manager 714 are designed to scale across hundreds of data centers and tens of thousands of accelerators with trained models of tens of trillions parameters. The service may also be configured to span geographic boundaries. The architecture is also capable of equally treating training jobs and inference services when they come from a data center and are on a local source.

Other examples perform checkpointing across numerous GPUs. For example, there may be DLT jobs 602 running on hundreds of GPUs. In a coordinated manner, since the hundreds of GPUs work together, a consistent checkpoint must be taken. To this end, the checkpoint service 600 applies and uses a "distributed barrier" protocol across multiple different GPUs 611 of the original node. At run time, each worker of the original node 604 runs a small lot, and then at the end of all small lots, the workers exchange results. At the end of a small batch, each worker determines a gradient and then executes one or more allreduces. For some GPUs 611, the allreduce library is part of a library that provides inter-GPU communication primitives. Some example insertion AllReduce occurs, effectively piggybacking new protocols on top of the regular AllReduce performed by the user. Other examples introduce a new protocol for an AllReduce call of a similar type.

Alternatively, checkpoint service 600 may direct migration service 712 to implement a multi-GPU barrier via barrier mechanism 713 by executing a meta AllReduce before executing an actual AllReduce. This requires encoding some additional interceptors that interact with the library calls. The meta AllReduce is executed asynchronously in the background to ensure that no latency problems occur. When the disclosed scheduler decides to migrate a job, such migration is done on demand. When AllReduce is executed, a sum is calculated across all of the worker/GPUs 611. The disclosed example uses a similar sum to quickly calculate how much AllReduce the worker has issued. A maximum AllReduce count is calculated, giving a barrier point when to stop all workers to effect migration.

FIG. 8 illustrates a flow chart depicting an operational procedure 800 for checkpointing a DLT job at one (original) node in a cloud computing environment and restoring the DLT job from the checkpointed state on a different (destination) node. Operational flow 800 involves processing a DLT job at an original node, as shown at 802. This occurs until the DLT job completes processing, or until a global scheduler, regional scheduler, or coordinator service schedules to migrate the DLT job to the destination node, as shown at 804. If so, a barrier may be established between the workers of the original node, enabling communication through multiple GPU communication primitives (along background threads) or through OOB channels, as shown at 806.

As shown at 808, GPU state, GPU memory, CPU state, and CPU memory, or a combination thereof, is captured and moved into shared memory shared between the original node and the proxy node, as previously discussed, as shown at 810. The checkpointing state is defined at the proxy node by the GPU state and the CPU state, or by any combination of the CPU state, CPU memory, GPU state, and GPU memory, as shown at 812. The DLT job is migrated to the destination node in the checkpointed state as shown at 814. And, processing of the DLT job is restored from the checkpointed state at the destination node, as shown at 816.

FIG. 9 illustrates a flow diagram depicting an operational procedure 900 for checkpointing a DLT job across a plurality of first nodes and restoring the DLT job from a checkpointed state across a plurality of second nodes different from the first nodes in a cloud computing environment. Operational flow 900 involves isolating GPU-related activity of a DLT job across a first set of original nodes that are different from a destination node, as shown at 902. This isolation may be accomplished using the proxy process discussed above. During isolation, the DLT job is allowed to continue computation in one or more host processes of the first node, as shown at 904. In some implementations and examples, the computation includes Python code with PT/TF training loops being implemented, as shown at 906, and the proxy process for isolating DLT jobs is maintained in a stateless state across multiple checkpoints, as shown at 908, until migration occurs. These two are implemented and maintained until the DLT job completes processing or is scheduled to be migrated, the latter as shown at 910.

When the global scheduler, the regional scheduler schedule the migration, the DLT job is migrated to the destination node, as shown at 912, the GPU state, GPU memory, CPU state, and CPU memory, or a combination thereof, is captured, and as shown at 914, the DLT job is moved into shared memory shared between the original node and the proxy node as previously discussed. The checkpointing state may then be defined on the proxy node by the GPU state and the CPU state, or by any combination of the CPU state, the CPU memory, the GPU state, and the GPU memory, as shown at 916. The DLT job may then be migrated to the destination node in the checkpointed state, as shown at 918. Also, as shown at 920, processing of the DLT job may resume from the checkpointed state at the destination node.

Operating Environment examples

Fig. 10 is a block diagram of an example computing device 1000 for implementing aspects disclosed herein and is designated generally as computing device 1000. Computing device 1000 is but one example of a suitable computing environment and is not intended to suggest any limitation as to the scope of use or functionality of the examples disclosed herein. Neither should the computing device 1000 be interpreted as having any dependency or requirement relating to any one or combination of components/modules illustrated. Examples disclosed herein may be described in the general context of computer code or machine-useable instructions, including computer-executable instructions such as program components, being executed by a computer or other machine, such as a personal data assistant or other handheld device. Generally, program components, including routines, programs, objects, components, data structures, etc., refer to code that perform particular tasks or implement particular abstract data types. The disclosed examples may be practiced in a wide variety of system configurations, including personal computers, laptop computers, smart phones, mobile tablets, hand-held devices, consumer electronics, special purpose computing devices, and the like. The disclosed examples may also be practiced in distributed computing environments when tasks are performed by remote processing devices that are linked through a communications network.

The computing device 1000 includes a bus 1010 that directly or indirectly couples the following devices: computer storage memory 1012, one or more processors 1014, one or more presentation components 1016, input/output (I/O) ports 1018, I/O components 1020, power supply 1022, and network component 1024. Although computing device 1000 is depicted as appearing to be a single device, multiple computing devices 1000 may work together and share the depicted device resources. For example, memory 1012 is distributed across multiple devices, and processor 1014 is housed together with the different devices.

Bus 1010 represents what may be one or more busses (such as an address bus, data bus, or combination thereof). Although the various blocks of FIG. 10 are shown with lines for the sake of clarity, various components may be accomplished by alternate representations. For example, presentation components such as display devices are I/O components in some examples, and some examples of processors have their own memory. No distinction is made between such categories as "workstation," server, "" laptop, "" handheld device, "etc., as all are contemplated within the scope of fig. 10 and references herein to" computing device. Memory 1012 may take the form of the following computer storage media references and is operable to provide storage of computer readable instructions, data structures, program modules, and other data for computing device 1000. In some examples, memory 1012 stores one or more of an operating system, a general application platform, or other program modules and program data. Accordingly, the memory 1012 is capable of storing and accessing data 1012a and instructions 1012b, the data 1012a and instructions 1012b being executable by the processor 1014 and configured to perform various operations disclosed herein.

In some examples, memory 1012 includes computer storage media in the form of volatile and/or nonvolatile memory, removable or non-removable memory, data disks in a virtual environment, or a combination thereof. Memory 1012 may include any number of memories associated with computing device 1000 or accessible by computing device 1000. Memory 1012 may be internal to computing device 1000 (as shown in fig. 10), external to computing device 1000 (not shown), or both (not shown). Examples of memory 1012 include, but are not limited to, random Access Memory (RAM); read Only Memory (ROM); an Electrically Erasable Programmable Read Only Memory (EEPROM); flash memory or other storage technology; CD-ROM, digital Versatile Disks (DVD), or other optical or holographic media; magnetic cassettes, magnetic tape, magnetic disk storage, or other magnetic storage devices; a memory wired to the analog computing device; or any other medium used to encode desired information and be accessed by computing device 1000. Additionally or alternatively, memory 1012 may be distributed across multiple computing devices 1000, for example, in a virtualized environment where instruction processing is performed on multiple computing devices 1000. For the purposes of this disclosure, "computer storage medium," "computer storage memory," "memory," and "memory device" are synonymous terms that computer storage memory 1012, and none of these terms include carrier wave or propagated signaling.

Processor 1014 may include any number of processing units to read data from various entities such as memory 1012 or I/O component 1020. In particular, the processor 1014 is programmed to execute computer-executable instructions for implementing aspects of the disclosure. These instructions may be executed by a processor, by multiple processors within computing device 1000, or by a processor external to client computing device 1000. In some examples, processor 1014 is programmed to execute instructions, such as those illustrated in the flowcharts discussed below and depicted in the figures. Further, in some examples, processor 1014 represents an implementation of simulation techniques for performing the operations described herein. For example, the operations are performed by the analog client computing device 1000 and/or the digital client computing device 1000. Presentation component 1016 presents data indications to a user or other device. Exemplary presentation components include display devices, speakers, printing components, vibration components, and the like. Those skilled in the art will understand and appreciate that computer data may be presented in several ways, such as visually in a Graphical User Interface (GUI), audibly through speakers, wirelessly between computing devices 1000, through a wired connection, or otherwise. I/O ports 1018 allow computing device 1000 to be logically coupled to other devices including I/O components 1020, some of which may be embedded. For example, example I/O components 1020 include, but are not limited to, a microphone, joystick, game pad, satellite dish, scanner, printer, wireless device, or the like

The computing device 1000 may operate in a networked environment using logical connections to one or more remote computers via a network component 1024. In some examples, network component 1024 includes a network interface card and/or computer-executable instructions (e.g., a driver) for operating the network interface card. Communication between computing device 1000 and other devices can occur over any wired or wireless connection using any protocol or mechanism. In some examples, network component 1024 is operable to wirelessly communicate data between devices using short-range communication technologies (e.g., near Field Communication (NFC), bluetooth brand communication, etc.), or a combination thereof, through public, private, or a mixture (public and private) using a transmission protocol. Network component 1024 communicates with cloud resources 1028 across network 1030 via wireless communication link 1026 and/or wired communication link 1026 a. Various examples of communication links 1026 and 1026a include wireless connections, wired connections, and/or dedicated links, and in some examples, at least a portion is routed through the internet.

Although described in connection with the example computing device 1000, examples of the disclosure are operational with numerous other general purpose or special purpose computing system environments, configurations, or devices. Examples of well known computing systems, environments, and/or configurations that may be suitable for use with aspects of the disclosure include, but are not limited to, smart phones, mobile tablets, mobile computing devices, personal computers, server computers, hand-held or laptop devices, multiprocessor systems, gaming machines, microprocessor-based systems, set top boxes, programmable consumer electronics, mobile phones, mobile computing and/or communication devices in the form of wearable or accessory pieces (e.g., watches, glasses, headphones, or earplugs), network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, virtual Reality (VR) devices, augmented Reality (AR) devices, mixed Reality (MR) devices, holographic devices, and the like. Such a system or device may accept input from a user in any manner, including from an input device such as a keyboard or pointing device, via gesture input, proximity input (such as by hovering), and/or via voice input.

Examples of the present disclosure may be described in the general context of computer-executable instructions, such as program modules, being executed by one or more computers or other devices in the form of software, firmware, hardware, or combinations thereof. The computer-executable instructions may be organized into one or more computer-executable components or modules. Generally, program modules include, but are not limited to, routines, programs, objects, components, and data structures that perform particular tasks or implement particular abstract data types. Aspects of the disclosure may be implemented with any number and organization of such components or modules. For example, aspects of the present disclosure are not limited to the specific computer-executable instructions or the specific components or modules illustrated in the figures and described herein. Other examples of the disclosure may include different computer-executable instructions or components having more or less functionality than illustrated and described herein.

In examples involving general-purpose computers, aspects of the present disclosure transform the general-purpose computer into a special-purpose computing device when configured to execute the instructions described herein.

An example method for providing checkpointing of DLT jobs at one node in a cloud computing environment and restoring DLT jobs from checkpointed states on different nodes includes: capturing a GPU state of a GPU executing on a DLT job, wherein the GPU state comprises GPU data, and the GPU data comprises model parameters and an optimizer state which are positioned in the GPU when checkpointing; capturing a CPU state of a CPU executing on the DLT job; storing the CPU state and the GPU state in a shared memory accessible to the proxy node, the checkpointing state being defined at least in part by the GPU state and the CPU state; migrating DLT jobs to different nodes using GPU state and CPU state to check the checkpointing state; and initiating restoration of DLT job processing from the checkpointed state on a different node.

An example method for providing checkpointing Deep Learning Training (DLT) jobs across a plurality of first nodes and restoring DLT jobs from checkpointing states across a plurality of second nodes different from the first nodes in a cloud computing environment, comprising: receiving a checkpoint request; in the event that a checkpoint request is received, performing operations comprising: establishing a barrier between the workers of the first node and enabling the workers to communicate across the barrier using at least one multi-GPU communication primitive or out-of-band (OOB) channel; capturing, for a subset or all of the DLT jobs, GPU states of a Graphics Processing Unit (GPU) executing the DLT jobs on a worker, wherein the GPU states include GPU data including model parameters and optimizer states; capturing, for a subset or all of the DLT jobs, a Central Processing Unit (CPU) state of a CPU executing the DLT jobs on the worker; and migrating a subset or all of the DLT jobs to a different node with the checkpointed state using the GPU state and the CPU state.

An example method for providing checkpointing of a DLT job across a plurality of first nodes and recovering the DLT job from a checkpointed state across a plurality of second nodes different from the first nodes in a cloud computing environment, comprising: isolating GPU-related activities of DLT jobs in the cloud computing environment into separate proxy processes across the first plurality of nodes; and during the isolating, allowing the DLT job to continue computation in the host process, wherein the computation includes Python code with PT/TF training cycles, and wherein the proxy process is stateless across multiple checkpoints.

An example system for operating a cloud computing environment that facilitates suspending deep DLT jobs and restoring the DLT jobs from checkpointing states in different areas of the cloud computing environment, comprising: a first node of a plurality of first nodes providing processing resources for a DLT job; providing a second node of the plurality of second nodes of auxiliary processing resources for the DLT job, wherein the DLT job is suspended on the plurality of first nodes by: isolating GPU-related activities of the DLT job across a first plurality of nodes in a cloud computing environment into a separate proxy process, and during said isolating, allowing the DLT job to continue computation in a main process, wherein said computation comprises Python code with PT/TF training loops, and wherein the proxy process is stateless across a plurality of checkpoints; and wherein the DLT job is migrated to the plurality of second nodes using the proxy process and the master process.

Alternatively, or in addition to other examples described herein, examples include any combination of the following operations:

-capturing a portion of GPU memory that is active during processing of a DLT job on an original node, the portion of GPU memory containing model parameters;

-restoring DLT jobs on a second GPU and a second CPU different from the GPU and the CPU, respectively;

-saving program state associated with DLT jobs; and restoring DLT jobs on another node by switching control flow to program state;

-isolating GPU-related activities into a separate proxy process having a different address space than the GPU; and calculating the DLT job in a host process associated with the CPU, wherein the proxy process is stateless across checkpoints, isolating any temporary GPU-related mappings to an address space of the proxy process;

-wherein the main process address space remains free of any GPU-related state;

-the proxy server is directed to read GPU function call parameters from the shared memory and to execute corresponding GPU function calls in the address space of the proxy process; and delivering the return value back to the proxy client through the shared memory;

moving the GPU related activities of the DLT job to a separate address space using dynamic library insertion on the GPU related calls, wherein the GPU related calls are intercepted in the main process by the client of the proxy process which serializes the GPU function call parameters and writes them into the shared memory;

-wherein the barrier is established in part by performing a meta AllReduce operation for the worker before performing the AllReduce operation on the worker alone;

-wherein the AllReduce operation comprises calculating a maximum value across workers and an AllReduce count value;

-wherein at least one multi-GPU communication primitive is arranged to operate in a background thread;

-wherein the barrier is established in part by performing a heartbeat element AllReduce operation;

-wherein the DLT job is a pyrerch job;

-wherein the DLT job is a TensorFlow job;

the address space of the proxy process is contaminated by the GPU-related map, while the main process address space still has no GPU-related state;

-wherein the main process address space can be checkpointed; and

moving the GPU related activities of the DLT job to a separate address space using dynamic library insertion on the GPU related calls, wherein the GPU related calls are intercepted in the main process by the client of the proxy process which serializes the GPU function call parameters and writes them into the shared memory.

The embodiments illustrated and described herein, as well as embodiments not specifically described herein but within the scope of the claims, constitute exemplary means for checkpointing machine learning jobs (or DLT jobs) by at least one processor of a cloud infrastructure platform using one or more proxy nodes and migrating them from one or more original nodes to one or more target nodes.

By way of example, and not limitation, computer readable media comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable memory implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules, etc. Computer storage media are tangible and mutually exclusive to communication media. The computer storage media is implemented in hardware and does not include carrier waves and propagated signals. For the purposes of this disclosure, the computer storage medium itself is not a signal. Exemplary computer storage media include hard disk, flash memory drives, solid state memory, phase change random access memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium which can be used to store information for access by a computing device. Rather, communication media typically embodies computer readable instructions, data structures, program modules or the like in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media.

The order of execution of the operations in the examples of the present disclosure illustrated and described herein is not essential, and may be performed in a different sequential manner in various examples. For example, it is contemplated that executing a particular operation before, contemporaneously with, or after another operation is within the scope of aspects of the disclosure.

When introducing elements of aspects of the present disclosure or the examples thereof, the articles "a," "an," "the," and "said" are intended to mean that there are one or more of the elements. The terms "comprising," "including," and "having" are intended to be inclusive and mean that there may be additional elements other than the listed elements. The term "exemplary" is intended to mean "an example of …". The phrase "one or more of the following: A. b and C "means" at least one of a and/or at least one of B and/or at least one of C ".

Having described aspects of the present disclosure in detail, it will be apparent that modifications and variations are possible without departing from the scope of aspects of the disclosure as defined in the appended claims. As various changes could be made in the above constructions, products, and methods without departing from the scope of aspects of the disclosure, it is intended that all matter contained in the above description and shown in the accompanying drawings shall be interpreted as illustrative and not in an illustrative sense.

Claims

1. A method for providing checkpointing of a Deep Learning Training (DLT) job at one node in a cloud computing environment, and restoring the DLT job from a checkpointed state on a different node, the method comprising:

capturing Graphics Processing Unit (GPU) state of a GPU executing on the DLT job, wherein the GPU state includes GPU data including model parameters and optimizer states located in the GPU at checkpointing;

capturing a Central Processing Unit (CPU) state of a CPU executing on the DLT job;

migrating the DLT job to the different node in the checknoded state using the GPU state and the CPU state; and

restoration of processing of the DLT job is initiated from the checkpointed state on the different node.

2. The method of claim 1, further comprising: a portion of GPU memory that is active during processing of the DLT job on an original node is captured, the portion of GPU memory including the model parameters.

3. The method of any of claims 1-2, further comprising:

the DLT job is restored on a second GPU and a second CPU that are different from the GPU and the CPU, respectively.

4. A method according to any one of claims 1-3, further comprising:

saving a program state associated with the DLT job; and

restoring the DLT job on another node by switching control flow to the program state.

5. The method of any of claims 1-4, further comprising:

isolating GPU-related activities into a separate proxy process having a different address space than the GPU; and

the DLT job is calculated in a main process associated with the CPU,

wherein the proxy process is stateless across checkpoints, isolating a temporary GPU-related map to the address space of the proxy process.

6. The method of any of claims 1-5, further comprising: a barrier is established in which the main process address space remains free of any GPU-related state.

7. The method of claim 5, further comprising:

the guiding proxy server reads GPU function call parameters from the shared memory, and executes corresponding GPU function call in the address space of the proxy process; and

and conveying the return value back to the proxy client through the shared memory.

8. The method of any of claims 1-7, further comprising:

the GPU related activities of the DLT job are moved to a separate address space using dynamic library insertion on GPU related calls, wherein the GPU related calls are intercepted in the host process by a client of a proxy process, which serializes the GPU function call parameters and writes the GPU function call parameters into a shared memory.

9. The method of any of claims 1-8, wherein the DLT job is a pyrerch job.

10. The method of any of claims 1-8, wherein the DLT job is a TensorFlow job.

11. A method for providing a checkpointed Deep Learning Training (DLT) job across a plurality of first nodes in a cloud computing environment and restoring the DLT job from a checkpointed state across a plurality of second nodes different from the first nodes, the method comprising:

receiving a checkpoint request;

after the receipt of the checkpoint request, performing operations comprising:

establishing a barrier between workers of the first node, and

enabling the worker to communicate across the barrier using at least one multi-GPU communication primitive or out-of-band (OOB) channel;

Capturing, for a subset or all of the DLT jobs, GPU states of a Graphics Processing Unit (GPU) executing the DLT jobs on the worker, wherein the GPU states include GPU data including model parameters and optimizer states;

capturing, for the subset or all of the DLT jobs, a Central Processing Unit (CPU) state of a CPU executing the DLT jobs on the worker; and

the subset or all of the DLT jobs are migrated to different nodes in checkpointed state using the GPU state and the CPU state.

12. The method of claim 11, wherein a barrier is established to enable each node to be checkpointed.

13. The method of claim 12, wherein the AllReduce operation includes calculating a maximum value and an AllReduce count value across the worker.

14. The method of any of claims 11-13, wherein the at least one multi-GPU communication primitive is configured to operate in a background thread.

15. A system for operating a cloud computing environment, the system facilitating pausing a Deep Learning Training (DLT) job and restoring the DLT job from checkpointed state in different areas of the cloud computing environment, the system comprising:

A first node of a plurality of first nodes providing processing resources for the DLT job;

a second node of the plurality of first nodes provides auxiliary processing resources for the DLT job,

wherein the DLT job is suspended on the plurality of first nodes by:

isolating GPU-related activity of the DLT job across the first plurality of nodes in the cloud computing environment into a separate proxy process, and

during the quarantining, allowing the DLT job to continue computation in a main process, wherein the computation includes Python code with PT/TF training loops, and wherein the proxy process is stateless across multiple checkpoints; and

wherein at least a portion of the DLT job is migrated to a plurality of second nodes using the proxy process and the master process.