CN117099083A

CN117099083A - Scheduler for a planetary level computing system

Info

Publication number: CN117099083A
Application number: CN202280026592.7A
Authority: CN
Inventors: M·斯瓦塔努; A·卡蒂亚; D·K·舒克拉; R·V·内赫姆; S·辛格哈尔; P·沙玛; N·夸特拉; R·拉姆基
Original assignee: Microsoft Technology Licensing LLC
Current assignee: Microsoft Technology Licensing LLC
Priority date: 2021-03-30
Filing date: 2022-03-08
Publication date: 2023-11-21

Abstract

The disclosure herein describes scheduling execution of Artificial Intelligence (AI) workloads in a cloud infrastructure platform. The global scheduler receives AI workloads associated with resource ticket values. The scheduler distributes AI workloads to nodes based on the balanced resource ticket values. The local scheduler of the node schedules the AI workload on the resource based on the resource ticket value of the AI workload. Based on scheduling the AI workload, the coordinator service of the local scheduler performs the assigned AI workload on the infrastructure resources of the node. The present disclosure also describes scheduling AI workloads based on priority levels. The scheduler receives AI workloads and each AI workload is associated with a priority level that indicates a preemption priority when executed. AI workloads are scheduled to execute on an assigned set of nodes based on a priority level, and then execute based on the schedule.

Description

Scheduler for a planetary level computing system

Background

The speed and scale of Artificial Intelligence (AI) innovation requires highly scalable, high performance, robust, and technically efficient AI infrastructure. Since AI workloads are fundamentally different and a specially constructed AI infrastructure is required, current methods of gradually extending existing general-purpose infrastructure as a service (IaaS) and cloud-based environment have great limitations. Furthermore, managing the scheduling of AI workloads on an infrastructure in a fair and efficient manner presents a significant challenge to data scientists attempting to accelerate the algorithmic innovation of AI.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.

A computerized method for scheduling execution of AI workloads in a cloud infrastructure platform is described. The global scheduler receives a set of AI workloads to execute, wherein each AI workload in the set of AI workloads is associated with a resource ticket value indicating a share of resources for which the AI workload is to be executed. The global scheduler assigns an AI workload set to a set of nodes of the cloud infrastructure platform, wherein each node in the set of nodes includes infrastructure resources for executing the AI workload, and wherein the AI workload set is assigned to the set of nodes based on balancing resource ticket values of the AI workload on each node in the set of nodes. A local scheduler of a first node in the node pool schedules a subset of AI workloads of the set of AI workloads that are assigned to the first node to execute on infrastructure resources of the first node, wherein the scheduling of the subset of AI workloads is based on resource ticket values associated with the subset of AI workloads. The coordinator service of the local scheduler then performs the subset of AI workloads on the infrastructure resources of the first node based on the scheduling the subset of AI workloads.

Drawings

The present description will be better understood from a reading of the following detailed description taken in conjunction with the drawings in which:

FIG. 1 is a block diagram illustrating a system configured for providing infrastructure services for Artificial Intelligence (AI) workloads;

FIG. 2 is a block diagram illustrating a runtime plane of the system of FIG. 1;

FIG. 3 is a block diagram illustrating an infrastructure plane of the system of FIG. 1;

FIG. 4 is a flow chart illustrating a method for managing AI workloads in the cloud infrastructure platform;

FIG. 5 is a block diagram illustrating a hierarchical scheduling subsystem configured for scheduling AI workloads;

FIG. 6 is a state diagram illustrating the operation of a hierarchical scheduling subsystem configured for scheduling AI workloads;

FIG. 7 is a block diagram illustrating a split scheduling subsystem configured to schedule AI workloads across a plurality of nodes;

FIG. 8 is a flowchart illustrating a method of scheduled execution of AI workloads in the cloud infrastructure platform using the split scheduling subsystem; and

fig. 9 illustrates, in functional block diagram form, a computing device in accordance with an embodiment.

Corresponding reference characters indicate corresponding parts throughout the several views of the drawings. In fig. 1 to 9, the system is shown as a schematic diagram. The figures may not be drawn to scale.

Detailed Description

Various embodiments will be described in detail with reference to the accompanying drawings. Wherever possible, the same reference numbers will be used throughout the drawings to refer to the same or like parts. References made throughout this disclosure to specific examples and implementations are provided for illustrative purposes only, and are not meant to limit all examples unless indicated to the contrary.

Aspects of the present disclosure provide computerized methods and systems for performing scheduling Artificial Intelligence (AI) workloads, such as training and reasoning workloads, on different pools of infrastructure resources allocated to respective regions. The global scheduler receives a set of AI workloads to execute, wherein each AI workload in the set of AI workloads is associated with a resource ticket value indicating a share of resources for which the AI workload is to be executed. The global scheduler assigns an AI workload set to a set of nodes of the cloud infrastructure platform, wherein each node in the set of nodes includes infrastructure resources for executing the AI workload, and wherein the AI workload set is assigned to the set of nodes based on balancing resource ticket values of the AI workload on each node in the set of nodes. A local scheduler of a first node in the set of nodes schedules a subset of AI workloads in the set of AI workloads that are assigned to the first node for execution on the infrastructure resources of the first node, wherein the scheduling of the subset of AI workloads is based on the resource ticket values associated with the subset of AI workloads. The coordinator service of the local scheduler then performs the subset of AI workloads on the infrastructure resources of the first node based on the scheduling the subset of AI workloads.

The described scheduling system and method operates in an irregular manner by splitting the scheduling tasks between two levels of global and regional schedulers, such that the global scheduler enables the system to treat all infrastructure resources in a region as a single large pool, and the use of regional schedulers reduces the chance of completing cross-regional migration tasks. Further, in some examples, the described scheduling systems and methods perform load balancing between regions and nodes, enforcing fairness among users of the system, and enabling heterogeneous resources between workloads to be automatically trained to improve efficiency of resource allocation and use.

The cloud infrastructure includes hardware accelerators, computer networks, and storage—all bundled together in a workload-aware manner. AI workloads, such as Deep Learning Training (DLT) and reasoning, are unique in how they operate in the manner in which they are written, structured, and executed. Currently, cloud-based generic IaaS are used for DLT and reasoning tasks, which requires data scientists to set their AIDLT questions, execute them, and solve any resulting problems that may occur with today's IaaS.

This has led to various trends. DLT workload increases exponentially (e.g., 10 times per year). Thus, the industry is coping with microliters of DLT workloads by including more hardware in the IaaS environment, e.g., purchasing more Graphics Processing Units (GPUs) or other hardware accelerators, adding more nodes, and building more distributed clusters. However, if the model continues to grow exponentially, iaaS grows in such an exponential manner becomes unable to sustain. From a practical point of view, the scale of the cloud infrastructure is limited. Aspects of the present disclosure address these matters, and others, in an unconventional manner.

The disclosed examples provide "singularity" services that increase the efficiency of today's fixed infrastructure resources (including hardware accelerators, networks, storage, etc.) and drive the greatest technical efficiency as models continue to grow or as the amount of DLT tasks and/or other AI workloads increases. For example, the disclosed service operates in an unconventional manner by allowing IaaS or other infrastructure to grow to accommodate a large number of DLT tasks or to act as a smaller IaaS group that facilitates the processing of different DLT tasks. Because today's generic IaaS has been developed independent of workload, conventional generic IaaS cannot cope with the dramatic increase in DLT tasks. On the other hand, the disclosed service is designed to build a dedicated workload that can be handled efficiently in IaaS. The AI infrastructure services of the disclosure can operate with all AI workloads, including training (e.g., workloads for training new or updated AI models) and reasoning (e.g., workloads for evaluating and reasoning from data using trained AI models).

More specifically, examples of the disclosed services are fully managed, globally distributed, multi-tenant AI infrastructure services, and have native support for a wide range of hardware, including, for example, custom silicon, application Specific Integrated Circuits (ASICs), graphics Processing Units (GPUs), central Processing Units (CPUs) for DLT task training and reasoning workloads. With the disclosed services, the AI-planetary-level computer infrastructure is used for training and reasoning at any scale, and has the highest technical efficiency and differencing ability, which significantly improves the productivity of data scientists. For example, the disclosed services manage third party hardware (e.g., GPUs and Field Programmable Gate Arrays (FPGAs)) and first party AI hardware capacity and support advanced services such as Machine Learning (ML) to build experience and tool service customers. In some examples, the first party is a company operating the cloud environment and the third party is a different company than the company operating the cloud environment.

While the disclosed examples are discussed with respect to DLT tasks and reasoning, any kind of AI tasks can be migrated using the disclosed techniques. Such tasks may be run for a long period of time (e.g., for several hours or days or weeks or months).

Some disclosed embodiments and examples can operate with Azure cloud services provided by MICROSOFT corporation. Any large-scale cloud infrastructure may utilize the disclosed services.

The following are example capabilities and corresponding technical design descriptions provided by the present disclosure.

The invention provides efficient AI training and reasoning through high utilization of driving resources. High-density containerized hosting provides secure, fine-grained multi-tenant services. For example, a Hyper-V isolation container on a bare metal may be used to provide such services. The disclosed service can safely and densely pack multiple tenants on the same host, thereby realizing the efficient utilization of computing and AI hardware capabilities of cross-cloud services. High-density workloads belonging to different tenants are realized. For example, an AI workload may run with a search workload.

The present disclosure provides multiplexing or dissemination of inference and training workloads on the same shared resource pool. By sharing the same full cloud resource pool for reasoning and training, workload can be more efficiently scheduled and packaged, thereby maximizing utilization of hardware capacity and coping with workload combinations and fluctuations in shared pool resource demand. In contrast, in conventional services, the inference workload and the training workload are located on different resource pools, resulting in capacity fragmentation. Rather, the disclosed service multiplexes training and reasoning workloads on the same cloud resource pool (e.g., hardware accelerators, computing resources, network resources, and storage resources, etc.). This facilitates further saturation of hardware density and dynamic load balancing of cloud resources to accommodate peaks or pauses in computing demand for training or reasoning workloads, thereby increasing efficiency to maximum capacity. DLT workloads and inference workloads require topological collocation of nodes and hardware associated with tasks. In some examples, the disclosed service spreads the inference workload over or between training workloads, helping drive efficiency and accomplish more tasks through IaaS.

The disclosed service provides cloud-wide (e.g., global), topology, and workload-aware scheduling of AI workloads. A global scheduler is provided to take advantage of the heterogeneity of workloads (e.g., different attributes between training tasks, reasoning tasks, etc.), and to provide dynamic, topology-aware scheduling of resources across the entire AI hardware capacity in the cloud. In particular, with its ability to transparently checkpoint the processor and device states that make up a task or workload (e.g., save the state of the workload without any user involvement or changes to the framework or changes to the training script logic), the disclosed scheduler is able to transparently preempt any running task, migrate any running task in real-time, and/or elastically expand/contract and load balance the service workers to achieve maximum utilization without affecting performance or downtime. In addition, the disclosed scheduler is configured to learn all tasks across the entire IaaS (e.g., a global view of the workload across the entire IaaS). For example, the disclosed scheduler of service usage is configured to identify groups of GPU/CPU/hardware accelerators that are not being effectively utilized, and thus migrate tasks on such groups to other GPU/CPU/hardware accelerators by transparently checkpointing and verifying the processor device state at which the migration occurred. The scheduler is also configured to monitor and/or track the workload currently running and the hardware capacity currently available around the world in the cloud of published services. In addition, the scheduler is configured to decide whether and/or when to preempt a task, migrate a task, expand or contract a task, or load balance between different workers of a task.

The disclosed service is configured to manage AI workloads in a priority-driven and/or level-driven manner. In some examples, the levels are defined by at least one Service Level Agreement (SLA). When the disclosed scheduler makes decisions regarding AI training or inferring workloads, the scheduler may consider the assigned level of a given task (or inference model) or associated task submitter. Each class may be defined with different specifications. For example, if a task is submitted at the highest hierarchical level, indicating the best capacity level, the task will run with the least preemption, equivalent to running on dedicated cloud resources. If a task commits at an intermediate level, some preemption or migration will be experienced, which may "slow" the task to some extent, but drive efficiency and increase the overall utilization of the fixed resource pool. If a task commits at the lowest level, the task will be preempted frequently, providing an experience similar to a live Virtual Machine (VM), while not necessarily completing at the fastest rate, but ensuring that the task will complete. There are many examples of different levels, and an associated level may be specific to a task, customer, and/or capacity type, except that DLT training and reasoning tasks may be scheduled based at least in part on its associated level, which need not be discussed in detail herein. Today, there is no system that provides level-based guarantees for DLT training and reasoning tasks.

In some examples, each tenant or task presenter is assigned a quota of system resources (e.g., GPUs), which places an upper limit on the use and/or pricing of these resources. The tenant may be provided with a level of such resource usage (e.g., three levels based on performance, guaranteed access, and/or priority with respect to preemption). When an associated cluster is oversubscribed, the associated rank may be used to determine a priority of the task. In some examples of the present disclosure, preemption and elastic rescaling are enabled for all tasks, and thus, the levels may be differentiated based on the associated task slowdown percentages.

The task slow down percentage value may be defined as the ideal completion time (T _ideal ) And the actual completion time of the task (T _real ) Is a function of (2). The ideal completion time may be defined as the time that a task is completed assuming that the task runs on a dedicated GPU without preemption. The actual completion time may be different from the ideal completion time due to oversubscription of the associated cluster, which may require some preemption or scaling of the task. The task slow down percentage may be calculated as (T _ideal -T _real )/T _real . For example, if a task is completed with a dedicated GPU within 80 hours, but it takes 100 hours to preempt, then the task slow down percentage is 25%. The associated throughput score value may be calculated as T _ideal /T _real Which in the previous example is 80%.

Additionally or alternatively, a value G may be defined that indicates the amount of accumulated GPU seconds consumed by the task. This value depends on T _ideal . In some examples, g=n×t for tasks requiring N GPUs _ideal . In some examples of the described system, performance measurements of the tenant's tasks (and related prices that may be charged to the tenant) may be based on the value of G, such thatThe tenant is paid mainly for the actual processing required for his task and does not need any additional indirect costs of preemption etc.

In some examples, the performance or priority levels include three levels: high priority level, standard priority level, and low priority level. In other examples, more, fewer, or different levels may be defined without departing from the present description. Each of the levels may be defined by at least one of: the task slow down percentage values and/or assurance levels of the throughput score (e.g., a high-level of the 99% throughput score, a standard-level of the 80% throughput score, and a low-level of "best effort" to maintain the throughput score), preemption frequency levels (e.g., almost never, infrequently, and frequently for high, standard, and low levels, respectively), priority levels (e.g., high, medium, and low priority levels) that extend to determine which tasks to assign free capacity to, and/or topology or locality criteria (e.g., always obeys locality, primarily obeys locality, and "best effort" obeys locality for high, standard, and low levels, respectively).

Moreover, the level differences based on larger tasks may also be flexible, as it may be more difficult to guarantee defined criteria for tasks that need to use a large amount of resources in parallel or otherwise simultaneously (e.g., tasks that require more than 256 GPUs may need to reduce the locality requirements to use 256 GPUs in a reasonable period of time).

In view of the above-described details regarding the level-based scheduling of tasks from multiple tenants, the scheduler of the described system may prioritize the maximization of overall cluster utilization and aggregate task throughput across clusters and the minimization of violations of performance or priority level-differentiating criteria. In some examples, the preemption and scheduling policies used by such schedulers stem from these goals.

For example, an internal dynamic scheduling score may be defined for each task such that tasks with lower scores are always preempted before tasks with higher scores are preempted. The scheduling score of a task is arbitraryDynamically changing during service run time and can be calculated as s=s _base +S _dynamic . Basic score S _base The "level" (e.g., high, standard, low) based on the task is fixed. Dynamic component S _dynamic Based on the proximity of the task to violating the level criteria, requirements, and/or rules. Tasks at risk of violating the level criteria, requirements, and/or rules may be assigned a high dynamic score so that they are not preempted.

The challenge of detecting the proximity of a task to violating a level criterion is the T of the task _ideal Is unknown. In some examples, the level criteria and/or requirements are tracked and maintained for each hour of task execution. Each time an hour passes from the time of task submission, a task hour is constituted. The scheduler may maintain a ranking criteria for each task hour. If the throughput score criteria for a task is 80% (e.g., a standard class), the system may be configured to ensure that within each task hour, the task gets 80% of the resources it would have obtained if not preempted at all.

Thus, for a task requesting N GPUs, the scheduler needs to ensure that it gets at least N x f GPU hours per task hour, where f is the throughput fraction of the rank (e.g., 80%). At task commit, this can be used to determine the maximum queuing delay allowed by the task (20% = 12 minutes in the example above, 1 hour) because any longer delay waiting in the queue would result in violating the first task hour's ranking criteria.

Depending on the load, the resources that a task gets within a certain task hour may exceed the minimum resources required by the ranking criteria. For example, it may be possible to obtain N GPUs (rather than the smallest resource n×f) for a given task hour. In this case, the task may accumulate debts, which may be paid off by the scheduler within hours of the subsequent task. For example, if a task obtains N GPU hours (instead of N f) within a first task hour, the task may have a (N f-margin) level standard resource requirement during a second task hour, where the margin is N (1-f) (i.e., the excess capacity accumulated so far). In an 80% example, the task only requires N0.6 GPU hours to meet the level criteria. Thus, the dynamic priority within a task hour is calculated based on the margin that the job can use to meet its level criteria requirements.

Since a task can be expanded or contracted multiple times, the GPU hours of the task can be calculated in a process-flexible manner. Thus, within one task hour, the number of GPU hours may be calculated based on the actual area under the curve. For example, if a task has 15 minutes of N GPUs, 30 minutes of N/2 GPUs, and 15 minutes of no GPUs within a single task hour, the number of GPU hours is (15×n+30×n/2+15×0)/60=n×0.5.

In some examples, the above-described margin-based scheduling for minimizing violations of the level criteria or requirements is configured to operate at multiple granularities. For example, the schedule may operate at the task level and/or at the account level or tenant level. Such a configuration may provide the tenant with the option of selecting a hierarchical implementation at the tenant level, which is helpful in two ways. First, a tenant may specify relative intra-tenant priorities between its tasks. Thus, while all of its tasks may be standard grades, a subset of tasks may be assigned a relatively higher priority than other tasks of the tenant. When deciding to preempt a task for that tenant, the scheduler may select a task with a lower intra-tenant priority than a task with a higher intra-tenant priority. Second, for extensions, such a configuration provides a scheduler with better flexibility to extend the tasks that benefit most from extension (e.g., linear extension of performance) and run other tasks in a reduced mode while preserving tenant-level ratings and/or requirements. Additionally or alternatively, such features may be configured to opt-in features, as tenants now need to bear additional complexity to manage the relative priorities between their tasks.

Furthermore, in some examples fairness is enforced when there is excess capacity in the system and may be better than the minimum requirements of the hierarchy. For example, the scheduler may be able to provide a 95% throughput score for one or more tasks instead of 80%. Expansion resilience is another situation where excess capacity can be allocated to a task. Note that cross performance or priority levels may not have fair enforcement (e.g., a high-level task always gets excess resources over a lower-level task), but within a single performance or priority level, a scheduler may be configured to allocate excess capacity in a fair manner as described herein.

The disclosed system is configured to provide a reliable and high performance AI infrastructure. Without a reliable infrastructure, the utilization will always be less than optimal. This is because both unplanned and unplanned failures can result in loss of GPU time and productivity. For example, if a large task runs on hundreds of nodes and GPUs for months, eventually, some of the GPUs will become unhealthy or need to be upgraded during task processing. This can affect the customer's workload. Any stagnation of the health of the GPU may result in the overall AI workload task stalling and progress may stop due to the manner in which the AI workload operates. Worse still, if the task or model does not pass the checkpoint, previous processing may be lost. To overcome this problem, the disclosed system provides capabilities such as transparent preemption, dynamic load balancing, defragmentation, and resilience, all of which enable a highly reliable infrastructure.

The present disclosure deeply integrates bare metal computing, networking, and driver stacks of various accelerators by providing at least the following technical contributions: (i) Bandwidth-optimized distributed barrier and convergence protocol implementations directly inside the backend network communication stack to implement a distributed consistency protocol between accelerator devices and a set of working processes, and (ii) transparent and consistent checkpoints and recovery of process and device states to implement transparent preemptive dispatch, failover, live migration, and dynamic resilience-all of which do not affect model convergence and do not require any assistance from users or frameworks. The disclosed service provides checkpoints for AI tasks so that their device states can be captured and then restored on other nodes without affecting the correctness of the model or the convergence of the model at the infrastructure level.

The disclosed service is configured to provide global distribution of inference endpoints to achieve (a) predictable single digit millisecond delay at the 99 th percentile (P99) around the world, and (b) high availability in the face of regional disasters. When a user submits an inference workload, the inference model may be deployed across different geographic regions and run in the nearest region.

The disclosed service is configured to provide vertical integration of a wide range of hardware. The example architecture shown below in fig. 1 is designed for the future, with built-in malleability to become agile as new scenarios and technologies emerge. The disclosed design is very flexible in terms of: providing a first-class support for various AI accelerators; providing a decomposition and aggregation topology; providing a non-uniform back-end network configuration, providing a scalable, hierarchical architecture; implementing an extensible scheduling system for customizing tenants; implementing extensible heterogeneous accelerators, devices, and/or hardware; and provides a compiler toolchain that is independent of the AI training and reasoning framework.

The present disclosure provides a unified abstraction over the various AI accelerators, and a given training task or reasoning endpoint can be mapped across a mix of heterogeneous device types to drive the highest efficiency.

In addition to supporting a standard server-type computing topology, the disclosed service is configured to support and drive the resolution policies of the cloud computing environment and/or other similar policies associated with other cloud platforms. The aggregation topology includes devices that are physically attached to the server such that a backend network is not required. The split topology includes a compute node framework and a hardware accelerator framework that may utilize a backend network. The disclosed service abstracts both topologies.

The disclosed service is configured to support a variety of non-uniform backend network architectures contemplated by different first and third party hardware manufacturers.

The disclosed services provide a hierarchical architecture that supports each level of extensibility, including pluggable data planes (e.g., orchestration level extensibility supports alternative data planes or orchestrators that plug under their schedulers to support Kubernetes running in a customer's private data center), pluggable scheduling subsystems (e.g., scheduling level extensibility supports alternative schedulers and custom policies that plug under their control plane to support gradual migration to the disclosed services), and pluggable heterogeneous device types and accelerators (e.g., the present disclosure is designed to implement a consistent model for configuring and expanding accelerator devices through pluggable device provider interfaces (including quantum computing devices).

The disclosed service is configured to provide a compiler toolchain that is independent of the AI training and reasoning framework. Services do not rely on any assistance from users or frameworks to provide their core functionality. Which is designed to be independent of AI training and reasoning frameworks and tools. It does not require the user to select any particular framework, compiler toolchain or library. Services are integrated at the device driver and device-to-device communication channel level to support various hardware-specific functions.

The disclosed service provides a highly scalable AI infrastructure. Services are designed to extend across hundreds of data centers and tens of thousands of accelerators, and training models have trillions of parameters. Services may also be configured to cross geographic boundaries. The architecture can also treat training tasks and reasoning services equally from data centers as well as local sources.

While aspects of the disclosure have been described in terms of various examples and their associated operations, those skilled in the art will appreciate that combinations of operations from any number of different examples are also within the scope of aspects of the disclosure.

While the examples provided relate to implementations using GPUs, it will be appreciated that FPGAs, ASICs, or other specialized hardware may be similarly used to perform the functions described herein.

Fig. 1 is a block diagram illustrating a system 100 configured to provide infrastructure services for AI workloads, according to an embodiment. The system 100 includes a control plane 102, a runtime plane 104, and an infrastructure plan 106. In some examples, system 100 is a distributed computing infrastructure system that includes hardware devices distributed across many different locations (e.g., a global or planetary-level distributed system). Further, the system 100 is specifically configured to be able to execute the AI workload such that the hardware, firmware, and/or software of the system 100 is configured to be able to efficiently perform tasks associated with the AI workload. Alternatively or additionally, system 100 may include hardware, firmware, and/or software specifically configured to be able to perform other types of workloads without departing from this description.

The control plane 102 includes a manageability subsystem 108, a pluggable data plane 110, and a global scheduling subsystem 112. In some examples, the control plane 102 is configured to receive or accept AI workloads and associated data through various pluggable data planes 110 that are extensible or defined by the tenant of the system (e.g., alternative data planes inserted below the scheduler to support Kubernetes or another similar system running in the tenant's private data center). As described herein, these AI workloads are scheduled to execute on the infrastructure (e.g., infrastructure plane 106) of the system 100.

Manageability subsystem 108 includes hardware, firmware, and/or software configured to provide interactive processing of AI workload requests to tenants. Further, manageability subsystem 108 is configured to provide all infrastructure resources of system 100 in all areas of operation of the system. In some examples, manageability subsystem 108 includes manageability copies in various areas of system 100 such that infrastructure resources of system 100 are multi-hosted by the various copies that are interfaces between tenants and system 100. Manageability subsystem 108 may be decoupled from global scheduler subsystem 112.

As described herein, the global scheduler subsystem 108 includes hardware, firmware, and/or software configured to schedule AI workloads/tasks for execution on the infrastructure resources of the system 100. In some examples, global scheduler subsystem 108 includes a hierarchical scheduler: global scheduler, regional scheduler, and coordinator services. The global scheduler is responsible for preparing the schedules corresponding to AI workloads (e.g., tasks, models, and/or small clusters) and handing them over to the regional scheduler based on these prepared schedules. The regional dispatcher is responsible for managing and reporting regional capacity to the global dispatcher and then executing the dispatch received from the global dispatcher. The coordinator service is responsible for converting the schedule into physical resource allocations across the regional infrastructure resource clusters. The coordinator service may also constitute the reliability subsystem 122 or otherwise be intimately associated with the reliability subsystem 122, as described herein. Global scheduling subsystem 112 is described in more detail below.

As described herein, the runtime plane 104 includes subsystems configured to be able to distribute AI workloads to the infrastructure plane 106 and execute on the infrastructure plane 106. Such subsystems may include a monitoring subsystem 114, a compiling subsystem 116, a communication subsystem 118, and/or a load balancing subsystem 120. Further, the runtime plane 104 includes a reliability subsystem 122, the reliability subsystem 122 being configured to ensure reliability of AI workload execution while enabling such workload to be checkpointed and/or migrated throughout the infrastructure resources of the system 100. The runtime plane 104 also includes an AI accelerator provider model 124, the AI accelerator provider model 124 being configured to manage AI accelerators using various libraries and/or configurations when executing AI workloads. The runtime plane 104 is described in more detail below.

The infrastructure plane 106 includes hardware, firmware, and/or software for executing AI workloads based on the schedule provided by the control plane 102 and the instructions received from the runtime plane 104. The infrastructure plane 106 includes a hosting and activation subsystem 126, infrastructure resources 128, and a device/AI accelerator 130. The infrastructure plane 106 is described in more detail below.

Fig. 2 is a block diagram 200 illustrating a runtime plane 204 of the system 100 of fig. 1. In some examples, the runtime plane 204 is substantially the same as the runtime plane 104 described above with reference to fig. 1. The runtime plane 204 includes a monitoring subsystem 214, a compiling subsystem 216, a communication subsystem 218, a load balancing subsystem 220, a reliability subsystem 222, and an AI accelerator provider model 224.

The reliability subsystem 222 includes routines for interacting with the AI workload to ensure its reliability. In some examples, the routine includes failover 232, suspension 234, recovery 236, migration 238, scaling 240, checkpoint 242, and restoration 244. The checkpoint 242 and restoration 244 routines may be configured as core routines and other routines (failover 232, suspension 234, recovery 236, migration 238, and scaling 240) may be configured to use the checkpoint 242 and/or restoration 244 routines to achieve desired results.

The checkpoint 242 routine is configured to save the state of the AI workload when executing the AI workload such that the saved state can be used to continue executing the AI workload from the saved point in time. Checkpoint 242 may be used to execute a suspend 234 routine to suspend the execution of the AI workload for a period of time and/or to execute a migrate 238 routine to save the state of the AI workload so that it may be moved to another set of infrastructure resources for continued execution.

The restoration 244 routine is configured to take as input the saved state of the AI workload and resume execution of the AI workload on the infrastructure resources starting from the point of the saved state. The restoration 244 routine may be used to execute a restoration 236 routine and/or restore execution of an AI workload that has been migrated to another set of infrastructure resources based on the migration 238 routine.

The failover 232 routine is configured to checkpoint the state of the AI workload based on the detection of a failure of a current infrastructure resource and to recover the AI workload on the new set of infrastructure resources such that the AI workload recovers from the detected failure.

The scaling 240 routine is configured to expand and/or contract the amount, quality, and/or type of infrastructure resources used to perform the AI workload. For example, if additional infrastructure resources are available, the AI workload may be extended to take advantage of these additional infrastructure resources. Alternatively, if the new AI workload requires some infrastructure resources for executing the current AI workload, the current AI workload may shrink to release some resources for the new AI workload (e.g., the new AI workload may be associated with a higher priority or level than the current AI workload).

The reliability subsystem 222 also includes a convergence protocol 246, the convergence protocol 246 being configured to synchronize or otherwise force the AI workloads to which the above-described routines are applied. For example, if AI workloads are to be migrated, the rendezvous protocol 246 is configured to synchronize the operation of the system such that resources involved in the migration are not changed during the migration process. Such a rendezvous protocol 246 may include the use of locks or forming barriers so that other processes that are not otherwise associated with the migration do not inadvertently affect the migration.

The AI accelerator provider model 224 is configured to enable use of various software stacks, including a third party library 248 (e.g., a library provided by a tenant of the system 100) and/or a first party library 250 (e.g., a library provided by an entity managing the system 100). For example, third party libraries 248 may include a third party specific Management Library (ML) 252, a third party specific multi-GPU communications library (MGCL) 254, and a third party specific GPU library (GPUL) 256. Additionally or alternatively, the first party library 250 may include a management library 264, a communication library 266, and/or a compiler-tool chain 268. The runtime plane 204 enables tenants to utilize various software stacks and associated libraries, including their own software stacks, to execute AI workloads within the described system 100 based on their extensible, resilient configuration.

Fig. 3 is a block diagram 300 illustrating an infrastructure plane 306 of the system 100 of fig. 1, according to an embodiment. In some examples, as described above, the infrastructure plane 306 is substantially the same as the infrastructure plane 106 of fig. 1. The infrastructure plane 306 includes a hosting and activation subsystem 326, infrastructure resources 328, and devices and AI accelerators 330.

Hosting and activation subsystem 326 includes host agent 370 and container 372. Host agent 370 enables and organizes the hosting of AI workloads on infrastructure resources 328. The containers 372 (e.g., copy-on-write containers) keep different AI workloads (e.g., workloads from different tenants) separate from each other and secure even though they execute on the same host. The host controlled by host agent 370 may be a device that includes a set of infrastructure resources 328 configured to execute an AI workload or at least a portion thereof. Thus, by separating the AI workloads into containers 372, some resources of the host may be used to execute the AI workload from one tenant, while other resources of the host may be used to execute the AI workload of another tenant at the same time. The container 372 is configured such that two separate AI workloads are prevented from interacting in any manner when executed.

Infrastructure resources 328 include service structure 396 interfaces, storage resources 376, networking resources 378, computing resources 380 (which may include bare metal blades 382 (e.g., physical processing devices) and virtual machines 384), and other resources 386 (e.g., virtual infrastructure resources). In some examples, the infrastructure resources 328 are provided primarily for use by an entity (e.g., a first party resource) that provides the services of the system 100, but in other examples, the infrastructure resources 328 may also include resources provided by other entities (e.g., third party resources), such as resources owned and used by tenants of the system 100. Such integration may be achieved via third party library 248 and other configurations described above.

The devices and AI accelerator 330 includes a GPU 388, an FPGA device 390, other third-party devices 392, and other first-party devices 394. The described processes may also be implemented by the back-end network 374 and/or associated devices. Execution of the AI workload may uniquely benefit from the use of the GPU 388, FPGA 390, and/or other specialized hardware. In such examples, an infrastructure resource 328, such as computing resource 380, may be linked to GPU 388, for example, such that computing resource 380 provides instructions to GPU 388 regarding how to perform the steps of the AI workload. Such execution then utilizes a dedicated architecture of the GPU 388, such as a GPU 388 having many cores, to enable data parallel processing that is largely beyond the capabilities of the computing resources 380.

The backend network 374 is configured to support various non-uniform backend network architectures that can be envisioned by various entities that use the system, such as first and third party hardware manufacturers. Such a back-end network 374 may be used to provide links between computing nodes (e.g., computing resources 380) and the exploded topology of hardware accelerators (e.g., GPUs 388).

Fig. 4 is a flowchart illustrating a method 400 for managing AI workloads in a cloud infrastructure platform, according to an embodiment. In some examples, the cloud infrastructure platform of method 400 is a system such as system 100 of fig. 1. At 402, a set of distributed infrastructure resources (e.g., the hosting and activation subsystem 126, the infrastructure resources 128, and/or the device/AI accelerator 130 of the infrastructure plane 106) are integrated into the cloud infrastructure platform via the native support interfaces of these resources. In some examples, the native support interface may include an interface and/or library of resource providers, such as third party library 248 and first party library 250 of fig. 4. For example, a tenant of a cloud infrastructure platform may provide a subset of infrastructure resources based on the provided libraries for integration into the platform such that the tenant of the platform and/or other tenants may use these resources in executing AI workloads.

At 404, an AI workload is received from a plurality of tenants, wherein the received AI workload includes a training workload and an inference workload. In some examples, a tenant provides AI workloads for execution on a platform via an interface such as pluggable data plane 110 described herein.

At 406, a subset of resources of the distributed infrastructure resources is assigned to the received AI workload. In some examples, the assignment of the subset of resources to the AI workload is performed by the global scheduling system 112, as described herein. Assigning resources may include determining resource requirements of AI workloads and then identifying a subset of infrastructure resources that meet those requirements (e.g., AI workloads that require parallel use of four GPUs may be assigned to nodes of a system having at least four GPUs).

Additionally or alternatively, assigning the subset of resources to the AI workloads may include rearranging other AI workloads relative to the subset of resources. For example, assigning a resource subset to an AI workload may include saving a status checkpoint of the AI workload currently executing on a first resource subset, migrating the AI workload to a second resource subset, restoring the saved status checkpoint of the migrated AI workload on the second resource subset, and then assigning at least a portion of the first resource subset to another AI workload. In some examples, such a process may be performed using routines of reliability subsystem 222 as described herein.

At 408, the received AI workload is scheduled to execute on the assigned subset of resources. In some examples, the global scheduling subsystem 112 generates a schedule of AI workloads as described herein. Further, scheduling the execution of the AI workload may include scheduling the training workload and the inference workload on the same infrastructure resources, and the two types of workload are multiplexed on these infrastructure resources (e.g., execution of the training workload interspersed with execution of the inference workload on infrastructure resources such as GPUs).

Further, in some examples, the AI workloads are associated with priorities or levels that affect how the resources are allocated and how the AI workloads are scheduled to execute on those resources. For example, as described herein, a lower-level AI workload may be more likely to be migrated to other resources to make room for a higher-level AI workload, or a higher-level AI workload may be scheduled for a greater share of resource usage time than a lower-level AI workload.

At 410, the AI workload is executed based on the scheduling of the AI workload on the assigned subset of resources. In some examples, the AI workload is hosted in the hosting and activation subsystem 126, and then the infrastructure resources 128 and/or the device/AI accelerator 130 are used to execute the AI workload. For example, dispatching and executing AI workloads on a subset of resources includes isolating AI workloads from each other in a secure container, such that AI workloads associated with different tenants execute securely with each other (e.g., on resources associated with the same server).

Further, in some examples, the AI workload being executed is monitored based on the performance of the cloud infrastructure platform, and the scheduling of the AI workload is adjusted based on the monitoring. Adjustment of the schedule may include preempting the AI workload, migrating the AI workload, expanding the AI workload, narrowing the AI workload, and/or load balancing between two or more AI workloads. Such scheduling adjustments may be performed by global scheduling subsystem 112 or other components of system 100.

Fig. 5 is a block diagram illustrating a hierarchical scheduling subsystem 500 configured for scheduling AI workloads 512, according to an embodiment. In some examples, scheduling subsystem 500 is included in a system, such as system 100 of fig. 1. For example, scheduling subsystem 500 may be substantially identical to global scheduling subsystem 112 of FIG. 1. Scheduling subsystem 500 includes a global scheduler 502 and a plurality of regional schedulers 504, a coordinator service 506, and associated infrastructure resources 508. The global scheduler 502 is configured to use global capacity data 510 (e.g., data indicating the current status of resource usage throughout the associated global infrastructure system, including resource usage in each region of the system) and AI workloads 512 to generate a global schedule 514, the global schedule 514 scheduling AI workloads 512 to be executed on the infrastructure resources 508. The global scheduler 514 includes a region scheduler 520 for each region of the system, which is then provided to the region schedulers 504 associated with those regions (e.g., the region scheduler 520 for a region is provided to the region schedulers 504 associated with that particular region).

The zone scheduler 504 monitors the current zone capacity data 516 of the infrastructure resources 508 associated with the respective zone and the zone capacity data 516 is provided to the global scheduler 502 periodically or based on a pattern or trigger event. Further, the zone scheduler 504 receives a zone AI workload 518 associated with its zone from the set of AI workloads 512 from the global scheduler 502. The zone scheduler 504 is also configured to instruct the coordinator service 506 to execute an associated zone scheduler 520 (each zone including the zone scheduler 504 and the coordinator service 506) using the data of the zone AI workload 518.

The coordinator service 506 is configured to receive the zone schedule 522 and the associated zone AI workload 524 from the associated zone scheduler 504 and use a reliability routine 526 (e.g., the routine of the reliability subsystem 222 of fig. 2, as described above) to cause the zone AI workload 524 to execute based on the zone scheduler 522 using the infrastructure resources 508 of the zone. For example, the coordinator service 506 may be configured to allocate a subset of the infrastructure resources 508 of the region to the region AI workload 524 and cause the workload 524 to execute on those allocated resources 508. Additionally or alternatively, the coordinator service 506 may be configured to checkpoint, restore, migrate, and/or execute other reliability routines 526 to schedule the use of the infrastructure resources 508 according to the regional schedule 522.

Fig. 6 is a state diagram 600 illustrating the operation of a hierarchical scheduling subsystem configured for scheduling AI workloads, according to an embodiment. In some examples, state diagram 600 describes the operation of a subsystem, such as subsystem 500 of fig. 5 as described in this disclosure. The state diagram 600 includes an initialization state 602 and three states associated with the execution of the AI workload: an operational state 604, an operational state 606, and a suspended state 608. State diagram 600 also includes three final states: completion state 610, failure state 612, and cancel state 614.

At 616, when the global scheduler is successfully ready for scheduling by regional scheduling, the state changes from initialization 602 to runnable 604.

At 618, when the user cancels a resource that has not yet been run, state transitions from initialization 602 to cancel 614.

At 620, when the regional dispatcher successfully prepares to dispatch and deploy the associated resource for execution, the state changes from runnable 604 to running 606, as described herein.

At 622, when the user cancels the resource to be run (e.g., for executing the scheduled AI workload), the state changes from runnable 604 to cancelled 614.

At 624, when the resource successfully completes execution of the scheduled AI workload, the state changes from run 606 to completed 610.

At 626, when the resource encounters a failure at runtime, the state changes from run 606 to failure 612.

At 628, when the user cancels the running resource, the state changes from run 606 to cancel 614.

At 630, when the scheduler or user pauses the running resource (e.g., pauses execution of the AI workload), the state changes from run 606 to pause 608.

At 632, when the user cancels the suspended resource, the state changes from suspended 608 to cancelled 614.

At 634, when the global scheduler successfully prepares for scheduling of resources and it is now available for execution by the regional scheduler, the state changes from pending 608 to runnable 604.

At 636, the state changes from runnable 604 to suspended 608 when the scheduler and/or user suspends the resources ready for execution based on the available schedule.

In some examples, the run state 606 includes sub-states of execute, delete, and fail. When entering the run state 606, the regional dispatcher requests the associated coordinator service execution resource so that it enters the execute state. Once execution of the request is complete, the execution artifact will be deleted so that it enters the delete state. Once all execution artifacts are deleted, the resource may exit the run state 606.

Alternatively or additionally, if execution of a resource fails in an execution state, an action from the regional dispatcher is required, the resource may enter a failed state. If the regional dispatcher is able to resolve the failure, it may instruct the coordinator service to continue execution so that it returns to the execution state. If the regional dispatcher fails to resolve the failure, it may instruct the coordinator service to delete any execution artifacts associated with the resource, thereby bringing it into a delete state. In other examples, more, fewer, or different sub-states of the run state 606 may be used without departing from the present description.

Fig. 7 is a block diagram illustrating a split scheduling subsystem 700 configured to schedule AI workloads (e.g., workloads 710, 714, 716, and/or 718) across a plurality of nodes (e.g., nodes 704-708), according to an embodiment. In some examples, split scheduling subsystem 700 is part of global scheduling subsystem 112 of system 100 as described with respect to fig. 1. Further, the split scheduling subsystem 700 may be replicated as different instances of the scheduling subsystem of multiple regions or other partitions of the global scheduling subsystem 112 (e.g., each region or other partition may include a central scheduler 702 and multiple node schedulers 704-708, the node schedulers 704-708 being associated with nodes 720-724 that include a set of infrastructure resources). Alternatively or additionally, the central scheduler 702 may be included in a global scheduler, such as global scheduler 502, while the node schedulers 704-708 may be included at a regional scheduler level, such as regional scheduler 504. Other arrangements of schedulers may also be used without departing from this description.

Further, in some examples, nodes 720-724 referenced with respect to split scheduling subsystem 700 may each be a server device. Alternatively or in addition, a node may comprise a plurality of server devices, or it may be provided on a server device together with one or more other nodes. In other examples, other arrangements of nodes relative to server devices or other hardware devices may be used without departing from the present description.

In some examples, the split scheduling system 700 is configured to schedule a workload on resources associated with nodes of the system. Workloads (e.g., node workloads 714, 716, and 718) that may be executing on resources of a single node may be assigned by the central scheduler 702 to node schedulers 704, 706, and/or 708, with node schedulers 704, 706, and/or 708 scheduling these workloads on associated nodes 720, 722, and/or 724. Workloads requiring resources from multiple nodes (e.g., multi-node workload 710) are provided from the central scheduler 702 to multiple node schedulers such that the nodes across the multi-node schedulers execute the multi-node workload substantially simultaneously.

In some examples, partition scheduling system 700 includes a distributed, fair share scheduler that balances conflicting objectives of efficiency and fairness in GPU clusters (and/or other infrastructure resources of nodes 720-724) for nodes of Deep Learning Training (DLT) and/or other workloads. System 700 may provide performance isolation between users/tenants such that multiple users can share a single cluster and/or node, thereby maximizing cluster efficiency. The system 700 may also fairly allocate cluster-wide GPU time among active users.

It should be appreciated that while many examples specifically describe partitioning a workload between GPUs, in other examples, other types of infrastructure resources may be used to execute a scheduled workload without departing from the present description.

In some examples, the split scheduling system 700 achieves efficiency and fairness despite cluster heterogeneity and/or node heterogeneity. Because faster-updating GPUs are released quickly, data centers host multiple generations of GPUs. As newer generations face higher demands from users, older generations have poor utilization of GPUs, thereby reducing cluster efficiency. The system 700 may analyze the variable marginal utility of various tasks from newer GPUs and transparently motivate users to use older GPUs through a novel resource trading mechanism that maximizes cluster efficiency without affecting any user's fairness guarantees.

A single shared cluster across all users is attractive for overall efficiency, but for practical purposes such a cluster must ensure that each user will achieve at least the same performance as they would with a statically partitioned cluster. In other words, if user A has the right to obtain 20% of the GPU's global share, the user A's effective performance in the shared cluster must be at least as good as user A running on a dedicated cluster with 20% of the GPU, regardless of other tasks/users running on the shared cluster. If user a cannot utilize its quota, unused capacity must be shared among other active users to maximize cluster efficiency.

An additional dimension complicating sharing is hardware heterogeneity, which is a particularly serious problem in GPU clusters. With the rapid release of newer generations of GPUs, large clusters often mix with heterogeneous GPUs over time. Users prefer newer generation GPUs because they have higher performance and therefore lower utilization of older generation GPUs. With a single shared cluster, system 700 is configured to intelligently allocate different generations of GPUs across users to maximize efficiency while ensuring fairness.

One of the goals of system 700 is inter-user fairness. For simplicity, in some examples, all users have the same amount of tickets. Scheduling subsystem 700 is inter-user fair if each active user receives a resource allocation of at least the total clustered GPU resources divided by the number of active users. If an active user does not have a sufficient amount of workload or tasks to utilize its fair share, the scheduling subsystem 700 may be configured to allocate sufficient resources to satisfy the user and then recursively apply a fairness definition to the remaining resources and active users. In one example, a cluster has 8 GPUs, with 4 GPUs allocated to each of user a and user B. Thus, the allocation is inter-user fair.

In a related example, user C receives a new 2-GPU task. In this example, the fair share of user C is 8 GPUs/3 active users = 2.66 GPUs (i.e., 2 GPUs + 2/3 time on one GPU), but since the user has only one 2-GPU task each, the fair share of user C is 2 GPUs. After the fair share of 2 GPUs allocated to user C, the remaining 6 GPUs need to be equally allocated between user a and user B for fairness among users, so that there are 3 GPUs in total for each of them. Consider now the various fairness options that current schedulers offer in this case. Some schedulers that do not target inter-user fairness and that optimize scheduling decisions based on minimizing task completion time may allow user C's task to remain in the queue or move one of the existing tasks back to the queue and schedule user C's task in its place. Alternatively, any scheduler that does not time share the GPU will only have these two options left. In either case, these schedulers are unfair among users. Thus, these options indicate that the GPU resources need to be time-shared to support fairness among users.

In some examples, it is assumed that each DLT task is assigned multiple tickets representing its share of resources. In other examples, other indicators of resource share (e.g., resource share, resource percentage) may be used without departing from the present description. The goal of system 700 may be to provide proportional sharing of resources based on task tickets. To achieve this goal, system 700 may be composed of three key components. First, system 700 may be configured to schedule DLT tasks of various sizes using a group-aware, split stride scheduler. Second, the system 700 may be configured to use a ticket-adjusted height-based load balancer that uses migration to ensure that tasks across clusters are balanced. Balancing the workload in conjunction with a split stride scheduler may provide fair and efficient service among clusters of servers or nodes. Third, system 700 may be configured to transparently handle GPU heterogeneity and implement automatic GPU transaction policies to improve efficiency across heterogeneous clusters while maintaining fairness.

In some examples, system 700 is configured to schedule workloads/tasks using a stride-based process. The stride of a task is inversely proportional to its ticket and represents the interval between tasks when it is scheduled. The stride is represented in virtual time units called passes. The pass value of the new task is set to the minimum pass value of all tasks in the node. At each time quantum, the task with the smallest pass value is scheduled. The pass value of the task may then be updated based on the stride of the task (e.g., the pass value of the task may increase with the stride value of the task).

Furthermore, the stride scheduling process may be extended to be crowd-aware. Examples of virtual program code for this process are provided below.

Algorithm 1: group perception stride

Data: a set of tasks that are running.

Results: the time quanta schedule tasks.

A stride scheduling process or algorithm is invoked in each time quantum and returns the task scheduled for the next time quantum from the list of schedulable tasks in the queue. Just as in classical stride, tasks are selected to be scheduled with minimum pass values (line 2), but additional checks are performed to ensure that they fit into the available resources (line 8). If a task is appropriate, the task is scheduled and its pass value is updated (lines 9-11). If a task is not appropriate, the task is skipped, but note that the task retains its pass value; in the next time quantum, the task will have the smallest pass value and therefore a higher priority in scheduling. For example, assume a 1-GPU task has a minimum pass value and is arranged on a 4-GPU server. Assume that the 4-GPU task has the next minimum pass value, but cannot be scheduled. In the group sense stride, the task will be skipped, but its pass value is preserved. This process continues until all GPUs are assigned or no feasible assignments are made. Since the pass value of the skipped task is not updated, it can be guaranteed that the skipped task (e.g., a 4-GPU task) has a minimum pass value in the next time quantum and will then be scheduled. Thus, from a fairness perspective, the group-aware stride results in a service delay of at most 1-fold quantum compared to classical stride algorithms. Finally, due to the deterministic nature of the stride, the time interval required to provide fairness guarantees in group-aware strides is significantly shorter than, for example, probabilistic group-aware random selection scheduling algorithms.

Consider now a large task (e.g., multi-node workload 710) that requires a GPU across multiple servers. In some examples, the group-aware stride algorithm runs across the entire cluster. In such an example, the scheduling process would be inefficient because it would result in too much migration (tasks may be scheduled into any GPU in the cluster in each time quantum).

A key requirement for scheduling large tasks across multiple servers or nodes is to coordinate group awareness, i.e., for a given large task, all GPUs across multiple servers must be allocated within the same time quantum. On the other hand, for scalability and efficiency reasons, it is preferable that each server independently run a group-aware stride algorithm for small tasks (e.g., node workloads 720, 722, and/or 724). As described with respect to central scheduler 702 and node schedulers 704-708, some examples of system 700 use separate stride schedulers in order to balance these conflicting goals. The central scheduler 702 maintains a pass value for all large tasks (e.g., multi-node workload 710) and one aggregate task/pass value for each server (e.g., aggregate node workload data 712). The aggregate pass value of an aggregate task is inversely proportional to the cumulative ticket size for all subtasks in the server or node. When the central scheduler 702 runs group-aware strides, it selects aggregate tasks and/or macro tasks based on the minimum pass value; in the former case, it simply instructs the corresponding server to run its own group-aware stride and schedule from its local small task pool (e.g., node workloads 720, 722, and 724), while in the latter case, it instructs the corresponding server to run large tasks. In this manner, system 700 enables coordination necessary for group-aware scheduling of large tasks while allowing independent servers to schedule small tasks.

The fairness provided by the split stride scheduler depends on the balancing of task tickets across servers. If all servers are load balanced, i.e., have the same aggregate ticket, the fairness provided by the split stride scheduler can prove the same as running a single cluster-level group-aware stride scheduler (but without the inefficiency of continuous migration). Assume that there are a total of 200 tickets for all subtasks in each server, and 100 tickets for the 8-GPU task. In this case, a mini-task would get 2 out of every 3 slots in total, while an 8-GPU task would get 1 out of every 3 slots. Balancing the load between servers is therefore critical to fairness and efficiency. This will be discussed in further detail below.

An example technical structure in system 700 for ensuring fair sharing between users and/or tenants is to distribute ticket loads across nodes as evenly as possible, and then schedule tasks in each node using a split stride scheduler, proportional to the ticket load of each task. Formally let t _i Is the ith' ^th Ticket of individual user, and let j _i1 、j _i2 ……j _ini Is a task for the user. Let task j _ab The amount of GPU required is r _ab . Then the tic of user i ketsPerGPU _i The definition is as follows:

wherein n is _i Is the amount of tasks for user i. the Ticket PerGPU may be considered as the amount of tickets that the user will use for their workload on each GPU.

Ith' ^th The ticket load for each GPU on each node is defined as:

wherein g _l Is the GPU count on node l, A _l Is the set of tasks scheduled on node l. the Ticket LoadPerGPU may be considered as the amount of tickets each GPU must service on a particular node. In some examples, the system 700 is configured to provide fair distribution of the Ticket LoadPerGPU across all nodes so that it will ensure that all node services have a similar amount of tickets. Then, by ensuring that each node performs fair scheduling locally proportional to the local ticket, fair distribution of GPU computations across tasks (per ticket scale) can be achieved.

To fairly distribute the load among all nodes, a new task will be dispatched to the node with the smallest value of the Ticket LoadPerGPU. Note that since the new task will change the tickladpergpu (equation 1), the tickladpergpu of each node will be recalculated from the updated tickladpergpu, and then the node with the smallest value will be selected.

To schedule tasks within the node, tickets to be used for each task are calculated. For task j _ik This can be given simply by:

jobTickets _ik ＝ticketsPerGPU _i *r _ik ， (3)

where i is the user of the task, r _ik Is the amount of GPU needed for a task. To ensure local fair sharing, the schedulers (e.g., node schedulers 704-708) then use the previous section descriptionThe split group aware stride scheduling algorithm assigns each task an amount of time proportional to its task tag.

Note that the load of the server may become unbalanced due to abrupt departure of the task. In some examples, system 700 is configured to repair such imbalances using task/workload migration as described herein. Furthermore, for simplicity, the example assumes that each user has enough tasks to fully utilize their GPU shares. However, the user may submit significantly less tasks than are required to utilize the user's full share. In this case, the system 700 may first meet the needs of these users, remove them from the list, recalculate the weights of the remaining users, and iterate to determine a fair share for each subsequent user. In this way, a sufficiently loaded user may obtain a bonus share from users who are not fully utilizing their share.

To achieve this iteration in an optimized manner, the process can be used to calculate the valid ticket for each user, and then follow the algorithm described above. Finally, care must be taken during placement and migration to ensure that tasks are "packaged" as much as possible in the server to avoid non-working protection scenarios, as described in the next section.

A key aspect in determining the efficiency of a scheduler is whether the scheduler is work efficient, i.e. determining whether the scheduler has resources free when there are active tasks that can be scheduled on it. The combined requirement to fairly handle variable size tasks makes it challenging for the scheduler to keep working savings in all cases. For example, consider the case where only one task is active on a 4-GPU server for each of the 1-GPU and 4-GPU sizes, each with the same amount of tickets. To ensure fair sharing, three GPUs may be in an idle state every other time interval while performing 1-GPU tasks.

In some examples, system 700 is configured to resolve significant conflicts between fairness and efficiency by utilizing two domain-specific customizations. First, the system 700 may utilize this mechanism to allow efficient on-demand migration of tasks across servers. Second, the specific workload of DLT tasks in a large cluster is particularly well suited to intelligent packaging of execution tasks to avoid migration policies of such ill-conditioned scenarios.

So far, in the described example of scheduler design, it is assumed that the GPU/resources for executing the workload/task are homogenous. However, in some scenarios, the GPU or other resource may be heterogeneous (e.g., a GPU set may include multiple different types of GPUs with different performance attributes). In some examples, system 700 is configured to handle GPU heterogeneity in a manner that is transparent to users and/or tenants. It has two components, first, transparently assigning tasks to GPUs of a particular model, and second, allowing two users to automatically transact dispatches to their GPUs to benefit each other.

Consider a cluster with a mix of V100 and K80. Based on a given user's ticket, in one example, the user's fair share allocation is 4V 100 GPUs and 4 ks 80. The following table includes example detailed information regarding the relative performance of V100 and K80 under different types of workloads:

TABLE 1

If the user wants a particular GPU model, the user can specify that the task is simply "pinned" to the GPU along with the task and scheduler, and the rest of the portion is meaningless to the task. However, in other examples, most users do not want to fix GPUs, as automatic transactions allow users to achieve higher throughput than fixed, as described below. In this case, when a given user's task arrives, where is the task assigned—on V100 or K80?

In some examples, system 700 is configured to assume a strict priority order among the various GPU models, with newer GPUs, e.g., V100, having a higher priority than older GPUs, e.g., K80. If a new task arrives, the scheduler will automatically select the newer GPU (V100), dispatch the task there and analyze its performance. If the user is performing a hyper-parameter adjustment, the user will submit a more similar task. For example, if a user submits 8 tasks, these tasks will all be placed on V100 and time-shared. When the task is time-shared and the scheduler estimates that the memory requirements of the task fit into the older K80, the scheduler may transparently migrate the task to K80 and monitor and/or analyze its performance there. The scheduler may then be configured to compare the performance of the tasks on V100 and K80 and decide how best to optimize the performance for the user.

For example, assume that the task is a type a task (e.g., variant automatic coding (VAE)), which achieves a 1.25-fold acceleration over V100 compared to K80 (see example statistics of table 1). In this case, since these tasks are currently time-sharing tasks on V100, the scheduler transparently migrates the time-sharing tasks to K80 so that four tasks run entirely on V100 and four tasks run on K80. In contrast, if the task is a class B task that gets a 5-fold acceleration on V100 (e.g., deep convolution generates a challenge network (DCGAN)), the example system will continue to time share 8 tasks on V100 so that these tasks work at an average speed of 2.5 times. Thus, system 700 may be configured to automatically select a GPU model that maximizes the performance of each task.

In some examples, once tasks are assigned their GPUs, system 700 is configured to support GPU transactions between users to further improve efficiency. To support automatic GPU transactions, such an example system utilizes key observations in the example statistics of table 1: different types of DLT tasks have variable marginal utility from newer GPU generations (e.g., each different type of task in table 1, a task, B task, and C task, with a different "acceleration" factor, which represents the degree to which V100 is used to accelerate the task compared to the same task executing on K80).

One of the unique aspects of DLT tasks is the need to train the task using various hyper-parameters in order to identify the best accuracy. Thus, the user submits so-called multitasking, typically 10-100 copies of the DLT task, each copy having different hyper-parameters such as learning rate, weight decay, etc. It is critical that the performance characteristics of a task be the same across a large number of tasks from a given user. In such a scenario, the transactions described herein may be used to increase the application throughput of two users participating in the transaction.

Consider a cluster with 60K 80 and 12V 100 and three active users, scheduling three types of tasks, respectively: a task, B task, and C task (see table 1). In some examples, the tasks may be a VAE task, a DCGAN task, and a residual neural network (ResNext) task, respectively. It is assumed that each user submits tens of tasks to the cluster as part of his hyper-parametric tuning experiments in order for the cluster to be fully utilized. If the scheduler only provides fair shares, each user will be allocated 20K 80 and 4V 100 (see top section of Table 1). In terms of performance, a user running the a task will see 25% acceleration when running in V100, with aggregate performance of 20+4 x 1.25=25 normalized K80 GPUs. Also, for aggregate performance of 20+4×5=40 normalized K80 GPUs, a user running the B task will see a 5-fold acceleration over V100. Finally, for standardized K80 GPUs with aggregate performance up to 20+4 x 6.25=45, the user running the C task will see a 6.25-fold acceleration over V100.

Since the user sees the marginal utility of using V100 as different than K80, there is room to increase the overall efficiency of using transactions. Consider that user a (e.g., the user performing type a tasks) benefits least (fast 125%) from V100 than user C who benefits most (fast 625%). The efficiency (in terms of task progress) of V100 assigned to user C (e.g., the user performing the type C task) is 5 times higher than V100 assigned to user a. Thus, it makes sense to trade user C's K80 to the user, exchanging maximum efficiency for V100.

In some examples, a simple solution is to trade user a's 1V 100 with user C's 1.25K 80. In this case, all efficiency gains (100 times trade surplus per V) are attributed to user C. However, this solution may be susceptible to gaming by an explicit user. Consider user B (e.g., a user performing a class B task) who modifies the task to examine the GPU architecture of the task execution and simply reduces its performance when running on K80 such that the overall acceleration of the user model on V100 is 6.5 times compared to K80. User B will then win the transaction and use V100 to reclaim enough revenue to compensate for the slowing of K80. Thus, the transaction price must be carefully selected to avoid such gaming. Another solution is to trade user a's 1V 100 with user C's 6.25K 80. In this case, the transaction surplus is all attributed to user a, and user C has no transaction power.

In some examples, the system 700 is configured to solve the pricing dilemma by using a second price auction mechanism. Specifically, system 700 may perform a transaction at the second highest price, i.e., user C transacts 5K 80 per 1V 100 at user A, where 5 times is determined by the second price (acceleration of user B; if there is no such user, the remainder bisects). The second price auction has properties such as incentive compatibility, which means that each user can achieve the best result by price own real preferences. In this scenario, artificially speeding up or slowing down their jobs does not help the user. In addition, both parties to the transaction may benefit from some of the remaining efficiency gains.

The vickri auction (second price auction mechanism described above) also has some weaknesses. In particular, the victorial auction cannot prove collusion. If all bidders in the auction disclose their own prices to each other, they can reduce their valuations. However, in the described systems and methods, collusion is not a major issue, as auctions are primarily used to distribute performance avails from transactions; each user in the transaction is still guaranteed to obtain a fair share.

Table 1 shows the allocation at the end of four such transactions in the lower part of the table. User a has 40K 80 with an aggregate performance of 40 (25 before) and user C has 8V 100 with an aggregate performance of 50 (45 before). Thus, both users can obtain better overall performance from the transaction and maximize the overall efficiency of the cluster while ensuring fairness among users.

In some examples, the system 700 is configured to use the analysis to continuously maintain task performance statistics, such as a small lot progress rate. For tasks that may change their performance behavior during the training process, the system 700 may be configured to detect such changes and undo the transaction if necessary. Additionally or alternatively, while the transactions in the examples provided are for hyper-parametric tasks in which the tasks of a given user are homogenous, in other examples, transactions may be used for heterogeneous tasks without departing from the present description.

An example process of distributing tasks in system 700 is provided below in virtual program code. This time quantum may be set to, for example, one or two minutes to ensure that the overhead of GPU context switching is below 1%.

Algorithm 2: assigning tasks

Data: a set of users, GPUs, servers, and tasks.

Results: tasks, associated notes, and assignments.

In some examples, the system 700 is configured to maintain updated values of the three key variables depicted in equations 1-3 (ticket for each GPU, ticket load for each GPU, and task ticket) to make its scheduling decisions. Algorithm 2 shows how tasks are first assigned to nodes. In the example using algorithm 2, scheduling subsystem 700 maintains a task queue for each user. The scheduling subsystem 700 first finds the user using the least resources so far according to the Ticket PerGPU (line 2), and picks one task to schedule from the user based on the priority/arrival time of the task (high priority and earliest submitted task are preferred). If the task (e.g., part of a multi-task) has been previously seen and the system 700 has profile information for the task available for different GPU models, then the system 700 selects the fastest GPU model (lines 5, 6), as discussed herein. Note that job.per f [ g ] refers to the small batch duration of tasks on GPU g. Otherwise, it simply selects the latest GPU model (e.g., V100) in line 8. Once the GPU model is selected, the actual node that schedules tasks is based on the node with the lowest TicketLoadPerGPU (lines 11, 12). The scheduler calculates the jobtackets, i.e., the ticket amount for the task (line 9), and then schedules the task on the selected node. The split group aware stride scheduler uses jobtackets to quantum schedule tasks at each time and ensure fair sharing thereof. It should be appreciated that the processes described in this paragraph are exemplary, and in other examples, other processes or algorithms may be used to distribute tasks in the system 700 without departing from this description.

In some examples, system 700 is configured to use task analysis to determine acceleration of each task on various GPU models in the system. The statistical information of task analysis is collected at the time of task scheduling, so no additional overhead is generated. For example, when a user submits their first task, the task may be scheduled on the fastest available GPU (e.g., V100). For each time quantum, the task is analyzed to determine the average time it takes for each small lot thereof, and this data is collected by the scheduling subsystem 700. As users submit more tasks, the tasks are scheduled on V100 until the users deplete their V100 assignments (based on their notes). The next task submitted by the user will be scheduled on the second fast GPU (e.g., P100). The task will then analyze on the new GPU. If the user is performing a hyper-parameter adjustment, the tasks will be similar, so the scheduling subsystem 700 may determine the acceleration of the task on V100 relative to P100. Thus, when a task arrives and is scheduled on a different GPU model, the analysis value is updated to the average of the latest statistics to reflect the current acceleration situation. By maintaining an average of recent statistics, in such examples, system 700 may detect and adapt to work that may change its performance in the middle of training.

In some examples, system 700 is configured to evaluate task migration options for each time quantum. In a homogenous cluster, migration may be used only for load balancing. If the difference in the TicketLoadPerGPU of tasks from the highest loaded node or server is above a threshold, they may migrate to the lowest loaded node or server. In heterogeneous clusters, migration may be used to improve performance and efficiency through transactions. In such an example, the scheduling subsystem 700 looks at the average progress of each user task on each GPU model and the tokespugpu and calculates the effective performance (small batch rate divided by tokespugpu). Based on the analysis information, if a user would benefit from migrating his task from one GPU model to another, and if the improvement resulting therefrom is greater than a threshold, the scheduler would add the user to the candidate migration list. Of all such (user, source GPU model, target GPU model) tuples, the scheduler selects the most obtained tuple and migrates a task to the node of the target GPU model with the lowest TicketLoadPerGPU.

In some examples, system 700 is configured to evaluate GPU transaction options for each time quantum. For each (fast GPU model, slow GPU model) tuple (e.g., V100, K80), the scheduling subsystem 700 may find the "seller" (the user with the fastest acceleration on the fast GPU relative to the slow GPU) and the "buyer" (the least accelerated user). The transaction price is obtained with the second highest price, i.e., if the acceleration of the user with the second fastest acceleration is r, then r slow GPUs are swapped between the buyer and the seller for 1 fast GPU, and then the work of the seller and the buyer is migrated to the transacted GPU. Each transaction results in an increase in efficiency corresponding to the difference between the acceleration of the buyer and the seller, and an efficiency benefit is distributed to the buyer and the seller based on the second price. Further, during job arrival and departure, a trade check is performed to see if certain trades must be undone (e.g., new users arrive and require the trading's GPU to be fairly distributed).

In some examples, the auction run time described for trading GPUs or other resources is O (user volume) and the allocation run time is O (user volume + task volume). Thus, in such an example of system 700, scaling with the amount of users/tasks is not computationally challenging.

In some examples, system 700 is configured to use kubrennetes as a cluster manager with a custom scheduler that assigns tasks to nodes. Tasks may be submitted as DOCKER containers. Other tools and/or elements of system 700 may include SCALA code, AKKA Actors library for concurrency, and/or GRPC Remote Procedure Call (GRPC) for performing remote procedure call. In other examples, system 700 may be implemented using more, fewer, or different tools and/or elements without departing from this description.

In some examples, the manager of system 700 is configured to expose REST APIs and GRPC endpoints for clients to connect to scheduling subsystem 700. The system 700 may make decisions such as placement, migration, ticket distribution, management of bonus tokens, transactions, etc. There may be one global executor for performing group scheduling of multi-server tasks, while each server in the group may have one local executor that together are responsible for running tasks on the servers for the proportion of tickets assigned by the scheduler. Finally, the client running within the container with the task also exposes the GRPC endpoint and is responsible for receiving commands from the executor to perform actions such as suspend/resume, checkpoint/migrate, report task metadata, and report the status of the running task.

In some examples, a key mechanism used by system 700 is the ability to migrate tasks between nodes. To migrate tasks, checkpointed tasks may be checkpointed as needed, and then these tasks may be restored on different nodes. Some DLT tasks are written with a checkpoint function so that they can recover from a previous checkpoint (if present), but few tasks implement such a checkpoint function. Furthermore, even in those DLT tasks that use checkpoints, they are typically checkpointed only at each artificial intelligence training pattern. An epoch may last for several hours or more. While such checkpoints are useful to prevent occasional server failures, in some examples, the system 700 requires more fine-grained checkpoints for fairness and efficiency. Thus, the system 700 may be configured to implement an automatic on-demand checkpointing mechanism.

In some examples, to support task migration, system 700 is configured with modified PYTORCH and TENSORFLOW frameworks. Such an implementation can handle unmodified user code and only requires precise modification of both frameworks. While common process migration tools exist, such as checkpoint libraries, they cannot handle processes with GPU state. In some examples of system 700, the proxy process branches from the main process and all GPU calls made by the process are intercepted and directed via the proxy process. In this way, the address space of the host process only retains the CPU and can be easily checkpointed through the checkpoint library. The proxy process is configured and responsible for 1) converting all GPU handles such as streams, contexts, etc. 2) A log of all state change GPU calls is kept so that they can be replayed upon recovery, and 3) memory management of GPU memory. In some examples, the memory manager of system 700 maps virtual address space to physical GPU address space in a consistent manner during migration such that pointers to GPU memory remain completely transparent to parent processes. Upon checkpointing, the proxy's memory manager copies the GPU state to the parent's CPU memory and terminates. The parent process may then be migrated using the checkpoint library. After executing the recovery process, the proxy process replays the log of the state change GPU call and copies the GPU memory back. In some examples, all communications between the proxy and parent processes may be handled through shared memory with negligible overhead. The proxy implementation between PYTORCH and TENSORFLOW remains unchanged with minimal modification to the actual framework. In other examples of system 700, other methods of migrating tasks may be used without departing from the description.

In some examples, system 700 is configured to optimize migration performance overhead by implementing a three-phase context switch referred to as suspend-preload-resume. When the notification framework pauses, it will complete the pause in about 100 milliseconds, allowing the scheduler to run additional tasks on the GPU by copying the minimum data in the GPU (proxy process) to CPU memory (parent process) at the end of the small lot training. If a task needs to migrate across servers, the scheduler checkpoints the task container and restores it to the target server. The framework then waits for a preload notification. When it receives the preload, it will set state on the new GPU by replaying a log of all stateful operations, but will not recover. Thus, the preload hides the 5 second delay of the GPU context initialization. Finally, when the framework is notified of the resume, it copies the data back to the GPU memory, which takes approximately 100ms and resumes GPU computation quickly. Thus, migration occurs primarily in the background, while other tasks utilize the GPU. In other examples of system 700, other methods of optimizing task migration may be used without departing from the description.

Fig. 8 is a flowchart illustrating a method for scheduling execution of AI workloads in a cloud infrastructure platform using a split scheduling subsystem, according to an embodiment. In some examples, method 800 is performed by a scheduling subsystem, such as system 700 of fig. 7. At 802, a partitioned scheduling subsystem of a cloud infrastructure platform receives a set of AI workloads to execute. Each AI workload in the set of AI workloads is associated with a resource ticket value that indicates a share of the resource to be used to execute the AI workload. As described herein with reference to at least fig. 5 and 7, the split scheduling subsystem may include a global scheduler and a plurality of regional schedulers.

At 804, the set of AI workloads is assigned to the set of nodes of the cloud infrastructure platform in a balanced manner (e.g., based on balancing the resource ticket values of the AI workloads on the set of nodes). The allocation of AI workloads to node sets may be performed by a global scheduler of the cloud infrastructure platform. In some examples, balancing the resource ticket values for the AI workloads on the set of nodes includes assigning each AI workload to a compatible node having a current lowest resource ticket value (e.g., a total number of tickets associated with AI workloads currently assigned to the node), and then updating the resource ticket values for the node such that the resource ticket values for the set of nodes tend to remain nearly balanced.

At 806, the local scheduler of the node schedules the assigned AI workload on the infrastructure resources based on the resource ticket value of the AI workload. In some examples, scheduling the AI workloads based on the resource ticket values includes multiplexing the plurality of AI workloads on the infrastructure resources proportionally based on the resource ticket values of the plurality of AI workloads. Additionally or alternatively, the multi-node AI workloads may be scheduled on the subset of nodes based on a balanced sum of resource ticket values of the AI workloads allocated to the subset of nodes, wherein the multi-node AI workloads are scheduled to execute on the subset of nodes simultaneously. Further, scheduling the multi-node AI workload by the global scheduler and the AI workload distributed to the nodes by the local scheduler includes scheduling the AI workload during execution of the AI workload using a stride mechanism for each time quantum.

Additionally or alternatively, scheduling the AI workload may include identifying an AI workload that is currently being performed using the infrastructure resources of the first node and migrating the identified AI workload to another node, thereby freeing up the infrastructure resources of the first node for performing the predetermined AI workload. Further, identifying the AI workloads may include identifying the AI workloads based on AI workloads associated with lower priority than at least one AI workload of the subset of AI workloads.

At 808, based on scheduling the AI workload on the node, the coordinator service of the node performs the AI workload on the infrastructure resources of the node as described herein.

In some examples, the method 800 further includes monitoring performance of the AI workloads using the infrastructure resources based on the heterogeneous type of infrastructure resources and based on monitoring of a performance difference between the first AI workload and a second AI workload based on the heterogeneous type of infrastructure resources, the infrastructure resources being traded between the first AI workload and the second AI workload to balance the indicated performance difference. Additionally, trading infrastructure resources between the first and second AI workloads includes determining a swap rate for trading the first type of infrastructure resources for the second type of infrastructure resources, wherein the determined swap rate facilitates the first AI workload and the second AI workload and assigning a first amount of the first type of infrastructure resources associated with the first AI workload to the second AI workload, and assigning a second amount of the second type of infrastructure resources associated with the second AI workload to the first AI workload, wherein the first amount and the second amount are based on the determined swap rate. In some examples, the determined exchange rate is based on a second price auction mechanism as described herein.

Exemplary operating Environment

Fig. 9 is a block diagram of an example computing device 900 for implementing aspects disclosed herein and designated generally as computing device 900. Computing device 900 is but one example of a suitable computing environment and is not intended to suggest any limitation as to the scope of use or functionality of the examples disclosed herein. Neither should the computing device 900 be interpreted as having any dependency or requirement relating to any one or combination of components/modules illustrated. Examples disclosed herein may be described in the general context of computer code or machine-useable instructions, including computer-executable instructions (e.g., program components) being executed by a computer or other machine (e.g., a personal data assistant or other handheld device). Generally, program components, including routines, programs, objects, components, data structures, etc., refer to code that perform particular tasks or implement particular abstract data types. The disclosed examples may be practiced in various system configurations, including personal computers, laptop computers, smart phones, mobile tablets, hand-held devices, consumer electronics, special purpose computing devices, and the like. The disclosed examples may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network.

Computing device 900 includes a bus 910 that directly or indirectly couples the following devices: computer storage memory 912, one or more processors 914, one or more presentation components 916, I/O ports 918, I/O components 920, a power supply 922, and a network component 924. Although computing device 900 is depicted as appearing to be a single device, multiple computing devices 900 may work together and share the described device resources. For example, memory 912 is allocated across multiple devices, and processor 914 is housed with the different devices.

Bus 910 represents what may be one or more busses (e.g., an address bus, data bus, or combination thereof). Although the various blocks of FIG. 9 are shown with lines for the sake of clarity, various components in FIG. 9 are shown with lines, alternate representations may be used to accomplish the depiction of the various components. For example, presentation components such as display devices are I/O components in some examples, and some examples of processors have their own memory. No distinction is made between categories such as "workstation", "server", "laptop", "handheld", etc., as all categories are contemplated within the scope of fig. 9 and see fig. 9 and references herein to "computing device". Memory 912 may take the form of the following computer storage media references and is operable to provide storage of computer-readable instructions, data structures, program modules, and other data for computing device 900. In some examples, memory 912 stores one or more of an operating system, a general-purpose application platform, or other program modules, and program data. Thus, the memory 912 is capable of storing and accessing data 912a and instructions 912b, the data 912a and instructions 912b being executable by the processor 914 and configured to perform various operations disclosed herein.

In some examples, memory 912 includes computer storage media in the form of volatile and/or nonvolatile memory, removable or non-removable memory, data disks in a virtual environment, or a combination thereof. Memory 912 may include any amount of memory associated with computing device 900 or accessible by computing device 900. Memory 912 may be internal to computing device 900 (as shown in fig. 9), external to computing device 900 (not shown), or both (not shown). Examples of memory 912 include, but are not limited to, random Access Memory (RAM); read Only Memory (ROM); an Electrically Erasable Programmable Read Only Memory (EEPROM); flash memory or other storage technology; CD-ROM, digital Versatile Disks (DVD) or other optical or holographic media; magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices; a memory connected to the analog computing device; or any other medium used to encode desired information and be accessed by computing device 900. Additionally or alternatively, memory 912 may be allocated across multiple computing devices 900, e.g., in a virtualized environment, where instruction processing is performed across multiple computing devices 900. For purposes of this disclosure, "computer storage medium," "computer storage memory," "memory," and "memory device" are synonymous terms of computer storage memory 912, and none of these terms include carrier waves or propagating signals.

Processor 914 may include any number of processing units to read data from various entities such as memory 912 or I/O component 920. In particular, processor 914 is programmed to execute computer-executable instructions to implement the various aspects disclosed. The instructions may be executed by a processor, by multiple processors within computing device 900, or by a processor external to client computing device 900. In some examples, processor 914 is programmed to execute instructions such as those shown in the flowcharts discussed below and depicted in the figures. Further, in some examples, processor 914 represents an implementation of simulation techniques for performing the operations described herein. For example, the operations are performed by analog client computing device 900 and/or digital client computing device 900. The presentation component 916 presents data indications to a user or other device. Exemplary presentation components include a display device, speakers, a printing component, a vibration component, and the like. Those skilled in the art will understand and appreciate that computer data may be presented in a variety of ways, such as visually in a Graphical User Interface (GUI), audibly through speakers, wirelessly between computing devices 900, through a wired connection, or otherwise. The I/O ports 918 allow the computing device 900 to be logically coupled to other devices, some of which may be built-in, including I/O components 920. Example I/O components 920 include, for example, but are not limited to, a microphone, joystick, game pad, satellite dish, scanner, printer, wireless device, or the like.

The computing device 900 may operate in a networked environment using logical connections to one or more remote computers via a network component 924. In some examples, network component 924 includes a network interface card and/or computer-executable instructions (e.g., drivers) for operating the network interface card. Communication between computing device 900 and other devices may occur over any wired or wireless connection using any protocol or mechanism. In some examples, network component 924 is operable to wirelessly transfer data between devices using short-range communication technologies (e.g., near Field Communication (NFC), bluetooth brand communication, etc., or a combination thereof) over public, private, or mixed (public and private) using a transport protocol. Network component 924 communicates with cloud resources 928 across network 930 via wireless communication link 926 and/or wired communication link 926 a. Various examples of communication links 926 and 926a include wireless connections, wired connections, and/or dedicated links, and in some examples, at least a portion is routed through the internet.

Although described in connection with the example computing device 900, examples of the disclosure are capable of being implemented with many other general purpose or special purpose computing system environments, configurations, or devices. Examples of well known computing systems, environments, and/or configurations that may be suitable for use with aspects of the disclosure include, but are not limited to, smart phones, mobile tablets, mobile computing devices, personal computers, server computers, hand-held or laptop computer devices, multiprocessor systems, gaming machines, microprocessor-based systems, set top boxes, programmable consumer electronics, mobile phones, mobile computing and/or wearable or accessory-shaped communication devices (e.g., watches, glasses, headphones, or earphones), network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, virtual Reality (VR) devices, augmented Reality (AR) devices, mixed Reality (MR) devices, holographic devices, and the like. Such a system or device may accept input from a user in any manner, including from an input device such as a keyboard or pointing device, via gesture input, proximity input (such as by hovering), and/or via voice input.

Examples of the present disclosure may be described in the general context of computer-executable instructions, such as program modules, executed by one or more computers or other devices in software, firmware, hardware, or a combination thereof. The computer-executable instructions may be organized into one or more computer-executable components or modules. Generally, program modules include, but are not limited to, routines, programs, objects, components, and data structures that perform particular tasks or implement particular abstract data types. Aspects of the disclosure may be implemented with any number and organization of such components or modules. For example, aspects of the disclosure are not limited to the specific computer-executable instructions or the specific components or modules illustrated in the figures and described herein. Other examples of the disclosure may include different computer-executable instructions or components having more or less functionality than illustrated and described herein.

In examples involving general-purpose computers, aspects of the present disclosure transform the general-purpose computer into a special-purpose computing device when configured to execute the instructions described herein.

An example system for scheduling execution of AI workloads in a cloud infrastructure platform includes: at least one processor of the cloud infrastructure platform, wherein the at least one processor comprises at least one of: a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a hardware accelerator; at least one memory of the cloud infrastructure platform, comprising computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the at least one processor to: receiving, by a scheduler, a set of AI workloads to be executed, wherein each AI workload in the set of AI workloads is associated with a priority level in a set of priority levels, the priority levels indicating preemption priorities associated with the AI workload when executed; scheduling, by the scheduler, the set of AI workloads to a set of nodes of the cloud infrastructure platform, wherein each node in the set of nodes includes infrastructure resources for executing AI workloads, and wherein the set of AI workloads is assigned to the set of nodes based at least on a priority of AI workloads on each node in the set of nodes; the AI workload set assigned to the set of nodes is scheduled to be executed on infrastructure resources of the set of nodes by a coordinator service of the scheduler.

An example computerized method for scheduling execution of AI workloads in a cloud infrastructure platform includes: receiving, by a processor of a global scheduler, a set of AI workloads to be executed, wherein each AI workload of the set of AI workloads is associated with a resource ticket value that indicates a share of resources to be executed with that AI workload; assigning, by a processor of the global scheduler, the set of AI workloads to a set of nodes of the cloud infrastructure platform, wherein each node in the set of nodes includes infrastructure resources for executing AI workloads, and wherein the set of AI workloads is assigned to the set of nodes based on resource ticket values that balance the AI workloads on each node in the set of nodes; scheduling, by a processor of a local scheduler of a first node in the set of nodes, a subset of AI workloads in the set of AI workloads that are assigned to the first node for execution on infrastructure resources of the first node, wherein scheduling the subset of AI workloads is based on resource ticket values associated with the subset of AI workloads; the subset of AI workloads is performed on the infrastructure resources of the first node, based on scheduling the subset of AI workloads assigned to the first node, served by a coordinator of the local scheduler of the first node.

One or more computer storage media having computer-executable instructions for scheduling execution of AI workloads in a cloud infrastructure platform, which when executed by a processor, cause the processor to at least: receiving, by a global scheduler, a set of AI workloads to be executed, wherein each AI workload in the set of AI workloads is associated with a resource ticket value that indicates a share of resources to be executed with that AI workload; assigning, by a global scheduler, a set of AI workloads to a set of nodes of the cloud infrastructure platform, wherein each node in the set of nodes includes infrastructure resources for executing AI workloads, and wherein the set of AI workloads is assigned to the set of nodes based on resource ticket values that balance the AI workloads on each node in the set of nodes; scheduling, by a local scheduler of a first node in the set of nodes, a subset of AI workloads in the set of AI workloads that are assigned to the first node for execution on infrastructure resources of the first node, wherein scheduling the subset of AI workloads is based on resource ticket values associated with the subset of AI workloads; the subset of AI workloads allocated to the first node is executed on infrastructure resources of the first node, based on scheduling the subset of AI workloads served by a coordinator of the local scheduler of the first node.

Alternatively, or in addition to other examples described herein, examples include any combination of the following:

-wherein the set of priority levels comprises at least a first priority level and a second priority level, and the first priority level has a higher priority than the second priority level; wherein at least one of the following is based on: preemption frequency, extended priority, and local priority, the AI workloads of the first priority level having priority over the AI workloads of the second priority level.

-wherein the at least one memory and the computer program code are configured to, with the at least one processor, further cause the at least one processor to: scheduling, by the scheduler, multi-node AI workloads on a subset of nodes of the set of nodes that are associated with the first priority, wherein scheduling the multi-node AI workloads includes preempting at least one AI workload associated with the second priority that is lower than a first priority level on the subset of nodes.

-wherein the at least one memory and the computer program code are configured to, with the at least one processor, further cause the at least one processor to: monitoring performance of the set of AI workloads being performed using the infrastructure resources;

Determining, for each AI workload, a dynamic preemption score indicating a relative likelihood that the AI workload will be preempted based on monitoring performance of the set of AI workloads, wherein the dynamic preemption score is based on at least one performance threshold requirement of a priority level associated with the AI workload; identifying an AI workload for preemption based on a dynamic preemption score of the AI workload based on determining that the AI workload is to be preempted, wherein the identified AI workload has preemptions indicating a high likelihood of all dynamic preemption scores of the AI workload set; preempting the identified AI workload.

-wherein the at least one memory and the computer program code are configured to, with the at least one processor, further cause the at least one processor to: identifying a set of idle infrastructure resources to allocate to the set of AI workloads; identifying a first subset of AI workloads associated with a high priority class of the set of AI workloads; assigning a first subset of infrastructure resources in the set of idle infrastructure resources to the first AI workload subset based on the extended priority requirement of the high priority level; identifying a second subset of AI workloads associated with a standard priority of the set of AI workloads; assigning a second subset of infrastructure resources of the set of idle infrastructure resources to the second subset of AI workloads based on the extended priority requirements of the standard priority class, wherein the second subset of infrastructure resources is smaller than the first subset of infrastructure resources; identifying a third subset of AI workloads associated with the low priority class of the set of AI workloads; assigning a third subset of infrastructure resources in the set of idle infrastructure resources to a third subset of the AI workload based at least on the extended priority requirement of the low priority level, wherein the third subset of infrastructure resources is smaller than the second subset of infrastructure resources; wherein the AI workload set comprises a suspended AI workload set that is suspended due to lack of infrastructure resources; wherein a subset of the set of suspended AI workloads is restored based on the allocation of the backup infrastructure resources.

-wherein scheduling the set of AI workloads on infrastructure resources of the set of nodes comprises: identifying a first AI workload currently executing using an infrastructure resource of a first node, wherein the first AI workload is associated with a lower priority than a second AI workload in the set of AI workloads; the first AI workload is migrated to a second node such that infrastructure resources of the first node are released for executing a second AI workload of the set of AI workloads.

-wherein the priority associated with the set of AI workloads comprises a performance requirement based on a throughput score value indicative of a ratio of an ideal time to complete an AI workload to an actual time to complete the AI workload; wherein scheduling, by a scheduler, the set of AI workloads to the set of nodes of the cloud infrastructure platform includes scheduling AI workloads to meet performance requirements of a priority of each AI workload.

-further comprising: scheduling, by a processor of the global scheduler, multi-node AI workloads on a subset of nodes of the set of nodes based on a balanced sum of resource ticket values of AI workloads allocated to the subset of nodes, wherein the multi-node AI workloads are scheduled for simultaneous execution on the subset of nodes.

-wherein scheduling, by the processor of the global scheduler, a multi-node AI workload and scheduling, by the processor of the local scheduler of the first node, a subset of the AI workload comprises scheduling AI workload using a stride mechanism for each time quantum during execution of the AI workload.

-further comprising: monitoring, by a processor of a local scheduler of the first node, performance of a subset of the AI workloads using infrastructure resources of the first node based on heterogeneous types of infrastructure resources; based on heterogeneous type-based monitoring of infrastructure resources that indicate a performance difference between a first AI workload and a second AI workload of a subset of the AI workloads, trading infrastructure resources between the first AI workload and the second AI workload by a processor of a local scheduler of the first node to balance the indicated performance difference.

-wherein trading infrastructure resources between the first AI workload and the second AI workload comprises: determining a swap rate for trading infrastructure resources of a second type for infrastructure resources of a first type, wherein the determined swap rate is favorable to the first AI workload and the second AI workload; exchanging a first amount of infrastructure resources of a first type associated with the first AI workload to the second AI workload for a second amount of infrastructure resources of a second type associated with the second AI workload to the first AI workload, wherein the first amount and the second amount are based on the determined exchange rate.

-wherein scheduling the subset of AI workloads on infrastructure resources of the first node comprises: identifying an AI workload currently being performed using an infrastructure resource of the first node; migrating the identified AI workload to another node, thereby freeing infrastructure resources of the first node for executing the AI

A subset of the workload.

-wherein identifying the AI workload comprises: the AI workload is identified based on the AI workload being associated with a lower priority than at least one AI workload of the subset of AI workloads.

The embodiments shown and described herein constitute embodiments not specifically described herein but within the scope of the various aspects of the claims: an example apparatus for receiving, by a processor of a global scheduler, a set of AI workloads to execute, wherein each AI workload of the set of AI workloads is associated with a resource ticket value that indicates a share of resources to execute the AI workload; an example apparatus for assigning, by a processor of a global scheduler, an AI workload set to a set of nodes of a cloud infrastructure platform, wherein each node in the set of nodes includes infrastructure resources for executing AI workloads, and wherein the AI workload set is assigned to the set of nodes based on balancing resource ticket values of the AI workload on each node in the set of nodes; an example apparatus includes means for scheduling, by a processor of a local scheduler of a first node in a set of nodes, a subset of AI workloads in the set of AI workloads that are assigned to the first node for execution on infrastructure resources of the first node, wherein the scheduling of the subset of AI workloads is based on resource ticket values associated with the subset of AI workloads; based on scheduling the subset of AI workloads allocated to the first node, the example apparatus is to execute the subset of AI workloads on infrastructure resources of the first node by a coordinator service of a local scheduler of the first node.

By way of example, and not limitation, computer readable media comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable memory implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules, and the like. Computer storage media are tangible and mutually exclusive to communication media. Computer storage media is implemented in hardware and does not include carrier waves and propagated signals. For purposes of this disclosure, the computer storage medium itself is not a signal. Exemplary computer storage media include hard disk, flash drive, solid state memory, phase change random access memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information for access by a computing device. In contrast, communication media typically embodies computer readable instructions, data structures, program modules or the like in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media.

The order of execution or performance of the operations in the examples of the present disclosure shown and described herein is not essential, and may be performed in a different sequential manner in various examples. For example, it is contemplated that executing or performing a particular operation before, contemporaneously with, or after another operation is within the scope of aspects of the disclosure.

When introducing elements of aspects of the present disclosure or examples thereof, the singular is intended to mean that there are one or more of the elements. The terms "comprising," "including," and "having" are intended to be inclusive and mean that there may be additional elements other than the listed elements. The term "exemplary" is intended to mean an example of "… …". The phrase "one or more of the following: A. b and C "means" at least one of a and/or at least one of B and/or at least one of C ".

Having described aspects of the present disclosure in detail, it will be apparent that modifications and variations are possible without departing from the scope of aspects of the present disclosure as defined in the appended claims. As various changes could be made in the above constructions, products, and methods without departing from the scope of aspects of the disclosure, it is intended that all matter contained in the above description and shown in the accompanying drawings shall be interpreted as illustrative and not in a limiting sense.

Claims

1. A system for scheduling execution of Artificial Intelligence (AI) workloads in a cloud infrastructure platform, the system comprising:

at least one processor of the cloud infrastructure platform, wherein the at least one processor comprises at least one of: a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a hardware accelerator; and

at least one memory of the cloud infrastructure platform comprising computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the at least one processor to:

receiving, by a scheduler, a set of AI workloads to be executed, wherein each AI workload in the set of AI workloads is associated with a priority level in a set of priority levels, the priority levels indicating preemption priorities associated with the AI workloads when the AI workloads are executed;

scheduling, by the scheduler, the set of AI workloads to a set of nodes of the cloud infrastructure platform, wherein each node in the set of nodes includes infrastructure resources for executing AI workloads, and wherein the set of AI workloads is assigned to the set of nodes based at least on a priority of AI workloads on each node in the set of nodes; and

The set of AI workloads allocated to the set of nodes is executed by a coordinator service of the scheduler on the infrastructure resources of the set of nodes based at least on the set of AI workloads scheduled.

2. The system of claim 1, wherein the set of priority levels includes at least a first priority level and a second priority level, and the first priority level has a higher priority than the second priority level; and is also provided with

Wherein is based on at least one of: preemption frequency, extended priority, and local priority, the AI workloads of the first priority level having priority over the AI workloads of the second priority level.

3. The system of claim 2, wherein the at least one memory and the computer program code are configured to, with the at least one processor, further cause the at least one processor to:

scheduling, by the scheduler, a multi-node AI workload associated with the first priority over a subset of nodes of the set of nodes, wherein scheduling the multi-node AI workload includes preempting at least one AI workload associated with the second priority level having a lower priority than the first priority level over the subset of nodes.

4. A system according to any one of claims 1 to 3, wherein the at least one memory and the computer program code are configured to, with the at least one processor, further cause the at least one processor to:

monitoring performance of the set of AI workloads that are executing using the infrastructure resources;

determining, for each AI workload, a dynamic preemption score indicating a relative likelihood that the AI workload will be preempted based at least on monitoring performance of the set of AI workloads, wherein the dynamic preemption score is based on at least one performance threshold requirement of a priority level associated with the AI workload;

identifying an AI workload for preemption based at least on determining that the AI workload is to be preempted, based at least on dynamic preemption scores of the AI workload, wherein the identified AI workload has the dynamic preemption score indicating the highest likelihood of preemption among all dynamic preemption scores of the AI workload set; and

preempting the identified AI workload.

5. A system according to any one of claims 1 to 3, wherein the at least one memory and the computer program code are configured to, with the at least one processor, further cause the at least one processor to:

Identifying a set of idle infrastructure resources for allocation to the set of AI workloads;

identifying a first subset of AI workloads of the set of AI workloads that are associated with a high priority level;

assigning a first subset of infrastructure resources in the set of idle infrastructure resources to the first AI workload subset based at least on the extended priority requirement of the high priority level;

identifying a second subset of AI workloads of the set of AI workloads associated with a standard priority;

assigning a second subset of infrastructure resources of the set of idle infrastructure resources to the second subset of AI workloads based at least on the extended priority requirements of the standard priority class, wherein the second subset of infrastructure resources is smaller than the first subset of infrastructure resources;

identifying a third subset of AI workloads of the set of AI workloads associated with a low priority level;

assigning a third subset of infrastructure resources in the set of idle infrastructure resources to a third subset of the AI workload based at least on the extended priority requirement of the low priority level, wherein the third subset of infrastructure resources is smaller than the second subset of infrastructure resources;

Wherein the AI workload set comprises a suspended AI workload set that is suspended due to lack of infrastructure resources; and is also provided with

Wherein a subset of the suspended AI workload set is restored based on the allocation of the backup infrastructure resources.

6. The system of claim 1, wherein scheduling the set of AI workloads on the infrastructure resources of the set of nodes comprises:

identifying a first AI workload currently executing using the infrastructure resources of a first node, wherein the first AI workload is associated with a lower priority than a second AI workload of the set of AI workloads; and

the first AI workload is migrated to a second node such that infrastructure resources of the first node are released for use by the second AI workload in executing the set of AI workloads.

7. The system of any of claims 1 to 6, wherein the priority level associated with the set of AI workloads includes a performance requirement based at least on a throughput score value indicative of a ratio of an ideal time to complete an AI workload to an actual time to complete the AI workload; and

Wherein scheduling, by the scheduler, the set of AI workloads to the set of nodes of the cloud infrastructure platform includes scheduling AI workloads to meet performance requirements of the priority level of each AI workload.

8. A computerized method for scheduling execution of AI workloads in a cloud infrastructure platform, the computerized method comprising:

receiving, by a processor of a global scheduler, a set of AI workloads to be executed, wherein each AI workload of the set of AI workloads is associated with a resource ticket value that indicates a share of resources to be utilized to execute the AI workload;

assigning, by the processor of the global scheduler, the set of AI workloads to a set of nodes of the cloud infrastructure platform, wherein each node in the set of nodes includes infrastructure resources for use in executing AI workloads, and wherein the set of AI workloads is assigned to the set of nodes based at least on balancing resource ticket values of the AI workloads on each node in the set of nodes;

scheduling, by a processor of a local scheduler of a first node of the set of nodes, a subset of AI workloads of the set of AI workloads that are assigned to the first node for execution on the infrastructure resources of the first node, wherein the scheduling of the subset of AI workloads is based at least on resource ticket values associated with the subset of AI workloads; and

The subset of AI workloads is executed on infrastructure resources of the first node, based at least on scheduling the subset of AI workloads assigned to the first node, served by a coordinator of a local scheduler of the first node.

9. The computerized method of claim 8, further comprising:

scheduling, by a processor of the global scheduler, a multi-node AI workload on a subset of nodes of the set of nodes based at least on a sum of resource ticket values balancing AI workloads allocated to the subset of nodes, wherein the multi-node AI workload is scheduled to be executed simultaneously on the subset of nodes.

10. The computerized method of claim 9, wherein scheduling, by a processor of the global scheduler, a multi-node AI workload and scheduling, by a processor of a local scheduler of the first node, a subset of the AI workload comprises: for each time quantum during which the AI workload is executed, a stride mechanism is used to schedule the AI workload.

11. The computerized method of any of claims 8-10, further comprising:

monitoring, by the processor of the local scheduler of the first node, performance of a subset of the AI workloads using the infrastructure resources of the first node based at least on heterogeneous types of infrastructure resources; and

Based at least on the monitoring, a performance difference between a first AI workload and a second AI workload indicative of a subset of the AI workloads, based at least on heterogeneous types of infrastructure resources, infrastructure resources are traded between the first AI workload and the second AI workload by the processor of the local scheduler of the first node to offset the indicated performance difference.

12. The computerized method of claim 11, wherein trading infrastructure resources between the first AI workload and the second AI workload comprises:

determining a swap rate for trading infrastructure resources of a second type for infrastructure resources of a first type, wherein the determined swap rate favors the first AI workload and the second AI workload; and

exchanging a first amount of a first type of infrastructure resources associated with the first AI workload for the second AI, with a second amount of a second type of infrastructure resources associated with the second AI workload for the first AI workload, wherein the first amount and the second amount are based at least on the determined exchange rate.

13. The computerized method of claim 8, wherein scheduling the subset of AI workloads on the infrastructure resources of the first node comprises:

identifying an AI workload currently being performed using infrastructure resources of the first node; and

the identified AI workloads are migrated to further nodes, freeing up infrastructure resources of the first node for execution of a subset of the AI workloads.

14. The computerized method of claim 13, wherein identifying the AI workload comprises: the AI workloads are identified based at least on the AI workloads being associated with a lower priority level than at least one AI workload of the subset of AI workloads.

15. One or more computer storage media having computer-executable instructions for scheduling execution of AI workloads in a cloud infrastructure platform, which when executed by a processor, cause the processor to at least:

receiving, by a global scheduler, a set of AI workloads to be executed, wherein each AI workload in the set of AI workloads is associated with a resource ticket value that indicates a share of resources to be utilized to execute the AI workload;

Assigning, by a global scheduler, the set of AI workloads to a set of nodes of the cloud infrastructure platform, wherein each node in the set of nodes includes infrastructure resources for executing AI workloads, and wherein the set of AI workloads is assigned to the set of nodes based at least on resource ticket values that balance the AI workloads on each node in the set of nodes;

scheduling, by a local scheduler of a first node of the set of nodes, a subset of AI workloads of the set of AI workloads that are assigned to the first node for execution on infrastructure resources of the first node, wherein scheduling the subset of AI workloads is based at least on resource ticket values associated with the subset of AI workloads; and