CN114503077A

CN114503077A - Task scheduling for machine learning workloads

Info

Publication number: CN114503077A
Application number: CN202080061569.2A
Authority: CN
Inventors: 王珏; 黄晖
Original assignee: Google LLC
Current assignee: Google LLC
Priority date: 2019-11-20
Filing date: 2020-09-08
Publication date: 2022-05-13
Also published as: US20210149729A1; JP2024020271A; EP4062281A1; WO2021101617A1; JP7379668B2; US11544113B2; US20230136661A1; KR20220038497A; JP2023511467A

Abstract

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for scheduling tasks of an ML workload are described. The system receives a request to execute a workload and determines resource requirements based on the request to execute the workload. The system includes a plurality of hosts and each host includes a plurality of accelerators. The system determines the number of hosts allocated to perform the tasks of the workload based on the resource requirements and the accelerator of each host. For each host of the number of hosts, the system generates a task specification based on a memory access topology of the host. The specification specifies tasks to be performed at the host using resources of the host that include the plurality of accelerators. The system provides the task specification to the hosts and executes the workloads as each host executes the assigned task specified in the task specification for the host.

Description

Task scheduling for machine learning workloads

Background

This description relates generally to scheduling tasks of a computing workload and allocating resources for performing the tasks of the computing workload.

Distributed computing systems typically include various resources, such as a Central Processing Unit (CPU), storage components, and image/speech processing accelerators, video transcoding accelerators, or neural network processors (e.g., Machine Learning (ML) accelerators). These resources can interact to process the tasks of an example computing workload, such as a workload for training an ML system, or an inference workload for classifying images or generating transcriptions for speech recognition.

Existing solutions for processing workloads periodically require access to memory and exchange data communications among computing resources or groups of resources in distributed systems that are non-local (or remote) with respect to each other. Such non-local memory access operations and data communications are typically bandwidth intensive, which can cause performance bottlenecks in compute clusters when the host bandwidth for cross-port (e.g., remote) operations is limited.

Disclosure of Invention

This document describes techniques for improved scheduling and resource allocation when processing Machine Learning (ML) workloads by allocating tasks of the workload to respective resource groups across multiple hosts in a large distributed system. Using the techniques described herein, a distributed system can be configured to assign each task of a workload to a group of resources that exchange data communications via a shared or common hardware bus of the distributed system. The allocation scheme achieves workload processing time reduction by utilizing resource locations that are based on a non-uniform memory access (NUMA) topology of resource groups. In some examples, the described techniques can be used to perform NUMA-aware scheduling for a series of hardware accelerators to accelerate neural network computations performed at discrete tensor processing nodes of a distributed system.

One aspect of the subject matter described in this specification can be embodied in a method for scheduling tasks and allocating resources to execute machine learning workloads using hardware accelerators, each of which is configured to implement a neural network comprising a plurality of neural network layers. The method comprises the following steps: receiving a request to execute a Machine Learning (ML) workload; determining resource requirements based on the requests to execute the ML workload at a distributed processing system comprising a plurality of hosts, each of the plurality of hosts comprising a respective plurality of hardware accelerators; the number of hosts that are respectively allocated to perform a respective task of a set of tasks that form the ML workload is determined based on resource requirements and a respective plurality of hardware accelerators per host.

For each host of the number of hosts, the method comprises: generating, based on a memory access topology of a host, a respective task specification specifying tasks allocated to be performed at the host using resources of the host, the resources of the host including the respective plurality of hardware accelerators; and providing the corresponding task specification to the host of the number of hosts; and executing the ML workload by executing, by each host of the number of hosts, the task specified for the host in the respective task specification.

These and other embodiments can each optionally include one or more of the following features. For example, in some embodiments, the memory access topology of each host includes a respective non-uniform memory access (NUMA) topology that includes a respective memory local to the host; and the respective memory includes a port interface (socket interface) that couples the respective memory to each of the respective plurality of hardware accelerators and one or more other resources of the host.

In some embodiments, performing the tasks specified in the respective task specification comprises: performing a plurality of neural network computations to generate an output for each of the plurality of neural network layers in response to assigning a respective portion of the plurality of neural network computations to each of the respective plurality of hardware accelerators.

In some embodiments, executing the ML workload comprises: instructions for the respective task specification are processed using each resource of the control group of the host and based on data exchanged between the respective memory, hardware accelerator, and respective processor included among the resources of the host.

In some embodiments, executing the ML workload comprises: the hardware port links each resource of the control group of the host in response to executing a task specified in the respective task specification based on data processing instructions exchanged via the hardware port, wherein the hardware port defines a local communication bus shared among a plurality of resources managed by the host.

In some embodiments, the respective NUMA topology for the first host is based in part on: i) a respective first memory local to the first host in a respective configuration of resources; and ii) a respective second distinct memory local to a second distinct host but remote to the first host in a respective configuration of resources.

In some embodiments, determining the number of hosts comprises: obtaining a system file describing a configuration of a resource managed by each of the plurality of hosts; and determining, for each of the plurality of hosts, a number of hosts based on the configuration of the resources described in the system file. In some embodiments, the method comprises: identifying one or more ports coupling resources of each host of a plurality of hosts based on a system file describing a mapping of NUMA ports of the host; and forming a control group for the host based on the one or more ports coupled to the resources of the host.

In some embodiments, the method comprises: assigning the ML task of the task specification to a control group of the host based on one or more port interfaces for accelerators in the control group, wherein the port interfaces are included in a mapping of NUMA ports described in the system file; and executing the ML task as a process under the control group using the accelerators in the control group.

Other embodiments of this and other aspects include corresponding systems, apparatus, and computer programs, configured to perform the actions of the methods, encoded on non-transitory computer-readable storage devices. A system of one or more computers can be configured by software, firmware, hardware, or a combination thereof installed on the system that in operation causes the system to perform actions. The one or more computer programs can be so configured with instructions that, when executed by the data processing apparatus, cause the apparatus to perform actions.

One aspect of the subject matter described in this specification can be embodied in a system that receives a request to execute a workload and determines resource requirements based on the request to execute the workload. The system includes a plurality of hosts and each host includes a plurality of accelerators. The system determines the number of hosts allocated to perform the tasks of the workload based on the resource requirements and the accelerator of each host. For each host in the number of hosts, the system generates a task specification based on a memory access topology of the host. The specification specifies tasks to be performed at the host using resources of the host including the plurality of accelerators. The system provides a task specification to the hosts and executes workloads as each host executes an assigned task specified in the task specification for the host.

The subject matter described in this specification can be implemented in particular embodiments to realize one or more of the following advantages. The techniques described in this document can alleviate performance bottlenecks in a system by reducing or preventing non-local memory access operations and data communications from occurring while a host of the system is performing the tasks of a workload. In contrast to the prior approaches, the described techniques can be used to reduce the amount of time required to process a workload by leveraging resource locations that are based on a non-uniform memory access (NUMA) topology of resources or groups of resources managed by each host of the system.

The details of one or more implementations of the subject matter described in this specification are set forth in the accompanying drawings and the description below. Other potential features, aspects, and advantages of the subject matter will become apparent from the description, the drawings, and the claims.

Drawings

FIG. 1 is a block diagram of an example computing system for scheduling tasks that are executed to execute a machine learning workload.

FIG. 2 illustrates a block diagram of example resources managed by a host included in the computing system of FIG. 1.

FIG. 3 illustrates example computing logic that can be executed to generate a task specification for executing a machine learning workload.

FIG. 4 illustrates an example process for scheduling tasks that are executed to execute a machine learning workload.

FIG. 5 illustrates an example process for generating a task specification that is provided to a host of the computing system of FIG. 1.

Like reference symbols in the various drawings indicate like elements.

Detailed Description

A distributed system has a plurality of nodes (e.g., hosts) that include hardware devices for executing computing workloads. The nodes can form individual hardware computing clusters or host devices in a cluster that process data to execute workloads. Each of the nodes can include a plurality of resources. For example, the resources can be processors or CPUs, memory, or Peripheral Component Interconnect (PCI) devices (e.g., hardware accelerators), and each host can include multiple resources forming a resource group.

Each resource may have some hardware or port connection that makes some resources remote or local with respect to other resources in the node. The manner in which workloads are handled at a distributed system typically requires access to memory and the performance of operations that involve moving data between resources that are non-local (or remote) with respect to each other. As described above, such non-local memory access and data transfer operations can cause performance bottlenecks in a cluster or host with limited bandwidth for cross-port (e.g., remote) operations.

In this context, techniques are described for improving the scheduling of tasks of computing workloads and allocating resources of a system at computing clusters in a distributed system for performing tasks that form the computing workloads. With respect to scheduling, techniques include improved processes for allocating tasks of a workload (e.g., an ML workload) to respective groups of resources managed by individual hosts of a large distributed system. For example, the system is configured to assign specific tasks of a workload to a particular group of resources, where discrete resources in the group exchange data communications via a shared or common hardware bus of the distributed system. The process for assigning certain tasks to a particular resource group of a host is performed by utilizing resource locations within a compute cluster, where the locations are based on a non-uniform memory access (NUMA) topology of the resource group.

Techniques are also described that provide a way to perform NUMA-aware task scheduling and resource allocation across multiple hosts in different compute clusters of a distributed system. For example, the controller of a compute cluster can pass over a set of protocol bits that describe the work or task that requires the workload of a NUMA location. Techniques leverage NUMA locations of hosts within a cluster or cluster by assigning one or more tasks to a particular set of resources or devices, such as a set of resources including CPUs, memory, and Peripheral Component Interconnect (PCI) devices (e.g., hardware accelerators).

A host managing a plurality of resource groups is operable to receive and process protocol bits communicated by a master controller of a cluster. For example, the host processes the protocol bits based on the determined port topology of its resource group and allocates a particular set of resources from the same NUMA port to perform the tasks of a particular work or portion of the workload as specified by the protocol bits. The host is operable to bind or assign a set of machine learning tasks of the task specification to a resource group (or control group) constructed at the host. For example, based on information conveyed by the protocol bits, the host can bind tasks to a given set of resources to reduce or prevent the occurrence of non-local memory or data access operations that may degrade the performance or execution of computations for a given task.

The described methods for NUMA-aware task scheduling and resource allocation enable a distributed system to optimize its use of NUMA locations to reduce bandwidth requirements and improve computation time for executing certain workloads. This can be accomplished, for example, based at least on the utilization of NUMA locations of the clusters, having certain types of tasks assigned to particular sets of devices or groups of resources that cooperate with one another, enabling cross-port or cross-node communications to be reduced, thereby freeing up bandwidth for other computations.

FIG. 1 is a block diagram of an example distributed computing system 100 for scheduling tasks that are executed to perform a computing workload. The system 100 may be a large distributed hardware computing system comprising a plurality of computing clusters 102, wherein each cluster 102 comprises a plurality of hosts 104 and each host 104 comprises a plurality of computing resources 105.

One or more groups of resources can be managed by a host 104 of the distributed system 100, and each computing cluster of the plurality of computing clusters 102 can include a plurality of hosts 104. More specifically, each host 104 is configured to manage two or more discrete resources 105 that form a resource group. The group of resources managed by the host 104 may alternatively be referred to herein as a resource group. Thus, in some cases, the resources 105 can represent discrete resources, such as a single processor or memory device, while in other cases the resources 105 can represent multiple resources, such as two or more processors, two or more memory banks, two or more hardware accelerators, or a combination of each. The resource groups of the host 104 are described in more detail below with reference to FIG. 2.

In some implementations, the host 104 is a hardware computing device (e.g., a computer or server). In some implementations, the host 104 is a virtual machine of a distributed system (or computing cluster), a software construct for managing a set of computing resources 105, or both. The system 100 can include M number of compute clusters 102, and each compute cluster 102 of the M number of compute clusters can include N number of hosts, where each of M and N is an integer greater than or equal to one.

In some implementations, each of the compute clusters 102 includes a set of machines (e.g., hardware or virtual machines) that form the hosts 104 of the cluster 102. As shown in FIG. 1, a single cluster 102 can include multiple controllers 108, each of which is used to distribute tasks of a workload to one or more of the hosts 104 in the cluster 102.

Each of the compute clusters 102 can include a scheduler 106 in communication with a master controller 108 ("controller 108") of the cluster 102 and a link slice 110 accessible by the master controller 108. The controller 108 is responsible for generating task specifications and preparing instructions and commands for sending to the host 104, and for updating the current processing state of the host 104 based on responses from the host 104. In some embodiments, each controller 108 includes state logic 110 that manages communication with a subset of hosts 104. For example, the state logic 110 is executed by the controller 108 to send commands to a subset of the hosts 104 to obtain information about the processing state of the hosts 104 and to receive responses from the hosts 104. For example, the state logic 110 is used to receive and process host reports indicating whether the assigned task is complete or in-process. The status logic 110 may determine that if the host 104 fails to provide a status report after a threshold number of attempts, the tasks assigned to the host 104 have stopped to obtain information about the host's processing status. In some cases, the state logic 110 is operable to aggregate and compress the process state information reported by the hosts 104, thereby reducing the size of the update load received at the master controller 108.

As described in more detail below, the scheduler 106 and the controller 108 interact or communicate to schedule and assign tasks of a workload to particular hosts 104 for execution at the hosts. Although depicted in fig. 1 as being separate from the controller 108, the scheduler 106 can be integrated in the controller 108. In some implementations, the scheduler 106 is an optional processing element of the compute cluster 102, and its functionality can be integrated into the allocation and control functions configured at the controller 108.

The controller 108 is operable to allocate tasks of the workload based at least on the request 112 to perform the one or more workloads and the hardware configuration of the resources managed by the host 104. For example, each of the controllers 108 may be a logically centralized controller that generates instructions based on parameters in the requests 112 received at the cluster 102 and based on the hardware port topology of the resources 105 (or resource groups) included in the hosts 104. In some implementations, each host 104 in the subset of hosts 104 is configured as a "slave" computing asset under a particular master controller 108. In this embodiment, the master controller 108 generates instructions based on the parameters in the request 112 and the hardware port topology at each host 104 in the subset of hosts 104 that are "slaves" under the particular master controller 108.

The host 104 can include a plurality of resources 105, e.g., hundreds or thousands of resources or devices, corresponding to a machine or hardware device. The resources 105 in the host 104 can vary in many ways or be heterogeneous. For example, each resource group 105 managed by the host 104 can vary in processing device (e.g., CPU, RAM, disk, network), processor type, processing speed, performance, and capabilities such as external IP addresses or flash storage. More specifically, for each compute cluster 102, each of the plurality of hosts 104 in the cluster 102 includes one or more dedicated hardware circuits that interact with other resources of the host to perform the tasks of the workload.

For example, the dedicated hardware circuit may be a hardware accelerator, a Graphics Processing Unit (GPU) hardware accelerator, or a neural network processor. In the example of FIG. 1, the system 100 can include a first host 104-1, a second host 104-2, a third host 104-3, and N number of additional hosts 104-N. In some embodiments, the dedicated circuitry and resources 105 of the first host 104-1 may be different (e.g., slightly or significantly) than the dedicated circuitry and resources 105 of the second host 104-2.

For example, the first host 104-1 may include 10 GPU hardware accelerators, each of which is configured to perform location and in-memory based analysis or GPU accelerated database queries, while the second host 104-2 may include 20 neural network processors, each of which is configured to implement a Convolutional Neural Network (CNN) model or a Recursive Neural Network (RNN) model. In some embodiments, the 20 neural network processors may be configured to execute a binary file for the trained inference model and to accelerate running the floating point-based inference model, the integer quantitative inference model, or both.

The system 100 uses the controller 108 to generate instructions for distributing and controlling the execution of individual tasks, such as tasks capable of running on one or more machines of a particular host 104. The determination of a particular set of resources for assigning certain tasks to a host 104 is made with an emphasis on utilizing the resource locations within the hosts 104 of the computing cluster 102. The resource locations are based on the hardware topology of the resource groups at the host 104, and more particularly, on the non-uniform memory access (NUMA) topology of the resource groups, as described below.

The system 100 is configured to generate a system topology that includes a hardware topology for each computing cluster 102 and each host 104 of the computing cluster. The hardware topology is configured to identify: i) the connectivity (e.g., port connections and interfaces) of the resources of the multiple devices and hosts, and ii) the local communication bus that enables data transfer between the resources 105 of the host 104.

The system 100 is configured to identify the location of each resource or peripheral device coupled to a connection point or component interface of a hardware port in the host 104. For example, the host 104 can run program code (e.g., firmware) associated with a system BIOS of a hardware computer managed by the host 104 to identify a resource location and a type of resource 105 (e.g., processor and memory) coupled to a motherboard of the computer. In some implementations, the operating system of the host 104 can use the chipset of the hardware computer to obtain a detailed list of information about the data bus and peripheral devices connected at the computer managed by the host 104. For example, the list can be based on a common portable interconnect library (e.g., libpci) that represents an interconnect configuration space of an operating system running on a processor of the computer.

The controller 108 is operable to transmit critical instructions to each host 104 for performing tasks and to transmit commands to each host 104 to obtain information about the current processing state of a particular machine or resource group 105 managed at the host 104. In some embodiments, the controller 108 dynamically transmits commands to obtain information about the status of the process. Alternatively, the controller 108 may transmit a command to obtain information about the status of the process with reference to a predetermined schedule (e.g., every few seconds), which is based on the specific tasks performed at the host 104. In general, each controller 108 is operable to control the respective communication rates between the various resources and different resource groups of the host 104 based on the instructions and commands it transmits to the host.

FIG. 2 illustrates a block diagram of an example resource group 200 managed by a host 104 of an example computing cluster 102. As described above, the host 104 can include hundreds or thousands of resources corresponding to machines or hardware devices. The resources 105 in the resource group of the host 104 can vary in many ways or be heterogeneous. For example, each resource group 200 managed by the host 104 can vary in processing device (e.g., CPU, RAM, disk, network), processor type, processing speed, overall performance, and capabilities such as external IP addresses or flash storage.

As shown in FIG. 2, the memory access topology of each host 104 can include a respective non-uniform memory access (NUMA) topology or ports 202-1, 202-2 of one or more resource groups 200 managed by the host 104. The NUMA topology of resource group 200 can include a plurality of processors (P)204 or a plurality of processor cores (P), memory resources (e.g., Random Access Memory (RAM)), and one or more dedicated circuits (such as hardware accelerator 208). The individual resources of the NUMA topology can form a local NUMA node corresponding to NUMA port 202-1 or 202-2.

For example, a local NUMA node may be formed based on resources in a group that exchange data communications via a shared or common hardware bus 210. When a resource is connected at a node via an interface (or port) connection with a common port, each resource in a local NUMA node may be local to another resource. In some implementations, each hardware accelerator 208 is connected to other resources of the NUMA node via a PCI or PCI-e port connection.

As used in this specification, NUMA refers to a computer memory design for use in a distributed multi-processing system in which memory access times depend on memory locations relative to processor (P)204 or processor cores. Under NUMA, the processor 204 is able to access its own local memory 206-1 faster than non-local memory, such as memory 206-2 local to another processor or memory shared between processors.

The example resource group 202 can include a plurality of interconnect locations 212. For example, each of the interconnect locations 212-1 and 212-2 can correspond to a respective component interface for establishing a data connection between the resources 105 of the host 104 (such as between the memory 206-1 of the host 104 and the hardware accelerator 208). In some implementations, the resources 105 of the resource group 200 exchange data communications via a hardware port that links local resources of the NUMA port 202-1, where the hardware port defines a local communications bus 210 that is shared among multiple resources managed by the host 104.

In some embodiments, the respective NUMA topology for the first NUMA port 202-1 is based in part on: i) a respective first memory 206-1 that is local to the NUMA port 202-1 in a respective configuration of resources; and ii) a respective second different memory 206-2 that is local to the second different NUMA port 202-2 but remote to the first NUMA port 202-1 in the respective configuration of resources.

FIG. 3 illustrates an example task specification 300 based on computing logic 302 executed at the system 100 for executing a computing workload. As shown in fig. 3, the logic 302 can include multiple computing blocks that each include instructions (e.g., programming code/instructions). Instructions can be executed at the system 100 using a processing device of the controller 108, a processing device of the host 104, and other resources 105, or a combination of each.

The computational logic 302 may be a programmed representation of an example task specification 300 for scheduling tasks and allocating resources to execute ML workloads at the system 100. In some implementations, the ML workload is executed using hardware accelerators, each of which is configured to implement a neural network that includes a plurality of neural network layers. The instructions can be stored in one or more non-transitory machine-readable storage media of the system 100 and executable by one or more processors of the system 100 to perform operations and perform tasks of a workload.

For example, operations can be performed to generate instructions and protocol bits (e.g., for a task specification) that are provided to a particular host 104 to perform the tasks of the ML workload. In some embodiments, the protocol bits are represented by an encoded signal, a binary-valued data word, or other relevant parameter or data value. The encoded signals and binary values are received and processed (or interpreted) at the host 104 to determine the assignment of tasks to the task specification. In general, each host 104 is configured to run or execute one or more tasks of a workload, including tasks of multiple different workloads that may be assigned to the host 104. When a request 112 is received by a compute cluster 102, the scheduler 106 and controller 108 of the cluster 102 interact to scan for the request 112.

For example, the controller 108 may scan the request 112 to identify parameters (e.g., protocol bits) in the request 112 that specify the CPU, memory, and accelerator requirements of various tasks in the workload (304). Based on the parameters and values in the request 112, the controller 108 may determine that the example workload includes 16 tasks, where each of the tasks requires a total resource allocation of 96 CPUs and 4 dedicated circuits (e.g., hardware accelerators). For example, the request 112 can include scalar resource parameters that specify a large number of hardware accelerators (4) to be used to execute each of the 16 tasks. In some implementations, the scalar resource parameter may include a subtype that specifies the type of hardware accelerator that will be used to process the workload. For example, a subtype may specify that each of the 4 hardware accelerators is a neural net processor configured to accelerate running a model trained for feature recognition.

The encapsulation field of the computational logic specifies a task binary (306) for performing each of the 16 tasks. For example, a task binary can be a particular type of neural network or inference model that is to be executed or run at a hardware accelerator to perform computations for performing a particular task of 16 tasks. In some cases, the task binaries are derived from scalar resource subtypes that specify the hardware accelerator type of the task to be used to process the workload.

In response to identifying the parameters in the request 112, the controller 108 is operable to determine an allocation scheme for scheduling and allocating tasks to the hosts 104 in the cluster 102 based on the parameters of the request 112 and based on the hardware port topology of the resources 105 (or resource groups 200) in the hosts 104. The controller 108 generates respective task specifications based on an allocation scheme for scheduling and allocating tasks to the hosts 104. For example, each of the parameters in the request 112 and the corresponding parameter value can represent a scheduling constraint for the scheduler 108. In some implementations, the request 112 may assign a priority to each of the parameters to further constrain the scheduler 108 and the controller 108. For example, the priority assigned to an accelerator subtype or CPU core can constrain the controller 108 to certain hosts 104 that have a particular type of hardware accelerator or a particular number of available CPUs.

The controller 108 determines the allocation of tasks and generates a task specification at least by analyzing the parameters of the request 112 and the details of the hardware port topology for each of the resource groups managed by one or more of the hosts 104, which hosts 104 are "slaves" under the controller 108, and generates the task specification. For example, the controller 108 may be operable to scan the hardware port topology of each resource group of the host 104 to determine the location of the resource 105, determine whether the resource or the type of resource satisfies the constraints of the request 112, and determine the availability of the resource. In some examples, controller 108 determines the availability of a resource from among resources that are local to the particular NUMA node and satisfy one or more of the constraints of request 112. In some embodiments, the hardware port topology for each host 104 is based on the respective memory access topology of each resource group 200 in the host 104.

In a NUMA system, there are multiple NUMA nodes that are made up of a set of processors and memory. As indicated above, access to memory 206-1 by a processor 204 in the same NUMA node 202-1 is local while a processor 204 in a NUMA node 202-1 accessing memory 206-2 in another NUMA node 202-2 is remote. In some embodiments, remote access can take multiple cycles relative to local access, as remote access can involve multi-hop operations. Because of this asymmetric memory access latency, keeping memory accesses local or maximizing memory locations can improve the performance of distributed processing. In some implementations, CPU load balancing across NUMA nodes in combination with utilizing NUMA locations can translate into additional performance improvements.

The master controller 108 is configured to encode one or more constraints in the task specification (308). For example, the task specification 300 can include scheduling constraints derived from parameters and values in the request 112. For example, the controller 108 can translate parameters in the request 112 into task constraints that instruct the host 104 to load data for performing task computations on hosts located within a particular cloud zone. For example, a cloud may be a particular physical or geographic location of a data center that includes a certain set of hardware accelerator resources that are needed to perform data computations for a given task of a workload.

The compute cluster 102 is configured to determine an allocation scheme for scheduling and allocating tasks to a particular resource group 200 of hosts 104 in a manner that exploits resource locations across multiple hosts 104 in the compute cluster 102. The example task specification 300 depicted in FIG. 3 provides a simplified task specification that represents a compute cluster 102 that receives requests to perform 16 tasks. Each of the tasks occupies a host 104 that includes two NUMA nodes. In this example, each of the tasks requires a total resource allocation of 96 CPUs and 4 dedicated circuits (e.g., hardware accelerators).

The task specification 300 includes parameters that define the resource allocation of a particular host 104, such as how the host 104 allocates its CPU cores from a particular NUMA node (310). The example of FIG. 3 shows a balanced CPU allocation of 48 processor cores allocated from each NUMA node to satisfy a total resource allocation of 96 CPUs for each task. In other examples, the controller 108 may generate a task specification that specifies an unbalanced distribution, such as 36 CPUs from a first NUMA node and 60 CPUs from a second NUMA node.

FIG. 4 illustrates an example process 400 for scheduling tasks that are performed to execute a machine learning workload. The process 400 can be implemented or performed using the system 100 described above. Accordingly, the description of process 400 may refer to the above-described computing resources of system 100 as well as other components described in this specification. In general, the computing steps or process flows in the description of process 400 can be grouped or arranged to occur in different orders and are not limited to the numerical sequences described herein.

Referring now to process 400, system 100 receives a request to execute a workload using one or more of its computing clusters (402). In some implementations, the process 400 corresponds to a method for scheduling tasks and allocating resources to execute workloads using hardware accelerators and other resources of the host 104. In some examples, the workload is a training or reasoning workload related to a particular machine learning operation, such as video transcoding, image processing, voice processing, autonomous vehicle navigation, or image recognition.

The request 112 may be to perform an ML workload, such as an inference workload to detect objects in an image or to recognize terms in a speech utterance. In this context, one or more of the hardware accelerators may be configured to implement a neural network comprising a plurality of neural network layers, such as a Convolutional Neural Network (CNN) or a Recurrent Neural Network (RNN). The received request 112 may include parameters that specify a particular type of neural network configuration (e.g., CNN or RNN) that should be used to perform the tasks of the workload.

The received request 112 may also follow a second request 112 to deploy a particular neural network on the cluster 102, for example, using a set of resources for the host 104. The second request may follow an instruction or command to cause the controller 108 (or host 104) to obtain parameters for a set of weights for a particular neural network layer. For example, the set of weights may be obtained from memory locations of a memory managed by host 104 based on a location address specified in the instruction. In some implementations, the memory that stores the weights obtained by the host 104 is one of a plurality of local resources of a NUMA node that defines a set of resources at the host 104. Similarly, the instructions may cause the controller 108 (or the host 104) to access other memory locations to fetch inputs for processing through the neural network layer to generate outputs for the neural network layer using the local resources 105 of the NUMA nodes in the host 104. In some implementations, processing a particular portion of the input identified in the request 112 by the neural network layer to generate a layer output can represent one or more tasks performing a larger workload that can be processed across multiple hosts 104 of a computing cluster 102 or across multiple computing clusters 102.

The system 100 determines resource requirements based on the request (404). The resource requirements can indicate certain details about the resources of the system 100 relative to the workload requests 112, such as the type and amount of computing resources needed to execute a set of tasks representing the ML workload. For example, a resource requirement can specify a certain processor or processor type, processing power or speed, an amount of memory or memory size, a number of hardware accelerators, or a measure of resource location of a resource at the distributed system.

The system 100 determines a number of hosts 104 allocated to perform respective tasks of the ML workload based on the resource requirements and the plurality of hardware accelerators per host (406). For each host 104 in the number of hosts 104, the system 100 generates a respective task specification based on the memory access topology of the host (408). The memory access topology of the host can be based on one of a plurality of corresponding NUMA topologies for each resource group 200 of the host 104. In some embodiments, the particular NUMA topology for resource group 200 is specific to a local NUMA node and includes corresponding memory (M) that is local to the other resources of the group. The respective memory (M) can include a port interface that locally couples the memory to the at least one hardware accelerator and one or more other resources of the NUMA node.

The controller 108 generates the task specification in response to the parameters of the scan request 112 and cross-referencing the respective hardware topology of each resource group 200 in a set of hosts 104, the set of hosts 104 being assigned to the controller 108 as slave assets. The system 100 provides the corresponding task specification to the host (410). For example, the controller 108 can provide multiple respective task specifications to different hosts 104, and the system 100 executes the ML workload by executing the tasks specified by the respective task specifications of each host (412).

FIG. 5 illustrates an example process for generating a task specification that is provided to a host of the system 100. Similar to process 400, process 500 can be implemented or performed using system 100, and the description of process 500 can refer to resources of system 100, including other components described in this specification. In general, the computing steps or process flows in the description of process 500 can be grouped or arranged to occur in different orders and are not limited to the numerical sequences described herein.

Referring now to process 500, system 100 is configured to identify one or more ports of a host's resources using a system file that describes a mapping of NUMA ports of each host (502). For example, the controller 108 is configured to determine the mapping of the NUMA ports based on a hardware topology of the resource group at the host 104. The mapping of NUMA ports is used to indicate the NUMA topology of each of the resource groups managed by the host 104.

The controller 108 constructs a control group for the host 104 using the ports of the resources described in the mapping of the NUMA ports of the host (504). In some implementations, the controller 108 passes one or more protocol bits to the host 104, and the host's scheduler uses the protocol bits (of the request or task specification) to construct control groups based on the locations of the resources 105 among the various resource groups at the host 104. For example, the scheduler of the controller 108 or the host 104 is operable to construct control groups based on the NUMA topology of each of the resource groups and the type of resources at the host 104 that satisfy some (or all) of the constraints in the received request 112.

The controller 108 cooperates with the host 104 to bind or assign the ML tasks of the task specification to control groups constructed at the host 104. In some implementations, the protocol bits passed to the host 104 by the controller 108 are used by the scheduler of the host 104 to bind or assign the ML tasks of the task specification to control groups constructed at the host 104. Protocol bits may be included in the task specification to indicate one or more constraints or requirements of the task specification. The protocol bits can be represented by one or more coded signals, binary-valued data words, or other relevant parameters of the data values. The encoded signals and binary values can be received and processed or otherwise interpreted by a scheduler of the host to determine the assignment of tasks to the task specification. In some implementations, the protocol bits may be associated with the task specification but provided separately from the task specification.

The host 104 is operable to bind the ML task of the task specification to the control group based at least on the port interface for the particular type of accelerator in the control group, including resource availability in the control group (506). The host 104 executes the ML task of the task specification as a process under the control group using the memory resources and hardware accelerators in the control group (508). In some implementations, the host 104 can use one or more other resources of the control group to perform ML tasks, including local memory. In some cases, non-local memory may be utilized, but balanced against the required computer bandwidth or any performance impact that may arise when such resources are utilized.

Embodiments of the subject matter and the functional operations described in this specification can be implemented in digital electronic circuitry, in tangibly embodied computer software or firmware, in computer hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible, non-transitory program carrier for execution by, or to control the operation of, data processing apparatus.

Alternatively or in addition, the program instructions can be encoded on an artificially generated propagated signal (e.g., a machine-generated electrical, optical, or electromagnetic signal) that is generated to encode information for transmission to suitable receiver apparatus for execution by data processing apparatus. The computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.

The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array), an ASIC (application-specific integrated circuit), or a GPGPU (general purpose graphics processing unit).

Computers suitable for the execution of a computer program include, for example, central processing units that can be based on general purpose microprocessors, or special purpose microprocessors, or both, or any other kind of central processing unit. Generally, a central processing unit will receive instructions and data from a read-only memory or a random access memory or both. The essential elements of a computer are a central processing unit for performing or executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. However, a computer need not have such devices.

Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, such as internal hard disks or removable disks. The processor and the memory can be implemented by, or incorporated in, special purpose logic circuitry.

While this specification contains many specific implementation details, these should not be construed as limitations on the scope or claiming of any invention, but rather as descriptions of features that may be specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

Similarly, when operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, or that desired results be achieved. In some cases, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

Specific embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. For example, the actions recited in the claims can be performed in a different order and still achieve desirable results. As one example, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, or achieve desirable results. In some embodiments, multitasking and parallel processing may be advantageous.

Claims

1. A method for scheduling tasks and allocating resources to execute machine learning workloads using hardware accelerators, each of the hardware accelerators configured to implement a neural network comprising a plurality of neural network layers, the method comprising:

receiving a request to execute the Machine Learning (ML) workload;

determining resource requirements based on the request to execute the ML workload at a distributed processing system comprising a plurality of hosts, each of the plurality of hosts comprising a respective plurality of hardware accelerators;

determining a number of hosts that are respectively allocated to perform a respective task of a set of tasks that form the ML workload based on the resource requirements and the respective plurality of hardware accelerators per host;

for each host of the number of hosts:

generating a respective task specification based on a memory access topology of the host, the task specification specifying tasks allocated to be performed at the host using resources of the host, the resources of the host including the respective plurality of hardware accelerators; and

providing the respective task specification to the hosts of the number of hosts; and

executing the ML workload by executing, by each host of the number of hosts, the task specified in the respective task specification for the host.

2. The method of claim 1, wherein:

the memory access topology of each host comprises a respective non-uniform memory access (NUMA) topology comprising respective memory local to the host; and

the respective memory includes a port interface that couples the respective memory to each of the respective plurality of hardware accelerators and one or more other resources of the host.

3. The method of claim 2, wherein performing the task specified in the respective task specification comprises:

performing a plurality of neural network computations to generate an output for each of the plurality of neural network layers in response to assigning a respective portion of the plurality of neural network computations to each of the respective plurality of hardware accelerators.

4. The method of claim 2, wherein executing the ML workload comprises:

processing instructions for the respective task specification using each resource of a control group of the host and based on data exchanged between the respective memory, the hardware accelerator, and a respective processor included among the resources of the host.

5. The method of claim 4, wherein executing the ML workload comprises:

executing a task specified in the respective task specification in response to processing the instruction based on the data exchanged via a hardware port that links each resource of the control group of the host, wherein the hardware port defines a local communication bus that is shared among a plurality of resources managed by the host.

6. The method of claim 4, wherein the respective NUMA topology for the first host is based in part on:

i) a respective first memory local to the first host in a respective configuration of resources; and

ii) a respective second distinct memory local to a second distinct host but remote to the first host in a respective configuration of resources.

7. The method of claim 2, wherein determining the number of hosts comprises:

obtaining a system file describing a configuration of resources managed by each of the plurality of hosts; and

determining, for each host of the plurality of hosts, the number of hosts based on a configuration of the resources described in the system file.

8. The method of claim 1, comprising:

identifying one or more ports coupling resources of the host based on a system file that describes a mapping of NUMA ports for each host of the plurality of hosts; and

forming a control group of the host based on the one or more ports coupling the resources of the host.

9. The method of claim 8, comprising:

assigning ML tasks of the task specification to the control groups of the host based on one or more port interfaces for accelerators in the control groups, wherein the port interfaces are included in the mapping of NUMA ports described in the system file; and

executing the ML task as a process under the control group using the accelerators in the control group.

10. A system configured to schedule tasks and allocate resources for executing machine learning workloads using hardware accelerators, each of the hardware accelerators configured to implement a neural network comprising a plurality of neural network layers, the system comprising:

one or more processing devices; and

one or more non-transitory machine-readable storage devices storing instructions executable by the one or more processing devices to cause performance of operations comprising:

receiving a request to execute the Machine Learning (ML) workload;

for each host of the number of hosts:

generating, based on a memory access topology of the host, a respective task specification specifying tasks allocated to be performed at the host using resources of the host, the resources of the host including the respective plurality of hardware accelerators; and

11. The system of claim 10, wherein:

12. The system of claim 11, wherein performing the task specified in the respective task specification comprises:

13. The system of claim 11, wherein executing the ML workload comprises:

processing instructions for the respective task specification using each resource of the host and based on data exchanged between the respective memory, the hardware accelerator, and a respective processor included among the resources of the host.

14. The system of claim 13, wherein executing the ML workload comprises:

executing a task specified in the respective task specification in response to processing the instruction based on the data exchanged via a hardware port that links each resource of the host, wherein the hardware port defines a local communication bus that is shared among a plurality of resources managed by the host.

15. The system of claim 13, wherein the respective NUMA topology for the first host is based in part on:

ii) a respective second different memory local to a second different host but remote to the first host in a respective configuration of resources.

16. The system of claim 11, wherein determining the number of hosts comprises:

17. The system of claim 10, wherein the operations comprise:

identifying one or more ports that couple resources of the host based on a system file that describes mappings of NUMA ports for each host of the plurality of hosts; and

18. The system of claim 17, comprising:

assigning ML tasks of the task specification to the control groups of the host based on one or more port interfaces for accelerators in the control groups, wherein the port interfaces are included in the mapping of NUMA ports described by the system file; and

19. A non-transitory machine-readable storage medium storing instructions for scheduling tasks and allocating resources to execute a machine learning workload using hardware accelerators, each of the hardware accelerators configured to implement a neural network comprising a plurality of neural network layers, the instructions executable by one or more processors to cause performance of operations comprising:

receiving a request to execute the Machine Learning (ML) workload;

for each host of the number of hosts:

20. The machine-readable storage medium of claim 19, wherein: