WO2022236834A1

WO2022236834A1 - Method and system for scheduling tasks

Info

Publication number: WO2022236834A1
Application number: PCT/CN2021/093945
Authority: WO
Inventors: Zhigang Hua; Gan LIU; Feng QI; Shuang Yang; Runzhong WANG
Original assignee: Alipay (Hangzhou) Information Technology Co., Ltd.
Priority date: 2021-05-14
Filing date: 2021-05-14
Publication date: 2022-11-17
Also published as: CN116670684A

Abstract

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for DAG-based task scheduling. The method may include: obtaining a directed acyclic graph (DAG) representing a plurality of computing tasks to be scheduled and processed by one or more processors. The DAG includes a plurality of nodes representing the plurality of computing tasks. The method further includes generating embeddings for the plurality of nodes in the DAG, and determining one or more edges to be added to the DAG based on the embeddings of the plurality of nodes and a policy network. The policy network is trained based on a plurality of training DAGs and a loss function associated with a heuristic scheduling algorithm. The method further includes adding the one or more edges to the DAG to obtain an updated DAG; and scheduling the plurality of computing tasks based on the updated DAG and the heuristic scheduling algorithm for the one or more processors to process.

Description

METHOD AND SYSTEM FOR SCHEDULING TASKS

TECHNICAL FIELD

The disclosure generally relates to systems and methods for scheduling tasks, specifically, directed acyclic graphs (DAGs) -based task scheduling using reinforcement learning (RL) .

BACKGROUND

Scheduling computational tasks, commonly represented by directed acyclic graphs (DAGs) , is a critical problem in many areas of computer science ranging from programming languages (e.g., compilation) , operating systems (e.g., parallel processing) , data engineering (e.g., distributed batch/streaming computation topology) , to machine learning (e.g., training graphs) . The overall goal of the DAG-based scheduling problem is to find an optimal scheduling solution (execution order) so that the tasks can be executed with a minimal makespan or a shortest average waiting time. These tasks are usually associated with numerous restrictions to which the scheduling solution must comply. For instance, some tasks may depend on other tasks, while some tasks may have resource constraints. Therefore, the DAG-based scheduling problem combines two well-known NP-hard problems: the minimum makespan problem and the bin packing problem. The former handles interdependence among tasks, and the latter handles one-dimensional or multi-dimensional resource constraints.

Existing solutions generally include solving an integer programming problem with branch-and-bound that is usually intractable in practice. Other heuristic approaches, such as Shortest Job First (SJF) , Highest Level First, Longest Job Time, Critical Path (CP) , and Random Priority, assign priorities to tasks and then execute the tasks when their dependent tasks are finished. These heuristics are problem-independent and incapable of utilizing the dependencies as defined by the DAGs or the resource consumption constraints when scheduling a job. These methods often fail to obtain optimal scheduling solutions.

SUMMARY

Various embodiments of the present specification may include systems, methods, and non-transitory computer-readable media for DAG-based task scheduling.

According to one aspect, a method for DAG-based task scheduling comprising: obtaining a directed acyclic graph (DAG) representing a plurality of computing tasks to be scheduled and processed by one or more processors, wherein the DAG includes a plurality of nodes representing the plurality of computing tasks; generating embeddings for the plurality of nodes in the DAG; determining one or more edges to be added to the DAG based on the embeddings of the plurality of nodes and a policy network, wherein the policy network is trained based on a plurality of training DAGs and a loss function associated with a heuristic scheduling algorithm; adding the one or more edges to the DAG to obtain an updated DAG; and scheduling the plurality of computing tasks based on the updated DAG and the heuristic scheduling algorithm for the one or more processors to process.

In some embodiments, the scheduling the plurality of computing tasks based on the updated DAG comprises: applying the heuristic scheduling algorithm to the updated DAG to determine the scheduling of the plurality of computing tasks.

In some embodiments, the obtaining a DAG representing a plurality of computing tasks comprises: receiving a plurality of jobs for parallel processing, each job comprising a plurality of tasks; generating a plurality of DAGs corresponding to the plurality of jobs, each DAG representing the plurality of tasks in a corresponding job; generating a new root node; and generating the DAG by connecting the plurality of DAGs to the new root node.

In some embodiments, the generating embeddings for the plurality of nodes comprises: obtaining feature vectors of the plurality of nodes, a feature vector of each node comprising at least one of the following: a runtime of a computing task corresponding to the node or an amount of resources required for running the computing task; and computing the embeddings of the plurality of nodes based on a graph neural network (GNN) and the feature vectors of the plurality of nodes.

In some embodiments, the computing the embeddings of the plurality of nodes based on the GNN comprises: for each node of the plurality of nodes, inputting the feature vector of the each node through a first neural network to obtain a hidden representation of the each node; and receiving other hidden representations propagated from neighboring nodes to the each node; and updating the hidden representation of the each node based on the received other hidden representations propagated from the neighboring nodes.

In some embodiments, the generating embeddings for the plurality of nodes in the DAG comprises: generating the embeddings for the plurality of nodes in the DAG based on a graph neural network (GNN) , wherein the GNN and the policy network are jointly trained based on a plurality of training DAGs using reinforcement learning (RL) .

In some embodiments, the policy network comprises a first policy network and a second policy network, and the determining one or more edges comprises: identifying one or more starting nodes from the plurality of nodes in the DAG by inputting the embeddings of the plurality of nodes into the first policy network; identifying one or more ending nodes from the plurality of nodes in the DAG by inputting the embeddings of the one or more starting nodes into the second policy network; and determining one or more edges connecting the one or more starting nodes and the one or more ending nodes.

In some embodiments, the identifying the one or more starting nodes from the plurality of nodes in the DAG based on the first policy network comprises: inputting the embeddings of the plurality of nodes into the first policy network to obtain a plurality of probabilities respectively corresponding to the plurality of nodes, wherein each probability represents a recommended chance to select a corresponding node as a starting node.

In some embodiments, identifying the one or more ending nodes from the plurality of nodes in the DAG based on the embeddings of the one or more starting nodes and the second policy network comprises: for each of the one or more starting nodes, inputting the embedding of the starting node and the embeddings of the plurality of nodes in the DAG into the second policy network to obtain a plurality of probabilities respectively corresponding to the plurality of nodes, wherein each probability representing a recommended chance to select a corresponding node as an ending node that matches with the starting node.

In some embodiments, the one or more edges are restrained from forming a close loop in the DAG, and the first and the second policy network each comprises a softmax layer configured to enforce the restrain.

In some embodiments, the method further comprises jointly training the GNN and the policy network, the training comprising: at each time step of the training, identifying a new edge to be added to the DAG based on a current policy network; determining a first performance metric by applying the heuristic scheduling algorithm to the DAG; determining a second performance metric by applying the heuristic scheduling algorithm to the DAG with the new edge; determining a reward based on the first performance metric and the second performance metric; and updating parameters of the GNN and the current policy network based on the loss function comprising a plurality of the rewards across a plurality of time steps.

In some embodiments, the heuristic scheduling algorithm comprises one of the following: Shortest Job First (SFJ) , Critical Path (CP) , and First-In-First-Out (FIFO) .

According to other embodiments, a system for DAG-based task scheduling comprises one or more processors and one or more computer-readable memories coupled to the one or more processors and having instructions stored thereon that are executable by the one or more processors to perform the method of any of the preceding embodiments.

According to yet other embodiments, a non-transitory computer-readable storage medium is configured with instructions executable by one or more processors to cause the one or more processors to perform the method of any of the preceding embodiments.

According to still other embodiments, an apparatus comprises a plurality of modules for performing the method of any of the preceding embodiments.

According to another aspect, a system for DAG-based task scheduling may comprise a computer system comprising a processor and a non-transitory computer-readable storage medium storing instructions executable by the processor to cause the computer system to perform operations comprising obtaining a directed acyclic graph (DAG) representing a plurality of computing tasks for scheduling, wherein the DAG includes a plurality of nodes representing the plurality of computing tasks; generating embeddings for the plurality of nodes in the DAG; determining one or more edges to be added to the DAG based on the embeddings of the plurality of nodes and a policy network, wherein the policy network is trained based on a plurality of training DAGs and a loss function associated with a heuristic scheduling algorithm; updating the DAG by adding the one or more edges to the DAG; and scheduling the plurality of computing tasks based on the updated DAG and the heuristic scheduling algorithm. :

According to yet another aspect, a non-transitory computer-readable storage medium may be configured with instructions executable by one or more processors to cause the one or more processors to perform operations comprising obtaining a directed acyclic graph (DAG) representing a plurality of computing tasks for scheduling, wherein the DAG includes a plurality of nodes representing the plurality of computing tasks; generating embeddings for the plurality of nodes in the DAG; determining one or more edges to be added to the DAG based on the embeddings of the plurality of nodes and a policy network, wherein the policy network is trained based on a plurality of training DAGs and a loss function associated with a heuristic scheduling algorithm; updating the DAG by adding the one or more edges to the DAG; and scheduling the plurality of computing tasks based on the updated DAG and the heuristic scheduling algorithm.

Embodiments disclosed in the specification have one or more technical effects. In some embodiments, for a DAG representing a plurality of tasks to be scheduled, a trained reinforcement learning (RL) agent is employed to iteratively add directed edges to the DAG. The trained RL agent may guarantee that these directed edges comply with scheduling constraints (e.g., priorities of execution and resource allocation) . By doing so, the original DAG scheduling problem is dramatically simplified to a proxy problem with an updated DAG, on which a traditional heuristic scheduling algorithm such as SJF and CP can be directly applied to obtain more efficient scheduling solution with a shorter makespan. In some embodiments, the RL agent may include one or more policy networks trained to select the potential directed edges to add to the DAG. The networks may be trained based on a plurality of training DAGs (e.g., training data comprising a plurality of DAGs) and one specific heuristic scheduling algorithm. That is, for different heuristic scheduling algorithms, corresponding RL agents may be trained. This way, the described embodiments herein can be easily integrated with any existing algorithms without extensive modification, such as SJF, CP, etc. In some embodiments, in order to accurately represent the tasks to be scheduled in the DAG, the nodes corresponding to the tasks in the DAG may be encoded using a Graph Neural Network (GNN) . The encoding process takes into account the topological structure of the DAG and the scheduling constraints such as the run time and resource requirement of each node corresponding to a task. With the GNN, the feature vectors of the nodes in the DAG may be updated to include information propagated from neighboring nodes, which allows the scheduling to be more accurate and optimized at a global scale.

These and other features of the systems, methods, and non-transitory computer-readable media disclosed herein, as well as and the methods of operation and functions of the related elements of structure and the combination of parts and economies of manufacture, will become more apparent upon consideration of the following description and the appended claims with reference to the accompanying drawings, all of which form a part of this specification, wherein like reference numerals designate corresponding parts in the various figures. It is to be expressly understood, however, that the drawings are for purposes of illustration and description only and are not intended as a definition of the limits of the invention.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates an example DAG-based scheduling problem, in accordance with various embodiments.

FIG. 2 illustrates an example environment for scheduling computational tasks, in accordance with various embodiments.

FIG. 3 illustrates an example workflow for DAG-based scheduling with reinforcement learning, in accordance with various embodiments.

FIG. 4 illustrates example policy networks trained for adding directed edges to a DAG, in accordance with various embodiments.

FIG. 5 illustrates a diagram for constructing a DAG in accordance with some embodiments.

FIG. 6 illustrates an example method for DAG-based scheduling with reinforcement learning, in accordance with various embodiments.

FIG. 7 illustrates an example computer system in which any of the embodiments described herein may be implemented.

DETAILED DESCRIPTION

The approaches disclosed herein include a deep reinforcement learning framework for directed acyclic graph (DAG) scheduling. These approaches are motivated by the observation that unsatisfactory schedules are often generated by existing solutions due entirely to the wrong ordering of a few task nodes (e.g., the nodes in a DAG representing tasks) while the rest is close to optimal. If these “tricky” nodes are correctly ordered, the quality of schedules would be dramatically improved. Moreover, if the correct ordering of these nodes can be given a priori by an oracle, the scheduling tasks would become substantially simpler such that conventional heuristics are able to find near-optimal solutions. One way to achieve this is to add directed edges to break the ties among tricky job nodes, i.e., to explicitly require that one job node needs to be given priority over another when it comes to execution as well as resource allocation.

FIG. 1 illustrates an example DAG-based scheduling problem, in accordance with various embodiments, and how adding directed edges can improve the quality of the schedules generated by conventional heuristic algorithms such as shortest job first (SJF) and critical path (CP) . In FIG. 1, a plurality of tasks for scheduling are represented as nodes in a DAG. Each task may be associated with various attributes. For illustrative purposes, the nodes in FIG. 1 each includes two attributes: an amount of resources required for executing a corresponding task, and the run time (e.g., projected execution time) of the corresponding task. Depending on the use case, the tasks may be associated with fewer, more, or different attributes. These tasks may also be interdependent, e.g., some tasks are prerequisites of other tasks. The interdependencies among the task nodes are represented as the directed edges in the DAG. As shown, without adding any directed edge, both scheduling solutions determined by the heuristic scheduling algorithms SJF and CP yield a makespan of 21. The SJF algorithm prioritizes the tasks with the shortest runtimes, and CP algorithm ranks nodes by the maximum sum of task runtimes along the path to any of their leaf nodes in a DAG. Here, the “makespan” refers to the distance in time that elapses from the start of work to the end (i.e., from the point of starting the first task to the point of ending the last task) . However, after adding directed edges that enforcing dependencies among tasks, the new DAGs may allow SFJ and CP to find better scheduling solutions with makespans of 16. The added directed edges simplify DAG scheduling by including additional constraints and convert the original problem to a simpler proxy.

In the embodiments described herein, a reinforcement learning (RL) agent (e.g., a software program) may be configured to determine the addition of the directed edges to the DAG based on one or more trained policy networks. The RL agent may convert the original DAG to an updated DAG that allows conventional heuristic scheduling algorithms to generate optimal scheduling solutions.

FIG. 2 illustrates an example environment for scheduling computational tasks, in accordance with various embodiments. As shown in FIG. 1, the environment may comprise a computing system 102 for scheduling computational tasks 107. The computational tasks 107 may include various tasks that can be executed by computing devices such as 105a and 105b. These tasks 107 may be associated with numerous execution requirement and constraints, such as interdependencies (e.g., one or more tasks may be prerequisites of one or more other tasks) , resource consumptions (e.g., how much computational resources are required for executing one task) , runtime (e.g., execution duration of one task) , other suitable attributes, or any combination thereof. Common examples may include tasks submitted from a plurality of clients to a server in which the tasks need to be scheduled to execute (either on the server or be distributed to other computing devices) . In some embodiments, the

computing devices

105a and 105b may include various devices such as computers, smart devices, edge devices. Even though the

devices

105a and 105b are illustrated as separated from the computing system 102, they may refer to components within the computing system 102.

In some embodiments, the computing system 102 may include a staging area for caching the received tasks 107. The computing system 102 may determine optimal scheduling solutions for the cached tasks 107. The determination is subject to the scheduling constraints such as the interdependencies among the tasks 107 (e.g., certain tasks cannot be executed until their prerequisite tasks are executed) and resource consumptions of the tasks 107. For instance, the total computational resource consumption of the tasks scheduled to be executed concurrently cannot exceed the available computational resources.

The components of the computing system 102 in FIG. 1 are illustrative. Depending on the implementation, the computing system 102 may include additional, fewer, or alternative components. The computing system 102 may be implemented in one or more networks (e.g., enterprise networks) , one or more endpoints, one or more servers, or one or more clouds. The computing system 102 may include hardware or software which manages access to a centralized resource or service in a network. A cloud may include a cluster of servers and other devices distributed across a network.

In some embodiments, the computing system 102 may include a DAG generating component 112, an embedding component 114, a determining component 116, and a scheduling component 118. The computing system 102 may include one or more processors (e.g., a digital processor, an analog processor, a digital circuit designed to process information, a central processing unit, a graphics processing unit, a microcontroller or microprocessor, an analog circuit designed to process information, a state machine, and/or other mechanisms for electronically processing information) and one or more memories (e.g., permanent memory, temporary memory, non-transitory computer-readable storage medium) . The one or more memories may be configured with instructions executable by the one or more processors. The processor (s) may be configured to perform various operations by interpreting machine-readable instructions stored in the memory.

In some embodiments, the DAG generating component 112 of the computing system 102 may be configured to obtain a directed acyclic graph (DAG) representing a plurality of received computing tasks for scheduling. The DAG may include a plurality of nodes corresponding to the plurality of computing tasks and a plurality of directed edges representing the interdependencies among the plurality of computing tasks. In some embodiments, each node in the DAG may be associated with a feature vector containing various attributes of the corresponding computational task such as resource consumptions, mandatory starting time and/or ending time, prerequisite (parent) tasks, other suitable attributes, or any combination thereof. The DAG may be constructed by another entity and sent to the computing system 102, or constructed by the DAG generating component 112 of the computing system 102 based on the tasks.

In some embodiments, the embedding component 114 of the computing system 102 may be configured to generate embeddings for the plurality of nodes in the DAG. As mentioned above, the nodes in the DAG may be associated with feature vectors. The embedding component 114 may update the feature vectors of the nodes using a graph neural network (GNN) . The feature vector of each node may embed information from the feature vectors of the node’s neighboring nodes. Here, the neighboring nodes may refer not only to the immediately neighboring nodes but also remotely neighboring nodes. In some embodiments, the embedding process may include: obtaining feature vectors of the plurality of nodes, the feature vector of each node comprising at least one of the following: a runtime of the computing task corresponding to the node or an amount of resources required for running the computing task; and computing the embeddings of the plurality of nodes based on a graph neural network (GNN) and the feature vectors of the plurality of nodes. For each node of the plurality of nodes, the feature vector of the each node may go through a first neural network to obtain a hidden representation of the each node. Here, the embedding of a node refers to an encoded form of a feature vector of the node, and the encoding/embedding is based on the feature vectors of neighboring nodes of the node. Meanwhile, the node may also receive other hidden representations propagated from its neighboring nodes. The hidden representation of this node may then be updated based on the received hidden representations propagated from its neighboring nodes.

In some embodiments, the determining component 116 of the computing system 102 may be configured to determine one or more edges to be added to the DAG and update the DAG by adding the one or more edges to the DAG. The determining component 116 may include a reinforcement learning agent that follows a policy network. The policy network may be trained based on a plurality of training DAGs and a loss function associated with a heuristic scheduling algorithm. In some embodiments, the policy network may include at least two policy networks. The first policy network is trained to identify starting nodes and the second policy network is trained to identify ending nodes based on the starting nodes identified by the first policy network. That is, the one or more edges may be determined by: identifying one or more starting nodes from the plurality of nodes in the DAG based on the first policy network; identifying one or more ending nodes from the plurality of nodes in the DAG based on embeddings of the one or more starting nodes and the second policy network; and determining one or more edges connecting the one or more starting nodes and the one or more ending nodes. In some embodiments, the first and second policy networks share a same network structure but have different input parameter configurations. In some embodiments, the one or more edges are restrained from forming circles in the DAG, and first policy network and the second policy network each comprises a softmax layer configured to enforce the restrain. A detailed description regarding the policy networks may refer to FIG. 4.

In some embodiments, the scheduling component 118 may be configured to determine the scheduling of the plurality of tasks 107 based on the updated DAG with the added edges and the heuristic scheduling algorithm. The determination may include applying the heuristic scheduling algorithm to the updated DAG to determine a schedule. In some embodiments, the heuristic scheduling algorithm may be the same algorithm that is used during the training of the policy networks in determining module 116. It means that the policy networks are customized for the heuristic scheduling algorithm, and different heuristic scheduling algorithms may be used to train different policy networks.

FIG. 3 illustrates an example workflow 300 for DAG-based scheduling with reinforcement learning, in accordance with various embodiments. As shown, the example workflow 300 may include two phases, a training phase 302 and an application phase 303. The training phase 302 and the applications phase 303 may be carried out by the same or different entities. In some embodiments, the training phase 302 may be performed offline. In other embodiments, both the training phase 302 and the application phase 303 may be performed online. That is, the data collected from the application phase 303 such as inputting DAG, actions took (e.g., edges added) , and rewards (e.g., makespan reduction) may be used as part of the training phase 302.

In some embodiments, the training phase 302 may include a plurality of steps illustrated in FIG. 3. Depending on the implementation, the training phase 302 may include fewer, more, or alternative steps, and the steps may be performed in a different order or in parallel. In some embodiments, the training phase 302 may include a reinforcement learning (RL) agent 380 exploring and learning two networks: a graph neural network (GNN) 390A and a policy network 390B. The GNN 390A may be trained to encode the nodes in a DAG so that the feature vectors of the nodes are embedded with configuration information propagated from neighboring nodes and the topological structure of the DAG. The policy network 390B may be trained to make recommended actions to the RL agent 380 in response to a state. Here, the action refers to adding a directed edge between a starting node and an ending node to the DAG that does not conflict with the scheduling restrictions of the DAG. For example, a conflict occurs when there is an existing directed path between the starting node and the ending node. In some embodiments, according the policy network 390B, the selection of the starting node is conditioned on the encoded feature vectors (also called embeddings) of the nodes generated by the GNN 390A, and the selection of the ending node is conditioned on the encoded feature vectors of the nodes generated by the GNN 390A and the selected starting node.

In some embodiments, the training of the GNN 390 and the policy network 390B may be based on a plurality of historical DAGs or synthetic DAGs (also referred to as training DAGs) 360 and a specific scheduling algorithm 370. The scheduling algorithm 370 may be a heuristic scheduling algorithm such as SJF, CP, etc., or another suitable algorithm. The scheduling algorithm 370 may be applied to a DAG to determine a scheduling solution (e.g., an execution plan) for the tasks represented by the DAG.

In some embodiments, the training process 302 may include a reinforcement learning process formulated based on the following configuration.

State: a DAG graph G= {n, N ⁰, E ⁰} , where n is the number of nodes in the DAG, N ⁰ is the set of n nodes, and E ⁰ is the set of directed edges in the DAG.

Action: adding a directed edge a in G by sequentially selecting a starting node a ¹ and an ending node a ² in G. The added edge may not conflict with the graph structure of G, and a conflict occurs when there is an existing directed edge between a ¹ and a ², or the addition of the directed edge will form a close loop within G. The selection of starting node a ¹ is conditioned on G, and the selection of ending node a ² is conditioned on G and a ¹. By sequentially selecting the starting and ending nodes, the time complexity of adding an edge in G is reduced from O (n ²) to O (n) .

Transition: after adding the directed edge, G is transitioned to G′.

Reward: in some embodiments, the reward is defined as the difference between makespans of G′ and G using the scheduling algorithm 370. Depending on the implantation, the reward can be easily switched to other measurements such as average waiting time.

In some embodiments, the GNN 390A may be structured as an L-layer neural network with parameters of

where 1≤l≤L, and L is an integer greater than 1. During the training phase 302 and the application phase 303, the GNN 390A may be used similarly to compute graph embeddings for the nodes in a DAG. For example, given a DAG G, the original feature vectors of each node in G may include attributes such as runtime and required computational resources (e.g., CPU time, storage space, number of cores) . By using GNN 390A, these feature vectors may be transformed into a hidden representation e ⁰ embedded with information propagated from neighboring nodes through a plurality of iterations of message passing.

An example process using the GNN 390A to generate embeddings for the nodes in G may include the following steps. Step 1, the edges in G may be reversed to form a new graph G ^T. Step 2, at each iteration of message passing in G ^T, each node’s embedding is propagated to its neighbors and updated as follows:

where σ () stands for activation function, H is the number of message passing iterations and h is the current iteration,

is the weight matrix for the h _thiteration, N _irepresents the set of neighbors in G ^T (e.g., the children nodes in G) . Hence the parameters of the GNN network may be denoted as:

In some embodiments, the output of the GNN 390A may include a deep embedding (also referred to updated feature vector) for each node. In some embodiments, the entire DAG may also be represented as a global embedding determined based on the embeddings of all the nodes within the DAG. The global embedding of the DAG may be used as one of the inputs to the policy network 390B.

In some embodiments, the GNN 380 and the policy network 390B may be jointly trained based on a policy gradient algorithm. The training process may include a plurality of rollouts, denoted as N rollouts. In each rollout, a plurality of edges, denoted as M edges are added across a plurality of time steps. At j _th time step of i _th rollout, a triplet is recorded for training purposes:

where G _i, jis the updated DAG graph with added edges,

and

refer to the starting node and the ending node of an added edge a _i, j respectively, and r _i, j represents the reward computed as the reduction of makespan (or average waiting time, or another suitable metrics) of G _i, j after an edge a _i, j is added. In some embodiments, to inspire long term reward, the cumulative reward at the j _th time step may be a discounted sum of incremental rewards from the j _th time step until completion of a rollout:

wherein r is the discounting factor that may be set as 1.0 as a default value.

For every N rollouts, a baseline reward at each time step j may be computed by the following equation, which is subtracted from rewards to reduce variance:

The parameters of the GNN 390A and the policy network 390B may be updated after every N rollouts. A joint loss function may be defined as:

where φ represents the parameters of the GNN 390A, and θ represents the parameters of the policy network 390B.

In some embodiments, the policy network 390B may include two policy networks: a first one is trained for determining starting nodes and a second one is trained for determining ending nodes. Denoting the parameters of the first and second policy networks as θ ₁ and θ ₂ respectively, the loss function may be represented as:

After the training phase 302, the GNN 390 and the policy network 390B may be used in the application phase 303 to determine edges to be added to an incoming DAG representing a plurality of to-be-scheduled tasks.

In some embodiments, the application phase 303 may start with obtaining a DAG at step 310 to represent the to-be-scheduled tasks. The DAG may be constructed based on the tasks, or received from another entity or device that performed the construction. In some embodiments, a plurality of DAGs representing a plurality of multi-task jobs may be received, and the tasks within the jobs may be scheduled altogether. In some embodiments, these received DAGs may be first aggregated into one DAG with a new root node. FIG. 5 illustrates a diagram for constructing a DAG in accordance with some embodiments. As shown in FIG. 5, three

jobs

510, 520, and 530 may each include a plurality of interdependent tasks that are represented by three DAGs. By adding a new root node 510 and attaching the three DAGs to the new root node 510, a new DAG may be formed as the output of step 310.

Referring to FIG. 3, the DAG output from step 310 may be fed into the GNN 390A to generate embeddings at step 320. Based on the embeddings of the nodes in the DAG, the RL agent 380 may recommend directed edges based on the trained policy network 390B at step 330. In some embodiments, the trained policy network 390B may include two policy networks for determining the directed edges at step 330, which may include identifying one or more starting nodes from the plurality of nodes in the DAG based on the first policy network; identifying one or more ending nodes from the plurality of nodes in the DAG based on embeddings of the one or more starting nodes and the second policy network; and determining one or more edges connecting the one or more starting nodes and the one or more ending nodes. For example, the embeddings of the plurality of nodes may be fed into the first policy network to obtain a plurality of probabilities respectively corresponding to the plurality of nodes. Each probability represents a recommended chance to select a corresponding node as a starting node. Based on the recommended probabilities, one or more starting nodes may be determined. Next, for each starting node, combinations of the embedding of the starting node and the embeddings of other nodes may be fed into the second policy network to obtain a plurality of probabilities respectively corresponding to the other nodes. Here, the “other nodes” may refer to the plurality of nodes except for the starting node and the nodes that are restrained from having a directed edge from the starting node. For example, each input to the second policy network may include the embedding of the starting node and the embedding of another node, and yield an output of a probability that the another node is recommended as the ending node corresponding to the starting node. Subsequently, for each of the starting nodes, one ending node may be selected based on the recommended probabilities generated by the second policy network.

After the starting nodes and the ending nodes are selected at step 330, the DAG may be updated by adding the directed edges connecting the starting nodes and corresponding ending nodes at step 340. The scheduling algorithm 370 may then be applied to the updated DAG to determine the scheduling solution for the tasks at step 350. The tasks may be executed according to the determined scheduling solution to reach an optimal makespan or average waiting time or another suitable performance measurement.

FIG. 4 illustrates example policy networks trained for adding directed edges to a DAG, in accordance with various embodiments. The components or layers in FIG. 4 are for illustrative purposes only. Depending on the implementation, the policy networks in FIG. 4 may include fewer, more, or alternative components or layers. In some embodiments, the policy network 410 and the policy network 420 may be collectively referred to as the policy network 390B in FIG. 3 and jointly trained. As described in FIG. 3, the policy network 410 may be referred to as a first policy network trained to determine starting nodes in a DAG, and the policy network 420 may be referred to as a second policy network trained to determine ending nodes in the DAG. The starting nodes and the ending nodes then are used to determine the directed edges to be added to the DAG to improve the performance of the task scheduling algorithm.

In some embodiments, the inputs to the

policy networks

410 and 420 may be different. For example, the inputs to the policy network 410 may include a plurality of pairs, each pair including a global embedding of the DAG and one embedding (e.g., feature vector) of a node in the DAG. In some embodiments, the global embedding of the DAG may be the average embedding of all the nodes, and the embedding of a node may be the feature vector of the node after going through the GNN 390A in FIG. 3. The inputs to the policy network 420, on the other hand, may include more information related to the already determined starting nodes. As shown in FIG. 4, the inputs to the policy network 420 may include a plurality of triplets. Each triplet includes the overall embedding of the DAG, the embedding of one starting node, and the embedding of a different node. Here, the different node may refer to a node in the DAG that does not have an existing edge with the starting node and is not restricted from having an edge with the starting node.

In some embodiments, both

policy networks

410 and 420 may include a plurality of neural network layers to extract features from the inputs. At the end of the

policy networks

410 and 420, a softmax layer may be implemented to enforce one or more edge-addition restrictions, such as no close loop is allowed, and the newly added edge cannot conflict with existing edges (e.g., the dependency indicated by the new edge cannot contradict with existing task dependencies) .

In some embodiments, the outputs of the policy network 410 may include a plurality of probabilities respectively corresponding to the plurality of nodes in the DAG. Each probability generated by the policy network 410 may represent a recommended chance of selecting the corresponding node as a starting node for the DAG. The outputs of the policy network 420 may also include a plurality of probabilities. Each probability generated by the policy network 420 may represent a conditional probability of selecting the corresponding node as an ending node for a starting node and the DAG.

FIG. 6 illustrates an example method 600 for DAG-based scheduling with reinforcement learning, in accordance with various embodiments. The method 600 may be implemented by the computing system 102 shown in FIG. 2, and correspond to the flow shown in FIG. 3. Depending on the implementation, the method 600 may have additional, fewer, or alternative steps.

Block 610 includes obtaining a directed acyclic graph (DAG) representing a plurality of computing tasks for scheduling, wherein the DAG includes a plurality of nodes representing the plurality of computing tasks. In some embodiments, the obtaining a DAG representing a plurality of computing tasks comprises: receiving a plurality of jobs for parallel processing, each job comprising a plurality of tasks; generating a plurality of DAGs corresponding to the plurality of jobs, each DAG representing the plurality of tasks in a corresponding job; generating a new root node; and generating the DAG by connecting the plurality of DAGs to the new root node.

Block 620 includes generating embeddings for the plurality of nodes in the DAG. In some embodiments, the generating embeddings for the plurality of nodes comprises: obtaining feature vectors of the plurality of nodes, a feature vector of each node comprising at least one of the following: a runtime of a computing task corresponding to the node or an amount of resources required for running the computing task; and computing the embeddings of the plurality of nodes based on a graph neural network (GNN) and the feature vectors of the plurality of nodes. In some embodiments, the computing the embeddings of the plurality of nodes based on the GNN comprises: for each node of the plurality of nodes, inputting the feature vector of the each node through a first neural network to obtain a hidden representation of the each node; and receiving other hidden representations propagated from neighboring nodes to the each node; and updating the hidden representation of the each node based on the received other hidden representations propagated from the neighboring nodes. In some embodiments, the generating embeddings for the plurality of nodes in the DAG comprises: generating the embeddings for the plurality of nodes in the DAG based on a graph neural network (GNN) , wherein the GNN and the policy network are jointly trained based on a plurality of training DAGs using reinforcement learning (RL) .

Block 630 includes determining one or more edges to be added to the DAG based on the embeddings of the plurality of nodes and a policy network, wherein the policy network is trained based on a plurality of training DAGs and a loss function associated with a heuristic scheduling algorithm. In some embodiments, the policy network comprises a first policy network and a second policy network, and the determining one or more edges comprises: identifying one or more starting nodes from the plurality of nodes in the DAG by inputting the embeddings of the plurality of nodes into the first policy network; identifying one or more ending nodes from the plurality of nodes in the DAG by inputting the embeddings of the one or more starting nodes into the second policy network; and determining one or more edges connecting the one or more starting nodes and the one or more ending nodes. In some embodiments, the identifying the one or more starting nodes from the plurality of nodes in the DAG based on the first policy network comprises: inputting the embeddings of the plurality of nodes into the first policy network to obtain a plurality of probabilities respectively corresponding to the plurality of nodes, wherein each probability represents a recommended chance to select a corresponding node as a starting node. In some embodiments, the identifying the one or more ending nodes from the plurality of nodes in the DAG based on the embeddings of the one or more starting nodes and the second policy network comprises: for each of the one or more starting nodes, inputting the embedding of the starting node and the embeddings of the plurality of nodes in the DAG into the second policy network to obtain a plurality of probabilities respectively corresponding to the plurality of nodes, wherein each probability representing a recommended chance to select a corresponding node as an ending node that matches with the starting node.

Block 640 includes updating the DAG by adding the one or more edges to the DAG. In some embodiments, the one or more edges are restrained from forming a close loop in the DAG, and the first and the second policy network each comprises a softmax layer configured to enforce the restrain.

Block 650 includes scheduling the plurality of computing tasks based on the updated DAG and the heuristic scheduling algorithm. In some embodiments, the scheduling the plurality of computing tasks based on the updated DAG comprises: applying the heuristic scheduling algorithm to the updated DAG to determine the scheduling of the plurality of computing tasks. In some embodiments, the heuristic scheduling algorithm comprises one of the following: shortest job first (SFJ) , critical path (CP) , and first-in-first-out (FIFO) .

In some embodiments, the method 600 further comprises jointly training the GNN and the policy network, the training comprising: at each time step of the training, identifying a new edge to be added to the DAG based on a current policy network; determining a first performance metric by applying the heuristic scheduling algorithm to the DAG; determining a second performance metric by applying the heuristic scheduling algorithm to the DAG with the new edge; determining a reward based on the first performance metric and the second performance metric; and updating parameters of the GNN and the current policy network based on the loss function comprising a plurality of the rewards across a plurality of time steps.

FIG. 7 illustrates an example computer system in which any of the embodiments described herein may be implemented. The electronic device may be used to implement one or more components of the systems and the methods shown in FIGs. 1-6. The electronic device 700 may comprise a bus 702 or other communication mechanism for communicating information and one or more hardware processors 704 coupled with bus 702 for processing information. Hardware processor (s) 704 may be, for example, one or more general purpose microprocessors.

The electronic device 700 may also include a main memory 706, such as a random-access memory (RAM) , cache and/or other dynamic storage devices, coupled to bus 702 for storing information and instructions to be executed by processor (s) 704. Main memory 706 also may be used for storing temporary variables or other intermediate information during execution of instructions to be executed by processor (s) 704. Such instructions, when stored in storage media accessible to processor (s) 704, may render electronic device 700 into a special-purpose machine that is customized to perform the operations specified in the instructions. Main memory 706 may include non-volatile media and/or volatile media. Non-volatile media may include, for example, optical or magnetic disks. Volatile media may include dynamic memory. Common forms of media may include, for example, a RAM, a DRAM, a PROM, and EPROM, a FLASH-EPROM, NVRAM, any other memory chip or cartridge, or networked versions of the same.

The electronic device 700 may implement the techniques described herein using customized hard-wired logic, one or more ASICs or FPGAs, firmware and/or program logic which in combination with the electronic device may cause or program electronic device 700 to be a special-purpose machine. According to one embodiment, the techniques herein are performed by electronic device 700 in response to processor (s) 704 executing one or more sequences of one or more instructions contained in main memory 706. Such instructions may be read into main memory 706 from another storage medium, such as storage device 707. Execution of the sequences of instructions contained in main memory 706 may cause processor (s) 704 to perform the process steps described herein. For example, the processes/methods disclosed herein may be implemented by computer program instructions stored in main memory 706. When these instructions are executed by processor (s) 704, they may perform the steps as shown in corresponding figures and described above. In alternative embodiments, hard-wired circuitry may be used in place of or in combination with software instructions.

The electronic device 700 also includes a communication interface 710 coupled to bus 702. Communication interface 710 may provide a two-way data communication coupling to one or more network links that are connected to one or more networks. As another example, communication interface 710 may be a local area network (LAN) card to provide a data communication connection to a compatible LAN (or WAN component to communicate with a WAN) . Wireless links may also be implemented.

The performance of certain of the operations may be distributed among the processors, not only residing within a single machine, but deployed across a number of machines. In some example embodiments, the processors or processor-implemented engines may be located in a single geographic location (e.g., within a home environment, an office environment, or a server farm) . In other example embodiments, the processors or processor-implemented engines may be distributed across a number of geographic locations.

Each of the processes, methods, and algorithms described in the preceding sections may be embodied in, and fully or partially automated by, code modules executed by one or more computer systems or computer processors comprising computer hardware. The processes and algorithms may be implemented partially or wholly in application-specific circuitry.

When the functions disclosed herein are implemented in the form of software functional units and sold or used as independent products, they can be stored in a processor executable non-volatile computer readable storage medium. Particular technical solutions disclosed herein (in whole or in part) or aspects that contribute to current technologies may be embodied in the form of a software product. The software product may be stored in a storage medium, comprising a number of instructions to cause a computing device (which may be a personal computer, a server, a network device, and the like) to execute all or some steps of the methods of the embodiments of the present application. The storage medium may comprise a flash drive, a portable hard drive, ROM, RAM, a magnetic disk, an optical disc, another medium operable to store program code, or any combination thereof.

Particular embodiments further provide a system comprising a processor and a non-transitory computer-readable storage medium storing instructions executable by the processor to cause the system to perform operations corresponding to steps in any method of the embodiments disclosed above. Particular embodiments further provide a non-transitory computer-readable storage medium configured with instructions executable by one or more processors to cause the one or more processors to perform operations corresponding to steps in any method of the embodiments disclosed above.

Embodiments disclosed herein may be implemented through a cloud platform, a server or a server group (hereinafter collectively the “service system” ) that interacts with a client. The client may be a terminal device, or a client registered by a user at a platform, wherein the terminal device may be a mobile terminal, a personal computer (PC) , and any device that may be installed with a platform application program.

The various features and processes described above may be used independently of one another or may be combined in various ways. All possible combinations and sub-combinations are intended to fall within the scope of this disclosure. In addition, certain method or process blocks may be omitted in some implementations. The methods and processes described herein are also not limited to any particular sequence, and the blocks or states relating thereto can be performed in other sequences that are appropriate. For example, described blocks or states may be performed in an order other than that specifically disclosed, or multiple blocks or states may be combined in a single block or state. The example blocks or states may be performed in serial, in parallel, or in some other manner. Blocks or states may be added to or removed from the disclosed example embodiments. The exemplary systems and components described herein may be configured differently than described. For example, elements may be added to, removed from, or rearranged compared to the disclosed example embodiments.

Throughout this specification, plural instances may implement components, operations, or structures described as a single instance. Although individual operations of one or more methods are illustrated and described as separate operations, one or more of the individual operations may be performed concurrently, and nothing requires that the operations be performed in the order illustrated. Structures and functionality presented as separate components in example configurations may be implemented as a combined structure or component. Similarly, structures and functionality presented as a single component may be implemented as separate components. These and other variations, modifications, additions, and improvements fall within the scope of the subject matter herein.

As used herein, “or” is inclusive and not exclusive, unless expressly indicated otherwise or indicated otherwise by context. Therefore, herein, “A, B, or C” means “A, B, A and B, A and C, B and C, or A, B, and C, ” unless expressly indicated otherwise or indicated otherwise by context.

The term “include” or “comprise” is used to indicate the existence of the subsequently declared features, but it does not exclude the addition of other features. Conditional language, such as, among others, “can, ” “could, ” “might, ” or “may, ” unless specifically stated otherwise, or otherwise understood within the context as used, is generally intended to convey that certain embodiments include, while other embodiments do not include, certain features, elements and/or steps. Thus, such conditional language is not generally intended to imply that features, elements and/or steps are in any way required for one or more embodiments or that one or more embodiments necessarily include logic for deciding, with or without user input or prompting, whether these features, elements and/or steps are included or are to be performed in any particular embodiment.

Although an overview of the subject matter has been described with reference to specific example embodiments, various modifications and changes may be made to these embodiments without departing from the broader scope of embodiments of the present disclosure. Such embodiments of the subject matter may be referred to herein, individually or collectively, by the term “invention” merely for convenience and without intending to voluntarily limit the scope of this application to any single disclosure or concept if more than one is, in fact, disclosed.

The embodiments illustrated herein are described in sufficient detail to enable those skilled in the art to practice the teachings disclosed. Other embodiments may be used and derived therefrom, such that structural and logical substitutions and changes may be made without departing from the scope of this disclosure. The Detailed Description, therefore, is not to be taken in a limiting sense, and the scope of various embodiments is defined only by the appended claims, along with the full range of equivalents to which such claims are entitled.

Claims

A computer-implemented method, comprising:

obtaining a directed acyclic graph DAG representing a plurality of computing tasks to be scheduled and processed by one or more processors, wherein the DAG includes a plurality of nodes representing the plurality of computing tasks;

generating embeddings for the plurality of nodes in the DAG;

determining one or more edges to be added to the DAG based on the embeddings of the plurality of nodes and a policy network, wherein the policy network is trained based on a plurality of training DAGs and a loss function associated with a heuristic scheduling algorithm;

adding the one or more edges to the DAG to obtain an updated DAG; and

scheduling the plurality of computing tasks based on the updated DAG and the heuristic scheduling algorithm for the one or more processors to process.
The method of claim 1, wherein the scheduling the plurality of computing tasks comprises:

applying the heuristic scheduling algorithm to the updated DAG to determine the scheduling of the plurality of computing tasks.
The method of claim 1, wherein the obtaining the DAG representing the plurality of computing tasks comprises:

receiving a plurality of jobs for parallel processing, each job comprising a plurality of tasks;

generating a plurality of DAGs corresponding to the plurality of jobs, each DAG representing the plurality of tasks in a corresponding job;

generating a new root node; and

generating the DAG by connecting the plurality of DAGs to the new root node.
The method of claim 1, wherein the generating the embeddings for the plurality of nodes comprises:

obtaining feature vectors of the plurality of nodes, a feature vector of each node comprising at least one of the following: a runtime of a computing task corresponding to the node or an amount of resources required for running the computing task; and

computing the embeddings of the plurality of nodes based on a Graph Neural Network GNN and the feature vectors of the plurality of nodes.
The method of claim 4, wherein the computing the embeddings of the plurality of nodes based on the GNN comprises:

for each node of the plurality of nodes, inputting the feature vector of the each node through a first neural network to obtain a hidden representation of the each node; and

receiving one or more other hidden representations propagated from neighboring nodes to the each node; and

updating the hidden representation of the each node based on the one or more other hidden representations propagated from the neighboring nodes.
The method of claim 4, wherein the GNN are jointly trained with the policy network based on a plurality of training DAGs using Reinforcement Learning RL.
The method of claim 1, wherein the policy network comprises a first policy network and a second policy network, and the determining the one or more edges comprises:

identifying one or more starting nodes from the plurality of nodes in the DAG by inputting the embeddings of the plurality of nodes into the first policy network;

identifying one or more ending nodes from the plurality of nodes in the DAG by inputting the embeddings of the one or more starting nodes into the second policy network; and

determining the one or more edges by connecting the one or more starting nodes and the one or more ending nodes.
The method of claim 7, wherein the identifying the one or more starting nodes from the plurality of nodes in the DAG based on the first policy network comprises:

inputting the embeddings of the plurality of nodes into the first policy network to obtain a plurality of probabilities respectively corresponding to the plurality of nodes, wherein each probability represents a recommended chance to select a corresponding node as a starting node.
The method of claim 7, wherein the identifying the one or more ending nodes from the plurality of nodes in the DAG based on the embeddings of the one or more starting nodes and the second policy network comprises:

for each of the one or more starting nodes, inputting the embedding of the starting node and the embeddings of the plurality of nodes in the DAG into the second policy network to obtain a plurality of probabilities respectively corresponding to the plurality of nodes, wherein each probability representing a recommended chance to select a corresponding node as an ending node that matches with the starting node.
The method of claim 7, wherein the one or more edges are restrained from forming a close loop in the DAG, and the first and the second policy network each comprises a softmax layer configured to enforce the restrain.
The method of claim 6, wherein the method further comprises jointly training the GNN and the policy network, the training comprising:

at each time step of the training, identifying a new edge to be added to the DAG based on a current policy network;

determining a first performance metric by applying the heuristic scheduling algorithm to the DAG;

determining a second performance metric by applying the heuristic scheduling algorithm to the DAG with the new edge;

determining a reward based on the first performance metric and the second performance metric; and

updating parameters of the GNN and the current policy network based on the loss function comprising a plurality of the rewards across a plurality of time steps.
The method of claim 1, wherein the heuristic scheduling algorithm comprises one of the following: Shortest Job First SFJ, Critical Path CP, or First-In-First-Out FIFO.
A system comprising one or more processors and one or more non-transitory computer-readable memories storing instructions that, when executed by the one or more processors, cause the system to perform the method of any one of claims 1 to 12.
A non-transitory computer-readable storage medium storing instructions that, when executed by one or more processors, cause the one or more processors to perform the method of any one of claims 1 to 12.
An apparatus comprising a plurality of modules for performing the method of any one of the claims 1 to 12.