CN116069473A

CN116069473A - Deep reinforcement learning-based Yarn cluster workflow scheduling method

Info

Publication number: CN116069473A
Application number: CN202310149989.1A
Authority: CN
Inventors: 王廷; 薛建国
Original assignee: East China Normal University
Current assignee: East China Normal University
Priority date: 2023-02-22
Filing date: 2023-02-22
Publication date: 2023-05-05

Abstract

The invention discloses a Yarn cluster workflow scheduling method based on deep reinforcement learning, which is characterized by comprising the following steps: modeling a workflow into a directed acyclic graph, coding the directed acyclic graph into a workflow state vector by using a graph neural convolution network, then inputting the workflow state vector and a Yarn cluster queue resource state vector into a strategy neural network together, and training the strategy neural network by using a near-end strategy optimization algorithm. Compared with the prior art, the method has the advantages that the scheduling decision is dynamically adjusted according to the current environment state, the utilization rate of cluster resources is improved, the overall completion time of the workflow and the delay time of tasks with higher priority are reduced, the model is simple and convenient, training is easy, the user experience can be remarkably improved, and technical support is provided for related fields such as Hadoop big data analysis.

Description

Deep reinforcement learning-based Yarn cluster workflow scheduling method

Technical Field

The invention relates to the technical field of big data and workflow scheduling, in particular to a Yarn cluster workflow scheduling method based on deep reinforcement learning for big data.

Background

Since google published MapReduce, big data technology has been developed for over twenty years, and is now widely used in various fields such as business intelligence, medical treatment, transportation, internet of things, artificial intelligence, etc. The open source distributed computing platform Hadoop based on the MapReduce programming model has become a fact standard in the big data industry. A cluster resource management and task scheduling framework Yarn is independently disclosed in Hadoop version 2.0, and supports a plurality of computing engines such as MapReduce, spark, tez and the like.

Yarn (Yet Another Resource Negotiator) is a framework supporting multi-user cluster resource management and task scheduling. To prevent interference between different users and a user from occupying too much cluster resources, yarn typically organizes the cluster resources into multiple queues and a system administrator can set the maximum resources available to the queues. When a user submits a task, the size of the applied resource (the number of CPU cores and the memory) and the queue name need to be specified.

From the yan's perspective, the tasks submitted to the queue by the user are all independently runnable tasks without business implications. From the user's perspective, however, a big data analysis application often consists of several tasks, each task handling a different business function, having a different priority, and many tasks having dependencies between themselves. Only if all the tasks on which the current task depends run to completion, the current task can be submitted to the cluster, otherwise, an incorrect calculation result can be obtained. Obviously, when the number of tasks is large and the dependency relationship is complex, manually maintaining the dependency relationship between tasks is very cumbersome and complex. Thus, big data ecocircles introduce a workflow scheduler (also called a workflow management system) to maintain dependencies between tasks in a workflow and schedule the workflow according to the dependencies. Unlike task scheduling on Yarn, the scheduling of workflow scheduler is only responsible for submitting tasks to the Yarn cluster according to the task dependency set by the user, and is not responsible for specific physical resource management and allocation.

For big data analytics type applications, the runtime of the application is critical to the user. The runtime of the workflow depends on the available cluster resource size in addition to the properties of the task itself (e.g., computational complexity of the task, amount of data processed) in the workflow. A system administrator typically allocates Yarn queues of varying sizes to different users according to their resource needs and user priorities. However, setting the maximum available resources of the queue for the user is a challenge, and because the number of tasks, the task complexity, and the dependency structure of the tasks included in the workflow of different users are different, the demands of different users on the resources often change with time, if the queue resource setting is too small, the running time of the tasks of the user may increase, and if the queue resource setting is too large, the available resources of other users may be affected. In practice, it often happens that the resources used by a certain queue reach the maximum set resources within a certain period of time, but a large number of tasks in the queue are blocked because the resources are not available, and at the same time, some queues have more idle resources. While during another period the situation may be reversed. Yarn does not perceive dependencies between tasks and therefore cannot determine whether the direct use of these free resources would affect the relevant user. However, if the workflow scheduler can schedule some tasks in the workflow with heavy load to the queue with idle resources without affecting other users, the resource utilization rate can be further improved, and the workflow running time can be reduced.

The core problem of workflow scheduling is to match corresponding resources for tasks in a workflow on the premise of meeting task dependency in the workflow, so as to meet certain scheduling targets, such as minimizing workflow running time or minimizing resource use cost. The workflow scheduling in the cloud computing environment is a hot spot problem at present, and besides a scheduling algorithm based on priority, an earliest completion time algorithm (HEFT) and other simple traditional workflow algorithms in a heterogeneous environment, a series of workflow scheduling methods based on heuristic algorithms such as genetic algorithms and ant colony algorithms are also provided. Meanwhile, in recent years, because machine learning technologies such as deep learning and reinforcement learning achieve better effects in the fields of automatic driving, robots and the like, some researchers now apply the technologies such as deep learning and reinforcement learning to workflow scheduling. However, different scheduling scenes have different resource models, scheduling models and optimization targets, and workflow scheduling algorithms based on cloud computing environments cannot be directly and well applied to big data workflow scheduling on a Yarn cluster.

The workflow schedulers commonly used in the big data ecological cycle at present, such as Azkaban, ooze, airflow and the like, all adopt very simple scheduling strategies, schedule according to the dependency relationship set by the user only, have no learning behavior, and often cause certain cluster resource waste and overlong workflow running time. Yarn is one of the most commonly used resource management and task platforms in the big data field, and the workflow scheduler of the resource model aiming at Yarn is provided to have great significance for improving the cluster resource utilization rate and reducing the running time of big data analysis application.

The task scheduler of the Yarn cluster in the prior art cannot sense the dependency relationship between tasks, and the workflow scheduler does not consider the dynamic change condition of cluster queue resources, so that the problems of low cluster resource utilization rate, long workflow running time and the like are caused.

Disclosure of Invention

The invention aims to provide a Yarn cluster workflow scheduler based on deep reinforcement learning aiming at the defects of the prior art, adopts a workflow scheduling method based on deep reinforcement learning, considers the dependency relationship of tasks and the dynamic condition of cluster resources, dynamically adjusts scheduling decisions according to the cluster queue resource condition so as to improve the cluster resource utilization rate and reduce the workflow running time and the delay time of high-priority tasks, better solves the problem that the task scheduler of the Yarn cluster cannot sense the dependency relationship among the tasks, and the workflow scheduler does not consider the dynamic change condition of the cluster queue resources, thereby causing the problems of lower cluster resource utilization rate, longer workflow running time and the like.

The specific technical scheme for realizing the aim of the invention is as follows: a Yarn cluster workflow scheduling method based on deep reinforcement learning specifically comprises the following steps:

s1, modeling workflow and Yarn cluster resource model

1) The workflow: a workflow can be modeled as a directed acyclic graph g= (J, D). Where J represents the set of tasks in the workflow, j= { J ₁ ，j ₂ ，...j _m -a }; d represents a set of task dependencies in the workflow, d= { D _st |s≠t，and s，t∈{1，...m}}；d _st Representing task j _t Dependent on task j _s Let j be _s For parent task j _t Is a subtask; only if parent task j _s After completion of task j _t The operation can be started. The user carries out workflow configuration in the form of configuration files, and the configuration content comprises: the start-up time of the workflow, the tasks involved, the fixed attributes of the tasks (e.g., the number of CPU cores required to run the task, memory size, priority, etc.), and the dependencies between the tasks.

2) Yarn cluster: the system administrator divides the Yarn cluster resources into n queues q ₁ ，q ₂ ，...，q _n Setting the ratio of the maximum available resource to the total resources of the cluster as f for each queue ₁ ，f ₂ ，...，f _n Then n queues are allocated to user u ₁ ，u ₂ ...，u _n . The resources within the queue use the principle of first come first served (FIFO), i.e. tasks submitted to the queue first get resource running. When the current used resources of the queue are smaller than the configured maximum resources, and the residual resource quota is enough to run the newly submitted task, the newly submitted task can be run quickly, otherwise, the newly submitted task is in a state of waiting to acquire the resources until the idle resources on the queue can meet the running conditions.

S2, strengthening learning modeling

The basic idea of reinforcement learning is that an agent observes the state of the environment, makes corresponding actions based on the observed state, interacts with the environment, and continues to optimize strategies in accordance with rewards given by the environment to maximize the long-term rewards obtained. Three elements in reinforcement learning are respectively a state space, an action space and a reward mechanism.

1) State space

Environmental state s= { S observed by the agent _g ，S _c }，S _g Representing the state of the workflow, S _c Representing the state of the current cluster queue resources; s is S _g ＝{S _g1 ，S _g2 ，...，S _gn }；S _gi Representing the state of the ith workflow; s is S _gi Composed of the dependency of the tasks in the ith workflow, i.e. S _gi ＝{D _g ，S _ji ，i∈J _g }, wherein the task state S _ji Comprising the following steps: fixed attributes of tasks and runtime attributes; the fixed attributes include: the number of resources required by the task, the priority of the task, etc.; the runtime properties of the task include: task state, schedulable time, scheduling time, start running time, etc.; the S is _c ＝{S _q1 ，S _q2 ，...，S _qn} wherein ,S_qi ＝{vcore _max ，vmem _max ，vcore _used ，vmem _used }；vcore _max Representing the maximum number of CPU cores available for the queue; vmem _max Indicating the maximum memory size available to the queue and, correspondingly, vcore _used ，vmem _used Respectively representing the number of CPU cores and the memory that the queue has currently used.

2) Action space

According to the S1 workflow and the modeling of the Yarn cluster resource model, the Yarn cluster is divided into n queues { q } ₁ ,q ₂ ,...,q _n Let the motion space a= { a } ₀ ,a ₁ ,a ₂ ,...,a _n When the agent performs action a _i When the task to be scheduled is scheduled to the queue q _i . In particular, the present invention provides a virtual queue q ₀ For indicating that no current task is scheduled. I.e. when the agent performs action a ₀ When it indicates that the current task is not scheduledAny queue. The benefit of designing virtual queues is that it provides the ability for the intelligent scheduler to delay the current task scheduling, making it possible to make scheduling decisions that are better from a global perspective.

3) Environment rewarding mechanism

The environmental rewards comprise two parts, namely a single step rewards r in the scheduling process _step Final prize r for end of schedule _final . Single step rewards refer to the timely rewards available after each scheduling of a certain task, while final rewards refer to the delayed rewards obtained after a round of scheduling is completed (all tasks are normally run or a preset maximum time step is reached). If the task j to be scheduled in the time step t is the task j, the agent executes the action a _t The single step awards obtained thereafter are of the following formula (a):

wherein ,

representation queue q _i Is sufficient to run task j.

Defining the delay delta for task j _j The following formula (b):

δ _j ＝start_time _j -avail_time _j (b)。

wherein, start_time _j Represent the start run time of task j, avail_time _j Indicating the schedulable time of the task (the time when all parent tasks end running). Let n= |j _p＝i I, which indicates the number of tasks with priority p=i among all tasks, the average delay time of all tasks with priority p=i is expressed as the following (c):

/>

the final prize r _final The following formula (d):

wherein ,w_p＝3 The average delay time of all tasks with priority of 3 is represented, 0.7 is a weight coefficient, and the same is true of w _p＝2 and w_p＝1 The average delay times of tasks with priorities of 2 and 1 are respectively represented, and the weight coefficients are respectively 0.2 and 0.1. The lower the average delay of a task, the greater the rewards earned and the higher the priority of the task, the greater its weight.

S3, carrying out workflow scheduling based on near-end policy optimization (PPO) algorithm

1) Neural network initialization, a near-end policy optimization algorithm (PPO) is based on actor-critter (actor-critic) architecture. Therefore, three neural networks, respectively policy network pi, need to be randomly initialized first _θ And old policy network

A value network Q. The policy network is used for outputting the probability of each action performed by the agent under a certain state, and the value network Q is used for evaluating the state.

2) And traversing all the non-started workflows in the configuration, checking whether the workflow needs to be started at the current moment t, and starting the workflow g if the current moment t is more than or equal to the starting time of the workflow g. Otherwise, the process advances to the next time t+1. After the workflow g is started, the status of the entry tasks (tasks without parent tasks) of the workflow g is updated to schedulable and added to the schedulable task queue.

3) Checking whether task operation is completed at the current moment t, if so, traversing all subtasks of the task j, checking the state of the father task of the subtasks, and if all the father tasks of the subtasks are completed, updating the state of the subtasks into schedulable, and adding the schedulable task queue.

4) And randomly taking out a task from the schedulable task queue, and updating the isActive value of the task to 1 to indicate that the task is the task to be scheduled at the current time t.

5) Embedding the state of the workflow into a vector with a fixed size by using a graph convolution neural network GCN, and then forming a state s observable by the current intelligent scheduler together with the state of the Yarn cluster _t 。

6) State s _t Input to policy network pi _θ In the method, the policy network outputs the probability of all actions, and then samples according to the probability to obtain an action a _t . Scheduling the current task to queue q _i (i＝a _t ) Simultaneously restoring the isActive value of the current task to 0, and then calculating the current reward r according to the environmental reward mechanism in the reinforcement learning modeling in the step S2 _t If the scheduling is not finished at the time t, r _t ＝r _step Otherwise r _t ＝r _final And will { s } _t ，a _t ，r _t Store to memory pool.

7) Entering the next time t+1, repeating the steps 2) to 6) until reaching the preset cut-off time t _max Or the maximum number of steps per round T or all tasks of all workflows are all running.

8) Fetching data { s } from memory pool ₀ ，a ₀ ，r ₀ ，s ₁ ，a ₁ ，r ₁ ...s _T ，a _T ，r _T State s _t Is input to the value network Q, and v(s) is estimated using the value network _t ) Then, the action advantage A is calculated by using the following expression (e) and expression (f) _t ：

μ _t ＝r _t +γν(s _t+1 )-ν(s _t ) (e)；

A _t ＝μ _t +(γλ)μ _t+1 +...+(γλ) ^T-t+1 μ _T-1 (f)。

Wherein, gamma and lambda are super parameters, gamma is 0.99, lambda is 0.95, r in the invention _t Indicating the prize obtained at time t, v (s _t) and ν(s_t+1 ) The state values at time t and time t+1 are shown, respectively.

9) Fetching data { s } from memory pool ₀ ，a ₀ ，r ₀ ，s ₁ ，a ₁ ，r ₁ ...s _T ，a _T ，r _T State s _t Respectively input to policy network pi _θ Policy network

Respectively get the state s _t The action probability distribution pi _θ (a _t |s _t), and />

And calculates an importance weight r according to the following (g) _t (θ) and then calculating L by the following formula (h) ^clip (θ) as a loss function, counter-propagating, updating policy network pi _θ ：/>

Wherein epsilon is a super parameter, and epsilon=0.2 is taken in the invention, clip (r _t (θ), 1-. Epsilon.1+epsilon.represents when r _t When (theta) is greater than 1+epsilon, the value is 1+epsilon, and when r _t When (θ) is smaller than 1-. Epsilon.the value is 1-. Epsilon.when (θ). The min function represents the smaller of the two,

representing the desire.

10 Repeating 8-9) several times, then overwriting the old policy network parameters with the policy network parameters, i.e. making

11 (i) calculating discount rewards G according to the following (i) _t Then report G using discount _t Predicted value v(s) with value network Q _t ) Making a difference, using the mean square error MSE as a loss function, back-propagating, and updating the value network parameters:

G _t ＝r _t+1 +γr _t+2 +γ ² r _t+3 +...+γ ^T-t r _T+1 +γ ^T-t+1 ν(s _t+1 ) (i)

12 And (3) emptying the memory pool, and repeating the steps (2) - (11) for a plurality of rounds until the environmental rewards obtained by the intelligent scheduler are converged to the optimal value or the suboptimal value.

Compared with the prior art, the method has higher cluster resource utilization rate, and can reduce the running time of each user workflow and the average delay time of tasks. And the scheduling decision can be dynamically adjusted according to the cluster resource change condition, for example, when resources are tense, the task with high priority can be preferentially considered, the running time of the task with high priority is ensured, the model is simple and convenient, the training is easy, the user experience can be obviously improved, and the technical support is provided for the field of big data workflow scheduling.

Drawings

FIG. 1 is a schematic diagram of a deep reinforcement learning model according to the present invention;

FIG. 2 is a workflow runtime diagram of the present invention in comparison with other scheduling algorithms.

Detailed Description

The invention is described and illustrated in further detail below with respect to specific implementations:

example 1

Referring to fig. 1, workflow scheduling on the yann cluster is performed as follows:

s1, modeling workflow and Yarn cluster resource model

1) The workflow: a workflow can be modeled as a directed acyclic graph g= (J, D), where J represents a set of tasks in the workflow, j= { J ₁ ,j ₂ ,...j _m -a }; d represents a set of task dependencies in the workflow, d= { D _st |s≠t,and s,t∈{1,...m}}；d _st Representing task j _t Dependent on task j _s Let j be _s For parent task j _t For child tasks, only if parent task j _s After completion of task j _t The operation can be started. The user carries out workflow configuration in the form of configuration files, and the configuration content comprises: when the workflow is startedInter-task, the fixed attributes of the task (e.g., the number of CPU cores, memory size, priority, etc. required to run the task), and the dependencies between tasks.

S2, strengthening learning modeling

1) State space

Environmental state s= { S observed by the agent _g ，S _c}, wherein S_g Representing the state of the workflow; s is S _c Representing the state of the current cluster queue resources; s is S _g ＝{S _g1 ，S _g2 ，...，S _gn }。S _gi Representing the state of the ith workflow; s is S _gi Composed of the dependency of the tasks in the ith workflow, i.e. S _gi ＝{D _g ，S _ji ，i∈J _g -a }; the task state includes: fixed attributes of tasks and runtime attributes; the fixed attributes include: the amount of resources required for a task, the priority of the task, etcAnd the runtime properties of the task include task state, schedulable time, scheduling time, start-up time, etc. S is S _c ＝{S _q1 ，S _q2 ，...，S _qn }，S _qi ＝{vcore _max ，vmem _max ，vcore _used ，vmem _used }。vcore _max Vmem, which represents the maximum number of CPU cores available to the queue _max The maximum memory size available for the queue is indicated. Accordingly, vcore _used ，vmem _used Respectively representing the number of CPU cores and the memory that the queue has currently used.

2) Action space

According to the S1 workflow and the modeling of the Yarn cluster resource model, the Yarn cluster is divided into n queues { q } ₁ ，q ₂ ，...，q _n Let the motion space a= { a } ₀ ，a ₁ ，a ₂ ，...，a _n When the agent performs action a _i When the task to be scheduled is scheduled to the queue q _i . In particular, the present invention provides a virtual queue q ₀ For indicating that no current task is scheduled. I.e. when the agent performs action a ₀ When this indicates that the current task is not scheduled to any queue. The benefit of designing virtual queues is that it provides the ability for the intelligent scheduler to delay the current task scheduling, making it possible to make scheduling decisions that are better from a global perspective.

3) Environment rewarding mechanism

The environmental rewards comprise two parts, namely a single step rewards r in the scheduling process _step Final prize r for end of schedule _final . Single step rewards refer to the timely rewards available after each scheduling of a certain task, while final rewards refer to the delayed rewards obtained after a round of scheduling is completed (all tasks are normally run or a preset maximum time step is reached). If the task j to be scheduled in the time step t is the task j, the agent executes the action a _t The single step prize obtained thereafter is calculated by the following equation (a):

wherein ,

representation queue q _i Is sufficient to run task j.

Defining the delay delta for task j _j Calculated from the following formula (b):

δ _j ＝start_time _j -avail_time _j (b)。

wherein, start_time _j Represent the start run time of task j, avail_time _j Indicating the schedulable time of the task (the time when all parent tasks end running). Let n= |j _p＝i I, the number of tasks with priority p=i in all tasks, and the average delay time of all tasks with priority p=i is calculated by the following formula (c):

the final prize r _final Calculated from the following (d):

1) The neural network is initialized. The near-end policy optimization algorithm (PPO) is based on an actor-critter (actor-critic) architecture. Therefore, three neural networks, respectively policy network pi, need to be randomly initialized first _θ And old policy netCollaterals

A value network Q. The policy network is used for outputting the probability of each action performed by the agent under a certain state, and the value network is used for evaluating the state.

2) And traversing all the non-started workflows in the configuration, checking whether the workflow needs to be started at the current moment t, and starting the workflow g if the current moment t > = the starting time of the workflow g. Otherwise, the process advances to the next time t+1. After the workflow g is started, the status of the entry tasks (tasks without parent tasks) of the workflow g is updated to schedulable and added to the schedulable task queue.

5) The state of the workflow is embedded as a fixed-size vector using a graph convolutional neural network GCN. Then, the state s observable by the current intelligent scheduler is formed by the state s and the Yarn cluster state _t 。

6) State s _t Input to policy network pi _θ In the method, the policy network outputs the probability of all actions, and then samples according to the probability to obtain an action a _t . Scheduling the current task to queue q _i (i＝a _t ) Simultaneously restoring the isActive value of the current task to 0, and then calculating the current reward r according to the environmental reward mechanism in the reinforcement learning modeling of the step S2 _t If the scheduling is not finished at the time t, r _t ＝r _step Otherwise r _t ＝r _final And will { s } _t ，a _t ，r _t Store to memory pool.

7) Enter the next time t+1, heavyRepeating the steps 2) to 6) until reaching the preset cut-off time t _max Or the maximum number of steps per round T or all tasks of all workflows are all running.

8) Fetching data { s } from memory pool ₀ ，a ₀ ，r ₀ ，s ₁ ，a ₁ ，r ₁ ...s _T ，a _T ，r _T State s _t Is input to the value network Q, and v(s) is estimated by using the value network Q _t ) Then, the action advantage A is calculated by using the following expression (e) and expression (f) _t ：

μ _t ＝r _t +γν(s _t+1 )-ν(s _t ) (e)；

A _t ＝μ _t +(γλ)μ _t+1 +...+(γλ) ^T-t+1 μ _T-1 (f)。

9) Fetching data { s } from memory pool ₀ ，a ₀ ，r ₀ ，s ₁ ，a ₁ ，r ₁ ...s _T ，a _Y ，r _T State s _t Respectively input to policy network pi _θ Policy network

And calculates an importance weight r according to the following (g) _t (θ) and then calculating L in the following formula (h) ^clip (θ) as a loss function, counter-propagating, updating policy network pi _θ ：

/>

representing the desire.

10 8-9) are repeated several times, then the old policy network parameters are covered with the policy network parameters, i.e. the order

11 Calculating the discount rewards G according to the following (i) _t Then report G using discount _t Predicted value v(s) with value network Q _t ) Making a difference, using the mean square error MSE as a loss function, back-propagating, and updating the value network parameters:

G _t ＝r _t+1 +γr _t+2 +γ ² r _t+3 +...+γ ^T-t r _T+1 +γ ^T-t+1 ν(s _t+1 ) (i)。

12 The memory pool is emptied, and the steps 2 to 11) are repeated for a plurality of times until the environmental rewards obtained by the intelligent scheduler are converged to the optimal value or the suboptimal value.

Referring to fig. 2, ppo corresponds to a Yarn cluster workflow scheduling method based on deep reinforcement learning in the present invention, and simulation experiment results show that compared with a static scheduling algorithm (static), a random algorithm (random), a priority-based scheduling algorithm (pbs) and an earliest completion time algorithm (heft), the scheduling algorithm in the present invention can bring lower workflow running time.

The foregoing description of the embodiments of the invention is not intended to limit the scope of the invention, but is intended to cover any modifications, equivalents, improvements or alternatives falling within the spirit and principles of the invention.

Claims

1. The Yarn cluster workflow scheduling method based on deep reinforcement learning is characterized by comprising the following steps of:

1) Collecting workflow information and cluster resource information according to workflow configuration of a user and Yam cluster queue configuration, and constructing a workflow scheduling model according to the collected information;

2) Performing reinforcement learning modeling according to the workflow scheduling model and the Yam cluster resource model, wherein the reinforcement learning modeling comprises a state space, an action space and a reward mechanism;

3) And training the scheduler by using a near-end strategy optimization algorithm in the deep reinforcement learning algorithm.

2. The method for scheduling workflow of Yam cluster based on deep reinforcement learning according to claim 1, wherein the constructing of the workflow scheduling model in step 1) comprises:

a) Modeling a workflow as a directed acyclic graph g= (J, D);

where J represents the set of tasks in the workflow, j= { J ₁ ，j ₂ ，...j _m -a }; d represents a set of task dependencies in the workflow, d= { D _st |s≠t，and s，t∈{1，...m}}；d _st Representing subtask j _t Dependent on parent task j _s Only when parent task j _s After completion, subtask j _t Can start to operate;

b) The user performs workflow configuration in the form of a configuration file, and the configuration content comprises: the starting time of the workflow, the contained tasks, the fixed attributes of the tasks and the dependency relationship among the tasks; the fixed attributes of the task include: the number of CPU cores, the memory size, and the priority required to run the task;

c) The workflow scheduler schedules tasks in the user-configured workflow to a Yarn cluster, which is divided into n queues { q } ₁ ，q ₂ ，...，q _n -and n queues are allocated to user u ₁ ，u ₂ ...，u _n The ratio of the maximum available resource of each queue to the total resources of the cluster is f ₁ ，f ₂ ，...，f _n 。

3. The method for scheduling a Yam cluster workflow based on deep reinforcement learning according to claim 1, wherein the reinforcement learning model of step 2) comprises:

a) State space

Environmental state s= { S observed by the agent _g ，S _c }；

wherein ,S_g Representing the state of the workflow, S _g ＝{S _g1 ，S _g2 ，...，S _gn }；S _c Representing the state of the current cluster queue resource, S _c ＝{S _q1 ，S _q2 ，...，S _qn }；S _qi ＝{vcore _max ，vmem _max ，vcore _used ，vmem _used }；vcore _max Representing the maximum number of CPU cores available for the queue; vmem _max Representing a maximum memory size available to the queue; vcore _used ，vmem _used Respectively representing the number of CPU cores and the memory which are currently used by the queue; s is S _gi Representing the state of the ith workflow, S _gi ＝{D _g ，S _ji ，i∈J _g }；S _ji A task state that includes fixed attributes of the task and runtime attributes; the fixed attributes of the task include: the number of resources required by the task and the priority of the task; the runtime properties of the task include: task state, schedulable time, scheduling time, and start run time;

b) Action space

The Yarn cluster is divided into n queues { q } ₁ ，q ₂ ，...，q _n Let the motion space a= { a } ₀ ，a ₁ ，a ₂ ，...，a _n When the agent performs action a _i When the task to be scheduled is scheduled to the queue q _i ；

c) Environment rewarding mechanism

The environmental rewards include: single step rewards r in scheduling procedure _step Final prize r for end of schedule _final The single-step rewards are timely rewards obtained after a certain task is scheduled each time, if the task j is to be scheduled in the time step t, the intelligent agent executes the action a _t The single step prize obtained thereafter is calculated by the following equation (a):

wherein ,

representation queue q _i Is sufficient to run task j; conversely, define the delay time delta of task j _j Calculated from the following formula (b):

δ _j ＝start_time _j -avail_time _j (b)；

wherein, start_time _j The starting run time for task j; avail_time _j Schedulable time for a task;

the final rewards are delay rewards obtained after all tasks are normally operated or a preset maximum time step is reached, so that N= |J _p＝i I, which indicates the number of tasks with priority p=i in all tasks, the average delay time w of all tasks with priority p=i _p＝i Calculated from the following formula (c):

the final prize r _final Calculated from the following formula (d):

wherein ,w_p＝3 Represents the average delay time of all tasks with priority of 3, the weight coefficient is 0.7, and the same is true of w _p＝2 and w_p＝1 The average delay times of tasks with priorities of 2 and 1 are respectively represented, and the weight coefficients are respectively 0.2 and 0.1.

4. The method for scheduling the workflow of the Yam cluster based on deep reinforcement learning according to claim 1, wherein training the scheduler using the near-end policy optimization algorithm in the step 3) specifically comprises:

a) Pi for the current policy network _θ Old policy network

And value network Q three neural networks are randomly initialized, wherein the current strategy network pi is provided with a plurality of strategy networks _θ And old policy network->

To output the probability of each action performed by the agent in a state; the value network Q is used for evaluating the state;

b) Traversing all the non-started workflows g in the configuration, checking whether the workflow g needs to be started at the current moment t, starting the workflow g if the current moment t is more than or equal to the starting time of the workflow g, updating the inlet task state of the workflow g into schedulable, and adding the schedulable task queue; otherwise, advancing to the next time t+1;

c) Checking whether the task j is completed at the current moment t, and traversing all subtasks j of the task j if the task j is completed _t Check subtask j _t Parent task j of (1) _s If the state of subtask j _t All parent tasks j of (1) _s Complete the subtask j _t The state of (2) is updated to be schedulable, and a schedulable task queue is added;

d) Randomly taking out a task j from the schedulable task queue, and updating the isActive value of the task j to 1 to indicate that the task j is the task j to be scheduled at the current time t;

e) Embedding the state of the workflow g into a vector with a fixed size by using a graph convolution neural network GCN, and then forming a state s observable by the current intelligent scheduler together with the state of the Yarn cluster _t ；

f) State s _t Input policy network pi _θ According to policy network pi _θ Outputting probabilities of all actions to sample to obtain action a _t The method comprises the steps of carrying out a first treatment on the surface of the Scheduling the current task j to queue q _i (i＝a _t ) Simultaneously restoring the isActive value of the current task to 0, and then calculating the current rewards r according to an environmental rewarding mechanism _t If the scheduling is not finished at the time t, r _t ＝r _step Otherwise r _t ＝r _final The method comprises the steps of carrying out a first treatment on the surface of the Finally { s } _t ，a _t ，r _t Storing to a memory pool;

g) Entering the next time t+1, repeating the steps b) to f) until reaching the preset cut-off time t _max The maximum step number T of each round or all tasks j of all workflows g are completely run;

h) Fetching data { s } from memory pool ₀ ，a ₀ ，r ₀ ，s ₁ ，a ₁ ，r ₁ ...s _T ，a _T ，r _T State s _t Input to value network Q to obtain state s _t Value v(s) _t ) Then, the action dominance A is calculated by the following equation (e) _t ：

A _t ＝μ _t +(γλ)μ _t+1 +...+(γλ) ^T-t+1 μ _T-1 (e)；

Wherein, gamma and lambda are super parameters; mu (mu) _t Is represented by the following (f):

μ _t ＝r _t +γv(s _t+1 )-v(s _t ) (f)；

wherein ,r_t Indicating the rewards obtained at time t; v(s) _t) and v(s_t+1 ) The state values at time t and time t+1 are respectively represented;

i) Fetching data { s } from memory pool ₀ ，a ₀ ，r ₀ ，s ₁ ，a ₁ ，r ₁ ...s _Y ，a _T ，r _T State s _t Respectively input to policy network pi _θ Policy network

Respectively get the state s _t The action probability distribution pi _θ (a _t |s _t) and />

Calculating importance weight r _t (θ) and is L ^clip (θ) as a loss function, counter-propagating, updating policy network pi _θ The importance weight r _t (θ) is calculated from the following formula (g):

the L is ^clip (θ) is calculated by the following formula (h);

wherein epsilon is a super parameter; clip (r) _t (θ), 1-. Epsilon.1+epsilon.represents when r _t When (theta) is greater than 1+epsilon, the value is 1+epsilon, and when r _t When the (theta) is smaller than 1-epsilon, the value is 1-epsilon; the min function represents the smaller value of the two;

representing a desire;

j) Repeating the steps h) to i) for a plurality of times, and then using a strategy network pi _θ Parameter overlay old policy network

Parameters, i.e. command

k) Reporting G using discounts _t Predicted value v(s) of value network Q _t ) Making a difference, using the MSE as a loss function, back-propagating, updating the Q parameter of the value network, said discount returns G _t Calculated from the following formula (i):

G _t ＝r _t+1 +γr _t+2 +γ ² r _t+3 +...+γ ^T-t r _T+1 +γ ^T-t+1 ν(s _t+1 ) (i)；

1) And (3) emptying the memory pool, and repeating the steps b-k) for a plurality of times until the environmental rewards obtained by the intelligent scheduler are converged to the optimal value or the suboptimal value.

5. A Yarn cluster workflow scheduling method based on deep reinforcement learning as claimed in claim 3, wherein the Yarn cluster of step b) is divided into n queues { q } ₁ ，q ₂ ，...，q _n A virtual queue q is arranged in } ₀ For indicating that the current task j is not scheduled, i.e. { q ₀ ，q ₁ ，q ₂ ，...，q _n When the agent performs action a ₀ When this indicates that the current task j is not scheduled to any queue.