CN116069473A - Deep reinforcement learning-based Yarn cluster workflow scheduling method - Google Patents

Deep reinforcement learning-based Yarn cluster workflow scheduling method Download PDF

Info

Publication number
CN116069473A
CN116069473A CN202310149989.1A CN202310149989A CN116069473A CN 116069473 A CN116069473 A CN 116069473A CN 202310149989 A CN202310149989 A CN 202310149989A CN 116069473 A CN116069473 A CN 116069473A
Authority
CN
China
Prior art keywords
task
workflow
time
state
tasks
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310149989.1A
Other languages
Chinese (zh)
Inventor
王廷
薛建国
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
East China Normal University
Original Assignee
East China Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by East China Normal University filed Critical East China Normal University
Priority to CN202310149989.1A priority Critical patent/CN116069473A/en
Publication of CN116069473A publication Critical patent/CN116069473A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/546Message passing systems or structures, e.g. queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a Yarn cluster workflow scheduling method based on deep reinforcement learning, which is characterized by comprising the following steps: modeling a workflow into a directed acyclic graph, coding the directed acyclic graph into a workflow state vector by using a graph neural convolution network, then inputting the workflow state vector and a Yarn cluster queue resource state vector into a strategy neural network together, and training the strategy neural network by using a near-end strategy optimization algorithm. Compared with the prior art, the method has the advantages that the scheduling decision is dynamically adjusted according to the current environment state, the utilization rate of cluster resources is improved, the overall completion time of the workflow and the delay time of tasks with higher priority are reduced, the model is simple and convenient, training is easy, the user experience can be remarkably improved, and technical support is provided for related fields such as Hadoop big data analysis.

Description

Deep reinforcement learning-based Yarn cluster workflow scheduling method
Technical Field
The invention relates to the technical field of big data and workflow scheduling, in particular to a Yarn cluster workflow scheduling method based on deep reinforcement learning for big data.
Background
Since google published MapReduce, big data technology has been developed for over twenty years, and is now widely used in various fields such as business intelligence, medical treatment, transportation, internet of things, artificial intelligence, etc. The open source distributed computing platform Hadoop based on the MapReduce programming model has become a fact standard in the big data industry. A cluster resource management and task scheduling framework Yarn is independently disclosed in Hadoop version 2.0, and supports a plurality of computing engines such as MapReduce, spark, tez and the like.
Yarn (Yet Another Resource Negotiator) is a framework supporting multi-user cluster resource management and task scheduling. To prevent interference between different users and a user from occupying too much cluster resources, yarn typically organizes the cluster resources into multiple queues and a system administrator can set the maximum resources available to the queues. When a user submits a task, the size of the applied resource (the number of CPU cores and the memory) and the queue name need to be specified.
From the yan's perspective, the tasks submitted to the queue by the user are all independently runnable tasks without business implications. From the user's perspective, however, a big data analysis application often consists of several tasks, each task handling a different business function, having a different priority, and many tasks having dependencies between themselves. Only if all the tasks on which the current task depends run to completion, the current task can be submitted to the cluster, otherwise, an incorrect calculation result can be obtained. Obviously, when the number of tasks is large and the dependency relationship is complex, manually maintaining the dependency relationship between tasks is very cumbersome and complex. Thus, big data ecocircles introduce a workflow scheduler (also called a workflow management system) to maintain dependencies between tasks in a workflow and schedule the workflow according to the dependencies. Unlike task scheduling on Yarn, the scheduling of workflow scheduler is only responsible for submitting tasks to the Yarn cluster according to the task dependency set by the user, and is not responsible for specific physical resource management and allocation.
For big data analytics type applications, the runtime of the application is critical to the user. The runtime of the workflow depends on the available cluster resource size in addition to the properties of the task itself (e.g., computational complexity of the task, amount of data processed) in the workflow. A system administrator typically allocates Yarn queues of varying sizes to different users according to their resource needs and user priorities. However, setting the maximum available resources of the queue for the user is a challenge, and because the number of tasks, the task complexity, and the dependency structure of the tasks included in the workflow of different users are different, the demands of different users on the resources often change with time, if the queue resource setting is too small, the running time of the tasks of the user may increase, and if the queue resource setting is too large, the available resources of other users may be affected. In practice, it often happens that the resources used by a certain queue reach the maximum set resources within a certain period of time, but a large number of tasks in the queue are blocked because the resources are not available, and at the same time, some queues have more idle resources. While during another period the situation may be reversed. Yarn does not perceive dependencies between tasks and therefore cannot determine whether the direct use of these free resources would affect the relevant user. However, if the workflow scheduler can schedule some tasks in the workflow with heavy load to the queue with idle resources without affecting other users, the resource utilization rate can be further improved, and the workflow running time can be reduced.
The core problem of workflow scheduling is to match corresponding resources for tasks in a workflow on the premise of meeting task dependency in the workflow, so as to meet certain scheduling targets, such as minimizing workflow running time or minimizing resource use cost. The workflow scheduling in the cloud computing environment is a hot spot problem at present, and besides a scheduling algorithm based on priority, an earliest completion time algorithm (HEFT) and other simple traditional workflow algorithms in a heterogeneous environment, a series of workflow scheduling methods based on heuristic algorithms such as genetic algorithms and ant colony algorithms are also provided. Meanwhile, in recent years, because machine learning technologies such as deep learning and reinforcement learning achieve better effects in the fields of automatic driving, robots and the like, some researchers now apply the technologies such as deep learning and reinforcement learning to workflow scheduling. However, different scheduling scenes have different resource models, scheduling models and optimization targets, and workflow scheduling algorithms based on cloud computing environments cannot be directly and well applied to big data workflow scheduling on a Yarn cluster.
The workflow schedulers commonly used in the big data ecological cycle at present, such as Azkaban, ooze, airflow and the like, all adopt very simple scheduling strategies, schedule according to the dependency relationship set by the user only, have no learning behavior, and often cause certain cluster resource waste and overlong workflow running time. Yarn is one of the most commonly used resource management and task platforms in the big data field, and the workflow scheduler of the resource model aiming at Yarn is provided to have great significance for improving the cluster resource utilization rate and reducing the running time of big data analysis application.
The task scheduler of the Yarn cluster in the prior art cannot sense the dependency relationship between tasks, and the workflow scheduler does not consider the dynamic change condition of cluster queue resources, so that the problems of low cluster resource utilization rate, long workflow running time and the like are caused.
Disclosure of Invention
The invention aims to provide a Yarn cluster workflow scheduler based on deep reinforcement learning aiming at the defects of the prior art, adopts a workflow scheduling method based on deep reinforcement learning, considers the dependency relationship of tasks and the dynamic condition of cluster resources, dynamically adjusts scheduling decisions according to the cluster queue resource condition so as to improve the cluster resource utilization rate and reduce the workflow running time and the delay time of high-priority tasks, better solves the problem that the task scheduler of the Yarn cluster cannot sense the dependency relationship among the tasks, and the workflow scheduler does not consider the dynamic change condition of the cluster queue resources, thereby causing the problems of lower cluster resource utilization rate, longer workflow running time and the like.
The specific technical scheme for realizing the aim of the invention is as follows: a Yarn cluster workflow scheduling method based on deep reinforcement learning specifically comprises the following steps:
s1, modeling workflow and Yarn cluster resource model
1) The workflow: a workflow can be modeled as a directed acyclic graph g= (J, D). Where J represents the set of tasks in the workflow, j= { J 1 ,j 2 ,...j m -a }; d represents a set of task dependencies in the workflow, d= { D st |s≠t,and s,t∈{1,...m}};d st Representing task j t Dependent on task j s Let j be s For parent task j t Is a subtask; only if parent task j s After completion of task j t The operation can be started. The user carries out workflow configuration in the form of configuration files, and the configuration content comprises: the start-up time of the workflow, the tasks involved, the fixed attributes of the tasks (e.g., the number of CPU cores required to run the task, memory size, priority, etc.), and the dependencies between the tasks.
2) Yarn cluster: the system administrator divides the Yarn cluster resources into n queues q 1 ,q 2 ,...,q n Setting the ratio of the maximum available resource to the total resources of the cluster as f for each queue 1 ,f 2 ,...,f n Then n queues are allocated to user u 1 ,u 2 ...,u n . The resources within the queue use the principle of first come first served (FIFO), i.e. tasks submitted to the queue first get resource running. When the current used resources of the queue are smaller than the configured maximum resources, and the residual resource quota is enough to run the newly submitted task, the newly submitted task can be run quickly, otherwise, the newly submitted task is in a state of waiting to acquire the resources until the idle resources on the queue can meet the running conditions.
S2, strengthening learning modeling
The basic idea of reinforcement learning is that an agent observes the state of the environment, makes corresponding actions based on the observed state, interacts with the environment, and continues to optimize strategies in accordance with rewards given by the environment to maximize the long-term rewards obtained. Three elements in reinforcement learning are respectively a state space, an action space and a reward mechanism.
1) State space
Environmental state s= { S observed by the agent g ,S c },S g Representing the state of the workflow, S c Representing the state of the current cluster queue resources; s is S g ={S g1 ,S g2 ,...,S gn };S gi Representing the state of the ith workflow; s is S gi Composed of the dependency of the tasks in the ith workflow, i.e. S gi ={D g ,S ji ,i∈J g }, wherein the task state S ji Comprising the following steps: fixed attributes of tasks and runtime attributes; the fixed attributes include: the number of resources required by the task, the priority of the task, etc.; the runtime properties of the task include: task state, schedulable time, scheduling time, start running time, etc.; the S is c ={S q1 ,S q2 ,...,S qn} wherein ,Sqi ={vcore max ,vmem max ,vcore used ,vmem used };vcore max Representing the maximum number of CPU cores available for the queue; vmem max Indicating the maximum memory size available to the queue and, correspondingly, vcore used ,vmem used Respectively representing the number of CPU cores and the memory that the queue has currently used.
2) Action space
According to the S1 workflow and the modeling of the Yarn cluster resource model, the Yarn cluster is divided into n queues { q } 1 ,q 2 ,...,q n Let the motion space a= { a } 0 ,a 1 ,a 2 ,...,a n When the agent performs action a i When the task to be scheduled is scheduled to the queue q i . In particular, the present invention provides a virtual queue q 0 For indicating that no current task is scheduled. I.e. when the agent performs action a 0 When it indicates that the current task is not scheduledAny queue. The benefit of designing virtual queues is that it provides the ability for the intelligent scheduler to delay the current task scheduling, making it possible to make scheduling decisions that are better from a global perspective.
3) Environment rewarding mechanism
The environmental rewards comprise two parts, namely a single step rewards r in the scheduling process step Final prize r for end of schedule final . Single step rewards refer to the timely rewards available after each scheduling of a certain task, while final rewards refer to the delayed rewards obtained after a round of scheduling is completed (all tasks are normally run or a preset maximum time step is reached). If the task j to be scheduled in the time step t is the task j, the agent executes the action a t The single step awards obtained thereafter are of the following formula (a):
Figure BDA0004090503340000041
wherein ,
Figure BDA0004090503340000042
representation queue q i Is sufficient to run task j.
Defining the delay delta for task j j The following formula (b):
δ j =start_time j -avail_time j (b)。
wherein, start_time j Represent the start run time of task j, avail_time j Indicating the schedulable time of the task (the time when all parent tasks end running). Let n= |j p=i I, which indicates the number of tasks with priority p=i among all tasks, the average delay time of all tasks with priority p=i is expressed as the following (c):
Figure BDA0004090503340000043
/>
the final prize r final The following formula (d):
Figure BDA0004090503340000044
wherein ,wp=3 The average delay time of all tasks with priority of 3 is represented, 0.7 is a weight coefficient, and the same is true of w p=2 and wp=1 The average delay times of tasks with priorities of 2 and 1 are respectively represented, and the weight coefficients are respectively 0.2 and 0.1. The lower the average delay of a task, the greater the rewards earned and the higher the priority of the task, the greater its weight.
S3, carrying out workflow scheduling based on near-end policy optimization (PPO) algorithm
1) Neural network initialization, a near-end policy optimization algorithm (PPO) is based on actor-critter (actor-critic) architecture. Therefore, three neural networks, respectively policy network pi, need to be randomly initialized first θ And old policy network
Figure BDA0004090503340000045
A value network Q. The policy network is used for outputting the probability of each action performed by the agent under a certain state, and the value network Q is used for evaluating the state.
2) And traversing all the non-started workflows in the configuration, checking whether the workflow needs to be started at the current moment t, and starting the workflow g if the current moment t is more than or equal to the starting time of the workflow g. Otherwise, the process advances to the next time t+1. After the workflow g is started, the status of the entry tasks (tasks without parent tasks) of the workflow g is updated to schedulable and added to the schedulable task queue.
3) Checking whether task operation is completed at the current moment t, if so, traversing all subtasks of the task j, checking the state of the father task of the subtasks, and if all the father tasks of the subtasks are completed, updating the state of the subtasks into schedulable, and adding the schedulable task queue.
4) And randomly taking out a task from the schedulable task queue, and updating the isActive value of the task to 1 to indicate that the task is the task to be scheduled at the current time t.
5) Embedding the state of the workflow into a vector with a fixed size by using a graph convolution neural network GCN, and then forming a state s observable by the current intelligent scheduler together with the state of the Yarn cluster t
6) State s t Input to policy network pi θ In the method, the policy network outputs the probability of all actions, and then samples according to the probability to obtain an action a t . Scheduling the current task to queue q i (i=a t ) Simultaneously restoring the isActive value of the current task to 0, and then calculating the current reward r according to the environmental reward mechanism in the reinforcement learning modeling in the step S2 t If the scheduling is not finished at the time t, r t =r step Otherwise r t =r final And will { s } t ,a t ,r t Store to memory pool.
7) Entering the next time t+1, repeating the steps 2) to 6) until reaching the preset cut-off time t max Or the maximum number of steps per round T or all tasks of all workflows are all running.
8) Fetching data { s } from memory pool 0 ,a 0 ,r 0 ,s 1 ,a 1 ,r 1 ...s T ,a T ,r T State s t Is input to the value network Q, and v(s) is estimated using the value network t ) Then, the action advantage A is calculated by using the following expression (e) and expression (f) t
μ t =r t +γν(s t+1 )-ν(s t ) (e);
A t =μ t +(γλ)μ t+1 +...+(γλ) T-t+1 μ T-1 (f)。
Wherein, gamma and lambda are super parameters, gamma is 0.99, lambda is 0.95, r in the invention t Indicating the prize obtained at time t, v (s t) and ν(st+1 ) The state values at time t and time t+1 are shown, respectively.
9) Fetching data { s } from memory pool 0 ,a 0 ,r 0 ,s 1 ,a 1 ,r 1 ...s T ,a T ,r T State s t Respectively input to policy network pi θ Policy network
Figure BDA0004090503340000052
Respectively get the state s t The action probability distribution pi θ (a t |s t), and />
Figure BDA0004090503340000051
And calculates an importance weight r according to the following (g) t (θ) and then calculating L by the following formula (h) clip (θ) as a loss function, counter-propagating, updating policy network pi θ :/>
Figure BDA0004090503340000061
Figure BDA0004090503340000062
Wherein epsilon is a super parameter, and epsilon=0.2 is taken in the invention, clip (r t (θ), 1-. Epsilon.1+epsilon.represents when r t When (theta) is greater than 1+epsilon, the value is 1+epsilon, and when r t When (θ) is smaller than 1-. Epsilon.the value is 1-. Epsilon.when (θ). The min function represents the smaller of the two,
Figure BDA0004090503340000064
representing the desire.
10 Repeating 8-9) several times, then overwriting the old policy network parameters with the policy network parameters, i.e. making
Figure BDA0004090503340000063
11 (i) calculating discount rewards G according to the following (i) t Then report G using discount t Predicted value v(s) with value network Q t ) Making a difference, using the mean square error MSE as a loss function, back-propagating, and updating the value network parameters:
G t =r t+1 +γr t+22 r t+3 +...+γ T-t r T+1T-t+1 ν(s t+1 ) (i)
12 And (3) emptying the memory pool, and repeating the steps (2) - (11) for a plurality of rounds until the environmental rewards obtained by the intelligent scheduler are converged to the optimal value or the suboptimal value.
Compared with the prior art, the method has higher cluster resource utilization rate, and can reduce the running time of each user workflow and the average delay time of tasks. And the scheduling decision can be dynamically adjusted according to the cluster resource change condition, for example, when resources are tense, the task with high priority can be preferentially considered, the running time of the task with high priority is ensured, the model is simple and convenient, the training is easy, the user experience can be obviously improved, and the technical support is provided for the field of big data workflow scheduling.
Drawings
FIG. 1 is a schematic diagram of a deep reinforcement learning model according to the present invention;
FIG. 2 is a workflow runtime diagram of the present invention in comparison with other scheduling algorithms.
Detailed Description
The invention is described and illustrated in further detail below with respect to specific implementations:
example 1
Referring to fig. 1, workflow scheduling on the yann cluster is performed as follows:
s1, modeling workflow and Yarn cluster resource model
1) The workflow: a workflow can be modeled as a directed acyclic graph g= (J, D), where J represents a set of tasks in the workflow, j= { J 1 ,j 2 ,...j m -a }; d represents a set of task dependencies in the workflow, d= { D st |s≠t,and s,t∈{1,...m}};d st Representing task j t Dependent on task j s Let j be s For parent task j t For child tasks, only if parent task j s After completion of task j t The operation can be started. The user carries out workflow configuration in the form of configuration files, and the configuration content comprises: when the workflow is startedInter-task, the fixed attributes of the task (e.g., the number of CPU cores, memory size, priority, etc. required to run the task), and the dependencies between tasks.
2) Yarn cluster: the system administrator divides the Yarn cluster resources into n queues q 1 ,q 2 ,...,q n Setting the ratio of the maximum available resource to the total resources of the cluster as f for each queue 1 ,f 2 ,...,f n Then n queues are allocated to user u 1 ,u 2 ...,u n . The resources within the queue use the principle of first come first served (FIFO), i.e. tasks submitted to the queue first get resource running. When the current used resources of the queue are smaller than the configured maximum resources, and the residual resource quota is enough to run the newly submitted task, the newly submitted task can be run quickly, otherwise, the newly submitted task is in a state of waiting to acquire the resources until the idle resources on the queue can meet the running conditions.
S2, strengthening learning modeling
The basic idea of reinforcement learning is that an agent observes the state of the environment, makes corresponding actions based on the observed state, interacts with the environment, and continues to optimize strategies in accordance with rewards given by the environment to maximize the long-term rewards obtained. Three elements in reinforcement learning are respectively a state space, an action space and a reward mechanism.
1) State space
Environmental state s= { S observed by the agent g ,S c}, wherein Sg Representing the state of the workflow; s is S c Representing the state of the current cluster queue resources; s is S g ={S g1 ,S g2 ,...,S gn }。S gi Representing the state of the ith workflow; s is S gi Composed of the dependency of the tasks in the ith workflow, i.e. S gi ={D g ,S ji ,i∈J g -a }; the task state includes: fixed attributes of tasks and runtime attributes; the fixed attributes include: the amount of resources required for a task, the priority of the task, etcAnd the runtime properties of the task include task state, schedulable time, scheduling time, start-up time, etc. S is S c ={S q1 ,S q2 ,...,S qn },S qi ={vcore max ,vmem max ,vcore used ,vmem used }。vcore max Vmem, which represents the maximum number of CPU cores available to the queue max The maximum memory size available for the queue is indicated. Accordingly, vcore used ,vmem used Respectively representing the number of CPU cores and the memory that the queue has currently used.
2) Action space
According to the S1 workflow and the modeling of the Yarn cluster resource model, the Yarn cluster is divided into n queues { q } 1 ,q 2 ,...,q n Let the motion space a= { a } 0 ,a 1 ,a 2 ,...,a n When the agent performs action a i When the task to be scheduled is scheduled to the queue q i . In particular, the present invention provides a virtual queue q 0 For indicating that no current task is scheduled. I.e. when the agent performs action a 0 When this indicates that the current task is not scheduled to any queue. The benefit of designing virtual queues is that it provides the ability for the intelligent scheduler to delay the current task scheduling, making it possible to make scheduling decisions that are better from a global perspective.
3) Environment rewarding mechanism
The environmental rewards comprise two parts, namely a single step rewards r in the scheduling process step Final prize r for end of schedule final . Single step rewards refer to the timely rewards available after each scheduling of a certain task, while final rewards refer to the delayed rewards obtained after a round of scheduling is completed (all tasks are normally run or a preset maximum time step is reached). If the task j to be scheduled in the time step t is the task j, the agent executes the action a t The single step prize obtained thereafter is calculated by the following equation (a):
Figure BDA0004090503340000081
wherein ,
Figure BDA0004090503340000082
representation queue q i Is sufficient to run task j.
Defining the delay delta for task j j Calculated from the following formula (b):
δ j =start_time j -avail_time j (b)。
wherein, start_time j Represent the start run time of task j, avail_time j Indicating the schedulable time of the task (the time when all parent tasks end running). Let n= |j p=i I, the number of tasks with priority p=i in all tasks, and the average delay time of all tasks with priority p=i is calculated by the following formula (c):
Figure BDA0004090503340000083
the final prize r final Calculated from the following (d):
Figure BDA0004090503340000084
wherein ,wp=3 The average delay time of all tasks with priority of 3 is represented, 0.7 is a weight coefficient, and the same is true of w p=2 and wp=1 The average delay times of tasks with priorities of 2 and 1 are respectively represented, and the weight coefficients are respectively 0.2 and 0.1. The lower the average delay of a task, the greater the rewards earned and the higher the priority of the task, the greater its weight.
S3, carrying out workflow scheduling based on near-end policy optimization (PPO) algorithm
1) The neural network is initialized. The near-end policy optimization algorithm (PPO) is based on an actor-critter (actor-critic) architecture. Therefore, three neural networks, respectively policy network pi, need to be randomly initialized first θ And old policy netCollaterals
Figure BDA0004090503340000085
A value network Q. The policy network is used for outputting the probability of each action performed by the agent under a certain state, and the value network is used for evaluating the state.
2) And traversing all the non-started workflows in the configuration, checking whether the workflow needs to be started at the current moment t, and starting the workflow g if the current moment t > = the starting time of the workflow g. Otherwise, the process advances to the next time t+1. After the workflow g is started, the status of the entry tasks (tasks without parent tasks) of the workflow g is updated to schedulable and added to the schedulable task queue.
3) Checking whether task operation is completed at the current moment t, if so, traversing all subtasks of the task j, checking the state of the father task of the subtasks, and if all the father tasks of the subtasks are completed, updating the state of the subtasks into schedulable, and adding the schedulable task queue.
4) And randomly taking out a task from the schedulable task queue, and updating the isActive value of the task to 1 to indicate that the task is the task to be scheduled at the current time t.
5) The state of the workflow is embedded as a fixed-size vector using a graph convolutional neural network GCN. Then, the state s observable by the current intelligent scheduler is formed by the state s and the Yarn cluster state t
6) State s t Input to policy network pi θ In the method, the policy network outputs the probability of all actions, and then samples according to the probability to obtain an action a t . Scheduling the current task to queue q i (i=a t ) Simultaneously restoring the isActive value of the current task to 0, and then calculating the current reward r according to the environmental reward mechanism in the reinforcement learning modeling of the step S2 t If the scheduling is not finished at the time t, r t =r step Otherwise r t =r final And will { s } t ,a t ,r t Store to memory pool.
7) Enter the next time t+1, heavyRepeating the steps 2) to 6) until reaching the preset cut-off time t max Or the maximum number of steps per round T or all tasks of all workflows are all running.
8) Fetching data { s } from memory pool 0 ,a 0 ,r 0 ,s 1 ,a 1 ,r 1 ...s T ,a T ,r T State s t Is input to the value network Q, and v(s) is estimated by using the value network Q t ) Then, the action advantage A is calculated by using the following expression (e) and expression (f) t
μ t =r t +γν(s t+1 )-ν(s t ) (e);
A t =μ t +(γλ)μ t+1 +...+(γλ) T-t+1 μ T-1 (f)。
Wherein, gamma and lambda are super parameters, gamma is 0.99, lambda is 0.95, r in the invention t Indicating the prize obtained at time t, v (s t) and ν(st+1 ) The state values at time t and time t+1 are shown, respectively.
9) Fetching data { s } from memory pool 0 ,a 0 ,r 0 ,s 1 ,a 1 ,r 1 ...s T ,a Y ,r T State s t Respectively input to policy network pi θ Policy network
Figure BDA0004090503340000091
Respectively get the state s t The action probability distribution pi θ (a t |s t), and />
Figure BDA0004090503340000092
And calculates an importance weight r according to the following (g) t (θ) and then calculating L in the following formula (h) clip (θ) as a loss function, counter-propagating, updating policy network pi θ
Figure BDA0004090503340000093
/>
Figure BDA0004090503340000094
Wherein epsilon is a super parameter, and epsilon=0.2 is taken in the invention, clip (r t (θ), 1-. Epsilon.1+epsilon.represents when r t When (theta) is greater than 1+epsilon, the value is 1+epsilon, and when r t When (θ) is smaller than 1-. Epsilon.the value is 1-. Epsilon.when (θ). The min function represents the smaller of the two,
Figure BDA0004090503340000096
representing the desire.
10 8-9) are repeated several times, then the old policy network parameters are covered with the policy network parameters, i.e. the order
Figure BDA0004090503340000095
11 Calculating the discount rewards G according to the following (i) t Then report G using discount t Predicted value v(s) with value network Q t ) Making a difference, using the mean square error MSE as a loss function, back-propagating, and updating the value network parameters:
G t =r t+1 +γr t+22 r t+3 +...+γ T-t r T+1T-t+1 ν(s t+1 ) (i)。
12 The memory pool is emptied, and the steps 2 to 11) are repeated for a plurality of times until the environmental rewards obtained by the intelligent scheduler are converged to the optimal value or the suboptimal value.
Referring to fig. 2, ppo corresponds to a Yarn cluster workflow scheduling method based on deep reinforcement learning in the present invention, and simulation experiment results show that compared with a static scheduling algorithm (static), a random algorithm (random), a priority-based scheduling algorithm (pbs) and an earliest completion time algorithm (heft), the scheduling algorithm in the present invention can bring lower workflow running time.
The foregoing description of the embodiments of the invention is not intended to limit the scope of the invention, but is intended to cover any modifications, equivalents, improvements or alternatives falling within the spirit and principles of the invention.

Claims (5)

1. The Yarn cluster workflow scheduling method based on deep reinforcement learning is characterized by comprising the following steps of:
1) Collecting workflow information and cluster resource information according to workflow configuration of a user and Yam cluster queue configuration, and constructing a workflow scheduling model according to the collected information;
2) Performing reinforcement learning modeling according to the workflow scheduling model and the Yam cluster resource model, wherein the reinforcement learning modeling comprises a state space, an action space and a reward mechanism;
3) And training the scheduler by using a near-end strategy optimization algorithm in the deep reinforcement learning algorithm.
2. The method for scheduling workflow of Yam cluster based on deep reinforcement learning according to claim 1, wherein the constructing of the workflow scheduling model in step 1) comprises:
a) Modeling a workflow as a directed acyclic graph g= (J, D);
where J represents the set of tasks in the workflow, j= { J 1 ,j 2 ,...j m -a }; d represents a set of task dependencies in the workflow, d= { D st |s≠t,and s,t∈{1,...m}};d st Representing subtask j t Dependent on parent task j s Only when parent task j s After completion, subtask j t Can start to operate;
b) The user performs workflow configuration in the form of a configuration file, and the configuration content comprises: the starting time of the workflow, the contained tasks, the fixed attributes of the tasks and the dependency relationship among the tasks; the fixed attributes of the task include: the number of CPU cores, the memory size, and the priority required to run the task;
c) The workflow scheduler schedules tasks in the user-configured workflow to a Yarn cluster, which is divided into n queues { q } 1 ,q 2 ,...,q n -and n queues are allocated to user u 1 ,u 2 ...,u n The ratio of the maximum available resource of each queue to the total resources of the cluster is f 1 ,f 2 ,...,f n
3. The method for scheduling a Yam cluster workflow based on deep reinforcement learning according to claim 1, wherein the reinforcement learning model of step 2) comprises:
a) State space
Environmental state s= { S observed by the agent g ,S c };
wherein ,Sg Representing the state of the workflow, S g ={S g1 ,S g2 ,...,S gn };S c Representing the state of the current cluster queue resource, S c ={S q1 ,S q2 ,...,S qn };S qi ={vcore max ,vmem max ,vcore used ,vmem used };vcore max Representing the maximum number of CPU cores available for the queue; vmem max Representing a maximum memory size available to the queue; vcore used ,vmem used Respectively representing the number of CPU cores and the memory which are currently used by the queue; s is S gi Representing the state of the ith workflow, S gi ={D g ,S ji ,i∈J g };S ji A task state that includes fixed attributes of the task and runtime attributes; the fixed attributes of the task include: the number of resources required by the task and the priority of the task; the runtime properties of the task include: task state, schedulable time, scheduling time, and start run time;
b) Action space
The Yarn cluster is divided into n queues { q } 1 ,q 2 ,...,q n Let the motion space a= { a } 0 ,a 1 ,a 2 ,...,a n When the agent performs action a i When the task to be scheduled is scheduled to the queue q i
c) Environment rewarding mechanism
The environmental rewards include: single step rewards r in scheduling procedure step Final prize r for end of schedule final The single-step rewards are timely rewards obtained after a certain task is scheduled each time, if the task j is to be scheduled in the time step t, the intelligent agent executes the action a t The single step prize obtained thereafter is calculated by the following equation (a):
Figure FDA0004090503320000021
wherein ,
Figure FDA0004090503320000022
representation queue q i Is sufficient to run task j; conversely, define the delay time delta of task j j Calculated from the following formula (b):
δ j =start_time j -avail_time j (b);
wherein, start_time j The starting run time for task j; avail_time j Schedulable time for a task;
the final rewards are delay rewards obtained after all tasks are normally operated or a preset maximum time step is reached, so that N= |J p=i I, which indicates the number of tasks with priority p=i in all tasks, the average delay time w of all tasks with priority p=i p=i Calculated from the following formula (c):
Figure FDA0004090503320000023
the final prize r final Calculated from the following formula (d):
Figure FDA0004090503320000024
wherein ,wp=3 Represents the average delay time of all tasks with priority of 3, the weight coefficient is 0.7, and the same is true of w p=2 and wp=1 The average delay times of tasks with priorities of 2 and 1 are respectively represented, and the weight coefficients are respectively 0.2 and 0.1.
4. The method for scheduling the workflow of the Yam cluster based on deep reinforcement learning according to claim 1, wherein training the scheduler using the near-end policy optimization algorithm in the step 3) specifically comprises:
a) Pi for the current policy network θ Old policy network
Figure FDA0004090503320000025
And value network Q three neural networks are randomly initialized, wherein the current strategy network pi is provided with a plurality of strategy networks θ And old policy network->
Figure FDA0004090503320000026
To output the probability of each action performed by the agent in a state; the value network Q is used for evaluating the state;
b) Traversing all the non-started workflows g in the configuration, checking whether the workflow g needs to be started at the current moment t, starting the workflow g if the current moment t is more than or equal to the starting time of the workflow g, updating the inlet task state of the workflow g into schedulable, and adding the schedulable task queue; otherwise, advancing to the next time t+1;
c) Checking whether the task j is completed at the current moment t, and traversing all subtasks j of the task j if the task j is completed t Check subtask j t Parent task j of (1) s If the state of subtask j t All parent tasks j of (1) s Complete the subtask j t The state of (2) is updated to be schedulable, and a schedulable task queue is added;
d) Randomly taking out a task j from the schedulable task queue, and updating the isActive value of the task j to 1 to indicate that the task j is the task j to be scheduled at the current time t;
e) Embedding the state of the workflow g into a vector with a fixed size by using a graph convolution neural network GCN, and then forming a state s observable by the current intelligent scheduler together with the state of the Yarn cluster t
f) State s t Input policy network pi θ According to policy network pi θ Outputting probabilities of all actions to sample to obtain action a t The method comprises the steps of carrying out a first treatment on the surface of the Scheduling the current task j to queue q i (i=a t ) Simultaneously restoring the isActive value of the current task to 0, and then calculating the current rewards r according to an environmental rewarding mechanism t If the scheduling is not finished at the time t, r t =r step Otherwise r t =r final The method comprises the steps of carrying out a first treatment on the surface of the Finally { s } t ,a t ,r t Storing to a memory pool;
g) Entering the next time t+1, repeating the steps b) to f) until reaching the preset cut-off time t max The maximum step number T of each round or all tasks j of all workflows g are completely run;
h) Fetching data { s } from memory pool 0 ,a 0 ,r 0 ,s 1 ,a 1 ,r 1 ...s T ,a T ,r T State s t Input to value network Q to obtain state s t Value v(s) t ) Then, the action dominance A is calculated by the following equation (e) t
A t =μ t +(γλ)μ t+1 +...+(γλ) T-t+1 μ T-1 (e);
Wherein, gamma and lambda are super parameters; mu (mu) t Is represented by the following (f):
μ t =r t +γv(s t+1 )-v(s t ) (f);
wherein ,rt Indicating the rewards obtained at time t; v(s) t) and v(st+1 ) The state values at time t and time t+1 are respectively represented;
i) Fetching data { s } from memory pool 0 ,a 0 ,r 0 ,s 1 ,a 1 ,r 1 ...s Y ,a T ,r T State s t Respectively input to policy network pi θ Policy network
Figure FDA0004090503320000031
Respectively get the state s t The action probability distribution pi θ (a t |s t) and />
Figure FDA0004090503320000032
Calculating importance weight r t (θ) and is L clip (θ) as a loss function, counter-propagating, updating policy network pi θ The importance weight r t (θ) is calculated from the following formula (g):
Figure FDA0004090503320000033
the L is clip (θ) is calculated by the following formula (h);
Figure FDA0004090503320000041
wherein epsilon is a super parameter; clip (r) t (θ), 1-. Epsilon.1+epsilon.represents when r t When (theta) is greater than 1+epsilon, the value is 1+epsilon, and when r t When the (theta) is smaller than 1-epsilon, the value is 1-epsilon; the min function represents the smaller value of the two;
Figure FDA0004090503320000042
representing a desire;
j) Repeating the steps h) to i) for a plurality of times, and then using a strategy network pi θ Parameter overlay old policy network
Figure FDA0004090503320000043
Parameters, i.e. command
Figure FDA0004090503320000044
k) Reporting G using discounts t Predicted value v(s) of value network Q t ) Making a difference, using the MSE as a loss function, back-propagating, updating the Q parameter of the value network, said discount returns G t Calculated from the following formula (i):
G t =r t+1 +γr t+22 r t+3 +...+γ T-t r T+1T-t+1 ν(s t+1 ) (i);
1) And (3) emptying the memory pool, and repeating the steps b-k) for a plurality of times until the environmental rewards obtained by the intelligent scheduler are converged to the optimal value or the suboptimal value.
5. A Yarn cluster workflow scheduling method based on deep reinforcement learning as claimed in claim 3, wherein the Yarn cluster of step b) is divided into n queues { q } 1 ,q 2 ,...,q n A virtual queue q is arranged in } 0 For indicating that the current task j is not scheduled, i.e. { q 0 ,q 1 ,q 2 ,...,q n When the agent performs action a 0 When this indicates that the current task j is not scheduled to any queue.
CN202310149989.1A 2023-02-22 2023-02-22 Deep reinforcement learning-based Yarn cluster workflow scheduling method Pending CN116069473A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310149989.1A CN116069473A (en) 2023-02-22 2023-02-22 Deep reinforcement learning-based Yarn cluster workflow scheduling method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310149989.1A CN116069473A (en) 2023-02-22 2023-02-22 Deep reinforcement learning-based Yarn cluster workflow scheduling method

Publications (1)

Publication Number Publication Date
CN116069473A true CN116069473A (en) 2023-05-05

Family

ID=86169799

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310149989.1A Pending CN116069473A (en) 2023-02-22 2023-02-22 Deep reinforcement learning-based Yarn cluster workflow scheduling method

Country Status (1)

Country Link
CN (1) CN116069473A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117555306A (en) * 2024-01-11 2024-02-13 天津斯巴克斯机电有限公司 Digital twinning-based multi-production-line task self-adaptive scheduling method and system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117555306A (en) * 2024-01-11 2024-02-13 天津斯巴克斯机电有限公司 Digital twinning-based multi-production-line task self-adaptive scheduling method and system
CN117555306B (en) * 2024-01-11 2024-04-05 天津斯巴克斯机电有限公司 Digital twinning-based multi-production-line task self-adaptive scheduling method and system

Similar Documents

Publication Publication Date Title
CN107301500B (en) Workflow scheduling method based on key path task look-ahead
CN109976909B (en) Learning-based low-delay task scheduling method in edge computing network
CN110737529B (en) Short-time multi-variable-size data job cluster scheduling adaptive configuration method
WO2023241000A1 (en) Dag task scheduling method and apparatus, device, and storage medium
CN108182109B (en) Workflow scheduling and data distribution method in cloud environment
CN111756812A (en) Energy consumption perception edge cloud cooperation dynamic unloading scheduling method
CN110351348B (en) Cloud computing resource scheduling optimization method based on DQN
CN111768006A (en) Artificial intelligence model training method, device, equipment and storage medium
CN109101339A (en) Video task parallel method, device and Heterogeneous Cluster Environment in isomeric group
CN112486641A (en) Task scheduling method based on graph neural network
CN114443249A (en) Container cluster resource scheduling method and system based on deep reinforcement learning
CN106648831B (en) Cloud workflow schedule method based on glowworm swarm algorithm and dynamic priority
CN111367644A (en) Task scheduling method and device for heterogeneous fusion system
CN109710372B (en) Calculation intensive cloud workflow scheduling method based on owl search algorithm
CN115033357A (en) Micro-service workflow scheduling method and device based on dynamic resource selection strategy
CN115168027A (en) Calculation power resource measurement method based on deep reinforcement learning
CN116069473A (en) Deep reinforcement learning-based Yarn cluster workflow scheduling method
CN114675975B (en) Job scheduling method, device and equipment based on reinforcement learning
CN106407007B (en) Cloud resource configuration optimization method for elastic analysis process
CN115586961A (en) AI platform computing resource task scheduling method, device and medium
CN111913800B (en) Resource allocation method for optimizing cost of micro-service in cloud based on L-ACO
CN113741999A (en) Dependency-oriented task unloading method and device based on mobile edge calculation
CN109784687B (en) Smart cloud manufacturing task scheduling method, readable storage medium and terminal
CN108270833A (en) Render automatic scheduling method, the apparatus and system of cloud resource
Kumar et al. EAEFA: An Efficient Energy-Aware Task Scheduling in Cloud Environment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination