CN114860385B

CN114860385B - Parallel cloud workflow scheduling method based on evolution reinforcement learning strategy

Info

Publication number: CN114860385B
Application number: CN202210537383.0A
Authority: CN
Inventors: 李慧芳; 陈兵; 田露之; 黄姜杭; 姚分喜; 崔灵果; 柴森春; 张百海
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2022-05-17
Filing date: 2022-05-17
Publication date: 2024-06-07
Anticipated expiration: 2042-05-17
Also published as: CN114860385A

Abstract

The invention discloses a parallel cloud workflow scheduling method based on an evolution reinforcement learning strategy, which is characterized in that two populations are adopted to respectively optimize workflow execution time and cost, population individuals are designed to be reinforcement learning agents, and two-stage optimization of an agent network is realized through interactive learning of the agents and the environment and network parameter updating based on a particle swarm optimization algorithm; in the training process of the reinforcement learning model, through parallel interaction and iterative learning of a plurality of agents in the population and the environment, a rich and various action selection experience sequence is generated, and the diversity of searching is improved; meanwhile, a complementary heuristic mechanism is designed, and the action selection probability of the Agent is finely adjusted and corrected by utilizing the external target advantage information of the scheduling scheme, so that the optimization between the execution time and the cost of the workflow is better balanced, and the global searching capability is improved.

Description

Parallel cloud workflow scheduling method based on evolution reinforcement learning strategy

Technical Field

The invention belongs to the technical field of cloud computing, and particularly relates to a parallel cloud workflow scheduling method based on an evolution reinforcement learning strategy.

Background

The cloud computing on-demand, flexible, extensible resource providing and pay-per-use modes and strong computing power provide a safe, reliable and low-cost execution environment for an enterprise to perform Internet-based application hosting and distributed deployment, and become a necessary means for the enterprise to realize digital and intelligent transformation, and more workflow applications are being migrated to the cloud. In order to guarantee quality of service QoS for cloud consumers (users) and to take into account the benefits of cloud providers, efficient cloud computing task scheduling and resource management techniques are needed. However, the NP-hard nature of scheduling problems, high concurrency of workflow task requests, and dynamics of cloud environments, etc. make workflow scheduling challenging.

Currently, cloud workflow scheduling methods mainly include heuristic, meta-heuristic, a combination of both, and algorithms based on machine learning. Heuristic algorithms are rule-based greedy algorithms, often designed based on domain expert knowledge or for certain specific problems, when faced with complex scheduling problems, find optimal solutions very difficult and have insufficient generalization ability. The meta-heuristic calculation or the algorithm combining the heuristic and the meta-heuristic is mainly used for searching the optimal solution of the scheduling problem through repeated iterative updating, has certain universality, can obtain different solutions during each operation due to the randomness of searching, has high time cost for optimizing, and is difficult to meet the real-time requirement of a user on scheduling.

The workflow scheduling method based on machine learning mainly solves the workflow scheduling problem by utilizing the prominent advantages of reinforcement learning in the aspect of processing sequence decision problems, namely, based on the thought of trial and error, learns scheduling knowledge through continuous interaction of agents and environments, and searches for balance between exploration and development. When the method is used, the near-optimal solution can be obtained only by forward prediction, the solving time is greatly shortened, and the method is considered to be a potential research direction at present. There are proposed Deep-Q-Network (DQN) based multi-objective workflow scheduling methods in which the DQN algorithm utilizes value function approximations to solve the high-dimensional data storage explosion problem of the Q-Learning algorithm. However, the generalization capability is poor because it adopts a fixed-dimension environmental state vector and a single type of workflow to train the reinforcement learning model. In addition, gradient-based reinforcement learning algorithms have greatly increased their application limitations due to the inefficient searching caused by sparse or fraudulent rewards and the fragile convergence caused by sensitivity to hyper-parameters.

Disclosure of Invention

In view of the above, the invention provides a parallel cloud workflow scheduling method based on an evolution reinforcement learning strategy, which realizes simultaneous scheduling of multiple types of applications in a hybrid cloud environment.

The invention provides a parallel cloud workflow scheduling method based on an evolution reinforcement learning strategy, which comprises the following steps:

Step 1, a parallel cloud workflow scheduling model based on an evolution reinforcement learning strategy is established by adopting DDPG algorithm, wherein the parallel cloud workflow scheduling model comprises a time population P ¹ and a cost population P ², and P ¹、P² is respectively used for optimizing workflow execution time and cost, individuals in the population are adopted as agents in a DDPG network, the time population comprises a plurality of time optimization sub-actors, and the cost population comprises a plurality of cost optimization sub-actors;

Step 2, calculating a time optimization target value Makspan and a Cost optimization target value Cost of a task to be scheduled executed by a virtual machine in a resource pool, and taking Makspan and the Cost as input states of a time optimization sub-Actor and a Cost optimization sub-Actor in a P ¹、P² population respectively;

Step 3, in the training process of the parallel cloud work scheduling model, the time optimization sub-Actor and the cost optimization sub-Actor respectively take the states related to time and cost as input, and update of sub-Actor network parameters is realized through interaction between the Agent and the environment;

Step 4, collecting all sub-actors in the P ¹ and P ² populations, evaluating and non-dominated sorting the sub-actors according to two optimization targets of Makespan and Cost, and storing all obtained non-dominated sub-actors into a set Obtaining a Pareto front solution set; reuse set/>All sub-actors in the system respectively process newly input parallel workflow applications, and corresponding scheduling schemes are output through forward prediction, so that a non-dominant workflow scheduling scheme set can be obtained, and the scheduling scheme set is the output scheduling scheme.

Further, the training process of the parallel cloud work scheduling model comprises the following steps:

Step 2.1, resetting the environment as an initial State, and clearing a Task State List task_List and a virtual machine State List VM_State_List;

step 2.2, respectively detecting Makespan related environmental states at the current time step q aiming at the time optimization sub-Actor and the cost optimization sub-Actor in the population And Cost-related environmental State/>Inputting the virtual machine to a corresponding sub-Actor network, and selecting an action a _q by combining with a complementary heuristic strategy to obtain a virtual machine distributed for a ready task; the complementary heuristic strategy is shown as follows:

Wherein, Is for ready task/>The assigned virtual machines, λ (i, j, k) and η (i, j, k) represent task/>, respectivelyThe method comprises the steps that an Actor network distributed to a virtual machine v _k outputs probability and heuristic information, i is the number of a workflow, j is the number of a task, k is the number of the virtual machine, delta is a random number in the range of 0 and 1, delta ₀ is a preset super parameter, and argmax is the serial number of the virtual machine with the maximum selection probability;

After step 2.3, the sub-Actor executes action a _q of time step q, updates the environment state to the new state at time step q+1 Or/>And calculates the time return/>, of the current moment qOr cost return/>At the same time, the subaactor will experience the sequence/>Or/>Store into playback buffer B;

step 2.4, after one interaction is finished, if the task state list has a task with a state not executed, executing the step 2.2; if all tasks are executed, storing an experience sequence of a scheduling round into a buffer zone B, and executing the step 2.5;

Step 2.5, extracting experience sequences from the buffer zone B through uniform random sampling, and learning and training an Actor network based on the extracted data to realize parameter updating of each sub Actor network and optimize action selection strategies;

step 2.6, judging whether the accumulated complete scheduling times reach a threshold value, and if so, completing training of the parallel cloud work scheduling model; otherwise, step 2.1 is performed.

Further, in the step 2.3 or the step 2.4, when the buffer B is full or overflows, the earliest stored experience sequence is replaced by the latest experience sequence in time sequence.

Further, the method further comprises the following steps after the step 2.4:

S1, if the number of experience sequences stored in a buffer B reaches a preset capacity, executing the step 2.5 and then executing the step 2; otherwise, executing S2;

S2, dividing the P ¹ and P ² populations into a plurality of groups, and independently evolving individuals in different groups; the global optimal sub-actors in various groups and the historical optimal solutions of each sub-Actor are selected to form an elite archive set H ^A, the worst sub-actors in various groups are selected to form a set H ^W to be learned, the elite solutions in H ^A are adopted to guide and update the worst sub-actors in H ^W, and the updating process is shown in the following formula:

Wherein, And/>Representing the position and velocity of the e-th subaactor in H ^W at the d-th iteration,/>AndRespectively represent the position and the speed of the e < th > subaactor in H ^W at the d+1th iteration,/>Represents the historical optimal position of the e-th subaactor in H ^W at the d-th iteration,/>For the position of the optimal solution of the population to which the e-th sub-Actor belongs in H ^W after d iterations,/>Representing the position of elite sub-Actor randomly chosen from H ^A at the d-th iteration and used to guide the e-th sub-Actor in H ^W, ψ ₀、ψ₁、ψ₂ and ψ ₃ are weight parameters,/>And/>A random number between 0 and 1.

Further, the step 4 is: and (3) inputting the new parallel workflow scheduling scheme into the model trained in the step (3), and outputting the corresponding parallel workflow scheduling scheme through forward prediction.

Further, the manner of dividing the P ¹ and P ² populations into the plurality of groups in S2 is as follows: the size of the group increases dynamically with the progression of time steps.

The beneficial effects are that:

1. The invention organically combines an evolutionary algorithm with a reinforcement learning algorithm, provides a workflow scheduling method (DG-ERL) based on an evolutionary reinforcement strategy, adopts two populations P ¹、P² to respectively optimize the execution time and cost of a workflow, designs population individuals as reinforcement learning agents (agents), and realizes two-stage optimization of an Agent network through interactive learning of the agents and the environment and network parameter updating based on a Particle Swarm Optimization (PSO) algorithm; in the training process of the reinforcement learning model, through parallel interaction and iterative learning of a plurality of agents and environments in P ¹ or P ², a rich and various action selection experience sequence is generated, and the diversity of searching is improved; meanwhile, a complementary heuristic mechanism is designed, and the action selection probability of the Agent is finely adjusted and corrected by utilizing the external target advantage information of the scheduling scheme, so that the optimization between the execution time and the cost of the workflow is better balanced, and the global searching capability is improved.

2. According to the invention, a mixed elite guiding strategy is introduced in Agent network parameter optimization based on PSO, and elite individuals are reserved competitively and an external archive set is updated through non-dominant ordering of P ¹、P² population of each generation; by means of guiding update of elite individuals to P ¹、P² population, knowledge communication and co-evolution between P ¹、P² are realized, and the diversity of searching is improved.

3. The invention designs a population updating mechanism based on dynamic grouping learning, P ¹ and P ² are divided into a plurality of groups, the worst particles in the groups are subjected to iterative updating, the learning intensity is controlled by dynamically changing the size of the groups, and the contradiction between the generalization capability and the convergence of reinforcement learning is balanced better, so that the convergence speed of a model is improved, and the possibility of searching to be in local optimum is further reduced.

Drawings

Fig. 1 is a flow chart of a parallel cloud workflow scheduling method based on an evolutionary reinforcement learning strategy.

Fig. 2 is a diagram of a convergence experiment result of the parallel cloud workflow scheduling method based on the evolutionary reinforcement learning strategy provided by the invention under a D-5-138 dataset.

Fig. 3 is a diagram of a convergence experiment result of the parallel cloud workflow scheduling method based on the evolutionary reinforcement learning strategy provided by the invention under a D-5-252 dataset.

FIG. 4 is a Pareto front distribution diagram of the parallel cloud workflow scheduling method based on the evolutionary reinforcement learning strategy and four comparison algorithms under a D-5-138 dataset.

FIG. 5 is a Pareto front distribution diagram of the parallel cloud workflow scheduling method based on the evolutionary reinforcement learning strategy and four comparison algorithms under a D-5-252 dataset.

Fig. 6 is a comparison diagram of the running time of the parallel cloud workflow scheduling method and four comparison algorithms based on the evolutionary reinforcement learning strategy.

Fig. 7 is a set of scheduling schemes obtained by the parallel cloud workflow scheduling method based on the evolutionary reinforcement learning strategy and four comparison algorithms.

Detailed Description

The present invention will be described in detail with reference to the following examples.

The invention provides a parallel cloud workflow scheduling method based on an evolution reinforcement learning strategy, which has the following basic ideas: based on a depth deterministic strategy Gradient (DEEP DETERMINISTIC Policy Gradient, DDPG) reinforcement learning algorithm, the scheduling of the dual-target workflow based on evolution reinforcement is realized by combining multiple group evolution ideas of group intelligent optimization. Firstly, a Multi-Objective Multi-Population (MPMO) concept is introduced into a PSO algorithm, and the populations are divided into two populations of a time Population P ¹ and a Cost Population P ², and different workflow scheduling targets, namely execution time (Makespan) and Cost (Cost), are respectively optimized. Meanwhile, the individuals in the population are designed as agents in reinforcement learning, and the iterative interaction of the agents and the environment is used for learning scheduling knowledge, so that the potential scheduling scheme is explored, and meanwhile, the optimization of the specific scheduling targets of the population is focused.

Secondly, embedding an Actor network parameter updating strategy based on a PSO algorithm in the evolution process, and performing secondary optimization on the Actor network parameter; through parallel interaction of a plurality of actors and the environment and co-evolution of two populations, balance is better obtained between exploration and utilization, diversity of action selection strategy searching is improved, searching is prevented from being trapped into a local optimal solution, and final convergence to the global optimal solution is ensured.

The invention provides a parallel cloud workflow scheduling method based on a reinforcement learning strategy, which comprises the steps of building and training a parallel cloud workflow scheduling model based on an evolution reinforcement learning strategy and applying the parallel cloud workflow scheduling model to parallel cloud workflow scheduling, and specifically comprises the following steps:

Step 1, a parallel cloud workflow scheduling model based on an evolution reinforcement learning strategy is established by utilizing DDPG algorithm, the scheduling model comprises a time population P ¹ and a cost population P ², the two populations P ¹、P² are respectively used for optimizing the execution time and the cost of the workflow, individuals in the populations are adopted as an agent (Actor) in a DDPG network, the time population comprises a plurality of time optimization sub-actors, the cost population comprises a plurality of cost optimization sub-actors, and the time optimization sub-actors and the cost optimization sub-actors are collectively called as sub-actors.

The DDPG algorithm is mainly composed of an Actor and Critic, and the Actor receives a state from the environment, selects actions to be executed, and learns scheduling knowledge from feedback from Critic. Critic obtains states and rewards generated in the interaction of the Actor and the environment, calculates time sequence difference (TIMING DIFFERENCE, TD) errors according to the states and rewards, and realizes the update of network parameters of the Actor and the Actor.

And 2, calculating a time optimization target value or a Cost optimization target value of the task to be scheduled executed by the virtual machine in the resource pool, and taking a time optimization target (Makspan) value and a Cost optimization target (Cost) value as input states of a time optimization sub-Actor and a Cost optimization sub-Actor in the P ¹、P² population respectively.

And 3, in the training process of the parallel cloud workflow scheduling model based on the evolution reinforcement learning strategy, the time optimization sub-Actor and the cost optimization sub-Actor respectively take the time-related and cost-related states as input, and the update of the sub-Actor network parameters is realized through the interaction of the Agent and the environment. The model training process is shown in fig. 1, and the specific process is as follows:

And 3.1, resetting the environment to be in an initial state. Since workflow scheduling has not been performed at the beginning, the Task State List task_list (recording whether a Task has been scheduled) and the virtual machine State List vm_state_list (recording virtual machine run time and use cost) are both "empty".

Step 3.2, each time optimization sub-Actor and each cost optimization sub-Actor in the population respectively detects the environmental state of the current time step, namely the environmental state when q(Makespan) and/>(Cost) and input to its corresponding sub-Actor network to select action a _q, i.e., the virtual machine assigned for the ready task. Meanwhile, a complementary heuristic strategy is introduced in the action selection process, so that any Actor is prevented from excessively optimizing own targets, and balance is carried out between the two targets of Makespan and Cost.

The complementary heuristic strategy is: in the process of interaction of reinforcement learning agents with environments, each subaactor interacts with the environment in parallel in its corresponding population, exploring their respective scheduling schemes. In each time step, each sub-Actor selects the appropriate resources, i.e., virtual machines, for the ready task according to its network output probability and heuristic information. The specific application process of the complementary heuristic strategy is as follows: for tasks(I.e., the j-th task of the i-th workflow in the parallel workflow set), a random number delta within the range of [0,1] is first generated and compared with a pre-set hyper-parameter delta ₀. If δ+.ltoreq.δ ₀, then the Actor greedily selects the resource with the highest probability, i.e., the Actor network output probability after fine-tuning using complementary heuristic information, denoted [ λ (i, j, k) ]× [ η (i, j, k) ] ^α, where α is a weight parameter, λ (i, j, k) and η (i, j, k) represent task/>, respectivelyAnd outputting probability and heuristic information by an Actor network distributed to the virtual machine v _k, wherein i is the number of a workflow, j is the number of a task, and k is the number of the virtual machine. The heuristic information is the supplementary information of the time optimization sub-Actor or the Cost optimization sub-Actor on the Cost or the Makespan optimization target (for example, the time optimization sub-Actor focuses on the Makespan optimization, so as to avoid the excessive optimization of the time optimization sub-Actor on the Makespan, and adds heuristic information about another optimization target, namely the Cost; otherwise, the Actor selects the most appropriate virtual machine for task allocation through the wheel disc. The complementary heuristic strategy is shown in the following formula:

The argmax function returns the sequence number of the maximum value element in the input probability vector, namely the virtual machine with the maximum selection probability, and the 'wheel selection' selects the virtual machine according to the accumulated probability of the resources.

Step 3.3, after each sub-Actor executes action a _q of time step q in its corresponding environment, updating the environment state to the new state at time step q+1, namely(Makespan) or/>(Cost) and calculates the time return at the current time qOr cost return/>At the same time, each subaactor will experience the sequence/>(Makespan) or(Cost) to the own playback buffer B; once the playback buffer is full or overflows, the earliest saved experience sequence is replaced by the latest experience sequence in chronological order.

Each sub-Actor randomly extracts experience sequences from a playback buffer area of the sub-Actor at preset frequency, trains a network of the sub-Actor by the experience sequences, optimizes an action selection strategy of the sub-Actor and improves search and solution quality; meanwhile, the strategy of the sub-Actor in the P ¹、P² population is ensured to be updated towards the direction of optimizing the finishing time and the cost respectively.

And 3.4, after one interaction is finished, checking a task state list, and judging whether all tasks are finished. If the task is still not executed, turning to step 3.2; if all tasks are performed, a sequence of experience of a scheduling round is stored in the buffer B, and when B is full or overflowed, the earliest stored sequence of experience is replaced by the latest sequence of experience in time sequence in turn, and step 3.5 is performed.

And 3.5, extracting an experience sequence from the B through uniform random sampling, learning and training an Actor network based on the extracted data, realizing parameter updating of each sub-Actor network, optimizing an action selection strategy, and executing the step 3.6.

And 3.6, judging whether the accumulated complete scheduling times reach a preset upper limit. If the preset times are reached, model training is completed; otherwise, turning to the step 3.1.

Step 4, collecting all sub-actors in the P ¹ and P ² populations, evaluating and non-dominated sorting the sub-actors according to two optimization targets of Makespan and Cost, and storing all obtained non-dominated sub-actors into a setObtaining a Pareto front solution set; reuse set/>All sub-actors in the system respectively process newly input parallel workflow applications, and corresponding scheduling schemes are output through forward prediction, so that a non-dominant workflow scheduling scheme set can be obtained, and the scheduling scheme set is the output scheduling scheme.

In the prior art, one DDPG network comprises two online networks and two target networks, wherein the online networks consist of an online policy network of an Actor and an online Q network of Critic, and parameters of the online policy network and the online Q network are respectively represented by theta _μ and theta _Q; the target network includes the target policy network of the Actor and the target Q network of Critic, and parameters of the target policy network and the target Q network are represented by θ _μ 'and θ _Q', respectively.

The update modes of the two online networks are as follows: the online strategy network of the Actor applies a gradient ascending method, and updates the network parameter theta _μ according to the score of the last action returned to the Critic; critic is a method based on a value function, and the quality of the action selected by the Actor is evaluated by calculating the Q value of the action strategy selected by the Actor. Thus, critic employs a gradient descent method to update the online Q network parameter θ _Q by minimizing the loss function of the Q value.

The target policy network and the target Q network adopt exactly the same structures as the online policy network and the online Q network, respectively. In order to make the learning process of Critic more stable, the update process of parameters of two target networks, namely, θ _μ 'and θ _Q', is improved by Soft update, and the update based on Soft is shown in formula (2):

where ω is a parameter for controlling the update amplitude.

In addition, because the neural network in the deep reinforcement learning algorithm adopts a gradient-based method, although the learning capability is strong, the convergence and stability are generally poor, so that in order to further improve the diversity of action selection strategies and avoid the search from falling into local optimum, the invention also adds an iterative evolutionary stage on the basis of the model training in the reinforcement learning stage in the step 3, and selects the final Actor network parameters based on the results of the reinforcement learning stage and the iterative evolutionary stage to predict the workflow scheduling scheme. Specifically, an optimization mode for realizing knowledge communication and co-evolution of the time population P ¹ and the cost population P ² by adopting an evolution algorithm is provided, namely: firstly, dividing a population P ¹、P² into a plurality of small groups, and storing the worst sub-actors in all groups of the P ¹、P² population into a to-be-learned set H ^W; next, an external elite archive set H ^A is constructed to store excellent sub-actors in the population, and elite sub-actors in the elite external archive set H ^A are used to guide and update sub-actors in the set to be learned H ^W. Specifically, the method comprises the following steps between the step 3.4 and the step 3.6:

s1, judging whether the number of experience sequences stored in a buffer B reaches a preset capacity, and if the number of experience sequences does not reach the preset capacity, executing S2; if the preset capacity is reached, step 3.5 is executed first, and then step S2 is executed.

S2, executing a group updating strategy.

Let l ^m be the size of population P ^m, m ε {1,2}, and P ¹ and P ² represent the time population and the cost population, respectively. At the beginning of the iteration, the population P ^m is randomly divided into N ^m The size is/>Group of (a) >, i.e./>Where n∈ {1,2,., N ^m }. It is necessary to explain that: if/>The last group of P ^m will contain/>And an Actor. After grouping P ^m, different groups are/>Will evolve independently. In the whole population dynamic grouping learning process, the group updating strategy of P ^m is as follows:

In group divided P ^m, each group can be regarded as a "large Actor", g ^m represents the optimal sub-Actor in group P ^m, for And/>Respectively represent the nth group/>, of P ^m The best sub-Actor and the worst sub-Actor in (c), and N e {1,2,..n ^m }. At each iteration, each group/>, is guided by global elite individuals of the entire population (P ¹ and P ²) and by excellent individuals within the population (P ¹ or P ²)Worst sub-Actor/>Evolutionary, group/>Other subactrors within the population enter the next generation population directly. That is, one iteration is only per group/>The worst sub-Actor in (a), i.e., population P ^m updates only N ^m particles per iteration.

S3, constructing an external elite archive set and a group updating mechanism based on mixed elite guidance. The hybrid elite guide strategy focuses on the elite solutions generated for each generation, including the globally optimal sub-Actor for each population and the historically optimal solutions for each sub-Actor. Assume that: set H ^E is used to store all elite individuals from both populations P ¹、P², namely the globally optimal sub-actors for P ¹、P² and the historical optimal solutions for each sub-Actor; aggregationNon-dominant solution for storing in H ^E, also called/>Is a non-dominant solution set; set H ^A is used to store the final non-dominant elite solution set, also known as H ^A, an external elite archive set. It should be noted that: if it isLess than κ (κ is the preset final non-dominant elite solution set capacity), then/>All non-dominant solutions in (3) are saved to H ^A; otherwise, according to the crowding distance pair/>Is sorted in descending order and outputs the first k solutions into H ^A.

Meanwhile, the worst sub-actors in each group of the P ¹、P² population are stored in the to-be-learned set H ^W, and the worst sub-actors of all groups of the P ¹、P² population are guided and updated by elite sub-actors in the elite external archive set H ^A. That is, the sub-actors in the set to be learned H ^W are guided to evolve by the sub-actors in H ^A, so that the group collaborative learning based on mixed elite individual guidance is realized, and the updating manner is similar to the particle updating of the PSO. The location of the sub-Actor is used to store DDPG network parameter information, and the specific update formula is as follows:

Wherein, And/>Representing the position and velocity of the e-th subaactor in H ^W at the d-th iteration,/>AndRespectively represent the position and the speed of the e < th > subaactor in H ^W at the d+1th iteration,/>Represents the historical optimal position of the e-th subaactor in H ^W at the d-th iteration,/>For the position of the optimal solution of the population to which the e-th sub-Actor belongs in H ^W after d iterations,/>Representing the position of elite sub-Actor randomly selected from elite archive set H ^A at the d-th iteration and used to guide the e-th sub-Actor in H ^W, ψ ₀、ψ₁、ψ₂ and ψ ₃ are weight parameters,/>And/>Random numbers between 0 and 1 are used for enhancing the randomness of particle search, improving the diversity of population and finally exploring a more promising pareto optimal solution.

It is necessary to explain that: during each iteration of the population, the populationThe worst subaactor in (i.e./>)Not only by learning/>(I.e., group/>)And g ^m (i.e., the optimal solution in population P ^m) to update its parameters, and also moves and evolves toward the elite sub-Actor in external archive set H ^A, thereby guaranteeing diversity of population searches and generating a high quality Pareto front solution set. It can be seen that set H ^A stores elite solutions generated throughout the evolution process of P ¹ and P ², i.e., by mixing sets/>, at each iterationIs obtained by selecting elite individuals based on non-dominant ranking and crowding ranking. In other words, the solution that H ^A finally retains is the result of the evolution and competition of mixed elite individuals of all populations according to the "superior/inferior rule", which process is also called an external archive update mechanism based on mixed elite retention.

Based on the results of the reinforcement learning stage and the iterative evolutionary stage, the implementation manner of the original step 4 is as follows: and (3) inputting a batch of new parallel workflow application into the model trained in the step (3), and outputting a corresponding parallel workflow scheduling scheme through forward prediction.

Further, for a given population of size l ^m, the larger the partitioned group, the more particles each group contains, meaning that the worst one is selected from the sub-actors of the larger group during iterationUpdating is relatively greedy, which may impair the exploratory ability of the population, whereas smaller partitioned groups each contain fewer particles, corresponding to the worst sub-Actor/>, selected from a smaller groupUpdating will make the searching process more diversified, so in order to control learning intensity and accelerate algorithm convergence, the invention increases the dynamic adjustment strategy of group size in S2, namely/>Will increase dynamically with the progression of time steps, including in particular: the early evolutionary stage should pay attention to exploration, and can properly reduce the group size/>To avoid group/>The method is in local optimum, so that the diversity of searching is improved; the post-evolution phase searches have approached or found the globally optimal region, which can appropriately increase the size/>To enhance the development and utilization of potentially superior individuals, to accelerate algorithm convergence, thereby finding a potential balance between population diversity and rapid convergence. Simultaneously, the parallel iterative exploration of a plurality of small groups and the continuous evolution of the whole population lighten the sensitivity of the reinforcement learning model to parameters to a certain extent.

Therefore, the invention realizes two-stage optimization of the reinforcement learning stage and the iterative evolution stage respectively, and specifically: in the reinforcement learning stage, the time optimization sub-Actor and the cost optimization sub-Actor respectively take the states related to time and cost as input, and realize the first-stage update of sub-Actor network parameters through the interaction of the Agent and the environment; in the iterative evolution stage, the invention designs a dynamic grouping learning mechanism and a mixed elite guiding strategy respectively, the learning intensity is controlled by dynamically adjusting the size of the group, and simultaneously, when the group iterates each time, the mixed elite guiding strategy is utilized, and the worst sub-Actor in each group is guided to learn and update by means of elite individuals, so that the second-stage optimization of the overall sub-Actor network parameters of the group is realized.

In order to check the effectiveness of the parallel workflow scheduling method based on the evolution reinforcement learning strategy, the invention uses python language to program the parallel cloud workflow scheduling algorithm and workflow scheduling simulation environment, and verifies the performance of the parallel cloud workflow scheduling algorithm and workflow scheduling simulation environment through comparison experiments. In order to make the experimental results more objective and practical, 5 typical scientific workflows CyberShake, epigenomics, inspiral, montage and Sipht of different scales are combined to construct two data sets corresponding to two parallel cloud workflow applications. The experiment selects typical multi-objective cloud workflow scheduling algorithms such as MOPSO, NSGA-II, MOACS, WDDQN-RL and the like as a baseline comparison algorithm.

First, to verify the network model convergence of the dynamic packet-based evolutionary reinforcement learning algorithm, a Hyper-Volume (HV) is introduced to evaluate the pareto front of the set of scheduling schemes obtained by each algorithm. In order to intuitively illustrate the convergence of the model (i.e. DG-ERL) of the invention, the change trend of the super volume of the scheduling solution set generated by each algorithm in the training process of two groups of parallel cloud workflows is recorded, as shown in fig. 2 and 3. As can be seen from fig. 2 and 3, as the training frequency increases, the reinforcement learning model for parallel cloud workflow scheduling designed by the invention tends to converge, which illustrates the feasibility of the reinforcement learning model.

Secondly, in order to further test the scheduling performance of the reinforcement learning model and evaluate the quality and diversity of the generated scheduling solution set, each group of parallel cloud workflow applications are respectively scheduled by five algorithms in the same scheduling simulation environment, the generated scheduling schemes are evaluated to obtain the completion time and total cost of the scheduling schemes, corresponding over-volume values are recorded, and experimental results are shown in fig. 4,5 and 7. As can be seen from fig. 4 and fig. 5, under two parallel cloud workflow datasets, the method provided by the present invention can obtain a better Pareto front solution, and the advantage becomes more and more obvious with the increase of the task number scale in the datasets. As can be seen from fig. 7, the model of the present invention has higher super-volume value of the scheduling scheme set, which indicates that the diversity of the scheduling solution set is better, and further indicates the superiority thereof.

Finally, to test the efficiency of the operation of the present invention, we also recorded the run time of each algorithm, as shown in FIG. 6. As can be seen from fig. 6, the proposed method of the present invention can obtain a scheduling solution set closer to the optimal Pareto front than MOPSO and NSGA-II in less time. Although the run time is slightly longer than WDDQN-RL, the quality of the obtained scheduling solution set is significantly better than WDDQN-RL. The parallel cloud workflow method based on the evolution reinforcement learning strategy can obtain a better Pareto front solution set, and has certain advantages in the aspect of operation efficiency.

In summary, the above embodiments are only preferred embodiments of the present invention, and are not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The parallel cloud workflow scheduling method based on the evolution reinforcement learning strategy is characterized by comprising the following steps of:

Step 3, in the training process of the parallel cloud workflow scheduling model, the time optimization sub-Actor and the cost optimization sub-Actor respectively take the states related to time and cost as input, and update of sub-Actor network parameters is realized through interaction between the Agent and the environment;

Step 4, collecting all sub-actors in the P ¹ and P ² populations, evaluating and non-dominated sorting the sub-actors according to two optimization targets of Makespan and Cost, and storing all obtained non-dominated sub-actors into a set Obtaining a Pareto front solution set; reuse set/>All sub-actors in the system respectively process newly input parallel workflow applications, and output corresponding scheduling schemes respectively through forward prediction to obtain a non-dominant workflow scheduling scheme set, wherein the scheduling scheme set is the output scheduling scheme;

The training process of the parallel cloud work scheduling model comprises the following steps of:

After step 2.3, the sub-Actor executes action a _q of time step q, updates the environment state to the new state at time step q+1 Or/>And calculates the time return/>, of the current moment qOr cost return/>At the same time, the sub-Actor sequences experienceOr/>Store into playback buffer B;

Step 2.6, judging whether the accumulated complete scheduling times reach a threshold value, and if so, completing training of the parallel cloud work scheduling model; otherwise, executing the step 2.1;

the method further comprises the following steps after the step 2.4:

Wherein, And/>Representing the position and velocity of the e-th subaactor in H ^W at the d-th iteration,/>And/>Respectively represent the position and the speed of the e < th > subaactor in H ^W at the d+1th iteration,/>Represents the historical optimal position of the e-th subaactor in H ^W at the d-th iteration,/>For the position of the optimal solution of the population to which the e-th sub-Actor belongs in H ^W after d iterations,/>Representing the position of elite sub-Actor randomly chosen from H ^A at the d-th iteration and used to guide the e-th sub-Actor in H ^W, ψ ₀、ψ₁、ψ₂ and ψ ₃ are weight parameters,/>And/>A random number between 0 and 1.

2. The parallel cloud workflow scheduling method of claim 1, wherein in the step 2.3 or the step 2.4, when the buffer B is full or overflows, the earliest stored experience sequence is replaced with the latest experience sequence in time sequence.

3. The parallel cloud workflow scheduling method of claim 1, wherein the step 4 is: and (3) inputting the new parallel workflow scheduling scheme into the model trained in the step (3), and outputting the corresponding parallel workflow scheduling scheme through forward prediction.

4. The parallel cloud workflow scheduling method of claim 1, wherein the manner of dividing the P ¹ and P ² populations into the plurality of groups in S2 is: the size of the group increases dynamically with the progression of time steps.