CN111274009B

CN111274009B - Data intensive workflow scheduling method based on stage division in cloud environment

Info

Publication number: CN111274009B
Application number: CN202010033432.8A
Authority: CN
Inventors: 陈俊宇; 刘茜萍
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2020-01-13
Filing date: 2020-01-13
Publication date: 2022-08-30
Anticipated expiration: 2040-01-13
Also published as: CN111274009A

Abstract

The invention discloses a data intensive workflow scheduling method based on stage division in a cloud environment, which comprises the steps of abstracting a workflow structure; defining a task candidate service provider; determining a workflow scheduling framework; dividing the workflow stage, and expanding according to data dependence; and calculating the completion time of each task executed by the candidate service provider at the current stage, arranging the completion time into a matrix, and distributing the tasks at the current stage. Until the tasks of all the stages are distributed, the method considers the influence of the transmission time of the data-intensive workflow, and improves the execution efficiency of the workflow.

Description

Data intensive workflow scheduling method based on stage division in cloud environment

Technical Field

The invention belongs to the field of cloud computing, and particularly relates to a data intensive workflow scheduling method based on stage division in a cloud environment.

Background

Cloud computing is a novel business computing model, provides convenient, low-cost and readily available computing resources as services, and has the advantages of low service and maintenance costs, flexible control and the like. Workflow refers to the use of a computer to integrate or automate a business process as a part of it. The workflow management federation defines a workflow as all or part of a business process automation during which documents, information, or tasks are to be executed according to a series of procedural rules in each link. The workflow of the cloud computing model can support various complex information applications, such as climate modeling, seismic modeling, weather forecasting, and the like. Particularly in the interdisciplinary fields of bioinformatics and climate simulation, workflows are often data intensive, requiring large-scale computing resources to process gigabytes or terabytes of input data. The purpose of the cloud workflow scheduling is to solve the task scheduling problem in a workflow management system in a cloud computing environment and to deploy tasks to different service providers in the cloud environment by formulating a proper scheduling method. The current research optimizes the execution time and cost of the workflow through various scheduling algorithms, and provides powerful theoretical guarantee for the scheduling process of practical application so as to improve the efficiency of the workflow and save working resources.

The current workflow scheduling method mostly does not consider the influence of data transmission time in the optimization operation of execution cost and completion time. However, in a large number of data-intensive workflow application instances, the data transmission time of a task is not negligible compared to the task execution time, and for such a workflow scheduling problem, the patent proposes a data-intensive workflow scheduling method based on phase division.

Disclosure of Invention

The purpose of the invention is as follows: aiming at the current lack of evaluation on the influence of data intensive workflow transmission time, the invention provides a data intensive workflow scheduling method based on stage division in a cloud environment.

The technical scheme is as follows: in order to achieve the purpose, the invention adopts the technical scheme that:

a data intensive workflow scheduling method based on stage division in a cloud environment comprises the following steps:

step 1, abstracting a workflow structure: acquiring workflow information, establishing a DAG graph according to the workflow information, representing the workflow through the DAG graph, wherein W is<T,D>W is the workflow to be scheduled, T ═ T _i N, T denotes a workflow W task set _i Denotes the ith task of the workflow W, D ═ D _ij I, j ═ 1.. n }, D denotes the set of transmitted data volumes in the workflow W _ij Denotes t _i Need to go to t _j Amount of data transferred, transfer time size d _ij Bw, bw is t _i And t _j The transmission bandwidth in between.

Step 2, defining a task candidate service provider:

in the cloud environment, a plurality of service providers in different regions rent one server for each service provider to utilize hardware resources thereof to execute various computing tasks. Each calculation task description in the workflow does not contain specific processing details, and is completed by a plurality of candidate servers through executing different algorithms, and the same task is delivered to different servers to be executed, so that the same task corresponds to different execution times.

There may be several candidate servers for each task, each server corresponding to a server that solves several but not necessarily every task. The process of cloud workflow scheduling is to decide which server each task should be handed over to for completion, that is, to which server to schedule for execution.

When d is _ij >0, direct data transfer is required between the two tasks,and if the two tasks select the same service provider, the data transmission time is 0, otherwise the data transmission time cannot be ignored. If a service provider can only process a single task, both input and output data of the task must be transmitted between the service provider and other service providers.

There are m service providers S in the service provider set S _p And (3) participating in the scheduling of the workflow, wherein p is 1. ST ═ ST _p ＝{<t _i ，et _pi >|et _pi Representing service providers s _p Performing t _i Execution time of t _i E {1.. n } }, p ═ 1.. m }, where ST denotes ST _p Set of (A), (B), ST _p Representing service providers s _p Can execute all t _i N denotes that there are n tasks,<t _i ，et _pi >representing a task t _i And service provider s _p Performing t _i The execution time of (1).

And 3, dividing the workflow W into a plurality of stages for scheduling, and enabling the completion time of all tasks in the current stage to be the earliest as possible. The staged scheduling can successively calculate the optimal scheduling result of each stage, so that the final scheduling result is relatively optimal, and the scheduling strategy is more suitable for the workflow with more average transmission time among tasks due to the staged scheduling. According to ST _p The specific situation of (1) determining candidate service providers for the tasks and selecting the best service provider for the tasks until all the tasks are distributed, and the scheduling result is

r _i I.e. t _i Is satisfied with the existence of<t _i ，et _pi >Belong to ST _p This condition, wherein: s _p Represents t _i Final selected facilitator, rft _i Represents t _i R represents the set of allocation cases for all tasks.

Step 4, the execution condition of each task in the workflow is that the predecessor task is executed completely and data is transmittedAs far as the current task is concerned, the potential time sequence precursor among tasks is derived by the workflow through data dependence, so that the distribution stage division of the workflow is mainly developed based on the data dependence, and TS _u ＝{t _i |t _i Is the task of the u-th stage, u 1,2 _u Set of tasks, TS, representing the u-th phase _u And | represents the number of tasks in the u-th stage.

Step 5, task scheduling:

step 51, candidate completion time calculation

For a task t to be allocated _j The candidate service provider and the corresponding execution time set are CS _j ＝{<s _p ,et _pj >|s _p ∈S，t _j ∈T}，s _p Representing a facilitator, S representing a set of facilitators, at a number of allocated t _i Point to t to be allocated _j For CS _j Of each service provider s _p Calculating t _j Completion time ft executed under the server _pj ；

Step 52, calculating the completion time of each task executed by the candidate facilitator at the current stage, which needs to pass through the matrix A _u(n*m) Arranging the data to complete the allocation, A _u(n*m) ＝[<s _p ，ft _pi >]n*m，A _u(n*m) And forming a matrix for the completion time of all tasks to be distributed in the u stage after the tasks are executed at different candidate service providers. Wherein each row corresponds to the completion time of a task when executed on different candidate service providers, and the values are arranged from small to large in sequence, ft _pi Denotes s _p Performing t _i The latter completion time.

Step 53, based on A _u ( _n*m ) Determining candidate facilitators FS to participate in the distribution _i And minimum x column

First of all, it is necessary to pass through A _u(n*m) Determining the ith column of newly-added candidate service provider set FS _i ，FS _i ＝{s _p |s _p ∈CS}，|FS _i I represents the size of the set, and then the satisfying condition needs to be found

The minimum x in the process, so as to ensure that the number of the service providers participating in the distribution exceeds the number of tasks in the stage, thereby realizing physical parallelism.

Step 54, based on FS _i And x developing the current stage of allocation

Based on the given candidate server FS participating in the allocation obtained in step 53 _i And the minimum x column assigns all tasks to different facilitators. Observation A _u(n*m) In the x-th column and FS _x All corresponding service providers and according to ft _pi Sorting from small to large, the sorting result is<s _p ，ft _pi >.., wherein s _p ∈FS _x Select the smallest ft in the sorted result _pi Confirming that the task performed by the current facilitator is t _i The candidate service provider of the rest tasks in the current stage abandons the s _p And observing whether the current condition meets the screening condition: and judging that feasible solutions possibly exist in the remaining candidate service providers according to the condition of meeting the screening condition, wherein the number sum of the candidate service providers of the remaining tasks is more than or equal to the number of the unallocated tasks.

If the condition is met, namely the current feasible solution possibly exists, updating the matrix and converting t in the matrix _i One line is discarded, and t is _i Selected s _p Abandoning from other tasks, and continuing to calculate FS of new matrix from step 53 _i And x, then executing the screening condition judgment in the step 53, and if all the service providers in the sequencing result are screened completely or have no feasible solution, selecting s with the minimum completion time in the sequencing result _p Confirming selection of s _p The current task of (1). The rest of the tasks in the current stage abandon the s _p And updating the matrix.

Preferably: the distribution stage division method for the workflow in the step 4 comprises the following steps:

in step 41, the degree of entry of the starting point is 0, no point reaches the starting point, and the starting point is arranged at the beginning, i.e. the starting point is divided into starting stages.

And 42, removing the points of the good stages, and screening out nodes of the next stage from the rest nodes in the removed graph, wherein the nodes need to meet the requirement that the degree of income in the current graph is 0.

And 43, after the division in the previous stage is finished, continuing to execute the step 42 until all the nodes are divided.

Preferably: CS in step 51 _j Of each service provider s _p Calculating t _j Completion time ft executed under the server _pj ：

Wherein,

representing a task t _i Is transmitted to task t _j Amount of data of r _i I.e. t _i Is assigned the result of, wherein r _i .s _q Represents t _i Final selected facilitator, rf _ti Is meant for t _i The actual completion time of.

Preferably, the following components: the transmission bandwidth bw is a constant.

Compared with the prior art, the invention has the following beneficial effects:

the invention considers the influence of the transmission time of the data intensive workflow, provides support for the scheduling method of practical application and is beneficial to improving the execution efficiency of the workflow.

Drawings

FIG. 1 is a workflow definition example diagram.

Fig. 2 is a diagram showing the correspondence between tasks, service providers, and servers.

FIG. 3 is a workflow scheduling framework diagram.

Fig. 4 is a graph of execution time of each task candidate facilitator in the detailed embodiment.

FIG. 5 is a diagram illustrating task allocation at various stages in an exemplary embodiment.

Fig. 6 is a schematic flow chart of the invention.

Detailed Description

The present invention is further illustrated in the accompanying drawings and described in the following detailed description, it is to be understood that such examples are included solely for the purposes of illustration and are not intended as a definition of the limits of the invention, since various equivalent modifications of the invention will become apparent to those skilled in the art after reading the present specification, and it is intended to cover all such modifications as fall within the scope of the invention as defined in the appended claims.

A data intensive workflow scheduling method based on stage division in a cloud environment is shown in FIG. 6, and includes the following steps:

step 1, abstracting a workflow structure.

Acquiring workflow information, establishing a DAG graph according to the workflow information, representing the workflow by the DAG graph, and representing the workflow based on the DAG graph, as shown in FIG. 1, wherein W is<T,D>W is the workflow that needs to be scheduled, where: t ═ T _i N, T denotes a workflow W task set _i Representing the ith task of the workflow W. D ═ D _ij I, j 1.. n }, D denotes the set of transmitted data quantities in the workflow W, where D is the set of transmitted data quantities in the workflow W _ij Denotes t _i It is necessary to go to t _j The amount of data transferred. The transmission time size may be calculated as dt _ij V. bw (bw is t _i And t _j By default, a constant).

We abstract the structure of the workflow, and represent the workflow by a DAG graph, and consider the workflow as a directed weighted graph, where a front task is connected with a back task, the front and back tasks have data dependency, and the edges have weights, i.e. the transmission time between tasks, which is a typical data flow structure. Providing theoretical basis for workflow scheduling afterwards.

And 2, defining a task candidate service provider.

In a cloud environment, there are several servers in different regions, and each server is assumed to rent a server to perform various computing tasks by using hardware resources of the server. Each computation task description in the workflow does not contain specific processing details, and can be completed by a plurality of candidate servers through executing different algorithms, and the same task is delivered to different servers to be executed, so that the execution time is different.

The correspondence among tasks, servers, and servers in a workflow is shown in fig. 2. There may be multiple candidate facilitators per task. Each service provider corresponds to one server, and can solve a plurality of tasks but not necessarily can solve each task. The process of cloud workflow scheduling is to decide which server each task should be handed over to for completion, that is, to which server to schedule for execution. When d is _ij >When the data transmission time is 0, direct data transmission is needed between the two tasks, and if the two tasks select the same service provider, the data transmission time is 0, otherwise, the data transmission time cannot be ignored. If a service provider can only process a single task, both input and output data of the task must be transmitted between the service provider and other service providers. In order to reduce the transmission time to a greater extent, the present document considers the service providers with more tasks to be solved as candidates when screening the service providers.

Assume that there are m facilitators S in the facilitator set S _p (p 1.. m) participating in the scheduling of the workflow, in order to more accurately express the association between each candidate facilitator and the task to be scheduled, a correlation definition is given as follows: ST ═ ST _p ＝{<t _i ,et _pi >|et _pi Representing service providers s _p Performing t _i Execution time of t _i ∈{1...n}}，p＝1...m}。

In the step, the concept of the candidate service providers of the tasks is completely provided, the corresponding candidate service providers are provided for each task, the candidate service providers execute the tasks corresponding to different execution times, and a data source is provided for further calculating the workflow completion time.

And step 3, a workflow scheduling framework.

Different task scheduling results determine different time costs in terms of both execution time and transmission time, and a scheduling method is provided herein to reduce the completion time of the entire workflow.

The basic flow of the workflow scheduling is to divide the workflow W into a plurality of stages for scheduling, and the basic allocation strategy is to make all operators in the current stage as possibleThe completion time of the transaction is the earliest. The division into multiple stages of scheduling mainly considers that the complexity of global scheduling is too large and belongs to the NP-hard problem, and the staged scheduling can successively calculate the better scheduling result of each stage, so that the obtained final scheduling result is relatively better. Due to the fact that the scheduling is carried out in a staged mode, the scheduling strategy is more suitable for the workflow with the transmission time between tasks being more average. According to the above scheduling concept and ST _p The specific situation of (1) determining candidate service providers for the tasks and selecting the best service provider for the tasks until all the tasks are distributed, and the scheduling result is

r _i I.e. t _i The result of the assignment of (1), wherein: s _p Represents t _i The final selected facilitator. rft _i Represents t _i The actual completion time of. The workflow scheduling framework is shown in fig. 3.

In this step, a workflow scheduling method framework of a stage division concept is provided, and it is ensured that a proper workflow can execute scheduling according to the workflow scheduling framework.

And 4, dividing the workflow stage.

The execution condition of each task in the workflow is that the predecessor task is completely executed and data is transmitted to the current task, while the workflow given based on the definition 1 form can derive potential time sequence predecessor among tasks through data dependence, so the distribution stage division of the workflow is mainly carried out based on the data dependence. TS (transport stream) _u ＝{t _i |t _i Is the task of the u-th stage,

u

1,2 _u Set of tasks, TS, representing the u-th phase _u And | represents the number of tasks in the u-th stage. The specific dividing idea is as follows:

in step 41, the in-degree of the starting point is 0, no point can reach it, and the starting point can be arranged at the beginning, i.e. the starting point is divided into the starting stages.

In the step, the workflow is divided into stages, the execution condition of each task in the workflow is that the predecessor task is completely executed and data is transmitted to the current task, and the potential time sequence predecessor among the tasks can be derived based on data dependence based on the abstract workflow structure in the step 1), so that the distribution stage division of the workflow is mainly carried out based on the data dependence.

Step 5, task scheduling

Step 51, candidate completion time calculation

For a task t to be allocated _j The candidate service provider and the corresponding execution time set are CS _j ＝{<s _p ，et _pj >|s _p ∈S，t _j E.g. T. At a number of allocated t _i In case of pointing to tj to be allocated, to CS _j Of each service provider s _p The following formula is executed to calculate the completion time ft of tj execution under the server _pj . ri is t _i Is assigned as a result of where r _i .s _q Represents t _i Final selected facilitator, rf _ti Is denoted by t _i The actual completion time of.

Step 52, calculating the completion time of each task executed by the candidate facilitator at the current stage, which needs to pass through the matrix A _u(n*m) Arranging the data to complete the allocation, A _u(n*m) ＝[<s _p ，ft _pi >]n*m，A _u(n*m) All of the u stageAnd allocating a matrix formed by the completion time of tasks after the tasks are executed at different candidate service providers. Wherein each row corresponds to the completion time of a task when executed on different candidate service providers, and the values are arranged from small to large in sequence, ft _pi Denotes s _p Performing t _i The latter completion time.

First of all, it is necessary to pass through A _u(n*m) Determining the ith column of newly-added candidate service provider set FS _i ，FS _i ＝{s _p |s _p ∈CS}，|FS _i I represents the size of the set, wherein the specific process is shown in algorithm 1, and then it is necessary to find the condition satisfying

The minimum x in the process, so as to ensure that the number of the service providers participating in the distribution exceeds the number of tasks in the stage, thereby realizing physical parallelism. The specific process is shown in algorithm 2.

Step 54, based on FS _i And x developing the current stage of allocation

The main purpose of the present step is based on the FS given in step 53 _i And x assigns all tasks to different facilitators for this phase. Observation A _u(n*m) Column x and FS _x All corresponding service providers and according to ft _pi Sorting from small to large, the sorting result is<s _p ，ft _pi >.., wherein s _p ∈FS _x Select the smallest ft in the sorted results _pi Confirming that the task performed by the current facilitator is t _i . The candidate facilitator of the rest of the tasks in the current stage discards the s _p . Observing whether the current situation meets the screening condition: the number sum of the candidate service providers of the remaining tasks is larger than or equal to the number of the unallocated tasks, and feasible solutions of the remaining candidate service providers can be judged if the screening conditions are met. Detailed description of the preferred embodimentAs shown in algorithm 3.

If the condition is met, namely the current feasible solution possibly exists, updating the matrix and converting t in the matrix _i One line is discarded, and t is _i Selected s _p Abandon it from other tasks, continue to calculate FS of new matrix from step 53 _i And x, and then the screening condition judgment in the step 3 is performed. If all the service providers in the sorting result are screened completely or have no feasible solution, s with the minimum completion time in the sorting result is selected _p Confirming to select s _p The current task of (1). The rest of the tasks in the current stage abandon the s _p The matrix is updated, and the process proceeds to step 53 and step 54. A detailed description of the entire allocation method is shown in algorithm 4.

In this step, the workflow calculates the completion time of each facilitator executing the task aiming at a certain task to be distributed, arranges the completion time of each task in the current stage executed by the candidate facilitator through a matrix, distributes all tasks from back to front according to a scheduling algorithm, and distributes all tasks to different facilitators until the tasks in all stages are distributed.

Examples of the invention

In order to better understand the technical content of the present invention, a specific scheduling example is given and described with reference to the attached drawings.

Workflow specific information andthe service provider information is as follows. The workflow W contains a total of 20 tasks, where t ₁ To start a task, t ₂₀ To end the task, d _ij Is listed in Table 1, and bw in this example is taken to be 1, i.e., the amount of task transmission time d, for the convenience of subsequent calculations _ij /bw＝d _ij 。

TABLE 1 table of transmission between tasks

d _1,2	43	d _1,3	37	d _1，4	25
						d _1，5	70	d _2，5	46	d _3，6	53
d _3，7	40	d _3,11	76	d _4,7	29
						d _4,8	38	d _4,13	88	d _5,9	57
d _6,9	34	d _6,10	29	d _6,11	40
						d _6,16	104	d _7,12	42	d _7,13	55
d _8,14	58	d _8,15	57	d _9,16	74
						d _10,17	69	d _11,17	49	d _12,18	50
d _13,18	46	d _14,18	31	d _15，19	38
						d _15，20	104	d _16，20	43	d _17，20	60
d _18，20	32	d _19，20	35

Candidate facilitator set S ═ S for the entire workflow ₁ ，s ₂ ，s ₃ ，s ₄ ，s ₅ ，s ₆ ，s ₇ }. The candidate facilitators for each task and their execution times are shown in fig. 4.

Allocation scheme according to the initial phase, t ₁ Selecting quilt s ₄ And (6) executing. And then, the scheduling scheme of each stage is based on the result of the scheduling scheme of the previous stage, the completion time of each task executed by different service providers in the current stage is firstly calculated, the completion time of each task is sequenced from small to large, and each task is allocated to different service providers to be executed according to an algorithm 4.

Taking the execution process of the fourth stage as an example, first, the allocation step 1 is executed, x is calculated to be equal to 3, and FS is calculated ₃ ＝{s ₅ ，s ₆ And (4) sorting the completion time from small to large according to the possible current feasible solutions in the first three columns<s ₅ ，342>，<s ₆ ，350>，<s ₆ ，381>，<s ₆ ，383>，t ₉ Quilt s ₅ Executing, if the screening condition is not met, selecting t according to the sequence ₁₀ Quilt s ₆ Execution, remaining tasks discard s ₆ And then, continuing to execute the distribution step 1, calculating to obtain that x is equal to 3, and obtaining a subsequent distribution result according to an algorithm 4.

In the task allocation diagram of each stage, the data of each row represents the completion time of the task executed by different service providers, the completion times are sequentially arranged from left to right, the scheduling result of the task is shown in bold, and as can be seen from fig. 5, the final completion time of the workflow W is 637. Compared with other workflow scheduling methods, the workflow completion time is optimized to a certain extent.

In summary, the invention provides a data intensive workflow scheduling method based on stage division in a cloud environment, which is used for optimizing the overall completion time of the data intensive workflow in the cloud environment, providing support for a scheduling method of practical application, and simultaneously contributing to the improvement of the workflow execution efficiency. The invention applies the traditional workflow scheduling idea to the migration innovation of data intensive workflow in the cloud environment.

The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims

1. A data intensive workflow scheduling method based on stage division in a cloud environment is characterized by comprising the following steps:

step 1, abstracting a workflow structure: obtaining workflow information, and establishing DA according to the workflow informationG graph, representing workflow by DAG graph, W ═<T,D>W is a workflow to be scheduled, comprising n tasks, T ═ T _i I 1 … n, T denotes a workflow W task set, T _i Denotes the ith task of the workflow W, D ═ D _ij I, j ═ 1 … n }, D denotes the set of transmitted data volumes in the workflow W, D _ij Represents t _i Need to go to t _j Amount of data transferred, transfer time size d _ij Bw, bw is t _i And t _j The transmission bandwidth in between;

step 2, defining a task candidate service provider:

under the cloud environment, a plurality of service providers in different regions exist, and each service provider rents a server to execute various computing tasks by using hardware resources of the service provider; each calculation task description in the workflow does not contain specific processing details, and is completed by a plurality of candidate servers through executing different algorithms, and the same task is submitted to different servers to be executed and corresponds to different execution times;

each task has a plurality of candidate service providers; the cloud workflow scheduling process is to determine which server each task should be handed to for completion, that is, to which server to schedule for execution;

when d is _ij >When the data transmission time is 0, the data transmission time is 0 if the two tasks select the same service provider, otherwise, the data transmission time cannot be ignored; if a certain service provider can only process a single task, the input and output data of the task must be transmitted between the service provider and other service providers;

there are m service providers S in the service provider set S _p And (3) participating in the scheduling of the workflow, wherein p is 1 … m, and the association between each candidate service provider and the task to be scheduled is as follows: ST ═ ST _p ＝{<t _i ,et _pi >|et _pi Representing service providers s _p Performing t _i Execution time of t _i E.g., T, i e {1 … n } }, p 1 … m }, where ST denotes ST _p Set of (A), (B), ST _p Representing service providers s _p Can execute all t _i N represents that there are n slavesThe business is to be conducted,<t _i ,et _pi >representing a task t _i And service provider s _p Performing t _i The corresponding relation of the execution time of (c);

step 3, dividing the workflow W into a plurality of stages for scheduling, and enabling the completion time of all tasks in the current stage to be the earliest as much as possible; the staged scheduling can successively calculate a better scheduling result of each stage, so that a final scheduling result is relatively better, and the scheduling strategy is more suitable for a workflow with more average transmission time among tasks due to the staged scheduling; according to ST _p The specific situation of (1) determining candidate service providers for the tasks and selecting the best service provider for the tasks until all the tasks are distributed, and the scheduling result is

r _i I.e. t _i Is satisfied with the existence of<t _i ,et _pi >Belong to ST _p This condition, wherein: s _p Represents t _i Final selected facilitator, rft _i Represents t _i R represents the set of allocation conditions for all tasks;

and 4, the execution condition of each task in the workflow is that the predecessor task is completely executed and data is transmitted to the current task, the potential time sequence predecessor among the tasks is derived by the workflow through data dependence, so the distribution stage division of the workflow is mainly based on the data dependence development, TS _u ＝{t _u |t _u Is the task of stage u, u being 1,2 … l, TS _u Set of tasks, TS, representing the u-th phase _u I represents the number of tasks in the u stage;

step 5, task scheduling:

step 51, candidate completion time calculation

For a task t to be allocated _j Its candidate facilitator and corresponding execution time set are denoted as CS _j ＝{<s _p ,ft _pj >|s _p ∈S，t _j ∈T}，s _p Representing facilitators, S representing a set of facilitators, among several allocatedt _i Point to t to be allocated _j For CS _j Each service provider s _p Calculating t _j Completion time ft executed under the server _pj ；

Step 52, calculating the completion time of each task executed by the candidate facilitator at the current stage, which needs to pass through the matrix A _u(n*m) Arranging the data to complete the allocation, A _u(n*m) ＝[<s _p ，ft _pi >]n*m，A _u(n*m) A matrix is formed by the completion time of all tasks to be distributed after being executed at different candidate service providers in the u stage; wherein each row corresponds to the completion time of a task when executed on different candidate service providers, and the values are arranged from small to large in sequence, ft _pi Denotes s _p Performing t _i The completion time of (c);

First of all, the first step is to pass through A _u(n*m) Determining the ith row of newly-added candidate service provider set FS _i ，FS _i ＝{s _p |s _p ∈S}，|FS _i I represents the size of the set, and then the satisfying condition needs to be found

The minimum x in the distribution table is used for ensuring that the number of the service providers participating in distribution exceeds the number of tasks in the stage, so that physical parallelism is realized;

step 54, based on FS _i And x developing the current stage of allocation

Based on the given candidate service provider FS participating in the distribution obtained in step 53 _i And the minimum x column assigns all tasks to different facilitators; observation A _u(n*m) Column x and FS _x All corresponding service providers, and according to ft _pi Sorting from small to large, the sorting result is<s _p ，ft _pi >… } where s is _p ∈FS _x Select the smallest ft in the sorted results _pi Confirming that the task performed by the current facilitator is t _i Current phase rest tasksCandidate facilitator of (2) abandons the s _p And observing whether the current condition meets the screening condition: judging whether the remaining candidate service providers have feasible solutions according to the condition that screening conditions are met, wherein the number sum of the candidate service providers of the remaining tasks is more than or equal to the number of the unallocated tasks;

if the condition is met, namely the current feasible solution exists, updating the matrix and converting t in the matrix _i One line is discarded, and t is _i Selected s _p Abandon it from other tasks, continue to calculate FS of new matrix from step 53 _i And x, then executing the screening condition judgment in the step 53, and if all the service providers in the sequencing result are screened completely or have no feasible solution, selecting s with the minimum completion time in the sequencing result _p Confirming selection of s _p The current task of (2); the rest tasks in the current stage abandon the s _p And updating the matrix.

2. The phase-division-based data-intensive workflow scheduling method in the cloud environment according to claim 1, wherein: the distribution stage division method for the workflow in the step 4 comprises the following steps:

step 41, the degree of entrance of the starting point is 0, no point reaches the starting point, the starting point is arranged at the beginning, namely the starting point is divided into a starting stage;

42, removing the points of the divided stages, and screening out nodes of the next stage from the rest nodes in the removed graph, wherein the nodes need to meet the requirement that the degree of income in the current graph is 0;

and step 43, after the division in the previous stage is finished, continuing to execute step 42 until all the nodes are divided.

3. The phase-division-based data-intensive workflow scheduling method in the cloud environment according to claim 2, wherein: CS in step 51 _j Each service provider s _p Calculating t _j Completion time ft executed under the server _pj ：

Wherein, Δ d _ij Representing a task t _i Is transmitted to task t _j Amount of data of r _i I.e. t _i Is assigned the result of, wherein r _i .s _q Represents t _i Final selected facilitator, rf _ti Is denoted by t _i The actual completion time of.

4. The method for data-intensive workflow scheduling based on staging in a cloud environment according to claim 3, wherein: the transmission bandwidth bw is a constant.