CN111274009B - Data intensive workflow scheduling method based on stage division in cloud environment - Google Patents

Data intensive workflow scheduling method based on stage division in cloud environment Download PDF

Info

Publication number
CN111274009B
CN111274009B CN202010033432.8A CN202010033432A CN111274009B CN 111274009 B CN111274009 B CN 111274009B CN 202010033432 A CN202010033432 A CN 202010033432A CN 111274009 B CN111274009 B CN 111274009B
Authority
CN
China
Prior art keywords
workflow
task
tasks
stage
scheduling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010033432.8A
Other languages
Chinese (zh)
Other versions
CN111274009A (en
Inventor
陈俊宇
刘茜萍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN202010033432.8A priority Critical patent/CN111274009B/en
Publication of CN111274009A publication Critical patent/CN111274009A/en
Application granted granted Critical
Publication of CN111274009B publication Critical patent/CN111274009B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Computer And Data Communications (AREA)

Abstract

The invention discloses a data intensive workflow scheduling method based on stage division in a cloud environment, which comprises the steps of abstracting a workflow structure; defining a task candidate service provider; determining a workflow scheduling framework; dividing the workflow stage, and expanding according to data dependence; and calculating the completion time of each task executed by the candidate service provider at the current stage, arranging the completion time into a matrix, and distributing the tasks at the current stage. Until the tasks of all the stages are distributed, the method considers the influence of the transmission time of the data-intensive workflow, and improves the execution efficiency of the workflow.

Description

Data intensive workflow scheduling method based on stage division in cloud environment
Technical Field
The invention belongs to the field of cloud computing, and particularly relates to a data intensive workflow scheduling method based on stage division in a cloud environment.
Background
Cloud computing is a novel business computing model, provides convenient, low-cost and readily available computing resources as services, and has the advantages of low service and maintenance costs, flexible control and the like. Workflow refers to the use of a computer to integrate or automate a business process as a part of it. The workflow management federation defines a workflow as all or part of a business process automation during which documents, information, or tasks are to be executed according to a series of procedural rules in each link. The workflow of the cloud computing model can support various complex information applications, such as climate modeling, seismic modeling, weather forecasting, and the like. Particularly in the interdisciplinary fields of bioinformatics and climate simulation, workflows are often data intensive, requiring large-scale computing resources to process gigabytes or terabytes of input data. The purpose of the cloud workflow scheduling is to solve the task scheduling problem in a workflow management system in a cloud computing environment and to deploy tasks to different service providers in the cloud environment by formulating a proper scheduling method. The current research optimizes the execution time and cost of the workflow through various scheduling algorithms, and provides powerful theoretical guarantee for the scheduling process of practical application so as to improve the efficiency of the workflow and save working resources.
The current workflow scheduling method mostly does not consider the influence of data transmission time in the optimization operation of execution cost and completion time. However, in a large number of data-intensive workflow application instances, the data transmission time of a task is not negligible compared to the task execution time, and for such a workflow scheduling problem, the patent proposes a data-intensive workflow scheduling method based on phase division.
Disclosure of Invention
The purpose of the invention is as follows: aiming at the current lack of evaluation on the influence of data intensive workflow transmission time, the invention provides a data intensive workflow scheduling method based on stage division in a cloud environment.
The technical scheme is as follows: in order to achieve the purpose, the invention adopts the technical scheme that:
a data intensive workflow scheduling method based on stage division in a cloud environment comprises the following steps:
step 1, abstracting a workflow structure: acquiring workflow information, establishing a DAG graph according to the workflow information, representing the workflow through the DAG graph, wherein W is<T,D>W is the workflow to be scheduled, T ═ T i N, T denotes a workflow W task set i Denotes the ith task of the workflow W, D ═ D ij I, j ═ 1.. n }, D denotes the set of transmitted data volumes in the workflow W ij Denotes t i Need to go to t j Amount of data transferred, transfer time size d ij Bw, bw is t i And t j The transmission bandwidth in between.
Step 2, defining a task candidate service provider:
in the cloud environment, a plurality of service providers in different regions rent one server for each service provider to utilize hardware resources thereof to execute various computing tasks. Each calculation task description in the workflow does not contain specific processing details, and is completed by a plurality of candidate servers through executing different algorithms, and the same task is delivered to different servers to be executed, so that the same task corresponds to different execution times.
There may be several candidate servers for each task, each server corresponding to a server that solves several but not necessarily every task. The process of cloud workflow scheduling is to decide which server each task should be handed over to for completion, that is, to which server to schedule for execution.
When d is ij >0, direct data transfer is required between the two tasks,and if the two tasks select the same service provider, the data transmission time is 0, otherwise the data transmission time cannot be ignored. If a service provider can only process a single task, both input and output data of the task must be transmitted between the service provider and other service providers.
There are m service providers S in the service provider set S p And (3) participating in the scheduling of the workflow, wherein p is 1. ST ═ ST p ={<t i ,et pi >|et pi Representing service providers s p Performing t i Execution time of t i E {1.. n } }, p ═ 1.. m }, where ST denotes ST p Set of (A), (B), ST p Representing service providers s p Can execute all t i N denotes that there are n tasks,<t i ,et pi >representing a task t i And service provider s p Performing t i The execution time of (1).
And 3, dividing the workflow W into a plurality of stages for scheduling, and enabling the completion time of all tasks in the current stage to be the earliest as possible. The staged scheduling can successively calculate the optimal scheduling result of each stage, so that the final scheduling result is relatively optimal, and the scheduling strategy is more suitable for the workflow with more average transmission time among tasks due to the staged scheduling. According to ST p The specific situation of (1) determining candidate service providers for the tasks and selecting the best service provider for the tasks until all the tasks are distributed, and the scheduling result is
Figure BDA0002365173500000021
r i I.e. t i Is satisfied with the existence of<t i ,et pi >Belong to ST p This condition, wherein: s p Represents t i Final selected facilitator, rft i Represents t i R represents the set of allocation cases for all tasks.
Step 4, the execution condition of each task in the workflow is that the predecessor task is executed completely and data is transmittedAs far as the current task is concerned, the potential time sequence precursor among tasks is derived by the workflow through data dependence, so that the distribution stage division of the workflow is mainly developed based on the data dependence, and TS u ={t i |t i Is the task of the u-th stage, u 1,2 u Set of tasks, TS, representing the u-th phase u And | represents the number of tasks in the u-th stage.
Step 5, task scheduling:
step 51, candidate completion time calculation
For a task t to be allocated j The candidate service provider and the corresponding execution time set are CS j ={<s p ,et pj >|s p ∈S,t j ∈T},s p Representing a facilitator, S representing a set of facilitators, at a number of allocated t i Point to t to be allocated j For CS j Of each service provider s p Calculating t j Completion time ft executed under the server pj
Step 52, calculating the completion time of each task executed by the candidate facilitator at the current stage, which needs to pass through the matrix A u(n*m) Arranging the data to complete the allocation, A u(n*m) =[<s p ,ft pi >]n*m,A u(n*m) And forming a matrix for the completion time of all tasks to be distributed in the u stage after the tasks are executed at different candidate service providers. Wherein each row corresponds to the completion time of a task when executed on different candidate service providers, and the values are arranged from small to large in sequence, ft pi Denotes s p Performing t i The latter completion time.
Step 53, based on A u ( n*m ) Determining candidate facilitators FS to participate in the distribution i And minimum x column
First of all, it is necessary to pass through A u(n*m) Determining the ith column of newly-added candidate service provider set FS i ,FS i ={s p |s p ∈CS},|FS i I represents the size of the set, and then the satisfying condition needs to be found
Figure BDA0002365173500000031
The minimum x in the process, so as to ensure that the number of the service providers participating in the distribution exceeds the number of tasks in the stage, thereby realizing physical parallelism.
Step 54, based on FS i And x developing the current stage of allocation
Based on the given candidate server FS participating in the allocation obtained in step 53 i And the minimum x column assigns all tasks to different facilitators. Observation A u(n*m) In the x-th column and FS x All corresponding service providers and according to ft pi Sorting from small to large, the sorting result is<s p ,ft pi >.., wherein s p ∈FS x Select the smallest ft in the sorted result pi Confirming that the task performed by the current facilitator is t i The candidate service provider of the rest tasks in the current stage abandons the s p And observing whether the current condition meets the screening condition: and judging that feasible solutions possibly exist in the remaining candidate service providers according to the condition of meeting the screening condition, wherein the number sum of the candidate service providers of the remaining tasks is more than or equal to the number of the unallocated tasks.
If the condition is met, namely the current feasible solution possibly exists, updating the matrix and converting t in the matrix i One line is discarded, and t is i Selected s p Abandoning from other tasks, and continuing to calculate FS of new matrix from step 53 i And x, then executing the screening condition judgment in the step 53, and if all the service providers in the sequencing result are screened completely or have no feasible solution, selecting s with the minimum completion time in the sequencing result p Confirming selection of s p The current task of (1). The rest of the tasks in the current stage abandon the s p And updating the matrix.
Preferably: the distribution stage division method for the workflow in the step 4 comprises the following steps:
in step 41, the degree of entry of the starting point is 0, no point reaches the starting point, and the starting point is arranged at the beginning, i.e. the starting point is divided into starting stages.
And 42, removing the points of the good stages, and screening out nodes of the next stage from the rest nodes in the removed graph, wherein the nodes need to meet the requirement that the degree of income in the current graph is 0.
And 43, after the division in the previous stage is finished, continuing to execute the step 42 until all the nodes are divided.
Preferably: CS in step 51 j Of each service provider s p Calculating t j Completion time ft executed under the server pj
Figure BDA0002365173500000041
Figure BDA0002365173500000042
Wherein,
Figure BDA0002365173500000043
representing a task t i Is transmitted to task t j Amount of data of r i I.e. t i Is assigned the result of, wherein r i .s q Represents t i Final selected facilitator, rf ti Is meant for t i The actual completion time of.
Preferably, the following components: the transmission bandwidth bw is a constant.
Compared with the prior art, the invention has the following beneficial effects:
the invention considers the influence of the transmission time of the data intensive workflow, provides support for the scheduling method of practical application and is beneficial to improving the execution efficiency of the workflow.
Drawings
FIG. 1 is a workflow definition example diagram.
Fig. 2 is a diagram showing the correspondence between tasks, service providers, and servers.
FIG. 3 is a workflow scheduling framework diagram.
Fig. 4 is a graph of execution time of each task candidate facilitator in the detailed embodiment.
FIG. 5 is a diagram illustrating task allocation at various stages in an exemplary embodiment.
Fig. 6 is a schematic flow chart of the invention.
Detailed Description
The present invention is further illustrated in the accompanying drawings and described in the following detailed description, it is to be understood that such examples are included solely for the purposes of illustration and are not intended as a definition of the limits of the invention, since various equivalent modifications of the invention will become apparent to those skilled in the art after reading the present specification, and it is intended to cover all such modifications as fall within the scope of the invention as defined in the appended claims.
A data intensive workflow scheduling method based on stage division in a cloud environment is shown in FIG. 6, and includes the following steps:
step 1, abstracting a workflow structure.
Acquiring workflow information, establishing a DAG graph according to the workflow information, representing the workflow by the DAG graph, and representing the workflow based on the DAG graph, as shown in FIG. 1, wherein W is<T,D>W is the workflow that needs to be scheduled, where: t ═ T i N, T denotes a workflow W task set i Representing the ith task of the workflow W. D ═ D ij I, j 1.. n }, D denotes the set of transmitted data quantities in the workflow W, where D is the set of transmitted data quantities in the workflow W ij Denotes t i It is necessary to go to t j The amount of data transferred. The transmission time size may be calculated as dt ij V. bw (bw is t i And t j By default, a constant).
We abstract the structure of the workflow, and represent the workflow by a DAG graph, and consider the workflow as a directed weighted graph, where a front task is connected with a back task, the front and back tasks have data dependency, and the edges have weights, i.e. the transmission time between tasks, which is a typical data flow structure. Providing theoretical basis for workflow scheduling afterwards.
And 2, defining a task candidate service provider.
In a cloud environment, there are several servers in different regions, and each server is assumed to rent a server to perform various computing tasks by using hardware resources of the server. Each computation task description in the workflow does not contain specific processing details, and can be completed by a plurality of candidate servers through executing different algorithms, and the same task is delivered to different servers to be executed, so that the execution time is different.
The correspondence among tasks, servers, and servers in a workflow is shown in fig. 2. There may be multiple candidate facilitators per task. Each service provider corresponds to one server, and can solve a plurality of tasks but not necessarily can solve each task. The process of cloud workflow scheduling is to decide which server each task should be handed over to for completion, that is, to which server to schedule for execution. When d is ij >When the data transmission time is 0, direct data transmission is needed between the two tasks, and if the two tasks select the same service provider, the data transmission time is 0, otherwise, the data transmission time cannot be ignored. If a service provider can only process a single task, both input and output data of the task must be transmitted between the service provider and other service providers. In order to reduce the transmission time to a greater extent, the present document considers the service providers with more tasks to be solved as candidates when screening the service providers.
Assume that there are m facilitators S in the facilitator set S p (p 1.. m) participating in the scheduling of the workflow, in order to more accurately express the association between each candidate facilitator and the task to be scheduled, a correlation definition is given as follows: ST ═ ST p ={<t i ,et pi >|et pi Representing service providers s p Performing t i Execution time of t i ∈{1...n}},p=1...m}。
In the step, the concept of the candidate service providers of the tasks is completely provided, the corresponding candidate service providers are provided for each task, the candidate service providers execute the tasks corresponding to different execution times, and a data source is provided for further calculating the workflow completion time.
And step 3, a workflow scheduling framework.
Different task scheduling results determine different time costs in terms of both execution time and transmission time, and a scheduling method is provided herein to reduce the completion time of the entire workflow.
The basic flow of the workflow scheduling is to divide the workflow W into a plurality of stages for scheduling, and the basic allocation strategy is to make all operators in the current stage as possibleThe completion time of the transaction is the earliest. The division into multiple stages of scheduling mainly considers that the complexity of global scheduling is too large and belongs to the NP-hard problem, and the staged scheduling can successively calculate the better scheduling result of each stage, so that the obtained final scheduling result is relatively better. Due to the fact that the scheduling is carried out in a staged mode, the scheduling strategy is more suitable for the workflow with the transmission time between tasks being more average. According to the above scheduling concept and ST p The specific situation of (1) determining candidate service providers for the tasks and selecting the best service provider for the tasks until all the tasks are distributed, and the scheduling result is
Figure BDA0002365173500000051
Figure BDA0002365173500000052
r i I.e. t i The result of the assignment of (1), wherein: s p Represents t i The final selected facilitator. rft i Represents t i The actual completion time of. The workflow scheduling framework is shown in fig. 3.
In this step, a workflow scheduling method framework of a stage division concept is provided, and it is ensured that a proper workflow can execute scheduling according to the workflow scheduling framework.
And 4, dividing the workflow stage.
The execution condition of each task in the workflow is that the predecessor task is completely executed and data is transmitted to the current task, while the workflow given based on the definition 1 form can derive potential time sequence predecessor among tasks through data dependence, so the distribution stage division of the workflow is mainly carried out based on the data dependence. TS (transport stream) u ={t i |t i Is the task of the u-th stage, u 1,2 u Set of tasks, TS, representing the u-th phase u And | represents the number of tasks in the u-th stage. The specific dividing idea is as follows:
in step 41, the in-degree of the starting point is 0, no point can reach it, and the starting point can be arranged at the beginning, i.e. the starting point is divided into the starting stages.
And 42, removing the points of the good stages, and screening out nodes of the next stage from the rest nodes in the removed graph, wherein the nodes need to meet the requirement that the degree of income in the current graph is 0.
And 43, after the division in the previous stage is finished, continuing to execute the step 42 until all the nodes are divided.
In the step, the workflow is divided into stages, the execution condition of each task in the workflow is that the predecessor task is completely executed and data is transmitted to the current task, and the potential time sequence predecessor among the tasks can be derived based on data dependence based on the abstract workflow structure in the step 1), so that the distribution stage division of the workflow is mainly carried out based on the data dependence.
Step 5, task scheduling
Step 51, candidate completion time calculation
For a task t to be allocated j The candidate service provider and the corresponding execution time set are CS j ={<s p ,et pj >|s p ∈S,t j E.g. T. At a number of allocated t i In case of pointing to tj to be allocated, to CS j Of each service provider s p The following formula is executed to calculate the completion time ft of tj execution under the server pj . ri is t i Is assigned as a result of where r i .s q Represents t i Final selected facilitator, rf ti Is denoted by t i The actual completion time of.
Figure BDA0002365173500000061
Figure BDA0002365173500000062
Step 52, calculating the completion time of each task executed by the candidate facilitator at the current stage, which needs to pass through the matrix A u(n*m) Arranging the data to complete the allocation, A u(n*m) =[<s p ,ft pi >]n*m,A u(n*m) All of the u stageAnd allocating a matrix formed by the completion time of tasks after the tasks are executed at different candidate service providers. Wherein each row corresponds to the completion time of a task when executed on different candidate service providers, and the values are arranged from small to large in sequence, ft pi Denotes s p Performing t i The latter completion time.
Step 53, based on A u ( n*m ) Determining candidate facilitators FS to participate in the distribution i And minimum x column
First of all, it is necessary to pass through A u(n*m) Determining the ith column of newly-added candidate service provider set FS i ,FS i ={s p |s p ∈CS},|FS i I represents the size of the set, wherein the specific process is shown in algorithm 1, and then it is necessary to find the condition satisfying
Figure BDA0002365173500000063
The minimum x in the process, so as to ensure that the number of the service providers participating in the distribution exceeds the number of tasks in the stage, thereby realizing physical parallelism. The specific process is shown in algorithm 2.
Figure BDA0002365173500000071
Step 54, based on FS i And x developing the current stage of allocation
The main purpose of the present step is based on the FS given in step 53 i And x assigns all tasks to different facilitators for this phase. Observation A u(n*m) Column x and FS x All corresponding service providers and according to ft pi Sorting from small to large, the sorting result is<s p ,ft pi >.., wherein s p ∈FS x Select the smallest ft in the sorted results pi Confirming that the task performed by the current facilitator is t i . The candidate facilitator of the rest of the tasks in the current stage discards the s p . Observing whether the current situation meets the screening condition: the number sum of the candidate service providers of the remaining tasks is larger than or equal to the number of the unallocated tasks, and feasible solutions of the remaining candidate service providers can be judged if the screening conditions are met. Detailed description of the preferred embodimentAs shown in algorithm 3.
Figure BDA0002365173500000072
Figure BDA0002365173500000081
If the condition is met, namely the current feasible solution possibly exists, updating the matrix and converting t in the matrix i One line is discarded, and t is i Selected s p Abandon it from other tasks, continue to calculate FS of new matrix from step 53 i And x, and then the screening condition judgment in the step 3 is performed. If all the service providers in the sorting result are screened completely or have no feasible solution, s with the minimum completion time in the sorting result is selected p Confirming to select s p The current task of (1). The rest of the tasks in the current stage abandon the s p The matrix is updated, and the process proceeds to step 53 and step 54. A detailed description of the entire allocation method is shown in algorithm 4.
Figure BDA0002365173500000082
Figure BDA0002365173500000091
In this step, the workflow calculates the completion time of each facilitator executing the task aiming at a certain task to be distributed, arranges the completion time of each task in the current stage executed by the candidate facilitator through a matrix, distributes all tasks from back to front according to a scheduling algorithm, and distributes all tasks to different facilitators until the tasks in all stages are distributed.
Examples of the invention
In order to better understand the technical content of the present invention, a specific scheduling example is given and described with reference to the attached drawings.
Workflow specific information andthe service provider information is as follows. The workflow W contains a total of 20 tasks, where t 1 To start a task, t 20 To end the task, d ij Is listed in Table 1, and bw in this example is taken to be 1, i.e., the amount of task transmission time d, for the convenience of subsequent calculations ij /bw=d ij
TABLE 1 table of transmission between tasks
d 1,2 43 d 1,3 37 d 1,4 25
d 1,5 70 d 2,5 46 d 3,6 53
d 3,7 40 d 3,11 76 d 4,7 29
d 4,8 38 d 4,13 88 d 5,9 57
d 6,9 34 d 6,10 29 d 6,11 40
d 6,16 104 d 7,12 42 d 7,13 55
d 8,14 58 d 8,15 57 d 9,16 74
d 10,17 69 d 11,17 49 d 12,18 50
d 13,18 46 d 14,18 31 d 15,19 38
d 15,20 104 d 16,20 43 d 17,20 60
d 18,20 32 d 19,20 35
Candidate facilitator set S ═ S for the entire workflow 1 ,s 2 ,s 3 ,s 4 ,s 5 ,s 6 ,s 7 }. The candidate facilitators for each task and their execution times are shown in fig. 4.
Allocation scheme according to the initial phase, t 1 Selecting quilt s 4 And (6) executing. And then, the scheduling scheme of each stage is based on the result of the scheduling scheme of the previous stage, the completion time of each task executed by different service providers in the current stage is firstly calculated, the completion time of each task is sequenced from small to large, and each task is allocated to different service providers to be executed according to an algorithm 4.
Taking the execution process of the fourth stage as an example, first, the allocation step 1 is executed, x is calculated to be equal to 3, and FS is calculated 3 ={s 5 ,s 6 And (4) sorting the completion time from small to large according to the possible current feasible solutions in the first three columns<s 5 ,342>,<s 6 ,350>,<s 6 ,381>,<s 6 ,383>,t 9 Quilt s 5 Executing, if the screening condition is not met, selecting t according to the sequence 10 Quilt s 6 Execution, remaining tasks discard s 6 And then, continuing to execute the distribution step 1, calculating to obtain that x is equal to 3, and obtaining a subsequent distribution result according to an algorithm 4.
In the task allocation diagram of each stage, the data of each row represents the completion time of the task executed by different service providers, the completion times are sequentially arranged from left to right, the scheduling result of the task is shown in bold, and as can be seen from fig. 5, the final completion time of the workflow W is 637. Compared with other workflow scheduling methods, the workflow completion time is optimized to a certain extent.
In summary, the invention provides a data intensive workflow scheduling method based on stage division in a cloud environment, which is used for optimizing the overall completion time of the data intensive workflow in the cloud environment, providing support for a scheduling method of practical application, and simultaneously contributing to the improvement of the workflow execution efficiency. The invention applies the traditional workflow scheduling idea to the migration innovation of data intensive workflow in the cloud environment.
The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims (4)

1. A data intensive workflow scheduling method based on stage division in a cloud environment is characterized by comprising the following steps:
step 1, abstracting a workflow structure: obtaining workflow information, and establishing DA according to the workflow informationG graph, representing workflow by DAG graph, W ═<T,D>W is a workflow to be scheduled, comprising n tasks, T ═ T i I 1 … n, T denotes a workflow W task set, T i Denotes the ith task of the workflow W, D ═ D ij I, j ═ 1 … n }, D denotes the set of transmitted data volumes in the workflow W, D ij Represents t i Need to go to t j Amount of data transferred, transfer time size d ij Bw, bw is t i And t j The transmission bandwidth in between;
step 2, defining a task candidate service provider:
under the cloud environment, a plurality of service providers in different regions exist, and each service provider rents a server to execute various computing tasks by using hardware resources of the service provider; each calculation task description in the workflow does not contain specific processing details, and is completed by a plurality of candidate servers through executing different algorithms, and the same task is submitted to different servers to be executed and corresponds to different execution times;
each task has a plurality of candidate service providers; the cloud workflow scheduling process is to determine which server each task should be handed to for completion, that is, to which server to schedule for execution;
when d is ij >When the data transmission time is 0, the data transmission time is 0 if the two tasks select the same service provider, otherwise, the data transmission time cannot be ignored; if a certain service provider can only process a single task, the input and output data of the task must be transmitted between the service provider and other service providers;
there are m service providers S in the service provider set S p And (3) participating in the scheduling of the workflow, wherein p is 1 … m, and the association between each candidate service provider and the task to be scheduled is as follows: ST ═ ST p ={<t i ,et pi >|et pi Representing service providers s p Performing t i Execution time of t i E.g., T, i e {1 … n } }, p 1 … m }, where ST denotes ST p Set of (A), (B), ST p Representing service providers s p Can execute all t i N represents that there are n slavesThe business is to be conducted,<t i ,et pi >representing a task t i And service provider s p Performing t i The corresponding relation of the execution time of (c);
step 3, dividing the workflow W into a plurality of stages for scheduling, and enabling the completion time of all tasks in the current stage to be the earliest as much as possible; the staged scheduling can successively calculate a better scheduling result of each stage, so that a final scheduling result is relatively better, and the scheduling strategy is more suitable for a workflow with more average transmission time among tasks due to the staged scheduling; according to ST p The specific situation of (1) determining candidate service providers for the tasks and selecting the best service provider for the tasks until all the tasks are distributed, and the scheduling result is
Figure FDA0003756623320000011
r i I.e. t i Is satisfied with the existence of<t i ,et pi >Belong to ST p This condition, wherein: s p Represents t i Final selected facilitator, rft i Represents t i R represents the set of allocation conditions for all tasks;
and 4, the execution condition of each task in the workflow is that the predecessor task is completely executed and data is transmitted to the current task, the potential time sequence predecessor among the tasks is derived by the workflow through data dependence, so the distribution stage division of the workflow is mainly based on the data dependence development, TS u ={t u |t u Is the task of stage u, u being 1,2 … l, TS u Set of tasks, TS, representing the u-th phase u I represents the number of tasks in the u stage;
step 5, task scheduling:
step 51, candidate completion time calculation
For a task t to be allocated j Its candidate facilitator and corresponding execution time set are denoted as CS j ={<s p ,ft pj >|s p ∈S,t j ∈T},s p Representing facilitators, S representing a set of facilitators, among several allocatedt i Point to t to be allocated j For CS j Each service provider s p Calculating t j Completion time ft executed under the server pj
Step 52, calculating the completion time of each task executed by the candidate facilitator at the current stage, which needs to pass through the matrix A u(n*m) Arranging the data to complete the allocation, A u(n*m) =[<s p ,ft pi >]n*m,A u(n*m) A matrix is formed by the completion time of all tasks to be distributed after being executed at different candidate service providers in the u stage; wherein each row corresponds to the completion time of a task when executed on different candidate service providers, and the values are arranged from small to large in sequence, ft pi Denotes s p Performing t i The completion time of (c);
step 53, based on A u ( n*m ) Determining candidate facilitators FS to participate in the distribution i And minimum x column
First of all, the first step is to pass through A u(n*m) Determining the ith row of newly-added candidate service provider set FS i ,FS i ={s p |s p ∈S},|FS i I represents the size of the set, and then the satisfying condition needs to be found
Figure FDA0003756623320000021
The minimum x in the distribution table is used for ensuring that the number of the service providers participating in distribution exceeds the number of tasks in the stage, so that physical parallelism is realized;
step 54, based on FS i And x developing the current stage of allocation
Based on the given candidate service provider FS participating in the distribution obtained in step 53 i And the minimum x column assigns all tasks to different facilitators; observation A u(n*m) Column x and FS x All corresponding service providers, and according to ft pi Sorting from small to large, the sorting result is<s p ,ft pi >… } where s is p ∈FS x Select the smallest ft in the sorted results pi Confirming that the task performed by the current facilitator is t i Current phase rest tasksCandidate facilitator of (2) abandons the s p And observing whether the current condition meets the screening condition: judging whether the remaining candidate service providers have feasible solutions according to the condition that screening conditions are met, wherein the number sum of the candidate service providers of the remaining tasks is more than or equal to the number of the unallocated tasks;
if the condition is met, namely the current feasible solution exists, updating the matrix and converting t in the matrix i One line is discarded, and t is i Selected s p Abandon it from other tasks, continue to calculate FS of new matrix from step 53 i And x, then executing the screening condition judgment in the step 53, and if all the service providers in the sequencing result are screened completely or have no feasible solution, selecting s with the minimum completion time in the sequencing result p Confirming selection of s p The current task of (2); the rest tasks in the current stage abandon the s p And updating the matrix.
2. The phase-division-based data-intensive workflow scheduling method in the cloud environment according to claim 1, wherein: the distribution stage division method for the workflow in the step 4 comprises the following steps:
step 41, the degree of entrance of the starting point is 0, no point reaches the starting point, the starting point is arranged at the beginning, namely the starting point is divided into a starting stage;
42, removing the points of the divided stages, and screening out nodes of the next stage from the rest nodes in the removed graph, wherein the nodes need to meet the requirement that the degree of income in the current graph is 0;
and step 43, after the division in the previous stage is finished, continuing to execute step 42 until all the nodes are divided.
3. The phase-division-based data-intensive workflow scheduling method in the cloud environment according to claim 2, wherein: CS in step 51 j Each service provider s p Calculating t j Completion time ft executed under the server pj
Figure FDA0003756623320000031
Figure FDA0003756623320000032
Wherein, Δ d ij Representing a task t i Is transmitted to task t j Amount of data of r i I.e. t i Is assigned the result of, wherein r i .s q Represents t i Final selected facilitator, rf ti Is denoted by t i The actual completion time of.
4. The method for data-intensive workflow scheduling based on staging in a cloud environment according to claim 3, wherein: the transmission bandwidth bw is a constant.
CN202010033432.8A 2020-01-13 2020-01-13 Data intensive workflow scheduling method based on stage division in cloud environment Active CN111274009B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010033432.8A CN111274009B (en) 2020-01-13 2020-01-13 Data intensive workflow scheduling method based on stage division in cloud environment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010033432.8A CN111274009B (en) 2020-01-13 2020-01-13 Data intensive workflow scheduling method based on stage division in cloud environment

Publications (2)

Publication Number Publication Date
CN111274009A CN111274009A (en) 2020-06-12
CN111274009B true CN111274009B (en) 2022-08-30

Family

ID=71001892

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010033432.8A Active CN111274009B (en) 2020-01-13 2020-01-13 Data intensive workflow scheduling method based on stage division in cloud environment

Country Status (1)

Country Link
CN (1) CN111274009B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114629959B (en) * 2022-03-22 2023-11-17 北方工业大学 Cloud environment context-aware internet traffic (IoT) service scheduling policy method

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108628665A (en) * 2018-05-16 2018-10-09 天津科技大学 Task scheduling based on data-intensive scientific workflow and virtual machine integration method
CN110489214B (en) * 2019-06-19 2022-09-20 南京邮电大学 Dynamic task allocation for data intensive workflows in a cloud environment

Also Published As

Publication number Publication date
CN111274009A (en) 2020-06-12

Similar Documents

Publication Publication Date Title
Alkhanak et al. A hyper-heuristic cost optimisation approach for scientific workflow scheduling in cloud computing
CN107301500B (en) Workflow scheduling method based on key path task look-ahead
CN107992359B (en) Task scheduling method for cost perception in cloud environment
CN106447173A (en) Cloud workflow scheduling method supporting any flow structure
CN112364590B (en) Construction method of practical logic verification architecture-level FPGA (field programmable Gate array) wiring unit
CN113742089B (en) Method, device and equipment for distributing neural network computing tasks in heterogeneous resources
CN110008013A (en) A kind of Spark method for allocating tasks minimizing operation completion date
CN112148468B (en) Resource scheduling method and device, electronic equipment and storage medium
CN110347489B (en) Multi-center data collaborative computing stream processing method based on Spark
WO2020186872A1 (en) Expense optimization scheduling method for deadline constraint under cloud scientific workflow
Li et al. An effective scheduling strategy based on hypergraph partition in geographically distributed datacenters
CN106934539B (en) Workflow scheduling method with deadline and expense constraints
CN112306642B (en) Workflow scheduling method based on stable matching game theory
CN111274009B (en) Data intensive workflow scheduling method based on stage division in cloud environment
Guan et al. Fleet: Flexible efficient ensemble training for heterogeneous deep neural networks
CN110084507B (en) Scientific workflow scheduling optimization method based on hierarchical perception in cloud computing environment
CN110048966B (en) Coflow scheduling method for minimizing system overhead based on deadline
CN110489214B (en) Dynamic task allocation for data intensive workflows in a cloud environment
CN111026534B (en) Workflow execution optimization method based on multiple group genetic algorithms in cloud computing environment
CN110119268B (en) Workflow optimization method based on artificial intelligence
Nematpour et al. Enhanced genetic algorithm with some heuristic principles for task graph scheduling
CN116681245A (en) Method and device for selecting workers and dispatching tasks in crowdsourcing system
CN116800610A (en) Distributed data plane resource optimization method and system
CN110968428B (en) Cloud workflow virtual machine configuration and task scheduling collaborative optimization method
In et al. Policy-based scheduling and resource allocation for multimedia communication on grid computing environment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: No. 66, New Model Road, Gulou District, Nanjing City, Jiangsu Province, 210000

Applicant after: NANJING University OF POSTS AND TELECOMMUNICATIONS

Address before: No.9 Wenyuan Road, Yadong Xincheng District, Qixia District, Nanjing, Jiangsu Province, 210000

Applicant before: NANJING University OF POSTS AND TELECOMMUNICATIONS

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant