CN112181620B

CN112181620B - Big data workflow scheduling method for sensing service capability of virtual machine in cloud environment

Info

Publication number: CN112181620B
Application number: CN202011031844.4A
Authority: CN
Inventors: 曹洁; 张志锋; 桑永宜; 王博; 崔霄
Original assignee: Zhengzhou University of Light Industry
Current assignee: Zhengzhou University of Light Industry
Priority date: 2020-09-27
Filing date: 2020-09-27
Publication date: 2022-09-20
Anticipated expiration: 2040-09-27
Also published as: CN112181620A

Abstract

The invention relates to the technical field of big data, in particular to a big data workflow scheduling method for sensing the service capability of a virtual machine in a cloud environment, which is a measurement method for evaluating the service capability requirement of a task on resources and the service capability guarantee of the virtual machine on the task, provides a necessary reference basis for the scheduling execution of big data parallel tasks, and provides a service dynamic level scheduling algorithm for matching the service capability requirement of the task with the service capability guarantee of the virtual machine; the method comprises the following steps: modeling and assuming a big data workflow and a cloud system; secondly, calculating service capacity; and thirdly, dynamic scheduling of the big data workflow service.

Description

Big data workflow scheduling method for sensing service capability of virtual machine in cloud environment

Technical Field

The invention relates to the technical field of big data, in particular to a big data workflow scheduling method for sensing service capability of a virtual machine in a cloud environment.

Background

Massive big data needs to be processed in various commercial and scientific fields such as banking industry, fraud detection, medical health, demand prediction, scientific exploration and the like, big data processing technology becomes very important in the industries, and most big data applications are generally composed of a plurality of interdependent calculation-intensive and data-intensive tasks.

The types of big data tasks can be divided into CPU intensive, data intensive (there are a lot of data analysis processing operations), and I/O intensive (transfer a lot of data to a subsequent task).

Huang, etc. provides a dynamic distributed scheduling algorithm CASA aiming at the characteristics of cloud environment resource distribution, each scheduling node in the algorithm is provided with a meta-scheduler for receiving tasks submitted by local users and performing task distribution, meanwhile, information sharing is also kept among the scheduling nodes so as to achieve the purpose of load balancing among the nodes, Mezmaz, etc. takes minimum completion time and minimum energy consumption as scheduling targets for dependent tasks under the cloud environment, a hybrid scheduling algorithm based on genetic algorithm and energy consumption perception is provided, the purpose of energy consumption optimization is realized by adopting a voltage dynamic adjustment technology, scheduling is centered on service quality, in cloud computing, QoS requirements of users comprise completion response time of the tasks, scheduling budget, reliability and availability of a system, John, etc. provides a cloud task scheduling and resource distribution method based on particle swarm optimization, the method fully considers two QoS constraint conditions of the deadline and the scheduling budget of the scheduling task, has good universality and scalability, provides a task scheduling model based on QoS grade division aiming at different QoS constraint requirements, Li and the like, and for application tasks with lower QoS requirements, the strategy adopts an optimized Chord algorithm to schedule the task, ensures that any application task with QoS constraint requirement can obtain a satisfactory task completion time, the scheduling method based on the economy as a principle and a model, the super-large scale of cloud computing and a commercial operation mode thereof enable economic factors to become scheduling indexes of key consideration, and Wei and the like provide various configuration combinations of virtual machines with different prices for users aiming at cloud service providers, and provide a big data processing workflow scheduling algorithm which minimizes operation cost and meets the requirement of completion deadline.

Aiming at the scheduling of big data processing tasks, scholars at home and abroad also conduct extensive research, and in a cloud computing environment, Gaith and the like provide a big data task scheduling method for virtual machine trust perception, which specifically comprises three stages: the method comprises the steps of evaluating the trust level of a virtual machine, determining the priority of a task, scheduling the trust perception of the virtual machine, providing an efficient operation scheduling method for energy consumption perception large data application by Yanling and the like, modeling the fair scheduling problem of energy consumption perception as a multidimensional knapsack problem, providing a large data transmission scheduling method for maximizing the large data computing throughput under a cloud environment by Ruitao and the like, minimizing the data retrieval time of an application program, providing a large data application scheduling method for maximizing the network throughput by dynamic load balancing of a cloud data center by Feilong and the like, considering the relevance among input data and the data locality, establishing the corresponding relation among data, tasks and nodes, minimizing the data migration cost as a performance improvement target, designing and optimizing a data placement strategy and a scheduling mechanism, and providing a relevance-driven large data processing task scheduling scheme.

The above documents study task scheduling problems in a cloud environment from different sides, and they mostly assume that task types are compute-intensive, computational performance of computational resources is fixed, matching problems of task types and resource types are not considered, and how to measure matching degrees of tasks and computations are not considered, compared with parallel task scheduling processing in traditional high-performance computing and grid computing, scheduling processing of big data parallel tasks in the cloud environment is more complex, traditional parallel tasks are mostly compute-intensive, and big data parallel tasks may be compute-intensive, data-intensive, and I/O-intensive; the performance of computing resources in a traditional computing environment is relatively fixed, and the physical resources of a virtual machine in a cloud computing environment are dynamically changed, namely the performance of the virtual machine is dynamically changed; cloud service providers are numerous, and the actual performance of the provided virtual machines and the performance of the virtual machines declared by the virtual machines are often in and out; the effect of different configured virtual machines on different types of task processing is greatly different.

Disclosure of Invention

In order to solve the technical problems, the invention provides a measurement method for evaluating service capacity requirements of tasks on resources and service capacity guarantees of virtual machines on the tasks, provides necessary reference for scheduling and execution of big data parallel tasks, and provides a virtual machine service capacity perception big data workflow scheduling method in a cloud environment of a service dynamic level scheduling algorithm, wherein the service capacity requirements of the tasks are matched with the service capacity guarantees of the virtual machines.

The invention discloses a big data workflow scheduling method for sensing the service capability of a virtual machine in a cloud environment, which comprises the following steps: modeling and assuming a big data workflow and a cloud system; secondly, calculating service capacity; thirdly, dynamic scheduling of big data workflow service;

the first step further comprises the following steps:

define 1 big data workflow: a big data workflow can be abstracted as a DAG graph, i.e. a 6-tuple DAG ═ V, R, Q, D, O, C, where the concrete meaning of each element in the tuple is as follows:

(1)V＝{v ₁ ,v ₂ ,...,v _n denotes the set of sub-tasks of the big data workflow, n denotes the number of sub-tasks, v _i Representing the ith subtask;

(2)R＝{r _ij |v _i ,v _j e is the V }, E is the V multiplied by V, and the execution sequence precedence relationship and the data dependency relationship among the subtasks are represented;

(3)Q＝{q ₁ ,q ₂ ,...,q _n is the set of computation quantities for the subtasks, q _i e.Q denotes the subtask v _i The calculated amount of (2);

(4)D＝{d ₁ ,d ₂ ,...,d _n is the set of data processing capacities of the subtasks, d _i e.D represents a subtask v _i The data throughput of (a);

(5)O＝{o ₁ ,o ₂ ,...,o _n denotes the set of I/O data throughput of the subtask, O _i Representing subtasks v _i The I/O data processing amount of the task is that the data to be communicated between the task and the subsequent task needs to be transmitted after I/O processing, such as coding, encryption and the like before data transmission;

(6)C＝{c _ij |v _i ，v _j e.g., V }, e.g., V × V, is a set of traffic between subtasks, c _ij Representing subtasks v _i To subtask v _j The amount of communication data of;

defining 2 the cloud platform: a real Cloud system can be abstractly described as a cluster system composed of different types of virtual machines virtualized by a plurality of physical servers through virtualization technologies, the virtual machines are connected through a network to form a graph Cloud computing system, which can be represented as a 6-tuple Cloud (VM, CS, DS, OS, E, B), wherein the concrete meaning of each tuple in the tuple is as follows:

(1)VM＝{vm ₁ ，vm ₂ ，...，vm _m denotes the set of virtual machines, m is the total number of virtual machines, vm _i Representing the ith virtual machine;

(2)CS＝{cs ₁ ，cs ₂ ，...，cs _m denotes the set of virtual machine computing speeds, cs _i Representing virtual machines vm _i The calculation speed of (a) represents the calculation amount processed in a unit time;

(3)DS＝{ds ₁ ，ds ₂ ，...，ds _m denotes the set of data processing speeds of the virtual machines, ds _i Representing virtual machines vm _i The data processing speed of (2), which represents the amount of data processed per unit time;

(4)OS＝{os ₁ ，os ₂ ，...，os _m denotes the set of I/O processing speeds, os, of the virtual machine _i Representing virtual machines vm _i The I/O processing speed of (2) represents the I/O data processing amount processed in a unit time;

(5)E＝{e _ij |vm _i ，vm _j the element belongs to VM } } represents a communication link set among the virtual machines;

(6)B＝{b _ij |vm _i ，vm _j ∈VM，e _ij e is the set of communication bandwidths of the edges in E; b _ij E B is a communication link roadside e between the virtual machines _ij ＝(vm _j ，vm _j ) E is the time for transmitting unit data; the service capacity of the virtual machine is evaluated through three evaluation indexes of the computing capacity, the data processing capacity and the I/O processing capacity of the virtual machine, a sample data matrix X of 3 evaluation indexes of the computing capacity, the data processing capacity and the I/O processing capacity of the virtual machine which obtains m service transactions of a cloud user is set to be { xij } mx 3, and due to the fact that the dimension, the order of magnitude and the orientation of the indexes are different greatly, initial data do not need to be subjected to evaluationPerforming dimensionalization treatment, namely performing normalization treatment on the data by adopting a min-max standardization mode, wherein the calculation formula is as follows:

wherein max { xik } and min { xik } respectively represent the maximum value and the minimum value in the index, and the values of all the indexes are converted into forward increment values in the range of [0, 1] through conversion, so that the larger each index value is desired to be, the better each index value is, and the normalized matrix after dimensionless processing is Y ═ yij } mx 3, the information entropy of each index is:

wherein j is 1, 2, 3, where the constant k is related to the number m of samples of the system, and for a system with completely disordered information, the degree of order is zero, and its entropy is maximum, e is 1, and when m samples are in a completely disordered distribution state, yij is 1/m, then:

thus, we obtain: k ═ 1 (lnm) -1;

since the information entropy ej can be used to measure the utility value of the information of the jth evaluation index, when the information entropy ej is completely unordered, ej is 1, at this time, the utility value of the information of ej to the comprehensive evaluation is zero, and therefore, the information utility value of a certain index depends on the difference hj between the information entropy ej of the index and 1: estimating the weight of each index by using an entropy method, wherein the weight is essentially calculated by using the utility value of index information, and the higher the utility value is, the greater the importance of the evaluation is, so that the weight of the j evaluation index is:

wherein the content of the first and second substances,wj is not less than 0 and not more than 1, and

in the second step, based on the weight of each evaluation index obtained by user evaluation and the normalization processing of the calculation speed, data processing speed and I/O speed of each virtual machine, the service guarantee capability of the virtual machine and vm of the virtual machine are calculated in a linear weighting mode _i SA (vm) for service capability guarantee _i ) The definition is as follows:

wherein the content of the first and second substances,

are respectively cs _i 、ds _i 、os _i Normalized value of (w) ₁ 、w ₂ 、w ₃ Weights of evaluation indexes of cs, ds and os calculated by formula (1) respectively;

in the second step, assuming that a big data parallel computing task includes n subtasks, the task computation amount, the data processing amount, and the I/O data processing amount of 3 dimensions of the n subtasks can be represented by An n row 3 column matrix An × 3 ═ aij) n × 3, z-score normalization processing is performed on each dimension, and the processed data will conform to the standard normal distribution by using the mean value and standard deviation of the original data, as shown in the following formula,

wherein the content of the first and second substances,

mu j is the mean value of the jth dimension service capability requirement of the matrix Anx 3, and sigma j is the standard deviation of the jth dimension service capability requirement of the matrix Anx 3, and the standard deviation is obtained after z-score standardization processingMatrix array

The mean value of each dimension is 0, the standard deviation is 1, and through conversion, each dimension required by task service capacity is converted into [0, 1]]The larger each dimension value is, the larger the demand for the corresponding type of service capability is, and the service capability demand sd (vi) of the task vi is defined as follows:

wherein, w ₁ 、w ₂ 、w ₃ Weights of evaluation indexes of cs, ds and os calculated by formula (1) respectively;

in step three, in order to fully consider the service guarantee capability of the virtual machine, the definition of the service dynamic level is as follows:

wherein, SA (v) _i ，vm _j ) Represent a task v _i Scheduling to virtual machines vm _j Service guarantee capability of the virtual machine when the virtual machine is executed; SD (v) _i ) Representing subtasks v _i Requirement for service capability, max { SD (v) } _k ) Represents the maximum value of the service capability requirements of all the subtasks of the parallel task, for the task-virtual machine pair (v) _i ，vm _j ) When SD (v) _i ) When increasing, i.e. task v _i When the service capacity requirement of the virtual machine is increased, the scheduling level is correspondingly reduced, SL (vi) is a task static level, the maximum value of the average computing time sum of all tasks from the task vi to all reachable paths of the parallel task exit task is represented, and the importance of a task node on the execution priority level is implied;

indicating that task vi reflects virtual machine and communication resources at the time virtual machine vmj begins executionAvailability, penalizing pairs of task nodes that incur large communication costs, wherein

Indicating the time when the required input data is available when task vi is scheduled onto virtual machine vmj,

represents the time at which virtual machine vmj is idle and thus available to perform task vi;

reflecting the computing performance of the virtual machines, increasing the priority for the virtual machines with higher processing speed and decreasing the priority for the virtual machines with lower processing speed, wherein

Representing the average of the time required for task vi to execute on all machines,

indicating the time required for task vi to execute on virtual machine vmj.

Drawings

FIG. 1 is an example diagram of a parallel task DAC graph;

FIG. 2 is a graphical block diagram of a cloud platform;

FIG. 3 is an average completion time for different numbers of subtasks;

FIG. 4 is an average completion time for different numbers of virtual machines;

Detailed Description

The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.

Example, as shown in fig. 1 to 4:

1. big data workflow and cloud system modeling and assumptions

1.1 DAG graph modeling of big data workflows

A big data workflow can be represented as a directed Acyclic graph dag (directed Acyclic graph), nodes represent tasks, edges between nodes represent data dependencies between tasks, and the specific definition is as follows:

define 1 big data workflow: a large data workflow can be abstractly represented as a DAG graph, i.e. a 6-tuple DAG ═ (V, R, Q, D, 0, C), where the concrete meaning of each element in the tuple is as follows:

(1)V＝{v ₁ ，v ₂ ，...，v _n denotes the set of subtasks of a big data workflow (also referred to herein as parallel tasks), n denotes the number of subtasks, v _i Representing the ith subtask.

(2)R＝{r _ij |v _i ，v _j And E is V, and the execution sequence precedence relationship and the data dependency relationship among the subtasks are represented.

(3)Q＝{q ₁ ，q ₂ ，...，q _n Is the set of computation quantities for the subtasks, q _i e.Q denotes the subtask v _i The amount of calculation of (a).

(4)D＝{d ₁ ，d ₂ ，...，d _n Is the set of data processing capacities of the subtasks, d _i e.D represents a subtask v _i The data throughput of (2).

(5)O＝{o ₁ ，o ₂ ，...，o _n Denotes the set of I/O data throughput of the subtask, O _i Representing subtasks v _i The I/O data processing amount of the task and the data to be communicated between the task and the subsequent task need to be transmitted after I/O processing, such as encoding, encryption and the like before data transmission.

(6)C＝{c _ij |v _i ，v _j E.g., V }, e.g., V × V, is a set of traffic between subtasks, c _ij Representing subtasks v _i To subtask v _j The amount of communication data.

FIG. 1 is a diagram of a parallel computing task comprising 8 subtasks, v within a circle _i Number indicating task, q _i Representing the amount of computation of the task, d _i Representing a task v _i Data throughput of o _i Representing a task v _i I/O throughput ofThe numbers next to the edges represent the traffic between the nodes, and by processing, it can be assumed that the parallel task DAG graph has only one ingress task and one egress task.

1.2 graphical modeling of cloud Environment

The server virtualization provides hardware resource abstraction capable of supporting the operation of the virtual server for the virtual server, the hardware resource abstraction comprises a virtual BIOS (basic input output System), a virtual processor, a virtual memory, virtual equipment and I/O (input/output), and meanwhile, good isolation and safety are provided for the virtual server.

(1)VM＝{vm ₁ ，vm ₂ ，...，vm _m denotes the set of virtual machines, m is the total number of virtual machines, vm _i Representing the ith virtual machine.

(2)CS＝{cs ₁ ，cs ₂ ，...，cs _m Denotes the set of virtual machine computing speeds, cs _i Representing virtual machines vm _i The calculation speed of (2) represents the amount of calculation processed per unit time.

(3)DS＝{ds ₁ ，ds ₂ ，...，ds _m Denotes the set of data processing speeds of the virtual machines, ds _i Representing virtual machines vm _i The data processing speed of (2) indicates the amount of data processed per unit time.

(4)OS＝{os ₁ ，os ₂ ，...，os _m Denotes the set of I/O processing speeds, os, of the virtual machine _i Representing virtual machines vm _i At I/O ofThe processing speed indicates the processing amount of I/O data processed in a unit time.

(5)E＝{e _ij |vm _i ，vm _j E.g. VM } } represents the set of communication links between virtual machines.

(6)B＝{b _ij |vm _i ，vm _j ∈VM，e _ij E is the set of communication bandwidths of the edges in E. b _ij E B is a communication link roadside e between the virtual machines _ij ＝(vm _i ，vm _j ) E the time it takes to transmit a unit of data.

FIG. 2 is a cloud platform graph topology containing 6 resource nodes, vm within a circle _i Denotes the virtual machine number, cs _i Representing virtual machines vm _i Calculated speed of (ds) _i Representing virtual machines vm _i Data processing speed of (os) _i Representing virtual machines vm _i I/O processing speed, the numbers on the edges indicate the communication bandwidth of the link.

1.3 assumptions for big data workflow task scheduling problem

Parallel task scheduling under a graph cloud platform is a process of distributing each subtask in a parallel task DAG graph to resource nodes for parallel cooperative computing on the basis of fully considering the dependency relationship among the tasks, under the graph cloud platform, the tasks are assumed to be executed in a non-preemptive way and not to be migrated, the tasks are uniformly managed by a central scheduler, each subtask is distributed to a proper virtual machine according to a certain strategy, the scheduler and each virtual machine operate independently, communication is controlled by a communication subsystem to be executed, communication operation can be executed in a concurrent way, the condition of communication conflict is not considered temporarily, and if the dependency relationship r exists _ij ＝(v _i ，v _j ) The two tasks are distributed to the same virtual machine to be executed, and the communication overhead between the two tasks is ignored; if the virtual machine vm is distributed to two different virtual machines _s And vm _d And the communication overhead between them is the sum of the communication time of the data on each link.

Hypothesis by ct _ij Representing a task v _i Facing task v through graph cloud platform _j Communication completion time of transmission data, let t _s (v _k ，vm _j ) Representing a task v _k Virtual machine vm of place _j Free and thus available to perform task v _k Time of (T) _comm (v _k )＝max{ct _ik |v _i ∈pred(v _k ) Denotes the task v _k Is the time of arrival of all the parent tasks' data, pred (v) _k ) Representing a task v _k Set of predecessor tasks of, t _e (v _k ，vm _j ) Representing a task v _k In virtual machine vm _j At the completion time of o _k Representing a task v _k I/O data throughput of t _e (v _k ，vm _j ) The calculation is as follows:

2. computation of service capabilities

A cloud computing environment composed of different cloud virtual machines is generally referred to as a heterogeneous cloud computing system, in which due to the existence of virtual machines with widely different performances, there are various allocation methods for scheduling parallel tasks to be executed on the virtual machines of the heterogeneous cloud computing system, different allocation methods can generate different computing effects, and it can be seen that whether the parallel tasks can be efficiently executed depends on not only the computing speed, the data processing speed, the communication data processing speed and the transmission speed of the communication link of the virtual machine, but also the matching degree of the parallel tasks and the heterogeneous cloud computing system, since a sub-task works well on one kind of virtual machine, not necessarily on another, but, on the contrary, the effect of the method can be poor, so that the service capability matching condition between the task and the virtual machine must be considered when the parallel task is researched and optimized to be executed on the heterogeneous cloud computing system.

2.1 concept of service capabilities

Define 3 service capabilities: service capability refers to the degree of capability of a service system to provide services, and is generally defined as the maximum throughput rate of the service system.

The definition of service capacity shows that the service capacity of a virtual machine is a numerical value between 0 and 1, the larger the numerical value is, the higher the service capacity of the virtual machine is, in order to further discuss the service capacity matching concept, the service capacity is further divided into a service capacity requirement and a service capacity guarantee, and the service capacity requirement is relative to a parallel computing task, namely the strong degree of the task on the service function requirement; service guarantees are relative to computing resources (herein computing resources are virtual machines), i.e., the extent to which the computing resources can provide service functionality.

2.2 service capability guarantee of cloud platform virtual machine

When the virtual machines are distributed to tasks with different service capacity requirements, in order to ensure that the tasks run according to an expected processing mode, the service guarantee capacity of the distributed virtual machines must be matched with the service capacity requirements of the tasks as much as possible, the satisfaction degree of cloud users on the execution of tasks submitted in the past is the most direct and obvious index for judging the matching degree of the service capacity of the virtual machines, whether the cloud users are satisfied with the services is obtained by integrating evaluation indexes of the services, in cloud computing, the performance difference of the virtual machines is large, such as the performance of different computing capacities, data processing capacities, I/O capacities, reliability capacities, available capacities and the like, and the evaluation of the service capacity of the virtual machines is considered through the three evaluation indexes of the computing capacity, the data processing capacity and the I/O processing capacity of the virtual machines.

When a cloud user determines the satisfaction degree of a service, the weights of all indexes are generally different, an objective determination method based on entropy weight is adopted to determine the weight of an evaluation index, in an information theory, entropy is regarded as the uncertainty degree of the state of an information source, and therefore the order degree of system information obtained by information entropy evaluation and the utility value of the information are natural.

Setting the sample data matrix X of the 3 evaluation indexes obtained by the cloud user for m service transactions as { X ═ X } _ij } _m×3 Because the dimension, order of magnitude and orientation of the indexes are different, the initial data needs to be dimensionless, and the user expects them for the specific values increasing in the forward direction, such as computing power, data processing power, I/O power, etcThe larger the data is, the better the data is, the min-max standardization mode is adopted to carry out normalization processing on the data, and the calculation formula is as follows:

where max { x _ik And min { x } _ik Denotes the maximum and minimum values in the index, respectively, and by the above conversion, the values of all indexes are converted to [0, 1]]The forward increment value in the range is preferably larger for each index value, and the normalized matrix after the dimensionless process is set to Y ═ Y _ij } _m×3 Then, the information entropy of each index is:

where j is 1, 2, 3, where the constant k is related to the number m of samples in the system, and for a system with completely disordered information, the degree of order is zero and its entropy is maximum, e is 1, and when m samples are in completely disordered distribution, y is _ij 1/m, then:

thus, we obtain: k ═ lnm ^-1 ；

Due to the information entropy e _j The utility value of the information that can be used to measure the jth evaluation index, e, when completely out of order _j When it is 1, e _j The utility value of the information (i.e. the data of the j index) on the comprehensive evaluation is zero, so the information utility value of a certain index depends on the information entropy e of the index _j Difference h from 1 _j ：h _j ＝1-e _j The weight of each index is estimated by using an entropy method, the essence of the weight is calculated by using the utility value of the index information, the higher the utility value is, the greater the importance of the evaluation is, and thus the weight of j evaluation index is:

based on the weight of each evaluation index obtained by user evaluation and the normalization processing of the calculation speed, data processing speed and I/O speed of each virtual machine, the service guarantee capability of the virtual machine and the vm of the virtual machine are calculated in a linear weighting mode _i SA (vm) for service capability guarantee _i ) The definition is as follows:

wherein, the first and the second end of the pipe are connected with each other,

are respectively cs _i 、ds _i 、os _i Normalized value of (w) ₁ 、w ₂ 、w ₃ The weights of the evaluation indexes cs, ds, and os calculated by the formula (1) are respectively used.

2.3 service capability requirements for tasks in DAG graphs

The large data parallel computing task can be composed of a plurality of subtasks, each subtask has different functions, roles and task types in the whole task, each subtask has to treat the degree of the service capacity required by the subtasks in a different way, and the requirement of the large data parallel computing task subtasks on the service capacity of the virtual machine is quantized and subdivided into the linear weighted sum of task computing requirement, data processing requirement and I/O processing requirement.

In order to eliminate the difference of different dimensions and units and realize the comparability of different dimensions and units, data standardization processing needs to be carried out on each dimension, and assuming that a big data parallel computing task comprises n subtasks, the task computing amount, the data processing amount and the I/O data processing amount of 3 dimensions of the n subtasks can use a matrix A with n rows and 3 columns _n×3 ＝(a _ij ) _n×3 For each dimension, goThe z-score normalization process is performed, and the processed data will conform to the standard normal distribution using the mean and standard deviation of the raw data, as shown in the following formula,

wherein the content of the first and second substances,

μ _j is a matrix A _n×3 Of the j-th dimension of service capability requirement, σ _j Is a matrix A _n×3 The standard deviation of the j-dimension service capability requirement is subjected to z-score standardization to obtain a matrix

The mean value of each dimension is 0, the standard deviation is 1, and through the conversion, each dimension required by the task service capacity is converted into [0, 1]]Forward increment value within range, so that the larger each dimension value, the greater the demand for service capability of corresponding type, task v _i Service capability requirement SD (v) _i ) The definition is as follows:

wherein, w ₁ 、w ₂ 、w ₃ The weights of the evaluation indexes cs, ds, and os calculated by the formula (1) are respectively used.

3. Dynamic scheduling algorithm for big data workflow service

Based on the cloud computing model of virtual machine service guarantee and task service capacity requirements, a DLS scheduling algorithm is modified, the completion time of parallel tasks is reduced by fully utilizing the heterogeneity of virtual machines, and the service quality requirements of the sub tasks of the parallel tasks on the virtual machine service guarantee capacity are met, so that DLS parallel task scheduling based on Directed Acyclic Graph (DAG) becomes more reasonable.

The DLS algorithm is a compiling and heuristic table scheduling algorithm and is used for effectively distributing applications based on a DAG graph to a heterogeneous resource machine set so as to reduce the execution time 0 of the applications, and if a ready subtask v is obtained by calculation at each step of scheduling _i And idle virtual machines vm _j The matching dynamic level is highest, then the DLS algorithm will assign task v _i Scheduling to virtual machines vm _j On-run, a task virtual machine pair (v) _i ，vm _j ) Dynamic level of (d) DL (v) _i ，vm _j ) Is defined as

Wherein, SL (v) _i ) For task static level, represent task v _i The maximum value of the average calculation time sum of all tasks on all reachable paths of the parallel task exit task implies the importance of the task node on the execution priority level;

representing a task v _i In virtual machine vm _j The starting time of execution reflects the availability of virtual machine and communication resource, and penalizes the task node pair generating large communication cost

Indicating when task v _i Scheduling to virtual machines vm _j The time when the required input data is available,

representing virtual machines vm _j Free and thus available to perform task v _i The time of day;

reflecting the computing performance of the virtual machines, increasing the priority for the virtual machines with higher processing speed and reducing the priority for the virtual machines with lower processing speed, wherein

Representing a task v _i The average of the required time is performed on all machines,

representing a task v _i In virtual machine vm _j The time required for execution.

When a Scheduling decision is made, the DLS algorithm considers the heterogeneity of machines, which can effectively adapt to the computing speed heterogeneity characteristics of resources in a cloud computing environment, but does not consider the comprehensive service guarantee capability of computing resources in a cloud computing system, and when a task is scheduled to a target virtual machine for execution, the service guarantee capability reflects the comprehensive capability of the target virtual machine for providing computing capability, data processing capability, and I/O processing capability, in order to fully consider the service guarantee capability of the virtual machine, a service Dynamic Level Scheduling algorithm sdls (Dynamic Level Scheduling algorithm) is proposed herein on the basis of the DLS algorithm, and the service Dynamic Level is defined as follows:

wherein, SA (v) _i ，vm _j ) Represent a task v _i Scheduling to virtual machines vm _j Service guarantee capability of the virtual machine during execution; SD (v) _i ) Representing subtasks v _i Requirement for service capability, max { SD (v) } _k ) Represents the maximum value of the service capability requirements of all the subtasks of the parallel task, for the task-virtual machine pair (v) _i ，vm _j ) When SD (v) _i ) When increasing, i.e. task v _i When the service capacity requirement of the virtual machine is increased, the scheduling level is correspondingly reduced, so that the service dynamic level scheduling in the cloud computing environment provided by the inventionThe algorithm has strong flexibility, and the matched virtual machines are set according to different requirements of each subtask on the service capability, so that different requirements of each subtask on the service capability are met.

The pseudo code of the dynamic level scheduling algorithm SDLS for parallel task services is given below,

the algorithm is as follows: SDLS (parallel task service) dynamic scheduling algorithm

Inputting: big data workflow DAG ═ (V, R, Q, D, C), graph Cloud platform Cloud ═(VM, CS, DS, CF, E, B)

And (3) outputting: sub-task virtual machine allocation sequence Assign { (v) _i ，vm _j ）}

4. Simulation experiment and result analysis

The effect of the service matching priority scheduling algorithm smFsa provided by the method is tested through a simulation experiment, cloud simulation software Cludsim 3.00 is adopted to carry out the simulation experiment, CludSim is a discrete event simulation toolkit based on Java and supports resource management and task scheduling simulation of cloud computing, and the simulation process is not influenced by the performance of a machine and is simulated through virtual time and tasks.

The main flow of the CloudSim simulation is as follows: initializing each discrete object according to set parameters → starting simulation → resource registration → agent broker inquiring resource to information center → parallel task subtask service capability demand calculation → virtual machine service guarantee capability calculation → distributing resources matched with serviceability to tasks according to set scheduling strategy → cloud resource execution task → task execution completion → returning final result → ending simulation, and writing a simulation program by adopting Java language, wherein the development environment is extensible integrated development platform Eclipse based on Java and open source codes.

To test the solution effect of the algorithm presented herein on the parallel task scheduling problem, the following two sets of experiments were considered for validation.

4.1 average completion time for different task counts

In order to evaluate the performance of the algorithm proposed herein, we compare the service dynamic level scheduling algorithm SDLS and the dynamic level scheduling algorithm DLS under the condition of different numbers of subtasks of parallel tasks, in this experiment: setting the number of computing resource nodes as 200, the number of links as 300, and randomly generating computing speed, data processing speed and I/O speed of computing resources among [20,60], [10,30] and [10,30], and randomly generating communication speed among computing resources among [10,30 ]; randomly generating a parallel task DAG graph with 30-130 subtasks, wherein the calculated amount, the data processing amount and the maximum communication amount between the subtasks and the directly subsequent tasks are randomly generated among [120,260], [40,180] and [50,100], respectively, executing a multi-time scheduling algorithm by the tasks of each scale in the experimental process, and taking the average value of the parallel task completion time, wherein the performance comparison of the average completion time under different subtasks is shown in FIG. 3.

As can be seen from fig. 3, with the increase of the number of the subtasks, the average scheduling lengths of the two algorithms are increased, and the scheduling length of the SDLS is smaller than the scheduling length of the DLS, because the SDLS takes into account not only the comprehensive capabilities of the computing capability, the data processing capability, and the I/O processing capability of the virtual machine when allocating the virtual machine to the task, but also the requirement of the subtask on the service capability and the service capability guarantee of the virtual machine, when the dynamic levels of the tasks on the two resources are equal, the virtual machine with a larger service guarantee capability has a larger service dynamic level, and the SDLS algorithm selects the resource with a larger service dynamic level when allocating the virtual machine to the subtask, which can increase the comprehensive processing speed of task execution, and thus the scheduling length of the SDLS is always smaller than the scheduling length of the DLS.

4.2 average completion time for different virtual machine counts

In this experiment: randomly generating a parallel task DAG graph with 400 subtasks, wherein the calculation amount, the data processing amount and the maximum communication amount between the subtasks and the directly subsequent tasks are randomly generated among [120,260], [40,180] and [50,100] respectively; the random generation has 200 to 500 virtual machines, the number of links is 500, the calculation speed, the data processing speed and the I/O speed of the calculation resources are respectively randomly generated among [20,60], [10,30] and [10,30], the communication speed among the virtual machines is randomly generated among [10,30], tasks of each scale in the experimental process execute a scheduling algorithm for multiple times, the average value of the parallel task completion time is taken, and fig. 4 shows the comparison of the average completion time under different virtual machine numbers.

As can be seen from fig. 4: with the increase of the number of virtual machines, it can be concluded that the average completion time is consistent with experiment 5.1, i.e. the scheduling length of SDLS is always smaller than that of DLS.

The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, it is possible to make various improvements and modifications without departing from the technical principle of the present invention, and those improvements and modifications should be also considered as the protection scope of the present invention.

Claims

1. A big data workflow scheduling method for sensing virtual machine service capability in a cloud environment is characterized by comprising the following steps: modeling and assuming a big data workflow and a cloud system; secondly, calculating service capacity; thirdly, dynamic scheduling of big data workflow service;

the first step further comprises the following steps:

define 1 big data workflow: a large data workflow can be abstractly represented as a DAG graph, i.e. a 6-tuple DAG ═ (V, R, Q, D, O, C), where the concrete meaning of each element in the tuple is as follows:

(1)V＝{v ₁ ，v ₂ ，...，v _n denotes the set of sub-tasks of the big data workflow, n denotes the number of sub-tasks, v _i Representing the ith subtask;

(2)R＝{r _ij |v _i ，v _j e is the V }, E is the V multiplied by V, and the execution sequence precedence relationship and the data dependency relationship among the subtasks are represented;

(3)Q＝{q ₁ ，q ₂ ，...，q _n is the set of computation quantities for the subtasks, q _i e.Q denotes the subtask v _i The calculated amount of (2);

(4)D＝{d ₁ ，d ₂ ，...，d _n is the set of data processing capacities of the subtasks, d _i e.D represents a subtask v _i The data throughput of (a);

(5)O＝{o ₁ ，o ₂ ，...，o _n denotes the set of I/O data throughput of the subtask, O _i Representing subtasks v _i The data to be communicated between the task and the subsequent task can be transmitted after I/O processing, such as coding, encryption and the like before data transmission;

(3)DS＝{ds ₁ ，ds ₂ ，...，ds _m denotes a set of data processing speeds of the virtual machines, ds _i Representing virtual machines vm _i The data processing speed of (2), which represents the amount of data processed per unit time;

(6)B＝{b _ij |vm _i ，vm _j ∈VM，e _ij e is the set of communication bandwidths of the edges in E; b _ij E B is a communication link roadside e between the virtual machines _ij ＝(vm _i ，vm _j ) E is the time for transmitting unit data; the service capacity of the virtual machine is evaluated through three evaluation indexes of the computing capacity, the data processing capacity and the I/O processing capacity of the virtual machine, a sample data matrix X of 3 evaluation indexes of the computing capacity, the data processing capacity and the I/O processing capacity of the virtual machine, which is obtained m times of service transactions of a cloud user, is set to be { xij } mx 3, and due to the fact that the dimension, the order of magnitude and the orientation of the indexes are different greatly, non-dimensionalization processing needs to be carried out on initial data, normalization processing is carried out on the data in a min-max standardization mode, and the computing formula is as follows:

thus, we obtain: k ═ 1 (lnm) -1;

wherein wj is more than or equal to 0 and less than or equal to 1, and

wherein the content of the first and second substances,

mu j is the mean value of the jth dimension service capability requirement of the matrix Anx 3, sigma j is the standard deviation of the jth dimension service capability requirement of the matrix Anx 3, and the matrix is obtained after z-score standardization processing

The mean value of each dimension is 0, the standard deviation is 1, and each dimension required by the task service capacity is converted into [0, 1] through conversion]The larger each dimension value is, the larger the demand for the corresponding type of service capability is, and the service capability demand sd (vi) of the task vi is defined as follows:

wherein, w ₁ 、w ₂ 、w ₃ Cs, ds, and,A weight of an evaluation index of os;

wherein, SA (v) _i ，vm _j ) Represent a task v _i Scheduling to virtual machines vm _j Service guarantee capability of the virtual machine during execution; SD (v) _i ) Representing subtasks v _i Requirement for service capability, max { SD (v) } _k ) Represents the maximum value of the service capability requirements of all the subtasks of the parallel task, for the task-virtual machine pair (v) _i ，vm _j ) When SD (v) _i ) When increasing, i.e. task v _i When the service capacity requirement of the virtual machine is increased, the scheduling level is correspondingly reduced, SL (vi) is a task static level, the maximum value of the average computing time sum of all tasks from the task vi to all reachable paths of the parallel task exit task is represented, and the importance of a task node on the execution priority level is implied;

indicating the time when task vi begins execution at virtual machine vmj, a task node pair that penalizes the large communication cost, reflecting the availability of virtual machines and communication resources, where

reflecting the computing performance of the virtual machine, i.e. the processing speedThe faster virtual machine increases priority and decreases priority for the slower processing virtual machine, wherein

indicating the time required for task vi to execute on virtual machine vmj.