CN107704069A

CN107704069A - A kind of Spark energy-saving scheduling methods perceived based on energy consumption

Info

Publication number: CN107704069A
Application number: CN201710452338.4A
Authority: CN
Inventors: 李鸿健; 王霍琛; 代宇; 熊安萍; 蒋溢
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Guangzhou Dayu Chuangfu Technology Co ltd
Priority date: 2017-06-15
Filing date: 2017-06-15
Publication date: 2018-02-16
Anticipated expiration: 2037-06-15
Also published as: CN107704069B

Abstract

The invention discloses a kind of Spark energy-saving scheduling methods perceived based on energy consumption.Big data under Spark Computational frames is built first calculates energy consumption model, the energy consumption of task and computing resource is established based on the model and performs time relationship Policy Table, instructed by Policy Table and optimize Spark task schedulings, effectively reduce calculating total energy consumption on the premise of parallel efficiency calculation is ensured.The present invention solves the defects of original scheduling strategies of Spark can not perceive energy consumption, and this method has the characteristics of energy consumption perception, dynamically optimized scheduling and Highly Scalable, effectively reduces energy consumption caused by the application program operated under Spark Computational frames.

Description

Spark energy-saving scheduling method based on energy consumption perception

Technical Field

The invention relates to the field of big data processing and energy efficiency, in particular to a big data energy consumption model based on energy consumption perception and an energy-saving scheduling strategy under a Spark calculation framework based on the model.

Background

The huge electric energy consumption generated by big data calculation is a problem to be solved urgently by a data center. Many enterprises and organizations are faced with large-scale data computing, and computing cost is an important concern while computing efficiency is considered. Enterprises and organizations desire to reduce large data computing power consumption and thus computing costs. However, the situation is not optimistic in the context of the current big data age, based on recent research reports ^[1] It has been shown that the power consumption of the german data center has increased by 15% from 2010 to 2015, to 12 billion kilowatt-hours per year. 2014 report of International Green peace organization (Greenseal International) ^[2] The demand of the global data center electric energy is expected to increase by 81% by 2020, and the operation and maintenance data center consumes huge electric energy. By optimizing the scheduling mode of the big data calculation tasks of the data center, the expense of big data calculation energy consumption can be effectively reduced, and therefore the energy efficiency of the data center is improved.

Currently, spark computing framework ^[3] The method is widely applied to big data calculation of the data center. Spark provides a distributed memory abstraction of a cluster, namely an elastic distributed data set (RDD), which is an immutable set of partitioned records and is a programming model of Spark. Spark divides the RDD into different stages (stages) according to the dependency relationship between the RDDs, and each stage has a series of methods to process the RDD in a pipeline way. In each stage, a task set with the same number as the last RDD partition in the current stage is divided, that is, the data of one partition is processed by one task. These tasks will be scheduled to worker nodes for processing in parallel. For big data application under a Spark computing framework, only one executor process can be arranged on one worker, and the process can occupy computing resources and can process a plurality of tasks. Then the task scheduling problem can be described as a binning problem: the task is an article, the executor process is a plurality of boxes, the resource occupied by the executor is the size of the box, the resource quantity required by the task is the size of the article, and the task is allocated to run the resource, which obviously belongs to the NP-hard problem. Assign task inOn the same execution, the running time and the energy consumption are different, so that reasonable distribution plays a vital role in reducing the running energy consumption of the big data application.

Currently, spark itself provides two scheduling strategies ^[3] FIFO and FAIR scheduling strategies, respectively. The scheduling granularity of both scheduling strategies is a phase. And distributing the tasks in each stage to the executors after random disorder according to a localization principle to execute. The two scheduling strategies differ in that: the FIFO sorts the phases according to the Job generation order and the phases generation order under Job, and allocates computing resources according to the sorting order. The FAIR divides the phases into a plurality of groups, each group is called Pool, polls the pools according to the weights of the pools, sorts the phases, and allocates computing resources according to the sorting order. Yangji Wei, zheng 28871 ^[4] Et al propose an adaptive task scheduling strategy based on heterogeneous Spark clusters. According to the strategy, the load and the resource utilization rate of the node are monitored, the monitored parameters are analyzed, the task allocation weight of the node is adjusted in a self-adaptive and dynamic mode, and the problem of energy consumption is not considered in the scheduling strategy. Leverich and Kozyrrakis ^[5] An energy management method for MapReduce jobs is provided. The method selectively shuts down the nodes with low utilization rate to reduce energy consumption. They keep at least one copy of the data block in the coverage set and turn off the nodes that are not in the coverage set and that have low utilization. L Mashayekhy et al ^[6] An efficient energy-saving heuristic task scheduling algorithm is provided based on a MapReduce computing framework, and a task scheduling strategy with low energy consumption is obtained by analyzing the energy consumption of specific operation in advance, so that the strategy can effectively reduce the energy consumption on the premise of ensuring the SLA requirement. The energy-saving strategies are based on a MapReduce calculation framework, and the Spark energy consumption research is insufficient at present.

Task scheduling for Spark is a binning problem, L Mashayekhy et al ^[6] And sequencing the computing resources of the Map stage and the computing resources of the Reduce stage based on a heuristic task scheduling algorithm provided by a MapReduce computing framework according to a preset energy consumption analysis result, and placing the tasks on slots with low unit energy consumption in each stage. In practical application, each time, a new operation needs to be performed in advanceAnd (5) performing energy consumption analysis. When the cluster physical node changes, the original performance analysis result needs to be regenerated, the flexibility is poor, and the method can not be applied to solving the energy consumption problem of the Spark computing framework.

Reference documents:

[1]Hintemann R,Beucker S,Clausen J,et al.Energy efficiency of data centers-a system-oriented analysis of current development trends[C]//Electronics Goes Green.IEEE,2017.

[2]Salahuddin M,Alam K.Information and Communication Technology,electricity consumption and economic growth in OECD countries:A panel data analysis[J].International Journal of Electrical Power&Energy Systems,2016,76:185-193.

[3] zhang station Spark technology inside curtain [ M ] mechanical industry press 2015.

[4] Yangxing Wei, zheng 28871, wang Song, et al. heterogeneous Spark Cluster adaptive task scheduling strategy [ J ] computer engineering, 2016,42 (1): 31-35.

[5]J.Leverich and C.Kozyrakis,On the energy(in)efficiency of Hadoop clusters[J]ACM SIGOPS Oper.Syst.Rev.,vol.44,no.1,pp.61–65,2010.

[6]Mashayekhy L,Nejad M M,Grosu D,et al.Energy-Aware Scheduling of MapReduce Jobs for Big Data Applications[J].Parallel&Distributed Systems IEEE Transactions on,2015,26(10):2720-2733.

[7] Luo Liang, wu Wen drastic, zhang Fei, cloud computing data center oriented energy consumption modeling method [ J ] software science report, 2014 (7): 1371-1387.

Disclosure of Invention

The invention aims to solve the problems in the prior art and provides a big data energy consumption model based on energy consumption perception and a Spark computing framework scheduling method based on the model and capable of obviously reducing big data computing energy consumption. The technical scheme of the invention is as follows:

a Spark energy-saving scheduling method based on energy consumption perception comprises the following steps:

firstly, constructing a big data calculation energy consumption model based on a Spark calculation frame;

secondly, establishing an energy consumption relation strategy table and an execution time relation strategy table of the tasks and the computing resources, wherein the two strategy tables together guide task scheduling;

thirdly, selecting the computing resources with the optimal evaluation criteria according to the policy table, preferentially distributing computing tasks for the computing resources, and simultaneously ensuring the balanced distribution of parallel computing tasks;

fourthly, initializing the decision table through data detection, and updating the energy consumption relation strategy table and the execution time relation strategy table after the phase task is executed.

Further, the big data calculation energy consumption model is as follows: energy consumption generated by big data application App is App ^ec ，

Is defined as shown in formula (8)

Reducing run-time energy consumption of big data applications, i.e. apps ^ec So that the objective function is defined as shown in equation (9)

Wherein i belongs to Z, j belongs to Z, k belongs to Z, and l belongs to Z,task Placement of representation, useRepresenting tasksAnd resource process ex _l The relationship between them is defined as shown in formula (5)

When the temperature is higher than the set temperatureIs distributed to ex _l When getting onOtherwise

Further, the energy consumption relationship policy table of the task and the computing resource is as follows: represents Stage _ij The result of Cartesian product of the set and the Exe stores historical, arbitrary tasksAt any process (ex) _l ) Energy consumption during run-upIn the initial state of the process,namely, valid data needs to be detected;

the execution time relation policy table of the task and the computing resource comprises the following steps: arbitrary tasksAt any process (ex) _l ) Time of executionIn the initial state of the process, the process is carried out,there is also a need to detect valid data.

Further, the evaluation criterion is selected according to the energy consumption relation strategy table and the execution time relation strategy table

Preferred computing resources include in particular:

within a batch of stages to be processed in parallel, processes ex are processed _l Evaluation of eva _l Is defined as formula (10)

Wherein, stage ' represents the Stage that the current DAKScheduler submits to the processing of the TaSkScheduler, U.G.Stage ' represents the union of all stages that the TaSkScheduler needs to process, sigma | Stage ' | represents the total number of tasks that the TaSkScheduler needs to process in all stages,denotes task in a @' set ^k′ In process ex _l The unit energy consumption of the upper execution,denotes task in a @' set ^k′ And (4) sequencing the processes in an ascending order according to the evaluation standard and the strategy table by combining the sum of the unit energy consumption to form a queue.

Further, in the policy tableOrRepresenting tasksIn the process ex _l The power consumption or execution time of the process is the initialization state, and the process ex is needed _l Placed in front of the queue, tasks are allocated preferentially to this process.

Further, the current optimal process ex is obtained according to the ascending sorting queue of the processes _l According to the execution time in the execution time strategy tableWill have a value ofTask composition Set of ₀ Set up ofAccording to the task ofAscending to form a double-end queue TaskQue;

if it isThen Set is preferentially allocated ₀ Task to ex in _l Up to ex _l Resource exhaustion orIf the resources are exhausted, then take out the suboptimal ex from the ExeQue _l ' continue assigning tasks;

if it isThat is, the data of the current decision table does not need to be detected or is distributed completely, and the tasks are sequentially and alternately taken out from the head and the tail of the TaskQue and distributed to the ex _l Up to ex _l If the resources are exhausted or the TaskQue queue is empty, if the resources are exhausted, a suboptimal ex is taken out of the ExeQue _l ' continue to allocate tasks, if the TaskQue queue is empty, all task allocations have ended.

Further, when the execution time is upWhen the task needs to be put into the Set separately ₀ In (3), probing is performed preferentially.

Furthermore, after the task operation is finished, the energy consumption and the operation time of the current operation are recorded, the energy consumption relation strategy table and the execution time relation strategy table are updated, and a decision is provided for the next operation of the same task.

The invention has the following advantages and beneficial effects:

the innovation points of the invention are as follows:

1. energy consumption model established for Spark big data application

The invention firstly provides mathematical models of Spark big data application energy consumption, operation energy consumption and stage energy consumption, which are detailed in formulas (6), (7) and (8), and provides model basis for accurately calculating Spark big data application energy consumption.

The advantages are that: the energy consumption model describes the relationship between a big data application (App) and a Job (Job) in a set mode; relationship between Job (Job) and Stage (Stage); stage (Stage) and task (task). The relationship between tasks and processes (ex) is described using variable X. The execution energy consumption of the calculation task on the process can be flexibly obtained, and the stage energy consumption, the operation energy consumption and the application energy consumption can be flexibly obtained. Meanwhile, a calculation energy consumption basis is provided for Spark energy consumption research.

2. Spark energy-saving scheduling method for providing energy consumption perception

The innovation point of the Spark energy-saving scheduling method for energy consumption perception can be subdivided into

2.1, an energy consumption relation strategy table and an execution time relation strategy table of the tasks and the computing resources are provided for the first time, historical energy consumption relations and historical execution time relations are recorded, and strategy basis is provided for task allocation. And simultaneously, after the task is finished, updating the data in the policy table. The advantages are that: the Saprk energy-saving scheduling method for energy consumption perception has dynamic expansibility. The scheduling algorithm is suitable for scenes of repeatedly running the same big data application, and the policy table mechanism is dynamically updated during each running, so that the energy consumption sensing effect is achieved. When the physical cluster is changed, the policy table updating mechanism can sense unknown data in time and detect and update the unknown data.

2.2 the process resources are sorted according to the evaluation standard, and the sorted results are stored by a queue. Therefore, the process with the optimal evaluation standard is easily obtained, and tasks are preferentially distributed. The evaluation criterion is to take the average energy consumption at the current stage. In particular, the process that needs to be probed needs to be placed at the front of the queue. The advantages are that: the problem that the original Spark scheduling is simple, the computing resources are shuffled randomly, and different process resource energy consumption is not considered is solved. The energy-saving scheduling method provided by the invention can effectively reduce the energy consumption during the application operation. The process to be detected needs to be placed in front of the queue, and the policy table can be dynamically expanded when the physical cluster is expanded.

2.3 setting Set ₀ And (4) collecting and storing tasks to be detected, setting a task Que double-end queue to store tasks with known execution time, and sequencing according to the execution time on the process. Preferential allocation Set ₀ Tasks in the set are then alternately allocated head-to-tail to tasks in the TaskQue double-ended queue. The advantages are that: set ₀ And the tasks stored in the set are preferentially detected, so that the policy table can be dynamically expanded when the physical cluster is expanded. The tasks in the TaskQue double-end queue are alternately distributed head and tail, so that the execution time of the tasks distributed by each process is ensured to be balanced, and the problem of energy consumption caused by waiting for other nodes to run after one node runs is solved. The energy-saving scheduling method provided by the invention is ensured to effectively reduce the energy consumption during operation.

The advantages are that: the invention provides an energy-saving Spark scheduling method for energy consumption perception, which can effectively reduce the expenditure of large-data calculation energy consumption, thereby improving the energy efficiency of a data center and effectively reducing the greenhouse gas emission caused by large-data calculation consumption of a large amount of electric energy.

Drawings

FIG. 1 is an algorithmic diagram of a preferred embodiment of the present invention;

FIG. 2 is a flowchart of the overall calculation;

FIG. 3 is a MySQL database design E-R diagram;

fig. 4 is a diagram of WordCount program logic execution.

Detailed Description

The technical solutions in the embodiments of the present invention will be described in detail and clearly in the following with reference to the accompanying drawings. The described embodiments are only some of the embodiments of the present invention.

The technical scheme for solving the technical problems is as follows:

the basic idea of the invention is: firstly, constructing a big data calculation energy consumption model based on a Spark calculation frame; secondly, establishing an energy consumption relation strategy table and an execution time relation strategy table of the tasks and the computing resources, wherein the two strategy tables together guide task scheduling; thirdly, selecting the computing resources with the optimal evaluation criteria according to the policy table, preferentially distributing computing tasks for the computing resources, and simultaneously ensuring the balanced distribution of parallel computing tasks; fourthly, initializing the decision table through data detection, and updating the energy consumption relation strategy table and the execution time relation strategy table after the phase task is executed. The technical scheme of the invention comprises the following steps:

the method comprises the following steps: energy consumption model construction

Under the Spark computing framework, a big data application (App) generates a Job every time it encounters an action operation _i The definition of App is shown in formula (1). App denotes a big data application, which consists of m jobs. Wherein Job _i Dividing into multiple stages according to RDD dependency relationship _ij To is aligned withThe definition is shown in formula (2). Job _i Represents the ith Job within App, which consists of n stages. Wherein Stage _ij Partitioning multiple according to last RDDTo Stage _ij The definition is shown in formula (3). Stage _ij Represents Job _i The upper jth Stage, which consists of o tasks.

App＝{Job ₀ ,Job ₁ ,Job ₂ ,…,Job _m-1 } (1)

Job _i ＝{Stage _i0 ,Stage _i1 ,Stage _i2 ,…,Stage _i(n-1) } (2)

Wherein i is more than or equal to 0 and less than m, j is more than or equal to 0 and less than n, i belongs to Z, and j belongs to Z

The computing resources available within the Spark cluster, i.e., the executor processes existing on each worker node, are defined as shown in equation (4). Exe represents the set of all available exeuter process resources on the cluster.

Exe＝{ex ₀ ,ex ₁ ,ex ₂ ,…,ex _p-1 } (4)

Use ofTo representIn ex _l Energy consumption of the operation. Use ofTo representIn ex _l Time of operation. Use ofRepresenting tasksAnd resource process ex _l The relationship between them is defined as shown in formula (5)

When in useIs distributed to ex _l When getting onOtherwiseWherein k is more than or equal to 0 and less than o, l is more than or equal to 0 and less than p, k belongs to Z, and l belongs to Z.

Job can be obtained according to the formula (3), the formula (4) and the formula (5) _i The energy consumption generated by the last j Stage isIs defined as shown in formula (6)

According to the formulas (2) and (6), the energy consumption generated by the ith Job in App can be obtained asIs defined as shown in formula (7)

According to the formula (1) and the formula (7), the energy consumption generated by the big data application App can be obtained as the App ^ec Is defined as shown in formula (8)

Wherein i belongs to Z, j belongs to Z, k belongs to Z, and l belongs to Z

The aim of the invention is to reduce the runtime energy consumption of big data applications, namely App ^ec So that the objective function is defined as shown in equation (9)

Wherein i belongs to Z, j belongs to Z, k belongs to Z, and l belongs to Z.The task placement is shown in equation (5). Irrational design ofTransaction Placement will result in obj ^ec And (4) rising. This means that running large data applications of the same size in the same cluster will require more energy to be consumed. As can be seen from the formula (9),how to place the task will be decisive for the objective function.

Step two: policy table initialization

The energy consumption relationship policy table of the tasks and the computing resources is defined as shown in table (1). Table (1) shows Stage _ij The result of Cartesian product of the set and the Exe stores historical, arbitrary tasksAt any process (ex) _l ) Energy consumption during runtimeIn the initial state of the process,i.e. it is necessary to probe valid data.

The execution time relationship policy table of tasks and computing resources is defined as table (2). Table (2) differs from Table (1) in that it retains historical, arbitrary tasksAt any process (ex) _l ) Time of executionIn the initial state of the process, the process is carried out,there is also a need to detect valid data.

TABLE (1) energy consumption relationship policy Table (J)

Table (2) execution time relationship policy table (ms)

Step three: evaluating the energy consumption of the unit process according to the evaluation standard and sequencing

In Spark, the DAGSSchedule encapsulates a batch of stages capable of being processed in parallel into a tasSet and sends the tasSet to the tasSchedule for task scheduling, so that the algorithm processes ex any process within the range of the batch of stages to be processed in parallel _l Evaluation of eva _l Is defined as formula (10)

Wherein Stage' represents the Stage that the current DAGSSchedule submits to the task Schedule for processing. U.G.stage' indicates that the TaskScheduler needs to process the union of all stages.Indicating that the task scheduler needs to handle the total number of tasks in all stages.Denotes task in a @' set ^k′ In the process ex _l Per unit energy consumption of execution.Denotes task in a @' set ^k′ Sum of unit energy consumption. And sorting the processes in an ascending order according to the evaluation standard and the strategy table to form a queue (ExeQue). Specifically, in the policy tableOr alternativelyRepresenting tasksIn process ex _l The energy consumption or execution time of the process is the initialization state, and the process ex is required to be executed _l Placed in front of the queue, tasks are allocated preferentially to this process.

Step four: preferentially distributing tasks to process with optimal evaluation criteria

The current optimal process ex can be obtained by the step three _l According to the execution time strategy table, the tasks are executed according to the execution timeSorting in ascending order to form double-ended queue (TaskQue). In particular whenWhen necessary, the task needs to be put into the Set separately ₀ In (3), probing is performed preferentially.

If it isThen Set is preferentially allocated ₀ Task to ex in _l Up to ex _l Resource exhaustion orIf the resources are exhausted, then take out the suboptimal ex from the ExeQue _l ' continue to assign tasks.

If it isThat is, the data of the current decision table does not need to be detected or is distributed completely, and the tasks are sequentially and alternately taken out from the head and the tail of the TaskQue and distributed to the ex _l Up to ex _l The resource is exhausted or the taskquee queue is empty. If the resources are exhausted, then take out the suboptimal ex from the ExeQue _l ' continue assigning tasks. If the TaskQue queue is empty, then all task assignments have been completed.

The task is alternately allocated, so that the execution time of the task allocated on each process is guaranteed to be balanced as much as possible, and the situation that the execution time of one machine is too long, other machines are in standby and unnecessary energy consumption is caused is avoided.

Step five: energy consumption aware task scheduling policy

After the task operation is finished, the algorithm records the energy consumption and the operation time during the operation, updates the energy consumption relation strategy table and the execution time relation strategy table and provides a decision for the next operation of the same task. Priority detection of step fourOrThe condition of (2) ensures that the policy table can be detected as long as the unknown condition exists, and ensures that the decision data is complete. The algorithm cooperates with the Spark DAGScheduler to complete calculation of Spark big data application, and energy consumption generated by the big data application is obtained through the energy consumption model provided in the step one.

The implementation of the invention comprises an energy consumption evaluation module and a scheduling module. And after the application program is executed by the energy consumption evaluation module, obtaining the program operation energy consumption according to the energy consumption model, and simultaneously considering the function of updating the strategy table. The scheduling module performs task scheduling according to the energy consumption perception scheduling algorithm.

1. Energy consumption evaluation module

The energy consumption evaluation module has a function of calculating the execution time of each task thread on the worker node, and the specific method can obtain the running time of the task on the process by analyzing the monitoring information provided by Spark. That is to say that arbitrary tasks can be obtainedAt any process (ex) _l ) Time of execution

The energy consumption evaluation module also has a function of calculating the energy consumption generated by each task thread on the worker node, and the specific method is as follows: obtaining ex through operating system resource monitoring program _l CPU usage (U) of a process at runtime _cpu ) And memory usage (U) _memory )。

P＝C ₀ +C ₁ ×U _CPU +C ₂ ×U _memory (11)

Energy consumption power model based on system utilization rate ^[7] As shown in equation (11), the instantaneous power consumption (P) of the process under the operation condition can be obtained _exe )。

The process ex is obtained from the formula (12) _l Amount of energy consumption variation (P) ^ec )。P ^ec Accurately reflects the load in the process ex _l The energy consumption generated. The execution time of the task and the energy consumption of the task are subject to uniform distribution, that is, the longer the execution time of the task is, the greater the energy consumption generated by the task is.

From the formula (13), it can be obtainedAt any process (ex) _l ) Power consumption of upper executionWhere Σ p ^k′l Indicated in the process ex _l The sum of the time of execution of all the tasks.

The energy consumption evaluation module records the execution time of the task in the process and the generated energy consumption in the MySQL database, so that the scheduling module can conveniently generate a decision according to the record and schedule the task. The data sheet design is shown in FIG. 3. And finally, completing the program execution, and obtaining the running energy consumption of the program by the energy consumption evaluation module according to an energy consumption model formula (8).

2. Scheduling module

The scheduling module modifies an implementation class tasskschedulerimpl of org. And modifying the original scheduling strategy granularity from stage to task, and sequencing the tasks according to the execution time. And shuffling the original process resources, and sorting the process resources according to the evaluation standard instead. And distributing the original tasks according to the localization degree, and distributing the optimal process according to the execution time instead. Generally speaking, according to the Spark energy consumption perception algorithm, as shown in fig. 1, the scheduling is based on a decision table to schedule tasks.

The invention is described in detail below by taking WordCount as an example:

the WordCount program is a word statistics program for simple statistics. The main function of WordCount is to count the frequency of each word in the text. The program logic executes the graph shown in FIG. 4. The logic execution process mainly goes through the following conversion of RDD:

1) Reading data from the HDFS, generating RDD1: textFile.

2) RDD1 is converted into RDD2 by a conversion operation (transformation): a flitmap. The words are separated by spaces, and it is easy to see that RDD2 and RDD1 are in narrow dependence.

3) RDD2 is converted into RDD3 by a conversion operation (transformation): map. Converting a word to a key-value pair < word,1>, it is easy to see that RDD3 and RDD2 are in narrow dependency relationship.

4) RDD3 is converted into RDD4 by a conversion operation (transformation): reduceByKey. Shuffling words, adding value values of the same words, it is easy to see that RDD4 and RDD3 are in wide dependence

5) RDD4 is converted into RDD5 through action (action): saveAsTextFile. Action operation triggers Job to submit, calculation results are stored in HDFS, and it is easy to see that RDD5 and RDD4 are in narrow dependency relationship

The WordCount program only contains one Job (Job) ₀ ) The Spark DAGScheduler divides Job into 2 stages (stages) according to the dependency relationship between RDDs ₀₀ ，Stage ₀₁ )。RDD4 and RDD5 are divided into stages ₀₀ RDD1, RDD2, RDD3 are divided into stages ₀₁ . Stage can be determined according to the number of partitions of the last RDD in the Stage ₀₀ Comprises 2 tasksStage ₀₁ Comprises 3 tasks

Assuming Spark clustering, the currently available computing resources Exe = { ex = ₀ ,ex ₁ ,ex ₂ ,ex ₃ And each computing resource occupies two CPUCore, namely, two tasks can be executed. The known energy consumption relation policy table is shown in table (3), and the known execution time relation policy table is shown in table (4).

TABLE (3) known energy consumption relationship policy Table (J)

Table (4) known execution time relationship policy table (ms)

The scheduling module steps are described as follows:

submitting Stage ₀₁ Phases

1) Submitting execution sequencing to generate ExeQue queue

Evaluation { eva) of executor from equation (10) ₀ ＝0.84,eva ₁ ＝2,eva ₂ ＝1.17,eva ₃ If =1.32, exeQue = { ex = ₀ ,ex ₂ ,ex ₃ ,ex ₁ }

2) Taking out the optimal process, and sequencing all tasks to TaskQue double-ended queue

Fetch Process ex ₀ According to in ex ₀ Perform time-up sequencing, then

3) Assigning tasks

First distributionTo ex ₀ I.e. byRedistributionTo ex ₀ I.e. byTask not allocated, ex ₀ The resources are exhausted, and a suboptimal process ex is taken out ₂ . DispensingTo ex ₂ I.e. by

4) After the task is executed, the energy consumption evaluation module calculates the execution time and energy consumption of the task and records the data into the MySQL database.

Submitting Stage ₀₀ Phases

1) Generating ExeQue queue by sorting executor

Evaluation value { eva) of executor is obtained from formula (10) ₀ ＝0.5,eva ₁ ＝3,eva ₂ ＝0.85,eva ₃ If =1.83, exeQue = { ex = ₀ ,ex ₂ ,ex ₃ ,ex ₁ }。

Fetch Process ex ₀ According to in ex ₀ Perform time-up sequencing, then

3) Assigning tasks

First distributionTo ex ₀ I.e. byRedistributionTo ex ₀ I.e. by

The example only comprises two phases, and after the Spark DAGScheduler completes the calculation of all phases, the Job calculation is finished. In this example, the program contains only one Job, so the program ends. And the energy consumption evaluation module gives the energy consumption value and the running time of the program.

And task scheduling results are as follows:

Stage ₀₀ in the stage (a) of the method,distribution to Process ex ₀ ，Distribution to Process ex ₀ 。

Stage ₀₁ In the stage (a) of the method,distributed to processes ex ₀ ，Distribution to processesex ₀ ，Distribution to Process ex ₂ 。

The app can be derived from tables (3) and (4) and equation (8) ^ec = (1 + 3) + (3 +6+ 10) =23. The above examples are to be construed as merely illustrative and not limitative of the remainder of the disclosure in any way whatsoever. After reading the description of the invention, the skilled person can make various changes or modifications to the invention, and these equivalent changes and modifications also fall into the scope of the invention defined by the claims.

Claims

1. A Spark energy-saving scheduling method based on energy consumption perception is characterized by comprising the following steps:

firstly, constructing a big data calculation energy consumption model under a Spark calculation framework;

secondly, establishing a policy table of energy consumption and execution time relation between the tasks and the computing resources based on the model, and guiding and optimizing Spark task scheduling through the policy table;

and finally, the total computing energy consumption is effectively reduced on the premise of ensuring the parallel computing efficiency.

2. The Spark energy-saving scheduling method based on energy consumption perception according to claim 1, wherein the big data calculation energy consumption model is: energy consumption App generated by big data application App ^ec Is defined as shown in formula (8)

The definition of the objective function of calculating the energy consumption of the big data is shown in the formula (9)

Wherein i belongs to Z, j belongs to Z, k belongs to Z, and l belongs to Z，Task placement of representations, useRepresenting tasksAnd resource process ex _l The relationship between them is defined as shown in formula (5)

When in useIs distributed to ex _l When going upOtherwise

3. The Spark energy-saving scheduling method based on energy consumption perception according to claim 1 or 2, wherein the energy consumption relation policy table of the task and the computing resource: represents Stage _ij The result of Cartesian product of the set and the Exe stores historical, arbitrary tasksAt any process ex _l Energy consumption during runtimeIn the initial state of the process,

the execution time relation policy table of the task and the computing resource comprises the following steps: arbitrary taskAt any process ex _l Time of executionIn the initial state of the process, the process is carried out,

4. the Spark energy-saving scheduling method based on energy consumption perception according to claim 3, wherein the processes are sorted according to the energy consumption relation policy table and the execution time relation policy table to form a process queue exequee. When in policy tableOrRepresenting tasksIn the process ex _l The energy consumption or execution time is in an initialization state, and needs to be placed in front of the queue, and tasks are preferentially allocated to the process.

5. The Spark energy-saving scheduling method based on energy consumption perception according to claim 4, wherein the process ex with the optimal current head of queue is obtained from ExeQue _l According to the execution time in the execution time relation strategy tableWill have a value ofTask composition Set of ₀ Set up ofAccording to the task ofAscending to form a double-end queue TaskQue;

if it isThen Set is preferentially allocated ₀ Task to ex in _l Up to ex _l Resource exhaustion orIf the resources are exhausted, then take out the suboptimal ex from the ExeQue _l ' continue allocation of tasks;

if it isThat is, the data of the current decision table does not need to be detected or is distributed completely, and the tasks are sequentially and alternately taken out from the head and the tail of the TaskQue and distributed to the ex _l Up to ex _l If the resources are exhausted, a suboptimal ex is taken out from the ExeQue _l ' continue to allocate tasks, if the TaskQue queue is empty, all task allocations have ended.

6. The Spark energy-saving scheduling method based on energy consumption perception as claimed in claim 5, wherein the Spark energy-saving scheduling method is characterized in that when the execution time is upWhen necessary, the task needs to be put into the Set separately ₀ In (3), probing is performed preferentially.

7. The Spark energy-saving scheduling method based on energy consumption perception according to claim 6, wherein after the task operation is finished, energy consumption and operation time in the current operation are recorded, and the energy consumption relation policy table and the execution time relation policy table are updated to provide a decision for the next operation of the same task.