CN107704069A - A kind of Spark energy-saving scheduling methods perceived based on energy consumption - Google Patents

A kind of Spark energy-saving scheduling methods perceived based on energy consumption Download PDF

Info

Publication number
CN107704069A
CN107704069A CN201710452338.4A CN201710452338A CN107704069A CN 107704069 A CN107704069 A CN 107704069A CN 201710452338 A CN201710452338 A CN 201710452338A CN 107704069 A CN107704069 A CN 107704069A
Authority
CN
China
Prior art keywords
energy consumption
task
spark
tasks
energy
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710452338.4A
Other languages
Chinese (zh)
Other versions
CN107704069B (en
Inventor
李鸿健
王霍琛
代宇
熊安萍
蒋溢
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Dayu Chuangfu Technology Co ltd
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN201710452338.4A priority Critical patent/CN107704069B/en
Publication of CN107704069A publication Critical patent/CN107704069A/en
Application granted granted Critical
Publication of CN107704069B publication Critical patent/CN107704069B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F1/00Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
    • G06F1/26Power supply means, e.g. regulation thereof
    • G06F1/32Means for saving power
    • G06F1/3203Power management, i.e. event-based initiation of a power-saving mode
    • G06F1/3234Power saving characterised by the action undertaken
    • G06F1/329Power saving characterised by the action undertaken by task scheduling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • G06F9/4893Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues taking into account power or heat criteria
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Power Sources (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention discloses a kind of Spark energy-saving scheduling methods perceived based on energy consumption.Big data under Spark Computational frames is built first calculates energy consumption model, the energy consumption of task and computing resource is established based on the model and performs time relationship Policy Table, instructed by Policy Table and optimize Spark task schedulings, effectively reduce calculating total energy consumption on the premise of parallel efficiency calculation is ensured.The present invention solves the defects of original scheduling strategies of Spark can not perceive energy consumption, and this method has the characteristics of energy consumption perception, dynamically optimized scheduling and Highly Scalable, effectively reduces energy consumption caused by the application program operated under Spark Computational frames.

Description

Spark energy-saving scheduling method based on energy consumption perception
Technical Field
The invention relates to the field of big data processing and energy efficiency, in particular to a big data energy consumption model based on energy consumption perception and an energy-saving scheduling strategy under a Spark calculation framework based on the model.
Background
The huge electric energy consumption generated by big data calculation is a problem to be solved urgently by a data center. Many enterprises and organizations are faced with large-scale data computing, and computing cost is an important concern while computing efficiency is considered. Enterprises and organizations desire to reduce large data computing power consumption and thus computing costs. However, the situation is not optimistic in the context of the current big data age, based on recent research reports [1] It has been shown that the power consumption of the german data center has increased by 15% from 2010 to 2015, to 12 billion kilowatt-hours per year. 2014 report of International Green peace organization (Greenseal International) [2] The demand of the global data center electric energy is expected to increase by 81% by 2020, and the operation and maintenance data center consumes huge electric energy. By optimizing the scheduling mode of the big data calculation tasks of the data center, the expense of big data calculation energy consumption can be effectively reduced, and therefore the energy efficiency of the data center is improved.
Currently, spark computing framework [3] The method is widely applied to big data calculation of the data center. Spark provides a distributed memory abstraction of a cluster, namely an elastic distributed data set (RDD), which is an immutable set of partitioned records and is a programming model of Spark. Spark divides the RDD into different stages (stages) according to the dependency relationship between the RDDs, and each stage has a series of methods to process the RDD in a pipeline way. In each stage, a task set with the same number as the last RDD partition in the current stage is divided, that is, the data of one partition is processed by one task. These tasks will be scheduled to worker nodes for processing in parallel. For big data application under a Spark computing framework, only one executor process can be arranged on one worker, and the process can occupy computing resources and can process a plurality of tasks. Then the task scheduling problem can be described as a binning problem: the task is an article, the executor process is a plurality of boxes, the resource occupied by the executor is the size of the box, the resource quantity required by the task is the size of the article, and the task is allocated to run the resource, which obviously belongs to the NP-hard problem. Assign task inOn the same execution, the running time and the energy consumption are different, so that reasonable distribution plays a vital role in reducing the running energy consumption of the big data application.
Currently, spark itself provides two scheduling strategies [3] FIFO and FAIR scheduling strategies, respectively. The scheduling granularity of both scheduling strategies is a phase. And distributing the tasks in each stage to the executors after random disorder according to a localization principle to execute. The two scheduling strategies differ in that: the FIFO sorts the phases according to the Job generation order and the phases generation order under Job, and allocates computing resources according to the sorting order. The FAIR divides the phases into a plurality of groups, each group is called Pool, polls the pools according to the weights of the pools, sorts the phases, and allocates computing resources according to the sorting order. Yangji Wei, zheng 28871 [4] Et al propose an adaptive task scheduling strategy based on heterogeneous Spark clusters. According to the strategy, the load and the resource utilization rate of the node are monitored, the monitored parameters are analyzed, the task allocation weight of the node is adjusted in a self-adaptive and dynamic mode, and the problem of energy consumption is not considered in the scheduling strategy. Leverich and Kozyrrakis [5] An energy management method for MapReduce jobs is provided. The method selectively shuts down the nodes with low utilization rate to reduce energy consumption. They keep at least one copy of the data block in the coverage set and turn off the nodes that are not in the coverage set and that have low utilization. L Mashayekhy et al [6] An efficient energy-saving heuristic task scheduling algorithm is provided based on a MapReduce computing framework, and a task scheduling strategy with low energy consumption is obtained by analyzing the energy consumption of specific operation in advance, so that the strategy can effectively reduce the energy consumption on the premise of ensuring the SLA requirement. The energy-saving strategies are based on a MapReduce calculation framework, and the Spark energy consumption research is insufficient at present.
Task scheduling for Spark is a binning problem, L Mashayekhy et al [6] And sequencing the computing resources of the Map stage and the computing resources of the Reduce stage based on a heuristic task scheduling algorithm provided by a MapReduce computing framework according to a preset energy consumption analysis result, and placing the tasks on slots with low unit energy consumption in each stage. In practical application, each time, a new operation needs to be performed in advanceAnd (5) performing energy consumption analysis. When the cluster physical node changes, the original performance analysis result needs to be regenerated, the flexibility is poor, and the method can not be applied to solving the energy consumption problem of the Spark computing framework.
Reference documents:
[1]Hintemann R,Beucker S,Clausen J,et al.Energy efficiency of data centers-a system-oriented analysis of current development trends[C]//Electronics Goes Green.IEEE,2017.
[2]Salahuddin M,Alam K.Information and Communication Technology,electricity consumption and economic growth in OECD countries:A panel data analysis[J].International Journal of Electrical Power&Energy Systems,2016,76:185-193.
[3] zhang station Spark technology inside curtain [ M ] mechanical industry press 2015.
[4] Yangxing Wei, zheng 28871, wang Song, et al. heterogeneous Spark Cluster adaptive task scheduling strategy [ J ] computer engineering, 2016,42 (1): 31-35.
[5]J.Leverich and C.Kozyrakis,On the energy(in)efficiency of Hadoop clusters[J]ACM SIGOPS Oper.Syst.Rev.,vol.44,no.1,pp.61–65,2010.
[6]Mashayekhy L,Nejad M M,Grosu D,et al.Energy-Aware Scheduling of MapReduce Jobs for Big Data Applications[J].Parallel&Distributed Systems IEEE Transactions on,2015,26(10):2720-2733.
[7] Luo Liang, wu Wen drastic, zhang Fei, cloud computing data center oriented energy consumption modeling method [ J ] software science report, 2014 (7): 1371-1387.
Disclosure of Invention
The invention aims to solve the problems in the prior art and provides a big data energy consumption model based on energy consumption perception and a Spark computing framework scheduling method based on the model and capable of obviously reducing big data computing energy consumption. The technical scheme of the invention is as follows:
a Spark energy-saving scheduling method based on energy consumption perception comprises the following steps:
firstly, constructing a big data calculation energy consumption model based on a Spark calculation frame;
secondly, establishing an energy consumption relation strategy table and an execution time relation strategy table of the tasks and the computing resources, wherein the two strategy tables together guide task scheduling;
thirdly, selecting the computing resources with the optimal evaluation criteria according to the policy table, preferentially distributing computing tasks for the computing resources, and simultaneously ensuring the balanced distribution of parallel computing tasks;
fourthly, initializing the decision table through data detection, and updating the energy consumption relation strategy table and the execution time relation strategy table after the phase task is executed.
Further, the big data calculation energy consumption model is as follows: energy consumption generated by big data application App is App ec
Is defined as shown in formula (8)
Reducing run-time energy consumption of big data applications, i.e. apps ec So that the objective function is defined as shown in equation (9)
Wherein i belongs to Z, j belongs to Z, k belongs to Z, and l belongs to Z,task Placement of representation, useRepresenting tasksAnd resource process ex l The relationship between them is defined as shown in formula (5)
When the temperature is higher than the set temperatureIs distributed to ex l When getting onOtherwise
Further, the energy consumption relationship policy table of the task and the computing resource is as follows: represents Stage ij The result of Cartesian product of the set and the Exe stores historical, arbitrary tasksAt any process (ex) l ) Energy consumption during run-upIn the initial state of the process,namely, valid data needs to be detected;
the execution time relation policy table of the task and the computing resource comprises the following steps: arbitrary tasksAt any process (ex) l ) Time of executionIn the initial state of the process, the process is carried out,there is also a need to detect valid data.
Further, the evaluation criterion is selected according to the energy consumption relation strategy table and the execution time relation strategy table
Preferred computing resources include in particular:
within a batch of stages to be processed in parallel, processes ex are processed l Evaluation of eva l Is defined as formula (10)
Wherein, stage ' represents the Stage that the current DAKScheduler submits to the processing of the TaSkScheduler, U.G.Stage ' represents the union of all stages that the TaSkScheduler needs to process, sigma | Stage ' | represents the total number of tasks that the TaSkScheduler needs to process in all stages,denotes task in a @' set k′ In process ex l The unit energy consumption of the upper execution,denotes task in a @' set k′ And (4) sequencing the processes in an ascending order according to the evaluation standard and the strategy table by combining the sum of the unit energy consumption to form a queue.
Further, in the policy tableOrRepresenting tasksIn the process ex l The power consumption or execution time of the process is the initialization state, and the process ex is needed l Placed in front of the queue, tasks are allocated preferentially to this process.
Further, the current optimal process ex is obtained according to the ascending sorting queue of the processes l According to the execution time in the execution time strategy tableWill have a value ofTask composition Set of 0 Set up ofAccording to the task ofAscending to form a double-end queue TaskQue;
if it isThen Set is preferentially allocated 0 Task to ex in l Up to ex l Resource exhaustion orIf the resources are exhausted, then take out the suboptimal ex from the ExeQue l ' continue assigning tasks;
if it isThat is, the data of the current decision table does not need to be detected or is distributed completely, and the tasks are sequentially and alternately taken out from the head and the tail of the TaskQue and distributed to the ex l Up to ex l If the resources are exhausted or the TaskQue queue is empty, if the resources are exhausted, a suboptimal ex is taken out of the ExeQue l ' continue to allocate tasks, if the TaskQue queue is empty, all task allocations have ended.
Further, when the execution time is upWhen the task needs to be put into the Set separately 0 In (3), probing is performed preferentially.
Furthermore, after the task operation is finished, the energy consumption and the operation time of the current operation are recorded, the energy consumption relation strategy table and the execution time relation strategy table are updated, and a decision is provided for the next operation of the same task.
The invention has the following advantages and beneficial effects:
the innovation points of the invention are as follows:
1. energy consumption model established for Spark big data application
The invention firstly provides mathematical models of Spark big data application energy consumption, operation energy consumption and stage energy consumption, which are detailed in formulas (6), (7) and (8), and provides model basis for accurately calculating Spark big data application energy consumption.
The advantages are that: the energy consumption model describes the relationship between a big data application (App) and a Job (Job) in a set mode; relationship between Job (Job) and Stage (Stage); stage (Stage) and task (task). The relationship between tasks and processes (ex) is described using variable X. The execution energy consumption of the calculation task on the process can be flexibly obtained, and the stage energy consumption, the operation energy consumption and the application energy consumption can be flexibly obtained. Meanwhile, a calculation energy consumption basis is provided for Spark energy consumption research.
2. Spark energy-saving scheduling method for providing energy consumption perception
The innovation point of the Spark energy-saving scheduling method for energy consumption perception can be subdivided into
2.1, an energy consumption relation strategy table and an execution time relation strategy table of the tasks and the computing resources are provided for the first time, historical energy consumption relations and historical execution time relations are recorded, and strategy basis is provided for task allocation. And simultaneously, after the task is finished, updating the data in the policy table. The advantages are that: the Saprk energy-saving scheduling method for energy consumption perception has dynamic expansibility. The scheduling algorithm is suitable for scenes of repeatedly running the same big data application, and the policy table mechanism is dynamically updated during each running, so that the energy consumption sensing effect is achieved. When the physical cluster is changed, the policy table updating mechanism can sense unknown data in time and detect and update the unknown data.
2.2 the process resources are sorted according to the evaluation standard, and the sorted results are stored by a queue. Therefore, the process with the optimal evaluation standard is easily obtained, and tasks are preferentially distributed. The evaluation criterion is to take the average energy consumption at the current stage. In particular, the process that needs to be probed needs to be placed at the front of the queue. The advantages are that: the problem that the original Spark scheduling is simple, the computing resources are shuffled randomly, and different process resource energy consumption is not considered is solved. The energy-saving scheduling method provided by the invention can effectively reduce the energy consumption during the application operation. The process to be detected needs to be placed in front of the queue, and the policy table can be dynamically expanded when the physical cluster is expanded.
2.3 setting Set 0 And (4) collecting and storing tasks to be detected, setting a task Que double-end queue to store tasks with known execution time, and sequencing according to the execution time on the process. Preferential allocation Set 0 Tasks in the set are then alternately allocated head-to-tail to tasks in the TaskQue double-ended queue. The advantages are that: set 0 And the tasks stored in the set are preferentially detected, so that the policy table can be dynamically expanded when the physical cluster is expanded. The tasks in the TaskQue double-end queue are alternately distributed head and tail, so that the execution time of the tasks distributed by each process is ensured to be balanced, and the problem of energy consumption caused by waiting for other nodes to run after one node runs is solved. The energy-saving scheduling method provided by the invention is ensured to effectively reduce the energy consumption during operation.
The advantages are that: the invention provides an energy-saving Spark scheduling method for energy consumption perception, which can effectively reduce the expenditure of large-data calculation energy consumption, thereby improving the energy efficiency of a data center and effectively reducing the greenhouse gas emission caused by large-data calculation consumption of a large amount of electric energy.
Drawings
FIG. 1 is an algorithmic diagram of a preferred embodiment of the present invention;
FIG. 2 is a flowchart of the overall calculation;
FIG. 3 is a MySQL database design E-R diagram;
fig. 4 is a diagram of WordCount program logic execution.
Detailed Description
The technical solutions in the embodiments of the present invention will be described in detail and clearly in the following with reference to the accompanying drawings. The described embodiments are only some of the embodiments of the present invention.
The technical scheme for solving the technical problems is as follows:
the basic idea of the invention is: firstly, constructing a big data calculation energy consumption model based on a Spark calculation frame; secondly, establishing an energy consumption relation strategy table and an execution time relation strategy table of the tasks and the computing resources, wherein the two strategy tables together guide task scheduling; thirdly, selecting the computing resources with the optimal evaluation criteria according to the policy table, preferentially distributing computing tasks for the computing resources, and simultaneously ensuring the balanced distribution of parallel computing tasks; fourthly, initializing the decision table through data detection, and updating the energy consumption relation strategy table and the execution time relation strategy table after the phase task is executed. The technical scheme of the invention comprises the following steps:
the method comprises the following steps: energy consumption model construction
Under the Spark computing framework, a big data application (App) generates a Job every time it encounters an action operation i The definition of App is shown in formula (1). App denotes a big data application, which consists of m jobs. Wherein Job i Dividing into multiple stages according to RDD dependency relationship ij To is aligned withThe definition is shown in formula (2). Job i Represents the ith Job within App, which consists of n stages. Wherein Stage ij Partitioning multiple according to last RDDTo Stage ij The definition is shown in formula (3). Stage ij Represents Job i The upper jth Stage, which consists of o tasks.
App={Job 0 ,Job 1 ,Job 2 ,…,Job m-1 } (1)
Job i ={Stage i0 ,Stage i1 ,Stage i2 ,…,Stage i(n-1) } (2)
Wherein i is more than or equal to 0 and less than m, j is more than or equal to 0 and less than n, i belongs to Z, and j belongs to Z
The computing resources available within the Spark cluster, i.e., the executor processes existing on each worker node, are defined as shown in equation (4). Exe represents the set of all available exeuter process resources on the cluster.
Exe={ex 0 ,ex 1 ,ex 2 ,…,ex p-1 } (4)
Use ofTo representIn ex l Energy consumption of the operation. Use ofTo representIn ex l Time of operation. Use ofRepresenting tasksAnd resource process ex l The relationship between them is defined as shown in formula (5)
When in useIs distributed to ex l When getting onOtherwiseWherein k is more than or equal to 0 and less than o, l is more than or equal to 0 and less than p, k belongs to Z, and l belongs to Z.
Job can be obtained according to the formula (3), the formula (4) and the formula (5) i The energy consumption generated by the last j Stage isIs defined as shown in formula (6)
According to the formulas (2) and (6), the energy consumption generated by the ith Job in App can be obtained asIs defined as shown in formula (7)
According to the formula (1) and the formula (7), the energy consumption generated by the big data application App can be obtained as the App ec Is defined as shown in formula (8)
Wherein i belongs to Z, j belongs to Z, k belongs to Z, and l belongs to Z
The aim of the invention is to reduce the runtime energy consumption of big data applications, namely App ec So that the objective function is defined as shown in equation (9)
Wherein i belongs to Z, j belongs to Z, k belongs to Z, and l belongs to Z.The task placement is shown in equation (5). Irrational design ofTransaction Placement will result in obj ec And (4) rising. This means that running large data applications of the same size in the same cluster will require more energy to be consumed. As can be seen from the formula (9),how to place the task will be decisive for the objective function.
Step two: policy table initialization
The energy consumption relationship policy table of the tasks and the computing resources is defined as shown in table (1). Table (1) shows Stage ij The result of Cartesian product of the set and the Exe stores historical, arbitrary tasksAt any process (ex) l ) Energy consumption during runtimeIn the initial state of the process,i.e. it is necessary to probe valid data.
The execution time relationship policy table of tasks and computing resources is defined as table (2). Table (2) differs from Table (1) in that it retains historical, arbitrary tasksAt any process (ex) l ) Time of executionIn the initial state of the process, the process is carried out,there is also a need to detect valid data.
TABLE (1) energy consumption relationship policy Table (J)
Table (2) execution time relationship policy table (ms)
Step three: evaluating the energy consumption of the unit process according to the evaluation standard and sequencing
In Spark, the DAGSSchedule encapsulates a batch of stages capable of being processed in parallel into a tasSet and sends the tasSet to the tasSchedule for task scheduling, so that the algorithm processes ex any process within the range of the batch of stages to be processed in parallel l Evaluation of eva l Is defined as formula (10)
Wherein Stage' represents the Stage that the current DAGSSchedule submits to the task Schedule for processing. U.G.stage' indicates that the TaskScheduler needs to process the union of all stages.Indicating that the task scheduler needs to handle the total number of tasks in all stages.Denotes task in a @' set k′ In the process ex l Per unit energy consumption of execution.Denotes task in a @' set k′ Sum of unit energy consumption. And sorting the processes in an ascending order according to the evaluation standard and the strategy table to form a queue (ExeQue). Specifically, in the policy tableOr alternativelyRepresenting tasksIn process ex l The energy consumption or execution time of the process is the initialization state, and the process ex is required to be executed l Placed in front of the queue, tasks are allocated preferentially to this process.
Step four: preferentially distributing tasks to process with optimal evaluation criteria
The current optimal process ex can be obtained by the step three l According to the execution time strategy table, the tasks are executed according to the execution timeSorting in ascending order to form double-ended queue (TaskQue). In particular whenWhen necessary, the task needs to be put into the Set separately 0 In (3), probing is performed preferentially.
If it isThen Set is preferentially allocated 0 Task to ex in l Up to ex l Resource exhaustion orIf the resources are exhausted, then take out the suboptimal ex from the ExeQue l ' continue to assign tasks.
If it isThat is, the data of the current decision table does not need to be detected or is distributed completely, and the tasks are sequentially and alternately taken out from the head and the tail of the TaskQue and distributed to the ex l Up to ex l The resource is exhausted or the taskquee queue is empty. If the resources are exhausted, then take out the suboptimal ex from the ExeQue l ' continue assigning tasks. If the TaskQue queue is empty, then all task assignments have been completed.
The task is alternately allocated, so that the execution time of the task allocated on each process is guaranteed to be balanced as much as possible, and the situation that the execution time of one machine is too long, other machines are in standby and unnecessary energy consumption is caused is avoided.
Step five: energy consumption aware task scheduling policy
After the task operation is finished, the algorithm records the energy consumption and the operation time during the operation, updates the energy consumption relation strategy table and the execution time relation strategy table and provides a decision for the next operation of the same task. Priority detection of step fourOrThe condition of (2) ensures that the policy table can be detected as long as the unknown condition exists, and ensures that the decision data is complete. The algorithm cooperates with the Spark DAGScheduler to complete calculation of Spark big data application, and energy consumption generated by the big data application is obtained through the energy consumption model provided in the step one.
The implementation of the invention comprises an energy consumption evaluation module and a scheduling module. And after the application program is executed by the energy consumption evaluation module, obtaining the program operation energy consumption according to the energy consumption model, and simultaneously considering the function of updating the strategy table. The scheduling module performs task scheduling according to the energy consumption perception scheduling algorithm.
1. Energy consumption evaluation module
The energy consumption evaluation module has a function of calculating the execution time of each task thread on the worker node, and the specific method can obtain the running time of the task on the process by analyzing the monitoring information provided by Spark. That is to say that arbitrary tasks can be obtainedAt any process (ex) l ) Time of execution
The energy consumption evaluation module also has a function of calculating the energy consumption generated by each task thread on the worker node, and the specific method is as follows: obtaining ex through operating system resource monitoring program l CPU usage (U) of a process at runtime cpu ) And memory usage (U) memory )。
P=C 0 +C 1 ×U CPU +C 2 ×U memory (11)
Energy consumption power model based on system utilization rate [7] As shown in equation (11), the instantaneous power consumption (P) of the process under the operation condition can be obtained exe )。
The process ex is obtained from the formula (12) l Amount of energy consumption variation (P) ec )。P ec Accurately reflects the load in the process ex l The energy consumption generated. The execution time of the task and the energy consumption of the task are subject to uniform distribution, that is, the longer the execution time of the task is, the greater the energy consumption generated by the task is.
From the formula (13), it can be obtainedAt any process (ex) l ) Power consumption of upper executionWhere Σ p k′l Indicated in the process ex l The sum of the time of execution of all the tasks.
The energy consumption evaluation module records the execution time of the task in the process and the generated energy consumption in the MySQL database, so that the scheduling module can conveniently generate a decision according to the record and schedule the task. The data sheet design is shown in FIG. 3. And finally, completing the program execution, and obtaining the running energy consumption of the program by the energy consumption evaluation module according to an energy consumption model formula (8).
2. Scheduling module
The scheduling module modifies an implementation class tasskschedulerimpl of org. And modifying the original scheduling strategy granularity from stage to task, and sequencing the tasks according to the execution time. And shuffling the original process resources, and sorting the process resources according to the evaluation standard instead. And distributing the original tasks according to the localization degree, and distributing the optimal process according to the execution time instead. Generally speaking, according to the Spark energy consumption perception algorithm, as shown in fig. 1, the scheduling is based on a decision table to schedule tasks.
The invention is described in detail below by taking WordCount as an example:
the WordCount program is a word statistics program for simple statistics. The main function of WordCount is to count the frequency of each word in the text. The program logic executes the graph shown in FIG. 4. The logic execution process mainly goes through the following conversion of RDD:
1) Reading data from the HDFS, generating RDD1: textFile.
2) RDD1 is converted into RDD2 by a conversion operation (transformation): a flitmap. The words are separated by spaces, and it is easy to see that RDD2 and RDD1 are in narrow dependence.
3) RDD2 is converted into RDD3 by a conversion operation (transformation): map. Converting a word to a key-value pair < word,1>, it is easy to see that RDD3 and RDD2 are in narrow dependency relationship.
4) RDD3 is converted into RDD4 by a conversion operation (transformation): reduceByKey. Shuffling words, adding value values of the same words, it is easy to see that RDD4 and RDD3 are in wide dependence
5) RDD4 is converted into RDD5 through action (action): saveAsTextFile. Action operation triggers Job to submit, calculation results are stored in HDFS, and it is easy to see that RDD5 and RDD4 are in narrow dependency relationship
The WordCount program only contains one Job (Job) 0 ) The Spark DAGScheduler divides Job into 2 stages (stages) according to the dependency relationship between RDDs 00 ,Stage 01 )。RDD4 and RDD5 are divided into stages 00 RDD1, RDD2, RDD3 are divided into stages 01 . Stage can be determined according to the number of partitions of the last RDD in the Stage 00 Comprises 2 tasksStage 01 Comprises 3 tasks
Assuming Spark clustering, the currently available computing resources Exe = { ex = 0 ,ex 1 ,ex 2 ,ex 3 And each computing resource occupies two CPUCore, namely, two tasks can be executed. The known energy consumption relation policy table is shown in table (3), and the known execution time relation policy table is shown in table (4).
TABLE (3) known energy consumption relationship policy Table (J)
Table (4) known execution time relationship policy table (ms)
The scheduling module steps are described as follows:
submitting Stage 01 Phases
1) Submitting execution sequencing to generate ExeQue queue
Evaluation { eva) of executor from equation (10) 0 =0.84,eva 1 =2,eva 2 =1.17,eva 3 If =1.32, exeQue = { ex = 0 ,ex 2 ,ex 3 ,ex 1 }
2) Taking out the optimal process, and sequencing all tasks to TaskQue double-ended queue
Fetch Process ex 0 According to in ex 0 Perform time-up sequencing, then
3) Assigning tasks
First distributionTo ex 0 I.e. byRedistributionTo ex 0 I.e. byTask not allocated, ex 0 The resources are exhausted, and a suboptimal process ex is taken out 2 . DispensingTo ex 2 I.e. by
4) After the task is executed, the energy consumption evaluation module calculates the execution time and energy consumption of the task and records the data into the MySQL database.
Submitting Stage 00 Phases
1) Generating ExeQue queue by sorting executor
Evaluation value { eva) of executor is obtained from formula (10) 0 =0.5,eva 1 =3,eva 2 =0.85,eva 3 If =1.83, exeQue = { ex = 0 ,ex 2 ,ex 3 ,ex 1 }。
2) Taking out the optimal process, and sequencing all tasks to TaskQue double-ended queue
Fetch Process ex 0 According to in ex 0 Perform time-up sequencing, then
3) Assigning tasks
First distributionTo ex 0 I.e. byRedistributionTo ex 0 I.e. by
4) After the task is executed, the energy consumption evaluation module calculates the execution time and energy consumption of the task and records the data into the MySQL database.
The example only comprises two phases, and after the Spark DAGScheduler completes the calculation of all phases, the Job calculation is finished. In this example, the program contains only one Job, so the program ends. And the energy consumption evaluation module gives the energy consumption value and the running time of the program.
And task scheduling results are as follows:
Stage 00 in the stage (a) of the method,distribution to Process ex 0Distribution to Process ex 0
Stage 01 In the stage (a) of the method,distributed to processes ex 0Distribution to processesex 0Distribution to Process ex 2
The app can be derived from tables (3) and (4) and equation (8) ec = (1 + 3) + (3 +6+ 10) =23. The above examples are to be construed as merely illustrative and not limitative of the remainder of the disclosure in any way whatsoever. After reading the description of the invention, the skilled person can make various changes or modifications to the invention, and these equivalent changes and modifications also fall into the scope of the invention defined by the claims.

Claims (7)

1. A Spark energy-saving scheduling method based on energy consumption perception is characterized by comprising the following steps:
firstly, constructing a big data calculation energy consumption model under a Spark calculation framework;
secondly, establishing a policy table of energy consumption and execution time relation between the tasks and the computing resources based on the model, and guiding and optimizing Spark task scheduling through the policy table;
and finally, the total computing energy consumption is effectively reduced on the premise of ensuring the parallel computing efficiency.
2. The Spark energy-saving scheduling method based on energy consumption perception according to claim 1, wherein the big data calculation energy consumption model is: energy consumption App generated by big data application App ec Is defined as shown in formula (8)
The definition of the objective function of calculating the energy consumption of the big data is shown in the formula (9)
Wherein i belongs to Z, j belongs to Z, k belongs to Z, and l belongs to Z,Task placement of representations, useRepresenting tasksAnd resource process ex l The relationship between them is defined as shown in formula (5)
When in useIs distributed to ex l When going upOtherwise
3. The Spark energy-saving scheduling method based on energy consumption perception according to claim 1 or 2, wherein the energy consumption relation policy table of the task and the computing resource: represents Stage ij The result of Cartesian product of the set and the Exe stores historical, arbitrary tasksAt any process ex l Energy consumption during runtimeIn the initial state of the process,
the execution time relation policy table of the task and the computing resource comprises the following steps: arbitrary taskAt any process ex l Time of executionIn the initial state of the process, the process is carried out,
4. the Spark energy-saving scheduling method based on energy consumption perception according to claim 3, wherein the processes are sorted according to the energy consumption relation policy table and the execution time relation policy table to form a process queue exequee. When in policy tableOrRepresenting tasksIn the process ex l The energy consumption or execution time is in an initialization state, and needs to be placed in front of the queue, and tasks are preferentially allocated to the process.
5. The Spark energy-saving scheduling method based on energy consumption perception according to claim 4, wherein the process ex with the optimal current head of queue is obtained from ExeQue l According to the execution time in the execution time relation strategy tableWill have a value ofTask composition Set of 0 Set up ofAccording to the task ofAscending to form a double-end queue TaskQue;
if it isThen Set is preferentially allocated 0 Task to ex in l Up to ex l Resource exhaustion orIf the resources are exhausted, then take out the suboptimal ex from the ExeQue l ' continue allocation of tasks;
if it isThat is, the data of the current decision table does not need to be detected or is distributed completely, and the tasks are sequentially and alternately taken out from the head and the tail of the TaskQue and distributed to the ex l Up to ex l If the resources are exhausted, a suboptimal ex is taken out from the ExeQue l ' continue to allocate tasks, if the TaskQue queue is empty, all task allocations have ended.
6. The Spark energy-saving scheduling method based on energy consumption perception as claimed in claim 5, wherein the Spark energy-saving scheduling method is characterized in that when the execution time is upWhen necessary, the task needs to be put into the Set separately 0 In (3), probing is performed preferentially.
7. The Spark energy-saving scheduling method based on energy consumption perception according to claim 6, wherein after the task operation is finished, energy consumption and operation time in the current operation are recorded, and the energy consumption relation policy table and the execution time relation policy table are updated to provide a decision for the next operation of the same task.
CN201710452338.4A 2017-06-15 2017-06-15 Spark energy-saving scheduling method based on energy consumption perception Active CN107704069B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710452338.4A CN107704069B (en) 2017-06-15 2017-06-15 Spark energy-saving scheduling method based on energy consumption perception

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710452338.4A CN107704069B (en) 2017-06-15 2017-06-15 Spark energy-saving scheduling method based on energy consumption perception

Publications (2)

Publication Number Publication Date
CN107704069A true CN107704069A (en) 2018-02-16
CN107704069B CN107704069B (en) 2020-08-04

Family

ID=61170182

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710452338.4A Active CN107704069B (en) 2017-06-15 2017-06-15 Spark energy-saving scheduling method based on energy consumption perception

Country Status (1)

Country Link
CN (1) CN107704069B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108733195A (en) * 2018-05-29 2018-11-02 郑州易通众联电子科技有限公司 Computer operation method and device based on equipment operational energy efficiency
CN109582119A (en) * 2018-11-28 2019-04-05 重庆邮电大学 The double-deck Spark energy-saving scheduling method based on dynamic voltage frequency adjustment
CN109614210A (en) * 2018-11-28 2019-04-12 重庆邮电大学 Storm big data energy-saving scheduling method based on energy consumption perception
CN109857084A (en) * 2019-01-18 2019-06-07 湖南大学 A kind of high-performing car electronic Dynamic dispatching algorithm of energy consumption perception
CN110008013A (en) * 2019-03-28 2019-07-12 东南大学 A kind of Spark method for allocating tasks minimizing operation completion date
CN110928666A (en) * 2019-12-09 2020-03-27 湖南大学 Method and system for optimizing task parallelism based on memory in Spark environment
CN111061565A (en) * 2019-12-12 2020-04-24 湖南大学 Two-stage pipeline task scheduling method and system in Spark environment
CN112532464A (en) * 2021-02-08 2021-03-19 中国人民解放军国防科技大学 Data distributed processing acceleration method and system across multiple data centers
CN115825736A (en) * 2023-02-09 2023-03-21 苏州洪昇新能源科技有限公司 Energy consumption comprehensive test method and system for energy-saving equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104102533A (en) * 2014-06-17 2014-10-15 华中科技大学 Bandwidth aware based Hadoop scheduling method and system
CN106293933A (en) * 2015-12-29 2017-01-04 北京典赞科技有限公司 A kind of cluster resource configuration supporting much data Computational frames and dispatching method
CN106371924A (en) * 2016-08-29 2017-02-01 东南大学 Task scheduling method for maximizing MapReduce cluster energy consumption

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104102533A (en) * 2014-06-17 2014-10-15 华中科技大学 Bandwidth aware based Hadoop scheduling method and system
CN106293933A (en) * 2015-12-29 2017-01-04 北京典赞科技有限公司 A kind of cluster resource configuration supporting much data Computational frames and dispatching method
CN106371924A (en) * 2016-08-29 2017-02-01 东南大学 Task scheduling method for maximizing MapReduce cluster energy consumption

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
RALPH HINTEMANN等: "Energy efficiency of data centers - A system-oriented analysis of current development trends", 《ELECTRONIC GOES GREEN》 *
张艳璐: "一种基于遗传算法的低能耗云计算数据中心资源调度策略", 《中国优秀硕士学位论文全文数据库(电子期刊)信息科技辑》 *
李学俊等: "云工作流系统中能耗感知的任务调度算法", 《模式识别与人工智能》 *
杨志伟等: "异构Spark集群下自适应任务调度策略", 《计算机工程》 *
薛胜军等: "云环境下能耗感知的公平性提升资源调度策略", 《计算机应用》 *
黄庆佳: "能耗成本感知的云数据中心资源调度机制研究", 《中国博士学位论文全文数据库(电子期刊)信息科技辑》 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108733195A (en) * 2018-05-29 2018-11-02 郑州易通众联电子科技有限公司 Computer operation method and device based on equipment operational energy efficiency
CN109582119B (en) * 2018-11-28 2022-07-12 重庆邮电大学 Double-layer Spark energy-saving scheduling method based on dynamic voltage frequency adjustment
CN109582119A (en) * 2018-11-28 2019-04-05 重庆邮电大学 The double-deck Spark energy-saving scheduling method based on dynamic voltage frequency adjustment
CN109614210A (en) * 2018-11-28 2019-04-12 重庆邮电大学 Storm big data energy-saving scheduling method based on energy consumption perception
CN109614210B (en) * 2018-11-28 2022-11-04 重庆邮电大学 Storm big data energy-saving scheduling method based on energy consumption perception
CN109857084A (en) * 2019-01-18 2019-06-07 湖南大学 A kind of high-performing car electronic Dynamic dispatching algorithm of energy consumption perception
CN110008013A (en) * 2019-03-28 2019-07-12 东南大学 A kind of Spark method for allocating tasks minimizing operation completion date
CN110008013B (en) * 2019-03-28 2023-08-04 东南大学 Spark task allocation method for minimizing job completion time
CN110928666B (en) * 2019-12-09 2022-03-22 湖南大学 Method and system for optimizing task parallelism based on memory in Spark environment
CN110928666A (en) * 2019-12-09 2020-03-27 湖南大学 Method and system for optimizing task parallelism based on memory in Spark environment
CN111061565A (en) * 2019-12-12 2020-04-24 湖南大学 Two-stage pipeline task scheduling method and system in Spark environment
CN111061565B (en) * 2019-12-12 2023-08-25 湖南大学 Two-section pipeline task scheduling method and system in Spark environment
CN112532464A (en) * 2021-02-08 2021-03-19 中国人民解放军国防科技大学 Data distributed processing acceleration method and system across multiple data centers
CN115825736A (en) * 2023-02-09 2023-03-21 苏州洪昇新能源科技有限公司 Energy consumption comprehensive test method and system for energy-saving equipment
CN115825736B (en) * 2023-02-09 2024-01-19 福建明泰嘉讯信息技术有限公司 Comprehensive energy consumption testing method and system for energy-saving equipment

Also Published As

Publication number Publication date
CN107704069B (en) 2020-08-04

Similar Documents

Publication Publication Date Title
CN107704069B (en) Spark energy-saving scheduling method based on energy consumption perception
Khorasani et al. Scalable simd-efficient graph processing on gpus
CN105446816B (en) A kind of energy optimization dispatching method towards heterogeneous platform
CN108572873A (en) A kind of load-balancing method and device solving the problems, such as Spark data skews
CN105373432B (en) A kind of cloud computing resource scheduling method based on virtual resource status predication
CN103500123B (en) Parallel computation dispatching method in isomerous environment
US11816509B2 (en) Workload placement for virtual GPU enabled systems
CN105718479A (en) Execution strategy generation method and device under cross-IDC (Internet Data Center) big data processing architecture
US8527988B1 (en) Proximity mapping of virtual-machine threads to processors
CN108132840B (en) Resource scheduling method and device in distributed system
CN108427602B (en) Distributed computing task cooperative scheduling method and device
Wang et al. Task scheduling algorithm based on improved Min-Min algorithm in cloud computing environment
Yu et al. Fluid: Resource-aware hyperparameter tuning engine
CN114281528A (en) Energy-saving scheduling method and system based on deep reinforcement learning and heterogeneous Spark cluster
Hu et al. Improved heuristic job scheduling method to enhance throughput for big data analytics
CN111309472A (en) Online virtual resource allocation method based on virtual machine pre-deployment
Hu et al. FlowTime: Dynamic scheduling of deadline-aware workflows and ad-hoc jobs
CN116932201A (en) Multi-resource sharing scheduling method for deep learning training task
CN110084507B (en) Scientific workflow scheduling optimization method based on hierarchical perception in cloud computing environment
CN115981843A (en) Task scheduling method and device in cloud-edge cooperative power system and computer equipment
Toporkov et al. Preference-based fair resource sharing and scheduling optimization in Grid VOs
Babu et al. Energy efficient scheduling algorithm for cloud computing systems based on prediction model
Shu-Jun et al. Optimization and research of hadoop platform based on fifo scheduler
Iglesias et al. A methodology for online consolidation of tasks through more accurate resource estimations
Dai et al. Improved greedy strategy for cloud computing resources scheduling

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20230823

Address after: Room 801, 85 Kefeng Road, Huangpu District, Guangzhou City, Guangdong Province

Patentee after: Guangzhou Dayu Chuangfu Technology Co.,Ltd.

Address before: 400065 Chongwen Road, Nanshan Street, Nanan District, Chongqing

Patentee before: CHONGQING University OF POSTS AND TELECOMMUNICATIONS