CN106371924A

CN106371924A - Task scheduling method for maximizing MapReduce cluster energy consumption

Info

Publication number: CN106371924A
Application number: CN201610785554.6A
Authority: CN
Inventors: 李小平; 王佳; 陈龙; 陈复超
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2016-08-29
Filing date: 2016-08-29
Publication date: 2017-02-01
Anticipated expiration: 2036-08-29
Also published as: CN106371924B

Abstract

The invention discloses a task scheduling method for maximizing MapReduce cluster energy consumption. The method comprises the following steps: preprocessing stage: constructing fuzzy logic control systems on servers in order to dynamically update the quantity of working slots on the servers; solution stage: sequencing jobs and tasks according to a deadline and the constraint of data localization, so that a large quantity of jobs can be finished within the deadline, and the total running time of a cluster is shortened; updating stage: updating task sequences according to task execution situations at all heartbeat periods, and updating a cluster environment in real time according to the resource utilization rates of the servers and the fuzzy logic control system. By adopting the task scheduling method, the total running time of the cluster is shortened to lower the energy consumption. The task scheduling method has a high application value and a good use prospect in the field of green computing.

Description

Task scheduling method for minimizing MapReduce cluster energy consumption

Technical Field

The invention relates to a task scheduling method for minimizing MapReduce cluster energy consumption, and belongs to the fields of cloud computing application, computer technology and green computing.

Background

In recent years, data of different forms has been available in societyThe fields of meeting, economy, network and the like are exponentially increased. Each enterprise, particularly IT enterprise, generates enormous amounts of Data each year, as reported by IDC (International Data corporation)^[1]It is said that nearly 40ZB data will be processed in the year 2020, with machine generated data volumes rising from 10% in 2005 to 40% in 2020. Thus, the energy consumption of the data center will be very considerable. According to the literature^[2]The sum of energy consumption in data centers worldwide is said to increase at a rate of 15% per year. Data mining in big Data requires a large amount of BDAApps (big Data Analytics applications) to process the Data, and jobs submitted to a Data center computing cluster by the applications often have different time parameters such as start time, execution time, deadline and the like. How to schedule them properly will directly impact the energy consumption of the data center. Therefore, task scheduling plays a crucial role in saving energy.

In order to accelerate the operation speed under a large-scale data set, a programming model MapReduce appears in the field of cloud computing^[3]And is rapidly being widely used. The model adopts simple two-step mapping and simplification to realize high parallelization between data operations, and aiming at the problems provided above, the data to be calculated by the BDA Apps can be divided in the model and solved by parallel tasks. The task scheduling method provided by the invention is mainly applied to the MapReduce computing cluster in the cloud computing environment.

Based on MapReduce's task scheduling, many researchers have proposed different approaches, but these approaches all suffer from varying degrees of drawbacks in controlling energy consumption. Lang et al^[4]A CS (converted set) method is provided, data localization constraint is considered, when the system utilization rate is reduced, a coverage set is constructed to cover a server where useful data are located, and a server located outside the coverage set is closed, so that the cluster energy consumption is reduced. Leverich et al^[5]AIS (All-InStrategy) is proposed to shut down All servers in a cluster only when All submitted jobs are fully completed. When the running time of one server is far greater than that of other serversIn time, the energy waste of the cluster is increased undoubtedly. Three performance factors for jobs on MapReduce: data localization, resource utilization and job deadline, and most of the existing methods only consider one or two of them, but in practice, all three should be considered.

[1]Data to grow more quickly says idc’s digital universestudy.http://www.computerweekly.com/news/2240174381/Data-to-grow-more-quickly-says-IDCs-Digital-Universe-study.

[2]Koomey Jonathan G.Worldwide electricity used in datacenters.Environmental Research Letters,3,2008.

[3]J.Dean and S.Ghemawat,“Mapreduce:Simplified data processingonlarge clusters,”in Proc.of the 6th USENIX Symposium onOperating System Designand Implementation,2004,pp.137–150.

[4]Willis Lang and Jignesh M Patel.Energy management formapreduceclusters.Proceedings of the VLDB Endowment,3(1-2):129–139,2010.

[5]Jacob Leverich and Christos Kozyrakis.On the energy(in)efficiencyofhadoop clusters.ACM SIGOPS Operating Systems Review,44(1):61–65,2010.

Disclosure of Invention

The purpose of the invention is as follows: in order to overcome the defects in the prior art, the task scheduling problem in cloud computing is mainly considered from the energy-saving perspective, the concept of green computing is embodied, and the task scheduling method for minimizing the MapReduce cluster energy consumption is provided.

The technical scheme is as follows: in order to achieve the purpose, the invention adopts the technical scheme that:

a task scheduling method for minimizing MapReduce cluster energy consumption comprises the following stages:

A. a pretreatment stage: according to the collected resource utilization rate, the resource utilization rate comprises a CPU utilization rate, a memory utilization rate and a network bandwidth utilization rate, frequency distribution conditions of the CPU utilization rate, the memory utilization rate and the network bandwidth utilization rate are respectively obtained, and membership functions of the CPU utilization rate, the memory utilization rate and the network bandwidth utilization rate are respectively determined according to the frequency distribution conditions of the CPU utilization rate, the memory utilization rate and the network bandwidth utilization rate. And in the task scheduling process, the fuzzy control system constructed on the server is utilized to dynamically change the number of the working slots of the server and optimize the task scheduling sequence, so that the number of the working slots on the server is dynamically determined.

B. The solution stage is as follows: firstly, the minimum number of work slots required by the job is calculated according to the number of running tasks, the number of tasks waiting for processing and the number of completed tasks of each job, and a priority queue of the job is established according to the minimum number of the work slots. And then, according to the priority queue of the job, sequentially taking out the jobs in the priority queue, and alternately selecting the tasks with long execution time and the tasks with short execution time of the job to queue to establish a task queue until the length of the task queue is equal to the number of all idle working slots on the cluster. And establishing a task work slot association list, and preferentially selecting the idle work slot with the minimum data localization cost to be allocated to the task and executed.

C. And (3) an updating stage: and updating the task sequence in real time according to the current execution condition of the task in each heartbeat period. And on the premise that the server is provided with a fuzzy control system, dynamically determining whether the number of the working slots on the server needs to be changed or not according to the current CPU utilization rate, memory utilization rate and network bandwidth utilization rate of each server. The utilization rate of the server is improved by changing the number of the working grooves, all the operations are completed as early as possible, and the energy consumption of the whole cluster is reduced.

The pretreatment stage comprises the following specific steps:

A1. recording real-time data of CPU utilization rate, memory utilization rate and network bandwidth utilization rate of the server to form a data set.

A2. And B, obtaining frequency distribution maps of the CPU utilization rate, the memory utilization rate and the network bandwidth utilization rate by sampling and analyzing the data set in the step A1, and determining membership functions of the CPU utilization rate, the memory utilization rate and the network bandwidth utilization rate of each server according to the frequency distribution maps and a mathematical model.

A3. And (5) constructing fuzzy rules by using experience and expert knowledge.

A4. And constructing a fuzzy logic control system on each server according to the determined membership functions of the CPU utilization rate, the memory utilization rate and the network bandwidth utilization rate and the constructed fuzzy rules, updating the environmental state of the server in real time according to the fuzzy logic control system in the task scheduling process, and dynamically changing the number of the working slots on the server.

The solution phase comprises the following steps:

B1. and calculating the minimum number of work slots required by the job according to the number of running tasks, the number of tasks waiting for processing and the number of completed tasks of each job, and sequencing the jobs from large to small according to the minimum number of the required work slots to obtain a priority queue of the job.

B2. And in each heartbeat period, acquiring the jobs from the job priority queue in sequence, alternately selecting the tasks with long execution time and the tasks with short execution time in the jobs, and sequentially adding the tasks into the job queue until the length of the job queue is equal to the number of all idle working slots on the cluster.

B3. Each task has a local node working slot list, a local frame working slot list and a remote working slot list, which respectively represent the data localization of the node level, the data localization of the frame level and the remote data localization, and the working slots allocated to each task are determined through a matrix of data localization cost to obtain an association list and execute the task. Where data localization indicates that the node to which the task is assigned is close to the node where the input data it needs to process is located. The data localization degree can be divided into three levels from high to low, namely a node level, a rack level and a remote level.

The update phase comprises the steps of:

C1. and for all the current unfinished jobs, calculating the corresponding required minimum number of working slots according to a formula shown in B1, and sequencing the jobs according to the required minimum number of working slots from large to small.

C2. The current CPU utilization rate, the memory utilization rate and the network bandwidth utilization rate of the server are used as input values of the fuzzy control system, and corresponding output values are obtained according to the membership functions and the satisfied fuzzy rules, namely whether the number of the working slots of the current server needs to be changed or not.

The calculation formula of the number of the working grooves required to be distributed by the operation is as follows:

ξ_{i}^{A} = \frac{Σ_{j = 1}^{| A_{i}^{r} |} (p_{i j} - p_{i j}^{e}) + (| A_{i}^{w} | \times \frac{Σ_{k = 1}^{| A_{i}^{c} |} p_{i k}}{| A_{i}^{c} |})}{D_{i} - t - K} - | A_{i}^{r} |

wherein,indicating the number of working slots that job i needs to allocate,indicating the number of tasks, p, that job i is running_ijRepresenting task A_ijExecution time of A_ijThe jth task representing job i,representing task A_ijThe elapsed execution time is the time at which the execution time passed,a set of waiting tasks representing job i,indicating the set of tasks for which job i has completed, D_iIndicating the deadline of job i, t indicating the current system time, and K indicating K time units.

The working process comprises the following steps:

s1, calculating the number of distributed working grooves required by each operationA job priority queue is created.

S2. if the job queue is empty, go to S10. Otherwise, go to S3.

And S3, adding the idle working slots of the cluster into an idle working slot list.

And S4, alternately adding the tasks with long time and the tasks with short time into the task list according to the running time of the tasks until the lengths of the task list and the free working groove list are equal.

S5, constructing a cost matrix according to the data localization costs of the local node working groove list, the local rack working groove list and the remote working groove list, wherein the local node working groove list, the local rack working groove list and the remote working groove list respectively represent data localization at a node level, data localization at a rack level and data localization at a remote level, rows of the cost matrix represent working grooves, columns represent tasks, and corresponding values are the data localization costs of the tasks and the working grooves.

And S6, distributing the work slots of the tasks according to the cost matrix to obtain a work slot association list.

And S7, corresponding the elements in the association list and the task list one by one, adding the elements into the distribution list, distributing corresponding resources and executing the tasks.

And S8, after the task is completed, updating the state of the corresponding operation, wherein the state comprises the number of operation tasks, the number of residual tasks, the number of task completion tasks, the task execution time and the like.

And S9, updating a new cluster environment according to the fuzzy logic control system, wherein the new cluster environment mainly comprises a CPU utilization rate, a memory utilization rate and a network bandwidth utilization rate, and determining whether the number of the working slots needs to be increased or decreased for the cluster without changing the latter through a fuzzy rule. Go to S1.

And S10, outputting a result, and finishing the algorithm.

In step S6, the method for allocating work slots of a task according to the cost matrix to obtain a work slot association list includes the following steps:

s601, recording the minimum value of each column of the cost matrix, and rearranging the sequence of the columns of the cost matrix according to the non-decreasing sequence of the minimum values of the columns.

S602, record the node-level data localized rows, rack-level data localized rows, and remote data localized rows for each column.

S603, each column of the cost matrix is selected in sequence, and if the column number does not exceed the boundary, the operation goes to S604. Otherwise, go to S611.

S604, if the data localization behavior of the column node level is null, the step goes to S606. Otherwise, go to S605.

S605. a node level data localization row is randomly selected for the column, proceeding to S609.

S606. if the column of rack level data localization rows is empty, go to S608. Otherwise, go to S607.

S607. randomly selecting a rack-level data localization row to the column, go to S609.

S608, a remote level data localization row is randomly selected to the column.

And S609, adding the work slot corresponding to the row into the association list.

S610, deleting the row from the node level data localization row, the rack level data localization row and the remote data localization row of other columns, increasing the column number, and returning to S604.

And S611, outputting the association list, and ending the algorithm.

In step S9, the method for updating the new cluster environment according to the fuzzy logic control system includes:

and S901, constructing a CPU membership value matrix according to the CPU utilization rate and the CPU membership function of the current cluster environment.

S902, constructing a memory membership value matrix according to the memory utilization rate and the memory membership function of the current cluster environment.

And S903, constructing a network bandwidth membership value matrix according to the network bandwidth utilization rate and the network bandwidth membership function of the current cluster environment.

And S904, constructing a fuzzy rule matrix and a fuzzy value matrix according to the fuzzy rule, the CPU membership value matrix, the memory membership value matrix and the network bandwidth membership value matrix.

S905, calculating a working groove change value through the maximum membership value according to the fuzzy rule matrix and the fuzzy value matrix, and if the value is 1, adding a working groove to the server. The value is-1, indicating that the server is decremented by one work slot. The value is 0, indicating no change.

S906, the method is ended.

Has the advantages that: compared with the prior art, the task scheduling method for minimizing the MapReduce cluster energy consumption has the following beneficial effects:

the invention considers deadline constraint when establishing the job priority queue, thereby ensuring that the maximum number of jobs can be completed before the deadline. Data localization factors and resource utilization factors are comprehensively considered in the job task sequencing and distribution stage, and the two factors play a key role in controlling energy consumption. And finally, updating the state of the server in real time by using fuzzy logic according to the completion condition of the operation on the server, and dynamically determining whether the number of the working slots needs to be increased or reduced or kept unchanged for the server so as to improve the utilization rate of the server. Data localization factors are particularly important in a data intensive job scheduling environment, and data migration causes a large amount of energy consumption. Therefore, the invention comprehensively considers the operation deadline, the resource utilization rate and the data localization and optimizes the total operation time of the server, thereby reducing the cluster energy consumption.

Drawings

Fig. 1 is a schematic diagram of a task scheduling structure of a cloud computing cluster according to the present invention.

FIG. 2 is a schematic diagram of a fuzzy logic system built on a server according to the present invention.

FIG. 3 is a detailed flowchart of a task scheduling method for minimizing energy consumption of a MapReduce cluster according to an embodiment of the present invention.

Detailed Description

The present invention is further illustrated by the following description in conjunction with the accompanying drawings and the specific embodiments, it is to be understood that these examples are given solely for the purpose of illustration and are not intended as a definition of the limits of the invention, since various equivalent modifications will occur to those skilled in the art upon reading the present invention and fall within the limits of the appended claims.

As shown in fig. 1

A task scheduling method for minimizing energy consumption of a MapReduce cluster comprehensively considers data localization, resource utilization rate and job deadline constraints, dynamically changes the number of working slots of a server by using a fuzzy logic control system constructed on the server in the task scheduling process, optimizes the task scheduling sequence, completes all jobs as early as possible, reduces the total working time of the cluster and reduces the energy consumption of the cluster. The method comprises the following stages:

A. a pretreatment stage: a fuzzy logic control system is built on each server to dynamically update the number of work slots on the server.

According to the collected resource utilization rate, the resource utilization rate comprises a CPU utilization rate, a memory utilization rate and a network bandwidth utilization rate, frequency distribution conditions of the CPU utilization rate, the memory utilization rate and the network bandwidth utilization rate are respectively obtained, membership functions with more accurate CPU utilization rate, memory utilization rate and network bandwidth utilization rate are respectively determined according to the frequency distribution conditions of the CPU utilization rate, the memory utilization rate and the network bandwidth utilization rate, and a final membership function is determined through a frequency distribution graph and a mathematical model. And in the task scheduling process, the fuzzy logic control system constructed on the server is utilized to dynamically change the number of the working slots of the server and optimize the task scheduling sequence, so that the number of the working slots on the server is dynamically determined, and a basis is provided for the subsequent improvement of the resource utilization rate.

The pretreatment stage comprises the following specific steps:

A3. And (3) constructing fuzzy rules by using experience and expert knowledge, wherein part of the fuzzy rules are as follows:

wherein L, M, H under utilization represents low utilization, medium utilization, and high utilization, respectively, and L, M, H under output represents adding one work slot to a server, keeping the number of work slots on a server unchanged, and reducing the number of work slots on a server by one, respectively.

B. The solution stage is as follows: and sequencing the jobs and the tasks according to the deadline and the constraint of data localization, so that more jobs can be completed in the deadline, and the total running time of the cluster is reduced.

Firstly, the minimum number of work slots required by the job is calculated according to the number of running tasks, the number of tasks waiting for processing and the number of completed tasks of each job, and a priority queue of the job is established according to the minimum number of the work slots. And then, according to the priority queue of the job, sequentially taking out the jobs in the priority queue, and alternately selecting the tasks with long execution time and the tasks with short execution time of the job to queue to establish a task queue until the length of the task queue is equal to the number of all idle working slots on the cluster. And establishing a task work slot association list, and preferentially selecting the idle work slot with the minimum data localization cost to be allocated to the task and executed.

The solution phase comprises the following steps:

ξ_{i}^{A} = \frac{Σ_{j = 1}^{| A_{i}^{r} |} (p_{i j} - p_{i j}^{e}) + (| A_{i}^{w} | \times \frac{Σ_{k = 1}^{| A_{i}^{c} |} p_{i k}}{| A_{i}^{c} |})}{D_{i} - t - K} - | A_{i}^{r} |

B3. In order to better utilize the localization of data, each task has a local node working slot list, a local rack working slot list and a remote working slot list which respectively represent the data localization of a node level, the data localization of a rack level and the remote data localization, the working slot allocated to each task is determined through a matrix of data localization cost, an association list is obtained, and the task is executed. Where data localization indicates that the node to which the task is assigned is close to the node where the input data it needs to process is located. The data localization degree can be divided into three levels from high to low, namely a node level, a rack level and a remote level.

C. And (3) an updating stage: and in each heartbeat period, updating task sequencing according to the execution condition of the tasks, and updating the cluster environment in real time according to the resource utilization rate of the server and the fuzzy logic control system.

In each heartbeat period, updating task sequencing in real time according to the current execution condition of the tasks: such as the number of tasks completed, the number of tasks currently being processed, the number of tasks not yet allocated, and the execution speed of the current job, the task order is updated in real time. And on the premise that the server is provided with a fuzzy control system, dynamically determining whether the number of the working slots on the server needs to be changed or not according to the current CPU utilization rate, memory utilization rate and network bandwidth utilization rate of each server. The utilization rate of the server is improved by changing the number of the working grooves, all the operations are completed as early as possible, and the energy consumption of the whole cluster is reduced.

The update phase comprises the steps of:

The working process comprises the following steps:

S2. if the job queue is empty, go to S10. Otherwise, go to S3.

S608, a remote level data localization row is randomly selected to the column.

And S611, outputting the association list, and ending the algorithm.

S906, the method is ended.

And S10, outputting a result, and finishing the algorithm.

The invention considers the operation deadline constraint, the data localization and the resource utilization rate, reduces the energy consumption by minimizing the total operation time of the cluster, and specifically comprises the following three steps: 1) establishing a job priority queue, which is mainly characterized in that under the constraint of a job deadline, the number of work slots required to be allocated by the job is calculated according to the number of tasks completed by the job, the number of tasks running and the number of tasks waiting to run, and the job priority queue is established; 2) task scheduling based on data localization is mainly characterized in that jobs in a job queue are selected in sequence, long and short tasks of the jobs are queued in sequence, idle work slots with low data localization cost are preferentially allocated to the tasks, a task work slot association list is obtained, and the tasks are executed; 3) the method is mainly characterized in that the resource utilization rate of the server is recorded in each heartbeat period, the resource utilization rate comprises the CPU utilization rate, the memory utilization rate and the network bandwidth utilization rate, and the number of the server working slots is dynamically adjusted by utilizing a fuzzy logic control system. Finally, by reducing the total cluster runtime, the energy consumption is reduced. The method has wide application value and application prospect in the field of green computing.

Fig. 1 shows a specific example of the present invention, which includes a cloud computing server user 11, a job set 12, a cloud computing server cluster 13, and a job scheduler 14. Firstly, a cloud computing server user submits jobs to a cloud platform, and then a job scheduler of a cloud computing server cluster is responsible for distributing tasks in the jobs to corresponding server racks. Each rack is provided with a plurality of servers, each server is provided with a plurality of working grooves, and each task of the operation corresponds to one working groove. The schematic diagram of the fuzzy logic system built on the server in the invention is shown in fig. 2, and comprises fuzzification 21, fuzzy inference 221 performed through fuzzy rule 222, and finally defuzzification 23.

Assume that a set of jobs submitted by a user is J ═ J₁,J₂,J₃,J₄,J₅There are five Map tasks per job. The task set is { A }₁₀,...,A₁₄,A₂₀,...,A₂₄,...,A₃₀,...A₃₄}. The server cluster in the MapReduce computing cluster is S ═ S { (S)₁,S₂,S₃,S₄,S₅5 servers are placed in 2 racks, with R ═ R₁,R₂}，{S₁,S₂,S₃Is placed at R₁Above, { S₄,S₅Is placed at R₂The above. Each server is provided with 5 serversThe working tanks are integrated into { L }₁₁,...,L₁₅,L₂₁,...,L₂₅,L₃₁,...,L₃₅,...,L₄₁,...,L₄₅,L₅₁,...L₅₅And currently, some existing working grooves are occupied or cannot be used. After the work cluster runs for a period of time, three tasks of each job do not start to be executed, and the waiting task set is { A }₁₁,A₁₂,A₁₃,A₂₁,A₂₂,A₂₃,A₃₁,A₃₂,A₃₃The set of idle working grooves is { L }₁₀，L₂₁，L₃₁，L₃₀，L₄₀}。

FIG. 3 is a flowchart of task scheduling for implementing MapReduce cluster with minimum energy consumption in the embodiment of the present invention. As shown in fig. 3, the task scheduling steps are as follows:

s1, calculating according to a formula to obtainInitializing job priority queue to jobList ═ J₂,J₁,J₃}。

And S2, the job queue is not empty at the moment.

S3, adding the idle work slot in the cluster into IdleList ═ L₁₀，L₂₁，L₃₁，L₃₀，L₄₀}。

And S4, sequentially and alternately selecting the long tasks and the short tasks of the job according to the job queue TaskList until the lengths of the TaskList and the IdleList are equal, wherein the TaskList is { }.

S5, constructing a distribution cost matrix:

1 represents a node-level data localization fabrication slot, 3 represents a rack-level data localization fabrication slot, and 5 represents a remote-level data localization fabrication slot.

And S6, obtaining an association list SelList { } of the working slot according to S601-S611 in the algorithm flow chart.

And S7, connecting the task list and the work slot association list, adding the task list and the work slot association list into an assignment list, wherein the assignment list is { }, and assigning resources to allow the tasks in the list to run.

And S8, updating the job state including a job running task, a job residual task, a job completion task and a task execution time in each heartbeat period.

And S9, utilizing fuzzy logic to follow up a new cluster environment. At this time, the CPU utilization rate, the memory utilization rate and the network bandwidth utilization rate are respectively 40,65 and 30, a CPU fuzzy value matrix (0.189,0.811 and 0), a memory fuzzy value matrix (0,0.237 and 0.427) and a network bandwidth matrix (0,0.842 and 0.158) are obtained through fuzzy membership functions, and a fuzzy rule matrix and a fuzzy value matrix are constructed:

R 1 [0] = (\begin{matrix} 0 & 0 & 0 \\ 0 & B^{'} (z) & C^{'} (z) \\ 0 & A^{'} (z) & B^{'} (z) \end{matrix})

\begin{matrix} R 1 [1] = (\begin{matrix} 0 & 0 & 0 \\ 0 & B^{'} (z) & B^{'} (z) \\ 0 & B^{'} (z) & B^{'} (z) \end{matrix}) & R 1 [2] = (\begin{matrix} 0 & 0 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 0 \end{matrix}) \end{matrix}

\begin{matrix} R 2 [0] = (\begin{matrix} 0 & 0 & 0 \\ 0 & 0.189 & 0.189 \\ 0 & 0.158 & 0.158 \end{matrix}) & R 2 [1] = (\begin{matrix} 0 & 0 & 0 \\ 0 & 0.237 & 0.437 \\ 0 & 0.158 & 0.158 \end{matrix}) \end{matrix}

R 2 [2] = (\begin{matrix} 0 & 0 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 0 \end{matrix})

where, represents the fuzzy values of the low, medium and high outputs of the membership function, respectively. By the operation on R1 and R2, R3:

R 3 [0] = (\begin{matrix} 0 & 0 & 0 \\ 0 & \min (B^{'} (z), 0.189) & \min (C^{'} (z), 0.189) \\ 0 & \min (A^{'} (z), 0.158) & \min (B^{'} (z), 0.158) \end{matrix})

R 3 [1] = (\begin{matrix} 0 & 0 & 0 \\ 0 & \min (B^{'} (z), 0.237) & \min (B^{'} (z), 0.437) \\ 0 & \min (B^{'} (z), 0.158) & \min (B^{'} (z), 0.158) \end{matrix})

R 3 [2] = (\begin{matrix} 0 & 0 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 0 \end{matrix})

the output fuzzy value is:

μ＝max{min(B′(z)，0.189)，min(C′(z)0.189，)

min(A′(z)，0.158)，min(B′(z)0.15８)，min(B′(z)，0.237)，)

min(B′(z)，0.437)，min(B′(z)，0.158)，min(B′(z)，0.158)}

that is, when μ is 0.437, the corresponding work slot variation amount is 0, that is, there is no need to increase or decrease the work slot for the server. Thereafter the algorithm returns from the new S1.

Through the process, the task scheduling method on the MapReduce cluster is realized, and the energy consumption of the whole MapReduce cluster is fully reduced through a proper job priority queue, a proper task scheduling method and a proper method for improving the utilization rate of cluster resources. While embodiments of the invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.

Claims

1. A task scheduling method for minimizing MapReduce cluster energy consumption is characterized in that: the method comprises the following steps:

A. a pretreatment stage: respectively acquiring the frequency distribution conditions of the CPU utilization rate, the memory utilization rate and the network bandwidth utilization rate according to the collected resource utilization rates, wherein the resource utilization rates comprise the CPU utilization rate, the memory utilization rate and the network bandwidth utilization rate, and respectively determining the membership functions of the CPU utilization rate, the memory utilization rate and the network bandwidth utilization rate according to the frequency distribution conditions of the CPU utilization rate, the memory utilization rate and the network bandwidth utilization rate; constructing a fuzzy control system on each server according to fuzzy rules given by corresponding experts and obtained membership functions, and dynamically changing the number of working slots of the servers and optimizing task scheduling sequences by using the fuzzy logic control system constructed on the servers in the task scheduling process so as to dynamically determine the number of the working slots on the servers;

B. the solution stage is as follows: firstly, calculating the minimum number of work slots required by the operation according to the number of running tasks, the number of tasks waiting for processing and the number of completed tasks of each operation, and establishing a priority queue of the operation according to the minimum number of the work slots; then, according to the priority queue of the job, the jobs in the priority queue are sequentially taken out, and the tasks with long execution time and the tasks with short execution time of the job are alternately selected to be queued to establish a task queue until the length of the task queue is equal to the number of all idle working slots on the cluster; establishing a task work slot association list, preferentially selecting an idle work slot with the minimum data localization cost to be allocated to a task and executing the task;

C. and (3) an updating stage: updating task sequencing in real time according to the current execution condition of the tasks in each heartbeat period; and on the premise that the server is provided with a fuzzy control system, dynamically determining whether the number of the working slots on the server needs to be changed or not according to the current CPU utilization rate, memory utilization rate and network bandwidth utilization rate of each server.

2. The task scheduling method for minimizing energy consumption of MapReduce cluster according to claim 1, wherein: the pretreatment stage comprises the following specific steps:

A1. recording real-time data of CPU utilization rate, memory utilization rate and network bandwidth utilization rate of a server to form a data set;

A2. sampling and analyzing the data set in the step A1 to obtain a frequency distribution map of the CPU utilization rate, the memory utilization rate and the network bandwidth utilization rate, and determining membership functions of the CPU utilization rate, the memory utilization rate and the network bandwidth utilization rate of each server according to the frequency distribution map and a mathematical model;

A3. establishing a fuzzy rule by using experience and expert knowledge;

3. The task scheduling method for minimizing MapReduce cluster energy consumption according to claim 2, wherein: the solution phase comprises the following steps:

B1. calculating the minimum number of work slots required by the job according to the number of running tasks, the number of tasks waiting for processing and the number of completed tasks of each job, and sequencing the jobs from large to small according to the minimum number of the required work slots to obtain a priority queue of the job;

B2. in each heartbeat period, acquiring the jobs from the job priority queue in sequence, alternately selecting the tasks with long execution time and the tasks with short execution time in the jobs, and sequentially adding the tasks into the job queue until the length of the job queue is equal to the number of all idle working slots on the cluster;

B3. each task has a local node working slot list, a local frame working slot list and a remote working slot list, which respectively represent the data localization of a node layer, the data localization of a frame layer and the remote data localization, and the working slots allocated to each task are determined through a matrix of data localization cost to obtain an association list and execute the task; wherein the data localization indicates that the node to which the task is assigned is close to the node at which the input data it needs to process is located; the data localization degree can be divided into three levels from high to low, namely a node level, a rack level and a remote level.

4. The task scheduling method for minimizing energy consumption of MapReduce cluster according to claim 3, wherein: the update phase comprises the steps of:

C1. calculating the corresponding minimum required working slot number of all the current unfinished jobs according to a formula shown in B1, and sequencing the jobs from large to small according to the minimum required working slot number;

5. The task scheduling method for minimizing MapReduce cluster energy consumption according to claim 5, wherein; the calculation formula of the number of the working grooves required to be distributed by the operation is as follows:

ξ_{i}^{A} = \frac{Σ_{j = 1}^{| A_{i}^{r} |} (p_{i j} - p_{i j}^{e}) + (| A_{i}^{w} | \times \frac{Σ_{k = 1}^{| A_{i}^{c} |} p_{i k}}{| A_{i}^{c} |})}{D_{i} - t - K} - | A_{i}^{r} |

6. The task scheduling method for minimizing energy consumption of MapReduce cluster according to claim 1, wherein: the work flow comprises the following steps:

s1, calculating the number of distributed working grooves required by each operationCreating a job priority queue;

s2, if the job queue is empty, turning to S10; otherwise, go to S3;

s3, adding the idle working slots of the cluster into an idle working slot list;

s4, alternately adding the tasks with long time and the tasks with short time into a task list according to the running time of the tasks until the lengths of the task list and the free working groove list are equal;

s5, constructing a cost matrix according to data localization costs of the local node working groove list, the local frame working groove list and the remote working groove list, wherein the local node working groove list, the local frame working groove list and the remote working groove list respectively represent data localization at a node level, data localization at a frame level and data localization at a remote level, rows of the cost matrix represent working grooves, columns represent tasks, and corresponding values are the data localization costs of the tasks and the working grooves;

s6, distributing the work slots of the tasks according to the cost matrix to obtain a work slot association list;

s7, corresponding the elements in the association list and the task list one by one, adding the elements into an allocation list, allocating corresponding resources and executing the tasks;

s8, after the task is completed, updating the state of the corresponding operation, wherein the state comprises the number of operation tasks, the number of residual tasks, the number of task completion tasks, the task execution time and the like;

s9, updating a new cluster environment according to a fuzzy logic control system, wherein the new cluster environment mainly comprises a CPU utilization rate, a memory utilization rate and a network bandwidth utilization rate, and determining whether to increase or not the number of working grooves for the cluster or not to reduce the number of the working grooves for the cluster without changing the former through a fuzzy rule; go to S1;

and S10, outputting a result, and finishing the algorithm.

7. The task scheduling method for minimizing energy consumption of MapReduce cluster according to claim 6, wherein: in step S6, the method for allocating work slots of a task according to the cost matrix to obtain a work slot association list includes the following steps:

s601, recording the minimum value of each column of the cost matrix, and rearranging the sequence of the columns of the cost matrix according to the non-decreasing sequence of the minimum values of the columns;

s602, recording a node-level data localization row, a rack-level data localization row and a remote data localization row of each column;

s603, sequentially selecting each row of the cost matrix, and if the row number does not exceed the boundary, turning to S604; otherwise, jumping to S611;

s604, if the localization behavior of the column node level data is null, the S606 is switched to; otherwise, go to S605;

s605, randomly selecting a node level data localization row to the column, and turning to S609;

s606, if the localization behavior of the row of rack level data is empty, the S608 is switched to; otherwise, jumping to S607;

s607, randomly selecting a frame-level data localization row to the column, and turning to S609;

s608, randomly selecting a remote level data localization row to the column;

s609, adding the working slot corresponding to the row into an association list;

s610, deleting the row from the node-level data localization row, the rack-level data localization row and the remote data localization row of other columns, increasing the column number progressively, and returning to S604;

and S611, outputting the association list, and ending the algorithm.

8. The task scheduling method for minimizing energy consumption of MapReduce cluster according to claim 7, wherein: in step S9, the method for updating the new cluster environment according to the fuzzy logic control system includes:

s901, constructing a CPU membership value matrix according to the CPU utilization rate and the CPU membership function of the current cluster environment;

s902, constructing a memory membership value matrix according to the memory utilization rate and the memory membership function of the current cluster environment;

s903, constructing a network bandwidth membership value matrix according to the network bandwidth utilization rate and the network bandwidth membership function of the current cluster environment;

s904, constructing a fuzzy rule matrix and a fuzzy value matrix according to the fuzzy rule, the CPU membership value matrix, the memory membership value matrix and the network bandwidth membership value matrix;

s905, calculating a working groove change value through the maximum membership value according to the fuzzy rule matrix and the fuzzy value matrix, and if the value is 1, adding a working groove to the server; the value is-1, indicating that the server is decremented by one work slot; the value is 0, meaning constant;

s906, the method is ended.