CN108170530B

CN108170530B - Hadoop load balancing task scheduling method based on mixed element heuristic algorithm

Info

Publication number: CN108170530B
Application number: CN201711433347.5A
Authority: CN
Inventors: 毕敬; 程煜东; 乔俊飞
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2017-12-26
Filing date: 2017-12-26
Publication date: 2021-08-17
Anticipated expiration: 2037-12-26
Also published as: CN108170530A

Abstract

The invention relates to a Hadoop load balancing task scheduling method based on a mixed element heuristic algorithm, which is characterized in that a resource slot pressure model is established, the model aims to enable the calculation pressure of all Slave node processing tasks in a cluster to be in the same horizontal line, the optimal task scheduling scheme is solved by adopting the mixed element heuristic algorithm based on simulated annealing and particle swarm optimization, and the load balancing task scheduling in the Hadoop cluster environment is realized. And further, parallel programming of the algorithm is realized through a message passing interface MPICH (MPI over Chuamereon) with high performance and wide portability, the calculation process of a heuristic optimization algorithm is transferred to additional calculation nodes, and the solution is carried out through multiple groups at the same time, so that the calculation pressure of the Master node is reduced, and the solution capability of the optimal task scheduling scheme in unit time is improved. The invention can carry out integral distribution on the computing resources of the Hadoop cluster, so that the node load of the cluster is balanced, the node computing resource waste is avoided, and the profit of equipment investment of a data center is maximized.

Description

Hadoop load balancing task scheduling method based on mixed element heuristic algorithm

Technical Field

The invention relates to the field of task scheduling under a Hadoop MapReduce structure. More specifically, a Hadoop task scheduling algorithm which aims at balancing cluster loads is achieved by utilizing a particle swarm algorithm, a mixed meta-heuristic algorithm based on simulated annealing and a particle swarm optimization algorithm and an MPICH parallel programming method.

Background

With the rapid development of mobile intelligent equipment, the development of the information-based era becomes more and more rapid, and meanwhile, mass data generated actively or passively along with the use of a network by a user are generated, the data usually cannot dig out the value of the data through a traditional statistical or calculation method, however, once the potential value behind the data can be dug out, huge benefits can be brought to enterprises and governments, for example, a treasure-washing network can judge the commodity preference and demand of the user through the analysis of the commodity browsing record of the user, and meanwhile, the first-page commodity pushing is directionally carried out, so that the purpose of commodity shopping guide is achieved; the video music resource service provider can summarize user preferences from the historical use data of the user, and improve the service capability of the user through directional recommendation, so that the user can obtain better user experience; public sentiment analysis can be carried out on the user hotspot attention information of the social platform by government agencies, so that social stability is better maintained. However, in order to process the big data, the traditional computing mode cannot meet the requirement, and enterprises and government organizations need computer clusters with strong computing power to achieve the purposes of the enterprises and government organizations.

However, the cost for building and maintaining a data center is extremely expensive, and most of small and medium-sized enterprises have no ability to build a large-scale data center to meet business requirements of the enterprises, and the cloud service mode of pricing as required provides great help for the enterprises. By purchasing the services of the cloud data center, enterprises can deploy hundreds of server clusters in a short time of several hours, the cost consumed by using the cloud computing resources is very cheap and convenient compared with the traditional data center construction, and meanwhile, along with the later business change, users can actively and conveniently realize the change and adjustment of the cloud computing resources and meet the business requirements of the users in real time. Meanwhile, the development of the cloud data center provides convenience for large-scale transnational companies, and the companies can purchase services through the cloud data centers in different regions of the world, so that not only can the companies save a large amount of labor and material cost brought by the construction of the traditional data center, but also better response speed can be brought to users distributed in various regions of the world by transnational enterprises, and user experience is improved. However, for the data center, the computing capability of the data center can be utilized more fully as much as possible, so that more efficient computing service is provided, and further more user tasks can be processed more quickly and better in the same time, which is an important way for the data center to obtain more benefits.

Hadoop is a distributed system infrastructure capable of performing distributed processing on a large amount of data, and the MapReduce distributed processing framework is simple in principle and excellent in performance, and is used by a plurality of data centers for performing calculation processing on large data at present. An important link for how to make the Hadoop cluster really work is task scheduling. Task scheduling is to allocate cluster computing resources through a certain task scheduling algorithm, so that tasks to be processed can be processed by sufficient resources, a good task scheduling algorithm not only needs to enable tasks to be processed faster (i.e. enables users to obtain faster effect speed), but also needs to enable each machine in a cluster to exert its own computing power, because machines running in a data center generally need to be replaced after five years of work, and if the machines are in an idle state for a long time or the computing power of the machines is not sufficiently utilized, the data center is greatly lost. The traditional task scheduling algorithm is divided into two types, one is a real-time task scheduling algorithm, and the other is a heuristic task scheduling algorithm. The core idea of the real-time scheduling algorithm is to schedule jobs in real time and allocate required computing resources to the jobs, and the real-time scheduling algorithm has the advantages of short response time and low resource overhead caused by task scheduling. The heuristic task scheduling algorithm can comprehensively consider resources of all machines in the cluster, and optimally solve the resources with load balance or the fastest processing speed as a target in a given solution space (limited by specific resources of the cluster), so that global resource allocation can be realized, and the computing resources of the cluster can be better utilized. However, the number of nodes in the cluster is often very large, when a heuristic algorithm is used for solving, the solving process is very complicated, and a small extra overhead is caused, while in a traditional Hadoop task scheduling mode, heuristic task scheduling causes a heavy burden to a Master node (Master node, scheduling and distributing work in the Hadoop responsible for operation), so that the working stability of the cluster is affected, meanwhile, the heuristic algorithm such as particle swarm, ant colony, simulated annealing and the like is easy to fall into the problem of local optimization, and the performance exerted in the actual scheduling process is not stable.

Based on the shortcomings that Hadoop adopts a heuristic task scheduling algorithm, the invention provides a task scheduling method aiming at improving the utilization rate of cluster resources based on a MapReduce distributed processing framework, an MPICH parallel processing algorithm, a particle swarm optimization algorithm, a mixed heuristic algorithm combining the particle swarm optimization algorithm and a simulated annealing optimization algorithm and the like.

Disclosure of Invention

The invention aims to provide a task scheduling method aiming at improving the utilization rate of distributed cluster computing resources based on a Hadoop architecture, which comprehensively considers the defects that a heuristic algorithm is easy to fall into local optimum and the inadaptation of the heuristic algorithm for Hadoop task scheduling, can respectively distribute an appropriate number of tasks to be processed according to the computing power of each machine in a cluster, ensures that the computing pressure of the tasks to be processed of each machine in the cluster is in the same horizontal line, thereby balancing the load of task processing nodes (Slave nodes) in the cluster, improving the utilization rate of the cluster computing resources, realizing cost saving and improving profits, simultaneously improving the capacity of the task scheduling algorithm for jumping out of the local optimum point by simultaneously optimizing a plurality of particle swarms and simultaneously using heuristic optimization algorithms such as a particle swarm optimization algorithm based on simulated annealing, and the MPICH parallel programming method is used, the solving process of the heuristic algorithm is shared among a plurality of machines to be executed, the calculation pressure of a Master node in a Hadoop cluster is avoided being added, and the stability of the cluster is improved.

In order to achieve the purpose, the invention adopts the following technical scheme:

in order to realize that the tasks to be processed with corresponding quantity are respectively distributed according to the computing resource difference between the task processing nodes and the computing pressure between the task processing nodes is in the same horizontal line, the concept of the resource slot under the MapReduce distributed framework is based on: the method comprises the steps that a resource Slot is used as a partition unit of a node resource of a traditional MapReduce distributed processing framework, the node can determine the total amount of Slots according to the computing capacity and the total memory amount of the node, the Slots are different from Map Slots and Reduce Slots due to the characteristics of the MapReduce framework, the Map Slots are specially used for processing Map subtasks, and the Reduce Slots are specially used for processing Reduce subtasks. When Hadoop processes operation, the operation applies Slots resources to JobTracker in advance, and according to the characteristic, a resource slot pressure model is established.

In order to realize the application of the heuristic algorithm to the task scheduling of Hadoop, simultaneously not add extra calculation burden to the Master node and improve the capability of the algorithm to jump out local optimum, the invention provides a MPICH parallel programming method, wherein the calculation process of the task scheduling algorithm is arranged in an extra calculation node for execution, simultaneously, a plurality of particle swarms, a particle swarms based on a simulated annealing algorithm and the like are adopted for simultaneous optimization, and the calculation result of the particle swarms with the optimal optimization result at a set time is adopted for actual task distribution, so that the capability of the algorithm to jump out local optimum solution is improved, and the burden on the Master node is also avoided.

In summary, a method for scheduling a Hadoop load balancing task based on a mixed element heuristic algorithm includes the following steps:

s1, aiming at the calculated pressure of the processing task of the balance task processing node, establishing a resource slot pressure model according to the resource slot principle;

s2, solving a resource tank pressure model by adopting a particle swarm optimization algorithm;

s3, solving a resource tank pressure model by adopting a mixed element heuristic optimization algorithm based on simulated annealing and a particle swarm optimization algorithm;

s4, transferring a complex calculation process of a heuristic optimization algorithm to an additional calculation node by adopting an MPICH parallel programming method, enabling a plurality of clusters to find more local optimal solutions by simultaneously operating a plurality of particle clusters, and then extracting the solution with the best effect for task scheduling.

Preferably, the optimization goal of the resource tank pressure model is to minimize Variance between calculated pressures of Hadoop cluster Slave nodes, and the resource tank pressure model is as follows:

the Pressure represents the calculation Pressure of the tasks to be processed caused by the sum of Map resource slots required by M Map subtasks to be processed between the Slave nodes being compared with the total number M of the Map resource slots of the node; t is t_iRepresenting the number of Map resource slots required by the ith task to be processed; average represents the Average value of the calculated pressure of the tasks to be processed of all Slave nodes in the cluster; m_iRepresenting the number of Map resource slots of the ith Slave node in the cluster; s represents the number of Slave nodes in the cluster; and Variance represents the Variance among the calculated pressures of the tasks to be processed of each Slave node in the cluster.

Preferably, according to the established resource slot pressure model, task scheduling is coded into particle coordinates of the particle swarm optimization algorithm, and meanwhile, a designed objective function can calculate the pressure variance of the Slave node of the cluster under the task scheduling scheme represented by the current particle coordinates from the particle coordinates. The particle swarm optimization algorithm based on the resource slot pressure model has the following specific parameters and formula:

X_i(x₁,x₂,x₃,...,x_m)；

V_i(v₁,v₂,v₃,...,v_m)；

V_i'＝w*V_i+c₁*r₁*(pBest_i-X_i)+c₂*r₂*(gBest-X_i)；

X_i'＝X_i+V_i'；

wherein, X_i(x₁,x₂,x₃,...,x_m) Representing coordinates of the ith particle in solution space; x is the number of_mIndicating that the mth task to be processed is assigned to the xth task_mRunning on a Slave node; v_i(v₁,v₂,v₃,...,v_m) Represents the velocity of the ith particle; v_i' represents the updated speed of the ith particle according to the learning experience of the previous iteration, and w is the inertia weight; c. C₁And c₂Is a learning factor; r is₁And r₂Is [0,1 ]]A random number of (c); pBest_iThe individual optimum point of the ith particle; gBest is the optimal point of the population; x_i' represents the coordinates of the ith particle after one iteration; pBest_i' updating the individual optimal points of the ith particle after one iteration; gBest' is the optimal point of the population after one iteration update; f (X)_i) Optimizing an objective function of an algorithm for particle swarm, the function functioning according to an example coordinate X_i(x₁,x₂,x₃,...,x_m) And calculating the Variance of the calculated pressure of the tasks to be processed of the Slave nodes in the cluster according to the corresponding relation between the tasks to be processed and the Slave nodes.

Preferably, the solving of the constraint of the resource tank pressure model by using the particle swarm optimization algorithm includes: in order to solve the resource slot pressure model by adopting a particle swarm optimization algorithm, the calculation resources of all Slave nodes in a cluster need to be counted in advance, meanwhile, a task queue to be processed needs to be generated and maintained for each Slave node in the task scheduling process, and before the Map subtask allocation scheme is solved, the number of resource slots needed by each Map subtask needs to be obtained from the JobTracker, so that the solution is carried out through a correlation formula of the resource slot pressure model.

Preferably, the method for solving the resource tank pressure model by adopting the hybrid heuristic algorithm based on simulated annealing and particle swarm optimization specifically comprises the following steps: after particle coordinates and initial speed in the particle swarm are initialized, the optimal particle position f is obtained according to the target function result corresponding to the current particle coordinates_minAnd worst particle position f_maxThe difference of the target function results and the preset initial acceptance probability p_rCalculating the initial annealing temperature T₀(ii) a In each iteration, if the objective function value corresponding to the particle coordinate calculated according to the particle coordinate updating rule is better than the objective function value corresponding to the current particle coordinate, the current particle coordinate is directly updated by using the new particle coordinate, if the objective function value corresponding to the new particle coordinate is different from the objective function value corresponding to the current particle coordinate, whether the new coordinate is accepted or not is judged according to an acceptance rule in the simulated annealing algorithm, otherwise, the original particle coordinate is kept unchanged; the simulated annealing algorithm receives the characteristic of coordinates of the suboptimal point, and gives a hybrid algorithm that after the particle swarm falls into the local optimal point, the local optimal point is skipped by receiving the suboptimal point, so that the local optimal point is converged to the global optimal point finally, wherein the specific formula is as follows:

Δf＝f(X_i+V_i')-f(X_i)；

T_i+1＝ξT_i。

wherein, the formula and the parameter related to the particle swarm only change the particle coordinate updating formula, T₀Is the initial annealing temperatureDegree; p is a radical of_rIs the initial acceptance probability; f. of_minAnd f_maxMinimum and maximum target function adaptive values after particle swarm initialization are obtained; t is_iThe annealing temperature in the ith iteration is the annealing temperature in the ith iteration; xi is the temperature decay coefficient; x'_iRepresenting the coordinates of the ith particle after one iteration.

Preferably, within a limited time, if any one of the clusters reaches a constraint condition specified by an objective function, the current cluster transmits a task scheduling scheme obtained by calculation to a Master node, the Master node compares objective function values of the obtained task scheduling schemes after a certain time slice, and the task scheduling scheme which enables the calculation pressure of all Slave nodes in the cluster to be the most average is selected for task resource allocation; if the existing population fails to reach the convergence condition within the limited time, intercepting the optimal solution of the population in the current population as the optimal task scheduling scheme searched by the population and transmitting the optimal task scheduling scheme to the Master node.

The invention has the following beneficial effects:

the technical scheme of the invention can solve the problems that the heuristic scheduling algorithm used by a Hadoop cluster causes extra calculation burden to a Master node and influences the working stability of the cluster, thereby overcoming the defects that the cluster load is unbalanced and part of nodes have resource waste when the real-time scheduling algorithm is used; through the combination of simulated annealing and particle swarm optimization, the capability of the hybrid heuristic algorithm for jumping out of the local optimal point is improved, and the capability of the heuristic scheduling algorithm for searching the global optimal point is optimized. The technical scheme provided by the invention starts from load balancing of each node in the Hadoop cluster, fully considers different calculation processing capacities of Slave nodes in the cluster on the premise of not adding extra calculation load to a Master node, realizes an allocation-by-energy task allocation mode, and can give play to the calculation potential of idle calculation resources possibly generated in the cluster under a real-time scheduling algorithm on the premise of ensuring the task processing efficiency and stability of the Hadoop cluster, so that the cluster construction investment of a data center is more rewarded, and the income of the data center is further increased.

Drawings

The following detailed description of embodiments of the invention is provided in conjunction with the appended drawings:

FIG. 1 shows a flow chart of a Hadoop load balancing task scheduling method based on a hybrid heuristic algorithm;

FIG. 2 shows a Hadoop cluster architecture diagram with the addition of additional compute nodes for algorithmic solution.

Detailed Description

In order to more clearly illustrate the invention, the invention is further described below with reference to preferred embodiments and the accompanying drawings. Similar parts in the figures are denoted by the same reference numerals. It is to be understood by persons skilled in the art that the following detailed description is illustrative and not restrictive, and is not to be taken as limiting the scope of the invention.

As shown in fig. 1 and fig. 2, the technical field of the Hadoop load balancing task scheduling method based on the hybrid heuristic algorithm disclosed by the invention comprises the following steps:

and S1, establishing a resource slot pressure model according to the resource slot principle aiming at the calculated pressure of the processing tasks of the balance task processing nodes.

The main objective of the above resource slot pressure model is to make the calculated pressures of all task running sub-nodes (Slave nodes) in the cluster to the respective nodes to be executed task be in the same horizontal line. In the actual resource allocation process of the Hadoop cluster under the MapReduce distributed processing framework, the resource allocation is performed on the tasks to be executed by taking the resource Slot as a unit. Specifically, how many computing resources each Slot contains will be preset in Hadoop configuration, and for all the Slot nodes, the computing resources contained in a single Slot are equal, so for the Slot nodes with different computing capabilities, the difference in the node computing capability is actually reflected in the difference in the number of slots. In the resource slot pressure model, the node calculated pressure is quantitatively analyzed by comparing the total number of resource slots required by the tasks to be processed of the calculated node with the number of resource slots owned by the node. The practical meaning of the resource slot pressure model is that the calculated pressures of the tasks to be processed of all the Slave nodes are in the same horizontal line (i.e. the smaller the variance is, the better) by a reasonable task scheduling method. According to the task scheduling method disclosed by the invention, all the Slave nodes can execute tasks under similar computing pressure, so that the load of the cluster is relatively balanced.

In the resource slot Pressure model, the node calculates the Pressure, and a parameter for measuring whether the calculated pressures of all Slave in the cluster are on the same horizontal line is Variance, wherein the calculation model is as follows:

And S2, solving the resource tank pressure model by adopting a particle swarm optimization algorithm.

According to the method, task scheduling codes are coded into particle coordinates of a particle swarm optimization algorithm according to an established resource slot pressure model, and meanwhile, a designed objective function can calculate the pressure variance of the Slave node of the cluster under a task scheduling scheme represented by the current particle coordinates from the particle coordinates. The particle swarm optimization algorithm based on the resource slot pressure model has the following specific parameters and formula:

X_i(x₁,x₂,x₃,...,x_m)；

V_i(v₁,v₂,v₃,...,v_m)；

V_i'＝w*V_i+c₁*r₁*(pBest_i-X_i)+c₂*r₂*(gBest-X_i)；

X_i'＝X_i+V_i'；

wherein, X_i(x₁,x₂,x₃,...,x_m) Representing coordinates of the ith particle in solution space; x is the number of_mIndicating that the mth task to be processed is assigned to the xth task_mRunning on a Slave node; v_i(v₁,v₂,v₃,...,v_m) Represents the velocity of the ith particle; v_i' represents that the velocity w of the ith particle after being updated according to the learning experience of the previous iteration is the inertia weight; c. C₁And c₂Is a learning factor; r is₁And r₂Is [0,1 ]]A random number of (c); pBest_iThe individual optimum point of the ith particle; gBest is the optimal point of the population; x_i' represents the coordinates of the ith particle after one iteration; pBest_i' updating the individual optimal points of the ith particle after one iteration; gBest' is the optimal point of the population after one iteration update; f (X)_i) Optimizing an objective function of an algorithm for particle swarm, the function functioning according to an example coordinate X_i(x₁,x₂,x₃,...,x_m) And calculating the Variance of the calculated pressure of the tasks to be processed of the Slave nodes in the cluster according to the corresponding relation between the tasks to be processed and the Slave nodes.

In order to solve the resource slot pressure model by adopting a particle swarm optimization algorithm, the calculation resources of all Slave nodes in a cluster need to be counted in advance, meanwhile, a task queue to be processed needs to be generated and maintained for each Slave node in the task scheduling process, and before the Map subtask allocation scheme is solved, the number of resource slots needed by each Map subtask needs to be obtained from the JobTracker, so that the solution is carried out through a correlation formula of the resource slot pressure model.

And S3, solving the resource tank pressure model by adopting a hybrid heuristic optimization algorithm based on the combination of simulated annealing and particle swarm optimization algorithm.

The mixed element heuristic optimization algorithm mainly aims to provide the particle swarm optimization with the capability of jumping out a local optimal point through the possibility of receiving the sub-dominant coordinates through the simulated annealing algorithm, so that the global optimal solution searching capability of the algorithm is improved, and the effect of the task scheduling algorithm is improved. The concrete formula of the mixed element heuristic algorithm is as follows:

Δf＝f(X_i+V_i')-f(X_i)；

T_i+1＝ξT_i。

wherein, the formula and the parameters related to the particle swarm only change the particle coordinate updating formula. T is₀Is the initial annealing temperature; p is a radical of_rIs the initial acceptance probability; f. of_minAnd f_maxMinimum and maximum target function adaptive values after particle swarm initialization are obtained; t is_iThe annealing temperature in the ith iteration is the annealing temperature in the ith iteration; xi is temperatureAnd (4) degree decay coefficient.

The basic principle of the hybrid heuristic algorithm based on simulated annealing and particle swarm optimization algorithm in the invention is as follows: after particle coordinates and initial speed in the particle swarm are initialized, the optimal particle position f is obtained according to the target function result corresponding to the current particle coordinates_minAnd worst particle position f_maxThe difference of the target function results and the preset initial acceptance probability p_rCalculating the initial annealing temperature T₀. In each iteration, if the objective function value corresponding to the particle coordinate calculated according to the particle coordinate updating rule is better than the objective function value corresponding to the current particle coordinate, the current particle coordinate is directly updated by using the new particle coordinate, if the objective function value corresponding to the new particle coordinate is worse than the objective function value corresponding to the current particle coordinate, whether the new coordinate is accepted or not is judged according to an acceptance rule in the simulated annealing algorithm, and otherwise, the original particle coordinate is kept unchanged. The simulated annealing algorithm is likely to receive the characteristic of the coordinates of the suboptimal point, and the possibility that the local optimal point is jumped out by receiving the suboptimal point after the particle swarm falls into the local optimal point is given to the hybrid algorithm, so that the global optimal point can be converged finally.

S4, transferring a complex calculation process of a heuristic optimization algorithm to an additional calculation node by adopting an MPICH parallel programming method, and improving the capability of a task scheduling method for solving an optimal task scheduling scheme within a specified time.

The MPICH parallel programming method is mainly used for transferring the actual calculation process of the task scheduling method to the special calculation node through the parallel programming method, so that the calculation burden of the Master node is reduced. Meanwhile, in order to meet the timeliness requirement of job processing, the calculation time of the task scheduling scheme needs to be limited to a certain extent, and since the calculation time of the hybrid algorithm may be too long under certain conditions and task allocation cannot be started until the hybrid algorithm is solved, a plurality of standard particle swarm optimization algorithms and the optimization process of the hybrid heuristic optimization algorithm based on the combination of simulated annealing and the particle swarm optimization algorithm can be simultaneously performed on a plurality of calculation nodes through the MPICH parallel programming algorithm, the standard particle swarm optimization algorithm is relatively simpler in calculation process and faster in solution process compared with the hybrid heuristic algorithm based on the simulated annealing and the particle swarm optimization algorithm. Although the standard particle swarm algorithm has the defect of easy falling into the local optimal solution, the particle swarm algorithm has randomness, and a method for improving the algorithm solving effect is to operate a plurality of particle swarms simultaneously to enable the plurality of particle swarms to find more local optimal solutions as far as possible, and then extract the solution with the best effect for task scheduling, so that the better task scheduling effect can be realized in a shorter time. Within a limited time, if any one of the clusters reaches a constraint condition specified by an objective function, the current cluster transmits a task scheduling scheme obtained by calculation to a Master node, the Master node compares objective function values of the obtained task scheduling schemes after a certain time slice, and the task scheduling scheme which enables the calculation pressure of all Slave nodes in the cluster to be the most average is selected for task resource allocation; if the existing population fails to converge to the optimal solution within the limited time, intercepting the optimal solution of the population in the current population as the optimal task scheduling scheme searched by the population and transmitting the optimal task scheduling scheme to the Master node. By comparing objective function values of the population optimal solutions of a plurality of populations, the task scheduling method with the optimal objective function values can be found as far as possible for task scheduling in a specified time, so that extra burden is not caused to a Master node, and timeliness and optimization effects of a task scheduling algorithm are guaranteed.

According to the Hadoop load balancing task scheduling method based on the mixed element heuristic algorithm, the problems of unbalanced load and insufficient idle of part of node calculation resources in a Hadoop cluster under the real-time scheduling algorithm are considered, so that overall task scheduling is performed on Map resource slots of all Slave nodes in the cluster. And establishing a resource slot pressure model, wherein the model aims to enable the calculation pressure of all Slave node processing tasks in the cluster to be in the same horizontal line, and solving an optimal task scheduling scheme by adopting a mixed element heuristic algorithm based on simulated annealing and particle swarm optimization to realize load balancing task scheduling in a Hadoop cluster environment. And further, parallel programming of the algorithm is realized through a message passing interface MPICH (MPI over Chuamereon) with high performance and wide portability, the calculation process of a heuristic optimization algorithm is transferred to additional calculation nodes, and the solution is carried out through multiple groups at the same time, so that the calculation pressure of the Master node is reduced, and the solution capability of the optimal task scheduling scheme in unit time is improved. The invention can carry out integral distribution on the computing resources of the Hadoop cluster, so that the node load of the cluster is balanced, the node computing resource waste is avoided, and the profit of equipment investment of a data center is maximized.

It should be understood that the above-mentioned embodiments of the present invention are examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention, and it will be obvious to those skilled in the art that other variations and modifications can be made on the basis of the above description, and all embodiments cannot be exhaustive, and obvious variations and modifications of the present invention are included in the protection scope of the present invention.

Claims

1. A Hadoop load balancing task scheduling method based on a mixed element heuristic algorithm is characterized by comprising the following steps:

s4, transferring a complex calculation process of a heuristic optimization algorithm to an additional calculation node by adopting an MPICH parallel programming method, enabling a plurality of clusters to find more local optimal solutions by simultaneously operating a plurality of particle clusters, and then extracting the solution with the best effect for task scheduling;

the optimization target of the resource tank pressure model is to minimize Variance between calculated pressures of Hadoop cluster Slave nodes, and the resource tank pressure model is as follows:

2. The Hadoop load balancing task scheduling method based on the hybrid heuristic algorithm of claim 1, wherein the task scheduling is encoded into the particle coordinates of the particle swarm optimization algorithm according to the established resource slot pressure model, and the Slave nodes of the cluster calculate the pressure variance under the task scheduling scheme that the current particle coordinates represent can be calculated from the particle coordinates by the designed objective function; the particle swarm optimization algorithm based on the resource slot pressure model has the following specific parameters and formula:

X_i(x₁,x₂,x₃,...,x_m)；

V_i(v₁,v₂,v₃,...,v_m)；

V_i'＝w*V_i+c₁*r₁*(pBest_i-X_i)+c₂*r₂*(gBest-X_i)；

X_i′＝X_i+V_i′；

3. The Hadoop load balancing task scheduling method based on the hybrid heuristic algorithm of claim 2, wherein solving the constraints of the resource slot pressure model by the particle swarm optimization comprises: in order to solve the resource slot pressure model by adopting a particle swarm optimization algorithm, the calculation resources of all Slave nodes in a cluster need to be counted in advance, meanwhile, a task queue to be processed needs to be generated and maintained for each Slave node in the task scheduling process, and before the Map subtask allocation scheme is solved, the number of resource slots needed by each Map subtask needs to be obtained from the JobTracker, so that the solution is carried out through a correlation formula of the resource slot pressure model.

4. The Hadoop load balancing task scheduling method based on the hybrid element heuristic algorithm as claimed in claim 2, wherein solving the resource slot pressure model using the hybrid element heuristic algorithm based on the simulated annealing and the particle swarm optimization algorithm specifically comprises: after particle coordinates and initial speed in the particle swarm are initialized, the optimal particle position f is obtained according to the target function result corresponding to the current particle coordinates_minAnd worst particle position f_maxThe difference of the target function results and the preset initial acceptance probability p_rCalculating the initial annealing temperature T₀(ii) a In each iteration, if the objective function value corresponding to the particle coordinate calculated according to the particle coordinate updating rule is better than the objective function value corresponding to the current particle coordinate, the current particle coordinate is directly updated by using the new particle coordinate, if the objective function value corresponding to the new particle coordinate is different from the objective function value corresponding to the current particle coordinate, whether the new coordinate is accepted or not is judged according to an acceptance rule in the simulated annealing algorithm, otherwise, the original particle coordinate is kept unchanged; the simulated annealing algorithm receives the characteristic of coordinates of the suboptimal point, and gives a hybrid algorithm that after the particle swarm falls into the local optimal point, the local optimal point is skipped by receiving the suboptimal point, so that the local optimal point is converged to the global optimal point finally, wherein the specific formula is as follows:

Δf＝f(X_i+V_i')-f(X_i)；

T_i+1＝ξT_i；

wherein, the formula and the parameter related to the particle swarm only change the particle coordinate updating formula, T₀Is the initial annealing temperature; p is a radical of_rIs the initial acceptance probability; f. of_minAnd f_maxMinimum and maximum target function adaptive values after particle swarm initialization are obtained; t is_iThe annealing temperature in the ith iteration is the annealing temperature in the ith iteration; xi is the temperature decay coefficient; x'_iRepresenting the coordinates of the ith particle after one iteration.

5. The Hadoop load balancing task scheduling method based on the hybrid heuristic algorithm of claim 1, wherein within a limited time, if any population reaches a constraint condition specified by an objective function, the current population transmits a task scheduling scheme obtained by calculation to a Master node, the Master node performs objective function value comparison on the obtained task scheduling scheme after a certain time slice, and the task scheduling scheme in which the calculation pressure of all Slave nodes in a cluster is the most average is selected for task resource allocation; if the existing population fails to reach the convergence condition within the limited time, intercepting the optimal solution of the population in the current population as the optimal task scheduling scheme searched by the population and transmitting the optimal task scheduling scheme to the Master node.