CN108170530B - Hadoop load balancing task scheduling method based on mixed element heuristic algorithm - Google Patents

Hadoop load balancing task scheduling method based on mixed element heuristic algorithm Download PDF

Info

Publication number
CN108170530B
CN108170530B CN201711433347.5A CN201711433347A CN108170530B CN 108170530 B CN108170530 B CN 108170530B CN 201711433347 A CN201711433347 A CN 201711433347A CN 108170530 B CN108170530 B CN 108170530B
Authority
CN
China
Prior art keywords
particle
task scheduling
algorithm
cluster
resource
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711433347.5A
Other languages
Chinese (zh)
Other versions
CN108170530A (en
Inventor
毕敬
程煜东
乔俊飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN201711433347.5A priority Critical patent/CN108170530B/en
Publication of CN108170530A publication Critical patent/CN108170530A/en
Application granted granted Critical
Publication of CN108170530B publication Critical patent/CN108170530B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system
    • G06F9/5088Techniques for rebalancing the load in a distributed system involving task migration

Abstract

The invention relates to a Hadoop load balancing task scheduling method based on a mixed element heuristic algorithm, which is characterized in that a resource slot pressure model is established, the model aims to enable the calculation pressure of all Slave node processing tasks in a cluster to be in the same horizontal line, the optimal task scheduling scheme is solved by adopting the mixed element heuristic algorithm based on simulated annealing and particle swarm optimization, and the load balancing task scheduling in the Hadoop cluster environment is realized. And further, parallel programming of the algorithm is realized through a message passing interface MPICH (MPI over Chuamereon) with high performance and wide portability, the calculation process of a heuristic optimization algorithm is transferred to additional calculation nodes, and the solution is carried out through multiple groups at the same time, so that the calculation pressure of the Master node is reduced, and the solution capability of the optimal task scheduling scheme in unit time is improved. The invention can carry out integral distribution on the computing resources of the Hadoop cluster, so that the node load of the cluster is balanced, the node computing resource waste is avoided, and the profit of equipment investment of a data center is maximized.

Description

Hadoop load balancing task scheduling method based on mixed element heuristic algorithm
Technical Field
The invention relates to the field of task scheduling under a Hadoop MapReduce structure. More specifically, a Hadoop task scheduling algorithm which aims at balancing cluster loads is achieved by utilizing a particle swarm algorithm, a mixed meta-heuristic algorithm based on simulated annealing and a particle swarm optimization algorithm and an MPICH parallel programming method.
Background
With the rapid development of mobile intelligent equipment, the development of the information-based era becomes more and more rapid, and meanwhile, mass data generated actively or passively along with the use of a network by a user are generated, the data usually cannot dig out the value of the data through a traditional statistical or calculation method, however, once the potential value behind the data can be dug out, huge benefits can be brought to enterprises and governments, for example, a treasure-washing network can judge the commodity preference and demand of the user through the analysis of the commodity browsing record of the user, and meanwhile, the first-page commodity pushing is directionally carried out, so that the purpose of commodity shopping guide is achieved; the video music resource service provider can summarize user preferences from the historical use data of the user, and improve the service capability of the user through directional recommendation, so that the user can obtain better user experience; public sentiment analysis can be carried out on the user hotspot attention information of the social platform by government agencies, so that social stability is better maintained. However, in order to process the big data, the traditional computing mode cannot meet the requirement, and enterprises and government organizations need computer clusters with strong computing power to achieve the purposes of the enterprises and government organizations.
However, the cost for building and maintaining a data center is extremely expensive, and most of small and medium-sized enterprises have no ability to build a large-scale data center to meet business requirements of the enterprises, and the cloud service mode of pricing as required provides great help for the enterprises. By purchasing the services of the cloud data center, enterprises can deploy hundreds of server clusters in a short time of several hours, the cost consumed by using the cloud computing resources is very cheap and convenient compared with the traditional data center construction, and meanwhile, along with the later business change, users can actively and conveniently realize the change and adjustment of the cloud computing resources and meet the business requirements of the users in real time. Meanwhile, the development of the cloud data center provides convenience for large-scale transnational companies, and the companies can purchase services through the cloud data centers in different regions of the world, so that not only can the companies save a large amount of labor and material cost brought by the construction of the traditional data center, but also better response speed can be brought to users distributed in various regions of the world by transnational enterprises, and user experience is improved. However, for the data center, the computing capability of the data center can be utilized more fully as much as possible, so that more efficient computing service is provided, and further more user tasks can be processed more quickly and better in the same time, which is an important way for the data center to obtain more benefits.
Hadoop is a distributed system infrastructure capable of performing distributed processing on a large amount of data, and the MapReduce distributed processing framework is simple in principle and excellent in performance, and is used by a plurality of data centers for performing calculation processing on large data at present. An important link for how to make the Hadoop cluster really work is task scheduling. Task scheduling is to allocate cluster computing resources through a certain task scheduling algorithm, so that tasks to be processed can be processed by sufficient resources, a good task scheduling algorithm not only needs to enable tasks to be processed faster (i.e. enables users to obtain faster effect speed), but also needs to enable each machine in a cluster to exert its own computing power, because machines running in a data center generally need to be replaced after five years of work, and if the machines are in an idle state for a long time or the computing power of the machines is not sufficiently utilized, the data center is greatly lost. The traditional task scheduling algorithm is divided into two types, one is a real-time task scheduling algorithm, and the other is a heuristic task scheduling algorithm. The core idea of the real-time scheduling algorithm is to schedule jobs in real time and allocate required computing resources to the jobs, and the real-time scheduling algorithm has the advantages of short response time and low resource overhead caused by task scheduling. The heuristic task scheduling algorithm can comprehensively consider resources of all machines in the cluster, and optimally solve the resources with load balance or the fastest processing speed as a target in a given solution space (limited by specific resources of the cluster), so that global resource allocation can be realized, and the computing resources of the cluster can be better utilized. However, the number of nodes in the cluster is often very large, when a heuristic algorithm is used for solving, the solving process is very complicated, and a small extra overhead is caused, while in a traditional Hadoop task scheduling mode, heuristic task scheduling causes a heavy burden to a Master node (Master node, scheduling and distributing work in the Hadoop responsible for operation), so that the working stability of the cluster is affected, meanwhile, the heuristic algorithm such as particle swarm, ant colony, simulated annealing and the like is easy to fall into the problem of local optimization, and the performance exerted in the actual scheduling process is not stable.
Based on the shortcomings that Hadoop adopts a heuristic task scheduling algorithm, the invention provides a task scheduling method aiming at improving the utilization rate of cluster resources based on a MapReduce distributed processing framework, an MPICH parallel processing algorithm, a particle swarm optimization algorithm, a mixed heuristic algorithm combining the particle swarm optimization algorithm and a simulated annealing optimization algorithm and the like.
Disclosure of Invention
The invention aims to provide a task scheduling method aiming at improving the utilization rate of distributed cluster computing resources based on a Hadoop architecture, which comprehensively considers the defects that a heuristic algorithm is easy to fall into local optimum and the inadaptation of the heuristic algorithm for Hadoop task scheduling, can respectively distribute an appropriate number of tasks to be processed according to the computing power of each machine in a cluster, ensures that the computing pressure of the tasks to be processed of each machine in the cluster is in the same horizontal line, thereby balancing the load of task processing nodes (Slave nodes) in the cluster, improving the utilization rate of the cluster computing resources, realizing cost saving and improving profits, simultaneously improving the capacity of the task scheduling algorithm for jumping out of the local optimum point by simultaneously optimizing a plurality of particle swarms and simultaneously using heuristic optimization algorithms such as a particle swarm optimization algorithm based on simulated annealing, and the MPICH parallel programming method is used, the solving process of the heuristic algorithm is shared among a plurality of machines to be executed, the calculation pressure of a Master node in a Hadoop cluster is avoided being added, and the stability of the cluster is improved.
In order to achieve the purpose, the invention adopts the following technical scheme:
in order to realize that the tasks to be processed with corresponding quantity are respectively distributed according to the computing resource difference between the task processing nodes and the computing pressure between the task processing nodes is in the same horizontal line, the concept of the resource slot under the MapReduce distributed framework is based on: the method comprises the steps that a resource Slot is used as a partition unit of a node resource of a traditional MapReduce distributed processing framework, the node can determine the total amount of Slots according to the computing capacity and the total memory amount of the node, the Slots are different from Map Slots and Reduce Slots due to the characteristics of the MapReduce framework, the Map Slots are specially used for processing Map subtasks, and the Reduce Slots are specially used for processing Reduce subtasks. When Hadoop processes operation, the operation applies Slots resources to JobTracker in advance, and according to the characteristic, a resource slot pressure model is established.
In order to realize the application of the heuristic algorithm to the task scheduling of Hadoop, simultaneously not add extra calculation burden to the Master node and improve the capability of the algorithm to jump out local optimum, the invention provides a MPICH parallel programming method, wherein the calculation process of the task scheduling algorithm is arranged in an extra calculation node for execution, simultaneously, a plurality of particle swarms, a particle swarms based on a simulated annealing algorithm and the like are adopted for simultaneous optimization, and the calculation result of the particle swarms with the optimal optimization result at a set time is adopted for actual task distribution, so that the capability of the algorithm to jump out local optimum solution is improved, and the burden on the Master node is also avoided.
In summary, a method for scheduling a Hadoop load balancing task based on a mixed element heuristic algorithm includes the following steps:
s1, aiming at the calculated pressure of the processing task of the balance task processing node, establishing a resource slot pressure model according to the resource slot principle;
s2, solving a resource tank pressure model by adopting a particle swarm optimization algorithm;
s3, solving a resource tank pressure model by adopting a mixed element heuristic optimization algorithm based on simulated annealing and a particle swarm optimization algorithm;
s4, transferring a complex calculation process of a heuristic optimization algorithm to an additional calculation node by adopting an MPICH parallel programming method, enabling a plurality of clusters to find more local optimal solutions by simultaneously operating a plurality of particle clusters, and then extracting the solution with the best effect for task scheduling.
Preferably, the optimization goal of the resource tank pressure model is to minimize Variance between calculated pressures of Hadoop cluster Slave nodes, and the resource tank pressure model is as follows:
Figure GDA0003122581180000041
Figure GDA0003122581180000042
Figure GDA0003122581180000043
the Pressure represents the calculation Pressure of the tasks to be processed caused by the sum of Map resource slots required by M Map subtasks to be processed between the Slave nodes being compared with the total number M of the Map resource slots of the node; t is tiRepresenting the number of Map resource slots required by the ith task to be processed; average represents the Average value of the calculated pressure of the tasks to be processed of all Slave nodes in the cluster; miRepresenting the number of Map resource slots of the ith Slave node in the cluster; s represents the number of Slave nodes in the cluster; and Variance represents the Variance among the calculated pressures of the tasks to be processed of each Slave node in the cluster.
Preferably, according to the established resource slot pressure model, task scheduling is coded into particle coordinates of the particle swarm optimization algorithm, and meanwhile, a designed objective function can calculate the pressure variance of the Slave node of the cluster under the task scheduling scheme represented by the current particle coordinates from the particle coordinates. The particle swarm optimization algorithm based on the resource slot pressure model has the following specific parameters and formula:
Xi(x1,x2,x3,...,xm);
Vi(v1,v2,v3,...,vm);
Vi'=w*Vi+c1*r1*(pBesti-Xi)+c2*r2*(gBest-Xi);
Xi'=Xi+Vi';
Figure GDA0003122581180000051
Figure GDA0003122581180000052
wherein, Xi(x1,x2,x3,...,xm) Representing coordinates of the ith particle in solution space; x is the number ofmIndicating that the mth task to be processed is assigned to the xth taskmRunning on a Slave node; vi(v1,v2,v3,...,vm) Represents the velocity of the ith particle; vi' represents the updated speed of the ith particle according to the learning experience of the previous iteration, and w is the inertia weight; c. C1And c2Is a learning factor; r is1And r2Is [0,1 ]]A random number of (c); pBestiThe individual optimum point of the ith particle; gBest is the optimal point of the population; xi' represents the coordinates of the ith particle after one iteration; pBesti' updating the individual optimal points of the ith particle after one iteration; gBest' is the optimal point of the population after one iteration update; f (X)i) Optimizing an objective function of an algorithm for particle swarm, the function functioning according to an example coordinate Xi(x1,x2,x3,...,xm) And calculating the Variance of the calculated pressure of the tasks to be processed of the Slave nodes in the cluster according to the corresponding relation between the tasks to be processed and the Slave nodes.
Preferably, the solving of the constraint of the resource tank pressure model by using the particle swarm optimization algorithm includes: in order to solve the resource slot pressure model by adopting a particle swarm optimization algorithm, the calculation resources of all Slave nodes in a cluster need to be counted in advance, meanwhile, a task queue to be processed needs to be generated and maintained for each Slave node in the task scheduling process, and before the Map subtask allocation scheme is solved, the number of resource slots needed by each Map subtask needs to be obtained from the JobTracker, so that the solution is carried out through a correlation formula of the resource slot pressure model.
Preferably, the method for solving the resource tank pressure model by adopting the hybrid heuristic algorithm based on simulated annealing and particle swarm optimization specifically comprises the following steps: after particle coordinates and initial speed in the particle swarm are initialized, the optimal particle position f is obtained according to the target function result corresponding to the current particle coordinatesminAnd worst particle position fmaxThe difference of the target function results and the preset initial acceptance probability prCalculating the initial annealing temperature T0(ii) a In each iteration, if the objective function value corresponding to the particle coordinate calculated according to the particle coordinate updating rule is better than the objective function value corresponding to the current particle coordinate, the current particle coordinate is directly updated by using the new particle coordinate, if the objective function value corresponding to the new particle coordinate is different from the objective function value corresponding to the current particle coordinate, whether the new coordinate is accepted or not is judged according to an acceptance rule in the simulated annealing algorithm, otherwise, the original particle coordinate is kept unchanged; the simulated annealing algorithm receives the characteristic of coordinates of the suboptimal point, and gives a hybrid algorithm that after the particle swarm falls into the local optimal point, the local optimal point is skipped by receiving the suboptimal point, so that the local optimal point is converged to the global optimal point finally, wherein the specific formula is as follows:
Figure GDA0003122581180000061
Δf=f(Xi+Vi')-f(Xi);
Figure GDA0003122581180000062
Ti+1=ξTi
wherein, the formula and the parameter related to the particle swarm only change the particle coordinate updating formula, T0Is the initial annealing temperatureDegree; p is a radical ofrIs the initial acceptance probability; f. ofminAnd fmaxMinimum and maximum target function adaptive values after particle swarm initialization are obtained; t isiThe annealing temperature in the ith iteration is the annealing temperature in the ith iteration; xi is the temperature decay coefficient; x'iRepresenting the coordinates of the ith particle after one iteration.
Preferably, within a limited time, if any one of the clusters reaches a constraint condition specified by an objective function, the current cluster transmits a task scheduling scheme obtained by calculation to a Master node, the Master node compares objective function values of the obtained task scheduling schemes after a certain time slice, and the task scheduling scheme which enables the calculation pressure of all Slave nodes in the cluster to be the most average is selected for task resource allocation; if the existing population fails to reach the convergence condition within the limited time, intercepting the optimal solution of the population in the current population as the optimal task scheduling scheme searched by the population and transmitting the optimal task scheduling scheme to the Master node.
The invention has the following beneficial effects:
the technical scheme of the invention can solve the problems that the heuristic scheduling algorithm used by a Hadoop cluster causes extra calculation burden to a Master node and influences the working stability of the cluster, thereby overcoming the defects that the cluster load is unbalanced and part of nodes have resource waste when the real-time scheduling algorithm is used; through the combination of simulated annealing and particle swarm optimization, the capability of the hybrid heuristic algorithm for jumping out of the local optimal point is improved, and the capability of the heuristic scheduling algorithm for searching the global optimal point is optimized. The technical scheme provided by the invention starts from load balancing of each node in the Hadoop cluster, fully considers different calculation processing capacities of Slave nodes in the cluster on the premise of not adding extra calculation load to a Master node, realizes an allocation-by-energy task allocation mode, and can give play to the calculation potential of idle calculation resources possibly generated in the cluster under a real-time scheduling algorithm on the premise of ensuring the task processing efficiency and stability of the Hadoop cluster, so that the cluster construction investment of a data center is more rewarded, and the income of the data center is further increased.
Drawings
The following detailed description of embodiments of the invention is provided in conjunction with the appended drawings:
FIG. 1 shows a flow chart of a Hadoop load balancing task scheduling method based on a hybrid heuristic algorithm;
FIG. 2 shows a Hadoop cluster architecture diagram with the addition of additional compute nodes for algorithmic solution.
Detailed Description
In order to more clearly illustrate the invention, the invention is further described below with reference to preferred embodiments and the accompanying drawings. Similar parts in the figures are denoted by the same reference numerals. It is to be understood by persons skilled in the art that the following detailed description is illustrative and not restrictive, and is not to be taken as limiting the scope of the invention.
As shown in fig. 1 and fig. 2, the technical field of the Hadoop load balancing task scheduling method based on the hybrid heuristic algorithm disclosed by the invention comprises the following steps:
and S1, establishing a resource slot pressure model according to the resource slot principle aiming at the calculated pressure of the processing tasks of the balance task processing nodes.
The main objective of the above resource slot pressure model is to make the calculated pressures of all task running sub-nodes (Slave nodes) in the cluster to the respective nodes to be executed task be in the same horizontal line. In the actual resource allocation process of the Hadoop cluster under the MapReduce distributed processing framework, the resource allocation is performed on the tasks to be executed by taking the resource Slot as a unit. Specifically, how many computing resources each Slot contains will be preset in Hadoop configuration, and for all the Slot nodes, the computing resources contained in a single Slot are equal, so for the Slot nodes with different computing capabilities, the difference in the node computing capability is actually reflected in the difference in the number of slots. In the resource slot pressure model, the node calculated pressure is quantitatively analyzed by comparing the total number of resource slots required by the tasks to be processed of the calculated node with the number of resource slots owned by the node. The practical meaning of the resource slot pressure model is that the calculated pressures of the tasks to be processed of all the Slave nodes are in the same horizontal line (i.e. the smaller the variance is, the better) by a reasonable task scheduling method. According to the task scheduling method disclosed by the invention, all the Slave nodes can execute tasks under similar computing pressure, so that the load of the cluster is relatively balanced.
In the resource slot Pressure model, the node calculates the Pressure, and a parameter for measuring whether the calculated pressures of all Slave in the cluster are on the same horizontal line is Variance, wherein the calculation model is as follows:
Figure GDA0003122581180000081
Figure GDA0003122581180000082
Figure GDA0003122581180000083
the Pressure represents the calculation Pressure of the tasks to be processed caused by the sum of Map resource slots required by M Map subtasks to be processed between the Slave nodes being compared with the total number M of the Map resource slots of the node; t is tiRepresenting the number of Map resource slots required by the ith task to be processed; average represents the Average value of the calculated pressure of the tasks to be processed of all Slave nodes in the cluster; miRepresenting the number of Map resource slots of the ith Slave node in the cluster; s represents the number of Slave nodes in the cluster; and Variance represents the Variance among the calculated pressures of the tasks to be processed of each Slave node in the cluster.
And S2, solving the resource tank pressure model by adopting a particle swarm optimization algorithm.
According to the method, task scheduling codes are coded into particle coordinates of a particle swarm optimization algorithm according to an established resource slot pressure model, and meanwhile, a designed objective function can calculate the pressure variance of the Slave node of the cluster under a task scheduling scheme represented by the current particle coordinates from the particle coordinates. The particle swarm optimization algorithm based on the resource slot pressure model has the following specific parameters and formula:
Xi(x1,x2,x3,...,xm);
Vi(v1,v2,v3,...,vm);
Vi'=w*Vi+c1*r1*(pBesti-Xi)+c2*r2*(gBest-Xi);
Xi'=Xi+Vi';
Figure GDA0003122581180000084
Figure GDA0003122581180000091
wherein, Xi(x1,x2,x3,...,xm) Representing coordinates of the ith particle in solution space; x is the number ofmIndicating that the mth task to be processed is assigned to the xth taskmRunning on a Slave node; vi(v1,v2,v3,...,vm) Represents the velocity of the ith particle; vi' represents that the velocity w of the ith particle after being updated according to the learning experience of the previous iteration is the inertia weight; c. C1And c2Is a learning factor; r is1And r2Is [0,1 ]]A random number of (c); pBestiThe individual optimum point of the ith particle; gBest is the optimal point of the population; xi' represents the coordinates of the ith particle after one iteration; pBesti' updating the individual optimal points of the ith particle after one iteration; gBest' is the optimal point of the population after one iteration update; f (X)i) Optimizing an objective function of an algorithm for particle swarm, the function functioning according to an example coordinate Xi(x1,x2,x3,...,xm) And calculating the Variance of the calculated pressure of the tasks to be processed of the Slave nodes in the cluster according to the corresponding relation between the tasks to be processed and the Slave nodes.
In order to solve the resource slot pressure model by adopting a particle swarm optimization algorithm, the calculation resources of all Slave nodes in a cluster need to be counted in advance, meanwhile, a task queue to be processed needs to be generated and maintained for each Slave node in the task scheduling process, and before the Map subtask allocation scheme is solved, the number of resource slots needed by each Map subtask needs to be obtained from the JobTracker, so that the solution is carried out through a correlation formula of the resource slot pressure model.
And S3, solving the resource tank pressure model by adopting a hybrid heuristic optimization algorithm based on the combination of simulated annealing and particle swarm optimization algorithm.
The mixed element heuristic optimization algorithm mainly aims to provide the particle swarm optimization with the capability of jumping out a local optimal point through the possibility of receiving the sub-dominant coordinates through the simulated annealing algorithm, so that the global optimal solution searching capability of the algorithm is improved, and the effect of the task scheduling algorithm is improved. The concrete formula of the mixed element heuristic algorithm is as follows:
Figure GDA0003122581180000092
Δf=f(Xi+Vi')-f(Xi);
Figure GDA0003122581180000093
Ti+1=ξTi
wherein, the formula and the parameters related to the particle swarm only change the particle coordinate updating formula. T is0Is the initial annealing temperature; p is a radical ofrIs the initial acceptance probability; f. ofminAnd fmaxMinimum and maximum target function adaptive values after particle swarm initialization are obtained; t isiThe annealing temperature in the ith iteration is the annealing temperature in the ith iteration; xi is temperatureAnd (4) degree decay coefficient.
The basic principle of the hybrid heuristic algorithm based on simulated annealing and particle swarm optimization algorithm in the invention is as follows: after particle coordinates and initial speed in the particle swarm are initialized, the optimal particle position f is obtained according to the target function result corresponding to the current particle coordinatesminAnd worst particle position fmaxThe difference of the target function results and the preset initial acceptance probability prCalculating the initial annealing temperature T0. In each iteration, if the objective function value corresponding to the particle coordinate calculated according to the particle coordinate updating rule is better than the objective function value corresponding to the current particle coordinate, the current particle coordinate is directly updated by using the new particle coordinate, if the objective function value corresponding to the new particle coordinate is worse than the objective function value corresponding to the current particle coordinate, whether the new coordinate is accepted or not is judged according to an acceptance rule in the simulated annealing algorithm, and otherwise, the original particle coordinate is kept unchanged. The simulated annealing algorithm is likely to receive the characteristic of the coordinates of the suboptimal point, and the possibility that the local optimal point is jumped out by receiving the suboptimal point after the particle swarm falls into the local optimal point is given to the hybrid algorithm, so that the global optimal point can be converged finally.
S4, transferring a complex calculation process of a heuristic optimization algorithm to an additional calculation node by adopting an MPICH parallel programming method, and improving the capability of a task scheduling method for solving an optimal task scheduling scheme within a specified time.
The MPICH parallel programming method is mainly used for transferring the actual calculation process of the task scheduling method to the special calculation node through the parallel programming method, so that the calculation burden of the Master node is reduced. Meanwhile, in order to meet the timeliness requirement of job processing, the calculation time of the task scheduling scheme needs to be limited to a certain extent, and since the calculation time of the hybrid algorithm may be too long under certain conditions and task allocation cannot be started until the hybrid algorithm is solved, a plurality of standard particle swarm optimization algorithms and the optimization process of the hybrid heuristic optimization algorithm based on the combination of simulated annealing and the particle swarm optimization algorithm can be simultaneously performed on a plurality of calculation nodes through the MPICH parallel programming algorithm, the standard particle swarm optimization algorithm is relatively simpler in calculation process and faster in solution process compared with the hybrid heuristic algorithm based on the simulated annealing and the particle swarm optimization algorithm. Although the standard particle swarm algorithm has the defect of easy falling into the local optimal solution, the particle swarm algorithm has randomness, and a method for improving the algorithm solving effect is to operate a plurality of particle swarms simultaneously to enable the plurality of particle swarms to find more local optimal solutions as far as possible, and then extract the solution with the best effect for task scheduling, so that the better task scheduling effect can be realized in a shorter time. Within a limited time, if any one of the clusters reaches a constraint condition specified by an objective function, the current cluster transmits a task scheduling scheme obtained by calculation to a Master node, the Master node compares objective function values of the obtained task scheduling schemes after a certain time slice, and the task scheduling scheme which enables the calculation pressure of all Slave nodes in the cluster to be the most average is selected for task resource allocation; if the existing population fails to converge to the optimal solution within the limited time, intercepting the optimal solution of the population in the current population as the optimal task scheduling scheme searched by the population and transmitting the optimal task scheduling scheme to the Master node. By comparing objective function values of the population optimal solutions of a plurality of populations, the task scheduling method with the optimal objective function values can be found as far as possible for task scheduling in a specified time, so that extra burden is not caused to a Master node, and timeliness and optimization effects of a task scheduling algorithm are guaranteed.
According to the Hadoop load balancing task scheduling method based on the mixed element heuristic algorithm, the problems of unbalanced load and insufficient idle of part of node calculation resources in a Hadoop cluster under the real-time scheduling algorithm are considered, so that overall task scheduling is performed on Map resource slots of all Slave nodes in the cluster. And establishing a resource slot pressure model, wherein the model aims to enable the calculation pressure of all Slave node processing tasks in the cluster to be in the same horizontal line, and solving an optimal task scheduling scheme by adopting a mixed element heuristic algorithm based on simulated annealing and particle swarm optimization to realize load balancing task scheduling in a Hadoop cluster environment. And further, parallel programming of the algorithm is realized through a message passing interface MPICH (MPI over Chuamereon) with high performance and wide portability, the calculation process of a heuristic optimization algorithm is transferred to additional calculation nodes, and the solution is carried out through multiple groups at the same time, so that the calculation pressure of the Master node is reduced, and the solution capability of the optimal task scheduling scheme in unit time is improved. The invention can carry out integral distribution on the computing resources of the Hadoop cluster, so that the node load of the cluster is balanced, the node computing resource waste is avoided, and the profit of equipment investment of a data center is maximized.
It should be understood that the above-mentioned embodiments of the present invention are examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention, and it will be obvious to those skilled in the art that other variations and modifications can be made on the basis of the above description, and all embodiments cannot be exhaustive, and obvious variations and modifications of the present invention are included in the protection scope of the present invention.

Claims (5)

1. A Hadoop load balancing task scheduling method based on a mixed element heuristic algorithm is characterized by comprising the following steps:
s1, aiming at the calculated pressure of the processing task of the balance task processing node, establishing a resource slot pressure model according to the resource slot principle;
s2, solving a resource tank pressure model by adopting a particle swarm optimization algorithm;
s3, solving a resource tank pressure model by adopting a mixed element heuristic optimization algorithm based on simulated annealing and a particle swarm optimization algorithm;
s4, transferring a complex calculation process of a heuristic optimization algorithm to an additional calculation node by adopting an MPICH parallel programming method, enabling a plurality of clusters to find more local optimal solutions by simultaneously operating a plurality of particle clusters, and then extracting the solution with the best effect for task scheduling;
the optimization target of the resource tank pressure model is to minimize Variance between calculated pressures of Hadoop cluster Slave nodes, and the resource tank pressure model is as follows:
Figure FDA0003122581170000011
Figure FDA0003122581170000012
Figure FDA0003122581170000013
the Pressure represents the calculation Pressure of the tasks to be processed caused by the sum of Map resource slots required by M Map subtasks to be processed between the Slave nodes being compared with the total number M of the Map resource slots of the node; t is tiRepresenting the number of Map resource slots required by the ith task to be processed; average represents the Average value of the calculated pressure of the tasks to be processed of all Slave nodes in the cluster; miRepresenting the number of Map resource slots of the ith Slave node in the cluster; s represents the number of Slave nodes in the cluster; and Variance represents the Variance among the calculated pressures of the tasks to be processed of each Slave node in the cluster.
2. The Hadoop load balancing task scheduling method based on the hybrid heuristic algorithm of claim 1, wherein the task scheduling is encoded into the particle coordinates of the particle swarm optimization algorithm according to the established resource slot pressure model, and the Slave nodes of the cluster calculate the pressure variance under the task scheduling scheme that the current particle coordinates represent can be calculated from the particle coordinates by the designed objective function; the particle swarm optimization algorithm based on the resource slot pressure model has the following specific parameters and formula:
Xi(x1,x2,x3,...,xm);
Vi(v1,v2,v3,...,vm);
Vi'=w*Vi+c1*r1*(pBesti-Xi)+c2*r2*(gBest-Xi);
Xi′=Xi+Vi′;
Figure FDA0003122581170000021
Figure FDA0003122581170000022
wherein, Xi(x1,x2,x3,...,xm) Representing coordinates of the ith particle in solution space; x is the number ofmIndicating that the mth task to be processed is assigned to the xth taskmRunning on a Slave node; vi(v1,v2,v3,...,vm) Represents the velocity of the ith particle; vi' represents the updated speed of the ith particle according to the learning experience of the previous iteration, and w is the inertia weight; c. C1And c2Is a learning factor; r is1And r2Is [0,1 ]]A random number of (c); pBestiThe individual optimum point of the ith particle; gBest is the optimal point of the population; xi' represents the coordinates of the ith particle after one iteration; pBesti' updating the individual optimal points of the ith particle after one iteration; gBest' is the optimal point of the population after one iteration update; f (X)i) Optimizing an objective function of an algorithm for particle swarm, the function functioning according to an example coordinate Xi(x1,x2,x3,...,xm) And calculating the Variance of the calculated pressure of the tasks to be processed of the Slave nodes in the cluster according to the corresponding relation between the tasks to be processed and the Slave nodes.
3. The Hadoop load balancing task scheduling method based on the hybrid heuristic algorithm of claim 2, wherein solving the constraints of the resource slot pressure model by the particle swarm optimization comprises: in order to solve the resource slot pressure model by adopting a particle swarm optimization algorithm, the calculation resources of all Slave nodes in a cluster need to be counted in advance, meanwhile, a task queue to be processed needs to be generated and maintained for each Slave node in the task scheduling process, and before the Map subtask allocation scheme is solved, the number of resource slots needed by each Map subtask needs to be obtained from the JobTracker, so that the solution is carried out through a correlation formula of the resource slot pressure model.
4. The Hadoop load balancing task scheduling method based on the hybrid element heuristic algorithm as claimed in claim 2, wherein solving the resource slot pressure model using the hybrid element heuristic algorithm based on the simulated annealing and the particle swarm optimization algorithm specifically comprises: after particle coordinates and initial speed in the particle swarm are initialized, the optimal particle position f is obtained according to the target function result corresponding to the current particle coordinatesminAnd worst particle position fmaxThe difference of the target function results and the preset initial acceptance probability prCalculating the initial annealing temperature T0(ii) a In each iteration, if the objective function value corresponding to the particle coordinate calculated according to the particle coordinate updating rule is better than the objective function value corresponding to the current particle coordinate, the current particle coordinate is directly updated by using the new particle coordinate, if the objective function value corresponding to the new particle coordinate is different from the objective function value corresponding to the current particle coordinate, whether the new coordinate is accepted or not is judged according to an acceptance rule in the simulated annealing algorithm, otherwise, the original particle coordinate is kept unchanged; the simulated annealing algorithm receives the characteristic of coordinates of the suboptimal point, and gives a hybrid algorithm that after the particle swarm falls into the local optimal point, the local optimal point is skipped by receiving the suboptimal point, so that the local optimal point is converged to the global optimal point finally, wherein the specific formula is as follows:
Figure FDA0003122581170000031
Δf=f(Xi+Vi')-f(Xi);
Figure FDA0003122581170000032
Ti+1=ξTi
wherein, the formula and the parameter related to the particle swarm only change the particle coordinate updating formula, T0Is the initial annealing temperature; p is a radical ofrIs the initial acceptance probability; f. ofminAnd fmaxMinimum and maximum target function adaptive values after particle swarm initialization are obtained; t isiThe annealing temperature in the ith iteration is the annealing temperature in the ith iteration; xi is the temperature decay coefficient; x'iRepresenting the coordinates of the ith particle after one iteration.
5. The Hadoop load balancing task scheduling method based on the hybrid heuristic algorithm of claim 1, wherein within a limited time, if any population reaches a constraint condition specified by an objective function, the current population transmits a task scheduling scheme obtained by calculation to a Master node, the Master node performs objective function value comparison on the obtained task scheduling scheme after a certain time slice, and the task scheduling scheme in which the calculation pressure of all Slave nodes in a cluster is the most average is selected for task resource allocation; if the existing population fails to reach the convergence condition within the limited time, intercepting the optimal solution of the population in the current population as the optimal task scheduling scheme searched by the population and transmitting the optimal task scheduling scheme to the Master node.
CN201711433347.5A 2017-12-26 2017-12-26 Hadoop load balancing task scheduling method based on mixed element heuristic algorithm Active CN108170530B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711433347.5A CN108170530B (en) 2017-12-26 2017-12-26 Hadoop load balancing task scheduling method based on mixed element heuristic algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711433347.5A CN108170530B (en) 2017-12-26 2017-12-26 Hadoop load balancing task scheduling method based on mixed element heuristic algorithm

Publications (2)

Publication Number Publication Date
CN108170530A CN108170530A (en) 2018-06-15
CN108170530B true CN108170530B (en) 2021-08-17

Family

ID=62521076

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711433347.5A Active CN108170530B (en) 2017-12-26 2017-12-26 Hadoop load balancing task scheduling method based on mixed element heuristic algorithm

Country Status (1)

Country Link
CN (1) CN108170530B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109522120B (en) * 2018-11-14 2022-10-11 重庆邮电大学 Intelligent home management platform based on Hadoop
CN109597682A (en) * 2018-11-26 2019-04-09 华南理工大学 A kind of cloud computing workflow schedule method using heuristic coding strategy
CN109711631B (en) * 2018-12-29 2021-09-07 杭州电子科技大学 Intelligent micro-grid optimized scheduling method for improving particle swarm algorithm
CN110275777B (en) * 2019-06-10 2021-10-29 广州市九重天信息科技有限公司 Resource scheduling system
CN111488209B (en) * 2020-03-22 2023-12-15 深圳市空管实业发展有限公司 Heuristic Storm node task scheduling optimization method
CN112217676B (en) * 2020-10-13 2023-01-31 北京工业大学 Kubernets container cluster node selection method based on mixed element heuristic algorithm
CN113850032B (en) * 2021-12-02 2022-02-08 中国空气动力研究与发展中心计算空气动力研究所 Load balancing method in numerical simulation calculation
CN116737394B (en) * 2023-08-14 2023-10-27 中海智(北京)科技有限公司 Dynamic adjustment security check centralized graph judging task allocation method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101819651A (en) * 2010-04-16 2010-09-01 浙江大学 Method for parallel execution of particle swarm optimization algorithm on multiple computers
CN103761146A (en) * 2014-01-06 2014-04-30 浪潮电子信息产业股份有限公司 Method for dynamically setting quantities of slots for MapReduce
CN104572297A (en) * 2014-12-24 2015-04-29 西安工程大学 Hadoop job scheduling method based on genetic algorithm
CN105183531A (en) * 2014-06-18 2015-12-23 华为技术有限公司 Distributed development platform and calculation method of same
CN107273209A (en) * 2017-06-09 2017-10-20 北京工业大学 The Hadoop method for scheduling task of improved adaptive GA-IAGA is clustered based on minimum spanning tree

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9183016B2 (en) * 2013-02-27 2015-11-10 Vmware, Inc. Adaptive task scheduling of Hadoop in a virtualized environment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101819651A (en) * 2010-04-16 2010-09-01 浙江大学 Method for parallel execution of particle swarm optimization algorithm on multiple computers
CN103761146A (en) * 2014-01-06 2014-04-30 浪潮电子信息产业股份有限公司 Method for dynamically setting quantities of slots for MapReduce
CN105183531A (en) * 2014-06-18 2015-12-23 华为技术有限公司 Distributed development platform and calculation method of same
CN104572297A (en) * 2014-12-24 2015-04-29 西安工程大学 Hadoop job scheduling method based on genetic algorithm
CN107273209A (en) * 2017-06-09 2017-10-20 北京工业大学 The Hadoop method for scheduling task of improved adaptive GA-IAGA is clustered based on minimum spanning tree

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
Efficient Hybrid framework for parallel Resource and Task scheduling in the Map reduce programming;S.Hemalatha等;《2016 International Conference on Computer Communication and Informatics》;20160109;第1-7页 *
Heuristic Virtual Machine Allocation for Multi-Tier Ambient Assisted Living Applications in a Cloud Data Center;Jing Bi等;《China Communications》;20160530;第56-65页 *
基于异构Hadoop集群的负载均衡策略研究;秦军等;《计算机技术与发展》;20170610;第27卷(第6期);第110-113页 *
基于混合优化算法的云计算资源调度;任小金等;《电脑开发与应用》;20141125;第27卷(第11期);第1-6页 *
基于粒子群模拟退火和聚类算法的软硬件划分方法研究;胡大庆;《中国优秀硕士学位论文全文数据库 信息科技辑》;20150115;I135-184 *

Also Published As

Publication number Publication date
CN108170530A (en) 2018-06-15

Similar Documents

Publication Publication Date Title
CN108170530B (en) Hadoop load balancing task scheduling method based on mixed element heuristic algorithm
Iranmanesh et al. DCHG-TS: a deadline-constrained and cost-effective hybrid genetic algorithm for scientific workflow scheduling in cloud computing
Ge et al. GA-based task scheduler for the cloud computing systems
Liu et al. Resource preprocessing and optimal task scheduling in cloud computing environments
Long et al. A game-based approach for cost-aware task assignment with QoS constraint in collaborative edge and cloud environments
Alfarrarjeh et al. Scalable spatial crowdsourcing: A study of distributed algorithms
Kaur et al. Deep‐Q learning‐based heterogeneous earliest finish time scheduling algorithm for scientific workflows in cloud
Liu et al. Task scheduling in fog enabled Internet of Things for smart cities
CN104375897A (en) Cloud computing resource scheduling method based on minimum relative load imbalance degree
Supreeth et al. Hybrid genetic algorithm and modified-particle swarm optimization algorithm (GA-MPSO) for predicting scheduling virtual machines in educational cloud platforms
Tripathi et al. Modified dragonfly algorithm for optimal virtual machine placement in cloud computing
Patni et al. Load balancing strategies for grid computing
Zhang et al. A PSO-based hierarchical resource scheduling strategy on cloud computing
Pattanaik et al. Performance study of some dynamic load balancing algorithms in cloud computing environment
Xiang et al. Computing power allocation and traffic scheduling for edge service provisioning
He et al. Energy-efficient framework for virtual machine consolidation in cloud data centers
Zhou et al. Deep reinforcement learning-based algorithms selectors for the resource scheduling in hierarchical cloud computing
CN113014649B (en) Cloud Internet of things load balancing method, device and equipment based on deep learning
Wang et al. Research on cloud computing task scheduling algorithm based on particle swarm optimization
Kousalya et al. Hybrid algorithm based on genetic algorithm and PSO for task scheduling in cloud computing environment
CN111782627A (en) Task and data cooperative scheduling method for wide-area high-performance computing environment
Li et al. Resource scheduling approach for multimedia cloud content management
Patel et al. An improved approach for load balancing among heterogeneous resources in computational grids
Hung et al. A dynamic scheduling method for collaborated cloud with thick clients.
Pahlevan et al. Exploiting CPU-load and data correlations in multi-objective VM placement for geo-distributed data centers

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant