CN105808334B

CN105808334B - A kind of short optimization of job system and method for MapReduce based on resource reuse

Info

Publication number: CN105808334B
Application number: CN201610124760.2A
Authority: CN
Inventors: 史玉良; 崔立真; 李庆忠; 郑永清; 张开会
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2016-03-04
Filing date: 2016-03-04
Publication date: 2016-12-28
Anticipated expiration: 2036-03-04
Also published as: CN105808334A

Abstract

The invention discloses a kind of short optimization of job system and method for MapReduce based on resource reuse；System includes: host node, one-level are from node and several two grades from node, and wherein host node is connected from node with one-level, and one-level is connected from node with several two grades from node；Explorer and one-level task dispatcher is disposed on described host node；Described one-level disposes application manager, mission performance evaluator and second task scheduler from node, and wherein second task scheduler is connected with mission performance evaluator, and second task scheduler is also connected with host node；Node manager is all disposed for each described two grades from node.It optimizes short job run performance in terms of improving utilization of resources, reduces resource distribution and the frequency reclaimed, and is used for running short operation by resource distribution and the time reclaimed, waits resource time by reducing operation, improve the performance performing short operation.

Description

A kind of short optimization of job system and method for MapReduce based on resource reuse

Technical field

The present invention relates to a kind of short optimization of job system and method for MapReduce based on resource reuse.

Background technology

The industries such as the Internet, finance and media are being faced with the challenge processing large-scale dataset, but the data of routine Handling implement and computation model can not meet its requirement.The MapReduce model that Google proposes is provided for one effectively Solution, Hadoop is the realization of increasing income of MapReduce.Hadoop is by Map that the breakdown of operation of submission is that granularity is less Task and Reduce task, parallel running on these tasks multiple nodes in the cluster, shorten operation the most greatly The operation time.Hadoop conceals the details of parallel computation and distributes data to calculating node, reruns the task of failure Deng, allow user can be absorbed in concrete business logic processing.And, Hadoop provides good linear expansion, data redundancy And the high fault tolerance calculated, these advantages make Hadoop become the intensive main flow with compute-intensive applications of service data Computational frame.Hadoop promotes academia to start to pay close attention to Hadoop in the success of industrial quarters, and to its design with realize Not enough proposition Improving advice.

The initial designs target of Hadoop is to calculate, a large amount of, the operation that in node, parallel processing is larger, but actual In production, Hadoop is but usually utilized to the short operation that treatment scale is less.Short operation refers to that the deadline is less than setting threshold value Operation, threshold value is typically arranged voluntarily by user.The short operation of small scale there are differences in many aspects with sweeping operation, Stock number, the deadline of task and the user that the size of such as input data set, the number of tasks of breakdown of operation, task need Expectation etc. to the operation deadline.Do not consider short operation due to Hadoop, cause short operation to be transported in Hadoop The Performance comparision poor efficiency of row.

Affect job run performance factor and have many aspects, the wherein configuration of clustered node, the dispatching algorithm of operation and collection The load of group is three factors that comparison is crucial.Hadoop supposes that when scheduler task the node constituting cluster is isomorphism, i.e. The configuration of the hardware such as the CPU of each node, internal memory, disk is identical.But, along with corporate business expand cluster scale gradually Expanding, the configuration of newly-increased node is apparently higher than the configuration of early stage node.Therefore, identical task performs on different nodes, appoints The deadline difference of business is bigger.In the case of cluster high capacity, all tasks of breakdown of operation can not apply now fortune The resource of row task, partial task enters the queue waiting resource.Obtain the money of release busy after the tasks carrying of resource completes Source, Hadoop selects suitable task that available resource is distributed to selecting of task according to dispatching algorithm from waiting list. Therefore, if the task of breakdown of operation is the most, the task of same breakdown of operation may perform to run many wheels and just can complete.Such as In the Hadoop cluster of TaoBao, operation more than Map task run two-wheeled accounts for more than 30%.Therefore, the load feelings of cluster Condition has conclusive impact to response time and the deadline of operation.

Recent years, the execution performance optimizing MapReduce operation is increasingly becoming study hotspot, and numerous studies work from many Individual aspect promotes the execution efficiency of operation, the framework of such as Hadoop, the dispatching algorithm of operation, the running of operation and hard Part accelerator etc..

Piranha goes out based on job history Data Summary that the feature number of tasks of little operation is few and mortality is low.Existing Having the feature that task based access control number is few to optimize the execution flow process of little operation in technology, such as Map task and Reduce task open simultaneously Dynamic, the intermediate object program of Map task is written directly to Reduce task end.Owing to the mortality of little operation is less than 1%, have employed simple If fault tolerant mechanism job run failure, Piranha re-executes the task of whole operation rather than failure.Prior art In have batch processing job be decomposed into substantial amounts of small task, each task only processes the quantity of 8M, and task can be within the several seconds Complete to calculate.Owing to the data volume of task input is little, it is short to run the time, there is not data skew and fall behind asking of task in task Topic, solves interactive operation simultaneously and waits as long for the deficiency of resource.Tenzing is the turnover reducing MapReduce operation Time, it is provided that a group job process pool.After new job is submitted to, the scheduler of Tenzing selects a sky from operation process pool Not busy process runs the operation submitted to.Use process pool to reduce the cost starting new job, but Tenzing existence two lacks Point: the process number that is reserved exceedes and is actually needed, and wastes the resource of cluster；Another is the tune limiting Tenzing Degree device can only use the process that some is concrete, compromises the characteristic of task local computing.

With the presence of analyzing Hadoop Computational frame and run the deficiency of short operation in prior art, proposition is from the frame of Hadoop Frame aspect improves the operational efficiency of short operation.Paper is optimized in terms of three: 1) by setup task and cleanup task Change into running at host node, it is to avoid changed the state of operation by heartbeat message, the deadline of operation directly shortens one Heart beat cycle；2) change task distribution into " pushing away " pattern from " drawing " pattern, reduce the delay of task distribution；3) by host node and from Control message between node is independent from current heartbeat message mechanism, uses instant pass through mechanism.

Spark is also the Computational frame processing large-scale dataset, has the engine running DAG figure.Spark and Hadoop Difference to be Spark calculate based on internal memory, spark the intermediate object program of operation is stored in internal memory rather than local disk or In HDFS, the subsequent job in DAG figure directly reads input data from internal memory and calculates.The target of SpongeFiles is The problem alleviating data skew, but effect highly significant in terms of reducing the job run time.Prior art has introducing distribution Formula internal memory, the output result of Map task preferentially writes in distributed memory, reduce time that data write local disk consumes and The shuffle stage reads the time of Map task intermediate object program.The optimization of job effect that the method is bigger to Map task output data Fruit is obvious, but the operation to only Map task and Map task output data quantity are little does not has obvious effect.In prior art all It it is the execution efficiency improving operation by reducing disk I/O.

The design object of Sparrow is to provide the task scheduling of low latency.Sparrow longtime running portion in each node Divide task process, the task process of longtime running perform new task, reduce the cost of task process frequent starting.Longtime running Task process number by static the setting or automatically adjusted according to the load of cluster by explorer of user.Quincy is to appoint The scheduler of business rank, similar with Sparrow.Scheduling problem is mapped as one for the dispatching sequence calculating optimum by Quincy Figure, considers the position of data, fairness and hungry problem during scheduling.Compared with Sparrow, Quincy calculates dispatching sequence Time is longer.

The job scheduling algorithm of Hadoop has important impact to the operation time of operation.FIFO dispatching algorithm is according to operation Submission time sequential scheduling operation.Owing to FIFO does not accounts for the difference between operation, cause performing little operation and interactive mode The Performance comparision poor efficiency of operation.FAIR dispatching algorithm ensures the shared cluster resource that the operation of user's submission is fair, it is ensured that short work Industry completed within the rational time, it is to avoid hungry problem occurs in operation.But FAIR dispatching algorithm does not accounts for the isomerism of cluster With the situation that operation has time-constrain.HFSP assesses the size of operation when job run, little operation is set to high preferential The operation of level, it is ensured that little operation completed within the shortest time.The priority of operation adjusts the most dynamically, prevents operation Hunger phenomenon occurs.

Summary of the invention

The purpose of the present invention is contemplated to solve the problems referred to above, it is provided that a kind of short work of MapReduce based on resource reuse Industry optimizes system and method, and it optimizes short job run performance in terms of improving utilization of resources, reduces resource distribution and returns The frequency received, is used for running short operation by resource distribution and the time reclaimed, waits resource time by reducing operation, and raising is held The performance of the short operation of row.

To achieve these goals, the present invention adopts the following technical scheme that

A kind of short optimization of job system of MapReduce based on resource reuse, including: if host node, one-level from node and Dry two grades from node, wherein host node is connected from node with one-level, and one-level is connected from node with several two grades from node；

Explorer and one-level task dispatcher is disposed on described host node；

Described one-level disposes application manager, mission performance evaluator and second task scheduler from node, and wherein two Level task dispatcher is connected with mission performance evaluator, and second task scheduler is also connected with host node；

Node manager is all disposed for each described two grades from node.

Described explorer be responsible for global resource distribution and monitoring, and the startup of application manager and monitoring.

Described one-level task dispatcher is used for the priority height according to task, required by task resource, task submission time Order etc. carries out the scheduling of job queue.

Described application manager is Map task and Reduce task breakdown of operation, is the most also Map task and Reduce Task application resource, coordinates operation with node manager, is additionally operable to be monitored task.Application manager is job run Control unit, the corresponding application manager of each operation.

Described mission performance evaluator, the task based access control performance model prediction task of being currently running and unscheduled task Deadline.

Whether described second task scheduler, judge the task of being carrying out according to predicting the outcome of mission performance evaluator Belong to short task and from unscheduled task queue, select task.

If being carrying out of task is short task, second task scheduler selects new short from unscheduled task queue Task, new short task re-uses the resource that the short task being currently executing will discharge；If being carrying out of task is not Short task, the tasks carrying being carrying out complete after direct resource shared by release.

Second task scheduler select unscheduled task time, need the locality of consideration task, the operation time of task, The fairness of resource and the isomerism of cluster.

Node manager is responsible for the stock number that monitor task uses, and the stock number used during preventing task run exceedes The stock number of task application.

A kind of short optimization of job method of MapReduce based on resource reuse, comprises the steps:

Step (1): application manager passes through heartbeat message to explorer application resource；

Step (2): the resource that explorer would sit idle for distributes to apply for the application manager of resource, and application manager obtains Take the resource that application is arrived；

Step (3): application manager application to resource distribute to unscheduled task, then application manager notice Corresponding node manager starts task process；

Step (4): node manager start task process operation task, task run during at interval of set Heart beat cycle sends heartbeat message to affiliated application manager, and described heartbeat message includes the progress of task, task statistical number According to task health status；Task statistical data refers to processed data volume, the quantity exported, the time consumed, overflows Write I/O rate；Task health status refers to that the process of execution task is the most abnormal；

Step (5): application manager receives heartbeat message, the task completion time prediction in task based access control Performance Evaluation device The operation time of task predicted by model, if task run the time less than or equal to user set task completion time, then when Front task is short task；Otherwise current task is long task；

If current task is short task, second task scheduler selects new task from unscheduled task queue；If working as Front task is long task, is ignored as heartbeat message；

Step (6): the new task notice task process that application manager will select, task process has performed to be currently running New task is continued to run with after task；

Step (7): if task process does not receives new task after having performed being currently running of task, task process exits also The resource of release busy, the resource notification explorer that node manager will be discharged by heartbeat message.

The task of described step (4) run during at interval of set heart beat cycle to affiliated application manager The step sending heartbeat message is:

Step (401): judge whether Task Progress exceedes setting threshold value, if Task Progress exceedes setting threshold value, calculates and works as The statistical data of front task, sends statistics heartbeat message, goes to step (403)；Otherwise go to step (402)；Described current task Statistical data includes the processed quantity of task, runs time and output data quantity；

Step (402): the progress of task, not less than setting threshold value, sends task health heartbeat message；Enter step (404)；

Step (403): task receives the heartbeat message of application manager；Enter step (404)；

Step (404): judge that task process is made whether to receive new task, if task process receives new task, go to step (405)；Otherwise turn (406)；

Step (405): by the input digital independent of new task to present node；After current task has performed, run Being newly received of task；

Step (406): if task process does not receive new task, after current task has performed, task process release is appointed The resource that business takies.

Task completion time forecast model in the task based access control Performance Evaluation device of described step (5) predicts the fortune of task The row time:

If the running of task can be divided into multiple sub stage, the deadline of task and task are in sub stage consumption Time closely related, therefore set up task completion time forecast model according to task in the time of sub stage consumption；

The running of task is divided into multiple sub stage, if there is not overlap, the deadline of task between sub stage Time sum for each sub stage；

If there is overlap between sub stage, the part of overlap to be removed；

If the running of task i is decomposed into n sub stage, the data volume vector s that each sub stage processes represents, S=[s₁, s₁..., s_n], each sub stage processes the speed vector r of data and represents, r=[r₁, r₁..., r_n]。

The deadline T of task i_iFor:

T = Σ_{i = 1}^{n} \frac{s_{i}}{r_{i}} + α - - - (1)

Wherein, α is the time of startup task, changes, be a constant in fixing scope.

The running of Map task is decomposed into four sub stages, is respectively and reads input data phase, runs map operation In the stage, output data overflow write phase, and in the intermediate object program Piece file mergence stage, each stage is respectively with read, map, spill, combine Represent.

The deadline T of Map task i is calculated according to formula (1)_i:

T_{i} = \frac{s_{r e a d}}{r_{r e a d}} + \frac{s_{m a p}}{r_{m a p}} + \frac{s_{s p i l l}}{r_{s p i l l}} + \frac{s_{c o m b i n e}}{r_{c o m b i n e}} + α - - - (2)

For short task, the intermediate object program Piece file mergence stage running time is the least, regards constant as.

Read input data phase identical with the quantity that the map operational phase processes, for input data volume s of task_task。

Due to the map operational phase and output data overflow write phase exist partly overlap, overflow write phase data measure map behaviour The quantity of output after work, the i.e. last quantity write of overflowing.

IfProcessed input data volume,For the data volume exported, s_bufferIt is slow that configuration item sets Deposit size,For the last data volume overflow and write, formula (2) develops into:

T_{i} = s_{t a s k} (\frac{1}{r_{r e a d}} + \frac{1}{r_{m a p}}) + \frac{s_{s p i l l}^{l a s t}}{r_{s p i l l}} + β - - - (3)

s_{s p i l l}^{l a s t} = (s_{t a s k} * {ratio}_{o u t p u t}) {%s}_{b u f f e r} - - - (4)

{ratio}_{o u t p u t} = s_{o u t p u t}^{d o n e} / s_{i n p u t}^{d o n e} - - - (5)

Wherein, s_taskFor the input data volume size of task, β is constant, r_read, r_map, r_spillThe system sent from task end Meter heartbeat message obtains. For the data volume that the i stage is processed,For the time consumed in the i stage.

When application manager receives the statistics heartbeat message that task end sends, calculate current task according to formula (3) Deadline.

If task k is run on node w, it was predicted that unscheduled task m when the deadline that node w runs, task m The processing speed of each sub stage is

r_{i}^{m} = η_{i}^{k}, i &Element; {r e a d, m a p, s p i l l},

{ratio}_{o u t p u t}^{m} = {ratio}_{o u t p u t}^{k} .

The second task scheduler of described step (5) selects the step of new task to include from unscheduled task queue:

Step (51):

By the locality of task, unscheduled task being divided into node this locality task, frame this locality task, other frames are appointed Business；

If the input data of task are on the node of the task of process, this task is called node this locality task；

If the node of the input data of task and the task of process is in same frame, being referred to as this task is that frame is local Task；

If the node of the input data of task and the task of process is not in same frame, being referred to as this task is other machines Frame task；

Order according still further to node this locality task, frame this locality task, other frame tasks selects task, and prioritizing selection is every Individual priority is run the task of minimal time；

Step (52): select OPTIMAL TASK according to isomerism principle.

The step of described step (51) is:

Step (511): application manager receives the statistics heartbeat message that task process sends, and goes to step (512)；

Step (512): the prediction of mission performance evaluator sends the deadline of heart beating task, goes to step (513)；

Step (513): if sending the time less than the setting deadline of heart beating task, then calculate unscheduled task in fortune Run the time on whole nodes of row current work, go to step (514)；

Step (514): according to locality, unscheduled task is added separately to the task queue of node this locality, frame this locality is appointed Business queue and other frame task queues, go to step (515)；Locality refers to the input data of task and process node whether phase With, if in a frame；

Step (515): according to task the operation time to the task queue of node this locality, the task queue of frame this locality and other Frame task queue is sorted；Go to step (516)；

Step (516): according to the locality priority of task, it is judged that whether unscheduled task belongs to OPTIMAL TASK, if The unscheduled task never dispatch list that OPTIMAL TASK then will currently select is deleted, returns this task；Otherwise judge the next one Unscheduled task, until checking out whole unscheduled task.

The step of described step (52) is:

Step (521): calculate the incoming task operation time on the full node running current work, go to step (522)；

Step (522): obtain the node calculating incoming task operation shortest time, go to step (523)；

Step (523): judge that the node obtained is the most identical with the node of input, if identical, return task is optimum Task；Otherwise go to step (524)；

Step (524): calculating task transfers to the income of operation the node of operation task shortest time from input node, If income exceedes the threshold value of setting, returning incoming task is OPTIMAL TASK；Otherwise returning incoming task is not OPTIMAL TASK.

The locality of described task refers to: the input data of task are referred to as this locality times with processing node at same node Business；If being referred to as frame this locality task in same frame；In a frame, or not is not referred to as other frame tasks.It is exactly the defeated of task Which kind of relation enters data and processes node is.

The transfer of described task runs income and refers to: if referring to, task is run at other nodes, runs time ratio at origin node The operation time is shorter.This part-time shortened is referred to as income.

Described OPTIMAL TASK refers to: be that this runs the task of shortest time at present node.

Beneficial effects of the present invention: be optimized Hadoop, improves the operational efficiency of short operation.First, the present invention is led to Cross the execution process of operation in Hadoop of analyzing, problem present in short operation processing procedure is described.Then, according to height The feature that under loading condition, task run is taken turns more, proposes a kind of short optimization of job mechanism based on resource reuse by Reusing of the discharged resource of execution task, reduces resource waste in distribution and removal process.Situation in cluster high capacity Under, run the ratio that in the task of many wheels, Map task accounts for and be far above Reduce task, therefore the present invention only optimizes Map task.Real Testing result to show, based on resource reuse the short optimization of job method that the present invention proposes can effectively reduce the operation of short operation Time, and significantly improve the utilization rate of cluster resource.

Accompanying drawing explanation

Fig. 1 is long task run sequential chart；

Fig. 2 is short Activity Calculation framework；

Fig. 3 is short tasks carrying flow process sequential chart；

Fig. 4 (a) is the process of task when taking different values, and the error of the task that mission performance model prediction is currently running becomes Change situation；

Fig. 4 (b) is the relation between Runtime and error；

Fig. 5 (a) is the word frequency statistics job run time；

Fig. 5 (b) is the job run time that Hive SQL converts；

Fig. 5 (c) is the job run time of short optimization of job；

Fig. 6 is for calculating node cpu utilization rate；

Fig. 7 is for calculating node memory utilization rate.

Detailed description of the invention

The invention will be further described with embodiment below in conjunction with the accompanying drawings.

Hadoop is a kind of parallel computing platform for processing large-scale dataset, possesses good autgmentability, Gao Rong The advantage such as mistake, easy programming.Although its initial designs target is to calculate, a large amount of, the operation that in node, parallel processing is larger, But in actual production, Hadoop is but usually utilized to the short operation that treatment scale is less.Do not consider short due to Hadoop The feature of operation, causes short operation to perform to compare poor efficiency in Hadoop.For above-mentioned challenge, the present invention first passes through analysis The execution process of operation in Hadoop, is described problem present in short operation processing procedure.Then, according to high capacity feelings The feature that under condition, task run is taken turns more, proposes a kind of short optimization of job based on resource reuse mechanism by appointing executed Reusing of discharged resource of being engaged in, reduces resource waste in distribution and removal process.Test result indicate that, the present invention proposes Short optimization of job method based on resource reuse can effectively reduce the operation time of short operation, and significantly improves cluster money The utilization rate in source.

Short operation is optimized by the present invention based on Hadoop 2.x version.First this section analyzes the Computational frame of Hadoop With the process of tasks carrying, then describe Hadoop and perform the problem that short operation exists.

Hadoop uses host-guest architecture, by a host node with multiple form from node, with reference to Fig. 2.Explorer (ResourceManager) run on the primary node, be responsible for distribution and the monitoring of global resource, and the startup of application manager And monitoring.Application manager (ApplicationMaster) operates in from node, and its responsibility is to be that Map appoints breakdown of operation Business and Reduce task, coordinates with node manager for Map task and Reduce task application resource and run and monitor task. Application manager is the control unit of job run, the corresponding application manager of each operation.Node manager (NodeManager) also operate in from node, be responsible for the stock number (such as internal memory, CPU etc.) that monitor task uses, prevent task The stock number used in running exceedes the stock number of task application.

It is task application resource and task run process that Fig. 1 describes application manager.1. operation is drawn by application manager It is divided into Map task and Reduce task, by heartbeat message to explorer application resource.2. explorer is according to certain Dispatching algorithm selects suitable task from the task queue of application resource, and the resource that would sit idle for distributes to the task of selection, Application manager obtains the resource that application is arrived when upper once heartbeat message.3. application manager is distributed to the resource of application to fit Closing of task, then the node manager of notice resource place node starts task.4. node manager starts task process fortune Row task, task is spaced heart beat cycle during running and reports the progress of task and healthy shape to affiliated application manager State.5. discharge the resource that task takies after task completes, node manager by heartbeat message, the resource of release is returned to Explorer.

Learning from the execution process of above-mentioned task, task and the task of long operational time that the operation time is short are all deferred to above-mentioned Process, but operation time short task performs as procedure described above, and there are the following problems:

1) task start cost is high.Application manager is that task application resource at least needs a heart beat cycle (to be defaulted as 3 Second), complete to need 1～2 second to task initialization from application manager notice node manager operation task.Collection at Taobao Run the time Map task less than 10 seconds in Qun and account for more than 50%, at Yahoo！Extemporaneous inquiry and data analytic set group in big The average completion time of amount Map task was about 20 seconds, and therefore the startup time of task accounts for the 20% more than (5/25) of total time. When task application resource, if explorer do not have can resource, task needs to wait in line available resources, and this will lead Cause task start cost to increase further.And the phenomenon that task queue waits available resources the most often occurs.

2) serious waste of resources.The resource shared by task release that executed is complete, is spaced a heart by node manager The resource of release is returned to explorer by hop cycle.Explorer is distributed to idle resource to wait appointing of resource Business, the application manager belonging to task is spaced a heart beat cycle could obtain the resource of application.Take office from starting task process Business initialization completes to need 1～2 second, so the resource of release is re-used needs 7～8 seconds again, wherein heart beat cycle is 3 Second.Therefore, cluster performs operation time short task frequently, will cause the serious waste of cluster resource.

According to above-mentioned analysis, current Hadoop performs the flow process of task and is not suitable for performing operation time short task.In order to Facilitating description problem, we are that operation time short task introduces a short task of new ideas.

Definition 1. sets deadline of task i as T_i, user sets task completion time as T_shortTaskIf, task complete The one-tenth time meets T_i≤T_shortTesk, task i is called short task.

Short operation is to be made up of substantial amounts of short task, runs substantial amounts of short task and reduces the execution performance of short operation.Root The feature taken turns according to task run under high load condition, the present invention proposes a kind of short optimization of job machine based on resource reuse more System is reused by resource discharged to executed task, reduces resource waste in distribution and removal process.Reuse The task of resource is run in advance, thus decreases the operation time of operation.

The short Activity Calculation framework of 1 resource reuse

This section describes short Activity Calculation framework and the tasks carrying process of short operation of resource reuse.Task run needs one The resource of determined number, including internal memory, CPU, network, disk storage space, task process etc., these resources can be unscheduled Task use.The resource quantity needed due to the task of different work has difference, and the present invention only considers the Map of same operation Resource reuse is carried out between task.

1.1 short Activity Calculation frameworks

In the framework of Hadoop, application manager is the control centre of an operation, be responsible for task application resource and Node manager coordinates execution and monitor task.The present invention increases by two assembly mission performance assessments in application manager Device (Task Performance Estimator) and second task scheduler (Sub-scheduler), framework is as shown in Figure 2.

Mission performance evaluator task based access control performance model predicts that what the deadline of two generic tasks was currently running appoints Business and unscheduled task.Because the mechanism of resource reuse is only applicable to short task, need task according to the definition of short task The operation time, and this time was unknown before task completes.Unscheduled task is two in the operation time of setting node Level task dispatcher selects the key foundation of task, and its result of calculation directly affects the dispatching sequence of task.Owing to task can be drawn Being divided into multiple sub stage, mission performance model is to set up the time that task based access control sub stage consumes.During tasks carrying, Continual collection of statistical data, the time that calculating task consumes at each sub stage.

The responsibility of second task scheduler is the task of judging to be carrying out that predicts the outcome according to mission performance evaluator Whether belong to short task, and from unscheduled task queue, select task.If being carrying out of task is short task, two grades Task dispatcher selects suitable task to reuse the resource that short task will discharge from unscheduled task queue.Second task Scheduler, when selecting unscheduled task, considers the locality of task, the operation time of task, the fairness of resource and collection The isomerism problem of group.

The tasks carrying process of 1.2 short operations

Fig. 3 describes the tasks carrying flow process of short operation.1. application manager passes through heartbeat message to explorer Shen Please resource.2. the resource that explorer would sit idle for distributes to apply for the task of resource, and application manager disappears in upper once heart beating The resource that application is arrived is obtained during breath.3. application manager distributes to the resource of application being suitable for of task, then notifies correspondence Node manager starts task.4. node manager start task process operation task, task run during between estrangement Hop cycle reports the progress of task, task statistical data and health status to affiliated application manager.5. application manager is received To heartbeat message, if current task is short task, second task scheduler selects suitably to appoint from unscheduled task queue Business.6. the task notifications task process that application manager will select, task process has performed the follow-up reforwarding of task being currently running The task that row is newly received.7. task process does not receives new task after having performed being currently running of task, and task process exits And the resource of release busy, the resource notification explorer that node manager will be discharged by heartbeat message.

Short Activity Calculation framework does not affect the execution of long task, but long task still follows the process execution that Fig. 1 describes, figure 3 processes described are only applicable to short task.

The short optimization of job of 2 resource reuses realizes

According to the design of short Activity Calculation framework, to realize respectively at application manager end and task process.Application management The statistical data that the task based access control process that realizes of device end is collected, first this section introduces the realization of task process heartbeat message, then Introduce application manager end mission performance model and the realization of second task scheduler.

2.1 task process heartbeat message

Task and application manager are based on heartbeat message communication, and the content of heartbeat message includes the progress of task, task The information such as health status.Statistical data in the deadline task based access control running of mission performance model prediction task Processed input quantity, the data volume exported, task output rating, read input the speed of data, map operation speed and The information such as the excessive speed writing output data.Because normal heartbeat message transmission frequency is higher, in order to alleviate application manager Pressure, the present invention increases a kind of statistics heartbeat message between task and application manager, is responsible for being sent to statistical data Application manager.Just transmission when statistics heartbeat message only has the progress of task to exceed the threshold value of setting.

Algorithm 1.sendHeartbeat algorithm.

Input: send the minimum task progress of statistics heartbeat message, be set by the user；

The progress of current task；

Output: heartbeat message

Step 101: if Task Progress exceedes the threshold value of setting, (task is processed for the statistical data of calculating current task Quantity, run time, output data quantity etc.), send statistics heartbeat message, go to step 103；Otherwise go to step 102；

Step 102: the progress of task, not less than setting threshold value, sends task health heartbeat message；

Step 103: task receives the heartbeat message of application manager；

Step 104: if task process receives new task, go to step 105；Otherwise turn 106；

Step 105: by the input digital independent of new task to present node.After current task has performed, run new Receiving of task；

Step 106: if task process does not receive new task, current task has performed rear task process release task and has accounted for Resource.

Algorithm 1 describes task process and sends the process of heartbeat message, and curTask is being currently running of task, NewTask is newly received task.When the progress of task exceedes the threshold value of setting, calculate the processed quantity of task, output Data volume, the read-write statistical data such as speed, then send statistics heartbeat message to application manager.Task process receive should With manager feedback heartbeat message time, check whether and receive new task.If task process receives new task, defeated by new task Enter data and read present node in advance.The operation reading new task input data performs with being currently running tasks in parallel, it is to avoid Perform to read data again during new task.After being currently running of task completes, continue executing with being newly received of task.If task process Do not receive new task, the task run being currently running complete after the resource of release busy.

2.2 task completion time forecast models

If the running of task can be divided into multiple sub stage, the deadline of task and task are in sub stage consumption Time closely related, therefore the present invention sets up mission performance model according to task in the time of sub stage consumption.The fortune of task Row process can be divided into multiple sub stage, if there is not overlap between sub stage, the deadline of task is each sub stage Time sum；If there is overlap between sub stage, the part of overlap to be removed.If the running of task i is decomposed into n Sub stage, the data volume vector s that each sub stage processes represents, s=[s₁, s₁..., s_n], each sub stage processes data Speed vector r represents, r=[r₁, r₁..., r_n].The deadline T of task i_iFor:

T_{i} = Σ_{i = 1}^{n} \frac{s_{i}}{r_{i}} + α - - - (1)

Wherein, α is the time of startup task, and this value changes in fixing scope, can regard a constant as.

The running of Map task is decomposed into four sub stages, is respectively and reads input data phase, runs map operation In the stage, output data overflow write phase, and in the intermediate object program Piece file mergence stage, each stage is respectively with read, map, spill, combine Represent.The deadline T of Map task i is calculated according to formula (1)_i:

T_{i} = \frac{s_{r e a d}}{r_{r e a d}} + \frac{s_{m a p}}{r_{m a p}} + \frac{s_{s p i l l}}{r_{s p i l l}} + \frac{s_{c o m b i n e}}{r_{c o m b i n e}} + α - - - (2)

For short task, the intermediate object program Piece file mergence stage running time is the least, can regard constant as.Read defeated Enter data phase identical with the quantity that the map operational phase processes, for input data volume s of task_task.Due to the map operational phase With output data overflow write phase exist partly overlap, overflow write phase data measure map operation after output quantity, the most finally The once excessive quantity write.IfProcessed input data volume,For the data volume exported, s_bufferConfiguration item sets Fixed cache size,For the last data volume overflow and write, formula (2) develops into:

T_{i} = s_{t a s k (} \frac{1}{r_{r e a d}} + \frac{1}{r_{m a p}}) + \frac{s_{s p i l l}^{l a s t}}{r_{s p i l l}} + β - - - (3)

s_{s p i l l}^{l a s t} = (s_{t a s k} * {ratio}_{o u t p u t}) {%s}_{b u f f e r} - - - (4)

{ratio}_{o u t p u t} = s_{o u t p u t}^{d o n e} / s_{i n p u t}^{d o n e} - - - (5)

When application manager receives the statistics heartbeat message that task end sends, can calculate as predecessor according to formula (3) The deadline of business.If task k is run on node w, it was predicted that unscheduled task m when the deadline that node w runs, The processing speed of each sub stage of task m is

r_{i}^{m} = r_{i}^{k}, i &Element; {r e a d, m a p, s p i l l}, {ratio}_{o u t p u t}^{m} = {ratio}_{o u t p u t}^{k} .

2.3 second task schedulers

After application manager receives the statistics heartbeat message that task sends, the operation of task based access control performance model prediction task Time.If the task of sending statistics heartbeat message is short task, from unscheduled task queue, select suitable task.This section Describe the process selecting unscheduled task, consider task locality, the isomerism of cluster and resource when the task of selection The problem of three aspects of fairness.

The locality of problem 1. task.The operation time of task is to read the time of input data and calculate input data Time sum, therefore the time of minimizing reading input data can shorten the operation time of task.In the cluster, local disk I/O rate is higher than the transfer rate of network, and the network bandwidth in same frame is more roomy than the Netowrk tape between frame, so To assign the task to the node near input data during distribution task as far as possible.This strategy be not only able to reduce data copy time Between, and the load of network can be alleviated.Therefore, second task scheduler according to node this locality task, frame this locality task, its The order of his frame task selects task.

Define 2. task transfers and run income.If task i in the operation time of node a isThe operation time at node b ForTask i by the income transferred to node b runs in node a operation isIfFor On the occasion of, task i that illustrates is run by node a and is transferred to node b operation time shortening；The income journey run for task transfer Degree, is worth the biggest income the highest.IfFor negative value, illustrate that Runtime extends.

The OPTIMAL TASK that definition 3. runs at certain node.Running at node a of task i meets (6) or (7), then claim to appoint Business i is the OPTIMAL TASK run at node a.

T_{a}^{i} = = T_{m}^{i} - - - (6)

{gain}_{a}^{m} < {gain}_{c o n f i g} - - - (7)

Wherein,X is the node set running current work, and m is operation task i shortest time Node.(6) illustrate that the task operation minimal time at a node, node the shortest for (7) the task i of explanation run Income less than set revenue threshold.

Algorithm 2.assignTask algorithm

Input: unappropriated set of tasks；

Output: will dispatching of task；

Step 201: application manager receives the statistics heartbeat message that task process sends, and goes to step 202；

Step 202: the prediction of mission performance evaluator sends the deadline of heart beating task, goes to step 203；

Step 203: if sending the time less than the setting deadline of heart beating task, then calculate unscheduled task and running Run the time on whole nodes of current work, go to step 204；

Step 204: unscheduled task is added separately to the task queue of node this locality, frame task queue according to locality With other task queues, go to step 205；

Step 205: according to operation time of task to the task queue of node this locality, frame task queue and other task teams Row sequence；Go to step 206

Step 206: according to task locality priority, it is judged that whether unscheduled task belongs to OPTIMAL TASK, if optimum Task the most never dispatch list is deleted, returns this task；Otherwise judge next unscheduled task, until checking out all Unscheduled task.

The isomerism of problem 2. cluster.The isomerism of cluster refers to that the hardware configuration of clustered node is inconsistent, and hardware is main Refer to CPU, internal memory, disk three.The isomerism of cluster has important impact to the execution efficiency of task.Same task is in performance High node performs compared with the node low in performance execution, and the former operation time is huge with the operation time difference of the latter. Second task scheduler, when selecting unscheduled task, checks whether selecting of task is OPTIMAL TASK for certain node.If Selecting of task is that OPTIMAL TASK is carried out this task；Otherwise skip this task, reselect task to be performed.

The fairness of problem 3. resource.The fairness of resource refers to the shared cluster resource that each operation is fair.Due to collection Group is isomery, and the computing capability of each node differs greatly, and certain operation will be avoided to take for a long time, and performance is high or performance is low Node resource.For Fairshare resource, being set by the user maximum the reusing the time of resource, the time of reusing of resource is identical 's.Maximum for resource reuses the time,For task average operating time on node x, reusing of resource Number of times isTask is different at the average operating time of each node, so resource in each node Reuse-time different, the resource reuse in the node that performance is high is often.

Second task scheduler selects unscheduled task in two steps.The first step is by unscheduled by the locality of task Task is divided into node this locality task, frame this locality task, other frame tasks.Select to appoint according still further to locality priority orders Business, runs the task of minimal time, sees algorithm 2 in each priority of prioritizing selection.Second step is to select according to isomerism principle OPTIMAL TASK, is shown in algorithm 3.

In algorithm 2, k is the node of operation task i,The reuse-time of resource, dask is used for task i_nodeFor node Local set of tasks, Task_rackFor frame this locality set of tasks, Task_offRackFor other frame set of tasks,For money The maximum in source reuses the time, is set by the user.Algorithm 2 describes application manager and receives the heartbeat message unscheduled task of selection Process.First determine whether whether the task of sending heartbeat message is short task, and whether the reuse-time of resource exceedes maximum. Then unscheduled task is divided into node this locality task, frame this locality task, and other frame tasks three groups, finally according to this locality Property priority select task.

Algorithm 3:selectOptimalTask algorithm.

Input: run the node set of current work；

The minimum yield that user sets；

Selecting of task；

Run the node of current task；

Output: whether the task of selection is OPTIMAL TASK.

Step 301: calculate the incoming task operation time on the full node running current work, go to step 302；

Step 302: obtain the node calculating incoming task operation shortest time, go to step 303；

Step 303: judge that the node obtained is the most identical with the node of input, if identical, return task is optimum appointing Business；Otherwise go to step 304；

Step 304: calculating task transfers to the income of operation the node of operation task shortest time from input node, if Income exceedes the threshold value of setting, and returning incoming task is OPTIMAL TASK；Otherwise returning incoming task is not OPTIMAL TASK.

Whether algorithm 3 describes the task of selection for present node is OPTIMAL TASK.First, selecting of task is calculated The operation time on other nodes running current work, it is judged that whether the task of selection the operation time on present node For minimum.If the operation minimal time that selecting of task is on node m, the income that the task of calculating performs on node m.If task Income exceed the threshold value of setting, then skip the task of selection；Selecting of task is otherwise performed at present node.

3 experiments and analysis

The present invention realizes the optimization of short operation based on Hadoop 2.2.0, and the version (Apache Hadoop) before optimization claims For AH, the version after optimization is referred to as SJH.We assess the effect of optimization of short operation by the mode of SJH Yu AH contrast.

Experiment cluster is calculated node by 1 host node and 8 and forms, and the main configuration of node is as shown in table 1.Host node and Four use configurations one calculating node, remaining four calculate node and use configuration two.Node distribution in two frames, node Between use kilomega network connect.The data block size of HDFS is 64M, and the number of copies of data block is set to 3.Cluster uses Hadoop The FAIR scheduler provided, running Map task needs 1G internal memory and the resource of 1 CPU, and Reduce task and application manager divide Do not need 1.5G internal memory and the resource of 1 CPU.

Table 1 clustered node configuration information

Experiment uses two test data set, and the randomtextwriter that data set one is provided by Hadoop generates, number It is the user power utilization data that certain electric power saving terminal acquisition system gathers according to collection two.Randomtextwriter is used for generating setting number According to the data set of amount, the content of data set is made up of random word.Experiment uses word frequency statistics (wordcount), Hive's Single surface condition inquiry and Terasort are as benchmark.Hive is to build the data warehouse on HDFS, and Hive carries Class SQL of confession is converted into Hadoop operation.After single surface condition query statement of Hive is converted to Hadoop operation, only Map Task.Word frequency statistics is for adding up the frequency that each word occurs in input data set, and input data set is pressed by Terasort Lexicographic order sorts.

The accuracy of 3.1 mission performance models

The present invention first test assignment performance model predicts the task and the accuracy of off-duty task being currently running.Use Word frequency statistics program, as benchmark, uses data set one, and data volume size is 2.5G, and cluster uses 40 Map to appoint Business processes data set.Experiment uses relative error to describe accuracy,E is that error is big It is little,For the prediction deadline of task, TⁱFor the deadline that task is actual.Progress p in task_minIt is respectively 40%, 60%, 70%, when 80%, it was predicted that be currently running the error of task completion time.

Fig. 4 (a) describes the process of task when taking different values, the mistake of the task that mission performance model prediction is currently running Difference situation of change.Being seen by Fig. 4 (a), error reduces along with the increase of progress, and error tends towards stability.The progress of task More than 60%, error is within 20%；When the progress of task is more than 70%, error is about 10%.Fig. 4 (b) describes task fortune Relation between row time and error.Error increases along with the increase of Runtime, reaches in the deadline of task When 80 seconds, error is more than 20%.And predict that the error of off-duty task is higher than prediction and is currently running the error of task.Task Read rate, map operation processing speed and the deadline of excessive writing speed calculating task in the performance model use short time, and not Consider speed situation over time.Therefore, the forecast error of mission performance model can increase along with the increase of task time Greatly.Learning from Fig. 4 (b), when task completion time was less than 30 seconds, forecast error is less than 15%, and when averagely completing of short task Between be 20 seconds, therefore this forecast model meet is actually needed.

3.2 operation deadlines

We test short optimization of job effect in the case of short operation and long operation mixed running, test data such as table 2 Shown in.Experiment uses shell script poll to submit benchmark to, and 1 operation of submission per second, All Jobs carried in 30 seconds Friendship completes.The master container quantity that cluster provides is 96, and the Map task of the most each operation needs many wheels to have dispatched.

Fig. 5 (a) and Fig. 5 (b) shows the impact on the operation deadline of the short optimization of job.During word frequency statistics job run Between shorten 6%-22%, Hive SQL convert the job run time shorten 6%-27%.These two kinds of operations are short works Industry, Hive mono-surface condition inquiry only Map task, its effect of optimization is more preferable than word frequency statistics operation.Can from Fig. 5 (c) Going out, short optimization of job does not has significant impact to long operation, and the prolongation time of operation maximum is 5%.The operation of short task optimization makes The task preemption resource of short operation, cause the long operation deadline to have part to extend.

Table 2 operation trials data

The utilization rate of 3.3 resources

CPU and the utilization rate of internal memory, benchmark and the data set of this section concern calculating node are as shown in table 2.Money The utilization rate in source uses Ganglia to collect, and final result is taken at the meansigma methods of each resource during calculating.Fig. 6 is for calculating The CPU utilization power of node, the cpu busy percentage after optimization averagely improves 4%-13%.In benchmark, word frequency statistics Operation with Hive SQL conversion belongs to computation-intensive operation.The amplitude that the cpu busy percentage of N1, N2, N6 and N7 node improves The amplitude improved more than N3, N4, N5 and N8, because the former node that to be performance high, the latter is the node that performance is low.Fig. 7 describes Calculating the change of node memory, after optimization, the utilization rate of internal memory improves 2.6%-6.18%.The width that memory usage improves Degree is little, because the output data of only terasort are bigger in benchmark.Experimental results shows, the present invention The short operational method of optimization proposed effectively utilizes aspect effect obvious improving cluster resource.

4 sum up

Do not consider short operation due to Hadoop, cause short operation to perform to compare poor efficiency in Hadoop.For Above-mentioned challenge, the present invention first passes through the execution process of operation in Hadoop of analyzing, asks present in short operation processing procedure Topic is described.Then, the feature taken turns according to task run under high load condition more, a kind of short work based on resource reuse is proposed Industry optimization mechanism is reused by resource discharged to executed task, reduces resource wave in distribution and removal process Take.The present invention, by optimizing Map tasks carrying process, reduces the operation time of short operation.Following work will be appointed from Reduce The execution process of business and task scheduling aspect improve the execution efficiency of short operation.

Although the detailed description of the invention of the present invention is described by the above-mentioned accompanying drawing that combines, but not the present invention is protected model The restriction enclosed, one of ordinary skill in the art should be understood that on the basis of technical scheme, and those skilled in the art are not Need to pay various amendments or deformation that creative work can make still within protection scope of the present invention.

Claims

1. the short optimization of job system of MapReduce based on resource reuse, is characterized in that, including: host node, one-level are from joint Point and several two grades are from node, and wherein host node is connected from node with one-level, one-level from node and several two grades from node Connect；Explorer and one-level task dispatcher is disposed on described host node；Described one-level disposes application management from node Device, mission performance evaluator and second task scheduler, wherein second task scheduler is connected with mission performance evaluator, two grades Task dispatcher is also connected with host node；Node manager is all disposed for each described two grades from node；

Described mission performance evaluator, it was predicted that being currently running of task and the deadline of unscheduled task；

Whether described second task scheduler, belong to according to the task of judging to be carrying out that predicts the outcome of mission performance evaluator Short task and from unscheduled task queue select task.

A kind of short optimization of job system of MapReduce based on resource reuse, is characterized in that,

If being carrying out of task is short task, second task scheduler selects new short from unscheduled task queue Business, new short task re-uses the resource that the short task being currently executing will discharge；If being carrying out of task is not Short task, the tasks carrying being carrying out complete after direct resource shared by release.

Second task scheduler is when selecting unscheduled task, it is considered to the locality of task, operation time, the public affairs of resource of task Levelling and the isomerism of cluster.

Described explorer be responsible for global resource distribution and monitoring, and the startup of application manager and monitoring；

Described application manager is Map task and Reduce task breakdown of operation, is the most also Map task and Reduce task Application resource, coordinates operation, is additionally operable to be monitored task with node manager；Application manager is the control of job run Unit, the corresponding application manager of each operation；

Node manager is responsible for the stock number that monitor task uses, and the stock number used during preventing task run exceedes task The stock number of application.

5. the short optimization of job method of MapReduce based on resource reuse, is characterized in that, comprise the steps:

Step (2): the resource that explorer would sit idle for is distributed to apply for that the application manager of resource, application manager obtain Shen The resource that please arrive；

Step (3): application manager application to resource distribute to unscheduled task, then application manager notice correspondence Node manager start task process；

Step (4): node manager start task process operation task, task run during at interval of set heart beating Cycle sends heartbeat message to affiliated application manager；

Step (5): application manager receives heartbeat message, the task completion time forecast model in task based access control Performance Evaluation device Predict the operation time of task, if the task completion time that the time of running of task sets less than or equal to user, then work as predecessor Business is short task；Otherwise current task is long task；If current task is short task, second task scheduler is appointed from unscheduled Business queue selects new task；If current task is long task, it is ignored as heartbeat message；

Step (6): the new task notice task process that application manager will select, task process has performed being currently running of task After continue to run with new task；

Step (7): if task process does not receives new task after having performed being currently running of task, task process exits and discharges The resource taken, the resource notification explorer that node manager will be discharged by heartbeat message.

6. method as claimed in claim 5, is characterized in that,

The task of described step (4) sends to affiliated application manager at interval of the heart beat cycle set during running The step of heartbeat message is:

Step (401): judge whether Task Progress exceedes setting threshold value, if Task Progress exceedes setting threshold value, calculates as predecessor The statistical data of business, sends statistics heartbeat message, goes to step (403)；Otherwise go to step (402)；The statistics of described current task Data include the processed quantity of task, run time and output data quantity；

Step (405): by the input digital independent of new task to present node；After current task has performed, run and newly connect Receiving of task；

Step (406): if task process does not receive new task, after current task has performed, task process release task accounts for Resource.

7. method as claimed in claim 5, is characterized in that, the second task scheduler of described step (5) is appointed from unscheduled Business queue selects the step of new task to include:

Step (51):

By the locality of task, unscheduled task is divided into node this locality task, frame this locality task, other frame tasks；

Order according still further to node this locality task, frame this locality task, other frame tasks selects task, and prioritizing selection is each excellent First level is run the task of minimal time；

Step (52): select OPTIMAL TASK according to isomerism principle.

8. method as claimed in claim 7, is characterized in that,

The step of described step (51) is:

Step (513): if sending the time less than the setting deadline of heart beating task, then calculate unscheduled task and work as in operation Run the time on whole nodes of front operation, go to step (514)；

Step (514): unscheduled task is added separately to the task queue of node this locality, frame this locality task team according to locality Row and other frame task queues, go to step (515)；Locality refers to that the input data of task are the most identical with process node, Whether in a frame；

Step (515): according to operation time of task to the task queue of node this locality, the task queue of frame this locality and other frames Task queue is sorted；Go to step (516)；

Step (516): according to the locality priority of task, it is judged that whether unscheduled task belongs to OPTIMAL TASK, if optimum The unscheduled task never dispatch list that task then will currently select is deleted, returns this task；Otherwise judge that the next one is not adjusted The task of degree, until checking out whole unscheduled task.

9. method as claimed in claim 7, is characterized in that,

The step of described step (52) is:

Step (523): judging that the node obtained is the most identical with the node of input, if identical, return task is OPTIMAL TASK； Otherwise go to step (524)；

Step (524): calculating task transfers to the income of operation the node of operation task shortest time from input node, if receiving Benefit exceedes the threshold value of setting, and returning incoming task is OPTIMAL TASK；Otherwise returning incoming task is not OPTIMAL TASK.

10. method as claimed in claim 7, is characterized in that,

If the node of the input data of task and the task of process is in same frame, being referred to as this task is that frame this locality is appointed Business；

If the node of the input data of task and the task of process is not in same frame, being referred to as this task is that other frames are appointed Business.