CN104102533A

CN104102533A - Bandwidth aware based Hadoop scheduling method and system

Info

Publication number: CN104102533A
Application number: CN201410270693.6A
Authority: CN
Inventors: 戴彬; 秦鹏; 邵翔; 邹云飞
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2014-06-17
Filing date: 2014-06-17
Publication date: 2014-10-15
Anticipated expiration: 2034-06-17
Also published as: CN104102533B

Abstract

The invention discloses a bandwidth aware based Hadoop scheduling method, which comprises the following steps: establishing a job time completion model for Hadoop task scheduling, establishing a mathematical model for a Hadoop scheduling system, and converting the Hadoop task scheduling problem into a problem of looking for a task scheduling method that the job completion time of the job is the shortest for the job to be scheduled; by utilizing the real-time network management and traffic control functions provided by SDN (Software Defined Networking), providing a time slot based network bandwidth allocation mechanism to divide the occupation period of the remaining bandwidth of each link into equal time slots, on the basis of the job time completion model and the network time slot bandwidth allocation mechanism; before allocating a computational node for a certain task, comprehensively considering the locality of the task and the network bandwidth implementation condition, and allocating the computational node capable of providing the earliest completion time for each task. The problem that the task scheduling can not be simultaneously performed from two aspects of global perspective and the actual network available bandwidth in the existing method is solved.

Description

A kind of Hadoop dispatching method and system based on bandwidth aware

Technical field

The invention belongs to information processing and data and calculate field, more specifically, relate to a kind of Hadoop dispatching method and system based on bandwidth aware.

Background technology

Along with scientific and technical progress, Internet technology has obtained development at a high speed, and this has also greatly enriched people's the network life when having promoted social development.And the arrival of WEB2.0, there is especially earth-shaking variation in internet.An outstanding feature of WEB2.0 is the original content of user (User generated content), and the original content of a large amount of users makes data occur volatile growth.In face of the challenge towards large-scale data treatment technology, cloud computing has been carried out as a kind of new model that calculates and process large-scale data.Have benefited from the associating evolution of the multinomial technology such as distributed and virtual, cloud computing has produced as a kind of novel large data processing model.The concept that cloud computing takes cluster to calculate, distributes calculation task in the computing power pond forming to large-scale computer cluster, makes data processing demand and application system to obtain dynamically computing power and storage resources etc. according to the actual demand of oneself.

Up to the present, what in worldwide, most cloud computing system was taked is all based on MapReduce computation model and distributed file storage system, and this mainly imitates the cloud computing core technology that has realized Google.The cloud computing core of Google mainly comprises following three parts: distributed structured data-storage system BigTable, GFS (Google File System) and distributed computing platform MapReduce.Yet Google is as Yi Jia commercial company, is impossible disclose its ins and outs.Individual or scientific research group for wanting to continue research and development cloud computing, cannot obtain more understanding.The cloud computing system Hadoop increasing income has made up this defect, and in 2005, Apache foundation was released a part of the project Nutch that increases income separately, and is subsidized with fund.The design concept of Hadoop is the cloud computing core technology of Google, and it is an Open Framework, and this framework is supported the operation of mass data processing application program.The core of Hadoop has comprised distributed file system HDFS and multiple programming framework MapReduce, and HDFS has realized GFS in cluster, and is supported in cluster operations such as the read-write of file and transmission; MapReduce has completed Distributed Calculation function in cluster, and the file processing ability that it utilizes HDFS to provide has realized the function such as initialization, scheduling, operation of task.

Building of Hadoop do not need supercomputer, can be deployed in the computer cluster that the hardware device by a large amount of cheapnesss forms.Hadoop platform has encapsulated its complicated bottom layer realization details, only the application program for operation on it provides stable api interface, this implementation has shielded the details of bottom to the parallel processing of data, as cutting apart and the scheduling of backup, cluster, fault-tolerant and monitoring etc. of input data.The developer of Hadoop, in the process of exploitation, can pay close attention to too many bottom architecture details, and energy be concentrated on above the core of program, as exploitation ordinary procedure, develops cloud computing application program.This mode has reduced the exploitation pressure of application program widely, has significantly promoted the efficiency of exploitation.Simultaneously in order to strengthen the ease for use of whole Hadoop framework, the Hadoop storehouse of increasing income provides and has enriched complete fault-tolerant ability in application layer, and the failure scenarios that each node in whole cluster can may occur job run is independently processed.Hadoop Development Framework stable with it, cheap and efficiently feature be deeply subject to researcher and developer's welcome, be widely used in the applications such as search engine, commerce data mining, advertisement marketing effect analysis, analysis of biological information, web log file analysis and storage.

Although as the cloud computing platform of increasing income that obtains at present widespread use, but from Apache foundation, release Hadoop platform short several years only up to the present after all, even if obtained the common attention of academia and industry member, in a lot of places, Hadoop platform still exists perfect necessity and possibility.At this wherein, sixty-four dollar question is Mission Scheduling.As a vital gordian technique in Hadoop system, task scheduling is responsible for computational resource and job run to dispatch, and scheduling result can directly affect the calculated performance of Hadoop system and the computational resource utilization ratio of Hadoop system.Yet at present industry to the research of job scheduling technology still in foundation phase, in face of day by day complicated network environment and diversified application scenarios, existing various job scheduling algorithms still exist excessively slow, the executive capability of Hadoop platform of the operation response time of Hadoop system and the technical matters that interaction capabilities is poor and the utilization factor of Hadoop system resource is low.

Summary of the invention

Above defect or Improvement requirement for prior art, the invention provides a kind of Hadoop dispatching method and system based on bandwidth aware, its object is, solves the technical matters that the operation response time is excessively slow, Hadoop platform and integrally performance is too low in existing Hadoop dispatching algorithm.

For achieving the above object, according to one aspect of the present invention, provide a kind of Hadoop dispatching method based on bandwidth aware, comprised the following steps:

(1) receive the operation of submitting to from user, and this operation of initialization, for this operation, set up an operation ID object, this operation ID object is responsible for encapsulation task and recorded information, to follow the tracks of Job execution state and process:

(2) operation initialization being completed is added in job queue, and this job queue is a queue of having safeguarded the operation for the treatment of scheduled for executing, and the All Jobs object in memory-mapped is in charge of and is dispatched in this queue;

(3) receive the heartbeat packet that computing node is sent, extract the current residing status information of the computing node comprising in this heartbeat packet, from job queue, extract operation to be scheduled;

(4) in job scheduling pond, inquire about in this pond whether had this operation to be scheduled, if exist, then proceed to step (6), otherwise enter step (5);

(5) for this operation to be scheduled, carrying out predistribution calculating operation, is this operation to be scheduled newly-built task scheduling mapping in job scheduling pond;

(6) in job scheduling pond, inquire this operation to be scheduled, extract the corresponding task scheduling mapping of this operation to be scheduled, if this mapping is not empty, enter step (7), otherwise, step (8) entered;

(7) from the corresponding task scheduling mapping of operation to be scheduled, extract the corresponding task queue of computing node shown in step (3), computing power according to this computing node, whole or the part of this task queue is encapsulated in the return message of heartbeat packet, returning to this computing node carries out, in job scheduling pond, upgrade this task queue simultaneously, in this task queue, delete the task of distributing to computing node, if be all assigned, delete whole task queue, enter step (3);

(8) if task scheduling is mapped as sky, illustrate that all tasks of this operation are all finished, the execution result of all tasks that obtain is carried out to reduction calculating, and the result that reduction is calculated returns to user.

Preferably, in step (3), in operation distributing reservoir, for each operation to be scheduled, safeguarded a task scheduling mapping, the key of this mapping is the name of computing node, and the value of this mapping is the calculation task queue of allocating in advance to this computing node.Whenever having after a computing node initiated the request of allocating task, from task management queue, extract the operation of a band scheduling, inquire about the task scheduling mapping that this operation is safeguarded.The name of the task node of initiating allocating task request of take in the mapping of this task scheduling is key, extracts the value that this key is corresponding, is the calculation task queue of allocating in advance to this computing node.Computing power according to this computing node, the whole of this task queue or part are encapsulated to feeding to be initiated in the return message of computing node of request of allocating task, returning to this computing node carries out, in job scheduling pond, upgrade this task queue simultaneously, in this task queue, delete the task of distributing to computing node, if be all assigned, delete whole task queue.

Preferably, step (5) specifically comprises following sub-step:

(5-1) calculate the present load situation that whole Hadoop calculates each node in cluster, and then estimate the residue execution time of present load, thereby obtain the free time of each node;

(5-2) communicate with name node, obtain the current data trnascription backup instances for the treatment of the input data of schedule job, resolve and this information of dump;

(5-3) communicate with SDN controller, obtain network implementation Time Bandwidth information, calculate the data-moving time.Call the API of SDN controller NOX, obtain network implementation Time Bandwidth information, obtain implementing bandwidth and store.The definition data-moving time is that data corresponding to task move from data source nodes the time consuming data computing node, and this data-moving time can be passed through formula: T _m=DS/BW, wherein T _mrepresent the data-moving time, DS represents data block size, and this size can be set in configuration file, and BW represents real-time bandwidth size cases.

(5-4) receive the clustered node free time information that (5-1) step is imported into, receive the input block copy information that (5-2) number of steps reportedly enters, receive the network bandwidth and data-moving temporal information that (5-3) step is imported into, comprehensive three carries out computing, is the computing node of a current optimum of each task distribution.

(5-5), in job scheduling pond, for the newly-built task scheduling mapping of this operation to be scheduled, upgrade this task scheduling mapping.

Preferably, in step (5-1), be specially, monitor and the current operation calculated case that records each computing node in whole calculating cluster, obtain the progress value of the current operation of each computing node.Progress represents that the size of data of a complete calculating operation of task executed accounts for the number percent of whole data block size, can estimate the task deadline thus.Computing formula is T _e=T _s+ (T _n-T _s)/progress, wherein T _erepresent the task deadline of estimating, T _sexpression task starts the time of carrying out, T _nfor the current time in system.

Preferably, in step (5-3), be specially, call the API of SDN controller, obtain network implementation Time Bandwidth information, obtain implementing bandwidth and store.The definition data-moving time is that data corresponding to task move from data source nodes the time consuming data computing node, and this data-moving time can be passed through formula: T _m=DS/BW, wherein T _mrepresent the data-moving time, DS represents data block size, and this size can be set in configuration file, and BW represents real-time bandwidth size cases.

Preferably, in step (5-4), specifically comprise following sub-step:

(5-4-1) in whole calculating cluster, find available the earliest remote node as optimum remote node, record node free time rI now _minnow;

(5-4-2) in whole calculating cluster, find available the earliest local node as optimum local node, record node free time rI now _minloc;

(5-4-3) whether the node that comparison step (5-4-1) and step (5-4-2) inquire is same node, and if so, defining optimum local node is optimum node, enters step (5-4-5), if not, enters step (5-4-4);

If (5-4-4) relatively task is distributed to respectively to this two computing nodes, the task deadline on these two computing nodes which more early, compares rI _minnow+ T _mwith rI _minlocsize, little i.e. explanation task finishes more early, defining this node is optimum node.

(5-4-5) task to be allocated is distributed to this optimum node.

Preferably, in step (5), when distributing remote task for computing node, slot reservation division while being also responsible for carrying out.When certain remote task need to be carried out data-moving, record the source node ND that task data to be moved is moved _dataSrcwith terminal note ND _minNow, these Information encapsulations are become to a stream table FlowTable, ND has been recorded in territory, FlowTable packet header _dataSrcwith terminal note ND _minNowinformation.This stream table information sends to SDN controller, and SDN controller can be issued to this stream table in corresponding SDN switch, when SDN switch be checked through this stream table for stream time, preferentially guarantee the operation of moving of these data;

Preferably, in step (5), if task TK _ibe assigned at node ND _jupper calculating, and TK _iinput deposit data at TK _inode ND _dataSrcupper,, when executing the task, input data need to be from ND _dataSrcmove ND _jupper, definition of T M _i,jfor this data-moving time; Task is calculated the computing time that the mistiming between complete is task, definition of T P from starting to calculate _i,jfor the computing time of this task; From task, be assigned to certain computing node and start, task will take the computational resource of this computing node, the time that task is actual takies computational resource for time of being assigned with from task to task computation the mistiming the complete time, definition of T E _i,jfor the actual execution time of this task, wherein, these times meet father-in-law's formula TE _i,j=TP _i,j+ TM _i,j.When computational data is positioned on the node of processing calculation task, defining this computing node is local node, otherwise defining this computing node is remote node.

Preferably, step (5-5) is specially, and in this duty mapping, if the key of the Optimal calculation node calculating does not exist, the name of this Optimal calculation node calculating of take is the newly-built key-value pair of key, then by this task add to worthwhile in.If the key of the Optimal calculation node calculating exists, find the key-value pair that this key is corresponding, task is added to after the task queue in value.

According to another aspect of the present invention, a kind of Hadoop dispatching system based on bandwidth aware is provided, comprising:

The first module, receives the operation of submitting to from user, and this operation of initialization, for this operation, sets up an operation ID object, and this operation ID object is responsible for encapsulation task and recorded information, to follow the tracks of Job execution state and process:

The second module, the operation that initialization is completed is added in job queue, and this job queue is a queue of having safeguarded the operation for the treatment of scheduled for executing, and the All Jobs object in memory-mapped is in charge of and is dispatched in this queue;

The 3rd module, receives the heartbeat packet that computing node is sent, and extracts the current residing status information of the computing node comprising in this heartbeat packet, extracts operation to be scheduled from job queue;

Four module inquires about in this pond whether had this operation to be scheduled in job scheduling pond, if exist, then proceeds to the 6th module, otherwise enters the 5th module;

The 5th module, carries out predistribution calculating operation for this operation to be scheduled, is this operation to be scheduled newly-built task scheduling mapping in job scheduling pond;

The 6th module inquires this operation to be scheduled in job scheduling pond, extracts the corresponding task scheduling mapping of this operation to be scheduled, if this mapping is not empty, enters the 7th module, otherwise, enter the 8th module;

The 7th module, from the corresponding task scheduling mapping of operation to be scheduled, extract the corresponding task queue of computing node shown in the 3rd module, computing power according to this computing node, whole or the part of this task queue is encapsulated in the return message of heartbeat packet, returning to this computing node carries out, in job scheduling pond, upgrade this task queue simultaneously, in this task queue, delete the task of distributing to computing node, if be all assigned, delete whole task queue, enter the 3rd module;

The 8th module, if task scheduling is mapped as sky, illustrates that all tasks of this operation are all finished, and the execution result of all tasks that obtain is carried out to reduction calculating, and the result that reduction is calculated returns to user.

In general, the above technical scheme of conceiving by the present invention compared with prior art, can obtain following beneficial effect:

1, can promote Job execution response speed: dispatching method of the present invention from overall visual angle for task scheduling is carried out in operation, abandoned in prior art, only, when computing node initiation task is distributed request, just for single computing node, execute the task and dispatch and distribute; The present invention guarantees the locality of task from overall angle, taked a kind of predistribution mechanism of operation, when certain operation is scheduled for the first time, for each task dispense needles of this operation Optimal calculation node to this task, the allocation result of task is stored in operation distributing reservoir, when this Optimal calculation node initiation task is distributed request, the task queue of allocating in advance to it is distributed in the past.When each task in this operation is complete on Optimal calculation node, the effect that completes of whole operation is also optimum.

2, can adapt to complicated network environment: the present invention as a parameter, provides a reference frame for scheduler carries out task scheduling by the network bandwidth.If assurance task is all local task in the distribution of task as far as possible, the data transmission in network link be can avoid, thereby execution speed, the reduction network congestion of task improved, this character is called the locality of task.Yet, during load imbalance above Hadoop calculates the computing node in cluster, if still ensure blindly the locality of data, this task is given to the words that local node is carried out, may cause task to be all assigned to the local node of high capacity, can produce like this operation and wait for.The present invention considers locality and the network bandwidth situation of task, chooses neatly an optimum node, thereby guaranteed in complicated network environment among local node and non-local node, still can keep a comparatively high efficiency task scheduling.

Accompanying drawing explanation

Fig. 1 is the process flow diagram that the present invention is based on the Hadoop dispatching method of bandwidth aware.

Fig. 2 is the refinement process flow diagram of step in the inventive method (5).

Fig. 3 is the refinement process flow diagram of step in the inventive method (5-4).

Embodiment

In order to make object of the present invention, technical scheme and advantage clearer, below in conjunction with drawings and Examples, the present invention is further elaborated.Should be appreciated that specific embodiment described herein, only in order to explain the present invention, is not intended to limit the present invention.In addition,, in each embodiment of described the present invention, involved technical characterictic just can not combine mutually as long as do not form each other conflict.

Integral Thought of the present invention is, as the Hadoop dispatching algorithm based on bandwidth aware, mainly utilizes the preferential algorithm of the local node of bandwidth aware characteristic optimizing Hadoop of SDN (software defined network) own.The link bandwidth obtaining by software defined network carrys out computational data transit time, and estimates the free time of node, finally calculates respectively the task deadline of all nodes.Selection task deadline node is the earliest as the node of tasks carrying.

As shown in Figure 1, the Hadoop dispatching method that the present invention is based on bandwidth aware comprises the following steps:

(1) receive the Hadoop operation of submitting to from user, and this Hadoop operation of initialization, and set up an operation ID object for this Hadoop operation, wherein this operation ID object is for encapsulation task and recorded information, to follow the tracks of executing state and the process of this Hadoop operation:

(2) the Hadoop operation after initialization is added in job queue, wherein this job queue is the Hadoop operation for the treatment of scheduled for executing for safeguarding, and is in charge of and dispatches all Hadoop operations in memory-mapped;

(3) receive the heartbeat packet that computing node is sent, extract the current status information of this computing node comprising in this heartbeat packet, and from job queue, extract the Hadoop operation that is positioned at head of the queue; Particularly, the status information of computing node comprises number of tasks that this computing node is being carried out and the vacant time of computing node;

(4) in job scheduling pond, (what job scheduling pond deposited is the result that operation distributes, and a lot of queues, consists of, and the dispatching distribution result of an operation is deposited in each queue, i.e. an operation is to be assigned on which computing node to carry out.) whether middle inquiry has existed the Hadoop operation of extraction, if existence proceeds to step (6), otherwise enters step (5);

(5) the Hadoop operation of extracting is carried out to predistribution calculating operation, by dispatching algorithm, operation is divided into a plurality of tasks, and each task is distributed to corresponding computing node, be this operation to be scheduled newly-built task scheduling mapping in job scheduling pond simultaneously; As shown in Figure 2, this step specifically comprises following sub-step:

(5-1) calculate the current Hadoop task situation that whole Hadoop calculates each computing node in cluster, and according to current Hadoop task situation, estimate residue execution time of the current Hadoop task of this computing node, to obtain the free time of this computing node; Particularly, first monitor and the present load situation that records each computing node in whole Hadoop calculating cluster, and obtain progress (Progress) value of the Hadoop task of the current operation of each computing node, wherein progress value is the number percent that in the data block of a Hadoop task, processed size of data accounts for whole data block size, can estimate thus the residue execution time of current Hadoop task, specific formula for calculation is T _e=(T _n-T _s) * (1/ progress value-1), wherein T _efor the residue execution time of current Hadoop task, T _sfor current Hadoop task starts the time of carrying out, T _nfor the current time of whole Hadoop calculating cluster, the free time of this computing node is T _e+ T _n;

(5-2) communicate with name node (Name node), to obtain the data trnascription backup instances of inputting data in the Hadoop operation of extraction, resolve and this information of dump;

(5-3) communicate, to obtain whole Hadoop, calculate the real-time bandwidth information of cluster, and calculate data-moving time T with software defined network (Software defined network is called for short SDN) controller _m, use data block size divided by real-time bandwidth.

As shown in Figure 3, this step specifically comprises following sub-step:

(5-4-5) task to be allocated is distributed to this optimum node.

(5-5), in job scheduling pond, for the newly-built task scheduling mapping of this operation to be scheduled, upgrade this task scheduling mapping.In this duty mapping, if the key of the Optimal calculation node calculating does not exist, the name of this Optimal calculation node calculating of take is the newly-built key-value pair of key, then by this task add to worthwhile in.If the key of the Optimal calculation node calculating exists, find the key-value pair that this key is corresponding, task is added to after the task queue in value.

The Hadoop dispatching system that the present invention is based on bandwidth aware comprises:

Those skilled in the art will readily understand; the foregoing is only preferred embodiment of the present invention; not in order to limit the present invention, all any modifications of doing within the spirit and principles in the present invention, be equal to and replace and improvement etc., within all should being included in protection scope of the present invention.

Claims

1. the Hadoop dispatching method based on bandwidth aware, is characterized in that, comprises the following steps:

2. Hadoop dispatching method according to claim 1, it is characterized in that, in operation distributing reservoir, for each operation to be scheduled, safeguarded a task scheduling mapping, the key of this mapping is the name of computing node, the value of this mapping is the calculation task queue of allocating in advance to this computing node, whenever having after a computing node initiated the request of allocating task, from task management queue, extract the operation of a band scheduling, inquire about the task scheduling mapping that this operation is safeguarded, the name of the task node of initiating allocating task request of take in the mapping of this task scheduling is key, extract the value that this key is corresponding, be the calculation task queue of allocating in advance to this computing node, computing power according to this computing node, the whole of this task queue or part are encapsulated to feeding to be initiated in the return message of computing node of request of allocating task, returning to this computing node carries out, in job scheduling pond, upgrade this task queue simultaneously, in this task queue, delete the task of distributing to computing node, if be all assigned, delete whole task queue.

3. Hadoop dispatching method according to claim 1 and 2, is characterized in that, step (5) specifically comprises following sub-step:

(5-3) communicate with SDN controller, obtain network implementation Time Bandwidth information, calculate the data-moving time,

(5-4) receive the clustered node free time information that (5-1) step is imported into, receive the input block copy information that (5-2) number of steps reportedly enters, receive the network bandwidth and data-moving temporal information that (5-3) step is imported into, comprehensive three carries out computing, computing node for a current optimum of each task distribution

4. Hadoop dispatching method according to claim 3, it is characterized in that, step (5-1) is specially, monitoring and the current operation calculated case that records each computing node in whole calculating cluster, obtain the progress value of the current operation of each computing node, progress represents that the size of data of a complete calculating operation of task executed accounts for the number percent of whole data block size, can estimate the task deadline thus, and computing formula is T _e=T _s+ (T _n-T _s)/progress, wherein T _erepresent the task deadline of estimating, T _sexpression task starts the time of carrying out, T _nfor the current time in system.

5. Hadoop dispatching method according to claim 3, it is characterized in that, step (5-3) is specially, call the API of SDN controller, obtain network implementation Time Bandwidth information, obtain implementing bandwidth and store, the definition data-moving time is that data corresponding to task move from data source nodes the time consuming data computing node, and this data-moving time can be passed through formula: T _m=DS/BW, wherein T _mrepresent the data-moving time, DS represents data block size, and this size can be set in configuration file, and BW represents real-time bandwidth size cases.

6. Hadoop dispatching method according to claim 3, is characterized in that, step (5-4) specifically comprises following sub-step:

If (5-4-4) relatively task is distributed to respectively to this two computing nodes, the task deadline on these two computing nodes which more early, compares rI _minnow+ T _mwith rI _minlocsize, little i.e. explanation task finishes more early, defining this node is optimum node,

(5-4-5) task to be allocated is distributed to this optimum node.

7. Hadoop dispatching method according to claim 6, it is characterized in that, when distributing remote task for computing node, slot reservation division while being also responsible for carrying out, when certain remote task need to be carried out data-moving, record the source node ND that task data to be moved is moved _dataSrcwith terminal note ND _minNow, these Information encapsulations are become to a stream table FlowTable, ND has been recorded in territory, FlowTable packet header _dataSrcwith terminal note ND _minNowinformation, this stream table information sends to SDN controller, SDN controller can be issued to this stream table in corresponding SDN switch, when SDN switch be checked through this stream table for stream time, preferentially guarantee the operation of moving of these data.

8. Hadoop dispatching method according to claim 8, is characterized in that, if task TK _ibe assigned at node ND _jupper calculating, and TK _iinput deposit data at TK _inode ND _dataSrcupper,, when executing the task, input data need to be from ND _dataSrcmove ND _jupper, definition of T M _i,jfor this data-moving time; Task is calculated the computing time that the mistiming between complete is task, definition of T P from starting to calculate _i,jfor the computing time of this task; From task, be assigned to certain computing node and start, task will take the computational resource of this computing node, the time that task is actual takies computational resource for time of being assigned with from task to task computation the mistiming the complete time, definition of T E _i,jfor the actual execution time of this task, wherein, these times meet father-in-law's formula TE _i,j=TP _i,j+ TM _i,j, when computational data is positioned on the node of processing calculation task, defining this computing node is local node, otherwise defining this computing node is remote node.

9. Hadoop dispatching method according to claim 3, it is characterized in that, step (5-5) is specially, in this duty mapping, if the key of the Optimal calculation node calculating does not exist, the name of this Optimal calculation node calculating of take is the newly-built key-value pair of key, again this task is added to worthwhile in, if the key of the Optimal calculation node calculating exists, find the key-value pair that this key is corresponding, task is added to after the task queue in value.

10. the Hadoop dispatching system based on bandwidth aware, is characterized in that, comprises with lower module: