CN102096602A

CN102096602A - Task scheduling method, and system and equipment thereof

Info

Publication number: CN102096602A
Application number: CN2009102424854A
Authority: CN
Inventors: 郭磊涛; 孙宏伟
Original assignee: China Mobile Communications Group Co Ltd
Current assignee: China Mobile Communications Group Co Ltd
Priority date: 2009-12-15
Filing date: 2009-12-15
Publication date: 2011-06-15

Abstract

The invention discloses a task scheduling method, and a system and equipment thereof. The method provided by the invention is applied to a data processing system which is provided with a main node and a plurality of working nodes, wherein the main node is used for scheduling tasks, and the working nodes are used for executing the tasks. The method comprises the following steps that: each working node transmits a request for acquiring a task to the main node, wherein the request carries available resources of the working node and the used resources of each task executed on the working node; and the main node determines resource demand of each type of tasks on the main node according to the used resource of each task which is transmitted by each working node and executed on each node and schedules the tasks for the working nodes according to the determined resource demand of each type of tasks and the available resources of the working node which transmits the request. By the method, the system and the equipment, overload of the working node can be avoided and the resource utilization ratio of the working node can be improved so as to improve the operating efficiency of the data processing system, particularly a MapReduce system.

Description

A kind of method for scheduling task and system thereof and equipment

Technical field

The present invention relates to the data processing technique of the communications field, relate in particular to a kind of method for scheduling task and system thereof and equipment.

Background technology

MapReduce is a kind of distributed multiple programming system that is used to handle the mass data collection, can automatically the MapReduce data processing task be walked abreast and turn to a plurality of subtasks, and be dispatched to one by concurrent execution on the cluster of ordinary node (as PC) structure; Simultaneously, system solves automatically to problems such as exchanges data between node failure, task inefficacy and node, make MapReduce use and to be concerned about this problem, and can realize the function of distributed data processing by definition corresponding M ap (mapping) and Reduce (abbreviation) function.

The MapReduce system mainly is made up of three modules, and its system architecture as shown in Figure 1.Client (Client) is used for the parallel processing operation (Job) that the user arranges is committed to host node (Master); The Job that Master submits client to automatically is decomposed into a plurality of Map tasks and a plurality of Reduce tasks with same treatment function (but the data of handling may be different) with same treatment function (but the input data may be different), wherein, the output data of Map task is as the input data of Reduce task, and with task scheduling to working node (Worker); Worker is to Master request task and carry out the task of asking.

Because the MapReduce system generally is structured in (as the network system of thousand level node scales) on the large-scale calculations resource, Master can't obtain the load information of all Worker and carry out the scheduling of task, so, in the MapReduce system, according to the configuration file of disposing in advance, initiatively ask task by Worker to Master.Master carries out to Worker according to Information Selection tasks such as formation configuration and job priority and scheduling.Its concrete scheduling flow can comprise as shown in Figure 2:

Step 201, Worker trigger the heartbeat message transmission flow, and trigger condition can be periodically to set or Event triggered, as when there is vacant resource in certain task system of being finished, can initiatively trigger the transmission heartbeat message;

Step 202, Worker check the configuration file of disposing in advance, record the maximum number of tasks quota that Worker can carry out in advance in the configuration file of Bu Shuing;

Step 203, Worker be according to configuration file, judges whether the task of please looking for novelty, and when the number of tasks of carrying out as Worker does not reach quota, will carry out new task to the Master request;

Step 204, Worker send heartbeat message according to judged result to Master, wherein have the mark of " whether asking task ".When new task was carried out in the Worker request, this mark was set to true; Otherwise this mark is set to false.

After step 205, Master receive the heartbeat message of Worker transmission, check the mark of " whether asking task ", when this was labeled as true, Master selected to give this Worker with certain task scheduling.Scheduler task on the Master tactful configurable, as possess with task scheduling to its redundancy scheduling or the like of the rescheduling of the nearer node of the data of handling, failed tasks, bottleneck task.

Step 206, Master return the heartbeat response to Worker, and when Worker request task, return task to Worker.

The inventor finds that there is following problem in the existing task scheduling mechanism of MapReduce system in realizing process of the present invention:

(1) Worker carries out new task according to the configuration request of disposing in advance fully, along with the continuous increase of node scale and the update of equipment, and the isomerism of different MapReduce operation resource requirements, only rely on the configuration file of disposing in advance to carry out task scheduling and can cause following problem:

When the hardware configuration of Worker task lower or operation on it takies more resource, when having taken a large amount of system resource (CPU is or/and internal memory etc.) as moving of task, if Worker does not reach its pre-configured maximum task quota as yet, it still can carry out new task to the Master request, in this case, new task not only may appear because the situation that low memory can not normally be carried out, but also can have influence on carrying out of task, even can cause Worker to break down;

When task higher or operation on it takies less resource when the hardware configuration of Worker, if Worker has reached its pre-configured maximum task quota, it will no longer carry out new task to the Master request, in this case, can cause the waste of Worker resource.

(2), when Master receives task requests from Worker, give this Worker with certain task scheduling according to self strategy for Master.Because the job behavior difference in the MapReduce system, its task is handled required stock number to be had than big-difference, what the required by task stock number that causes Master to distribute to Worker easily exceeded this Worker can stock number, cause task to carry out failure, can influence other tasks of on this Worker, moving simultaneously.

Generally speaking, in the task scheduling mechanism of current MapReduce system, on the one hand, Worker only relies on the configuration information of disposing in advance and dispatches, thereby does not cause the overload of Worker or underloading to cause the wasting of resources simultaneously in node configuration and task type; On the other hand, Master can't be with task scheduling to the suitable Worker of load, thereby causes overload or the underloading of Worker, has reduced the operational efficiency of MapReduce system.

Summary of the invention

The embodiment of the invention provides a kind of method for scheduling task and system and equipment, in order to working node overload or the low problem of resource utilization because of not considering that the working node load is caused in the task scheduling mechanism that solves the available data disposal system.

The method for scheduling task that the embodiment of the invention provides is applied to be provided with the data handling system of host node and a plurality of working nodes, and wherein, host node is used for task scheduling, and working node is used to execute the task, and this method comprises the steps:

Working node sends the request obtain task to host node, wherein carries each task resource use amount separately of carrying out on the available volume of resources of this working node and this working node;

Each task resource use amount separately of carrying out on the node separately of described host node according to each working node transmission, determine the resources requirement of each generic task on this host node, and according to the resources requirement of each generic task of determining and the available volume of resources that sends the working node of described request, for described working node carries out task scheduling.

The data handling system that the embodiment of the invention provides comprises host node equipment and a plurality of working node equipment;

Described working node equipment is used for sending the request obtain task to host node, wherein carries each task resource use amount information separately of carrying out on the available volume of resources of this working node and this working node;

Described host node equipment, each task resource use amount separately of carrying out on the node separately that is used for sending according to each working node, determine the resources requirement of each generic task on this host node, and according to the resources requirement of each generic task of determining and the available volume of resources that sends the working node of described request, for described working node carries out task scheduling.

The host node equipment that the embodiment of the invention provides is applied to be provided with the data handling system of host node equipment and a plurality of working node equipment, and wherein, working node equipment is used to carry out the task of host node devices allocation, and described host node equipment comprises:

The mission bit stream statistical module is used for after the request that receives the task of obtaining that working node sends, and each task resource use amount separately of carrying out on the node separately that sends according to each working node is determined the resources requirement of each generic task on this host node;

Task scheduling modules is used for the resources requirement of each generic task of determining according to described mission bit stream statistical module, and the available volume of resources of entrained described working node in the request that receives, for described working node carries out task scheduling.

In the above embodiment of the present invention, working node sends the available volume of resources of self and each task resource use amount separately of carrying out on it to host node, make host node when carrying out task scheduling for this working node, each task resource use amount separately of carrying out on the node separately that can send according to each working node, calculate the resources requirement of each generic task on this host node in advance, thereby can be according to the resources requirement of each generic task and the available volume of resources that sends the working node of request, for working node carries out task scheduling.Owing to when task scheduling, introduced working node load and mission requirements amount as the task scheduling foundation, therefore task of can come the Resources allocation demand to adapt according to the working node actual loading, thereby can avoid the situation of working node overload on the one hand, can improve the resource utilization of working node on the other hand, thereby improve the operational efficiency of whole data handling system.

The embodiment of the invention also provides a kind of working node equipment, to realize sending each task resource use amount separately of carrying out on the available volume of resources of working node and this working node to host node equipment, carries out the foundation of task scheduling as host node equipment.

The working node equipment that the embodiment of the invention provides, be applied to be provided with the data handling system of host node equipment and a plurality of working node equipment, wherein, host node equipment is used for task scheduling, working node equipment comprises task execution module, be used to carry out the task that host node distributes, described working node equipment also comprises:

The monitoring resource module is used to monitor each task resource use amount separately of carrying out on the available volume of resources of working node and this working node;

Sending module is used for sending the request obtain task to host node, wherein carry described monitoring resource module monitors to the available volume of resources of this working node and this working node on each task resource use amount separately of carrying out.

The above embodiment of the present invention, by sending to host node when obtaining the request of task, each task resource use amount separately of carrying out on the available volume of resources of working node and this working node is sent to host node equipment, for host node equipment adopts the available volume of resources of the resources requirement of task and working node for providing assurance according to carrying out task scheduling.

Description of drawings

Fig. 1 is the configuration diagram of existing MapReduce system;

Fig. 2 is existing MapReduce system task scheduling flow synoptic diagram;

Worker in the MapReduce system that Fig. 3 provides for the embodiment of the invention and the structural representation of Master;

The MapReduce system task scheduling flow synoptic diagram that Fig. 4 provides for the embodiment of the invention.

Embodiment

At the problems referred to above that the task scheduling mechanism of existing MepReduce system exists, the embodiment of the invention has proposed a kind of MapReduce of being applicable to system and based on the task scheduling scheme of Worker actual loading.

Described task scheduling scheme may be implemented as a kind of method for scheduling task, also may be implemented as a kind of MapReduce system, and relevant equipment; And, may be implemented as hardware, also may be implemented as software, perhaps be implemented as the combination of software and hardware.Describe the present invention below in conjunction with accompanying drawing and specific embodiment.

The framework of the MapReduce system that the embodiment of the invention provided but improves respectively Worker and Master wherein as shown in Figure 1.

For Worker, increased following function:

Worker can be according to self loading condition or resource operating position, whether decision carries out new task to the Master request, thereby when the Worker load is higher, even it does not reach the maximum task quota of configuration file defined, still do not ask to carry out new task, and when the Worker load was low, even it has reached maximum task quota, still new task was carried out in request;

Worker reports the available volume of resources of self simultaneously to Master when Master request task, with and go up the resource use amount statistical information of each task of operation, carry out the foundation of task scheduling as Master.

For Master, increased following function:

The task that Master reports according to Worker is carried out the information of used resource, and the every generic task of statistical study is carried out required stock number.Because in the Mapreduce system, the same generic task of all of same Job, may be different but treatment scheme is identical as the input data of all Map tasks and all Reduce tasks, therefore a plurality of Map tasks and a plurality of Reduce task of same operation have similar resources requirement, so required stock number when used stock number can reflect this generic task execution when certain Map task or Reduce task are carried out on certain Worker.Report on it information of the used stock number of carrying out of task by Worker to Master, Master can progressively accurately grasp such task resource demand, thereby provides foundation for task scheduling;

Master receives Worker when obtaining the request of new task, the available volume of resources of this Worker self that reports according to this Worker, and each generic task that Master counts is carried out required resource use amount, for this Worker distributes suitable task, can not surpass the load tolerance range of this Worker so that distribute to the resources requirement of the task of this Worker as far as possible, and make full use of the load capacity of this Worker as far as possible.

According to the above-mentioned functions that Worker and Master realized, internal module structure and the annexation of Worker and Master can be as shown in Figure 3, and Fig. 3 only shows the structure of a Master and a Worker and the annexation between each module.

As shown in Figure 3, among the Worker except comprising that (this module is used for the reception of signal and send handles transceiver module 301, as the transmission of heartbeat signal and the reception of heartbeat response signal), beyond the task execution module 302 conventional modules such as (this module are used to carry out the task of asking from Master), also comprise monitoring resource module 303 and task requests decision-making module 304, wherein, monitoring resource module 303 is newly-increased modules, and task requests decision-making module 304 can improve on the basis of original task requests decision-making module and obtain.

Monitoring resource module 303: on the one hand, the load of monitoring Worker itself, the stock number operating position that comprises CPU, Mem (internal memory) or Disk (disk) etc. as stock number and the spendable stock number of having used of residue, and can offer monitored results task requests decision-making module 304; On the other hand, monitoring Worker goes up the employed stock number of each task, the resource operating position that can comprise CPU, Mem (internal memory) or Disk (disk) etc., and the stock number that each task can be used and the available volume of resources of this Worker send to Master by transceiver module 301.For each task on the Worker, can monitor and add up employed stock number when task is carried out in a period of time, employed maximum resource amount was carried out employed stock number as this task and is reported when preferably this task that statistics in this section period can be obtained was carried out, and can avoid like this owing to task being carried out Master that employed stock number statistics inaccurate (being lower than the required maximum resource amount of this task actual motion as statistics) causes with the situation of task scheduling to the Worker of surplus resources deficiency as far as possible.

Task requests decision-making module 304: the task requests decision rule of getting rid of original task quota based on pre-configured file defined, judge whether the task of please looking for novelty and be improved to the node load situation that is monitored according to monitoring resource module 303, and the result of decision can be sent to transceiver module 301 and send to Master so that it generates heartbeat signal.Concrete, if specified conditions are satisfied in judgement, promptly the node load amount is no more than the charge capacity threshold value of setting, then carries out new task to the Master request, otherwise will not ask new task; These specified conditions can be expressed as:

(LCPU＜TCPU)&&(LMEM＜TMEM)&&(LDISK＜TDISK)

Wherein, L ^*Be the current load of Worker, T ^*It is the threshold value that system budget is provided with; This condition shows: the CPU use amount threshold value that is no more than default when the CPU of Worker use amount, and the internal memory use amount of Worker is no more than the internal memory use amount threshold value of default, and the disk use amount of Worker then can ask to carry out new task when being no more than the disk use amount threshold value of default.

Above conditional expression only is an example; in fact; can be out of shape this expression formula; as reducing the judgement factor (as the disk use amount not being judged) wherein; perhaps increase other and judge the factor; as long as can whether surpass the charge capacity threshold value of default by the charge capacity that expression formula is judged Worker, all should be within protection scope of the present invention.

As shown in Figure 3, among the Master except comprising that (this module is used for the reception of signal and send handles transceiver module 310, as the reception of heartbeat signal and the transmission of heartbeat response signal) etc. beyond the conventional module, also comprise: mission bit stream statistical module 311 and task scheduling modules 312, wherein, mission bit stream statistical module 311 is newly-increased modules, and task scheduling modules 312 can be improved on the basis of original task scheduling modules and obtain.

Mission bit stream statistical module 311: be mainly used in receive that each Worker sends carry out the information of used resource about task, and carry out statistical study, thereby obtain every generic task and carry out required stock number, carry out the foundation of task scheduling as task scheduling modules 312;

Task scheduling modules 312: when carrying out task scheduling with existing task scheduling modules, do not consider that each task carries out required stock number and compare, task scheduling modules 312 after the improvement, after the available volume of resources of its node that the request and the Worker of the task of obtaining that receives the Worker transmission send, carry out required stock number according to every generic task that mission bit stream statistical module 311 is counted, select the resource requirement amount to give this Worker less than the Task Distribution of this Worker available volume of resources.

Need to prove; the Module Division mode of above Worker and Master only is a kind of in the various possible Module Division modes; those skilled in the art should be understood that; as long as make Worker and Master possess above-mentioned functions; whether it is divided into different modules or how divides module, can not be construed as limiting protection scope of the present invention.

Structure with Worker shown in Figure 3 and Master is an example below, and in conjunction with flow process shown in Figure 4, the task scheduling process is described in detail.As shown in Figure 4, this flow process comprises:

Step 401, Worker trigger the heartbeat message transmission flow.It can be the transmission of periodic triggers heartbeat message, also can be based on the transmission of Event triggered heartbeat message, as when there is vacant resource in certain task system of being finished or node surplus yield (as DISK) when not enough, can initiatively trigger the transmission heartbeat message.

Task requests decision-making module 304 among step 402, the Worker from monitoring resource module 303 read current self charge capacity (as CPU/MEM/DISK etc.) with and each task of operation take the statistical information of stock number.

Wherein, monitoring resource module 303 can use the methods such as mean value of certain class resource (as CPU) to calculate on node by each process of statistics task.Monitoring resource module 303 can be monitored and add up according to the measurement period of setting.

Task requests decision-making module 304 among step 403, the Worker judges whether request execution new task according to the node load component analysis that gets access to from monitoring resource module 303.If the node load amount does not surpass the charge capacity threshold value of system's regulation, then new task is carried out in request, otherwise does not ask to carry out new task.

The transceiver module 301 of step 404, Worker sends heartbeat message to Master, have in the heartbeat message Worker node available volume of resources that " whether asking task " mark, monitoring resource module 303 monitor and count on and go up the stock number information that each task of moving takies.When the task requests decision-making module 304 of Worker was determined to carry out new task, this mark was set to true; Otherwise this mark is set to false.

The transceiver module 310 of step 405, Master receives the heartbeat message that Worker sends, the stock number that each task that mission bit stream statistical module 311 upward moves according to this Worker that carries in the heartbeat message takies, and the stock number that takies of its each task moved on node separately that sends with reference to other Worker, calculate every generic task and carry out required stock number.

" whether asking task " mark in the task scheduling modules 312 inspection heartbeat message of step 406, Master and the node available volume of resources of Worker.When " whether asking task " when mark value is true, Master carries out required stock number according to every generic task that mission bit stream statistical module 311 calculates, choose the task-set that resources requirement is no more than Worker node available volume of resources, and the selection task is dispatched to this Worker also from this task-set.

Step 407, Master return the heartbeat response to Worker, have distributed task if Master is Worker, then will return to Worker for the task that this Worker distributes.

After Worker received the task that Master returns, task execution module 302 was carried out receiving of task.

In the above-mentioned flow process, if current task of not having resources requirement to be no more than Worker node available volume of resources among the Master in the step 406 then can be returned the information of task scheduling failure to Worker, and finish this flow process.

According to above description as can be seen, compare with existing scheme, on the one hand, Worker determines whether to ask to carry out new task according to the actual loading situation, eliminated to a certain extent in the existing scheme and only judged Worker node overload or the underloaded problem of being brought, thereby can better guarantee normal, the efficient execution of Worker based on the configuration file of disposing in advance; On the other hand, the information that Master sends according to Worker about each task use stock number, the employed stock number of the every generic task of statistical study, thereby can be more accurately give Worker, avoided Worker to a certain extent because surplus resources is not enough and carry out the overload phenomenon that new task causes with suitable task scheduling.

In the another embodiment of the present invention, the task requests decision-making module of Worker still adopts existing mode to judge whether to ask to carry out new task, promptly, when if the task quantity on the Worker does not reach the task amount quota then ask to carry out new task, but unlike the prior art be also to send the resource use amount of carrying out on the available volume of resources of this node that the monitoring resource module monitored and this node of task with heartbeat message; Master is after the request of the execution new task that receives the Worker transmission, the resource use amount of carrying out on the node separately that the mission bit stream statistical module reports according to each Worker of task, determine that Master goes up the resources requirement of each task, the resources requirement of each task of being determined according to the mission bit stream statistical module by task scheduling modules and the available volume of resources of this Worker are carried out task scheduling then, concrete, give this Worker with resources requirement less than the Task Distribution of Worker current available resource amount, if there is not such task to distribute, then Master can return the response of Task Distribution failure.

This embodiment at Worker in request during task, the situation that its load is bigger, Task Distribution that can resources requirement is less is relatively given this Worker, perhaps do not distribute new task to give this Worker, thereby compare the load pressure that to alleviate Worker to a certain extent with existing task scheduling mechanism; At Worker in request during task, the situation that its load is less, Task Distribution that can resources requirement is relatively large is given this Worker, thereby compares the resource utilization that can improve Worker to a certain extent with existing task scheduling mechanism.

The above embodiment of the present invention is not only applicable to the MapReduce system, also applicable to the data handling system of similar structures, as be provided with the data handling system of host node and a plurality of working nodes, wherein, host node is used for task scheduling, and working node is used to execute the task.

Obviously, those skilled in the art can carry out various changes and modification to the present invention and not break away from the spirit and scope of the present invention.Like this, if of the present invention these are revised and modification belongs within the scope of claim of the present invention and equivalent technologies thereof, then the present invention also is intended to comprise these changes and modification interior.

Claims

1. method for scheduling task is applied to be provided with the data handling system of host node and a plurality of working nodes, and wherein, host node is used for task scheduling, and working node is used to execute the task, and it is characterized in that, comprises the steps:

2. the method for claim 1 is characterized in that, described host node is a described working node when carrying out task scheduling, and the Task Distribution that resources requirement is no more than the available volume of resources of this working node is given this working node.

3. the method for claim 1 is characterized in that, the resource use amount that working node sends to the task of host node is the maximum resource use amount of this task in setting duration.

4. the method for claim 1 is characterized in that, working node sends the request of obtaining task by heartbeat message when triggering the transmission heartbeat message;

Each node in each working node sends to host node with each task resource use amount separately of carrying out on it by heartbeat message when triggering the transmission heartbeat message.

5. as each described method of claim 1 to 4, it is characterized in that, when working node is no more than the charge capacity threshold value of setting in the charge capacity of judging himself, send the request of obtaining task to host node.

6. as each described method of claim 1 to 4, it is characterized in that described data handling system is the MapReduce system.

7. working node equipment, be applied to be provided with the data handling system of host node equipment and a plurality of working node equipment, wherein, host node equipment is used for task scheduling, working node equipment comprises task execution module, be used to carry out the task that host node distributes, it is characterized in that described working node equipment also comprises:

8. working node equipment as claimed in claim 7 is characterized in that, described sending module specifically is used for: with described monitoring resource module monitors to the maximum resource use amount of task in setting duration send.

9. working node equipment as claimed in claim 7 is characterized in that, described sending module specifically is used for: when triggering the transmission heartbeat message, send the request of obtaining task by heartbeat message.

10. as each described working node equipment of claim 7 to 9, it is characterized in that, also comprise:

The task requests decision-making module is used for the available volume of resources at the working node that arrives according to described monitoring resource module monitors, when the charge capacity of judging this working node is no more than the charge capacity threshold value of setting, indicates described sending module to send the request of obtaining task;

Described sending module is further used for, and sends the request of obtaining task according to the indication of described task requests decision-making module.

11. a host node equipment is applied to be provided with the data handling system of host node equipment and a plurality of working node equipment, wherein, working node equipment is used to carry out the task of host node devices allocation, it is characterized in that, described host node equipment comprises:

12. host node equipment as claimed in claim 11 is characterized in that, described task scheduling modules specifically is used for: the Task Distribution that resources requirement is no more than the available volume of resources of this working node is given this working node.

13. host node equipment as claimed in claim 11, it is characterized in that, described mission bit stream statistical module specifically is used for: receive each working node when triggering the transmission heartbeat message, parse each task resource use amount separately that working node is carried out from heartbeat message.

14. a data handling system is characterized in that, comprises host node equipment and a plurality of working node equipment;

15. data handling system as claimed in claim 14 is characterized in that, described host node is a described working node when carrying out task scheduling, and the Task Distribution that resources requirement is no more than the available volume of resources of this working node is given this working node.

16. data handling system as claimed in claim 14 is characterized in that, when described working node is no more than the charge capacity threshold value of setting in the charge capacity of judging himself, sends the request of obtaining task to host node.

17., it is characterized in that described data handling system is the MapReduce system as each described data handling system of claim 14 to 16.