CN106354553A

CN106354553A - Task scheduling method and device based on resource estimation in big data system

Info

Publication number: CN106354553A
Application number: CN201510411512.1A
Authority: CN
Inventors: 朱泓; 钟咏; 曾东; 张聪; 夏峻峰; 李小东
Original assignee: MIGU Music Co Ltd
Current assignee: MIGU Music Co Ltd
Priority date: 2015-07-14
Filing date: 2015-07-14
Publication date: 2017-01-25

Abstract

The invention discloses a task scheduling method based on resource estimation in a big data system. The method includes the steps that a received task is subjected to resource estimation and added into a task list; current system idle resources are estimated, and the task in the task list is scheduled according to the total quantity of resources required by the task in the task list and the size relationship of the current system idle resources. The invention further discloses a task scheduling device based on resource estimation in the big data system.

Description

Method for scheduling task based on calculation of natural resources and device in a kind of big data system

Technical field

The present invention relates to mission planning and dispatching technique field, in more particularly, to a kind of big data system, it is based on money The method for scheduling task of source estimation and device.

Background technology

Big data system possesses that data storage amount is big, complex structure, the task of running are various, and task is processed Data volume is big, have between task complexity dependence the features such as.Big data system is in the side such as calculating and storage The ability in face is all very powerful, but for a specific big data system, over a period to come, no matter Its time resource or storage resource is determined, therefore, only reasonably enters scheduling to the task in system, Execute with allowing task coordinate, the resource that system could be allowed limited gives full play to its effect, realizes big data system True value.

Complete task scheduling process at least will include task resource estimation and tasks carrying plans two parts, but Because the huge and task of big data system data amount is numerous and diverse, effective method is there is no to realize calculation of natural resources at present, In real work, typically all abandon task resource is estimated；In terms of task Execution plan, generally adopt base Principle in controlling stream to be realized, and the method has certain operability in the less situation of task scale, But it is as the increase of number of tasks amount, task dependence becomes complicated, not only efficiency declines but also realizes difficulty Degree is very big.

In sum, provide a kind of task scheduling approach based on calculation of natural resources, be capable of to task resource Estimation, accurately and efficiently complete mission planning scheduling, it has also become problem demanding prompt solution.

Content of the invention

In view of this, embodiment of the present invention expectation provides the task based on calculation of natural resources in a kind of big data system Dispatching method and device, are capable of the estimation to task resource, accurately and efficiently complete mission planning and adjust Degree, and realize simple, reliability height.

For reaching above-mentioned purpose, the technical scheme of the embodiment of the present invention is achieved in that

Embodiments provide the method for scheduling task based on calculation of natural resources, institute in a kind of big data system The method of stating includes:

Calculation of natural resources is carried out to receiving of task, and described task is added task list；

Current system idling-resource is estimated, and the resource according to the required by task in task list is total Amount is scheduling to the task in described task list with the magnitude relationship of current system idling-resource.

In such scheme, the described task to reception carries out calculation of natural resources and includes:

Obtain the data source information of the task of described reception, determine that the scale of the data source obtaining meets first During part, choose n data block from the data block that described data source comprises as the data source of estimation tasks, Run described estimation tasks and record the resource that described estimation tasks consume, consume according to described estimation tasks The resource of the required by task receiving described in calculation of natural resources；Wherein, n is positive integer.

In such scheme, described from the data block that described data source comprises choose n data block as estimation The data source of task, comprising:

The data block that described data source is comprised is ranked up, and randomly selects a data block as the first data Block, then everyIndividual data block chooses a data block, till choosing n data block；Wherein, M is the data block number that described data source comprises, and m is positive integer.

In such scheme, described empty with current system according to the total resources of required by task in task list The magnitude relationship of not busy resource the task in described task list is scheduling including:

When determining that the total resources of the required by task in task list is not more than current system idling-resource, open Move all tasks in described task list；

When determining that the total resources of the required by task in task list is more than current system idling-resource, foundation In task list, the priority of task starts the task in described task list successively, and to priority identical Task, the less task of preferential startup resource occupation.

In such scheme, the described priority according to task in task list starts in described task list successively Task include:

Priority according to task in task list carries out resource occupation to the task in described task list successively Application, and start the successful task of resource occupation application successively according to the priority of task.

The embodiment of the present invention additionally provides the task scheduling apparatus in a kind of big data system based on calculation of natural resources, Described device includes: processing module and scheduler module；Wherein,

Described processing module, for carrying out calculation of natural resources to receiving of task, and described task is added task List；

Described scheduler module, for estimating to current system idling-resource, and according in task list The total resources of required by task and current system idling-resource magnitude relationship in described task list Task is scheduling.

In such scheme, described processing module, specifically for obtaining the data source information of the task of described reception, When determining that the scale of the data source obtaining meets first condition, choose from the data block that described data source comprises N data block, as the data source of estimation tasks, is run described estimation tasks and is recorded described estimation tasks and disappear The resource of consumption, according to the required by task resource receiving described in the calculation of natural resources that described estimation tasks consume；Wherein, N is positive integer.

In such scheme, described processing module, the data block specifically for comprising to described data source is arranged Sequence, randomly selects a data block as the first data block, then everyIndividual data block chooses a number According to block, till choosing n data block；Wherein, m is the data block number that described data source comprises, M is positive integer.

In such scheme, described scheduler module, specifically for determining the resource of the required by task in task list When total amount is not more than current system idling-resource, start all tasks in described task list；

In such scheme, described scheduler module, specifically for the priority according to task in task list successively Task in described task list is carried out with resource occupation application, and starts money successively according to the priority of task Source takies applies for successful task.

Method for scheduling task based on calculation of natural resources and dress in the big data system that the embodiment of the present invention is provided Put, calculation of natural resources is carried out to receiving of task, and described task is added task list；To current system Idling-resource is estimated, and empty with current system according to the total resources of required by task in task list The magnitude relationship of not busy resource is scheduling to the task in described task list；So, it is possible to realize to task The estimation of resource, accurately and efficiently completes mission planning scheduling, and realizes simple, reliability height.

Brief description

Fig. 1 is to be shown based on the method for scheduling task flow process of calculation of natural resources in the embodiment of the present invention one big data system It is intended to；

Fig. 2 is to be shown based on the method for scheduling task flow process of calculation of natural resources in the embodiment of the present invention two big data system It is intended to；

Fig. 3 is that in embodiment of the present invention big data system, the task scheduling apparatus based on calculation of natural resources form structure Schematic diagram.

Specific embodiment

The storage strategy of big data system is each node being uniformly distributed in cluster random as far as possible, leads to Often mission planning and scheduling are based primarily upon with two aspects: run time cost and the storage that this required by task is wanted Cost；In the case that system environmentss are constant, time cost and carrying cost depend primarily on task process Data volume, calculating logic and Algorithms T-cbmplexity, and for assigned tasks, process logical sum algorithm Complexity is all to determine, therefore, the data volume that the time cost of this task is processed with task with carrying cost Proportional.

In embodiments of the present invention, calculation of natural resources is carried out to receiving of task, and described task is added task List；Current system idling-resource is estimated, and the resource according to the required by task in task list Total amount is scheduling to the task in described task list with the magnitude relationship of current system idling-resource.

Fig. 1 is to be shown based on the method for scheduling task flow process of calculation of natural resources in the embodiment of the present invention one big data system It is intended to, as shown in figure 1, the method for scheduling task based on calculation of natural resources in embodiment of the present invention big data system Including:

Step 101: calculation of natural resources is carried out to receiving of task, and described task is added task list；

Here, described task can be data processing task, and the task of described reception can be one or more；

The described task to reception carries out calculation of natural resources and includes:

Obtain the data source information of the task of described reception, determine that the scale of the data source obtaining meets first During part, choose n data block from the data block that described data source comprises as the data source of estimation tasks, Run described estimation tasks and record the resource that described estimation tasks consume, consume according to described estimation tasks The required by task resource receiving described in calculation of natural resources；Wherein, n is positive integer；

Here, described resource includes: time resource and storage resource.

Further, the data source information of the described task of obtaining described reception includes:

Parse the task description file of the task of described reception, obtain the data source information of described task.

Further, the described scale determining the data source obtaining meets first condition and includes:

Determine that the data block total amount that the data source obtaining comprises reaches default data block threshold value；Wherein, described Data block threshold value can be set according to being actually needed.

Further, choose n data block from the data block that described data source comprises as estimation tasks Data source, comprising:

The data block that described data source is comprised is ranked up, and randomly selects a data block as the first data Block, then everyIndividual data block chooses a data block, runs into tail of the queue and just starts anew to count, until Till choosing n data block；Wherein, m is the data block number that described data source comprises, and m is positive integer；

Here, the described data block that described data source is comprised is ranked up being that described data source is comprised Data block be ranked up at random, or, enter according to the rule data block that described data source is comprised setting Row sequence；

DescribedValue be less thanMaximum integer；For example:ThenValue be 5；

The size of described n can according to need be set it is preferred that n value asI.e. n Value be less thanMaximum integer；

In the present embodiment, everyIndividual data block chooses a data block it may be assumed that taking out using equally spaced Sample loading mode, effectively reduces the systematic error carrying out calculation of natural resources；And sampling error partly can count greatly and determine Reason calculates sampling error rateWherein z_α/2For coefficient of reliability, α is confidence level, when confidence level is When 95%, this coefficient of reliability value is 1.96, and when confidence level is 90%, this coefficient of reliability value is 1.645, The sample size of the higher needs of confidence level is more；σ is variance, embodies between sampling individual values and overall average Departure degree, sample value distribution more dispersion variance is bigger, and the sampling quantity of needs is more；N is sample size, Sample more multiple error is less；

Further, the threshold value of described n can be set the threshold value it is preferred that n according to being actually needed Can be 10.

Further, described run described estimation tasks and record the resource of described estimation tasks consumption and include:

The n data block chosen is executed respectively with the task of described reception, gathers and record described n data Block is submitted to the operation informations such as the cpu consumption during task completes, storage consumption from task.

Here, described cpu consumes, i.e. time resource shared by operation task, storage consumption namely operation Storage resource shared by task；Cpu consumes and belongs to time resource, and storage consumption belongs to storage resource.

Further, the required by task money of the described reception described in calculation of natural resources consuming according to described estimation tasks Source includes:

Determine needed for each data block in the n data block chosen according to the resource that described estimation tasks consume Resource average, and according to the resource average needed for each data block and described data source corresponding aggregate data block Determine the resource of the required by task of described reception.

Step 102: current system idling-resource is estimated, and according to the task institute in task list The total resources needing is carried out to the task in described task list with the magnitude relationship of current system idling-resource Scheduling；

Here, current system idling-resource is carried out with estimation to include:

Current system idling-resource, specially prior art can be obtained by inquiry system, do not go to live in the household of one's in-laws on getting married herein State.

Further, described idle with current system according to the total resources of required by task in task list The magnitude relationship of resource the task in described task list is scheduling including:

When determining that the total resources of the required by task in task list is more than current system idling-resource, foundation In task list, the priority of task starts the task in described task list successively, and to priority identical Task, the less task of preferential startup resource occupation；So, the little task of big task blocking can be avoided, carry High task scheduling efficiency；Here, the priority of described task can be set according to being actually needed.

Further, the described priority according to task in task list starts in described task list successively Task includes:

Further, the described priority according to task in task list is successively to appointing in described task list Business carries out resource occupation application, comprising:

When determining that current system idling-resource meets current task demand, it is pre-assigned to described required by task Stock number, and deduct the stock number of described required by task from current system idling-resource, determine as predecessor Business resource occupation application success, until the whole task resources in described task list take and apply for successfully；

When determining that current system idling-resource is unsatisfactory for current task demand, judge to work as at interval of certain time Whether front system idling-resource meets current task demand, until determining that current system idling-resource meets Current task demand；Wherein, described certain time can be set according to being actually needed.

Further, after this step, methods described also includes: feeds back the operation result of described task；Tool Body includes: the operation result of described task is exported with file mode.

Fig. 2 is to be shown based on the method for scheduling task flow process of calculation of natural resources in the embodiment of the present invention two big data system It is intended to；As shown in Fig. 2 the method for scheduling task based on calculation of natural resources in embodiment of the present invention big data system Including:

Step 201: the task of receive user submission simultaneously carries out calculation of natural resources to described task；

Here, described resource includes: time resource and storage resource.

In the present embodiment, everyIndividual data block chooses a data block, that is, using equally spaced sampling Mode, effectively reduces the systematic error carrying out calculation of natural resources；And sampling error part can be with law of great number Calculate sampling error rateWherein z_α/2For coefficient of reliability, α is confidence level, when confidence level is When 95%, this coefficient of reliability value is 1.96, and when confidence level is 90%, this coefficient of reliability value is 1.645, The sample size of the higher needs of confidence level is more.σ is variance, embodies between sampling individual values and overall average Departure degree, sample value distribution more dispersion variance is bigger, and the sampling quantity of needs is more；N is sample size, Sample more multiple error is less.

Step 202: estimate by described task addition task list and to current system idling-resource；

Here, described current system idling-resource is carried out estimation include:

Described cpu consumes, i.e. time resource shared by operation task, storage consumption namely operation task institute The storage resource taking；Cpu consumes and belongs to time resource, and storage consumption belongs to storage resource

Step 203: judge whether the total resources of the required by task in task list exceedes current system empty Not busy resource, if it does, execution step 204；If not above execution step 206；

Step 204: start the task in described task list successively according to the priority of task in task list, And to priority identical task, the less task of preferential startup resource occupation；

Here, the priority of described task can be set according to being actually needed；

The task that the described priority according to task in task list starts in described task list successively includes:

Step 205: start all tasks in described task list.

Step 206: feedback task run result, terminates this handling process；

Here, described feedback task run result includes: will be defeated with file mode for the operation result of described task Go out.

Fig. 3 is that in embodiment of the present invention big data system, the task scheduling apparatus based on calculation of natural resources form structure Schematic diagram；As shown in figure 3, the task scheduling based on calculation of natural resources fills in embodiment of the present invention big data system Put composition to include: processing module 31 and scheduler module 32；Wherein,

Described processing module 31, for carrying out calculation of natural resources to receiving of task, and the addition of described task is appointed Business list；

Described scheduler module 32, for estimating to current system idling-resource, and according to task list In the total resources of required by task and current system idling-resource magnitude relationship in described task list Task be scheduling.

Further, described processing module 31 carries out calculation of natural resources and includes to receiving of task:

Described processing module 31 obtains the data source information of the task of described reception, determines the data source of acquisition When scale meets first condition, choose n data block from the data block that described data source comprises as estimation The data source of task, runs described estimation tasks and records the resource that described estimation tasks consume, according to described The required by task resource receiving described in the calculation of natural resources that estimation tasks consume；Wherein, n is positive integer；

Here, described resource includes: time resource and storage resource.

Further, the data source information of the task that described processing module 31 obtains described reception includes:

Described processing module 31 parses the task description file of the task of described reception, obtains the number of described task According to source information.

Further, described processing module 31 determines that the scale of the data source obtaining meets first condition and includes:

Described processing module 31 determines that the data block total amount that the data source obtaining comprises reaches default data block Threshold value；Wherein, described data block threshold value can be set according to being actually needed.

Further, described processing module 31 chooses n data from the data block that described data source comprises Block is as the data source of estimation tasks, comprising:

The data block that described processing module 31 comprises to described data source is ranked up, and randomly selects a data Block as the first data block, then everyIndividual data block chooses a data block, runs into tail of the queue just from the beginning Start counting up, till choosing n data block；Wherein, m is the data block that described data source comprises Number, m is positive integer；

In the present embodiment, everyIndividual data block chooses a data block, that is, using equally spaced sampling Mode, effectively reduces the systematic error carrying out calculation of natural resources；And sampling error part can be with law of great number Calculate sampling error rateWherein z_α/2For coefficient of reliability, α is confidence level, when confidence level is When 95%, this coefficient of reliability value is 1.96, and when confidence level is 90%, this coefficient of reliability value is 1.645, The sample size of the higher needs of confidence level is more.σ is variance, embodies between sampling individual values and overall average Departure degree, sample value distribution more dispersion variance is bigger, and the sampling quantity of needs is more；N is sample size, Sample more multiple error is less；

Further, described processing module 31 is run described estimation tasks and is recorded what described estimation tasks consumed Resource includes:

Described processing module 31 executes the task of described reception respectively to the n data block chosen, and collection is simultaneously Record that described n data block is submitted to cpu consumptions during task completes from task, storage consumption etc. is transported Row information.

Further, the reception described in calculation of natural resources that described processing module 31 consumes according to described estimation tasks Required by task resource includes:

Described processing module 31 determines in the n data block chosen according to the resource that described estimation tasks consume Resource average needed for each data block, and according to the resource average needed for each data block and described data source Corresponding aggregate data block determines the resource of the required by task of described reception.

Further, described scheduler module 32 carries out to current system idling-resource estimating and includes:

Described scheduler module 32 can obtain current system idling-resource, specially existing skill by inquiry system Art, does not repeat herein.

Further, described scheduler module 32 according to the required by task in task list total resources with current System idling-resource magnitude relationship the task in described task list is scheduling including:

Described scheduler module 32 determines that the total resources of the required by task in task list is not more than During system idling-resource, start all tasks in described task list；

Further, described scheduler module 32 starts described appointing successively according to the priority of task in task list Task in business list includes:

Described scheduler module 32 according to task in task list priority successively in described task list Task carries out resource occupation application, and the priority of foundation task starts resource occupation application successively and successfully appoints Business.

Further, described scheduler module 32 according to the priority of task in task list successively to described task Task in list carries out resource occupation application, comprising:

When described scheduler module 32 determines that current system idling-resource meets current task demand, it is pre-assigned to The stock number of described required by task, and deduct the resource of described required by task from current system idling-resource Amount, determines current task resource occupation application success, until the whole task resources in described task list account for With applying for successfully；

Further, described device also includes feedback module 33, for feeding back the operation result of described task.

In embodiments of the present invention, described processing module 31, scheduler module 32 and feedback module 33 all can be by Central processing unit (cpu, central processing unit) in server or digital signal processor (dsp, Digital signal processor) or field programmable gate array (fpga, field programmable gate Array) realize.

The above, only presently preferred embodiments of the present invention, it is not intended to limit the protection model of the present invention Enclose.

Claims

1. in a kind of big data system the method for scheduling task based on calculation of natural resources it is characterised in that described side Method includes:

2. according to claim 1 method it is characterised in that described carry out resource and estimate to receiving of task Calculate and include:

3. according to claim 2 method it is characterised in that the described data comprising from described data source N data block is chosen as the data source of estimation tasks in block, comprising:

The data block that described data source is comprised is ranked up, and randomly selects a data block as the first data Block, then everyIndividual data block chooses a data block, till choosing n data block；Its In, m is the data block number that described data source comprises, and m is positive integer.

4. method according to claim 1 or claim 2 it is characterised in that described according to appointing in task list The magnitude relationship of the required total resources of business and current system idling-resource is to the task in described task list Be scheduling including:

5. according to claim 4 method it is characterised in that described according in task list task excellent The task that first level starts in described task list successively includes:

6. in a kind of big data system the task scheduling apparatus based on calculation of natural resources it is characterised in that described dress Put including processing module and scheduler module；Wherein,

7. according to claim 6 device it is characterised in that described processing module, specifically for obtaining The data source information of the task of described reception, when determining that the scale of the data source obtaining meets first condition, from Choose n data block in the data block that described data source comprises as the data source of estimation tasks, run described Estimation tasks simultaneously record the resource that described estimation tasks consume, the calculation of natural resources consuming according to described estimation tasks The required by task resource of described reception；Wherein, n is positive integer.

8. according to claim 7 device it is characterised in that described processing module, specifically for institute State the data block that data source comprises to be ranked up, randomly select a data block as the first data block, then EveryIndividual data block chooses a data block, till choosing n data block；Wherein, m is The data block number that described data source comprises, m is positive integer.

9. according to claim 6 or 7 described devices it is characterised in that described scheduler module, specifically for When determining that the total resources of the required by task in task list is not more than current system idling-resource, start institute State all tasks in task list；

10. according to claim 9 device it is characterised in that described scheduler module, specifically for according to Priority according to task in task list carries out resource occupation application to the task in described task list successively, And start the successful task of resource occupation application successively according to the priority of task.