CN106502792A

CN106502792A - A kind of multi-tenant priority scheduling of resource method towards dissimilar load

Info

Publication number: CN106502792A
Application number: CN201610916059.4A
Authority: CN
Inventors: 林伟伟; 温昂展; 张子龙; 张国强; 李进
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2016-10-20
Filing date: 2016-10-20
Publication date: 2017-03-15
Anticipated expiration: 2036-10-20
Also published as: CN106502792B

Abstract

The present invention relates to a kind of multi-tenant priority scheduling of resource method towards dissimilar load, comprises the steps of：1st, system tenant submits operation to and is added in job queue；2nd, Collecting operation load information is sent to explorer；3rd, explorer judges operation different loads type according to job load information, and type information is sent to job scheduler；4th, job scheduler is according to different loads type executive job scheduling, if computation-intensive operation is directly in the node scheduling, if I/O intensity operations then carry out delay wait to which；5th, collect job scheduling decision information and be sent to scheduling reconstruct decision-making device, reconstruct target computing nodes, job scheduling is carried out according to final decision result.The method realizes that multi-tenant shares cluster, that is, reduce the cost for setting up separate cluster, while cause to share more large data sets resources between multi-tenant again.Towards the dissimilar load optimized balance that realizes more preferable data locality, can accomplish in job scheduling between fairness and efficiency well, calculating performance such as throughput, operation response time etc. of whole cluster is improved.

Description

A kind of multi-tenant priority scheduling of resource method towards dissimilar load

Technical field

The present invention relates to a kind of multi-tenant resource management techniques, more particularly to a kind of multi-tenant towards dissimilar load Priority scheduling of resource method.

Background technology

In recent years, the network information technology is advanced by leaps and bounds, the digital information of magnanimity and people obtain information needed ability it Between contradiction.On the other hand, the application of MapReduce, Dryad and Spark distributed Computational frame causes to use Correct information is found from magnanimity data bank within the time of tolerance and is possibly realized in family.Traditional Web multi-tenant systems need Process complicated service logic thus be also required to ensure higher data consistency rank, support complicated data query.Therewith Conversely, multi-tenant system based on big data platform often has, data volume is very big, do not require strict uniformity, only support The characteristics of simple inquiry, dynamic expansion of support available resources.Big data platform to user provide storage and calculate Application service, platform have the increase income powerful resource management of Hadoop frameworks and the support of task scheduling, have preferably load Weighing apparatus mechanism, while easily can also dispose that thereon many powerful big data process assemblies are built, such as：Hive、Spark、 Storm etc., this provide unrivaled advantage for many Application share cluster resources, but the framework using sharing application example It is intended to support higher tenant's density and relatively low management cost and maintenance cost, realizes many rents of big data platform Family framework is faced with the challenge of the aspects such as data are localized, extended on demand, performance optimizes.

The shared cluster of multi-tenant is the typical scene that applies under Hadoop frameworks, but the shared of resource is caused between operation Exist competitive, cause can possibility of the computing resource comprising a certain operation pending data reduce, bring relatively low data sheet Groundization degree, so that often from data far are copied, the occupancy network bandwidth that thus brings, the problems such as computational efficiency is low Project very much.

In cloud computing platform based on Hadoop frameworks, the data required for due to computing resource are in different physical bits Put, need migrating data, i.e. data localization problem., calculate when data and calculating task are in same node due to only Efficiency can be guaranteed, therefore the localization degree of data be determine Hadoop frameworks under cloud computing efficiency important because Element.Data localization degree how is improved, network bandwidth resources is saved, is improved the execution efficiency of task, it is ensured that user job takes Safeguard while business quality that cluster overall throughput is to apply multiple users share formula cluster to need the key problem for solving.

In order to solve data localization problem, core concept is using mobile computing rather than mobile data, i.e., calculating is appointed Business moves to range data closer proximity.The scholars such as above-mentioned thought, the Matei Zaharia in the U.S. are based in Fair Scheduler On the basis of propose delay dispatching algorithm, its core concept is：Ready operation is allowed to abandon within a certain period of time adjusting several times Degree chance, until the pending data that time time-out or the operation have calculating task is nearer apart from computing resource, then receives Current scheduling resource, otherwise abandons.Zhang Boyu et al. is further studied to delay dispatching algorithm, gives delay dispatching Algorithmic delay is spaced an ideal scheme of selection.Tao Yongcai et al. then takes into full account collection on the basis of delay dispatching The load balancing of group, proposes the dynamic deferred scheduling mechanism DDS (Dynamic Delay Scheduling) based on load balancing, The studies above shows to can be very good to accomplish fairness and effect in job scheduling using the Fair Scheduler of delay dispatching algorithm Equilibrium between rate.The scholars such as U.S. Mohammad Hammoud are directed to a kind of Reduce operations of the local optimization design of data and adjust Degree device LARTS, using meshed network position and Reducer ' s partitions sizes as scheduler decision condition, significantly Improve the efficiency of cluster.Similar, the scholar such as U.S. Shadi Ibrahim it is also proposed a local perception algorithm, no Crossing it is realized by changing the hash algorithm of partition functions, and Xiangping Bu et al. propose a kind of being suitable for Dispatching algorithm is locally perceived in the Virtual Cluster of MapReduce, is data to be kept while MapReduce programs mitigate interference Localization.

Although recent years has carried out many research work in multi-tenant resource management direction, however, being currently based on The multi-tenant priority scheduling of resource research of big data platform is primarily directed to the scheduling of resource fairness.Meanwhile, existing delay Mostly dispatching algorithm is based on the fixed stand-by period, and does not take into full account the different loads type of operation.Assume operation Most of data are concentrated in certain node or cluster on minority back end, in fact it could happen that calculating task is similarly concentrated on In cluster on a certain node or part of nodes, so as to the parallel efficiency that causes operation relatively low and longer response time.Therefore, For the different loads type of operation, to carry out multi-tenant priority scheduling of resource be multi-tenant resource management field weight urgently to be resolved hurrily Want problem.

Content of the invention

In view of the deficiency that above-mentioned prior art is present, it is an object of the present invention to provide a kind of many rents towards dissimilar load Family priority scheduling of resource method, can judge whether to delay dispatching according to operation different loads type, reconfigure target meter Operator node keeps the localization scheduling of higher proportion, mitigates network I/O loads, improve operation to improve job parallelism effect Execution efficiency.

To achieve these goals, the present invention one of is at least adopted the following technical scheme that.

A kind of multi-tenant priority scheduling of resource method towards dissimilar load, it is characterised in that comprise the following steps：

First step：Tenant's submission operation using multi-tenant system is in job queue；

Second step：Collecting operation load information is sent to explorer；

Third step：Explorer respectively obtains Map-shuffle phase data read-write rates according to job load information With magnetic disc i/o amount of bandwidth, operation different loads type is judged by comparing magnitude relationship between the two, type information is sent Give job scheduler；

Four steps：Job scheduler determines the scheduling mode of operation according to different loads type；If computation-intensive Type operation is then directly executed in present node scheduling, if I/O intensity operations then execute delay dispatching；Aggregates dispatch decision-making Information is sent to scheduling reconstruct decision-making device；

5th step：Scheduling reconstruct decision-making device reconstructs target computing nodes according to decision information, obtains final scheduling result And executive job scheduling；

6th step：When there is new user to submit operation to, the above first to the 5th step is repeated.

Further, in the first step, it is to meet the shared cluster of multi-tenant, explorer is existing using Hadoop Fair Scheduler (Fair Scheduler) carries out Preliminary division to resource.

Further, in the third step, the job load type decision method is：

Distributed Calculation operation is divided into Map, Shuffle, Reduce three phases, by operation Map-Shuffle The size relativity of the reading and writing data rate and magnetic disc i/o bandwidth in stage is distinguishing the loadtype of operation；If MID, MOD, SOD, SID are respectively Map and read size of data, Map write size of data, Shuffle write size of data and shuffle readings Size of data, it is ρ to arrange appropriate Map data rates, then have relational expression MOD=ρ * MID, and due to shuffle write numbers Data are write according to as Map, therefore has a SOD=MOD, then the number of tasks that is run on node simultaneously is set as n, when completing of the stage Between be MCT, the reading and writing data rate that can be calculated operation is as follows：

If magnetic disc i/o amount of bandwidth is DIOR, there is the predicting relation to be：(1) work as MIOR<During DIOR, then the operation is calculating Intensive；(2) as MIOR >=DIOR, then the operation is that I/O is intensive.

Further, in the four steps, the method for the delay dispatching is：

If it if present node starts a local task temporarily cannot be skipped and then go to dispatch which by operation He meets the operation that localization is required；Time threshold t1 and t2, t1 ＜ t2 is set, if the time that operation is skipped adds up exceeded Threshold value t1, then scheduler will allow it to start the task of a rack_local in present node, i.e., needed for tasks carrying Data can not be on this node, but will be on another node belonged to this node in a frame；When this After the time that operation is skipped adds up more than longer time threshold t2, scheduler will allow this operation on any node Startup task is guaranteeing fairness.

Further, in the 5th step, the method for the reconstruct target computing nodes is：

IfFor the calculation cost of i-th task on m-th machine, it is assumed that close for serial between task on individual node System, thenRepresent the time that all serial tasks of the operation have been executed on m-th machine, choose earliest finish time T_min, each node is counted in time T_minInterior can operation task number, be designated as calculating capacity of certain operation in the node；Will meter Calculate capacity and be divided into localization tasks capacity L and migration task capacity M, represent respectively and can calculate the quantity of localization tasks and can count Calculate the quantity of migration task；Reconstruct target computing nodes, each node choose localization tasks according to the size of localization capacity, Prioritizing selection I/O intensive tasks, being selected for task using the node as the target computing nodes of oneself, if now operation is also There is remaining unassigned task, then remaining task is sequentially allocated the node being not zero to migration capacity as migration task.

Of the invention compared with existing multi-tenant priority scheduling of resource method, have the advantage that：

(1) the multi-tenant priority scheduling of resource method towards dissimilar load is proposed, by Collecting operation load letter Breath, with the size relativity of the reading and writing data rate in each operation Map-Shuffle stages and Disk bandwidth as foundation, distinguishes not Same loadtype, the different loads type according to operation judge whether to delay dispatching, while realizing resource localization, enter One step improves the throughput of system.

(2) the reconstruct priority scheduling of resource method for proposing, keeps operation to make calculating task point while localization as far as possible Each node in cluster is scattered to, cluster resource is made full use of, to realize distributed load equalizing.

(3) proposed reconfigures target computing nodes it is also contemplated that the different loads type of operation, preferentially localizes I/O Intensive operation, increased the degree of concurrence of operation, reduce the response time of operation.Compare other multi-tenant resource optimizations to adjust Degree algorithm has higher resource utilization.

Description of the drawings

Fig. 1 be example in towards dissimilar load multi-tenant priority scheduling of resource method flow chart.

Fig. 2 is the enforcement schematic diagram of the multi-tenant priority scheduling of resource method towards dissimilar load.

Specific embodiment

The present invention is described in further detail below in conjunction with the accompanying drawings, but the enforcement of the present invention and protection domain are not limited to This, if especially not describing part in detail it is noted that having below, is that those skilled in the art can refer to prior art realization 's.

As shown in figure 1, for the flow chart of this example, operation is submitted to in operation waiting list using the tenant of system；Make Industry load monitor collects job load information, and the such as Map-shuffle stages read and write size of data, read-write deadline, disk The information such as I/O bandwidth, are sent to explorer；Reading and writing data rate of the explorer according to the operation Map-Shuffle stages Judge that operation different loads type, concrete determination methods are with the size relativity of magnetic disc i/o bandwidth：Believed according to job load Breath calculates reading and writing data rate MIOR of operation, and obtaining magnetic disc i/o amount of bandwidth according to job load watch-dog feedback information is DIOR, has the predicting relation to be：(1) work as MIOR<During DIOR, then the operation is computation-intensive；(2) as MIOR >=DIOR, then The operation is that I/O is intensive；Type information is sent to job scheduler；Different loads type of the job scheduler according to operation Determine the scheduling mode of operation, if computation-intensive operation is then directly executed in present node scheduling, if I/O is intensive Type operation then executes delay dispatching；Aggregates dispatch decision information is sent to scheduling reconstruct decision-making device.Scheduling reconstruct decision-making device according to Decision information reconstructs target computing nodes, and each node chooses localization tasks, prioritizing selection according to the size of localization capacity I/O intensive tasks, being selected for task using the node as the target computing nodes of oneself, if now operation also have remaining not Allocated task, then remaining task be sequentially allocated the node being not zero to migration capacity as migration task, finally adjusted The degree result of decision executive job scheduling；When there is new user to submit operation to, above step is repeated, until system closure is transported OK.

As shown in Fig. 2 giving a kind of embodiment party of the multi-tenant priority scheduling of resource method towards dissimilar load Formula, the scheduling system by operation I/O watch-dogs, main controlled node, explorer, job scheduler, scheduling reconstruct decision-making device and Calculate node is constituted, and wherein operation I/O watch-dogs are responsible for needing the stand-by period for carrying out reading and writing data before each job scheduling Counted and collected, and fed back information to explorer；Main controlled node and explorer are responsible for receiving tenant's submission The request of operation, global resource management and distribution, it is carried out according to operation the stand-by period of I/O read-writes, distinguishes the difference of operation The loadtype information of operation is sent to job scheduler by loadtype；Different loads class of the job scheduler according to operation Type information, determines the scheduling mode of operation, the operation that resource allocation in system is currently running to each；Scheduling reconstruct decision-making device The scheduling result of job scheduler is reconfigured, it is considered to different loads type information, can in the face of computation-intensive task With priority scheduling, it is executed to other nodes in local frame, I/O intensive tasks then priority scheduling arrive local with node Property node on execute, as far as possible keep operation localize while make calculating task be distributed to each node in cluster, obtain Final priority scheduling of resource result executive job scheduling.Calculate node is common physics PC, these physical servers Can be isomery, the difference such as physical resource (CPU, internal memory etc.) size, architecture, energy consumption including physical server.

Example

In order to verify the validity of the multi-tenant priority scheduling of resource method towards dissimilar load, we simulate use The dispatching method of proposition, operation response time and cluster throughput after com-parison and analysis optimization.

Now choose four ordinary PCs in same frame build Hadoop clusters, a machine as main controlled node, Other three conducts are from node (calculate node), it is assumed that each calculate node only one of which resource slot, at any time certain section Point only exists a task in running status, is the relation of complete serial, so task is at certain between the task on node On node, run time directly can add up.

First pass through the validity that founding mathematical models analysis judges whether to delay dispatching according to different loads type.False If a task is approximately one the arrival of job request in non-local nodes operating ratio in local node operation many consuming D seconds Individual Poisson distribution, occur requesting node every time to meet the time interval of task data locality is the t seconds, submits to node locality to appoint The stand-by period of business is w, and therefore delay dispatching can be expressed as follows on the expected gain that the response time affects：

(1-e^-w/t)(D-t) (2)

Because (1-e^-w/t) permanent be more than zero, so want to obtain positive gain to only need to meet D>t.If operation is calculating Intensive, it is clear that to have D to level off to 0, thus this method takes directly scheduling and non-delayed is waited.Contrary, for I/O intensities Operation is usually taken delay and waits, when can significantly shorten operation response by setting rational time threshold t1 and t2 Between.

In a lot of job runs and between each node during load balancing, the team when an operation cannot submit local task to Row other operations below will have the wait scheduling for meeting that data localization is required, so waiting scheduling to carry the throughput of system Height is also advantageous.But it is also possible that bringing the race problem of hot node, it is assumed that most of data of operation concentrate on certain In individual node or cluster on minority back end, in fact it could happen that calculating task similarly concentrates a certain node or portion in the cluster On partial node, the operation in multiple scheduling queues is waiting for same node and executes task, so as to cause operation relatively low simultaneously Line efficiency and longer response time.In order to further improve the throughput of system, scheduling reconstruct decision-making device is to operation plan Reconfigured, the different loads type according to operation carries out different scheduling strategies, can be with for computation-intensive task Priority scheduling it execute to other nodes in local frame, then priority scheduling is arrived with node locality I/O intensive tasks Node on execute, as far as possible keep operation localize while make calculating task be distributed to each node in cluster, so as to subtract Competition between few task to hot node.

Existing two tenants submit operation to simultaneously, by job load watch-dog feedback information, according to job load type Determination methods judge operation Job-1 be that I/O is intensive, operation Job-2 as computation-intensive, each operation is made up of 4 tasks, The pending data of operation is identical, but pending data burst is distributed in 3 calculate nodes, and data fragmentation is in each node Distribution situation as shown in table 1:

Distribution situation of 1 data fragmentation of table in three calculate nodes

Node	Job-1	Job-2
			Node-1	4	2
Node-2	0	2
			Node-3	0	0

All on node 1, computation-intensive operation Job-2 has 2 to 4 data fragmentations of wherein I/O intensity operation Job-1 , on node 1,2 data fragmentations are on node 2 for individual data fragmentation.<i,j>Sequence pair represents the pending data of numbering i task On jth node, i.e., the data distribution of Job-1 for<1,1>,<2,1>,<3,1>,<4,1>, the data distribution of Job-2 for< 1,1>,<2,1>,<3,2>,<4,2>}.

Thus target computing nodes are reconstructed, chooses operation T on earliest finish time_min, each node is then counted in the time T_minInterior can operation task number, the calculating capacity for being designated as each node is as shown in table 2：

The calculating capacity of 2 each node of table

Node	Localization tasks capacity L	Migration task capacity M
			Node-1	4	0
Node-2	2	2
			Node-3	0	4

Each node chooses localization tasks according to the size of localization capacity, and on node 1, prioritizing selection scheduling I/O is close The node is appointed by collection type operation Job-1, being selected for task as the target computing nodes of oneself, i.e. distribution Job-1 localizations Business<1,1>,<2,1>,<3,1>,<4,1>, distribution Job-2 localization tasks<3,2>,<4,2>}；Node 2 and node 3 are present Migration task capacity, as three nodes are all in same frame, so the remaining unallocated tasks 1,2 of Job-2 can choose section Point 3 is used as target computing nodes.The target computing nodes of reconstruct for<1,3>,<2,3>,<3,2>,<4,2>}.By above Multi-tenant Scheduling instances can verify that the resource scheduling scheme after optimization saves 2 chronomeres, it is clear that can reduce operation Response time simultaneously improves the throughput of system.

Above-described embodiment is the present invention preferably embodiment, but embodiments of the present invention are not by above-described embodiment Limit, other any Spirit Essences without departing from the present invention and the change, modification, replacement that is made under principle, combine, simplification, Equivalent substitute mode is should be, is included within protection scope of the present invention.

Claims

1. a kind of multi-tenant priority scheduling of resource method towards dissimilar load, it is characterised in that comprise the following steps：

Second step：Collecting operation load information is sent to explorer；

Third step：Explorer respectively obtains Map-shuffle phase data read-write rates and magnetic according to job load information Disk I/O amount of bandwidth, judges operation different loads type by comparing magnitude relationship between the two, type information is sent to work Industry scheduler；

Four steps：Job scheduler determines the scheduling mode of operation according to different loads type；If computation-intensive is made Industry is then directly executed in present node scheduling, if I/O intensity operations then execute delay dispatching；Aggregates dispatch decision information It is sent to scheduling reconstruct decision-making device；

5th step：Scheduling reconstruct decision-making device reconstructs target computing nodes according to decision information, obtains final scheduling result and holds Row job scheduling；

2. a kind of multi-tenant priority scheduling of resource method towards dissimilar load according to claim 1, its feature It is：

In the first step, it is to meet the shared cluster of multi-tenant, explorer adopts the existing Fair Schedulers pair of Hadoop Resource carries out Preliminary division.

3. the multi-tenant priority scheduling of resource method towards dissimilar load according to claim 1, it is characterised in that：

In the third step, the job load type decision method is：

Distributed Calculation operation is divided into Map, Shuffle, Reduce three phases, by the operation Map-Shuffle stages Reading and writing data rate and magnetic disc i/o bandwidth size relativity distinguishing the loadtype of operation；If MID, MOD, SOD, SID It is big that respectively Map reads size of data, Map write size of data, Shuffle write size of data and shuffle readings data Little, it is ρ to arrange appropriate Map data rates, then have relational expression MOD=ρ * MID, and be as shuffle writes data Map writes data, therefore has a SOD=MOD, then sets the number of tasks that is run on node simultaneously as n, and the deadline in the stage is MCT, the reading and writing data rate that can be calculated operation are as follows：

M I O R = \frac{n * (M I D + M O D + S O D + S I D)}{M C T} = \frac{n * ((1 + 2 ρ) M I D + S I D)}{M C T} - - - (1)

If magnetic disc i/o amount of bandwidth is DIOR, there is the predicting relation to be：(1) work as MIOR<During DIOR, then the operation is computation-intensive Type；(2) as MIOR >=DIOR, then the operation is that I/O is intensive.

4. the multi-tenant priority scheduling of resource method towards dissimilar load according to claim 1, it is characterised in that：

In the four steps, the method for the delay dispatching is：

If it if present node starts a local task temporarily cannot be skipped and then go to dispatch other completely by operation The operation that foot localization is required；Time threshold t1 and t2, t1 ＜ t2 is set, if the time that operation is skipped is cumulative more than threshold value T1, then scheduler will allow it to start the task of a rack_local, i.e. data needed for tasks carrying in present node Can not be on this node, but will be on another node belonged to this node in a frame；When this operation After the time being skipped adds up more than longer time threshold t2, scheduler will allow this operation to start on any node Task is guaranteeing fairness.

5. the multi-tenant priority scheduling of resource method towards dissimilar load according to claim 1, it is characterised in that：

In 5th step, the method for the reconstruct target computing nodes is：

IfFor the calculation cost of i-th task on m-th machine, it is assumed that be Serial Relation between task on individual node, thenRepresent the time that all serial tasks of the operation have been executed on m-th machine, choose T on earliest finish time_min, Each node is counted in time T_minInterior can operation task number, be designated as calculating capacity of certain operation in the node；Hold calculating Amount is divided into localization tasks capacity L and migration task capacity M, represents that the quantity that can calculate localization tasks is moved with calculating respectively The quantity of shifting task；Reconstruct target computing nodes, each node choose localization tasks according to the size of localization capacity, preferentially Select I/O intensive tasks, being selected for task using the node as the target computing nodes of oneself, if now operation also have surplus Remaining unassigned task, then remaining task be sequentially allocated the node being not zero to migration capacity as migration task.