CN110275765A

CN110275765A - Data parallel job scheduling method based on branch DAG dependency

Info

Publication number: CN110275765A
Application number: CN201910514403.0A
Authority: CN
Inventors: 李东升; 胡智尧; 张一鸣
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2019-06-14
Filing date: 2019-06-14
Publication date: 2019-09-24
Anticipated expiration: 2039-06-14
Also published as: CN110275765B

Abstract

The invention discloses a data parallel job scheduling method based on branch DAG dependence, which comprises the following steps that 1, a job end receives jobs; 2. traversing a DAG task graph of the operation, and finding out branches and branch synchronization in the DAG task graph and suspended branches in the branches; 3. finding out a suspended branch in each DAG graph of the operation end, and adding the suspended branch into a branch set B; 4. executing a branch scheduling algorithm on the branches in the suspended branch set B to obtain a branch scheduling sequence P; 5. when computing resources exist, the execution unit is distributed and branch tasks are executed according to the branch scheduling sequence P; 6. and repeating the steps 3 to 5 until the branches in each DAG graph in the working end are executed. The invention ensures that the non-urgent branches are delayed to be dispatched by determining the urgency of each branch, saves the computing resources, allocates more urgent jobs and accelerates the completion time of the branch synchronization. Compared with other scheduling methods, the method disclosed by the invention has the advantage that the average job completion time is reduced by 10-15%.

Description

The data parallel job scheduling method relied on based on branch DAG

Technical field

The invention belongs to Parallel & Distributed Computing field more particularly to a kind of data parallel works relied on based on branch DAG Industry dispatching method.

Background technique

Analysis operation for big data has become and has closed in daily life if machine learning, figure calculate, streaming computing A part of key.Hadoop and Spark platform is suggested the operation for aiming at efficient process data parallel.However, this is related to The technical issues of some challenges, such as job shop scheduling problem and network communication.For big data analysis operation, operation deadline (job completion time, JCT) is an extremely important index.JCT refers to a data concurrent job from being submitted to A period of time of completion.One data concurrent job includes the network communication between multiple calculation stages and calculation stages.These Calculation stages can be executed according to the sequencing specified, to ensure that dependence will not be violated.This dependence shape At a circulant Digraph.Whole process can be with circulant Digraph DAG (Directed Acyclic Graphene) come table Show.Currently, newest DAG dispatching method document 1 " R.Grandl, S.Kandula, S.Rao, A.Akella, and J.Kulkarni,“GRAPHENE:packing and dependency-aware scheduling for data- parallel clusters,”in USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2016, pp.81-97 ", and document 2 " R.Grandl, M.Chowdhury, A.Akella, and G.Ananthanarayanan,“Al-truistic scheduling in multi-resource clusters,”in USENIX Symposium on Operating Systems Design and Implementation(OSDI),2016, Pp.65-80. " can dispatch calculation stages using certain special heuristic mutation operations method.But both dispatching methods are simultaneously Network communication is not accounted for.The network communication of this data parallel operation can be related to data shuffling (data in practice Shuffle) the problem of.Document 3 " M.Chowdhury, M.Zaharia, J.Ma, M.I.Jordan, and I.Stoica, “Manag-ing data transfers in computer clusters with orchestra,”ACN SIGCOMM Computer Communication Review, vol.41, no.4, pp.98-109,2011. " indicates this network communication Time accounts for the 50% of operation deadline JCT, therefore can have a significant impact to the deadline of operation.It is asked to solve this Topic, nearest data stream scheduling method such as document 4 " Q.Liang and E.Modiano, " Coflow scheduling in input-queued switches:Optimal delay scaling and algorithms,”in IEEE Conference on Comput-er Communications(INFOCOM),2017,pp.10–18.

" and document 5 " W.Wang, S.Ma, B.Li, and B.Li, " Coflex:Navigating the fairness- efficiency tradeoff for coflow scheduling,”in IEEE Conference on Computer Communications (INFOCOM), 2017, pp.46-54. " is described, propose it is a kind of for the new abstract of network data flow, Referred to as coflow.Coflow refers to the one group of parallel data transmitted between continuous calculation stages in two dependences Stream.Recently, for coflow dispatching method, it is intended to reduce the average coflow deadline.Fig. 1 shows a data parallel The circulant Digraph (DAG) of operation.This operation is made of 5 calculation stages and 4 coflow.Critical path scheduling algorithm meeting Calculation stages on priority scheduling longest path.Critical path scheduling method can preferentially execute Stage1, Stage2, Stage3. However, already existing coflow dispatching method can the smallest coflow of priority scheduling.If it is assumed that coflow3 ratio If coflow1, coflow2 want small, coflow dispatching method can first carry out coflow3, delay to execute coflow1.This can lead After causing key methodology to execute Stage1, Stage2 cannot be immediately dispatched, because coflow1 will wait coflow3 to first carry out It completes.But the dispatching method of this coflow, it can not cooperate well with DAG dispatching method.This is because the two Optimization aim it is not identical.

DAG dispatching technique is a kind of wherein each to determine for the circulant Digraph DAG of operation under limited resource The technology of task scheduling priority.DAG scheduling is a kind of method of task schedule, is widely used in various computational problems, is wrapped The DAG job scheduling for including multiprocessor under stand-alone environment, the data parallel operation DAG scheduling for also thering are us to intend to solve.Pass through DAG Scheduling, may be implemented resource utilization maximization or operation average completion time minimizes.With common method for scheduling task Different, DAG scheduling will not isolate calculating task, but pay attention to the dependence between them, and emphasize to make Semantic relation in industry level.For DAG scheduling, if using the DAG information for having arrived operation during scheduler task, Such dispatching method can be considered as DAG dispatching method.Current main DAG dispatching method: (1) critical path algorithm, In entire operation DAG, the execution route of most critical is found, as critical path.Scheduler task in critical path can quilt It pays the utmost attention to, and the task needs on remaining path can be by secondary consideration.But this method does not account for the concurrent of multiple DAG operations Situation；(2) dispatching method of breadth-first, the biggish operation of width in preferential DAG.The method equally lacks to multiple operations simultaneously The considerations of hair scheduling；(3) operation is packaged scheduling, comprehensively considers the DAG of all operations, can will be complementary to one another in DAG structure Operation is scheduled together, improves the concurrency of operation.But the complexity of this method is very big, it is more difficult to be applied in practice.

Summary of the invention

The technical problem to be solved by the present invention is to how to reduce the operation deadline in the case where multiple operations are concurrent, mention A kind of data parallel job scheduling method relied on based on branch DAG is gone out.

To solve this problem, the technical scheme adopted by the invention is that:

A kind of data parallel job scheduling method relied on based on branch DAG, comprising the following steps:

Step 1: operation termination is incorporated as industry；

Step 2: traversing the DAG task image of operation, find out the convergent point and bifurcation of DAG, by the convergence in DAG task image Point is known as a branch and synchronizes, and will be known as a branch without convergence, the chain type part without bifurcated in DAG task image, and will not depend on Branch when other branches or the branch relied on have executed completion is known as hanging up branch；

Step 3: the DAG task image of each operation in traversal operation end finds out the hang-up branch in each DAG task image, And the hang-up branch found out is added to and is hung up in branch's set B；

Step 4: branch's dispatching algorithm being executed to the branch hung up in branch's set B, obtains branch dispatching sequence P；

Step 5: when there is computing resource, execution unit distribution being carried out according to branch dispatching sequence P and executes Branch Tasks；

Step 6: repeating step 3 to 5, until the branch in DAG figure each in operation end is executed.

Further, branch's dispatching algorithm described in step 4 refers to:

Step 4.1: the branch hung up in branch's set B being divided into multiple scheduler objects, if hung up in branch's set There is total amount of computational resources of multiple parallel branch to be less than the limitation of resource capacity, this multiple parallel branch is packaged into branch group BC is closed, and using the branch combination as a scheduler object, the branch not being packaged is separately as a scheduler object；

Step 4.2: constructing total scheduler object set E=B ∪ BC, and construct an empty schedule sequences

Step 4.3: calculating the urgency U (o) of each scheduler object o in scheduler object set E；

Step 4.4: the scheduler object in scheduler object set E being ranked up according to urgency, collating sequence p, then By P=P ∪ p, output scheduling sequence P.

Further, branch's dispatching algorithm described in step 4 the following steps are included:

Step 4.1 ': the branch hung up in branch's set B is divided into multiple scheduler objects, if hanging up branch's set has Total amount of computational resources of multiple parallel branch is less than the limitation of capacity, this multiple parallel branch is packaged into branch combination BC, And using the branch combination as a scheduler object, the branch not being packaged is separately as a scheduler object；

Step 4.2 ': total scheduler object set E=B ∪ BC is constructed, and constructs an empty schedule sequences

Step 4.3 ': calculate the urgency U (o) of each scheduler object o in scheduler object set E；

Step 4.4 ': in scheduler object set E, select the strongest scheduler object e of urgency；

Step 4.5 ': for other objects o in scheduler object set E in addition to scheduler object e_j, remember e → o_jIt is interim Dispatching sequence p_j, 1≤j≤J-1, J are the sum of scheduler object in scheduler object set；

Step 4.6 ': calculate temporary scheduling sequence p_jExceed time ET_j；

Step 4.7 ': by min (ET_j) corresponding to temporary scheduling sequence p_jIt is appended in schedule sequences P, P=P ∪ p_j；

Step 4.8 ': e and o is rejected from scheduler object set E_jThe branch being related to；

Step 4.9 ': repeating step 4.4 ' is to step 4.8 ', until in scheduler object set E any branch or branch group Conjunction is all added in schedule sequences P.

Further, the calculation method of the urgency U (o) of scheduler object described in step 4.3 is:

Step 4.3.1: the time span T (o) of scheduler object o is calculated；

1) when scheduler object is branch, then time span T (o) are as follows:

Wherein, S represents the set in stage in a branch, T_sThe calculating task set of a stage s is represented, W represents work Make the set of thread.D_t,wIt represents and time used in calculating task t, D is executed as thread w_t,wTime pass through the prediction model of formula 2 It is predicted；

D_t,w(i)=a₁D_t,w(i-1)+a₂D_t,w(i-2)+...a_nD_t,w(i-n) (2)

Formula 2 indicates that the time as used in thread w execution current task t needs through the calculating where collecting current task t The history of stage all tasks executes data, and the item number that history executes data shares n item, and i indicates the number of iterations of operation, to public affairs Formula 2 carries out models fitting, obtains a₁、a₂...a_nThe estimated value of parameter；

2) when scheduler object is branch combination { a, b }, then the time span of branch combination { a, b } is max { T (a), T (b)}；

Step 4.3.2: the urgency U (o) of scheduler object is calculated；

1) when scheduler object is branch, the branch where calculating the branch by formula 1 synchronize in all branches Time span, then the branch where the urgency U (o) of branch is equal to the branch synchronize in longest branching time length and this point The difference of the time span of branch, the time difference is shorter, and urgency is stronger.

2) when scheduler object is branch combination { a, b }, then the urgency of branch combination { a, b } is min { U (a), U (b)}。

Further, exceed time ET_jCalculation method be:

For dispatching sequence e → o_j, exceed the time

Wherein T (e) and U (o_j) respectively indicate the time span and scheduler object o of scheduler object e_jUrgency.

Compared with prior art, the present invention can obtain it is following the utility model has the advantages that

A kind of data parallel job scheduling method relied on based on branch DAG of the present invention, by by one section of nothing in DAG figure Convergence, the chain type part without the bifurcated smallest object schedulable as one, a referred to as branch, a branch can be by continuous Network communication composition between calculation stages and stage.Calculation stages and relevant network communication in one branch must be gone here and there Therefore capable execution can be considered a schedulable object, use branch as schedulable object to distribute resource, and one The calculating task in each stage can be arranged on identical machine in a branch, greatly met data locality, reduced network Expense, reduce the operation deadline.In addition, the present invention in a branch synchronizes according to each branch is tolerable by prolonging Slow length of time determines the urgency of each branch, so that short less urgent branch is delayed by scheduling, saves lower calculate Resource distributes to other more urgent operations, to accelerate branch's synchronous deadline.Therefore, the present invention is based on branches DAG is dispatched compared with Spark FIFO and critical path scheduling method, and the average operation deadline reduces 10-15%.

Detailed description of the invention

Fig. 1 is the circulant Digraph of data parallel operation；

Fig. 2 is general structure schematic diagram of the present invention；

Fig. 3 converts branch's schematic diagram for operation DAG for the present invention；

Fig. 4 is branch's dispatching method flow chart of the invention；

Fig. 5 is branch prediction accuracy variation diagram；

Fig. 6 is branch's dispatching method ratio Spark FIFO and the average JCT of shortest job first scheduling method compares signal Figure；

Fig. 7 be branch's dispatching method and Spark FIFO and critical path scheduling method after 30 operations are submitted simultaneously in The performance of different moments；

Fig. 8 a is the change curve that expense is predicted when the quantity of branch increases, and 8b hangs up branch's tune when numbers of branches increases Spend the complexity schematic diagram of algorithm；

For Fig. 9 a to be interior at the same time, branch's dispatching method can complete more operations；Fig. 9 b shows branch's scheduling The operation deadline ratio of method and CARBYNE method；

Figure 10 is that three kinds of dispatching methods are averaged in the variation of machine quantity, operation quantity and operation submission time span Operation deadline variation diagram；Figure 10 a is the average JCT that 5000 operations are executed by the machine of different number, and Figure 10 b is shown Average JCT can increase with the quantity of operation, and Figure 10 c shows three kinds of dispatching parties of variation with operation submission time span The variation tendency of the average JCT of method.

Specific embodiment

In order to be better understood the technical solution in the application, Fig. 2 to Figure 10 shows that the present invention is based on branch DAG A kind of specific embodiment of the data parallel job scheduling method of dependence, comprising the following steps:

Step 1: operation termination is incorporated as industry；

The branch as shown in Figure 3 schematic diagram synchronous with branch, for data parallel operation, data processing be according to What certain operation order was completed.There are dependence between these operations, i.e. part operation can only be completed in some other operation It can just execute later.The corresponding task of stage 8 in Fig. 3 can only could execute after the completion of stage 2, stage 3, stage 5.Cause This, the stage 8 is an aggregation node, is chosen as branch's synchronous point.And then become one without the stage chain of convergence without bifurcated Branch, as stage 1 and stage 2 constitute branch 1.It may include a stage in a branch internal, be also possible to multiple ranks Section.

The present embodiment is referred to using branch's dispatching algorithm, as shown in Figure 4:

Step 4.1: the branch hung up in branch's set B being divided into multiple scheduler objects, if hung up in branch's set Total amount of computational resources of multiple parallel branch is less than the limitation of resource capacity, this multiple parallel branch is packaged into branch combination BC, and using the branch combination as a scheduler object, the branch not being packaged is separately as a scheduler object；

Step 4.3.1: the time span T (o) of scheduler object o is calculated；

1) when scheduler object is branch, then the time span T (o) of scheduler object o are as follows:

Wherein, S represents the set in stage in a branch, and Ts represents the calculating task set of a stage s, and W represents work Make the set of thread.D_t,wIt represents and time used in calculating task t, D is executed as thread w_t,wTime pass through the prediction model of formula 2 It is predicted；

D_t,w(i)=a₁D_t,w(i-1)+a₂D_t,w(i-2)+...a_nD_t,w(i-n) (2)

Formula 2 indicates that the time as used in thread w execution current task t needs through the calculating where collecting current task t The history of all tasks in stage executes data, and the item number that history executes data shares n item, and i indicates the number of iterations of operation, right Formula 2 carries out models fitting, obtains a₁、a₂...a_nThe estimated value of parameter.

When scheduler object is branch combination { a, b }, then the time span of branch combination { a, b } is max { T (a), T (b)}；

Step 4.3.2: the urgency U (o) of scheduler object is calculated；

When scheduler object be branch when, the branch where calculating the branch by formula 1 synchronize in all branches when Between length, then the branch where the urgency U (o) of branch is equal to the branch synchronize in longest branching time length and the branch Time span difference, the time difference is shorter, and urgency is stronger.

When scheduler object is branch combination { a, b }, then the urgency of branch combination { a, b } is min { U (a), U (b) }.

All branches due to being executed concurrently an operation can occupy excess resource, hinder other job executions.Consider Synchronous to the branch being made of multiple branches, the deadline is determined by that longest branch of time, due to each branch Time span is different, and appropriate delay dispatching executes time shorter branch, as long as terminating it before longest branch completes His shorter branch, the time that branch would not be caused synchronous extend, and the branch shorter by the delay dispatching time, for length Branch reserve computing resource.But, short branch is delayed by too long, and branch also will increase the synchronous deadline.This The dispatching principle of invention be postpone short branch, but cannot allow it become affiliated branch synchronize in the last branch that just completes.This Invention is based on this heuristic thought, proposes branch's urgency, as shown in figure 3, for a branch synchronous 1, the length of branch 3 Degree is time longest branch in synchronous 1, and the time of the branch is just used as the time restriction of branch synchronous 1, the urgency of branch It is defined as a branch under the synchronous time restriction of branch, the maximum tolerable delay dispatching time.Such as branch 4 Urgency is that the time restriction of branch synchronous 1 subtracts the time of branch 4.The urgency means that branch 4 can be delayed by scheduling Maximum duration.The present embodiment is ranked up by the urgency of all branches to each operation in operation end, priority scheduling those more Urgent branch, in case the time that delay branch is synchronous, can reduce delay caused by incorrect scheduling.

But this dispatching method to be sorted by urgency may result in the time that branch synchronously completes and be extended.Needle To this problem, the problem of the invention proposes a kind of dispatching algorithms for minimizing branch's synchronization time, therefore the present invention to point Branch dispatching method has done following improvement:

Further, branch's dispatching algorithm described in step 4 is further comprising the steps of:

Step 4.1 ': the branch hung up in branch's set B is divided into multiple scheduler objects, if hung up in branch's set Total amount of computational resources of multiple parallel branch is less than the limitation of resource capacity, this multiple parallel branch is packaged into branch combination BC, and using the branch combination as a scheduler object, the branch not being packaged is separately as a scheduler object；

Step 4.6 ': calculate temporary scheduling sequence p_jExceed time ET_j；

Beyond time ET_jCalculation method be:

For dispatching sequence e → o_j, exceed the time

Why introducing exceeds this concept of time, is because previously described be delayed by: some branches not will lead to point The branch synchronous deadline increases.It is so delayed by order to avoid branch synchronizes, needs to find out these branches, using beyond the time Carry out the consequence of rational judgment Tapped Delay.

Beyond the time is defined as: set the branch that a and b represents operation J1 and J2, since computing resource is limited, a and b can only It is performed serially.If delayed branch b executes branch a, beyond the time it is,

Wherein, T (a) and U (b) respectively represent the time of branch a and the urgency of branch b.

It is zero beyond the time if T (a)≤U (b), expression, which first carries out a and executes b again not and will lead to operation J2, to be delayed. Otherwise, operation J2 can be because branch b be delayed by scheduling, and the deadline of J2 will increase.There is also regardless of delayed branch a in practice Or delayed branch b, can all have non-zero exceeds the time, if delay a, the deadline of J1 will increase；Delayed branch b then can The deadline of J2 is caused to increase.Therefore it needs to weigh, select beyond time shortest scheduling scheme.

Step 4.9 ': repeating step 4.4 ' is to step 4.8 ', until in scheduler object set E any branch or branch group Conjunction is all added to schedule sequences P and output scheduling sequence P.

Step 5: when there is computing resource, execution unit distribution is carried out according to branch schedule sequences P and executes Branch Tasks, After the completion of hanging up branch and executing, occupied computing resource can be released；

The present invention smallest object schedulable as one by one section of independent pathway using DAG, referred to as branch (branch), a branch can be made of continuous calculation stages and the network communication between the stage.For example, the branch 1 of Fig. 1 by Stage1, coflow1, Stage2, coflow2 and Stage3 composition.Calculation stages and relevant network in one branch are logical Therefore letter must can be considered a schedulable object by serial execution, avoid logical in scheduling calculation stages and network Occurs the disagreement of DAG scheduling and coflow scheduling when letter.In addition, the concept of branch has been additionally contemplates that wanting for data locality It asks.In Fig. 1, the calculation stages of upstream can export calculated result, and be used as the input data of downstream calculation stages.? After Stage1 is completed, on the machine that the calculating task that the output of Stage1 can be stored in Stage1 is performed.These are temporary The data of preservation can be used as the input data of Stage2.When Stage2 starts to execute, the calculating task of Stage2 can be from It executes and pulls intermediate data on the machine of Stage1.This process can generate network communication, that is, coflow1.If will The calculating task of Stage2 is also arranged on identical machine, so that it may greatly meet data locality.Therefore the present invention makes Branch is used as schedulable object to distribute resource, Stage1 and Stage2 can be scheduled on same a group machine, reduce The expense of network reduces the operation deadline.

Further, since a plurality of parallel route of a DAG is intersected with each other, is formed and synchronized.The present invention claims the remittance of multiple branches Collecting process is synchronous for branch, and after the branch only executed parallel synchronizes, subsequent calculation stages can just start to execute；And one In the operation of a data parallel, job scheduling depends on the most slow parallel branch executed.In Fig. 1, spoke 1 and spoke 2 is handed over It pitches in the last one calculating task.It is synchronous which constitute a branches.This operation is only after branch synchronously completes, ability Terminate.Therefore, branch, which synchronizes, seriously affects operation deadline JCT.It is performed simultaneously the synchronous all branch's meetings of a branch A large amount of computing resource is occupied, the execution for hindering other branches synchronous.In fact, the operation deadline is by executing most slow point The influence of branch, the present invention passes through the deadline for predicting each branch, to calculate the tolerable delay time of each branch, so that it is determined that The priority or urgency of Chu Ge branch, first the parallel urgent branch executed in different work, then serially executes less tight Anxious branch.More computing resources can be left in advance by postponing short branch, to complete those more urgent operations.In order to verify this The effect of invention, the present invention are tested by the Spark cluster in laboratory, and branch's dispatching method is better than Spark as the result is shown FIFO and shortest job first scheduling, critical path scheduling method.The average operation deadline, JCT reduced 10-15%.This Outside, the present invention has also carried out large-scale emulation testing.Emulation experiment uses Google company-data, https: // ", and the CARBYNE with newest DAG dispatching method as described in document 2 github.com/google/cluster-data. It compares.Experimental result also shows BS method ratio CARBYNE and realizes faster JCT.

It carries out verifying effect of the invention below by one experimental situation of building.

As shown in Fig. 2, host node is responsible for receiving operation submission and distribution calculates in the system architecture of branch's dispatching method The work of resource.And user can submit operation to Spark cluster by an interface driver.This driver is for pipe The progress for the operation that reason user is submitted including being multiple parallel calculating tasks data parallel breakdown of operation, and generates one A DAG.In the implementation procedure of data parallel operation, driver can be communicated with host node, to grasp the variation of cluster resource (as determined the server that there are idle computing resources, calculating task can be initiated).These really are used to execute the section of calculating task Point is known as working node (worker node).Host node will also manage entire cluster, each server of maintenance state, and The calculated result of each server is indexed, can determine the position where input data so as to subsequent calculating task. The process of assignment calculating task is called scheduling by we.Computing resource total amount required for driver interface and master are determined Afterwards, master node can according to grasped each server state (such as whether and distribute to other calculate operation, if also Have assignable computing resource etc.), to determine the computing resource total amount for the operation for distributing to driver processing.But, driver Interface is after the feedback for receiving master node, it is also necessary to pay attention to and each worker node of distribution communicates, it is established that communication Channel, to distribute calculating task later.In the stage for executing data parallel operation, worker node can provide one group of execution Device unit (executor).These executor units are the Java Virtual Machines for wrapping up certain C PU core and memory.Driver node Each executor can be grasped from the communication with worker node.Then, driver interface makees the data parallel after decomposition Multiple concurrent tasks caused by industry, are dispatched to these executor one by one.After executor has executed these calculating operations, meeting Calculated result is stored in local or feeds back to driver.

As shown in Fig. 2, a branched structure can be exported firstly, user submits to the operation of driver after being decomposed into DAG DAG, for branch dispatch.We also added the module of a driver interface and the intercommunication of executor.This A communication module can be after executor have executed a calculating task, can be the execution information of this task, when such as calculating Between, network communication time, disk read-write time, feed back to driver node.Driver node can be according to each of this operation The finishing time information of task estimates the length of branch, calculates the urgency of branch.Each point is determined in driver interface After the urgency of branch, these urgencies can be fed to host node.Host node can be according to the urgency of each branch to from difference The branch of operation is scheduled.

Verifying branch's dispatching method can dispatch concurrent multiple operations and reduce operation deadline JCT below.This reality It applies example and performs a large amount of experiment in an actual Spark cluster to evaluate the performance of branch's dispatching method.In an experiment Compare First Come First Served (first-in-first-out, FIFO) and the shortest job first (shortest-job- of Spark First, SJF), key methodology (critical path, CP).It is noted that CP method is the scheduling of a job built Method, it is intended to the calculating task in one DAG operation of priority scheduling in critical path.By measure more operations and meanwhile operation Average JCT.These operations can be filed concurrently cluster.In addition, also having rated accuracy of forecast, overhead, and online Submit the performance boost of working scene inferior division dispatching method.

Experiment in the present embodiment is used to be made of 30 servers for being distributed in 6 racks.Each server is equipped with Intel Xeon E5-26502.2GHz 12-core processor.In experiment use three kinds of mixing Spark operations, 50% PageRank, 30%logistic regression and 20% machine learning operation, to ensure the multiplicity of branching time length Property.PageRank operation is that typical iteration diagram calculates application.Logistic regression is that common regression analysis is answered With.Machine learning application can use the learning process of iteration.The quantity that operation is controlled in experiment increases to 50 from 10, operation Input data size increase to 50GB from 10GB.

Test online branch's deadline accuracy of forecast: the urgency in order to calculate branch is predicted by formula 14 Branching time.Therefore, prediction accuracy evaluation is an important index.After submitting operation, time of measuring branch first The time of execution, and the branching time of prediction finally calculate prediction error.

Wherein, Tpred and Tmeas has respectively represented the branching time of prediction and measurement.Branch prediction accuracy changes such as Shown in Fig. 5.Experimental result shows that most of branch predictions are accurate.At the beginning, prediction error will increase, pre- at the 6000th time After survey, error is stablized 25% or so.When the wheel number of prediction increases, this prediction becomes more accurate.Therefore the weight of iteration operation Prediction exact value can be improved in multiple execute.

Evaluate the average operation deadline:

After the batch job submitted simultaneously, the average JCT of this batch of DAG operation is evaluated.The quantity of this batch job increases from 10 To 50.Fig. 6 shows that branch's dispatching method ratio Spark FIFO method realizes less average JCT, reduces 10-15%.When The quantity of operation increases to 40 and 50, and the reduction amount of average JCT is stablized 15% or so.And the method ratio Spark of SJF and CP The average operation deadline of FIFO will lack 5%, hardly change as the quantity of operation increases.Branch's dispatching method JCT reduction amount can increase as operation quantity increases.Generally speaking, while the operation of submission is more, and the average JCT of operation is got over Greatly.But, branch's dispatching method can more effectively be packaged branch.In comparison, the method for Spark FIFO and SJF and CP is not The branch that can be perceived in DAG is synchronous.When the quantity of operation increases, which results in longer beyond the time.

The operation quantitative assessment totaled:

Fig. 7 is shown after 30 operations are submitted at the same time in the performance of different moments.500 before after operation submission Second, the performance of these three methods is very close.Then, Spark FIFO method, which fulfils assignment, decreased significantly.Branch's tune Degree method can complete almost half operation at the 1000th second or so, and 2/3rds operation can be complete in first 1500 seconds At.This is the result shows that branch's dispatching method can complete same batch job than other two methods faster.

Overhead experiment: main overhead is as caused by branch prediction and branch's scheduling.Prediction expense is related to To finishing time information, the predicted branches time span of the task of collection.Fig. 8 a is shown, when the quantity increase of branch, predicts expense It can gradually increase.Want to compare with branch's dispatching algorithm is executed, prediction expense is relatively small.Fig. 8 b is shown when hang-up branch Quantity from 20 increase to 70 when, the complexity of branch's dispatching algorithm.The time complexity of algorithm is with the quantity of operation and gradually It is increased.When more operations are submitted, branch combination also can be more, increase the expense of branch's dispatching algorithm operation.This leads The complexity for having caused algorithm is that superlinearity increases.But, overhead is compared with JCT, is acceptable.

The present embodiment simulates 5000 typical DAG operations in Google company-data.Cooperate CP using SJ, makees For comparative approach.CARBYNE method can his ground of benefit to shortest operation contribute remaining computing resource.All emulation are at one It is equipped on the computer of Intel (R) Core (TM) i7-4700MQ CPU 2.40GHz and 32GB RAM and completes.

Experimental data set: log is completed in the operation that Google company-data has recorded 1 month.These logs contain details Task completed information, resources requirement, the state of machine, limitation etc..However, this record cannot provide useful DAG letter Breath.Most of tasks only have simple dependence, such as MapReduce.For the dependence of the operation of simulating realistic, this reality It applies example and simulates actual job dependence.Main DAG includes synchronous with 5 branches (50 quantile) less than 10 branches.It is small Partial DAG includes that 34 branches are synchronous with 13 branches (95 quantile).

Emulation about operation deadline JCT: branch's dispatching method, CARBYNE and SJF, the side CP are had rated in experiment The average JCT of method.5000 operations have been used in an experiment, have all been submitted in 600 seconds, and 500 machines are shared.Firstly, surveying The cumulative distribution function CDF to fulfil assignment is measured.As illustrated in fig. 9, at the same time, branch's dispatching method can be completed More operations.In addition, we further evaluate the JCT reduction amount of details.Fig. 9 b shows branch's dispatching method and the side CARBYNE The operation deadline ratio of method.Less JCT may be implemented in branch's dispatching method.The JCT reduction of operation more than 15% reaches 20%.For most of operations, branch dispatching method ratio CARBYNE realizes more JCT and reduces.

The influence of machine quantity: resource capacity is an important performance influence index.The present embodiment controls replicating machine The quantity of device increases to 1000 from 500.Figure 10 a shows the average JCT that this 5000 operations are executed by the machine of different number. Average JCT can be reduced as machine quantity increases.The smallest average JCT may be implemented in branch's dispatching method.But, work as machine When quantity reaches 1000, the performance of CARBYNE method also moves closer to branch's dispatching method.

The influence of operation quantity: Figure 10 b shows that average JCT can increase with the quantity of operation.The present embodiment experiment The quantity of middle control operation increases to 5000. branch's dispatching methods from 2000 can realize smaller average JCT than other methods. Therefore, when cluster resource is vied each other in more operations, more economical scheduling is may be implemented in branch's dispatching method.Therefore, JCT Reduction amount can increase with the quantity of operation.

The influence of operation submission time: when submitting operation with different rates, the performance of scheduler can be due to overstocking Handling situations are different and realize different performances.The interval time that operation submission is arranged in we increased to 1500 seconds from 300 seconds.Figure 10c shows that the average JCT of branch's dispatching method is the smallest.The average JCT of these three methods can be with interval time Increase and rises.

It is demonstrated from different angles by above-mentioned emulation experiment, under the conditions of limited cluster resource, when more operation When submitting in a relatively short period of time, higher performance is may be implemented in branch's dispatching method.

The above is only the preferred embodiment of the present invention, protection scope of the present invention is not limited merely to above-described embodiment, All technical solutions belonged under thinking of the present invention all belong to the scope of protection of the present invention.It should be pointed out that for the art For those of ordinary skill, several improvements and modifications without departing from the principles of the present invention should be regarded as protection of the invention Range.

Claims

1. a kind of data parallel job scheduling method relied on based on branch DAG, which comprises the following steps:

Step 1: operation termination is incorporated as industry；

Step 2: traversing the DAG task image of operation, find out the convergent point and bifurcation of DAG, the convergent point in DAG task image is claimed It is synchronous for a branch, a branch will be known as without convergence, the chain type part without bifurcated in DAG task image, other will not depended on Branch when branch or the branch relied on have executed completion is known as hanging up branch；

Step 3: the DAG task image of each operation in traversal operation end finds out the hang-up branch in each DAG task image, and will The hang-up branch found out, which is added to, to be hung up in branch's set B；

Step 6: repeating step 3 to 5, until the branch in the DAG figure of operation each in operation end is executed.

2. the data parallel job scheduling method according to claim 1 relied on based on branch DAG, which is characterized in that step Branch's dispatching algorithm described in rapid 4 refers to:

Step 4.1: the branch hung up in branch's set B being divided into multiple scheduler objects, if hung up multiple in branch's set Total amount of computational resources of parallel branch is less than the limitation of resource capacity, this multiple parallel branch is packaged into branch combination BC, And using the branch combination as a scheduler object, the branch not being packaged is separately as a scheduler object；

Step 4.4: the scheduler object in scheduler object set E being ranked up according to urgency, collating sequence p, then by P= P ∪ p, output scheduling sequence P,.

3. the data parallel job scheduling method according to claim 1 relied on based on branch DAG, which is characterized in that step Branch's dispatching algorithm described in rapid 4 refers to:

Step 4.1 ': the branch hung up in branch's set B is divided into multiple scheduler objects, if hung up multiple in branch's set Total amount of computational resources of parallel branch is less than the limitation of resource capacity, this multiple parallel branch is packaged into branch combination BC, And using the branch combination as a scheduler object, the branch not being packaged is separately as a scheduler object；

Step 4.5 ': for other objects o in scheduler object set E in addition to scheduler object e_j, remember e → o_jFor temporary scheduling Sequence p_j, 1≤j≤J-1, J are the sum of scheduler object in scheduler object set；

Step 4.6 ': calculate temporary scheduling sequence p_jExceed time ET_j；

Step 4.9 ': repeating step 4.4 ' is to step 4.8 ', until in scheduler object set E any branch or branch combination all It is added in schedule sequences P, output scheduling sequence P.

4. the data parallel job scheduling method according to claim 2 or 3 relied on based on branch DAG, which is characterized in that The calculation method of the urgency U (o) of the scheduler object is:

Step 4.3.1: the time span T (o) of scheduler object o is calculated；

1) when scheduler object is branch, then time span T (o) are as follows:

Wherein, S represents the set in stage in a branch, and Ts represents the calculating task set of a stage s, and W represents active line The set of journey, D_t,wIt represents and time used in calculating task t, D is executed as thread w_t,wTime carried out by the prediction model of formula 2 Prediction；

D_t,w(i)=a₁D_t,w(i-1)+a₂D_t,w(i-2)+...a_nD_t,w(i-n) (2)

Formula 2 indicates that the time as used in thread w execution current task t needs through the calculation stages where collecting current task t The history of all tasks executes data, and the item number that history executes data shares n item, and i indicates the number of iterations of operation, to formula 2 Models fitting is carried out, a is obtained₁、a₂...a_nThe estimated value of parameter；

2) when scheduler object is branch combination { a, b }, then the time span of branch combination { a, b } is max { T (a), T (b) }；

Step 4.3.2: it calculates scheduler object urgency U (o)；

1) when scheduler object be branch when, the branch where calculating the branch by formula 1 synchronize in all branches time Length, then branch where the urgency U (o) of branch is equal to the branch synchronize in longest branching time length and the branch The difference of time span, the time difference is shorter, and urgency is stronger；

2) when scheduler object is branch combination { a, b }, then the urgency of branch combination { a, b } is min { U (a), U (b) }.

5. the data parallel job scheduling method according to claim 4 relied on based on branch DAG, it is characterised in that: step Exceed time ET in rapid 4.6 '_jCalculation method be:

For dispatching sequence e → o_j, exceed the time