CN110275765A - Data parallel job scheduling method based on branch DAG dependency - Google Patents

Data parallel job scheduling method based on branch DAG dependency Download PDF

Info

Publication number
CN110275765A
CN110275765A CN201910514403.0A CN201910514403A CN110275765A CN 110275765 A CN110275765 A CN 110275765A CN 201910514403 A CN201910514403 A CN 201910514403A CN 110275765 A CN110275765 A CN 110275765A
Authority
CN
China
Prior art keywords
branch
scheduler object
dag
urgency
time
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910514403.0A
Other languages
Chinese (zh)
Other versions
CN110275765B (en
Inventor
李东升
胡智尧
张一鸣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN201910514403.0A priority Critical patent/CN110275765B/en
Publication of CN110275765A publication Critical patent/CN110275765A/en
Application granted granted Critical
Publication of CN110275765B publication Critical patent/CN110275765B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues

Abstract

The invention discloses a data parallel job scheduling method based on branch DAG dependence, which comprises the following steps that 1, a job end receives jobs; 2. traversing a DAG task graph of the operation, and finding out branches and branch synchronization in the DAG task graph and suspended branches in the branches; 3. finding out a suspended branch in each DAG graph of the operation end, and adding the suspended branch into a branch set B; 4. executing a branch scheduling algorithm on the branches in the suspended branch set B to obtain a branch scheduling sequence P; 5. when computing resources exist, the execution unit is distributed and branch tasks are executed according to the branch scheduling sequence P; 6. and repeating the steps 3 to 5 until the branches in each DAG graph in the working end are executed. The invention ensures that the non-urgent branches are delayed to be dispatched by determining the urgency of each branch, saves the computing resources, allocates more urgent jobs and accelerates the completion time of the branch synchronization. Compared with other scheduling methods, the method disclosed by the invention has the advantage that the average job completion time is reduced by 10-15%.

Description

The data parallel job scheduling method relied on based on branch DAG
Technical field
The invention belongs to Parallel & Distributed Computing field more particularly to a kind of data parallel works relied on based on branch DAG Industry dispatching method.
Background technique
Analysis operation for big data has become and has closed in daily life if machine learning, figure calculate, streaming computing A part of key.Hadoop and Spark platform is suggested the operation for aiming at efficient process data parallel.However, this is related to The technical issues of some challenges, such as job shop scheduling problem and network communication.For big data analysis operation, operation deadline (job completion time, JCT) is an extremely important index.JCT refers to a data concurrent job from being submitted to A period of time of completion.One data concurrent job includes the network communication between multiple calculation stages and calculation stages.These Calculation stages can be executed according to the sequencing specified, to ensure that dependence will not be violated.This dependence shape At a circulant Digraph.Whole process can be with circulant Digraph DAG (Directed Acyclic Graphene) come table Show.Currently, newest DAG dispatching method document 1 " R.Grandl, S.Kandula, S.Rao, A.Akella, and J.Kulkarni,“GRAPHENE:packing and dependency-aware scheduling for data- parallel clusters,”in USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2016, pp.81-97 ", and document 2 " R.Grandl, M.Chowdhury, A.Akella, and G.Ananthanarayanan,“Al-truistic scheduling in multi-resource clusters,”in USENIX Symposium on Operating Systems Design and Implementation(OSDI),2016, Pp.65-80. " can dispatch calculation stages using certain special heuristic mutation operations method.But both dispatching methods are simultaneously Network communication is not accounted for.The network communication of this data parallel operation can be related to data shuffling (data in practice Shuffle) the problem of.Document 3 " M.Chowdhury, M.Zaharia, J.Ma, M.I.Jordan, and I.Stoica, “Manag-ing data transfers in computer clusters with orchestra,”ACN SIGCOMM Computer Communication Review, vol.41, no.4, pp.98-109,2011. " indicates this network communication Time accounts for the 50% of operation deadline JCT, therefore can have a significant impact to the deadline of operation.It is asked to solve this Topic, nearest data stream scheduling method such as document 4 " Q.Liang and E.Modiano, " Coflow scheduling in input-queued switches:Optimal delay scaling and algorithms,”in IEEE Conference on Comput-er Communications(INFOCOM),2017,pp.10–18.
" and document 5 " W.Wang, S.Ma, B.Li, and B.Li, " Coflex:Navigating the fairness- efficiency tradeoff for coflow scheduling,”in IEEE Conference on Computer Communications (INFOCOM), 2017, pp.46-54. " is described, propose it is a kind of for the new abstract of network data flow, Referred to as coflow.Coflow refers to the one group of parallel data transmitted between continuous calculation stages in two dependences Stream.Recently, for coflow dispatching method, it is intended to reduce the average coflow deadline.Fig. 1 shows a data parallel The circulant Digraph (DAG) of operation.This operation is made of 5 calculation stages and 4 coflow.Critical path scheduling algorithm meeting Calculation stages on priority scheduling longest path.Critical path scheduling method can preferentially execute Stage1, Stage2, Stage3. However, already existing coflow dispatching method can the smallest coflow of priority scheduling.If it is assumed that coflow3 ratio If coflow1, coflow2 want small, coflow dispatching method can first carry out coflow3, delay to execute coflow1.This can lead After causing key methodology to execute Stage1, Stage2 cannot be immediately dispatched, because coflow1 will wait coflow3 to first carry out It completes.But the dispatching method of this coflow, it can not cooperate well with DAG dispatching method.This is because the two Optimization aim it is not identical.
DAG dispatching technique is a kind of wherein each to determine for the circulant Digraph DAG of operation under limited resource The technology of task scheduling priority.DAG scheduling is a kind of method of task schedule, is widely used in various computational problems, is wrapped The DAG job scheduling for including multiprocessor under stand-alone environment, the data parallel operation DAG scheduling for also thering are us to intend to solve.Pass through DAG Scheduling, may be implemented resource utilization maximization or operation average completion time minimizes.With common method for scheduling task Different, DAG scheduling will not isolate calculating task, but pay attention to the dependence between them, and emphasize to make Semantic relation in industry level.For DAG scheduling, if using the DAG information for having arrived operation during scheduler task, Such dispatching method can be considered as DAG dispatching method.Current main DAG dispatching method: (1) critical path algorithm, In entire operation DAG, the execution route of most critical is found, as critical path.Scheduler task in critical path can quilt It pays the utmost attention to, and the task needs on remaining path can be by secondary consideration.But this method does not account for the concurrent of multiple DAG operations Situation;(2) dispatching method of breadth-first, the biggish operation of width in preferential DAG.The method equally lacks to multiple operations simultaneously The considerations of hair scheduling;(3) operation is packaged scheduling, comprehensively considers the DAG of all operations, can will be complementary to one another in DAG structure Operation is scheduled together, improves the concurrency of operation.But the complexity of this method is very big, it is more difficult to be applied in practice.
Summary of the invention
The technical problem to be solved by the present invention is to how to reduce the operation deadline in the case where multiple operations are concurrent, mention A kind of data parallel job scheduling method relied on based on branch DAG is gone out.
To solve this problem, the technical scheme adopted by the invention is that:
A kind of data parallel job scheduling method relied on based on branch DAG, comprising the following steps:
Step 1: operation termination is incorporated as industry;
Step 2: traversing the DAG task image of operation, find out the convergent point and bifurcation of DAG, by the convergence in DAG task image Point is known as a branch and synchronizes, and will be known as a branch without convergence, the chain type part without bifurcated in DAG task image, and will not depend on Branch when other branches or the branch relied on have executed completion is known as hanging up branch;
Step 3: the DAG task image of each operation in traversal operation end finds out the hang-up branch in each DAG task image, And the hang-up branch found out is added to and is hung up in branch's set B;
Step 4: branch's dispatching algorithm being executed to the branch hung up in branch's set B, obtains branch dispatching sequence P;
Step 5: when there is computing resource, execution unit distribution being carried out according to branch dispatching sequence P and executes Branch Tasks;
Step 6: repeating step 3 to 5, until the branch in DAG figure each in operation end is executed.
Further, branch's dispatching algorithm described in step 4 refers to:
Step 4.1: the branch hung up in branch's set B being divided into multiple scheduler objects, if hung up in branch's set There is total amount of computational resources of multiple parallel branch to be less than the limitation of resource capacity, this multiple parallel branch is packaged into branch group BC is closed, and using the branch combination as a scheduler object, the branch not being packaged is separately as a scheduler object;
Step 4.2: constructing total scheduler object set E=B ∪ BC, and construct an empty schedule sequences
Step 4.3: calculating the urgency U (o) of each scheduler object o in scheduler object set E;
Step 4.4: the scheduler object in scheduler object set E being ranked up according to urgency, collating sequence p, then By P=P ∪ p, output scheduling sequence P.
Further, branch's dispatching algorithm described in step 4 the following steps are included:
Step 4.1 ': the branch hung up in branch's set B is divided into multiple scheduler objects, if hanging up branch's set has Total amount of computational resources of multiple parallel branch is less than the limitation of capacity, this multiple parallel branch is packaged into branch combination BC, And using the branch combination as a scheduler object, the branch not being packaged is separately as a scheduler object;
Step 4.2 ': total scheduler object set E=B ∪ BC is constructed, and constructs an empty schedule sequences
Step 4.3 ': calculate the urgency U (o) of each scheduler object o in scheduler object set E;
Step 4.4 ': in scheduler object set E, select the strongest scheduler object e of urgency;
Step 4.5 ': for other objects o in scheduler object set E in addition to scheduler object ej, remember e → ojIt is interim Dispatching sequence pj, 1≤j≤J-1, J are the sum of scheduler object in scheduler object set;
Step 4.6 ': calculate temporary scheduling sequence pjExceed time ETj
Step 4.7 ': by min (ETj) corresponding to temporary scheduling sequence pjIt is appended in schedule sequences P, P=P ∪ pj
Step 4.8 ': e and o is rejected from scheduler object set EjThe branch being related to;
Step 4.9 ': repeating step 4.4 ' is to step 4.8 ', until in scheduler object set E any branch or branch group Conjunction is all added in schedule sequences P.
Further, the calculation method of the urgency U (o) of scheduler object described in step 4.3 is:
Step 4.3.1: the time span T (o) of scheduler object o is calculated;
1) when scheduler object is branch, then time span T (o) are as follows:
Wherein, S represents the set in stage in a branch, TsThe calculating task set of a stage s is represented, W represents work Make the set of thread.Dt,wIt represents and time used in calculating task t, D is executed as thread wt,wTime pass through the prediction model of formula 2 It is predicted;
Dt,w(i)=a1Dt,w(i-1)+a2Dt,w(i-2)+...anDt,w(i-n) (2)
Formula 2 indicates that the time as used in thread w execution current task t needs through the calculating where collecting current task t The history of stage all tasks executes data, and the item number that history executes data shares n item, and i indicates the number of iterations of operation, to public affairs Formula 2 carries out models fitting, obtains a1、a2...anThe estimated value of parameter;
2) when scheduler object is branch combination { a, b }, then the time span of branch combination { a, b } is max { T (a), T (b)};
Step 4.3.2: the urgency U (o) of scheduler object is calculated;
1) when scheduler object is branch, the branch where calculating the branch by formula 1 synchronize in all branches Time span, then the branch where the urgency U (o) of branch is equal to the branch synchronize in longest branching time length and this point The difference of the time span of branch, the time difference is shorter, and urgency is stronger.
2) when scheduler object is branch combination { a, b }, then the urgency of branch combination { a, b } is min { U (a), U (b)}。
Further, exceed time ETjCalculation method be:
For dispatching sequence e → oj, exceed the time
Wherein T (e) and U (oj) respectively indicate the time span and scheduler object o of scheduler object ejUrgency.
Compared with prior art, the present invention can obtain it is following the utility model has the advantages that
A kind of data parallel job scheduling method relied on based on branch DAG of the present invention, by by one section of nothing in DAG figure Convergence, the chain type part without the bifurcated smallest object schedulable as one, a referred to as branch, a branch can be by continuous Network communication composition between calculation stages and stage.Calculation stages and relevant network communication in one branch must be gone here and there Therefore capable execution can be considered a schedulable object, use branch as schedulable object to distribute resource, and one The calculating task in each stage can be arranged on identical machine in a branch, greatly met data locality, reduced network Expense, reduce the operation deadline.In addition, the present invention in a branch synchronizes according to each branch is tolerable by prolonging Slow length of time determines the urgency of each branch, so that short less urgent branch is delayed by scheduling, saves lower calculate Resource distributes to other more urgent operations, to accelerate branch's synchronous deadline.Therefore, the present invention is based on branches DAG is dispatched compared with Spark FIFO and critical path scheduling method, and the average operation deadline reduces 10-15%.
Detailed description of the invention
Fig. 1 is the circulant Digraph of data parallel operation;
Fig. 2 is general structure schematic diagram of the present invention;
Fig. 3 converts branch's schematic diagram for operation DAG for the present invention;
Fig. 4 is branch's dispatching method flow chart of the invention;
Fig. 5 is branch prediction accuracy variation diagram;
Fig. 6 is branch's dispatching method ratio Spark FIFO and the average JCT of shortest job first scheduling method compares signal Figure;
Fig. 7 be branch's dispatching method and Spark FIFO and critical path scheduling method after 30 operations are submitted simultaneously in The performance of different moments;
Fig. 8 a is the change curve that expense is predicted when the quantity of branch increases, and 8b hangs up branch's tune when numbers of branches increases Spend the complexity schematic diagram of algorithm;
For Fig. 9 a to be interior at the same time, branch's dispatching method can complete more operations;Fig. 9 b shows branch's scheduling The operation deadline ratio of method and CARBYNE method;
Figure 10 is that three kinds of dispatching methods are averaged in the variation of machine quantity, operation quantity and operation submission time span Operation deadline variation diagram;Figure 10 a is the average JCT that 5000 operations are executed by the machine of different number, and Figure 10 b is shown Average JCT can increase with the quantity of operation, and Figure 10 c shows three kinds of dispatching parties of variation with operation submission time span The variation tendency of the average JCT of method.
Specific embodiment
In order to be better understood the technical solution in the application, Fig. 2 to Figure 10 shows that the present invention is based on branch DAG A kind of specific embodiment of the data parallel job scheduling method of dependence, comprising the following steps:
A kind of data parallel job scheduling method relied on based on branch DAG, comprising the following steps:
Step 1: operation termination is incorporated as industry;
Step 2: traversing the DAG task image of operation, find out the convergent point and bifurcation of DAG, by the convergence in DAG task image Point is known as a branch and synchronizes, and will be known as a branch without convergence, the chain type part without bifurcated in DAG task image, and will not depend on Branch when other branches or the branch relied on have executed completion is known as hanging up branch;
The branch as shown in Figure 3 schematic diagram synchronous with branch, for data parallel operation, data processing be according to What certain operation order was completed.There are dependence between these operations, i.e. part operation can only be completed in some other operation It can just execute later.The corresponding task of stage 8 in Fig. 3 can only could execute after the completion of stage 2, stage 3, stage 5.Cause This, the stage 8 is an aggregation node, is chosen as branch's synchronous point.And then become one without the stage chain of convergence without bifurcated Branch, as stage 1 and stage 2 constitute branch 1.It may include a stage in a branch internal, be also possible to multiple ranks Section.
Step 3: the DAG task image of each operation in traversal operation end finds out the hang-up branch in each DAG task image, And the hang-up branch found out is added to and is hung up in branch's set B;
Step 4: branch's dispatching algorithm being executed to the branch hung up in branch's set B, obtains branch dispatching sequence P;
The present embodiment is referred to using branch's dispatching algorithm, as shown in Figure 4:
Step 4.1: the branch hung up in branch's set B being divided into multiple scheduler objects, if hung up in branch's set Total amount of computational resources of multiple parallel branch is less than the limitation of resource capacity, this multiple parallel branch is packaged into branch combination BC, and using the branch combination as a scheduler object, the branch not being packaged is separately as a scheduler object;
Step 4.2: constructing total scheduler object set E=B ∪ BC, and construct an empty schedule sequences
Step 4.3: calculating the urgency U (o) of each scheduler object o in scheduler object set E;
Step 4.3.1: the time span T (o) of scheduler object o is calculated;
1) when scheduler object is branch, then the time span T (o) of scheduler object o are as follows:
Wherein, S represents the set in stage in a branch, and Ts represents the calculating task set of a stage s, and W represents work Make the set of thread.Dt,wIt represents and time used in calculating task t, D is executed as thread wt,wTime pass through the prediction model of formula 2 It is predicted;
Dt,w(i)=a1Dt,w(i-1)+a2Dt,w(i-2)+...anDt,w(i-n) (2)
Formula 2 indicates that the time as used in thread w execution current task t needs through the calculating where collecting current task t The history of all tasks in stage executes data, and the item number that history executes data shares n item, and i indicates the number of iterations of operation, right Formula 2 carries out models fitting, obtains a1、a2...anThe estimated value of parameter.
When scheduler object is branch combination { a, b }, then the time span of branch combination { a, b } is max { T (a), T (b)};
Step 4.3.2: the urgency U (o) of scheduler object is calculated;
When scheduler object be branch when, the branch where calculating the branch by formula 1 synchronize in all branches when Between length, then the branch where the urgency U (o) of branch is equal to the branch synchronize in longest branching time length and the branch Time span difference, the time difference is shorter, and urgency is stronger.
When scheduler object is branch combination { a, b }, then the urgency of branch combination { a, b } is min { U (a), U (b) }.
Step 4.4: the scheduler object in scheduler object set E being ranked up according to urgency, collating sequence p, then By P=P ∪ p, output scheduling sequence P.
All branches due to being executed concurrently an operation can occupy excess resource, hinder other job executions.Consider Synchronous to the branch being made of multiple branches, the deadline is determined by that longest branch of time, due to each branch Time span is different, and appropriate delay dispatching executes time shorter branch, as long as terminating it before longest branch completes His shorter branch, the time that branch would not be caused synchronous extend, and the branch shorter by the delay dispatching time, for length Branch reserve computing resource.But, short branch is delayed by too long, and branch also will increase the synchronous deadline.This The dispatching principle of invention be postpone short branch, but cannot allow it become affiliated branch synchronize in the last branch that just completes.This Invention is based on this heuristic thought, proposes branch's urgency, as shown in figure 3, for a branch synchronous 1, the length of branch 3 Degree is time longest branch in synchronous 1, and the time of the branch is just used as the time restriction of branch synchronous 1, the urgency of branch It is defined as a branch under the synchronous time restriction of branch, the maximum tolerable delay dispatching time.Such as branch 4 Urgency is that the time restriction of branch synchronous 1 subtracts the time of branch 4.The urgency means that branch 4 can be delayed by scheduling Maximum duration.The present embodiment is ranked up by the urgency of all branches to each operation in operation end, priority scheduling those more Urgent branch, in case the time that delay branch is synchronous, can reduce delay caused by incorrect scheduling.
But this dispatching method to be sorted by urgency may result in the time that branch synchronously completes and be extended.Needle To this problem, the problem of the invention proposes a kind of dispatching algorithms for minimizing branch's synchronization time, therefore the present invention to point Branch dispatching method has done following improvement:
Further, branch's dispatching algorithm described in step 4 is further comprising the steps of:
Step 4.1 ': the branch hung up in branch's set B is divided into multiple scheduler objects, if hung up in branch's set Total amount of computational resources of multiple parallel branch is less than the limitation of resource capacity, this multiple parallel branch is packaged into branch combination BC, and using the branch combination as a scheduler object, the branch not being packaged is separately as a scheduler object;
Step 4.2 ': total scheduler object set E=B ∪ BC is constructed, and constructs an empty schedule sequences
Step 4.3 ': calculate the urgency U (o) of each scheduler object o in scheduler object set E;
Step 4.4 ': in scheduler object set E, select the strongest scheduler object e of urgency;
Step 4.5 ': for other objects o in scheduler object set E in addition to scheduler object ej, remember e → ojIt is interim Dispatching sequence pj, 1≤j≤J-1, J are the sum of scheduler object in scheduler object set;
Step 4.6 ': calculate temporary scheduling sequence pjExceed time ETj
Beyond time ETjCalculation method be:
For dispatching sequence e → oj, exceed the time
Wherein T (e) and U (oj) respectively indicate the time span and scheduler object o of scheduler object ejUrgency.
Why introducing exceeds this concept of time, is because previously described be delayed by: some branches not will lead to point The branch synchronous deadline increases.It is so delayed by order to avoid branch synchronizes, needs to find out these branches, using beyond the time Carry out the consequence of rational judgment Tapped Delay.
Beyond the time is defined as: set the branch that a and b represents operation J1 and J2, since computing resource is limited, a and b can only It is performed serially.If delayed branch b executes branch a, beyond the time it is,
Wherein, T (a) and U (b) respectively represent the time of branch a and the urgency of branch b.
It is zero beyond the time if T (a)≤U (b), expression, which first carries out a and executes b again not and will lead to operation J2, to be delayed. Otherwise, operation J2 can be because branch b be delayed by scheduling, and the deadline of J2 will increase.There is also regardless of delayed branch a in practice Or delayed branch b, can all have non-zero exceeds the time, if delay a, the deadline of J1 will increase;Delayed branch b then can The deadline of J2 is caused to increase.Therefore it needs to weigh, select beyond time shortest scheduling scheme.
Step 4.7 ': by min (ETj) corresponding to temporary scheduling sequence pjIt is appended in schedule sequences P, P=P ∪ pj
Step 4.8 ': e and o is rejected from scheduler object set EjThe branch being related to;
Step 4.9 ': repeating step 4.4 ' is to step 4.8 ', until in scheduler object set E any branch or branch group Conjunction is all added to schedule sequences P and output scheduling sequence P.
Step 5: when there is computing resource, execution unit distribution is carried out according to branch schedule sequences P and executes Branch Tasks, After the completion of hanging up branch and executing, occupied computing resource can be released;
Step 6: repeating step 3 to 5, until the branch in DAG figure each in operation end is executed.
The present invention smallest object schedulable as one by one section of independent pathway using DAG, referred to as branch (branch), a branch can be made of continuous calculation stages and the network communication between the stage.For example, the branch 1 of Fig. 1 by Stage1, coflow1, Stage2, coflow2 and Stage3 composition.Calculation stages and relevant network in one branch are logical Therefore letter must can be considered a schedulable object by serial execution, avoid logical in scheduling calculation stages and network Occurs the disagreement of DAG scheduling and coflow scheduling when letter.In addition, the concept of branch has been additionally contemplates that wanting for data locality It asks.In Fig. 1, the calculation stages of upstream can export calculated result, and be used as the input data of downstream calculation stages.? After Stage1 is completed, on the machine that the calculating task that the output of Stage1 can be stored in Stage1 is performed.These are temporary The data of preservation can be used as the input data of Stage2.When Stage2 starts to execute, the calculating task of Stage2 can be from It executes and pulls intermediate data on the machine of Stage1.This process can generate network communication, that is, coflow1.If will The calculating task of Stage2 is also arranged on identical machine, so that it may greatly meet data locality.Therefore the present invention makes Branch is used as schedulable object to distribute resource, Stage1 and Stage2 can be scheduled on same a group machine, reduce The expense of network reduces the operation deadline.
Further, since a plurality of parallel route of a DAG is intersected with each other, is formed and synchronized.The present invention claims the remittance of multiple branches Collecting process is synchronous for branch, and after the branch only executed parallel synchronizes, subsequent calculation stages can just start to execute;And one In the operation of a data parallel, job scheduling depends on the most slow parallel branch executed.In Fig. 1, spoke 1 and spoke 2 is handed over It pitches in the last one calculating task.It is synchronous which constitute a branches.This operation is only after branch synchronously completes, ability Terminate.Therefore, branch, which synchronizes, seriously affects operation deadline JCT.It is performed simultaneously the synchronous all branch's meetings of a branch A large amount of computing resource is occupied, the execution for hindering other branches synchronous.In fact, the operation deadline is by executing most slow point The influence of branch, the present invention passes through the deadline for predicting each branch, to calculate the tolerable delay time of each branch, so that it is determined that The priority or urgency of Chu Ge branch, first the parallel urgent branch executed in different work, then serially executes less tight Anxious branch.More computing resources can be left in advance by postponing short branch, to complete those more urgent operations.In order to verify this The effect of invention, the present invention are tested by the Spark cluster in laboratory, and branch's dispatching method is better than Spark as the result is shown FIFO and shortest job first scheduling, critical path scheduling method.The average operation deadline, JCT reduced 10-15%.This Outside, the present invention has also carried out large-scale emulation testing.Emulation experiment uses Google company-data, https: // ", and the CARBYNE with newest DAG dispatching method as described in document 2 github.com/google/cluster-data. It compares.Experimental result also shows BS method ratio CARBYNE and realizes faster JCT.
It carries out verifying effect of the invention below by one experimental situation of building.
As shown in Fig. 2, host node is responsible for receiving operation submission and distribution calculates in the system architecture of branch's dispatching method The work of resource.And user can submit operation to Spark cluster by an interface driver.This driver is for pipe The progress for the operation that reason user is submitted including being multiple parallel calculating tasks data parallel breakdown of operation, and generates one A DAG.In the implementation procedure of data parallel operation, driver can be communicated with host node, to grasp the variation of cluster resource (as determined the server that there are idle computing resources, calculating task can be initiated).These really are used to execute the section of calculating task Point is known as working node (worker node).Host node will also manage entire cluster, each server of maintenance state, and The calculated result of each server is indexed, can determine the position where input data so as to subsequent calculating task. The process of assignment calculating task is called scheduling by we.Computing resource total amount required for driver interface and master are determined Afterwards, master node can according to grasped each server state (such as whether and distribute to other calculate operation, if also Have assignable computing resource etc.), to determine the computing resource total amount for the operation for distributing to driver processing.But, driver Interface is after the feedback for receiving master node, it is also necessary to pay attention to and each worker node of distribution communicates, it is established that communication Channel, to distribute calculating task later.In the stage for executing data parallel operation, worker node can provide one group of execution Device unit (executor).These executor units are the Java Virtual Machines for wrapping up certain C PU core and memory.Driver node Each executor can be grasped from the communication with worker node.Then, driver interface makees the data parallel after decomposition Multiple concurrent tasks caused by industry, are dispatched to these executor one by one.After executor has executed these calculating operations, meeting Calculated result is stored in local or feeds back to driver.
As shown in Fig. 2, a branched structure can be exported firstly, user submits to the operation of driver after being decomposed into DAG DAG, for branch dispatch.We also added the module of a driver interface and the intercommunication of executor.This A communication module can be after executor have executed a calculating task, can be the execution information of this task, when such as calculating Between, network communication time, disk read-write time, feed back to driver node.Driver node can be according to each of this operation The finishing time information of task estimates the length of branch, calculates the urgency of branch.Each point is determined in driver interface After the urgency of branch, these urgencies can be fed to host node.Host node can be according to the urgency of each branch to from difference The branch of operation is scheduled.
Verifying branch's dispatching method can dispatch concurrent multiple operations and reduce operation deadline JCT below.This reality It applies example and performs a large amount of experiment in an actual Spark cluster to evaluate the performance of branch's dispatching method.In an experiment Compare First Come First Served (first-in-first-out, FIFO) and the shortest job first (shortest-job- of Spark First, SJF), key methodology (critical path, CP).It is noted that CP method is the scheduling of a job built Method, it is intended to the calculating task in one DAG operation of priority scheduling in critical path.By measure more operations and meanwhile operation Average JCT.These operations can be filed concurrently cluster.In addition, also having rated accuracy of forecast, overhead, and online Submit the performance boost of working scene inferior division dispatching method.
Experiment in the present embodiment is used to be made of 30 servers for being distributed in 6 racks.Each server is equipped with Intel Xeon E5-26502.2GHz 12-core processor.In experiment use three kinds of mixing Spark operations, 50% PageRank, 30%logistic regression and 20% machine learning operation, to ensure the multiplicity of branching time length Property.PageRank operation is that typical iteration diagram calculates application.Logistic regression is that common regression analysis is answered With.Machine learning application can use the learning process of iteration.The quantity that operation is controlled in experiment increases to 50 from 10, operation Input data size increase to 50GB from 10GB.
Test online branch's deadline accuracy of forecast: the urgency in order to calculate branch is predicted by formula 14 Branching time.Therefore, prediction accuracy evaluation is an important index.After submitting operation, time of measuring branch first The time of execution, and the branching time of prediction finally calculate prediction error.
Wherein, Tpred and Tmeas has respectively represented the branching time of prediction and measurement.Branch prediction accuracy changes such as Shown in Fig. 5.Experimental result shows that most of branch predictions are accurate.At the beginning, prediction error will increase, pre- at the 6000th time After survey, error is stablized 25% or so.When the wheel number of prediction increases, this prediction becomes more accurate.Therefore the weight of iteration operation Prediction exact value can be improved in multiple execute.
Evaluate the average operation deadline:
After the batch job submitted simultaneously, the average JCT of this batch of DAG operation is evaluated.The quantity of this batch job increases from 10 To 50.Fig. 6 shows that branch's dispatching method ratio Spark FIFO method realizes less average JCT, reduces 10-15%.When The quantity of operation increases to 40 and 50, and the reduction amount of average JCT is stablized 15% or so.And the method ratio Spark of SJF and CP The average operation deadline of FIFO will lack 5%, hardly change as the quantity of operation increases.Branch's dispatching method JCT reduction amount can increase as operation quantity increases.Generally speaking, while the operation of submission is more, and the average JCT of operation is got over Greatly.But, branch's dispatching method can more effectively be packaged branch.In comparison, the method for Spark FIFO and SJF and CP is not The branch that can be perceived in DAG is synchronous.When the quantity of operation increases, which results in longer beyond the time.
The operation quantitative assessment totaled:
Fig. 7 is shown after 30 operations are submitted at the same time in the performance of different moments.500 before after operation submission Second, the performance of these three methods is very close.Then, Spark FIFO method, which fulfils assignment, decreased significantly.Branch's tune Degree method can complete almost half operation at the 1000th second or so, and 2/3rds operation can be complete in first 1500 seconds At.This is the result shows that branch's dispatching method can complete same batch job than other two methods faster.
Overhead experiment: main overhead is as caused by branch prediction and branch's scheduling.Prediction expense is related to To finishing time information, the predicted branches time span of the task of collection.Fig. 8 a is shown, when the quantity increase of branch, predicts expense It can gradually increase.Want to compare with branch's dispatching algorithm is executed, prediction expense is relatively small.Fig. 8 b is shown when hang-up branch Quantity from 20 increase to 70 when, the complexity of branch's dispatching algorithm.The time complexity of algorithm is with the quantity of operation and gradually It is increased.When more operations are submitted, branch combination also can be more, increase the expense of branch's dispatching algorithm operation.This leads The complexity for having caused algorithm is that superlinearity increases.But, overhead is compared with JCT, is acceptable.
The present embodiment simulates 5000 typical DAG operations in Google company-data.Cooperate CP using SJ, makees For comparative approach.CARBYNE method can his ground of benefit to shortest operation contribute remaining computing resource.All emulation are at one It is equipped on the computer of Intel (R) Core (TM) i7-4700MQ CPU 2.40GHz and 32GB RAM and completes.
Experimental data set: log is completed in the operation that Google company-data has recorded 1 month.These logs contain details Task completed information, resources requirement, the state of machine, limitation etc..However, this record cannot provide useful DAG letter Breath.Most of tasks only have simple dependence, such as MapReduce.For the dependence of the operation of simulating realistic, this reality It applies example and simulates actual job dependence.Main DAG includes synchronous with 5 branches (50 quantile) less than 10 branches.It is small Partial DAG includes that 34 branches are synchronous with 13 branches (95 quantile).
Emulation about operation deadline JCT: branch's dispatching method, CARBYNE and SJF, the side CP are had rated in experiment The average JCT of method.5000 operations have been used in an experiment, have all been submitted in 600 seconds, and 500 machines are shared.Firstly, surveying The cumulative distribution function CDF to fulfil assignment is measured.As illustrated in fig. 9, at the same time, branch's dispatching method can be completed More operations.In addition, we further evaluate the JCT reduction amount of details.Fig. 9 b shows branch's dispatching method and the side CARBYNE The operation deadline ratio of method.Less JCT may be implemented in branch's dispatching method.The JCT reduction of operation more than 15% reaches 20%.For most of operations, branch dispatching method ratio CARBYNE realizes more JCT and reduces.
The influence of machine quantity: resource capacity is an important performance influence index.The present embodiment controls replicating machine The quantity of device increases to 1000 from 500.Figure 10 a shows the average JCT that this 5000 operations are executed by the machine of different number. Average JCT can be reduced as machine quantity increases.The smallest average JCT may be implemented in branch's dispatching method.But, work as machine When quantity reaches 1000, the performance of CARBYNE method also moves closer to branch's dispatching method.
The influence of operation quantity: Figure 10 b shows that average JCT can increase with the quantity of operation.The present embodiment experiment The quantity of middle control operation increases to 5000. branch's dispatching methods from 2000 can realize smaller average JCT than other methods. Therefore, when cluster resource is vied each other in more operations, more economical scheduling is may be implemented in branch's dispatching method.Therefore, JCT Reduction amount can increase with the quantity of operation.
The influence of operation submission time: when submitting operation with different rates, the performance of scheduler can be due to overstocking Handling situations are different and realize different performances.The interval time that operation submission is arranged in we increased to 1500 seconds from 300 seconds.Figure 10c shows that the average JCT of branch's dispatching method is the smallest.The average JCT of these three methods can be with interval time Increase and rises.
It is demonstrated from different angles by above-mentioned emulation experiment, under the conditions of limited cluster resource, when more operation When submitting in a relatively short period of time, higher performance is may be implemented in branch's dispatching method.
The above is only the preferred embodiment of the present invention, protection scope of the present invention is not limited merely to above-described embodiment, All technical solutions belonged under thinking of the present invention all belong to the scope of protection of the present invention.It should be pointed out that for the art For those of ordinary skill, several improvements and modifications without departing from the principles of the present invention should be regarded as protection of the invention Range.

Claims (5)

1. a kind of data parallel job scheduling method relied on based on branch DAG, which comprises the following steps:
Step 1: operation termination is incorporated as industry;
Step 2: traversing the DAG task image of operation, find out the convergent point and bifurcation of DAG, the convergent point in DAG task image is claimed It is synchronous for a branch, a branch will be known as without convergence, the chain type part without bifurcated in DAG task image, other will not depended on Branch when branch or the branch relied on have executed completion is known as hanging up branch;
Step 3: the DAG task image of each operation in traversal operation end finds out the hang-up branch in each DAG task image, and will The hang-up branch found out, which is added to, to be hung up in branch's set B;
Step 4: branch's dispatching algorithm being executed to the branch hung up in branch's set B, obtains branch dispatching sequence P;
Step 5: when there is computing resource, execution unit distribution being carried out according to branch dispatching sequence P and executes Branch Tasks;
Step 6: repeating step 3 to 5, until the branch in the DAG figure of operation each in operation end is executed.
2. the data parallel job scheduling method according to claim 1 relied on based on branch DAG, which is characterized in that step Branch's dispatching algorithm described in rapid 4 refers to:
Step 4.1: the branch hung up in branch's set B being divided into multiple scheduler objects, if hung up multiple in branch's set Total amount of computational resources of parallel branch is less than the limitation of resource capacity, this multiple parallel branch is packaged into branch combination BC, And using the branch combination as a scheduler object, the branch not being packaged is separately as a scheduler object;
Step 4.2: constructing total scheduler object set E=B ∪ BC, and construct an empty schedule sequences
Step 4.3: calculating the urgency U (o) of each scheduler object o in scheduler object set E;
Step 4.4: the scheduler object in scheduler object set E being ranked up according to urgency, collating sequence p, then by P= P ∪ p, output scheduling sequence P,.
3. the data parallel job scheduling method according to claim 1 relied on based on branch DAG, which is characterized in that step Branch's dispatching algorithm described in rapid 4 refers to:
Step 4.1 ': the branch hung up in branch's set B is divided into multiple scheduler objects, if hung up multiple in branch's set Total amount of computational resources of parallel branch is less than the limitation of resource capacity, this multiple parallel branch is packaged into branch combination BC, And using the branch combination as a scheduler object, the branch not being packaged is separately as a scheduler object;
Step 4.2 ': total scheduler object set E=B ∪ BC is constructed, and constructs an empty schedule sequences
Step 4.3 ': calculate the urgency U (o) of each scheduler object o in scheduler object set E;
Step 4.4 ': in scheduler object set E, select the strongest scheduler object e of urgency;
Step 4.5 ': for other objects o in scheduler object set E in addition to scheduler object ej, remember e → ojFor temporary scheduling Sequence pj, 1≤j≤J-1, J are the sum of scheduler object in scheduler object set;
Step 4.6 ': calculate temporary scheduling sequence pjExceed time ETj
Step 4.7 ': by min (ETj) corresponding to temporary scheduling sequence pjIt is appended in schedule sequences P, P=P ∪ pj
Step 4.8 ': e and o is rejected from scheduler object set EjThe branch being related to;
Step 4.9 ': repeating step 4.4 ' is to step 4.8 ', until in scheduler object set E any branch or branch combination all It is added in schedule sequences P, output scheduling sequence P.
4. the data parallel job scheduling method according to claim 2 or 3 relied on based on branch DAG, which is characterized in that The calculation method of the urgency U (o) of the scheduler object is:
Step 4.3.1: the time span T (o) of scheduler object o is calculated;
1) when scheduler object is branch, then time span T (o) are as follows:
Wherein, S represents the set in stage in a branch, and Ts represents the calculating task set of a stage s, and W represents active line The set of journey, Dt,wIt represents and time used in calculating task t, D is executed as thread wt,wTime carried out by the prediction model of formula 2 Prediction;
Dt,w(i)=a1Dt,w(i-1)+a2Dt,w(i-2)+...anDt,w(i-n) (2)
Formula 2 indicates that the time as used in thread w execution current task t needs through the calculation stages where collecting current task t The history of all tasks executes data, and the item number that history executes data shares n item, and i indicates the number of iterations of operation, to formula 2 Models fitting is carried out, a is obtained1、a2...anThe estimated value of parameter;
2) when scheduler object is branch combination { a, b }, then the time span of branch combination { a, b } is max { T (a), T (b) };
Step 4.3.2: it calculates scheduler object urgency U (o);
1) when scheduler object be branch when, the branch where calculating the branch by formula 1 synchronize in all branches time Length, then branch where the urgency U (o) of branch is equal to the branch synchronize in longest branching time length and the branch The difference of time span, the time difference is shorter, and urgency is stronger;
2) when scheduler object is branch combination { a, b }, then the urgency of branch combination { a, b } is min { U (a), U (b) }.
5. the data parallel job scheduling method according to claim 4 relied on based on branch DAG, it is characterised in that: step Exceed time ET in rapid 4.6 'jCalculation method be:
For dispatching sequence e → oj, exceed the time
Wherein T (e) and U (oj) respectively indicate the time span and scheduler object o of scheduler object ejUrgency.
CN201910514403.0A 2019-06-14 2019-06-14 Data parallel job scheduling method based on branch DAG dependency Active CN110275765B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910514403.0A CN110275765B (en) 2019-06-14 2019-06-14 Data parallel job scheduling method based on branch DAG dependency

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910514403.0A CN110275765B (en) 2019-06-14 2019-06-14 Data parallel job scheduling method based on branch DAG dependency

Publications (2)

Publication Number Publication Date
CN110275765A true CN110275765A (en) 2019-09-24
CN110275765B CN110275765B (en) 2021-02-26

Family

ID=67960808

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910514403.0A Active CN110275765B (en) 2019-06-14 2019-06-14 Data parallel job scheduling method based on branch DAG dependency

Country Status (1)

Country Link
CN (1) CN110275765B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110688993A (en) * 2019-12-10 2020-01-14 中国人民解放军国防科技大学 Spark operation-based computing resource determination method and device
CN110730470A (en) * 2019-10-24 2020-01-24 北京大学 Mobile communication equipment integrating multiple access technologies
CN111857984A (en) * 2020-06-01 2020-10-30 北京文思海辉金信软件有限公司 Job calling processing method and device in bank system and computer equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105868019A (en) * 2016-02-01 2016-08-17 中国科学院大学 Automatic optimization method for performance of Spark platform
US20190065336A1 (en) * 2017-08-24 2019-02-28 Tata Consultancy Services Limited System and method for predicting application performance for large data size on big data cluster

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105868019A (en) * 2016-02-01 2016-08-17 中国科学院大学 Automatic optimization method for performance of Spark platform
US20190065336A1 (en) * 2017-08-24 2019-02-28 Tata Consultancy Services Limited System and method for predicting application performance for large data size on big data cluster

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
DONGSHENG LI等: "ReB: Balancing Resource Allocation for Iterative Data-Parallel Jobs", 《IN PROCEEDINGS OF ACM CONFERENCE (CONFERENCE’17)》 *
MASTERT-J: "Spark详解(五):Spark作业执行原理", 《HTTPS://BLOG.CSDN.NET/QQ_21125183/ARTICLE/DETAILS/87875902》 *
WEI WANG等: "Coflex: Navigating the Fairness-Efficiency Tradeoff for Coflow Scheduling", 《IEEE INFOCOM 2017 - IEEE CONFERENCE ON COMPUTER COMMUNICATIONS》 *
田国忠等: "一种多DAG任务共享异构资源调度的费用优化方法", 《电子学报》 *
胡智尧等: "数据中心网络流调度技术前沿进展", 《计算机研究与发展》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110730470A (en) * 2019-10-24 2020-01-24 北京大学 Mobile communication equipment integrating multiple access technologies
CN110730470B (en) * 2019-10-24 2020-10-27 北京大学 Mobile communication equipment integrating multiple access technologies
CN110688993A (en) * 2019-12-10 2020-01-14 中国人民解放军国防科技大学 Spark operation-based computing resource determination method and device
CN110688993B (en) * 2019-12-10 2020-04-17 中国人民解放军国防科技大学 Spark operation-based computing resource determination method and device
CN111857984A (en) * 2020-06-01 2020-10-30 北京文思海辉金信软件有限公司 Job calling processing method and device in bank system and computer equipment

Also Published As

Publication number Publication date
CN110275765B (en) 2021-02-26

Similar Documents

Publication Publication Date Title
Levin et al. DP-FAIR: A simple model for understanding optimal multiprocessor scheduling
US9471390B2 (en) Scheduling mapreduce jobs in a cluster of dynamically available servers
CN103309738B (en) User job dispatching method and device
CN110275765A (en) Data parallel job scheduling method based on branch DAG dependency
CN103838621B (en) Method and system for scheduling routine work and scheduling nodes
US9723070B2 (en) System to improve cluster machine processing and associated methods
CN109408215A (en) A kind of method for scheduling task and device of calculate node
CN107193655B (en) Big data processing-oriented fair resource scheduling method based on utility function
CN112685153A (en) Micro-service scheduling method and device and electronic equipment
Hu et al. Branch scheduling: DAG-aware scheduling for speeding up data-parallel jobs
Shi et al. Exploiting simultaneous communications to accelerate data parallel distributed deep learning
CN105867998A (en) Virtual machine cluster deployment algorithm
Hong et al. Sharp waiting-time bounds for multiserver jobs
CN105740059A (en) Particle swarm scheduling method for divisible task
CN110928657A (en) Embedded system certainty analysis method
Li et al. MapReduce task scheduling in heterogeneous geo-distributed data centers
Dubey et al. QoS driven task scheduling in cloud computing
Li et al. Co-Scheduler: A coflow-aware data-parallel job scheduler in hybrid electrical/optical datacenter networks
Natarajan Parallel queue scheduling in dynamic cloud environment using backfilling algorithm
CN106020333B (en) Multi-core timer implementing method and multiple nucleus system
CN102184124A (en) Task scheduling method and system
CN109298919B (en) Multi-core scheduling method of soft real-time system for high-utilization-rate task set
Qin et al. Dependent task scheduling algorithm in distributed system
Hu et al. Requirement-aware strategies with arbitrary processor release times for scheduling multiple divisible loads
Li et al. Efficient semantic-aware coflow scheduling for data-parallel jobs

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant