CN110275765A - Data parallel job scheduling method based on branch DAG dependency - Google Patents
Data parallel job scheduling method based on branch DAG dependency Download PDFInfo
- Publication number
- CN110275765A CN110275765A CN201910514403.0A CN201910514403A CN110275765A CN 110275765 A CN110275765 A CN 110275765A CN 201910514403 A CN201910514403 A CN 201910514403A CN 110275765 A CN110275765 A CN 110275765A
- Authority
- CN
- China
- Prior art keywords
- branch
- scheduler object
- dag
- urgency
- time
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/901—Indexing; Data structures therefor; Storage structures
- G06F16/9024—Graphs; Linked lists
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Program initiating; Program switching, e.g. by interrupt
- G06F9/4806—Task transfer initiation or dispatching
- G06F9/4843—Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
- G06F9/4881—Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
Abstract
The invention discloses a data parallel job scheduling method based on branch DAG dependence, which comprises the following steps that 1, a job end receives jobs; 2. traversing a DAG task graph of the operation, and finding out branches and branch synchronization in the DAG task graph and suspended branches in the branches; 3. finding out a suspended branch in each DAG graph of the operation end, and adding the suspended branch into a branch set B; 4. executing a branch scheduling algorithm on the branches in the suspended branch set B to obtain a branch scheduling sequence P; 5. when computing resources exist, the execution unit is distributed and branch tasks are executed according to the branch scheduling sequence P; 6. and repeating the steps 3 to 5 until the branches in each DAG graph in the working end are executed. The invention ensures that the non-urgent branches are delayed to be dispatched by determining the urgency of each branch, saves the computing resources, allocates more urgent jobs and accelerates the completion time of the branch synchronization. Compared with other scheduling methods, the method disclosed by the invention has the advantage that the average job completion time is reduced by 10-15%.
Description
Technical field
The invention belongs to Parallel & Distributed Computing field more particularly to a kind of data parallel works relied on based on branch DAG
Industry dispatching method.
Background technique
Analysis operation for big data has become and has closed in daily life if machine learning, figure calculate, streaming computing
A part of key.Hadoop and Spark platform is suggested the operation for aiming at efficient process data parallel.However, this is related to
The technical issues of some challenges, such as job shop scheduling problem and network communication.For big data analysis operation, operation deadline
(job completion time, JCT) is an extremely important index.JCT refers to a data concurrent job from being submitted to
A period of time of completion.One data concurrent job includes the network communication between multiple calculation stages and calculation stages.These
Calculation stages can be executed according to the sequencing specified, to ensure that dependence will not be violated.This dependence shape
At a circulant Digraph.Whole process can be with circulant Digraph DAG (Directed Acyclic Graphene) come table
Show.Currently, newest DAG dispatching method document 1 " R.Grandl, S.Kandula, S.Rao, A.Akella, and
J.Kulkarni,“GRAPHENE:packing and dependency-aware scheduling for data-
parallel clusters,”in USENIX Symposium on Operating Systems Design and
Implementation (OSDI), 2016, pp.81-97 ", and document 2 " R.Grandl, M.Chowdhury, A.Akella, and
G.Ananthanarayanan,“Al-truistic scheduling in multi-resource clusters,”in
USENIX Symposium on Operating Systems Design and Implementation(OSDI),2016,
Pp.65-80. " can dispatch calculation stages using certain special heuristic mutation operations method.But both dispatching methods are simultaneously
Network communication is not accounted for.The network communication of this data parallel operation can be related to data shuffling (data in practice
Shuffle) the problem of.Document 3 " M.Chowdhury, M.Zaharia, J.Ma, M.I.Jordan, and I.Stoica,
“Manag-ing data transfers in computer clusters with orchestra,”ACN SIGCOMM
Computer Communication Review, vol.41, no.4, pp.98-109,2011. " indicates this network communication
Time accounts for the 50% of operation deadline JCT, therefore can have a significant impact to the deadline of operation.It is asked to solve this
Topic, nearest data stream scheduling method such as document 4 " Q.Liang and E.Modiano, " Coflow scheduling in
input-queued switches:Optimal delay scaling and algorithms,”in IEEE
Conference on Comput-er Communications(INFOCOM),2017,pp.10–18.
" and document 5 " W.Wang, S.Ma, B.Li, and B.Li, " Coflex:Navigating the fairness-
efficiency tradeoff for coflow scheduling,”in IEEE Conference on Computer
Communications (INFOCOM), 2017, pp.46-54. " is described, propose it is a kind of for the new abstract of network data flow,
Referred to as coflow.Coflow refers to the one group of parallel data transmitted between continuous calculation stages in two dependences
Stream.Recently, for coflow dispatching method, it is intended to reduce the average coflow deadline.Fig. 1 shows a data parallel
The circulant Digraph (DAG) of operation.This operation is made of 5 calculation stages and 4 coflow.Critical path scheduling algorithm meeting
Calculation stages on priority scheduling longest path.Critical path scheduling method can preferentially execute Stage1, Stage2, Stage3.
However, already existing coflow dispatching method can the smallest coflow of priority scheduling.If it is assumed that coflow3 ratio
If coflow1, coflow2 want small, coflow dispatching method can first carry out coflow3, delay to execute coflow1.This can lead
After causing key methodology to execute Stage1, Stage2 cannot be immediately dispatched, because coflow1 will wait coflow3 to first carry out
It completes.But the dispatching method of this coflow, it can not cooperate well with DAG dispatching method.This is because the two
Optimization aim it is not identical.
DAG dispatching technique is a kind of wherein each to determine for the circulant Digraph DAG of operation under limited resource
The technology of task scheduling priority.DAG scheduling is a kind of method of task schedule, is widely used in various computational problems, is wrapped
The DAG job scheduling for including multiprocessor under stand-alone environment, the data parallel operation DAG scheduling for also thering are us to intend to solve.Pass through DAG
Scheduling, may be implemented resource utilization maximization or operation average completion time minimizes.With common method for scheduling task
Different, DAG scheduling will not isolate calculating task, but pay attention to the dependence between them, and emphasize to make
Semantic relation in industry level.For DAG scheduling, if using the DAG information for having arrived operation during scheduler task,
Such dispatching method can be considered as DAG dispatching method.Current main DAG dispatching method: (1) critical path algorithm,
In entire operation DAG, the execution route of most critical is found, as critical path.Scheduler task in critical path can quilt
It pays the utmost attention to, and the task needs on remaining path can be by secondary consideration.But this method does not account for the concurrent of multiple DAG operations
Situation;(2) dispatching method of breadth-first, the biggish operation of width in preferential DAG.The method equally lacks to multiple operations simultaneously
The considerations of hair scheduling;(3) operation is packaged scheduling, comprehensively considers the DAG of all operations, can will be complementary to one another in DAG structure
Operation is scheduled together, improves the concurrency of operation.But the complexity of this method is very big, it is more difficult to be applied in practice.
Summary of the invention
The technical problem to be solved by the present invention is to how to reduce the operation deadline in the case where multiple operations are concurrent, mention
A kind of data parallel job scheduling method relied on based on branch DAG is gone out.
To solve this problem, the technical scheme adopted by the invention is that:
A kind of data parallel job scheduling method relied on based on branch DAG, comprising the following steps:
Step 1: operation termination is incorporated as industry;
Step 2: traversing the DAG task image of operation, find out the convergent point and bifurcation of DAG, by the convergence in DAG task image
Point is known as a branch and synchronizes, and will be known as a branch without convergence, the chain type part without bifurcated in DAG task image, and will not depend on
Branch when other branches or the branch relied on have executed completion is known as hanging up branch;
Step 3: the DAG task image of each operation in traversal operation end finds out the hang-up branch in each DAG task image,
And the hang-up branch found out is added to and is hung up in branch's set B;
Step 4: branch's dispatching algorithm being executed to the branch hung up in branch's set B, obtains branch dispatching sequence P;
Step 5: when there is computing resource, execution unit distribution being carried out according to branch dispatching sequence P and executes Branch Tasks;
Step 6: repeating step 3 to 5, until the branch in DAG figure each in operation end is executed.
Further, branch's dispatching algorithm described in step 4 refers to:
Step 4.1: the branch hung up in branch's set B being divided into multiple scheduler objects, if hung up in branch's set
There is total amount of computational resources of multiple parallel branch to be less than the limitation of resource capacity, this multiple parallel branch is packaged into branch group
BC is closed, and using the branch combination as a scheduler object, the branch not being packaged is separately as a scheduler object;
Step 4.2: constructing total scheduler object set E=B ∪ BC, and construct an empty schedule sequences
Step 4.3: calculating the urgency U (o) of each scheduler object o in scheduler object set E;
Step 4.4: the scheduler object in scheduler object set E being ranked up according to urgency, collating sequence p, then
By P=P ∪ p, output scheduling sequence P.
Further, branch's dispatching algorithm described in step 4 the following steps are included:
Step 4.1 ': the branch hung up in branch's set B is divided into multiple scheduler objects, if hanging up branch's set has
Total amount of computational resources of multiple parallel branch is less than the limitation of capacity, this multiple parallel branch is packaged into branch combination BC,
And using the branch combination as a scheduler object, the branch not being packaged is separately as a scheduler object;
Step 4.2 ': total scheduler object set E=B ∪ BC is constructed, and constructs an empty schedule sequences
Step 4.3 ': calculate the urgency U (o) of each scheduler object o in scheduler object set E;
Step 4.4 ': in scheduler object set E, select the strongest scheduler object e of urgency;
Step 4.5 ': for other objects o in scheduler object set E in addition to scheduler object ej, remember e → ojIt is interim
Dispatching sequence pj, 1≤j≤J-1, J are the sum of scheduler object in scheduler object set;
Step 4.6 ': calculate temporary scheduling sequence pjExceed time ETj;
Step 4.7 ': by min (ETj) corresponding to temporary scheduling sequence pjIt is appended in schedule sequences P, P=P ∪ pj;
Step 4.8 ': e and o is rejected from scheduler object set EjThe branch being related to;
Step 4.9 ': repeating step 4.4 ' is to step 4.8 ', until in scheduler object set E any branch or branch group
Conjunction is all added in schedule sequences P.
Further, the calculation method of the urgency U (o) of scheduler object described in step 4.3 is:
Step 4.3.1: the time span T (o) of scheduler object o is calculated;
1) when scheduler object is branch, then time span T (o) are as follows:
Wherein, S represents the set in stage in a branch, TsThe calculating task set of a stage s is represented, W represents work
Make the set of thread.Dt,wIt represents and time used in calculating task t, D is executed as thread wt,wTime pass through the prediction model of formula 2
It is predicted;
Dt,w(i)=a1Dt,w(i-1)+a2Dt,w(i-2)+...anDt,w(i-n) (2)
Formula 2 indicates that the time as used in thread w execution current task t needs through the calculating where collecting current task t
The history of stage all tasks executes data, and the item number that history executes data shares n item, and i indicates the number of iterations of operation, to public affairs
Formula 2 carries out models fitting, obtains a1、a2...anThe estimated value of parameter;
2) when scheduler object is branch combination { a, b }, then the time span of branch combination { a, b } is max { T (a), T
(b)};
Step 4.3.2: the urgency U (o) of scheduler object is calculated;
1) when scheduler object is branch, the branch where calculating the branch by formula 1 synchronize in all branches
Time span, then the branch where the urgency U (o) of branch is equal to the branch synchronize in longest branching time length and this point
The difference of the time span of branch, the time difference is shorter, and urgency is stronger.
2) when scheduler object is branch combination { a, b }, then the urgency of branch combination { a, b } is min { U (a), U
(b)}。
Further, exceed time ETjCalculation method be:
For dispatching sequence e → oj, exceed the time
Wherein T (e) and U (oj) respectively indicate the time span and scheduler object o of scheduler object ejUrgency.
Compared with prior art, the present invention can obtain it is following the utility model has the advantages that
A kind of data parallel job scheduling method relied on based on branch DAG of the present invention, by by one section of nothing in DAG figure
Convergence, the chain type part without the bifurcated smallest object schedulable as one, a referred to as branch, a branch can be by continuous
Network communication composition between calculation stages and stage.Calculation stages and relevant network communication in one branch must be gone here and there
Therefore capable execution can be considered a schedulable object, use branch as schedulable object to distribute resource, and one
The calculating task in each stage can be arranged on identical machine in a branch, greatly met data locality, reduced network
Expense, reduce the operation deadline.In addition, the present invention in a branch synchronizes according to each branch is tolerable by prolonging
Slow length of time determines the urgency of each branch, so that short less urgent branch is delayed by scheduling, saves lower calculate
Resource distributes to other more urgent operations, to accelerate branch's synchronous deadline.Therefore, the present invention is based on branches
DAG is dispatched compared with Spark FIFO and critical path scheduling method, and the average operation deadline reduces 10-15%.
Detailed description of the invention
Fig. 1 is the circulant Digraph of data parallel operation;
Fig. 2 is general structure schematic diagram of the present invention;
Fig. 3 converts branch's schematic diagram for operation DAG for the present invention;
Fig. 4 is branch's dispatching method flow chart of the invention;
Fig. 5 is branch prediction accuracy variation diagram;
Fig. 6 is branch's dispatching method ratio Spark FIFO and the average JCT of shortest job first scheduling method compares signal
Figure;
Fig. 7 be branch's dispatching method and Spark FIFO and critical path scheduling method after 30 operations are submitted simultaneously in
The performance of different moments;
Fig. 8 a is the change curve that expense is predicted when the quantity of branch increases, and 8b hangs up branch's tune when numbers of branches increases
Spend the complexity schematic diagram of algorithm;
For Fig. 9 a to be interior at the same time, branch's dispatching method can complete more operations;Fig. 9 b shows branch's scheduling
The operation deadline ratio of method and CARBYNE method;
Figure 10 is that three kinds of dispatching methods are averaged in the variation of machine quantity, operation quantity and operation submission time span
Operation deadline variation diagram;Figure 10 a is the average JCT that 5000 operations are executed by the machine of different number, and Figure 10 b is shown
Average JCT can increase with the quantity of operation, and Figure 10 c shows three kinds of dispatching parties of variation with operation submission time span
The variation tendency of the average JCT of method.
Specific embodiment
In order to be better understood the technical solution in the application, Fig. 2 to Figure 10 shows that the present invention is based on branch DAG
A kind of specific embodiment of the data parallel job scheduling method of dependence, comprising the following steps:
A kind of data parallel job scheduling method relied on based on branch DAG, comprising the following steps:
Step 1: operation termination is incorporated as industry;
Step 2: traversing the DAG task image of operation, find out the convergent point and bifurcation of DAG, by the convergence in DAG task image
Point is known as a branch and synchronizes, and will be known as a branch without convergence, the chain type part without bifurcated in DAG task image, and will not depend on
Branch when other branches or the branch relied on have executed completion is known as hanging up branch;
The branch as shown in Figure 3 schematic diagram synchronous with branch, for data parallel operation, data processing be according to
What certain operation order was completed.There are dependence between these operations, i.e. part operation can only be completed in some other operation
It can just execute later.The corresponding task of stage 8 in Fig. 3 can only could execute after the completion of stage 2, stage 3, stage 5.Cause
This, the stage 8 is an aggregation node, is chosen as branch's synchronous point.And then become one without the stage chain of convergence without bifurcated
Branch, as stage 1 and stage 2 constitute branch 1.It may include a stage in a branch internal, be also possible to multiple ranks
Section.
Step 3: the DAG task image of each operation in traversal operation end finds out the hang-up branch in each DAG task image,
And the hang-up branch found out is added to and is hung up in branch's set B;
Step 4: branch's dispatching algorithm being executed to the branch hung up in branch's set B, obtains branch dispatching sequence P;
The present embodiment is referred to using branch's dispatching algorithm, as shown in Figure 4:
Step 4.1: the branch hung up in branch's set B being divided into multiple scheduler objects, if hung up in branch's set
Total amount of computational resources of multiple parallel branch is less than the limitation of resource capacity, this multiple parallel branch is packaged into branch combination
BC, and using the branch combination as a scheduler object, the branch not being packaged is separately as a scheduler object;
Step 4.2: constructing total scheduler object set E=B ∪ BC, and construct an empty schedule sequences
Step 4.3: calculating the urgency U (o) of each scheduler object o in scheduler object set E;
Step 4.3.1: the time span T (o) of scheduler object o is calculated;
1) when scheduler object is branch, then the time span T (o) of scheduler object o are as follows:
Wherein, S represents the set in stage in a branch, and Ts represents the calculating task set of a stage s, and W represents work
Make the set of thread.Dt,wIt represents and time used in calculating task t, D is executed as thread wt,wTime pass through the prediction model of formula 2
It is predicted;
Dt,w(i)=a1Dt,w(i-1)+a2Dt,w(i-2)+...anDt,w(i-n) (2)
Formula 2 indicates that the time as used in thread w execution current task t needs through the calculating where collecting current task t
The history of all tasks in stage executes data, and the item number that history executes data shares n item, and i indicates the number of iterations of operation, right
Formula 2 carries out models fitting, obtains a1、a2...anThe estimated value of parameter.
When scheduler object is branch combination { a, b }, then the time span of branch combination { a, b } is max { T (a), T
(b)};
Step 4.3.2: the urgency U (o) of scheduler object is calculated;
When scheduler object be branch when, the branch where calculating the branch by formula 1 synchronize in all branches when
Between length, then the branch where the urgency U (o) of branch is equal to the branch synchronize in longest branching time length and the branch
Time span difference, the time difference is shorter, and urgency is stronger.
When scheduler object is branch combination { a, b }, then the urgency of branch combination { a, b } is min { U (a), U (b) }.
Step 4.4: the scheduler object in scheduler object set E being ranked up according to urgency, collating sequence p, then
By P=P ∪ p, output scheduling sequence P.
All branches due to being executed concurrently an operation can occupy excess resource, hinder other job executions.Consider
Synchronous to the branch being made of multiple branches, the deadline is determined by that longest branch of time, due to each branch
Time span is different, and appropriate delay dispatching executes time shorter branch, as long as terminating it before longest branch completes
His shorter branch, the time that branch would not be caused synchronous extend, and the branch shorter by the delay dispatching time, for length
Branch reserve computing resource.But, short branch is delayed by too long, and branch also will increase the synchronous deadline.This
The dispatching principle of invention be postpone short branch, but cannot allow it become affiliated branch synchronize in the last branch that just completes.This
Invention is based on this heuristic thought, proposes branch's urgency, as shown in figure 3, for a branch synchronous 1, the length of branch 3
Degree is time longest branch in synchronous 1, and the time of the branch is just used as the time restriction of branch synchronous 1, the urgency of branch
It is defined as a branch under the synchronous time restriction of branch, the maximum tolerable delay dispatching time.Such as branch 4
Urgency is that the time restriction of branch synchronous 1 subtracts the time of branch 4.The urgency means that branch 4 can be delayed by scheduling
Maximum duration.The present embodiment is ranked up by the urgency of all branches to each operation in operation end, priority scheduling those more
Urgent branch, in case the time that delay branch is synchronous, can reduce delay caused by incorrect scheduling.
But this dispatching method to be sorted by urgency may result in the time that branch synchronously completes and be extended.Needle
To this problem, the problem of the invention proposes a kind of dispatching algorithms for minimizing branch's synchronization time, therefore the present invention to point
Branch dispatching method has done following improvement:
Further, branch's dispatching algorithm described in step 4 is further comprising the steps of:
Step 4.1 ': the branch hung up in branch's set B is divided into multiple scheduler objects, if hung up in branch's set
Total amount of computational resources of multiple parallel branch is less than the limitation of resource capacity, this multiple parallel branch is packaged into branch combination
BC, and using the branch combination as a scheduler object, the branch not being packaged is separately as a scheduler object;
Step 4.2 ': total scheduler object set E=B ∪ BC is constructed, and constructs an empty schedule sequences
Step 4.3 ': calculate the urgency U (o) of each scheduler object o in scheduler object set E;
Step 4.4 ': in scheduler object set E, select the strongest scheduler object e of urgency;
Step 4.5 ': for other objects o in scheduler object set E in addition to scheduler object ej, remember e → ojIt is interim
Dispatching sequence pj, 1≤j≤J-1, J are the sum of scheduler object in scheduler object set;
Step 4.6 ': calculate temporary scheduling sequence pjExceed time ETj;
Beyond time ETjCalculation method be:
For dispatching sequence e → oj, exceed the time
Wherein T (e) and U (oj) respectively indicate the time span and scheduler object o of scheduler object ejUrgency.
Why introducing exceeds this concept of time, is because previously described be delayed by: some branches not will lead to point
The branch synchronous deadline increases.It is so delayed by order to avoid branch synchronizes, needs to find out these branches, using beyond the time
Carry out the consequence of rational judgment Tapped Delay.
Beyond the time is defined as: set the branch that a and b represents operation J1 and J2, since computing resource is limited, a and b can only
It is performed serially.If delayed branch b executes branch a, beyond the time it is,
Wherein, T (a) and U (b) respectively represent the time of branch a and the urgency of branch b.
It is zero beyond the time if T (a)≤U (b), expression, which first carries out a and executes b again not and will lead to operation J2, to be delayed.
Otherwise, operation J2 can be because branch b be delayed by scheduling, and the deadline of J2 will increase.There is also regardless of delayed branch a in practice
Or delayed branch b, can all have non-zero exceeds the time, if delay a, the deadline of J1 will increase;Delayed branch b then can
The deadline of J2 is caused to increase.Therefore it needs to weigh, select beyond time shortest scheduling scheme.
Step 4.7 ': by min (ETj) corresponding to temporary scheduling sequence pjIt is appended in schedule sequences P, P=P ∪ pj;
Step 4.8 ': e and o is rejected from scheduler object set EjThe branch being related to;
Step 4.9 ': repeating step 4.4 ' is to step 4.8 ', until in scheduler object set E any branch or branch group
Conjunction is all added to schedule sequences P and output scheduling sequence P.
Step 5: when there is computing resource, execution unit distribution is carried out according to branch schedule sequences P and executes Branch Tasks,
After the completion of hanging up branch and executing, occupied computing resource can be released;
Step 6: repeating step 3 to 5, until the branch in DAG figure each in operation end is executed.
The present invention smallest object schedulable as one by one section of independent pathway using DAG, referred to as branch
(branch), a branch can be made of continuous calculation stages and the network communication between the stage.For example, the branch 1 of Fig. 1 by
Stage1, coflow1, Stage2, coflow2 and Stage3 composition.Calculation stages and relevant network in one branch are logical
Therefore letter must can be considered a schedulable object by serial execution, avoid logical in scheduling calculation stages and network
Occurs the disagreement of DAG scheduling and coflow scheduling when letter.In addition, the concept of branch has been additionally contemplates that wanting for data locality
It asks.In Fig. 1, the calculation stages of upstream can export calculated result, and be used as the input data of downstream calculation stages.?
After Stage1 is completed, on the machine that the calculating task that the output of Stage1 can be stored in Stage1 is performed.These are temporary
The data of preservation can be used as the input data of Stage2.When Stage2 starts to execute, the calculating task of Stage2 can be from
It executes and pulls intermediate data on the machine of Stage1.This process can generate network communication, that is, coflow1.If will
The calculating task of Stage2 is also arranged on identical machine, so that it may greatly meet data locality.Therefore the present invention makes
Branch is used as schedulable object to distribute resource, Stage1 and Stage2 can be scheduled on same a group machine, reduce
The expense of network reduces the operation deadline.
Further, since a plurality of parallel route of a DAG is intersected with each other, is formed and synchronized.The present invention claims the remittance of multiple branches
Collecting process is synchronous for branch, and after the branch only executed parallel synchronizes, subsequent calculation stages can just start to execute;And one
In the operation of a data parallel, job scheduling depends on the most slow parallel branch executed.In Fig. 1, spoke 1 and spoke 2 is handed over
It pitches in the last one calculating task.It is synchronous which constitute a branches.This operation is only after branch synchronously completes, ability
Terminate.Therefore, branch, which synchronizes, seriously affects operation deadline JCT.It is performed simultaneously the synchronous all branch's meetings of a branch
A large amount of computing resource is occupied, the execution for hindering other branches synchronous.In fact, the operation deadline is by executing most slow point
The influence of branch, the present invention passes through the deadline for predicting each branch, to calculate the tolerable delay time of each branch, so that it is determined that
The priority or urgency of Chu Ge branch, first the parallel urgent branch executed in different work, then serially executes less tight
Anxious branch.More computing resources can be left in advance by postponing short branch, to complete those more urgent operations.In order to verify this
The effect of invention, the present invention are tested by the Spark cluster in laboratory, and branch's dispatching method is better than Spark as the result is shown
FIFO and shortest job first scheduling, critical path scheduling method.The average operation deadline, JCT reduced 10-15%.This
Outside, the present invention has also carried out large-scale emulation testing.Emulation experiment uses Google company-data, https: //
", and the CARBYNE with newest DAG dispatching method as described in document 2 github.com/google/cluster-data.
It compares.Experimental result also shows BS method ratio CARBYNE and realizes faster JCT.
It carries out verifying effect of the invention below by one experimental situation of building.
As shown in Fig. 2, host node is responsible for receiving operation submission and distribution calculates in the system architecture of branch's dispatching method
The work of resource.And user can submit operation to Spark cluster by an interface driver.This driver is for pipe
The progress for the operation that reason user is submitted including being multiple parallel calculating tasks data parallel breakdown of operation, and generates one
A DAG.In the implementation procedure of data parallel operation, driver can be communicated with host node, to grasp the variation of cluster resource
(as determined the server that there are idle computing resources, calculating task can be initiated).These really are used to execute the section of calculating task
Point is known as working node (worker node).Host node will also manage entire cluster, each server of maintenance state, and
The calculated result of each server is indexed, can determine the position where input data so as to subsequent calculating task.
The process of assignment calculating task is called scheduling by we.Computing resource total amount required for driver interface and master are determined
Afterwards, master node can according to grasped each server state (such as whether and distribute to other calculate operation, if also
Have assignable computing resource etc.), to determine the computing resource total amount for the operation for distributing to driver processing.But, driver
Interface is after the feedback for receiving master node, it is also necessary to pay attention to and each worker node of distribution communicates, it is established that communication
Channel, to distribute calculating task later.In the stage for executing data parallel operation, worker node can provide one group of execution
Device unit (executor).These executor units are the Java Virtual Machines for wrapping up certain C PU core and memory.Driver node
Each executor can be grasped from the communication with worker node.Then, driver interface makees the data parallel after decomposition
Multiple concurrent tasks caused by industry, are dispatched to these executor one by one.After executor has executed these calculating operations, meeting
Calculated result is stored in local or feeds back to driver.
As shown in Fig. 2, a branched structure can be exported firstly, user submits to the operation of driver after being decomposed into DAG
DAG, for branch dispatch.We also added the module of a driver interface and the intercommunication of executor.This
A communication module can be after executor have executed a calculating task, can be the execution information of this task, when such as calculating
Between, network communication time, disk read-write time, feed back to driver node.Driver node can be according to each of this operation
The finishing time information of task estimates the length of branch, calculates the urgency of branch.Each point is determined in driver interface
After the urgency of branch, these urgencies can be fed to host node.Host node can be according to the urgency of each branch to from difference
The branch of operation is scheduled.
Verifying branch's dispatching method can dispatch concurrent multiple operations and reduce operation deadline JCT below.This reality
It applies example and performs a large amount of experiment in an actual Spark cluster to evaluate the performance of branch's dispatching method.In an experiment
Compare First Come First Served (first-in-first-out, FIFO) and the shortest job first (shortest-job- of Spark
First, SJF), key methodology (critical path, CP).It is noted that CP method is the scheduling of a job built
Method, it is intended to the calculating task in one DAG operation of priority scheduling in critical path.By measure more operations and meanwhile operation
Average JCT.These operations can be filed concurrently cluster.In addition, also having rated accuracy of forecast, overhead, and online
Submit the performance boost of working scene inferior division dispatching method.
Experiment in the present embodiment is used to be made of 30 servers for being distributed in 6 racks.Each server is equipped with
Intel Xeon E5-26502.2GHz 12-core processor.In experiment use three kinds of mixing Spark operations, 50%
PageRank, 30%logistic regression and 20% machine learning operation, to ensure the multiplicity of branching time length
Property.PageRank operation is that typical iteration diagram calculates application.Logistic regression is that common regression analysis is answered
With.Machine learning application can use the learning process of iteration.The quantity that operation is controlled in experiment increases to 50 from 10, operation
Input data size increase to 50GB from 10GB.
Test online branch's deadline accuracy of forecast: the urgency in order to calculate branch is predicted by formula 14
Branching time.Therefore, prediction accuracy evaluation is an important index.After submitting operation, time of measuring branch first
The time of execution, and the branching time of prediction finally calculate prediction error.
Wherein, Tpred and Tmeas has respectively represented the branching time of prediction and measurement.Branch prediction accuracy changes such as
Shown in Fig. 5.Experimental result shows that most of branch predictions are accurate.At the beginning, prediction error will increase, pre- at the 6000th time
After survey, error is stablized 25% or so.When the wheel number of prediction increases, this prediction becomes more accurate.Therefore the weight of iteration operation
Prediction exact value can be improved in multiple execute.
Evaluate the average operation deadline:
After the batch job submitted simultaneously, the average JCT of this batch of DAG operation is evaluated.The quantity of this batch job increases from 10
To 50.Fig. 6 shows that branch's dispatching method ratio Spark FIFO method realizes less average JCT, reduces 10-15%.When
The quantity of operation increases to 40 and 50, and the reduction amount of average JCT is stablized 15% or so.And the method ratio Spark of SJF and CP
The average operation deadline of FIFO will lack 5%, hardly change as the quantity of operation increases.Branch's dispatching method
JCT reduction amount can increase as operation quantity increases.Generally speaking, while the operation of submission is more, and the average JCT of operation is got over
Greatly.But, branch's dispatching method can more effectively be packaged branch.In comparison, the method for Spark FIFO and SJF and CP is not
The branch that can be perceived in DAG is synchronous.When the quantity of operation increases, which results in longer beyond the time.
The operation quantitative assessment totaled:
Fig. 7 is shown after 30 operations are submitted at the same time in the performance of different moments.500 before after operation submission
Second, the performance of these three methods is very close.Then, Spark FIFO method, which fulfils assignment, decreased significantly.Branch's tune
Degree method can complete almost half operation at the 1000th second or so, and 2/3rds operation can be complete in first 1500 seconds
At.This is the result shows that branch's dispatching method can complete same batch job than other two methods faster.
Overhead experiment: main overhead is as caused by branch prediction and branch's scheduling.Prediction expense is related to
To finishing time information, the predicted branches time span of the task of collection.Fig. 8 a is shown, when the quantity increase of branch, predicts expense
It can gradually increase.Want to compare with branch's dispatching algorithm is executed, prediction expense is relatively small.Fig. 8 b is shown when hang-up branch
Quantity from 20 increase to 70 when, the complexity of branch's dispatching algorithm.The time complexity of algorithm is with the quantity of operation and gradually
It is increased.When more operations are submitted, branch combination also can be more, increase the expense of branch's dispatching algorithm operation.This leads
The complexity for having caused algorithm is that superlinearity increases.But, overhead is compared with JCT, is acceptable.
The present embodiment simulates 5000 typical DAG operations in Google company-data.Cooperate CP using SJ, makees
For comparative approach.CARBYNE method can his ground of benefit to shortest operation contribute remaining computing resource.All emulation are at one
It is equipped on the computer of Intel (R) Core (TM) i7-4700MQ CPU 2.40GHz and 32GB RAM and completes.
Experimental data set: log is completed in the operation that Google company-data has recorded 1 month.These logs contain details
Task completed information, resources requirement, the state of machine, limitation etc..However, this record cannot provide useful DAG letter
Breath.Most of tasks only have simple dependence, such as MapReduce.For the dependence of the operation of simulating realistic, this reality
It applies example and simulates actual job dependence.Main DAG includes synchronous with 5 branches (50 quantile) less than 10 branches.It is small
Partial DAG includes that 34 branches are synchronous with 13 branches (95 quantile).
Emulation about operation deadline JCT: branch's dispatching method, CARBYNE and SJF, the side CP are had rated in experiment
The average JCT of method.5000 operations have been used in an experiment, have all been submitted in 600 seconds, and 500 machines are shared.Firstly, surveying
The cumulative distribution function CDF to fulfil assignment is measured.As illustrated in fig. 9, at the same time, branch's dispatching method can be completed
More operations.In addition, we further evaluate the JCT reduction amount of details.Fig. 9 b shows branch's dispatching method and the side CARBYNE
The operation deadline ratio of method.Less JCT may be implemented in branch's dispatching method.The JCT reduction of operation more than 15% reaches
20%.For most of operations, branch dispatching method ratio CARBYNE realizes more JCT and reduces.
The influence of machine quantity: resource capacity is an important performance influence index.The present embodiment controls replicating machine
The quantity of device increases to 1000 from 500.Figure 10 a shows the average JCT that this 5000 operations are executed by the machine of different number.
Average JCT can be reduced as machine quantity increases.The smallest average JCT may be implemented in branch's dispatching method.But, work as machine
When quantity reaches 1000, the performance of CARBYNE method also moves closer to branch's dispatching method.
The influence of operation quantity: Figure 10 b shows that average JCT can increase with the quantity of operation.The present embodiment experiment
The quantity of middle control operation increases to 5000. branch's dispatching methods from 2000 can realize smaller average JCT than other methods.
Therefore, when cluster resource is vied each other in more operations, more economical scheduling is may be implemented in branch's dispatching method.Therefore, JCT
Reduction amount can increase with the quantity of operation.
The influence of operation submission time: when submitting operation with different rates, the performance of scheduler can be due to overstocking
Handling situations are different and realize different performances.The interval time that operation submission is arranged in we increased to 1500 seconds from 300 seconds.Figure
10c shows that the average JCT of branch's dispatching method is the smallest.The average JCT of these three methods can be with interval time
Increase and rises.
It is demonstrated from different angles by above-mentioned emulation experiment, under the conditions of limited cluster resource, when more operation
When submitting in a relatively short period of time, higher performance is may be implemented in branch's dispatching method.
The above is only the preferred embodiment of the present invention, protection scope of the present invention is not limited merely to above-described embodiment,
All technical solutions belonged under thinking of the present invention all belong to the scope of protection of the present invention.It should be pointed out that for the art
For those of ordinary skill, several improvements and modifications without departing from the principles of the present invention should be regarded as protection of the invention
Range.
Claims (5)
1. a kind of data parallel job scheduling method relied on based on branch DAG, which comprises the following steps:
Step 1: operation termination is incorporated as industry;
Step 2: traversing the DAG task image of operation, find out the convergent point and bifurcation of DAG, the convergent point in DAG task image is claimed
It is synchronous for a branch, a branch will be known as without convergence, the chain type part without bifurcated in DAG task image, other will not depended on
Branch when branch or the branch relied on have executed completion is known as hanging up branch;
Step 3: the DAG task image of each operation in traversal operation end finds out the hang-up branch in each DAG task image, and will
The hang-up branch found out, which is added to, to be hung up in branch's set B;
Step 4: branch's dispatching algorithm being executed to the branch hung up in branch's set B, obtains branch dispatching sequence P;
Step 5: when there is computing resource, execution unit distribution being carried out according to branch dispatching sequence P and executes Branch Tasks;
Step 6: repeating step 3 to 5, until the branch in the DAG figure of operation each in operation end is executed.
2. the data parallel job scheduling method according to claim 1 relied on based on branch DAG, which is characterized in that step
Branch's dispatching algorithm described in rapid 4 refers to:
Step 4.1: the branch hung up in branch's set B being divided into multiple scheduler objects, if hung up multiple in branch's set
Total amount of computational resources of parallel branch is less than the limitation of resource capacity, this multiple parallel branch is packaged into branch combination BC,
And using the branch combination as a scheduler object, the branch not being packaged is separately as a scheduler object;
Step 4.2: constructing total scheduler object set E=B ∪ BC, and construct an empty schedule sequences
Step 4.3: calculating the urgency U (o) of each scheduler object o in scheduler object set E;
Step 4.4: the scheduler object in scheduler object set E being ranked up according to urgency, collating sequence p, then by P=
P ∪ p, output scheduling sequence P,.
3. the data parallel job scheduling method according to claim 1 relied on based on branch DAG, which is characterized in that step
Branch's dispatching algorithm described in rapid 4 refers to:
Step 4.1 ': the branch hung up in branch's set B is divided into multiple scheduler objects, if hung up multiple in branch's set
Total amount of computational resources of parallel branch is less than the limitation of resource capacity, this multiple parallel branch is packaged into branch combination BC,
And using the branch combination as a scheduler object, the branch not being packaged is separately as a scheduler object;
Step 4.2 ': total scheduler object set E=B ∪ BC is constructed, and constructs an empty schedule sequences
Step 4.3 ': calculate the urgency U (o) of each scheduler object o in scheduler object set E;
Step 4.4 ': in scheduler object set E, select the strongest scheduler object e of urgency;
Step 4.5 ': for other objects o in scheduler object set E in addition to scheduler object ej, remember e → ojFor temporary scheduling
Sequence pj, 1≤j≤J-1, J are the sum of scheduler object in scheduler object set;
Step 4.6 ': calculate temporary scheduling sequence pjExceed time ETj;
Step 4.7 ': by min (ETj) corresponding to temporary scheduling sequence pjIt is appended in schedule sequences P, P=P ∪ pj;
Step 4.8 ': e and o is rejected from scheduler object set EjThe branch being related to;
Step 4.9 ': repeating step 4.4 ' is to step 4.8 ', until in scheduler object set E any branch or branch combination all
It is added in schedule sequences P, output scheduling sequence P.
4. the data parallel job scheduling method according to claim 2 or 3 relied on based on branch DAG, which is characterized in that
The calculation method of the urgency U (o) of the scheduler object is:
Step 4.3.1: the time span T (o) of scheduler object o is calculated;
1) when scheduler object is branch, then time span T (o) are as follows:
Wherein, S represents the set in stage in a branch, and Ts represents the calculating task set of a stage s, and W represents active line
The set of journey, Dt,wIt represents and time used in calculating task t, D is executed as thread wt,wTime carried out by the prediction model of formula 2
Prediction;
Dt,w(i)=a1Dt,w(i-1)+a2Dt,w(i-2)+...anDt,w(i-n) (2)
Formula 2 indicates that the time as used in thread w execution current task t needs through the calculation stages where collecting current task t
The history of all tasks executes data, and the item number that history executes data shares n item, and i indicates the number of iterations of operation, to formula 2
Models fitting is carried out, a is obtained1、a2...anThe estimated value of parameter;
2) when scheduler object is branch combination { a, b }, then the time span of branch combination { a, b } is max { T (a), T (b) };
Step 4.3.2: it calculates scheduler object urgency U (o);
1) when scheduler object be branch when, the branch where calculating the branch by formula 1 synchronize in all branches time
Length, then branch where the urgency U (o) of branch is equal to the branch synchronize in longest branching time length and the branch
The difference of time span, the time difference is shorter, and urgency is stronger;
2) when scheduler object is branch combination { a, b }, then the urgency of branch combination { a, b } is min { U (a), U (b) }.
5. the data parallel job scheduling method according to claim 4 relied on based on branch DAG, it is characterised in that: step
Exceed time ET in rapid 4.6 'jCalculation method be:
For dispatching sequence e → oj, exceed the time
Wherein T (e) and U (oj) respectively indicate the time span and scheduler object o of scheduler object ejUrgency.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910514403.0A CN110275765B (en) | 2019-06-14 | 2019-06-14 | Data parallel job scheduling method based on branch DAG dependency |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910514403.0A CN110275765B (en) | 2019-06-14 | 2019-06-14 | Data parallel job scheduling method based on branch DAG dependency |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110275765A true CN110275765A (en) | 2019-09-24 |
CN110275765B CN110275765B (en) | 2021-02-26 |
Family
ID=67960808
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910514403.0A Active CN110275765B (en) | 2019-06-14 | 2019-06-14 | Data parallel job scheduling method based on branch DAG dependency |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110275765B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110688993A (en) * | 2019-12-10 | 2020-01-14 | 中国人民解放军国防科技大学 | Spark operation-based computing resource determination method and device |
CN110730470A (en) * | 2019-10-24 | 2020-01-24 | 北京大学 | Mobile communication equipment integrating multiple access technologies |
CN111857984A (en) * | 2020-06-01 | 2020-10-30 | 北京文思海辉金信软件有限公司 | Job calling processing method and device in bank system and computer equipment |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105868019A (en) * | 2016-02-01 | 2016-08-17 | 中国科学院大学 | Automatic optimization method for performance of Spark platform |
US20190065336A1 (en) * | 2017-08-24 | 2019-02-28 | Tata Consultancy Services Limited | System and method for predicting application performance for large data size on big data cluster |
-
2019
- 2019-06-14 CN CN201910514403.0A patent/CN110275765B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105868019A (en) * | 2016-02-01 | 2016-08-17 | 中国科学院大学 | Automatic optimization method for performance of Spark platform |
US20190065336A1 (en) * | 2017-08-24 | 2019-02-28 | Tata Consultancy Services Limited | System and method for predicting application performance for large data size on big data cluster |
Non-Patent Citations (5)
Title |
---|
DONGSHENG LI等: "ReB: Balancing Resource Allocation for Iterative Data-Parallel Jobs", 《IN PROCEEDINGS OF ACM CONFERENCE (CONFERENCE’17)》 * |
MASTERT-J: "Spark详解(五):Spark作业执行原理", 《HTTPS://BLOG.CSDN.NET/QQ_21125183/ARTICLE/DETAILS/87875902》 * |
WEI WANG等: "Coflex: Navigating the Fairness-Efficiency Tradeoff for Coflow Scheduling", 《IEEE INFOCOM 2017 - IEEE CONFERENCE ON COMPUTER COMMUNICATIONS》 * |
田国忠等: "一种多DAG任务共享异构资源调度的费用优化方法", 《电子学报》 * |
胡智尧等: "数据中心网络流调度技术前沿进展", 《计算机研究与发展》 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110730470A (en) * | 2019-10-24 | 2020-01-24 | 北京大学 | Mobile communication equipment integrating multiple access technologies |
CN110730470B (en) * | 2019-10-24 | 2020-10-27 | 北京大学 | Mobile communication equipment integrating multiple access technologies |
CN110688993A (en) * | 2019-12-10 | 2020-01-14 | 中国人民解放军国防科技大学 | Spark operation-based computing resource determination method and device |
CN110688993B (en) * | 2019-12-10 | 2020-04-17 | 中国人民解放军国防科技大学 | Spark operation-based computing resource determination method and device |
CN111857984A (en) * | 2020-06-01 | 2020-10-30 | 北京文思海辉金信软件有限公司 | Job calling processing method and device in bank system and computer equipment |
Also Published As
Publication number | Publication date |
---|---|
CN110275765B (en) | 2021-02-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Levin et al. | DP-FAIR: A simple model for understanding optimal multiprocessor scheduling | |
US9471390B2 (en) | Scheduling mapreduce jobs in a cluster of dynamically available servers | |
CN103309738B (en) | User job dispatching method and device | |
CN110275765A (en) | Data parallel job scheduling method based on branch DAG dependency | |
CN103838621B (en) | Method and system for scheduling routine work and scheduling nodes | |
US9723070B2 (en) | System to improve cluster machine processing and associated methods | |
CN109408215A (en) | A kind of method for scheduling task and device of calculate node | |
CN107193655B (en) | Big data processing-oriented fair resource scheduling method based on utility function | |
CN112685153A (en) | Micro-service scheduling method and device and electronic equipment | |
Hu et al. | Branch scheduling: DAG-aware scheduling for speeding up data-parallel jobs | |
Shi et al. | Exploiting simultaneous communications to accelerate data parallel distributed deep learning | |
CN105867998A (en) | Virtual machine cluster deployment algorithm | |
Hong et al. | Sharp waiting-time bounds for multiserver jobs | |
CN105740059A (en) | Particle swarm scheduling method for divisible task | |
CN110928657A (en) | Embedded system certainty analysis method | |
Li et al. | MapReduce task scheduling in heterogeneous geo-distributed data centers | |
Dubey et al. | QoS driven task scheduling in cloud computing | |
Li et al. | Co-Scheduler: A coflow-aware data-parallel job scheduler in hybrid electrical/optical datacenter networks | |
Natarajan | Parallel queue scheduling in dynamic cloud environment using backfilling algorithm | |
CN106020333B (en) | Multi-core timer implementing method and multiple nucleus system | |
CN102184124A (en) | Task scheduling method and system | |
CN109298919B (en) | Multi-core scheduling method of soft real-time system for high-utilization-rate task set | |
Qin et al. | Dependent task scheduling algorithm in distributed system | |
Hu et al. | Requirement-aware strategies with arbitrary processor release times for scheduling multiple divisible loads | |
Li et al. | Efficient semantic-aware coflow scheduling for data-parallel jobs |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |