CN103970613A - Multi-copy task fault tolerance scheduling method of heterogeneous distributed system - Google Patents

Multi-copy task fault tolerance scheduling method of heterogeneous distributed system Download PDF

Info

Publication number
CN103970613A
CN103970613A CN201410216137.0A CN201410216137A CN103970613A CN 103970613 A CN103970613 A CN 103970613A CN 201410216137 A CN201410216137 A CN 201410216137A CN 103970613 A CN103970613 A CN 103970613A
Authority
CN
China
Prior art keywords
task
node
reliability
copy
scheduling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410216137.0A
Other languages
Chinese (zh)
Other versions
CN103970613B (en
Inventor
门朝光
何忠政
李香
蒋庆丰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Engineering University
Original Assignee
Harbin Engineering University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Engineering University filed Critical Harbin Engineering University
Priority to CN201410216137.0A priority Critical patent/CN103970613B/en
Publication of CN103970613A publication Critical patent/CN103970613A/en
Application granted granted Critical
Publication of CN103970613B publication Critical patent/CN103970613B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Computer And Data Communications (AREA)
  • Hardware Redundancy (AREA)

Abstract

The invention belongs to the field of computers, and particularly relates to a multi-copy task fault tolerance scheduling method of a heterogeneous distributed system. The method includes the steps that according to the load of each task and the executing speed of each node in the system, the average executing time of all the tasks on all the processor nodes and the average communication time of all communication messages on all chains are calculated; through a bottom end priority method, the bottom end priority of any task in a task set is calculated; the tasks allowed to be scheduled are added into a scheduling queue in a priority non-increasing mode according to the priority of the tasks; the task highest in priority is selected from all the tasks allowed to be scheduled in the scheduling queue. According to the method, the execution starting time of current task scheduling copies can be further shortened, and therefore the task scheduling Makespan can be further reduced.

Description

A kind of heterogeneous distributed many copies of system task fault-tolerant scheduling method
Technical field
The invention belongs to computer realm, be specifically related to a kind of heterogeneous distributed many copies of system task fault-tolerant scheduling method.
Background technology
Appearance along with express network, using distributed, low-cost and be very likely that the resource of isomery connects be feasible as computing environment, therefore distributed system is (as cloud computing, grid computing, distributed mobile computing) isomerism of Computer will progressively strengthen, this provides a kind of computing platform that is called heterogeneous distributed calculating (Heterogeneous Distributed Computing, HDC) system.HDC system has become the computing equipment of popular high-performance calculation and information processing, and progressively by critical system, is used.HDC system often has higher throughput of system and availability, can access efficiently distributed network information widely.HDC system is more complicated than the isologous seve central control system of unifying, and extra complicacy may cause more thrashing.In HDC system, safety-critical application program not only will can be fault-tolerant in the situation that fault occurs, and also will meet time constraint condition.
In large-scale heterogeneous distributed computing system, effectively task scheduling algorithm is meeting user or system requirements and is realizing aspect high-performance and play the part of pivotal player.Task scheduling is intended to duty mapping to processor and the Starting Executing Time of task is set, and makes execution sequence meet dependence between task, maximizes dispatch reliability simultaneously and minimizes scheduling Makespan.Current task scheduling problem, mostly for independent task, has been ignored data correlation and priority constraint relationship between task.Simultaneous minimization scheduling Makespan and application program failure probability are usually conflicting, therefore must design consider the dispatching algorithm of scheduling Makespan and reliability simultaneously.
Fault-tolerant scheduling method has carried out large quantity research.Fault-tolerant scheduling method can adopt passive replication (Primary/Backup, PB) mechanism and Active Replication mechanism to improve reliability.PB mechanism is distributed to different processor by two versions of each task, and when this processor of key plate lost efficacy, task will be carried out on subedition processor.PB mechanism can only be tolerated the generation of primary fault.PB mechanism task execution time when breaking down is longer, very likely can not meet real-time task time requirement.Active Replication mechanism, based on spatial redundancy, is dispatched to different processor by a plurality of copies of task, and the executed in parallel by a plurality of copies realizes fault-tolerant.Task scheduling based on many copies mode has two kinds: strict scheduling and general scheduling.Strict scheduling refers to that task is only all complete and when relying on message and arriving current scheduling task institute mapping node at all copies of its all direct predecessor task, could start to carry out.Fault-tolerant scheduling method based on task copy all adopts this scheduling mode mostly.General scheduling refers to as long as each direct predecessor task of task has a copy message complete and this copy successfully to transfer to current scheduling duty mapping node, and current scheduling task just can start to carry out.Obviously strict scheduling is the special case of general scheduling.The Starting Executing Time of the lower current scheduling copy of strict scheduling is necessary for the maximal value in all predecessor task copies and its call duration time sum separately.And the Starting Executing Time of current scheduling copy can be the maximal value in part predecessor task copy and its call duration time sum separately under general scheduling.Intertask communication message does not need whole transmissions, as long as send relying on message between required part task copy, general scheduling mode can further reduce the Starting Executing Time of current scheduling task copy, and therefore the general Makespan calling is probably little than strict scheduling.The Calculation of Reliability mode of strict scheduling and general scheduling mode is different, in strict scheduling, the reliability of task copy need be considered all copies of all predecessor tasks of this task, and in general scheduling, the reliability of task copy can only consider to complete in all predecessor task copies all copies that execution time and message communicating time sum are less than current scheduling task copy Starting Executing Time.
Most of Active Replication fault-tolerant scheduling mechanism lost efficacy the certain number of times of Task Duplication blindly with tolerance specified quantity processor.The article " Faulttolerant scheduling of precedence task graphs on heterogeneous platforms " that article " An algorithmfor automatically obtaining distributed and fault-tolerant static schedules " and the Anne Benoit in 2008 etc. that A.Girault in 2003 etc. deliver in meeting " Dependable Systems and Networks " deliver in meeting " Parallel and Distributed Processing " has proposed respectively FTBAR algorithm and FTSA algorithm.These two algorithms tolerate that by task scheduling to minimum scheduling pressure and front ε+1 processor that minimizes the deadline ε processor lost efficacy respectively.These two algorithms all do not provide dispatch reliability analysis.The article " Reliable workflow scheduling withless resource redundancy " that ZhaoLaiping in 2013 etc. deliver on periodical < < Parallel Computing > > has proposed to realize based on Active Replication mechanism the fault-Tolerant Scheduling Algorithm of minimized resource expense, the highest node of this algorithm picks reliability copy of executing the task, does not consider the copy deadline.Therefore likely going out current task is scheduled to the node that reliability is high, but its execution time is long, and the method is unfavorable for system load balancing.And the method is the dispatching method based on strict scheduling mode, the more general scheduling mode of its scheduling Makespan is longer.The article " Efficient task replication and management for adaptive fault tolerance in MobileGrid environments " that Antonios Litke in 2007 etc. deliver on periodical < < Future Generation Computer Systems > > has proposed the fault-tolerant scheduling mechanism under mobile network's computing environment, but its for be Independent Task Scheduling, and this mechanism does not consider to dispatch Makespan.The article " Reliability versus performance for criticalapplications " that Alain Girault in 2009 etc. deliver on periodical < < Journal of Paralleland Distributed Computing > > has proposed to optimize two stage algorithms of reliability and scheduling length simultaneously.But what this algorithm was used is strict scheduling mode, Makespan is long in its scheduling.This algorithm is not considered link failure, and its task copy mapping node is chosen at random.The article " A Resource Minimizing Scheduling Algorithm with Ensuringthe Deadline and Reliability in Heterogeneous Systems " that Laiping Zhao in 2011 etc. deliver on meeting < < Advanced Information Networking andApplications > > adopts Active Replication mechanism to carry out the research of task fault-tolerant scheduling.This article algorithm has been considered node and link failure simultaneously, but does not adopt general scheduling mode.
Random search algorithm can produce new result in conjunction with the information and some random character that obtain in existing Search Results.Genetic algorithm (GA) is a kind of method of utilizing natural selection and the optimizing in higher dimensional space of evolution thought, has the features such as simple, quick, robustness is good, and often can provide good solution.It adopts the non-traversal searching mechanism that has tutorial message, can rapidly converge to overall near-optimum solution.The operation expense of GA is usually higher, but this is acceptable for long-play task, and can improve computing velocity by parallel GA technology.The task copy that the article " Genetic Algorithm basedScheduling Method for Efficiency and Reliability in Mobile Grid " that SungHo Chin in 2009 etc. deliver on meeting < < UbiquitousInformation Technologies & Applications > > carries out based on GA in isomery mobile grid environment dispatches to improve mission reliability, but its for be independent task.The double object genetic algorithm (BGA) that the article " Biobjective schedulingalgorithms for execution time-reliability trade-off in heterogeneous computingsystems " that Atakan Dogan in 2005 etc. deliver on periodical < < The Computer Journal > > proposes is Optimized Operation Makespan and reliability simultaneously, but BGA likely produces the solution trivial relying between the task of running counter to during evolution.When the article " Optimizing the makespan and reliabilityfor workflow applications with reputation and a look-ahead genetic algorithm " that Xiaofeng Wang in 2011 etc. deliver on periodical < < FutureGeneration Computer Systems > > adopts GA meeting Task Dependent relation, adopt two-stage policy Optimized Operation Makespan and reliability simultaneously, but this algorithm does not adopt task copy mechanism to improve reliability, therefore its reliability lifting is limited.
The copy scheduling problem that can optimize reliability in heterogeneous computing system is np complete problem, does not exist the polynomial time algorithm can maximum reliability.Therefore this method adopts alternative method to carry out task scheduling: as long as dispatching method meets mission reliability requirement, do not need to maximize scheduler task reliability.No matter be transient fault or permanent fault, assessing that general dispatch reliability has been proved to be is all #P ' complete problem, and this problem is at least and the equal difficulty of np complete problem.Therefore even if obtain task set dispatching scheme, still can not be in polynomial time calculation task collection reliability.Therefore this method is calculated the reliability requirement of each task, is meeting under the prerequisite that relies on restriction relation between mission reliability requirement and task, further Optimized Operation Makespan.The reliability calculation method of task scheduling is different under strict scheduling mode and general scheduling mode.Mission reliability is closely associated with its Starting Executing Time, because task Starting Executing Time has determined the predecessor task message number that task can receive.Current scheduling problem is not considered choosing of on node task copy Starting Executing Time position mostly, on therefore optimizing in the deadline, has certain defect.Therefore this method adopts the many copies fault-tolerant scheduling mechanism based on general scheduling mode, is meeting under the prerequisite of reliability requirement, utilizes the further Optimized Operation Makespan of genetic algorithm.
Summary of the invention
Of the present inventionly be to provide a kind of heterogeneous distributed many copies of system task fault-tolerant scheduling method based on Active Replication mechanism.
The object of the present invention is achieved like this:
(1) according to the execution speed of each node in the load of each task and system, each task v in computing application program jbe scheduled to each node p in system kexecution time ET (v j, p k); For the application program G=<V of Existence dependency constraint, E>, set V={v 1, v 2... v n, task quantity N=|V|, E is the set of the oriented communication weight limit between task in V; System model is non-directed graph G s=<P, L>, P={p 1, p 2..., p mm heterogeneous nodes set, and M=|P|, L is | L| communication link set; Task-set reliability requirement R;
(2) calculate each task at average execution time of all processor nodes and every communication information the average communication time at all links;
(3) adopt bottom priority approach to come calculation task to concentrate any task v jbottom priority bl (v j):
Succ (v in formula j) be task v jdirect follow-up work set, for task v jthe average execution time of all nodes in node set P, for message e j,iaverage transmission time at all links of system link set L;
(4) according to the priority of task, permission scheduler task is added into scheduling queue according to the nonincremental mode of its priority;
(5) from all permission scheduler tasks of scheduling queue, select priority super objective, calculating priority level super objective v jreliability requirement r x, x is the position of this task in priority query:
r x = R / &Pi; i = 0 x - 1 r i &prime; n - x + 1
1≤x≤n in formula, and meet the prioritization of task; R is task-set reliability requirement; R ' ifor priority query's meta is set to the actual institute of the task achieved reliability of i, r ' 0=1; If this task is priority super objective is entry task, reliability requirement
(6) if reliability requirement is invalid, i.e. task v jreliability requirement r x>=1, refuse so scheduler task, and return; Otherwise the general dispatching method of the many copies of calling task calculates copy scheduling node and the Starting Executing Time of this task;
(7) scheduler task is deleted from scheduling queue, new permission scheduler task is added in scheduling queue according to priority simultaneously; Continue next priority super objective in selection scheduling queue and dispatch, repeating step (5)-(7) until all tasks all dispatched.
The general dispatching method of the many copies of task is:
(6.1) the corresponding information of initialization: by task v iit is 0 that copy amount is composed, and mapping node is composed as empty, and idle node set is composed as node set P;
(6.2) if task v jfor entry task, choose in idle node queue the deadline node the earliest copy of executing the task, calculation task v jreliability
P [ E v j ] = 1 - &Pi; p n &Element; proc ( v j ) ( 1 - exp { - &lambda; p n * w ( v j ) / w ( p n ) } )
Proc (v j) be task v jmapping node set, λ p nfor processor node p npermanent fault probability, w (v j) expression task v jload, w (p n) expression node p nexecutable calculated amount in unit interval; If can not meet mission reliability, continue so to choose in idle queues the deadline node the earliest copy of executing the task, the reliability of calculation task then, until meet mission reliability requirement; If until idle node queue is empty, mission reliability still can not meet the demands, and makes up mission reliability loss when follow-up work copy is dispatched by Calculation of Reliability formula;
(6.3) if task v jhave predecessor task, the general dispatching method of the many copies of task calling based on genetic algorithm carries out copy scheduling.
The general dispatching method of the many copies of task based on genetic algorithm is:
(6.3.1) initialization crossover probability p c, variation Probability p m, population quantity GN, Evolution of Population number of times EN;
(6.3.2) generate initial population:
Calculate the predecessor task v of current scheduling task ibe mapped in node p ktask copy message arrive node p ntime ave ( v i k , p n ) :
ave ( v i k , p n ) = max { FT ( v i , p k ) , rdy ( l k , n ) } + w ( e i , j ) / w ( l k , n )
FT (v in formula i, p k) be task v iat node p kcomplete the execution time, rdy (l k,n) be link l k,nbe ready to last message communicating deadline that call duration time is link, w (e i,j) be task v iwith task v jbetween communication information e i,jsize, w (l k,n) be node p kwith node p nbetween link l k,nthe data volume that can transmit in unit interval, if mapping node is identical, i.e. p k=p n, time rdy (l so k,n) be 0, communication overhead is 0,
Task encoding scheme need be gene in individuality in minimum effective Starting Executing Time position and maximum effectively all position encoded between Starting Executing Time position by each node, task v jat processor p nposition EST (v of effective execution time of minimum j, p n) calculate;
EST ( v j , p n ) = max { max v i &Element; pred ( v j ) { min v i k &Element; rep ( v i ) { ave ( v i k , p n ) } } , rdy ( p n ) }
Pred (v in formula i) be task v idirect predecessor task set; Rep (v i) be task v icopy set; Rdy (p n) be current scheduling situation lower node p nwhat finally shine upon task completes execution time PFT (p n)
PFT ( p k ) = max v i &Element; V , p k &Element; proc ( v i ) { FT ( v i , p k ) }
Proc (v in formula i) task v ithe processor sets shining upon;
Task v jat processor p nthe effective Starting Executing Time of maximum position LST (v j, p n)
LST ( v j , p n ) = max { max v i &Element; pred ( v j ) { min v i k &Element; rep ( v i ) { ave ( v i k , p n ) } } , rdy ( p n ) }
From node idle queues, choose processor node, at processor node, choose effective Starting Executing Time position, the copy of mapping current scheduling task, the reliability of calculation task copy, if the reliability of this task does not meet the demands, continuation is chosen processor node and in the effective Starting Executing Time of node selection task position from node idle queues, until the reliability of task meets the demands, body one by one using task copy mapping scheme in population, repeat to generate individual, until reach population scale, when if task copy amount is M, the reliability of task does not also reach reliability requirement, will this task copy mapping scheme as the body one by one in population, because can compensate in right amount the reliability loss of this task during follow-up work scheduling,
P [ E v j ] = 1 - &Pi; p n &Element; proc ( v j ) ( 1 - AP v j p n ) = 1 - &Pi; p n &Element; proc ( v j ) ( 1 - ( &Pi; v i n &Element; Prep n &cap; tv i n &le; tv j n ( e - &lambda;p n * w ( v i ) / w ( p n ) ) ) ) &times; &Pi; v l &Element; pred ( v j ) ( 1 - &Pi; p k &Element; proc ( v l ) , ave ( v l k , p n ) &le; ST ( v j , p n ) ( 1 - ( &Pi; et p , q &Element; ON ( l k , n ) &cap; et p , q &le; etl , j ( e - &lambda;l k , n * w ( e p , q ) / w ( l k , n ) ) ) ) ) )
In formula for task v jbe mapped in node p ncopy reliability, for node p nupper current scheduler task copy the task copy of before carrying out prep nfor node p nthe task copy set of carrying out; ST (v j, p n) be task v jat node p nstarting Executing Time; Et p,qfor task v pwith v qbetween the beginning call duration time of communication information; ON (l k,n) be at link l k,nthe all communication occurring; Et p,q≤ et l,j(v p, v q∈ V) be link l k,nupper communication information e p,qbeginning call duration time be less than or equal to message e l,jbeginning call duration time; λ l k,nfor node p kwith node p nbetween link l k,nfailure probability; If task copy with mapping node identical, its link communication time is 0 so, the reliability of this communication information is 1;
The encoding gene value corresponding to effective Starting Executing Time position of mapping task is 1, and the corresponding position of not shining upon task is 0, and when duty mapping, in gene corresponding to each node, having at most the value of a position is 1, and the value of other positions is 0;
Coding also comprises the effective mapping position number of each node in individuality coding, and this position is represented by array s, if task v jdistribute to node p nin the effective Starting Executing Time position of k, so individual g jin l gene g j,l=1, | s i| be s in array s irepresentative node p ieffective mapping position number, | s 0|=0, the individual length of encoding is array element s iat individual g jcorresponding gene sets is
(6.3.3) according to crossover probability p call individualities in population are carried out to interlace operation:
If random number is less than crossover probability p c, for two selected individualities, select in array s the not identical same node point of corresponding encoding gene value in two individualities, the corresponding gene of all nodes of choosing in two individualities is exchanged, the new individuality generating is added into population;
(6.3.4) according to variation Probability p mall individualities in population are carried out to mutation operation:
Newly-generated individuality is added into population;
(6.3.5) deadline valuation functions F timwith reliability assessment function F relcalculate each individual g in population ifitness, by all individualities according to F timand F relthe descending arrangement of functional value obtains two sequence individual queues
F Tim ( g i ) = 1 - max 1 &le; k &le; M { FT ( v j , p k ) | &Sigma; &Sigma; q = 0 k - 1 | s q | < p < 1 + &Sigma; l = 0 k | s l | g i , p = 1 }
F Rel ( g i ) = P [ E v j ] = 1 - &Pi; 1 &le; k &le; M , &Sigma; &Sigma; q = 0 k - 1 | s q | < p < 1 + &Sigma; l = 0 k | s l | g i , p = 1 ( 1 - AP v j p k ) ;
(6.3.6) individuality based in two queues of RR mechanism selection is as the individuality in new population, until reach population scale requirement;
If (6.3.7) do not meet stop condition, repeating step (6.3.3)-(6.3.6), within the evolution number of times of regulation, reliability or Makespan do not improve, and stop solving.
Beneficial effect of the present invention is:
It is the maximal value in part predecessor task copy and its call duration time sum separately that the general scheduling mode of many copies of the heterogeneous distributed system of the present invention task fault-tolerant scheduling method allows the Starting Executing Time of current scheduling copy, intertask communication message does not need whole transmissions, as long as send relying on message between required part task copy, the method can further reduce the Starting Executing Time of current scheduling task copy, so the method can further reduce the scheduling Makespan of task.The method adopts the repeatedly crossover and mutation evolutional operation of genetic algorithm further to optimize the scheduling Makespan of task under the prerequisite of reliability requirement of guaranteeing to meet task-set, has considered node failure and link failure when Calculation of Reliability; And the method there will not be idle task copy pair in the evolutionary process of genetic algorithm.
Accompanying drawing explanation
Many copies of the heterogeneous distributed system of Fig. 1 task fault-tolerant scheduling method process flow diagram;
Fig. 2 scheduler task DAG structural drawing;
Node and link configuration parameter in Fig. 3 system;
Fig. 4 task v 3first individuality that during mapping, initialization of population generates;
Fig. 5 task v 3second individuality that during mapping, initialization of population generates;
Fig. 6 task v 3the 3rd individuality that during mapping, initialization of population generates;
Fig. 7 task v 3the 4th individuality that during mapping, initialization of population generates;
Fig. 8 task v 3individual g during mapping 3the 5th individuality generating after variation for the first time;
Fig. 9 task v 4first individuality that during mapping, initialization of population generates;
Figure 10 task v 4second individuality that during mapping, initialization of population generates;
Figure 11 task v 4the 3rd individuality that during mapping, initialization of population generates;
Figure 12 task v 4the 4th individuality that during mapping, initialization of population generates;
Figure 13 task v 4the 5th individuality generating after intersecting during mapping;
Figure 14 task v 4the 6th individuality generating after variation during mapping;
The final scheduling scheme generating of Figure 15.
Embodiment
Below in conjunction with accompanying drawing, the present invention is described in more detail:
For blindly copying the brought wasting of resources and other reliability dispatching methods, ignore and between scheduling Makespan, task, rely on and link failure probability and the longer defect of strict scheduling mode scheduling Makespan, the object of this invention is to provide a kind of heterogeneous distributed many copies of system task fault-tolerant scheduling method based on Active Replication mechanism.The method adopts many copies fault tolerant mechanism meeting under the prerequisite of mission reliability requirement based on general scheduling mode, further optimizes the scheduling Makespan of task-set by the crossover and mutation evolutional operation of genetic algorithm.
The concrete steps of many copies of the heterogeneous distributed system of the present invention task fault-tolerant scheduling method are as follows:
For the application program G=<V of Existence dependency constraint, E>, set of tasks V={v 1, v 2... v n, task quantity N=|V|, E is the set of the oriented communication weight limit between task in V; System model is non-directed graph G s=<P, L>, P={p 1, p 2..., p mm heterogeneous nodes set, and M=|P|, L is | L| communication link set; Task-set reliability requirement R:
1. first according to the execution speed of each node in the load of each task and system, each task v in computing application program jbe scheduled to each node p in system kexecution time ET (v j, p k).
2. calculate each task at average execution time of all processor nodes and every communication information the average communication time at all links.
3. according to formula (1), adopt bottom priority approach to come calculation task to concentrate any task v jbottom priority bl (v j).
Succ (v in formula j) be task v jdirect follow-up work set, for task v jthe average execution time of all nodes in node set P, for message e j,iaverage transmission time at all links of system link set L.
4. according to the priority of task, permission scheduler task is added into scheduling queue according to the nonincremental mode of its priority, allowing scheduler task is the task that predecessor task had been dispatched or do not existed to predecessor task.
5. from all permission scheduler tasks of scheduling queue, select priority super objective, according to formula (2) calculating priority level super objective v jreliability requirement r x(x is the position of this task in priority query).
r x = R / &Pi; i = 0 x - 1 r i &prime; n - x + 1 - - - ( 2 )
1≤x≤n in formula, and meet the prioritization of task; R is task-set reliability requirement; R ' ifor priority query's meta is set to the actual institute of the task achieved reliability of i, r ' 0=1.If this task is priority super objective (being entry task), so its reliability requirement
6. if reliability requirement is invalid, i.e. task v jreliability requirement r x>=1, refuse so scheduler task, and return.Otherwise the general dispatching method of the many copies of calling task calculates copy scheduling node and the Starting Executing Time of this task.
The general dispatching method implementation procedure of the many copies of task is:
For scheduler task v j, node set P, mission reliability require r in system x:
(1) the corresponding information of initialization first.By task v iit is 0 that copy amount is composed, and mapping node is composed as empty.Idle node set is composed as node set P.
(2) if task v jfor entry task, so first choose in idle node queue the deadline node the earliest copy (if the deadline of two nodes identical choose so at random arbitrary node) of executing the task.According to formula (3) calculation task v jreliability as long as all copy reliabilities of this task meet corresponding requirements, do not need to rely on message between consideration task.
P [ E v j ] = 1 - &Pi; p n &Element; proc ( v j ) ( 1 - exp { - &lambda; p n * w ( v j ) / w ( p n ) } ) - - - ( 3 )
In formula, proc (v j) be task v jmapping node set, λ p nfor processor node p npermanent fault probability, w (v j) expression task v jload, w (p n) expression node p nexecutable calculated amount in unit interval.
If can not meet mission reliability, continue so to choose in idle queues the deadline node the earliest copy of executing the task, then according to formula (3), calculate the reliability of this task, until meet this mission reliability requirement.If until idle node queue is empty, this mission reliability still can not meet the demands, and can make up this mission reliability by Calculation of Reliability formula and lose, to guarantee to meet set of tasks reliability requirement when follow-up work copy is dispatched.
(3) if task v jhave predecessor task, the general dispatching method of the many copies of task calling so based on genetic algorithm carries out copy scheduling.
The concrete steps of the general dispatching method of the many copies of task based on genetic algorithm are as follows:
1) initialization crossover probability p first c, variation Probability p m, population quantity GN, Evolution of Population number of times EN.
2) then generate initial population.
For dependence between assurance task, in coding, can only comprise effective Starting Executing Time position.Effectively start executing location and need to guarantee that task can receive the message of its predecessor task institute mapping node transmission.Any predecessor task v for current scheduling task ibe mapped in node p ktask copy message arrive node p ntime according to formula (4), calculate.
ave ( v i k , p n ) = max { FT ( v i , p k ) , rdy ( l k , n ) } + w ( e i , j ) / w ( l k , n ) - - - ( 4 )
FT (v in formula i, p k) be task v iat node p kcomplete the execution time. for link l k,nbe ready to last message communicating deadline that call duration time is link.W(e i,j) be task v iwith task v jbetween communication information e i,jsize.W(l k,n) be node p kwith node p nbetween link l k,nthe data volume that can transmit in unit interval.If mapping node is identical, i.e. p k=p n, time rdy (l so k,n) be 0, communication overhead is 0, now
Task encoding scheme need be gene in individuality in minimum effective Starting Executing Time position and maximum effectively all position encoded between Starting Executing Time position by each node.Task v jat processor p nthe effective Starting Executing Time of minimum position EST (v j, p n) according to formula (5), calculate.
EST ( v j , p n ) = max { max v i &Element; pred ( v j ) { min v i k &Element; rep ( v i ) { ave ( v i k , p n ) } } , rdy ( p n ) } - - - ( 5 )
Pred (v in formula i) be task v idirect predecessor task set; Rep (v i) be task v icopy set; Rdy (p n) be current scheduling situation lower node p nwhat finally shine upon task completes execution time PFT (p n), its computing method are as shown in formula (6).
PFT ( p k ) = max v i &Element; V , p k &Element; proc ( v i ) { FT ( v i , p k ) } - - - ( 6 )
Proc (v in formula i) task v ithe processor sets shining upon.
Task v jat processor p nthe effective Starting Executing Time of maximum position LST (v j, p n) according to formula (7), calculate.
LST ( v j , p n ) = max { max v i &Element; pred ( v j ) { min v i k &Element; rep ( v i ) { ave ( v i k , p n ) } } , rdy ( p n ) } - - - ( 7 )
From node idle queues, choose at random certain processor node, at this processor node, choose at random effective Starting Executing Time position, the copy of mapping current scheduling task.According to formula (8), calculate the reliability of this task copy.If the reliability of this task does not meet the demands, continue so to choose processor node and in the effective Starting Executing Time of this node selection task position from node idle queues, until the reliability of this task meets the demands, using the one by one body of this task copy mapping scheme in population.Repeat to generate individuality, until reach population scale.When if task copy amount is M, the reliability of task does not also reach reliability requirement, will this task copy mapping scheme as the body one by one in population, because can compensate in right amount the reliability loss of this task during follow-up work scheduling.
P [ E v j ] = 1 - &Pi; p n &Element; proc ( v j ) ( 1 - AP v j p n ) = 1 - &Pi; p n &Element; proc ( v j ) ( 1 - ( &Pi; v i n &Element; Prep n &cap; tv i n &le; tv j n ( e - &lambda;p n * w ( v i ) / w ( p n ) ) ) ) &times; &Pi; v l &Element; pred ( v j ) ( 1 - &Pi; p k &Element; proc ( v l ) , ave ( v l k , p n ) &le; ST ( v j , p n ) ( 1 - ( &Pi; et p , q &Element; ON ( l k , n ) &cap; et p , q &le; etl , j ( e - &lambda;l k , n * w ( e p , q ) / w ( l k , n ) ) ) ) ) ) - - - ( 8 )
In formula for task v jbe mapped in node p ncopy reliability, for node p nupper current scheduler task copy the task copy of before carrying out prep nfor node p nthe task copy set of carrying out; ST (v j, p n) be task v jat node p nstarting Executing Time; Et p,qfor task v pwith v qbetween the beginning call duration time of communication information; ON (l k,n) be at link l k,nthe all communication occurring; Et p,q≤ et l,j(v p, v q∈ V) be link l k,nupper communication information e p,qbeginning call duration time be less than or equal to message e l,jbeginning call duration time.λ l k,nfor node p kwith node p nbetween link l k,nfailure probability.If task copy with mapping node identical, its link communication time is 0 so, the reliability of this communication information is 1.
The encoding gene value corresponding to effective Starting Executing Time position of mapping task is 1, and the corresponding position of not shining upon task is 0 with it.When duty mapping, in order to prevent that task is repeatedly mapped to identical node, in gene corresponding to each node, can only have at most the value of a position is 1, and the value of other positions is 0, and the same task copy that maps to same node point can only have at most one.
Encoding scheme also will comprise the effective mapping position number of each node in individuality coding, and this position is represented by array s.If task v jdistribute to node p nin the effective Starting Executing Time position of k, so individual g jin l gene g j,l=1, | s i| be s in array s irepresentative node p ieffective mapping position number, | s 0|=0.The individual length of encoding is array element s iat individual g jcorresponding gene sets is
3) according to crossover probability p call individualities in population are carried out to interlace operation.If random number is less than crossover probability p c, for two selected individualities, select at random in array s certain or certain several same node point that in two individualities corresponding encoding gene value is not identical, the corresponding gene of all nodes of choosing in two individualities is exchanged.Finally the new individuality generating is added into population.
4) according to variation Probability p mall individualities in population are carried out to mutation operation.If random number is less than variation Probability p m, for random certain node location of selecting in certain selected individuality and array s, carry out mutation operation.Mutation operation comprises that change task copy is two kinds of the Starting Executing Time position of mapping node and task copy mapping nodes.
If this individual reliability is low compared with the reliability requirement of current scheduling task, and node corresponding to array s shone upon task copy, so the mapping task Starting Executing Time of the variation node of choosing in individuality postponed increasing mission reliability backward.
If this individual reliability is low compared with the reliability requirement of current scheduling task, and node corresponding to array s do not shine upon task copy, the genic value corresponding to Starting Executing Time position of the variation node that does not shine upon task copy of choosing in individuality is so set to 1, adds new mapping task copy and improves reliability.
If this individual reliability is high compared with the reliability requirement of current scheduling task, and the existence effective Starting Executing Time position more Zao than current scheduling task copy Starting Executing Time, so the mapping task Starting Executing Time of the variation node of choosing in individuality is passed forward, this can reduce mission reliability, as long as but meet corresponding reliability conditions.
If this individual reliability is high compared with the reliability requirement of current scheduling task, and the Starting Executing Time of current scheduling task copy be mapping node the earliest effective Starting Executing Time, genic value corresponding to node Starting Executing Time position is set to 0 so, the copy of the variation node that shines upon task copy of choosing in individuality is cancelled, can reduce reliability like this, as long as but guarantee to meet mission reliability requirement.
Finally newly-generated individuality is added into population.
5) according to formula (9) deadline valuation functions F timand formula (10) reliability assessment function F relcalculate each individual g in population ifitness.By all individualities according to F timand F relthe descending arrangement of functional value obtains two sequence individual queues.
F Tim ( g i ) = 1 - max 1 &le; k &le; M { FT ( v j , p k ) | &Sigma; &Sigma; q = 0 k - 1 | s q | < p < 1 + &Sigma; l = 0 k | s l | g i , p = 1 } - - - ( 9 )
F Rel ( g i ) = P [ E v j ] = 1 - &Pi; 1 &le; k &le; M , &Sigma; &Sigma; q = 0 k - 1 | s q | < p < 1 + &Sigma; l = 0 k | s l | g i , p = 1 ( 1 - AP v j p k ) --- ( 10 )
6) individuality based in two queues of RR mechanism selection is as the individuality in new population, until reach population scale requirement.
7) if do not meet stop condition, repeating step 3)-6).Finally, after certain Evolution of Population number of times or during algorithm convergence (within the evolution number of times of stipulating, reliability or Makespan are not significantly improved), stop solving.
7. scheduler task is deleted from scheduling queue, new permission scheduler task is added in scheduling queue according to its priority simultaneously.Continue next priority super objective in selection scheduling queue and dispatch, repetitive process 5-7 until all tasks all dispatched.
Fig. 1 has shown heterogeneous distributed many copies of system task fault-tolerant scheduling method process flow diagram, in conjunction with the implementation procedure of process flow diagram and example in detail the method.
Example is by task-set V={v in Fig. 2 1, v 2, v 3, v 4to be dispatched to configuration parameter be node set P={p in Fig. 3 1, p 2, p 3, p 4, p 5heterogeneous distributed system time scheduling situation, reliability requirement R is 0.999.
1. first according to the execution speed of each node in the load of each task and system, calculate the execution time that each task is scheduled to each node in system.As calculated: task v 1execution time at five nodes is respectively { 18,9,9,18,6}, task v 2execution time at five nodes is respectively { 20,10,10,20,6.7}, task v 3execution time at five nodes is respectively { 22,11,11,22,7.3}, task v 4execution time at five nodes is respectively { 24,12,12,24,8}.
2. calculate each task at the average execution time of all processor nodes: task v 1in the average execution time of five nodes, be 12, task v 2in the average execution time of five nodes, be 13.3, task v 3in the average execution time of five nodes, be 14.7, task v 4in the average execution time of five nodes, be 16.Calculate every communication information at the average communication time of all links: message e 1,2in the average communication time of all links, be 8.5, message e 1,3in the average communication time of all links, be 10.6, message e 2,4in the average communication time of all links, be 6.4, message e 3,4in the average communication time of all links, be 12.8.
3. according to formula (1), adopt bottom priority approach to calculate each task priority.Task v 1bl (v 1) be 54.1, task v 2bl (v 2) be 35.7, task v 3bl (v 3) be 43.5, task v 4bl (v 4) be 16.The prioritization of task is { v 1, v 3, v 2, v 4.
4. according to the priority of task, will allow scheduler task v 1be added into scheduling queue.
5. from all permission scheduler tasks of scheduling queue, select priority super objective v 1, according to formula (2) calculation task v 1reliability requirement r 1 = R 4 = 0.9997499 .
6. the general dispatching method of the many copies of calling task carrys out calculation task v 1copy scheduling node and Starting Executing Time.
(1) first by task v 1it is 0 that copy amount is composed, and mapping node is composed as empty.Idle node set is composed as node set P.
(2) choose deadline node p the earliest in idle node queue 1the copy of executing the task.According to formula (3) calculation task v 1reliability be 0.998202.Can not meet mission reliability requirement, continue so to choose deadline node p the earliest in idle queues 4the copy of executing the task, is then 0.999981 according to the reliability that formula (3) calculates this task, meets this mission reliability requirement.Be r ' 1be 0.999981.
7. by scheduler task v 1from scheduling queue, delete, simultaneously by new permission scheduler task v 2and v 3according to its priority, add in scheduling queue.Continue to select priority super objective v 3dispatch.
8. from all permission scheduler tasks of scheduling queue, select priority super objective v 3, according to formula (2) calculation task v 3reliability requirement r 2 = R / r 1 &prime; 3 = 0.9996729 .
9. the general dispatching method of the many copies of calling task carrys out calculation task v 3copy scheduling node and Starting Executing Time.
(1) first by task v 3it is 0 that copy amount is composed, and mapping node is composed as empty.Idle node set is composed as node set P.
(2) the general dispatching algorithm of the many copies of task of calling based on genetic algorithm is carried out copy scheduling, and concrete steps are as follows:
1) initialization crossover probability p first c=0.5, variation Probability p m=0.25, population quantity GN=4, Evolution of Population number of times EN=3.
2) then generate initial population.
First generate first individual g 1for by task v 3map to node p 1.EST (v now 3, p 1) be 18, LST (v 3, p 1) be 38.Copy starting Executing Time position be (for task v jpredecessor task copy task copy at node p ncorresponding Starting Executing Time position is designated as ), 18.Task v now 3reliability be 0.996008.Continue to choose node p 2mapping task v 3copy.EST (v now 3, p 2) be 23, LST (v 3, p 2) be 28.Copy starting Executing Time position be 23.Task v now 3reliability be 0.999967.Task v 3individual g 1encoding scheme as shown in Figure 4.
Then generate second individual g 2for by task v 3map to node p 1.EST (v now 3, p 1) be 18, LST (v 3, p 1) be 38.Copy starting Executing Time position be 18.Task v now 3reliability be 0.996008.Continue to choose node p 4mapping task v 3copy.EST (v now 3, p 4) be 18, LST (v 3, p 4) be 38.Copy starting Executing Time position be 18.Task v now 3reliability be 0.999905.Task v 3individual g 2encoding scheme as shown in Figure 5.
Then generate the 3rd individual g 3for by task v 3map to node p 2.EST (v now 3, p 2) be 23, LST (v 3, p 2) be 28.Copy starting Executing Time position be 23.Task v now 3reliability be 0.991635.Continue to choose node p 3mapping task v 3copy.EST (v now 3, p 3) be 23, LST (v 3, p 3) be 28.Copy starting Executing Time position be 23.Task v now 3reliability be 0.999946.Task v 3individual g 3encoding scheme as shown in Figure 6.
Finally generate the 4th individual g 4for by task v 3map to node p 2.EST (v now 3, p 2) be 23, LST (v 3, p 2) be 28.Copy starting Executing Time position be 23.Task v now 3reliability be 0.991635.Continue to choose node p 4mapping task v 3copy.EST (v now 3, p 4) be 18, LST (v 3, p 4) be 38.Copy starting Executing Time position be 18.Task v now 3reliability be 0.999802.Task v 3individual g 4encoding scheme as shown in Figure 7.
3) according to crossover probability p call individualities in=0.5 pair of population carry out interlace operation for the first time.Suppose that now the random number of each interlace operation is less than crossover probability p c, therefore do not carry out interlace operation.
4) according to variation Probability p mall individualities in=0.25 pair of population carry out mutation operation.Suppose at the 3rd individual g 3during variation, random number is greater than variation Probability p m, the 3rd individuality made a variation.Choose at random the 3rd individual nodes p 2mapping position make a variation.This individual reliability is compared with task v 3reliability requirement high, and task copy starting Executing Time be mapping node the earliest effective Starting Executing Time, therefore by node p 2middle Starting Executing Time position corresponding genic value is set to 0, by the node p that makes a variation in individuality 2copy cancel, generate individual g 5.Task v 3after variation, generate individual g 5encoding scheme as shown in Figure 8.Newly-generated individuality is added into population.
5) according to formula (9) deadline valuation functions F timand formula (10) reliability assessment function F relcalculate each individual fitness in population.According to F timand F relthe descending arrangement of functional value to obtain two individual queues that sort.Deadline assessment queue: g 3, g 5, g 1, g 2, g 4.Reliability assessment queue: g 1, g 3, g 2, g 4, g 5.
6) individuality based in two queues of RR mechanism selection is as the individuality in new population, until reach population scale requirement.The individuality of choosing in queue is: g 3, g 1, g 5, g 2.Evolved for the first time.
7) according to said process 3)-6) carry out remaining evolutionary process, when the mutation operation of evolving for the second time, suppose individual g 5meet variation condition, variation mode is at node p 5interpolation task v 3mapping copy finally, after three evolution, generate final task v 3copy mapping scheme: with task copy now starting Executing Time be 23, task copy starting Executing Time be 21.3, task v 3the execution time that completes be 34, reliability is 0.999970.Be r ' 2be 0.999970.
10. by scheduler task v 3from scheduling queue, delete, continue next priority super objective v in selection scheduling queue 2dispatch.According to formula (2) calculation task v 2reliability requirement
The general dispatching method of the many copies of 11. calling task carrys out calculation task v 2copy scheduling node and Starting Executing Time.Calculation procedure and task v 3calculation procedure similar, at this, just do not repeat.Finally, after three evolution, generate final task copy mapping scheme: with task copy now starting Executing Time be 18, task copy starting Executing Time be 22, task v 2the execution time that completes be 38, reliability is 0.999973.Be r ' 3be 0.999973.
12. by scheduler task v 2from scheduling queue, delete, simultaneously by new permission scheduler task v 4add in scheduling queue.Continue to select priority super objective v 4dispatch.
13. select priority super objective v from all permission scheduler tasks of scheduling queue 4, according to formula (2) calculation task v 4reliability requirement r 4.r 4=R/(r′ 1*r′ 2*r′ 3)=0.99907593。
The general dispatching method of the many copies of 14. calling task carrys out calculation task v 4copy scheduling node and Starting Executing Time.
(1) first by task v 4it is 0 that copy amount is composed, and mapping node is composed as empty.Idle node set is composed as node set P.
(2) the general dispatching algorithm of the many copies of task of calling based on genetic algorithm is carried out copy scheduling, and concrete steps are as follows:
1) initialization crossover probability p first c=0.5, variation Probability p m=0.25, population quantity GN=4, Evolution of Population number of times EN=3.
2) then generate initial population.
First generate first individual g 1for by task v 4map to node p 1.EST (v now 4, p 1) be 40, LST (v 4, p 1) be 40.6.Copy starting Executing Time position be 40.Task v now 4reliability be 0.992627.Continue to choose node p 2mapping task v 4copy.EST (v now 4, p 2) be 38, LST (v 4, p 2) be 52.6.Copy starting Executing Time position be 38.Task v now 4reliability be 0.999927.Task v 4individual g 1encoding scheme as shown in Figure 9.
Then generate second individual g 2for by task v 4map to node p 1.EST (v now 4, p 1) be 40, LST (v 4, p 1) be 40.6.Copy starting Executing Time position be 40.Task v now 4reliability be 0.992627.Continue to choose node p 3mapping task v 4copy.EST (v now 4, p 3) be 34, LST (v 4, p 3) be 52.6.Copy starting Executing Time position be 34.Task v now 4reliability be 0.999911.Task v 4individual g 2encoding scheme as shown in figure 10.
Then generate the 3rd individual g 3for by task v 4map to node p 2.EST (v now 4, p 2) be 38, LST (v 4, p 2) be 52.6.Copy starting Executing Time position be 38.Task v now 4reliability be 0.990050.Continue to choose node p 3mapping task v 4copy.EST (v now 4, p 3) be 34, LST (v 4, p 3) be 52.6.Copy starting Executing Time position be 41.Task v now 4reliability be 0.999886.Task v 4individual g 3encoding scheme as shown in figure 11.
Finally generate the 4th individual g 4for by task v 4map to node p 3.EST (v now 4, p 3) be 34, LST (v 4, p 3) be 52.6.Copy starting Executing Time position be 34.Task v now 4reliability be 0.987973.Continue to choose node p 4mapping task v 3copy.EST (v now 4, p 4) be 35, LST (v 4, p 4) be 50.Copy starting Executing Time position be 35.Task v now 4reliability be 0.999603.Task v 4individual g 4encoding scheme as shown in figure 12.
3) according to crossover probability p call individualities in=0.5 pair of population carry out interlace operation for the first time.Suppose now only at individual g 1with individual g 4random number during interlace operation is greater than crossover probability p c, therefore carry out interlace operation.Two positions choosing at random array s during interlace operation comprise array element s 2and s 3between mapping position.Therefore by node p in two individualities 2and p 3corresponding encoding gene exchanges, and produces new individual g 5.Individual g 5for by task v 4map to node p 2and p 4.EST (v now 4, p 2) be 38, LST (v 4, p 2) be 52.6.Copy starting Executing Time position be 38.EST (v now 4, p 4) be 35, LST (v 4, p 4) be 50.Copy starting Executing Time position be 35.Task v now 4reliability be 0.999671.By newly-generated individual g 5be added into population.Task v 4the individual g that interlace operation generates 5encoding scheme as shown in figure 13.
4) according to variation Probability p mall individualities in=0.25 pair of population carry out mutation operation.Suppose at the 3rd individual g 3during variation, random number is greater than variation Probability p m, the 3rd individuality made a variation.Choose at random the 3rd individual nodes p 3mapping position make a variation.This individual reliability is compared with task v 4reliability requirement high, and task copy starting Executing Time before still there is effective Starting Executing Time, therefore by node p 3starting Executing Time position migrates to position copy starting Executing Time position be 34.Generate individual g 6, its reliability is 0.999880, by newly-generated individual g 6be added into population.Task v 4the individual g that mutation operation generates 6encoding scheme as shown in figure 14.
5) according to formula (9) deadline valuation functions F timand formula (10) reliability assessment function F relcalculate each individual fitness in population.According to F timand F relthe descending arrangement of functional value to obtain two individual queues that sort.Deadline assessment queue: g 6, g 3, g 4, g 5, g 1, g 2.Reliability assessment queue: g 1, g 2, g 3, g 6, g 5, g 4.
6) individuality based in two queues of RR mechanism selection is as the individuality in new population, until reach population scale requirement.The individuality of choosing in queue is: g 6, g 1, g 3, g 2.Evolved for the first time.
7) according to said process 3)-6) carry out remaining evolutionary process.Finally, after three evolution, generate final task v 4copy mapping scheme: with task copy now starting Executing Time be 38, task copy starting Executing Time be 34, task v 4the execution time that completes be 50, reliability is 0.999880.
The 15. final scheduling schemes that generate as shown in figure 15.Now the scheduling Makespan of task-set is 50, and reliability is 0.99980401.

Claims (3)

1. heterogeneous distributed many copies of system task fault-tolerant scheduling method, is characterized in that:
(1) according to the execution speed of each node in the load of each task and system, each task v in computing application program jbe scheduled to each node p in system kexecution time ET (v j, p k); For the application program G=<V of Existence dependency constraint, E>, set V={v 1, v 2... v n, task quantity N=|V|, E is the set of the oriented communication weight limit between task in V; System model is non-directed graph G s=<P, L>, P={p 1, p 2..., p mm heterogeneous nodes set, and M=|P|, L is | L| communication link set; Task-set reliability requirement R;
(2) calculate each task at average execution time of all processor nodes and every communication information the average communication time at all links;
(3) adopt bottom priority approach to come calculation task to concentrate any task v jbottom priority bl (v j):
Succ (v in formula j) be task v jdirect follow-up work set, for task v jthe average execution time of all nodes in node set P, for message e j,iaverage transmission time at all links of system link set L;
(4) according to the priority of task, permission scheduler task is added into scheduling queue according to the nonincremental mode of its priority;
(5) from all permission scheduler tasks of scheduling queue, select priority super objective, calculating priority level super objective v jreliability requirement r x, x is the position of this task in priority query:
r x = R / &Pi; i = 0 x - 1 r i &prime; n - x + 1
1≤x≤n in formula, and meet the prioritization of task; R is task-set reliability requirement; R ' ipriority query's meta is set to the actual institute of the task achieved reliability of i, r ' 0=1; If this task is priority super objective is entry task, reliability requirement
(6) if reliability requirement is invalid, i.e. task v jreliability requirement r x>=1, refuse so scheduler task, and return; Otherwise the general dispatching method of the many copies of calling task calculates copy scheduling node and the Starting Executing Time of this task;
(7) scheduler task is deleted from scheduling queue, new permission scheduler task is added in scheduling queue according to priority simultaneously; Continue next priority super objective in selection scheduling queue and dispatch, repeating step (5)-(7) until all tasks all dispatched.
2. a kind of heterogeneous distributed many copies of system task fault-tolerant scheduling method according to claim 1, is characterized in that, the general dispatching method of the many copies of described task is:
(6.1) the corresponding information of initialization: by task v iit is 0 that copy amount is composed, and mapping node is composed as empty, and idle node set is composed as node set P;
(6.2) if task v jfor entry task, choose in idle node queue the deadline node the earliest copy of executing the task, calculation task v jreliability
P [ E v j ] = 1 - &Pi; p n &Element; proc ( v j ) ( 1 - exp { - &lambda; p n * w ( v j ) / w ( p n ) } )
Proc (v j) be task v jmapping node set, λ p nfor processor node p npermanent fault probability, w (v j) expression task v jload, w (p n) expression node p nexecutable calculated amount in unit interval; If can not meet mission reliability, continue so to choose in idle queues the deadline node the earliest copy of executing the task, the reliability of calculation task then, until meet mission reliability requirement; If until idle node queue is empty, mission reliability still can not meet the demands, and makes up mission reliability loss when follow-up work copy is dispatched by Calculation of Reliability formula;
(6.3) if task v jhave predecessor task, the general dispatching method of the many copies of task calling based on genetic algorithm carries out copy scheduling.
3. a kind of heterogeneous distributed many copies of system task fault-tolerant scheduling method according to claim 1, is characterized in that, the described general dispatching method of the many copies of the task based on genetic algorithm is:
(6.3.1) initialization crossover probability p c, variation Probability p m, population quantity GN, Evolution of Population number of times EN;
(6.3.2) generate initial population:
Calculate the predecessor task v of current scheduling task ibe mapped in node p ktask copy message arrive node p ntime ave ( v i k , p n ) :
ave ( v i k , p n ) = max { FT ( v i , p k ) , rdy ( l k , n ) } + w ( e i , j ) / w ( l k , n )
FT (v in formula i, p k) be task v iat node p kcomplete the execution time, rdy (l k,n) be link l k,nbe ready to last message communicating deadline that call duration time is link, w (e i,j) be task v iwith task v jbetween communication information e i,jsize, w (l k,n) be node p kwith node p nbetween link l k,nthe data volume that can transmit in unit interval, if mapping node is identical, i.e. p k=p n, time rdy (l so k,n) be 0, communication overhead is 0,
Task encoding scheme need be gene in individuality in minimum effective Starting Executing Time position and maximum effectively all position encoded between Starting Executing Time position by each node, task v jat processor p nposition EST (v of effective execution time of minimum j, p n) calculate;
EST ( v j , p n ) = max { max v i &Element; pred ( v j ) { min v i k &Element; rep ( v i ) { ave ( v i k , p n ) } } , rdy ( p n ) }
Pred (v in formula i) be task v idirect predecessor task set; Rep (v i) be task v icopy set; Rdy (p n) be current scheduling situation lower node p nwhat finally shine upon task completes execution time PFT (p n)
PFT ( p k ) = max v i &Element; V , p k &Element; proc ( v i ) { FT ( v i , p k ) }
Proc (v in formula i) task v ithe processor sets shining upon;
Task v jat processor p nthe effective Starting Executing Time of maximum position LST (v j, p n)
LST ( v j , p n ) = max { max v i &Element; pred ( v j ) { min v i k &Element; rep ( v i ) { ave ( v i k , p n ) } } , rdy ( p n ) }
From node idle queues, choose processor node, at processor node, choose effective Starting Executing Time position, the copy of mapping current scheduling task, the reliability of calculation task copy, if the reliability of this task does not meet the demands, continuation is chosen processor node and in the effective Starting Executing Time of node selection task position from node idle queues, until the reliability of task meets the demands, body one by one using task copy mapping scheme in population, repeat to generate individual, until reach population scale, when if task copy amount is M, the reliability of task does not also reach reliability requirement, will this task copy mapping scheme as the body one by one in population, because can compensate in right amount the reliability loss of this task during follow-up work scheduling,
P [ E v j ] = 1 - &Pi; p n &Element; proc ( v j ) ( 1 - AP v j p n ) = 1 - &Pi; p n &Element; proc ( v j ) ( 1 - ( &Pi; v i n &Element; Prep n &cap; tv i n &le; tv j n ( e - &lambda;p n * w ( v i ) / w ( p n ) ) ) ) &times; &Pi; v l &Element; pred ( v j ) ( 1 - &Pi; p k &Element; proc ( v l ) , ave ( v l k , p n ) &le; ST ( v j , p n ) ( 1 - ( &Pi; et p , q &Element; ON ( l k , n ) &cap; et p , q &le; etl , j ( e - &lambda;l k , n * w ( e p , q ) / w ( l k , n ) ) ) ) ) )
In formula for task v jbe mapped in node p ncopy reliability, for node p nupper current scheduler task copy the task copy of before carrying out prep nfor node p nthe task copy set of carrying out; ST (v j, p n) be task v jat node p nstarting Executing Time; Et p,qfor task v pwith v qbetween the beginning call duration time of communication information; ON (l k,n) be at link l k,nthe all communication occurring; Et p,q≤ et l,j(v p, v q∈ V) be link l k,nupper communication information e p,qbeginning call duration time be less than or equal to message e l,jbeginning call duration time; λ l k,nfor node p kwith node p nbetween link l k,nfailure probability; If task copy with mapping node identical, its link communication time is 0 so, the reliability of this communication information is 1;
The encoding gene value corresponding to effective Starting Executing Time position of mapping task is 1, and the corresponding position of not shining upon task is 0, and when duty mapping, in gene corresponding to each node, having at most the value of a position is 1, and the value of other positions is 0;
Coding also comprises the effective mapping position number of each node in individuality coding, and this position is represented by array s, if task v jdistribute to node p nin the effective Starting Executing Time position of k, so individual g jin l gene g j,l=1, | s i| be s in array s irepresentative node p ieffective mapping position number, | s 0|=0, the individual length of encoding is array element s iat individual g jcorresponding gene sets is
(6.3.3) according to crossover probability p call individualities in population are carried out to interlace operation:
If random number is less than crossover probability p c, for two selected individualities, select in array s the not identical same node point of corresponding encoding gene value in two individualities, the corresponding gene of all nodes of choosing in two individualities is exchanged, the new individuality generating is added into population;
(6.3.4) according to variation Probability p mall individualities in population are carried out to mutation operation:
Newly-generated individuality is added into population;
(6.3.5) deadline valuation functions F timwith reliability assessment function F relcalculate each individual g in population ifitness, by all individualities according to F timand F relthe descending arrangement of functional value obtains two sequence individual queues
F Tim ( g i ) = 1 - max 1 &le; k &le; M { FT ( v j , p k ) | &Sigma; &Sigma; q = 0 k - 1 | s q | < p < 1 + &Sigma; l = 0 k | s l | g i , p = 1 }
F Rel ( g i ) = P [ E v j ] = 1 - &Pi; 1 &le; k &le; M , &Sigma; &Sigma; q = 0 k - 1 | s q | < p < 1 + &Sigma; l = 0 k | s l | g i , p = 1 ( 1 - AP v j p k ) ;
(6.3.6) individuality based in two queues of RR mechanism selection is as the individuality in new population, until reach population scale requirement;
If (6.3.7) do not meet stop condition, repeating step (6.3.3)-(6.3.6), within the evolution number of times of regulation, reliability or Makespan do not improve, and stop solving.
CN201410216137.0A 2014-05-21 2014-05-21 Multi-copy task fault tolerance scheduling method of heterogeneous distributed system Expired - Fee Related CN103970613B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410216137.0A CN103970613B (en) 2014-05-21 2014-05-21 Multi-copy task fault tolerance scheduling method of heterogeneous distributed system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410216137.0A CN103970613B (en) 2014-05-21 2014-05-21 Multi-copy task fault tolerance scheduling method of heterogeneous distributed system

Publications (2)

Publication Number Publication Date
CN103970613A true CN103970613A (en) 2014-08-06
CN103970613B CN103970613B (en) 2017-05-24

Family

ID=51240145

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410216137.0A Expired - Fee Related CN103970613B (en) 2014-05-21 2014-05-21 Multi-copy task fault tolerance scheduling method of heterogeneous distributed system

Country Status (1)

Country Link
CN (1) CN103970613B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108108233A (en) * 2017-11-29 2018-06-01 上海交通大学 The cluster job scheduling method and system that the more copies of task perform
CN108628708A (en) * 2017-03-20 2018-10-09 中兴通讯股份有限公司 Cloud computing fault-tolerance approach and device
CN109254841A (en) * 2018-09-30 2019-01-22 湘潭大学 A kind of two-objective optimization method for scheduling task for distributed system
CN109976890A (en) * 2019-03-28 2019-07-05 东南大学 A kind of conversion method minimizing the privately owned cloud computing resources energy consumption of isomery
CN111090783A (en) * 2019-12-18 2020-05-01 北京百度网讯科技有限公司 Recommendation method, device and system, graph-embedded wandering method and electronic equipment

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102799474A (en) * 2012-06-21 2012-11-28 浙江工商大学 Cloud resource fault-tolerant scheduling method based on reliability drive

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102799474A (en) * 2012-06-21 2012-11-28 浙江工商大学 Cloud resource fault-tolerant scheduling method based on reliability drive

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
LAIPING ZHAO ET AL.: "A Resource Minimizing Scheduling Algorithm with Ensuring the Deadline and Reliability in Heterogeneous Systems", 《2011 IEEE INTERNATIONAL CONFERENCE ON ADVANCED INFORMATION NETWORKING AND APPLICATION》 *
SUNGHO CHIN ET AL.: "Genetic Algorithm based Scheduling Method for Efficiency and Reliability in Mobile Grid", 《PROCEEDINGS OF THE 4TH INTERNATIONAL CONFERENCE ON UBIQUITOUS INFORMATION TECHNOLOGIES & APPLICATIONS, 2009》 *
WANG X ET AL.: "Optimizing Makespan and Reliability for Workflow Applications with Reputation and Look-ahead Genetic Algorithm", 《FUTURE GENERATION COMPUTER SYSTEMS》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108628708A (en) * 2017-03-20 2018-10-09 中兴通讯股份有限公司 Cloud computing fault-tolerance approach and device
CN108108233A (en) * 2017-11-29 2018-06-01 上海交通大学 The cluster job scheduling method and system that the more copies of task perform
CN108108233B (en) * 2017-11-29 2021-10-01 上海交通大学 Cluster job scheduling method and system for task multi-copy execution
CN109254841A (en) * 2018-09-30 2019-01-22 湘潭大学 A kind of two-objective optimization method for scheduling task for distributed system
CN109254841B (en) * 2018-09-30 2021-11-26 湘潭大学 Dual-objective optimization task scheduling method for distributed system
CN109976890A (en) * 2019-03-28 2019-07-05 东南大学 A kind of conversion method minimizing the privately owned cloud computing resources energy consumption of isomery
CN109976890B (en) * 2019-03-28 2023-05-30 东南大学 Variable frequency method for minimizing heterogeneous private cloud computing resource energy consumption
CN111090783A (en) * 2019-12-18 2020-05-01 北京百度网讯科技有限公司 Recommendation method, device and system, graph-embedded wandering method and electronic equipment
CN111090783B (en) * 2019-12-18 2023-10-03 北京百度网讯科技有限公司 Recommendation method, device and system, graph embedded wandering method and electronic equipment

Also Published As

Publication number Publication date
CN103970613B (en) 2017-05-24

Similar Documents

Publication Publication Date Title
CN103970613A (en) Multi-copy task fault tolerance scheduling method of heterogeneous distributed system
US9852230B2 (en) Asynchronous message passing for large graph clustering
CN103810061B (en) A kind of High Availabitity cloud storage method
CN105719221B (en) Path collaborative planning method and device for multiple tasks
CN111295643A (en) Managing computing clusters using durability level indicators
CN103701900A (en) Data distribution method on basis of heterogeneous cluster
CN101179553A (en) Efficient order-preserving delivery method and device for concurrent messages
CN105808346B (en) A kind of method for scheduling task and device
Celaya et al. A fair decentralized scheduler for bag-of-tasks applications on desktop grids
CN107203421A (en) A kind of adaptive work in combination stream method in cloud computing environment
Li et al. Task scheduling algorithm for heterogeneous real-time systems based on deadline constraints
CN102571913B (en) Network-transmission-overhead-based data migration method
Sheeba et al. An efficient fault tolerance scheme based enhanced firefly optimization for virtual machine placement in cloud computing
CN109951551A (en) A kind of container mirror image management system and method
CN102799474A (en) Cloud resource fault-tolerant scheduling method based on reliability drive
Zhou et al. Learning to optimize dag scheduling in heterogeneous environment
CN108304264A (en) A kind of correcting and eleting codes archiving method based on SPARK streaming computings
Bogatyrev et al. Multipath transmission of heterogeneous traffic in acceptable delays with packet replication and destruction of expired replicas in the nodes that make up the path
CN112114951A (en) Bottom-up distributed scheduling system and method
Ding et al. A task scheduling algorithm for heterogeneous systems using aco
CN114760202A (en) Reliable construction and deployment method of service function chain in network slice scene
Benoit et al. Optimizing the latency of streaming applications under throughput and reliability constraints
CN112883526A (en) Workload distribution method under task delay and reliability constraints
CN112698944A (en) Distributed cloud computing system and method based on human brain simulation
Semmoud et al. A survey of load balancing in distributed systems

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20170524