CN103197982B - A kind of task local optimum check point interval searching method - Google Patents
A kind of task local optimum check point interval searching method Download PDFInfo
- Publication number
- CN103197982B CN103197982B CN201310104518.5A CN201310104518A CN103197982B CN 103197982 B CN103197982 B CN 103197982B CN 201310104518 A CN201310104518 A CN 201310104518A CN 103197982 B CN103197982 B CN 103197982B
- Authority
- CN
- China
- Prior art keywords
- task
- fault
- local optimum
- worst
- expense
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Abstract
The present invention relates to computer system tasks transient fault fault-tolerant technique field.Be specifically related to the fault-tolerant local optimum check point interval searching method of a kind of task transient fault.The present invention includes: the initial value establishing the worst-case response time of task; By task worst-case response time divided by the value that system failure origination interval gained business rounds; Obtain local optimum checkpoint quantity; The worst-case response time of acquisition task; Judge
value; Determine fault frequency, local optimum checkpoint quantity.The present invention when minimizing checkpointing and the fault-tolerant expense of rollback recovery, for task provides transient fault fault-tolerant ability.The method realizes simple, can determine the local optimum check point interval of its correspondence for the different configuration tasks under different system more complex environment.
Description
Technical field
The present invention relates to computer system tasks transient fault fault-tolerant technique field.Be specifically related to the fault-tolerant local optimum check point interval searching method of a kind of task transient fault.
Background technology
Along with the development of IT application process, computing machine has become the indispensable equipment of current scientific research, commercialization, military combat, and the lifting of its performance becomes the target that people constantly pursue.In performance boost process, integrated circuit fabrication process development, the reduction of transistor size and operating voltage in circuit, integrated level improves further makes chip more be subject to transient fault impact; The quickening of device speed causes power consumption also more and more higher, and device lifetime and reliability are affected, thus influential system reliability.Rugged environment also makes computing machine face the impact of severeer transient fault, and the high energy particle irradiation in external environment condition and the electronic noise such as voltage disturbance, electromagnetic interference (EMI) may cause transistor PN junction moment discharge and recharge, thus change its internal logic states.Although transient fault generally can not damage physical circuit, also can not exist lastingly, the normal operation of possibility influential system, can cause system crash time serious.Transient fault affect computing machine performance and wherein task reliable, effectively and correctly perform, therefore the Reliability Assurance of transient fault fault-tolerant technique to system is most important.
Fault-tolerant technique is on the basis of redundant resource (hardware redundancy, time redundancy, information redundancy, software redundancy), is formed by architecture reasonable in design and algorithm.Affect that instantaneity, randomness are strong, same fault repeats the feature that probability of happening is minimum, fault can not exist lastingly because transient fault has, after re-writing new data, namely phenomenon of the failure disappears, and the transient fault fault-tolerant technique based on software simulating can realize error-detecting and restore funcitons by repeating program.Time redundancy fault-tolerant way based on checkpointing and rollback recovery technology has and realizes simple, can realize the fault-tolerant of transient fault by rollback recovery.Checkpointing and rollback recovery technology can when transient fault repeatedly occur, by task recovery to past a certain correct status, task is made to continue to perform from this state, counting loss is reduced to the checkpointing moment and to fault, the calculating done of moment occurs, what avoid task re-executes caused execution time waste.
Checkpointing and rollback recovery technology bring certain overhead to tasks carrying, checkpointing expense when this mainly comprises non-fault and the rollback recovery expense after fault; Fault detect also brings overhead to task.The recovery of checkpointing, fault and detection realize fault-tolerant required process, and different for the expense of different this threes of task, even if when same task is run under different time and loading condition, the expense of three is also different.Therefore, when realizing fault-tolerant, minimizing overhead that three causes becomes the focus that researcher pays close attention to.Execution time and the checkpoint interval of task are closely related, and local optimum check point interval while providing fault-tolerant ability for task, can minimize the outer executive overhead of the fault-tolerant Quota brought.The expense of checkpoint rollback recovery is determined by checkpoint interval, and therefore checkpoint interval is most important for the fault-tolerant scheduling performance of real-time task.Under different fault origination intervals, the number of stoppages that task occurs is not identical, and local optimum check point interval is also different, and therefore its worst-case response time is also different.
Calendar year 2001, PunnekkatS, BurnsA and DavisR delivered article " Analysisofcheckpointingforreal-timesystems " on periodical " Real-TimeSystems ", proposed the optimum checkpoint interval computing formula in single failure situation in this paper.But in the practical implementation of task, it is very likely that repeatedly fault occurs, therefore this checkpoint interval be not suitable for the actual implementation status of task.PaulPop in 2009, ViacheslavIzosimov, PetruEles and ZeboPeng has delivered article " DesignOptimizationofTime-andCost-ConstrainedFault-Tolera ntEmbeddedSystemsWithCheckpointingandReplication " on periodical " IEEETransonVeryLargeScaleIntergrationSystems ", offer the challenge in this paper and the task fault-tolerance execution model under k failure condition is occurring, establish task worst-case response time computing formula, and the optimum checkpoint interval computing formula of task of deriving under k failure condition.But the value of its fault frequency k is the definite value of a supposition, this is also not suitable for the actual implementation status of task because when actual motion task τ
ithe number of times k broken down in implementation is by the worst-case response time R of task
i(n
i, k) and the fault origination interval T of system
edetermine.At T
ewhen being less than task worst-case response time, may there is repeatedly fault in task.Its fault frequency of task of different configuration is not identical, even if identical task its fault origination interval under different system environment is also different, causes its fault frequency to be also different.Therefore design the present invention, the method can according to the concrete configuration of task (the worst execution time C of task
i, checkpointing expense O
i, rollback recovery expense μ
i, fault detect expense α
i) and configuration (the fault origination interval T of task run environment
e), determine task local optimum check point interval.
Summary of the invention
A kind of local optimum check point interval searching method provided for fault-tolerant computer system task adapting to more complex environment is provided.
The present invention includes following steps:
(1) by the worst for task execution time C
ibe set to the worst-case response time R of task
i(n
i, initial value k);
(2) by task worst-case response time R
i(n
i, k) divided by system failure origination interval T
ethe value that gained business rounds, namely
be set to the currency of fault frequency k;
(3) according to the worst execution time C of task
i, fault detect expense α
i, checkpointing expense O
iand try to achieve the currency of fault frequency k, obtain local optimum checkpoint quantity n
i:
If
if
(4) according to the worst execution time C of task
i, fault detect expense α
i, checkpointing expense O
i, rollback recovery expense μ
iand the currency of fault frequency k and local optimum checkpoint quantity n
ivalue, obtain the worst-case response time R of task
i(n
i, k),
(5) if
value be greater than the currency of k, add 1 at the currency of fault frequency k, perform step (3); Otherwise perform step (6);
(6) current k value is fault frequency, current n
ibe local optimum checkpoint quantity.
Beneficial effect of the present invention is:
The present invention can according to the local optimum check point interval of the fault origination interval determination task of the worst execution time of different task, checkpointing expense, fault detect expense, rollback recovery expense and system, when minimizing checkpointing and the fault-tolerant expense of rollback recovery, for task provides transient fault fault-tolerant ability.The method realizes simple, and can determine the local optimum check point interval of its correspondence for the different configuration tasks under different system more complex environment.
Accompanying drawing explanation
Fig. 1 local optimum check point interval search routine figure.
Embodiment
The method, according to the configuration attribute of task, by the local optimum check point interval of alternative manner search mission, realizes while providing transient fault fault-tolerant ability for task, minimizes the extra execution time of task.The present invention, while providing fault-tolerant ability for task, minimizes the expense caused by rollback recovery after by checkpointing during non-fault, fault detect and fault.The method can determine the fault frequency of task reality according to the response time of task and fault origination interval, and then determines the local optimum checkpoint quantity of task.
Below in conjunction with attached Example, the present invention is described in more detail:
Fig. 1 shows task local optimum check point interval searching method process flow diagram, in conjunction with the implementation procedure of process flow diagram and example in detail the method.
Example 1:
Task τ
iconfiguration information be: the worst execution time C
ifor 40ms, checkpointing expense O
ifor 2ms, fault detect expense α
ifor 3ms, rollback recovery expense μ
ifor 1ms, system failure origination interval T
efor 80ms.Task local optimum check point interval search detailed step is as follows:
(1) by C
ir is assigned to as initial value
i(n
i, k), i.e. R
i(n
i, k)=40;
(2) according to R
i(n
i, k)=40 and T
e=80, ask fault frequency k currency, namely
(3) according to C
i=40, O
i=2, α
i=3 and k=1, try to achieve local optimum checkpoint quantity n
i=3;
(4) according to C
i=40, O
i=2, α
i=3, μ
i=1 and k=1 and n
i=3, try to achieve the worst-case response time of task
(5)
equal k;
(6) fault frequency k=1 is therefore tried to achieve, local optimum checkpoint quantity n
i=3.
Example 2:
Task τ
iconfiguration information be: the worst execution time C
ifor 50ms, checkpointing expense O
ifor 2ms, fault detect expense α
ifor 3ms, rollback recovery time overhead μ
ifor 2ms, system failure origination interval T
efor 30ms.Task local optimum check point interval search detailed step is as follows:
(1) by C
ir is assigned to as initial value
i(n
i, k), i.e. R
i(n
i, k)=50;
(2) according to R
i(n
i, k)=50 and T
e=30, ask fault frequency k currency, namely
(3) according to C
i=50, O
i=2, α
i=3 and k=2, try to achieve local optimum checkpoint quantity n
i=4;
(4) according to C
i=50, O
i=2, α
i=3, μ
i=2 and k=2 and n
i=4, try to achieve the worst-case response time of task
(5)
be greater than k, then k=2+1=3;
(6) according to C
i=50, O
i=2, α
i=3 and k=3, try to achieve local optimum checkpoint quantity n
i=5;
(7) according to C
i=50, O
i=2, α
i=3, μ
i=2 and k=3 and n
i=5, try to achieve the worst-case response time of task
(8)
be greater than k, then k=3+1=4;
(9) according to C
i=50, O
i=2, α
i=3 and k=4, try to achieve local optimum checkpoint quantity n
i=6;
(10) according to C
i=50, O
i=2, α
i=3, μ
i=2 and k=4 and n
i=6, try to achieve the worst-case response time of task
(11)
be greater than k, then k=4+1=5;
(12) according to C
i=50, O
i=2, α
i=3 and k=5, try to achieve local optimum checkpoint quantity n
i=7;
(13) according to C
i=50, O
i=2, α
i=3, μ
i=2 and k=5 and n
i=7, try to achieve the worst-case response time of task
(14)
equal k;
(15) fault frequency k=5 is tried to achieve, local optimum checkpoint quantity n
i=7.
Claims (1)
1. a task local optimum check point interval searching method, is characterized in that, comprises the steps:
(1) by the worst for task execution time C
ibe set to the worst-case response time R of task
i(n
i, initial value k);
(2) by task worst-case response time R
i(n
i, k) divided by system failure origination interval T
ethe value that gained business rounds, namely
be set to the currency of fault frequency k;
(3) according to the worst execution time C of task
i, fault detect expense α
i, checkpointing expense O
iand try to achieve the currency of fault frequency k, obtain local optimum checkpoint quantity n
i:
If
if
(4) according to the worst execution time C of task
i, fault detect expense α
i, checkpointing expense O
i, rollback recovery expense μ
iand the currency of fault frequency k and local optimum checkpoint quantity n
ivalue, obtain the worst-case response time R of task
i(n
i, k),
(5) if
value be greater than the currency of k, add 1 at the currency of fault frequency k, perform step (3); Otherwise perform step (6);
(6) current k value is fault frequency, current n
ibe local optimum checkpoint quantity.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310104518.5A CN103197982B (en) | 2013-03-28 | 2013-03-28 | A kind of task local optimum check point interval searching method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310104518.5A CN103197982B (en) | 2013-03-28 | 2013-03-28 | A kind of task local optimum check point interval searching method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103197982A CN103197982A (en) | 2013-07-10 |
CN103197982B true CN103197982B (en) | 2016-03-09 |
Family
ID=48720570
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310104518.5A Expired - Fee Related CN103197982B (en) | 2013-03-28 | 2013-03-28 | A kind of task local optimum check point interval searching method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103197982B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103560915A (en) * | 2013-11-07 | 2014-02-05 | 浪潮(北京)电子信息产业有限公司 | Method and system for managing resources in cloud computing system |
US9348710B2 (en) * | 2014-07-29 | 2016-05-24 | Saudi Arabian Oil Company | Proactive failure recovery model for distributed computing using a checkpoint frequency determined by a MTBF threshold |
CN106383995B (en) * | 2016-09-05 | 2018-08-07 | 南京臻融软件科技有限公司 | A kind of checkpoint laying method based on node failure relevance |
CN111124720B (en) * | 2019-12-26 | 2021-05-04 | 江南大学 | Self-adaptive check point interval dynamic setting method |
CN111682981B (en) * | 2020-06-02 | 2021-09-14 | 深圳大学 | Check point interval setting method and device based on cloud platform performance |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1457577A (en) * | 2001-03-06 | 2003-11-19 | 皇家菲利浦电子有限公司 | System, method and measuring node for determining worst case gap-count value in multi-station network |
CN101303657A (en) * | 2008-06-13 | 2008-11-12 | 上海大学 | Method of optimization of multiprocessor real-time task execution power consumption |
CN102541646A (en) * | 2010-12-09 | 2012-07-04 | 中国科学院沈阳计算技术研究所有限公司 | Task scheduling method suitable for hard real-time system |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN100589362C (en) * | 2007-09-20 | 2010-02-10 | 哈尔滨工程大学 | Check point migration method under error tolerance mobile computing environment |
CN102369514B (en) * | 2011-08-31 | 2013-09-11 | 华为技术有限公司 | Method and system for establishing detection points |
-
2013
- 2013-03-28 CN CN201310104518.5A patent/CN103197982B/en not_active Expired - Fee Related
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1457577A (en) * | 2001-03-06 | 2003-11-19 | 皇家菲利浦电子有限公司 | System, method and measuring node for determining worst case gap-count value in multi-station network |
CN101303657A (en) * | 2008-06-13 | 2008-11-12 | 上海大学 | Method of optimization of multiprocessor real-time task execution power consumption |
CN102541646A (en) * | 2010-12-09 | 2012-07-04 | 中国科学院沈阳计算技术研究所有限公司 | Task scheduling method suitable for hard real-time system |
Non-Patent Citations (2)
Title |
---|
Analysis of checkpointing for real-time systems;Punnekkat S etal;《Real-time Systems》;20010131;83-102 * |
Design optimization of time-and cost-constrained fault-tolerant embedded systems with checkpointing and replication;Paul Pop etal;《IEEE Trans on Very Large Scale Integration Systems》;20090331;395页左栏第2-7段,右栏第4-6段 * |
Also Published As
Publication number | Publication date |
---|---|
CN103197982A (en) | 2013-07-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103197982B (en) | A kind of task local optimum check point interval searching method | |
Zhao et al. | Generalized reliability-oriented energy management for real-time embedded applications | |
CN103064770B (en) | Dual-process redundancy transient fault tolerating method | |
Zhou et al. | Fault-tolerant task scheduling for mixed-criticality real-time systems | |
CN103913662B (en) | A kind of method of the test macro single-particle disabler rate based on direct fault location | |
CN105279049A (en) | Method for designing triple-modular redundancy type fault-tolerant computer IP core with fault spontaneous restoration function | |
CN104424017B (en) | A kind of fault-tolerant low-power consumption scheduling method suitable for digital control system | |
CN106327033A (en) | Power system cascading failure analysis method based on Markov process | |
CN105718713A (en) | Reliability analysis method of space radiation environment | |
Kumar et al. | Performance and cost benefit analysis of a hardware-software system considering hardware based software interaction failures and different types of recovery | |
CN104820777B (en) | Method for identifying single-particle protective weak spots of spacecraft system | |
CN102521510A (en) | Method for evaluating capability of power grid for resisting concentric power falling off | |
CN108052420B (en) | Zynq-7000-based dual-core ARM processor single event upset resistance protection method | |
Singh et al. | Fault-tolerant systems | |
Li et al. | Energy minimization for checkpointing-based approach to guaranteeing real-time systems reliability | |
JP2011170606A (en) | Electronic apparatus | |
Ballerini et al. | Software rejuvenation impacts on a phased-mission system for Mars exploration | |
Zhang et al. | Work-in-progress: Optimal checkpointing strategy for real-time systems with both logical and timing correctness | |
Fouad et al. | Context-aware resources placement for SRAM-based FPGA to minimize checkpoint/recovery overhead | |
Zhengyong et al. | Schedulability analysis for Fault tolerance real-time system under fault bursts | |
Karimi et al. | Notice of Retraction: Accurate and efficient reliability Markov model analysis of predictive hybrid m-out-of-n systems | |
Jing et al. | Modeling and availability analysis of nested software rejuvenation policy | |
Kwak et al. | Checkpoint management with double modular redundancy based on the probability of task completion | |
Mirle et al. | Simulation of fault-tolerant scheduling on real-time multiprocessor systems using primary backup overloading | |
Jun et al. | Research of the software aging regeneration strategy based on components |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20160309 |