CN103197982B - A kind of task local optimum check point interval searching method - Google Patents

A kind of task local optimum check point interval searching method Download PDF

Info

Publication number
CN103197982B
CN103197982B CN201310104518.5A CN201310104518A CN103197982B CN 103197982 B CN103197982 B CN 103197982B CN 201310104518 A CN201310104518 A CN 201310104518A CN 103197982 B CN103197982 B CN 103197982B
Authority
CN
China
Prior art keywords
task
fault
local optimum
worst
expense
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201310104518.5A
Other languages
Chinese (zh)
Other versions
CN103197982A (en
Inventor
门朝光
何忠政
李香
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Engineering University
Original Assignee
Harbin Engineering University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Engineering University filed Critical Harbin Engineering University
Priority to CN201310104518.5A priority Critical patent/CN103197982B/en
Publication of CN103197982A publication Critical patent/CN103197982A/en
Application granted granted Critical
Publication of CN103197982B publication Critical patent/CN103197982B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The present invention relates to computer system tasks transient fault fault-tolerant technique field.Be specifically related to the fault-tolerant local optimum check point interval searching method of a kind of task transient fault.The present invention includes: the initial value establishing the worst-case response time of task; By task worst-case response time divided by the value that system failure origination interval gained business rounds; Obtain local optimum checkpoint quantity; The worst-case response time of acquisition task; Judge value; Determine fault frequency, local optimum checkpoint quantity.The present invention when minimizing checkpointing and the fault-tolerant expense of rollback recovery, for task provides transient fault fault-tolerant ability.The method realizes simple, can determine the local optimum check point interval of its correspondence for the different configuration tasks under different system more complex environment.

Description

A kind of task local optimum check point interval searching method
Technical field
The present invention relates to computer system tasks transient fault fault-tolerant technique field.Be specifically related to the fault-tolerant local optimum check point interval searching method of a kind of task transient fault.
Background technology
Along with the development of IT application process, computing machine has become the indispensable equipment of current scientific research, commercialization, military combat, and the lifting of its performance becomes the target that people constantly pursue.In performance boost process, integrated circuit fabrication process development, the reduction of transistor size and operating voltage in circuit, integrated level improves further makes chip more be subject to transient fault impact; The quickening of device speed causes power consumption also more and more higher, and device lifetime and reliability are affected, thus influential system reliability.Rugged environment also makes computing machine face the impact of severeer transient fault, and the high energy particle irradiation in external environment condition and the electronic noise such as voltage disturbance, electromagnetic interference (EMI) may cause transistor PN junction moment discharge and recharge, thus change its internal logic states.Although transient fault generally can not damage physical circuit, also can not exist lastingly, the normal operation of possibility influential system, can cause system crash time serious.Transient fault affect computing machine performance and wherein task reliable, effectively and correctly perform, therefore the Reliability Assurance of transient fault fault-tolerant technique to system is most important.
Fault-tolerant technique is on the basis of redundant resource (hardware redundancy, time redundancy, information redundancy, software redundancy), is formed by architecture reasonable in design and algorithm.Affect that instantaneity, randomness are strong, same fault repeats the feature that probability of happening is minimum, fault can not exist lastingly because transient fault has, after re-writing new data, namely phenomenon of the failure disappears, and the transient fault fault-tolerant technique based on software simulating can realize error-detecting and restore funcitons by repeating program.Time redundancy fault-tolerant way based on checkpointing and rollback recovery technology has and realizes simple, can realize the fault-tolerant of transient fault by rollback recovery.Checkpointing and rollback recovery technology can when transient fault repeatedly occur, by task recovery to past a certain correct status, task is made to continue to perform from this state, counting loss is reduced to the checkpointing moment and to fault, the calculating done of moment occurs, what avoid task re-executes caused execution time waste.
Checkpointing and rollback recovery technology bring certain overhead to tasks carrying, checkpointing expense when this mainly comprises non-fault and the rollback recovery expense after fault; Fault detect also brings overhead to task.The recovery of checkpointing, fault and detection realize fault-tolerant required process, and different for the expense of different this threes of task, even if when same task is run under different time and loading condition, the expense of three is also different.Therefore, when realizing fault-tolerant, minimizing overhead that three causes becomes the focus that researcher pays close attention to.Execution time and the checkpoint interval of task are closely related, and local optimum check point interval while providing fault-tolerant ability for task, can minimize the outer executive overhead of the fault-tolerant Quota brought.The expense of checkpoint rollback recovery is determined by checkpoint interval, and therefore checkpoint interval is most important for the fault-tolerant scheduling performance of real-time task.Under different fault origination intervals, the number of stoppages that task occurs is not identical, and local optimum check point interval is also different, and therefore its worst-case response time is also different.
Calendar year 2001, PunnekkatS, BurnsA and DavisR delivered article " Analysisofcheckpointingforreal-timesystems " on periodical " Real-TimeSystems ", proposed the optimum checkpoint interval computing formula in single failure situation in this paper.But in the practical implementation of task, it is very likely that repeatedly fault occurs, therefore this checkpoint interval be not suitable for the actual implementation status of task.PaulPop in 2009, ViacheslavIzosimov, PetruEles and ZeboPeng has delivered article " DesignOptimizationofTime-andCost-ConstrainedFault-Tolera ntEmbeddedSystemsWithCheckpointingandReplication " on periodical " IEEETransonVeryLargeScaleIntergrationSystems ", offer the challenge in this paper and the task fault-tolerance execution model under k failure condition is occurring, establish task worst-case response time computing formula, and the optimum checkpoint interval computing formula of task of deriving under k failure condition.But the value of its fault frequency k is the definite value of a supposition, this is also not suitable for the actual implementation status of task because when actual motion task τ ithe number of times k broken down in implementation is by the worst-case response time R of task i(n i, k) and the fault origination interval T of system edetermine.At T ewhen being less than task worst-case response time, may there is repeatedly fault in task.Its fault frequency of task of different configuration is not identical, even if identical task its fault origination interval under different system environment is also different, causes its fault frequency to be also different.Therefore design the present invention, the method can according to the concrete configuration of task (the worst execution time C of task i, checkpointing expense O i, rollback recovery expense μ i, fault detect expense α i) and configuration (the fault origination interval T of task run environment e), determine task local optimum check point interval.
Summary of the invention
A kind of local optimum check point interval searching method provided for fault-tolerant computer system task adapting to more complex environment is provided.
The present invention includes following steps:
(1) by the worst for task execution time C ibe set to the worst-case response time R of task i(n i, initial value k);
(2) by task worst-case response time R i(n i, k) divided by system failure origination interval T ethe value that gained business rounds, namely be set to the currency of fault frequency k;
(3) according to the worst execution time C of task i, fault detect expense α i, checkpointing expense O iand try to achieve the currency of fault frequency k, obtain local optimum checkpoint quantity n i:
If C i ≤ n i ( n i + 1 ) · O i + α i k , if C i > n i ( n i + 1 ) · O i + α i k ,
(4) according to the worst execution time C of task i, fault detect expense α i, checkpointing expense O i, rollback recovery expense μ iand the currency of fault frequency k and local optimum checkpoint quantity n ivalue, obtain the worst-case response time R of task i(n i, k), R i ( n i , k ) = C i + n i * ( α i + O i ) + ( C i n i + μ i + α i ) * k ;
(5) if value be greater than the currency of k, add 1 at the currency of fault frequency k, perform step (3); Otherwise perform step (6);
(6) current k value is fault frequency, current n ibe local optimum checkpoint quantity.
Beneficial effect of the present invention is:
The present invention can according to the local optimum check point interval of the fault origination interval determination task of the worst execution time of different task, checkpointing expense, fault detect expense, rollback recovery expense and system, when minimizing checkpointing and the fault-tolerant expense of rollback recovery, for task provides transient fault fault-tolerant ability.The method realizes simple, and can determine the local optimum check point interval of its correspondence for the different configuration tasks under different system more complex environment.
Accompanying drawing explanation
Fig. 1 local optimum check point interval search routine figure.
Embodiment
The method, according to the configuration attribute of task, by the local optimum check point interval of alternative manner search mission, realizes while providing transient fault fault-tolerant ability for task, minimizes the extra execution time of task.The present invention, while providing fault-tolerant ability for task, minimizes the expense caused by rollback recovery after by checkpointing during non-fault, fault detect and fault.The method can determine the fault frequency of task reality according to the response time of task and fault origination interval, and then determines the local optimum checkpoint quantity of task.
Below in conjunction with attached Example, the present invention is described in more detail:
Fig. 1 shows task local optimum check point interval searching method process flow diagram, in conjunction with the implementation procedure of process flow diagram and example in detail the method.
Example 1:
Task τ iconfiguration information be: the worst execution time C ifor 40ms, checkpointing expense O ifor 2ms, fault detect expense α ifor 3ms, rollback recovery expense μ ifor 1ms, system failure origination interval T efor 80ms.Task local optimum check point interval search detailed step is as follows:
(1) by C ir is assigned to as initial value i(n i, k), i.e. R i(n i, k)=40;
(2) according to R i(n i, k)=40 and T e=80, ask fault frequency k currency, namely
(3) according to C i=40, O i=2, α i=3 and k=1, try to achieve local optimum checkpoint quantity n i=3;
(4) according to C i=40, O i=2, α i=3, μ i=1 and k=1 and n i=3, try to achieve the worst-case response time of task
R i ( n i , k ) = 40 + 3 * ( 3 + 2 ) + ( 40 3 + 3 + 1 ) * 1 = 72 ;
(5) equal k;
(6) fault frequency k=1 is therefore tried to achieve, local optimum checkpoint quantity n i=3.
Example 2:
Task τ iconfiguration information be: the worst execution time C ifor 50ms, checkpointing expense O ifor 2ms, fault detect expense α ifor 3ms, rollback recovery time overhead μ ifor 2ms, system failure origination interval T efor 30ms.Task local optimum check point interval search detailed step is as follows:
(1) by C ir is assigned to as initial value i(n i, k), i.e. R i(n i, k)=50;
(2) according to R i(n i, k)=50 and T e=30, ask fault frequency k currency, namely
(3) according to C i=50, O i=2, α i=3 and k=2, try to achieve local optimum checkpoint quantity n i=4;
(4) according to C i=50, O i=2, α i=3, μ i=2 and k=2 and n i=4, try to achieve the worst-case response time of task
R i ( n i , k ) = 50 + 4 * ( 3 + 2 ) + ( 50 4 + 2 + 3 ) * 2 = 105 ;
(5) be greater than k, then k=2+1=3;
(6) according to C i=50, O i=2, α i=3 and k=3, try to achieve local optimum checkpoint quantity n i=5;
(7) according to C i=50, O i=2, α i=3, μ i=2 and k=3 and n i=5, try to achieve the worst-case response time of task
R i ( n i , k ) = 50 + 5 * ( 3 + 2 ) + ( 50 5 + 2 + 3 ) * 3 = 120 ;
(8) be greater than k, then k=3+1=4;
(9) according to C i=50, O i=2, α i=3 and k=4, try to achieve local optimum checkpoint quantity n i=6;
(10) according to C i=50, O i=2, α i=3, μ i=2 and k=4 and n i=6, try to achieve the worst-case response time of task
R i ( n i , k ) = 50 + 6 * ( 3 + 2 ) + ( 50 6 + 2 + 3 ) * 4 = 133 ;
(11) be greater than k, then k=4+1=5;
(12) according to C i=50, O i=2, α i=3 and k=5, try to achieve local optimum checkpoint quantity n i=7;
(13) according to C i=50, O i=2, α i=3, μ i=2 and k=5 and n i=7, try to achieve the worst-case response time of task
R i ( n i , k ) = 50 + 7 * ( 3 + 2 ) + ( 50 7 + 2 + 3 ) * 5 = 146 ;
(14) equal k;
(15) fault frequency k=5 is tried to achieve, local optimum checkpoint quantity n i=7.

Claims (1)

1. a task local optimum check point interval searching method, is characterized in that, comprises the steps:
(1) by the worst for task execution time C ibe set to the worst-case response time R of task i(n i, initial value k);
(2) by task worst-case response time R i(n i, k) divided by system failure origination interval T ethe value that gained business rounds, namely be set to the currency of fault frequency k;
(3) according to the worst execution time C of task i, fault detect expense α i, checkpointing expense O iand try to achieve the currency of fault frequency k, obtain local optimum checkpoint quantity n i:
If C i ≤ n i ( n i + 1 ) · O i + α i k , if C i ≤ n i ( n i + 1 ) · O i + α i k ,
(4) according to the worst execution time C of task i, fault detect expense α i, checkpointing expense O i, rollback recovery expense μ iand the currency of fault frequency k and local optimum checkpoint quantity n ivalue, obtain the worst-case response time R of task i(n i, k), R i ( n i , k ) = C i + n i * ( α i + O i ) + ( C i n i + μ i + α i ) * k ;
(5) if value be greater than the currency of k, add 1 at the currency of fault frequency k, perform step (3); Otherwise perform step (6);
(6) current k value is fault frequency, current n ibe local optimum checkpoint quantity.
CN201310104518.5A 2013-03-28 2013-03-28 A kind of task local optimum check point interval searching method Expired - Fee Related CN103197982B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310104518.5A CN103197982B (en) 2013-03-28 2013-03-28 A kind of task local optimum check point interval searching method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310104518.5A CN103197982B (en) 2013-03-28 2013-03-28 A kind of task local optimum check point interval searching method

Publications (2)

Publication Number Publication Date
CN103197982A CN103197982A (en) 2013-07-10
CN103197982B true CN103197982B (en) 2016-03-09

Family

ID=48720570

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310104518.5A Expired - Fee Related CN103197982B (en) 2013-03-28 2013-03-28 A kind of task local optimum check point interval searching method

Country Status (1)

Country Link
CN (1) CN103197982B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103560915A (en) * 2013-11-07 2014-02-05 浪潮(北京)电子信息产业有限公司 Method and system for managing resources in cloud computing system
US9348710B2 (en) * 2014-07-29 2016-05-24 Saudi Arabian Oil Company Proactive failure recovery model for distributed computing using a checkpoint frequency determined by a MTBF threshold
CN106383995B (en) * 2016-09-05 2018-08-07 南京臻融软件科技有限公司 A kind of checkpoint laying method based on node failure relevance
CN111124720B (en) * 2019-12-26 2021-05-04 江南大学 Self-adaptive check point interval dynamic setting method
CN111682981B (en) * 2020-06-02 2021-09-14 深圳大学 Check point interval setting method and device based on cloud platform performance

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1457577A (en) * 2001-03-06 2003-11-19 皇家菲利浦电子有限公司 System, method and measuring node for determining worst case gap-count value in multi-station network
CN101303657A (en) * 2008-06-13 2008-11-12 上海大学 Method of optimization of multiprocessor real-time task execution power consumption
CN102541646A (en) * 2010-12-09 2012-07-04 中国科学院沈阳计算技术研究所有限公司 Task scheduling method suitable for hard real-time system

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100589362C (en) * 2007-09-20 2010-02-10 哈尔滨工程大学 Check point migration method under error tolerance mobile computing environment
CN102369514B (en) * 2011-08-31 2013-09-11 华为技术有限公司 Method and system for establishing detection points

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1457577A (en) * 2001-03-06 2003-11-19 皇家菲利浦电子有限公司 System, method and measuring node for determining worst case gap-count value in multi-station network
CN101303657A (en) * 2008-06-13 2008-11-12 上海大学 Method of optimization of multiprocessor real-time task execution power consumption
CN102541646A (en) * 2010-12-09 2012-07-04 中国科学院沈阳计算技术研究所有限公司 Task scheduling method suitable for hard real-time system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Analysis of checkpointing for real-time systems;Punnekkat S etal;《Real-time Systems》;20010131;83-102 *
Design optimization of time-and cost-constrained fault-tolerant embedded systems with checkpointing and replication;Paul Pop etal;《IEEE Trans on Very Large Scale Integration Systems》;20090331;395页左栏第2-7段,右栏第4-6段 *

Also Published As

Publication number Publication date
CN103197982A (en) 2013-07-10

Similar Documents

Publication Publication Date Title
CN103197982B (en) A kind of task local optimum check point interval searching method
Zhao et al. Generalized reliability-oriented energy management for real-time embedded applications
CN103064770B (en) Dual-process redundancy transient fault tolerating method
Zhou et al. Fault-tolerant task scheduling for mixed-criticality real-time systems
CN103913662B (en) A kind of method of the test macro single-particle disabler rate based on direct fault location
CN105279049A (en) Method for designing triple-modular redundancy type fault-tolerant computer IP core with fault spontaneous restoration function
CN104424017B (en) A kind of fault-tolerant low-power consumption scheduling method suitable for digital control system
CN106327033A (en) Power system cascading failure analysis method based on Markov process
CN105718713A (en) Reliability analysis method of space radiation environment
Kumar et al. Performance and cost benefit analysis of a hardware-software system considering hardware based software interaction failures and different types of recovery
CN104820777B (en) Method for identifying single-particle protective weak spots of spacecraft system
CN102521510A (en) Method for evaluating capability of power grid for resisting concentric power falling off
CN108052420B (en) Zynq-7000-based dual-core ARM processor single event upset resistance protection method
Singh et al. Fault-tolerant systems
Li et al. Energy minimization for checkpointing-based approach to guaranteeing real-time systems reliability
JP2011170606A (en) Electronic apparatus
Ballerini et al. Software rejuvenation impacts on a phased-mission system for Mars exploration
Zhang et al. Work-in-progress: Optimal checkpointing strategy for real-time systems with both logical and timing correctness
Fouad et al. Context-aware resources placement for SRAM-based FPGA to minimize checkpoint/recovery overhead
Zhengyong et al. Schedulability analysis for Fault tolerance real-time system under fault bursts
Karimi et al. Notice of Retraction: Accurate and efficient reliability Markov model analysis of predictive hybrid m-out-of-n systems
Jing et al. Modeling and availability analysis of nested software rejuvenation policy
Kwak et al. Checkpoint management with double modular redundancy based on the probability of task completion
Mirle et al. Simulation of fault-tolerant scheduling on real-time multiprocessor systems using primary backup overloading
Jun et al. Research of the software aging regeneration strategy based on components

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20160309