CN103197982B

CN103197982B - A kind of task local optimum check point interval searching method

Info

Publication number: CN103197982B
Application number: CN201310104518.5A
Authority: CN
Inventors: 门朝光; 何忠政; 李香
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2013-03-28
Filing date: 2013-03-28
Publication date: 2016-03-09
Anticipated expiration: 2033-03-28
Also published as: CN103197982A

Abstract

The present invention relates to computer system tasks transient fault fault-tolerant technique field.Be specifically related to the fault-tolerant local optimum check point interval searching method of a kind of task transient fault.The present invention includes: the initial value establishing the worst-case response time of task; By task worst-case response time divided by the value that system failure origination interval gained business rounds; Obtain local optimum checkpoint quantity; The worst-case response time of acquisition task; Judge value; Determine fault frequency, local optimum checkpoint quantity.The present invention when minimizing checkpointing and the fault-tolerant expense of rollback recovery, for task provides transient fault fault-tolerant ability.The method realizes simple, can determine the local optimum check point interval of its correspondence for the different configuration tasks under different system more complex environment.

Description

A kind of task local optimum check point interval searching method

Technical field

The present invention relates to computer system tasks transient fault fault-tolerant technique field.Be specifically related to the fault-tolerant local optimum check point interval searching method of a kind of task transient fault.

Background technology

Along with the development of IT application process, computing machine has become the indispensable equipment of current scientific research, commercialization, military combat, and the lifting of its performance becomes the target that people constantly pursue.In performance boost process, integrated circuit fabrication process development, the reduction of transistor size and operating voltage in circuit, integrated level improves further makes chip more be subject to transient fault impact; The quickening of device speed causes power consumption also more and more higher, and device lifetime and reliability are affected, thus influential system reliability.Rugged environment also makes computing machine face the impact of severeer transient fault, and the high energy particle irradiation in external environment condition and the electronic noise such as voltage disturbance, electromagnetic interference (EMI) may cause transistor PN junction moment discharge and recharge, thus change its internal logic states.Although transient fault generally can not damage physical circuit, also can not exist lastingly, the normal operation of possibility influential system, can cause system crash time serious.Transient fault affect computing machine performance and wherein task reliable, effectively and correctly perform, therefore the Reliability Assurance of transient fault fault-tolerant technique to system is most important.

Fault-tolerant technique is on the basis of redundant resource (hardware redundancy, time redundancy, information redundancy, software redundancy), is formed by architecture reasonable in design and algorithm.Affect that instantaneity, randomness are strong, same fault repeats the feature that probability of happening is minimum, fault can not exist lastingly because transient fault has, after re-writing new data, namely phenomenon of the failure disappears, and the transient fault fault-tolerant technique based on software simulating can realize error-detecting and restore funcitons by repeating program.Time redundancy fault-tolerant way based on checkpointing and rollback recovery technology has and realizes simple, can realize the fault-tolerant of transient fault by rollback recovery.Checkpointing and rollback recovery technology can when transient fault repeatedly occur, by task recovery to past a certain correct status, task is made to continue to perform from this state, counting loss is reduced to the checkpointing moment and to fault, the calculating done of moment occurs, what avoid task re-executes caused execution time waste.

Checkpointing and rollback recovery technology bring certain overhead to tasks carrying, checkpointing expense when this mainly comprises non-fault and the rollback recovery expense after fault; Fault detect also brings overhead to task.The recovery of checkpointing, fault and detection realize fault-tolerant required process, and different for the expense of different this threes of task, even if when same task is run under different time and loading condition, the expense of three is also different.Therefore, when realizing fault-tolerant, minimizing overhead that three causes becomes the focus that researcher pays close attention to.Execution time and the checkpoint interval of task are closely related, and local optimum check point interval while providing fault-tolerant ability for task, can minimize the outer executive overhead of the fault-tolerant Quota brought.The expense of checkpoint rollback recovery is determined by checkpoint interval, and therefore checkpoint interval is most important for the fault-tolerant scheduling performance of real-time task.Under different fault origination intervals, the number of stoppages that task occurs is not identical, and local optimum check point interval is also different, and therefore its worst-case response time is also different.

Calendar year 2001, PunnekkatS, BurnsA and DavisR delivered article " Analysisofcheckpointingforreal-timesystems " on periodical " Real-TimeSystems ", proposed the optimum checkpoint interval computing formula in single failure situation in this paper.But in the practical implementation of task, it is very likely that repeatedly fault occurs, therefore this checkpoint interval be not suitable for the actual implementation status of task.PaulPop in 2009, ViacheslavIzosimov, PetruEles and ZeboPeng has delivered article " DesignOptimizationofTime-andCost-ConstrainedFault-Tolera ntEmbeddedSystemsWithCheckpointingandReplication " on periodical " IEEETransonVeryLargeScaleIntergrationSystems ", offer the challenge in this paper and the task fault-tolerance execution model under k failure condition is occurring, establish task worst-case response time computing formula, and the optimum checkpoint interval computing formula of task of deriving under k failure condition.But the value of its fault frequency k is the definite value of a supposition, this is also not suitable for the actual implementation status of task because when actual motion task τ _ithe number of times k broken down in implementation is by the worst-case response time R of task _i(n _i, k) and the fault origination interval T of system _edetermine.At T _ewhen being less than task worst-case response time, may there is repeatedly fault in task.Its fault frequency of task of different configuration is not identical, even if identical task its fault origination interval under different system environment is also different, causes its fault frequency to be also different.Therefore design the present invention, the method can according to the concrete configuration of task (the worst execution time C of task _i, checkpointing expense O _i, rollback recovery expense μ _i, fault detect expense α _i) and configuration (the fault origination interval T of task run environment _e), determine task local optimum check point interval.

Summary of the invention

A kind of local optimum check point interval searching method provided for fault-tolerant computer system task adapting to more complex environment is provided.

The present invention includes following steps:

(1) by the worst for task execution time C _ibe set to the worst-case response time R of task _i(n _i, initial value k);

(2) by task worst-case response time R _i(n _i, k) divided by system failure origination interval T _ethe value that gained business rounds, namely be set to the currency of fault frequency k;

(3) according to the worst execution time C of task _i, fault detect expense α _i, checkpointing expense O _iand try to achieve the currency of fault frequency k, obtain local optimum checkpoint quantity n _i:

If

C_{i} \leq n_{i} (n_{i} + 1) \cdot \frac{O_{i} + α_{i}}{k},

if

C_{i} > n_{i} (n_{i} + 1) \cdot \frac{O_{i} + α_{i}}{k},

(4) according to the worst execution time C of task _i, fault detect expense α _i, checkpointing expense O _i, rollback recovery expense μ _iand the currency of fault frequency k and local optimum checkpoint quantity n _ivalue, obtain the worst-case response time R of task _i(n _i, k),

R_{i} (n_{i}, k) = C_{i} + n_{i} * (α_{i} + O_{i}) + (\frac{C_{i}}{n_{i}} + μ_{i} + α_{i}) * k;

(5) if value be greater than the currency of k, add 1 at the currency of fault frequency k, perform step (3); Otherwise perform step (6);

(6) current k value is fault frequency, current n _ibe local optimum checkpoint quantity.

Beneficial effect of the present invention is:

The present invention can according to the local optimum check point interval of the fault origination interval determination task of the worst execution time of different task, checkpointing expense, fault detect expense, rollback recovery expense and system, when minimizing checkpointing and the fault-tolerant expense of rollback recovery, for task provides transient fault fault-tolerant ability.The method realizes simple, and can determine the local optimum check point interval of its correspondence for the different configuration tasks under different system more complex environment.

Accompanying drawing explanation

Fig. 1 local optimum check point interval search routine figure.

Embodiment

The method, according to the configuration attribute of task, by the local optimum check point interval of alternative manner search mission, realizes while providing transient fault fault-tolerant ability for task, minimizes the extra execution time of task.The present invention, while providing fault-tolerant ability for task, minimizes the expense caused by rollback recovery after by checkpointing during non-fault, fault detect and fault.The method can determine the fault frequency of task reality according to the response time of task and fault origination interval, and then determines the local optimum checkpoint quantity of task.

Below in conjunction with attached Example, the present invention is described in more detail:

Fig. 1 shows task local optimum check point interval searching method process flow diagram, in conjunction with the implementation procedure of process flow diagram and example in detail the method.

Example 1:

Task τ _iconfiguration information be: the worst execution time C _ifor 40ms, checkpointing expense O _ifor 2ms, fault detect expense α _ifor 3ms, rollback recovery expense μ _ifor 1ms, system failure origination interval T _efor 80ms.Task local optimum check point interval search detailed step is as follows:

(1) by C _ir is assigned to as initial value _i(n _i, k), i.e. R _i(n _i, k)=40;

(2) according to R _i(n _i, k)=40 and T _e=80, ask fault frequency k currency, namely

(3) according to C _i=40, O _i=2, α _i=3 and k=1, try to achieve local optimum checkpoint quantity n _i=3;

(4) according to C _i=40, O _i=2, α _i=3, μ _i=1 and k=1 and n _i=3, try to achieve the worst-case response time of task

R_{i} (n_{i}, k) = 40 + 3 * (3 + 2) + (\frac{40}{3} + 3 + 1) * 1 = 72;

(5) equal k;

(6) fault frequency k=1 is therefore tried to achieve, local optimum checkpoint quantity n _i=3.

Example 2:

Task τ _iconfiguration information be: the worst execution time C _ifor 50ms, checkpointing expense O _ifor 2ms, fault detect expense α _ifor 3ms, rollback recovery time overhead μ _ifor 2ms, system failure origination interval T _efor 30ms.Task local optimum check point interval search detailed step is as follows:

(1) by C _ir is assigned to as initial value _i(n _i, k), i.e. R _i(n _i, k)=50;

(2) according to R _i(n _i, k)=50 and T _e=30, ask fault frequency k currency, namely

(3) according to C _i=50, O _i=2, α _i=3 and k=2, try to achieve local optimum checkpoint quantity n _i=4;

(4) according to C _i=50, O _i=2, α _i=3, μ _i=2 and k=2 and n _i=4, try to achieve the worst-case response time of task

R_{i} (n_{i}, k) = 50 + 4 * (3 + 2) + (\frac{50}{4} + 2 + 3) * 2 = 105;

(5) be greater than k, then k=2+1=3;

(6) according to C _i=50, O _i=2, α _i=3 and k=3, try to achieve local optimum checkpoint quantity n _i=5;

(7) according to C _i=50, O _i=2, α _i=3, μ _i=2 and k=3 and n _i=5, try to achieve the worst-case response time of task

R_{i} (n_{i}, k) = 50 + 5 * (3 + 2) + (\frac{50}{5} + 2 + 3) * 3 = 120;

(8) be greater than k, then k=3+1=4;

(9) according to C _i=50, O _i=2, α _i=3 and k=4, try to achieve local optimum checkpoint quantity n _i=6;

(10) according to C _i=50, O _i=2, α _i=3, μ _i=2 and k=4 and n _i=6, try to achieve the worst-case response time of task

R_{i} (n_{i}, k) = 50 + 6 * (3 + 2) + (\frac{50}{6} + 2 + 3) * 4 = 133;

(11) be greater than k, then k=4+1=5;

(12) according to C _i=50, O _i=2, α _i=3 and k=5, try to achieve local optimum checkpoint quantity n _i=7;

(13) according to C _i=50, O _i=2, α _i=3, μ _i=2 and k=5 and n _i=7, try to achieve the worst-case response time of task

R_{i} (n_{i}, k) = 50 + 7 * (3 + 2) + (\frac{50}{7} + 2 + 3) * 5 = 146;

(14) equal k;

(15) fault frequency k=5 is tried to achieve, local optimum checkpoint quantity n _i=7.

Claims

1. a task local optimum check point interval searching method, is characterized in that, comprises the steps:

If

C_{i} \leq n_{i} (n_{i} + 1) \cdot \frac{O_{i} + α_{i}}{k},

if

C_{i} \leq n_{i} (n_{i} + 1) \cdot \frac{O_{i} + α_{i}}{k},

R_{i} (n_{i}, k) = C_{i} + n_{i} * (α_{i} + O_{i}) + (\frac{C_{i}}{n_{i}} + μ_{i} + α_{i}) * k;