CN112131034A - Checkpoint soft error recovery method based on detector position - Google Patents

Checkpoint soft error recovery method based on detector position Download PDF

Info

Publication number
CN112131034A
CN112131034A CN202011005441.2A CN202011005441A CN112131034A CN 112131034 A CN112131034 A CN 112131034A CN 202011005441 A CN202011005441 A CN 202011005441A CN 112131034 A CN112131034 A CN 112131034A
Authority
CN
China
Prior art keywords
program
checkpoint
program segment
time
segment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011005441.2A
Other languages
Chinese (zh)
Other versions
CN112131034B (en
Inventor
汪芸
杨娜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN202011005441.2A priority Critical patent/CN112131034B/en
Publication of CN112131034A publication Critical patent/CN112131034A/en
Application granted granted Critical
Publication of CN112131034B publication Critical patent/CN112131034B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0751Error or fault detection not based on redundancy

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Retry When Errors Occur (AREA)

Abstract

The invention discloses a checkpoint soft error recovery method based on detector position, which is named DPCKPT; the method comprises the following steps: the first step is as follows: loading a program for adding a detector as an input of the method; the second step is that: deploying an initial check point and dividing program segments; the third step: calculating the time overhead of the program segment; the fourth step: judging the sufficiency of the check points in the program segment, and if the check points are insufficient, adding additional check points for the program segment; the fifth step: and judging the deletable property of the check point in the program segment, and deleting the deletable check point. The DPCKPT takes the detector position into account on the basis of a checkpoint method of an isochronous checkpoint interval, and redeployes the checkpoint of the isochronous checkpoint interval method so as to reduce the time overhead brought by the checkpoint method and further reduce the total operation time of the program. The invention reduces the total running time of the program.

Description

Checkpoint soft error recovery method based on detector position
Technical Field
The invention relates to a soft error recovery method of a check point based on a detector position, belonging to the technical field of computer soft error fault tolerance.
Background
The single event upset is the phenomenon that single high-energy particles in the universe are shot into a sensitive area of a semiconductor device to cause the logic state of the device to be overturned. A single event upset may cause a hard or soft error in the device. Hard errors refer to permanent failures and are not recoverable. Soft errors refer to transient failures and can be recovered. Soft errors are approximately 100 times more than hard errors. Soft error detection and soft error recovery are two important aspects of soft errors.
The source code level soft error recovery method provides recovery service for the program from the source code layer, and the operation is simple. It mainly utilizes replica technique and checkpoint technique to recover soft errors. The source code level soft error recovery method based on the copy generates copy variables for the variables, and judges whether the correct variables are the original variables or the copy variables after errors are detected. And then, recovering the error variable by using the correct variable, and continuing to run the program. A checkpoint-based source code level soft error recovery method checkpoints a program according to a checkpoint interval. When a checkpoint is performed, the program is usually blocked from running, and after the checkpoint is completed, the program continues to run. During program run, if the detector detects an error, the program will roll back to the last checkpoint position and continue running down. When the checkpointing technique is applied to a program, the total running time of the program includes not only the original running time of the program but also the time overhead caused by the checkpointing method, such as the checkpointing time and the recovery time when an error occurs. Current copy-based source code level soft error recovery methods are costly and may have data overflow problems. Checkpoint-based source code level soft error recovery methods are relatively low overhead, however, checkpointing is typically dominated by isochronous checkpoint intervals (periodic checkpoints).
The result types of soft errors can be classified as Benign outcomes (Benign), crashes (Crash), hangs (Hang), and sdc (silentdatacorruption). SDC is the most concealed and most severe type of error for soft errors, and once SDC occurs, the consequences are extremely severe if they cannot be handled. There are currently many source code level soft error detection methods for SDC, e.g., assertion-based source code level SDC detection methods. The method inserts an assertion detector at the source code level to detect SDC errors. To reduce the detection cost, such methods typically insert predicate detectors at program locations where there is a high probability of SDC occurring, which causes the detectors to be asymmetrically distributed in the program.
The existing checkpoint-based source code level soft error recovery method is mainly based on the equal checkpoint interval when the checkpoint is deployed, the position distribution of a detector is not considered, and the time overhead cannot be sufficiently reduced. The invention provides a checkpoint soft error recovery method based on a detector position, which is called DPCKPT. The DPCKPT redeployes the checkpoints of the checkpointing method of the peer-to-peer checkpoint interval by taking the detector position into account so as to reduce the time overhead brought by the checkpointing method and further reduce the total running time of the program.
Disclosure of Invention
The isochronous checkpoint interval based checkpoint method periodically checkpoints, and because the detectors in the program are asymmetrically distributed, the isochronous checkpoint interval based checkpoint method cannot sufficiently reduce the time overhead associated with the checkpoint method. The invention aims to provide a soft error recovery method of a check point based on a detector position, which reduces the time overhead brought by the check point method and further reduces the total running time of a program.
In order to achieve the purpose of the invention, the technical scheme adopted by the invention is as follows: a method of checkpoint soft error recovery based on detector location, the method comprising the steps of:
the first step is as follows: loading a program for adding a detector as an input of the method;
the second step is that: deploying an initial check point and dividing program segments; the method comprises the following specific steps:
an initial checkpoint is deployed based on an isochronous checkpoint interval, which divides the program into equal-runtime program segments. At the program level, c1—cnFor n checkpoints, p1—pnIs the n program segments into which the point of inspection is divided, each having equal run time, as shown in fig. 1. At the instruction level, dynamic instructions of a program are generally divided into n instruction segments s of equal size1—snWherein i is1—in*ΔsRepresenting the dynamic instructions of the program and deltas representing the instruction fragment size. Instruction section sjIs instructed as
Figure BDA0002695684670000021
m(sj) Denotes sjThe value of the number of the first instruction is shown in the formula (1).
Figure BDA0002695684670000022
The third step: calculating the time overhead of the program segment; the method comprises the following specific steps:
the time overhead of a program segment refers to the time overhead generated when errors detected by detectors within the program segment are tolerated, including checkpointing overhead and recovery overhead. Checkpointing refers to the time required to save the running state of a program, t in FIG. 21. Recovery overhead refers to the time required to recover the running state of the program saved at the checkpoint and to continue running the program from the checkpoint location to the detector error reporting location, i.e., the historical state recovery time and the current state recovery time. When the detector detects an error, it will send out an error message, and the position of the detector sending out the error message is called as a failure point. Assuming F in FIG. 2 as a point of failure, the historical state time and current state recovery time for processing error generation at F may be represented as t2And t3. It is assumed herein that the time to checkpoint at the same location is the same as the time to restore the saved program state at that location, i.e., t1=t2
The time for restoring the running state of the program saved by the checkpoint is used as the historical state restoration time, and the time required for executing the instruction between the checkpoint and the detector is used as the current state restoration time. It is assumed that only one soft error occurs at most during program execution. Since there are a plurality of detectors for one block, the average value of the current state recovery times of the blocks is used as the current state recovery time of the block, which is the average value of the current state recovery times of the blocks when errors are detected by the detectors in the blocks. Thus, the program segment pjThe recovery overhead of (c) can be expressed as equation (2), where the first term sr (c)j) Representing save or restore checkpoints cjThe time required for the state of (a), i.e., the checkpointing time or the historical state recovery time. The second term represents the average current state recovery time, c (p)j) Represents pjThe number of dynamic detectors in the program, theta represents the average execution time of a single instruction of the program, a (p)jK) represents pjThe number of the instruction corresponding to the kth dynamic detector of (1). It should be noted that the source code level detector corresponds to a plurality of instructions, and we use the first instruction corresponding to the detector as the number of the instruction corresponding to the detector. Finally, pjThe time overhead of (a) can be expressed as equation (3),
Figure BDA0002695684670000031
o(pj)=sr(cj)+r(pj) (3)
the fourth step: judging the sufficiency of the check points in the program segment, and if the check points are insufficient, adding additional check points for the program segment; the method comprises the following specific steps:
the time overhead for each program segment is obtained by the above third step. Next, for each program segment, its change in time overhead after adding a checkpoint to it is evaluated. Thereafter, the sufficiency of its checkpoint is judged based on the time-overhead changes. If the checkpoint is not sufficient, an additional checkpoint is added to it. The invention only considers adding a check point in the middle of the program segment, thus, the change of the time cost of the program segment is only evaluated after adding a check point in the middle of the program segment. Following with the program segment pjThe analytical procedure is given as an example.
Suppose that p isjWith intermediate addition of a check point acjAs shown in fig. 3. Addition of acjThen, pjThe time overhead of (a) varies. In one aspect, pjWill pass through acjRecovery instead of cjRecovery of pjThe recovery overhead of (2) changes. The changed recovery overhead can be expressed as equation (4), where A represents the changed historical state recovery time, B represents the changed current state recovery time, and u (p) in Bj) Represents pjNumber of dynamic detectors in the first half. On the other hand, pjThe checkpointing overhead of (a) is changed, and the changed checkpointing overhead can be represented by equation (5).
Figure BDA0002695684670000041
Figure BDA0002695684670000042
r(pj)′=A+B (4)
sr(pj)′=sr(cj)+sr(acj) (5)
Thus, add acjThen, pjThe time overhead of (c) can be expressed as equation (6), and, thus, pjThe time overhead variation of (2) is as shown in equation (7). ag (p)j) When greater than 0, p is representedjAdding check points acjWill decrease pjTime overhead of, this accounts for pjInsufficient internal check points, therefore, at pjA checkpoint is added in the middle. Otherwise, p is representedjThe internal checkpoint is sufficient, in which case p need not bejA checkpoint is added.
o(pj)′=r(pj)′+sr(pj)′ (6)
ag(pj)=o(pj)-o(pj)′ (7)
The fifth step: judging the deletable property of the check point in the program segment, and deleting the deletable check point; the method comprises the following specific steps:
when the checkpoints of the program segment are sufficient, the time overhead of the program segment changes after the checkpoint of the program segment is evaluated for deletion. If the time overhead is reduced, the check points in the program segment are redundant, the check points of the program segment are deleted, otherwise, the check points in the program segment are not redundant, and the check points of the program segment are not deleted. Following with the program segment pjFor example, a detailed description is given.
Delete checkpoint cjThen, pjThe error detected by the detector within will be detected by the checkpoint c closest to itj-1And (6) processing. We will turn pjAnd the nearest preceding program segment pj-1Viewed as a program segment pmAs shown in fig. 4. Computing deletion cjRear pmTo determine cjMay be deleted. Retention of cjWhen is, pmThe time overhead of (c) can be expressed as equation (8). Deletion cjThen, pmThe time overhead of (a) can be expressed as equation (9). Further, delete cjThen, pmThe reduced time overhead can be expressed as equation (10). If dg (p)m) Greater than 0, indicating deletion cjRear pmIs reduced, when c is deletedj. Otherwise, reserve cj
Figure BDA0002695684670000051
Figure BDA0002695684670000052
dg(sm)=o(pm)-o(pm)′ (10)
Through the above steps, the redeployment of the checkpoints of the peer-to-peer checkpointing interval method is completed.
Selecting the checking point method of the isochronous checking point interval as comparison, setting the checking point interval of the isochronous checking point method to be T/4 and T/3 respectively, wherein T is the original running time of the program. The invention has the beneficial effects that: (1) the overall program run time is reduced. Programs such as replace, bitstrng, rad2deg and isqrt in Siemens and Mibench were chosen for testing, and the Checkpoint tool used BLCR (Berkeley Lab Checkpoint/Restart). The percentage drop in the DPCKPT method over the total run time of the program was recorded as ep. ep ═ tfix-tdpckpt)/tfix,tfixOverall program runtime, t, derived from checkpointing methods representing isochronous checkpoint intervalsdpckptRepresents the total run time of the program resulting from the redeployment of checkpoints by the DPCKPT method for the isochronous checkpoint interval. And simulating single event upset for program injection faults, and evaluating the method provided by the invention.
Compared with the prior art, the invention has the following advantages: the method of the present invention reduces the overall run time of the program when a single event upset occurs and the upset disables the detector as compared to the checkpoint method at equal checkpoint intervals. After the DPCKPT redeployes the checkpoints of the T/4 and T/3 isochronous checkpoint interval methods, the percentage reduction of the overall program runtime is 15% and 11.4%, respectively. When single event upset occurs, the method of the invention also reduces the total program running time. The DPCKPT redeploys the checkpoints of the T/4 and T/3 isochronous checkpoint interval methods, and the percentage reduction of the overall program runtime is 16% and 11%, respectively.
Drawings
FIG. 1 is a diagram of the deployment initial checkpoint and partitioning program segments of the present invention;
FIG. 2 is a time overhead diagram of the program segment of the present invention;
FIG. 3 is an incremental check point map of the present invention;
FIG. 4 is a deletion check point diagram of the present invention;
FIG. 5 is a flow chart of a method for soft error recovery of checkpoints based on detector location according to the present invention.
Detailed Description
The present invention is further illustrated by the following examples, which are intended to be purely exemplary and are not intended to limit the scope of the invention, which is defined in the appended claims, as may be amended by those skilled in the art upon reading the present invention.
Example 1: referring to fig. 1-5, a method for checkpoint soft error recovery based on detector location, the method comprising the steps of:
the first step is as follows: loading a program for adding a detector as an input of the method;
the second step is that: deploying an initial check point and dividing program segments; the method comprises the following specific steps:
an initial checkpoint is deployed based on an isochronous checkpoint interval, which divides the program into equal-runtime program segments. At the program level, c, as shown in FIG. 11—cnFor n checkpoints, p1—pnIs the n program segments into which the point of inspection is divided, each having equal run times. At the instruction level, dynamic instructions of a program are generally divided into n instruction segments s of equal size1—snWherein i is1—in*ΔsRepresenting the dynamic instructions of the program and deltas representing the instruction fragment size. Instruction section sjIs instructed as
Figure BDA0002695684670000061
m(sj) Denotes sjThe value of the number of the first instruction is shown in the formula (1).
Figure BDA0002695684670000062
The third step: calculating the time overhead of the program segment; the method comprises the following specific steps:
the time overhead of a program segment refers to the time overhead generated when errors detected by detectors within the program segment are tolerated, including checkpointing overhead and recovery overhead. Checkpointing refers to the time required to save the running state of a program, t in FIG. 21. Recovery overhead refers to the time required to recover the running state of the program saved at the checkpoint and to continue running the program from the checkpoint location to the detector error reporting location, i.e., the historical state recovery time and the current state recovery time. When the detector detects an error, it will send out an error message, and the position of the detector sending out the error message is called as a failure point. Assuming F in FIG. 2 as a point of failure, the historical state time and current state recovery time for processing error generation at F may be represented as t2And t3. It is assumed herein that the time to checkpoint at the same location is the same as the time to restore the saved program state at that location, i.e., t1=t2
The time for restoring the running state of the program saved by the checkpoint is used as the historical state restoration time, and the time required for executing the instruction between the checkpoint and the detector is used as the current state restoration time. It is assumed that only one soft error occurs at most during program execution. Since there are a plurality of detectors for one block, the average value of the current state recovery times of the blocks is used as the current state recovery time of the block, which is the average value of the current state recovery times of the blocks when errors are detected by the detectors in the blocks. Thus, the program segment pjThe recovery overhead of (c) can be expressed as equation (2), where the first term sr (c)j) Representing save or restore checkpoints cjRequired by the state ofTime, i.e., checkpointing time or historical state recovery time. The second term represents the average current state recovery time, c (p)j) Represents pjThe number of dynamic detectors in the program, theta represents the average execution time of a single instruction of the program, a (p)jK) represents pjThe number of the instruction corresponding to the kth dynamic detector of (1). It should be noted that the source code level detector corresponds to a plurality of instructions, and we use the first instruction corresponding to the detector as the number of the instruction corresponding to the detector. Finally, pjThe time overhead of (a) can be expressed as equation (3),
Figure BDA0002695684670000071
o(pj)=sr(cj)+r(pj) (3)
the fourth step: judging the sufficiency of the check points in the program segment, and if the check points are insufficient, adding additional check points for the program segment; the method comprises the following specific steps:
the time overhead for each program segment is obtained by the above third step. Next, for each program segment, its change in time overhead after adding a checkpoint to it is evaluated. Thereafter, the sufficiency of its checkpoint is judged based on the time-overhead changes. If the checkpoint is not sufficient, an additional checkpoint is added to it. The invention only considers adding a check point in the middle of the program segment, thus, the change of the time cost of the program segment is only evaluated after adding a check point in the middle of the program segment. Following with the program segment pjThe analytical procedure is given as an example.
Assume that p is as shown in FIG. 3jWith intermediate addition of a check point acj. Addition of acjThen, pjThe time overhead of (a) varies. In one aspect, pjWill pass through acjRecovery instead of cjRecovery of pjThe recovery overhead of (2) changes. The changed recovery overhead can be expressed as equation (4), where A represents the changed historical state recovery time, B represents the changed current state recovery time, and u (p) in Bj) Represents pjNumber of dynamic detectors in the first half. On the other hand, pjThe checkpointing overhead of (a) is changed, and the changed checkpointing overhead can be represented by equation (5).
Figure BDA0002695684670000081
Figure BDA0002695684670000082
r(pj)′=A+B (4)
sr(pj)′=sr(cj)+sr(acj) (5)
Thus, add acjThen, pjThe time overhead of (c) can be expressed as equation (6), and, thus, pjThe time overhead variation of (2) is as shown in equation (7). ag (p)j) When greater than 0, p is representedjAdding check points acjWill decrease pjTime overhead of, this accounts for pjInsufficient internal check points, therefore, at pjA checkpoint is added in the middle. Otherwise, p is representedjThe internal checkpoint is sufficient, in which case p need not bejA checkpoint is added.
o(pj)′=r(pj)′+sr(pj)′ (6)
ag(pj)=o(pj)-o(pj)′ (7)
The fifth step: judging the deletable property of the check point in the program segment, and deleting the deletable check point; the method comprises the following specific steps:
when the checkpoints of the program segment are sufficient, the time overhead of the program segment changes after the checkpoint of the program segment is evaluated for deletion. If the time overhead is reduced, the check points in the program segment are redundant, the check points of the program segment are deleted, otherwise, the check points in the program segment are not redundant, and the check points of the program segment are not deleted. Following with the program segment pjFor example, a detailed description is given.
As shown in fig. 4, the checkpoint c is deletedjThen, pjThe error detected by the detector within will be detected by the checkpoint c closest to itj-1And (6) processing. We will turn pjAnd the nearest preceding program segment pj-1Viewed as a program segment pmCalculating deletion cjRear pmTo determine cjMay be deleted. Retention of cjWhen is, pmThe time overhead of (c) can be expressed as equation (8). Deletion cjThen, pmThe time overhead of (a) can be expressed as equation (9). Further, delete cjThen, pmThe reduced time overhead can be expressed as equation (10). If dg (p)m) Greater than 0, indicating deletion cjRear pmIs reduced, when c is deletedj. Otherwise, reserve cj
Figure BDA0002695684670000091
Figure BDA0002695684670000092
dg(sm)=o(pm)-o(pm)′ (10)
Through the above steps, the redeployment of the checkpoints of the peer-to-peer checkpointing interval method is completed.
Selecting the checking point method of the isochronous checking point interval as comparison, setting the checking point interval of the isochronous checking point method to be T/4 and T/3 respectively, wherein T is the original running time of the program. The invention has the beneficial effects that: (1) the overall program run time is reduced. Programs such as replace, bitstrng, rad2deg and isqrt in Siemens and Mibench were chosen for testing, and the Checkpoint tool used BLCR (Berkeley Lab Checkpoint/Restart). The percentage drop in the DPCKPT method over the total run time of the program was recorded as ep. ep ═ tfix-tdpckpt)/tfix,tfixOverall program runtime, t, derived from checkpointing methods representing isochronous checkpoint intervalsdpckptProgram assembly obtained after checkpoint redeployment representing checkpoint method of DPCKPT to isochronous checkpoint intervalThe volume run time. And simulating single event upset for program injection faults, and evaluating the method provided by the invention.
Application example 1: a flow chart of the detector location based checkpoint soft error recovery method of the present invention is shown in fig. 5. FIG. 5 includes part 2, to the left, for deploying initial checkpoints based on an isochronous checkpoint interval, dividing a program segment, and calculating a time overhead for the program segment. To the right is the process of redeploying checkpoints for each program segment. The upper right side judges the sufficiency of the check points in the program segment, if the check points are insufficient, the check points are added to the program segment, the lower right side judges the deletable property of the check points in the program segment, and if the check points can be deleted, the check points of the program segment are deleted.
The first step is as follows: loading a program for adding a detector as an input of the method;
taking the rad2deg program in the Mibench test program set as an example, a soft error detection method based on application of logical invariant assertion is used to generate a detector for rad2 deg. After that, the detector is inserted into the original program. Finally, a program with a detector is used as input to the method.
The second step is that: and deploying an initial check point and dividing program segments. In the isochronous checkpoint interval checkpoint method, a checkpoint interval is set to T/3. The original running time of the program is 1.2s, and check points are set every 0.4 s. 527179283 instructions are executed during the program operation, according to equation (1), program segment p1-p3Respectively are [1,175726427 ]]、[175726428,351452854]And [351452855,527179283]. In the following with p2And p3For example, a checkpoint redeployment process is presented.
The third step: the time overhead of the program segment is calculated. For program segment p2Checkpointing at 0.4s of the program using the BLCR checkpointing tool, resulting in a checkpointing time sr (c)2) 0.13 s. Analyzing the program for instruction execution results in p2Has performed 9195000 times in total, i.e. has 9195000 dynamic detectors, so that c (p)j) 9195000. Analyzing the instruction condition of program operation, if all predicates are detectedThe detectors all detect errors and roll back to c2At this point, 80317.79 × e10 instructions are executed in total, and each predicate detector needs to execute 80317.79 × e10/9195000 — 87349418 instructions on average. The average running time of a single instruction in the program is 1200000000/527179283-2.28 ns. Thus, p2The average current state recovery time of the error in p is 87349418 × 2.28ns — 0.2s, and p is obtained using equation (2)2Recovery time overhead r (p)2) 0.13s +0.2 s-0.33 s. Obtained by using the formula (3), p2Time overhead of o (p)2) 0.13s +0.33 s-0.46 s. In the same manner, o (p)3)=0.58s。
The fourth step: and judging the sufficiency of the check points in the program segment, and if the check points are insufficient, adding additional check points for the program segment. For p2With addition of a check point ac at an intermediate position thereof2I.e. adding a checkpoint ac at 0.6s2. Checkpointing using the BLCR checkpointing tool at 0.6s of the program, resulting in a checkpointing time of 0.14s, sr (ac)2) 0.14 s. Thus, p2Has a history state recovery time of (sr (c)2)+sr(ac2) 2 ═ 0.13s +0.14s)/2 ═ 0.135s, that is, a in formula (4) is 0.135 s. ac2P is to be2Divided into two parts, p2Error pass c of the first half of2Recovery, error in the second half passing ac2And (6) recovering. Analyzing the program for instruction execution results in p2Is performed 4650000 times in total, i.e. u (p)2) 4650000. All detectors in the second half were performed 4545000 times in total. Analyzing the instruction condition of program operation to obtain if p2All predicate detectors in the first half detect an error and roll back to the nearest checkpoint, then a total of 20406.20 × e10 instructions are executed. If p is2All predicate detectors in the second half detect errors and roll back to the nearest checkpoint, then a total of 19970.58 × e10 instructions are executed. Thus, p2When the inner detector detects an error and performs a rollback, on average, there are (20406.20 e10+19970.58 e10)/9195000 43911669 instructions that need to be executed, and p2Current state ofThe recovery time was 43911669 × 2.28ns ═ 0.1s, i.e., B in formula (4) was 0.1 s. Thus, r (p) can be obtained by using the formula (4)2) ' -0.135 s +0.1 s-0.235 s. Obtained by using the formula (5), adding ac2Rear p2Checkpointing time sr (p)2) ' -0.13 s +0.14 s-0.27 s. Obtained by using the formula (6), to which ac is added2Rear p2Time overhead of o (p)2) ' -0.27 s +0.235 s-0.505 s. Obtaining additive ac by Using formula (7)2Rear p2Reduced time overhead ag (p)2)=o(p2)-o(p2)′=0.46s-0.505s<0. This is illustrated as p2Addition of ac2Rear p2Is increased, and thus, need not be p2A checkpoint is added. By the same analysis, in p3After intermediate addition of checkpoints, p3The time overhead of (a) also increases. Therefore, it also need not be p3A checkpoint is added.
The fifth step: and judging the deletable property of the check point in the program segment, and deleting the deletable check point. Introduction of p3The deletable nature of the checkpoint of (1). p is a radical of2Without merging with preceding program segments, so that p3The preceding block is still p2A 1 is to p2And p3Viewed as p23. First, analyze not to delete p23Inner checking point c3When is p23The time overhead of (a). The checkpoints using the BLCR at 0.4s and 0.8s of the program were 0.13s and 0.19s, sr (c), respectively2)=0.13s,sr(c3) 0.19 s. From the analysis in the third step, c (p)2) 9195000 if p2All assertion detectors within detect an error and roll back to c2There are a total of 80317.79 × e10 instructions executed. Similarly, the instruction execution condition of the analysis program is obtained, p3Is performed 9091393 times in total, namely c (p)3) 9091393. In addition, if p3All assertion detectors of (2) detect an error and roll back to c3There are a total of 79879.81 × e10 instructions executed. Thus, obtained by formula (8), retention c3When is, p23The time overhead of (80317.79 × e10+79879.81 × e10)/(9195000+9091393) × (2.28/1000000000) + (0.1)3+0.19)/2+0.13+ 0.19-0.68 s. Next, the deletion p is analyzed23Inner checking point c3When is, p23The time overhead of (a). Deletion c3When is, p23The number of dynamic detectors in (1) is still c (p)2)+c(p3) 18286393. Analyzing the program for instructions to run to get if p23All detectors in (a) detect an error and roll back to the nearest checkpoint c2At this point, 319957.41 × e10 instructions are executed in total. Using formula (9) to obtain, delete c3When is, p23The time overhead of (a) is 319957.41 × e10/18286393 × 2.28/1000000000) +0.13+0.13 ═ 0.66 s. Finally, using equation (10) to obtain, delete c3When is p23The reduced time overhead is 0.68-0.66>0. At this time, delete c3
It should be noted that the above-mentioned embodiments are only examples of the present invention, and are not intended to limit the scope of the present invention, and equivalents and substitutions made on the above-mentioned technical solutions are included in the scope of the present invention.

Claims (5)

1. A method for checkpoint soft error recovery based on detector location, the method comprising the steps of:
the first step is as follows: loading a program for adding a detector as an input of the method;
the second step is that: deploying an initial check point and dividing program segments;
the third step: calculating the time overhead of the program segment;
the fourth step: judging the sufficiency of the check points in the program segment, and if the check points are insufficient, adding additional check points for the program segment;
the fifth step: and judging the deletable property of the check point in the program segment, and deleting the deletable check point.
2. The detector location based checkpoint soft error recovery method of claim 1, wherein the second step: deploying an initial check point and dividing program segments; the method comprises the following specific steps:
the checkpoint method based on an isochronous checkpoint interval checkpoints a program at regular intervals, so that the checkpoints divide the program into program segments of equal run time.
3. The detector location based checkpoint soft error recovery method of claim 1, wherein the third step: calculating the time overhead of the program segment; the method comprises the following specific steps:
the time overhead of the program segment refers to the time generated when the error detected by a detector in the program segment is fault-tolerant, and comprises checkpoint setting time and recovery time, the time required for saving the running state of the program is taken as the checkpoint setting time, and in the aspect of the recovery time, the historical state recovery time and the current state recovery time are considered, namely the time for recovering the running state of the program saved by the checkpoint, and the time required for continuously running the program from the checkpoint position to the error reporting position of the detector; for each program segment, the checkpoint time and the recovery time of the program segment are calculated, and the sum of the two is taken as the time overhead of the program segment.
4. The detector location based checkpoint soft error recovery method of claim 1, wherein the fourth step: judging the sufficiency of the check points in the program segment, and if the check points are insufficient, adding additional check points for the program segment; the method comprises the following specific steps:
and evaluating the checkpoint setting time, the current state recovery time and the historical state recovery time of the program segment after the checkpoint is added for each program segment, and further obtaining the time overhead of the program segment. And comparing the time overhead of the program segment with the checkpoint which is added with the time overhead of the program segment without the checkpoint, if the time overhead is reduced, indicating that the checkpoint in the program segment is insufficient, adding the checkpoint to the program segment, otherwise, indicating that the checkpoint in the program segment is sufficient, and not adding the checkpoint to the program segment.
5. The detector location based checkpoint soft error recovery method of claim 1, wherein the fifth step: judging the deletable property of the check point in the program segment, and deleting the deletable check point; the method comprises the following specific steps:
when the check points of the program segments are enough, the deletability of the check points in the program segments is evaluated, and after the check points of the program segments are deleted, the error in the program segments is processed by the check point of the program segment which is closest to the previous one, and the program segment which is closest to the previous one are regarded as one program segment, which is called pmWhen evaluating a checkpoint of a deleted program segment, pmIf the time cost is reduced, the check point in the program segment is redundant, the check point of the program segment is deleted, otherwise, the check point in the program segment is not redundant, and the check point of the program segment is not deleted; assume that the preceding program segment of the current program segment is pjAfter the fourth step of analysis, at pjWith the addition of checkpoints, the newly added checkpoint will pjDivided into two program segments, pj1And pj2At this time, for the current program segment, the nearest previous program segment is pj2Instead of pj
CN202011005441.2A 2020-09-22 2020-09-22 Checkpoint soft error recovery method based on detector position Active CN112131034B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011005441.2A CN112131034B (en) 2020-09-22 2020-09-22 Checkpoint soft error recovery method based on detector position

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011005441.2A CN112131034B (en) 2020-09-22 2020-09-22 Checkpoint soft error recovery method based on detector position

Publications (2)

Publication Number Publication Date
CN112131034A true CN112131034A (en) 2020-12-25
CN112131034B CN112131034B (en) 2023-07-25

Family

ID=73841630

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011005441.2A Active CN112131034B (en) 2020-09-22 2020-09-22 Checkpoint soft error recovery method based on detector position

Country Status (1)

Country Link
CN (1) CN112131034B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101123521A (en) * 2006-08-07 2008-02-13 华为技术有限公司 A management method for check points in cluster
CN110083488A (en) * 2019-04-21 2019-08-02 哈尔滨工业大学 A kind of tolerant system of the fine granularity low overhead towards GPGPU
CN111124720A (en) * 2019-12-26 2020-05-08 江南大学 Self-adaptive check point interval dynamic setting method
CN111143142A (en) * 2019-12-26 2020-05-12 江南大学 Universal check point and rollback recovery method
CN111274058A (en) * 2020-01-20 2020-06-12 东南大学 Lightweight redundancy assertion screening method
CN111682981A (en) * 2020-06-02 2020-09-18 深圳大学 Check point interval setting method and device based on cloud platform performance

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101123521A (en) * 2006-08-07 2008-02-13 华为技术有限公司 A management method for check points in cluster
CN110083488A (en) * 2019-04-21 2019-08-02 哈尔滨工业大学 A kind of tolerant system of the fine granularity low overhead towards GPGPU
CN111124720A (en) * 2019-12-26 2020-05-08 江南大学 Self-adaptive check point interval dynamic setting method
CN111143142A (en) * 2019-12-26 2020-05-12 江南大学 Universal check point and rollback recovery method
CN111274058A (en) * 2020-01-20 2020-06-12 东南大学 Lightweight redundancy assertion screening method
CN111682981A (en) * 2020-06-02 2020-09-18 深圳大学 Check point interval setting method and device based on cloud platform performance

Also Published As

Publication number Publication date
CN112131034B (en) 2023-07-25

Similar Documents

Publication Publication Date Title
US10997027B2 (en) Lightweight checkpoint technique for resilience against soft errors
US8677189B2 (en) Recovering from stack corruption faults in embedded software systems
Melliar-Smith et al. Software reliability: The role of programmed exception handling
CN111552590B (en) Detection and recovery method and system for memory bit overturning of power secondary equipment
Li et al. Have things changed now? An empirical study of bug characteristics in modern open source software
US7308607B2 (en) Periodic checkpointing in a redundantly multi-threaded architecture
US7373548B2 (en) Hardware recovery in a multi-threaded architecture
US7962798B2 (en) Methods, systems and media for software self-healing
US8196106B2 (en) Autonomic verification of HDL models using real-time statistical analysis and layered feedback stages
US9032190B2 (en) Recovering from an error in a fault tolerant computer system
WO2010144913A2 (en) Memory change track logging
JPH09258995A (en) Computer system
CN1752936A (en) Vectoring process-kill errors to an application program
KR102031606B1 (en) Versioned memory implementation
US20120158652A1 (en) System and method for ensuring consistency in raid storage array metadata
Xu et al. Sender-based message logging for reducing rollback propagation
Lo et al. Efficient mining of recurrent rules from a sequence database
Montezanti et al. A methodology for soft errors detection and automatic recovery
JP5664886B2 (en) Fault tolerant system, fault tolerant method and program
CN112131034B (en) Checkpoint soft error recovery method based on detector position
Huang et al. Two techniques for transient software error recovery
CN111274058A (en) Lightweight redundancy assertion screening method
Cavelan et al. Assessing the impact of partial verifications against silent data corruptions
CN116069468A (en) Checkpoint adjustment method and device
Sadi et al. An efficient approach towards mitigating soft errors risks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant