CN112131034A

CN112131034A - Checkpoint soft error recovery method based on detector position

Info

Publication number: CN112131034A
Application number: CN202011005441.2A
Authority: CN
Inventors: 汪芸; 杨娜
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2020-09-22
Filing date: 2020-09-22
Publication date: 2020-12-25
Anticipated expiration: 2040-09-22
Also published as: CN112131034B

Abstract

The invention discloses a checkpoint soft error recovery method based on detector position, which is named DPCKPT; the method comprises the following steps: the first step is as follows: loading a program for adding a detector as an input of the method; the second step is that: deploying an initial check point and dividing program segments; the third step: calculating the time overhead of the program segment; the fourth step: judging the sufficiency of the check points in the program segment, and if the check points are insufficient, adding additional check points for the program segment; the fifth step: and judging the deletable property of the check point in the program segment, and deleting the deletable check point. The DPCKPT takes the detector position into account on the basis of a checkpoint method of an isochronous checkpoint interval, and redeployes the checkpoint of the isochronous checkpoint interval method so as to reduce the time overhead brought by the checkpoint method and further reduce the total operation time of the program. The invention reduces the total running time of the program.

Description

Checkpoint soft error recovery method based on detector position

Technical Field

The invention relates to a soft error recovery method of a check point based on a detector position, belonging to the technical field of computer soft error fault tolerance.

Background

The single event upset is the phenomenon that single high-energy particles in the universe are shot into a sensitive area of a semiconductor device to cause the logic state of the device to be overturned. A single event upset may cause a hard or soft error in the device. Hard errors refer to permanent failures and are not recoverable. Soft errors refer to transient failures and can be recovered. Soft errors are approximately 100 times more than hard errors. Soft error detection and soft error recovery are two important aspects of soft errors.

The source code level soft error recovery method provides recovery service for the program from the source code layer, and the operation is simple. It mainly utilizes replica technique and checkpoint technique to recover soft errors. The source code level soft error recovery method based on the copy generates copy variables for the variables, and judges whether the correct variables are the original variables or the copy variables after errors are detected. And then, recovering the error variable by using the correct variable, and continuing to run the program. A checkpoint-based source code level soft error recovery method checkpoints a program according to a checkpoint interval. When a checkpoint is performed, the program is usually blocked from running, and after the checkpoint is completed, the program continues to run. During program run, if the detector detects an error, the program will roll back to the last checkpoint position and continue running down. When the checkpointing technique is applied to a program, the total running time of the program includes not only the original running time of the program but also the time overhead caused by the checkpointing method, such as the checkpointing time and the recovery time when an error occurs. Current copy-based source code level soft error recovery methods are costly and may have data overflow problems. Checkpoint-based source code level soft error recovery methods are relatively low overhead, however, checkpointing is typically dominated by isochronous checkpoint intervals (periodic checkpoints).

The result types of soft errors can be classified as Benign outcomes (Benign), crashes (Crash), hangs (Hang), and sdc (silentdatacorruption). SDC is the most concealed and most severe type of error for soft errors, and once SDC occurs, the consequences are extremely severe if they cannot be handled. There are currently many source code level soft error detection methods for SDC, e.g., assertion-based source code level SDC detection methods. The method inserts an assertion detector at the source code level to detect SDC errors. To reduce the detection cost, such methods typically insert predicate detectors at program locations where there is a high probability of SDC occurring, which causes the detectors to be asymmetrically distributed in the program.

The existing checkpoint-based source code level soft error recovery method is mainly based on the equal checkpoint interval when the checkpoint is deployed, the position distribution of a detector is not considered, and the time overhead cannot be sufficiently reduced. The invention provides a checkpoint soft error recovery method based on a detector position, which is called DPCKPT. The DPCKPT redeployes the checkpoints of the checkpointing method of the peer-to-peer checkpoint interval by taking the detector position into account so as to reduce the time overhead brought by the checkpointing method and further reduce the total running time of the program.

Disclosure of Invention

The isochronous checkpoint interval based checkpoint method periodically checkpoints, and because the detectors in the program are asymmetrically distributed, the isochronous checkpoint interval based checkpoint method cannot sufficiently reduce the time overhead associated with the checkpoint method. The invention aims to provide a soft error recovery method of a check point based on a detector position, which reduces the time overhead brought by the check point method and further reduces the total running time of a program.

In order to achieve the purpose of the invention, the technical scheme adopted by the invention is as follows: a method of checkpoint soft error recovery based on detector location, the method comprising the steps of:

the first step is as follows: loading a program for adding a detector as an input of the method;

the second step is that: deploying an initial check point and dividing program segments; the method comprises the following specific steps:

an initial checkpoint is deployed based on an isochronous checkpoint interval, which divides the program into equal-runtime program segments. At the program level, c₁—c_nFor n checkpoints, p₁—p_nIs the n program segments into which the point of inspection is divided, each having equal run time, as shown in fig. 1. At the instruction level, dynamic instructions of a program are generally divided into n instruction segments s of equal size₁—s_nWherein i is₁—i_n*ΔsRepresenting the dynamic instructions of the program and deltas representing the instruction fragment size. Instruction section s_jIs instructed as

m(s_j) Denotes s_jThe value of the number of the first instruction is shown in the formula (1).

The third step: calculating the time overhead of the program segment; the method comprises the following specific steps:

the time overhead of a program segment refers to the time overhead generated when errors detected by detectors within the program segment are tolerated, including checkpointing overhead and recovery overhead. Checkpointing refers to the time required to save the running state of a program, t in FIG. 2₁. Recovery overhead refers to the time required to recover the running state of the program saved at the checkpoint and to continue running the program from the checkpoint location to the detector error reporting location, i.e., the historical state recovery time and the current state recovery time. When the detector detects an error, it will send out an error message, and the position of the detector sending out the error message is called as a failure point. Assuming F in FIG. 2 as a point of failure, the historical state time and current state recovery time for processing error generation at F may be represented as t₂And t₃. It is assumed herein that the time to checkpoint at the same location is the same as the time to restore the saved program state at that location, i.e., t₁＝t₂。

The time for restoring the running state of the program saved by the checkpoint is used as the historical state restoration time, and the time required for executing the instruction between the checkpoint and the detector is used as the current state restoration time. It is assumed that only one soft error occurs at most during program execution. Since there are a plurality of detectors for one block, the average value of the current state recovery times of the blocks is used as the current state recovery time of the block, which is the average value of the current state recovery times of the blocks when errors are detected by the detectors in the blocks. Thus, the program segment p_jThe recovery overhead of (c) can be expressed as equation (2), where the first term sr (c)_j) Representing save or restore checkpoints c_jThe time required for the state of (a), i.e., the checkpointing time or the historical state recovery time. The second term represents the average current state recovery time, c (p)_j) Represents p_jThe number of dynamic detectors in the program, theta represents the average execution time of a single instruction of the program, a (p)_jK) represents p_jThe number of the instruction corresponding to the kth dynamic detector of (1). It should be noted that the source code level detector corresponds to a plurality of instructions, and we use the first instruction corresponding to the detector as the number of the instruction corresponding to the detector. Finally, p_jThe time overhead of (a) can be expressed as equation (3),

o(p_j)＝sr(c_j)+r(p_j) (3)

the fourth step: judging the sufficiency of the check points in the program segment, and if the check points are insufficient, adding additional check points for the program segment; the method comprises the following specific steps:

the time overhead for each program segment is obtained by the above third step. Next, for each program segment, its change in time overhead after adding a checkpoint to it is evaluated. Thereafter, the sufficiency of its checkpoint is judged based on the time-overhead changes. If the checkpoint is not sufficient, an additional checkpoint is added to it. The invention only considers adding a check point in the middle of the program segment, thus, the change of the time cost of the program segment is only evaluated after adding a check point in the middle of the program segment. Following with the program segment p_jThe analytical procedure is given as an example.

Suppose that p is_jWith intermediate addition of a check point ac_jAs shown in fig. 3. Addition of ac_jThen, p_jThe time overhead of (a) varies. In one aspect, p_jWill pass through ac_jRecovery instead of c_jRecovery of p_jThe recovery overhead of (2) changes. The changed recovery overhead can be expressed as equation (4), where A represents the changed historical state recovery time, B represents the changed current state recovery time, and u (p) in B_j) Represents p_jNumber of dynamic detectors in the first half. On the other hand, p_jThe checkpointing overhead of (a) is changed, and the changed checkpointing overhead can be represented by equation (5).

r(p_j)′＝A+B (4)

sr(p_j)′＝sr(c_j)+sr(ac_j) (5)

Thus, add ac_jThen, p_jThe time overhead of (c) can be expressed as equation (6), and, thus, p_jThe time overhead variation of (2) is as shown in equation (7). ag (p)_j) When greater than 0, p is represented_jAdding check points ac_jWill decrease p_jTime overhead of, this accounts for p_jInsufficient internal check points, therefore, at p_jA checkpoint is added in the middle. Otherwise, p is represented_jThe internal checkpoint is sufficient, in which case p need not be_jA checkpoint is added.

o(p_j)′＝r(p_j)′+sr(p_j)′ (6)

ag(p_j)＝o(p_j)-o(p_j)′ (7)

The fifth step: judging the deletable property of the check point in the program segment, and deleting the deletable check point; the method comprises the following specific steps:

when the checkpoints of the program segment are sufficient, the time overhead of the program segment changes after the checkpoint of the program segment is evaluated for deletion. If the time overhead is reduced, the check points in the program segment are redundant, the check points of the program segment are deleted, otherwise, the check points in the program segment are not redundant, and the check points of the program segment are not deleted. Following with the program segment p_jFor example, a detailed description is given.

Delete checkpoint c_jThen, p_jThe error detected by the detector within will be detected by the checkpoint c closest to it_j-1And (6) processing. We will turn p_jAnd the nearest preceding program segment p_j-1Viewed as a program segment p_mAs shown in fig. 4. Computing deletion c_jRear p_mTo determine c_jMay be deleted. Retention of c_jWhen is, p_mThe time overhead of (c) can be expressed as equation (8). Deletion c_jThen, p_mThe time overhead of (a) can be expressed as equation (9). Further, delete c_jThen, p_mThe reduced time overhead can be expressed as equation (10). If dg (p)_m) Greater than 0, indicating deletion c_jRear p_mIs reduced, when c is deleted_j. Otherwise, reserve c_j。

dg(s_m)＝o(p_m)-o(p_m)′ (10)

Through the above steps, the redeployment of the checkpoints of the peer-to-peer checkpointing interval method is completed.

Selecting the checking point method of the isochronous checking point interval as comparison, setting the checking point interval of the isochronous checking point method to be T/4 and T/3 respectively, wherein T is the original running time of the program. The invention has the beneficial effects that: (1) the overall program run time is reduced. Programs such as replace, bitstrng, rad2deg and isqrt in Siemens and Mibench were chosen for testing, and the Checkpoint tool used BLCR (Berkeley Lab Checkpoint/Restart). The percentage drop in the DPCKPT method over the total run time of the program was recorded as ep. ep ═ t_fix-t_dpckpt)/t_fix，t_fixOverall program runtime, t, derived from checkpointing methods representing isochronous checkpoint intervals_dpckptRepresents the total run time of the program resulting from the redeployment of checkpoints by the DPCKPT method for the isochronous checkpoint interval. And simulating single event upset for program injection faults, and evaluating the method provided by the invention.

Compared with the prior art, the invention has the following advantages: the method of the present invention reduces the overall run time of the program when a single event upset occurs and the upset disables the detector as compared to the checkpoint method at equal checkpoint intervals. After the DPCKPT redeployes the checkpoints of the T/4 and T/3 isochronous checkpoint interval methods, the percentage reduction of the overall program runtime is 15% and 11.4%, respectively. When single event upset occurs, the method of the invention also reduces the total program running time. The DPCKPT redeploys the checkpoints of the T/4 and T/3 isochronous checkpoint interval methods, and the percentage reduction of the overall program runtime is 16% and 11%, respectively.

Drawings

FIG. 1 is a diagram of the deployment initial checkpoint and partitioning program segments of the present invention;

FIG. 2 is a time overhead diagram of the program segment of the present invention;

FIG. 3 is an incremental check point map of the present invention;

FIG. 4 is a deletion check point diagram of the present invention;

FIG. 5 is a flow chart of a method for soft error recovery of checkpoints based on detector location according to the present invention.

Detailed Description

The present invention is further illustrated by the following examples, which are intended to be purely exemplary and are not intended to limit the scope of the invention, which is defined in the appended claims, as may be amended by those skilled in the art upon reading the present invention.

Example 1: referring to fig. 1-5, a method for checkpoint soft error recovery based on detector location, the method comprising the steps of:

an initial checkpoint is deployed based on an isochronous checkpoint interval, which divides the program into equal-runtime program segments. At the program level, c, as shown in FIG. 1₁—c_nFor n checkpoints, p₁—p_nIs the n program segments into which the point of inspection is divided, each having equal run times. At the instruction level, dynamic instructions of a program are generally divided into n instruction segments s of equal size₁—s_nWherein i is₁—i_n*ΔsRepresenting the dynamic instructions of the program and deltas representing the instruction fragment size. Instruction section s_jIs instructed as

The time for restoring the running state of the program saved by the checkpoint is used as the historical state restoration time, and the time required for executing the instruction between the checkpoint and the detector is used as the current state restoration time. It is assumed that only one soft error occurs at most during program execution. Since there are a plurality of detectors for one block, the average value of the current state recovery times of the blocks is used as the current state recovery time of the block, which is the average value of the current state recovery times of the blocks when errors are detected by the detectors in the blocks. Thus, the program segment p_jThe recovery overhead of (c) can be expressed as equation (2), where the first term sr (c)_j) Representing save or restore checkpoints c_jRequired by the state ofTime, i.e., checkpointing time or historical state recovery time. The second term represents the average current state recovery time, c (p)_j) Represents p_jThe number of dynamic detectors in the program, theta represents the average execution time of a single instruction of the program, a (p)_jK) represents p_jThe number of the instruction corresponding to the kth dynamic detector of (1). It should be noted that the source code level detector corresponds to a plurality of instructions, and we use the first instruction corresponding to the detector as the number of the instruction corresponding to the detector. Finally, p_jThe time overhead of (a) can be expressed as equation (3),

o(p_j)＝sr(c_j)+r(p_j) (3)

Assume that p is as shown in FIG. 3_jWith intermediate addition of a check point ac_j. Addition of ac_jThen, p_jThe time overhead of (a) varies. In one aspect, p_jWill pass through ac_jRecovery instead of c_jRecovery of p_jThe recovery overhead of (2) changes. The changed recovery overhead can be expressed as equation (4), where A represents the changed historical state recovery time, B represents the changed current state recovery time, and u (p) in B_j) Represents p_jNumber of dynamic detectors in the first half. On the other hand, p_jThe checkpointing overhead of (a) is changed, and the changed checkpointing overhead can be represented by equation (5).

r(p_j)′＝A+B (4)

sr(p_j)′＝sr(c_j)+sr(ac_j) (5)

o(p_j)′＝r(p_j)′+sr(p_j)′ (6)

ag(p_j)＝o(p_j)-o(p_j)′ (7)

As shown in fig. 4, the checkpoint c is deleted_jThen, p_jThe error detected by the detector within will be detected by the checkpoint c closest to it_j-1And (6) processing. We will turn p_jAnd the nearest preceding program segment p_j-1Viewed as a program segment p_mCalculating deletion c_jRear p_mTo determine c_jMay be deleted. Retention of c_jWhen is, p_mThe time overhead of (c) can be expressed as equation (8). Deletion c_jThen, p_mThe time overhead of (a) can be expressed as equation (9). Further, delete c_jThen, p_mThe reduced time overhead can be expressed as equation (10). If dg (p)_m) Greater than 0, indicating deletion c_jRear p_mIs reduced, when c is deleted_j. Otherwise, reserve c_j。

dg(s_m)＝o(p_m)-o(p_m)′ (10)

Selecting the checking point method of the isochronous checking point interval as comparison, setting the checking point interval of the isochronous checking point method to be T/4 and T/3 respectively, wherein T is the original running time of the program. The invention has the beneficial effects that: (1) the overall program run time is reduced. Programs such as replace, bitstrng, rad2deg and isqrt in Siemens and Mibench were chosen for testing, and the Checkpoint tool used BLCR (Berkeley Lab Checkpoint/Restart). The percentage drop in the DPCKPT method over the total run time of the program was recorded as ep. ep ═ t_fix-t_dpckpt)/t_fix，t_fixOverall program runtime, t, derived from checkpointing methods representing isochronous checkpoint intervals_dpckptProgram assembly obtained after checkpoint redeployment representing checkpoint method of DPCKPT to isochronous checkpoint intervalThe volume run time. And simulating single event upset for program injection faults, and evaluating the method provided by the invention.

Application example 1: a flow chart of the detector location based checkpoint soft error recovery method of the present invention is shown in fig. 5. FIG. 5 includes part 2, to the left, for deploying initial checkpoints based on an isochronous checkpoint interval, dividing a program segment, and calculating a time overhead for the program segment. To the right is the process of redeploying checkpoints for each program segment. The upper right side judges the sufficiency of the check points in the program segment, if the check points are insufficient, the check points are added to the program segment, the lower right side judges the deletable property of the check points in the program segment, and if the check points can be deleted, the check points of the program segment are deleted.

taking the rad2deg program in the Mibench test program set as an example, a soft error detection method based on application of logical invariant assertion is used to generate a detector for rad2 deg. After that, the detector is inserted into the original program. Finally, a program with a detector is used as input to the method.

The second step is that: and deploying an initial check point and dividing program segments. In the isochronous checkpoint interval checkpoint method, a checkpoint interval is set to T/3. The original running time of the program is 1.2s, and check points are set every 0.4 s. 527179283 instructions are executed during the program operation, according to equation (1), program segment p₁-p₃Respectively are [1,175726427 ]]、[175726428,351452854]And [351452855,527179283]. In the following with p₂And p₃For example, a checkpoint redeployment process is presented.

The third step: the time overhead of the program segment is calculated. For program segment p₂Checkpointing at 0.4s of the program using the BLCR checkpointing tool, resulting in a checkpointing time sr (c)₂) 0.13 s. Analyzing the program for instruction execution results in p₂Has performed 9195000 times in total, i.e. has 9195000 dynamic detectors, so that c (p)_j) 9195000. Analyzing the instruction condition of program operation, if all predicates are detectedThe detectors all detect errors and roll back to c₂At this point, 80317.79 × e10 instructions are executed in total, and each predicate detector needs to execute 80317.79 × e10/9195000 — 87349418 instructions on average. The average running time of a single instruction in the program is 1200000000/527179283-2.28 ns. Thus, p₂The average current state recovery time of the error in p is 87349418 × 2.28ns — 0.2s, and p is obtained using equation (2)₂Recovery time overhead r (p)₂) 0.13s +0.2 s-0.33 s. Obtained by using the formula (3), p₂Time overhead of o (p)₂) 0.13s +0.33 s-0.46 s. In the same manner, o (p)₃)＝0.58s。

The fourth step: and judging the sufficiency of the check points in the program segment, and if the check points are insufficient, adding additional check points for the program segment. For p₂With addition of a check point ac at an intermediate position thereof₂I.e. adding a checkpoint ac at 0.6s₂. Checkpointing using the BLCR checkpointing tool at 0.6s of the program, resulting in a checkpointing time of 0.14s, sr (ac)₂) 0.14 s. Thus, p₂Has a history state recovery time of (sr (c)₂)+sr(ac₂) 2 ═ 0.13s +0.14s)/2 ═ 0.135s, that is, a in formula (4) is 0.135 s. ac₂P is to be₂Divided into two parts, p₂Error pass c of the first half of₂Recovery, error in the second half passing ac₂And (6) recovering. Analyzing the program for instruction execution results in p₂Is performed 4650000 times in total, i.e. u (p)₂) 4650000. All detectors in the second half were performed 4545000 times in total. Analyzing the instruction condition of program operation to obtain if p₂All predicate detectors in the first half detect an error and roll back to the nearest checkpoint, then a total of 20406.20 × e10 instructions are executed. If p is₂All predicate detectors in the second half detect errors and roll back to the nearest checkpoint, then a total of 19970.58 × e10 instructions are executed. Thus, p₂When the inner detector detects an error and performs a rollback, on average, there are (20406.20 e10+19970.58 e10)/9195000 43911669 instructions that need to be executed, and p₂Current state ofThe recovery time was 43911669 × 2.28ns ═ 0.1s, i.e., B in formula (4) was 0.1 s. Thus, r (p) can be obtained by using the formula (4)₂) ' -0.135 s +0.1 s-0.235 s. Obtained by using the formula (5), adding ac₂Rear p₂Checkpointing time sr (p)₂) ' -0.13 s +0.14 s-0.27 s. Obtained by using the formula (6), to which ac is added₂Rear p₂Time overhead of o (p)₂) ' -0.27 s +0.235 s-0.505 s. Obtaining additive ac by Using formula (7)₂Rear p₂Reduced time overhead ag (p)₂)＝o(p₂)-o(p₂)′＝0.46s-0.505s<0. This is illustrated as p₂Addition of ac₂Rear p₂Is increased, and thus, need not be p₂A checkpoint is added. By the same analysis, in p₃After intermediate addition of checkpoints, p₃The time overhead of (a) also increases. Therefore, it also need not be p₃A checkpoint is added.

The fifth step: and judging the deletable property of the check point in the program segment, and deleting the deletable check point. Introduction of p₃The deletable nature of the checkpoint of (1). p is a radical of₂Without merging with preceding program segments, so that p₃The preceding block is still p₂A 1 is to p₂And p₃Viewed as p₂₃. First, analyze not to delete p₂₃Inner checking point c₃When is p₂₃The time overhead of (a). The checkpoints using the BLCR at 0.4s and 0.8s of the program were 0.13s and 0.19s, sr (c), respectively₂)＝0.13s，sr(c₃) 0.19 s. From the analysis in the third step, c (p)₂) 9195000 if p₂All assertion detectors within detect an error and roll back to c₂There are a total of 80317.79 × e10 instructions executed. Similarly, the instruction execution condition of the analysis program is obtained, p₃Is performed 9091393 times in total, namely c (p)₃) 9091393. In addition, if p₃All assertion detectors of (2) detect an error and roll back to c₃There are a total of 79879.81 × e10 instructions executed. Thus, obtained by formula (8), retention c₃When is, p₂₃The time overhead of (80317.79 × e10+79879.81 × e10)/(9195000+9091393) × (2.28/1000000000) + (0.1)3+0.19)/2+0.13+ 0.19-0.68 s. Next, the deletion p is analyzed₂₃Inner checking point c₃When is, p₂₃The time overhead of (a). Deletion c₃When is, p₂₃The number of dynamic detectors in (1) is still c (p)₂)+c(p₃) 18286393. Analyzing the program for instructions to run to get if p₂₃All detectors in (a) detect an error and roll back to the nearest checkpoint c₂At this point, 319957.41 × e10 instructions are executed in total. Using formula (9) to obtain, delete c₃When is, p₂₃The time overhead of (a) is 319957.41 × e10/18286393 × 2.28/1000000000) +0.13+0.13 ═ 0.66 s. Finally, using equation (10) to obtain, delete c₃When is p₂₃The reduced time overhead is 0.68-0.66>0. At this time, delete c₃。

It should be noted that the above-mentioned embodiments are only examples of the present invention, and are not intended to limit the scope of the present invention, and equivalents and substitutions made on the above-mentioned technical solutions are included in the scope of the present invention.

Claims

1. A method for checkpoint soft error recovery based on detector location, the method comprising the steps of:

the second step is that: deploying an initial check point and dividing program segments;

the third step: calculating the time overhead of the program segment;

the fourth step: judging the sufficiency of the check points in the program segment, and if the check points are insufficient, adding additional check points for the program segment;

the fifth step: and judging the deletable property of the check point in the program segment, and deleting the deletable check point.

2. The detector location based checkpoint soft error recovery method of claim 1, wherein the second step: deploying an initial check point and dividing program segments; the method comprises the following specific steps:

the checkpoint method based on an isochronous checkpoint interval checkpoints a program at regular intervals, so that the checkpoints divide the program into program segments of equal run time.

3. The detector location based checkpoint soft error recovery method of claim 1, wherein the third step: calculating the time overhead of the program segment; the method comprises the following specific steps:

the time overhead of the program segment refers to the time generated when the error detected by a detector in the program segment is fault-tolerant, and comprises checkpoint setting time and recovery time, the time required for saving the running state of the program is taken as the checkpoint setting time, and in the aspect of the recovery time, the historical state recovery time and the current state recovery time are considered, namely the time for recovering the running state of the program saved by the checkpoint, and the time required for continuously running the program from the checkpoint position to the error reporting position of the detector; for each program segment, the checkpoint time and the recovery time of the program segment are calculated, and the sum of the two is taken as the time overhead of the program segment.

4. The detector location based checkpoint soft error recovery method of claim 1, wherein the fourth step: judging the sufficiency of the check points in the program segment, and if the check points are insufficient, adding additional check points for the program segment; the method comprises the following specific steps:

and evaluating the checkpoint setting time, the current state recovery time and the historical state recovery time of the program segment after the checkpoint is added for each program segment, and further obtaining the time overhead of the program segment. And comparing the time overhead of the program segment with the checkpoint which is added with the time overhead of the program segment without the checkpoint, if the time overhead is reduced, indicating that the checkpoint in the program segment is insufficient, adding the checkpoint to the program segment, otherwise, indicating that the checkpoint in the program segment is sufficient, and not adding the checkpoint to the program segment.

5. The detector location based checkpoint soft error recovery method of claim 1, wherein the fifth step: judging the deletable property of the check point in the program segment, and deleting the deletable check point; the method comprises the following specific steps:

when the check points of the program segments are enough, the deletability of the check points in the program segments is evaluated, and after the check points of the program segments are deleted, the error in the program segments is processed by the check point of the program segment which is closest to the previous one, and the program segment which is closest to the previous one are regarded as one program segment, which is called p_mWhen evaluating a checkpoint of a deleted program segment, p_mIf the time cost is reduced, the check point in the program segment is redundant, the check point of the program segment is deleted, otherwise, the check point in the program segment is not redundant, and the check point of the program segment is not deleted; assume that the preceding program segment of the current program segment is p_jAfter the fourth step of analysis, at p_jWith the addition of checkpoints, the newly added checkpoint will p_jDivided into two program segments, p_j1And p_j2At this time, for the current program segment, the nearest previous program segment is p_j2Instead of p_j。