CN111124720A - Self-adaptive check point interval dynamic setting method - Google Patents

Self-adaptive check point interval dynamic setting method Download PDF

Info

Publication number
CN111124720A
CN111124720A CN201911361269.1A CN201911361269A CN111124720A CN 111124720 A CN111124720 A CN 111124720A CN 201911361269 A CN201911361269 A CN 201911361269A CN 111124720 A CN111124720 A CN 111124720A
Authority
CN
China
Prior art keywords
interval
soft error
seiht
flag
bit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911361269.1A
Other languages
Chinese (zh)
Other versions
CN111124720B (en
Inventor
虞致国
常龙鑫
顾晓峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangnan University
Original Assignee
Jiangnan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangnan University filed Critical Jiangnan University
Priority to CN201911361269.1A priority Critical patent/CN111124720B/en
Publication of CN111124720A publication Critical patent/CN111124720A/en
Application granted granted Critical
Publication of CN111124720B publication Critical patent/CN111124720B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/006Identification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/81Threshold

Abstract

The invention discloses a self-adaptive check point interval dynamic setting method, and belongs to the technical field of processor fault tolerance. The method uses the SEIHT and PHT to predict the next soft error interval, and increases or decreases the checkpoint interval according to the comparison result between the predicted soft error interval and the set threshold value, thereby realizing the adaptive adjustment of the setting interval of the checkpoint. The method can accurately predict the soft error interval according to the long-term characteristic and the short-term characteristic of the soft error history, and has good instantaneity and high accuracy; the method can rapidly learn and predict any repeated mode, can further cope with the situation that soft errors are distributed complicatedly and unknowingly, and is strong in universality.

Description

Self-adaptive check point interval dynamic setting method
Technical Field
The invention relates to a self-adaptive check point interval dynamic setting method, and belongs to the technical field of processor fault tolerance.
Background
The checkpoint technology is a common method for improving the reliability of the fault-tolerant system, and is an important means for fault recovery of the fault-tolerant system. The additional time overhead introduced by the checkpoint and rollback recovery techniques includes the time to checkpoint and the time to failover due to the occurrence of a failure. The setting interval of the check points has great influence on the execution time of the task, if the setting frequency of the check points is increased, the failure recovery time caused by failure can be reduced, but the time occupied by the establishment of the check points is increased; if the setting frequency of the checkpoint is reduced, the influence of the checkpoint establishment on the task execution can be reduced, but the fault recovery time is prolonged, so the checkpoint interval setting needs to balance the checkpoint establishment overhead and the fault recovery overhead.
At present, most checkpoint-based fault-tolerant systems still adopt a method for fixing a checkpoint (refer to "oli a J, rudolh L, saho R k. cooperative checking: a debug assessment large-scale systems reliability; proceedings of the international conference on Supercomputing, F,2006[ C ]), and the method for fixing a checkpoint is only optimal in the case that a fault-tolerant system fails and follows poisson distribution, but the distribution assumption does not completely conform to the reality.
The existing method provides a method for predicting the soft error occurrence condition in a short time in the future according to historical soft error information so as to dynamically adjust the subsequent check point interval, namely a dynamic check point technology.
The conventional method of the dynamic checkpoint technique is to calculate an optimal checkpoint interval (refer to "YOUNG J W.A first order adaptation timing to the optimal checkpoint interval [ J ]. Communications of the ACM,1974,17(9): 530-.
It is an important issue to calculate the average failure rate using how much historical failure information, and it is often difficult to determine the optimal value: the adopted historical information is too much, and only can reflect the long-term characteristics of system failure, but cannot reflect the short-term change of failure rate in time, and the instantaneity of soft error rate prediction is poor; too little historical information is used and may be affected by incidental factors that may mis-predict the failure rate.
The existing dynamic checkpoint algorithm directly takes the historical average soft error rate as the average failure rate of a future period of time (the historical failure rate is used for predicting the average failure rate of the future), and determines the checkpoint interval again according to the optimal checkpoint formula under the poisson distribution condition. The soft error distribution (interval) prediction method of "predicting the future average failure rate using the historical failure rate" is suitable for the prediction of the long-term characteristics, but since the historical average soft error rate cannot represent the change of the short-term content soft error distribution in time, the accuracy of the short-term prediction using the method is low. In summary, the conventional dynamic checkpoint algorithm cannot obtain good instantaneity or high short-term prediction accuracy.
In addition, the existing dynamic checkpoint algorithm is generally only suitable for the case that the failure distribution conforms to the poisson distribution, and is not well suitable for the case that the soft error distribution rule is complex or unknown (for example, the working environment of the system is constantly changed).
Disclosure of Invention
In order to solve the problems that the instantaneity is poor or the prediction precision is low due to the fact that the historical failure rate is directly used as the average failure rate in the next period of time by the existing dynamic checkpoint algorithm, and the optimal checkpoint formula is not suitable for being directly used for non-Poisson distribution soft errors, the invention provides a dynamic checkpoint interval setting method.
Optionally, the method includes:
the method comprises the following steps: adding 1 to the value of the soft error occurrence frequency N, recording the time indicated by a period counter of the processor when the soft error is detected, and executing the step two;
step two: judging whether the record is the first record or not, if not, executing the third step; as a result, step four is performed;
step three: the last recorded time is recorded as t0Recording the time as t1Changing Δ t to t1-t0As the time interval from the last soft error to the current soft error, and the time T of the latest soft error in the systemfinalAssigned a value of t1Executing the step five;
step four: recording the time as t0And the time T of the first soft error in the systemstartAnd the earliest soft error occurrence time T recorded by SEIHTstart' average value is t0And ending the algorithm;
step five: judging whether the soft error interval history table SEIHT is full, if so, executing a sixth step; if not, executing the step seven;
step six: assigning a timestamp of a most significant bit SEIHT.FLAG.HSB of a soft error interval FLAG in SEIHT to Tstart' after that, the storage space where the time stamp is located is allowed to be covered and is allowed to be used for recording the new soft error occurrence time, and a soft error interval threshold value is calculated
Figure BDA0002337225890000021
T represents the average soft error interval of k soft error intervals recorded in SEIHT, and step seven is executed;
step seven: comparing the delta T calculated in the third step with the soft error interval threshold value T, and updating a soft error interval mark SEIHT.FLAG of SEIHT and a corresponding timestamp SEIHT.TIMESTAMP according to the comparison result:
if Δ T ≧ T, the interval is considered as "long interval", SEIHT.FLAG is shifted to the left by one bit, while 1 is written into the lowest bit SEIHT.FLAG.LSB of the soft error interval FLAG in SEIHT, and the timestamp TIMESTAMP of the XLEN bit corresponding to the FLAG bit is assigned as T0
If Δ t<T, considering the interval as "short interval", the SEIHT. FLAG is shifted left by one bit, while 0 is written in the lowest bit SEIHT. flag.lsb of the soft error interval FLAG in SEIHT, and the FLAG bit is setThe timestamp TIMESTAMP of the corresponding Xlen bit is assigned a value of t0
Executing the step eight;
step eight: predicting the soft error occurrence interval, firstly judging whether the prediction is carried out for the first time or not, and if not, executing the ninth step; as a result, step ten is performed;
step nine: updating a mode history table PHT item used by the last prediction according to the SEIHT, FLAG, LSB and a state transition diagram of a two-bit saturation counter, and executing a step ten;
step ten: flag with k bits of seiht value as index, query has 2kThe PHT of each table entry executes the step eleven;
step eleven: predicting the occurrence interval of soft errors according to the value of the two-bit saturation counter inquired in the step nine, if the value of the counter is '00' or '01', predicting that the next soft error interval is 'short interval', and setting the interval T of the check pointcAssigned a shorter interval Ts
If the counter value is "10" or "11", the next soft error interval is predicted to be "long interval", the set interval T of the point will be checkedcAssigned a longer interval Tl
Optionally, Tl>TsWherein the shorter interval
Figure BDA0002337225890000031
Where C denotes the time overhead of checkpointing, p denotes the soft error incidence rate during the period from system startup until the last soft error,
Figure BDA0002337225890000032
wherein, TstartRepresenting the time when the soft error first occurred in the system.
Optionally, Tl=2Ts
Optionally, the time overhead C of checkpointing is determined according to system characteristics, including the micro-architecture of the processor, the characteristics of the data that needs to be protected.
Optionally, the soft error interval history table SEIHT has k entries, each entry consisting of two parts: a soft error interval FLAG with 1 bit for indicating whether the soft error interval is larger than a threshold T, and a XLen bit timestamp corresponding to the soft error interval FLAG, where the soft error FLAG is a shift register with k bits, the soft error FLAG is seiht.
Optionally, the pattern history table PHT creates a two-bit saturation counter for all patterns of seiht.flag, which is 2 in totalkIn the seed pattern, "00" indicates "strong short interval", "01" indicates "weak short interval", "10" indicates "weak long interval", and "11" indicates "strong long interval".
Optionally, the value of k is determined according to specific requirements of the system in terms of performance, power consumption, area, and the like.
Optionally, the soft error interval FLAG of the soft error interval history table SEIHT is used to indicate whether the soft error interval is greater than the threshold T, and includes:
if the value is larger than or equal to the preset value, the SEIHT.FLAG is shifted by one bit to the left, and simultaneously 1 is written into SEIHT.FLAG.LSB, which indicates that the interval is 'long interval';
if less, seiht.flag is shifted left by one bit while a 0 is written to seiht.flag.lsb, indicating that this interval is a "short interval".
Flag, if the lowest bit of seiht. flag is 1, there are two cases: if the current state is not '11', the state of the saturation counter is converted to the directions of '10' and '11'; if the current state is "11", the state is not changed.
Flag, if the lowest bit of seiht.flag is 0, there are two cases: if the current state is not '00', the state of the saturation counter is converted to the directions of '01' and '00'; if the current state is "00", the state is not changed.
Optionally, the method dynamically predicts the next soft error interval according to SEIHT and PHT, including:
using SEIHT.FLAG mode as index, inquiring the state value of corresponding two-bit counter in PHT, if the state value is '00' or '01', predicting the next soft error interval as short interval;
if the state values are '10' and '11', predicting the next soft error interval to be a long interval;
the method for adjusting the check point interval comprises the following steps: if the next soft error interval is predicted to be a short interval, the checkpointed set interval TcAssigned a value of Ts(ii) a Otherwise, setting the interval T of the check pointscAssigned a value of Tl
The invention also provides a microprocessor soft error fault-tolerant system, which adopts the dynamic setting method of the check point interval to set the check point interval.
The invention has the beneficial effects that:
1) the method reduces the adverse effect of accidental factors on soft error interval prediction, improves the instantaneity of the soft error interval prediction, improves the accuracy of short-term prediction of the soft error interval, and is different from the phenomenon that the traditional dynamic check point method is difficult to simultaneously refer to the long-term characteristic and the short-term characteristic of the soft error interval.
The traditional dynamic check point directly takes the historical soft error rate as the soft error rate of a period of time in the future, but the method of the invention uses an adaptive prediction module based on SEIHT and PHT, can accurately predict the next soft error interval according to the short-term characteristics of the soft error interval, and then determines to adjust the check point interval to a shorter value or a longer value according to the prediction result; the method takes the average failure rate from the starting time to the final time as the reference of new checkpoint interval setting so as to consider the long-term characteristics of soft error distribution.
2) The self-adaptive prediction structure can rapidly learn and predict any repeated mode, so that the method can cope with the conditions of complex and unknown soft error distribution and has strong universality.
3) The hardware part required for realizing the method mainly comprises an SEIHT and a PHT, and the method is easy to realize.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is an algorithmic flow chart of the method of the present invention.
Fig. 2 is a diagram of the hardware architecture of the method of the present invention.
Fig. 3 is a schematic diagram of the SEIHT update procedure in the method of the present invention.
FIG. 4 is a state transition diagram of a two-bit saturating counter included in the PHT of the present invention.
FIG. 5 is a flow chart of soft error interval prediction and checkpoint interval adjustment in the method of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.
The first embodiment is as follows:
the present embodiment provides a method for dynamically setting a checkpoint Interval, where the method uses a Soft Error Interval History Table (SEIHT) and a Pattern History Table (Pattern History Table, PHT) to predict a next Soft Error Interval, compares the predicted Soft Error Interval with a set threshold, and increases or decreases the checkpoint Interval according to a comparison result, so that the checkpoint Interval can be adaptively adjusted according to long-term characteristics and short-term characteristics of History information. .
The implementation method adopts a self-adaptive prediction structure to predict the soft error interval, realizes the dynamic setting of the check point interval, improves the instantaneity of the soft error interval prediction and the accuracy of short-term prediction, and thus effectively reduces the average execution time of the processor task; the implementation method can cope with the situation of unknown non-Poisson distribution with complex soft error distribution and strong universality; the hardware required by the implementation method mainly comprises an SEIHT and a PHT, and the implementation is easy.
As shown in fig. 1, the method for dynamically setting the checkpoint interval provided by the present application includes the following specific processes:
the method comprises the following steps: adding 1 to the value of the soft error occurrence frequency N, recording the time indicated by a period counter of the processor when the soft error is detected, and executing the step two;
step two: judging whether the record is the first record or not, if not, executing the third step; as a result, step four is performed;
step three: the last recorded time is recorded as t0Recording the time as t1Changing Δ t to t1-t0As the time interval from the last soft error to the current soft error, and the time T of the latest soft error in the systemfinalAssigned a value of t1Executing the step five;
step four: recording the time as t0And the time T of the first soft error in the systemstartAnd the earliest soft error occurrence time (timestamp) T recorded by SEIHTstart' average value is t0And ending the algorithm;
step five: judging whether the soft error interval history table SEIHT is full, if so, executing a sixth step; if not, executing the step seven;
step six: assigning a timestamp of a most significant bit SEIHT.FLAG.HSB of a soft error interval FLAG in SEIHT to Tstart' after that, the storage space where the time stamp is located is allowed to be covered and is allowed to be used for recording the new soft error occurrence time, and a soft error interval threshold value is calculated
Figure BDA0002337225890000061
T represents the average soft error interval of k soft error intervals recorded in SEIHT, and step seven is executed;
step seven: comparing the delta T calculated in the third step with the soft error interval threshold value T, and updating a soft error interval mark SEIHT.FLAG of SEIHT and a corresponding timestamp SEIHT.TIMESTAMP according to the comparison result:
if Δ T ≧ T, the interval is considered as "long interval", SEIHT.FLAG is shifted to the left by one bit, while 1 is written into the lowest bit SEIHT.FLAG.LSB of the soft error interval FLAG in SEIHT, and the timestamp TIMESTAMP of the XLEN bit corresponding to the FLAG bit is assigned as T0
If Δ t<T, considering the interval as "short interval", shift SEIHT one bit to the left while writing 0 to the lowest bit SEIHT. flag.lsb of the soft error interval FLAG in SEIHT, assign the timestamp TIMESTAMP of the XLen bit corresponding to this FLAG bit as T0
Executing the step eight;
step eight: predicting the soft error occurrence interval, firstly judging whether the prediction is carried out for the first time or not, and if not, executing the ninth step; as a result, step ten is performed; step nine: updating a mode history table PHT item used by the last prediction according to the SEIHT, FLAG, LSB and a state transition diagram of a two-bit saturation counter, and executing a step ten;
step ten: flag with k bits of seiht value as index, query has 2kThe PHT of each table entry executes the step eleven;
step eleven: predicting the occurrence interval of soft errors according to the value of the two-bit saturation counter inquired in the step nine, if the value of the counter is '00' or '01', predicting that the next soft error interval is 'short interval', and setting the interval T of the check pointcAssigned a shorter interval Ts
If the counter value is "10" or "11", the next soft error interval is predicted to be "long interval", the set interval T of the point will be checkedcAssigned a longer interval TlAlgorithmAnd (6) ending.
In the method of this embodiment, the shorter interval TsAnd a longer interval TlAre two possible values of the checkpoint interval, where Tl>TsThe two values are referenced to long-term characteristics of the soft error history, at shorter intervals
Figure BDA0002337225890000071
Wherein C represents the time overhead of establishing the checkpoint, the value needs to be determined according to the micro-architecture of the processor, the specific system characteristics of data needing to be protected and the like, p represents the soft error occurrence rate from the system startup to the last soft error occurrence,
Figure BDA0002337225890000072
wherein, TstartRepresenting the time when the soft error first occurred in the system; longer interval Tl=2Ts
The hardware architecture of the present embodiment is mainly composed of a soft error interval history table SEIHT with k items and a hardware architecture containing 2kA pattern history table PHT of two-bit saturating counters, as shown in fig. 2. Each entry of SEIHT consists of two parts: a soft error interval flag with 1 bit and a timestamp with XLen bits, where the soft error interval flag is used to indicate whether a soft error interval is "short interval" or "long interval", and the timestamp indicates the time when each soft error occurs, and the SEIHT has k soft error interval flags (denoted as SEIHT. flag) in common and k timestamps with XLen bits (denoted as SEIHT. timestamp) corresponding to the k soft error interval flags respectively; flag all modes (total 2) for seihtkSeed) to create a two-bit saturating counter, the value of SEIHT.FLAG will be used as the index of PHT, and each mode of SEIHT.FLAG corresponds to a special two-bit saturating counter in PHT.
In this embodiment, seiht.flag is a k-bit shift register, where seiht.flag is used to indicate a set of all 1-bit soft error interval FLAGs FLAG in k entries, and is used to record whether the soft error occurrence distance Δ T is greater than a threshold T, and if so, the seiht.flag is shifted by one bit to the left, and meanwhile, 1 is written into the lowest bit seiht.flag.lsb of the soft error interval FLAGs, indicating that the interval is "long interval"; if less, seiht.flag is shifted left by one bit while a 0 is written to seiht.flag.lsb, indicating that this interval is a "short interval". Seiht. timestamp represents the time when a soft error occurs, corresponding to each bit of seiht. flag, respectively. The update procedure of SEIHT is shown in FIG. 3.
In the present embodiment, PHT is all modes of seiht. flag (one common mode is 2)kSeed) created a two-bit saturating counter that represented "00" for "strong short interval", "01" for "weak short interval", "10" for "weak long interval", and "11" for "strong long interval". The two-bit saturating counter can be updated according to the state transition rule, the state transition diagram of which is shown in fig. 4. Flag, if the lowest bit of seiht. flag is 1, there are two cases: if the current state is not '11', the state of the saturation counter is converted to the directions of '10' and '11'; if the current state is "11", the state is not changed. Flag is 0 in the lowest bit, there are two cases: if the current state is not '00', the state of the saturation counter is converted to the directions of '01' and '00'; if the current state is "00", the state is not changed.
In the present embodiment, the flow of prediction of soft error intervals and adjustment of checkpoint intervals is shown in fig. 5. The SEIHT and the PHT can dynamically predict the next soft error interval, and the specific steps are as follows:
using the SEIHT mode as an index, inquiring the state value of a corresponding two-bit counter in the PHT, and if the state value is '00' or '01', predicting the next soft error interval as a short interval; if the state values are "10", "11", the next soft error interval is predicted to be a long interval. The method for adjusting the check point interval comprises the following steps: if the next soft error interval is predicted to be a short interval, the checkpointed set interval TcAssigned a value of Ts(ii) a Otherwise, setting the interval T of the check pointscAssigned a value of Tl
Figure BDA0002337225890000081
Figure BDA0002337225890000082
Tl=2Ts
The method reduces the adverse effect of accidental factors on soft error interval prediction, improves the instantaneity of the soft error interval prediction, improves the accuracy of short-term prediction of the soft error interval, and is different from the phenomenon that the traditional dynamic check point method is difficult to simultaneously refer to the long-term characteristic and the short-term characteristic of the soft error interval.
The traditional dynamic check point directly takes the historical soft error rate as the soft error rate of a period of time in the future, but the method of the invention uses an adaptive prediction module based on SEIHT and PHT, can accurately predict the next soft error interval according to the short-term characteristics of the soft error interval, and then determines to adjust the check point interval to a shorter value or a longer value according to the prediction result; the method takes the average failure rate from the starting time to the final time as the reference of new checkpoint interval setting so as to consider the long-term characteristics of soft error distribution.
In addition, the self-adaptive prediction structure can rapidly learn and predict any repeated mode, so that the method can cope with the conditions of complex and unknown soft error distribution and has strong universality.
The hardware part required for realizing the method mainly comprises an SEIHT and a PHT, and the method is easy to realize.
Some steps in the embodiments of the present invention may be implemented by software, and corresponding software programs may be stored in a readable storage medium, such as a main memory, a magnetic disk, a hard disk, and the like.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (10)

1. A dynamic setting method of a checkpoint interval is characterized in that after a soft error occurs, a soft error interval history table SEIHT and a pattern history table PHT are updated according to a time interval delta T from the last soft error to the current soft error, the next soft error interval is predicted by using the SEIHT and the PHT, the predicted time interval from the current soft error to the next soft error is compared with a set threshold value T, the setting frequency of the checkpoint is determined according to the comparison result, and therefore the dynamic adjustment of the checkpoint interval is achieved.
2. The method according to claim 1, characterized in that it comprises:
the method comprises the following steps: adding 1 to the value of the soft error occurrence frequency N, recording the time indicated by a period counter of the processor when the soft error is detected, and executing the step two;
step two: judging whether the record is the first record or not, if not, executing the third step; as a result, step four is performed;
step three: the last recorded time is recorded as t0Recording the time as t1Changing Δ t to t1-t0As the time interval from the last soft error to the current soft error, and the time T of the latest soft error in the systemfinalAssigned a value of t1Executing the step five;
step four: recording the time as t0And the time T of the first soft error in the systemstartAnd the earliest soft error occurrence time T recorded by SEIHTstart' average value is t0And ending the algorithm;
step five: judging whether the soft error interval history table SEIHT is full, if so, executing a sixth step; if not, executing the step seven;
step six: assigning a timestamp of a most significant bit SEIHT.FLAG.HSB of a soft error interval FLAG in SEIHT to Tstart' after that, the storage space where the time stamp is located is allowed to be covered and is allowed to be used for recording the new soft error occurrence time, and a soft error interval threshold value is calculated
Figure FDA0002337225880000011
T represents the average soft error interval of k soft error intervals recorded in SEIHT, and step seven is executed;
step seven: comparing the delta T calculated in the third step with the soft error interval threshold value T, and updating a soft error interval mark SEIHT.FLAG of SEIHT and a corresponding timestamp SEIHT.TIMESTAMP according to the comparison result:
if Δ T ≧ T, the interval is considered as "long interval", SEIHT.FLAG is shifted to the left by one bit, while 1 is written into the lowest bit SEIHT.FLAG.LSB of the soft error interval FLAG in SEIHT, and the timestamp TIMESTAMP of the XLEN bit corresponding to the FLAG bit is assigned as T0
If Δ t<T, considering the interval as "short interval", shift SEIHT. FLAG by one bit to the left, while writing 0 to the lowest bit SEIHT. flag.lsb of the soft error interval FLAG in SEIHT, assign a timestamp TIMESTAMP of the XLen bit corresponding to this FLAG bit as T0
Executing the step eight;
step eight: predicting the soft error occurrence interval, firstly judging whether the prediction is carried out for the first time or not, and if not, executing the ninth step; as a result, step ten is performed;
step nine: updating a mode history table PHT item used by the last prediction according to the SEIHT, FLAG, LSB and a state transition diagram of a two-bit saturation counter, and executing a step ten;
step ten: flag with k bits of seiht value as index, query has 2kThe PHT of each table entry executes the step eleven;
step eleven: predicting the occurrence interval of soft errors according to the value of the two-bit saturation counter inquired in the step nine, if the value of the counter is '00' or '01', predicting that the next soft error interval is 'short interval', and setting the interval T of the check pointcAssigned a shorter interval Ts
If the counter value is "10" or "11", the next soft error interval is predicted to be "long interval", the set interval T of the point will be checkedcAssigned a longer interval Tl
3. The method of claim 2, wherein T isl>TsWherein the shorter interval
Figure FDA0002337225880000021
Where C denotes the time overhead of checkpointing, p denotes the soft error incidence rate during the period from system startup until the last soft error,
Figure FDA0002337225880000022
wherein, TstartRepresenting the time when the soft error first occurred in the system.
4. The method of claim 3, wherein the soft error interval history table SEIHT has k entries, each entry consisting of two parts: a soft error interval FLAG with 1 bit for indicating whether the soft error interval is larger than a threshold T, and a XLen bit timestamp corresponding to the soft error interval FLAG, where the soft error FLAG is a shift register with k bits, the soft error FLAG is seiht.
5. Method according to claim 4, characterized in that the pattern history table PHT creates a two-bit saturating counter for all the patterns SEIHTkIn the seed pattern, "00" indicates "strong short interval", "01" indicates "weak short interval", "10" indicates "weak long interval", and "11" indicates "strong long interval".
6. The method of claim 5, wherein the soft error interval FLAG of the soft error interval history table SEIHT is used to indicate whether the soft error interval is greater than a threshold T, and comprises:
if the value is larger than or equal to the preset value, the SEIHT.FLAG is shifted by one bit to the left, and simultaneously 1 is written into SEIHT.FLAG.LSB, which indicates that the interval is 'long interval';
if less, seiht.flag is shifted left by one bit while a 0 is written to seiht.flag.lsb, indicating that this interval is a "short interval".
7. Method according to claim 6, characterized in that if the lowest bit of SEIHT. If the current state is not '11', the state of the saturation counter is converted to the directions of '10' and '11'; if the current state is "11", the state is not changed.
8. Method according to claim 7, characterized in that if the lowest bit of SEIHT. FLAG is 0, there are two cases: if the current state is not '00', the state of the saturation counter is converted to the directions of '01' and '00'; if the current state is "00", the state is not changed.
9. The method of claim 8, wherein dynamically predicting a next soft error interval based on SEIHT and PHT comprises:
using SEIHT.FLAG mode as index, inquiring the state value of corresponding two-bit counter in PHT, if the state value is '00' or '01', predicting the next soft error interval as short interval;
if the state values are '10' and '11', predicting the next soft error interval to be a long interval;
the method for adjusting the check point interval comprises the following steps: if the next soft error interval is predicted to be a short interval, the checkpointed set interval TcAssigned a value of Ts(ii) a Otherwise, setting the interval T of the check pointscAssigned a value of Tl
10. A microprocessor soft error tolerant system, wherein said system employs the method of dynamic checkpointing according to any one of claims 1 to 9 for checkpointing.
CN201911361269.1A 2019-12-26 2019-12-26 Self-adaptive check point interval dynamic setting method Active CN111124720B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911361269.1A CN111124720B (en) 2019-12-26 2019-12-26 Self-adaptive check point interval dynamic setting method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911361269.1A CN111124720B (en) 2019-12-26 2019-12-26 Self-adaptive check point interval dynamic setting method

Publications (2)

Publication Number Publication Date
CN111124720A true CN111124720A (en) 2020-05-08
CN111124720B CN111124720B (en) 2021-05-04

Family

ID=70502662

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911361269.1A Active CN111124720B (en) 2019-12-26 2019-12-26 Self-adaptive check point interval dynamic setting method

Country Status (1)

Country Link
CN (1) CN111124720B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111682981A (en) * 2020-06-02 2020-09-18 深圳大学 Check point interval setting method and device based on cloud platform performance
CN112131034A (en) * 2020-09-22 2020-12-25 东南大学 Checkpoint soft error recovery method based on detector position
CN112445641A (en) * 2020-11-05 2021-03-05 德州职业技术学院(德州市技师学院) Operation maintenance method and system for big data cluster
CN116361060A (en) * 2023-05-25 2023-06-30 中国地质大学(北京) Multi-feature-aware stream computing system fault tolerance method and system

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002075483A2 (en) * 2001-01-11 2002-09-26 Nortel Networks Limtied Check pointing of processing context of network accounting components
US20060143528A1 (en) * 2004-12-27 2006-06-29 Stratus Technologies Bermuda Ltd Systems and methods for checkpointing
CN102369514A (en) * 2011-08-31 2012-03-07 华为技术有限公司 Method and system for establishing detection points
CN103197982A (en) * 2013-03-28 2013-07-10 哈尔滨工程大学 Task local optimum check point interval searching method
CN104657229A (en) * 2015-03-19 2015-05-27 哈尔滨工业大学 Multi-core processor rollback recovering system and method based on high-availability hardware checking point
CN105718355A (en) * 2016-01-21 2016-06-29 中国人民解放军国防科学技术大学 Online learning-based super computer node active fault-tolerant method
CN106383995A (en) * 2016-09-05 2017-02-08 南京臻融软件科技有限公司 Node failure relevance-based check point placing method

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002075483A2 (en) * 2001-01-11 2002-09-26 Nortel Networks Limtied Check pointing of processing context of network accounting components
US20060143528A1 (en) * 2004-12-27 2006-06-29 Stratus Technologies Bermuda Ltd Systems and methods for checkpointing
CN102369514A (en) * 2011-08-31 2012-03-07 华为技术有限公司 Method and system for establishing detection points
CN103197982A (en) * 2013-03-28 2013-07-10 哈尔滨工程大学 Task local optimum check point interval searching method
CN104657229A (en) * 2015-03-19 2015-05-27 哈尔滨工业大学 Multi-core processor rollback recovering system and method based on high-availability hardware checking point
CN105718355A (en) * 2016-01-21 2016-06-29 中国人民解放军国防科学技术大学 Online learning-based super computer node active fault-tolerant method
CN106383995A (en) * 2016-09-05 2017-02-08 南京臻融软件科技有限公司 Node failure relevance-based check point placing method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
印杰: "复杂失效分布下的动态检查点设置", 《小型微型计算机系统》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111682981A (en) * 2020-06-02 2020-09-18 深圳大学 Check point interval setting method and device based on cloud platform performance
WO2021244066A1 (en) * 2020-06-02 2021-12-09 深圳大学 Method and apparatus for setting checkpoint interval on the basis of performance of cloud platform
CN112131034A (en) * 2020-09-22 2020-12-25 东南大学 Checkpoint soft error recovery method based on detector position
CN112131034B (en) * 2020-09-22 2023-07-25 东南大学 Checkpoint soft error recovery method based on detector position
CN112445641A (en) * 2020-11-05 2021-03-05 德州职业技术学院(德州市技师学院) Operation maintenance method and system for big data cluster
CN116361060A (en) * 2023-05-25 2023-06-30 中国地质大学(北京) Multi-feature-aware stream computing system fault tolerance method and system
CN116361060B (en) * 2023-05-25 2023-09-15 中国地质大学(北京) Multi-feature-aware stream computing system fault tolerance method and system

Also Published As

Publication number Publication date
CN111124720B (en) 2021-05-04

Similar Documents

Publication Publication Date Title
CN111124720B (en) Self-adaptive check point interval dynamic setting method
US8990657B2 (en) Selective masking for error correction
US8108365B2 (en) Consistency of a database management system
CN106708653B (en) Mixed tax big data security protection method based on erasure code and multiple copies
US8566672B2 (en) Selective checkbit modification for error correction
US9378098B2 (en) Methods and systems for redundant data storage in a register
US20070061555A1 (en) Call return tracking technique
CN111143142B (en) Universal check point and rollback recovery method
US7302619B1 (en) Error correction in a cache memory
US20200004439A1 (en) Determining when to perform a data integrity check of copies of a data set by training a machine learning module
KR20100111680A (en) Correction of errors in a memory array
US7945745B2 (en) Methods and systems for exchanging data
CN104424186A (en) Method and device for realizing persistence in flow calculation application
US20110035643A1 (en) System and Apparatus for Error-Correcting Register Files
US7849355B2 (en) Distributed object sharing system and method thereof
CN107632781B (en) Method for rapidly checking consistency of distributed storage multi-copy and storage structure
CN107992268B (en) Bad block marking method and related device
CN111104243B (en) Low-delay dual-mode lockstep soft error-tolerant processor system
CN111522684A (en) Method and device for simultaneously correcting soft and hard errors of phase change memory
CN109992492A (en) Log recording method, device, equipment and the readable storage medium storing program for executing of functional module
US8661298B2 (en) Controlling nanostore operation based on monitored performance
CN110990197B (en) Optimization method of application-level multi-layer check point based on supercomputer
CN112463880A (en) Block chain data storage method and related device
CN111506450A (en) Method, apparatus and computer program product for data processing
CN114077610A (en) Data publishing method and related device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant