CN112269686B - LUTRAM self-repairing structure and self-repairing method based on cold backup dual-mode error detection code - Google Patents

LUTRAM self-repairing structure and self-repairing method based on cold backup dual-mode error detection code Download PDF

Info

Publication number
CN112269686B
CN112269686B CN202011178870.XA CN202011178870A CN112269686B CN 112269686 B CN112269686 B CN 112269686B CN 202011178870 A CN202011178870 A CN 202011178870A CN 112269686 B CN112269686 B CN 112269686B
Authority
CN
China
Prior art keywords
fault
module
sub
self
repairing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011178870.XA
Other languages
Chinese (zh)
Other versions
CN112269686A (en
Inventor
张砦
刘燕
黄莉莉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Aeronautics and Astronautics
Original Assignee
Nanjing University of Aeronautics and Astronautics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Aeronautics and Astronautics filed Critical Nanjing University of Aeronautics and Astronautics
Priority to CN202011178870.XA priority Critical patent/CN112269686B/en
Publication of CN112269686A publication Critical patent/CN112269686A/en
Application granted granted Critical
Publication of CN112269686B publication Critical patent/CN112269686B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1456Hardware arrangements for backup
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1458Management of the backup or restore process
    • G06F11/1469Backup restoration techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Techniques For Improving Reliability Of Storages (AREA)
  • For Increasing The Reliability Of Semiconductor Memories (AREA)

Abstract

The invention discloses a LUTRAM self-repairing structure based on a cold backup dual-mode error detection code and a self-repairing method, wherein the self-repairing structure comprises two hot backup sub-modules, a cold backup sub-module and a control module, and the repairing structure based on dual-mode error detection code refreshing and cold backup error detection code replacement is respectively formed, so that the reliability of a lookup table memory under the conditions of transient faults and permanent faults can be improved, and the transient faults and the permanent faults self-repairing method based on the structure are provided, thereby providing a new idea for the design of the FPGA memory self-repairing method.

Description

LUTRAM self-repairing structure and self-repairing method based on cold backup dual-mode error detection code
Technical Field
The invention belongs to the field of design of reconfigurable hardware fault-tolerant methods, and particularly relates to an LUTRAM self-repairing structure and a self-repairing method applied to fault-tolerant design of a Field Programmable Gate Array (FPGA) memory.
Background
The SRAM type FPGA is used as a programmable ASIC (Application SPECIFIC INTEGRATED Circuit), can change logic functions at any time, has rich logic resources, has the characteristics of flexibility, parallelism and integration, is widely applied to space electronic equipment in an aerospace environment, is a smaller and faster memory constructed by LUTs in FPGA logic units, and can provide better access to the content of the memory. At present, all SRAM type circuit devices in the FPGA are extremely easy to be influenced by high-intensity radiation, a single event effect is easy to occur, the LUTRAM is also used as an SRAM type storage device, the storage bit overturning and even permanent failure faults are extremely likely to occur under the influence of the single event effect, and if the faults are not repaired, the normal operation of the system can be seriously influenced, so that the reliability problem of the system is required to be considered, and the reliability problem of the LUTRAM is mainly solved.
The nature of the impact of the single event effect on the device is different and can be categorized into transient faults, which are restorable faults, and permanent faults, which are non-restorable faults. The LUTRAM has a high transient fault rate, but still has a permanent fault rate, and once a permanent fault occurs, the effect is far higher than that of the transient fault, so that the problem of the permanent fault is more needed to be considered when the transient fault of the LUTRAM is solved.
The current fault tolerance method for improving the reliability of the LUTRAM Memory comprises a dual-mode error detection code DEDC (Double Error Detection Code), triple-mode redundancy TMR (Triple Mode Redundancy), an error correction code ECC (Error Correction Code) and a Memory refreshing Memory-Scurbbing; the basic principle of the dual-mode error detection code DEDC (Double Error Detection Code) is that faults are detected through codes, and then only fault-free module data are selected and output under the condition of dual-mode comparison faults; the basic principle of the triple modular redundancy TMR (Triple Mode Redundancy) is that data are subjected to multiple hot backups, and only correct results are output through majority voting; the basic principle of the error correction code ECC (Error Correction Code) is that the error correction code is added to the data and stored in the memory, and only the correct result is output through decoding error correction; the above method only shields the fault and does not repair the fault. Memory refresh Memory-Scurbbing is classified as periodic refresh, which is a periodic operation performed when a failed Memory cell is read, or aperiodic refresh, which is a more suitable for small memories (e.g., LUTRAM) whose basic principle is to write back failed cells with the correct data output, which can achieve LUTRAM fail-over, but require locating failed cells.
The fault locating and refreshing repairing are realized by combining the fault shielding and repairing methods, and when the LUTRAM fault detection, locating and repairing are controlled by the system autonomously, the LUTRAM self-repairing method can be called. The self-repair methods thus formed may be DEDC-Scurbbing, TMR-Scurbbing and ECC-Scurbbing. The TMR-Scurbbing self-repairing method consumes more control and storage resources, the circuit of the ECC-Scurbbing self-repairing method is more complex, compared with the TMR-Scurbbing, the DEDC-Scurbbing self-repairing method can reduce the resource consumption, and compared with the ECC-Scurbbing, the algorithm circuit complexity can be reduced, so that the method is more suitable for solving the transient fault problem of LUTRAM.
When the DEDC-Scurbbing self-repairing method is adopted, the possibility of realizing the transient fault self-repairing depends on the positioning refreshing of the fault unit, so that the fault storage unit needs to be positioned in advance to realize the transient fault refreshing self-repairing. However, the above-mentioned DEDC-Scurbbing self-repairing method still cannot solve the problem of permanent fault self-repairing, when a fault rate low redundancy unit (cold backup unit) in the LUTRAM memory is adopted to replace a permanent fault unit, the possibility of implementing permanent fault self-repairing depends on the judging condition of fault type and the replacing structure of the fault unit, so that a judging condition is defined to determine the permanent fault, and the characteristic of low cold backup fault rate is utilized to implement permanent fault replacement self-repairing.
Disclosure of Invention
The invention aims to: aiming at solving different types of faults and improving the reliability of a system, the invention provides a LUTRAM self-repairing structure and a self-repairing method based on a cold backup dual-mode error detection code.
In order to achieve the above object, the present invention is realized by the following technical scheme: a LUTRAM self-healing structure based on a cold-backup dual-mode error detection code, comprising:
The two hot standby sub-modules are used for outputting coded data and failure times to the control circuit;
the one or more cold backup sub-modules are used for replacing the failed hot backup sub-module when the hot backup sub-module fails permanently;
the control circuit is used for positioning the hot backup sub-module with faults according to the number of faults output by each hot backup sub-module, judging the fault type according to the fault number, and performing fault self-repairing according to the fault type;
the fault types include transient faults and permanent faults;
When an instantaneous fault occurs, the control circuit refreshes the encoded data and enables refreshing of the failed hot backup sub-module through the encoded data output by the hot backup sub-module which does not fail;
When permanent faults occur, the control circuit controls the cold backup sub-module to replace the hot backup sub-module with the faults.
When an instantaneous fault occurs, the two hot backup sub-modules and the control module form a repair structure for refreshing (Double Error Detection Code with Scrubbing, DEDC-Scrubbing) the double-mode error detection code, the fault output is carried out after the two hot backup sub-modules detect the fault, the control module is combined to position the fault sub-module, and the refreshing data and the refreshing enabling of the fault sub-module are controlled, so that the instantaneous fault sub-module in the two sub-modules is refreshed and repaired, and the double-mode error detection code repair structure of the system is maintained.
When permanent faults occur, the two hot backup sub-modules, the cold backup sub-modules and the control module form a repair structure for replacing the cold backup error detection codes (Cold Backup Error Detection Code WITH REPLACING, CBEDC-REPLACING), the fault sub-modules in the two hot backup sub-modules are not repaired, the control module controls the cold backup sub-modules to be converted into hot backups, and the read-write enabling of the fault sub-modules is stopped, so that the permanent fault sub-modules in the two hot backup sub-modules are replaced and repaired, and the double-mode error detection code repair structure of the system is maintained.
Further, the hot backup sub-module and the cold backup sub-module have the same composition, and both the hot backup sub-module and the cold backup sub-module comprise:
the coding module is used for coding the input data according to the even check coding principle and storing the data into the LUTRAM;
LUTRAM for storing encoded data;
The detection module is used for carrying out fault detection on the coded data stored in the LUTRAM according to an even check decoding principle, and triggering a fault signal and counting the number of faults when single or odd bit failures are detected.
Further, for the same hot backup sub-module, when the transient fault is judged to occur in the previous round of fault detection, if the fault still occurs in the next round of fault detection, the fault is judged to be a permanent fault.
The first detected faults are all considered as transient faults, and the fault sub-module is detected through two continuous decoding, and the permanent fault sub-module is judged if the number of faults is 2 from 1.
Further, the cold backup sub-module replaces the failed hot backup sub-module, and includes: and allowing the read-write enabling of the cold backup sub-module, and stopping the read-write enabling of the failed hot backup sub-module.
Based on the structure, the invention also provides a LUTRAM self-repairing method based on the cold backup dual-mode error detection code, which comprises the following steps:
Receiving the coding data and the fault times from each hot standby sub-module, positioning the hot standby sub-module with fault and judging the fault type according to the fault times;
According to the fault type, performing self-repairing on the hot backup sub-module with the fault; comprising the following steps: when the fault type is transient fault, the coded data output by the hot backup sub-module which does not generate fault is used for refreshing the coded data and enabling refreshing the hot backup sub-module which generates fault; when the fault type is a permanent fault, the cold backup sub-module is controlled to replace the hot backup sub-module with the fault;
And repeatedly executing the steps to finish the self-repairing.
The two hot backup sub-modules work simultaneously and the cold backup sub-module does not work, wherein when one hot backup sub-module fails, the detection module detects data failure (the failure frequency is increased by 1), the control module locates the failed sub-module and judges the failed sub-module as instantaneous failure for the first time, the failed sub-module is refreshed by adopting the correctly coded data of the other hot backup sub-module, the failed sub-module is decoded and detected again, the failure frequency is still 1, and the realization of the instantaneous failure self-repair is described.
The fault sub-module is detected through continuous twice decoding, the fault frequency is 2, the fact that the refreshing method cannot achieve fault restoration is indicated, the fact that the fault is permanent is judged at the moment, the control module converts the cold backup sub-module into a hot backup state, the cold backup sub-module replaces the permanent fault sub-module, and when the system does not detect faults any more, the fact that self-restoration of the permanent fault is achieved is indicated.
Further, the "judging the fault type according to the fault frequency condition" includes:
when the number of faults changes only once, the fault type is instantaneous fault;
When the number of faults changes twice continuously, the fault type is permanent fault.
Further, the "control backup sub-module replaces the failed hot backup sub-module", including: and allowing the read-write enabling of the cold backup sub-module, and stopping the read-write enabling of the failed hot backup sub-module.
The beneficial effects are that: compared with the prior art, the invention has the following advantages:
(1) According to the LUTRAM self-repairing structure, the reliability of the FPGA memory in a strong radiation environment is improved through the instantaneous fault repairing structure based on the refreshing of the dual-mode error detection code and the permanent repairing structure based on the replacement of the cold backup error detection code, so that the self-repairing of the FPGA memory is realized;
(2) According to the LUTRAM self-repairing method provided by the invention, through fault positioning, fault type judgment and fault repairing, the reliability of the FPGA memory in a strong radiation environment is improved, and the self-repairing of the FPGA memory is realized.
Drawings
FIG. 1 is a LUTRAM self-healing structure based on a cold-backup dual-mode error detection code;
FIG. 2 is a LUTRAM self-repair method based on a cold-backup dual-mode error detection code;
FIG. 3 is a timing diagram of LUTRAM normal operation based on a cold-backup dual-mode error detection code;
FIG. 4 is a LUTRAM transient fault self-repairing timing diagram based on DEDC-Scrubbing;
FIG. 5 is a LUTRAM permanent fault self-repairing timing diagram based on CBEDC-REPLACING;
FIG. 6 is a Markov model based on different self-healing structures;
FIG. 7 is a graph of LUTRAM reliability R (t) under the influence of transient failure rate s;
FIG. 8 is a graph of LUTRAM mean time to failure MTTF under the influence of transient failure rate s;
FIG. 9 is a graph of LUTRAM reliability R (t) under the influence of permanent failure rate p;
Fig. 10 is a graph of LUTRAM mean time to failure MTTF under the influence of permanent failure rate p.
Detailed Description
The invention provides a LUTRAM self-repairing structure based on a cold backup dual-mode error detection code, which is used for repairing transient faults and permanent faults, and specifically comprises two hot backup sub-modules, a cold backup sub-Module and a Control Module (CM); each sub-module has self-error detection capability, and the control module realizes fault module positioning, fault type judgment and fault repair.
The hot backup sub-Module and the cold backup sub-Module have the same composition and comprise an Encoding Module (EM), an LUTRAM and a Detection Module (DM), wherein the encoding Module mainly aims at obtaining a check code according to an even check encoding principle according to input data, combining the check code to obtain encoded data and storing the encoded data into the LUTRAM, and the detection Module mainly aims at judging whether the read encoded data has unit failure (only one data bit and one check bit) or not on one hand, triggering a fault signal and counting the number of faults when the unit failure is detected on the other hand, and outputting the fault signal to the control Module to serve as a key condition for fault positioning and fault type judgment; the control module is mainly used for positioning the fault sub-module to judge the fault type and controlling the corresponding self-repairing process according to the fault times of the two hot backup sub-modules provided by the detection module.
When instantaneous faults are carried out, the two hot backup sub-modules and the control module form a repair structure (DEDC-Scurbbing) for refreshing the double-mode error detection codes, the fault output is carried out after the two hot backup sub-modules detect the faults, the control module is combined to position the fault sub-module, and the refreshing data and refreshing enabling of the fault sub-module are controlled, so that the instantaneous fault sub-modules in the two sub-modules are refreshed and repaired, and the double-mode error detection code repair structure of the system is maintained.
When permanent faults are carried out, a repair structure for replacing (CBEDC-REPLACING) the cold backup error detection codes is formed by adding the cold backup submodules on the basis of the structure, if the fault submodules in the two hot backup submodules are not repaired, the control module controls the cold backup submodules to be converted into hot backup, namely, the read-write enabling of the cold backup submodules is allowed, and the read-write enabling of the fault submodules is stopped, so that the permanent fault submodules in the two hot backup submodules are replaced and repaired, a new double-mode error detection code structure is formed by the repaired permanent faults and the hot backup submodules without faults, and the double-mode error detection code repair structure of the system is maintained.
The above structure will be further described with reference to fig. 1, in fig. 1, the encoding module EM1, LUTRAM1 and the detecting module DM1 form a first hot standby sub-module, and the encoding module EM2, LUTRAM2 and detecting module DM2 form a second hot standby sub-module, and the working process of the hot standby sub-module will be described below: the coding module EM1 and the coding module EM2 adopt an even check coding principle to code input data din, under the normal working condition, the input data din are stored into input ends cin1 and cin2 of LUTRAM1 and LUTRAM2, the detection module DM1 judges and reads out coded data cout1 according to the even check error detection principle, judges whether bit failure occurs except check bits or not, and when the bit failure occurs, triggers an internal fault signal and counts fault times cnt1_flag; similarly, the detection module DM2 judges and reads out the coded data cout2 according to the even check error detection principle, judges whether bit failure occurs except check bits, and triggers an internal fault signal and counts fault times cnt2_flag when the bit failure occurs; the coded data cout1, the coded data cout2, the fault times cnt1_flag and the fault times cnt2_flag are transmitted into the control module CM and serve as key conditions for fault positioning and fault type judgment. The control module CM locates and refreshes the fault sub-module according to the fault times cnt1_flag and the fault times cnt2_flag, if the fault times cnt 1_flag=1 and the fault times cnt 2_flag=0 indicate that the first hot standby sub-module is a fault sub-module, the second hot standby sub-module is a fault-free sub-module, at this time, the correct coding data cout2 is output from the second hot standby sub-module, the first detected fault is considered as an instantaneous fault, the correct coding data refresh fault sub-module is adopted for the instantaneous fault, and self-repair is realized through refreshing, and the specific self-repair process is as follows: refreshing the first hot standby submodule by means of the selector Mux when the refresh enable signal sr_en1=1; if the failure sub-module is detected through two continuous decoding, the failure times are judged to be permanent failure sub-modules when the failure times are 2 from 1, namely if the failure times cnt 1_flag=2 and the failure times cnt 2_flag=0 indicate that the first hot standby sub-module is the failure sub-module, the failure is permanent failure, refreshing can not realize restoration and the failure sub-module needs to be restored through cold backup replacement, so that the read enable ce1 of the LUTRAM1 of the first hot backup sub-module is set to 0, and meanwhile, the read enable ce3 of the LUTRAM3 in the cold standby sub-module is set to 1, and the permanent failure self-restoration process is completed. The control module CM determines the fault type and confirms the repair method by continuously detecting the fault sub-module according to the fault times cnt1_flag and cnt2_flag.
Based on the self-repairing structure, the invention provides a LUTRAM self-repairing method based on a cold backup dual-mode error detection code, which mainly adopts corresponding repairing methods aiming at different fault types, and specifically comprises the following steps:
The two hot backup sub-modules work simultaneously, at the moment, the cold backup sub-module does not work, if a detection module in one of the hot backup sub-modules detects a data fault, the fault frequency is increased by 1, and the control module positions out the hot backup sub-module with the fault according to the fault frequency, in the method, the first judgment is instantaneous fault, and meanwhile, the coded data output by the non-fault hot backup sub-module is refreshed to the hot backup sub-module with the fault, the fault sub-module is decoded again, and if the fault frequency is still 1, the realization of the instantaneous fault self-repair is indicated; if the number of faults is 2, the above refreshing method is not capable of realizing fault repair, and at the moment, the control module converts the cold backup sub-module into a hot backup state and enables the cold backup sub-module to replace the hot backup sub-module with the permanent fault, and when the system does not detect the fault any more, the self-repair of the permanent fault is realized.
When two hot backup sub-modules start to work, error detection is performed on two coded data.
Referring to fig. 2, the self-repairing method is further described, where the part (a) of fig. 2 is instantaneous fault self-repairing based on the dual-mode error detection code refreshing DEDC-Scrubbing, and as can be known from the part (a) of fig. 2, the LUTRAM1 sub-module and the LUTRAM2 sub-module are in a hot backup state to form a dual-mode error detection code structure, the LUTRAM3 sub-module is in a cold backup state, when the first time of detecting that the LUTRAM2 sub-module fails, the fault is determined to be an instantaneous fault, the LUTRAM1 sub-module is a fault-free sub-module, the number of faults of the two sub-modules and output coded data are simultaneously input to the control module, the control module adopts the self-repairing method of the dual-mode error detection code refreshing DEDC-Scrubbing to locate a fault unit of the LUTRAM2 sub-module, and refreshes the fault unit of the LUTRAM2 sub-module by using the correct coded data output by the fault-free LUTRAM1 sub-module. When the LUTRAM2 submodule fault unit is detected again, no fault is detected again, which indicates that the self-repair of the transient fault of the LUTRAM2 submodule is completed at the moment, and the system still keeps the double-mode error detection code structure formed by the LUTRAM1 submodule and the LUTRAM2 submodule.
As can be seen from fig. 2 (b), when the failure of the lutam 2 submodule is detected again based on the partial refresh failure of the lutam 2 submodule in fig. 2 (a), the failure is determined to be a permanent failure, the control module adopts a self-repairing method of the cold backup replacement, that is, the lutam 2 submodule is replaced by the lutam3 submodule, then the lutam 2 submodule does not work normally, and the system forms a dual-mode error detection code structure of the lutam 1 submodule and the LUTAM submodule, so that the self-repairing of the permanent failure of the lutam 2 submodule is completed.
Fig. 3 is a timing chart of normal operation of the LUTRAM based on the cold-backup dual-mode error detection code, in 110ns, data din is input, only the write enable signals of the LUTRAM1 sub-module (the first hot-backup sub-module) and the LUTRAM2 sub-module (the second hot-backup sub-module) are enabled, i.e., we1=1, we2=1, and the write enable signal of the LUTRAM3 sub-module (the cold-backup sub-module) is disabled, i.e., we3=0, so that the LUTRAM1 sub-module and the LUTRAM2 sub-module store data while the LUTRAM3 sub-module does not store data; the LUTRAM1 submodule and the LUTRAM2 submodule output data at the same time in 155ns, and the LUTRAM3 submodule does not output data, so that the LUTRAM1 submodule and the LUTRAM2 submodule are in a hot backup state and the LUTRAM3 submodule is in a cold backup state when the system is initially and normally operated; the system outputs the final result dout at 175 ns.
Fig. 4 is a schematic diagram of a self-repairing timing chart of an LUTRAM transient fault based on DEDC-Scrubbing, in which fault injection is performed on data unit addr2=000010 of the LUTRAM2 sub-module by fault injection signal fault_in=1 at 120ns, and when addr1=000010 data unit of the LUTRAM1 sub-module writes correctly encoded data cin1=00; after the encoded data of the lutam 1 sub-module and the lutam 2 sub-module are read at the same time, the lutam 2 sub-module outputs error data of cout2=01 in 185ns, the error signal flag 2=1 is triggered through decoding error detection of one period, the number of faults cnt 2_flag=1 is counted, the control module outputs a refresh enabling signal sr_en2=1 in 215ns, and the lutam 1 sub-module outputs correct encoded data of cout1=00, and the correct encoded data is output to the input terminal cin 2=00 of the lutam 2 sub-module through a selector Mux, so that a fault unit of the lutam 2 sub-module is refreshed; the next period LUTRAM2 sub-module outputs correct data cout2=00, indicating that the refresh operation has completed repairing the LUTRAM2 faulty data unit.
Fig. 5 is a schematic self-repairing timing chart of a LUTRAM permanent fault based on CBEDC-REPLACING, and according to the repairing result of the transient fault in fig. 4, the detected result will not fail any more by the refreshing operation of the faulty data unit, so that the fault injection signal fault_in=1 needs to be used again to perform the fault injection in order to simulate the repairing process of the permanent fault. As shown, the data cell addr2=000010 of the LUTRAM2 sub-module is again injected with the cin2=01 failure data at 225 ns; then, the fault data cout2=01 is read, the fault signal flag 2=1 is triggered again after decoding error detection and 245ns, then the fault frequency signal is changed from 1 to 2, namely, cnt2_flag=2, 265ns, the control module outputs a reading enabling signal which acts on the LUTRAM2 sub-module to be invalid, namely, ce2=0, and simultaneously, the reading enabling signal which acts on the LUTRAM3 sub-module is valid, namely, ce3=1, and then the reading of the encoded data of the LUTRAM3 sub-module is started and the encoded data of the LUTRAM2 sub-module is not read; the LUTRAM3 sub-module outputs correct data cout2=00, and the repair process of replacing the failed LUTRAM2 sub-module with the cold backup LUTRAM3 sub-module is completed through the above operation.
Fig. 6 is a markov model based on a different self-repairing structure, wherein state 0 indicates that the LUTRAM submodule is in a normal operating state, state 1 indicates that one LUTRAM submodule is in a failure state, state 2 indicates that 1 LUTRAM submodule is in a failure state after the cold-backup replaces the failure module, and state F indicates that two LUTRAM submodules are in a failure state. Part (a) of fig. 6 is a reliability model under an instantaneous fault of a structure of a dual-mode error detection code DEDC, part (b) of fig. 6 is a reliability model under an instantaneous fault of a structure of a dual-mode error detection code refresh DEDC-Scrubbing, part (c) of fig. 6 is a reliability model under a permanent fault of a structure of a dual-mode error detection code DEDC, and part (d) of fig. 6 is a reliability model under a permanent fault of a structure of a cold-backup error detection code replacement CBEDC-REPLACING.
The reliability of the DEDC structure and the DEDC-Scrubbing structure in case of a transient fault is now illustrated by the reliability R (t) and the mean time between failure MTTF.
Introducing an instantaneous fault rate S, calculating according to a formula (1) to obtain the reliability of the DEDC structure under the condition of the instantaneous fault, and calculating according to a formula (2) to obtain the average fault-free time of the DEDC structure under the condition of the instantaneous fault:
RDEDC(t)=2e-(b+c)(1-s)λt-e-2(b+c)(1-s)λt (1)
Wherein b and c respectively represent the number of data bits and the number of check code bits, lambda is the bit failure rate, and t is the running time.
Calculating according to the formula (3) to obtain the reliability of the DEDC-Scrubbing structure under the condition of realizing the transient fault self-repairing, and calculating according to the formula (4) to obtain the average fault-free time of the DEDC-Scrubbing structure under the condition of realizing the transient fault self-repairing:
RDEDC-S(t)=1 (3)
MTTFDEDC-S=C (4)
Wherein C is a time constant.
Introducing a permanent fault rate p, calculating the reliability of the DEDC structure under the permanent fault condition through a formula (5), and calculating the average fault-free time of the DEDC structure under the permanent fault condition through a formula (6):
RDEDC(t)=2e-(b+c)pλtt-e-2(b+c)pλt (5)
Based on a Markov model of CBEDC-REPLACING structure, the reliability under the influence of the permanent failure rate p can be calculated according to the formula (7), and the average failure free time under the influence of the permanent failure rate p can be calculated according to the formula (8):
RCBEDC-R(t)=4e-(b+c)pλtt-3e-2(b+c)pλtt-2(b+c)pλte-2(b+c)pλt (7)
Fig. 7 is a graph of LUTRAM reliability R (t) under the influence of the instantaneous failure rate s, and fig. 8 is a graph of LUTRAM mean time to failure MTTF under the influence of the instantaneous failure rate s, where both R (t) and MTTF of DEDC are gradually increasing with increasing instantaneous failure rate s, but R DEDC-S(t)>RDEDC(t),MTTFDEDC-S>MTTFDEDC is present. The dual-mode error detection code refresh DEDC-Scrubbing structure formed by adding a refresh repair mechanism realizes instantaneous fault self-repair through refresh, so that the reliability of a system under the condition of instantaneous fault is improved, and both figures 7 and 8 show that the reliability under the condition of instantaneous fault of LUTRAM can be better improved by adopting the DEDC-Scrubbing structure.
Fig. 9 is a graph of LUTRAM reliability R (t) curves under the influence of the permanent failure rate p, and fig. 10 is a graph of LUTRAM mean time to failure MTTF curves under the influence of the permanent failure rate p, where R (t) and MTTF of DEDC and CBDEC-REPAIRING structures are gradually decreasing with increasing permanent failure rate p, but R CBEDC-R(t)>RDEDC(t),MTTFCBEDC-R>MTTFDEDC is shown. The cold backup double-mode error detection code replacement CBEDC-REPLACING structure formed by adding a cold backup is used for realizing permanent fault self-repairing through replacement, so that the reliability of the system under the permanent fault condition is further improved. Both fig. 9 and 10 show that adopting the CBEDC-REPAIRING structure can better improve reliability in the event of a LUTRAM permanent failure.

Claims (6)

1. A LUTRAM self-repairing structure based on cold backup dual-mode error detection code is characterized in that: comprising the following steps:
The two hot standby sub-modules are used for outputting coded data and failure times to the control circuit;
the one or more cold backup sub-modules are used for replacing the failed hot backup sub-module when the hot backup sub-module fails permanently;
the control circuit is used for positioning the hot backup sub-module with faults according to the number of faults output by each hot backup sub-module, judging the fault type according to the fault number, and performing fault self-repairing according to the fault type;
the fault types include transient faults and permanent faults;
When an instantaneous fault occurs, the control circuit refreshes the encoded data and enables refreshing of the failed hot backup sub-module through the encoded data output by the hot backup sub-module which does not fail;
when permanent faults occur, the control circuit controls the cold backup sub-module to replace the hot backup sub-module with the faults;
the hot backup sub-module and the cold backup sub-module have the same composition and both comprise:
the coding module is used for coding the input data according to the even check coding principle and storing the data into the LUTRAM;
LUTRAM for storing encoded data;
The detection module is used for carrying out fault detection on the coded data stored in the LUTRAM according to an even check decoding principle, and triggering a fault signal and counting the number of faults when single or odd bit failures are detected.
2. The LUTRAM self-repairing structure based on the cold-backup dual-mode error detection code as claimed in claim 1, wherein: for the same hot standby sub-module, when the transient fault is judged to occur in the previous round of fault detection, if the fault still occurs in the next round of fault detection, the fault is judged to be a permanent fault.
3. The LUTRAM self-repairing structure based on the cold-backup dual-mode error detection code as claimed in claim 1, wherein: the cold backup sub-module replaces the failed hot backup sub-module, and comprises: and allowing the read-write enabling of the cold backup sub-module, and stopping the read-write enabling of the failed hot backup sub-module.
4. A self-repairing method of a LUTRAM self-repairing structure based on a cold-backup dual-mode error detection code according to any one of claims 1 to 3, characterized in that: comprising the following steps:
Receiving the coding data and the fault times from each hot standby sub-module, positioning the hot standby sub-module with fault and judging the fault type according to the fault times;
According to the fault type, performing self-repairing on the hot backup sub-module with the fault; comprising the following steps: when the fault type is transient fault, the coded data output by the hot backup sub-module which does not generate fault is used for refreshing the coded data and enabling refreshing the hot backup sub-module which generates fault; when the fault type is a permanent fault, the cold backup sub-module is controlled to replace the hot backup sub-module with the fault;
And repeatedly executing the steps to finish the self-repairing.
5. The self-healing method according to claim 4, wherein: judging the fault type according to the fault frequency condition, including:
when the number of faults changes only once, the fault type is instantaneous fault;
When the number of faults changes twice continuously, the fault type is permanent fault.
6. The self-healing method according to claim 4, wherein: the said control cold backup sub-module replaces the failed hot backup sub-module, including: and allowing the read-write enabling of the cold backup sub-module, and stopping the read-write enabling of the failed hot backup sub-module.
CN202011178870.XA 2020-10-29 2020-10-29 LUTRAM self-repairing structure and self-repairing method based on cold backup dual-mode error detection code Active CN112269686B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011178870.XA CN112269686B (en) 2020-10-29 2020-10-29 LUTRAM self-repairing structure and self-repairing method based on cold backup dual-mode error detection code

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011178870.XA CN112269686B (en) 2020-10-29 2020-10-29 LUTRAM self-repairing structure and self-repairing method based on cold backup dual-mode error detection code

Publications (2)

Publication Number Publication Date
CN112269686A CN112269686A (en) 2021-01-26
CN112269686B true CN112269686B (en) 2024-04-26

Family

ID=74344635

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011178870.XA Active CN112269686B (en) 2020-10-29 2020-10-29 LUTRAM self-repairing structure and self-repairing method based on cold backup dual-mode error detection code

Country Status (1)

Country Link
CN (1) CN112269686B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113836079B (en) * 2021-09-23 2024-03-19 南京航空航天大学 Reconfigurable circuit for software and hardware cooperative processing and self-repairing method thereof
CN115987846A (en) * 2022-12-28 2023-04-18 广东汇天航空航天科技有限公司 Fault detection method, apparatus and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102541698A (en) * 2011-12-22 2012-07-04 南京航空航天大学 Self-reconfigurable D/TMR (Dual/Triple Modular Redundancy) system based on FPGA (Field Programmable Gate Array) and fault-tolerant design method thereof
CN102868604A (en) * 2012-09-28 2013-01-09 中国航空无线电电子研究所 Two-dimension Mesh double buffering fault-tolerant route unit applied to network on chip
KR101400809B1 (en) * 2012-04-20 2014-05-29 조선대학교산학협력단 Self-repairing bio-inspired fault-tolerant fpga
CN104731670A (en) * 2015-03-25 2015-06-24 北京空间飞行器总体设计部 Switch type on-board computer tolerant system facing satellite
CN105279049A (en) * 2015-06-16 2016-01-27 康宇星科技(北京)有限公司 Method for designing triple-modular redundancy type fault-tolerant computer IP core with fault spontaneous restoration function
CN110008061A (en) * 2019-03-14 2019-07-12 南京航空航天大学 A kind of double copies selfreparing configuration memory and its self-repair method based on shift register
CN110991128A (en) * 2019-12-03 2020-04-10 南京航空航天大学 Cell array circuit-based FPGA self-repairing structure and fault-tolerant method thereof

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102541698A (en) * 2011-12-22 2012-07-04 南京航空航天大学 Self-reconfigurable D/TMR (Dual/Triple Modular Redundancy) system based on FPGA (Field Programmable Gate Array) and fault-tolerant design method thereof
KR101400809B1 (en) * 2012-04-20 2014-05-29 조선대학교산학협력단 Self-repairing bio-inspired fault-tolerant fpga
CN102868604A (en) * 2012-09-28 2013-01-09 中国航空无线电电子研究所 Two-dimension Mesh double buffering fault-tolerant route unit applied to network on chip
CN104731670A (en) * 2015-03-25 2015-06-24 北京空间飞行器总体设计部 Switch type on-board computer tolerant system facing satellite
CN105279049A (en) * 2015-06-16 2016-01-27 康宇星科技(北京)有限公司 Method for designing triple-modular redundancy type fault-tolerant computer IP core with fault spontaneous restoration function
CN110008061A (en) * 2019-03-14 2019-07-12 南京航空航天大学 A kind of double copies selfreparing configuration memory and its self-repair method based on shift register
CN110991128A (en) * 2019-12-03 2020-04-10 南京航空航天大学 Cell array circuit-based FPGA self-repairing structure and fault-tolerant method thereof

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
一种基于商用电信网络的容错技术;刘志丹;彭建华;陈辉煌;;电讯技术(01);全文 *
具有在线修复能力的强容错三模冗余系统设计及实验研究;姚睿;王友仁;于盛林;陈则王;;电子学报(01);全文 *

Also Published As

Publication number Publication date
CN112269686A (en) 2021-01-26

Similar Documents

Publication Publication Date Title
CN112269686B (en) LUTRAM self-repairing structure and self-repairing method based on cold backup dual-mode error detection code
US5086429A (en) Fault-tolerant digital computing system with reduced memory redundancy
Fang et al. SoftPCM: Enhancing energy efficiency and lifetime of phase change memory in video applications via approximate write
CN113076219B (en) High-energy-efficiency on-chip memory error detection and correction circuit and implementation method
US20100146368A1 (en) Performing multi-bit error correction on a cache line
US20060184856A1 (en) Memory circuit
CN108182125B (en) Apparatus and method for detecting and fault-tolerant multi-bit hard errors of cache memory under near threshold voltage
CN101615147A (en) The skin satellite is based on the fault-tolerance approach of the memory module of FPGA
US12111726B2 (en) Error rates for memory with built in error correction and detection
US8615690B2 (en) Controller of memory device and method for operating the same
CN103761171A (en) Low-bandwidth data reconstruction method for binary coding redundancy storage system
CN110309014B (en) Data read-write structure and data read-write method of full-line coding and decoding SRAM encoder
CN115729746A (en) Data storage protection method based on CRC and ECC
CN112181709A (en) RAM storage area single event effect fault tolerance method of FPGA chip
JPH01158698A (en) Semiconductor memory
CN105027084B (en) The apparatus and method of control memory in mobile communication system
CN112000526B (en) Low-cost small satellite important data fault tolerance method
Mappouras et al. Jenga: Efficient fault tolerance for stacked dram
Hafidhi et al. Reliable gold code generators for gps receivers
Yalcin et al. Flexicache: Highly reliable and low power cache under supply voltage scaling
CN113254252B (en) Satellite load FPGA with BRAM and use method thereof
RU2327236C2 (en) Random access memory with high extent of fault tolerance
CN114880161A (en) Bi-adjacent error correction code based on (23, 12) Golay code for data storage correction
CN211124024U (en) Radiation-resistant reinforced memory
Benso et al. Online and offline BIST in IP-core design

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant