GB2613120A - Online parallel processing soft error real-time error detection and recovery method and system - Google Patents

Online parallel processing soft error real-time error detection and recovery method and system Download PDF

Info

Publication number
GB2613120A
GB2613120A GB2303510.8A GB202303510A GB2613120A GB 2613120 A GB2613120 A GB 2613120A GB 202303510 A GB202303510 A GB 202303510A GB 2613120 A GB2613120 A GB 2613120A
Authority
GB
United Kingdom
Prior art keywords
linked list
data
backups
protected
recovery
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
GB2303510.8A
Other versions
GB202303510D0 (en
Inventor
Zhou Hualiang
Zheng Yuping
Xu Guanghui
Li Youjun
Liu Zheng
Zou Zhiyang
Jiang Lei
Gao Shihang
Wang Shiping
Zhang Jiasen
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nari Technology Co Ltd
NARI Nanjing Control System Co Ltd
Original Assignee
Nari Technology Co Ltd
NARI Nanjing Control System Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nari Technology Co Ltd, NARI Nanjing Control System Co Ltd filed Critical Nari Technology Co Ltd
Publication of GB202303510D0 publication Critical patent/GB202303510D0/en
Publication of GB2613120A publication Critical patent/GB2613120A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/004Error avoidance
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C29/00Checking stores for correct operation ; Subsequent repair; Testing stores during standby or offline operation
    • G11C29/04Detection or location of defective memory elements, e.g. cell constructio details, timing of test signals
    • G11C29/08Functional testing, e.g. testing during refresh, power-on self testing [POST] or distributed testing
    • G11C29/12Built-in arrangements for testing, e.g. built-in self testing [BIST] or interconnection details
    • G11C29/44Indication or identification of errors, e.g. for repair
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • G06F11/1004Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's to protect a block of data words, e.g. CRC or checksum
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • G06F11/1008Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's in individual solid state devices
    • G06F11/1048Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's in individual solid state devices using arrangements adapted for a specific error detection or correction feature
    • G06F11/106Correcting systematically all correctable errors, i.e. scrubbing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1448Management of the data involved in backup or backup restore
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1458Management of the backup or restore process
    • G06F11/1469Backup restoration techniques
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C29/00Checking stores for correct operation ; Subsequent repair; Testing stores during standby or offline operation
    • G11C29/04Detection or location of defective memory elements, e.g. cell constructio details, timing of test signals
    • G11C29/08Functional testing, e.g. testing during refresh, power-on self testing [POST] or distributed testing
    • G11C29/12Built-in arrangements for testing, e.g. built-in self testing [BIST] or interconnection details
    • G11C29/18Address generation devices; Devices for accessing memories, e.g. details of addressing circuits
    • G11C29/24Accessing extra cells, e.g. dummy cells or redundant cells
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C29/00Checking stores for correct operation ; Subsequent repair; Testing stores during standby or offline operation
    • G11C29/52Protection of memory contents; Detection of errors in memory contents
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C29/00Checking stores for correct operation ; Subsequent repair; Testing stores during standby or offline operation
    • G11C29/70Masking faults in memories by using spares or by reconfiguring
    • G11C29/74Masking faults in memories by using spares or by reconfiguring using duplex memories, i.e. using dual copies
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/83Indexing scheme relating to error detection, to error correction, and to monitoring the solution involving signatures
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C29/00Checking stores for correct operation ; Subsequent repair; Testing stores during standby or offline operation
    • G11C29/04Detection or location of defective memory elements, e.g. cell constructio details, timing of test signals
    • G11C2029/0409Online test
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C29/00Checking stores for correct operation ; Subsequent repair; Testing stores during standby or offline operation
    • G11C29/04Detection or location of defective memory elements, e.g. cell constructio details, timing of test signals
    • G11C2029/0411Online error correction

Abstract

An online parallel processing soft error real-time error detection and recovery method and system. The method comprising: dividing a protected RAM space into multiple protected areas; dividing all of the protected areas into one or more levels, the highest level being to complete an error detection and recovery function in each interrupt cycle, and the other levels being to complete an error detection and recovery function once in multiple interrupt cycles; registering the protected areas of each level to generate a linked list corresponding to the number of levels, and backing up at least two copies of each linked list and the protected areas in the linked list to other RAM space; and processing in parallel the error detection and recovery of each level of linked list and each protected area in the linked list. The described solution can, in a key scenario, achieve the verification, decision, correction, and recovery of high-importance level data in a control system within a single interruption beat, and at the same time, does not rely on a CPU processor itself, can process in real-time in parallel, and can achieve an online real-time error detection and correction function when multiple positions of the CPU processor and a RAM are abnormally shifted at the same time.

Description

ONLINE PARALLEL PROCESSING SOFT ERROR REAL-TIME ERROR DETECTION AND
RECOVERY METHOD AND SYSTEM
TECHNICAL FIELD
[0001] The invention belongs to the technical field of error detection of RAM abnormal bit changes, and particularly relates to a soft crror real-time detection and recovery method and systcm based on online parallel processing.
BACKGROUND
[0002] With the development of the microprocessor technique towards low power, low voltage and high integration, the influence of random access memory (RAM) abnormal bit changes (or soft errors, or single event effect) on system security and stability becomes non-negligible. The main causes of RAM abnormal bit changes include (I) a particle radiation: a particle radiation in packaging materials used by processors will result in abnormal bit changes of the chip memory area; (2) scale increase of integrated circuits: the size of transistors becomes increasingly smaller, the frequency becomes increasingly higher, while the limit voltage of the transistors becomes increasingly lower and the noise tolerance becomes increasingly narrower, so processors become more sensitive to interference, voltage disturbance and electromagnetic radiation, and the reliability is reduced; (3) cosmic radiation: high-energy charged particles in the cosmic space, before reaching the earth surface, will be subjected to cascade interaction with atmospheric elements (oxygen and nitrogen) to produce a large quantity of secondary neutrons, which then enter integrated circuits and generate a large quantity of electron-hole pairs due to direct ionization or nuclear reaction-induced indirect ionization, and when electric charges collected by sensitive electrodes of microelectronic devices exceed critical charges, the single event effect will occur and cause memory bit flips, which in turn leads to program dysfunctions. There are typically three types of RAM abnormal bit changes: first, single-bit flip of data: the flip of a single data bit of a memory will lead to a data exception, and this error is common in SRAMs, DRAMs and NVMs: second, single-bit transient flip of data: a single data bit of a memory jitters due to voltage or current, the jitter is sampled and recorded by a circuit and is then recovered, and such an abnormal bit change often happens to DRAMs; third, multi-bit flip of data: multiple data bits in a memory arc changed at the same time. If the abnormal bit change happens to a critical position of RAMs (such as the code area or the key data area), a program miming error may be caused, which in turn leads to a system exception. If the abnormal bit change can be detected in advance for data recovery, the system stability and reliability will be greatly improved. With the development of the microprocessor technique and the expansion of its application range, RAM abnormal bit changes become more and more detrimental, and their influence scope becomes increasingly wider. So, how to solve the problem of RAM abnormal bit changes to improve the system stability and reliability and ensure safe system operation has drawn more and more attention of all industries.
[0003] At present, in the aspect of detecting and correcting RAM abnormal bit changes (or soft errors, or single event effect), many approaches are proposed to monitor and correct the abnormal bit changes from the perspective of hardware and software. Monitoring and correcting schemes based on hardware include, for example, process strengthening measures and device strengthening measures. The process strengthening measures generally include strengthening of capacitance condition, resistance condition, structure, layout and dopants, and latch circuit (DICE). The device strengthening measures generally include parity, ECC, interleave, and the like.
[0004] The hardware of devices such as CPUs supports ECC (Error Correcting Code) check of RAMs; the hardware DDR control of TI new processors and XILINX new MPSoC processors supports ECC check, SoC RAM supports ECC check, and L /L2cache supports ECC check, wherein ECC check can realize single-bit abnormal change correction and two-bit change warning by adding a 8-bit Hamming code every 32 bits or 64 bits, but it cannot realize two-bit correction or multi-bit change warning and correction.
100051 Most of the above measures prevent soft errors on one level, and monitoring and detection approaches based on software are also purposed. For example, Patent (Publication No. CN105446842A, 2016-03-30) provides an online monitoring method for AD1 DSP codes, which comprises: adding a code monitoring tag to a to-be-monitored section of a loadable file of a DSP according to a link mapping file of the DSP; during operation, continuously reading, by the DSP. LDR in a nonvolatile memory, and if the code monitoring tag is found, comparing a code in the nonvolatile memory with a running code in the RAM; if the code in the nonvolatile memory and the running code in the RAM are inconsistent, checking again; if it is confirmed that the code in the nonvolatile memory and the running code in the RAM are inconsistent, determining that the running code in the RAM is wrong, instantly giving an alarm, and recording error information in the nonvolatile memory for fault analysis. This method can realize online monitoring of running codes by simply modifying the LDR file. However, this method cannot realize online correction and is poor in response timeliness, the processing time of the CPU is occupied by monitoring, error detection and correction rely on the CPU, erroneous data cannot be located, and online recovery cannot be realized.
100061 Existing techniques and methods for realizing system-level protection against soft errors in a CPU system mainly have the following disadvantages and defects.
100071 (1) Considering the consumption of hardware resources, the ECC check function based on hardware can only realize single-bit change correction and two-bit change detection, and cannot give a warning when more bits change synchronously, so the correction function is limited.
100081 (2) The ECC check function based on hardware can be realized only in new processors rather than in all processors, is not suitable for all types of processors, and the ECC check function and performance vary, and the reliability of systems cannot be guaranteed.
100091 (3) The software of the CPU can only realize online program monitoring and alarming, there is no systematic RAM data monitoring and recovery scheme, and particularly, the real-time performance is unsatisfying, and the detection and recovery response is slow.
100101 (4) The detection and recovery function based on software takes the time of processors and affects normal functions of the processors, the error detection and correction program relies on the CPU, and erroneous data cannot be located.
100111 (5) The detection and recovery function based on software of the CPU cannot be realized normally when the RAM where the error detection and correction program is located is abnormal, and fall coverage of the RAM space is impossible.
SUMMARY OF THE INVENTION
100121 The objective of the invention is to overcome the defects of the prior art by providing a soft error real-time detection and recovery method and system based on online parallel processing to solve the problems of hidden faults or protection and control dysfunction of equipment caused by abnormal bit changes in power secondary equipment.
100131 To solve the above-mentioned problems, the invention provides a soft error real-time detection and recovery method based on online parallel processing, which comprises the following steps: dividing a protected RAM space into multiple protected areas; classifying all the protected areas into one or more levels, completing error detection and recovery of the protected area of a highest level in each interrupt cycle, and completing error detection and recovery of the protected areas of the other levels in multiple interrupt cycles: registering all the levels of the protected areas to generate linked lists that correspond to the levels in number and are located in the protected RAM space, and creating at least two backups of the linked lists and the protected areas in the linked lists in other spaces, wherein contents of the linked list comprise positions and lengths of the protected areas of the same level, and positions of the backups: and perfonning error detection and recovery on all the levels of the linked Fists and the protected areas in the linked lists.
100141 Further, the method is executed by a parallel processing module, the parallel processing module is connected to and accesses the protected RAM space through a high-speed interface, and independent DDRs, SRAMs or RAM spaces are used for storing the backups 10015] Further, the process of performing error detection and recovery on any one level of a linked list and the protected areas in the linked list comprises: performing error detection on the linked list and the backups of the linked list, and if an exception is detected, recovering the exception; and performing error detection on the protected areas in the linked list and the backups of the protected areas in the linked list, and if an exception is detected, recovering the exception.
100161 Further, performing error detection on the linked list and the backups of the Finked list 35 comprises: reading the linked list and the backups of the linked list; checking the Finked list and the backups of the linked list with any one or more of SM3 signature information, MD5 information abstracts and BCC codes; and comparing determination results obtained through different check methods to detenthne whether the linked Fist and the backups of the linked list are correct.
100171 Further, when multiple check methods are used, check processes of the multiple check methods are performed in parallel.
100181 Further, the soft error real-time detection and recovery method further comprises: performing communication process check on die read linked list and die backups of the linked list, comprising: successively reading the linked list at each position at least three times, and calculating corresponding CRC codes; and determining data with at least two identical CRC codes at each position as correct, and taking one copy of correct data as correct read data 100191 Further, checking the linked list and the two backups A and B of die linked list through one check method comprises: with the SM3 signature information as the check method calculating and comparing SM3 signature information of the linked list and the two backups A and B of the linked list; and if the SM3 signature information of at least two copies of data is the same, determining that all the copies of the data with the same SM3 signature information are normal and the other copy of the data is abnormal.
[0020] Further, comparing determination results obtained by different check methods to determine whether the linked list and the backups of the linked list are correct comprises: comparing determination results of the linked list and the backups of the linked list obtained with the SM3 signature information, the MD5 information abstracts and die BCC codes: if the determination results obtained with two or more of die check methods are the same, taking die same result as a final determination result.
100211 Further, perfaing error detection on the protected areas in the linked list and the backups of the protected areas in the linked list comprises: when the level of the linked list is the highest level: performing error detection on the protected areas in the linked list and the backups of the protected areas in sequence, wherein performing error detection on any one protected area in the linked list and the backups of the protected area comprises: reading a current protected area in the linked list and die backups of the current protected area; checking the current protected area in the linked list and the backups of the current protected area with any one or more of SM3 signature information, MD5 information abstracts and BCC codes; and comparing determination results obtained by different check methods to determine whether the current protected area in the linked list and the backups of the current protected area are correct.
100221 Further, performing error detection on the protected areas in the linked list and the backups of the protected areas comprises when the linked list is of other levels; dividing all the protected areas in the linked list into multiple groups; in each interrupt cycle, performing error detection on any one protected area in any one group and the backups of the protected area, which comprises: reading a current protected area in a current detection group in the linked list and the backups of the current protected area; checking the current protected area in the current detection group in the linked list and the backups of the current protected area with any one or more of SM3 signature information, MD5 information abstracts and BCC codes; and comparing determination results obtained by different check methods to determine whether the current protected area in the current detection group in the linked list and the backups of the current protected area are correct.
100231 Further, when multiple check methods are used. check processes of the multiple check methods are performed in parallel.
100241 Further, the soft error real-time detection and recovery method based on online parallel processing further comprises: performing communication process check on the current protected area in the linked list and the backups of the current protected area, comprising: successively reading the linked list at each position at least three times, and calculating corresponding CRC codes; and determining data with at least two identical CRC codes at each position as correct, and taking one copy of correct data as correct read data [0025] Further, checking the current protected area in the linked list and the backups of the current protected area through one check method comprises: with SM3 signature information as the check method: calculating and comparing SM3 signature information of the current protected area in the linked list and the backups of the current protected area; and if the SM3 signature information of at least two copies of data is the same, determining that all the copies of data with the same SM3 signature information are correct and the other copy of data is abnormal.
100261 Further, comparing determination results obtained by different check methods to determine whether the current protected area in the linked list and the two backups of the current protected area arc correct comprises: comparing determination results of the current protected area in the linked list and the two backups of the current protected area obtained with the SM3 signature information, the MD information abstracts and the BCC codes, and if the determination results obtained by at least two check methods are the same, taking the same result as a final determination result.
100271 Further, recovering the exception comprises: dividing an abnormal data area and a normal data area into N basic data blocks in sequence: calculating a data CRC code C" of the abnormal data area, wherein n=0. I. 2, 3, ...,iii. N/2< 2N, and Ce" is calculated by: selecting P" basic data blocks starting from the first basic data block, then selecting P" basic data blocks every P" basic data blocks, and using all the selected basic data blocks as data resources to calculate the CRC code, wherein P"=2", calculating a data CRC code Cen of the normal data area, wherein n=0. I. 2, 3, m, and C," is calculated by: selecting P" basic data blocks starting from the first basic data block, then selecting P" basic data blocks every P. basic data blocks, and using all the selected basic data blocks as data resources to calculate the CRC code; determining whether C," and C,", are equal; if so, updating an abnormal data tag position to point to a latter half of the abnormal data area; if not, updating the abnormal data tag position to point to a front half of the abnormal data area; repeating this process to calculate C11. and C", wherein n is gradually decreased from m-1 to 0, and the abnormal data tag position is updated in each step: when n=0, the abnormal data tag position is located to a basic data block of the abnormal data area where abnormal data is located; copying contents of the corresponding correct basic data block in the normal data area into the abnormal basic data block in the abnormal data area to which the abnormal data tag position points, such that recovery of the abnonna1 basic data block is completed, and calculating CRC codes of all the basic data blocks in the normal data area and the abnormal data area, and determining, by comparison, whether the two CRC codes are equal; if not, repeating all the steps; if so, ending recovery of abnormal data.
[0028] Further, after recovery of the abnormal basic data block is ended, the soft error real-time detection and recovery method further comprises: rereading the abnormal basic data block, recalculating the CRC code of the abnormal basic data block and calculating the CRC code of the corresponding correct basic data block; and determining, by comparison, whether the two CRC codes are equal, and if so, completing the recovery of the abnormal basic data block.
[0029] Correspondingly, the invention further provides a soft error real-time detection and recovery system based on online parallel processing, which comprises a linked list management module and an error detection and recovery module.
100301 The linked list management module is used for classifying all protected areas in a protected RAM space to one or more levels, completing error detection and recovery of the protected area of a highest level in each interrupt cycle, and completing error detection and recovery of the protected areas of the other levels in multiple interrupt cycles; registering all the levels of protected areas to generate linked lists which correspond to the levels in number and are located in the protected RAM space, and creating at least two backups of the linked lists and die protected areas in the linked lists in other spaces, contents of each linked list comprising positions and lengths of the protected areas of the same level, and positions of the backups.
100311 The error detection and recovery module is used for performing error detection and recovery on all the levels of linked lists and the protected areas in the linked lists, wherein the process of performing error detection and recovery on any one level of linked list and the protected areas in the linked list comprise s: performing error detection on the linked list and the backups of the linked list, and if an exception is detected, recovering the exception; and performing error detection on the protected areas in the linked list and the backups of the protected areas in the linked list, and if an exception is detected, recovering the exception.
[0032] Compared with the prior am the invention has the following beneficial effects.
100331 (1) The invention can realize online error detection and recovery of multiple or even all synchronous changes in a designated memory area, can solve the problem that hardware-based ECC cannot detect or correct multi-bit abnormal changes, has powerful functions, and improves the running robustness and stability of system programs.
100341 (2) The invention can realize error detection and recovery of abnormal bit changes of RAMs independent of hardware ECC of a CPU, and provides a feasible RAM abnormal bit change detection method for processor systems without the hardware ECC function.
[0035] (3) The RAM exception detection and recovery function of the invention can locate erroneous data in an area and recover the erroneous data within one interrupt tick, so the real-time performance of the system in processing RAM exceptions is improved, and the influence of RAM exceptions on the system is reduced.
100361 (4) The FPGA parallel processing module is used to complete RAM detection and recovery of the CPU, so the time of the CPU is not occupied, parallel detection can be realized, and RAM exceptions can be corrected and recovered.
100371 (5) Detection and correction of a whole protected RAM space are realized through the FPGA parallel processing module.
BRIEF DESCRIPTION OF THE DRAWINGS
100381 FIG. 1 is a schematic diagram of a system according to the invention; [0039] FIG. 2 is a schematic diagram of data in linked lists according to the invention; 100401 FIG. 3 is a schematic diagram of contents of the linked list according to the invention; [0041] FIG. 4 is a schematic diagram of a chock and determination process of an FPGA parallel processing module according to the invention; 100421 FIG. 5 is a schematic diagram of a determination function of the FPGA parallel processing module according to the invention; 100431 FIG. 6 is a schematic diagram of data block partition and check code calculation implemented by a location function of the FPGA parallel processing module according to the invention; 100441 FIG. 7 is a schematic diagram of abnormal data location implemented by the location fine on of the FPGA parallel processing module according to the invention: 100451 FIG. 8 is a schematic diagram of a data recovery function of the FPGA parallel processing module according to the invention; [0046] FIG. 9 is a sequence diagram of determination and recovery check of the FPGA parallel processing module according to the invention.
DETAILED DESCRIPTION OF THE INVENTION
[0047] The invention will be further described below in conjunction with the accompanying drawings. The following embodiments are merely used to explain the technical solutions of the invention more clearly, and should not be constmed as limitations of the protection scope of the invention.
[0048] Embodiment 1 100491 The invention provides a soft error real-time detection and recovery method based on online parallel processing, which comprises the following steps.
100501 A protected RAM space is divided into multiple protected areas.
100511 All the protected areas are classified to one or more levels, wherein error detection and recovery of the protected area of a highest level are completed within each interrupt cycle, and error detection and recovery of the protected areas of the other levels are completed within multiple interrupt cycles.
100521 All the levels of protected areas are registered in linked lists which correspond to the levels in number and are located in the protected RAM space; at least two backups A and B of the linked Fists and the protected areas in the linked lists are created in other RAM spaces, wherein contents of each linked list comprise positions and lengths of the protected areas of the same level, and positions of the two backups A and B. 100531 Error detection and recovery of all the levels of linked lists and the protected areas in the linked lists are performed in parallel, wherein the process of performing error detection and recovery on any one level of linked list and the protected areas in the linked list comprises that: error detection is performed on the linked list and the two backups A and B of the linked list and an exception, if detected, is recovered; and error detection is performed on the protected areas in the linked list and the two backups A and B of the protected areas, and an exception, if detected, is recovered.
100541 The invention can realize RAM exception detection and recovery of a system and can locate erroneous data and recover the erroneous data within one intermpt cycle, so the system can handle RAM exceptions in real time, and the influence of RAM exceptions on the system is reduced.
100551 Embodiment 2 [0056] To realize the parallel data processing function of a parallel processing module (such as an FPGA processing module, which is referred to as FPGA parallel processing module hereinafter), the FPGA parallel processing module in the system needs to access an RAM of a CPU system through a high-speed interface such as PC1e, SRIO or HyperLink. Moreover, to minimize the influence of the FPGA parallel processing module on the CPU system during data processing, the FPGA parallel processing module should independently control memory spaces used for backing-up data, such as independent DDRs, SRAMs, or RAMs in the FPGA parallel processing module. If the RAM space of a CPU is large enough and the bandwidth of the high-speed interface is sufficient, independent spaces may be partitioned in the RAM of the CPU to be used for the FPGA parallel processing module to store data backups.
[0057] The FPGA parallel processing module may be realized by a dedicated FPGA parallel processing module designed on a hardware module or an FPGA parallel processing module of a CPU control system, or be realized based on a SoC integrated with an FPGA parallel processing circuit.
[0058] hi a software system, when the FPGA parallel processing module is used for performing online error detection and recovery, the correctness of data transmission should also be taken into account, so the invention provides a communication-level check mechanism; and the accuracy of check algorithms should also be taken into account, so three check methods based on different principles, namely SM3 signature information, MD5 information abstracts and BCC codes, are used to prevent homologous errors.
[0059] The invention provides a soft error real-time detection and recovery method based on online parallel processing, which, as shown in FIG. 1, requires that a CPU in a control system is connected to an FPGA parallel processing module through a high-speed interface (such as PCIe, SRTO or HyperLink. PCIe is used in the following description), and that the FPGA parallel processing module controls independent RAM spaces for backing-up programs or data. The soft error real-time detection and recovery method based on online parallel processing comprises the following steps: 100601 S: a protected RAM space is divided into multiple protected areas.
[0061] According to the use condition of the RAM space of the protected CPU system, the protected RAM space is divided into multiple protected areas. When program codes are compiled, the RAM spatial position of each protected area is secured through a link mapping file, and the link mapping file is a description file specifying the address of each program section during program compiling. The file comprises: the memory address and length of the CPU system, and the name, length and memory address of each memory area. The storage position of some programs can be specified through the compiling and pre-processing program source codes (ftpragrna DATA SECTION (program name, "domain name of memory area").
[0062] The FPGA parallel processing module directly reads, writes and accesses the RAM space of the protected CPU system through a high-speed bus (such as PCIe), and controls independent RAM spaces for backing-up contents in the protected areas, and the RAM space can be read, written and accessed only by the FPGA parallel processing module.
[0063] To ensure real-time error detection and recovery of the protected RAM space, the protected areas are classified to different levels for management: the protected areas are classified to multiple levels according to the importance of the protected areas, wherein LV I is the highest level and corresponds to the highest error detection and recovery frequency, and error detection and recovery are completed within each interrupt cycle; the error detection and recovery frequencies corresponding to the other levels are decreased gradually. When the protected areas of the system are small and the FPGA can complete error detection and recovery within one interrupt cycle, there may be only one level LV1. When the importance of the protected areas in the system is not high and the completion of error detection and recovery in multiple interrupt cycles is tolerated, there may be also only low level LV. For the sake of brevity, this embodiment is stated according to two levels (LV1/LV2).
100641 According to the importance of the protected areas in the system, the protected areas are classified to a first level (LV1) and a second level (LV2) in this embodiment, wherein LV1 is the highest level, and the system completes check and error correction of all contents of the protected areas of LV1 within each interrupt tick; the protected areas of LV2 are divided into multiple groups, check and error correction of contents in one group of protected areas are completed in each interrupt tick, and check and error correction of the protected areas of LV2 are completed by multiple interrupts. When the system is initialized, the protected areas are registered to generate two levels of linked lists. Information of the protected RAM areas of LV ULV2 is stored in the two levels of linked lists or in the RAM. and the linked lists are used for the FPGA parallel processing module to search for the position, size and other information of the protected areas.
[0065] S2: all the protected areas in the protected RAM space are registered into the linked list of the first level (LV1) and the linked list of the second level (LV2) according to the importance of the protected areas, and the FPGA parallel processing module creates at least two backups A and B of the linked lists (LV1/LV2) and the protected areas in the linked lists in independent RAMs.
100661 The number of backups of data can be determined according to the actual condition, and at least two backups are created. When two backups of data are created, which copy of data is correct or abnormal is determined based on the two-out-of-three rule. Or, three backups of data are created, and which copy of data is correct or abnormal is determined based on the three-out-of-four rule or two-out-of-four rule. In the invention, two backups A and B are described in detail by way of example.
100671 An initial backup fiinctional module of the FPGA parallel processing module initializes contents in the two backups A and B of the Finked Fists and the protected areas in an independent RAM according to the linked lists LV1/LV2.
[0068] The linked lists at the three positions have the same contents, and comprise the initial address and length of the protected areas and the positions of the two backups A/B.
[0069] Register principle: the protected areas, to which the linked Fists LV1 and LV2 point, are registered section by section, with the maximum length of each section being 1K; the linked lists LV1 and LV2 are stored in the form of a structural array, as shown in FIG. 2, each entry in the linked lists is a BD and records one protected area, and the data format is: typedef struct 1 UIN32 length; /* linked list entry length: 0: not used; others-data length */ UIN32 addr; /* address of data in RAM of processor*/ UIN32 back addr A, /* deviation position of backup A in the independent RAM controlled by the FPGA parallel processing module*/ U1N32 back_addr_B; /* deviation position of backup B in the independent RAM controlled by the FPGA parallel processing module*/
BDTable.
[0070] The corresponding relation between parameters is shown in FIG. 3, wherein length is the length of the protected areas; addr is the initial address of the protected area in the protected RAM space, that is, the protected area to which the address points in the protected RAM space, back add A is the initial address of the backup A in the independent RAM controlled by the FPGA parallel processing module, that is, the area to which the address points in the independent RAM; back add B is the initial address of the backup B in the independent RAM controlled by the FPGA parallel processing module, that is, the area to which the address points in the independent RAM.
[0071] For the sake of convenience, the protected areas in the linked lists are protected areas to which the addresses of BD blocks in the linked lists point.
100721 S3: when the system is powered on for initialization, the system configuration is completed: a PC1E interface is configured, and the protected RAM space is mapped to a PCIe address space of the FPGA parallel processing module; through a register of the FPGA parallel processing module, the number of entries of the linked list LV1 processed in each interrupt is configured as a total entry number, and the number of entries of the linked list LV2 processed in each interrupt is total entry number/P, that is, all processing is completed within P interrupts.
100731 The interact cycle varies within 500-1000us. Abnormal bit change detection and recovery are completed in each interrupt, such that the detection and recovery response is improved, and the capacity to resist exceptions (misoperation and failure to operate) of power protection equipment is improved.
100741 In the detection process, as shown in FIG 4, the linked lists LV1/LV2 are detected first, and exceptions of the linked lists are recovered, such that the correctness of information in the linked lists is guaranteed, and the corresponding protected areas can be detected and recovered correctly later according to the information recorded in the linked lists LVI/LV2. There is no requirement for the sequential order of error detection and recovery of the linked list LV 1 and LV2. Error detection and recovery of the linked list LV I or LV2 and the corresponding protected areas must be serial. To facilitate understanding, in this embodiment of the invention, error detection and recovery of the linked lists LV 1 and LV2 are described first, and then error detection and recovery of the protected areas in the linked lists LV1 and LV2 are described.
100751 As shown in FIG. 5, the specific process of determining whether the linked list LV I and the two backups A and B of the linked list LV I are correct comprises S4-S8.
100761 S4: the FPGA parallel processing module reads the Finked list LV I and the two backups A and B of the linked list LV I according to the interrupt tick. The linked list at each position is successively read three times, and corresponding CRC codes are calculated respectively; one copy of correct read data is selected from the three copies of data at each position based on the two-out-of-three rule.
100771 Data read from the three positions is subjected to communication process check as follows: three copies of linked list data read at each position are subjected to CRC; if the three CRC codes of each position are different, it indicates that the reading function is abnormal or the system hardware is abnormal, the device should be locked, and the system should be restarted; if two of the three CRC codes are the same, the two copies of linked list data are correct, and any one of the two copies of linked list data is taken as correct read data.
[0078] The communication process check aims to prevent misjudgment caused by an error of the communication process, and is optional. If the system does not take into account communication interface errors, this check process can be omitted. That is data is read once to be taken as correct data for later determination.
[0079] It should be pointed out that data may be read more than three times. When data is read four times, correct read data can be determined based on the two-out-of-four rule or the three-out-of-four rule. 100801 S5: the FPGA parallel processing module calculates and compares SM3 signature information of the correct read data of the linked list LV1 and the two backups A and B of the linked list LV1, and determines whether an abnormal bit change happens to the data at the three positions based on the two-out-of-three rule; if the SM3 signature information of the three positions is identical, it is determined that there is no exception, that is, the data at the three positions is identical; if the SM3 signature information of two positions is identical, it is determined that data at the two positions is normal, and data at the other position is abnormal, if the SM3 signature information of the three positions is different each other, it is determined that the data is invalid.
[0081] S6: the FPGA parallel processing module calculates and compares MD5 information abstracts of the correct read data of the linked list LV1 and the two backups A and B of the linked list LV1, and determines whether an abnormal bit change happens to data at the three positions based on the two-out-of-three rule; if the MD5 information abstracts of the three positions are identical, it is determined that there is no exception, that is, data at the three positions is identical; if the MD5 information abstracts of two positions are identical, it is determined that data at the two positions is normal, and data at the other position is abnormal; if the MD5 information abstracts of the three positions are different from each other, it is determined that the data is invalid.
[0082] S7: the FPGA parallel processing module calculates and compares BCC codes of the correct read data of the linked list LV1 and the two backups A and B of the linked list LV1, and determines whether an abnormal bit change happens to data at the three positions according to the two-out-of-three rule; if the BCC codes of the three positions are identical, it is detennined that there is no exception, that is, data at the three positions is identical; if the BCC codes of two positions are identical, it is determined that data at the two positions is normal, and data at the other position is abnormal, if the BCC codes of the three positions are different from each other, it is determined that the data is invalid.
[0083] It should be noted that, in this embodiment, data is checked three times respectively with SM3 signature information, MD5 information abstracts and BCC codes (each once). However, those skilled in the art should understand that, the check methods are not limited to the three methods described here, and data can also be checked through other common methods such as CRC32 and CRC64. The check methods and the check times can be selected freely. For example, data may be checked with SM3 signature information twice or be checked with CRC32 twice. By using different check methods, a final misjudgment caused by the principle of one method can be prevented. During specific application, the check methods and check times can be selected/combined according to the tolerance to misjudgment of the system.
[0084] In this embodiment of the invention, the three check processes are not associated, and can be performed in parallel.
[0085] SR: the FPGA parallel processing module compares determination results of the linked list LV I obtained with SM3 signature information, MD5 information abstracts and BCC codes, and determines whether the linked list LV1 and the two backups A and B of the linked list LV1 are correct; if the determination results obtained by two or more check methods are the same, the same result is taken as a final determination result; otherwise, the determination is invalid.
100861 If the final result is normal (that is, data at the three positions is identical), the linked list LV I and the two backups A and B of the linked list LV I are correct, and the detection of the linked list LV I is ended.
[0087] If the determination is invalid, it indicates that there is an unrecoverable error, and the device needs to be locked, and the system needs to be restarted to be recovered.
100881 If two copies of data are normal and the other copy of data is abnormal, the abnormal data can be recovered according to an abnormal data recovery method described below. Here, the abnormal data may be the linked list LV I, the backup A or the backup B, and the backup area may be the RAM, or may be incorrect and needs to be recovered.
100891 S9: whether the linked list LV2 and the backups A and B of the linked list LV2 are correct are determined according to 54 to S8, and if there is abnormal data in data at the three positions, the data needs to be recovered.
100901 The above processing process guarantees the correctness of the linked lists LV I/LV2 and the two backups A and B of the linked lists LV I/LV2, and error detection and recovery are performed on the corresponding protected areas later according to correct information recorded in the linked lists LV1/LV2.
[0091] As shown in FIG. 4, the specific process of determining whether the protected areas in the linked list LV1 and the two backups A and B of the protected areas in the linked list LV1 are correct comprises S 1 0-S 15.
[0092] SI 0: the FPGA parallel processing module reads data of the corresponding RAM space and the two backups A mid B of the first protected area in the linked list LV1. Data at each position is successively read three times, and corresponding CRC codes are calculated respectively; and one copy of data is selected from the three copies of data at each position based on the two-out-of-three rule.
[0093] Data read from the three positions is subjected to communication process check as follows: the three copies of data read from each position are subjected to CRC; if three CRC codes of each position are different, it indicates that the reading fimction is abnormal or the system hardware is abnormal, the device needs to be locked, and the system needs to be restarted. If two of the three CRC codes are identical, it indicates that two copies of data are correct, and any one of the two copies of data is taken as correct read data [0094] S11: the FPGA parallel processing module calculates and compares SM3 signature information of the correct read data of the corresponding RAM space and the backups A and B of the first protected area in the linked list LV1, and determines whether an abnormal bit change happens to data at the three position based on the two-out-of-three rule; if the SM3 signature infommtion of the three positions is identical, it is determined that there is no exception, that is, data at the three positions is identical; if the SM3 signature information of two positions is identical, it is determined that data at the two positions is correct, and data at the other position is abnormal; if the SM3 signature information of the three positions is different from each other, it is determined that the data is invalid.
100951 S12: the FPGA parallel processing module calculates and compares MD5 information abstracts of the correct read data of the corresponding RAM space and the backups A and B of the first protected area in the linked list LV I, and determines whether an abnormal bit change happens to data at the three position based on the two-out-of-three rule: if the MD5 information abstracts of the three positions are identical, it is determined that there is no exception, that is, data at the three positions is identical; if the MD5 information abstracts of two positions are identical, it is determined that data at the two positions is correct, and data at the other position is abnormal; if the MD5 information abstracts of the three positions are different from each other, it is determined that the data is invalid.
100961 S13: the FPGA parallel processing module calculates and compares BCC check codes of the correct read data of the corresponding RAM space and the backups A and B of the first protected area in the linked list LV I, and determines whether an abnormal bit change happens to data at the three position based on the two-out-of-three rule; if the BCC check codes of the three positions are identical, it is determined that there is no exception, that is, data at the three positions is identical: if the BCC check codes of two positions are identical, it is determined that data at the two positions is correct, and data at the other position is abnormal, if the BCC check codes of the three positions are different from each other, it is determined that the data is invalid.
[0097] S14: the FPGA parallel processing module compares determination results obtained with SM3 signature information, MD5 information abstracts and BCC codes, and determine whether the corresponding RAM space and the two backups A and B of the first protected area in the linked list LV I are correct based on the two-out-of-three rule; if the determination results obtained by two or more check methods are the same. the same result is taken as a final result: othenvise, the determination is invalid.
100981 If the final determination result is normal (that is, data at the three positions is identical), the corresponding RAM space and the two backups A and B of the first protected area in the linked list LV1 are correct, and the detection of the first protected area in the linked list LV1 is ended.
100991 If the determination is invalid, it indicates that there is an unrecoverable error, and the device needs to be locked, and the system needs to be restarted to be recovered.
101001 If two copies of data are normal and the other copy of data is abnormal, the abnormal data can be recovered according to an abnormal data recovery method described below. Here, the abnormal data may be the RAM space, the backup A or the backup B, and the backup area may be the RAM, or may be incorrect and needs to be recovered.
101011 S15: the FPGA parallel processing module repeats SIO-S14 to &tem-rine whether data of all the protected areas in the linked list LV1 is correct, and if there is an exception, the exception is recovered. In this way data detection of all the protected area in the linked list LV1 is ended.
[0102] The protected areas in the linked list LV2 are detected in multiple interrupts. The protected areas in the linked list LV2 are divided into multiple groups, and one group of protected areas (multiple BDs) is detected in each interrupt tick, so every time LV2 is detected in one interrupt tick, the position of the next group of protected areas needs to be updated.
101031 As shown in FIG. 4, the specific process of determining whether the protected areas in the current detection group in the linked list LV2 and the two backups A and B of the protected areas in the current detection group are correct comprises S I 6-522.
101041 S16: the FPGA parallel processing module reads data of the corresponding RAM space and the two backups A and B of the first protected area in the current detection group in the linked list LV2. Data at each position is successively read three times, and corresponding CRC codes are calculated respectively; one copy of data is selected from the three copies of data at each position based on the two-out-of-three rule.
101051 The current detection group: the protected areas in the linked list LV2 are divided into multiple groups, and during detection, from the first group, the group which is currently detected is called the current detection group.
101061 S17: the FPGA parallel processing module calculates and compares SM3 signature information of the correct read data of the corresponding RAM space and the backups A and B of the first protected area in the current detection group of the linked list LV2, and determines whether an abnormal bit change happens to data at the three position based on the two-out-of-three rule; if the SM3 signature information of the three positions is identical, it is determined that there is no exception, that is, data at the three positions is identical; if the SM3 signature information of two positions is identical, it is determined that data at the two positions is correct, and data at the other position is abnormal, if the 5M3 signature information of the three positions is different from each other, it is determined that the data is invalid. [0107] S18: the FPGA parallel processing module calculates and compares MD5 information abstracts of the correct read data of the corresponding RAM space and the backups A and B of the first protected area in the first detection group in the linked list LV2, and determines whether an abnormal bit change happens to data at the three position based on the two-out-of-three rule; if the MD5 information abstracts of the three positions are identical, it is determined that there is no exception, that is, data at the three positions is identical; if the MD5 information abstracts of two positions are identical, it is determined that data at the two positions is correct, and data at the other position is abnormal, if the MD5 information abstracts of the three positions are different from each other, it is determined that the data is invalid.
101081 S19: the FPGA parallel processing module calculates and compares BCC check codes of the correct read data of the corresponding RAM space and the backups A and B of the first protected area in the first detection group in the linked list LV2, and determines whether an abnormal bit change happens to data at the three position based on the two-out-of-three rule; if the BCC check codes of the three positions are identical, it is detennined that there is no exception, that is, data at the tluree positions is identical; if the BCC check codes of two positions are identical, it is determined that data at the two positions is correct, and data at the other position is abnormal; if the BCC check codes of the three positions are different from each other, it is determined that the data is invalid.
101091 S20: the FPGA parallel processing module compares determination results obtained with SM3 signature information. MD5 information abstracts and BCC codes, and determines whether the corresponding RAM space and the two backups A and B of the first protected area in the current detection group of the linked list LV2 are correct based on the two-out-of-three rule; if the determination results obtained by two or more check methods are the same, the same result is taken as a final result; otherwise, the determination is invalid.
[0110] If the final determination result is normal (that is, data at the three positions is identical), the corresponding RAM space and the two backups A and B of the first protected area in the first detection group in the linked list LV2 are correct, and the detection of the first protected area in the first detection group in the linked list LV2 is ended.
101111 If the determination is invalid, it indicates that there is an unrecoverable error, and the device needs to be locked, and the system needs to be restarted to be recovered.
101121 If two copies of data are normal and the other copy of data is abnormal. the abnormal data can be recovered according to an abnormal data recovery method described below. Here, the abnormal data may be the RAM space, the backup A or the backup B. 101131 S21: the FPGA parallel processing module repeats S16-S20 to determine whether data of all the protected areas in the current detection group in the linked list LV2 is correct, and if there is an exception, the exception is recovered. In this way, data detection of all the protected area in the current detection group of the linked list LV2 is ended.
101141 S22: finally, the current detection group in the Finked list LV2 is changed into the next group, and S16-S2 I are repeated in the next interrupt cycle to determine and recover data of all the protected areas in the next group in the linked list LV2 until data detection of all the protected areas in all the groups in the linked list LV2 is ended.
101151 After the system determines the correctness of the protected RAM space, the FPGA parallel processing module performs data recovery. During the data recovery process, in order to improve the real-time performance of response, data is divided into a plurality of basic data blocks in the invention, CRC values are calculated through multiple methods, and abnormal basic data blocks are rapidly located according to the CRC values. When protected data is recovered, the correctness of the communication process should be taken into account, so in the invention, after a correct basic data block at an abnormal position is recovered, the data is read-back instantly, and whether data is correctly written-in is determined; if the read-back data is inconsistent with the written-in data, data is rewritten.
101161 In case of multiple abnormal positions, one abnormal basic data block is located first, and then the multiple basic data blocks are located by iteration. So, the invention solves the problem of multiple abnormal positions by iteration.
101171 The invention provides an abnormal data recovery method, which is implemented under the premise that the FPGA parallel processing module finally determines that two copies of data are normal and the other copy of data is abnormal, and in this case, the abnormal data can be recovered. Any one copy of normal data is used as a normal data area, and the copy of abnormal data is used as an abnormal data area. The abnormal data area may be the protected RAM. or an independent RAM area. As shown in FIG. 8, the abnormal data recovery method comprises the following specific steps.
101181 Si: the abnormal data area and the normal data area are sequentially divided into N blocks, which are referred to as basic data blocks in the following description, according to a minimum data length.
[0119] S2: the FPGA parallel processing module calculates a data CRC code Con of the abnormal data area, wherein n=0, 1, 2, 3, .. m, and the data CRC code Co" is calculated every P11=2" basic data blocks, 20 and N/2< r<N.
[0120] C" is calculated as follow: 12, basic data blocks are selected from the first basic data block, and then P" basic data blocks are selected even P" basic data blocks, and the CRC code is calculated with all the selected basic data blocks as data sources.
101211 In this step, Ceo, C01, ..., Corn are calculated in parallel or in series.
101221 As shown in FIG. 6, the abnormal data area is divided into N=16 basic data blocks which are marked as B1, B2, ..., and B16 respectively; C00 is calculated with B1, B3, B5, and B15 as data sources, which are selected from B I every one basic data block, C0i is calculated with B I, B2, B5, B6, ..., B13 and B14, which are selected from B2 every two basic data blocks, as data sources, and similarly. Ce3 is calculated with Bl, B2, ..., and B8, which arc selected from B1 every eight basic data blocks, as data sources.
[0123] S3: the FPGA parallel processing modulo calculates a data CRC code C. of the normal data area, wherein n=0, I, 2, 3, ..., m. and the data CRC code C," is calculated every P"=2" basic data blocks, and N/2.< 101241 C," is calculated as follow: P" basic data blocks are selected from the first basic data block_ and then P. basic data blocks are selected every P. basic data blocks, and the CRC code is calculated with all the selected basic data blocks as data sources.
[0125] In this step, Coo, Col, ..., and Com are calculated in parallel o series.
101261 As shown in FIG. 6, the normal data area is divided into N=I 6 basic data blocks which are marked as Bl, B2, ..., and B16 respectively; Cco is calculated with B1, B3, B5, ..., and B15 as data sources, which are selected from B1 one basic data block, C01 is calculated with B1, B2, B5, B6, B13 and B14, which are selected from B2 every two basic data blocks, as data sources, and similarly, CO3 is calculated with B!, B2, ..., and 138, which are selected from Bl every eight basic data blocks, as data sources.
[0127] S4: die FPGA parallel processing module determines whether Corn and Corn are equal; if Corn and C,,, arc equal, it indicates that the abnormal data is located in the latter half of the abnormal data area, and an abnormal data tag position is updated to point to the latter half of the abnormal data area; if C,", and Cern are not equal, it indicates that data in the front half of the abnormal data area is abnormal, and the abnormal data tag position is updated to point to the front half of the abnormal data area.
[0128] This step is performed when n=m, and in this case, all the basic data blocks in the whole data area are divided into a front part and a rear part. In S4, whether abnormal data is in the front half or the latter half of the data area is determined; after the scope of the abnormal data is narrowed, n is set to m-1 to repeat the process in S5 until n=0, such that the scope of the abnormal data is narrowed to a certain abnormal basic data block.
101291 S5: S4 is repeated to determine calculate Corn and Cora, wherein n is decreased gradually from m-1 to 0, and the abnormal data tag position is updated in each step. When n=0, the abnormal data tag position is located in a basic data block in the abnormal data area where the abnormal data is located, which is abbreviated as to the abnormal basic data block, and correct data corresponding to the abnormal basic data block is a corresponding basic data block in the abnormal data area, which is abbreviated a correct basic data block.
101301 hi this embodiment of die invention, the determination process, as shown in FIG. 7, is as follows whether CO3 and CO3 are equal is determined first; if' so, it indicates that the abnormal data is located in the latter half 130->B16; then, whether CO2 and Ca are equal is determined; if so, it indicates that the abnormal data is located in the latter half B13->B16, next, whether C01 and C01 are equal is determined: if so, it indicates that the abnormal data is located in the latter half BI5->B16; and then, whether C00 and C,0 are equal is determined; if so, it indicates that the abnormal data is located in the basic data block B16. In this way, the specific abnormal basic data block is located. As shown in FIG. 7, other determination steps are similar and will not be detailed.
[0131] S6: the FPGA parallel processing module copies contents in the corresponding correct basic data block in the normal data area to the abnormal data block, to which the abnormal data tag position points, in the abnormal data area.
101321 S7: then, the FPGA parallel processing module rereads the abnormal basic data block_ calculates the CRC code of the reread abnormal basic data block, and calculates the CRC code of the corresponding correct basic data block. Whether the two CRC codes are equal is determined by comparison; if the two CRC codes are not equal, S6 and S7 are repeated for recovery, and in this embodiment, S6 and S7 are repeated at most twice; and if the two CRC codes are equal, recovery of the abnormal basic data block is completed.
[0133] S8: the FPGA parallel processing module calculates the CRC codes of all the basic data blocks in the normal data area and the CRC codes of all the basic data blocks in the abnormal data area and determines, by comparison, whether the CRC codes of all the basic data blocks in the normal data area and the CRC codes of all the basic data blocks in the abnormal data area are equal, if not, it indicates that there is still abnormal data, and S2-57 are repeated to locate and recover other abnormal data; if so, recovery of abnormal data is ended.
101341 As shown in FIG. 9, the typical flow of the above process is as follows: first, error detection and recovery of the linked list LV1 are performed at first, then error detection and recovery of the linked list LV2 are performed, then error detection and recovery of all the protected areas in the linked list LV1 are performed, and fmally, error detection and recovery of the protected areas in the current group in the linked list LV2 are performed, and the sequence diagram of this process is shown in FIG. 9. Wherein, error detection and recovery of the corresponding protected areas in the linked list LV1 are performed after error detection and recovery of the linked list LV1 are completed; similarly, error detection and recovery of the corresponding protected areas in the linked list LV2 are performed after error detection and recovery of the linked list LV2 are completed; and the linked list LV1 and the linked list LV2 have no precedence relationship, and can be performed in parallel.
[0135] If data in the protected RAM space needs to be normally modified during nmning of the system (such as, normal data parameter modification), the data should be modified in the following sequence: first, the software stops the check function and the determination function, the data recovery function is stopped, and the running state of the FPGA parallel processing module is read back to confirm that the functions have been stopped; and then, the data are modified normally; and finally, the check function and the determination function are restarted, and then the data recovery function is restarted.
101361 Online parallel processing of the invention: -online" means that error detection and recovery of a protected RAM are performed when the system functions and runs normally. In this specification, read checking, checking with SM3 signature information, checking with MD information abstracts, checking with BCC codes, comprehensive determination and abnormal data recovery of the linked list LV I/LV2 and the corresponding protected RAM are performed synchronously according to system interrupts to realize error detection and recovery of RAM abnormal changes. In the invention, "parallel" first means that normal functions of the system and error detection and recovery are performed in parallel, and these functions are implemented through the FPGA parallel processing module and do not take the time of the processor. Error detection and recovery of LV I and LV2 can be performed in parallel or in series according to the design of the system and FPGA resource.
101371 Real-time: read checking, checking with SM3 signature information, checking with MD information abstracts, checking with BCC codes, comprehensive determination and abnormal data recovery of the linked list LV1/LV2 and the corresponding protected RAM are performed synchronously according to system interrupts, error detection and recovery are performed in real time, and error detection and recovery of a key data area is completed in each interrupt, so the real-time perfonnance is high.
101381 It should be noted that the communication process check aims to prevent misjudgment caused by an error of the communication process, and is optional. If the system does not take into account communication interface errors, this check process can be omitted, That is, data is read once to be taken as correct data for later determination.
[0139] It should be noted that, in this embodiment, data is checked three times respectively with SM3 signature information, MD5 information abstracts and BCC codes (each once). However, those skilled in the art should understand that, the check methods are not limited to the three methods described here, and data can also be checked through other common methods such as CRC32 and CRC64. The check methods and the check times can be selected freely. For example, data may be checked with SM3 signature information twice or be checked with CRC32 twice. By using different check methods, a final misjudgment caused by the principle of one method can be prevented. During specific application, the check methods and check times can be selected/combined according to the tolerance to misjudgment of the system.
101401 Embodiment 3 [0141] The invention further provides a soft error real-time detection and recovery device based on parallel processing, which comprises a linked list management module, mid an error detection mid recovery module.
[0142] The linked list management module is used for classifying all protected areas in a protected RAM space to one or more levels, completing error detection and recovery of the protected area of a highest level in each interrupt cycle, and completing error detection and recovery of the protected areas of the other levels in multiple interrupt cycles: registering all the levels of protected areas to generate linked lists which correspond to the levels in number and are located in the protected RAM space, and creating at least two backups of the linked lists and the protected areas in the Finked lists in other spaces, contents of each linked list comprising positions and lengths of the protected areas of the same level, and positions of the backups.
101431 The error detection module is used for performing error detection and recovery on all the levels of linked lists and the protected areas in the linked lists, wherein the process of performing error detection and recovery on any one level of linked list and the protected areas in the linked list comprises: performing error detection on the linked list and the backups of the linked list, and if an exception is detected, recovering the exception; and perfonning error detection on the protected areas in the linked list and the backups of the protected areas in the linked list, and if an exception is detected, recovering the exception.
101441 The specific implementation of the modules of the device, and error detection and recovery of the linked list and the backups of the linked Fist arc the same as those in Embodiment 1 and Embodiment 2.
[0145] The device in this embodiment can realize error detection and recovery of an RAM space and can locate and recover error data within one interrupt tick, thus improving the real-time performance of the system in processing RAM exceptions and reducing the influence of RAM exceptions on the system. [0146] Those skilled in the art would appreciate that the embodiments of the application can be provided as a method, a system or a computer program product. So, the embodiments of the application may be completely hardware embodiments, completely software embodiments, or embodiments combining software and hardware. In addition, the embodiments of the application may be in the form of a computer program product to be implemented on one or more computer-available storage media (including, but not limited to, a disk memory, a CD-ROM, an optical memory, and the like) comprising computer-available program codes.
[0147] The application is described with reference to the flow diagram and/or block diagram of the method, device (system) and computer program product of the embodiments of the application. It should be understood that each process and/or block in the flow diagram and/or block diagram and the combinations of processes and/or blocks in the flow diagram and/or block diagram can be implemented by computer program instructions. These computer program instructions can be configured in a general-purpose computer, a special-purpose computer, an embedded processor, or a processor of other programmable data processing terminals to create a machine, so that the instructions can be executed by the computer or the processor of other programmable data processing terminals to create a device for realizing specific functions in one or more processes in the flow diagram and/or in one or more blocks in the block diagram.
[0148] These computer program instructions may also be stored in a computer-readable memory that can guide the computer or other program data processing terminals to work in a specific manner, so that the instructions stored in the computer-readable memory can create a product including an instruction device, and the instruction device implements specific functions in one or more processes of the flow diagram and/or one or more blocks in the block diagram.
101491 These computer program instructions may also be loaded on a computer or other programmable data processing terminal devices, so that the computer or other programmable terminal devices can perform a series of operation steps to carry out processing realized by the computer, and the instructions are executed on the computer or other programmable terminal devices to realize specific functions in one or more processes in the flow diagram and/or one or more block diagrams in the block diagram.
101501 The above embodiments are merely preferred ones of the invention. it should be pointed out that those ordinarily skilled in the art can make various improvements and transformations without departing from the technical principle of the invention, and all these improvements and transformations should also fall within the protection scope of the invention.

Claims (17)

  1. CLAIMS1. A soft error real-time detection and recovery method based on online parallel processing, comprising: classifying all protected areas in a protected RAM space into one or more levels, completing error detection and recovery of the protected area of a highest level in each interrupt cycle, and completing error detection and recovery of thc protected areas of the other levels in multiple interrupt cycles; registering all the levels of the protected areas to generate linked lists that correspond to the levels in number and are located in the protected RAM space, and creating at least two backups of the linked lists and the protected areas in the linked lists in other spaces; and performing error detection and recovery on all the levels of the linked lists and the protected areas in the linked lists.
  2. 2. The soft error real-time detection and recovery method based on online parallel processing according to Claim 1, wherein the method is executed by a parallel processing module, the parallel processing module is connected to and accesses the protected RAM space through a high-speed interface, and independent DDRs, SRAMs or RAM spaces are used for storing the backups.
  3. 3. The soft error real-time detection and recovery method based on online parallel processing according to Claim 1, wherein the process of performing error detection and recovery on any one level of a linked list and the protected areas in the linked list comprises: performing error detection on the linked list and the backups of the linked list and if an exception is detected, recovering the exception; and performing error detection on the protected areas in the linked list and the backups of the protected areas in the linked list, and if an exception is detected, recovering the exception.
  4. 4. The soft error real-time detection and recovery method based on online parallel processing according to Claim 3, wherein performing error detection on the linked list and the backups of the linked list comprises: reading the linked list and the backups of the linked list; checking the linked list and the backups of the Finked list with any one or more of check methods selected from SM3 signature information. MD5 information abstracts and BCC codes; and comparing determination results obtained through different check methods to determine whether the linked list and the backups of the linked list are correct
  5. 5. The soft error real-time detection and recovery method based on online parallel processing according to Claim 4, wherein when multiple check methods are used, check processes of the multiple check methods are performed in parallel.
  6. 6. The soft error real-time detection and recovery method based on online parallel processing according to Claim 4, further comprising: performing communication process check on the read linked list and the backups of the linked list.
  7. 7. The soft error real-time detection and recovery method based on online parallel processing according to Claim 4, wherein checking the linked list and the backups of the linked Fist through one check method comprise s: with the SM3 signature information as the check method: calculating and comparing SM3 signature information of the linked list and the backups of the linked list; and if the SM3 s gnature information of at least two copies of data is the same, determining that all the copies of the data with the same SM3 signature information are normal and the other copy of the data is abnormal.
  8. 8. The soft error real-time detection and recovery method based on online parallel processing according to Claim 7, wherein comparing determination results obtained by different check methods to determine whether the linked list and the backups of the linked list are correct comprises: comparing determination results of the linked list and the backups of the linked list obtained with the SM3 signature information, the MD5 information abstracts and the BCC codes; and if the determination results obtained with two or more of the check methods are the same, taking the same result as a final determination result.
  9. 9. The soft error real-time detection and recovery method based on online parallel processing according to Claim 3, wherein performing error detection on the protected areas in the linked list and the backups of the protected areas in the linked list comprises: when the level of the linked list is the highest level: performing error detection on the protected areas in the linked list and the backups of the protected areas in sequence, wherein performing error detection on any one protected area in the linked list and the backups of the protected area comprises: reading a current protected area in the linked list and the backups of the current protected area: checking the current protected area in the linked list and the backups of the current protected area with any one or more of check methods selected from SM3 signature information, MD5 information abstracts and BCC codes; mid comparing determination results obtained by different check methods to determine whether the current protected area in the linked list and the backups of the current protected area are correct.
  10. 10. The soft error real-time detection and recovery method based on online parallel processing according to Claim 3, wherein performing error detection on the protected areas in the linked list and the backups of the protected areas comprise: when the linked list is of other levels: dividing all the protected areas in the linked list into multiple groups; and in each interrupt cycle, performing error detection on any one protected area in any one group and the backups of the protected area, which comprises: reading a current protected area in a current detection group in the linked list and the backups of the current protected area; checking the current protected area in the current detection group in the linked list and the backups of the current protected area with any one or more of check methods selected from SM3 signature information. MD5 information abstracts and BCC codes; and comparing determination results obtained by different check methods to determine whether the current protected area in the current detection group in the linked list and the backups of the current protected area are correct.
  11. 11. The soft error real-time detection and recovery method based on online parallel processing according to Claim 10, wherein when multiple check methods are used, check processes of the multiple check methods are performed in parallel
  12. 12. The soft error real-time detection and recovery method based on online parallel processing according to Claim 10, further comprising: performing communication process check on the current protected area in the linked list and the backups of the current protected area.
  13. 13. The soft error real-time detection and recovery method based on online parallel processing according to Claim 10, wherein checking the current protected area in the linked list and the backups of the current protected area through one check method comprises: with SM3 signature information as the check method: calculating and comparing SM3 signature information of the current protected area in the linked list and the backups of the current protected area; and if the SM3 signature information of at least two copies of data is the same, determining that all the copies of the data with the same SM3 signature information are correct and the other copy of the data is abnormal.
  14. 14. The soft error real-time detection and recovery method based on online parallel processing according to Claim 13, wherein comparing determination results obtained by different check methods to determine whether the current protected area in the linked list and the two backups A and B of the current protected area are correct comprises: comparing determination results of the current protected area in the linked list and the two backups A and B of the current protected area obtained with the SM3 signature information, the MD5 information abstracts and the BCC codes: and if the determination results obtained by at least two check methods are the sametaking the same result as a final determination result.
  15. 15. The soft error real-time detection mid recovery method based on online parallel processing according to Claim 3, wherein recovering the exception comprises: dividing an abnormal data area and a normal data area into N basic data blocks in sequence; calculating a data CRC code C," of the abnormal data area, wherein n=0, 1, 2, 3, ..., rn, N/2< 2m<N, and C" is calculated by: selecting P. basic data blocks starting from the first basic data block, then selecting P" basic data blocks every P" basic data blocks: and using all the selected basic data blocks as data resources to calculate the CRC code, wherein P11=2"; calculating a data CRC code C. of the normal data area, wherein C," is calculated by: selecting P. basic data Mocks starting from the first basic data block, then selecting -13, basic data blocks evely P, basic data blocks, and using all the selected basic data blocks as data resources to calculate the CRC code; determining whether Cell, and C0111 are equal; if so, updating an abnormal data tag position to point to a latter half of the abnormal data area: if not updating the abnormal data tag position to point to a front half of the abnormal data area; repeating the preceding process to compare Cc", and Cen, wherein n is gradually decreased from m-1 to 0, the abnormal data tag position is updated in each step, and when n=0, the abnormal data tag position is located to a basic data block of the abnormal data area where abnormal data is located: copying contents of a corresponding correct basic data block in the nomml data area into the abnormal basic data block in the abnormal data area to which the abnormal data tag position points, such that recovery of the abnormal basic data block is completed; and calculating CRC codes of all the basic data blocks in the normal data area and the abnormal data area and determining, by comparison, whether the two CRC codes are equal; if not, repeating all the steps; and if so, ending recovery of abnormal data.
  16. 16. The soft error real-time detection and recovery method based on online parallel processing according to Cairn 15, wherein after recovery of the abnomial basic data blocks is ended, the soft error real-time detection and recovery method further comprises: rereading the abnormal basic data block: recalculating a CRC code of the abnormal basic data block and calculating a CRC code of the corresponding correct basic data block; and determining, by comparison, whether the two CRC codes are equal, and if' so, completing the recovery of the abnormal basic data block.
  17. 17. A soft error real-time detection and recovery device based on online parallel processing, comprising a linked list management module and an error detection and recovery module, wherein: the linked list management module is configured for classifying all protected areas in a protected RAM space into one or more levels; completing error detection and recovery of the protected area of a highest level in each interrupt cycle; completing error detection and recovery of the protected areas of the other levels in multiple interrupt cycles; registering all the levels of the protected areas to generate linked lists that correspond to the levels in number and are located in the protected RAM space; and creating at least two backups of the linked lists and the protected areas in the linked lists in other spaces; and the error detection and recovery module is configured for performing error detection and recovery on all the levels of the Finked Fists and the protected areas in the linked lists.
GB2303510.8A 2020-08-21 2021-02-02 Online parallel processing soft error real-time error detection and recovery method and system Pending GB2613120A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010849103.0A CN112053737B (en) 2020-08-21 2020-08-21 Online parallel processing soft error real-time error detection and recovery method and system
PCT/CN2021/074836 WO2022037022A1 (en) 2020-08-21 2021-02-02 Online parallel processing soft error real-time error detection and recovery method and system

Publications (2)

Publication Number Publication Date
GB202303510D0 GB202303510D0 (en) 2023-04-26
GB2613120A true GB2613120A (en) 2023-05-24

Family

ID=73600711

Family Applications (1)

Application Number Title Priority Date Filing Date
GB2303510.8A Pending GB2613120A (en) 2020-08-21 2021-02-02 Online parallel processing soft error real-time error detection and recovery method and system

Country Status (3)

Country Link
CN (1) CN112053737B (en)
GB (1) GB2613120A (en)
WO (1) WO2022037022A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112053737B (en) * 2020-08-21 2022-08-26 国电南瑞科技股份有限公司 Online parallel processing soft error real-time error detection and recovery method and system
CN115426028B (en) * 2022-08-29 2023-10-20 鹏城实验室 Fault tolerance method and system for data encoding and decoding and high-speed communication system
CN115421967B (en) * 2022-11-04 2022-12-30 中国电力科学研究院有限公司 Method and system for evaluating storage abnormal risk point of secondary equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003323353A (en) * 2002-05-01 2003-11-14 Denso Corp Memory diagnostic device and control device
CN102356384A (en) * 2011-08-23 2012-02-15 华为技术有限公司 Method and device for data reliability detection
CN102779557A (en) * 2011-05-13 2012-11-14 苏州雄立科技有限公司 Method and system for data detection and correction of memory module integrated chip
US20190088350A1 (en) * 2017-09-21 2019-03-21 Canon Kabushiki Kaisha Information processing apparatus, control method thereof, and storage medium
CN111552590A (en) * 2020-04-16 2020-08-18 国电南瑞科技股份有限公司 Detection and recovery method and system for memory bit overturning of power secondary equipment
CN112053737A (en) * 2020-08-21 2020-12-08 国电南瑞科技股份有限公司 Online parallel processing soft error real-time error detection and recovery method and system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101937375B (en) * 2010-08-27 2013-07-31 浙江大学 Code and data real-time error correcting and detecting method and device for pico-satellite central processing unit

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003323353A (en) * 2002-05-01 2003-11-14 Denso Corp Memory diagnostic device and control device
CN102779557A (en) * 2011-05-13 2012-11-14 苏州雄立科技有限公司 Method and system for data detection and correction of memory module integrated chip
CN102356384A (en) * 2011-08-23 2012-02-15 华为技术有限公司 Method and device for data reliability detection
US20190088350A1 (en) * 2017-09-21 2019-03-21 Canon Kabushiki Kaisha Information processing apparatus, control method thereof, and storage medium
CN111552590A (en) * 2020-04-16 2020-08-18 国电南瑞科技股份有限公司 Detection and recovery method and system for memory bit overturning of power secondary equipment
CN112053737A (en) * 2020-08-21 2020-12-08 国电南瑞科技股份有限公司 Online parallel processing soft error real-time error detection and recovery method and system

Also Published As

Publication number Publication date
CN112053737B (en) 2022-08-26
CN112053737A (en) 2020-12-08
GB202303510D0 (en) 2023-04-26
WO2022037022A1 (en) 2022-02-24

Similar Documents

Publication Publication Date Title
GB2613120A (en) Online parallel processing soft error real-time error detection and recovery method and system
US7599235B2 (en) Memory correction system and method
US6397357B1 (en) Method of testing detection and correction capabilities of ECC memory controller
US5692121A (en) Recovery unit for mirrored processors
US7900100B2 (en) Uncorrectable error detection utilizing complementary test patterns
US4661955A (en) Extended error correction for package error correction codes
US5177744A (en) Method and apparatus for error recovery in arrays
CN107436821B (en) Apparatus and method for generating error codes for blocks comprising a plurality of data bits and address bits
JP2014515537A (en) Apparatus and method for providing data integrity
US3735105A (en) Error correcting system and method for monolithic memories
US20120079346A1 (en) Simulated error causing apparatus
WO2017215377A1 (en) Method and device for processing hard memory error
JP2519286B2 (en) Address line test method
US7293221B1 (en) Methods and systems for detecting memory address transfer errors in an address bus
JPH03248251A (en) Information processor
Roberts et al. FAULTSIM: A fast, configurable memory-resilience simulator
US5953265A (en) Memory having error detection and correction
CN100468367C (en) Solid state storage unit safety storage system and method
US20230214295A1 (en) Error rates for memory with built in error correction and detection
JP2001290710A (en) Device for detecting data error
EP0599524A2 (en) Self test mechanism for embedded memory arrays
US11609813B2 (en) Memory system for selecting counter-error operation through error analysis and data process system including the same
TWI509622B (en) Fault bits scrambling memory and method thereof
GB2455212A (en) Error detection in processor status register files
JPS6051142B2 (en) Logging error control method

Legal Events

Date Code Title Description
789A Request for publication of translation (sect. 89(a)/1977)

Ref document number: 2022037022

Country of ref document: WO