WO2022037022A1 - Online parallel processing soft error real-time error detection and recovery method and system - Google Patents

Online parallel processing soft error real-time error detection and recovery method and system Download PDF

Info

Publication number
WO2022037022A1
WO2022037022A1 PCT/CN2021/074836 CN2021074836W WO2022037022A1 WO 2022037022 A1 WO2022037022 A1 WO 2022037022A1 CN 2021074836 W CN2021074836 W CN 2021074836W WO 2022037022 A1 WO2022037022 A1 WO 2022037022A1
Authority
WO
WIPO (PCT)
Prior art keywords
linked list
error detection
backup
recovery
data
Prior art date
Application number
PCT/CN2021/074836
Other languages
French (fr)
Chinese (zh)
Inventor
周华良
郑玉平
徐广辉
李友军
刘拯
邹志杨
姜雷
高诗航
汪世平
张家森
Original Assignee
国电南瑞科技股份有限公司
国电南瑞南京控制系统有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 国电南瑞科技股份有限公司, 国电南瑞南京控制系统有限公司 filed Critical 国电南瑞科技股份有限公司
Priority to GB2303510.8A priority Critical patent/GB2613120A/en
Publication of WO2022037022A1 publication Critical patent/WO2022037022A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/004Error avoidance
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C29/00Checking stores for correct operation ; Subsequent repair; Testing stores during standby or offline operation
    • G11C29/04Detection or location of defective memory elements, e.g. cell constructio details, timing of test signals
    • G11C29/08Functional testing, e.g. testing during refresh, power-on self testing [POST] or distributed testing
    • G11C29/12Built-in arrangements for testing, e.g. built-in self testing [BIST] or interconnection details
    • G11C29/44Indication or identification of errors, e.g. for repair
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • G06F11/1004Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's to protect a block of data words, e.g. CRC or checksum
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • G06F11/1008Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's in individual solid state devices
    • G06F11/1048Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's in individual solid state devices using arrangements adapted for a specific error detection or correction feature
    • G06F11/106Correcting systematically all correctable errors, i.e. scrubbing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1448Management of the data involved in backup or backup restore
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1458Management of the backup or restore process
    • G06F11/1469Backup restoration techniques
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C29/00Checking stores for correct operation ; Subsequent repair; Testing stores during standby or offline operation
    • G11C29/04Detection or location of defective memory elements, e.g. cell constructio details, timing of test signals
    • G11C29/08Functional testing, e.g. testing during refresh, power-on self testing [POST] or distributed testing
    • G11C29/12Built-in arrangements for testing, e.g. built-in self testing [BIST] or interconnection details
    • G11C29/18Address generation devices; Devices for accessing memories, e.g. details of addressing circuits
    • G11C29/24Accessing extra cells, e.g. dummy cells or redundant cells
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C29/00Checking stores for correct operation ; Subsequent repair; Testing stores during standby or offline operation
    • G11C29/52Protection of memory contents; Detection of errors in memory contents
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C29/00Checking stores for correct operation ; Subsequent repair; Testing stores during standby or offline operation
    • G11C29/70Masking faults in memories by using spares or by reconfiguring
    • G11C29/74Masking faults in memories by using spares or by reconfiguring using duplex memories, i.e. using dual copies
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/83Indexing scheme relating to error detection, to error correction, and to monitoring the solution involving signatures
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C29/00Checking stores for correct operation ; Subsequent repair; Testing stores during standby or offline operation
    • G11C29/04Detection or location of defective memory elements, e.g. cell constructio details, timing of test signals
    • G11C2029/0409Online test
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C29/00Checking stores for correct operation ; Subsequent repair; Testing stores during standby or offline operation
    • G11C29/04Detection or location of defective memory elements, e.g. cell constructio details, timing of test signals
    • G11C2029/0411Online error correction

Definitions

  • the invention belongs to the technical field of error detection of abnormal displacement of RAM memory, in particular to a method and system for real-time error detection and recovery of soft errors in on-line parallel processing.
  • RAM Random Access Memory
  • Soft error, or Single event effect has a great impact on the security and stability of the system. The impact cannot be ignored.
  • the main reasons for the abnormality of RAM are: (1) Alpha particle radiation. The radiation of alpha particles in the packaging material used by the processor will cause abnormal displacement of the memory area of the chip; (2) the scale of the integrated circuit increases. The size of transistors is getting smaller and the frequency is getting higher and higher, but the limit voltage of transistors is getting lower and lower, and the noise tolerance is getting narrower and narrower, which makes the processor more sensitive to crosstalk, voltage disturbance, and electromagnetic radiation, resulting in reduced reliability. ; (3) cosmic radiation.
  • This type of error is common in static random access memory, dynamic random access memory, non-volatile memory memory; the second is the transient inversion of the data unit, the single data bit of the memory is sampled and recorded by the circuit due to voltage or current jitter, and then recovered, this type of abnormal displacement mainly occurs in dynamic random access memory; the third is the multi-bit data inversion, Multiple data bits in memory are changed at the same time. If the abnormal displacement occurs in a key position of the RAM memory (for example: code area, key data area), it may cause a program running error, and then cause a system abnormality. If abnormal displacement can be detected before the problem occurs and data recovery can be performed, the stability and reliability of the system will be greatly improved.
  • a key position of the RAM memory for example: code area, key data area
  • the reinforcement in the process usually includes the reinforcement of capacitor conditions, resistance conditions, structure, layout, doping, latch circuit (DICE), etc.
  • reinforcement measures at the component level usually include parity.
  • Check Parity
  • ECC Error Correction Code
  • Interleave Interleave
  • CPU processor class devices support ECC (Error Correcting Code) verification of RAM in hardware: TI's new processor and XILINX's new MPSoC processor, hardware DDR control supports ECC verification function, on-chip RAM supports ECC verification function, L1/ L2cache supports the ECC check function.
  • ECC check function can realize single-bit abnormal displacement correction and two-bit displacement alarm by adding 8-bit Hamming code every 32-bit or 64-bit, but it cannot realize two-bit correction and multi-bit simultaneous displacement Yes Alert and correct.
  • the code monitoring flag If the code monitoring flag is found, it compares the non-volatile memory code and the running code in the RAM; if the two are inconsistent, repeat the check; if it is confirmed that they are inconsistent, judge When the RAM runs code errors, the alarm measures are taken immediately, and the error information is recorded to the non-volatile memory for failure analysis.
  • This method enables online monitoring of running code by simply modifying the LDR file. However, this method cannot achieve online correction, the response is poor in real-time, and the monitoring function consumes CPU processing time.
  • the error detection and correction program also depends on the CPU itself, cannot locate error data, and has no online recovery function.
  • the hardware detection ECC verification function considers the consumption of hardware resources, and only realizes the correction of single-bit displacement and the detection function of two-bit displacement, and cannot realize the alarm of simultaneous displacement of more bits, and the correction function is very limited;
  • the software-based detection and recovery function consumes processor time and affects the normal operation of the processor.
  • the error detection and correction program also depends on the CPU itself, and it is impossible to locate the erroneous data;
  • the purpose of the present invention is to overcome the deficiencies in the prior art, and to provide a real-time error detection and recovery method and system for online parallel processing of soft errors, so as to solve the problem of equipment recessiveness caused by abnormal memory displacement of existing power secondary equipment. Problems with faults or abnormal protection control functions.
  • the present invention provides a kind of online parallel processing soft error real-time error detection and recovery method, including the following process:
  • the protected area of each level is registered to generate a linked list corresponding to the number of levels, and the linked list is located in the protected RAM space; at least two copies of each linked list and the protected area in the linked list are backed up to other RAM spaces; the linked list content includes the same level.
  • the method is performed by a parallel processing module, the parallel processing module accesses the protected RAM space through a high-speed interface connection, and uses an independent DDR, SRAM, or RAM space for backup storage.
  • Error detection is performed on each protected area and its backup in the linked list, and its abnormality is recovered.
  • error detection of the linked list and its backup includes:
  • the linked list and its backup are verified by using any one or more combination verification methods including SM3 signature information, MD5 information digest and BCC verification code;
  • it also includes: verifying the communication process of the read linked list and its backup: the linked list of each position is continuously read at least three times, and the CRC check code is calculated respectively; each position is judged at least twice CRC check code The same data is correct, and one of the correct data is read as correct read data.
  • a verification method is used to verify the linked list and its two backups A and B, including:
  • the verification method is described by taking SM3 signature information as an example:
  • the comparison of the judgment results of each verification method determines the correctness of the linked list and its backup, including:
  • performing error detection on each protected area and its backup in the linked list includes:
  • performing error detection on each protected area and its backup in the linked list includes:
  • each protected area and its backups in any group in the linked list are checked for errors in sequence: among them, any protected area and its backups in any group in the linked list are checked for errors, including:
  • the current protected area and its backup in the read linked list are checked for the communication process: the linked list of each position is continuously read at least three times, and the CRC check code is calculated respectively; each position judges at least two The data with the same CRC check code for the second time is correct, and the correct data is read as the correct read data.
  • a verification method is used to verify the currently protected area and its backup in the linked list, including:
  • the verification method is described by taking SM3 signature information as an example:
  • the comparison of the judgment results of each verification method determines the correctness of the currently protected area and its backup in the linked list, including:
  • recovery of the exception includes:
  • the recovery of the abnormal basic data block further includes: re-reading the abnormal basic data block; calculating the CRC value of the re-reading abnormal basic data block, and calculating the corresponding correct basic data block CRC value; comparing two CRCs Whether the values are equal, if so, the basic data block of this exception is restored.
  • the present invention also provides an online parallel processing soft error real-time error detection and recovery system, including a linked list management module and an error detection recovery module, wherein:
  • the linked list management module is used to divide the protected RAM space into multiple protected areas; all protected areas are divided into one or more levels, the highest level is to complete an error detection and recovery function for each interrupt cycle; other levels are multiple One interrupt cycle to complete one error detection and recovery function; register the protected area of each level to generate a linked list corresponding to the number of levels, the linked list is located in the protected RAM space; backup each linked list and the protected area in the linked list at least two copies to other RAM space; the content of the linked list includes the position, length and the position of each backup of each protected area of the same level;
  • the error detection and recovery module is used to process the error detection and recovery of each level of linked list and each protected area in the linked list in parallel: the process of error detection and recovery of any level of linked list and each protected area in the linked list is as follows:
  • Error detection is performed on each protected area and its backup in the linked list, and its abnormality is recovered.
  • the present invention can realize online error detection and recovery of multiple or even all simultaneous changes in a designated storage area, can solve the problem that hardware ECC cannot detect and correct abnormal displacement of multiple bits, has more powerful functions, and improves the operation of system programs robustness and stability.
  • the present invention can realize the error detection and recovery function of abnormal displacement of RAM memory without relying on the hardware ECC function of the CPU processor, and provides the realization of the abnormal displacement detection of RAM for some processor systems without the hardware ECC function. feasible method.
  • the function of the present invention for abnormal detection and recovery of the system RAM can locate the error data position to a certain area, and can realize the completion of recovery within one interrupt cycle, which improves the real-time performance of the system for abnormal processing of RAM and reduces the amount of RAM. The effect of the exception on the system.
  • the present invention assists in completing the processor RAM verification and correction and recovery through the FPGA parallel processing module, without occupying the processor's CPU time, and realizes parallel detection, parallel correction and recovery of RAM abnormalities.
  • Fig. 1 is the system schematic diagram of the present invention
  • Fig. 2 is the schematic diagram of linked list data of the present invention.
  • Fig. 3 is the content schematic diagram of linked list of the present invention.
  • Fig. 4 is the schematic diagram of FPGA parallel processing module checking and adjudication process of the present invention.
  • Fig. 5 is the FPGA parallel processing module adjudication function schematic diagram of the present invention.
  • FIG. 6 is a schematic diagram of data block division and check code calculation in the FPGA parallel processing module positioning function of the present invention.
  • FIG. 7 is a schematic diagram of the abnormal data position positioning function in the FPGA parallel processing module positioning function of the present invention.
  • FIG. 8 is a schematic diagram of the data recovery function of the FPGA parallel processing module of the present invention.
  • FIG. 9 is a time sequence diagram of the verification and recovery of the FPGA parallel processing module of the present invention.
  • a kind of online parallel processing soft error real-time error detection and recovery method of the present invention comprises the following process:
  • Error detection is performed on each protected area and its A and B backups in the linked list, and its abnormality is recovered.
  • the present invention has the function of detecting and recovering the abnormality of the system RAM, which can correct the location of the erroneous data and realize the recovery within one interrupt cycle, which improves the real-time performance of the system for processing the abnormality of the RAM and reduces the influence of the abnormality of the RAM on the system.
  • the FPGA parallel processing module in the system needs to access the RAM of the CPU system through a high-speed interface, while the high-speed interfaces PCIe, SRIO, HyperLink Wait.
  • the FPGA parallel processing module needs to independently control the memory storage space for backing up data, such as a separate DDR, SRAM or internal RAM of the FPGA parallel processing module. If the RAM space of the CPU processor is very sufficient, and the high-speed interface bandwidth is sufficient, a separate space can also be divided into the CPU processor RAM for the FPGA parallel processing module to store backup data.
  • a dedicated FPGA parallel processing circuit module can be designed or realized by using the FPGA parallel processing module designed by itself in the CPU control system or based on the SoC integrated with the FPGA parallel processing circuit.
  • the present invention provides a verification mechanism at the communication level; at the same time, the accuracy of the verification algorithm must be considered, so the present invention adopts three different verification principles, SM3 signature information, MD5 digest information, and BCC verification code, to prevent homologous errors. happened.
  • a real-time error detection and recovery method for online parallel processing of soft errors requires: the CPU processor and the FPGA parallel processing module in the control system are connected through a high-speed interface (such as PCIe, SRIO, HyperLink, etc.) , the following content is expressed as PCIe); the FPGA parallel processing module controls an independent RAM space for backing up programs or data. Include the following processes:
  • the protected RAM space is divided into multiple protected areas.
  • the mapping file refers to the description file that specifies the address of each program segment in the program compilation.
  • the contents of this file include: the memory address and length in the processor system; the name, length, and memory address of each memory area.
  • the FPGA parallel processing module directly reads and writes to the protected CPU system RAM space through a high-speed bus (such as PCIe); and controls the independent RAM space for content backup of the protected area, which can only be read and written by the FPGA parallel processing module alone.
  • a high-speed bus such as PCIe
  • each protected area is managed hierarchically: according to the importance of the protected area, it is divided into multiple levels, LV1 is the highest level, and the frequency of error detection and recovery is the highest.
  • Each interrupt cycle completes one error detection and recovery function; the frequency of error detection and recovery at other levels decreases sequentially.
  • the protected area of the system is small, when the FPGA can complete error detection and recovery within one interrupt cycle, there can be only one LV1 level; if the protected area is not very important in the system, it can tolerate multiple interrupts to complete one error detection and recovery.
  • restoring it is also possible to have only one lower LV level.
  • the following contents in this embodiment are stated according to two levels (LV1/LV2).
  • the protected area is divided into two levels: level 1 (LV1) and level 2 (LV2), of which the level of LV1 is the highest.
  • Each interrupt beat completes the checksum error correction of the content of the entire area; for the LV2-level protected area, it is divided into multiple groups, and each interrupt beat completes the checksum error correction of the content of a group of protected areas, and is divided into multiple interrupts.
  • each protected area is registered to generate a two-level linked list.
  • the linked list is used by the FPGA to find the location and size information of the protected area.
  • S2 Register all protected areas in the protected RAM space into the level 1 (LV1) and level 2 (LV2) linked lists according to their importance; the FPGA parallel processing module backup linked list (LV1/LV2) and linked lists point to the protected area Backup it to independent RAM, and back up at least two copies of A and B.
  • LV1 level 1
  • LV2 level 2
  • the number of copies of backup data can be determined according to the actual situation, and at least two copies are backed up.
  • the rule of taking two out of three is used to determine which data is correct or abnormal, or three copies can be backed up, and the rule of taking three out of four or two out of four is used to determine.
  • two backup copies A and B are taken as an example for detailed description.
  • the initial backup function module of FPGA is that after the system initialization is completed, the function module initializes the linked list and the contents of the two backup areas of AB in the protected area in the independent RAM according to the LV1/LV2 linked list.
  • the contents of the data linked list in the three positions are the same, including the starting address, length, and two backup positions of A/B in the protected area.
  • the protected area pointed to by the LV1 and LV2 linked lists is registered according to the maximum 1K segment per segment; the LV1 and LV2 linked lists are stored in the form of a structure array, as shown in Figure 2, each item in the linked list is a BD, Record a protected area, and its data format is:
  • UIN32 back_addr_A; /*Backup A controls the offset position of the independent RAM in the FPGA parallel processing module*/
  • UIN32 back_addr_B; /*Backup B controls the offset position of the independent RAM in the FPGA parallel processing module*/
  • the protected area in the shorthand linked list refers to the protected area pointed to by the address in the BD block in the linked list.
  • the interruption period varies between 500-1000us. Each interrupt completes the abnormal displacement detection and recovery, which can improve the response performance of the detection and recovery, and improve the ability of the power protection equipment to resist abnormality (misoperation, refusal to operate).
  • This detection process firstly detects the LV1/LV2 linked list area, and restores its abnormality to ensure the correctness of the linked list information, so that the information can be correctly recorded according to the LV1/LV2 linked list in the future, corresponding to the protected area. detection and recovery. Error detection and recovery at the LV1 and LV2 levels have no sequential requirements.
  • the linked list of a single LV1 or LV2 and the error detection and recovery of the corresponding protected area must be serial.
  • the embodiment of the present invention first describes the error detection and recovery of the LV1 and LV2 linked lists, and then describes the error detection and recovery of the protected areas in the LV1 and LV2 linked lists.
  • S4-S8 steps are the specific process of judging the correctness of the LV1 linked list and its two backups A and B, as shown in Figure 5, including:
  • the FPGA parallel processing module reads the LV1 linked list and its two backups A and B according to the interrupt rhythm.
  • the linked list of each position is read three times in a row, and the CRC check code is calculated separately; each position selects one of the three times to correctly read the data according to the rule of three out of two;
  • the three position data are all checked during the communication process.
  • the check method is as follows: CRC check is performed on the three linked list data read from each position. If the three CRC check codes of each position are different, it means that the reading function is abnormal. Or the system hardware is abnormal, the locking device is required, and the system is restarted. If the two CRC check codes are the same, it is correct, then any read data is taken as the correct read data.
  • the purpose of communication process verification is to prevent misjudgment caused by errors in the communication process, which is an optional function. If the system does not consider the error of the communication interface, this verification process can be removed, that is, read the data once as the correct data, and make subsequent judgments.
  • the number of readings can be three or more times, and when the number of readings is four, the rule of taking two out of four or three out of four can be used to determine the correct read data.
  • the FPGA parallel processing module calculates the LV1 linked list and the SM3 signature information of the two backups A and B that correctly read the data, and compares them. According to the rule of three out of two, it is judged whether the data of the three positions is abnormally displaced: if the three positions If the SM3 signature information is the same, there is no abnormality, that is, the three position data are the same; if two are the same, the two data are normal, and the other is abnormal; if the three are different, the judgment is invalid;
  • the FPGA parallel processing module calculates the MD5 information summary of the LV1 linked list and its two backups A and B that correctly read the data, compares them, and judges whether there is abnormal displacement of the three position data according to the rule of three out of two: if the three positions If the MD5 information digests are the same, there is no abnormality, that is, the three position data are the same; if two are the same, the two data are normal, and the other is abnormal; if the three are different, the judgment is invalid;
  • the FPGA parallel processing module calculates the BCC check code of the LV1 linked list and its two backups A and B to correctly read the data, and compares them. According to the rule of taking two out of three, it determines whether the three position data are abnormally shifted: if three If the BCC check codes of the positions are the same, there is no abnormality, that is, the three position data are the same; if two are the same, the two data are normal, and the other is abnormal; if the three are different, the judgment is invalid;
  • SM3 signature information in the embodiment of the present invention, three verification methods of SM3 signature information, MD5 digest information, and BCC verification are used for three verifications (one verification for each method) for description, but those skilled in the art need to know that verification
  • the methods are not limited to these three, and other common check methods, such as CRC32 and CRC64, can also be selected.
  • the verification method and the verification times can be combined arbitrarily, for example, 2 times of SM3 signature information and 2 times of CRC32 verification are used. Using different verification methods can prevent the final misjudgment caused by the principle problem of a certain verification. In application, the method and number of times can be selected/combined according to the tolerance of the system for misjudgment.
  • the three verification processes are not related and can be performed in parallel.
  • the FPGA parallel processing module compares the judgment results of the three verification methods of SM3 signature information, MD5 information digest, and BCC check code of the LV1 linked list, and makes a ruling according to the rule of three out of two, and determines the LV1 linked list and its two backups A and B. Correctness: If the judgment results of two or more verification methods are consistent, the consistent result will be regarded as the final judgment result; otherwise, the judgment result will be invalid.
  • the abnormal data here may be LV1 linked list, A backup or B backup, and the backup area is also RAM, or it may be wrong and needs to be restored.
  • S9 According to the steps from S4 to S8, check the correctness of the LV2 linked list and its two backups A and B. If there is abnormal data in the three location data, restore it.
  • the above processing process ensures the correctness of the LV1/LV2 linked list and its two backups A and B. Subsequently, according to the correct LV1/LV2 linked list record information, the corresponding protected area is detected and restored.
  • S10-S15 steps are the specific process of judging the correctness of each protected area and its two backups A and B in the LV1 linked list, as shown in Figure 4, including:
  • the FPGA parallel processing module reads the RAM space data corresponding to the first protected area in the LV1 linked list and its two backup data A and B.
  • the data of each position is continuously read three times, and the CRC check code is calculated separately; each position selects one of the three times to correctly read the data according to the rule of three out of two;
  • the data of the three positions are checked during the communication process.
  • the check method is as follows: the three times of data read at each position are checked by CRC. If the three CRC check codes of each position are different, it means that the reading function is abnormal or The system hardware is abnormal, the locking device is required, and the system is restarted. If the two CRC check codes are the same, it is correct, then any read data is taken as the correct read data.
  • the FPGA parallel processing module calculates the RAM space data corresponding to the first protected area in the LV1 linked list and the SM3 signature information of the two backups A and B that correctly read the data, and compares them, and judges three according to the rule of three out of two. Whether the position data is abnormally shifted: If the SM3 signature information of the three positions is the same, there is no abnormality, that is, the three position data are the same; if the two are the same, the two data are normal, and the other is abnormal; If they are different, the judgment is invalid;
  • the FPGA parallel processing module calculates the RAM space data corresponding to the first protected area in the LV1 linked list and the MD5 information summary of the correct read data of the two backups A and B, and compares them, and judges the three according to the rule of taking two out of three. Whether the position data is abnormally shifted: if the MD5 information digests of the three positions are the same, there is no abnormality, that is, the three position data are the same; if the two are the same, the two data are normal, and the other is abnormal; if the three are different , the judgment is invalid;
  • the FPGA parallel processing module calculates the RAM space data corresponding to the first protected area in the LV1 linked list and the BCC check codes of the two backups A and B that correctly read the data, and compares them, and judges three according to the rule of taking two out of three. Whether the position data is abnormally shifted: if the BCC check codes of the three positions are the same, there is no abnormality, that is, the three position data are the same; if two are the same, the two data are normal, and the other is abnormal; are different, the judgment is invalid;
  • the FPGA parallel processing module compares the judgment results of the three verification methods of SM3 signature information, MD5 information digest, and BCC check code, and makes a ruling according to the rule of three out of two, and determines the RAM space corresponding to the first protected area in the LV1 linked list
  • the correctness of its two backups A and B if the judgment results of two or more methods are consistent, the consistent result will be regarded as the final judgment result; otherwise, the judgment result will be invalid.
  • the RAM space data corresponding to the first protected area in the LV1 linked list and its two backups A and B are both correct, and the first one in the LV1 linked list is correct.
  • the protection area detection is over;
  • the abnormal data may be RAM space data, A backup or B backup, and the backup area is also RAM. It may also be wrong and needs to be restored.
  • the FPGA parallel processing module repeats the steps S10 to S14 to judge all the protected area data in the LV1 linked list, and if there is an abnormality, restore the abnormality. At this point, the data detection of all protected areas in the LV1 linked list is completed.
  • Each protected area in the LV2 linked list is completed by multiple interrupts.
  • the protected areas of the LV2 linked list are divided into multiple groups, and each interrupt beat completes one group (multiple BDs), so after each interrupt detection LV2 is completed, the group position needs to be updated. .
  • S16-S22 steps are the specific process of judging the correctness of each protected area in the current detection group and its two backups A and B in the LV2 linked list, as shown in Figure 4, including:
  • the FPGA parallel processing module reads the RAM space data corresponding to the first protected area in the current detection group in the LV2 linked list and its two backup data A and B. The data of each position is continuously read three times, and the CRC check code is calculated separately; each position selects one of the three times to correctly read the data according to the rule of three out of two;
  • the content of the LV2 linked list is divided into multiple groups.
  • the detection starts from the first group, and one group is detected each time.
  • the currently detected group is called the current detection group.
  • the FPGA parallel processing module calculates the RAM space data corresponding to the first protected area in the current detection group in the LV2 linked list and the SM3 signature information of the two backups A and B that correctly read the data, and compares them.
  • the rule judges whether there is abnormal displacement of the three location data: if the SM3 signature information of the three locations is the same, there is no abnormality, that is, the three location data are the same; if the two are the same, the two data are normal, and the other one is abnormal; if If all three are different, the judgment is invalid;
  • the FPGA parallel processing module calculates the RAM space data corresponding to the first protected area in the current detection group in the LV2 linked list and the MD5 information summary of the two backups A and B of the correctly read data, and compares them.
  • the rule determines whether there is abnormal displacement of the three location data: if the MD5 information digests of the three locations are the same, there is no abnormality, that is, the three location data are the same; if the two are the same, the two data are normal, and the other is abnormal; if If all three are different, the judgment is invalid;
  • the FPGA parallel processing module calculates the RAM space data corresponding to the first protected area in the current detection group in the LV2 linked list and the BCC check codes of the two backups A and B that correctly read the data, and compares them.
  • the second rule judges whether there is abnormal displacement of the three position data: if the BCC check codes of the three positions are the same, there is no abnormality, that is, the three position data are the same; if the two are the same, the two data are normal, and the other one is abnormal ; If all three are different, the judgment is invalid;
  • the FPGA parallel processing module compares the judgment results of the three verification methods of SM3 signature information, MD5 information digest, and BCC verification code, and makes a ruling according to the rule of three out of two, and determines the first protected area in the current detection group in the LV2 linked list Corresponding RAM space and the correctness of its A and B backups: If the judgment results of two or more methods are consistent, the consistent result will be regarded as the final judgment result; otherwise, the judgment result will be invalid.
  • the RAM space data corresponding to the first protected area in the current detection group in the LV2 linked list and its two backup data A and B are both correct, and the LV2 linked list The data detection of the first protected area in the current detection group is completed;
  • the abnormal data here may be RAM space data, A backup or B backup.
  • S21 Repeat the steps from S16 to S20 to adjudicate all the protected area data in the current detection group in the LV2 linked list, and restore the abnormality if there is an abnormality. So far, the detection of all the protected area data in the current detection group in the LV2 linked list is completed;
  • the FPGA parallel processing module needs to restore the data.
  • the present invention adopts the method of dividing the data into several basic data blocks, and calculates the CRC value in multiple ways, and quickly locates the abnormal basic data block according to these CRC values.
  • the correctness of the communication process should also be considered. Therefore, the present invention adopts the method of reading back the data immediately after restoring the correct basic data block to the abnormal position, and judging whether the data is correctly written. If the data is read back If it is inconsistent with the written data, write it again.
  • the location of an abnormal basic data block can be located first, and then the locations of multiple abnormal basic data blocks can be located in an iterative manner. Therefore, the present invention solves the problem of multi-position anomalies through an iterative manner.
  • the premise of the method is that the FPGA parallel processing module finally decides that two pieces of data are normal and one is abnormal, and the abnormal data can be recovered at this time.
  • Choose a normal data as the normal data area and select the abnormal data as the abnormal data area.
  • the abnormal data area may be a protected RAM area or an independent RAM area.
  • S1 The abnormal data area and the normal data area are divided into N blocks in order according to the minimum data length, which are referred to as basic data blocks in the following description.
  • the calculation method of C en is: starting from the first basic data block, taking P n basic data blocks, then taking P n basic data blocks every P n basic blocks, and using all the basic data blocks taken out as the data source to calculate the CRC value.
  • C e0 , C e1 . . . C em in this step can be calculated in parallel or in series.
  • One basic data block is taken from one basic block, that is, B1, B3, B5,...
  • the basic block takes 2 basic data blocks, that is, B1, B2, B5, B6...B13, B14 are used as the data source for calculation.
  • the calculation of C e3 is to take 8 basic data blocks from B1. Then take 8 basic data blocks every 8 basic blocks, that is, take B1, B2...B8 as the data source for calculation.
  • the calculation method of C cn is: starting from the first basic data block, take P n basic data blocks, then every P n basic blocks, take P n basic data blocks, and use all the basic data blocks taken out as the data source to calculate CRC value.
  • C c0 , C c1 ... C cm in this step can be calculated in parallel or in series.
  • B1, B2 Take 1 basic data block for 1 basic block, that is, take B1, B3, B5 ...
  • the basic block takes 2 basic data blocks, that is, B1, B2, B5, B6...B13, B14 are used as the data source for calculation.
  • the calculation of C c3 is to take 8 basic data blocks from B1. Then take 8 basic data blocks every 8 basic blocks, that is, take B1, B2...B8 as the data source for calculation.
  • the FPGA parallel processing module judges whether the two CRCs of C cm and C em are equal: if they are equal, it means that the abnormal data is located in the second half of the abnormal data area, and the abnormal data mark position is updated to point to the second half of the abnormal data area; If they are not equal, it means that there must be an abnormality in the first half of the abnormal data area. Update the abnormal data mark position to point to the first half of the abnormal data area.
  • Step S4 is repeated to determine the values of C cn and C en , where n decreases sequentially from m-1 until it is 0, and the abnormal data mark position is updated in each step.
  • the abnormal data mark position has been reduced to the basic data block in the abnormal data area where the abnormal data is located, which is referred to as the abnormal basic data block, and the correct data corresponding to the abnormal basic data block is normal.
  • the corresponding basic data block in the data area is referred to as the correct basic data block for short.
  • the judgment process is shown in FIG. 7.
  • the FPGA parallel processing module copies and copies the correct basic data block content in the normal data area to the abnormal basic data block in the abnormal data area pointed to by the abnormal data mark position through the high-speed interface.
  • the FPGA parallel processing module re-reads the abnormal basic data block; calculates the CRC value of the re-read abnormal basic data block, and calculates the corresponding correct basic data block CRC value. Compare whether the two CRC values are equal, and if they are not equal, repeat steps S6 and S7 for re-recovery. In this embodiment, the number of repetitions is at most 2; if they are equal, the abnormal basic data block is recovered.
  • the FPGA parallel processing module calculates the CRC values of all the basic data blocks in the normal data area and the abnormal data area, and compares whether the two CRC values are equal. If the two are not equal, it means that there is still abnormal data. Repeat the steps from S2 to S7. , locate and recover other abnormal data; if they are equal, the abnormal data recovery ends.
  • the typical flow of the above process is shown in Figure 9.
  • the protection area error detection and recovery, its timing diagram is shown in Figure 9.
  • the error detection and recovery of the LV1 linked list is completed, the error detection and recovery of the protection area corresponding to the LV1 linked list can be performed; the same is true for the LV2 linked list.
  • the LV1 linked list and the LV2 linked list have no sequential relationship and can be designed for parallel processing.
  • the modification should be continued in the following order: First, the software stops the checking and adjudication functions, stops the data recovery function, and reads back the FPGA for parallel processing The running status of the module is confirmed, and the function is stopped; then the data is modified normally; finally, the verification and adjudication functions are restarted, and the data recovery function is restarted.
  • the meaning of the on-line parallel processing in the present invention is: the meaning of on-line refers to performing error detection and recovery of the protected RAM while the system function is running normally.
  • the LV1/LV2 linked list and the read verification of the corresponding protected RAM, SM3 signature information, MD5 summary information, BCC verification, comprehensive adjudication, and abnormal data recovery functions are performed simultaneously according to system interruptions to deal with abnormal changes in RAM. Error detection and recovery.
  • the parallel in the present invention first means that the normal function of the system and the error detection and recovery function are performed in parallel; these functions are realized by FPGA and do not occupy processor time.
  • the error detection and recovery of LV1 and LV2 can be designed as parallel or serial according to the system design and FPGA resources.
  • the real-time meaning of the present invention is: In this paper, the LV1/LV2 linked list and the read verification, SM3 signature information, MD5 digest information, BCC verification, comprehensive verdict, and abnormal data recovery functions of the corresponding protected RAM are based on system interruption. At the same time, real-time error detection and recovery, each interrupt completes the error detection and recovery of the key data area, with high real-time performance.
  • the purpose of communication process verification is to prevent misjudgment caused by errors in the communication process, which is an optional function. If the system does not consider the error of the communication interface, this verification process can be removed, that is, read the data once as the correct data, and make subsequent judgments.
  • three verification methods of SM3 signature information, MD5 digest information, and BCC verification are used for three verifications (one verification for each method) for description, but those skilled in the art need to know that verification
  • the methods are not limited to these three, and other common check methods, such as CRC32 and CRC64, can also be selected.
  • the verification method and the verification times can be combined arbitrarily. Using different verification methods can prevent the final misjudgment caused by the principle problem of a certain verification. In application, the method and number of times can be selected/combined according to the tolerance of the system for misjudgment.
  • the present invention also provides an online parallel processing soft error real-time error detection and recovery device, including a linked list management module and an error detection recovery module, wherein:
  • the linked list management module is used to divide the protected RAM space into multiple protected areas; all protected areas are divided into one or more levels, the highest level is to complete an error detection and recovery function for each interrupt cycle; other levels are multiple One interrupt cycle to complete one error detection and recovery function; register the protected area of each level to generate a linked list corresponding to the number of levels, and the linked list is located in the protected RAM space; back up each linked list and the protected area in the linked list at least two copies to other RAM space; the content of the linked list includes the position, length and the position of each backup of each protected area of the same level;
  • the error detection and recovery module is used to process the error detection and recovery of each level of linked list and each protected area in the linked list in parallel: the process of error detection and recovery of any level of linked list and each protected area in the linked list is as follows:
  • Error detection is performed on each protected area and its backup in the linked list, and its abnormality is recovered.
  • the device of this embodiment can realize the error detection and recovery of the RAM space, can correct the location of the erroneous data, and can realize the recovery within one interrupt cycle, which improves the real-time performance of the system in processing RAM exceptions and reduces the impact of RAM exceptions on the system. .
  • the embodiments of the present application may be provided as a method, a system, or a computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
  • computer-usable storage media including, but not limited to, disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory result in an article of manufacture comprising instruction means, the instructions
  • the apparatus implements the functions specified in the flow or flow of the flowcharts and/or the block or blocks of the block diagrams.

Abstract

An online parallel processing soft error real-time error detection and recovery method and system. The method comprising: dividing a protected RAM space into multiple protected areas; dividing all of the protected areas into one or more levels, the highest level being to complete an error detection and recovery function in each interrupt cycle, and the other levels being to complete an error detection and recovery function once in multiple interrupt cycles; registering the protected areas of each level to generate a linked list corresponding to the number of levels, and backing up at least two copies of each linked list and the protected areas in the linked list to other RAM space; and processing in parallel the error detection and recovery of each level of linked list and each protected area in the linked list. The described solution can, in a key scenario, achieve the verification, decision, correction, and recovery of high-importance level data in a control system within a single interruption beat, and at the same time, does not rely on a CPU processor itself, can process in real-time in parallel, and can achieve an online real-time error detection and correction function when multiple positions of the CPU processor and a RAM are abnormally shifted at the same time.

Description

一种在线并行处理的软错误实时检错与恢复方法及系统An online parallel processing soft error real-time error detection and recovery method and system 技术领域technical field
本发明属于RAM存储器异常变位的检错技术领域,具体涉及一种在线并行处理的软错误实时检错与恢复方法及系统。The invention belongs to the technical field of error detection of abnormal displacement of RAM memory, in particular to a method and system for real-time error detection and recovery of soft errors in on-line parallel processing.
背景技术Background technique
随着微处理器技术往低功耗、低电压、高集成度方向发展,RAM(Random Access Memory)存储器异常变位(或软错误soft error、或单粒子效应Single event effect)对系统安全稳定的影响不容忽视。导致RAM异常的原因主要有:(1)α粒子辐射。处理器使用的封装材料中的α粒子辐射,会导致芯片存储区异常变位;(2)集成电路的规模增加。晶体管的尺寸越来越小,频率越来越高,但是晶体管的极限电压越来越低,噪声容忍度越来越窄,使得处理器对串扰、电压扰动、电磁辐射更加敏感,造成可靠性降低;(3)宇宙辐射。宇宙空间中的高能带电粒子在达到地球表面前将会与地球大气层元素(主要是氧和氮)发生级联相互作用而产生大量的次级中子,这些次级中子入射集成电路,通过直接电离或者核反应间接电离产生大量电子-空穴对,当微电子器件的灵敏电极收集的电荷超过电路的临界电荷时,会发生单粒子效应导致内存位翻转,进而导致程序功能异常。RAM存储器异常变位主要有3种类型:一是数据单位翻转,存储器单个数据位发生反转,从而导致数据出现异常,这种类型的错误常见于静态随机存储器、动态随机存储器、非易失性存储器;二是数据单位瞬态翻转,存储器单个数据位发生因电压或电流抖动而被电路采样记录,随后恢复,此种类型的异常变位主要发生于动态随机存储器;三是数据多位翻转,存储器中多个 数据位同时发生改变。如果异常变位出现在RAM存储器的关键位置(例如:代码区、关键数据区),则可能引发程序运行出错,进而导致系统异常。如果能在问题发生之前检测出异常变位,进行数据恢复,将大大提高系统的稳定性和可靠性。随着微处理器技术的发展和应用范围拓展,RAM存储器异常变位的危害日益增加,其影响范围越来越广。因此如何解决RAM异常变位,提高系统稳定性可靠性,确保系统安全运转,越来越受到各个行业的关注。With the development of microprocessor technology in the direction of low power consumption, low voltage and high integration, the abnormal displacement of RAM (Random Access Memory) memory (or soft error, or Single event effect) has a great impact on the security and stability of the system. The impact cannot be ignored. The main reasons for the abnormality of RAM are: (1) Alpha particle radiation. The radiation of alpha particles in the packaging material used by the processor will cause abnormal displacement of the memory area of the chip; (2) the scale of the integrated circuit increases. The size of transistors is getting smaller and the frequency is getting higher and higher, but the limit voltage of transistors is getting lower and lower, and the noise tolerance is getting narrower and narrower, which makes the processor more sensitive to crosstalk, voltage disturbance, and electromagnetic radiation, resulting in reduced reliability. ; (3) cosmic radiation. High-energy charged particles in cosmic space will interact with the earth's atmosphere elements (mainly oxygen and nitrogen) in a cascade before reaching the earth's surface to generate a large number of secondary neutrons. Ionization or indirect ionization by nuclear reaction produces a large number of electron-hole pairs. When the charge collected by the sensitive electrode of the microelectronic device exceeds the critical charge of the circuit, a single event effect will occur, which will cause the memory bit to flip, and then cause the program to function abnormally. There are three main types of abnormal displacement in RAM memory: First, the data unit is reversed, and a single data bit of the memory is reversed, resulting in abnormal data. This type of error is common in static random access memory, dynamic random access memory, non-volatile memory memory; the second is the transient inversion of the data unit, the single data bit of the memory is sampled and recorded by the circuit due to voltage or current jitter, and then recovered, this type of abnormal displacement mainly occurs in dynamic random access memory; the third is the multi-bit data inversion, Multiple data bits in memory are changed at the same time. If the abnormal displacement occurs in a key position of the RAM memory (for example: code area, key data area), it may cause a program running error, and then cause a system abnormality. If abnormal displacement can be detected before the problem occurs and data recovery can be performed, the stability and reliability of the system will be greatly improved. With the development of microprocessor technology and the expansion of application scope, the harm of abnormal displacement of RAM memory is increasing day by day, and its influence scope is wider and wider. Therefore, how to solve the abnormal displacement of RAM, improve the stability and reliability of the system, and ensure the safe operation of the system has attracted more and more attention from various industries.
目前在处理器RAM存储异常(或软错误、或单粒子效应)的检测和纠正方面,人们想出了各种办法从硬件和软件角度进行监视、纠正,通过硬件进行监视和纠正的方案。比如从工艺及器件及的加固措施,在工艺上的加固通常有像电容条件,电阻条件,结构、版图、掺杂的加固,锁存电路(DICE)等;在部件级的加固措施通常有奇偶校验(Parity),纠错代码(ECC)和交叉存取(Interleave)等。At present, in the detection and correction of processor RAM storage anomalies (or soft errors, or single event effects), people have come up with various methods to monitor and correct from the perspective of hardware and software, and to monitor and correct through hardware. For example, from the reinforcement measures of processes and devices, the reinforcement in the process usually includes the reinforcement of capacitor conditions, resistance conditions, structure, layout, doping, latch circuit (DICE), etc.; reinforcement measures at the component level usually include parity. Check (Parity), Error Correction Code (ECC) and Interleave (Interleave).
CPU处理器类器件硬件上支持RAM的ECC(Error Correcting Code)校验:TI新处理器及XILINX新的MPSoC处理器,硬件DDR控制支持ECC校验功能、片上RAM支持ECC校验功能、L1/L2cache支持ECC校验功能,其中ECC校验功能通过每32bit或者64bit增加8bit汉明码,实现单bit异常变位纠正和两bit变位告警,但是无法实现两bit纠正、多bit同时变位是的告警及纠正。CPU processor class devices support ECC (Error Correcting Code) verification of RAM in hardware: TI's new processor and XILINX's new MPSoC processor, hardware DDR control supports ECC verification function, on-chip RAM supports ECC verification function, L1/ L2cache supports the ECC check function. The ECC check function can realize single-bit abnormal displacement correction and two-bit displacement alarm by adding 8-bit Hamming code every 32-bit or 64-bit, but it cannot realize two-bit correction and multi-bit simultaneous displacement Yes Alert and correct.
以上多为单层级的软错误防误考量,也有提到通过软件进行监视检测的方案。例如“专利一种ADI DSP代码在线监视方法”(公开号CN105446842A,2016-03-30)根据DSP的链接映射文件,往DSP的可加载文件中待监视的代码段增加代码监视标记;在运行时DSP持续读取非易失性存储器中的LDR,如果发现代码监视标志,则比较非易失性存储器代码与RAM中的运行代码;如果 两者不一致,则进行重复检查;如果确认不一致,则判断RAM运行代码出错,立即采取报警措施,记录错误信息至非易失性存储器用于故障分析。此方法可通过简单地修改LDR文件实现对运行代码的在线监视。但是此方法无法实现在线纠正,响应实时性差,且监视功能消耗CPU处理时间,检错纠错程序也依赖于CPU本身,无法定位出错数据,不具有在线恢复功能。Most of the above are single-level soft error prevention considerations, and there are also solutions for monitoring and detection through software. For example, "Patent an ADI DSP Code Online Monitoring Method" (Publication No. CN105446842A, 2016-03-30) According to the link mapping file of the DSP, the code monitoring mark is added to the code segment to be monitored in the loadable file of the DSP; The DSP continuously reads the LDR in the non-volatile memory. If the code monitoring flag is found, it compares the non-volatile memory code and the running code in the RAM; if the two are inconsistent, repeat the check; if it is confirmed that they are inconsistent, judge When the RAM runs code errors, the alarm measures are taken immediately, and the error information is recorded to the non-volatile memory for failure analysis. This method enables online monitoring of running code by simply modifying the LDR file. However, this method cannot achieve online correction, the response is poor in real-time, and the monitoring function consumes CPU processing time. The error detection and correction program also depends on the CPU itself, cannot locate error data, and has no online recovery function.
针对在单CPU系统中实现软错误系统级防护的实现方法,现有技术及方法主要存在以下缺点和不足:Aiming at the realization method of realizing soft error system-level protection in a single CPU system, the prior art and method mainly have the following shortcomings and deficiencies:
(1)目前硬件检测ECC校验功能考虑硬件资源的消耗,只实现单bit变位的纠正、两bit变位的检测功能,无法实现更多bit同时变位的告警,纠正功能很有限;(1) At present, the hardware detection ECC verification function considers the consumption of hardware resources, and only realizes the correction of single-bit displacement and the detection function of two-bit displacement, and cannot realize the alarm of simultaneous displacement of more bits, and the correction function is very limited;
(2)硬件ECC功能在较新的处理器中才有,不是所有处理器多都具备ECC校验功能,无法适应到各种类型的处理器,且功能和性能各有差别,不能保障系统的真正可靠性;(2) The hardware ECC function is only available in newer processors. Not all processors have the ECC check function, which cannot be adapted to various types of processors, and the functions and performance are different. true reliability;
(3)CPU软件方式只是实现了在线程序的监视和报警功能,并没有一个系统性的RAM数据监视和恢复方案,特别实时性不够、检测及恢复响应慢;(3) The CPU software method only realizes the monitoring and alarm functions of the online program, and there is no systematic RAM data monitoring and recovery scheme, especially the real-time performance is not enough, and the detection and recovery response is slow;
(4)软件方式检测恢复功能消耗处理器时间,影响处理器正常功能的运行,同时检错纠错程序也依赖于CPU自己本身,更无法定位出错数据;(4) The software-based detection and recovery function consumes processor time and affects the normal operation of the processor. At the same time, the error detection and correction program also depends on the CPU itself, and it is impossible to locate the erroneous data;
(5)CPU软件方式的检测恢复功能,当检错纠错程序所在RAM异常时,该功能无法正常保障工作,不能做到RAM空间全覆盖。(5) The detection and recovery function of the CPU software mode, when the RAM where the error detection and correction program is located is abnormal, this function cannot guarantee the normal work, and cannot achieve full coverage of the RAM space.
发明内容SUMMARY OF THE INVENTION
本发明的目的在于克服现有技术中的不足,提供了一种在线并行处理的软错误实时检错与恢复方法及系统,以解决现有电力二次设备因内存异常变位而 导致设备隐性故障或保护控制功能异常的问题。The purpose of the present invention is to overcome the deficiencies in the prior art, and to provide a real-time error detection and recovery method and system for online parallel processing of soft errors, so as to solve the problem of equipment recessiveness caused by abnormal memory displacement of existing power secondary equipment. Problems with faults or abnormal protection control functions.
为解决上述技术问题,本发明提供了一种在线并行处理的软错误实时检错与恢复方法,包括以下过程:In order to solve the above-mentioned technical problem, the present invention provides a kind of online parallel processing soft error real-time error detection and recovery method, including the following process:
将被保护RAM空间分成多个被保护区域;Divide the protected RAM space into multiple protected areas;
将所有被保护区域分成1个或多个级别,最高级别为每个中断周期完成一次检错与恢复功能;其他级别为多个中断周期完成一次检错与恢复功能;Divide all protected areas into one or more levels, the highest level completes one error detection and recovery function for each interrupt cycle; the other levels complete one error detection and recovery function for multiple interrupt cycles;
将各级别被保护区域注册生成与级别数量相应的链表,所述链表位于被保护RAM空间;将各链表及链表中被保护区域备份至少两份到其他RAM空间;所述链表内容包括相同级别的各被保护区域位置、长度和各个备份的位置;The protected area of each level is registered to generate a linked list corresponding to the number of levels, and the linked list is located in the protected RAM space; at least two copies of each linked list and the protected area in the linked list are backed up to other RAM spaces; the linked list content includes the same level. The position, length and position of each backup area of each protected area;
并行处理各级别链表及链表中各被保护区域的检错与恢复。Parallel processing of error detection and recovery of each level of linked list and each protected area in the linked list.
进一步的,所述方法采用并行处理模块执行,所述并行处理模块通过高速接口连接访问被保护RAM空间,并采用独立的DDR、SRAM,或者RAM空间用于备份存储。Further, the method is performed by a parallel processing module, the parallel processing module accesses the protected RAM space through a high-speed interface connection, and uses an independent DDR, SRAM, or RAM space for backup storage.
进一步的,所述对任一级别的链表及链表中各被保护区域进行检错与恢复的过程为:Further, the process of performing error detection and recovery on the linked list of any level and each protected area in the linked list is as follows:
对链表及其备份进行检错,并对其异常进行恢复;Check the linked list and its backup for errors, and recover its abnormality;
对链表中各被保护区域及其备份进行检错,并对其异常进行恢复。Error detection is performed on each protected area and its backup in the linked list, and its abnormality is recovered.
进一步的,所述对链表及其备份进行检错,包括:Further, the error detection of the linked list and its backup includes:
读取链表及其备份;Read the linked list and its backup;
采用包括SM3签名信息、MD5信息摘要和BCC校验码任意一种或多种组合校验方式,对链表及其备份进行校验;The linked list and its backup are verified by using any one or more combination verification methods including SM3 signature information, MD5 information digest and BCC verification code;
对比各校验方式的判断结果,确定链表及其备份的正确性。Compare the judgment results of each verification method to determine the correctness of the linked list and its backup.
进一步的,采用多种组合校验方式时,各个校验方式的校验过程并行处理。Further, when multiple combined verification methods are used, the verification processes of each verification method are processed in parallel.
进一步的,还包括:对读取的链表及其备份进行通讯过程校验:每个位置的链表连续读取至少三次,并分别计算CRC校验码;每个位置判断至少两次CRC校验码相同的数据为正确的,读取其中一次正确的数据作为正确读取数据。Further, it also includes: verifying the communication process of the read linked list and its backup: the linked list of each position is continuously read at least three times, and the CRC check code is calculated respectively; each position is judged at least twice CRC check code The same data is correct, and one of the correct data is read as correct read data.
进一步的,采用一种校验方式对链表及其A、B两个备份进行校验,包括:Further, a verification method is used to verify the linked list and its two backups A and B, including:
校验方式以SM3签名信息为例来描述:The verification method is described by taking SM3 signature information as an example:
计算链表及其A、B两个备份的SM3签名信息,并比对;Calculate the linked list and its two backup SM3 signature information of A and B, and compare them;
至少有两份SM3签名信息相同,则判断SM3签名信息相同的所有份数据正常,其他份数据异常。If there are at least two copies of the same SM3 signature information, it is judged that all copies of the same SM3 signature information are normal, and other copies of the data are abnormal.
进一步的,所述对比各校验方式的判断结果,确定链表及其备份的正确性,包括:Further, the comparison of the judgment results of each verification method determines the correctness of the linked list and its backup, including:
对比链表及其备份的SM3签名信息、MD5信息摘要、BCC校验码三种校验方式的判断结果,如果两种及以上的校验方式判断结果一致,则将此一致的结果作为最终裁决结果。Compare the judgment results of the linked list and its backup SM3 signature information, MD5 information digest, and BCC verification code. If the judgment results of two or more verification methods are consistent, the consistent result will be regarded as the final judgment result. .
进一步的,所述对链表中各被保护区域及其备份进行检错,包括:Further, performing error detection on each protected area and its backup in the linked list includes:
若链表级别为最高级别时:If the linked list level is the highest level:
对链表中各被保护区域及其备份依次进行检错,其中,对链表中任一个被保护区域及其备份进行检错,包括:Perform error detection on each protected area and its backup in the linked list in turn, wherein, perform error detection on any protected area and its backup in the linked list, including:
读取链表中当前一个被保护区域及其备份;Read the current protected area in the linked list and its backup;
采用包括SM3签名信息、MD5信息摘要和BCC校验码任意一种或多种组合校验方式,对链表中当前被保护区域及其备份进行校验;Use any one or more combined verification methods including SM3 signature information, MD5 information digest and BCC verification code to verify the currently protected area and its backup in the linked list;
对比各校验方式的判断结果,确定链表中当前被保护区域及其备份的正确 性。Compare the judgment results of each verification method to determine the correctness of the currently protected area and its backup in the linked list.
进一步的,所述对链表中各被保护区域及其备份进行检错,包括:Further, performing error detection on each protected area and its backup in the linked list includes:
当链表级别为其他级别时:When the linked list level is another level:
将链表中所有被保护区域划分为多个组,Divide all protected areas in the linked list into multiple groups,
在每个中断周期,对链表中任一组中各被保护区域及其备份依次进行检错:其中,对链表中任一组中任一被保护区域及其备份进行检错,包括:In each interruption cycle, each protected area and its backups in any group in the linked list are checked for errors in sequence: among them, any protected area and its backups in any group in the linked list are checked for errors, including:
读取链表中当前检测组中当前一个被保护区域及其备份;Read the current protected area and its backup in the current detection group in the linked list;
采用包括SM3签名信息、MD5信息摘要和BCC校验码任意一种或多种组合校验方式,对链表中当前检测组中当前被保护区域及其备份进行校验;Use any one or more combined verification methods including SM3 signature information, MD5 information digest and BCC verification code to verify the currently protected area and its backup in the current detection group in the linked list;
对比各校验方式的判断结果,确定链表中当前检测组中当前被保护区域及其备份的正确性。Comparing the judgment results of each verification method, determine the correctness of the currently protected area and its backup in the current detection group in the linked list.
进一步的,采用多种组合校验方式时,各个校验方式的校验过程并行处理。Further, when multiple combined verification methods are used, the verification processes of each verification method are processed in parallel.
进一步的,还包括:对读取的链表中当前被保护区域及其备份进行通讯过程校验:每个位置的链表连续读取至少三次,并分别计算CRC校验码;每个位置判断至少两次CRC校验码相同的数据为正确的,读取其中一次正确的数据作为正确读取数据。Further, it also includes: the current protected area and its backup in the read linked list are checked for the communication process: the linked list of each position is continuously read at least three times, and the CRC check code is calculated respectively; each position judges at least two The data with the same CRC check code for the second time is correct, and the correct data is read as the correct read data.
进一步的,采用一种校验方式对链表中当前被保护区域及其备份进行校验,包括:Further, a verification method is used to verify the currently protected area and its backup in the linked list, including:
校验方式以SM3签名信息为例来描述:The verification method is described by taking SM3 signature information as an example:
计算链表中当前被保护区域及其备份的SM3签名信息,并比对;Calculate the current protected area in the linked list and its backup SM3 signature information, and compare;
至少有两份SM3签名信息相同,则判断SM3签名信息相同的所有份数据正常,其他份数据异常。If there are at least two copies of the same SM3 signature information, it is judged that all copies of the same SM3 signature information are normal, and other copies of the data are abnormal.
进一步的,所述对比各校验方式的判断结果,确定链表中当前被保护区域及其备份的正确性,包括:Further, the comparison of the judgment results of each verification method determines the correctness of the currently protected area and its backup in the linked list, including:
对比链表中当前被保护区域及其备份的SM3签名信息、MD5信息摘要、BCC校验码三种校验方式的判断结果,如果两种及以上的校验方式判断结果一致,则将此一致的结果作为最终裁决结果。Compare the judgment results of the three verification methods of the currently protected area and its backup SM3 signature information, MD5 information digest, and BCC verification code in the linked list. If the judgment results of two or more verification methods are consistent, then this consistent The result is the final decision.
进一步的,所述对异常进行恢复,包括:Further, the recovery of the exception includes:
将异常数据区和正常数据区按顺序划分成N块基本数据块;Divide the abnormal data area and the normal data area into N basic data blocks in sequence;
计算异常数据区的数据CRC校验码C en,其中n=0、1、2、3..m,而且N/2≤2 m<N;其中C en计算方法为:从第1个基本数据块开始取P n个基本数据块,然后每隔P n个基本块取P n个基本数据块,将取出的所有基本数据块作为数据源计算CRC值,间隔的基本数据块个数P n=2 nCalculate the data CRC check code C en of the abnormal data area, where n=0, 1, 2, 3..m, and N/2≤2 m <N; the calculation method of C en is: from the first basic data At the beginning of the block, take P n basic data blocks, then take P n basic data blocks every P n basic blocks, and use all the basic data blocks taken out as the data source to calculate the CRC value, and the number of basic data blocks in the interval P n = 2n ;
计算正常数据区的数据CRC校验码C cn,其中n=0、1、2、3..m,其中,C cn计算方法为:从第1个基本数据块开始取P n个基本数据块,然后每隔P n个基本块,取P n个基本数据块,将取出的所有基本数据块作为数据源计算CRC值; Calculate the data CRC check code C cn of the normal data area, where n=0, 1, 2, 3..m, wherein, the calculation method of C cn is: starting from the first basic data block, take P n basic data blocks , and then every P n basic blocks, take P n basic data blocks, and use all the basic data blocks taken out as the data source to calculate the CRC value;
判断C cm和C em两个CRC是否相等:如果相等,则说明异常数据位于异常数据区的后半部,更新异常数据标记位置,指向异常数据区的后半部分范围;如果不相等,则说明异常数据区的前半部分一定有异常,更新异常数据标记位置,指向异常数据区的前半部分范围; Determine whether the two CRCs of C cm and C em are equal: if they are equal, it means that the abnormal data is located in the second half of the abnormal data area, and the abnormal data mark position is updated to point to the second half of the abnormal data area; if they are not equal, it means that The first half of the abnormal data area must be abnormal. Update the abnormal data mark position to point to the first half of the abnormal data area;
重复以上判断过程,判断C cn和C en值,其中n从m-1依次递减直至为0,每个步骤均更新异常数据标记位置;在判断完n=0步骤后,此时异常数据标记位置已经缩小到了异常数据所在异常数据区的基本数据块; Repeat the above judgment process to judge the values of C cn and C en , where n decreases from m-1 until it is 0, and the abnormal data mark position is updated in each step; after judging the n=0 step, the abnormal data mark position at this time It has been reduced to the basic data block of the abnormal data area where the abnormal data is located;
将正常数据区内正确的基本数据块内容通过高速接口拷贝复制到异常数据 标记位置指向的异常数据区内异常的基本数据块,此异常的基本数据块完成恢复;Copy and copy the correct basic data block content in the normal data area to the abnormal basic data block in the abnormal data area pointed to by the abnormal data mark position through the high-speed interface, and the abnormal basic data block is restored;
计算正常数据区和异常数据区内所有基本数据块的CRC值,比较两个CRC值是否相等,如果两者不相等,说明仍然有异常数据,在重复以上所有的步骤,对其他异常数据进行定位和恢复;如果相等,则异常数据恢复结束。Calculate the CRC values of all basic data blocks in the normal data area and the abnormal data area, and compare whether the two CRC values are equal. If the two are not equal, it means that there is still abnormal data. Repeat all the above steps to locate other abnormal data. and recovery; if equal, abnormal data recovery ends.
进一步的,异常的基本数据块完成恢复后还包括:重新读取异常的基本数据块;计算重新读取异常的基本数据块CRC值,并计算对应正确的基本数据块CRC值;比较两个CRC值是否相等,如果相等,则此异常的基本数据块完成恢复。Further, after the recovery of the abnormal basic data block is completed, it further includes: re-reading the abnormal basic data block; calculating the CRC value of the re-reading abnormal basic data block, and calculating the corresponding correct basic data block CRC value; comparing two CRCs Whether the values are equal, if so, the basic data block of this exception is restored.
相应的,本发明还提供了一种在线并行处理的软错误实时检错与恢复系统,包括链表管理模块、以及检错恢复模块,其中:Correspondingly, the present invention also provides an online parallel processing soft error real-time error detection and recovery system, including a linked list management module and an error detection recovery module, wherein:
链表管理模块,用于将被保护RAM空间分成多个被保护区域;将所有被保护区域分成1个或多个级别,最高级别为每个中断周期完成一次检错与恢复功能;其他级别为多个中断周期完成一次检错与恢复功能;将各级别被保护区域注册生成与级别数量相应的链表,所述链表位于被保护RAM空间;将各链表及链表中被保护区域备份至少两份到其他RAM空间;所述链表内容包括相同级别的各被保护区域位置、长度和各个备份的位置;The linked list management module is used to divide the protected RAM space into multiple protected areas; all protected areas are divided into one or more levels, the highest level is to complete an error detection and recovery function for each interrupt cycle; other levels are multiple One interrupt cycle to complete one error detection and recovery function; register the protected area of each level to generate a linked list corresponding to the number of levels, the linked list is located in the protected RAM space; backup each linked list and the protected area in the linked list at least two copies to other RAM space; the content of the linked list includes the position, length and the position of each backup of each protected area of the same level;
检错恢复模块,用于并行处理各级别链表及链表中各被保护区域的检错与恢复:其中,对任一级别的链表及链表中各被保护区域进行检错与恢复的过程为:The error detection and recovery module is used to process the error detection and recovery of each level of linked list and each protected area in the linked list in parallel: the process of error detection and recovery of any level of linked list and each protected area in the linked list is as follows:
对链表及其备份进行检错,并对其异常进行恢复;Check the linked list and its backup for errors, and recover its abnormality;
对链表中各被保护区域及其备份进行检错,并对其异常进行恢复。Error detection is performed on each protected area and its backup in the linked list, and its abnormality is recovered.
与现有技术相比,本发明所达到的有益效果是:Compared with the prior art, the beneficial effects achieved by the present invention are:
(1)本发明能够实现指定存储区域内同时多处甚至全部同时变化的在线检错与恢复,能够解决硬件ECC不能检测、纠正多位异常变位问题,具有更强大的功能,提高系统程序运行的鲁棒性稳定性。(1) The present invention can realize online error detection and recovery of multiple or even all simultaneous changes in a designated storage area, can solve the problem that hardware ECC cannot detect and correct abnormal displacement of multiple bits, has more powerful functions, and improves the operation of system programs robustness and stability.
(2)本发明可以实现不依赖CPU处理器硬件ECC功能实现RAM存储器异常变位的检错和恢复功能,为一些不具备硬件ECC功能的处理器系统,进行RAM异常变位检测的实现提供了可行方法。(2) The present invention can realize the error detection and recovery function of abnormal displacement of RAM memory without relying on the hardware ECC function of the CPU processor, and provides the realization of the abnormal displacement detection of RAM for some processor systems without the hardware ECC function. feasible method.
(3)本发明对系统RAM异常检测和恢复的功能,能够对出错数据位置定位到一定区域,并且能够实现在一个中断节拍内完成恢复,提高了系统对RAM异常处理的实时性,降低了RAM异常对系统的影响。(3) The function of the present invention for abnormal detection and recovery of the system RAM can locate the error data position to a certain area, and can realize the completion of recovery within one interrupt cycle, which improves the real-time performance of the system for abnormal processing of RAM and reduces the amount of RAM. The effect of the exception on the system.
(4)本发明通过FPGA并行处理模块协助完成处理器RAM校验及纠正恢复,不占用处理器的CPU时间,实现并行检测,并行纠正恢复RAM异常。(4) The present invention assists in completing the processor RAM verification and correction and recovery through the FPGA parallel processing module, without occupying the processor's CPU time, and realizes parallel detection, parallel correction and recovery of RAM abnormalities.
(5)通过FPGA并行处理模块协助实现整个被保护RAM空间的检测和纠正。(5) The detection and correction of the entire protected RAM space are assisted by the FPGA parallel processing module.
附图说明Description of drawings
图1是本发明的系统示意图;Fig. 1 is the system schematic diagram of the present invention;
图2是本发明的链表数据示意图;Fig. 2 is the schematic diagram of linked list data of the present invention;
图3是本发明的链表内容示意图;Fig. 3 is the content schematic diagram of linked list of the present invention;
图4是本发明的FPGA并行处理模块校验裁决过程示意图;Fig. 4 is the schematic diagram of FPGA parallel processing module checking and adjudication process of the present invention;
图5是本发明的FPGA并行处理模块裁决功能示意图;Fig. 5 is the FPGA parallel processing module adjudication function schematic diagram of the present invention;
图6是本发明的FPGA并行处理模块定位功能中数据块划分与校验码计算示意图;6 is a schematic diagram of data block division and check code calculation in the FPGA parallel processing module positioning function of the present invention;
图7是本发明的FPGA并行处理模块定位功能中异常数据位置定位功能示意图;7 is a schematic diagram of the abnormal data position positioning function in the FPGA parallel processing module positioning function of the present invention;
图8是本发明的FPGA并行处理模块数据恢复功能示意图;8 is a schematic diagram of the data recovery function of the FPGA parallel processing module of the present invention;
图9是本发明的FPGA并行处理模块验算裁决与恢复的时序图。FIG. 9 is a time sequence diagram of the verification and recovery of the FPGA parallel processing module of the present invention.
具体实施方式detailed description
下面结合附图对本发明作进一步描述。以下实施例仅用于更加清楚地说明本发明的技术方案,而不能以此来限制本发明的保护范围。The present invention will be further described below in conjunction with the accompanying drawings. The following examples are only used to illustrate the technical solutions of the present invention more clearly, and cannot be used to limit the protection scope of the present invention.
实施例1Example 1
本发明的一种在线并行处理的软错误实时检错与恢复方法,包括以下过程:A kind of online parallel processing soft error real-time error detection and recovery method of the present invention comprises the following process:
将被保护RAM空间分成多个被保护区域;Divide the protected RAM space into multiple protected areas;
将所有被保护区域分成1个或多个级别,最高级别为每个中断周期完成一次检错与恢复功能;其他级别为多个中断周期完成一次检错与恢复功能;Divide all protected areas into one or more levels, the highest level completes one error detection and recovery function for each interrupt cycle; the other levels complete one error detection and recovery function for multiple interrupt cycles;
将各级别被保护区域注册生成与级别数量相应的链表,所述链表位于被保护RAM空间;将各链表及链表中被保护区域备份至少A、B两份到其他RAM空间;所述链表内容包括相同级别的各被保护区域位置、长度和A、B两个备份的位置;Register the protected area of each level to generate a linked list corresponding to the number of levels, and the linked list is located in the protected RAM space; backup at least two copies A and B of each linked list and the protected area in the linked list to other RAM spaces; the contents of the linked list include: The position and length of each protected area of the same level and the positions of the two backups A and B;
并行处理各级别链表及链表中各被保护区域的检错与恢复:其中,对任一级别的链表及链表中各被保护区域进行检错与恢复的过程为:Parallel processing of error detection and recovery of each level of linked list and each protected area in the linked list: The process of error detection and recovery of each level of linked list and each protected area in the linked list is as follows:
对链表及其A、B两个备份进行检错,并对其异常进行恢复;Check the linked list and its two backups A and B, and restore the abnormality;
对链表中各被保护区域及其A、B两个备份进行检错,并对其异常进行恢复。Error detection is performed on each protected area and its A and B backups in the linked list, and its abnormality is recovered.
本发明对系统RAM异常检测和恢复的功能,能够对出错数据位置并且能够实现在一个中断节拍内完成恢复,提高了系统对RAM异常处理的实时性,降低 了RAM异常对系统的影响。The present invention has the function of detecting and recovering the abnormality of the system RAM, which can correct the location of the erroneous data and realize the recovery within one interrupt cycle, which improves the real-time performance of the system for processing the abnormality of the RAM and reduces the influence of the abnormality of the RAM on the system.
实施例2Example 2
为了实现并行处理模块(如FPGA处理模块,后续统称FPGA并行处理模块)并行协助处理的数据功能,系统中的FPGA并行处理模块需要通过高速接口访问CPU系统的RAM,而高速接口PCIe、SRIO、HyperLink等。同时为了FPGA并行处理模块在协助处理时,对CPU系统的影响最小,FPGA并行处理模块需要单独控制用于备份数据的存储器存储空间,例如单独的DDR、SRAM或FPGA并行处理模块内部RAM等。如果CPU处理器的RAM空间非常充足,而且高速接口带宽比较充裕的情况下,也可以在CPU处理器RAM中划分出单独的空间,用于FPGA并行处理模块存放备份数据。In order to realize the data function of parallel processing of parallel processing modules (such as FPGA processing modules, collectively referred to as FPGA parallel processing modules), the FPGA parallel processing module in the system needs to access the RAM of the CPU system through a high-speed interface, while the high-speed interfaces PCIe, SRIO, HyperLink Wait. At the same time, in order to minimize the impact on the CPU system when the FPGA parallel processing module assists in processing, the FPGA parallel processing module needs to independently control the memory storage space for backing up data, such as a separate DDR, SRAM or internal RAM of the FPGA parallel processing module. If the RAM space of the CPU processor is very sufficient, and the high-speed interface bandwidth is sufficient, a separate space can also be divided into the CPU processor RAM for the FPGA parallel processing module to store backup data.
在硬件模块上可以设计专用的FPGA并行处理电路模块或者利用CPU控制系统中本身已设计的FPGA并行处理模块或者基于集成了FPGA并行处理电路的SoC来实现。On the hardware module, a dedicated FPGA parallel processing circuit module can be designed or realized by using the FPGA parallel processing module designed by itself in the CPU control system or based on the SoC integrated with the FPGA parallel processing circuit.
在软件系统上,在FPGA并行处理模块协处理在线检错与恢复时,同样需要考虑:数据传输的正确性。本发明提供了通讯层次的校验机制;同时还要考虑验算算法的准确性,因而本发明采用SM3签名信息、MD5摘要信息、BCC校验码三种不同原理的校验算,防止同源错误的发生。On the software system, when the FPGA parallel processing module co-processes online error detection and recovery, the same needs to be considered: the correctness of data transmission. The present invention provides a verification mechanism at the communication level; at the same time, the accuracy of the verification algorithm must be considered, so the present invention adopts three different verification principles, SM3 signature information, MD5 digest information, and BCC verification code, to prevent homologous errors. happened.
本发明的一种在线并行处理的软错误实时检错与恢复方法,参见图1所示,该方法要求:控制系统中CPU处理器和FPGA并行处理模块通过高速接口连接(例如PCIe、SRIO、HyperLink,下面内容中按照PCIe表述);FPGA并行处理模块控制独立的RAM空间用于备份程序或者数据。包括以下过程:A real-time error detection and recovery method for online parallel processing of soft errors according to the present invention, as shown in FIG. 1, the method requires: the CPU processor and the FPGA parallel processing module in the control system are connected through a high-speed interface (such as PCIe, SRIO, HyperLink, etc.) , the following content is expressed as PCIe); the FPGA parallel processing module controls an independent RAM space for backing up programs or data. Include the following processes:
S1:将被保护RAM空间划分为多个被保护区域。S1: Divide the protected RAM space into multiple protected areas.
根据被保护CPU系统RAM空间使用情况,将被保护RAM空间分成多个被保护区域。在程序代码编译时,通过链接映射文件将每个被保护区域固定RAM空间位置;链接映射文件,指程序编译中指定各个程序段地址的描述文件。此文件内容包括:处理器系统中内存地址与长度;各个内存区域的名称、长度、内存地址。程序源代码中,可以通过编译预处理功能(#pragma DATA_SECTION(程序名称,"存储区域名"))指定某些程序存放位置。According to the usage of the RAM space of the protected CPU system, the protected RAM space is divided into multiple protected areas. When the program code is compiled, the RAM space location of each protected area is fixed by linking the mapping file; the linking mapping file refers to the description file that specifies the address of each program segment in the program compilation. The contents of this file include: the memory address and length in the processor system; the name, length, and memory address of each memory area. In the program source code, you can specify the storage location of some programs through the compilation preprocessing function (#pragma DATA_SECTION(program name, "storage area name")).
FPGA并行处理模块通过高速总线(如PCIe)直接读写访问被保护CPU系统RAM空间;并且控制独立RAM空间用于被保护区内容备份,此空间只能由FPGA并行处理模块单独读写访问。The FPGA parallel processing module directly reads and writes to the protected CPU system RAM space through a high-speed bus (such as PCIe); and controls the independent RAM space for content backup of the protected area, which can only be read and written by the FPGA parallel processing module alone.
为确保被保护RAM空间检错与恢复功能的实时性,对各被保护区域进行分级管理:按照被保护区域重要程度,分成多个级别,LV1为最高级别,检错与恢复的频率最高,每个中断周期完成一次检错与恢复功能;其他级别检错与恢复的频率依次降低。如果系统被保护区小,FPGA能够在一个中断周期内完成检错与恢复时,可以只有一个LV1级别;如果被保护区域在系统中重要程度都不高,可以容忍多次中断完成一次检错与恢复时,也可以只有一个低LV级别。为方便描述,本实施例中以下内容按照两个级别(LV1/LV2)进行陈述。In order to ensure the real-time error detection and recovery function of the protected RAM space, each protected area is managed hierarchically: according to the importance of the protected area, it is divided into multiple levels, LV1 is the highest level, and the frequency of error detection and recovery is the highest. Each interrupt cycle completes one error detection and recovery function; the frequency of error detection and recovery at other levels decreases sequentially. If the protected area of the system is small, when the FPGA can complete error detection and recovery within one interrupt cycle, there can be only one LV1 level; if the protected area is not very important in the system, it can tolerate multiple interrupts to complete one error detection and recovery. When restoring, it is also possible to have only one lower LV level. For the convenience of description, the following contents in this embodiment are stated according to two levels (LV1/LV2).
根据被保护区域在系统中的重要程度,本发明实施例中将被保护区域分成1级(LV1)、2级(LV2)两个级别,其中LV1级别最高,对于LV1级被保护区域,系统每个中断节拍完成全部区域内容的校验和纠错;对于LV2级被保护区域,分成多组,每个中断节拍完成一组被保护区域内容的校验和纠错,分多次中断完成。在系统初始化时,将各个被保护区域注册生成2个级别的链表。对应2个级别的链表来存放LV1/LV2被保护RAM区域信息,也存放在RAM中,链表 用于FPGA寻找被保护区域的位置大小信息等。According to the importance of the protected area in the system, in the embodiment of the present invention, the protected area is divided into two levels: level 1 (LV1) and level 2 (LV2), of which the level of LV1 is the highest. Each interrupt beat completes the checksum error correction of the content of the entire area; for the LV2-level protected area, it is divided into multiple groups, and each interrupt beat completes the checksum error correction of the content of a group of protected areas, and is divided into multiple interrupts. When the system is initialized, each protected area is registered to generate a two-level linked list. Corresponding to the two-level linked list to store the LV1/LV2 protected RAM area information, and also stored in the RAM, the linked list is used by the FPGA to find the location and size information of the protected area.
S2:将被保护RAM空间内所有被保护区域,按照重要程度,注册到1级(LV1)、2级(LV2)链表中;FPGA并行处理模块备份链表(LV1/LV2)及链表指向被保护区域备到独立RAM,至少备份A、B两份。S2: Register all protected areas in the protected RAM space into the level 1 (LV1) and level 2 (LV2) linked lists according to their importance; the FPGA parallel processing module backup linked list (LV1/LV2) and linked lists point to the protected area Backup it to independent RAM, and back up at least two copies of A and B.
备份数据的份数可依据实际情况决定,至少备份两份。当备份为两份时采用三取二的规则判断哪份数据是正确或异常,也可以备份三份,采用四取三或四取二的规则判断。本发明中以备份A、B两份为例来详细描述。The number of copies of backup data can be determined according to the actual situation, and at least two copies are backed up. When the backup is two copies, the rule of taking two out of three is used to determine which data is correct or abnormal, or three copies can be backed up, and the rule of taking three out of four or two out of four is used to determine. In the present invention, two backup copies A and B are taken as an example for detailed description.
FPGA的初始备份功能模块是在系统初始化完成后,该功能模块根据LV1/LV2链表,在独立RAM中,初始化链表及被保护区域的AB两个备份区内容。The initial backup function module of FPGA is that after the system initialization is completed, the function module initializes the linked list and the contents of the two backup areas of AB in the protected area in the independent RAM according to the LV1/LV2 linked list.
三个位置的数据链表内容相同,均包括了被保护区域起始地址、长度、A/B两个备份的位置。The contents of the data linked list in the three positions are the same, including the starting address, length, and two backup positions of A/B in the protected area.
注册原则:LV1、LV2链表指向的被保护区域,按照每段最大1K分段注册;LV1、LV2链表以结构体数组形式存放,如图2所示,链表中的每一项内容为一个BD,记录一个被保护区域,其数据格式为:Registration principle: The protected area pointed to by the LV1 and LV2 linked lists is registered according to the maximum 1K segment per segment; the LV1 and LV2 linked lists are stored in the form of a structure array, as shown in Figure 2, each item in the linked list is a BD, Record a protected area, and its data format is:
typedef struct{typedef struct{
UIN32 length;/*链表条目长度:0为未用,其他为数据长度*/UIN32 length; /* linked list entry length: 0 is unused, others are data length */
UIN32 addr;/*数据在处理器RAM中的地址*/UIN32 addr; /* address of data in processor RAM */
UIN32 back_addr_A;/*备份A在FPGA并行处理模块控制独立RAM的偏移位置*/UIN32 back_addr_A; /*Backup A controls the offset position of the independent RAM in the FPGA parallel processing module*/
UIN32 back_addr_B;/*备份B在FPGA并行处理模块控制独立RAM的偏移位置*/UIN32 back_addr_B; /*Backup B controls the offset position of the independent RAM in the FPGA parallel processing module*/
}BDTable;}BDTable;
参数对应关系如图3所示,其中,length为被保护区域长度;addr为被保护区域起始地址,为被保护RAM空间中的地址,即此地址指向的被保护区域在被保护RAM空间中;back_add_A为备份A在FPGA控制的独立RAM中起始地址,即此地址指向的区域在独立RAM中;为独立RAM中的地址;back_add_B为备份B在FPGA控制的独立RAM中起始地址,即此位置指向的区域在独立RAM中。The parameter correspondence is shown in Figure 3, where length is the length of the protected area; addr is the starting address of the protected area, which is the address in the protected RAM space, that is, the protected area pointed to by this address is in the protected RAM space ;back_add_A is the starting address of backup A in the independent RAM controlled by the FPGA, that is, the area pointed to by this address is in the independent RAM; it is the address in the independent RAM; back_add_B is the starting address of the backup B in the independent RAM controlled by the FPGA, that is The area pointed to by this location is in separate RAM.
为了方便,简记链表中被保护区域是指链表中BD块中地址指向的被保护区域。For convenience, the protected area in the shorthand linked list refers to the protected area pointed to by the address in the BD block in the linked list.
S3:系统上电初始化时,完成系统配置工作:配置PCIE接口,将被保护RAM空间映射到FPGA并行处理模块的PCIe地址空间;通过FPGA并行处理模块的寄存器配置每个中断处理LV1链表条目的数量为条目总数,LV2配置总条目/P,即P个中断完成所有的处理。S3: When the system is powered on and initialized, complete the system configuration: configure the PCIE interface, map the protected RAM space to the PCIe address space of the FPGA parallel processing module; configure the number of LV1 linked list entries for each interrupt processing through the registers of the FPGA parallel processing module For the total number of entries, LV2 configures total entries/P, that is, P interrupts complete all processing.
中断周期在500-1000us之间各异。每个中断完成异常变位检测恢复,可以提升检测恢复的响应性能,提升电力保护设备抗异常(误动、拒动)的能力。The interruption period varies between 500-1000us. Each interrupt completes the abnormal displacement detection and recovery, which can improve the response performance of the detection and recovery, and improve the ability of the power protection equipment to resist abnormality (misoperation, refusal to operate).
此检测过程,参见图4所示,首先检测LV1/LV2链表区域,并对其异常进行恢复,确保了链表信息的正确性,这样后续才能正确根据LV1/LV2链表记录信息,进行对应被保护区域检测与恢复。LV1和LV2级别的检错和恢复没有先后顺序要求。单个LV1或LV2的链表和对应被保护区域的检错恢复一定是串行的。为了方便理解,本发明实施例中先描述LV1和LV2链表的检错和恢复,然后描述LV1和LV2链表中被保护区域的检错和恢复。This detection process, as shown in Figure 4, firstly detects the LV1/LV2 linked list area, and restores its abnormality to ensure the correctness of the linked list information, so that the information can be correctly recorded according to the LV1/LV2 linked list in the future, corresponding to the protected area. detection and recovery. Error detection and recovery at the LV1 and LV2 levels have no sequential requirements. The linked list of a single LV1 or LV2 and the error detection and recovery of the corresponding protected area must be serial. For ease of understanding, the embodiment of the present invention first describes the error detection and recovery of the LV1 and LV2 linked lists, and then describes the error detection and recovery of the protected areas in the LV1 and LV2 linked lists.
以下S4-S8步骤是LV1链表及其A、B两个备份正确性的裁决具体过程,参见图5所示,包括:The following S4-S8 steps are the specific process of judging the correctness of the LV1 linked list and its two backups A and B, as shown in Figure 5, including:
S4:FPGA并行处理模块按照中断节拍,读取LV1链表及其A、B两个备份。每个位置的链表连续读取三次,并分别计算CRC校验码;每个位置按照三取二规则选取三次中的一次正确读取数据;S4: The FPGA parallel processing module reads the LV1 linked list and its two backups A and B according to the interrupt rhythm. The linked list of each position is read three times in a row, and the CRC check code is calculated separately; each position selects one of the three times to correctly read the data according to the rule of three out of two;
三个位置数据都进行通讯过程校验,校验方法为:每个位置读取的三次链表数据都进行CRC校验,若每个位置的三次CRC校验码都不一样,说明读取功能异常或者系统硬件异常,需要闭锁装置,系统重启。若两次CRC校验码一样的即为正确,则任取一次读取数据作为正确读取数据。The three position data are all checked during the communication process. The check method is as follows: CRC check is performed on the three linked list data read from each position. If the three CRC check codes of each position are different, it means that the reading function is abnormal. Or the system hardware is abnormal, the locking device is required, and the system is restarted. If the two CRC check codes are the same, it is correct, then any read data is taken as the correct read data.
通讯过程校验的目的是防止通讯过程出错导致的误判,为可选功能。如果系统不考虑通讯接口出错的情况时,可以去掉此校验过程,即:读取一次数据作为正确数据,进行后续判断。The purpose of communication process verification is to prevent misjudgment caused by errors in the communication process, which is an optional function. If the system does not consider the error of the communication interface, this verification process can be removed, that is, read the data once as the correct data, and make subsequent judgments.
需要说明的是,读取次数可以是三次及以上,当读取次数为四次时,可采用四取二或四取三规则来确定正确读取数据。It should be noted that the number of readings can be three or more times, and when the number of readings is four, the rule of taking two out of four or three out of four can be used to determine the correct read data.
S5:FPGA并行处理模块计算LV1链表及其A、B两个备份正确读取数据的SM3签名信息,并比对,按照三取二规则判断三个位置数据是否出现异常变位:如果三个位置的SM3签名信息相同,则无异常,即三个位置数据相同;如果两个相同,则这两份数据正常,另外一份异常;如果三个都不同,则判断失效;S5: The FPGA parallel processing module calculates the LV1 linked list and the SM3 signature information of the two backups A and B that correctly read the data, and compares them. According to the rule of three out of two, it is judged whether the data of the three positions is abnormally displaced: if the three positions If the SM3 signature information is the same, there is no abnormality, that is, the three position data are the same; if two are the same, the two data are normal, and the other is abnormal; if the three are different, the judgment is invalid;
S6:FPGA并行处理模块计算LV1链表及其A、B两个备份正确读取数据的MD5信息摘要,并比对,按照三取二规则判断三个位置数据是否出现异常变位:如果三个位置的MD5信息摘要相同,则无异常,即三个位置数据相同;如果两个相同,则这两份数据正常,另外一个异常;如果三个都不同,则判断失效;S6: The FPGA parallel processing module calculates the MD5 information summary of the LV1 linked list and its two backups A and B that correctly read the data, compares them, and judges whether there is abnormal displacement of the three position data according to the rule of three out of two: if the three positions If the MD5 information digests are the same, there is no abnormality, that is, the three position data are the same; if two are the same, the two data are normal, and the other is abnormal; if the three are different, the judgment is invalid;
S7:FPGA并行处理模块计算LV1链表及其A、B两个备份正确读取数据的BCC校验码,并比对,按照三取二规则判断三个位置数据是否出现异常变位:如果三个位置的BCC校验码相同,则无异常,即三个位置数据相同;如果两个相同,则这两份数据正常,另外一份异常;如果三个都不同,则判断失效;S7: The FPGA parallel processing module calculates the BCC check code of the LV1 linked list and its two backups A and B to correctly read the data, and compares them. According to the rule of taking two out of three, it determines whether the three position data are abnormally shifted: if three If the BCC check codes of the positions are the same, there is no abnormality, that is, the three position data are the same; if two are the same, the two data are normal, and the other is abnormal; if the three are different, the judgment is invalid;
需要说明的是:本发明实施例中采用SM3签名信息、MD5摘要信息、BCC校验三种检验方式三次校验(每种方式校验一次)进行描述,但本领域技术人员需知,校验方式不限于这三种,也可以选用其他的常见的校验方式,例如CRC32、CRC64等。校验方式和校验次数均可以任意组合,例如采用2次SM3签名信息,2次CRC32校验等。使用不同的校验方式可以防止某种校验的原理性问题导致的最终误判。应用时可以根据系统对于误判的容忍度进行方式与次数的选择/组合。It should be noted that: in the embodiment of the present invention, three verification methods of SM3 signature information, MD5 digest information, and BCC verification are used for three verifications (one verification for each method) for description, but those skilled in the art need to know that verification The methods are not limited to these three, and other common check methods, such as CRC32 and CRC64, can also be selected. The verification method and the verification times can be combined arbitrarily, for example, 2 times of SM3 signature information and 2 times of CRC32 verification are used. Using different verification methods can prevent the final misjudgment caused by the principle problem of a certain verification. In application, the method and number of times can be selected/combined according to the tolerance of the system for misjudgment.
本发明实施例中三个校验过程没有关联,可以并行进行。In the embodiment of the present invention, the three verification processes are not related and can be performed in parallel.
S8:FPGA并行处理模块对比LV1链表的SM3签名信息、MD5信息摘要、BCC校验码三种校验方式的判断结果,根据三取二规则进行裁决,确定LV1链表及其A、B两个备份正确性:如果两种及以上的校验方式判断结果一致,则将此一致的结果作为最终裁决结果;否则裁决结果为判断失效。S8: The FPGA parallel processing module compares the judgment results of the three verification methods of SM3 signature information, MD5 information digest, and BCC check code of the LV1 linked list, and makes a ruling according to the rule of three out of two, and determines the LV1 linked list and its two backups A and B. Correctness: If the judgment results of two or more verification methods are consistent, the consistent result will be regarded as the final judgment result; otherwise, the judgment result will be invalid.
最终裁决结果如果为无异常(即三个位置数据相同),则LV1链表及其A、B两个备份均是正确的,链表LV1检测结束;If the final ruling result is no abnormality (that is, the data of the three positions are the same), then the LV1 linked list and its two backups A and B are correct, and the linked list LV1 detection ends;
如果为判断失效,则为无法恢复的错误,需要闭锁装置,重新启动恢复;If it is judged to be invalid, it is an unrecoverable error, and it is necessary to lock the device and restart the recovery;
如果为其中两份数据正常,另一份异常,则异常数据可恢复,按照后续的异常数据恢复方法处理。此处异常数据可能LV1链表、A备份或者B备份,备份区也是RAM,也可能错误,也需要恢复的。If two of the data are normal and the other is abnormal, the abnormal data can be recovered and processed according to the subsequent abnormal data recovery method. The abnormal data here may be LV1 linked list, A backup or B backup, and the backup area is also RAM, or it may be wrong and needs to be restored.
S9:按照S4到S8步骤,验算LV2链表及其A、B两个备份正确性,如果三个位置数据存在异常数据,则恢复。S9: According to the steps from S4 to S8, check the correctness of the LV2 linked list and its two backups A and B. If there is abnormal data in the three location data, restore it.
以上处理过程保证LV1/LV2链表及其A、B两个备份的正确性,后续根据正确的LV1/LV2链表记录信息,进行对应被保护区域进行检错与恢复。The above processing process ensures the correctness of the LV1/LV2 linked list and its two backups A and B. Subsequently, according to the correct LV1/LV2 linked list record information, the corresponding protected area is detected and restored.
以下S10-S15步骤是LV1链表中各被保护区域及其A、B两个备份正确性的裁决具体过程,参见图4所示,包括:The following S10-S15 steps are the specific process of judging the correctness of each protected area and its two backups A and B in the LV1 linked list, as shown in Figure 4, including:
S10:然后FPGA并行处理模块读取LV1链表中第一个被保护区域对应的RAM空间数据及其A、B两个备份数据。每个位置的数据连续读取三次,并分别计算CRC校验码;每个位置按照三取二规则选取三次中的一次正确读取数据;S10: Then the FPGA parallel processing module reads the RAM space data corresponding to the first protected area in the LV1 linked list and its two backup data A and B. The data of each position is continuously read three times, and the CRC check code is calculated separately; each position selects one of the three times to correctly read the data according to the rule of three out of two;
三个位置数据都进行通讯过程校验,校验方法为:每个位置读取的三次数据都进行CRC校验,若每个位置的三次CRC校验码都不一样,说明读取功能异常或者系统硬件异常,需要闭锁装置,系统重启。若两次CRC校验码一样的即为正确,则任取一次读取数据作为正确读取数据。The data of the three positions are checked during the communication process. The check method is as follows: the three times of data read at each position are checked by CRC. If the three CRC check codes of each position are different, it means that the reading function is abnormal or The system hardware is abnormal, the locking device is required, and the system is restarted. If the two CRC check codes are the same, it is correct, then any read data is taken as the correct read data.
S11:FPGA并行处理模块计算LV1链表中第一个被保护区域对应的RAM空间数据及其A、B两个备份正确读取数据的SM3签名信息,并比对,按照三取二规则判断三个位置数据是否出现异常变位:如果三个位置的SM3签名信息相同,则无异常,即三个位置数据相同;如果两个相同,则这两份数据正常,另外一份异常;如果三个都不同,则判断失效;S11: The FPGA parallel processing module calculates the RAM space data corresponding to the first protected area in the LV1 linked list and the SM3 signature information of the two backups A and B that correctly read the data, and compares them, and judges three according to the rule of three out of two. Whether the position data is abnormally shifted: If the SM3 signature information of the three positions is the same, there is no abnormality, that is, the three position data are the same; if the two are the same, the two data are normal, and the other is abnormal; If they are different, the judgment is invalid;
S12:FPGA并行处理模块计算LV1链表中第一个被保护区域对应的RAM空间数据及其A、B两个备份正确读取数据的MD5信息摘要,并比对,按照三取二规则判断三个位置数据是否出现异常变位:如果三个位置的MD5信息摘要相同,则无异常,即三个位置数据相同;如果两个相同,则这两份数据正常, 另外一个异常;如果三个都不同,则判断失效;S12: The FPGA parallel processing module calculates the RAM space data corresponding to the first protected area in the LV1 linked list and the MD5 information summary of the correct read data of the two backups A and B, and compares them, and judges the three according to the rule of taking two out of three. Whether the position data is abnormally shifted: if the MD5 information digests of the three positions are the same, there is no abnormality, that is, the three position data are the same; if the two are the same, the two data are normal, and the other is abnormal; if the three are different , the judgment is invalid;
S13:FPGA并行处理模块计算LV1链表中第一个被保护区域对应的RAM空间数据及其A、B两个备份正确读取数据的BCC校验码,并比对,按照三取二规则判断三个位置数据是否出现异常变位:如果三个位置的BCC校验码相同,则无异常,即三个位置数据相同;如果两个相同,则这两份数据正常,另外一个异常;如果三个都不同,则判断失效;S13: The FPGA parallel processing module calculates the RAM space data corresponding to the first protected area in the LV1 linked list and the BCC check codes of the two backups A and B that correctly read the data, and compares them, and judges three according to the rule of taking two out of three. Whether the position data is abnormally shifted: if the BCC check codes of the three positions are the same, there is no abnormality, that is, the three position data are the same; if two are the same, the two data are normal, and the other is abnormal; are different, the judgment is invalid;
S14:FPGA并行处理模块对比SM3签名信息、MD5信息摘要、BCC校验码三种校验方式的判断结果,根据三取二规则进行裁决,确定LV1链表中第一个被保护区域对应的RAM空间及其A、B两个备份正确性:如果两种及以上的方式判断结果一致,则将此一致的结果作为最终裁决结果;否则裁决结果为判断失效。S14: The FPGA parallel processing module compares the judgment results of the three verification methods of SM3 signature information, MD5 information digest, and BCC check code, and makes a ruling according to the rule of three out of two, and determines the RAM space corresponding to the first protected area in the LV1 linked list The correctness of its two backups A and B: if the judgment results of two or more methods are consistent, the consistent result will be regarded as the final judgment result; otherwise, the judgment result will be invalid.
最终裁决结果如果为无异常(三个位置数据相同),则LV1链表中第一个被保护区域对应的RAM空间数据及其A、B两个备份均是正确的,LV1链表中第一个被保护区域检测结束;If the final ruling result is no abnormality (the data of the three locations are the same), then the RAM space data corresponding to the first protected area in the LV1 linked list and its two backups A and B are both correct, and the first one in the LV1 linked list is correct. The protection area detection is over;
如果为判断失效,则为无法恢复的错误,需要闭锁装置,重新启动恢复;If it is judged to be invalid, it is an unrecoverable error, and it is necessary to lock the device and restart the recovery;
如果为其中两份数据正常,另一份异常,则异常数据可恢复,按照后续的异常数据恢复方法处理。此处异常数据可能RAM空间数据、A备份或者B备份,备份区也是RAM,也可能错误,也需要恢复的。If two of the data are normal and the other is abnormal, the abnormal data can be recovered and processed according to the subsequent abnormal data recovery method. The abnormal data here may be RAM space data, A backup or B backup, and the backup area is also RAM. It may also be wrong and needs to be restored.
S15:FPGA并行处理模块重复S10到S14步骤,对LV1链表中所有被保护区域数据进行裁决,若存在异常,则对异常进行恢复。至此即LV1链表中所有被保护区域数据检测结束。S15: The FPGA parallel processing module repeats the steps S10 to S14 to judge all the protected area data in the LV1 linked list, and if there is an abnormality, restore the abnormality. At this point, the data detection of all protected areas in the LV1 linked list is completed.
LV2链表中的各被保护区域为多中断完成,将LV2链表被保护区域分成多 组,每个中断节拍完成一组(多个BD),因此每次中断检测LV2完成后,需要更新一下组位置。Each protected area in the LV2 linked list is completed by multiple interrupts. The protected areas of the LV2 linked list are divided into multiple groups, and each interrupt beat completes one group (multiple BDs), so after each interrupt detection LV2 is completed, the group position needs to be updated. .
以下S16-S22步骤是LV2链表中当前检测组中各被保护区域及其A、B两个备份正确性的裁决具体过程,参见图4所示,包括:The following S16-S22 steps are the specific process of judging the correctness of each protected area in the current detection group and its two backups A and B in the LV2 linked list, as shown in Figure 4, including:
S16:FPGA并行处理模块读取LV2链表中当前检测组中第一个被保护区域对应的RAM空间数据及其A、B两个备份数据。每个位置的数据连续读取三次,并分别计算CRC校验码;每个位置按照三取二规则选取三次中的一次正确读取数据;S16: The FPGA parallel processing module reads the RAM space data corresponding to the first protected area in the current detection group in the LV2 linked list and its two backup data A and B. The data of each position is continuously read three times, and the CRC check code is calculated separately; each position selects one of the three times to correctly read the data according to the rule of three out of two;
当前检测组,LV2链表内容分成多组,检测时从第一组开始,每次检测一组,当前被检测的组称为当前检测组。In the current detection group, the content of the LV2 linked list is divided into multiple groups. The detection starts from the first group, and one group is detected each time. The currently detected group is called the current detection group.
S17:FPGA并行处理模块计算LV2链表中当前检测组中第一个被保护区域对应的RAM空间数据及其A、B两个备份正确读取数据的SM3签名信息,并比对,按照三取二规则判断三个位置数据是否出现异常变位:如果三个位置的SM3签名信息相同,则无异常,即三个位置数据相同;如果两个相同,则这两份数据正常,另外一个异常;如果三个都不同,则判断失效;S17: The FPGA parallel processing module calculates the RAM space data corresponding to the first protected area in the current detection group in the LV2 linked list and the SM3 signature information of the two backups A and B that correctly read the data, and compares them. The rule judges whether there is abnormal displacement of the three location data: if the SM3 signature information of the three locations is the same, there is no abnormality, that is, the three location data are the same; if the two are the same, the two data are normal, and the other one is abnormal; if If all three are different, the judgment is invalid;
S18:FPGA并行处理模块计算LV2链表中当前检测组中第一个被保护区域对应的RAM空间数据及其A、B两个备份正确读取数据的MD5信息摘要,并比对,按照三取二规则判断三个位置数据是否出现异常变位:如果三个位置的MD5信息摘要相同,则无异常,即三个位置数据相同;如果两个相同,则这两份数据正常,另外一个异常;如果三个都不同,则判断失效;S18: The FPGA parallel processing module calculates the RAM space data corresponding to the first protected area in the current detection group in the LV2 linked list and the MD5 information summary of the two backups A and B of the correctly read data, and compares them. The rule determines whether there is abnormal displacement of the three location data: if the MD5 information digests of the three locations are the same, there is no abnormality, that is, the three location data are the same; if the two are the same, the two data are normal, and the other is abnormal; if If all three are different, the judgment is invalid;
S19:FPGA并行处理模块计算LV2链表中当前检测组中第一个被保护区域对应的RAM空间数据及其A、B两个备份正确读取数据的BCC校验码,并比 对,按照三取二规则判断三个位置数据是否出现异常变位:如果三个位置的BCC校验码相同,则无异常,即三个位置数据相同;如果两个相同,则这两份数据正常,另外一个异常;如果三个都不同,则判断失效;S19: The FPGA parallel processing module calculates the RAM space data corresponding to the first protected area in the current detection group in the LV2 linked list and the BCC check codes of the two backups A and B that correctly read the data, and compares them. The second rule judges whether there is abnormal displacement of the three position data: if the BCC check codes of the three positions are the same, there is no abnormality, that is, the three position data are the same; if the two are the same, the two data are normal, and the other one is abnormal ; If all three are different, the judgment is invalid;
S20:FPGA并行处理模块对比SM3签名信息、MD5信息摘要、BCC校验码三种校验方式的判断结果,根据三取二规则进行裁决,确定LV2链表中当前检测组中第一个被保护区域对应的RAM空间及其A、B两个备份正确性:如果两种及以上的方式判断结果一致,则将此一致的结果作为最终裁决结果;否则裁决结果为判断失效。S20: The FPGA parallel processing module compares the judgment results of the three verification methods of SM3 signature information, MD5 information digest, and BCC verification code, and makes a ruling according to the rule of three out of two, and determines the first protected area in the current detection group in the LV2 linked list Corresponding RAM space and the correctness of its A and B backups: If the judgment results of two or more methods are consistent, the consistent result will be regarded as the final judgment result; otherwise, the judgment result will be invalid.
最终裁决结果如果为无异常(三个位置数据相同),则LV2链表中当前检测组中第一个被保护区域对应的RAM空间数据及其A、B两个备份数据均是正确的,LV2链表中当前检测组中第一个被保护区域数据检测结束;If the final ruling result is no abnormality (the three locations have the same data), the RAM space data corresponding to the first protected area in the current detection group in the LV2 linked list and its two backup data A and B are both correct, and the LV2 linked list The data detection of the first protected area in the current detection group is completed;
如果为判断失效,则为无法恢复的错误,需要闭锁装置,重新启动恢复;If it is judged to be invalid, it is an unrecoverable error, and it is necessary to lock the device and restart the recovery;
如果为其中两份数据正常,另一份异常,则异常数据可恢复,按照后续的异常数据恢复方法处理。此处异常数据可能RAM空间数据、A备份或者B备份。If two of the data are normal and the other is abnormal, the abnormal data can be recovered and processed according to the subsequent abnormal data recovery method. The abnormal data here may be RAM space data, A backup or B backup.
S21:重复S16到S20步骤,对LV2链表中当前检测组中所有被保护区域数据进行裁决,若存在异常则对异常进行恢复,至此即LV2链表中当前检测组中所有被保护区域数据检测结束;S21: Repeat the steps from S16 to S20 to adjudicate all the protected area data in the current detection group in the LV2 linked list, and restore the abnormality if there is an abnormality. So far, the detection of all the protected area data in the current detection group in the LV2 linked list is completed;
S22:最后LV2链表中当前检测组位置指向下一组,在下一个中断周期内重复S16到S21步骤,对LV2链表中下一组中所有被保护区域数据进行裁决与恢复,直至LV2链表中所有组中所有被保护区域数据检测结束。S22: The position of the current detection group in the last LV2 linked list points to the next group, and steps S16 to S21 are repeated in the next interrupt cycle to adjudicate and restore all the protected area data in the next group in the LV2 linked list until all the groups in the LV2 linked list The data detection of all protected areas in .
系统完成被保护RAM空间的正确性判断后,FPGA并行处理模块需要对数据进行恢复。在数据恢复过程中,为了提升响应实时性能,本发明采用了将数 据分成若干基本数据块的方式,并通过多个方式计算CRC值,根据这些CRC值,快速定位出异常的基本数据块。而在恢复被保护数据时,也应考虑通讯过程的正确性,因此本发明采用在恢复正确基本数据块到异常位置后,立即回读数据,并判断数据是否被正确写入,如果回读数据和写入数据不一致,则从新在写一次。After the system completes the correctness judgment of the protected RAM space, the FPGA parallel processing module needs to restore the data. In the data recovery process, in order to improve the real-time performance of the response, the present invention adopts the method of dividing the data into several basic data blocks, and calculates the CRC value in multiple ways, and quickly locates the abnormal basic data block according to these CRC values. When restoring the protected data, the correctness of the communication process should also be considered. Therefore, the present invention adopts the method of reading back the data immediately after restoring the correct basic data block to the abnormal position, and judging whether the data is correctly written. If the data is read back If it is inconsistent with the written data, write it again.
在遇到多位置异常时,可以先定位出一个异常的基本数据块位置,然后通过迭代的方式定位出多个异常的基本数据块位置。因此本发明通过迭代的方式,解决多位置异常的问题。When encountering multi-location exceptions, the location of an abnormal basic data block can be located first, and then the locations of multiple abnormal basic data blocks can be located in an iterative manner. Therefore, the present invention solves the problem of multi-position anomalies through an iterative manner.
本发明的异常数据恢复方法,该方法的前提是FPGA并行处理模块最终裁决解决为两份数据正常,一份异常,此时异常数据可恢复。任选一份正常数据作为正常数据区,存在异常的一份数据作为异常数据区。异常数据区可能是被保护RAM区,也有可能是独立RAM区。具体步骤内容参见图8所示,包括有以下步骤:In the abnormal data recovery method of the present invention, the premise of the method is that the FPGA parallel processing module finally decides that two pieces of data are normal and one is abnormal, and the abnormal data can be recovered at this time. Choose a normal data as the normal data area, and select the abnormal data as the abnormal data area. The abnormal data area may be a protected RAM area or an independent RAM area. The specific steps are shown in Figure 8, including the following steps:
S1:异常数据区和正常数据区按照最小数据长度按顺序划分成N块,下面表述时称之为基本数据块。S1: The abnormal data area and the normal data area are divided into N blocks in order according to the minimum data length, which are referred to as basic data blocks in the following description.
S2:由FPGA并行处理模块计算异常数据区的数据CRC校验码C en,其中n=0、1、2、3..m,间隔的基本数据块个数P n=2 n,而且N/2≤2 m<N。 S2: The data CRC check code C en of the abnormal data area is calculated by the FPGA parallel processing module, where n=0, 1, 2, 3..m, the number of basic data blocks in the interval P n =2 n , and N/ 2≤2m <N.
C en计算方法为:从第1个基本数据块开始取P n个基本数据块,然后每隔P n个基本块取P n个基本数据块,将取出的所有基本数据块作为数据源计算CRC值。 The calculation method of C en is: starting from the first basic data block, taking P n basic data blocks, then taking P n basic data blocks every P n basic blocks, and using all the basic data blocks taken out as the data source to calculate the CRC value.
本步骤的C e0、C e1...C em可以并行计算,也可以串行计算。 C e0 , C e1 . . . C em in this step can be calculated in parallel or in series.
参见图6所示,异常数据库划分为N=16个基础数据块,分别记为B1、 B2......B16,C e0的计算是从B1开始取1个基本数据块,然后每隔1个基本块取1个基本数据块,即取B1、B3、B5......B15作为数据源计算,C e1的计算是从B1开始取2个基本数据块,然后每隔2个基本块取2个基本数据块,即取B1、B2、B5、B6......B13、B14作为数据源计算,同理,C e3的计算是从B1开始取8个基本数据块,然后每隔8个基本块取8个基本数据块,即取B1、B2......B8作为数据源计算。 Referring to Figure 6, the abnormal database is divided into N=16 basic data blocks, which are denoted as B1, B2 ... One basic data block is taken from one basic block, that is, B1, B3, B5,... The basic block takes 2 basic data blocks, that is, B1, B2, B5, B6...B13, B14 are used as the data source for calculation. Similarly, the calculation of C e3 is to take 8 basic data blocks from B1. Then take 8 basic data blocks every 8 basic blocks, that is, take B1, B2...B8 as the data source for calculation.
S3:由FPGA并行处理模块计算正常数据区的数据CRC校验码C cn,其中n=0、1、2、3..m,间隔的基本数据块个数P n=2 n,而且N/2≤2 m<N。 S3: The data CRC check code C cn of the normal data area is calculated by the FPGA parallel processing module, where n=0, 1, 2, 3..m, the number of basic data blocks in the interval P n =2 n , and N/ 2≤2m <N.
C cn计算方法为:从第1个基本数据块开始取P n个基本数据块,然后每隔P n个基本块,取P n个基本数据块,将取出的所有基本数据块作为数据源计算CRC值。 The calculation method of C cn is: starting from the first basic data block, take P n basic data blocks, then every P n basic blocks, take P n basic data blocks, and use all the basic data blocks taken out as the data source to calculate CRC value.
本步骤的C c0、C c1...C cm可以并行计算,也可以串行计算。 C c0 , C c1 ... C cm in this step can be calculated in parallel or in series.
参见图6所示,正常数据库划分为N=16个基础数据块,分别记为B1、B2......B16,C c0的计算是从B1开始取1个基本数据块,然后每隔1个基本块取1个基本数据块,即取B1、B3、B5......B15作为数据源计算,C c1的计算是从B1开始取2个基本数据块,然后每隔2个基本块取2个基本数据块,即取B1、B2、B5、B6......B13、B14作为数据源计算,同理,C c3的计算是从B1开始取8个基本数据块,然后每隔8个基本块取8个基本数据块,即取B1、B2......B8作为数据源计算。 Referring to Figure 6, the normal database is divided into N=16 basic data blocks, which are denoted as B1, B2 ... Take 1 basic data block for 1 basic block, that is, take B1, B3, B5 ... The basic block takes 2 basic data blocks, that is, B1, B2, B5, B6...B13, B14 are used as the data source for calculation. Similarly, the calculation of C c3 is to take 8 basic data blocks from B1. Then take 8 basic data blocks every 8 basic blocks, that is, take B1, B2...B8 as the data source for calculation.
S4:FPGA并行处理模块判断C cm和C em两个CRC是否相等:如果相等,则说明异常数据位于异常数据区的后半部,更新异常数据标记位置,指向异常数据区的后半部分范围;如果不相等,则说明异常数据区的前半部分一定有异 常,更新异常数据标记位置,指向异常数据区的前半部分范围。 S4: The FPGA parallel processing module judges whether the two CRCs of C cm and C em are equal: if they are equal, it means that the abnormal data is located in the second half of the abnormal data area, and the abnormal data mark position is updated to point to the second half of the abnormal data area; If they are not equal, it means that there must be an abnormality in the first half of the abnormal data area. Update the abnormal data mark position to point to the first half of the abnormal data area.
此步骤是n=m时处理,此时整个数据区所有基本数据块分为前后两部分。S4步骤判断出异常数据位于前半部分会还是后半部分;异常范围缩小后,S5重复n=m-1….不断缩小范围,直到n=0时,范围缩小到具体的某个异常基本块上。This step is processed when n=m, and all the basic data blocks in the entire data area are divided into two parts. Step S4 determines whether the abnormal data is located in the first half or the second half; after the abnormal range is narrowed, S5 repeats n=m-1.... Continuously narrowing the range until n=0, the range is narrowed to a specific abnormal basic block .
S5:重复S4步骤,判断C cn和C en值,其中n从m-1依次递减,直至为0,每个步骤均更新异常数据标记位置。在判断完n=0步骤后,此时异常数据标记位置已经缩小到了异常数据所在异常数据区的基本数据块,简称为异常的基本数据块,而此异常的基本数据块对应的正确数据为正常数据区中对应的基本数据块,简称为正确的基本数据块。 S5: Step S4 is repeated to determine the values of C cn and C en , where n decreases sequentially from m-1 until it is 0, and the abnormal data mark position is updated in each step. After the step of judging n=0, the abnormal data mark position has been reduced to the basic data block in the abnormal data area where the abnormal data is located, which is referred to as the abnormal basic data block, and the correct data corresponding to the abnormal basic data block is normal. The corresponding basic data block in the data area is referred to as the correct basic data block for short.
本发明实施例中,此判断过程参见图7所示,先比较C e3和C c3值是否相等,如果相等,则说明异常数据位置位于后半部B9->B16范围,然后继续判断C e2和C c2值是否相等,如果相等,则说明异常数据位置位于后半部B13->B16范围,然后继续判断C e1和C c1值是否相等,如果相等,则说明异常数据位置位于后半部B15->B16范围,然后继续判断C e0和C c0值是否相等,如果相等,则说明异常数据位置位于后半部B16基本数据块。至此定位到了具体的异常的基本数据块。同理,其他判断分支参见图7所示,此处不再多赘述。 In the embodiment of the present invention, the judgment process is shown in FIG. 7. First, compare whether the values of C e3 and C c3 are equal. If they are equal, it means that the abnormal data position is located in the range of B9->B16 in the second half, and then continue to judge the C e2 and C c3 values. Whether the values of C c2 are equal, if they are equal, it means that the abnormal data location is located in the second half B13->B16 range, and then continue to judge whether the values of C e1 and C c1 are equal, if they are equal, it means that the abnormal data location is located in the second half B15- >B16 range, and then continue to judge whether the values of C e0 and C c0 are equal. If they are equal, it means that the abnormal data location is located in the second half of the B16 basic data block. So far, the specific abnormal basic data block has been located. In the same way, other judgment branches are shown in FIG. 7 , which will not be repeated here.
S6:由FPGA并行处理模块将正常数据区内正确的基本数据块内容通过高速接口拷贝复制到异常数据标记位置指向的异常数据区内异常的基本数据块。S6: The FPGA parallel processing module copies and copies the correct basic data block content in the normal data area to the abnormal basic data block in the abnormal data area pointed to by the abnormal data mark position through the high-speed interface.
S7:然后FPGA并行处理模块重新读取异常的基本数据块;计算重新读取异常的基本数据块CRC值,并计算对应正确的基本数据块CRC值。比较两个CRC值是否相等,如果不相等,则重复S6、S7步骤进行重新恢复,本实施例中重复次数最多2次;如果相等,则此异常的基本数据块完成恢复。S7: Then the FPGA parallel processing module re-reads the abnormal basic data block; calculates the CRC value of the re-read abnormal basic data block, and calculates the corresponding correct basic data block CRC value. Compare whether the two CRC values are equal, and if they are not equal, repeat steps S6 and S7 for re-recovery. In this embodiment, the number of repetitions is at most 2; if they are equal, the abnormal basic data block is recovered.
S8:FPGA并行处理模块计算正常数据区和异常数据区内所有基本数据块的CRC值,比较两个CRC值是否相等,如果两者不相等,说明仍然有异常数据,在重复S2到S7的步骤,对其他异常数据进行定位和恢复;如果相等,则异常数据恢复结束。S8: The FPGA parallel processing module calculates the CRC values of all the basic data blocks in the normal data area and the abnormal data area, and compares whether the two CRC values are equal. If the two are not equal, it means that there is still abnormal data. Repeat the steps from S2 to S7. , locate and recover other abnormal data; if they are equal, the abnormal data recovery ends.
上述过程的典型流程如图9所示,首先进行LV1链表检错与恢复,然后LV2的链表检错与恢复,然后进行LV1链表的所有被保护区域检错与恢复,最后是LV2当前组的被保护区域检错与恢复,其时序图如图9所示。其中LV1链表检错与恢复完成之后,才能进行LV1链表对应保护区域的检错与恢复;LV2链表也同样,在完成LV2链表检错与恢复后,才能LV2链表对应保护区域的检错与恢复,但是LV1链表和LV2链表没有先后关系,可以设计为并行处理。The typical flow of the above process is shown in Figure 9. First, the error detection and recovery of the LV1 linked list is performed, then the error detection and recovery of the LV2 linked list is performed, then the error detection and recovery of all the protected areas of the LV1 linked list are performed, and finally the protected area of the LV2 linked list is performed. The protection area error detection and recovery, its timing diagram is shown in Figure 9. Among them, after the error detection and recovery of the LV1 linked list is completed, the error detection and recovery of the protection area corresponding to the LV1 linked list can be performed; the same is true for the LV2 linked list. However, the LV1 linked list and the LV2 linked list have no sequential relationship and can be designed for parallel processing.
如果系统运行中需要正常更改被保护RAM空间的数据(如正常的数据参数修改服务等),应按照以下次序继续修改:首先软件停止验算和裁决功能,停止数据恢复功能,并回读FPGA并行处理模块的运行状态,确认功能停止;然后正常修改数据;最后重启验算和裁决功能,重启数据恢复功能。If the data in the protected RAM space needs to be changed normally during system operation (such as normal data parameter modification services, etc.), the modification should be continued in the following order: First, the software stops the checking and adjudication functions, stops the data recovery function, and reads back the FPGA for parallel processing The running status of the module is confirmed, and the function is stopped; then the data is modified normally; finally, the verification and adjudication functions are restarted, and the data recovery function is restarted.
本发明的在线并行处理的含义是:在线的含义是指在系统功能正常运行的同时,进行被保护RAM检错与恢复。在本文中,LV1/LV2链表及对应被保护RAM的读取校验、SM3签名信息、MD5摘要信息、BCC校验、综合裁决、异常数据恢复功能是根据系统中断同时进行,处理RAM异常变化的检错和恢复。本发明中的并行首先是指系统正常功能和检错恢复功能并行进行;这些功能是通过FPGA实现的,不占用处理器时间。LV1、LV2的检错恢复,可以根据系统设计情况和FPGA资源情况设计为并行或者串行。The meaning of the on-line parallel processing in the present invention is: the meaning of on-line refers to performing error detection and recovery of the protected RAM while the system function is running normally. In this article, the LV1/LV2 linked list and the read verification of the corresponding protected RAM, SM3 signature information, MD5 summary information, BCC verification, comprehensive adjudication, and abnormal data recovery functions are performed simultaneously according to system interruptions to deal with abnormal changes in RAM. Error detection and recovery. The parallel in the present invention first means that the normal function of the system and the error detection and recovery function are performed in parallel; these functions are realized by FPGA and do not occupy processor time. The error detection and recovery of LV1 and LV2 can be designed as parallel or serial according to the system design and FPGA resources.
本发明的实时的含义是:在本文中,LV1/LV2链表及对应被保护RAM的读 取校验、SM3签名信息、MD5摘要信息、BCC校验、综合裁决、异常数据恢复功能是根据系统中断同时进行,实时检错与恢复,每个中断完成关键数据区的检错与恢复,实时性高。The real-time meaning of the present invention is: In this paper, the LV1/LV2 linked list and the read verification, SM3 signature information, MD5 digest information, BCC verification, comprehensive verdict, and abnormal data recovery functions of the corresponding protected RAM are based on system interruption. At the same time, real-time error detection and recovery, each interrupt completes the error detection and recovery of the key data area, with high real-time performance.
需要说明的是:通讯过程校验的目的是防止通讯过程出错导致的误判,为可选功能。如果系统不考虑通讯接口出错的情况时,可以去掉此校验过程,即:读取一次数据作为正确数据,进行后续判断。It should be noted that the purpose of communication process verification is to prevent misjudgment caused by errors in the communication process, which is an optional function. If the system does not consider the error of the communication interface, this verification process can be removed, that is, read the data once as the correct data, and make subsequent judgments.
需要说明的是:本发明实施例中采用SM3签名信息、MD5摘要信息、BCC校验三种检验方式三次校验(每种方式校验一次)进行描述,但本领域技术人员需知,校验方式不限于这三种,也可以选用其他的常见的校验方式,例如CRC32、CRC64等。校验方式和校验次数均可以任意组合。使用不同的校验方式可以防止某种校验的原理性问题导致的最终误判。应用时可以根据系统对于误判的容忍度进行方式与次数的选择/组合。It should be noted that: in the embodiment of the present invention, three verification methods of SM3 signature information, MD5 digest information, and BCC verification are used for three verifications (one verification for each method) for description, but those skilled in the art need to know that verification The methods are not limited to these three, and other common check methods, such as CRC32 and CRC64, can also be selected. The verification method and the verification times can be combined arbitrarily. Using different verification methods can prevent the final misjudgment caused by the principle problem of a certain verification. In application, the method and number of times can be selected/combined according to the tolerance of the system for misjudgment.
实施例3Example 3
本发明还提供了一种在线并行处理的软错误实时检错与恢复装置,包括链表管理模块、以及检错恢复模块,其中:The present invention also provides an online parallel processing soft error real-time error detection and recovery device, including a linked list management module and an error detection recovery module, wherein:
链表管理模块,用于将被保护RAM空间分成多个被保护区域;将所有被保护区域分成1个或多个级别,最高级别为每个中断周期完成一次检错与恢复功能;其他级别为多个中断周期完成一次检错与恢复功能;将各级别被保护区域注册生成与级别数量相应的链表,所述链表位于被保护RAM空间;将各链表及链表中被保护区域备份至少两份到其他RAM空间;所述链表内容包括相同级别的各被保护区域位置、长度和各个备份的位置;The linked list management module is used to divide the protected RAM space into multiple protected areas; all protected areas are divided into one or more levels, the highest level is to complete an error detection and recovery function for each interrupt cycle; other levels are multiple One interrupt cycle to complete one error detection and recovery function; register the protected area of each level to generate a linked list corresponding to the number of levels, and the linked list is located in the protected RAM space; back up each linked list and the protected area in the linked list at least two copies to other RAM space; the content of the linked list includes the position, length and the position of each backup of each protected area of the same level;
检错恢复模块,用于并行处理各级别链表及链表中各被保护区域的检错与 恢复:其中,对任一级别的链表及链表中各被保护区域进行检错与恢复的过程为:The error detection and recovery module is used to process the error detection and recovery of each level of linked list and each protected area in the linked list in parallel: the process of error detection and recovery of any level of linked list and each protected area in the linked list is as follows:
对链表及其备份进行检错,并对其异常进行恢复;Check the linked list and its backup for errors, and recover its abnormality;
对链表中各被保护区域及其备份进行检错,并对其异常进行恢复。Error detection is performed on each protected area and its backup in the linked list, and its abnormality is recovered.
本实施例装置中各模块的具体实现,以及链表及其备份的检错与恢复,采取实施例1和实施例2的实施方式。The specific implementation of each module in the device of this embodiment, as well as the error detection and recovery of the linked list and its backup, adopt the implementation manners of Embodiment 1 and Embodiment 2.
本实施例的装置可实现RAM空间的检错与恢复,能够对出错数据位置并且能够实现在一个中断节拍内完成恢复,提高了系统对RAM异常处理的实时性,降低了RAM异常对系统的影响。The device of this embodiment can realize the error detection and recovery of the RAM space, can correct the location of the erroneous data, and can realize the recovery within one interrupt cycle, which improves the real-time performance of the system in processing RAM exceptions and reduces the impact of RAM exceptions on the system. .
本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。As will be appreciated by those skilled in the art, the embodiments of the present application may be provided as a method, a system, or a computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the present application. It will be understood that each flow and/or block in the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to the processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing device to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing device produce Means for implementing the functions specified in a flow or flow of a flowchart and/or a block or blocks of a block diagram.
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备 以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory result in an article of manufacture comprising instruction means, the instructions The apparatus implements the functions specified in the flow or flow of the flowcharts and/or the block or blocks of the block diagrams.
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded on a computer or other programmable data processing device to cause a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process such that The instructions provide steps for implementing the functions specified in the flow or blocks of the flowcharts and/or the block or blocks of the block diagrams.
以上所述仅是本发明的优选实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明技术原理的前提下,还可以做出若干改进和变型,这些改进和变型也应视为本发明的保护范围。The above are only the preferred embodiments of the present invention. It should be pointed out that for those skilled in the art, without departing from the technical principles of the present invention, several improvements and modifications can also be made. These improvements and modifications It should also be regarded as the protection scope of the present invention.

Claims (17)

  1. 一种在线并行处理的软错误实时检错与恢复方法,其特征是,包括以下过程:An online parallel processing soft error real-time error detection and recovery method, characterized in that it includes the following processes:
    将被保护RAM空间的各被保护区域分成1个或多个级别,最高级别为每个中断周期完成一次检错与恢复功能;其他级别为多个中断周期完成一次检错与恢复功能;Divide each protected area of the protected RAM space into one or more levels, the highest level is to complete one error detection and recovery function for each interrupt cycle; the other levels complete one error detection and recovery function for multiple interrupt cycles;
    将各级别被保护区域注册生成与级别数量相应的链表,所述链表位于被保护RAM空间;将各链表及链表中被保护区域备份至少两份到其他空间;Register the protected area of each level to generate a linked list corresponding to the number of levels, and the linked list is located in the protected RAM space; backup at least two copies of each linked list and the protected area in the linked list to other spaces;
    并行处理各级别链表及链表中各被保护区域的检错与恢复。Parallel processing of error detection and recovery of each level of linked list and each protected area in the linked list.
  2. 根据权利要求1所述的一种在线并行处理的软错误实时检错与恢复方法,其特征是,所述方法采用并行处理模块执行,所述并行处理模块通过高速接口连接访问被保护RAM空间,并采用独立的DDR、SRAM,或者RAM空间用于备份存储。The real-time error detection and recovery method for online parallel processing of soft errors according to claim 1, wherein the method is performed by a parallel processing module, and the parallel processing module accesses the protected RAM space through a high-speed interface connection, And use independent DDR, SRAM, or RAM space for backup storage.
  3. 根据权利要求1所述的一种在线并行处理的软错误实时检错与恢复方法,其特征是,所述任一级别链表及链表中各被保护区域进行检错与恢复的过程为:A kind of online parallel processing soft error real-time error detection and recovery method according to claim 1, it is characterized in that, the process of error detection and recovery of each protected area in the linked list of any level and linked list is:
    对链表及其备份进行检错,如果检错结果为异常则对异常进行恢复;Perform error detection on the linked list and its backup, and restore the exception if the error detection result is abnormal;
    对链表中各被保护区域及其备份进行检错,如果检错结果为异常则对异常进行恢复。Perform error detection on each protected area and its backup in the linked list, and restore the abnormality if the error detection result is abnormal.
  4. 根据权利要求3所述的一种在线并行处理的软错误实时检错与恢复方法,其特征是,所述对链表及其备份进行检错,包括:A kind of online parallel processing soft error real-time error detection and recovery method according to claim 3, it is characterized in that, described performing error detection on linked list and its backup, comprising:
    读取链表及其备份;Read the linked list and its backup;
    采用包括SM3签名信息、MD5信息摘要和BCC校验码中任意一种或多种 组合校验方式,对链表及其备份进行校验;The linked list and its backup are verified by adopting any one or more combination verification methods including SM3 signature information, MD5 information digest and BCC check code;
    对比各校验方式的判断结果,确定链表及其备份的正确性。Compare the judgment results of each verification method to determine the correctness of the linked list and its backup.
  5. 根据权利要求4所述的一种在线并行处理的软错误实时检错与恢复方法,其特征是,采用多种组合校验方式时,各个校验方式的校验过程并行处理。The real-time error detection and recovery method for online parallel processing of soft errors according to claim 4, wherein when multiple combined verification methods are adopted, the verification processes of each verification method are processed in parallel.
  6. 根据权利要求4所述的一种在线并行处理的软错误实时检错与恢复方法,其特征是,还包括:对读取的链表及其备份进行通讯过程校验。The online parallel processing soft error real-time error detection and recovery method according to claim 4, further comprising: performing communication process verification on the read linked list and its backup.
  7. 根据权利要求4所述的一种在线并行处理的软错误实时检错与恢复方法,其特征是,采用一种校验方式对链表及其备份进行校验,包括:A kind of online parallel processing soft error real-time error detection and recovery method according to claim 4, is characterized in that, adopts a kind of check method to check the linked list and its backup, comprising:
    校验方式以SM3签名信息为例来描述:The verification method is described by taking SM3 signature information as an example:
    计算链表及其备份的SM3签名信息,并比对;Calculate the linked list and its backup SM3 signature information, and compare;
    至少有两份SM3签名信息相同,则判断SM3签名信息相同的所有份数据正常,其他份数据异常。If there are at least two copies of the same SM3 signature information, it is judged that all copies of the same SM3 signature information are normal, and other copies of the data are abnormal.
  8. 根据权利要求7所述的一种在线并行处理的软错误实时检错与恢复方法,其特征是,所述对比各校验方式的判断结果,确定链表及其备份的正确性,包括:A kind of online parallel processing soft error real-time error detection and recovery method according to claim 7, is characterized in that, described comparing the judgment result of each check mode, confirming the correctness of linked list and its backup, comprising:
    对比链表及其备份的SM3签名信息、MD5信息摘要、BCC校验码三种校验方式的判断结果,如果两种及以上的校验方式判断结果一致,则将此一致的结果作为最终裁决结果。Compare the judgment results of the linked list and its backup SM3 signature information, MD5 information digest, and BCC verification code. If the judgment results of two or more verification methods are consistent, the consistent result will be regarded as the final judgment result. .
  9. 根据权利要求3所述的一种在线并行处理的软错误实时检错与恢复方法,其特征是,所述对链表中各被保护区域及其备份进行检错,包括:A kind of online parallel processing soft error real-time error detection and recovery method according to claim 3, is characterized in that, described in the linked list each protected area and its backup carry out error detection, including:
    若链表级别为最高级别时:If the linked list level is the highest level:
    对链表中各被保护区域及其备份依次进行检错,其中,对链表中任一个被 保护区域及其备份进行检错,包括:Perform error detection on each protected area and its backup in the linked list in turn, wherein, perform error detection on any protected area and its backup in the linked list, including:
    读取链表中当前一个被保护区域及其备份;Read the current protected area in the linked list and its backup;
    采用包括SM3签名信息、MD5信息摘要和BCC校验码任意一种或多种组合校验方式,对链表中当前被保护区域及其备份进行校验;Use any one or more combined verification methods including SM3 signature information, MD5 information digest and BCC verification code to verify the currently protected area and its backup in the linked list;
    对比各校验方式的判断结果,确定链表中当前被保护区域及其备份的正确性。Compare the judgment results of each verification method to determine the correctness of the currently protected area in the linked list and its backup.
  10. 根据权利要求3所述的一种在线并行处理的软错误实时检错与恢复方法,其特征是,所述对链表中各被保护区域及其备份进行检错,包括:A kind of online parallel processing soft error real-time error detection and recovery method according to claim 3, is characterized in that, described in the linked list each protected area and its backup carry out error detection, including:
    当链表级别为其他级别时:When the linked list level is another level:
    将链表中所有被保护区域划分为多个组,Divide all protected areas in the linked list into multiple groups,
    在每个中断周期,对链表中任一组中任一被保护区域及其备份进行检错,包括:In each interrupt cycle, check any protected area and its backup in any group in the linked list, including:
    读取链表中当前检测组中当前一个被保护区域及其备份;Read the current protected area and its backup in the current detection group in the linked list;
    采用包括SM3签名信息、MD5信息摘要和BCC校验码任意一种或多种组合校验方式,对链表中当前检测组中当前被保护区域及其备份进行校验;Use any one or more combined verification methods including SM3 signature information, MD5 information digest and BCC verification code to verify the currently protected area and its backup in the current detection group in the linked list;
    对比各校验方式的判断结果,确定链表中当前检测组中当前被保护区域及其备份的正确性。Comparing the judgment results of each verification method, determine the correctness of the currently protected area and its backup in the current detection group in the linked list.
  11. 根据权利要求10所述的一种在线并行处理的软错误实时检错与恢复方法,其特征是,采用多种组合校验方式时,各个校验方式的校验过程并行处理。The real-time error detection and recovery method for online parallel processing of soft errors according to claim 10, characterized in that when multiple combined verification methods are adopted, the verification process of each verification method is processed in parallel.
  12. 根据权利要求10所述的一种在线并行处理的软错误实时检错与恢复方法,其特征是,还包括:对读取的链表中当前被保护区域及其备份进行通讯过程校验。The online parallel processing soft error real-time error detection and recovery method according to claim 10, further comprising: performing a communication process check on the currently protected area and its backup in the read linked list.
  13. 根据权利要求10所述的一种在线并行处理的软错误实时检错与恢复方法,其特征是,采用一种校验方式对链表中当前被保护区域及其备份进行校验,包括:A kind of online parallel processing soft error real-time error detection and recovery method according to claim 10, is characterized in that, adopts a kind of check method to check the current protected area in the linked list and its backup, including:
    校验方式以SM3签名信息为例来描述:The verification method is described by taking SM3 signature information as an example:
    计算链表中当前被保护区域及其备份的SM3签名信息,并比对;Calculate the current protected area in the linked list and its backup SM3 signature information, and compare;
    至少有两份SM3签名信息相同,则判断SM3签名信息相同的所有份数据正常,其他份数据异常。If there are at least two copies of the same SM3 signature information, it is judged that all copies of the same SM3 signature information are normal, and other copies of the data are abnormal.
  14. 根据权利要求13所述的一种在线并行处理的软错误实时检错与恢复方法,其特征是,所述对比各校验方式的判断结果,确定链表中当前被保护区域及其A、B两个备份的正确性,包括:The real-time error detection and recovery method for online parallel processing of soft errors according to claim 13, wherein the comparison of the judgment results of each check mode determines the currently protected area and its A and B in the linked list. The correctness of each backup, including:
    对比链表中当前被保护区域及其A、B两个备份的SM3签名信息、MD5信息摘要、BCC校验码三种校验方式的判断结果,如果两种及以上的校验方式判断结果一致,则将此一致的结果作为最终裁决结果。Compare the judgment results of the three verification methods of the currently protected area and its A and B backups in the linked list, SM3 signature information, MD5 information digest, and BCC verification code. If the judgment results of two or more verification methods are consistent, The unanimous result shall be regarded as the final ruling.
  15. 根据权利要求3所述的一种在线并行处理的软错误实时检错与恢复方法,其特征是,所述对异常进行恢复,包括:A kind of online parallel processing soft error real-time error detection and recovery method according to claim 3, it is characterised in that the recovery of the exception comprises:
    将异常数据区和正常数据区按顺序划分成N块基本数据块;Divide the abnormal data area and the normal data area into N basic data blocks in sequence;
    计算异常数据区的数据CRC校验码C en,其中n=0、1、2、3..m,而且N/2≤2 m<N;其中C en计算方法为:从第1个基本数据块开始取P n个基本数据块,然后每隔P n个基本块取P n个基本数据块,将取出的所有基本数据块作为数据源计算CRC值,间隔的基本数据块个数P n=2 nCalculate the data CRC check code C en of the abnormal data area, where n=0, 1, 2, 3..m, and N/2≤2 m <N; the calculation method of C en is: from the first basic data At the beginning of the block, take P n basic data blocks, then take P n basic data blocks every P n basic blocks, and use all the basic data blocks taken out as the data source to calculate the CRC value, and the number of basic data blocks in the interval P n = 2n ;
    计算正常数据区的数据CRC校验码C cn,其中,C cn计算方法为:从第1个基本数据块开始取P n个基本数据块,然后每隔P n个基本块,取P n个基本数据块, 将取出的所有基本数据块作为数据源计算CRC值; Calculate the data CRC check code C cn of the normal data area, wherein the calculation method of C cn is: starting from the first basic data block, take P n basic data blocks, and then take P n basic data blocks every P n basic blocks The basic data block, using all the basic data blocks taken out as the data source to calculate the CRC value;
    判断C cm和C em两个CRC是否相等:如果相等,则更新异常数据标记位置,指向异常数据区的后半部分范围;如果不相等,则更新异常数据标记位置,指向异常数据区的前半部分范围; Determine whether the two CRCs of C cm and C em are equal: if they are equal, update the abnormal data mark position to point to the second half of the abnormal data area; if not, update the abnormal data mark position to point to the first half of the abnormal data area Scope;
    重复以上判断过程,判断C cn和C en值,其中n从m-1依次递减直至为0,每个步骤均更新异常数据标记位置;在判断完n=0步骤后,此时异常数据标记位置已经缩小到了异常数据所在异常数据区的基本数据块; Repeat the above judgment process to judge the values of C cn and C en , where n decreases from m-1 until it is 0, and the abnormal data mark position is updated in each step; after judging the n=0 step, the abnormal data mark position at this time It has been reduced to the basic data block of the abnormal data area where the abnormal data is located;
    将正常数据区内正确的基本数据块内容拷贝复制到异常数据标记位置指向的异常数据区内异常的基本数据块,此异常的基本数据块完成恢复;Copy and copy the content of the correct basic data block in the normal data area to the abnormal basic data block in the abnormal data area pointed to by the abnormal data mark position, and the abnormal basic data block is restored;
    计算正常数据区和异常数据区内所有基本数据块的CRC值,比较两个CRC值是否相等,如果两者不相等,重复以上所有的步骤;如果相等,则异常数据恢复结束。Calculate the CRC values of all basic data blocks in the normal data area and the abnormal data area, and compare whether the two CRC values are equal. If the two are not equal, repeat all the above steps; if they are equal, the abnormal data recovery ends.
  16. 根据权利要求15所述的一种在线并行处理的软错误实时检错与恢复方法,其特征是,异常的基本数据块完成恢复后还包括:重新读取异常的基本数据块;计算重新读取异常的基本数据块CRC值,并计算对应正确的基本数据块CRC值;比较两个CRC值是否相等,如果相等,则此异常的基本数据块完成恢复。The method for real-time error detection and recovery of soft errors for online parallel processing according to claim 15, wherein after the recovery of the abnormal basic data block is completed, the method further comprises: re-reading the abnormal basic data block; calculating the re-reading The abnormal basic data block CRC value is calculated, and the corresponding correct basic data block CRC value is calculated; the two CRC values are compared to see if they are equal, if they are equal, the abnormal basic data block is restored.
  17. 一种在线并行处理的软错误实时检错与恢复系统,包括链表管理模块、以及检错恢复模块,其中:An online parallel processing soft error real-time error detection and recovery system, comprising a linked list management module and an error detection recovery module, wherein:
    链表管理模块,用于将被保护RAM空间的各被保护区域分成1个或多个级别,最高级别为每个中断周期完成一次检错与恢复功能;其他级别为多个中断周期完成一次检错与恢复功能;将各级别被保护区域注册生成与级别数量相应 的链表,所述链表位于被保护RAM空间;将各链表及链表中被保护区域备份至少两份到其他RAM空间;The linked list management module is used to divide each protected area of the protected RAM space into one or more levels. The highest level completes one error detection and recovery function for each interrupt cycle; the other levels complete one error detection for multiple interrupt cycles. With the recovery function; register the protected area of each level to generate a linked list corresponding to the number of levels, and the linked list is located in the protected RAM space; backup at least two copies of each linked list and the protected area in the linked list to other RAM spaces;
    检错恢复模块,用于并行处理各级别链表及链表中各被保护区域的检错与恢复。The error detection and recovery module is used to process the error detection and recovery of each level of linked list and each protected area in the linked list in parallel.
PCT/CN2021/074836 2020-08-21 2021-02-02 Online parallel processing soft error real-time error detection and recovery method and system WO2022037022A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
GB2303510.8A GB2613120A (en) 2020-08-21 2021-02-02 Online parallel processing soft error real-time error detection and recovery method and system

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010849103.0 2020-08-21
CN202010849103.0A CN112053737B (en) 2020-08-21 2020-08-21 Online parallel processing soft error real-time error detection and recovery method and system

Publications (1)

Publication Number Publication Date
WO2022037022A1 true WO2022037022A1 (en) 2022-02-24

Family

ID=73600711

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/074836 WO2022037022A1 (en) 2020-08-21 2021-02-02 Online parallel processing soft error real-time error detection and recovery method and system

Country Status (3)

Country Link
CN (1) CN112053737B (en)
GB (1) GB2613120A (en)
WO (1) WO2022037022A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115421967A (en) * 2022-11-04 2022-12-02 中国电力科学研究院有限公司 Method and system for evaluating storage abnormal risk point of secondary equipment

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112053737B (en) * 2020-08-21 2022-08-26 国电南瑞科技股份有限公司 Online parallel processing soft error real-time error detection and recovery method and system
CN115426028B (en) * 2022-08-29 2023-10-20 鹏城实验室 Fault tolerance method and system for data encoding and decoding and high-speed communication system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003323353A (en) * 2002-05-01 2003-11-14 Denso Corp Memory diagnostic device and control device
CN102356384A (en) * 2011-08-23 2012-02-15 华为技术有限公司 Method and device for data reliability detection
CN102779557A (en) * 2011-05-13 2012-11-14 苏州雄立科技有限公司 Method and system for data detection and correction of memory module integrated chip
US20190088350A1 (en) * 2017-09-21 2019-03-21 Canon Kabushiki Kaisha Information processing apparatus, control method thereof, and storage medium
CN111552590A (en) * 2020-04-16 2020-08-18 国电南瑞科技股份有限公司 Detection and recovery method and system for memory bit overturning of power secondary equipment
CN112053737A (en) * 2020-08-21 2020-12-08 国电南瑞科技股份有限公司 Online parallel processing soft error real-time error detection and recovery method and system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101937375B (en) * 2010-08-27 2013-07-31 浙江大学 Code and data real-time error correcting and detecting method and device for pico-satellite central processing unit

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003323353A (en) * 2002-05-01 2003-11-14 Denso Corp Memory diagnostic device and control device
CN102779557A (en) * 2011-05-13 2012-11-14 苏州雄立科技有限公司 Method and system for data detection and correction of memory module integrated chip
CN102356384A (en) * 2011-08-23 2012-02-15 华为技术有限公司 Method and device for data reliability detection
US20190088350A1 (en) * 2017-09-21 2019-03-21 Canon Kabushiki Kaisha Information processing apparatus, control method thereof, and storage medium
CN111552590A (en) * 2020-04-16 2020-08-18 国电南瑞科技股份有限公司 Detection and recovery method and system for memory bit overturning of power secondary equipment
CN112053737A (en) * 2020-08-21 2020-12-08 国电南瑞科技股份有限公司 Online parallel processing soft error real-time error detection and recovery method and system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115421967A (en) * 2022-11-04 2022-12-02 中国电力科学研究院有限公司 Method and system for evaluating storage abnormal risk point of secondary equipment
CN115421967B (en) * 2022-11-04 2022-12-30 中国电力科学研究院有限公司 Method and system for evaluating storage abnormal risk point of secondary equipment

Also Published As

Publication number Publication date
GB2613120A (en) 2023-05-24
CN112053737B (en) 2022-08-26
CN112053737A (en) 2020-12-08
GB202303510D0 (en) 2023-04-26

Similar Documents

Publication Publication Date Title
WO2022037022A1 (en) Online parallel processing soft error real-time error detection and recovery method and system
US5692121A (en) Recovery unit for mirrored processors
US6948091B2 (en) High integrity recovery from multi-bit data failures
JP5792380B2 (en) Apparatus and method for providing data integrity
US8589759B2 (en) RAM single event upset (SEU) method to correct errors
CN111552590B (en) Detection and recovery method and system for memory bit overturning of power secondary equipment
US6397357B1 (en) Method of testing detection and correction capabilities of ECC memory controller
CN100419695C (en) Vectoring process-kill errors to an application program
Argyrides et al. Matrix-based codes for adjacent error correction
US8996953B2 (en) Self monitoring and self repairing ECC
JP7418397B2 (en) Memory scan operation in response to common mode fault signals
Gottscho et al. Software-defined error-correcting codes
US9329926B1 (en) Overlapping data integrity for semiconductor devices
Tan et al. CFEDR: Control-flow error detection and recovery using encoded signatures monitoring
US20230214295A1 (en) Error rates for memory with built in error correction and detection
Shao et al. An error location and correction method for memory based on data similarity analysis
Karsli et al. Enhanced duplication: a technique to correct soft errors in narrow values
Kajmakovic et al. Challenges in Mitigating Errors in 1oo2D Safety Architecture with COTS Micro-controllers
US20230195565A1 (en) Multilevel Memory System with Copied Error Detection Bits
Shao et al. Fault Tolerance Method for Memory Based on Inner Product Similarity and Experimental Study on Heavy Ion Irradiation
RU2465636C1 (en) Method of correcting single errors and preventing double errors in register file and apparatus for realising said method
Du et al. Cache Tag Array Fault Tolerance Method Based on Redundancy and Similarity of Adjacent Cache Line Tag Bits
Li et al. A low-cost correction algorithm for transient data errors
Ramesh et al. Verification of Error Correction Codes for CPU Memories
Garg Soft error fault tolerant systems: cs456 survey

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21857128

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 202303510

Country of ref document: GB

Kind code of ref document: A

Free format text: PCT FILING DATE = 20210202

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21857128

Country of ref document: EP

Kind code of ref document: A1