WO2022037022A1

WO2022037022A1 - Online parallel processing soft error real-time error detection and recovery method and system

Info

Publication number: WO2022037022A1
Application number: PCT/CN2021/074836
Authority: WO
Inventors: 周华良; 郑玉平; 徐广辉; 李友军; 刘拯; 邹志杨; 姜雷; 高诗航; 汪世平; 张家森
Original assignee: 国电南瑞科技股份有限公司; 国电南瑞南京控制系统有限公司
Priority date: 2020-08-21
Filing date: 2021-02-02
Publication date: 2022-02-24
Also published as: GB2613120A; CN112053737B; CN112053737A; GB202303510D0

Abstract

An online parallel processing soft error real-time error detection and recovery method and system. The method comprising: dividing a protected RAM space into multiple protected areas; dividing all of the protected areas into one or more levels, the highest level being to complete an error detection and recovery function in each interrupt cycle, and the other levels being to complete an error detection and recovery function once in multiple interrupt cycles; registering the protected areas of each level to generate a linked list corresponding to the number of levels, and backing up at least two copies of each linked list and the protected areas in the linked list to other RAM space; and processing in parallel the error detection and recovery of each level of linked list and each protected area in the linked list. The described solution can, in a key scenario, achieve the verification, decision, correction, and recovery of high-importance level data in a control system within a single interruption beat, and at the same time, does not rely on a CPU processor itself, can process in real-time in parallel, and can achieve an online real-time error detection and correction function when multiple positions of the CPU processor and a RAM are abnormally shifted at the same time.

Description

An online parallel processing soft error real-time error detection and recovery method and system

technical field

The invention belongs to the technical field of error detection of abnormal displacement of RAM memory, in particular to a method and system for real-time error detection and recovery of soft errors in on-line parallel processing.

Background technique

With the development of microprocessor technology in the direction of low power consumption, low voltage and high integration, the abnormal displacement of RAM (Random Access Memory) memory (or soft error, or Single event effect) has a great impact on the security and stability of the system. The impact cannot be ignored. The main reasons for the abnormality of RAM are: (1) Alpha particle radiation. The radiation of alpha particles in the packaging material used by the processor will cause abnormal displacement of the memory area of the chip; (2) the scale of the integrated circuit increases. The size of transistors is getting smaller and the frequency is getting higher and higher, but the limit voltage of transistors is getting lower and lower, and the noise tolerance is getting narrower and narrower, which makes the processor more sensitive to crosstalk, voltage disturbance, and electromagnetic radiation, resulting in reduced reliability. ; (3) cosmic radiation. High-energy charged particles in cosmic space will interact with the earth's atmosphere elements (mainly oxygen and nitrogen) in a cascade before reaching the earth's surface to generate a large number of secondary neutrons. Ionization or indirect ionization by nuclear reaction produces a large number of electron-hole pairs. When the charge collected by the sensitive electrode of the microelectronic device exceeds the critical charge of the circuit, a single event effect will occur, which will cause the memory bit to flip, and then cause the program to function abnormally. There are three main types of abnormal displacement in RAM memory: First, the data unit is reversed, and a single data bit of the memory is reversed, resulting in abnormal data. This type of error is common in static random access memory, dynamic random access memory, non-volatile memory memory; the second is the transient inversion of the data unit, the single data bit of the memory is sampled and recorded by the circuit due to voltage or current jitter, and then recovered, this type of abnormal displacement mainly occurs in dynamic random access memory; the third is the multi-bit data inversion, Multiple data bits in memory are changed at the same time. If the abnormal displacement occurs in a key position of the RAM memory (for example: code area, key data area), it may cause a program running error, and then cause a system abnormality. If abnormal displacement can be detected before the problem occurs and data recovery can be performed, the stability and reliability of the system will be greatly improved. With the development of microprocessor technology and the expansion of application scope, the harm of abnormal displacement of RAM memory is increasing day by day, and its influence scope is wider and wider. Therefore, how to solve the abnormal displacement of RAM, improve the stability and reliability of the system, and ensure the safe operation of the system has attracted more and more attention from various industries.

At present, in the detection and correction of processor RAM storage anomalies (or soft errors, or single event effects), people have come up with various methods to monitor and correct from the perspective of hardware and software, and to monitor and correct through hardware. For example, from the reinforcement measures of processes and devices, the reinforcement in the process usually includes the reinforcement of capacitor conditions, resistance conditions, structure, layout, doping, latch circuit (DICE), etc.; reinforcement measures at the component level usually include parity. Check (Parity), Error Correction Code (ECC) and Interleave (Interleave).

CPU processor class devices support ECC (Error Correcting Code) verification of RAM in hardware: TI's new processor and XILINX's new MPSoC processor, hardware DDR control supports ECC verification function, on-chip RAM supports ECC verification function, L1/ L2cache supports the ECC check function. The ECC check function can realize single-bit abnormal displacement correction and two-bit displacement alarm by adding 8-bit Hamming code every 32-bit or 64-bit, but it cannot realize two-bit correction and multi-bit simultaneous displacement Yes Alert and correct.

Most of the above are single-level soft error prevention considerations, and there are also solutions for monitoring and detection through software. For example, "Patent an ADI DSP Code Online Monitoring Method" (Publication No. CN105446842A, 2016-03-30) According to the link mapping file of the DSP, the code monitoring mark is added to the code segment to be monitored in the loadable file of the DSP; The DSP continuously reads the LDR in the non-volatile memory. If the code monitoring flag is found, it compares the non-volatile memory code and the running code in the RAM; if the two are inconsistent, repeat the check; if it is confirmed that they are inconsistent, judge When the RAM runs code errors, the alarm measures are taken immediately, and the error information is recorded to the non-volatile memory for failure analysis. This method enables online monitoring of running code by simply modifying the LDR file. However, this method cannot achieve online correction, the response is poor in real-time, and the monitoring function consumes CPU processing time. The error detection and correction program also depends on the CPU itself, cannot locate error data, and has no online recovery function.

Aiming at the realization method of realizing soft error system-level protection in a single CPU system, the prior art and method mainly have the following shortcomings and deficiencies:

(1) At present, the hardware detection ECC verification function considers the consumption of hardware resources, and only realizes the correction of single-bit displacement and the detection function of two-bit displacement, and cannot realize the alarm of simultaneous displacement of more bits, and the correction function is very limited;

(2) The hardware ECC function is only available in newer processors. Not all processors have the ECC check function, which cannot be adapted to various types of processors, and the functions and performance are different. true reliability;

(3) The CPU software method only realizes the monitoring and alarm functions of the online program, and there is no systematic RAM data monitoring and recovery scheme, especially the real-time performance is not enough, and the detection and recovery response is slow;

(4) The software-based detection and recovery function consumes processor time and affects the normal operation of the processor. At the same time, the error detection and correction program also depends on the CPU itself, and it is impossible to locate the erroneous data;

(5) The detection and recovery function of the CPU software mode, when the RAM where the error detection and correction program is located is abnormal, this function cannot guarantee the normal work, and cannot achieve full coverage of the RAM space.

SUMMARY OF THE INVENTION

The purpose of the present invention is to overcome the deficiencies in the prior art, and to provide a real-time error detection and recovery method and system for online parallel processing of soft errors, so as to solve the problem of equipment recessiveness caused by abnormal memory displacement of existing power secondary equipment. Problems with faults or abnormal protection control functions.

In order to solve the above-mentioned technical problem, the present invention provides a kind of online parallel processing soft error real-time error detection and recovery method, including the following process:

Divide the protected RAM space into multiple protected areas;

Divide all protected areas into one or more levels, the highest level completes one error detection and recovery function for each interrupt cycle; the other levels complete one error detection and recovery function for multiple interrupt cycles;

The protected area of each level is registered to generate a linked list corresponding to the number of levels, and the linked list is located in the protected RAM space; at least two copies of each linked list and the protected area in the linked list are backed up to other RAM spaces; the linked list content includes the same level. The position, length and position of each backup area of each protected area;

Parallel processing of error detection and recovery of each level of linked list and each protected area in the linked list.

Further, the method is performed by a parallel processing module, the parallel processing module accesses the protected RAM space through a high-speed interface connection, and uses an independent DDR, SRAM, or RAM space for backup storage.

Further, the process of performing error detection and recovery on the linked list of any level and each protected area in the linked list is as follows:

Check the linked list and its backup for errors, and recover its abnormality;

Error detection is performed on each protected area and its backup in the linked list, and its abnormality is recovered.

Further, the error detection of the linked list and its backup includes:

Read the linked list and its backup;

The linked list and its backup are verified by using any one or more combination verification methods including SM3 signature information, MD5 information digest and BCC verification code;

Compare the judgment results of each verification method to determine the correctness of the linked list and its backup.

Further, when multiple combined verification methods are used, the verification processes of each verification method are processed in parallel.

Further, it also includes: verifying the communication process of the read linked list and its backup: the linked list of each position is continuously read at least three times, and the CRC check code is calculated respectively; each position is judged at least twice CRC check code The same data is correct, and one of the correct data is read as correct read data.

Further, a verification method is used to verify the linked list and its two backups A and B, including:

The verification method is described by taking SM3 signature information as an example:

Calculate the linked list and its two backup SM3 signature information of A and B, and compare them;

If there are at least two copies of the same SM3 signature information, it is judged that all copies of the same SM3 signature information are normal, and other copies of the data are abnormal.

Further, the comparison of the judgment results of each verification method determines the correctness of the linked list and its backup, including:

Compare the judgment results of the linked list and its backup SM3 signature information, MD5 information digest, and BCC verification code. If the judgment results of two or more verification methods are consistent, the consistent result will be regarded as the final judgment result. .

Further, performing error detection on each protected area and its backup in the linked list includes:

If the linked list level is the highest level:

Perform error detection on each protected area and its backup in the linked list in turn, wherein, perform error detection on any protected area and its backup in the linked list, including:

Read the current protected area in the linked list and its backup;

Use any one or more combined verification methods including SM3 signature information, MD5 information digest and BCC verification code to verify the currently protected area and its backup in the linked list;

Compare the judgment results of each verification method to determine the correctness of the currently protected area and its backup in the linked list.

When the linked list level is another level:

Divide all protected areas in the linked list into multiple groups,

In each interruption cycle, each protected area and its backups in any group in the linked list are checked for errors in sequence: among them, any protected area and its backups in any group in the linked list are checked for errors, including:

Read the current protected area and its backup in the current detection group in the linked list;

Use any one or more combined verification methods including SM3 signature information, MD5 information digest and BCC verification code to verify the currently protected area and its backup in the current detection group in the linked list;

Comparing the judgment results of each verification method, determine the correctness of the currently protected area and its backup in the current detection group in the linked list.

Further, it also includes: the current protected area and its backup in the read linked list are checked for the communication process: the linked list of each position is continuously read at least three times, and the CRC check code is calculated respectively; each position judges at least two The data with the same CRC check code for the second time is correct, and the correct data is read as the correct read data.

Further, a verification method is used to verify the currently protected area and its backup in the linked list, including:

Calculate the current protected area in the linked list and its backup SM3 signature information, and compare;

Further, the comparison of the judgment results of each verification method determines the correctness of the currently protected area and its backup in the linked list, including:

Compare the judgment results of the three verification methods of the currently protected area and its backup SM3 signature information, MD5 information digest, and BCC verification code in the linked list. If the judgment results of two or more verification methods are consistent, then this consistent The result is the final decision.

Further, the recovery of the exception includes:

Divide the abnormal data area and the normal data area into N basic data blocks in sequence;

Calculate the data CRC check code C _en of the abnormal data area, where n=0, 1, 2, 3..m, and N/2≤2 ^m <N; the calculation method of C _en is: from the first basic data At the beginning of the block, take P _n basic data blocks, then take P _n basic data blocks every P _n basic blocks, and use all the basic data blocks taken out as the data source to calculate the CRC value, and the number of basic data blocks in the interval P _n = ²ⁿ ;

Calculate the data CRC check code C _cn of the normal data area, where n=0, 1, 2, 3..m, wherein, the calculation method of C _cn is: starting from the first basic data block, take P _n basic data blocks , and then every P _n basic blocks, take P _n basic data blocks, and use all the basic data blocks taken out as the data source to calculate the CRC value;

Determine whether the two CRCs of C _cm and C _em are equal: if they are equal, it means that the abnormal data is located in the second half of the abnormal data area, and the abnormal data mark position is updated to point to the second half of the abnormal data area; if they are not equal, it means that The first half of the abnormal data area must be abnormal. Update the abnormal data mark position to point to the first half of the abnormal data area;

Repeat the above judgment process to judge the values of C _cn and C _en , where n decreases from m-1 until it is 0, and the abnormal data mark position is updated in each step; after judging the n=0 step, the abnormal data mark position at this time It has been reduced to the basic data block of the abnormal data area where the abnormal data is located;

Copy and copy the correct basic data block content in the normal data area to the abnormal basic data block in the abnormal data area pointed to by the abnormal data mark position through the high-speed interface, and the abnormal basic data block is restored;

Calculate the CRC values of all basic data blocks in the normal data area and the abnormal data area, and compare whether the two CRC values are equal. If the two are not equal, it means that there is still abnormal data. Repeat all the above steps to locate other abnormal data. and recovery; if equal, abnormal data recovery ends.

Further, after the recovery of the abnormal basic data block is completed, it further includes: re-reading the abnormal basic data block; calculating the CRC value of the re-reading abnormal basic data block, and calculating the corresponding correct basic data block CRC value; comparing two CRCs Whether the values are equal, if so, the basic data block of this exception is restored.

Correspondingly, the present invention also provides an online parallel processing soft error real-time error detection and recovery system, including a linked list management module and an error detection recovery module, wherein:

The linked list management module is used to divide the protected RAM space into multiple protected areas; all protected areas are divided into one or more levels, the highest level is to complete an error detection and recovery function for each interrupt cycle; other levels are multiple One interrupt cycle to complete one error detection and recovery function; register the protected area of each level to generate a linked list corresponding to the number of levels, the linked list is located in the protected RAM space; backup each linked list and the protected area in the linked list at least two copies to other RAM space; the content of the linked list includes the position, length and the position of each backup of each protected area of the same level;

The error detection and recovery module is used to process the error detection and recovery of each level of linked list and each protected area in the linked list in parallel: the process of error detection and recovery of any level of linked list and each protected area in the linked list is as follows:

Check the linked list and its backup for errors, and recover its abnormality;

Compared with the prior art, the beneficial effects achieved by the present invention are:

(1) The present invention can realize online error detection and recovery of multiple or even all simultaneous changes in a designated storage area, can solve the problem that hardware ECC cannot detect and correct abnormal displacement of multiple bits, has more powerful functions, and improves the operation of system programs robustness and stability.

(2) The present invention can realize the error detection and recovery function of abnormal displacement of RAM memory without relying on the hardware ECC function of the CPU processor, and provides the realization of the abnormal displacement detection of RAM for some processor systems without the hardware ECC function. feasible method.

(3) The function of the present invention for abnormal detection and recovery of the system RAM can locate the error data position to a certain area, and can realize the completion of recovery within one interrupt cycle, which improves the real-time performance of the system for abnormal processing of RAM and reduces the amount of RAM. The effect of the exception on the system.

(4) The present invention assists in completing the processor RAM verification and correction and recovery through the FPGA parallel processing module, without occupying the processor's CPU time, and realizes parallel detection, parallel correction and recovery of RAM abnormalities.

(5) The detection and correction of the entire protected RAM space are assisted by the FPGA parallel processing module.

Description of drawings

Fig. 1 is the system schematic diagram of the present invention;

Fig. 2 is the schematic diagram of linked list data of the present invention;

Fig. 3 is the content schematic diagram of linked list of the present invention;

Fig. 4 is the schematic diagram of FPGA parallel processing module checking and adjudication process of the present invention;

Fig. 5 is the FPGA parallel processing module adjudication function schematic diagram of the present invention;

6 is a schematic diagram of data block division and check code calculation in the FPGA parallel processing module positioning function of the present invention;

7 is a schematic diagram of the abnormal data position positioning function in the FPGA parallel processing module positioning function of the present invention;

8 is a schematic diagram of the data recovery function of the FPGA parallel processing module of the present invention;

FIG. 9 is a time sequence diagram of the verification and recovery of the FPGA parallel processing module of the present invention.

detailed description

The present invention will be further described below in conjunction with the accompanying drawings. The following examples are only used to illustrate the technical solutions of the present invention more clearly, and cannot be used to limit the protection scope of the present invention.

Example 1

A kind of online parallel processing soft error real-time error detection and recovery method of the present invention comprises the following process:

Divide the protected RAM space into multiple protected areas;

Register the protected area of each level to generate a linked list corresponding to the number of levels, and the linked list is located in the protected RAM space; backup at least two copies A and B of each linked list and the protected area in the linked list to other RAM spaces; the contents of the linked list include: The position and length of each protected area of the same level and the positions of the two backups A and B;

Parallel processing of error detection and recovery of each level of linked list and each protected area in the linked list: The process of error detection and recovery of each level of linked list and each protected area in the linked list is as follows:

Check the linked list and its two backups A and B, and restore the abnormality;

Error detection is performed on each protected area and its A and B backups in the linked list, and its abnormality is recovered.

The present invention has the function of detecting and recovering the abnormality of the system RAM, which can correct the location of the erroneous data and realize the recovery within one interrupt cycle, which improves the real-time performance of the system for processing the abnormality of the RAM and reduces the influence of the abnormality of the RAM on the system.

Example 2

In order to realize the data function of parallel processing of parallel processing modules (such as FPGA processing modules, collectively referred to as FPGA parallel processing modules), the FPGA parallel processing module in the system needs to access the RAM of the CPU system through a high-speed interface, while the high-speed interfaces PCIe, SRIO, HyperLink Wait. At the same time, in order to minimize the impact on the CPU system when the FPGA parallel processing module assists in processing, the FPGA parallel processing module needs to independently control the memory storage space for backing up data, such as a separate DDR, SRAM or internal RAM of the FPGA parallel processing module. If the RAM space of the CPU processor is very sufficient, and the high-speed interface bandwidth is sufficient, a separate space can also be divided into the CPU processor RAM for the FPGA parallel processing module to store backup data.

On the hardware module, a dedicated FPGA parallel processing circuit module can be designed or realized by using the FPGA parallel processing module designed by itself in the CPU control system or based on the SoC integrated with the FPGA parallel processing circuit.

On the software system, when the FPGA parallel processing module co-processes online error detection and recovery, the same needs to be considered: the correctness of data transmission. The present invention provides a verification mechanism at the communication level; at the same time, the accuracy of the verification algorithm must be considered, so the present invention adopts three different verification principles, SM3 signature information, MD5 digest information, and BCC verification code, to prevent homologous errors. happened.

A real-time error detection and recovery method for online parallel processing of soft errors according to the present invention, as shown in FIG. 1, the method requires: the CPU processor and the FPGA parallel processing module in the control system are connected through a high-speed interface (such as PCIe, SRIO, HyperLink, etc.) , the following content is expressed as PCIe); the FPGA parallel processing module controls an independent RAM space for backing up programs or data. Include the following processes:

S1: Divide the protected RAM space into multiple protected areas.

According to the usage of the RAM space of the protected CPU system, the protected RAM space is divided into multiple protected areas. When the program code is compiled, the RAM space location of each protected area is fixed by linking the mapping file; the linking mapping file refers to the description file that specifies the address of each program segment in the program compilation. The contents of this file include: the memory address and length in the processor system; the name, length, and memory address of each memory area. In the program source code, you can specify the storage location of some programs through the compilation preprocessing function (#pragma DATA_SECTION(program name, "storage area name")).

The FPGA parallel processing module directly reads and writes to the protected CPU system RAM space through a high-speed bus (such as PCIe); and controls the independent RAM space for content backup of the protected area, which can only be read and written by the FPGA parallel processing module alone.

In order to ensure the real-time error detection and recovery function of the protected RAM space, each protected area is managed hierarchically: according to the importance of the protected area, it is divided into multiple levels, LV1 is the highest level, and the frequency of error detection and recovery is the highest. Each interrupt cycle completes one error detection and recovery function; the frequency of error detection and recovery at other levels decreases sequentially. If the protected area of the system is small, when the FPGA can complete error detection and recovery within one interrupt cycle, there can be only one LV1 level; if the protected area is not very important in the system, it can tolerate multiple interrupts to complete one error detection and recovery. When restoring, it is also possible to have only one lower LV level. For the convenience of description, the following contents in this embodiment are stated according to two levels (LV1/LV2).

According to the importance of the protected area in the system, in the embodiment of the present invention, the protected area is divided into two levels: level 1 (LV1) and level 2 (LV2), of which the level of LV1 is the highest. Each interrupt beat completes the checksum error correction of the content of the entire area; for the LV2-level protected area, it is divided into multiple groups, and each interrupt beat completes the checksum error correction of the content of a group of protected areas, and is divided into multiple interrupts. When the system is initialized, each protected area is registered to generate a two-level linked list. Corresponding to the two-level linked list to store the LV1/LV2 protected RAM area information, and also stored in the RAM, the linked list is used by the FPGA to find the location and size information of the protected area.

S2: Register all protected areas in the protected RAM space into the level 1 (LV1) and level 2 (LV2) linked lists according to their importance; the FPGA parallel processing module backup linked list (LV1/LV2) and linked lists point to the protected area Backup it to independent RAM, and back up at least two copies of A and B.

The number of copies of backup data can be determined according to the actual situation, and at least two copies are backed up. When the backup is two copies, the rule of taking two out of three is used to determine which data is correct or abnormal, or three copies can be backed up, and the rule of taking three out of four or two out of four is used to determine. In the present invention, two backup copies A and B are taken as an example for detailed description.

The initial backup function module of FPGA is that after the system initialization is completed, the function module initializes the linked list and the contents of the two backup areas of AB in the protected area in the independent RAM according to the LV1/LV2 linked list.

The contents of the data linked list in the three positions are the same, including the starting address, length, and two backup positions of A/B in the protected area.

Registration principle: The protected area pointed to by the LV1 and LV2 linked lists is registered according to the maximum 1K segment per segment; the LV1 and LV2 linked lists are stored in the form of a structure array, as shown in Figure 2, each item in the linked list is a BD, Record a protected area, and its data format is:

typedef struct{

UIN32 length; /* linked list entry length: 0 is unused, others are data length */

UIN32 addr; /* address of data in processor RAM */

UIN32 back_addr_A; /*Backup A controls the offset position of the independent RAM in the FPGA parallel processing module*/

UIN32 back_addr_B; /*Backup B controls the offset position of the independent RAM in the FPGA parallel processing module*/

}BDTable;

The parameter correspondence is shown in Figure 3, where length is the length of the protected area; addr is the starting address of the protected area, which is the address in the protected RAM space, that is, the protected area pointed to by this address is in the protected RAM space ;back_add_A is the starting address of backup A in the independent RAM controlled by the FPGA, that is, the area pointed to by this address is in the independent RAM; it is the address in the independent RAM; back_add_B is the starting address of the backup B in the independent RAM controlled by the FPGA, that is The area pointed to by this location is in separate RAM.

For convenience, the protected area in the shorthand linked list refers to the protected area pointed to by the address in the BD block in the linked list.

S3: When the system is powered on and initialized, complete the system configuration: configure the PCIE interface, map the protected RAM space to the PCIe address space of the FPGA parallel processing module; configure the number of LV1 linked list entries for each interrupt processing through the registers of the FPGA parallel processing module For the total number of entries, LV2 configures total entries/P, that is, P interrupts complete all processing.

The interruption period varies between 500-1000us. Each interrupt completes the abnormal displacement detection and recovery, which can improve the response performance of the detection and recovery, and improve the ability of the power protection equipment to resist abnormality (misoperation, refusal to operate).

This detection process, as shown in Figure 4, firstly detects the LV1/LV2 linked list area, and restores its abnormality to ensure the correctness of the linked list information, so that the information can be correctly recorded according to the LV1/LV2 linked list in the future, corresponding to the protected area. detection and recovery. Error detection and recovery at the LV1 and LV2 levels have no sequential requirements. The linked list of a single LV1 or LV2 and the error detection and recovery of the corresponding protected area must be serial. For ease of understanding, the embodiment of the present invention first describes the error detection and recovery of the LV1 and LV2 linked lists, and then describes the error detection and recovery of the protected areas in the LV1 and LV2 linked lists.

The following S4-S8 steps are the specific process of judging the correctness of the LV1 linked list and its two backups A and B, as shown in Figure 5, including:

S4: The FPGA parallel processing module reads the LV1 linked list and its two backups A and B according to the interrupt rhythm. The linked list of each position is read three times in a row, and the CRC check code is calculated separately; each position selects one of the three times to correctly read the data according to the rule of three out of two;

The three position data are all checked during the communication process. The check method is as follows: CRC check is performed on the three linked list data read from each position. If the three CRC check codes of each position are different, it means that the reading function is abnormal. Or the system hardware is abnormal, the locking device is required, and the system is restarted. If the two CRC check codes are the same, it is correct, then any read data is taken as the correct read data.

The purpose of communication process verification is to prevent misjudgment caused by errors in the communication process, which is an optional function. If the system does not consider the error of the communication interface, this verification process can be removed, that is, read the data once as the correct data, and make subsequent judgments.

It should be noted that the number of readings can be three or more times, and when the number of readings is four, the rule of taking two out of four or three out of four can be used to determine the correct read data.

S5: The FPGA parallel processing module calculates the LV1 linked list and the SM3 signature information of the two backups A and B that correctly read the data, and compares them. According to the rule of three out of two, it is judged whether the data of the three positions is abnormally displaced: if the three positions If the SM3 signature information is the same, there is no abnormality, that is, the three position data are the same; if two are the same, the two data are normal, and the other is abnormal; if the three are different, the judgment is invalid;

S6: The FPGA parallel processing module calculates the MD5 information summary of the LV1 linked list and its two backups A and B that correctly read the data, compares them, and judges whether there is abnormal displacement of the three position data according to the rule of three out of two: if the three positions If the MD5 information digests are the same, there is no abnormality, that is, the three position data are the same; if two are the same, the two data are normal, and the other is abnormal; if the three are different, the judgment is invalid;

S7: The FPGA parallel processing module calculates the BCC check code of the LV1 linked list and its two backups A and B to correctly read the data, and compares them. According to the rule of taking two out of three, it determines whether the three position data are abnormally shifted: if three If the BCC check codes of the positions are the same, there is no abnormality, that is, the three position data are the same; if two are the same, the two data are normal, and the other is abnormal; if the three are different, the judgment is invalid;

It should be noted that: in the embodiment of the present invention, three verification methods of SM3 signature information, MD5 digest information, and BCC verification are used for three verifications (one verification for each method) for description, but those skilled in the art need to know that verification The methods are not limited to these three, and other common check methods, such as CRC32 and CRC64, can also be selected. The verification method and the verification times can be combined arbitrarily, for example, 2 times of SM3 signature information and 2 times of CRC32 verification are used. Using different verification methods can prevent the final misjudgment caused by the principle problem of a certain verification. In application, the method and number of times can be selected/combined according to the tolerance of the system for misjudgment.

In the embodiment of the present invention, the three verification processes are not related and can be performed in parallel.

S8: The FPGA parallel processing module compares the judgment results of the three verification methods of SM3 signature information, MD5 information digest, and BCC check code of the LV1 linked list, and makes a ruling according to the rule of three out of two, and determines the LV1 linked list and its two backups A and B. Correctness: If the judgment results of two or more verification methods are consistent, the consistent result will be regarded as the final judgment result; otherwise, the judgment result will be invalid.

If the final ruling result is no abnormality (that is, the data of the three positions are the same), then the LV1 linked list and its two backups A and B are correct, and the linked list LV1 detection ends;

If it is judged to be invalid, it is an unrecoverable error, and it is necessary to lock the device and restart the recovery;

If two of the data are normal and the other is abnormal, the abnormal data can be recovered and processed according to the subsequent abnormal data recovery method. The abnormal data here may be LV1 linked list, A backup or B backup, and the backup area is also RAM, or it may be wrong and needs to be restored.

S9: According to the steps from S4 to S8, check the correctness of the LV2 linked list and its two backups A and B. If there is abnormal data in the three location data, restore it.

The above processing process ensures the correctness of the LV1/LV2 linked list and its two backups A and B. Subsequently, according to the correct LV1/LV2 linked list record information, the corresponding protected area is detected and restored.

The following S10-S15 steps are the specific process of judging the correctness of each protected area and its two backups A and B in the LV1 linked list, as shown in Figure 4, including:

S10: Then the FPGA parallel processing module reads the RAM space data corresponding to the first protected area in the LV1 linked list and its two backup data A and B. The data of each position is continuously read three times, and the CRC check code is calculated separately; each position selects one of the three times to correctly read the data according to the rule of three out of two;

The data of the three positions are checked during the communication process. The check method is as follows: the three times of data read at each position are checked by CRC. If the three CRC check codes of each position are different, it means that the reading function is abnormal or The system hardware is abnormal, the locking device is required, and the system is restarted. If the two CRC check codes are the same, it is correct, then any read data is taken as the correct read data.

S11: The FPGA parallel processing module calculates the RAM space data corresponding to the first protected area in the LV1 linked list and the SM3 signature information of the two backups A and B that correctly read the data, and compares them, and judges three according to the rule of three out of two. Whether the position data is abnormally shifted: If the SM3 signature information of the three positions is the same, there is no abnormality, that is, the three position data are the same; if the two are the same, the two data are normal, and the other is abnormal; If they are different, the judgment is invalid;

S12: The FPGA parallel processing module calculates the RAM space data corresponding to the first protected area in the LV1 linked list and the MD5 information summary of the correct read data of the two backups A and B, and compares them, and judges the three according to the rule of taking two out of three. Whether the position data is abnormally shifted: if the MD5 information digests of the three positions are the same, there is no abnormality, that is, the three position data are the same; if the two are the same, the two data are normal, and the other is abnormal; if the three are different , the judgment is invalid;

S13: The FPGA parallel processing module calculates the RAM space data corresponding to the first protected area in the LV1 linked list and the BCC check codes of the two backups A and B that correctly read the data, and compares them, and judges three according to the rule of taking two out of three. Whether the position data is abnormally shifted: if the BCC check codes of the three positions are the same, there is no abnormality, that is, the three position data are the same; if two are the same, the two data are normal, and the other is abnormal; are different, the judgment is invalid;

S14: The FPGA parallel processing module compares the judgment results of the three verification methods of SM3 signature information, MD5 information digest, and BCC check code, and makes a ruling according to the rule of three out of two, and determines the RAM space corresponding to the first protected area in the LV1 linked list The correctness of its two backups A and B: if the judgment results of two or more methods are consistent, the consistent result will be regarded as the final judgment result; otherwise, the judgment result will be invalid.

If the final ruling result is no abnormality (the data of the three locations are the same), then the RAM space data corresponding to the first protected area in the LV1 linked list and its two backups A and B are both correct, and the first one in the LV1 linked list is correct. The protection area detection is over;

If two of the data are normal and the other is abnormal, the abnormal data can be recovered and processed according to the subsequent abnormal data recovery method. The abnormal data here may be RAM space data, A backup or B backup, and the backup area is also RAM. It may also be wrong and needs to be restored.

S15: The FPGA parallel processing module repeats the steps S10 to S14 to judge all the protected area data in the LV1 linked list, and if there is an abnormality, restore the abnormality. At this point, the data detection of all protected areas in the LV1 linked list is completed.

Each protected area in the LV2 linked list is completed by multiple interrupts. The protected areas of the LV2 linked list are divided into multiple groups, and each interrupt beat completes one group (multiple BDs), so after each interrupt detection LV2 is completed, the group position needs to be updated. .

The following S16-S22 steps are the specific process of judging the correctness of each protected area in the current detection group and its two backups A and B in the LV2 linked list, as shown in Figure 4, including:

S16: The FPGA parallel processing module reads the RAM space data corresponding to the first protected area in the current detection group in the LV2 linked list and its two backup data A and B. The data of each position is continuously read three times, and the CRC check code is calculated separately; each position selects one of the three times to correctly read the data according to the rule of three out of two;

In the current detection group, the content of the LV2 linked list is divided into multiple groups. The detection starts from the first group, and one group is detected each time. The currently detected group is called the current detection group.

S17: The FPGA parallel processing module calculates the RAM space data corresponding to the first protected area in the current detection group in the LV2 linked list and the SM3 signature information of the two backups A and B that correctly read the data, and compares them. The rule judges whether there is abnormal displacement of the three location data: if the SM3 signature information of the three locations is the same, there is no abnormality, that is, the three location data are the same; if the two are the same, the two data are normal, and the other one is abnormal; if If all three are different, the judgment is invalid;

S18: The FPGA parallel processing module calculates the RAM space data corresponding to the first protected area in the current detection group in the LV2 linked list and the MD5 information summary of the two backups A and B of the correctly read data, and compares them. The rule determines whether there is abnormal displacement of the three location data: if the MD5 information digests of the three locations are the same, there is no abnormality, that is, the three location data are the same; if the two are the same, the two data are normal, and the other is abnormal; if If all three are different, the judgment is invalid;

S19: The FPGA parallel processing module calculates the RAM space data corresponding to the first protected area in the current detection group in the LV2 linked list and the BCC check codes of the two backups A and B that correctly read the data, and compares them. The second rule judges whether there is abnormal displacement of the three position data: if the BCC check codes of the three positions are the same, there is no abnormality, that is, the three position data are the same; if the two are the same, the two data are normal, and the other one is abnormal ; If all three are different, the judgment is invalid;

S20: The FPGA parallel processing module compares the judgment results of the three verification methods of SM3 signature information, MD5 information digest, and BCC verification code, and makes a ruling according to the rule of three out of two, and determines the first protected area in the current detection group in the LV2 linked list Corresponding RAM space and the correctness of its A and B backups: If the judgment results of two or more methods are consistent, the consistent result will be regarded as the final judgment result; otherwise, the judgment result will be invalid.

If the final ruling result is no abnormality (the three locations have the same data), the RAM space data corresponding to the first protected area in the current detection group in the LV2 linked list and its two backup data A and B are both correct, and the LV2 linked list The data detection of the first protected area in the current detection group is completed;

If two of the data are normal and the other is abnormal, the abnormal data can be recovered and processed according to the subsequent abnormal data recovery method. The abnormal data here may be RAM space data, A backup or B backup.

S21: Repeat the steps from S16 to S20 to adjudicate all the protected area data in the current detection group in the LV2 linked list, and restore the abnormality if there is an abnormality. So far, the detection of all the protected area data in the current detection group in the LV2 linked list is completed;

S22: The position of the current detection group in the last LV2 linked list points to the next group, and steps S16 to S21 are repeated in the next interrupt cycle to adjudicate and restore all the protected area data in the next group in the LV2 linked list until all the groups in the LV2 linked list The data detection of all protected areas in .

After the system completes the correctness judgment of the protected RAM space, the FPGA parallel processing module needs to restore the data. In the data recovery process, in order to improve the real-time performance of the response, the present invention adopts the method of dividing the data into several basic data blocks, and calculates the CRC value in multiple ways, and quickly locates the abnormal basic data block according to these CRC values. When restoring the protected data, the correctness of the communication process should also be considered. Therefore, the present invention adopts the method of reading back the data immediately after restoring the correct basic data block to the abnormal position, and judging whether the data is correctly written. If the data is read back If it is inconsistent with the written data, write it again.

When encountering multi-location exceptions, the location of an abnormal basic data block can be located first, and then the locations of multiple abnormal basic data blocks can be located in an iterative manner. Therefore, the present invention solves the problem of multi-position anomalies through an iterative manner.

In the abnormal data recovery method of the present invention, the premise of the method is that the FPGA parallel processing module finally decides that two pieces of data are normal and one is abnormal, and the abnormal data can be recovered at this time. Choose a normal data as the normal data area, and select the abnormal data as the abnormal data area. The abnormal data area may be a protected RAM area or an independent RAM area. The specific steps are shown in Figure 8, including the following steps:

S1: The abnormal data area and the normal data area are divided into N blocks in order according to the minimum data length, which are referred to as basic data blocks in the following description.

S2: The data CRC check code C _en of the abnormal data area is calculated by the FPGA parallel processing module, where n=0, 1, 2, 3..m, the number of basic data blocks in the interval P _n =2 ⁿ , and N/ ^2≤2m <N.

The calculation method of C _en is: starting from the first basic data block, taking P _n basic data blocks, then taking P _n basic data blocks every P _n basic blocks, and using all the basic data blocks taken out as the data source to calculate the CRC value.

C _e0 , C _e1 . . . C _em in this step can be calculated in parallel or in series.

Referring to Figure 6, the abnormal database is divided into N=16 basic data blocks, which are denoted as B1, _B2 ... One basic data _block is taken from one basic block, that is, B1, B3, B5,... The basic block takes 2 basic data blocks, that is, B1, B2, B5, B6...B13, B14 are used as the data source for calculation. Similarly, the calculation of C _e3 is to take 8 basic data blocks from B1. Then take 8 basic data blocks every 8 basic blocks, that is, take B1, B2...B8 as the data source for calculation.

S3: The data CRC check code C _cn of the normal data area is calculated by the FPGA parallel processing module, where n=0, 1, 2, 3..m, the number of basic data blocks in the interval P _n =2 ⁿ , and N/ ^2≤2m <N.

The calculation method of C _cn is: starting from the first basic data block, take P _n basic data blocks, then every P _n basic blocks, take P _n basic data blocks, and use all the basic data blocks taken out as the data source to calculate CRC value.

C _c0 , C _c1 ... C _cm in this step can be calculated in parallel or in series.

Referring to Figure 6, the normal database is divided into N=16 basic data blocks, which are denoted as B1, _B2 ... Take 1 basic data block for 1 basic block, that is, take B1, B3, _B5 ... The basic block takes 2 basic data blocks, that is, B1, B2, B5, B6...B13, B14 are used as the data source for calculation. Similarly, the calculation of C _c3 is to take 8 basic data blocks from B1. Then take 8 basic data blocks every 8 basic blocks, that is, take B1, B2...B8 as the data source for calculation.

S4: The FPGA parallel processing module judges whether the two CRCs of C _cm and C _em are equal: if they are equal, it means that the abnormal data is located in the second half of the abnormal data area, and the abnormal data mark position is updated to point to the second half of the abnormal data area; If they are not equal, it means that there must be an abnormality in the first half of the abnormal data area. Update the abnormal data mark position to point to the first half of the abnormal data area.

This step is processed when n=m, and all the basic data blocks in the entire data area are divided into two parts. Step S4 determines whether the abnormal data is located in the first half or the second half; after the abnormal range is narrowed, S5 repeats n=m-1.... Continuously narrowing the range until n=0, the range is narrowed to a specific abnormal basic block .

S5: Step S4 is repeated to determine the values of C _cn and C _en , where n decreases sequentially from m-1 until it is 0, and the abnormal data mark position is updated in each step. After the step of judging n=0, the abnormal data mark position has been reduced to the basic data block in the abnormal data area where the abnormal data is located, which is referred to as the abnormal basic data block, and the correct data corresponding to the abnormal basic data block is normal. The corresponding basic data block in the data area is referred to as the correct basic data block for short.

In the embodiment of the present invention, the judgment process is shown in FIG. 7. First, compare whether the values of C _e3 and C _c3 are equal. If they are equal, it means that the abnormal data position is located in the range of B9->B16 in the second half, and then continue to judge the C _e2 and C c3 values. Whether the values of C _c2 are equal, if they are equal, it means that the abnormal data location is located in the second half B13->B16 range, and then continue to judge whether the values of C _e1 and C _c1 are equal, if they are equal, it means that the abnormal data location is located in the second half B15- >B16 range, and then continue to judge whether the values of C _e0 and C _c0 are equal. If they are equal, it means that the abnormal data location is located in the second half of the B16 basic data block. So far, the specific abnormal basic data block has been located. In the same way, other judgment branches are shown in FIG. 7 , which will not be repeated here.

S6: The FPGA parallel processing module copies and copies the correct basic data block content in the normal data area to the abnormal basic data block in the abnormal data area pointed to by the abnormal data mark position through the high-speed interface.

S7: Then the FPGA parallel processing module re-reads the abnormal basic data block; calculates the CRC value of the re-read abnormal basic data block, and calculates the corresponding correct basic data block CRC value. Compare whether the two CRC values are equal, and if they are not equal, repeat steps S6 and S7 for re-recovery. In this embodiment, the number of repetitions is at most 2; if they are equal, the abnormal basic data block is recovered.

S8: The FPGA parallel processing module calculates the CRC values of all the basic data blocks in the normal data area and the abnormal data area, and compares whether the two CRC values are equal. If the two are not equal, it means that there is still abnormal data. Repeat the steps from S2 to S7. , locate and recover other abnormal data; if they are equal, the abnormal data recovery ends.

The typical flow of the above process is shown in Figure 9. First, the error detection and recovery of the LV1 linked list is performed, then the error detection and recovery of the LV2 linked list is performed, then the error detection and recovery of all the protected areas of the LV1 linked list are performed, and finally the protected area of the LV2 linked list is performed. The protection area error detection and recovery, its timing diagram is shown in Figure 9. Among them, after the error detection and recovery of the LV1 linked list is completed, the error detection and recovery of the protection area corresponding to the LV1 linked list can be performed; the same is true for the LV2 linked list. However, the LV1 linked list and the LV2 linked list have no sequential relationship and can be designed for parallel processing.

If the data in the protected RAM space needs to be changed normally during system operation (such as normal data parameter modification services, etc.), the modification should be continued in the following order: First, the software stops the checking and adjudication functions, stops the data recovery function, and reads back the FPGA for parallel processing The running status of the module is confirmed, and the function is stopped; then the data is modified normally; finally, the verification and adjudication functions are restarted, and the data recovery function is restarted.

The meaning of the on-line parallel processing in the present invention is: the meaning of on-line refers to performing error detection and recovery of the protected RAM while the system function is running normally. In this article, the LV1/LV2 linked list and the read verification of the corresponding protected RAM, SM3 signature information, MD5 summary information, BCC verification, comprehensive adjudication, and abnormal data recovery functions are performed simultaneously according to system interruptions to deal with abnormal changes in RAM. Error detection and recovery. The parallel in the present invention first means that the normal function of the system and the error detection and recovery function are performed in parallel; these functions are realized by FPGA and do not occupy processor time. The error detection and recovery of LV1 and LV2 can be designed as parallel or serial according to the system design and FPGA resources.

The real-time meaning of the present invention is: In this paper, the LV1/LV2 linked list and the read verification, SM3 signature information, MD5 digest information, BCC verification, comprehensive verdict, and abnormal data recovery functions of the corresponding protected RAM are based on system interruption. At the same time, real-time error detection and recovery, each interrupt completes the error detection and recovery of the key data area, with high real-time performance.

It should be noted that the purpose of communication process verification is to prevent misjudgment caused by errors in the communication process, which is an optional function. If the system does not consider the error of the communication interface, this verification process can be removed, that is, read the data once as the correct data, and make subsequent judgments.

It should be noted that: in the embodiment of the present invention, three verification methods of SM3 signature information, MD5 digest information, and BCC verification are used for three verifications (one verification for each method) for description, but those skilled in the art need to know that verification The methods are not limited to these three, and other common check methods, such as CRC32 and CRC64, can also be selected. The verification method and the verification times can be combined arbitrarily. Using different verification methods can prevent the final misjudgment caused by the principle problem of a certain verification. In application, the method and number of times can be selected/combined according to the tolerance of the system for misjudgment.

Example 3

The present invention also provides an online parallel processing soft error real-time error detection and recovery device, including a linked list management module and an error detection recovery module, wherein:

The linked list management module is used to divide the protected RAM space into multiple protected areas; all protected areas are divided into one or more levels, the highest level is to complete an error detection and recovery function for each interrupt cycle; other levels are multiple One interrupt cycle to complete one error detection and recovery function; register the protected area of each level to generate a linked list corresponding to the number of levels, and the linked list is located in the protected RAM space; back up each linked list and the protected area in the linked list at least two copies to other RAM space; the content of the linked list includes the position, length and the position of each backup of each protected area of the same level;

Check the linked list and its backup for errors, and recover its abnormality;

The specific implementation of each module in the device of this embodiment, as well as the error detection and recovery of the linked list and its backup, adopt the implementation manners of Embodiment 1 and Embodiment 2.

The device of this embodiment can realize the error detection and recovery of the RAM space, can correct the location of the erroneous data, and can realize the recovery within one interrupt cycle, which improves the real-time performance of the system in processing RAM exceptions and reduces the impact of RAM exceptions on the system. .

As will be appreciated by those skilled in the art, the embodiments of the present application may be provided as a method, a system, or a computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the present application. It will be understood that each flow and/or block in the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to the processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing device to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing device produce Means for implementing the functions specified in a flow or flow of a flowchart and/or a block or blocks of a block diagram.

These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory result in an article of manufacture comprising instruction means, the instructions The apparatus implements the functions specified in the flow or flow of the flowcharts and/or the block or blocks of the block diagrams.

These computer program instructions can also be loaded on a computer or other programmable data processing device to cause a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process such that The instructions provide steps for implementing the functions specified in the flow or blocks of the flowcharts and/or the block or blocks of the block diagrams.

The above are only the preferred embodiments of the present invention. It should be pointed out that for those skilled in the art, without departing from the technical principles of the present invention, several improvements and modifications can also be made. These improvements and modifications It should also be regarded as the protection scope of the present invention.

Claims

An online parallel processing soft error real-time error detection and recovery method, characterized in that it includes the following processes:

Divide each protected area of the protected RAM space into one or more levels, the highest level is to complete one error detection and recovery function for each interrupt cycle; the other levels complete one error detection and recovery function for multiple interrupt cycles;

Register the protected area of each level to generate a linked list corresponding to the number of levels, and the linked list is located in the protected RAM space; backup at least two copies of each linked list and the protected area in the linked list to other spaces;

Parallel processing of error detection and recovery of each level of linked list and each protected area in the linked list.
The real-time error detection and recovery method for online parallel processing of soft errors according to claim 1, wherein the method is performed by a parallel processing module, and the parallel processing module accesses the protected RAM space through a high-speed interface connection, And use independent DDR, SRAM, or RAM space for backup storage.
A kind of online parallel processing soft error real-time error detection and recovery method according to claim 1, it is characterized in that, the process of error detection and recovery of each protected area in the linked list of any level and linked list is:

Perform error detection on the linked list and its backup, and restore the exception if the error detection result is abnormal;

Perform error detection on each protected area and its backup in the linked list, and restore the abnormality if the error detection result is abnormal.
A kind of online parallel processing soft error real-time error detection and recovery method according to claim 3, it is characterized in that, described performing error detection on linked list and its backup, comprising:

Read the linked list and its backup;

The linked list and its backup are verified by adopting any one or more combination verification methods including SM3 signature information, MD5 information digest and BCC check code;

Compare the judgment results of each verification method to determine the correctness of the linked list and its backup.
The real-time error detection and recovery method for online parallel processing of soft errors according to claim 4, wherein when multiple combined verification methods are adopted, the verification processes of each verification method are processed in parallel.
The online parallel processing soft error real-time error detection and recovery method according to claim 4, further comprising: performing communication process verification on the read linked list and its backup.
A kind of online parallel processing soft error real-time error detection and recovery method according to claim 4, is characterized in that, adopts a kind of check method to check the linked list and its backup, comprising:

The verification method is described by taking SM3 signature information as an example:

Calculate the linked list and its backup SM3 signature information, and compare;

If there are at least two copies of the same SM3 signature information, it is judged that all copies of the same SM3 signature information are normal, and other copies of the data are abnormal.
A kind of online parallel processing soft error real-time error detection and recovery method according to claim 7, is characterized in that, described comparing the judgment result of each check mode, confirming the correctness of linked list and its backup, comprising:

Compare the judgment results of the linked list and its backup SM3 signature information, MD5 information digest, and BCC verification code. If the judgment results of two or more verification methods are consistent, the consistent result will be regarded as the final judgment result. .
A kind of online parallel processing soft error real-time error detection and recovery method according to claim 3, is characterized in that, described in the linked list each protected area and its backup carry out error detection, including:

If the linked list level is the highest level:

Perform error detection on each protected area and its backup in the linked list in turn, wherein, perform error detection on any protected area and its backup in the linked list, including:

Read the current protected area in the linked list and its backup;

Use any one or more combined verification methods including SM3 signature information, MD5 information digest and BCC verification code to verify the currently protected area and its backup in the linked list;

Compare the judgment results of each verification method to determine the correctness of the currently protected area in the linked list and its backup.
A kind of online parallel processing soft error real-time error detection and recovery method according to claim 3, is characterized in that, described in the linked list each protected area and its backup carry out error detection, including:

When the linked list level is another level:

Divide all protected areas in the linked list into multiple groups,

In each interrupt cycle, check any protected area and its backup in any group in the linked list, including:

Read the current protected area and its backup in the current detection group in the linked list;

Use any one or more combined verification methods including SM3 signature information, MD5 information digest and BCC verification code to verify the currently protected area and its backup in the current detection group in the linked list;

Comparing the judgment results of each verification method, determine the correctness of the currently protected area and its backup in the current detection group in the linked list.
The real-time error detection and recovery method for online parallel processing of soft errors according to claim 10, characterized in that when multiple combined verification methods are adopted, the verification process of each verification method is processed in parallel.
The online parallel processing soft error real-time error detection and recovery method according to claim 10, further comprising: performing a communication process check on the currently protected area and its backup in the read linked list.
A kind of online parallel processing soft error real-time error detection and recovery method according to claim 10, is characterized in that, adopts a kind of check method to check the current protected area in the linked list and its backup, including:

The verification method is described by taking SM3 signature information as an example:

Calculate the current protected area in the linked list and its backup SM3 signature information, and compare;

If there are at least two copies of the same SM3 signature information, it is judged that all copies of the same SM3 signature information are normal, and other copies of the data are abnormal.
The real-time error detection and recovery method for online parallel processing of soft errors according to claim 13, wherein the comparison of the judgment results of each check mode determines the currently protected area and its A and B in the linked list. The correctness of each backup, including:

Compare the judgment results of the three verification methods of the currently protected area and its A and B backups in the linked list, SM3 signature information, MD5 information digest, and BCC verification code. If the judgment results of two or more verification methods are consistent, The unanimous result shall be regarded as the final ruling.
A kind of online parallel processing soft error real-time error detection and recovery method according to claim 3, it is characterised in that the recovery of the exception comprises:

Divide the abnormal data area and the normal data area into N basic data blocks in sequence;

Calculate the data CRC check code C en of the abnormal data area, where n=0, 1, 2, 3..m, and N/2≤2 m <N; the calculation method of C en is: from the first basic data At the beginning of the block, take P n basic data blocks, then take P n basic data blocks every P n basic blocks, and use all the basic data blocks taken out as the data source to calculate the CRC value, and the number of basic data blocks in the interval P n = 2n ;

Calculate the data CRC check code C cn of the normal data area, wherein the calculation method of C cn is: starting from the first basic data block, take P n basic data blocks, and then take P n basic data blocks every P n basic blocks The basic data block, using all the basic data blocks taken out as the data source to calculate the CRC value;

Determine whether the two CRCs of C cm and C em are equal: if they are equal, update the abnormal data mark position to point to the second half of the abnormal data area; if not, update the abnormal data mark position to point to the first half of the abnormal data area Scope;

Repeat the above judgment process to judge the values of C cn and C en , where n decreases from m-1 until it is 0, and the abnormal data mark position is updated in each step; after judging the n=0 step, the abnormal data mark position at this time It has been reduced to the basic data block of the abnormal data area where the abnormal data is located;

Copy and copy the content of the correct basic data block in the normal data area to the abnormal basic data block in the abnormal data area pointed to by the abnormal data mark position, and the abnormal basic data block is restored;

Calculate the CRC values of all basic data blocks in the normal data area and the abnormal data area, and compare whether the two CRC values are equal. If the two are not equal, repeat all the above steps; if they are equal, the abnormal data recovery ends.
The method for real-time error detection and recovery of soft errors for online parallel processing according to claim 15, wherein after the recovery of the abnormal basic data block is completed, the method further comprises: re-reading the abnormal basic data block; calculating the re-reading The abnormal basic data block CRC value is calculated, and the corresponding correct basic data block CRC value is calculated; the two CRC values are compared to see if they are equal, if they are equal, the abnormal basic data block is restored.
An online parallel processing soft error real-time error detection and recovery system, comprising a linked list management module and an error detection recovery module, wherein:

The linked list management module is used to divide each protected area of the protected RAM space into one or more levels. The highest level completes one error detection and recovery function for each interrupt cycle; the other levels complete one error detection for multiple interrupt cycles. With the recovery function; register the protected area of each level to generate a linked list corresponding to the number of levels, and the linked list is located in the protected RAM space; backup at least two copies of each linked list and the protected area in the linked list to other RAM spaces;

The error detection and recovery module is used to process the error detection and recovery of each level of linked list and each protected area in the linked list in parallel.