WO2021208341A1 - Method and system for detecting and recovering memory bit flipping in secondary power equipment - Google Patents

Method and system for detecting and recovering memory bit flipping in secondary power equipment Download PDF

Info

Publication number
WO2021208341A1
WO2021208341A1 PCT/CN2020/114368 CN2020114368W WO2021208341A1 WO 2021208341 A1 WO2021208341 A1 WO 2021208341A1 CN 2020114368 W CN2020114368 W CN 2020114368W WO 2021208341 A1 WO2021208341 A1 WO 2021208341A1
Authority
WO
WIPO (PCT)
Prior art keywords
check code
code
crc check
segment
ecc
Prior art date
Application number
PCT/CN2020/114368
Other languages
French (fr)
Chinese (zh)
Inventor
周华良
刘拯
郑玉平
李友军
张吉
邹志扬
张连生
张成彬
朱彬彬
戴欣欣
郑奕
Original Assignee
国电南瑞科技股份有限公司
国电南瑞南京控制系统有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 国电南瑞科技股份有限公司, 国电南瑞南京控制系统有限公司 filed Critical 国电南瑞科技股份有限公司
Publication of WO2021208341A1 publication Critical patent/WO2021208341A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • G06F11/1008Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's in individual solid state devices
    • G06F11/1044Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's in individual solid state devices with specific ECC/EDC distribution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • G06F11/1004Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's to protect a block of data words, e.g. CRC or checksum
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1458Management of the backup or restore process
    • G06F11/1469Backup restoration techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3034Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a storage system, e.g. DASD based or network based

Definitions

  • the invention relates to a method and system for detecting and restoring the memory bit flip of a power secondary device, and belongs to the technical field of memory error correction.
  • the secondary devices of modern power systems are mostly embedded devices, composed of a large number of chips.
  • One of the core components of the memory chip has millions of memory units, and each unit can store "0" or "1", etc. information. Since the memory is used to temporarily store the program and data being executed, once there is a data error inside it, it will affect the normal operation of the program, and can cause the entire system to fail in severe cases. Therefore, its reliability and fault tolerance have always been hot issues in the industry. .
  • hard errors can occur repeatedly, mainly due to hardware damage to the memory unit in the memory chip or external wiring errors.
  • Hard errors can generally only be solved by replacing the hardware.
  • Soft errors occur randomly and can be recovered after the program is reloaded.
  • a soft error is usually an unexpected rollover of 1-bit data (from 0->1 or 1->0), and the probability of occurrence is high.
  • Single Event Effects refer to radiation particles (heavy particles, protons, neutrons, X-rays, gamma rays and alpha particles, etc.) from cosmic radiation and ground radiation environment that cause integrated circuits and even electronic devices.
  • SEEs Single Event Effects
  • the phenomenon of severe damage When the radiant particles collide with the silicon material, direct or indirect ionization will generate additional radiation to induce electron-hole pairs, and the electric field in the reverse-biased depletion layer of the device can separate these electron-hole pairs and drift through them. The process effectively collects them, which results in the accumulation of extra charge in the sensitive areas of the device.
  • SETs single event transients
  • SRAM Static Random Access Memory
  • ECC Error Correcting Code
  • DSP digital signal processor chips
  • hardware ECC error detection and correction technologies such as Ti’s multi-core ARM processor AM572x, Ti’s DSP chip C665X, and Xilinx’s latest Fully programmable SOC chip UltraScale MPSoC, etc.
  • most bit flip detection and recovery solutions require the support of hardware units, and they all implement bit flip error protection from a hardware perspective.
  • hardware ECC error detection and error correction requires the integration of hardware error detection and error correction control modules in the processor's memory controller, which leads to a significant increase in chip cost, it is currently widely used in processors and RAM devices in power secondary equipment. Both do not support the hardware ECC detection and recovery function, and the problem that the protection bit can not be flipped from the hardware level.
  • the purpose of the present invention is to provide a method and system for detecting and restoring the memory bit flip of a power secondary device, so as to solve the abnormal function or result of the power secondary device caused by the memory bit flip in the prior art.
  • a method for detecting and restoring the memory bit flip of a power secondary device includes the following steps:
  • the method further includes:
  • the ECC check code with the same comparison results is extracted and compared with the ECC check code.
  • comparing the ECC check code with the ECC check code also includes: extracting the ECC check code with the same comparison result, and comparing and replacing the ECC check code with the inconsistent result.
  • the comparison of the ECC check code with the ECC check code also includes:
  • the redundant backup recovery mechanism is used to restore the code segment of the application when the application is running.
  • the redundant backup recovery mechanism is used to restore the code segment of the application program when the application program is running.
  • a redundant backup and recovery mechanism to restore the code segment of the application when the application is running includes:
  • the code segment corresponding to the CRC check code is extracted, and the code segment of the application program is covered when the application program is running.
  • the method further includes: compressing the obtained code segment;
  • the code segment corresponding to the CRC check code Before extracting the code segment corresponding to the CRC check code, it also includes: decompressing the compressed code segment corresponding to the CRC check code.
  • Extract the code segment corresponding to the CRC check code including:
  • extracting the code segment corresponding to the CRC check code also includes: extracting the CRC check code and its code segment whose online check result is always consistent, and the CRC check code and its code whose online check result is not always consistent Segment to replace.
  • performing online verification on the stored CRC check code includes: comparing the current state of the stored CRC check code with the previous state at a preset time interval.
  • it also includes: detecting and restoring the pre-registered key data in the application;
  • the key data detection and recovery method includes:
  • the memory data corresponding to the CRC check code is extracted, and the memory data in the memory address is overwritten when the application program is running.
  • the method further includes: storing the obtained memory data and the CRC check code and the storage quantity is not less than two copies;
  • Extract the memory data corresponding to the CRC check code including:
  • extracting the memory data corresponding to the CRC check code also includes: extracting the CRC check code and its memory data whose online check results are always consistent, and the CRC check code and its memory whose online check results are not always consistent. Data row replacement.
  • the present invention also provides a system for detecting and restoring the memory bit flip of a power secondary device, which includes: a single-bit error recovery module, the single-bit error recovery module includes:
  • ECC check code calculation sub-module used to calculate the ECC check code of the application running area when the application is loaded according to the preset ECC segment length to obtain the ECC check code of the segment data of the application when the application is loaded, and Perform ECC check code calculation on the segment data of the application when the application is running according to the preset ECC segment length to obtain the ECC check code of the segment data of the application when the application is running;
  • ECC check code storage sub-module After obtaining the ECC check code of the segment data of the application program when the application program is loaded, it is used to store the obtained ECC check code and the storage quantity is not less than three;
  • ECC check code comparison sub-module used to compare the stored ECC check codes one by one. If at least two comparison results of the stored ECC check codes are consistent, extract the ECC check with the same comparison result The code is compared with the ECC check code;
  • ECC check code replacement sub-module used to extract ECC check codes with consistent comparison results, and compare and replace ECC check codes with inconsistent results
  • Single-bit error correction sub-module If a single-bit error occurs in the segment data of the application program when the application program is running according to the comparison result, it is used to correct the bit where the error occurs.
  • redundancy backup and recovery module for recovering the code segment of the application program when the application program is running by using a redundancy backup and recovery mechanism
  • redundancy backup recovery module includes:
  • Code segment acquisition sub-module used to acquire the code segment of the application when the application is loaded
  • CRC check code calculation sub-module used to calculate the CRC check code of the code segment of the application program when the application program is loaded to obtain the CRC check code of the code segment, and perform CRC check on the code segment of the application program when the application program is running Check code calculation to obtain the CRC check code of the code segment;
  • CRC check code storage sub-module After obtaining the code segment and CRC check code of the application program when the application is loaded, it is used to store the obtained code segment and its CRC check code and the storage quantity is not less than two copies. ;
  • CRC check code online check sub-module used to check the stored CRC check code online
  • CRC check code comparison sub-module used to compare CRC check code with CRC check code
  • Code segment coverage sub-module If the online check result of the stored CRC check code is always consistent, it is used to extract the code segment corresponding to the CRC check code whose online check result is always consistent. Segment to cover;
  • CRC check code replacement sub-module used to extract the CRC check code and its code segment whose online check result is always consistent, and replace the CRC check code and its code segment whose online check result is not always consistent;
  • Code segment compression sub-module used to compress the obtained code segment after obtaining the code segment of the application program when the application is loaded, and the code segment corresponding to the CRC check code before extracting the code segment corresponding to the CRC check code. The compressed code segment is decompressed.
  • the critical data detection and recovery module includes:
  • Memory address extraction sub-module used to extract the memory address of the key data based on the pre-registered information of the key data;
  • Memory data acquisition sub-module used to acquire the memory data in the memory address when the application program is loaded
  • the CRC check code calculation sub-module is also used to perform CRC check code calculation on the memory data in the memory address when the application program is loaded to obtain the CRC check code, and to calculate the key data in the memory address when the application program is running CRC check code calculation is performed on the memory data to obtain the CRC check code;
  • the CRC check code storage sub-module is also used to store the obtained memory data and its CRC check code after obtaining the memory data and its CRC check code in the memory address when the application program is loaded. Not less than two copies;
  • the code segment coverage sub-module is also used to extract the memory data corresponding to the CRC check code whose online check result is always consistent if the online check result of the stored CRC check code is always consistent, and the application program is running. Overwrite the memory data in the memory address;
  • the CRC check code replacement sub-module is also used to extract the CRC check code whose online check result is always consistent and its memory data, and replace the CRC check code whose online check result is not always consistent and its memory data row.
  • the present invention also provides a computer processing control device, including:
  • Memory used to store instructions
  • the processor is configured to operate according to the instructions to execute the steps of the method for detecting and restoring the memory bit flip of a power secondary device provided by the present invention.
  • the present invention also provides a computer-readable storage medium on which a computer program is stored.
  • the program is executed by a processor, the method for detecting and restoring the memory bit flip of a power secondary device provided by the present invention is implemented. step.
  • the present invention achieves the beneficial effects: the method and system of the present invention comprehensively use two mechanisms of ECC error detection and correction and redundant backup and recovery to realize the rapid positioning and recovery of single-bit errors and multi-bit errors, respectively.
  • Error recovery taking into account both the efficiency of error correction and the ability of error correction, makes up for the insufficiency of simply using ECC hardware error detection and error correction that cannot correct more than two errors, which is beneficial to ensure the reliable operation of power secondary equipment. Since there is no need to detect and restore circuits inside the processor, it is not affected by changes in the processor's own process design, and achieves the goal of using the existing architecture and saving development costs. At the same time, it can be extended to processors of different architectures and has a wide range of applicability.
  • Figure 1 is a schematic diagram of the address allocation of the on-chip memory of the microcontroller in the system embodiment of the present invention
  • FIG. 2 is a schematic diagram of the backup process of the code segment of the application program in the system embodiment of the present invention
  • Figure 3 is a schematic diagram of a backup process of key data in an embodiment of the system of the present invention.
  • FIG. 4 is a schematic diagram of the recovery process of the code segment of the application program in the system embodiment of the present invention.
  • Fig. 5 is a schematic diagram of the restoration process of key data in the system embodiment of the present invention.
  • the program segment data generally has continuous addresses, which meets the operating requirements of the ECC error detection and correction algorithm. Therefore, the software ECC error detection and correction algorithm can be used to quickly locate and restore the single bit flip error of the program segment data; also taking into account the occurrence of system memory.
  • a redundant backup recovery mechanism is added on the basis of the ECC error detection and correction, and the high efficiency and redundancy of the integrated ECC error detection and correction mechanism
  • the strong error correction capability of the backup and recovery mechanism can complement each other to further ensure the correctness of the program segment data, thereby ensuring the reliability of the system.
  • Step 1 Load the application program with the error detection and correction program, and calculate the ECC check code in the application program operating area according to the length of the ECC segment (the segment length is confirmed by the digits of the ECC code), in order to prevent the ECC check code from running
  • the ECC check code is stored.
  • the ECC check codes of all segments are stored in the memory reserved area in three copies for each segment according to the sequence of the segments.
  • Step 2 When the application program is loaded, the error detection and correction program separates the code segment of the application program according to different file types, compresses the code segment data and calculates the CRC check code on it to obtain the CRC check code, which will be compressed
  • the code segment data and CRC check code should be backed up at least in two copies, stored in the system memory or external storage.
  • the compressed data is mainly to save the memory overhead of redundant backup.
  • Step 3 After the application is started, the number of segments that need to be ECC checked for a single execution of the error detection and correction program, and the processing mode after an unrecoverable error occurs, are set to the error detection and correction program.
  • the application program registers the key data that needs to be protected with the error detection and correction program through the interface provided by the error detection and correction program.
  • the error detection and correction program obtains the memory address of these key data, backs up the memory data in these memory addresses, press Ha
  • the hash table is serialized and stored sequentially. The serialized whole piece of content is stored in more than two copies, and the CRC check code is calculated separately for each piece of content to obtain the CRC check code.
  • Step 4 After the system is running normally, the error detection and correction program periodically performs ECC error detection and error correction on the original program segment data according to the segment length, and calculates the ECC check code for the program data of the segment length each time to obtain the ECC check code, and Use three pre-stored ECC check codes to comprehensively judge the ECC check codes. First, compare the three ECC check codes with each other. If two ECC check codes are the same, the ECC check code is considered correct; if there is an error in the ECC check code among the three pre-stored ECC check codes If the code is verified, the error ECC check code is restored to the correct value. Then, the correct ECC check code is compared with the ECC check code.
  • the comparison results are consistent, it is determined that no bit error has occurred in the segment of data, otherwise it is determined that a bit error has occurred in the segment of data. If a single-bit error occurs in the segment data of the application program when it is judged to run, the correct ECC check code will be used directly for error correction and recovery; in order to reduce the impact of error detection and correction on the system load, the error detection and correction program will always Only one piece of length data is detected during the second run, and the next piece of data is detected during the next run.
  • Step 5 If more than two bit inversion errors or three ECC check codes in the segment data of the application program are judged to run, the redundant comparison recovery process will be entered. First, perform online verification on the CRC check code backed up in step 2, that is, compare the current state of the CRC check code with the previous state at a preset time interval. If the CRC check code comparison result is always consistent , The CRC check code is determined to be correct, otherwise the CRC check code is determined to be wrong. For the wrong CRC check code and its corresponding code segment data backup, the correct CRC check code and its corresponding code segment data can be used Back up to replace it in time.
  • the error detection and correction program calls the system interface to turn off the system program preemption function or puts the running program into a hibernation state, and then decompress the correct backup version and restore it to the corresponding memory. And restore the correct backup version, backup check code and ECC check code to the original wrong backup version, and finally enable the system program preemption function or restore program operation.
  • Step 6 After the system is running normally, the error checking and correcting program periodically checks the correctness of the registered key data, obtains the memory address of the key data by querying the hash table, and compares the contents of the memory address with the backup version. Comparison, that is, the CRC check code is calculated on the content in the memory address to obtain the CRC check code, and the CRC check code is compared with the correct CRC check code backed up in step 3. Because the CRC check code can be checked online Check, and replace the wrong CRC check code and its corresponding memory data with the correct CRC check code and its corresponding memory data at any time, thus ensuring the correctness of the CRC check code.
  • the memory data corresponding to the correct CRC check code is extracted, and the memory data in the corresponding memory address is overwritten when the application program is running.
  • the error detection and correction program calls the system interface to close the system program preemption function or make the running program enter the dormant state, and then restore the correct backup version to the corresponding memory address, and at the same time, the correct backup check code Restore to the backup version where the error occurred, and finally enable the system program preemption function or restore the program operation.
  • the error detection and correction program always runs in the background of the system, and the error detection and correction process of program segment data and key data is continuously performed.
  • a sandbox-like mechanism is used to prevent the error recovery process from being abnormally interrupted or the execution of abnormal codes, causing the entire system to enter an abnormal state, thereby ensuring the correctness of the error recovery process and results.
  • the error detection and correction program detects and repairs the error, it informs the application program of the address of the error, the type of error, and the current total number of errors through the interface.
  • an unrecoverable error occurs, it can be handled according to the pre-defined behavior of the application Abnormalities, including but not limited to reloading applications, system exiting, restarting the system through the watchdog, etc.
  • the specific embodiment of the present invention provides a system for detecting and restoring the memory bit flip of a power secondary device, which is used to implement the aforementioned method of the invention.
  • the number of segments checked by each ECC of the error detection and correction program can be flexibly configured by the application program, so that the system of the present invention can be conveniently applied to systems with different processor load levels, and the error detection is determined according to the actual system processor load level
  • the frequency of error correction keeps the overall load of the system at a reasonable level.
  • the system of the present invention is based on a microcontroller-free operating system environment. Because there is no intervention of the cache memory (Cache) and the memory management unit (MMU), the logical address is consistent with the physical address, and the content obtained by the processor accessing a certain address is the physical The internal real version of the memory, rather than the temporary version stored in the Cache, is easier to implement the method of the present invention.
  • the method of the present invention is implemented in the on-chip memory of the microcontroller to realize the function of online error detection and recovery of the on-chip memory of the microcontroller.
  • Other operating environments can also apply the relevant methods of the present invention after adding some additional operations.
  • FIG. 1 it is a schematic diagram of the address allocation of the on-chip memory of the microcontroller in the system embodiment of the present invention.
  • the BOOT startup program is used to initialize the operating environment of the microcontroller and the peripheral equipment of the processor. The entire system enters the initial operating state; the error detection and correction program is used to guide the application program and perform ECC error detection and correction of the application code, and backup and restore the application code and key data; the application program is used to implement the specific functions of the system .
  • the memory between the low address of the memory reserved area (Resv_low) and the high address of the memory reserved area (Resv_high) is reserved for the error detection and correction program for separate use, and is used to store the segmented ECC check code of the application code , Application code and key data backup.
  • Other programs cannot use the memory space of the memory reserved area. This can be achieved by specifying the program's operating address space to avoid the memory reserved area during the program compilation and linking phase.
  • the segmented ECC check code of the code segment is stored in three copies, which are ECC code A, ECC code B, and ECC code C; the redundant backups of the code segment and key data are all backed up at least two copies, each of which is backup A and backup B.
  • the application program parameter setting, ECC check code calculation and program segment backup process are shown in Figure 2.
  • the error detection and correction program calculates the ECC check code of the application code segment by segment according to the length of the ECC segment and stores it in the application code ECC area of the memory Resv area.
  • the ECC check code of each piece of data is stored in three copies, namely ECC code A, ECC code B, and ECC code C.
  • the error detection and correction program separates the code segment according to the application file type, compresses the code segment data and calculates the check code, and stores the compressed data and check code in the application code backup A and backup B in the Resv area of the memory , And then guide the application to run.
  • the key parameter setting, application key data backup process as shown in Figure 3.
  • the application program registers the key data that needs to be protected with the error detection and correction program through the interface provided by the error detection and correction program.
  • the error detection and correction program obtains the memory addresses of all key data, backs up the content in these memory addresses, and presses the hash
  • the (hash) table is serialized and stored in the key data backup area. The serialized whole piece of content is sequentially stored in data backup A and data backup B in the key data backup area. Each backup data is used to calculate the check code separately. Verify the correctness of the backup data and attach it to the end of the backup data.
  • the program segment error detection and recovery process of the application program is shown in Figure 4.
  • the error detection and correction program periodically performs ECC error detection and correction on the original program segment data according to the segment length.
  • the segment length is configured by the application program to the error detection and correction program in the previous step.
  • the ECC check code is calculated for the program data of the segment length, and the pre-stored ECC code A, ECC code B, and ECC code C are comprehensively judged. If the original data has a single-bit error, it will directly perform error correction and recovery; at the same time, if the pre-stored ECC check code has an error, the check code will be restored to the correct value.
  • the error detection and correction program In order to reduce the influence of error detection and correction on the system load, the error detection and correction program only detects one segment of data each time it runs, and the next segment of data is tested during the next run. When a multi-bit error that cannot be recovered by ECC is detected or all ECC check codes are wrong, it will enter the redundant comparison recovery process. When all ECC check codes are detected to be wrong, first verify that the currently running program is correct. If it is not correct, check whether backup A and backup B are correct respectively.
  • the system preemption function needs to be turned off when restoring to prevent the program from being preempted by the application while restoring the code segment and causing program operation problems; if the backup A is incorrect and the backup B is correct, restore the backup B to the running code segment, and also need to Backup B is restored to backup A to ensure that subsequent verification can continue correctly. If it is found that both backup A and backup B are incorrect, it means that a very serious error has been made and the error cannot be recovered. At this time, the error detection and correction program will process the system according to the processing method set in advance by the application, such as restarting System, let the application quit running, etc.
  • the error detection and correction program will also record the detected errors in its own dedicated variables for the application program to refer to.
  • a multi-bit error that cannot be recovered by ECC is detected, it is only necessary to verify whether backup A and backup B are correct. There is no need to verify whether the currently running program is correct.
  • the subsequent process is the same as the processing flow when all ECC check codes are wrong.
  • the error detection and correction program After the system is running normally, the error detection and correction program periodically checks the correctness of the registered key data, obtains the memory address of the data by querying the hash table, and compares the key data with the backup A. If they are consistent, it means no occurrence In case of errors, the error detection and correction procedures will directly return to the next operation. When the key data is inconsistent with the backup A, it means that one party has made an error. At this time, compare whether the key data is the same as the backup B. The same means that the key data and the backup B are correct. The content of the backup A is wrong, and the backup will be backed up. After B is restored to backup A, the error detection and correction procedure returns.
  • backup A and backup B you need to verify whether backup A and backup B are correct. If the data of backup A is correct, restore backup A to the key data; if backup A is incorrect, backup B is correct , The backup B is restored to the key data, and the backup B needs to be restored to the backup A at the same time to ensure that the subsequent verification can continue correctly.
  • the system preemption function needs to be turned off when recovering to prevent the program from being preempted by the application program while the code segment is being recovered. If both backup A and backup B are found to be incorrect, it means that a very serious error has been made and the error cannot be recovered.
  • the error detection and correction program will process the system according to the processing method set in advance by the application, such as restarting System, let the application quit running, etc.
  • the error detection and correction program will also record the detected errors in its own dedicated variables for the application program to refer to.
  • the system of the present invention is differentiated based on functional modules, mainly including:
  • ECC check code calculation sub-module used to calculate the ECC check code of the application running area when the application is loaded according to the preset ECC segment length to obtain the ECC check code of the segment data of the application when the application is loaded, and Perform ECC check code calculation on the segment data of the application when the application is running according to the preset ECC segment length to obtain the ECC check code of the segment data of the application when the application is running;
  • ECC check code storage sub-module After obtaining the ECC check code of the segment data of the application program when the application program is loaded, it is used to store the obtained ECC check code and the storage quantity is not less than three;
  • ECC check code comparison sub-module used to compare the stored ECC check codes one by one. If at least two comparison results of the stored ECC check codes are consistent, extract the ECC check with the same comparison result The code is compared with the ECC check code;
  • ECC check code replacement sub-module used to extract ECC check codes with consistent comparison results, and compare and replace ECC check codes with inconsistent results
  • Single-bit error correction sub-module If a single-bit error occurs in the segment data of the application program when the application program is running according to the comparison result, it is used to correct the bit where the error occurs.
  • Code segment acquisition sub-module used to acquire the code segment of the application when the application is loaded
  • CRC check code calculation sub-module used to calculate the CRC check code of the code segment of the application program when the application program is loaded to obtain the CRC check code of the code segment, and perform CRC check on the code segment of the application program when the application program is running Check code calculation to obtain the CRC check code of the code segment;
  • CRC check code storage sub-module After obtaining the code segment and CRC check code of the application program when the application is loaded, it is used to store the obtained code segment and its CRC check code and the storage quantity is not less than two copies. ;
  • CRC check code online check sub-module used to check the stored CRC check code online
  • CRC check code comparison sub-module used to compare CRC check code with CRC check code
  • Code segment coverage sub-module If the online check result of the stored CRC check code is always consistent, it is used to extract the code segment corresponding to the CRC check code whose online check result is always consistent, and the code of the application program when the application program is running Segment to cover;
  • CRC check code replacement sub-module used to extract the CRC check code and its code segment whose online check result is always consistent, and replace the CRC check code and its code segment whose online check result is not always consistent;
  • Code segment compression sub-module used to compress the obtained code segment after obtaining the code segment of the application program when the application is loaded, and the code segment corresponding to the CRC check code before extracting the code segment corresponding to the CRC check code. The compressed code segment is decompressed.
  • Memory address extraction sub-module used to extract the memory address of the key data based on the pre-registered information of the key data;
  • Memory data acquisition sub-module used to acquire the memory data in the memory address when the application program is loaded
  • the aforementioned CRC check code calculation sub-module is also used to calculate the CRC check code of the memory data in the memory address when the application program is loaded to obtain the CRC check code, and to calculate the key data in the memory address of the application program when the application is running.
  • the aforementioned CRC check code storage sub-module is also used to store the obtained memory data and its CRC check code after obtaining the memory data and its CRC check code in the memory address when the application program is loaded, and the storage quantity is different. Less than two servings;
  • the aforementioned code segment coverage sub-module is also used to extract the memory data corresponding to the CRC check code whose online check result is always consistent if the online check result of the stored CRC check code is always consistent, and to check the memory data when the application is running.
  • the memory data in the address is overwritten;
  • the aforementioned CRC check code replacement submodule is also used to extract the CRC check code and its memory data whose online check results are always consistent, and replace the CRC check code and its memory data rows whose online check results are not always consistent.
  • the specific embodiment of the present invention also provides a computer processing control device, including:
  • Memory used to store instructions
  • the processor is configured to operate according to the instructions to execute the steps of the method for detecting and restoring the memory bit flip of a power secondary device provided by the present invention.
  • the specific embodiment of the present invention also provides a computer-readable storage medium on which a computer program is stored.
  • the program is executed by a processor, the steps of the method for detecting and restoring the memory bit flip of a power secondary device provided by the present invention are realized.
  • this application can be provided as methods, systems, or computer program products. Therefore, this application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, this application may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.
  • computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device.
  • the device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
  • These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment.
  • the instructions provide steps for implementing the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.

Abstract

A method and system for detecting and recovering memory bit flipping in secondary power equipment, related to the technical field of memory error correction. The technical problem to be solved is equipment malfunctioning or result abnormality caused by memory bit flipping in secondary power equipment in the prior art. The method comprises the following steps: performing an ECC verification code calculation according to a preset ECC segment length with respect to an application running area when an application is loading, acquiring an ECC verification code for segment data of the application when the application is loading; performing an ECC verification code calculation according to the preset ECC segment length with respect to the segment data of the application when the application is running, acquiring an ECC check code for the segment data of the application when the application is running; comparing the ECC verification code with the ECC check code; and, if the segment data of the application when the application is running is determined, on the basis of the comparison result, to have experienced a single bit error, correcting the bit where the error occurred.

Description

一种电力二次设备内存位翻转的检测恢复方法及系统Method and system for detecting and restoring memory bit flip of power secondary equipment 技术领域Technical field
本发明涉及一种电力二次设备内存位翻转的检测恢复方法及系统,属于内存错误纠正技术领域。The invention relates to a method and system for detecting and restoring the memory bit flip of a power secondary device, and belongs to the technical field of memory error correction.
背景技术Background technique
现代电力系统二次设备多数为嵌入式设备,由大量的芯片构成,其核心部件之一的内存芯片中有数以百万计的记忆单元,每个单元均可存储“0”或“1”等信息。由于内存用于暂存正在执行的程序和数据,其内部一旦出现数据错误,将会影响程序的正常运转,严重时可导致整个系统失效,故其可靠性和容错能力一直是业界研究的热点问题。The secondary devices of modern power systems are mostly embedded devices, composed of a large number of chips. One of the core components of the memory chip has millions of memory units, and each unit can store "0" or "1", etc. information. Since the memory is used to temporarily store the program and data being executed, once there is a data error inside it, it will affect the normal operation of the program, and can cause the entire system to fail in severe cases. Therefore, its reliability and fault tolerance have always been hot issues in the industry. .
业界对半导体和内存技术多年研究发现,造成内存错误的原因主要分为硬错误和软错误两类。其中,硬错误可以重复发生,主要是由于内存芯片中记忆单元硬件损坏或者是外部连线错误导致的,硬错误一般只能通过更换硬件解决。软错误则是随机发生的,程序重新加载后可以恢复。软错误通常是1位数据发生了非预期的翻转(由0->1或由1->0),而且发生概率较高。The industry has studied semiconductor and memory technology for many years and found that the causes of memory errors are mainly divided into hard errors and soft errors. Among them, hard errors can occur repeatedly, mainly due to hardware damage to the memory unit in the memory chip or external wiring errors. Hard errors can generally only be solved by replacing the hardware. Soft errors occur randomly and can be recovered after the program is reloaded. A soft error is usually an unexpected rollover of 1-bit data (from 0->1 or 1->0), and the probability of occurrence is high.
软错误通常是由单粒子效应导致的。单粒子效应(Single Event Effects,SEEs)是指来源于宇宙辐射和地面辐射环境中的辐射粒子(重粒子、质子、中子、X射线、γ射线及α粒子等)对集成电路甚至电子设备造成严重破坏的现象。当辐射粒子与硅材料发生碰撞后,由于直接或者间接离化将会产生额外的辐射诱导电子一空穴对,而器件反偏耗尽层中的电场能够把这些电子一空穴对分离,并通过漂移过程将其有效地收集,从而导致在器件敏感区域存在额外电荷的积累。当积累的电荷足够多时,将产生一个大的电压瞬态脉冲,此瞬态脉冲能够暂时将电路敏感节点的电压翻转。在组合电路中,这样的电压瞬态脉冲叫做单粒子瞬态效应(Single Event Transients,SETs)。SETs能够沿着电路进行传播直至存储单元中,如静态随机存取存储器(Static Random Access Memory,SRAM)的存储单元。如果满足适当条件,SETs可以导致存储单元捕获错误的时序信息而导致比特位发生变化。Soft errors are usually caused by single event effects. Single Event Effects (SEEs) refer to radiation particles (heavy particles, protons, neutrons, X-rays, gamma rays and alpha particles, etc.) from cosmic radiation and ground radiation environment that cause integrated circuits and even electronic devices. The phenomenon of severe damage. When the radiant particles collide with the silicon material, direct or indirect ionization will generate additional radiation to induce electron-hole pairs, and the electric field in the reverse-biased depletion layer of the device can separate these electron-hole pairs and drift through them. The process effectively collects them, which results in the accumulation of extra charge in the sensitive areas of the device. When the accumulated charge is large enough, a large voltage transient pulse will be generated, which can temporarily reverse the voltage of the sensitive node of the circuit. In combinational circuits, such voltage transient pulses are called single event transients (SETs). SETs can propagate along the circuit to the storage unit, such as the storage unit of Static Random Access Memory (SRAM). If the appropriate conditions are met, SETs can cause the memory cell to capture incorrect timing information and cause bit changes.
检测纠错(Error Correcting Code,ECC)技术可用于解决内存和Nand Flash设备由于1位翻转而导致软错误的问题。该技术出现在“奇偶校验”技术之后,是更为先进的存储错误检查和更正手段,在工作站、服务器产品中得到了较为广泛的应用。ECC技术是在数据位上额外地位存储一个用于数据加密的代码,当数据被写入内存,相应的ECC代码同时也被保存下来;当重新读回刚才存储的数据时,保存下来的ECC代码就会和读数据时实时计算的ECC代码做比较,如果两个代码不相同,他们则会被解码,以确定数据中的哪一位 是不正确的。然后这一错误位会被抛弃,并由内存控制器释放出正确数据,如果相同错误数据再次被读出,则纠正过程再次被执行。Error Correcting Code (ECC) technology can be used to solve the problem of soft errors caused by 1-bit flipping in memory and Flash devices. This technology appeared after the "parity check" technology, which is a more advanced means of checking and correcting storage errors, and has been widely used in workstations and server products. ECC technology is to store a code for data encryption in an extra place on the data bit. When the data is written into the memory, the corresponding ECC code is also saved; when the data just stored is re-read, the saved ECC code It will be compared with the ECC code calculated in real time when the data is read. If the two codes are not the same, they will be decoded to determine which bit of the data is incorrect. Then this error bit will be discarded and the memory controller will release the correct data. If the same error data is read out again, the correction process will be executed again.
近年来,一部分嵌入式处理器和数字信号处理器芯片(DSP)开始陆续加入硬件ECC检错纠错技术,如Ti公司的多核ARM处理器AM572x、Ti公司的DSP芯片C665X以及Xilinx公司最新推出的全可编程SOC芯片UltraScale MPSoC等。但是,大多数位翻转检测恢复方案都需要硬件单元的支持,都是从硬件角度实现位翻转错误的防护。由于硬件ECC检错纠错需要在处理器的内存控制器中集成硬件检错纠错控制模块,进而导致大幅增加芯片成本,因而目前广泛应用于电力二次设备的处理器和RAM器件,大多数均不支持硬件ECC检测恢复功能,无法从硬件层面防护位翻转的问题。In recent years, some embedded processors and digital signal processor chips (DSP) have begun to add hardware ECC error detection and correction technologies, such as Ti’s multi-core ARM processor AM572x, Ti’s DSP chip C665X, and Xilinx’s latest Fully programmable SOC chip UltraScale MPSoC, etc. However, most bit flip detection and recovery solutions require the support of hardware units, and they all implement bit flip error protection from a hardware perspective. Because hardware ECC error detection and error correction requires the integration of hardware error detection and error correction control modules in the processor's memory controller, which leads to a significant increase in chip cost, it is currently widely used in processors and RAM devices in power secondary equipment. Both do not support the hardware ECC detection and recovery function, and the problem that the protection bit can not be flipped from the hardware level.
发明内容Summary of the invention
针对现有技术的不足,本发明的目的在于提供一种电力二次设备内存位翻转的检测恢复方法及系统,以解决现有技术中电力二次设备因内存位翻转而导致设备功能或结果异常的技术问题。In view of the shortcomings of the prior art, the purpose of the present invention is to provide a method and system for detecting and restoring the memory bit flip of a power secondary device, so as to solve the abnormal function or result of the power secondary device caused by the memory bit flip in the prior art. Technical issues.
为解决上述技术问题,本发明所采用的技术方案是:In order to solve the above technical problems, the technical solutions adopted by the present invention are:
一种电力二次设备内存位翻转的检测恢复方法,包括如下步骤:A method for detecting and restoring the memory bit flip of a power secondary device includes the following steps:
按预设的ECC段长度对应用程序加载时应用程序运行区进行ECC校验码计算,获取应用程序加载时应用程序的段数据的ECC校验码;Calculate the ECC check code of the application running area when the application is loaded according to the preset ECC segment length, and obtain the ECC check code of the segment data of the application when the application is loaded;
按预设的ECC段长度对应用程序运行时应用程序的段数据进行ECC校验码计算,获取应用程序运行时应用程序的段数据的ECC检验码;Perform ECC check code calculation on the segment data of the application program when the application program is running according to the preset ECC segment length, and obtain the ECC check code of the segment data of the application program when the application program is running;
对ECC校验码与ECC检验码进行比对;Compare the ECC check code with the ECC check code;
如果根据比对结果判定应用程序运行时应用程序的段数据发生单比特位错误,对发生错误的比特位进行纠正。If it is determined according to the comparison result that a single-bit error occurs in the segment data of the application program when the application program is running, correct the bit where the error occurred.
进一步地,在获取应用程序加载时应用程序的段数据的ECC校验码之后,还包括:Further, after obtaining the ECC check code of the segment data of the application program when the application program is loaded, the method further includes:
对获取的ECC校验码进行存储且存储数量不少于三份;Store the obtained ECC check code and the storage quantity is not less than three copies;
对ECC校验码与ECC检验码进行比对,包括:Compare the ECC check code with the ECC check code, including:
对存储的ECC校验码进行逐个相互比对;Compare the stored ECC check codes one by one;
如果存储的ECC校验码中至少有两个比对结果一致,提取比对结果一致的ECC校验码与ECC检验码进行比对。If at least two comparison results of the stored ECC check codes are consistent, the ECC check code with the same comparison results is extracted and compared with the ECC check code.
进一步地,对ECC校验码与ECC检验码进行比对,还包括:提取比对结果一致的ECC校验码,对比对结果不一致的ECC校验码进行替换。Further, comparing the ECC check code with the ECC check code also includes: extracting the ECC check code with the same comparison result, and comparing and replacing the ECC check code with the inconsistent result.
进一步地,对ECC校验码与ECC检验码进行比对,还包括:Further, the comparison of the ECC check code with the ECC check code also includes:
如果存储的ECC校验码中相互比对结果均不一致,采用冗余备份恢复机制恢复应用程序运行时应用程序的代码段。If the stored ECC check codes are inconsistent with each other, the redundant backup recovery mechanism is used to restore the code segment of the application when the application is running.
进一步地,如果根据比对结果判定应用程序运行时应用程序的段数据中发生错误的比特位不少于两个,采用冗余备份恢复机制恢复应用程序运行时应用程序的代码段。Further, if it is determined according to the comparison result that there are no less than two bits in the segment data of the application program when the application program is running, the redundant backup recovery mechanism is used to restore the code segment of the application program when the application program is running.
进一步地,采用冗余备份恢复机制恢复应用程序运行时应用程序的代码段,包括:Further, the use of a redundant backup and recovery mechanism to restore the code segment of the application when the application is running includes:
获取应用程序加载时应用程序的代码段并对其进行CRC校验码计算,获取代码段的CRC校验码;Obtain the code segment of the application program when the application program is loaded and calculate the CRC check code to obtain the CRC check code of the code segment;
对应用程序运行时应用程序的代码段进行CRC校验码计算,获取代码段的CRC检验码;Calculate the CRC check code of the code segment of the application program when the application is running, and obtain the CRC check code of the code segment;
对CRC校验码与CRC检验码进行比对;Compare the CRC check code with the CRC check code;
如果CRC校验码与CRC检验码比对结果不一致,提取CRC校验码所对应的代码段,对应用程序运行时应用程序的代码段进行覆盖。If the comparison result between the CRC check code and the CRC check code is inconsistent, the code segment corresponding to the CRC check code is extracted, and the code segment of the application program is covered when the application program is running.
进一步地,在获取应用程序加载时应用程序的代码段之后,还包括:对获取的代码段进行压缩处理;Further, after obtaining the code segment of the application program when the application program is loaded, the method further includes: compressing the obtained code segment;
在提取CRC校验码所对应的代码段之前,还包括:对CRC校验码所对应的压缩过的代码段进行解压处理。Before extracting the code segment corresponding to the CRC check code, it also includes: decompressing the compressed code segment corresponding to the CRC check code.
进一步地,在获取应用程序加载时应用程序的代码段及其CRC校验码之后,还包括:Further, after obtaining the code segment of the application program and its CRC check code when the application program is loaded, it further includes:
对获取的代码段及其CRC校验码进行存储且存储数量不少于两份;Store the obtained code segment and its CRC check code and the number of storage is not less than two copies;
提取CRC校验码所对应的代码段,包括:Extract the code segment corresponding to the CRC check code, including:
对存储的CRC校验码进行在线校验;Online verification of the stored CRC check code;
如果存储的CRC校验码的在线校验结果始终一致,提取在线校验结果始终一致的CRC校验码所对应的代码段。If the online check result of the stored CRC check code is always consistent, extract the code segment corresponding to the CRC check code whose online check result is always consistent.
进一步地,提取CRC校验码所对应的代码段,还包括:提取在线校验结果始终一致的CRC校验码及其代码段,对在线校验结果不始终一致的CRC校验码及其代码段进行替换。Further, extracting the code segment corresponding to the CRC check code also includes: extracting the CRC check code and its code segment whose online check result is always consistent, and the CRC check code and its code whose online check result is not always consistent Segment to replace.
进一步地,对存储的CRC校验码进行在线校验,包括:按预设的时间间隔,对存储的CRC校验码的当前状态与上一状态进行比对。Further, performing online verification on the stored CRC check code includes: comparing the current state of the stored CRC check code with the previous state at a preset time interval.
进一步地,还包括:检测恢复应用程序中预注册的关键数据;Further, it also includes: detecting and restoring the pre-registered key data in the application;
所述关键数据的检测恢复方法,包括:The key data detection and recovery method includes:
基于关键数据的预注册信息提取关键数据的内存地址;Extract the memory address of the key data based on the pre-registration information of the key data;
获取应用程序加载时所述内存地址中的内存数据并对其进行CRC校验码计算,获取CRC校验码;Obtain the memory data in the memory address when the application program is loaded and calculate the CRC check code to obtain the CRC check code;
对应用程序运行时关键数据的内存地址中的内存数据进行CRC校验码计算,获取CRC检验 码;Calculate the CRC check code for the memory data in the memory address of the key data when the application is running, and obtain the CRC check code;
对CRC校验码与CRC检验码进行比对;Compare the CRC check code with the CRC check code;
如果CRC校验码与CRC检验码比对结果不一致,提取CRC校验码所对应的内存数据,对应用程序运行时所述内存地址中的内存数据进行覆盖。If the comparison result between the CRC check code and the CRC check code is inconsistent, the memory data corresponding to the CRC check code is extracted, and the memory data in the memory address is overwritten when the application program is running.
进一步地,在获取应用程序加载时所述内存地址中的内存数据及其CRC校验码之后,还包括:对获取的内存数据及其CRC校验码进行存储且存储数量不少于两份;Further, after obtaining the memory data and the CRC check code in the memory address when the application program is loaded, the method further includes: storing the obtained memory data and the CRC check code and the storage quantity is not less than two copies;
提取CRC校验码所对应的内存数据,包括:Extract the memory data corresponding to the CRC check code, including:
对存储的CRC校验码进行在线校验;Online verification of the stored CRC check code;
如果存储的CRC校验码的在线校验结果始终一致,提取在线校验结果始终一致的CRC校验码所对应的内存数据。If the online check result of the stored CRC check code is always consistent, extract the memory data corresponding to the CRC check code whose online check result is always consistent.
进一步地,提取CRC校验码所对应的内存数据,还包括:提取在线校验结果始终一致的CRC校验码及其内存数据,对在线校验结果不始终一致的CRC校验码及其内存数据行替换。Further, extracting the memory data corresponding to the CRC check code also includes: extracting the CRC check code and its memory data whose online check results are always consistent, and the CRC check code and its memory whose online check results are not always consistent. Data row replacement.
为达到上述目的,本发明还提供了一种电力二次设备内存位翻转的检测恢复系统,包括:单比特位错误恢复模块,所述单比特位错误恢复模块包括:To achieve the above objective, the present invention also provides a system for detecting and restoring the memory bit flip of a power secondary device, which includes: a single-bit error recovery module, the single-bit error recovery module includes:
ECC校验码计算子模块:用于按预设的ECC段长度对应用程序加载时应用程序运行区进行ECC校验码计算以获取应用程序加载时应用程序的段数据的ECC校验码,以及按预设的ECC段长度对应用程序运行时应用程序的段数据进行ECC校验码计算以获取应用程序运行时应用程序的段数据的ECC检验码;ECC check code calculation sub-module: used to calculate the ECC check code of the application running area when the application is loaded according to the preset ECC segment length to obtain the ECC check code of the segment data of the application when the application is loaded, and Perform ECC check code calculation on the segment data of the application when the application is running according to the preset ECC segment length to obtain the ECC check code of the segment data of the application when the application is running;
ECC校验码存储子模块:在获取应用程序加载时应用程序的段数据的ECC校验码之后,用于对获取的ECC校验码进行存储且存储数量不少于三份;ECC check code storage sub-module: After obtaining the ECC check code of the segment data of the application program when the application program is loaded, it is used to store the obtained ECC check code and the storage quantity is not less than three;
ECC校验码比对子模块:用于对存储的ECC校验码进行逐个相互比对,如果存储的ECC校验码中至少有两个比对结果一致,提取比对结果一致的ECC校验码与ECC检验码进行比对;ECC check code comparison sub-module: used to compare the stored ECC check codes one by one. If at least two comparison results of the stored ECC check codes are consistent, extract the ECC check with the same comparison result The code is compared with the ECC check code;
ECC校验码替换子模块:用于提取比对结果一致的ECC校验码,对比对结果不一致的ECC校验码进行替换;ECC check code replacement sub-module: used to extract ECC check codes with consistent comparison results, and compare and replace ECC check codes with inconsistent results;
单比特位错误纠正子模块:如果根据比对结果判定应用程序运行时应用程序的段数据发生单比特位错误,用于对发生错误的比特位进行纠正。Single-bit error correction sub-module: If a single-bit error occurs in the segment data of the application program when the application program is running according to the comparison result, it is used to correct the bit where the error occurs.
进一步地,还包括用于采用冗余备份恢复机制恢复应用程序运行时应用程序的代码段的冗余备份恢复模块,所述冗余备份恢复模块包括:Further, it also includes a redundancy backup and recovery module for recovering the code segment of the application program when the application program is running by using a redundancy backup and recovery mechanism, and the redundancy backup recovery module includes:
代码段获取子模块:用于获取应用程序加载时应用程序的代码段;Code segment acquisition sub-module: used to acquire the code segment of the application when the application is loaded;
CRC校验码计算子模块:用于对应用程序加载时应用程序的代码段进行CRC校验码计算以 获取代码段的CRC校验码,以及对应用程序运行时应用程序的代码段进行CRC校验码计算以获取代码段的CRC检验码;CRC check code calculation sub-module: used to calculate the CRC check code of the code segment of the application program when the application program is loaded to obtain the CRC check code of the code segment, and perform CRC check on the code segment of the application program when the application program is running Check code calculation to obtain the CRC check code of the code segment;
CRC校验码存储子模块:在获取应用程序加载时应用程序的代码段及其CRC校验码之后,用于对获取的代码段及其CRC校验码进行存储且存储数量不少于两份;CRC check code storage sub-module: After obtaining the code segment and CRC check code of the application program when the application is loaded, it is used to store the obtained code segment and its CRC check code and the storage quantity is not less than two copies. ;
CRC校验码在线校验子模块:用于对存储的CRC校验码进行在线校验;CRC check code online check sub-module: used to check the stored CRC check code online;
CRC校验码比对子模块:用于对CRC校验码与CRC检验码进行比对;CRC check code comparison sub-module: used to compare CRC check code with CRC check code;
代码段覆盖子模块:如果存储的CRC校验码的在线校验结果始终一致,用于提取在线校验结果始终一致的CRC校验码所对应的代码段,对应用程序运行时应用程序的代码段进行覆盖;Code segment coverage sub-module: If the online check result of the stored CRC check code is always consistent, it is used to extract the code segment corresponding to the CRC check code whose online check result is always consistent. Segment to cover;
CRC校验码替换子模块:用于提取在线校验结果始终一致的CRC校验码及其代码段,对在线校验结果不始终一致的CRC校验码及其代码段进行替换;CRC check code replacement sub-module: used to extract the CRC check code and its code segment whose online check result is always consistent, and replace the CRC check code and its code segment whose online check result is not always consistent;
代码段压缩子模块:用于在获取应用程序加载时应用程序的代码段之后对获取的代码段进行压缩处理,以及在提取CRC校验码所对应的代码段之前对CRC校验码所对应的压缩过的代码段进行解压处理。Code segment compression sub-module: used to compress the obtained code segment after obtaining the code segment of the application program when the application is loaded, and the code segment corresponding to the CRC check code before extracting the code segment corresponding to the CRC check code. The compressed code segment is decompressed.
进一步地,还包括关键数据检测恢复模块,所述关键数据检测恢复模块包括:Further, it also includes a critical data detection and recovery module, and the critical data detection and recovery module includes:
内存地址提取子模块:用于基于关键数据的预注册信息提取关键数据的内存地址;Memory address extraction sub-module: used to extract the memory address of the key data based on the pre-registered information of the key data;
内存数据获取子模块:用于获取应用程序加载时所述内存地址中的内存数据;Memory data acquisition sub-module: used to acquire the memory data in the memory address when the application program is loaded;
所述CRC校验码计算子模块还用于对应用程序加载时所述内存地址中的内存数据进行CRC校验码计算以获取CRC校验码,以及对应用程序运行时关键数据的内存地址中的内存数据进行CRC校验码计算以获取CRC检验码;The CRC check code calculation sub-module is also used to perform CRC check code calculation on the memory data in the memory address when the application program is loaded to obtain the CRC check code, and to calculate the key data in the memory address when the application program is running CRC check code calculation is performed on the memory data to obtain the CRC check code;
所述CRC校验码存储子模块还用于在获取应用程序加载时所述内存地址中的内存数据及其CRC校验码之后,对获取的内存数据及其CRC校验码进行存储且存储数量不少于两份;The CRC check code storage sub-module is also used to store the obtained memory data and its CRC check code after obtaining the memory data and its CRC check code in the memory address when the application program is loaded. Not less than two copies;
所述代码段覆盖子模块还用于如果存储的CRC校验码的在线校验结果始终一致,提取在线校验结果始终一致的CRC校验码所对应的内存数据,对应用程序运行时所述内存地址中的内存数据进行覆盖;The code segment coverage sub-module is also used to extract the memory data corresponding to the CRC check code whose online check result is always consistent if the online check result of the stored CRC check code is always consistent, and the application program is running. Overwrite the memory data in the memory address;
所述CRC校验码替换子模块还用于提取在线校验结果始终一致的CRC校验码及其内存数据,对在线校验结果不始终一致的CRC校验码及其内存数据行替换。The CRC check code replacement sub-module is also used to extract the CRC check code whose online check result is always consistent and its memory data, and replace the CRC check code whose online check result is not always consistent and its memory data row.
为达到上述目的,本发明还提供了计算机处理控制装置,包括:To achieve the above objective, the present invention also provides a computer processing control device, including:
存储器:用于存储指令;Memory: used to store instructions;
处理器:用于根据所述指令进行操作以执行本发明提供的一种电力二次设备内存位翻转的检测恢复方法的步骤。The processor is configured to operate according to the instructions to execute the steps of the method for detecting and restoring the memory bit flip of a power secondary device provided by the present invention.
为达到上述目的,本发明还提供了计算机可读存储介质,其上存储有计算机程序,所述程序被处理器执行时实现本发明提供的一种电力二次设备内存位翻转的检测恢复方法的步骤。To achieve the above objective, the present invention also provides a computer-readable storage medium on which a computer program is stored. When the program is executed by a processor, the method for detecting and restoring the memory bit flip of a power secondary device provided by the present invention is implemented. step.
与现有技术相比,本发明所达到的有益效果:本发明方法及系统综合使用ECC检错纠错以及冗余备份恢复两种机制,分别实现单比特位错误的快速定位恢复以及多比特位错误的恢复,兼顾纠错效率和纠错能力,弥补了单纯使用ECC硬件检错纠错不能纠正两位以上错误的不足,有利于保证电力二次设备的可靠运行。由于不需要在处理器内部检测恢复电路,因而不受处理器自身工艺设计变化影响,达到沿用现有架构、节省研制成本的目标,同时可以推广到不同架构的处理器,具有广泛的适用性。Compared with the prior art, the present invention achieves the beneficial effects: the method and system of the present invention comprehensively use two mechanisms of ECC error detection and correction and redundant backup and recovery to realize the rapid positioning and recovery of single-bit errors and multi-bit errors, respectively. Error recovery, taking into account both the efficiency of error correction and the ability of error correction, makes up for the insufficiency of simply using ECC hardware error detection and error correction that cannot correct more than two errors, which is beneficial to ensure the reliable operation of power secondary equipment. Since there is no need to detect and restore circuits inside the processor, it is not affected by changes in the processor's own process design, and achieves the goal of using the existing architecture and saving development costs. At the same time, it can be extended to processors of different architectures and has a wide range of applicability.
附图说明Description of the drawings
图1是本发明系统实施例中微控制器片上内存的地址分配示意图;Figure 1 is a schematic diagram of the address allocation of the on-chip memory of the microcontroller in the system embodiment of the present invention;
图2是本发明系统实施例中应用程序的代码段的备份过程示意图;2 is a schematic diagram of the backup process of the code segment of the application program in the system embodiment of the present invention;
图3是本发明系统实施例中关键数据的备份过程示意图;Figure 3 is a schematic diagram of a backup process of key data in an embodiment of the system of the present invention;
图4是本发明系统实施例中应用程序的代码段的恢复过程示意图;FIG. 4 is a schematic diagram of the recovery process of the code segment of the application program in the system embodiment of the present invention;
图5是本发明系统实施例中关键数据的恢复过程示意图。Fig. 5 is a schematic diagram of the restoration process of key data in the system embodiment of the present invention.
具体实施方式Detailed ways
下面结合附图对本发明作进一步描述。以下实施例仅用于更加清楚地说明本发明的技术方案,而不能以此来限制本发明的保护范围。The present invention will be further described below in conjunction with the accompanying drawings. The following embodiments are only used to illustrate the technical solutions of the present invention more clearly, and cannot be used to limit the protection scope of the present invention.
程序段数据一般地址连续,符合ECC检错纠错算法的运行要求,因而可以采用软件ECC检错纠错算法对程序段数据进行单比特位翻转错误的快速定位与恢复;同时考虑到系统内存发生多比特位数据错误的可能性,为提高检错纠错程序的纠错能力,在ECC检错纠错的基础上加入冗余备份恢复机制,综合ECC检错纠错机制的高效率以及冗余备份恢复机制的强纠错能力,取长补短,进一步保证程序段数据的正确性,从而保证系统的可靠性。关键数据一般地址不连续,应用ECC检错纠错比较困难,且关键数据的数据量一般不大,单纯使用冗余备份恢复机制已经可以满足大部分应用场合的需求,故对关键数据只采用冗余备份恢复一种检错纠错机制。The program segment data generally has continuous addresses, which meets the operating requirements of the ECC error detection and correction algorithm. Therefore, the software ECC error detection and correction algorithm can be used to quickly locate and restore the single bit flip error of the program segment data; also taking into account the occurrence of system memory The possibility of multi-bit data errors, in order to improve the error correction capability of the error detection and correction program, a redundant backup recovery mechanism is added on the basis of the ECC error detection and correction, and the high efficiency and redundancy of the integrated ECC error detection and correction mechanism The strong error correction capability of the backup and recovery mechanism can complement each other to further ensure the correctness of the program segment data, thereby ensuring the reliability of the system. Generally, the addresses of key data are not continuous, it is difficult to apply ECC to detect and correct errors, and the data volume of key data is generally not large. Simply using redundant backup and recovery mechanisms can meet the needs of most applications. Therefore, only redundant data is used for key data. I backup and restore an error detection and correction mechanism.
基于上述技术思路,本发明具体实施方式提供了一种电力二次设备内存位翻转的检测恢复方法,包括如下步骤:Based on the above technical ideas, specific embodiments of the present invention provide a method for detecting and restoring a memory bit flip of a power secondary device, which includes the following steps:
步骤一,使用检错纠错程序加载应用程序,按ECC段长度(段长度根据ECC码的位数确认)逐段对应用程序运行区进行ECC校验码计算,为防止ECC校验码在运行过程中发生错误,ECC校验码存储三份。所有段的ECC校验码按照段顺序以每段三份的方式存放在内存预留 区中。Step 1: Load the application program with the error detection and correction program, and calculate the ECC check code in the application program operating area according to the length of the ECC segment (the segment length is confirmed by the digits of the ECC code), in order to prevent the ECC check code from running When an error occurs during the process, three copies of the ECC check code are stored. The ECC check codes of all segments are stored in the memory reserved area in three copies for each segment according to the sequence of the segments.
步骤二,应用程序加载时,检错纠错程序根据不同文件类型分离出应用程序的代码段,对代码段数据进行压缩并对其进行CRC校验码计算以获取CRC校验码,将压缩后的代码段数据及CRC校验码备份至少两份,存储于系统内存或外存中,压缩数据主要是为了节约冗余备份的内存开销。Step 2: When the application program is loaded, the error detection and correction program separates the code segment of the application program according to different file types, compresses the code segment data and calculates the CRC check code on it to obtain the CRC check code, which will be compressed The code segment data and CRC check code should be backed up at least in two copies, stored in the system memory or external storage. The compressed data is mainly to save the memory overhead of redundant backup.
步骤三,应用程序启动后,将检错纠错程序单次执行需要进行ECC校验的段数、发生不可恢复错误后的处理方式等参数设置给检错纠错程序。应用程序将需要保护的关键数据通过检错纠错程序提供的接口向检错纠错程序进行注册,检错纠错程序获取这些关键数据的内存地址,备份这些内存地址中的内存数据,按哈希(hash)表序列化后顺序存储,序列化后的整片内容存储两份以上,对每份内容分别进行CRC校验码计算以获取CRC校验码。Step 3: After the application is started, the number of segments that need to be ECC checked for a single execution of the error detection and correction program, and the processing mode after an unrecoverable error occurs, are set to the error detection and correction program. The application program registers the key data that needs to be protected with the error detection and correction program through the interface provided by the error detection and correction program. The error detection and correction program obtains the memory address of these key data, backs up the memory data in these memory addresses, press Ha The hash table is serialized and stored sequentially. The serialized whole piece of content is stored in more than two copies, and the CRC check code is calculated separately for each piece of content to obtain the CRC check code.
步骤四,系统正常运转后,检错纠错程序定期按段长度对原程序段数据进行ECC检错纠错,每次对段长度的程序数据进行ECC校验码计算以获取ECC检验码,并利用预先存储的三份ECC校验码对ECC检验码进行综合评判。首先,对三份ECC校验码进行相互比对,如果有两份ECC校验码一致,则认为该ECC校验码正确;如果预先存储的三份ECC校验码中存在发生错误的ECC校验码,则将发生错误的ECC校验码恢复为正确值。然后,将正确的ECC校验码与ECC检验码进行比对,如果比对结果一致,则判定该段数据未发生比特位错误,否则判定该段数据发生比特位错误。如果经评判运行时应用程序的段数据发生单比特位错误,则直接利用正确的ECC校验码对其进行纠错恢复;为减少检错纠错对系统负荷的影响,检错纠错程序每次运行只检测一段长度数据,下一次运行时检测下一段数据。Step 4: After the system is running normally, the error detection and correction program periodically performs ECC error detection and error correction on the original program segment data according to the segment length, and calculates the ECC check code for the program data of the segment length each time to obtain the ECC check code, and Use three pre-stored ECC check codes to comprehensively judge the ECC check codes. First, compare the three ECC check codes with each other. If two ECC check codes are the same, the ECC check code is considered correct; if there is an error in the ECC check code among the three pre-stored ECC check codes If the code is verified, the error ECC check code is restored to the correct value. Then, the correct ECC check code is compared with the ECC check code. If the comparison results are consistent, it is determined that no bit error has occurred in the segment of data, otherwise it is determined that a bit error has occurred in the segment of data. If a single-bit error occurs in the segment data of the application program when it is judged to run, the correct ECC check code will be used directly for error correction and recovery; in order to reduce the impact of error detection and correction on the system load, the error detection and correction program will always Only one piece of length data is detected during the second run, and the next piece of data is detected during the next run.
步骤五,如果经评判运行时应用程序的段数据中两位以上比特位翻转错误或三份ECC校验码均错误,则会进入冗余对比恢复流程。首先,对步骤二中备份的CRC校验码进行在线校验,即按预设的时间间隔对CRC校验码的当前状态与上一状态进行比对,如果CRC校验码比对结果始终一致,则认定该CRC校验码正确,否则认定该CRC校验码错误,对于错误的CRC校验码及其对应的代码段数据备份,可利用正确的CRC校验码及其对应的代码段数据备份对其及时进行替换。然后,对于两位以上比特位翻转错误,利用正确CRC校验码所应对的代码段对运行中应用程序的代码段进行覆盖;对于三份ECC校验码均错误的情况,对运行时应用程序的代码段进行CRC校验码计算,获取代码段的CRC检验码,再对该CRC检验码与正确的CRC校验码进行比对,如果对比结果一致,则认定运行中应用程序未发生比特位翻转错误,如果对比结果不一致,则认定运行中应用程序发生比特位翻转错误,则利用正确CRC校验码所应对的代码段对运行中应用程序的代码段进行覆盖。在对运行中应用程序的代码段进行覆盖时,检错纠错程序通过调用系统接口关闭系统程序抢占功能 或使运行程序进入休眠状态,然后将正确的备份版本解压后恢复到相应的内存中,并将正确的备份版本、备份校验码及ECC校验码恢复覆盖到原错误的备份版本中,最后再使能系统程序抢占功能或恢复程序运转。Step 5: If more than two bit inversion errors or three ECC check codes in the segment data of the application program are judged to run, the redundant comparison recovery process will be entered. First, perform online verification on the CRC check code backed up in step 2, that is, compare the current state of the CRC check code with the previous state at a preset time interval. If the CRC check code comparison result is always consistent , The CRC check code is determined to be correct, otherwise the CRC check code is determined to be wrong. For the wrong CRC check code and its corresponding code segment data backup, the correct CRC check code and its corresponding code segment data can be used Back up to replace it in time. Then, for two or more bit flip errors, use the code segment corresponding to the correct CRC check code to cover the code segment of the running application; for the case where all three ECC check codes are wrong, the runtime application CRC check code is calculated for the code segment of the code segment, and the CRC check code of the code segment is obtained, and then the CRC check code is compared with the correct CRC check code. If the comparison results are consistent, it is determined that no bit has occurred in the running application. Turnover error, if the comparison results are inconsistent, it is determined that a bit turning error occurs in the running application, and the code segment corresponding to the correct CRC check code is used to cover the code segment of the running application. When covering the code segment of the running application, the error detection and correction program calls the system interface to turn off the system program preemption function or puts the running program into a hibernation state, and then decompress the correct backup version and restore it to the corresponding memory. And restore the correct backup version, backup check code and ECC check code to the original wrong backup version, and finally enable the system program preemption function or restore program operation.
步骤六,系统正常运转后,检错纠错程序定期循环检查已经注册的关键数据的正确性,通过查询哈希(hash)表获取关键数据的内存地址,将内存地址中的内容与备份版本进行对比,即对内存地址中内容进行CRC校验码计算以获取CRC检验码,将该CRC检验码与步骤三中备份的正确的CRC校验码进行比对,由于CRC校验码可进行在线校验,并随时用正确的CRC校验码及其对应的内存数据对错误的CRC校验码及其对应的内存数据进行替换,从而确保了CRC校验码的正确性。如果发现CRC检验码与正确的CRC校验码不一致,则提取正确CRC校验码所对应的内存数据,对应用程序运行时相应内存地址中的内存数据进行覆盖。在进行覆盖操作时,检错纠错程序通过调用系统接口关闭系统程序抢占功能或使运行程序进入休眠状态,然后将正确的备份版本恢复到相应的内存地址中,同时将正确的备份校验码恢复到发生错误的备份版本中,最后再使能系统程序抢占功能或恢复程序运转。Step 6. After the system is running normally, the error checking and correcting program periodically checks the correctness of the registered key data, obtains the memory address of the key data by querying the hash table, and compares the contents of the memory address with the backup version. Comparison, that is, the CRC check code is calculated on the content in the memory address to obtain the CRC check code, and the CRC check code is compared with the correct CRC check code backed up in step 3. Because the CRC check code can be checked online Check, and replace the wrong CRC check code and its corresponding memory data with the correct CRC check code and its corresponding memory data at any time, thus ensuring the correctness of the CRC check code. If the CRC check code is found to be inconsistent with the correct CRC check code, the memory data corresponding to the correct CRC check code is extracted, and the memory data in the corresponding memory address is overwritten when the application program is running. During the overwriting operation, the error detection and correction program calls the system interface to close the system program preemption function or make the running program enter the dormant state, and then restore the correct backup version to the corresponding memory address, and at the same time, the correct backup check code Restore to the backup version where the error occurred, and finally enable the system program preemption function or restore the program operation.
本发明方法实施例中,检错纠错程序始终运行在系统后台,不断地进行程序段数据及关键数据的检错纠错过程。在执行错误恢复操作时,会使用类沙箱机制,避免错误恢复过程被异常打断或者异常代码被执行,导致整个系统进入异常状态,从而保证错误恢复流程及结果的正确性。检错纠错程序在检测并修复错误后,通过接口告知应用程序发生错误的地址、错误类型及当前总共发生错误的次数,当发生不可恢复错误时,可根据应用程序事先定义好的行为来处理异常,包括但不限于重新加载应用程序、系统退出运行、通过看门狗重新启动系统等。In the method embodiment of the present invention, the error detection and correction program always runs in the background of the system, and the error detection and correction process of program segment data and key data is continuously performed. When performing error recovery operations, a sandbox-like mechanism is used to prevent the error recovery process from being abnormally interrupted or the execution of abnormal codes, causing the entire system to enter an abnormal state, thereby ensuring the correctness of the error recovery process and results. After the error detection and correction program detects and repairs the error, it informs the application program of the address of the error, the type of error, and the current total number of errors through the interface. When an unrecoverable error occurs, it can be handled according to the pre-defined behavior of the application Abnormalities, including but not limited to reloading applications, system exiting, restarting the system through the watchdog, etc.
本发明具体实施方式提供了一种电力二次设备内存位翻转的检测恢复系统,用于实现前述发明方法,该系统即前述检错纠错程序,通过该程序完成应用程序加载引导、ECC校验码计算、ECC检错纠错过程、备份恢复数据内容以及处理不可恢复错误的等功能。检错纠错程序每次ECC校验的段数可以通过应用程序进行灵活配置,使得本发明系统可以很方便地应用于不同处理器负荷等级的系统中,根据实际系统处理器负荷水平来决定检错纠错的频度,使系统整体负荷处于一个合理的水平。The specific embodiment of the present invention provides a system for detecting and restoring the memory bit flip of a power secondary device, which is used to implement the aforementioned method of the invention. Code calculation, ECC error detection and correction process, backup and recovery of data content, and processing of unrecoverable errors and other functions. The number of segments checked by each ECC of the error detection and correction program can be flexibly configured by the application program, so that the system of the present invention can be conveniently applied to systems with different processor load levels, and the error detection is determined according to the actual system processor load level The frequency of error correction keeps the overall load of the system at a reasonable level.
本发明系统基于微控制器的无操作系统环境,由于没有高速缓冲存储器(Cache)以及内存管理单元(MMU)的介入,逻辑地址与物理地址一致,处理器访问某个地址得到的内容即为物理存储器内部真实版本,而非存储在Cache中的临时版本,比较容易实现本发明所述的方法。本实施例中,以基于微控制器的无操作系统环境为例,在微控制器片上内存中实施本发明所述的方法,实现微控制器片上内存的错误在线检测及恢复功能。其他运行环境 在增加一些额外的操作后也可以应用本发明的相关方法。The system of the present invention is based on a microcontroller-free operating system environment. Because there is no intervention of the cache memory (Cache) and the memory management unit (MMU), the logical address is consistent with the physical address, and the content obtained by the processor accessing a certain address is the physical The internal real version of the memory, rather than the temporary version stored in the Cache, is easier to implement the method of the present invention. In this embodiment, taking a microcontroller-based non-operating system environment as an example, the method of the present invention is implemented in the on-chip memory of the microcontroller to realize the function of online error detection and recovery of the on-chip memory of the microcontroller. Other operating environments can also apply the relevant methods of the present invention after adding some additional operations.
如图1所示,是本发明系统实施例中微控制器片上内存的地址分配示意图,其上总共运行三种程序:BOOT启动程序用于微控制器运行环境及处理器外部设备的初始化,使整个系统进入初始运行状态;检错纠错程序用于引导应用程序并对应用代码进行ECC检错纠错,并对应用代码和关键数据进行备份、恢复;应用程序则用于实现系统的具体功能。程序设计时将内存预留区低地址(Resv_low)到内存预留区高地址(Resv_high)之间的内存预留给检错纠错程序单独使用,用于存储应用代码的分段ECC校验码、应用代码及关键数据的备份。其他程序不能使用内存预留区的内存空间,可以在程序编译链接阶段通过指定程序的运行地址空间避开内存预留区的方法实现此目的。As shown in Figure 1, it is a schematic diagram of the address allocation of the on-chip memory of the microcontroller in the system embodiment of the present invention. There are three types of programs running on it: The BOOT startup program is used to initialize the operating environment of the microcontroller and the peripheral equipment of the processor. The entire system enters the initial operating state; the error detection and correction program is used to guide the application program and perform ECC error detection and correction of the application code, and backup and restore the application code and key data; the application program is used to implement the specific functions of the system . During program design, the memory between the low address of the memory reserved area (Resv_low) and the high address of the memory reserved area (Resv_high) is reserved for the error detection and correction program for separate use, and is used to store the segmented ECC check code of the application code , Application code and key data backup. Other programs cannot use the memory space of the memory reserved area. This can be achieved by specifying the program's operating address space to avoid the memory reserved area during the program compilation and linking phase.
本实施例中,代码段的分段ECC校验码存储三份,分别为ECC码A、ECC码B、ECC码C;代码段和关键数据的冗余备份均至少备份两份,分别为备份A和备份B。In this embodiment, the segmented ECC check code of the code segment is stored in three copies, which are ECC code A, ECC code B, and ECC code C; the redundant backups of the code segment and key data are all backed up at least two copies, each of which is backup A and backup B.
应用程序的参数设置、ECC校验码计算以及程序段备份过程,具体如图2所示。检错纠错程序按ECC段长度逐段对应用程序代码进行ECC校验码的计算并存储到内存Resv区域的应用代码ECC区中。为防止ECC校验码在运行过程中发生变化,每段数据的ECC校验码存储三份,分别为ECC码A、ECC码B、ECC码C。其后,检错纠错程序根据应用程序文件类型分离出其代码段,将代码段数据压缩并计算检验码,将压缩数据及校验码存储到内存Resv区域的应用代码备份A和备份B中,然后引导应用程序运行。The application program parameter setting, ECC check code calculation and program segment backup process are shown in Figure 2. The error detection and correction program calculates the ECC check code of the application code segment by segment according to the length of the ECC segment and stores it in the application code ECC area of the memory Resv area. In order to prevent the ECC check code from changing during operation, the ECC check code of each piece of data is stored in three copies, namely ECC code A, ECC code B, and ECC code C. After that, the error detection and correction program separates the code segment according to the application file type, compresses the code segment data and calculates the check code, and stores the compressed data and check code in the application code backup A and backup B in the Resv area of the memory , And then guide the application to run.
关键参数设置、应用关键数据的备份过程,具体如图3所示。应用程序运行后将检错纠错程序一次需要进行ECC校验的段数、发生不可恢复错误后的处理方式等参数设置给检错纠错程序。其后应用程序通过检错纠错程序提供的接口向检错纠错程序注册需要保护的关键数据,检错纠错程序获取所有关键数据的内存地址,备份这些内存地址中的内容,按哈希(hash)表序列化后存储到关键数据备份区中,序列化后的整片内容顺序存储到关键数据备份区的数据备份A和数据备份B中,每个备份数据单独计算校验码用于验证本备份数据的正确性,附在该备份数据的尾部。The key parameter setting, application key data backup process, as shown in Figure 3. After the application program runs, the number of segments required for ECC verification of the error detection and correction program at one time, the processing mode after an unrecoverable error occurs, and other parameters are set to the error detection and correction program. After that, the application program registers the key data that needs to be protected with the error detection and correction program through the interface provided by the error detection and correction program. The error detection and correction program obtains the memory addresses of all key data, backs up the content in these memory addresses, and presses the hash The (hash) table is serialized and stored in the key data backup area. The serialized whole piece of content is sequentially stored in data backup A and data backup B in the key data backup area. Each backup data is used to calculate the check code separately. Verify the correctness of the backup data and attach it to the end of the backup data.
应用程序的程序段检错恢复过程,具体如图4所示。系统正常运转后,检错纠错程序定期按段长度对原程序段数据进行ECC检错纠错,段长度在上一步骤中由应用程序配置给检错纠错程序。每次对段长度的程序数据计算ECC校验码,并与预先存储的ECC码A、ECC码B、ECC码C一起进行综合评判。如果原数据发生单比特位错误,则直接进行纠错恢复;同时如果预先存储的ECC校验码发生了错误,则将校验码恢复为正确值。为了减少检错纠错对系统负荷的影响,检错纠错程序每次运行只检测一段长度数据,下一次运行时检测下一段数据。当检测到ECC无法恢复的多比特位错误或所有的ECC校验码都错误时,会 进入冗余对比恢复流程。当检测到所有的ECC校验码都错误时,先验证当前运行的程序是否正确,不正确则分别校验备份A、备份B是否正确,若备份A数据正确,则将备份A恢复到运行代码段,恢复时需要关闭系统抢占功能防止在代码段恢复的同时被应用程序抢占导致程序运转出现问题;若备份A不正确、备份B正确,则将备份B恢复到运行代码段,同时还需要将备份B恢复到备份A中,以保证后续校验可以正确继续进行。如果发现备份A、备份B均不正确,说明此时发了极为严重的错误,已无法恢复错误,此时检错纠错程序会按照应用程序事先设置好的处理方式对系统进行处理,如重启系统、让应用程序退出运行等。检错纠错程序还会将检测到的错误记录在自身的专用变量中以供应用程序查阅。当检测到ECC无法恢复的多比特位错误,只需要校验备份A、备份B是否正确,无需验证当前运行的程序是否正确,后续流程与所有ECC校验码都错误时的处理流程一致。The program segment error detection and recovery process of the application program is shown in Figure 4. After the system runs normally, the error detection and correction program periodically performs ECC error detection and correction on the original program segment data according to the segment length. The segment length is configured by the application program to the error detection and correction program in the previous step. Each time the ECC check code is calculated for the program data of the segment length, and the pre-stored ECC code A, ECC code B, and ECC code C are comprehensively judged. If the original data has a single-bit error, it will directly perform error correction and recovery; at the same time, if the pre-stored ECC check code has an error, the check code will be restored to the correct value. In order to reduce the influence of error detection and correction on the system load, the error detection and correction program only detects one segment of data each time it runs, and the next segment of data is tested during the next run. When a multi-bit error that cannot be recovered by ECC is detected or all ECC check codes are wrong, it will enter the redundant comparison recovery process. When all ECC check codes are detected to be wrong, first verify that the currently running program is correct. If it is not correct, check whether backup A and backup B are correct respectively. If the data of backup A is correct, restore backup A to the running code Segment, the system preemption function needs to be turned off when restoring to prevent the program from being preempted by the application while restoring the code segment and causing program operation problems; if the backup A is incorrect and the backup B is correct, restore the backup B to the running code segment, and also need to Backup B is restored to backup A to ensure that subsequent verification can continue correctly. If it is found that both backup A and backup B are incorrect, it means that a very serious error has been made and the error cannot be recovered. At this time, the error detection and correction program will process the system according to the processing method set in advance by the application, such as restarting System, let the application quit running, etc. The error detection and correction program will also record the detected errors in its own dedicated variables for the application program to refer to. When a multi-bit error that cannot be recovered by ECC is detected, it is only necessary to verify whether backup A and backup B are correct. There is no need to verify whether the currently running program is correct. The subsequent process is the same as the processing flow when all ECC check codes are wrong.
应用关键数据检错、恢复过程,具体如图5所示。系统正常运转后,检错纠错程序定期循环检查已经注册的关键数据的正确性,通过查询哈希(hash)表获取数据的内存地址,将关键数据与备份A进行比较,如果一致说明没有发生错误,检错纠错程序直接返回下一次操作。当关键数据与备份A不一致时,说明有一方发生了错误,此时比较关键数据与备份B是否相同,相同则说明关键数据和备份B均正确,备份A的内容发生了错误,此时将备份B恢复到备份A后检错纠错程序返回。若关键数据与备份A、备份B均不同,此时需要分别校验备份A、备份B是否正确,若备份A数据正确,则将备份A恢复到关键数据;若备份A不正确、备份B正确,则将备份B恢复到关键数据,同时还需要将备份B恢复到备份A中,以保证后续校验可以正确继续进行。恢复时需要关闭系统抢占功能防止在代码段恢复的同时被应用程序抢占导致程序运转出现问题。如果发现备份A、备份B均不正确,说明此时发了极为严重的错误,已无法恢复错误,此时检错纠错程序会按照应用程序事先设置好的处理方式对系统进行处理,如重启系统、让应用程序退出运行等。检错纠错程序还会将检测到的错误记录在自身的专用变量中以供应用程序查阅。The application of critical data error detection and recovery process is shown in Figure 5. After the system is running normally, the error detection and correction program periodically checks the correctness of the registered key data, obtains the memory address of the data by querying the hash table, and compares the key data with the backup A. If they are consistent, it means no occurrence In case of errors, the error detection and correction procedures will directly return to the next operation. When the key data is inconsistent with the backup A, it means that one party has made an error. At this time, compare whether the key data is the same as the backup B. The same means that the key data and the backup B are correct. The content of the backup A is wrong, and the backup will be backed up. After B is restored to backup A, the error detection and correction procedure returns. If the key data is different from backup A and backup B, you need to verify whether backup A and backup B are correct. If the data of backup A is correct, restore backup A to the key data; if backup A is incorrect, backup B is correct , The backup B is restored to the key data, and the backup B needs to be restored to the backup A at the same time to ensure that the subsequent verification can continue correctly. The system preemption function needs to be turned off when recovering to prevent the program from being preempted by the application program while the code segment is being recovered. If both backup A and backup B are found to be incorrect, it means that a very serious error has been made and the error cannot be recovered. At this time, the error detection and correction program will process the system according to the processing method set in advance by the application, such as restarting System, let the application quit running, etc. The error detection and correction program will also record the detected errors in its own dedicated variables for the application program to refer to.
基于功能模块对本发明系统进行区分,主要包括:The system of the present invention is differentiated based on functional modules, mainly including:
(1)单比特位错误恢复模块:(1) Single bit error recovery module:
ECC校验码计算子模块:用于按预设的ECC段长度对应用程序加载时应用程序运行区进行ECC校验码计算以获取应用程序加载时应用程序的段数据的ECC校验码,以及按预设的ECC段长度对应用程序运行时应用程序的段数据进行ECC校验码计算以获取应用程序运行时应用程序的段数据的ECC检验码;ECC check code calculation sub-module: used to calculate the ECC check code of the application running area when the application is loaded according to the preset ECC segment length to obtain the ECC check code of the segment data of the application when the application is loaded, and Perform ECC check code calculation on the segment data of the application when the application is running according to the preset ECC segment length to obtain the ECC check code of the segment data of the application when the application is running;
ECC校验码存储子模块:在获取应用程序加载时应用程序的段数据的ECC校验码之后,用于对获取的ECC校验码进行存储且存储数量不少于三份;ECC check code storage sub-module: After obtaining the ECC check code of the segment data of the application program when the application program is loaded, it is used to store the obtained ECC check code and the storage quantity is not less than three;
ECC校验码比对子模块:用于对存储的ECC校验码进行逐个相互比对,如果存储的ECC校验码中至少有两个比对结果一致,提取比对结果一致的ECC校验码与ECC检验码进行比对;ECC check code comparison sub-module: used to compare the stored ECC check codes one by one. If at least two comparison results of the stored ECC check codes are consistent, extract the ECC check with the same comparison result The code is compared with the ECC check code;
ECC校验码替换子模块:用于提取比对结果一致的ECC校验码,对比对结果不一致的ECC校验码进行替换;ECC check code replacement sub-module: used to extract ECC check codes with consistent comparison results, and compare and replace ECC check codes with inconsistent results;
单比特位错误纠正子模块:如果根据比对结果判定应用程序运行时应用程序的段数据发生单比特位错误,用于对发生错误的比特位进行纠正。Single-bit error correction sub-module: If a single-bit error occurs in the segment data of the application program when the application program is running according to the comparison result, it is used to correct the bit where the error occurs.
(2)冗余备份恢复模块(2) Redundant backup and recovery module
代码段获取子模块:用于获取应用程序加载时应用程序的代码段;Code segment acquisition sub-module: used to acquire the code segment of the application when the application is loaded;
CRC校验码计算子模块:用于对应用程序加载时应用程序的代码段进行CRC校验码计算以获取代码段的CRC校验码,以及对应用程序运行时应用程序的代码段进行CRC校验码计算以获取代码段的CRC检验码;CRC check code calculation sub-module: used to calculate the CRC check code of the code segment of the application program when the application program is loaded to obtain the CRC check code of the code segment, and perform CRC check on the code segment of the application program when the application program is running Check code calculation to obtain the CRC check code of the code segment;
CRC校验码存储子模块:在获取应用程序加载时应用程序的代码段及其CRC校验码之后,用于对获取的代码段及其CRC校验码进行存储且存储数量不少于两份;CRC check code storage sub-module: After obtaining the code segment and CRC check code of the application program when the application is loaded, it is used to store the obtained code segment and its CRC check code and the storage quantity is not less than two copies. ;
CRC校验码在线校验子模块:用于对存储的CRC校验码进行在线校验;CRC check code online check sub-module: used to check the stored CRC check code online;
CRC校验码比对子模块:用于对CRC校验码与CRC检验码进行比对;CRC check code comparison sub-module: used to compare CRC check code with CRC check code;
代码段覆盖子模块:如果存储的CRC校验码的在线校验结果始终一致,用于提取在线校验结果始终一致的CRC校验码所对应的代码段,对应用程序运行时应用程序的代码段进行覆盖;Code segment coverage sub-module: If the online check result of the stored CRC check code is always consistent, it is used to extract the code segment corresponding to the CRC check code whose online check result is always consistent, and the code of the application program when the application program is running Segment to cover;
CRC校验码替换子模块:用于提取在线校验结果始终一致的CRC校验码及其代码段,对在线校验结果不始终一致的CRC校验码及其代码段进行替换;CRC check code replacement sub-module: used to extract the CRC check code and its code segment whose online check result is always consistent, and replace the CRC check code and its code segment whose online check result is not always consistent;
代码段压缩子模块:用于在获取应用程序加载时应用程序的代码段之后对获取的代码段进行压缩处理,以及在提取CRC校验码所对应的代码段之前对CRC校验码所对应的压缩过的代码段进行解压处理。Code segment compression sub-module: used to compress the obtained code segment after obtaining the code segment of the application program when the application is loaded, and the code segment corresponding to the CRC check code before extracting the code segment corresponding to the CRC check code. The compressed code segment is decompressed.
(3)关键数据检测恢复模块(3) Critical data detection and recovery module
内存地址提取子模块:用于基于关键数据的预注册信息提取关键数据的内存地址;Memory address extraction sub-module: used to extract the memory address of the key data based on the pre-registered information of the key data;
内存数据获取子模块:用于获取应用程序加载时所述内存地址中的内存数据;Memory data acquisition sub-module: used to acquire the memory data in the memory address when the application program is loaded;
前述CRC校验码计算子模块还用于对应用程序加载时所述内存地址中的内存数据进行CRC校验码计算以获取CRC校验码,以及对应用程序运行时关键数据的内存地址中的内存数据进行CRC校验码计算以获取CRC检验码;The aforementioned CRC check code calculation sub-module is also used to calculate the CRC check code of the memory data in the memory address when the application program is loaded to obtain the CRC check code, and to calculate the key data in the memory address of the application program when the application is running. CRC check code calculation for memory data to obtain CRC check code;
前述CRC校验码存储子模块还用于在获取应用程序加载时所述内存地址中的内存数据及其CRC校验码之后,对获取的内存数据及其CRC校验码进行存储且存储数量不少于两份;The aforementioned CRC check code storage sub-module is also used to store the obtained memory data and its CRC check code after obtaining the memory data and its CRC check code in the memory address when the application program is loaded, and the storage quantity is different. Less than two servings;
前述代码段覆盖子模块还用于如果存储的CRC校验码的在线校验结果始终一致,提取在线校验结果始终一致的CRC校验码所对应的内存数据,对应用程序运行时所述内存地址中的内存数据进行覆盖;The aforementioned code segment coverage sub-module is also used to extract the memory data corresponding to the CRC check code whose online check result is always consistent if the online check result of the stored CRC check code is always consistent, and to check the memory data when the application is running. The memory data in the address is overwritten;
前述CRC校验码替换子模块还用于提取在线校验结果始终一致的CRC校验码及其内存数据,对在线校验结果不始终一致的CRC校验码及其内存数据行替换。The aforementioned CRC check code replacement submodule is also used to extract the CRC check code and its memory data whose online check results are always consistent, and replace the CRC check code and its memory data rows whose online check results are not always consistent.
本发明具体实施方式还提供了计算机处理控制装置,包括:The specific embodiment of the present invention also provides a computer processing control device, including:
存储器:用于存储指令;Memory: used to store instructions;
处理器:用于根据所述指令进行操作以执行本发明提供的一种电力二次设备内存位翻转的检测恢复方法的步骤。The processor is configured to operate according to the instructions to execute the steps of the method for detecting and restoring the memory bit flip of a power secondary device provided by the present invention.
本发明具体实施方式还提供了计算机可读存储介质,其上存储有计算机程序,所述程序被处理器执行时实现本发明提供的一种电力二次设备内存位翻转的检测恢复方法的步骤。The specific embodiment of the present invention also provides a computer-readable storage medium on which a computer program is stored. When the program is executed by a processor, the steps of the method for detecting and restoring the memory bit flip of a power secondary device provided by the present invention are realized.
本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art should understand that the embodiments of the present application can be provided as methods, systems, or computer program products. Therefore, this application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, this application may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.
本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。This application is described with reference to flowcharts and/or block diagrams of methods, devices (systems), and computer program products according to embodiments of this application. It should be understood that each process and/or block in the flowchart and/or block diagram, and the combination of processes and/or blocks in the flowchart and/or block diagram can be realized by computer program instructions. These computer program instructions can be provided to the processor of a general-purpose computer, a special-purpose computer, an embedded processor, or other programmable data processing equipment to generate a machine, so that the instructions executed by the processor of the computer or other programmable data processing equipment are used to generate It is a device that realizes the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device. The device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment. The instructions provide steps for implementing the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
以上所述仅是本发明的优选实施方式,应当指出,对于本技术领域的普通技术人员 来说,在不脱离本发明技术原理的前提下,还可以做出若干改进和变形,这些改进和变形也应视为本发明的保护范围。The above are only the preferred embodiments of the present invention. It should be pointed out that for those of ordinary skill in the art, without departing from the technical principles of the present invention, several improvements and modifications can be made. These improvements and modifications It should also be regarded as the protection scope of the present invention.

Claims (18)

  1. 一种电力二次设备内存位翻转的检测恢复方法,其特征是,包括如下步骤:A method for detecting and restoring a memory bit flip of a power secondary device, which is characterized in that it includes the following steps:
    按预设的ECC段长度对应用程序加载时应用程序运行区进行ECC校验码计算,获取应用程序加载时应用程序的段数据的ECC校验码;Calculate the ECC check code of the application running area when the application is loaded according to the preset ECC segment length, and obtain the ECC check code of the segment data of the application when the application is loaded;
    按预设的ECC段长度对应用程序运行时应用程序的段数据进行ECC校验码计算,获取应用程序运行时应用程序的段数据的ECC检验码;Perform ECC check code calculation on the segment data of the application program when the application program is running according to the preset ECC segment length, and obtain the ECC check code of the segment data of the application program when the application program is running;
    对ECC校验码与ECC检验码进行比对;Compare the ECC check code with the ECC check code;
    如果根据比对结果判定应用程序运行时应用程序的段数据发生单比特位错误,对发生错误的比特位进行纠正。If it is determined according to the comparison result that a single-bit error occurs in the segment data of the application program when the application program is running, correct the bit where the error occurred.
  2. 根据权利要求1所述的电力二次设备内存位翻转的检测恢复方法,其特征是,在获取应用程序加载时应用程序的段数据的ECC校验码之后,还包括:对获取的ECC校验码进行存储且存储数量不少于三份;The method for detecting and restoring a memory bit flip of a power secondary device according to claim 1, wherein after obtaining the ECC check code of the segment data of the application when the application is loaded, the method further comprises: checking the obtained ECC The code is stored and the storage quantity is not less than three copies;
    对ECC校验码与ECC检验码进行比对,包括:Compare the ECC check code with the ECC check code, including:
    对存储的ECC校验码进行逐个相互比对;Compare the stored ECC check codes one by one;
    如果存储的ECC校验码中至少有两个比对结果一致,提取比对结果一致的ECC校验码与ECC检验码进行比对。If at least two comparison results of the stored ECC check codes are consistent, the ECC check code with the same comparison results is extracted and compared with the ECC check code.
  3. 根据权利要求2所述的电力二次设备内存位翻转的检测恢复方法,其特征是,对ECC校验码与ECC检验码进行比对,还包括:提取比对结果一致的ECC校验码,对比对结果不一致的ECC校验码进行替换。The method for detecting and restoring a memory bit flip of a power secondary device according to claim 2, characterized in that, comparing the ECC check code with the ECC check code, further comprising: extracting the ECC check code with the same comparison result, Compare and replace ECC check codes with inconsistent results.
  4. 根据权利要求2所述的电力二次设备内存位翻转的检测恢复方法,其特征是,对ECC校验码与ECC检验码进行比对,还包括:The method for detecting and restoring a memory bit flip of a power secondary device according to claim 2, wherein the comparison of the ECC check code with the ECC check code further comprises:
    如果存储的ECC校验码中相互比对结果均不一致,采用冗余备份恢复机制恢复应用程序运行时应用程序的代码段。If the stored ECC check codes are inconsistent with each other, the redundant backup recovery mechanism is used to restore the code segment of the application when the application is running.
  5. 根据权利要求1所述的电力二次设备内存位翻转的检测恢复方法,其特征是,如果根据比对结果判定应用程序运行时应用程序的段数据中发生错误的比特位不少于两个,采用冗余备份恢复机制恢复应用程序运行时应用程序的代码段。The method for detecting and restoring a memory bit flip of a power secondary device according to claim 1, wherein if it is determined according to the comparison result that there are no less than two bits in the segment data of the application program when the application program is running, there are no less than two bits, The redundant backup and recovery mechanism is used to restore the code segment of the application when the application is running.
  6. 根据权利要求4或5所述的电力二次设备内存位翻转的检测恢复方法,其特征是,采用冗余备份恢复机制恢复应用程序运行时应用程序的代码段,包括:The method for detecting and restoring a memory bit flip of a power secondary device according to claim 4 or 5, wherein the use of a redundant backup and restoring mechanism to restore the code segment of the application program when the application program is running includes:
    获取应用程序加载时应用程序的代码段并对其进行CRC校验码计算,获取代码段的CRC校验码;Obtain the code segment of the application program when the application program is loaded and calculate the CRC check code to obtain the CRC check code of the code segment;
    对应用程序运行时应用程序的代码段进行CRC校验码计算,获取代码段的CRC检验码;Calculate the CRC check code of the code segment of the application program when the application is running, and obtain the CRC check code of the code segment;
    对CRC校验码与CRC检验码进行比对;Compare the CRC check code with the CRC check code;
    如果CRC校验码与CRC检验码比对结果不一致,提取CRC校验码所对应的代码段,对应用程序运行时应用程序的代码段进行覆盖。If the comparison result between the CRC check code and the CRC check code is inconsistent, the code segment corresponding to the CRC check code is extracted, and the code segment of the application program is covered when the application program is running.
  7. 根据权利要求6所述的电力二次设备内存位翻转的检测恢复方法,其特征是,在获取应用程序加载时应用程序的代码段之后,还包括:对获取的代码段进行压缩处理;The method for detecting and restoring a memory bit flip of a power secondary device according to claim 6, wherein after obtaining the code segment of the application program when the application program is loaded, the method further comprises: compressing the obtained code segment;
    在提取CRC校验码所对应的代码段之前,还包括:对CRC校验码所对应的压缩过的代码段进行解压处理。Before extracting the code segment corresponding to the CRC check code, it also includes: decompressing the compressed code segment corresponding to the CRC check code.
  8. 根据权利要求6所述的电力二次设备内存位翻转的检测恢复方法,其特征是,在获取应用程序加载时应用程序的代码段及其CRC校验码之后,还包括:对获取的代码段及其CRC校验码进行存储且存储数量不少于两份;The method for detecting and restoring a memory bit flip of a power secondary device according to claim 6, characterized in that, after acquiring the code segment of the application program and its CRC check code when the application program is loaded, the method further comprises: checking the acquired code segment And its CRC check code is stored and the storage quantity is not less than two copies;
    提取CRC校验码所对应的代码段,包括:Extract the code segment corresponding to the CRC check code, including:
    对存储的CRC校验码进行在线校验;Online verification of the stored CRC check code;
    如果存储的CRC校验码的在线校验结果始终一致,提取在线校验结果始终一致的CRC校验码所对应的代码段。If the online check result of the stored CRC check code is always consistent, extract the code segment corresponding to the CRC check code whose online check result is always consistent.
  9. 根据权利要求8所述的电力二次设备内存位翻转的检测恢复方法,其特征是,提取CRC校验码所对应的代码段,还包括:提取在线校验结果始终一致的CRC校验码及其代码段,对在线校验结果不始终一致的CRC校验码及其代码段进行替换。The method for detecting and restoring a memory bit flip of a power secondary device according to claim 8, wherein extracting the code segment corresponding to the CRC check code further comprises: extracting the CRC check code whose online check result is always consistent, and The code segment replaces the CRC check code and its code segment whose online verification results are not always consistent.
  10. 根据权利要求8所述的电力二次设备内存位翻转的检测恢复方法,其特征是,对存储的CRC校验码进行在线校验,包括:按预设的时间间隔,对存储的CRC校验码的当前状态与上一状态进行比对。The method for detecting and restoring a memory bit flip of a power secondary device according to claim 8, wherein the online verification of the stored CRC check code includes: checking the stored CRC at a preset time interval The current state of the code is compared with the previous state.
  11. 根据权利要求1所述的电力二次设备内存位翻转的检测恢复方法,其特征是,还包括:检测恢复应用程序中预注册的关键数据;The method for detecting and restoring a memory bit flip of a power secondary device according to claim 1, further comprising: detecting and restoring key data pre-registered in an application;
    所述关键数据的检测恢复方法,包括:The key data detection and recovery method includes:
    基于关键数据的预注册信息提取关键数据的内存地址;Extract the memory address of the key data based on the pre-registration information of the key data;
    获取应用程序加载时所述内存地址中的内存数据并对其进行CRC校验码计算,获取CRC校验码;Obtain the memory data in the memory address when the application program is loaded and calculate the CRC check code to obtain the CRC check code;
    对应用程序运行时关键数据的内存地址中的内存数据进行CRC校验码计算,获取CRC检验码;Calculate the CRC check code for the memory data in the memory address of the key data when the application is running, and obtain the CRC check code;
    对CRC校验码与CRC检验码进行比对;Compare the CRC check code with the CRC check code;
    如果CRC校验码与CRC检验码比对结果不一致,提取CRC校验码所对应的内存数据,对应用程序运行时所述内存地址中的内存数据进行覆盖。If the comparison result between the CRC check code and the CRC check code is inconsistent, the memory data corresponding to the CRC check code is extracted, and the memory data in the memory address is overwritten when the application program is running.
  12. 根据权利要求11所述的电力二次设备内存位翻转的检测恢复方法,其特征是,在获取 应用程序加载时所述内存地址中的内存数据及其CRC校验码之后,还包括:对获取的内存数据及其CRC校验码进行存储且存储数量不少于两份;The method for detecting and restoring a memory bit flip of a power secondary device according to claim 11, wherein after acquiring the memory data and the CRC check code in the memory address when the application program is loaded, the method further comprises: acquiring The memory data and its CRC check code are stored, and the storage quantity is not less than two copies;
    提取CRC校验码所对应的内存数据,包括:Extract the memory data corresponding to the CRC check code, including:
    对存储的CRC校验码进行在线校验;Online verification of the stored CRC check code;
    如果存储的CRC校验码的在线校验结果始终一致,提取在线校验结果始终一致的CRC校验码所对应的内存数据。If the online check result of the stored CRC check code is always consistent, extract the memory data corresponding to the CRC check code whose online check result is always consistent.
  13. 根据权利要求12所述的电力二次设备内存位翻转的检测恢复方法,其特征是,提取CRC校验码所对应的内存数据,还包括:提取在线校验结果始终一致的CRC校验码及其内存数据,对在线校验结果不始终一致的CRC校验码及其内存数据行替换。The method for detecting and restoring a memory bit flip of a power secondary device according to claim 12, wherein extracting the memory data corresponding to the CRC check code further comprises: extracting the CRC check code whose online check result is always consistent, and The memory data is replaced with the CRC check code and the memory data row whose online check result is not always consistent.
  14. 一种电力二次设备内存位翻转的检测恢复系统,其特征是,包括:单比特位错误恢复模块,所述单比特位错误恢复模块包括:A system for detecting and restoring memory bit flips of power secondary equipment, which is characterized by comprising: a single-bit error recovery module, and the single-bit error recovery module includes:
    ECC校验码计算子模块:用于按预设的ECC段长度对应用程序加载时应用程序运行区进行ECC校验码计算以获取应用程序加载时应用程序的段数据的ECC校验码,以及按预设的ECC段长度对应用程序运行时应用程序的段数据进行ECC校验码计算以获取应用程序运行时应用程序的段数据的ECC检验码;ECC check code calculation sub-module: used to calculate the ECC check code of the application running area when the application is loaded according to the preset ECC segment length to obtain the ECC check code of the segment data of the application when the application is loaded, and Perform ECC check code calculation on the segment data of the application when the application is running according to the preset ECC segment length to obtain the ECC check code of the segment data of the application when the application is running;
    ECC校验码存储子模块:在获取应用程序加载时应用程序的段数据的ECC校验码之后,用于对获取的ECC校验码进行存储且存储数量不少于三份;ECC check code storage sub-module: After obtaining the ECC check code of the segment data of the application program when the application program is loaded, it is used to store the obtained ECC check code and the storage quantity is not less than three;
    ECC校验码比对子模块:用于对存储的ECC校验码进行逐个相互比对,如果存储的ECC校验码中至少有两个比对结果一致,提取比对结果一致的ECC校验码与ECC检验码进行比对;ECC校验码替换子模块:用于提取比对结果一致的ECC校验码,对比对结果不一致的ECC校验码进行替换;ECC check code comparison sub-module: used to compare the stored ECC check codes one by one. If at least two comparison results in the stored ECC check codes are consistent, extract the ECC check with the same comparison result The code is compared with the ECC check code; ECC check code replacement sub-module: used to extract the ECC check code with the same comparison result, and compare and replace the ECC check code with inconsistent results;
    单比特位错误纠正子模块:如果根据比对结果判定应用程序运行时应用程序的段数据发生单比特位错误,用于对发生错误的比特位进行纠正。Single-bit error correction sub-module: If a single-bit error occurs in the segment data of the application program when the application program is running according to the comparison result, it is used to correct the bit where the error occurs.
  15. 根据权利要求14所述的电力二次设备内存位翻转的检测恢复系统,其特征是,还包括用于采用冗余备份恢复机制恢复应用程序运行时应用程序的代码段的冗余备份恢复模块,所述冗余备份恢复模块包括:The system for detecting and restoring the memory bit flip of a power secondary device according to claim 14, characterized in that it further comprises a redundant backup and restoration module for restoring the code segment of the application when the application is running by using a redundant backup and restoration mechanism, The redundant backup and recovery module includes:
    代码段获取子模块:用于获取应用程序加载时应用程序的代码段;Code segment acquisition sub-module: used to acquire the code segment of the application when the application is loaded;
    CRC校验码计算子模块:用于对应用程序加载时应用程序的代码段进行CRC校验码计算以获取代码段的CRC校验码,以及对应用程序运行时应用程序的代码段进行CRC校验码计算以获取代码段的CRC检验码;CRC check code calculation sub-module: used to calculate the CRC check code of the code segment of the application program when the application program is loaded to obtain the CRC check code of the code segment, and perform CRC check on the code segment of the application program when the application program is running Check code calculation to obtain the CRC check code of the code segment;
    CRC校验码存储子模块:在获取应用程序加载时应用程序的代码段及其CRC校验码之后, 用于对获取的代码段及其CRC校验码进行存储且存储数量不少于两份;CRC check code storage sub-module: after obtaining the code segment of the application program and its CRC check code when the application program is loaded, it is used to store the obtained code segment and its CRC check code and the number of storage is not less than two copies ;
    CRC校验码在线校验子模块:用于对存储的CRC校验码进行在线校验;CRC check code online check sub-module: used to check the stored CRC check code online;
    CRC校验码比对子模块:用于对CRC校验码与CRC检验码进行比对;CRC check code comparison sub-module: used to compare CRC check code with CRC check code;
    代码段覆盖子模块:如果存储的CRC校验码的在线校验结果始终一致,用于提取在线校验结果始终一致的CRC校验码所对应的代码段,对应用程序运行时应用程序的代码段进行覆盖;Code segment coverage sub-module: If the online check result of the stored CRC check code is always consistent, it is used to extract the code segment corresponding to the CRC check code whose online check result is always consistent, and the code of the application program when the application program is running Segment to cover;
    CRC校验码替换子模块:用于提取在线校验结果始终一致的CRC校验码及其代码段,对在线校验结果不始终一致的CRC校验码及其代码段进行替换;CRC check code replacement sub-module: used to extract the CRC check code and its code segment whose online check result is always consistent, and replace the CRC check code and its code segment whose online check result is not always consistent;
    代码段压缩子模块:用于在获取应用程序加载时应用程序的代码段之后对获取的代码段进行压缩处理,以及在提取CRC校验码所对应的代码段之前对CRC校验码所对应的压缩过的代码段进行解压处理。Code segment compression sub-module: used to compress the obtained code segment after obtaining the code segment of the application program when the application is loaded, and the code segment corresponding to the CRC check code before extracting the code segment corresponding to the CRC check code. The compressed code segment is decompressed.
  16. 根据权利要求15所述的电力二次设备内存位翻转的检测恢复系统,其特征是,还包括关键数据检测恢复模块,所述关键数据检测恢复模块包括:The system for detecting and restoring a memory bit flip of a power secondary device according to claim 15, characterized in that it further comprises a key data detection and restoration module, and the key data detection and restoration module comprises:
    内存地址提取子模块:用于基于关键数据的预注册信息提取关键数据的内存地址;Memory address extraction sub-module: used to extract the memory address of the key data based on the pre-registered information of the key data;
    内存数据获取子模块:用于获取应用程序加载时所述内存地址中的内存数据;Memory data acquisition sub-module: used to acquire the memory data in the memory address when the application program is loaded;
    所述CRC校验码计算子模块还用于对应用程序加载时所述内存地址中的内存数据进行CRC校验码计算以获取CRC校验码,以及对应用程序运行时关键数据的内存地址中的内存数据进行CRC校验码计算以获取CRC检验码;The CRC check code calculation sub-module is also used to perform CRC check code calculation on the memory data in the memory address when the application program is loaded to obtain the CRC check code, and to calculate the key data in the memory address when the application program is running CRC check code calculation is performed on the memory data to obtain the CRC check code;
    所述CRC校验码存储子模块还用于在获取应用程序加载时所述内存地址中的内存数据及其CRC校验码之后,对获取的内存数据及其CRC校验码进行存储且存储数量不少于两份;The CRC check code storage sub-module is also used to store the obtained memory data and its CRC check code after obtaining the memory data and its CRC check code in the memory address when the application program is loaded. Not less than two copies;
    所述代码段覆盖子模块还用于如果存储的CRC校验码的在线校验结果始终一致,提取在线校验结果始终一致的CRC校验码所对应的内存数据,对应用程序运行时所述内存地址中的内存数据进行覆盖;The code segment coverage sub-module is also used to extract the memory data corresponding to the CRC check code whose online check result is always consistent if the online check result of the stored CRC check code is always consistent, and the application program is running. Overwrite the memory data in the memory address;
    所述CRC校验码替换子模块还用于提取在线校验结果始终一致的CRC校验码及其内存数据,对在线校验结果不始终一致的CRC校验码及其内存数据行替换。The CRC check code replacement sub-module is also used to extract the CRC check code whose online check result is always consistent and its memory data, and replace the CRC check code whose online check result is not always consistent and its memory data row.
  17. 计算机处理控制装置,其特征是,包括:The computer processing control device is characterized in that it includes:
    存储器:用于存储指令;Memory: used to store instructions;
    处理器:用于根据所述指令进行操作以执行权利要求1至13中任一项所述方法的步骤。Processor: configured to operate according to the instructions to execute the steps of the method described in any one of claims 1 to 13.
  18. 计算机可读存储介质,其上存储有计算机程序,其特征是,所述程序被处理器执行时实现权利要求1至13中任一项所述方法的步骤。A computer-readable storage medium having a computer program stored thereon is characterized in that, when the program is executed by a processor, the steps of the method according to any one of claims 1 to 13 are realized.
PCT/CN2020/114368 2020-04-16 2020-09-10 Method and system for detecting and recovering memory bit flipping in secondary power equipment WO2021208341A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010299597.X 2020-04-16
CN202010299597.XA CN111552590B (en) 2020-04-16 2020-04-16 Detection and recovery method and system for memory bit overturning of power secondary equipment

Publications (1)

Publication Number Publication Date
WO2021208341A1 true WO2021208341A1 (en) 2021-10-21

Family

ID=72007435

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/114368 WO2021208341A1 (en) 2020-04-16 2020-09-10 Method and system for detecting and recovering memory bit flipping in secondary power equipment

Country Status (2)

Country Link
CN (1) CN111552590B (en)
WO (1) WO2021208341A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114238035A (en) * 2022-02-23 2022-03-25 南京芯驰半导体科技有限公司 Method and system for error detection through running state fingerprint
CN114579352A (en) * 2022-04-29 2022-06-03 阿里云计算有限公司 Data reconstruction method and device

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111552590B (en) * 2020-04-16 2022-09-30 国电南瑞科技股份有限公司 Detection and recovery method and system for memory bit overturning of power secondary equipment
CN112053737B (en) * 2020-08-21 2022-08-26 国电南瑞科技股份有限公司 Online parallel processing soft error real-time error detection and recovery method and system
CN114253758A (en) * 2020-09-21 2022-03-29 华为技术有限公司 Data processing method and related device
CN114598418A (en) * 2020-12-07 2022-06-07 山东新松工业软件研究院股份有限公司 Method, device and system applied to encoder data transmission
CN112860500B (en) * 2021-02-22 2024-03-22 四川腾盾科技有限公司 Power-on self-detection method for redundant aircraft management computer board
CN115421967B (en) * 2022-11-04 2022-12-30 中国电力科学研究院有限公司 Method and system for evaluating storage abnormal risk point of secondary equipment
CN116107800B (en) * 2023-04-12 2023-08-15 浙江恒业电子股份有限公司 Verification code generation method, data recovery method, medium and electronic equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120110411A1 (en) * 2010-10-29 2012-05-03 Brocade Communications Systems, Inc. Content Addressable Memory (CAM) Parity And Error Correction Code (ECC) Protection
CN104616698A (en) * 2015-01-28 2015-05-13 山东华翼微电子技术股份有限公司 Method for sufficiently utilizing memory redundancy unit
CN108345430A (en) * 2017-12-27 2018-07-31 北京兆易创新科技股份有限公司 A kind of Nand flash elements and its progress control method and device
CN110222501A (en) * 2019-05-31 2019-09-10 河南思维轨道交通技术研究院有限公司 A kind of inspection method of runtime code, storage medium
CN111552590A (en) * 2020-04-16 2020-08-18 国电南瑞科技股份有限公司 Detection and recovery method and system for memory bit overturning of power secondary equipment

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101232348B (en) * 2006-10-04 2014-03-26 马维尔国际贸易有限公司 Method and device for error correcting using cyclic redundancy check
CN104598342B (en) * 2014-12-31 2018-05-01 曙光信息产业(北京)有限公司 The detection method and device of memory
CN109800104A (en) * 2018-12-18 2019-05-24 盛科网络(苏州)有限公司 Detection method, device, storage medium and the electronic device of data storage
CN110289041B (en) * 2019-06-25 2021-05-18 浙江大学 Memory detection device combining BIST and ECC in system chip

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120110411A1 (en) * 2010-10-29 2012-05-03 Brocade Communications Systems, Inc. Content Addressable Memory (CAM) Parity And Error Correction Code (ECC) Protection
CN104616698A (en) * 2015-01-28 2015-05-13 山东华翼微电子技术股份有限公司 Method for sufficiently utilizing memory redundancy unit
CN108345430A (en) * 2017-12-27 2018-07-31 北京兆易创新科技股份有限公司 A kind of Nand flash elements and its progress control method and device
CN110222501A (en) * 2019-05-31 2019-09-10 河南思维轨道交通技术研究院有限公司 A kind of inspection method of runtime code, storage medium
CN111552590A (en) * 2020-04-16 2020-08-18 国电南瑞科技股份有限公司 Detection and recovery method and system for memory bit overturning of power secondary equipment

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114238035A (en) * 2022-02-23 2022-03-25 南京芯驰半导体科技有限公司 Method and system for error detection through running state fingerprint
CN114238035B (en) * 2022-02-23 2022-06-21 南京芯驰半导体科技有限公司 Method and system for error detection through running state fingerprint
CN114579352A (en) * 2022-04-29 2022-06-03 阿里云计算有限公司 Data reconstruction method and device

Also Published As

Publication number Publication date
CN111552590B (en) 2022-09-30
CN111552590A (en) 2020-08-18

Similar Documents

Publication Publication Date Title
WO2021208341A1 (en) Method and system for detecting and recovering memory bit flipping in secondary power equipment
TWI537967B (en) Methods and apparatus to protect segments of memory
US8589759B2 (en) RAM single event upset (SEU) method to correct errors
US9891917B2 (en) System and method to increase lockstep core availability
KR101557572B1 (en) Memory circuits, method for accessing a memory and method for repairing a memory
US8996953B2 (en) Self monitoring and self repairing ECC
US9208027B2 (en) Address error detection
JP7418397B2 (en) Memory scan operation in response to common mode fault signals
US9934085B2 (en) Invoking an error handler to handle an uncorrectable error
WO2022037022A1 (en) Online parallel processing soft error real-time error detection and recovery method and system
US7373558B2 (en) Vectoring process-kill errors to an application program
CN112328396A (en) Dynamic self-adaptive SOPC fault-tolerant method based on task level
CN113608720B (en) Single event upset resistant satellite-borne data processing system and method
US9329926B1 (en) Overlapping data integrity for semiconductor devices
US7240272B2 (en) Method and system for correcting errors in a memory device
CN112559395A (en) Relay protection device and method based on dual-Soc storage system exception handling mechanism
CN113626246A (en) Single-bit overturning fast repairing method and device, computer equipment and storage medium
Garg Soft error fault tolerant systems: cs456 survey
Zhai et al. A software approach to protecting embedded system memory from single event upsets
Kim et al. A Page-mapping Consistency Protecting Method for Soft Error Damage in Flash-based Storage
Eichhorn et al. Techniques to maximize software reliability in radiation fields
RU2465636C1 (en) Method of correcting single errors and preventing double errors in register file and apparatus for realising said method
SE1300783A1 (en) Handling soft errors in connection with data storage
Lee et al. Error Detection

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20930910

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20930910

Country of ref document: EP

Kind code of ref document: A1