CN111858192A - Spatial single-particle upset autonomous fault-tolerant method - Google Patents
Spatial single-particle upset autonomous fault-tolerant method Download PDFInfo
- Publication number
- CN111858192A CN111858192A CN202010713393.6A CN202010713393A CN111858192A CN 111858192 A CN111858192 A CN 111858192A CN 202010713393 A CN202010713393 A CN 202010713393A CN 111858192 A CN111858192 A CN 111858192A
- Authority
- CN
- China
- Prior art keywords
- data
- data block
- partitions
- auxiliary
- redundant memory
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/16—Error detection or correction of the data by redundancy in hardware
- G06F11/20—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
- G06F11/2017—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where memory access, memory control or I/O control functionality is redundant
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/16—Error detection or correction of the data by redundancy in hardware
- G06F11/1629—Error detection by comparing the output of redundant processing systems
- G06F11/165—Error detection by comparing the output of redundant processing systems with continued operation after detection of the error
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Quality & Reliability (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Hardware Redundancy (AREA)
Abstract
The invention discloses a space single event upset autonomous fault-tolerant method, relating to the technical field of computers, creating three redundant memory pools for data according to the size of the data, respectively allocating the three redundant memory pools to three partitions of a system, setting one redundant memory pool of the three redundant memory pools as a main data block and the other two redundant memory pools as auxiliary data blocks to obtain a first auxiliary data block and a second auxiliary data block, recording storage addresses of the data in the three partitions in a variable-address mapping table, sequentially writing the data into the data areas of the three partitions, acquiring the data in the main data block according to the variable-address mapping table, judging whether an EDAC circuit is triggered to check memory error operation, if so, processing the data in the three partitions by using a two-out-of-three strategy, and ensuring the address distribution among the redundant memory pools and the isolation among the data, and the EDAC memory abnormity capture mechanism is combined and utilized, so that the fault-tolerant efficiency is improved, and the fault-tolerant cost is reduced.
Description
Technical Field
The invention relates to the technical field of computers, in particular to a spatial single event upset autonomous fault-tolerant method.
Background
In computer systems used in the aerospace field, device failure due to the harsh space environment is one of the root causes of reliability degradation or failure of computer systems.
The radiation in the space is formed by particles such as electrons, neutrons, protons, etc. emitted by various sources inside and outside the solar system. These particles are generally of high energy and the resulting radiation effects can not only cause degradation of the electronics, but can also lead to system failure.
Inside and outside the solar system, there are two main types of radiation generated by various radiation sources: solar radiation and the galaxy cosmic rays. The radiation to which electronic equipment in space is subjected comes mainly from the earth's radiation zone, the cosmic rays of the silver river, the sun particle events, etc. The satellite-borne system is easily interfered by an earth electromagnetic field, particularly the influence of an earth charged particle band (a Van Allen radiation band, wherein the height from the earth surface is 400-900 km for a proton concentration zone, and the height from the earth surface is 900-56000 km for an electron concentration zone), and the solar flare comprising a large number of high-energy protons and heavy ions can expand the influence of the Van Allen radiation band by 3 orders of magnitude. Cosmic rays (protons account for about 85%, a particles account for about 13%, and heavy nuclei account for about 2%) composed of heavy ions such as hydrogen and nickel are also very intense in radiation energy.
The energy and types of particles in the space environment are quite wide, the aerospace equipment is greatly influenced by radiation, and the radiation effect is the first problem that needs to be considered in the design of aerospace weapon model software and test verification.
The radiation effect can be roughly divided into two main categories, namely total dose accumulation effect and single particle effect according to different influence modes. The total dose accumulation effect is an effect that the electronic equipment is aged gradually when being in a high-intensity irradiation environment for a long time, and the single event effect refers to the state change of the electronic device caused by single high-energy particles and comprises three different effects of bit upset/single event upset, single event locking and single event breakdown.
Transient faults are caused by single event upset, the transient faults can be avoided to a large extent by a software method, and bit upset accounts for the majority in the single event effect, so that the influence of the single event effect on an aerospace computer system can be reduced to a large extent by using a software fault tolerance method. The existing technical means for solving the single event effect include:
(1) data redundancy coding: various check codes such as hamming codes and cyclic codes are adopted to encode key data of the system, and automatic error correction is carried out when errors are detected in the data;
(2) and (3) memory redundancy allocation: key data are simultaneously distributed to three memory blocks to be used as redundant backup, and the data are output after being compared by a plurality of data blocks during data extraction;
(3) EDAC memory error detection: detecting abnormal memory data through an EDAC service or an EDAC kernel loading module, and when a single event upset effect is detected, recovering through a system backup mechanism;
(4) and (3) system level fault tolerance: and adopting heartbeat detection and state detection among the multi-stage systems, and restarting the system or starting a backup system to enable the system to continue to normally work when the single event upset effect of the host is detected.
The main disadvantages of the prior art are as follows:
(1) the execution efficiency is low
The problems of low execution efficiency exist in the method of encoding and decoding key data by adopting data redundancy codes such as hamming codes and cyclic codes and the like and fault tolerance by using a memory redundancy allocation method;
(2) insufficient data isolation
The problem of insufficient isolation among data exists by adopting a memory redundancy allocation method for data fault tolerance, and the problem comprises that the physical address distribution among redundant memory pools cannot be ensured, data competition when a plurality of tasks simultaneously perform redundant memory allocation, and the like;
(3) high fault-tolerant cost
EDAC memory error detection and system-level fault tolerance both require increased hardware cost and increased fault tolerance cost.
Disclosure of Invention
In order to overcome the defects in the prior art, the embodiment of the invention provides a spatial single event upset autonomous fault tolerance method, which comprises the following steps:
according to the size of data, creating three redundant memory pools for the data;
continuously performing three times of memory allocation operations by using a semaphore mechanism, and respectively allocating three redundant memory pools to three partitions of a system;
setting one redundant memory pool of the three redundant memory pools as a main data block, and the other two redundant memory pools as auxiliary data blocks to obtain a first auxiliary data block and a second auxiliary data block, and recording storage addresses of the data in the three partitions in a variable-address mapping table;
writing the data into the data areas of the three partitions in sequence;
acquiring data in the main data block according to the variable-address mapping table;
judging whether an EDAC circuit is triggered to check the memory error operation or not;
if not, the data is sent to the user;
if so, processing the data in the three partitions by using a two-out-of-three strategy.
Preferably, processing data in the three partitions using the two out of three policy includes:
and when the data in the main data block is different from the data in the first auxiliary data block and the second auxiliary data block and no EDAC error occurs in the data in the first auxiliary data block and the second auxiliary data block, acquiring the data in the first auxiliary data block and the second auxiliary data block according to the variable-address mapping table, sending the data to a user, and replacing the data in the main data block with the data.
Preferably, processing data in the three partitions by using the two-out-of-three policy further comprises:
and when the data in the main data block, the first auxiliary data block and the second auxiliary data block are different and the data in the main data block, the first auxiliary data block and the second auxiliary data block are all subjected to EDAC errors, sending the data to a user and respectively replacing the data in the first auxiliary data block and the second auxiliary data block with the data.
Preferably, after sending the data to the user, the method further comprises:
and if the user releases the data, simultaneously releasing the data in the three partitions according to the variable-address mapping table, and releasing the variable-address mapping table.
Preferably, the three redundant memory pools have the same size and have a certain distance in address distribution.
Preferably, the data includes data in a ready queue, a block queue, and a delay queue.
The space single event upset autonomous fault-tolerant method provided by the embodiment of the invention has the following beneficial effects:
by carrying out cross-partition redundant backup and mutual exclusion locking mechanisms on key data, the address distribution among redundant memory pools and the isolation among data can be ensured, and the EDAC memory exception capture mechanism is combined and utilized, so that the fault-tolerant efficiency is improved, and the fault-tolerant cost is reduced.
Drawings
Fig. 1 is a schematic flow chart of a spatial single event upset autonomous fault-tolerant method according to an embodiment of the present invention;
fig. 2 is a schematic diagram of a technical framework corresponding to the spatial single event upset autonomous fault-tolerant method according to the embodiment of the present invention.
Detailed Description
The invention is described in detail below with reference to the figures and the embodiments.
As shown in fig. 1, the spatial single event upset autonomous fault-tolerant method provided in the embodiment of the present invention includes the following steps:
s101, creating three redundant memory pools for the data according to the size of the data.
As a specific embodiment of the present invention, the data is an operating system kernel data structure, including a ready queue, a blocking queue, a delay queue, and the like, which is used for managing and controlling data executed by an operating system kernel, and if the data is tampered, serious consequences will be caused.
And (3) applying for the redundant memory pool through a memory allocation interface (a first-time adaptive algorithm) provided in the operating system, and if the memory space is insufficient or the application is wrong, finishing all the steps.
S102, continuously performing three times of memory allocation operations by using a semaphore mechanism, and allocating three redundant memory pools to three partitions of the system respectively.
S103, setting one redundant memory pool of the three redundant memory pools as a main data block, and the other two redundant memory pools as auxiliary data blocks to obtain a first auxiliary data block and a second auxiliary data block, and recording storage addresses of the data in the three partitions in a variable-address mapping table.
As a specific embodiment of the present invention, as shown in fig. 2, a technical framework corresponding to the spatial single event upset autonomous fault-tolerant method provided in the embodiment of the present invention includes 3 redundant memory pools, where each redundant memory pool corresponds to a mapping address.
And S104, sequentially writing the data into the data areas of the three partitions.
And S105, acquiring the data in the main data block according to the variable-address mapping table.
And S106, judging whether the EDAC circuit is triggered to check the memory error operation.
The EDAC circuit is realized by linear block code, error correction and detection are realized by the linear block code, check bits of information data to be coded are generated in the coding process, and the check bits and the data are stored in the memory together. In the decoding process, the check bit is generated for the information data again, and the XOR operation is carried out on the check bit and the data check bit generated in the encoding process to obtain the syndrome, the syndrome is used for positioning the error of the information data, and the correctable error is corrected.
And S107, if not, sending the data to the user.
And S108, if so, processing the data in the three partitions by using a two-out-of-three strategy.
Optionally, processing data in the three partitions using a two out of three policy includes:
and when the data in the main data block is different from the data in the first auxiliary data block and the second auxiliary data block and no EDAC error occurs in the data in the first auxiliary data block and the second auxiliary data block, acquiring the data in the first auxiliary data block and the second auxiliary data block according to the variable-address mapping table, sending the data to a user, and replacing the data in the main data block with the data.
Optionally, processing data in the three partitions by using the two out of three policy further includes:
and when the data in the main data block, the first auxiliary data block and the second auxiliary data block are different and the data in the main data block, the first auxiliary data block and the second auxiliary data block are all subjected to EDAC errors, sending the data to a user and respectively replacing the data in the first auxiliary data block and the second auxiliary data block with the data.
Optionally, after sending the data to the user, the method further comprises:
and if the user releases the data, simultaneously releasing the data in the three partitions according to the variable-address mapping table, and releasing the variable-address mapping table.
Optionally, the size of the three redundant memory pools is the same, and there is a certain distance in the address distribution.
Optionally, the data includes data in a ready queue, a block queue, a delay queue.
The invention provides a spatial single event upset autonomous fault-tolerant method, which comprises the steps of establishing three redundant memory pools for data according to the size of the data, continuously performing three times of memory allocation operation by using a semaphore mechanism, respectively allocating the three redundant memory pools to three partitions of a system, setting one redundant memory pool of the three redundant memory pools as a main data block and the other two redundant memory pools as auxiliary data blocks to obtain a first auxiliary data block and a second auxiliary data block, recording storage addresses of the data in the three partitions in a variable-address mapping table, sequentially writing the data into data areas of the three partitions, acquiring the data in the main data block according to the variable-address mapping table, judging whether an EDAC circuit is triggered to check memory error operation, if not, sending the data to a user, if so, processing the data in the three partitions by using a two-out-of-three strategy, the method ensures the address distribution among the redundant memory pools and the isolation among data, improves the fault-tolerant efficiency and reduces the fault-tolerant cost by combining and utilizing an EDAC memory exception capture mechanism.
In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
It will be appreciated that the relevant features of the method and apparatus described above are referred to one another. In addition, "first", "second", and the like in the above embodiments are for distinguishing the embodiments, and do not represent merits of the embodiments.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.
Claims (6)
1. A spatial single event upset autonomous fault-tolerant method is characterized by comprising the following steps:
according to the size of data, creating three redundant memory pools for the data;
continuously performing three times of memory allocation operations by using a semaphore mechanism, and respectively allocating three redundant memory pools to three partitions of a system;
setting one redundant memory pool of the three redundant memory pools as a main data block, and the other two redundant memory pools as auxiliary data blocks to obtain a first auxiliary data block and a second auxiliary data block, and recording storage addresses of the data in the three partitions in a variable-address mapping table;
writing the data into the data areas of the three partitions in sequence;
acquiring data in the main data block according to the variable-address mapping table;
judging whether an EDAC circuit is triggered to check the memory error operation or not;
if not, the data is sent to the user;
if so, processing the data in the three partitions by using a two-out-of-three strategy.
2. The spatial single event upset autonomous fault-tolerant method of claim 1, wherein processing data in three partitions using a two-out-of-three strategy comprises:
and when the data in the main data block is different from the data in the first auxiliary data block and the second auxiliary data block and no EDAC error occurs in the data in the first auxiliary data block and the second auxiliary data block, acquiring the data in the first auxiliary data block and the second auxiliary data block according to the variable-address mapping table, sending the data to a user, and replacing the data in the main data block with the data.
3. The spatial single event upset autonomous fault-tolerant method of claim 1, wherein processing data in three partitions using a two out of three strategy further comprises:
and when the data in the main data block, the first auxiliary data block and the second auxiliary data block are different and the data in the main data block, the first auxiliary data block and the second auxiliary data block are all subjected to EDAC errors, sending the data to a user and respectively replacing the data in the first auxiliary data block and the second auxiliary data block with the data.
4. The spatial single event upset autonomous fault tolerance method of any of claims 1-3, wherein after sending the data to a user, the method further comprises:
and if the user releases the data, simultaneously releasing the data in the three partitions according to the variable-address mapping table, and releasing the variable-address mapping table.
5. The spatial single event upset autonomous fault-tolerant method of claim 1, wherein the three redundant memory pools have the same size and have a certain distance in address distribution.
6. The spatial single event upset autonomous fault tolerance method of claim 1, wherein the data comprises data in a ready queue, a block queue, and a delay queue.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010713393.6A CN111858192A (en) | 2020-07-22 | 2020-07-22 | Spatial single-particle upset autonomous fault-tolerant method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010713393.6A CN111858192A (en) | 2020-07-22 | 2020-07-22 | Spatial single-particle upset autonomous fault-tolerant method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111858192A true CN111858192A (en) | 2020-10-30 |
Family
ID=72950272
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010713393.6A Pending CN111858192A (en) | 2020-07-22 | 2020-07-22 | Spatial single-particle upset autonomous fault-tolerant method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111858192A (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102915768A (en) * | 2012-10-01 | 2013-02-06 | 中国科学院近代物理研究所 | Device and method for tolerating faults of storage based on triple modular redundancy of EDAC module |
CN108446189A (en) * | 2018-06-12 | 2018-08-24 | 中国科学院上海技术物理研究所 | A kind of fault-tolerant activation system of spaceborne embedded software and method |
CN109669823A (en) * | 2018-12-03 | 2019-04-23 | 中国工程物理研究院电子工程研究所 | Anti- Multiple-bit upsets error chip reinforcement means based on modified triple-modular redundancy system |
CN111176890A (en) * | 2019-12-16 | 2020-05-19 | 上海航天控制技术研究所 | Data storage and exception recovery method for satellite-borne software |
-
2020
- 2020-07-22 CN CN202010713393.6A patent/CN111858192A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102915768A (en) * | 2012-10-01 | 2013-02-06 | 中国科学院近代物理研究所 | Device and method for tolerating faults of storage based on triple modular redundancy of EDAC module |
CN108446189A (en) * | 2018-06-12 | 2018-08-24 | 中国科学院上海技术物理研究所 | A kind of fault-tolerant activation system of spaceborne embedded software and method |
CN109669823A (en) * | 2018-12-03 | 2019-04-23 | 中国工程物理研究院电子工程研究所 | Anti- Multiple-bit upsets error chip reinforcement means based on modified triple-modular redundancy system |
CN111176890A (en) * | 2019-12-16 | 2020-05-19 | 上海航天控制技术研究所 | Data storage and exception recovery method for satellite-borne software |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN100419695C (en) | Vectoring process-kill errors to an application program | |
CN107799151B (en) | Solid State Disk (SSD) and method and system for high-availability peripheral component interconnect express (PCIe) SSD | |
EP2437172B1 (en) | RAM single event upset (SEU) method to correct errors | |
WO2021208341A1 (en) | Method and system for detecting and recovering memory bit flipping in secondary power equipment | |
Sanchez-Macian et al. | Enhanced detection of double and triple adjacent errors in hamming codes through selective bit placement | |
CN108874576B (en) | Data storage system based on error correction coding | |
US8996953B2 (en) | Self monitoring and self repairing ECC | |
US8181094B2 (en) | System to improve error correction using variable latency and associated methods | |
US9208027B2 (en) | Address error detection | |
CN102915768A (en) | Device and method for tolerating faults of storage based on triple modular redundancy of EDAC module | |
CN101615147A (en) | The skin satellite is based on the fault-tolerance approach of the memory module of FPGA | |
Gottscho et al. | Software-defined error-correcting codes | |
US8108714B2 (en) | Method and system for soft error recovery during processor execution | |
CN106328209B (en) | Memory single-particle multi-bit upset fault-tolerant method and circuit | |
US9043683B2 (en) | Error protection for integrated circuits | |
US9041428B2 (en) | Placement of storage cells on an integrated circuit | |
US9201727B2 (en) | Error protection for a data bus | |
Heidergott | SEU tolerant device, circuit and processor design | |
CN110489268B (en) | Two-stage error correction coding method and system applied to storage system in satellite severe environment | |
CN112052117B (en) | Satellite-borne system software protection method based on redundant API interface | |
CN113608720A (en) | Satellite-borne data processing system and method resistant to single event upset | |
CN111858192A (en) | Spatial single-particle upset autonomous fault-tolerant method | |
US20150143201A1 (en) | Error-correcting code distribution for memory systems | |
CN112000526A (en) | Low-cost minisatellite important data fault-tolerant method | |
Dell | System RAS implications of DRAM soft errors |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |