CN111858192A - Spatial single-particle upset autonomous fault-tolerant method - Google Patents

Spatial single-particle upset autonomous fault-tolerant method Download PDF

Info

Publication number
CN111858192A
CN111858192A CN202010713393.6A CN202010713393A CN111858192A CN 111858192 A CN111858192 A CN 111858192A CN 202010713393 A CN202010713393 A CN 202010713393A CN 111858192 A CN111858192 A CN 111858192A
Authority
CN
China
Prior art keywords
data
data block
partitions
auxiliary
redundant memory
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010713393.6A
Other languages
Chinese (zh)
Inventor
程胜
邱化强
蔡铭
赵新鹏
崔小磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Shenzhou Aerospace Software Technology Co ltd
Original Assignee
Beijing Shenzhou Aerospace Software Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Shenzhou Aerospace Software Technology Co ltd filed Critical Beijing Shenzhou Aerospace Software Technology Co ltd
Priority to CN202010713393.6A priority Critical patent/CN111858192A/en
Publication of CN111858192A publication Critical patent/CN111858192A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/2017Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where memory access, memory control or I/O control functionality is redundant
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/1629Error detection by comparing the output of redundant processing systems
    • G06F11/165Error detection by comparing the output of redundant processing systems with continued operation after detection of the error

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Hardware Redundancy (AREA)

Abstract

The invention discloses a space single event upset autonomous fault-tolerant method, relating to the technical field of computers, creating three redundant memory pools for data according to the size of the data, respectively allocating the three redundant memory pools to three partitions of a system, setting one redundant memory pool of the three redundant memory pools as a main data block and the other two redundant memory pools as auxiliary data blocks to obtain a first auxiliary data block and a second auxiliary data block, recording storage addresses of the data in the three partitions in a variable-address mapping table, sequentially writing the data into the data areas of the three partitions, acquiring the data in the main data block according to the variable-address mapping table, judging whether an EDAC circuit is triggered to check memory error operation, if so, processing the data in the three partitions by using a two-out-of-three strategy, and ensuring the address distribution among the redundant memory pools and the isolation among the data, and the EDAC memory abnormity capture mechanism is combined and utilized, so that the fault-tolerant efficiency is improved, and the fault-tolerant cost is reduced.

Description

Spatial single-particle upset autonomous fault-tolerant method
Technical Field
The invention relates to the technical field of computers, in particular to a spatial single event upset autonomous fault-tolerant method.
Background
In computer systems used in the aerospace field, device failure due to the harsh space environment is one of the root causes of reliability degradation or failure of computer systems.
The radiation in the space is formed by particles such as electrons, neutrons, protons, etc. emitted by various sources inside and outside the solar system. These particles are generally of high energy and the resulting radiation effects can not only cause degradation of the electronics, but can also lead to system failure.
Inside and outside the solar system, there are two main types of radiation generated by various radiation sources: solar radiation and the galaxy cosmic rays. The radiation to which electronic equipment in space is subjected comes mainly from the earth's radiation zone, the cosmic rays of the silver river, the sun particle events, etc. The satellite-borne system is easily interfered by an earth electromagnetic field, particularly the influence of an earth charged particle band (a Van Allen radiation band, wherein the height from the earth surface is 400-900 km for a proton concentration zone, and the height from the earth surface is 900-56000 km for an electron concentration zone), and the solar flare comprising a large number of high-energy protons and heavy ions can expand the influence of the Van Allen radiation band by 3 orders of magnitude. Cosmic rays (protons account for about 85%, a particles account for about 13%, and heavy nuclei account for about 2%) composed of heavy ions such as hydrogen and nickel are also very intense in radiation energy.
The energy and types of particles in the space environment are quite wide, the aerospace equipment is greatly influenced by radiation, and the radiation effect is the first problem that needs to be considered in the design of aerospace weapon model software and test verification.
The radiation effect can be roughly divided into two main categories, namely total dose accumulation effect and single particle effect according to different influence modes. The total dose accumulation effect is an effect that the electronic equipment is aged gradually when being in a high-intensity irradiation environment for a long time, and the single event effect refers to the state change of the electronic device caused by single high-energy particles and comprises three different effects of bit upset/single event upset, single event locking and single event breakdown.
Transient faults are caused by single event upset, the transient faults can be avoided to a large extent by a software method, and bit upset accounts for the majority in the single event effect, so that the influence of the single event effect on an aerospace computer system can be reduced to a large extent by using a software fault tolerance method. The existing technical means for solving the single event effect include:
(1) data redundancy coding: various check codes such as hamming codes and cyclic codes are adopted to encode key data of the system, and automatic error correction is carried out when errors are detected in the data;
(2) and (3) memory redundancy allocation: key data are simultaneously distributed to three memory blocks to be used as redundant backup, and the data are output after being compared by a plurality of data blocks during data extraction;
(3) EDAC memory error detection: detecting abnormal memory data through an EDAC service or an EDAC kernel loading module, and when a single event upset effect is detected, recovering through a system backup mechanism;
(4) and (3) system level fault tolerance: and adopting heartbeat detection and state detection among the multi-stage systems, and restarting the system or starting a backup system to enable the system to continue to normally work when the single event upset effect of the host is detected.
The main disadvantages of the prior art are as follows:
(1) the execution efficiency is low
The problems of low execution efficiency exist in the method of encoding and decoding key data by adopting data redundancy codes such as hamming codes and cyclic codes and the like and fault tolerance by using a memory redundancy allocation method;
(2) insufficient data isolation
The problem of insufficient isolation among data exists by adopting a memory redundancy allocation method for data fault tolerance, and the problem comprises that the physical address distribution among redundant memory pools cannot be ensured, data competition when a plurality of tasks simultaneously perform redundant memory allocation, and the like;
(3) high fault-tolerant cost
EDAC memory error detection and system-level fault tolerance both require increased hardware cost and increased fault tolerance cost.
Disclosure of Invention
In order to overcome the defects in the prior art, the embodiment of the invention provides a spatial single event upset autonomous fault tolerance method, which comprises the following steps:
according to the size of data, creating three redundant memory pools for the data;
continuously performing three times of memory allocation operations by using a semaphore mechanism, and respectively allocating three redundant memory pools to three partitions of a system;
setting one redundant memory pool of the three redundant memory pools as a main data block, and the other two redundant memory pools as auxiliary data blocks to obtain a first auxiliary data block and a second auxiliary data block, and recording storage addresses of the data in the three partitions in a variable-address mapping table;
writing the data into the data areas of the three partitions in sequence;
acquiring data in the main data block according to the variable-address mapping table;
judging whether an EDAC circuit is triggered to check the memory error operation or not;
if not, the data is sent to the user;
if so, processing the data in the three partitions by using a two-out-of-three strategy.
Preferably, processing data in the three partitions using the two out of three policy includes:
and when the data in the main data block is different from the data in the first auxiliary data block and the second auxiliary data block and no EDAC error occurs in the data in the first auxiliary data block and the second auxiliary data block, acquiring the data in the first auxiliary data block and the second auxiliary data block according to the variable-address mapping table, sending the data to a user, and replacing the data in the main data block with the data.
Preferably, processing data in the three partitions by using the two-out-of-three policy further comprises:
and when the data in the main data block, the first auxiliary data block and the second auxiliary data block are different and the data in the main data block, the first auxiliary data block and the second auxiliary data block are all subjected to EDAC errors, sending the data to a user and respectively replacing the data in the first auxiliary data block and the second auxiliary data block with the data.
Preferably, after sending the data to the user, the method further comprises:
and if the user releases the data, simultaneously releasing the data in the three partitions according to the variable-address mapping table, and releasing the variable-address mapping table.
Preferably, the three redundant memory pools have the same size and have a certain distance in address distribution.
Preferably, the data includes data in a ready queue, a block queue, and a delay queue.
The space single event upset autonomous fault-tolerant method provided by the embodiment of the invention has the following beneficial effects:
by carrying out cross-partition redundant backup and mutual exclusion locking mechanisms on key data, the address distribution among redundant memory pools and the isolation among data can be ensured, and the EDAC memory exception capture mechanism is combined and utilized, so that the fault-tolerant efficiency is improved, and the fault-tolerant cost is reduced.
Drawings
Fig. 1 is a schematic flow chart of a spatial single event upset autonomous fault-tolerant method according to an embodiment of the present invention;
fig. 2 is a schematic diagram of a technical framework corresponding to the spatial single event upset autonomous fault-tolerant method according to the embodiment of the present invention.
Detailed Description
The invention is described in detail below with reference to the figures and the embodiments.
As shown in fig. 1, the spatial single event upset autonomous fault-tolerant method provided in the embodiment of the present invention includes the following steps:
s101, creating three redundant memory pools for the data according to the size of the data.
As a specific embodiment of the present invention, the data is an operating system kernel data structure, including a ready queue, a blocking queue, a delay queue, and the like, which is used for managing and controlling data executed by an operating system kernel, and if the data is tampered, serious consequences will be caused.
And (3) applying for the redundant memory pool through a memory allocation interface (a first-time adaptive algorithm) provided in the operating system, and if the memory space is insufficient or the application is wrong, finishing all the steps.
S102, continuously performing three times of memory allocation operations by using a semaphore mechanism, and allocating three redundant memory pools to three partitions of the system respectively.
S103, setting one redundant memory pool of the three redundant memory pools as a main data block, and the other two redundant memory pools as auxiliary data blocks to obtain a first auxiliary data block and a second auxiliary data block, and recording storage addresses of the data in the three partitions in a variable-address mapping table.
As a specific embodiment of the present invention, as shown in fig. 2, a technical framework corresponding to the spatial single event upset autonomous fault-tolerant method provided in the embodiment of the present invention includes 3 redundant memory pools, where each redundant memory pool corresponds to a mapping address.
And S104, sequentially writing the data into the data areas of the three partitions.
And S105, acquiring the data in the main data block according to the variable-address mapping table.
And S106, judging whether the EDAC circuit is triggered to check the memory error operation.
The EDAC circuit is realized by linear block code, error correction and detection are realized by the linear block code, check bits of information data to be coded are generated in the coding process, and the check bits and the data are stored in the memory together. In the decoding process, the check bit is generated for the information data again, and the XOR operation is carried out on the check bit and the data check bit generated in the encoding process to obtain the syndrome, the syndrome is used for positioning the error of the information data, and the correctable error is corrected.
And S107, if not, sending the data to the user.
And S108, if so, processing the data in the three partitions by using a two-out-of-three strategy.
Optionally, processing data in the three partitions using a two out of three policy includes:
and when the data in the main data block is different from the data in the first auxiliary data block and the second auxiliary data block and no EDAC error occurs in the data in the first auxiliary data block and the second auxiliary data block, acquiring the data in the first auxiliary data block and the second auxiliary data block according to the variable-address mapping table, sending the data to a user, and replacing the data in the main data block with the data.
Optionally, processing data in the three partitions by using the two out of three policy further includes:
and when the data in the main data block, the first auxiliary data block and the second auxiliary data block are different and the data in the main data block, the first auxiliary data block and the second auxiliary data block are all subjected to EDAC errors, sending the data to a user and respectively replacing the data in the first auxiliary data block and the second auxiliary data block with the data.
Optionally, after sending the data to the user, the method further comprises:
and if the user releases the data, simultaneously releasing the data in the three partitions according to the variable-address mapping table, and releasing the variable-address mapping table.
Optionally, the size of the three redundant memory pools is the same, and there is a certain distance in the address distribution.
Optionally, the data includes data in a ready queue, a block queue, a delay queue.
The invention provides a spatial single event upset autonomous fault-tolerant method, which comprises the steps of establishing three redundant memory pools for data according to the size of the data, continuously performing three times of memory allocation operation by using a semaphore mechanism, respectively allocating the three redundant memory pools to three partitions of a system, setting one redundant memory pool of the three redundant memory pools as a main data block and the other two redundant memory pools as auxiliary data blocks to obtain a first auxiliary data block and a second auxiliary data block, recording storage addresses of the data in the three partitions in a variable-address mapping table, sequentially writing the data into data areas of the three partitions, acquiring the data in the main data block according to the variable-address mapping table, judging whether an EDAC circuit is triggered to check memory error operation, if not, sending the data to a user, if so, processing the data in the three partitions by using a two-out-of-three strategy, the method ensures the address distribution among the redundant memory pools and the isolation among data, improves the fault-tolerant efficiency and reduces the fault-tolerant cost by combining and utilizing an EDAC memory exception capture mechanism.
In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
It will be appreciated that the relevant features of the method and apparatus described above are referred to one another. In addition, "first", "second", and the like in the above embodiments are for distinguishing the embodiments, and do not represent merits of the embodiments.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (6)

1. A spatial single event upset autonomous fault-tolerant method is characterized by comprising the following steps:
according to the size of data, creating three redundant memory pools for the data;
continuously performing three times of memory allocation operations by using a semaphore mechanism, and respectively allocating three redundant memory pools to three partitions of a system;
setting one redundant memory pool of the three redundant memory pools as a main data block, and the other two redundant memory pools as auxiliary data blocks to obtain a first auxiliary data block and a second auxiliary data block, and recording storage addresses of the data in the three partitions in a variable-address mapping table;
writing the data into the data areas of the three partitions in sequence;
acquiring data in the main data block according to the variable-address mapping table;
judging whether an EDAC circuit is triggered to check the memory error operation or not;
if not, the data is sent to the user;
if so, processing the data in the three partitions by using a two-out-of-three strategy.
2. The spatial single event upset autonomous fault-tolerant method of claim 1, wherein processing data in three partitions using a two-out-of-three strategy comprises:
and when the data in the main data block is different from the data in the first auxiliary data block and the second auxiliary data block and no EDAC error occurs in the data in the first auxiliary data block and the second auxiliary data block, acquiring the data in the first auxiliary data block and the second auxiliary data block according to the variable-address mapping table, sending the data to a user, and replacing the data in the main data block with the data.
3. The spatial single event upset autonomous fault-tolerant method of claim 1, wherein processing data in three partitions using a two out of three strategy further comprises:
and when the data in the main data block, the first auxiliary data block and the second auxiliary data block are different and the data in the main data block, the first auxiliary data block and the second auxiliary data block are all subjected to EDAC errors, sending the data to a user and respectively replacing the data in the first auxiliary data block and the second auxiliary data block with the data.
4. The spatial single event upset autonomous fault tolerance method of any of claims 1-3, wherein after sending the data to a user, the method further comprises:
and if the user releases the data, simultaneously releasing the data in the three partitions according to the variable-address mapping table, and releasing the variable-address mapping table.
5. The spatial single event upset autonomous fault-tolerant method of claim 1, wherein the three redundant memory pools have the same size and have a certain distance in address distribution.
6. The spatial single event upset autonomous fault tolerance method of claim 1, wherein the data comprises data in a ready queue, a block queue, and a delay queue.
CN202010713393.6A 2020-07-22 2020-07-22 Spatial single-particle upset autonomous fault-tolerant method Pending CN111858192A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010713393.6A CN111858192A (en) 2020-07-22 2020-07-22 Spatial single-particle upset autonomous fault-tolerant method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010713393.6A CN111858192A (en) 2020-07-22 2020-07-22 Spatial single-particle upset autonomous fault-tolerant method

Publications (1)

Publication Number Publication Date
CN111858192A true CN111858192A (en) 2020-10-30

Family

ID=72950272

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010713393.6A Pending CN111858192A (en) 2020-07-22 2020-07-22 Spatial single-particle upset autonomous fault-tolerant method

Country Status (1)

Country Link
CN (1) CN111858192A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102915768A (en) * 2012-10-01 2013-02-06 中国科学院近代物理研究所 Device and method for tolerating faults of storage based on triple modular redundancy of EDAC module
CN108446189A (en) * 2018-06-12 2018-08-24 中国科学院上海技术物理研究所 A kind of fault-tolerant activation system of spaceborne embedded software and method
CN109669823A (en) * 2018-12-03 2019-04-23 中国工程物理研究院电子工程研究所 Anti- Multiple-bit upsets error chip reinforcement means based on modified triple-modular redundancy system
CN111176890A (en) * 2019-12-16 2020-05-19 上海航天控制技术研究所 Data storage and exception recovery method for satellite-borne software

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102915768A (en) * 2012-10-01 2013-02-06 中国科学院近代物理研究所 Device and method for tolerating faults of storage based on triple modular redundancy of EDAC module
CN108446189A (en) * 2018-06-12 2018-08-24 中国科学院上海技术物理研究所 A kind of fault-tolerant activation system of spaceborne embedded software and method
CN109669823A (en) * 2018-12-03 2019-04-23 中国工程物理研究院电子工程研究所 Anti- Multiple-bit upsets error chip reinforcement means based on modified triple-modular redundancy system
CN111176890A (en) * 2019-12-16 2020-05-19 上海航天控制技术研究所 Data storage and exception recovery method for satellite-borne software

Similar Documents

Publication Publication Date Title
CN100419695C (en) Vectoring process-kill errors to an application program
CN107799151B (en) Solid State Disk (SSD) and method and system for high-availability peripheral component interconnect express (PCIe) SSD
EP2437172B1 (en) RAM single event upset (SEU) method to correct errors
WO2021208341A1 (en) Method and system for detecting and recovering memory bit flipping in secondary power equipment
Sanchez-Macian et al. Enhanced detection of double and triple adjacent errors in hamming codes through selective bit placement
CN108874576B (en) Data storage system based on error correction coding
US8996953B2 (en) Self monitoring and self repairing ECC
US8181094B2 (en) System to improve error correction using variable latency and associated methods
US9208027B2 (en) Address error detection
CN102915768A (en) Device and method for tolerating faults of storage based on triple modular redundancy of EDAC module
CN101615147A (en) The skin satellite is based on the fault-tolerance approach of the memory module of FPGA
Gottscho et al. Software-defined error-correcting codes
US8108714B2 (en) Method and system for soft error recovery during processor execution
CN106328209B (en) Memory single-particle multi-bit upset fault-tolerant method and circuit
US9043683B2 (en) Error protection for integrated circuits
US9041428B2 (en) Placement of storage cells on an integrated circuit
US9201727B2 (en) Error protection for a data bus
Heidergott SEU tolerant device, circuit and processor design
CN110489268B (en) Two-stage error correction coding method and system applied to storage system in satellite severe environment
CN112052117B (en) Satellite-borne system software protection method based on redundant API interface
CN113608720A (en) Satellite-borne data processing system and method resistant to single event upset
CN111858192A (en) Spatial single-particle upset autonomous fault-tolerant method
US20150143201A1 (en) Error-correcting code distribution for memory systems
CN112000526A (en) Low-cost minisatellite important data fault-tolerant method
Dell System RAS implications of DRAM soft errors

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination