CN102521066A - On-board computer space environment event fault tolerance method - Google Patents

On-board computer space environment event fault tolerance method Download PDF

Info

Publication number
CN102521066A
CN102521066A CN2011103619895A CN201110361989A CN102521066A CN 102521066 A CN102521066 A CN 102521066A CN 2011103619895 A CN2011103619895 A CN 2011103619895A CN 201110361989 A CN201110361989 A CN 201110361989A CN 102521066 A CN102521066 A CN 102521066A
Authority
CN
China
Prior art keywords
spaceborne computer
software
spaceborne
computer software
fault
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2011103619895A
Other languages
Chinese (zh)
Inventor
翟君武
陶利民
李林
汪路元
唐自新
李伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Spacecraft System Engineering
Original Assignee
Beijing Institute of Spacecraft System Engineering
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Spacecraft System Engineering filed Critical Beijing Institute of Spacecraft System Engineering
Priority to CN2011103619895A priority Critical patent/CN102521066A/en
Publication of CN102521066A publication Critical patent/CN102521066A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Techniques For Improving Reliability Of Storages (AREA)
  • Hardware Redundancy (AREA)

Abstract

The invention relates to an on-board computer space environment event fault tolerance method, which mainly comprises memory single particle turning processing, chip internal register change tolerance caused by space radiation and partial circuit failure fault tolerance caused by space radiation. For the memory single particle turning, an on-board computer regularly carries out fault tolerance on the memory region reading and writing through the error detection and correction (EDAC) checking addition to a memory region. For the chip internal register change caused by space radiation, the on-board computer protects unused interruption; for a work mode register, the regular routing inspection is adopted, the re-initialization is carried out when the value is not the expected value; and for a register relevant to the bus message sending, the value giving is carried out on the memory again before the message sending in each time. For the partial circuit failure caused by the space radiation, the fault random-access memory (RAM) chip replacement, the bus interface chip fault detection and switching and the central processing unit (CPU) chip fault detection and switching are adopted for fault tolerance. The method provided by the invention has the advantage that the on-board computer emission and in-orbit operation reliability of the on-board computer can be effectively improved.

Description

Spaceborne computer space environment incident fault-tolerance approach
Technical field
The present invention relates to a kind of spaceborne computer fault-tolerance approach.
Background technology
Spacecraft is in whole emission process and operational process; Owing to various space environment incidents can appear in various reasons such as space environment, spacecraft characteristic; Processing can not cause the inefficacy even the collapse of satellite system function in addition; Therefore should take measures to tackle these unusual conditions, make satellite can continue correct, stable operation, thereby ensure the stable operation and the service of whole satellite system.
The space environment incident mainly comprises: the storer single-particle inversion; The chip internal register that space irradiation causes changes; The partial circuit inefficacy that space irradiation causes etc.The storer single-particle inversion can cause software or FPGA operation result mistake on the star, even the race of software is run fast extremely.The chip internal register that space irradiation causes changes, and can cause the dysfunction of some chip of spacecraft, and then influence the realization of function.The partial circuit that space irradiation causes lost efficacy, and was meant that mainly the partial circuit that causes behind the single event latch-up lost efficacy.
At present, the fault-tolerance approach of spaceborne computer space environment anomalous event does not obtain systematic research as yet.
Summary of the invention
Technology of the present invention is dealt with problems and is: the deficiency that overcomes prior art; A kind of fault-tolerance approach of spaceborne computer space environment incident is provided; Set up a kind of space environment incident fault-tolerant strategy that is applicable to the spaceborne computer design with this, improve spaceborne computer emission and reliability in orbit.
Technical solution of the present invention is: spaceborne computer space environment incident fault-tolerance approach, and step is following:
(1) after spaceborne computer initially powers on operation, at first detects spaceborne computer software and whether can normally start; If spaceborne computer software can start, then feed software watchdog with the fixed cycle by spaceborne computer software, spaceborne computer software normally moves; If spaceborne computer software can't start or spaceborne computer software is fed the software watchdog failure with the fixed cycle, then reset circuit provides reset signal to spaceborne computer, and spaceborne computer restarts operation; If spaceborne computer can't normally start for continuous three times, then switch to the backup spaceborne computer;
(2) after the normal operation of spaceborne computer software, send read-write to all RAM; If there is the read-write of RAM district undesired, then spaceborne computer uses the abnormal RAM of backup RAM replacement read-write through software arrangements;
When (3) spaceborne computer software normally moves, periodically send poll bus message to each bus termination, when all bus terminations were all obstructed, spaceborne computer software sent cutting machine signal to spaceborne computer, and spaceborne computer switches to backup machine;
When (4) spaceborne computer software normally moved, the interrupt source permission to all actual uses shielded other interrupt source simultaneously; When spaceborne computer response is interrupted, at first interrupt source is confirmed, when interrupting not being one of interruption of using from reality, again IMR is carried out initialization;
When (5) spaceborne computer software normally moves; Whether register value in running order in the bus driver chip changed make regular check on; If there is the numerical value of register to change, then spaceborne computer reinitializes this register and related register; Simultaneously, for only at the effective buffer status of part-time, each when arriving effective time to these registers assignment again;
When (6) spaceborne computer software normally moves, utilize Hamming code to the data computation verification of each memory address with, and with verification with store; The spaceborne computer cycle is checked the data of each memory address, when finding verification list bit mistake, carries out error correction; When finding two bit or above mistake, spaceborne computer is resetted, restart.
The present invention's advantage compared with prior art is:
(1) fault-tolerance approach of spaceborne computer space environment incident of the present invention is primarily aimed at the special event that space environment causes, it is fault-tolerant to divide diverse ways to carry out, and can effectively improve spaceborne computer reliability in orbit;
(2) adopt spaceborne computer software to realize the fault-tolerant of spaceborne computer space environment incident of the present invention, can improve the autonomous management ability of satellite;
(3) fault-tolerance approach of spaceborne computer space environment incident of the present invention can mainly adopt software to accomplish the error detection of spaceborne computer, fault-tolerant under hardware supports, and principle is simple, realization is easy, and is maintainable strong, be applicable to most satellites, but generalization is strong.
Description of drawings
Fig. 1 is the FB(flow block) of the inventive method;
Fig. 2 is the concrete fault-tolerant content composition diagram of the inventive method;
Fig. 3 is a star load computer hardware arrangement plan in the embodiment of the invention.
Embodiment
The fault-tolerant of spaceborne computer space environment anomalous event of the present invention is the software and hardware resources that utilizes spaceborne computer, and dissimilar according to the space environment incident carry out different processing; Can satisfy simultaneously the limited requirement of weight, power consumption of spaceborne computer again.
As shown in Figure 1, incidents such as the single-particle inversion that the inventive method causes to space environment, single event latch-up are carried out dissimilar fault-tolerant, are applicable to the application of most of spacecrafts, can improve satellite equipment in rail capacity of will and reliability.Comprise that mainly the processing of storer single-particle inversion, chip internal register that space irradiation causes change partial circuit fault-tolerant, that space irradiation causes fault-tolerant three aspects that lost efficacy, as shown in Figure 2.
(1) processing of storer single-particle inversion
For the single-particle inversion of storer, spaceborne computer comes verification is carried out in the memory block through regular read-write through the memory block being added the EDAC verification.Because the characteristics of EDAC check code are " inspection one entangle two ", promptly can error correction when taking place that single bit staggers the time, can't error correction when taking place that two bit or many bit stagger the time, only can report an error.Therefore star load computer hardware is designed with the EDAC checking circuit of storer; When the EDAC verification is not passed through; Software can produce an interruption, and through reading EDAC verification state, judgement is that single bit mistake or many bit are wrong to software in interruption; If single bit mistake is then wrong through single bit that the rewriting of reading of data is corrected in the storer, if two bit mistake is then eliminated the wrong influence of two bit through soft ware autonomous resetting.
(2) the irradiation chip internal register that causes in space changes
The chip internal register is a chip at the beginning of design, is the convenient interface of leaving the user for that uses, and the different numerical value of register can cause the variation of chip operation pattern, major function.The chip internal register that space irradiation causes changes, and can cause that the execution of spaceborne computer normal function is incorrect.Several kinds of means below spaceborne computer mainly adopts to the variation of chip internal register:, prevent to interrupt related register and change the uncertain interruption that causes to not protecting with interruption; To the mode of operation register, adopt and regularly follow inspection, if then do not reinitialize for expectation value; To sending the relevant register of message, again storer is carried out assignment before sending message with bus at every turn.
(3) the irradiation partial circuit that causes in space lost efficacy
The partial circuit that space irradiation causes lost efficacy, and was meant that mainly the partial circuit that causes behind the single event latch-up lost efficacy.Spaceborne computer has adopted fault isolation and system reconfiguration mechanism to partial circuit, eliminates the influence of partial circuit single event latch-up.Mainly contain replacement, Bus Interface Chip fault detect and switching, cpu chip fault detect and the switching of fault RAM memory chip.Spaceborne computer adopts the standby redundancy strategy, when certain block RAM chip can't normal read-write, switches to backup RAM; After Bus Interface Chip or cpu chip are unusual, independently switch to backup machine.
The key step of the inventive method is following:
(1) the spaceborne computer operation that initially powers on;
Whether (2) detect spaceborne computer software and can normally start, if start, then feed dog by the software fixed cycle, software normally moves; Otherwise software can't be fed dog, and reset circuit provides reset signal to spaceborne computer, and spaceborne computer restarts operation; If continuous 3 times can't normally start, then spaceborne computer is backed up in the tangential.
(3) behind the spaceborne computer running software, send read-write,, RAM then is described because unknown cause is destroyed if there is the read-write of RAM district undesired to all RAM.This moment, spaceborne computer software was then through disposing use backup RAM.
(4) after the spaceborne computer operation, periodically send poll bus message, when all bus terminations are all obstructed, prove that the bus driver chip damages for a certain reason to each bus termination.This moment, spaceborne computer software sent cutting machine signal to spaceborne computer, and spaceborne computer is cut backup machine, uses another sheet bus driver chip.
When (5) spaceborne computer moved, the interrupt source permission to all actual uses shielded other interrupt source.When spaceborne computer response is interrupted, at first interrupt source is confirmed, when interrupting not being one of interruption of using from reality, IMR generation single-particle inversion is described, again IMR is carried out initialization.
Whether (6) spaceborne computer when operation changes to register value in running order in the bus driver chip and to make regular check on, and when changing, explains that this register receives the influence of single-particle.At this moment, spaceborne computer reinitializes this register and related register.
(7) in spaceborne computer when operation, have the state of some registers only effective at part-time, when needing to use these registers, to they assignment again, eliminates the single-particle influence that these registers before this possibly receive at every turn.
(8) spaceborne computer when operation, utilize Hamming code to the data computation verification of each memory address with, and with verification with store.The spaceborne computer cycle is checked the data of each memory address, when finding verification list bit mistake (single-particle inversion), carries out error correction; When finding that many bit stagger the time, computing machine is resetted, reload program.
(9) fault-tolerant processing is carried out to the space environment incident in step (4)~(8) of reruning.
Embodiment
Be example with certain satellite below, introduce the space environment incident fault-tolerant strategy of spaceborne computer:
As shown in Figure 3, the spaceborne computer of certain satellite adopts TSC695f as cpu, carries the EDAC circuit, has the replacement circuit of redundancy ram simultaneously; Spaceborne computer uses 61580 interface chips as bus; Spaceborne computer has the PROM of 128K and the RAM of 8M, and the 9Q512K32 that the RAM chip is 2M by 4 capacity forms, and system backs up the RAM of 1 2M simultaneously; Have telemetry interface and Remote Control Interface simultaneously.Application software is accomplished the function of each item application layer on operating system.Spaceborne computer has the two-shipper cold standby, and the composition of each unit is identical.
In the spaceborne computer start-up course, operating system is at first carried out self check to 4 RAM, if certain sheet RAM read-write is undesired, then adopts backup RAM to substitute, if still undesired after the replacement, then resets.
TSC695f carries the EDAC circuit, and is not out-of-date when the EDAC of memory block verification, can produce corresponding interruption, and record this moment be that the wrong still two bit of single bit are wrong, and the numerical value that reads at this moment of record.In the application software initialization procedure, with this interrupt hook; In producing this, have no progeny, software at first judges whether to single bit is wrong, if single bit mistake then is written back to the numerical value after the EDAC error correction in the RAM district, eliminates single bit influence; If two bit are wrong, then reset immediately.
Per 0.5 second of spaceborne computer application software once checks the mode of operation of 61580 chips and the mode of operation of 695f chip, if be not the value set assignment again then; Application software when sending bus message at every turn, and the register that the relevant information of 61580 chips is sent carries out assignment again; Operating system software, removes the corresponding positions of interrupt status register, and withdraws from the interrupt response program when not taking place with abnormal interruption not protecting with interruption.
Spaceborne computer has autonomous cutter function, when application software detects all bus terminations when obstructed, thinks that 61580 chips break down, immediately cutter; When cpu chip occurs unusually, reset through house dog, still can not recover if reset 3 times, cutter immediately then, thus the insulating space environment event causes the position of fault.
The content of not doing to describe in detail in the instructions of the present invention belongs to those skilled in the art's known technology.

Claims (1)

1. spaceborne computer space environment incident fault-tolerance approach is characterized in that step is following:
(1) after spaceborne computer initially powers on operation, at first detects spaceborne computer software and whether can normally start; If spaceborne computer software can start, then feed software watchdog with the fixed cycle by spaceborne computer software, spaceborne computer software normally moves; If spaceborne computer software can't start or spaceborne computer software is fed the software watchdog failure with the fixed cycle, then reset circuit provides reset signal to spaceborne computer, and spaceborne computer restarts operation; If spaceborne computer can't normally start for continuous three times, then switch to the backup spaceborne computer;
(2) after the normal operation of spaceborne computer software, send read-write to all RAM; If there is the read-write of RAM district undesired, then spaceborne computer uses the abnormal RAM of backup RAM replacement read-write through software arrangements;
When (3) spaceborne computer software normally moves, periodically send poll bus message to each bus termination, when all bus terminations were all obstructed, spaceborne computer software sent cutting machine signal to spaceborne computer, and spaceborne computer switches to backup machine;
When (4) spaceborne computer software normally moved, the interrupt source permission to all actual uses shielded other interrupt source simultaneously; When spaceborne computer response is interrupted, at first interrupt source is confirmed, when interrupting not being one of interruption of using from reality, again IMR is carried out initialization;
When (5) spaceborne computer software normally moves; Whether register value in running order in the bus driver chip changed make regular check on; If there is the numerical value of register to change, then spaceborne computer reinitializes this register and related register; Simultaneously, for only at the effective buffer status of part-time, each when arriving effective time to these registers assignment again;
When (6) spaceborne computer software normally moves, utilize Hamming code to the data computation verification of each memory address with, and with verification with store; The spaceborne computer cycle is checked the data of each memory address, when finding verification list bit mistake, carries out error correction; When finding two bit or above mistake, spaceborne computer is resetted, restart.
CN2011103619895A 2011-11-15 2011-11-15 On-board computer space environment event fault tolerance method Pending CN102521066A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2011103619895A CN102521066A (en) 2011-11-15 2011-11-15 On-board computer space environment event fault tolerance method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2011103619895A CN102521066A (en) 2011-11-15 2011-11-15 On-board computer space environment event fault tolerance method

Publications (1)

Publication Number Publication Date
CN102521066A true CN102521066A (en) 2012-06-27

Family

ID=46292001

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2011103619895A Pending CN102521066A (en) 2011-11-15 2011-11-15 On-board computer space environment event fault tolerance method

Country Status (1)

Country Link
CN (1) CN102521066A (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103076779A (en) * 2012-12-28 2013-05-01 中国人民解放军国防科学技术大学 Independent control method and device of satellite-borne equipment on microsatellite
CN103984630A (en) * 2014-05-27 2014-08-13 中国科学院空间科学与应用研究中心 Single event upset fault processing method based on AT697 processor
CN104035828A (en) * 2014-05-19 2014-09-10 上海微小卫星工程中心 FPGA space irradiation comprehensive protection method and device
CN104246629A (en) * 2012-10-02 2014-12-24 富士电机株式会社 Redundant computation processing system
CN106354579A (en) * 2016-10-14 2017-01-25 上海微小卫星工程中心 Spaceborne computer
CN107273240A (en) * 2017-05-18 2017-10-20 北京空间飞行器总体设计部 A kind of spaceborne phased array TR components single-particle inversion means of defence
CN108021473A (en) * 2017-11-29 2018-05-11 山东航天电子技术研究所 The aerospace computer system and safe starting method that a kind of more backups start
CN109001778A (en) * 2018-05-21 2018-12-14 北京空间飞行器总体设计部 A kind of processing method based on satellite-based navigation satellite receiving system single event
CN109491290A (en) * 2018-11-16 2019-03-19 西安空间无线电技术研究所 A kind of cold standby bus complexing circuit suitable for digital processing system
CN109739697A (en) * 2018-12-13 2019-05-10 北京计算机技术及应用研究所 A kind of hard real-time two-shipper synchronous fault-tolerant system based on high-speed data exchange
CN111708695A (en) * 2020-06-12 2020-09-25 上海航天计算机技术研究所 AT 697-based cache single event upset resistant effect verification method
CN112860467A (en) * 2021-01-20 2021-05-28 北京国电高科科技有限公司 On-orbit fault smooth repairing device and method for satellite-borne computer
CN113744787A (en) * 2021-07-27 2021-12-03 北京空间飞行器总体设计部 SRAM (static random Access memory) type FPGA (field programmable Gate array) user register single event upset fault injection method
CN114090327A (en) * 2022-01-20 2022-02-25 浙江吉利控股集团有限公司 Single-particle error processing method, system and device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7805245B2 (en) * 2007-04-18 2010-09-28 Honeywell International Inc. Inertial measurement unit fault detection isolation reconfiguration using parity logic
CN101907888A (en) * 2010-07-29 2010-12-08 航天东方红卫星有限公司 Double-machine cold standby non-distance switching method for small satellite affair system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7805245B2 (en) * 2007-04-18 2010-09-28 Honeywell International Inc. Inertial measurement unit fault detection isolation reconfiguration using parity logic
CN101907888A (en) * 2010-07-29 2010-12-08 航天东方红卫星有限公司 Double-machine cold standby non-distance switching method for small satellite affair system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
段星辉等: "一种提高星载软件可靠性的开发方法", 《计算机工程》 *
贾文涛等: "一种高可靠双机温备星载计算机的设计与实现", 《计算机研究与发展》 *

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104246629A (en) * 2012-10-02 2014-12-24 富士电机株式会社 Redundant computation processing system
CN104246629B (en) * 2012-10-02 2016-10-12 富士电机株式会社 redundant operation processing system
CN103076779A (en) * 2012-12-28 2013-05-01 中国人民解放军国防科学技术大学 Independent control method and device of satellite-borne equipment on microsatellite
CN104035828A (en) * 2014-05-19 2014-09-10 上海微小卫星工程中心 FPGA space irradiation comprehensive protection method and device
CN103984630B (en) * 2014-05-27 2017-02-01 中国科学院空间科学与应用研究中心 Single event upset fault processing method based on AT697 processor
CN103984630A (en) * 2014-05-27 2014-08-13 中国科学院空间科学与应用研究中心 Single event upset fault processing method based on AT697 processor
CN106354579B (en) * 2016-10-14 2019-07-19 上海微小卫星工程中心 Spaceborne computer
CN106354579A (en) * 2016-10-14 2017-01-25 上海微小卫星工程中心 Spaceborne computer
CN107273240A (en) * 2017-05-18 2017-10-20 北京空间飞行器总体设计部 A kind of spaceborne phased array TR components single-particle inversion means of defence
CN107273240B (en) * 2017-05-18 2020-04-28 北京空间飞行器总体设计部 Single event upset protection method for satellite-borne phased array TR (transmitter-receiver) assembly
CN108021473A (en) * 2017-11-29 2018-05-11 山东航天电子技术研究所 The aerospace computer system and safe starting method that a kind of more backups start
CN109001778A (en) * 2018-05-21 2018-12-14 北京空间飞行器总体设计部 A kind of processing method based on satellite-based navigation satellite receiving system single event
CN109491290A (en) * 2018-11-16 2019-03-19 西安空间无线电技术研究所 A kind of cold standby bus complexing circuit suitable for digital processing system
CN109739697A (en) * 2018-12-13 2019-05-10 北京计算机技术及应用研究所 A kind of hard real-time two-shipper synchronous fault-tolerant system based on high-speed data exchange
CN111708695A (en) * 2020-06-12 2020-09-25 上海航天计算机技术研究所 AT 697-based cache single event upset resistant effect verification method
CN112860467A (en) * 2021-01-20 2021-05-28 北京国电高科科技有限公司 On-orbit fault smooth repairing device and method for satellite-borne computer
CN113744787A (en) * 2021-07-27 2021-12-03 北京空间飞行器总体设计部 SRAM (static random Access memory) type FPGA (field programmable Gate array) user register single event upset fault injection method
CN113744787B (en) * 2021-07-27 2023-09-08 北京空间飞行器总体设计部 SRAM type FPGA user register single event upset fault injection method
CN114090327A (en) * 2022-01-20 2022-02-25 浙江吉利控股集团有限公司 Single-particle error processing method, system and device

Similar Documents

Publication Publication Date Title
CN102521066A (en) On-board computer space environment event fault tolerance method
CN102521059B (en) On-board data management system self fault-tolerance method
Avizienis Toward systematic design of fault-tolerant systems
US5923830A (en) Non-interrupting power control for fault tolerant computer systems
CN107347018B (en) Three-redundancy 1553B bus dynamic switching method
CN101576836B (en) Degradable three-machine redundancy fault-tolerant system
US10761925B2 (en) Multi-channel network-on-a-chip
Wensley Sift: software implemented fault tolerance
CN101930052B (en) Online detection fault-tolerance system of FPGA (Field programmable Gate Array) digital sequential circuit of SRAM (Static Random Access Memory) type and method
CN103853622A (en) Control method of dual redundancies capable of being backed up mutually
US7861106B2 (en) Hierarchical configurations in error-correcting computer systems
Villalpando et al. Reliable multicore processors for NASA space missions
US9952579B2 (en) Control device
US8775867B2 (en) Method and system for using a standby server to improve redundancy in a dual-node data storage system
CN105373443A (en) Data system with memory system architecture and data reading method
US20150293806A1 (en) Direct Connect Algorithm
CN108958987B (en) Low-orbit small satellite fault-tolerant system and method
CN102404139B (en) Method for increasing fault tolerance performance of application level of fault tolerance server
Avizienis A fault tolerance infrastructure for dependable computing with high-performance COTS components
US9665447B2 (en) Fault-tolerant failsafe computer system using COTS components
Thekkilakattil et al. Mixed criticality systems: Beyond transient faults
US10621024B2 (en) Signal pairing for module expansion of a failsafe computing system
CN116088369A (en) Reconstruction method and system for spaceborne computer
US9311212B2 (en) Task based voting for fault-tolerant fail safe computer systems
JP6464704B2 (en) Fault tolerant system, active device, standby device, failover method, and failover program

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20120627