CN102521066A

CN102521066A - On-board computer space environment event fault tolerance method

Info

Publication number: CN102521066A
Application number: CN2011103619895A
Authority: CN
Inventors: 翟君武; 陶利民; 李林; 汪路元; 唐自新; 李伟
Original assignee: Beijing Institute of Spacecraft System Engineering
Current assignee: Beijing Institute of Spacecraft System Engineering
Priority date: 2011-11-15
Filing date: 2011-11-15
Publication date: 2012-06-27

Abstract

The invention relates to an on-board computer space environment event fault tolerance method, which mainly comprises memory single particle turning processing, chip internal register change tolerance caused by space radiation and partial circuit failure fault tolerance caused by space radiation. For the memory single particle turning, an on-board computer regularly carries out fault tolerance on the memory region reading and writing through the error detection and correction (EDAC) checking addition to a memory region. For the chip internal register change caused by space radiation, the on-board computer protects unused interruption; for a work mode register, the regular routing inspection is adopted, the re-initialization is carried out when the value is not the expected value; and for a register relevant to the bus message sending, the value giving is carried out on the memory again before the message sending in each time. For the partial circuit failure caused by the space radiation, the fault random-access memory (RAM) chip replacement, the bus interface chip fault detection and switching and the central processing unit (CPU) chip fault detection and switching are adopted for fault tolerance. The method provided by the invention has the advantage that the on-board computer emission and in-orbit operation reliability of the on-board computer can be effectively improved.

Description

Spaceborne computer space environment incident fault-tolerance approach

Technical field

The present invention relates to a kind of spaceborne computer fault-tolerance approach.

Background technology

Spacecraft is in whole emission process and operational process; Owing to various space environment incidents can appear in various reasons such as space environment, spacecraft characteristic; Processing can not cause the inefficacy even the collapse of satellite system function in addition; Therefore should take measures to tackle these unusual conditions, make satellite can continue correct, stable operation, thereby ensure the stable operation and the service of whole satellite system.

The space environment incident mainly comprises: the storer single-particle inversion; The chip internal register that space irradiation causes changes; The partial circuit inefficacy that space irradiation causes etc.The storer single-particle inversion can cause software or FPGA operation result mistake on the star, even the race of software is run fast extremely.The chip internal register that space irradiation causes changes, and can cause the dysfunction of some chip of spacecraft, and then influence the realization of function.The partial circuit that space irradiation causes lost efficacy, and was meant that mainly the partial circuit that causes behind the single event latch-up lost efficacy.

At present, the fault-tolerance approach of spaceborne computer space environment anomalous event does not obtain systematic research as yet.

Summary of the invention

Technology of the present invention is dealt with problems and is: the deficiency that overcomes prior art; A kind of fault-tolerance approach of spaceborne computer space environment incident is provided; Set up a kind of space environment incident fault-tolerant strategy that is applicable to the spaceborne computer design with this, improve spaceborne computer emission and reliability in orbit.

Technical solution of the present invention is: spaceborne computer space environment incident fault-tolerance approach, and step is following:

(1) after spaceborne computer initially powers on operation, at first detects spaceborne computer software and whether can normally start; If spaceborne computer software can start, then feed software watchdog with the fixed cycle by spaceborne computer software, spaceborne computer software normally moves; If spaceborne computer software can't start or spaceborne computer software is fed the software watchdog failure with the fixed cycle, then reset circuit provides reset signal to spaceborne computer, and spaceborne computer restarts operation; If spaceborne computer can't normally start for continuous three times, then switch to the backup spaceborne computer;

(2) after the normal operation of spaceborne computer software, send read-write to all RAM; If there is the read-write of RAM district undesired, then spaceborne computer uses the abnormal RAM of backup RAM replacement read-write through software arrangements;

When (3) spaceborne computer software normally moves, periodically send poll bus message to each bus termination, when all bus terminations were all obstructed, spaceborne computer software sent cutting machine signal to spaceborne computer, and spaceborne computer switches to backup machine;

When (4) spaceborne computer software normally moved, the interrupt source permission to all actual uses shielded other interrupt source simultaneously; When spaceborne computer response is interrupted, at first interrupt source is confirmed, when interrupting not being one of interruption of using from reality, again IMR is carried out initialization;

When (5) spaceborne computer software normally moves; Whether register value in running order in the bus driver chip changed make regular check on; If there is the numerical value of register to change, then spaceborne computer reinitializes this register and related register; Simultaneously, for only at the effective buffer status of part-time, each when arriving effective time to these registers assignment again;

When (6) spaceborne computer software normally moves, utilize Hamming code to the data computation verification of each memory address with, and with verification with store; The spaceborne computer cycle is checked the data of each memory address, when finding verification list bit mistake, carries out error correction; When finding two bit or above mistake, spaceborne computer is resetted, restart.

The present invention's advantage compared with prior art is:

(1) fault-tolerance approach of spaceborne computer space environment incident of the present invention is primarily aimed at the special event that space environment causes, it is fault-tolerant to divide diverse ways to carry out, and can effectively improve spaceborne computer reliability in orbit;

(2) adopt spaceborne computer software to realize the fault-tolerant of spaceborne computer space environment incident of the present invention, can improve the autonomous management ability of satellite;

(3) fault-tolerance approach of spaceborne computer space environment incident of the present invention can mainly adopt software to accomplish the error detection of spaceborne computer, fault-tolerant under hardware supports, and principle is simple, realization is easy, and is maintainable strong, be applicable to most satellites, but generalization is strong.

Description of drawings

Fig. 1 is the FB(flow block) of the inventive method;

Fig. 2 is the concrete fault-tolerant content composition diagram of the inventive method;

Fig. 3 is a star load computer hardware arrangement plan in the embodiment of the invention.

Embodiment

The fault-tolerant of spaceborne computer space environment anomalous event of the present invention is the software and hardware resources that utilizes spaceborne computer, and dissimilar according to the space environment incident carry out different processing; Can satisfy simultaneously the limited requirement of weight, power consumption of spaceborne computer again.

As shown in Figure 1, incidents such as the single-particle inversion that the inventive method causes to space environment, single event latch-up are carried out dissimilar fault-tolerant, are applicable to the application of most of spacecrafts, can improve satellite equipment in rail capacity of will and reliability.Comprise that mainly the processing of storer single-particle inversion, chip internal register that space irradiation causes change partial circuit fault-tolerant, that space irradiation causes fault-tolerant three aspects that lost efficacy, as shown in Figure 2.

(1) processing of storer single-particle inversion

For the single-particle inversion of storer, spaceborne computer comes verification is carried out in the memory block through regular read-write through the memory block being added the EDAC verification.Because the characteristics of EDAC check code are " inspection one entangle two ", promptly can error correction when taking place that single bit staggers the time, can't error correction when taking place that two bit or many bit stagger the time, only can report an error.Therefore star load computer hardware is designed with the EDAC checking circuit of storer; When the EDAC verification is not passed through; Software can produce an interruption, and through reading EDAC verification state, judgement is that single bit mistake or many bit are wrong to software in interruption; If single bit mistake is then wrong through single bit that the rewriting of reading of data is corrected in the storer, if two bit mistake is then eliminated the wrong influence of two bit through soft ware autonomous resetting.

(2) the irradiation chip internal register that causes in space changes

The chip internal register is a chip at the beginning of design, is the convenient interface of leaving the user for that uses, and the different numerical value of register can cause the variation of chip operation pattern, major function.The chip internal register that space irradiation causes changes, and can cause that the execution of spaceborne computer normal function is incorrect.Several kinds of means below spaceborne computer mainly adopts to the variation of chip internal register:, prevent to interrupt related register and change the uncertain interruption that causes to not protecting with interruption; To the mode of operation register, adopt and regularly follow inspection, if then do not reinitialize for expectation value; To sending the relevant register of message, again storer is carried out assignment before sending message with bus at every turn.

(3) the irradiation partial circuit that causes in space lost efficacy

The partial circuit that space irradiation causes lost efficacy, and was meant that mainly the partial circuit that causes behind the single event latch-up lost efficacy.Spaceborne computer has adopted fault isolation and system reconfiguration mechanism to partial circuit, eliminates the influence of partial circuit single event latch-up.Mainly contain replacement, Bus Interface Chip fault detect and switching, cpu chip fault detect and the switching of fault RAM memory chip.Spaceborne computer adopts the standby redundancy strategy, when certain block RAM chip can't normal read-write, switches to backup RAM; After Bus Interface Chip or cpu chip are unusual, independently switch to backup machine.

The key step of the inventive method is following:

(1) the spaceborne computer operation that initially powers on;

Whether (2) detect spaceborne computer software and can normally start, if start, then feed dog by the software fixed cycle, software normally moves; Otherwise software can't be fed dog, and reset circuit provides reset signal to spaceborne computer, and spaceborne computer restarts operation; If continuous 3 times can't normally start, then spaceborne computer is backed up in the tangential.

(3) behind the spaceborne computer running software, send read-write,, RAM then is described because unknown cause is destroyed if there is the read-write of RAM district undesired to all RAM.This moment, spaceborne computer software was then through disposing use backup RAM.

(4) after the spaceborne computer operation, periodically send poll bus message, when all bus terminations are all obstructed, prove that the bus driver chip damages for a certain reason to each bus termination.This moment, spaceborne computer software sent cutting machine signal to spaceborne computer, and spaceborne computer is cut backup machine, uses another sheet bus driver chip.

When (5) spaceborne computer moved, the interrupt source permission to all actual uses shielded other interrupt source.When spaceborne computer response is interrupted, at first interrupt source is confirmed, when interrupting not being one of interruption of using from reality, IMR generation single-particle inversion is described, again IMR is carried out initialization.

Whether (6) spaceborne computer when operation changes to register value in running order in the bus driver chip and to make regular check on, and when changing, explains that this register receives the influence of single-particle.At this moment, spaceborne computer reinitializes this register and related register.

(7) in spaceborne computer when operation, have the state of some registers only effective at part-time, when needing to use these registers, to they assignment again, eliminates the single-particle influence that these registers before this possibly receive at every turn.

(8) spaceborne computer when operation, utilize Hamming code to the data computation verification of each memory address with, and with verification with store.The spaceborne computer cycle is checked the data of each memory address, when finding verification list bit mistake (single-particle inversion), carries out error correction; When finding that many bit stagger the time, computing machine is resetted, reload program.

(9) fault-tolerant processing is carried out to the space environment incident in step (4)～(8) of reruning.

Embodiment

Be example with certain satellite below, introduce the space environment incident fault-tolerant strategy of spaceborne computer:

As shown in Figure 3, the spaceborne computer of certain satellite adopts TSC695f as cpu, carries the EDAC circuit, has the replacement circuit of redundancy ram simultaneously; Spaceborne computer uses 61580 interface chips as bus; Spaceborne computer has the PROM of 128K and the RAM of 8M, and the 9Q512K32 that the RAM chip is 2M by 4 capacity forms, and system backs up the RAM of 1 2M simultaneously; Have telemetry interface and Remote Control Interface simultaneously.Application software is accomplished the function of each item application layer on operating system.Spaceborne computer has the two-shipper cold standby, and the composition of each unit is identical.

In the spaceborne computer start-up course, operating system is at first carried out self check to 4 RAM, if certain sheet RAM read-write is undesired, then adopts backup RAM to substitute, if still undesired after the replacement, then resets.

TSC695f carries the EDAC circuit, and is not out-of-date when the EDAC of memory block verification, can produce corresponding interruption, and record this moment be that the wrong still two bit of single bit are wrong, and the numerical value that reads at this moment of record.In the application software initialization procedure, with this interrupt hook; In producing this, have no progeny, software at first judges whether to single bit is wrong, if single bit mistake then is written back to the numerical value after the EDAC error correction in the RAM district, eliminates single bit influence; If two bit are wrong, then reset immediately.

Per 0.5 second of spaceborne computer application software once checks the mode of operation of 61580 chips and the mode of operation of 695f chip, if be not the value set assignment again then; Application software when sending bus message at every turn, and the register that the relevant information of 61580 chips is sent carries out assignment again; Operating system software, removes the corresponding positions of interrupt status register, and withdraws from the interrupt response program when not taking place with abnormal interruption not protecting with interruption.

Spaceborne computer has autonomous cutter function, when application software detects all bus terminations when obstructed, thinks that 61580 chips break down, immediately cutter; When cpu chip occurs unusually, reset through house dog, still can not recover if reset 3 times, cutter immediately then, thus the insulating space environment event causes the position of fault.

The content of not doing to describe in detail in the instructions of the present invention belongs to those skilled in the art's known technology.

Claims

1. spaceborne computer space environment incident fault-tolerance approach is characterized in that step is following: