CN107844407A

CN107844407A - A kind of reliability verification method of the anti-SEU based on PRISM

Info

Publication number: CN107844407A
Application number: CN201711102436.1A
Authority: CN
Inventors: 王超; 李静; 陈阳
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2017-11-06
Filing date: 2017-11-06
Publication date: 2018-03-27

Abstract

The invention discloses a kind of reliability verification method of the anti-SEU based on PRISM.The present invention is that formal Verification Techniques are applied to system earlier design phase by one kind, analyze the system reliability under different reinforcement techniques and parameter, availability and security, designer is helped to develop a kind of method of more reliable and more effective solution, consider memory refresh using PRISM probabilistic models verifier, the reinforcement techniques such as redundancy are modeled to system design, dispatch using CDFG and (be used for fault recovery), failure checking cover ratio (security is related) and feature database (being used for the single-particle inversion probability of happening that each component is provided), it is stateful that the model established can describe the institute that system can reach.

Description

A kind of reliability verification method of the anti-SEU based on PRISM

Technical field

The invention belongs to computer system model to examine field, and the specifically embedded system reliability in aviation field is tested Card method.

Background technology

In recent years, as the continuous development of Aero-Space related-art technology, increasing IC-components (are such as defended Integrated circuit in star) need to work under radiation environment.Wallmark and Marcus predicts cosmic ray to micro- within 1962 The influence of electronic device.1978, Pickel and Blandford entered to the accumulator system anomaly of U.S.'s secret satellite Row analysis, it is caused by single-particle inversion (singe-event upset, SEU) to confirm these anomalies.Such as single-particle is turned over Turn effect and do not take certain safeguard procedures, once appearing in crucial electronic chip, its caused consequence is probably disaster Property.In 1589 American satellite exceptions, having 621 extremely caused by Single event upset effecf, 39.1% is accounted for.And China " wind and cloud No.1 (B) " satellite irradiated by high energy charged particles due to its main control computer, multiple single-particle inversion thing occurs Part, finally attitude control system is failed, finish the life-span too early.Therefore, compared to other kinds of device, under radiation environment The device of application needs higher radiation-resistant guaranteed reliability.

Current radiation-resistant chip device type is broadly divided into application specific integrated circuit device (ASIC), One Time Programmable device Part (such as FPGA based on antifuse), repeatable programming device (such as FPGA based on SRAM, i.e. SRAM-based FPGA).With Exemplified by AEROSPACE APPLICATION, first two device is compared to, repeatable programming device possesses the advantage not available for first two device. First, cost advantage.It is single for ASIC device because the device yield of the single application in AEROSPACE APPLICATION is often very low The cost of device is quite high, and must carry out redesign manufacture towards different applications.And SRAM-based FPGA mono- Aspect can be produced in batches, reduce the manufacturing cost of individual devices, on the other hand not have to consider that technique manufacture is asked due to basic Topic, greatly reduces design cost, shortens product development cycle.2nd, restructural advantage.Because the environment in AEROSPACE APPLICATION is answered The factors such as polygamy, high cost, restructural characteristic possess high meaning and value.On the one hand, if system finds to set after transmitting Broken down in meter problem or use, reconfigurable device only need to remotely re-download amended circuit, can be so that system Recover normal, substantially increase the maintainability of system.On the other hand, can be to change as needed due to system restructural The task and function of more new system, developed again so as to both avoid, aerospace system is possessed " high intelligence ".Based on above-mentioned Advantage, SRAM-based FPGA have received common concern, and are applied in many space projects.Such as 2003 (xQvR400oxL is successfully applied to during mars exploration appoints the Virtex FPGA of year Xilinx company, European gold in 2005 Star detection mission is also the system based on Virtex FPGA (XQVR6OO) used.But compared to first two device, SRAM- Based FPGA are more vulnerable to the influence of SEU in radiation effect, it is therefore necessary to which the reinforcing in terms of carrying out SEU to device makes to have and held Wrong ability, to improve system reliability.

2011, the related researcher in manned space flight totality portion of the Chinese Academy of Space Technology proposed for SEU phenomenons A kind of Scheme of Strengthening, this scheme propose many kinds of measures and carry out frequency converter to FPGA, are detection side is compared in retaking of a year or grade first Method, next to that memory refresh (scrubbing) measure, is finally that BRAM is grand from error correction.These three measures are all Xilinx companies It is recommended that method, be effective against SEU, improve the belief system of system.The Institute of Technology of India (Indian Institute Of Technology) a kind of scheme of partially dynamical reconfiguration is proposed to resist single-particle inversion, to improve the reliable of system Property.This scheme uses partial reconfiguration, FPGA is reconstructed based on modular design (RM), it will can easily send out in FPGA The logical resource of raw single-particle inversion reconfigures.Not only effectively make use of FPGA resource, and improve system can By property.It can be seen that all having done many work in the research of FPGA reliabilities both at home and abroad, system can be improved to a certain extent Reliability.But even with these Scheme of Strengthening, the system that final production comes out may not just meet reliability standard.Cause This, has to before system formally puts into operation into excessively strict reliability demonstration.

Formal verification referred in computer software and hardware system design process, using mathematical method come prove some or Some formal Specifications either correctness of attribute or incorrect property.For traditional method (such as simulation), how mould Various possible sexual behaviour have become a classical problem in plan system, but Formal Verification is better than in this regard Simulation, because formalization method can be verified by the state space of limit system to given Formal Specification. Model testing is exactly a kind of ripe, for verifying the automation formal verification technology of finite state system correctness.Given system The formalized model of system and specification to be authenticated and attribute, model-checking algorithm just can automatically thoroughly searching system can go out Existing institute is stateful, to verify whether these specifications and attribute are satisfied.If be unsatisfactory for, counter-example can be enumerated.Model inspection skill The use range of art is very extensive, such as investigates security, performance or the independence of system.

The content of the invention

In view of insufficient present in above-mentioned existing anti-single particle effect reliability verification method, the present invention is that a kind of assess is navigated The verification method of IC-components reliability, availability, security, formal Verification Techniques are applied in its aviation field System earlier design phase, system reliability, availability and security under different reinforcement techniques and parameter are analyzed, helps to design Staff development goes out more reliable and more effective solution, reduces global design cost.

The technical solution adopted in the present invention is：A kind of reliability verification method of the anti-SEU based on PRISM, using general Rate model detector, the design work of the chip in auxiliary aviation field.First by Program Generating CDFG flow graphs to be verified, CDFG Flow graph includes controlling stream graph (CFG) and DFD (DFG), and system all state and behavioural information can be depicted；Afterwards CDFG flow graphs are divided into the time step of the clock cycle corresponding to Method at Register Transfer Level (RTL) rank, reschedule failure Recovery technology；The component that CDFG is distributed to varying number again is modeled, finally using PRISM probabilistic model verifiers to being System design is modeled checking, specifically includes following steps：

Step 1：Anti-single particle effect reliability demonstration platform is built, it is a probabilistic model inspection of increasing income to build PRISM Device, and tested, initial relevant configuration.

Described PRISM is a probabilistic model detector of increasing income, and it also includes multiple model testing engines, wherein several (binary decision figure and its extension being used, such as multiple terminals binary decision figure) is realized based on symbol.These engines can be to bag Model containing up to " 10 " ^10 state carries out probabilistic verification (PRISM average treatments at most " 10 " ^7 to " 10 " ^8 state Model).PRISM also has a variety of advanced technologies, such as abstract to simplify and symmetrically reduce.In addition, it also supports to pass through discrete event Simulation engine carries out approximation/statistical model and examined.

Step 2：From control flow chart (the control and data flow of high-level language description (such as C/C++) extraction Graph, CDFG).

Described CDFG models be by for arithmetic or logical operation structure composition, all behaviors of algorithm can be represented, The high-level design that the instrument such as GAUT, SUIF can be used to be expressed from high-level language (such as C/C++) extracts CDFG in describing.Total space flight Aviation field C/C++ is conventional main flow high-level language, and the program that numerous C/C++ language are write will be write in embedded chip, Apply among rocket, aerospace craft.Meanwhile this method is equally applicable to other language, such as Java language, Fortran languages The older language of speech etc.

Step 3：The CDFG flow graphs extracted are modeled using PRISM modeling languages, first will be comprised the concrete steps that CDFG is extracted, then using PRISM modeling languages to the CDFG with different configurations (applicable components quantity) and radiation ring Under border situation (reinforcement technique, failure occur with recovery etc.) be modeled.The fault rate of wherein component is from component What feature database obtained, verify various reliability attributes automatically using PRISM afterwards, whether meet to require with inspection system.

Described probabilistic model checking technology is for analyzing system that those show random behavior and for definition Probability attribute verified automatically.Probabilistic model checking technology is successfully applied to many fields, such as random point The field such as cloth algorithm, communication and security protocol, biology.Markov model is corresponding to this kind of random process of Markov chain Model, have the property that：Under conditions of known current state (present), its following differentiation (future) is independent of it Conventional differentiation (past).In real world, it is all Markov process to have many processes, such as the cloth that particulate is made in liquid Bright motion, the infected number of infectious disease, the number of waiting at station etc., all can be considered Markov process.On the process Research, A.H. Andrei Kolmogorovs exist within 1931《The analytic method of probability theory》First by the side of the analyses such as the differential equation in one text Method is used for this class process, has established the theoretical foundation of Markov process.Probabilistic model has four models most commonly seen in examining： Markov model (Discrete-Time Markov Chains), the Markov model of continuous time of discrete time (Continuous-Time Markov Chains), markov decision model (Markov Decision Processes) and The automodel (Probabilistic Time Automata) of probability times.When to system modelling, model will be according to system The characteristics of behavior, is selected.

Brief description of the drawings

Fig. 1 is radiation single particle effect figure；

Fig. 2 is Single event upset effecf schematic diagram；

Fig. 3 is Virtex Series FPGA basic structure schematic diagrams；

Fig. 4 is the logic error schematic diagram of one three input look-up table caused by SEU；

Fig. 5 is TMR method basic structure schematic diagrams；

Fig. 6 is false code and corresponding DFG figures；

Fig. 7 is that CDFG reschedules fault recovery technology sample；

Fig. 8 is that probabilistic model examines schematic diagram；

Fig. 9 is the sample of CTMCs reliabilty and availability analysis；

Figure 10 be consider failure whether safety Safety modeling.

Embodiment

To be easy to understand the technical means, the inventive features, the objects and the advantages of the present invention, with reference to Embodiment, the present invention is expanded on further.

The present invention is a kind of checking for assessing IC-components reliability, availability, security in field of aerospace Method, formal Verification Techniques are applied to system earlier design phase, the system analyzed under different reinforcement techniques and parameter can By property, availability and security, help designer to develop more reliable and more effective solution, reduction global design into This.Consider that the reinforcement techniques such as memory refresh, redundancy are modeled to system design using PRISM probabilistic models verifier, establish It is stateful that good model can describe the institute that system can reach.In the present invention, dispatch using CDFG and (be used for fault recovery), Failure checking cover ratio (security is related) and feature database (being used for the single-particle inversion probability of happening that each component is provided).Below will Theoretical and specific modeling is examined to be specifically addressed to probabilistic model.

For different types of model, there are corresponding many model inspection technologies.Probabilistic model checking technology be for Analyze the system that those show random behavior and verified automatically for the probability attribute of definition.Probabilistic model checking skill Art is successfully applied to many fields, such as the field such as accidental distributed algorithm, communication and security protocol, biology.

At present, all probabilistic models are all Markov models, Markov model be it is this kind of to Markov chain with Model corresponding to machine process.Markov chain, proposed by Russia's mathematician A.A. markovs in 1907.The process has such as Lower characteristic：Under conditions of known current state (present), its following differentiation (future) differentiation (mistake conventional independent of it Go).In real world, it is all Markov process to have many processes, such as particulate is made in liquid Brownian movement, infectious disease Infected number, the number of waiting at station etc., all can be considered Markov process.On the research of the process, 1931 A.H. Andrei Kolmogorov exists《The analytic method of probability theory》The method of the analyses such as the differential equation is used for first in one text this kind of Process, the theoretical foundation of Markov process is established.CTMCs is the random process for having Markov property, i.e., known present s When state X (s) and all last time u, 0≤u≤s state X (u) under conditions of, future time t+s state X (t+s) Condition distribution only rely on present state X (s) and independent with the past.It is continuous the time that the characteristics of CTMCs, which is, during state from It is scattered, such as weather forecast, the random walk of particle, gambling lose problem etc., in the FPGA based on SRAM over time by A continuous Markov process is can be regarded as in the process that SEU breaks down.CTMCs includes one group of state S and transfer speed Rate matrix R：S×S→R≥0.Speed R (s, s ') is defined on the delay before changing between state s and s '.If R (s, S ') ≠ 0, then in time t, the probability of the transformation between state s and s ' is defined as 1-e^{- R (s, s ') × t}.If R (s, s ')= 0, then it will not change.PRISM (Probabilistic Symbolic Model Checker) is a kind of conventional probability Model checking tools.PRISM is a free software of increasing income issued by Oxford University 1999.As PRISM plus After having carried model file and property file, we can verify some specific attribute or all attributes.In addition, The concept of experiment (experiment) is also defined in PRISM.It is so-called once to test, it is exactly by the stateful change of institute to model Amount assigns initial value, travels through out the once execution of model.According to the change of Model Parameter, PRISM can draw out model behavior Variation tendency.Therefore, experiment can intuitively analyze the influence factor of system action very much.PRISM language includes two classes：Model Language and attribute language.Model language is a kind of specification normative language based on system mode for modeling.And attribute language includes Sequential logic, e.g., PCTL, CSL etc..PRISM provides the support automatically analyzed to the extensive qualitative attribute of model.

Because the SEU speed λ of the FPGA based on SRAM is highly dependent on device technology technology, framework and track, so should Parameter is different for each train.Use CREME96^[7]In HEO (HEO) and Low Earth Orbit (LEO) In, the single-particle inversion probability λ of Xilinx Virtex-5 every bit_bit.The fault rate of component can use below equation meter Calculate：

λ_component=λ_bit×Number of critical bits (1)

In the present invention, λ_bit=7.31 × 10^-12SEUs/bit/sec, Number of critical bits are crucial The number of position.

Table 1 gives the parameter situation that SEU causes component faults.First row is component, and secondary series is important bits number, one As for, the crucial bits number of component is less than important position number, and we employ the important position of worst case, i.e. component in testing It is crucial position entirely.3rd row are unsuccessfully interval time (MTBF), and unit is day.Component faults rate λ and failure interval time just like Lower relation：

The feature database of table 1

1) reliabilty and availability analysis modeling

CTMCs models are frequently utilized for the Reliability modeling of degradable system.Represent every in the CTMCs models of particular configuration Individual state can be divided into different type according to the quantity of normal component.Such as FIR filter at least needs an adder and one Multiplier completes once successfully operation.Therefore, any state for not meeting minimum resources availability is all marked as unsuccessfully shape State.Finally, the state to fail one by one due to SEU labeled as all component in the state representation system all to fail.Build herein Mode step section does not take into account that safe and unsafe failure.How security is included into model to be described in detail in next trifle. The original state all component of configuration is all available, and system has maximum handling capacity.Side between state represents conversion rate.Mould It is as follows to intend hypothesis.

Assuming that 1：Assuming that component is all individually to break down, and when causing the failure of component due to configuration bit flipping Between and memory refresh interval follow exponential distribution.Under this assumption, speed is repairedWherein τ represents memory refresh interval.

Assuming that 2：Assuming that only data flow component failure.Since in many systems, compared with controlling stream component, number The overwhelming majority of design is occupied according to stream component, the likelihood ratio controlling stream component that data flow component is influenceed by SEU is much bigger.

Assuming that 3：Assuming that can only once have a component because SEU breaks down, and we to be easy to inspection system every The failure of individual component.This hypothesis is in order to ensure the complexity of markov model is manageable.

Assuming that 4：Assuming that cold standby component can only break down in activity because of by cosmic radiation.Cold standby component is used When the component failure of redundancy, only same type is provided, it is only activity.

Assuming that 5：Assuming that the time is reconfigured and rescheduled (that is, when system reschedules when component failure Between and carry out by memory refresh repairing the required time) it is very small compared with the time between failure and reparation.Again adjust Time needed for degree is preferably at most several clock cycle, and the time needed for memory refresh only has several milliseconds

Assuming that 6：Assuming that the institute in CTMCs models stateful can be divided into three types by us：

1) normal condition is operated：System all component is normal, and the handling capacity of system is maximum.

2) degeneration degrading state：At least one component failure of system, system continue work using remaining component resources Make, but the handling capacity of system diminishes.

3) status of fail：The remaining components of system has been not enough to complete successfully to operate, handling capacity 0.

By taking the device with 2 adders and 2 multipliers as an example, Markovian state's metastasis model of the device is for example attached Shown in Fig. 8, it is assumed that the device at least needs 1 adder and 1 multiplier to complete once successfully to operate, in state A, M and numeral above represent adder, subtracter and the quantity of normal work respectively.According to state transition diagram, we make The stateful and behavior of this system is described with PRISM codes.We are divided the state of system with formula (formula) Class, we define three formula, i.e. operational, degraded and failed identifies that system operatio is normal, drop of degenerating respectively Level and failure.

With PRISM to having 2 adders and the system modelling process of 2 multipliers as it appears from the above, wherein, num_A and Num_M represents available adder and the quantity of multiplier under configuration original state.λ A and λ M variables represent adder and multiplication The dependent failure rate of device, and miu represents repair rate.Each repair (memory refresh) will return to system initial state.Then, PRISM builds corresponding probabilistic model, is CTMC in this case.Reparation conversion and label [repair] in code is same Step, with demonstration when the FPGA memory refresh after, phenomenon that all component is all repaired simultaneously.Formula operational, degrade With failed to the normal operating in model, degenerate degradation and status of fail are classified.

2) safety analysis models

Assuming that any fault detection algorithm correctly can be detected and handled, institute is faulty, but actual conditions are really not so, Because the always faulty fault detection mechanism that can escape implementation.Therefore, system will be unable to reschedule, and system will be with failure Pattern continues to run with.It means that implement CDFG configuration in each component may by safety or it is unsafe in a manner of lose Lose.This is just needed by the concept for considering and introducing safe failure and unsafe failure come improved model.It is defined as follows：

Define 1：For component because the failure that SEU occurs is properly detected, system, which reschedules, finds remaining component Quantity has been not enough to complete successfully to operate, and system enters failure of security state.

Define 2：Unsafe status of fail refers to that system can not detect the failure silence row occurred during component faults For.If all component failure can be detected safely, system eventually enters safe status of fail, but even if only A uneasy total failure occurs, system can also immediately enter unsafe status of fail.The failure checking cover ratio of component can be with Determined by conditional probability C：

C=P (fault detection | fault exitence) (3.2)

Figure 10 shows the Modeling with Security of the simple single component system of only two adders (including repairing conversion).It is right In such case, it is assumed that system at least needs an adder just to successfully complete operation.Initially, system is in operation normal mode Formula, two adders.When an adder fails, if detecting failure, system will reschedule, and simply continue to use one Individual adder.If being not detected by failure, unsafe status of fail is transferred to.If another adder fails, system It will be unable to continue their operations with, therefore it will safely fail.But if being not detected by this failure, system will be finally with uneasiness Full mode fails.Security is included in a model and needs to change assumes that 6 is as follows：

Assuming that 6：Assuming that the institute in CTMC stateful can be divided into four types：

1) operation is normal：All component function is normal, the handling capacity highest of system.

2) degrade and degenerate：At least one component failure.

3) failure of security：Operation of the lazy weight of remaining barrier component without reason to run succeeded, therefore handling capacity is 0, In order to reach failure of security state, cause thrashing it is faulty must be safe.

4) dangerous failure：At least one failure is not detected by fault detection algorithm.The failure silence row of component Enter dangerous status of fail to immediately result in system.

Failure checking cover ratio C is added afterwards, is by above-mentioned code revision：

Above-mentioned is the system modelling with 2 adders and 2 multipliers with PRISM to addition coverage rate (C), is led to Cross the implementation steps that above step details the analysis method of this patent proposition.The analysis method emphasis is that probabilistic model is examined, FPGA based on SRAM under radiation environment is idealized, is abstracted into a continuous Markov model, system mode we according to The amount field of normal component is divided into three classes, i.e., operation is normal, degenerate degradation and failure.Conversion between state is by SEU speed λ Determined with reparation speed μ.Wherein λ is relevant with the component MBTF in feature database, and μ is relevant with the speed of memory refresh.

Claims

1. a kind of reliability verification method of the anti-SEU based on PRISM, it is a kind of that formal Verification Techniques are early applied to system Design phase phase, system reliability, availability and security under different reinforcement techniques and parameter are analyzed, help designer to open Send a kind of method of more reliable and more effective solution, it is characterised in that be that a kind of assess collects in field of aerospace Into circuit devcie reliability, availability, security verification method, using PRISM probabilistic models verifier consider memory refresh, The reinforcement techniques such as redundancy are modeled to system design, and it is stateful that the model established can describe the institute that system can reach. Dispatch using CDFG and (be used for fault recovery), failure checking cover ratio (security is related) and feature database (are used to provide each component Single-particle inversion probability of happening).

2. the reliability verification method of the anti-SEU according to claim 1 based on PRISM, it is characterised in that comprising as follows Step：Comprise the following steps, A：Anti-single particle effect reliability demonstration platform is built, it is a probability mould of increasing income to build PRISM Type checking device, and tested, initial relevant configuration.B：From the control flow chart of high-level language description (such as C/C++) extraction (control and data flow graph, CDFG).C：The CDFG flow graphs extracted are entered using PRISM modeling languages Row modeling, comprises the concrete steps that and first extracts CDFG, then using PRISM modeling languages to different configuration (available sets Number of packages amount) CDFG and radiation environment under situation (reinforcement technique, failure occur with recovery etc.) be modeled.Wherein component Fault rate obtains from module diagnostic storehouse, verifies various reliability attributes automatically using PRISM afterwards, is to check Whether system meets to require.

3. the reliability verification method of the anti-SEU according to claim 1 based on PRISM, it is characterised in that reliability and 6 kinds of hypothesis are employed in availability analysis modeling, it is assumed that component individually breaks down, and configures bit flipping and cause component Fault time and memory refresh interval follow exponential distribution, then simplify challenge.In numerous systems, data flow group Part accounts for the overwhelming majority of design, therefore assumes there was only data flow component failure, it is assumed that once only has a component that SEU events occur Barrier, it is ensured that markovian complexity is relatively low.Assuming that cold standby component can only be sent out in activity because of by cosmic radiation Raw failure.When cold standby component is used to provide the component failure of redundancy, only same type, it is only activity.Assuming that Reconfigure and to reschedule the time very small compared with the time between failure and reparation.It is assuming that all in CTMCs models State can be divided into three types by us：Operate normal condition, degeneration degrading state, status of fail.

4. the reliability verification method of the anti-SEU according to claim 1 based on PRISM, it is characterised in that security point In analysis modeling, the situation of handling failure is unable to come improved model by the concept of the safe failure of introducing and unsafe failure, By definitions component because the failure that SEU occurs is properly detected, system, which reschedules, finds the quantity of remaining component Successfully operated through being not enough to complete, system enters failure of security state, and unsafe status of fail refers to that system can not be examined Measure the failure silence behavior occurred during component faults.If all component failure can safely be detected that system is final The status of fail of safety can be entered, but even if a uneasy total failure only occurs, system can also immediately enter unsafe mistake Lose state.The failure checking cover ratio of component can be determined by conditional probability C.

5. the failure checking cover ratio of component according to claim 4, it is characterised in that the failure checking cover ratio of component It can be determined by conditional probability C：Condition probability formula C is C=P (fault detection | fault exitence), Show the Modeling with Security of the simple single component system of only two adders (including repairing conversion).In this case, Assuming that system at least needs an adder just to successfully complete operation.Initially, system is in operation normal mode, two additions Device.When an adder fails, if detecting failure, system will reschedule, and simply continue to use an adder.Such as Fruit is not detected by failure, then is transferred to unsafe status of fail.If another adder fails, system will be unable to continue it Operation, therefore it will safely fail.

6. the failure checking cover ratio of component according to claim 5, it is characterised in that repaiied security is included in model Change the situation assumed in 6, it will be assumed that the institute in CTMC stateful can be divided into four types：(1) operation is normal：All component work( Can be normal, the handling capacity highest of system.(2) degrade and degenerate：At least one component failure.(3) failure of security：It is remaining Operation of the lazy weight of barrier component to run succeeded without reason, therefore handling capacity is 0, in order to reach failure of security state, is caused Thrashing it is faulty must be safe.(4) dangerous failure：At least one failure is not by fault detection algorithm Detect.The failure silence behavior of component immediately results in system and enters dangerous status of fail.