CN103744760A - Gate-level symptom based hardware fault detection method - Google Patents

Gate-level symptom based hardware fault detection method Download PDF

Info

Publication number
CN103744760A
CN103744760A CN201310743467.0A CN201310743467A CN103744760A CN 103744760 A CN103744760 A CN 103744760A CN 201310743467 A CN201310743467 A CN 201310743467A CN 103744760 A CN103744760 A CN 103744760A
Authority
CN
China
Prior art keywords
symptom
fault
cycle
fault detection
detection method
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201310743467.0A
Other languages
Chinese (zh)
Inventor
崔刚
傅忠传
王超
朱东杰
潘波
王秀峰
季春光
张明
王彦
张毕英
张策
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CN201310743467.0A priority Critical patent/CN103744760A/en
Publication of CN103744760A publication Critical patent/CN103744760A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Test And Diagnosis Of Digital Computers (AREA)

Abstract

The invention provides a gate-level symptom based hardware fault detection method and belongs to the field of hardware fault detection methods. The problem that no hardware fault is obtained by detection on gate-level symptoms is solved. According to the gate-level symptom based hardware fault detection method, the hardware fault detection is achieved based on the detection on the gate-level symptoms which comprise invalid package IPacket, processor hang or application timeout, the three types of symptoms of the invalid package IPacket, the processor hang and the application timeout are detected, and accordingly the hardware fault detection is achieved. The gate-level symptom based hardware fault detection method is specifically applied to the field of the hardware fault detection.

Description

Based on the hardware fault detection method of gate leve symptom
Technical field
The invention belongs to hardware fault detection method field.
Background technology
With the progress of semiconductor technology, the physical dimension of transistor and gauze is constantly dwindled, but supply voltage cannot Scaling, and this makes, and current density increases, temperature raises, and integrated circuit neurological susceptibility increases.Technique progress leads processor to step into the multinuclear epoch, take CMT, CMP(Chip Multi-Processor) become main flow as multi-core/many-core (multi-core/many core) framework of representative.The inconsistency of internuclear even intranuclear crystal pipe physical dimension, accelerated the degradation of some device and make its be more prone to lost efficacy.These factors cause the processor lifetime to be difficult to prediction, in the urgent need to corresponding protection mechanism.
Under nanometer technology, combinational logic neurological susceptibility raises rapidly, is about to catch up with and surpass sequential logic.On the one hand because its scrambling is difficult to protection; On the other hand, due to three kinds of shield effectivenesses (electric shield, logic shielding and the shielding of latch window), to its protection, pay attention to not enough.
G.P.Saggese injects research to a DLX risc processor fault and shows: combinational logic is that the transient fault susceptibility of 1 clock period reaches 4.2% to the duration, and its impact is very important.In physics realization, the shared chip area of combinational logic often exceedes sequential logic in processor, and the outer aobvious fault as an example of DLX RISC example 90% comes from combinational logic [84], and similarly, this chapter result shows to have 94.7% fault to stem from combinational logic.The traditional mechanism such as ECC and parity checking, to sequential logic protective value excellence.But studies have found that for out of order architecture processor, after more than 90% test benchmark generation single bit upset by verification and the probability that detects lower than 50%.Because streamline has been flowed out in instruction during out of order architecture processor read operands.
Therefore the design of protection mechanism must give enough attention to combinational logic, takes into account combination simultaneously and has obvious competitive edge with the protection mechanism of sequential two class devices.
Fault detect correlative study is long-standing, utilizes ECC or parity checking etc. to utilize information theory means to protect sequential logic very general, but is difficult to protection for combinational logic.The single bit upset fault that the above combinational logic of out of order architecture processor 90% occurs is found in research, and the probability that can be detected by traditional mechanism is lower than 50%.Under particular stress condition, by test, screening out incipient fault, is also a kind of popular mode, and typical case is as burn-in accelerated test etc.In addition, many test circuits have been embedded in chip, in the processor lifetime, carry out fault detect and test.
In recent years processor inner part level fine granularity detection technique is emerged in large numbers gradually.The representational testing mechanism for processor sequential logic parts has Memory Order Buffer, Register files and Cache etc.; For the testing mechanism of processor combinational logic, there are ALU, Wire etc.This class technology utilizes the characteristic of certain base part to realize fault detect or test, lacks versatility.
As the mechanism such as DIVA, RMT realize fault detect by space-time redundant fashion.Redundancy is taken into account combination, sequential two base parts, and successfully covers different fault type highly versatiles, but the space-time cost prohibitive of introducing.
With technique progress, intermittent fault takes place frequently, and has become the significant obstacle of integrated circuit and polycaryon processor reliability, has no at present the authority report of coherent detection mechanism to intermittent fault validity.Can estimate, the low-cost protection mechanism of taking into account combination and sequential two class devices and successfully covering intermittent fault will have clear superiority.
Fault does not propagate into processor structure level and shows with regard to initiating failure, and such execution result is classified gate leve event as.Because fault propagation causes synchronic command mistake (comprising poke, peek or branch instruction), such execution result is classified memory access level event as.Fault has been destroyed processor structure level state, but does not cause synchronic command mistake, and such result is listed in structural level event.The fault performance of each level, comprises and can examine symptom (symptoms), shielding (mask) and error result SDC(Silent Data Corruption) etc.Wherein can detect symptom comprises: invalid bag IPacket, apply overtime and processor and hang up.
Summary of the invention
The present invention not to detect gate leve symptom at present and obtains hardware fault problem in order to solve, and the invention provides a kind of hardware fault detection method based on gate leve symptom.
Based on the hardware fault detection method of gate leve symptom, described hardware fault detection method is to realize hardware failure detection based on detecting gate pole symptom.
The described hardware fault detection method based on gate leve symptom, described gate pole symptom is that invalid bag IPacket, processor are hung up or apply overtime.
The described hardware fault detection method based on gate leve symptom, this hardware fault detection method is realized based on catching invalid bag IPacket symptom, the method realizes based on PCX interface structure, 8 control information input/output terminals of this PCX interface structure are connected with 8 CPU cores by independent bus line respectively, PCX interface structure by 4 independently bus be connected with 4 L2Cache respectively, described L2Cache represents L2 cache, PCX interface structure is connected with I/O port and FPU processor by 1 bus simultaneously
L2 cache is for carrying out verification and processing to the request of CPU core, carry out afterwards memory access and the packet of request is passed through to crossbar(corsspoint switch matrix or crossbar switching matrix, please inventor provide and translate accurately content) return to CPU core, invalid bag IPacket in packet is that Invalid[has different translation results, please inventor provide accurate content] packet
The detailed process of the method is,
Invalid bag IPacket is carrying out other operation for the data of reporting system current request,
For Invalid packet, its effective mark is 0, by crossbar, the significance bit of Invalid packet is revised, and realizes catching of invalid bag, detects hardware and breaks down.The described hardware fault detection method based on gate leve symptom, this hardware fault detection method is realized based on detecting the overtime symptom of application, the method is based on last_act_cycle register, Th_last_act_cycle[63:0], Global_cycle_cnt register and core_cycle_cnt realize
Last_act_cycle register is used to refer to processor active periodicity recently,
Th_last_act_cycle[63:0] indication CMT thread active periodicity recently,
Global_cycle_cnt register is for recording the operation week issue of tested module, and this register is by system clock control,
Core_cycle_cnt represents the operation week issue of tested core,
The detailed process of the method is,
During processor implementation, above register is upgraded, and apply overtime detection according to above-mentioned information, in CPU, each thread has a thread_running register, whether actively demarcate this thread, by tcu_core_running[7:0] in corresponding signal initialization start thread running status to monitor
When instruction completes outflow flowing water, with core_cycle_cnt renewal last_act_cycle register, and upgrade th_last_act_cycle[63:0 simultaneously] content;
When core_cycle_cnt be monitored thread th_last_act_cycle[mytnum] periodicity carried out is while being greater than appointed threshold, thread-level is applied overtime symptom and is hunted down;
When core_cycle_cnt is greater than thresholding with the last_act_cycle being monitored, the overtime symptom of global application is hunted down, and detects hardware and breaks down.The described hardware fault detection method based on gate leve symptom, this hardware fault detection method is hung up symptom based on measurement processor and is realized, the detailed process of the method is, utilize heuritic approach to improve hang detection device, by the monitoring of branch instruction being identified to tight circulation cumulative to loop iteration number of times, when it exceedes thresholding hang-up symptom, be hunted down, detect hardware and break down.Fault filling method:
For investigating all sidedly intermittent fault characteristic, test and adopt instantaneous, permanent and three kinds of fault types of intermittent fault and six kinds of fault models, and adopt different fault parameters.
Intermittent fault is enlivened time T aparameter generates at random respectively under [0.01T-0.1T], [0.1T-1T] and [1T-10T] three kinds of configurations; Outburst length L burstadopt respectively 2,4 and 8, this is in order to simulate the ageing process of depth.The duration of transient fault is random generation under [0.01T-0.1T] configuration, and permanent fault just lasts till that once injecting emulation finishes.
For inquiring into the different qualities of intermittent fault, adopt instantaneous, permanent and intermittent fault to contrast the mode of carrying out fault injection, the corresponding relation of fault model and fault parameter is in Table shown in 4-1.
Table 4-1 fault filling method and fault model
Figure BDA0000449446740000041
Sequential logic adopts two kinds of fault models, and bit flipping fault model represents SEU(Single EventUpset under transient fault) event, the logical value of being noted object becomes 1 or become 0 from 1 from 0, and trouble duration can arrange.Fixed model represents that storage unit logical value is fixed on 0 or 1 and not affected by read-write.This model represents the permanent fault that sequential logic occurs, and the duration lasts till that for inject the moment from fault emulation finishes.
Combinational logic VPFIT supports four kinds of fault models.Pulse fault model represents the SET in transient fault, is the form of expression of SEU under combinational logic gauze.Uncertain value refers to by note object logics value underrange.Open fault model representation line becomes the situation of high resistant, and bridge model represents the short circuit of two combinational logics.
The relatively-stationary feature in position having in view of intermittent fault, experiment is using the functional module of components interior as injecting object, and supposes that the probability that in parts, each module failure occurs is identical; And fault is followed stochastic distribution in module, different types of device, the probability occurring as its fault of NET and REG is identical.In SPARC T2 processor, representational 5 parts amount to 13 modules, and it is described in Table 4-2.
In a word, each test benchmark intermittent fault is injected 40950 (350injections*3L burst* 3T a* 13structures), transient fault and permanent fault are carried out respectively 4550 times, carry out altogether fault injection experiments 100100 times.
Table 4-2 gate leve fault is injected target architecture
Figure BDA0000449446740000051
Adopt the hardware fault detection method based on gate leve symptom of the present invention to carry out fault injection:
UltraSPARC T2 is the representative of multithreading on current sheet (CMT) framework, has eight CPU cores in target processor, and eight rigid line journeys of every core, have four flowing water in core: two integer flowing water, a floating-point flowing water and a memory access flowing water.By the scheduling of rigid line journey, realize memory access latency and hide to improve processor throughput.The today highlighting at Memory wall, multithreading a kind of effectively solution of can yet be regarded as on sheet.
Pick logic is representative components in CMT processor, by the anticipation to TG sets of threads thread handling capacity (every core eight threads, four rigid line journeys form a sets of threads), decides the right to use of thread to streamline.Address generation parts are responsible for generating PC and the NPC of instruction.Decoding unit is responsible for the instruction in instruction buffer detect and decode.These three parts are representative components of combinational logic in processor, are incorporated into as controlling functional module (Control functional blocks) for this reason.The core that parts are integer streamlines is patrolled in calculation, and integer register file is responsible for preserving the environmental context of rigid line journey, and they are incorporated into respectively as carrying out and memory function module.
Each parts are comprised of multiple modules, and the door of each parts and inner each module thereof is realized scale and provided by showing 4-1, and wherein NET represents combinational logic device, and REG represents sequential logic device.
Fault is injected and is adopted two test benchmarks, and IFU_BASIC_EX_RAW is the representative (CPU-intensive) that operating characteristic is patrolled in calculation, and LDST_ATOMIC is the representative of memory access characteristic (Memory-intensive), and its detailed characteristics is referring to table 4-3.
Table 4-3 test benchmark characteristic
Figure BDA0000449446740000052
Figure BDA0000449446740000061
According to the different levels of fault propagation, the fault detection method based on gate leve symptom is quantized to evaluation and test, fault propagation process is divided into three levels of gate leve (Gate level), structural level (Arch-level) and memory access level (Memory access level)
Fault does not propagate into processor structure level and shows with regard to initiating failure, and such execution result is classified gate leve event as.Because fault propagation causes synchronic command mistake (comprising poke, peek or branch instruction), such execution result is classified memory access level event as.Fault has been destroyed processor structure level state, but does not cause synchronic command mistake, and such result is listed in structural level event.
The fault performance of each level, comprises and can examine symptom (symptoms), shielding (mask) and error result SDC(Silent Data Corruption) etc.Wherein can detect symptom comprises: invalid bag IPacket, apply overtime and processor and hang up, other fault is injected execution result and described in detail referring to table 4-4.
Table 4-4 gate leve symptom and execution result are described
Figure BDA0000449446740000062
The each level fault performance of test benchmark LDST and EXU is by shown in table 4-5.
The each level fault performance of table 4-5
Figure BDA0000449446740000071
Experiment shows, average 93.36% fault conductively-closed, and being activated, the symptom ratio being detected by gate leve symptom detecting device in fault is about 6.19%.Therefore, the fault coverage based on gate leve symptom fault detection method has on average reached
99.55%, SDC only has 0.45%, and this has illustrated the validity of gate leve symptom fault detection mechanism.
Shielding:
Though shielding refers to that test benchmark is injected into fault, actual Output rusults and non-fault execution architecture are that Golden run is in full accord.Identifying substantially fault masking set, is the task that fault detection mechanism design must solve, and for reducing, false drop rate is of crucial importance.Data show, in three levels, shield distribution and are changed significantly with level: the lower fault masking ability of level is more remarkable, and adjacent level differs approximately order of magnitude (89.26%/3.99%/0.11%).By contrast, combinational logic (_ net is capable) screening ability is lower than sequential logic (_ reg is capable), and for example, the SDC of gate leve combinational logic and symptom sum exceed 7%, and sequential logic only has 1% left and right by contrast, this fault neurological susceptibility that combinational logic is described again can not be ignored.
Fault detection mechanism based on gate leve symptom is fairly obvious to the detectability of combinational logic, and two test benchmarks detect respectively combinational logic fault 7.19% and 6.69% at gate leve, have significantly reduced SDC.In addition, fault shows the correlativity having with test benchmark.LDST test benchmark comprises a large amount of access instruction and is intended to realize the test of LSU unit, therefore in obviously (shielding/SDC/ symptom) of memory access level fault behavior.By contrast, EXU test benchmark is patrolled instruction as purport is in the test that completes ALU unit to calculate, and its access instruction is less, therefore remarkable not in the performance of memory access level fault.
SDC:
SDC is for weighing the validity of testing mechanism, and this index is lower shows that fault coverage is higher.Based on the fault detection mechanism of gate leve symptom, SDC source comprises two kinds: carry out premature termination (Incomplete Execution) and error result (Incorrect Result).The former stops before showing as test benchmark generation correct result, and latter has produced Output rusults mistake.Experimental result demonstration, SDC is 0.09%/0.25% in gate leve ratio, and structural level is 0.5%/0.06%, and memory access level ratio is 0.04%/0%.That is to say, at the fault detection mechanism fault coverage based on gate leve symptom, reach 99.5%, this has fully proved the validity that this is machine-processed.
Symptom distributes:
Three kinds of mechanism detect altogether 6027 of fault diagnosis example and account for 92.7% of fault sum.Three kinds of symptoms are in the characteristic difference of each level: invalid bag IPacket and application are overtime, with the rising fault-detecting ability of level, weaken gradually; On the contrary, processor is hung up with level its contribution that raises and is manifested gradually.
Labor discovery, based on the fault detection mechanism of gate leve symptom, the raising that detects coverage rate mainly stems from the exploitation of gate leve, structural level symptom.Invalid bag IPacket and the overtime fault diagnosis example that both catch of application are covered and have almost been covered whole gate leve symptoms, account for all 81.3%(5284/6499 that do not shield fault).That is to say, both become fault coverage and rise to 99.5% principal element by 95%.
At application layer, the contribution that processor is hung up is remarkable, accounts for 25.9% of symptomatology.Especially for EXU test benchmark, processor is hung up and has been reached 50.5%.In memory access level, processor is hung up and is accounted for 81.5% of three kinds of symptoms.It is noted that in memory access level EXU test benchmark and almost do not produce any symptom, this is to cause because this benchmark only has 8 access instruction to be that accessing operation is less.
Combinational logic and sequential logic comparative analysis
The detectability of combinational logic and sequential logic is carried out to quantitative analysis.Two test benchmarks are injected respectively fault 50050 times, wherein combinational logic (NET) approximately 36000 times, and sequential logic (REG) approximately 14000 times, during this ratio is realized by processor, the scale of combinational logic and sequential logic determines (2.6:1).In view of intermittent fault has the relatively-stationary characteristic in position, we inject object using the module of components interior as fault, and suppose that the fault rate of each module in parts is identical, and its fault rate of variety classes device is also identical.
First, from the angle of fault propagation, two test benchmarks are caught symptom (execution result except correct result) 5456 times altogether, and wherein combinational logic accounts for 94.7% of total symptom 5166 times; Sequential logic symptom is only 290 times, only accounts for 5.3% of total symptom.That is to say, combinational logic triggers than being 17.8:1 with the symptom of sequential logic, far exceedes the scale that the realizes ratio of 2.6:1.This explanation comes from the fault destructive power of combinational logic considerably beyond sequential logic.
Secondly, three kinds of symptoms detectability difference to combination, sequential logic.Invalidly wrap in the fault diagnosis example number that gate leve detects and be about 2300 times, what wherein trouble spot was sequential logic only has 2 times.At structural level, invalid bag symptom accounts for combinational logic and catches 94.1% of whole three kinds of symptoms.The validity of invalid bag to combinational logic as can be seen here.Analysis shows, applies overtime all effective to combinational logic and sequential logic with processor hang-up.Processor is hung up remarkable in memory access level detectability, accounts for 81.5% of memory access level symptom.
Component-level and module level quantitative evaluation:
From the angle of fault propagation, fault detection mechanism has been carried out to detailed assessment above.First in component-level, symptom is distributed and carries out labor, and different faults model and fault parameter are inquired into the impact of fault performance, afterwards module level fault susceptible bottleneck is analyzed.All execution results are all called to symptom, wherein invalid bag, apply overtime and processor and hang up to be called as and can examine symptom.
Component-level symptom:
In fact, each component function difference, design is different, NET is different with REG ratio of components, all can cause the difference of fault transmission path and causes different symptoms and distribute.Component-level symptom distributes as shown in Table 4-6.
Table 4-6 component-level symptom distribution (number percent)
Figure BDA0000449446740000091
First, Mask has shown the fault masking ability of different parts.What screening ability was the strongest is that parts (ALU) are patrolled in calculation and integer register file IRF(is respectively 97.27% and 97.21%), they belong to respectively carries out and memory function module.The fault masking rate of three parts of control functional module is lower, and wherein address generation parts fault masking rate is 93.45%, and decoding unit fault masking rate is that 95.59%, pick unit shielding rate only has 87.32%.In fact this can be explained by the function of each parts: pick unit obtains the right to use of streamline by the anticipation of thread handling capacity in CMT processor thread group TG being determined to thread; Scalar/vector is that each hardware thread generates the PC of instruction and NPC address to instruct fetching; Decoding unit is responsible for the decoding task after fetching.Three parts all belong to control base part, and the generation of lead programmer control stream, once it is larger on the execution result impact of program to break down.The Pick unit of being responsible for thread selection instructs control stream of overall importance to generate, once its destructive power that breaks down is the most obvious; The control stream granularity of scalar/vector and decoding unit impact is only single thread-level and instruction-level by contrast, and therefore its destructive power will be lower than Pick unit.The data stream of unit and integer register file major effect program is patrolled in calculation, therefore its fault destructive power minimum.
Secondly, three kinds of symptoms are to integer register file and calculate the contribution of patrolling unit and be respectively 2.61% and 2.71%, contribution to control functional unit is more remarkable, reached respectively 4.23%(Decoder), 6.39%(AGEN) with 12.54%(Pick unit), wherein invalid bag IPacket ratio reaches 86.5%, shows that this symptom is to controlling the excellent detectability of functional part fault.Apply by contrast overtime contribution lower, but at structural level, invalid bag symptom has been formed well and supplemented.The protective capability that processor hang-up is patrolled unit and integer register file to calculation is stronger, and average proportions is 0.77%, and steering logic is only had to 0.18%.
Module level symptom:
For disclosing fault susceptible bottleneck, need to carry out sensitivity analysis in more fine-grained module level, this will provide foundation for intermittent fault protection and aging testing mechanism design.The symptom that detection system is caught, one side has shown the validity of testing mechanism, and the neurological susceptibility of parts has also been described on the other hand.The symptom that table 4-7 has described each module under permanent fault model distributes, and has only listed as space is limited fault masking rate lower than 90% module.
Table 4-7 module level symptom (permanent fault number percent)
Figure BDA0000449446740000101
Data declaration, the PKD module failure shielding rate in PKU parts is only that 48.29%, SDC also only has 1.14%, the symptom that this module triggers is in other words about 50%.The EDP symptom ratio that calculation is patrolled in unit is 24%; Its symptom ratio of IRF module in integer register file has reached 28.86%, is only second to PKD module.This explanation, in PKU parts, PKD module becomes the susceptible bottleneck of permanent fault.PKD module is responsible for verification and measuring ability in PKU, and intermittent fault bottleneck result is similar: in PKD module, combinational logic becomes the susceptible bottleneck of intermittent fault.The susceptible bottleneck of transient fault and intermittent fault has similar characteristics.
Fault model:
Fault detection method based on gate leve symptom to instantaneous, intermittently and the fault coverage of permanent fault be respectively 99.91%, 99.95% and 99.08%.
L under same outburst length burstwith T athe growth of activationary time, all significantly declines to its fault masking rate of all parts.Take Decoder as example, outburst length is that 4 o'clock fault masking rates are respectively 98.74%(0.01T-0.1T), 96.09%(0.1T-1T) and 93.20%(1T-10T), permanent fault has reached 85.24%.
At same T ain interval, for all unit failure shielding rates with outburst length L burstrising and reduce.With pick logic is example, T aduring for 0.01-0.1T, fault masking rate is 2 from 97.43%(outburst length), being reduced to 95.62%(outburst length is 4), being finally reduced to 93.81%(outburst length is 8).This illustrates with aging intensification, the fact that the increase of outburst length causes fault destructive power to increase.
Data show, outburst length L burstdestructive power will be higher than T a, this is because at T aless but outburst length L burstunder larger configuration, do not cause the remarkable decline of fault masking rate.For example, for PKU parts, in outburst length, be 8 and T afor lower its fault masking rate of 0.1-1T configuration is 85.43%, higher than outburst length, be still 2 and TA fault masking rate of 81.90% under 1-10T configuration.Notice, activationary time and outburst length have contrary rule just for testing mechanism.
Detect and postpone:
Detecting delay is the important measurement index of fault detection mechanism, directly determines the usefulness of protection mechanism.Postpone the short lightweight Restoration Mechanism that only needs, also to the fault diagnosis allowance that sets apart.Contrary may cause too high recovery cost, even cause fault diagnosis and Restoration Mechanism to lose efficacy.The mean failure rate that table 4-8 has listed three kinds of gate leve symptoms detects delay.
Tri-kinds of door symptoms of table 4-8 detect and postpone (unit: clock period)
Figure BDA0000449446740000111
Invalid bag symptom average detection delay under instantaneous and intermittent fault model is less than 30 clock period, even average detection delay is also less than the 1K clock period (760 clock period) under permanent fault model.The detection of applying overtime symptom postpones in 3000 clock period magnitudes.The delay of processor hang-up symptom is slightly long, in 10K clock period magnitude.
The mean failure rate of transient fault, permanent fault and intermittent fault detects and postpones to be respectively 256.2,162.8, and 2930.3 clock period.Intermittent fault detects and postpones the ratios of approximately 23.7 clock period and exceed 96.5%, and maximum detection delay is that the ratio of 10K clock period magnitude is only 0.35%, allows that instruction re-executes, the lightweight mechanism such as rollback and hardware check point recovers.
Evaluation result:
Hardware fault detection method based on gate leve symptom verifies, the hardware fault detection method based on gate leve symptom to instantaneous, intermittently and the fault coverage of permanent fault be respectively 99.91%, 99.95% and 99.08%.The maximum detection delay ratio that the ratio of intermittent fault detection approximately 23.7 clock period of delay exceedes 96.5%, 10K clock period magnitude is only 0.35%, allows lightweight hardware mechanisms to recover.Three kinds of symptoms are independent of trouble unit, effectively cover combination sequential two class devices, and to instantaneous, permanent and intermittent fault is all effective, highly versatile, space-time cost is little.
Accompanying drawing explanation
Fig. 1 is PCX interface structure described in embodiment three and its principle schematic between parts around;
Fig. 2 is summary of the invention part, cpu i/f structural representation.
Embodiment
Embodiment one: the hardware fault detection method based on gate leve symptom described in present embodiment, described hardware fault detection method is to realize hardware failure detection based on detecting gate pole symptom.
Embodiment two: the hardware fault detection method based on gate leve symptom described in present embodiment and embodiment one is distinguished and is, described gate pole symptom is that invalid bag IPacket, processor are hung up or apply overtime.
Embodiment three: present embodiment is described referring to Fig. 1, the hardware fault detection method based on gate leve symptom described in present embodiment and embodiment two, this hardware fault detection method is realized based on catching invalid bag IPacket symptom, the method realizes based on PCX interface structure, 8 control information input/output terminals of this PCX interface structure are connected with 8 CPU cores by independent bus line respectively, PCX interface structure by 4 independently bus be connected with 4 L2Cache respectively, described L2Cache represents L2 cache, PCX interface structure is connected with I/O port and FPU processor by 1 bus simultaneously,
L2 cache, for the request of CPU core is carried out to verification and processing, carries out afterwards memory access and the packet of request is returned to CPU core by corsspoint switch matrix crossbar, and the invalid bag IPacket in packet is Invalid packet,
The detailed process of the method is,
Invalid bag IPacket is carrying out other operation for the data of reporting system current request,
For Invalid packet, its effective mark is 0, by crossbar, the significance bit of Invalid packet is revised, and realizes catching of invalid bag, detects hardware and breaks down.Embodiment four: the hardware fault detection method based on gate leve symptom described in present embodiment and embodiment two, this hardware fault detection method is realized based on detecting the overtime symptom of application, the method is based on last_act_cycle register, Th_last_act_cycle[63:0], Global_cycle_cnt register and core_cycle_cnt realize
Last_act_cycle register is used to refer to processor active periodicity recently,
Th_last_act_cycle[63:0] indication CMT thread active periodicity recently,
Global_cycle_cnt register is for recording the operation week issue of tested module, and this register is by system clock control,
Core_cycle_cnt represents the operation week issue of tested core,
The detailed process of the method is,
During processor implementation, above register is upgraded, and apply overtime detection according to above-mentioned information, in CPU, each thread has a thread_running register, whether actively demarcate this thread, by tcu_core_running[7:0] in corresponding signal initialization start thread running status to monitor
When instruction completes outflow flowing water, with core_cycle_cnt renewal last_act_cycle register, and upgrade th_last_act_cycle[63:0 simultaneously] content;
When core_cycle_cnt be monitored thread th_last_act_cycle[mytnum] periodicity carried out is while being greater than appointed threshold, thread-level is applied overtime symptom and is hunted down;
When core_cycle_cnt is greater than thresholding with the last_act_cycle being monitored, the overtime symptom of global application is hunted down, and detects hardware and breaks down.
Embodiment five: the hardware fault detection method based on gate leve symptom described in embodiment and embodiment two, this hardware fault detection method is hung up symptom based on measurement processor and is realized, the detailed process of the method is, utilize heuritic approach to improve hang detection device, by the monitoring of branch instruction being identified to tight circulation cumulative to loop iteration number of times, when it exceedes thresholding hang-up symptom, be hunted down, detect hardware and break down.

Claims (5)

1. the hardware fault detection method based on gate leve symptom, is characterized in that, described hardware fault detection method is to realize hardware failure detection based on detecting gate pole symptom.
2. the hardware fault detection method based on gate leve symptom according to claim 1, is characterized in that, described gate pole symptom is that invalid bag IPacket, processor are hung up or apply overtime.
3. the hardware fault detection method based on gate leve symptom according to claim 2, it is characterized in that, this hardware fault detection method is realized based on catching invalid bag IPacket symptom, the method realizes based on PCX interface structure, 8 control information input/output terminals of this PCX interface structure are connected with 8 CPU cores by independent bus line respectively, PCX interface structure by 4 independently bus be connected with 4 L2Cache respectively, described L2Cache represents L2 cache, PCX interface structure is connected with I/O port and FPU processor by 1 bus simultaneously
L2 cache, for the request of CPU core is carried out to verification and processing, carries out afterwards memory access and the packet of request is returned to CPU core by crossbar, and the invalid bag IPacket in packet is Invalid packet,
The detailed process of the method is,
Invalid bag IPacket is carrying out other operation for the data of reporting system current request,
For Invalid packet, its effective mark is 0, by crossbar, the significance bit of Invalid packet is revised, and realizes catching of invalid bag, detects hardware and breaks down.
4. the hardware fault detection method based on gate leve symptom according to claim 2, it is characterized in that, this hardware fault detection method is realized based on detecting the overtime symptom of application, the method is based on last_act_cycle register, Th_last_act_cycle[63:0], Global_cycle_cnt register and core_cycle_cnt realize
Last_act_cycle register is used to refer to processor active periodicity recently,
Th_last_act_cycle[63:0] indication CMT thread active periodicity recently,
Global_cycle_cnt register is for recording the operation week issue of tested module, and this register is by system clock control,
Core_cycle_cnt represents the operation week issue of tested core,
The detailed process of the method is,
During processor implementation, above register is upgraded, and apply overtime detection according to above-mentioned information, in CPU, each thread has a thread_running register, whether actively demarcate this thread, by tcu_core_running[7:0] in corresponding signal initialization start thread running status to monitor
When instruction completes outflow flowing water, with core_cycle_cnt renewal last_act_cycle register, and upgrade th_last_act_cycle[63:0 simultaneously] content;
When core_cycle_cnt be monitored thread th_last_act_cycle[mytnum] periodicity carried out is while being greater than appointed threshold, thread-level is applied overtime symptom and is hunted down;
When core_cycle_cnt is greater than thresholding with the last_act_cycle being monitored, the overtime symptom of global application is hunted down, and detects hardware and breaks down.
5. the hardware fault detection method based on gate leve symptom according to claim 2, it is characterized in that, this hardware fault detection method is hung up symptom based on measurement processor and is realized, the detailed process of the method is, utilize heuritic approach to improve hang detection device, by the monitoring of branch instruction being identified to tight circulation cumulative to loop iteration number of times, when exceeding thresholding hang-up symptom, it is hunted down, and detect hardware and break down.
CN201310743467.0A 2013-12-30 2013-12-30 Gate-level symptom based hardware fault detection method Pending CN103744760A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310743467.0A CN103744760A (en) 2013-12-30 2013-12-30 Gate-level symptom based hardware fault detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310743467.0A CN103744760A (en) 2013-12-30 2013-12-30 Gate-level symptom based hardware fault detection method

Publications (1)

Publication Number Publication Date
CN103744760A true CN103744760A (en) 2014-04-23

Family

ID=50501780

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310743467.0A Pending CN103744760A (en) 2013-12-30 2013-12-30 Gate-level symptom based hardware fault detection method

Country Status (1)

Country Link
CN (1) CN103744760A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109271268A (en) * 2018-09-04 2019-01-25 山东超越数控电子股份有限公司 A kind of intelligent fault-tolerance method based on DPDK

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060005062A1 (en) * 2004-06-30 2006-01-05 Fujitsu Limited Buffer and method of diagnosing buffer failure
CN1852541A (en) * 2005-11-30 2006-10-25 华为技术有限公司 Base-station fault detecting method and fault detecting system
CN1882917A (en) * 2003-09-26 2006-12-20 Ati技术公司 Method and apparatus for monitoring and resetting a co-processor
CN101149723A (en) * 2006-09-19 2008-03-26 国际商业机器公司 Livelock resolution method and apparatus and system
CN101751980A (en) * 2008-12-17 2010-06-23 中国科学院电子学研究所 Embedded programmable memory based on memory IP core
CN102761439A (en) * 2012-06-13 2012-10-31 烽火通信科技股份有限公司 Device and method for detecting and recording abnormity on basis of watchdog in PON (Passive Optical Network) access system
CN103092734A (en) * 2011-10-28 2013-05-08 株式会社东芝 Periodic error detection method and periodic error detection circuit

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1882917A (en) * 2003-09-26 2006-12-20 Ati技术公司 Method and apparatus for monitoring and resetting a co-processor
US20060005062A1 (en) * 2004-06-30 2006-01-05 Fujitsu Limited Buffer and method of diagnosing buffer failure
CN1852541A (en) * 2005-11-30 2006-10-25 华为技术有限公司 Base-station fault detecting method and fault detecting system
CN101149723A (en) * 2006-09-19 2008-03-26 国际商业机器公司 Livelock resolution method and apparatus and system
CN101751980A (en) * 2008-12-17 2010-06-23 中国科学院电子学研究所 Embedded programmable memory based on memory IP core
CN103092734A (en) * 2011-10-28 2013-05-08 株式会社东芝 Periodic error detection method and periodic error detection circuit
CN102761439A (en) * 2012-06-13 2012-10-31 烽火通信科技股份有限公司 Device and method for detecting and recording abnormity on basis of watchdog in PON (Passive Optical Network) access system

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
BARAMOWSKI RAFAL ET AL.: "Efficient multi-level fault simulation of HW/SW systems for structural faults", 《SCIENCE CHINA INFORMATION SCIENCES》 *
MICHAIL MANIATAKOS: "Instruction-Level Impact Comparison of RT- vs. Gate-Level Faults in a Modern Microprocessor Controller", 《VLSI TEST SYMPOSIUM, 2009. VTS "09. 27TH IEEE》 *
欧国东 和 张民选: "一种基于线程的数据预取方法", 《计算机工程与科学》 *
王超: "低代价锁步EDDI:处理器瞬时故障检测机制", 《计算机学报》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109271268A (en) * 2018-09-04 2019-01-25 山东超越数控电子股份有限公司 A kind of intelligent fault-tolerance method based on DPDK

Similar Documents

Publication Publication Date Title
Kim et al. Soft error sensitivity characterization for microprocessor dependability enhancement strategy
Sahoo et al. Using likely program invariants to detect hardware errors
Kaliorakis et al. Differential fault injection on microarchitectural simulators
Pan et al. IVF: Characterizing the vulnerability of microprocessor structures to intermittent faults
US7926021B2 (en) Insertion of error detection circuits based on error propagation within integrated circuits
Nomura et al. Sampling+ dmr: practical and low-overhead permanent fault detection
JP2008546123A (en) Selective activation of error mitigation based on bit-level error counting
Suh et al. Soft error benchmarking of L2 caches with PARMA
Wei et al. Comparing the effects of intermittent and transient hardware faults on programs
Höller et al. FIES: a fault injection framework for the evaluation of self-tests for COTS-based safety-critical systems
Yao et al. DARA: A low-cost reliable architecture based on unhardened devices and its case study of radiation stress test
Sanchez et al. On the functional test of branch prediction units
Vera et al. Selective replication: A lightweight technique for soft errors
Dadashi et al. Hardware-software integrated diagnosis for intermittent hardware faults
Bottoni et al. Partial triplication of a SPARC-V8 microprocessor using fault injection
Theodorou et al. A software-based self-test methodology for on-line testing of processor caches
Hsieh et al. Tolerance of performance degrading faults for effective yield improvement
Dweik et al. Reliability-aware exceptions: Tolerating intermittent faults in microprocessor array structures
CN103744760A (en) Gate-level symptom based hardware fault detection method
Lee et al. Evaluation of error detection coverage and fault-tolerance of digital plant protection system in nuclear power plants
Chao et al. FSFI: A full system simulator-based fault injection tool
Sánchez et al. On the functional test of branch prediction units based on branch history table
Bustamante et al. Soft error detection via double execution with hardware assistance
Rodrigues et al. Shadow checker (SC): A low-cost hardware scheme for online detection of faults in small memory structures of a microprocessor
Karimi et al. Workload-cognizant concurrent error detection in the scheduler of a modern microprocessor

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20140423

WD01 Invention patent application deemed withdrawn after publication