CN1851826A

CN1851826A - Random storage failure detection processing method and its system

Info

Publication number: CN1851826A
Application number: CN 200610001749
Authority: CN
Inventors: 李强
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2006-01-25
Filing date: 2006-01-25
Publication date: 2006-10-25
Anticipated expiration: 2026-01-25
Also published as: CN100536031C

Abstract

This invention discloses a testing method for the failure of RAM used in the test to RAM failure of CPU/DSP including: fetching the program content in said RAM, comparing the fetched content with the correct program content to judge the failure of the RAM and restore the data when the two are not the same, or checking data of the program content by a designed check method and comparing it with the correct result and judging that the RAM is failed and alarming if they are not the same.

Description

The detection disposal route and the system thereof of random storage failure

Technical field

The present invention relates to detection range, relate in particular to the detection disposal route and the system thereof of the RAM storage space inefficacy of a kind of CPU/DSP.

Background technology

Have random access memory (Random Access Memory, RAM) device is in its life cycle, can be owing to various reasons cause disabler, generally speaking, itself is damaged hardware, we are referred to as device hard failure (Firm Error), otherwise, be referred to as device soft failure (Soft Error).

Soft failure mainly is because the storage unit RAM of charged particle bump device causes, the atomic interaction of these high energy particles and semiconductor memory produces electron hole pair, this electron hole pair causes the change of canned data in the storage unit, and then causes the device function mistake.

In the time of the end of the seventies in last century, the engineering staff has just found the phenomenon of soft failure, and reason at that time is that the α particle is sent in the decay of device package material radioimpurity radioactive impurity, has caused ionisation effect, causes state of memory cells to change.Along with the reduction of development of semiconductor, device technology size, the minimizing of operating voltage, find that now cosmic rays equally also can cause the device soft failure, and its effect will be than serious many in the past, and therefore the device soft failure receives the concern of industry once more now.

Soft failure all might take place in the device of all band RAM, but industry is paid close attention to the more field programmable gate array that is based on RAM (Field Programmable Gate Array up to now, FPGA) and special IC (Application Specific Integrated Circuit, the soft failure problem that occurs of logical device such as ASIC), and in design and protection, accumulated certain experience, but for using the static RAM of CPU/DSP (Static Random Access Memory widely, SRAM) and dynamic RAM (Dynamic Random Access Memory, soft failure problem DRAM) is then paid close attention to very few.

In fact, there are serious soft failure situation in the SRAM of CPU/DSP and DRAM reservoir in the same old way, and the data that provide according to a plurality of device manufacturer, amount to adult crash rate, the probability that occurs 1 bit space soft failure among the 1 megabit SRAM is the rank of ppt, for communication, space flight and military etc. for the very tight product of stability requirement, this is a very high crash rate.

Not within minority by the soft failure case that practical application is caught, certain communication manufacturer once reprocesses plate to the year of certain specific products and adds up, find that nearly 80% the database of plate in wiping storer of reprocessing lay equal stress on and can repair behind the host software down, by analysis, the very big probability in the inside is owing to SRAM device soft failure causes.In certain positioning problems process of another one product, the research staff captures the situation that only has certain program bit information to lose unusually in the DSP internal memory, and code analysis, be the situation that this bit is changed in absolutely not appearance, can determine substantially because the SRAM of DSP deposits the problem that program bit soft failure causes.

Confirmable is that in those commercial products very frequent to the CPU/DSP use, the probability that SRAM or DRAM soft failure occur is very big.Wherein have some to find and be confirmed to be that soft failure causes, but most all can show as the fault that some can't reappear by memory analysis, such as reset, deadlock, no abnormal record case lower part capability error or the like.This can influence reliability of products on the one hand, can drop into great amount of manpower in order to locate these problems on the other hand, therefore is necessary the influence of soft failure is limited in minimum limit.

Summary of the invention

The invention provides the detection disposal route that a kind of RAM lost efficacy, in order to the RAM failure of removal of finding CPU/DSP early and the generation that reduces the RAM inefficacy.

Based on said method, the detecting processing system that the present invention also provides a kind of RAM to lose efficacy.

The inventive method comprises:

A, read the contents of program among the described RAM;

B, the contents of program that reads and correct contents of program are compared, when both are inconsistent, judge the described RAM line data reparation of going forward side by side of losing efficacy; Perhaps

Adopt the method for calibration of setting to carry out data check the contents of program that reads, and compare, when both are inconsistent, judge that described RAM lost efficacy and warning with correct check results.

According to said method of the present invention, if described RAM can be by the high-rise controlled entity visit of CPU/DSP, then read the contents of program among the described RAM and the correct contents of program of the local storage of itself and this high level controlled entity compared by this high level controlled entity, when both were inconsistent, this high level controlled entity was loaded into described correct contents of program and carries out data repair among the described RAM.

If described RAM can not be by the high-rise controlled entity visit of CPU/DSP, then the contents of program that is read among the described RAM by described CPU/DSP carries out verification; Perhaps

In described RAM, back up correct contents of program in advance;

CPU/DSP compares the contents of program of contents of program among the described RAM and described backup, when both are inconsistent, repairs contents of program among the described RAM according to the contents of program of described backup.

In the said method, the contents of program among the described RAM is carried out verification or comparison, adopt the task of low priority to trigger.

According to said method of the present invention, when carrying out the contents of program comparison, the data cell and the data cell of storage correct procedure content of stored programme content among the RAM are compared one by one, and one by one the bit in the described data cell is compared.

According to said method of the present invention, judge after carrying out data repair whether reparation is successful, if repair successfully, judge that then soft failure takes place described RAM, and record soft failure relevant information; Otherwise, repeat repair process;

When repeating to repair number of times and reach predetermined threshold value, judge that hard failure takes place described RAM, and report the hardware fault alarm.

According to said method of the present invention, the contents of program among the described RAM is carried out described comparison or verification by setting cycle in batches.

In the said method, the contents of program among the described RAM is carried out verification in batches, comprising:

The check results value of last consignment of data is carried out verification with the next group data, obtain the check results value of next group data;

With the check results value of last batch of data terminal check result as the contents of program among the described RAM.

In the said method, adopt cyclic redundancy or parity check method that the contents of program among the described RAM is carried out verification.

According to said method of the present invention, according to the interval time of setting, repeating said steps A and B.

The detecting processing system that RAM provided by the invention lost efficacy comprises: detect judge module and crash handling module;

Described detection judge module reads the contents of program among the described RAM;

The contents of program that reads and correct contents of program are compared, when both are inconsistent, judge that described RAM lost efficacy, send to repair and instruct described crash handling module; Perhaps

Adopt the method for calibration of setting to carry out data check the contents of program that reads, and compare, when both are inconsistent, judge that described RAM lost efficacy, send alarm command to described crash handling module with correct check results;

After described crash handling module receives and repairs instruction, the contents of program among the described RAM is carried out data repair handle; After receiving alarm command, the processing of reporting to the police.

According to said system of the present invention, if described RAM can be by the high-rise controlled entity visit of CPU/DSP, then described detection judge module and crash handling module are positioned at the high-rise controlled entity of described CPU/DSP;

The correct contents of program of the local storage of high-rise controlled entity of the contents of program that reads and described CPU/DSP is compared, when both are inconsistent, judge that described RAM lost efficacy, send to repair and instruct described crash handling module;

After described crash handling module receives and repairs instruction, the contents of program among the described RAM is carried out data repair handle.

According to said system of the present invention, if described RAM can not be by the high-rise controlled entity visit of CPU/DSP, then described detection judge module and crash handling module are arranged in described CPU/DSP;

Adopt the method for calibration of setting to carry out data check the contents of program that reads, and compare, when both are inconsistent, judge that described RAM lost efficacy, send alarm command to described crash handling module with correct check results; Perhaps

The contents of program that reads and the correct procedure content that backs up in described RAM are compared, when both are inconsistent, judge that described RAM lost efficacy, send to repair and instruct described crash handling module;

After described crash handling module receives alarm command, the processing of reporting to the police; After receiving the reparation instruction, the contents of program among the described RAM is carried out data repair handle.

Beneficial effect of the present invention is as follows:

(1) the inventive method is carried out periodicity detection in real time to the contents of program among the CPU/DSP RAM, can in time detect the situation of the program space RAM inefficacy that occurs among the CPU/DSP, can save the workloads that in a large number RAM lost efficacy and positions and analyze.

(2) the present invention is directed to the RAM that can be visited by high-rise controlled entity, adopt the method for data comparison, the correct contents of program of preserving in the contents of program that moves among the RAM and the high-rise controlled entity is compared, when finding to lose efficacy, can carry out data repair automatically, significantly reduce the influence that RAM inefficacy generation is caused; At the RAM that can not be visited by high-rise controlled entity, adopt the method for data check, the contents of program among the RAM is carried out verification, when finding to lose efficacy, in time report to the police, so that in time take appropriate measures, the influence that inefficacy is caused drops to minimum.

(3) simple possible of the present invention, not needing increases extra investment enhanced system reliability, can in time detect failure conditions, does not take too much system handles resource again, does not influence regular traffic.

Description of drawings

Fig. 1 is the detection procedure figure that category-A type CPU/DSP RAM lost efficacy;

Fig. 2 is the detection procedure figure that category-B type CPU/DSP RAM lost efficacy;

Fig. 3 is the detecting processing system structural representation that category-A type CPU/DSP RAM lost efficacy;

Fig. 4 is the detecting processing system structural representation that category-B type CPU/DSP RAM lost efficacy.

Embodiment

The present invention is directed to the contents of program that leaves among the CPU/DSP RAM, the employing cycle compares with correct data incessantly or carries out the method for data check, in time detects the failure conditions of RAM storer; Methods such as then the start-up routine download is repaired, abnormal information reports, alarms indication were disposed if lose efficacy, and reduced the influence of component failure to system as far as possible.

For the product of reality, whether can be subjected to the characteristic of high level visit according to its storage space, CPU/DSP can be divided into two kinds of application types: the storage space of a class CPU/DSP can be visited by high-rise controlled entity, and for convenience, category-A is called in letter; The storage space of another kind of CPU/DSP can not be visited by high-rise controlled entity, and category-B is called in letter.At the storage space of two class CPU/DSP, need to adopt different detection and method of disposal.

CPU/DSP storage space for the category-A type, the inside and outside RAM of the sheet of CPU/DSP can be directly by high-rise controlled entity (generally also being to adopt high-rise CPU or DSP as controlled entity) visit, in this case, generally speaking the program of CPU/DSP also issues loading by high-rise controlled entity, stores the correct procedure code of CPU/DSP in the storer of high-rise controlled entity self.The failure detection disposal route of category-A type CPU/DSPRAM as shown in Figure 1.

Referring to Fig. 1, the detection procedure figure for category-A type CPU/DSP RAM lost efficacy specifically comprises:

Step S1, initialization: one-period is set triggers task, and set the detection limit of the data cell of each sense cycle.In order in time to find Problem of Failure, can guarantee not take too much high-rise controlled entity processing power again, to sense cycle time and the setting of phase detection scale weekly, need balance to consider to detect promptness and high-rise controlled entity processing power, as with 1 second be the cycle, detect the data of 100 32 bits in the one-period.In addition, also to set maximum and repeat to repair number of times;

Step S2, in sense cycle, high-rise controlled entity incessantly one by one data cell read program code among the CPU/DSP RAM, and the corresponding data unit of the correct procedure code of preserving with self storer compares, for each storage unit one by one data bits compare;

Step S3, judge whether the program code among the RAM is consistent with correct program code in carrying out the process of data comparisons, if discovery is inconsistent, then execution in step S4 starts follow-up Disposal Measures;

Otherwise, proceed the data comparison, when the Data Detection amount that this cycle finishes or this cycle of reaching is set, the maximum address of the data cell that the minute book cycle has been detected, when entering next sense cycle, first address behind this address begins to detect data.Therefore generally speaking, program code is sequential storage in CPU/DSPRAM, and the maximum address of the data cell that can finish by the minute book cycle detection obtains the start address that the next cycle data cell detects.

The correct procedure code that step S4, high-rise controlled entity will be stored in wherein is issued to the data cell that occurs error in data among the RAM again, repairs;

After step S5, the data repair, high-rise controlled entity judges by modes such as retakings of a year or grade whether reparation is successful, if repair successfully, then execution in step S6 gets nowhere execution in step S7 if repair;

Step S6, judgment data mistake are caused by soft failure, in the relevant soft failure information of backstage record, as the data cell region of soft failure, in order to the subsequent analysis reference; Then, return step S2, wait for entering next sense cycle;

Step S7, judge whether to reach predefined maximum and repeat to repair number of times, if do not reach, then execution in step S4 repairs the data cell that error in data occurs once more; Repeat to repair number of times as if reaching maximum, then execution in step S8;

Step S8, do not repair success when reaching maximum and repeating to repair number of times yet, then high-rise controlled entity judgment data mistake causes that for the device hard failure high-rise controlled entity reports the device fault alarm, and triggering system is more high-rise carries out fault handling.

CPU/DSP storage space for the category-B type, the inside and outside RAM of the sheet of CPU/DSP can not directly be visited by the outside, generally be by mailbox mode (promptly each send with the CPU/DSP that receives data in open up shared memory space, carry out the mode that information is transmitted by certain communication mechanism) or the mode and the outside of peripheral hardware communication carry out interacting message.Application program in the category-B type CPU/DSP storage space or load by mailbox when starting resetting, or be solidificated in the RAM the inside guarantees that by comprising modes such as not power down its information do not lose.The CUP/DSP of the type generally has two kinds to the mode of external feedback abnormal information, and the one, report unusually/warning information to system's high level by communication channel, another kind is the alarm that drives types such as alarm equipment generation sound, light, electricity by peripheral hardware.The failure detection disposal route of category-B type CPU/DSP RAM as shown in Figure 2.

Referring to Fig. 2, the detection procedure figure for category-B type CPU/DSP RAM lost efficacy specifically comprises:

Step S1, initialization: one-period is set triggers task, and set weekly the detection limit of the data cell of phase.In order in time to find Problem of Failure, can guarantee not take the processing power of too much CPU/DSP again, to cycle length and the setting of phase detection scale weekly, need balance to consider to detect promptness and CPU/DSP processing power, as with 1 second be the cycle, detect the data of 100 32 bits in the one-period.For the operation of the mission critical that guarantees CPU/DSP, can be with should the lower priority of cycle triggering task setting.

In addition, a desirable check results value to be set also, promptly in advance correct contents of program be carried out data check, and with the check results value that obtains as desirable check results value.As, when adopting the even parity check method to carry out data check, at first all data of correct procedure code are carried out even parity check, XOR is carried out in the corresponding bits position that is about to the data in all data cells, obtain an end value, this end value is desirable check results value.

Step S2, in sense cycle, CPU/DSP incessantly one by one data cell read program code among the CPU/DSPRAM, the line data verification of going forward side by side; Method of calibration can adopt cyclic redundancy check (CRC) (CyclicRedundancy Check, CRC) or parity checking.The CRC check loss is low, but the required time is longer, and required time of even parity check is shorter, but than the loss height of CRC check;

As, when adopting the even parity check method, in first sense cycle, read the data of the data cell of 100 32 bits, XOR is carried out in the corresponding bits position of these data, obtain a check results value; In the sense cycle afterwards, the check results value in last cycle and 100 32 the detection data in this cycle are carried out the bit XOR, generate the check results value in this cycle;

Simultaneously, when each sense cycle finishes, the maximum address of the data cell that record had detected, CPU/DSP is with the next address of the data cell region of the record detection data start address as next sense cycle;

Step S3, when each sense cycle finishes, judge according to the maximum address that detects the data cell of finishing whether the program code among the RAM finishes one time after testing; If also do not detected one time, then return step S2, continue the data check of next sense cycle; Otherwise, execution in step S4;

Step S4, if the contents of program among the RAM has been detected one time, then CPU/DSP compares final assay value and the desired result value of presetting, if consistent, represents that then the contents of program among the RAM is correct, return step S2, the continuation cycle is carried out data check to the contents of program among the RAM; If inconsistent, error in data appears in the representation program content, then execution in step S5;

Step S5, final check result value and the desired result value of presetting are inconsistent, then CPU/DSP judges because the RAM inefficacy causes, at this moment, CPU/DSP reports unusually/warning information to system's high level by communication channel according to system requirements, perhaps drive and trigger the external alarm equipment alarm, perhaps report to the police by above-mentioned two kinds of approach simultaneously by peripheral hardware.

CPU/DSP storage space for the category-B type, if external space is enough big in the CPU/DSP residue, then can when loading startup, program copy portion be left in the remaining space of CPU/DSP, in periodic duty, carry out failure detection with the mode of data comparison rather than verification, its comparison method is with the flow process of Fig. 1, difference is, carries out data comparison and reparation by this CPU/DSP, rather than high-rise controlled entity.For the operation of the mission critical that guarantees CPU/DSP, can carry out the lower priority of task setting of data comparison.

CPU/DSP storage space for the category-A type, though the detection that also can adopt the mode of data check to lose efficacy, but no matter be CRC check or parity checking, there is the omission phenomenon in the capital to the minority error pattern, can be so incomplete as the data comparison, and mis repair data, the therefore detection and the processing of losing efficacy automatically for the first-selected mode that adopts data to compare of the CPU/DSP storage space of category-A type.

Failure detection treatment scheme at the CPU/DSP storage space of category-A type the invention provides the relevant detection disposal system.

Referring to Fig. 3, the detecting processing system structural representation for category-A type CPU/DSP RAM lost efficacy specifically comprises: the detection judge module and the crash handling module that are arranged in the high-rise controlled entity of CPU/DSP.

The detection judge module reads the contents of program among the RAM, and the contents of program and the local correct procedure content of storing of high-rise controlled entity that read are compared.Detect judge module in carrying out the data comparison process, one by one the data cell of the stored programme content among the RAM and the corresponding data cell of correct procedure content are compared, and one by one the data bits in the data unit is compared, and when finding that data are inconsistent, judge that RAM lost efficacy, and send to repair and to instruct the crash handling module, carry the data cell region that goes wrong among the RAM and the address of its corresponding correct procedure content data unit;

After the crash handling module receives and repairs instruction, address according to the data cell that receives, correct procedure load content in the high-rise controlled entity of appointment is carried out data repair to the data cell among the RAM of appointment, and write down or export the data cell region that goes wrong among the described RAM, in order to follow-up analysis reference.

Effect at the CPU/DSP storage space of category-B type detects and treatment scheme, the invention provides the relevant detection disposal system.

Referring to Fig. 4, the detecting processing system structural representation for category-B type CPU/DSPRAM lost efficacy specifically comprises: the detection judge module and the crash handling module that are arranged in CPU/DSP.

The detection judge module reads the contents of program among the RAM, adopt CRC check method or parity check method to carry out data check the contents of program that reads, and the check results that obtains and the check results that adopts same method of calibration to obtain to correct contents of program compared, when both are inconsistent, judge that RAM lost efficacy, the concurrent alert inefficacy judge module that instructs of delivering newspaper;

The crash handling module is received the processing of reporting to the police behind the alarm command.

If category-B type CPU/DSPRAM has enough big storage space, then when starting this RAM, correct contents of program can be backuped among the RAM, the detection judge module reads the contents of program among the RAM, the contents of program that reads is compared with the correct procedure content of backup when RAM loads startup, when both are inconsistent, judge that RAM lost efficacy, send to repair and instruct the crash handling module; After the crash handling module receives and repairs instruction, the contents of program among the RAM is carried out data repair handle.

Describe as can be known by above flow process, the inventive method is carried out periodicity to the contents of program among the CPU/DSP RAM and is detected in real time, the situation of the program space RAM inefficacy that occurs among the CPU/DSP can be in time detected, the workloads that in a large number RAM lost efficacy and positions and analyze can be saved.After detecting the RAM inefficacy, start corresponding treatment measures automatically, the influence that inefficacy is caused drops to minimum.The present invention is directed to the RAM that can be visited by high-rise controlled entity, adopt the method for data comparison, the correct procedure content of preserving in contents of program among the RAM and the high-rise controlled entity is compared, when finding that RAM lost efficacy, can carry out data repair automatically, significantly reduce the influence that RAM inefficacy generation is caused; At the RAM that can not be visited by high-rise controlled entity, adopt the method for data check, the contents of program among the RAM is carried out verification, when finding that RAM lost efficacy, in time report to the police, so that in time take appropriate measures.Simple possible of the present invention, not needing increases extra investment enhanced system reliability, can in time detect failure conditions, does not take too much system handles resource again, does not influence regular traffic.

Obviously, those skilled in the art can carry out various changes and modification to the present invention and not break away from the spirit and scope of the present invention.Like this, if of the present invention these are revised and modification belongs within the scope of claim of the present invention and equivalent technologies thereof, then the present invention also is intended to comprise these changes and modification interior.

Claims

1, a kind of detection disposal route of random access memory ram inefficacy is applied to the detection processing to the RAM inefficacy of central processing unit CPU/ digital signal processor DSP, and this method comprises:

A, read the contents of program among the described RAM;

2, the method for claim 1, it is characterized in that, if described RAM can be by the high-rise controlled entity visit of CPU/DSP, then read the contents of program among the described RAM and the correct contents of program of the local storage of itself and this high level controlled entity compared by this high level controlled entity, when both were inconsistent, this high level controlled entity was loaded into described correct contents of program and carries out data repair among the described RAM.

3, method as claimed in claim 2 is characterized in that, if described RAM can not be by the high-rise controlled entity visit of CPU/DSP, then the contents of program that is read among the described RAM by described CPU/DSP carries out verification; Perhaps

In described RAM, back up correct contents of program in advance;

4, method as claimed in claim 3 is characterized in that, the contents of program among the described RAM is carried out verification or comparison, adopts the task of low priority to trigger.

5, the method for claim 1, it is characterized in that, when carrying out the contents of program comparison, the data cell and the data cell of storage correct procedure content of stored programme content among the RAM are compared one by one, and one by one the bit in the described data cell is compared.

6, the method for claim 1 is characterized in that, judges after carrying out data repair whether reparation is successful, if repair successfully, judges that then soft failure takes place described RAM, and record soft failure relevant information; Otherwise, repeat repair process;

7, the method for claim 1 is characterized in that, the contents of program among the described RAM is carried out described comparison or verification by setting cycle in batches.

8, method as claimed in claim 7 is characterized in that, the contents of program among the described RAM is carried out verification in batches, comprising:

9, the method for claim 1 is characterized in that, adopts cyclic redundancy or parity check method that the contents of program among the described RAM is carried out verification.

10, the method for claim 1 is characterized in that, according to the interval time of setting, repeating said steps A and B.

11, a kind of detecting processing system of RAM inefficacy is applied to the detection processing to the RAM inefficacy of CPU/DSP, it is characterized in that comprising: detect judge module and crash handling module;

12, system as claimed in claim 11 is characterized in that, if described RAM can be by the high-rise controlled entity visit of CPU/DSP, then described detection judge module and crash handling module are positioned at the high-rise controlled entity of described CPU/DSP;

13, system as claimed in claim 11 is characterized in that, if described RAM can not be by the high-rise controlled entity visit of CPU/DSP, then described detection judge module and crash handling module are arranged in described CPU/DSP;