JPH07306811A

JPH07306811A - Memory fault diagnosing method

Info

Publication number: JPH07306811A
Application number: JP6100958A
Authority: JP
Inventors: Katsunobu Miyake; 勝伸三宅
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1994-05-16
Filing date: 1994-05-16
Publication date: 1995-11-21

Abstract

PURPOSE:To provide an effective fault diagnosing function of a memory fault which is low in reproduction frequency by effectively utilizing the surplus processing ability of a controlled device. CONSTITUTION:The microprocessor of the controlled device 101 reads the contents of a 1st-class specified area of a memory device out in order and takes a 1st-class fault diagnosis wherein an error detecting means detects an error; if the error detecting means detects the error in the 1st-class fault diagnosis, a 2nd-class fault diagnosis for detecting an error repeatedly is taken for a 2nd-class area which is an area including the address where the error is detected and narrower than the 1st-kind fault specified area of the memory device. When a control signal is received from a host processor during the 1st-kind fault diagnosis or 2nd-class fault diagnosis, the fault diagnosis is discontinued.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、上位処理装置により制
御される被制御装置に搭載され、該上位処理装置の制御
下で動作可能なサブプロセッサ（以下μプロセッサとい
う）の空き時間を有効に利用して、大容量メモリの故障
診断を効率的に実施するための方法に関するものであ
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention enables a free time of a sub-processor (hereinafter referred to as a μ-processor) mounted on a controlled device controlled by a host processor and operable under the control of the host processor. The present invention relates to a method for efficiently performing failure diagnosis of a large capacity memory by utilizing the method.

【０００２】[0002]

【従来の技術】処理装置からの制御オーダに基づいて動
作する被制御装置には、ＬＳＩ技術等の近年の進んだ部
品技術を反映して、μプロセッサが搭載されることが一
般的になってきた。この種の被制御装置は、上位処理装
置からの制御オーダや緊急割り込み信号を受信すると、
それに伴う処理を極めて高速に実行することが求められ
ている場合が多く、従来のシステムでは、上位の処理装
置からの制御オーダの処理が終了して次の制御オーダが
届くまでは、当該μプロセッサに高い処理能力があるに
も拘わらず、無駄な空転を継続して次の制御オーダの到
来を待っているか、次の制御オーダに関連する割り込み
が来るのを待つ処理を行っていた。このような方式を採
るシステムでは、制御オーダの受信頻度が低くなるほど
μプロセッサの使用率も低下して来ることになり、μプ
ロセッサの処理能力が有効に利用されなくなるという欠
点があった。2. Description of the Related Art It has become common for a controlled device that operates based on a control order from a processing device to incorporate a μ processor, reflecting recent advanced component technologies such as LSI technology. It was This type of controlled device receives a control order or an emergency interrupt signal from the host processor,
In many cases, it is required to perform the processing associated therewith at an extremely high speed, and in the conventional system, until the next control order arrives after the processing of the control order from the upper processing unit is completed, In spite of its high processing capacity, it was performing a process of waiting for the arrival of the next control order by continuing wasteful idling or waiting for the interrupt related to the next control order. In the system adopting such a method, the lower the frequency of control order reception, the lower the usage rate of the μ processor, and the processing capacity of the μ processor is not effectively used.

【０００３】他方、被制御装置側で分担する機能の増大
に伴ってμプロセッサで必要となるメモリの規模が増大
してゆき、メモリの故障診断に要する時間が長大化する
という問題があった。特に近年の大規模集積回路技術の
普及につれて、当該メモリアドレスを何回もアクセスし
てゆくうちに希に故障が顕在化するような複雑な故障形
態が多くなって来た。このような故障モードの複雑化も
故障診断時間を一層長大化させる要因の１つになってい
た。On the other hand, there has been a problem that the scale of the memory required by the μ-processor increases with the increase of the functions shared by the controlled devices, and the time required for the failure diagnosis of the memory increases. In particular, with the recent widespread use of large-scale integrated circuit technology, a complicated failure form in which a failure rarely becomes apparent as the memory address is accessed many times has increased. Such a complicated failure mode has been one of the factors that further lengthen the failure diagnosis time.

【０００４】以下、図面を用いて従来技術の問題点を説
明する。図１は上位処理装置とその制御下で動作する複
数の被制御装置により構成される一般的なシステムの概
念図であり、交換機を始めとする多くの計算機制御シス
テムで用いられている。図１において１００は上位処理
装置、１０１−ｎは上位処理装置１００により制御され
る第ｎ番目の被制御装置であり、上記の上位処理装置の
制御下で動作可能である。１００と１０１との間には制
御信号やその応答信号を送信する信号路１０２が設けら
れている。１０１は信号路１０２を経由して１００から
の制御信号を受信して必要な処理を行い、その結果を再
び信号路１０２を経由して１００に返送する。The problems of the prior art will be described below with reference to the drawings. FIG. 1 is a conceptual diagram of a general system including a host processor and a plurality of controlled devices that operate under the control of the host processor, and is used in many computer control systems including an exchange. In FIG. 1, reference numeral 100 is a host processor, 101-n is an nth controlled device controlled by the host processor 100, which can operate under the control of the host processor. A signal path 102 for transmitting a control signal and its response signal is provided between 100 and 101. 101 receives the control signal from 100 via signal path 102, performs necessary processing, and returns the result to 100 again via signal path 102.

【０００５】図２は被制御装置１０１の、一般的な機能
ブロック構成例を説明するための図である。図２におい
て、１００，１０１，１０２は図１で示した略号と同じ
ものである。最初に図２における機能ブロックの接続構
成を説明する。FIG. 2 is a diagram for explaining a general functional block configuration example of the controlled device 101. In FIG. 2, 100, 101, and 102 are the same as the abbreviations shown in FIG. First, the connection configuration of the functional blocks in FIG. 2 will be described.

【０００６】図２において、２００は被制御装置１０１
に内蔵されているμプロセッサであり、１００からの制
御信号は信号路１０２を経由して被制御装置１０１に到
達する。被制御装置１０１の内部では信号受信回路２０
４がその信号を受信し、割り込み若しくはバスメッセー
ジにより２００に通知する。２０１，２０２，２０３は
μプロセッサ２００に接続されているＲＡＭ（ランダム
アクセスメモリ；主メモリ）、ＲＯＭ（リードオンリメ
モリ）、ＨＤ（ファイルメモリ；ハードディスク）であ
る。２０５は上記２００〜２０３の処理に基づいて最終
的に制御される制御対象装置である。In FIG. 2, 200 is a controlled device 101.
The control signal from 100 reaches the controlled device 101 via the signal path 102. The signal receiving circuit 20 is provided inside the controlled device 101.
4 receives the signal and notifies 200 by an interrupt or a bus message. Reference numerals 201, 202, and 203 denote a RAM (random access memory; main memory), a ROM (read only memory), and an HD (file memory; hard disk) connected to the μ processor 200. Reference numeral 205 denotes a device to be controlled which is finally controlled based on the processes 200 to 203.

【０００７】次に、同図を用いてその動作概要を説明す
る。上位処理装置１００は格納されているプログラムの
制御に基づいて必要な処理を行うとともに、その結果に
基づいてその配下の複数の被制御装置１０１の１つを選
択して必要な制御オーダを送出する。被制御装置１０１
では信号受信回路２０４でこれを受け付け、内蔵されて
いるμプロセッサ２００に割り込んで又は定期読み取り
により通知する。μプロセッサ２００はこれを解釈し
て、制御対象装置２０５に対して、受信信号に関連する
必要な処理を高速に実行する。μプロセッサ２００がこ
れらの処理を全て処理し終わると、中断された前の処理
を中断点から実行する。Next, the outline of the operation will be described with reference to FIG. The upper processing device 100 performs necessary processing based on the control of the stored program, and based on the result, selects one of the plurality of controlled devices 101 under its control and sends the necessary control order. . Controlled device 101
Then, the signal receiving circuit 204 accepts this, and notifies it by interrupting the built-in μ processor 200 or by periodical reading. The μ processor 200 interprets this and executes the necessary processing related to the received signal to the control target device 205 at high speed. When the μ processor 200 finishes processing all of these processes, the previous process that was interrupted is executed from the interrupt point.

【０００８】システムの構成と処理概要は上記の通りで
あるが、システム全体の性能を向上させるために、近年
被制御装置１０１で受け持つ機能は増大の一途を辿り、
これに伴い被制御装置に内蔵されるメモリ機能（２０
１，２０２，２０３）の容量も飛躍的に増大する傾向に
ある。被制御装置で受け持つようになってきた機能の１
つに被制御装置の故障診断機能があり、一般に自己診断
とか自律診断とか呼ばれている。The configuration of the system and the outline of the processing are as described above, but in order to improve the performance of the entire system, the function to be controlled by the controlled device 101 has been increasing in recent years.
Along with this, the memory function (20
1,202,203) capacity also tends to increase dramatically. One of the functions that the controlled device has taken charge of
One of them is a failure diagnosis function of the controlled device, which is generally called self-diagnosis or autonomous diagnosis.

【０００９】ＬＳＩ技術がそれ程進歩していなかった以
前のシステムでは、コスト軽減のための上位処理装置
（１００）の空き時間若しくは優先順位の低いレベルで
下位の被制御装置の故障診断を実行していたが、μプロ
セッサのコストが無視できる程度で適用できるようにな
ってきたこともあり、下位の被制御装置の回路レベルの
影響が上位の処理装置の応用プログラムにまで影響を与
えること、上位処理装置の負荷が無視できなくなってき
たこと、故障モードの複雑さや診断処理の優先順位の低
さに起因する故障診断時間が長大化して来たこと、等の
問題を解決するために、近年の被制御装置では、自身の
回路故障を自分で診断する自己診断方式を採用すること
が多くなって来た。しかしこうした方式を採用して来た
にも係わらず、そうした対処による効果以上にメモリ容
量が増大の一途を続けて来た結果、自己診断方式による
診断時間の短縮化が困難になるという問題があった。In the previous system in which the LSI technology has not advanced so much, the failure diagnosis of the lower controlled device is executed in the idle time of the upper processing device (100) or the level of low priority for cost reduction. However, since it can be applied at a level where the μ processor cost can be ignored, the effect of the circuit level of the lower controlled device affects the application program of the upper processing device, In order to solve problems such as that the load on the device can no longer be ignored and that the failure diagnosis time is lengthened due to the complexity of failure modes and the low priority of diagnosis processing, Controllers are often adopting a self-diagnosis method of diagnosing their own circuit failure. However, despite the adoption of such a method, the memory capacity continues to increase beyond the effect of such measures, and as a result, it is difficult to shorten the diagnosis time by the self-diagnosis method. It was

【００１０】他方、被制御装置側での機能分担比率が高
まるにつれてμプロセッサに要求される性能も高まり、
一度制御オーダを受理するとこの制御オーダに基づく処
理を極めて高速に実行することになる。しかし、１つの
処理装置が複数の被制御装置をアクセスする頻度が多く
なるにつれて、被制御装置における制御オーダ受信頻度
が必ずしも高いとは言えず、制御オーダ待ち時間比率も
高まることになった結果、μプロセッサを必ずしも有効
に活用できないという問題も生じていた。On the other hand, as the function allocation ratio on the controlled device side increases, the performance required of the μ processor also increases,
Once the control order is accepted, the processing based on this control order is executed at extremely high speed. However, as the frequency with which one processing device accesses a plurality of controlled devices increases, it cannot be said that the frequency of control order reception in the controlled devices is necessarily high, and as a result the control order waiting time ratio also increases. There was also a problem that the μ processor could not always be used effectively.

【００１１】図３は従来方式を更に詳細に説明するため
の図である。メモリ装置では、その記憶装置を運用中に
おいても記憶したデータの誤りを検出することが重要で
ある。このために従来より誤り検出符号が多く用いられ
てきた。最も簡単な誤り検出符号の１つに垂直パリティ
符号があり、誤り検出のみならず誤り訂正が可能な符号
には垂直水平誤り検出符号やＥＣＣ符号がある。これら
の故障検出方式や訂正方式は広く知られていること、ま
た本発明の構成条件の説明に不要であること等からここ
ではその説明を省略する。FIG. 3 is a diagram for explaining the conventional method in more detail. In the memory device, it is important to detect an error in the stored data even when the storage device is in operation. For this reason, error detection codes have been widely used in the past. One of the simplest error detection codes is a vertical parity code, and a code capable of error correction as well as error detection is a vertical and horizontal error detection code or an ECC code. Since these failure detection methods and correction methods are widely known and are not necessary for explaining the constituent conditions of the present invention, the description thereof will be omitted here.

【００１２】交換機のように連続した安定な運転が要求
されるシステムでは、メモリ装置のような重要な装置は
２重化するなどの対策がとられており、このような装置
に異常が検出されると、通信交換サービスを継続しなが
ら予備系に切り替え、新しい系でサービスを開始した後
に、切り替えられた故障装置の診断を実施することにな
る。In a system such as an exchange that requires continuous and stable operation, measures such as duplication of important devices such as memory devices are taken, and an abnormality is detected in such devices. Then, while continuing the communication switching service, the service is switched to the standby system, the service is started in the new system, and then the diagnosis of the switched failure device is performed.

【００１３】このような時に行う故障診断では、故障診
断実施の度に診断結果が異なることのないように、診断
対象装置を初期状態に設定した後に、メモリ装置の故障
診断エリアに対して故障検出が容易な適当なパターンを
書込、その書き込んだデータを再び読み出して正解デー
タ（書き込みデータ）と照合する方式が一般的である。
以下、このような故障診断方式を図を用いて詳細に説明
する。In the failure diagnosis performed at such a time, the failure detection is performed on the failure diagnosis area of the memory device after the apparatus to be diagnosed is set to the initial state so that the diagnosis result does not change each time the failure diagnosis is performed. In general, a suitable pattern is written, the written data is read again, and the correct data (write data) is compared.
Hereinafter, such a failure diagnosis method will be described in detail with reference to the drawings.

【００１４】図３において、２０１はメモリ装置であ
り、データバス３０１、アドレスバス３０２からメモリ
アクセスに必要な書き込みデータおよびアドレス情報を
外部から取り込み、メモリ装置２０１はこの情報に基づ
いてメモリ素子３０７にアクセスし、読み出し時には読
み出しデータをアンサバス３０３に出力することができ
る。In FIG. 3, reference numeral 201 denotes a memory device, which externally receives write data and address information required for memory access from the data bus 301 and the address bus 302, and the memory device 201 stores the data in the memory element 307 based on this information. When accessed and read, read data can be output to the answer bus 303.

【００１５】メモリ装置２０１は、メモリアドレス情報
を格納するメモリアドレスレジスタ３０４、データ書き
込みレジスタ３０５、データ読み出しレジスタ３０６、
メモリ素子３０７等を具えている。μプロセッサ２００
には、メモリデータバス３０１やメモリアドレスバス３
０２を使ってメモリアクセスした時、メモリ側で正しく
受信できたかどうかを検査するためにパリティ情報を生
成する回路（３０８−１，３０８−２）が設けられ、そ
のデータを受信するメモリ装置側では、そのデータの誤
りを検出するための第１種パリティ検査回路（３０９−
１，３０９−２）が設けられている。また、パリティ情
報付きメモリデータの読み出し時に誤りを検出するため
に、μプロセッサ側のアンサバスには第２種パリティ検
査回路（３１０）を設けている。The memory device 201 includes a memory address register 304 for storing memory address information, a data write register 305, a data read register 306,
It includes a memory element 307 and the like. μ processor 200
Includes a memory data bus 301 and a memory address bus 3
When the memory is accessed using 02, circuits (308-1, 308-2) for generating parity information are provided in order to check whether or not the data can be correctly received on the memory side, and the memory device side receiving the data has the circuit. , A first-type parity check circuit (309-) for detecting an error in the data.
1, 309-2) are provided. Further, in order to detect an error when reading the memory data with parity information, the μ-processor side answer bus is provided with a type II parity check circuit (310).

【００１６】運用中のメモリ装置では、３０８−１，３
０８−２，３０９−１，３０９−２等の誤り検査用符号
生成回路やその検査回路を用いて書き込みや読み出しの
度に検査を実施している。ここで誤りが検出されると被
制御装置１０１全体を系から切り離すとともに、系から
切り放たれたメモリ装置の故障診断を実施することにな
る。この場合には、故障診断対象メモリ２０１を初期設
定し、定められた手順に従って（一般にはメモリアドレ
スの最若番地から順次アドレスを更新しながら）、指定
データ（例えば、オール０のデータ）を全アドレスに書
き込んだ後にこれを読み出し、正解値と比較して検査
し、次に前のデータを反転したデータ（例えば、オール
１のデータ）を書き込んで上記処理を繰り返していた。In the memory device in operation, 308-1, 3
An inspection is performed every time writing or reading is performed by using an error checking code generating circuit such as 08-2, 309-1, 309-2 or the like, and its checking circuit. When an error is detected here, the entire controlled device 101 is disconnected from the system, and the failure diagnosis of the memory device disconnected from the system is performed. In this case, the failure diagnosis target memory 201 is initialized, and all specified data (for example, data of all 0s) is completely written according to a predetermined procedure (generally, sequentially updating addresses from the lowest address of the memory address). After writing to the address, this is read out, compared with the correct value and inspected, and then the above-described processing is repeated by writing the inverted data of the previous data (for example, all 1's data).

【００１７】図４は従来の試験実施手順の１例につい
て、その概要フローを示すものである。こうした試験を
実施することにより、メモリの１ビットスタック故障
（当該メモリビットを読み出した時に固定的に０若しく
は１になる故障）を確実に検出することができる。FIG. 4 shows an outline flow of an example of a conventional test execution procedure. By carrying out such a test, a 1-bit stack failure of the memory (a failure that becomes 0 or 1 fixedly when the memory bit is read) can be reliably detected.

【００１８】しかし、一般的な傾向として、必ずしも１
回の読み出しでメモリ故障が検出できるとは限らない。
多くの場合、読み出し時に時々故障が発生する時期があ
り、このような時期に当該アドレスのデータを読み出し
て試験を行っても、必ずしも１回の読み出しで異常（メ
モリデータの誤り）を確認することはできない。このよ
うな故障を一時故障（インターミッテント故障）と呼
び、システムを不安定にさせる要因の１つになってい
た。However, as a general tendency, it is not always 1
A memory failure cannot always be detected by reading once.
In many cases, there are occasions when a failure occurs at the time of reading, and even if the data at the relevant address is read and a test is performed at such times, it is not always necessary to confirm an error (error in memory data) with one read. I can't. Such a failure is called a temporary failure (intermittent failure) and has been one of the factors that make the system unstable.

【００１９】また、メモリ規模が１０万バイトを越える
ような被制御装置が多く扱われるようになると、全アド
レスに故障診断用データを１回書き込んだ後、これを読
み出し、正解データと照合するだけで、数分から数十分
という長い時間が必要になり、しかも、一時故障の占め
る割合が高くなって来ると繰り返し故障診断を実施する
必要があることから所要時間は更に長くなるという問題
があった。Further, when a large number of controlled devices whose memory scale exceeds 100,000 bytes are handled, the fault diagnosis data is written once to all the addresses, then this is read and collated with the correct answer data. Therefore, a long time of several minutes to several tens of minutes is required, and further, when the proportion of temporary failures increases, it is necessary to repeatedly perform failure diagnosis, and therefore the required time becomes longer. .

【００２０】[0020]

【発明が解決しようとする課題】本発明の目的は、被制
御装置の余剰処理能力を有効利用することにより、大容
量メモリの故障診断を効率的に実施するとともに、再現
頻度の少ないメモリ故障に対して有効な故障診断機能を
提供することにある。SUMMARY OF THE INVENTION An object of the present invention is to effectively utilize the surplus processing capacity of a controlled device to efficiently diagnose a large-capacity memory failure and to prevent a memory failure with a low frequency of reproduction. On the other hand, it is to provide an effective failure diagnosis function.

【００２１】[0021]

【課題を解決するための手段】本発明では、前記目的を
達成するため、被制御装置のμプロセッサは、メモリ装
置の第１種指定エリアの内容を順次読み出して誤り検出
手段により誤りを検出する第１種故障診断を行い、前記
第１種故障診断において前記誤り検出手段が誤りを検出
した場合には、誤りを検出したアドレスを含むエリアで
あって前記メモリ装置の第１種指定エリアより狭い第２
種エリアに対して、繰り返して誤りを検出する第２種故
障診断を行う。According to the present invention, in order to achieve the above object, the μ processor of the controlled device sequentially reads the contents of the type 1 designated area of the memory device and detects the error by the error detecting means. When the type 1 failure diagnosis is performed and the error detecting means detects an error in the type 1 failure diagnosis, the area includes the address where the error is detected and is narrower than the type 1 designated area of the memory device. Second
The second type failure diagnosis for repeatedly detecting an error is performed on the seed area.

【００２２】更に、本発明では、第１種故障診断若しく
は第２種故障診断実行中に、上位処理装置からの制御信
号を受信した時には、前記第１種故障診断若しくは第２
種故障診断を中断する。Further, according to the present invention, when the control signal from the host processor is received during the execution of the type 1 failure diagnosis or the type 2 failure diagnosis, the type 1 failure diagnosis or the type 2 failure diagnosis is performed.
Interrupt the seed failure diagnosis.

【００２３】[0023]

【作用】近年のμプロセッサの性能向上につれて、上位
処理装置からの制御オーダは見かけ上それに制御される
被制御装置に搭載されているμプロセッサの性能に比較
して充分長い時間間隔で送られてくることになる。交換
機のような大規模システムの例においても、被制御装置
の利用可能な時間のうちその半分程度しか制御オーダに
関連する処理のために使われていない装置も少なくな
い。だからといって、一度制御オーダを受理すると高速
に処理し短時間でその応答を返送する必要があることか
ら、μプロセッサに対する性能要求が弱まることは無
い。これまでにも、この空き時間を有効に利用し処理装
置の負荷を軽減する試みが広く考えられてきた。本発明
の第１の特徴は、被制御装置に搭載されているμプロセ
ッサの余剰能力を被制御装置メモリの故障診断に適用し
たことにある。With the recent improvement in the performance of the μ processor, the control order from the host processor is sent at a sufficiently long time interval as compared with the performance of the μ processor mounted on the controlled device which is apparently controlled by it. Will come. Even in an example of a large-scale system such as an exchange, there are not a few devices that are used for processing related to a control order only about half of the available time of the controlled device. However, once the control order is accepted, it is necessary to process it at high speed and send back its response in a short time, so the performance requirement for the μ processor is not weakened. Until now, attempts have been widely made to effectively utilize this idle time to reduce the load on the processing device. A first feature of the present invention is that the surplus capacity of the μ processor mounted on the controlled device is applied to the failure diagnosis of the controlled device memory.

【００２４】他方、被制御装置に対する機能分担の増大
に伴って、被制御装置に搭載されるメモリ規模が増大し
て来た。このような大規模メモリの故障形態は複雑で、
必ずしも１回の書き込みや読み出しでは検出できないこ
とが多い。このような大規模メモリの故障診断において
は診断時間が長大化する。本発明の第２の特徴は、この
ような問題を解決するために、故障診断を２階層に分離
した点にある。即ち、通常の運転状態では、故障の再現
頻度が低い可能性のある故障診断対象メモリに対して、
被制御装置に搭載されるμプロセッサの空き時間を利用
してメモリデータの誤り検出を断続的に実施し、この階
層の故障診断において一度故障が検出されると、故障発
生アドレス又は故障発生アドレスを含むメモリのエリア
に限定して集中的に故障診断を繰り返し実行する点に特
徴がある。On the other hand, along with the increase in the function allocation to the controlled device, the scale of the memory mounted on the controlled device has increased. The failure mode of such a large-scale memory is complicated,
In many cases, it cannot be detected by writing or reading once. In the failure diagnosis of such a large scale memory, the diagnosis time becomes long. The second feature of the present invention is that the fault diagnosis is divided into two layers in order to solve such a problem. That is, in a normal operating state, for a failure diagnosis target memory that may have a low frequency of failure reproduction,
The error detection of the memory data is intermittently performed by utilizing the idle time of the μ processor mounted in the controlled device, and once the failure is detected in the failure diagnosis of this hierarchy, the failure occurrence address or the failure occurrence address is set. It is characterized in that the failure diagnosis is repeatedly executed in a concentrated manner only in the area of the memory that includes it.

【００２５】[0025]

【実施例】最近のメモリには、単に誤り検出機構のみな
らず誤り自動訂正機構が装備されることが多くなって来
たばかりでなく、そうした機構自体をメモリ装置（メモ
リ周辺回路）の内部に持つ方式が多く採用されるように
なって来た。このような機構として、ＥＣＣ方式や水平
垂直パリティ方式等が多く用いられており、メモリデー
タの格納時に格納対象データの他にそのデータの誤り検
出／訂正用の情報を余分に格納しておき、そのエリアの
データの読み出し時にその余分データを使って誤り検出
を行ったりその訂正を行う点で共通している。本発明に
おいては、どのような誤り検出方式や訂正方式を採用し
ているかには依存しないので、実施例の説明においては
最も単純なパリティ符号により誤り検出を行う方式を例
にとる。DESCRIPTION OF THE PREFERRED EMBODIMENTS Recently, not only memory has been often equipped with an error automatic correction mechanism as well as an error detection mechanism, but such a mechanism itself is provided inside a memory device (memory peripheral circuit). Many methods have been adopted. As such a mechanism, an ECC system, a horizontal / vertical parity system, or the like is often used. When storing memory data, extra information for error detection / correction of the data is stored in addition to the data to be stored. It is common in that when the data in the area is read, the extra data is used to perform error detection or correction. Since the present invention does not depend on what kind of error detection method or correction method is adopted, the method of performing error detection using the simplest parity code is taken as an example in the description of the embodiment.

【００２６】実施例を詳細に説明するために必要な個所
を、図３を例にとり更に詳細に説明する。図３におい
て、２０１はメモリ装置であり、データバス３０１、ア
ドレスバス３０２からメモリアクセスに必要な書き込み
データ及びアドレス情報を外部から取り込み、メモリ２
０１はこの情報に基づいてメモリ素子３０７にアクセス
し、読み出し時には読み出しデータをアンサバス３０３
に出力することができる。メモリ装置２０１には、メモ
リアドレス情報を格納するメモリアドレスレジスタ３０
４、データ書き込みレジスタ３０５、データ読み出しレ
ジスタ３０６、メモリの読み出し時にそのデータに誤り
が発生したか否かを検査したり、書き込み時に検査に必
要な検査データを生成することのできる誤りチェック回
路等が含まれている。The points necessary for explaining the embodiment in detail will be described in more detail by taking FIG. 3 as an example. In FIG. 3, reference numeral 201 denotes a memory device, which externally receives write data and address information required for memory access from the data bus 301 and the address bus 302,
01 accesses the memory element 307 based on this information, and at the time of reading, the read data is transferred to the answer bus 303.
Can be output to. The memory device 201 includes a memory address register 30 for storing memory address information.
4, a data write register 305, a data read register 306, an error check circuit that can check whether or not an error has occurred in the data at the time of reading the memory, and can generate check data necessary for the check at the time of writing. include.

【００２７】メモリ素子３０７はワードと呼ばれる単位
毎にアドレスが付与されており、アドレスレジスタ３０
４にアドレス情報ｍが格納されているとｍ番地の内容が
読み出しレジスタ３０６に読み出される。The memory element 307 is provided with an address for each unit called a word, and the address register 30
When the address information m is stored in 4, the content of the address m is read out to the read register 306.

【００２８】ｍ番地の内容は、データバスから送られて
くる格納対象データ部と、データ部の内容に誤りがある
とこれを検出することのできる誤り検出符号（パリティ
符号）とにより構成される。このデータの読み出し時に
はパリティ検査回路３０９がその誤り検出符号に基づい
てそのデータの誤りチェックを行うことができる。この
誤りチェック方式は、検査対象のメモリビットの０又は
１の数が偶数／若しくは奇数のどちらかになるように１
ビットを付加する方式であり、広く一般に用いられてい
ることから、本実施例では詳細な説明を省略する。The contents of the address m are composed of a storage target data part sent from the data bus and an error detection code (parity code) capable of detecting an error in the contents of the data part. . At the time of reading this data, the parity check circuit 309 can check the error of the data based on the error detection code. This error checking method is performed so that the number of 0s or 1s of the memory bit to be inspected is 1 evenly or oddly.
Since it is a method of adding bits and is widely used, detailed description thereof will be omitted in the present embodiment.

【００２９】メモリ装置がこのような構成になっている
ので、当該メモリに１ビット故障が発生した時には、メ
モリデータを読み出すだけでパリティ符号の異常として
誤りを検出することができる。Since the memory device has such a configuration, when a 1-bit failure occurs in the memory, an error can be detected as an abnormality of the parity code by simply reading the memory data.

【００３０】しかし、一般的な傾向として、必ずしも１
回の読み出しでメモリ故障が検出できるとは限らない。
多くの場合、読み出し時に時々故障が発生する時期があ
り、このような時期に当該アドレスのデータを読み出し
て試験を行っても異常（メモリデータの誤り）を確認す
ることはできない。このような故障を一時故障（インタ
ーミッテント故障）と呼び、システムを不安定にさせて
いる。However, as a general tendency, it is not always 1
A memory failure cannot always be detected by reading once.
In many cases, a failure sometimes occurs at the time of reading, and even if the data of the address is read and a test is performed at such a time, it is not possible to confirm the abnormality (error of memory data). Such a failure is called a temporary failure (intermittent failure) and makes the system unstable.

【００３１】第１種故障診断においては、大きな容量を
有するメモリ装置の全エリア若しくはその中の大きなエ
リア（第１種指定エリア）を順次読み出して誤り検出を
行う点に特徴がある。第２種故障診断においては、メモ
リ装置がオンライン系に組み入れられている場合には、
上位処理装置への影響を限りなく小さくするために第１
種故障診断の優先順位を低くして実行する。これにより
第１種故障診断はμプロセッサの負荷が軽い時に実行さ
れることになり、この動作を何回も繰り返して実行する
ことにより、メモリ装置の劣化したビットはいずれ一時
的又は固定的なパリティエラーとして検出することがで
きる。The first type failure diagnosis is characterized in that the entire area of the memory device having a large capacity or a large area (first type designated area) therein is sequentially read out to perform error detection. In the second type failure diagnosis, when the memory device is incorporated in the online system,
First to minimize the impact on the host processor
Execute with a lower priority for species failure diagnosis. As a result, the type 1 failure diagnosis will be executed when the load of the μ processor is light, and by repeating this operation many times, the deteriorated bits of the memory device will eventually have a temporary or fixed parity. It can be detected as an error.

【００３２】しかし読み出しのみでエラーの検出が可能
なことから、上位処理装置からの指示オーダに基づくオ
ンライン処理と並行して、大容量のメモリの指定全領域
の１ラウンドの故障診断を比較的短い時間で実施するこ
とが可能になる。また、メモリ装置がオンライン系に組
み入れられていない時には、メモリ装置がオンライン系
に組入れられている場合と同様な処理を行っても上位処
理装置からのオーダを伴うことが無いので、第１種故障
診断の故障検出率は実効的に上がることになる。However, since the error can be detected only by reading, the fault diagnosis for one round of the designated whole area of the large capacity memory is relatively short in parallel with the online processing based on the instruction order from the host processor. It can be done in time. Further, when the memory device is not incorporated in the online system, even if the same processing as that in the case where the memory device is incorporated in the online system is performed, there is no order from the higher-order processing device. The failure detection rate for diagnosis will effectively increase.

【００３３】第１種故障診断結果に関連づけて行う第２
種故障診断が本発明の特徴である。即ち、第１種故障診
断において、誤りを検出すると、誤りが発生したアドレ
スを含む付加情報（誤りデータ等）を収集しておき、こ
の誤り発生アドレスを含む狭いエリアに対して集中的に
故障検出を行う（第２種故障診断）点に特徴がある。Second operation performed in association with the first type failure diagnosis result
Species failure diagnosis is a feature of the present invention. That is, in the first type failure diagnosis, when an error is detected, additional information (error data etc.) including the address in which the error has occurred is collected, and the failure is intensively detected in a narrow area including this error occurrence address. Is performed (type 2 failure diagnosis).

【００３４】メモリのビット当りの故障診断回数は、第
１種故障診断においては数分から数十分に１回程度であ
るが、第２種故障診断では、数千かち数百万倍にも頻度
を高めることができる。誤りを検出したことのある狭い
エリアに高頻度の故障診断を実施することにより、希に
しか検出できない故障を効率的に顕在化することが可能
になる。The number of times of failure diagnosis per bit of the memory is from several minutes to several tens of minutes in the first type failure diagnosis, but in the second type failure diagnosis, the frequency is several thousand to several million times. Can be increased. By performing high-frequency failure diagnosis in a narrow area where an error has been detected, it becomes possible to efficiently manifest a rarely detected failure.

【００３５】実施例における以上の動作を図５のフロー
チャートにより更に詳細に示す。このフローチャートの
中で、ｉは第１種故障診断において検出した故障件数で
ある。この実施例では、第１種故障診断が開始されると
ｉは初期設定され０が書き込まれる。その後に第１種故
障診断で故障が検出されると、その都度ｉの値は１ずつ
加算され、第１種故障診断において検出した故障件数を
記憶してゆく。The above operation in the embodiment will be described in more detail with reference to the flowchart of FIG. In this flowchart, i is the number of failures detected in the type 1 failure diagnosis. In this embodiment, when the type 1 failure diagnosis is started, i is initialized and 0 is written. After that, when a failure is detected in the type 1 failure diagnosis, the value of i is incremented by 1 each time, and the number of failures detected in the type 1 failure diagnosis is stored.

【００３６】他方第２種故障診断は、第１種故障診断で
検出した故障に対応して実施され、第２種故障診断が１
件完了する毎にｉの値を１ずつ減算してゆき、ｉの値が
０になるとその期間中に限り第２種故障診断は実施され
ることは無い。On the other hand, the type 2 failure diagnosis is performed corresponding to the failure detected in the type 1 failure diagnosis, and the type 2 failure diagnosis is 1
The value of i is decremented by 1 each time a case is completed, and when the value of i becomes 0, the type 2 failure diagnosis is not executed only during that period.

【００３７】図５において、第１種故障診断が最初に起
動されると、ｉの値も当然０が書き込まれ、最も低い優
先度において繰り返し実行される第１種故障診断モード
が設定される。このモードの故障診断では、診断開始ア
ドレス、終了アドレス等の診断の実行範囲の情報が必要
であり、診断の実行前にこの情報を指定しておく。第１
種故障診断の開始に当たっては、診断開始アドレスから
データを読み出し、そのデータのチェックを行うことに
なるので、このフローチャートでは読み出し前にこの番
地が診断対象範囲内か範囲外かの判定を行い、範囲外で
あれば最初の診断開始アドレスから再び診断を実行する
ことにしている。In FIG. 5, when the first type failure diagnosis is first activated, the value of i is also written to 0, and the first type failure diagnosis mode is set which is repeatedly executed at the lowest priority. In the failure diagnosis in this mode, information on the execution range of the diagnosis such as the diagnosis start address and the end address is required, and this information is designated before the execution of the diagnosis. First
At the start of the seed failure diagnosis, data is read from the diagnosis start address and the data is checked.Therefore, in this flowchart, it is judged whether this address is within the diagnosis target range or out of the range before reading. If it is outside, it is decided to execute the diagnosis again from the first diagnosis start address.

【００３８】そのアドレスが診断対象範囲内であれば、
その読み出しデータに付加されている誤り制御情報（パ
リティＥＣＣコードなどがよく知られている）を検査し
データの正常性を調べる。この検査は１語行う毎にアド
レスを更新し繰り返し実施されるため、もし、データに
誤りが発生しなければ、この第１種故障診断は無限に繰
り返されることになる。If the address is within the diagnostic range,
The error control information (parity ECC code or the like is well known) added to the read data is inspected to check the normality of the data. Since this inspection is repeatedly performed by updating the address every time one word is executed, if no error occurs in the data, this type 1 failure diagnosis will be repeated indefinitely.

【００３９】この第１種故障診断で故障が検出される
と、その都度誤りが検出されたアドレスを含む関連デー
タ（例えば、誤りデータパターン、正解値など）を収集
し、この情報を保存すると共に、ｉの値を１ずつ加算し
てゆく。When a failure is detected in this type 1 failure diagnosis, relevant data including the address in which an error is detected (for example, error data pattern, correct answer value, etc.) is collected, and this information is saved. , I are incremented by one.

【００４０】第２種故障診断の起動契機は多く考えられ
るが、即時起動契機形態では、故障を１件検出する毎に
起動される。他方、定時起動契機形態では、一定時間が
経過（例えば１日に１回起動）する毎に起動される。こ
の判定がフローチャートにおける起動モード判定であ
る。There are many possible triggers for the type 2 failure diagnosis, but in the immediate activation trigger mode, the trigger is activated each time one failure is detected. On the other hand, in the regular activation trigger mode, the regular activation is activated every time a certain period of time elapses (for example, once a day). This determination is the activation mode determination in the flowchart.

【００４１】第２種故障診断が起動されると、第１種故
障診断時に保存しておいた故障関連データを１件ずつ取
り出し、故障発生アドレスを含む極狭い範囲（１〜１０
０語程度）に対して、充分多い回数（例えば、１００〜
１，０００，０００回）だけ繰り返し実行する。When the type 2 failure diagnosis is started, the failure-related data stored at the time of the type 1 failure diagnosis is retrieved one by one, and a very narrow range (1 to 10) including the failure occurrence address is fetched.
A sufficient number of times (for example, 100 to 100 words)
1,000,000 times).

【００４２】第１種故障診断では診断対象が広いため、
メモリの素子当たりの故障診断回数は１回／１５分程度
にしかならないことも珍しくない。一時故障のような故
障では１００〜１００００回のアクセスに対して１回し
か発生しないこともあり、しかもメモリエラーの自動訂
正機能を有するメモリにおいては、エラーデータも自動
訂正されることになるので、この診断のみで故障を再現
することは極めて困難な場合が多い。In the first type failure diagnosis, since the diagnosis target is wide,
It is not uncommon that the number of failure diagnoses per memory element is only once / 15 minutes. A failure such as a temporary failure may occur only once for 100 to 10,000 accesses, and in a memory having a memory error automatic correction function, error data is automatically corrected. It is often extremely difficult to reproduce the failure only by this diagnosis.

【００４３】しかし、この診断は、優先レベルの低い状
態で、充分時間をかけて無限回繰り返し実行するので、
ある程度時間が経過すればある確率で故障が検出され
る。これに対して第２種診断では、一度故障が発生した
箇所に対して範囲を限定した充分大きい繰り返し回数の
診断を実施することになるので、その故障再現確率を飛
躍的に高めることが可能になる。However, since this diagnosis is repeatedly executed infinitely many times with a low priority level,
When a certain amount of time has passed, a failure is detected with a certain probability. On the other hand, in the second type diagnosis, since the diagnosis is performed with a sufficiently large number of repetitions for a location where a failure has occurred once, the failure reproduction probability can be dramatically increased. Become.

【００４４】ここでは、決められた第２種故障診断を実
施してはその結果を上位ソフトに通知し、次の故障箇所
に対してこのような試験を繰り返して行うことになる。
１件の故障について第２種故障診断が終了するとその都
度ｉの値を１ずつ減じてゆき、ｉの値が０になるまでこ
れらの処理を繰り返して実行する。Here, the determined type 2 failure diagnosis is carried out, the result is notified to the upper software, and such a test is repeated for the next failure location.
When the type 2 failure diagnosis is completed for one failure, the value of i is decremented by 1 each time, and these processes are repeatedly executed until the value of i becomes 0.

【００４５】上記でも説明したが、第２種故障診断は、
第１種故障診断で誤りを検出した時その都度起動しても
よいが、定期的に起動しても構わない。その理由は、上
記いずれの起動契機にしても、第２種故障診断の処理に
おいて、第１種故障診断で誤りを検出したことがあるか
どうかを検査し、誤り検出をした箇所があればその箇所
に着目して順次第２種故障診断を行っているからであ
る。また、第２種指定エリアとして、１語（又はバイ
ト）でも構わないし、そのアドレスを含む複数ワード
（又はバイト）であってもよい。As described above, the type 2 failure diagnosis is
When an error is detected in the type 1 failure diagnosis, it may be activated each time, but it may be activated periodically. The reason is that in any of the above activation triggers, in the process of the type 2 failure diagnosis, it is checked whether or not an error has been detected in the type 1 failure diagnosis. This is because the type 2 failure diagnosis is performed sequentially focusing on the location. Further, the second type designation area may be one word (or byte) or plural words (or bytes) including the address.

【００４６】また、第１種故障診断は、上位処理装置配
下でオンライン処理を行いながら、その空き時間に時分
割的に実施してもよいし、系から切り放してオフライン
状態で実施しても良い。Further, the type 1 failure diagnosis may be carried out in a time-division manner during the idle time while performing online processing under the control of the host processor, or may be disconnected from the system and carried out in an offline state. .

【００４７】また、第１種故障診断では、一時故障の発
生を長期間にわたって検査する方式であり、パリティ符
号のような簡単な誤り検出方式で検査するが、一度でも
誤りが検出された箇所に対する第２種故障診断に対して
は、故障再現確率が高いことから、より厳格な誤り検査
方式（反転データ書き込み後の読み出しを併用する方式
等）を用いることが考えられる。In the first type failure diagnosis, the occurrence of a temporary failure is inspected for a long period of time, and a simple error detection method such as a parity code is used to inspect the location where an error is detected even once. Since the failure reproduction probability is high for the type 2 failure diagnosis, it is conceivable to use a stricter error checking method (a method using reading after writing inverted data together).

【００４８】更に、実施例では主メモリ（ＲＡＭ）に限
定して説明したが、メモリ装置の種類がＲＯＭであれ、
ハードディスク装置や光磁気ディスク装置のようなファ
イルメモリ装置であれ、対象は限定されない。Further, in the embodiment, the description has been limited to the main memory (RAM), but if the type of memory device is ROM,
The target is not limited, even if it is a file memory device such as a hard disk device or a magneto-optical disk device.

【００４９】[0049]

【発明の効果】本発明を適用すれば、以上述べた特徴に
より次のような効果が得られる。従来有効に利用されていなかった被制御装置内にあ
るμプロセッサも、その空き時間が多くなるほど、メモ
リ故障診断に有効に機能した予防保全に貢献できる。再現性の低い故障が一度でも発生すると、当該アド
レスから見た場合、第１の階層で実施した故障診断（第
１種故障診断）の頻度の数千或いは数万倍を越えるアク
セス頻度で診断が繰り返し実施され、容易に故障の顕在
化を行うことができる。また、故障診断がμプロセッサの空き時間に自動的
に実施されるため、見かけ上、故障診断のための待ち時
間は無くなる。When the present invention is applied, the following effects can be obtained due to the features described above. The μ processor in the controlled device, which has not been effectively used in the past, can contribute to the preventive maintenance that effectively functions for the memory failure diagnosis as the free time increases. If a fault with low reproducibility occurs even once, when viewed from the relevant address, the diagnosis can be performed at an access frequency that exceeds several thousand or tens of thousands times the frequency of the fault diagnosis (first type fault diagnosis) performed in the first layer. It can be repeated and the failure can be easily revealed. Further, since the failure diagnosis is automatically performed during the free time of the μ processor, the waiting time for the failure diagnosis is apparently eliminated.

[Brief description of drawings]

【図１】上位処理装置とその制御下で動作する複数の被
制御装置により構成される一般的なシステムの概念図で
ある。FIG. 1 is a conceptual diagram of a general system including a host processor and a plurality of controlled devices that operate under the control of the host processor.

【図２】被制御装置の一般的な機能ブロック構成図であ
る。FIG. 2 is a general functional block configuration diagram of a controlled device.

【図３】従来方式を更に詳細に説明するための被制御装
置の詳細機能ブロック図である。FIG. 3 is a detailed functional block diagram of a controlled device for explaining the conventional system in more detail.

【図４】従来の試験実施手順の概要フローチャートであ
る。FIG. 4 is a schematic flowchart of a conventional test execution procedure.

【図５】本発明の実施例を示す概要フローチャートであ
る。FIG. 5 is a schematic flowchart showing an embodiment of the present invention.

[Explanation of symbols]

１００上位処理装置１０１被制御装置１０２信号路２００ μプロセッサ２０１ＲＡＭ２０２ＲＯＭ２０３ハードディスク２０４信号送受信回路２０５制御対象装置３０１データバス３０２アドレスバス３０３アンサバス３０４メモリアドレスレジスタ３０５書き込みレジスタ３０６読み出しレジスタ３０７メモリ素子３０８パリティ付加回路３０９第１種パリティ検査回路３１０第２種パリティ検査回路 100 host processor 101 controlled device 102 signal path 200 μ processor 201 RAM 202 ROM 203 hard disk 204 signal transmission / reception circuit 205 controlled device 301 data bus 302 address bus 303 answer bus 304 memory address register 305 write register 306 read register 307 memory element 308 Parity addition circuit 309 First type parity check circuit 310 Second type parity check circuit

Claims

[Claims]

1. A high-order processing device, a sub-processor (hereinafter referred to as a μ processor) operable under the control of the high-order processing device, a memory device connected to the μ processor, and reading the contents of the memory device. In the method of diagnosing a memory failure in a system including an error detecting unit for checking the data error by means of the above, the μ processor sequentially reads the contents of the first type designated area of the memory device and detects the error by the error detecting unit. If the first type failure diagnosis is performed and the error detecting means detects an error in the first type failure diagnosis,
A memory failure characterized by repeatedly performing a second type failure diagnosis for repeatedly detecting an error in a second type area which is an area including an address where an error is detected and is narrower than a first type designated area of the memory device. Diagnostic method.

2. The first type failure diagnosis or the second type failure diagnosis is interrupted when a control signal from the host processor is received during execution of the first type failure diagnosis or the second type failure diagnosis. The memory failure diagnosis method according to claim 1.