JP2010170462A

JP2010170462A - Fault handling device and method

Info

Publication number: JP2010170462A
Application number: JP2009014164A
Authority: JP
Inventors: Tsuneshi Sentoda; 恒志仙洞田
Original assignee: NEC Computertechno Ltd
Current assignee: NEC Computertechno Ltd
Priority date: 2009-01-26
Filing date: 2009-01-26
Publication date: 2010-08-05
Anticipated expiration: 2029-01-26
Also published as: JP5451087B2

Abstract

<P>PROBLEM TO BE SOLVED: To confirm latest fault occurrence from software even during a fault detection suppression period. <P>SOLUTION: A fault report control part 12 notifies a diagnostic device 30 of fault detection indicating that a fault is detected by an error handling part 11 in response to error detection by the error handling part 11, and suppresses the notification of the fault detection in response to subsequent error detection over a predetermined suppression period from the error detection. A fault log control part 13 counts the number of error occurrences as fault log information for each address space including the error data, in response to the error detection by the error handling part 11 to store the number of error occurrences, and notifies the diagnostic device 30 of the fault detection when the number of error occurrences related to an arbitrary address space reaches a predetermined error count threshold value. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、障害処理技術に関し、特にデータ処理装置で発生した障害を検出して上位装置へ通知する障害処理技術に関する。 The present invention relates to a failure processing technology, and more particularly to a failure processing technology for detecting a failure occurring in a data processing device and notifying a higher-level device.

高信頼性を求められるコンピュータシステムでは、メモリやデータ伝送路上のデータを保護するためにＥＣＣ（Error Checking and Correction）と呼ばれるエラー訂正機能が利用されている。メモリに誤ったデータが記録された場合や伝送路上に誤ったデータが送出された場合に、ＥＣＣを用いて、訂正可能エラーの場合は、エラーしたｂｉｔを訂正してコンピュータシステムを動作継続し、訂正不可能なエラーの場合は、訂正不可能であることを検出するとともに、コンピュータシステムの動作継続が不可能であると判断して、システムダウンさせるような障害処理方式を取っている。 In a computer system that requires high reliability, an error correction function called ECC (Error Checking and Correction) is used to protect data on a memory or a data transmission path. When incorrect data is recorded in the memory or when erroneous data is sent on the transmission path, in the case of a correctable error using ECC, the errored bit is corrected and the computer system continues to operate. In the case of an uncorrectable error, it is detected that the error cannot be corrected, and it is determined that the operation of the computer system cannot be continued.

メモリ等で発生するエラーの要因には、ハードウェアにおける論理設計ミスや電気的な回路設計ミスにより発生するもの、また半導体素子や配線の劣化等によるハードウェア破壊等がある。また、一般的に、α線等が原因でメモリ上のｂｉｔエラーを一時的に引き起こすソフトエラーもある。さらには、コンピュータシステムの電源環境や設置環境による温度異常等により、エラーが引き起こされることも要因の一つである。 Causes of errors that occur in a memory or the like include those that occur due to a logic design error or an electrical circuit design error in hardware, and hardware destruction due to deterioration of semiconductor elements or wiring. In general, there is also a soft error that temporarily causes a bit error on the memory due to α rays or the like. Another factor is that an error is caused by a temperature abnormality caused by the power supply environment or installation environment of the computer system.

このような様々な状況下では、多くの要因により障害が多数発生することが考えられる。一般的に、コンピュータシステムに障害が発生すると、障害報告やこの報告契機により、障害箇所のログ情報採取といった割り込み処理が診断装置によって行われる。
しかしながら、様々な要因により障害が多発して、診断装置の処理能力以上の割り込み処理が頻発すると、一部の割り込み処理が未実行となり、必要な障害処理ができなくなってしまう問題が発生する。 Under such various circumstances, it is conceivable that many failures occur due to many factors. In general, when a failure occurs in a computer system, an interruption process such as collection of log information of a failure location is performed by a diagnostic device in response to a failure report or a trigger for this report.
However, if troubles frequently occur due to various factors and interrupt processing more frequently than the processing capability of the diagnostic device occurs frequently, some interrupt processing is not executed, and necessary trouble processing cannot be performed.

従来、このような障害多発による割り込み処理を軽減させるために、訂正可能エラー検出以後、訂正可能エラーの発生回数を計数し、発生回数が閾値に達した時点で、障害検出を通知することにより、一定期間、障害検出を抑止することで、割り込み処理の未実行を防止する関連技術が提案されている（例えば、特許文献１など参照）。但し、この抑止期間中に検出する訂正可能エラーは訂正される。 Conventionally, in order to reduce interrupt processing due to such frequent occurrence of faults, after detecting correctable errors, the number of occurrences of correctable errors is counted, and when the number of occurrences reaches a threshold, notification of fault detection is made. Related techniques for preventing unexecuted interrupt processing by suppressing failure detection for a certain period of time have been proposed (see, for example, Patent Document 1). However, correctable errors detected during this suppression period are corrected.

特開２００８−０２７２８４号公報JP 2008-027284 A

しかしながら、このような関連技術では、障害検出の抑止期間中において障害報告も抑止されるため、障害ログ情報といった処理が実施されず、次障害報告は、障害検出抑止の解除後となる。このため、抑止期間中は、あたかも障害が発生しないものとして扱われていることになり、この抑止期間中に同様な障害が多発してしまうと、訂正不可能エラーに発展し、システムダウンに繋がる恐れがある。 However, in such a related technology, since the failure report is also suppressed during the failure detection suppression period, processing such as failure log information is not performed, and the next failure report is after cancellation of the failure detection suppression. For this reason, during the suppression period, it is treated as if a failure does not occur, and if a similar failure occurs frequently during this suppression period, it will develop into an uncorrectable error, leading to system down. There is a fear.

また、ＯＳなどのソフトウェアには、メモリをページ単位に分け、ページごとに発生するメモリの訂正可能エラー発生回数をカウントし、エラー発生回数が閾値に達すると、障害メモリページを論理的に切り離す機能を備えているものがある。エラー発生回数のカウントアップ契機は、ハードウェアから報告される障害ログであるが、上述のように一定期間障害検出が抑止されるような障害処理方法では、当該ソフトウェアによる障害監視が機能しないことになり、ソフトウェアの目的とする効果が得られていないという問題点もあった。 Also, software such as an OS has a function of dividing memory into pages, counting the number of correctable errors occurring in each page, and logically separating the faulty memory page when the number of errors reaches a threshold Some are equipped with. The count up of the error occurrence count is the fault log reported from the hardware. However, in the fault processing method in which fault detection is suppressed for a certain period as described above, fault monitoring by the software does not function. Therefore, there was a problem that the intended effect of the software was not obtained.

本発明はこのような課題を解決するためのものであり、障害検出の抑止期間であってもソフトウェアから最新の障害発生状況を確認できる障害処理装置および方法を提供することを目的としている。 An object of the present invention is to provide a failure processing apparatus and method capable of confirming the latest failure occurrence status from software even during a failure detection suppression period.

このような目的を達成するために、本発明にかかる障害処理装置は、対象となるデータ処理装置から取得したデータから訂正可能なエラーを検出するエラー処理部と、エラー処理部でのエラー検出に応じて、エラー処理部で障害を検出したことを示す障害検出を上位装置へ通知するとともに、当該エラー検出から所定の抑止期間にわたり後続するエラー検出に応じた障害検出の通知を抑止する障害報告制御部と、エラー処理部でのエラー検出に応じて、当該エラーが検出されたエラーデータが属するデータブロックごとに、障害ログ情報として当該エラーの発生回数をカウントして保持し、任意のデータブロックに関するエラー発生回数が予め設定されたエラーカウント閾値に達した時点で、障害検出を上位装置へ通知する障害ログ制御部とを備えている。 In order to achieve such an object, the failure processing apparatus according to the present invention detects an error that can be corrected from data acquired from a target data processing apparatus, and detects errors in the error processing section. Accordingly, a failure report control for notifying failure detection indicating that a failure has been detected by the error processing unit to a higher-level device and suppressing notification of failure detection in accordance with error detection subsequent to the error detection for a predetermined suppression period. And the error processing unit counts and holds the number of occurrences of the error as failure log information for each data block to which the error data in which the error is detected belongs, and relates to an arbitrary data block. A failure log control unit for notifying the host device of failure detection when the number of error occurrences reaches a preset error count threshold; It is provided.

また、本発明にかかる対象となる障害処理方法は、データ処理装置から取得したデータに基づき障害発生を検出して上位装置へ通知する障害処理装置で用いられる障害処理方法であって、エラー処理部が、データ処理装置から取得したデータから訂正可能なエラーを検出するエラー処理ステップと、障害報告制御部が、エラー処理部でのエラー検出に応じて、エラー処理部で障害を検出したことを示す障害検出を上位装置へ通知するとともに、当該エラー検出から所定の抑止期間にわたり後続するエラー検出に応じた障害検出の通知を抑止する障害報告制御ステップと、障害ログ制御部が、エラー処理部でのエラー検出に応じて、当該エラーが検出されたエラーデータが属するデータブロックごとに、障害ログ情報として当該エラーの発生回数をカウントして保持し、任意のデータブロックに関するエラー発生回数が予め設定されたエラーカウント閾値に達した時点で、障害検出を上位装置へ通知する障害ログ制御ステップとを備えている。 Further, a failure processing method as a target according to the present invention is a failure processing method used in a failure processing device that detects a failure occurrence based on data acquired from a data processing device and notifies a higher-level device, and includes an error processing unit. Indicates that the error processing step detects a correctable error from the data acquired from the data processing device and that the failure report control unit has detected a failure in the error processing unit in response to the error detection in the error processing unit. A failure report control step for notifying failure detection to a higher-level device and suppressing notification of failure detection according to error detection that follows for a predetermined suppression period from the error detection, and a failure log control unit, In response to the error detection, the error occurrence information is recorded as fault log information for each data block to which the error data in which the error is detected belongs. It was held to count, when the number of errors has reached a preset error count threshold for any data block, and a fault log control step of notifying the fault detection to the higher-level device.

本発明によれば、障害検出通知の抑止期間であっても、エラー発生回数に応じて障害検出が障害処理装置から上位装置へ通知される。これにより、エラー検出から一定期間にわたり障害検出通知が抑止されているような障害処理方式を持つハードウェアが搭載されたコンピュータシステムであっても、診断装置さらにはソフトウェア（ＯＳ）において、ハードウェアに関する最新の障害発生状況を確認することができる。したがって、当該ハードウェアに対する本来の障害処理動作が阻害されることなく、適切に処理実行することが可能となり、抑止期間中に発生する恐れがあるシステムダウンの発生確率を軽減させることができる。 According to the present invention, even during the failure detection notification suppression period, failure detection is notified from the failure processing device to the host device according to the number of error occurrences. As a result, even in a computer system equipped with hardware having a failure processing method in which failure detection notification is suppressed for a certain period from error detection, the diagnosis device and the software (OS) are related to the hardware. You can check the latest failure status. Therefore, it is possible to appropriately execute the processing without hindering the original failure processing operation for the hardware, and it is possible to reduce the probability of system down that may occur during the suppression period.

本発明の一実施の形態にかかる障害処理装置の構成を示すブロック図である。It is a block diagram which shows the structure of the failure processing apparatus concerning one embodiment of this invention. ソフトウェアの機能を示す説明図である。It is explanatory drawing which shows the function of software. 従来の障害処理の概略フローである。It is a schematic flow of conventional failure processing. 本発明の一実施の形態にかかる障害処理の概略フローである。It is a general | schematic flow of the failure process concerning one embodiment of this invention.

次に、本発明の実施の形態について図面を参照して説明する。
［一実施の形態］
まず、図１を参照して、本発明の一実施の形態にかかる障害処理装置について説明する。図１は、本発明の一実施の形態にかかる障害処理装置の構成を示すブロック図である。
図１のコンピュータシステム１には、障害処理装置１０、メモリ２０、診断装置３０、記憶装置４０、およびプロセッサ５０が設けられており、内部バスを介して接続されている。 Next, embodiments of the present invention will be described with reference to the drawings.
[One Embodiment]
First, a failure processing apparatus according to an embodiment of the present invention will be described with reference to FIG. FIG. 1 is a block diagram showing a configuration of a failure processing apparatus according to an embodiment of the present invention.
A computer system 1 in FIG. 1 includes a failure processing device 10, a memory 20, a diagnostic device 30, a storage device 40, and a processor 50, which are connected via an internal bus.

障害処理装置１０は、専用の信号処理回路からなり、データ処理装置から取得したデータで発生したエラーを検出して上位装置へ通知する機能を有している。
本実施の形態では、障害処理装置１０をコンピュータシステム１で使用されるメモリコントローラへ適用し、障害処理対象となるデータ処理装置であるメモリ２０から取得したデータから訂正可能なエラーを検出した際に、その上位装置である診断装置３０へ障害検出を通知する場合を例として説明する。 The failure processing device 10 is composed of a dedicated signal processing circuit, and has a function of detecting an error occurring in data acquired from the data processing device and notifying the host device.
In the present embodiment, the failure processing apparatus 10 is applied to a memory controller used in the computer system 1 and when a correctable error is detected from data acquired from the memory 20 that is a data processing apparatus to be subjected to the failure processing. A case will be described as an example where failure detection is notified to the diagnostic device 30 which is the host device.

メモリ２０は、半導体記憶装置（主記憶）からなり、障害処理装置１０を介したプロセッサ５０からのアクセスに応じて各種データの書き込みおよび読み出しを行う機能と、データ書き込み時に入力されたＥＣＣ用の誤り訂正情報を保持する機能とを有している。
診断装置３０は、当該コンピュータシステム１の診断制御を行う装置であり、障害処理装置１０からの障害検出通知を受けて、障害ログ採取や障害復旧処理といった障害診断処理を行う機能を有している。 The memory 20 is composed of a semiconductor memory device (main memory), and has a function of writing and reading various data in response to an access from the processor 50 via the failure processing device 10, and an error for ECC input at the time of data writing. And a function of holding correction information.
The diagnostic device 30 is a device that performs diagnostic control of the computer system 1 and has a function of receiving a failure detection notification from the failure processing device 10 and performing failure diagnosis processing such as failure log collection and failure recovery processing. .

記憶装置４０は、ハードディスクや不揮発性メモリなどの記憶装置からなり、診断装置３０により採取された障害ログ情報、さらにはプロセッサで実行されるソフトウェアのプログラムを記憶する機能を有している。
プロセッサ５０は、ＣＰＵなどの演算処理回路からなり、記憶装置４０のプロクラムを実行することにより、各種情報処理を行う機能を有している。 The storage device 40 includes a storage device such as a hard disk or a nonvolatile memory, and has a function of storing failure log information collected by the diagnostic device 30 and a software program executed by the processor.
The processor 50 includes an arithmetic processing circuit such as a CPU, and has a function of performing various types of information processing by executing a program of the storage device 40.

図２は、ソフトウェアの機能を示す説明図である。
ソフトウェア５１は、コンピュータシステム１を制御するＯＳやアプリケーションのプログラムであり、プロセッサ５０で実行されることにより、障害処理装置１０への各種設定を実行する。本実施の形態では、プロセッサ５０からの命令は診断装置３０を介して行うこととしている。
また、ソフトウェア５１は、コンピュータシステム１が使用しているメモリ２０をページと呼ばれるデータブロックの単位に分け、メモリ２０から取得したデータから検出したエラー発生回数をページごとにカウントし、エラー発生回数が閾値に達すると、当該ページを論理的に切り離して使用不可とする障害処理機能を備えている。 FIG. 2 is an explanatory diagram showing functions of software.
The software 51 is an OS or application program that controls the computer system 1, and executes various settings for the failure processing apparatus 10 by being executed by the processor 50. In the present embodiment, instructions from the processor 50 are performed via the diagnostic device 30.
The software 51 divides the memory 20 used by the computer system 1 into units of data blocks called pages, counts the number of error occurrences detected from the data acquired from the memory 20 for each page, and the number of error occurrences When the threshold is reached, a failure processing function is provided that logically separates the page and renders it unusable.

ソフトウェア５１は、診断装置３０によって採取されたログ情報を、プロセッサ５０を介して取得し、コンピュータシステム１全体の構成制御機能５１Ａおよびメモリ管理機能５１Ｂによって、上述のようなメモリ２０における任意のページ切り離しを行う。
なお、診断装置３０、プロセッサ５０、およびソフトウェア５１に関わる詳細動作については、周知の技術に基づくものであり、個々での詳細な説明は省略する。 The software 51 acquires log information collected by the diagnostic device 30 via the processor 50, and separates any page in the memory 20 as described above by the configuration control function 51A and the memory management function 51B of the entire computer system 1. I do.
Note that detailed operations relating to the diagnostic device 30, the processor 50, and the software 51 are based on well-known techniques, and detailed descriptions thereof are omitted.

［障害処理装置］
次に、図１を参照して、本実施の形態にかかる障害処理装置の構成について詳細に説明する。
障害処理装置１０には、主な機能部として、エラー処理部１１、障害報告制御部１２、障害ログ制御部１３、および診断命令制御部１４が設けられている。一般的には、障害処理装置１０が適用されるメモリコントローラには、例えば、プロセッサ５０からの書き込み命令に応じて、任意のデータをメモリ２０内の所定アドレスへ書き込むデータ書込部や、プロセッサ５０からの読み出し命令に応じて、任意のデータをメモリ２０内の所定アドレスから読み出すデータ読出部など、これら以外の機能部も設けられているが、図１では、本実施の形態における障害処理に関する機能部のみが図示されている。 [Failure handling device]
Next, the configuration of the failure processing apparatus according to the present embodiment will be described in detail with reference to FIG.
The failure processing apparatus 10 is provided with an error processing unit 11, a failure report control unit 12, a failure log control unit 13, and a diagnostic command control unit 14 as main functional units. In general, a memory controller to which the failure processing apparatus 10 is applied includes, for example, a data writing unit that writes arbitrary data to a predetermined address in the memory 20 in response to a write command from the processor 50, and a processor 50. Other function units such as a data read unit for reading arbitrary data from a predetermined address in the memory 20 in response to a read command from the memory 20 are also provided. In FIG. Only the part is shown.

エラー処理部１１は、メモリ２０から読み出したデータとその誤り訂正情報とに基づいてＥＣＣチェックを行う機能と、当該データに訂正可能エラーがあると、当該エラーｂｉｔの訂正、障害報告制御部１２に対する障害検出通知、および障害ログ制御部１３に対して当該エラーデータの障害を検出したアドレスの送出を行う機能と、メモリ２０から読み出したデータに訂正不可能なエラーがある場合は、エラー検出のみを行う機能とを有している。なお、訂正不可能なエラー検出をした時の動作については、周知の技術を用いればよく、ここでの説明は省略する。 The error processing unit 11 performs a function of performing an ECC check based on the data read from the memory 20 and its error correction information. If there is a correctable error in the data, the error processing unit 11 corrects the error bit. When there is an uncorrectable error in the data read from the memory 20 and a function for sending a failure detection notification and an address where the failure of the error data is detected to the failure log control unit 13, only error detection is performed. It has a function to perform. Note that a known technique may be used for the operation when an uncorrectable error is detected, and a description thereof is omitted here.

障害報告制御部１２は、エラー処理部１１から報告されるエラー検出に応じて、診断装置３０に対する障害検出の通知制御を行う機能を有している。障害報告制御部１２には、主な構成として、エラーフラグ１２Ａ、マスクフラグ１２Ｂ、ＡＮＤ論理回路１２Ｃ、カウンタ１２Ｄ、マスクカウント閾値１２Ｅ、比較器１２Ｆ、およびＯＲ論理回路１２Ｇが設けられている。このうち、エラーフラグ１２Ａ、マスクフラグ１２Ｂ、およびマスクカウント閾値１２Ｅは、レジスタで構成されているものとする。 The failure report control unit 12 has a function of performing failure detection notification control for the diagnostic device 30 in accordance with the error detection reported from the error processing unit 11. The failure report control unit 12 includes an error flag 12A, a mask flag 12B, an AND logic circuit 12C, a counter 12D, a mask count threshold 12E, a comparator 12F, and an OR logic circuit 12G as main components. Of these, the error flag 12A, the mask flag 12B, and the mask count threshold 12E are assumed to be constituted by registers.

エラーフラグ１２Ａは、診断装置３０へ通知する障害検出の有無を示すフラグ値を保持する機能を有しており、ＡＮＤ論理回路１２Ｃで得られた、エラー処理部１１から出力されるエラー検出の有無（検出無＝「０」，検出有＝「１」）と、マスクフラグ１２Ｂのフラグ値（抑止無＝「０」，抑止有＝「１」）の反転値との論理積の演算結果により更新される。実際には、マスクフラグ１２Ｂのフラグ値が「０」で障害検出が「１」の場合にのみ、エラーフラグ１２Ａが「１」にセットされる。
また、診断装置３０により障害処理が実施されると「０」にリセットされるとともに、後述のとおり、エラーフラグ１２Ａは、比較器１２Ｆの比較結果により「０」にリセットされる。 The error flag 12A has a function of holding a flag value indicating the presence / absence of failure detection notified to the diagnostic device 30, and the presence / absence of error detection output from the error processing unit 11 obtained by the AND logic circuit 12C. Updated with the result of the logical product of the non-detection = “0”, detection present = “1”) and the inverted value of the flag value of the mask flag 12B (inhibition non = “0”, inhibition present = “1”) Is done. Actually, the error flag 12A is set to “1” only when the flag value of the mask flag 12B is “0” and the failure detection is “1”.
Further, when failure processing is performed by the diagnostic device 30, the error flag 12A is reset to “0” according to the comparison result of the comparator 12F, as will be described later.

マスクフラグ１２Ｂは、障害検出の通知に対する抑止有無を示すフラグ値を保持する機能を有しており、このフラグ値により、エラー処理部１１から報告されたエラー検出以後、一定期間にわたり、ＡＮＤ論理回路１２Ｃによりエラーフラグ１２Ａのセット、すなわち診断装置３０へ障害検出の通知が抑止される。
エラーフラグ１２Ａが「１」にセットされると、マスクフラグ１２Ｂも「１」にセットされる。したがって、エラーフラグ１２Ａが「１」にセットされた後、マスクフラグ１２Ｂが「１」に保持される期間だけ、エラーフラグ１２Ａへのセットが抑止される。 The mask flag 12B has a function of holding a flag value indicating whether or not a failure detection notification is suppressed. With this flag value, an AND logic circuit is used over a certain period after the error detection reported from the error processing unit 11. 12C sets the error flag 12A, that is, the failure detection notification to the diagnostic device 30 is suppressed.
When the error flag 12A is set to “1”, the mask flag 12B is also set to “1”. Accordingly, after the error flag 12A is set to “1”, the setting to the error flag 12A is suppressed only during a period in which the mask flag 12B is held at “1”.

カウンタ１２Ｄは、マスタフラグ１２Ｂが抑止有を示すフラグ値に変化した時点で一定期間ごとにカウント動作を開始する機能を有している。マスクフラグ１２Ｂが「１」にセットされている間、カウンタ値が一定間隔でインクリメントされる。
マスクカウント閾値１２Ｅには、障害検出の抑止期間を指定する閾値が予め設定されている。 The counter 12D has a function of starting a count operation at regular intervals when the master flag 12B changes to a flag value indicating inhibition. While the mask flag 12B is set to “1”, the counter value is incremented at regular intervals.
The mask count threshold value 12E is preset with a threshold value for designating a failure detection suppression period.

比較器１２Ｆは、このマスクカウント閾値１２Ｅの値とカウンタ１２Ｄのカウンタ値を比較し、その比較結果に応じてマスクフラグ１２Ｂのフラグ値を抑制無にリセットするとともに、カウンタのカウント値をリセットする機能を有している。
カウンタ値がマスクカウント閾値１２Ｅの値に達すると、比較器１２Ｆの比較結果が反転して、マスクフラグ１２Ｂおよびカウンタ１２Ｄがリセットされる。
以上のようにして、障害報告制御部１２では、障害検出以後、一定期間障害検出を抑止する制御が行われる。 The comparator 12F has a function of comparing the value of the mask count threshold value 12E with the counter value of the counter 12D, resetting the flag value of the mask flag 12B without suppression according to the comparison result, and resetting the count value of the counter have.
When the counter value reaches the mask count threshold value 12E, the comparison result of the comparator 12F is inverted, and the mask flag 12B and the counter 12D are reset.
As described above, the failure report control unit 12 performs control to suppress failure detection for a certain period after failure detection.

障害ログ制御部１３は、エラー処理部１１でのエラー検出に応じて、当該エラーが検出されたエラーデータが属するデータブロックごとに、障害ログ情報として当該エラーの発生回数をカウントして保持する機能と、任意のデータブロックに関するエラー発生回数が予め設定されたエラーカウント閾値に達した時点で、障害検出を障害報告制御部１２を介して診断装置３０へ通知する機能とを有している。 The failure log control unit 13 counts and holds the number of occurrences of the error as failure log information for each data block to which the error data in which the error is detected belongs in response to error detection in the error processing unit 11 And a function of notifying the diagnosis device 30 of failure detection via the failure report control unit 12 when the number of occurrences of errors regarding an arbitrary data block reaches a preset error count threshold.

障害ログ制御部１３には、主な構成として、位置情報保持部１３Ａ、エラーカウント制御部１３Ｂ、障害ログ保持部１３Ｆ、エラーカウント閾値１３Ｇ、および比較器１３Ｈが設けられている。 The failure log control unit 13 is mainly provided with a position information holding unit 13A, an error count control unit 13B, a failure log holding unit 13F, an error count threshold 13G, and a comparator 13H.

位置情報保持部１３Ａは、エラー処理部１１でのエラー検出に応じてエラー処理部１１から通知された、当該エラーデータに関するエラー位置情報を保持するレジスタである。本実施の形態では、メモリ２０のうち、エラーデータが記憶されていたアドレス情報が、エラー処理部１１から通知され、エラー位置情報として位置情報保持部１３Ａで保持される。
エラーカウント制御部１３Ｂは、データブロックごとに、当該データブロックに関する位置情報とエラー位置情報との一致回数を、当該データブロックのエラー発生回数としてカウントする機能を有している。 The position information holding unit 13A is a register that holds error position information related to the error data notified from the error processing unit 11 in response to the error detection in the error processing unit 11. In the present embodiment, the address information in which the error data is stored in the memory 20 is notified from the error processing unit 11 and held in the position information holding unit 13A as error position information.
The error count control unit 13B has a function of counting, for each data block, the number of matches between the position information related to the data block and the error position information as the number of error occurrences of the data block.

このエラーカウント制御部１３Ｂには、データブロックごとに、比較アドレス１３Ｃ、比較器１３Ｄ、およびカウンタ１３Ｅの組が、エントリとしてそれぞれ設けられている。
本実施の形態では、メモリ２０を分割して設けたアドレス空間をデータブロックとし、これらアドレス空間単位でエラー発生回数を計測するものとする。したがって、比較アドレス１３Ｃには、これらアドレス空間のアドレスを示す上限値および下限値が予め設定されている。 In the error count control unit 13B, a set of a comparison address 13C, a comparator 13D, and a counter 13E is provided as an entry for each data block.
In the present embodiment, an address space provided by dividing the memory 20 is used as a data block, and the number of error occurrences is measured in units of these address spaces. Therefore, an upper limit value and a lower limit value indicating addresses in these address spaces are set in advance in the comparison address 13C.

例えば、メモリ２０のアドレス空間を４ＫＢ単位に分割する場合、エントリ０の比較アドレス１３Ｃの下限値には、３２’ｈ００００＿００００、上限値には、３２’ｈ００００＿０ＦＦＦの値が設定される。（本表記の３２’ｈ００００＿００００および３２’ｈ００００＿０ＦＦＦは、３２ｂｉｔのアドレスを１６進数で表したものである。）
この設定は、コンピュータシステム１のメモリ容量を管理するソフトウェア（ＯＳ）により、最適な値が計算されて行われる。なお、ここでは、設定の一例を示しているが、このような設定方法のみに限定されるものではない。 For example, when the address space of the memory 20 is divided into 4 KB units, the lower limit value of the comparison address 13C of entry 0 is set to 32'h0000_0000, and the upper limit value is set to 32'h0000_0FFF. (In this notation, 32'h0000_0000 and 32'h0000_0FFF represent 32-bit addresses in hexadecimal.)
This setting is performed by calculating an optimum value by software (OS) that manages the memory capacity of the computer system 1. In addition, although an example of a setting is shown here, it is not limited only to such a setting method.

比較器１３Ｄは、比較アドレス１３Ｃに設定されたアドレス空間と、位置情報保持部１３Ａで保持しているアドレスとを比較する。
カウンタ１３Ｅは、比較機１３Ｄの比較結果が両アドレスの一致を示す場合、カウンタ値をインクリメントする。 The comparator 13D compares the address space set in the comparison address 13C with the address held in the position information holding unit 13A.
The counter 13E increments the counter value when the comparison result of the comparator 13D indicates that both addresses match.

したがって、エラーカウント制御部１３Ｂでは、エラー処理部１１から当該エラーデータに関するアドレス情報が位置情報保持部１３Ａへ格納された時点で、当該アドレス情報と対応するアドレス空間に関するエラー発生回数がインクリメントされ、新たなエラー発生回数が障害ログ保持部１３Ｆへ出力される。
以上のようにして、障害処理装置１０が管理するメモリ空間において、設定されたアドレス空間ごとのエラー発生回数を計数を行う。 Therefore, at the time when the address information related to the error data is stored in the position information holding unit 13A from the error processing unit 11, the error count control unit 13B increments the number of error occurrences related to the address space corresponding to the address information. The number of error occurrences is output to the failure log holding unit 13F.
As described above, the number of error occurrences for each set address space is counted in the memory space managed by the failure processing apparatus 10.

障害ログ保持部１３Ｆは、複数のレジスタからなり、エラー処理部１１から通知されたエラーデータのエラー位置情報と当該エラーブロックのエラー発生回数との組を障害ログ情報として保持する機能を有している。
障害ログ保持部１３Ｆは、レジスタとして、エラーカウント制御部１３Ｂと同様のエントリ数を具備しており、位置情報保持部１３Ａから受け取った障害発生アドレス（エラー位置情報）と、エラーカウント制御部１３Ｂから受け取ったアドレス空間（エラーブロック）でのエラー発生回数との組からなる障害ログ情報を、当該アドレス空間と対応するエントリに格納する。 The failure log holding unit 13F includes a plurality of registers, and has a function of holding a set of error position information of error data notified from the error processing unit 11 and the number of error occurrences of the error block as failure log information. Yes.
The failure log holding unit 13F has the same number of entries as the error count control unit 13B as a register. The failure occurrence address (error position information) received from the position information holding unit 13A and the error count control unit 13B Fault log information consisting of a combination of the number of error occurrences in the received address space (error block) is stored in an entry corresponding to the address space.

なお、本実施の形態では、障害ログの中にアドレスを格納しているが、エラーカウントを行うアドレス空間の設定は、プロセッサ５０で実行されるソフトウェア（ＯＳ）５１で行われる。したがって、ソフトウェア５１では、設定段階でエントリごとに指定するアドレス空間を把握しているため、アドレスの格納は行わなくても良い。 In the present embodiment, the address is stored in the failure log, but the setting of the address space for performing error counting is performed by software (OS) 51 executed by the processor 50. Therefore, since the software 51 knows the address space designated for each entry at the setting stage, the address need not be stored.

エラーカウント閾値１３Ｇは、障害ログ保持部１３Ｆに格納されているカウント値の閾値を予め保持するレジスタである。
比較器１３Ｈは、障害ログ保持部１３Ｆの各エントリで保持されているエラー発生回数と、エラーカウント閾値１３Ｇの値とを比較し、その比較結果を障害検出として出力する機能を有している。これにより、いずれかのアドレス空間（エラーブロック）でのエラー発生回数が閾値に達すると、比較器１３Ｈの比較結果が、検出無＝「０」から検出有＝「１」に変化することにより、障害報告制御部１２に対して、障害検出が通知される。 The error count threshold 13G is a register that holds in advance the threshold of the count value stored in the failure log holding unit 13F.
The comparator 13H has a function of comparing the number of error occurrences held in each entry of the failure log holding unit 13F with the value of the error count threshold 13G and outputting the comparison result as failure detection. As a result, when the number of error occurrences in any address space (error block) reaches a threshold value, the comparison result of the comparator 13H changes from non-detection = “0” to detection = “1”. Failure detection control unit 12 is notified of failure detection.

本実施の形態では、このエラーカウント閾値１３Ｇは、障害ログ保持部１３Ｆで管理する全エントリの閾値を一元管理するものとしており、エントリごとに複数の閾値がユニークに設定されるようにしても良い。なお、この閾値設定は、ハードウェアにより初期値として値を設定（例：初期値３）する方法もしくは、ソフトウェア（ＯＳ）によって設定される方法のどちらでも良い。 In the present embodiment, the error count threshold 13G is used to centrally manage the thresholds of all entries managed by the failure log holding unit 13F, and a plurality of thresholds may be set uniquely for each entry. . This threshold value setting may be either a method of setting a value as an initial value by hardware (eg, initial value 3) or a method set by software (OS).

診断命令制御部１４は、診断装置３０やプロセッサ５０からの診断命令を受け付けて、障害処理装置１０内の各部を制御することにより、障害処理装置１０内の各部への値設定や、障害ログ保持部１３Ｆからの障害ログ採取などの診断命令を実行する機能を有している。 The diagnostic command control unit 14 receives a diagnostic command from the diagnostic device 30 or the processor 50 and controls each unit in the fault processing device 10 to set a value in each unit in the fault processing device 10 or hold a fault log. It has a function of executing a diagnostic command such as fault log collection from the unit 13F.

［一実施の形態の動作］
次に、図１および図２を参照して、本実施の形態にかかる障害処理装置の動作について説明する。
メモリ２０に書き込まれるデータには、ＥＣＣ用の誤り訂正情報が付加される。メモリ２０からデータが読み出されると、エラー処理部１１は、同じくメモリ２０から読み出した誤り訂正情報に基づきＥＣＣチェックを行う。
メモリ２０から読み出されたデータに、訂正可能エラーが発生していた場合、障害処理装置１０は、次のような障害処理動作を実行する。 [Operation of one embodiment]
Next, the operation of the failure processing apparatus according to the present embodiment will be described with reference to FIG. 1 and FIG.
Error correction information for ECC is added to data written to the memory 20. When data is read from the memory 20, the error processing unit 11 performs an ECC check based on the error correction information that is also read from the memory 20.
If a correctable error has occurred in the data read from the memory 20, the failure processing apparatus 10 executes the following failure processing operation.

エラー処理部１１は、ＥＣＣチェックにより訂正可能エラーが発生していることを検出するとともに、エラーしているｂｉｔのエラー訂正を行う。この時、エラー検出したことにより、エラー処理部１１は、障害報告制御部１２に対してエラー検出を通知し、また、エラー検出時の障害ログ情報を保存するために、障害ログ制御部１３に対して、エラー検出したアドレス情報を送出する。 The error processing unit 11 detects that a correctable error has occurred due to the ECC check, and corrects an error in the error bit. At this time, when an error is detected, the error processing unit 11 notifies the failure report control unit 12 of the error detection, and also stores the failure log information at the time of error detection in the failure log control unit 13. On the other hand, the address information in which the error is detected is transmitted.

障害報告制御部１２は、ＡＮＤ論理回路１２Ｃにより、エラー処理部１１からのエラー検出の有無とマスクフラグ１２Ｂの反転値の論理積を求め、この結果に応じてエラーフラグ１２Ａを点灯させる。この際、エラー検出通知時点では、マスクフラグ１２Ｂの値は「０」であるため、エラーフラグ１２Ａは「１」に設定され、続いてマスクフラグ１２Ｂが「１」に設定される。 The failure report control unit 12 obtains the logical product of the presence / absence of error detection from the error processing unit 11 and the inverted value of the mask flag 12B by the AND logic circuit 12C, and turns on the error flag 12A according to this result. At this time, since the value of the mask flag 12B is “0” at the time of error detection notification, the error flag 12A is set to “1”, and then the mask flag 12B is set to “1”.

マスクフラグ１２Ｂが「１」に設定されると、カウンタ１２Ｄのインクリメントが開始される。なお、マスクフラグ１２Ｂが「１」になった時点で、カウンタ１２Ｄの値は「０」になっている。このカウンタ１２Ｄは、所定間隔でインクリメントされ、そのカウンタ値が、比較器１２Ｆにより、マスクカウント閾値１２Ｅに設定されている値と比較される。したがって、カウンタ値がマスクカウント閾値１２Ｅに設定されている値までインクリメントされ、カウンタ１２Ｄの値がマスクカウント閾値１２Ｅに達すると、比較器１２Ｆの比較結果が反転して、マスクフラグ１２Ｂおよびカウンタ１２Ｄがリセットされる。 When the mask flag 12B is set to “1”, the counter 12D starts incrementing. Note that the value of the counter 12D is “0” when the mask flag 12B is “1”. The counter 12D is incremented at a predetermined interval, and the counter value is compared with the value set in the mask count threshold 12E by the comparator 12F. Therefore, when the counter value is incremented to the value set in the mask count threshold 12E and the value of the counter 12D reaches the mask count threshold 12E, the comparison result of the comparator 12F is inverted, and the mask flag 12B and the counter 12D are Reset.

これにより、マスクフラグ１２Ｂが「１」に設定されている間、エラー処理部１１から新たなエラー検出が通知されても、障害検出したことを示すエラーフラグ１２Ａが「１」に設定されないため、障害検出以後、一定期間にわたり、診断装置３０への障害検出通知が抑止されることになる。
障害報告制御部１２のＯＲ論理回路１２Ｇには、エラーフラグ１２Ａと障害ログ制御部１３からの障害検出通知とが入力されており、両入力のいずれか一方、ここではエラーフラグ１２Ａが「１」になることによって、診断装置３０に対して、障害検出通知が行われる。 Thus, while the mask flag 12B is set to “1”, even if a new error detection is notified from the error processing unit 11, the error flag 12A indicating that a failure has been detected is not set to “1”. After the failure detection, the failure detection notification to the diagnostic device 30 is suppressed for a certain period.
An error flag 12A and a failure detection notification from the failure log control unit 13 are input to the OR logic circuit 12G of the failure report control unit 12, and the error flag 12A is “1” in either one of these inputs. As a result, a failure detection notification is sent to the diagnostic device 30.

一方、障害ログ制御部１３は、エラー処理部１１からエラーデータに関するアドレス情報を受け取り、障害ログ情報の制御を行う。
エラー処理部１１から送出されたアドレス情報は、位置情報保持部１３Ａに一旦格納され、エラーカウント制御部１３Ｂおよび障害ログ保持部１３Ｆに送られる。 On the other hand, the failure log control unit 13 receives address information related to error data from the error processing unit 11 and controls the failure log information.
The address information sent from the error processing unit 11 is temporarily stored in the position information holding unit 13A and sent to the error count control unit 13B and the failure log holding unit 13F.

まず、エラーカウント制御部１３Ｂでは、障害発生したアドレス空間ごとにエラー発生回数を計数する。ここで、障害が発生したアドレスが、エントリ０のアドレス空間に一致する番地であった場合、エントリ０では、位置情報保持部１３Ａで保持しているアドレスと、比較アドレス１３Ｃに設定されている当該アドレス空間のアドレス範囲との一致を比較器１３Ｄで確認する。これにより、当該エラーデータのアドレスがエントリ０に該当するアドレスであると認識し、エントリ０のカウンタ１３Ｅをインクリメントする。このカウンタ１３Ｅの値は、障害ログ保持部１３Ｆに出力される。 First, the error count control unit 13B counts the number of error occurrences for each address space where a failure has occurred. Here, when the address where the failure has occurred is an address that matches the address space of entry 0, in entry 0, the address held in position information holding unit 13A and the address set in comparison address 13C A match with the address range of the address space is confirmed by the comparator 13D. As a result, the address of the error data is recognized as an address corresponding to entry 0, and the entry 13 counter 13E is incremented. The value of the counter 13E is output to the failure log holding unit 13F.

障害ログ保持部１３Ｆは、エラーカウント制御部１３Ｂから指示されたエントリ０に対して、位置情報保持部１３Ａに保持しているアドレス情報と、カウンタ１３Ｅのカウント値を書き込む。
この時点では、障害ログ保持部１３Ｆには、エントリ０のみに障害ログが書き込まれており、エラー発生回数は「１」となっている。 The failure log holding unit 13F writes the address information held in the position information holding unit 13A and the count value of the counter 13E to entry 0 instructed by the error count control unit 13B.
At this time, the failure log is written only in the entry 0 in the failure log holding unit 13F, and the number of error occurrences is “1”.

障害ログ制御部１３では、エラー発生回数の閾値管理を実施するため、エラーカウント閾値１３Ｇにエラー発生回数の閾値が予め設定されている。ここでは、閾値が「３」である場合について説明する。
障害ログ保持部１３Ｆのカウント値とエラーカウント閾値１３Ｇで設定されている値は、比較器１３Ｈで比較され、その比較結果が障害検出有無として障害報告制御部１２へ通知される。上述の場合、障害が発生した回数は「１」回であるため、障害ログ保持部１３Ｆのカウント値とエラーカウント閾値１３Ｇで設定されている値とは一致せず、比較結果は「０」となる。このため、この時点では、障害報告制御部１２への障害検出は、検出無し＝「０」が通知される。 In the failure log control unit 13, a threshold value for the number of error occurrences is set in advance in the error count threshold value 13G in order to manage the threshold value for the error occurrence number. Here, a case where the threshold is “3” will be described.
The count value of the failure log holding unit 13F and the value set by the error count threshold 13G are compared by the comparator 13H, and the comparison result is notified to the failure report control unit 12 as the presence or absence of failure detection. In the above-described case, the number of times that a failure has occurred is “1”, so the count value of the failure log holding unit 13F does not match the value set in the error count threshold 13G, and the comparison result is “0”. Become. For this reason, at this point in time, failure detection control unit 12 is notified of failure detection = “0”.

次に、障害報告制御部１２から診断装置３０に対して障害検出通知された場合の動作について説明する。
障害検出が抑止されていない期間に障害が発生した場合、障害報告制御部１２から診断装置３０に障害検出通知が行われる。これに応じて、診断装置３０は、次のような障害処理動作を開始する。 Next, an operation when a failure detection notification is sent from the failure report control unit 12 to the diagnostic device 30 will be described.
When a failure occurs during a period in which failure detection is not suppressed, the failure report control unit 12 sends a failure detection notification to the diagnostic device 30. In response to this, the diagnostic device 30 starts the following fault processing operation.

診断装置３０は、障害処理装置１０からの障害検出通知を契機として、障害処理装置１０の診断命令制御部１４に対して、障害ログ採取命令を通知する。
診断命令制御部１４は、診断装置３０から障害ログ採取命令を受けると、障害ログ保持部１３Ｆから障害ログを読み出して、診断装置３０に返送する。
この後、診断装置３０は、障害処理装置１０からの障害ログ採取が完了すると、障害処理装置１０の診断命令制御部１４に対して、障害復旧命令を通知する。
これに応じて、診断命令制御部１４は、障害報告制御部１２のエラーフラグ１２Ａをリセットする。 The diagnosis device 30 notifies a failure log collection command to the diagnosis command control unit 14 of the failure processing device 10 in response to a failure detection notification from the failure processing device 10.
When receiving the fault log collection command from the diagnostic device 30, the diagnostic command control unit 14 reads the fault log from the fault log holding unit 13 </ b> F and returns it to the diagnostic device 30.
Thereafter, when the fault log collection from the fault processing device 10 is completed, the diagnostic device 30 notifies the fault recovery command to the diagnostic command control unit 14 of the fault processing device 10.
In response to this, the diagnostic instruction control unit 14 resets the error flag 12A of the failure report control unit 12.

障害報告制御部１２では、エラーフラグ１２Ａがリセットされた際、マスクフラグ１２Ｂが「１」になっているため、エラー処理部１１から新たなエラー検出が通知されても、この値が「１」である間、すなわち一定の抑止期間にわたり、エラーフラグ１２Ａが「１」に設定されることはなく、診断装置３０へ障害検出が通知されることはない。
一方、障害ログ制御部１３では、この抑止期間に発生した障害は、上述したとおりの動作で、障害ログ保持部１３Ｆに障害ログ情報が順次格納される。 In the failure report control unit 12, since the mask flag 12B is “1” when the error flag 12A is reset, even if a new error detection is notified from the error processing unit 11, this value is “1”. In other words, the error flag 12A is not set to “1” over a certain suppression period, and no failure detection is notified to the diagnostic device 30.
On the other hand, in the failure log control unit 13, the failure log information is sequentially stored in the failure log holding unit 13F by the operation as described above for the failure occurring during this suppression period.

診断装置３０は、障害処理装置１０の障害ログ保持部１３Ｆから採取した障害ログを記憶装置４０に格納する。この記憶装置４０に格納された障害ログは、プロセッサ５０により取得され、ソフトウェア５１に受け渡す。 The diagnosis device 30 stores the failure log collected from the failure log holding unit 13F of the failure processing device 10 in the storage device 40. The failure log stored in the storage device 40 is acquired by the processor 50 and transferred to the software 51.

ソフトウェア５１は、この障害ログを元にしてメモリ２０の障害状態を解析し、任意のアドレス空間に関するエラー発生回数が、予めソフトウェア５１で管理しているエラー発生回数の閾値に達していれば、当該アドレス空間に対応するメモリページを論理的に切り離す障害処理を行う。これまでの説明では、障害発生回数が「１」回であるため、障害処理は実施せず、このままコンピュータシステム１の動作継続を行っていく。
以上が、エラー発生回数が閾値に達していない場合の動作説明である。 The software 51 analyzes the failure state of the memory 20 based on this failure log, and if the error occurrence count for an arbitrary address space reaches the error occurrence count threshold previously managed by the software 51, the software 51 A failure process for logically separating memory pages corresponding to the address space is performed. In the description so far, since the number of failure occurrences is “1”, the failure processing is not performed and the operation of the computer system 1 is continued as it is.
The above is the description of the operation when the number of error occurrences does not reach the threshold value.

次に、障害検出が抑止されている間に障害が発生した場合について説明する。
エラー処理部１１は、障害検出通知の抑止期間においても、メモリ２０から読み出したデータから訂正可能エラーを検出する。
障害検出通知の抑止期間において、エラー処理部１１により訂正可能エラーが検出された場合、障害報告制御部１２では、マスクフラグ１２Ｂにより、障害検出が抑止されているため、エラーフラグ１２Ａは「１」に設定されず、診断装置３０への障害検出通知は行われない。 Next, a case where a failure occurs while failure detection is suppressed will be described.
The error processing unit 11 detects a correctable error from the data read from the memory 20 even during the failure detection notification suppression period.
If a correctable error is detected by the error processing unit 11 during the failure detection notification suppression period, the failure report control unit 12 sets the error flag 12A to “1” because the failure detection is suppressed by the mask flag 12B. No failure detection notification to the diagnostic device 30 is performed.

一方、障害検出通知の抑止期間において、エラー処理部１１により訂正可能エラーが検出された場合、障害ログ制御部１３は、上述と同様に、エラー処理部１１から受け取ったアドレス情報を元に、障害ログ保持部１３Ｆへの障害ログ格納動作を行う。 On the other hand, when a correctable error is detected by the error processing unit 11 during the failure detection notification suppression period, the failure log control unit 13 determines the failure based on the address information received from the error processing unit 11 as described above. The fault log storage operation to the log holding unit 13F is performed.

ここで、エラー発生回数の閾値管理であるが、２回目に発生した障害アドレスがエントリ０に設定されているアドレス空間であれば、カウント値は「２」を示すが、ここでは、エラーカウント閾値１３Ｇを「３」としているため、この時点でも閾値には達していないため、障害ログ制御部１３から障害報告制御部１２に対して障害検出通知は行われない。この結果、診断装置３０への障害検出通知も行われないため、コンピュータシステム１は、動作を継続していく。
以上のような動作を繰り返し実施していくことで、障害検出が抑止されている期間であっても、障害ログを格納していいき、障害ログ情報を蓄積させていく。 Here, the threshold management of the number of error occurrences is performed. If the second failed address is an address space set in entry 0, the count value indicates “2”. Since 13G is set to “3”, the threshold is not reached even at this time, so the failure log control unit 13 does not send a failure detection notification to the failure report control unit 12. As a result, since the failure detection notification to the diagnosis device 30 is not performed, the computer system 1 continues to operate.
By repeatedly performing the above operation, the failure log is stored and the failure log information is accumulated even during the period in which the failure detection is suppressed.

このような動作が繰り返されて、エントリ０に設定されているアドレス空間内で発生したエラー発生回数が「３」になると、エラーカウント閾値１３Ｇに達したことにより、比較器１３Ｈの比較結果が「１」となる。このため、障害検出通知の抑止期間であっても、障害ログ制御部１３は、障害報告制御部１２に対して、障害検出通知を行う。 When such an operation is repeated and the number of errors occurring in the address space set in entry 0 reaches “3”, the comparison result of the comparator 13H is “ 1 ". Therefore, even during the failure detection notification suppression period, the failure log control unit 13 sends a failure detection notification to the failure report control unit 12.

障害報告制御部１２のＯＲ論理回路１２Ｇには、エラーフラグ１２Ａと障害ログ制御部１３からの障害検出通知が入力されており、両者のうちいずれか一方、ここでは障害検出通知が「１」となるため、この障害検出通知が、障害報告制御部１２を介して診断装置３０に通知され、診断装置３０により、障害処理が開始される。
障害処理で行われるログ採取方法は上述と同様であるため、省略する。 The error flag 12A and the failure detection notification from the failure log control unit 13 are input to the OR logic circuit 12G of the failure report control unit 12, and one of them, in this case, the failure detection notification is “1”. Therefore, this failure detection notification is notified to the diagnostic device 30 via the failure report control unit 12, and the failure processing is started by the diagnostic device 30.
Since the log collection method performed in the failure processing is the same as described above, the description is omitted.

ここで、採取された障害ログ情報は、再びソフトウェア５１に渡され、障害状態の解析が行われる。ソフトウェア５１は、障害ログ情報に基づき、当該アドレス空間でのエラー発生回数が、閾値「３」に達したことを認識し、このアドレス空間に対応するメモリページを論理的に切り離す障害処理動作を実行する。切り離し対象となるメモリページは、障害ログ情報のアドレス情報より、アドレス空間のうち「３２’ｈ００００＿００００〜３２’ｈ００００＿０ＦＦＦ」であると判定され、ソフトウェア５１は、本アドレス空間の切り離し処理を実施する。 Here, the collected failure log information is transferred to the software 51 again, and the failure state is analyzed. The software 51 recognizes that the number of error occurrences in the address space has reached the threshold value “3” based on the failure log information, and executes a failure processing operation for logically separating the memory page corresponding to the address space. To do. The memory page to be separated is determined to be “32′h0000 — 0000 to 32′h0000 — 0FFF” in the address space from the address information of the failure log information, and the software 51 performs the separation processing of this address space.

図３は、従来の障害処理の概略フローである。図４は、本実施の形態にかかる障害処理の概略フローである。図３および図４において、縦方向を時間軸として、エラー発生からログ採取が行われるまでの処理順序を明示したものである。 FIG. 3 is a schematic flow diagram of conventional failure processing. FIG. 4 is a schematic flow of fault processing according to the present embodiment. In FIG. 3 and FIG. 4, the processing order from the occurrence of an error to the log collection is clearly shown with the vertical direction as a time axis.

従来の障害処理方法では、図３に示すように、最初に訂正可能エラーが検出されると、障害処理装置から診断装置に対して障害検出が通知されるとともにエラー発生回数がログとして記録される。診断装置は、この障害検出通知に応じて、障害処理装置から障害ログを採取し、その内容がソフトウェアに通知される。この際、エラー発生回数が「１」であることから、上述のようにソフトウェアによるエラー発生回数の閾値が３回として設定されている場合には、両者が一致せず、ソフトウェアによる障害処理は実行されない。 In the conventional failure processing method, as shown in FIG. 3, when a correctable error is first detected, failure detection is notified from the failure processing device to the diagnostic device, and the number of occurrences of the error is recorded as a log. . In response to this failure detection notification, the diagnostic device collects a failure log from the failure processing device and notifies the software of the content. At this time, since the number of error occurrences is “1”, when the threshold value of the number of error occurrences by software is set as 3 as described above, the two do not match and the failure processing by software is executed. Not.

この後、障害処理装置では、エラー検出後から一定期間にわたり障害検出通知の抑止期間となるため、この抑止期間中に検出された新たなエラーに関する障害検出については診断装置へ通知されず、診断装置でのログ採取による障害監視も行われない。
したがって、障害検出通知は、抑止期間が解除されている抑止期間外でエラーが検出された場合のみとなるため、ソフトウェアによるエラー発生回数の閾値が３回として設定されている場合、抑止期間外でエラーが３回検出された時点、すなわち最初のエラー検出から抑止期間が２回以上経過した時点で、初めてソフトウェアによる障害処理が実行されて、障害メモリページの切り離しが行われる。 Thereafter, in the failure processing apparatus, since the failure detection notification is suppressed for a certain period after the error detection, the failure detection relating to a new error detected during this suppression period is not notified to the diagnosis device, and the diagnosis device Fault monitoring by collecting logs is not performed.
Therefore, the failure detection notification is only when an error is detected outside the suppression period for which the suppression period has been released. Therefore, if the threshold value for the number of error occurrences by the software is set to 3, the failure detection notification When the error is detected three times, that is, when the suppression period has passed twice or more after the first error detection, the fault processing by the software is executed for the first time, and the faulty memory page is separated.

このように、従来の障害処理方法によれば、抑止期間中に発生する障害が多発したとしても、しばらくの間、障害が発生していないものとして運用継続されるため、訂正不可能なエラーに発展し、システムダウンに繋がる場合もあった。 In this way, according to the conventional failure processing method, even if many failures occur during the suppression period, the operation continues for a while as if no failure has occurred. In some cases, it developed and led to system down.

一方、本実施の形態にかかる障害処理の場合、図４に示すように、最初に訂正可能エラーが検出されると、障害処理装置から診断装置に対して障害検出が通知されるとともにエラー発生回数がログとして記録される。診断装置は、この障害検出通知に応じて、障害処理装置から障害ログを採取し、その内容がソフトウェアに通知される。この際、エラー発生回数が「１」であることから、上述のようにソフトウェアによるエラー発生回数の閾値が３回として設定されている場合には、両者が一致せず、ソフトウェアによる障害処理は実行されない。 On the other hand, in the case of failure processing according to the present embodiment, as shown in FIG. 4, when a correctable error is first detected, failure detection is notified from the failure processing device to the diagnostic device, and the number of error occurrences Is recorded as a log. In response to this failure detection notification, the diagnostic device collects a failure log from the failure processing device and notifies the software of the content. At this time, since the number of error occurrences is “1”, when the threshold value of the number of error occurrences by software is set as 3 as described above, the two do not match and the failure processing by software is executed. Not.

この際、障害処理装置では、更新した障害ログのエラー発生回数と障害処理装置によるエラー発生回数の閾値とを比較し、その比較結果に応じて障害検出の通知要否を判断している。
したがって、障害処理装置によるエラー発生回数の閾値が３回として設定されている場合、図４に示すように、抑制期間内であっても、３回目にエラーを検出した時点で、そのエラー発生回数が、障害処理装置によるエラー発生回数の閾値に達することになり、障害処理装置から診断装置に対して障害検出が通知される。 At this time, the failure processing device compares the number of error occurrences in the updated failure log with the threshold value of the number of error occurrences by the failure processing device, and determines whether or not failure detection notification is necessary according to the comparison result.
Therefore, when the threshold of the number of error occurrences by the failure processing apparatus is set to 3 times, as shown in FIG. 4, the number of error occurrences at the time when the error is detected for the third time, even within the suppression period. However, the threshold value of the number of error occurrences by the failure processing device is reached, and the failure detection device notifies the diagnosis device of failure detection.

これに応じて、診断装置により、障害処理装置からログが採取されてソフトウェアに通知される。
これにより、ソフトウェアによるエラー発生回数の閾値が３回として設定されている場合、ログで通知されたエラー発生回数が「３」であることから、両者が一致し、障害処理装置が障害検出通知の抑止期間中であっても、ソフトウェアによる障害処理、すなわち当該アドレス空間の切り離しが行われる。この結果、システムダウンに繋がり兼ねない障害メモリページの切り離し処理が一早く行われ、システムダウンの発生確率を軽減させることが可能である。 In response to this, the diagnostic device collects a log from the failure processing device and notifies the software.
As a result, when the threshold value of the number of error occurrences by software is set as 3, the number of error occurrences notified in the log is “3”. Even during the suppression period, failure processing by software, that is, separation of the address space is performed. As a result, the faulty memory page separation process that may lead to system down is performed quickly, and the probability of system down can be reduced.

［一実施の形態の効果］
このように、本実施の形態では、障害報告制御部１２により、エラー処理部１１でのエラー検出に応じて、エラー処理部１１で障害を検出したことを示す障害検出を診断装置（上位装置）３０へ通知するとともに、当該エラー検出から所定の抑止期間にわたり後続するエラー検出に応じた障害検出の通知を抑止し、障害ログ制御部１３により、エラー処理部１１でのエラー検出に応じて、当該エラーが検出されたエラーデータが属するアドレス空間（データブロック）ごとに、障害ログ情報として当該エラーの発生回数をカウントして保持し、任意のアドレス空間に関するエラー発生回数が予め設定されたエラーカウント閾値に達した時点で、障害検出を診断装置３０へ通知している。 [Effect of one embodiment]
As described above, in the present embodiment, the failure report control unit 12 performs failure detection indicating that a failure has been detected by the error processing unit 11 in accordance with the error detection by the error processing unit 11 as a diagnostic device (higher level device). 30, the notification of failure detection according to the error detection that follows for a predetermined suppression period from the error detection is suppressed, and the failure log control unit 13 detects the error in the error processing unit 11 according to the error detection. For each address space (data block) to which error data in which an error is detected belongs, an error count threshold in which the number of occurrences of the error is counted and held as failure log information, and the number of occurrences of errors related to an arbitrary address space is set in advance At this point, the diagnostic device 30 is notified of the failure detection.

これにより、エラー検出から一定期間にわたり障害検出通知が抑止されているような障害処理方式を持つハードウェアが搭載されたコンピュータシステム１であっても、診断装置さらにはソフトウェア（ＯＳ）において、ハードウェアに関する最新の障害発生状況を確認することができる。したがって、当該ハードウェアに対する本来の障害処理動作が阻害されることなく、適切に処理実行することが可能となり、抑止期間中に発生する恐れがあるシステムダウンの発生確率を軽減させることができる。 As a result, even in the computer system 1 in which hardware having a failure processing method in which failure detection notification is suppressed for a certain period from the error detection is installed, the hardware in the diagnosis apparatus and further in the software (OS) The latest failure occurrence status can be confirmed. Therefore, it is possible to appropriately execute the processing without hindering the original failure processing operation for the hardware, and it is possible to reduce the probability of system down that may occur during the suppression period.

［実施の形態の拡張］
以上では、障害処理装置１０をメモリコントローラに適用して、メモリ２０から読み出されたデータに対するエラーを検出する場合を例として説明したが、これに限定されるものではなく、メモリコントローラと同様に、例えばデータ通信インターフェース回路のように、高速でデータを取得する電子回路であれば、本発明にかかる障害処理装置１０を前述と同様にして適用でき、同様の作用効果を得ることができる。 [Extended embodiment]
In the above, the case where the failure processing apparatus 10 is applied to a memory controller and an error is detected with respect to data read from the memory 20 has been described as an example. However, the present invention is not limited to this, and is similar to the memory controller. For example, if the electronic circuit acquires data at high speed, such as a data communication interface circuit, the failure processing apparatus 10 according to the present invention can be applied in the same manner as described above, and the same operational effects can be obtained.

また、以上では、障害処理装置１０、診断装置３０、およびプロセッサ５０を別個の回路構成で実現した場合を例として説明したが、これに限定されるものではなく、これら回路構成を任意に１つの回路構成で実現してもよい。
また、以上では、エラー訂正方式としてＥＣＣを用いる場合を例として説明したが、これに限定されるものではなく、他のエラー訂正方式を適用してもよい。 In the above description, the case where the failure processing device 10, the diagnosis device 30, and the processor 50 are realized by separate circuit configurations has been described as an example. However, the present invention is not limited to this, and any one of these circuit configurations may be arbitrarily selected. You may implement | achieve with a circuit structure.
In the above, the case where ECC is used as an error correction method has been described as an example. However, the present invention is not limited to this, and other error correction methods may be applied.

１…コンピュータシステム、１０…障害処理装置（メモリコントローラ）、１１…エラー処理部、１２…障害報告制御部、１２Ａ…エラーフラグ、１２Ｂ…マスクフラグ、１２Ｃ…ＡＮＤ論理回路、１２Ｄ…カウンタ、１２Ｅ…マスクカウント閾値、１２Ｆ…比較器、１２Ｇ…ＯＲ論理回路、１３…障害ログ制御部、１３Ａ…位置情報保持部、１３Ｂ…エラーカウント制御部、１３Ｃ…比較アドレス、１３Ｄ…比較器、１３Ｅ…カウンタ、１３Ｆ…障害ログ保持部、１３Ｇ…エラーカウント閾値、１３Ｈ…比較器、１４…診断命令制御部、２０…メモリ、３０…診断装置、４０…記憶装置、５０…プロセッサ、５１…ソフトウェア、５１Ａ…構成制御機能、５１Ｂ…メモリ管理機能。 DESCRIPTION OF SYMBOLS 1 ... Computer system, 10 ... Fault processing apparatus (memory controller), 11 ... Error processing part, 12 ... Fault report control part, 12A ... Error flag, 12B ... Mask flag, 12C ... AND logic circuit, 12D ... Counter, 12E ... Mask count threshold, 12F ... comparator, 12G ... OR logic circuit, 13 ... failure log control unit, 13A ... position information holding unit, 13B ... error count control unit, 13C ... comparison address, 13D ... comparator, 13E ... counter, 13F ... Fault log holding unit, 13G ... Error count threshold, 13H ... Comparator, 14 ... Diagnostic command control unit, 20 ... Memory, 30 ... Diagnostic device, 40 ... Storage device, 50 ... Processor, 51 ... Software, 51A ... Configuration Control function, 51B ... Memory management function.

Claims

An error processing unit for detecting a correctable error from data acquired from the target data processing device;
In response to the error detection in the error processing unit, a failure detection indicating that a failure has been detected in the error processing unit is notified to the host device, and in response to the subsequent error detection from the error detection over a predetermined suppression period. A failure report control unit that suppresses notification of failure detection;
In response to error detection in the error processing unit, for each data block to which the error data in which the error is detected belongs, the error log information is counted and held as error log information, and an error related to an arbitrary data block is generated A failure processing apparatus comprising: a failure log control unit that notifies the host device of the failure detection when the number of times reaches a preset error count threshold.

The failure processing apparatus according to claim 1,
A failure processing apparatus, further comprising: a diagnosis instruction control unit that acquires the failure log information held in the failure log control unit in response to a diagnosis command from the upper device and notifies the host device .

The failure processing apparatus according to claim 1,
The failure report control unit
An error flag holding a flag value indicating the presence or absence of failure detection to be notified to the host device;
A mask flag that holds a flag value indicating whether or not to suppress the notification of the failure detection, and sets the flag value to suppress when the error flag changes to a flag value that indicates that the failure is detected, and
An AND logic circuit that registers an AND logic value of an error detection presence / absence signal indicating the presence / absence of error detection in the error processing unit and an inverted value of the mask flag in the error flag;
A counter that starts a count operation at regular intervals when the master flag changes to a flag value indicating the presence of inhibition;
A comparator that compares the count value of the counter with a preset mask count threshold value, resets the flag value of the mask flag in accordance with the comparison result, and resets the count value of the counter; A failure processing apparatus comprising:

The failure processing apparatus according to claim 1,
The failure log control unit
A position information holding unit for holding error position information related to the error data notified from the error processing unit in response to error detection in the error processing unit;
For each data block, an error count control unit that counts the number of matches between the position information about the data block and the error position information as the number of error occurrences in the data block;
A failure log information holding unit that holds, as the failure log information, a set of error position information of the error data and the number of error occurrences in the error block;
A failure processing apparatus comprising: a comparator that compares the number of error occurrences of the set with the error count threshold for each set and outputs the comparison result as the failure detection.

A fault processing method used in a fault processing apparatus that detects a fault occurrence based on data acquired from a target data processing apparatus and notifies a higher-level apparatus,
An error processing unit detects an error that can be corrected from data acquired from the data processing device;
In response to the error detection in the error processing unit, the failure report control unit notifies the host device of a failure detection indicating that the error processing unit has detected a failure, and from the error detection over a predetermined suppression period. A failure report control step that suppresses notification of failure detection according to subsequent error detection;
In response to the error detection in the error processing unit, the failure log control unit counts and holds the number of occurrences of the error as failure log information for each data block to which the error data in which the error is detected belongs. And a failure log control step of notifying the host device of the failure detection when the number of error occurrences related to the data block reaches a preset error count threshold.