JP2007265157A

JP2007265157A - System and method for detecting fault of i/o device

Info

Publication number: JP2007265157A
Application number: JP2006091028A
Authority: JP
Inventors: Seiichi Ishizuka; 誠一石塚
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2006-03-29
Filing date: 2006-03-29
Publication date: 2007-10-11

Abstract

<P>PROBLEM TO BE SOLVED: To provide an abnormality detection system of an I/O device capable of executing processing according to the abnormality of the I/O device even when operation of a processor stops. <P>SOLUTION: An external storage device management means 101 counts the number of normal terminations and the number of abnormal terminations of read/write of a magnetic disk device 130. The external storage device management means 101 compares the number of normal terminations and the number of abnormal terminations that are counted with a predetermined threshold, and determines whether abnormality occurs in the magnetic disk device 130. A watch dog timer management means 102 periodically resets a watch dog timer 111 when the magnetic disk unit 130 normally operates. When the watch dog timer 111 generates time out, a BMC (Base Management Controller) 110 generates NMI interruption and informs a processor 100 that the magnetic disk unit 130 is abnormal. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、外部装置の障害検出システム、及び、方法に関し、更に詳しくは、Ｉ／Ｏ装置等の外部装置に障害が発生したことを検出する外部装置の障害検出システム、及び、方法に関する。 The present invention relates to a failure detection system and method for an external device, and more particularly to a failure detection system and method for an external device that detects that a failure has occurred in an external device such as an I / O device.

一般に、コンピュータシステムは、システム制御を行うプロセッサと、外部記憶装置等のＩ／Ｏ装置とを含み、プロセッサと、Ｉ／Ｏ装置とは、バスを介して接続されている。このようなコンピュータシステムにおけるＩ／Ｏ装置の障害検出方法としては、例えば、特許文献１に記載された技術がある。この技術では、Ｉ／Ｏタイムアウトの発生を検出し、複数のＩ／Ｏ装置のうちのＩ／Ｏタイムアウトが多発するＩ／Ｏ装置を、固定障害として検出する。 In general, a computer system includes a processor that performs system control and an I / O device such as an external storage device, and the processor and the I / O device are connected via a bus. As a method for detecting a failure of an I / O device in such a computer system, for example, there is a technique described in Patent Document 1. In this technique, occurrence of an I / O timeout is detected, and an I / O device that frequently generates an I / O timeout among a plurality of I / O devices is detected as a fixed failure.

特開平５−３３４２０５号公報JP-A-5-334205

従来のコンピュータシステムでは、例えば磁気ディスクが無応答になった場合には、ＯＳが動作することが不能になって、ユーザがハードリセットを行う必要があった。或いは、一定時間ＯＳが無応答となると、ウォッチドッグタイマーにより、リセットを行っていた。しかし、磁気ディスクの無応答を検出して障害を検出する場合には、磁気ディスクの間欠的な無応答によって、障害を誤検出し、不必要なリセットが発生することがある。また、プロセッサが割り込みを禁止した状態で停止することで、通常の割り込み処理が動作しないという問題もある。 In the conventional computer system, for example, when the magnetic disk becomes unresponsive, the OS cannot be operated, and the user needs to perform a hard reset. Alternatively, when the OS becomes unresponsive for a certain time, the watchdog timer is used for resetting. However, when a failure is detected by detecting a non-response of the magnetic disk, the failure may be erroneously detected due to intermittent non-response of the magnetic disk, and an unnecessary reset may occur. There is also a problem that normal interrupt processing does not operate when the processor stops in a state in which interrupts are prohibited.

本発明は、上記従来技術の問題点を解消し、プロセッサの動作が停止した場合でも、Ｉ／Ｏ装置の異常に応じた処理を実行可能なＩ／Ｏ装置の異常検出システム及び方法を提供することを目的とする。また、本発明は、Ｉ／Ｏ装置の異常の誤検出を防止できるＩ／Ｏ装置の異常検出システム及び方法を提供することを目的とする。 The present invention provides an I / O device abnormality detection system and method that solves the above-described problems of the prior art and that can execute processing according to an abnormality of the I / O device even when the operation of the processor is stopped. For the purpose. It is another object of the present invention to provide an I / O device abnormality detection system and method that can prevent erroneous detection of an abnormality of an I / O device.

上記目的を達成するために、本発明のＩ／Ｏ装置の障害検出システムは、Ｉ／Ｏ装置に障害が発生したことを検出する障害検出システムにおいて、時間経過に従ってカウントを進行し、カウント値が所定の値となるとカウントアウトを発生するタイマー手段と、前記Ｉ／Ｏ装置との間のデータ入出力を監視し、データ入出力の正常終了数、及び、データ入出力の異常終了数をカウントし、該カウントした正常終了数、及び、異常終了数の少なくとも一方に基づいて、前記Ｉ／Ｏ装置が正常に動作しているか否かを判断するＩ／Ｏ装置管理手段と、前記Ｉ／Ｏ装置管理手段が、前記Ｉ／Ｏ装置が正常に動作していると判断すると、前記タイマー手段のカウント値を初期値にリセットするタイマー管理手段と、前記タイマー手段がタイムアウトを発生すると、ＮＭＩ割り込みを発生する割り込み発生手段とを備えることを特徴とする。 In order to achieve the above object, a failure detection system for an I / O device according to the present invention is a failure detection system that detects that a failure has occurred in an I / O device. Monitors the data input / output between the I / O device and the timer means for generating a count-out when a predetermined value is reached, and counts the number of normal data input / output ends and the number of abnormal data input / output ends. An I / O device management means for judging whether or not the I / O device is operating normally based on at least one of the counted number of normal ends and the number of abnormal ends, and the I / O device When the management means determines that the I / O device is operating normally, timer management means for resetting the count value of the timer means to an initial value, and the timer means time-out. If raw, characterized in that it comprises an interrupt generating means for generating an NMI interrupt.

本発明の障害検出方法は、Ｉ／Ｏ装置に障害が発生したことを検出する障害検出方法において、タイマーのカウントを時間経過に従ってカウントを進行しつつ、前記Ｉ／Ｏ装置との間のデータ入出力を監視して、データ入出力の正常終了数、及び、データ入出力の異常終了数をカウントし、前記カウントした正常終了数、及び、異常終了数の少なくとも一方に基づいて、前記Ｉ／Ｏ装置が正常に動作しているか否かを判断し、前記Ｉ／Ｏ装置が正常に動作していると判断すると、前記タイマーのカウント値を初期値にリセットし、前記タイマーのカウントが所定の値となると、ＮＭＩ割り込みを発生することを特徴とする。 The failure detection method of the present invention is a failure detection method for detecting that a failure has occurred in an I / O device. The output is monitored to count the number of normal terminations of data input / output and the number of abnormal terminations of data input / output. Based on at least one of the counted number of normal terminations and abnormal terminations, the I / O It is determined whether or not the device is operating normally. If it is determined that the I / O device is operating normally, the timer count value is reset to an initial value, and the timer count is a predetermined value. Then, an NMI interrupt is generated.

本発明のＩ／Ｏ装置の障害検出システム及び方法では、データ入出力の正常終了数及び異常終了数の少なくとも一方に基づいてＩ／Ｏ装置が正常に動作しているか否かを判断し、正常に動作していると判断した場合には、タイマーのカウントを初期状態にリセットする。Ｉ／Ｏ装置に異常が発生した場合には、タイマーがリセットされないことで、タイムアウトが発生し、ＮＭＩ割り込みによって、この割り込みを入力するプロセッサ側で、Ｉ／Ｏ装置の異常を検出できる。本発明では、Ｉ／Ｏ装置が正常に動作しているか否かの判断に、データ入出力の正常終了数と異常終了数の少なくとも一方を用いているため、これに基づいて適切に判断することで、偶発的に異常終了が発生しただけなのか、或いは、装置に異常が発生して異常終了が発生したのかを判断でき、復旧可能なエラーに対する誤検出を防止できる。ここで、Ｉ／Ｏ装置が無応答になると、正常終了数や異常終了数をカウントする手段（プロセッサ）が、全く動作しなくなる場合があるが、このような場合でも、タイマーにリセットがかからないことでタイムアウトが発生し、ＮＭＩ割り込みが発生することで、プロセッサは、Ｉ／Ｏ装置の異常を検出して、それに応じた処理を実行できる。 In the I / O device failure detection system and method of the present invention, it is determined whether or not the I / O device is operating normally based on at least one of the normal termination number and abnormal termination number of data input / output. If it is determined that the timer is operating, the timer count is reset to the initial state. When an abnormality occurs in the I / O device, the timer is not reset, a timeout occurs, and the abnormality of the I / O device can be detected by the NMI interrupt on the processor side that inputs this interrupt. In the present invention, since at least one of the normal termination number and the abnormal termination number of data input / output is used for determining whether or not the I / O device is operating normally, an appropriate determination should be made based on this. Therefore, it is possible to determine whether the abnormal termination has just occurred accidentally or whether the apparatus has malfunctioned and the abnormal termination has occurred, and it is possible to prevent erroneous detection of a recoverable error. Here, if the I / O device becomes non-responsive, the means (processor) for counting the number of normal terminations and abnormal terminations may not operate at all. In such a case, the timer is not reset. When a time-out occurs and an NMI interrupt occurs, the processor can detect an abnormality in the I / O device and execute processing corresponding thereto.

本発明のＩ／Ｏ装置の障害検出システムでは、前記タイマー手段が、時間経過と共に、カウント値を、所定の初期値からカウントダウンし、前記カウント値が０になるとタイムアウトを発生するウォッチドッグタイマーを含む構成を採用できる。この場合、カウント値が０になる前に、Ｉ／Ｏ装置管理手段がＩ／Ｏ装置の動作が異常であると判断することにより、或いは、何らかの原因でＩ／Ｏ装置管理手段の動作自体が停止することにより、タイマー制御手段がリセットを発行しない場合には、カウントダウンが進行してカウント値が０となり、ＮＭＩ割り込みが発生して、Ｉ／Ｏ装置の異常を検出できる。 In the failure detection system for an I / O device according to the present invention, the timer means includes a watchdog timer that counts down a count value from a predetermined initial value as time elapses, and generates a timeout when the count value becomes 0. Configuration can be adopted. In this case, before the count value becomes 0, the I / O device management unit determines that the operation of the I / O device is abnormal, or the operation of the I / O device management unit itself is caused by some reason. By stopping, when the timer control means does not issue a reset, the countdown proceeds, the count value becomes 0, an NMI interrupt is generated, and an abnormality in the I / O device can be detected.

本発明のＩ／Ｏ装置の障害検出システムでは、前記Ｉ／Ｏ装置管理手段は、前記異常終了数が所定のしきい値を超えると、前記Ｉ／Ｏ装置が正常に動作していないと判断する構成を採用できる。例えば、所定回数のデータ入出力に対して、Ｉ／Ｏ装置が異常であると判断する際の基準となるしきい値を設定しておき、異常終了回数がしきい値を超えるか否かにより、Ｉ／Ｏ装置が異常であるか否かを判断する構成を採用できる。 In the failure detection system for an I / O device of the present invention, the I / O device management means determines that the I / O device is not operating normally when the number of abnormal terminations exceeds a predetermined threshold value. Can be adopted. For example, for a predetermined number of times of data input / output, a threshold value is set as a reference for determining that the I / O device is abnormal, and depending on whether the number of abnormal terminations exceeds the threshold value A configuration for determining whether or not the I / O device is abnormal can be employed.

本発明のＩ／Ｏ装置の障害検出システムでは、前記Ｉ／Ｏ装置管理手段は、前記正常終了数が所定のしきい値を超えると、前記Ｉ／Ｏ装置が正常に動作していると判断する構成を採用できる。例えば、所定回数のデータ入出力に対して、Ｉ／Ｏ装置が正常であると判断する際の基準となるしきい値を設定しておき、正常終了回数がしきい値を超えるか否かにより、Ｉ／Ｏ装置が正常であるか否かを判断する構成を採用できる。 In the I / O device failure detection system according to the present invention, the I / O device management means determines that the I / O device is operating normally when the number of normal terminations exceeds a predetermined threshold. Can be adopted. For example, for a predetermined number of times of data input / output, a threshold value serving as a reference for determining that the I / O device is normal is set, and depending on whether the normal end count exceeds the threshold value A configuration for determining whether or not an I / O device is normal can be employed.

本発明のＩ／Ｏ装置の障害検出システムでは、前記異常終了数がＩ／Ｏタイムアウト発生数を含み、前記Ｉ／Ｏ装置管理手段は、前記Ｉ／Ｏタイムアウトの発生数と前記正常終了数との比率が所定の値を超えると、前記Ｉ／Ｏ装置が正常に動作していないと判断する構成を採用できる。例えば、所定回数のデータ入出力に対して、Ｉ／Ｏ装置が異常であると判断する際の基準となる、前記Ｉ／Ｏタイムアウトの発生数と前記正常終了数との比率のしきい値を設定しておき、前記Ｉ／Ｏタイムアウトの発生数と前記正常終了数との比率がそのしきい値を超えるか否かにより、Ｉ／Ｏ装置が異常であるか否かを判断する構成を採用できる。 In the failure detection system for an I / O device according to the present invention, the number of abnormal terminations includes the number of I / O timeout occurrences, and the I / O device management means includes the number of I / O timeout occurrences and the number of normal terminations. If the ratio exceeds a predetermined value, a configuration can be adopted in which it is determined that the I / O device is not operating normally. For example, a threshold value of the ratio between the number of I / O timeout occurrences and the number of normal terminations, which serves as a criterion for determining that an I / O device is abnormal for a predetermined number of data inputs / outputs, A configuration is adopted in which it is determined whether or not the I / O device is abnormal depending on whether the ratio between the number of I / O timeout occurrences and the number of normal terminations exceeds the threshold. it can.

本発明のＩ／Ｏ装置の障害検出システムでは、前記Ｉ／Ｏ装置が、外部記憶装置を含む構成を採用できる。また、前記Ｉ／Ｏ装置が、ネットワーク装置を含む構成を採用することもできる。 In the I / O device failure detection system according to the present invention, the I / O device may include an external storage device. Further, the I / O device may employ a configuration including a network device.

本発明のＩ／Ｏ装置の障害検出システム及び方法では、データ入出力の正常終了数及び異常終了数の少なくとも一方に基づいてＩ／Ｏ装置が正常に動作しているか否かを判断し、正常に動作していると判断した場合には、タイマーのカウントを初期状態にリセットする。Ｉ／Ｏ装置に異常が発生した場合や、Ｉ／Ｏ装置が無応答となることでプロセッサの動作が停止した場合には、タイマーがリセットされないことで、タイムアウトが発生し、ＮＭＩ割り込みを発生させる。このＮＭＩ割り込みを、プロセッサに入力することで、プロセッサにより、Ｉ／Ｏ装置の異常に応じた処理を実行できる。また、本発明では、Ｉ／Ｏ装置が正常に動作しているか否かの判断に、データ入出力の正常終了数と異常終了数の少なくとも一方を用いているため、これに基づいて適切に判断することで、偶発的に異常終了が発生しただけなのか、或いは、装置に異常が発生して異常終了が発生したのかを判断でき、復旧可能なエラーに対する誤検出を防止できる。 In the I / O device failure detection system and method of the present invention, it is determined whether or not the I / O device is operating normally based on at least one of the normal termination number and abnormal termination number of data input / output. If it is determined that the timer is operating, the timer count is reset to the initial state. When an abnormality occurs in the I / O device, or when the operation of the processor is stopped due to no response from the I / O device, the timer is not reset, a timeout occurs, and an NMI interrupt is generated. . By inputting this NMI interrupt to the processor, the processor can execute processing according to the abnormality of the I / O device. In the present invention, since at least one of the normal termination number and the abnormal termination number of data input / output is used to determine whether the I / O device is operating normally, an appropriate determination is made based on this. By doing so, it can be determined whether the abnormal termination has just occurred accidentally or whether the apparatus has malfunctioned and the abnormal termination has occurred, and it is possible to prevent erroneous detection of a recoverable error.

以下、図面を参照し、本発明の実施の形態を詳細に説明する。図１は、本発明の一実施形態のＩ／Ｏ装置の障害検出システムの構成を示している。この障害検出システム１０は、プロセッサ１００と、ＢＭＣ（Base Management Controller）１１０と、ＳＣＳＩコントローラ１２０とを備える。プロセッサ１００は、Ｉ／Ｏバス１５０を介して、ＳＣＳＩコントローラ１２０に接続される。ＳＣＳＩコントローラ１２０は、ＳＣＳＩバス１４０を介して磁気ディスク装置１３０に接続される。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. FIG. 1 shows the configuration of an I / O device failure detection system according to an embodiment of the present invention. The failure detection system 10 includes a processor 100, a BMC (Base Management Controller) 110, and a SCSI controller 120. The processor 100 is connected to the SCSI controller 120 via the I / O bus 150. The SCSI controller 120 is connected to the magnetic disk device 130 via the SCSI bus 140.

プロセッサ１００は、外部記憶装置管理手段１０１と、ウォッチドッグタイマー管理手段１０２とを備える。ＢＭＣ１１０は、ウォッチドッグタイマー１１１を有する。プロセッサ１００及びＢＭＣ１１０内の各手段は、プログラム動作により実現される。外部記憶装置管理手段１０１は、磁気ディスク装置１３０との間のリード・ライトの正常終了数や異常終了数をカウントして図示しないメモリに記憶し、プロセッサ１００の起動時や、システム運用時、或いは、システムシャットダウン時に、外部記憶装置（磁気ディスク装置）１３０の正常性確認を行う。 The processor 100 includes an external storage device management unit 101 and a watchdog timer management unit 102. The BMC 110 has a watch dog timer 111. Each means in the processor 100 and the BMC 110 is realized by a program operation. The external storage device management means 101 counts the number of normal and abnormal reads / writes with the magnetic disk device 130 and stores it in a memory (not shown), and starts up the processor 100, operates the system, When the system is shut down, the normality of the external storage device (magnetic disk device) 130 is confirmed.

ウォッチドッグタイマー管理手段１０２は、ＢＭＣ１１０に対してウォッチドッグタイマー１１１のリセットを発行する。ＢＭＣ１１０は、ウォッチドッグタイマー１１１がタイムアウトを発生すると、プロセッサ１００に対して、ＮＭＩ割り込み（ノンマスカブルの割り込み）１６０を発生し、プロセッサ１００に、磁気ディスク装置１３０が異常である旨を通知する。 The watchdog timer management unit 102 issues a reset of the watchdog timer 111 to the BMC 110. When the watchdog timer 111 times out, the BMC 110 generates an NMI interrupt (non-maskable interrupt) 160 to the processor 100 and notifies the processor 100 that the magnetic disk device 130 is abnormal.

図２は、プロセッサ１００の動作手順を示している。プロセッサ１００は、自身の起動時や外部記憶装置管理コマンド発行時に、ウォッチドッグタイマー管理手段１０２によってウォッチドッグタイマー１１１のリセットを発行し、ウォッチドッグタイマー１１１を初期状態にリセットする（ステップＡ１）。外部記憶装置管理手段１０１は、Ｉ／Ｏバス１５０、ＳＣＳＩコントローラ１２０、及び、ＳＣＳＩバス１４０を介して、磁気ディスク装置１３０にリード・ライトを実施する（ステップＡ２）。外部記憶装置管理手段１０１は、リード・ライトが正常終了したか否かを判断し（ステップＡ３）、正常終了した場合には、正常数加算処理により、正常終了の累計を計算する（ステップＡ４）。 FIG. 2 shows an operation procedure of the processor 100. The processor 100 issues a reset of the watchdog timer 111 by the watchdog timer management means 102 when the processor 100 is started or when an external storage device management command is issued, and resets the watchdog timer 111 to an initial state (step A1). The external storage device management unit 101 reads / writes data from / to the magnetic disk device 130 via the I / O bus 150, the SCSI controller 120, and the SCSI bus 140 (step A2). The external storage device management unit 101 determines whether or not the read / write has been normally completed (step A3). If the read / write has been completed normally, the normal number addition processing is performed to calculate the total of normal termination (step A4). .

外部記憶装置管理手段１０１は、リード・ライトが異常終了したと判断すると、異常終了がタイムアウトに起因して発生したか否かを判断する（ステップＡ５）。外部記憶装置管理手段１０１は、タイムアウトに起因しないと判断したときには、異常数加算処理を行い、異常終了の累計を計算する（ステップＡ６）。また、タイムアウトに起因して発生したと判断したときには、タイムアウト数加算処理を行い、タイムアウト発生数の累計を計算する（ステップＡ７）。外部記憶装置管理手段１０１は、リード・ライトをＮ回実施したか否かを判断し（ステップＡ８）、Ｎ回実施していないときには、ステップＡ２へ戻る。これにより、リード・ライトをＮ回実施した際の正常終了の累計、異常終了の累計、タイムアウト数の累計が得られる。 When the external storage device management unit 101 determines that the read / write has ended abnormally, it determines whether the abnormal end has occurred due to a timeout (step A5). When the external storage device management means 101 determines that the time-out is not caused, it performs an abnormal number addition process and calculates the cumulative number of abnormal terminations (step A6). If it is determined that the error occurred due to a timeout, a timeout number addition process is performed to calculate the total number of timeout occurrences (step A7). The external storage device management unit 101 determines whether or not read / write has been performed N times (step A8), and when it has not been performed N times, returns to step A2. As a result, the cumulative number of normal terminations, the cumulative number of abnormal terminations, and the cumulative number of timeouts when N times of reading and writing are performed are obtained.

プロセッサ１００が、リード・ライトをＮ回実行すると、外部記憶装置管理手段１０１は、正常性判断処理を行い（ステップＡ９）、結果の正常性を判断する（ステップＡ１０）。ステップＡ９では、例えば異常終了の累計が、所定のしきい値を上回ると、異常であると判断する。また、タイムアウト数の累計と、正常終了の累計との比率に基づいて、正常か否かを判断する。具体的には、タイムアウト数／正常終了数が所定のしきい値を超えると、異常であると判断する。 When the processor 100 executes read / write N times, the external storage device management means 101 performs normality determination processing (step A9) and determines the normality of the result (step A10). In step A9, for example, if the cumulative number of abnormal terminations exceeds a predetermined threshold value, it is determined that there is an abnormality. Further, it is determined whether or not it is normal based on the ratio between the cumulative number of timeouts and the cumulative number of normal terminations. Specifically, when the timeout number / normal termination number exceeds a predetermined threshold value, it is determined that there is an abnormality.

外部記憶装置管理手段１０１は、ステップＡ１０で正常であると判断すると、スリープ処理（ステップＡ１１）を行って所定の時間だけ待機した後に、ステップＡ１へ戻り、ウォッチドッグタイマー管理手段１０２によってウォッチドッグタイマー１１１をリセットする。その後、ステップＡ２からステップＡ１０を実行する。異常であると判断した場合には、処理を停止する（ステップＡ１２）。 If the external storage device management means 101 determines that it is normal at step A10, it performs a sleep process (step A11) and waits for a predetermined time, and then returns to step A1, and the watchdog timer management means 102 uses the watchdog timer. 111 is reset. Thereafter, Step A2 to Step A10 are executed. If it is determined that there is an abnormality, the processing is stopped (step A12).

図３は、ＢＭＣ１１０の動作手順を示している。ウォッチドッグタイマー１１１は、図２のステップＡ１でウォッチドッグタイマー管理手段１０２がリセットを発行すると、カウント値ＷＤＴを、初期値にリセットする（ステップＢ１）。ウォッチドッグタイマー１１１のカウント値ＷＤＴを１減算し（ステップＢ２）、カウント値ＷＤＴが０になったか否かを判断する（ステップＢ３）。 FIG. 3 shows an operation procedure of the BMC 110. The watchdog timer 111 resets the count value WDT to an initial value when the watchdog timer management means 102 issues a reset in step A1 of FIG. 2 (step B1). The count value WDT of the watchdog timer 111 is decremented by 1 (step B2), and it is determined whether or not the count value WDT has become 0 (step B3).

カウント値ＷＤＴがまだ０になっていない場合には、ステップＢ２に戻り、カウント値ＷＤＴを更に１つ減算する。これにより、カウント値ＷＤＴは、時間経過と共に、０に近づいていく。ウォッチドッグタイマー管理手段１０２によってリセットが発行された場合には、ステップＢ１へ移行してカウント値ＷＤＴを初期値に設定する。カウント値ＷＤＴが０になると、ＣＰＵに対してＮＭＩ割り込み１６０を発生し、タイムアウトが発生した旨を通知する（ステップＢ４）。 If the count value WDT has not yet become 0, the process returns to step B2, and one more count value WDT is subtracted. As a result, the count value WDT approaches 0 over time. When a reset is issued by the watchdog timer management unit 102, the process proceeds to step B1 and the count value WDT is set to an initial value. When the count value WDT becomes 0, an NMI interrupt 160 is generated to notify the CPU that a timeout has occurred (step B4).

障害検出システム１０では、図２のステップＡ１０で磁気ディスク装置１３０の動作が正常であると判定されると、ウォッチドッグタイマー１１１のリセットが行われる。一方、ステップＡ１０で正常でないと判断された場合には、ステップＡ１２に移行し、処理を停止することで、ウォッチドッグタイマー１１１のリセットが行われない。このため、磁気ディスク装置１３０に障害が発生した場合には、ウォッチドッグタイマー１１１のタイムアウトが発生して、プロセッサ１００にＮＭＩ割り込み１６０が入力される。また、ＯＳが動作できない状態になった場合にも、図２に示す処理が停止することでウォッチドッグタイマー１１１のタイムアウトが発生し、ＮＭＩ割り込み１６０が発生する。 In the failure detection system 10, if it is determined in step A10 in FIG. 2 that the operation of the magnetic disk device 130 is normal, the watchdog timer 111 is reset. On the other hand, if it is determined in step A10 that it is not normal, the process proceeds to step A12, and the process is stopped so that the watchdog timer 111 is not reset. For this reason, when a failure occurs in the magnetic disk device 130, the watchdog timer 111 times out and the NMI interrupt 160 is input to the processor 100. Further, even when the OS becomes inoperable, the processing shown in FIG. 2 is stopped, so that the watchdog timer 111 times out and the NMI interrupt 160 is generated.

本実施形態では、磁気ディスク装置１３０が正常に動作する場合には、プロセッサ１００にウォッチドッグタイマー１１１を周期的にリセットさせ、異常が発生した場合には、ウォッチドッグタイマー１１１をリセットさせないことでタイムアウトを発生させる。また、磁気ディスク装置１３０が無応答になった場合には、プロセッサ１００が動作できないこともあるが、この場合でも、ウォッチドッグタイマー１１１のリセットが発生しないことで、タイムアウトが発生する。ＢＭＣ１１０は、ウォッチドッグタイマー１１１がタイムアウトを発生すると、プロセッサ１００に、ＮＭＩ割り込み１６０を入力する。これにより、磁気ディスク装置１３０に異常が発生した場合や、プロセッサ１００が動作できない事態となった場合に、プロセッサ１００に、磁気ディスク装置１３０が異常である旨を通知できる。また、リード・ライトの正常終了数、異常終了数、Ｉ／Ｏタイムアウト発生数を観測し、これらに基づいて磁気ディスク装置１３０が正常に動作しているか否かを判断することにより、復旧可能なエラーに対する誤検出を防止できる。 In the present embodiment, when the magnetic disk device 130 operates normally, the processor 100 periodically resets the watchdog timer 111, and when an abnormality occurs, the watchdog timer 111 is not reset so that a timeout occurs. Is generated. Further, when the magnetic disk device 130 becomes unresponsive, the processor 100 may not be able to operate. However, even in this case, a timeout occurs because the watchdog timer 111 is not reset. The BMC 110 inputs an NMI interrupt 160 to the processor 100 when the watchdog timer 111 times out. As a result, when an abnormality occurs in the magnetic disk device 130 or when the processor 100 cannot operate, the processor 100 can be notified that the magnetic disk device 130 is abnormal. In addition, the number of normal read / write terminations, the number of abnormal terminations, and the number of I / O timeout occurrences are observed, and based on these, it is possible to recover by determining whether or not the magnetic disk device 130 is operating normally. It is possible to prevent erroneous detection of errors.

なお、上記実施形態では、異常検出の対象となるＩ／Ｏ装置として、磁気ディスク装置１３０を例に挙げたが、これには限定されない。図４は、本発明の変形例の異常検出システムの構成を示している。変形例の異常検出システム１０ａの入出力管理手段１０１ａは、図１の外部記憶装置管理手段１０１に対応し、Ｉ／Ｏコントローラ１２０ａは、ＳＣＳＩコントローラ１２０に対応する。Ｉ／Ｏコントローラ１２０ａには、バス１７０を介して、磁気ディスク装置１３０とＬＡＮコントローラ１８０とが接続される。Ｉ／Ｏコントローラ１２０ａに接続されるＩ／Ｏ装置は、磁気ディスク装置１３０やＬＡＮコントローラ１８０には限定されず、種々のＩ／Ｏ装置とすることができる。 In the above embodiment, the magnetic disk device 130 is taken as an example of an I / O device that is a target of abnormality detection, but is not limited thereto. FIG. 4 shows the configuration of an abnormality detection system according to a modification of the present invention. The input / output management unit 101a of the abnormality detection system 10a of the modification corresponds to the external storage device management unit 101 of FIG. 1, and the I / O controller 120a corresponds to the SCSI controller 120. The magnetic disk device 130 and the LAN controller 180 are connected to the I / O controller 120a via the bus 170. The I / O device connected to the I / O controller 120a is not limited to the magnetic disk device 130 or the LAN controller 180, and can be various I / O devices.

以上、本発明をその好適な実施形態に基づいて説明したが、本発明のＩ／Ｏ装置の異常検出システム、及び、方法は、上記実施形態にのみ限定されるものではなく、上記実施形態の構成から種々の修正及び変更を施したものも、本発明の範囲に含まれる。 As described above, the present invention has been described based on the preferred embodiment. However, the abnormality detection system and method of the I / O device of the present invention are not limited to the above embodiment. Those in which various modifications and changes have been made to the configuration are also included in the scope of the present invention.

本発明の一実施形態のＩ／Ｏ装置の障害検出システムの構成を示すブロック図。1 is a block diagram showing a configuration of a failure detection system for an I / O device according to an embodiment of the present invention. プロセッサ１００の動作手順を示すフローチャート。4 is a flowchart showing an operation procedure of the processor 100. ＢＭＣ１１０の動作手順を示すフローチャート。The flowchart which shows the operation | movement procedure of BMC110. 本発明の変形例のＩ／Ｏ装置の障害検出システムの構成を示すブロック図。The block diagram which shows the structure of the failure detection system of the I / O apparatus of the modification of this invention.

Explanation of symbols

１０：障害検出システム
１００：プロセッサ
１０１：外部記憶装置管理手段
１０２：ウォッチドッグタイマー管理手段
１１０：ＢＭＣ（Base Management Controller）
１１１：ウォッチドッグタイマー
１２０：ＳＣＳＩコントローラ
１３０：磁気ディスク装置
１４０：ＳＣＳＩバス
１５０：Ｉ／Ｏバス
１６０：ＮＭＩ割り込み 10: Failure detection system 100: Processor 101: External storage device management means 102: Watchdog timer management means 110: BMC (Base Management Controller)
111: Watchdog timer 120: SCSI controller 130: Magnetic disk device 140: SCSI bus 150: I / O bus 160: NMI interrupt

Claims

In a failure detection system that detects that a failure has occurred in an I / O device,
Timer means for proceeding counting over time and generating a count-out when the count value reaches a predetermined value;
Monitor data input / output with the I / O device, count the number of normal data input / output ends and the number of abnormal data input / output ends, and count the number of normal ends and abnormal end counts I / O device management means for determining whether or not the I / O device is operating normally based on at least one of the following:
Timer management means for resetting the count value of the timer means to an initial value when the I / O apparatus management means determines that the I / O device is operating normally;
An I / O device failure detection system comprising: an interrupt generation means for generating an NMI interrupt when the timer means generates a timeout.

2. The I / O device failure according to claim 1, wherein the timer means includes a watchdog timer that counts down a count value from a predetermined initial value as time elapses, and generates a timeout when the count value reaches 0. Detection system.

3. The I / O device according to claim 1, wherein the I / O device management unit determines that the I / O device is not operating normally when the number of abnormal terminations exceeds a predetermined threshold value. Device failure detection system.

3. The I / O device according to claim 1, wherein the I / O device management means determines that the I / O device is operating normally when the number of normal terminations exceeds a predetermined threshold value. 4. Device failure detection system.

The number of abnormal terminations includes the number of I / O timeout occurrences, and the I / O device management means, when the ratio between the number of I / O timeout occurrences and the number of normal terminations exceeds a predetermined value, The I / O device failure detection system according to claim 1, wherein the O device is determined not to operate normally.

The I / O device failure detection system according to claim 1, wherein the I / O device includes an external storage device.

The I / O device failure detection system according to claim 1, wherein the I / O device includes a network device.

In a failure detection method for detecting that a failure has occurred in an I / O device,
While progressing the count of the timer as time passes,
Monitor data input / output with the I / O device, and count the number of normal data input / output ends and the number of abnormal data input / output ends,
Based on at least one of the counted normal end number and abnormal end number, it is determined whether or not the I / O device is operating normally,
When it is determined that the I / O device is operating normally, the count value of the timer is reset to an initial value,
A failure detection method, wherein an NMI interrupt is generated when the timer count reaches a predetermined value.