JP2008171217A

JP2008171217A - Computer system for estimating preventive maintenance of disk drive in disk sub-system

Info

Publication number: JP2008171217A
Application number: JP2007003984A
Authority: JP
Inventors: Hiromichi Shibata; 洋道柴田
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2007-01-12
Filing date: 2007-01-12
Publication date: 2008-07-24

Abstract

<P>PROBLEM TO BE SOLVED: To collect error information from a disk drive composing a logic unit within a disk sub-system which cannot be recognized from a computer system and estimate a preventive maintenance replacement timing based on the collected error information. <P>SOLUTION: Error information collecting and analyzing function is added to a device driver for controlling a host bus adaptor mounted on the computer system to collect error information of the disk drive composing the logic unit LU within the disk sub-system, and the preventive maintenance replacement timing estimated based on the collected error information is provided to a system manager. The error information collecting and analyzing function is added to the device driver, and the preventive replacement timing estimated based on the collected error information is recorded as a log within a service processor SVP of a computer system. Accordingly, a computer system uninfluenced by various operating systems or versions of such operating systems can be provided. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、デバイスドライバが制御するホストバスアダプタに接続したディスクサブシ
ステム内の論理ユニットを構成するディスクドライブのエラー情報を採取するシステム、
並びに、採取したエラー情報を用いて該当ディスクドライブの予防保守時期を推測する計
算機システムに関する。 The present invention relates to a system for collecting error information of a disk drive constituting a logical unit in a disk subsystem connected to a host bus adapter controlled by a device driver,
In addition, the present invention relates to a computer system that estimates the preventive maintenance time of a corresponding disk drive using the collected error information.

ディスクドライブはデータのリードまたはライト時に検出された欠陥データブロックに
対して、自動的に交代ブロックを割当てる機能を持っている。さらに、リードまたはライ
トのリトライをした回数、リトライが成功せず交代ブロック割当を実施した回数等のエラ
ー情報を記録している。この自動交代はディスクドライブが処理し、計算機システムには
知らされない。また、稼動時間が長く、年月が経過したディスクドライブは上述のリード
またはライトのリトライ、自動交代が発生する可能性が高くなり、アクセス性能が低下す
る場合がある。 The disk drive has a function of automatically assigning replacement blocks to defective data blocks detected at the time of data reading or writing. Further, error information such as the number of times of retry of read or write, the number of times of execution of replacement block allocation without successful retry, etc. is recorded. This automatic change is handled by the disk drive and is not informed to the computer system. In addition, a disk drive having a long operation time and a long period of time has a high possibility that the above-described read or write retry or automatic change will occur, and the access performance may be lowered.

計算機システムより物理的に認識できるディスクドライブは、このディスクドライブに
対しエラー情報採取コマンド(ＳＣＳＩコマンドの場合、ＲＥＡＤＤＥＦＥＣＴＤＡＴＡ
, ＬＯＧＳＥＮＳＥ等)を発行することにより、上述のエラー情報を採取可能である。さ
らに採取した情報を解析することによりディスクドライブの予防保守交換が可能である。 The disk drive that can be physically recognized by the computer system has an error information collection command (in the case of a SCSI command, READ DEFECTDATA) for this disk drive.
, LOG SENSE, etc.) can be collected. Furthermore, preventive maintenance replacement of disk drives is possible by analyzing the collected information.

しかし、ディスクサブシステム内に論理ユニット（ＬＵ）を構成しているディスクドラ
イブにおいては、計算機システムよりディスクサブシステム内のディスクコントローラを
Ｉ/Ｏデバイスとして認識できなければ、計算機システムからは論理ユニットのみがＩ/Ｏ
デバイスとして認識され、物理ドライブについての情報を採取することは不可能である。 However, in a disk drive that constitutes a logical unit (LU) in the disk subsystem, if the disk controller in the disk subsystem cannot be recognized as an I / O device by the computer system, only the logical unit is detected from the computer system. Is I / O
It is recognized as a device and it is impossible to collect information about physical drives.

一方特許文献１には、計算機システム内でトレース情報を収集することにより、障害に
至った情報を提供する仕組みが公開されている。 On the other hand, Patent Document 1 discloses a mechanism for providing information leading to a failure by collecting trace information in a computer system.

また、特許文献２には、障害監視や電源制御を行うサービスプロセッサボードを搭載し
た管理コンピュータと管理を行う計算機システムのサービスプロセッサをネットワーク経
由で接続し、リモートで計算機システムの障害監視や電源制御を行う仕組みが公開されて
いる。 In Patent Document 2, a management computer equipped with a service processor board that performs fault monitoring and power control is connected to a service processor of a computer system that performs management via a network so that fault monitoring and power control of the computer system can be performed remotely. The mechanism to do is public.

さらに、特許文献３には、システム内のディスクドライブ、液晶ディスプレイ、バッテ
リ等の消耗品の使用頻度をカウントし、カウント結果に基づいて消耗品の劣化状態を管理
用ホストに送信する仕組みが公開されている。 Furthermore, Patent Document 3 discloses a mechanism for counting the frequency of use of consumables such as disk drives, liquid crystal displays, and batteries in the system, and transmitting the deterioration state of the consumables to the management host based on the count result. ing.

特開２００６−２０２０７６号公報JP 2006-202076 A 特開平９−５０３８６号公報Japanese Patent Laid-Open No. 9-50386 特開２００１−３２５３８１号公報JP 2001-325381 A

近年、コンピュータシステムは、基幹業務においても欠かせない役割を果たしており、
障害発生により長期間停止することは許されない。よって、障害が発生する前に、障害が
発生しうる部品の情報をシステム管理者に提供することが必要とされる。さらに、障害を
収集するプログラムは管理対象システムで動作しているオペレーティングシステムとの相
性を考慮する必要があり、このオペレーティングシステムのバージョンも考慮する必要が
ある。 In recent years, computer systems have played an indispensable role in core business,
It is not allowed to stop for a long time due to a failure. Therefore, before a failure occurs, it is necessary to provide the system administrator with information on components that can cause the failure. Furthermore, a program for collecting faults needs to consider compatibility with an operating system running on a managed system, and also needs to consider the version of this operating system.

また、障害が発生する部位の多くは、システム内で機械的に動作しているディスクドラ
イブ等の消耗品であり、この消耗品の稼働時間より一元的に保守交換時期を設定するとデ
ィスクドライブを多数搭載するディスクサブシステムにおいては、一斉にディスクサブシ
ステム内のディスクドライブを保守交換する事態が発生する。よって、ディスクサブシス
テムに搭載されている個々のディスクドライブのエラー情報採取して、劣化状況を判断す
る必要がある。 Many of the parts where failures occur are consumables such as disk drives that are mechanically operating in the system. If the maintenance replacement period is set centrally based on the operating time of these consumables, many disk drives In the disk subsystem to be installed, a situation occurs in which the disk drives in the disk subsystem are simultaneously maintained and replaced. Therefore, it is necessary to collect error information of individual disk drives mounted on the disk subsystem and determine the deterioration status.

しかし、ディスクサブシステム内に論理ユニット（ＬＵ）を構成しているディスクドラ
イブでは、計算機システムよりディスクサブシステム内のディスクコントローラをＩ/Ｏ
デバイスとして認識できなければ、計算機システムからは論理ユニットのみがＩ/Ｏデバ
イスとして認識され、物理ディスクドライブについての情報を採取することは不可能であ
る。 However, in a disk drive that configures a logical unit (LU) in the disk subsystem, the disk controller in the disk subsystem is connected to the I / O from the computer system.
If it cannot be recognized as a device, only the logical unit is recognized as an I / O device from the computer system, and it is impossible to collect information about the physical disk drive.

以上の問題点を解決するためには、以下の課題が挙げられる。 In order to solve the above problems, the following problems are listed.

第一の課題は、計算機システムから認識できないディスクサブシステム内の論理ユニッ
トを構成しているディスクドライブからエラー情報を採取し、採取したエラー情報を基に
予防保守交換時期を推測することである。 The first problem is to collect error information from the disk drives constituting the logical units in the disk subsystem that cannot be recognized from the computer system, and to estimate the preventive maintenance replacement time based on the collected error information.

第二の課題は、採取したディスクサブシステム内の論理ユニットを構成するディスクド
ライブのエラー情報と、このエラー情報を基に推測した予防保守交換時期を、自動的にシ
ステム管理者に提供することである。 The second problem is to automatically provide the system administrator with error information of the disk drives that make up the logical units in the collected disk subsystem and the preventive maintenance replacement time estimated based on this error information. is there.

第三の課題は、計算機システム上の様々なオペレーティングシステムまたはこのオペレ
ーティングシステムのバージョンに左右されないシステムであることである。 The third problem is that the system is independent of various operating systems or versions of the operating system on the computer system.

本発明は、これらの課題を解決することを目的とする。 The present invention aims to solve these problems.

本発明は、計算機システム搭載のホストバスアダプタを制御するデバイスドライバに、
エラー情報を採取、解析機能を付加することにより、ディスクサブシステム内の論理ユニ
ット（ＬＵ）を構成しているディスクドライブのエラー情報を採取し、この採取したエラ
ー情報を基に推測した予防保守交換時期を、システム管理者に提供することを課題解決手
段とするものである。 The present invention provides a device driver that controls a host bus adapter mounted on a computer system.
By collecting error information and adding an analysis function, error information of the disk drives that make up the logical unit (LU) in the disk subsystem is collected, and preventive maintenance replacement estimated based on the collected error information Providing the time to the system administrator is a problem solving means.

さらに、専用のアプリケーションソフトでディスクドライブのエラー情報を採取、解析
を行うのではなく、ホストバスアダプタを制御するデバイスドライバにエラー情報採取、
解析機能を付加し、採取したエラー情報を基に推測した予防交換時期を計算機システムの
サービスプロセッサ（ＳＶＰ）内にハードウェアイベントログとして記録することにより
、様々なオペレーティングシステムまたはこのオペレーティングシステムのバージョンに
左右されない計算機システムを提供することを課題解決手段とするものである。 Furthermore, instead of collecting and analyzing disk drive error information with dedicated application software, it collects error information in the device driver that controls the host bus adapter,
By adding an analysis function and recording the preventive replacement time estimated based on the collected error information as a hardware event log in the service processor (SVP) of the computer system, various operating systems or versions of this operating system are recorded. It is an object of the present invention to provide a computer system that is not affected by problems.

本発明により、計算機システムに接続しているディスクサブシステム内の論理ユニット
を構成しているディスクドライブのエラー情報を、計算機システム上で採取することが可
能となり、該当ディスクドライブのエラー情報を基に推測した予防保守時期をシステム管
理者に提供することが可能となる。 According to the present invention, it becomes possible to collect error information of a disk drive constituting a logical unit in a disk subsystem connected to a computer system on the computer system, and based on the error information of the corresponding disk drive. It is possible to provide the system administrator with the estimated preventive maintenance time.

以下、本発明の実施の形態を図面を用いて説明する。図1はディスクコントローラ１０
３およびディスクドライブ１０７、１０８、１０９をＩ/Ｏデバイスとして認識可能なデ
ィスクサブシステム１０２であり、図２はディスクコントローラ２０３をネットワークデ
バイスとして認識可能なディスクサブシステム２０２である。図１のディスクサブシステ
ム１０２より、ディスクサブシステム１０２に搭載されているディスクドライブ１０７、
１０８、１０９のエラー情報１１２、１１３、１１４を採取する手段を実施例１に、図２
のディスクドライブ２１２、２１３、２１４のエラー情報２１２、２１３、２１４を採取
する手段を実施例２に述べる。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. FIG. 1 shows a disk controller 10
3 and the disk drives 107, 108, and 109 are disk subsystems 102 that can be recognized as I / O devices, and FIG. 2 is a disk subsystem 202 that can recognize the disk controller 203 as a network device. From the disk subsystem 102 of FIG. 1, the disk drive 107 mounted on the disk subsystem 102,
Means for collecting the error information 112, 113, 114 of 108, 109 is shown in FIG.
A means for collecting the error information 212, 213, 214 of the disk drives 212, 213, 214 will be described in the second embodiment.

図１は計算機システム１００に搭載されているホストバスアダプタ１０１とディスクサ
ブシステム１０２をＳＡＮまたはＩＰ−ＳＡＮ経由で接続した図であり、計算機システム
１００からはディスクサブシステム１０２内のディスクコントローラ１０３およびディス
クドライブ１１２、１１３、１１４をＩ/Ｏデバイスとして認識可能である。ホストバス
アダプタ１０１を制御するデバイスドライバ１０４内に新たにエラー情報採取、解析部１
０５を設け、このエラー情報採取、解析部１０５をより、ディスクコントローラ１０３配
下の論理ユニット（ＬＵ０，ＬＵ１）１１１、１１０を構成しているディスクドライブ１
０７、１０８、１０９のエラー情報１１２、１１３、１１４を採取する。 FIG. 1 is a diagram in which a host bus adapter 101 mounted on a computer system 100 and a disk subsystem 102 are connected via a SAN or IP-SAN. The computer system 100 includes a disk controller 103 and a disk in the disk subsystem 102. The drives 112, 113, and 114 can be recognized as I / O devices. New error information collection and analysis unit 1 in the device driver 104 that controls the host bus adapter 101
05, and this error information collection and analysis unit 105 makes the disk drive 1 constituting the logical units (LU0, LU1) 111, 110 under the disk controller 103.
The error information 112, 113, 114 of 07, 108, 109 is collected.

図３に示す当該ディスクドライブ１０７、１０８、１０９のエラー情報１１２、１１３
、１１４を採取する処理のフローチャートに基づいて、本発明の実施形態におけるエラー
情報を採取の動作を説明する。 Error information 112, 113 of the disk drives 107, 108, 109 shown in FIG.
The operation of collecting error information in the embodiment of the present invention will be described based on the flowchart of the process of collecting 114.

図３において、まず図１のデバイスドライバ１０４内に新たに設けたエラー情報採取、
解析部１０５より、ディスクコントローラ１０３配下のディスクドライブ（ｉ）に対して
エラー情報採取コマンドを投入する（Ｓ３０１）。ここで、ｉは変数で初期値は０とし（
Ｓ３００）、ｉ＝０,１,２・・・と変化する。エラー情報採取コマンドは、ＳＣＳＩコマ
ンドを例に取るとREAD DEFECTDATA, LOG SENSE等に相当する。 In FIG. 3, first, error information collection newly provided in the device driver 104 of FIG.
The analysis unit 105 inputs an error information collection command to the disk drive (i) under the disk controller 103 (S301). Where i is a variable and the initial value is 0 (
S300), i = 0, 1, 2,... The error information collection command corresponds to READ DEFECTDATA, LOG SENSE, etc., taking a SCSI command as an example.

次に、デバイスドライバ１０４はディスクコントローラ１０３より、論理ユニット（Ｌ
Ｕ０，ＬＵ１）（１１１、１１０を構成しているディスクドライブ（ｉ）についてのエラ
ー情報を受取る（Ｓ３０２）。 Next, the device driver 104 receives a logical unit (L
U0, LU1) (111, 110) is received error information about the disk drive (i) (S302).

エラー情報採取、解析部１０５は上記の処理Ｓ３０１、Ｓ３０２を論理ユニット（ＬＵ
０，ＬＵ１）１１１、１１０を構成しているディスクドライブの数分、Ａ回実施したか判
断し（Ｓ３０３）、Ａ回実施したと判断した場合、エラー情報採取、解析部１０５は受取
ったディスクドライブ（ｉ）１０７、１０８、１０９のエラー情報１１２、１１３、１１
４に基づいて予防保守時期を推定する（Ｓ３０５）。ここで、Ａは定数であり、ディスク
サブシステム１０２構築の際に決定された既知の値である。図１の構成においてＡ＝３で
ある。 The error information collection and analysis unit 105 performs the above processes S301 and S302 as a logical unit (LU
0, LU1) 111, 110 is determined for the number of disk drives that have been performed A times (S303). If it is determined that the disk drives have been performed A times, the error information collection and analysis unit 105 receives the received disk drive. (I) 107, 108, 109 error information 112, 113, 11
4 is used to estimate the preventive maintenance time (S305). Here, A is a constant, and is a known value determined when the disk subsystem 102 is constructed. In the configuration of FIG. 1, A = 3.

その後、エラー情報採取、解析部１０５はディスクドライブ１０７、１０８、１０９よ
り採取したエラー情報１１２、１１３、１１４を基に推測した予防保守時期をハードウェ
アイベントログとして表示する処理を開始する（Ｓ３０６）。 Thereafter, the error information collection / analysis unit 105 starts a process of displaying the preventive maintenance time estimated based on the error information 112, 113, 114 collected from the disk drives 107, 108, 109 as a hardware event log (S306). .

図２は計算機システム２００に搭載されているホストバスアダプタ２０１とディスクド
ライブ２０７、２０８、２０９をＩ/Ｏデバイスとして認識することが不可能であるディ
スクサブシステム２０２をＳＡＮまたはＩＰ−ＳＡＮ経由で接続した図である。計算機シ
ステム２００からは論理ユニットユニット（ＬＵ０，ＬＵ１）２１１、２１０のみがＩ/
Ｏデバイスとして認識され、ＳＡＮまたはＩＰ−ＳＡＮ経由でディスクドライブ２０７、
２０８、２０９のエラー情報２１２、２１３、２１４を採取することは不可能である。こ
のため、ディスクコントローラ２０３の管理用ＮＩＣポート２１７よりディスクコントロ
ーラ２０３を制御し、論理ユニット（ＬＵ０，ＬＵ１）２１１、２１０を構成しているデ
ィスクドライブ２０７、２０８、２０９のエラー情報２１２、２１３、２１４を採取する
。 FIG. 2 shows the connection between the host bus adapter 201 mounted on the computer system 200 and the disk subsystem 202 that cannot recognize the disk drives 207, 208, and 209 as I / O devices via SAN or IP-SAN. FIG. From the computer system 200, only logical unit units (LU0, LU1) 211, 210 are I / O.
Recognized as O device, disk drive 207 via SAN or IP-SAN
It is impossible to collect error information 212, 213, and 214 of 208 and 209. Therefore, the disk controller 203 is controlled from the management NIC port 217 of the disk controller 203, and error information 212, 213, 214 of the disk drives 207, 208, 209 constituting the logical units (LU0, LU1) 211, 210 are controlled. Collect.

図４に示す当該ディスクドライブ２０７、２０８、２０９のエラー情報２１２、２１３
、２１４を採取する処理のフローチャートに基づいて、本発明の実施形態におけるエラー
情報を採取の動作を説明する。 Error information 212, 213 of the disk drives 207, 208, 209 shown in FIG.
The operation of collecting error information in the embodiment of the present invention will be described based on the flowchart of the process of collecting 214.

図４において、まずデバイスドライバ２０４内に新たに設けたエラー情報採取、解析部
２０５はホストバスアダプタ２０１のＮＩＣポート２１６にネットワーク経由で接続され
ているディスクコントローラ２０３を制御し、ディスクコントローラ２０３にディスクド
ライブ（ｉ）のエラー情報を要求する（Ｓ４０１）。ここで、ｉは変数で初期値は０とし
（Ｓ３００）、ｉ＝０,１,２・・・と変化する。 In FIG. 4, first, the error information collection and analysis unit 205 newly provided in the device driver 204 controls the disk controller 203 connected to the NIC port 216 of the host bus adapter 201 via the network. Error information of the drive (i) is requested (S401). Here, i is a variable and the initial value is 0 (S300), and changes to i = 0, 1, 2,.

次に、ディスクコントローラ２０３は論理ユニット（ＬＵ０，ＬＵ１）２１１、２１０
を構成しているディスクドライブ（ｉ）についてのエラー情報を採取し、デバイスドライ
バ２０４はディスクコントローラ２０３よりディスクドライブ（ｉ）のエラー情報を受取
る（Ｓ４０２）。 Next, the disk controller 203 includes logical units (LU0, LU1) 211, 210.
Is collected, and the device driver 204 receives the error information of the disk drive (i) from the disk controller 203 (S402).

エラー情報採取、解析部２０５は上記の処理Ｓ４０１、Ｓ４０２を論理ユニット（ＬＵ
０，ＬＵ１）２１１、２１０を構成しているディスクドライブの数分、Ａ回実施したか判
断し（Ｓ４０３）、Ａ回実施したと判断した場合、エラー情報採取、解析部２０５は受取
ったディスクドライブ（ｉ）２０７、２０８、２０９のエラー情報２１２、２１３、２１
４に基づいて予防保守時期を推定する（Ｓ４０５）。ここで、Ａは定数であり、ディスク
サブシステム２０２構築の際に決定された既知の値である。図２の構成においてＡ＝３で
ある。 The error information collection / analysis unit 205 performs the above processing S401 and S402 as a logical unit (LU
(0, LU1) 211, 210 is determined for the number of disk drives that have been performed A times (S403). If it is determined that the A times have been performed, error information collection and analysis unit 205 receives the received disk drive. (I) Error information 212, 213, 21 of 207, 208, 209
4 is used to estimate the preventive maintenance time (S405). Here, A is a constant, and is a known value determined when the disk subsystem 202 is constructed. In the configuration of FIG. 2, A = 3.

その後、エラー情報採取、解析部２０５はディスクドライブ２０７、２０８、２０９よ
り採取したエラー情報２１２、２１３、２１４を基に推測した予防保守時期をハードウェ
アイベントログとして表示する処理を開始する（Ｓ４０６）。 Thereafter, the error information collection / analysis unit 205 starts a process of displaying the preventive maintenance time estimated based on the error information 212, 213, 214 collected from the disk drives 207, 208, 209 as a hardware event log (S406). .

図２のホストバスアダプタ２０１はＮＩＣ２１６のみを搭載したＬＡＮアダプタでも、
ＦＣまたはｉＳＣＳＩコントローラ２１９を同じボードに搭載したホストバスアダプタで
もよい。図２において、ホストバスアダプタ２０１は機能の異なるＮＩＣ２１６とＦＣま
たはｉＳＣＳＩコントローラ２１９を同じボードに搭載しているが、ホストバスアダプタ
２０１上のＮＩＣ２１６とＦＣまたはｉＳＣＳＩコントローラ２１９は一つに集約しても
構わない。
（予防保守時期の推定手段）
デバイスドライバ１０４、２０４のエラー情報採取、解析部１０５、２０５はディスク
ドライブ（ｉ）より採取したエラー情報（ｉ）に基づいて、次のように予防保守時期を推
定する。ここで、ｉは変数で、図１、図２においてｉ＝０,１,２である。 The host bus adapter 201 in FIG. 2 is a LAN adapter having only NIC 216 mounted.
A host bus adapter in which the FC or iSCSI controller 219 is mounted on the same board may be used. In FIG. 2, the host bus adapter 201 has the NIC 216 and the FC or iSCSI controller 219 having different functions mounted on the same board, but the NIC 216 and the FC or iSCSI controller 219 on the host bus adapter 201 may be combined into one. I do not care.
(Preventive maintenance time estimation method)
The error information collection and analysis units 105 and 205 of the device drivers 104 and 204 estimate the preventive maintenance time based on the error information (i) collected from the disk drive (i) as follows. Here, i is a variable, and i = 0, 1, 2 in FIGS.

例えば、一日に一回決まった時刻Ｈ時Ｍ分に論理ユニットを構成しているディスクド
ライブ（ｉ）のエラー情報を採取し、採取した値を次とする。ここで、Ｈは０から２３ま
での定数で、Ｍは０から５９までの定数であり、ＨとＭは共にシステム管理者が設定し、
変更可能な値である。 For example, the error information of the disk drive (i) constituting the logical unit is collected at a time H hour M determined once a day, and the collected value is as follows. Here, H is a constant from 0 to 23, M is a constant from 0 to 59, H and M are both set by the system administrator,
It can be changed.

０６年９月ｎ日自動交代回数：Ａ_n（ｉ）
リードエラー回数：Ｒ_n（ｉ）
ライトエラー回数：Ｗ_n（ｉ）・・・（ａ）
０６年９月ｎ＋１日自動交代回数：Ａ_n+1（ｉ）
リードエラー回数：Ｒ_n+1（ｉ）
ライトエラー回数：Ｗ_n+1（ｉ）・・・（ｂ）
（ａ）、（ｂ）より、
Ａｎ＋１（ｉ）―Ａｎ（ｉ）＝Ａ（ｉ）
Ｒｎ＋１（ｉ）―Ｒｎ（ｉ）＝Ｒ（ｉ）
Ｗｎ＋１（ｉ）―Ｗｎ（ｉ）＝Ｗ（ｉ）・・・（ｃ）
（ｃ）を定義する。 September n, 2006 Automatic change frequency: _An (i)
Number of read errors: R _n (i)
Number of write errors: W _n (i) (a)
September n + 1, 2006 Automatic change count: A _{n + 1} (i)
Number of read errors: R _{n + 1} (i)
Number of write errors: W _{n + 1} (i) (b)
From (a) and (b),
An + 1 (i) −An (i) = A (i)
Rn + 1 (i) −Rn (i) = R (i)
Wn + 1 (i) −Wn (i) = W (i) (c)
Define (c).

デバイスドライバのエラー情報採取、解析部１０５、２０５は、（ｃ）のＡ（ｉ）、Ｒ
（ｉ）、Ｗ（ｉ）が予め設定したＡ、Ｒ、Ｗを上回った時に、ハードウェアログ１１５ま
たは２１５を生成する。ここで、Ａ、Ｒ、Ｗは定数である。 The error information collection / analysis unit 105, 205 of the device driver performs A (i), R in (c).
When (i) and W (i) exceed preset A, R, and W, the hardware log 115 or 215 is generated. Here, A, R, and W are constants.

このハードウェアログにより、システム管理者にディスクドライブの予防交換時期を提
供することができ、ディスクドライブの予防保守交換を促すことが可能となる。
（予防保守時期の提供手段）
ディスクドライブより採取したエラー情報とこのエラー情報に基づいて推測したディス
クドライブの予防保守交換時期を記録し、表示する方法を説明する。 With this hardware log, it is possible to provide the system administrator with a time for preventive replacement of the disk drive, and to prompt preventive maintenance replacement of the disk drive.
(Providing means for preventive maintenance time)
A method of recording and displaying error information collected from the disk drive and preventive maintenance replacement time of the disk drive estimated based on the error information will be described.

図５はデバイスドライバのエラー情報採取、解析部５０５によって採取したディスクド
ライブのエラー情報とこのエラー情報を基に推定したディスクドライブの予防保守交換時
期をハードウェアイベントログとして、システム管理者に提供する手段である。図６に示
すハードウェアイベントログを記録、表示する処理のフローチャートに基づいて、本発明
の実施形態におけるハードウェアイベントログを記録、表示する処理を説明する。 FIG. 5 shows the error information of the disk drive collected by the device driver error information collection / analysis unit 505 and the preventive maintenance replacement time of the disk drive estimated based on this error information as a hardware event log to the system administrator. Means. Based on the flowchart of the process for recording and displaying the hardware event log shown in FIG. 6, the process for recording and displaying the hardware event log in the embodiment of the present invention will be described.

図５において、まずデバイスドライバ５０４のエラー情報採取、解析部５０５は、ディ
スクドライブ５０７、５０８、５０９より採取したエラー情報５１２、５１３、５１４と
このエラー情報を基に推測した予防交換時期をハードウェアログ５１５として、デバイス
ドライバ５０４の管理エリア内に格納する（Ｓ６００）。図５の例では、デバイスドライ
バ５０４の管理エリアを物理的に示すと計算機システム５００に搭載されているディスク
ドライブ５２４になるが、このディスクドライブ５２４は計算機システム５００の外に接
続されている記録媒体でも構わない。更に、デバイスドライバ５０４の管理エリア内に格
納するログは、オペレーティングシステムを操作するユーザの誰もが参照することが可能
であることから、格納する内容に制限を設けたり、格納する処理そのものも抑止できる。 In FIG. 5, first, the error information collection / analysis unit 505 of the device driver 504 has hardware information indicating the error information 512, 513, 514 collected from the disk drives 507, 508, 509 and the preventive replacement time estimated based on this error information. The log 515 is stored in the management area of the device driver 504 (S600). In the example of FIG. 5, when the management area of the device driver 504 is physically shown, it becomes a disk drive 524 mounted on the computer system 500. This disk drive 524 is a recording medium connected outside the computer system 500. It doesn't matter. Furthermore, since the log stored in the management area of the device driver 504 can be referred to by any user operating the operating system, the contents to be stored are limited or the stored process itself is suppressed. it can.

次にデバイスドライバ５０４のエラー情報採取、解析部５０５はホストバスアダプタ５
０１内のサービスプロセッサ報告用レジスタレジスタ５２１に物理ディスク番号ｉ、自動
交代回数Ａ（ｉ）、リードエラー回数Ｒ（ｉ）、ライトエラー回数Ｗ（ｉ）をセット（Ｓ
６０１）すると共にチェックビット５２０をセットする（Ｓ６０２）。ここで、ホストバ
スアダプタ５０１のサービスプロセッサ報告用レジスタレジスタ５２１とチェックビット
５２０は計算機システム５００のサービスプロセッサ５２２がスキャンインおよびスキャ
ンアウトすることが可能でかつ、ホストバスアダプタ５０１の動作に何ら影響を与えない
レジスタでなければならい。また、チェックピット５２０はホストアダプタ５０１で発生
した障害等の事象をサービスプロセッサ５２２に報告することができるレジスタであり、
サービスプロセッサによりセットされている値を定期的に監視できるレジスタである必要
がある。図５ではホストバスアダプタ５０１のサービスプロセッサ報告用レジスタ５２１
とチェックビット５２０をＦＣまたはｉＳＣＳＩコントローラ５１９またはＮＩＣ５１６
の外に記載しているが、ＦＣまたはｉＳＣＳＩコントローラ５１９またはＮＩＣ５１６内
のレジスタとチェックビットでも構わない。 Next, error information collection / analysis unit 505 of device driver 504 is connected to host bus adapter 5.
The physical disk number i, automatic alternation count A (i), read error count R (i), and write error count W (i) are set in the service processor report register register 521 in 01 (S
601) and the check bit 520 is set (S602). Here, the service processor reporting register register 521 and the check bit 520 of the host bus adapter 501 can be scanned in and out by the service processor 522 of the computer system 500 and have no influence on the operation of the host bus adapter 501. The register must not be given. The check pit 520 is a register that can report an event such as a failure occurring in the host adapter 501 to the service processor 522.
It must be a register that can periodically monitor the value set by the service processor. In FIG. 5, the service processor report register 521 of the host bus adapter 501
And check bit 520 for FC or iSCSI controller 519 or NIC 516
However, the registers and check bits in the FC or iSCSI controller 519 or NIC 516 may be used.

サービスプロセッサ５２２はホストバスアダプタ５０１を含む計算機システム５００内
のすべてのチェックビットの有無を定期的に監視しており（Ｓ６０３）、ホストバスアダ
プタ５０１のチェックビット５２０がセットされていることを発見したサービスプロセッ
サ５２２は、該当ホストアダプタの報告用レジスタレジスタ５２１に格納されている物理
ディスク番号ｉ、自動交代回数Ａ（ｉ）、リードエラー回数Ｒ（ｉ）、ライトエラー回数
Ｗ（ｉ）を読み出し（Ｓ６０４）、該当アダプタのチェックビット５２０と報告用レジス
タ５２１をリセットする（Ｓ６０５）。 The service processor 522 periodically monitors the presence or absence of all check bits in the computer system 500 including the host bus adapter 501 (S603), and found that the check bit 520 of the host bus adapter 501 is set. The service processor 522 reads the physical disk number i, automatic alternation count A (i), read error count R (i), and write error count W (i) stored in the report register register 521 of the host adapter ( In step S604, the check bit 520 and the report register 521 of the corresponding adapter are reset (S605).

サービスプロセッサ５２２はホストバスアダプタ５０１より読み出した物理ディスク番
号ｉ、自動交代回数Ａ（ｉ）、リードエラー回数Ｒ（ｉ）、ライトエラー回数Ｗ（ｉ）を
基にハードウェアイベントログを生成し（Ｓ６０６）、このハードウェアイベントログを
サービスプロセッサ内のログエリアに格納すると共に、サービスプロセッサ５２２に接続
しているコンソール５２３にハードウェアログ情報を表示する（Ｓ６０７）。 The service processor 522 generates a hardware event log based on the physical disk number i read from the host bus adapter 501, the automatic alternation count A (i), the read error count R (i), and the write error count W (i) ( In step S606, the hardware event log is stored in the log area in the service processor, and the hardware log information is displayed on the console 523 connected to the service processor 522 (S607).

上記のハードウェアイベントログをサービスプロセッサ５２２内のログエリアに格納し
、サービスプロセッサ５２２に接続しているコンソール５２３に表示することにより、シ
ステム管理者にディスクドライブの予防交換時期を提供することができ、ディスクドライ
ブの予防保守交換を促すことが可能となる。 By storing the hardware event log in the log area in the service processor 522 and displaying it on the console 523 connected to the service processor 522, it is possible to provide the system administrator with a preventive replacement time for the disk drive. It becomes possible to promote preventive maintenance replacement of the disk drive.

本発明を適用する計算機システムとディスクサブシステムを接続した第一の実施形態である（実施例１）。1 is a first embodiment in which a computer system to which the present invention is applied and a disk subsystem are connected (Example 1). 本発明を適用する計算機システムとディスクサブシステムを接続した第二の実施形態である（実施例２）。This is a second embodiment in which a computer system to which the present invention is applied and a disk subsystem are connected (Example 2). ディスクドライブのエラー情報を収集する処理のフローチャートである（実施例１）。10 is a flowchart of processing for collecting disk drive error information (Example 1); ディスクドライブのエラー情報を収集する処理のフローチャートである（実施例２）。10 is a flowchart of processing for collecting disk drive error information (second embodiment); ディスクドライブのエラー情報と予防保守交換時期をハードウェアイベントログとして、システム管理者に提供する手段を示す図である。It is a figure which shows the means to provide the error information and preventive maintenance replacement time of a disk drive to a system administrator as a hardware event log. ハードウェアイベントログを記録、表示する処理のフローチャートである。It is a flowchart of the process which records and displays a hardware event log.

Explanation of symbols

１００、２００、５００…計算機システム、１０１、２０１、５０１…ホストバスアダ
プタ、１０２、２０２、５０２…ディスクサブシステム、１０３、２０３、５０３…ディ
スクコントローラ、１０４、２０４、５０４…デバイスドライバ、１０５、２０５、５０
５…エラー情報採取、解析部、１０６、２０６、５０６…ＦＣまたはｉＳＣＳＩコントロ
ーラ制御部、１０７、２０７、５０７…ディスクドライブ０、１０８、２０８、５０８…
ディスクドライブ１、１０９、２０９、５０９…ディスクドライブ２、１１０、２１０、
５１０…ＬＵ１、１１１、２１１、５１１…ＬＵ０、１１２、１１３、１１４、２１２、
２１３、２１４、５１２、５１３、５１４…エラー情報、１１５、２１５、５１５…ハー
ドウェアログ、２１６、２１７、５１６、５１７…ＮＩＣ、２１８、５１８…ＮＩＣ制御
部、２１９、５１９…ＦＣまたはｉＳＣＳＩコントローラ、５２０…チェックビット、５
２１…サービスプロセッサ報告用レジスタレジスタ、５２２…サービスプロセッサ（ＳＶ
Ｐ）、５２３…コンソール、５２４…ディスクドライブ。 100, 200, 500: Computer system, 101, 201, 501: Host bus adapter, 102, 202, 502 ... Disk subsystem, 103, 203, 503 ... Disk controller, 104, 204, 504 ... Device driver, 105, 205 , 50
5 ... Error information collection / analysis unit, 106, 206, 506 ... FC or iSCSI controller control unit, 107, 207, 507 ... Disk drive 0, 108, 208, 508 ...
Disk drive 1, 109, 209, 509... Disk drive 2, 110, 210,
510 ... LU1, 111, 211, 511 ... LU0, 112, 113, 114, 212,
213, 214, 512, 513, 514 ... error information, 115, 215, 515 ... hardware log, 216, 217, 516, 517 ... NIC, 218, 518 ... NIC controller, 219, 519 ... FC or iSCSI controller, 520: Check bit, 5
21: Register register for service processor report, 522: Service processor (SV
P), 523, console, 524, disk drive.

Claims

A means for issuing a command for collecting error information to a disk drive constituting a logical unit in a disk subsystem connected to the host bus adapter, and collecting error information from the disk drive periodically;
Request the disk controller in the disk subsystem connected to the host bus adapter for error information of the disk drive that configures the logical unit in the disk subsystem, and periodically collect disk drive error information from the disk controller. Means to
A computer system comprising a device driver for estimating a preventive maintenance replacement time of a disk drive based on collected error information.

Means for comparing the disk drive error information periodically collected from the disk drives that make up the logical units in the disk subsystem connected to the host bus adapter, and estimating the preventive replacement time of the disk drives;
2. The computer system according to claim 1, further comprising storage means for storing the error information collected from the disk drive and the information on the estimated preventive replacement time in the management area of the device driver.

2. The computer according to claim 1, wherein the error information of the disk drive periodically collected from the disk drive constituting the logical unit in the disk subsystem connected to the host bus adapter and the estimated preventive replacement time are displayed on the monitor screen. system.

A means for issuing a command for collecting error information to a disk drive constituting a logical unit in a disk subsystem connected to the host bus adapter, and collecting error information from the disk drive periodically;
Request the disk controller in the disk subsystem connected to the host bus adapter for error information of the disk drive that configures the logical unit in the disk subsystem, and periodically collect disk drive error information from the disk controller. Means to
A computer system comprising storage means for storing collected error information and estimated preventive replacement time information in a service processor of the computer system.