JP2008171231A

JP2008171231A - Array disk group maintenance management system, array disk group maintenance management apparatus, array disk group maintenance management method, and array disk group maintenance management program

Info

Publication number: JP2008171231A
Application number: JP2007004294A
Authority: JP
Inventors: Mitsuru Maejima; 満前嶋; Shoichi Murano; 正一村野
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2007-01-12
Filing date: 2007-01-12
Publication date: 2008-07-24
Anticipated expiration: 2027-01-12
Also published as: JP4854522B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide an array disk group maintenance management system, an array disk group maintenance management device, an array disk group maintenance management method, and an array disk group maintenance management program, for dynamically changing a criterion for preventive replacement of array disks for each generation of the array disks. <P>SOLUTION: In the array disk group maintenance management device 1, a transmitting and receiving part 11 receives log information from a disk device 2, a fault count information acquisition part 12 acquires information on the number of faults information based on the log information, a failure rate calculation part 13 calculates a failure rate of an array disk 101 of each generation based on the acquired fault count information; a threshold update part 14 updates a frequency threshold for recovery error of the array disk 101 of each generation based on the calculated failure rate of the array disk 101. The transmitting and receiving part 11 transmits the updated threshold information for recovery error to the disk device 2. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、アレイディスク群の保守管理技術に関し、特に、アレイディスクの予防交換の判定基準をアレイディスクの世代毎に動的に変更するアレイディスク群の保守管理システム、アレイディスク群の保守管理装置、アレイディスク群の保守管理方法およびアレイディスク群の保守管理プログラムに関する。 The present invention relates to an array disk group maintenance management technique, and more particularly to an array disk group maintenance management system and an array disk group maintenance management apparatus that dynamically change a judgment criterion for preventive replacement of an array disk for each generation of array disks. The present invention relates to an array disk group maintenance management method and an array disk group maintenance management program.

複数のフィールド（ディスク装置の運用場所）とサポートセンタを通信回線などで接続し、各フィールドに設置されたディスク装置から送られてくるエラー情報などをサポートセンタ（集中監視装置）で受信し、複数のディスク装置を集中監視するシステムが運用されている。
上記ディスク装置として、冗長構成のアレイディスク装置が使用されている場合、ディスクが故障（フォルト）しても、予備のディスクであるＨＳ（ＨｏｔＳｐａｒｅ）にデータをコピーすることにより、運用を継続することが可能である。
上記システムにおいて、従来においては例えば以下のようにして、各フィールドに設置されたディスク装置の保守管理を行っている。 Multiple fields (disk device operating location) and support center are connected via communication lines, etc., and error information sent from disk devices installed in each field is received by the support center (centralized monitoring device). A system for centralized monitoring of disk devices is operating.
When a redundant array disk device is used as the disk device, even if the disk fails (faults), the operation is continued by copying data to the spare disk HS (Hot Spare). It is possible.
In the above system, conventionally, for example, maintenance management of disk devices installed in each field is performed as follows.

例えば、冗長構成のアレイディスクがフォルトすると、データを予備のディスクであるＨＳにコピーするとともに、フォルトの発生したことを上記センタに通知する。
上記通知を受けたセンタでは、ＣＥ（カスタマーエンジニア）に対して該ディスク装置の保守を指示する。ＣＥは保守部品などを準備し、フォルトが発生したディスクが設置された場所に行き、ディスクを交換するなどの保守を行う。なお、ここでは、ディスクのフォルトとは、ディスクがリカバリ不能な程度に故障することをいう。 For example, when a redundant array disk faults, data is copied to the spare disk HS, and the center is notified that a fault has occurred.
Upon receiving the above notification, the center instructs CE (customer engineer) to maintain the disk device. The CE prepares maintenance parts, etc., goes to the place where the faulted disk is installed, and performs maintenance such as replacing the disk. Here, the disk fault means that the disk fails to the extent that it cannot be recovered.

しかし、複数のディスクのフォルトが発生した場合、ＨＳが不足し、ＲＡＩＤの冗長化が崩れる恐れがある。したがって、アレイディスクが故障する前に、事前にディスクの予防交換（フォルトの発生しそうなディスクを、フォルト前に交換）を実施することが望ましい。
ディスクを予防交換する技術としては、例えば特許文献１に記載のものが知られている。特許文献１に記載のものは、ディスクのエラー情報を取得して所定のファイルに記録し、所得したエラー情報を統計分析することにより、障害発生前に予防交換すべきかを判断し、ディスク障害が発生する前に、自動的に正常なディスクを使用したアレイディスクに組み変えるようにしたものである。
特開２０００−３０５７２０号公報 However, when a plurality of disk faults occur, there is a risk that HS becomes insufficient and RAID redundancy is lost. Therefore, it is desirable to carry out preventive replacement of a disk in advance (replace a faulty disk before a fault) before the array disk fails.
As a technique for preventive replacement of a disk, for example, a technique described in Patent Document 1 is known. According to the method described in Patent Document 1, disk error information is acquired and recorded in a predetermined file, and the error information obtained is statistically analyzed to determine whether or not preventive replacement should be performed before a failure occurs. Prior to the occurrence, an array disk using a normal disk is automatically reconfigured.
JP 2000-305720 A

年々、新しいバードディスクが開発・提供されているが、その信頼性等は必ずしも同じではなく、ディスクの型格、開発・提供時期などによりバラツキがある。
例えば、代替可能な同様な性能のディスクであっても、ある時期に提供されたディスクより、これより前あるいは後に提供されたディスクの方が信頼性が高いという場合も多々ある。以下では、上述したように、信頼性が変わらないと考えられる同じような時期に開発・提供された性能、容量、用途等に応じて分類される一連のディスク群の単位をそれぞれ世代と表現する。すなわち、異なる世代のディスクとは、信頼性が異なっていると考えられる、異なる時期に開発・提供時期されたディスクを意味する。
従って、前述したようなディスクの予防交換を行うに際しては、ディスクの世代を考慮して信頼性の低い世代のディスクは早めに交換するなどの措置を講ずることが望ましい。 New bird discs are being developed and provided year by year, but their reliability is not always the same, and there are variations depending on the type of disc and the timing of development and provision.
For example, even if a disk with similar performance that can be replaced, a disk provided earlier or later is more reliable than a disk provided at a certain time. In the following, as described above, a series of disk group units classified according to performance, capacity, usage, etc. developed and provided at the same time when reliability is considered to be unchanged will be expressed as generations. . In other words, different generation disks mean disks that are considered to have different reliability and have been developed and provided at different times.
Therefore, when performing the above-described preventive replacement of a disk, it is desirable to take measures such as replacing a disk of a low-reliability generation early in consideration of the disk generation.

従来においては、世代によって信頼性が異なるといった認識はあったものの、世代による信頼性の相違を考慮して、ディスク装置の予防交換を自動的に行うといった考え方はなく、せいぜい、ディスク装置のエラー情報などから、人手により、世代による信頼性の相違を把握し、信頼性の低い世代のディスクは、早めに交換するといったことが行われていた程度であった。
しかし、監視すべきアレイディスク装置が多数あり、ディスクの世代数が多く、さらにアレイディスク装置に、世代の異なるディスクが混在して使用されているような場合には、上述したように人手により世代毎にディスクの信頼性を認識し、予防交換を行うことは困難である。
本発明は、上述した問題点を解決するためになされたものであって、アレイディスク装置群の保守管理を行うに際し、世代毎の信頼性を考慮してディスクの予防交換が行えるようにし、各フィールドに設置されたアレイディスク装置の信頼性を向上させることを目的とする。 In the past, although it was recognized that the reliability was different depending on the generation, there was no idea of automatically performing a preventive replacement of the disk device in consideration of the difference in reliability depending on the generation. At best, error information of the disk device From the above, it was only possible to grasp the difference in reliability by generation and replace the disk of a generation with low reliability early.
However, when there are many array disk devices to be monitored, the number of disk generations is large, and disks with different generations are mixedly used in the array disk device, the generation is done manually as described above. It is difficult to recognize the reliability of the disk every time and perform preventive replacement.
The present invention has been made in order to solve the above-described problems. When performing maintenance management of an array disk device group, it is possible to perform preventive replacement of a disk in consideration of reliability for each generation. The object is to improve the reliability of an array disk device installed in the field.

各アレイディスク装置でリカバリエラー（ディスクを交換せずに修復可能なエラー）の発生回数を調べ、リカバリエラーの発生回数が、判定基準である予め定められた閾値を超えると、各アレイディスク装置では、前述したように自動的にリカバリエラが閾値を越えたディスクをＨＳに交換し、アレイディスク装置の構成を組み変える。
上記において、本発明では、集中監視装置と各場所に設置されたアレイディスク装置とを通信手段で接続し、集中監視装置で、アレイディスク装置からエラー情報などのログ情報を取得し、上記ログ情報から各アレイディスク装置におけるディスクのフォルト（リカバリできない故障）数を得て、このフォルト数からディスクの世代毎の故障率を算出する。
そして、算出されたディスクの世代毎の故障率に基づき、各アレイディスク装置における上記閾値を更新する。例えば、ある世代のディスクの故障率が高い場合には、該ディスクが組み込まれているアレイディスク装置における当該ディスクの閾値を下げ、リカバリエラーの発生回数が比較的少ない場合でも、予防交換されるようにする。
このように、世代毎のディスクの故障率に基づき、予防交換するか否かを判定する判定基準である閾値を更新することにより、故障率が高い世代のディスクは早めに予防交換されることになり、リカバリできない故障が発生する率を低下させることができ、アレイディスク装置全体の信頼性を向上させることができる。 Check the number of occurrences of recovery errors (errors that can be repaired without replacing the disk) in each array disk device, and if the number of occurrences of recovery errors exceeds a predetermined threshold that is a criterion, each array disk device As described above, the disk whose recovery error exceeds the threshold value is automatically replaced with HS, and the configuration of the array disk device is reconfigured.
In the above, in the present invention, the centralized monitoring device and the array disk device installed at each place are connected by communication means, the centralized monitoring device acquires log information such as error information from the array disk device, and the log information Thus, the number of disk faults (failures that cannot be recovered) in each array disk device is obtained, and the failure rate for each disk generation is calculated from the number of faults.
Then, based on the calculated failure rate for each disk generation, the threshold value in each array disk device is updated. For example, when a failure rate of a certain generation of disks is high, the threshold value of the disk in the array disk device in which the disk is incorporated is lowered so that even if the number of occurrences of recovery errors is relatively small, preventive replacement is performed. To.
In this way, by updating the threshold value, which is a criterion for determining whether to perform preventive replacement based on the failure rate of the disk for each generation, the generational disk with a high failure rate is proactively replaced. Thus, the rate of occurrence of failures that cannot be recovered can be reduced, and the reliability of the entire array disk device can be improved.

すなわち、本発明においては、以下のように前記課題を解決する。
（１）アレイディスク群に含まれるアレイディスクのリカバリエラーの発生回数が所定の閾値を超えた時に該アレイディスクの予防交換を行う複数のディスク装置と、上記複数のディスク装置と通信手段で接続され、上記複数のディスク装置を集中監視する集中監視装置とを備えるアレイディスク群の保守管理システムを設ける。
そして、上記集中監視装置は、上記複数のディスク装置からログ情報を取得するログ情報取得手段と、上記取得されたログ情報からフォルトしたアレイディスクの数の情報を取得するフォルト数情報取得手段と、上記取得されたフォルトしたアレイディスクの数の情報に基づいて、上記アレイディスクの世代毎の故障率を算出する故障率算出手段と、上記算出されたアレイディスクの世代毎の故障率に基づいて、各世代のアレイディスクのリカバリエラーの発生回数についての上記所定の閾値を更新する閾値更新手段とを備える。
（２）上記複数のディスク装置は、上記アレイディスクのリカバリエラーの発生回数が上記所定の閾値を超えた時に、該アレイディスク内のデータを予備のアレイディスクにコピーして、該アレイディスクの切り離しを行う。
（３）上記閾値更新手段は、上記算出されたアレイディスクの世代毎の故障率と故障率についての所定の閾値とを比較し、該アレイディスクの世代毎の故障率と故障率についての所定の閾値との比較結果に基づいて、各世代のアレイディスクのリカバリエラーの発生回数についての所定の閾値を更新する。 That is, in the present invention, the above-mentioned problem is solved as follows.
(1) A plurality of disk devices that perform preventive replacement of an array disk when the number of occurrences of recovery errors of the array disks included in the array disk group exceeds a predetermined threshold, and the plurality of disk devices are connected by communication means. An array disk group maintenance management system comprising a centralized monitoring device for centrally monitoring the plurality of disk devices is provided.
And the centralized monitoring device, log information acquisition means for acquiring log information from the plurality of disk devices, fault number information acquisition means for acquiring information on the number of faulted array disks from the acquired log information, Based on the acquired information on the number of faulted array disks, failure rate calculation means for calculating the failure rate for each generation of the array disk, and based on the calculated failure rate for each generation of the array disk, Threshold update means for updating the predetermined threshold for the number of occurrences of recovery errors in each generation of array disks.
(2) When the number of occurrences of recovery errors of the array disk exceeds the predetermined threshold, the plurality of disk devices copy the data in the array disk to a spare array disk and disconnect the array disk I do.
(3) The threshold value updating means compares the calculated failure rate for each generation of the array disk with a predetermined threshold value for the failure rate, and determines a predetermined failure rate and failure rate for each generation of the array disk. Based on the comparison result with the threshold, a predetermined threshold for the number of occurrences of recovery errors of the array disks of each generation is updated.

本発明のアレイディスク群の保守管理システム、アレイディスク群の保守管理装置、アレイディスク群の保守管理方法およびアレイディスク群の保守管理プログラムは、複数のディスク装置からログ情報を取得し、該ログ情報からフォルトしたアレイディスクの数の情報（フォルト数情報）を取得し、該フォルト数情報に基づいて、アレイディスクの世代毎の故障率を算出し、該算出されたアレイディスクの世代毎の故障率に基づいて、各世代のアレイディスクのリカバリエラーの発生回数についての上記所定の閾値を更新する。
また、上記複数のディスク装置は、上記アレイディスクのリカバリエラーの発生回数が上記所定の閾値を超えた時に、該アレイディスク内のデータを予備のアレイディスクにコピーして、該アレイディスクの切り離しを行う（予防交換を行う）。 An array disk group maintenance management system, an array disk group maintenance management device, an array disk group maintenance management method, and an array disk group maintenance management program according to the present invention acquire log information from a plurality of disk devices, and the log information Information on the number of faulted array disks (fault number information) is obtained, a failure rate for each generation of the array disk is calculated based on the fault number information, and the calculated failure rate for each generation of the array disk Based on the above, the predetermined threshold for the number of occurrences of recovery errors of the array disks of each generation is updated.
The plurality of disk devices copy the data in the array disk to a spare array disk and disconnect the array disk when the number of occurrences of recovery errors of the array disk exceeds the predetermined threshold. Perform (perform preventive replacement).

従って、本発明によれば、アレイディスクの予防交換の判定基準をアレイディスクの世代毎に動的に変更することができる。その結果、複数のディスク装置に分散して配置された各世代のアレイディスクを、世代毎に自動的に予防交換することが可能となる。
このため、故障率が高い世代のディスクは早めに予防交換されることになり、リカバリできない故障が発生する率を低下させることができ、アレイディスク装置の信頼性を向上させることができる。 Therefore, according to the present invention, it is possible to dynamically change the judgment criteria for preventive replacement of array disks for each generation of array disks. As a result, it is possible to automatically prevent and replace each generation of array disks distributed and arranged in a plurality of disk devices for each generation.
Therefore, a generation disk having a high failure rate is proactively replaced early, the rate at which failures that cannot be recovered can be reduced, and the reliability of the array disk device can be improved.

図１は、本発明のアレイディスク群の保守管理システムの構成の一例を示す図である。本発明のアレイディスク群の保守管理システムは、アレイディスク群の保守管理装置１と、複数のディスク装置２とを備える。 FIG. 1 is a diagram showing an example of the configuration of an array disk group maintenance management system according to the present invention. The array disk group maintenance management system of the present invention includes an array disk group maintenance management apparatus 1 and a plurality of disk apparatuses 2.

アレイディスク群の保守管理装置１は、ディスク装置２とネットワーク等の通信手段３で接続され、ディスク装置２を集中監視する処理装置である。具体的には、アレイディスク群の保守管理装置１は、アレイディスク１０１を備えるディスク装置２からログ情報を取得し、該ログ情報に基づいてフォルト数情報を取得し、取得したフォルト数情報に基づいて、アレイディスク１０１の世代毎の故障率を算出する。上記フォルト数情報とは、フォルトしたアレイディスクの数の情報である。 The array disk group maintenance management device 1 is a processing device that is connected to the disk device 2 by a communication means 3 such as a network and centrally monitors the disk device 2. Specifically, the maintenance management device 1 of the array disk group acquires log information from the disk device 2 including the array disk 101, acquires fault number information based on the log information, and based on the acquired fault number information. Thus, the failure rate for each generation of the array disk 101 is calculated. The fault number information is information on the number of faulted array disks.

また、アレイディスク群の保守管理装置１は、算出されたアレイディスク１０１の世代毎の故障率に基づいて、ディスク装置２が備える各世代のアレイディスク１０１のリカバリエラーの発生回数の閾値を更新し、更新した該閾値の情報をディスク装置２に対して送信して、ディスク装置２内の閾値記憶部２０５に記憶させる。
また、アレイディスク群の保守管理装置１は、ディスク装置２から後述するエラー通知を受信して、アレイディスク群の保守管理装置１のオペレータ（図示を省略）に通知する。当該通知を受けたオペレータは、ディスク装置２の保守要員１００に対してメール通知や電話連絡等して、上記エラー通知の送信元であるディスク装置２の保守を行わせる。 Further, the maintenance management device 1 of the array disk group updates the threshold value of the number of occurrences of recovery errors of the array disks 101 of each generation included in the disk device 2 based on the calculated failure rate for each generation of the array disks 101. The updated threshold value information is transmitted to the disk device 2 and stored in the threshold value storage unit 205 in the disk device 2.
The array disk group maintenance management device 1 receives an error notification described later from the disk device 2 and notifies an operator (not shown) of the array disk group maintenance management device 1. Upon receiving the notification, the operator notifies the maintenance person 100 of the disk device 2 by e-mail notification, telephone contact, etc., and causes maintenance of the disk device 2 that is the transmission source of the error notification.

複数のディスク装置２の各々は、アレイディスク１０１を備え、該アレイディスク１０１を管理する。各ディスク装置２は、例えば複数のフィールド（場所）に設置されており、また、例えば複数の世代のアレイディスク１０１を備えている。従って、各世代のアレイディスク１０１は、例えば複数のフィールドに分散して配置されている。
ディスク装置２は、ログ情報をアレイディスク群の保守管理装置１に対して送信する。上記ログ情報には、ディスク装置２が備えるアレイディスク１０１のうち、フォルトしたアレイディスクの情報が含まれている。また、該ログ情報には、例えば、後述するアレイディスク１０１のリカバリエラーの種類の情報が含まれている。 Each of the plurality of disk devices 2 includes an array disk 101 and manages the array disk 101. Each disk device 2 is installed in, for example, a plurality of fields (locations), and includes a plurality of generations of array disks 101, for example. Accordingly, the array disks 101 of each generation are arranged in a plurality of fields, for example.
The disk device 2 transmits log information to the maintenance management device 1 of the array disk group. The log information includes information on the faulted array disk among the array disks 101 included in the disk device 2. The log information includes, for example, information on the type of recovery error of the array disk 101 described later.

また、ディスク装置２は、アレイディスク群の保守管理装置１から送信された各世代のアレイディスクのリカバリエラーの発生回数の閾値を閾値記憶部２０５に記憶する。ここで、アレイディスク１０１のリカバリエラーは複数種類存在する。例えば、ディスク装置２によるリード動作、ライト動作、ヘッドシーク動作等の各動作内容毎に、リカバリエラーが存在する。従って、ディスク装置２は、リカバリエラーの種類毎に、各世代のアレイディスク１０１のリカバリエラーの発生回数の閾値を閾値記憶部２０５に記憶するようにしてもよい。
また、ディスク装置２は、アレイディスク１０１のリカバリエラーの発生回数が閾値記憶部２０５に記憶された閾値を超えた時に、該アレイディスク１０１の予防交換を行う。すなわち、閾値記憶部２０５に記憶された閾値は、アレイディスクの予防交換の判定基準である。 Further, the disk device 2 stores in the threshold storage unit 205 the threshold value of the number of occurrences of recovery errors of the array disks of each generation transmitted from the maintenance management device 1 of the array disk group. Here, there are a plurality of types of recovery errors for the array disk 101. For example, a recovery error exists for each operation content such as a read operation, a write operation, and a head seek operation by the disk device 2. Therefore, the disk device 2 may store the threshold value of the number of occurrences of the recovery error of each generation of the array disk 101 in the threshold value storage unit 205 for each type of recovery error.
In addition, the disk device 2 performs preventive replacement of the array disk 101 when the number of occurrences of recovery errors of the array disk 101 exceeds the threshold value stored in the threshold value storage unit 205. That is, the threshold value stored in the threshold value storage unit 205 is a criterion for preventive replacement of the array disk.

ディスク装置２内のアレイディスク１０１（図１中の＃１および＃２に示すアレイディスク１０１）は、例えばミラー構成されている。ディスク装置２は、例えば、＃２に示すアレイディスク１０１のリカバリエラーの発生回数が上記閾値を超えた時に、該アレイディスク１０１内のデータを予備のアレイディスクであるＨＳ１０２にコピーして、該アレイディスク１０１の切り離しを行う。該アレイディスク１０１のデータのＨＳ１０２へのコピーと該アレイディスクの切り離しを行って、該アレイディスク１０１とＨＳ１０２とを入れ替える処理を、リダンダントコピーという。該切り離されたアレイディスク１０１は、フォルトしたものと扱われる。
ディスク装置２は、フォルト扱いとしたアレイディスク１０１を切り離した後に、アレイディスク群の保守管理装置１に対してエラー通知を行う。エラー通知は、アレイディスク１０１がフォルトしたことの通知である。 The array disk 101 in the disk device 2 (the array disk 101 indicated by # 1 and # 2 in FIG. 1) has, for example, a mirror configuration. For example, when the number of occurrences of recovery errors of the array disk 101 shown in # 2 exceeds the threshold value, the disk device 2 copies the data in the array disk 101 to the HS 102 which is a spare array disk, and The disk 101 is disconnected. The process of copying the data on the array disk 101 to the HS 102 and detaching the array disk and replacing the array disk 101 and the HS 102 is called redundant copy. The detached array disk 101 is treated as faulted.
The disk device 2 sends an error notification to the maintenance management device 1 of the array disk group after disconnecting the array disk 101 treated as a fault. The error notification is a notification that the array disk 101 has failed.

アレイディスク群の保守管理装置１は、送受信部１１、フォルト数情報取得部１２、故障率算出部１３、閾値更新部１４、ログ情報記憶部１５、閾値記憶部１６を備える。
送受信部１１は、ディスク装置２から送信されたログ情報を受信して、受信したログ情報をログ情報記憶部１５に記憶する。また、送受信部１１は、閾値更新部１４によって更新されたリカバリエラーの発生回数の閾値情報をディスク装置２に対して送信する。
送受信部１１は、アレイディスク群の保守管理装置１のオペレータによって入力される、更新されたリカバリエラーの発生回数の閾値の情報を全てのディスク装置２に送信するか特定のディスク装置２に送信するかを示す選択情報に基づいて、更新されたリカバリエラーの発生回数の閾値の情報を全てのディスク装置２に送信し、または、特定のディスク装置２（例えば、当該閾値に対応する世代のアレイディスク１０１を備えるディスク装置２）に対して送信する。また、送受信部１１は、ディスク装置２からエラー通知を受信して、オペレータに通知する。 The array disk group maintenance management device 1 includes a transmission / reception unit 11, a fault number information acquisition unit 12, a failure rate calculation unit 13, a threshold update unit 14, a log information storage unit 15, and a threshold storage unit 16.
The transmission / reception unit 11 receives the log information transmitted from the disk device 2 and stores the received log information in the log information storage unit 15. Further, the transmission / reception unit 11 transmits threshold information on the number of occurrences of recovery errors updated by the threshold update unit 14 to the disk device 2.
The transmission / reception unit 11 transmits the information on the updated threshold value of the number of occurrences of recovery errors input by the operator of the maintenance management device 1 of the array disk group to all the disk devices 2 or to a specific disk device 2. Based on the selection information indicating whether or not, the updated threshold information of the number of occurrences of recovery errors is transmitted to all the disk devices 2, or a specific disk device 2 (for example, an array disk of a generation corresponding to the threshold value) 101 to the disk device 2) having 101. Further, the transmission / reception unit 11 receives an error notification from the disk device 2 and notifies the operator.

フォルト数情報取得部１２は、ログ情報記憶部１５内に記憶されたログ情報に基づいて、フォルト数情報を取得する。具体的には、フォルト数情報取得部１２はログ情報記憶部１５からログ情報を抽出する。抽出されたログ情報中には、ディスク装置２内においてフォルトしたアレイディスク１０１の情報が含まれているので、フォルト数情報取得部１２は、抽出されたログ情報に基づいて、フォルト数情報を取得する。
故障率算出部１３は、フォルト数情報取得部１２によって取得されたフォルト数情報に基づいて、アレイディスク１０１の故障率を各世代毎に算出する。故障率算出部１３は、例えばアレイディスクの年間故障率（ＡＦＲ：Annual failure Rate ）を算出する。ＡＦＲの算出例については後述する。 The fault number information acquisition unit 12 acquires fault number information based on the log information stored in the log information storage unit 15. Specifically, the fault number information acquisition unit 12 extracts log information from the log information storage unit 15. Since the extracted log information includes information on the faulted array disk 101 in the disk device 2, the fault number information acquisition unit 12 acquires the fault number information based on the extracted log information. To do.
The failure rate calculation unit 13 calculates the failure rate of the array disk 101 for each generation based on the fault number information acquired by the fault number information acquisition unit 12. The failure rate calculation unit 13 calculates, for example, an annual failure rate (AFR) of the array disk. A calculation example of AFR will be described later.

閾値更新部１４は、故障率算出部１３によって算出されたアレイディスク１０１の故障率に基づいて、閾値記憶部１６内に記憶されている各世代のアレイディスク１０１のリカバリエラーの発生回数の閾値を更新する。本発明の一実施例によれば、閾値更新部１４は、リカバリエラーの種類毎に、各世代のアレイディスク１０１のリカバリエラーの発生回数の閾値を更新するようにしてもよい。
ログ情報記憶部１５には、ディスク装置２から受信されたログ情報が記憶される。閾値記憶部１６には、閾値更新部１４によって更新されたリカバリエラーの発生回数の閾値が記憶される。 Based on the failure rate of the array disk 101 calculated by the failure rate calculation unit 13, the threshold update unit 14 sets the threshold of the number of occurrences of recovery errors of the array disks 101 of each generation stored in the threshold storage unit 16. Update. According to one embodiment of the present invention, the threshold update unit 14 may update the threshold of the number of occurrences of recovery errors in the array disks 101 of each generation for each type of recovery error.
The log information storage unit 15 stores log information received from the disk device 2. The threshold storage unit 16 stores a threshold value of the number of occurrences of recovery errors updated by the threshold update unit 14.

なお、上述したアレイディスク群の保守管理装置１及びその各部の機能は、ＣＰＵとその上で実行されるプログラムにより実現される。当該本発明を実現するプログラムは、コンピュータが読み取り可能な記録媒体、例えば半導体メモリ、ハードディスク、ＣＤ−ＲＯＭ、ＤＶＤ等に格納することができ、これらの記録媒体に記録して提供され、又は、通信インタフェースを介してネットワークを利用した送受信により提供される。 Note that the functions of the array disk group maintenance management device 1 and the components thereof are realized by a CPU and a program executed thereon. The program for realizing the present invention can be stored in a computer-readable recording medium such as a semiconductor memory, a hard disk, a CD-ROM, a DVD, or the like, provided by being recorded in these recording media, or communication. It is provided by transmission / reception using a network via an interface.

図２は、ディスク装置の構成の一例を示す図である。ここでは、ディスク装置２がＲＡＩＤ５の装置である場合を例にとって説明する。
ディスク装置２は、ディスク装置部２０とＲＡＩＤコントローラ２１とを備える。ディスク装置部２０は、アレイディスク１０１およびＨＳ１０２を備え、アレイディスク１０１のリダンダントコピー（アレイディスク１０１内のデータのＨＳ１０２へのコピーおよび該アレイディスク１０１の切り離し）を行う。アレイディスク１０１のリダンダントコピーの例については、図３を参照して後述する。図２中の＃１〜＃４に示すアレイディスク１０１は、例えばＲＡＩＤ５を構成している。ＲＡＩＤコントローラ２１は、ファームウェアで構成されており、ディスク装置部２０に対してリダンダントコピーを指示する。 FIG. 2 is a diagram illustrating an example of the configuration of the disk device. Here, a case where the disk device 2 is a RAID 5 device will be described as an example.
The disk device 2 includes a disk device unit 20 and a RAID controller 21. The disk device unit 20 includes an array disk 101 and an HS 102, and performs redundant copy of the array disk 101 (copying data in the array disk 101 to the HS 102 and detaching the array disk 101). An example of redundant copy of the array disk 101 will be described later with reference to FIG. The array disks 101 indicated by # 1 to # 4 in FIG. 2 constitute RAID 5, for example. The RAID controller 21 is configured by firmware and instructs the disk device unit 20 to perform redundant copy.

ＲＡＩＤコントローラ２１は、エラー通知部２０１、ログ送信部２０２、送受信部２０３、リダンダントコピー指示部２０４、閾値記憶部２０５を備える。
エラー通知部２０１は、ディスク装置部２０がアレイディスク１０１を切り離した時に、送受信部２０３を通じて、アレイディスク群の保守管理装置１に対してエラー通知を行う。
ログ送信部２０２は、送受信部２０３を通じて、ディスク装置２のログ情報をアレイディスク群の保守管理装置１に対して送信する。 The RAID controller 21 includes an error notification unit 201, a log transmission unit 202, a transmission / reception unit 203, a redundant copy instruction unit 204, and a threshold storage unit 205.
The error notification unit 201 sends an error notification to the maintenance management device 1 of the array disk group through the transmission / reception unit 203 when the disk device unit 20 disconnects the array disk 101.
The log transmission unit 202 transmits the log information of the disk device 2 to the maintenance management device 1 of the array disk group through the transmission / reception unit 203.

送受信部２０３は、アレイディスク群の保守管理装置１に対してエラー通知を行い、また、ログ情報を送信する。また、送受信部２０３は、アレイディスク群の保守管理装置１からアレイディスク１０１のリカバリエラーの発生回数の閾値情報を受信して、閾値記憶部２０５に記憶する。
リダンダントコピー指示部２０４は、アレイディスク１０１のリカバリエラーの発生回数を監視し、該リカバリエラーの発生回数が閾値記憶部２０５に記憶された閾値を超えたと判断すると、ディスク装置部２０に対してリダンダントコピーを指示する。 The transmission / reception unit 203 sends an error notification to the maintenance management apparatus 1 of the array disk group, and transmits log information. Further, the transmission / reception unit 203 receives threshold information on the number of occurrences of recovery errors of the array disk 101 from the maintenance management apparatus 1 of the array disk group, and stores the threshold information in the threshold storage unit 205.
The redundant copy instructing unit 204 monitors the number of occurrences of recovery errors in the array disk 101, and determines that the number of occurrences of recovery errors exceeds the threshold stored in the threshold storage unit 205. Instruct to copy.

図３は、図２に示すディスク装置におけるリダンダントコピーの一例を示す図である。リダンダントコピー指示部２０４が、図３（Ａ）中の＃３に示すアレイディスク１０１のリカバリエラーの発生回数が閾値記憶部２０５に記憶されている閾値を超えたと判断すると、リダンダントコピー指示部２０４は、ディスク装置部２０に対して＃３に示すアレイディスク１０１を予防交換対象ドライブとして、リダンダントコピーを指示する。 FIG. 3 is a diagram showing an example of a redundant copy in the disk device shown in FIG. When the redundant copy instruction unit 204 determines that the number of occurrences of recovery errors of the array disk 101 indicated by # 3 in FIG. 3A exceeds the threshold stored in the threshold storage unit 205, the redundant copy instruction unit 204 Then, the redundant copy is instructed to the disk device unit 20 with the array disk 101 shown in # 3 as a preventive replacement target drive.

該リダンダントコピーの指示を受けたディスク装置部２０は、図３（Ａ）中に示すように、＃３に示すアレイディスク１０１内のデータをＨＳ１０２にコピーするとともに、図３（Ｂ）中に示すように、該＃３に示すアレイディスク１０１の切り離しと、ＨＳ１０２のＲＡＩＤ５への組み込みを行う。 Upon receiving the redundant copy instruction, the disk device unit 20 copies the data in the array disk 101 shown in # 3 to the HS 102 as shown in FIG. 3A, and also shows it in FIG. As described above, the array disk 101 shown in # 3 is detached and the HS 102 is incorporated into RAID5.

図４は、アレイディスク群の保守管理装置によるアレイディスクの世代毎のＡＦＲの算出例を説明する図である。ここでは、各世代のアレイディスク１０１が複数のフィールドに分散して配置されている場合を例にとって説明する。なお、フィールドとは、ディスク装置２の運用場所である。 FIG. 4 is a diagram for explaining an example of calculating the AFR for each generation of array disks by the array disk group maintenance management apparatus. Here, a case will be described as an example where array disks 101 of each generation are distributed in a plurality of fields. The field is an operation place of the disk device 2.

図４（Ａ）中に示すように、ａ世代、ｂ世代、ｃ世代、・・・といった各世代のアレイディスク１０１は、第１フィールド、第２フィールド、・・・といった複数のフィールドに分散して配置されている。
例えば、第１フィールドにおけるａ世代のアレイディスク１０１の年間フォルト数がｎａ１台、ｂ世代のアレイディスク１０１の年間フォルト数がｎｂ１台、ｃ世代のアレイディスク１０１の年間フォルト数がｎｃ１台、・・・である。また、例えば第２フィールドにおけるａ世代のアレイディスク１０１の年間フォルト数がｎａ２台、ｂ世代のアレイディスク１０１の年間フォルト数がｎｂ２台、ｃ世代のアレイディスク１０１の年間フォルト数がｎｃ２台、・・・である。上記各世代の年間フォルト数の情報は、ログ情報の一部として、各フィールドに配置されたディスク装置２からアレイディスク群の保守管理装置１に送信される。 As shown in FIG. 4A, each generation of array disks 101 such as a generation, b generation, c generation,... Is distributed in a plurality of fields such as a first field, a second field,. Are arranged.
For example, the annual fault number of the a generation array disk 101 in the first field is na1, the annual fault number of the b generation array disk 101 is nb1, the annual fault number of the c generation array disk 101 is nc1,.・ It is. Further, for example, the number of annual faults of the a generation array disk 101 in the second field is na2, the number of faults of the b generation array disk 101 is nb2, the number of faults of the c generation array disk 101 is nc2,・・. The information on the number of annual faults of each generation is transmitted as part of log information from the disk device 2 arranged in each field to the maintenance management device 1 of the array disk group.

アレイディスク群の保守管理装置１が備えるフォルト数情報取得部１２は、上記各世代の年間フォルト数の情報を算出する。すなわち、フォルト数情報取得部１２は、図４（Ｂ）に示すように、例えばａ世代のアレイディスク１０１の年間フォルト数として、ｎａ１＋ｎａ２＋・・・、ｂ世代のアレイディスク１０１の年間フォルト数として、ｎｂ１＋ｎｂ２＋・・・、ｃ世代のアレイディスク１０１の年間フォルト数として、ｎｃ１＋ｎｃ２＋・・・を算出する。ａ世代、ｂ世代、ｃ世代以外の世代のアレイディスク１０１の年間フォルト数についても同様に算出される。
そして、図４（Ｃ）に示すように、故障率算出部１３が、上記算出された各世代の年間フォルト数の情報に基づいて、所定のＡＦＲ算出式に基づいて、各世代のアレイディスクのＡＦＲを算出する。 The fault number information acquisition unit 12 provided in the maintenance management device 1 of the array disk group calculates the information on the annual fault number of each generation. That is, as shown in FIG. 4B, the fault number information acquisition unit 12 has, for example, na1 + na2 +... As the annual fault number of the b generation array disk 101 as the annual fault number of the generation a array disk 101. nb1 + nb2 +... nc1 + nc2 +... is calculated as the annual number of faults of the c-th generation array disk 101. The number of faults per year for the array disks 101 of generations other than the a generation, the b generation, and the c generation is calculated in the same manner.
Then, as shown in FIG. 4C, the failure rate calculation unit 13 uses the predetermined annual AFR calculation formula based on the calculated annual fault count information for each generation to determine the array disk of each generation. AFR is calculated.

図５は、アレイディスク群の保守管理装置によるリカバリエラーの発生回数の閾値の更新処理の例を説明する図である。この例では、アレイディスク群の保守管理装置１が備える閾値更新部１４が、図４を参照して前述したＡＦＲの算出例に従って故障率算出部１３が算出したアレイディスクの世代毎のＡＦＲと、予め定められた該ＡＦＲの閾値の情報とを比較し、該比較結果に基づいて、各世代のリカバリエラーの発生回数の閾値を更新する。 FIG. 5 is a diagram illustrating an example of threshold value update processing for the number of occurrences of recovery errors by the array disk group maintenance management device. In this example, the threshold update unit 14 included in the maintenance management device 1 of the array disk group includes an AFR for each generation of array disks calculated by the failure rate calculation unit 13 according to the AFR calculation example described above with reference to FIG. The predetermined threshold information of the AFR is compared, and the threshold of the number of occurrences of recovery errors of each generation is updated based on the comparison result.

より具体的には、閾値更新部１４は、ＡＦＲが該ＡＦＲの閾値未満であると判断した場合には、該ＡＦＲに対応する世代のアレイディスク１０１についてのリカバリエラーの発生回数の閾値を更新せず、ＡＦＲが該ＡＦＲの閾値以上であると判断した場合には、該ＡＦＲに対応する世代のアレイディスク１０１についてのリカバリエラーの発生回数の閾値を、該ＡＦＲが該ＡＦＲの閾値を超える程度に応じた値に設定して減らす。閾値更新部１４は、例えば、ＡＦＲが該ＡＦＲの閾値を超える程度が大きいほど、設定するリカバリエラーの発生回数の閾値を小さくする。 More specifically, when the threshold update unit 14 determines that the AFR is less than the threshold of the AFR, the threshold update unit 14 updates the threshold of the number of occurrences of recovery errors for the generation of the array disk 101 corresponding to the AFR. If it is determined that the AFR is equal to or greater than the AFR threshold, the recovery error occurrence threshold for the generation of the array disk 101 corresponding to the AFR is set so that the AFR exceeds the AFR threshold. Set to a corresponding value and reduce. For example, the threshold update unit 14 decreases the threshold value of the number of occurrences of the recovery error to be set as the degree that the AFR exceeds the threshold value of the AFR is large.

図５中に示すグラフの横軸はアレイディスク１０１の世代、縦軸はアレイディスク１０１の各世代に対応するＡＦＲである。ＥｔはＡＦＲの閾値、Ｅａはａ世代のアレイディスク１０１のＡＦＲ、Ｅｂはｂ世代のアレイディスク１０１のＡＦＲ、Ｅｃはｃ世代のアレイディスク１０１のＡＦＲである。 The horizontal axis of the graph shown in FIG. 5 is the generation of the array disk 101, and the vertical axis is the AFR corresponding to each generation of the array disk 101. Et is the AFR threshold, Ea is the AFR of the a generation array disk 101, Eb is the AFR of the b generation array disk 101, and Ec is the AFR of the c generation array disk 101.

閾値更新部１４は、ＥａとＥｔとを比較し、ＥａがＥｔ未満であると判断する。従って、閾値更新部１４は、ａ世代のアレイディスクについてのリカバリエラーの発生回数の閾値を更新しない。
また、閾値更新部１４は、ＥｂとＥｔとを比較し、ＥｂがＥｔ以上であると判断する。従って、閾値更新部１４は、ｂ世代のアレイディスクについてのリカバリエラーの発生回数の閾値を、例えばＥｂからＥｔを減じた結果得られる値に応じた値に設定して減らす。また、閾値更新部１４は、ＥｃとＥｔとを比較し、ＥｃがＥｔ以上であると判断する。従って、閾値更新部１４は、ｃ世代のアレイディスクについてのリカバリエラーの発生回数の閾値を、ＥｃからＥｔを減じた結果得られる値に応じた値に設定して減らす。 The threshold update unit 14 compares Ea and Et, and determines that Ea is less than Et. Therefore, the threshold update unit 14 does not update the threshold for the number of occurrences of recovery errors for the a generation array disks.
Further, the threshold update unit 14 compares Eb and Et, and determines that Eb is equal to or greater than Et. Therefore, the threshold update unit 14 reduces the threshold of the number of occurrences of recovery errors for the b generation array disks by setting, for example, a value corresponding to a value obtained as a result of subtracting Et from Eb. Further, the threshold update unit 14 compares Ec and Et, and determines that Ec is equal to or greater than Et. Therefore, the threshold update unit 14 reduces the threshold of the number of occurrences of recovery errors for the c generation array disks by setting the threshold to a value obtained as a result of subtracting Et from Ec.

本発明におけるリカバリエラーの発生回数の閾値の更新処理は、上述した処理に限定されない。本発明の一実施例によれば、閾値更新部１４が、例えば、ＡＦＲが該ＡＦＲの閾値未満であると判断した場合には、該ＡＦＲに対応する世代のアレイディスク１０１についてのリカバリエラーの発生回数の閾値を増やし、ＡＦＲが該ＡＦＲの閾値と等しいと判断した場合には、該ＡＦＲに対応する世代のアレイディスク１０１についてのリカバリエラーの発生回数の閾値を更新せず、ＡＦＲが該ＡＦＲの閾値より大きいと判断した場合には、該ＡＦＲに対応する世代のアレイディスク１０１についてのリカバリエラーの発生回数の閾値を減らすようにしてもよい。 The threshold value update process for the number of occurrences of recovery errors in the present invention is not limited to the process described above. According to one embodiment of the present invention, when the threshold update unit 14 determines that the AFR is less than the threshold of the AFR, for example, the occurrence of a recovery error for the generation of the array disk 101 corresponding to the AFR When the threshold of the number of times is increased and it is determined that AFR is equal to the threshold of the AFR, the threshold of the number of occurrences of recovery errors for the generation of the array disk 101 corresponding to the AFR is not updated, and the AFR When it is determined that the threshold value is larger than the threshold value, the threshold value of the number of occurrences of recovery errors for the generation of the array disk 101 corresponding to the AFR may be reduced.

図６は、アレイディスク群の保守管理装置によるリカバリエラーの発生回数の閾値の更新処理フローの例を示す図である。
まず、アレイディスク群の保守管理装置１の送受信部１１が、ディスク装置２からログ情報を受信して（ステップＳ１）、ログ情報記憶部１５に記憶する。次に、フォルト数情報取得部１２が、ログ情報記憶部１５からログ情報を切り出す（ステップＳ２）。
フォルト数情報取得部１２は、切り出したログ情報を解析して、フォルト数情報を取得する（ステップＳ３）。そして、故障率算出部１３が、取得されたフォルト数情報に基づいて、各世代のアレイディスク１０１のＡＦＲを算出する（ステップＳ４）。そして、故障率算出部１３が、ステップＳ４において算出されたＡＦＲが該ＡＦＲの閾値未満かを判断する（ステップＳ５）。 FIG. 6 is a diagram showing an example of a processing flow for updating the threshold value of the number of occurrences of recovery errors by the array disk group maintenance management device.
First, the transmission / reception unit 11 of the array disk group maintenance management device 1 receives the log information from the disk device 2 (step S1) and stores it in the log information storage unit 15. Next, the fault number information acquisition unit 12 cuts out the log information from the log information storage unit 15 (step S2).
The fault number information acquisition unit 12 analyzes the cut log information and acquires fault number information (step S3). Then, the failure rate calculation unit 13 calculates the AFR of each generation of the array disk 101 based on the acquired fault number information (step S4). Then, the failure rate calculation unit 13 determines whether the AFR calculated in step S4 is less than the threshold value of the AFR (step S5).

故障率算出部１３が、ＡＦＲが該ＡＦＲの閾値未満であると判断した場合は、該ＡＦＲに対応する世代のアレイディスク１０１についてのリカバリエラーの発生回数の閾値に１を加えて（ステップＳ１０）、ステップＳ８に進む。故障率算出部１３が、ＡＦＲが該ＡＦＲの閾値未満でないと判断した場合は、故障率算出部１３は、ＡＦＲが該ＡＦＲの閾値より大きいかを判断する（ステップＳ６）。 If the failure rate calculation unit 13 determines that the AFR is less than the threshold value of the AFR, the failure rate calculation unit 13 adds 1 to the threshold value of the number of occurrences of the recovery error for the generation of the array disk 101 corresponding to the AFR (step S10). The process proceeds to step S8. If the failure rate calculation unit 13 determines that the AFR is not less than the AFR threshold, the failure rate calculation unit 13 determines whether the AFR is greater than the AFR threshold (step S6).

故障率算出部１３が、ＡＦＲが該ＡＦＲの閾値より大きくない（すなわち、ＡＦＲ＝ＡＦＲの閾値である）と判断した場合、故障率算出部１３は、該ＡＦＲに対応する世代のアレイディスク１０１についてのリカバリエラーの発生回数の閾値を更新しない（ステップＳ１１）。故障率算出部１３が、ＡＦＲが該ＡＦＲの閾値より大きいと判断した場合、故障率算出部１３は、該ＡＦＲに対応する世代のアレイディスク１０１についてのリカバリエラーの発生回数の閾値から１を減じて（ステップＳ７）、ステップＳ８に進む。 When the failure rate calculation unit 13 determines that AFR is not larger than the threshold value of the AFR (that is, AFR = AFR threshold value), the failure rate calculation unit 13 determines the generation of the array disk 101 corresponding to the AFR. The threshold value of the number of occurrences of recovery errors is not updated (step S11). When the failure rate calculation unit 13 determines that AFR is larger than the threshold value of the AFR, the failure rate calculation unit 13 subtracts 1 from the threshold value of the number of occurrences of recovery errors for the generation of the array disk 101 corresponding to the AFR. (Step S7), the process proceeds to step S8.

次に、アレイディスク群の保守管理装置１のオペレータが、上記ステップＳ７またはステップＳ１０において更新されたリカバリエラーの発生回数の閾値の情報を全てのディスク装置２に送信するか特定のディスク装置２に送信するかを示す選択情報を送受信部１１に対して入力すると、送受信部１１は、入力された選択情報に基づいて、更新されたリカバリエラーの発生回数の閾値の情報を全てのディスク装置２に送信するか特定のディスク装置２に送信するかを判断する（ステップＳ８）。ステップＳ８において、送受信部１１は、予め設定された上記選択情報に基づいて、更新されたリカバリエラーの発生回数の閾値の情報を全てのディスク装置２に送信するか特定のディスク装置２に送信するかを判断するようにしてもよい。 Next, the operator of the maintenance management device 1 of the array disk group sends the information on the threshold value of the number of occurrences of the recovery error updated in step S7 or step S10 to all the disk devices 2 or to a specific disk device 2. When selection information indicating whether or not to transmit is input to the transmission / reception unit 11, the transmission / reception unit 11 sends updated information on the threshold value of the number of occurrences of recovery errors to all the disk devices 2 based on the input selection information. It is determined whether to transmit to a specific disk device 2 (step S8). In step S <b> 8, the transmission / reception unit 11 transmits updated information on the threshold value of the number of occurrences of recovery errors to all the disk devices 2 or to a specific disk device 2 based on the selection information set in advance. You may make it judge.

送受信部１１が、更新されたリカバリエラーの発生回数の閾値の情報を全てのディスク装置２に送信すると判断した場合には、送受信部１１は、更新されたリカバリエラーの発生回数の閾値の情報を全てのディスク装置２に対して送信する（ステップＳ９）。
送受信部１１が、更新されたリカバリエラーの発生回数の閾値の情報を特定のディスク装置２に送信すると判断した場合には、送受信部１１は、更新されたリカバリエラーの発生回数の閾値の情報を該特定のディスク装置２に対して送信する（ステップＳ１２）。 When the transmission / reception unit 11 determines to transmit the updated information on the number of occurrences of recovery errors to all the disk devices 2, the transmission / reception unit 11 displays the information on the updated number of occurrences of recovery errors. The data is transmitted to all the disk devices 2 (step S9).
When the transmission / reception unit 11 determines to transmit the updated information on the number of occurrences of the recovery error to the specific disk device 2, the transmission / reception unit 11 displays the information on the updated number of occurrences of the recovery error. The data is transmitted to the specific disk device 2 (step S12).

以上、説明したように、本発明によれば、アレイディスクの予防交換の判定基準をアレイディスクの世代毎に動的に変更することができる。その結果、複数のディスク装置に分散して配置された各世代のアレイディスクを、世代毎に自動的に予防交換することが可能となる。 As described above, according to the present invention, the criterion for preventive replacement of an array disk can be dynamically changed for each generation of array disks. As a result, it is possible to automatically prevent and replace each generation of array disks distributed and arranged in a plurality of disk devices for each generation.

本発明のアレイディスク群の保守管理システムの構成の一例を示す図である。It is a figure which shows an example of a structure of the maintenance management system of the array disk group of this invention. ディスク装置の構成の一例を示す図である。It is a figure which shows an example of a structure of a disk apparatus. ディスク装置におけるリダンダントコピーの一例を示す図である。It is a figure which shows an example of the redundant copy in a disc apparatus. アレイディスクの世代毎のＡＦＲの算出例を説明する図である。It is a figure explaining the example of calculation of AFR for every generation of an array disk. リカバリエラーの発生回数の閾値の更新処理の例を説明する図である。It is a figure explaining the example of the update process of the threshold value of the generation frequency of a recovery error. リカバリエラーの発生回数の閾値の更新処理フローの例を示す図である。It is a figure which shows the example of the update process flow of the threshold value of the generation frequency of a recovery error.

Explanation of symbols

１アレイディスク群の保守管理装置
２ディスク装置
３通信手段
１１送受信部
１２フォルト数情報取得部
１３故障率算出部
１４閾値更新部
１５ログ情報記憶部
１６、２０５閾値記憶部
２０ディスク装置部
２１ＲＡＩＤコントローラ
１００保守要員
１０１アレイディスク
１０２ＨＳ
２０１エラー通知部
２０２ログ送信部
２０３送受信部
２０４リダンダントコピー指示部 DESCRIPTION OF SYMBOLS 1 Array disk group maintenance management apparatus 2 Disk apparatus 3 Communication means 11 Transmission / reception part 12 Fault number information acquisition part 13 Failure rate calculation part 14 Threshold update part 15 Log information storage part 16, 205 Threshold storage part 20 Disk apparatus part 21 RAID controller 100 maintenance personnel 101 array disk 102 HS
201 Error notification unit 202 Log transmission unit 203 Transmission / reception unit 204 Redundant copy instruction unit

Claims

An array disk group maintenance management system,
A plurality of disk devices that perform preventive replacement of the array disk when the number of occurrences of recovery errors of the array disk included in the array disk group exceeds a predetermined threshold;
A centralized monitoring device connected to the plurality of disk devices by communication means and centrally monitoring the plurality of disk devices;
The centralized monitoring device is
Log information acquisition means for acquiring log information from the plurality of disk devices;
Fault number information acquisition means for acquiring information on the number of faulted array disks from the acquired log information;
A failure rate calculating means for calculating a failure rate for each generation of the array disk based on the acquired information on the number of faulted array disks;
An array disk group comprising threshold update means for updating the predetermined threshold for the number of occurrences of recovery errors of the array disk of each generation based on the calculated failure rate for each generation of the array disk Maintenance management system.

When the number of occurrences of recovery errors of the array disk exceeds the predetermined threshold, the plurality of disk devices copy the data in the array disk to a spare array disk and detach the array disk The maintenance management system for an array disk group according to claim 1.

The threshold update unit compares the calculated failure rate for each generation of the array disk with a predetermined threshold for the failure rate, and determines the failure rate for each generation of the array disk and the predetermined threshold for the failure rate. 2. The array disk group maintenance management system according to claim 1, wherein a predetermined threshold for the number of occurrences of recovery errors of each generation of array disks is updated based on the comparison result.

A maintenance management device for an array disk group connected by a communication means to a plurality of disk devices that perform preventive replacement of the array disk when the number of occurrences of recovery errors of the array disk included in the array disk group exceeds a predetermined threshold. There,
Log information acquisition means for acquiring log information from the plurality of disk devices;
Fault number information acquisition means for acquiring information on the number of faulted array disks from the acquired log information;
A failure rate calculating means for calculating a failure rate for each generation of the array disk based on the acquired information on the number of faulted array disks;
An array disk group, comprising: threshold update means for updating the predetermined threshold for the number of occurrences of recovery errors of the array disks of each generation based on the calculated failure rate for each generation of array disks Maintenance management equipment.

In a maintenance management device for an array disk group, which is connected by a communication means to a plurality of disk devices that perform preventive replacement of the array disk when the number of occurrences of recovery errors of the array disk included in the array disk group exceeds a predetermined threshold A maintenance management method for array disks,
Obtaining log information from the plurality of disk devices;
Acquiring information on the number of faulted array disks from the acquired log information;
Calculating a failure rate for each generation of the array disk based on the acquired information on the number of faulted array disks;
The maintenance of the array disk group comprising the step of updating the predetermined threshold for the number of occurrences of recovery errors of the array disks of each generation based on the calculated failure rate for each generation of array disks Management method.

In a maintenance management device for an array disk group, which is connected by a communication means to a plurality of disk devices that perform preventive replacement of the array disk when the number of occurrences of recovery errors of the array disk included in the array disk group exceeds a predetermined threshold A maintenance management program for array disks,
On the computer,
Processing for obtaining log information from the plurality of disk devices;
A process of acquiring information on the number of faulty array disks from the acquired log information;
A process of calculating a failure rate for each generation of the array disk based on the acquired information on the number of faulted array disks;
A process of updating the predetermined threshold for the number of occurrences of recovery errors of each generation of array disks based on the calculated failure rate for each generation of array disks. Maintenance management program.