JP2023134170A

JP2023134170A - Storage medium management device, method for managing storage medium, and storage medium management program

Info

Publication number: JP2023134170A
Application number: JP2022039539A
Authority: JP
Inventors: 和一柴; Kazuichi Shiba
Original assignee: NEC Platforms Ltd
Current assignee: NEC Platforms Ltd
Priority date: 2022-03-14
Filing date: 2022-03-14
Publication date: 2023-09-27

Abstract

To precisely predict a storage medium with a high failure risk and reduce a rebuilding time.SOLUTION: An HDD information collecting unit 211 collects the operational state of a plurality of physical disks HDD1 to HDD3 forming an RAID array. A statics information analysis unit 212 determines whether there is a relative difference in the operational state of the collected physical disks HDD1 to HDD3. A failure HDD prediction unit 213 predicts a physical disk with the highest failure risk on the basis of the relative difference in the operational state. An HDD copy control unit 214 copies the data of the physical disk predicted to have the highest failure risk into a standby disk SHDD.SELECTED DRAWING: Figure 1

Description

本発明は、記憶媒体管理装置、記憶媒体管理方法および記憶媒体管理プログラムに関する。 The present invention relates to a storage medium management device, a storage medium management method, and a storage medium management program.

近年、ＲＡＩＤ（Redundant Arrays of Inexpensive Disks、または Redundant Arrays of Independent Disks）システムにおいて、物理ディスクの容量増加に伴い、物理ディスクが故障した際、ＲＡＩＤのリビルド（再構築）処理に多大な時間を要している。そこで、リビルド処理時間を短縮するために、例えば、エラー統計値が所定の閾値を越えた場合に被疑ディスク装置と判定した復元モード設定中の装置からのアクセス空き時間に、被疑ディスク装置のアドレス範囲を指定しながらデータを予備ディスクへ順次コピーして復元する技術が提案されている（例えば、特許文献１）。なおリビルド処理は、複数のＨＤＤ等の物理ディスクを使ってＲＡＩＤシステムを構築して運用しており、そのＲＡＩＤシステムにおいてあるディスクが故障してしまった場合に故障していない他のＨＤＤから故障したディスク内のデータを復元・再構築し直すことなどの処理をいう。 In recent years, with the increase in the capacity of physical disks in RAID (Redundant Arrays of Inexpensive Disks, or Redundant Arrays of Independent Disks) systems, when a physical disk fails, the RAID rebuild process requires a large amount of time. ing. Therefore, in order to shorten the rebuild processing time, for example, when the error statistics value exceeds a predetermined threshold, the address range of the suspect disk device is set to A technique has been proposed in which data is sequentially copied to a spare disk and restored while specifying the data (for example, Patent Document 1). The rebuild process is performed by constructing and operating a RAID system using physical disks such as multiple HDDs, and if one disk fails in that RAID system, the failure will be detected from other HDDs that are not failed. Processing such as restoring and reconstructing data on a disk.

特開２００６－１３９３３９号公報Japanese Patent Application Publication No. 2006-139339

しかしながら、特許文献１による方法では、ディスク装置（記憶媒体）の各種の故障、例えばモータストップ、媒体欠陥エラー、モードエラーなどのエラー種別について、予め加算値を設定しておき、エラー発生ごとに対応する加算値をエラー統計加算値とし、該エラー統計加算値が所定の閾値を越えたディスク装置を被疑ディスク装置として特定している。しかしながら、エラー統計加算値が所定の閾値を超えたからといって、その被疑ディスク装置が将来故障するとは限らなかった。ゆえに、将来故障するリスクが高いディスク装置を正確に特定できず、他のディスク装置が故障してしまう可能性があり、結果として、リビルド処理に多大な時間を要してしまうという問題がった。 However, in the method disclosed in Patent Document 1, additional values are set in advance for various types of failures in disk devices (storage media), such as motor stop, medium defect errors, mode errors, etc., and countermeasures are taken for each error occurrence. The added value is defined as the error statistics added value, and a disk device for which the error statistics added value exceeds a predetermined threshold is identified as a suspect disk device. However, just because the error statistics added value exceeds a predetermined threshold value does not necessarily mean that the suspected disk device will fail in the future. Therefore, it is not possible to accurately identify a disk device with a high risk of failure in the future, which may cause other disk devices to fail, resulting in the problem of requiring a large amount of time for the rebuild process. .

そこで本発明は、故障リスクが高い記憶媒体をより正確に予測し、リビルド時間を短縮する記憶媒体管理装置、記憶媒体管理方法および記憶媒体管理プログラムを提供することを目的としている。 SUMMARY OF THE INVENTION Therefore, an object of the present invention is to provide a storage medium management device, a storage medium management method, and a storage medium management program that more accurately predict storage media with a high risk of failure and shorten rebuild time.

上述した課題を解決するために、本発明の一態様は、ＲＡＩＤアレイを構成する複数の記憶媒体の動作状態を収集する収集手段と、前記収集手段により収集された前記複数の記憶媒体の動作状態に相対的な差異があるか否かを判断する動作状態判断手段と、前記動作状態判断手段による判断結果に基づいて、故障リスクが最も高い記憶媒体を予測する予測手段と、前記予測手段によって予測された、前記故障リスクが最も高い記憶媒体のデータを、スタンバイ記憶媒体にコピーするコピー制御手段と、を備えることを特徴とする。 In order to solve the above-mentioned problems, one aspect of the present invention includes a collection unit that collects the operating states of a plurality of storage media that constitute a RAID array, and a collection unit that collects the operating states of the plurality of storage media that are collected by the collection unit. an operating state determining means for determining whether there is a relative difference between the operating state determining means; a predicting means for predicting a storage medium with the highest failure risk based on a determination result by the operating state determining means; The present invention is characterized by comprising a copy control means for copying data of the storage medium having the highest failure risk to a standby storage medium.

また、本発明の一態様は、ＲＡＩＤアレイを構成する複数の記憶媒体を管理する記憶媒体管理方法であって、前記複数の記憶媒体の動作状態を収集するステップと、前記収集された前記複数の記憶媒体の動作状態に相対的な差異があるか否かを判断するステップと、前記判断の結果に基づいて、故障リスクが最も高い記憶媒体を予測するステップと、前記予測された前記故障リスクが最も高い記憶媒体のデータを、スタンバイ記憶媒体にコピーするステップと、を含むことを特徴とする。 Further, one aspect of the present invention is a storage medium management method for managing a plurality of storage media constituting a RAID array, the method comprising: collecting operational states of the plurality of storage media; determining whether there is a relative difference in the operating states of storage media; predicting a storage medium with the highest failure risk based on the result of the determination; and determining whether the predicted failure risk is copying data on the highest storage medium to a standby storage medium.

また、本発明の一態様は、ＲＡＩＤアレイを構成する複数の記憶媒体を管理する記憶媒体管理装置のコンピュータに、前記複数の記憶媒体の動作状態を収集するステップと、前記収集された前記複数の記憶媒体の動作状態に相対的な差異があるか否かを判断するステップと、前記判断の結果に基づいて、故障リスクが最も高い記憶媒体を予測するステップと、前記予測された前記故障リスクが最も高い記憶媒体のデータを、スタンバイ記憶媒体にコピーするステップと、を実行させることを特徴とする。 Further, one aspect of the present invention includes the step of collecting the operating states of the plurality of storage media in a computer of a storage medium management device that manages the plurality of storage media constituting the RAID array; determining whether there is a relative difference in the operating states of storage media; predicting a storage medium with the highest failure risk based on the result of the determination; and determining whether the predicted failure risk is The present invention is characterized by causing a step of copying data of the highest storage medium to a standby storage medium to be executed.

以上説明したように、故障リスクが高い記憶媒体をより正確に予測し、リビルド時間を短縮することができるという利点が得られる。 As described above, it is possible to more accurately predict storage media with a high risk of failure and to reduce the rebuild time.

本発明の本実施形態によるＲＡＩＤコントローラ（記憶媒体管理装置）２０を適用したサーバシステムの構成を示すブロック図である。1 is a block diagram showing the configuration of a server system to which a RAID controller (storage medium management device) 20 according to the present embodiment of the present invention is applied. 本実施形態によるＲＡＩＤコントローラ２０が定期的に実行する動作を説明するためのフローチャートである。3 is a flowchart for explaining operations periodically executed by the RAID controller 20 according to the present embodiment. 本実施形態によるＲＡＩＤコントローラ２０の通常時の動作を説明するためのフローチャートである。3 is a flowchart for explaining the normal operation of the RAID controller 20 according to the present embodiment. 本実施形態によるＲＡＩＤコントローラ２０によるディスク制御動作を示す模式図である。FIG. 3 is a schematic diagram showing a disk control operation by the RAID controller 20 according to the present embodiment. 本実施形態によるＲＡＩＤコントローラ２０によるディスク制御動作を示す模式図である。FIG. 3 is a schematic diagram showing a disk control operation by the RAID controller 20 according to the present embodiment. 本実施形態によるＲＡＩＤコントローラ２０によるディスク制御動作を示す模式図である。FIG. 3 is a schematic diagram showing a disk control operation by the RAID controller 20 according to the present embodiment. 本実施形態によるＲＡＩＤコントローラ２０によるディスク制御動作を示す模式図である。FIG. 3 is a schematic diagram showing a disk control operation by the RAID controller 20 according to the present embodiment. 本実施形態によるＲＡＩＤコントローラ２０によるディスク制御動作を示す模式図である。FIG. 3 is a schematic diagram showing a disk control operation by the RAID controller 20 according to the present embodiment. 本実施形態によるＲＡＩＤコントローラ２０によるディスク制御動作を示す模式図である。FIG. 3 is a schematic diagram showing a disk control operation by the RAID controller 20 according to the present embodiment. 本実施形態によるＲＡＩＤコントローラ２０によるディスク制御動作を示す模式図である。FIG. 3 is a schematic diagram showing a disk control operation by the RAID controller 20 according to the present embodiment. 本実施形態による記憶媒体管理装置の最小構成を示すブロック図である。FIG. 1 is a block diagram showing the minimum configuration of a storage medium management device according to the present embodiment.

以下、本発明の実施の形態を、図面を参照して説明する。 Embodiments of the present invention will be described below with reference to the drawings.

Ａ．実施形態
図１は、本実施形態によるＲＡＩＤコントローラ（記憶媒体管理装置）を適用したサーバシステムの構成を示すブロック図である。サーバシステム１は、ＯＳ（Operating System）１０、ＲＡＩＤコントローラ２０、およびディスクアレイ３０から構成される。ＯＳ１０は、ＲＡＩＤドライバ１１およびＲＡＩＤ管理ユーティリティ１２からなる。ＲＡＩＤドライバ１１は、ディスクアレイ３０との間でデータを送受信（読み出し／書き込み）するためのソフトウェアである。ＲＡＩＤ管理ユーティリティ１２は、ＯＳレベルで、ディスクアレイ３０との間でデータを読み出し／書き込みを管理するために用いられるソフトウェアである。 A. Embodiment FIG. 1 is a block diagram showing the configuration of a server system to which a RAID controller (storage medium management device) according to the present embodiment is applied. The server system 1 includes an OS (Operating System) 10, a RAID controller 20, and a disk array 30. The OS 10 includes a RAID driver 11 and a RAID management utility 12. The RAID driver 11 is software for transmitting and receiving (reading/writing) data to and from the disk array 30. The RAID management utility 12 is software used to manage reading/writing of data to/from the disk array 30 at the OS level.

ＲＡＩＤコントローラ２０は、ＲＡＩＤファームウェア２１を含み、ＲＡＩＤファームウェア２１は、ＨＤＤ情報収集部２１１、統計情報解析部２１２、故障ＨＤＤ予測部２１３、およびＨＤＤコピー制御部２１４を備えている。ディスクアレイ３０は、ＲＡＩＤ５を構成する物理ディスクＨＤＤ１、ＨＤＤ２、ＨＤＤ３、およびスタンバイディスクＳＨＤＤを有する。ＲＡＩＤ５は、複数の外部記憶装置（ハードディスクなど）をまとめて一台の装置として管理する技術である。本実施形態において、物理ディスクＨＤＤ１、ＨＤＤ２、ＨＤＤ３、およびスタンバイディスクＳＨＤＤは、ハードディスクを想定しているが、これに限らず、ＳＳＤ（Solid State Drive）などの半導体記憶媒体であってもよい。 The RAID controller 20 includes RAID firmware 21, and the RAID firmware 21 includes an HDD information collection section 211, a statistical information analysis section 212, a failed HDD prediction section 213, and an HDD copy control section 214. The disk array 30 includes physical disks HDD1, HDD2, HDD3 and a standby disk SHDD that configure RAID5. RAID5 is a technology that manages multiple external storage devices (hard disks, etc.) as a single device. In this embodiment, the physical disks HDD1, HDD2, HDD3, and standby disk SHDD are assumed to be hard disks, but are not limited to this, and may be semiconductor storage media such as SSD (Solid State Drive).

ＨＤＤ情報収集部２１１は、定期的に、ディスクアレイ３０の各物理ディスクＨＤＤ１～ＨＤＤ３の統計情報を収集する。統計情報は、稼働時間、応答速度、エラーレート、代替えセクタ登録数などの少なくとも１つの動作状態を示す情報である。統計情報解析部２１２は、収集した物理ディスクＨＤＤ１～ＨＤＤ３の稼働時間、応答速度、エラーレート、代替えセクタ登録数などの少なくとも１つの動作状態を含む統計情報を解析する。解析方法については後述する。 The HDD information collection unit 211 periodically collects statistical information of each physical disk HDD1 to HDD3 of the disk array 30. The statistical information is information indicating at least one operating state such as operating time, response speed, error rate, and number of registered alternative sectors. The statistical information analysis unit 212 analyzes the collected statistical information including at least one operating state of the physical disks HDD1 to HDD3, such as operating time, response speed, error rate, and number of registered alternative sectors. The analysis method will be described later.

故障ＨＤＤ予測部２１３は、統計情報解析部２１２により解析された、物理ディスクＨＤＤ１～ＨＤＤ３の統計情報の解析結果に基づいて、ＲＡＩＤコントローラ２０に接続された複数の物理ディスクＨＤＤ１～ＨＤＤ３の中から今後故障するリスクの最も高いＨＤＤｍ（ｍ＝１、２、or３）を特定する。ＨＤＤコピー制御部２１４は、特定された今後故障するリスクの最も高いＨＤＤｍのデータをスタンバイディスクＳＨＤＤにコピーする。 The failure HDD prediction unit 213 determines which of the plurality of physical disks HDD 1 to HDD 3 connected to the RAID controller 20 will be selected in the future based on the analysis results of the statistical information of the physical disks HDD 1 to HDD 3 analyzed by the statistical information analysis unit 212. Identify the HDDm (m=1, 2, or 3) with the highest risk of failure. The HDD copy control unit 214 copies the data of the identified HDDm with the highest risk of failure in the future to the standby disk SHDD.

Ｂ．実施形態の動作
図２は、本実施形態によるＲＡＩＤコントローラ２０が定期的に実行する動作を説明するためのフローチャートである。図４～図６は、本実施形態によるＲＡＩＤコントローラ２０によるディスク制御動作を示す模式図である。 B. Operations of the Embodiment FIG. 2 is a flowchart for explaining operations periodically executed by the RAID controller 20 according to the present embodiment. 4 to 6 are schematic diagrams showing disk control operations by the RAID controller 20 according to this embodiment.

図２に示すフローチャートは、定期的（例えば、１週間毎）に実行される。ＨＤＤ情報収集部２１１は、ディスクアレイ３０の各物理ディスクＨＤＤ１～ＨＤＤ３の稼働時間、応答速度、エラーレート、代替えセクタ登録数などの統計情報を収集する（ステップＳ１０）。統計情報解析部２１２は、収集した物理ディスクＨＤＤ１～ＨＤＤ３の稼働時間、応答速度、エラーレート、代替えセクタ登録数を示す統計情報を解析し、物理ディスクＨＤＤ１～ＨＤＤ３の動作状態に今後故障するリスクがあることを十分に予測し得るだけの差異があるか否かを判断する（ステップＳ１２）。例えば、論理ドライブを構成するメンバーの物理ディスクＨＤＤ１～ＨＤＤ３の統計情報を解析し、僅かでも稼働時間が長い、応答速度が遅い、エラーレートが大きい、代替えセクタ登録数が多いなど、他の物理ディスクに対して少なくとも１つの動作状態に相対的な差異（劣化を示す差異）があるか否かを判断する。 The flowchart shown in FIG. 2 is executed periodically (for example, every week). The HDD information collection unit 211 collects statistical information such as the operating time, response speed, error rate, and number of registered alternative sectors of each physical disk HDD1 to HDD3 of the disk array 30 (step S10). The statistical information analysis unit 212 analyzes the collected statistical information indicating the operating time, response speed, error rate, and number of registered alternative sectors of the physical disks HDD1 to HDD3, and determines whether there is a risk of future failure in the operating status of the physical disks HDD1 to HDD3. It is determined whether there is a difference sufficient to predict something (step S12). For example, by analyzing the statistical information of the physical disks HDD1 to HDD3 of the members that make up the logical drive, we will analyze the statistical information of the physical disks HDD1 to HDD3 of the members that make up the logical drive. It is determined whether there is a relative difference (difference indicating deterioration) in at least one operating state.

そして、収集した物理ディスクＨＤＤ１～ＨＤＤ３のそれぞれの統計情報に含まれる少なくとも１つの動作状態に差異がない場合には（ステップＳ１４のＮＯ）、ＨＤＤコピー制御部２１４は、任意の物理ハードディスクＨＤＤｎ（ｎ＝１、２、ｏｒ３）を選定し（例えば、ＨＤＤ２）、物理ハードディスクＨＤＤｎのデータをスタンバイディスクＳＨＤＤにコピーする（ステップＳ１６）。例えば、図４に示すように、物理ディスクＨＤＤ２を選定し、物理ディスクＨＤＤ２のデータをスタンバイディスクＳＨＤＤにコピーする。その後、当該処理を終了する。 Then, if there is no difference in at least one operating state included in the collected statistical information of each of the physical disks HDD1 to HDD3 (NO in step S14), the HDD copy control unit 214 selects an arbitrary physical hard disk HDDn(n =1, 2, or3) (for example, HDD2), and copies the data of the physical hard disk HDDn to the standby disk SHDD (step S16). For example, as shown in FIG. 4, the physical disk HDD2 is selected and the data on the physical disk HDD2 is copied to the standby disk SHDD. After that, the process ends.

一方、収集した物理ディスクＨＤＤ１～ＨＤＤ３のそれぞれの統計情報に含まれる少なくとも１つの動作状態に、今後故障するリスクがあることを十分に予測し得るだけの差異がある場合には（ステップＳ１４のＹＥＳ）、故障ＨＤＤ予測部２１３は、統計情報解析部２１２により解析された解析結果、すなわち、物理ディスクＨＤＤ１～ＨＤＤ３の少なくとも１つの動作状態（稼働時間、応答速度、エラーレート、代替えセクタ登録数など）の相対的な差異に基づいて、ＲＡＩＤコントローラ２０に接続された複数の物理ディスクＨＤＤ１～ＨＤＤ３の中から今後故障するリスクの最も高い物理ディスクＨＤＤｍ（ｍ＝１、２、ｏｒ３）を特定する（ステップＳ１８）。 On the other hand, if there is a difference in at least one operating state included in the collected statistical information of each of the physical disks HDD1 to HDD3 that is sufficient to predict that there is a risk of failure in the future (YES in step S14), ), the failure HDD prediction unit 213 calculates the analysis results analyzed by the statistical information analysis unit 212, that is, at least one operating state of the physical disks HDD1 to HDD3 (operating time, response speed, error rate, number of registered alternative sectors, etc.). Based on the relative difference between the physical disks HDD1 to HDD3 connected to the RAID controller 20, the physical disk HDDm (m=1, 2, or3) with the highest risk of failure in the future is identified (step S18).

例えば、故障ＨＤＤ予測部２１３は、論理ドライブを構成するメンバーの物理ディスクＨＤＤ１～ＨＤＤ３のうち、僅かでも稼働時間が長い、応答速度が遅い、エラーレートが大きい、代替えセクタ登録数が多いなど、他の物理ディスクに対して少なくとも１つの動作状態に相対的な差異がある物理ハードディスクを、故障するリスクの最も高い物理ディスクＨＤＤｍとして特定する。例えば、図５に示すように、物理ディスクＨＤＤ２が他の物理ディスクに対して僅かでも稼働時間が長い、応答速度が遅い、代替えセクタ登録数が多いという差異がある場合には、物理ディスクＨＤＤ２が、今後故障するリスクの最も高いと予測される物理ディスクＨＤＤｍとして特定される。 For example, the failure HDD prediction unit 213 determines whether any of the member physical disks HDD1 to HDD3 constituting the logical drive have a long operating time, a slow response speed, a large error rate, a large number of alternative sectors, etc. A physical hard disk having at least one relative difference in operating state with respect to the physical disk of is identified as the physical disk HDDm with the highest risk of failure. For example, as shown in FIG. 5, if physical disk HDD2 has a slightly longer operating time, slower response speed, or a larger number of alternative sectors registered than other physical disks, physical disk HDD2 may , is identified as the physical disk HDDm that is predicted to have the highest risk of failure in the future.

なお、「僅か」については「差異」に対して予め閾値を設定し、当該閾値を超えた場合に、その物理ディスクを、今後故障するリスクが最も高い物理ディスクＨＤＤｍと予測してもよい。また、稼働時間、応答速度、代替えセクタ登録数のうち、どの動作状態を用いて判断するかについては、それぞれの動作状態に重み付けしてもよいし、優先順位を設定するようにしてもよい。また、稼働時間、応答速度、代替えセクタ登録数のうち、少なくとも１つを判断条件としてもよいし、２つ以上を組み合わせて判断条件としてもよい。 Note that for "slight", a threshold value may be set in advance for the "difference", and when the threshold value is exceeded, the physical disk may be predicted to be the physical disk HDDm with the highest risk of failure in the future. Further, as to which operating state to use in the determination of operating time, response speed, and number of alternative sector registrations, each operating state may be weighted or a priority may be set. Furthermore, at least one of the operating time, response speed, and number of alternative sector registrations may be used as a judgment condition, or two or more may be used as a judgment condition in combination.

次に、故障ＨＤＤ予測部２１３は、前回特定した故障リスクの高い物理ディスクＨＤＤｐ（ｐ＝１、２、or３）よりも、今回特定した故障リスクの高い物理ディスクＨＤＤｍの方が、故障リスクが高いか否かを判断する（ステップＳ２０）。故障リスクが高いか低いかは、それぞれの稼働時間、応答速度、代替えセクタ登録数などの大小に基づいて判断すればよい。そして、今回特定した故障リスクの高い物理ディスクＨＤＤｍの故障リスクの方が、前回特定した故障リスクの高い物理ディスクＨＤＤｐよりも低い場合には（ステップＳ２０のＮＯ）、何もせずに当該処理を終了する。 Next, the failure HDD prediction unit 213 determines that the currently identified physical disk HDDm with a high failure risk has a higher failure risk than the previously identified physical disk HDDp with a high failure risk (p=1, 2, or 3). It is determined whether or not (step S20). Whether the failure risk is high or low can be determined based on the respective operating time, response speed, number of registered alternative sectors, etc. If the failure risk of the currently identified physical disk HDDm with a high failure risk is lower than the previously identified physical disk HDDp with a high failure risk (NO in step S20), the process ends without doing anything. do.

一方、今回特定した故障リスクの高い物理ディスクＨＤＤｍの方が、故障リスクが前回特定した故障リスクの高い物理ディスクＨＤＤｐよりも高い場合には（ステップＳ２０のＹＥＳ）、ＨＤＤコピー制御部２１４は、今回特定された故障するリスクの最も高い物理ディスクＨＤＤｍのデータをスタンバイディスクＳＨＤＤにコピーする（ステップＳ２２）。 On the other hand, if the currently identified physical disk HDDm with a high failure risk has a higher failure risk than the previously identified physical disk HDDp with a high failure risk (YES in step S20), the HDD copy control unit 214 The data of the identified physical disk HDDm with the highest risk of failure is copied to the standby disk SHDD (step S22).

例えば、今回物理ディスクＨＤＤ３が故障リスクの高い物理ディスクＨＤＤｍとして特定された場合、稼働時間が長い、応答速度が遅い、代替えセクタ登録数が多いなど、当該物理ディスクＨＤＤ３の故障リスクの方が前回特定した故障リスクの高い物理ディスクＨＤＤ２よりも高ければ、図６に示すように、物理ディスクＨＤＤ２から物理ディスクＨＤＤ３に切り替え、今回特定された故障するリスクの最も高い物理ディスクＨＤＤ３のデータをスタンバイディスクＳＨＤＤにコピーする。その後、当該処理を終了する。 For example, if physical disk HDD3 is identified as a physical disk HDDm with a high failure risk this time, the failure risk of the physical disk HDD3 is higher than that previously identified due to long operating time, slow response speed, large number of registered alternative sectors, etc. If the failure risk is higher than that of the physical disk HDD2, as shown in Figure 6, the physical disk HDD2 is switched to the physical disk HDD3, and the data of the physical disk HDD3, which has the highest failure risk identified this time, is transferred to the standby disk SHDD. make a copy. After that, the process ends.

図３は、本実施形態によるＲＡＩＤコントローラ２０の通常時の動作を説明するためのフローチャートである。また、図７～図１０は、本実施形態によるＲＡＩＤコントローラ２０によるディスク制御動作を示す模式図である。 FIG. 3 is a flowchart for explaining the normal operation of the RAID controller 20 according to this embodiment. Further, FIGS. 7 to 10 are schematic diagrams showing disk control operations by the RAID controller 20 according to this embodiment.

図３に示すフローチャートは、上位のＯＳ１０や、アプリケーションなどから書き込み要求が入った場合に実行される。ＲＡＩＤコントローラ２０では、上位のＯＳ１０や、アプリケーションなどから書き込み要求が入った場合、ＨＤＤコピー制御部２１４が、故障するリスクが高いと特定された物理ディスクＨＤＤｍにデータを書き込むとともに、スタンバイディスクＳＨＤＤにも同じデータを書き込む（ステップＳ３０）。例えば、故障するリスクの最も高い物理ディスクが物理ディスクＨＤＤ２である場合、図７に示すように、物理ディスクＨＤＤ２にデータを書き込むとともに、スタンバイディスクＳＨＤＤにも同じデータを書き込む。 The flowchart shown in FIG. 3 is executed when a write request is received from the host OS 10, an application, or the like. In the RAID controller 20, when a write request is received from the host OS 10, an application, etc., the HDD copy control unit 214 writes data to the physical disk HDDm identified as having a high risk of failure, and also writes data to the standby disk SHDD. The same data is written (step S30). For example, if the physical disk with the highest risk of failure is the physical disk HDD2, as shown in FIG. 7, data is written to the physical disk HDD2 and the same data is also written to the standby disk SHDD.

次に、ＲＡＩＤコントローラ２０は、故障するリスクが高いと特定された物理ディスクＨＤＤｍが実際に故障したか否かを判断する（ステップＳ３２）。そして、故障するリスクが高いと特定された物理ディスクＨＤＤｍがまだ故障していない場合には（ステップＳ３２のＮＯ）、何もせずに当該処理を終了する。 Next, the RAID controller 20 determines whether the physical disk HDDm identified as having a high risk of failure has actually failed (step S32). If the physical disk HDDm identified as having a high risk of failure has not yet failed (NO in step S32), the process is ended without doing anything.

一方、故障するリスクが高いと特定された物理ディスクＨＤＤｍが実際に故障した場合には（ステップＳ３２のＹＥＳ）、ＲＡＩＤコントローラ２０は、故障した物理ディスクＨＤＤｍを切り離し、当該故障した物理ディスクＨＤＤｍと同じデータを書き込んでいたスタンバイディスクＳＨＤＤを新たにハードディスク装置（ＲＡＩＤ）に組み込む（ステップＳ３４）。例えば、図８に示すように、故障するリスクが高いと特定された物理ディスクＨＤＤ２が実際に故障した場合、故障した物理ディスクＨＤＤ２を切り離し、当該故障した物理ディスクＨＤＤ２と同じデータを書き込んでいたスタンバイディスクＳＨＤＤを物理ディスクＨＤＤ２としてＲＡＩＤに組み込む。ゆえに、リビルド作業が不要になるため、システムダウンやデータロストの発生を防止できる。その後、当該処理を終了する。 On the other hand, if the physical disk HDDm identified as having a high risk of failure actually fails (YES in step S32), the RAID controller 20 disconnects the failed physical disk HDDm and uses the same physical disk HDDm as the failed physical disk HDDm. The standby disk SHDD on which data was written is newly incorporated into the hard disk device (RAID) (step S34). For example, as shown in FIG. 8, if the physical disk HDD2 that has been identified as having a high risk of failure actually fails, the failed physical disk HDD2 is disconnected and the standby disk that was writing the same data as the failed physical disk HDD2 is The disk SHDD is incorporated into the RAID as the physical disk HDD2. Therefore, rebuilding work is no longer necessary, which prevents system failure and data loss. After that, the process ends.

なお、故障ＨＤＤ予測部２１３によって故障リスクが高いと予測した物理ディスクＨＤＤ２ではなく、他の物理ディスク、例えば、物理ディスクＨＤＤ１が故障した場合には、図９に示すように、残りの物理ディスクＨＤＤ２および物理ディスクＨＤＤ３からスタンバイディスクＳＨＤＤにデータを書き込む動作、所謂リビルド処理が行われる。この場合には、リビルド処理が必要となる。 Note that if the physical disk HDD 2 that is predicted to have a high failure risk by the failure HDD prediction unit 213 fails, but another physical disk, for example, the physical disk HDD 1, fails, as shown in FIG. 9, the remaining physical disk HDD 2 Then, an operation of writing data from the physical disk HDD3 to the standby disk SHDD, a so-called rebuild process, is performed. In this case, rebuild processing is required.

しかしながら、上述したように、本実施形態では、僅かでも稼働時間が長い、応答速度が遅い、エラーレートが大きい、代替えセクタ登録数が多いなど、他の物理ディスクに対して少なくとも１つの動作状態に相対的な差異がある物理ハードディスクを、故障するリスクの最も高い物理ディスクＨＤＤｍとして特定している。したがって、本実施形態によれば、単に一意に設定した閾値を超えたことを判断基準とする方法に比べ、今後故障するリスクの最も高い物理ディスクＨＤＤｍをより正確に予測することができるので、リビルド処理が必要となる状況は発生しにくい。 However, as described above, in this embodiment, if there is at least one operating state with respect to other physical disks, such as a long operating time, slow response speed, large error rate, or large number of registered alternative sectors, etc. A physical hard disk with a relative difference is identified as the physical disk HDDm with the highest risk of failure. Therefore, according to the present embodiment, compared to a method in which the determination criterion is simply that a uniquely set threshold value has been exceeded, it is possible to more accurately predict the physical disk HDDm that has the highest risk of failure in the future. Situations that require processing are unlikely to occur.

故障した物理ディスクＨＤＤ２を新たな物理ディスクＨＤＤ２ｎｅｗ交換すると、図１０に示すように、物理ディスクＨＤＤ２ｎｅｗは、自動的にスタンバイディスクＳＨＤＤに設定される。スタンバイディスクＳＨＤＤの設定が完了した後、ＲＡＩＤコントローラ２０は、図２に示すフローチャートを実行し、物理ディスクＨＤＤ１～ＨＤＤ３の中から今後故障リスクが最も高い物理ディスクＨＤＤｍを特定し、スタンバイディスクＳＨＤＤ（ＨＤＤ２ｎｅｗ）にデータがコピーされる。 When the failed physical disk HDD2 is replaced with a new physical disk HDD2new, the physical disk HDD2new is automatically set as the standby disk SHDD, as shown in FIG. After completing the setting of the standby disk SHDD, the RAID controller 20 executes the flowchart shown in FIG. ) the data is copied to.

上述した実施形態によれば、僅かでも稼働時間が長い、応答速度が遅い、代替えセクタ登録数が多いなど、他の物理ディスクに対して少なくとも１つの動作状態に相対的な差異がある物理ハードディスクを、故障するリスクの最も高い物理ディスクＨＤＤｍとして特定するので、今後故障するリスクの最も高い物理ディスクをより正確に予測することができる。そして、故障するリスクの最も高い物理ディスクＨＤＤｍのデータを、故障する前に予めスタンバイディスクＳＨＤＤにデータをコピーしておくことで、実際に故障した場合であっても、故障した物理ディスクに切り替えてスタンバイディスクＳＨＤＤをＲＡＩＤに組み込むようにしたので、リビルド時間を短縮することができる。 According to the embodiment described above, a physical hard disk that has at least one relative difference in operating state from other physical disks, such as a long operating time, a slow response speed, or a large number of registered alternative sectors, is selected. Since the physical disk HDDm with the highest risk of failure is identified as the physical disk HDDm with the highest risk of failure, it is possible to more accurately predict the physical disk with the highest risk of failure in the future. By copying the data on the physical disk HDDm, which has the highest risk of failure, to the standby disk SHDD before the failure occurs, even if the physical disk actually fails, it can be switched to the failed physical disk. Since the standby disk SHDD is incorporated into the RAID, the rebuild time can be shortened.

図１１は、本実施形態による記憶媒体管理装置の最小構成を示すブロック図である。
本実施形態による記憶媒体管理装置５０は、少なくとも、ＲＡＩＤアレイ５１を構成する複数の記憶媒体５２～５４、スタンバイ記憶媒体５５、収集手段５６、動作状態判断手段５７、予測手段５８、コピー制御手段５９を備えればよい。複数の記憶媒体５２～５４は、ＲＡＩＤアレイ（ＲＡＩＤ５）を構成する。収集手段５６は、複数の記憶媒体５２～５４の動作状態を収集する。動作状態判断手段５７は、収集された複数の記憶媒体５２～５４の動作状態に相対的な差異があるか否かを判断する。予測手段５８は、動作状態判断手段５７による判断結果に基づいて、故障リスクが最も高い記憶媒体を予測する。コピー制御手段５９は、予測手段５８によって予測された、故障リスクが最も高い記憶媒体のデータを、スタンバイ記憶媒体５５にコピーする。 FIG. 11 is a block diagram showing the minimum configuration of the storage medium management device according to this embodiment.
The storage medium management device 50 according to the present embodiment includes at least a plurality of storage media 52 to 54 constituting a RAID array 51, a standby storage medium 55, a collection means 56, an operating state determination means 57, a prediction means 58, and a copy control means 59. All you have to do is prepare. The plurality of storage media 52 to 54 constitute a RAID array (RAID5). The collecting means 56 collects the operating states of the plurality of storage media 52-54. The operating state determining means 57 determines whether there is a relative difference in the operating states of the plurality of collected storage media 52-54. The prediction means 58 predicts the storage medium with the highest failure risk based on the determination result by the operating state determination means 57. The copy control means 59 copies the data of the storage medium with the highest failure risk predicted by the prediction means 58 to the standby storage medium 55.

なお、本発明における処理部の機能を実現するためのプログラムをコンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することにより特典情報の制御処理を行ってもよい。なお、ここでいう「コンピュータシステム」とは、ＯＳや周辺機器等のハードウェアを含むものとする。また、「コンピュータシステム」は、インターネットやＷＡＮ、ＬＡＮ、専用回線等の通信回線を含むネットワークを介して接続された複数のコンピュータ装置を含んでもよい。また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ－ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、ネットワークを介してプログラムが送信された場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリ（ＲＡＭ）のように、一定時間プログラムを保持しているものも含むものとする。また、上記プログラムは、上述した機能の一部を実現するためのものであってもよい。さらに、上述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるもの、いわゆる差分ファイル（差分プログラム）であってもよい。 Note that the program for realizing the functions of the processing unit in the present invention is recorded on a computer-readable recording medium, and the program recorded on the recording medium is read into a computer system and executed, thereby controlling the benefit information. Processing may be performed. Note that the "computer system" herein includes hardware such as an OS and peripheral devices. Further, a "computer system" may include a plurality of computer devices connected via a network including the Internet, a WAN, a LAN, a communication line such as a dedicated line, etc. Furthermore, the term "computer-readable recording medium" refers to portable media such as flexible disks, magneto-optical disks, ROMs, and CD-ROMs, and storage devices such as hard disks built into computer systems. Furthermore, a ``computer-readable recording medium'' refers to a storage medium that retains a program for a certain period of time, such as volatile memory (RAM) inside a computer system that serves as a server or client when a program is transmitted via a network. This shall also include things. Moreover, the above-mentioned program may be for realizing a part of the above-mentioned functions. Furthermore, it may be a so-called difference file (difference program) that can realize the above-mentioned functions in combination with a program already recorded in the computer system.

また、上述した機能の一部または全部を、ＬＳＩ（ＬａｒｇｅＳｃａｌｅＩｎｔｅｇｒａｔｉｏｎ）等の集積回路として実現してもよい。上述した各機能は個別にプロセッサ化してもよいし、一部、または全部を集積してプロセッサ化してもよい。また、集積回路化の手法はＬＳＩに限らず専用回路、または汎用プロセッサで実現してもよい。また、半導体技術の進歩によりＬＳＩに代替する集積回路化の技術が出現した場合、当該技術による集積回路を用いてもよい。 Furthermore, some or all of the functions described above may be realized as an integrated circuit such as an LSI (Large Scale Integration). Each of the above-mentioned functions may be implemented as an individual processor, or some or all of them may be integrated into a processor. Further, the method of circuit integration is not limited to LSI, but may be implemented using a dedicated circuit or a general-purpose processor. Further, if an integrated circuit technology that replaces LSI emerges due to advances in semiconductor technology, an integrated circuit based on this technology may be used.

１サーバシステム
１０ＯＳ
１１ＲＡＩＤドライバ
１２ＲＡＩＤ管理ユーティリティ
２０ＲＡＩＤコントローラ
２１ＲＡＩＤファームウェア
２１１ＨＤＤ情報収集部
２１２統計情報解析部
２１３故障ＨＤＤ予測部
２１４ＨＤＤコピー制御部
３０ディスクアレイ
ＨＤＤ１～ＨＤＤ３物理ディスク
ＳＨＤＤスタンバイディスク 1 Server system 10 OS
11 RAID driver 12 RAID management utility 20 RAID controller 21 RAID firmware 211 HDD information collection unit 212 Statistical information analysis unit 213 Failure HDD prediction unit 214 HDD copy control unit 30 Disk array HDD1 to HDD3 Physical disk SHDD Standby disk

Claims

a collection means for collecting operational states of a plurality of storage media constituting the RAID array;
operating state determining means for determining whether there is a relative difference in the operating states of the plurality of storage media collected by the collecting means;
prediction means for predicting a storage medium with the highest failure risk based on the determination result by the operating state determination means;
Copy control means for copying data of the storage medium with the highest failure risk predicted by the prediction means to a standby storage medium;
A storage medium management device comprising:

The storage medium according to claim 1, wherein the copy control means writes data to the predicted storage medium and also writes the same data to the standby storage medium when there is a request to write data. Management device.

The copy control means is characterized in that when the storage medium with the highest failure risk fails, the standby storage medium is incorporated as one of the plurality of storage media in place of the storage medium with the highest failure risk. 3. The storage medium management device according to claim 2.

The copy control means is characterized in that, when it is determined by the operation state determination means that there is no difference in the operation states, the copy control means copies the data of any storage medium among the plurality of storage media to the standby storage medium. 4. The storage medium management device according to claim 3.

5. The operating state includes at least one of the operating time, response speed, error rate, and number of registered alternative sectors of each of the plurality of storage media. The storage medium management device described.

A storage medium management method for managing a plurality of storage media constituting a RAID array, the method comprising:
collecting operational states of the plurality of storage media;
determining whether there is a relative difference in the operating states of the plurality of collected storage media;
predicting a storage medium with the highest failure risk based on the result of the determination;
Copying data of the storage medium with the predicted highest failure risk to a standby storage medium;
A storage medium management method characterized by comprising:

A computer of a storage medium management device that manages a plurality of storage media constituting a RAID array,
collecting operational states of the plurality of storage media;
determining whether there is a relative difference in the operating states of the plurality of collected storage media;
predicting a storage medium with the highest failure risk based on the result of the determination;
Copying data of the storage medium with the predicted highest failure risk to a standby storage medium;
A storage medium management program that executes.