JP2019036163A

JP2019036163A - Storage control device and control program

Info

Publication number: JP2019036163A
Application number: JP2017157531A
Authority: JP
Inventors: 康太郎仁村; Kotaro Nimura; 惇猪頭; Jun Ito; 康寛小笠原; Yasuhiro Ogasawara; 麻理恵安部; Marie Abe; 洋今村; Hiroshi Imamura
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2017-08-17
Filing date: 2017-08-17
Publication date: 2019-03-07
Anticipated expiration: 2037-08-17
Also published as: US10606490B2; JP6965626B2; US20190056875A1

Abstract

To detect a storage device in a latent failure state.SOLUTION: A storage control device 101 acquires performance information (for example, performance information 110) representing a load state and a response state of each of one or more storage devices D. The one or more storage devices D can be accessed in response to a request from a host device 102. The storage control device 101 detects a storage device D whose load is lower than a first threshold and whose response time is equal to or longer than a second threshold among the one or more storage devices D on the basis of the acquired performance information.SELECTED DRAWING: Figure 1

Description

本発明は、ストレージ制御装置、および制御プログラムに関する。 The present invention relates to a storage control device and a control program.

従来、リダンダントコピー（ＲｅｄｕｎｄａｎｔＣｏｐｙ）と呼ばれるリカバリ処理がある。リダンダントコピーでは、統計加点処理等を利用して故障の予兆を検知し、バックグラウンドで被疑ディスクから代替ディスク（ホットスペア）へのデータ移行を実施する。 Conventionally, there is a recovery process called Redundant Copy. In redundant copy, statistical signs are used to detect signs of failure, and data is transferred from the suspect disk to the replacement disk (hot spare) in the background.

先行技術としては、例えば、障害の発生時にポイントを減点し、障害に至らないがコマンド処理時間が処理時間基準値を超える応答遅延の時にポイントを減点し、ポイントが第１のポイント基準値を下回った場合に不良部品を縮退するディスクアレイ装置がある。また、被仮想化ストレージの障害を検知した場合、当該障害による波及範囲を調べ、対処が必要なデバイスを特定し、当該デバイスの性能や信頼性などの属性に適応する移行先デバイスを決定し、仮想化ストレージに対してデバイス移行を指示する技術がある。また、未割り当てのデータ記憶装置を用いて劣化したデータ記憶アレイを最良の信頼性、最良の性能及び最良の効率に復元することができると判定される場合に、未割り当てのデータ記憶装置を含むように劣化したデータ記憶アレイを再構成する技術がある。また、ホストからアクセスが無いディスクストレージ装置の待機状態において、ディスクストレージ装置の機能に関する所定の検査、好ましくはリードテスト、ライトサーボテスト、ライトテストのうちの少なくとも１つのテストを実施する技術がある。また、マスタディスク側のディスク装置と上位装置との間におけるデータバスを経由した入出力処理情報（イベント）の授受を、スレーブ側のディスク装置がモニタして自装置内に採取して記憶し、記憶したイベント情報を自装置内で再現する技術がある。 As prior art, for example, points are deducted when a failure occurs, points are deducted when there is a response delay that does not lead to a failure but the command processing time exceeds the processing time reference value, and the point falls below the first point reference value. There is a disk array device that degenerates defective parts in the case of failure. In addition, when a failure of the virtualized storage is detected, the propagation range due to the failure is examined, a device that needs to be addressed is identified, and a migration destination device that adapts to attributes such as the performance and reliability of the device is determined. There is a technique for instructing virtual storage to perform device migration. It also includes an unassigned data storage device if it is determined that the degraded data storage array can be restored to the best reliability, best performance and best efficiency using the unassigned data storage device. There is a technique for reconfiguring a degraded data storage array. In addition, there is a technique for performing a predetermined test relating to the function of the disk storage device, preferably at least one of a read test, a write servo test, and a write test in a standby state of the disk storage device that is not accessed from the host. In addition, transfer of input / output processing information (event) via the data bus between the disk device on the master disk side and the host device is monitored by the disk device on the slave side, collected and stored in its own device, There is a technique for reproducing stored event information within the device itself.

特開２００４−２５２６９２号公報JP 2004-252692 A 特開２００５−３２６９３５号公報JP 2005-326935 A 特開２００７−２００２９９号公報JP 2007-200289 A 特開２００１−５６１６号公報JP 2001-5616 A 特開２００３−１５０３２６号公報JP 2003-150326 A

しかしながら、従来技術では、レスポンスタイムアウトや媒体エラーは発生していないものの、動作がスローダウンしている潜在故障状態の記憶装置を発見することが難しい。 However, in the prior art, although a response timeout or a medium error does not occur, it is difficult to find a storage device in a latent failure state in which the operation is slowed down.

一つの側面では、本発明は、潜在故障状態の記憶装置を検出することを目的とする。 In one aspect, the present invention is directed to detecting a storage device in a latent fault state.

１つの実施態様では、上位装置からの要求に応じてアクセスされる１または複数の記憶装置を制御するストレージ制御装置であって、前記１または複数の記憶装置それぞれの負荷状況およびレスポンス状況を表す性能情報を取得し、取得した前記性能情報に基づいて、前記１または複数の記憶装置のうち、負荷が第１の閾値より低く、かつ、レスポンスタイムが第２の閾値以上の記憶装置を検出するストレージ制御装置が提供される。 In one embodiment, a storage control device that controls one or a plurality of storage devices accessed in response to a request from a host device, the performance representing the load status and response status of each of the one or more storage devices Storage that acquires information and detects a storage device having a load lower than the first threshold and a response time equal to or greater than the second threshold among the one or more storage devices based on the acquired performance information A control device is provided.

本発明の一側面によれば、潜在故障状態の記憶装置を検出することができる。 According to one aspect of the present invention, a storage device in a latent failure state can be detected.

図１は、実施の形態にかかるストレージ制御装置１０１の一実施例を示す説明図である。FIG. 1 is an explanatory diagram of an example of the storage control apparatus 101 according to the embodiment. 図２は、ストレージシステム２００のシステム構成例を示す説明図である。FIG. 2 is an explanatory diagram showing a system configuration example of the storage system 200. 図３は、ストレージ制御装置１０１のハードウェア構成例を示すブロック図である。FIG. 3 is a block diagram illustrating a hardware configuration example of the storage control apparatus 101. 図４は、性能情報テーブル２２０の記憶内容の一例を示す説明図である。FIG. 4 is an explanatory diagram showing an example of the contents stored in the performance information table 220. 図５は、コンフィグテーブル２３０の記憶内容の一例を示す説明図である。FIG. 5 is an explanatory diagram showing an example of the contents stored in the configuration table 230. 図６は、ストレージ制御装置１０１の機能的構成例を示すブロック図である。FIG. 6 is a block diagram illustrating a functional configuration example of the storage control apparatus 101. 図７は、リダンダントコピーの具体的な処理内容の一例を示す説明図である。FIG. 7 is an explanatory diagram showing an example of specific processing contents of the redundant copy. 図８は、ストレージ制御装置１０１の第１の潜在故障検出処理手順の一例を示すフローチャート（その１）である。FIG. 8 is a flowchart (part 1) illustrating an example of a first latent failure detection processing procedure of the storage control apparatus 101. 図９は、ストレージ制御装置１０１の第１の潜在故障検出処理手順の一例を示すフローチャート（その２）である。FIG. 9 is a flowchart (part 2) illustrating an example of the first latent failure detection processing procedure of the storage control apparatus 101. 図１０は、ストレージ制御装置１０１の第２の潜在故障検出処理手順の一例を示すフローチャートである。FIG. 10 is a flowchart illustrating an example of the second latent failure detection processing procedure of the storage control apparatus 101. 図１１は、新診断処理の具体的処理手順の一例を示すフローチャートである。FIG. 11 is a flowchart illustrating an example of a specific processing procedure of the new diagnosis processing.

以下に図面を参照して、本発明にかかるストレージ制御装置、および制御プログラムの実施の形態を詳細に説明する。 Embodiments of a storage control device and a control program according to the present invention will be described below in detail with reference to the drawings.

（実施の形態）
図１は、実施の形態にかかるストレージ制御装置１０１の一実施例を示す説明図である。図１において、ストレージ制御装置１０１は、上位装置１０２からのストレージ１０３に対する要求を処理するコンピュータである。上位装置１０２は、情報処理を行うコンピュータであり、例えば、業務処理を行う業務サーバである。ストレージ１０３に対する要求は、例えば、ストレージ１０３に対するＩ／Ｏ（Ｉｎｐｕｔ／Ｏｕｔｐｕｔ）要求である。 (Embodiment)
FIG. 1 is an explanatory diagram of an example of the storage control apparatus 101 according to the embodiment. In FIG. 1, the storage control apparatus 101 is a computer that processes a request for the storage 103 from the host apparatus 102. The host device 102 is a computer that performs information processing, for example, a business server that performs business processing. The request for the storage 103 is, for example, an I / O (Input / Output) request for the storage 103.

ストレージ１０３は、データを記憶する１以上の記憶装置Ｄ（図１の例では、記憶装置Ｄ１〜Ｄ３）を含む。記憶装置Ｄは、例えば、ハードディスク、光ディスク、フラッシュメモリなどである。例えば、ストレージ制御装置１０１は、ＲＡＩＤ（ＲｅｄｕｎｄａｎｔＡｒｒａｙｓｏｆＩｎｅｘｐｅｎｓｉｖｅＤｉｓｋｓ）構成のストレージ装置に適用される。 The storage 103 includes one or more storage devices D (storage devices D1 to D3 in the example of FIG. 1) that store data. The storage device D is, for example, a hard disk, an optical disk, a flash memory, or the like. For example, the storage control apparatus 101 is applied to a storage apparatus having a RAID (Redundant Arrays of Inexpensive Disks) configuration.

ここで、ストレージ装置内のディスク故障の予兆を検知した際のリカバリ処理として、リダンダントコピーがある。リダンダントコピーでは、ディスク故障の予兆を検知すると、バックグラウンドで被疑ディスクから代替ディスク（ホットスペア）へのデータ移行を実施する。 Here, there is a redundant copy as a recovery process when a sign of a disk failure in the storage apparatus is detected. In redundant copy, when a sign of a disk failure is detected, data migration from a suspect disk to a replacement disk (hot spare) is performed in the background.

被疑ディスクの検出には、例えば、統計加点処理が利用される。統計加点処理は、各ディスク装置（例えば、記憶装置Ｄ）について、レスポンスタイムアウトや媒体エラーが発生するたびに加点していき、監視期間内に統計加点値が閾値を超えたディスク装置を被疑ディスクとして検出する処理である。 For example, statistical score processing is used to detect the suspicious disk. In the statistical score process, each disk device (for example, storage device D) is scored whenever a response timeout or a medium error occurs, and a disk device whose statistical score value exceeds a threshold value during the monitoring period is regarded as a suspect disk. It is a process to detect.

また、定期的にストレージ装置内のディスク装置を診断する機能として、パトロール診断処理と呼ばれるものがある。パトロール診断処理では、ホスト（例えば、上位装置１０２）からのＩ／Ｏ要求とは非同期に、全ディスク装置（ホットスペアを含む）に対してデータの入力／出力のためのＩ／Ｏコマンドを発行して故障診断を行う。 Further, as a function for periodically diagnosing a disk device in the storage device, there is a so-called patrol diagnosis process. In the patrol diagnosis process, I / O commands for data input / output are issued to all disk devices (including hot spares) asynchronously with I / O requests from the host (for example, the host device 102). Trouble diagnosis.

パトロール診断処理は、ディスク装置のエラーを早期に検出して故障ディスクの切り離しを行うことにより、二重故障によるデータロストやデータ化けを防ぐことを主な目的としている。ただし、パトロール診断処理には時間がかかる。例えば、４［ＴＢ］のディスクの場合、ディスク内の全領域を診断するために、２週間程度を要する。また、パトロール診断処理においても、被疑ディスクの検出には、例えば、統計加点処理が利用される。 The main purpose of the patrol diagnosis process is to prevent data loss and data corruption due to double failure by detecting an error of the disk device at an early stage and disconnecting the failed disk. However, the patrol diagnosis process takes time. For example, in the case of a 4 [TB] disk, it takes about two weeks to diagnose the entire area in the disk. Also in the patrol diagnosis process, for example, a statistical score process is used to detect the suspect disk.

しかしながら、統計加点処理では、レスポンスタイムアウトや媒体エラーなどの深刻度が高いエラーだけが統計加点の対象となっている。したがって、統計加点処理では、レスポンスタイムアウトや媒体エラーは発生していないものの、動作がスローダウンしている潜在故障状態（故障予防の交換対象）のディスク装置を発見することが難しい。 However, in the statistical score addition process, only errors with a high severity such as a response timeout or a medium error are subjected to statistical score addition. Therefore, it is difficult to find a disk device in a potential failure state (replacement target for failure prevention) in which the operation is slowed down in the statistical scoring process, although no response timeout or medium error has occurred.

例えば、ホストからのＩ／Ｏ要求に応じたディスク装置へのアクセスのレスポンスが、通常数ミリセックで終わるところ、数十〜数百ミリセックかかることがある。数秒程度（例えば、５秒以上）かかるようであれば、統計加点の対象となるが、数十〜数百ミリセック程度（例えば、５秒未満）であれば、統計加点の対象とならない。しかし、統計加点の対象とならない遅延（例えば、５秒未満の遅延）であっても、それが日常的に発生すると、ディスク装置がスローダウンしてホストへの応答性能の低下を招いてしまう。 For example, a response to access to a disk device in response to an I / O request from a host usually takes several tens to several hundreds of milliseconds when it ends in several milliseconds. If it takes about several seconds (for example, 5 seconds or more), it is a target for statistical addition, but if it is about tens to hundreds of milliseconds (for example, less than 5 seconds), it is not a target for statistical addition. However, even if a delay that is not subject to statistical addition (for example, a delay of less than 5 seconds) occurs on a daily basis, the disk device slows down and the response performance to the host is degraded.

ディスク装置が潜在故障状態となる要因としては、ディスク装置の経年劣化、外的要因による損傷、ディスク上の微少な塵埃や潤滑油の轍などが挙げられる。例えば、ディスク上の微少な塵埃によって読み込みに失敗してリトライ動作が発生した場合、最終的に読み込むことができれば、レスポンスタイムアウトや媒体エラーは発生しないもののレスポンスに時間がかかることがある。 Factors that cause the disk device to be in a latent failure state include aged deterioration of the disk device, damage due to external factors, minute dust on the disk, and fouling of lubricating oil. For example, when a read operation fails due to minute dust on the disk and a retry operation occurs, if a read operation can be finally performed, a response timeout or a medium error does not occur, but a response may take time.

なお、スローダウンしているディスク装置を検出するために、統計加点の対象とする事象の条件を厳しくすることが考えられる。例えば、遅延のエラーを検出するための閾値を低く設定することで、スローダウンを引き起こすような遅延もエラーとして検出することが可能となる。ところが、エラーを検出するための閾値を低くしただけでは、アクセス集中による繁忙状態に起因するレスポンス低下と、潜在故障状態に起因するレスポンス低下とを区別することができない。 In order to detect a disk device that is slowing down, it is conceivable that the condition of an event to be subjected to statistical addition is strict. For example, by setting a low threshold for detecting a delay error, a delay that causes a slowdown can be detected as an error. However, it is not possible to distinguish between a decrease in response due to a busy state due to access concentration and a decrease in response due to a potential failure state simply by lowering the threshold for detecting errors.

そこで、本実施の形態では、レスポンスタイムアウトや媒体エラーは発生していないものの、スローダウンしている潜在故障状態の記憶装置Ｄを検出するストレージ制御装置１０１について説明する。以下、ストレージ制御装置１０１の処理例について説明する。 Therefore, in the present embodiment, the storage control device 101 that detects the storage device D in the latent failure state that has slowed down although no response timeout or medium error has occurred will be described. Hereinafter, a processing example of the storage control apparatus 101 will be described.

（１）ストレージ制御装置１０１は、上位装置１０２からのＩ／Ｏ要求に応じてアクセスされる１または複数の記憶装置Ｄそれぞれの負荷状況およびレスポンス状況を表す性能情報を取得する。ここで、記憶装置Ｄの負荷状況は、アクセスにかかる負荷を表しており、例えば、ビジー率によって表される。ビジー率は、所定期間（例えば、直近１時間）での記憶装置Ｄの負荷状況を示す指標値である（単位：％）。 (1) The storage control device 101 acquires performance information representing the load status and response status of each of the one or more storage devices D accessed in response to an I / O request from the host device 102. Here, the load status of the storage device D represents a load applied to access, and is represented by, for example, a busy rate. The busy rate is an index value (unit:%) indicating the load status of the storage device D in a predetermined period (for example, the latest one hour).

また、記憶装置Ｄのレスポンス状況は、記憶装置Ｄに対してアクセスコマンドを発行してから応答があるまでのレスポンスタイム（応答時間）によって表される（単位：秒）。図１の例では、ストレージ１０３内の各記憶装置Ｄ１〜Ｄ３の負荷状況およびレスポンス状況を表す性能情報１１０が取得される。 The response status of the storage device D is represented by a response time (response time) from when an access command is issued to the storage device D until a response is received (unit: seconds). In the example of FIG. 1, performance information 110 representing the load status and response status of each of the storage devices D1 to D3 in the storage 103 is acquired.

（２）ストレージ制御装置１０１は、取得した性能情報に基づいて、１または複数の記憶装置Ｄのうち、負荷が第１の閾値より低く、かつ、レスポンスタイムが第２の閾値以上の記憶装置Ｄを検出する。ここで、第１および第２の閾値は、任意に設定可能である。 (2) The storage control device 101, based on the acquired performance information, among the one or more storage devices D, the storage device D whose load is lower than the first threshold value and whose response time is the second threshold value or more. Is detected. Here, the first and second threshold values can be arbitrarily set.

第１の閾値は、記憶装置Ｄの負荷が第１の閾値以上となると、記憶装置Ｄが高負荷状態であると判断できる値に設定される。高負荷状態は、例えば、アクセス集中による繁忙状態である。より具体的には、例えば、記憶装置Ｄの負荷状況がビジー率によって表される場合、第１の閾値は、５０％程度の値に設定される。 The first threshold value is set to a value that allows the storage device D to be determined to be in a high load state when the load of the storage device D is equal to or greater than the first threshold value. The high load state is, for example, a busy state due to concentration of access. More specifically, for example, when the load status of the storage device D is represented by a busy rate, the first threshold value is set to a value of about 50%.

第２の閾値は、記憶装置Ｄについてのタイムアウト値よりも低い値である。タイムアウト値とは、レスポンスタイムアウト（Ｉ／Ｏタイムアウト）を判断するための値（応答時間）である。具体的には、例えば、第２の閾値は、記憶装置Ｄについての統計加点処理やパトロール診断処理において、レスポンスタイムアウトを判断する値よりも低い値である。一例として、レスポンスタイムアウトを判断する値が「５秒」の場合、第２の閾値は、例えば、２秒程度の値に設定される。 The second threshold is a value lower than the timeout value for the storage device D. The timeout value is a value (response time) for determining a response timeout (I / O timeout). Specifically, for example, the second threshold value is a value lower than a value for determining a response timeout in the statistical score addition process or the patrol diagnosis process for the storage device D. As an example, when the value for determining the response timeout is “5 seconds”, the second threshold value is set to a value of about 2 seconds, for example.

図１の例では、ストレージ制御装置１０１は、取得した性能情報１１０に基づいて、ストレージ１０３内の記憶装置Ｄ１〜Ｄ３のうち、負荷が第１の閾値より低く、かつ、レスポンスタイムが第２の閾値以上の記憶装置Ｄを検出する。ここでは、記憶装置Ｄ１〜Ｄ３のうち、記憶装置Ｄ３の負荷が第１の閾値より低く、かつ、記憶装置Ｄ３のレスポンスタイムが第２の閾値以上である場合を想定する。この場合、記憶装置Ｄ３が検出される。 In the example of FIG. 1, the storage control device 101 has a load lower than the first threshold value among the storage devices D1 to D3 in the storage 103 based on the acquired performance information 110, and the response time is the second. A storage device D greater than or equal to the threshold is detected. Here, it is assumed that among the storage devices D1 to D3, the load of the storage device D3 is lower than the first threshold value, and the response time of the storage device D3 is equal to or greater than the second threshold value. In this case, the storage device D3 is detected.

このように、ストレージ制御装置１０１によれば、上位装置１０２からのＩ／Ｏ要求に応じてアクセスされる１または複数の記憶装置Ｄのうち、負荷が第１の閾値より低く、かつ、レスポンスタイムが第２の閾値以上の記憶装置Ｄを検出することができる。これにより、レスポンスタイムアウトや媒体エラーは発生していないものの、スローダウンしている潜在故障状態の記憶装置Ｄを早期に発見することができる。また、繁忙状態に起因してレスポンスが低下している記憶装置Ｄを、潜在故障状態の記憶装置Ｄとして誤検出するのを防ぐことができる。 As described above, according to the storage control device 101, the load is lower than the first threshold value among the one or more storage devices D accessed in response to the I / O request from the higher level device 102, and the response time. Can detect a storage device D that is greater than or equal to the second threshold. Thereby, although the response timeout or the medium error has not occurred, the storage device D in the latent failure state that is slowing down can be detected early. Further, it is possible to prevent erroneously detecting the storage device D whose response is reduced due to the busy state as the storage device D in the latent failure state.

図１の例では、潜在故障状態の記憶装置Ｄとして、記憶装置Ｄ３が検出される。このため、例えば、記憶装置Ｄ３に対してリダンダントコピーを実施することで、統計加点処理では検出できないような微妙な不具合により、運用に影響を与えるような不調を抱えている記憶装置Ｄ３を切り離すことができる。これにより、潜在故障状態である記憶装置Ｄ３の性能劣化の影響によるストレージ１０３全体の応答性能の低下を抑えることができる。 In the example of FIG. 1, the storage device D3 is detected as the storage device D in the latent failure state. For this reason, for example, by performing redundant copy on the storage device D3, the storage device D3 having a malfunction that affects the operation due to a subtle problem that cannot be detected by the statistical score addition process is separated. Can do. As a result, it is possible to suppress a decrease in the response performance of the entire storage 103 due to the performance degradation of the storage device D3 that is in a latent failure state.

（ストレージシステム２００のシステム構成例）
つぎに、図１に示したストレージ制御装置１０１をストレージシステム２００に適用した場合について説明する。ストレージシステム２００は、例えば、ＲＡＩＤ５，６等の冗長化されたシステムである。 (System configuration example of the storage system 200)
Next, a case where the storage control apparatus 101 shown in FIG. 1 is applied to the storage system 200 will be described. The storage system 200 is a redundant system such as RAIDs 5 and 6, for example.

図２は、ストレージシステム２００のシステム構成例を示す説明図である。図２において、ストレージシステム２００は、ストレージ装置２０１と、ホスト装置２０２と、を含む。ストレージシステム２００において、ストレージ装置２０１およびホスト装置２０２は、有線または無線のネットワーク２１０を介して接続される。ネットワーク２１０は、例えば、ＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）、ＷＡＮ（ＷｉｄｅＡｒｅａＮｅｔｗｏｒｋ）、インターネットなどである。 FIG. 2 is an explanatory diagram showing a system configuration example of the storage system 200. In FIG. 2, the storage system 200 includes a storage device 201 and a host device 202. In the storage system 200, the storage device 201 and the host device 202 are connected via a wired or wireless network 210. The network 210 is, for example, a local area network (LAN), a wide area network (WAN), or the Internet.

ストレージ装置２０１は、ストレージ制御装置１０１とストレージＳＴを含む。ストレージＳＴは、複数のＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）を含む。ただし、ＨＤＤの代わりに、ＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）を用いることにしてもよい。ストレージＳＴは、１以上のホットスペアＨＳを含む。ホットスペアＨＳは、代替用のＨＤＤである。 The storage device 201 includes a storage control device 101 and a storage ST. The storage ST includes a plurality of HDDs (Hard Disk Drives). However, an SSD (Solid State Drive) may be used instead of the HDD. The storage ST includes one or more hot spares HS. The hot spare HS is a replacement HDD.

ストレージＳＴにおいて、例えば、１以上のＨＤＤからＲＡＩＤグループが作成される。図２の例では、ＨＤＤ１〜ＨＤＤ４からＲＡＩＤグループＧ１が作成されている。ＨＤＤ５〜ＨＤＤ８からＲＡＩＤグループＧ２が作成されている。なお、図１に示したストレージ１０３は、例えば、ストレージＳＴに対応する。 In the storage ST, for example, a RAID group is created from one or more HDDs. In the example of FIG. 2, a RAID group G1 is created from HDD1 to HDD4. A RAID group G2 is created from HDD5 to HDD8. Note that the storage 103 illustrated in FIG. 1 corresponds to the storage ST, for example.

ストレージ制御装置１０１は、ストレージＳＴ内の各ＨＤＤにアクセス可能であり、ホスト装置２０２からのストレージＳＴに対するＩ／Ｏ要求を処理する。ストレージ制御装置１０１は、不図示の構成情報や割当情報を有する。構成情報には、例えば、ストレージシステム２００において作成された論理ボリュームや、ＲＡＩＤグループを構成するディスクについての種々の管理情報が格納される。割当情報には、例えば、シン・プロビジョニング構成における割り当て単位（チャンク）ごとの情報や、割り当て済みのチャンクに対する論理アドレスと物理アドレスの対応情報が格納される。 The storage control apparatus 101 can access each HDD in the storage ST and processes an I / O request for the storage ST from the host apparatus 202. The storage control apparatus 101 has configuration information and allocation information (not shown). In the configuration information, for example, various management information about the logical volume created in the storage system 200 and the disks constituting the RAID group are stored. The allocation information stores, for example, information for each allocation unit (chunk) in the thin provisioning configuration, and correspondence information between logical addresses and physical addresses for allocated chunks.

また、ストレージ制御装置１０１は、性能情報テーブル２２０およびコンフィグテーブル２３０を有する。性能情報テーブル２２０およびコンフィグテーブル２３０の記憶内容については、図４および図５を用いて後述する。ストレージシステム２００において、ストレージ制御装置１０１とホスト装置２０２は、例えば、ＦＣ（ＦｉｂｒｅＣｈａｎｎｅｌ）やｉＳＣＳＩ（ＩｎｔｅｒｎｅｔＳｍａｌｌＣｏｍｐｕｔｅｒＳｙｓｔｅｍＩｎｔｅｒｆａｃｅ）で接続される。 The storage control device 101 also has a performance information table 220 and a configuration table 230. The contents stored in the performance information table 220 and the configuration table 230 will be described later with reference to FIGS. In the storage system 200, the storage control device 101 and the host device 202 are connected by, for example, FC (Fibre Channel) or iSCSI (Internet Small Computer System Interface).

ホスト装置２０２は、ストレージＳＴへのＩ／Ｏ要求を行うコンピュータである。具体的には、例えば、ホスト装置２０２は、ストレージシステム２００により提供される論理ボリュームに対するデータのリード／ライトを要求する。例えば、ホスト装置２０２は、ストレージシステム２００を利用する業務サーバである。図１に示した上位装置１０２は、例えば、ホスト装置２０２に対応する。 The host device 202 is a computer that makes an I / O request to the storage ST. Specifically, for example, the host device 202 requests data read / write with respect to the logical volume provided by the storage system 200. For example, the host device 202 is a business server that uses the storage system 200. The host device 102 illustrated in FIG. 1 corresponds to the host device 202, for example.

なお、図２の例では、ストレージ制御装置１０１およびホスト装置２０２をそれぞれ１台のみ表記したが、ストレージシステム２００に複数のストレージ制御装置１０１やホスト装置２０２が含まれることにしてもよい。また、図２の例では、ストレージＳＴにおいて、ＲＡＩＤグループＧ１，Ｇ２が作成されることにしたが、１あるいは３以上のＲＡＩＤグループが作成されることにしてもよい。 In the example of FIG. 2, only one storage control device 101 and one host device 202 are shown, but the storage system 200 may include a plurality of storage control devices 101 and host devices 202. In the example of FIG. 2, the RAID groups G1 and G2 are created in the storage ST. However, one or three or more RAID groups may be created.

（ストレージ制御装置１０１のハードウェア構成例）
図３は、ストレージ制御装置１０１のハードウェア構成例を示すブロック図である。図３において、ストレージ制御装置１０１は、プロセッサであるＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）３０１と、メモリ３０２と、通信Ｉ／Ｆ（Ｉｎｔｅｒｆａｃｅ）３０３と、Ｉ／Ｏコントローラ３０４と、を有する。また、各構成部は、バス３００によってそれぞれ接続される。 (Example of hardware configuration of the storage control apparatus 101)
FIG. 3 is a block diagram illustrating a hardware configuration example of the storage control apparatus 101. In FIG. 3, the storage control apparatus 101 includes a CPU (Central Processing Unit) 301 that is a processor, a memory 302, a communication I / F (Interface) 303, and an I / O controller 304. Each component is connected by a bus 300.

ここで、ＣＰＵ３０１は、ストレージ制御装置１０１の全体の制御を司る。メモリ３０２は、例えば、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）およびフラッシュＲＯＭなどを有する。具体的には、例えば、フラッシュＲＯＭやＲＯＭが各種プログラムを記憶し、ＲＡＭがＣＰＵ３０１のワークエリアとして使用される。メモリ３０２に記憶されるプログラムは、ＣＰＵ３０１にロードされることで、コーディングされている処理をＣＰＵ３０１に実行させる。 Here, the CPU 301 governs overall control of the storage control apparatus 101. The memory 302 includes, for example, a ROM (Read Only Memory), a RAM (Random Access Memory), and a flash ROM. Specifically, for example, a flash ROM or ROM stores various programs, and a RAM is used as a work area for the CPU 301. The program stored in the memory 302 is loaded into the CPU 301 to cause the CPU 301 to execute the coded process.

通信Ｉ／Ｆ３０３は、通信回線を通じてネットワーク２１０に接続され、ネットワーク２１０を介して外部装置（例えば、図２に示したホスト装置２０２）に接続される。そして、通信Ｉ／Ｆ３０３は、ネットワーク２１０と装置内部とのインターフェースを司り、外部装置からのデータの入出力を制御する。Ｉ／Ｏコントローラ３０４は、ＣＰＵ３０１の制御にしたがって、ストレージＳＴ（図２参照）に対するアクセスを行う。 The communication I / F 303 is connected to the network 210 via a communication line, and is connected to an external device (for example, the host device 202 shown in FIG. 2) via the network 210. The communication I / F 303 controls the interface between the network 210 and the inside of the apparatus, and controls data input / output from the external apparatus. The I / O controller 304 accesses the storage ST (see FIG. 2) according to the control of the CPU 301.

（性能情報テーブル２２０の記憶内容）
つぎに、ストレージ制御装置１０１が有する性能情報テーブル２２０の記憶内容について説明する。性能情報テーブル２２０は、例えば、図３に示したメモリ３０２により実現される。 (Storage contents of performance information table 220)
Next, the storage contents of the performance information table 220 included in the storage control apparatus 101 will be described. The performance information table 220 is realized by, for example, the memory 302 illustrated in FIG.

図４は、性能情報テーブル２２０の記憶内容の一例を示す説明図である。図４において、性能情報テーブル２２０は、ＲＡＩＤグループＩＤ、ディスクＩＤ、コマンド発行数、発行待ちコマンド数、ビジー率およびレスポンスタイムのフィールドを有する。各フィールドに情報を設定することで、性能情報４００−１〜４００−８がレコードとして記憶される。 FIG. 4 is an explanatory diagram showing an example of the contents stored in the performance information table 220. In FIG. 4, the performance information table 220 has fields for RAID group ID, disk ID, command issue count, issue wait command count, busy rate, and response time. By setting information in each field, performance information 400-1 to 400-8 is stored as a record.

ここで、ＲＡＩＤグループＩＤは、ストレージＳＴ（図２参照）内のＲＡＩＤグループを一意に識別する識別子である。ディスクＩＤは、ＲＡＩＤグループＩＤにより識別されるＲＡＩＤグループ内のＨＤＤ（ディスク装置）を一意に識別する識別子である。コマンド発行数（Ｑｕｅ−ｉｎ−ｐｒｏｑは、ディスクＩＤにより識別されるＨＤＤに発行中のアクセスコマンド（ライトコマンド、リードコマンド）の数である。なお、コマンド発行数の上限値は、例えば、３０である。 Here, the RAID group ID is an identifier for uniquely identifying a RAID group in the storage ST (see FIG. 2). The disk ID is an identifier that uniquely identifies the HDD (disk device) in the RAID group identified by the RAID group ID. The number of commands issued (Que-in-proq is the number of access commands (write command, read command) being issued to the HDD identified by the disk ID. The upper limit of the number of commands issued is, for example, 30. is there.

発行待ちコマンド数（Ｑｕｅ−ｗａｉｔ）は、ＨＤＤに対する発行待ちのアクセスコマンドの数である。アクセスコマンドには優先度が設定される。優先度としては、例えば、Ｈｉｇｈ、Ｎｏｒｍａｌ、Ｌｏｗのいずれかが設定される。優先度は、「Ｌｏｗ→Ｎｏｒｍａｌ→Ｈｉｇｈ」の順に高くなる。優先度が高いアクセスコマンドほど優先して処理される。 The number of commands waiting for issue (Que-wait) is the number of access commands waiting to be issued to the HDD. A priority is set for the access command. As the priority, for example, one of High, Normal, and Low is set. The priority increases in the order of “Low → Normal → High”. An access command with a higher priority is processed with higher priority.

ビジー率は、直近１時間でのＨＤＤの負荷状況を示す指標値である（単位：％）。例えば、ビジー率は、ＨＤＤの発行待ちコマンド数とＨＤＤの処理能力（回転スピードなど）を考慮して算出される。例えば、ビジー率が０％の場合、直近１時間でＨＤＤへのアクセスがないことを示す。ビジー率が５０％未満の場合、直近１時間でのＨＤＤへのアクセスの負荷状況が通常状態であることを示す。ビジー率が５０％以上の場合、直近１時間でのＨＤＤへのアクセスの負荷状況が高負荷状態であることを示す。 The busy rate is an index value indicating the load status of the HDD in the last hour (unit:%). For example, the busy rate is calculated in consideration of the number of HDD issuance commands and the HDD processing capability (rotation speed, etc.). For example, a busy rate of 0% indicates that there is no access to the HDD in the last hour. When the busy rate is less than 50%, it indicates that the load state of access to the HDD in the last hour is in a normal state. When the busy rate is 50% or more, it indicates that the load state of access to the HDD in the last hour is a high load state.

レスポンスタイムは、ＨＤＤに対してアクセスコマンドを発行してから応答があるまでの応答時間である（単位：秒）。例えば、レスポンスタイムは、直近のアクセスコマンドについてのレスポンスタイムであってもよく、また、過去数回分のアクセスコマンドについてのレスポンスタイムの平均であってもよい。 The response time is a response time from when an access command is issued to the HDD until a response is received (unit: second). For example, the response time may be the response time for the most recent access command, or may be the average of the response times for the past several access commands.

なお、性能情報テーブル２２０は、直近数回分（例えば、３回分）のアクセスコマンドについて、発行時のコマンド発行数、発行待ちコマンド数、およびアクセスコマンドの優先度を保持することにしてもよい。性能情報テーブル２２０は、例えば、定期的または所定のタイミングで更新される。所定のタイミングは、例えば、ホスト装置２０２からのＩ／Ｏ要求が処理されたタイミングや、後述する診断用コマンドが実行されたタイミングである。 Note that the performance information table 220 may hold the number of commands issued at the time of issue, the number of commands waiting to be issued, and the priority of access commands for the access commands for the latest several times (for example, three times). The performance information table 220 is updated, for example, regularly or at a predetermined timing. The predetermined timing is, for example, a timing at which an I / O request from the host device 202 is processed or a timing at which a diagnostic command described later is executed.

（コンフィグテーブル２３０の記憶内容）
つぎに、ストレージ制御装置１０１が有するコンフィグテーブル２３０の記憶内容について説明する。コンフィグテーブル２３０は、例えば、図３に示したメモリ３０２により実現される。 (Contents stored in the config table 230)
Next, the contents stored in the configuration table 230 of the storage control apparatus 101 will be described. The configuration table 230 is realized by, for example, the memory 302 illustrated in FIG.

図５は、コンフィグテーブル２３０の記憶内容の一例を示す説明図である。図５において、コンフィグテーブル２３０は、ＲＡＩＤグループＩＤ、ＲＡＩＤステータス、ディスクＩＤおよびチェックフラグのフィールドを有し、各フィールドに情報を設定することで、コンフィグ情報５００−１，５００−２をレコードとして記憶する。 FIG. 5 is an explanatory diagram showing an example of the contents stored in the configuration table 230. In FIG. 5, the configuration table 230 has fields of RAID group ID, RAID status, disk ID, and check flag, and stores configuration information 500-1 and 500-2 as records by setting information in each field. To do.

ここで、ＲＡＩＤグループＩＤは、ストレージＳＴ（図２参照）内のＲＡＩＤグループを一意に識別する識別子である。ＲＡＩＤステータスは、ＲＡＩＤグループＩＤにより識別されるＲＡＩＤグループの状態を示す。ＲＡＩＤステータスとしては、例えば、Ａｖａｉｌａｂｌｅ、Ｒｅｂｕｉｌｄ、Ｅｘｐｏｓｅｄのいずれかが設定される。ＲＡＩＤステータス「Ａｖａｉｌａｂｌｅ」は、データの冗長性がある状態を示す。ＲＡＩＤステータス「Ｒｅｂｕｉｌｄ」は、データの冗長性を復旧中の状態を示す。ＲＡＩＤステータス「Ｅｘｐｏｓｅｄ」は、データの冗長性がない状態を示す。 Here, the RAID group ID is an identifier for uniquely identifying a RAID group in the storage ST (see FIG. 2). The RAID status indicates the state of the RAID group identified by the RAID group ID. As the RAID status, for example, one of Available, Rebuilt, and Exposed is set. The RAID status “Available” indicates that there is data redundancy. The RAID status “Rebuild” indicates a state in which data redundancy is being restored. The RAID status “Exposed” indicates a state where there is no data redundancy.

ディスクＩＤは、ＲＡＩＤグループ内のＨＤＤを一意に識別する識別子である。チェックフラグは、ＨＤＤが診断対象であるか否かを示す。診断対象とは、後述する新診断処理の処理対象となるＨＤＤである。チェックフラグ「０」は、ＨＤＤが診断対象であることを示す。チェックフラグ「１」は、ＨＤＤが診断対象外であることを示す。チェックフラグは、初期状態では「０」である。 The disk ID is an identifier that uniquely identifies the HDD in the RAID group. The check flag indicates whether the HDD is a diagnosis target. The diagnosis target is an HDD that is a processing target of a new diagnosis process to be described later. The check flag “0” indicates that the HDD is a diagnosis target. The check flag “1” indicates that the HDD is not a diagnosis target. The check flag is “0” in the initial state.

（ストレージ制御装置１０１の機能的構成例）
図６は、ストレージ制御装置１０１の機能的構成例を示すブロック図である。図６において、ストレージ制御装置１０１は、Ｉ／Ｏ処理部６０１と、取得部６０２と、検出部６０３と、診断部６０４と、復旧部６０５と、を含む。Ｉ／Ｏ処理部６０１〜復旧部６０５は制御部となる機能であり、具体的には、例えば、図３に示したメモリ３０２に記憶されたプログラムをＣＰＵ３０１に実行させることにより、または、通信Ｉ／Ｆ３０３、Ｉ／Ｏコントローラ３０４により、その機能を実現する。各機能部の処理結果は、例えば、メモリ３０２に記憶される。 (Functional configuration example of the storage control apparatus 101)
FIG. 6 is a block diagram illustrating a functional configuration example of the storage control apparatus 101. In FIG. 6, the storage control apparatus 101 includes an I / O processing unit 601, an acquisition unit 602, a detection unit 603, a diagnosis unit 604, and a recovery unit 605. The I / O processing unit 601 to the restoration unit 605 are functions as control units. Specifically, for example, by causing the CPU 301 to execute a program stored in the memory 302 illustrated in FIG. The function is realized by the / F 303 and the I / O controller 304. The processing result of each functional unit is stored in the memory 302, for example.

Ｉ／Ｏ処理部６０１は、ホスト装置２０２からのストレージＳＴに対するＩ／Ｏ要求を処理する。Ｉ／Ｏ要求は、ライト要求またはリード要求である。ライト要求は、例えば、ストレージシステム２００により提供される論理ボリュームに対してデータの書き込みを要求するものである。リード要求は、例えば、論理ボリュームに対してデータの読み込みを要求するものである。 The I / O processing unit 601 processes an I / O request for the storage ST from the host device 202. The I / O request is a write request or a read request. The write request is, for example, a request to write data to a logical volume provided by the storage system 200. The read request is a request for reading data from the logical volume, for example.

具体的には、例えば、Ｉ／Ｏ処理部６０１は、ホスト装置２０２からのＩ／Ｏ要求に応じて、ＲＡＩＤグループ内のＨＤＤに対してアクセスコマンドを発行し、当該アクセスコマンドに対する応答コマンドを受け取る。アクセスコマンドは、リードコマンドまたはライトコマンドである。 Specifically, for example, in response to an I / O request from the host device 202, the I / O processing unit 601 issues an access command to the HDD in the RAID group and receives a response command for the access command. . The access command is a read command or a write command.

また、Ｉ／Ｏ処理部６０１は、ホスト装置２０２からのＩ／Ｏ要求に対する応答を行う。具体的には、例えば、Ｉ／Ｏ処理部６０１は、ホスト装置２０２からのライト要求に対するライト完了応答や、リード要求に対するリードデータをホスト装置２０２に通知する。 The I / O processing unit 601 responds to an I / O request from the host device 202. Specifically, for example, the I / O processing unit 601 notifies the host device 202 of a write completion response to the write request from the host device 202 and read data for the read request.

取得部６０２は、ストレージＳＴ内のＨＤＤの負荷状況およびレスポンス状況を表す性能情報を取得する。ここで、ＨＤＤの負荷状況は、アクセスにかかる負荷を表しており、例えば、ビジー率によって表される。ＨＤＤのレスポンス状況は、例えば、ＨＤＤに対してアクセスコマンドを発行してから応答があるまでのレスポンスタイムによって表される。 The acquisition unit 602 acquires performance information representing the load status and response status of the HDD in the storage ST. Here, the load status of the HDD represents the load applied to access, and is represented by, for example, a busy rate. The response status of the HDD is represented, for example, by a response time from when an access command is issued to the HDD until there is a response.

具体的には、例えば、取得部６０２は、ホスト装置２０２からのＩ／Ｏ要求が処理されたことに応じて、ストレージＳＴ内の各ＨＤＤの負荷状況およびレスポンス状況を表す性能情報を取得する。より詳細に説明すると、例えば、取得部６０２は、ＨＤＤのコマンド発行数とＨＤＤの処理能力（回転スピードなど）を考慮してビジー率を算出することにより、ＨＤＤの負荷状況を表す性能情報を取得することにしてもよい。 Specifically, for example, the acquisition unit 602 acquires performance information indicating the load status and response status of each HDD in the storage ST in response to processing of an I / O request from the host device 202. More specifically, for example, the acquisition unit 602 acquires performance information representing the load status of the HDD by calculating the busy rate in consideration of the number of HDD commands issued and the processing capacity (rotation speed, etc.) of the HDD. You may decide to do it.

また、取得部６０２は、ＨＤＤにアクセスコマンドを発行してから応答があるまでのレスポンスタイムを計測することにより、ＨＤＤのレスポンス状況を表す性能情報を取得することにしてもよい。この際、取得部６０２は、ＲＡＩＤグループ全体のレスポンスタイムを計測することにしてもよい。ＲＡＩＤグループ内のＨＤＤ間ではアクセスコマンドを発行してから応答があるまでの時間にはばらつきが生じる。ＲＡＩＤグループ全体のレスポンスタイムは、ＲＡＩＤグループ内のＨＤＤにアクセスコマンドを発行してから、最も遅い応答があるまでの時間に相当する。 Further, the acquisition unit 602 may acquire performance information representing the response status of the HDD by measuring a response time from when an access command is issued to the HDD until a response is received. At this time, the acquisition unit 602 may measure the response time of the entire RAID group. Variations occur in the time from when an access command is issued until there is a response between HDDs in the RAID group. The response time of the entire RAID group corresponds to the time from when the access command is issued to the HDD in the RAID group until the latest response is received.

取得された性能情報は、例えば、図４に示した性能情報テーブル２２０に記憶される。これにより、ストレージ制御装置１０１は、ホスト装置２０２からのＩ／Ｏ要求に応じてアクセスされるＲＡＩＤグループ内のＨＤＤの負荷状況およびレスポンス状況を監視することができる。 The acquired performance information is stored in, for example, the performance information table 220 shown in FIG. Thereby, the storage control device 101 can monitor the load status and response status of the HDDs in the RAID group accessed in response to the I / O request from the host device 202.

検出部６０３は、潜在故障ディスクを検出する。ここで、潜在故障ディスクとは、潜在故障状態のＨＤＤである。具体的には、例えば、検出部６０３は、取得部６０２によって取得された性能情報に基づいて、ストレージＳＴ内のＨＤＤのうち、負荷が閾値αより低く、かつ、レスポンスタイムが閾値β以上のＨＤＤを、潜在故障ディスクとして検出する。 The detection unit 603 detects a potential failure disk. Here, the latent failure disk is a HDD in a latent failure state. Specifically, for example, the detection unit 603, based on the performance information acquired by the acquisition unit 602, among the HDDs in the storage ST, the HDD whose load is lower than the threshold value α and whose response time is the threshold value β or more. Are detected as potential failure disks.

ここで、閾値α、閾値βは、任意に設定可能である。閾値αは、ＨＤＤの負荷が閾値α以上となると、ＨＤＤが高負荷（繁忙状態）であると判断できる値に設定される。例えば、ＨＤＤの負荷状況がビジー率によって表される場合、閾値αは、５０％程度の値に設定される。閾値αは、図１で説明した「第１の閾値」に相当する。 Here, the threshold value α and the threshold value β can be arbitrarily set. The threshold value α is set to a value by which it can be determined that the HDD is in a high load (busy state) when the load on the HDD becomes equal to or greater than the threshold value α. For example, when the load status of the HDD is represented by a busy rate, the threshold value α is set to a value of about 50%. The threshold value α corresponds to the “first threshold value” described in FIG.

閾値βは、ＨＤＤについての統計加点処理やパトロール診断処理において、レスポンスタイムアウトを判断する値よりも低い値である。例えば、ＨＤＤのレスポンスタイムアウトを判断する値が「５秒」の場合、閾値βは、２秒程度の値に設定される。閾値βは、図１で説明した「第２の閾値」に相当する。 The threshold value β is a value lower than a value for determining a response timeout in the statistical score addition process or patrol diagnosis process for the HDD. For example, when the value for determining the response timeout of the HDD is “5 seconds”, the threshold value β is set to a value of about 2 seconds. The threshold value β corresponds to the “second threshold value” described in FIG.

より具体的には、例えば、検出部６０３は、性能情報テーブル２２０を参照して、ＲＡＩＤグループ内のＨＤＤのうち、ビジー率が閾値αより低く、かつ、レスポンスタイムが閾値β以上のＨＤＤを、潜在故障ディスクとして検出する。ＲＡＩＤグループは、例えば、ホスト装置２０２からのＩ／Ｏ要求に応じてアクセスされたＲＡＩＤグループである。 More specifically, for example, the detection unit 603 refers to the performance information table 220, and among the HDDs in the RAID group, detects a HDD whose busy rate is lower than the threshold value α and whose response time is equal to or higher than the threshold value β. Detect as a potential failed disk. The RAID group is a RAID group that is accessed in response to an I / O request from the host device 202, for example.

一例として、閾値αを「５０％」とし、閾値βを「２秒」とする。また、ホスト装置２０２からのＩ／Ｏ要求に応じてＲＡＩＤグループＧ１へのアクセスがあった際のＨＤＤ１のビジー率ｂ１を「３０％」、レスポンスタイムｔ１を「２．２秒」とする。この場合、検出部６０３は、ＨＤＤ１のビジー率ｂ１が閾値αより低く、かつ、ＨＤＤ１のレスポンスタイムｔ１が閾値β以上のため、ＨＤＤ１を潜在故障ディスクとして検出する。また、ＨＤＤ２のビジー率ｂ２を「６０％」、レスポンスタイムｔ１を「３．２秒」とする。この場合、検出部６０３は、ＨＤＤ２のレスポンスタイムｔ２が閾値β以上であるものの、ＨＤＤ２のビジー率ｂ２が閾値α以上のため、ＨＤＤ２を潜在故障ディスクとして検出しない。すなわち、ＨＤＤ２は、繁忙状態に起因してレスポンスが低下していると判断される。 As an example, the threshold value α is set to “50%”, and the threshold value β is set to “2 seconds”. Further, it is assumed that the busy rate b1 of the HDD 1 when the RAID group G1 is accessed in response to an I / O request from the host device 202 is “30%”, and the response time t1 is “2.2 seconds”. In this case, the detection unit 603 detects the HDD 1 as a potential failure disk because the busy rate b1 of the HDD 1 is lower than the threshold value α and the response time t1 of the HDD 1 is equal to or greater than the threshold value β. Further, the busy rate b2 of the HDD 2 is set to “60%”, and the response time t1 is set to “3.2 seconds”. In this case, although the response time t2 of the HDD 2 is equal to or greater than the threshold value β, the detection unit 603 does not detect the HDD 2 as a potential failure disk because the busy rate b2 of the HDD 2 is equal to or greater than the threshold value α. That is, it is determined that the response of the HDD 2 is reduced due to the busy state.

ただし、ＲＡＩＤグループへのアクセスであっても、当該ＲＡＩＤグループ内の一部のＨＤＤへのアクセスが発生しないことがある。例えば、ＲＡＩＤ５では、ＲＡＩＤグループ内のＨＤＤにデータが分散して格納される。しかし、例えば、データサイズが小さいデータの場合、分割データもパリティデータも格納されないＨＤＤ、すなわち、アクセスされないＨＤＤが出てくることがある。このような事象は、ＲＡＩＤグループ内のＨＤＤ数が多くなるほど生じる可能性は高い。 However, even when accessing a RAID group, access to some HDDs in the RAID group may not occur. For example, in RAID 5, data is distributed and stored in HDDs in a RAID group. However, for example, in the case of data having a small data size, an HDD in which neither divided data nor parity data is stored, that is, an HDD that is not accessed may appear. Such an event is more likely to occur as the number of HDDs in the RAID group increases.

また、ホスト装置２０２のアクセス傾向によっては、ある期間全くアクセスされないＲＡＩＤグループが出てくることもある。このため、ホスト装置２０２からのＩ／Ｏ要求に応じて計測される性能だけでは、潜在故障状態となっているＨＤＤを判断できない場合がある。 Depending on the access tendency of the host device 202, a RAID group that is not accessed at all for a certain period may appear. For this reason, there are cases where it is not possible to determine an HDD that is in a potential failure state based only on the performance measured in response to an I / O request from the host device 202.

そこで、ストレージ制御装置１０１は、ストレージＳＴ内のＨＤＤのうち、アクセスがないと判断されるＨＤＤを診断対象ディスクとして抽出し、診断対象ディスクに対してダミーのアクセスを実行して性能診断を行う。以下の説明では、診断対象ディスクに対する診断処理を、既存のパトロール診断処理と区別するために、「新診断処理」と表記する場合がある。 Therefore, the storage control apparatus 101 extracts HDDs that are determined not to be accessed as HDDs in the storage ST as diagnosis target disks, and performs dummy access to the diagnosis target disks to perform performance diagnosis. In the following description, the diagnosis process for the diagnosis target disk may be referred to as “new diagnosis process” in order to distinguish it from the existing patrol diagnosis process.

診断部６０４は、取得された性能情報に基づいて、ストレージＳＴ内のＨＤＤのうち、診断対象ディスクを抽出する。ここで、診断対象ディスクとは、アクセスがないと判断されるＨＤＤである。具体的には、例えば、診断部６０４は、性能情報テーブル２２０を参照して、ストレージＳＴ内のＨＤＤのうち、ビジー率が０％のＨＤＤを、アクセスがないＨＤＤと判断する。そして、診断部６０４は、アクセスがないと判断したＨＤＤを診断対象ディスクとして抽出する。ただし、診断部６０４は、ストレージＳＴ内のＨＤＤのうち、ビジー率が所定値以下（例えば、５％以下）のＨＤＤを、アクセスがないＨＤＤとして判断することにしてもよい。 The diagnosis unit 604 extracts a diagnosis target disk from the HDDs in the storage ST based on the acquired performance information. Here, the diagnosis target disk is an HDD that is determined not to be accessed. Specifically, for example, the diagnosis unit 604 refers to the performance information table 220 and determines, among HDDs in the storage ST, an HDD with a busy rate of 0% as an HDD that is not accessed. Then, the diagnosis unit 604 extracts the HDD that is determined not to be accessed as a diagnosis target disk. However, the diagnosis unit 604 may determine, among HDDs in the storage ST, HDDs having a busy rate that is equal to or less than a predetermined value (for example, 5% or less) as HDDs that are not accessed.

一例として、ＨＤＤ４のビジー率ｂ４を「０％」とすると、診断部６０４は、ビジー率が０％のＨＤＤ４を、診断対象ディスクとして抽出する。なお、コンフィグテーブル２３０（図５参照）内の、診断対象ディスクとして抽出されなかったＨＤＤのチェックフラグには「１」が設定される。 As an example, if the busy rate b4 of the HDD 4 is “0%”, the diagnosis unit 604 extracts the HDD 4 with the busy rate of 0% as a diagnosis target disk. Note that “1” is set to the check flag of the HDD that has not been extracted as the diagnosis target disk in the configuration table 230 (see FIG. 5).

また、診断部６０４は、抽出した診断対象ディスクに対して、負荷が閾値αを超えないように、規定量分のアクセスコマンドを発行した際のレスポンスタイムを計測する。ここで、規定量分のアクセスコマンドは、高負荷状態とならないようにＨＤＤに対して適度な負荷をかけるためのアクセスコマンドであり、ＨＤＤの性能に応じて適宜設定される。適度な負荷とは、例えば、ビジー率が４０％程度の負荷である。規定量分のアクセスコマンドは、例えば、コマンド発行数によって指定される。 Further, the diagnosis unit 604 measures the response time when an access command for a specified amount is issued to the extracted diagnosis target disk so that the load does not exceed the threshold value α. Here, the access commands for the specified amount are access commands for applying an appropriate load to the HDD so as not to be in a high load state, and are appropriately set according to the performance of the HDD. The moderate load is, for example, a load having a busy rate of about 40%. The access commands for the specified amount are specified by, for example, the number of command issuances.

一例として、ビジー率が４０％となるコマンド発行数を「３０」とする。この場合、診断部６０４は、例えば、ホスト装置２０２からのＩ／Ｏ要求とは非同期に、診断対象ディスクに対して、コマンド発行数「３０」を維持するように、リード／ライトコマンドを発行する。リード／ライトコマンドは、リードしたデータをそのまま書き戻す診断用コマンドである。なお、診断用コマンドの実行に応じて、性能情報テーブル２２０内の診断対象ディスクの性能情報が更新される。 As an example, it is assumed that the number of commands issued with a busy rate of 40% is “30”. In this case, for example, the diagnosis unit 604 issues a read / write command to the diagnosis target disk so as to maintain the command issuance number “30” asynchronously with the I / O request from the host device 202. . The read / write command is a diagnostic command for writing back the read data as it is. Note that the performance information of the diagnosis target disk in the performance information table 220 is updated in accordance with the execution of the diagnostic command.

また、診断部６０４は、パトロール診断処理が実行中の場合、診断対象ディスクのうち、パトロール診断済みの領域以外の領域を、診断領域として選定することにしてもよい。そして、診断部６０４は、リード／ライトを実施する範囲に偏りがでないように、診断用コマンドにより、選定した診断領域内をランダムアクセスすることにしてもよい。 In addition, when the patrol diagnosis process is being performed, the diagnosis unit 604 may select an area other than the area on which the patrol diagnosis has been performed, as a diagnosis area, from the diagnosis target disk. Then, the diagnosis unit 604 may randomly access the selected diagnosis area by using a diagnosis command so that the read / write range is not biased.

また、診断対象ディスクは、アクセスがないと判断されたＨＤＤではあるものの、診断用コマンドが、ホスト装置２０２からのＩ／Ｏ要求に応じて発行されるアクセスコマンドと競合する可能性がある。Ｉ／Ｏ要求と競合すると、Ｉ／Ｏ性能に影響を与えるおそれがある。さらに、診断中はＣＰＵ負荷が上がるため、Ｉ／Ｏ性能に影響を与えるおそれがある。 Further, although the diagnosis target disk is an HDD that is determined not to be accessed, there is a possibility that the diagnosis command conflicts with an access command issued in response to an I / O request from the host device 202. Conflicting with I / O requests may affect I / O performance. In addition, the CPU load increases during diagnosis, which may affect I / O performance.

このため、診断部６０４は、診断コマンドに対して、ホスト装置２０２からのＩ／Ｏ要求に応じて発行されるアクセスコマンドよりも低い優先度（例えば、Ｌｏｗ）を設定することにしてもよい。これにより、Ｉ／Ｏ要求と競合した場合に、Ｉ／Ｏ要求に応じて発行されるアクセスコマンドを優先させることができる。 For this reason, the diagnosis unit 604 may set a lower priority (for example, Low) than the access command issued in response to the I / O request from the host device 202 for the diagnosis command. As a result, when there is a conflict with an I / O request, priority can be given to an access command issued in response to the I / O request.

また、診断部６０４は、診断領域のサイズに応じて診断処理時間Ｔを設定することにしてもよい。具体的には、例えば、診断部６０４は、診断領域のサイズが「１００ＧＢ」の場合、診断処理時間Ｔを「５分」程度に設定する。これにより、新診断処理が行われる時間を制限して、Ｉ／Ｏ性能に与える影響を抑えることができる。 The diagnosis unit 604 may set the diagnosis processing time T according to the size of the diagnosis area. Specifically, for example, when the size of the diagnosis area is “100 GB”, the diagnosis unit 604 sets the diagnosis processing time T to about “5 minutes”. As a result, the time during which the new diagnosis process is performed can be limited to suppress the influence on the I / O performance.

また、診断部６０４は、冗長性がないＲＡＩＤグループや、リカバリ処理中のＲＡＩＤグループについては、負荷が高く、かつ、データ復旧中のため、診断対象外とすることにしてもよい。また、ストレージ制御装置１０１は、診断対象ディスクに対する新診断処理を頻繁に実施させないために、１日に実施する回数を制限することにしてもよい（例えば、１日１回）。 Further, the diagnosis unit 604 may exclude a RAID group without redundancy or a RAID group that is undergoing recovery processing from being diagnosed because the load is high and data is being recovered. In addition, the storage control apparatus 101 may limit the number of times of execution per day (for example, once a day) in order not to frequently perform new diagnosis processing on the diagnosis target disk.

また、検出部６０３は、診断部６０４によって抽出された診断対象ディスクのうち、診断部６０４によって計測されたレスポンスタイムが閾値β以上となるＨＤＤを、潜在故障ディスクとして検出する。なお、負荷が閾値αを超えないように規定量分のアクセスコマンドを発行したとしても、ホスト装置２０２からのＩ／Ｏ要求に応じたアクセスが急激に増加して診断対象ディスクが高負荷状態となる場合がある。 In addition, the detection unit 603 detects an HDD whose response time measured by the diagnosis unit 604 is equal to or greater than the threshold β among the diagnosis target disks extracted by the diagnosis unit 604 as a potential failure disk. Even if an access command for a specified amount is issued so that the load does not exceed the threshold value α, access according to an I / O request from the host device 202 increases rapidly, and the diagnosis target disk becomes in a high load state. There is a case.

このため、例えば、検出部６０３は、性能情報テーブル２２０を参照して、診断対象ディスクのうち、ビジー率が閾値αより低く、かつ、レスポンスタイムが閾値β以上のＨＤＤを、潜在故障ディスクとして検出することにしてもよい。これにより、繁忙状態に起因してレスポンスが低下している診断対象ディスクを、潜在故障ディスクとして検出してしまうのを防ぐことができる。 For this reason, for example, the detection unit 603 refers to the performance information table 220 and detects, among the diagnosis target disks, HDDs whose busy rate is lower than the threshold value α and whose response time is equal to or higher than the threshold value β as potential failure disks. You may decide to do it. As a result, it is possible to prevent a diagnosis target disk whose response is lowered due to a busy state from being detected as a potential failure disk.

例えば、診断対象ディスクとして抽出されたＨＤＤ４のビジー率ｂ４を「４０％」、レスポンスタイムｔ４を「３秒」とする。この場合、検出部６０３は、ＨＤＤ４のビジー率ｂ４が閾値αより低く、かつ、ＨＤＤ４のレスポンスタイムｔ４が閾値β以上のため、ＨＤＤ４を潜在故障ディスクとして検出する。 For example, assume that the busy rate b4 of the HDD 4 extracted as the diagnosis target disk is “40%” and the response time t4 is “3 seconds”. In this case, the detection unit 603 detects the HDD 4 as a potential failure disk because the busy rate b4 of the HDD 4 is lower than the threshold value α and the response time t4 of the HDD 4 is equal to or greater than the threshold value β.

復旧部６０５は、検出部６０３によって検出された潜在故障ディスクに対してリダンダントコピーを実施する。リダンダントコピーは、バックグラウンドで潜在故障ディスクからホットスペアＨＳへのデータ移行を行い、データ移行後のホットスペアＨＳを、潜在故障ディスクに換えてＲＡＩＤグループに組み込む処理である。 The recovery unit 605 performs redundant copy on the latent failure disk detected by the detection unit 603. Redundant copy is a process of migrating data from a potential failed disk to a hot spare HS in the background and incorporating the hot spare HS after data migration into a RAID group in place of the potential failed disk.

なお、リダンダントコピーの具体的な処理内容については、図７を用いて後述する。 The specific processing contents of the redundant copy will be described later with reference to FIG.

また、検出された潜在故障ディスクが、アクセスがないと判断された診断対象ディスクのときは、アクセスがある潜在故障ディスクに比べて、リダンダントコピーを実施する緊急性は低い。このため、復旧部６０５は、検出された潜在故障ディスクが、アクセスがないと判断された診断対象ディスクのときは、ホットスペアＨＳが複数存在する場合に、当該潜在故障ディスクに対してリダンダントコピーを実施することにしてもよい。 Further, when the detected latent failure disk is a diagnosis target disk that is determined not to be accessed, the urgency of performing redundant copy is lower than that of a potentially failed disk that is accessed. For this reason, when the detected potential failed disk is a diagnosis target disk that is determined not to be accessed, the recovery unit 605 performs redundant copy on the potential failed disk when there are multiple hot spares HS. You may decide to do it.

（リダンダントコピー）
つぎに、図７を用いて、潜在故障ディスクに対するリダンダントコピーの具体的な処理内容について説明する。 (Redundant copy)
Next, the specific processing contents of the redundant copy for the latent failure disk will be described with reference to FIG.

図７は、リダンダントコピーの具体的な処理内容の一例を示す説明図である。図７において、ＲＡＩＤグループ＄１内のＨＤＤ＃１，ＨＤＤ＃２のうち、ＨＤＤ＃１が潜在故障ディスクとして検出される場合を想定する。また、ここでは、ＨＤＤ＃１，ＨＤＤ＃２において、データ２重化されているものとする。 FIG. 7 is an explanatory diagram showing an example of specific processing contents of the redundant copy. In FIG. 7, it is assumed that HDD # 1 is detected as a potential failure disk among HDD # 1 and HDD # 2 in RAID group $ 1. Here, it is assumed that data is duplicated in HDD # 1 and HDD # 2.

（ｉ）ストレージ制御装置１０１は、ＨＤＤ＃１を潜在故障ディスクとして検出する。なお、ＨＤＤ＃１は、潜在故障ディスクとして検出されたもののまだ使用可能な状態である。このため、ホスト装置２０２からのＩ／Ｏ要求に伴うＨＤＤ＃１へのアクセスは実施される。ただし、リード要求やデータコピーは、正常状態であるＨＤＤ＃２を主体として行われる。 (I) The storage control apparatus 101 detects HDD # 1 as a potential failure disk. Although HDD # 1 is detected as a potential failure disk, it is still usable. For this reason, access to the HDD # 1 accompanying the I / O request from the host device 202 is performed. However, the read request and the data copy are performed mainly by the HDD # 2 which is in a normal state.

（ｉｉ）ストレージ制御装置１０１は、バックグラウンドでＨＤＤ＃２からホットスペア＃３へのデータコピーを行う。このデータコピーは、ＨＤＤ＃１からホットスペア＃３へのデータ移行に相当する。データコピー中において、ホスト装置２０２からのＩ／Ｏ要求に伴うアクセスは、ホットスペア＃３にも実施される。すなわち、潜在故障ディスクであるＨＤＤ＃１が切り離されるまでは、データ３重化の状態で運用される。なお、ＨＤＤ＃２へのアクセス時にエラーが発生した場合には、ＨＤＤ＃１に切り替えてアクセスが実施される。 (Ii) The storage control apparatus 101 performs data copy from the HDD # 2 to the hot spare # 3 in the background. This data copy corresponds to data migration from HDD # 1 to hot spare # 3. During data copying, access accompanying an I / O request from the host device 202 is also performed for hot spare # 3. That is, until the HDD # 1 that is a potential failure disk is disconnected, the data is operated in a triple state. If an error occurs when accessing HDD # 2, the access is switched to HDD # 1.

（ｉｉｉ）ストレージ制御装置１０１は、バックグラウンドでＨＤＤ＃２からホットスペア＃３へのデータコピーが完了すると、ＨＤＤ＃１を切り離して、ホットスペア＃３をＲＡＩＤグループ＄１に組み込む。これにより、データの冗長性を確保しつつ、潜在故障状態であるＨＤＤ＃１を切り離すことができる。 (Iii) When the data copy from the HDD # 2 to the hot spare # 3 is completed in the background, the storage control apparatus 101 disconnects the HDD # 1 and incorporates the hot spare # 3 into the RAID group $ 1. Thereby, HDD # 1 which is a potential failure state can be disconnected while ensuring data redundancy.

（ストレージ制御装置１０１の各種制御処理手順）
つぎに、ストレージ制御装置１０１の各種制御処理手順について説明する。以下の説明では、統計加点処理やパトロール診断処理でＩ／Ｏタイムアウトを判断するためのタイムアウト値を「５秒」とする。また、閾値αを「５０％」とし、閾値βを「２秒」とする。また、アクセスがないＨＤＤを判断するためのビジー率を「０％」とする。 (Various control processing procedures of the storage control apparatus 101)
Next, various control processing procedures of the storage control apparatus 101 will be described. In the following description, the timeout value for determining the I / O timeout in the statistical addition process or the patrol diagnosis process is “5 seconds”. Further, the threshold value α is set to “50%”, and the threshold value β is set to “2 seconds”. Further, the busy rate for determining the HDD without access is set to “0%”.

まず、図８および図９を用いて、ストレージ制御装置１０１の第１の潜在故障検出処理手順について説明する。第１の潜在故障検出処理は、ホスト装置２０２からのＩ／Ｏ要求を処理する際に実行される。 First, the first latent failure detection processing procedure of the storage control apparatus 101 will be described with reference to FIGS. The first latent failure detection process is executed when an I / O request from the host device 202 is processed.

図８および図９は、ストレージ制御装置１０１の第１の潜在故障検出処理手順の一例を示すフローチャートである。図８のフローチャートにおいて、まず、ストレージ制御装置１０１は、ホスト装置２０２からのＩ／Ｏ要求を処理する（ステップＳ８０１）。なお、Ｉ／Ｏ要求に対するホスト装置２０２への応答は適宜行われる。 FIG. 8 and FIG. 9 are flowcharts showing an example of the first latent failure detection processing procedure of the storage control apparatus 101. In the flowchart of FIG. 8, first, the storage control apparatus 101 processes an I / O request from the host apparatus 202 (step S801). A response to the host device 202 in response to the I / O request is appropriately performed.

そして、ストレージ制御装置１０１は、ストレージＳＴ内のＨＤＤの負荷状況およびレスポンス状況を表す性能情報を取得する（ステップＳ８０２）。取得された性能情報は、性能情報テーブル２２０に記憶される。つぎに、ストレージ制御装置１０１は、ホスト装置２０２からのＩ／Ｏ要求に応じてアクセスされたＲＡＩＤグループ全体のレスポンスタイムが５秒以上であるか否かを判断する（ステップＳ８０３）。 Then, the storage control apparatus 101 acquires performance information indicating the load status and response status of the HDD in the storage ST (step S802). The acquired performance information is stored in the performance information table 220. Next, the storage control apparatus 101 determines whether or not the response time of the entire RAID group accessed in response to the I / O request from the host apparatus 202 is 5 seconds or more (step S803).

ここで、ＲＡＩＤグループ全体のレスポンスタイムが５秒未満の場合（ステップＳ８０３：Ｎｏ）、ストレージ制御装置１０１は、ステップＳ８０５に移行する。一方、ＲＡＩＤグループ全体のレスポンスタイムが５秒以上の場合（ステップＳ８０３：Ｙｅｓ）、ストレージ制御装置１０１は、性能情報テーブル２２０を参照して、アクセスされたＲＡＩＤグループ内のＨＤＤのレスポンスタイムが５秒以上であるか否かを判断する（ステップＳ８０４）。 Here, when the response time of the entire RAID group is less than 5 seconds (step S803: No), the storage controller 101 proceeds to step S805. On the other hand, when the response time of the entire RAID group is 5 seconds or more (step S803: Yes), the storage control apparatus 101 refers to the performance information table 220 and the response time of the HDD in the accessed RAID group is 5 seconds. It is determined whether or not the above is true (step S804).

ここで、ＨＤＤのレスポンスタイムが５秒未満の場合（ステップＳ８０４：Ｎｏ）、ストレージ制御装置１０１は、アクセスされたＲＡＩＤグループ全体のレスポンスタイムが２秒以上であるか否かを判断する（ステップＳ８０５）。ここで、ＲＡＩＤグループ全体のレスポンスタイムが２秒未満の場合（ステップＳ８０５：Ｎｏ）、ストレージ制御装置１０１は、本フローチャートによる一連の処理を終了する。 If the HDD response time is less than 5 seconds (step S804: No), the storage control apparatus 101 determines whether the response time of the accessed RAID group as a whole is 2 seconds or more (step S805). ). Here, when the response time of the entire RAID group is less than 2 seconds (step S805: No), the storage control device 101 ends the series of processing according to this flowchart.

一方、ＲＡＩＤグループ全体のレスポンスタイムが２秒以上の場合（ステップＳ８０５：Ｙｅｓ）、ストレージ制御装置１０１は、図９に示すステップＳ９０１に移行する。 On the other hand, when the response time of the entire RAID group is 2 seconds or longer (step S805: Yes), the storage control apparatus 101 proceeds to step S901 shown in FIG.

また、ステップＳ８０４において、ＨＤＤのレスポンスタイムが５秒以上の場合（ステップＳ８０４：Ｙｅｓ）、ストレージ制御装置１０１は、統計加点処理を実行して（ステップＳ８０６）、本フローチャートによる一連の処理を終了する。 In step S804, if the HDD response time is 5 seconds or longer (step S804: Yes), the storage control apparatus 101 executes statistical score addition processing (step S806), and ends the series of processing according to this flowchart. .

統計加点処理は、アクセスされたＲＡＩＤグループ内のＨＤＤのうちレスポンスタイムが５秒以上であるＨＤＤに加点していき、統計加点値が閾値を超えたＨＤＤを被疑ディスクとして検出する処理である。被疑ディスクとして検出されたＨＤＤに対しては、例えば、リダンダントコピーが実施される。 The statistical point addition process is a process of adding points to HDDs having a response time of 5 seconds or more among the HDDs in the accessed RAID group and detecting HDDs having statistical point values exceeding a threshold as suspect disks. For example, a redundant copy is performed on the HDD detected as the suspect disk.

図９のフローチャートにおいて、まず、ストレージ制御装置１０１は、アクセスされたＲＡＩＤグループ内のＨＤＤのうち選択されていない未選択のＨＤＤを選択する（ステップＳ９０１）。つぎに、ストレージ制御装置１０１は、コンフィグテーブル２３０を参照して、選択したＨＤＤのチェックフラグが「０」であるか否かを判断する（ステップＳ９０２）。 In the flowchart of FIG. 9, first, the storage control apparatus 101 selects an unselected HDD that has not been selected among the HDDs in the accessed RAID group (step S901). Next, the storage control device 101 refers to the configuration table 230 and determines whether or not the check flag of the selected HDD is “0” (step S902).

ここで、チェックフラグが「０」ではない場合（ステップＳ９０２：Ｎｏ）、ストレージ制御装置１０１は、ステップＳ９０６に移行する。一方、チェックフラグが「０」の場合（ステップＳ９０２：Ｙｅｓ）、ストレージ制御装置１０１は、コンフィグテーブル２３０を参照して、アクセスされたＲＡＩＤグループのＲＡＩＤステータスが「Ａｖａｉｌａｂｌｅ」であるか否かを判断する（ステップＳ９０３）。 If the check flag is not “0” (step S902: No), the storage control apparatus 101 proceeds to step S906. On the other hand, when the check flag is “0” (step S902: Yes), the storage control device 101 refers to the configuration table 230 and determines whether or not the RAID status of the accessed RAID group is “Available”. (Step S903).

ここで、ＲＡＩＤステータスが「Ａｖａｉｌａｂｌｅ」ではない場合（ステップＳ９０３：Ｎｏ）、ストレージ制御装置１０１は、ステップＳ９０６に移行する。一方、ＲＡＩＤステータスが「Ａｖａｉｌａｂｌｅ」の場合（ステップＳ９０３：Ｙｅｓ）、ストレージ制御装置１０１は、性能情報テーブル２２０を参照して、選択したＨＤＤのビジー率ｂが０％であるか否かを判断する（ステップＳ９０４）。 Here, when the RAID status is not “Available” (step S903: No), the storage control apparatus 101 proceeds to step S906. On the other hand, when the RAID status is “Available” (step S903: Yes), the storage control apparatus 101 refers to the performance information table 220 and determines whether the busy rate b of the selected HDD is 0%. (Step S904).

ここで、ビジー率が０％の場合（ステップＳ９０４：Ｙｅｓ）、ストレージ制御装置１０１は、ステップＳ９０８に移行する。一方、ビジー率が０％ではない場合（ステップＳ９０４：Ｎｏ）、ストレージ制御装置１０１は、性能情報テーブル２２０を参照して、選択したＨＤＤのビジー率ｂが５０％未満であり、かつ、レスポンスタイムｔが２秒以上であるか否かを判断する（ステップＳ９０５）。 Here, if the busy rate is 0% (step S904: Yes), the storage control apparatus 101 proceeds to step S908. On the other hand, when the busy rate is not 0% (step S904: No), the storage control apparatus 101 refers to the performance information table 220 and the busy rate b of the selected HDD is less than 50% and the response time. It is determined whether t is 2 seconds or longer (step S905).

ここで、ビジー率ｂが５０％未満であり、かつ、レスポンスタイムｔが２秒以上ではない場合（ステップＳ９０５：Ｎｏ）、ストレージ制御装置１０１は、選択したＨＤＤのチェックフラグに「１」を設定して（ステップＳ９０６）、ステップＳ９０８に移行する。 Here, when the busy rate b is less than 50% and the response time t is not 2 seconds or longer (step S905: No), the storage control apparatus 101 sets “1” to the check flag of the selected HDD. (Step S906), the process proceeds to step S908.

一方、ビジー率ｂが５０％未満であり、かつ、レスポンスタイムｔが２秒以上の場合（ステップＳ９０５：Ｙｅｓ）、ストレージ制御装置１０１は、選択したＨＤＤに対してリダンダントコピーを実施する（ステップＳ９０７）。なお、ＨＤＤのリダンダントコピーを実施中は、当該ＨＤＤを含むＲＡＩＤグループのＲＡＩＤステータスは「Ｒｅｂｕｉｌｄ」となる。 On the other hand, when the busy rate b is less than 50% and the response time t is 2 seconds or longer (step S905: Yes), the storage control apparatus 101 performs redundant copy on the selected HDD (step S907). ). During the redundant copy of the HDD, the RAID status of the RAID group including the HDD is “Rebuild”.

そして、ストレージ制御装置１０１は、アクセスされたＲＡＩＤグループ内のＨＤＤのうち選択されていない未選択のＨＤＤがあるか否かを判断する（ステップＳ９０８）。ここで、未選択のＨＤＤがある場合（ステップＳ９０８：Ｙｅｓ）、ストレージ制御装置１０１は、ステップＳ９０１に戻る。 Then, the storage controller 101 determines whether there is an unselected HDD that is not selected among the HDDs in the accessed RAID group (step S908). If there is an unselected HDD (step S908: Yes), the storage controller 101 returns to step S901.

一方、未選択のＨＤＤがない場合（ステップＳ９０８：Ｎｏ）、ストレージ制御装置１０１は、本フローチャートによる一連の処理を終了する。これにより、レスポンスタイムアウト（Ｉ／Ｏタイムアウト）は発生していないものの、スローダウンしている潜在故障ディスクを検出してリダンダントコピーを実施することができる。 On the other hand, when there is no unselected HDD (step S908: No), the storage control apparatus 101 ends a series of processes according to this flowchart. As a result, although a response timeout (I / O timeout) has not occurred, it is possible to detect a potential failed disk that has been slowed down and perform redundant copy.

また、繁忙状態に起因してレスポンスが低下しているＨＤＤを、潜在故障ディスクとして誤検出するのを防ぐことができる。また、ＲＡＩＤグループが復旧中や冗長性を失っているときに潜在故障ディスクに対するリダンダントコピーが実施されないように制御することができる。また、アクセスがないと判断したＨＤＤ（チェックフラグ「０」のＨＤＤ）を、診断対象ディスクとして抽出することができる。 Further, it is possible to prevent erroneous detection of an HDD whose response is lowered due to a busy state as a potential failure disk. Further, it is possible to perform control so that redundant copying is not performed on a latently failed disk when a RAID group is being restored or has lost redundancy. Further, an HDD that is determined not to be accessed (HDD with check flag “0”) can be extracted as a diagnosis target disk.

つぎに、図１０を用いて、ストレージ制御装置１０１の第２の潜在故障検出処理手順について説明する。第２の潜在故障検出処理は、定期的（例えば、毎日２４時）、または、所定のタイミング（例えば、ストレージシステム２００の管理者により指定されるタイミング）で実行される。 Next, the second latent failure detection processing procedure of the storage control apparatus 101 will be described with reference to FIG. The second latent failure detection process is executed regularly (for example, every day at 24:00) or at a predetermined timing (for example, a timing specified by the administrator of the storage system 200).

図１０は、ストレージ制御装置１０１の第２の潜在故障検出処理手順の一例を示すフローチャートである。図１０のフローチャートにおいて、まず、ストレージ制御装置１０１は、ストレージＳＴ内のＨＤＤのうち選択されていない未選択のＨＤＤを選択する（ステップＳ１００１）。 FIG. 10 is a flowchart illustrating an example of the second latent failure detection processing procedure of the storage control apparatus 101. In the flowchart of FIG. 10, the storage control apparatus 101 first selects an unselected HDD that has not been selected from among the HDDs in the storage ST (step S1001).

つぎに、ストレージ制御装置１０１は、コンフィグテーブル２３０を参照して、選択したＨＤＤのチェックフラグが「０」であるか否かを判断する（ステップＳ１００２）。ここで、チェックフラグが「０」ではない場合（ステップＳ１００２：Ｎｏ）、ストレージ制御装置１０１は、ステップＳ１００４に移行する。 Next, the storage control device 101 refers to the configuration table 230 and determines whether or not the check flag of the selected HDD is “0” (step S1002). If the check flag is not “0” (step S1002: No), the storage control apparatus 101 proceeds to step S1004.

一方、チェックフラグが「０」の場合（ステップＳ１００２：Ｙｅｓ）、ストレージ制御装置１０１は、新診断処理を実行する（ステップＳ１００３）。新診断処理の具体的な処理手順については、図１１を用いて後述する。そして、ストレージ制御装置１０１は、選択したＨＤＤのチェックフラグを「０」で初期化する（ステップＳ１００４）。 On the other hand, when the check flag is “0” (step S1002: Yes), the storage control apparatus 101 executes a new diagnosis process (step S1003). A specific processing procedure of the new diagnosis processing will be described later with reference to FIG. Then, the storage control apparatus 101 initializes the check flag of the selected HDD with “0” (step S1004).

つぎに、ストレージ制御装置１０１は、ストレージＳＴ内のＨＤＤのうち選択されていない未選択のＨＤＤがあるか否かを判断する（ステップＳ１００５）。ここで、未選択のＨＤＤがある場合（ステップＳ１００５：Ｙｅｓ）、ストレージ制御装置１０１は、ステップＳ１００１に戻る。 Next, the storage control apparatus 101 determines whether or not there is an unselected HDD among the HDDs in the storage ST (step S1005). If there is an unselected HDD (step S1005: Yes), the storage control apparatus 101 returns to step S1001.

一方、未選択のＨＤＤがない場合（ステップＳ１００５：Ｎｏ）、ストレージ制御装置１０１は、本フローチャートによる一連の処理を終了する。これにより、ストレージＳＴ内の診断対象ディスク（チェックフラグ「０」のＨＤＤ）に対して新診断処理を実施することができる。 On the other hand, when there is no unselected HDD (step S1005: No), the storage control apparatus 101 ends the series of processing according to this flowchart. As a result, the new diagnosis process can be performed on the disk to be diagnosed (the HDD with the check flag “0”) in the storage ST.

つぎに、図１１を用いて、図１０のステップＳ１００３の新診断処理の具体的な処理手順について説明する。 Next, a specific processing procedure of the new diagnosis processing in step S1003 in FIG. 10 will be described with reference to FIG.

図１１は、新診断処理の具体的処理手順の一例を示すフローチャートである。図１１のフローチャートにおいて、まず、ストレージ制御装置１０１は、コンフィグテーブル２３０を参照して、診断対象ディスクを含むＲＡＩＤグループのＲＡＩＤステータスが「Ａｖａｉｌａｂｌｅ」であるか否かを判断する（ステップＳ１１０１）。なお、診断対象ディスクは、図１０のステップＳ１００１において選択されたＨＤＤである。 FIG. 11 is a flowchart illustrating an example of a specific processing procedure of the new diagnosis processing. In the flowchart of FIG. 11, first, the storage control device 101 refers to the configuration table 230 to determine whether or not the RAID status of the RAID group including the diagnosis target disk is “Available” (step S1101). The diagnosis target disk is the HDD selected in step S1001 in FIG.

ここで、ＲＡＩＤステータスが「Ａｖａｉｌａｂｌｅ」ではない場合（ステップＳ１１０１：Ｎｏ）、ストレージ制御装置１０１は、新診断処理を呼び出したステップに戻る。一方、ＲＡＩＤステータスが「Ａｖａｉｌａｂｌｅ」の場合（ステップＳ１１０１：Ｙｅｓ）、ストレージ制御装置１０１は、診断対象ディスクのうち、パトロール診断済みの領域以外の領域を、診断領域として選定する（ステップＳ１１０２）。 If the RAID status is not “Available” (step S1101: No), the storage control apparatus 101 returns to the step that called the new diagnosis process. On the other hand, when the RAID status is “Available” (step S1101: Yes), the storage control device 101 selects an area other than the area on which the patrol diagnosis has been completed among the diagnosis target disks (step S1102).

つぎに、ストレージ制御装置１０１は、診断用コマンド（リード／ライトコマンド）に優先度「Ｌｏｗ」を設定する（ステップＳ１１０３）。そして、ストレージ制御装置１０１は、規定量分の診断用コマンドにより、選定した診断領域をランダムアクセスする（ステップＳ１１０４）。この際、ストレージ制御装置１０１は、規定量分の診断用コマンドを発行した際のレスポンスタイムを計測し、性能情報テーブル２２０に性能情報を記憶する。 Next, the storage control apparatus 101 sets the priority “Low” to the diagnostic command (read / write command) (step S1103). Then, the storage control apparatus 101 randomly accesses the selected diagnostic area by a diagnostic command for a specified amount (step S1104). At this time, the storage control apparatus 101 measures the response time when a prescribed amount of diagnostic command is issued, and stores the performance information in the performance information table 220.

つぎに、ストレージ制御装置１０１は、性能情報テーブル２２０を参照して、診断対象ディスクのビジー率ｂが５０％未満であり、かつ、レスポンスタイムｔが２秒以上であるか否かを判断する（ステップＳ１１０５）。 Next, the storage control apparatus 101 refers to the performance information table 220 to determine whether the busy rate b of the diagnosis target disk is less than 50% and the response time t is 2 seconds or more ( Step S1105).

ここで、ビジー率ｂが５０％未満であり、かつ、レスポンスタイムｔが２秒以上ではない場合（ステップＳ１１０５：Ｎｏ）、ストレージ制御装置１０１は、診断領域へのランダムアクセスを開始してから診断処理時間Ｔを経過したか否かを判断する（ステップＳ１１０６）。 Here, when the busy rate b is less than 50% and the response time t is not 2 seconds or more (step S1105: No), the storage controller 101 starts diagnosis after starting random access to the diagnosis area. It is determined whether or not the processing time T has elapsed (step S1106).

ここで、診断処理時間Ｔを経過していない場合（ステップＳ１１０６：Ｎｏ）、ストレージ制御装置１０１は、ステップＳ１１０４に戻る。一方、診断処理時間Ｔを経過した場合（ステップＳ１１０６：Ｙｅｓ）、ストレージ制御装置１０１は、新診断処理を呼び出したステップに戻る。 Here, when the diagnosis processing time T has not elapsed (step S1106: No), the storage control apparatus 101 returns to step S1104. On the other hand, when the diagnostic processing time T has elapsed (step S1106: Yes), the storage control device 101 returns to the step that called the new diagnostic processing.

また、ステップＳ１１０５において、ビジー率ｂが５０％未満であり、かつ、レスポンスタイムｔが２秒以上の場合（ステップＳ１１０５：Ｙｅｓ）、ストレージ制御装置１０１は、ホットスペアＨＳが２本以上あるか否かを判断する（ステップＳ１１０７）。ここで、ホットスペアＨＳが２本以上ない場合（ステップＳ１１０７：Ｎｏ）、ストレージ制御装置１０１は、新診断処理を呼び出したステップに戻る。 In step S1105, when the busy rate b is less than 50% and the response time t is 2 seconds or more (step S1105: Yes), the storage controller 101 determines whether there are two or more hot spares HS. Is determined (step S1107). Here, when there are not two or more hot spares HS (step S1107: No), the storage control apparatus 101 returns to the step that called the new diagnosis process.

一方、ホットスペアＨＳが２本以上ある場合（ステップＳ１１０７：Ｙｅｓ）、ストレージ制御装置１０１は、診断対象ディスク（潜在故障ディスク）に対してリダンダントコピーを実施して（ステップＳ１１０８）、新診断処理を呼び出したステップに戻る。 On the other hand, if there are two or more hot spares (step S1107: Yes), the storage controller 101 performs redundant copy on the diagnosis target disk (latent failure disk) (step S1108) and calls a new diagnosis process. Return to the previous step.

これにより、アクセスがないと判断された診断対象ディスクのうち、潜在故障状態のＨＤＤを検出してリダンダントコピーを実施することができる。また、ＲＡＩＤグループが復旧中や冗長性を失っているときに新診断処理が実施されないように制御することができる。 As a result, it is possible to perform redundant copy by detecting a potentially failed HDD from among the diagnosis target disks that are determined not to be accessed. Further, it is possible to control so that the new diagnosis processing is not performed when the RAID group is being restored or has lost redundancy.

以上説明したように、実施の形態にかかるストレージ制御装置１０１によれば、ホスト装置２０２からのＩ／Ｏ要求に応じてアクセスされるストレージＳＴ内のＨＤＤの負荷状況およびレスポンス状況を表す性能情報を取得することができる。そして、ストレージ制御装置１０１によれば、取得した性能情報に基づいて、ストレージＳＴ内のＨＤＤのうち、負荷が閾値αより低く、かつ、レスポンスタイムが閾値β以上のＨＤＤを、潜在故障ディスクとして検出することができる。 As described above, according to the storage control device 101 according to the embodiment, the performance information indicating the load status and response status of the HDD in the storage ST accessed in response to the I / O request from the host device 202 is obtained. Can be acquired. Then, according to the storage control device 101, based on the acquired performance information, an HDD having a load lower than the threshold value α and a response time equal to or higher than the threshold value β is detected as a potential failure disk among the HDDs in the storage ST. can do.

これにより、レスポンスタイムアウト（Ｉ／Ｏタイムアウト）は発生していないものの、スローダウンしている潜在故障ディスクを検出することができる。また、レスポンスタイムだけでなく負荷も考慮するため、繁忙状態に起因してレスポンスが低下しているＨＤＤを、潜在故障ディスクとして誤検出するのを防ぐことができる。 As a result, although a response timeout (I / O timeout) has not occurred, it is possible to detect a potential failed disk that is slowing down. Further, since not only the response time but also the load is taken into account, it is possible to prevent erroneous detection of an HDD whose response is reduced due to a busy state as a potential failed disk.

また、ストレージ制御装置１０１によれば、性能情報に基づいて、ストレージＳＴ内のＨＤＤのうち、アクセスがないと判断したＨＤＤを診断対象ディスクとして抽出することができる。また、ストレージ制御装置１０１によれば、抽出した診断対象ディスクに対して、負荷が閾値αを超えないように、規定量分の診断用コマンドを発行した際のレスポンスタイムを計測することができる。そして、ストレージ制御装置１０１によれば、抽出した診断対象ディスクのうち、計測したレスポンスタイムが閾値β以上となるＨＤＤを、潜在故障ディスクとして検出することができる。 Further, according to the storage control apparatus 101, it is possible to extract, as a diagnosis target disk, an HDD that is determined not to be accessed among HDDs in the storage ST based on the performance information. Further, according to the storage control device 101, it is possible to measure the response time when a diagnostic command for a specified amount is issued to the extracted diagnosis target disk so that the load does not exceed the threshold value α. Then, according to the storage control apparatus 101, an HDD whose measured response time is equal to or greater than the threshold value β among the extracted diagnosis target disks can be detected as a potential failure disk.

これにより、アクセスがない、あるいは、アクセスがほとんどないＨＤＤについても、ホスト装置２０２からのＩ／Ｏ要求とは非同期に診断用コマンドを発行して性能診断することで、スローダウンしている潜在故障ディスクを検出することができる。 As a result, even for HDDs that are not accessed or rarely accessed, a potential failure that has slowed down by issuing a diagnostic command asynchronously with the I / O request from the host device 202 and performing performance diagnosis The disk can be detected.

また、ストレージ制御装置１０１によれば、検出した潜在故障ディスクに対してリダンダントコピーを実施することができる。これにより、データの冗長性を確保しつつ、潜在故障状態であるＨＤＤを切り離すリカバリ処理を自動で行うことができ、潜在故障状態であるＨＤＤの性能劣化の影響によるＲＡＩＤグループ全体の応答性能の低下を抑えることができる。 Further, according to the storage control apparatus 101, redundant copy can be performed on the detected latent failure disk. As a result, it is possible to automatically perform recovery processing for separating the HDD in a latent failure state while ensuring data redundancy, and the response performance of the entire RAID group is degraded due to the performance deterioration of the HDD in the latent failure state. Can be suppressed.

また、ストレージ制御装置１０１によれば、検出した潜在故障ディスクが、アクセスがないと判断した診断対象ディスクのときは、ホットスペアＨＳが２本以上ある場合に、当該潜在故障ディスクに対してリダンダントコピーを実施することができる。 Further, according to the storage control apparatus 101, when the detected latent failed disk is a diagnosis target disk that is determined not to be accessed, if there are two or more hot spares HS, a redundant copy is made to the latent failed disk. Can be implemented.

これにより、潜在故障ディスクが、アクセスがないと判断された診断対象ディスクのときは、ホットスペアＨＳが複数存在する場合に、リダンダントコピーを実施することができる。このため、頻繁にアクセスがあるような潜在故障ディスクに対するリダンダントコピーを実施する際にホットスペアＨＳがないという事態が生じる可能性を低減させることができる。 As a result, when the potential failed disk is a diagnosis target disk that is determined not to be accessed, redundant copy can be performed when there are a plurality of hot spares HS. For this reason, it is possible to reduce the possibility of occurrence of a situation where there is no hot spare HS when performing redundant copy for a potentially failed disk that is frequently accessed.

また、ストレージ制御装置１０１によれば、診断用コマンドに、ホスト装置２０２からのＩ／Ｏ要求に応じて発行されるアクセスコマンドよりも低い優先度を設定することができる。これにより、ホスト装置２０２からのＩ／Ｏ要求と競合した場合に、Ｉ／Ｏ要求に応じて発行されるアクセスコマンドを、診断用コマンドよりも優先して処理させることができ、Ｉ／Ｏ性能に与える影響を抑えることができる。 Further, according to the storage control apparatus 101, a lower priority than the access command issued in response to the I / O request from the host apparatus 202 can be set in the diagnostic command. As a result, when there is a conflict with an I / O request from the host device 202, the access command issued in response to the I / O request can be processed in preference to the diagnostic command, and the I / O performance Can be reduced.

これらのことから、実施の形態にかかるストレージ制御装置１０１によれば、レスポンスタイムアウトや媒体エラーは発生していないものの、スローダウンしている潜在故障状態のＨＤＤを早期に発見することができる。また、リダンダントコピーを利用した自動リカバリ処理により、潜在故障状態であるＨＤＤの性能劣化の影響によるＲＡＩＤグループ全体の応答性能の低下を抑えることができる。 For these reasons, according to the storage control apparatus 101 according to the embodiment, although a response timeout or a medium error has not occurred, it is possible to detect a slow-down HDD in a slow failure state at an early stage. In addition, automatic recovery processing using redundant copy can suppress a decrease in response performance of the entire RAID group due to the performance degradation of the HDD that is in a potential failure state.

なお、本実施の形態で説明した制御方法は、予め用意されたプログラムをストレージ制御装置等のコンピュータで実行することにより実現することができる。本制御プログラムは、ハードディスク、フレキシブルディスク、ＣＤ（ＣｏｍｐａｃｔＤｉｓｃ）−ＲＯＭ、ＭＯ（Ｍａｇｎｅｔｏ−Ｏｐｔｉｃａｌｄｉｓｋ）、ＤＶＤ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｋ）、ＵＳＢ（ＵｎｉｖｅｒｓａｌＳｅｒｉａｌＢｕｓ）メモリ等のコンピュータで読み取り可能な記録媒体に記録され、コンピュータによって記録媒体から読み出されることによって実行される。また、本制御プログラムは、インターネット等のネットワークを介して配布してもよい。 The control method described in this embodiment can be realized by executing a program prepared in advance by a computer such as a storage control device. This control program is stored on a computer-readable recording medium such as a hard disk, a flexible disk, a CD (Compact Disc) -ROM, an MO (Magneto-Optical disk), a DVD (Digital Versatile Disk), or a USB (Universal Serial Bus) memory. It is recorded and executed by being read from the recording medium by a computer. The control program may be distributed via a network such as the Internet.

上述した実施の形態に関し、さらに以下の付記を開示する。 The following additional notes are disclosed with respect to the embodiment described above.

（付記１）上位装置からの要求に応じてアクセスされる１または複数の記憶装置を制御するストレージ制御装置であって、
前記１または複数の記憶装置それぞれの負荷状況およびレスポンス状況を表す性能情報を取得し、
取得した前記性能情報に基づいて、前記１または複数の記憶装置のうち、負荷が第１の閾値より低く、かつ、レスポンスタイムが第２の閾値以上の記憶装置を検出する、
制御部を有することを特徴とするストレージ制御装置。 (Supplementary note 1) A storage control device that controls one or a plurality of storage devices accessed in response to a request from a host device,
Obtaining performance information representing the load status and response status of each of the one or more storage devices;
Based on the acquired performance information, a storage device having a load lower than a first threshold and a response time equal to or greater than a second threshold among the one or more storage devices is detected.
A storage control device comprising a control unit.

（付記２）前記制御部は、
取得した前記性能情報に基づいて、前記１または複数の記憶装置のうち、アクセスがないと判断した記憶装置を抽出し、
抽出した前記記憶装置に対して、負荷が前記第１の閾値を超えないように、規定量分のアクセスコマンドを発行した際のレスポンスタイムを計測し、
抽出した前記記憶装置のうち、計測した前記レスポンスタイムが前記第２の閾値以上となる記憶装置を検出する、
ことを特徴とする付記１に記載のストレージ制御装置。 (Appendix 2) The control unit
Based on the acquired performance information, the storage device that is determined not to be accessed is extracted from the one or more storage devices,
For the extracted storage device, measure a response time when issuing an access command for a specified amount so that the load does not exceed the first threshold,
Among the extracted storage devices, a storage device in which the measured response time is equal to or greater than the second threshold is detected.
The storage control device according to appendix 1, wherein:

（付記３）前記制御部は、
検出した前記記憶装置に対してリダンダントコピーを実施する、ことを特徴とする付記１または２に記載のストレージ制御装置。 (Appendix 3) The control unit
The storage control device according to appendix 1 or 2, wherein redundant copy is performed on the detected storage device.

（付記４）検出した前記記憶装置が、アクセスがないと判断した記憶装置のときは、代替記憶装置が複数存在する場合に、当該記憶装置に対してリダンダントコピーを実施する、ことを特徴とする付記２に記載のストレージ制御装置。 (Supplementary Note 4) When the detected storage device is a storage device that is determined not to be accessed, redundant copy is performed on the storage device when there are a plurality of alternative storage devices. The storage control device according to attachment 2.

（付記５）前記アクセスコマンドには、前記上位装置からの要求に応じて発行されるアクセスコマンドよりも低い優先度が設定される、ことを特徴とする付記２に記載のストレージ制御装置。 (Supplementary note 5) The storage control device according to supplementary note 2, wherein a lower priority than an access command issued in response to a request from the host device is set in the access command.

（付記６）前記第２の閾値は、前記１または複数の記憶装置それぞれについてのタイムアウト値よりも低い値である、ことを特徴とする付記１〜５のいずれか一つに記載のストレージ制御装置。 (Supplementary note 6) The storage control device according to any one of supplementary notes 1 to 5, wherein the second threshold value is lower than a timeout value for each of the one or the plurality of storage devices. .

（付記７）前記１または複数の記憶装置それぞれの負荷状況は、ビジー率によって表される、ことを特徴とする付記１〜６のいずれか一つに記載のストレージ制御装置。 (Supplementary note 7) The storage control device according to any one of supplementary notes 1 to 6, wherein the load status of each of the one or more storage devices is represented by a busy rate.

（付記８）前記１または複数の記憶装置それぞれのレスポンス状況は、アクセスコマンドを発行してから応答があるまでのレスポンスタイムによって表される、ことを特徴とする付記１〜７のいずれか一つに記載のストレージ制御装置。 (Supplementary note 8) Any one of Supplementary notes 1 to 7, wherein the response status of each of the one or more storage devices is represented by a response time from when an access command is issued until there is a response. The storage control device described in 1.

（付記９）上位装置からの要求に応じてアクセスされる１または複数の記憶装置を制御するコンピュータに、
前記１または複数の記憶装置それぞれの負荷状況およびレスポンス状況を表す性能情報を取得し、
取得した前記性能情報に基づいて、前記１または複数の記憶装置のうち、負荷が第１の閾値より低く、かつ、レスポンスタイムが第２の閾値以上の記憶装置を検出する、
処理を実行させることを特徴とする制御プログラム。 (Supplementary note 9) A computer that controls one or a plurality of storage devices accessed in response to a request from a host device,
Obtaining performance information representing the load status and response status of each of the one or more storage devices;
Based on the acquired performance information, a storage device having a load lower than a first threshold and a response time equal to or greater than a second threshold among the one or more storage devices is detected.
A control program characterized by causing a process to be executed.

（付記１０）前記コンピュータに、
取得した前記性能情報に基づいて、前記１または複数の記憶装置のうち、アクセスがないと判断した記憶装置を抽出し、
抽出した前記記憶装置に対して、負荷が前記第１の閾値を超えないように、規定量分のアクセスコマンドを発行した際のレスポンスタイムを計測し、
抽出した前記記憶装置のうち、計測した前記レスポンスタイムが前記第２の閾値以上となる記憶装置を検出する、
処理を実行させることを特徴とする付記９に記載の制御プログラム。 (Supplementary Note 10) In the computer,
Based on the acquired performance information, the storage device that is determined not to be accessed is extracted from the one or more storage devices,
For the extracted storage device, measure a response time when issuing an access command for a specified amount so that the load does not exceed the first threshold,
Among the extracted storage devices, a storage device in which the measured response time is equal to or greater than the second threshold is detected.
The control program according to appendix 9, wherein the process is executed.

（付記１１）前記コンピュータに、
検出した前記記憶装置に対してリダンダントコピーを実施する、処理を実行させることを特徴とする付記９または１０に記載の制御プログラム。 (Supplementary note 11)
The control program according to appendix 9 or 10, wherein a redundant copy is executed on the detected storage device to execute a process.

（付記１２）前記リダンダントコピーを実施する処理は、
検出した前記記憶装置が、アクセスがないと判断した記憶装置のときは、代替記憶装置が複数存在する場合に、当該記憶装置に対してリダンダントコピーを実施する、ことを特徴とする付記１１に記載の制御プログラム。 (Supplementary Note 12) The process of performing the redundant copy is as follows:
The supplementary note 11 is characterized in that, when the detected storage device is a storage device that is determined not to be accessed, redundant copy is performed on the storage device when there are a plurality of alternative storage devices. Control program.

１０１ストレージ制御装置
１０２上位装置
１０３，ＳＴストレージ
１１０性能情報
２００ストレージシステム
２０１ストレージ装置
２０２ホスト装置
２１０ネットワーク
２２０性能情報テーブル
２３０コンフィグテーブル
３００バス
３０１ＣＰＵ
３０２メモリ
３０３通信Ｉ／Ｆ
３０４Ｉ／Ｏコントローラ
６０１Ｉ／Ｏ処理部
６０２取得部
６０３検出部
６０４診断部
６０５復旧部 DESCRIPTION OF SYMBOLS 101 Storage control apparatus 102 Host apparatus 103, ST storage 110 Performance information 200 Storage system 201 Storage apparatus 202 Host apparatus 210 Network 220 Performance information table 230 Configuration table 300 Bus 301 CPU
302 Memory 303 Communication I / F
304 I / O controller 601 I / O processing unit 602 acquisition unit 603 detection unit 604 diagnosis unit 605 recovery unit

Claims

A storage control device that controls one or more storage devices accessed in response to a request from a host device,
Obtaining performance information representing the load status and response status of each of the one or more storage devices;
Based on the acquired performance information, a storage device having a load lower than a first threshold and a response time equal to or greater than a second threshold among the one or more storage devices is detected.
A storage control device comprising a control unit.

The controller is
Based on the acquired performance information, the storage device that is determined not to be accessed is extracted from the one or more storage devices,
For the extracted storage device, measure a response time when issuing an access command for a specified amount so that the load does not exceed the first threshold,
Among the extracted storage devices, a storage device in which the measured response time is equal to or greater than the second threshold is detected.
The storage control device according to claim 1.

The controller is
The storage control apparatus according to claim 1, wherein redundant copy is performed on the detected storage device.

The redundant copy is performed on the storage device when the detected storage device is a storage device that is determined not to be accessed when there are a plurality of alternative storage devices. The storage control device described.

The storage control apparatus according to claim 2, wherein a lower priority than an access command issued in response to a request from the host apparatus is set in the access command.

A computer that controls one or more storage devices accessed in response to a request from a host device;
Obtaining performance information representing the load status and response status of each of the one or more storage devices;
Based on the acquired performance information, a storage device having a load lower than a first threshold and a response time equal to or greater than a second threshold among the one or more storage devices is detected.
A control program characterized by causing a process to be executed.