JP2006268502A

JP2006268502A - Array controller, media error restoring method and program

Info

Publication number: JP2006268502A
Application number: JP2005086358A
Authority: JP
Inventors: Masanori Tomota; 正憲友田
Original assignee: Toshiba Corp; Toshiba Solutions Corp
Current assignee: Toshiba Corp; Toshiba Digital Solutions Corp
Priority date: 2005-03-24
Filing date: 2005-03-24
Publication date: 2006-10-05
Anticipated expiration: 2025-03-24
Also published as: JP4203034B2

Abstract

<P>PROBLEM TO BE SOLVED: To prevent the generation of restoration failure by preliminarily or as quickly as possible solving any media error existing in a normal disk drive. <P>SOLUTION: In an array controller 20, an HDD failure predicting part 24 predicts the failure of each of HDDs 11-0 to 11-2 configuring a logical disk 10. When the failure of any of the HDDs 11-0 to 11-2 is predicted, an HDD health confirming part 25 performs health confirmation processing for confirming the presence/absence of any media error by inspecting the contents of any HDD other than the HDD whose failure has been predicted among the HDDs 11-0 to 11-2. When any media error has been detected by the HDD health confirming part 25, a medial error restoring part 26 restores the deficiency of the data of the place where the media error has been generated by using the data or redundant data of all the HDDs excluding the HDD where the media error has been generated among the HDDs 11-0 to 11-2. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、複数のディスクドライブから構成される冗長性を持つディスクアレイを制御するアレイコントローラに係り、特にディスクドライブのメディアエラーを迅速に解消するのに好適な、アレイコントローラ、メディアエラー修復方法及びプログラムを提供することにある。 The present invention relates to an array controller that controls a redundant disk array composed of a plurality of disk drives, and more particularly to an array controller, a media error repair method, and a media controller that are suitable for quickly eliminating a media error in a disk drive. To provide a program.

冗長データを持つことによりデータの信頼性を向上させる技術として、複数のディスクドライブ（ＨＤＤ）を用いて構成される冗長性を持つディスクアレイ（冗長化ディスクアレイ）、つまりＲＡＩＤ（Redundant Array of Inexpensive Disks、またはRedundant Array of Independent Disks）が知られている。ＲＡＩＤには幾つかのレベル（ＲＡＩＤレベル）が定義されており、ＲＡＩＤ１（ミラーリング）やＲＡＩＤ５（パリティ付きストライピング）などが知られている。いずれも複数のＨＤＤを用いて構成される冗長化ディスクアレイにデータ及び冗長データを配置することにより、いずれかのＨＤＤが故障してもデータの復元を可能にする技術である。 As a technology to improve data reliability by having redundant data, a redundant disk array (redundant disk array) configured using multiple disk drives (HDD), that is, RAID (Redundant Array of Inexpensive Disks) Or Redundant Array of Independent Disks). Several levels (RAID levels) are defined for RAID, and RAID 1 (mirroring), RAID 5 (striping with parity), and the like are known. Both are technologies that enable data to be restored even if one of the HDDs fails by arranging data and redundant data in a redundant disk array composed of a plurality of HDDs.

ディスクアレイと当該ディスクアレイを制御するアレイコントローラとから構成される装置は、ディスクアレイ装置と呼ばれる。ディスクアレイは、ディスクアレイ装置を外部記憶装置として利用するホスト（ホストコンピュータ）からは、１つの記憶領域を有する１つの論理ディスク（論理ユニット）として認識される。このため、ディスクアレイは、論理ディスク（論理ユニット）と呼ばれることもある。 A device composed of a disk array and an array controller that controls the disk array is called a disk array device. A disk array is recognized as one logical disk (logical unit) having one storage area from a host (host computer) that uses the disk array device as an external storage device. For this reason, the disk array is sometimes called a logical disk (logical unit).

ディスクアレイ装置では、ディスクアレイ（論理ディスク）を構成する複数のＨＤＤのいずれかが故障した場合、その故障したＨＤＤ（故障ＨＤＤ）を別の正常なＨＤＤ（スペアＨＤＤ）に交換するのが一般的である（例えば、特許文献１参照）。アレイコントローラは、ディスクアレイを構成している複数のＨＤＤのうち、故障ＨＤＤを除くＨＤＤのデータにより、当該ディスクアレイにおけるデータの冗長性を利用して、故障ＨＤＤのデータを復元する。復元されたデータは、交換されたＨＤＤに格納される。このようにして故障ＨＤＤのデータが、交換されたＨＤＤに復元される。するとディスクアレイ装置は、ＨＤＤの故障発生前と同様に動作を継続することができる。
特開平８−１９０４６０号公報（段落００２０乃至００２７） In a disk array device, when one of a plurality of HDDs constituting a disk array (logical disk) fails, it is common to replace the failed HDD (failed HDD) with another normal HDD (spare HDD). (For example, see Patent Document 1). The array controller restores the data of the failed HDD by using the data redundancy in the disk array based on the data of the HDD excluding the failed HDD among the plurality of HDDs constituting the disk array. The restored data is stored in the replaced HDD. In this way, the data of the failed HDD is restored to the replaced HDD. Then, the disk array device can continue the operation as before the occurrence of the HDD failure.
JP-A-8-190460 (paragraphs 0020 to 0027)

上記したように従来技術においては、ディスクアレイを構成する複数のＨＤＤのいずれかが故障した場合、その故障したＨＤＤ（故障ＨＤＤ）のデータを、残りのＨＤＤのデータから復元することができる。 As described above, in the prior art, when any of a plurality of HDDs constituting a disk array fails, the data of the failed HDD (failed HDD) can be restored from the data of the remaining HDDs.

ところが、故障ＨＤＤ以外の、ディスクアレイを構成する複数のＨＤＤのいずれかにメディアエラーが存在することがある。メディアエラーとは、ＨＤＤのディスク媒体（ディスクメディア）に起因して、データを読み出しまたは書き込むことができないエラーを指す。もし、メディアエラーが存在する場合、故障ＨＤＤのデータを復元するのに必要なデータまたは冗長データを、当該メディアエラーが存在するＨＤＤから読み出すことができなくなるおそれがある。このような場合、メディアエラーが存在する領域に対応する故障ＨＤＤのデータを復元できなくなる。 However, a media error may exist in any of a plurality of HDDs constituting the disk array other than the failed HDD. A media error refers to an error in which data cannot be read or written due to a disk medium (disk medium) of the HDD. If a media error exists, there is a possibility that data or redundant data necessary for restoring the data of the failed HDD cannot be read from the HDD in which the media error exists. In such a case, it becomes impossible to restore the data of the failed HDD corresponding to the area where the media error exists.

例えばＨＤＤ＃１及び＃２から構成されるＲＡＩＤ１レベルの論理ディスクにおいて、ＨＤＤ＃１が故障した場合を考える。この場合、故障したＨＤＤ＃１を新たなＨＤＤ＃３に交換する。この状態で、アレイコントローラは、正常なＨＤＤ＃２からデータをリードし、新たなＨＤＤ＃３へそのリードデータのライトを行うことで、データを復元する。ところが、ＨＤＤ＃２からのデータリードで当該ＨＤＤ＃２にメディアエラーが発生すると、その部分のデータを読み出すことができず、ＨＤＤ＃３へデータをライトすることができない。この結果、メディアエラーの発生部分に格納されていたデータは失われ、復元は失敗となる。
このように従来の技術では、故障したＨＤＤ（ディスクドライブ）のデータを復元する際、正常なＨＤＤ（ディスクドライブ）のメディアエラーにより復元失敗となる場合がある。 For example, consider a case where HDD # 1 fails in a RAID1 level logical disk composed of HDDs # 1 and # 2. In this case, the failed HDD # 1 is replaced with a new HDD # 3. In this state, the array controller restores the data by reading data from the normal HDD # 2 and writing the read data to the new HDD # 3. However, if a media error occurs in the HDD # 2 by reading data from the HDD # 2, the data in that portion cannot be read, and the data cannot be written to the HDD # 3. As a result, the data stored in the portion where the media error has occurred is lost, and the restoration fails.
As described above, in the conventional technology, when restoring data of a failed HDD (disk drive), there is a case where the restoration fails due to a media error of a normal HDD (disk drive).

本発明は上記事情を考慮してなされたものでその目的は、正常なディスクドライブに存在するメディアエラーを事前に、或いはできるだけ迅速に解消することにより、復元失敗の発生を防止できるアレイコントローラ、メディアエラー修復方法及びプログラムを提供することにある。 The present invention has been made in consideration of the above circumstances, and an object of the present invention is to provide an array controller and a medium capable of preventing the occurrence of restoration failure by eliminating a media error existing in a normal disk drive in advance or as quickly as possible. An object of the present invention is to provide an error repair method and program.

本発明の１つの観点によれば、論理ディスクを構成する複数のディスクドライブへのアクセスを制御することによりホストから要求されたデータと当該データの冗長データとを上記複数のディスクドライブに分散して配置するアレイコントローラが提供される。このアレイコントローラは、上記論理ディスクを構成する複数のディスクドライブの各々の故障を予知するディスクドライブ故障予知手段と、上記論理ディスクを構成する複数のディスクドライブのうち上記ディスクドライブ故障予知手段によって故障が予知されたディスクドライブを除く全てのディスクドライブの内容を検査してメディアエラーの有無を確認するための健全性確認処理を実行する健全性確認手段と、この健全性確認手段によって、メディアエラーが検出された場合に、上記論理ディスクを構成する複数のディスクドライブのうち当該メディアエラーが発生しているディスクドライブを除く全てのディスクドライブのデータまたは冗長データを利用して当該メディアエラーが発生している箇所のデータの欠損を修復するメディアエラー修復手段とから構成される。 According to one aspect of the present invention, by controlling access to a plurality of disk drives constituting a logical disk, data requested from a host and redundant data of the data are distributed to the plurality of disk drives. An array controller for placement is provided. The array controller includes a disk drive failure prediction means for predicting a failure of each of a plurality of disk drives constituting the logical disk, and a failure is detected by the disk drive failure prediction means among the plurality of disk drives constituting the logical disk. A soundness check means for executing a soundness check process to check the contents of all disk drives except the predicted disk drive to check for the presence of a media error, and this soundness check means detects a media error. In such a case, the media error has occurred using data or redundant data of all the disk drives except for the disk drive in which the media error has occurred among the plurality of disk drives constituting the logical disk. To repair missing data Composed of the Aera repair means.

このような構成においては、論理ディスクを構成する複数のディスクドライブのいずれかが近い将来に故障することがディスクドライブ故障予知手段によって予知された場合に、当該複数のディスクドライブのうち、故障が予知されたディスクドライブ以外の全てのディスクドライブの内容が健全性確認手段によって検査される。この検査で、データが正常に読み出せないメディアエラーが検出された場合、上記論理ディスクを構成する複数のディスクドライブのうち当該メディアエラーが発生しているディスクドライブを除く全てのディスクドライブ（つまり故障が予測されたディスクドライブを含むディスクドライブ）のデータまたは冗長データを利用して当該メディアエラーが発生している箇所のデータの欠損が修復される。 In such a configuration, when the disk drive failure predicting means predicts that any of the plurality of disk drives constituting the logical disk will fail in the near future, the failure is predicted among the plurality of disk drives. The contents of all the disk drives other than the selected disk drive are inspected by the soundness confirmation means. If a media error that prevents data from being read correctly is detected during this check, all the disk drives (that is, the failure) of the plurality of disk drives constituting the logical disk except the disk drive in which the media error has occurred are detected. The data loss at the location where the media error has occurred is repaired using the data or the redundant data of the disk drive including the predicted disk drive).

このように上記の構成においては、論理ディスクを構成する複数のディスクドライブの中から近い将来に故障する可能性のあるディスクドライブを予知して、当該故障する可能性のあるディスクドライブが実際に故障する前に、つまり当該故障する可能性のあるディスクドライブのデータまたは冗長データを正常に読み出すことができて論理ディスクの冗長性が保たれている間に、論理ディスクを構成する他のディスクドライブのメディアエラーが発生している箇所が、当該故障する可能性のあるディスクドライブのデータまたは冗長データをも利用して修復される。これにより、その後、上記故障する可能性のあるディスクドライブが実際に故障したとしても、論理ディスクを構成する残りのディスクドライブのデータまたは冗長データを利用して、当該故障したディスクドライブのデータを復元できる。 As described above, in the above configuration, a disk drive that may fail in the near future is predicted from among a plurality of disk drives that make up a logical disk, and the disk drive that is likely to fail actually fails. That is, while the data or redundant data of the disk drive that might fail may be read normally and the redundancy of the logical disk is maintained, the other disk drives that make up the logical disk A location where a media error has occurred is repaired using the data or redundant data of the disk drive that may possibly fail. As a result, even if the disk drive that may fail is actually failed, the data of the failed disk drive is restored using the data or redundant data of the remaining disk drives that make up the logical disk. it can.

ここで、上記健全性確認手段による健全性確認処理の完了後に、上記故障が予知されたディスクドライブのデータをスペアのディスクドライブにコピーするためのデータコピー処理を実行するデータコピー手段と、このデータコピー手段によるデータコピー処理の完了後に、上記故障が予知されたディスクドライブを上記スペアのディスクドライブに代替させて、上記故障が予知されたディスクドライブを論理ディスクから切り離す論理ディスク復元手段とを追加すると良い。このようにすると、近い将来故障する可能性のあるディスクドライブが実際に故障する前に、当該ディスクドライブを確実にスペアのディスクドライブに代替させることができる。また、データコピー手段によるデータコピー処理の実行中に、故障が予想されていたディスクドライブが実際に故障してしまった場合でも、論理ディスクを構成する他のディスクドライブのデータの健全性が既に確認されていることで、より安全にスペアディスクへのデータコピーを行うことができる。 Here, after completion of the soundness confirmation processing by the soundness confirmation means, data copy means for executing data copy processing for copying the data of the disk drive in which the failure is predicted to a spare disk drive, and the data After completion of data copy processing by the copy means, a logical disk restoration means is added to replace the disk drive in which the failure is predicted with the spare disk drive and to disconnect the disk drive in which the failure is predicted from the logical disk. good. This makes it possible to reliably replace the disk drive with a spare disk drive before a disk drive that may fail in the near future actually fails. In addition, even if a disk drive that was expected to fail during the data copy process by the data copy means actually fails, the data integrity of the other disk drives that make up the logical disk has already been confirmed. As a result, data can be copied to the spare disk more safely.

また、上記のデータコピー手段に代えて、上記データコピー処理を上記健全性確認手段による健全性確認処理が進行するアドレスの方向とは逆方向に実行するデータコピー手段を用い、このデータコピー手段によるデータコピー処理が完了した領域に対応するアドレスまで上記健全性確認処理が実行された段階で当該健全性確認処理を終了する構成とすると良い。このようにすると、上記健全性確認手段による健全性確認処理の完了を待ってデータコピー手段によるデータコピー処理を開始する必要がないため、故障する可能性のあるディスクドライブを検出してからデータコピー手段によるデータコピー処理が完了するまでの時間を短縮でき、しかも健全性確認手段による健全性確認処理も無駄なく効率的に行える。この効果は、データコピー手段によるデータコピー処理を、健全性確認手段による健全性確認処理の開始時に開始する場合に最も高くなる。 Further, instead of the data copy means, a data copy means for executing the data copy processing in a direction opposite to the direction of the address in which the soundness confirmation processing proceeds by the soundness confirmation means is used. It is preferable that the soundness confirmation processing is terminated when the soundness confirmation processing is executed up to the address corresponding to the area where the data copy processing has been completed. In this case, it is not necessary to start the data copying process by the data copying unit after the completion of the sounding checking process by the sounding checking unit. Therefore, the data copy is performed after detecting a disk drive that may fail. The time until the data copy processing by the means is completed can be shortened, and the soundness confirmation processing by the soundness confirmation means can be efficiently performed without waste. This effect is the highest when the data copy process by the data copy means is started at the start of the soundness confirmation process by the soundness confirmation means.

本発明によれば、故障する可能性のあるディスクドライブが実際に故障する前に、論理ディスクを構成する他のディスクドライブのメディアエラーが発生している箇所を、当該故障する可能性のあるディスクドライブのデータまたは冗長データをも利用して修復することにより、当該故障する可能性のあるディスクドライブが実際に故障したとしても、論理ディスクを構成する残りのディスクドライブのデータまたは冗長データを利用して、当該故障したディスクドライブのデータを復元することができ、メディアエラーによるデータ修復不能を回避することができる。 According to the present invention, before a disk drive that has a possibility of failure actually breaks down, a location where a media error has occurred in another disk drive that constitutes the logical disk is identified as a disk that has the possibility of failure. By using drive data or redundant data for repair, even if the disk drive that may fail is actually failed, the remaining disk drive data or redundant data that make up the logical disk is used. Thus, the data of the failed disk drive can be restored, and the inability to restore data due to a media error can be avoided.

以下、本発明の実施の形態につき図面を参照して説明する。
図１は本発明の一実施形態に係るディスクアレイ装置の構成を示すブロック図である。図１のディスクアレイ装置は、主として論理ディスク（ディスクアレイ）１０と当該論理ディスク１０を制御するコントローラ（アレイコントローラ）とから構成される。論理ディスク１０は当該ディスク１０の信頼性の向上に必要な冗長データを保持するために、複数のディスクドライブ、例えば３台の磁気ディスクドライブ（ＨＤＤ）１１-0（＃０）〜１１-2（＃２）から構成されるものとする。つまり論理ディスク１０は冗長性を持つ論理ユニットである。この冗長性のレベルとして、ＲＡＩＤ１，ＲＡＩＤ１０，ＲＡＩＤ３，ＲＡＩＤ５などが知られている。ここでは、論理ディスク１０がＲＡＩＤ５を適用するディスクアレイであるものとする。ＲＡＩＤ５では、データは複数のＨＤＤに分散して書き込まれると共に、そのデータのパリティデータが冗長データとして別のＨＤＤに書き込まれる。また、ＲＡＩＤ５では、冗長データ（パリティデータ）の書き込み先となるＨＤＤは固定されていない。したがって、論理ディスク１０がＲＡＩＤ５を適用するものとすると、当該論理ディスク１０を構成するＨＤＤ１１-0〜１１-2は、いずれもデータ並びに冗長データ（パリティデータ）の格納用に用いられ、冗長データ（パリティデータ）は各ＨＤＤ１１-0〜１１-2に分散して格納される。 Embodiments of the present invention will be described below with reference to the drawings.
FIG. 1 is a block diagram showing a configuration of a disk array device according to an embodiment of the present invention. The disk array device of FIG. 1 is mainly composed of a logical disk (disk array) 10 and a controller (array controller) that controls the logical disk 10. The logical disk 10 holds a plurality of disk drives, for example, three magnetic disk drives (HDDs) 11-0 (# 0) to 11-2 (in order to hold redundant data necessary for improving the reliability of the disk 10. # 2). That is, the logical disk 10 is a logical unit having redundancy. RAID1, RAID10, RAID3, RAID5, etc. are known as this redundancy level. Here, it is assumed that the logical disk 10 is a disk array to which RAID 5 is applied. In RAID 5, data is distributed and written to a plurality of HDDs, and parity data of the data is written as redundant data to another HDD. In RAID5, the HDD to which redundant data (parity data) is written is not fixed. Therefore, if the logical disk 10 applies RAID 5, the HDDs 11-0 to 11-2 constituting the logical disk 10 are all used for storing data and redundant data (parity data). (Parity data) is distributed and stored in the HDDs 11-0 to 11-2.

図１のディスクアレイ装置は更に、論理ディスク１０を構成するＨＤＤ１１-0〜１１-2のいずれかが故障した場合に、その故障したＨＤＤに代えて用いられるスペアのＨＤＤ（ＳＨＤＤ）１１-3（＃３）を有する。このＨＤＤ１１-3を含むＨＤＤ群、即ちＨＤＤ１１-0〜１１-3は、例えばＳＣＳＩインタフェースによりアレイコントローラ２０と接続されている。 The disk array apparatus of FIG. 1 further has a spare HDD (SHDD) 11-3 (SHDD) 11-3 (used in place of the failed HDD when any of the HDDs 11-0 to 11-2 constituting the logical disk 10 fails. # 3). The HDD group including the HDDs 11-3, that is, the HDDs 11-0 to 11-3 are connected to the array controller 20 through, for example, a SCSI interface.

アレイコントローラ２０は、構成管理部２１と、アクセス制御部２２と、論理ディスク復元部２３と、ＨＤＤ故障予知部２４と、ＨＤＤ健全性確認部２５と、メディアエラー修復部２６と、データコピー部２７とを有する。 The array controller 20 includes a configuration management unit 21, an access control unit 22, a logical disk restoration unit 23, an HDD failure prediction unit 24, an HDD soundness confirmation unit 25, a media error restoration unit 26, and a data copy unit 27. And have.

構成管理部２１は、主として論理ディスク１０の構成を管理する。アクセス制御部２２は、図１のディスクアレイ装置を外部記憶装置として利用するホスト（ホストコンピュータ）からの読み出し／書き込み要求を受けて論理ディスク１０に対するアクセスを制御する。ここではアクセス制御部２２は、ホストからの読み出し／書き込み要求を、実際にデータを読み出し／書き込みすべき個々のＨＤＤに対する読み出し／書き込み要求に変換し、その変換された読み出し／書き込み要求を該当するＨＤＤに発行する。例えば、ホストからのデータＤの書き込み要求の場合、アクセス制御部２２は当該データＤをデータＤ１及びＤ２に分割し、例えばＨＤＤ１１-0に対してはデータＤ１の書き込み要求を、ＨＤＤ１１-1に対してはデータＤ２の書き込み要求を、そしてＨＤＤ１１-2に対してはデータＤ１及びＤ２の排他的論理和データ（パリティデータ）、つまりデータＤ１及びＤ２の冗長データの書き込み要求を、それぞれ発行する。論理ディスク復元部２３は、論理ディスク１０を構成するＨＤＤ群（ＨＤＤ１１-0〜１１-2）のいずれかが故障した場合に、その故障したＨＤＤ（故障ＨＤＤ）をスペアのＨＤＤ（ＳＨＤＤ）１１-4で代替する。このとき論理ディスク復元部２３は、論理ディスク１０内の正常なＨＤＤのデータから故障ＨＤＤのデータを復元してＨＤ１１-4に格納する。このような構成管理部２１、アクセス制御部２２及び論理ディスク復元部２３の各機能は、アレイコントローラ２０が一般に有する従来からよく知られている機能である。 The configuration management unit 21 mainly manages the configuration of the logical disk 10. The access control unit 22 controls access to the logical disk 10 in response to a read / write request from a host (host computer) that uses the disk array device of FIG. 1 as an external storage device. Here, the access control unit 22 converts a read / write request from the host into a read / write request for each HDD that actually reads / writes data, and the converted read / write request corresponds to the corresponding HDD. To issue. For example, in the case of a data D write request from the host, the access control unit 22 divides the data D into data D1 and D2. For example, for the HDD 11-0, the data D1 write request is sent to the HDD 11-1. Then, a write request for data D2 and a write request for exclusive data (parity data) of data D1 and D2, that is, redundant data of data D1 and D2, are issued to HDD 11-2. When any of the HDD groups (HDDs 11-0 to 11-2) constituting the logical disk 10 fails, the logical disk restoration unit 23 replaces the failed HDD (failed HDD) with a spare HDD (SHDD) 11−. Substitute with 4. At this time, the logical disk restoration unit 23 restores the failed HDD data from the normal HDD data in the logical disk 10 and stores it in the HD 11-4. Each function of the configuration management unit 21, the access control unit 22, and the logical disk restoration unit 23 is a well-known function that the array controller 20 generally has.

ＨＤＤ故障予知部２４は、論理ディスク１０を構成するＨＤＤ１１-0〜１１-2の故障を予知する。この故障予知の手法については後述する。ＨＤＤ健全性確認部２５は、論理ディスク１０を構成するＨＤＤ１１-0〜１１-2のうち、ＨＤＤ故障予知部２４によって故障が予知されたＨＤＤ以外の全てのＨＤＤの内容を検査してメディアエラーの有無を検出するためのＨＤＤ健全性確認処理を、例えば一定のサイズのデータブロック（ここでは６４ＫＢのデータブロック）を単位に実行する。ＨＤＤ健全性確認部２５は、健全性の確認で、メディアエラーが発生したデータブロックを検出した場合、そのデータブロックをメディアエラー修復部２６により修復させる。データコピー部２７は、ＨＤＤ故障予知部２４によって故障が予知されたＨＤＤのデータを例えば一定のサイズのデータブロック（ここでは６４ＫＢのデータブロック）を単位にスペアのＨＤＤ１１-3にコピーする。 The HDD failure prediction unit 24 predicts a failure of the HDDs 11-0 to 11-2 constituting the logical disk 10. This failure prediction method will be described later. The HDD soundness confirmation unit 25 inspects the contents of all HDDs other than the HDDs for which a failure is predicted by the HDD failure prediction unit 24 among the HDDs 11-0 to 11-2 constituting the logical disk 10 to check for media errors. The HDD soundness confirmation process for detecting presence / absence is executed, for example, in units of data blocks of a certain size (here, 64 KB data blocks). When the HDD soundness confirmation unit 25 detects a data block in which a media error has occurred in the soundness confirmation, the HDD error confirmation unit 25 causes the media error repair unit 26 to repair the data block. The data copy unit 27 copies the HDD data for which the failure has been predicted by the HDD failure prediction unit 24 to the spare HDD 11-3, for example, in units of data blocks of a certain size (here, 64 KB data blocks).

次に、図１のディスクアレイ装置におけるアレイコントローラ２０の動作について、（１）故障ＨＤＤの予知、（２）故障ＨＤＤの予知に基づく処理、（３）ＨＤＤ健全性確認処理に分けて順に説明する。 Next, the operation of the array controller 20 in the disk array apparatus of FIG. 1 will be described in order of (1) prediction of failed HDD, (2) processing based on prediction of failed HDD, and (3) HDD soundness confirmation processing. .

（１）故障ＨＤＤの予知
まず、アレイコントローラ２０のＨＤＤ故障予知部２４による、論理ディスク１０を構成するＨＤＤ１１-i（ｉ＝０，１，２）の故障の予知（予測）について説明する。 (1) Prediction of failed HDD First, prediction (prediction) of failure of the HDD 11-i (i = 0, 1, 2) constituting the logical disk 10 by the HDD failure prediction unit 24 of the array controller 20 will be described.

近年のＨＤＤは、ＳＭＡＲＴ（Self-Monitoring, Analysis and Reporting Technology）と呼ばれる機能を有しているものが多い。このＳＭＡＲＴ機能は、例えばＨＤＤが自身の信頼性悪化に関係する状態を監視して分析し、その結果をホストに報告する機能である。このＳＭＡＲＴ機能を、ＨＤＤ１１-iも有しているものとする、ＨＤＤ故障予知部２４は、このＨＤＤ１１-iのＳＭＡＲＴ機能、つまりＨＤＤ１１-iが持つ当該ＨＤＤ１１-i自身の故障を予知する機能を利用することで、簡単に当該ＨＤＤ１１-iの故障を予知することができる。 Many HDDs in recent years have a function called SMART (Self-Monitoring, Analysis and Reporting Technology). This SMART function is a function for monitoring and analyzing the state related to the deterioration of reliability of the HDD, for example, and reporting the result to the host. It is assumed that the HDD 11-i also has this SMART function, and the HDD failure prediction unit 24 has a function of predicting a failure of the HDD 11-i itself, that is, the failure of the HDD 11-i itself of the HDD 11-i. By using this, a failure of the HDD 11-i can be easily predicted.

さて、ＨＤＤ１１-iの故障の危険性に関係する事象として、
a) アレイコントローラ２０のアクセス制御部２２からＨＤＤ１１-iへのリクエストでエラーを起こし、アレイコントローラ２０でエラーリトライが発生した場合
b) ＨＤＤ１１-iへのリクエストで回復可能なエラー（ＨＤＤ１１-iの備えるエラー回復機能により回復されたエラー）が発生した場合
がある。 As an event related to the risk of failure of the HDD 11-i,
a) When an error occurs in the request from the access controller 22 of the array controller 20 to the HDD 11-i and an error retry occurs in the array controller 20
b) An error that can be recovered by a request to the HDD 11-i (an error recovered by the error recovery function of the HDD 11-i) may occur.

明らかなように、ＨＤＤ１１-iが近い将来故障する危険性がある場合、ＨＤＤ１１-iへのリクエストで上記 a)または b)の事象が発生する回数は増加する。そこでＨＤＤ故障予知部２４は、ＨＤＤ１１-iへのリクエストで上記 a)または b)の事象が発生する回数をカウントし、当該事象が予め定められた時間内に予め定められた回数を超えて発生した場合に、当該ＨＤＤ１１-iの故障を予知する。なお、上記 a)または b)の事象が、単に予め定められた回数を超えて発生した場合に、ＨＤＤ１１-iの故障を予知するようにしても構わない。 Obviously, when there is a risk that the HDD 11-i will break down in the near future, the number of occurrences of the event a) or b) in the request to the HDD 11-i increases. Therefore, the HDD failure prediction unit 24 counts the number of times the event a) or b) occurs in the request to the HDD 11-i, and the event occurs beyond a predetermined number of times within a predetermined time. In such a case, a failure of the HDD 11-i is predicted. Note that when the event a) or b) occurs more than a predetermined number of times, a failure of the HDD 11-i may be predicted.

（２）故障ＨＤＤの予知に基づく処理、
次に、ＨＤＤ故障予知部２４によってＨＤＤ１１-iの故障が予知された場合の処理について、図２のフローチャートを参照して説明する。 (2) Processing based on the prediction of the failed HDD,
Next, processing when the HDD failure prediction unit 24 predicts a failure of the HDD 11-i will be described with reference to the flowchart of FIG. 2.

今、ＨＤＤ故障予知部２４が、ＨＤＤ１１-0〜１１-2のうちのいずれかのＨＤＤ１１-i、例えばＨＤＤ１１-0の故障を予知（予測）したものとする。この場合、ＨＤＤ故障予知部２４は、ＨＤＤ１１-0の故障を予知したことをＨＤＤ健全性確認部２５及びデータコピー部２７に通知して、当該ＨＤＤ健全性確認部２５及びデータコピー部２７をそれぞれ起動する（ステップＳ１，Ｓ２）。 Now, it is assumed that the HDD failure prediction unit 24 has predicted (predicted) a failure of any one of the HDDs 11-0 to 11-2, for example, the HDD 11-0. In this case, the HDD failure predicting unit 24 notifies the HDD soundness confirmation unit 25 and the data copy unit 27 that the HDD 11-0 has been predicted to fail, and the HDD soundness confirmation unit 25 and the data copy unit 27 are notified respectively. Start (steps S1, S2).

すると、ＨＤＤ健全性確認部２５はＨＤＤ健全性確認処理を実行して、その時点において論理ディスク１０を構成しているＨＤＤ１１-0〜１１-2のうち、ＨＤＤ故障予知部２４によって故障が予知されたＨＤＤ１１-0を除く全てのＨＤＤ（つまりＨＤＤ１１-1及び１１-2）に実装されているディスク媒体（メディア）の健全性を確認する。ＨＤＤ健全性確認部２５は、このＨＤＤ健全性確認処理でメディアエラーが発生している箇所を検出した場合、メディアエラー修復部２６により、その箇所のデータを修復させる。この修復には、論理ディスク１０を構成するＨＤＤ１１-0〜１１-2のうち、メディアエラーが発生しているＨＤＤを除く全てのＨＤＤ（つまり、故障が予知されたＨＤＤ１１-0を含むＨＤＤ）のデータまたは冗長データが用いられる。 Then, the HDD soundness confirmation unit 25 executes HDD soundness confirmation processing, and the HDD failure prediction unit 24 predicts a failure among the HDDs 11-0 to 11-2 configuring the logical disk 10 at that time. The soundness of the disk medium (media) mounted on all HDDs except the HDD 11-0 (that is, the HDDs 11-1 and 11-2) is confirmed. When the HDD soundness confirmation unit 25 detects a location where a media error has occurred in this HDD soundness confirmation processing, the media error repairing unit 26 repairs the data at that location. For this repair, of all HDDs 11-0 to 11-2 constituting the logical disk 10 except the HDD in which the media error has occurred (that is, the HDD including the HDD 11-0 in which a failure is predicted). Data or redundant data is used.

このＨＤＤ健全性確認部２５及びメディアエラー修復部２６の動作は、ＨＤＤ故障予知部２４によりＨＤＤ１１-0の故障が予知されたことで、当該ＨＤＤ１１-0が近い将来に本当に故障するおそれがあることを考慮して行われる。つまり、ＨＤＤ１１-0が実際に故障して当該ＨＤＤ１１-0のメディアにアクセスできなくなる前に、故障が予知されていない他のＨＤＤ１１-1及び１１-2のメディアエラーを検出し、そのメディアエラーの箇所のデータを、論理ディスク１０を構成するＨＤＤ１１-0〜１１-2のうち、メディアエラーが発生しているＨＤＤを除く全てのＨＤＤのデータまたは冗長データを用いて修復する。 The operations of the HDD soundness confirmation unit 25 and the media error repair unit 26 are that there is a possibility that the HDD 11-0 may really break down in the near future because the HDD failure prediction unit 24 has predicted the failure of the HDD 11-0. Is taken into consideration. In other words, before the HDD 11-0 actually fails and the media of the HDD 11-0 cannot be accessed, the media error of the other HDDs 11-1 and 11-2 that are not predicted to fail is detected, and the media error The data at the location is restored using the data or redundant data of all HDDs of the HDDs 11-0 to 11-2 constituting the logical disk 10 except the HDD in which the media error has occurred.

もし、故障が予知されたＨＤＤ１１-0が近い将来に本当に故障した場合には、上述のようなＨＤＤ１１-1及び１１-2のメディアエラーの箇所の修復ができなくなってしまう。これに対して本実施形態では、ＨＤＤ１１-0の故障が予知された段階で、故障が予知されていない他のＨＤＤ１１-1及び１１-2のメディアエラー箇所を修復するため、その修復に当該ＨＤＤ１１-0（つまり、この段階では未だ故障していないＨＤＤ１１-0）のデータまたは冗長データを利用できる。 If the HDD 11-0 that is predicted to fail really fails in the near future, it becomes impossible to repair the media error portion of the HDDs 11-1 and 11-2 as described above. On the other hand, in this embodiment, when a failure of the HDD 11-0 is predicted, the media error portion of the other HDDs 11-1 and 11-2 where the failure is not predicted is repaired. -0 (that is, HDD 11-0 that has not yet failed at this stage) or redundant data can be used.

一方、データコピー部２７は、ＨＤＤ故障予知部２４によって故障が予知されたＨＤＤ１１-0のデータをスペアのＨＤＤ１１-3に一定のサイズのデータブロックを単位にコピーするためのデータコピー処理を実行する。このデータコピー処理で、ＨＤＤ１１-0からデータが読み出せない箇所（つまりメディアエラーを発生している箇所）が存在した場合、データコピー部２７は、論理ディスク１０を構成する他のＨＤＤ１１-1及び１１-2のデータを用いてＨＤＤ１１-0から読み出すべきデータを復元して、ＨＤＤ１１-3にコピーする。このことは、データコピーが行われたデータブロックは、データの冗長性を保つ健全性が確認されたブロックであることを示す。もし故障が予知されたＨＤＤ１１-0が、データコピー部２７によるデータコピー処理の期間に実際に故障した場合にも、データコピー部２７は上記と同様に、論理ディスク１０を構成する他のＨＤＤ１１-1及び１１-2のデータを用いてＨＤＤ１１-0から読み出すべきデータを復元して、ＨＤＤ１１-3にコピーする。なお、データコピー部２７によるデータコピーの期間に、ホストからデータ書き込みが要求された結果、ＨＤＤ１１-0にデータまたは冗長データを書き込む場合、アクセス制御部２２はＨＤＤ１１-3にも同一のデータまたは冗長データを書き込む。 On the other hand, the data copy unit 27 executes a data copy process for copying the data of the HDD 11-0, for which the failure has been predicted by the HDD failure prediction unit 24, to the spare HDD 11-3 in units of data blocks of a certain size. . In this data copy processing, if there is a location where data cannot be read from the HDD 11-0 (that is, a location where a media error has occurred), the data copy unit 27 sends the other HDD 11-1 that constitutes the logical disk 10 and The data to be read from the HDD 11-0 is restored using the data 11-2 and copied to the HDD 11-3. This indicates that the data block that has been subjected to data copy is a block that has been confirmed to be sound enough to maintain data redundancy. Even if the HDD 11-0 for which a failure is predicted is actually failed during the data copy processing period by the data copy unit 27, the data copy unit 27 is also connected to other HDDs 11- constituting the logical disk 10 in the same manner as described above. The data to be read from the HDD 11-0 is restored using the data of 1 and 11-2 and copied to the HDD 11-3. When data or redundant data is written in the HDD 11-0 as a result of a data write request from the host during the data copy period by the data copy unit 27, the access control unit 22 also stores the same data or redundant data in the HDD 11-3. Write data.

ここで、データコピー部２７によるデータコピー処理とＨＤＤ健全性確認部２５によるＨＤＤ健全性確認処理とは、処理対象ＨＤＤをＨＤＤ１１-ｊで表すと、当該ＨＤＤ１１-jのアドレス０から最大アドレスの方向に実行されても、最大アドレスからアドレス０の方向に実行されても構わない。しかし本実施形態では、処理の効率化のために、データコピー部２７による処理の方向とＨＤＤ健全性確認部２５による処理の方向とは逆方向となっている。ここでは、ＨＤＤ健全性確認部２５による処理（ＨＤＤ健全性確認処理）は、図３（ａ）において矢印３１で示されるように、ＨＤＤ１１-jの最大アドレスからアドレス０の方向に実行される。これに対して、データコピー部２７による処理（データコピー処理）は、図３（ａ）において矢印３２で示されるように、ＨＤＤ１１-jのアドレス０から最大アドレスの方向に実行される。この処理方向の違いによる効果については後述する。なお本実施形態において、データコピー部２７による処理の対象となるＨＤＤ１１-jはＨＤＤ１１-0及び１１-3であって、ＨＤＤ健全性確認部２５による処理の対象となるＨＤＤ１１-jはＨＤＤ１１-1及び１１-2であるというように、両ＨＤＤ１１-jは相違する。しかし、図３では、便宜的に１つのＨＤＤ１１-jで代表させてある。 Here, the data copy process by the data copy unit 27 and the HDD soundness confirmation process by the HDD soundness confirmation unit 25 are represented in the direction from the address 0 to the maximum address of the HDD11-j when the processing target HDD is represented by the HDD11-j. May be executed in the direction from the maximum address to address 0. However, in this embodiment, in order to improve processing efficiency, the direction of processing by the data copy unit 27 and the direction of processing by the HDD soundness confirmation unit 25 are opposite. Here, the process by the HDD soundness confirmation unit 25 (HDD soundness confirmation process) is executed in the direction from the maximum address of the HDD 11-j to the address 0 as indicated by an arrow 31 in FIG. On the other hand, the process (data copy process) by the data copy unit 27 is executed in the direction from the address 0 to the maximum address of the HDD 11-j as indicated by an arrow 32 in FIG. The effect of this difference in processing direction will be described later. In this embodiment, the HDDs 11-j to be processed by the data copy unit 27 are the HDDs 11-0 and 11-3, and the HDDs 11-j to be processed by the HDD soundness confirmation unit 25 are the HDDs 11-1. And 11-2, the two HDDs 11-j are different. However, in FIG. 3, it is represented by one HDD 11-j for convenience.

さて、ＨＤＤ故障予知部２４は、ＨＤＤ健全性確認部２５及びデータコピー部２７を起動すると、データコピー部２７によるデータコピー処理の完了を待つ（ステップＳ３）。もし、データコピー部２７によるデータコピー処理が完了すると、ＨＤＤ故障予知部２４は論理ディスク復元部２３を起動する（ステップＳ４）。すると論理ディスク復元部２３は、ＨＤＤ故障予知部２４によって故障が予知されたＨＤＤ１１-0をスペアのＨＤＤ１１-3で代替させ、当該ＨＤＤ１１-0を論理ディスク１０から切り離す。このときスペアのＨＤＤ１１-3には、上記ステップＳ２のデータコピー部２７によるデータコピー処理でＨＤＤ１１-0のデータがコピーされている。したがって論理ディスク復元部２３は、ＨＤＤ１１-0が実際に故障したために当該ＨＤＤ１１-0をスペアのＨＤＤ１１-3で代替させる場合と異なり、ＨＤＤ１１-0のデータを論理ディスク１０の他のＨＤＤ１１-1及び１１-2のデータまたは冗長データで復元する処理を行う必要はない。論理ディスク１０内のＨＤＤ１１-0がスペアのＨＤＤ１１-3で代替されて、当該ＨＤＤ１１-1がＨＤＤ１１-0から切り離されると、構成管理部２１は論理ディスク１０の構成管理情報を更新する。 Now, when the HDD failure prediction unit 24 starts up the HDD soundness confirmation unit 25 and the data copy unit 27, the HDD failure prediction unit 24 waits for completion of the data copy processing by the data copy unit 27 (step S3). If the data copy process by the data copy unit 27 is completed, the HDD failure prediction unit 24 activates the logical disk restoration unit 23 (step S4). Then, the logical disk restoration unit 23 substitutes the spare HDD 11-3 for the HDD 11-0 for which the failure is predicted by the HDD failure prediction unit 24, and disconnects the HDD 11-0 from the logical disk 10. At this time, the data in the HDD 11-0 is copied to the spare HDD 11-3 by the data copy processing by the data copy unit 27 in step S2. Therefore, unlike the case where the HDD 11-0 is replaced with a spare HDD 11-3 because the HDD 11-0 has actually failed, the logical disk restoration unit 23 transfers the data of the HDD 11-0 to the other HDDs 11-1 and 11-12 of the logical disk 10. There is no need to perform a restoration process using the data of 11-2 or redundant data. When the HDD 11-0 in the logical disk 10 is replaced with the spare HDD 11-3 and the HDD 11-1 is disconnected from the HDD 11-0, the configuration management unit 21 updates the configuration management information of the logical disk 10.

（３）ＨＤＤ健全性確認処理
次に、ＨＤＤ健全性確認部２５によるＨＤＤ健全性確認処理の詳細について、図４のフローチャートを参照して説明する。 (3) HDD soundness confirmation processing Next, details of the HDD soundness confirmation processing by the HDD soundness confirmation unit 25 will be described with reference to the flowchart of FIG.

ＨＤＤ健全性確認部２５は、健全性確認の対象とすべきデータブロック（論理ブロック）を指定するアドレス（論理ブロックアドレス）から、ＨＤＤ１１-ｊ（ここではＨＤＤ１１-1及び１１-2）の全領域をスキャン（つまり健全性を確認）したかを判定する（ステップＳ１１）。もし、スキャン（健全性を確認）すべき領域が残っているならば、ＨＤＤ健全性確認部２５は、健全性確認の対象とすべきアドレスが、故障が予知されたＨＤＤ１１-0からスペアのＨＤＤ１１-3へのデータコピーが完了している領域に含まれているかを判定する（ステップＳ１２）。 The HDD soundness confirmation unit 25 starts from the address (logical block address) that designates the data block (logical block) that is to be soundness-checked, and the entire area of the HDD 11-j (here, the HDDs 11-1 and 11-2). Is scanned (ie, the soundness is confirmed) (step S11). If there remains an area to be scanned (confirming the soundness), the HDD soundness confirmation unit 25 changes the address to be the object of the soundness confirmation from the HDD 11-0 where the failure is predicted to the spare HDD 11. It is determined whether it is included in the area where the data copy to -3 has been completed (step S12).

もし、健全性確認の対象とすべきアドレスが、ＨＤＤ１１-0からスペアのＨＤＤ１１-3へのデータコピーが完了している領域に含まれていないならば、ＨＤＤ健全性確認部２５は、当該アドレスの指定する領域（データブロック）を検査して当該領域の健全性を確認するための例えばリードコマンドをＨＤＤ１１-1及び１１-2に発行する（ステップＳ１３）。ＨＤＤ健全性確認部２５は、このリードコマンドに対するＨＤＤ１１-1及び１１-2からの応答に基づき、当該ＨＤＤ１１-1または１１-2でメディアエラーが発生したかを判定する（ステップＳ１４）。もし、メディアエラーが発生していないならば、ＨＤＤ健全性確認部２５は健全性確認の対象とすべきアドレスを次のデータブロックを指定するように更新してステップＳ１１に戻る。 If the address to be subjected to the soundness check is not included in the area where the data copy from the HDD 11-0 to the spare HDD 11-3 has been completed, the HDD soundness check unit 25 performs the address check. For example, a read command is issued to the HDDs 11-1 and 11-2 for checking the area (data block) designated by the HDD 11-1 and confirming the soundness of the area (step S13). The HDD soundness confirmation unit 25 determines whether a media error has occurred in the HDD 11-1 or 11-2 based on the responses from the HDDs 11-1 and 11-2 to the read command (step S14). If no media error has occurred, the HDD soundness confirmation unit 25 updates the address to be soundness-checked so as to designate the next data block, and returns to step S11.

これに対し、ＨＤＤ１１-1またはＨＤＤ１１-2でメディアエラーが発生しているならば、ＨＤＤ健全性確認部２５はそのメディアエラーが発生している箇所（データブロック）をメディアエラー修復部２６により修復させる（ステップＳ１５）。今、ＨＤＤ１１-2でメディアエラーが発生しているものとすると、そのメディアエラー発生箇所の修復は、当該メディアエラー発生箇所（データブロック）に対応するＨＤＤ１１-0及び１１-1のデータまたは冗長データで、当該メディアエラー発生箇所のデータを復元し、その復元されたデータをメディアエラーが発生したデータブロック（アドレス）に書き込むことで実現される。もし、ＨＤＤ１１-0が実際に故障した後であれば、ＨＤＤ１１-2でのメディアエラーは修復できず、したがってＨＤＤ１１-1及び１１-2からＨＤＤ１１-0のデータを復元することもできなくなる。なお、メディアエラー修復部２６により修復で、ＨＤＤ１１-2内部で代替処理が行われることがある。この場合、復元されたデータが書き込まれる物理位置は、メディアエラーが発生した物理位置とは異なる。但し、代替先の物理位置は、メディアエラーが発生したアドレスでリンク付けされており、当該アドレスで正しくアクセスされる。 On the other hand, if a media error has occurred in the HDD 11-1 or HDD 11-2, the HDD soundness confirmation unit 25 uses the media error repair unit 26 to repair the location (data block) where the media error has occurred. (Step S15). Assuming that a media error has occurred in the HDD 11-2, the media error occurrence location is repaired by data or redundant data in the HDDs 11-0 and 11-1 corresponding to the media error occurrence location (data block). Thus, the data at the location where the media error has occurred is restored, and the restored data is written into the data block (address) where the media error has occurred. If the HDD 11-0 has actually failed, the media error in the HDD 11-2 cannot be repaired, and therefore the data of the HDD 11-0 cannot be restored from the HDDs 11-1 and 11-2. In some cases, the media error repair unit 26 performs repair, and substitute processing is performed inside the HDD 11-2. In this case, the physical location where the restored data is written is different from the physical location where the media error occurred. However, the physical location of the alternative destination is linked at the address where the media error has occurred, and is correctly accessed at that address.

ＨＤＤ健全性確認部２５は、メディアエラー発生箇所をメディアエラー修復部２６により修復させると、健全性確認の対象とすべきアドレスを次のデータブロックを指定するように更新してステップＳ１１に戻る。 When the media error occurrence part is repaired by the media error repair unit 26, the HDD soundness confirmation unit 25 updates the address to be subjected to soundness confirmation so as to designate the next data block, and returns to step S11.

以上のようにして、論理ディスク１０を構成するＨＤＤ１１-0〜１１-2のうち、故障が予知されたＨＤＤ１１-0を除く全てのＨＤＤ１１-ｊ（ＨＤＤ１１-1及び１１-2）を対象とする健全性の確認処理が進んだ結果、次に健全性確認の対象とすべきアドレスが、図３（ｂ）に示すＡｃとなったものとする。このとき、データコピー部２７によるＨＤＤ１１-0からスペアＨＤＤ１１-3へのデータコピー処理は、アドレス０から上記アドレスＡｃまで完了しているものとする。つまり、次に健全性確認の対象とすべきアドレスが、スペアＨＤＤ１１-3へのデータコピー完了領域に到達した、更に詳細に述べるならば、当該データコピー完了領域に相当するＨＤＤ１１-ｊ（ＨＤＤ１１-1及び１１-2）の領域に到達したものとする。 As described above, among the HDDs 11-0 to 11-2 configuring the logical disk 10, all the HDDs 11-j (HDDs 11-1 and 11-2) except the HDD 11-0 in which a failure is predicted are targeted. As a result of the progress of the soundness confirmation process, it is assumed that the next address to be the object of soundness confirmation is Ac shown in FIG. At this time, it is assumed that the data copy processing from the HDD 11-0 to the spare HDD 11-3 by the data copy unit 27 is completed from the address 0 to the address Ac. That is, the address to be checked next has reached the data copy completion area to the spare HDD 11-3. In more detail, the HDD 11-j (HDD11- It is assumed that the areas 1 and 11-2) have been reached.

ＨＤＤ健全性確認部２５は、ＨＤＤ１１-ｊ（ＨＤＤ１１-1及び１１-2）の全領域の健全性を確認（スキャン）し終えない場合でも（ステップＳ１１）、次に健全性確認の対象とすべきアドレスが上記データコピー完了領域に到達したならば（ステップＳ１２）、健全性確認処理を終了する。その理由は、既に述べたように、ＨＤＤ１１-3のデータコピー完了領域（アドレス０〜Ａｃまでの領域）に相当するＨＤＤ１１-1または１１-2の領域に、仮にメディアエラーが発生する箇所が含まれいて、その箇所が修復されなかったとしても、その箇所のデータは、新たな論理ディスク１０を構成する、ＨＤＤ１１-3を含む他のＨＤＤのデータから復元可能であるためである。なお、健全性確認処理をＨＤＤ１１-ｊ（ＨＤＤ１１-1及び１１-2）の全領域について実行しても構わない。 Even if the HDD soundness confirmation unit 25 does not complete (scan) the soundness of all areas of the HDD 11-j (HDDs 11-1 and 11-2) (step S11), the HDD soundness confirmation unit 25 next performs the soundness confirmation target. If the address to be reached has reached the data copy completion area (step S12), the soundness confirmation process is terminated. The reason for this is that, as already described, the HDD 11-1 or 11-2 area corresponding to the data copy completion area (address 0 to Ac) of the HDD 11-3 includes a location where a media error occurs. This is because even if the location is not restored, the data at that location can be restored from the data of other HDDs including the HDD 11-3 constituting the new logical disk 10. The soundness confirmation process may be executed for all areas of the HDD 11-j (HDDs 11-1 and 11-2).

上記実施形態では、ＨＤＤ故障予知部２４によりＨＤＤ１１-0の故障が予知されてから、当該ＨＤＤ１１-0からスペアＨＤＤ１１-3へのデータコピーが完了するまでに要する時間を短縮できるように、ＨＤＤ健全性確認部２５による健全性確認処理とデータコピー部２７によるコピー処理とがほぼ同時に、且つアドレスを更新する方向が相互に逆順となるようにしている。しかし、故障が予知されたＨＤＤ１１-0のデータを当該ＨＤＤ１１-0が実際に故障する前にスペアＨＤＤ１１-3にコピーすると共に、当該ＨＤＤ１１-0のデータを利用して残りのＨＤＤ１１-1及び１１-2のメディアエラーを修復するという観点からは、この手法に限らない。例えば、ＨＤＤ健全性確認部２５によるＨＤＤ１１-1及び１１-2の全領域を対象に健全性確認処理を先に行い、しかる後にデータコピー部２７によるコピー処理を行うようにしても良い。また、ＨＤＤ１１-0のデータまたは冗長データを利用して残りのＨＤＤ１１-1及び１１-2のメディアエラーを修復するという観点だけに着目するならば、データコピー部２７によるコピー処理は必ずしも必要ない。この場合には、ＨＤＤ１１-0をスペアＨＤＤ１１-3で代替する際に、論理ディスク復元部２３が、ＨＤＤ１１-0のデータをＨＤＤ１１-1及び１１-2のデータで復元してＨＤＤ１１-3に書き込めばよい。 In the above embodiment, the HDD health prediction unit 24 predicts the failure of the HDD 11-0 and the time until the data copy from the HDD 11-0 to the spare HDD 11-3 is completed can be shortened. The soundness confirmation processing by the property confirmation unit 25 and the copy processing by the data copy unit 27 are almost simultaneously performed, and the directions in which addresses are updated are reversed. However, the data of the HDD 11-0 in which the failure is predicted is copied to the spare HDD 11-3 before the HDD 11-0 actually fails, and the remaining HDDs 11-1 and 11 are used using the data of the HDD 11-0. From the viewpoint of repairing media errors of -2, it is not limited to this method. For example, the soundness confirmation processing by the HDD soundness confirmation unit 25 may be performed on all areas of the HDDs 11-1 and 11-2 first, and then the copy processing by the data copy unit 27 may be performed. If attention is paid only to the viewpoint of repairing the media error of the remaining HDDs 11-1 and 11-2 by using the data of the HDD 11-0 or redundant data, the copy process by the data copy unit 27 is not necessarily required. In this case, when the HDD 11-0 is replaced with the spare HDD 11-3, the logical disk restoration unit 23 restores the data of the HDD 11-0 with the data of the HDDs 11-1 and 11-2 and writes the data to the HDD 11-3. That's fine.

また上記実施形態では、データコピー部２７は、故障が予知されたＨＤＤ１１-0からデータが読み出せない場合を除いて、当該ＨＤＤ１１-0のデータをそのままスペアＨＤＤ１１-3にコピーしている。しかし、論理ディスク１０を構成するＨＤＤ１１-0〜１１-2のうち、残りのＨＤＤ１１-1及び１１-2のデータまたは冗長データを用いてＨＤＤ１１-0のデータを復元して、その復元されたデータをＨＤＤ１１-3にコピーするようにしても構わない。 Further, in the above embodiment, the data copy unit 27 copies the data of the HDD 11-0 to the spare HDD 11-3 as it is, except when the data cannot be read from the HDD 11-0 that is predicted to fail. However, among the HDDs 11-0 to 11-2 constituting the logical disk 10, the data of the HDD 11-0 is restored using the data of the remaining HDDs 11-1 and 11-2 or the redundant data, and the restored data May be copied to the HDD 11-3.

なお、本発明は、上記実施形態そのままに限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で構成要素を変形して具体化できる。また、上記実施形態に開示されている複数の構成要素の適宜な組み合せにより種々の発明を形成できる。例えば、実施形態に示される全構成要素から幾つかの構成要素を削除してもよい。 Note that the present invention is not limited to the above-described embodiment as it is, and can be embodied by modifying the constituent elements without departing from the scope of the invention in the implementation stage. Further, various inventions can be formed by appropriately combining a plurality of constituent elements disclosed in the embodiment. For example, some components may be deleted from all the components shown in the embodiment.

本発明の一実施形態に係るディスクアレイ装置の構成を示すブロック図。1 is a block diagram showing a configuration of a disk array device according to an embodiment of the present invention. ＨＤＤ故障予知部２４によってＨＤＤ１１-iの故障が予知された場合の処理の手順を示すフローチャート。7 is a flowchart showing a processing procedure when a failure of the HDD 11-i is predicted by the HDD failure prediction unit 24. ＨＤＤ健全性確認部２５によるＨＤＤ健全性確認処理とデータコピー部２７によるデータコピー処理の各々の処理が進行するアドレスの方向と、ＨＤＤ健全性確認処理が終了するアドレスとを示す図。The figure which shows the direction of the address which each process of the HDD soundness confirmation process by the HDD soundness confirmation part 25 and the data copy process by the data copy part 27 advances, and the address where HDD soundness confirmation process is complete | finished. ＨＤＤ健全性確認部２５によるＨＤＤ健全性確認処理の詳細な手順を示すフローチャート。6 is a flowchart showing a detailed procedure of HDD soundness confirmation processing by the HDD soundness confirmation unit 25;

Explanation of symbols

１０…論理ディスク、１１-0〜１１-2…ＨＤＤ（磁気ディスクドライブ）、１１-3…スペアのＨＤＤ（磁気ディスクドライブ）、２０…アレイコントローラ、２１…構成管理部、２２…アクセス制御部、２３…論理ディスク復元部、２４…ＨＤＤ故障予知部、２５…ＨＤＤ健全性確認部、２６…メディアエラー修復部、２７…データコピー部。 DESCRIPTION OF SYMBOLS 10 ... Logical disk, 11-0-11-2 ... HDD (magnetic disk drive), 11-3 ... Spare HDD (magnetic disk drive), 20 ... Array controller, 21 ... Configuration management part, 22 ... Access control part, 23 ... Logical disk restoration unit, 24 ... HDD failure prediction unit, 25 ... HDD soundness confirmation unit, 26 ... Media error restoration unit, 27 ... Data copy unit.

Claims

In an array controller that distributes and arranges data requested by a host and redundant data of the data in the plurality of disk drives by controlling access to the plurality of disk drives constituting the logical disk,
Disk drive failure prediction means for predicting failure of each of the plurality of disk drives constituting the logical disk;
Soundness for checking the contents of all the disk drives except the disk drive for which the failure is predicted by the disk drive failure prediction means among the plurality of disk drives constituting the logical disk to confirm the presence or absence of a media error Soundness confirmation means for executing the confirmation process;
When a media error is detected by the soundness confirmation means, data or redundant data of all the disk drives except the disk drive in which the media error has occurred among the plurality of disk drives constituting the logical disk An array controller comprising: media error repairing means for repairing data loss at a location where the media error has occurred, using

A data copy means for executing a data copy process for copying the data of the disk drive in which the failure is predicted to a spare disk drive after the completion of the soundness confirmation process by the soundness confirmation means;
After completion of the data copy processing by the data copy means, a logical disk restoring means for replacing the disk drive in which the failure is predicted by the spare disk drive and separating the disk drive in which the failure is predicted from the logical disk The array controller according to claim 1, further comprising:

The data copy means normally copies the data of the disk drive for which the failure is predicted to the spare disk drive, and when the data of the disk drive for which the failure is predicted cannot be read normally. The data is restored from the data or redundant data of all the disk drives except the disk drive where the failure is predicted among the plurality of disk drives constituting the logical disk, and the restored data is restored to the spare disk. 3. The array controller according to claim 2, wherein the array controller is copied to a drive.

The data copy means includes data of a disk drive in which the failure is predicted, data of all the disk drives other than the disk drive in which the failure is predicted among the plurality of disk drives constituting the logical disk, or redundant data. The array controller according to claim 2, wherein the restored data is copied to the spare disk drive.

Data copy for executing a data copy process for copying the data of the disk drive in which the failure is predicted to a spare disk drive in a direction opposite to the direction of the address in which the soundness confirmation process proceeds by the soundness confirmation unit Means,
After completion of the data copy processing by the data copy means, a logical disk restoring means for replacing the disk drive in which the failure is predicted by the spare disk drive and separating the disk drive in which the failure is predicted from the logical disk And further comprising
The soundness confirmation means ends the soundness confirmation process at the stage where the soundness confirmation process has been executed up to an address corresponding to an area where the data copy process by the data copy means has been completed. The array controller according to claim 1.

6. The array controller according to claim 5, wherein the data copy means starts the data copy process at the start of the soundness confirmation process by the soundness confirmation means.

The disk drive failure predicting means monitors and analyzes the state related to deterioration in reliability of the disk drive, which the disk drive constituting the logical disk has as a standard, 2. The array controller according to claim 1, wherein a failure of each of the plurality of disk drives is predicted using a function of reporting the result to a device using the disk drive.

The disk drive failure predicting means counts the number of times an error retry or recoverable error has occurred in access to each of the plurality of disk drives constituting the logical disk, and sets the count value of each of the plurality of disk drives. 2. The array controller according to claim 1, wherein a failure of each of the plurality of disk drives is predicted based on the failure.

Media error applied to an array controller that distributes data requested by a host and redundant data of the data distributed to the plurality of disk drives by controlling access to the plurality of disk drives constituting the logical disk A repair method,
Predicting a failure of each of the plurality of disk drives constituting the logical disk;
When a failure of any one of the plurality of disk drives is predicted, the contents of all the disk drives except the disk drive for which the failure is predicted among the plurality of disk drives constituting the logical disk Performing a soundness check process to check whether there is a media error and
When a media error is detected by the soundness confirmation processing, data or redundant data of all the disk drives except the disk drive in which the media error has occurred among the plurality of disk drives constituting the logical disk And a step of repairing data loss at a location where the media error occurs by using the media error repairing method.

A program for causing an array controller that distributes and arranges data requested from a host and redundant data of the data to the plurality of disk drives by controlling access to the plurality of disk drives constituting the logical disk Because
In the array controller,
Predicting a failure of each of the plurality of disk drives constituting the logical disk;
When a failure of any one of the plurality of disk drives is predicted, the contents of all the disk drives except the disk drive for which the failure is predicted among the plurality of disk drives constituting the logical disk Performing a soundness check process to check whether there is a media error and
When a media error is detected by the soundness confirmation processing, data or redundant data of all the disk drives except the disk drive in which the media error has occurred among the plurality of disk drives constituting the logical disk And a program for executing the step of repairing data loss at a location where the media error has occurred.