JP2005004733A

JP2005004733A - Arrangement and method of disposition for detecting write error in storage system

Info

Publication number: JP2005004733A
Application number: JP2004141188A
Authority: JP
Inventors: Ian David Judd; イアン・デビッド・ジャッド
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 2003-06-11
Filing date: 2004-05-11
Publication date: 2005-01-06
Also published as: GB0313419D0; CN1324474C; GB2402803B; CN1573703A; GB2402803A

Abstract

<P>PROBLEM TO BE SOLVED: To provide a scheme for detecting a write error within a disk storage system by using a phase field. <P>SOLUTION: A user data block D is divided into groups 120 and a check block P is inserted after each of the groups. The check block includes a field that is updated each time a group is written. In the simplest case, the field is a single bit to be inverted. In order to more strengthen a protection, however, the field may also be a multiple bit counter to be made increment. The check block of the XOR combination of data blocks for each group or may also be XOR combination of LBA for each group. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明はストレージ・システムに関し、具体的には、電子データ・ストレージ用のディスク・ストレージ・システムに関する。 The present invention relates to storage systems, and in particular to disk storage systems for electronic data storage.

記録技術の進歩により、ハード・ドライブの容量は年々倍増している。２００３年には、面密度は１平方インチあたり１００ＧＢに達すると予測され、３．５インチ・ドライブでは３００ＧＢを格納できるようになる。 Due to advances in recording technology, hard drive capacity has doubled year by year. In 2003, the areal density is expected to reach 100 GB per square inch, and a 3.5 inch drive will be able to store 300 GB.

ハード・ドライブの信頼性は、そのＭＴＢＦおよび回復不能エラー率に関して指定される。現在のサーバ・クラスのドライブに関する典型的な指定は、１，０００，０００時間および１０^１５ビット読取りでの１回復不能エラーである。ただし、面密度が増加すると、浮上高の低下、媒体の欠陥などにより、信頼性の維持はさらに困難になる。 Hard drive reliability is specified in terms of its MTBF and unrecoverable error rate. A typical designation for current server class drives is one unrecoverable error at 1,000,000 hours and 10 ^15- bit reads. However, as the surface density increases, it becomes more difficult to maintain reliability due to a decrease in flying height and a defect in the medium.

ストレージ・システムの信頼性をさらに向上させるために、ＲＡＩＤ（ＲｅｄｕｎｄａｎｔＡｒｒａｙｏｆＩｎｄｅｐｅｎｄｅｎｔＤｉｓｋｓ）アレイ（たとえば、ＲＡＩＤ−１またはＲＡＩＤ−５）が使用されることが多い。ただし、容量の大きなドライブを使用した場合、単一レベルの冗長度では、データ損失の確率を無視できるレベルまで下げるには、もはや十分とは言えない。 In order to further improve the reliability of the storage system, a RAID (Redundant Array of Independent Disks) array (for example, RAID-1 or RAID-5) is often used. However, when using large capacity drives, a single level of redundancy is no longer sufficient to reduce the probability of data loss to a negligible level.

以前の書込みコマンドが記録媒体上の正しい位置に書き込んでいないか、または媒体上への記録に失敗したことにより、ディスク・ドライブが読取りコマンドで古いデータを戻してしまう可能性もある。これは、間欠的なハードウェア障害または潜在的な設計上の欠陥による可能性がある。たとえばドライブは、ファームウェアのバグによって誤ったＬＢＡ（論理ブロック・アドレス）にデータを書き込む、オフ・トラックで書き込む、または、１滴の潤滑油（一般に「ｌｕｂｅ」と呼ばれる）がディスク表面からヘッドを離すことによって全く書き込まない、などの可能性がある。 It is possible that the previous write command did not write to the correct location on the recording medium, or that recording on the medium failed, causing the disk drive to return old data with the read command. This may be due to intermittent hardware failures or potential design defects. For example, the drive writes data to the wrong LBA (Logical Block Address) due to a firmware bug, writes off track, or a drop of lubricant (commonly called “lube”) moves the head off the disk surface There is a possibility of not writing at all.

ＡｄｖａｎｃｅｄＴｅｃｈｎｏｌｏｇｙＡｔｔａｃｈｍｅｎｔ（ＡＴＡ）ドライブなどの市販のドライブは、セント（ｃｅｎｔｓ）／ＭＢ単位で表すと、およそ３分の１と安価であることから、これらをサーバ・アプリケーションで使用することについても、関心が高まっている。ただしこれらのドライブは、元来、ＰＣ内で間欠的に使用することを目的としたものであるため、サーバ・クラスのドライブよりも信頼性が低い可能性がある。ＡＴＡドライブは５１２バイト・ブロックのみをサポートしているため、ブロック・レベルのＬＲＣ（水平冗長検査）を使用してデータの破損を検出することはできない。 Commercial drives such as Advanced Technology Attachment (ATA) drives are about a third cheaper when expressed in cents / MB, so it is also interesting to use them in server applications. Is growing. However, since these drives are originally intended to be used intermittently in a PC, they may be less reliable than server class drives. Since ATA drives support only 512 byte blocks, block level LRC (Horizontal Redundancy Check) cannot be used to detect data corruption.

単一のディスク・ドライブの場合、コントローラは、各ブロックに書き込んだ直後に再度読み出して、これを検証することができる。 In the case of a single disk drive, the controller can verify this by reading again immediately after writing to each block.

任意タイプの冗長ＲＡＩＤ（ＲｅｄｕｎｄａｎｔＡｒｒａｙｏｆＩｎｄｅｐｅｎｄｅｎｔＤｉｓｋｓ）アレイを、読取りデータをチェックできるような方法で実施することができる。たとえば、ＲＡＩＤ−５アレイを使用した場合、コントローラは、読取りデータが他のデータ・ドライブおよびパリティ・ドライブと整合性があることをチェックできる。 Any type of redundant RAID of Redundant Array of Independent Disks can be implemented in such a way that the read data can be checked. For example, when using a RAID-5 array, the controller can check that the read data is consistent with other data drives and parity drives.

ただし、これらの手法には、第１の方法は特別な解像度を必要とし、第２の方法は読取りコマンドごとにいくつかのドライブにアクセスする必要があることから、どちらの方法も、１秒あたりのＩ／Ｏ（入力／出力）コマンドに関して、全体のスループットを劇的に低下させるという欠点がある。 However, for these approaches, both methods require a special resolution, and the second method requires access to several drives for each read command, so both methods are per second. The I / O (input / output) command has the disadvantage of dramatically reducing the overall throughput.

したがって、前述の欠点を軽減できる、ストレージ・システムにおける書込みエラーの検出が求められている。 Accordingly, there is a need for detection of write errors in storage systems that can alleviate the aforementioned drawbacks.

本発明の第１の態様によれば、ストレージ・システムにおいて書込みエラーを検出するための配置構成が提供され、この配置構成（ａｒｒａｎｇｅｍｅｎｔ）は、各グループが、複数のデータ・ブロックと、グループがストレージに書き込まれるたびに更新される１つのチェック・ブロックとを含む、グループ単位でデータ・ブロックを格納するための手段と、チェック・ブロックをチェックすることによって書込みエラーを検出するための手段とを含む。 In accordance with a first aspect of the present invention, an arrangement is provided for detecting write errors in a storage system, wherein the arrangement includes a plurality of data blocks and a group storage. Means for storing data blocks in groups, including one check block that is updated each time it is written to, and means for detecting a write error by checking the check block .

好ましくは、チェック・ブロックは、グループのデータ・ブロックの組合せである。 Preferably, the check block is a combination of groups of data blocks.

好ましくは、組合せは、論理排他的ＯＲの組合せである。 Preferably, the combination is a logical exclusive OR combination.

好ましくは、チェック・ブロックは、グループに関連付けられた論理ブロック・アドレスの組合せである。 Preferably, the check block is a combination of logical block addresses associated with the group.

好ましくは、チェック・ブロックは、グループが書き込まれるたびに更新される位相フィールドの組合せである。 Preferably, the check block is a combination of phase fields that are updated each time a group is written.

好ましくは、位相フィールドは、グループが書き込まれるたびに反転される単一ビット値を含む。 Preferably, the phase field contains a single bit value that is inverted each time a group is written.

好ましくは、位相フィールドは、グループが書き込まれるたびに更新される複数ビット値を含む。 Preferably, the phase field contains a multi-bit value that is updated each time a group is written.

配置構成は、好ましくは、位相フィールド値に関する不揮発性テーブルをさらに含む。 The arrangement preferably further includes a non-volatile table for the phase field values.

好ましくは、不揮発性テーブルは、予約済みディスク・ドライブ領域と、システムのコントローラにキャッシュされるテーブルの作業コピーを含む。 Preferably, the non-volatile table includes a reserved disk drive area and a working copy of the table cached in the system controller.

配置構成は、好ましくは、書込みオペレーションの前にエントリを記録するように配列された不揮発性ログをさらに含み、エントリは、
Ａ無効化、および
Ｂ書込みオペレーション完了時削除
のうちの１つに関して配列される。 The arrangement preferably further includes a non-volatile log arranged to record the entry prior to the write operation,
Arranged for one of A invalidation and B delete on completion of write operation.

好ましくは、ログは、まだ不揮発性テーブルに格納されていないコントローラ内のテーブルの作業コピーへの更新を、保持するように配列される。 Preferably, the log is arranged to keep updates to working copies of tables in the controller that are not yet stored in the non-volatile table.

好ましくは、ログは、システムのコントローラに関するコードも保持するために、メモリ内に格納される。 Preferably, the log is stored in memory to also hold code related to the system's controller.

好ましくは、ストレージ・システムはディスク・ストレージ・システムを含む。 Preferably, the storage system includes a disk storage system.

好ましくは、ディスク・ストレージ・システムはＡＴＡディスク・ドライブを含む。 Preferably, the disk storage system includes an ATA disk drive.

好ましくは、ディスク・ストレージ・システムはＲＡＩＤシステムを含む。 Preferably, the disk storage system includes a RAID system.

第２の態様では、本発明は、ストレージ・システムにおいて書込みエラーを検出するための方法を提供し、この方法は、各グループが複数のデータ・ブロックと１つのチェック・ブロックとを含むグループ単位でデータ・ブロックを格納すること、グループがストレージに書き込まれるたびにチェック・ブロックを更新すること、およびチェック・ブロックをチェックすることによって起こり得る書込みエラーを検出することを含む。 In a second aspect, the present invention provides a method for detecting a write error in a storage system, wherein the method comprises a group unit in which each group includes a plurality of data blocks and a check block. Including storing data blocks, updating check blocks each time a group is written to storage, and detecting possible write errors by checking the check blocks.

好ましくは、位相フィールド値は、不揮発性テーブルに格納される。 Preferably, the phase field value is stored in a non-volatile table.

方法は、好ましくは、書込みオペレーションの前にエントリを不揮発性ログに記録すること、ならびに、
Ａエントリの無効化、および
Ｂ書込みオペレーション完了時のエントリの削除
のうちの１つのオペレーションを実行することを、さらに含む。 The method preferably records the entry in a non-volatile log prior to the write operation, and
Further comprising performing one operation of A invalidating the entry and B deleting the entry upon completion of the write operation.

方法は、好ましくは、まだ不揮発性テーブルに格納されていないコントローラ内のテーブルの作業コピーへの更新を、ログ内に保持することをさらに含む。 The method preferably further comprises keeping in the log updates to a working copy of a table in the controller that has not yet been stored in the non-volatile table.

第３の態様では、本発明は、第２の態様の方法を実質的に実行するためのコンピュータ・プログラム手段を含む、コンピュータ・プログラム要素を提供する。 In a third aspect, the present invention provides a computer program element comprising computer program means for substantially performing the method of the second aspect.

次に、本発明を組み込んだ、位相フィールドの使用によってストレージ・システム内の書込みエラーを検出するための一配置構成および方法について、添付の図面を参照しながら、例示的なものとして説明する。 An arrangement and method for detecting write errors in a storage system through the use of a phase field, incorporating the present invention, will now be described by way of example with reference to the accompanying drawings.

簡略的に述べると、本発明は、その好ましい実施形態において、位相フィールド（たとえば、単一ビット・フラグ）を含むインタリーブド・パリティ・ブロックを使用して、ディスク・ドライブによるデータ破損のほぼすべてのインスタンスを検出する。パリティ・ブロックは、追加レベルのエラー修正も提供する。ＡＴＡドライブはサーバ・ドライブよりも修正不能エラー率が高い傾向にあるため、これらの機能は、ＡＴＡドライブにとって特に有用である。（ＡＴＡドライブは、通常、１０^１４ビット中１というハード・エラー率を指定するため、ハード読取りエラーを備えたブロックを含む１００ＧＢドライブの確率は０．８％である。これらのドライブが、１０＋ＰＲＡＩＤ−５アレイの構築に使用された場合、ドライブ交換後の再構築失敗の確率は８％である。） Briefly stated, the present invention, in its preferred embodiment, uses an interleaved parity block that includes a phase field (e.g., a single bit flag) to provide almost all data corruption by a disk drive. Detect instances. The parity block also provides an additional level of error correction. These features are particularly useful for ATA drives because ATA drives tend to have higher uncorrectable error rates than server drives. (ATA drives typically specify a hard error rate of 1 in 10 ¹⁴ bits, so the probability of a 100 GB drive containing blocks with hard read errors is 0.8%. These drives are 10 + P RAID. (When used to build a -5 array, the probability of rebuild failure after drive replacement is 8%.)

次に図１を参照すると、磁気ディスク・ストレージ・システム１００はディスク１１０を含み、ここでは情報が、通常は５１２バイトのブロックＤおよびＰに格納される。ディスクにデータを格納する場合、１つのパリティ・ブロックＰがＮごとに、たとえば図に示されるように、８個の５１２バイト・ブロックまたは４ＫＢごとに、挿入される。これらのＮ＋１ブロックがグループ１２０とみなされる。したがって、ドライブの有効データ容量は、Ｎ／（Ｎ＋１）ずつ減少する。 Referring now to FIG. 1, the magnetic disk storage system 100 includes a disk 110 where information is stored in blocks D and P, typically 512 bytes. When storing data on the disk, one parity block P is inserted every N, for example every 8 512-byte blocks or 4 KB as shown in the figure. These N + 1 blocks are considered as group 120. Therefore, the effective data capacity of the drive decreases by N / (N + 1).

図２に示されるように、パリティ・ブロックＰには、以下のように計算されるグループ・パリティが含まれる。 As shown in FIG. 2, the parity block P includes group parity calculated as follows.

ステップ２１０−そのグループ内のそれぞれのデータ・ブロックから、対応するバイトをＸＯＲする。 Step 210—XOR the corresponding bytes from each data block in the group.

ステップ２２０−グループ内の第１のブロックの物理ＬＢＡを、ステップ２１０の結果のうちの最初の数バイトにＸＯＲする。このＬＢＡシードは、ほぼすべての読取りおよび一部の書込みに関するアドレス指定エラーの検出を可能にする。 Step 220—XOR the physical LBA of the first block in the group to the first few bytes of the result of Step 210. This LBA seed allows detection of addressing errors for almost all reads and some writes.

ステップ２３０−位相フィールドＦを、ステップ２２０の結果のうちの最後の数ビットにＸＯＲする。位相フィールドＦは、グループが書き込まれるたびに反転される単一ビット値であってよい。あるいは、グループが書き込まれるたびに更新（たとえば増分）される、複数ビット・カウンタであってよい。位相フィールドは、書込みに関する残りのアドレス指定エラーのほとんどを検出する。 Step 230-XOR the phase field F to the last few bits of the result of step 220. The phase field F may be a single bit value that is inverted each time a group is written. Alternatively, it may be a multi-bit counter that is updated (eg, incremented) each time a group is written. The phase field detects most of the remaining addressing errors for writing.

ドライブがハード読取りエラーに遭遇した場合を除き、ディスク・コントローラ（図示せず）は、完全なグループでドライブの読取りおよび書込みを行う。上記の計算は、各グループについて実行する。書込みの場合、結果はパリティ・ブロックに書き込まれる。読取りの場合、結果は読取りパリティ・ブロックのコンテンツと共にＸＯＲされ、結果が非ゼロの場合は、そのグループにエラーがある。 Unless the drive encounters a hard read error, the disk controller (not shown) reads and writes the drive in complete groups. The above calculation is performed for each group. In the case of writing, the result is written to the parity block. For reads, the result is XORed with the contents of the read parity block, and if the result is non-zero, there is an error in the group.

パリティ・ブロックＰは、コントローラが以下のドライブ・エラーを処理できるようにするものである。 Parity block P allows the controller to handle the following drive errors.

ドライブが、グループの１つのデータ・ブロック内で回復不能な媒体エラーに遭遇した場合、コントローラは、次のブロックで読取りを再開する。その後、ＬＢＡおよび位相は正しいと想定し、グループ・パリティを使用することによって、失われたブロックを再構築する。最終的に不良ＬＢＡを再度割り当てて、ブロックを書き直す。 If the drive encounters an unrecoverable media error in one data block of the group, the controller resumes reading at the next block. It then assumes that the LBA and phase are correct and reconstructs the lost block by using group parity. Finally, the bad LBA is reassigned and the block is rewritten.

ドライブが誤ったＬＢＡを読み取ると、ＬＢＡシードが原因でグループ・パリティ・チェックは非ゼロになる。その後、コントローラは読取りをもう一度再試行し、パリティが再度失敗した場合は、媒体エラーを戻す。 If the drive reads the wrong LBA, the group parity check will be non-zero due to the LBA seed. The controller will then retry the read again and return a media error if parity fails again.

ドライブが以前に誤ったＬＢＡを書き込んだ場合、または媒体が全く書き込まれておらず、ホストが正しいＬＢＡを読み取るように要求を出した場合、位相フィールドＦが原因で、グループ・パリティ・チェックは非ゼロになる。その後、コントローラは読取りをもう一度再試行し、パリティが再度失敗した場合は、媒体エラーを戻す。 If the drive has previously written the wrong LBA, or if the media has not been written at all and the host has requested to read the correct LBA, the phase field F will cause the group parity check to be disabled. It becomes zero. The controller will then retry the read again and return a media error if parity fails again.

ドライブが以前に誤ったＬＢＡを書き込み、その後ホストが誤ったＬＢＡを読み取るように要求を出した場合、ＬＢＡシードが原因で、グループ・パリティ・チェックは誤りとなる。コントローラは読取りをもう一度再試行し、媒体エラーを戻す。 If the drive previously wrote the wrong LBA and then the host requested to read the wrong LBA, the group parity check would be wrong due to the LBA seed. The controller will retry the read again and return a media error.

コントローラが媒体エラーを戻した場合、ドライブが冗長アレイ（図示せず）の構成要素であれば、データは依然として回復可能である。 If the controller returns a media error, the data can still be recovered if the drive is a component of a redundant array (not shown).

コントローラは、ディスク上で常に完全なグループの読取りおよび書込みを行うため、短いかまたは位置合わせされていない書込みは、読取り (read) −修正 (modify) −書込み (write) が必要である。ただし、ＲＡＩＤ−５には同様の犠牲があるため、この場合には追加のオーバヘッドはない。 Since the controller always reads and writes complete groups on the disk, short or unaligned writes need to be read-modify-write. However, there is no additional overhead in this case because RAID-5 has the same sacrifice.

ディスク・コントローラは、各グループの現在の位相を不揮発性記憶域１３０に格納しなければならない。たとえば、単一ビット位相フラグを使用する場合、結果として生じるビット・マップは、４ＫＢグループで、１００ＧＢドライブに対しておよそ２．６ＭＢを占有する。コントローラは、ドライブがフォーマット化されるときに、すべての位相フラグをゼロに初期設定する。位相フラグ・ビット・マップ１３０は、様々な方法で実施可能である。フラッシュ・メモリは、同じグループが繰り返し書き込まれると急速に磨耗することから、直接的には好適でない。バッテリ・バックアップＳＲＡＭ（スタティック・ランダム・アクセス・メモリ）は大きくて扱いにくく、高価である。好ましい解決方法は、ビット・マップをディスク・ドライブの予約済み領域に格納し、作業コピーを、コントローラのＳＤＲＡＭ（スタティック・ダイナミック・ランダム・アクセス・メモリ）にキャッシュすることである。ただし、あらゆる書込みコマンドについて予約済み領域を更新するのを避けるために、変更は何らかの方法で一括され、電力障害およびドライブのリセットから保護されなければならない。 The disk controller must store the current phase of each group in non-volatile storage 130. For example, when using a single bit phase flag, the resulting bit map occupies approximately 2.6 MB for a 100 GB drive in a 4 KB group. The controller initializes all phase flags to zero when the drive is formatted. The phase flag bit map 130 can be implemented in various ways. Flash memory is not directly suitable because it quickly wears out when the same group is written repeatedly. Battery backup SRAM (Static Random Access Memory) is large, cumbersome and expensive. The preferred solution is to store the bit map in a reserved area of the disk drive and cache the working copy in the controller's SDRAM (Static Dynamic Random Access Memory). However, to avoid updating the reserved area for every write command, the changes must be packaged in some way to protect against power failures and drive resets.

さらに、ディスク書込みが電力障害またはリセットによって中断された場合、ディスク上の位相フラグの状態は不明である。ドライブには何も不良はないため、これにより、後続の読取りに媒体エラーによる失敗を発生させてはならない（ただし、コントローラはまだホストへの書込みを完了していないため、古いデータ、新しいデータ、またはそれらの混合を戻すことは許容できる）。 In addition, if the disk write is interrupted by a power failure or reset, the state of the phase flag on the disk is unknown. Since the drive has nothing wrong, this should not cause a subsequent read failure due to a media error (however, the controller has not yet completed the write to the host, so the old data, new data, Or it is permissible to bring back a mixture thereof).

これら２つの問題は、ディスク書込みを発行する直前に不揮発性ログ内にエントリを作成すること、および書込みが完了したときにそれを削除（または無効化）することによって解決することができる。同じログを使用して、まだディスクにフラッシュされていないＳＤＲＡＭ内のビット・マップへの更新を保持することもできる。典型的なログ・エントリには、以下のように８バイトが必要である。
バイト数説明
０：３書き込まれる第１グループのアドレス
４：５書き込まれる連続するグループの数（非ゼロは有効なログ・エントリを示す）
６ＦＦｈに初期設定される（「ｈ」は１６進法を示す接尾辞）。ディスク書込みが完了した後、００ｈに設定する。
７ＦＦｈに初期設定される。ディスク上のビット・マップが更新された後、００ｈに設定する。 These two problems can be solved by creating an entry in the non-volatile log just before issuing a disk write and deleting (or invalidating) it when the write is complete. The same log can also be used to keep updates to bitmaps in SDRAM that have not yet been flushed to disk. A typical log entry requires 8 bytes as follows:
Number of bytes Description 0: 3 Address of first group to be written 4: 5 Number of consecutive groups to be written (non-zero indicates a valid log entry)
Initially set to 6 FFh ("h" is a suffix indicating hexadecimal). Set to 00h after disk writing is complete.
7 Initially set to FFh. After the bit map on the disk is updated, it is set to 00h.

ログは、小さなバッテリ・バックアップＳＲＡＭ、すなわち、ＮＶＲＡＭ（不揮発性ＲＡＭ）に格納することができる。 The log can be stored in a small battery backup SRAM, ie NVRAM (non-volatile RAM).

一部の実施では、ログを、コントローラ・コードを含むフラッシュ・メモリの追加セクタに格納すると便利な場合がある。ログ・セクタが完全に使用されると、すべてＦＦｈに消去される。フラッシュに単語を書き込むには、通常、約５００μｓを要し、各ディスク書込みには３フラッシュ書込みが必要である。これにより、１秒あたりほぼ７００のディスク書込みが可能である。ログには順番に書き込まれるため、フラッシュ・メモリの磨耗は自動的に均一になる。さらにログ・エントリは、１回のディスク書込みにつき、各バイトが１回だけ書き込まれるように、フォーマット化される。たとえば、１０^５サイクルの耐久性を持つ１ＭＢのフラッシュは、毎秒１００ディスク書込みで４年以上持つことになる。 In some implementations, it may be convenient to store the log in an additional sector of flash memory that contains the controller code. When the log sectors are fully used, they are all erased to FFh. Writing a word to flash typically takes about 500 μs, and each disk write requires 3 flash writes. This allows approximately 700 disk writes per second. Because the logs are written sequentially, the wear on the flash memory is automatically uniform. In addition, log entries are formatted so that each byte is written only once per disk write. For example, the flash 1MB with 10 ⁵ cycles durability, will have more than 4 years per 100 disk writes.

高い可用性(availability) を保証するために、ストレージ・システムは、二重(dual) （アクティブ・アクティブ）コントローラを使用することが多い。この環境では、各コントローラで、不揮発性ログのミラー・コピーを維持することが望ましい。これによって、コントローラが障害を起こした場合でも、位相フィールドによって提供される保護が失われないことが保証される。２つのログは、コントローラ間でのメッセージの交換により、同期が維持されなければならない。各コントローラは、ディスクにグループを書き込む前、および書込みが完了したときにもう一度、そのログの更新を他方のコントローラに通知しなければならない。ただし実際には、ＲＡＩＤ−５などの高位レベル機能はいずれにせよ同様のメッセージを交換するので、通常、それほどのオーバヘッドにはならない。 In order to ensure high availability, storage systems often use dual (active / active) controllers. In this environment, it is desirable to maintain a mirror copy of the non-volatile log at each controller. This ensures that even if the controller fails, the protection provided by the phase field is not lost. The two logs must be kept synchronized by exchanging messages between the controllers. Each controller must notify the other controller of its log update before writing the group to disk and once again when the write is complete. In practice, however, high-level functions such as RAID-5 exchange similar messages anyway, so there is usually not much overhead.

たとえば、障害後にコントローラのうちの１つを交換する場合など、２つのコントローラを再同期化するための手段も提供しなければならない。これは、他方のコントローラ内のログから未処理の更新をディスクにフラッシュすること、および交換コントローラ内のログをクリアすることによって、最も簡単に達成される。 A means for resynchronizing the two controllers must also be provided, for example, when one of the controllers is replaced after a failure. This is most easily accomplished by flushing outstanding updates from the log in the other controller to disk and clearing the log in the replacement controller.

上記で説明した、位相フラグを使用してストレージ・システム内で書込みエラーを検出するためのスキームは、以下の利点を提供することが理解されよう。 It will be appreciated that the scheme described above for detecting write errors in a storage system using phase flags provides the following advantages:

データの保全性 (integrity) の向上。このスキームは、低コストのデスクトップ・ドライブを使用している場合に特に有用である。これらは通常、５１２バイト・ブロックに制限されているため、各ブロックにチェック・フィールドを格納する余地がない。ただし、サーバ・クラス・ドライブにも適用可能である。 Improved data integrity. This scheme is particularly useful when using low cost desktop drives. Since these are typically limited to 512 byte blocks, there is no room for storing a check field in each block. However, it can also be applied to server class drives.

特に、ＲＡＩＤ−５と共に使用する場合に、性能への影響が少ない（読取りデータをチェックする際に、追加のディスク・アクセスを必要としない）。 In particular, when used with RAID-5, there is less performance impact (no additional disk access is required when checking read data).

最も単純なケースでは、位相フィールドは、書込みのたびに反転される単一ビットである。ただし、より保護を強化するためには、正の値によって更新、たとえば増分されるか、または負の値によって更新される（すなわち減分）、複数ビット・カウンタであってもよい。 In the simplest case, the phase field is a single bit that is inverted with each write. However, for greater protection, it may be a multi-bit counter that is updated with a positive value, eg, incremented, or updated with a negative value (ie, decremented).

ストレージ・システムにおいて書込みエラーを検出するための上記の方法は、典型的には、システム内のプロセッサ（図示せず）上で実行中のソフトウェアで実施されること、および、ソフトウェアは、磁気または光コンピュータ・ディスクなどの任意の好適なデータ・キャリア（これも図示せず）上で搬送される、コンピュータ・プログラム要素 (element) として提供可能であることを理解されよう。 The above method for detecting write errors in a storage system is typically implemented in software running on a processor (not shown) in the system, and the software can be magnetic or optical It will be appreciated that it can be provided as a computer program element carried on any suitable data carrier (also not shown) such as a computer disk.

以上、本発明について、磁気ディスク・ストレージ・システムのコンテキストで説明してきたが、本発明は、別法として、光ディスクまたは磁気テープに基づくものなどの、他のストレージ・システムにも適用可能であることも理解されよう。 While the present invention has been described in the context of magnetic disk storage systems, the present invention is alternatively applicable to other storage systems such as those based on optical disks or magnetic tapes. Will also be understood.

添付の図面を参照しながら説明するのとほぼ同様に、ストレージ・システムにおいて書込みエラーを検出するための配置構成、さらにはストレージ・システムにおいて書込みエラーを検出するための方法も適用可能である。 Almost as described with reference to the accompanying drawings, an arrangement for detecting write errors in a storage system and a method for detecting write errors in a storage system are also applicable.

本発明を組み込んだディスク・ドライブ・ストレージ・システムを示す、概略的なブロック図である。1 is a schematic block diagram illustrating a disk drive storage system incorporating the present invention. 図１のシステムを使用してパリティ・ブロックを計算するための方法を示す、概略的なブロック図である。FIG. 2 is a schematic block diagram illustrating a method for calculating a parity block using the system of FIG.

Explanation of symbols

１００ディスク・ストレージ・システム
１１０ディスク
１２０グループ
１３０位相フラグ・ビット・マップ

100 disk storage system 110 disk 120 group 130 phase flag bit map

Claims

An arrangement for detecting a write error in a storage system,
Means for storing data blocks in groups, each group comprising a plurality of data blocks and one check block;
The check block is updated each time the group is written to storage,
The arrangement further comprises means for detecting a write error by checking the check block.

The arrangement according to claim 1, wherein the check block is a combination of data blocks of the group.

The arrangement according to claim 2, wherein the combination is a combination of logical exclusive ORs.

The arrangement according to any one of claims 1 to 3, wherein the check block is a combination of logical block addresses associated with the group.

The arrangement according to claim 4, wherein the combination is a combination of logical exclusive ORs.

6. Arrangement according to any one of the preceding claims, wherein the check block is a combination of phase fields that are updated each time the group is written.

The arrangement according to claim 6, wherein the combination is a combination of logical exclusive ORs.

8. Arrangement according to claim 6 or 7, wherein the phase field comprises a single bit value that is inverted each time the group is written.

8. Arrangement according to claim 6 or 7, wherein the phase field comprises a multi-bit value that is updated each time the group is written.

The arrangement according to any one of claims 6 to 9, wherein the arrangement further includes a non-volatile table relating to phase field values.

11. The arrangement of claim 10, wherein the non-volatile table includes a reserved disk drive area and a working copy of the table that is cached in the system controller.

The arrangement further includes a non-volatile log arranged to record entries prior to a write operation, the entries comprising:
12. Arrangement according to any one of the preceding claims, arranged for one of invalidation and deletion upon completion of the write operation.

12. When dependent on claim 11, wherein the log is arranged to hold updates to the working copy of the table in the controller that are not yet stored in the non-volatile table. 12. The arrangement configuration according to 12.

14. Arrangement according to claim 12 or 13, wherein the log is stored in memory to also hold code relating to the controller of the system.

The arrangement according to any one of the preceding claims, wherein the storage system comprises a disk storage system.

16. The arrangement of claim 15, wherein the disk storage system includes an ATA disk drive.

The arrangement according to claim 15 or 16, wherein the disk storage system includes a RAID system.

A method for detecting a write error in a storage system, the method comprising:
Storing data blocks in groups, each group including multiple data blocks and one check block;
Updating the check block each time the group is written and detecting write errors that may occur by checking the check block.

The method of claim 18, wherein the check block is a combination of data blocks of the group.

20. The method of claim 19, wherein the combination is a logical exclusive OR combination.

21. A method as claimed in any one of claims 18 to 20, wherein the check block is a combination of logical block addresses associated with the group.

The method of claim 21, wherein the combination is a logical exclusive OR combination.

23. A method as claimed in any one of claims 18 to 22, wherein the check block is a combination of phase fields that are updated each time the group is written.

24. The method of claim 23, wherein the combination is a logical exclusive OR combination.

25. A method according to claim 23 or 24, wherein the phase field comprises a single bit value that is inverted each time the group is written.

25. A method according to claim 23 or 24, wherein the phase field comprises a multi-bit value that is updated each time the group is written.

27. A method according to any one of claims 23 to 26, wherein the phase field value is stored in a non-volatile table.

28. The method of claim 27, wherein the non-volatile table includes a reserved disk drive area and a working copy of the table that is cached on a controller of the system.

29. The method of claims 18-28, further comprising performing one operation of logging the entry prior to the write operation and invalidating the entry and deleting the entry upon completion of the write operation: The method according to any one of the above.

29. When dependent on claim 28, further comprising maintaining in the log updates to the working copy of the table in the controller that are not yet stored in the non-volatile table. the method of.

31. A method according to claim 29 or 30, wherein the log is stored in memory to also hold code relating to a controller of the system.

32. A method as claimed in any one of claims 18 to 31 wherein the storage system comprises a disk storage system.

The method of claim 15, wherein the disk storage system comprises an ATA disk drive.

34. A method according to claim 32 or 33, wherein the disk storage system comprises a RAID system.

35. A computer program element comprising computer program means for substantially performing the method of any one of claims 18 to 34.