JP3785127B2

JP3785127B2 - Disk array control device and data writing method in disk array control device

Info

Publication number: JP3785127B2
Application number: JP2002265854A
Authority: JP
Inventors: 聡水野
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2002-09-11
Filing date: 2002-09-11
Publication date: 2006-06-14
Anticipated expiration: 2022-09-11
Also published as: JP2004102822A

Description

【０００１】
【発明の属する技術分野】
本発明は、計算機システムのディスクアレイ制御装置及びこのようなディスクアレイ制御装置におけるデータ書き込み方法に関する。
【０００２】
【従来の技術】
従来、RAID(Redundant Array of Independent Disks)の高速化方式の１つとして、制御対象の論理ディスクの全てのユーザ領域に関してクラスタ単位でのログ化書きこみ（ログ構造化ファイルシステム）を行っていた。
【０００３】
この方式では、特に、RAID5に対する複数のランダムライト要求に対して、それらを一続き、かつクラスタサイズで物理ディスクに対して一括して書きこむということをしていた。この結果、従来パリティの再計算のために必要であった、ライト処理前のデータ（元のデータ、及び書きこむ前のパリティの値）のリードを省くことができ、効率の良いライト処理を実現していた。このような技術としては、例えば、特開平１１−５３２３５号公報、特開２００１−１８４１７２号公報、特開２００２−１４７７６号公報に開示されている。
【０００４】
【発明が解決しようとする課題】
しかしながら、このようなRAID高速化方式では、対象の論理ディスクの全てのユーザ領域のデータブロック管理単位に対してアドレス変換テーブルでアドレス変換を行なう必要があった。このため、制御対象の論理ディスクの容量に比例したサイズのアドレス変換テーブルが必要であり、論理ディスクを構成するディスクのサイズが大きい場合や、ディスク台数が多い時には、論理ディスクの容量が大きくなり、そのアドレス変換テーブルはRAIDコントローラが使えるメモリの容量を越えてしまいかねなかった。
【０００５】
この問題を回避するために、例えば、以下の２つの方法が考えられる。
【０００６】
１．ディスク上にアドレス変換テーブルを置き、現在参照しているエントリを含むアドレス変換テーブルのサブセットのみメモリ上にロードしてアクセスし、変更したアドレス変換テーブルのサブセットは適切なタイミングでディスク上に書き戻す（アドレス変換テーブルのキャッシング）。このようにして、限りあるメモリ容量で大きなアドレス変換テーブルを管理する。
【０００７】
２．使用できるメモリに収まるアドレス変換テーブルのサイズで制御できるだけのサイズに論理ディスクの容量を制限する。
【０００８】
しかし、上記（１）の方法は、アドレス変換テーブルのサイズに制約が無くなり、制御対象の論理ディスクの大きさも制限しなくて良いというメリットがあるものの、一方でディスクからのアドレス変換テーブルのロード・リストアにある程度の時間がかかり、かつ、その処理が終らないとアドレス決定が出来ないため本来のユーザデータのためのI/O処理も止まってしまい、結局ターンアラウンドタイムが大きくなり、性能ダウンの要因になりかねない。
【０００９】
また、上記（２）の方法は、上記のような性能面でのデメリットは無いが、代りに論理ディスクの容量がアドレス変換テーブルを置くためのメモリの大きさによって制限されるというデメリットがある。
【００１０】
本発明は、アドレス変換テーブルに使用するメモリの量を一定にしたまま、制御対象の論理ディスクの容量に制約を課さずにデータの高速書きこみを実現することができるディスクアレイ制御装置を提供する目的とする。
【００１１】
【課題を解決するための手段】
上記目的を達成するために、本発明の第１の発明は、少なくとも１つの論理ディスクが設けられたディスクアレイを制御するためのディスクアレイ制御装置において、１つの論理ディスクにおけるディスクアレイの物理クラスタの状態を管理する物理クラスタ管理テーブルと、前記１つの論理ディスクにおける論理ブロック番号と、前記論理ブロック番号に対応する物理ブロック番号とを対応付けて記憶するアドレス変換テーブルと、書き込み対象となるデータブロックの物理ブロック番号が前記アドレス変換テーブルに登録されているか否かを判断する手段と、前記書き込み対象となるデータブロックの物理ブロック番号が前記アドレス変換テーブルに登録されていると判断された場合、前記物理クラスタ管理テーブルを参照して、空き物理クラスタを確保することができるか否かを判断する手段と、空き物理クラスタを確保することができると判断された場合、前記空き物理クラスタを確保し、この確保された空き物理クラスタに前記書き込み対象となるデータブロックを書き込む手段とを具備し、アドレス変換を必要とせずにアクセスできるデータブロックのみからなる直接アドレスクラスタと、アドレス変換を必要とするデータブロックを含む間接アドレスクラスタとを混在させて１つの論理ディスクを管理することを特徴とするディスクアレイ制御装置、である。
【００１２】
また、本発明の第２の発明によれば、次の書き出し対象とする空き物理クラスタが存在するか否か、或いは現在書き出し対象の物理クラスタに空き物理ブロックが存在するか否かを判断する手段と、存在すると判断された場合に、書き出し候補の論理ブロックに、要求された論理アドレスとは異なるディスク上の物理アドレスに書き込まれている間接アドレスデータブロックが存在するか否かを判断する手段と、前記書き出し候補の論理ブロックに前記間接アドレスデータブロックが存在すると判断された場合に、最も有効ブロック数の少ない物理クラスタに属している間接アドレスデータブロックを選択する手段とをさらに具備し、前記選択された間接アドレスデータブロックを前記書き込み対象となるデータブロックとして書き込むことを特徴とする請求項１記載のディスクアレイ制御装置、である。
さらに、本発明の第３の発明によれば、１つの論理ディスクにおける物理クラスタの状態を管理する物理クラスタ管理テーブルと、前記１つの論理ディスクにおける論理ブロック番号と、前記論理ブロック番号に対応する物理ブロック番号とを対応付けて記憶するアドレス変換テーブルとを具備する少なくとも１つの論理ディスクが設けられたディスクアレイを制御するためのディスクアレイ制御装置におけるデータ書き込み方法において、書き込み対象となるデータブロックの物理ブロック番号が前記アドレス変換テーブルに登録されているか否かを判断し、前記書き込み対象となるデータブロックの物理ブロック番号が前記アドレス変換テーブルに登録されていると判断された場合、前記物理クラスタ管理テーブルを参照して、空き物理クラスタを確保することができるか否かを判断し、空き物理クラスタを確保することができると判断された場合、前記空き物理クラスタを確保し、この確保された空き物理クラスタに前記書き込み対象となるデータブロックを書き込むステップを具備し、アドレス変換を必要とせずにアクセスできるデータブロックのみからなる直接アドレスクラスタと、アドレス変換を必要とするデータブロックを含む間接アドレスクラスタとを混在させて１つの論理ディスクを管理することを特徴とするディスクアレイ制御装置におけるデータ書き込み方法、である。
【００１３】
【発明の実施の形態】
まず、最初に、本発明の実施の形態において使用する用語の定義について説明する。
【００１４】
１．ストライプセット
RAIDは、複数のディスクから論理ディスクを構成する際にストライピングという方法で論理ディスクのアドレス空間を一定サイズ毎に順番に物理ディスクに割り振る。この一定サイズのデータをストライプと呼ぶ。
【００１５】
特に、RAID5の場合には、データストライプ（複数）とそれに対応するパリティストライプを１つの組にしてパリティデータの生成及びいずれか１つのディスク障害時におけるデータ再生処理を行なう。
【００１６】
RAID5の論理ディスク上で、一続きの領域で「RAIDコントローラが管理するストライプサイズのディスク台数倍（データはストライプサイズの「ディスク台数−１」）」のサイズであり、かつ、そこに含まれる各ディスクのストライプサイズのデータは、それに１つだけ含まれるパリティストライプに対応するようなストライプのグループのことを「ストライプセット」と呼ぶ。
【００１７】
例えば、図１３は、３台のディスクから構成されるRAID5の論理ディスクである。S0、S1、・・・はストライプを示しており、例えば64KBなどの固定長である。P0、P1、・・・はパリティブロックであり、ストライプサイズである。この図で「S0、S1、P0」はストライプセットである。同様に「S2、P1、S3」、「P2、S4、S5」はそれぞれストライプセットである。なお、論理ディスク上では、データストライプの並びは、S0、S1、S2、S3、S4、S5、・・・であり、パリティのストライプは上位にはデータとして見えない。
【００１８】
RAID5に対して、このサイズの整数倍のサイズを一括して書きこむことにより、パリティ再計算のためのディスクからのデータリードを省くことができ、大幅な性能向上を実現できる。これがRAIDの高速化方式の一手法である。
【００１９】
２．クラスタ
ディスクの物理アドレス空間をストライプセットのサイズあるいはその整数倍の大きさ毎、かつストライプセット境界で区分した各データ単位。本方式では間接アドレスデータブロックからなるデータを「クラスタ」の単位でディスクに書き出す。ここで、直接アドレスとはアドレス変換を必要とせずにアクセスできることを意味し、間接アドレスとはアドレス変換を行なわなければアクセスすることができないことを意味する。
【００２０】
３．アドレス変換テーブル
RAID高速化方式では、ホスト計算機のファイルシステムがアクセスを要求する時のブロック番号は、論理ブロック番号であり、仮想的なブロック番号である。この論理ブロック番号は，RAIDコントローラが管理する「アドレス変換テーブル」によって物理ブロック番号（論理ディスク上のブロック番号）と対応づけられる。
【００２１】
アドレス変換テーブルは対象の論理ディスク上に有り、データ部と別の領域が割り当てられる。アドレス変換テーブルは、論理ブロック番号に対する物理ブロック番号が登録されたれたテーブルである。新たにデータブロックを書き込む時には、相当する論理ブロック番号のエントリに、そのブロックを書き込む物理ブロック番号（＝論理ディスク上のブロック番号）を登録する。逆にデータブロックを参照する際には、その論理アドレスに対して登録されている値を求め、その値を論理ディスク上のブロック番号として実際のアドレスを求め参照する。
【００２２】
４．データブロック管理単位（ブロック）
本発明の実施の形態に係るディスクアレイ制御装置のRAIDコントローラがデータを管理する単位であり固定長（例えば4KB）である。必要に応じて、この単位でアドレス変換テーブルにアドレスを登録して管理する。
【００２３】
５．論理ブロック番号
「論理ブロック番号」とは、RAIDコントローラがホスト計算機など上位から受け取るI/Oのアクセス要求アドレスをRAIDコントローラが管理するデータブロック管理単位で換算した番号である。RAID高速化方式では、ホスト計算機から要求される時のブロック番号は、論理ブロック番号であり、仮想的なブロック番号である。この論理ブロック番号はRAIDコントローラが管理する「アドレス変換テーブル」によって物理ブロック番号（論理ディスク上のブロック番号）と対応づけられる。（物理ブロック番号）×（ブロックサイズ[Byte]）より論理ディスク上でのバイトオフセット値（アドレス）が求まる。
【００２４】
６．「直接アドレスデータブロック」
本発明の実施の形態において、アドレス変換テーブルに登録されていないデータブロック管理単位。その論理ブロック番号をもって物理ブロック番号とする。
【００２５】
すなわち、本来要求されたアドレス通りにディスク上に置かれている（書き込まれる）ブロックである。
【００２６】
７．「間接アドレスデータブロック」
本発明の実施の形態において、アドレス変換テーブルに登録されているデータブロック管理単位である。アドレス変換テーブル上のその論理アドレス番号に対応する（登録されている）物理ブロック番号をもってアクセス時のアドレスを求める。
【００２７】
すなわち、本来要求されたアドレスとは異なるディスク上のアドレスに置かれている（書き込まれる）ブロックである。
【００２８】
８．直接アドレスクラスタ
「直接アドレスデータブロック」からなるクラスタである。
【００２９】
９．間接アドレスクラスタ
「間接アドレスデータブロック」からなるクラスタである。
【００３０】
１０．論理クラスタ
上位からのデータアクセスに用いるアドレス空間を「論理アドレス空間」とよぶ。論理アドレス空間を先頭から「クラスタ」単位に区切った時のデータ単位を「論理クラスタ」と呼ぶ。
【００３１】
１１．物理クラスタ
論理クラスタに対して論理ディスク上のクラスタを「物理クラスタ」と呼ぶ。以降で「クラスタ」とある場合には「物理クラスタ」を指すことにする。
【００３２】
１２．リパック処理
ログ化クラスタの有効なブロックを集めることにより、空き物理クラスタを生成する処理をいう。
【００３３】
本発明の実施の形態に係るディスクアレイ制御装置は、以下の条件を前提としている。
【００３４】
１．ＲＡＩＤコントローラの制御アルゴリズムに関し、ＲＡＩＤコントローラのファームウェアとして実装される。
【００３５】
２．ＲＡＩＤで論理ディスクが構成されている。
【００３６】
３． RAIDコントローラに対するライトデータに関して、高速化モジュールがそのアドレスをアドレス変換テーブルに登録することにより変換し、たとえそれらがアドレスが連続していない一連のライトデータブロックであってもディスク上の連続アドレスとなるようにアドレス変換して書き込む。なお、高速化モジュールとは、本発明の実施の形態をRAIDコントローラのファームウェアの一部としてモジュールの形で実装したものである。この高速化モジュールによりRAIDコントローラの処理性能が向上する。
【００３７】
４．上記ディスク上の連続アドレスとして書き込むサイズは、ディスク上のパリティや元のデータを読みこむ必要がないため、ディスクに効率よく書き込める「ストライプセット（あるいはその整数倍）」である。
【００３８】
以下、図面を参照して、本発明の実施の形態に係る計算機システムのディスクアレイ制御装置について説明する。
【００３９】
図１は、本発明の実施の形態に係る計算機システムの構成を示す図である。
【００４０】
同図に示すように、ＣＰＵバス１には、ＣＰＵ２及びメインメモリ３が接続されている。
【００４１】
ＣＰＵ２は、計算機システム全体の制御を司るものであり、メインメモリ３を作業領域などに使用する。ＣＰＵバス１は、ＰＣＩ−ＰＣＩブリッジ４を介してＰＣＩバス５に接続されている。
【００４２】
ＰＣＩバス５には、ＲＡＩＤコントローラＰＣＩカード１１が接続されている。このＲＡＩＤコントローラＰＣＩカード１１は、ＰＣＩ−ＰＣＩブリッジ１２、内部バス１３、キャッシュメモリ１４、ＳＣＳＩコントローラ１５、専用ＣＰＵ１６、ローカルＲＡＭ１７及びフラッシュＲＯＭ１８を具備している。
【００４３】
ＰＣＩ−ＰＣＩブリッジ１２は、ＰＣＩバス５と内部ＰＣＩバス１３とのブリッジ制御を行なう。この内部ＰＣＩバス１３には、キャッシュメモリ１４、ＳＣＳＩコントローラ１５及び専用ＣＰＵ１６が接続される。専用ＣＰＵ１６には、ローカルＲＡＭ１７及びフラッシュＲＯＭ１８が接続されている。
【００４４】
キャッシュメモリ１４は、計算機がディスクに対してアクセスするデータが一時的に置かれ、アクセス速度向上の目的に使われる。あるいは、RAID5の論理ディスクに関してはパリティ計算のための領域としても使用される。キャッシュメモリ１４の制御には、例えば一般的にはセットアソシアティブ等の管理方式が取られている。
【００４５】
SCSIコントローラ１５は、RAIDコントローラＰＣＩカード１１からディスクにアクセスする際のインターフェイスである。専用CPU１６が、SCSIコントローラ１５に指示することよりキャッシュメモリ１４と、各ディスク２１−１〜２１−ｎとの間のデータ転送や、各ディスクに対するSCSIコマンドの送信が行なわれる。
【００４６】
ローカルRAM１７は、専用CPU１６の制御プログラムの処理に必要な作業データ領域になる。フラッシュＲＯＭ１８には、専用CPU１６の制御プログラムが格納されている。ＳＣＳＩコントローラ１５は、ＳＣＳＩバス２０を介して、ディスク２１−１〜２１−ｎに接続されている。
【００４７】
図２は、本発明の実施の形態に係る計算機システムのキャッシュメモリ１４の構成を示す図である。このキャッシュメモリ１４は、セットアソシアティブ方式で管理されるRAIDコントローラのキャッシュメモリである。
【００４８】
同図に示すように、このキャッシュメモリ１４は、セット数はｍ、ウェイ数はｎである。升目は１つのキャッシュブロックを示す。ここでは、４ＫＢや１６ＫＢの大きさと仮定する。
【００４９】
キャッシュ管理テーブルは、キャッシュメモリ同様（ｍ、ｎ）の２次元配列になっている。そのエントリには（有効ビット３１、更新ビット３２、その他ビット３３、タグ３４）が入っている。有効ビット３１は、キャッシュメモリの有効／無効（有効なデータが入っているか否か）が１ビットで示されており、更新ビット３２には、キャッシュメモリ内データが更新されているか否かが１ビットで示されており、更にその他の管理に用いるビット33があり（説明省略）、タグ３４にはデータブロック番号がセットされている。
【００５０】
リード／ライトデータをキャッシュメモリに入れる際には、アドレスから以下の式によりセット番号を求める。
【００５１】
（セット番号）＝（ブロック番号）mod（セット数ｍ）
セット番号を決定した後、キャッシュ管理テーブルを参照し、そのセットの空きエントリの有無を確認する。空きエントリがあれば、そのエントリに相当するキャッシュ管理テーブルのエントリにそこを使う。空きエントリがなければ、同じ「列」に属するキャッシュブロックを空けてそこを確保するか、あるいはキャッシュを使わずに処理を行う。
【００５２】
図３は、RAID高速化方式におけるディスクの論理ディスクアドレス空間のレイアウトを示す図であり、（ａ）は物理ディスク上のストライプ配置を示す図であり、（ｂ）は論理ディスク上のストライプ配置とレイアウトを示す図である。
【００５３】
図３において、ｄ０、ｄ１、・・・はデータを含むストライプを示す。ストライプは、例えば６４ＫＢ程度のサイズである。また、ｐ０、ｐ１、ｐ２、・・・は、それぞれ（ｄ０、ｄ１）、（ｄ２、ｄ３）、（ｄ４、ｄ５）のパリティストライプである。
【００５４】
ここで、（ｄ０、ｄ１、ｐ０）、（ｄ２、ｐ１、ｄ３）、（ｐ２、ｄ４、ｄ５）はそれぞれ「ストライプセット」である。
【００５５】
この例では、２ストライプセットを１クラスタとする。よって、物理クラスタ０は、ｄ０、ｄ１、（ｐ０、）ｄ２、（ｐ１、）ｄ３、物理クラスタ１は、（ｐ２、）ｄ４、ｄ５、ｄ６、ｄ７（、ｐ３）となる。
【００５６】
論理ディスク全体は、「直接アドレス領域」、「マージン領域」、「管理データ領域」の３つの領域に分かれている。
【００５７】
「直接アドレス領域」は、上位（ホスト計算機）に見せるディスクとして使用可能なサイズ分だけ確保する。例えば、論理ディスクとして１００ＧＢの容量を使用可能とする場合には「直接アドレス領域」のサイズは１００ＧＢとなる。
【００５８】
「マージン領域」は、複数の物理クラスタで構成される。本方式のログ化書き込みを繰り返すことによって、書き込み直後には有効なデータブロックが１００％詰まっていた状態の物理クラスタは、引き続き繰り返される間接アドレス物理クラスタの書き込みにより、論理データブロックが新しい物理ブロック番号の位置に移動する結果として、間接アドレス物理クラスタの有効なブロックの割合はどんどん下がっていく。この結果、直接アドレス領域の物理クラスタだけでは、全ての論理ブロックを収めることができなくなる。予め大目に物理クラスタをマージン領域として用意することにより、ログ化クラスタを書き込んでも空の物理クラスタが存在するようになり、間接アドレス物理クラスタを書き込むことができるようになる。
【００５９】
ただし、いずれ間接アドレス物理クラスタの数が増えて、空クラスタが無くなる状態に至る。この状態を避けるためにも、RAIDコントローラの負荷が低いときなどに有効なデータを含む物理ブロックを集めて、空き物理クラスタを生成する処理を行なう（リパック処理）。
【００６０】
「管理データ領域」は、この論理ディスク全体を管理するためのデータ構造を置くための領域である。この領域には、アドレス変換テーブル、その他管理情報が含まれる。アドレス変換テーブルは、システム起動時に、RAIDコントローラのキャッシュメモリの一部にロードされ、システム停止時にはディスクに書き戻される。他の管理情報に関しても必要に応じて同様にロード／セーブされる。
【００６１】
図４は、図３に相当する「論理アドレス空間」のレイアウトを示す図である。
【００６２】
論理アドレス空間は、「ＲＡＩＤコントローラの外から見える論理ディスクのアドレス空間」である。例えば、ホスト計算機は「論理アドレス空間」に対してファイルシステムを構築してデータをアクセスする。あくまでも１つのディスクとして扱う。
【００６３】
同図において、「論理クラスタ」を構成するデータブロックが全て「直接アドレスデータブロック」の場合には、その「論理クラスタ番号」はそれらデータブロックが置かれる「物理クラスタ」の番号と一致する。このようなクラスタを「直接アドレス物理クラスタ」、「直接アドレス論理クラスタ」、あるいは単に「直接アドレスクラスタ」と呼ぶ。
【００６４】
図５は、アドレス変換テーブルを示す図である。
【００６５】
論理クラスタが「間接アドレスクラスタ」の場合には、それを構成するデータブロック管理単位は「論理ブロック番号」と異なる「物理ブロック番号」に相当するディスク上のアドレスに割り当てられている。
【００６６】
この「論理ブロック番号」と「物理ブロック番号」との対応を記録管理しているのが「アドレス変換テーブル」である。従来のRAID高速化方式では、全ての論理ブロックに対するエントリがアドレス変換テーブルに用意されていた。対して本発明の実施の形態では、アドレス変換テーブルに使えるメモリ量が制限されており全ての論理ブロックがアドレス変換テーブルに登録される訳でなく、一部のみが登録できるという前提で考える。登録できない論理ブロックに関しては、その論理ブロック番号がすなわち物理ブロック番号を示すものとして扱う。
【００６７】
よって、「間接アドレスクラスタ」に属する「データブロック管理単位」のみがアドレス変換テーブルに登録されている。
【００６８】
アドレス変換テーブルは「間接アドレス論理クラスタ」単位で登録される。逆に、「直接アドレス論理クラスタ」に属するデータブロック管理単位はアドレス変換テーブルに登録されていない。
【００６９】
同図においては、１論理クラスタは８データブロックで構成されるものと仮定しており、「論理クラスタ番号」、「論理ブロック番号」、「物理ブロック番号」の３つ組のテーブルエントリになっている。しかし、本来「論理ブロック番号」の項目は不要である。代わりに、以下のように求めることができるからである（アドレス変換テーブルが使用するメモリ量を節約するためには「論理ブロック番号を省くほうが良い）。
【００７０】
このテーブルの主な使用目的は、特定の「論理ブロック番号」がアドレス変換テーブルに登録されているか否かの判別と、登録されていた場合に、その論理ブロック番号に割り当てられている「物理ブロック番号」の値を知ることである。
【００７１】
今、「論理ブロック番号」LBｎがあるとき、その論理クラスタ番号は、
「論理クラスタ番号」
A ＝ (LBｎ) ／「論理クラスタあたりのデータブロック数」
（ただし、「／」は整数の割り算を示す）
で求められ、「論理クラスタ内のブロックオフセット番号」は
「論理クラスタ内ブロックオフセット番号」
B ＝ (LBｎ) mod 「論理クラスタあたりのデータブロック数」
ただし、「mod」はmodulo演算を示す）
となる。よって、アドレス変換テーブルを参照し論理クラスタ番号がAで、そのクラスタにB番目に登録されている物理ブロック番号を求めればよい。
【００７２】
次に、本発明の実施の形態に係るディスクアレイ制御装置のライト動作の基本的な処理の流れについて、図６のフロチャートを参照して説明する。
【００７３】
図６は、１つのデータブロック管理単位LBi をディスクに書き出す時の処理の流れを示している。
【００７４】
図６の説明のために、具体的に、図７（ａ）〜図７（ｃ）のような状況を考える。以降、説明のために、ブロック数、テーブルのエントリ数などは実際のシステムより遥かに少ない数で示してある。
【００７５】
図７（ａ）は、論理ディスク上の物理クラスタの様子を示している。ここで、LB0、LB1、・・・は、それぞれ論理ブロックを示しており、論理ブロック番号を添え字で表している。
【００７６】
この例では、物理クラスタは４つのデータブロックで構成される。例えば、物理クラスタ０はLB0〜LB3の４つのデータブロックで構成されている。
【００７７】
この例では、論理ディスク上のデータ領域には１６個の物理クラスタが存在する。ただし、論理ディスクの「論理アドレス空間」の広さは物理クラスタ１３個分であり、この数だけ「論理クラスタ」が存在する。
【００７８】
なお、物理クラスタ０〜１２の範囲には直接アドレスデータブロックが存在しうる。この領域を「直接アドレス領域」と呼ぶ。物理クラスタ１３〜１５は「マージン領域」である。この部分には「間接アドレスデータブロック」のみ配置できる。
【００７９】
図７（ａ）においては、全ての論理ブロックは、その番号に等しい物理ブロック番号の位置に配置されている。すなわち、全ての論理クラスタは「直接アドレスクラスタ」である。また、マージン領域には有効なデータブロックは配置されていない。
【００８０】
図７（ｂ）は、図７（ａ）に示される全ての物理クラスタの状態を管理するテーブルであり、RAIDコントローラのキャッシュメモリに置かれる。インデックスは物理クラスタ番号である。とりうる値は、
直接アドレスクラスタ・・・クラスタ内の全てのブロックが直接アドレスデータブロックである
間接アドレスクラスタ・・・クラスタ内に１つ以上の間接アドレスデータブロックが含まれる
空クラスタ・・・クラスタ内に１つも有効なブロックが含まれない
の３つである。
【００８１】
ここでは、物理クラスタ０〜１２は「直接アドレスクラスタ」、それ以外は「空クラスタ」になっている。
【００８２】
図７（ｃ）は、アドレス変換テーブルであり、RAIDコントローラのキャッシュメモリに置かれる。間接アドレスの論理クラスタの番号と、それを構成する論理ブロックが配置されている物理ブロック番号が４つ分登録される。この状態で、アドレス変換テーブルには何も登録されていない（間接アドレス論理クラスタが存在しないため）。なお、この例では、４つの論理クラスタ分のブロックを登録するだけの大きさのアドレス変換テーブルが用意されているものとする。
【００８３】
ここで、LB9、LB6、LB8、LB0、LB17、LB15、LB2、LB7、LB6…の順番で論理ブロックをディスクに書き出す際の様子を説明する。
【００８４】
図６において、最初LBiは「LB9」である。ステップ６０１では、LB9はAMT（アドレス変換テーブル）に登録されていないので、ステップ６０２に進む。
【００８５】
ステップ６０２で、LB9が属する「直接アドレス論理クラスタ」２を登録するのに必要なエントリがアドレス変換テーブルに存在するので、ステップ６０３に進み、「論理クラスタ」２（の全てのブロック）をアドレス変換テーブルに登録する。すなわち、アドレス変換テーブルの最初の「論理クラスタ番号」に「２」、その物理ブロック番号に「８、９、１０、１１」を登録する（ステップ６０３）。
【００８６】
ステップ６０６に進み、現在書き込み中の物理クラスタＰＣが存在しないので、ステップ６０７に進み、空き物理クラスタを探す。ここで、物理クラスタＰＣとは、ステップ６０９において確保されるクラスタであり、最初の書き込みの段階では存在しない。
【００８７】
図７（ｂ）に示された物理クラスタ管理テーブルより、物理クラスタ１３が「空クラスタ」であることを知り、それを物理クラスタPCとして選択するとともに、物理クラスタ管理テーブルの物理クラスタ１３に「間接アドレスクラスタ」と記す（ステップ６０８、６０９）。
【００８８】
ステップ６１２で物理クラスタ１３の先頭ブロック（物理ブロック番号５２）を、LB9に割り当てて、ステップ６１３でアドレス変換テーブル（ＡＭＴ）の論理ブロック番号９のエントリに「５２」を登録する。なお、図６において、ＰＢnextは、書き込みのためのポインタを意味する。
【００８９】
ステップ６１４で、物理クラスタ１３には、まだ空のブロックが３つ残っているので、ステップ６１６に進み、まだ実際の書き出しは行なわない。例えば、図８（ｄ）に示すように、１クラスタ分のライトバッファをメモリ上に用意し、ここに書き込みが決まった論理ブロックのデータを置いておき、ステップ６１４で物理クラスタに書き込む全てのブロックが決定した時点で、ステップ６１５で一括して物理クラスタに書き出すようにする。
【００９０】
ここまで処理が進んだ状態を、図８に示す。図８（ａ）は、物理クラスタの図で、初めLB9が置かれていた物理ブロック番号９の位置には、既にLB9の最新のデータはない。
【００９１】
新たに書き込まれたLB9のデータは、物理クラスタ１３の先頭ブロックに書かれることが予約されており、実際のデータは、図８（ｄ）に示すように、ライトバッファに収められている。図８（ｂ）に示した物理クラスタ管理テーブル上では、物理クラスタ２、１３が「間接アドレスクラスタ」と記され、図８（ｃ）に示すアドレス変換テーブルには、論理クラスタ２が登録され、特に書き込み対象のLB9の物理ブロック番号は物理クラスタ１３の先頭ブロックの番号である５２が対応付けられる。
【００９２】
同様にして、図６のフローチャートに従って、上記順番（LB9、LB6、LB8、LB0、LB17、LB15、LB2、LB7、LB6…）で論理ブロックを処理した時点の様子を図９に示す。図９（ａ）において、×がついているところは、書き込みによって無効（古くなった）データを示す。LB15は、ステップ６０２でアドレス変換テーブルに空きが無いために現在の物理ブロック番号の位置に重ね書きされていることを示している。
【００９３】
なお、このように現在の論理データブロックが置かれている物理ブロック番号の位置にそのまま新しい書き込みを行うことを本発明の実施の形態では「オーバライト」と呼んでいる。
【００９４】
また、上記順番（LB9、LB6、LB8、LB0、LB17、LB15、LB2、LB7、LB6…）のような書き込みは、ランダムライトと言われる書き込みパターンである。すなわち、論理アドレスがばらばらの書き込み順序になっており、通常、ディスクにこの順番に書き込むとシークが発生するなどして効率が悪い。
【００９５】
図９（ａ）の時点で、ディスクに書き込まれた部分は、物理クラスタ１３、１４である。それぞれストライプセット（の倍数）の単位で一括して書き込むので、パリティ計算のためのリードが発生せず、かつディスク上でシークする範囲が狭いので、非常に効率よく、高速に、書き込みが行える。
【００９６】
本実施の形態においては、間接アドレスクラスタとして配置する論理データブロックの総数は、前記アドレス変換テーブルのエントリの総数の範囲内に制限される。
【００９７】
すなわち、固定サイズのアドレス変換テーブルに登録できる範囲内で、間接アドレスクラスタとして一括書き込みが可能なため、従来のようにアドレス変換テーブルのサイズで、対象の論理ディスクの容量を制限する必要がない。
【００９８】
なお、ステップ６０８で空き物理クラスタを確保できなかった場合には、LBiをオーバライトで、現在の（それまでの）その論理データブロックが置かれている物理ブロックのアドレスに書き出す（ステップ６１０）。また、Ｓ６０２において、アドレス変換テーブルに十分なエントリが存在しないと判断された場合には、書き込み対象となる論理ブロックＬＢｉを直接アドレスデータブロックとして書き込む（ステップ６０４）。
【００９９】
なお、ステップ６１２、６１３では、まだディスク上の物理ブロック番号のアドレスにデータは書き込まれておらず、ライトバッファ上におかれたままとなっている。従って、これらデータをリードしたいときには、ディスクに書き込まれるまでの間はライトバッファを参照する必要がある。
【０１００】
ライトバッファに関しては、この例では、１物理クラスタ分のメモリを確保して、ライトデータを置くようにしたが、書き出すデータがキャッシュメモリ上に存在するときには、そのキャッシュメモリへのポインタ列として管理しても良い。このような実装により、データコピーのオーバヘッドをなくすことができる。
【０１０１】
図１０は、本発明の実施の形態に係るディスクアレイ制御装置の書き出し処理全体の動作を説明するためのフロチャートである。
【０１０２】
この処理は、例えば、RAIDコントローラのDirtyキャッシュのディスクへの書き出し処理などに適用するロジックである。具体的には、定期的にRAIDコントローラのキャッシュからDirtyデータをディスクに書き出す処理を考える。キャッシュ上のDirtyデータの何れかを書き出さなければならない状況であり、書き出す必要がなくなるまで（例えば十分空きのキャッシュブロックが生じるまで）繰り返し実行されるものとする。なお、基本的な１データブロックの書き出しには、図６の処理を用いる。
【０１０３】
先ず、空き物理クラスタが存在しない、かつ、現在書き出し対象の物理クラスタに空き物理ブロックが存在しない場合を考える。
【０１０４】
この場合、以下のように考える。
【０１０５】
「空きクラスタ」がない、及び「現在書き出し対象の物理クラスタが存在しない」場合には、ログ化クラスタとしての書き出しはできない。このため、何れの論理ブロックを書き出し対象に選択しても、現在のその論理ブロックに割り当てられた物理ブロックに書き出す（オーバライト）しか選択肢は無い。
【０１０６】
しかしながら、いずれ空き物理クラスタが生じる可能性がある。例えば、リパック処理が同時に進んでいる可能性があり、あるいはやがて負荷が下がりリパック処理が優先的に動き出して空き物理クラスタが生成されるかもしれない。
【０１０７】
空き物理クラスタが生じた時点では、間接アドレスデータブロックを優先的にログ化クラスタの形式で書き出すのが有利である。間接アドレスデータブロックをログ化クラスタで書き出しても「アドレス変換テーブル」のエントリを新たに消費することは無いからである（既に登録されているため）。よって、直接アドレスデータブロックが書き出し候補として存在するなら、直接アドレスデータブロックを優先的にオーバライトで書き出すべきである。
【０１０８】
上記の考えに基づいて、Ｓ８０１において、空き物理クラスタが存在する或いは現在書き出した対象の物理クラスタに空き物理ブロックが存在するか否かの判断が行なわれ、存在しないと判断された場合には、Ｓ８０９において、オーバライト書き出しを試みる。
【０１０９】
図１１は、オーバライト書き込み処理の動作を説明するためのフロチャートである。
【０１１０】
同図に示すように、まず、ステップＳ９０１において、書き込み候補の直接アドレスデータブロックが存在するか否かの判断が行なわれる（Ｓ９０１）。Ｓ９０１において、直接アドレスデータブロックが存在すると判断された場合には、書き込み対象の直接アドレスデータブロックに関して、アドレスが連続する直接アドレスデータブロックの列のうち、最長のものを選択する（Ｓ９０２）。そして、選択した直接アドレスデータブロックの列をオーバライトで書きだす（Ｓ９０３）。Ｓ９０２において、最長の直接アドレスのブロックの列を選択するのは、オーバライトであっても大きなサイズで書き出すほうが効率が良いためである。また、この時、ストライプセットの境界かつストライプサイズで書き出すと最も効率がよい。
【０１１１】
このように、間接アドレスデータブロックの書き出しを遅らせることにより、空き物理クラスタが生じた時点で、アドレス変換テーブルのエントリ消費を伴わずにログ化ストライプ形式で高速にデータを書き出すことが可能になる。
【０１１２】
一方で、ステップ９０１で書き出し候補の直接アドレスデータブロックが存在しない場合を考える。
【０１１３】
この場合には、以下のように考える。
【０１１４】
１．書き出し対象には間接アドレスデータブロックを選択するしかない。
【０１１５】
２．書き出し処理にはある程度の時間を要する。
【０１１６】
３．書き出し処理の間、書き出し処理対象の物理ストライプはリパック処理の対象にはなれない。
【０１１７】
４．リパック処理の対象にしたい物理クラスタに属する間接アドレスデータブロックをここで書き出し対象に選ぶと、リパック処理の効率が悪化する。
【０１１８】
５．現在は空き物理クラスタが少ないので、リパック処理の効率は悪くしたくない。
【０１１９】
６．リパック対象にしたい物理クラスタとは、有効なデータブロックが少ない物理クラスタである（リパック処理では、有効ブロック数の少ない物理ストライプ同士を融合して空き物理ストライプを作るほうが効率がよい）。
【０１２０】
上記事情を考慮して、ここでは間接アドレスデータブロックのうち、最も有効ブロックを多く含んでいる物理ストライプに属するブロックを選択する。
【０１２１】
すなわち、Ｓ９０１において、直接アドレスデータブロックが存在しないと判断された場合には、候補の間接アドレスデータブロックのうち、最も有効ブロック数が多い物理クラスタに属する間接アドレスデータブロックを選択する（Ｓ９０４）。選択した間接アドレスデータブロックをオーバライトで書きだす（Ｓ９０５）。
【０１２２】
このように処理を進めることにより、リパック対象に有効ブロックの少ない物理クラスタを選択した際に、その物理クラスタに対する書き込みが行われる機会が少なくなるため、処理を続けることができリパックの効率が良くなる。
【０１２３】
図１０のＳ８０１において、空き物理クラスタが存在する、あるいは、現在書き出し対象の物理クラスタに空き物理ブロックが存在する場合を考える。
【０１２４】
この場合、以下のように考える。
【０１２５】
１．ログ化クラスタとしての書き出し方法を選択できる可能性がある。
【０１２６】
２．特に間接アドレスデータブロックは新たにアドレス変換テーブルのエントリを確保する必要がないので、ログ化クラスタとしての書き出し方法が可能である。
【０１２７】
３．特に直接アドレスデータブロックは新たにアドレス変換テーブルのエントリを確保できれば、ログ化クラスタとしての書き出し方法が可能である。
【０１２８】
４．アドレス変換テーブルのエントリは有限の資源であるので、なるべく使うべきではない。
【０１２９】
上記事情を考慮して、間接アドレスデータブロックを優先してログ化クラスタの書き出し候補とすべきである。この結果、アドレス変換テーブルのエントリの消費が多くならないように処理を進めることができる。
【０１３０】
更に、有効ブロック数の最も少ない物理クラスタに属する間接アドレスデータブロックを書き出すことにより、そのブロックが他の物理クラスタに移り、有効ブロック数の少ない物理ストライプの有効ブロック数を更に減らすことができる。このことはリパックの効率を向上する意味がある。
【０１３１】
よって、有効ブロック数の最も少ない物理クラスタに属する間接アドレスデータブロックを書き出す候補とすべきである。この結果、有効ブロック数の少ない物理クラスタが増え、リパックの効率を良くすることができる。
【０１３２】
すなわち、Ｓ８０１において、空き物理クラスタが存在する、あるいは、現在書き出し対象の物理クラスタに空き物理ブロックが存在する場合、書き出し候補のブロックに間接アドレスデータブロックがあるか否かの判断が行なわれる（Ｓ８０２）。
【０１３３】
Ｓ８０２において、書き出し候補のブロックに間接アドレスデータブロックがあると判断された場合、最も有効ブロック数の少ない物理クラスタに属している間接アドレスデータブロックを選択する（Ｓ８１０）。そして、図６のライト処理により、間接アドレスデータブロックをログ化ストライプに書き出す（Ｓ８１１）。
【０１３４】
次に、書き出し候補の間接アドレスデータブロックが存在しない場合を考える。
【０１３５】
この場合、下記のように考える。
【０１３６】
１．書き出し対象のブロックは直接アドレスデータブロックである。
【０１３７】
２．直接アドレスデータブロックを書き出すときにログ化クラスタ形式で書き出す場合は、要する時間は短くて済むが、反面、アドレス変換テーブルのエントリを消費することになる。
【０１３８】
３．更に、直接アドレスデータブロックが間接アドレスデータブロックになることにより、例え論理ブロック番号が連続する一連のブロックでもディスク上に間接アドレスとして不連続に配置されてしまう結果、その後のシーケンシャルアクセス性能が悪くなる、というデメリットがある。
【０１３９】
４．よって、直接アドレスデータブロックは、効率よくオーバライトで書き出せる場合には、なるべくオーバライトで書き出したほうが効率がよい。
【０１４０】
５．特に、ブロック番号（＝アドレス）が連続する一連の直接アドレスデータブロックが書き出し対象として存在する場合には、比較的効率よくオーバライトで書き出すことができる。
【０１４１】
よって、一定個数以上、連続する直接アドレスデータブロックが存在する場合には、オーバライトで直接アドレスデータブロックのまま一括して書き出すべきである。
【０１４２】
すなわち、Ｓ８０２において書き出し候補のブロックに間接アドレスデータブロックが存在しないと判断された場合、書き出し対象の直接アドレスデータブロックに関して、アドレスが連続する直接アドレスデータブロックの列のうち最長のものを選択してＢＬとし、列に含まれるブロック個数をＮとする（Ｓ８０３）。
【０１４３】
次に、Ｓ８０４において、ブロック個数Ｎが所定の値である「Ａ」よりも大きいか否かの判断が行なわれる。ここで、「Ａ」はブロック長の基準値として設定されるものであり、この値には実測により、比較的高速にオーバライトで書き込めるブロック長を指定する。あるいは、動的パラメータとしてシステム全体の状態から値を決めるという実装も可能である。
【０１４４】
Ｓ８０４において、ブロック個数Ｎが所定の値である「Ａ」よりも大きいと判断された場合には、選択されたＢＬを現在のアドレスに一括して書き込む（Ｓ８１３）。なお、Ｓ８１３において、ストライプセットの境界かつストライプサイズで書き出してもよく、この場合、更に効率がよくなる。
【０１４５】
このような処理を行なうことにより、アドレス変換テーブルのエントリをむやみに消費することを防止することができる。また、間接アドレスをむやみに増やすことによりシーケンシャル性能が悪化するのを防ぐ効果が得られる。
【０１４６】
次に、Ｓ８０４において、個数A以上の連続した直接アドレスデータブロックが無いと判断された場合について説明する。この場合、効率よい直接アドレスデータブロックのオーバライトは期待できないので、以下のように考える。
【０１４７】
１．書き出すことの緊急性を評価し、なるべく早く書き出す必要がある場合には、高いスループットで書き出せるログ化クラスタでの書き出しを行なう。
【０１４８】
２．「緊急」の度合いとしては、キャッシュのDirty率が考えられる。Dirty率が高いほど、早めの書き出しが必要である。
【０１４９】
３．緊急性が低い場合には、間接アドレスデータブロックが増えるのを避けるため（アドレス変換テーブルエントリの節約、シーケンシャル性能悪化防止のため）、直接アドレスデータブロックのオーバライト書き出しを行う。特に、間接アドレスデータブロックが既に多く存在しているほど、直接アドレスデータブロックはオーバライト書き出しにする。
【０１５０】
上記考察に基づいて、Ｓ８０５において、書き込み候補の直接アドレスデータブロックの１つを選択し、ログ化クラスタに書くか、オーバライトで書くかを決定する。図１２（ａ）は、Ｓ８０５における処理を説明するためのフロチャートである。
【０１５１】
同図に示すように、書き込み対象の直接アドレスデータブロックのうち、最も使用率の高いキャッシュ列（キャッシュセット）に属するものを１つ選ぶ（Ｓ１００１）。次に、書き込み方式を判定関数ｆ（α，β）を使用して決定する。
【０１５２】
図１２（ｂ）は、判定関数ｆ（α，β）を説明するための図である。同図において、αはキャッシュ全体のDirty率、βはアドレス変換テーブル全体の使用率（％）を意味する。
【０１５３】
この例では、キャッシュのDirty率αが９０％以上の時には「ログ化クラスタへの書き出し」と判断する。それ以下の場合には、アドレス変換テーブルの使用率β（使用中エントリの全エントリ数に対する割合％）とαを比較して判定している。Dirty率が高い程、あるいは間接アドレスデータブロックの数が少ない程「ログ化クラスタへの書き出し」と判定するようになっている。状況や実装に応じて判定関数ｆ（）のロジックを変えることより、実際の状況に即した判定を行なえる。
【０１５４】
Ｓ１００２において、オーバライト書き出しではないと判定された場合には、ログ化クラスタへの書き込み指示を決定する（Ｓ１００３、１００５）。一方、Ｓ１００２において、オーバライト書き出しであると判定された場合には、オーバライトでの書き込みを決定する（Ｓ１００３，Ｓ１００４）。
【０１５５】
結果的に、以上の方法により、キャッシュのDirty使用率が低いときには、無闇に間接アドレスデータブロックを増やしてシーケンシャル性能を悪化することが無いように制御することが可能になる。
【０１５６】
その後、Ｓ８０６において、ログ化ストライプに書くことを決定したか否かの判断が行なわれ（Ｓ８０６）、決定していないと判断された場合には、選択した直接アドレスデータブロックをオーバライトで書き出す（Ｓ８０８）。一方、決定した場合には、図６のライト処理に選択した直接アドレスデータブロックを書き出す。この時、可能であればログ化クラスタで書き出す。
【０１５７】
したがって、本発明の実施の形態に係るディスクアレイ制御装置によれば、ログ形式のデータブロック管理方式を採用したディスクシステムにおいて、アドレス変換テーブルの容量を制限しつつ、その利用できる範囲内でログ形式のデータブロックを管理することにより対象の論理ディスクの容量制限を回避するこができる。
【０１５８】
また、緊急性が高い時にはデータをログ形式で書き込み、緊急性が低いときには非ログ形式で書き込み、ログ化データが増えることを避け、ディスクアレイ制御装置のシーケンシャルアクセス性能悪化を回避することができる。
【０１５９】
なお、本願発明は、上記各実施形態に限定されるものでなく、実施段階ではその要旨を逸脱しない範囲で種々に変形することが可能である。また、各実施形態は可能な限り適宜組み合わせて実施してもよく、その場合組み合わされた効果が得られる。さらに、上記各実施形態には種々の段階の発明が含まれており、開示される複数の構成要件における適宜な組み合わせにより種々の発明が抽出され得る。例えば実施形態に示される全構成要件から幾つかの構成要件が省略されることで発明が抽出された場合には、その抽出された発明を実施する場合には省略部分が周知慣用技術で適宜補われるものである。
【０１６０】
【発明の効果】
以上詳記したように本発明によれば、アドレス変換テーブルに使用するメモリの量を一定にしたまま、制御対象の論理ディスクの容量に制約を課さず、かつ、従来のRAID高速化方式の高速書きこみを実現することができる。
【図面の簡単な説明】
【図１】本発明の実施の形態に係る計算機システムの構成を示す図である。
【図２】本発明の実施の形態に係る計算機システムのキャッシュメモリ１４の構成を示す図である。
【図３】 RAID高速化方式におけるディスクの論理ディスクアドレス空間のレイアウトを示す図であり、（ａ）は物理ディスク上のストライプ配置を示す図であり、（ｂ）は論理ディスク上のストライプ配置とレイアウトを示す図である。
【図４】図３に相当する「論理アドレス空間」のレイアウトを示す図である。
【図５】アドレス変換テーブルを示す図である。
【図６】本発明の実施の形態に係るディスクアレイ制御装置のライト動作の基本的な処理の流れについて説明するためのフロチャートである。
【図７】（ａ）は、論理ディスク上の物理クラスタの様子を示す図であり、（ｂ）は図７（ａ）に示される全ての物理クラスタの状態を管理するテーブルであり、（ｃ）はアドレス変換テーブルを示す図である。
【図８】（ａ）は、論理ディスク上の物理クラスタの様子を示す図であり、（ｂ）は図８（ａ）に示される全ての物理クラスタの状態を管理するテーブルであり、（ｃ）はアドレス変換テーブルを示す図であり、（ｄ）はライトバッファを示す図である。
【図９】（ａ）は、論理ディスク上の物理クラスタの様子を示す図であり、（ｂ）は図９（ａ）に示される全ての物理クラスタの状態を管理するテーブルであり、（ｃ）はアドレス変換テーブルを示す図であり、（ｄ）はライトバッファを示す図である。
【図１０】本発明の実施の形態に係るディスクアレイ制御装置の書き出し処理全体の動作を説明するためのフロチャートである。
【図１１】オーバライト書き込み処理の動作を説明するためのフロチャートである。
【図１２】（ａ）は、Ｓ８０５における処理を説明するためのフロチャートであり、（ｂ）は、判定関数ｆ（α，β）を説明するための図である。
【図１３】３台のディスクから構成されるRAID5の論理ディスクを示す図である。
【符号の説明】
１…ＣＰＵバス、
２…ＣＰＵ、
３…メインメモリ、
４…ＰＣＩ−ＰＣＩブリッジ、
５…ＰＣＩバス、
１１…ＲＡＩＤコントローラＰＣＩカード、
１２…ＰＣＩ−ＰＣＩブリッジ、
１３…内部ＰＣＩバス、
１４…キャッシュメモリ、
１５…ＳＣＳＩコントローラ、
１６…専用ＣＰＵ，
１７…ローカルＲＡＭ、
１８…フラッシュＲＯＭ、
２０…ＳＣＳＩバス、
２１−１〜２１−ｎ…ディスク装置。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a disk array control device of a computer system and a data writing method in such a disk array control device.
[0002]
[Prior art]
Conventionally, as one of the speed-up methods of RAID (Redundant Array of Independent Disks), log writing (log structured file system) is performed in cluster units for all user areas of a logical disk to be controlled.
[0003]
In this method, in particular, in response to a plurality of random write requests for RAID5, they are continuously written to the physical disk in a cluster size. As a result, it is possible to eliminate the need to read data before write processing (original data and parity value before writing), which was necessary for conventional recalculation of parity, and realize efficient write processing. Was. Such techniques are disclosed in, for example, Japanese Patent Application Laid-Open Nos. 11-53235, 2001-184172, and 2002-14776.
[0004]
[Problems to be solved by the invention]
However, in such a RAID acceleration system, it is necessary to perform address conversion using the address conversion table for the data block management units of all user areas of the target logical disk. For this reason, an address conversion table whose size is proportional to the capacity of the logical disk to be controlled is required. When the size of the disks constituting the logical disk is large or when the number of disks is large, the capacity of the logical disk increases. The address translation table could exceed the memory capacity that the RAID controller can use.
[0005]
In order to avoid this problem, for example, the following two methods are conceivable.
[0006]
1. Place the address translation table on the disk, load and access only a subset of the address translation table that contains the currently referenced entry, and write the modified subset of the address translation table back to the disk at the appropriate time ( Address translation table caching). In this way, a large address conversion table is managed with a limited memory capacity.
[0007]
2. Limit the capacity of the logical disk to a size that can be controlled by the size of the address translation table that fits in the available memory.
[0008]
However, the method (1) has the advantage that the size of the address conversion table is not limited and the size of the logical disk to be controlled need not be restricted. Restoration takes a certain amount of time, and if the processing is not completed, address determination cannot be performed, so I / O processing for the original user data also stops, eventually increasing turnaround time and causing performance degradation It can be.
[0009]
The method (2) does not have the above-mentioned disadvantages in terms of performance, but instead has the disadvantage that the capacity of the logical disk is limited by the size of the memory for storing the address conversion table.
[0010]
The present invention provides a disk array control device capable of realizing high-speed data writing without imposing restrictions on the capacity of a logical disk to be controlled while keeping the amount of memory used for an address conversion table constant. Objective.
[0011]
[Means for Solving the Problems]
  In order to achieve the above object, the first invention of the present invention provides:In a disk array control device for controlling a disk array provided with at least one logical disk, in one logical diskA physical cluster management table that manages the physical cluster status of the disk array;In the one logical diskAn address conversion table for storing a logical block number and a physical block number corresponding to the logical block number in association with each other, and whether a physical block number of a data block to be written is registered in the address conversion table. When it is determined that the determination unit and the physical block number of the data block to be written are registered in the address conversion table, a free physical cluster may be secured by referring to the physical cluster management table. When it is determined that a means for determining whether or not a free physical cluster can be secured, the free physical cluster is secured, and the data block to be written is written to the secured free physical cluster. Means that can be accessed without the need for address translation. Direct address cluster consisting capacitor blocks only, a mix of the indirect address cluster containing data blocks that require address translation1 logical diskA disk array control device, characterized in that
[0012]
  According to the second aspect of the present invention, the means for determining whether or not there is a free physical cluster to be written next, or whether or not there is a free physical block in the physical cluster to be written next. And a means for determining whether or not an indirect address data block written at a physical address on a disk different from the requested logical address exists in the write candidate logical block when it is determined that the logical block is present. And a means for selecting an indirect address data block belonging to a physical cluster having the smallest number of effective blocks when it is determined that the indirect address data block exists in the logical block as the write candidate. Write the indirect address data block that has been written as the data block to be written The disk array controller according to claim 1, wherein a.
  Further, according to the third aspect of the present invention, a physical cluster management table for managing the state of a physical cluster in one logical disk, a logical block number in the one logical disk, and a physical corresponding to the logical block number In a data writing method in a disk array control apparatus for controlling a disk array provided with at least one logical disk having an address conversion table for storing a block number in association with the physical number of a data block to be written It is determined whether or not a block number is registered in the address conversion table, and when it is determined that a physical block number of the data block to be written is registered in the address conversion table, the physical cluster management table Refer to free physical It is determined whether or not a raster can be secured, and if it is determined that a free physical cluster can be secured, the free physical cluster is secured, and the write is made to the secured free physical cluster. A logical disk comprising a step of writing a data block, in which a direct address cluster consisting only of data blocks that can be accessed without requiring address translation and an indirect address cluster including data blocks that require address translation are mixed. Is a data writing method in a disk array control device.
[0013]
DETAILED DESCRIPTION OF THE INVENTION
First, definitions of terms used in the embodiment of the present invention will be described.
[0014]
1. Stripe set
In RAID, when configuring a logical disk from a plurality of disks, the address space of the logical disk is allocated to the physical disks in order for each fixed size by a method called striping. This fixed size data is called a stripe.
[0015]
In particular, in the case of RAID5, data stripes and a corresponding parity stripe are combined into one set to generate parity data and perform data reproduction processing in the event of any one disk failure.
[0016]
On the RAID5 logical disk, the continuous area is the size of "the number of disks of the stripe size managed by the RAID controller (data is the number of disks of the stripe size-1"), and each of the included areas A group of stripes corresponding to a parity stripe included in only one stripe size data of a disk is called a “stripe set”.
[0017]
For example, FIG. 13 shows a RAID 5 logical disk composed of three disks. S0, S1,... Indicate stripes and have a fixed length of 64 KB, for example. P0, P1,... Are parity blocks and have a stripe size. In this figure, “S0, S1, P0” is a stripe set. Similarly, “S2, P1, S3”, “P2, S4, S5” are stripe sets, respectively. On the logical disk, the arrangement of data stripes is S0, S1, S2, S3, S4, S5,..., And the parity stripe cannot be seen as data at the upper level.
[0018]
By writing a size that is an integral multiple of this size to RAID5 all at once, data read from the disk for parity recalculation can be omitted, and a significant performance improvement can be realized. This is one method for increasing the speed of RAID.
[0019]
2. cluster
Each data unit that divides the physical address space of the disk by stripe set size or an integral multiple of the stripe set, and by stripe set boundaries. In this method, data consisting of indirect address data blocks is written to the disk in units of “cluster”. Here, the direct address means that it can be accessed without requiring address conversion, and the indirect address means that it cannot be accessed without performing address conversion.
[0020]
3. Address translation table
In the RAID acceleration method, the block number when the file system of the host computer requests access is a logical block number and a virtual block number. This logical block number is associated with a physical block number (block number on the logical disk) by an “address conversion table” managed by the RAID controller.
[0021]
The address conversion table exists on the target logical disk, and an area different from the data part is allocated. The address conversion table is a table in which physical block numbers for logical block numbers are registered. When a new data block is written, the physical block number (= block number on the logical disk) to which the block is written is registered in the corresponding logical block number entry. Conversely, when referring to a data block, a value registered with respect to the logical address is obtained, and an actual address is obtained and referred to using the value as a block number on the logical disk.
[0022]
4). Data block management unit (block)
The RAID controller of the disk array control apparatus according to the embodiment of the present invention is a unit for managing data and has a fixed length (for example, 4 KB). If necessary, addresses are registered in the address conversion table and managed in this unit.
[0023]
5). Logical block number
The “logical block number” is a number obtained by converting an access request address of an I / O received by the RAID controller from a host computer or the like from a host computer or the like in a data block management unit managed by the RAID controller. In the RAID acceleration method, the block number when requested from the host computer is a logical block number, which is a virtual block number. This logical block number is associated with a physical block number (block number on the logical disk) by an “address conversion table” managed by the RAID controller. The byte offset value (address) on the logical disk is obtained from (physical block number) x (block size [Byte]).
[0024]
6). "Direct address data block"
In the embodiment of the present invention, a data block management unit that is not registered in the address translation table. The logical block number is used as the physical block number.
[0025]
That is, it is a block that is placed (written) on the disk according to the originally requested address.
[0026]
7). "Indirect address data block"
In the embodiment of the present invention, it is a data block management unit registered in the address conversion table. The address at the time of access is obtained from the physical block number (registered) corresponding to the logical address number on the address conversion table.
[0027]
That is, a block that is placed (written) at an address on the disk that is different from the originally requested address.
[0028]
8). Direct address cluster
It is a cluster composed of “direct address data blocks”.
[0029]
9. Indirect address cluster
A cluster composed of “indirect address data blocks”.
[0030]
10. Logical cluster
An address space used for data access from the upper level is called a “logical address space”. A data unit when a logical address space is divided into “cluster” units from the top is called a “logical cluster”.
[0031]
11. Physical cluster
A cluster on a logical disk with respect to a logical cluster is called a “physical cluster”. In the following description, “cluster” refers to “physical cluster”.
[0032]
12 Repack processing
A process for creating a free physical cluster by collecting valid blocks of a logging cluster.
[0033]
The disk array control apparatus according to the embodiment of the present invention is based on the following conditions.
[0034]
1. The RAID controller control algorithm is implemented as RAID controller firmware.
[0035]
2. A logical disk is configured by RAID.
[0036]
3. Write data for the RAID controller is converted by the high-speed module by registering its address in the address conversion table, and even if it is a series of write data blocks whose addresses are not continuous, it becomes a continuous address on the disk. Write the address as follows. The speed-up module is obtained by mounting the embodiment of the present invention in the form of a module as part of the firmware of the RAID controller. This speed-up module improves the processing performance of the RAID controller.
[0037]
4). The size written as a continuous address on the disk is a “stripe set (or an integer multiple thereof)” that can be efficiently written to the disk because it is not necessary to read the parity or original data on the disk.
[0038]
Hereinafter, a disk array control apparatus of a computer system according to an embodiment of the present invention will be described with reference to the drawings.
[0039]
FIG. 1 is a diagram showing a configuration of a computer system according to an embodiment of the present invention.
[0040]
As shown in the figure, a CPU 2 and a main memory 3 are connected to the CPU bus 1.
[0041]
The CPU 2 controls the entire computer system, and uses the main memory 3 as a work area. The CPU bus 1 is connected to the PCI bus 5 via the PCI-PCI bridge 4.
[0042]
A RAID controller PCI card 11 is connected to the PCI bus 5. The RAID controller PCI card 11 includes a PCI-PCI bridge 12, an internal bus 13, a cache memory 14, a SCSI controller 15, a dedicated CPU 16, a local RAM 17, and a flash ROM 18.
[0043]
The PCI-PCI bridge 12 performs bridge control between the PCI bus 5 and the internal PCI bus 13. A cache memory 14, a SCSI controller 15, and a dedicated CPU 16 are connected to the internal PCI bus 13. A local RAM 17 and a flash ROM 18 are connected to the dedicated CPU 16.
[0044]
The cache memory 14 temporarily stores data that the computer accesses to the disk, and is used for the purpose of improving the access speed. Alternatively, the RAID5 logical disk is also used as an area for parity calculation. For the control of the cache memory 14, for example, a management method such as set associative is generally used.
[0045]
The SCSI controller 15 is an interface for accessing a disk from the RAID controller PCI card 11. The dedicated CPU 16 instructs the SCSI controller 15 to transfer data between the cache memory 14 and each of the disks 21-1 to 21-n and transmit a SCSI command to each disk.
[0046]
The local RAM 17 becomes a work data area necessary for processing of the control program of the dedicated CPU 16. The flash ROM 18 stores a control program for the dedicated CPU 16. The SCSI controller 15 is connected to the disks 21-1 to 21-n via the SCSI bus 20.
[0047]
FIG. 2 is a diagram showing a configuration of the cache memory 14 of the computer system according to the embodiment of the present invention. This cache memory 14 is a cache memory of a RAID controller managed by the set associative method.
[0048]
As shown in the figure, the cache memory 14 has m sets and n ways. A cell indicates one cache block. Here, it is assumed that the size is 4 KB or 16 KB.
[0049]
The cache management table is a two-dimensional array of (m, n) like the cache memory. The entry contains (valid bit 31, update bit 32, other bit 33, tag 34). The valid bit 31 indicates whether the cache memory is valid / invalid (whether or not valid data is contained) by 1 bit, and the update bit 32 indicates whether or not the data in the cache memory is updated. In addition, there is a bit 33 used for other management (description omitted), and a data block number is set in the tag 34.
[0050]
When reading / writing data into the cache memory, the set number is obtained from the address according to the following equation.
[0051]
(Set number) = (Block number) mod (Number of sets m)
After determining the set number, the cache management table is referenced to check whether there is an empty entry in the set. If there is an empty entry, it is used for the entry in the cache management table corresponding to the entry. If there is no free entry, a cache block belonging to the same “column” is freed and secured there, or processing is performed without using a cache.
[0052]
FIG. 3 is a diagram showing the layout of the logical disk address space of the disk in the RAID acceleration method, (a) is a diagram showing the stripe arrangement on the physical disk, and (b) is the stripe arrangement on the logical disk. It is a figure which shows a layout.
[0053]
3, d0, d1,... Indicate stripes including data. The stripe has a size of about 64 KB, for example. Further, p0, p1, p2,... Are parity stripes of (d0, d1), (d2, d3), and (d4, d5), respectively.
[0054]
Here, (d0, d1, p0), (d2, p1, d3), and (p2, d4, d5) are “stripe sets”, respectively.
[0055]
In this example, it is assumed that two stripe sets are one cluster. Therefore, the physical cluster 0 is d0, d1, (p0,) d2, (p1,) d3, and the physical cluster 1 is (p2,) d4, d5, d6, d7 (, p3).
[0056]
The entire logical disk is divided into three areas: a “direct address area”, a “margin area”, and a “management data area”.
[0057]
The “direct address area” is secured by the size that can be used as a disk to be shown to the host (host computer). For example, when a capacity of 100 GB can be used as a logical disk, the size of the “direct address area” is 100 GB.
[0058]
The “margin area” is composed of a plurality of physical clusters. By repeating log writing of this method, a physical cluster in which valid data blocks are 100% clogged immediately after writing is written to the physical block number of the logical data block by successive indirect address physical cluster writing. As a result of moving to the position of, the proportion of valid blocks in the indirect address physical cluster is steadily decreasing. As a result, all the logical blocks cannot be accommodated only by the physical cluster in the direct address area. By preparing a physical cluster roughly as a margin area in advance, an empty physical cluster exists even when a log cluster is written, and an indirect address physical cluster can be written.
[0059]
However, the number of indirect address physical clusters will eventually increase, and there will be no empty clusters. In order to avoid this state, a physical block including valid data is collected when the load on the RAID controller is low, and a process of generating a free physical cluster is performed (repacking process).
[0060]
The “management data area” is an area for placing a data structure for managing the entire logical disk. This area includes an address conversion table and other management information. The address conversion table is loaded into a part of the cache memory of the RAID controller when the system is started, and written back to the disk when the system is stopped. Other management information is similarly loaded / saved as necessary.
[0061]
FIG. 4 is a diagram showing a layout of a “logical address space” corresponding to FIG.
[0062]
The logical address space is “a logical disk address space visible from outside the RAID controller”. For example, the host computer accesses the data by building a file system for the “logical address space”. Treat as a single disc.
[0063]
In the figure, when all the data blocks constituting the “logical cluster” are “direct address data blocks”, the “logical cluster number” matches the number of the “physical cluster” in which the data blocks are placed. Such clusters are called “direct address physical clusters”, “direct address logical clusters”, or simply “direct address clusters”.
[0064]
FIG. 5 is a diagram showing an address conversion table.
[0065]
When the logical cluster is an “indirect address cluster”, the data block management unit constituting the logical cluster is assigned to an address on the disk corresponding to a “physical block number” different from the “logical block number”.
[0066]
The “address conversion table” records and manages the correspondence between the “logical block number” and the “physical block number”. In the conventional RAID acceleration method, entries for all logical blocks are prepared in the address conversion table. On the other hand, in the embodiment of the present invention, the amount of memory that can be used in the address conversion table is limited, and it is assumed that not all logical blocks are registered in the address conversion table, but only a part can be registered. For a logical block that cannot be registered, the logical block number is treated as a physical block number.
[0067]
Therefore, only “data block management units” belonging to “indirect address clusters” are registered in the address conversion table.
[0068]
The address conversion table is registered in units of “indirect address logical clusters”. Conversely, data block management units belonging to the “direct address logical cluster” are not registered in the address conversion table.
[0069]
In the figure, it is assumed that one logical cluster is composed of 8 data blocks, and it becomes a table entry of three sets of “logical cluster number”, “logical block number”, and “physical block number”. ing. However, the “logical block number” item is not necessary. Instead, it can be obtained as follows (in order to save the amount of memory used by the address translation table, it is better to omit the logical block number).
[0070]
The main purpose of use of this table is to determine whether or not a specific “logical block number” is registered in the address translation table, and if registered, the “physical block number” assigned to that logical block number. Knowing the value of "number".
[0071]
Now, when there is a “logical block number” LBn, the logical cluster number is
"Logical cluster number"
A = (LBn) / "number of data blocks per logical cluster"
(However, “/” indicates integer division.)
The "block offset number in the logical cluster" is
"Logical cluster block offset number"
B = (LBn) mod "Number of data blocks per logical cluster"
(However, “mod” indicates modulo operation)
It becomes. Therefore, it is only necessary to obtain the physical block number registered in the cluster with the logical cluster number A by referring to the address conversion table.
[0072]
Next, the basic processing flow of the write operation of the disk array control apparatus according to the embodiment of the present invention will be described with reference to the flowchart of FIG.
[0073]
FIG. 6 shows a processing flow when one data block management unit LBi is written to the disk.
[0074]
For the explanation of FIG. 6, the situation as shown in FIGS. 7A to 7C is specifically considered. Hereinafter, for the sake of explanation, the number of blocks, the number of entries in the table, and the like are indicated by numbers far smaller than those of the actual system.
[0075]
FIG. 7A shows a physical cluster on the logical disk. Here, LB0, LB1,... Each indicate a logical block, and the logical block number is represented by a subscript.
[0076]
In this example, the physical cluster is composed of four data blocks. For example, the physical cluster 0 is composed of four data blocks LB0 to LB3.
[0077]
In this example, there are 16 physical clusters in the data area on the logical disk. However, the “logical address space” of the logical disk is as wide as 13 physical clusters, and this number of “logical clusters” exists.
[0078]
Note that direct address data blocks may exist in the range of physical clusters 0-12. This area is called a “direct address area”. The physical clusters 13 to 15 are “margin areas”. Only “indirect address data block” can be arranged in this portion.
[0079]
In FIG. 7A, all logical blocks are arranged at physical block number positions equal to the numbers. That is, all logical clusters are “direct address clusters”. Also, no valid data block is arranged in the margin area.
[0080]
FIG. 7B is a table for managing the states of all the physical clusters shown in FIG. 7A, and is placed in the cache memory of the RAID controller. The index is a physical cluster number. Possible values are
Direct address cluster ・・・ All blocks in the cluster are direct address data blocks
Indirect address cluster: One or more indirect address data blocks are included in the cluster.
Empty cluster: No valid block is included in the cluster
It is three.
[0081]
Here, the physical clusters 0 to 12 are “direct address clusters”, and the others are “empty clusters”.
[0082]
FIG. 7C shows an address conversion table, which is placed in the cache memory of the RAID controller. The number of the logical cluster of the indirect address and four physical block numbers where the logical blocks constituting the indirect address are registered are registered. In this state, nothing is registered in the address translation table (because there is no indirect address logical cluster). In this example, it is assumed that an address conversion table large enough to register blocks for four logical clusters is prepared.
[0083]
Here, the state when writing logical blocks to the disk in the order of LB9, LB6, LB8, LB0, LB17, LB15, LB2, LB7, LB6.
[0084]
In FIG. 6, the initial LBi is “LB9”. In step 601, since LB9 is not registered in the AMT (address translation table), the process proceeds to step 602.
[0085]
In step 602, an entry necessary for registering the “direct address logical cluster” 2 to which LB9 belongs exists in the address translation table. Therefore, the process proceeds to step 603, where “logical cluster” 2 (all blocks) is address translated. Register in the table. That is, “2” is registered in the first “logical cluster number” of the address conversion table and “8, 9, 10, 11” is registered in the physical block number (step 603).
[0086]
Proceeding to step 606, since there is no physical cluster PC currently being written, the process proceeds to step 607 to search for a free physical cluster. Here, the physical cluster PC is a cluster secured in step 609 and does not exist at the initial writing stage.
[0087]
From the physical cluster management table shown in FIG. 7B, the physical cluster 13 is found to be an “empty cluster” and is selected as a physical cluster PC. Address cluster "(steps 608 and 609).
[0088]
In step 612, the first block (physical block number 52) of the physical cluster 13 is assigned to LB9, and in step 613, "52" is registered in the entry of logical block number 9 in the address translation table (AMT). In FIG. 6, PBnext means a pointer for writing.
[0089]
In step 614, since three empty blocks still remain in the physical cluster 13, the process proceeds to step 616, and actual writing is not yet performed. For example, as shown in FIG. 8D, a write buffer for one cluster is prepared in the memory, and data of logical blocks determined to be written are placed here, and all blocks to be written to the physical cluster in step 614 are stored. Is determined, the data is collectively written to the physical cluster in step 615.
[0090]
FIG. 8 shows a state where the processing has progressed so far. FIG. 8A is a diagram of a physical cluster, and there is no latest data of LB9 already at the position of physical block number 9 where LB9 was initially placed.
[0091]
The newly written LB9 data is reserved to be written in the first block of the physical cluster 13, and the actual data is stored in the write buffer as shown in FIG. 8D. In the physical cluster management table shown in FIG. 8B, the physical clusters 2 and 13 are described as “indirect address clusters”, and the logical cluster 2 is registered in the address conversion table shown in FIG. In particular, 52, which is the number of the first block of the physical cluster 13, is associated with the physical block number of the write target LB9.
[0092]
Similarly, FIG. 9 shows a state when logical blocks are processed in the above order (LB9, LB6, LB8, LB0, LB17, LB15, LB2, LB7, LB6...) According to the flowchart of FIG. In FIG. 9 (a), the portions marked with x indicate invalid (out of date) data by writing. LB15 indicates that the address conversion table is overwritten at the position of the current physical block number because there is no empty space in step 602.
[0093]
In the embodiment of the present invention, performing new writing as it is at the position of the physical block number where the current logical data block is placed is called “overwriting”.
[0094]
Further, the writing in the above order (LB9, LB6, LB8, LB0, LB17, LB15, LB2, LB7, LB6...) Is a writing pattern called random writing. In other words, the logical addresses are written in a different order, and normally writing to the disk in this order causes a seek, resulting in poor efficiency.
[0095]
The portions written on the disk at the time of FIG. 9A are physical clusters 13 and 14. Since writing is performed in units of stripe sets (multiples of each), reading for parity calculation does not occur and the seek range on the disk is narrow, so writing can be performed very efficiently and at high speed.
[0096]
In the present embodiment, the total number of logical data blocks arranged as an indirect address cluster is limited within the range of the total number of entries in the address translation table.
[0097]
In other words, batch writing as an indirect address cluster is possible within a range that can be registered in a fixed-size address conversion table, so that it is not necessary to limit the capacity of the target logical disk with the size of the address conversion table as in the prior art.
[0098]
If a free physical cluster cannot be secured in step 608, LBi is overwritten and written to the address of the physical block where the current logical data block is located (step 610). If it is determined in S602 that there are not enough entries in the address conversion table, the logical block LBi to be written is directly written as an address data block (step 604).
[0099]
In steps 612 and 613, data is not yet written at the address of the physical block number on the disk, and remains on the write buffer. Therefore, when reading these data, it is necessary to refer to the write buffer until it is written to the disk.
[0100]
Regarding the write buffer, in this example, the memory for one physical cluster is secured and the write data is placed. However, when the data to be written exists on the cache memory, it is managed as a pointer string to the cache memory. May be. Such an implementation can eliminate the overhead of data copy.
[0101]
FIG. 10 is a flowchart for explaining the overall operation of the writing process of the disk array control apparatus according to the embodiment of the present invention.
[0102]
This processing is, for example, logic applied to write processing to a disk of the dirty cache of the RAID controller. Specifically, consider a process of periodically writing dirty data from a RAID controller cache to a disk. This is a situation in which any dirty data on the cache has to be written, and is repeatedly executed until there is no need to write (for example, a sufficiently free cache block is generated). Note that the processing of FIG. 6 is used for writing one basic data block.
[0103]
First, let us consider a case where there is no free physical cluster and there is no free physical block in the physical cluster currently being written out.
[0104]
In this case, the following is considered.
[0105]
When there is no “free cluster” and “there is no physical cluster to be written out”, writing as a logging cluster cannot be performed. For this reason, even if any logical block is selected as a write target, the only choice is to write (overwrite) the current physical block assigned to that logical block.
[0106]
However, there is a possibility that an empty physical cluster will eventually occur. For example, there is a possibility that the repacking process is proceeding at the same time, or the load is reduced and the repacking process starts preferentially and an empty physical cluster may be generated.
[0107]
When free physical clusters occur, it is advantageous to write indirect address data blocks preferentially in the form of logged clusters. This is because even if the indirect address data block is written in the log cluster, the entry of the “address conversion table” is not consumed again (because it is already registered). Therefore, if a direct address data block exists as a write candidate, the direct address data block should be preferentially written by overwriting.
[0108]
Based on the above idea, in S801, it is determined whether there is a free physical cluster or there is a free physical block in the target physical cluster that is currently written, and if it is determined that it does not exist, In step S809, overwriting is attempted.
[0109]
FIG. 11 is a flowchart for explaining the operation of the overwrite writing process.
[0110]
As shown in the figure, first, in step S901, it is determined whether or not there is a direct address data block as a write candidate (S901). If it is determined in S901 that there is a direct address data block, the longest one of the direct address data block columns with consecutive addresses is selected for the direct address data block to be written (S902). Then, the column of the selected direct address data block is overwritten and written (S903). In S902, the column of the block with the longest direct address is selected because it is more efficient to write in a large size even if it is an overwrite. At this time, it is most efficient to write data at the stripe set boundary and stripe size.
[0111]
In this way, by delaying the writing of the indirect address data block, it becomes possible to write data in the logged stripe format at high speed without consuming entries in the address translation table when a free physical cluster occurs.
[0112]
On the other hand, let us consider a case where there is no direct address data block to be written in step 901.
[0113]
In this case, the following is considered.
[0114]
1. An indirect address data block can only be selected for writing.
[0115]
2. A certain amount of time is required for the writing process.
[0116]
3. During the writing process, the physical stripe to be written cannot be the target of the repack process.
[0117]
4). If an indirect address data block belonging to a physical cluster to be repacked is selected as a write target here, the efficiency of the repacking process deteriorates.
[0118]
5). Since there are currently few free physical clusters, we do not want the repacking process to be inefficient.
[0119]
6). A physical cluster to be repacked is a physical cluster with a small number of valid data blocks (in repacking, it is more efficient to merge physical stripes with a small number of valid blocks to create a free physical stripe).
[0120]
In consideration of the above circumstances, a block belonging to the physical stripe that contains the most effective blocks is selected from the indirect address data blocks.
[0121]
That is, if it is determined in S901 that no direct address data block exists, an indirect address data block belonging to a physical cluster having the largest number of valid blocks is selected from candidate indirect address data blocks (S904). The selected indirect address data block is overwritten (S905).
[0122]
By proceeding in this way, when a physical cluster with few valid blocks is selected for repacking, there are fewer opportunities for writing to that physical cluster, so processing can be continued and repacking efficiency is improved. .
[0123]
Consider a case where there is a free physical cluster in S801 in FIG. 10 or there is a free physical block in the physical cluster to be written out.
[0124]
In this case, the following is considered.
[0125]
1. There is a possibility to select the export method as a logging cluster.
[0126]
2. In particular, the indirect address data block does not need to secure a new entry in the address conversion table, so a writing method as a logging cluster is possible.
[0127]
3. In particular, a direct address data block can be written as a log cluster if a new address translation table entry can be secured.
[0128]
4). Address translation table entries are finite resources and should not be used as much as possible.
[0129]
In consideration of the above situation, the indirect address data block should be given priority as a log cluster write candidate. As a result, it is possible to proceed with processing so that consumption of entries in the address translation table does not increase.
[0130]
Further, by writing out the indirect address data block belonging to the physical cluster having the smallest number of effective blocks, the block moves to another physical cluster, and the number of effective blocks of the physical stripe having the small number of effective blocks can be further reduced. This has the meaning of improving the efficiency of repacking.
[0131]
Therefore, the indirect address data block belonging to the physical cluster having the smallest number of effective blocks should be a candidate for writing. As a result, the number of physical clusters with a small number of effective blocks increases, and the repacking efficiency can be improved.
[0132]
That is, in S801, if there is a free physical cluster, or if there is a free physical block in the current write target physical cluster, it is determined whether there is an indirect address data block in the write candidate block (S802). ).
[0133]
If it is determined in S802 that the write candidate block has an indirect address data block, an indirect address data block belonging to the physical cluster having the smallest number of effective blocks is selected (S810). Then, the indirect address data block is written to the log stripe by the write process of FIG. 6 (S811).
[0134]
Next, consider a case where there is no indirect address data block as a write candidate.
[0135]
In this case, the following is considered.
[0136]
1. The block to be written out is a direct address data block.
[0137]
2. When the address data block is directly written out in the logged cluster format, the time required is short, but on the other hand, the address translation table entry is consumed.
[0138]
3. Furthermore, since the direct address data block becomes an indirect address data block, even if a series of blocks having consecutive logical block numbers are discontinuously arranged on the disk as indirect addresses, the subsequent sequential access performance deteriorates. There is a disadvantage that.
[0139]
4). Therefore, when the direct address data block can be efficiently written by overwriting, it is more efficient to write by overwriting as much as possible.
[0140]
5). In particular, when a series of direct address data blocks having consecutive block numbers (= addresses) exists as a write target, the data can be written by overwriting relatively efficiently.
[0141]
Therefore, if there are more than a certain number of continuous direct address data blocks, the direct address data blocks should be written in bulk at the time of overwriting.
[0142]
That is, if it is determined in S802 that the indirect address data block does not exist in the write candidate block, the longest one of the direct address data block columns having consecutive addresses is selected for the direct address data block to be written. BL is set, and the number of blocks included in the column is set to N (S803).
[0143]
Next, in S804, it is determined whether or not the number of blocks N is greater than a predetermined value “A”. Here, “A” is set as a block length reference value, and this value designates a block length that can be overwritten at a relatively high speed by actual measurement. Alternatively, an implementation in which a value is determined from the state of the entire system as a dynamic parameter is possible.
[0144]
If it is determined in S804 that the number of blocks N is larger than the predetermined value “A”, the selected BL is collectively written to the current address (S813). In S813, writing may be performed at the stripe set boundary and stripe size, and in this case, the efficiency is further improved.
[0145]
By performing such processing, it is possible to prevent unnecessary consumption of entries in the address conversion table. Moreover, the effect of preventing the sequential performance from deteriorating by increasing the indirect address unnecessarily can be obtained.
[0146]
Next, the case where it is determined in S804 that there are no continuous direct address data blocks equal to or greater than the number A will be described. In this case, since efficient direct address data block overwriting cannot be expected, the following is considered.
[0147]
1. Evaluate the urgency of writing, and if it is necessary to write as soon as possible, write in a logging cluster that can be written at high throughput.
[0148]
2. As the degree of “emergency”, the dirty rate of the cache can be considered. The higher the Dirty rate, the sooner it needs to be exported.
[0149]
3. When the urgency is low, the address data block is overwritten directly in order to avoid an increase in indirect address data blocks (to save address translation table entries and prevent deterioration of sequential performance). In particular, the more indirect address data blocks already exist, the more direct address data blocks are overwritten.
[0150]
Based on the above considerations, in S805, one of the write candidate direct address data blocks is selected to determine whether to write to the logged cluster or to overwrite. FIG. 12A is a flowchart for explaining the processing in S805.
[0151]
As shown in the figure, one of the direct address data blocks to be written belonging to the cache column (cache set) with the highest usage rate is selected (S1001). Next, the writing method is determined using the determination function f (α, β).
[0152]
FIG. 12B is a diagram for explaining the determination function f (α, β). In the figure, α means the dirty rate of the entire cache, and β means the usage rate (%) of the entire address translation table.
[0153]
In this example, when the dirty rate α of the cache is 90% or more, it is determined that “write to log cluster”. If it is less than that, determination is made by comparing α with the usage rate β (ratio of the number of entries in use to the total number of entries). The higher the dirty rate, or the smaller the number of indirect address data blocks, the more it is determined that “write to log cluster”. By changing the logic of the determination function f () according to the situation and implementation, the determination according to the actual situation can be performed.
[0154]
If it is determined in S1002 that the data is not overwritten, an instruction to write to the logging cluster is determined (S1003, 1005). On the other hand, if it is determined in S1002 that the data is overwritten, overwriting is determined (S1003, S1004).
[0155]
As a result, by the above method, when the dirty usage rate of the cache is low, it is possible to increase the number of indirect address data blocks and control so that the sequential performance is not deteriorated.
[0156]
Thereafter, in S806, it is determined whether or not it is decided to write to the log stripe (S806). If it is judged that it is not decided, the selected direct address data block is written overwritten ( S808). On the other hand, if it is determined, the direct address data block selected in the write processing of FIG. 6 is written. At this time, if possible, write in the logging cluster.
[0157]
Therefore, according to the disk array control apparatus of the embodiment of the present invention, in the disk system adopting the log format data block management method, the log format is limited within the usable range while limiting the capacity of the address conversion table. By managing this data block, it is possible to avoid the capacity limitation of the target logical disk.
[0158]
Further, when the urgency is high, the data is written in the log format, and when the urgency is low, the data is written in the non-log format, thereby avoiding an increase in the log data and avoiding the deterioration of the sequential access performance of the disk array control device.
[0159]
Note that the present invention is not limited to the above-described embodiments, and various modifications can be made without departing from the scope of the invention at the stage of implementation. In addition, the embodiments may be appropriately combined as much as possible, and in that case, the combined effect is obtained. Furthermore, the above embodiments include inventions at various stages, and various inventions can be extracted by appropriately combining a plurality of disclosed constituent elements. For example, when an invention is extracted by omitting some constituent elements from all the constituent elements shown in the embodiment, when the extracted invention is implemented, the omitted part is appropriately supplemented by a well-known common technique. It is what is said.
[0160]
【The invention's effect】
As described above in detail, according to the present invention, the amount of memory used for the address conversion table is kept constant, the logical disk capacity to be controlled is not restricted, and the high speed of the conventional RAID acceleration method is used. Writing can be realized.
[Brief description of the drawings]
FIG. 1 is a diagram showing a configuration of a computer system according to an embodiment of the present invention.
FIG. 2 is a diagram showing a configuration of a cache memory 14 of the computer system according to the embodiment of the present invention.
FIG. 3 is a diagram showing a layout of logical disk address space of a disk in a RAID acceleration method, (a) is a diagram showing a stripe arrangement on a physical disk, and (b) is a stripe arrangement on a logical disk. It is a figure which shows a layout.
4 is a diagram showing a layout of a “logical address space” corresponding to FIG. 3; FIG.
FIG. 5 is a diagram illustrating an address conversion table.
FIG. 6 is a flowchart for explaining a basic processing flow of a write operation of the disk array control apparatus according to the embodiment of the present invention;
7A is a diagram showing a state of physical clusters on a logical disk, FIG. 7B is a table for managing the states of all physical clusters shown in FIG. 7A, and FIG. ) Is a diagram showing an address conversion table.
8A is a diagram showing a state of physical clusters on a logical disk, FIG. 8B is a table for managing the states of all physical clusters shown in FIG. 8A, and FIG. ) Is a diagram showing an address conversion table, and (d) is a diagram showing a write buffer.
9A is a diagram showing a state of physical clusters on a logical disk, FIG. 9B is a table for managing the states of all physical clusters shown in FIG. 9A, and FIG. ) Is a diagram showing an address conversion table, and (d) is a diagram showing a write buffer.
FIG. 10 is a flowchart for explaining the overall operation of the writing process of the disk array control apparatus according to the embodiment of the present invention;
FIG. 11 is a flowchart for explaining the operation of overwrite writing processing;
12A is a flowchart for explaining the processing in S805, and FIG. 12B is a diagram for explaining a determination function f (α, β).
FIG. 13 is a diagram showing a RAID 5 logical disk composed of three disks.
[Explanation of symbols]
1 ... CPU bus,
2 ... CPU,
3 ... main memory,
4 ... PCI-PCI bridge,
5 ... PCI bus,
11 ... RAID controller PCI card,
12 ... PCI-PCI bridge,
13 ... Internal PCI bus,
14 ... cache memory,
15 ... SCSI controller,
16 ... Dedicated CPU,
17 ... Local RAM,
18 ... Flash ROM,
20 ... SCSI bus,
21-1 to 21-n... Disk devices.

Claims

In a disk array control device for controlling a disk array provided with at least one logical disk,
A physical cluster management table for managing the state of the physical cluster in one logical disk ;
An address conversion table for storing a logical block number in the one logical disk and a physical block number corresponding to the logical block number in association with each other;
Means for determining whether a physical block number of a data block to be written is registered in the address conversion table;
If it is determined that the physical block number of the data block to be written is registered in the address conversion table, whether or not a free physical cluster can be secured by referring to the physical cluster management table is determined. Means to judge,
When it is determined that a free physical cluster can be secured, the free physical cluster is secured, and a means for writing the data block to be written into the secured free physical cluster is provided,
A disk characterized by managing one logical disk by mixing a direct address cluster consisting only of data blocks that can be accessed without requiring address translation and an indirect address cluster containing data blocks that require address translation. Array controller.

A means for writing the data block to be written to a physical block address corresponding to the data block to be written that is registered in the address conversion table when it is determined that a free physical cluster cannot be secured; The disk array control apparatus according to claim 1, further comprising:

When it is determined that the physical block number of the data block to be written is not registered in the address conversion table, an entry for registering the logical cluster to which the data block to be written belongs belongs to the address conversion table. Means for determining whether or not
When it is determined that there is an entry for registering the logical cluster to which the data block to be written belongs in the address conversion table, all the physical block numbers of the logical cluster to which the data block to be written belongs are all the addresses Means for registering in the conversion table;
Means for registering a physical block number of a write-destination data block in the empty physical cluster as a physical block number registered for the data block to be written newly registered in the address conversion table; 2. The disk array control apparatus according to claim 1, further comprising:

When the logical cluster to which the data block to be written belongs is registered in the address conversion table, the physical block number corresponding to the logical cluster number to which the data block registered in the address conversion table belongs is used. And access
The access is performed using a logical block number of the data block as a physical block number when a logical cluster to which the data block to be written belongs is not registered in the address conversion table. 2. The disk array control device according to 1.

2. The disk array control apparatus according to claim 1, wherein the total number of logical data blocks arranged in the indirect address cluster is limited within a range of the total number of entries in the address conversion table.

Means for determining whether there is a free physical cluster to be written next, or whether there is a free physical block in the physical cluster currently being written;
Means for determining whether or not an indirect address data block written to a physical address on a disk different from the requested logical address exists in the write candidate logical block when it is determined to exist;
Means for selecting an indirect address data block belonging to a physical cluster having the smallest number of effective blocks when it is determined that the indirect address data block exists in the logical block of the write candidate;
2. The disk array control apparatus according to claim 1, wherein the selected indirect address data block is written as the data block to be written.

When it is determined that there is no free physical cluster, or there is no free physical block in the current write target physical cluster, the write target block is registered in the address conversion table as the write target block. 7. The disk array control apparatus according to claim 6, further comprising means for writing to a physical block address corresponding to the data block to be written.

If it is determined that the indirect address data block written to the physical address on the disk different from the requested logical address does not exist in the write candidate logical block, the address among the direct address data blocks of the write candidate Means for selecting the longest sequence of direct address data blocks that are consecutive,
7. The disk array according to claim 6, further comprising means for writing a column of the selected direct address data block when the column of the selected direct address data block satisfies a predetermined condition. Control device.

When it is determined that the predetermined condition is not satisfied, the dirty rate of the cache memory that temporarily stores data to be written to the disk array and the use of the address conversion table for each direct address data block of the write candidate 9. The disk array control apparatus according to claim 8, further comprising means for determining a writing method based on the rate.

The direct address data block as the write candidate is selected in order from a direct address data block belonging to a cache column having a high usage rate when it is determined that the predetermined condition is not satisfied. 9. The disk array control device according to 9.

A physical cluster management table for managing the state of the physical cluster in one logical disk, the logical block number in the one logical disk, the address conversion table for storing correspondence between the physical block number corresponding to the logical block number In a data writing method in a disk array control device for controlling a disk array provided with at least one logical disk comprising :
Determine whether the physical block number of the data block to be written is registered in the address conversion table,
If it is determined that the physical block number of the data block to be written is registered in the address conversion table, whether or not a free physical cluster can be secured by referring to the physical cluster management table is determined. Judgment
If it is determined that a free physical cluster can be secured, the method comprises the steps of securing the free physical cluster and writing the data block to be written to the secured free physical cluster,
A disk characterized by managing one logical disk by mixing a direct address cluster consisting only of data blocks that can be accessed without requiring address translation and an indirect address cluster containing data blocks that require address translation. Data writing method in array controller.

When it is determined that a free physical cluster cannot be secured, the step of writing the data block to be written to a physical block address corresponding to the data block to be written registered in the address conversion table is further included. 12. The data writing method in the disk array control apparatus according to claim 11, further comprising:

When it is determined that the physical block number of the data block to be written is not registered in the address conversion table, an entry for registering the logical cluster to which the data block to be written belongs belongs to the address conversion table. To determine whether or not
When it is determined that there is an entry for registering the logical cluster to which the data block to be written belongs in the address conversion table, all the physical block numbers of the logical cluster to which the data block to be written belongs are all the addresses Register in the conversion table,
A step of registering a physical block number of a write-destination data block in the free physical cluster as a physical block number registered for the data block to be written newly registered in the address conversion table; 12. The data writing method in the disk array control apparatus according to claim 11, further comprising:

When the logical cluster to which the data block to be written belongs is registered in the address conversion table, the physical block number corresponding to the logical cluster number to which the data block registered in the address conversion table belongs is used. And access
The access is performed using a logical block number of the data block as a physical block number when a logical cluster to which the data block to be written belongs is not registered in the address conversion table. 11. A data writing method in the disk array control device according to 11.

12. The method of writing data in a disk array controller according to claim 11, wherein the total number of logical data blocks arranged in the indirect address cluster is limited within a range of the total number of entries in the address translation table.

Determine whether there is a free physical cluster to be written next, or whether there is a free physical block in the current physical cluster to be written,
If it is determined that there is an indirect address data block written to a physical address on the disk different from the requested logical address in the write candidate logical block,
When it is determined that the indirect address data block exists in the write candidate logical block, the indirect address data block belonging to the physical cluster having the smallest number of effective blocks is selected,
12. The data writing method in the disk array control device according to claim 11, wherein the selected indirect address data block is written as the data block to be written.

When it is determined that there is no free physical cluster, or there is no free physical block in the current write target physical cluster, the write target block is registered in the address conversion table as the write target block. 17. The data writing method in the disk array control device according to claim 16, further comprising a step of writing to a physical block address corresponding to the data block to be written.

If it is determined that the indirect address data block written to the physical address on the disk different from the requested logical address does not exist in the write candidate logical block, the address among the direct address data blocks of the write candidate Select the longest sequence of direct address data blocks that
17. The disk array control according to claim 16, further comprising a step of writing a column of the selected direct address data block when the column of the selected direct address data block satisfies a predetermined condition. Data writing method in apparatus.

When it is determined that the predetermined condition is not satisfied, the dirty rate of the cache memory that temporarily stores data to be written to the disk array and the use of the address conversion table for each direct address data block of the write candidate 19. The data writing method in the disk array control apparatus according to claim 18, further comprising a step of determining a writing method based on the rate.

The direct address data block as the write candidate is selected in order from a direct address data block belonging to a cache column having a high usage rate when it is determined that the predetermined condition is not satisfied. 20. A data writing method in the disk array control device according to 19.