JP4234730B2

JP4234730B2 - RAID blockage determination method, RAID device, its controller / module, program

Info

Publication number: JP4234730B2
Application number: JP2006130737A
Authority: JP
Inventors: 孝一塚田; 悟史矢澤; 章二大嶋; 達彦町田; 宏和松林
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2006-05-09
Filing date: 2006-05-09
Publication date: 2009-03-04
Anticipated expiration: 2026-05-09
Also published as: US7779203B2; US20080010495A1; JP2007304728A

Description

本発明は、ＲＡＩＤ装置における閉塞判定等に関する。 The present invention relates to blockage determination and the like in a RAID device.

従来のＲＡＩＤシステムの概略構成を図１１に示す。図１１において、ＲＡＩＤ装置１００は、ＣＭ（Centralized Module）１０１、ＢＲＴ（Backend Router）１０２，１０３、及び複数のディスクから成るＲＡＩＤグループ１０４を有する。尚、図では、ＲＡＩＤグループ１０４は１つのみ示すが、実際には複数ある場合が多い。ホスト１１０は、任意の通信線を介してＣＭ１０１に対して、任意のＲＡＩＤグループへのアクセスを要求する。 FIG. 11 shows a schematic configuration of a conventional RAID system. In FIG. 11, a RAID device 100 has a centralized module (CM) 101, backend routers (BRT) 102 and 103, and a RAID group 104 composed of a plurality of disks. Although only one RAID group 104 is shown in the figure, there are many cases where there are actually a plurality of RAID groups. The host 110 requests the CM 101 to access an arbitrary RAID group via an arbitrary communication line.

ＣＭ１０１は、ＲＡＩＤ装置１００内における各種ディスクアクセス処理、エラーリカバリ処理等を管理・制御する。ＢＲＴ１０２、１０３は、ＣＭ１０１とＲＡＩＤグループ１０４との間に位置し、ＣＭ１０１とＲＡＩＤグループ１０４とを繋ぐ為のスイッチの役割を果たす。ホスト１１０がＣＭ１０１を介してＲＡＩＤグループ１０４にアクセスする経路（パス）は２つあり（図では１つのアクセス経路のみ示しているが）、この２つのアクセス経路の各々にＢＲＴ１０２、１０３が設けられている。従って、どちらか一方のアクセス経路が何等かの理由（例えば、ＢＲＴの故障等）によって使用不可となっても、他方のアクセス経路を用いてアクセスすることができる。 The CM 101 manages and controls various disk access processes, error recovery processes, and the like in the RAID device 100. The BRTs 102 and 103 are located between the CM 101 and the RAID group 104 and serve as switches for connecting the CM 101 and the RAID group 104. There are two paths (paths) in which the host 110 accesses the RAID group 104 via the CM 101 (only one access path is shown in the figure), and BRTs 102 and 103 are provided in each of these two access paths. Yes. Therefore, even if one of the access paths becomes unusable due to some reason (for example, failure of BRT, etc.), it is possible to access using the other access path.

しかしながら、例えば、両方の経路（両系とも）が使用不可となる場合がある。図示の例では、ＢＲＴ１０２、１０３が故障しており、この場合、当然、全てのＲＡＩＤグループ１０４にアクセスすることが出来なくなる（図では、ＲＡＩＤグループは１つのみであるが、実際には、複数のＲＡＩＤグループが存在する場合が多い）。 However, for example, both paths (both systems) may not be usable. In the illustrated example, the BRTs 102 and 103 are out of order. In this case, of course, it is impossible to access all the RAID groups 104 (in the figure, there is only one RAID group, but actually there are a plurality of RAID groups). RAID groups often exist).

この様に、ＲＡＩＤ装置において、あるＲＡＩＤグループへアクセスできなくなった場合、そのままホスト１１０がアクセス要求し続けると、ＲＡＩＤ装置１００側でディスク故障と判断し、最終的にはＲＡＩＤグループ故障となりユーザデータが消失する可能性がある。また、ホスト１１０は、アクセス出来ないにも係らずアクセスしようとする為、ホスト処理遅延の原因となる。 As described above, when the RAID device becomes unable to access a certain RAID group, if the host 110 continues to request access as it is, the RAID device 100 determines that the disk has failed, and eventually the RAID group fails and user data is lost. May disappear. In addition, since the host 110 tries to access although it cannot be accessed, it causes a host processing delay.

その為、アクセスできない理由がディスク要因である場合を除いては、ＲＡＩＤを一旦閉塞状態にさせる。ＲＡＩＤ閉塞とは、上記アクセス出来なくなったＲＡＩＤグループの状態を、閉塞前と同じ状態で保持し、ホストアクセスを禁止している状態を意味する。これによって、ユーザデータを保護し、ホストからのアクセスを、即、異常終了とさせる。
ホストアクセスは、閉塞したＲＡＩＤグループにおいて、閉塞となった要因が解消された時点から受付可能となる。 Therefore, except for the case where the reason for inaccessibility is a disk factor, the RAID is temporarily blocked. RAID block means a state in which the status of the RAID group that has become inaccessible is held in the same state as before the block and host access is prohibited. As a result, user data is protected and access from the host is immediately terminated abnormally.
Host access can be accepted from the point at which the blocked factor is resolved in the blocked RAID group.

ここで問題になるのが、ＲＡＩＤ閉塞を行うか否かの判定方法である。
図１２に、従来のＲＡＩＤ閉塞判定方法の一例を示す。尚、図１２には、判定対象となるＲＡＩＤグループがＲＡＩＤ１の場合に対応した閉塞判定方法を示す。 The problem here is how to determine whether or not to perform RAID blockage.
FIG. 12 shows an example of a conventional RAID blockage determination method. FIG. 12 shows a blockage determination method corresponding to the case where the RAID group to be determined is RAID1.

図１２に示す表の通り、従来では、各“ＲＡＩＤが閉塞し得る事象”（装置として発生した事象（３））と“ＤＬＵ単位での各ディスクの状態”（２）との組み合わせに応じて、ＲＡＩＤグループを閉塞させるか否かを登録しており、各“ＲＡＩＤが閉塞し得る事象”のうちの１つが発生したときに、この表を参照して、ＲＡＩＤグループを閉塞させるか否かを判定する。尚、この表は、例えば、ＣＭ１０１内のメモリ等に記憶されており、この判定はＣＭ１０１が行う。また、尚、この表において「○」は閉塞させること、「×」は閉塞させないことを意味している。 As shown in the table of FIG. 12, in the past, depending on the combination of each “event that can be blocked by RAID” (event (3) generated as a device) and “state of each disk in DLU unit” (2) , Whether or not the RAID group is to be blocked is registered, and when one of the “events that can be blocked by RAID” occurs, this table is referenced to determine whether or not the RAID group is to be blocked. judge. This table is stored in, for example, a memory in the CM 101, and the CM 101 makes this determination. Further, in this table, “◯” means closing, and “×” means not closing.

“ＲＡＩＤが閉塞し得る事象”は、図示の例では、「ペアとなるＢＲＴの故障」、「ペアとなるＢＲＴのポートの故障」、「ペアとなるＢＲＴの故障（ＢＲＴ跨ぎ）」、「ペアとなるＢＲＴのポートの故障（ＢＲＴ跨ぎ）」、「Ｐ１がＮｅ(Not Exist)へ遷移」等であるが、これら以外にも様々な事象が発生し得る。 In the example shown in the figure, “events that can cause RAID to be blocked” are “paired BRT failure”, “paired BRT port failure”, “paired BRT failure (BRT straddling)”, “pair "BRT port failure (BRT crossing)", "P1 transitions to Ne (Not Exist)", etc., but various other events may occur.

「ペアとなるＢＲＴの故障」とは、例えば上記ＢＲＴ１０２、１０３の両方が故障したことを意味する。従って、この場合には、当然、図示の通り、各ディスクの状態は関係なく、全て「○（閉塞させる）」となる。 “Failure of paired BRT” means that both of the BRTs 102 and 103 have failed, for example. Accordingly, in this case, as a matter of course, as shown in the figure, all the disks are “◯ (closed)” regardless of the state of each disk.

「ペアとなるＢＲＴのポートの故障」とは、例えば上記ＢＲＴ１０２、１０３において、同一の１つのＲＡＩＤグループに接続しているポートが、両方とも故障した場合である。また、上記“ＢＲＴ跨ぎ”とは、同一のRAIDグループに属するディスクが、別々の系統に接続されている場合である。例えば、図１５に示すように、ＢＲＴ０とＢＲＴ１の系統に接続されたディスクＰ１と、ＢＲＴ２とＢＲＴ３の系統に接続されたディスクＰ２とが、同じＲＡＩＤグループである場合である。 The “failure of paired BRT ports” refers to a case where, for example, both ports connected to the same RAID group in the BRTs 102 and 103 fail. The “BRT straddle” is a case where disks belonging to the same RAID group are connected to different systems. For example, as shown in FIG. 15, the disk P1 connected to the BRT0 and BRT1 systems and the disk P2 connected to the BRT2 and BRT3 systems are in the same RAID group.

また、図１２において示す各記号（状態を示す記号）の意味について以下に説明する。
まず、ＤＬＵについて説明する。図１３（ａ）、（ｂ）に示すように、ＲＬＵはＲＡＩＤグループそのものを意味し、ＤＬＵは、ＲＬＵという論理ボリュームと、ディスクという物理的ボリュームを結合する為の概念である。尚、DISKは、各ハードディスクそのものである。また、尚、図１３（ａ）に示すように、RAID１の場合には、ＤＬＵとＲＬＵとは同じ内容となる。よって、図１２に示すＤＬＵはＲＬＵに置き換えても良い。 The meaning of each symbol (symbol indicating the state) shown in FIG. 12 will be described below.
First, the DLU will be described. As shown in FIGS. 13A and 13B, RLU means a RAID group itself, and DLU is a concept for combining a logical volume called RLU and a physical volume called disk. DISK is each hard disk itself. Further, as shown in FIG. 13A, in the case of RAID1, the DLU and RLU have the same contents. Therefore, the DLU shown in FIG. 12 may be replaced with an RLU.

図１２に示すＰ１，Ｐ２，ＨＳ１，ＨＳ２は、１つのＲＡＩＤグループを構成する各ディスクに仮に与えた名称である。尚、図１３（ａ）に示す通り、ＲＡＩＤ１においてはＲＡＩＤグループは２つのディスク（Ｐ１，Ｐ２）より構成され、ディスクＰ１，Ｐ２の両方に同一のデータが書き込まれるが、実際には更にスペア用のディスク(Hot Spareと呼ぶ)が用意されており、これが図１２に示すHS1、HS2である。 P1, P2, HS1 and HS2 shown in FIG. 12 are names temporarily given to the respective disks constituting one RAID group. As shown in FIG. 13 (a), in RAID 1, the RAID group is composed of two disks (P1, P2), and the same data is written to both disks P1, P2. Discs (called Hot Spare) are prepared, which are HS1 and HS2 shown in FIG.

また、図１２に示すＤＬＵ又は各ディスクの状態を示すAｖ,Br等の記号の意味を、以下に記す。
すなわち、
Ａｖ（Available；通常状態）、Ｂｒ（Broken；故障状態）、Ｆｕ（Failed Usable；ＲＡＩＤ故障時にRead（読出し）のみ許可状態）、Ｅｘ（Exposed；縮退状態）、Ｎｅ（Not Exist；loop down等が原因でDiskが一時的に見えなくなる状態）、Ｒｂ（Rebuild；Rebuild状態）、Ｓｐｒ（Sparing；Redundant Copy状態）、SiU(Spare In Use；Hot Spare使用状態)、Ｃｐ（Copyback；Copyback状態）、SpW（Spr＋WF）である。 The meanings of symbols such as Av and Br indicating the state of the DLU or each disk shown in FIG. 12 are described below.
That is,
Av (Available; normal state), Br (Broken; failure state), Fu (Failed Usable; Read only when RAID fails), Ex (Exposed), Ne (Not Exist; Loop down), etc. Disk is temporarily invisible due to cause), Rb (Rebuild; Rebuild state), Spr (Sparing; Redundant Copy state), SiU (Spare In Use; Hot Spare use state), Cp (Copyback; Copyback state), SpW (Spr + WF).

図１２に示す通り、ＤＬＵ／ＲＬＵを構成する各ディスクの上記何れかの状態によって、そのＤＬＵ／ＲＬＵの状態が決まる。例えば、ディスクＰ１，Ｐ２の両方が通常状態(Av)であれば、当然、ＤＬＵは通常状態(Av)となる。これについて、図１４を参照して説明する。 As shown in FIG. 12, the state of the DLU / RLU is determined by one of the above states of each disk constituting the DLU / RLU. For example, if both the disks P1 and P2 are in the normal state (Av), the DLU is naturally in the normal state (Av). This will be described with reference to FIG.

図１４（ａ）については、上記の通りであり、この通常状態から、どちらか一方のディスクが故障状態(Br)になったら、そのＤＬＵは縮退状態(Ex)となる（図１４（ｂ））。
尚、図１２に示す通り、故障状態(Br)ではなく、Ne状態になった場合でも、そのＤＬＵは縮退状態となる。あるいは、例えば、ディスクP1が不調になった為に、ディスクP1をSpr状態にしディスクＨＳ１をRb状態にして、ディスクP1のデータをディスクＨＳ１にコピーしている状態で、更にディスクＰ２がＢrとなった場合も、そのDLUは縮退状態となる。
そして、図１４（ｂ）の状態になったら、レイド１では２つ以上のディスクに同一データを格納する必要があるので、上記Hot Spare（ここではＨＳ１）を使用する為に、図１４（ｃ）に示す通り、ディスクＰ１の格納データをディスクHS1にコピーする。この状態では、ディスクHS1はRb状態であり、ＤＬＵもＲb状態である。そして、コピー完了したら、図１４（ｄ）に示す通り、ディスクHS1は通常状態となり、この様にHot Spareを用いて通常運用しているＤＬＵの状態は、SiU状態となる。 FIG. 14A is as described above. When one of the disks enters the failed state (Br) from this normal state, the DLU enters the degenerated state (Ex) (FIG. 14B). ).
Note that, as shown in FIG. 12, the DLU is in a degenerated state even when the Ne state is entered instead of the failure state (Br). Or, for example, because the disk P1 is malfunctioning, the disk P1 is in the Spr state, the disk HS1 is in the Rb state, and the data in the disk P1 is being copied to the disk HS1, and the disk P2 is further Br. In this case, the DLU is in a degenerated state.
When the state shown in FIG. 14B is reached, the same data needs to be stored in two or more disks in the raid 1, so in order to use the Hot Spare (here, HS1), FIG. ), The data stored in the disk P1 is copied to the disk HS1. In this state, the disk HS1 is in the Rb state, and the DLU is also in the Rb state. When the copying is completed, the disk HS1 is in the normal state as shown in FIG. 14D, and the state of the DLU that is normally operated using Hot Spare is in the SiU state.

また、例えば故障ではないが多少の不具合が生じた為、そのままそのディスク（ここではＰ２とする）を使い続けるのは不安である場合等には、図１４（ｅ）に示すように、ディスクＰ２をSpr状態にすると共に、Hot Spare（ここではＨＳ１）をＲｂ状態にして、ディスクＰ２のデータをHot Spareにコピーする。このときのＤＬＵの状態はRedundant Copy(Spr)状態である。 Further, for example, when there is some trouble that is not a failure, but it is uneasy to continue using the disk (P2 here) as it is, the disk P2 as shown in FIG. Is set to the Spr state, and Hot Spare (here, HS1) is set to the Rb state, and the data of the disk P2 is copied to the Hot Spare. The state of the DLU at this time is the Redundant Copy (Spr) state.

上記Hot Spareを使用して通常運用を行っている状態で、例えばディスクＰ２が正常になったら、図１４（ｆ）に示すように、ディスクＰ２をＲｂ状態にして、Hot Spareの格納データをディスクＰ２にコピーする。このときのＤＬＵの状態は、Copyback(Cp)状態となる。 When normal operation is performed using the Hot Spare, for example, when the disk P2 becomes normal, as shown in FIG. 14F, the disk P2 is set to the Rb state, and the data stored in the Hot Spare is stored in the disk. Copy to P2. The DLU state at this time is a Copyback (Cp) state.

また、図１４（ｇ）に示すように、ディスクＰ１，Ｐ２の何れか一方がＦｕ状態、他方がＢｒ状態の場合、ＤＬＵはＢｒ状態である。尚、そのＲＡＩＤグループが使用不能状態となった場合（ここでは、ディスクＰ１，Ｐ２の両方が故障した場合）、そのときに故障したディスクはＢｒ状態とはしないようにしている。図示の例では最初にディスクＰ２が故障した為にディスクＰ２をＢｒ状態にしたが、その後にディスクＰ１が故障したときには、ＢｒではなくＦｕとすることで、格納されているデータを少しでも救出するように試みる。尚、RAID１の場合は１台のディスクが故障しただけでは使用不能状態とはならないが、RAID０の場合は一台故障しただけで使用不能状態となるので、Ｂｒ状態となるディスクは存在しないことになる。 Further, as shown in FIG. 14G, when one of the disks P1 and P2 is in the Fu state and the other is in the Br state, the DLU is in the Br state. If the RAID group becomes unusable (here, both of the disks P1 and P2 fail), the failed disk is not set to the Br state at that time. In the example shown in the figure, the disk P2 is first brought into failure because the disk P2 has failed. However, when the disk P1 subsequently fails, the stored data is rescued by using Fu instead of Br. Try to do so. In the case of RAID1, a single disk failure does not result in an unusable state, but in the case of RAID0, only one disk fails and becomes unusable, so there is no disk in the Br state. Become.

また、ＤＬＵがSiU状態となるのは、図１４（ｄ）の状態以外にも、図１４（ｈ）の状態がある。図１４（ｈ）では、図１４（ｄ）の状態から更にディスクＰ１が不調になった為に、ディスクＰ１をSpr状態にしてそのデータをディスクHS2にコピー中の状態である（当然、ディスクHS2はRb状態である）。 In addition to the state of FIG. 14D, the DLU enters the SiU state in the state of FIG. 14H. In FIG. 14 (h), since the disk P1 is further malfunctioning from the state of FIG. 14 (d), the disk P1 is in the Spr state and the data is being copied to the disk HS2 (of course, the disk HS2 Is the Rb state).

また、図１２の図上右端に示すように、例えばディスクＰ２が、Sprではなく、SpW（Spr＋WF）の状態のときも、そのＤＬＵはSpr状態として扱う。この「Spr＋WF」について、図１４（ｉ），（ｊ）を用いて説明する。 Also, as shown at the right end of FIG. 12, for example, when the disk P2 is not in Spr but in SpW (Spr + WF) state, the DLU is handled as the Spr state. This “Spr + WF” will be described with reference to FIGS.

まず、図１４（ｉ）に示すように、ディスクＰ１がAv状態、ディスクＰ２がSpr状態、ディスクHS1がRb状態で、ディスクＰ１からディスクHS1にコピーを行っているときに、Writeが発生したものとする。この場合、Writeは、図示の通り、全ディスクに対して行われる。そして、図示の通り、ディスクＰ２に対するWriteが失敗したものとする。この場合、ディスクＰ２にはＯＬＤデータが格納されている状態（Write前の状態）である為、ディスクＰ２からリードすることは出来ない。よって、図１４（ｊ）に示す通り、ディスクＰ２は「Spr＋WF」状態にする。但し、この状態でも、ディスクＰ２からディスクHS1へのコピーは続行する。この状態が、図１２の図上右端に示す状態である。 First, as shown in FIG. 14 (i), when the disk P1 is in the Av state, the disk P2 is in the Spr state, and the disk HS1 is in the Rb state, a write occurs when copying from the disk P1 to the disk HS1. And In this case, Write is performed on all the disks as illustrated. As shown in the figure, it is assumed that writing to the disk P2 has failed. In this case, since the OLD data is stored in the disk P2 (before writing), it is impossible to read from the disk P2. Therefore, as shown in FIG. 14J, the disk P2 is set to the “Spr + WF” state. However, even in this state, copying from the disk P2 to the disk HS1 continues. This state is the state shown in the upper right corner of FIG.

また、特許文献１、特許文献２、特許文献３に記載の公知技術が知られている。
特許文献１の発明は、ディスク装置を有する周辺装置に対する入出力動作要求で障害が発生した場合に、コンピュータによる判断処理の軽減を図ると共に、障害回復処理における無駄なリトライ処理を低減できるエラーリトライ方法である。 Further, known techniques described in Patent Document 1, Patent Document 2, and Patent Document 3 are known.
The invention of Patent Document 1 is an error retry method capable of reducing judgment processing by a computer and reducing unnecessary retry processing in failure recovery processing when a failure occurs due to an input / output operation request to a peripheral device having a disk device. It is.

特許文献２の発明は、FC-AL接続されているシステムで障害が発生した場合に、各種モニターの連携により、ＨＵＢに接続されている装置をポート単位に自動バイパスさせ、Ｔ＆Ｄを実行させて障害情報を収集し、ログ情報とペアにして管理する方法である。 In the invention of Patent Document 2, when a failure occurs in a system connected to FC-AL, the device connected to the HUB is automatically bypassed in units of ports by the cooperation of various monitors, and the failure is caused by executing T & D. This is a method of collecting information and managing it in pairs with log information.

特許文献３の発明は、ディスク・サブシステムからの障害通知が、デバイスパスに依存する障害であれば、該当するデバイスパスのみを閉塞してチャネルパスへの影響を防ぐデバイスパス閉塞方式である。
特開２０００−１３２４１３号公報特開２０００−２１５０８６号公報特開平４−２９１６４９号公報 The invention of Patent Document 3 is a device path blocking method that blocks only a corresponding device path and prevents an influence on a channel path if the failure notification from the disk subsystem is a failure depending on the device path.
JP 2000-132413 A JP 2000-215086 A JP-A-4-291649

上記図１２で説明した従来の方式では、以下の（１）〜（４）等の問題点がある。
（１）“ＲＡＩＤが閉塞し得る事象”が増えた場合、その都度、この事象を表に追加すると共に、この追加事象と各種“ＲＡＩＤグループがとりうる状態”との組み合わせに応じたＲＡＩＤ閉塞の可否の設定を行わなければならず、非常に手間が掛かる。 The conventional method described in FIG. 12 has the following problems (1) to (4).
(1) When the number of “events that can block RAID” increases, this event is added to the table each time, and RAID blockage corresponding to the combination of this additional event and various “statuses that can be taken by the RAID group” is added. It must be set whether or not it is possible, which is very time-consuming.

（２）図１２に示す表だけでは対応できない事象が発生した場合、例外処理を追加しなければならなかった。
（３）（２）の例外処理を追加していった結果、論理が複雑化し、メンテナンスし難くなる。 (2) When an event that cannot be dealt with only by the table shown in FIG. 12 occurs, an exception process must be added.
(3) As a result of adding the exception processing in (2), the logic becomes complicated and maintenance becomes difficult.

（４）“ＲＡＩＤが閉塞し得る事象”を特定している関係上、論理の共通化ができず、ソースコードの増大に繋がる。
尚、上記特許文献１〜３は、上記問題点を解決することには何等関係がない。すなわち、特許文献１の発明は、ディスク装置に対するリトライ処理に係るものであり、ディスク装置内でのエラー発生時の対処方法とは関係ない。特許文献２の発明は、システムとして、あるサブシステムにおいて障害が発生した場合の、システムとしてのリカバリ方法と障害情報／ログ情報の採取に係るものであり、サブシステム（ここではディスク装置）内でのエラー発生時の対処方法とは関係ない。特許文献３は、デバイスパスに着目し、デバイスパス異常時にはデバイスパスを自律的に閉塞させ、システムへの影響を最小限にする発明であり、デバイスパス閉塞後に閉塞されたデバイスパスに存在するＲＡＩＤグループのデータ保護に関するものではない。 (4) Due to the fact that “an event that can cause RAID to be blocked” is specified, logic cannot be shared, leading to an increase in source code.
Note that Patent Documents 1 to 3 have nothing to do with solving the above problems. That is, the invention of Patent Document 1 relates to a retry process for a disk device, and is not related to a method for coping with an error in the disk device. The invention of Patent Document 2 relates to a recovery method as a system and collection of failure information / log information when a failure occurs in a certain subsystem as a system, and in the subsystem (here, a disk device). It has nothing to do with how to deal with errors. Patent Document 3 is an invention that focuses on a device path, autonomously blocks the device path when the device path is abnormal, and minimizes the influence on the system. RAID present in the blocked device path after the device path is blocked It is not about group data protection.

本発明の課題は、ＲＡＩＤ閉塞の可否の判定処理に係わり、その設定の手間を大幅に軽減し、メンテナンス性を向上させ、コーディング量の削減し、各ＲＡＩＤレベルに共通の判定論理を用いることができるＲＡＩＤ装置におけるＲＡＩＤ閉塞判定方法、ＲＡＩＤ装置、そのコントローラ・モジュール、プログラム等を提供することである。 An object of the present invention relates to a process for determining whether or not a RAID blockage is possible, greatly reducing the setting effort, improving maintainability, reducing the amount of coding, and using a common determination logic for each RAID level. It is to provide a RAID blockage determination method in a RAID device, a RAID device, its controller module, a program, and the like.

本発明による第１のコントローラ・モジュールは、複数のディスクより成るＲＡＩＤグループを有するＲＡＩＤ装置内のコントローラ・モジュールにおいて、前記ＲＡＩＤ装置内でＲＡＩＤ閉塞可否を判定すべき特定の事象が発生する毎に、閉塞判定対象となる前記各ＲＡＩＤグループ毎に、前記ＲＡＩＤグループに属する前記各ディスクの状態又は前記各ディスクへのアクセスパスの有無に基づいて、前記各ディスクを複数のカテゴリに分類して当該分類単位毎に該当するディスクの数を集計し、該各集計結果と予め設定される閾値条件とを比較することによって、該ＲＡＩＤグループを閉塞させるか否かを判定するＲＡＩＤ管理・制御手段を有する。 The first controller module according to the present invention is a controller module in a RAID device having a RAID group composed of a plurality of disks, each time a specific event that should be determined whether or not RAID blockage occurs in the RAID device occurs. For each RAID group that is subject to blockage determination, the disks are classified into a plurality of categories based on the status of the disks belonging to the RAID group or the presence / absence of access paths to the disks. RAID management / control means for determining whether or not to block the RAID group is provided by counting the number of corresponding disks for each time and comparing each count result with a preset threshold value condition.

上記第１のコントローラ・モジュールでは、“ＲＡＩＤが閉塞し得る事象”が何であるかに関係なく、所定の分類方法によって分類・集計し、集計結果と閾値条件とを比較することで、ＲＡＩＤ閉塞可否を判定できる。この様に、判定論理が共通化できるので、“ＲＡＩＤが閉塞し得る事象”が増えた場合でも判定論理を追加する必要なく、例外処理の追加等も必要なく、メンテナンス性が向上する。また、各ＲＡＩＤレベル毎に閾値条件を設定すれば済む。 In the first controller module, regardless of what “the event that RAID can be blocked” is, classification and aggregation by a predetermined classification method, and by comparing the aggregation result and the threshold condition, whether or not RAID blocking is possible Can be determined. In this way, since the determination logic can be made common, even when “events that can block RAID” increase, it is not necessary to add determination logic, and it is not necessary to add exception processing, so that maintainability is improved. Further, a threshold condition may be set for each RAID level.

本発明による第２のコントローラ・モジュールは、複数のディスクより成るＲＡＩＤグループを有するＲＡＩＤ装置内のコントローラ・モジュールにおいて、該コントローラ・モジュールと外部の任意のホスト装置とのインタフェースであるＩ／Ｏ制御手段と、前記ＲＡＩＤ装置内の任意の前記ＲＡＩＤグループの閉塞可否の判定、閉塞の実行を管理・制御するＲＡＩＤ管理・制御手段とを有し、前記ＲＡＩＤ管理・制御手段は、任意の前記ＲＡＩＤグループの閉塞を実行する場合、該ＲＡＩＤグループが短時間でリカバリ可能か否かを判定し、短時間でリカバリ可能と判定した場合、その旨を前記Ｉ／Ｏ制御手段に通知し、前記Ｉ／Ｏ制御手段は、該通知を受けた場合であって前記ホスト装置が前記閉塞されたＲＡＩＤグループへのアクセスを要求した場合には、該ホスト装置に対してダミーの応答を返信する。 The second controller module according to the present invention is a controller module in a RAID device having a RAID group composed of a plurality of disks, and an I / O control means that is an interface between the controller module and an external host device. And RAID management / control means for managing / controlling the execution of blockage, determination of whether or not to block any RAID group in the RAID device, and the RAID management / control means When executing the blockage, it is determined whether or not the RAID group can be recovered in a short time. If it is determined that the RAID group can be recovered in a short time, the I / O control means is notified to that effect, and the I / O control is performed. Means for receiving the notification, wherein the host device has access to the blocked RAID group. If that requested returns a dummy response to the host device.

ＲＡＩＤグループの閉塞を実行した場合、通常であれば、ホスト装置からのアクセスは全て受け付けなくなる為、ホスト装置側ではそれまで実行していた処理は異常終了することになる。しかし、閉塞させたＲＡＩＤグループが短時間でリカバリ可能な状況であるならば、ホスト側にダミーの応答を返信することで、ホストをしばらく待たせる。これによって、それまで実行していた処理が異常終了されることを回避できる。 When the RAID group is blocked, all accesses from the host device are not accepted under normal circumstances, and the processing that has been executed on the host device side ends abnormally. However, if the blocked RAID group can be recovered in a short time, the host is made to wait for a while by returning a dummy response to the host side. As a result, it is possible to avoid the abnormal termination of the processing that has been executed.

本発明のＲＡＩＤ閉塞判定方法、ＲＡＩＤ装置、そのコントローラ・モジュール、プログラム等によれば、ＲＡＩＤ閉塞の可否の判定処理に係わり、その設定の手間を大幅に軽減し、メンテナンス性を向上させ、コーディング量の削減し、各ＲＡＩＤレベルに共通の判定論理を用いることができる。 According to the RAID blockage determination method, RAID device, controller module, program, etc. of the present invention, it is related to the RAID blockage determination process, greatly reducing the setting effort, improving maintainability, and coding amount. And common determination logic can be used for each RAID level.

また、ＲＡＩＤ閉塞した場合でも、ホスト側でそれまで実行していた処理が異常終了されることを回避できる。 Further, even when the RAID is blocked, it is possible to avoid abnormal termination of processing executed so far on the host side.

以下、図面を参照して、本発明の実施の形態について説明する。
図１に、本例のＲＡＩＤ装置１の構成図を示す。
図示のＲＡＩＤ装置１は、２つのＣＭ１０（１０ａ、１０ｂ）、ＦＲＴ３、ＢＲＴ４、ＢＲＴ５、ＤＥ６、ＤＥ７を有する。 Embodiments of the present invention will be described below with reference to the drawings.
FIG. 1 shows a configuration diagram of a RAID device 1 of this example.
The illustrated RAID apparatus 1 has two CMs 10 (10a, 10b), FRT3, BRT4, BRT5, DE6, and DE7.

ＣＭ１０とＢＲＴ４、５に関しては、上記従来技術で説明した通りであり、ＣＭはCentralized Module、ＢＲＴはBackend Routerである。但し、ここでは、ＣＭ１０ａは、ＢＲＴ４とＢＲＴ５の両系統に接続しており、ＣＭ１０ｂも、ＢＲＴ４とＢＲＴ５の両系統に接続している。尚、後述するＲＡＩＤ閉塞可否判定処理等は、ＣＭ１０ａ、１０ｂが各々個別に実行する。また、ＦＲＴ３は、ＣＭ１０ａ−１０ｂ間の通信を中継制御するものである。 The CM 10 and the BRTs 4 and 5 are as described in the above prior art. The CM is a Centralized Module and the BRT is a Backend Router. However, here, the CM 10a is connected to both systems BRT4 and BRT5, and the CM 10b is also connected to both systems BRT4 and BRT5. Note that the CM 10a and 10b individually execute RAID blockability determination processing and the like described later. The FRT 3 relays communication between the CMs 10a-10b.

ＤＥ（ドライブエンクロージャー）６は、ＰＢＣ６ａ，６ｂと、ディスク群６ｃを有する。同様に、ＤＥ（ドライブエンクロージャー）７は、ＰＢＣ７ａ，７ｂと、ディスク群７ｃとを有する。そして、例えば、ディスク群６ｃにおける図示のディスクＰ１とディスク群６ｃにおける図示のディスクＰ２とで、１つのＲＡＩＤグループ（例えばRAID１）を形成している。勿論、この例に限らない。例えば、従来技術で説明した「ＢＲＴ跨ぎ」のＲＡＩＤグループが構成されている場合もあり得るし、１つのディスク群内でＲＡＩＤグループが構成されている場合もあり得る。更に、複数のＲＡＩＤグループがあったときに、全て同じＲＡＩＤレベルであるとは限らず、ＲＡＩＤレベルが異なる場合もあり得る。各ＲＡＩＤグループのＲＡＩＤレベルは、ＣＭ１０が記憶・管理している。 The DE (drive enclosure) 6 includes PBCs 6a and 6b and a disk group 6c. Similarly, the DE (drive enclosure) 7 includes PBCs 7a and 7b and a disk group 7c. For example, the illustrated disk P1 in the disk group 6c and the illustrated disk P2 in the disk group 6c form one RAID group (for example, RAID1). Of course, the present invention is not limited to this example. For example, a “BRT crossing” RAID group described in the related art may be configured, or a RAID group may be configured in one disk group. Furthermore, when there are a plurality of RAID groups, the RAID levels are not necessarily the same, and the RAID levels may be different. The RAID level of each RAID group is stored and managed by the CM 10.

ＰＢＣはポート・バイパス・サーキットである。
ＢＲＴ４の各ポートはＰＢＣ６ａ、ＰＢＣ７ａに接続しており、ＢＲＴ５の各ポートはＰＢＣ６ｂ、ＰＢＣ７ｂに接続しており、各ＣＭ１０は、ＢＲＴ４又はＢＲＴ５とＰＢＣを介して、ディスク群６ｃ、ディスク群７ｃにアクセスする。 PBC is a port bypass circuit.
Each port of BRT4 is connected to PBC 6a and PBC 7a, each port of BRT 5 is connected to PBC 6b and PBC 7b, and each CM 10 accesses disk group 6c and disk group 7c via BRT 4 or BRT 5 and PBC. To do.

各ＣＭ１０は、任意の通信線を介してホスト２（２ａ、２ｂ）に接続している。各ＣＭ１０は、Ｉ／Ｏ制御部２１を有しており、Ｉ／Ｏ制御部２１がホスト２とのやりとり（ホストアクセスの受付、応答等）を実行する。また、各ＣＭ１０は、ＲＡＩＤ管理・制御部２２を有している。 Each CM 10 is connected to the host 2 (2a, 2b) via an arbitrary communication line. Each CM 10 has an I / O control unit 21, and the I / O control unit 21 executes exchanges with the host 2 (acceptance of host access, response, etc.). Each CM 10 has a RAID management / control unit 22.

ＲＡＩＤ管理・制御部２２は、ＲＡＩＤ装置１内の各部品（ＢＲＴ、ＰＢＣ等）や各ディスクの状態を随時取得して構成情報２２ａとして記憶している。これ自体は、従来と同様であるので、特に説明しないし、構成情報２２ａの具体例も示さない。そして、ＲＡＩＤ管理・制御部２２は、構成情報２２ａ等を参照し、更に各ディスクに対するアクセスパスをチェックして、ＲＡＩＤ閉塞可否の判定を行う。従来でも、上述した通り、構成情報２２ａにおける各ディスクの状態等を参照して、ＲＡＩＤ閉塞可否の判定を行っているが、本発明では判定方法が異なる。本発明の特徴は、主にＲＡＩＤ管理・制御部２２にある。すなわち、ＲＡＩＤ管理・制御部２２によるＲＡＩＤ閉塞可否の判定方法、短時間リカバリ可否の判定方法等にある。詳しくは後述する。 The RAID management / control unit 22 acquires the status of each component (BRT, PBC, etc.) and each disk in the RAID device 1 as needed and stores it as configuration information 22a. Since this is the same as the conventional one, it will not be described in particular, and a specific example of the configuration information 22a will not be shown. Then, the RAID management / control unit 22 refers to the configuration information 22a and the like, further checks the access path for each disk, and determines whether or not the RAID blockage is possible. Conventionally, as described above, whether or not RAID blockage is possible is determined by referring to the status of each disk in the configuration information 22a, but the determination method is different in the present invention. The feature of the present invention is mainly in the RAID management / control unit 22. That is, the RAID management / control unit 22 includes a determination method for determining whether or not RAID is blocked, a determination method for determining whether or not short-term recovery is possible, and the like. Details will be described later.

但し、ＲＡＩＤ閉塞可否の判定結果自体は、従来と同様である。すなわち、ＲＡＩＤを閉塞させるべき状態であるか否かを判定する為の根拠（理由）自体は、従来と同じであり、従って判定結果自体は、従来と変わらない。しかし、本手法では、後述する“分類”と“閾値”を用いた判定を行うことで、上述した従来の問題点を解決できる。 However, the determination result of whether or not the RAID blockage is possible is the same as the conventional one. That is, the basis (reason) itself for determining whether or not the RAID should be blocked is the same as the conventional one, and therefore the determination result itself is not different from the conventional one. However, this method can solve the above-mentioned conventional problems by performing determination using “classification” and “threshold” described later.

尚、ＲＡＩＤ管理・制御部２２は、ＲＡＩＤ閉塞の実行、閉塞の解除等も行っている。
尚、各ＣＭ１０と各ＢＲＴ４，５とは、Back Panelによって接続されており、Ｉ／Ｆ（インタフェース）はＦＣ（ファイバーチャネル）である。各ＢＲＴ４，５と各ＤＥ６，７とは、ＦＣケーブルによって接続されており、Ｉ／Ｆ（インタフェース）はＦＣ（ファイバーチャネル）である。各ＤＥ６，７内の各ディスクは、Back Panelによって接続されており、Ｉ／Ｆ（インタフェース）はＦＣ（ファイバーチャネル）である。そして、各ディスクへのアクセスは、ＦＣループにより行う。この為、ＦＣループ上の複数のディスクのうち、上流のディスクの不具合等によってループが途切れると、下流のディスクにアクセスできなくなる場合がある。 The RAID management / control unit 22 also performs execution of RAID blockage, release of blockage, and the like.
Each CM 10 and each BRT 4 and 5 are connected by a back panel, and the I / F (interface) is FC (fiber channel). Each BRT 4, 5 and each DE 6, 7 are connected by an FC cable, and the I / F (interface) is FC (fiber channel). Each disk in each DE 6 and 7 is connected by Back Panel, and I / F (interface) is FC (fiber channel). Access to each disk is performed by the FC loop. For this reason, among the plurality of disks on the FC loop, if the loop is interrupted due to an upstream disk defect or the like, the downstream disk may become inaccessible.

図２に上記ＣＭ１０のハードウェア構成図を示す。
図２に示すＣＭ１０は、各ＤＩ３１、各ＤＭＡ３２、２つのＣＰＵ３３，３４、ＭＣＨ(Memory Controller Hub)３５、メモリ３６、及びＣＡ３７を有する。 FIG. 2 shows a hardware configuration diagram of the CM 10.
The CM 10 shown in FIG. 2 includes each DI 31, each DMA 32, two CPUs 33 and 34, an MCH (Memory Controller Hub) 35, a memory 36, and a CA 37.

ＤＩ３１は、各ＢＲＴと接続するＦＣコントローラである。ＤＭＡ３２はＦＲＴ３に接続する通信回路である。ＭＣＨ３５は、ＣＰＵ３３，３４の外部バス等の所謂ホスト側のバスを、ＰＣＩバスと接続し、相互に通信できるようにする為の回路である。ＣＡ３７は、ホストと接続する為のアダプタである。 DI 31 is an FC controller connected to each BRT. The DMA 32 is a communication circuit connected to the FRT 3. The MCH 35 is a circuit for connecting a so-called host-side bus such as an external bus of the CPUs 33 and 34 to the PCI bus so that they can communicate with each other. The CA 37 is an adapter for connecting to the host.

以下、まず、本発明の第１の実施形態について説明する。
第１の実施形態における後述する各種フローチャートの処理は、メモリ３６に予め格納されているアプリケーションプログラムを、ＣＰＵ３３又はＣＰＵ３４が読出し・実行することにより実現される。尚、これは、後述する第２の実施形態についても同様である。また、後述する図３（ｂ）に示す閾値条件データ等も、予めメモリ３６に格納されており、この閾値条件データは後述するように上記閉塞可否判定処理の際に参照される。 Hereinafter, first, a first embodiment of the present invention will be described.
Processing of various flowcharts described later in the first embodiment is realized by the CPU 33 or the CPU 34 reading and executing an application program stored in the memory 36 in advance. This also applies to a second embodiment described later. In addition, threshold condition data shown in FIG. 3B, which will be described later, is also stored in the memory 36 in advance, and this threshold condition data is referred to in the blockability determination process as described later.

図３は、第１の実施形態による閉塞可否判断方法を説明する為の図である。
本例の閉塞可否判断処理は、例えば、上記“ＲＡＩＤが閉塞し得る事象”の何れかが発生すると処理開始する。本手法では、従来のように各“ＲＡＩＤが閉塞し得る事象”毎に閉塞可否を登録しておくものではない。“ＲＡＩＤが閉塞し得る事象”の発生は、単なる処理開始のトリガとなるに過ぎず、閉塞可否の判断は、各ディスクの状態や各ディスクへのアクセスパスの有無等に基づいて、図３（ａ）に示す基準に従った集計を行い（各ディスクを複数のカテゴリ（ここでは、図示の３種類の集計単位）に分類して、各カテゴリ毎に該当したディスクの数をカウントする）、この集計結果を図３（ｂ）に示す閾値条件と比較することにより行う。 FIG. 3 is a diagram for explaining a closing possibility determination method according to the first embodiment.
The blockability determination process in this example starts when, for example, one of the above “events that can block RAID” occurs. In this method, the blockability is not registered for each “event in which RAID can be blocked” as in the prior art. The occurrence of “an event that can cause RAID to be blocked” is merely a trigger for starting processing, and whether or not blocking is possible is determined based on the state of each disk, the presence or absence of an access path to each disk, and the like (see FIG. Aggregation is performed according to the criteria shown in a) (each disk is classified into a plurality of categories (here, the three types of aggregation units shown in the figure), and the number of disks corresponding to each category is counted) This is done by comparing the tabulation results with the threshold conditions shown in FIG.

まず、図３（ａ）について説明する。
ここで、ＲＡＩＤ閉塞判定を行ううえでは、任意のＲＡＩＤグループ毎に、どのディスクが使用でき、どのディスクが使用できないかを区別し、ＲＡＩＤグループとしてアクセスが可能なのかを判定する必要がある。 First, FIG. 3A will be described.
Here, when performing RAID blockage determination, it is necessary to determine which disk can be used and which disk cannot be used for each arbitrary RAID group, and determine whether the RAID group can be accessed.

その為、ＲＡＩＤグループ内の各ディスクを、その状態に応じて分類する。図３（ａ）に示す通り、“Use Disk”、“Unuse Disk”、“Loop Down Disk”の何れかに分類する。尚、従来で説明した通り、ＲＡＩＤグループとは上記ＲＬＵを意味する。従って、本例の分類・集計処理及び閾値条件との比較・判定処理（つまり、ＲＡＩＤ閉塞判定処理）は、ＲＬＵ単位で行うものである。 Therefore, each disk in the RAID group is classified according to its state. As shown in FIG. 3A, it is classified into one of “Use Disk”, “Unuse Disk”, and “Loop Down Disk”. As described above, the RAID group means the RLU. Therefore, the classification / aggregation process and the comparison / determination process with the threshold condition (that is, the RAID blockage determination process) in this example are performed in units of RLUs.

基本的には、“Use Disk” はアクセス可能なディスクであり、“Unuse Disk”はアクセスが不可能なディスクである。但し、“Unuse Disk”は、ディスクの故障によってアクセスできなくなった場合が該当する。アクセスパス消失によってアクセスできなくなったディスクは、“Loop Down Disk”に分類するものとし、ディスク故障とは区別する。 Basically, “Use Disk” is an accessible disk, and “Unuse Disk” is an inaccessible disk. However, “Unuse Disk” corresponds to the case where access becomes impossible due to a disk failure. Disks that are inaccessible due to access path loss are classified as “Loop Down Disk”, and are distinguished from disk failures.

以上述べたことを、具体的なディスク状態を示して一覧表にしたものが図３（ａ）である。
図３（ａ）に示すように、“Use Disk”に分類されるディスクは、そのディスクの状態が、Available（通常状態）、Failed Usable（ＲＡＩＤ故障時にRead（読出し）のみ許可状態）、Sparing（Redundant Copy状態）の何れかの状態であるディスクである。但し、これらの状態であっても、後述する“Loop Down Disk”の条件に該当するものは、“Loop Down Disk”に分類する。 FIG. 3A shows the above description in a list showing specific disk states.
As shown in FIG. 3A, the disk classified as “Use Disk” has a disk status of “Available” (normal state), “Failed Usable” (only read (reading is permitted when RAID fails)), “Sparing”. The disk is in any state of Redundant Copy state. However, even in these states, those corresponding to the condition of “Loop Down Disk” described later are classified as “Loop Down Disk”.

“Unuse Disk” に分類されるディスクは、そのディスクの状態が、“Use Disk”に該当しないディスクである。例えば、Broken（故障状態）、Rebuild、Not Existの何れかの状態であるディスクである（但し、後述する特例２を適用する場合には、この限りではない）。また、Not Existに関しては、後述する（５）のケースに該当する場合には、“Loop Down Disk”にカウントする。 A disk classified as “Unuse Disk” is a disk whose status does not correspond to “Use Disk”. For example, the disk is in a broken (failed state), Rebuild, or Not Exist state (however, this is not the case when special case 2 described later is applied). Further, regarding Not Exist, when it corresponds to the case of (5) described later, it is counted as “Loop Down Disk”.

但し、ディスクの状態は、図１２に示した種類に限らない。以下、図１２には示していないが“Unuse Disk” に分類されるべき状態の例を列挙しておく。但し、本説明においては、図１２に示していない状態は考慮せずに説明するものとする。
『・Not Available；ディスクが搭載されていない状態。
・Not Supported；定義よりも容量が小さいディスクが搭載された状態。
・Present；Rebuild／Copyback待ちディスク。
・Readying；ディスク組み込み処理中の状態。
・Spare；Hot Spareとして正常状態のディスク（ＲＡＩＤグループに含まれない為、Unuse Diskとして扱う）』
“Loop Down Disk”に分類されるディスクは、以下の（４）、（５）の何れかの条件に該当するディスクである。 However, the state of the disk is not limited to the type shown in FIG. Hereinafter, although not shown in FIG. 12, examples of states that should be classified as “Unuse Disk” are listed. However, in this description, the description is made without considering the state not shown in FIG.
“・ Not Available: The disk is not installed.
-Not Supported: A disk with a capacity smaller than the definition is mounted.
-Present: Rebuild / Copyback waiting disk.
・ Readying: A state during the disk installation process.
・ Spare: Normal disk as Hot Spare (Since it is not included in RAID group, it is treated as Unuse Disk)
Disks classified as “Loop Down Disk” are disks that meet either of the following conditions (4) and (5).

（４）Available、Failed Usable、Sparingの何れかの状態であり、且つ当該ディスクのアクセスパスが無い場合
（５）Available、Failed Usable、Sparingの何れかの状態であるが、Not Exist状態へ遷移（変更）途中の場合
但し、特例として、以下の条件が加わる形態もある。 (4) When the state is Available, Failed Usable, or Sparing and there is no access path for the disk. (5) The state is any of Available, Failed Usable, or Sparing, but transitions to the Not Exist state ( (Change) In the middle However, as a special case, the following conditions may be added.

特例１；Redundant中のRebuild Diskは、集計に含めない。
特例２；Sparing状態であっても“Write Failあり”のディスクは、“Use Disk”ではなく、“Unuse Disk” に分類する。 Special case 1: Rebuild Disk in Redundant is not included in the total.
Special Case 2: Even in the Sparing state, a disk with “Write Fail” is classified as “Unuse Disk”, not “Use Disk”.

ここで、上記（５）について説明しておく。まず、前提として、図１には示していないが、従来より、ＣＭ１０は、ＲＡＩＤ装置１内の各部品（ＢＲＴ、ＰＢＣ等）の状態や各ディスクへのアクセスパスや各ディスクの状態を監視・検出する機能部（ＲＡＳと呼ぶ）が存在する。そして、ＲＡＳは、各検出結果を、上記ＲＡＩＤ管理・制御部２２に通知する。ＲＡＩＤ管理・制御部２２は、この通知を受けて、自己が保持・管理する構成情報２２ａを更新する。そして、上記分類の判断は、基本的には、ＲＡＩＤ管理・制御部２２が、この構成情報２２ａを参照して行う。 Here, the above (5) will be described. First, as a premise, although not shown in FIG. 1, the CM 10 conventionally monitors the status of each component (BRT, PBC, etc.) in the RAID device 1, the access path to each disk, and the status of each disk. There is a function unit (referred to as RAS) to detect. The RAS notifies the RAID management / control unit 22 of the detection results. Upon receiving this notification, the RAID management / control unit 22 updates the configuration information 22a held and managed by itself. The classification is basically determined by the RAID management / control unit 22 with reference to the configuration information 22a.

上記（５）における“遷移（変更）途中”とは、ＲＡＩＤ管理・制御部２２が上記ＲＡＳから新たなディスク状態の通知を受けたが、未だ構成情報２２ａの更新を行なっていない状態を意味する。従って、上記（５）の意味は、上記ＲＡＳが任意のディスクの状態がNot Exist状態に変化したことをＲＡＩＤ管理・制御部２２に通知しているが、ＲＡＩＤ管理・制御部２２が未だこの変化を構成情報２２ａに反映させていない状況を意味している。 “Transition (change) halfway” in (5) means that the RAID management / control unit 22 has received a notification of a new disk status from the RAS, but has not yet updated the configuration information 22a. . Therefore, the meaning of the above (5) means that the RAS notifies the RAID management / control unit 22 that the state of an arbitrary disk has changed to the Not Exist state, but the RAID management / control unit 22 still does this change. Is not reflected in the configuration information 22a.

ＣＭ（Centralized Module）のＲＡＩＤ管理・制御部２２は、上記の通り基本的には構成情報２２ａを参照して各ディスクの状態を認識し、更に必要に応じてディスクへのアクセスパスをチェックして、これら処理結果に基づいて上記分類を行い、“Use Disk”、“Unuse Disk”、“Loop Down Disk”それぞれに分類されたディスクの数を集計する。そして、図３（ｂ）に示すＲＡＩＤ閉塞閾値表（閾値条件）を参照して集計値と比較して、閉塞可否を判定する。図３（ｂ）に示すＲＡＩＤ閉塞閾値表には、図示の通り、各ＲＡＩＤレベル毎に対応した閾値条件が格納されており、ＣＭは、判定対象のＲＡＩＤグループのＲＡＩＤレベルに対応する閾値条件を参照して、上記集計値と比較して、閉塞可否を判定する。集計値が閾値条件に該当する場合には、閉塞させるものと判定する。 The RAID management / control unit 22 of the CM (Centralized Module) basically recognizes the state of each disk by referring to the configuration information 22a as described above, and further checks the access path to the disk as necessary. Then, the above classification is performed based on these processing results, and the number of disks classified into “Use Disk”, “Unuse Disk”, and “Loop Down Disk” is totaled. Then, referring to the RAID blockage threshold value table (threshold condition) shown in FIG. In the RAID blocking threshold table shown in FIG. 3B, threshold conditions corresponding to each RAID level are stored as shown in FIG. 3, and the CM sets the threshold conditions corresponding to the RAID level of the RAID group to be determined. Referring to the above total value, it is determined whether or not blocking is possible. When the total value corresponds to the threshold condition, it is determined to block.

図示の通り、判定対象のＲＡＩＤグループのＲＡＩＤレベルがRAID０の場合には、“Use Disk”の数は関係なく、“Unuse Disk”の数が‘０’で且つ“Loop Down Disk”の数が‘１’以上である場合には、このＲＡＩＤグループは閉塞状態にさせるものと判定する。尚、既に従来で述べた通り、RAID０の場合には、“Unuse Disk”は存在し得ない。なぜなら、RAID０の場合、１つでもディスクが故障したら、使用不能状態となるので、上記図１４（ｇ）で説明した通り、ＢｒではなくＦｕとして扱うからである。よって、Rebuildとなるディスクもあり得ないし、Hot Spareにコピー中となることもない。従って「“Unuse Disk”の数が‘０’」という条件は、上記のことを確認の意味で明示しているだけであると考えても良いので、この条件は無くてもよい。 As shown in the figure, when the RAID level of the RAID group to be determined is RAID 0, the number of “Use Disk” is irrelevant, the number of “Unuse Disk” is “0” and the number of “Loop Down Disk” is “'. If it is 1 'or more, it is determined that this RAID group is to be blocked. As already described above, in the case of RAID 0, “Unuse Disk” cannot exist. This is because, in the case of RAID 0, if even one disk fails, it becomes unusable and is treated as Fu instead of Br as described above with reference to FIG. Therefore, there can be no Rebuild disk and it is not being copied to Hot Spare. Therefore, the condition that “the number of“ Unuse Disks ”is“ 0 ”” may be considered as merely indicating the above in the sense of confirmation, so this condition may be omitted.

判定対象のＲＡＩＤグループのＲＡＩＤレベルがRAID１又はRAID0+1の場合には、“Unuse Disk”の数は関係なく、“Use Disk”の数が‘０’で且つ“Loop Down Disk”の数が‘１’以上である場合には、このＲＡＩＤグループは閉塞状態にさせるものと判定する。 When the RAID level of the RAID group to be determined is RAID 1 or RAID 0 + 1, the number of “Use Disk” is “0” and the number of “Loop Down Disk” is “no matter” regardless of the number of “Unuse Disk”. If it is 1 'or more, it is determined that this RAID group is to be blocked.

判定対象のＲＡＩＤグループのＲＡＩＤレベルがRAID5又はRAID0+5の場合には、２種類の閾値条件のうち、どちらか一方に該当した場合には、ＲＡＩＤグループは閉塞状態にさせるものと判定する。すなわち、２種類の閾値条件は、どちらも、“Use Disk”の数は関係ない（幾つでもよい）。そして、一方の閾値条件は「“Unuse Disk”の数が‘０’で且つ“Loop Down Disk”の数が‘２’以上」、他方の閾値条件は「“Unuse Disk”の数が‘１’で且つ“Loop Down Disk”の数が‘１’以上」である。 When the RAID level of the RAID group to be determined is RAID5 or RAID0 + 5, if either one of the two threshold conditions is met, the RAID group is determined to be blocked. That is, the two threshold conditions are not related to the number of “Use Disk” (any number is acceptable). One threshold condition is “the number of“ Unuse Disk ”is“ 0 ”and the number of“ Loop Down Disk ”is“ 2 ”or more”, ”and the other threshold condition is“ the number of “Unuse Disk” is “1”. And the number of “Loop Down Disk” is “1” or more.

図３（ｂ）に示す閾値は、各ＲＡＩＤレベル毎に、ユーザデータを保障できなくなる状態を一意に決定する為のものである。例えば、RAID０の場合、ストライピングである為、１台でもディスクが故障すれば、ユーザデータは消失する。RAID１の場合、ミラーリングである為、ディスクが１台故障しても、ミラーディスクが存在しており、ユーザデータは保障される。 The threshold shown in FIG. 3B is for uniquely determining a state in which user data cannot be guaranteed for each RAID level. For example, since RAID 0 is striping, user data will be lost if a single disk fails. In the case of RAID1, since mirroring is used, even if one disk fails, the mirror disk exists and user data is guaranteed.

以上説明した閉塞可否判定手法では、以下の（１）〜（４）の効果が得られる。
（１）論理共通化によるコーディング量の削減
（２）論理共通化によるメンテナンス性の向上
（３）発生事象が増えても論理を追加・変更する必要がない（処理開始のトリガが増えるだけ）
（４）ＲＡＩＤレベルが増えても、新たな閾値条件を追加することで対応可能
図４〜図７に、本例のＲＡＩＤ閉塞判定処理のフローチャート図を示す。これらフローチャート図は、基本的には、上記図３（ａ）、（ｂ）で説明して分類方法、閾値を用いた判定方法を、コンピュータによる実行処理手順として示したものである。尚、ここでいうコンピュータとは、上記ＣＭのことである。ＣＭ内のメモリには、図４〜図７に示すＲＡＩＤ閉塞判定処理をＣＰＵ３３又はＣＰＵ３４によって実行させる為のプログラムが格納されている。但し、機能的に言えば、図４〜図７に示すフローチャートの処理は、ＲＡＩＤ管理・制御部２２が行なう。他のフローチャートの処理も同様である。 With the blockage determination method described above, the following effects (1) to (4) can be obtained.
(1) Reduction of coding amount due to logic sharing (2) Improvement of maintainability due to logic sharing (3) No need to add or change logic even if the number of occurrences increases (only the trigger for starting processing increases)
(4) Even if the RAID level increases, it can be dealt with by adding a new threshold condition. FIGS. 4 to 7 show flowcharts of the RAID blockage determination processing of this example. These flowcharts basically show the classification method and the determination method using the threshold value described with reference to FIGS. 3A and 3B as the execution processing procedure by the computer. The computer here is the CM. A program for causing the CPU 33 or the CPU 34 to execute the RAID blockage determination process shown in FIGS. 4 to 7 is stored in the memory in the CM. However, functionally speaking, the RAID management / control unit 22 performs the processing of the flowcharts shown in FIGS. The same applies to the processing of the other flowcharts.

まず、図４の処理について説明する。図４の処理は、上記特例１、特例２を考慮しない場合の処理である。
図示の処理において、本例のＲＡＩＤ閉塞判定処理（ステップＳ１４以降）は、何等かの部品故障が発生したときに（ステップＳ１１）、この故障によって両系ともFC Loop Down状態となった場合に（ステップＳ１２，ＹＥＳ）実行する。よって、両系ともFC Loop Down状態となる状況にならない場合には（ステップＳ１２，ＮＯ）、本処理は実行しない（ステップＳ１３）。 First, the process of FIG. 4 will be described. The process of FIG. 4 is a process when the above-mentioned special cases 1 and 2 are not considered.
In the illustrated process, the RAID blockage determination process (step S14 and subsequent steps) in this example is performed when any component failure occurs (step S11) and both systems enter the FC Loop Down state due to this failure ( Step S12, YES). Therefore, if neither system is in the FC Loop Down state (step S12, NO), this process is not executed (step S13).

尚、上記ステップＳ１１、Ｓ１２の判定自体は、従来と同じである。従来では、ステップＳ１２の判定がＹＥＳの場合、上述した“ＲＡＩＤが閉塞し得る事象”を特定する処理を行い、この特定した事象を用いて、図１２の表を参照して、ＲＡＩＤ閉塞可否を判定していた。 The determinations in steps S11 and S12 are the same as in the conventional case. Conventionally, when the determination in step S12 is YES, the above-described process of specifying “an event that can be blocked by RAID” is performed, and using this specified event, whether or not RAID blocking is permitted is determined. I was judging.

一方、本手法では、上記ステップＳ１２の判定がＹＥＳの場合、上記故障部品が接続される全てのＲＡＩＤグループを対象として、各ＲＡＩＤグループ毎にステップＳ１４〜Ｓ２４の処理を実行する。すなわち、まず、そのＲＡＩＤグループ内の各ディスクの状態をチェックする（ステップＳ１４）。そして、各ディスク毎にステップＳ１６〜Ｓ２１の処理を実行する。 On the other hand, in this method, when the determination in step S12 is YES, the processes in steps S14 to S24 are executed for each RAID group for all RAID groups to which the failed part is connected. That is, first, the status of each disk in the RAID group is checked (step S14). Then, the processes of steps S16 to S21 are executed for each disk.

まず、処理対象のディスクの状態が、Av（Available）、Fu（Failed Usable）、Spr（Sparing）の何れかの状態である場合には（ステップＳ１６、ＹＥＳ）、このディスクへのアクセスパス（このディスクから見て上位側のパス）があるか否かをチェックし（ステップＳ１８）、Pathがある場合には（ステップＳ１８，ＹＥＳ）“Use Disk”としてカウントし（ステップＳ１９）、Pathが無い場合には（ステップＳ１８，ＮＯ）“Loop Down Disk”としてカウントする（ステップＳ２０）。また、ディスクの状態が、Av、Fu、Spr以外の状態である場合には（ステップＳ１６，ＮＯ）、“Unuse Disk”としてカウントする（ステップＳ１７）。 First, if the status of the target disk is Av (Available), Fu (Failed Usable), or Spr (Sparing) (step S16, YES), the access path to this disk (this It is checked whether or not there is a higher-order path from the disk (step S18). If there is a path (step S18, YES), it is counted as “Use Disk” (step S19), and there is no path. Is counted as “Loop Down Disk” (step S20). If the disk is in a state other than Av, Fu, or Spr (step S16, NO), it is counted as “Unuse Disk” (step S17).

以上の集計処理を、上記処理対象のＲＡＩＤグループ内の全てのディスクについて実行したら（ステップＳ２１，ＹＥＳ）、上記処理によって得られた、“Use Disk”、“Unuse Disk”、“Loop Down Disk”のカウント値を用いて、図３（ｂ）に示すＲＡＩＤ閉塞閾値表を参照して（上記処理対象のＲＡＩＤグループのＲＡＩＤレベルに対応する閾値条件を参照して）、ＲＡＩＤ閉塞の可否を判定する（ステップＳ２２）。そして、ステップＳ２２の判定がＹＥＳであれば当該ＲＡＩＤグループを閉塞させ（ステップＳ２３）、ＮＯであれば何も処理しない（ステップＳ２４）。 When the above totalization processing is executed for all the disks in the RAID group to be processed (step S21, YES), “Use Disk”, “Unuse Disk”, “Loop Down Disk” obtained by the above processing are obtained. Using the count value, with reference to the RAID blockage threshold table shown in FIG. 3B (referring to the threshold condition corresponding to the RAID level of the RAID group to be processed), it is determined whether or not RAID blockage is possible ( Step S22). If the determination in step S22 is YES, the RAID group is blocked (step S23), and if NO, no processing is performed (step S24).

一方、本手法のＲＡＩＤ閉塞可否判定処理は、上記部品故障発生の場合だけでなく、任意のディスクの状態が変化した場合にも実行する。つまり、上記の通り、ＲＡＳは、各ディスクの状態を監視・検出してＲＡＩＤ管理・制御部２２に通知しており、ＲＡＩＤ管理・制御部２２は、構成情報２２ａを参照して、状態が変化したディスクがある場合には図５の処理を開始する。 On the other hand, the RAID blockability determination process according to the present method is executed not only when the component failure occurs but also when the state of an arbitrary disk changes. In other words, as described above, the RAS monitors and detects the status of each disk and notifies the RAID management / control unit 22 of the status, and the RAID management / control unit 22 changes the status with reference to the configuration information 22a. If there is a disc that has been changed, the processing of FIG. 5 is started.

すなわち、任意のディスクの状態が変化した場合には（ステップＳ８１）、少なくとも当該状態変化したディスクが属するＲＡＩＤグループを処理対象として、このＲＡＩＤグループに属する全てのディスクの状態をチェックし、各ディスク毎にステップＳ８３〜Ｓ９３の処理を実行することで、集計を行う。 That is, when the state of an arbitrary disk changes (step S81), at least the RAID group to which the disk whose state has changed belongs to the processing target, and the state of all the disks belonging to this RAID group is checked. Then, the totalization is performed by executing the processing of steps S83 to S93.

すなわち、まず、処理対象のディスクの状態（但し、構成情報２２ａに記録されている状態）が、Av、Fu、Spr以外の状態である場合には（ステップＳ８３，ＮＯ）、“Unuse Disk”としてカウントする（ステップＳ８８）。処理対象のディスクの状態が、Av、Fu、Sprの何れかの状態である場合には（ステップＳ８３，ＹＥＳ）、この処理対象ディスクが変更対象ディスク（状態が変化したディスク）であるか否かを判定する（ステップＳ８４）。そして、変更対象ディスクではない場合には（ステップＳ８４，ＮＯ）、上記ステップＳ１８，Ｓ１９，Ｓ２０と同じ処理を行う（ステップＳ８５，Ｓ９１，Ｓ９２）。すなわち、処理対象ディスクへのアクセスパス（このディスクから見て上位側のパス）があるか否かをチェックし（ステップＳ８５）、Pathがある場合には（ステップＳ８５，ＹＥＳ）“Use Disk”としてカウントし（ステップＳ９１）、Pathが無い場合には（ステップＳ８５，ＮＯ）“Loop Down Disk”としてカウントする（ステップＳ９２）。 That is, first, when the state of the target disk (the state recorded in the configuration information 22a) is a state other than Av, Fu, or Spr (step S83, NO), “Unuse Disk” is set. Count (step S88). If the status of the target disk is Av, Fu, or Spr (YES in step S83), whether or not this target disk is a change target disk (a disk whose status has changed). Is determined (step S84). If the disk is not the disk to be changed (step S84, NO), the same processing as in steps S18, S19, S20 is performed (steps S85, S91, S92). That is, it is checked whether or not there is an access path to the processing target disk (upper path when viewed from this disk) (step S85). If there is a path (step S85, YES), “Use Disk” is set. Count (step S91), and if there is no path (step S85, NO), count as “Loop Down Disk” (step S92).

一方、処理対象ディスクが変更対象ディスク（状態が変化したディスク）である場合には（ステップＳ８４，ＹＥＳ）、もし上記Av、Fu、Sprの何れかの状態からＮｅ状態（Not Exist）へ変化したのであれば（ステップＳ８６，ＹＥＳ）、“Loop Down Disk”としてカウントする（ステップＳ９０）。一方、もし上記Av、Fu、Sprの何れかの状態からＮｅ状態（Not Exist）以外の状態に変化したならば（ステップＳ８６，ＮＯ）、変化後の状態がAv、Fu、Sprの何れかの状態であれば（ステップＳ８７，ＮＯ）、“Use Disk”としてカウントし（ステップＳ８９）、Av、Fu、Spr以外の状態へと変化したのであれば（ステップＳ８７，ＹＥＳ）、“Unuse Disk”としてカウントする（ステップＳ８８）。 On the other hand, if the processing target disk is a change target disk (disk whose state has changed) (step S84, YES), the state has changed from any of the above Av, Fu, and Spr to the Ne state (Not Exist). If (step S86, YES), it is counted as “Loop Down Disk” (step S90). On the other hand, if any of the above Av, Fu, and Spr states changes to a state other than the Ne state (Not Exist) (step S86, NO), the changed state is any of Av, Fu, or Spr. If it is in a state (step S87, NO), it is counted as “Use Disk” (step S89), and if it has changed to a state other than Av, Fu, Spr (step S87, YES), it is set as “Unuse Disk”. Count (step S88).

そして、全ての処理対象ディスクについて上述した集計処理を実行したら（ステップＳ９３，ＹＥＳ）、ステップＳ９４，Ｓ９５，Ｓ９６の処理を実行する。ステップＳ９４，Ｓ９５，Ｓ９６の処理は、上記ステップＳ２２，Ｓ２３，Ｓ２４の処理と同じであるので、ここでは説明しない。 When the above-described aggregation process is executed for all the processing target disks (YES in step S93), the processes of steps S94, S95, and S96 are executed. The processes in steps S94, S95, and S96 are the same as the processes in steps S22, S23, and S24, and will not be described here.

次に、図６の処理について説明する。図６の処理は、図４の処理において上記特例１を考慮した処理である。図６において図４の処理と同じ処理ステップには同じステップ番号を付してあり、その説明は省略するものとし、以下、図４の処理と違う部分についてのみ説明する。尚、ここでは、図４の処理に上記特例１を適用した処理を示すが、当然、図５の処理に上記特例１を適用してもよい。この処理は特に図示しないが、上記ステップＳ８３の判定がＮＯの場合に、後述するステップＳ３１の処理が加わることになる。 Next, the process of FIG. 6 will be described. The process of FIG. 6 is a process that takes the above-mentioned special case 1 into consideration in the process of FIG. In FIG. 6, the same processing steps as those in FIG. 4 are denoted by the same step numbers, and the description thereof will be omitted. Only portions different from the processing in FIG. 4 will be described below. Here, the process in which the above-described special example 1 is applied to the process in FIG. 4 is shown, but the above-described special example 1 may naturally be applied to the process in FIG. Although this process is not particularly illustrated, if the determination in step S83 is NO, a process in step S31 described later is added.

図６の処理が図４の処理と異なる点は、ステップＳ１６の判定がＮＯとなった場合に、図示のステップＳ３１の処理が加わっている点である。ステップＳ３１の処理は、処理対象ディスクがRedundant Copy中のCopy先Disk（ Redundant中のRebuild Disk）であるか否かを判定し、そうである場合には（ステップＳ３１，ＹＥＳ）、ステップＳ１７の処理は実行しないようにする処理である。ステップＳ３１の判定がＮＯであれば、“Unuse Disk”としてカウントする（ステップＳ１７）。 The process of FIG. 6 differs from the process of FIG. 4 in that the process of step S31 is added when the determination of step S16 is NO. The process of step S31 determines whether or not the processing target disk is a copy destination disk (redundant copy rebuild disk) during redundant copy. If so (step S31, YES), the process of step S17 is performed. Is a process that prevents execution. If the determination in step S31 is NO, it is counted as “Unuse Disk” (step S17).

すなわち、図４の処理では、ステップＳ１６の判定がＮＯであれば必ず“Unuse Disk”としてカウントしたが、本処理では、処理対象ディスクがRedundant Copy先Diskである場合には集計対象から除外する。 That is, in the process of FIG. 4, if the determination in step S16 is NO, it is always counted as “Unuse Disk”. However, in this process, if the process target disk is a redundant copy destination disk, it is excluded from the aggregation target.

次に、図７の処理について説明する。図７の処理は、上記特例１、特例２の両方を考慮した処理である。図７において図６の処理と同じ処理ステップには同じステップ番号を付してあり、その説明は省略するものとし、以下、図６の処理と違う部分についてのみ説明する。 Next, the process of FIG. 7 will be described. The process of FIG. 7 is a process that takes into account both the above-mentioned special case 1 and special case 2. In FIG. 7, the same processing steps as those in FIG. 6 are denoted by the same step numbers, and the description thereof will be omitted. Only portions different from the processing in FIG. 6 will be described below.

図７の処理が図６の処理と異なる点は、ステップＳ１８の判定がＹＥＳとなった場合に、上記ステップＳ１９の処理の代わりに、図示のステップＳ４１〜Ｓ４３の処理を実行する点である。 The process of FIG. 7 is different from the process of FIG. 6 in that, when the determination in step S18 is YES, the processes of steps S41 to S43 are executed instead of the process of step S19.

ステップＳ１８の判定がＹＥＳとなる場合、すなわち、処理対象ディスクの状態がAv（Available）、Fu（Failed Usable）、Spr（Sparing）の何れかの状態であり且つPathが存在する場合には、上記図４、図６の処理では必ず“Use Disk”としてカウントしたが、本処理では、処理対象ディスクが“Sparing状態で且つWrite失敗あり”の状態である場合には（つまり、上記図１２に示すSpWの状態であれば）（ステップＳ４１，ＹＥＳ）、“Unuse Disk”としてカウントする（ステップＳ４２）。勿論、これ以外の場合には（ステップＳ４１，ＮＯ）、“Use Disk”としてカウントする（ステップＳ４３）。
尚、ここでは、図４の処理に上記特例１、特例２を適用した処理を示すが、当然、図５の処理に上記特例１、特例２を適用してもよい。この処理は特に図示しないが、上記ステップＳ８３の判定がＮＯの場合に対して上記ステップＳ３１の処理が加わり、ステップＳ９１の処理の代わりに上記ステップＳ４１，Ｓ４２，Ｓ４３の処理を行うことになる。 If the determination in step S18 is YES, that is, if the status of the processing target disk is one of Av (Available), Fu (Failed Usable), and Spr (Sparing) and there is a path, 4 and 6 always count as “Use Disk”, but in this process, when the processing target disk is in the “Sparing state and Write failure” state (that is, as shown in FIG. 12 above). If it is in the SpW state (step S41, YES), it is counted as “Unuse Disk” (step S42). Of course, in other cases (step S41, NO), it counts as “Use Disk” (step S43).
Here, the processing in which the above-mentioned special cases 1 and 2 are applied to the processing in FIG. 4 is shown, but the above-mentioned special cases 1 and 2 may naturally be applied to the processing in FIG. Although this process is not particularly illustrated, the process of step S31 is added to the case where the determination of step S83 is NO, and the processes of steps S41, S42, and S43 are performed instead of the process of step S91.

以上説明した処理を、具体例を挙げて説明する。
まず、図８（ａ）に具体例の１つを示す。ここでは、ＲＡＩＤレベルがRAID１のＲＡＩＤグループの２つのディスクＰ１，Ｐ２のうち、ディスクＰ１がBr状態（故障状態）、ディスクＰ２がＡｖ状態（通常状態）であったが、何等かの原因でディスクＰ２に対するアクセスパスが消失した例を示す。この例では、図４の処理を実行すると、ディスクＰ１に関してはステップＳ１６でＮＯとなるので“Unuse Disk”としてカウントし、ディスクＰ２に関してはステップＳ１８でＮＯとなるので“Loop Down Disk”としてカウントする。尚、Ｂｒ状態のディスクは、通常、ＲＡＩＤグループから外れるものとして管理されるが、図８（ａ）の例から明らかなように、本例の集計処理に関してはＢｒ状態のディスクもＲＡＩＤグループに属するものとして集計対象に含めている。 The processing described above will be described with a specific example.
First, FIG. 8A shows one specific example. Here, of the two disks P1 and P2 of the RAID group with the RAID level RAID1, the disk P1 is in the Br state (failure state) and the disk P2 is in the Av state (normal state). An example in which the access path for P2 has disappeared is shown. In this example, when the process of FIG. 4 is executed, NO is determined in step S16 for the disk P1, so it is counted as “Unuse Disk”, and NO is determined in step S18 for the disk P2, and is counted as “Loop Down Disk”. . Note that the Br disk is normally managed as being out of the RAID group, but as is apparent from the example of FIG. 8A, the Br disk also belongs to the RAID group for the aggregation processing of this example. It is included in the aggregation target.

従って、集計結果は、“Use Disk”＝‘０’、“Unuse Disk”＝‘１’、“Loop Down Disk”＝‘１’となる。図３（ｂ）においてRAID１に対応する閾値は、“Unuse Disk”＝‘１’、“Loop Down Disk”＝‘１以上’であるので、閉塞条件に合致し、当該ＲＡＩＤグループは閉塞させるものと判定される。 Therefore, the total results are “Use Disk” = “0”, “Unuse Disk” = “1”, and “Loop Down Disk” = “1”. In FIG. 3B, the threshold values corresponding to RAID 1 are “Unuse Disk” = “1” and “Loop Down Disk” = “1 or more”, so that the blocking condition is met and the RAID group is blocked. Determined.

図８（ａ）のような例であれば、図４の処理であっても問題なく閉塞可否を判定できる。しかし、本処理は、冗長性が無いＲＡＩＤグループを対象に考えている為、例えばRedundant Copy等のように冗長性を保ちながらRebuildするケースを想定していない為、誤判定が生じる場合がある。 In the example as shown in FIG. 8A, it is possible to determine whether or not the blockage is possible even with the process of FIG. However, since this processing is intended for a RAID group having no redundancy, for example, it is not assumed that a rebuild is performed while maintaining redundancy, such as Redundant Copy, and therefore an erroneous determination may occur.

その一例を図８（ｂ）に示す。尚、図８（ｂ）のＲＡＩＤグループのＲＡＩＤレベルはRAID5であるものとする。
図８（ｂ）に示す通り、この例では、ディスクＰ２が故障（Br）した為、上記Hot Spare（ここではＨＳ１）を使用し、ディスクＨＳ１がAv状態になった。その後、ディスクＰ１にも不具合が生じた為、ディスクＰ１をSpr状態にして、Hot Spare（ここではＨＳ２）に対してディスクＰ１のデータをコピーする処理（Redundant Copy）を実行している。しかし、コピー処理実行中に、ディスクＰ１に対するアクセスパスが、何等かの理由により消失してしまったケースを示している。 An example is shown in FIG. Note that the RAID level of the RAID group in FIG. 8B is assumed to be RAID5.
As shown in FIG. 8B, in this example, since the disk P2 has failed (Br), the Hot Spare (here, HS1) is used, and the disk HS1 is in the Av state. Thereafter, since a problem also occurred in the disk P1, the disk P1 is set in the Spr state, and a process (Redundant Copy) of copying data on the disk P1 to Hot Spare (here, HS2) is executed. However, the access path to the disk P1 has been lost for some reason during the execution of the copy process.

この様なケースでは、このＲＡＩＤグループは閉塞させなければならない。
しかしながら、図４の処理に従うと、ディスクＰ１はステップＳ１８でＹＥＳとなるので“Loop Down Disk”としてカウントされ、ディスクＨＳ１は“Use Disk” としてカウントされ、ディスクＰ２とＨＳ２は“Unuse Disk”としてカウントされるので、集計結果は以下の通りとなる。 In such a case, this RAID group must be blocked.
However, according to the process of FIG. 4, since the disk P1 is YES in step S18, it is counted as “Loop Down Disk”, the disk HS1 is counted as “Use Disk”, and the disks P2 and HS2 are counted as “Unuse Disk”. Therefore, the total results are as follows.

“Use Disk”＝‘１’、“Unuse Disk”＝‘２’、“Loop Down Disk”＝‘１’
一方、図３（ｂ）において、RAID5に対応する閉塞条件は、以下の２種類ある。
・“Unuse Disk”＝‘０’、“Loop Down Disk”＝‘２以上’
・“Unuse Disk”＝‘１’、“Loop Down Disk”＝‘１以上’
従って、上記集計結果は、上記２種類の閉塞条件のどちらにも該当しないので、閉塞しないと判定されてしまう。 “Use Disk” = “1”, “Unuse Disk” = “2”, “Loop Down Disk” = “1”
On the other hand, in FIG. 3B, there are the following two types of blocking conditions corresponding to RAID5.
・ “Unuse Disk” = “0”, “Loop Down Disk” = “2 or more”
・ "Unuse Disk" = '1', "Loop Down Disk" = '1 or more'
Therefore, since the total result does not correspond to either of the two types of blockage conditions, it is determined not to block.

この為、上記特例１を採用しており、図６の処理を実行する。すなわち、Redundant Copy中のCopy先Disk（この例ではディスクＨＳ２）は、集計対象から除外する。よって、図６の処理を実行した場合の集計結果は以下の通りとなる。 For this reason, the above-mentioned special example 1 is adopted, and the processing of FIG. 6 is executed. That is, the copy destination disk (in this example, the disk HS2) during redundant copy is excluded from the aggregation target. Therefore, the totaling result when the processing of FIG. 6 is executed is as follows.

“Use Disk”＝‘１’、“Unuse Disk”＝‘１’、“Loop Down Disk”＝‘１’
従って、上記２種類の閉塞条件のうち、
・“Unuse Disk”＝‘１’、“Loop Down Disk”＝‘１以上’
に該当することになるので、当該ＲＡＩＤグループは閉塞させるものと判定される（誤判定しない）ことになる。 “Use Disk” = “1”, “Unuse Disk” = “1”, “Loop Down Disk” = “1”
Therefore, of the two types of blocking conditions,
・ "Unuse Disk" = '1', "Loop Down Disk" = '1 or more'
Therefore, the RAID group is determined to be blocked (not erroneously determined).

次に、以下、特例２を用いる理由について、具体例を用いて説明する。
まず、既に説明してあるが、図１２の図上右側に示すように、Redundant Copy(Sparing)には、“Write失敗あり”（SpW）という状態が存在する。この状態は、Redundant Copyの完了率を向上させる為に設けられたものである。すなわち、上記従来技術で説明した通り、全てのディスクに対してWriteを行った結果、Redundant Copyのコピー元でWriteが失敗する場合がある。この場合に、直ぐに故障状態とはせずに、Copyを継続させる場合がある。この状態の一例を図８（ｃ）に示す。図８（ｃ）に示す一例では、ディスクＰ２とＨＳ１はＡｖ状態であり、ディスクＰ１をコピー元としてディスクＨＳ２をコピー先としてRedundant Copyを実行したが、ディスクＰ１でWriteが失敗している。この場合、図示の通り、ディスクＰ１の状態は、Ｂｒ状態とはせずに、“Spr＋WF”状態とし、コピーを継続させる。この状態で、図示の例では、ディスクＰ２に対するアクセスパスが、何等かの理由により消失してしまったケースを示している。 Next, the reason for using the special example 2 will be described using a specific example.
First, as described above, as shown on the right side of FIG. 12, Redundant Copy (Sparing) has a state of “Write failure” (SpW). This state is provided in order to improve the Redundant Copy completion rate. That is, as described in the above prior art, as a result of writing to all the disks, writing may fail at the redundant copy copy source. In this case, there is a case where the copy is continued without immediately becoming a failure state. An example of this state is shown in FIG. In the example shown in FIG. 8C, the disks P2 and HS1 are in the Av state, and Redundant Copy is executed using the disk P1 as the copy source and the disk HS2 as the copy destination. However, the write on the disk P1 fails. In this case, as shown in the drawing, the state of the disk P1 is not changed to the Br state, but is set to the “Spr + WF” state, and the copying is continued. In this state, the illustrated example shows a case in which the access path to the disk P2 has disappeared for some reason.

ここで、上記Writeが失敗しているディスクＰ１は、Old Dataが書かれている為、このディスクＰ１からReadすることはできない。また、RAID5の場合は、Read可能なディスク（Av状態のディスク）が最低２つ残っていなければならず、図８（ｃ）に示す状態では、閉塞させなければ、Old DataがReadされ、データ化けに繋がる可能性がある。 Here, since the old data is written in the disk P1 in which the writing has failed, it is impossible to read from the disk P1. In addition, in the case of RAID5, at least two readable disks (Av state disks) must remain, and in the state shown in FIG. There is a possibility that it will lead to haunting.

しかしながら、図８（ｃ）の状態に対して図６の処理を実行すると、集計結果は以下の通りとなる。
“Use Disk”＝‘２’、“Unuse Disk”＝‘０’、“Loop Down Disk”＝‘１’
従って、上記２種類の閉塞条件のどちらにも該当しないので、閉塞しないことになってしまう。そこで、図７の処理では、ディスクＰ１は、“Use Disk”ではなく、“Unuse Disk”としてカウントするようにしている。つまり、Writeが失敗したRedundant Copyのコピー元ディスクは、故障状態と同様に扱い、ＲＡＩＤグループの状態は冗長性がない状態と同じに扱う。 However, when the process of FIG. 6 is executed for the state of FIG. 8C, the totaling result is as follows.
“Use Disk” = “2”, “Unuse Disk” = “0”, “Loop Down Disk” = “1”
Therefore, since neither of the above two types of blocking conditions is applicable, the blockage is not performed. Therefore, in the process of FIG. 7, the disk P1 is counted as “Unuse Disk” instead of “Use Disk”. That is, the Redundant Copy copy source disk in which the write has failed is handled in the same manner as the failure state, and the RAID group state is handled in the same manner as the state without redundancy.

図８（ｃ）の状態に対して図６の処理を実行すると、集計結果は以下の通りとなる。
“Use Disk”＝‘１’、“Unuse Disk”＝‘１’、“Loop Down Disk”＝‘１’
従って、上記２種類の閉塞条件の一方に該当することになるので、当該ＲＡＩＤグループは閉塞させるものと判定される（適切な判定が行われる）。 When the processing of FIG. 6 is executed for the state of FIG. 8C, the tabulation results are as follows.
“Use Disk” = “1”, “Unuse Disk” = “1”, “Loop Down Disk” = “1”
Accordingly, since one of the two types of blocking conditions is satisfied, it is determined that the RAID group is blocked (appropriate determination is performed).

ところで、ここで、以上説明した図４〜図７の何れかの処理、あるいは上記従来技術又は何等かの既存技術によって、任意のＲＡＩＤグループを閉塞させた場合、当然、ホスト装置からのアクセスは全て受け付けなくなる為、ホスト装置側ではそれまで実行していた処理は異常終了することになる。これは、ミッションクリティカルなシステムであれば問題無いが（下手にアクセスを滞らせるより、処理を、即、中止させたほうが、システム的に影響は少ない）、そうでないシステム（ホスト側で、レスポンス時間が長くなってもよいから、ＲＡＩＤ装置の復旧を優先させてほしいような事情があるシステム）を考慮して、以下の第２の実施形態を提案する。すなわち、閉塞したＲＡＩＤグループの復旧に時間が掛かるならば仕方ないが、短時間で復旧出来る場合には、実際はＲＡＩＤグループは閉塞していても、ホストに対しては閉塞を知らせない方がよいと考えられる。 By the way, when any RAID group is blocked by any one of the processes shown in FIGS. 4 to 7 described above, or the above-described conventional technique or any existing technique, naturally, all accesses from the host device are performed. Since it is not accepted, the processing that has been executed on the host device side ends abnormally. This is fine if it is a mission-critical system (it is less affected by the system if the process is stopped immediately rather than slowing down access), but the system is not so (response time on the host side) Therefore, the following second embodiment is proposed in consideration of a system having a situation in which priority is given to restoration of the RAID device. In other words, if it takes a long time to recover the blocked RAID group, it is better not to notify the host of the blocking even if the RAID group is actually blocked if it can be recovered in a short time. Conceivable.

この為、第２の実施形態では、実際はＲＡＩＤグループは閉塞していても、短時間で復旧出来る場合には、ホストアクセスに対してダミーの応答（ここでは、Busy）を返すようにする。Busyを返された場合、ホストは、しばらく時間をおいてリトライ処理を行うことになり、それまで実行していた処理を異常終了することにはならない。リトライ処理に対してもBusyを返す。この様にして、ホストがリトライ処理を繰返している間に、閉塞したＲＡＩＤグループを復旧させる（リカバリ処理を行う）。但し、リカバリが失敗した場合には、ホストが延々とリトライ処理を繰返すことになるので、結果的にシステムに悪影響を与える為、即時、ＲＡＩＤ閉塞の発生を通知する。 For this reason, in the second embodiment, even if the RAID group is actually blocked, if it can be recovered in a short time, a dummy response (in this case, Busy) is returned to the host access. If Busy is returned, the host will perform a retry process after a while, and the process that has been executed will not be terminated abnormally. Returns Busy for retry processing. In this manner, the blocked RAID group is recovered (recovery processing is performed) while the host repeats the retry processing. However, if the recovery fails, the host repeats the retry process endlessly. As a result, the system is adversely affected, so that the occurrence of RAID blockage is notified immediately.

短時間でリカバリ可能な場合とは、以下の場合である。
（ａ）ＲＡＩＤ装置による自動リカバリ機能が動作する部品故障の場合（但し、自動リカバリ機能が動作する故障と動作しない故障が同時に発生した場合は、短時間でリカバリ可能な場合として扱う。２つのＢＲＴによる２系統によってアクセスする為、一方の系統だけでも自動リカバリ機能によって復旧すれば、使用可能となるからである。）。 The case where recovery is possible in a short time is as follows.
(A) In the case of a component failure in which the automatic recovery function by the RAID device operates (however, a failure in which the automatic recovery function operates and a failure in which the automatic recovery function does not operate simultaneously are treated as cases where recovery is possible in a short time. This is because only one system can be used if it is restored by the automatic recovery function.

（ｂ）あるディスクの故障により他のディスクがSpindownしてしまうような部品故障の場合
上記（ａ）に関して、具体的には、例えば、ＢＲＴのポートが故障した場合、ＣＥ（作業者：人）が強制的に故障させた場合等には、自動リカバリ機能は動作しない。一方、同じくＢＲＴのポートが故障した場合に、ＲＡＩＤ装置側で異常として切り離した場合（例えばＰＢＣが異常と判断したディスク切り離した場合）は、自動リカバリ機能が動作する。 (B) In the case of a component failure that causes another disk to spindown due to a failure of a certain disk Regarding (a), specifically, for example, when a BRT port fails, CE (operator: person) The auto-recovery function does not work if the computer is forced to fail. On the other hand, if the BRT port also fails and is disconnected as an abnormality on the RAID device side (for example, when the disk is determined to be abnormal by the PBC), the automatic recovery function operates.

上記ＰＢＣが異常と判断したディスク切り離すことについて説明する。例えば、ＢＲＴの任意のポートに複数のディスクＡ，Ｂ，Ｃが接続されてＦＣループを形成している場合であって、ディスクＡ→ディスクＢ→ディスクＣの順にループするとした場合に、仮にディスクＡが不調になってループが途切れてしまうと、実際にはディスクＡが原因であっても、ＢＲＴのポートが故障したと判定されてしまう場合がある。この場合、ＰＢＣが従来より有するチェック機能により各ディスクをチェックすると、ディスクＡが原因であることが分かるので、ＰＢＣがディスクＡを切り離せば、問題は解決する。 Detaching the disk that the PBC has determined to be abnormal will be described. For example, when a plurality of disks A, B, and C are connected to an arbitrary port of the BRT to form an FC loop and a loop is performed in the order of disk A → disk B → disk C, the disk If A becomes malfunctioning and the loop is interrupted, it may be determined that the BRT port has failed even though the disk A is actually the cause. In this case, when each disk is checked by the check function that the PBC has conventionally provided, it can be seen that the disk A is the cause. Therefore, if the PBC disconnects the disk A, the problem is solved.

ＲＡＳは、ＲＡＩＤ管理・制御部２２に故障発生を通知する際に、どのルートで故障と判断したのか（上記例では、作業者によるものか、装置側で判断したのか）という情報（Factor）を付加して通知する。ＲＡＩＤ管理・制御部２２は、この情報（Factor）も、構成情報２２ａ内に記録する。上記自動リカバリ機能が動作するものであるか否かの判定は、構成情報２２ａを参照して行ってもよいし、Factorを構成情報２２ａに反映させるタイミングで行っても良い。これは上記（ｂ）に関して同様である。すなわち、部品故障だけでなく、ディスクについても故障と扱う場合には、上記Factorが付加され、構成情報２２ａに反映されるので、このFactorを参照して上記判断を行う。 When the RAS notifies the RAID management / control unit 22 of the occurrence of the failure, information (Factor) indicating which route the failure is determined (in the above example, whether the failure is determined by the operator or the device side). Add and notify. The RAID management / control unit 22 also records this information (factor) in the configuration information 22a. The determination as to whether or not the automatic recovery function is operating may be made with reference to the configuration information 22a or may be performed at a timing at which the factor is reflected in the configuration information 22a. The same applies to (b) above. That is, when handling not only a component failure but also a disk as a failure, the factor is added and reflected in the configuration information 22a, so the determination is made with reference to the factor.

図９に、上記図７の処理に基づく上記第２の実施形態の処理フローチャート図を示す。
図９において、図７における処理ステップと同じ処理ステップには、同一のステップ番号を付してあり、その説明は省略する。 FIG. 9 is a process flowchart of the second embodiment based on the process of FIG.
In FIG. 9, the same processing steps as those in FIG. 7 are denoted by the same step numbers, and the description thereof is omitted.

図７の処理では、ステップＳ２２においてＲＡＩＤを閉塞する（ステップＳ２２，ＹＥＳ）と判定した場合には、ステップＳ２３の処理（ＲＡＩＤを閉塞させる処理）を実行したが、図９の処理では、ステップＳ２３の代わりに、ステップＳ５１〜Ｓ５３の処理を実行する。 In the process of FIG. 7, when it is determined in step S22 that the RAID is to be blocked (step S22, YES), the process of step S23 (the process of closing the RAID) is executed. However, in the process of FIG. Instead, the processes of steps S51 to S53 are executed.

すなわち、上記ステップＳ２２の判定がＹＥＳの場合、このＲＡＩＤ閉塞の原因となった故障が、上記（ａ）、（ｂ）の何れかの部品故障であるか否かを判定し（ステップＳ５１）、上記（ａ）、（ｂ）の何れかの部品故障である場合には（ステップＳ５１，ＹＥＳ）、ＲＡＩＤグループを閉塞させると共に、Ｉ／Ｏ制御部２１に対してリカバリ中である旨を通知する（ステップＳ５２）。この通知を受けたＩ／Ｏ制御部２１は、上記ホストアクセスに対してダミーの応答（ここでは、Busy）を返す。 That is, if the determination in step S22 is YES, it is determined whether the failure that caused the RAID blockage is a component failure in any of the above (a) and (b) (step S51). In the case of a component failure in any of the above (a) and (b) (step S51, YES), the RAID group is closed and the I / O control unit 21 is notified that recovery is in progress. (Step S52). Upon receiving this notification, the I / O control unit 21 returns a dummy response (in this case, Busy) to the host access.

一方、上記（ａ）、（ｂ）の部品故障に該当しない場合には（ステップＳ５１，ＮＯ）、図７の場合と同様、通常通りのＲＡＩＤ閉塞処理を実行する（ステップＳ５３）。
図１０に、ＲＡＩＤ閉塞からの復旧時の処理フローチャート図を示す。 On the other hand, when the component failure does not correspond to the above (a) and (b) (step S51, NO), the normal RAID blocking process is executed as in the case of FIG. 7 (step S53).
FIG. 10 shows a processing flowchart when recovering from a RAID blockage.

図１０において、ステップＳ６１〜Ｓ６６の処理は、従来通りの処理である。すなわち、故障部品が、部品交換や自動リカバリ等を経て再びシステムに組み込み可能となり（ステップＳ６１）、組み込みに成功したならば（ステップＳ６２，ＹＥＳ）、ＲＡＩＤ復旧判断処理を実施する（ステップＳ６３）。そして、ＲＡＩＤを復旧できると判定したならば（ステップＳ６４，ＹＥＳ）、そのＲＡＩＤグループを復旧させる（ステップＳ６５）。ＲＡＩＤを復旧できないと判定したならば（ステップＳ６４，ＮＯ）、何もしない。 In FIG. 10, the processes in steps S61 to S66 are conventional processes. That is, the failed part can be incorporated into the system again through parts replacement, automatic recovery, etc. (step S61). If the incorporation is successful (step S62, YES), RAID recovery determination processing is performed (step S63). If it is determined that the RAID can be recovered (step S64, YES), the RAID group is recovered (step S65). If it is determined that the RAID cannot be recovered (step S64, NO), nothing is done.

ここで、組み込みに失敗したならば（ステップＳ６２，ＮＯ）、通常であれば何も行わないが（ステップＳ６９）、上記ステップＳ５２の処理を行っていた場合（ステップＳ６７，ＹＥＳ）、ホストはリトライ処理を続けていることになるので、リカバリ失敗をＩ／Ｏ制御部２１に通知して、ホストアクセスをError終了させる必要がある（ステップＳ６８）。 If the integration fails (step S62, NO), nothing is normally performed (step S69), but if the process of step S52 is performed (step S67, YES), the host will retry. Since the processing is continued, it is necessary to notify the I / O control unit 21 of the recovery failure and terminate the host access with Error (step S68).

以上説明したように、第２の実施形態では、ＲＡＩＤ閉塞が生じても短時間で復旧可能な場合には、ホストをしばらく待たせておき、ＲＡＩＤ復旧完了と同時にホストアクセスを許可するので、ホストアクセスを異常終了させずにアクセスを再開させることができる。一方、もしＲＡＩＤ復旧失敗した場合には、即、ホストアクセスを異常終了させ、ホストが無駄なリトライ処理を続けないようにする。 As described above, in the second embodiment, when recovery is possible in a short time even if a RAID blockage occurs, the host is allowed to wait for a while, and host access is permitted simultaneously with completion of RAID recovery. Access can be resumed without abnormally terminating the access. On the other hand, if the RAID recovery fails, the host access is immediately terminated abnormally so that the host does not continue useless retry processing.

（付記１）複数のディスクより成るＲＡＩＤグループを有するＲＡＩＤ装置内のコントローラ・モジュールにおいて、
前記ＲＡＩＤ装置内でＲＡＩＤ閉塞可否を判定すべき特定の事象が発生する毎に、閉塞判定対象となる前記各ＲＡＩＤグループ毎に、前記ＲＡＩＤグループに属する前記各ディスクの状態又は前記各ディスクへのアクセスパスの有無に基づいて、前記各ディスクを複数のカテゴリに分類して当該分類単位毎に該当するディスクの数を集計し、該各集計結果と予め設定される閾値条件とを比較することによって、該ＲＡＩＤグループを閉塞させるか否かを判定するＲＡＩＤ管理・制御手段、
を有することを特徴とするコントローラ・モジュール。 (Supplementary Note 1) In a controller module in a RAID device having a RAID group composed of a plurality of disks,
The status of each disk belonging to the RAID group or access to each disk for each RAID group that is subject to blockage determination every time a specific event that should determine whether or not RAID blockage occurs in the RAID device. Based on the presence or absence of a path, each disk is classified into a plurality of categories, the number of disks corresponding to each classification unit is totaled, and the total result is compared with a preset threshold condition, RAID management / control means for determining whether or not to block the RAID group;
A controller module comprising:

（付記２）前記複数のカテゴリは、“Use Disk”、“Unuse Disk”、“Loop Down Disk”であり、
前記閾値条件は各ＲＡＩＤレベル毎に設定され、前記判定は処理対象の前記ＲＡＩＤグループのＲＡＩＤレベルに応じた閾値条件を用いて行うことを特徴とする付記１記載のコントローラ・モジュール。 (Appendix 2) The plurality of categories are “Use Disk”, “Unuse Disk”, “Loop Down Disk”,
The controller module according to claim 1, wherein the threshold condition is set for each RAID level, and the determination is performed using a threshold condition corresponding to a RAID level of the RAID group to be processed.

（付記３）基本的には、前記“Use Disk”はアクセス可能なディスクであり、前記“Unuse Disk”はディスクの故障によってアクセスできないディスクであり、前記“Loop Down Disk”はアクセスパス消失によってアクセスできないディスクであることを特徴とする付記１又は２記載のコントローラ・モジュール。 (Appendix 3) Basically, the “Use Disk” is an accessible disk, the “Unuse Disk” is an inaccessible disk due to a disk failure, and the “Loop Down Disk” is accessed due to an access path loss. 3. The controller module according to appendix 1 or 2, wherein the controller module is a non-capable disc.

（付記４）前記ＲＡＩＤレベルがRAID０の場合、前記閾値条件は、“Unuse Disk”が０且つ“Loop Down Disk”が１以上であり、
ＲＡＩＤレベルがRAID０の前記ＲＡＩＤグループであって前記集計結果が該閾値条件に該当したＲＡＩＤグループは、閉塞させると判定することを特徴とする付記２記載のコントローラ・モジュール。 (Supplementary Note 4) When the RAID level is RAID 0, the threshold condition is that “Unuse Disk” is 0 and “Loop Down Disk” is 1 or more,
3. The controller module according to appendix 2, wherein the RAID group whose RAID level is RAID 0 and whose RAID result corresponds to the threshold condition is determined to be blocked.

（付記５）前記ＲＡＩＤレベルがRAID１又はRAID０＋１の場合、前記閾値条件は、“Use Disk”が０且つ“Loop Down Disk”が１以上であり、
ＲＡＩＤレベルがRAID１又はRAID０＋１の前記ＲＡＩＤグループであって前記集計結果が該閾値条件に該当したＲＡＩＤグループは、閉塞させると判定することを特徴とする付記２記載のコントローラ・モジュール。 (Supplementary Note 5) When the RAID level is RAID 1 or RAID 0 + 1, the threshold condition is that “Use Disk” is 0 and “Loop Down Disk” is 1 or more,
3. The controller module according to appendix 2, wherein the RAID group whose RAID level is RAID 1 or RAID 0 + 1 and whose RAID result corresponds to the threshold condition is determined to be blocked.

（付記６）前記ＲＡＩＤレベルがRAID５又はRAID０＋５の場合、前記閾値条件は、“Unuse Disk”が０且つ“Loop Down Disk”が２以上、又“Unuse Disk”が１且つ“Loop Down Disk”が１以上であり、
ＲＡＩＤレベルがRAID５又はRAID０＋５の前記ＲＡＩＤグループであって前記集計結果が該２種類の閾値条件の何れか一方に該当したＲＡＩＤグループは、閉塞させると判定することを特徴とする付記２記載のコントローラ・モジュール。 (Supplementary Note 6) When the RAID level is RAID 5 or RAID 0 + 5, the threshold conditions are “Unuse Disk” is 0 and “Loop Down Disk” is 2 or more, “Unuse Disk” is 1 and “Loop Down Disk” is 1 That's it,
The controller according to appendix 2, wherein the RAID group having a RAID level of RAID5 or RAID0 + 5 and the aggregation result corresponding to one of the two types of threshold conditions is determined to be blocked. module.

（付記７） Redundant コピー中のコピー先ディスクは、前記集計の対象外とすることを特徴とする付記２記載のコントローラ・モジュール。
（付記８） Sparing状態であっても“Write失敗あり”のディスクは、前記“Use Disk”ではなく、前記“Unuse Disk” に分類することを特徴とする付記２記載のコントローラ・モジュール。 (Supplementary note 7) The controller module according to supplementary note 2, wherein a copy destination disk during Redundant copy is not subject to the aggregation.
(Supplementary note 8) The controller module according to supplementary note 2, wherein a disk with “Write failure” even in the Sparing state is classified as “Unuse Disk” instead of “Use Disk”.

（付記９）複数のディスクより成るＲＡＩＤグループを有するＲＡＩＤ装置内のコントローラ・モジュールにおいて、
該コントローラ・モジュールと外部の任意のホスト装置とのインタフェースであるＩ／Ｏ制御手段と、
前記ＲＡＩＤ装置内の任意の前記ＲＡＩＤグループの閉塞可否の判定、閉塞の実行を管理・制御するＲＡＩＤ管理・制御手段とを有し、
前記ＲＡＩＤ管理・制御手段は、任意の前記ＲＡＩＤグループの閉塞を実行する場合、該ＲＡＩＤグループが短時間でリカバリ可能か否かを判定し、短時間でリカバリ可能と判定した場合、その旨を前記Ｉ／Ｏ制御手段に通知し、
前記Ｉ／Ｏ制御手段は、該通知を受けた場合であって前記ホスト装置が前記閉塞されたＲＡＩＤグループへのアクセスを要求した場合には、該ホスト装置に対してダミーの応答を返信することを特徴とするコントローラ・モジュール。 (Supplementary Note 9) In a controller module in a RAID device having a RAID group composed of a plurality of disks,
I / O control means that is an interface between the controller module and any external host device;
RAID management / control means for determining whether or not to block any RAID group in the RAID device and managing / controlling the execution of the block,
The RAID management / control unit determines whether or not the RAID group can be recovered in a short time when performing blocking of the arbitrary RAID group. Notify the I / O control means,
When the I / O control means receives the notification and the host device requests access to the blocked RAID group, it returns a dummy response to the host device. Controller module characterized by

（付記１０）前記ダミーの応答は、Ｂｕｓｙであり、
該Busyの応答により、前記ホスト装置は、リトライ処理を繰返すことを特徴とする付記９記載のコントローラ・モジュール。 (Supplementary Note 10) The dummy response is Busy,
The controller module according to appendix 9, wherein the host device repeats the retry process in response to the Busy response.

（付記１１）前記短時間でリカバリ可能な場合とは、前記ＲＡＩＤ装置による自動リカバリ機能が動作する部品故障の場合、又は任意のディスクの故障により他のディスクがSpindownした場合であることを特徴とする付記９記載のコントローラ・モジュール。 (Supplementary Note 11) The case where the recovery is possible in a short time is a case where a component failure in which the automatic recovery function of the RAID device operates or a case where another disk spins down due to an arbitrary disk failure. The controller module according to appendix 9.

（付記１２）ＲＡＩＤ装置において、
複数のディスクより成るＲＡＩＤグループと、
前記ＲＡＩＤ装置内でＲＡＩＤ閉塞可否を判定すべき特定の事象が発生する毎に、閉塞判定対象となる前記各ＲＡＩＤグループ毎に、前記ＲＡＩＤグループに属する前記各ディスクの状態又は前記各ディスクへのアクセスパスの有無に基づいて、該各ディスクを複数のカテゴリに分類して当該分類単位毎に該当するディスクの数を集計し、該各集計結果と予め設定される閾値条件とを比較することによって、該ＲＡＩＤグループを閉塞させるか否かを判定するコントローラ・モジュールと、
を有することを特徴とするＲＡＩＤ装置。 (Supplementary Note 12) In a RAID device,
A RAID group consisting of a plurality of disks;
The status of each disk belonging to the RAID group or access to each disk for each RAID group that is subject to blockage determination every time a specific event that should determine whether or not RAID blockage occurs in the RAID device. Based on the presence / absence of a path, each disk is classified into a plurality of categories, the number of disks corresponding to each classification unit is totaled, and the total result is compared with a preset threshold condition, A controller module for determining whether or not to block the RAID group;
A RAID device characterized by comprising:

（付記１３）ＲＡＩＤ装置において、
該ＲＡＩＤ装置と外部の任意のホスト装置とのインタフェースであるＩ／Ｏ制御手段と、
該ＲＡＩＤ装置内の任意のＲＡＩＤグループの閉塞可否の判定、閉塞の実行を管理・制御するＲＡＩＤ管理・制御手段とを有し、
前記ＲＡＩＤ管理・制御手段は、前記任意のＲＡＩＤグループの閉塞を実行する場合、該ＲＡＩＤグループが短時間で復旧するか否かを判定し、短時間で復旧すると判定した場合、その旨を前記Ｉ／Ｏ制御手段に通知し、
前記Ｉ／Ｏ制御手段は、該通知を受けた場合、前記ホスト装置が前記閉塞されたＲＡＩＤグループへのアクセスを要求した場合、該ホスト装置に対してダミーの応答を返信することを有することを特徴とするＲＡＩＤ装置。 (Supplementary note 13) In a RAID device,
I / O control means that is an interface between the RAID device and any external host device;
RAID management / control means for determining whether or not to block any RAID group in the RAID device and managing / controlling the execution of the block,
The RAID management / control unit determines whether or not the RAID group is restored in a short time when the arbitrary RAID group is blocked, and determines that the I / O is restored in a short time. / O control means,
When receiving the notification, the I / O control unit has a function of returning a dummy response to the host device when the host device requests access to the blocked RAID group. Feature RAID device.

（付記１４）複数のディスクより成るＲＡＩＤグループを有するＲＡＩＤ装置におけるコンピュータに、
前記ＲＡＩＤ装置内でＲＡＩＤ閉塞可否を判定すべき特定の事象が発生する毎に、閉塞判定対象となる前記各ＲＡＩＤグループ毎に、前記ＲＡＩＤグループに属する前記各ディスクの状態や前記各ディスクへのアクセスパスの有無に基づいて、該各ディスクを複数のカテゴリに分類して当該分類単位毎に該当するディスクの数を集計する機能と
該各集計結果と予め設定される閾値条件とを比較することによって、該ＲＡＩＤグループを閉塞させるか否かを判定する機能と、
を実現させる為のプログラム。 (Supplementary Note 14) In a computer in a RAID device having a RAID group composed of a plurality of disks,
The status of each disk belonging to the RAID group and the access to each disk for each RAID group that is subject to blockage determination every time a specific event that should determine whether or not RAID blockage occurs in the RAID device. Based on the presence / absence of a path, the disk is classified into a plurality of categories and the number of disks corresponding to each classification unit is counted, and the result of the counting is compared with a preset threshold condition. A function for determining whether to close the RAID group;
A program to realize

(付記１５) ＲＡＩＤ装置におけるコンピュータに、
該ＲＡＩＤ装置内の任意のＲＡＩＤグループの閉塞可否の判定、閉塞の実行を管理・制御する機能と、
前記任意のＲＡＩＤグループの閉塞を実行する場合、該ＲＡＩＤグループが短時間で復旧するか否かを判定し、短時間で復旧すると判定した場合であって該閉塞されるＲＡＩＤグループへのアクセスを外部のホスト装置が要求する場合、該ホスト装置に対してダミーの応答を返信する機能と、
を実現させる為のプログラム。 (Supplementary Note 15) In a computer in a RAID device,
A function for determining whether or not to block any RAID group in the RAID device and managing and controlling execution of the block;
When performing blockage of the arbitrary RAID group, it is determined whether or not the RAID group is restored in a short time, and when it is determined that the RAID group is restored in a short time, access to the blocked RAID group is externally performed. A function of returning a dummy response to the host device when requested by the host device;
A program to realize

（付記１６）複数のディスクより成るＲＡＩＤグループを有するＲＡＩＤ装置内のコントローラ・モジュールにおけるＲＡＩＤ閉塞判定方法であって、
前記ＲＡＩＤ装置内でＲＡＩＤ閉塞可否を判定すべき特定の事象が発生する毎に、閉塞判定対象となる前記各ＲＡＩＤグループ毎に、前記ＲＡＩＤグループに属する前記各ディスクの状態や前記各ディスクへのアクセスパスの有無に基づいて、該各ディスクを複数のカテゴリに分類して当該分類単位毎に該当するディスクの数を集計し、
該各集計結果と予め設定される閾値条件とを比較することによって、該ＲＡＩＤグループを閉塞させるか否かを判定することを特徴とするＲＡＩＤ閉塞判定方法。 (Supplementary Note 16) A RAID blockage determination method in a controller module in a RAID device having a RAID group composed of a plurality of disks,
The status of each disk belonging to the RAID group and the access to each disk for each RAID group that is subject to blockage determination every time a specific event that should determine whether or not RAID blockage occurs in the RAID device. Based on the presence / absence of a path, the disks are classified into a plurality of categories, and the number of disks corresponding to each classification unit is tabulated.
A RAID blockage determination method characterized by determining whether or not to block the RAID group by comparing each of the total results with a preset threshold condition.

本例のＲＡＩＤ装置の構成図である。It is a block diagram of the RAID apparatus of this example. 図１に示すＣＭのハードウェア構成図である。It is a hardware block diagram of CM shown in FIG. （ａ）、（ｂ）は、本手法による閉塞可否判断方法を説明する為の図である。(A), (b) is a figure for demonstrating the obstruction | occlusion propriety determination method by this method. 本例のＲＡＩＤ閉塞判定処理のフローチャート図（その１）である。It is a flowchart figure (the 1) of the RAID blockage determination process of this example. 本例のＲＡＩＤ閉塞判定処理のフローチャート図（その２）である。It is a flowchart figure (the 2) of the RAID blockage determination process of this example. 本例のＲＡＩＤ閉塞判定処理のフローチャート図（その３）である。FIG. 10 is a flowchart (part 3) of the RAID blockage determination process of the present example. 本例のＲＡＩＤ閉塞判定処理のフローチャート図（その４）である。FIG. 10 is a flowchart (part 4) of the RAID blockage determination process of the present example. （ａ）〜（ｃ）は、図４〜図７の処理を具体例を用いて説明する為の図である。(A)-(c) is a figure for demonstrating the process of FIGS. 4-7 using a specific example. 本例の第２の実施形態の処理フローチャート図（その１）である。It is a processing flowchart figure (the 1) of 2nd Embodiment of this example. 本例の第２の実施形態の処理フローチャート図（その２）である。It is a processing flowchart figure (the 2) of 2nd Embodiment of this example. 従来のＲＡＩＤシステムの概略構成図である。It is a schematic block diagram of the conventional RAID system. 従来のＲＡＩＤ閉塞判定方法を説明する為の図である。It is a figure for demonstrating the conventional RAID blockage | determination determination method. （ａ）、（ｂ）は、ＲＬＵ／ＤＬＵを説明する為の図である。(A), (b) is a figure for demonstrating RLU / DLU. （ａ）〜（ｊ）は、ＲＡＩＤグループがとりえる状態を説明する為の図である。(A)-(j) is a figure for demonstrating the state which a RAID group can take. “ＢＲＴ跨ぎ”を説明する為の図である。It is a figure for demonstrating "BRT crossing."

Explanation of symbols

１ＲＡＩＤ装置
２（２ａ、２ｂ）ホスト
１０（１０ａ、１０ｂ）ＣＭ
３ＦＲＴ
４，５ＢＲＴ
６，７ＤＥ
６ａ，６ｂ、７ａ、７ｂＰＢＣ
６ｃ、７ｃディスク群
２１Ｉ／Ｏ制御部
２２ＲＡＩＤ管理・制御部
２２ａ構成情報
３１ＤＩ
３２ＤＭＡ
３３，３４ＣＰＵ
３５ＭＣＨ(Memory Controller Hub)
３６メモリ
３７ＣＡ 1 RAID device 2 (2a, 2b) Host 10 (10a, 10b) CM
3 FRT
4,5 BRT
6,7 DE
6a, 6b, 7a, 7b PBC
6c, 7c Disk group 21 I / O control unit 22 RAID management / control unit 22a Configuration information 31 DI
32 DMA
33, 34 CPU
35 MCH (Memory Controller Hub)
36 memory 37 CA

Claims

In a controller module in a RAID device having a RAID group consisting of a plurality of disks,
Each time a specific event that should be determined whether or not RAID blockage occurs in the RAID device, each disk belonging to the RAID group is assigned to each RAID group that is a blockage determination target. Classification into a plurality of classification units based on the presence / absence of an access path to a disk and the state of each disk belonging to the RAID group, and then the specific event belonging to the RAID group belonging to each of the plurality of classification units occurs Count the number of discs
Here, the plurality of classification units include a classification unit corresponding to a case where there is no access path to each of the disks belonging to the RAID group,
Among the totaling results , at least a preset corresponding to the totaling result of the number of disks in which the specific event that belongs to the classification unit according to a case where there is no access path to each of the disks belonging to the RAID group. compared with the threshold condition that,
The threshold condition corresponding to the total result of the number of disks in which the specific event that belongs to the classification unit corresponding to the case where there is no access path to each disk belonging to the RAID group among the total results determines that occlude the RAID group when filled,
RAID management / control means,
A controller module comprising:

The threshold condition is set for each RAID level, and the determination is made for the RAID group to be processed.
The controller module according to claim 1, wherein the controller module is performed using a threshold condition corresponding to a RAID level of the loop.

The plurality of classification units includes a first classification unit and a third classification unit;
The classification unit corresponding to the case where there is no access path to each of the disks belonging to the RAID group corresponds to the third classification unit;
When the RAID level is RAID 1 or RAID 0 + 1, the threshold condition is that the number of disks in which the specific event belonging to the first classification unit has occurred is 0 and the specific event belonging to the third classification unit has occurred. The number of discs played is 1 or more,
3. The controller module according to claim 2, wherein the RAID group whose RAID level is RAID 1 or RAID 0 + 1 and whose RAID result satisfies the threshold condition is determined to be blocked.

The plurality of classification units includes a second classification unit and a third classification unit;
The classification unit corresponding to the case where there is no access path to each of the disks belonging to the RAID group corresponds to the third classification unit;
When the RAID level is RAID 5 or RAID 0 + 5, the threshold condition is that the number of disks in which the specific event belonging to the second classification unit has occurred is 0 and the specific event belonging to the third classification unit has occurred. The number of discs that are two or more, or the number of discs in which the specific event belonging to the second classification unit has occurred is one and the specific event belonging to the third classification unit has occurred Is 1 or more, and
3. The controller according to claim 2, wherein the RAID group having a RAID level of RAID 5 or RAID 0 + 5 and the aggregation result corresponding to one of the two types of threshold conditions is determined to be blocked. ·module.

An interface with the controller module and any external host device I / O control unit
Further including
The RAID management / control unit determines whether or not the RAID group can be recovered in a short time when performing blocking of the arbitrary RAID group. Notify the I / O control means,
When the I / O control means receives the notification and the host device requests access to the blocked RAID group, it returns a dummy response to the host device. The controller module according to claim 1 .

In a RAID device,
A RAID group consisting of a plurality of disks;
Each time a specific event that should be determined whether or not RAID blockage occurs in the RAID device, each disk belonging to the RAID group is assigned to each RAID group that is a blockage determination target. Classification into a plurality of classification units based on the presence / absence of an access path to a disk and the state of each disk belonging to the RAID group, and then the specific event belonging to the RAID group belonging to each of the plurality of classification units occurs Count the number of discs
Here, the plurality of classification units include a classification unit corresponding to a case where there is no access path to each of the disks belonging to the RAID group,
Among the totaling results , at least a preset corresponding to the totaling result of the number of disks in which the specific event that belongs to the classification unit according to a case where there is no access path to each of the disks belonging to the RAID group. compared with the threshold condition that,
The threshold condition corresponding to the total result of the number of disks in which the specific event that belongs to the classification unit corresponding to the case where there is no access path to each disk belonging to the RAID group among the total results determines that occlude the RAID group when filled,
A controller module;
A RAID device characterized by comprising:

A RAID blocking determination method executed by a controller module in a RAID device having a RAID group composed of a plurality of disks,
Each time a specific event that should be determined whether or not RAID blockage occurs in the RAID device, each disk belonging to the RAID group is assigned to each RAID group that is a blockage determination target. Classification into a plurality of classification units based on the presence / absence of an access path to a disk and the state of each disk belonging to the RAID group, and then the specific event belonging to the RAID group belonging to each of the plurality of classification units occurs Count the number of discs
Here, the plurality of classification units include a classification unit corresponding to a case where there is no access path to each of the disks belonging to the RAID group,
Among the totaling results , at least a preset corresponding to the totaling result of the number of disks in which the specific event that belongs to the classification unit according to a case where there is no access path to each of the disks belonging to the RAID group. compared with the threshold condition that,
The threshold condition corresponding to the total result of the number of disks in which the specific event that belongs to the classification unit corresponding to the case where there is no access path to each disk belonging to the RAID group among the total results RAID blockage determination method characterized by determining that occlude the RAID group when filled.

A computer in a RAID device having a RAID group consisting of a plurality of disks;
Each time a specific event that should be determined whether or not RAID blockage occurs in the RAID device, each disk belonging to the RAID group is assigned to each RAID group that is a blockage determination target. Classification into a plurality of classification units based on the presence / absence of an access path to a disk and the state of each disk belonging to the RAID group, and then the specific event belonging to the RAID group belonging to each of the plurality of classification units occurs A function of including a classification unit corresponding to a case where there is no access path to each disk belonging to the RAID group, and at least one of the total results When there is no access path to each disk belonging to the RAID group The result of counting the number of disks in which the specific event belonging to the classification unit has occurred is compared with a corresponding preset threshold condition, and the disk group belonging to the RAID group among the results of the counting is compared . and determining functions and occlude the RAID group when the access path is the number of counting result of the disk in which the specific event has occurred belongs to the classification unit in accordance with the absence satisfies the corresponding said threshold condition,
A program to realize