JP2021170261A

JP2021170261A - Storage control device and control program

Info

Publication number: JP2021170261A
Application number: JP2020073527A
Authority: JP
Inventors: 明三瓶; Akira Sanpei; 文夫榛澤; Fumio Hanzawa
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2020-04-16
Filing date: 2020-04-16
Publication date: 2021-10-28

Abstract

To enable a storage system having a redundant path configuration to execute efficient preventive maintenance.SOLUTION: A storage control device includes: a disconnection processing part for disconnecting a connection with a first module 22 in a pseudo manner if a recoverable failure occurs in the first module 22 in a primary path; a connection processing part for validating a connection with a second module 22 in a secondary path to execute an access test; and a specification part for specifying a cause of the failure according to a result of the access test by the connection processing part.SELECTED DRAWING: Figure 7

Description

本発明は、ストレージ制御装置及び制御プログラムに関する。 The present invention relates to a storage control device and a control program.

近年のエンタープライズ系のRedundant Arrays of Independent Disks（ＲＡＩＤ）コントローラにおいては、ＲＡＩＤコントローラとディスクとの間のデータ転送通信路が冗長化されることがある。これにより、片方のデータ転送通信路で故障が発生しても、残りのデータ転送通信路での運用継続が可能になる。 In recent enterprise Redundant Arrays of Independent Disks (RAID) controllers, the data transfer communication path between the RAID controller and the disk may be redundant. As a result, even if a failure occurs in one of the data transfer communication paths, the operation can be continued in the remaining data transfer communication path.

特開２００９−２０５３１６号公報Japanese Unexamined Patent Publication No. 2009-20513 特開２００６−１６４３０４号公報Japanese Unexamined Patent Publication No. 2006-164304

しかしながら、データ転送通信路が冗長化された装置においても、データ転送通信路の二重故障時には運用が停止されることがある。 However, even in a device in which the data transfer communication path is made redundant, the operation may be stopped in the event of a double failure of the data transfer communication path.

図１は、ＲＡＩＤシステム６００における予防保守を例示する図である。 FIG. 1 is a diagram illustrating preventive maintenance in the RAID system 600.

予防保守とは、装置内の保守交換可能な部品であって、運用の継続は可能であるものの軽微の故障状態に陥った部品について、予防的に交換を行なうことである。予防保守の処理としては、例えば、故障部品の切り離しと、故障部品の交換とがある。 Preventive maintenance is the preventive replacement of parts that can be maintained and replaced in the device and that can be operated but have fallen into a minor failure state. Preventive maintenance processes include, for example, disconnection of defective parts and replacement of defective parts.

ＲＡＩＤシステム６００は、複数（図示する例では２つ）のController Module（ＣＭ）６（ＣＭ＃０，＃１と称されてもよい。）及び複数（図示する例では３つ）のDisk Enclosure（ＤＥ）７（ＤＥ＃０１〜＃０３と称されてもよい。）を備える。 The RAID system 600 includes a plurality of Controller Modules (CM) 6 (may be referred to as CM # 0 and # 1) and a plurality of (three in the illustrated example) Disk Enclosure (two in the illustrated example). DE) 7 (may be referred to as DE # 01 to # 03) is provided.

ＤＥ７は、複数のディスク７１及び２つのInput Output Module（ＩＯＭ）７２（ＩＯＭ＃０，＃１と称されてよい。）を備える。なお、図１においては、簡単のため、各ＤＥ７において、１つのディスク７１に限って示されている。 The DE7 includes a plurality of discs 71 and two Input Output Modules (IOMs) 72 (may be referred to as IOM # 0, # 1). In FIG. 1, for the sake of simplicity, only one disk 71 is shown in each DE7.

ＣＭ６は、Central Processing Unit（ＣＰＵ）６１及びExpander（ＥＸＰ）６２を備える。ＣＰＵ６１は、Input Output Controller（ＩＯＣ）６０１として機能する。 The CM6 includes a Central Processing Unit (CPU) 61 and an Expander (EXP) 62. The CPU 61 functions as an Input Output Controller (IOC) 601.

ここで、冗長化されたバックエンドパス構成において、ＤＥ７の片方のＩＯＭ７２の予防保守開始後に誤って正常部品を保守切り離しすると、ＤＥ７の両パスが閉塞となり、ＤＥ７内の全てのディスク７１にアクセスできなくなる。 Here, in a redundant back-end path configuration, if a normal component is mistakenly maintained and disconnected after the preventive maintenance of one IOM72 of the DE7 is started, both paths of the DE7 are blocked and all the disks 71 in the DE7 can be accessed. It disappears.

このような現象が起きる場合としては、一点故障の場合と二点故障の場合とが想定される。一点故障は、保守員の単純な作業ミスによって発生することが想定される。また、二点故障は、両パスに異常が包含され、一点は運用不可能で且つ被疑箇所の特定不可能な故障であり、もう一点は軽微で継続運用可能で且つ被疑箇所の特定可能な故障であることが想定される。 When such a phenomenon occurs, it is assumed that there is a one-point failure and a two-point failure. It is assumed that a single point failure is caused by a simple work mistake of a maintenance person. In addition, a two-point failure includes an abnormality in both paths, one point is a failure that cannot be operated and the suspected part cannot be identified, and the other point is a minor failure that can be continuously operated and the suspected part can be identified. Is assumed to be.

図１には、二点故障の場合が示されている。ＤＥ＃０２のＩＯＭ＃０は、故障部品−１であり、予防保守対象のリカバリ可能な故障モード（別言すれば、軽微な異常）である。また、ＤＥ＃０２のＩＯＭ＃１は、故障部品−２であり、サイレント故障のリカバリ不可能な故障モード（別言すれば、重度な異常）である。 FIG. 1 shows the case of a two-point failure. IOM # 0 of DE # 02 is a failed component-1, and is a recoverable failure mode (in other words, a minor abnormality) subject to preventive maintenance. Further, IOM # 1 of DE # 02 is a failure component-2, which is a failure mode in which a silent failure cannot be recovered (in other words, a serious abnormality).

以下、図１を用いて、誤って正常部品を保守切り離しした結果、全経路が閉塞する様子を説明する。 Hereinafter, with reference to FIG. 1, a state in which all routes are blocked as a result of accidentally maintaining and disconnecting a normal component will be described.

まず、符号Ａ１に示すように、ＤＥ＃０２のＩＯＭ＃１で、検出できない異常であるサイレント故障が発生する。次に、符号Ａ２に示すように、ＤＥ＃０２のＩＯＭ＃０で、リカバリ可能な故障が発生し、予防保守が実施される。これにより、符号Ａ３に示すように、各ＤＥ７のＩＯＭ＃０側の経路が使用できなくなる。一方、符号Ａ４に示すように、各ＤＥ７のＩＯＭ＃１側の経路でディスク７１に対するアクセスが開始される。そして、符号Ａ５に示すように、ＤＥ＃０２のＩＯＭ＃１におけるサイレント故障による異常が顕在化し、両パスが閉塞する。 First, as shown by reference numeral A1, a silent failure, which is an abnormality that cannot be detected, occurs in IOM # 1 of DE # 02. Next, as shown by reference numeral A2, a recoverable failure occurs in IOM # 0 of DE # 02, and preventive maintenance is carried out. As a result, as shown by reference numeral A3, the route on the IOM # 0 side of each DE7 cannot be used. On the other hand, as shown by reference numeral A4, access to the disk 71 is started by the route on the IOM # 1 side of each DE7. Then, as shown by reference numeral A5, an abnormality due to a silent failure in IOM # 1 of DE # 02 becomes apparent, and both paths are blocked.

リカバリ不可能な故障モードであるＩＯＭ７２が交換される場合には、リカバリ可能な故障モードであるＩＯＭ７２側の経路を用いて引き続きの運用が可能である。しかしながら、図１に示したように、リカバリ可能な故障モードであるＩＯＭ７２の予防保守が実施されると、両パスが使用不可能になる。 When the IOM72, which is an unrecoverable failure mode, is replaced, it is possible to continue the operation by using the route on the IOM72 side, which is a recoverable failure mode. However, as shown in FIG. 1, when preventive maintenance of IOM72, which is a recoverable failure mode, is performed, both paths become unusable.

また、データ転送を伴うパトロール等では、ホストInput Output（ＩＯ）性能の低下を引き起こすため、データ転送ありのパトロールを恒常的に実施できない場合がある。 In addition, patrols with data transfer may not be able to perform patrols with data transfer constantly because the host Input Output (IO) performance deteriorates.

１つの側面では、冗長的なパス構成を有するストレージシステムにおいて、効率的な予防保守を実施できるようにすることを目的とする。 One aspect is aimed at enabling efficient preventive maintenance in a storage system with a redundant path configuration.

１つの側面では、ストレージ制御装置は、プライマリパスとセカンダリパスとによって複数の記憶装置群がカスケード接続されたストレージ制御装置であって、前記プライマリパスにおける第１のモジュールでリカバリ可能な障害が発生した場合に、前記第１のモジュールとの接続を擬似的に切断する切断処理部と、前記セカンダリパスにおける第２のモジュールとの接続を有効にしてアクセス試験を実施する接続処理部と、前記接続処理部による前記アクセス試験の結果に応じて、前記障害の原因を特定する特定部と、を備える。 On one side, the storage controller is a storage controller in which a plurality of storage devices are cascaded by a primary path and a secondary path, and a recoverable failure has occurred in the first module in the primary path. In this case, a disconnection processing unit that pseudo-disconnects the connection with the first module, a connection processing unit that enables an access test by enabling the connection with the second module in the secondary path, and the connection processing. A specific unit for identifying the cause of the failure is provided according to the result of the access test by the unit.

１つの側面では、冗長的なパス構成を有するストレージシステムにおいて、効率的な予防保守を実施できる。 On one side, efficient preventive maintenance can be performed in a storage system with a redundant path configuration.

ＲＡＩＤシステムにおける予防保守を例示する図である。It is a figure which illustrates the preventive maintenance in a RAID system. 実施形態の一例におけるＲＡＩＤシステムのハードウェア構成例を模式的に示すブロック図である。It is a block diagram which shows typically the hardware configuration example of the RAID system in one example of Embodiment. 図２に示したＣＭのソフトウェア構成例を模式的に示すブロック図である。It is a block diagram which shows typically the software structure example of CM shown in FIG. 図２に示したＲＡＩＤシステムにおける予防保守対象部品の検出処理を説明する図である。It is a figure explaining the detection process of the part subject to preventive maintenance in the RAID system shown in FIG. 図４に示した予防保守対象部品の検出結果に応じた被疑箇所管理情報を示すテーブルである。It is a table which shows the suspected part management information according to the detection result of the part subject to preventive maintenance shown in FIG. 図２に示したＲＡＩＤシステムにおける予防保守対象部品の切り離し処理を説明する図である。It is a figure explaining the separation process of the component subject to preventive maintenance in the RAID system shown in FIG. 図２に示したＲＡＩＤシステムにおける冗長パスに対するアクセス試験処理の第１の例を説明する図である。It is a figure explaining the 1st example of the access test process for the redundant path in the RAID system shown in FIG. 図２に示したＲＡＩＤシステムにおける冗長パスに対するアクセス試験処理の第２の例を説明する図である。It is a figure explaining the 2nd example of the access test process for the redundant path in the RAID system shown in FIG. 図８に示した冗長パスに対するアクセス試験結果に応じた被疑箇所管理情報を示すテーブルである。It is a table which shows the suspected part management information according to the access test result for the redundant path shown in FIG. 図２に示したＲＡＩＤシステムにおける予防保守処理を説明するフローチャートである。It is a flowchart explaining the preventive maintenance process in the RAID system shown in FIG.

以下、図面を参照して一実施の形態を説明する。ただし、以下に示す実施形態はあくまでも例示に過ぎず、実施形態で明示しない種々の変形例や技術の適用を排除する意図はない。すなわち、本実施形態を、その趣旨を逸脱しない範囲で種々変形して実施することができる。 Hereinafter, one embodiment will be described with reference to the drawings. However, the embodiments shown below are merely examples, and there is no intention of excluding the application of various modifications and techniques not specified in the embodiments. That is, the present embodiment can be variously modified and implemented within a range that does not deviate from the purpose.

また、各図は、図中に示す構成要素のみを備えるという趣旨ではなく、他の機能等を含むことができる。 Further, each figure does not mean that it includes only the components shown in the figure, but may include other functions and the like.

以下、図中において、同一の各符号は同様の部分を示しているので、その説明は省略する。 Hereinafter, since the same reference numerals indicate the same parts in the drawings, the description thereof will be omitted.

〔Ａ〕実施形態の一例
〔Ａ−１〕システム構成例
図２は、実施形態の一例におけるＲＡＩＤシステム１００のハードウェア構成例を模式的に示すブロック図である。 [A] Example of Embodiment [A-1] System Configuration Example FIG. 2 is a block diagram schematically showing a hardware configuration example of the RAID system 100 in the example of the embodiment.

ＲＡＩＤシステム１００は、ストレージシステムの一例であり、複数（図示する例では２つ）のＣＭ１（ＣＭ＃０，＃１と称されてもよい。）及び複数（図示する例では３つ）のＤＥ２（ＤＥ＃０１〜＃０３と称されてもよい。）を備える。 The RAID system 100 is an example of a storage system, and has a plurality of (two in the illustrated example) CM1 (may be referred to as CM # 0 and # 1) and a plurality of (three in the illustrated example) DE2. (It may be referred to as DE # 01 to # 03.).

各ＤＥ２は、プライマリパスとセカンダリパスとによって、ＣＭ＃０，＃１にカスケード接続されている。プライマリパスは、ＣＭ＃０を、ＤＥ＃０１，＃０２，＃０３の順に接続する。また、セカンダリパスは、ＣＭ＃１を、ＤＥ＃０３，＃０２，＃０１の順に接続する。 Each DE2 is cascade-connected to CM # 0 and # 1 by a primary path and a secondary path. The primary path connects CM # 0 in the order of DE # 01, # 02, # 03. The secondary path connects CM # 1 in the order of DE # 03, # 02, # 01.

ＤＥ２は、記憶装置群の一例であり、複数のディスク２１及び２つのＩＯＭ２２（ＩＯＭ＃０，＃１と称されてよい。）を備える。なお、図２においては、簡単のため、各ＤＥ２において、１つのディスク２１に限って示されている。 DE2 is an example of a storage device group, and includes a plurality of disks 21 and two IOMs 22 (may be referred to as IOMs # 0 and # 1). In FIG. 2, for the sake of simplicity, only one disk 21 is shown in each DE2.

ディスク２１は、記憶装置の一例であり、ＣＭ１からの命令に応じて種々の情報を記憶する。ＩＯＭ２２は、モジュールの一例であり、ＣＭ１又は他のＤＥ２との間の通信を中継する。 The disk 21 is an example of a storage device, and stores various information in response to a command from CM1. The IOM 22 is an example of a module and relays communication with CM1 or another DE2.

ＣＭ１は、ストレージ制御装置の一例であり、ＣＰＵ１１及びＥＸＰ１２を備える。 CM1 is an example of a storage control device, and includes a CPU 11 and an EXP 12.

ＥＸＰ１２は、各ＤＥ２又は他系のＣＭ１との間の通信を中継する。 EXP12 relays communication with each DE2 or CM1 of another system.

図３は、図２に示したＣＭ１のソフトウェア構成例を模式的に示すブロック図である。 FIG. 3 is a block diagram schematically showing a software configuration example of CM1 shown in FIG.

ＣＰＵ１１は、図２に示したようにＩＯＣ１０１として機能すると共に、図３に示すように切断処理部１１１，接続処理部１１２及び特定部１１３として機能する。 The CPU 11 functions as an IOC 101 as shown in FIG. 2, and also functions as a cutting processing unit 111, a connection processing unit 112, and a specific unit 113 as shown in FIG.

ＩＯＣ１０１は、各ＤＥ２又は他系のＣＭ１との間の通信を制御する。 The IOC101 controls communication with each DE2 or another CM1 system.

切断処理部１１１は、プライマリパスにおけるＩＯＭ２２でリカバリ可能な障害が発生した場合に、プライマリパスにおけるＩＯＭ２２との接続を擬似的に切断する。切断処理部１１１は、プライマリパスのカスケード接続において、複数のＤＥ２のうち末端に接続されているＤＥ２のＩＯＭ２２から順次接続を擬似的に切断してよい。 The disconnection processing unit 111 pseudo-disconnects the connection with the IOM 22 in the primary path when a recoverable failure occurs in the IOM 22 in the primary path. In the cascade connection of the primary path, the disconnection processing unit 111 may pseudo-disconnect the connection from the IOM 22 of the DE2 connected to the end of the plurality of DE2s in a pseudo manner.

接続処理部１１２は、セカンダリパスにおけＩＯＭ２２との接続を有効にしてアクセス試験を実施する。接続処理部１１２は、切断処理部１１１によって接続を擬似的に切断されたＤＥ２のセカンダリパス側のＩＯＭ２２について、順次接続を有効にしてアクセス試験を実施してよい。また、接続処理部１１２は、特定部１１３によってセカンダリパスのＩＯＭ２２が障害の原因として特定された場合に、プライマリパスのＩＯＭ２２との接続を再度有効にしてよい。 The connection processing unit 112 performs an access test by enabling the connection with the IOM 22 in the secondary path. The connection processing unit 112 may perform an access test on the IOM 22 on the secondary path side of the DE2 in which the connection is pseudo-disconnected by the disconnection processing unit 111, with the sequential connection enabled. Further, the connection processing unit 112 may re-enable the connection with the IOM 22 of the primary path when the IOM 22 of the secondary path is identified as the cause of the failure by the identification unit 113.

特定部１１３は、接続処理部１１２によるアクセス試験の結果に応じて、障害の原因を特定する。例えば、特定部１１３は、セカンダリパスのＩＯＭ２２に対するアクセス試験の結果が正常である場合に、プライマリパスのＩＯＭ２２を障害の原因として特定する。一方、特定部１１３は、セカンダリパスのＩＯＭ２２に対するアクセス試験の結果が異常である場合に、セカンダリパスのＩＯＭ２２を障害の原因として特定する。 The identification unit 113 identifies the cause of the failure according to the result of the access test by the connection processing unit 112. For example, the identification unit 113 identifies the IOM 22 of the primary path as the cause of the failure when the result of the access test for the IOM 22 of the secondary path is normal. On the other hand, the identification unit 113 identifies the IOM 22 of the secondary path as the cause of the failure when the result of the access test for the IOM 22 of the secondary path is abnormal.

ＣＭ１全体の動作を制御するための装置は、ＣＰＵ１１に限定されず、例えば、ＭＰＵやＤＳＰ，ＡＳＩＣ，ＰＬＤ，ＦＰＧＡのいずれか１つであってもよい。また、ＣＭ１全体の動作を制御するための装置は、ＣＰＵ，ＭＰＵ，ＤＳＰ，ＡＳＩＣ，ＰＬＤ及びＦＰＧＡのうちの２種類以上の組み合わせであってもよい。なお、ＭＰＵはMicro Processing Unitの略称であり、ＤＳＰはDigital Signal Processorの略称であり、ＡＳＩＣはApplication Specific Integrated Circuitの略称である。また、ＰＬＤはProgrammable Logic Deviceの略称であり、ＦＰＧＡはField Programmable Gate Arrayの略称である。 The device for controlling the operation of the entire CM1 is not limited to the CPU 11, and may be, for example, any one of MPU, DSP, ASIC, PLD, and FPGA. Further, the device for controlling the operation of the entire CM1 may be a combination of two or more of the CPU, MPU, DSP, ASIC, PLD and FPGA. MPU is an abbreviation for Micro Processing Unit, DSP is an abbreviation for Digital Signal Processor, and ASIC is an abbreviation for Application Specific Integrated Circuit. PLD is an abbreviation for Programmable Logic Device, and FPGA is an abbreviation for Field Programmable Gate Array.

図４は、図２に示したＲＡＩＤシステム１００における予防保守対象部品の検出処理を説明する図である。 FIG. 4 is a diagram for explaining the detection process of the preventive maintenance target component in the RAID system 100 shown in FIG.

符号Ｂ１に示すように、ＤＥ＃０２のＩＯＭ＃０がリカバリ可能なエラーが発生した故障部品として検出される。これにより、符号Ｂ２に示すようにＤＥ＃０１のＩＯＭ＃０を経由したディスク２１に対するアクセスは可能なものの、符号Ｂ３及びＢ４に示すようにＤＥ＃０２のＩＯＭ＃０を経由したディスク２１に対するアクセスには障害が発生する。また、符号Ｂ５に示すように、ＤＥ＃０３のＩＯＭ＃０を経由したディスク２１へのアクセスにも障害が発生する。 As shown by reference numeral B1, IOM # 0 of DE # 02 is detected as a failed component in which a recoverable error has occurred. As a result, although access to the disk 21 via IOM # 0 of DE # 01 is possible as shown by reference numeral B2, access to the disk 21 via IOM # 0 of DE # 02 as shown by reference numerals B3 and B4 is possible. Will fail. Further, as shown by reference numeral B5, the access to the disk 21 via IOM # 0 of DE # 03 also fails.

図５は、図４に示した予防保守対象部品の検出結果に応じた被疑箇所管理情報を示すテーブルである。 FIG. 5 is a table showing suspected location management information according to the detection result of the preventive maintenance target component shown in FIG.

被疑箇所管理情報には、被疑箇所と加点値と重度故障フラグとが対応づけられている。 The suspected location management information is associated with the suspected location, a point addition value, and a severe failure flag.

図５に示す例では、ＤＥ＃０１のＩＯＭ＃０，＃１とＤＥ＃０２のＩＯＭ＃１とＤＥ＃０３のＩＯＭ＃１とにおいては、エラーが発生していないため、初期値として加点値が“０”に設定されると共に重度故障フラグが“Ｏｆｆ”に設定される。一方、ＤＥ＃０２のＩＯＭ＃０とＤＥ＃０３のＩＯＭ＃０とにおいては、リカバリ可能なエラーによりアクセスに障害が発生しているため、加点値が“１０”に設定されると共に重度故障フラグが初期値としての“ｏｆｆ”に設定される。 In the example shown in FIG. 5, since no error has occurred in IOM # 0, # 1 of DE # 01, IOM # 1 of DE # 02, and IOM # 1 of DE # 03, additional points are added as initial values. Is set to "0" and the severe failure flag is set to "Off". On the other hand, in IOM # 0 of DE # 02 and IOM # 0 of DE # 03, since an access failure has occurred due to a recoverable error, the point addition value is set to "10" and the severe failure flag is set. Is set to "off" as the initial value.

図６は、図２に示したＲＡＩＤシステム１００における予防保守対象部品の切り離し処理を説明する図である。 FIG. 6 is a diagram illustrating a process of separating parts subject to preventive maintenance in the RAID system 100 shown in FIG.

符号Ｃ１に示すように、ＤＥ＃０２のＩＯＭ＃０は、予防保守の対象に設定される。また、符号Ｃ２に示すように、ＤＥ＃０３のＩＯＭ＃０は、擬似的に使えない状態に設定される。そして、符号Ｃ３に示すように、ＤＥ＃０３のＩＯＭ＃１を経由したディスク２１へのアクセスを発生させる。 As shown by reference numeral C1, IOM # 0 of DE # 02 is set as a target for preventive maintenance. Further, as shown by reference numeral C2, IOM # 0 of DE # 03 is set to a state in which it cannot be used in a pseudo manner. Then, as shown by reference numeral C3, access to the disk 21 via IOM # 1 of DE # 03 is generated.

ここで、擬似的に使えない状態への設定としては、例えば、ＩＯＭ２２の電源オフやSerial Attached Small computer system interface（ＳＡＳ）接続の切断は行なわず、ＣＭ１のファームウェア内のドライバ層でパスを一時的に使えない状態にすることである。例えば、ＳＡＳドライバ層での対象ＥＸＰアクセス応答が、SASSTS=28(Port Unavailable)に設定される。 Here, as a setting to the state where it cannot be used in a pseudo manner, for example, the power of the IOM22 is not turned off or the Serial Attached Small computer system interface (SAS) connection is not disconnected, and the path is temporarily passed in the driver layer in the firmware of CM1. It is to make it unusable. For example, the target EXP access response in the SAS driver layer is set to SASSTS = 28 (Port Unavailable).

図７は、図２に示したＲＡＩＤシステム１００における冗長パスに対するアクセス試験処理の第１の例を説明する図である。 FIG. 7 is a diagram illustrating a first example of access test processing for a redundant path in the RAID system 100 shown in FIG.

符号Ｄ１に示すように、ＤＥ＃０２のＩＯＭ＃０は、図６の符号Ｃ２に示したＤＥ＃０３のＩＯＭ＃０と同様に、擬似的に使えない状態に設定される。そして、ＤＥ＃０３のＩＯＭ＃１及びＤＥ＃０２のＩＯＭ＃１を経由したアクセスが実施される。符号Ｄ２及びＤ３に示すようにＤＥ＃０３のＩＯＭ＃１及びＤＥ＃０２のＩＯＭ＃１では異常が検出されないため、符号Ｄ４及びＤ５に示すようにＤＥ＃０３のディスク２１及びＤＥ＃０２のディスク２１へのアクセスが発生する。これにより、符号Ｄ６に示すように、ＤＥ＃０２のＩＯＭ＃０は、障害の原因である可能性が最も高い部品であるとして、第一被疑箇所に設定される。 As shown by reference numeral D1, IOM # 0 of DE # 02 is set to a state in which it cannot be used in a pseudo manner, similarly to IOM # 0 of DE # 03 shown by reference numeral C2 of FIG. Then, access is performed via IOM # 1 of DE # 03 and IOM # 1 of DE # 02. Since no abnormality is detected in IOM # 1 of DE # 03 and IOM # 1 of DE # 02 as shown by reference numerals D2 and D3, the disks of DE # 03 and the disks of DE # 02 as shown by reference numerals D4 and D5. Access to 21 occurs. As a result, as shown by reference numeral D6, IOM # 0 of DE # 02 is set as the first suspected part as the component most likely to be the cause of the failure.

これにより、セカンダリパス側の通信を保持したまま、作業員はプライマリパス側で被疑箇所として特定されたＩＯＭ２２の予防保守を実施できる。 As a result, the worker can perform preventive maintenance of the IOM22 identified as the suspected part on the primary path side while maintaining the communication on the secondary path side.

図８は、図２に示したＲＡＩＤシステム１００における冗長パスに対するアクセス試験処理の第２の例を説明する図である。 FIG. 8 is a diagram illustrating a second example of access test processing for a redundant path in the RAID system 100 shown in FIG.

符号Ｅ１に示すように、ＤＥ＃０２のＩＯＭ＃０は、図６の符号Ｃ２に示したＤＥ＃０３のＩＯＭ＃０と同様に、擬似的に使えない状態に設定される。そして、ＤＥ＃０３のＩＯＭ＃１及びＤＥ＃０２のＩＯＭ＃１を経由したアクセスが実施される。符号Ｅ２に示すようにＤＥ＃０３のＩＯＭ＃１では異常が検出されないため、符号Ｅ４に示すようにＤＥ＃０３のディスク２１へのアクセスが発生する。一方、符号Ｅ３に示すようにＤＥ＃０２のＩＯＭ＃１では異常が検出されため、符号Ｅ５に示すようにＤＥ＃０２のディスク２１へのアクセス障害が発生する。これにより、符号Ｅ６に示すように、ＤＥ＃０２のＩＯＭ＃１は、障害の原因である可能性が最も高い部品であるとして、第一被疑箇所に設定される。また、符号Ｅ７に示すように、ＤＥ＃０２のＩＯＭ＃０は、障害の原因である可能性が二番目に高い部品であるとして、第二被疑箇所に設定される。 As shown by reference numeral E1, IOM # 0 of DE # 02 is set to a state in which it cannot be used in a pseudo manner, similarly to IOM # 0 of DE # 03 shown by reference numeral C2 of FIG. Then, access is performed via IOM # 1 of DE # 03 and IOM # 1 of DE # 02. Since no abnormality is detected in IOM # 1 of DE # 03 as shown by reference numeral E2, access to the disk 21 of DE # 03 occurs as shown by reference numeral E4. On the other hand, as shown by reference numeral E3, an abnormality is detected in IOM # 1 of DE # 02, so that an access failure to the disk 21 of DE # 02 occurs as shown by reference numeral E5. As a result, as shown by reference numeral E6, IOM # 1 of DE # 02 is set as the first suspected part as the component most likely to be the cause of the failure. Further, as shown by reference numeral E7, IOM # 0 of DE # 02 is set as the second suspected part as the component having the second highest possibility of causing the failure.

ここで、プライマリパス側の第二被疑箇所として特定されたＩＯＭ２の予防保守を実施すると、プライマリパス側及びセカンダリパス側の両方が閉塞してしまう。このため、第二被疑箇所として特定されたリカバリ可能なエラーを有するＩＯＭ２２は、ＣＭ１に再度接続されてよい。 Here, if preventive maintenance of IOM2 specified as the second suspected place on the primary path side is carried out, both the primary path side and the secondary path side will be blocked. Therefore, the IOM 22 with the recoverable error identified as the second suspect may be reconnected to the CM1.

図９は、図８に示した冗長パスに対するアクセス試験結果に応じた被疑箇所管理情報を示すテーブルである。 FIG. 9 is a table showing suspected location management information according to the access test results for the redundant path shown in FIG.

図９に示す被疑箇所管理情報では、図５に示した被疑箇所管理情報と比較して、図８の符号Ｅ６で第一被疑箇所に設定されたＤＥ＃０２のＩＯＭ＃１における加点値が“１００”に設定されている。そして、加点値が閾値を超えると、重度故障フラグが“Ｏｎ”に設定される。 In the suspected location management information shown in FIG. 9, the added point value in IOM # 1 of DE # 02 set as the first suspected location with reference numeral E6 in FIG. 8 is "" as compared with the suspected location management information shown in FIG. It is set to "100". Then, when the added point value exceeds the threshold value, the severe failure flag is set to "On".

〔Ａ−２〕動作例
図２に示したＲＡＩＤシステム１００における予防保守処理を、図１０に示すフローチャート（ステップＳ１〜Ｓ９）を用いて説明する。 [A-2] Operation Example The preventive maintenance process in the RAID system 100 shown in FIG. 2 will be described with reference to the flowcharts (steps S1 to S9) shown in FIG.

予防保守が開始されると、切断処理部１１１は、ＳＡＳカスケードの末端のＩＯＭ２２を擬似的に使えない状態とする（ステップＳ１）。 When the preventive maintenance is started, the cutting processing unit 111 puts the IOM 22 at the end of the SAS cascade in a pseudo-unusable state (step S1).

接続処理部１１２は、逆パス（別言すれば、セカンダリパス）へのアクセスを接続し、逆パスで異常が発生しているかを判定する（ステップＳ２）。 The connection processing unit 112 connects the access to the reverse path (in other words, the secondary path) and determines whether or not an abnormality has occurred in the reverse path (step S2).

逆パスで異常が発生していない場合には（ステップＳ２のＮＯルート参照）、接続処理部１１２は、予防保守の対象となった被疑ＩＯＭ２２が属するＤＥ２まで逆パスの接続試験を実施したかを判定する（ステップＳ３）。 If no abnormality has occurred in the reverse path (see the NO route in step S2), the connection processing unit 112 checks whether the reverse path connection test has been performed up to DE2 to which the suspected IOM22 subject to preventive maintenance belongs. Determine (step S3).

被疑ＩＯＭ２２が属するＤＥ２まで逆パスの接続試験を実施していない場合には（ステップＳ３のＮＯルート参照）、切断処理部１１１は、ＳＡＳカスケードにおいて一つ分ＩＯＣ１０１に近いＩＯＭ２２を擬似的に使えない状態とする（ステップＳ４）。そして、処理はステップＳ２へ戻る。 If the reverse path connection test has not been performed up to DE2 to which the suspected IOM22 belongs (see the NO route in step S3), the disconnection processing unit 111 cannot use the IOM22 that is one minute closer to the IOC101 in the SAS cascade. The state is set (step S4). Then, the process returns to step S2.

一方、被疑ＩＯＭ２２が属するＤＥ２まで逆パスの接続試験を実施した場合には（ステップＳ３のＹＥＳルート参照）、特定部１１３は、元パス（別言すれば、プライマリパス）において異常が発生したＩＯＭ２２を第一被疑箇所として指示する（ステップＳ５）。そして、予防保守処理は終了する。 On the other hand, when the reverse path connection test is performed up to DE2 to which the suspected IOM22 belongs (see the YES route in step S3), the specific unit 113 has an abnormality in the original path (in other words, the primary path). Is instructed as the first suspected place (step S5). Then, the preventive maintenance process is completed.

ステップＳ２において、逆パスで異常が発生している場合には（ステップＳ２のＹＥＳルート参照）、接続処理部１１２は、被疑箇所管理情報においてエラー箇所をマッピングする（ステップＳ６）。 If an error occurs in the reverse path in step S2 (see the YES route in step S2), the connection processing unit 112 maps the error location in the suspected location management information (step S6).

接続処理部１１２は、擬似的に使えない状態としたＩＯＭ２２を復旧させる（ステップＳ７）。 The connection processing unit 112 restores the IOM 22 that has been put into a pseudo-unusable state (step S7).

特定部１１３は、エラー箇所としてマッピングされたＩＯＭ２２を第一被疑個所として指示する（ステップＳ８）。 The identification unit 113 indicates the IOM 22 mapped as the error location as the first suspected location (step S8).

特定部１１３は、元パスにおいて異常が発生したＩＯＭ２２を第二被疑箇所として指示する（ステップＳ９）。そして、予防保守処理は終了する。 The identification unit 113 instructs the IOM 22 in which the abnormality has occurred in the original path as the second suspected portion (step S9). Then, the preventive maintenance process is completed.

〔Ａ−３〕効果
上述した実施形態の一例におけるストレージ制御装置及び制御プログラムによれば、例えば、以下の作用効果を奏することができる。 [A-3] Effect According to the storage control device and the control program in the above-described example of the embodiment, for example, the following effects can be achieved.

切断処理部１１１は、プライマリパスにおけるＩＯＭ２２でリカバリ可能な障害が発生した場合に、プライマリパスにおけるＩＯＭ２２との接続を擬似的に切断する。接続処理部１１２は、セカンダリパスにおけＩＯＭ２２との接続を有効にしてアクセス試験を実施する。特定部１１３は、接続処理部１１２によるアクセス試験の結果に応じて、障害の原因を特定する。 The disconnection processing unit 111 pseudo-disconnects the connection with the IOM 22 in the primary path when a recoverable failure occurs in the IOM 22 in the primary path. The connection processing unit 112 performs an access test by enabling the connection with the IOM 22 in the secondary path. The identification unit 113 identifies the cause of the failure according to the result of the access test by the connection processing unit 112.

これにより、冗長的なパス構成を有するＲＡＩＤシステム１００において、効率的な予防保守を実施できる。そして、作業員による誤った被疑箇所の保守切り離しに起因するＤＥ２の全パス閉塞を防ぎ、上位ホストによるＲＡＩＤアクセスを継続しながら保守を行なうことができる。 As a result, efficient preventive maintenance can be performed in the RAID system 100 having a redundant path configuration. Then, it is possible to prevent the entire path of DE2 from being blocked due to the maintenance disconnection of the suspected part by the worker, and to perform the maintenance while continuing the RAID access by the upper host.

特定部１１３は、セカンダリパスのＩＯＭ２２に対するアクセス試験の結果が正常である場合に、プライマリパスのＩＯＭ２２を障害の原因として特定する。一方、特定部１１３は、セカンダリパスのＩＯＭ２２に対するアクセス試験の結果が異常である場合に、セカンダリパスのＩＯＭ２２を障害の原因として特定する。これにより、障害の原因箇所を容易に特定することができる。 The identification unit 113 identifies the IOM 22 of the primary path as the cause of the failure when the result of the access test for the IOM 22 of the secondary path is normal. On the other hand, the identification unit 113 identifies the IOM 22 of the secondary path as the cause of the failure when the result of the access test for the IOM 22 of the secondary path is abnormal. As a result, the location of the cause of the failure can be easily identified.

接続処理部１１２は、特定部１１３によってセカンダリパスのＩＯＭ２２が障害の原因として特定された場合に、プライマリパスのＩＯＭ２２との接続を再度有効にする。これにより、ＤＥ２の全パス閉塞を防止することができる。 The connection processing unit 112 re-enables the connection with the IOM 22 of the primary path when the IOM 22 of the secondary path is identified as the cause of the failure by the identification unit 113. As a result, it is possible to prevent all paths of DE2 from being blocked.

切断処理部１１１は、プライマリパスのカスケード接続において、複数のＤＥ２のうち末端に接続されているＤＥ２のＩＯＭ２２から順次接続を擬似的に切断する。接続処理部１１２は、切断処理部１１１によって接続を擬似的に切断されたＤＥ２のセカンダリパス側のＩＯＭ２２について、順次接続を有効にしてアクセス試験を実施する。これにより、障害の原因の特定を効率的に実施できる。 The disconnection processing unit 111 pseudo-disconnects the connection from the IOM 22 of the DE2 connected to the end of the plurality of DE2s in the cascade connection of the primary path. The connection processing unit 112 performs an access test on the IOM 22 on the secondary path side of the DE2 in which the connection is pseudo-disconnected by the disconnection processing unit 111, with the sequential connection enabled. As a result, the cause of the failure can be efficiently identified.

〔Ｂ〕その他
開示の技術は上述した実施形態に限定されるものではなく、本実施形態の趣旨を逸脱しない範囲で種々変形して実施することができる。本実施形態の各構成及び各処理は、必要に応じて取捨選択することができ、あるいは適宜組み合わせてもよい。 [B] Other disclosed techniques are not limited to the above-described embodiments, and can be variously modified and implemented without departing from the spirit of the present embodiment. Each configuration and each process of the present embodiment can be selected as necessary, or may be combined as appropriate.

〔Ｃ〕付記
以上の実施形態に関し、更に以下の付記を開示する。 [C] Additional Notes The following additional notes will be further disclosed with respect to the above embodiments.

（付記１）
プライマリパスとセカンダリパスとによって複数の記憶装置群がカスケード接続されたストレージ制御装置であって、
前記プライマリパスにおける第１のモジュールでリカバリ可能な障害が発生した場合に、前記第１のモジュールとの接続を擬似的に切断する切断処理部と、
前記セカンダリパスにおける第２のモジュールとの接続を有効にしてアクセス試験を実施する接続処理部と、
前記接続処理部による前記アクセス試験の結果に応じて、前記障害の原因を特定する特定部と、
を備える、ストレージ制御装置。 (Appendix 1)
A storage control device in which multiple storage devices are cascaded by a primary path and a secondary path.
A disconnection processing unit that pseudo-disconnects the connection with the first module when a recoverable failure occurs in the first module in the primary path.
A connection processing unit that enables an access test by enabling the connection with the second module in the secondary path, and
A specific unit that identifies the cause of the failure according to the result of the access test by the connection processing unit, and
A storage control device.

（付記２）
前記特定部は、前記第２のモジュールに対する前記アクセス試験の結果が正常である場合に、前記第１のモジュールを前記障害の原因として特定する、
付記１に記載のストレージ制御装置。 (Appendix 2)
The identification unit identifies the first module as the cause of the failure when the result of the access test for the second module is normal.
The storage control device according to Appendix 1.

（付記３）
前記特定部は、前記第２のモジュールに対する前記アクセス試験の結果が異常である場合に、前記第２のモジュールを前記障害の原因として特定する、
付記１又は２に記載のストレージ制御装置。 (Appendix 3)
The identification unit identifies the second module as the cause of the failure when the result of the access test for the second module is abnormal.
The storage control device according to Appendix 1 or 2.

（付記４）
前記接続処理部は、前記特定部によって前記第２のモジュールが前記障害の原因として特定された場合に、前記第１のモジュールとの接続を再度有効にする、
付記３に記載のストレージ制御装置。 (Appendix 4)
The connection processing unit re-enables the connection with the first module when the second module is identified as the cause of the failure by the identification unit.
The storage control device according to Appendix 3.

（付記５）
前記切断処理部は、前記プライマリパスの前記カスケード接続において、前記複数の記憶装置群のうち末端に接続されている記憶装置群のモジュールから順次接続を擬似的に切断し、
前記接続処理部は、前記切断処理部によって接続を擬似的に切断された前記記憶装置群の前記セカンダリパス側のモジュールについて、順次接続を有効にして前記アクセス試験を実施する、
付記１〜４のいずれか１項に記載のストレージ制御装置。 (Appendix 5)
In the cascade connection of the primary path, the disconnection processing unit pseudo-disconnects the connection from the module of the storage device group connected to the end of the plurality of storage device groups in a pseudo manner.
The connection processing unit performs the access test by enabling sequential connection of the modules on the secondary path side of the storage device group in which the connection is pseudo-disconnected by the disconnection processing unit.
The storage control device according to any one of Items 1 to 4.

（付記６）
プライマリパスとセカンダリパスとによって複数の記憶装置群がカスケード接続されたコンピュータに、
前記プライマリパスにおける第１のモジュールでリカバリ可能な障害が発生した場合に、前記第１のモジュールとの接続を擬似的に切断し、
前記セカンダリパスにおける第２のモジュールとの接続を有効にしてアクセス試験を実施し、
前記アクセス試験の結果に応じて、前記障害の原因を特定する、
処理を実行させる、制御プログラム。 (Appendix 6)
A computer in which multiple storage devices are cascaded by a primary path and a secondary path,
When a recoverable failure occurs in the first module in the primary path, the connection with the first module is pseudo-disconnected.
The access test was performed by enabling the connection with the second module in the secondary path.
Identify the cause of the failure according to the results of the access test.
A control program that executes processing.

（付記７）
前記第２のモジュールに対する前記アクセス試験の結果が正常である場合に、前記第１のモジュールを前記障害の原因として特定する、
処理を前記コンピュータに実行させる、付記６に記載の制御プログラム。 (Appendix 7)
When the result of the access test for the second module is normal, the first module is identified as the cause of the failure.
The control program according to Appendix 6, which causes the computer to execute the process.

（付記８）
前記第２のモジュールに対する前記アクセス試験の結果が異常である場合に、前記第２のモジュールを前記障害の原因として特定する、
処理を前記コンピュータに実行させる、付記６又は７に記載の制御プログラム。 (Appendix 8)
When the result of the access test for the second module is abnormal, the second module is identified as the cause of the failure.
The control program according to Appendix 6 or 7, which causes the computer to execute the process.

（付記９）
前記第２のモジュールが前記障害の原因として特定された場合に、前記第１のモジュールとの接続を再度有効にする、
処理を前記コンピュータに実行させる、付記８に記載の制御プログラム。 (Appendix 9)
If the second module is identified as the cause of the failure, re-enable the connection with the first module.
The control program according to Appendix 8, which causes the computer to execute the process.

（付記１０）
前記プライマリパスの前記カスケード接続において、前記複数の記憶装置群のうち末端に接続されている記憶装置群のモジュールから順次接続を擬似的に切断し、
接続を擬似的に切断された前記記憶装置群の前記セカンダリパス側のモジュールについて、順次接続を有効にして前記アクセス試験を実施する、
処理を前記コンピュータに実行させる、付記６〜９のいずれか１項に記載の制御プログラム。 (Appendix 10)
In the cascade connection of the primary path, the connection is pseudo-disconnected from the modules of the storage device group connected to the end of the plurality of storage device groups.
The access test is performed on the modules on the secondary path side of the storage device group in which the connection is pseudo-disconnected, with the sequential connection enabled.
The control program according to any one of Supplementary note 6 to 9, which causes the computer to execute the process.

１００，６００：ＲＡＩＤシステム
１，６：ＣＭ
１１，６１：ＣＰＵ
１０１，６０１：ＩＯＣ
１１１：切断処理部
１１２：接続処理部
１１３：特定部
１２，６２：ＥＸＰ
２，７：ＤＥ
２１，７１：ディスク
２２，７２：ＩＯＭ 100,600: RAID system 1,6: CM
11,61: CPU
101,601: IOC
111: Cutting processing unit 112: Connection processing unit 113: Specific unit 12, 62: EXP
2, 7: DE
21,71: Disk 22,72: IOM

Claims

A storage control device in which multiple storage devices are cascaded by a primary path and a secondary path.
A disconnection processing unit that pseudo-disconnects the connection with the first module when a recoverable failure occurs in the first module in the primary path.
A connection processing unit that enables an access test by enabling the connection with the second module in the secondary path, and
A specific unit that identifies the cause of the failure according to the result of the access test by the connection processing unit, and
A storage control device.

The identification unit identifies the first module as the cause of the failure when the result of the access test for the second module is normal.
The storage control device according to claim 1.

The identification unit identifies the second module as the cause of the failure when the result of the access test for the second module is abnormal.
The storage control device according to claim 1 or 2.

The connection processing unit re-enables the connection with the first module when the second module is identified as the cause of the failure by the identification unit.
The storage control device according to claim 3.

In the cascade connection of the primary path, the disconnection processing unit pseudo-disconnects the connection from the module of the storage device group connected to the end of the plurality of storage device groups in a pseudo manner.
The connection processing unit performs the access test by enabling sequential connection of the modules on the secondary path side of the storage device group in which the connection is pseudo-disconnected by the disconnection processing unit.
The storage control device according to any one of claims 1 to 4.

A computer in which multiple storage devices are cascaded by a primary path and a secondary path,
When a recoverable failure occurs in the first module in the primary path, the connection with the first module is pseudo-disconnected.
The access test was performed by enabling the connection with the second module in the secondary path.
Identify the cause of the failure according to the results of the access test.
A control program that executes processing.