JP5490067B2

JP5490067B2 - Fault management apparatus and program

Info

Publication number: JP5490067B2
Application number: JP2011178036A
Authority: JP
Inventors: 義晴鯨井; 俊志山内; 剛雄江頭; 博文近藤; 弘史西; 貴之山田
Original assignee: Hitachi Ltd; Bank of Tokyo Mitsubishi UFJ Trust Co
Current assignee: Hitachi Ltd; MUFG Bank Ltd
Priority date: 2011-08-16
Filing date: 2011-08-16
Publication date: 2014-05-14
Anticipated expiration: 2029-03-17
Also published as: JP2011227936A

Description

本発明は障害管理装置及びプログラムに係り、特に、コンピュータが外部機器と通信するためのパスが複数の外部機器の各々毎に複数設けられた構成において、各パスの障害を管理する障害管理装置、及び、コンピュータを前記障害管理装置として機能させるための障害管理プログラムに関する。 The present invention relates to a failure management device and a program, and in particular, in a configuration in which a plurality of paths for a computer to communicate with an external device are provided for each of a plurality of external devices, a failure management device that manages a failure in each path, The present invention also relates to a failure management program for causing a computer to function as the failure management apparatus.

コンピュータと記憶装置を含み、外部からの要求に従って記憶装置へのアクセス（記憶装置に記憶されているデータの読み出しや記憶装置へのデータの書き込み等）を行い、要求元へ応答を返す処理を行うシステムにおいて、障害発生等の際にも稼働状態の継続が求められている場合は、記憶装置自体を多重化すると共に、コンピュータが記憶装置と通信するための通信経路(パス)も多重化する等の冗長構成が採用されることが多い。 Includes a computer and a storage device, performs access to the storage device (reading data stored in the storage device, writing data to the storage device, etc.) according to an external request, and returns a response to the request source In the system, when it is required to continue the operation state even when a failure occurs, the storage device itself is multiplexed, and the communication path (path) for the computer to communicate with the storage device is also multiplexed. The redundant configuration is often adopted.

上記の冗長構成に適用可能な技術として、例えば特許文献１には、ホスト計算機と、物理ディスク及びディスクコントローラを備えたストレージシステムと、を備える計算機システムにおいて、ストレージシステムは、物理ディスクの記憶領域を、一つ以上の論理ユニットとしてホスト計算機に提供し、ホスト計算機は、当該ホスト計算機から論理ユニットへのアクセス経路である論理パスの障害を検知すると、障害が検知された論理パスと同じ論理ユニットにアクセスするための論理パスを特定し、特定された論理パスに対して障害検知処理を実行することで、特定された論理パスが正常であるか否かを判定し、特定されたパスの中から正常な論理パスを選択し、選択された正常な論理パスを用いて論理ユニットにアクセスする技術が開示されている。 As a technology applicable to the above redundant configuration, for example, Patent Document 1 discloses a computer system including a host computer and a storage system including a physical disk and a disk controller. Provided to the host computer as one or more logical units. When the host computer detects a failure in a logical path that is an access path from the host computer to the logical unit, the host computer uses the same logical unit as the logical path in which the failure is detected. By specifying the logical path to access and executing failure detection processing for the specified logical path, it is determined whether or not the specified logical path is normal, and from among the specified path A technology for selecting a normal logical path and accessing a logical unit using the selected normal logical path is disclosed. It has been.

また、上記に関連して特許文献２には、分散コンピュータ・システムにおいて、マスタ・プロセスは、共有リソース制御ファイルを新しいタイムスタンプ時間で周期的に更新し、共有リソースへアクセスを求めるプロセスは、共有リソース制御ファイルを読み取って他のプロセスがマスタとして指定されているか否か判定し、最後のタイムスタンプ時間からの経過時間が予め設定されている時間よりも長ければ、その共有リソース制御ファイルを無効にして新しい共有リソース制御ファイルを作成することで、共有リソースを制御するマスタ・プロセスを効率よく決定する技術が開示されている。 Further, in Patent Document 2 related to the above, in the distributed computer system, the master process periodically updates the shared resource control file with a new time stamp time, and the process for requesting access to the shared resource is shared. The resource control file is read to determine whether another process is designated as the master. If the elapsed time from the last timestamp is longer than the preset time, the shared resource control file is invalidated. A technique for efficiently determining a master process for controlling a shared resource by creating a new shared resource control file is disclosed.

特開２００７−２６５２４３号公報JP 2007-265243 A 特開平７−１３９３９号公報JP 7-13939 A

前述の冗長構成において、例えばコンピュータから記憶装置に至るパスの途中に存在する機器で障害が発生すると、当該機器を経由する複数のパスで障害の発生が各々検知される等のように、冗長構成では障害発生時に障害の発生が連続的に複数回通知されることが多い。このため、障害発生が通知されたことをトリガとして、障害の発生が通知されたパスを閉塞する等の障害対処処理を行う場合、この障害対処処理が頻繁に繰り返されることでコンピュータに多大な負荷が加わり、コンピュータ・システムの動作が不安定になったり、コンピュータがダウンする等の事態も生じ得る。 In the above-described redundant configuration, for example, when a failure occurs in a device existing in the middle of a path from the computer to the storage device, the occurrence of the failure is detected in each of a plurality of paths passing through the device. In many cases, when a failure occurs, the occurrence of the failure is continuously notified a plurality of times. For this reason, when performing failure handling processing such as blocking a path that has been notified of the occurrence of a failure, triggered by the notification of the occurrence of the failure, a large load is placed on the computer by repeating this failure handling processing frequently. As a result, the operation of the computer system may become unstable or the computer may go down.

上記の問題に対しては、障害発生が通知されたことをトリガとして行われる障害対処処理に、特許文献２に記載の排他制御を適用することが考えられる。しかしながら、障害対処処理の実行に特許文献２に記載の技術を適用した場合、障害対処処理を行うプロセスが障害発生が通知される度に生成されると共に、生成されたプロセスは、マスタ権を獲得する迄マスタ権の獲得を繰り返し試行することになる。このため、パスの途中に存在する機器で障害が発生した等のように多数のパスに影響を及ぼす障害が発生し、多数のパスの障害発生が連続的に通知された場合、多数のプロセスが生成されることでコンピュータに多大な負荷が加わると共に、メモリ等のリソースが大量に消費されるので、コンピュータの動作の安定化に有効ではない。 To deal with the above problem, it is conceivable to apply the exclusive control described in Patent Document 2 to a failure handling process that is triggered by the notification of the occurrence of a failure. However, when the technique described in Patent Document 2 is applied to the execution of the fault handling process, a process for performing the fault handling process is generated each time the occurrence of the fault is notified, and the generated process acquires the master right. Until then, it will repeatedly try to acquire mastership. For this reason, when a failure that affects many paths occurs, such as when a failure has occurred in a device that exists in the middle of a path, and when the failure occurrence of a large number of paths is continuously notified, many processes The generation of the load adds a great load on the computer and consumes a large amount of resources such as a memory, which is not effective for stabilizing the operation of the computer.

本発明は上記事実を考慮して成されたもので、複数のパスに影響を及ぼす障害が発生した場合のコンピュータの動作安定化を実現できる障害管理装置及び障害管理プログラムを得ることが目的である。 The present invention has been made in view of the above facts, and an object of the present invention is to obtain a failure management apparatus and a failure management program capable of stabilizing the operation of a computer when a failure affecting a plurality of paths occurs. .

上記目的を達成するために請求項１記載の発明に係る障害管理装置は、複数の外部機器のうちの互いに異なる前記外部機器と通信するための複数のパスから成るグループが設けられたコンピュータによって実現される障害管理装置であって、前記複数のパスは物理的に単一の通信経路を共有し、何れかのパスでの障害の発生が通知された場合に、前記障害の発生が通知されたパスと同一のグループに属する全てのパスを閉塞するための閉塞処理を行う制御手段を備えている。 To achieve the above object, the fault management apparatus according to the first aspect of the present invention is realized by a computer provided with a group of a plurality of paths for communicating with different external devices among a plurality of external devices. The plurality of paths physically share a single communication path, and when the occurrence of a failure in one of the paths is notified, the occurrence of the failure is notified Control means for performing blocking processing for blocking all paths belonging to the same group as the path is provided.

請求項１記載の発明に係る障害管理装置は、複数の外部機器のうちの互いに異なる外部機器と通信するための複数のパスから成るグループが設けられたコンピュータによって実現される。なお、本発明に係る外部機器としては、例えば、情報を記憶可能でかつ記憶している情報を書替可能な記憶装置が好適であるが、他の機器であってもよい。また、本発明におけるグループは、例えば図２に「パスのグループ」と表記して示すように、物理的な通信経路を共有する複数のパスで構成することができる。請求項１記載の発明では、何れかのパスでの障害の発生が通知された場合に、制御手段により、障害の発生が通知されたパスと同一のグループに属する全てのパスを閉塞するための閉塞処理が行われる。 The fault management apparatus according to the first aspect of the present invention is realized by a computer provided with a group including a plurality of paths for communicating with different external devices among a plurality of external devices. Incidentally, as an external device according to the present invention, For example, although the information stored possible and storing information rewritable storage device is suitable may be other devices. In addition, the group in the present invention can be composed of a plurality of paths sharing a physical communication path, for example, as shown as “path group” in FIG. In the first aspect of the present invention, when the occurrence of a failure in one of the paths is notified, the control means blocks all the paths belonging to the same group as the path in which the occurrence of the failure is notified. A blocking process is performed.

このように、請求項１記載の発明では、同一グループに属するパスは障害発生時に同様の影響を受けることが多く、障害の発生が各々通知されることが多いことに基づき、閉塞処理として、障害の発生が通知されたパスと同一のグループに属する全てのパスを閉塞するための処理を行うので、同一グループに属する各パスが各々影響を受ける障害が発生したとしても、各パスのうちの何れか１つのパスでの障害の発生が通知されたことに基づいて閉塞処理を一旦行った以降は、当該閉塞処理によって閉塞された各パスでの障害発生が通知されることが無くなり、複数のパスに影響を及ぼす障害が発生した場合にコンピュータに加わる負荷を低減できると共に、メモリ等のリソースの消費量も抑制することができる。従って、請求項１記載の発明によれば、複数のパスに影響を及ぼす障害が発生した場合のコンピュータの動作安定化を実現することができる。 As described above, in the invention according to claim 1, paths belonging to the same group are often affected in the same way when a failure occurs, and the occurrence of a failure is often notified. Since all the paths that belong to the same group as the path that is notified of the occurrence of a failure are blocked, even if a failure occurs that affects each path that belongs to the same group, After the blockage process is once performed based on the notification of the occurrence of a failure in one path, the occurrence of a failure in each path blocked by the blockage process is not notified, and a plurality of paths It is possible to reduce the load applied to the computer when a failure that affects the computer occurs, and to suppress the consumption of resources such as memory. Therefore, according to the first aspect of the present invention, it is possible to realize the stabilization of the operation of the computer when a failure that affects a plurality of paths occurs.

また、或るパスで障害が発生した場合、その原因は障害が発生したパス上のハードウェア(例えばＨＢＡ(Host Bus Adapter)やケーブル等)の不調や故障等であることが多く、当該ハードウェアを被疑部位として稼働中の部位より切り離し、必要に応じて交換する保守作業(発生した障害を復旧させるための作業)が行われるが、この保守作業の実施に際し、被疑部位としてのハードウェアを経由する別のパスが存在している場合には、当該別のパスで障害が発生しているか否かに拘わらず、被疑部位としてのハードウェアの切り離しを行うために、前記別のパスが通信に使用されないように前記別のパスも閉塞する必要がある。請求項１記載の発明では、何れかのパスでの障害の発生が通知されると、障害の発生が通知されたパスと同一のグループに属する全てのパスを閉塞するための閉塞処理が行われるので、保守作業の実施に際し、障害が発生したパスと同一のグループに属する他の全てのパスを改めて閉塞する必要が無くなり、保守作業を行う作業者の負担を軽減することができる。 Also, when a failure occurs in a certain path, the cause is often a malfunction or failure of hardware (for example, HBA (Host Bus Adapter) or cable) on the path where the failure has occurred. Is removed from the operating part as a suspected part and replaced when necessary (work to recover from a failure that has occurred) .When performing this maintenance work, the hardware is used as the suspected part. In order to disconnect the hardware as the suspected part regardless of whether or not a failure has occurred in the other path, the other path is used for communication. The other path needs to be blocked so that it is not used. In the first aspect of the invention, when a failure occurrence is notified in any of the paths, a blocking process is performed to block all the paths belonging to the same group as the path for which the failure has been notified. Therefore, when performing the maintenance work, it is not necessary to block all other paths belonging to the same group as the path where the failure has occurred, and the burden on the worker who performs the maintenance work can be reduced.

また、請求項２記載の発明は、請求項１記載の発明において、グループの個々のパスの各々の状態を表す状態情報を状態テーブルに登録・管理し、コンピュータと個々の外部機器との通信に用いるパスを選択すると共に、状態情報として閉塞を意味する情報が状態テーブルに登録されているパスを、外部機器との通信における選択対象から除外する処理を含む管理処理を行う管理手段と、前記制御手段による前記閉塞処理により、前記状態テーブルに登録されている前記状態情報のうち、前記障害の発生が通知されたパスと同一のグループに属する全てのパスの前記状態情報が閉塞を意味する情報に各々書き替えられた場合に、前記パスを閉塞処理するためのプログラムの起動を停止させる起動制御手段と、を更に備えている。 Further, in the invention described in claim 2, in the invention described in claim 1, status information indicating the status of each path of the group is registered and managed in the status table, and communication between the computer and each external device is performed. Management means for performing a management process including a process for selecting a path to be used, and excluding a path in which information indicating blockage is registered in the status table as status information from a selection target in communication with an external device; and The state information of all the paths belonging to the same group as the path notified of the occurrence of the failure among the state information registered in the state table by the block processing by means is information indicating blockage. And a start control means for stopping the start of the program for closing the path when each of them is rewritten.

請求項２記載の発明では、障害の発生が通知されたパスと同一のグループに属する全てのパスが閉塞されている状態、すなわち前記全てのパスが閉塞処理が不要な状態であるにも拘わらず、前記プログラムが無駄に起動されることを防止することができ、コンピュータに加わる負荷の更なる低減及びメモリ等のリソースの消費量の更なる抑制を実現することができる。 In the invention according to claim 2, although all paths belonging to the same group as the path notified of the occurrence of the failure are blocked, that is, all the paths are in a state where blocking processing is unnecessary. Thus, it is possible to prevent the program from being activated in vain, and to further reduce the load applied to the computer and further suppress the consumption of resources such as memory.

請求項３記載の発明に係る障害管理プログラムは、複数の外部機器のうちの互いに異なる前記外部機器と通信するための複数のパスから成るグループが設けられたコンピュータを、前記複数のパスは物理的に単一の通信経路を共有し、何れかのパスでの障害の発生が通知された場合に、前記障害の発生が通知されたパスと同一のグループに属する全てのパスを閉塞するための閉塞処理を行う制御手段として機能させる。 According to a third aspect of the present invention, there is provided a failure management program comprising: a computer provided with a group of a plurality of paths for communicating with different external devices among a plurality of external devices ; share a single communication path, when the occurrence of failure in any of the paths has been notified, closed for closing all the paths belonging to group generate notifications paths identical and of the fault It functions as a control means for performing processing.

請求項３記載の発明に係る障害管理プログラムは、上記コンピュータを、上記の制御手段として機能させるためのプログラムであるので、コンピュータが請求項３記載の発明に係る障害管理プログラムを実行することで、コンピュータが請求項１に記載の障害管理装置として機能することになり、請求項１記載の発明と同様に、複数のパスに影響を及ぼす障害が発生した場合のコンピュータの動作安定化を実現することができる。 Since the failure management program according to the invention of claim 3 is a program for causing the computer to function as the control means, the computer executes the failure management program according to the invention of claim 3 , The computer functions as the failure management device according to claim 1, and, similarly to the invention according to claim 1, realizes stabilization of computer operation when a failure that affects a plurality of paths occurs. Can do.

なお、上記の発明において、複数の外部機器のうちの互いに異なる前記外部機器と通信するための複数のパスから成るグループが複数設けられ、何れかのパスでの障害の発生が通知される毎に、同一属性での重複作成が制限されているファイル情報が、前記障害の発生が通知されたパスが属するグループに対応する所定の属性で既に作成されているか否かを判定する判定手段と、前記判定手段によって前記所定の属性のファイル情報が作成されていないと判定された場合に、前記所定の属性のファイル情報の作成を試行する作成手段と、を更に備え、前記制御手段は、前記作成手段による前記所定の属性のファイル情報の作成が成功した場合に前記閉塞処理を行うように構成してもよい。 In the above invention, a plurality of groups each having a plurality of paths for communicating with different external devices among a plurality of external devices are provided, and whenever a failure occurs in any of the paths is notified. Determination means for determining whether or not file information in which duplicate creation with the same attribute is restricted has already been created with a predetermined attribute corresponding to a group to which the path to which the occurrence of the failure is notified belongs, Creation means for attempting creation of the file information of the predetermined attribute when the determination means determines that the file information of the predetermined attribute has not been created, and the control means includes the creation means The blocking process may be performed when the file information having the predetermined attribute is successfully created.

上記の第１の態様によれば、コンピュータと接続する複数のパス(互いに異なるグループに属する複数のパス：一例として図２には、論理ディスクＬＵ#0をサーバ・コンピュータ１２と接続している全てのパスを明示している)のうちの一部のパスが閉塞され、かつ、閉塞されているパスが属するグループが相違する外部機器が複数存在している状態(本明細書ではこの状態を交差閉塞状態という)が生ずることを極力回避することができる。上記の交差閉塞状態が生じた場合、保守作業の実施に際し、障害が発生したパスと同一のグループに属する他の全てのパスを閉塞しようとすると、コンピュータと接続する複数のパスが全て閉塞されることで通信不能となってしまう外部機器が出現するので、保守作業の実施が困難になるが、第１の態様では、このような状況に陥ってしまうことを極力回避することができ、保守作業を実施できる確率を向上させることができる。 According to the first aspect described above, a plurality of paths connected to the computer (a plurality of paths belonging to different groups: as an example, FIG. 2 shows all the connections between the logical disk LU # 0 and the server computer 12 (In this specification, this state is crossed) in which some paths are blocked and there are multiple external devices with different groups to which the blocked path belongs. It is possible to avoid as much as possible from occurring. When the above cross-blocking state occurs, when performing maintenance work, if all other paths belonging to the same group as the failed path are blocked, all the paths connected to the computer are blocked. However, in the first aspect, it is possible to avoid such a situation as much as possible, and maintenance work can be avoided. It is possible to improve the probability that can be implemented.

また、第１の態様では、複数のパスに影響を及ぼす障害が発生し、同一のグループに属する互いに異なるパスでの障害の発生がほぼ同時に通知された場合に、障害の発生が最も早く通知された特定のパスに対応する処理では、ファイル情報の作成が成功することで閉塞処理が行われる一方、前記特定のパスよりも後に障害の発生が通知された残りのパスに対応する処理では、所定の属性のファイル情報が既に作成されているか否かの判定のタイミングが、前記特定のパスに対応する処理における所定の属性のファイル情報の作成後であれば、所定の属性のファイル情報が既に作成されていると判定され、所定の属性のファイル情報が既に作成されているか否かの判定のタイミングが、特定のパスに対応する処理における所定の属性のファイル情報の作成前であったとしても、同一属性のファイル情報の重複作成が制限されているためにファイル情報の作成には失敗することで、閉塞処理は行われない。従って、複数のパスに影響を及ぼす障害が発生し、同一のグループに属する互いに異なるパスでの障害の発生がほぼ同時に通知された場合にも、閉塞処理が重複して実行されることを確実に防止することができ、コンピュータに加わる負荷を低減することができる。 Further, in the first aspect, when a failure that affects a plurality of paths occurs and the occurrence of a failure in different paths belonging to the same group is notified almost simultaneously, the occurrence of the failure is notified earliest. In the process corresponding to the specific path, the block process is performed by successfully creating the file information. On the other hand, in the process corresponding to the remaining path in which the occurrence of the failure is notified after the specific path, a predetermined process is performed. If the timing for determining whether or not file information with a predetermined attribute has already been created is after creation of file information with a predetermined attribute in the processing corresponding to the specific path, the file information with a predetermined attribute has already been created. A file with a predetermined attribute in a process corresponding to a specific path is determined as to whether or not file information with a predetermined attribute has already been created. Even a before creating a broadcast, a failure that the creation of the file information to duplicate create the file information of the same attribute is restricted, closing processing is not performed. Therefore, even when a failure that affects multiple paths occurs and the occurrence of a failure in different paths belonging to the same group is notified almost simultaneously, it is ensured that the blocking process is executed in duplicate. This can prevent the load on the computer.

また、第１の態様において、例えば、判定手段、作成手段及び制御手段は、何れかのグループの何れかのパスで障害の発生が通知される毎に起動されてコンピュータによって実行される単一のプログラムによって実現され、コンピュータによるプログラムの実行は、作成手段による所定の属性のファイル情報の作成が失敗した場合に終了されるように構成することが好ましい(第２の態様)。これにより、複数のパスに影響を及ぼす障害が発生し、同一のグループに属する互いに異なるパスでの障害の発生がほぼ同時に通知された場合にも、個々の通知に対応する処理のうちファイル情報の作成に失敗した処理については、コンピュータによる実行が終了されることになり、当該処理の実行を終了させない場合と比較して、コンピュータに加わる負荷を低減できると共に、メモリ等のリソースの消費量も抑制することができる。 Further, in the first aspect, for example, the determination unit, the creation unit, and the control unit are activated each time a failure occurrence is notified in any path of any group and executed by a computer. Preferably, the execution of the program by the computer is preferably configured to be terminated when the creation of the file information with the predetermined attribute by the creation unit fails (second mode). As a result, even when a failure that affects multiple paths occurs and the occurrence of failures in different paths belonging to the same group is notified almost simultaneously, the file information of the processing corresponding to each notification The processing that failed to be created will be terminated by the computer, reducing the load on the computer and reducing the consumption of resources such as memory compared to the case where the execution of the processing is not terminated. can do.

また、第１又は第２の態様において、例えば、判定手段は、所定の属性のファイル情報が既に作成されていると判定した場合に、当該ファイル情報が作成されてからの経過時間が閾値以上か否かを判定し、制御手段は、閉塞処理の終了後に、作成手段によって作成された所定の属性のファイル情報を削除すると共に、判定手段により、所定の属性のファイル情報が既に作成されており、当該ファイル情報が作成されてからの経過時間が閾値以上と判定された場合に、既に作成されているファイル情報を削除し、作成手段は、判定手段によって前記経過時間が閾値以上と判定された場合に、既に作成されているファイル情報が制御手段によって削除された後に、所定の属性のファイル情報の作成を試行するように構成することが好ましい(第３の態様)。 In the first or second aspect, for example, when the determination unit determines that file information with a predetermined attribute has already been created, is the elapsed time since the creation of the file information equal to or greater than a threshold value? The control means deletes the file information of the predetermined attribute created by the creating means after the end of the blocking process, and the file information of the predetermined attribute has already been created by the judging means, When it is determined that the elapsed time from the creation of the file information is equal to or greater than the threshold, the file information that has already been created is deleted, and the creation means determines that the elapsed time is equal to or greater than the threshold In addition, it is preferable that the configuration is such that the creation of file information having a predetermined attribute is tried after the already created file information is deleted by the control means (third Embodiment).

第３の態様における閾値としては、判定手段、作成手段及び制御手段による一連の処理に要する時間よりも所定値以上長く、前回に発生が通知された障害とは異なる障害であると判断できる程度の時間(例えば１分間程度)に相当する値が好適である。第３の態様では、閉塞処理の終了後に、作成手段によって作成されたファイル情報が制御手段によって削除されるが、判定手段により、所定の属性のファイル情報が既に作成されており、当該ファイル情報が作成されてからの経過時間が閾値以上と判定された場合、当該ファイル情報は、何らかの理由で制御手段による削除が失敗し、残ってしまったファイル情報である可能性が高い。第３の態様では、このようなファイル情報も制御手段によって削除され、その後、作成手段によってファイル情報の作成が試行されるので、制御手段によるファイル情報の削除が失敗した場合にも、削除されずに残っているファイル情報が作成されてから閾値以上の時間の経過に伴い、作成手段が同一属性のファイル情報を再度作成することが可能で、制御手段が閉塞処理を行うことが可能な状態に復帰させることができる。 The threshold value in the third aspect is longer than a time required for a series of processes by the determination unit, the generation unit, and the control unit by a predetermined value or more and can be determined to be a failure different from the failure that has been notified of the previous occurrence. A value corresponding to time (for example, about 1 minute) is preferable. In the third aspect, the file information created by the creation means is deleted by the control means after the closing process is completed, but the file information with a predetermined attribute has already been created by the judgment means, and the file information is If it is determined that the elapsed time from the creation is equal to or greater than the threshold, the file information is likely to be file information that remains after deletion by the control means for some reason. In the third aspect, such file information is also deleted by the control means, and then the creation of the file information is attempted by the creation means. Therefore, even if the deletion of the file information by the control means fails, it is not deleted. When the file information remaining in the file is created, the creation unit can create the file information with the same attribute again as time elapses beyond the threshold, and the control unit can perform the blocking process. Can be restored.

また、第３の態様において、例えば、判定手段、作成手段及び制御手段は、何れかのグループの何れかのパスで障害の発生が通知される毎に起動されてコンピュータによって実行される単一のプログラムによって実現され、コンピュータによる前記プログラムの実行は、判定手段によって所定の属性のファイル情報が作成されており、かつ前記経過時間が閾値未満と判定された場合に終了されるように構成することが好ましい(第４の態様)。これにより、複数のパスに影響を及ぼす障害が発生し、同一のグループに属する互いに異なるパスでの障害の発生が閾値未満の時間間隔で各々通知された場合にも、各パスに対応する処理のうち障害の発生が２番目以降に通知されたパスに対応する処理については、コンピュータによる実行が終了されることになり、当該処理の実行を終了させない場合と比較して、コンピュータに加わる負荷を低減できると共に、メモリ等のリソースの消費量も抑制することができる。 Further, in the third aspect, for example, the determination unit, the creation unit, and the control unit are activated and executed by the computer each time a failure occurrence is notified in any path of any group. The execution of the program by the computer may be configured to be terminated when the file information with a predetermined attribute is created by the determination unit and the elapsed time is determined to be less than the threshold value. Preferred (fourth embodiment). As a result, even when a failure that affects multiple paths occurs and the occurrence of a failure in a different path belonging to the same group is notified at a time interval that is less than the threshold, the processing corresponding to each path is performed. Of these, the processing corresponding to the path notified of the occurrence of the failure after the second is executed by the computer, and the load applied to the computer is reduced as compared with the case where the execution of the processing is not ended. In addition, the consumption of resources such as memory can be suppressed.

また、請求項２記載の発明に第１〜第４の態様の何れかを適用した構成において、制御手段は、例えば、閉塞処理を行う前に状態テーブルを参照し、障害の発生が通知されたパスと同一のグループに属する全てのパスのうちの特定のパスが、当該特定のパスと同一の外部機器に対応する複数のパスのうちの未閉塞の最後のパスか否かを確認し、特定のパスが前記最後のパスであった場合は、状態テーブルに登録されている特定のパスに対応する状態情報の書き替えを中止することを、障害の発生が通知されたパスと同一のグループに属する全てのパスについて各々行うように構成することが好ましい(第５の態様)。これにより、閉塞処理の実行に伴ってコンピュータと通信するためのパスが全て閉塞された外部機器が出現することを防止することができ、閉塞処理の実行に拘わらず、コンピュータが１つ以上の未閉塞のパスを経由して個々の外部機器と通信可能な状態を維持することができる。 Further, in the configuration in which any one of the first to fourth aspects is applied to the invention according to claim 2, the control means refers to, for example, the state table before performing the blocking process and is notified of the occurrence of the failure. Check whether a specific path among all paths belonging to the same group as the path is the last unblocked path among multiple paths corresponding to the same external device as the specific path. If the current path is the last path, the rewriting of the status information corresponding to the specific path registered in the status table is canceled in the same group as the path that is notified of the occurrence of the failure. It is preferable to configure so that all the paths to which it belongs are performed (fifth aspect). As a result, it is possible to prevent the appearance of an external device in which all paths for communicating with the computer are blocked along with the execution of the blocking process. It is possible to maintain a state in which communication with each external device is possible via the blocking path.

また、請求項３記載の発明に第１〜第５の態様の何れかを適用した構成において、起動制御手段は、例えば、制御手段によって閉塞処理が行われた結果、複数の外部機器の各々が、対応する複数のパスのうち未閉塞のパスの数が残り１つの状態となった場合にも、障害の発生が通知された際の前記プログラムの起動を停止させるように構成することが好ましい(第６の態様)。これにより、個々の外部機器に対応する未閉塞のパスの数が各々残り１つの状態、すなわち閉塞処理を行って閉塞状態のパスを増加させることが望ましくない状態であるにも拘わらず、前記プログラムが無駄に起動されることを防止することができ、コンピュータに加わる負荷の更なる低減及びメモリ等のリソースの消費量の更なる抑制を実現することができる。 In addition, in the configuration in which any one of the first to fifth aspects is applied to the invention according to claim 3, the activation control unit is configured so that each of the plurality of external devices is, for example, the result of the blocking process performed by the control unit. In addition, it is preferable that the start of the program when the occurrence of a failure is notified is stopped even when the number of unblocked paths among a plurality of corresponding paths is one remaining state. Sixth aspect). As a result, the number of unblocked paths corresponding to each external device is one remaining state, that is, it is not desirable to increase the number of blocked paths by performing blocking processing. Can be prevented from being started unnecessarily, and further reduction of the load applied to the computer and further suppression of consumption of resources such as memory can be realized.

以上説明したように本発明は、何れかのパスでの障害の発生が通知された場合に、所定の属性のファイル情報の作成を試行し、所定の属性のファイル情報の作成が成功した場合に、障害の発生が通知されたパスと同一のグループに属する全てのパスを閉塞するための閉塞処理を行うようにしたので、複数のパスに影響を及ぼす障害が発生した場合のコンピュータの動作安定化を実現できる、という優れた効果を有する。 As described above, the present invention tries to create file information with a predetermined attribute when the occurrence of a failure in one of the paths is notified, and the file information with the predetermined attribute is successfully created. Since block processing is performed to block all paths that belong to the same group as the path that has been notified of the occurrence of a failure, computer operation can be stabilized when a failure that affects multiple paths occurs. It has an excellent effect that can be realized.

本実施形態に係るコンピュータ・システムの概略構成図である。It is a schematic block diagram of the computer system which concerns on this embodiment. 稼働時のディスク装置へのアクセスを説明するための概略図である。It is the schematic for demonstrating the access to the disk apparatus at the time of operation. Ｉ／Ｏ遅延発生時の処理の流れを示す概略図である。It is the schematic which shows the flow of a process at the time of I / O delay generation | occurrence | production. パス閉塞制御処理の内容を示すフローチャートである。It is a flowchart which shows the content of a path | pass block | close control process. パス閉塞制御処理の結果を各パターン毎に示す図表である。It is a chart which shows the result of path blockage control processing for every pattern.

以下、図面を参照して本発明の実施形態の一例を詳細に説明する。図１には本実施形態に係るコンピュータ・システム１０が示されている。コンピュータ・システム１０は、複数台のサーバ・コンピュータ１２(図１では２台のサーバ・コンピュータ１２Ａ，１２Ｂのみ図示している)と、論理ディスクＬＵが複数設けられ光ファイバケーブル２２を介して個々のサーバ・コンピュータ１２と各々接続されたディスク装置２４を備えており、ＤＢ(データベース)サーバとして機能する。なお、個々のサーバ・コンピュータ１２は本発明に係るコンピュータに、ディスク装置２４に設けられた個々の論理ディスクＬＵは本発明に係る外部機器に対応している。 Hereinafter, an example of an embodiment of the present invention will be described in detail with reference to the drawings. FIG. 1 shows a computer system 10 according to the present embodiment. The computer system 10 includes a plurality of server computers 12 (only two server computers 12A and 12B are shown in FIG. 1), a plurality of logical disks LU, and individual optical fibers 22 via optical fiber cables 22. A disk device 24 is connected to each of the server computer 12 and functions as a DB (database) server. Incidentally, the individual server computer 12 to the computer according to the present invention, each logical disk LU provided in the disk unit 24 corresponds to the external device according to the present invention.

個々のサーバ・コンピュータ１２は、ＣＰＵ１４、メモリ１６、ディスク駆動装置(ＤＫＵ)によって駆動されるディスクドライブ等から成る不揮発性の記憶部１８、サーバ・コンピュータ１２を光ファイバ経由でディスク装置２４と接続するための複数のＨＢＡ(Host Bus Adapter)２０(図１では２個のＨＢＡ(ＨＢＡ＃０,ＨＢＡ＃２)のみ示している)を含んで構成されている。個々のサーバ・コンピュータ１２の記憶部１８には、ディスク装置２４(詳しくはディスク装置２４の複数の論理ディスクのうちの何れかの論理ディスク)へのＩ／Ｏ(Input/Output：入出力)要求を出力するアプリケーション・プログラム、アプリケーションからのＩ／Ｏ要求に応答してハードウェアの制御等を行うオペレーティング・システム(ＯＳ)のプログラム、リンク管理部のプログラム、ドライバのプログラム、パス閉塞制御シェル(プログラム)及びログトラップのプログラムが各々記憶されている。なお、リンク管理部、ドライバ、パス閉塞制御シェル及びログトラップについては後述する。 Each server computer 12 connects the CPU 14, the memory 16, a nonvolatile storage unit 18 including a disk drive driven by a disk drive unit (DKU), and the server computer 12 to the disk device 24 via an optical fiber. A plurality of HBAs (Host Bus Adapters) 20 (only two HBAs (HBA # 0, HBA # 2) are shown in FIG. 1). The storage unit 18 of each server computer 12 has an I / O (Input / Output) request to the disk device 24 (specifically, any one of a plurality of logical disks of the disk device 24). , An operating system (OS) program that controls hardware in response to an I / O request from the application, a link management unit program, a driver program, a path blockage control shell (program ) And log trap programs are stored. The link management unit, driver, path blockage control shell, and log trap will be described later.

記憶部１８に記憶されている各プログラムは、サーバ・コンピュータ１２の稼働開始時に記憶部１８から読み出されてメモリ１６に各々記憶され、必要時にＣＰＵ１４によって実行される。なお、記憶部１８に記憶されている各プログラムのうち、パス閉塞制御シェルは本発明に係る障害管理プログラムに対応しており、サーバ・コンピュータ１２は、ＣＰＵ１４がパス閉塞制御シェルを実行することで本発明に係る障害管理装置として機能する。 Each program stored in the storage unit 18 is read from the storage unit 18 at the start of operation of the server computer 12, stored in the memory 16, and executed by the CPU 14 when necessary. Incidentally, among the programs stored in the storage unit 18, path deactivation control shell Ri Contact corresponds to the fault management program according to the present invention, the server computer 12, the CPU14 executes a path deactivation control shell It functions as a failure management apparatus according to the present invention.

一方、ディスク装置２４は、ディスク駆動装置(ＤＫＵ)２６によって駆動される多数台のディスクドライブを備えている。多数台のディスクドライブは、各々冗長化された複数の論理ディスクＬＵを形成するように互いに接続されている(図１では、ディスク装置２４に設けられた論理ディスクの一部(ＬＵ#0〜ＬＵ#3,ＬＵ#m〜ＬＵ#m+3,ＬＵ#n〜ＬＵ#n+3)のみ示されている)。また、ディスク装置２４にはディスク制御装置(ＤＫＣ)２８が設けられている。ディスク制御装置(ＤＫＣ)２８には複数(図１では２個)のクラスタ３０が設けられており、個々のクラスタ３０には、複数(図１では２個)のチャネル・アダプタ(ＣＨＡ)３２、複数(図１では２個)のディスク・アダプタ(ＤＫＡ)３４、チャネル・アダプタ(ＣＨＡ)３２とディスク・アダプタ(ＤＫＡ)３４を接続するキャッシュ・スイッチ(ＣＳＷ)３６が設けられている。 On the other hand, the disk device 24 includes a large number of disk drives driven by a disk drive unit (DKU) 26. A large number of disk drives are connected to each other so as to form a plurality of redundant logical disks LU (in FIG. 1, a part of logical disks (LU # 0 to LU # provided in the disk device 24). Only # 3, LU # m to LU # m + 3, LU # n to LU # n + 3) are shown). The disk device 24 is provided with a disk controller (DKC) 28. The disk controller (DKC) 28 is provided with a plurality (two in FIG. 1) of clusters 30, and each cluster 30 includes a plurality (two in FIG. 1) of channel adapters (CHAs) 32, A plurality (two in FIG. 1) of disk adapters (DKAs) 34, and a channel switch (CHA) 32 and a cache switch (CSW) 36 for connecting the disk adapters (DKAs) 34 are provided.

個々の論理ディスクＬＵは個々のクラスタ３０の何れか１つのディスク・アダプタ(ＤＫＡ)３４に各々接続されている。また、同一のクラスタ３０に設けられた複数のチャネル・アダプタ(ＣＨＡ)３２は、互いに異なるサーバ・コンピュータ１２のＨＢＡ２０と光ファイバケーブル２２を介して各々接続されており、チャネル・アダプタ(ＣＨＡ)３２は、サーバ・コンピュータ１２から光ファイバケーブル２２経由でＩ／Ｏ要求を受信すると、キャッシュ・スイッチ(ＣＳＷ)３６を介し、受信したＩ／Ｏ要求におけるアクセス対象の論理ディスクＬＵが接続されたディスク・アダプタ(ＤＫＡ)３４へＩ／Ｏ要求を転送する。 Each logical disk LU is connected to any one disk adapter (DKA) 34 of each cluster 30. A plurality of channel adapters (CHA) 32 provided in the same cluster 30 are connected to the HBA 20 of the server computer 12 and the optical fiber cable 22 which are different from each other, and the channel adapter (CHA) 32 is connected. When an I / O request is received from the server computer 12 via the optical fiber cable 22, the disk to which the logical disk LU to be accessed in the received I / O request is connected via the cache switch (CSW) 36. The I / O request is transferred to the adapter (DKA) 34.

次に本実施形態の作用として、まず、コンピュータ・システム１０がＤＢサーバとして稼働しており、ディスク装置２４等に障害が発生していない状態で、サーバ・コンピュータ１２のＣＰＵ１４によって各プログラムが実行されることで実現される処理について、図２を参照して説明する。 Next, as an operation of the present embodiment, first, each program is executed by the CPU 14 of the server computer 12 in a state where the computer system 10 is operating as a DB server and no failure has occurred in the disk device 24 or the like. The processing realized by this will be described with reference to FIG.

本実施形態では、個々のサーバ・コンピュータ１２にＨＢＡ２０が複数設けられており、個々のＨＢＡ２０は光ファイバケーブル２２を介してディスク装置２４(の互いに異なるクラスタ３０のチャネル・アダプタ(ＣＨＡ)３２)に各々接続されているので、図２に破線で示すように、個々のサーバ・コンピュータ１２は、ディスク装置２４の個々の論理ディスクＬＵと複数のパス(互いに異なるチャネル・アダプタ(ＣＨＡ)３２を経由する複数のパス)を介して通信可能とされており、個々の論理ディスクＬＵと通信を行うためのパスが冗長化(多重化)されている。なお、単一のサーバ・コンピュータ１２とディスク装置２４との間に設けられた複数のパスは、同一のチャネル・アダプタ(ＣＨＡ)３２を経由するパス(物理的な通信経路を共有するパス)を単位としてグループ化されている(図２に示す「パスのグループ」も参照)。 In this embodiment, each server computer 12 is provided with a plurality of HBAs 20, and each HBA 20 is connected to a disk device 24 (channel adapter (CHA) 32 of a different cluster 30) via an optical fiber cable 22. Since each is connected, each server computer 12 passes through each logical disk LU of the disk device 24 and a plurality of paths (different channel adapters (CHAs) 32 as shown by broken lines in FIG. A plurality of paths), and the paths for communicating with the individual logical disks LU are made redundant (multiplexed). A plurality of paths provided between the single server computer 12 and the disk device 24 are paths (paths sharing a physical communication path) that pass through the same channel adapter (CHA) 32. They are grouped as a unit (see also “path group” shown in FIG. 2).

前述のアプリケーションから出力されるディスク装置２４へのＩ／Ｏ要求では、アクセス対象の論理ディスクＬＵが指定される。このＩ／Ｏ要求はＯＳを経由し、リンク管理部のプログラムがサーバ・コンピュータ１２のＣＰＵ１４によって実行されることで実現されるリンク管理部に転送・入力される。リンク管理部は、個々のパスへの負荷を把握し、入力されたＩ／Ｏ要求をアクセス対象の論理ディスクＬＵへ転送するためのパスを個々のパスへの負荷が分散するように選択し、ドライバのプログラムがサーバ・コンピュータ１２のＣＰＵ１４によって実行されることで実現されるドライバに対し、前記選択したパスを指定して前記Ｉ／Ｏ要求を転送・入力することで、前記Ｉ／Ｏ要求が前記選択したパスを通じて転送されるように制御する。また、リンク管理部は個々のパス毎にその状態が稼働中(Online)か非稼働中(閉塞中／Offline)かを表す状態情報を登録するためのパス状態管理テーブルを保持しており、Ｉ／Ｏ要求の転送に用いるパスの選択に際しては、パス状態管理テーブルに登録されている状態情報が「非稼働中(閉塞中):offline」のパスが選択対象から除外される。 In the I / O request to the disk device 24 output from the aforementioned application, the logical disk LU to be accessed is specified. This I / O request is transferred / input via the OS to the link management unit realized by the CPU 14 of the server computer 12 executing the program of the link management unit. The link management unit grasps the load on each path, selects a path for transferring the input I / O request to the logical disk LU to be accessed, so that the load on each path is distributed, The driver program is executed by the CPU 14 of the server computer 12, and the I / O request is transferred and input by designating the selected path to the driver. Control is performed so as to be transferred through the selected path. In addition, the link management unit maintains a path status management table for registering status information indicating whether the status is active (online) or inactive (blocked / offline) for each path. When selecting a path to be used for transferring the / O request, a path whose status information registered in the path status management table is “inactive (blocked): offline” is excluded from selection targets.

このように、リンク管理部は請求項２に記載の管理部に、パス状態管理テーブルは請求項２に記載の状態テーブルに各々対応している。 Thus, the link management unit corresponds to the management unit described in claim 2, and the path status management table corresponds to the status table described in claim 2.

またドライバは、図２に示すように、入力されたＩ／Ｏ要求を保持するためのキューを個々のパス毎に設けており、リンク管理部から入力されたＩ／Ｏ要求をリンク管理部から指定されたパスに対応するキューに一旦投入すると共に、個々のキューに保持されているＩ／Ｏ要求をキューへの投入順に個々のキューから取り出して対応するＨＢＡ２０へ転送・入力することで、Ｉ／Ｏ要求の実行順序を制御する。また、ドライバからＨＢＡ２０へ転送・入力されたＩ／Ｏ要求は、リンク管理部によって先に決定されて指定されたパスを通じてアクセス対象の論理ディスクＬＵへ転送され、前記Ｉ／Ｏ要求に応じたディスクアクセス処理(アクセス対象の論理ディスクＬＵを構成する何れかのディスクドライブに対するデータの読み出し又は書き込み)が行われた後に、同一のパスを通じてＩ／Ｏ要求に対する応答が転送される。 Further, as shown in FIG. 2, the driver has a queue for holding the input I / O request for each path, and the I / O request input from the link management unit is sent from the link management unit. I / O requests held in the individual queues are once entered into the queue corresponding to the specified path, taken out from the individual queues in the order of entry into the queue, transferred to the corresponding HBA 20, and input to the IBA. Controls the execution order of / O requests. Further, the I / O request transferred / input from the driver to the HBA 20 is transferred to the logical disk LU to be accessed through the path determined and specified in advance by the link management unit, and the disk corresponding to the I / O request After access processing (reading or writing of data to any disk drive constituting the logical disk LU to be accessed) is performed, a response to the I / O request is transferred through the same path.

次に、ディスク装置２４のうちのアクセス対象の論理ディスクＬＵ、又は、特定のパスの途中に存在する何れかの機器で何らかの障害が発生することで、ドライバがＨＢＡ２０へ転送・入力した特定のＩ／Ｏ要求に対する応答が遅延した場合(Ｉ／Ｏ遅延が発生した場合)について説明する。 Next, when a failure occurs in the logical disk LU to be accessed in the disk device 24 or any device existing in the middle of a specific path, the specific I transferred / input by the driver to the HBA 20 A case where a response to the / O request is delayed (when an I / O delay occurs) will be described.

ドライバは、ＨＢＡ２０へのＩ／Ｏ要求の転送・入力に際し、Ｉ／Ｏ遅延の発生に備えて予め定められたタイマ値のタイマをスタートさせると共に、リトライカウンタのカウント値を０にリセットする。そして、特定のＩ／Ｏ要求に対する応答を受信する前に、特定のＩ／Ｏ要求に対応するタイマのタイムアウトが発生した場合には、リトライカウンタのカウント値を１だけインクリメントし、特定のＩ／Ｏ要求をＨＢＡ２０へ再度転送・入力すると共にタイマをスタートさせることを、リトライカウンタのカウント値が予め定められたリトライ回数の最大値に達する迄繰り返す。そして、リトライカウンタのカウント値がリトライ回数の最大値に達しても特定のＩ／Ｏ要求に対する応答を受信しなかった場合は、リンク管理部に対して障害の発生を通知する。 When the I / O request is transferred / input to the HBA 20, the driver starts a timer having a predetermined timer value in preparation for the occurrence of an I / O delay, and resets the count value of the retry counter to zero. If the timer corresponding to the specific I / O request has timed out before receiving a response to the specific I / O request, the count value of the retry counter is incremented by 1, and the specific I / O request is incremented. The O request is again transferred and input to the HBA 20 and the timer is started until the count value of the retry counter reaches a predetermined maximum number of retries. If a response to a specific I / O request is not received even if the count value of the retry counter reaches the maximum number of retries, the link management unit is notified of the occurrence of a failure.

ところで、上記のタイマ値としては例えば３０〜４０秒、リトライ回数の最大値としては例えば３〜５回程度の値が用いられるが、仮にタイマ値が３６秒、リトライ回数の最大値が４回であったとすると、ドライバが最初のＩ／Ｏ要求を送信してからＩ／Ｏ遅延により障害の発生をリンク管理部へ通知する迄の経過時間は３６秒×(１＋４)＝１８０秒となるので、障害が発生してから、リンク管理部によって障害の発生が認識されて障害対処処理の実行が開始される迄に比較的長い時間が掛かる、という問題がある。 By the way, the timer value is, for example, 30 to 40 seconds, and the maximum number of retries is, for example, about 3 to 5 times. However, the timer value is 36 seconds and the maximum number of retries is 4 times. Assuming that there is an elapsed time from when the driver sends the first I / O request until the link management unit is notified of the occurrence of the failure due to the I / O delay, 36 seconds × (1 + 4) = 180 seconds. There is a problem that it takes a relatively long time after the occurrence of a failure until the link management unit recognizes the occurrence of the failure and starts executing the failure handling process.

また、ドライバからリンク管理部へ障害の発生が通知される迄の間、発生している障害は障害対処処理が行われることなく放置されるが、障害が発生した箇所が、例えば複数のパスが経由する機器(例えばＨＢＡ２０やチャネル・アダプタ(ＣＨＡ)３２)等であった場合、障害発生箇所を経由する他のパスでもＩ／Ｏ遅延が各々発生し、Ｉ／Ｏ遅延によるＩ／Ｏ要求の再送信が各々行われることで、リンク管理部に対して障害の発生が連続的に複数回通知されることになり、障害発生が通知される都度、前述の障害対処処理が繰り返されることでサーバ・コンピュータ１２に多大な負荷が加わり、コンピュータ・システム１０の動作が不安定になったり、サーバ・コンピュータ１２がダウンする等の事態が生じる恐れがある。 Further, until the driver notifies the link management unit of the occurrence of the failure, the failure that has occurred is left without being subjected to the failure handling processing, but the location where the failure has occurred is, for example, a plurality of paths In the case of a device that passes through (for example, HBA 20 or channel adapter (CHA) 32), an I / O delay occurs in each of the other paths that pass through the failure location, and an I / O request due to the I / O delay occurs. By performing each retransmission, the occurrence of a failure is continuously notified to the link management unit a plurality of times, and each time the occurrence of a failure is notified, the above-described failure handling processing is repeated, and the server -A great load is applied to the computer 12, and the operation of the computer system 10 may become unstable or the server computer 12 may be down.

一方、本実施形態に係るドライバは、任意のＩ／Ｏ要求に対する応答を受信する前に、前記Ｉ／Ｏ要求に対応するタイマのタイムアウトが発生することで、Ｉ／Ｏ遅延が発生した(図３の(1)も参照)ことを検知した場合、メモリ１６又は記憶部１８の記憶領域に予め設けられたログ情報記録領域に、Ｉ／Ｏ遅延(タイマのタイムアウト)が発生したことを表すＩ／Ｏ遅延メッセージを記録している(図３の(2)も参照)。このため、本実施形態では上述した課題を解決するために、ログ情報記録領域へのＩ／Ｏ遅延メッセージの記録を利用し、これをトリガとして、同一のチャネル・アダプタ(ＣＨＡ)３２を経由するパスのグループ単位でパスを閉塞する制御を行っている。 On the other hand, the driver according to the present embodiment generates an I / O delay due to a timer timeout corresponding to the I / O request before receiving a response to an arbitrary I / O request (see FIG. 3 (see also (1) in FIG. 3) indicating that an I / O delay (timer timeout) has occurred in the log information recording area provided in advance in the storage area of the memory 16 or the storage unit 18. / O delay message is recorded (see also (2) in FIG. 3). For this reason, in this embodiment, in order to solve the above-described problem, recording of an I / O delay message in the log information recording area is used, and this is used as a trigger to pass through the same channel adapter (CHA) 32. Control is performed to block paths in units of path groups.

すなわち本実施形態では、上記制御を実現するために、サーバ・コンピュータ１２の記憶部１８にログトラップのプログラムとパス閉塞制御シェルが各々記憶されている。ログトラップ及びパス閉塞制御シェルは、単一のサーバ・コンピュータ１２に設けられたＨＢＡ２０と同数個存在しており、個々のログトラップは互いに異なるＨＢＡ２０(パスのグループ)に対応し、個々のパス閉塞制御シェルも互いに異なるＨＢＡ２０(パスのグループ)に対応している。 That is, in this embodiment, in order to realize the above control, a log trap program and a path blockage control shell are stored in the storage unit 18 of the server computer 12. There are as many log traps and path blocking control shells as there are HBAs 20 provided in a single server computer 12, and each log trap corresponds to a different HBA 20 (path group), and individual path blockings. The control shells also correspond to different HBAs 20 (path groups).

本実施形態において、Ｉ／Ｏ遅延の発生を検知した場合にドライバによってログ情報記録領域に記録されるＩ／Ｏ遅延メッセージは、Ｉ／Ｏ遅延が発生したパスを明示しない代わりに、Ｉ／Ｏ遅延が発生したパスのグループを明示するメッセージであり、個々のログトラップは、対応するパスのグループにＩ／Ｏ遅延が発生したことを表すＩ／Ｏ遅延メッセージがログ情報記録領域に記録されたか否かを監視している。そして、対応するパスのグループにＩ／Ｏ遅延が発生したことを表すＩ／Ｏ遅延メッセージがログ情報記録領域に記録されたことを検知(図３の(3)も参照)したログトラップは、対応するパスのグループが同一のパス閉塞制御シェルを起動する(図３の(4)も参照)。以下、ログトラップによって起動されたパス閉塞制御シェルによって行われるパス閉塞制御処理について、図４を参照して説明する。 In this embodiment, when the occurrence of an I / O delay is detected, the I / O delay message recorded in the log information recording area by the driver is not clearly indicated the path where the I / O delay has occurred. Whether the I / O delay message indicating that an I / O delay has occurred in the corresponding path group has been recorded in the log information recording area is a message that clearly indicates the path group in which the delay has occurred. It is monitoring whether or not. A log trap that detects that an I / O delay message indicating that an I / O delay has occurred in the corresponding path group is recorded in the log information recording area (see also (3) in FIG. 3) The path block control shell having the same path group is started (see also (4) in FIG. 3). The path block control process performed by the path block control shell activated by the log trap will be described below with reference to FIG.

パス閉塞制御処理では、まずステップ５０において、対応するパスのグループ(Ｉ／Ｏ遅延が発生したパスのグループ)に対して予め設定された名称のロックディレクトリが既に作成されているか否かを判定する。本実施形態ではＯＳとしてＵＮＩＸ（登録商標）系のＯＳを用いており、ＵＮＩＸ（登録商標）系のＯＳでは名称が同一のディレクトリの重複作成が制限(禁止)されている。本実施形態に係るロックディレクトリは、ＵＮＩＸ（登録商標）系のＯＳにおけるディレクトリの上記特性をパス閉塞制御シェルの排他制御に利用するためのものであり、本実施形態では、ロックディレクトリとして用いるディレクトリの名称がパスのグループ毎に予め設定されており、対応するパスのグループに対して予め設定された名称のロックディレクトリを作成できたパス閉塞制御シェルにのみパス状態管理テーブルの更新権を与えることで、対応するパスのグループが同一のパス閉塞制御シェルが多重に起動された場合(複数のパス閉塞制御処理が並列に実行されている場合)の排他制御を実現している。 In the path blocking control process, first, in step 50, it is determined whether or not a lock directory having a preset name has already been created for a corresponding path group (a group of paths in which an I / O delay has occurred). . In the present embodiment, a UNIX (registered trademark) OS is used as the OS, and the UNIX (registered trademark) OS restricts (prohibits) duplicate creation of directories having the same name. The lock directory according to the present embodiment is for using the above-described characteristics of the directory in the UNIX (registered trademark) OS for exclusive control of the path blockage control shell. In this embodiment, the directory of the directory used as the lock directory is used. A name is preset for each path group, and only the path blockage control shell that has created a lock directory with a preset name for the corresponding path group is given the right to update the path status management table. The exclusive control is realized when multiple path block control shells with the same path group are activated (when multiple path block control processes are executed in parallel).

なお、パス閉塞制御処理では大半のケースで自シェルを起動したログトラップ(対応するパスのグループが同一のログトラップ)を停止させる処理を行う(詳細は後述)ので、ログトラップを停止させる処理を行った以降は、対応するパスのグループが同一のパス閉塞制御シェルが重複起動されることは生じ得ない。但し、例えば同一のグループに属する各パスでＩ／Ｏ遅延がほぼ同時に発生した等のように、ログトラップを停止させる処理が行われる前に、対応するパスのグループが同一のパス閉塞制御シェルが再度起動された場合には上記の重複起動が発生することになる。上記のステップ５０は本発明に係る判定手段に対応している。 In most cases, the path blockage control process stops the log trap that started the own shell (the log trap with the same path group) (details will be described later). After the execution, the path blocking control shell having the same corresponding path group cannot be activated twice. However, before the process of stopping the log trap is performed, for example, when an I / O delay occurs almost simultaneously in each path belonging to the same group, the corresponding path block control shell has the same path group. When it is activated again, the above-described overlapping activation occurs. Step 50 described above corresponds to the determination means according to the present invention.

なお、上記のロックディレクトリは本発明に係るファイル情報に対応している。また、本発明において、ＯＳはＵＮＩＸ（登録商標）系に限られるものではなく、ＷＩＮＤＯＷＳ（登録商標）系のＯＳにおけるフォルダについても、名称が同一のフォルダの重複作成が制限(禁止)されているので、ＯＳがＷＩＮＤＯＷＳ（登録商標）系である場合は、ＷＩＮＤＯＷＳ（登録商標）系のＯＳにおけるフォルダを本発明に係るファイル情報として適用可能である。また、本発明に係るファイル情報は、同一名称での重複作成が制限されている情報に限られるものではなく、名称以外の他の属性が同一の情報の重複作成が制限されている情報を適用することも可能である。 The above lock directory corresponds to file information according to the present invention. In the present invention, the OS is not limited to the UNIX (registered trademark) system, and duplicate creation of folders with the same name is also restricted (prohibited) for a folder in a WINDOWS (registered trademark) OS. Therefore, when the OS is a WINDOWS (registered trademark) system, a folder in the WINDOWS (registered trademark) system OS can be applied as file information according to the present invention. Further, the file information according to the present invention is not limited to information in which duplicate creation with the same name is restricted, and information in which duplicate creation of information with the same attribute other than the name is restricted is applied. It is also possible to do.

上記のステップ５０の判定が否定された場合、対応するパスのグループが同一のパス閉塞制御シェルは重複起動されていない(対応するパスのグループが同一の他のパス閉塞制御処理は並列に実行されていない)と判断できるのでステップ６０へ移行し、対応するパスのグループ(Ｉ／Ｏ遅延が発生したパスのグループ)に対して予め設定された名称のロックディレクトリの作成を試行する。なお、ステップ６０は本発明に係る作成手段に対応している。次のステップ６２では、ステップ６０でロックディレクトリの作成に成功したか否か判定する。ロックディレクトリの作成に成功した場合はパス状態管理テーブルの更新権を獲得できたと判断できるので、ステップ６２の判定が肯定された場合はステップ６６へ移行し、ステップ６６以降でパス状態管理テーブルを更新する閉塞処理を行う。 If the determination in step 50 above is negative, the path blocking control shell with the same corresponding path group has not been activated twice (other path blocking control processes with the same corresponding path group are executed in parallel). Therefore, the process proceeds to step 60, and an attempt is made to create a lock directory having a preset name for the corresponding path group (the group of the path in which the I / O delay has occurred). Step 60 corresponds to the creating means according to the present invention. In the next step 62, it is determined whether or not the lock directory has been successfully created in step 60. If the creation of the lock directory is successful, it can be determined that the right to update the path state management table has been acquired. If the determination in step 62 is affirmative, the process proceeds to step 66, and the path state management table is updated in step 66 and subsequent steps. Perform blocking processing.

すなわち、まずステップ６６では対応するパスのグループ(Ｉ／Ｏ遅延が発生したパスのグループ)の中から処理対象の単一のパスを選択する。次のステップ６８ではパス状態管理テーブルを参照し、ステップ６６で選択した処理対象のパスに対応する(処理対象のパスによってサーバ・コンピュータ１２と接続されている)判定対象の論理ディスクＬＵについて、当該判定対象の論理ディスクＬＵをサーバ・コンピュータ１２と接続している全てのパスの状態を確認する。そしてステップ７０では、該判定対象の論理ディスクＬＵをサーバ・コンピュータ１２と接続している全てのパス(一例として図２には、論理ディスクＬＵ#0をサーバ・コンピュータ１２と接続している全てのパスを明示している)のうち、未閉塞のパスの数が２以上か否か判定する。 That is, first, in step 66, a single path to be processed is selected from a corresponding path group (a group of paths in which an I / O delay has occurred). In the next step 68, the path status management table is referred to, and the determination target logical disk LU corresponding to the processing target path selected in step 66 (connected to the server computer 12 by the processing target path) The state of all paths connecting the logical disk LU to be determined to the server computer 12 is confirmed. In step 70, all paths connecting the logical disk LU to be determined to the server computer 12 (as an example, all paths connecting the logical disk LU # 0 to the server computer 12 are shown in FIG. Whether the number of unblocked paths is 2 or more.

ステップ７０の判定が肯定された場合はステップ７２へ移行し、パス状態管理テーブルに登録されている処理対象のパスの状態情報を「稼働中:online」から「非稼働中(閉塞中):offline」へ書き替えることで処理対象のパスを閉塞し、ステップ７４へ移行する。また、判定対象の論理ディスクＬＵをサーバ・コンピュータ１２と接続している未閉塞のパスの数が１の場合、すなわち処理対象のパスが判定対象の論理ディスクＬＵをサーバ・コンピュータ１２と接続している未閉塞の最後のパスである場合は、処理対象のパスを閉塞してしまうとサーバ・コンピュータ１２が判定対象の論理ディスクＬＵと通信不能の状態になってしまうので、ステップ７０の判定が否定された場合はステップ７２の処理を行うことなくステップ７４へ移行する。 If the determination in step 70 is affirmative, the process proceeds to step 72, where the status information of the processing target path registered in the path status management table is changed from “in operation: online” to “not in operation (blocked): offline”. ”To block the processing target path, and the process proceeds to step 74. In addition, when the number of unblocked paths connecting the determination target logical disk LU to the server computer 12 is 1, that is, the determination target logical disk LU is connected to the server computer 12. In the case of the last unblocked path, if the processing target path is blocked, the server computer 12 becomes unable to communicate with the logical disk LU to be determined. If so, the process proceeds to step 74 without performing the process of step 72.

ステップ７４では、対応するパスのグループ(Ｉ／Ｏ遅延が発生したパスのグループ)の中に未処理のパス(処理対象として未選択のパス)が存在しているか否か判定する。この判定が肯定された場合はステップ６６へ戻り、ステップ７４の判定が否定される迄ステップ６６〜ステップ７４を繰り返す。これにより、対応するパスのグループ(Ｉ／Ｏ遅延が発生したパスのグループ)に属する全てのパスに対してステップ６６〜ステップ７４の閉塞処理が各々行われる(図３の(5)も参照)。なお、ステップ６６〜ステップ７４は本発明に係る制御手段に対応している。 In step 74, it is determined whether or not an unprocessed path (unselected path as a processing target) exists in the corresponding path group (group of paths in which an I / O delay has occurred). If this determination is affirmative, the process returns to step 66, and steps 66 to 74 are repeated until the determination in step 74 is negative. As a result, the blocking process in steps 66 to 74 is performed for all paths belonging to the corresponding path group (the group of paths in which the I / O delay has occurred) (see also (5) in FIG. 3). . Steps 66 to step 74 corresponds to the control hand stage according to the present invention.

上記の閉塞処理の結果、例えば図５に示すパターン１のように、サーバ・コンピュータ１２とディスク装置２４の間に設けられた全てのパスが「稼働中:online」の状態で、ＨＢＡ０系のパスのグループのうちの一部のパスでＩ／Ｏ遅延が発生した場合には、「パス閉塞制御実行後の管理テーブルの内容」に示されているように、ＨＢＡ０系のパスのグループにおける全てのパスの状態情報が「非稼働中(閉塞中):offline」へ書き替えされる。また、例えば図５に示すパターン２のように、ＨＢＡ０系のパスのグループのうちの一部のパスが「非稼働中(閉塞中):offline」の状態で、ＨＢＡ０系のパスのグループのうちの他のパスでＩ／Ｏ遅延が発生した場合には、「パス閉塞制御実行後の管理テーブルの内容」に示されているように、ＨＢＡ０系のパスのグループのうち「稼働中:online」のパスの状態情報が「非稼働中(閉塞中):offline」へ書き替えされる。 As a result of the above blocking process, for example, as shown in pattern 1 in FIG. 5, all paths provided between the server computer 12 and the disk device 24 are in the “online” state, and the HBA0 system path When an I / O delay occurs in some of the paths in the group, as shown in “Contents of management table after execution of path blockage control”, all the paths in the HBA0 system path group The path status information is rewritten to "Non-operating (blocked): offline". Further, for example, as in pattern 2 shown in FIG. 5, a part of the HBA 0 system path group is in a state of “inactive (blocked): offline” and the HBA 0 system path group When an I / O delay occurs in another path, as shown in “Contents of management table after execution of path blockage control”, “in operation: online” in the group of HBA0 path. The status information of the path is rewritten to "Inactive (Blocked): offline".

なお、上記のように一部のパスが「非稼働中(閉塞中):offline」となっている状態は、例えば特定のパスにおける光通信の通信障害や一時的な減衰により、特定のパスが物理的に通信不能な状態となったことがＨＢＡによって検知され、これに基づきドライバによって対応するパスの状態情報が書替えられることによって生ずる。図５に示すパターン１,２では、何れもパス閉塞制御実行後にＨＢＡ０系の全てのパスが「非稼働中(閉塞中):offline」となっており、同一グループに属するパスの中に「稼働中:online」のパスと「非稼働中(閉塞中):offline」のパスが混在している交差閉塞状態が生ずることが回避されているので、発生した障害を復旧させる際の作業が簡単になる。 In addition, as described above, a state in which a part of the paths is “inactive (blocked): offline” indicates that the specific path is not connected due to optical communication failure or temporary attenuation in the specific path. This occurs when the HBA detects that the communication is physically impossible and the driver rewrites the corresponding path status information based on this. In the patterns 1 and 2 shown in FIG. 5, all the paths of the HBA 0 system after the execution of the path block control are “not operating (blocked): offline”, and “acting” is included in the paths belonging to the same group. Since it is avoided that a cross-blocking state where a path of `` Medium: online '' and a path of `` not working (blocked): offline '' coexist will occur, the work to recover from the failure that occurred is easy Become.

一方、例えば図５に示すパターン３のように、ＨＢＡ２系のパスのグループのうちの一部のパス(論理ディスクLU#0に対応するパス)が「非稼働中(閉塞中):offline」の状態で、ＨＢＡ０系のパスのグループのうちの一部のパスでＩ／Ｏ遅延が発生した場合には、「パス閉塞制御実行後の管理テーブルの内容」に示されているように、ＨＢＡ０系のパスのグループのうち論理ディスクLU#0以外の論理ディスクに対応するパスについては状態情報が「非稼働中(閉塞中):offline」へ書き替えされるものの、論理ディスクLU#0に対応するパスについては、当該パスが論理ディスクLU#0をサーバ・コンピュータ１２と接続している未閉塞の最後のパスであり、前記パスを閉塞してしまうとサーバ・コンピュータ１２が論理ディスクLU#0と通信不能の状態になってしまうことから、前述のステップ７０の判定が否定されることで状態情報は「稼働中:online」のまま維持される。この場合は交差閉塞状態が生ずることになるものの、サーバ・コンピュータ１２が論理ディスクLU#0と通信不能の状態になってしまうことは回避できる。 On the other hand, for example, as shown in pattern 3 in FIG. 5, a part of the HBA2 path group (path corresponding to the logical disk LU # 0) is “inactive (blocked): offline”. If an I / O delay occurs in a part of the HBA0 path group in the state, as shown in "Contents of management table after execution of path blocking control", the HBA0 system For the paths corresponding to logical disks other than logical disk LU # 0 in the group of paths, the status information is rewritten to "inactive (blocked): offline", but it corresponds to logical disk LU # 0 The path is the last unblocked path connecting the logical disk LU # 0 to the server computer 12, and if the path is blocked, the server computer 12 is connected to the logical disk LU # 0. Communication is disabled. It from "Up: online" state information by the determination of the foregoing step 70 is negative while the maintenance of. In this case, although a cross blockage state occurs, it is possible to prevent the server computer 12 from being in a state where it cannot communicate with the logical disk LU # 0.

なお、上記の閉塞処理で閉塞されたパスは、サーバ・コンピュータ１２からディスク装置２４へのＩ／Ｏ要求の転送に用いるパスの選択対象から除外されるので、ドライバによってＩ／Ｏ遅延の発生が検知されてＩ／Ｏ遅延メッセージがログ情報記録領域に記録されることはない。このため、同一グループに属する各パスが各々経由する機器等で障害が発生したとしても、図５に示すパターン１,２のように同一グループに属する全パスを閉塞できた場合には、以後は今回の障害が復旧する迄の間、今回Ｉ／Ｏ遅延が発生したパスのグループに対してパス閉塞制御処理が再度行われることはない。また、閉塞処理で閉塞されたパスについては、閉塞される前に当該パスを用いてＩ／Ｏ要求が送信され、送信したＩ／Ｏ要求に対する応答がタイムアウトしたことで前記パスの閉塞後にＩ／Ｏ遅延の発生が検知されたとしても、前記パスが閉塞されたことに基づきＩ／Ｏ要求の再送信は行われない。従って、同一グループに属する各パスに影響を及ぼす障害が発生した場合にサーバ・コンピュータ１２に加わる負荷を軽減できると共に、メモリ等のリソースの消費量も抑制できる。また、上記の閉塞処理はログ情報記録領域へのＩ／Ｏ遅延メッセージの記録、すなわち、ドライバによる１回目のＩ／Ｏ要求の送信がタイムアウトしたことをトリガとして行われるので、ドライバからリンク管理部へ障害の発生が通知されたことをトリガとして閉塞処理を行う場合と比較してより早期に閉塞処理を行うことができる。 Since the path blocked by the above-described blocking process is excluded from the selection targets of the path used for transferring the I / O request from the server computer 12 to the disk device 24, an I / O delay is generated by the driver. Detected and no I / O delay message is recorded in the log information recording area. For this reason, even if a failure occurs in a device or the like through which each path belonging to the same group passes, if all paths belonging to the same group can be blocked as in patterns 1 and 2 shown in FIG. Until the current failure is recovered, the path blockage control process is not performed again for the group of paths in which the current I / O delay has occurred. For a path blocked by the blocking process, an I / O request is transmitted using the path before being blocked, and a response to the transmitted I / O request times out, so that the I / O is blocked after the path is blocked. Even if the occurrence of an O delay is detected, the I / O request is not retransmitted based on the block being blocked. Therefore, it is possible to reduce the load applied to the server computer 12 when a failure that affects each path belonging to the same group occurs, and to suppress the consumption of resources such as memory. In addition, since the above blocking process is triggered by the recording of the I / O delay message in the log information recording area, that is, when the first transmission of the I / O request by the driver has timed out, the link management unit from the driver. Compared to the case where the blockage process is triggered by the notification of the occurrence of a fault, the blockage process can be performed earlier.

対応するパスのグループ(Ｉ／Ｏ遅延が発生したパスのグループ)に属する全てのパスに対してステップ６６〜ステップ７４の処理を行うと、ステップ７４の判定が否定されてステップ７６へ移行し、パス状態管理テーブルを再度参照し、ステップ６６〜ステップ７４の処理の結果、対応するパスのグループの全てのパスが閉塞されたか否か判定する。上記判定の条件には、例えば図５に示すパターン１,２が該当し、上記判定が肯定された場合はステップ８２へ移行する。また、ステップ７６の判定が否定された場合はステップ７８へ移行し、パス状態管理テーブルを再度参照してディスク装置２４の各論理ディスク毎に未閉塞のパスの数を計数する。そして次のステップ８０では、ディスク装置２４の各論理ディスク毎の未閉塞のパスの数が全て１か否か判定する。ステップ７６の判定が否定されてステップ８０の判定が肯定される条件には、例えば図５に示すパターン３が該当し、ステップ８０の判定が肯定された場合はステップ８２へ移行する。 If the processing of step 66 to step 74 is performed for all the paths belonging to the corresponding path group (the group of paths in which the I / O delay has occurred), the determination of step 74 is denied and the process proceeds to step 76. The path state management table is referred again, and it is determined whether or not all the paths in the corresponding path group have been blocked as a result of the processing in step 66 to step 74. For example, patterns 1 and 2 shown in FIG. 5 correspond to the determination condition. If the determination is positive, the process proceeds to step 82. If the determination in step 76 is negative, the process proceeds to step 78, and the path state management table is referenced again to count the number of unblocked paths for each logical disk of the disk device 24. In the next step 80, it is determined whether or not the number of unblocked paths for each logical disk of the disk device 24 is all one. The condition in which the determination in step 76 is negative and the determination in step 80 is affirmative is, for example, the pattern 3 shown in FIG. 5. If the determination in step 80 is affirmative, the process proceeds to step 82.

ステップ７６又はステップ８０の判定が肯定される条件では、今回発生した障害が復旧する迄の間、少なくとも対応するパスのグループ(Ｉ／Ｏ遅延が発生したパスのグループ)に対しては閉塞処理(状態情報の書き替え)を行う必要は無く、また、例えば図５に示すパターン３におけるＨＢＡ０系の論理ディスクLU#0に対応するパスのように、対応するパスのグループの中に未閉塞のパスが存在していたとしても、当該パスを閉塞するとサーバ・コンピュータ１２が通信不能の状態になってしまう論理ディスクＬＵが出現するので、対応するパスのグループに対しては閉塞処理(状態情報の書き替え)を行うことは望ましくない。このため、ステップ８２では自処理を起動したログトラップ(対応するパスのグループが同一のログトラップ)の動作を停止させ(図３の(6)、図５に示すパターン１〜３の「ログトラップ」の欄も参照)た後にステップ８４へ移行する。 Under the condition that the determination in step 76 or step 80 is affirmative, at least the corresponding path group (the group of the path in which the I / O delay has occurred) is blocked until the current failure is recovered. There is no need to rewrite the status information), and for example, a path that is not blocked in the corresponding path group, such as a path corresponding to the logical disk LU # 0 of the HBA0 system in the pattern 3 shown in FIG. Even if the path exists, a logical disk LU that causes the server computer 12 to be unable to communicate appears when the path is blocked. It is not desirable to perform (replacement). For this reason, in step 82, the operation of the log trap (the log trap having the same path group corresponding to the corresponding process) that started its own processing is stopped ((6) in FIG. ”(See also the column“ ”), the process proceeds to step 84.

このログトラップの動作停止に伴い、以後は対応するパスのグループにＩ／Ｏ遅延が発生したことを表すＩ／Ｏ遅延メッセージがログ情報記録領域に記録されたとしても、ログトラップによってパス閉塞制御シェルが起動されることはなく、パス閉塞制御処理は行われない。これにより、同一グループに属する各パスが各々経由する機器等で障害が発生したとしても、ログトラップの動作を停止させた以降は、今回の障害が復旧する迄の間、今回Ｉ／Ｏ遅延が発生したパスのグループに対してパス閉塞制御処理が再度行われることはないので、同一グループに属する各パスに影響を及ぼす障害が発生した場合にサーバ・コンピュータ１２に加わる負荷を軽減できると共に、メモリ等のリソースの消費量も抑制できる。なお、ステップ７６〜ステップ８２は請求項２に記載の起動制御手段に対応している。 When this log trap operation stops, the path trap control is performed by the log trap even if an I / O delay message indicating that an I / O delay has occurred in the corresponding path group is recorded in the log information recording area. The shell is not started and the path block control process is not performed. As a result, even if a failure occurs in a device or the like through which each path belonging to the same group passes, the current I / O delay will continue until the current failure is recovered after the log trap operation is stopped. Since the path blockage control process is not performed again for the group of paths that have occurred, the load on the server computer 12 can be reduced when a failure that affects each path belonging to the same group occurs, and the memory The consumption of resources such as these can also be suppressed. Steps 76 to 82 correspond to the activation control means described in claim 2 .

一方、ステップ７６の判定及びステップ８０の判定が各々否定される条件としては、例えば何らかの理由により１つ以上のパスの閉塞に失敗した場合が挙げられる。典型例を図５にパターン４として示す。図５に示すパターン４は、サーバ・コンピュータ１２とディスク装置２４の間に設けられた全てのパスが「稼働中:online」の状態で、ＨＢＡ０系のパスのグループのうちの一部のパスでＩ／Ｏ遅延が発生することで、前述したパターン１と同様に、ＨＢＡ０系のパスのグループにおける全てのパスの状態情報の「非稼働中(閉塞中):offline」への書き替えを行ったものの、「パス閉塞制御実行後の管理テーブルの内容」に示されているように、エラー発生等の理由でＨＢＡ０系のパスのグループのうち論理ディスクLU#0に対応するパスの状態情報の書き替えに失敗したパターンである。 On the other hand, as a condition in which the determination in step 76 and the determination in step 80 are respectively denied, for example, a case where one or more paths have failed to be blocked due to some reason. A typical example is shown as pattern 4 in FIG. Pattern 4 shown in FIG. 5 is a state in which all paths provided between the server computer 12 and the disk device 24 are “online” and a part of the HBA 0 system path group. Due to the occurrence of the I / O delay, the state information of all paths in the HBA 0 system path group was rewritten to “inactive (blocked): offline” in the same manner as pattern 1 described above. However, as shown in “Contents of management table after execution of path blockage control”, the status information of the path corresponding to logical disk LU # 0 in the HBA0 system path group is written due to an error or the like. The pattern failed to be replaced.

上記のパターン４では、ＨＢＡ０系のパスのグループの中に状態が「稼働中:online」のパス(論理ディスクLU#0に対応するパス)が混在しているのでステップ７６の判定が否定され、論理ディスクLU#0は対応するＨＢＡ０系及びＨＢＡ２系のパスが各々「稼働中:online」で未閉塞のパスの数＝２のためステップ８０の判定も否定される。このパターン４のように、対応するパスのうち未閉塞のパスの数が２以上の論理ディスクが存在している場合、当該論理ディスクについては、障害の発生に伴って対応するパスを閉塞する余地がある(障害の発生に伴って対応する何れか１つのパスを閉塞してもサーバ・コンピュータ１２と通信不能の状態にはならない)ので、ステップ７６,８０の判定が否定された場合はログトラップの動作を停止させることなく(ステップ８２をスキップして)ステップ８４へ移行する(図５に示すパターン４の「ログトラップ」の欄も参照)。 In the above pattern 4, since the HBA0 system path group includes a path whose status is “active: online” (path corresponding to the logical disk LU # 0), the determination in step 76 is negative, In the logical disk LU # 0, since the corresponding HBA0 system and HBA2 system paths are “online” and the number of unblocked paths = 2, the determination in step 80 is also denied. When there is a logical disk with two or more unblocked paths among the corresponding paths as in pattern 4, there is room for blocking the corresponding path when a failure occurs for the logical disk. (Since any one of the corresponding paths is blocked when a failure occurs, communication with the server computer 12 is not disabled.) If the determination in steps 76 and 80 is negative, a log trap (Step 82 is skipped) and the process proceeds to Step 84 (see also the “Log Trap” column of Pattern 4 shown in FIG. 5).

そしてステップ８４では、先のステップ６０で作成したロックディレクトリを削除し、パス閉塞処理を終了する。このロックディレクトリの削除に伴い、自シェルが再度起動されて対応するパスのグループが同一のパス閉塞制御処理が再度行われた場合に、パス状態管理テーブルを更新する閉塞処理を行うことが可能となる。 In step 84, the lock directory created in the previous step 60 is deleted, and the path closing process is terminated. Along with the deletion of this lock directory, it is possible to perform a blocking process that updates the path status management table when the local shell is restarted and the corresponding path grouping process is performed again for the same path group. Become.

ところで、パス閉塞制御処理では、上述のようにステップ６６〜ステップ７４の閉塞処理により、対応するパスのグループの全てのパスの閉塞が試行されると共に、通常は(パスの状態情報の書き替えに失敗した等の稀なケース以外は)ログトラップの動作も停止されるので、同一グループに属する各パスが経由する機器等で障害が発生したとしても、各パスでＩ／Ｏ遅延が発生したことがドライバによって検知されるタイミングに一定時間以上の時間差があれば、Ｉ／Ｏ遅延が発生したことがドライバによって最初に検知されたパスについてのみ、ログ情報記録領域にＩ／Ｏ遅延メッセージが記録され、ログトラップによってパス閉塞制御シェルの起動されてパス閉塞制御処理が実行されることになる。 By the way, in the path block control process, as described above, the block process in step 66 to step 74 attempts to block all the paths in the corresponding path group, and normally (to rewrite the path status information). Since the log trap operation is also stopped (except in rare cases such as failure), there is an I / O delay in each path even if a failure occurs in a device that passes through each path belonging to the same group. If there is a time difference of a certain time or more in the timing detected by the driver, the I / O delay message is recorded in the log information recording area only for the path where the driver first detected that the I / O delay has occurred. Then, the path blocking control shell is started by the log trap and the path blocking control process is executed.

しかしながら、同一グループに属する各パスでＩ／Ｏ遅延が発生したことがドライバによってほぼ同時に検知された場合、より詳しくは、Ｉ／Ｏ遅延の発生が最初に検知されたパスについて、ログ情報記録領域にＩ／Ｏ遅延メッセージが記録されたことをトリガとしてパス閉塞制御処理が行われ、パス状態管理テーブルが更新されてログトラップの実行が停止される前に、同一グループに属する他のパスについてもＩ／Ｏ遅延の発生が検知され、ログ情報記録領域にＩ／Ｏ遅延メッセージが記録された場合には、同一のグループに属する複数のパスについて、ログ情報記録領域へのＩ／Ｏ遅延メッセージの記録をトリガとして、対応するパスのグループが同一の複数のパス閉塞制御処理が重複起動され、対応するパスのグループが同一の複数のパス閉塞制御処理がほぼ同じタイミングで並列に実行されることになる。以下、この場合のパス閉塞制御処理について説明する。 However, when it is detected almost simultaneously by the driver that an I / O delay has occurred in each path belonging to the same group, more specifically, the log information recording area for the path where the occurrence of the I / O delay is first detected. For the other paths belonging to the same group, the path blockage control process is triggered by the recording of the I / O delay message in the log, the path status management table is updated, and log trap execution is stopped. When the occurrence of an I / O delay is detected and an I / O delay message is recorded in the log information recording area, the I / O delay message to the log information recording area for a plurality of paths belonging to the same group. Multiple path blockage control processes with the same corresponding path group are triggered by recording, and multiple corresponding path groups are the same. So that the path deactivation control processing is performed in parallel at almost the same timing. Hereinafter, the path blockage control process in this case will be described.

前述のように、パス閉塞制御処理のステップ５０ではロックディレクトリが既に作成されているか否かを判定している。先に説明したように、パス閉塞制御処理では処理終了時に、作成したロックディレクトリを削除する(ステップ８４)ので、ステップ５０の判定が肯定された場合は、対応するパスのグループが同一のパス閉塞制御シェルが重複起動されている可能性が高いものの、稀に、以前に行われたパス閉塞制御処理でロックディレクトリの削除に失敗し、パス閉塞制御処理自体は終了しているもののロックディレクトリが残ってしまっていることもあり、これが原因である可能性も否定できない。 As described above, in step 50 of the path blocking control process, it is determined whether or not a lock directory has already been created. As described above, in the path block control process, the created lock directory is deleted at the end of the process (step 84). Therefore, if the determination in step 50 is affirmative, the corresponding path group is the same path block. Although there is a high possibility that multiple control shells have been activated, in rare cases, the lock directory deletion failed in the path block control process that was performed previously, and the path block control process itself has ended, but the lock directory remains. The possibility that this is the cause is undeniable.

このため、ステップ５０の判定が肯定された場合はステップ５２へ移行し、既に存在しているロックディレクトリのタイムスタンプ(ロックディレクトリの作成日時)を確認した後に、次のステップ５４において、既に存在しているロックディレクトリの作成日時からの経過時間が予め定めた閾値以上か否か判定する。なお、ステップ５４の判定における閾値としては、例えばパス閉塞制御処理に要する時間よりも十分に長い時間(例えば１分間)に相当する値を適用することができる。既に存在しているロックディレクトリの作成日時からの経過時間が予め定めた閾値未満の場合は、対応するパスのグループが同一のパス閉塞制御シェルが重複起動されている可能性が非常に高いと判断できるので、上記判定が否定された場合はステップ５６へ移行し、「多重起動のため停止」のメッセージを出力してパス閉塞制御処理を終了する。 Therefore, if the determination in step 50 is affirmed, the process proceeds to step 52, and after confirming the time stamp (lock directory creation date and time) of the lock directory that already exists, in the next step 54, it already exists. It is determined whether the elapsed time from the creation date and time of the current lock directory is equal to or greater than a predetermined threshold. As the threshold value in the determination of step 54, for example, a value corresponding to a time (for example, 1 minute) sufficiently longer than the time required for the path blockage control process can be applied. If the elapsed time from the creation date and time of the existing lock directory is less than a predetermined threshold, it is determined that there is a very high possibility that the same path block control shell with the same path group has been activated twice. Therefore, if the above determination is negative, the routine proceeds to step 56, where the message “stop for multiple activation” is output and the path blocking control process is terminated.

またステップ５４の判定が肯定された場合、既に存在しているロックディレクトリは、以前に行われたパス閉塞制御処理で削除に失敗して残ってしまっているものであり、以前に行われたパス閉塞制御処理自体は終了していると判断できるので、ステップ５８へ移行し、既に存在しているロックディレクトリを削除した後にステップ６０へ移行し、新たなロックディレクトリの作成を試行する。そしてロックディレクトリの作成に成功した場合には、ステップ６２の判定が肯定されてステップ６６へ移行し、先に説明した閉塞処理・ログトラップの動作停止が順に行われる。 If the determination in step 54 is affirmative, the lock directory that already exists has been left unsuccessfully deleted in the path block control process performed previously, and the previously performed path directory. Since it can be determined that the blocking control process itself has been completed, the process proceeds to step 58, after deleting the existing lock directory, the process proceeds to step 60 to try to create a new lock directory. If the creation of the lock directory is successful, the determination at step 62 is affirmed and the routine proceeds to step 66, where the above-described blocking process and log trap operation stop are sequentially performed.

また、同一グループに属する複数のパスの各々におけるＩ／Ｏ遅延の発生が検知されたタイミングの時間差が極めて小さい場合、Ｉ／Ｏ遅延の発生が最初に検知されたパスに対応するパス閉塞制御処理でロックディレクトリが作成される前に、Ｉ／Ｏ遅延の発生が２番目以降に検知されたパスに対応するパス閉塞制御処理が行われ、Ｉ／Ｏ遅延の発生が２番目以降に検知されたパスに対応するパス閉塞制御処理でステップ５０の判定が否定されることも確率として０ではない。 In addition, when the time difference in timing at which the occurrence of I / O delay is detected in each of a plurality of paths belonging to the same group is extremely small, the path blockage control process corresponding to the path where the occurrence of I / O delay is first detected Before the lock directory is created in step 1, the path blockage control process corresponding to the path where the occurrence of the I / O delay is detected after the second is performed, and the occurrence of the I / O delay is detected after the second. It is not 0 as a probability that the determination in step 50 is denied in the path blockage control process corresponding to the path.

しかしながら、ディレクトリの作成はＯＳを通じて行われるので、複数の処理(プロセス)がＯＳを通じてディレクトリを同時に作成しようとしたとしても、作成タイミングには必ず時間差が生じると共に、対応するパスのグループが同一のパス閉塞制御シェルによるパス閉塞制御処理ではロックディレクトリとして作成するディレクトリの名称が同一である一方、ＵＮＩＸ（登録商標）系のＯＳでは名称が同一のディレクトリの重複作成が制限(禁止)されているので、一方のパス閉塞制御処理ではロックディレクトリの作成に必ず失敗する。従って、パス閉塞制御処理においてステップ６２の判定が否定された場合、すなわちステップ６０で試行したロックディレクトリの作成に失敗した場合には、重複起動されているパス閉塞制御シェルによる別のパス閉塞制御処理でロックディレトリが先に作成されたと判断できるので、ステップ６４へ移行し「ディレクトリ作成失敗」のメッセージを出力してパス閉塞制御処理を終了する。 However, since the directory is created through the OS, even if a plurality of processes (processes) try to create the directory through the OS at the same time, there is always a time difference in the creation timing, and the corresponding path group is the same path. In the path blocking control process by the blocking control shell, the directory name created as the lock directory is the same, while in the UNIX (registered trademark) OS, the creation of duplicate directories with the same name is restricted (prohibited). On the other hand, the path block control process always fails to create a lock directory. Accordingly, if the determination in step 62 is negative in the path block control process, that is, if the creation of the lock directory attempted in step 60 has failed, another path block control process by the redundantly activated path block control shell is performed. Since it can be determined that the lock directory has been created first, the process proceeds to step 64 to output a “directory creation failure” message and terminate the path blockage control process.

このように、本実施形態に係るパス閉塞制御処理では、ロックディレクトリが既に存在しており、かつ当該ロックディレクトリの作成日時からの経過時間が閾値未満の場合、及び、ロックディレクトリは存在していないと判定したもののロックディレクトリの作成に失敗した場合には、対応するパスのグループが同一のパス閉塞制御シェルが重複起動されている可能性が非常に高いと判断してパス閉塞制御処理を直ちに終了するので、同一グループに属する各パスでＩ／Ｏ遅延が発生したことがドライバによってほぼ同時に検知され、対応するパスのグループが同一の複数のパス閉塞制御処理がほぼ同じタイミングで並列に実行された場合にも、閉塞処理が重複して実行されることを確実に防止することができ、サーバ・コンピュータ１２に加わる負荷を低減できると共に、メモリ等のリソースの消費量を早期に回復させることができる。 As described above, in the path blocking control process according to the present embodiment, when the lock directory already exists and the elapsed time from the creation date and time of the lock directory is less than the threshold, and the lock directory does not exist. However, if the creation of the lock directory fails, it is determined that there is a very high possibility that the same path block control shell with the same path group has been activated, and the path block control process is immediately terminated. Therefore, it is detected almost simultaneously by the driver that an I / O delay has occurred in each path belonging to the same group, and a plurality of path blockage control processes with the same group of corresponding paths are executed in parallel at substantially the same timing. Even in this case, it is possible to reliably prevent the blocking process from being executed repeatedly. It is possible to reduce the Waru load, it is possible to recover the resource consumption of memory or the like in an early stage.

従って、本実施形態に係るパス閉塞制御処理によれば、複数のパスに影響を及ぼす障害がコンピュータ・システム１０に発生した場合に、各パスでのＩ／Ｏ遅延の発生が検知されたタイミングに拘わらず、サーバ・コンピュータ１２に加わる負荷を低減し、メモリ等のリソースの消費量を抑制することができるので、複数のパスに影響を及ぼす障害がコンピュータ・システム１０に発生した場合のサーバ・コンピュータ１２の動作安定化を実現することができる。 Therefore, according to the path blockage control processing according to the present embodiment, when a failure that affects a plurality of paths occurs in the computer system 10, the occurrence of an I / O delay in each path is detected. Regardless, since the load applied to the server computer 12 can be reduced and the consumption of resources such as memory can be suppressed, the server computer when a failure that affects a plurality of paths occurs in the computer system 10. Twelve operations can be stabilized.

なお、上記で説明したパス閉塞制御処理では、対応するパスのグループに対して閉塞処理を行った後、ディスク装置２４の各論理ディスク毎の未閉塞のパスの数が全て１の場合(ステップ８０の判定が肯定された場合)に、対応するパスのグループが同一のログトラップの動作を停止させていたが、これに限定されるものではない。上記処理では、対応するパスのグループが異なるログトラップは動作が継続され、対応するパスのグループと異なるグループに属するパスでＩ／Ｏ遅延が発生した場合はパス閉塞制御シェルが起動されることになるが、各論理ディスク毎の未閉塞のパスの数が全て１であれば、パスの全てのグループに対して閉塞処理は不要であり、また閉塞処理を行うことは望ましくもないので、各論理ディスク毎の未閉塞のパスの数が全て１の場合には、パスの全てのグループに対してログトラップの動作を停止させるようにしてもよい。 In the path block control process described above, after the block process is performed on the corresponding path group, the number of unblocked paths for each logical disk of the disk device 24 is all 1 (step 80). The group of the corresponding paths has stopped the operation of the same log trap, but the present invention is not limited to this. In the above processing, log traps with different path groups continue to operate, and if an I / O delay occurs on a path belonging to a group different from the corresponding path group, the path blockage control shell is activated. However, if the number of unblocked paths for each logical disk is all 1, block processing is unnecessary for all groups of paths, and it is not desirable to perform block processing. When the number of unblocked paths for each disk is all 1, the log trap operation may be stopped for all groups of paths.

また、上記ではログトラップ及びパス閉塞制御シェルをパスのグループ毎に設けた態様を説明したが、本発明はこれに限定されるものではなく、ログトラップ及びパス閉塞制御シェルを１つづつ設け、ログトラップは、ログ情報記録領域へのＩ／Ｏ遅延メッセージの記録をトリガとして、記録されたＩ／Ｏ遅延メッセージに対応するパスのグループを判定し、判定したグループを引数としてパス閉塞制御シェルを起動し、起動されたパス閉塞制御シェルは、通知されたグループが閉塞処理停止中か否かを判定し、閉塞処理停止中でなければ通知されたグループに対応する名称のロックディレクトリの存在確認、同名称のロックディレクトリの作成試行、通知されたグループの各パスに対する閉塞処理、通知されたグループが閉塞処理停止中であることを表す情報の設定・記憶、の各処理を行うように構成してもよい。 In the above description, the log trap and the path blocking control shell are provided for each path group. However, the present invention is not limited to this, and the log trap and the path blocking control shell are provided one by one. The log trap uses the recording of the I / O delay message in the log information recording area as a trigger to determine the path group corresponding to the recorded I / O delay message, and uses the path blocking control shell as an argument. The path blocking control shell that was started and started determines whether the notified group is stopped or not, and if the blocked process is not stopped, the existence confirmation of the lock directory with the name corresponding to the notified group is confirmed. An attempt to create a lock directory with the same name, block processing for each path of the notified group, and the notified group is stopped Set and stored information representative of the Rukoto may be configured to perform each process.

また、上記では本発明に係る外部機器の一例として、ディスク駆動装置(ＤＫＵ)２６によって駆動される多数台のディスクドライブを備えたディスク装置２４を説明したが、本発明に係る外部機器は上記構成に限定されるものではなく、例えばフラッシュメモリ等の半導体メモリから成る記憶装置であってもよいし、本発明に係るコンピュータ以外の相手装置と通信を行う通信装置であってもよく、任意の機器を適用可能である。 In the above description, the disk device 24 including a plurality of disk drives driven by the disk drive unit (DKU) 26 has been described as an example of the external device according to the present invention. However, the external device according to the present invention is configured as described above. However, the present invention is not limited thereto, and may be a storage device including a semiconductor memory such as a flash memory, a communication device that communicates with a counterpart device other than the computer according to the present invention, and any device. Is applicable.

更に、外部機器(論理ディスクＬＵ)の数やパスのグループの数(ＨＢＡ２０、チャネル・アダプタ(ＣＨＡ)３２の数)、コンピュータ(サーバ・コンピュータ１２)の数等は、図面に示した数に限られるものではなく、本発明を逸脱しない範囲で適宜変更可能である。 Further, the number of external devices (logical disks LU), the number of path groups (the number of HBA 20 and channel adapter (CHA) 32), the number of computers (server computers 12), etc. are limited to the numbers shown in the drawings. However, the present invention can be changed as appropriate without departing from the scope of the present invention.

また、上記では本発明に係る障害管理プログラムに対応するパス閉塞制御シェルがサーバ・コンピュータ１２の記憶部１８に予め記憶（インストール）されている態様を説明したが、本発明に係る障害管理プログラムは、ＣＤ−ＲＯＭやＤＶＤ−ＲＯＭ等の記録媒体に記録されている形態で提供することも可能である。 In the above description, the path block control shell corresponding to the failure management program according to the present invention has been previously stored (installed) in the storage unit 18 of the server computer 12. However, the failure management program according to the present invention is It is also possible to provide the information recorded in a recording medium such as a CD-ROM or a DVD-ROM.

１０コンピュータ・システム
１２サーバ・コンピュータ
１４ＣＰＵ
１６メモリ
１８記憶部
２０ＨＢＡ
２２光ファイバケーブル
２４ディスク装置 10 Computer system 12 Server computer 14 CPU
16 Memory 18 Storage unit 20 HBA
22 Optical fiber cable 24 Disk device

Claims

A failure management device realized by a computer provided with a group of a plurality of paths for communicating with different external devices among a plurality of external devices,
The plurality of paths physically share a single communication path, and when the occurrence of a failure in one of the paths is notified, all of the paths belonging to the same group as the path in which the occurrence of the failure is notified A failure management apparatus comprising a control means for performing a blocking process for blocking a path.

Status information representing the status of each of the individual paths in the group is registered and managed in a status table, and a path used for communication between the computer and each external device is selected, and the status information means blocking. A management means for performing a management process including a process of excluding a path whose information is registered in the state table from a selection target in communication with the external device;
The state information of all paths belonging to the same group as the path notified of the occurrence of the failure among the state information registered in the state table by the blocking process by the control means means blocking. An activation control means for stopping activation of a program for closing the path when each of the information is rewritten;
The failure management apparatus according to claim 1, further comprising:

A computer provided with a group consisting of a plurality of paths for communicating with different external devices among a plurality of external devices,
The plurality of paths physically share a single communication path, and when the occurrence of a failure in one of the paths is notified, all of the paths belonging to the same group as the path in which the occurrence of the failure is notified A failure management program that functions as a control unit that performs block processing for blocking a path.