JP2005122409A

JP2005122409A - Failure tracking method and device for system having virtual layer, and computer program

Info

Publication number: JP2005122409A
Application number: JP2003355708A
Authority: JP
Inventors: Peter John Deacon; ピーター・ジョン・ディーコン; Carlos Fancisco Fuente; カルロス・フランシスコ・フエンテ; William James Scales; ウィリアム・ジェームス・スケールズ; Barry Douglas White; バリー・ダグラス・ホワイト
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 2003-10-15
Filing date: 2003-10-15
Publication date: 2005-05-12

Abstract

<P>PROBLEM TO BE SOLVED: To provide a failure tracking method for a system having a virtual layer. <P>SOLUTION: A stack system detects an error in a user application interface, and identifies an error being a relevant bottom cause in a much lower stack level, and prepares an error trace entry concerning the error, and associates an error log identifier with the error trace entry, and prepares an unique error identifier in the plurality of host systems of the stack system from the combined error log identifier and error trace entry. When there is high possibility that any failure is generated in a service due to the error being the bottom cause, the error identifier is transmitted to an arbitrary service request side through the user application interface of one or more host systems. Thus, the error detected by the user application interface of one or more host systems is associated with the error being the bottom cause in the stack level below a virtual layer. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、エラー追跡に関し、詳細には、ホスト・アプリケーションとデバイスの間に仮想化レイヤを有する環境でのエラー追跡に関する。 The present invention relates to error tracking, and in particular to error tracking in an environment having a virtualization layer between a host application and a device.

故障の検出および分離の問題（複合システムの問題をその根本原因まで突き止めること）は、非常に重要な問題である。環境によってはエラー報告情報が欠如しているだけの場合もあるが、多くの企業規模の環境では、検出された故障を提示して、これをログ記録することに多くの努力が払われてきた。耐故障性システムでは、継続した耐故障性を確保するためにこのような情報が不可欠である。効果的な故障検出／修復機構がない場合、耐故障性システムは、それ以上の故障が障害を起こすまで問題をマスクしておくしかない。 Fault detection and isolation problems (finding complex system problems to their root cause) are very important issues. In some environments, error reporting information may only be lacking, but in many enterprise-wide environments, much effort has been put into presenting and logging detected faults. . In a fault tolerant system, such information is essential to ensure continued fault tolerance. Without an effective fault detection / repair mechanism, fault tolerant systems can only mask the problem until further faults fail.

問題が発生すると、多くの場合、その影響を予測することはしばしば困難である。例えばストレージ・コントローラのサブシステムには、ディスク・ドライブからホスト・アプリケーションへの経路すなわち「スタック」に多くのコンポーネントがある。実際に検出され、ログ記録されたエラーをアプリケーションまたはユーザのホスト・システムから見た影響に関係付けることは困難である。 When problems arise, it is often difficult to predict the impact. For example, the storage controller subsystem has many components in the path or “stack” from the disk drive to the host application. It is difficult to relate the actual detected and logged errors to the impact seen by the application or the user's host system.

一度に多くのエラーが発生した場合、それらのエラーのどれが特定のアプリケーションの障害に繋がっているのかを判別することは殊に困難である。すべての報告されたエラーを修理する強力なソリューションは確かに役に立つだろうが、ビジネスに最も重要なアプリケーションに影響を及ぼしたエラーを修理する優先順位ベース方式の方がさらに費用効率が高く、システムのユーザにとって大きな価値がある。 When many errors occur at once, it is particularly difficult to determine which of those errors leads to a particular application failure. A powerful solution that repairs all reported errors will certainly help, but a priority-based approach to repairing errors that affected the business-critical application is more cost-effective and system-friendly. There is great value for users.

追跡可能性が少しでも欠如するということは、ユーザまたはアプリケーションに生じた特定の問題を解決するために適切なエラーが修理されたという信頼性も減少するということである。 The lack of any traceability means that the reliability that the appropriate error has been fixed to resolve a specific problem that has occurred to the user or application is also reduced.

ＲＡＩＤアレイ、Flash Copyのような拡張機能、およびキャッシュを有する今日のシステムは、トップダウン分析（アプリケーションからシステムのコンポーネントまで故障を追跡すること）に既に多くの混乱を加えている。故障を招いた根本原因であるエラーを選択するには多くの時間と知識を要する。 Today's systems with RAID arrays, extensions like Flash Copy, and caches already add a lot of confusion to top-down analysis (tracking failures from applications to system components). It takes a lot of time and knowledge to select the error that is the root cause of the failure.

多くのシステムに仮想化レイヤを導入することによって、この問題はさらに拡大する。仮想化によって無目的な他のレイヤが付加されるだけでなく、多くの仮想化方式では基礎となる実際のサブシステムにおけるデータの動的な移動が可能なので、正確な故障追跡を実行することはさらに困難を極めることになる。 This problem is further magnified by introducing a virtualization layer in many systems. Not only does virtualization add another layer of no purpose, but many virtualization methods allow dynamic movement of data in the underlying actual subsystem, so accurate fault tracking is not possible It will be even more difficult.

例えば米国特許第５，９７４，５４４号の教示から、安価なディスクの冗長アレイを使用してストレージ・システムのＲＡＩＤコントローラ・レベルで論理的欠陥リストを維持することが知られている。しかし複数のそのようなアレイを他の周辺装置と共に使用するシステムの場合、しかも特にそれらがストレージ・エリア・ネットワーク（ＳＡＮ）の一部を構成している場合、そのシステムは、エラーを外部の症状から根本原因まで追跡することをさらに困難にする仮想化のような機能を有するソフトウェアのレイヤを導入する。
米国特許第５，９７４，５４４号 For example, it is known from the teaching of US Pat. No. 5,974,544 to use a redundant array of inexpensive disks to maintain a logical defect list at the RAID controller level of the storage system. However, in the case of a system that uses multiple such arrays in conjunction with other peripheral devices, and especially if they are part of a storage area network (SAN), the system may report an error as an external symptom. Introduce a layer of software with features like virtualization that makes it more difficult to track from root to root cause.
US Pat. No. 5,974,544

したがってこの問題を緩和する方法、システム、またはコンピュータ・プログラムが求められている。さらにこの問題は、金額、処理資源、および時間の点で顧客に最小限の負担を課すだけで緩和されることが好ましい。 Accordingly, there is a need for a method, system, or computer program that alleviates this problem. Furthermore, this problem is preferably alleviated with a minimal burden on the customer in terms of money, processing resources and time.

したがって本発明は、第１の態様で、複数あるホスト・システムのうちの１つまたは複数のホスト・システムのユーザ・アプリケーション・インターフェースで検出されたエラーを仮想化レイヤの下にあるスタック・レベルの根本原因であるエラーに関連付けるためのスタック・システムにおける方法であって、ユーザ・アプリケーション・インターフェースでのエラーを検出するステップと、さらに低いスタック・レベルでの関連する根本原因であるエラーを識別するステップと、前記エラーに関してエラー追跡エントリを作成するステップと、エラー・ログ識別子を前記エラー追跡エントリに関連付けるステップと、前記組み合わされたエラー・ログ識別子と前記エラー追跡エントリから前記スタック・システムの前記複数のホスト・システム内で一意のエラー識別子を作成するステップと、前記根本原因であるエラーのために前記サービスに障害が起こっている可能性が高い場合に、複数あるホスト・システムのうちの１つまたは複数のホスト・システムのユーザ・アプリケーション・インターフェースでサービスの任意の要求側に前記エラー識別子を伝達するステップとを含む方法を提供する。 Accordingly, the present invention provides, in a first aspect, an error detected in a user application interface of one or more host systems of a plurality of host systems at a stack level below the virtualization layer. A method in a stack system for associating with root cause errors, detecting errors in the user application interface, and identifying associated root cause errors at a lower stack level Creating an error tracking entry for the error; associating an error log identifier with the error tracking entry; from the combined error log identifier and the error tracking entry, the plurality of the stack system Host system Creating a unique error identifier within a system and one or more of a plurality of host systems if the service is likely to fail due to the root cause error Communicating the error identifier to any requestor of a service at a user application interface of a host system.

前記組み合わされたエラー・ログ識別子と前記エラー追跡エントリから前記スタック・システムの前記複数のホスト・システム内で一意のエラー識別子を作成するステップは、前記複数のホスト・システム内で一意のエラー識別子を作成するためにエラー追跡エントリとエラー・ログ識別子とを整数値と組み合わせるステップを含むことが好ましい。 Creating a unique error identifier within the plurality of host systems of the stack system from the combined error log identifier and the error tracking entry comprises: generating an error identifier that is unique within the plurality of host systems. Preferably, the method includes the step of combining the error tracking entry and the error log identifier with an integer value for creation.

さらに低いスタック・レベルの根本原因であるエラーは前記スタック・システムの周辺装置内にあることが好ましい。 The error that is the root cause of the lower stack level is preferably in the peripheral device of the stack system.

周辺装置はストレージ・デバイスであることが好ましい。 The peripheral device is preferably a storage device.

スタック・システムはストレージ・エリア・ネットワークを含むことが好ましい。 The stack system preferably includes a storage area network.

本発明は、第２の態様で、複数あるホスト・システムのうちの１つまたは複数のホスト・システムのユーザ・アプリケーション・インターフェースで検出されたエラーを仮想化レイヤの下にあるスタック・レベルの根本原因であるエラーに関連付ける装置であって、ユーザ・アプリケーション・インターフェースでエラーを検出するためのエラー検出器と、さらに低いスタック・レベルでの関連する根本原因であるエラーを識別するための診断コンポーネントと、前記エラーに関してエラー追跡エントリを作成するための追跡コンポーネントと、エラー・ログ識別子を前記エラー追跡エントリに関連付けるための識別コンポーネントと、前記組み合わされたエラー・ログ識別子と前記エラー追跡エントリから前記スタック・システムの前記複数のホスト・システム内で一意のエラー識別子を作成するためのシステムワイドな識別コンポーネントと、前記根本原因であるエラーのために前記サービスに障害が起こっている可能性が高い場合に、複数あるホスト・システムのうちの１つまたは複数のホスト・システムのユーザ・アプリケーション・インターフェースでサービスの任意の要求側に前記エラー識別子を伝達するための伝達コンポーネントとを備える装置を提供する。 In a second aspect, the present invention provides a stack level root for detecting errors detected in a user application interface of one or more of the plurality of host systems below the virtualization layer. A device associated with the causal error, an error detector for detecting the error in the user application interface, and a diagnostic component for identifying the associated causal error at a lower stack level; A tracking component for creating an error tracking entry for the error, an identification component for associating an error log identifier with the error tracking entry, the combined error log identifier and the error tracking entry from the stack Said duplicate of the system A system-wide identification component to create a unique error identifier within the host system, and a plurality of hosts that are likely to fail the service due to the root cause error. An apparatus is provided comprising a communication component for conveying the error identifier to any requestor of service at a user application interface of one or more host systems of the system.

前記組み合わされたエラー・ログ識別子と前記エラー追跡エントリから前記スタック・システムの前記複数のホスト・システム内で一意のエラー識別子を作成するためのシステムワイドな識別コンポーネントは、前記複数のホスト・システム内で一意のエラー識別子を作成するためにエラー追跡エントリとエラー・ログ識別子とを整数値と組み合わせるためのコンポーネントを備えることが好ましい。 A system-wide identification component for creating a unique error identifier within the plurality of host systems of the stack system from the combined error log identifier and the error tracking entry is within the plurality of host systems. Preferably, it comprises a component for combining the error tracking entry and the error log identifier with an integer value to create a unique error identifier.

本発明は、第３の態様で、コンピュータ・システムにロードされ、実行された場合に、前記コンピュータ・システムに、複数あるホスト・システムのうちの１つまたは複数のホスト・システムのユーザ・アプリケーション・インターフェースで検出されたエラーを仮想化レイヤの下にあるスタック・レベルの根本原因であるエラーに関連付けさせるようにストレージ・メディアで有形に実施されるコンピュータ・プログラム製品であって、ユーザ・アプリケーション・インターフェースでエラーを検出するコンピュータ・プログラム・コード手段と、さらに低いスタック・レベルでの関連する根本原因であるエラーを識別するコンピュータ・プログラム・コード手段と、前記エラーに関してエラー追跡エントリを作成するコンピュータ・プログラム・コード手段と、エラー・ログ識別子を前記エラー追跡エントリに関連付けるコンピュータ・プログラム・コード手段と、前記組み合わされたエラー・ログ識別子と前記エラー追跡エントリから前記スタック・システムの前記複数のホスト・システム内で一意のエラー識別子を作成するコンピュータ・プログラム・コード手段と、前記根本原因であるエラーのために前記サービスに障害が起こっている可能性が高い場合に、複数あるホスト・システムのうちの１つまたは複数のホスト・システムのユーザ・アプリケーション・インターフェースでサービスの任意の要求側に前記エラー識別子を伝達するコンピュータ・プログラム・コード手段とを備えるコンピュータ・プログラム製品をさらに提供する。 According to a third aspect of the present invention, in a third aspect, when loaded and executed on a computer system, the computer system has a user application application for one or more of the plurality of host systems. A computer program product tangibly implemented on storage media to correlate errors detected at the interface with errors at the root level of the stack level below the virtualization layer, the user application interface Computer program code means for detecting errors in the computer, computer program code means for identifying the relevant root cause error at a lower stack level, and a computer program for creating an error tracking entry for the error Ram code means, computer program code means for associating an error log identifier with the error tracking entry, and the plurality of host systems of the stack system from the combined error log identifier and error tracking entry Computer program code means for creating a unique error identifier within the host, and one of a plurality of host systems when the service is likely to fail due to the root cause error There is further provided a computer program product comprising computer program code means for communicating said error identifier to any requestor of service at a user application interface of one or more host systems.

一意のエラー識別子を使用して根本原因情報がエラーにタグ付けされている仮想化ストレージのサブシステムで故障を分離するための本発明の好ましい実施形態は、システム内の１つの故障によって生じる複数のエラーを即座に当該１つの故障が原因であると診断することができるという利点を提供する。これにより診断手順は高速化され、通常ならば非常に有効なシステムにおける潜在的なダウンタイムが低減される。 A preferred embodiment of the present invention for isolating faults in a virtualized storage subsystem where root cause information is tagged with errors using a unique error identifier is a plurality of faults caused by a single fault in the system. It offers the advantage that an error can be diagnosed immediately due to the one fault. This speeds up the diagnostic procedure and reduces potential downtime in a normally very effective system.

次に、本発明の好ましい実施形態を、添付の図面を参照して例示によってのみ説明する。 Preferred embodiments of the present invention will now be described by way of example only with reference to the accompanying drawings.

本発明の好ましい実施形態は、多くの企業規模の環境で既存であるような従来型のエラー・ログ（１７０）をとることから開始する。エラー・ログは、システムのコンポーネントによって検出された故障を記録するために使用される。典型的には、ネットワークまたはドライバ・レイヤのような「外部」にインターフェースしており、エラーを最初に検出して処理するコンポーネントがある。 The preferred embodiment of the present invention begins with taking a conventional error log (170) as is prevalent in many enterprise-scale environments. The error log is used to record faults detected by system components. Typically, there are components that interface "outside", such as the network or driver layer, and first detect and handle errors.

一意の識別子（２１０）が既存の従来型エラー・ログ・エントリに追加される。これは、エントリごとに大きな（例えば３２ビットの）整数を使用して行うことができる。一意の識別子は、ログの識別子によって修飾された場合、引き続き入出力サービスまたは他の動作に障害を起こさせる危険性のある特定イベントを示す。エラー・ログは、ユーザまたはサービス技術者が根本原因である故障を修復するために十分な、検出された故障（２２０）を詳述する補足情報を含んでいる。 A unique identifier (210) is added to the existing conventional error log entry. This can be done using a large (eg 32 bit) integer for each entry. The unique identifier indicates a particular event that, when modified by the log identifier, may continue to cause a failure in input / output services or other operations. The error log contains supplemental information detailing the detected fault (220) sufficient for the user or service technician to repair the root cause fault.

次いで一意の識別子は、そのエラーが原因で障害が起こっている可能性の高い任意のサービス要求（例えば入出力要求）に対する応答の一部として使用される。この要求の発行側は、要求に対する障害応答を受信すると、あるならば障害が起こっている可能性の高いそれ自体のサービスまたは要求を判別する。要求の発行側はそれ自体の要求に障害を起こし、それらの障害の原因を示す最初に受け取った一意の識別子を改めて提示する。 The unique identifier is then used as part of the response to any service request (eg, I / O request) that is likely to have failed due to the error. When the request issuer receives a failure response to the request, it determines its own service or request that is likely to have failed, if any. The issuer of the request fails its own request and again presents the first received unique identifier that indicates the cause of those failures.

したがって、障害の原因であるイベントのアイデンティティ（identity）は、各要求の発信元に達するまで一連の障害が起こっている要求全体を通過する。この場合、発信元は、検出された障害ごとにどのエラー・イベントを修復する必要があるかを正確に判別するために必要な情報を有しており、これによって修復プロセスは促進され、最も危険な状態にあるアプリケーションが確実に最初に修復される。また、適切なエラーが修復済みであるというさらに高度な信頼性があり、これによって時間遅延および回復の失敗に伴う出費が防止される。 Thus, the identity of the event that is the cause of the failure passes through the entire series of failed requests until the origin of each request is reached. In this case, the source has the information necessary to accurately determine which error events need to be repaired for each detected fault, which facilitates the repair process and is the most dangerous An application that is in a safe state is reliably repaired first. There is also a higher degree of confidence that the appropriate error has been repaired, thereby preventing the expense associated with time delays and recovery failures.

現在最も好ましい実施形態では、要求を伝達するコンポーネントはソフトウェア・スタック（１００）のレイヤであり、ＲＡＩＤコントローラ（１１０）の管理、仮想化（１２０）、フラッシュ・コピー（１３０）、キャッシング（１４０）、リモート・コピー（１５０）、およびホスト・システムへのインターフェース（１６０）などの機能を実行する。本発明の好ましい実施形態の方法は、ストレージ・コントローラのエッジまでシステム全体にわたってスタックを追跡する追跡可能性を考慮している。 In the presently most preferred embodiment, the component that communicates the request is the layer of the software stack (100), management of the RAID controller (110), virtualization (120), flash copy (130), caching (140), Perform functions such as remote copy (150) and interface to host system (160). The method of the preferred embodiment of the present invention allows for traceability to track the stack throughout the system to the edge of the storage controller.

ソフトウェア・スタックの各コンポーネントは、元の障害が起こっているイベントの結果としてそれ自体がエラーを提示することができる。一例として、アプリケーション・サーバ（１９０）からの書き込み動作を、障害として、すなわち何らかの理由で物理的な記憶装置によりその書き込みに障害が起こっているとして、ＳＣＳＩバック・エンド（１１０）に戻すことができる。この結果、エラーはログ記録され、一意の識別子（２１０）は問題を提示しているコンポーネントに戻される。障害が起こっている書き込みは、一意の識別子と共に上のレイヤに戻される。それらはスタックまで戻される。各レイヤで、これはそのコンポーネント内で障害を起こす可能性がある。例えば書き込みに障害を起こしているディスクに対してフラッシュ・コピーがアクティブである場合、そのフラッシュ・コピーの動作は中断され、エラーが提示される。この新しいエラー自体に一意の識別子が割り当てられているが、これには下方のコンポーネントによって渡された一意の識別子、すなわち根本原因（２３０）がマーク付けされている。ソフトウェアのスタックでも各レイヤで同様のことが発生する場合がある。その結果、最初のエラーは、書き込みを要求したアプリケーション・サーバにＳＣＳＩセンス・データの一部として戻される。 Each component of the software stack can itself present an error as a result of the event in which the original failure occurred. As an example, a write operation from the application server (190) can be returned to the SCSI back end (110) as a failure, i.e., for some reason the write has failed due to physical storage. . As a result, the error is logged and the unique identifier (210) is returned to the component presenting the problem. The failing write is returned to the upper layer with a unique identifier. They are returned to the stack. At each layer, this can cause failures within that component. For example, if a flash copy is active for a disk that has failed to write, the flash copy operation is interrupted and an error is presented. The new error itself is assigned a unique identifier, which is marked with the unique identifier passed by the lower component, namely the root cause (230). The same thing may occur in each layer in the software stack. As a result, the first error is returned as part of the SCSI sense data to the application server that requested the write.

この場合ユーザは、障害が起こっている書き込み動作を、書き込みに障害を起こした物理ディスクと、上記のフラッシュ・コピー動作のようなソフトウェア・スタック内で障害を起こした動作および機能とに関係付けることができる。 In this case, the user must relate the failed write operation to the failed physical disk and the failed operation and function in the software stack, such as the flash copy operation described above. Can do.

上記で説明した方法は、通常、１つまたは複数のプロセッサ（図示せず）で実行中のソフトウェアで実行され、磁気または光コンピュータ・ディスクのような任意の適切なデータ・キャリア（これも図示せず）で搬送されるコンピュータ・プログラム要素として提供することができるということが理解されよう。同様にデータ伝送用のチャネルは、すべての種類のストレージ・メディア、ならびに有線または無線の信号媒体のような信号搬送媒体を含むことができる。 The methods described above are typically implemented in software running on one or more processors (not shown), and any suitable data carrier (also shown) such as a magnetic or optical computer disk. It will be understood that it can be provided as a computer program element carried in Similarly, channels for data transmission can include all types of storage media, as well as signal carrying media such as wired or wireless signal media.

本発明は、コンピュータ・システムで使用するためのコンピュータ・プログラム製品として適切に実施することができる。そのような実施態様は、ディスケット、ＣＤ−ＲＯＭ、ＲＯＭ、またはハード・ディスクなどのコンピュータ可読媒体のような有形媒体に固定した一連のコンピュータ可読命令、あるいはモデムまたは他のインターフェース・デバイスを介して、限定はしないが光またはアナログ通信回線を含む有形媒体によって、または限定はしないがマイクロ波、赤外線、または他の伝送技術を含む無線技術を無形に使用してコンピュータ・システムに送信することができる一連のコンピュータ可読命令を含むことができる。この一連のコンピュータ可読命令は、本明細書で既に説明した機能のすべてまたは一部を実施する。 The present invention can be suitably implemented as a computer program product for use in a computer system. Such an implementation may be via a series of computer readable instructions fixed to a tangible medium, such as a computer readable medium such as a diskette, CD-ROM, ROM, or hard disk, or a modem or other interface device. A series that can be transmitted to a computer system by means of tangible media including, but not limited to, optical or analog communication lines, or intangibly using wireless technologies including but not limited to microwave, infrared, or other transmission technologies Computer readable instructions. This series of computer readable instructions implements all or part of the functionality already described herein.

当業者ならば、そのようなコンピュータ可読命令は、多くのコンピュータ・アーキテクチャまたはオペレーティング・システムで使用するために複数のプログラミング言語で書くことができるということを理解されよう。さらにそのような命令は、限定はしないが半導体、磁気、または光を含む現在または将来のどのようなメモリ技術を使用しても記憶することができ、あるいは限定はしないが光、赤外線、またはマイクロ波を含む現在または将来のどのような通信技術を使用しても送信することができる。そのようなコンピュータ・プログラム製品は、例えばソフトウェア・パッケージのような印刷物または電子文書を添付した取り外し可能媒体として分配し、例えばシステムＲＯＭまたは固定ディスクによってコンピュータ・システムにプリロードし、または例えばインターネットまたはワールド・ワイド・ウェブのようなネットワークを介してサーバまたは電子掲示板から配布することができるということが予想される。 Those skilled in the art will appreciate that such computer readable instructions can be written in a number of programming languages for use in many computer architectures or operating systems. Further, such instructions can be stored using any current or future memory technology including, but not limited to, semiconductor, magnetic, or light, or are not limited to light, infrared, or micro Any current or future communication technology including waves can be used for transmission. Such computer program products can be distributed as removable media with attached printed materials or electronic documents, such as software packages, preloaded into a computer system, for example, by a system ROM or fixed disk, or, for example, the Internet or world It is expected that it can be distributed from a server or an electronic bulletin board via a network such as the wide web.

上記の実施形態に対する様々な修正形態が当業者には明らかになろうことが理解されよう。 It will be understood that various modifications to the above embodiments will be apparent to those skilled in the art.

まとめとして、本発明の構成に関して以下の事項を開示する。 In summary, the following matters are disclosed regarding the configuration of the present invention.

（１）複数あるホスト・システムのうちの１つまたは複数のホスト・システムのユーザ・アプリケーション・インターフェースで検出されたエラーを仮想化レイヤの下にあるスタック・レベルの根本原因であるエラーに関連付けるためのスタック・システムにおける方法であって、
ユーザ・アプリケーション・インターフェースでエラーを検出するステップと、
さらに低いスタック・レベルでの関連する根本原因であるエラーを識別するステップと、
前記エラーに関してエラー追跡エントリを作成するステップと、
エラー・ログ識別子を前記エラー追跡エントリに関連付けるステップと、
前記組み合わされたエラー・ログ識別子と前記エラー追跡エントリから前記スタック・システムの前記複数のホスト・システム内で一意のエラー識別子を作成するステップと、
前記根本原因であるエラーのために前記サービスに障害が起こっている可能性が高い場合に、複数あるホスト・システムのうちの１つまたは複数のホスト・システムのユーザ・アプリケーション・インターフェースでサービスの任意の要求側に前記エラー識別子を伝達するステップと
を含む方法。
（２）前記組み合わされたエラー・ログ識別子と前記エラー追跡エントリから前記スタック・システムの前記複数のホスト・システム内で一意のエラー識別子を作成するステップが、
前記複数のホスト・システム内で一意のエラー識別子を作成するためにエラー追跡エントリとエラー・ログ識別子とを整数値と組み合わせるステップを含む上記（１）に記載の方法。
（３）さらに低いスタック・レベルの根本原因であるエラーが前記スタック・システムの周辺装置内にある上記（１）に記載の方法。
（４）前記周辺装置がストレージ・デバイスである上記（３）に記載の方法。
（５）スタック・システムがストレージ・エリア・ネットワークを含む上記（１）に記載の方法。
（６）複数あるホスト・システムのうちの１つまたは複数のホスト・システムのユーザ・アプリケーション・インターフェースで検出されたエラーを仮想化レイヤの下にあるスタック・レベルの根本原因であるエラーに関連付ける装置であって、
ユーザ・アプリケーション・インターフェースでエラーを検出するためのエラー検出器と、
さらに低いスタック・レベルでの関連する根本原因であるエラーを識別するための診断コンポーネントと、
前記エラーに関してエラー追跡エントリを作成するための追跡コンポーネントと、
エラー・ログ識別子を前記エラー追跡エントリに関連付けるための識別コンポーネントと、
前記組み合わされたエラー・ログ識別子と前記エラー追跡エントリから前記スタック・システムの前記複数のホスト・システム内で一意のエラー識別子を作成するためのシステムワイドな識別コンポーネントと、
前記根本原因であるエラーのために前記サービスに障害が起こっている可能性が高い場合に、複数あるホスト・システムのうちの１つまたは複数のホスト・システムのユーザ・アプリケーション・インターフェースでサービスの任意の要求側に前記エラー識別子を伝達するための伝達コンポーネントと
を備える装置。
（７）システムワイドな識別コンポーネントが、
前記複数のホスト・システム内で一意のエラー識別子を作成するためにエラー追跡エントリとエラー・ログ識別子とを整数値と組み合わせるためのコンポーネントを備える上記（６）に記載の装置。
（８）さらに低いスタック・レベルの根本原因であるエラーが前記スタック・システムの周辺装置内にある上記（６）に記載の装置。
（９）前記周辺装置がストレージ・デバイスである上記（６）に記載の装置。
（１０）スタック・システムがストレージ・エリア・ネットワークを含む上記（６）に記載の装置。
（１１）コンピュータ・システムにロードされ、実行された場合に、前記コンピュータ・システムに、複数あるホスト・システムのうちの１つまたは複数のホスト・システムのユーザ・アプリケーション・インターフェースで検出されたエラーを仮想化レイヤの下にあるスタック・レベルの根本原因であるエラーに関連付けさせるようにするコンピュータ・プログラムであって、
ユーザ・アプリケーション・インターフェースでエラーを検出するコンピュータ・プログラム・コード手段と、
さらに低いスタック・レベルでの関連する根本原因であるエラーを識別するコンピュータ・プログラム・コード手段と、
前記エラーに関してエラー追跡エントリを作成するコンピュータ・プログラム・コード手段と、
エラー・ログ識別子を前記エラー追跡エントリに関連付けるコンピュータ・プログラム・コード手段と、
前記組み合わされたエラー・ログ識別子と前記エラー追跡エントリから前記スタック・システムの前記複数のホスト・システム内で一意のエラー識別子を作成するコンピュータ・プログラム・コード手段と、
前記根本原因であるエラーのために前記サービスに障害が起こっている可能性が高い場合に、複数あるホスト・システムのうちの１つまたは複数のホスト・システムのユーザ・アプリケーション・インターフェースでサービスの任意の要求側に前記エラー識別子を伝達するコンピュータ・プログラム・コード手段と
を備えるコンピュータ・プログラム。 (1) To correlate an error detected in the user application interface of one or more of the host systems to an error that is the root cause at the stack level below the virtualization layer. In a stack system of
Detecting errors in the user application interface;
Identifying the associated root cause error at a lower stack level;
Creating an error tracking entry for the error;
Associating an error log identifier with the error tracking entry;
Creating a unique error identifier within the plurality of host systems of the stack system from the combined error log identifier and the error tracking entry;
Any of the services in the user application interface of one or more of the host systems when there is a high probability that the service has failed due to the root cause error Communicating the error identifier to a requester of the method.
(2) creating a unique error identifier within the plurality of host systems of the stack system from the combined error log identifier and the error tracking entry;
The method of (1) above, comprising combining an error tracking entry and an error log identifier with an integer value to create a unique error identifier within the plurality of host systems.
(3) The method according to (1) above, wherein an error that is a root cause of a lower stack level is in a peripheral device of the stack system.
(4) The method according to (3) above, wherein the peripheral device is a storage device.
(5) The method according to (1) above, wherein the stack system includes a storage area network.
(6) An apparatus for associating an error detected in a user application interface of one or more host systems among a plurality of host systems with an error that is a root cause at a stack level below the virtualization layer Because
An error detector for detecting errors in the user application interface;
A diagnostic component to identify related root cause errors at lower stack levels;
A tracking component for creating an error tracking entry for the error;
An identification component for associating an error log identifier with the error tracking entry;
A system-wide identification component for creating a unique error identifier within the plurality of host systems of the stack system from the combined error log identifier and the error tracking entry;
Any of the services in the user application interface of one or more of the host systems when there is a high probability that the service has failed due to the root cause error A communication component for transmitting the error identifier to a requester of the system.
(7) A system-wide identification component
The apparatus of (6) above, comprising a component for combining an error tracking entry and an error log identifier with an integer value to create a unique error identifier within the plurality of host systems.
(8) The apparatus according to (6), wherein an error that is a root cause of a lower stack level is present in a peripheral device of the stack system.
(9) The device according to (6), wherein the peripheral device is a storage device.
(10) The apparatus according to (6) above, wherein the stack system includes a storage area network.
(11) An error detected in a user application interface of one or more host systems of a plurality of host systems when loaded and executed on the computer system. A computer program that causes a stack level underlying cause error below the virtualization layer to be associated,
Computer program code means for detecting errors in a user application interface;
Computer program code means to identify the associated root cause error at a lower stack level;
Computer program code means for creating an error tracking entry for said error;
Computer program code means for associating an error log identifier with the error tracking entry;
Computer program code means for creating a unique error identifier within the plurality of host systems of the stack system from the combined error log identifier and the error tracking entry;
Any of the services in the user application interface of one or more of the host systems when there is a high probability that the service has failed due to the root cause error A computer program comprising computer program code means for transmitting the error identifier to the requesting party.

仮想化サブシステムのコンポーネント・スタックの一例を示す図である。It is a figure which shows an example of the component stack of a virtualization subsystem. 本発明の現在の好ましい実施形態によるエラー・ログの一例を示す図である。FIG. 5 is a diagram illustrating an example of an error log according to a presently preferred embodiment of the present invention.

Explanation of symbols

１００仮想化サブシステム（ソフトウェア・スタック）
１１０ＳＣＳＩバック・エンド
１２０仮想化
１３０フラッシュ・コピー
１４０キャッシュ
１５０リモート・コピー
１６０ＳＣＳＩフロント・エンド
１７０エラー・ログ
１９０アプリケーション・サーバ
２２０エラー・コード
２３０根本原因
100 Virtualization subsystem (software stack)
110 SCSI back end 120 Virtualization 130 Flash copy 140 Cache 150 Remote copy 160 SCSI front end 170 Error log 190 Application server 220 Error code 230 Root cause

Claims

A stack for associating errors detected in the user application interface of one or more of the host systems with the underlying error at the stack level below the virtualization layer A method in a system,
Detecting errors in the user application interface;
Identifying the associated root cause error at a lower stack level;
Creating an error tracking entry for the error;
Associating an error log identifier with the error tracking entry;
Creating a unique error identifier within the plurality of host systems of the stack system from the combined error log identifier and the error tracking entry;
Any of the services in the user application interface of one or more of the host systems when there is a high probability that the service has failed due to the root cause error Communicating the error identifier to a requester of the method.

Creating a unique error identifier within the plurality of host systems of the stack system from the combined error log identifier and the error tracking entry;
The method of claim 1, comprising combining an error tracking entry and an error log identifier with an integer value to create a unique error identifier within the plurality of host systems.

The method of claim 1, wherein the error that is the root cause of the lower stack level is in a peripheral device of the stack system.

The method of claim 3, wherein the peripheral device is a storage device.

The method of claim 1, wherein the stack system comprises a storage area network.

An apparatus for associating an error detected in a user application interface of one or more host systems of a plurality of host systems with a stack level root cause error below a virtualization layer. ,
An error detector for detecting errors in the user application interface;
A diagnostic component to identify related root cause errors at lower stack levels;
A tracking component for creating an error tracking entry for the error;
An identification component for associating an error log identifier with the error tracking entry;
A system-wide identification component for creating a unique error identifier within the plurality of host systems of the stack system from the combined error log identifier and the error tracking entry;
Any of the services in the user application interface of one or more of the host systems when there is a high probability that the service has failed due to the root cause error A communication component for transmitting the error identifier to a requester of the system.

A system-wide identification component
7. The apparatus of claim 6, comprising a component for combining an error tracking entry and an error log identifier with an integer value to create a unique error identifier within the plurality of host systems.

7. The apparatus of claim 6, wherein the error that is the root cause of the lower stack level is in a peripheral device of the stack system.

The apparatus of claim 6, wherein the peripheral device is a storage device.

The apparatus of claim 6, wherein the stack system comprises a storage area network.

A virtualization layer that, when loaded and executed on a computer system, causes an error detected in a user application interface of one or more host systems of the plurality of host systems to the computer system. A computer program that causes a stack level underlying cause error to be associated with
Computer program code means for detecting errors in a user application interface;
Computer program code means to identify the associated root cause error at a lower stack level;
Computer program code means for creating an error tracking entry for said error;
Computer program code means for associating an error log identifier with the error tracking entry;
Computer program code means for creating a unique error identifier within the plurality of host systems of the stack system from the combined error log identifier and the error tracking entry;
Any of the services in the user application interface of one or more of the host systems when there is a high probability that the service has failed due to the root cause error A computer program comprising computer program code means for transmitting the error identifier to the requesting party.