JP2012059063A

JP2012059063A - Computer system management method and management system

Info

Publication number: JP2012059063A
Application number: JP2010202274A
Authority: JP
Inventors: Masataka Nagura; 正剛名倉; Takayuki Nagai; 崇之永井; Kimitoku Sugauchi; 公徳菅内
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2010-09-09
Filing date: 2010-09-09
Publication date: 2012-03-22
Anticipated expiration: 2030-09-09
Also published as: JP5432867B2; WO2012032676A1

Abstract

PROBLEM TO BE SOLVED: To provide a failure analysis result classification function capable of reducing time required for recovery from a failure in a monitored device.SOLUTION: A failure analysis result (a failure cause candidate) obtained during failure analysis processing is analyzed to find which failure cause candidates other than the failure cause candidate are related to a device abnormality failure event as grounds for deriving the failure cause candidate, so that the failure cause candidates are classified by each range of effect of being dealt therewith. Then, a classification result is displayed on a GUI.

Description

本発明は、計算機システムの管理方法及び管理システムに関し、例えば、計算機システムを構成するホストコンピュータ、ネットワークスイッチおよびストレージシステムの障害を管理する技術に関する。 The present invention relates to a computer system management method and management system, for example, a technique for managing a failure of a host computer, a network switch, and a storage system constituting the computer system.

計算機システムを管理する場合、例えば特許文献１に示されるように、システム内で検知した複数の障害もしくはその兆候の中から、原因となる事象（イベント）を検出することが行われている。より具体的に、特許文献１では、管理ソフトウェアを用いて、管理下機器における性能値の閾値超過をイベント化し、イベントDBに情報を蓄積する。 When managing a computer system, for example, as disclosed in Patent Document 1, a cause event is detected from a plurality of faults detected in the system or signs thereof. More specifically, in Patent Document 1, management software is used to generate an event that the performance value exceeds the threshold value in the managed device, and information is stored in the event DB.

また、この管理ソフトウェアは、管理下機器において発生した複数の障害イベントの因果関係を解析するための解析エンジンを持っている。この解析エンジンは、管理下機器のインベントリ情報を持つ構成DBにアクセスして、I/O経路上のパス上にある機器内構成要素を認識し、ホスト上の論理ボリュームの性能に影響を与えうる構成要素を「トポロジ」と呼ばれる一グループとして認識する。そして、解析エンジンは、イベントが発生すると各トポロジに対し、事前に定められた条件文と解析結果からなる解析ルールを適用して展開ルールを構築する。この展開ルールには、他装置における性能低下の原因である原因イベントと、それによって引き起こされている関連イベント群が含まれる。具体的には、ルールのTHEN部に障害の原因として記載されているイベントが原因イベント、IF部に記載されているイベントのうち原因イベント以外のものが関連イベントである。 The management software also has an analysis engine for analyzing the causal relationship between a plurality of failure events that have occurred in the managed device. This analysis engine can access the configuration DB with inventory information of managed devices to recognize in-device components on the path on the I / O path and affect the performance of logical volumes on the host. The components are recognized as a group called “topology”. Then, when an event occurs, the analysis engine applies an analysis rule including a predetermined conditional statement and an analysis result to each topology and constructs an expansion rule. This expansion rule includes a cause event that is a cause of performance degradation in another device and a related event group caused by the cause event. Specifically, an event described as the cause of the failure in the THEN part of the rule is a cause event, and an event other than the cause event among the events described in the IF part is a related event.

米国特許７１０７１８５号公報U.S. Pat. No. 7,107,185

特許文献１による障害解析機能では、管理対象機器から受信するイベントの組み合わせと、障害の原因候補をIF-THEN形式のルールとして記述しておく。障害解析機能は、ルールのIF部に記載されたイベントの発生割合を計算することで、THEN部に記載された障害原因候補の確信度を算出する。算出した確信度と障害原因候補は、ユーザの求めに応じてGUI表示される。 In the failure analysis function disclosed in Patent Document 1, a combination of events received from managed devices and failure cause candidates are described as rules in the IF-THEN format. The failure analysis function calculates the certainty factor of the failure cause candidate described in the THEN portion by calculating the occurrence rate of the event described in the IF portion of the rule. The calculated certainty factor and failure cause candidate are displayed in a GUI according to the user's request.

しかしながら、このような従来の障害解析機能においては、障害が短い期間に頻発すると、保存される障害解析結果の数が多くなってしまい、管理者としてはどれが本当に対処すべき障害なのか判断できないことがある。このため、監視対象の機器における障害を解消するまでに要する時間が長くなってしまい、事態をより深刻にしてしまうことがある。 However, in such a conventional failure analysis function, if failures occur frequently in a short period of time, the number of failure analysis results to be saved increases, and it is impossible for an administrator to determine which failure is really to be dealt with. Sometimes. For this reason, it takes a long time to eliminate the failure in the monitored device, which may make the situation more serious.

本発明はこのような状況に鑑みてなされたものであり、監視対象の機器における障害を解消させるために要する時間を短くするための機能を提供するものである。 The present invention has been made in view of such a situation, and provides a function for shortening the time required for eliminating a failure in a monitored device.

上記課題を解決するために、本発明では、障害原因解析処理の後に、得られた原因候補を影響範囲ごとに分類する。原因候補が関連する障害イベントによって分類してグループ化し、それらを区別してGUI表示する。より具体的には、まず障害原因解析の結果として原因候補群を推論したとき、導出根拠となる機器異常状態が同一である原因候補群を分類する。そして同一の機器異常状態によって導出された原因候補群を、同一の障害を解決するための原因候補の集合であるとみなし、それらを分類してGUI表示する。 In order to solve the above-described problem, in the present invention, after the failure cause analysis process, the obtained cause candidates are classified for each affected range. The cause candidates are classified and grouped according to related failure events, and these are displayed in a GUI. More specifically, when a cause candidate group is inferred as a result of failure cause analysis, the cause candidate group having the same device abnormal state as a derivation basis is classified. The cause candidate groups derived from the same device abnormal state are regarded as a set of cause candidates for solving the same failure, and are classified and displayed in the GUI.

即ち、本発明によれば、管理システムが、ノード装置の処理性能を示す処理性能値を取得し、当該取得した処理性能値から前記ノード装置に障害が発生したことを検知する。そして、管理システムが、検知した障害を、ノード装置で発生し得る１つ以上の条件イベントの組み合わせと条件イベントの組み合わせの障害原因とされる結論イベントとの関係を示す解析ルールに適用し、ノード装置における障害の発生の可能性を示す情報である確信度を算出する。さらに、管理システムは、複数の障害原因とされる結論イベントの１つを起点原因候補として選択し、起点原因候補に関係する条件イベントを抽出する。また、管理システムは、抽出された条件イベントに関係する結論イベントであって、起点原因候補の結論イベントとは異なる１つ又は複数の障害原因とされる結論イベントを関連原因候補として選択し、起点原因候補の結論イベントと前記関連原因候補の結論イベントを、他の結論イベントとは別個に分類処理する。その分類された結論イベントは、表示画面にGUI表示される。 That is, according to the present invention, the management system acquires a processing performance value indicating the processing performance of the node device, and detects that a failure has occurred in the node device from the acquired processing performance value. Then, the management system applies the detected failure to an analysis rule indicating a relationship between a combination of one or more condition events that can occur in the node device and a conclusion event that is a cause of the failure of the combination of the condition events. A certainty factor, which is information indicating the possibility of a failure occurring in the apparatus, is calculated. Furthermore, the management system selects one conclusion event that is regarded as the cause of the failure as a starting cause candidate, and extracts a condition event related to the starting cause candidate. In addition, the management system selects a conclusion event related to the extracted condition event, which is one or a plurality of conclusion events that are different from the conclusion event of the origin cause candidate, as a related cause candidate, The conclusion event of the cause candidate and the conclusion event of the related cause candidate are classified and processed separately from the other conclusion events. The classified conclusion event is displayed as a GUI on the display screen.

さらなる本発明の特徴は、以下本発明を実施するための形態および添付図面によって明らかになるものである。 Further features of the present invention will become apparent from the following detailed description and the accompanying drawings.

本発明によれば、障害解析結果を管理者（ユーザ）に提示する際、推論した障害原因候補を、それによって解決される障害にかかわる障害イベントによって分類して表示することにより、管理者が解析結果の対応優先度を容易に判断でき、解析結果確認と障害対応に要する負荷を軽減することができる。 According to the present invention, when the failure analysis result is presented to the administrator (user), the inferred failure cause candidate is classified and displayed according to the failure event related to the failure to be solved thereby, so that the administrator can analyze it. The response priority of the result can be easily determined, and the load required for analysis result confirmation and failure response can be reduced.

計算機システムの物理構成例を示す図である。It is a figure which shows the physical structural example of a computer system. ホストコンピュータの詳細な構成例を示す図である。It is a figure which shows the detailed structural example of a host computer. ストレージ装置の詳細な構成例を示す図である。It is a figure which shows the detailed structural example of a storage apparatus. 管理サーバの詳細な構成例を示す図である。It is a figure which shows the detailed structural example of a management server. 管理サーバが有する装置性能管理表の構成例を示す図である。It is a figure which shows the structural example of the apparatus performance management table | surface which a management server has. 管理サーバが有するボリュームトポロジ管理表の構成例を示す図である。It is a figure which shows the structural example of the volume topology management table | surface which a management server has. 管理サーバが有するイベント管理表の構成例を示す図である。It is a figure which shows the structural example of the event management table | surface which a management server has. 管理サーバが有する汎用ルールの構成例（１）を示す図である。It is a figure which shows the structural example (1) of the general purpose rule which a management server has. 管理サーバが有する汎用ルールの構成例（２）を示す図である。It is a figure which shows the structural example (2) of the general purpose rule which a management server has. 管理サーバが有する展開ルールの構成例（１）を示す図である。It is a figure which shows the structural example (1) of the expansion | deployment rule which a management server has. 管理サーバが有する展開ルールの構成例（２）を示す図である。It is a figure which shows the structural example (2) of the expansion | deployment rule which a management server has. 管理サーバが有する展開ルールの構成例（３）を示す図である。It is a figure which shows the structural example (3) of the expansion | deployment rule which a management server has. 管理サーバが有する展開ルールの構成例（４）を示す図である。It is a figure which shows the structural example (4) of the expansion | deployment rule which a management server has. 管理サーバが有する展開ルールの構成例（５）を示す図である。It is a figure which shows the structural example (5) of the expansion | deployment rule which a management server has. 管理サーバが有する展開ルールの構成例（６）を示す図である。It is a figure which shows the structural example (6) of the expansion | deployment rule which a management server has. 管理サーバが有する展開ルールの構成例（７）を示す図である。It is a figure which shows the structural example (7) of the expansion | deployment rule which a management server has. 管理サーバが有する展開ルールの構成例（８）を示す図である。It is a figure which shows the structural example (8) of the expansion | deployment rule which a management server has. 管理サーバが有する解析結果管理表の構成例を示す図である。It is a figure which shows the structural example of the analysis result management table | surface which a management server has. 管理サーバが実施する性能情報取得処理の概要を説明するためのフローチャートである。It is a flowchart for demonstrating the outline | summary of the performance information acquisition process which a management server implements. 管理サーバが実施する障害解析処理を説明するためのフローチャートである。It is a flowchart for demonstrating the failure analysis process which a management server implements. 管理サーバが実施する原因候補分類処理を説明するためのフローチャートである。It is a flowchart for demonstrating the cause candidate classification | category process which a management server implements. 第１の実施形態において、管理サーバが表示する障害解析結果画面の構成例を示す図である。It is a figure which shows the structural example of the failure analysis result screen which a management server displays in 1st Embodiment. 第２の実施形態において、管理者が分類された原因候補を選択した際の管理サーバの処理を説明するためのフローチャートである。In 2nd Embodiment, it is a flowchart for demonstrating the process of the management server when an administrator selects the cause candidate classified. 第２の実施形態に置いて、管理サーバが実施する原因候補再分類処理を説明するためのフローチャートである。It is a flowchart for demonstrating the cause candidate reclassification process which a management server implements in 2nd Embodiment. 第２の実施形態において、管理サーバが表示する障害解析結果画面の構成例を示す図である。It is a figure which shows the structural example of the failure analysis result screen which a management server displays in 2nd Embodiment.

本発明の実施形態は、ITシステム障害解消のための障害原因解析に関するものである。前述のように、従来技術でも障害原因候補を管理者に提示して障害に対処可能なようにしている。ところが、複数の障害原因によって多数の原因候補が発生した場合、実際に発生しているどの障害原因にどの原因候補が関連するのかを把握しないと、効率的に障害対応を行うことができない。例えば、確信度に基づき、上位数候補に対して障害対応を実施したとしても、それらの候補は実際には同じ装置に発生した障害に起因する障害原因の候補かもしれない。また、他の装置にも障害が発生していて、それに起因する障害原因候補が低い優先度で提示されていたのならば、上位数候補分と同じレベルでその候補についても対応すべきである。しかし、障害原因解析を行うソフトウェアには、複数の障害原因によって多数の原因候補が発生した場合に、それらを原因候補の影響する範囲に応じてグループ化する方法がない。このため管理者は、どの原因候補に優先的に対応すべきか、判断することが困難である。つまり、従来の障害解析結果にはどの障害原因候補が関連しているかを示す情報がないため、管理者が優先的に対策を行うべき解析結果を参照するまでの時間が長くなり、結果的に障害の解消までに要する時間が長くなってしまう。 Embodiments of the present invention relate to failure cause analysis for solving IT system failures. As described above, in the prior art, failure cause candidates are presented to the administrator so that the failure can be dealt with. However, when a large number of cause candidates are generated due to a plurality of failure causes, it is impossible to efficiently cope with the failure unless it is understood which cause cause is associated with which cause of the failure. For example, even if failure handling is performed on the top number candidates based on the certainty factor, these candidates may actually be candidates for failure causes due to failures occurring in the same device. In addition, if a failure has occurred in another device and the cause of failure caused by the failure has been presented with a low priority, the candidate should be dealt with at the same level as the higher number candidates. . However, software that performs failure cause analysis does not have a method for grouping a plurality of cause candidates due to a plurality of failure causes according to the range of influence of the cause candidates. For this reason, it is difficult for the administrator to determine which cause candidate should be preferentially dealt with. In other words, since there is no information indicating which failure cause candidates are related to the conventional failure analysis results, it takes a long time for the administrator to refer to the analysis results that should be preferentially taken. It takes a long time to resolve the problem.

そこで、本発明の実施形態では、より信頼度が高く優先的に対処すべき原因候補を提示するための機能を提供する。 Therefore, the embodiment of the present invention provides a function for presenting cause candidates that should be dealt with with higher reliability and priority.

以下、添付図面を参照して本発明の実施形態について説明する。ただし、本実施形態は本発明を実現するための一例に過ぎず、本発明の技術的範囲を限定するものではないことに注意すべきである。また、各図において共通の構成については同一の参照番号が付されている。 Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings. However, it should be noted that this embodiment is merely an example for realizing the present invention, and does not limit the technical scope of the present invention. In each drawing, the same reference numerals are assigned to common components.

なお、本明細書では「ａａａ表」という表現によって本発明で用いられる情報について説明しているが、「ａａａテーブル」、「ａａａリスト」、「ａａａDB」、「ａａａキュー」等の表現や、テーブル、リスト、DB、キュー等のデータ構造以外で表現されていてもよい。このため、本発明で用いられる情報が、データ構造に依存しないことを示すために、「ａａａテーブル」、「ａａａリスト」、「ａａａDB」、「ａａａキュー」等について「ａａａ情報」と呼ぶことがある。 In this specification, the information used in the present invention is described by the expression “aaa table”. However, expressions such as “aaa table”, “aaa list”, “aaaDB”, “aaa queue”, etc. It may be expressed in other than data structures such as list, DB, and queue. Therefore, in order to show that the information used in the present invention does not depend on the data structure, “aaa table”, “aaa list”, “aaaDB”, “aaa queue”, etc. may be referred to as “aaa information”. is there.

また、各情報の内容を説明する際に、「識別情報」、「識別子」、「名」、「名前」、「ID」という表現を用いるが、これらについてはお互いに置換が可能である。 Further, in describing the contents of each information, the expressions “identification information”, “identifier”, “name”, “name”, and “ID” are used, but these can be replaced with each other.

さらに、以後の本発明の処理動作の説明では、「プログラム」や「モジュール」を動作主体（主語）として説明を行う場合があるが、プログラムやモジュールは、プロセッサによって実行されることで、定められた処理をメモリ及び通信ポート（通信制御装置）を用いながら行うため、プロセッサを動作主体（主語）とした処理に読み替えても良い。また、プログラムやモジュールを主語として開示された処理は、管理サーバ等の計算機、情報処理装置が行う処理としてもよい。プログラムの一部または全ては専用ハードウェアによって実現されてもよい。また、各種プログラムはプログラム配布サーバや記憶メディアによって各計算機にインストールされてもよい。 Furthermore, in the following description of the processing operation of the present invention, “program” or “module” may be described as the subject of operation (subject), but the program or module is defined by being executed by the processor. Since the above processing is performed using the memory and the communication port (communication control device), the processing may be read as processing in which the processor is an operation subject (subject). Further, the processing disclosed with the program or module as the subject may be processing performed by a computer such as a management server or an information processing apparatus. Part or all of the program may be realized by dedicated hardware. Various programs may be installed in each computer by a program distribution server or a storage medium.

また本明細書で記載する実施形態においては、管理対象とするシステムの規模については言及しない。しかし、システムが大規模になればなるほど、同時多発的に障害が複数個所で発生する可能性が高くなる。そのため、大規模システムを対象に本発明を適用した場合には、本発明の効果をより享受できる。 In the embodiment described in this specification, the scale of a system to be managed is not mentioned. However, the larger the system, the higher the possibility that multiple failures will occur simultaneously. Therefore, when the present invention is applied to a large-scale system, the effects of the present invention can be enjoyed more.

（１）第１の実施形態
第１の実施形態は、管理ソフトウェア（例えば、管理サーバに含まれる）による障害原因候補表示処理に関するものである。 (1) 1st Embodiment 1st Embodiment is related with the failure cause candidate display process by management software (for example, contained in a management server).

＜システム構成＞
図１は、本発明による計算機システムの物理的構成を示す図である。当該計算機システム１は、ストレージ装置２００００と、ホストコンピュータ１００００と、管理サーバ３００００と、WEBブラウザ起動サーバ３５０００と、IPスイッチ４００００とを有し、それらが、ネットワーク４５０００によって接続される構成となっている。 <System configuration>
FIG. 1 is a diagram showing a physical configuration of a computer system according to the present invention. The computer system 1 includes a storage device 20000, a host computer 10000, a management server 30000, a WEB browser activation server 35000, and an IP switch 40000, which are connected by a network 45000. .

ホストコンピュータ１００００乃至１００１０は、例えば、それらに接続された、図示しないクライアントコンピュータからファイルのI/O要求を受信し、それに基づいてストレージ装置２００００乃至２００１０へのアクセスを実現する。また、管理サーバ（管理計算機）３００００は、当該計算機システム全体の運用を管理するものである。 For example, the host computers 10000 to 10010 receive a file I / O request from a client computer (not shown) connected thereto, and realize access to the storage apparatuses 20000 to 20010 based on the received request. The management server (management computer) 30000 manages the operation of the entire computer system.

WEBブラウザ起動サーバ３５０００は、ネットワーク４５０００を介して、管理サーバ３００００のGUI表示処理モジュール３２４００と通信し、WEBブラウザ上に各種情報を表示する。ユーザはWEBブラウザ起動サーバ上のWEBブラウザに表示された情報を参照することで、計算機システム内の装置を管理する。ただし、管理サーバ３００００と、WEBブラウザ起動サーバ３５０００は１台のサーバから構成されていてもよい。 The web browser activation server 35000 communicates with the GUI display processing module 32400 of the management server 30000 via the network 45000, and displays various information on the web browser. The user manages the devices in the computer system by referring to the information displayed on the WEB browser on the WEB browser activation server. However, the management server 30000 and the web browser activation server 35000 may be composed of a single server.

＜ホストコンピュータの内部構成＞
図２は、本発明によるホストコンピュータ１００００の詳細な内部構成例を示す図である。ホストコンピュータ１００００は、ネットワーク４５０００に接続するためのポート１１０００と、プロセッサ１２０００と、メモリ１３０００とを有し（ディスク装置を構成として含んでも良い）、これらは内部バス等の回路を介して相互に接続される構成となっている。 <Internal configuration of host computer>
FIG. 2 is a diagram showing a detailed internal configuration example of the host computer 10000 according to the present invention. The host computer 10000 has a port 11000 for connecting to the network 45000, a processor 12000, and a memory 13000 (which may include a disk device as a component), which are connected to each other via a circuit such as an internal bus. It becomes the composition which is done.

メモリ１３０００には、業務アプリケーション１３１００と、オペレーティングシステム１３２００が格納されている。 The memory 13000 stores a business application 13100 and an operating system 13200.

業務アプリケーション１３１００は、オペレーティングシステム１３２００から提供された記憶領域を使用し、当該記憶領域に対しデータ入出力（以下、I/Oと表記）を行う。 The business application 13100 uses a storage area provided from the operating system 13200 and performs data input / output (hereinafter referred to as I / O) to the storage area.

オペレーティングシステム１３２００は、ネットワーク４５０００を介してホストコンピュータ１００００に接続されたストレージ装置２００００乃至２００１０上の論理ボリュームを記憶領域として業務アプリケーション１３１００に認識させるための処理を実行する。 The operating system 13200 executes processing for causing the business application 13100 to recognize the logical volumes on the storage apparatuses 20000 to 20010 connected to the host computer 10000 via the network 45000 as storage areas.

ポート１１０００は、ストレージ装置２００００とiSCSIにより通信を行うためのI/Oポートと、管理サーバ３００００がホストコンピュータ１００００乃至１００１０内の管理情報を取得するための管理ポートを含む単一のポートとして図２で表現されているが、iSCSIにより通信を行うためのI/Oポートと管理ポートに分かれていてもよい。 The port 11000 is a single port including an I / O port for communicating with the storage device 20000 by iSCSI and a management port for the management server 30000 to acquire management information in the host computers 10000 to 10010. However, it may be divided into an I / O port for communication by iSCSI and a management port.

＜ストレージ装置の内部構成＞
図３は、本発明によるストレージ装置２００００の詳細な内部構成例を示す図である。ストレージ装置２００１０も同様の構成を有している。 <Internal configuration of storage device>
FIG. 3 is a diagram showing a detailed internal configuration example of the storage apparatus 20000 according to the present invention. The storage device 20010 has the same configuration.

ストレージ装置２００００は、ネットワーク４５０００を介してホストコンピュータ１００００に接続するためのI/Oポート２１０００及び２１０１０と、ネットワーク４５０００を介して管理サーバ３００００に接続するための管理ポート２１１００と、各種管理情報を格納するための管理メモリ２３０００と、データを格納するためのRAIDグループ２４０００乃至２４０１０と、データや管理メモリ内の管理情報を制御するためのコントローラ２５０００及び２５０１０とを有し、これらが内部バス等の回路を介して相互に接続される構成となっている。なお、RAIDグループ２４０００乃至２４０１０の接続とは、より正確にはRAIDグループ２４０００乃至２４０１０を構成する記憶デバイスが他の構成物と接続されていることを指す。 The storage device 20000 stores I / O ports 21000 and 21010 for connecting to the host computer 10000 via the network 45000, a management port 21100 for connecting to the management server 30000 via the network 45000, and various management information. Management memory 23000 for storing data, RAID groups 24000 to 24010 for storing data, and controllers 25000 and 25010 for controlling management information in the data and management memory, and these are circuits such as an internal bus It is the structure connected mutually via. Note that the connection of the RAID groups 24000 to 24010 indicates that the storage devices constituting the RAID groups 24000 to 24010 are more accurately connected to other components.

管理メモリ２３０００には、ストレージ装置の管理プログラム２３１００が格納される。管理プログラム２３１００は管理ポート２１１００を経由して管理サーバ３００００と通信し、管理サーバ３００００に対しストレージ装置２００００の構成情報を提供する。 The management memory 23000 stores a storage apparatus management program 23100. The management program 23100 communicates with the management server 30000 via the management port 21100 and provides the configuration information of the storage device 20000 to the management server 30000.

RAIDグループ２４０００乃至２４０１０は、それぞれ、１つまたは複数の磁気ディスク２４２００、２４２１０、２４２２０、及び２４２３０によって構成されている。複数の磁気ディスクによって構成されている場合、それらの磁気ディスクはRAID構成を組んでいてもよい。また、RAIDグループ２４０００乃至２４０１０は、論理的に複数のボリューム２４１００乃至２４１１０に分割されている。 Each of the RAID groups 24000 to 24010 includes one or more magnetic disks 24200, 24210, 24220, and 24230. In the case of being constituted by a plurality of magnetic disks, these magnetic disks may have a RAID configuration. The RAID groups 24000 to 24010 are logically divided into a plurality of volumes 24100 to 24110.

なお、論理ボリューム２４１００及び２４１１０は、１つ以上の磁気ディスクの記憶領域を用いて構成されるのであれば、RAID構成を組まなくてもよい。さらに、論理ボリュームに対応する記憶領域を提供するのであれば、磁気ディスクの代わりとしてフラッシュメモリなど他の記憶媒体を用いた記憶デバイスでも良いものとする。 If the logical volumes 24100 and 24110 are configured using storage areas of one or more magnetic disks, it is not necessary to form a RAID configuration. Furthermore, as long as a storage area corresponding to a logical volume is provided, a storage device using another storage medium such as a flash memory may be used instead of the magnetic disk.

コントローラ２５０００及び２５０１０は、その内部に、ストレージ装置２００００内の制御を行うプロセッサや、ホストコンピュータ１００００との間でやりとりするデータを一時的に記憶するキャッシュメモリを持っている。そして、それぞれのコントローラは、I/OポートとRAIDグループの間に介在し、両者の間でデータの受け渡しを行う。 The controllers 25000 and 25010 have therein a processor that controls the storage device 20000 and a cache memory that temporarily stores data exchanged with the host computer 10000. Each controller is interposed between the I / O port and the RAID group, and exchanges data between them.

なお、ストレージ装置２００００は、何れかのホストコンピュータに対して論理ボリュームを提供し、アクセス要求(I/O要求を指す）を受信し、受信したアクセス要求に応じて記憶デバイスへの読み書きを行うストレージコントローラと、記憶領域を提供する前述の記憶デバイスを含めば、図３及び上記説明以外の構成でもよく、例えば、ストレージコントローラと記憶領域を提供する記憶デバイスが別な筐体に格納されていてもよい。即ち、図３の例では管理メモリ２３０００とコントローラ２５０００及び２５１１０とが別個の存在として設けられているが、それらが一体となったストレージコントローラとして構成しても良い。また、本明細書ではストレージコントローラと記憶デバイスが同じ筐体に存在する場合または別な筐体を含む表現として、ストレージ装置をストレージシステムと呼び変えても良い。 The storage device 20000 provides a logical volume to any host computer, receives an access request (indicating an I / O request), and reads / writes data from / to a storage device in response to the received access request If the controller and the storage device that provides the storage area are included, the configuration other than that illustrated in FIG. 3 and the above description may be used. For example, the storage controller and the storage device that provides the storage area may be stored in different cases Good. That is, in the example of FIG. 3, the management memory 23000 and the controllers 25000 and 25110 are provided as separate entities, but may be configured as a storage controller in which they are integrated. Further, in this specification, a storage device may be referred to as a storage system when the storage controller and the storage device are present in the same housing or as an expression including another housing.

＜管理サーバの内部構成＞
図４は、本発明による管理サーバ３００００の詳細な内部構成例を示す図である。管理サーバ３００００は、ネットワーク４５０００に接続するための管理ポート３１０００と、プロセッサ３１１００と、キャッシュメモリ等のメモリ３２０００と、HDD等の二次記憶装置（二次記憶領域）３３０００と、後述する処理結果を出力するためのディスプレイ装置等の出力デバイス３１２００と、ストレージ管理者が指示を入力するためのキーボード等の入力デバイス３１３００とを有し、これらが内部バス等の回路を介して相互に接続される構成となっている。 <Internal configuration of management server>
FIG. 4 is a diagram showing a detailed internal configuration example of the management server 30000 according to the present invention. The management server 30000 includes a management port 31000 for connection to the network 45000, a processor 31100, a memory 32000 such as a cache memory, a secondary storage device (secondary storage area) 33000 such as an HDD, and processing results to be described later. A configuration having an output device 31200 such as a display device for outputting and an input device 31300 such as a keyboard for the storage administrator to input instructions, and these are connected to each other via a circuit such as an internal bus It has become.

メモリ３２０００には、プログラム制御モジュール３２１００と、構成管理情報取得モジュール３２２００と、装置性能取得モジュール３２３００と、GUI表示処理モジュール３２４００と、イベント解析処理モジュール３２５００と、ルール展開モジュール３２６００とが格納されている。なお、図４においては、各モジュールはメモリ３２０００のソフトウェアモジュールとして提供されているが、ハードウェアモジュールとして提供されるものであっても良い。また、各モジュールが行う処理が一つ以上のプログラムコードとして提供されても良く、モジュール間の明確な境界が存在しなくても良い。モジュールは、プログラムと読み替えても良い。 The memory 32000 stores a program control module 32100, a configuration management information acquisition module 32200, an apparatus performance acquisition module 32300, a GUI display processing module 32400, an event analysis processing module 32500, and a rule expansion module 32600. . In FIG. 4, each module is provided as a software module of the memory 32000, but may be provided as a hardware module. Also, the processing performed by each module may be provided as one or more program codes, and there may be no clear boundary between modules. Modules may be read as programs.

二次記憶領域３３０００には、装置性能管理表３３１００と、ボリュームトポロジ管理表３３２００と、イベント管理表３３３００と、汎用ルールリポジトリ３３４００と、展開ルールリポジトリ３３５００と、解析結果管理表３３６００が格納されている。なお、二次記憶領域３３０００は、半導体メモリまたは磁気ディスクのいずれか、もしくは半導体メモリおよび磁気ディスク両方から構成される。 The secondary storage area 33000 stores an apparatus performance management table 33100, a volume topology management table 33200, an event management table 33300, a general rule repository 33400, an expansion rule repository 33500, and an analysis result management table 33600. . The secondary storage area 33000 is composed of either a semiconductor memory or a magnetic disk, or both a semiconductor memory and a magnetic disk.

GUI表示処理モジュール３２４００は、入力デバイス３１３００を介した管理者からの要求に応じ、取得した構成管理情報を出力デバイス３１２００を介して表示する。なお、入力デバイスと出力デバイスは別々なデバイスでもよく、一つ以上のまとまったデバイスでもよい。 The GUI display processing module 32400 displays the acquired configuration management information via the output device 31200 in response to a request from the administrator via the input device 31300. The input device and the output device may be separate devices, or one or more integrated devices.

なお、管理サーバ（管理計算機）３００００は、例えば、入力デバイス３１３００としてキーボードとポインタデバイス等、出力デバイス３１２００としてディスプレイやプリンタ等とを有しているが、これ以外の装置であってもよい。また、入出力デバイスの代替としてシリアルインターフェースやイーサーネットインターフェースを用い、当該インターフェースにディスプレイ又はキーボード又はポインタデバイスを有する表示用計算機を接続し、表示用情報を表示用計算機に送信したり、入力用情報を表示用計算機から受信することで、表示用計算機で表示を行ったり、入力を受け付けることで入出力デバイスでの入力及び表示を代替してもよい。 The management server (management computer) 30000 has, for example, a keyboard and pointer device as the input device 31300, and a display, a printer, and the like as the output device 31200, but may be other devices. In addition, a serial interface or an Ethernet interface is used as an alternative to the input / output device, a display computer having a display or keyboard or pointer device is connected to the interface, and the display information is transmitted to the display computer, or the input information May be displayed by the display computer, or the input and display at the input / output device may be substituted by receiving the input.

本明細書では、計算機システム（情報処理システム）１を管理し、表示用情報を表示する一つ以上の計算機の集合を管理システムと呼ぶことがある。管理サーバ３００００が表示用情報を表示する場合は、管理サーバ３００００が管理システムであり、また、管理サーバ３００００と表示用計算機（例えば図１のWEBブラウザ起動サーバ３５０００)の組み合わせも管理システムである。また、管理処理の高速化や高信頼化のために複数の計算機で管理サーバと同等の処理を実現してもよく、この場合は当該複数の計算機（表示を表示用計算機が行う場合は表示用計算機も含め）が管理システムである。 In this specification, a set of one or more computers that manage the computer system (information processing system) 1 and display display information may be referred to as a management system. When the management server 30000 displays display information, the management server 30000 is a management system, and a combination of the management server 30000 and a display computer (for example, the WEB browser activation server 35000 in FIG. 1) is also a management system. In addition, in order to increase the speed and reliability of management processing, processing equivalent to that of the management server may be realized with a plurality of computers. In this case, the plurality of computers (if the display computer performs display, display (Including computers) is the management system.

＜装置性能管理表の構成＞
図５は、管理サーバ３００００が有する装置性能管理表３３１００の構成例を示す図である。 <Configuration of device performance management table>
FIG. 5 is a diagram showing a configuration example of the device performance management table 33100 that the management server 30000 has.

装置性能管理表３３１００は、管理対象となる機器の識別子となる装置IDを登録するフィールド３３１１０と、管理対象機器内部のデバイスの識別子であるデバイスIDを登録するフィールド３３１２０と、管理対象デバイスの性能情報のメトリック名称を格納するフィールド３３１３０と、閾値異常（「閾値に基づいて異常であると判断されたもの」の意味である）を検知した機器のOS種別を登録するフィールド３３１４０と、管理対象デバイスの性能値を該当装置から取得して格納するフィールド３３１５０と、管理対象デバイスの性能値の正常範囲の上限もしくは下限である閾値（アラート実行閾値）を、ユーザからの入力を受けて格納するフィールド３３１６０と、閾値が正常値の上限であるのか下限であるのかを登録するためのフィールド３３１７０と、性能値が正常値であるか異常値であるかを登録するためのフィールド３３１８０と、を構成項目として含んでいる。 The device performance management table 33100 includes a field 33110 for registering a device ID that is an identifier of a device to be managed, a field 33120 for registering a device ID that is a device identifier inside the device to be managed, and performance information of the management target device. A field 33130 for storing the metric name, a field 33140 for registering the OS type of the device that detected the threshold abnormality (which means “determined to be abnormal based on the threshold”), and the management target device A field 33150 for acquiring and storing the performance value from the corresponding device, and a field 33160 for storing a threshold (alert execution threshold) that is the upper limit or lower limit of the normal range of the performance value of the management target device in response to an input from the user. A field for registering whether the threshold is the upper limit or lower limit of the normal value. And de 33170 includes a field 33180 for performance value registers whether the abnormal value is a normal value, as configuration items.

例えば、図５の第１行目（１つ目のエントリ）からは、ストレージ装置SYS1内のコントローラCTL1におけるプロセッサの稼働率が現時点で４０％（３３１５０参照）であり、CTL1の稼働率が２０％を超えた場合（３３１６０参照）に管理サーバ３００００はコントローラCTL1が過負荷であると判断するが、当該具体例では本性能値が異常値であると判断されている（３３１８０参照）ことが分かる。 For example, from the first row (first entry) in FIG. 5, the processor operating rate in the controller CTL1 in the storage device SYS1 is 40% (see 33150) at the present time, and the operating rate of CTL1 is 20%. The management server 30000 determines that the controller CTL1 is overloaded when exceeding (see 33160), but in this specific example, it is understood that this performance value is determined to be an abnormal value (see 33180).

なお、ここでは管理サーバ３００００が管理するデバイスの性能値として単位時間当たりのI/O量、稼働率やレスポンスタイムを例として挙げたが、管理サーバ３００００が管理する性能値はこれ以外でも良い。 Here, the I / O amount per unit time, the operation rate, and the response time are exemplified as the performance values of the devices managed by the management server 30000, but the performance values managed by the management server 30000 may be other than this.

＜ボリュームトポロジ管理表の構成＞
図６は、管理サーバ３００００の有するボリュームトポロジ管理表３３２００の構成例を示す図である。 <Configuration of volume topology management table>
FIG. 6 is a diagram showing a configuration example of the volume topology management table 33200 that the management server 30000 has.

ボリュームトポロジ管理表３３２００は、ストレージ装置の識別子となる装置IDを登録するフィールド３３２１０と、ストレージ装置が有するボリュームの識別子となるボリュームIDを登録するフィールド３３２２０と、ホストコンピュータ１００００が利用するLU(Logical Unit)の識別子となるLU番号を登録するフィールド３３２３０と、ポートとボリュームとの通信の際に使用するコントローラのIDを登録するフィールド３３２４０と、ボリュームが接続するホストコンピュータ１００００の識別子を登録するフィールド３３２５０と、ボリュームが実体となるホストコンピュータ１００００の論理ボリュームのドライブ名を登録するフィールド３３２６０とを構成項目として含んでいる。 The volume topology management table 33200 includes a field 33210 for registering a device ID serving as a storage device identifier, a field 33220 for registering a volume ID serving as a volume identifier of the storage device, and an LU (Logical Unit) used by the host computer 10000. ) Field 33230 for registering an LU number as an identifier, a field 33240 for registering an ID of a controller used for communication between a port and a volume, and a field 33250 for registering an identifier of a host computer 10000 to which the volume is connected. And a field 33260 for registering the drive name of the logical volume of the host computer 10000 in which the volume is an entity.

例えば、図６の第１行目（１つ目のエントリ）からは、ストレージ装置SYS1のボリュームVOL1を、LU1で示される論理ユニットとしてホストコンピュータに提供し、 CTL1で示されるストレージ側のコントローラを介してホストコンピュータHOST1と接続し、ホスト上で論理ボリューム（/var）として認識されていることが分かる。 For example, from the first line (first entry) in FIG. 6, the volume VOL1 of the storage device SYS1 is provided to the host computer as a logical unit indicated by LU1, and is passed through the storage-side controller indicated by CTL1. It can be seen that it is connected to the host computer HOST1 and is recognized as a logical volume (/ var) on the host.

＜イベント管理表の構成＞
図７は、管理サーバ３００００が有するイベント管理表３３３００の構成例を示す図である。このイベント管理表３３３００は、後述する障害原因解析処理、原因候補分類処理において適宜参照されるものである。 <Configuration of event management table>
FIG. 7 is a diagram showing a configuration example of the event management table 33300 that the management server 30000 has. This event management table 33300 is appropriately referred to in failure cause analysis processing and cause candidate classification processing described later.

イベント管理表３３３００は、イベント自身の識別子となるイベントIDを登録するフィールド３３３１０と、取得した性能値に閾値異常といったイベントの発生した機器の識別子となる装置IDを登録するフィールド３３３２０と、イベントの発生した機器内の部位の識別子を登録するフィールド３３３３０と、閾値異常を検知したメトリックの名称を登録するフィールド３３３４０と、閾値異常が検知された機器のOS種別を登録するフィールド３３３５０と、機器内の部位のイベント発生時の状態を登録するフィールド３３３６０と、イベントが後述するイベント解析処理モジュール３２５００によって解析済みかどうかを登録するフィールド３３３７０と、イベントが発生した日時を登録するフィールド３３３８０とを構成項目として含んでいる。 The event management table 33300 includes a field 33310 for registering an event ID serving as an identifier of the event itself, a field 33320 for registering a device ID serving as an identifier of a device having an event such as a threshold abnormality in the acquired performance value, and an event occurrence A field 33330 for registering the identifier of the part in the device that has been detected, a field 33340 for registering the name of the metric in which the threshold abnormality is detected, a field 33350 for registering the OS type of the device in which the threshold abnormality is detected, and a part in the device The configuration item includes a field 33360 for registering the state when the event occurs, a field 33370 for registering whether the event has been analyzed by the event analysis processing module 32500 described later, and a field 33380 for registering the date and time when the event occurred. so That.

例えば、図７の第１行目（１つ目のエントリ）からは、管理サーバ３００００が、ストレージ装置SYS1の、CTL1で示されるコントローラにおけるプロセッサ稼働率の閾値異常を検知し、そのイベントIDはＥＶ１であることが分かる。 For example, from the first line (first entry) in FIG. 7, the management server 30000 detects the threshold abnormality of the processor operation rate in the controller indicated by CTL1 of the storage device SYS1, and the event ID is EV1. It turns out that it is.

＜汎用ルールの構成＞
図８Ａ及びＢは、管理サーバ３００００が有する汎用ルールリポジトリ３３４００内の汎用ルールの構成例を示す図である。汎用ルール（後述の展開ルールも同様）は、計算機システム１を構成するノード装置で発生し得る１つ以上の条件イベントの組み合わせとその条件イベントの組み合わせに対して障害原因とされる結論イベントとの関係を示すものである。つまり、汎用ルール及び後述の展開ルールは、条件部におけるイベントが発生したときに、結論部に記述された内容が障害原因となりうることを示すものである。 <General rule configuration>
FIGS. 8A and 8B are diagrams illustrating a configuration example of the general rules in the general rule repository 33400 included in the management server 30000. FIG. A general-purpose rule (the same applies to an expansion rule described later) is a combination of one or more condition events that can occur in the node devices constituting the computer system 1 and a conclusion event that is a cause of failure for the combination of condition events. It shows the relationship. That is, the general-purpose rule and the later-described expansion rule indicate that the contents described in the conclusion part can cause a failure when an event in the condition part occurs.

一般的に、障害解析において原因を特定するためのイベント伝播モデルは、ある障害の結果発生することが予想されるイベントの組み合わせと、その原因を”IF-THEN”形式で記載するものとなっている。なお、汎用ルールは図８Ａ及びＢに挙げられたものに限られず、さらに多くのルールがあっても構わない。 In general, the event propagation model for identifying the cause in failure analysis is a combination of events that are expected to occur as a result of a failure and the cause in “IF-THEN” format. Yes. The general-purpose rules are not limited to those shown in FIGS. 8A and 8B, and there may be more rules.

汎用ルールは、汎用ルールの識別子となる汎用ルールIDを登録するフィールド３３４３０と、”IF-THEN”形式で記載した汎用ルールのIF部に相当する観測事象を登録するフィールド３３４１０と、”IF-THEN”形式で記載した汎用ルールのTHEN部に相当する原因事象を登録するためのフィールド３３４２０と、汎用ルールを実システムに展開し、展開ルールを生成する際に取得するトポロジを登録するためのフィールド３３４４０とを構成項目として含んでいる。条件部３３４１０のイベントが検知されたら結論部３３４２０のイベントが障害の原因であり、結論部３３４２０のステータスが正常になれば、条件部３３４１０の問題も解決しているという関係にあるものである。図８Ａ及びＢの例では、条件部３３４１０には３つのイベントが記述されているが、イベント数に制限はない。 The general rule includes a field 33430 for registering a general rule ID as an identifier of the general rule, a field 33410 for registering an observation event corresponding to the IF part of the general rule described in the “IF-THEN” format, and “IF-THEN”. A field 33420 for registering the cause event corresponding to the THEN part of the general rule described in the format, and a field 33440 for registering the topology acquired when the general rule is expanded in the real system and the expanded rule is generated. Are included as configuration items. If the event of the condition part 33410 is detected, the event of the conclusion part 33420 is the cause of the failure, and if the status of the conclusion part 33420 becomes normal, the problem of the condition part 33410 is also solved. 8A and 8B, three events are described in the condition part 33410, but the number of events is not limited.

例えば、図８Ａからは、汎用ルールIDがRule1で示される汎用ルールが、観測事象としてホストコンピュータ上の論理ボリュームのレスポンスタイムの閾値異常（関連イベント）と、ストレージ装置におけるコントローラの稼働率（プロセッサ使用率）の閾値異常（原因イベント）と、ストレージ装置におけるLUの単位時間のI/O量の閾値異常 (関連イベント)を検知したとき、ストレージ装置のコントローラの稼働率（プロセッサ使用率）のボトルネックが障害の原因であると結論付けられるということが分かる。 For example, from FIG. 8A, a general-purpose rule whose general-purpose rule ID is Rule1 indicates that a threshold error (related event) of the response time of the logical volume on the host computer as an observation event, and the operating rate of the controller in the storage device (processor usage) Ratio) threshold error (cause event) and LU unit time I / O threshold error (related event) in the storage device are detected, the storage device controller operation rate (processor usage rate) bottleneck It can be concluded that is the cause of disability.

なお、展開ルールを生成する際にはボリュームトポロジ管理表からトポロジ情報を取得する。また、観測事象に含まれるイベントとして、ある条件が正常であることを定義してもよい。図８Ｂに示す汎用ルールの例では、ストレージ装置のコントローラのプロセッサ使用率や、ストレージ装置におけるLUの単位時間のI/O量が正常であることを観測事象として定義している。 Note that topology information is acquired from the volume topology management table when generating an expansion rule. Moreover, you may define that a certain condition is normal as an event contained in an observation phenomenon. In the example of the general rule shown in FIG. 8B, it is defined as an observation event that the processor usage rate of the controller of the storage device and the I / O amount of LU unit time in the storage device are normal.

＜展開ルールの構成＞
図９Ａ乃至Ｈは、管理サーバ３００００が有する展開ルールリポジトリ３３５００内の展開ルールの構成例を示す図である。これらの展開ルールは、汎用ルール（図８Ａ及びＢ）にボリュームトポロジ管理表（図７）の各エントリの項目を挿入することによって生成される。 <Configuration of deployment rules>
FIGS. 9A to 9H are diagrams illustrating configuration examples of expansion rules in the expansion rule repository 33500 included in the management server 30000. FIG. These expansion rules are generated by inserting items of each entry of the volume topology management table (FIG. 7) into the general-purpose rules (FIGS. 8A and 8B).

展開ルールは、展開ルールの識別子となる展開ルールIDを登録するフィールド３３５３０と、展開ルールの基となった汎用ルールの識別子となる汎用ルールIDを登録するためのフィールド３３５４０と、”IF-THEN”形式で記載した展開ルールのIF部に相当する観測事象を登録するフィールド３３５１０と、”IF-THEN”形式で記載した展開ルールのTHEN部に相当する原因事象を登録するためのフィールド３３５２０とを構成項目として含んでいる。 The expansion rule includes a field 33530 for registering an expansion rule ID serving as an expansion rule identifier, a field 33540 for registering a general rule ID serving as a general rule identifier based on the expansion rule, and “IF-THEN”. A field 33510 for registering an observation event corresponding to the IF part of the expansion rule described in the format and a field 33520 for registering a cause event corresponding to the THEN part of the expansion rule described in the “IF-THEN” format are configured. Includes as an item.

例えば、図９Ａの展開ルールは、汎用ルールIDがRule1における装置種別及び装置部位種別に、図６の第１エントリのコントローラ名３２２４０とホストID３２２５０と、接続先ドライブ名３２２６０とLU番号３２２３０を挿入することによって生成される。そして、図９Ａからは、展開ルールIDがExRule1-1で示される展開ルールが、汎用ルールIDがRule1で示される汎用ルールを基に展開され、観測事象としてホストコンピュータ上の論理ボリュームのレスポンスタイムの閾値異常と、ストレージ装置におけるコントローラの稼働率（プロセッサの使用率）の閾値異常と、ストレージ装置におけるLUの単位時間のI/O量の閾値異常を検知したとき、ストレージ装置のコントローラの稼働率（プロセッサ使用率）のボトルネックが障害原因と結論付けられることが分かる。 For example, the expansion rule of FIG. 9A inserts the controller name 32240, the host ID 32250, the connection destination drive name 32260, and the LU number 32230 of the first entry of FIG. 6 into the device type and device part type in the general rule ID Rule1. Is generated by From FIG. 9A, an expansion rule whose expansion rule ID is ExRule1-1 is expanded based on the general rule whose general rule ID is Rule1, and the response time of the logical volume on the host computer is observed as an observation event. When a threshold error, a threshold error in the controller operation rate (processor usage rate) in the storage device, and a threshold error in the LU unit time I / O amount in the storage device are detected, the controller operation rate ( It can be concluded that the bottleneck of the processor usage rate is the cause of the failure.

＜解析結果管理表の構成＞
図１０は、管理サーバ３００００の有する解析結果管理表３３６００の構成例を示す図である。 <Configuration of analysis result management table>
FIG. 10 is a diagram showing a configuration example of the analysis result management table 33600 that the management server 30000 has.

解析結果管理表３３６００は、障害原因解析処理において障害の原因と判断されたイベントの発生した機器の識別子となる装置IDを登録するフィールド３３６１０と、イベントの発生した機器内の部位の識別子を登録するフィールド３３６２０と、閾値異常を検知したメトリックの名称を登録するフィールド３３６３０と、展開ルールにおいて条件部に記載されたイベントの発生割合を登録するフィールド３３６４０と、イベントを障害の原因と判断した根拠となる展開ルールのIDを登録するフィールド３３６５０と、展開ルールにおいて条件部に記載されたイベントのうち、実際に受信したイベントのＩＤを登録するフィールド３３６６０と、該解析結果を基にユーザである管理者が実際に障害対応を行ったかどうかを登録するフィールド３３６７０と、分類したグループＩＤを登録するフィールド３３６８０と、分類の際に該解析結果から開始して行ったかどうかを登録するフィールド３３６９０と、イベント発生に伴う障害解析処理を開始した日時を登録するフィールド３３６９５とを構成項目として含んでいる。 The analysis result management table 33600 registers a field 33610 for registering a device ID that is an identifier of a device in which an event has been determined to be the cause of the failure in the failure cause analysis process, and an identifier of a part in the device in which the event has occurred. Field 33620, field 33630 for registering the name of the metric that detected the threshold abnormality, field 33640 for registering the occurrence rate of the event described in the condition part in the expansion rule, and the basis for determining the event as the cause of the failure The field 33650 for registering the ID of the expansion rule, the field 33660 for registering the ID of the actually received event among the events described in the condition part in the expansion rule, and the administrator who is the user based on the analysis result Field 3 for registering whether or not an actual failure has been handled 3670, a field 33680 for registering the classified group ID, a field 33690 for registering whether or not the classification has been started from the analysis result, and a field for registering the date and time when the failure analysis process accompanying the occurrence of the event is started 33695 is included as a configuration item.

例えば、図１０の第１段目（１つ目のエントリ）からは、展開ルールExRule1-1に基づき、管理サーバ３００００がストレージ装置SYS1の、CTL1で示されるコントローラにおけるプロセッサ稼働率の閾値異常を障害原因として判断し、その根拠としてイベントIDがEV1およびEV3およびEV6で示されるイベントを受信し、すなわち条件イベントの発生割合が3/3であることが分かる。 For example, from the first row (first entry) in FIG. 10, based on the expansion rule ExRule1-1, the management server 30000 fails the threshold abnormality of the processor operation rate in the controller indicated by CTL1 of the storage device SYS1. As a cause, it is understood that the event IDs EV1, EV3, and EV6 are received as the basis, that is, the occurrence rate of the conditional event is 3/3.

＜構成管理情報の取得処理及び、ボリュームトポロジ管理表の更新処理＞
プログラム制御モジュール３２１００は、例えばポーリング処理によって、構成情報取得モジュール３２２００に対し、計算機システム１内のストレージ装置２００００、ホストコンピュータ１００００およびIPスイッチ４００００から、構成管理情報を定期的に取得するよう指示する。 <Configuration management information acquisition processing and volume topology management table update processing>
The program control module 32100 instructs the configuration information acquisition module 32200 to periodically acquire configuration management information from the storage device 20000, the host computer 10000, and the IP switch 40000 in the computer system 1 by, for example, polling processing.

構成管理情報取得モジュール３２２００は、ストレージ装置２００００およびホストコンピュータ１００００およびIPスイッチ４００００から構成管理情報を取得するとともに、ボリュームトポロジ管理表３３２００を更新する。 The configuration management information acquisition module 32200 acquires configuration management information from the storage device 20000, the host computer 10000, and the IP switch 40000, and updates the volume topology management table 33200.

＜装置性能情報取得処理及びイベント解析処理＞
図１１は、管理サーバ３００００の装置性能取得モジュール３２３００が実行する通常の装置性能情報取得処理を説明するためのフローチャートである。プログラム制御モジュール３２１００は、プログラムの起動時、もしくは前回の装置性能情報取得処理から一定時間経過するたびに、装置性能取得モジュール３２３００に対し、装置性能情報取得処理を実行するよう指示する。なお、当該実行指示を繰り返し出す場合は厳密に一定期間毎である必要は無く、繰り返しさえしていればよい。 <Device performance information acquisition processing and event analysis processing>
FIG. 11 is a flowchart for explaining a normal device performance information acquisition process executed by the device performance acquisition module 32300 of the management server 30000. The program control module 32100 instructs the device performance acquisition module 32300 to execute the device performance information acquisition process at the time of starting the program or every time a predetermined time elapses from the previous device performance information acquisition processing. It should be noted that when the execution instruction is repeatedly issued, it is not necessarily strictly every fixed period, and it is only necessary to repeat it.

装置性能情報取得モジュール３２３００は、監視対象の各装置に対し、以下の一連の処理を繰り返す。 The device performance information acquisition module 32300 repeats the following series of processes for each device to be monitored.

装置性能情報取得モジュール３２３００は、まず、監視対象の各装置に対し、構成管理情報を送信するよう指示する（ステップ６１０１０）。 The device performance information acquisition module 32300 first instructs each device to be monitored to transmit configuration management information (step 61010).

装置性能情報取得モジュール３２３００は、監視対象装置からの応答があったか否か判断し（ステップ６１０２０）、装置から装置性能情報の応答があれば（ステップ６１０２０でYesの場合）、取得した装置性能情報を装置性能管理表３３１００に格納する（ステップ６１０３０）。装置から構成管理情報の応答がなかった場合（ステップ６１０２０でNoの場合）、構成管理情報取得処理は終了する。 The device performance information acquisition module 32300 determines whether or not there is a response from the monitoring target device (step 61020). If there is a response of the device performance information from the device (Yes in step 61020), the acquired device performance information is displayed. The information is stored in the device performance management table 33100 (step 61030). If there is no response for configuration management information from the device (No in step 61020), the configuration management information acquisition process ends.

次に、装置性能取得モジュール３２３００は、装置性能管理表３３１００に格納された装置性能情報を参照し、各性能値に対してステップ６１０５０からステップ６１０７０の処理を繰り返す（ステップ６１０４０）。装置性能取得モジュール３２３００は、性能値が閾値を超過しているかを確認し、装置性能管理表３３１００に登録された状態を更新する（ステップ６１０５０）。そして、装置性能取得モジュール３２３００は、状態が正常から閾値異常に、或いは閾値異常から正常に変化したか否か判断し（ステップ６１０６０）、状態が変化した場合（ステップ６１０６０でYesの場合）、イベント管理表３３７００にイベントを登録する（ステップ６１０７０）。状態が変化していない場合（ステップ６１０６０でNoの場合）、全ての性能値に対する状態確認処理が終わっていなければ、処理はステップ６１０５０に戻る。 Next, the device performance acquisition module 32300 refers to the device performance information stored in the device performance management table 33100, and repeats the processing from step 61050 to step 61070 for each performance value (step 61040). The device performance acquisition module 32300 checks whether the performance value exceeds the threshold, and updates the state registered in the device performance management table 33100 (Step 61050). Then, the device performance acquisition module 32300 determines whether or not the state has changed from normal to threshold abnormality or from threshold abnormality to normal (step 61060). If the state has changed (Yes in step 61060), the event An event is registered in the management table 33700 (step 61070). If the state has not changed (No in step 61060), the processing returns to step 61050 if the state confirmation processing has not been completed for all performance values.

全ての性能値に対する上記の処理が終了した後、装置性能取得モジュール３２３００は、一連の処理で新規に追加したイベントがあるか否か判断する（ステップ６１０８０）。追加イベントがあれば（例えば、処理中に新たな異常が発生したような場合）、プログラム制御モジュール３２１００は、イベント解析処理モジュール３２５００に対し、図１２に示す障害原因解析処理を行なうよう指示する（ステップ６１０９０）。
以上が、装置性能取得モジュール３２３００が実施する装置性能情報取得処理である。 After the above processing for all performance values is completed, the device performance acquisition module 32300 determines whether or not there is a newly added event in a series of processing (step 61080). If there is an additional event (for example, when a new abnormality occurs during processing), the program control module 32100 instructs the event analysis processing module 32500 to perform the failure cause analysis processing shown in FIG. 12 ( Step 61090).
The above is the apparatus performance information acquisition process performed by the apparatus performance acquisition module 32300.

＜障害解析処理（ステップ６１０９０）の詳細＞
図１２は、管理サーバ３００００のイベント解析処理モジュール３２５００が実行する障害原因解析処理（図１１のステップ６１０９０）の詳細を説明するためのフローチャートである。 <Details of Failure Analysis Processing (Step 61090)>
FIG. 12 is a flowchart for explaining the details of the failure cause analysis processing (step 61090 in FIG. 11) executed by the event analysis processing module 32500 of the management server 30000.

イベント解析処理モジュール３２５００は、イベント管理表３３３００より、解析済フラグがYesになっていないイベントを取得する（ステップ６２０１０）。 The event analysis processing module 32500 acquires an event whose analyzed flag is not Yes from the event management table 33300 (step 62010).

次に、イベント解析処理モジュール３２５００は、展開ルールリポジトリ３３５００内の各展開ルールに対し、ステップ６２０２０からステップ６２０４０の処理を繰り返す（ステップ６２０２０）。イベント解析処理モジュール３２５００は、まず、展開ルールに記載された条件部に対応する各イベントについて、過去一定期間の発生件数を算出する（ステップ６２０３０）。 Next, the event analysis processing module 32500 repeats the processing from step 62020 to step 62040 for each expansion rule in the expansion rule repository 33500 (step 62020). The event analysis processing module 32500 first calculates the number of occurrences in the past certain period for each event corresponding to the condition part described in the expansion rule (step 62030).

続いて、イベント解析処理モジュール３２５００は、原因候補分類処理（図１３）を実行する（ステップ６２０５０）。そして、イベント解析処理モジュール３２５００は、ステップ６２０３０の処理において集計したイベント発生数が、条件部に記載された全イベントにおいて一定の比率を超過したか否か判断し、超過している場合には GUI表示処理モジュール３２４００に対し、障害原因になるイベントを、条件文中のイベント発生割合と共に、ステップ６２０５０で行った分類に基づいて表示するよう指示する（ステップ６２０６０）。その後イベント管理表３３３００を参照して、ステップ６２０１０で取得したイベントについて解析済フラグ３３３７０をYesに設定する（ステップ６２０７０）。 Subsequently, the event analysis processing module 32500 executes cause candidate classification processing (FIG. 13) (step 62050). Then, the event analysis processing module 32500 determines whether or not the number of event occurrences counted in the processing of step 62030 exceeds a certain ratio in all events described in the condition part. The display processing module 32400 is instructed to display the event causing the failure based on the classification performed in Step 62050 together with the event occurrence ratio in the conditional sentence (Step 62060). Thereafter, with reference to the event management table 33300, the analyzed flag 33370 is set to Yes for the event acquired in step 622010 (step 62070).

最後にイベント解析処理モジュール３２５００は、展開ルールリポジトリ内の各展開ルールのうち、確信度が０でないものを解析結果管理表３３６００に書き出す（ステップ６２０８０）。 Finally, the event analysis processing module 32500 writes, in the analysis result management table 33600, each of the expansion rules in the expansion rule repository that has a certainty factor that is not 0 (step 62080).

例えば、図９Ａに示す展開ルールExRule1-1には、条件部に”ホストコンピュータHOST1における論理ボリューム（/var）のレスポンスタイムの閾値異常”と、”ストレージ装置SYS1におけるコントローラCTL1の稼働率の閾値異常”と、”ストレージ装置SYS1における論理ユニットLU1の単位時間I/O量の閾値異常”が定義されている。 For example, in the expansion rule ExRule1-1 shown in FIG. 9A, the condition part includes “abnormal response time threshold value of logical volume (/ var) in host computer HOST1” and “abnormal threshold value operation rate of controller CTL1 in storage device SYS1”. “And“ abnormal threshold value of unit time I / O amount of logical unit LU1 in storage device SYS1 ”are defined.

そして、図７に示すイベント管理表３３３００に、”ストレージ装置SYS1におけるコントローラCTL1の稼働率の閾値異常”（発生日時：2010-01-01 15:05:00）が登録されると、イベント解析処理モジュール３２５００は、一定時間待機した後にイベント管理表３３３００を参照し、過去一定期間に発生したイベントを取得する。 7 is registered in the event management table 33300 shown in FIG. 7, the event analysis process is performed when “the threshold abnormality of the operation rate of the controller CTL1 in the storage device SYS1” (occurrence date: 2010-01-01 15:05:00) is registered. The module 32500 refers to the event management table 33300 after waiting for a certain period of time, and acquires events that have occurred in the past certain period.

次に、イベント解析処理モジュール３２５００は、展開ルールリポジトリ３３５００の展開ルールExRule1-1に記載された条件部に対応する各イベントについて、過去一定期間の発生件数を算出する。その結果、”ホストコンピュータHOST1における論理ボリューム（/var）のレスポンスタイムの閾値異常”（関連イベント）と、”論理ユニット LU1の単位時間I/O量の閾値異常”（関連イベント）も過去一定期間に発生していることから、展開ルールExRule1-1に記載された条件部に対応する各イベント（原因イベントと関連イベント）の過去一定期間の発生数が、条件部に記載された全イベントにおいて占める割合は3/3となる。 Next, the event analysis processing module 32500 calculates the number of occurrences in the past certain period for each event corresponding to the condition part described in the expansion rule ExRule1-1 of the expansion rule repository 33500. As a result, "Threshold error of logical volume (/ var) response time in host computer HOST1" (related event) and "Threshold error of logical unit LU1 unit time I / O amount" (related event) are also in the past certain period Therefore, the number of occurrences of each event (cause event and related event) corresponding to the condition part described in the expansion rule ExRule1-1 in the past certain period occupies in all the events described in the condition part The ratio will be 3/3.

以上のようにして算出された割合が一定値を超過した場合、イベント解析処理モジュール32500は、GUI表示処理モジュール３２４００に対し、障害原因となるイベントを、条件文中のイベント発生割合と共に表示するよう指示する。ここでいう一定値を例えば30%とした場合、当該具体例では、展開ルールExRule1-1の条件部の各イベントの過去一定期間の発生割合が3/3、すなわち100%であるので、解析結果がGUIに表示されることになる。 When the ratio calculated as described above exceeds a certain value, the event analysis processing module 32500 instructs the GUI display processing module 32400 to display the event causing the failure together with the event occurrence ratio in the conditional statement. To do. If the constant value here is 30%, for example, in this specific example, the occurrence ratio of each event in the condition part of the expansion rule ExRule1-1 in the past certain period is 3/3, that is, 100%. Will be displayed in the GUI.

上記の処理を、展開ルールリポジトリ３３５００に定義された全ての展開ルールに対し実行することになる。 The above processing is executed for all the expansion rules defined in the expansion rule repository 33500.

以上が、イベント解析処理モジュール３２５００が実施する障害原因解析処理である。上述したように、特許文献１による障害解析機能では、複数の障害が短い期間に頻発すると、保存される障害解析結果の数が多くなる。しかし、複数の障害に対して多数の原因候補を推論した場合に、実際に発生しているどの障害にどの原因候補が関連するのかを提示する方法がない。特に多量の障害イベントが発生して多数の障害原因候補が推論された場合は、どの障害原因候補に対応すればどの箇所で発生している障害を直ぐに解決することができるのか、管理者が類推困難であり、管理者が優先的に対策を行うべき解析結果を参照するまでの時間が長くなる。その結果、障害の解消までに要する時間が長くなるという課題が存在する。 The above is the failure cause analysis processing performed by the event analysis processing module 32500. As described above, in the failure analysis function according to Patent Document 1, when a plurality of failures frequently occur in a short period, the number of failure analysis results to be stored increases. However, when a large number of cause candidates are inferred for a plurality of failures, there is no method for presenting which cause candidates are related to which failure actually occurs. In particular, when a large number of failure events occur and a large number of failure cause candidates are inferred, the administrator makes an analogy as to which failure cause candidate corresponds to which failure can be resolved immediately. It is difficult, and it takes a long time for the administrator to refer to the analysis result that should be preferentially taken. As a result, there is a problem that it takes a long time to resolve the failure.

そこで、本発明による実施形態では、多数の解析結果を分類して表示できるようにするために、原因候補分類処理を新たに提供する。 Therefore, in the embodiment according to the present invention, a cause candidate classification process is newly provided so that a large number of analysis results can be classified and displayed.

＜原因候補分類処理の内容＞
従来技術における課題を解決するため、本発明の第１の実施形態では管理サーバ３００００における原因候補分類処理が追加されている。以下、当該原因候補分類処理の動作の詳細について説明する。 <Contents of cause candidate classification processing>
In order to solve the problems in the prior art, a cause candidate classification process in the management server 30000 is added in the first embodiment of the present invention. Hereinafter, details of the operation of the cause candidate classification process will be described.

原因候補分類処理は、起点となる原因候補（例えば、確信度の一番高い原因候補）に含まれるイベントを基準とし、そのイベントを含む他の原因候補があれば、それは同じ障害原因に対する原因候補であると推測して分類（グルーピング）する処理である。関連する原因候補をグループとしてまとめているので、優先的に対処すべき候補を知ることが出来るようになる。 The cause candidate classification process is based on an event included in a cause candidate as a starting point (for example, a cause candidate with the highest certainty), and if there is another cause candidate including the event, it is a cause candidate for the same cause of failure This is the process of classifying (grouping) by assuming that Since related cause candidates are grouped together, it becomes possible to know candidates to be dealt with preferentially.

図１３は、第１の実施形態における、管理サーバ３００００のイベント解析処理モジュール３２５００が実施する原因候補分類処理（ステップ６３０５０）の詳細を説明するためのフローチャートである。 FIG. 13 is a flowchart for explaining details of the cause candidate classification process (step 63050) performed by the event analysis processing module 32500 of the management server 30000 in the first embodiment.

イベント解析処理モジュール３２５００は、解析結果管理表３３６００より、一定期間（例えば、一回のポーリング期間）において確信度が最も高い原因候補を選択する(ステップ６３０１０)。そして、選択した原因候補のエントリについて、解析結果管理表３３６００の分類起点フラグフィールド３３６９０に、Yesを登録する。イベント解析処理モジュール３２５００は、選択した候補に含まれる受信イベントIDを、解析結果管理表３３６００より取得する(ステップ６３０２０)。そしてイベント解析処理モジュール３２５００は、取得した受信イベントIDのうち、いずれか一つ以上の同一受信イベントIDを含む原因候補を、解析結果管理表３３６００より取得する(ステップ６３０３０)。原因候補の取得後、イベント解析処理モジュール３２５００は、解析結果管理表３３６００のグループIDを登録するフィールド３３６８０より利用されているグループIDのリストを取得し、重複しないグループIDを作成し、ステップ６３０１０で選択した原因候補およびステップ６３０３０で取得した原因候補のエントリに関して、フィールド３３６８０の内容を作成したグループIDに更新する (ステップ６３０４０)。 The event analysis processing module 32500 selects a cause candidate having the highest certainty factor in a certain period (for example, one polling period) from the analysis result management table 33600 (step 63010). Then, “Yes” is registered in the classification starting point flag field 33690 of the analysis result management table 33600 for the selected cause candidate entry. The event analysis processing module 32500 acquires the reception event ID included in the selected candidate from the analysis result management table 33600 (step 63020). Then, the event analysis processing module 32500 acquires from the analysis result management table 33600 a cause candidate that includes any one or more of the same reception event IDs among the acquired reception event IDs (step 63030). After acquiring the cause candidates, the event analysis processing module 32500 acquires a list of group IDs used from the field 33680 for registering group IDs in the analysis result management table 33600, creates a group ID that does not overlap, and in step 63010 With respect to the selected cause candidate and the cause candidate entry acquired in step 63030, the contents of the field 33680 are updated to the created group ID (step 63040).

次に、イベント解析処理モジュール３２５００は、解析結果管理表３３６００より、フィールド３３６８０にグループIDが記載されていないエントリが存在するかどうかチェックする。そのようなエントリが存在した場合 (ステップ６３０５０でNoの場合)、そのようなエントリのうち、確信度が最も高い原因候補を選択し (ステップ６３０６０)、解析結果管理表３３６００の選択した原因候補のエントリについて、分類起点フラグフィールド３３６９０に、Yesを登録する。そして、選択した候補に対して、ステップ６３０２０以降の処理を再度行う。 Next, the event analysis processing module 32500 checks whether there is an entry for which no group ID is described in the field 33680 from the analysis result management table 33600. If such an entry exists (No in step 63050), the cause candidate having the highest certainty among such entries is selected (step 63060), and the selected cause candidate in the analysis result management table 33600 is selected. For the entry, Yes is registered in the classification start flag field 33690. And the process after step 63020 is performed again with respect to the selected candidate.

解析結果管理表３３６００のフィールド３３６８０を参照し、全てのエントリにグループIDが記載されていた場合(ステップ６３０５０でYesの場合)、イベント解析処理モジュール３２５００は、解析結果管理表３３６００の受信イベントIDフィールド３３６６０から、全ての受信イベントIDを取得する。次に解析結果管理表３３６００の分類起点フラグフィールド３３６９０にYesが記載されているエントリを取得し、全ての受信イベントIDが取得したエントリに含まれているかどうかをチェックする。 When the field 33680 of the analysis result management table 33600 is referred to and the group ID is described in all entries (Yes in step 63050), the event analysis processing module 32500 receives the received event ID field of the analysis result management table 33600. All reception event IDs are acquired from 33660. Next, an entry in which Yes is described in the classification start flag field 33690 of the analysis result management table 33600 is acquired, and it is checked whether or not all received event IDs are included in the acquired entry.

エントリに含まれていない一つないし一つ以上の受信IDが存在する場合(ステップ６３０７０でNoの場合)、イベント解析処理モジュール３２５００は、それらの受信IDを含む原因候補を含む原因候補エントリのうち、確信度が最も高い原因候補を選択し(ステップ６３０８０)、解析結果管理表３３６００の選択した原因候補のエントリについて、分類起点フラグフィールド３３６９０に、Yesを登録する。そして、選択した候補に対して、ステップ６３０２０以降の処理を再度行う。 When there is one or more reception IDs not included in the entry (No in step 63070), the event analysis processing module 32500 includes the cause candidate entries including the cause candidates including those reception IDs. Then, the cause candidate with the highest certainty factor is selected (step 63080), and Yes is registered in the classification start point flag field 33690 for the selected cause candidate entry in the analysis result management table 33600. And the process after step 63020 is performed again with respect to the selected candidate.

解析結果管理表３３６００の分類起点フラグフィールド３３６９０にYesが記載されているエントリが取得され、全ての受信イベントIDが取得したエントリに含まれていた場合(ステップ６３０７０でYesの場合)、原因候補分類処理は終了する。
以上が、イベント解析処理モジュール３２５００が実施する原因候補分類処理である。 When an entry in which Yes is described in the classification start flag field 33690 of the analysis result management table 33600 is acquired and all received event IDs are included in the acquired entries (Yes in step 63070), the cause candidate classification The process ends.
The above is the cause candidate classification process performed by the event analysis processing module 32500.

以下に、原因候補分類処理の具体例について説明する。なお、処理開始当初の解析結果管理表は図１０、展開ルールは図９、イベント管理表は図７に示す通りのものであるとする。そして、図１２のステップ６２０５０の直前までは、処理が終了しているものとする。 A specific example of the cause candidate classification process will be described below. It is assumed that the analysis result management table at the beginning of the processing is as shown in FIG. 10, the development rules are as shown in FIG. 9, and the event management table is as shown in FIG. Then, it is assumed that the processing is completed until immediately before step 62050 in FIG.

イベント解析処理モジュール３２５００は、解析結果管理表３３６００より、確信度が最も高いエントリとして、解析結果管理表の第１段目（１つ目のエントリ）から、SYS1の装置の、CTL1の障害原因候補エントリを選択する。つぎに、この候補に含まれる障害イベントである、EV1、EV3、EV6を抽出する。そして、これらの障害イベントを含む他の障害原因候補として、2段目のエントリ(SYS1/CTL2)と5段目のエントリ(IPSW1)を選択する。そして、これら3つのエントリをグループ化し、グループIDとしてGR1を生成してこれらのエントリの解析結果管理表のグループID登録用のフィールド３３６８０に生成したグループIDを登録する。さらに1段目のエントリを、分類を行う際の基準として扱ったので、1段目のエントリの分類起点フラグ３３６９０にはYesを、残りの2エントリの分類起点フラグ３３６９０にはNoを記録する。 The event analysis processing module 32500 has the highest certainty factor from the analysis result management table 33600. From the first level (first entry) of the analysis result management table, the event analysis processing module 32500 has the CTL1 failure cause candidate of the SYS1 device. Select an entry. Next, EV1, EV3, and EV6, which are failure events included in this candidate, are extracted. Then, the second-stage entry (SYS1 / CTL2) and the fifth-stage entry (IPSW1) are selected as other failure cause candidates including these failure events. These three entries are grouped, GR1 is generated as a group ID, and the generated group ID is registered in the group ID registration field 33680 of the analysis result management table for these entries. Further, since the first-stage entry is treated as a reference for classification, Yes is recorded in the classification start flag 33690 of the first-stage entry, and No is recorded in the classification start flag 33690 of the remaining two entries.

解析結果管理表にはまだグループ化されていない残りのエントリ(3段目、4段目)が存在するため、それらについてここまでの作業を繰り返す。まず、確信度の高いエントリとして、3段目のエントリ(SYS1/CTL3)を選択する。そしてこの候補に含まれる障害イベントである EV2、EV4、EV8を抽出する。これらの障害イベントを含む他の障害原因候補として、5段目のエントリ(IPSW1)を選択する。そして、これら2つのエントリをグループ化し、グループIDとしてGR2を生成してこれらのエントリの解析結果管理表のグループID登録用のフィールド３３６８０に生成したグループIDを登録する。なお、5段目のエントリには既にグループIDが登録されているが、複数のグループに所属していることを示すため、追加して登録する。このために、グループID登録用のフィールド３３６８０は、複数のIDを登録できるような構造にする。さらに3段目のエントリを、分類を行う際の基準として扱ったので、3段目のエントリの分類起点フラグ３３６９０にはYesを記録する。 Since there are remaining entries (third and fourth levels) that are not yet grouped in the analysis result management table, the operations up to this point are repeated. First, the third entry (SYS1 / CTL3) is selected as an entry with a high certainty factor. Then, the failure events EV2, EV4, and EV8 included in this candidate are extracted. The fifth entry (IPSW1) is selected as another failure cause candidate including these failure events. Then, these two entries are grouped, GR2 is generated as a group ID, and the generated group ID is registered in the group ID registration field 33680 of the analysis result management table of these entries. Note that the group ID is already registered in the fifth row entry, but it is additionally registered to indicate that it belongs to a plurality of groups. For this purpose, the group ID registration field 33680 is structured so that a plurality of IDs can be registered. Furthermore, since the third-stage entry is handled as a reference for classification, Yes is recorded in the classification start flag 33690 of the third-stage entry.

さらに解析結果管理表にはまだグループ化されていない残りのエントリ(4段目)が存在する。このエントリについても同様の作業を繰り返す。そしてこの候補に含まれる障害イベントであるEV5、EV9を抽出する。これらの障害イベントを含む他の障害原因候補として、5段目のエントリ(IPSW1) を選択する。そして、これら2つのエントリをグループ化し、グループIDとしてGR3を生成してこれらのエントリの解析結果管理表のグループID登録用のフィールド３３６８０に生成したグループIDを登録する。なお、5段目のエントリには既にグループIDが登録されているため、追加して登録する。さらに4段目のエントリを、分類を行う際の基準として扱ったので、4段目のエントリの分類起点フラグ３３６９０にはYesを記録する。
ここまでの処理により、解析結果管理表のすべてのエントリはグループ化された。 Furthermore, the analysis result management table has remaining entries (fourth row) that are not yet grouped. The same operation is repeated for this entry. Then, EV5 and EV9, which are failure events included in this candidate, are extracted. The fifth entry (IPSW1) is selected as another failure cause candidate including these failure events. Then, these two entries are grouped, GR3 is generated as a group ID, and the generated group ID is registered in the group ID registration field 33680 of the analysis result management table of these entries. Since the group ID is already registered in the fifth row entry, it is additionally registered. Furthermore, since the fourth-stage entry is handled as a reference for classification, Yes is recorded in the classification start flag 33690 of the fourth-stage entry.
Through the processing so far, all entries in the analysis result management table have been grouped.

次に、グループ化の際に参照されなかった障害イベントを抽出する。解析結果管理表３３６００の受信イベントIDフィールド３３６６０に含まれる全てのイベントIDのうち、分類起点フラグ３３６９０にYesが記録されているエントリに含まれないものとして、EV7を抽出する。EV7を含む原因候補として、2段目のエントリ(SYS1/CTL2)と5段目のエントリ(IPSW1)が存在する。このうち確信度の高い2段目のエントリ(SYS1/CTL2)を起点に、同様のグループ化を行うと、これら2つのエントリと、1段目のエントリ(SYS1/CTL1)を新たにグループ化できる。なお、ここでこれらのエントリは全てグループGR1に含まれる。GR1に着目して障害対応を行うことを考えると、GR1の起点となった1段目のエントリを解決するためにSYS1/CTL1の障害に対応しても、2段目のエントリに含まれるEV7については解決できない可能性がある。本実施形態では、各グループの一つのエントリについて対処すれば全ての障害を修復できるよう、2段目のエントリ(SYS1/CTL2)を起点にしたグループも、GR1とは別に生成する。そしてグループIDとしてGR4を生成してこれらのエントリの解析結果管理表のグループID登録用のフィールド３３６８０に生成したグループIDを登録する。なお、各エントリには既にグループIDが登録されているため、追加して登録する。さらに2段目のエントリを、分類を行う際の基準として扱ったので、2段目のエントリの分類起点フラグ３３６９０にはYesを記録する。 Next, failure events that were not referenced during grouping are extracted. Of all event IDs included in the received event ID field 33660 of the analysis result management table 33600, EV7 is extracted as not included in the entry in which Yes is recorded in the classification start flag 33690. As cause candidates including EV7, there are a second-stage entry (SYS1 / CTL2) and a fifth-stage entry (IPSW1). If the same grouping is performed starting from the second-stage entry (SYS1 / CTL2) with a high degree of certainty, these two entries and the first-stage entry (SYS1 / CTL1) can be newly grouped. . Here, all these entries are included in the group GR1. Considering that the failure handling is focused on GR1, even if the failure of SYS1 / CTL1 is dealt with to solve the entry of the first row that is the starting point of GR1, EV7 included in the entry of the second row There is a possibility that cannot be resolved. In this embodiment, a group starting from the second-stage entry (SYS1 / CTL2) is also generated separately from GR1 so that all faults can be repaired by dealing with one entry of each group. Then, GR4 is generated as the group ID, and the generated group ID is registered in the group ID registration field 33680 of the analysis result management table of these entries. Since each group already has a group ID registered, it is additionally registered. Further, since the second-stage entry is treated as a reference for classification, Yes is recorded in the classification start flag 33690 of the second-stage entry.

これにより、解析結果管理表３３６００の受信イベントIDフィールド３３６６０に含まれる全てのイベントIDのうち、分類起点フラグ３３６９０にYesが記録されているエントリに含まれないものが無くなったため、原因候補分類処理を終了する。 As a result, all the event IDs included in the received event ID field 33660 of the analysis result management table 33600 are not included in the entry whose Yes is recorded in the classification start flag 33690. finish.

＜障害解析結果表示画面の構成＞
図１４は、管理サーバ３００００がユーザ（管理者）に対して表示する、障害解析結果表示画面の表示例７１０００を示す図である。 <Configuration of failure analysis result display screen>
FIG. 14 is a diagram illustrating a display example 71000 of a failure analysis result display screen that the management server 30000 displays to the user (administrator).

障害解析結果表示画面７１０００では、解析結果管理表に定義された解析結果をグループIDが一致するものをまとめて表示する。その際に、複数のグループに分類されているエントリは、複数のグループに重複して表示する。また、各グループにおいてグループ化の際に起点とした原因候補を、そのグループの最上位に表示する。そしてそれ以外の候補は、確信度の高い順に表示している。 On the failure analysis result display screen 71000, the analysis results defined in the analysis result management table are displayed together with the group IDs that match. At this time, entries classified into a plurality of groups are displayed in duplicate in the plurality of groups. In addition, the cause candidate that is the starting point for grouping in each group is displayed at the top of the group. The other candidates are displayed in descending order of certainty.

なお、本実施形態では同一画面に全ての原因候補のグループを表示しているが、グループごとに分割して表示されれば良いので、グループごとに別画面で表示し、タブ等で切り替えられるように実施してもよい。 In this embodiment, all the cause candidate groups are displayed on the same screen. However, since it is only necessary to divide and display each cause group, each group can be displayed on a separate screen and switched by a tab or the like. May be implemented.

以上の障害解析結果表示によれば、例えば、管理者は、管理サーバ３００００の画面に表示された各原因候補グループの最上位の候補から対処していけば効率よく障害原因を取り除ける可能性が高いことを知ることができる。 According to the above failure analysis result display, for example, the administrator is likely to be able to efficiently remove the cause of the failure if he / she deals with the top candidate of each cause candidate group displayed on the screen of the management server 30000. I can know that.

＜変形例＞
上述の分類処理の結果生成されたグループの数が多すぎると、却ってグループ化することにより障害結果の確認が困難になる場合がある。そこで、分類処理で生成されたグループ数が所定数以上の場合（グループ数については管理者が設定可能）、分類結果を自動的にまとめるようにしても良い。その際の処理は、例えば、まず、ある分類結果のグループに含まれる条件イベントのうちの一定割合以上が別の分類結果のグループに含まれるか否かを判断する。そして、一定割合異常の条件イベントが別の分類結果のグループに含まれる場合には、それらのグループに含まれる原因候補を１つのグループにまとめて、グループ化する。このようにするのは、あるグループの条件イベントの一定以上の割合が別のグループにも含まれる場合には、双方のグループに含まれる障害イベントが同じ装置に発生した障害に起因して発生している可能性が高く、同一のグループとして扱っても問題がない可能性が高いからである。 <Modification>
If the number of groups generated as a result of the above classification process is too large, it may be difficult to confirm the failure result by grouping. Therefore, when the number of groups generated by the classification process is greater than or equal to a predetermined number (the number of groups can be set by the administrator), the classification results may be automatically collected. In this process, for example, first, it is determined whether or not a certain percentage or more of the condition events included in a certain classification result group is included in another classification result group. When condition events having a certain percentage of abnormality are included in different classification result groups, the cause candidates included in these groups are grouped into one group. This is because when a certain group of conditional events of a certain group is included in another group, the failure event included in both groups is caused by the failure that occurred in the same device. This is because there is a high possibility that there is no problem even if they are treated as the same group.

＜原因候補分類処理の効果＞
以上、第１の実施形態によれば、管理サーバ３００００の管理ソフトウェアは、図１２に示す障害原因解析処理の後、推論した障害原因候補を、それによって解決される障害にかかわる障害イベントによって分類して表示する。第１の実施形態による分類法と、その結果の表示形式では、各グループの上位の１つのエントリについて対処すれば全ての障害を修復できるよう、分類することができる。従来、原因候補分類処理を行わない場合は、推論した障害原因候補として、図１０に示すリストの内容をそのまま表示している。原因候補分類処理を行うことによって、管理者がどの原因候補に優先的に対応すべきか容易に判断でき、解析結果確認と障害対応に要する負荷を軽減することができる。 <Effects of cause candidate classification processing>
As described above, according to the first embodiment, the management software of the management server 30000 classifies the inferred failure cause candidates after the failure cause analysis processing shown in FIG. To display. In the classification method according to the first embodiment and the display format of the result, classification can be performed so that all faults can be repaired by dealing with one entry at the top of each group. Conventionally, when the cause candidate classification process is not performed, the contents of the list shown in FIG. 10 are displayed as they are as inferred failure cause candidates. By performing the cause candidate classification processing, the administrator can easily determine which cause candidate should be preferentially dealt with, and the load required for analysis result confirmation and failure handling can be reduced.

そして、図１４のようにグループに分類して各原因候補を表示することにより、管理者としては優先度の高い原因候補（優先的に対処すべき候補）をバランスよく検証することができ、よって障害対応の時間を短縮することができるようになる。 And by classifying into groups as shown in FIG. 14 and displaying each cause candidate, it is possible for the administrator to verify cause candidates with high priority (candidates to be dealt with preferentially) in a well-balanced manner. It becomes possible to shorten the time for failure handling.

（２）第２の実施形態
第２の実施形態は、第１の実施形態により管理者に原因候補を提示した後、管理者が実施した障害対応手順に基づき、原因候補分類処理を再度実施するものである。システム構成や各装置の構成は第１の実施形態と同じであるので、説明は省略する。以降、第２の実施形態の説明では、第１の実施形態によって図１４のように障害解析結果を画面表示した後で、管理者の操作に基づいて行う処理を記載する。 (2) Second Embodiment In the second embodiment, after presenting a cause candidate to the administrator according to the first embodiment, the cause candidate classification process is performed again based on the failure handling procedure performed by the administrator. Is. Since the system configuration and the configuration of each device are the same as those in the first embodiment, description thereof will be omitted. Hereinafter, in the description of the second embodiment, processing performed based on the operation of the administrator after the failure analysis result is displayed on the screen as shown in FIG. 14 according to the first embodiment will be described.

＜原因候補対処時の処理＞
図１５は、第２の実施形態において、管理者が障害解析結果を利用して障害対応を行う時の処理を説明するためのフローチャートである。管理者は、例えば、障害解析結果表示画面７１０００から、原因候補を選択して障害対応を行ったことを検知する(ステップ６４０１０)と、イベント解析モジュール３２５００は、管理者が選択した候補の対応済フラグをYesに変更する(ステップ６４０２０)。第１の実施形態では、各グループの上位の一つのエントリについて対処すれば全ての障害を修復できるように分類した。したがって、障害対応時に最初に選択された候補がいずれかのグループの最上位の候補であれば、分類が管理者の意図や実際の構成状況に合致するように行われていることになる。逆にいずれのグループの最上位でもない候補を最初に選択した場合は、分類が適切に行われていなかったことになる。そのため、最初に管理者に選択された候補がいずれのグループの最上位でもなかった場合、イベント解析モジュール３２５００は、原因候補再分類処理を行う(ステップ６４０３０〜６４０４０)。つまり、最上位以外の候補が選択されたということは、管理者が自身の経験等に基づいて１回目の分類結果を信用していないことを示しており、このような事態に対応して再分類を行い、管理者がより効率よく原因候補に対処できるようにしている。 <Processing when the cause candidate is dealt with>
FIG. 15 is a flowchart for explaining processing when the administrator performs a failure response using a failure analysis result in the second embodiment. For example, when the administrator detects from the failure analysis result display screen 71000 that a cause candidate has been selected and handled the failure (step 64010), the event analysis module 32500 has already dealt with the candidate selected by the administrator. The flag is changed to Yes (step 64020). In the first embodiment, classification is performed so that all faults can be repaired by dealing with one entry at the top of each group. Therefore, if the candidate selected first at the time of failure handling is the highest candidate in any group, the classification is performed so as to match the intention of the manager and the actual configuration status. Conversely, if a candidate that is not the highest in any group is selected first, it means that classification has not been performed properly. Therefore, when the candidate initially selected by the administrator is not the top of any group, the event analysis module 32500 performs a cause candidate reclassification process (steps 6040 to 64040). In other words, the fact that a candidate other than the top candidate has been selected indicates that the administrator does not trust the first classification result based on his / her own experience, etc. Classification is performed so that the administrator can deal with the cause candidates more efficiently.

＜原因候補再分類処理の詳細＞
図１６は、第２の実施形態による原因候補再分類処理（ステップ６４０４０）の詳細を説明するためのフローチャートである。本実施形態の原因候補再分類処理は、第１の実施形態での原因候補分類処理(ステップ６３０１０〜６３０８０)に対して行った処理と同等の分類処理を、対応済フラグがYesに設定されている候補から優先的に実施する。 <Details of cause candidate reclassification processing>
FIG. 16 is a flowchart for explaining the details of the cause candidate reclassification process (step 64040) according to the second embodiment. The cause candidate reclassification process of this embodiment is the same as the process performed for the cause candidate classification process (steps 63010 to 63080) in the first embodiment, with the corresponding flag set to Yes. Priority should be given to the candidates that are present.

イベント解析処理モジュール３２５００は、まず、事前処理として全ての候補のグループIDフィールド３３６８０と、分類起点フラグフィールド３３６９０の値を削除する(ステップ６５００５)。 First, the event analysis processing module 32500 deletes the values of all candidate group ID fields 33680 and the classification start flag field 33690 as pre-processing (step 65005).

次に、イベント解析処理モジュール３２５００は、解析結果管理表３３６００より、対応済フラグフィールド３３６７０がYesに設定されている候補のうちで、確信度が最も高い原因候補を選択する(ステップ６５０１０)。そして、イベント解析処理モジュール３２５００は、選択した原因候補のエントリについて、解析結果管理表３３６００の分類起点フラグフィールド３３６９０に、Yesを登録する。 Next, from the analysis result management table 33600, the event analysis processing module 32500 selects a cause candidate having the highest certainty among candidates whose corresponding flag field 33670 is set to Yes (step 65010). Then, the event analysis processing module 32500 registers Yes for the selected cause candidate entry in the classification start flag field 33690 of the analysis result management table 33600.

イベント解析処理モジュール３２５００は、選択した候補に含まれる受信イベントIDを、解析結果管理表３３６００より取得する(ステップ６５０２０)。そして、イベント解析処理モジュール３２５００は、取得した受信イベントIDのうち、いずれか一つ以上の同一受信イベントIDを含む原因候補を、解析結果管理表３３６００より取得する(ステップ６５０３０)。 The event analysis processing module 32500 acquires the reception event ID included in the selected candidate from the analysis result management table 33600 (step 65020). Then, the event analysis processing module 32500 acquires a cause candidate including any one or more of the same reception event IDs from the acquired reception event IDs from the analysis result management table 33600 (step 65030).

原因候補の取得後、イベント解析処理モジュール３２５００は、解析結果管理表３３６００のグループIDを登録するフィールド３３６８０より利用されているグループIDのリストを取得し、重複しないグループIDを作成し、ステップ６５０１０で選択した原因候補およびステップ６５０３０で取得した原因候補のエントリに関して、フィールド３３６８０の内容を作成したグループIDに更新する (ステップ６５０４０)。 After acquiring the cause candidates, the event analysis processing module 32500 acquires a list of group IDs used from the field 33680 for registering group IDs in the analysis result management table 33600, creates non-overlapping group IDs, and in step 65010 For the selected cause candidate and the cause candidate entry acquired in step 65030, the contents of the field 33680 are updated to the created group ID (step 65040).

続いて、イベント解析処理モジュール３２５００は、解析結果管理表３３６００より、対応済フラグフィールド３３６７０がYesに設定されている候補のうちで、フィールド３３６８０にグループIDが記載されていないエントリが存在するかどうかチェックする。そのようなエントリが存在した場合(ステップ６５０５０でNoの場合)、そのようなエントリのうち、確信度が最も高い原因候補を選択し(ステップ６５０６０)、解析結果管理表３３６００の選択した原因候補のエントリについて、分類起点フラグフィールド３３６９０に、Yesを登録する。そして、選択した候補に対して、ステップ６５０２０以降の処理を再度行う。 Subsequently, the event analysis processing module 32500 determines from the analysis result management table 33600 whether there is an entry in which the group ID is not described in the field 33680 from among candidates whose corresponding flag field 33670 is set to Yes. To check. When such an entry exists (in the case of No in step 65050), a cause candidate having the highest certainty is selected from such entries (step 65060), and the cause candidate selected in the analysis result management table 33600 is selected. For the entry, Yes is registered in the classification start flag field 33690. And the process after step 65020 is performed again with respect to the selected candidate.

対応済みフラグYesの原因候補が全て分類済であると判断された場合（ステップ６５０５０でYesの場合）、イベント解析処理モジュール３２５００は、解析結果管理表３３６００より、フィールド３３６８０にグループIDが記載されていないエントリが存在するかどうかチェックする。そのようなエントリが存在した場合(ステップ６５０７０でNoの場合)、イベント解析処理モジュール３２５００は、そのようなエントリのうち、確信度が最も高い原因候補を選択し(ステップ６５０８０)、解析結果管理表３３６００の選択した原因候補のエントリについて、分類起点フラグフィールド３３６９０に、Yesを登録する。そして、選択した候補に対して、ステップ６５０２０以降の処理を再度行う。 When it is determined that all the cause candidates of the handled flag Yes are classified (Yes in Step 65050), the event analysis processing module 32500 has a group ID described in the field 33680 from the analysis result management table 33600. Check for missing entries. When such an entry exists (No in step 65070), the event analysis processing module 32500 selects a cause candidate having the highest certainty among such entries (step 65080), and an analysis result management table. For the selected cause candidate entry 33600, Yes is registered in the classification start flag field 33690. And the process after step 65020 is performed again with respect to the selected candidate.

さらに、解析結果管理表３３６００のフィールド３３６８０を参照し、全てのエントリにグループIDが記載されていた場合(ステップ６５０７０でYesの場合)、イベント解析処理モジュール３２５００は、解析結果管理表３３６００の受信イベントIDフィールド３３６６０から、全ての受信イベントIDを取得する。 Furthermore, with reference to the field 33680 of the analysis result management table 33600, if the group ID is described in all entries (Yes in step 65070), the event analysis processing module 32500 receives the received event of the analysis result management table 33600. All received event IDs are acquired from the ID field 33660.

次に、イベント解析処理モジュール３２５００は、解析結果管理表３３６００の分類起点フラグフィールド３３６９０にYesが記載されているエントリを取得し、全ての受信イベントIDが取得したエントリに含まれているかどうかをチェックする。 Next, the event analysis processing module 32500 acquires an entry in which Yes is described in the classification start flag field 33690 of the analysis result management table 33600, and checks whether all received event IDs are included in the acquired entries. To do.

エントリに含まれていない１つないし１つ以上の受信IDが存在する場合(ステップ６５０９０でNoの場合)、イベント解析処理モジュール３２５００は、それらの受信IDを含む原因候補を含む原因候補エントリのうち、確信度が最も高い原因候補を選択し(ステップ６５０９５)、解析結果管理表３３６００の選択した原因候補のエントリについて、分類起点フラグフィールド３３６９０に、Yesを登録する。そして、イベント解析処理モジュール３２５００は、選択した候補に対して、ステップ６５０２０以降の処理を再度行う。 When one or more reception IDs not included in the entry exist (No in step 65090), the event analysis processing module 32500 includes the cause candidate entries including the cause candidates including those reception IDs. Then, the cause candidate having the highest certainty factor is selected (step 65095), and Yes is registered in the classification starting point flag field 33690 for the entry of the selected cause candidate in the analysis result management table 33600. Then, the event analysis processing module 32500 performs the processing from step 65020 on the selected candidate again.

イベント解析処理モジュール３２５００は、解析結果管理表３３６００の分類起点フラグフィールド３３６９０にYesが記載されているエントリを取得し、全ての受信イベントIDが取得したエントリに含まれていた場合(ステップ６５０９０でYesの場合)、原因候補再分類処理を終了する。 The event analysis processing module 32500 acquires an entry in which Yes is described in the classification start flag field 33690 of the analysis result management table 33600, and if all received event IDs are included in the acquired entries (Yes in step 65090). ), The cause candidate reclassification process is terminated.

以上が、イベント解析処理モジュール３２５００が実施する原因候補再分類処理である。なお、図１６では、対処済フラグをYesにしたタイミングと原因候補再分類処理を実行するタイミングとの関係については明記していないが、管理者がいくつかの原因候補について対応し、いくつかの対応済フラグがYesになった後、管理者の指示に従って原因候補再分類処理（図１６）を実行するようにしても良いし、対応済フラグがYesに変更される都度、原因候補再分類処理を実行するようにしても良い。 The above is the cause candidate reclassification processing performed by the event analysis processing module 32500. In FIG. 16, the relationship between the timing at which the handled flag is set to Yes and the timing at which the cause candidate reclassification processing is executed is not specified, but the administrator handles several cause candidates, After the handled flag becomes Yes, the cause candidate reclassification process (FIG. 16) may be executed in accordance with an instruction from the administrator, or each time the handled flag is changed to Yes, the cause candidate reclassification process May be executed.

以下に、原因候補再分類処理の具体例について説明する。なお、第１の実施形態と同様に、処理開始当初の解析結果管理表は図１０、展開ルールは図９、イベント管理表は図７に示す通りのものであるとする。なお、図１５のステップ６４０４０の実行の直前までは処理が終了しているものとする、その過程において、管理者は図１４の結果画面表示で、最初にIPSW1の障害原因を選択しており、図１０の対応済フラグフィールド３３６７０には、5段目のエントリ(IPSW1)の部分にのみYesが記録されているものとする。 A specific example of the cause candidate reclassification process will be described below. As in the first embodiment, it is assumed that the analysis result management table at the beginning of the processing is as shown in FIG. 10, the development rules are as shown in FIG. 9, and the event management table is as shown in FIG. It is assumed that the processing has been completed until immediately before the execution of step 64040 in FIG. 15. In the process, the administrator first selects the cause of failure of IPSW1 on the result screen display in FIG. Assume that Yes is recorded only in the entry (IPSW1) in the fifth row in the corresponding flag field 33670 in FIG.

イベント解析処理モジュール３２５００は、まず、解析結果管理表３３６００の全ての原因候補のグループIDフィールドと、分類起点フラグフィールドの値を削除する。次に、解析結果管理表３３６００より、対応済フラグがYesの原因候補のうち、確信度が最も高いエントリとして、解析結果管理表の第５段目（5つ目のエントリ）から、IPSW装置の障害原因候補エントリを選択する。 The event analysis processing module 32500 first deletes all the cause candidate group ID fields and classification start flag field values of the analysis result management table 33600. Next, from the analysis result management table 33600, the entry with the highest certainty among the cause candidates whose corresponding flag is Yes is the fifth entry (fifth entry) of the analysis result management table, and the IPSW device Select a failure cause candidate entry.

次に、イベント解析処理モジュール３２５００は、この候補に含まれる障害イベントである、EV6、EV7、EV8、EV9を抽出する。そして、これらの障害イベントを含む他の障害原因候補として、1段目のエントリ(SYS1/CTL1)、2段目のエントリ(SYS1/CTL2)、3段目のエントリ(SYS1/CTL3)、4段目のエントリ(SYS1/CTL4)選択する。そして、これら5つのエントリをグループ化し、グループIDとしてGR1を生成してこれらのエントリの解析結果管理表のグループID登録用のフィールド３３６８０に生成したグループIDを登録する。さらに5段目のエントリを、分類を行う際の基準として扱ったので、5段目のエントリの分類起点フラグ３３６９０にはYesを、残りの4エントリの分類起点フラグ３３６９０にはNoを記録する。
ここまでの処理により、解析結果管理表のすべてのエントリはグループ化された。 Next, the event analysis processing module 32500 extracts EV6, EV7, EV8, and EV9, which are failure events included in this candidate. As other failure cause candidates including these failure events, the first row entry (SYS1 / CTL1), the second row entry (SYS1 / CTL2), the third row entry (SYS1 / CTL3), the fourth row Select the first entry (SYS1 / CTL4). Then, these five entries are grouped, GR1 is generated as a group ID, and the generated group ID is registered in the group ID registration field 33680 of the analysis result management table of these entries. Further, since the fifth row entry is treated as a reference for classification, Yes is recorded in the classification start flag 33690 of the fifth row entry, and No is recorded in the remaining four entry classification start flag 33690.
Through the processing so far, all entries in the analysis result management table have been grouped.

続いて、イベント解析処理モジュール３２５００は、グループ化の際に参照されなかった障害イベントを抽出する。解析結果管理表３３６００の受信イベントIDフィールド３３６６０に含まれる全てのイベントIDのうち、分類起点フラグ３３６９０にYesが記録されているエントリに含まれないものとしてEV1、EV2、EV3、EV4、EV5が抽出される。それらを含む原因候補として、1段目のエントリ〜4段目のエントリまでの4エントリが存在する。このうち確信度の高い1段目のエントリを起点に、同様のグループ化を行うと、イベント解析処理モジュール３２５００は、障害イベントEV1、EV3、EV6を含む他の障害原因候補として、2段目のエントリ(SYS1/CTL2)と5段目のエントリ(IPSW1)を選択する。そして、イベント解析処理モジュール３２５００は、これら3つのエントリをグループ化し、グループIDとしてGR2を生成してこれらのエントリの解析結果管理表のグループID登録用のフィールド３３６８０に生成したグループIDを登録する。さらに、イベント解析処理モジュール３２５００は、分類を行う際の基準として1段目のエントリを扱ったので、1段目のエントリの分類起点フラグ３３６９０にはYesを記録する。 Subsequently, the event analysis processing module 32500 extracts a failure event that has not been referred to when grouping. Among all event IDs included in the received event ID field 33660 of the analysis result management table 33600, EV1, EV2, EV3, EV4, and EV5 are extracted as those not included in the entry whose Yes is recorded in the classification start flag 33690. Is done. There are four entries from the first entry to the fourth entry as cause candidates including them. If the same grouping is performed starting from the entry with the first level having a high certainty among these, the event analysis processing module 32500 determines the second level as other failure cause candidates including the failure events EV1, EV3, and EV6. Select the entry (SYS1 / CTL2) and the fifth entry (IPSW1). Then, the event analysis processing module 32500 groups these three entries, generates GR2 as a group ID, and registers the generated group ID in the group ID registration field 33680 of the analysis result management table of these entries. Further, since the event analysis processing module 32500 has handled the first-stage entry as a reference for performing classification, Yes is recorded in the classification start flag 33690 of the first-stage entry.

イベント解析処理モジュール３２５００は、解析結果管理表３３６００の受信イベントIDフィールド３３６６０に含まれる全てのイベントIDのうち、分類起点フラグ３３６９０にYesが記録されているエントリに含まれないものとしてEV2、EV4、EV5を抽出する。それらを含む原因候補として、3段目のエントリ、4段目のエントリの2エントリが存在する。このうち確信度の高い3段目のエントリを起点に、同様のグループ化を行うと、イベント解析処理モジュール３２５００は、障害イベントEV2、EV4、EV8を含む他の障害原因候補として、5段目のエントリ(IPSW1)を選択する。そして、イベント解析処理モジュール３２５００は、これら2つのエントリをグループ化し、グループIDとしてGR3を生成してこれらのエントリの解析結果管理表のグループID登録用のフィールド３３６８０に生成したグループIDを登録する。さらに、分類を行う際の基準として3段目のエントリを、扱ったので、イベント解析処理モジュール３２５００は、3段目のエントリの分類起点フラグ３３６９０にはYesを記録する。 The event analysis processing module 32500 assumes that all event IDs included in the received event ID field 33660 of the analysis result management table 33600 are not included in the entry whose Yes is recorded in the classification start flag 33690, EV2, EV4, Extract EV5. There are two entries, the third row entry and the fourth row entry, as cause candidates including them. If the same grouping is performed starting from the entry of the third level with a high certainty among these, the event analysis processing module 32500 will display the fifth level as other failure cause candidates including the failure events EV2, EV4, and EV8. Select the entry (IPSW1). The event analysis processing module 32500 groups these two entries, generates GR3 as a group ID, and registers the generated group ID in the group ID registration field 33680 of the analysis result management table of these entries. Further, since the third-stage entry has been handled as a reference for classification, the event analysis processing module 32500 records Yes in the classification start flag 33690 of the third-stage entry.

さらに、イベント解析処理モジュール３２５００は、解析結果管理表３３６００の受信イベントIDフィールド３３６６０に含まれる全てのイベントIDのうち、分類起点フラグ３３６９０にYesが記録されているエントリに含まれないものとしてEV5を抽出する。また、イベント解析処理モジュール３２５００は、それらを含む原因候補として、4段目のエントリを起点に同様のグループ化を行うと、障害イベントEV5、EV9を含む他の障害原因候補として、5段目のエントリ(IPSW1)を選択する。そして、イベント解析処理モジュール３２５００は、これら2つのエントリをグループ化し、グループIDとしてGR4を生成してこれらのエントリの解析結果管理表のグループID登録用のフィールド３３６８０に生成したグループIDを登録する。さらに、イベント解析処理モジュール３２５００は、分類を行う際の基準として4段目のエントリを、扱ったので、4段目のエントリの分類起点フラグ３３６９０にはYesを記録する。 Further, the event analysis processing module 32500 determines that EV5 is not included in the entry whose Yes is recorded in the classification start flag 33690 among all event IDs included in the reception event ID field 33660 of the analysis result management table 33600. Extract. Further, when the event analysis processing module 32500 performs the same grouping starting from the entry in the fourth row as the cause candidates including them, the event analysis processing module 32500 sets the fifth step as other failure cause candidates including the failure events EV5 and EV9. Select the entry (IPSW1). The event analysis processing module 32500 groups these two entries, generates GR4 as the group ID, and registers the generated group ID in the group ID registration field 33680 of the analysis result management table of these entries. Furthermore, since the event analysis processing module 32500 has handled the fourth-stage entry as a reference for performing classification, Yes is recorded in the classification start flag 33690 of the fourth-stage entry.

解析結果管理表３３６００の受信イベントIDフィールド３３６６０に含まれる全てのイベントIDのうち、分類起点フラグ３３６９０にYesが記録されているエントリに含まれないものが無くなったため、イベント解析処理モジュール３２５００は、原因候補再分類処理を終了する。 Since all the event IDs included in the received event ID field 33660 of the analysis result management table 33600 are not included in the entry whose Yes is recorded in the classification start flag 33690, the event analysis processing module 32500 causes the cause The candidate reclassification process is terminated.

＜障害解析結果表示画面の構成＞
図１７は管理サーバ３００００が原因候補再分類処理後にユーザ（管理者）に対して表示する、障害解析結果表示画面の表示例７２０００を示す図である。 <Configuration of failure analysis result display screen>
FIG. 17 is a diagram illustrating a display example 72000 of a failure analysis result display screen that the management server 30000 displays to the user (administrator) after the cause candidate reclassification processing.

第１の実施形態と同様に、障害解析結果表示画面７２０００では、解析結果管理表に定義された解析結果をグループIDが一致するものをまとめて表示する。その際に、複数のグループに分類されているエントリは、複数のグループに重複して表示する。また、各グループにおいてグループ化の際に起点とした原因候補を、そのグループの最上位に表示する。そしてそれ以外の候補は、確信度の高い順に表示している。 Similar to the first embodiment, on the failure analysis result display screen 72000, the analysis results defined in the analysis result management table are displayed together with the group IDs that match. At this time, entries classified into a plurality of groups are displayed in duplicate in the plurality of groups. In addition, the cause candidate that is the starting point for grouping in each group is displayed at the top of the group. The other candidates are displayed in descending order of certainty.

なお、第１の実施形態と同様に、本実施形態では同一画面に全ての原因候補のグループを表示しているが、グループごとに分割して表示されれば良いので、グループごとに別画面で表示し、タブ等で切り替えられるように実施してもよい。 As in the first embodiment, in this embodiment, all the cause candidate groups are displayed on the same screen. However, since the groups may be displayed separately for each group, a separate screen is displayed for each group. You may implement so that it may be displayed and switched with a tab etc.

＜原因候補再分類処理の効果＞
以上、第２の実施形態によれば、管理サーバ３００００の管理ソフトウェアは、図１５に示すように、第１の実施形態でグループの最上位に表示しなかった障害原因候補を管理者が最初に障害原因として選択した場合に、それを基準に障害原因候補の分類を再度実行する。第１の実施形態（図１４）のように分類された状況で、それぞれのグループの最下位に表示されていたIPSW1を管理者が選択したということは、本発明の管理ソフトウェアが把握していないが管理者が把握している状況としてIPSW1が障害の原因であると管理者に推測させるような外部的な状況が存在するのかもしれない。そのような場合に、管理者の選択に合わせて第２の実施形態のように、動的にグループ化を再構成している。 <Effects of cause candidate reclassification processing>
As described above, according to the second embodiment, as shown in FIG. 15, the management software of the management server 30000 allows the administrator to first select the cause of failure that was not displayed at the top of the group in the first embodiment. When the failure cause is selected, the failure cause candidate classification is executed again based on the failure cause. In the situation classified as in the first embodiment (FIG. 14), the management software of the present invention does not know that the administrator has selected IPSW1 displayed at the bottom of each group. However, there may be an external situation that causes the administrator to guess that IPSW1 is the cause of the failure. In such a case, the grouping is dynamically reconfigured according to the administrator's selection as in the second embodiment.

その結果、まずIPSW1に発生している障害原因に対応した場合に、他にどの障害原因に優先的に対応すべきなのかを分類して表示している。このため、第１の実施形態の提示した結果が仮に管理者の意図と異なっていた場合にも、それに合わせて分類を修正することができ、管理者の障害対応に要する負荷を軽減することができる。 As a result, when the cause of the failure occurring in IPSW1 is dealt with first, the other cause of failure should be classified and displayed. For this reason, even if the result presented in the first embodiment is different from the intention of the administrator, the classification can be corrected accordingly, and the load required for handling the failure of the administrator can be reduced. it can.

（３）まとめ
障害原因解析では、障害原因の推論の後で、管理サーバにおいて推論した障害原因候補それぞれに対して導出過程で解析ルールに適用した障害イベントを取得する。そして個々の原因候補の確信度と、その導出根拠となる障害イベントに基づいて、障害原因候補を分類する。しかし、異なった原因に起因する複数の障害が短い期間に頻発すると、保存される障害解析結果の数が多くなり、どの障害原因候補が実際に発生するどの障害に関して推論されたのか、管理者に判断できないことがある。本発明ではそのような場合に、実際に発生している障害イベントが共通して含まれる原因候補を、同一グループとして分類する。これにより、異なった原因に起因する複数の障害が発生した際にも、確からしい組み合わせで原因候補を分類することができる。 (3) Summary In the failure cause analysis, after inferring the cause of the failure, a failure event applied to the analysis rule in the derivation process is acquired for each failure cause candidate inferred in the management server. Then, the failure cause candidates are classified based on the certainty factor of each cause candidate and the failure event that is the basis for derivation thereof. However, if multiple failures due to different causes occur frequently in a short period, the number of stored failure analysis results will increase, and it will be necessary for the administrator to determine which failure cause candidate is inferred about which failure actually occurs. It may not be possible to judge. In such a case, the present invention classifies cause candidates that commonly include failure events that are actually occurring as the same group. As a result, even when a plurality of failures due to different causes occur, it is possible to classify the cause candidates in a probable combination.

なお、ある１つの原因候補にしか関連していない障害イベントが存在する場合は、障害イベント発生の原因となる障害を解決するためには、その原因候補を利用するほかに手段がない。このような場合には、複数の原因候補が障害イベントに関連していないため、この障害イベントを基準に分類されたグループが存在しない。そのため、すべてのグループの障害原因候補に対応しても、この障害イベントを解決できないことがある。特に、何らかの別の障害イベントによって、この障害イベントを解決できる唯一の原因候補が、たまたまあるグループに分類されていた場合、特定の障害イベントに関して解決できる唯一の原因候補にもかかわらず、グループ内の他の多数の原因候補と同一視され、その結果分類を行ったことによりその障害イベントに関する対応が迅速に行われなくなることも考えられる。それを防止するため、本発明では障害原因候補分類の根拠として利用しなかった障害イベントが存在する場合には、それを解決する原因候補や原因候補群について、さらに別個のグループを作成する。つまり、管理サーバは、起点原因候補を変えて結論イベントの分類処理を繰り返し、全ての障害原因とされる結論イベントを分類した後、起点原因候補として選択された結論イベント以外の結論イベント（例えば、図１０の２段目のエントリ）が起点原因候補として選択された結論イベントに含まれる条件イベント以外の条件イベントである残余条件イベントを含む場合に、この残余条件イベントを含む結論イベントを起点原因候補としてさらに分類処理を実行する。こうすることにより、漏れなく原因候補をグループ化することができ、全ての障害を修復することが可能となる。 When there is a failure event related to only one cause candidate, there is no means other than using the cause candidate in order to solve the failure that causes the failure event. In such a case, since a plurality of cause candidates are not related to the failure event, there is no group classified based on the failure event. Therefore, the failure event may not be resolved even if the failure cause candidates of all the groups are dealt with. In particular, if some other failure event causes the only candidate cause that can resolve this failure event to be accidentally categorized into a group, the only cause in the group that can be resolved for a particular failure event It may be considered that the failure event is not quickly dealt with because it is identified with many other cause candidates and as a result of the classification. In order to prevent this, if there is a failure event that is not used as a basis for failure cause candidate classification in the present invention, a separate group is created for the cause candidate or cause candidate group to solve it. That is, the management server repeats the conclusion event classification process by changing the origin cause candidate, classifies all the conclusion events that are the cause of the failure, and then concludes events other than the conclusion event selected as the origin cause candidate (for example, When the second entry in FIG. 10 includes a residual condition event that is a condition event other than the conditional event included in the conclusion event selected as the starting cause candidate, the conclusion event including the remaining condition event is set as the starting cause candidate. Further classification processing is executed. By doing this, it is possible to group the cause candidates without omission and to repair all the faults.

さらに、本発明では、管理サーバはこのような分類結果に基づいて、障害解析結果の表示を行う。この際に、それぞれの障害原因候補が他のどの原因候補とグループ化されているのか、管理者が理解できるように表示する。例えば、分類結果に基づきグループごとに別画面に分けて原因候補を表示しても良いし、同一の画面内で候補グループごとに順番を入れ替えた上でそれぞれのグループを認識できるように表示したり、同様に同一の画面内で確信度等のグループとは関係ない順序で原因候補を表示した上で、属するグループを各原因候補エントリに表示したりしても良い。 Furthermore, in the present invention, the management server displays a failure analysis result based on such a classification result. At this time, the display is made so that the administrator can understand which other cause candidates are grouped with each other. For example, the cause candidates may be displayed separately on different screens for each group based on the classification result, or displayed so that each group can be recognized after changing the order for each candidate group within the same screen. Similarly, after the cause candidates are displayed in the order not related to the group such as the certainty factor in the same screen, the groups to which the cause belongs may be displayed in each cause candidate entry.

本実施形態では、各ノード装置の性能値から異常状態を検知し、その解析結果（異常状態の確信度を演算）として障害原因の候補を管理者に提示する。その際に、いくつかの異常状態を示すイベントが、ある特定の異常状態の事象により引き起こされる場合を想定し、確信度の最も高い障害原因と共通した異常状態を含む障害原因の候補を分類する。そして、障害解析結果表示画面では、その分類を管理者が理解できるような方法で、解析結果の表示を行う。より具体的には、本実施形態の計算機システムでは、管理サーバ（管理システム）が、ノード装置の処理性能を示す処理性能値を取得し、当該取得した処理性能値からノード装置に障害が発生したことを検知し、複数の障害原因とされる結論イベントの１つを起点原因候補として選択し、起点原因候補に関係する条件イベントを抽出する。また、管理サーバは、抽出された条件イベントに関係する結論イベントであって、起点原因候補の結論イベントとは異なる１つ又は複数の障害原因とされる結論イベントを関連原因候補として選択し、起点原因候補の結論イベントと前記関連原因候補の結論イベントを、他の結論イベントとは別個に分類処理する。そして、管理サーバは、分類された結論イベントを表示画面に表示する。このようにすることにより、管理者は、解析結果の対応優先度を容易に判断でき、解析結果確認と障害対応に要する負荷を軽減することができる。 In this embodiment, an abnormal state is detected from the performance value of each node device, and a failure cause candidate is presented to the administrator as an analysis result (calculation of the certainty of the abnormal state). At that time, assuming that an event indicating some abnormal state is caused by an event of a specific abnormal state, classify candidate failure causes including the abnormal state common to the cause of failure with the highest certainty. . On the failure analysis result display screen, the analysis result is displayed in such a way that the administrator can understand the classification. More specifically, in the computer system of this embodiment, the management server (management system) acquires a processing performance value indicating the processing performance of the node device, and a failure has occurred in the node device from the acquired processing performance value. This is detected, one of the conclusion events that are assumed to be the cause of the failure is selected as a starting cause candidate, and a condition event related to the starting cause candidate is extracted. In addition, the management server selects one or a plurality of conclusion events that are conclusion events related to the extracted condition event and that are different from the conclusion event of the origin cause candidate as related cause candidates, The conclusion event of the cause candidate and the conclusion event of the related cause candidate are classified and processed separately from the other conclusion events. Then, the management server displays the classified conclusion event on the display screen. By doing in this way, the administrator can easily determine the response priority of the analysis result, and can reduce the load required for analysis result confirmation and failure handling.

また、管理サーバは、起点原因候補及び関連原因候補に対応する結論イベントの分類結果に従って、障害原因とされる結論イベントを分類結果ごとに区別して表示画面に表示する。このようにすることにより、対処すべき解析結果が容易に判断することができると共に、対処済の結果と未対処の結果を区別して管理することが可能となる。 Further, the management server distinguishes the conclusion event that is the cause of the failure for each classification result according to the classification result of the conclusion event corresponding to the origin cause candidate and the related cause candidate, and displays them on the display screen. By doing so, it is possible to easily determine the analysis result to be dealt with and manage separately the dealt result and the unhandled result.

また、管理サーバは、起点原因候補の結論イベントに関係する条件イベントと同一の条件イベントを解析ルールに少なくとも１つ含む関連原因候補の結論イベントを、起点原因候補の結論イベントと同一のグループとして分類する。このようにすることにより、分類の条件が明確になり、起点となる原因候補を対処したときに同時に解決されうる原因候補を同一グループに分類するので、管理者の負担を軽減することができるようになる。 In addition, the management server classifies the conclusion event of the related cause candidate including at least one condition event related to the conclusion event of the origin cause candidate in the analysis rule as the same group as the conclusion event of the origin cause candidate To do. By doing this, the classification conditions are clarified, and cause candidates that can be resolved simultaneously when dealing with the starting cause candidates are classified into the same group, so that the burden on the administrator can be reduced. become.

なお、起点原因候補として、確信度が最も高い結論イベントを選択するようにしても良い。これにより、対応優先度が高いと考えられる解析結果を軸として自動的に分類処理することが可能となり、効率的に障害対応をすることができるようになる。 Note that the conclusion event with the highest certainty factor may be selected as the starting cause candidate. As a result, it is possible to automatically perform a classification process with an analysis result considered to have a high response priority as an axis, and to efficiently handle a failure.

また、障害解析を行う管理サーバは、管理対象を取り巻く外部的な状況を必ずしも完全に把握できるとは限らない。したがって、本実施形態での分類結果により提示される障害原因と、実際に管理者が障害原因と考えている事象が異なる可能性があることを否定できない。そのため、優先度（確信度）の低い原因候補を管理者が選択し、障害復旧を行った場合に、管理者の選択に合わせて、動的にグループ化を再構成するようにしている(第２の実施形態参照）。即ち、管理サーバは、複数の分類グループを含む前記分類結果において、障害対応時にどの分類グループに含まれる前記結論イベントが管理者によって選択されたかについての情報に基づいて、再度分類処理を実行するか決定する。つまり、障害対応時に選択された結論イベントを起点原因候補として分類処理を再度実行する。このように動的に分類処理を再度実行することにより、管理者は経験に基づいた障害対処を実行することができ、効率的に計算機システムを管理することができるようになる。 Moreover, the management server that performs failure analysis does not always have a complete understanding of the external situation surrounding the management target. Therefore, it cannot be denied that the cause of failure presented by the classification result in the present embodiment may be different from the event that the administrator actually considers as the cause of failure. Therefore, when an administrator selects a cause cause with a low priority (confidence) and performs failure recovery, grouping is dynamically reconfigured according to the administrator's selection (No. 1). 2 embodiment). That is, in the classification result including a plurality of classification groups, the management server executes classification processing again based on information about which classification group is included in the classification group at the time of failure handling. decide. That is, the classification process is executed again with the conclusion event selected at the time of handling the failure as the starting cause candidate. By dynamically executing the classification process again in this manner, the administrator can execute a failure handling based on experience, and can efficiently manage the computer system.

なお、本発明は、実施形態の機能を実現するソフトウェアのプログラムコードによっても実現できる。この場合、プログラムコードを記録した記憶媒体をシステム或は装置に提供し、そのシステム或は装置のコンピュータ（又はＣＰＵやＭＰＵ）が記憶媒体に格納されたプログラムコードを読み出す。この場合、記憶媒体から読み出されたプログラムコード自体が前述した実施形態の機能を実現することになり、そのプログラムコード自体、及びそれを記憶した記憶媒体は本発明を構成することになる。このようなプログラムコードを供給するための記憶媒体としては、例えば、フレキシブルディスク、ＣＤ−ＲＯＭ、ＤＶＤ−ＲＯＭ、ハードディスク、光ディスク、光磁気ディスク、ＣＤ−Ｒ、磁気テープ、不揮発性のメモリカード、ＲＯＭなどが用いられる。 The present invention can also be realized by a program code of software that realizes the functions of the embodiments. In this case, a storage medium in which the program code is recorded is provided to the system or apparatus, and the computer (or CPU or MPU) of the system or apparatus reads the program code stored in the storage medium. In this case, the program code itself read from the storage medium realizes the functions of the above-described embodiments, and the program code itself and the storage medium storing the program code constitute the present invention. As a storage medium for supplying such program code, for example, a flexible disk, CD-ROM, DVD-ROM, hard disk, optical disk, magneto-optical disk, CD-R, magnetic tape, nonvolatile memory card, ROM Etc. are used.

また、プログラムコードの指示に基づき、コンピュータ上で稼動しているＯＳ（オペレーティングシステム）などが実際の処理の一部又は全部を行い、その処理によって前述した実施の形態の機能が実現されるようにしてもよい。さらに、記憶媒体から読み出されたプログラムコードが、コンピュータ上のメモリに書きこまれた後、そのプログラムコードの指示に基づき、コンピュータのＣＰＵなどが実際の処理の一部又は全部を行い、その処理によって前述した実施の形態の機能が実現されるようにしてもよい。 Also, based on the instruction of the program code, an OS (operating system) running on the computer performs part or all of the actual processing, and the functions of the above-described embodiments are realized by the processing. May be. Further, after the program code read from the storage medium is written in the memory on the computer, the computer CPU or the like performs part or all of the actual processing based on the instruction of the program code. Thus, the functions of the above-described embodiments may be realized.

また、実施の形態の機能を実現するソフトウェアのプログラムコードを、ネットワークを介して配信することにより、それをシステム又は装置のハードディスクやメモリ等の記憶手段又はＣＤ−ＲＷ、ＣＤ−Ｒ等の記憶媒体に格納し、使用時にそのシステム又は装置のコンピュータ（又はＣＰＵやＭＰＵ）が当該記憶手段や当該記憶媒体に格納されたプログラムコードを読み出して実行するようにしても良い。 Further, by distributing the program code of the software that realizes the functions of the embodiment via a network, the program code is stored in a storage means such as a hard disk or memory of a system or apparatus, or a storage medium such as a CD-RW or CD-R And the computer (or CPU or MPU) of the system or apparatus may read and execute the program code stored in the storage means or the storage medium when used.

１００００：サーバ
２００００：ストレージ装置
３００００：管理サーバ
３５０００：WEBブラウザ起動サーバ
４００００：IPスイッチ
４５０００：ネットワーク 10000: Server 20000: Storage device 30000: Management server 35000: Web browser activation server 40000: IP switch 45000: Network

Claims

A management method for a computer system comprising: a node device to be monitored; and a management system connected to the node device via a network and monitoring and managing the node device,
The management system acquires a processing performance value indicating the processing performance of the node device, detects that a failure has occurred in the node device from the acquired processing performance value,
The management system applies the detected failure to an analysis rule indicating a relationship between a combination of one or more condition events that may occur in the node device and a conclusion event that is a cause of the failure of the combination of the condition events. , Calculating a certainty factor that is information indicating the possibility of failure in the node device,
The management system selects one of a plurality of conclusion events that are regarded as a cause of failure as a starting cause candidate, extracts the condition event related to the starting cause candidate,
The management system selects a conclusion event related to the extracted condition event, which is one or more conclusion events that are different from the conclusion event of the origin cause candidate, as related cause candidates,
The management system classifies the conclusion event of the origin cause candidate and the conclusion event of the related cause candidate separately from other conclusion events,
The management system displays the classified conclusion event on a display screen;
A computer system management method characterized by the above.

In claim 1,
The management system is characterized in that, according to the classification result of the conclusion event corresponding to the origin cause candidate and the related cause candidate, the conclusion event to be the cause of the failure is distinguished for each classification result and displayed on the display screen. Computer system management method.

In claim 1,
The management system includes the conclusion event of the related cause candidate that includes at least one condition event in the analysis rule that is the same as the condition event related to the conclusion event of the origin cause candidate, and the same as the conclusion event of the origin cause candidate A management method of a computer system, characterized by classifying as a group.

In claim 1,
The management system selects the conclusion event having the highest certainty factor as the origin cause candidate, and classifies the conclusion event of the related cause candidate according to the condition event related to the conclusion event of the origin cause candidate. A management method for a computer system.

In claim 1,
The management system repeats the classification process of the conclusion event by changing the starting cause candidate in the plurality of conclusion events that are the cause of the failure, classifies all the conclusion events that are the cause of the failure, and then the starting cause candidate It is determined whether a conclusion event other than the conclusion event selected as a candidate event includes a residual condition event that is a condition event other than the conditional event included in the conclusion event selected as the origin cause candidate, and includes the residual condition event A computer system management method, wherein a classification event is further executed with a conclusion event as the origin cause candidate.

In claim 2,
In the classification result including a plurality of classification groups, the management system determines whether to execute the classification process again based on information on which classification group is included in the classification group at the time of failure handling. A computer system management method characterized by:

In claim 6,
The management system re-executes the classification process by using the conclusion event selected at the time of the failure handling as the origin cause candidate.

A management system connected to a monitored node device via a network and managing the node device,
A processor that acquires a processing performance value indicating the processing performance of the node device, and detects a state of the node device from the acquired processing performance value;
A memory for storing an analysis rule indicating a relationship between a combination of one or more condition events that can occur in the node device and a conclusion event that is a cause of a failure in the combination of the condition events;
The processor is
Applying the detected state to the analysis rule, calculating a certainty factor that is information indicating the possibility of failure in the node device,
Selecting one of a plurality of conclusion events regarded as a cause of failure as a starting cause candidate, and extracting the condition event related to the starting cause candidate,
A conclusion event related to the extracted condition event, and one or more conclusion events that are different from the conclusion event of the origin cause candidate are selected as related cause candidates;
Classifying the conclusion event of the origin cause candidate and the conclusion event of the related cause candidate separately from other conclusion events;
Displaying the classified conclusion event on a display screen;
Management system characterized by that.

In claim 8,
The processor is characterized in that, according to the classification result of the conclusion event corresponding to the origin cause candidate and the related cause candidate, the conclusion event to be the cause of the failure is distinguished for each classification result and displayed on the display screen. system.

In claim 8,
The processor includes a conclusion event of the related cause candidate including at least one condition event in the analysis rule that is the same as the condition event related to the conclusion event of the origin cause candidate, and the same as the conclusion event of the origin cause candidate Management system characterized by classifying as a group.

In claim 8,
The processor selects the conclusion event having the highest certainty factor as the origin cause candidate, and classifies the conclusion event of the related cause candidate according to the condition event related to the conclusion event of the origin cause candidate. Management system.

In claim 8,
The processor repeats the classification process of the conclusion event by changing the origin cause candidate in the plurality of conclusion events that are the cause of failure, classifies all the conclusion events that are the cause of the failure, and then as the origin cause candidate It is determined whether a conclusion event other than the selected conclusion event includes a residual condition event that is a condition event other than the condition event included in the conclusion event selected as the origin cause candidate, and a conclusion including the residual condition event A management system further executing a classification process using an event as the origin cause candidate.

In claim 9,
In the classification result including a plurality of classification groups, the processor determines whether to execute the classification process again based on information on which classification group is included in the classification group at the time of failure handling. When it is determined that the classification process is to be executed again, the management system is configured to execute the classification process again using the conclusion event selected at the time of handling the failure as the origin cause candidate.