JP5651381B2

JP5651381B2 - Failure cause determination rule verification device and program

Info

Publication number: JP5651381B2
Application number: JP2010136300A
Authority: JP
Inventors: 小林　宏至; 宏至小林
Original assignee: Hitachi Solutions Ltd
Current assignee: Hitachi Solutions Ltd
Priority date: 2010-06-15
Filing date: 2010-06-15
Publication date: 2015-01-14
Anticipated expiration: 2030-06-15
Also published as: JP2012003406A

Description

本発明は、過去のイベントログに基づいて、障害原因解析システムにて利用される障害原因解析ルールを検証する装置及び当該装置をソフトウェア的に実現するプログラムに関する。 The present invention relates to a device for verifying a failure cause analysis rule used in a failure cause analysis system based on a past event log, and a program for realizing the device in software.

システム障害発生時における復旧作業の遅延は、企業の業績や社会インフラに大きな影響を与える。システム障害の迅速な復旧には、早期の障害原因の特定と復旧手順の決定が重要となる。 Delays in recovery work in the event of a system failure have a major impact on corporate performance and social infrastructure. To quickly recover from a system failure, it is important to identify the cause of the failure at an early stage and determine the recovery procedure.

そこで、障害の迅速な復旧を支援する障害原因解析システムが提案されている（特許文献１）。このシステムは、特定の障害時に発生するイベントと復旧手順とを対応付けた障害原因判定ルールを有し、当該ルールを用いて障害イベントを解析することにより適切な復旧手順を復旧担当者に提供する。 In view of this, a failure cause analysis system that supports rapid recovery of failures has been proposed (Patent Document 1). This system has a failure cause determination rule that correlates an event that occurs at the time of a specific failure with a recovery procedure, and provides an appropriate recovery procedure to a recovery person by analyzing the failure event using the rule .

しかし、人手による障害原因判定ルールの作成には困難を伴う。そこで、イベントログから障害原因判定ルールを自動的に生成する手法が提案されている（特許文献２及び３）。特許文献２には、特定イベントの発生頻度を利用する方法が記載されている。特許文献３には、イベントの生起パターンを利用する方法が記載されている。 However, it is difficult to manually create a failure cause determination rule. Therefore, a method for automatically generating a failure cause determination rule from an event log has been proposed (Patent Documents 2 and 3). Patent Document 2 describes a method of using the occurrence frequency of a specific event. Patent Document 3 describes a method of using an event occurrence pattern.

国際公開第２００４／０６１６８１号International Publication No. 2004/061681 特開２００８−４１０４１号公報JP 2008-41041 A 特開２００６−４３４６号公報JP 2006-4346 A

Fisher, Douglas H. “Knowledge acquisition via incremental clustering”, Machine Learning 2, 139-172, 1987Fisher, Douglas H. “Knowledge acquisition via incremental clustering”, Machine Learning 2, 139-172, 1987

ところが、一度作成して登録した障害原因判定ルールであっても、次のような理由により、登録内容の更新が必要となる。
（１）新規ＩＴサービスの追加／既存ＩＴサービスの廃止
新規のＩＴサービスの運用が開始されると、当該サービスに関係するシステム障害が新たに発生するようになる。この場合、新規なシステム障害に対応する障害原因判定ルールを作成し、既存の障害原因判定ルールに追加する必要がある。反対に、既存のＩＴサービスが廃止された場合、当該サービスに関係するシステム障害はそれ以降発生しなくなる。この場合、今後発生しなくなる障害に対応する障害原因判定ルールを、既存の障害原因判定ルールから削除する必要がある。
（２）ＩＴ基盤構成の変更
システムの運用過程では、提供されるＩＴサービス自体に変更が存在しなくとも、ＩＴ基盤が変更されることがある。例えばハードウェアの交換やネットワーク構成の変更などが生じることがある。このようにシステム構成に変更が生じると、同じ原因に起因するシステム障害であったとしても、発生するイベントの属性値やイベントの出現の仕方が影響を受けることになる。すなわち、障害原因判定ルールへの変更が必要となる。
（３）システム障害に対する認識の変化
当然ながら、障害原因判定ルールの作成時には、その時点で利用可能な情報に基づいて障害原因判定ルールが作成される。しかし、システム障害に関する情報量の不足から誤った障害原因判定ルールが生成される可能性がある。例えば同じ原因に起因すると判定されていたシステム障害Ａとシステム障害Ｂが、その後、異なる原因に起因するものであると判明することがある。反対に、当初は異なる原因に起因する障害として判定されていたものが、その後、同じ原因に起因する障害であると判明することがある。 However, even if the failure cause determination rule is once created and registered, the registration content needs to be updated for the following reason.
(1) Addition of new IT service / Abolition of existing IT service When the operation of a new IT service is started, a system failure related to the service is newly generated. In this case, it is necessary to create a failure cause determination rule corresponding to a new system failure and add it to the existing failure cause determination rule. On the other hand, when an existing IT service is abolished, a system failure related to the service will not occur thereafter. In this case, it is necessary to delete the failure cause determination rule corresponding to the failure that will not occur in the future from the existing failure cause determination rule.
(2) Change of IT infrastructure configuration In the operation process of the system, the IT infrastructure may be changed even if there is no change in the provided IT service itself. For example, hardware replacement or network configuration change may occur. When the system configuration is changed in this way, even if the system failure is caused by the same cause, the attribute value of the event that occurs and the appearance of the event are affected. That is, a change to the failure cause determination rule is required.
(3) Change in recognition of system failure Naturally, when creating a failure cause determination rule, a failure cause determination rule is created based on information available at that time. However, an erroneous failure cause determination rule may be generated due to a lack of information regarding system failure. For example, the system failure A and the system failure B that have been determined to be caused by the same cause may be subsequently found to be caused by different causes. On the other hand, what was initially determined as a failure due to a different cause may be later found to be a failure due to the same cause.

しかるに従来手法は、障害原因判定ルールを自動生成するものであっても、その後のメンテナンスを考慮していない。すなわち、障害原因判定ルールの作成後もその有効性を常に検証し、必要に応じてルールを更新することは何ら考慮されていない。 However, the conventional method does not consider the subsequent maintenance even if the failure cause determination rule is automatically generated. In other words, no consideration is given to constantly verifying the effectiveness of a failure cause determination rule after creation and updating the rule as necessary.

そこで、発明者は、障害原因判定ルールを運用状況に応じて自動的に更新するための仕組みを提供する。具体的には、障害原因判定ルールを見直すための時間間隔を与える時間窓を自動的に設定する処理と、直近の時間窓内に発生したイベントに基づいて一時障害分類木を作成する処理と、作成された一時障害分類木と運用中の障害原因判定ルールに対応する障害分類木（登録障害分類木）を比較し、比較結果に基づいて運用に使用する障害原因判定ルールを更新する処理とを有する仕組みを提供する。 Therefore, the inventor provides a mechanism for automatically updating the failure cause determination rule according to the operation status. Specifically, a process that automatically sets a time window that gives a time interval for reviewing the failure cause determination rule, a process that creates a temporary failure classification tree based on events that occurred within the most recent time window, Comparing the created temporary failure classification tree with a failure classification tree (registered failure classification tree) corresponding to the failure cause determination rule in operation, and updating the failure cause determination rule used for operation based on the comparison result Provide a mechanism to have.

本発明によれば、運用に使用する障害原因判定ルールを運用状況の変化に応じて自動的に最適化できる。 According to the present invention, a failure cause determination rule used for operation can be automatically optimized according to a change in operation status.

障害原因解析システムのシステム構成例を示す図。The figure which shows the system configuration example of a failure cause analysis system. ログＤＢが保持するイベントテーブルの具体例を説明する図。The figure explaining the specific example of the event table which log DB hold | maintains. 障害原因判定ルールＤＢが保持する障害分類木及び障害原因判定ルールテーブルを説明する図。The figure explaining the failure classification tree and failure cause determination rule table which failure cause determination rule DB holds. 障害原因判定ルールＤＢが保持する障害ノードテーブルを説明する図。The figure explaining the failure node table which failure cause determination rule DB hold | maintains. 障害原因判定ルール検証コンピュータのシステム構成例を示す図。The figure which shows the system configuration example of a failure cause determination rule verification computer. 障害原因解析プロセスの概要を示すフローチャート。The flowchart which shows the outline | summary of a failure cause analysis process. イベントブロックの特徴テーブル例を示す図。The figure which shows the example of a feature table of an event block. 障害原因判定ルールにおけるスケジューリングプロセスの実行手順例を示すフローチャート。The flowchart which shows the example of an execution procedure of the scheduling process in a failure cause determination rule. 障害原因判定ルールにおける検証・更新プロセスの実行手順例を示すフローチャート。The flowchart which shows the example of an execution procedure of the verification / update process in a failure cause determination rule. 障害原因判定ルールの更新処理の概念を説明する図。The figure explaining the concept of the update process of a failure cause determination rule. 更新処理の具体例を説明する図。The figure explaining the specific example of an update process. 障害原因判定ルールの追加処理の概念を説明する図。The figure explaining the concept of the addition process of a failure cause determination rule. 障害原因判定ルールの削除処理の概念を説明する図。The figure explaining the concept of the deletion process of a failure cause determination rule. 障害原因判定ルールの統合処理の概念を説明する図。The figure explaining the concept of the failure cause determination rule integration process. 統合処理の具体例を説明する図。The figure explaining the specific example of an integration process. 障害原因判定ルールの分割処理の概念を説明する図。The figure explaining the concept of the division | segmentation process of a failure cause determination rule. 分割処理の具体例を説明する図。The figure explaining the specific example of a division | segmentation process.

以下、図面に基づいて、本発明の実施の形態を説明する。なお、後述する装置構成や処理動作の内容は発明を説明するための一例である。本発明は、後述する装置構成同士の組み合わせ、後述する装置構成と既知の技術の組み合わせ、後述する装置構成の一部と既知の技術との組み合わせも包含する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. Note that the contents of the apparatus configuration and processing operation described later are examples for explaining the invention. The present invention also includes combinations of device configurations described later, combinations of device configurations described below and known technologies, and combinations of a part of device configurations described below and known technologies.

（障害原因解析システムの全体構成）
図１に、障害原因判定ルール検証コンピュータ１０７を実装する障害原因解析システムの構成例を示す。図１に示す障害原因解析システムは、監視対象サーバ群１０１と、監視サーバ１０２と、ログデータベース（ＤＢ）１０３と、障害原因判定ルール生成コンピュータ１０４と、障害原因解析コンピュータ１０５と、障害原因判定ルールＤＢ１０６と、障害原因判定ルール検証コンピュータ１０７と、復旧手順書データベース（ＤＢ）１０８と、復旧手順書閲覧コンピュータ１０９とを有している。 (Overall configuration of failure cause analysis system)
FIG. 1 shows a configuration example of a failure cause analysis system in which the failure cause determination rule verification computer 107 is installed. The failure cause analysis system shown in FIG. 1 includes a monitoring target server group 101, a monitoring server 102, a log database (DB) 103, a failure cause determination rule generation computer 104, a failure cause analysis computer 105, and a failure cause determination rule. It has a DB 106, a failure cause determination rule verification computer 107, a recovery procedure manual database (DB) 108, and a recovery procedure manual browsing computer 109.

このうち、監視サーバ１０２は、監視対象サーバ群１０１の状態（死活など）を監視し、状態に応じたイベントを生成する機能を提供する。監視サーバ１０２が生成したイベントはログデータベース（ＤＢ）１０３に格納される。障害原因判定ルール生成コンピュータ１０４は、ログＤＢ１０３からイベントログを読み出し、障害原因判定ルールを生成する機能を提供する。障害原因判定ルール生成コンピュータ１０４が生成した障害原因判定ルールは、障害原因判定ルールＤＢ１０６に格納される。障害原因解析コンピュータ１０５は、障害原因判定ルールＤＢ１０６が格納する障害原因判定ルールに基づいてイベントを解析し、障害に対する復旧手順書を特定する機能を提供する。障害原因判定ルール検証コンピュータ１０７は、監視サーバ１０２が生成したイベントを解析し、障害原因判定ルールＤＢ１０６が格納する障害原因判定ルールの有効性を検証する。復旧手順書データベース（ＤＢ）１０８は、障害時の復旧手順に関する文書を格納する。ここでの文書には、障害発生時のトラブルシューティングを記述したマニュアル（ハードウェアかソフトウェアかを問わない）だけでなく、過去の障害に対する保守担当者の対応記録、報告書その他の障害から復旧するための手順に関する文書も含まれる。復旧手順書閲覧コンピュータ１０９は、障害原因解析コンピュータ１０５によって特定された復旧手順書を画面上に表示する機能を提供する。 Among these, the monitoring server 102 provides a function of monitoring the state (life and death) of the monitoring target server group 101 and generating an event corresponding to the state. Events generated by the monitoring server 102 are stored in a log database (DB) 103. The failure cause determination rule generation computer 104 provides a function of reading an event log from the log DB 103 and generating a failure cause determination rule. The failure cause determination rule generated by the failure cause determination rule generation computer 104 is stored in the failure cause determination rule DB 106. The failure cause analysis computer 105 provides a function of analyzing an event based on the failure cause determination rule stored in the failure cause determination rule DB 106 and specifying a recovery procedure manual for the failure. The failure cause determination rule verification computer 107 analyzes the event generated by the monitoring server 102 and verifies the validity of the failure cause determination rule stored in the failure cause determination rule DB 106. The recovery procedure database (DB) 108 stores documents relating to a recovery procedure at the time of failure. In this document, not only manuals (whether hardware or software) that describe troubleshooting in the event of a failure, but also a record of the maintenance staff's response to past failures, reports, and other failures are recovered. Documentation on the procedure for this is also included. The recovery procedure manual browsing computer 109 provides a function of displaying the recovery procedure manual identified by the failure cause analysis computer 105 on the screen.

（イベントテーブルの具体例）
図２に、ログＤＢ１０３に格納されるイベントテーブル２００の具体例を示す。イベントテーブル２００は、イベントを一意に特定する識別子(ID)２０１、イベントが発生した日時を特定する発生日時２０２、個々のイベントの属性値の集合であるイベント２０３から構成される。この形態例の場合、イベント２０３の属性は、<種類>、<ソース>、<イベント番号>、<ユーザ>、<コンピュータ>で定義される。このうち、<種類>はイベントの重要度を示している。<ソース>はイベントを発生させたプロセスやアプリケーション等の発生源を示している。<イベント番号>はイベントの内容を特定する番号を示している。<ユーザ>はイベントの発生源となったプロセスやアプリケーションを実行していたユーザを示している。<コンピュータ>はイベントの発生源となった監視対象サーバ群１０１内のサーバを示している。 (Specific example of event table)
FIG. 2 shows a specific example of the event table 200 stored in the log DB 103. The event table 200 includes an identifier (ID) 201 that uniquely identifies an event, an occurrence date and time 202 that specifies the date and time when the event occurred, and an event 203 that is a set of attribute values of individual events. In the case of this embodiment, the attributes of the event 203 are defined by <type>, <source>, <event number>, <user>, and <computer>. Of these, <Type> indicates the importance of the event. <Source> indicates the source of the process or application that generated the event. <Event number> indicates a number for identifying the content of the event. <User> indicates the user who is executing the process or application that is the source of the event. <Computer> indicates a server in the monitoring target server group 101 that is an event generation source.

（障害原因判定ルールＤＢの具体例）
図３−１及び図３−２に、障害原因判定ルールＤＢ１０６の構成例を示す。障害原因判定ルールＤＢ１０６は、障害原因解析コンピュータ１０５に登録されている障害原因判定ルールとそれに関連する情報を格納しているＤＢである。障害原因判定ルールＤＢ１０６は、登録障害分類木３００と、障害原因判定ルールテーブル３１０と、障害ノードテーブル３２０とから構成される。 (Specific example of failure cause determination rule DB)
FIGS. 3A and 3B show a configuration example of the failure cause determination rule DB 106. FIG. The failure cause determination rule DB 106 is a DB that stores a failure cause determination rule registered in the failure cause analysis computer 105 and information related thereto. The failure cause determination rule DB 106 includes a registered failure classification tree 300, a failure cause determination rule table 310, and a failure node table 320.

登録障害分類木３００は、障害原因解析コンピュータ１０５に登録されている障害原因判定ルールの生成時に作成される。登録障害分類木３００では、障害時に発生した単数又は複数のイベントの集合（以下、「イベントブロック」という。）が共通に有する特徴に基づいて障害が分類され、分類木として表現される。登録障害分類木３００のノードを障害ノードと呼ぶ。同じ障害ノードに分類された障害同士は、発生したイベント及び発生の仕方が類似しているので、同じ障害原因による障害であると考えられる。 The registered failure classification tree 300 is created when a failure cause determination rule registered in the failure cause analysis computer 105 is generated. In the registered fault classification tree 300, faults are classified based on features that are commonly shared by a set of one or more events (hereinafter referred to as “event blocks”) that occurred at the time of the fault, and are represented as a classification tree. A node of the registered failure classification tree 300 is called a failure node. Faults classified into the same fault node are considered to be faults caused by the same fault cause because the generated events and the manner of occurrence are similar.

障害原因判定ルールテーブル３１０は、障害原因解析コンピュータ１０５に登録されている障害原因判定ルールを格納するデータテーブルである。障害原因判定ルールテーブル３１０は、登録障害分類木３００の障害ノード３１１と、対象障害ノードに分類される障害に適用される障害原因判定ルール３１２から構成される。障害原因判定ルール３１２は、単数又は複数の判定イベント３１３と、判定時間３１４と、復旧手順書３１５とで構成される。判定イベント３１３は、対象障害ノードを特徴付けるイベントの属性の集合である。判定時間３１４は、判定イベント３１３を満たすイベントが発生する時間間隔である。復旧手順書３１５は、判定時間３１４内に判定イベント３１３を満たすイベントが発生した場合に復旧手順書閲覧コンピュータ１０９に表示される文書である。 The failure cause determination rule table 310 is a data table that stores failure cause determination rules registered in the failure cause analysis computer 105. The failure cause determination rule table 310 includes a failure node 311 of the registered failure classification tree 300 and a failure cause determination rule 312 that is applied to a failure classified as a target failure node. The failure cause determination rule 312 includes one or more determination events 313, a determination time 314, and a recovery procedure manual 315. The determination event 313 is a set of event attributes that characterize the target failure node. The determination time 314 is a time interval at which an event that satisfies the determination event 313 occurs. The recovery procedure manual 315 is a document displayed on the recovery procedure manual browsing computer 109 when an event that satisfies the determination event 313 occurs within the determination time 314.

１つの障害ノード３１１に複数の判定イベント３１３が指定されている場合は、障害原因判定ルールテーブル３１０に記述されている順番に判定イベント３１３が出現するものとする。図３−１の（２）の場合、「障害ノード１−１」に対する障害原因判定ルール３１２として、（「警戒」、「process71」、「80」、「user2」、「server9」）の属性値を有するイベントの発生後に、（「*」、「process39」、「*」、「user4」、「server8」）の属性値を有するイベントが判定時間「２分９秒」以内に発生したら、「復旧手順A.doc」を復旧手順閲覧コンピュータ１０９に表示するというルールが設定されている。ここで、属性値「*」は、値が不定であることを意味し、任意の値を取り得ることを示す。 When a plurality of determination events 313 are specified for one failure node 311, the determination events 313 appear in the order described in the failure cause determination rule table 310. In the case of (2) in FIG. 3A, as the failure cause determination rule 312 for the “failure node 1-1”, the attribute values of (“alert”, “process71”, “80”, “user2”, “server9”) If an event with an attribute value of ("*", "process39", "*", "user4", "server8") occurs within the determination time "2 minutes 9 seconds" after the occurrence of an event with A rule of displaying “procedure A.doc” on the recovery procedure browsing computer 109 is set. Here, the attribute value “*” means that the value is indefinite and indicates that an arbitrary value can be taken.

障害ノードテーブル３２０は、登録障害分類木３００を構築する際に訓練データとして使用したイベントブロックを格納する。障害ノードテーブル３２０は、障害ノード３１１と、当該障害ノードに分類されたイベントブロック３２１と、当該イベントブロック３２１内に含まれるイベント２０３とから構成される。 The failure node table 320 stores event blocks used as training data when the registered failure classification tree 300 is constructed. The failure node table 320 includes a failure node 311, an event block 321 classified as the failure node, and an event 203 included in the event block 321.

（障害原因判定ルール検証コンピュータの構成例）
図４に、障害原因判定ルール検証コンピュータ１０７の構成例を示す。障害原因判定ルール検証コンピュータ１０７は、コンピュータ本体４００と、入力装置４３０と、表示装置４３１と、通信装置４３２とから構成される。なお、通信装置４３２は、監視サーバ１０２、ログＤＢ１０３及び障害原因判定ルールＤＢ１０６と通信する。 (Configuration example of failure cause determination rule verification computer)
FIG. 4 shows a configuration example of the failure cause determination rule verification computer 107. The failure cause determination rule verification computer 107 includes a computer main body 400, an input device 430, a display device 431, and a communication device 432. The communication device 432 communicates with the monitoring server 102, the log DB 103, and the failure cause determination rule DB 106.

コンピュータ本体４００は、データ演算をするＣＰＵ４０１、ＲＯＭ４０２、ＲＡＭ４１０、ハードディスク駆動装置４２０、これらデバイス間のデータ転送を実現するＣＰＵバス４０７、これらデバイスとＣＰＵバス４０７とを結合するインターフェース４０３〜４０６で構成される。 The computer main body 400 includes a CPU 401 that performs data calculation, a ROM 402, a RAM 410, a hard disk drive 420, a CPU bus 407 that realizes data transfer between these devices, and interfaces 403 to 406 that couple these devices to the CPU bus 407. The

ＲＡＭ４１０には、ＣＰＵ４０１に演算処理をさせる障害原因判定ルール検証プログラム４１１の実行領域と、検算時に一時的に生成させるデータを格納する作業領域４１２とが少なくとも確保される。また、ハードディスク駆動装置４２０の記憶領域には、障害原因判定ルール検証プログラムの格納領域としてのプログラム格納部４２１と、監視サーバ１０２及び障害原因判定ルールＤＢ１０６から取得したデータを一時的に格納しておくデータ格納部４２２が少なくとも確保される。 The RAM 410 has at least an execution area for the failure cause determination rule verification program 411 for causing the CPU 401 to perform arithmetic processing and a work area 412 for storing data to be temporarily generated at the time of verification. The storage area of the hard disk drive 420 temporarily stores a program storage unit 421 as a storage area for the failure cause determination rule verification program and data acquired from the monitoring server 102 and the failure cause determination rule DB 106. At least the data storage unit 422 is secured.

(障害原因解析動作)
図５に、障害原因解析システム全体の障害原因解析プロセスの概略を示す。
（ステップ５０１）
障害原因判定ルール生成コンピュータ１０４は、ログＤＢ１０３からイベントログを取得して障害原因判定ルールを生成し、障害原因判定ルールＤＢ１０６に保存する。ここで、障害原因判定ルールの作成は、（１）障害分類木の作成、（２）頻出イベントパターンの発見、（３）復旧手順書検索の順番に行う。 (Failure cause analysis operation)
FIG. 5 shows an outline of the failure cause analysis process of the entire failure cause analysis system.
(Step 501)
The failure cause determination rule generation computer 104 acquires an event log from the log DB 103, generates a failure cause determination rule, and stores it in the failure cause determination rule DB. Here, the failure cause determination rule is created in the order of (1) creation of a failure classification tree, (2) discovery of frequent event patterns, and (3) recovery procedure manual search.

（１）障害分類木の作成
障害原因判定ルール生成コンピュータ１０４は、ログＤＢ１０３から取得したイベントを障害別に分類する。障害別に分類された状態のイベントをイベントブロックという。次に、障害原因判定ルール生成コンピュータ１０４は、各イベントブロックから特徴を抽出し、抽出された特徴に基づいて教師なしのクラスタリングを行い、分類木を構築する。この分類木が、障害原因判定ルールＤＢ１０６の登録障害分類木３００に相当する。この場合のクラスタリング手法としては、非特許文献１に記載されている概念クラスタリングＣＯＢＷＥＢなどがある。 (1) Creation of Failure Classification Tree The failure cause determination rule generation computer 104 classifies events acquired from the log DB 103 by failure. Events that are classified by failure are called event blocks. Next, the failure cause determination rule generation computer 104 extracts features from each event block, performs unsupervised clustering based on the extracted features, and constructs a classification tree. This classification tree corresponds to the registered failure classification tree 300 of the failure cause determination rule DB 106. As a clustering method in this case, there is a concept clustering COBWEB described in Non-Patent Document 1.

（２）頻出イベントパターンの発見
障害原因判定ルール生成コンピュータ１０４は、指定された分類木の障害ノードに分類される複数のイベントブロックに単数又は複数の頻出するイベントを発見する。さらに、頻出イベントが複数ある場合には、頻出イベントが出現する順番と時間間隔を求める。これらが、障害原因判定ルールＤＢ１０６の障害原因判定ルールテーブル３１０の障害原因判定ルール３１２における判定イベント３１３及び判定時間３１４に相当する。 (2) Discovery of frequent event patterns The failure cause determination rule generation computer 104 finds one or a plurality of frequent events in a plurality of event blocks classified as a failure node of a designated classification tree. Furthermore, when there are a plurality of frequent events, the order in which the frequent events appear and the time interval are obtained. These correspond to the determination event 313 and the determination time 314 in the failure cause determination rule 312 of the failure cause determination rule table 310 of the failure cause determination rule DB 106.

図６に、各イベントブロックの特徴を抽出することで作成したイベントブロックの特徴テーブル６００の構成例を示す。特徴テーブル６００は、イベントブロックを特定するイベントブロックＩＤ６０１、各イベントブロックに対する特徴である属性リスト６０２で構成される。属性リスト６０２は、イベント２０３を構成する属性毎にイベントブロック内で最も頻出する属性値と次に頻出する属性値で構成される。このため、「種類」、「ソース」、「イベント」、「ユーザ」、「コンピュータ」の各属性にそれぞれ２つの属性値が割り当てられている。 FIG. 6 shows a configuration example of a feature table 600 for event blocks created by extracting features of each event block. The feature table 600 includes an event block ID 601 that identifies an event block, and an attribute list 602 that is a feature for each event block. The attribute list 602 is configured with an attribute value that appears most frequently in the event block and an attribute value that appears next frequently for each attribute constituting the event 203. For this reason, two attribute values are assigned to each attribute of “type”, “source”, “event”, “user”, and “computer”.

（３）復旧手順書の検索
障害原因判定ルール生成コンピュータ１０４は、（２）で求めた頻出イベントの属性値に基づいて検索キーを生成する。例えば５つの属性、すなわち「種類」、「ソース」、「イベント」、「ユーザ」、「コンピュータ」のそれぞれについて最も頻出する属性値の組み合わせを検索キーに設定する。次に、障害原因判定ルール生成コンピュータ１０４は、生成された検索キーを用いて復旧手順書ＤＢ１０８を検索し、適切な復旧手順書を取得する。ここでの復旧手順書が、障害原因判定ルールテーブル３１０（図３−１）の障害原因判定ルール３１２における復旧手順書３１５に相当する。 (3) Recovery Procedure Manual Search The failure cause determination rule generation computer 104 generates a search key based on the attribute value of the frequent event obtained in (2). For example, a combination of attribute values that appears most frequently for each of five attributes, that is, “type”, “source”, “event”, “user”, and “computer” is set as a search key. Next, the failure cause determination rule generation computer 104 searches the recovery procedure manual DB 108 using the generated search key, and acquires an appropriate recovery procedure manual. The recovery procedure document here corresponds to the recovery procedure document 315 in the failure cause determination rule 312 of the failure cause determination rule table 310 (FIG. 3A).

（ステップ５０２）
障害原因解析コンピュータ１０５は、障害原因判定ルール生成コンピュータ１０４により障害原因判定ルールＤＢ１０６が更新されたのを検知すると、障害原因判定ルールＤＢ１０６から障害原因判定ルールテーブル３１０の障害原因判定ルール３１２（図３−１）を取得し、登録する。 (Step 502)
When the failure cause analysis computer 105 detects that the failure cause determination rule DB 106 has been updated by the failure cause determination rule generation computer 104, the failure cause determination rule DB 312 of the failure cause determination rule table 310 (FIG. 3) -1) is acquired and registered.

（ステップ５０３）
監視サーバ１０２は、監視対象サーバ群１０１を監視している。監視サーバ１０２は、監視対象サーバ群１０１内のサーバに障害に起因する異常を発見すると、該当するサーバの状態に応じたイベントを生成する。監視サーバ１０２は、生成したイベントをログＤＢ１０３に保存すると共に、障害原因解析コンピュータ１０５及び障害原因判定ルール検証コンピュータ１０７にそのイベントを送信する。 (Step 503)
The monitoring server 102 monitors the monitoring target server group 101. When the monitoring server 102 finds an abnormality caused by a failure in the servers in the monitoring target server group 101, the monitoring server 102 generates an event corresponding to the state of the corresponding server. The monitoring server 102 stores the generated event in the log DB 103 and transmits the event to the failure cause analysis computer 105 and the failure cause determination rule verification computer 107.

（ステップ５０４）
障害原因解析コンピュータ１０５は、受信したイベントと、障害原因判定ルール３１２とのマッチング処理を実行する。障害原因判定ルール３１２に登録されたいずれかの障害ノードと受信したイベントが一致した場合、障害原因解析コンピュータ１０５は、一致が確認された傷害ノードについて登録されている復旧手順書３１５を復旧手順書ＤＢ１０８から取得し、復旧手順書閲覧コンピュータ１０９に送信する。 (Step 504)
The failure cause analysis computer 105 executes matching processing between the received event and the failure cause determination rule 312. If any of the failure nodes registered in the failure cause determination rule 312 matches the received event, the failure cause analysis computer 105 uses the recovery procedure manual 315 registered for the injured node whose match is confirmed. Obtained from the DB 108 and transmitted to the recovery procedure manual browsing computer 109.

（ステップ５０５）
復旧手順書閲覧コンピュータ１０９は、障害原因解析コンピュータ１０５から受信した復旧手順書３１５を表示装置上に表示する。 (Step 505)
The recovery procedure manual browsing computer 109 displays the recovery procedure manual 315 received from the failure cause analysis computer 105 on the display device.

（ステップ５０６）
障害原因判定ルール検証コンピュータ１０７は、監視サーバ１０２からイベントを受信すると、設定された時間窓内のイベント集合から障害原因判定ルールを作成し、障害原因判定ルールＤＢ１０６を更新する。この処理内容の詳細は後述する。 (Step 506)
Upon receiving an event from the monitoring server 102, the failure cause determination rule verification computer 107 creates a failure cause determination rule from the set of events within the set time window, and updates the failure cause determination rule DB 106. Details of this processing will be described later.

（ステップ５０７）
障害原因解析コンピュータ１０５は、障害原因判定ルール検証コンピュータ１０７により障害原因判定ルールＤＢ１０６が更新されたことを検知した場合、障害原因判定ルールＤＢ１０７から障害原因判定ルール３１２を取得し、現在利用している障害原因判定ルールと置き換える。 (Step 507)
When the failure cause analysis computer 105 detects that the failure cause determination rule verification computer 107 has updated the failure cause determination rule DB 106, the failure cause analysis computer 105 acquires the failure cause determination rule 312 from the failure cause determination rule DB 107 and currently uses it. Replace with failure cause determination rule.

（障害原因判定ルール検証動作）
図７−１及び図７−２に、障害原因判定ルール検証プログラム４１１を通じて実行される障害原因判定ルールの検証・更新プロセスの概要を示す。まず、図７−１に、検証プロセスのスケジューリング処理の内容を示す。 (Failure cause determination rule verification operation)
7A and 7B show an overview of the failure cause determination rule verification / update process executed through the failure cause determination rule verification program 411. FIG. First, FIG. 7-1 shows the contents of the scheduling process of the verification process.

（スケジューリングの詳細動作）
（ステップ７００）
障害原因判定ルール検証プログラム４１１の実行は、障害原因判定ルール生成コンピュータ１０４が、障害原因判定ルールＤＢ１０６の更新を通信装置４３２経由で検知することにより開始される。 (Detailed operation of scheduling)
(Step 700)
Execution of the failure cause determination rule verification program 411 is started when the failure cause determination rule generation computer 104 detects an update of the failure cause determination rule DB 106 via the communication device 432.

（ステップ７０１）
障害原因判定ルール検証コンピュータ１０７は、障害の発生時間間隔の境界時間ｔｂを計算する。境界時間ｔｂとは、最後に障害が発生してから、時間間隔の境界時間ｔｂ以内に、同じ障害原因による障害が発生しなければ、以降も発生しないと考えられる時間間隔である。障害の発生時刻は、対応するイベントブロックの最初のイベントの発生日時とする。この境界時間tbは、次の（１）から（３）の手順で決める。 (Step 701)
The failure cause determination rule verification computer 107 calculates the boundary time tb of the failure occurrence time interval. The boundary time tb is a time interval that is considered not to occur if a failure due to the same failure does not occur within the boundary time tb of the time interval after the last failure. The occurrence time of the failure is the occurrence date and time of the first event of the corresponding event block. This boundary time tb is determined by the following procedures (1) to (3).

（１）登録障害分類木の取得
障害原因判定ルール検証コンピュータ１０７は、通信装置４３２を介して障害原因判定ルールＤＢ１０６から登録障害分類木３００、障害原因判定ルールテーブル３１０、障害ノードテーブル３２０を取得する。これらの情報に基づいて、障害原因判定ルール検証コンピュータ１０７は、登録障害分類木３００を構築する際に訓練データとして使用した最初のイベントの発生日時ｔ０と最後のイベントの発生日時との時間差を算出する。この時間差を、障害原因判定ルール検証コンピュータ１０７は、障害分類木を作成する際におけるイベントの時間範囲を与える分類木構築時間Δとする。 (1) Acquisition of registered failure classification tree The failure cause determination rule verification computer 107 acquires the registered failure classification tree 300, the failure cause determination rule table 310, and the failure node table 320 from the failure cause determination rule DB 106 via the communication device 432. . Based on these pieces of information, the failure cause determination rule verification computer 107 calculates the time difference between the occurrence date and time t0 of the first event and the occurrence date and time of the last event used as training data when the registered failure classification tree 300 is constructed. To do. The time difference is set as a classification tree construction time Δ that gives the time range of the event when the failure cause determination rule verification computer 107 creates the failure classification tree.

（２）検証障害ノードの障害発生時間間隔
障害原因判定ルール検証コンピュータ１０７は、障害原因判定ルールテーブル３１０のうち障害原因判定ルール３１２が設定されている障害ノード３１１を検証対象に設定する。この検証対象としての障害ノードを、以下、「検証障害ノード」という。この後、障害原因判定ルール検証コンピュータ１０７は障害ノードテーブル３２０にアクセスし、検証障害ノードに対応するイベントブロック３２１に関する障害の発生時間間隔を計算する。 (2) Failure occurrence time interval of verification failure node The failure cause determination rule verification computer 107 sets the failure node 311 in the failure cause determination rule table 310 in which the failure cause determination rule 312 is set as a verification target. The failure node as the verification target is hereinafter referred to as a “verification failure node”. Thereafter, the failure cause determination rule verification computer 107 accesses the failure node table 320 and calculates a failure occurrence time interval for the event block 321 corresponding to the verification failure node.

（３）境界時間ｔｂの計算
障害原因判定ルール検証コンピュータ１０７は、（２）で求めた発生時間間隔の分布から、仮説「ある障害ノードに分類される障害が発生してからｔｂ後に発生した障害がその障害ノードに分類される」が有意水準１％の確率で棄却されるような境界値である境界時間ｔｂを、仮説検定を用いて求める。 (3) Calculation of the boundary time tb The failure cause determination rule verification computer 107 calculates the failure that occurred after tb from the occurrence of the failure classified as a certain failure node from the distribution of the occurrence time intervals obtained in (2). A hypothesis test is used to determine a boundary time tb that is a boundary value such that “is classified as a failure node” is rejected with a probability of 1% significance level.

（ステップ７０２）
障害原因判定ルール検証コンピュータ１０７は、通信装置４３２を介してログＤＢ１０３から、日時（ｔ０＋ｔｂ）から日時（ｔ０＋Δ）までのイベントを取得する。 (Step 702)
The failure cause determination rule verification computer 107 acquires events from the date and time (t0 + tb) to the date and time (t0 + Δ) from the log DB 103 via the communication device 432.

（ステップ７０３）
障害原因判定ルール検証コンピュータ１０７は、ステップ７０２で取得したイベントを訓練データとして一時障害分類木を作成する。ここで、一時障害分類木は、障害原因判定ルール検証コンピュータ１０７で一時的に作成される障害分類木であり、登録障害分類木３００と同様の方法で作成される。この一時障害分類木の場合も、障害原因判定ルールＤＢ１０６と同様に、障害原因判定ルールテーブル、障害ノードテーブルが同時に作成される。作成された一時障害分類木、対応する障害原因判定ルールテーブル、障害ノードテーブルは、作業領域４１０に格納される。 (Step 703)
The failure cause determination rule verification computer 107 creates a temporary failure classification tree using the event acquired in step 702 as training data. Here, the temporary failure classification tree is a failure classification tree temporarily created by the failure cause determination rule verification computer 107, and is created by the same method as the registered failure classification tree 300. Also in the case of this temporary failure classification tree, a failure cause determination rule table and a failure node table are created at the same time as in the failure cause determination rule DB 106. The created temporary failure classification tree, corresponding failure cause determination rule table, and failure node table are stored in the work area 410.

（ステップ７０４）
障害原因判定ルール検証コンピュータ１０７は、検証開始日時ｔｖｓを計算する。検証開始日時ｔｖｓは、ｔ０＋Δ＋ｔｂとする。 (Step 704)
The failure cause determination rule verification computer 107 calculates the verification start date and time tvs. The verification start date and time tvs is t0 + Δ + tb.

（ステップ７０５）
障害原因判定ルール検証コンピュータ１０７は、監視対象サーバ群１０１内に障害を検知した監視サーバ１０２が送信したイベントを、通信装置４３２を介して受信する。 (Step 705)
The failure cause determination rule verification computer 107 receives an event transmitted from the monitoring server 102 that has detected a failure in the monitoring target server group 101 via the communication device 432.

（ステップ７０６）
障害原因判定ルール検証コンピュータ１０７は、受信したイベントの発生日時ｔｅと検証開始日時ｔｖｓとを比較する。イベントの発生日時ｔｅが検証開始日時ｔｖｓ以下の場合、障害原因判定ルール検証コンピュータ１０７は、ステップ７１０を実行する。イベント発生日時ｔｅが検証開始日時ｔｖｓより大きい場合、障害原因判定ルール検証コンピュータ１０７は、ステップ７０７を実行する。 (Step 706)
The failure cause determination rule verification computer 107 compares the received event occurrence date te with the verification start date tvs. If the event occurrence date and time te is equal to or less than the verification start date and time tvs, the failure cause determination rule verification computer 107 executes step 710. If the event occurrence date and time te is greater than the verification start date and time tvs, the failure cause determination rule verification computer 107 executes step 707.

（ステップ７０７）
障害原因判定ルール検証コンピュータ１０７は、障害原因判定ルールＤＢ１０６の障害原因判定ルールテーブル３１０内の検証対象である障害ノードに設定されている障害原因判定ルール３１２の有効性を検証し、必要があれば障害原因判定ルールＤＢ１０６の内容を更新する。この処理内容の詳細は後述する。 (Step 707)
The failure cause determination rule verification computer 107 verifies the validity of the failure cause determination rule 312 set in the failure node to be verified in the failure cause determination rule table 310 of the failure cause determination rule DB 106, and if necessary, The contents of the failure cause determination rule DB 106 are updated. Details of this processing will be described later.

（ステップ７０８）
障害原因判定ルール検証コンピュータ１０７は、新しい検証開始日時ｔｖｓを設定する。新しい検証開始日時ｔｖｓは、次のように求める。まず、ステップ７０９で作成した一時障害分類木の検証障害ノードに対し、障害原因判定ルール検証コンピュータ１０７は、ステップ７０１の（２）及び（３）と同じ方法により、障害の発生時間間隔の境界時間ｔｂ’を計算する。次に、障害原因判定ルール検証コンピュータ１０７は、ｔｖｓ＋ｔｂ’を計算し、これを新しい検証開始日時ｔｖｓとする。 (Step 708)
The failure cause determination rule verification computer 107 sets a new verification start date and time tvs. The new verification start date and time tvs is obtained as follows. First, for the failure node of the temporary failure classification tree created in step 709, the failure cause determination rule verification computer 107 uses the same method as in steps 701 (2) and (3) to determine the boundary time of the failure occurrence time interval. tb 'is calculated. Next, the failure cause determination rule verification computer 107 calculates tvs + tb ′ and sets this as a new verification start date and time tvs.

（ステップ７０９）
障害原因判定ルール検証コンピュータ１０７は、作業領域４１２に格納されている現在の一時障害分類木の障害ノードテーブルから時間範囲（ｔｖｓ−Δ〜ｔｅ）に発生したイベントを取得する。次に、障害原因判定ルール検証コンピュータ１０７は、取得したイベントを訓練データとして新規の一時障害分類木、障害原因判定ルールテーブル及び障害ノードテーブルを作成し、作業領域４１２に格納する。その後、障害原因判定ルール検証コンピュータ１０７は、現在の一時障害分類木及び対応する障害原因判定ルールテーブル、障害ノードテーブルは削除する。 (Step 709)
The failure cause determination rule verification computer 107 acquires events that occurred in the time range (tvs−Δ to te) from the failure node table of the current temporary failure classification tree stored in the work area 412. Next, the failure cause determination rule verification computer 107 creates a new temporary failure classification tree, a failure cause determination rule table, and a failure node table using the acquired events as training data, and stores them in the work area 412. Thereafter, the failure cause determination rule verification computer 107 deletes the current temporary failure classification tree, the corresponding failure cause determination rule table, and the failure node table.

（ステップ７１０）
障害原因判定ルール検証コンピュータ１０７は、受信したイベントからイベントブロックを作成し又は更新し、一時障害分類木を更新する。同時に、障害原因判定ルール検証コンピュータ１０７は、この一時障害分類木に対応する障害原因判定ルールテーブル及び障害ノードテーブルも更新する。 (Step 710)
The failure cause determination rule verification computer 107 creates or updates an event block from the received event, and updates the temporary failure classification tree. At the same time, the failure cause determination rule verification computer 107 also updates the failure cause determination rule table and the failure node table corresponding to this temporary failure classification tree.

（検証プロセスの詳細動作）
次に、図７−２に示す検証・更新プロセスの詳細動作を説明する。
（ステップ７５０）
障害原因判定ルール検証コンピュータ１０７は、一時障害分類木と障害原因判定ルールＤＢ１０６の登録障害分類木３００とを比較する。すなわち、一時的に生成した分類木と運用中の分類木を比較する。両分類木の構成が一致している場合、障害原因判定ルール検証コンピュータ１０７はステップ７６０を実行する。一方、不一致の場合、障害原因判定ルール検証コンピュータ１０７はステップ７７０を実行する。登録障害分類木と一時障害分類木との対応づけは、次のように行う。登録障害分類木作成時に訓練データとして使用した障害のイベントブロックを、登録分類木と一時障害分類木の両方で分類する。同じイベントブロックが分類された登録分類木の障害ノードと一時障害分類木の障害ノードとを対応する障害ノードとする。これにより、登録障害分類木の障害ノードが、一時障害分類木のどの障害ノードに対応しているかを判断できる。 (Detailed operation of the verification process)
Next, the detailed operation of the verification / update process shown in FIG.
(Step 750)
The failure cause determination rule verification computer 107 compares the temporary failure classification tree with the registered failure classification tree 300 of the failure cause determination rule DB 106. That is, the temporarily generated classification tree is compared with the classification tree in operation. If the configurations of the two classification trees match, the failure cause determination rule verification computer 107 executes step 760. On the other hand, if they do not match, the failure cause determination rule verification computer 107 executes step 770. Correspondence between the registered fault classification tree and the temporary fault classification tree is performed as follows. The failure event blocks used as training data when creating the registered failure classification tree are classified by both the registered classification tree and the temporary failure classification tree. The failure node corresponding to the failure node in the registered classification tree and the temporary failure classification tree in which the same event block is classified is set as the corresponding failure node. As a result, it is possible to determine which fault node in the temporary fault classification tree corresponds to the fault node in the registered fault classification tree.

（ステップ７６０）
障害原因判定ルール検証コンピュータ１０７は、一時障害分類木の検証障害ノードに、登録障害分類木の対応する障害ノードにない新しい障害が分類されているか否か判定する。一時障害分類木のみに存在する障害が存在しない場合、障害原因判定ルール検証コンピュータ１０７はステップ７６１を実行する。一方、一時障害分類木にのみ存在する障害が存在する場合、障害原因判定ルール検証コンピュータ１０７はステップ７６２を実行する。 (Step 760)
The failure cause determination rule verification computer 107 determines whether or not a new failure that does not exist in the corresponding failure node of the registered failure classification tree is classified into the verification failure node of the temporary failure classification tree. If there is no failure that exists only in the temporary failure classification tree, the failure cause determination rule verification computer 107 executes step 761. On the other hand, if there is a fault that exists only in the temporary fault classification tree, the fault cause determination rule verification computer 107 executes step 762.

（ステップ７６１）
障害原因判定ルール検証コンピュータ１０７は、障害原因判定ルールＤＢ１０６を更新する。すなわち、障害原因判定ルール検証コンピュータ１０７は、障害原因判定ルールテーブル３１０から検証障害ノードに対応する障害原因判定ルールを削除する。なお、ここでの一時障害分類木は、登録障害分類木が作成されてから時間間隔の境界時間ｔｂ以降に作成されたものである。従って、同じ障害ノードに分類される障害が発生しなければ、今後この障害ノードに分類される障害は発生しないと判断できる。このため、検証障害ノードに設定された障害原因判定ルールを削除しても問題ない。 (Step 761)
The failure cause determination rule verification computer 107 updates the failure cause determination rule DB 106. That is, the failure cause determination rule verification computer 107 deletes the failure cause determination rule corresponding to the verification failure node from the failure cause determination rule table 310. Note that the temporary failure classification tree here is created after the boundary time tb of the time interval after the registered failure classification tree is created. Therefore, if a failure classified into the same failure node does not occur, it can be determined that a failure classified as this failure node will not occur in the future. For this reason, there is no problem even if the failure cause determination rule set in the verification failure node is deleted.

（ステップ７６２）
障害原因判定ルール検証コンピュータ１０７は、一時障害分類木の検証障害ノードに対する障害原因判定ルールをステップ５０１の（２）及び（３）で記述した方法で生成する。障害原因判定ルール検証コンピュータ１０７は、生成した一時障害分類木の検証障害ノードの障害原因判定ルールと、障害原因判定ルールテーブル３１０の障害原因判定ルールを比較し、一致していない場合はステップ７６３を実行する。 (Step 762)
The failure cause determination rule verification computer 107 generates a failure cause determination rule for the verification failure node of the temporary failure classification tree by the method described in steps 501 (2) and (3). The failure cause determination rule verification computer 107 compares the generated failure cause verification rule of the temporary failure classification tree with the failure cause determination rule of the failure cause determination rule table 310. If they do not match, step 763 is performed. Run.

（ステップ７６３）
障害原因判定ルール検証コンピュータ１０７は、障害原因判定ルールＤＢ１０６を更新する。すなわち、障害原因判定ルール検証コンピュータ１０７は、検証障害ノードに設定された障害原因判定ルールを、ステップ７６２で作成された障害原因判定ルールに置き換える。 (Step 763)
The failure cause determination rule verification computer 107 updates the failure cause determination rule DB 106. That is, the failure cause determination rule verification computer 107 replaces the failure cause determination rule set in the verification failure node with the failure cause determination rule created in step 762.

図８−１及び図８−２は、障害原因判定ルールの更新処理を説明した図である。図８−１の（１）は２００９年１月から３月に発生したイベントから作成した登録障害分類木８００であり、検証対象である「障害ノード１−１」８０１に対して、図８−２の（Ａ−１）の障害原因判定ルール８２０が設定されている。図８−１の（２）は２００９年１０月から１２月に発生したイベントから作成した一時分類木８１０であり、この「障害ノード１−１’」８１１に対して、図８−２の（Ｂ−１）の障害原因判定ルール８３０が生成された。このとき、登録分類木８００と一時分類８１０の構成は一致している。さらに「障害ノード１−１」に分類される２００９年１月から３月に発生したイベントから作成したイベントブロックは全て一時障害分類木８１０に分類され、「障害ノード１−１」８１１’と「障害ノード１−１’」８１１とが対応しているとする。 FIGS. 8A and 8B are diagrams illustrating the failure cause determination rule update process. (1) in FIG. 8A is a registered failure classification tree 800 created from events that occurred from January to March 2009. For the “failure node 1-1” 801 to be verified, FIG. 2 (A-1) failure cause determination rule 820 is set. (2) in FIG. 8A is a temporary classification tree 810 created from an event that occurred from October to December 2009. This “failure node 1-1 ′” 811 is shown in FIG. The failure cause determination rule 830 of B-1) is generated. At this time, the configurations of the registered classification tree 800 and the temporary classification 810 are the same. Further, all event blocks created from events generated from January to March 2009 classified as “failure node 1-1” are classified into temporary failure classification tree 810, “failure node 1-1” 811 ′ and “ It is assumed that the failure node 1-1 ′ ”811 corresponds.

このとき、「障害ノード１−１」８０１の障害原因判定ルール８２０と、「障害ノード１−１’」８１１の障害原因判定ルール８３０とを比較すると、２番目の判定イベントの属性「コンピュータ」の属性値が、「障害ノード１−１」では「server8」８２１であるのに対し、「障害ノード１−１’」では「server25」であり異なっている。以上の場合、障害原因判定ルールテーブル３１０に登録されている「障害ノード１−１」８０１の障害原因判定ルール８２０を、「障害ノード１−１’」８１１の障害原因判定ルール８３０で置換する。関連して障害ノードテーブル３２０も更新される。 At this time, when the failure cause determination rule 820 of the “failure node 1-1” 801 and the failure cause determination rule 830 of the “failure node 1-1 ′” 811 are compared, the attribute “computer” of the second determination event The attribute value is “server8” 821 for “failure node 1-1”, whereas “server25” is different for “failure node 1-1 ′”. In the above case, the failure cause determination rule 820 of “failure node 1-1” 801 registered in the failure cause determination rule table 310 is replaced with the failure cause determination rule 830 of “failure node 1-1 ′” 811. In association with this, the failure node table 320 is also updated.

（ステップ７７０）
障害原因判定ルール検証コンピュータ１０７は、一時障害分類木と登録障害分類木との差分を求める。一時障害分類木に登録障害分類木３００にない障害ノードが存在する場合、障害原因判定ルール検証コンピュータ１０７は、ステップ７７１を実行する。登録障害分類木３００に存在していた障害ノードが一時障害分類木には存在しない場合、障害原因判定ルール検証コンピュータ１０７は、ステップ７７２を実行する。登録障害分類木３００に存在していた複数の障害ノードが一時障害分類木では一つの障害ノードにまとめられている場合、障害原因判定ルール検証コンピュータ１０７は、ステップ７７３を実行する。登録障害分類木３００では一つの障害ノードが一時障害分類木では複数の障害ノードに分割された場合、障害原因判定ルール検証コンピュータ１０７は、ステップ７７４を実行する。 (Step 770)
The failure cause determination rule verification computer 107 obtains a difference between the temporary failure classification tree and the registered failure classification tree. If there is a failure node that is not in the registered failure classification tree 300 in the temporary failure classification tree, the failure cause determination rule verification computer 107 executes step 771. If the failure node that existed in the registered failure classification tree 300 does not exist in the temporary failure classification tree, the failure cause determination rule verification computer 107 executes Step 772. When a plurality of failure nodes existing in the registered failure classification tree 300 are grouped into one failure node in the temporary failure classification tree, the failure cause determination rule verification computer 107 executes step 773. When one failure node in the registered failure classification tree 300 is divided into a plurality of failure nodes in the temporary failure classification tree, the failure cause determination rule verification computer 107 executes step 774.

（ステップ７７１）
障害原因判定ルール検証コンピュータ１０７は、障害原因判定ルールＤＢ１０６を更新する。すなわち、障害原因判定ルール検証コンピュータ１０７は、一時障害分類木に追加された新規の障害ノードに対してステップ５０１の（２）及び（３）の手順で障害原因判定ルールを作成し、作成された障害原因判定ルールを障害原因判定ルールテーブル３１０に追加する。 (Step 771)
The failure cause determination rule verification computer 107 updates the failure cause determination rule DB 106. That is, the failure cause determination rule verification computer 107 creates a failure cause determination rule for the new failure node added to the temporary failure classification tree according to the steps (2) and (3) of step 501. The failure cause determination rule is added to the failure cause determination rule table 310.

図９は、障害原因判定ルールの追加処理を説明した図である。図９の（１）は２００９年１月から３月に発生したイベントから作成した登録障害分類木９００であり、図９の（２）は２００９年１０月から１２月に発生したイベントから作成した一時障害分類木９１０である。登録障害分類木９００と一時障害分類木９１０とを比較すると、登録障害分類木９００には存在しないが、一時障害分類木９１０には「障害ノード３’」９１１が存在することが分かる。 FIG. 9 is a diagram for explaining failure cause determination rule addition processing. (1) in FIG. 9 is a registered failure classification tree 900 created from events that occurred from January to March 2009, and (2) in FIG. 9 was created from events that occurred from October to December 2009. This is a temporary failure classification tree 910. Comparing the registered failure classification tree 900 and the temporary failure classification tree 910, it can be seen that the registered failure classification tree 900 does not exist, but the temporary failure classification tree 910 includes the “failure node 3 '” 911.

このとき、障害原因判定ルール検証コンピュータ１０７は、「障害ノード３’」９１１に対して障害原因判定ルールを作成し、障害原因判定ルールテーブル３１０に登録する。この登録に関連して、障害原因判定ルール検証コンピュータ１０７は、登録障害分類木３００と障害ノードテーブル３２０も更新する。 At this time, the failure cause determination rule verification computer 107 creates a failure cause determination rule for the “failure node 3 ′” 911 and registers it in the failure cause determination rule table 310. In connection with this registration, the failure cause determination rule verification computer 107 also updates the registered failure classification tree 300 and the failure node table 320.

（ステップ７７２）
障害原因判定ルール検証コンピュータ１０７は、障害原因判定ルールＤＢ１０６を更新する。すなわち、障害原因判定ルール検証コンピュータ１０７は、一時障害分類木には存在しないが、登録障害分類木３００の検証障害ノード３１１には設定されている障害原因判定ルールテーブル３１０の障害原因判定ルール３１２を削除する。 (Step 772)
The failure cause determination rule verification computer 107 updates the failure cause determination rule DB 106. That is, the failure cause determination rule verification computer 107 does not exist in the temporary failure classification tree, but the failure cause determination rule 312 of the failure cause determination rule table 310 set in the verification failure node 311 of the registered failure classification tree 300 is displayed. delete.

図１０は、障害原因判定ルールの削除処理を説明した図である。図１０の（１）は２００９年１月から３月に発生したイベントから作成した登録障害分類木１０００であり、図１０の（２）は２００９年１０月から１２月に発生したイベントから作成した一時障害分類木１０１０である。登録障害分類木１０００と一時障害分類木１０１０とを比較すると、登録障害分類木１０００に存在した障害ノード１００１は、一時障害分類木１０１０では無くなっていることが分かる。 FIG. 10 is a diagram for explaining a failure cause determination rule deletion process. (1) in FIG. 10 is a registered failure classification tree 1000 created from events that occurred from January to March 2009, and (2) in FIG. 10 was created from events that occurred from October to December 2009. This is a temporary failure classification tree 1010. Comparing the registered fault classification tree 1000 and the temporary fault classification tree 1010, it can be seen that the fault node 1001 existing in the registered fault classification tree 1000 is not the temporary fault classification tree 1010.

このとき、障害原因判定ルール検証コンピュータ１０７は、「障害ノード２」１００１に設定されていた障害原因判定ルールを、障害原因判定ルールテーブル３１０から削除する。この削除に関連し、障害原因判定ルール検証コンピュータ１０７は、登録障害分類木３００及び障害ノードテーブル３２０も更新する。 At this time, the failure cause determination rule verification computer 107 deletes the failure cause determination rule set in the “failure node 2” 1001 from the failure cause determination rule table 310. In connection with this deletion, the failure cause determination rule verification computer 107 also updates the registered failure classification tree 300 and the failure node table 320.

（ステップ７７３）
障害原因判定ルール検証コンピュータ１０７は、障害原因判定ルールＤＢ１０６を更新する。すなわち、障害原因判定ルール検証コンピュータ１０７は、統合される登録障害分類木３００の複数の障害ノードに設定されていた障害原因判定ルール３１２を、障害原因判定ルールテーブル３１０から削除する。さらに、障害原因判定ルール検証コンピュータ１０７は、一時障害分類木の統合された障害ノードに対してステップ５０１の（２）及び（３）の手順で障害原因判定ルールを作成し、障害原因判定ルールテーブル３１０に追加する。ただし、作成した一時障害分類木の障害ノードの障害原因判定ルールの復旧手順書は、実績のある登録障害分類木３００の障害ノード３１１に設定されていた障害原因判定ルール３１２の復旧手順３１５を活用する。 (Step 773)
The failure cause determination rule verification computer 107 updates the failure cause determination rule DB 106. That is, the failure cause determination rule verification computer 107 deletes the failure cause determination rule 312 set in the plurality of failure nodes of the registered failure classification tree 300 to be integrated from the failure cause determination rule table 310. Further, the failure cause determination rule verification computer 107 creates a failure cause determination rule for the failure node integrated with the temporary failure classification tree by the steps (2) and (3) in step 501, and the failure cause determination rule table Add to 310. However, the recovery procedure manual for the failure cause determination rule of the failure node of the created temporary failure classification tree utilizes the recovery procedure 315 of the failure cause determination rule 312 set in the failure node 311 of the registered failure classification tree 300 with a proven record. To do.

図１１−１及び図１１−２は、障害原因判定ルールの統合処理を説明した図である。図１１−１の（１）は２００９年１月から３月に発生したイベントから作成した登録障害分類木１１００であり、図１１−１の（２）は２００９年１０月から１２月に発生したイベントから作成した一時障害分類木１１１０である。２００９年１月から３月に発生したイベントから作成したイベントブロックは、「障害ノード１−１」１１０１に１０個、「障害ノード１−２」１１０２に５個分類されている。これら１５個のイベントブロックを一時障害分類木１１０１で分類すると、１５個全てのイベントブロックが「障害ノード１’」１１０１に分類されている。すなわち、「障害ノード１−１」と「障害ノード１−２」が統合されて「障害ノード１’」になっている。「障害ノード１−１」１１０１及び「障害ノード１−２」１１０２に設定されている障害原因判定ルールは、それぞれ図１１−２の（Ａ−１）の１１２０及び図１１−２の（Ａ−２）の１０２１に対応する。 FIGS. 11A and 11B are diagrams illustrating the failure cause determination rule integration processing. (1) in FIG. 11-1 is a registered failure classification tree 1100 created from an event that occurred from January to March 2009, and (2) in FIG. 11-1 occurred from October to December 2009. A temporary failure classification tree 1110 created from an event. The event blocks created from the events that occurred from January to March 2009 are classified into “failed node 1-1” 1101 and five “failed node 1-2” 1102. When these 15 event blocks are classified by the temporary failure classification tree 1101, all 15 event blocks are classified as “failure node 1 ′” 1101. That is, “failure node 1-1” and “failure node 1-2” are integrated into “failure node 1 ′”. The failure cause determination rules set in the “failure node 1-1” 1101 and the “failure node 1-2” 1102 are respectively 1120 in FIG. 11-2 (A-1) and (A- in FIG. 11-2). This corresponds to 1021 of 2).

また、障害原因判定ルール検証コンピュータ１０７は、２００９年１０月から１２月に発生したイベントから「障害ノード１’」１１１１の障害原因判定ルール１１３０を作成する。ただし、障害原因判定ルール１０３０の復旧手順書１０３１には、より多くのイベントブロックが分類され、かつ、実績のあった「障害ノード１−１」に対応する障害原因判定ルール１０２０の復旧手順書１０２１を採用する。 Further, the failure cause determination rule verification computer 107 creates a failure cause determination rule 1130 of “failure node 1 ′” 1111 from events that occurred from October to December 2009. However, in the recovery procedure manual 1031 of the failure cause determination rule 1030, more event blocks are classified and the recovery procedure manual 1021 of the failure cause determination rule 1020 corresponding to the “failed node 1-1” that has been proven. Is adopted.

このとき、障害原因判定ルールテーブル３１０からは「障害ノード１−１」１１０１及び「障害ノード１−２」１１０２に対応する障害原因判定ルール１１２０及び１１３０を削除し、新たに作成された「障害ノード１’」の障害原因判定ルール１０３０を追加する。関連して登録障害分類木３００及びイベントブロックテーブル３２０も更新する。 At this time, the failure cause determination rules 1120 and 1130 corresponding to the “failure node 1-1” 1101 and “failure node 1-2” 1102 are deleted from the failure cause determination rule table 310, and the newly created “failure node” 1 ′ ”failure cause determination rule 1030 is added. In association with this, the registered failure classification tree 300 and the event block table 320 are also updated.

（ステップ７７４）
障害原因判定ルール検証コンピュータ１０７は、障害原因判定ルールＤＢ１０６を更新する。すなわち、障害原因判定ルール検証コンピュータ１０７は、分割される登録障害分類木３００の障害ノードに設定されていた障害原因判定ルール３１２を、障害原因判定ルールテーブル３１０から削除する。さらに、障害原因判定ルール検証コンピュータ１０７は、一時障害原因分類木の分割された複数の障害ノードに対してステップ５０１の（２）及び（３）の手順で障害原因判定ルールを作成し、障害原因判定ルールテーブル３１０に追加する。ただし、作成した一時障害分類木の障害ノードの障害原因判定ルールの復旧手順書は、実績のある登録分類木３００の障害ノード３１１について設定されていた障害原因判定ルール３１２の復旧手順書３１５を活用する。 (Step 774)
The failure cause determination rule verification computer 107 updates the failure cause determination rule DB 106. That is, the failure cause determination rule verification computer 107 deletes the failure cause determination rule 312 set in the failure node of the registered failure classification tree 300 to be divided from the failure cause determination rule table 310. Further, the failure cause determination rule verification computer 107 creates a failure cause determination rule for the plurality of failure nodes divided in the temporary failure cause classification tree by the steps (2) and (3) in step 501, and the failure cause It adds to the judgment rule table 310. However, the recovery procedure manual of the failure cause determination rule of the failure node of the created temporary failure classification tree utilizes the recovery procedure manual 315 of the failure cause determination rule 312 set for the failure node 311 of the registered classification tree 300 with a proven record. To do.

図１２−１及び図１２−２は、障害原因判定ルールの分割処理を説明した図である。図１２−１の（１）は２００９年１月から３月に発生したイベントから作成した登録障害分類木１２００であり、図１２−１の（２）は２００９年１０月から１２月に発生したイベントから作成した一時障害分類木１２１０である。２００９年１月から３月に発生したイベントから作成したイベントブロックのうち１０個が「障害ノード１−１」１２０１に分類されている。一時障害分類木１２１０では、これら１０個のイベントブロックのうちの６個が「障害ノード１−１−１’」１２１１に、４個が「障害ノード１−１−２’」１２１２に分類される。「障害ノード１−１」１２０１に設定されている障害原因判定ルールは、図１２−２の（Ａ−１）の１２２０であり、２００９年１０月から１２月に発生したイベントから生成した「障害ノード１−１−１’」１２１１及び「障害ノード１−１−２’」１２１２の障害原因判定ルールは、それぞれ図１２−２の（Ｂ−１）の１２２１及び図１２−２の（Ｂ−２）の１２３１である。ただし、「障害ノード１−１」１２０１に分類されたイベントブロックがより多く分類された「障害ノード１−１−１’」１２１１の障害原因判定ルール１２２１の属性「復旧手順書」に「障害ノード１−１」１２０１の障害原因判定ルール１１２０の属性「復旧手順書」で指定された実績のある障害復旧手順書を割り当てる。 12A and 12B are diagrams for explaining the failure cause determination rule division processing. (1) in FIG. 12-1 is a registered failure classification tree 1200 created from an event that occurred from January to March 2009, and (2) in FIG. 12-1 occurred from October to December 2009. A temporary failure classification tree 1210 created from an event. Of the event blocks created from events that occurred from January to March 2009, 10 are classified as “failed node 1-1” 1201. In the temporary failure classification tree 1210, 6 of these 10 event blocks are classified as “failure node 1-1-1 ′” 1211 and 4 are classified as “failure node 1-1-2 ′” 1212. . The failure cause determination rule set in the “failure node 1-1” 1201 is 1220 of (A-1) in FIG. 12-2, and “failure generated from an event that occurred from October to December 2009”. The failure cause determination rules of the node 1-1-1 ′ ”1211 and the“ failure node 1-1-2 ′ ”1212 are respectively 1221 in FIG. 12-2 (B-1) and 12B in FIG. 2) of 1231. However, in the attribute “recovery procedure” of the failure cause determination rule 1221 of “failure node 1-1-1 ′” 1211 in which more event blocks classified as “failure node 1-1” 1201 are classified, “failure node” 1-1 "1201 failure cause determination rule 1120 is assigned a proven failure recovery procedure document specified by the attribute" recovery procedure document ".

このとき、障害原因判定ルール検証コンピュータ１０７は、障害原因判定ルールテーブル３１０から「障害ノード１−１」１２０１に対応する障害原因判定ルール１２２０を削除し、「障害ノード１−１−１’」１２１１及び「障害ノード１−１−２’」１２１２に対応する障害原因判定ルール１２２１と１２３１を追加する。関連して登録障害分類木３００及びイベントブロックテーブル３２０も更新する。 At this time, the failure cause determination rule verification computer 107 deletes the failure cause determination rule 1220 corresponding to the “failure node 1-1” 1201 from the failure cause determination rule table 310, and “failure node 1-1-1 ′” 1211. And failure cause determination rules 1221 and 1231 corresponding to “failure node 1-1-2 ′” 1212 are added. In association with this, the registered failure classification tree 300 and the event block table 320 are also updated.

１０１…監視対象サーバ群
１０２…監視サーバ
１０３…ログデータベース（ＤＢ）
１０４…障害原因判定ルール生成コンピュータ
１０５…障害原因解析コンピュータ
１０６…障害原因判定ルールＤＢ
１０７…障害原因判定ルール検証コンピュータ
１０８…復旧手順書データベース（ＤＢ）
１０９…復旧手順書閲覧コンピュータ 101 ... Monitoring target server group 102 ... Monitoring server 103 ... Log database (DB)
104 ... Failure cause determination rule generation computer 105 ... Failure cause analysis computer 106 ... Failure cause determination rule DB
107 ... Failure cause determination rule verification computer 108 ... Recovery procedure manual database (DB)
109 ... Recovery procedure manual browsing computer

Claims

A first processing unit that acquires an event generated by a monitoring server based on a status of a server group to be monitored when a system failure occurs;
A second processing unit that classifies events that occur within a preset time window by failure and generates a temporary failure classification tree;
A third processing unit that compares the registered failure classification tree corresponding to the failure cause determination rule in operation with the temporary failure classification tree;
A failure cause determination rule verification device comprising: a fourth processing unit that updates the failure cause determination rule in operation based on a difference between the registered failure classification tree and the temporary failure classification tree.

The second processing unit is
A time window for creating a temporary fault classification tree,
Based on the statistics of the appearance interval time of failures classified as failure nodes to be verified in the registered failure classification tree, if no failures classified into the same failure node occur within that time, similar failures will occur in the future. The failure cause determination rule verification device according to claim 1, wherein the failure cause determination rule verification device is calculated as a boundary time of a time interval in which it can be determined that no failure occurs.

The third processing unit includes:
Failure node and temporary failure corresponding to each classification destination of the same event block when the event block used as training data for the registered failure classification tree is classified for each of the registered failure classification tree and the temporary failure classification tree The failure cause determination rule verification device according to claim 1, wherein the comparison between the registered failure classification tree and the temporary failure classification tree is performed by comparison with a failure node of the classification tree.

The fourth processing unit includes:
Based on the difference between the failure cause determination rule generated from the failure node of the temporary failure classification tree and the failure cause determination rule in operation, further has a function of updating the failure cause determination rule in operation,
When a failure cause determination rule is created for a failure node in the temporary failure classification tree corresponding to the failure node in the registered failure classification tree used to set the failure cause determination rule in operation. When a difference is detected,
The failure cause determination rule verification apparatus according to claim 1, wherein the failure cause determination rule in operation is replaced with a failure cause determination rule created for a failure node in a temporary failure classification tree.

The fourth processing unit includes:
If there is a failure node in the temporary failure classification tree that does not exist in the registered failure classification tree corresponding to the failure cause determination rule in operation,
The failure cause determination rule verification apparatus according to claim 1, wherein a failure cause determination rule for a failure node existing only in the temporary failure classification tree is created and additionally registered in a failure cause determination rule in operation.

The fourth processing unit includes:
Based on the statistics of the appearance interval time of failures classified as failure nodes to be verified in the registered failure classification tree, if no failures classified into the same failure node occur within that time, similar failures will occur in the future. If a failure that is classified as a failed node to be verified does not occur even after the boundary time of the time interval that can be determined not to have passed,
The failure cause determination rule verification device according to claim 1, wherein the failure cause determination rule set for the failure node is deleted from the failure cause determination rule in operation.

The fourth processing unit includes:
If the failure node that existed in the registered failure classification tree corresponding to the failure cause determination rule in operation does not exist in the temporary failure classification tree,
The failure cause determination rule verification according to claim 1, wherein the failure cause determination rule set for the failure node existing only in the registered failure classification tree is deleted from the failure cause determination rule in operation. apparatus.

The fourth processing unit includes:
If the registered failure classification tree corresponding to the failure cause determination rule in operation is classified into a plurality of failure nodes, but is temporarily classified into one failure node in the temporary failure classification tree,
The failure cause determination rule set for multiple failure nodes in the registered failure classification tree is deleted from the failure cause determination rule in operation, and the failure cause determination rule created for the failure node in the temporary failure classification tree is in operation The failure cause determination rule verification device according to claim 1, wherein the failure cause determination rule is additionally registered.

The fourth processing unit includes:
If the registered failure classification tree corresponding to the failure cause determination rule in operation is divided into one failure node, but temporarily divided into multiple failure nodes in the temporary failure classification tree,
The failure cause determination rule set for the failure node in the registered failure classification tree is deleted from the operating failure cause determination rule, and each failure cause determination created for each of the corresponding failure nodes in the temporary failure classification tree The failure cause determination rule verification device according to claim 1, wherein the rule is additionally registered in a failure cause determination rule.

In the computer that functions as the failure cause determination rule verification device,
A first process for acquiring an event generated by a monitoring server based on a status of a server group to be monitored when a system failure occurs;
A second process for classifying events that occurred within a preset time window by fault and generating a temporary fault classification tree;
A third process for comparing the registered failure classification tree corresponding to the failure cause determination rule in operation and the temporary failure classification tree;
A computer program for executing a fourth process of updating the failure cause determination rule in operation based on a difference between the registered failure classification tree and the temporary failure classification tree.