JP5514643B2

JP5514643B2 - Failure cause determination rule change detection device and program

Info

Publication number: JP5514643B2
Application number: JP2010140846A
Authority: JP
Inventors: 宏至小林
Original assignee: Hitachi Solutions Ltd
Current assignee: Hitachi Solutions Ltd
Priority date: 2010-06-21
Filing date: 2010-06-21
Publication date: 2014-06-04
Anticipated expiration: 2030-06-21
Also published as: JP2012003713A

Description

本発明は、例えば過去のイベントログに基づいて、障害原因解析システムで利用される障害原因解析ルールの変化を検知できる装置及び当該装置をソフトウェア的に実現するプログラムに関する。 The present invention relates to a device that can detect a change in a failure cause analysis rule used in a failure cause analysis system, for example, based on a past event log, and a program that implements the device in software.

企業経営や社会に大きな影響を与えるシステム障害が多発している。この背景には、外部ＩＴサービスとの連携の複雑化がある。その結果、システム障害の伝播到達範囲が広域化し、一箇所のシステム障害が社会全体に大きな影響を及ぼすようになっている。このようなシステム障害による被害の拡大を防止するには、迅速かつ適切な初動対応が重要となる。 There are many system failures that have a major impact on corporate management and society. This is due to the complexity of cooperation with external IT services. As a result, the propagation reach of system failures has been widened, and one system failure has a major impact on society as a whole. Prompt and appropriate initial response is important to prevent the spread of damage due to such system failures.

そこで、障害検知と適切な障害復旧手順書の提示により、初動対応を支援する障害原因解析システムが提案されている（特許文献１）。このシステムは、特定のシステム障害発生時に監視系が生成するイベントパターンとその障害に対する復旧手順書とを対応付けた障害原因判定ルールを事前に登録し、当該障害原因判定ルールと監視系が生成するイベントパターン（ストリーム）とのマッチングにより、障害検知と対応する適切な復旧手順書の提供を実現する。 In view of this, a failure cause analysis system that supports initial response by presenting failure detection and an appropriate failure recovery procedure has been proposed (Patent Document 1). This system registers in advance a failure cause determination rule that associates an event pattern generated by a monitoring system when a specific system failure occurs with a recovery procedure for the failure, and the failure cause determination rule and the monitoring system generate By matching with event patterns (streams), it is possible to provide appropriate recovery procedures corresponding to failure detection.

しかし、障害原因判定ルールをユーザが記述することは困難である。そこで、監視系の生成するイベントログから障害原因判定ルールを自動生成する方法が提案されている（特許文献２及び３）。これらの方法は、システム障害発生時に発生するイベントの中から特徴のあるイベントを特定し、その振る舞いを解析することによりルールを生成することを基本とする。なお、特許文献２には、特定イベントの発生頻度を利用する方法が開示されている。また、特許文献３には、イベントの生起パターンを利用する方法が記述されている。 However, it is difficult for the user to describe the failure cause determination rule. Therefore, a method for automatically generating a failure cause determination rule from an event log generated by a monitoring system has been proposed (Patent Documents 2 and 3). These methods are based on generating a rule by identifying a characteristic event from events that occur when a system failure occurs and analyzing its behavior. Note that Patent Document 2 discloses a method of using the occurrence frequency of a specific event. Patent Document 3 describes a method of using an event occurrence pattern.

国際公開第２００４／０６１６８１号International Publication No. 2004/061681 特開２００８−４１０４１号公報JP 2008-41041 A 特開２００６−４３４６号公報JP 2006-4346 A

Fisher, Douglas H. “Knowledge acquisition via incremental clustering”, Machine Learning 2, 139-172, 1987Fisher, Douglas H. “Knowledge acquisition via incremental clustering”, Machine Learning 2, 139-172, 1987

ところが、ＩＴサービスの運用時には、一度作成した障害原因判定ルールが無効になる変化が生じ得る。この種の変化が発生した場合、なるべく関連する障害が発生する前に、障害原因判定ルールを修正することが求められる。この様な変化には次の場合がある。 However, during the operation of the IT service, there may be a change in which the fault cause determination rule once created becomes invalid. When this type of change occurs, it is required to correct the failure cause determination rule before a related failure occurs as much as possible. Such changes include the following cases.

（１）ＩＴサービスの削除
不要となったＩＴサービスを削除した場合、当該ＩＴサービスに関連するシステム障害は、それ以降発生しなくなる。この場合、このシステム障害に関連した障害原因判定ルールは無効となる。 (1) Deletion of IT service When an unnecessary IT service is deleted, a system failure related to the IT service does not occur thereafter. In this case, the failure cause determination rule related to this system failure is invalid.

（２）ＩＴ基盤の構成変更
ネットワーク構成の変更、ハードウェアの変更その他のＩＴ基盤の構成変更を行った場合、当該変更に関連するイベントの属性値をもつ障害原因判定ルールは無効になる。 (2) IT infrastructure configuration change When a network configuration change, hardware change, or other IT infrastructure configuration change is made, the failure cause determination rule having an attribute value of an event related to the change becomes invalid.

この様な変化は、繰り返し発生する障害に反映される。従って、その際のイベントを学習することで変更を検知することができる。しかし、この変化を検知するまでに要する時間は、障害の内容により異なる。頻出して発生する障害は短時間で検知することができるのに対し、発生頻度の低い障害は検知までに時間を要する。 Such changes are reflected in repeated failures. Therefore, a change can be detected by learning the event at that time. However, the time required to detect this change varies depending on the content of the failure. Failures that occur frequently can be detected in a short time, whereas failures that occur less frequently require time to be detected.

もし、頻出する障害によって検知された変化が発生頻度の低い障害に対する障害原因判定ルールにも影響する場合、理想的には、頻出する障害に対する障害原因判定ルールを修正するだけでなく、影響のある出現頻度の低い障害が発生する前にその障害原因判定ルールを修正することが求められる。 If changes detected due to frequent failures also affect failure cause determination rules for failures that occur less frequently, ideally it will not only correct the failure cause determination rules for frequent failures, but also affect It is required to correct the failure cause determination rule before a failure having a low appearance frequency occurs.

しかし、従来の障害原因判定ルールの自動生成技術は、前述した２つの変化を検知し、関連のある全ての障害原因判定ルールの修正を実現する方法を提供していない。 However, the conventional technology for automatically generating a failure cause determination rule does not provide a method for detecting the two changes described above and correcting all related failure cause determination rules.

そこで、本発明者は、障害原因判定ルールを運用状況に応じて更新するための仕組みを提供する。具体的には、システム障害の発生時に、監視対象サーバ群の状態に基づいて監視サーバが生成したイベントを取得してイベントブロックを作成し、作成したイベントログを訓練データに用いて一時障害分類木オブジェクトの集合を更新する処理と、当該集合のうちで重みが最も重い一時障害分類木オブジェクトを選択し、選択された一時障害分類木オブジェクトと登録障害分類木オブジェクトとを比較する処理と、両者が一致しない場合、双方の違いから変化を予測し、当該予測に基づいた一時障害分類木オブジェクトを作成し、作成された一時障害分類木オブジェクトを一時障害分類木オブジェクト集合に追加する処理と、選択した一時障害分類木オブジェクトによって登録障害分類木オブジェクトを置換する処理とを有する仕組みを提供する。 Therefore, the present inventor provides a mechanism for updating the failure cause determination rule according to the operation status. Specifically, when a system failure occurs, the event generated by the monitoring server based on the status of the monitored server group is acquired to create an event block, and the created event log is used as training data to create a temporary failure classification tree. A process for updating a set of objects, a process for selecting a temporary fault classification tree object having the highest weight in the set, and comparing the selected temporary fault classification tree object with a registered fault classification tree object, If they do not match, the change is predicted from the difference between the two, a temporary failure classification tree object based on the prediction is created, and the created temporary failure classification tree object is added to the temporary failure classification tree object set and selected. Providing a mechanism for replacing a registered fault classification tree object with a temporary fault classification tree object That.

本発明によれば、障害原因判定ルールに関わる変化を検知することができる。また、本発明によれば、まだ観測されていない障害に対応する障害原因判定ルールについても、関連性のある障害の発生に伴って予測的に更新することができる。 According to the present invention, it is possible to detect a change related to a failure cause determination rule. Further, according to the present invention, it is possible to predictively update a failure cause determination rule corresponding to a failure that has not been observed as a related failure occurs.

障害原因解析システムのシステム構成例を示す図。The figure which shows the system configuration example of a failure cause analysis system. ログＤＢが保持するイベントログの具体例を説明する図。The figure explaining the specific example of the event log which log DB hold | maintains. 障害原因判定ルールＤＢが保持する登録障害分類木オブジェクトの具体例を示す図。The figure which shows the specific example of the registration failure classification | category tree object which failure cause determination rule DB hold | maintains. 登録障害分類木オブジェクトを構成する障害分類木と障害原因判定ルールテーブルの具体例を示す図。The figure which shows the specific example of the failure classification tree and failure cause determination rule table which comprise the registration failure classification tree object. 登録障害分類木オブジェクトを構成する障害ノードテーブルの具体例を示す図。The figure which shows the specific example of the failure node table which comprises a registration failure classification | category tree object. 障害原因判定ルール変化検知コンピュータのシステム構成例を示す図。The figure which shows the system structural example of a failure cause determination rule change detection computer. 障害原因判定ルール変化検知プログラムの画面例を示す図。The figure which shows the example of a screen of a failure cause determination rule change detection program. 障害原因解析プロセスの概要を示すフローチャート。The flowchart which shows the outline | summary of a failure cause analysis process. 障害原因判定ルール生成プロセスの概要を示すフローチャート。The flowchart which shows the outline | summary of a failure cause determination rule production | generation process. イベントブロックの具体例を説明する図。The figure explaining the specific example of an event block. イベントブロックの特徴テーブル例を示す図。The figure which shows the example of a feature table of an event block. 障害原因判定ルール変化検知プロセスの実行手順を示すフローチャート。The flowchart which shows the execution procedure of a failure cause determination rule change detection process. 障害原因判定ルール変化検知装置が生成する一時障害分類木オブジェクト集合を説明する図。The figure explaining the temporary failure classification | category tree object set which a failure cause determination rule change detection apparatus produces | generates. 障害分類木の障害ノードの消失検知を説明する図。The figure explaining the loss | disappearance detection of the failure node of a failure classification tree. 消失した障害ノードに対応する障害原因判定ルールを説明する図。The figure explaining the failure cause determination rule corresponding to the lost failure node. 障害原因判定ルールの属性値の変化検知を説明する図。The figure explaining the change detection of the attribute value of a failure cause determination rule. 属性値の変化が検出された障害原因判定ルールにおける属性値の変化を説明する図。The figure explaining the change of the attribute value in the failure cause determination rule in which the change of the attribute value was detected.

以下、図面に基づいて、本発明の実施の形態を説明する。なお、後述する装置構成や処理動作の内容は発明を説明するための一例である。本発明は、後述する装置構成同士や処理動作同士の任意の組み合わせ、後述する装置構成や処理動作に既知の技術を追加する組み合わせ、後述する装置構成や処理動作の一部を既知の技術で置換する組み合わせも包含する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. Note that the contents of the apparatus configuration and processing operation described later are examples for explaining the invention. The present invention provides an arbitrary combination of device configurations and processing operations described later, a combination of adding a known technology to the device configuration and processing operations described later, and replacing a part of the device configuration and processing operations described later with known technologies. Combinations to include are also included.

（障害原因解析システムのシステム構成）
図１に、障害原因判定ルール変化検知コンピュータ１０７を実装する障害原因解析システムの構成例を示す。図１に示す障害原因解析システムは、監視対象サーバ群１０１と、監視サーバ１０２と、ログデータベース（ＤＢ）１０３と、障害原因判定ルール生成コンピュータ１０４と、障害原因解析コンピュータ１０５と、障害原因判定ルールＤＢ１０６と、障害原因判定ルール変化検知コンピュータ１０７と、復旧手順書データベース（ＤＢ）１０８と、復旧手順書閲覧コンピュータ１０９とを有している。 (System configuration of failure cause analysis system)
FIG. 1 shows a configuration example of a failure cause analysis system in which the failure cause determination rule change detection computer 107 is installed. The failure cause analysis system shown in FIG. 1 includes a monitoring target server group 101, a monitoring server 102, a log database (DB) 103, a failure cause determination rule generation computer 104, a failure cause analysis computer 105, and a failure cause determination rule. It has a DB 106, a failure cause determination rule change detection computer 107, a recovery procedure manual database (DB) 108, and a recovery procedure manual browsing computer 109.

このうち、監視サーバ１０２は、監視対象サーバ群１０１の状態（死活など）を監視し、状態に応じたイベントを生成する機能を提供する。監視サーバ１０２が生成したイベントはログデータベース（ＤＢ）１０３に格納される。障害原因判定ルール生成コンピュータ１０４は、ログＤＢ１０３からイベントログを読み出し、障害原因判定ルールを生成する機能を提供する。障害原因判定ルール生成コンピュータ１０４が生成した障害原因判定ルールは、障害原因判定ルールＤＢ１０６に格納される。障害原因解析コンピュータ１０５は、障害原因判定ルールＤＢ１０６が格納する障害原因判定ルールに基づいてイベントを解析し、障害に対する復旧手順書を特定する機能を提供する。 Among these, the monitoring server 102 provides a function of monitoring the state (life and death) of the monitoring target server group 101 and generating an event corresponding to the state. Events generated by the monitoring server 102 are stored in a log database (DB) 103. The failure cause determination rule generation computer 104 provides a function of reading an event log from the log DB 103 and generating a failure cause determination rule. The failure cause determination rule generated by the failure cause determination rule generation computer 104 is stored in the failure cause determination rule DB 106. The failure cause analysis computer 105 provides a function of analyzing an event based on the failure cause determination rule stored in the failure cause determination rule DB 106 and specifying a recovery procedure manual for the failure.

障害原因判定ルール変化検知コンピュータ１０７は、監視サーバ１０２が生成したイベントを解析し、障害原因判定ルールＤＢ１０６に格納されている障害原因判定ルールに関わる変化を検知する機能を提供する。ここでの検知には、予測的な検知も含まれる。 The failure cause determination rule change detection computer 107 provides a function of analyzing an event generated by the monitoring server 102 and detecting a change related to the failure cause determination rule stored in the failure cause determination rule DB 106. The detection here includes predictive detection.

復旧手順書データベース（ＤＢ）１０８は、障害時の復旧手順に関する文書を格納する。ここでの文書には、障害発生時のトラブルシューティングを記述したマニュアル（ハードウェアかソフトウェアかを問わない）だけでなく、過去の障害に対する保守担当者の対応記録、報告書その他の障害から復旧するための手順に関する文書も含まれる。復旧手順書閲覧コンピュータ１０９は、障害原因解析コンピュータ１０５によって特定された復旧手順書を画面上に表示する機能を提供する。 The recovery procedure database (DB) 108 stores documents relating to a recovery procedure at the time of failure. In this document, not only manuals (whether hardware or software) that describe troubleshooting in the event of a failure, but also a record of the maintenance staff's response to past failures, reports, and other failures are recovered. Documentation on the procedure for this is also included. The recovery procedure manual browsing computer 109 provides a function of displaying the recovery procedure manual identified by the failure cause analysis computer 105 on the screen.

（イベントテーブルの具体例）
図２に、ログＤＢ１０３に格納されるイベントテーブル２００の具体例を示す。イベントテーブル２００は、イベントを一意に特定する識別子(ID)２０１、イベントが発生した日時を特定する発生日時２０２、個々のイベントの属性値の集合であるイベント２０３から構成される。この形態例の場合、イベント２０３の属性は、<種類>、<ソース>、<イベント番号>、<ユーザ>、<コンピュータ>で定義される。このうち、<種類>はイベントの重要度を示している。<ソース>はイベントを発生させたプロセスやアプリケーション等の発生源を示している。<イベント番号>はイベントの内容を特定する番号を示している。<ユーザ>はイベントの発生源となったプロセスやアプリケーションを実行していたユーザを示している。<コンピュータ>はイベントの発生源となった監視対象サーバ群１０１内のサーバを示している。 (Specific example of event table)
FIG. 2 shows a specific example of the event table 200 stored in the log DB 103. The event table 200 includes an identifier (ID) 201 that uniquely identifies an event, an occurrence date and time 202 that specifies the date and time when the event occurred, and an event 203 that is a set of attribute values of individual events. In the case of this embodiment, the attributes of the event 203 are defined by <type>, <source>, <event number>, <user>, and <computer>. Of these, <Type> indicates the importance of the event. <Source> indicates the source of the process or application that generated the event. <Event number> indicates a number for identifying the content of the event. <User> indicates the user who is executing the process or application that is the source of the event. <Computer> indicates a server in the monitoring target server group 101 that is an event generation source.

（障害原因判定ルールＤＢの具体例）
図３−１〜図３−３に、障害原因判定ルールＤＢ１０６の構成例を示す。障害原因判定ルールＤＢ１０６は、障害原因解析コンピュータ１０５に登録されている障害原因判定ルールと、それに関連する情報を登録障害分類木オブジェクト３００として格納しているＤＢである。登録障害分類木オブジェクト３００は、図３−１に示すように、障害分類木３１０と、障害原因判定ルールテーブル３２０と、障害ノードテーブル３３０とから構成される。 (Specific example of failure cause determination rule DB)
3-1 to 3-3 show configuration examples of the failure cause determination rule DB 106. FIG. The failure cause determination rule DB 106 is a DB that stores a failure cause determination rule registered in the failure cause analysis computer 105 and information related thereto as a registered failure classification tree object 300. The registered failure classification tree object 300 includes a failure classification tree 310, a failure cause determination rule table 320, and a failure node table 330, as shown in FIG.

障害分類木３１０は、障害原因解析コンピュータ１０５に登録されている障害原因判定ルールの生成時に作成される。障害分類木３１０では、障害時に発生した単数又は複数のイベントの集合（以下、「イベントブロック」という）が共通に有する特徴に基づいて障害が分類され、分類木として表現される。障害分類木３１０のノードを障害ノードと呼ぶ。同じ障害ノードに分類された障害同士は、発生したイベント及び発生の仕方が類似しているので、同じ障害原因による障害であると考えられる。図３−２の（１）に、障害分類木３１０の構造例を示す。 The failure classification tree 310 is created when a failure cause determination rule registered in the failure cause analysis computer 105 is generated. In the fault classification tree 310, faults are classified based on features that are commonly shared by a set of one or more events (hereinafter referred to as “event blocks”) that occurred at the time of the fault, and are represented as a classification tree. A node in the failure classification tree 310 is called a failure node. Faults classified into the same fault node are considered to be faults caused by the same fault cause because the generated events and the manner of occurrence are similar. FIG. 3-2 (1) shows an example of the structure of the failure classification tree 310.

障害原因判定ルールテーブル３２０は、障害原因解析コンピュータ１０５に登録されている障害原因判定ルールを格納するテーブルである。障害原因判定ルールテーブル３２０は、図３−２の（２）に示すように、障害分類木３１０の障害ノード３２１と、対象障害ノードに分類される障害に適用される障害原因判定ルール３２２から構成される。障害原因判定ルール３２２は、単数又は複数の判定イベント３２３と、判定時間３２４と、復旧手順書３２５とで構成される。判定イベント３２３は、対象障害ノードを特徴付けるイベントの属性の集合である。判定時間３２４は、判定イベント３２３で指定した全てのイベントが発生する時間間隔である。復旧手順書３２５は、判定時間３２４内に判定イベント３２２が発生したときに復旧手順書閲覧コンピュータ１０９に表示される文書である。 The failure cause determination rule table 320 is a table that stores failure cause determination rules registered in the failure cause analysis computer 105. The failure cause determination rule table 320 includes a failure node 321 of the failure classification tree 310 and a failure cause determination rule 322 applied to the failure classified as the target failure node, as shown in (2) of FIG. 3-2. Is done. The failure cause determination rule 322 includes one or a plurality of determination events 323, a determination time 324, and a recovery procedure manual 325. The determination event 323 is a set of event attributes that characterize the target failure node. The determination time 324 is a time interval at which all events specified by the determination event 323 occur. The recovery procedure manual 325 is a document displayed on the recovery procedure manual browsing computer 109 when the determination event 322 occurs within the determination time 324.

１つの障害ノード３２１に対して複数の判定イベント３２３が指定されている場合は、障害原因判定ルールテーブル３２０で指定された順番に判定イベントが出現するものとする。図３−２の（２）の場合、「障害ノード１−１」に対する障害原因判定ルール３２２として、（「警戒」、「process71」、「80」、「user2」、「server9」）の属性値を有する判定イベントの発生後に、（「*」、「process39」、「*」、「user4」、「server8」）の属性値を有する判定イベントが判定時間「２分９秒」以内に発生したら、「復旧手順A.doc」を復旧手順閲覧コンピュータ１０９に表示するというルールが設定されている。ここで属性値「*」は、値が不定であることを意味し、任意の値を採り得ることを示す。 When a plurality of determination events 323 are specified for one failure node 321, it is assumed that determination events appear in the order specified in the failure cause determination rule table 320. In the case of (2) in FIG. 3B, as the failure cause determination rule 322 for “failure node 1-1”, the attribute values of (“alert”, “process71”, “80”, “user2”, “server9”) If a determination event having an attribute value (“*”, “process39”, “*”, “user4”, “server8”) occurs within the determination time “2 minutes 9 seconds” after the generation of the determination event having A rule for displaying “recovery procedure A.doc” on the recovery procedure browsing computer 109 is set. Here, the attribute value “*” means that the value is indefinite and indicates that an arbitrary value can be taken.

障害ノードテーブル３３０は、障害分類木３１０を構築する際に訓練データとして使用したイベントブロックを格納する。図３−３に、障害ノードテーブル３３０の一例を示す。障害ノードテーブル３３０は、障害ノード３２１と、当該障害ノードに分類されたイベントブロック３３１と、当該イベントブロック３２１内に含まれるイベント２０３から構成される。 The failure node table 330 stores event blocks used as training data when the failure classification tree 310 is constructed. An example of the failure node table 330 is shown in FIG. The failure node table 330 includes a failure node 321, an event block 331 classified as the failure node, and an event 203 included in the event block 321.

（障害原因判定ルール変化検知コンピュータの構成例）
図４に、障害原因判定ルール変化検知コンピュータ１０７の構成例を示す。障害原因判定ルール変化検知コンピュータ１０７は、コンピュータ本体４００と、入力装置４３０と、表示装置４３１と、通信装置４３２とから構成される。なお、通信装置４３２は、監視サーバ１０２、ログＤＢ１０３及び障害原因判定ルールＤＢ１０６と通信する。 (Configuration example of failure cause determination rule change detection computer)
FIG. 4 shows a configuration example of the failure cause determination rule change detection computer 107. The failure cause determination rule change detection computer 107 includes a computer main body 400, an input device 430, a display device 431, and a communication device 432. The communication device 432 communicates with the monitoring server 102, the log DB 103, and the failure cause determination rule DB 106.

コンピュータ本体４００は、データ演算をするＣＰＵ４０１、ＲＯＭ４０２、ＲＡＭ４１０、ハードディスク駆動装置４２０、これらデバイス間のデータ転送を実現するＣＰＵバス４０７、これらデバイスとＣＰＵバス４０７とを結合するインターフェース４０３〜４０６で構成される。 The computer main body 400 includes a CPU 401 that performs data calculation, a ROM 402, a RAM 410, a hard disk drive 420, a CPU bus 407 that realizes data transfer between these devices, and interfaces 403 to 406 that couple these devices to the CPU bus 407. The

ＲＡＭ４１０には、ＣＰＵ４０１に演算処理をさせる障害原因判定ルール変化検知プログラム４１１の実行領域と、検算時に一時的に生成させるデータを格納する作業領域４１２とが少なくとも確保される。また、ハードディスク駆動装置４２０の記憶領域には、障害原因判定ルール変化検知プログラムの格納領域としてのプログラム格納部４２１と、監視サーバ１０２及び障害原因判定ルールＤＢ１０６から取得したデータを一時的に格納しておくデータ格納部４２２が少なくとも確保される。 The RAM 410 has at least an execution area for the failure cause determination rule change detection program 411 for causing the CPU 401 to perform arithmetic processing and a work area 412 for storing data to be temporarily generated at the time of calculation. The storage area of the hard disk drive 420 temporarily stores the program storage unit 421 as a storage area for the failure cause determination rule change detection program, and data acquired from the monitoring server 102 and the failure cause determination rule DB 106. The data storage unit 422 to be stored is secured at least.

図５に、障害原因判定ルール変化検知コンピュータ１０７に接続される表示装置４３１に表示されるＧＵＩ画面例を示す。障害原因判定ルール変化検知プログラム画面５００は、障害原因判定ルール３２２（図３−２）が満たすべき最小のイベントブロックサポート率を入力する最小イベントブロックサポート率入力部５０１、障害原因判定ルール３２２の判定イベント３２３が満たすべき最小の有効な属性数を入力する最小有効属性数入力部５０２、変化を検知するために一時的に障害分類木作成する際に訓練データとして使用するイベントの時間範囲である時間窓幅入力部５０３と、障害原因判定ルール検知プロセスを開始させるための開始ボタン５０５から構成される。 FIG. 5 shows an example of a GUI screen displayed on the display device 431 connected to the failure cause determination rule change detection computer 107. The failure cause determination rule change detection program screen 500 displays the minimum event block support rate input unit 501 for inputting the minimum event block support rate that should be satisfied by the failure cause determination rule 322 (FIG. 3-2), and the determination of the failure cause determination rule 322. The minimum effective attribute number input unit 502 for inputting the minimum number of valid attributes to be satisfied by the event 323, and a time that is a time range of events used as training data when a fault classification tree is temporarily created to detect a change The window width input unit 503 and a start button 505 for starting the failure cause determination rule detection process are configured.

ここで、最小イベントブロックサポート率は、検知対象の障害ノード毎に適用され、対象障害ノードに分類された全てのイベントブロックのうちで、生成した障害原因判定ルールを適用することができるイベントブロック数の割合を意味する。図５の例の場合、「障害原因判定ルール」として検出されるためには、少なくとも６０％のイベントブロックに適用されなければならないことを表している。また、有効な属性とは、属性値が不定「*」以外の属性値を有する属性を意味する。図５の場合、有効な属性数の最小値は「５」である。 Here, the minimum event block support rate is applied to each failure node to be detected, and the number of event blocks to which the generated failure cause determination rule can be applied among all event blocks classified as the target failure node Means the percentage of In the case of the example of FIG. 5, this means that it must be applied to at least 60% of event blocks in order to be detected as the “failure cause determination rule”. In addition, an effective attribute means an attribute having an attribute value other than the undefined “*” attribute value. In the case of FIG. 5, the minimum value of the number of valid attributes is “5”.

（障害原因解析動作）
図６に、障害原因解析システム全体の障害原因解析プロセスの概略を示す。 (Failure cause analysis operation)
FIG. 6 shows an outline of the failure cause analysis process of the entire failure cause analysis system.

（ステップ６０１）
障害原因判定ルール生成コンピュータ１０４は、ログＤＢ１０３からイベントログを取得して障害原因判定ルール３２２を生成し、障害原因判定ルールＤＢ１０６に保存する。この処理内容の詳細は後述する。 (Step 601)
The failure cause determination rule generation computer 104 acquires an event log from the log DB 103, generates a failure cause determination rule 322, and stores it in the failure cause determination rule DB. Details of this processing will be described later.

（ステップ６０２）
障害原因解析コンピュータ１０５は、障害原因判定ルール生成コンピュータ１０４による障害原因判定ルールＤＢ１０６の更新を検知すると、障害原因判定ルールＤＢ１０６から障害原因判定ルールテーブル３２０の更新後の障害原因判定ルール３２２を取得する。この際、取得された障害原因判定ルール３２２は、データ格納部４２２に登録される。 (Step 602)
When the failure cause analysis computer 105 detects an update of the failure cause determination rule DB 106 by the failure cause determination rule generation computer 104, the failure cause analysis rule acquisition computer 104 acquires the updated failure cause determination rule 322 from the failure cause determination rule table 320. . At this time, the acquired failure cause determination rule 322 is registered in the data storage unit 422.

（ステップ６０３）
監視サーバ１０２は、監視対象サーバ群１０１を常に監視している。監視サーバ１０２は、監視対象サーバ群１０１内のあるサーバに障害に起因する異常を発見すると、当該サーバの状態からイベントを生成する。監視サーバ１０２は、生成したイベントをログＤＢ１０３に保存すると共に、障害原因解析コンピュータ１０５及び障害原因判定ルール変化検知コンピュータ１０７にそのイベントを送信する。 (Step 603)
The monitoring server 102 constantly monitors the monitoring target server group 101. When the monitoring server 102 finds an abnormality caused by a failure in a certain server in the monitoring target server group 101, the monitoring server 102 generates an event from the state of the server. The monitoring server 102 stores the generated event in the log DB 103 and transmits the event to the failure cause analysis computer 105 and the failure cause determination rule change detection computer 107.

（ステップ６０４）
障害原因解析コンピュータ１０５は、受信したイベントと障害原因判定ルール３２２とをマッチングする。受信したイベントが障害原因判定ルール３２２とマッチした場合、障害原因解析コンピュータ１０５は、その障害原因判定ルール３２２の復旧手順書３２５を、復旧手順書ＤＢ１０８から取得し、復旧手順書閲覧コンピュータ１０９に送信する。一方、受信したイベントが障害原因判定ルール３２２とマッチしなかった場合、障害原因解析コンピュータ１０５は、何もしない。 (Step 604)
The failure cause analysis computer 105 matches the received event with the failure cause determination rule 322. When the received event matches the failure cause determination rule 322, the failure cause analysis computer 105 acquires the recovery procedure manual 325 of the failure cause determination rule 322 from the recovery procedure manual DB 108 and transmits it to the recovery procedure manual browsing computer 109. To do. On the other hand, when the received event does not match the failure cause determination rule 322, the failure cause analysis computer 105 does nothing.

（ステップ６０５）
復旧手順書閲覧コンピュータ１０９は、障害原因解析コンピュータ１０５から受信した復旧手順書３１５を表示装置上に表示する。 (Step 605)
The recovery procedure manual browsing computer 109 displays the recovery procedure manual 315 received from the failure cause analysis computer 105 on the display device.

（ステップ６０６）
障害原因判定ルール変化検知コンピュータ１０７は、監視サーバ１０２からイベントを受信すると、設定された時間窓内に出現したイベント集合から一時的な障害分類木（以下、「一時障害分類木」という。）を作成し、障害原因判定ルールＤＢ１０６に格納されている障害分類木と比較する。ここでの比較には、木構造の比較だけでなく、各イベントが割り当てられる障害ノードの比較も含まれる。変化が検出された場合、障害原因判定ルール変化検知コンピュータ１０７は、障害原因判定ルールに関わる変化が生じたものと判定し、障害原因判定ルールＤＢ１０６の登録障害分類木オブジェクト３００を更新する。この処理内容の詳細は後述する。 (Step 606)
When the failure cause determination rule change detection computer 107 receives an event from the monitoring server 102, a temporary failure classification tree (hereinafter referred to as “temporary failure classification tree”) is generated from the event set that appears within the set time window. It is created and compared with the failure classification tree stored in the failure cause determination rule DB 106. The comparison here includes not only comparison of tree structures but also comparison of failure nodes to which each event is assigned. When a change is detected, the failure cause determination rule change detection computer 107 determines that a change related to the failure cause determination rule has occurred, and updates the registered failure classification tree object 300 in the failure cause determination rule DB 106. Details of this processing will be described later.

（ステップ６０７）
障害原因解析コンピュータ１０５は、障害原因判定ルール検証コンピュータ１０７により障害原因判定ルールＤＢ１０６の更新を検知すると、障害原因判定ルールＤＢ１０７から障害原因判定ルール３１２を取得し、現在利用している障害原因判定ルールと置換する。この置換処理により、障害原因解析コンピュータ１０５は、常に、最新の障害原因判定ルールに従ってイベントを解析することができる。 (Step 607)
When the failure cause determination rule verification computer 107 detects an update of the failure cause determination rule DB 106, the failure cause analysis computer 105 acquires the failure cause determination rule 312 from the failure cause determination rule DB 107, and currently uses the failure cause determination rule. Replace with By this replacement process, the failure cause analysis computer 105 can always analyze the event according to the latest failure cause determination rule.

（障害原因判定ルール生成動作）
図７に、障害原因判定ルール生成コンピュータ１０４による障害原因判定ルールの生成プロセスの概略を示す。 (Error cause determination rule generation operation)
FIG. 7 shows an outline of a failure cause determination rule generation process performed by the failure cause determination rule generation computer 104.

（ステップ７０１）
障害原因判定ルール生成コンピュータ１０４は、ログＤＢ１０３のイベントテーブル２０３からイベントを取得する。 (Step 701)
The failure cause determination rule generation computer 104 acquires an event from the event table 203 of the log DB 103.

（ステップ７０２）
障害原因判定ルール生成コンピュータ１０４は、ステップ７０１で取得したイベントを障害毎にまとめたイベントブロックのテーブルであるイベントブロックテーブル８００（図８）を作成する。イベントブロックの作成には、例えば以下のルールを適用する。あるイベントの発生日時２０２と一つ前のイベントの発生日時２０２との時間差が与えられた閾値以内の場合、当該イベントは一つ前のイベントブロックに分類する。一方、当該時間差が閾値以上の場合、当該イベントを新規のイベントブロックに分類する。 (Step 702)
The failure cause determination rule generation computer 104 creates an event block table 800 (FIG. 8) that is a table of event blocks in which the events acquired in step 701 are grouped for each failure. In creating an event block, for example, the following rules are applied. When the time difference between the occurrence date 202 of an event and the occurrence date 202 of the previous event is within a given threshold, the event is classified into the previous event block. On the other hand, if the time difference is greater than or equal to the threshold, the event is classified into a new event block.

図８に、図２のイベントテーブル２００に基づいて作成したイベントブロックテーブル８００の具体例を示す。イベントブロックテーブル８００は、イベントブロックを一意に特定するイベントブロックＩＤ８０１、各イベントブロックに含まれるイベントを特定するイベントＩＤ２０１、各イベントの発生した日時を特定する発生日時２０２、発生したイベントを構成する属性値の集合であるイベント２０３から構成される。 FIG. 8 shows a specific example of the event block table 800 created based on the event table 200 of FIG. The event block table 800 includes an event block ID 801 that uniquely identifies an event block, an event ID 201 that identifies an event included in each event block, an occurrence date and time 202 that specifies the date and time when each event occurred, and an attribute that configures the event that occurred The event 203 is a set of values.

（ステップ７０３）
障害原因判定ルール生成コンピュータ１０４は、ステップ７０２で作成されたイベントブロックテーブル８００に基づいて各イベントブロックの特徴を抽出し、イベントブロックの特徴テーブル９００（図９）を作成する。ここでの特徴は、各イベントブロックに分類されたイベント集合に頻出する属性を意味する。 (Step 703)
The failure cause determination rule generation computer 104 extracts the feature of each event block based on the event block table 800 created in step 702, and creates the event block feature table 900 (FIG. 9). The feature here means an attribute that frequently appears in the event set classified into each event block.

図９に、図８のイベントブロックテーブル８００から作成したイベントブロックの特徴テーブル９００の具体例を示す。イベントブロックの特徴テーブル９００は、イベントブロックを特定するイベントブロックＩＤ８０１、各イベントブロックに対する特徴である属性リスト９０１から構成される。この例において、属性リスト９０１は、イベント２０３の各属性に対してそのイベントブロック内で最も頻出する属性値と、次に頻出する属性値から構成される。例えば図９の場合、イベントブロックＩＤ「１」のイベントブロックに属するイベント集合では、属性「種類」について「エラー」が最も多く、「致命的」が２番目に多いことが分かる。 FIG. 9 shows a specific example of the event block feature table 900 created from the event block table 800 of FIG. The event block feature table 900 includes an event block ID 801 that identifies an event block, and an attribute list 901 that is a feature for each event block. In this example, the attribute list 901 includes attribute values that appear most frequently in the event block for each attribute of the event 203 and attribute values that appear next frequently. For example, in the case of FIG. 9, in the event set belonging to the event block with the event block ID “1”, it is understood that “error” is the most common and “fatal” is the second most common for the attribute “type”.

（ステップ７０４）
障害原因判定ルール生成コンピュータ１０４は、ステップ７０３で作成したイベントブロックの特徴テーブル９００をクラスタリングする。クラスタリングでは、登録障害分類木オブジェクト３００（図３−１）の障害分類木３１０の作成と、作成された障害分類木３１０の各障害ノードに対してイベントブロックテーブル８００のイベントブロックを分類する障害ノードテーブル３３０（図３−３）の作成とが行われる。 (Step 704)
The failure cause determination rule generation computer 104 clusters the event block feature table 900 created in step 703. In clustering, a failure classification tree 310 of the registered failure classification tree object 300 (FIG. 3-1) is created, and a failure node that classifies event blocks in the event block table 800 for each failure node of the created failure classification tree 310. Creation of the table 330 (FIG. 3-3) is performed.

障害分類木３１０の同じ障害ノードに分類されたイベントブロックは、障害時のイベントの出現の仕方が類似している。従って、当該イベントブロックは、同じ障害原因により発生した障害とみなすことが出来る。クラスタリングのアルゴリズムには、例えば非特許文献１で説明されている概念クラスタリングＣＯＢＷＥＢを利用する。 The event blocks classified into the same failure node in the failure classification tree 310 are similar in the manner of appearance of an event at the time of failure. Therefore, the event block can be regarded as a failure caused by the same failure cause. As the clustering algorithm, for example, conceptual clustering COBWEB described in Non-Patent Document 1 is used.

（ステップ７０５）
障害原因判定ルール生成コンピュータ１０４は、ステップ７０４で作成した障害ノードテーブル３３０（図３−３）の各障害ノード３２１内に頻出するイベントを発見する。この頻出イベントは、全ての属性値が一致する必要は無く、一部の属性値は不定「*」でも良い。ただし、イベントを構成する属性のうち不定値「*」以外の属性値数の満たすべき最小値は、事前に最小有効属性数として与えておく（図５）。 (Step 705)
The failure cause determination rule generation computer 104 finds an event that frequently appears in each failure node 321 of the failure node table 330 (FIG. 3C) created in step 704. In this frequent event, not all attribute values need to match, and some attribute values may be indefinite “*”. However, the minimum value to be satisfied of the number of attribute values other than the indefinite value “*” among the attributes constituting the event is given in advance as the minimum number of valid attributes (FIG. 5).

また、頻出イベントの頻度は、その頻出イベントが含まれるイベントブロック数で決まる。頻出イベントが満たすべき最小のイベントブロック数は、その障害ノード内の全イベントブロック数に対する最小サポートイベントブロック率として事前に与えておく（図５）。 The frequency of frequent events is determined by the number of event blocks that include the frequent events. The minimum number of event blocks to be satisfied by a frequent event is given in advance as the minimum support event block rate with respect to the total number of event blocks in the failed node (FIG. 5).

（ステップ７０６）
障害原因判定ルール生成コンピュータ１０４は、ステップ７０５で求めた頻出イベントがどのようなパターン（出現の順番）であるかを発見する。障害原因判定ルール生成コンピュータ１０４は、発見した頻出イベントをその出現順に障害原因判定ルールテーブル３２０の判定イベント３２３に格納する。発見された判定イベント３２３が複数の場合、障害原因判定ルール生成コンピュータ１０４は、当該複数のパターンが出現する時間間隔の最大値を障害原因判定ルールテーブル３２０の判定時間３２４に格納する。 (Step 706)
The failure cause determination rule generation computer 104 discovers what pattern (the order of appearance) the frequent event obtained in step 705. The failure cause determination rule generation computer 104 stores the found frequent events in the determination event 323 of the failure cause determination rule table 320 in the order of appearance. When a plurality of determination events 323 are found, the failure cause determination rule generation computer 104 stores the maximum value of the time interval at which the plurality of patterns appear in the determination time 324 of the failure cause determination rule table 320.

（ステップ７０７）
障害原因判定ルール生成コンピュータ１０４は、ステップ７０６で求めた障害原因判定ルール３２２の判定イベント３２３の属性値の一部を検索キーに用い、障害ノードに分類された障害に対応する障害復旧手順書を復旧手順書ＤＢ１０８から取得する。障害原因判定ルール生成コンピュータ１０４は、取得した障害復旧手順書のファイル名を、図３の障害原因判定ルールテーブル３２０の復旧手順書３２５に格納する。このステップで障害原因判定ルールテーブル３２０が完成する。以上のステップで、登録障害分類木オブジェクト３００が完成する。 (Step 707)
The failure cause determination rule generation computer 104 uses a part of the attribute value of the determination event 323 of the failure cause determination rule 322 obtained in step 706 as a search key, and generates a failure recovery procedure manual corresponding to the failure classified as the failure node. Obtained from the recovery procedure manual DB. The failure cause determination rule generation computer 104 stores the file name of the acquired failure recovery procedure manual in the recovery procedure manual 325 of the failure cause determination rule table 320 of FIG. In this step, the failure cause determination rule table 320 is completed. With the above steps, the registered failure classification tree object 300 is completed.

（ステップ７０８）
障害原因判定ルール生成コンピュータ１０４は、ステップ７０７までに作成した登録障害分類木オブジェクト３００を障害原因判定ルールＤＢ１０６に登録する。 (Step 708)
The failure cause determination rule generation computer 104 registers the registered failure classification tree object 300 created up to step 707 in the failure cause determination rule DB 106.

（障害原因判定ルールの変化を検知する動作の詳細）
図１０に、障害原因判定ルール変化検知プログラム４１１を通じて実行される障害原因判定ルールの変化検知プロセスの概要を示す。 (Details of operation for detecting changes in failure cause determination rules)
FIG. 10 shows an overview of a change detection process for a failure cause determination rule executed through the failure cause determination rule change detection program 411.

（ステップ１０００）
このプログラムは、表示装置４３１に表示される障害原因判定ルール変化検知プログラム画面５００（図５）の最小イベントブロックサポート率入力部５０１にサポート率、最小有効属性数入力部５０２に有効属性数、時間窓幅入力部５０３に時間が入力された後、開始ボタン５０５に対するクリック操作が検出されることで開始される。ここでのクリック操作は、障害原因判定ルール変化検出コンピュータ１０７を構成する入力装置４３０に対するユーザ操作を通じて入力される。障害原因判定ルール変化検知プログラム４１１は、障害原因判定ルール変化検知コンピュータ１０７で実行される。 (Step 1000)
This program includes a support rate in the minimum event block support rate input unit 501 of the failure cause determination rule change detection program screen 500 (FIG. 5) displayed on the display device 431, a number of valid attributes in the minimum valid attribute number input unit 502, and time. After the time is input to the window width input unit 503, the operation is started when a click operation on the start button 505 is detected. The click operation here is input through a user operation on the input device 430 constituting the failure cause determination rule change detection computer 107. The failure cause determination rule change detection program 411 is executed by the failure cause determination rule change detection computer 107.

（ステップ１００１）
障害原因判定ルール変化検知プログラム４１１は、該当する操作入力を検出すると、最小イベントブロックサポート率入力部５０１、最小有効属性数入力部５０２、時間窓幅入力部５０３に入力された数値を読み取り、ＲＡＭ４１０の作業領域４１２に格納する。更に、障害原因判定ルール変化検知プログラム４１１は、通信装置４３２を介して障害原因判定ルールＤＢ１０６から登録障害分類木オブジェクト３００を取得し、一時的にハードディスク駆動装置４２０のデータ格納部４２２に保存した後、ＲＡＭ４１０の作業領域４１２に格納する。 (Step 1001)
When the failure cause determination rule change detection program 411 detects a corresponding operation input, the failure cause determination rule change detection program 411 reads numerical values input to the minimum event block support rate input unit 501, the minimum valid attribute number input unit 502, and the time window width input unit 503, and the RAM 410 Are stored in the work area 412. Further, the failure cause determination rule change detection program 411 acquires the registered failure classification tree object 300 from the failure cause determination rule DB 106 via the communication device 432 and temporarily stores it in the data storage unit 422 of the hard disk drive 420. And stored in the work area 412 of the RAM 410.

（ステップ１００２）
障害原因判定ルール変化検知プログラム４１１は、一時障害分類木オブジェクト集合１１００（図１１）を作成する。一時障害分類木オブジェクト集合１１００は、障害解析に使用中の障害原因判定ルールの変更を検出するため、一時的に作成されるオブジェクトの集合である。 (Step 1002)
The failure cause determination rule change detection program 411 creates a temporary failure classification tree object set 1100 (FIG. 11). The temporary failure classification tree object set 1100 is a set of objects temporarily created in order to detect a change in the failure cause determination rule being used for failure analysis.

障害原因判定ルール変化検知プログラム４１１は、ステップ１００１で取得した登録障害分類木オブジェクト３００を一時障害分類木オブジェクト１１１０（図１１）に変換し、一時障害分類木オブジェクト集合１１００に格納する。 The failure cause determination rule change detection program 411 converts the registered failure classification tree object 300 acquired in step 1001 into a temporary failure classification tree object 1110 (FIG. 11) and stores it in the temporary failure classification tree object set 1100.

登録障害分類木オブジェクト３００から一時障害分類木オブジェクト１１１０への変換は、登録障害分類木オブジェクト３００を構成する障害分類木３１０、障害原因判定ルールテーブル３２０及び障害ノードテーブル３３０のそれぞれを、一時障害分類木オブジェクト１１１０の障害分類木１１２０、障害原因判定ルールテーブル１１４０、障害ノードテーブル１１５０に代入する処理と、一時障害分類木オブジェクト１１１０の重み１１３０に値「１」を設定する処理とによって実現する。 The conversion from the registered failure classification tree object 300 to the temporary failure classification tree object 1110 is performed by converting each of the failure classification tree 310, the failure cause determination rule table 320, and the failure node table 330 constituting the registered failure classification tree object 300 into a temporary failure classification. This is realized by a process of substituting the fault classification tree 1120, fault cause determination rule table 1140, and fault node table 1150 of the tree object 1110, and a process of setting a value “1” in the weight 1130 of the temporary fault classification tree object 1110.

（ステップ１００３）
障害原因判定ルール変化検知プログラム４１１は、監視サーバ１０２からイベントを受信すると、イベントブロックを作成してその特徴を抽出する。イベントブロックの作成処理及び特徴抽出処理の内容は、図７に示すステップ７０２及びステップ７０３の内容と同様である。なお、監視サーバ１０２からのイベントは、通信装置４３２を通じて取得される。 (Step 1003)
When the failure cause determination rule change detection program 411 receives an event from the monitoring server 102, it creates an event block and extracts its characteristics. The contents of the event block creation process and the feature extraction process are the same as the contents of step 702 and step 703 shown in FIG. Note that the event from the monitoring server 102 is acquired through the communication device 432.

（ステップ１００４）
障害原因判定ルール変化検知プログラム４１１は、ステップ１００３で抽出されたイベントブロックの特徴を訓練データとして、一時障害分類木オブジェクト集合１１００内の全ての一時障害分類木オブジェクト１１１０を更新する。以下、一時障害分類木オブジェクト１１１０の更新方法を詳細に説明する。 (Step 1004)
The failure cause determination rule change detection program 411 updates all temporary failure classification tree objects 1110 in the temporary failure classification tree object set 1100 using the characteristics of the event block extracted in step 1003 as training data. Hereinafter, a method for updating the temporary failure classification tree object 1110 will be described in detail.

（障害分類木、障害原因判定ルールテーブル、障害ノードテーブルの更新）
最初に、障害原因判定ルール変化検知プログラム４１１は、図５の障害原因判定ルール変化検知プログラム画面５００の時間窓幅入力部５０３に入力された時間窓幅を取得する。次に、当該プログラムは、一時障害原因オブジェクト１１１０の障害ノードテーブル１１５０から現在の日時を起点として時間窓幅内に含まれるイベントを取得する。取得したイベントから図７のステップ７０２からステップ７０７の手順に従い、障害分類木１１２０、障害原因判定ルールテーブル１１４０、障害ノードテーブル１１５０を作成する。このとき、図５の障害原因判定ルール変化検知プログラム画面５００で入力された最小イベントブロックサポート率、最小有効属性数を考慮する。 (Update of fault classification tree, fault cause determination rule table, fault node table)
First, the failure cause determination rule change detection program 411 acquires the time window width input to the time window width input unit 503 of the failure cause determination rule change detection program screen 500 of FIG. Next, the program acquires events included in the time window width from the failure node table 1150 of the temporary failure cause object 1110 starting from the current date and time. A failure classification tree 1120, a failure cause determination rule table 1140, and a failure node table 1150 are created from the acquired events in accordance with the procedure from step 702 to step 707 in FIG. At this time, the minimum event block support rate and the minimum number of valid attributes input on the failure cause determination rule change detection program screen 500 of FIG. 5 are considered.

（重みの更新）
次に、障害原因判定ルール変化検知プログラム４１１は、次の２つの指標を利用して、一時障害分類木オブジェクト選択用の重み１１３０を計算する。このとき、事前に与えられた閾値以下の重みをもつ一時障害分類木オブジェクト１１１０は削除する。 (Weight update)
Next, the failure cause determination rule change detection program 411 calculates a temporary failure classification tree object selection weight 1130 using the following two indices. At this time, the temporary failure classification tree object 1110 having a weight less than or equal to a threshold given in advance is deleted.

（１）カテゴリーユーティリティ
概念クラスタリングＣＯＢＷＥＢは、非特許文献１の１４７頁の「式３−３」で定義されたカテゴリーユーティリティ（Category Utility）が最大になるように、分類木である概念木を作成する。障害原因判定ルール変化検知プログラム４１１は、一時障害分類木オブジェクト１１１０の障害分類木１１２０に対して当該数値計算を適用し、算出された値を重みとする。 (1) Category Utility Concept Clustering COBWEB creates a concept tree that is a classification tree so that the category utility defined in “Equation 3-3” on page 147 of Non-Patent Document 1 is maximized. . The failure cause determination rule change detection program 411 applies the numerical calculation to the failure classification tree 1120 of the temporary failure classification tree object 1110 and sets the calculated value as a weight.

（２）障害ノードへの適合度
ただし、カテゴリーユーティリティは一時障害分類木オブジェクト１１１０における障害分類木１１２０の全体を評価する指数であり、ステップ１００３で作成した新しい変化を伴っている可能性のある最新のイベントブロックの影響が反映され難い。このため、カテゴリーユニットだけの指標では変化の検知が遅れてしまう。 (2) Degree of conformity to fault node However, the category utility is an index for evaluating the entire fault classification tree 1120 in the temporary fault classification tree object 1110, and may be the latest that may be accompanied by the new change created in step 1003. The effect of the event block is difficult to be reflected. For this reason, detection of changes is delayed with an index of only category units.

そこで、最新のイベントブロックと、当該イベントブロックが分類された障害ノードとの適合度を計算する。最新のイベントブロックにおいて、ｉ番目の属性のうち最も頻出する属性値をa1、次に頻出する属性値をa2とすると、適合度は（１）式によって計算することができる。 Therefore, the degree of conformity between the latest event block and the failure node into which the event block is classified is calculated. In the latest event block, if the attribute value that appears most frequently among the i-th attributes is a1, and the attribute value that appears next is a2, the fitness can be calculated by equation (1).

ここで、Nev は、障害ノード内に分類されるイベントブロック数の総数である。Nattはイベントの属性数である。本明細書の例の場合、図２に示すように（種類、ソース、イベント番号、ユーザ、コンピュータ）の５個である。また、P_a ^(i,j)(a₁,a₂)は、障害ノードに分類されるイベントブロック内において、イベントのｊ番目の属性のうち最も頻出する属性値がa₁で、次に頻出する属性値がa₂であるイベントブロックが出現する確率を意味している。 Here, Nev is the total number of event blocks classified in the failed node. Natt is the number of event attributes. In the case of the example of this specification, as shown in FIG. 2, there are five types (type, source, event number, user, computer). _{^{Also, P a (i, j)}} (a 1, a 2) , in the event the block that fall into disorder node, the attribute value of the most frequent among the j-th attribute of the event is at a _1, then frequent This means the probability that an event block whose attribute value is a ₂ will appear.

例えば、障害ノード内に１０個のイベントブロックがあり、属性「種類」に関して、イベントブロック内で最も頻出する属性値と次に頻出する属性値のペアが（「エラー」,「致命的」）が５個、（「エラー」,「警戒」）が３個、（「致命的」,「警戒」）が２個の場合、P_a ^(i,j)(a₁,a₂)は、

は上記のよう与えられる。 For example, there are 10 event blocks in the failure node, and for the attribute “type”, the attribute value pair that appears most frequently in the event block and the attribute value pair that appears next (“error”, “fatal”) If there are 5 (“error”, “warning”) and 2 (“fatal”, “warning”), then P _a ^{(i, j)} (a ₁ , a ₂ ) is

Is given above.

（ステップ１００５）
障害原因判定ルール変化検知プログラム４１１は、一時障害分類木オブジェクト集合１１００から最も重い重み１１３０を持つ一時障害分類木オブジェクト１１１０を選択し、それと登録障害分類木オブジェクト３００と比較する。オブジェクト同士が一致する場合、障害原因判定ルール変化検知プログラム４１１は何もせず、ステップ１００３に戻る。一方、オブジェクト同士が一致しない場合、障害原因判定ルール変化検知プログラム４１１は、ステップ１００６を実行する。 (Step 1005)
The failure cause determination rule change detection program 411 selects the temporary failure classification tree object 1110 having the heaviest weight 1130 from the temporary failure classification tree object set 1100 and compares it with the registered failure classification tree object 300. If the objects match, the failure cause determination rule change detection program 411 does nothing and returns to step 1003. On the other hand, if the objects do not match, the failure cause determination rule change detection program 411 executes Step 1006.

なお、障害原因判定ルール変化検知プログラム４１１は、障害分類木オブジェクト同士を比較する際、まず最初に障害分類木同士を比較し、障害ノード間の対応関係の有無を判断する。次に、障害原因判定ルール変化検知プログラム４１１は、対応する障害ノードに設定された障害原因判定ルール同士を比較する。 The failure cause determination rule change detection program 411 first compares the failure classification trees with each other and determines whether or not there is a correspondence between the failure nodes. Next, the failure cause determination rule change detection program 411 compares the failure cause determination rules set in the corresponding failure nodes.

この際、障害原因判定ルール変化検知プログラム４１１は、比較処理に先立って、登録障害分類木オブジェクト３００の障害ノードテーブル３３０から各障害ノードに対するイベントブロックを取得し、これらを一時障害分類木オブジェクト１１１０の障害分類木１１２０に従って分類する。一時障害分類木オブジェクトは、最新のイベントブロックに基づいて作成されているが、分類対象のイベントブロックを登録障害分類木オブジェクト３００と共通化することで障害ノード間の対応関係の変化の有無の比較を可能とする。 At this time, prior to the comparison process, the failure cause determination rule change detection program 411 acquires event blocks for each failure node from the failure node table 330 of the registered failure classification tree object 300 and stores them in the temporary failure classification tree object 1110. Classification is performed according to the fault classification tree 1120. The temporary failure classification tree object is created based on the latest event block, but by comparing the event block to be classified with the registered failure classification tree object 300, whether or not the correspondence between the failure nodes has changed is compared. Is possible.

（ステップ１００６）
このステップは、ステップ１００５で、登録障害分類木オブジェクト３００と一時障害分類木オブジェクト１１１０が一致しない場合に実行される。 (Step 1006)
This step is executed when the registered failure classification tree object 300 and the temporary failure classification tree object 1110 do not match in step 1005.

障害原因判定ルール変化検知プログラム４１１は、ステップ１００５で選択された一時障害分類木オブジェクト１１１０と登録分類木オブジェクト３００の間で生じた変化の原因の違いを検知する。変化の原因が障害ノードの消失と予測された場合、障害原因判定ルール変化検知プログラム４１１は、ステップ１００８を実行する。一方、変化の原因が障害原因判定ルールの変化と予測された場合、障害原因判定ルール変化検知プログラム４１１は、ステップ１００９を実行する。変化の原因がこれらのいずれでもない場合、障害原因判定ルール変化検知プログラム４１１は何もせず、ステップ１００９に進む。 The failure cause determination rule change detection program 411 detects a difference in cause of a change that occurs between the temporary failure classification tree object 1110 selected in step 1005 and the registered classification tree object 300. When the cause of the change is predicted to be the disappearance of the failure node, the failure cause determination rule change detection program 411 executes Step 1008. On the other hand, when the cause of the change is predicted to be a change in the failure cause determination rule, the failure cause determination rule change detection program 411 executes Step 1009. If the cause of the change is none of these, the failure cause determination rule change detection program 411 does nothing and proceeds to step 1009.

ここで、障害原因判定ルール変化検知プログラム４１１は、（１）障害ノードの消失及び（２）障害原因判定ルールの変化の検知を以下のように行う。 Here, the failure cause determination rule change detection program 411 detects (1) disappearance of the failure node and (2) change of the failure cause determination rule as follows.

（１）障害ノードの消失検知
図１２−１に、障害ノードの消失検知の具体例を説明する図を示す。
図１２−１の(1) は、２００９年１月〜３月のイベントから作成された登録障害分類木オブジェクト３００の障害分類木１２００である。「障害ノード２」１２０１には、図１２−２の(A-1) に示すように、障害原因判定ルール１２２０が設定されている。 (1) Failure Node Loss Detection FIG. 12-1 is a diagram illustrating a specific example of failure node loss detection.
(1) in FIG. 12A is a failure classification tree 1200 of the registered failure classification tree object 300 created from events from January to March 2009. As shown in (A-1) of FIG. 12-2, a failure cause determination rule 1220 is set in the “failure node 2” 1201.

図１２−１の（2）は、２００９年１０月〜１２月のイベントから作成された一時障害分類木オブジェクト１１１０の障害分類木１２１０である。この一時障害分類木オブジェクト１１１０は、ステップ１００５で、登録分類木オブジェクト３００との比較のために、一時障害分類木オブジェクト集合１１００の中から選択されたものである。 (2) in FIG. 12A is a failure classification tree 1210 of the temporary failure classification tree object 1110 created from events of October to December 2009. This temporary failure classification tree object 1110 is selected from the temporary failure classification tree object set 1100 in step 1005 for comparison with the registered classification tree object 300.

まず、障害原因判定ルール変化検知プログラム４１１は、障害分類木１２００の木構造と障害分類木１２１０の木構造を比較する。図１２−１の場合、障害分類木１２００には存在した「障害ノード２」１２０１が、障害分類木１２１０には存在していない。この場合、障害原因判定ルール変化検知プログラム４１１は、障害ノードが消失したと予測（判定）する。このことは、「障害ノード２」１２０１に属していたイベントオブジェクトが、一時障害分類木オブジェクト１１１０の障害分類木１２１０のいずれの障害ノードにも分類されないことを意味する。 First, the failure cause determination rule change detection program 411 compares the tree structure of the failure classification tree 1200 and the tree structure of the failure classification tree 1210. In the case of FIG. 12A, the “failure node 2” 1201 present in the failure classification tree 1200 does not exist in the failure classification tree 1210. In this case, the failure cause determination rule change detection program 411 predicts (determines) that the failure node has disappeared. This means that the event object belonging to “failure node 2” 1201 is not classified as any failure node in the failure classification tree 1210 of the temporary failure classification tree object 1110.

（２）障害原因判定ルールの変化検知
図１３−１に、障害原因判定ルールの変化検知の具体例を説明する図を示す。
図１３−１の（1）は、２００９年１月から３月のイベントから作成された登録障害分類木オブジェクト３００の障害分類木１３００である。「障害ノード１−１」１３０１には、図１３−２の(A-1) に示すように、障害原因判定ルール１３２０が設定されている。 (2) Change Detection of Failure Cause Determination Rule FIG. 13-1 is a diagram for explaining a specific example of change detection of the failure cause determination rule.
(1) in FIG. 13A is a failure classification tree 1300 of the registered failure classification tree object 300 created from events from January to March 2009. In the “failure node 1-1” 1301, a failure cause determination rule 1320 is set as shown in (A-1) of FIG.

図１３−１の（2）は、２００９年１０月から１２月のイベントから作成された一時障害分類木オブジェクト１１１０の障害分類木１３１０である。「障害ノード１−１’」１３１１には、図１３−２の（B-1）に示すように、障害原因判定ルール１３３０が設定されている。この一時障害分類木オブジェクト１１１０は、ステップ１００５で、登録分類木オブジェクト３００との比較のために、一時障害分類木オブジェクト集合１１００の中から選択されたものである。 (2) in FIG. 13A is a failure classification tree 1310 of the temporary failure classification tree object 1110 created from events from October to December 2009. As shown in (B-1) of FIG. 13-2, a failure cause determination rule 1330 is set in the “failure node 1-1 ′” 1311. This temporary failure classification tree object 1110 is selected from the temporary failure classification tree object set 1100 in step 1005 for comparison with the registered classification tree object 300.

まず、障害原因判定ルール変化検知プログラム４１１は、障害分類木１３００と障害分類木１３１０を比較する。この例の場合、障害分類木１３００の木構造と障害分類木１３１０の木構造は同じである。従って、障害原因判定ルール変化検知プログラム４１１は、障害ノード毎にそれぞれに属するイベントブロックを一時障害分類木オブジェクトの障害分類木１３１０の特徴に従って分類し、いずれの障害ノードに属するか判定する。ある障害ノードに属する全てのイベントブロックが割り当てられる障害ノードが、一時障害分類木オブジェクトの障害分類木１３１０のある障害ノードに分類された場合、障害原因判定ルール変化検知プログラム４１１は、分類元の障害ノードと分類先の障害ノードは対応関係にあると判定する。 First, the failure cause determination rule change detection program 411 compares the failure classification tree 1300 with the failure classification tree 1310. In this example, the tree structure of the failure classification tree 1300 and the tree structure of the failure classification tree 1310 are the same. Therefore, the failure cause determination rule change detection program 411 classifies event blocks belonging to each failure node according to the characteristics of the failure classification tree 1310 of the temporary failure classification tree object, and determines which failure node belongs. When a failure node to which all event blocks belonging to a certain failure node are assigned is classified as a failure node in the failure classification tree 1310 of the temporary failure classification tree object, the failure cause determination rule change detection program 411 detects the failure of the classification source. It is determined that the node and the failure node of the classification destination are in a correspondence relationship.

図１３−１の場合、登録障害分類木オブジェクト３００の障害ノードテーブル３３０に格納されている「障害ノード１−１」１３０１と、一時障害分類木オブジェクト１１０の「障害ノード１−１’」１３１１とが対応関係にあると判定されたものとする。 In the case of FIG. 13A, “failure node 1-1” 1301 stored in the failure node table 330 of the registered failure classification tree object 300, “failure node 1-1 ′” 1311 of the temporary failure classification tree object 110, and Is determined to be in a correspondence relationship.

次に、障害原因判定ルール変化検知プログラム４１１は、「障害ノード１−１」１３０１の障害原因判定ルール１３２０と、「障害ノード１−１’」１３１１の障害原因判定ルール１３３０とを比較する。この比較により違いが検出された場合、障害原因判定ルール変化検知プログラム４１１は、障害原因判定ルールが変化したと判定する。 Next, the failure cause determination rule change detection program 411 compares the failure cause determination rule 1320 of the “failure node 1-1” 1301 with the failure cause determination rule 1330 of the “failure node 1-1 ′” 1311. When a difference is detected by this comparison, the failure cause determination rule change detection program 411 determines that the failure cause determination rule has changed.

図１３−２の場合、障害原因判定ルール１３２０を構成する判定イベントの５番目のイベントの属性「コンピュータ」の属性値は「server8」１３２１である。一方、障害原因判定ルール１３３０を構成する判定イベントの５番目のイベントの属性「コンピュータ」の属性値は「server25」１３３１である。すなわち、属性値が変化している。この場合、障害原因判定ルール変化検知プログラム４１１は、障害原因判定ルールが変化したと判定する。 In the case of FIG. 13B, the attribute value of the attribute “computer” of the fifth event of the determination event constituting the failure cause determination rule 1320 is “server8” 1321. On the other hand, the attribute value of the attribute “computer” of the fifth event of the determination events constituting the failure cause determination rule 1330 is “server25” 1331. That is, the attribute value has changed. In this case, the failure cause determination rule change detection program 411 determines that the failure cause determination rule has changed.

（ステップ１００７）
このステップは、ステップ１００６で、「障害ノードの消失」と判定された場合に実行される。 (Step 1007)
This step is executed when it is determined in step 1006 that “the failure node has disappeared”.

障害原因判定ルール変化検知プログラム４１１は、「障害ノードの消失」の原因を予測し、その予測に基づいて一時障害分類木オブジェクト１１１０を生成する。 The failure cause determination rule change detection program 411 predicts the cause of “disappearance of the failure node”, and generates a temporary failure classification tree object 1110 based on the prediction.

予測に基づく一時障害分類木オブジェクト１１１０の生成方法を図１２−１及び図１２−２の具体例に基づいて説明する。図１２−１の例では、登録障害分類木オブジェクト３００の障害分類木１２００に属する「障害ノード２」１２０１が一時障害分類木オブジェクト１１０１の障害分類木１２１０では消失している。これは、「障害ノード２」１２０１に関わるＩＴサービスが削除されたためと予測できる。すなわち、「障害ノード２」１２０１に設定された障害原因判定ルール１２２０の属性「ソース」の属性値「process71」と「process39」による障害は、今後発生しないと考えられる。 A method of generating the temporary fault classification tree object 1110 based on the prediction will be described based on the specific examples of FIGS. 12-1 and 12-2. In the example of FIG. 12A, “failure node 2” 1201 belonging to the failure classification tree 1200 of the registered failure classification tree object 300 has disappeared in the failure classification tree 1210 of the temporary failure classification tree object 1101. This can be predicted because the IT service related to “failure node 2” 1201 has been deleted. That is, it is considered that a failure due to the attribute values “process71” and “process39” of the attribute “source” of the failure cause determination rule 1220 set in the “failure node 2” 1201 does not occur in the future.

この場合、障害原因判定ルール変化検知プログラム４１１は、ステップ１００５で選択した一時障害分類木オブジェクト１１１０の障害ノードテーブル１１５０から、属性「ソース」の属性値が「process71」と「process39」以外のイベントを取得する。次に、これら取得されたイベントを対象とし、図７のステップ７０２からステップ７０７の手順に従って、障害分類木、障害原因判定ルール及び障害ノードテーブルを作成する。このとき、図５の障害原因判定ルール変化検知プログラム画面５００で入力された最小イベントブロックサポート率と最小有効属性数を考慮する。 In this case, the failure cause determination rule change detection program 411 generates an event whose attribute value of the attribute “source” is other than “process71” and “process39” from the failure node table 1150 of the temporary failure classification tree object 1110 selected in step 1005. get. Next, a failure classification tree, a failure cause determination rule, and a failure node table are created for these acquired events in accordance with the procedure from step 702 to step 707 in FIG. At this time, the minimum event block support rate and the minimum number of valid attributes input on the failure cause determination rule change detection program screen 500 of FIG. 5 are considered.

最後に、新規の一時障害分類木オブジェクト１１１０を作成し、障害分類木１１２０、障害原因判定ルールテーブル１１４０及び障害ノードテーブル１１５０のそれぞれに、作成した障害分類木、障害原因判定ルール、障害ノードテーブルを格納する。このとき、重み１１３０には「１」を設定する。作成した一時障害分類木オブジェクト１１１０は、一時障害分類木オブジェクト集合１１００に追加する。 Finally, a new temporary failure classification tree object 1110 is created, and the created failure classification tree, failure cause determination rule, and failure node table are stored in the failure classification tree 1120, the failure cause determination rule table 1140, and the failure node table 1150, respectively. Store. At this time, “1” is set to the weight 1130. The created temporary failure classification tree object 1110 is added to the temporary failure classification tree object set 1100.

（ステップ１００８）
このステップは、ステップ１００６で、「障害原因判定ルールの変化」と判定された場合に実行される。 (Step 1008)
This step is executed when it is determined in step 1006 that “change in failure cause determination rule”.

障害原因判定ルール変化検知プログラム４１１は、障害原因判定ルールの変化の原因を予測し、その予測に基づく一時障害分類木オブジェクト１１１０を生成する。 The failure cause determination rule change detection program 411 predicts the cause of the change in the failure cause determination rule, and generates a temporary failure classification tree object 1110 based on the prediction.

予測に基づく一時障害分類木オブジェクト１１１０の生成方法を図１３−１及び図１３−２の具体例に基づいて説明する。図１３−１の例では、登録障害分類木オブジェクト３００の障害分類木１３００を構成する「障害ノード１−１」１３０１と、一時障害分類木オブジェクト１１００の障害分類木１３１０を構成する「障害ノード１−１’」１３１１とが対応している。また、それぞれの障害ノードに対する障害原因判定ルール１３２０と１３３０とを比較すると、５番目の判定イベントの属性「コンピュータ」の属性値が「server8」１３２１から「server25」１３３１に変化している。これは、「server8」のコンピュータがハードウェア故障などの理由により、「server25」のコンピュータに置き換えられたためと予測できる。 A method of generating the temporary fault classification tree object 1110 based on the prediction will be described based on the specific examples of FIGS. 13-1 and 13-2. In the example of FIG. 13A, “failure node 1-1” 1301 constituting the failure classification tree 1300 of the registered failure classification tree object 300 and “failure node 1” constituting the failure classification tree 1310 of the temporary failure classification tree object 1100. -1 ′ ”1311 corresponds. Further, comparing the failure cause determination rules 1320 and 1330 for the respective failure nodes, the attribute value of the attribute “computer” of the fifth determination event is changed from “server8” 1321 to “server25” 1331. This can be predicted because the computer of “server8” was replaced with the computer of “server25” due to a hardware failure or the like.

この場合、障害原因判定ルール変化検知プログラム４１１は、ステップ１００５で選択した一時障害分類木オブジェクト１１１０の障害ノードテーブル１１５０から全てのイベントを取得し、属性「コンピュータ」の属性値が「server8」である全てのイベントの属性「コンピュータ」に属性値「server25」を設定する。 In this case, the failure cause determination rule change detection program 411 acquires all events from the failure node table 1150 of the temporary failure classification tree object 1110 selected in step 1005, and the attribute value of the attribute “computer” is “server8”. Set the attribute value “server25” in the attribute “computer” of all events.

次に、これら修正されたイベントを対象とし、図７のステップ７０２からステップ７０７の手順に従って、障害分類木、障害原因判定ルール及び障害ノードテーブルを作成する。このとき、図５の障害原因判定ルール変化検知プログラム画面５００で入力された最小イベントブロックサポート率と最小有効属性数を考慮する。 Next, for these corrected events, a failure classification tree, a failure cause determination rule, and a failure node table are created according to the procedure from step 702 to step 707 in FIG. At this time, the minimum event block support rate and the minimum number of valid attributes input on the failure cause determination rule change detection program screen 500 of FIG. 5 are considered.

最後に、新規の一時障害分類木オブジェクト１１１０を作成し、障害分類木１１２０、障害原因判定ルール１１４０及び障害ノードテーブル１１５０のそれぞれに、作成した障害分類木、障害原因判定ルール、障害ノードテーブルを格納する。このとき、重み１１３０には「１」を設定する。作成した一時障害分類木オブジェクト１１１０は、一時障害分類木オブジェクト集合１１００に追加する。 Finally, a new temporary failure classification tree object 1110 is created, and the created failure classification tree, failure cause determination rule, and failure node table are stored in the failure classification tree 1120, failure cause determination rule 1140, and failure node table 1150, respectively. To do. At this time, “1” is set to the weight 1130. The created temporary failure classification tree object 1110 is added to the temporary failure classification tree object set 1100.

（ステップ１００９）
障害原因判定ルール変化検知プログラム４１１は、通信装置４３２を介して障害原因判定ルールＤＢ１０６の登録障害分類木オブジェクト３００を、ステップ１００５で選択した一時障害分類木オブジェクト１１１０で置換する。置換方法は、一時障害分類木オブジェクト１１１０の障害分類木１１２０、障害原因判定ルールテーブル１１４０、障害ノートテーブル１１５０のそれぞれを、登録障害分類木オブジェクト３００の障害分類木３１０、障害原因判定ルール３２０、障害ノードテーブル３３０に代入することで行う。 (Step 1009)
The failure cause determination rule change detection program 411 replaces the registered failure classification tree object 300 in the failure cause determination rule DB 106 with the temporary failure classification tree object 1110 selected in step 1005 via the communication device 432. For the replacement method, the failure classification tree 1120 of the temporary failure classification tree object 1110, the failure cause determination rule table 1140, and the failure note table 1150 are replaced with the failure classification tree 310 of the registered failure classification tree object 300, the failure cause determination rule 320, the failure, respectively. This is done by substituting into the node table 330.

なお、ステップ１００５で選択した一時障害分類木オブジェクト１１１０には、ステップ１００７で作成された一時障害分類木オブジェクト、ステップ１００８で作成された一時障害分類木オブジェクトも含まれる。 The temporary failure classification tree object 1110 selected in step 1005 includes the temporary failure classification tree object created in step 1007 and the temporary failure classification tree object created in step 1008.

（まとめ）
本実施形態の動作は以下の順番に進行する。
（１）システム障害を監視対象とする監視サーバが生成したイベントを逐次取得し、イベントを障害毎にまとめたイベントブロックを作成する（ステップ１００３）。
（２）現処理時点から所定の時間窓幅内に取得したイベントブロックを訓練データとして取得し、当該イベントブロックに基づいて一時障害分類木オブジェクトの集合を更新する（ステップ１００４）。
（３）更新された一時障害分類木オブジェクトの集合の中から選択用の重みが最も重い一時障害分類木オブジェクトを選択し、当該選択された一時障害分類木オブジェクトと現行の障害原因判定ルールに関連する登録障害分類木オブジェクトとの比較によりオブジェクト間の変化を検知する（ステップ１００５）。
（４）ステップ１００５で両オブジェクトが一致しないと検知された場合、両オブジェクト間の違いから変化の原因を予測し、予測に基づいた一時障害分類木オブジェクトを作成して一時障害分類木オブジェクトの集合に追加する（ステップ１００６〜１００８）。
（５）ステップ１００５で両オブジェクトが一致しないと検知された場合、当該ステップ１００５で選択された一時障害分類木オブジェクト又はステップ１００７又は１００８で作成された一時障害分類木オブジェクトによって登録障害分類木オブジェクトを置き換える（ステップ１００９）。 (Summary)
The operation of this embodiment proceeds in the following order.
(1) The event generated by the monitoring server that monitors the system failure is sequentially acquired, and an event block in which the event is summarized for each failure is created (step 1003).
(2) An event block acquired within a predetermined time window width from the current processing time point is acquired as training data, and a set of temporary failure classification tree objects is updated based on the event block (step 1004).
(3) Select a temporary failure classification tree object having the heaviest selection weight from the set of updated temporary failure classification tree objects, and relate to the selected temporary failure classification tree object and the current failure cause determination rule. A change between the objects is detected by comparison with the registered failure classification tree object to be performed (step 1005).
(4) If it is detected in step 1005 that the two objects do not match, the cause of the change is predicted from the difference between the two objects, a temporary failure classification tree object based on the prediction is created, and a set of temporary failure classification tree objects (Steps 1006 to 1008).
(5) If it is detected in step 1005 that the two objects do not match, the registered fault classification tree object is determined by the temporary fault classification tree object selected in step 1005 or the temporary fault classification tree object created in step 1007 or 1008. Replace (step 1009).

このように、本実施形態の場合には、使用中の障害分類木と一時的に作成された障害分類木同士の比較及び使用中の障害原因判定ルールと一時的に作成された障害原因判定ルール同士の比較に基づいて変化の原因を予測し、原因が予測できた場合には当該原因の内容に従って判定イベントの特定の属性値を一斉に書き換える。なお、原因が予測できない場合には、現時点の一時障害分類木オブジェクトの内容に、登録分類木オブジェクトの内容を書き換える。 Thus, in the case of the present embodiment, the failure classification tree in use and the temporarily created failure classification tree are compared with each other, and the failure cause determination rule in use and the failure cause determination rule temporarily created Based on the comparison between them, the cause of the change is predicted, and when the cause can be predicted, specific attribute values of the determination event are rewritten all at once according to the content of the cause. If the cause cannot be predicted, the contents of the registered classification tree object are rewritten to the contents of the current temporary failure classification tree object.

結果的に、少なくとも障害分類木オブジェクトの変化が検知された時点において、登録分類木オブジェクトの内容を変更できる。しかも、障害分類木オブジェクトの変化の原因を予測できる場合には、予測された原因に応じて関連する全ての判定イベントの属性値を一斉に書き換えることができる。このことは、出現頻度の高い障害が出現頻度の低い障害と関連がある場合に、出現頻度の高い発生周期で出現頻度の低い障害に対応する障害原因判定ルールを事前に変更できることを意味する。 As a result, the content of the registered classification tree object can be changed at least when a change in the failure classification tree object is detected. In addition, when the cause of the change of the failure classification tree object can be predicted, the attribute values of all the determination events related to the predicted cause can be rewritten simultaneously. This means that when a failure with a high appearance frequency is related to a failure with a low appearance frequency, the failure cause determination rule corresponding to the failure with a low appearance frequency can be changed in advance in the occurrence cycle with a high appearance frequency.

１０１…監視対象サーバ群
１０２…監視サーバ
１０３…ログデータベース（ＤＢ）
１０４…障害原因判定ルール生成コンピュータ
１０５…障害原因解析コンピュータ
１０６…障害原因判定ルールＤＢ
１０７…障害原因判定ルール変化検知コンピュータ
１０８…復旧手順書データベース（ＤＢ）
１０９…復旧手順書閲覧コンピュータ 101 ... Monitoring target server group 102 ... Monitoring server 103 ... Log database (DB)
104 ... Failure cause determination rule generation computer 105 ... Failure cause analysis computer 106 ... Failure cause determination rule DB
107 ... Failure cause determination rule change detection computer 108 ... Recovery procedure database (DB)
109 ... Recovery procedure manual browsing computer

Claims

A first processing unit that sequentially acquires events generated by a monitoring server that monitors system failures and creates an event block that summarizes events for each failure;
Get event blocks acquired from the current processing time within a predetermined time window width as the training data, in the second processing unit for updating the set of the temporary fault classification tree objects that are temporarily generated based on the event block The temporary fault classification tree object includes (1) a fault classification tree in which faults are classified by characteristics, (2) a fault cause determination rule table storing fault cause determination rules, and (3) construction of the fault classification tree A second node that is configured with a failure node table that stores event blocks used as training data in (4) and an object selection weight ;
A temporary failure classification tree object having the highest selection weight is selected from the set of updated temporary failure classification tree objects, and the registered failure related to the selected temporary failure classification tree object and the current failure cause determination rule. A third processing unit that detects a change between objects by comparison with a classification tree object, and the registered failure classification tree object includes (1) a failure classification tree in which failures are classified according to characteristics; and (2) failure cause determination. A third processing unit composed of a failure cause determination rule table storing rules, and (3) a failure node table storing event blocks used as training data in the construction of the failure classification tree ;
If both objects do not match in the third processing unit, the cause of the difference between the two objects is predicted, a new temporary failure classification tree object is created based on the prediction, and the temporary failure classification tree object and a fourth processing unit to be added to the collection of,
If both objects do not match in the third processing unit, the registration is performed by the temporary fault classification tree object selected by the third processing unit or the temporary fault classification tree object created by the fourth processing unit. A fifth processing unit for replacing the fault classification tree object;
A failure cause determination rule change detection device having

The second processing unit is configured to temporarily evaluate the classification tree based on the evaluation of the entire classification tree based on the category utility and the matching degree between the newly acquired event block and the failure node into which the event block is classified. The failure cause determination rule change detection device according to claim 1, wherein a weight for selecting an object is determined.

When the failure node that existed in the failure classification tree that constitutes the registered failure classification tree object does not exist at the corresponding position of the failure classification tree that constitutes the temporary failure classification tree object, the fourth processing unit Is created, and a new temporary failure classification tree object is created based on an event set in which events related to the service are deleted from events that are training data of the temporary failure classification tree object. The failure cause determination rule change detection device according to claim 1.

In the fourth processing unit, the content of the failure cause determination rule set for the failure node of the failure classification tree constituting the registered failure classification tree object is the content of the failure cause determination rule of the corresponding temporary failure classification tree object. If it is different from that, change the attribute of the event of the training data to the content after the change by predicting that the configuration of the system infrastructure has changed, and a new temporary failure based on the event block including the event after the change of the attribute The failure cause determination rule change detection device according to claim 1, wherein a classification tree object is created.

On the computer,
A first process for sequentially acquiring events generated by a monitoring server that monitors system failures and creating an event block that summarizes events for each failure;
This is a second process for acquiring an event block acquired within a predetermined time window width from the current processing time point as training data and updating a set of temporary failure classification tree objects that are temporarily generated based on the event block . The temporary fault classification tree object includes: (1) a fault classification tree that classifies faults according to features; (2) a fault cause determination rule table that stores fault cause determination rules; and (3) construction of the fault classification tree. A second process consisting of a failure node table storing event blocks used as training data, and (4) weights for object selection ;
A temporary failure classification tree object having the highest selection weight is selected from the set of updated temporary failure classification tree objects, and the registered failure related to the selected temporary failure classification tree object and the current failure cause determination rule. A third process of detecting a change between objects by comparison with a classification tree object , wherein the registered fault classification tree object includes (1) a fault classification tree in which faults are classified by characteristics, and (2) fault cause determination rules. A third process consisting of a failure cause determination rule table storing (3) and a failure node table storing event blocks used as training data in the construction of the failure classification tree ;
If the third Oite in processing both objects do not match, to predict the cause of the differences between the two objects is generated, the one o'clock fault classification to create a new temporary fault classification tree object based on a prediction A fourth process to add to the set of tree objects ;
If the third Oite in processing both objects do not match, by the third processing with the selected one o'clock fault classification tree object or the fourth temporary fault classification tree objects created by processing A fifth process for replacing the registered failure classification tree object ;
A program that executes