JP5836316B2

JP5836316B2 - Fault monitoring system, fault monitoring method, and fault monitoring program

Info

Publication number: JP5836316B2
Application number: JP2013103976A
Authority: JP
Inventors: 武人小澤; 田中　章仁; 章仁田中
Original assignee: SoftBank Corp
Current assignee: SoftBank Corp
Priority date: 2013-05-16
Filing date: 2013-05-16
Publication date: 2015-12-24
Anticipated expiration: 2033-05-16
Also published as: JP2014225124A

Description

本発明はネットワークノードに生じた障害を監視するためのシステム、方法、及びプログラムに関する。 The present invention relates to a system, a method, and a program for monitoring a failure occurring in a network node.

計算機ネットワークに接続する何れかのネットワークノードに障害が生じると、一つの障害事象に対して多くのネットワークノードが相互に影響し合う結果、複数の副次的な障害イベントが発生することがある。障害の箇所と原因を特定する障害解析技術の一つとして、例えば、特開０９−０６４９７１号公報にて詳述されているように、障害イベント相互間のコリレーション（関連性）を利用するイベント・コリレーション技術が知られている。 When a failure occurs in any one of the network nodes connected to the computer network, a plurality of secondary failure events may occur as a result of a number of network nodes affecting one failure event. As one of failure analysis techniques for identifying the location and cause of a failure, for example, as described in detail in Japanese Patent Application Laid-Open No. 09-064971, an event that uses correlation (relationship) between failure events -Correlation technology is known.

特開０９−０６４９７１号公報JP 09-064971 A

しかし、計算機ネットワーク構成の変化の影響を受けて複数の障害イベント相互間のコリレーションが変動するにも関わらず、従来のイベント・コリレーション技術では、複数の障害イベント相互間のコリレーションが固定されていたため、計算機ネットワーク構成を忠実に反映した障害解析がなされていなかった。 However, despite the fact that the correlation between multiple failure events fluctuates due to changes in the computer network configuration, the conventional event correlation technology fixes the correlation between multiple failure events. Therefore, failure analysis that accurately reflects the computer network configuration has not been performed.

そこで、本発明は、計算機ネットワーク構成を忠実に反映した障害解析を行うことを課題とする。 Therefore, an object of the present invention is to perform failure analysis that faithfully reflects the computer network configuration.

上記の課題を解決するため、本発明に係る障害監視システムは、ネットワークノードに生じた複数の障害イベント間の動的コリレーションを障害イベントの回復履歴に基づいて計算し直す計算手段と、複数の障害イベント間の動的コリレーションを計算の結果に基づいて更新する更新手段と、複数の障害イベント間に初期設定された静的コリレーションを一定周期間隔で減算する減算手段と、複数の障害イベントのうち所定の障害イベントに対する動的コリレーション又は静的コリレーションのうち高い方が閾値以上である何れかの障害イベントが存在することを条件として、所定の障害イベントの発生の通知を制限する制限手段とを備える。所定の障害イベント（例えば、障害イベントＢ）に対する動的コリレーション又は静的コリレーションのうち高い方が閾値以上である何れかの障害イベント（例えば、障害イベントＡ）が存在する場合は、当該何れかの障害イベント（例えば、障害イベントＡ）の解消に起因して所定の障害イベント（例えば、障害イベントＢ）が解消する確率が統計的に高いため、そのような障害イベント（例えば、障害イベントＢ）の発生の通知を制限することでオペレータの監視負担を軽減できる。 In order to solve the above-described problem, a failure monitoring system according to the present invention includes a calculation unit that recalculates dynamic correlation between a plurality of failure events occurring in a network node based on a recovery history of the failure event, Update means for updating dynamic correlation between failure events based on the calculation result, subtraction means for subtracting static correlation that is initially set between multiple failure events at regular intervals, and multiple failure events A restriction that restricts notification of occurrence of a predetermined failure event on condition that there is a failure event whose higher one of the dynamic correlation or static correlation for the predetermined failure event is greater than or equal to a threshold value Means . If there is any failure event (for example, failure event A) in which the higher one of the dynamic correlation or static correlation for a predetermined failure event (for example, failure event B) is greater than or equal to the threshold, Since there is a statistically high probability that a predetermined failure event (for example, failure event B) is resolved due to the cancellation of the failure event (for example, failure event A), such a failure event (for example, failure event B) ) Can be reduced to reduce the monitoring burden on the operator.

本発明によれば、計算機ネットワーク構成を忠実に反映した障害解析を行うことができる。 According to the present invention, failure analysis that faithfully reflects the computer network configuration can be performed.

本実施形態に関わるイベント・コリレーションの一例を示す説明図である。It is explanatory drawing which shows an example of the event correlation in connection with this embodiment. 本実施形態に関わる障害監視システムによるアラーム通知の概要を示す説明図である。It is explanatory drawing which shows the outline | summary of the alarm notification by the failure monitoring system concerning this embodiment. 本実施形態に関わる障害監視システムの構成を示すブロック図である。It is a block diagram which shows the structure of the failure monitoring system concerning this embodiment. 本実施形態に関わる障害発生時のアラーム表示の流れを示すフローチャートである。It is a flowchart which shows the flow of the alarm display at the time of the failure generation concerning this embodiment. 本実施形態に関わるログファイルの管理の流れを示すフローチャートである。It is a flowchart which shows the flow of management of the log file concerning this embodiment. 本実施形態に関わる動的コリレーションの管理の流れを示すフローチャートである。It is a flowchart which shows the flow of management of the dynamic correlation in connection with this embodiment. 本実施形態に関わる静的コリレーションの管理の流れを示すフローチャートである。It is a flowchart which shows the flow of management of the static correlation in connection with this embodiment.

以下、各図を参照しながら本発明の実施形態について説明する。
図１は、本実施形態に関わるイベント・コリレーションの一例を示す説明図である。イベントとは、広義には、コンピュータに状態遷移をもたらす事象を意味し、とりわけ障害イベントと呼ばれるイベントは、コンピュータに障害をもたらす事象を意味する。本明細書では、ネットワークノードに生じた複数の障害イベント相互間のコリレーションを定量的に評価する指標として、一方の障害イベントの解消に起因して他方の障害イベントが解消する統計的な確率を用いる。例えば、図１に示す例では、障害イベントＡの障害イベントＢに対するコリレーションは、障害イベントＡの解消に起因して障害イベントＢが解消する確率９５％であり、障害イベントＢの障害イベントＡに対するコリレーションは、障害イベントＢの解消に起因して障害イベントＡが解消する確率２０％である。同様に、障害イベントＢの障害イベントＣに対するコリレーションは、障害イベントＢの解消に起因して障害イベントＣが解消する確率８０％であり、障害イベントＣの障害イベントＢに対するコリレーションは、障害イベントＣの解消に起因して障害イベントＢが解消する確率１０％である。また、障害イベントＣの障害イベントＡに対するコリレーションは、障害イベントＣの解消に起因して障害イベントＡが解消する確率２０％であり、障害イベントＡの障害イベントＣに対するコリレーションは、障害イベントＡの解消に起因して障害イベントＣが解消する確率８５％である。本明細書では、第一の障害イベントの第二の障害イベントに対するコリレーションが閾値（例えば、確率８０％）以上であるときに、第一の障害イベントは第二の障害イベントの「原因」であり、第二の障害イベントは第一の障害イベントの「影響下にある」と定義する。図１に示す例では、障害イベントＡの障害イベントＢに対するコリレーションは、確率９５％であるから、障害イベントＡは障害イベントＢの原因であり、障害イベントＢは障害イベントＡの影響下にある。なお、図１では、説明の便宜上、三つの障害イベント相互間のコリレーションについて説明したが、障害イベントの数が二つ或いは四つ以上の場合でも同様に障害イベント相互間のコリレーションを定義することができる。 Embodiments of the present invention will be described below with reference to the drawings.
FIG. 1 is an explanatory diagram illustrating an example of event correlation associated with the present embodiment. The event broadly means an event that causes a state transition in the computer, and an event called a failure event particularly means an event that causes a failure in the computer. In this specification, as an index for quantitatively evaluating the correlation between a plurality of failure events occurring in a network node, a statistical probability that the other failure event is eliminated due to the elimination of one failure event is used. Use. For example, in the example illustrated in FIG. 1, the correlation of the failure event A to the failure event B has a probability of 95% that the failure event B is resolved due to the resolution of the failure event A. Correlation has a 20% probability that the failure event A is resolved due to the resolution of the failure event B. Similarly, the correlation of the failure event B to the failure event C has a probability of 80% that the failure event C is resolved due to the cancellation of the failure event B, and the correlation of the failure event C to the failure event B is The probability that the failure event B is resolved due to the resolution of C is 10%. Further, the correlation of the failure event C to the failure event A has a probability of 20% that the failure event A is resolved due to the cancellation of the failure event C. The correlation of the failure event A to the failure event C is the failure event A. There is a probability of 85% that the failure event C is resolved due to the resolution of the above. As used herein, the first failure event is the “cause” of the second failure event when the correlation of the first failure event to the second failure event is greater than or equal to a threshold (eg, probability 80%). Yes, the second failure event is defined as “under the influence” of the first failure event. In the example illustrated in FIG. 1, since the correlation of the failure event A to the failure event B has a probability of 95%, the failure event A is the cause of the failure event B, and the failure event B is under the influence of the failure event A. . In FIG. 1, for convenience of explanation, the correlation between three fault events has been described, but the correlation between fault events is similarly defined even when the number of fault events is two or four or more. be able to.

図２は本実施形態に係る障害監視システム１０によるアラーム通知の概要を示す説明図である。計算機ネットワークに接続するネットワークノード２０に複数の障害イベントＡ，Ｂ，Ｃが発生すると、障害監視システム１０は、複数の障害イベントＡ，Ｂ，Ｃ相互間のコリレーションをチェックし、どの障害イベントの影響下にもない障害イベントの発生をオペレータ３０に通知する。何れかの障害イベントの影響下にある障害イベントは、原因となる障害イベントの解消に起因して同時に解消する確率が統計的に高いため、どの障害イベントの影響下にもない障害イベントを選択し、これをオペレータ３０に通知することにより、監視負担を軽減できる。図２に示す例では、複数の障害イベントＡ，Ｂ，Ｃのうち障害イベントＡのみが他の障害イベントの影響下にないため、監視システム１０は、複数の障害イベントＡ，Ｂ，Ｃの中から障害イベントＡを選択し、これをオペレータ３０に通知する。 FIG. 2 is an explanatory diagram showing an overview of alarm notification by the failure monitoring system 10 according to the present embodiment. When a plurality of failure events A, B, and C occur in the network node 20 connected to the computer network, the failure monitoring system 10 checks the correlation between the plurality of failure events A, B, and C and determines which failure event. The operator 30 is notified of the occurrence of a fault event that is not under the influence. A failure event under the influence of any failure event has a statistically high probability of being eliminated simultaneously due to the elimination of the cause failure event, so select a failure event that is not under the influence of any failure event. By notifying this to the operator 30, the monitoring burden can be reduced. In the example shown in FIG. 2, since only the failure event A is not affected by other failure events among the plurality of failure events A, B, and C, the monitoring system 10 includes a plurality of failure events A, B, and C. The failure event A is selected from the above, and this is notified to the operator 30.

図３は本実施形態に関わる障害監視システム１０の構成を示すブロック図である。障害イベントＡ，Ｂ，Ｃ相互間のコリレーションは、計算機ネットワーク構成の変化の影響を受けて随時変動し得るため、障害監視システム１０は、障害イベントＡ，Ｂ，Ｃの回復履歴に基づいて障害イベントＡ，Ｂ，Ｃ相互間のコリレーションを定期的に計算し直すことにより計算機ネットワーク構成を忠実に反映した障害解析を可能にしている。障害監視システム１０は、通信インタフェース１１、プロセッサ１２、表示装置１３、入力装置１４、記憶資源１５、及びバス１６を備える。通信インタフェース１１は、計算機ネットワークを介してネットワークノード２０に接続しており、ネットワークノード２０からの障害イベントＡ，Ｂ，Ｃの発生通知を受信する。ネットワークノード２０は、例えば、ルータ、スイッチ、ハブ、及び小型基地局等である。プロセッサ１２は、記憶資源１５に格納されている障害監視プログラム４０を解釈及び実行することにより、障害監視処理（例えば、図４乃至図７に示す処理）を行う。入力装置１４は、例えば、オペレータ３０からの入力操作を受け付けるキーボードやマウス等である。記憶資源１５の一部は、障害管理プログラム４０のワークエリアとして使用され、障害中リスト５１、復旧中リスト５２、及び通知対象リスト５３を一時的に格納する。障害中リスト５１は、ネットワークノード２０から障害発生が通知された障害イベントをリスト化したものである。復旧中リスト５２は、障害回復処理がなされている最中の障害イベントをリスト化したものである。通知対象リスト５３は、表示装置１３にアラーム表示する障害イベントをリスト化したものである。アラーム表示とは、障害の発生を警報するメッセージを表示装置１３に表示することをいい、アラーム通知と同義である。 FIG. 3 is a block diagram illustrating a configuration of the failure monitoring system 10 according to the present embodiment. Since the correlation between the failure events A, B, and C can be changed at any time due to the influence of the change in the computer network configuration, the failure monitoring system 10 determines the failure based on the recovery history of the failure events A, B, and C. By periodically recalculating the correlation between events A, B, and C, failure analysis that accurately reflects the computer network configuration is enabled. The fault monitoring system 10 includes a communication interface 11, a processor 12, a display device 13, an input device 14, a storage resource 15, and a bus 16. The communication interface 11 is connected to the network node 20 via the computer network, and receives notification of occurrence of failure events A, B, and C from the network node 20. The network node 20 is, for example, a router, a switch, a hub, a small base station, or the like. The processor 12 interprets and executes the fault monitoring program 40 stored in the storage resource 15 to perform fault monitoring processing (for example, processing shown in FIGS. 4 to 7). The input device 14 is, for example, a keyboard or a mouse that receives an input operation from the operator 30. A part of the storage resource 15 is used as a work area of the failure management program 40, and temporarily stores a failure list 51, a recovery list 52, and a notification target list 53. The in-failure list 51 is a list of failure events notified of the occurrence of a failure from the network node 20. The recovery-in-progress list 52 is a list of failure events during failure recovery processing. The notification target list 53 is a list of failure events to be displayed as alarms on the display device 13. The alarm display means that a message for alarming the occurrence of a failure is displayed on the display device 13, and is synonymous with alarm notification.

記憶資源１５には、障害イベントＡ，Ｂ，Ｃ相互間のコリレーションの計算結果を格納するコリレーションファイル６１と、障害イベントＡ，Ｂ，Ｃの回復履歴を格納するログファイル６２とが保存されている。記憶資源１５は、例えば、コンピュータ読み取り可能な記録媒体（物理デバイス）が提供する記憶領域（論理デバイス）である。コンピュータ読み取り可能な記録媒体は、例えば、フレキシブルディスク、光磁気ディスク、ＲＯＭ、フラッシュメモリ等の書き込み可能な不揮発性メモリ、ＤＶＤ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶資源を含む。コンピュータ読み取り可能な記録媒体は、インターネット等のネットワークや電話回線等の通信回線を介してプログラムが送信された場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリのように、プログラムを一時的に保持しているものも含むものとする。また、障害監視プログラム４０は、伝送媒体を介して、或いは伝送媒体中の伝送波により他のコンピュータシステムに伝送されてもよい。ここで、障害監視プログラム４０を伝送する伝送媒体は、インターネット等のネットワークや電話回線等の通信回線のように情報を伝送する機能を有する媒体のことをいう。汎用のコンピュータに障害監視プログラム４０をインストールすることにより、汎用のコンピュータを障害監視システム１０として機能させることができる。バス１６は、通信インタフェース１１、プロセッサ１２、表示装置１３、入力装置１４、及び記憶資源１５を相互に接続し、データの読み書きができるように構成されている。 The storage resource 15 stores a correlation file 61 that stores the calculation result of the correlation between the failure events A, B, and C, and a log file 62 that stores the recovery history of the failure events A, B, and C. ing. The storage resource 15 is, for example, a storage area (logical device) provided by a computer-readable recording medium (physical device). The computer-readable recording medium includes storage resources such as a flexible disk, a magneto-optical disk, a ROM, a writable nonvolatile memory such as a flash memory, a portable medium such as a DVD, and a hard disk built in the computer system. . A computer-readable recording medium temporarily stores a program, such as a volatile memory inside a computer system serving as a server or client when the program is transmitted via a network such as the Internet or a communication line such as a telephone line. Including those held. The fault monitoring program 40 may be transmitted to another computer system via a transmission medium or by a transmission wave in the transmission medium. Here, the transmission medium for transmitting the failure monitoring program 40 refers to a medium having a function of transmitting information, such as a network such as the Internet or a communication line such as a telephone line. By installing the fault monitoring program 40 on a general-purpose computer, the general-purpose computer can function as the fault monitoring system 10. The bus 16 is configured to connect the communication interface 11, the processor 12, the display device 13, the input device 14, and the storage resource 15, and to read / write data.

障害監視プログラム４０は、複数の障害イベント相互間のコリレーションを２種類の変数を用いて計算する。一つは、静的コリレーションと呼ばれる変数であり、もう一つは、動的コリレーションと呼ばれる変数である。静的コリレーションは、ある値に初期設定され、その後、ある一定の周期間隔で初期値から所定確率の分だけ減算されていくようにプログラム上宣言された変数である。動的コリレーションは、障害イベントの回復履歴に基づいて定期的に計算し直された上で更新されていくようにプログラム上宣言された確率変数である。 The fault monitoring program 40 calculates a correlation between a plurality of fault events using two types of variables. One is a variable called static correlation, and the other is a variable called dynamic correlation. Static correlation is a variable that is declared in a program so that it is initially set to a certain value and then subtracted by a predetermined probability from the initial value at a certain periodic interval. The dynamic correlation is a random variable that is declared on the program so as to be recalculated periodically and updated based on the recovery history of the failure event.

図４は本実施形態に関わる障害発生時のアラーム表示の流れを示すフローチャートである。
ステップ４０１では、プロセッサ１２は、ネットワークノード２０から障害イベントの発生の通知を受信したか否かを判定する。
ステップ４０２では、プロセッサ１２は、障害中リスト５１にリストされている障害イベントを参照する。障害中リスト５１には、ネットワークノード２０から障害発生が通知された障害イベントがリストされている。
ステップ４０３では、プロセッサ１２は、障害中リスト５１に複数の障害イベントがリストされているか否かを判定する。
ステップ４０４では、プロセッサ１２は、障害中リスト５１にリストされている障害イベントを通知対象リスト５３に追加し、通知対象リスト５３に追加された障害イベントの発生を表示装置１３にアラーム表示する。
ステップ４０５では、プロセッサ１２は、障害中リスト５１にリストされている複数の障害イベントの中から二つの障害イベントを選択する。
ステップ４０６では、プロセッサ１２は、ステップ４０５で選択された二つの障害イベント（例えば、障害イベントＡ，Ｂ）のうち一方の障害イベント（例えば、障害イベントＡ）の他方の障害イベント（例えば、障害イベントＢ）に対する静的コリレーションと動的コリレーションのうち高い方が閾値（例えば、８０％）以上であるか否かを判定する。但し、静的コリレーションは、無効化される場合もあるので（図７のステップ７０５参照）、静的コリレーションが無効化されている場合は、ステップ４０６では、動的コリレーションが閾値を超えているか否かが判定される。
ステップ４０７では、プロセッサ１２は、ステップ４０５で選択された二つの障害イベント（例えば、障害イベントＡ，Ｂ）のうち一方の障害イベント（例えば、障害イベントＡ）を他方の障害イベント（例えば、障害イベントＢ）の「原因」であると判定する。
ステップ４０８では、プロセッサ１２は、障害中リスト５１にリストされている複数の障害イベントの中から二つの障害イベントの全ての組み合わせについて、ステップ４０５〜４０７の処理を実行したか否かを判定する。
ステップ４０９では、プロセッサ１２は、障害中リスト５１にリストされている複数の障害イベントのうちどの障害イベントの影響下にもない障害イベントを通知対象リスト５３に追加し、通知対象リスト５３に追加された障害イベントの発生を表示装置１３にアラーム表示する。これにより、何れかの障害イベントの影響下にある障害イベントの通知を制限することができるため、オペレータの監視負担を軽減できる。但し、障害中リスト５１にリストされている複数の障害イベントが互いに影響し合う場合には、どの障害イベントの影響下にもない障害イベントが存在しない。このような場合には、影響を受ける障害イベントの数が最も少ない障害イベントを通知対象リスト５３に追加し、通知対象リスト５３に追加された障害イベントの発生を表示装置１３にアラーム表示すればよい。 FIG. 4 is a flowchart showing the flow of alarm display when a failure occurs according to this embodiment.
In step 401, the processor 12 determines whether or not a notification of the occurrence of a failure event has been received from the network node 20.
In step 402, the processor 12 refers to the failure event listed in the failure list 51. In the failure list 51, failure events notified of the occurrence of a failure from the network node 20 are listed.
In step 403, the processor 12 determines whether or not a plurality of failure events are listed in the failure list 51.
In step 404, the processor 12 adds the failure event listed in the failure list 51 to the notification target list 53, and displays the occurrence of the failure event added to the notification target list 53 as an alarm on the display device 13.
In step 405, the processor 12 selects two failure events from the plurality of failure events listed in the failure list 51.
In step 406, the processor 12 determines that the other failure event (for example, the failure event) of one failure event (for example, the failure event A) of the two failure events (for example, failure events A and B) selected in step 405 is performed. It is determined whether the higher one of the static correlation and the dynamic correlation with respect to B) is a threshold value (for example, 80%) or more. However, since static correlation may be invalidated (see step 705 in FIG. 7), when static correlation is invalidated, in step 406, dynamic correlation exceeds the threshold. It is determined whether or not.
In step 407, the processor 12 converts one failure event (for example, failure event A) out of the two failure events (for example, failure event A and B) selected in step 405 to the other failure event (for example, failure event). It is determined that it is the “cause” of B).
In step 408, the processor 12 determines whether or not the processing in steps 405 to 407 has been executed for all combinations of two failure events among the plurality of failure events listed in the failure list 51.
In step 409, the processor 12 adds a failure event that is not under the influence of any failure event among the plurality of failure events listed in the failure list 51 to the notification target list 53 and is added to the notification target list 53. The occurrence of the fault event is displayed on the display device 13 as an alarm. As a result, the notification of a failure event under the influence of any failure event can be limited, and the monitoring burden on the operator can be reduced. However, when a plurality of failure events listed in the failure list 51 influence each other, there is no failure event that is not under the influence of any failure event. In such a case, the failure event with the least number of affected failure events may be added to the notification target list 53, and the occurrence of the failure event added to the notification target list 53 may be displayed as an alarm on the display device 13. .

図５は本実施形態に関わるログファイル６２の管理の流れを示すフローチャートである。
ステップ５０１では、プロセッサ１２は、ネットワークノード２０から障害イベントの復旧通知を受信したか否かを判定する。
ステップ５０２では、プロセッサ１２は、ステップ５０１で受信した復旧通知により復旧対象となる障害イベントを復旧中リスト５２に追加する。
ステップ５０３では、プロセッサ１２は、復旧中リスト５２に追加された障害イベントの処理済みフラグをオフに設定する。
ステップ５０４では、プロセッサ１２は、復旧中リスト５２に追加されている障害イベントの中から処理済みフラグがオフに設定されている障害イベントを一つ選択し、選択した障害イベントに関する回復履歴をログファイル６２に記録する。回復履歴は、例えば、ステップ５０４で選択した障害イベントの種別と、ステップ５０４で選択した障害イベントの解消に起因して解消する全ての障害イベントの種別と、ステップ５０４で選択した障害イベントが解消しても解消しない全ての障害イベントの種別と、ステップ５０４の処理を行った日時を含む。
ステップ５０５では、プロセッサ１２は、ステップ５０４の処理が完了した障害イベントの処理済みフラグをオンに設定する。
ステップ５０６では、プロセッサ１２は、復旧中リスト５２にリストされている全ての障害イベントについてステップ５０４，５０５の処理を完了したか否かを判定する。
ステップ５０７では、プロセッサ１２は、復旧中リスト５２にリストされている全ての障害イベントを削除する。 FIG. 5 is a flowchart showing the flow of management of the log file 62 according to this embodiment.
In step 501, the processor 12 determines whether or not a failure event recovery notification has been received from the network node 20.
In step 502, the processor 12 adds a failure event to be recovered to the recovery-in-progress list 52 based on the recovery notification received in step 501.
In step 503, the processor 12 sets the processed flag of the failure event added to the recovery list 52 to OFF.
In step 504, the processor 12 selects one failure event whose processed flag is set to off from the failure events added to the recovery-in-progress list 52, and logs the recovery history related to the selected failure event in the log file. 62. The recovery history includes, for example, the type of the failure event selected in step 504, all the types of failure events that are resolved due to the elimination of the failure event selected in step 504, and the failure event selected in step 504. However, all failure event types that are not resolved, and the date and time when the processing in step 504 was performed are included.
In step 505, the processor 12 sets the processed flag of the failure event for which the processing in step 504 has been completed to ON.
In step 506, the processor 12 determines whether or not the processing in steps 504 and 505 has been completed for all failure events listed in the recovery list 52.
In step 507, the processor 12 deletes all the failure events listed in the recovering list 52.

図６は本実施形態に関わる動的コリレーションの管理の流れを示すフローチャートである。
ステップ６０１では、プロセッサ１２は、過去一定期間内（例えば、過去数ヶ月以内）にログファイル６２に記録されている回復履歴をリストアップする。
ステップ６０２では、プロセッサ１２は、ステップ６０１でリストアップされた回復履歴の中から任意の二つの障害イベント（例えば、障害イベントＡ，Ｂ）を選択し、選択された二つの障害イベントのうち一方の障害イベント（例えば、障害イベントＡ）の解消に起因して他方の障害イベント（例えば、障害イベントＢ）が解消した回数Ｎ１と、一方の障害イベント（例えば、障害イベントＡ）の解消に起因して他方の障害イベント（例えば、障害イベントＢ）が解消しない回数Ｎ２とを計算する。
ステップ６０３では、プロセッサ１２は、ステップ６０２で選択された二つの障害イベントのうち一方の障害イベント（例えば、障害イベントＡ）の他方の障害イベント（例えば、障害イベントＢ）に対する動的コリレーションをＮ１／（Ｎ１＋Ｎ２）に基づいて計算し直す。
ステップ６０４では、プロセッサ１２は、ステップ６０２で選択された二つの障害イベントのうち一方の障害イベント（例えば、障害イベントＡ）の他方の障害イベント（例えば、障害イベントＢ）に対する動的コリレーションの値が、ステップ６０３で計算し直された値に一致するように、コリレーションファイル６１を更新する。
ステップ６０５では、プロセッサ１２は、任意の二つの障害イベントの全ての組み合わせについて、ステップ６０２〜６０４の処理を実行したか否かを判定する。
なお、ステップ６０１〜６０５を実行するためのソフトウェアモジュールは、障害管理プログラム４０の中で定期的に呼び出されて実行されるようにプログラムされており、これにより、動的コリレーションの値を定期的に最新の値に更新することができる。 FIG. 6 is a flowchart showing the flow of dynamic correlation management according to the present embodiment.
In step 601, the processor 12 lists the recovery history recorded in the log file 62 within a certain past period (for example, within the past several months).
In step 602, the processor 12 selects any two failure events (for example, failure events A and B) from the recovery history listed in step 601, and selects one of the two selected failure events. Due to the elimination of the failure event (for example, failure event A), the number of times N1 that the other failure event (for example, failure event B) has been resolved, and the elimination of one failure event (for example, failure event A) The number of times N2 at which the other failure event (for example, failure event B) is not resolved is calculated.
In step 603, the processor 12 performs dynamic correlation for one failure event (for example, failure event A) of the two failure events selected in step 602 with respect to the other failure event (for example, failure event B) N1. Recalculate based on / (N1 + N2).
In step 604, the processor 12 determines the value of dynamic correlation for one of the two failure events selected in step 602 (eg, failure event A) and the other failure event (eg, failure event B). Updates the correlation file 61 to match the value recalculated in step 603.
In step 605, the processor 12 determines whether or not the processing in steps 602 to 604 has been executed for all combinations of any two failure events.
Note that the software module for executing steps 601 to 605 is programmed so as to be periodically called and executed in the failure management program 40, whereby the value of the dynamic correlation is periodically updated. Can be updated to the latest value.

図７は本実施形態に関わる静的コリレーションの管理の流れを示すフローチャートである。
ステップ７０１では、プロセッサ１２は、任意の二つの障害イベント（例えば、障害イベントＡ，Ｂ）を選択し、選択した二つの障害イベントのうち一方の障害イベント（例えば、障害イベントＡ）の他方の障害イベント（例えば、障害イベントＢ）に対する静的コリレーションの初期値と、一定の周期間隔で初期値から減算される静的コリレーションの減算値とを設定する。静的コリレーションの初期値とその減算値は、オペレータ３０が事前に指定することができる。減算値はゼロでもよい。
ステップ７０２では、プロセッサ１２は、後述するステップ７０６の処理が完了してから一定の周期期間が経過したか否かを判定する。
ステップ７０３では、プロセッサ１２は、ステップ７０１で選択された二つの障害イベントのうち一方の障害イベント（例えば、障害イベントＡ）の他方の障害イベント（例えば、障害イベントＢ）に対する静的コリレーションの値を、ステップ７０１で設定された減算値の分だけ減算する。
ステップ７０４では、プロセッサ１２は、ステップ７０１で選択された二つの障害イベントのうち一方の障害イベント（例えば、障害イベントＡ）の他方の障害イベント（例えば、障害イベントＢ）に対する静的コリレーションの値が閾値（例えば、８０％）を下回るか否かを判定する。
ステップ７０５では、プロセッサ１２は、ステップ７０１で選択された二つの障害イベントのうち一方の障害イベント（例えば、障害イベントＡ）の他方の障害イベント（例えば、障害イベントＢ）に対する静的コリレーションを無効化し、その旨をオペレータ３０に通知する。
ステップ７０６では、プロセッサ１２は、任意の二つの障害イベントの全ての組み合わせについて、ステップ７０１〜７０５の処理を実行したか否かを判定する。 FIG. 7 is a flowchart showing a flow of management of static correlation related to the present embodiment.
In step 701, the processor 12 selects any two failure events (for example, failure events A and B), and the failure of one failure event (for example, failure event A) of the two selected failure events. An initial value of static correlation for an event (for example, failure event B) and a subtraction value of static correlation to be subtracted from the initial value at a constant cycle interval are set. The initial value of static correlation and its subtraction value can be designated in advance by the operator 30. The subtraction value may be zero.
In step 702, the processor 12 determines whether or not a certain period has elapsed since the processing in step 706 described later has been completed.
In step 703, the processor 12 determines the static correlation value for the other failure event (eg, failure event B) of one failure event (eg, failure event A) of the two failure events selected in step 701. Is subtracted by the subtraction value set in step 701.
In step 704, the processor 12 determines the value of the static correlation with respect to the other failure event (eg, failure event B) of one failure event (eg, failure event A) of the two failure events selected in step 701. Is less than a threshold value (for example, 80%).
In step 705, the processor 12 invalidates the static correlation with respect to the other fault event (eg, fault event B) of one fault event (eg, fault event A) of the two fault events selected in step 701. The operator 30 is notified of this.
In step 706, the processor 12 determines whether or not the processing in steps 701 to 705 has been executed for all combinations of any two failure events.

本実施形態によれば、複数の障害イベントＡ，Ｂ，Ｃ間のコリレーションを障害イベントＡ，Ｂ，Ｃの回復履歴に基づいて計算し直して更新することにより、計算機ネットワーク構成を忠実に反映した障害解析が可能になる。また、計算機ネットワークが構築された初期段階では、動的コリレーションよりも静的コリレーションの方が実態に即している場合もあるため、動的コリレーション加えて静的コリレーションを加味することで、計算機ネットワーク構成を忠実に反映した障害解析を可能にできる。 According to the present embodiment, the correlation between a plurality of failure events A, B, and C is recalculated and updated based on the recovery history of the failure events A, B, and C, thereby faithfully reflecting the computer network configuration. Failure analysis is possible. In addition, in the initial stage when the computer network is constructed, static correlation may be more realistic than dynamic correlation. Therefore, it is possible to perform failure analysis that accurately reflects the computer network configuration.

１０…障害監視システム１１…通信インタフェース１２…プロセッサ１３…表示装置１４…入力装置１５…記憶資源２０…ネットワークノード３０…オペレータ４０…障害監視プログラム５１…障害中リスト５２…復旧中リスト５３…通知対象リスト６１…コリレーションファイル６２…ログファイル DESCRIPTION OF SYMBOLS 10 ... Fault monitoring system 11 ... Communication interface 12 ... Processor 13 ... Display apparatus 14 ... Input device 15 ... Storage resource 20 ... Network node 30 ... Operator 40 ... Fault monitoring program 51 ... Fault list 52 ... Restoring list 53 ... Notification object List 61 ... Correlation file 62 ... Log file

Claims

A calculation means for recalculating the dynamic correlation between a plurality of failure events occurring in the network node based on the recovery history of the failure event;
Updating means for updating dynamic correlation between the plurality of failure events based on the result of the calculation;
Subtracting means for subtracting the static correlation that is initially set between the plurality of failure events at regular intervals;
Occurrence of the predetermined fault event on condition that any one of the plurality of fault events has a fault correlation event where the higher one of the dynamic correlation or static correlation for the predetermined fault event is greater than or equal to a threshold value Limiting means to limit notifications for
A fault monitoring system comprising:

The fault monitoring system according to claim 1,
The plurality of failure events includes a first failure event and a second failure event;
The fault monitoring system, wherein the dynamic correlation of the first fault event with respect to the second fault event is a probability that the second fault event is resolved due to the resolution of the first fault event.

Recalculating the dynamic correlation between multiple failure events occurring in the network node based on the failure event recovery history;
Updating a dynamic correlation between the plurality of failure events based on a result of the calculation;
Subtracting static correlations initialized between the plurality of failure events at regular intervals;
Occurrence of the predetermined fault event on condition that any one of the plurality of fault events has a fault correlation event where the higher one of the dynamic correlation or static correlation for the predetermined fault event is greater than or equal to a threshold value Steps to limit notifications for
A fault monitoring method comprising:

A failure monitoring program for causing a computer to execute the failure monitoring method according to claim 3 .