JP2011176554A

JP2011176554A - Monitoring device, monitoring method and program

Info

Publication number: JP2011176554A
Application number: JP2010038456A
Authority: JP
Inventors: Hiroyoshi Kin; 大善金; Hiroyuki Shinpo; 宏之新保; Hidetoshi Yokota; 英俊横田
Original assignee: KDDI R&D Laboratories Inc
Current assignee: KDDI Research Inc
Priority date: 2010-02-24
Filing date: 2010-02-24
Publication date: 2011-09-08
Anticipated expiration: 2030-02-24
Also published as: JP5505930B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a monitoring device whose processing time is shorter than that of the conventional technology, and which does not require "rules" or the like to be preliminarily created by an experienced person. <P>SOLUTION: The monitoring device obtains alarms or warning messages from the respective devices, and includes: a grouping means for storing the obtained messages in message tables for each transmission source address of the messages; a compression means for consolidating duplicated messages in the message tables into one; a clustering means for determining messages relevant to one another in the respective message tables based on content information included in the messages to group a series of the determined messages relevant to one another into clusters; and a cause analysis means for determining a warning message in the lowest order layer as a cause warning among the warning messages included in the respective clusters. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、ネットワークの監視技術に関し、より詳しくは、ネットワークの各構成装置から取得する多数のメッセージ間の相関を判定して障害原因を通知しているメッセージを特定し、さらに、障害原因の警報メッセージに付随する警告メッセージに基づき、障害の発生を予測する監視装置に関する。 The present invention relates to a network monitoring technique, and more specifically, identifies a message notifying the cause of a failure by determining a correlation between a number of messages acquired from each component device of the network, and further, alerting the cause of the failure. The present invention relates to a monitoring device that predicts the occurrence of a failure based on a warning message accompanying a message.

管理対象のネットワーク規模が大きくなるにつれて、ある障害が発生したときに、監視装置がネットワーク内の各装置から受信する警報メッセージや警告メッセージの数は膨大なものとなるが、障害に対する対応を行うためには、受信した各メッセージの相互関係を判定し、根本的な障害原因を示しているメッセージを判断することが必要となり、このための種々の方法が提案されている（例えば、非特許文献１〜１１、参照。）。 As the managed network size increases, the number of warning messages and warning messages received by each monitoring device from each device in the network when a failure occurs increases. In this case, it is necessary to determine the mutual relationship between the received messages and determine the message indicating the root cause of the failure, and various methods for this have been proposed (for example, Non-Patent Document 1). ~ 11.).

ＨａｎｅｍａｎｎＡ，ｅｔａｌ．，“ＡｌｇｏｒｉｔｈｍＤｅｓｉｇｎａｎｄＡｐｐｌｉｃａｔｉｏｎｏｆＳｅｒｖｉｃｅ−ｏｒｉｅｎｔｅｄＥｖｅｎｔＣｏｒｒｅｌａｔｉｏｎ”、３ｒｄＩＥＥＥ／ＩＦＩＰＩｎｔｅｒｎａｔｉｏｎａｌＷｏｒｋｓｈｏｐ，２００８年４月，ｐｐ．６１−７０Hanemann A, et al. "Algorithm Design and Application of Service-oriented Event Correlation", 3rd IEEE / IFIP International Workshop, April 2008, pp. 196 61-70 Ｒｉｓｔｏ，ｅｔａｌ．，“ＴｏｏｌｓａｎｄＴｅｃｈｎｉｑｕｅｓｆｏｒＥｖｅｎｔＬｏｇＡｎａｌｙｓｉｓ”，ＰｈＤｔｈｅｓｉｓ、ＴａｌｌｉｎｎＵｎｉｖｅｒｓｉｔｙｏｆＴｅｃｈｎｏｌｏｇｙ、ＤｅｐａｒｔｍｅｎｔｏｆＣｏｍｐｕｔｅｒＥｎｇｉｎｅｅｒｉｎｇ，Ｅｓｔｏｎｉａ，２００５年６月Risto, et al. , “Tools and Technologies for Event Log Analysis”, PhD thesis, Tallin University of Technology, Department of Computer Engineering, May ＢａｎｅｒｊｅｅＤ，ｅｔａｌ．，“ＡＦｒａｍｅｗｏｒｋｆｏｒＤｉｓｔｒｉｂｕｔｅｄＭｏｎｉｔｏｒｉｎｇａｎｄＲｏｏｔＣａｕｓｅＡｎａｌｙｓｉｓｆｏｒＬａｒｇｅＩＰＮｅｔｗｏｒｋｓ”，２８ｔｈＩＥＥＥＩｎｔｅｒｎａｔｉｏｎａｌＳｙｍｐｏｓｉｕｍｏｎＲｅｌｉａｂｌｅＤｉｓｔｒｉｂｕｔｅｄＳｙｓｔｅｍｓ，２００９年９月，ｐｐ．２４６−２５５Banerjee D, et al. , “A Framework for Distributed Monitoring and Root Cause Analysis for Large IP Networks”, 28th IEEE International Symposium on Reliable 9 Months. 246-255 ＷｈｉｔｅＰａｐｅｒ，“ＡｕｔｏｍａｔｉｎｇＲｏｏｔ−ＣａｕｓｅＡｎａｌｙｓｉｓ：ＥＭＣＩｏｎｉｘＣｏｄｅｂｏｏｋＣｏｒｒｅｌａｔｉｏｎＴｅｃｈｎｏｌｏｇｙｖｓ．Ｒｕｌｅｓ−ｂａｓｅｄＡｎａｌｙｓｉｓ”，２０００年１１月White Paper, “Automating Root-Cause Analysis: EMC Ionix Codebook Correlation Technology vs. Rules-based Analysis”, November 2000 ＱｉｕｈｕａＺｈｅｎｇｅｔａｌ．，“ＡｎＥｖｅｎｔＣｏｒｒｅｌａｔｉｏｎＡｐｐｒｏａｃｈＢａｓｅｄｏｎｔｈｅＣｏｍｂｉｎａｔｉｏｎｏｆＩＨＵａｎｄＣｏｄｅｂｏｏｋ”，ＬｅｃｔｕｒｅＮｏｔｅｓｉｎＣｏｍｐｕｔｅｒＳｃｉｅｎｃｅ，２００５年，Ｖｏｌ．３８０２，ｐｐ．７５７−７６３Qiuhua Zheng et al. , “An Event Correlation Approach Based on the Combination of IHU and Codebook”, Lecture Notes in Computer Science, 2005, Vol. 3802, pp. 757-763 Ｍ．Ｓｔｅｉｎｄｅｒｅｔａｌ．，“ＡＳｕｒｖｅｙｏｆＦａｕｌｔＬｏｃａｌｉｚａｔｉｏｎＴｅｃｈｎｉｑｕｅｓｉｎＣｏｍｐｕｔｅｒＮｅｔｗｏｒｋｓ”，ＳｃｉｅｎｃｅｏｆＣｏｍｐｕｔｅｒＰｒｏｇｒａｍｍｉｎｇ，ＳｐｅｃｉａｌＥｄｉｔｉｏｎｏｎＴｏｐｉｃｓｉｎＳｙｓｔｅｍＡｄｍｉｎｉｓｔｒａｔｉｏｎ，２００４年１１月，Ｖｏｌ．５３，ｐｐ．１６５−１９４M.M. Steinder et al. , “A Survey of Fault Localization Technologies in Computer Networks”, Science of Computer Programming, Special Edition on Topic, 4th Month 53, pp. 165-194 ＡＬ−ＭＡＭＯＲＹＳａｆａａＯ，ｅｔａｌ．，“IｎｔｒｕｓｉｏｎＤｅｔｅｃｔｉｏｎＡｌａｒｍｓＲｅｄｕｃｔｉｏｎＵｓｉｎｇＲｏｏｔＣａｕｓｅＡｎａｌｙｓｉｓａｎｄＣｌｕｓｔｅｒｉｎｇ”，ＪｏｕｒｎａｌｏｆＣｏｍｐｕｔｅｒＣｏｍｍｕｎｉｃａｔｉｏｎｓ，，２００９年２月，Ｖｏｌ．３２，Ｎｏ．２，ｐｐ．４１９−４３０AL-MAMORY Safaa O, et al. "Intrusion Detection Alarms Reduction Usage Root Cause Analysis and Clustering", Journal of Computer Communications, February 2009, Vol. 32, no. 2, pp. 419-430 Ｊｕｋｉｃｅｔａｌ．，“ＬｏｇｉｃａｌＩｎｖｅｎｔｏｒｙＤａｔａｂａｓｅＩｎｔｅｇｒａｔｉｏｎｉｎｔｏＮｅｔｗｏｒｋＰｒｏｂｌｅｍｓＦｒｅｑｕｅｎｃｙＤｅｔｅｃｔｉｏｎＰｒｏｃｅｓｓ”，ＣｏｎＴＥＬ２００９，２００９年６月，ｐｐ．３６１−３６５Jukic et al. , “Logical Inventory Database Integration into Network Problems Frequency Detection Process”, ConTEL 2009, June 2009, pp. 199-201. 361-365 ＷｕＪｉａｎ，ｅｔａｌ，“ＡＮｏｖｅｌＡｌｇｏｒｉｔｈｍｆｏｒＤｙｎａｍｉｃＭｉｎｉｎｇｏｆＡｓｓｏｃｉａｔｉｏｎＲｕｌｅｓ”，ＩｎｔｅｒｎａｔｉｏｎａｌＷｏｒｋｓｈｏｐｏｎＫｎｏｗｌｅｄｇｅＤｉｓｃｏｖｅｒｙａｎｄＤａｔａＭｉｎｉｎｇ，２００８年１月,ＰＰ．９４−９９Wu Jian, et al, “A Novel Algorithm for Dynamic Mining of Association Rules”, International Workshop on Knowledge Discovery and Data Mining, 2008. 94-99 ＱｉｎｇｇｕｏＺｈｅｎｇ，ｅｔａｌ．，“ＩｎｔｅｌｌｉｇｅｎｔＳｅａｒｃｈｏｆＣｏｒｒｅｌａｔｅｄＡｌａｒｍｓｆｒｏｍＤａｔａｂａｓｅＣｏｎｔａｉｎｉｎｇＮｏｉｓｅＤａｔａ”，ＮｅｔｗｏｒｋＯｐｅｒａｔｉｏｎｓａｎｄＭａｎａｇｅｍｅｎｔＳｙｍｐｏｓｉｕｍ,２００２年４月,ｐｐ．４０５−４１９Qingguo Zheng, et al. "Intelligent Search of Correlated Alarms From Database Containing Noise Data", Network Operations and Management Symposium, April 2002, pp. 405-419 ＲｉｓｔｏＶａａｒａｎｄｉ，“ＡＤａｔａＣｌｕｓｔｅｒｉｎｇＡｌｇｏｒｉｔｈｍｆｏｒＭｉｎｉｎｇＰａｔｔｅｒｎｓｆｒｏｍＥｖｅｎｔＬｏｇｓ”，ＩＰＯＭ２００３，２００３年１０月，ｐｐ．１１９−１２６Risto Varandi, “A Data Clustering Algorithm for Mining Patterns from Event Logs”, IPOM 2003, October 2003, pp. 119-126

非特許文献１、２、３及び６には、あらかじめ定めた“ルール”に基づきメッセージ間の相関を判定することが、非特許文献１、４、５及び６には、あらかじめ定めた“コードブック”に基づきメッセージ間の相関を判定することが、非特許文献１及び６には、“事例”に基づきメッセージ間の相関を判定することが記載されている。しかしながら、“ルール”、“コードブック”、“事例”の作成においては、それらを作成する者の経験や能力に依存する部分が大きく、判定の精度は、作成された“ルール”等により大きく変動するという問題がある。 Non-patent documents 1, 2, 3, and 6 determine the correlation between messages based on a predetermined “rule”. Non-patent documents 1, 4, 5, and 6 include a predetermined “codebook”. "Determining the correlation between messages based on", and Non-Patent Documents 1 and 6 describe determining the correlation between messages based on "example". However, the creation of “rules”, “codebooks”, and “examples” largely depends on the experience and ability of those who create them, and the accuracy of judgment varies greatly depending on the created “rules”, etc. There is a problem of doing.

また、非特許文献７〜１１には、データマイニング技術を利用して、根本的な障害原因を示すメッセージを判定する構成が記載されている。しかしながら、提案されているデータマイニング・アルゴリズムは、いずれもその処理に時間がかかり、受信した多数のメッセージから直ちに障害原因を判定することはできないという問題がある。 Non-Patent Documents 7 to 11 describe a configuration for determining a message indicating a fundamental cause of failure using data mining technology. However, all of the proposed data mining algorithms have a problem that the processing takes time, and the cause of the failure cannot be immediately determined from a large number of received messages.

したがって、本発明は、従来技術より処理時間が短く、かつ、経験者によりあらかじめ作成する“ルール”等を必要としない監視装置、監視方法及びプログラムを提供することを目的とする。 Therefore, an object of the present invention is to provide a monitoring device, a monitoring method, and a program that have a shorter processing time than the prior art and do not require “rules” or the like created in advance by an experienced person.

本発明における監視装置によれば、
各装置から警報及び警告メッセージを取得する監視装置であって、取得したメッセージを、該メッセージの送信元アドレスに基づき分類し、送信元アドレスごとのメッセージテーブルに保存するグループ化手段と、メッセージテーブル内の重複したメッセージを１つに集約する圧縮手段と、各メッセージテーブルの互いに関連するメッセージを、メッセージに含まれる内容情報に基づき判定し、判定した互いに関連する一連のメッセージをクラスタにグループ化するクラスタリング手段と、各クラスタに含まれるメッセージの内、最下位のレイヤの警報メッセージを原因警報として判定する原因分析手段とを備えていることを特徴とする。 According to the monitoring device of the present invention,
A monitoring device that acquires alarm and warning messages from each device, the grouping means for classifying the acquired messages based on the source address of the messages and storing them in a message table for each source address, and in the message table A clustering unit that aggregates a plurality of duplicate messages into one, and determines a message related to each other in each message table based on content information included in the message, and groups the determined series of messages related to each other into a cluster And a cause analysis means for determining the alarm message of the lowest layer among the messages included in each cluster as a cause alarm.

本発明における監視装置の他の実施形態によると、
原因警報として判定した警報メッセージの警報内容と、前記警報メッセージと同一クラスタの警告メッセージのうち、前記警報メッセージの発生時刻との差が所定時間以内である警告メッセージの警告内容に基づき、警報内容と、該警報内容で特定される障害が発生する際に付随して発生する可能性が高い警告内容の組合せを示す頻発イベントテーブルを作成するパターン分析手段を、さらに、備えており、前記圧縮手段は、メッセージテーブルに含まれる警告メッセージの警告内容が、前記頻発イベントテーブルの警告内容の組合せと所定数以上一致する場合、前記所定数以上一致する頻発イベントテーブル内の警告内容の組合せに対応する警報内容が示す障害の発生を警告する予備警告を出力することも好ましい。 According to another embodiment of the monitoring device of the present invention,
Based on the warning content of the warning message that the difference between the warning content of the warning message determined as the cause warning and the warning message of the same cluster as the warning message and the occurrence time of the warning message is within a predetermined time, And a pattern analysis means for creating a frequent event table showing a combination of warning contents that are likely to occur when a failure specified by the warning contents occurs, the compression means comprising: If the warning content of the warning message included in the message table matches a predetermined number or more of the warning content combinations in the frequent event table, the warning content corresponding to the combination of warning contents in the frequent event table that matches the predetermined number or more It is also preferable to output a preliminary warning that warns of the occurrence of the failure indicated by.

また、本発明における監視装置の他の実施形態によると、
前記クラスタリング部は、同一クラスタに含まれるメッセージのうち、送信元アドレスは異なるが、内容情報は同じであるメッセージを１つに集約することも好ましい。 According to another embodiment of the monitoring device of the present invention,
It is also preferable that the clustering unit aggregates messages included in the same cluster but having different source addresses but the same content information into one.

本発明における監視方法によれば、
グループ化部が、各装置から取得したメッセージを送信元アドレスにより分類し、該メッセージを、送信元アドレスごとのメッセージテーブルに保存する第１のステップと、圧縮部が、メッセージテーブル内の重複したメッセージを１つに集約する第２のステップと、クラスタリング部が、各メッセージテーブルの互いに関連するメッセージを、メッセージに含まれる内容情報に基づき判定し、判定した互いに関連する一連のメッセージをクラスタにグループ化する第３のステップと、原因分析部が、各クラスタに含まれるメッセージの内、最下位のレイヤの警報メッセージを原因警報として出力する第４のステップとを備えていることを特徴とする。 According to the monitoring method of the present invention,
A first step in which the grouping unit classifies the messages acquired from each device by the source address, and stores the message in a message table for each source address, and the compression unit includes duplicate messages in the message table. And the clustering unit determines messages related to each other in each message table based on the content information included in the message, and groups the determined series of messages related to each other into a cluster And a cause analysis unit includes a fourth step of outputting an alarm message of the lowest layer among the messages included in each cluster as a cause alarm.

本発明におけるプログラムによれば、コンピュータを前記監視装置として機能させることを特徴とする。 According to the program of the present invention, a computer is caused to function as the monitoring device.

本発明による監視装置は、“ルール”、“コードブック”、“事例”等をあらかじめ作成する必要がなく、さらに、複雑なアルゴリズムを使用するものではなく、よって、素早く障害原因を判定することができる。 The monitoring device according to the present invention does not need to create “rules”, “codebooks”, “examples”, etc. in advance, and does not use a complicated algorithm. it can.

本発明による監視装置の概略的な構成図である。It is a schematic block diagram of the monitoring apparatus by this invention. メッセージテーブルを示す図である。It is a figure which shows a message table. 圧縮後のメッセージテーブルを示す図である。It is a figure which shows the message table after compression. クラスタテーブルへの変換を説明する図である。It is a figure explaining conversion to a cluster table. クラスタテーブルを示す図である。It is a figure which shows a cluster table. 圧縮後のクラスタテーブルを示す図である。It is a figure which shows the cluster table after compression. 本発明による監視方法を実行するシステム構成図である。It is a system block diagram which performs the monitoring method by this invention. 図７の構成において発生するメッセージを示す図である。It is a figure which shows the message which generate | occur | produces in the structure of FIG. 図８に示すメッセージ取得時のメッセージテーブルを示す図である。It is a figure which shows the message table at the time of the message acquisition shown in FIG. 図８に示すメッセージ取得時のクラスタテーブルを示す図である。It is a figure which shows the cluster table at the time of the message acquisition shown in FIG. 図８に示すメッセージ取得時の圧縮後のクラスタテーブルを示す図である。It is a figure which shows the cluster table after the compression at the time of message acquisition shown in FIG. 図８に示すメッセージ取得時の原因警報リストを示す図である。It is a figure which shows the cause alarm list | wrist at the time of the message acquisition shown in FIG. 図８に示すメッセージ取得時のイベントセットを示す図である。It is a figure which shows the event set at the time of the message acquisition shown in FIG.

本発明を実施するための形態について、以下では図面を用いて詳細に説明する。なお、本発明において、警報メッセージとはネットワークの監視対象である端点や機能が停止していることを通知するメッセージであり、警告メッセージとはネットワークの監視対象である端点や機能が停止している訳ではないが、不安定な状況又は異常な状態にあることを通知するメッセージであり、警報メッセージと警告メッセージを区別する必要がない場合には単にメッセージと呼ぶものとする。また、メッセージには、そのメッセージを生成した装置のＩＰアドレスの他に、メッセージが生成された時刻（タイムスタンプ）や、警報内容（警報メッセージの場合）又は警告内容（警告メッセージの場合）を示す内容情報が含まれており、監視装置は、内容情報から障害／異常状態が生じているレイヤを判定できるものとする。 EMBODIMENT OF THE INVENTION The form for implementing this invention is demonstrated in detail below using drawing. In the present invention, an alarm message is a message notifying that an end point or function that is a monitoring target of a network is stopped, and an alert message is an end point or function that is an object to be monitored of a network. Although it is not a translation, it is a message notifying that there is an unstable situation or an abnormal condition, and when there is no need to distinguish between an alarm message and a warning message, it is simply called a message. In addition to the IP address of the device that generated the message, the message indicates the time (time stamp) at which the message was generated, alarm content (in the case of an alarm message), or warning content (in the case of a warning message). It is assumed that content information is included, and the monitoring device can determine a layer in which a failure / abnormal state has occurred from the content information.

なお、内容情報は、その警告又は警報内容に他の装置が関係する場合には、当該他の装置を示す対向装置情報も含んでいる。例えば、あるＷｅｂサーバで、Ｐｏｒｔ８０Ｄｏｗｎ警報が発生した場合、このＷｅｂサーバと通信している他のノード装置は、ＷｅｂＥｒｒｏｒをその内容とするメッセージを生成するが、このメッセージの内容情報には、前記Ｗｅｂサーバを示す対向装置情報が含まれることになる。なお、レイヤは、本実施形態においては、下位側から順に、システム、リンク、ネットワーク、トランスポート、アプリケーションの５つとする。なお、システム・レイヤとは、ＣＰＵやメモリ等、装置共通部分のことである。 The content information includes counter device information indicating the other device when the other device is related to the warning or alarm content. For example, when a Port 80 Down alarm occurs in a certain Web server, other node devices communicating with this Web server generate a message having the content of Web Error. The content information of this message includes Opposite device information indicating the Web server is included. In the present embodiment, there are five layers in order from the lower side: system, link, network, transport, and application. The system layer refers to a common part of the device such as a CPU and a memory.

図１は、本発明による監視装置の概略的な構成図である。図１に示す様に、監視装置は、グループ化部１と、圧縮部２と、クラスタリング部３と、原因分析部４と、通知部５と、パターン分析部６と、記憶部７とを備えている。また、記憶部７は、メッセージテーブル７１と、クラスタテーブル７２と、原因警報リスト７３と、多発イベントテーブル（ＦＥＴ）７４とを保存している。 FIG. 1 is a schematic configuration diagram of a monitoring apparatus according to the present invention. As shown in FIG. 1, the monitoring apparatus includes a grouping unit 1, a compression unit 2, a clustering unit 3, a cause analysis unit 4, a notification unit 5, a pattern analysis unit 6, and a storage unit 7. ing. In addition, the storage unit 7 stores a message table 71, a cluster table 72, a cause alarm list 73, and a frequent event table (FET) 74.

グループ化部１は、ネットワーク内の各装置からメッセージを取得し、取得メッセージを、送信元ＩＰアドレスとレイヤに基づき分類し、送信元ＩＰアドレスに対応するメッセージテーブル７１に保存する。図２は、メッセージテーブル７１を示す図である。図２には、例えば、ＩＰアドレスがＡである装置から、リンクレイヤにおいては、メッセージＭＬ１とＭＬ２を、それぞれ、１回と４回受信していることが示されている。 The grouping unit 1 acquires a message from each device in the network, classifies the acquired message based on the source IP address and the layer, and stores the acquired message in a message table 71 corresponding to the source IP address. FIG. 2 is a diagram showing the message table 71. FIG. 2 shows that messages ML1 and ML2 are received once and four times, respectively, in the link layer from a device whose IP address is A, for example.

圧縮部２は、メッセージ内のタイムスタンプ以外が全く同じであるメッセージを１つのメッセージに集約する。つまり、重複して受信したメッセージを１つに集約する。図３は、図２のメッセージテーブル７１を、圧縮部２が集約した状態を示している。例えば、トランスポートレイヤにおいては、メッセージＭＴ１を３回、メッセージＭＴ２を２回受信していたが、それぞれ、１つに集約されている。 The compression unit 2 aggregates messages that are exactly the same except for the time stamp in the message into one message. In other words, duplicate received messages are consolidated into one. FIG. 3 shows a state in which the message table 71 of FIG. For example, in the transport layer, the message MT1 has been received three times and the message MT2 has been received twice, but each is consolidated into one.

また、ＦＥＴ７４は、ある警報内容を示す警報メッセージが発出されるときに、付随して発出される可能性が高い警告メッセージの警告内容の組合せを示すテーブルであり、圧縮部２は、ＦＥＴ７４が保持する警告内容の組合せに、集約後のメッセージテーブル７１の警告メッセージの警告内容の組合せと相関が高いものがある場合には、相関が高い組合せに対応する警報内容が示す障害が発生する可能性が高いと判定し、図示しない表示部及び／又はネットワーク管理装置に予備警告を出力する。 The FET 74 is a table showing combinations of warning contents of warning messages that are likely to be issued when an alarm message indicating a certain alarm content is issued. The compression unit 2 is held by the FET 74. If there is a combination of warning contents that is highly correlated with the combination of warning contents of the warning messages in the message table 71 after aggregation, there is a possibility that a failure indicated by the warning contents corresponding to the combination with high correlation may occur. It is determined that the value is high, and a preliminary warning is output to a display unit (not shown) and / or a network management device.

クラスタリング部３は、各ＩＰアドレスに対して設けられた集約後のメッセージテーブル７１の各メッセージから、ネットワーク構成に基づき互いに関連するメッセージを判定し、関連する一連のメッセージをクラスタとして、グループ化する。例えば、図４に示す様に、リンク＃２が断となった場合、リンク＃２に接続するルータのみならず、このリンク＃２を経由した通信を行っている両端の装置においてもネットワークレイヤや、トランスポートレイヤの障害を検出してメッセージを監視装置に送信することになるが、クラスタリング部３は、メッセージの内容情報及び／又はネットワーク構成に基づき、例えば、図２の両端の装置からのネットワークレイヤや、トランスポートレイヤのメッセージは、ルータからのリンク＃２の障害を示すメッセージと同じクラスタに属するものと判定する。なお、監視装置が監視するネットワークの構成に関する情報は、ネットワーク内で既知であるものとし、監視装置のクラスタリング部３は、例えば、ネットワークの構成に関する情報を保持している装置にアクセスしてこれら情報を取得する。 The clustering unit 3 determines messages related to each other based on the network configuration from each message in the message table 71 after aggregation provided for each IP address, and groups a series of related messages as a cluster. For example, as shown in FIG. 4, when the link # 2 is disconnected, not only the router connected to the link # 2, but also the network layer and the devices at both ends performing communication via the link # 2. , The failure of the transport layer is detected and the message is transmitted to the monitoring device. The clustering unit 3 uses, for example, the network from the devices at both ends of FIG. 2 based on the message content information and / or the network configuration. The layer and transport layer messages are determined to belong to the same cluster as the message indicating the failure of the link # 2 from the router. It is assumed that information regarding the configuration of the network monitored by the monitoring device is known in the network, and the clustering unit 3 of the monitoring device accesses, for example, a device holding information regarding the configuration of the network. To get.

クラスタリング部３は、同じクラスタに属すると判定したメッセージ群を、クラスタテーブル７２として保存する。なお、通知されている内容情報は同一であるが、送信元のＩＰアドレスが違うメッセージについては１つに集約する。例えば、各メッセージテーブルから互いに関連するメッセージを抜き出して図５に示すクラスタテーブルが作成されたとする。図５に示す様に、アプリケーションレイヤにおいては、ＩＰアドレスＡの装置から受信したメッセージＭＡ１と、ＩＰアドレスＢ、Ｃ、Ｄの装置からそれぞれ受信したメッセージＭＡ５が含まれているが、メッセージＭＡ５は、送信元の装置が異なるがメッセージが示す内容は同じであるため、図６に示す様に１つに集約する。 The clustering unit 3 stores the message group determined to belong to the same cluster as the cluster table 72. Note that the notified content information is the same, but messages with different IP addresses of the transmission sources are combined into one. For example, it is assumed that messages related to each other are extracted from each message table and the cluster table shown in FIG. 5 is created. As shown in FIG. 5, the application layer includes a message MA1 received from the device having the IP address A and a message MA5 received from each of the devices having the IP addresses B, C, and D. Although the contents of the messages are the same although the transmission source apparatuses are different, they are collected into one as shown in FIG.

原因分析部４は、各クラスタテーブルにおいて最下位のレイヤにある警報メッセージを根本原因として、これらを原因警報リスト７３として記憶部７に出力し、通知部５は、原因警報リスト７３を図示しない表示部に出力及び／又はネットワーク管理装置等に送信する。また、原因分析部４は、クラスタテーブルの警告メッセージの警告内容と、当該クラスタの根本原因として選択した警報メッセージの警報内容をイベントセットとしてパターン分析部６に出力し、パターン分析部６は、既に作成したＦＥＴ７４と新たなイベントセットから、例えば、アプリオリ・アルゴリズム等を使用してＦＥＴ７４を更新する。なお、イベントセットには、根本原因として選択した警報メッセージのタイムスタンプとの時間差が所定値以内の警告メッセージの内容のみを含めるものとする。 The cause analysis unit 4 outputs the alarm message in the lowest layer in each cluster table as the root cause, and outputs these to the storage unit 7 as the cause alarm list 73, and the notification unit 5 displays the cause alarm list 73 not shown. Output to the network and / or transmitted to the network management device or the like. The cause analysis unit 4 outputs the warning content of the warning message in the cluster table and the warning content of the warning message selected as the root cause of the cluster to the pattern analysis unit 6 as an event set. The FET 74 is updated from the created FET 74 and a new event set using, for example, an a priori algorithm. It should be noted that the event set includes only the content of the warning message whose time difference from the time stamp of the alarm message selected as the root cause is within a predetermined value.

続いて、本発明による監視方法の具体例を以下に説明する。図７は、以下の説明に使用するシステム構成図であり、ルータ＃Ａ、＃Ｂ及び＃Ｃが相互に接続し、Ｗｅｂサーバ＃Ｄ及び＃Ｅがルータ＃Ｂと接続し、ノード＃１及び＃２がルータ＃Ｃと接続している。また、プローブ装置とは、他の各装置に反復してアクセスして、各装置の障害や状態に関するメッセージを取得する装置である。なお、本例において、本発明による監視装置は、プローブ装置と同じコンピュータ上に実現されているものとするが、例示であり、本発明による監視装置を、プローブ装置とは異なる装置として実現し、プローブ装置を含むネットワークの各装置から、メッセージを受信する形態であっても良い。 Then, the specific example of the monitoring method by this invention is demonstrated below. FIG. 7 is a system configuration diagram used for the following description. Routers #A, #B, and #C are connected to each other, Web servers #D and #E are connected to router #B, and nodes # 1 and #C are connected to each other. # 2 is connected to router #C. The probe device is a device that repeatedly accesses each other device and acquires a message regarding a failure or a state of each device. In this example, it is assumed that the monitoring device according to the present invention is realized on the same computer as the probe device, but is an example, the monitoring device according to the present invention is realized as a device different from the probe device, The message may be received from each device of the network including the probe device.

図７に示す構成において、ルータ＃Ａそのものと、Ｗｅｂサーバ＃Ｄのサーバ機能が障害により停止し、監視装置は、図８に示すメッセージを取得したものとする。なお、プローブ装置は、反復して各装置の状態を監視しており、例えば、ルータ＃ＡのＣＰＵＷａｒｎｉｎｇ等、ルータ＃Ａが停止する以前のメッセージも監視装置は取得している。図８において、例えば、“ＷｅｂＥｒｒｏｒ（＃Ｄ）”は内容情報であり、その内の（＃Ｄ）は対向装置情報である。 In the configuration shown in FIG. 7, it is assumed that the server #A itself and the server function of the Web server #D are stopped due to a failure, and the monitoring apparatus acquires the message shown in FIG. 8. Note that the probe device repeatedly monitors the state of each device. For example, the monitoring device also acquires messages before the router #A is stopped, such as CPU Warning of the router #A. In FIG. 8, for example, “Web Error (#D)” is content information, and (#D) of the content information is counter device information.

図８に示すメッセージを取得した場合における、グループ化部１及び圧縮部２による集約後のメッセージテーブルを図９に示す。図９は、図８に示すメッセージを、その送信元のＩＰアドレス、よって、装置別に分類し、さらに、内容情報から判定したレイヤ別に記録したものである。続いて、クラスタリング部３は、図９に示すメッセージテーブルから図１０に示すクラスタテーブルを作成する。例えば、ノード＃１及びノード＃２のＷｅｂＥｒｒｏｒ（＃Ｄ）メッセージは、そのメッセージの内容情報からＷｅｂサーバ＃Ｄに対するものであることが判明するため、Ｗｅｂサーバ＃ＤのＰｏｒｔ８０ＤｏｗｎメッセージとＷｅｂＥｒｒｏｒ（＃Ｄ）メッセージは、同一のクラスタに属するものと判定できる。なお、リンクレイヤとネットワークレイヤ間の関係は、上述した様に、ネットワークの構成情報を参照して判定する。 FIG. 9 shows a message table after aggregation by the grouping unit 1 and the compression unit 2 when the message shown in FIG. 8 is acquired. FIG. 9 shows the messages shown in FIG. 8 categorized by device based on the IP address of the transmission source, and further recorded by layer determined from the content information. Subsequently, the clustering unit 3 creates a cluster table shown in FIG. 10 from the message table shown in FIG. For example, since the Web Error (#D) message of the node # 1 and the node # 2 is determined to be for the Web server #D from the content information of the message, the Port 80 Down message of the Web server #D and the Web Error (#D) Messages can be determined to belong to the same cluster. The relationship between the link layer and the network layer is determined by referring to the network configuration information as described above.

図１０に示すクラスタテーブルにおいて、例えば、クラスタ＃１のルータ＃Ａに対するＤｅｖｉｃｅＵｎｒｅａｃｈａｂｌｅ（＃Ａ）メッセージは、それぞれ、ルータ＃Ｂと＃Ｃから通知されているが、これらは、送信元が異なるが同じ内容情報を有するメッセージであるから、クラスタリング部３は、これらを集約して１つに纏める。よって、クラスタリング部３は、図１１に示すクラスタリングテーブルを最終的に記憶部７に出力する。 In the cluster table shown in FIG. 10, for example, Device Unreachable (#A) messages for router #A of cluster # 1 are notified from routers #B and #C, respectively, but the transmission sources are different. Since the messages have the same content information, the clustering unit 3 aggregates them into one. Therefore, the clustering unit 3 finally outputs the clustering table shown in FIG.

原因分析部４は、各クラスタの警報メッセージのうち、一番下位のレイヤにあるものを根本原因として抽出する。したがって図１２に示す原因警報リスト７３を記憶部７に出力する。また、原因分析部４は、各クラスタの警告メッセージの内、当該クラスタの原因警報の発生時刻に対して所定の時間差で発生した警告メッセージの警告内容を、原因警報の警報内容と共に、イベントセットとしてパターン分析部６に出力する。図１３は、本例において原因分析部４が出力するイベントセットを示している。 The cause analysis unit 4 extracts, as a root cause, alarm messages in the lowest layer among alarm messages of each cluster. Therefore, the cause alarm list 73 shown in FIG. In addition, the cause analysis unit 4 includes, as an event set, the warning contents of the warning message generated at a predetermined time difference with respect to the generation time of the cause alarm of the cluster, together with the warning contents of the cause alarm. Output to the pattern analysis unit 6. FIG. 13 shows an event set output by the cause analysis unit 4 in this example.

最後に、圧縮部２における予備警告の出力について説明を行う。例えば、ＦＥＴ７４に、警報内容Ｘに対してＭ個の警告内容が特定されているものとする。この場合において、ある装置（ＩＰアドレス）に対応するメッセージテーブル内に、その警告内容が所定の割合又は数以上一致する警告メッセージが存在する場合、圧縮部２は、その割合又は数に応じた内容の予備警告を出力する。 Finally, preliminary warning output in the compression unit 2 will be described. For example, it is assumed that M warning contents are specified for the alarm contents X in the FET 74. In this case, if there is a warning message whose warning content matches a predetermined rate or number in the message table corresponding to a certain device (IP address), the compression unit 2 will select the content corresponding to the rate or number. A preliminary warning is output.

具体的には、警報内容Ｘに対応するものとして警告内容Ｗ１〜Ｗ６がＦＥＴ７４において特定されており、３つの一致でＷａｒｎｉｎｇを、４つ以上の一致でＣｒｉｔｉｃａｌを出力するものとする。この場合、あるメッセージテーブル内の警告メッセージが、警告内容Ｗ１、Ｗ２、Ｗ７、Ｗ９、Ｗ１０を示していたとしても、一致する警告内容の数はＷ１、Ｗ２の２つであり予備警告は発出されない。これに対して、メッセージテーブル内の警告メッセージが、警告内容Ｗ１、Ｗ２、Ｗ３、Ｗ７、Ｗ９を示している場合には、Ｗ１、Ｗ２、Ｗ３の３つの警告内容が一致し、よって、警報内容Ｘで特定される障害が、当該メッセージテーブルのＩＰアドレスに対応する装置に発生する可能性を警告するＷａｒｎｉｎｇメッセージが出力される。さらに、メッセージテーブル内の警告メッセージが、警告内容Ｗ１、Ｗ２、Ｗ３、Ｗ４、Ｗ５、Ｗ７、Ｗ９、Ｗ１０を示している場合には、Ｗ１、Ｗ２、Ｗ３、Ｗ４の４つの警告内容が一致し、よって、警報内容Ｘで特定される障害が、当該メッセージテーブルのＩＰアドレスに対応する装置に発生する可能性を警告するＣｒｉｔｉｃａｌメッセージが出力される。 Specifically, the warning contents W1 to W6 are specified in the FET 74 as corresponding to the warning contents X, and Warning is output with three matches, and Critical is output with four or more matches. In this case, even if a warning message in a certain message table indicates warning contents W1, W2, W7, W9, and W10, the number of matching warning contents is two, W1 and W2, and no preliminary warning is issued. . On the other hand, when the warning message in the message table indicates the warning contents W1, W2, W3, W7, and W9, the three warning contents of W1, W2, and W3 match. A Warning message is output to warn of the possibility that a failure identified by X occurs in the device corresponding to the IP address in the message table. Furthermore, when the warning message in the message table indicates warning contents W1, W2, W3, W4, W5, W7, W9, W10, the four warning contents W1, W2, W3, and W4 match. Therefore, a Critical message is output that warns that a failure identified by the alarm content X may occur in a device corresponding to the IP address of the message table.

以上、本発明による監視装置は、“ルール”、“コードブック”、“事例”等をあらかじめ作成する必要がなく、さらに、複雑なアルゴリズムを使用するものではなく、よって、素早く障害原因を判定することができる。 As described above, the monitoring apparatus according to the present invention does not need to create “rules”, “codebooks”, “examples” or the like in advance, and does not use a complicated algorithm, and thus quickly determines the cause of failure. be able to.

なお、上述した実施形態は、ＴＣＰ／ＩＰネットワークに対して本発明を適用するものであったが、本発明は上述した実施形態に限定されるものではなく、センサネットワークや、ホームネットワークや、クラウドネットワーク等、レイヤ構造を有する総てのネットワークに対して適用可能である。 In the above-described embodiment, the present invention is applied to a TCP / IP network. However, the present invention is not limited to the above-described embodiment, and includes a sensor network, a home network, and a cloud. The present invention can be applied to all networks having a layer structure such as a network.

なお、本発明による監視装置は、コンピュータを図１の各部として機能させるプログラムにより実現することができる。これらコンピュータプログラムは、コンピュータが読み取り可能な記憶媒体に記憶されて、又は、ネットワーク経由で配布が可能なものである。さらに、本発明は、ハードウェア及びソフトウェアの組合せによっても実現可能である。 The monitoring apparatus according to the present invention can be realized by a program that causes a computer to function as each unit in FIG. These computer programs can be stored in a computer-readable storage medium or distributed via a network. Furthermore, the present invention can be realized by a combination of hardware and software.

１グループ化部
２圧縮部
３クラスタリング部
４原因分析部
５通知部
６パターン分析部
７記憶部
７１メッセージテーブル
７２クラスタテーブル
７３原因警報リスト
７４多発イベントテーブル DESCRIPTION OF SYMBOLS 1 Grouping part 2 Compression part 3 Clustering part 4 Cause analysis part 5 Notification part 6 Pattern analysis part 7 Storage part 71 Message table 72 Cluster table 73 Cause alarm list 74 Multiple event table

Claims

A monitoring device that acquires alarm and warning messages from each device,
Grouping means for classifying the acquired message based on the source address of the message and storing it in a message table for each source address;
Compression means for aggregating duplicate messages in the message table into one;
Clustering means for determining messages related to each other in each message table based on content information included in the messages, and grouping the determined series of messages related to each other into clusters;
Among the messages included in each cluster, cause analysis means for determining the alarm message of the lowest layer as a cause alarm,
Monitoring device.

Based on the warning content of the warning message that the difference between the warning content of the warning message determined as the cause warning and the warning message of the same cluster as the warning message and the occurrence time of the warning message is within a predetermined time, A pattern analysis means for creating a frequent event table indicating a combination of warning contents that are likely to occur accompanying the occurrence of a failure specified by the warning contents,
When the warning content of the warning message included in the message table matches a predetermined number or more of the warning content combinations in the frequent event table, the compression means converts the warning content in the frequent event table to match the predetermined number or more. Output a preliminary warning that warns of the occurrence of the failure indicated by the corresponding alarm content.
The monitoring apparatus according to claim 1.

The clustering means aggregates messages that are included in the same cluster but have different source addresses but the same content information into one.
The monitoring device according to claim 1 or 2.

A first step in which the grouping unit classifies the messages acquired from each device according to a source address, and stores the message in a message table for each source address;
A second step in which the compression unit aggregates duplicate messages in the message table into one;
A third step in which the clustering unit determines messages related to each other in each message table based on content information included in the message, and groups the determined series of messages related to each other into a cluster;
A fourth step in which the cause analysis unit outputs an alarm message of the lowest layer among the messages included in each cluster as a cause alarm;
Monitoring method.

The program which makes a computer function as a monitoring apparatus of any one of Claim 1 to 3.