JP4412031B2

JP4412031B2 - Network monitoring system and method, and program

Info

Publication number: JP4412031B2
Application number: JP2004101827A
Authority: JP
Inventors: 到西岡; 伸治加美
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2004-03-31
Filing date: 2004-03-31
Publication date: 2010-02-10
Anticipated expiration: 2024-03-31
Also published as: JP2005285040A

Description

本発明はネットワーク監視システム及びその方法、プログラムに関し、特に通信ネットワークにおける障害監視方式および障害情報分析方式に関するものである。 The present invention relates to a network monitoring system, a method thereof, and a program, and more particularly to a failure monitoring method and a failure information analysis method in a communication network.

近年の高度情報社会化により、データセンターなどでは様々なサービスを提供するサーバが絶えず稼動しており、これらを接続するために様々な種類にわたる膨大な数のネットワーク装置が導入されている。これらのネットワーク装置に障害があるとサービス利用者に迷惑をかけるだけではなく、サービス提供者が莫大な損失を被る。そのために、管理者が監視装置を使ってネットワーク装置を絶えず監視する必要がある。管理者は、監視しているネットワーク装置に障害があった場合、この障害の原因を特定して迅速に復旧する必要がある。 With the recent advancement of information society, servers providing various services are constantly operating in data centers and the like, and a huge number of network devices of various types have been introduced to connect these servers. Failure of these network devices not only inconveniences the service user, but the service provider suffers a huge loss. Therefore, it is necessary for the administrator to constantly monitor the network device using the monitoring device. If there is a failure in the network device being monitored, the administrator needs to identify the cause of this failure and quickly recover it.

ネットワーク装置を監視する形態には、一般的に、ＳＮＭＰ（Simple Network Management Protocol）を使って監視する方法がある。この形態での監視情報の収集方法としては、定期的に装置の稼動状態をポーリングにより収集する方法、装置側に予め閾値を設定しておき閾値を超えるとアラームを上げるトラップによる方法がある。障害が発生した場合、上記２種類の収集方法を使って監視装置が集めた情報を元に管理者は障害原因の特定や影響範囲の分析を行う必要があるが、この作業を全て人手で行っており、分析に莫大な時間がかかるという問題がある。 As a form of monitoring a network device, there is generally a method of monitoring using SNMP (Simple Network Management Protocol). As a method of collecting monitoring information in this form, there are a method of periodically collecting the operating state of the device by polling, and a method of using a trap that raises an alarm when a threshold value is set in advance on the device side. When a failure occurs, the administrator needs to identify the cause of the failure and analyze the scope of influence based on the information collected by the monitoring device using the above two types of collection methods. There is a problem that the analysis takes an enormous amount of time.

この問題を解決するために、自動で障害情報の分析する技術が特許文献１に開示されている。この技術では、ネットワーク装置から収集した複数の情報をファジールールに基づいて、障害が発生しているかどうか、障害が発生していると判断した場合には、どの部分が障害となっているかを詳しく診断するというものである。 In order to solve this problem, Patent Document 1 discloses a technique for automatically analyzing failure information. In this technology, based on fuzzy rules for multiple pieces of information collected from network devices, if it is determined that a failure has occurred and which portion has failed, it is detailed which part is the failure. Diagnosis.

しかしながら、昨今の装置自体の複雑化およびネットワークの大規模化により、ネットワーク装置をきめ細やかに監視しようとすると、収集する監視情報の数が膨大になり、監視情報の収集のためにネットワーク自体に負荷をかけてしまうという問題が発生する。一方、ネットワークへの負荷を低減しようとすると、監視情報の量を減らさなければならず、詳細にネットワークの状態を管理者が把握することが難しくなるという問題が発生する。 However, due to the complexity of the devices themselves and the increase in the scale of the network, when monitoring network devices in detail, the number of monitoring information to be collected becomes enormous and a load is placed on the network itself for collecting monitoring information. Problem occurs. On the other hand, if an attempt is made to reduce the load on the network, the amount of monitoring information must be reduced, which causes a problem that it becomes difficult for the administrator to grasp the network state in detail.

この問題を解決するために、特許文献２では、予め限定された監視情報だけを収集し、この監視情報の判定に異常があった場合、予め関連づけされた監視情報を収集し、さらに判定するという動作を繰り返す方式が開示されている。また、その他の問題解決方法として、特許文献３では、過去の障害発生頻度の高い装置に対して優先的にポーリングにより監視情報を収集するという方式が開示されている。 In order to solve this problem, in Patent Document 2, only monitoring information limited in advance is collected, and when there is an abnormality in the determination of the monitoring information, the monitoring information associated in advance is collected and further determined. A method of repeating the operation is disclosed. As another problem solving method, Patent Document 3 discloses a method of preferentially collecting monitoring information by polling with respect to a device having a high frequency of failure occurrence in the past.

特許文献２及び３の技術では、障害となったネットワーク装置や障害の項目のみを集中的に管理するので、ネットワークの負荷を軽減することが可能であるが、障害が発生してから動作を起こすため、障害に関連する情報が取得できない場合があり、障害の原因の分析ができない可能性がある。また、管理者が人手で分析をしなければならないという問題は改善されていない。 In the techniques of Patent Documents 2 and 3, since only the network device that has failed and the item of the failure are centrally managed, it is possible to reduce the load on the network, but the operation occurs after the failure occurs. Therefore, information related to the failure may not be obtained, and the cause of the failure may not be analyzed. Moreover, the problem that the administrator has to perform analysis manually has not been improved.

特開平７−３０５４０号公報JP 7-30540 A 特開平８−０６５３０２号公報Japanese Patent Laid-Open No. 8-066302 特開平４−２３９２４２号公報JP-A-4-239242

上記した３つの従来技術の課題は、障害が発生してから動作を起こすため、すでに障害が発生しているネットワーク装置からは、監視情報が収集できない場合があるということである。例えば、データトラヒックによりネットワーク装置の負荷が非常に大きくなるといった問題が発生した場合、この装置から監視情報を収集しようとしても、ネットワーク装置は、負荷が大きいため、監視情報取得の要求にこたえられない。また、その他の例として、ネットワーク装置が何かの理由により再起動したとき、再起動前の情報が欠落しているため、管理者が再起動した理由を分析するための十分な情報を得ることができないという問題点がある。 The problem with the three prior arts described above is that the monitoring information cannot be collected from the network device in which the failure has occurred because the operation occurs after the failure has occurred. For example, when there is a problem that the load on the network device becomes very large due to data traffic, the network device cannot respond to the request for acquiring the monitoring information because the load is large even if the monitoring information is collected from this device. . As another example, when a network device is restarted for some reason, the information before the restart is missing, so the administrator has enough information to analyze the reason for the restart. There is a problem that can not be.

本発明の目的は、ネットワーク装置に負荷をかけることなく、ネットワーク装置が障害となる前に関連情報を取得するネットワーク監視システム及びその方法、プログラムを提供することである。 An object of the present invention is to provide a network monitoring system, a method thereof, and a program for acquiring related information before a network device fails, without imposing a load on the network device.

また、本発明の他の目的は、情報収集の課程で、同時に障害原因や障害影響範囲の分析結果を管理者に通知するようにしたネットワーク監視システム及びその方法、プログラムを提供することである。 Another object of the present invention is to provide a network monitoring system, method and program for notifying the administrator of the cause of failure and the analysis result of the failure influence range at the same time in the information collecting process.

本発明によるネットワーク監視システムは、
複数のネットワーク機器の情報を収集して監視する監視システムであって、
前記ネットワーク機器の各々から収集されるべき初期監視情報およびそれに関連する監視情報を監視ルールとして予め格納した監視ルール格納手段と、
前記ネットワーク機器から収集される初期監視情報を処理することによって障害の予兆を発見する予兆発見手段と、
前記予兆発見手段による予兆発見に応答して前記初期監視情報に関連し前記障害の原因を特定する監視情報を前記監視ルール格納手段から検索して、この検索した前記監視情報を収集する収集監視情報決定手段と、
前記収集監視情報決定手段により収集された監視情報により障害詳細の判定処理をなす事後発見手段と、
を含むことを特徴とする。 The network monitoring system according to the present invention comprises:
A monitoring system for collecting and monitoring information on a plurality of network devices,
Monitoring rule storage means for storing in advance initial monitoring information to be collected from each of the network devices and related monitoring information as monitoring rules;
Sign detection means for detecting a sign of failure by processing initial monitoring information collected from the network device;
Collected monitoring information for searching for monitoring information from the monitoring rule storage means related to the initial monitoring information and identifying the cause of the failure in response to the detection of the warning by the warning detection means, and collecting the searched monitoring information A determination means;
A post-discovery means for performing failure details determination processing based on the monitoring information collected by the collected monitoring information determining means;
It is characterized by including.

本発明によるネットワーク監視方法は、
複数のネットワーク機器の情報を収集して監視する監視方法であって、
前記ネットワーク機器の各々から収集されるべき初期監視情報およびそれに関連する監視情報を監視ルールとして予め格納した監視ルール格納手段を準備しておき、
前記ネットワーク機器から収集される前記初期監視情報を処理することによって障害の予兆を発見する予兆発見ステップと、
前記予兆発見ステップにおける予兆発見に応答して前記初期監視情報に関連し前記障害の原因を特定する監視情報を前記監視ルール格納手段から検索して、この検索した前記監視情報を収集する収集監視情報決定ステップと、
前記収集監視情報決定ステップにより収集された監視情報により障害詳細の判定処理をなす事後発見ステップと、
を含むことを特徴とする。 A network monitoring method according to the present invention includes:
A monitoring method for collecting and monitoring information of a plurality of network devices,
Prepare a monitoring rule storage means that stores in advance the initial monitoring information to be collected from each of the network devices and the monitoring information related thereto as monitoring rules;
A sign detection step of detecting a sign of failure by processing the initial monitoring information collected from the network device;
Collected monitoring information for collecting the searched monitoring information by searching the monitoring rule storage means for monitoring information specifying the cause of the failure in relation to the initial monitoring information in response to the finding of the sign in the predicting step. A decision step;
A post-discovery step for performing failure details determination processing based on the monitoring information collected by the collected monitoring information determination step;
It is characterized by including.

本発明によるプログラムは、
複数のネットワーク機器の情報を収集して監視する監視方法をコンピュータにより実行させるためのプログラムであって、
前記ネットワーク機器から収集される初期監視情報を処理することによって、障害の予兆を発見する処理と、
前記予兆発見に応答して前記初期監視情報に関連し前記障害の原因を特定する監視情報を監視ルール格納手段から検索して、この検索した前記監視情報を収集する処理と、
前記関連する監視情報により障害詳細の判定処理をなす事後発見処理と、
を含むことを特徴とする。 The program according to the present invention is:
A program for causing a computer to execute a monitoring method for collecting and monitoring information of a plurality of network devices,
Processing to detect a failure sign by processing initial monitoring information collected from the network device; and
A process of searching the monitoring rule storage means for monitoring information that identifies the cause of the failure in relation to the initial monitoring information in response to the sign discovery, and collecting the searched monitoring information ;
A post-discovery process for determining a failure detail based on the related monitoring information;
It is characterized by including.

本発明の作用を述べる。複数のネットワーク装置からの監視情報を取得する通信機能を有するネットワーク監視システムにおいて、監視情報収集部で、初期監視情報として連続量情報を収集し、監視情報判定部で、この連続量情報の統計的な振舞いを監視し、通常と異なる振舞いを検出した場合には、異常が発生する予兆を発見したとみなして、収集監視情報決定部で、監視ルールデータベースを参照して、監視情報収集部に対して、関連する複数の監視情報を収集する様指示する。そして、監視情報判定部で、その値を判定することにより、障害の原因を特定する。 The operation of the present invention will be described. In a network monitoring system having a communication function for acquiring monitoring information from a plurality of network devices, the monitoring information collecting unit collects continuous amount information as initial monitoring information, and the monitoring information determining unit statistically analyzes the continuous amount information. If a behavior different from normal is detected, it is considered that a sign of an abnormality has been detected, and the monitoring information database is referred to the monitoring rule database by the monitoring information determination unit. To collect related monitoring information. Then, the cause of the failure is specified by determining the value in the monitoring information determination unit.

本発明の第一の効果は、ネットワーク監視システムがネットワーク装置を監視するときに、ネットワーク装置およびネットワークに与える負荷を最小限に抑えることである。その理由は、監視情報全てを同時にネットワーク装置から取るのではなく、発生した管理情報のアラームに対して関連する必要最低限の監視情報を決定し、その決定に基づいた監視情報のみを必要な期間だけ収集する手段を有するためである。 The first effect of the present invention is to minimize the load applied to the network device and the network when the network monitoring system monitors the network device. The reason is that not all monitoring information is taken from the network device at the same time, but the minimum necessary monitoring information related to the alarm of the generated management information is determined, and only the monitoring information based on the determination is required. This is because it only has means to collect.

本発明の第二の効果は、ネットワーク監視システムがネットワークの障害を迅速に発見できることである。その理由は、ネットワーク監視システムが障害の予兆を検出し、その予兆に関する障害を動的かつ詳細に監視し始めるためである。予兆に基づいて関連する情報を動的に監視し始めることにより、同時に監視している情報を削減できるため、これまで全てのパラメータを監視するときには３０分程度の監視間隔であったのに対し、本発明では、これまでと同程度の負荷で監視間隔を１分程度にまで短縮が実現できるためである。 The second effect of the present invention is that the network monitoring system can quickly find a network failure. The reason is that the network monitoring system detects a failure sign and starts monitoring the failure related to the sign dynamically and in detail. By starting to dynamically monitor the relevant information based on the indications, it is possible to reduce the information being monitored at the same time, so when monitoring all parameters so far it was a monitoring interval of about 30 minutes, This is because in the present invention, the monitoring interval can be shortened to about 1 minute with the same load as before.

本発明の第三の効果は、ネットワーク管理者が、ネットワークの障害に対して迅速に対処することできることである。その理由は、本発明のネットワーク管理システムでは、予兆発見、障害発見の後に、障害の原因特定および影響範囲を検査し、その結果をネットワーク管理者に報告するためである。 A third effect of the present invention is that a network administrator can quickly cope with a network failure. The reason is that, in the network management system of the present invention, after the detection of a sign or failure, the cause of the failure is identified and the influence range is inspected, and the result is reported to the network administrator.

以下に、図面を参照しつつ本発明の実施の形態について詳細に説明する。本発明では、図１における情報収集部１０１が情報を収集する手段として、ＩＥＴＦ（Internet Engineering Task Force ）で標準化されているＳＮＭＰ（Simple Network Management Protocol）を用いることを前提とする。本発明の説明では、ネットワーク監視システムは装置で、管理者はネットワーク監視システムを使ってネットワークを管理する人を表すものとする。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. In the present invention, it is assumed that SNMP (Simple Network Management Protocol) standardized by the Internet Engineering Task Force (IETF) is used as means for collecting information by the information collection unit 101 in FIG. In the description of the present invention, the network monitoring system is an apparatus, and the administrator represents a person who manages the network using the network monitoring system.

図１は本発明の第一の実施例におけるネットワーク監視システムならびに本発明のネットワーク監視システムを用いて監視される監視対象ネットワークを示すブロック図である。図１において、ネットワーク監視システム１００は、複数のネットワーク装置１１１から監視情報を収集する監視情報収集部１０１と、判定機能部１０３で予め定義された判定機能のいずれかを使って収集した監視情報に異常があるかどうかを判断する監視情報判定部１０２と、監視ルールを規定する監視ルールＤＢ（データベース）１０５と、次に収集する監視情報を監視ルールＤＢ１０５を参照して決定する収集監視情報決定部１０４と、監視システムが収集した情報やアラームの有無を保管するログ蓄積部１０６とを含んで構成されている。 FIG. 1 is a block diagram showing a network monitoring system in a first embodiment of the present invention and a monitored network monitored using the network monitoring system of the present invention. In FIG. 1, the network monitoring system 100 includes monitoring information collected using a monitoring information collecting unit 101 that collects monitoring information from a plurality of network devices 111 and a determination function that is defined in advance by the determination function unit 103. A monitoring information determination unit 102 that determines whether there is an abnormality, a monitoring rule DB (database) 105 that defines a monitoring rule, and a collected monitoring information determination unit that determines monitoring information to be collected next with reference to the monitoring rule DB 105 104 and a log storage unit 106 that stores information collected by the monitoring system and the presence or absence of an alarm.

ログ蓄積部１０６の情報には、監視サイト１２０にある監視端末１２１を通してネットワークを監視する管理者がアクセスすることができると共に、ログ蓄積部に異常情報が入力された場合には、監視端末１２１に自動的に通知される。 The information stored in the log storage unit 106 can be accessed by an administrator who monitors the network through the monitoring terminal 121 in the monitoring site 120. When abnormality information is input to the log storage unit, the information is stored in the monitoring terminal 121. You will be notified automatically.

監視ルールＤＢ１０５の情報は、図２に示すように、複数の監視オブジェクトからなり、この監視オブジェクトのそれぞれには、管理者が監視している情報を識別するための監視情報名、監視情報収集部１０１がＳＮＭＰを使って監視情報を収集するためのＭＩＢ（Management Information Base ）オブジェクト名、監視情報の関係を示す監視ツリー番号、監視するネットワーク装置を示す監視ノードアドレス、監視時間を示すタイムアウト時間、収集した情報を判定するために利用する判定値、および次に監視をする監視情報を示す子監視ツリー番号が記載されている。 As shown in FIG. 2, the information in the monitoring rule DB 105 is composed of a plurality of monitoring objects, and each of the monitoring objects includes a monitoring information name for identifying information monitored by the administrator, and a monitoring information collecting unit. 101 is an MIB (Management Information Base) object name for collecting monitoring information using SNMP, a monitoring tree number indicating the relationship of monitoring information, a monitoring node address indicating a network device to be monitored, a timeout time indicating a monitoring time, and a collection The determination value used for determining the information that has been monitored and the child monitoring tree number indicating the monitoring information to be monitored next are described.

また、各監視情報の監視ツリー番号は、”１．１”や”１．１．１．１”のように、”．（ドット）”で区切られており、これにより親監視情報に異常があった場合に監視する子監視情報を関連付けることができる。この監視ツリーはこれまで発生した障害の経験を元に、管理者により予め構築されて、監視ルールＤＢ１０５に格納されているものとする。 Also, the monitoring tree number of each monitoring information is delimited by “. (Dot)” such as “1.1” and “1.1.1.1”. Child monitoring information to be monitored can be associated if there is. This monitoring tree is preliminarily constructed by an administrator based on the experience of failures that have occurred so far, and is stored in the monitoring rule DB 105.

収集した監視情報を分析する判定機能部１０３は、時系列情報判定機能１０３ａ、複数時系列情報判定機能１０３ｂ、整数型情報判定機能１０３ｃ、配列型情報判定機能１０３ｄからなる。これら判定機能の選択方法について説明する。 The determination function unit 103 that analyzes the collected monitoring information includes a time series information determination function 103a, a plurality of time series information determination functions 103b, an integer type information determination function 103c, and an array type information determination function 103d. A method for selecting these determination functions will be described.

ＳＮＭＰが収集した監視情報がＭＩＢの表記形式であるＳＭＩ（Structure of Management Information ）であることから、本発明で監視する監視情報のデータ型は、Ｃｏｕｎｔｅｒ（時間に伴い増加する負でない整数）、Ｇａｕｇｅ（最大値を維持する負でない整数）、Ｉｎｔｅｇｅｒ（整数値）、ＩＰＡｄｄｒｅｓｓ（ＩＰアドレス）、ＰｈｙｓｉｃａｌＡｄｄｒｅｓｓ（物理的なアドレスで、例として、ＭＡＣアドレスがある）および、Ｌｉｓｔ（他のデータ型の値を複数並べたリスト）とＴａｂｌｅ（Ｌｉｓｔを複数並べたもの）がある。 Since the monitoring information collected by SNMP is the MIB notation format SMI (Structure of Management Information), the data type of the monitoring information monitored in the present invention is Counter (a non-negative integer that increases with time), Gauge. (A non-negative integer that maintains the maximum value), Integer (integer value), IP Address (IP address), Physical Address (physical address, for example, MAC address), and List (for other data types) A list in which a plurality of values are arranged) and a table (a list in which a plurality of lists are arranged).

これらの型に従って、データ型がＣｏｕｎｔｅｒ、Ｇａｕｇｅであるならば時系列情報判定機能１０３ａが、単一のＩｎｔｅｇｅｒ、ＩＰＡｄｄｒｅｓｓ、ＰｈｙｓｉｃａｌＡｄｄｒｅｓｓであるならば整数型情報判定機能１０３ｄが、複数のネットワーク装置から収集したＣｏｕｎｔｅｒ、Ｇａｕｇｅであるならば複数時系列情報判定機能が１０３ｂが、Ｉｎｔｅｇｅｒ、ＩＰＡｄｄｒｅｓｓ、ＰｈｙｓｉｃａｌＡｄｄｒｅｓｓのＬｉｓｔまたはＴａｂｌｅ、または複数のネットワーク装置から収集したデータであるならば配列型情報判定機能１０３ｄが、それぞれ選択される。 According to these types, if the data type is Counter or Gauge, the time-series information determination function 103a is used. If the data type is a single integer, IP Address, or Physical Address, the integer type information determination function 103d is received from a plurality of network devices. If it is collected counter and gauge, the multiple time series information determination function 103b is the data collected from the list or table of the integer, IP address, physical address, or multiple network devices. Are selected.

以下に、監視情報のデータ型のそれぞれについて判定方法を説明する。入力監視情報が、単一のＣｏｕｎｔｅｒ、またはＧａｕｇｅの場合、図３に示す時系列情報判定機能を用いて、図４に示す動作フローに従って、ネットワークの状態を診断する。すなわち、時系列情報判定機能では、過去のデータを統計処理し、統計処理したデータと新たなデータを比較してその外れ値の大きさを算出し、異常を判定する。情報収集装置１０１から収集した監視情報Ａt を保存期間Ｗの間、監視情報ＤＢ１０に保存し（Ｓ１０）、時系列情報Ａ［ｔ］を作成する。そして、統計処理装置１１では、この時系列情報Ａ［ｔ］を統計的に処理して発生分布関数θを導き出す（Ｓ１１）。 Hereinafter, a determination method for each data type of the monitoring information will be described. When the input monitoring information is a single Counter or Gauge, the network status is diagnosed according to the operation flow shown in FIG. 4 using the time-series information determination function shown in FIG. That is, in the time-series information determination function, past data is statistically processed, the statistically processed data is compared with new data, the magnitude of the outlier is calculated, and abnormality is determined. The monitoring information At collected from the information collecting apparatus 101 is stored in the monitoring information DB 10 during the storage period W (S10), and time series information A [t] is created. Then, the statistical processing device 11 statistically processes the time series information A [t] to derive the occurrence distribution function θ (S11).

異常判定装置１２では、新たな監視情報Ａt+1 と分布関数θを比較して、Ａt+1 と分布関数θの差分を算出し（Ｓ１２）、この差分を監視ルールＤＢ１０５のエラー条件と比較し（Ｓ１３）、真（異常）であるならば、監視情報決定部１０４に対して異常を通知すると共に、ログ蓄積部１０６に異常を保管する（Ｓ１４）。また、偽（正常）であるならば、収集監視情報決定部１０４に正常を通知し、同時に、監視情報ＤＢ１０は、保存している最も古い情報Ａt-w を廃棄し、Ａt+1 を保存する（Ｓ１５）。 The abnormality determination device 12 compares the new monitoring information At + 1 with the distribution function θ, calculates the difference between At + 1 and the distribution function θ (S12), and compares this difference with the error condition of the monitoring rule DB 105. (S13) If true (abnormal), the monitoring information determination unit 104 is notified of the abnormality and the abnormality is stored in the log storage unit 106 (S14). If false (normal), the collection monitoring information determining unit 104 is notified of normality, and at the same time, the monitoring information DB 10 discards the oldest stored information At-w and stores At + 1. (S15).

入力監視情報が、複数のＣｏｕｎｔｅｒ、またはＧａｕｇｅの場合、図５に示す複数時系列情報判定機能を用いて、図６に示す動作フローに従って、ネットワークの状態を診断する。複数時系列情報判定機能では、過去の複数のデータを相関処理し、相関処理したデータを統計処理したものと新たな複数のデータの相関処理したデータを比較してその外れ値の大きさを算出し、異常を判定する。情報収集装置１０１から収集した複数の監視情報Ａt 、Ｂt 、Ｃt を保存期間Ｗの間、監視情報ＤＢ１０に保存し（Ｓ２０）、時系列情報Ａ［ｔ］、Ｂ［ｔ］、Ｃ［ｔ］を作成する（Ｓ２１）。 When the input monitoring information is a plurality of counters or gauges, the network status is diagnosed according to the operation flow shown in FIG. 6 using the multiple time-series information determination function shown in FIG. In the multiple time series information determination function, correlation processing is performed on multiple past data, and the correlation processed data is compared with the statistically processed data of the correlation processing, and the magnitude of the outlier is calculated. And determine abnormality. A plurality of pieces of monitoring information At, Bt, Ct collected from the information collection device 101 are stored in the monitoring information DB 10 during the storage period W (S20), and time-series information A [t], B [t], C [t] Is created (S21).

相関処理装置１１では、この時系列情報Ａ［ｔ］、Ｂ［ｔ］、Ｃ［ｔ］をそれぞれの間で相関処理して、共分散Γ_AB、Γ_BC、Γ_CAを導き出す（Ｓ２２）。さらにこれらの共分散の発生分布関数θ_AB、θ_BC、θ_CAを導き出す（Ｓ２３）。異常判定装置１２では、新たな監視情報Ａt+1 、Ｂt+1 、Ｃt+1 の共分散を計算し（Ｓ２４）、その共分散とそれぞれの分布関数θを比較して、新たなデータと分布関数θとの差分を算出する（Ｓ２５）。 In the correlation processing device 11, the time series information A [t], B [t], and C [t] are subjected to correlation processing to derive covariances Γ _AB , Γ _BC , and Γ _CA (S22). Further, the occurrence distribution functions θ _AB , θ _BC , and θ _CA of these covariances are derived (S23). The abnormality determination device 12 calculates the covariance of the new monitoring information At + 1, Bt + 1, Ct + 1 (S24), compares the covariance with each distribution function θ, and creates new data and distribution. A difference from the function θ is calculated (S25).

この差分を監視ルールＤＢ１０５のエラー条件と比較し（Ｓ２６）、真（異常）であるならば、収集監視情報決定部１０４に対して異常を通知すると共に、ログ蓄積部１０６に異常を保管する（Ｓ２７）。また、偽（正常）であるならば、収集監視情報決定部１０４に正常を通知する。同時に、監視情報ＤＢ１０は、保存している最も古い情報Ａt-w 、Ｂt-w 、Ｃt-w を廃棄し、Ａt+1 、Ｂt+1 、Ｃt+1 を保存する（Ｓ２８）。 This difference is compared with the error condition of the monitoring rule DB 105 (S26). If true (abnormal), the collection monitoring information determining unit 104 is notified of the abnormality and the abnormality is stored in the log storage unit 106 ( S27). If false (normal), the collection monitoring information determination unit 104 is notified of normality. At the same time, the monitoring information DB 10 discards the oldest stored information At-w, Bt-w, Ct-w and stores At + 1, Bt + 1, Ct + 1 (S28).

入力監視情報が、単一のＩｎｔｅｇｅｒ、ＩＰＡｄｄｒｅｓｓ、ＰｈｙｓｉｃａｌＡｄｄｒｅｓｓである場合、図７に示す整数型情報判定機能を用いて、図８に示す動作フローに従って、ネットワークの状態を診断する。整数型情報判定機能では、収集した監視情報の値が正常かどうか判定する。情報収集装置１０１から収集した監視情報Ａと監視ルールＤＢ１０５のエラー条件を比較する（Ｓ３０，Ｓ３１）。真（異常）であるならば、収集監視情報決定部１０４に対して異常を通知すると共に、ログ蓄積部１０６に異常を保管し（Ｓ３２）、偽（正常）であるならば、収集監視情報決定部１０４に正常を通知する（Ｓ３３）。 When the input monitoring information is a single integer, IP address, or physical address, the network state is diagnosed according to the operation flow shown in FIG. 8 using the integer type information determination function shown in FIG. The integer type information determination function determines whether or not the collected monitoring information value is normal. The monitoring information A collected from the information collecting apparatus 101 is compared with the error conditions of the monitoring rule DB 105 (S30, S31). If true (abnormal), the collection monitoring information determination unit 104 is notified of the abnormality, and the abnormality is stored in the log storage unit 106 (S32). If false (normal), the collection monitoring information determination The unit 104 is notified of normality (S33).

入力監視情報が、複数のＩＰＡｄｄｒｅｓｓ、ＰｈｙｓｉｃａｌＡｄｄｒｅｓｓである場合、図９に示す配列型情報判定機能を用いて、図１０に示す動作フローに従って、ネットワークの状態を診断する。配列型情報判定機能では複数のネットワーク機器から収集した監視情報の論理的なつながり（例えば、ＩＰルーティングテーブルやＬ２ＦｏｒｗａｒｄｉｎｇＴａｂｌｅなど）が正常であるかどうかを判定する。情報収集装置１０１から収集した監視情報Ａ［ｘ］、Ｂ［ｘ］、Ｃ［ｘ］の各テーブルは、図１１にその例を示す様に、テーブル結合装置１４により、宛先毎に各ネットワーク装置での転送先をならべた一つのテーブル（結合テーブルΩ）に結合される（Ｓ４０）。異常判定装置１２は、構成情報ＤＢ１３を参照して、宛先毎に経路を検査する（Ｓ４１）。経路の検査により、ループの発見、経路なしの発見が可能である。 When the input monitoring information is a plurality of IP addresses and physical addresses, the network status is diagnosed according to the operation flow shown in FIG. 10 using the array type information determination function shown in FIG. The array type information determination function determines whether or not the logical connection (for example, IP routing table or L2 Forwarding Table) of monitoring information collected from a plurality of network devices is normal. Each table of monitoring information A [x], B [x], and C [x] collected from the information collection device 101 is sent to each network device for each destination by the table combination device 14 as shown in FIG. Are combined into one table (joining table Ω) in which transfer destinations are arranged (S40). The abnormality determination device 12 refers to the configuration information DB 13 and checks the route for each destination (S41). By inspecting the route, it is possible to find a loop or discover a route without a route.

ＩＰルーティングテーブルの経路の検査方法を例に挙げ、図１１を参照しながら説明する。結合テーブルΩのＤｅｓｔ１の経路に対して、ネットワーク装置Ａは、インターフェースＡ−１に転送することがわかる。構成情報ＤＢ１３を参照して、インターフェースＡ−１は、同じくネットワーク装置Ａに属するので、この経路は正常であると判断する。次に、Ｄｅｓｔ１の経路に対して、ネットワーク装置Ｂは、インターフェースＡ−２に転送する。構成情報ＤＢ１３を参照して、インターフェースＡ−２は、ネットワーク装置Ａのインターフェースなので、すでにＤｅｓｔ１に対するネットワーク装置Ａは検査済みであり、よってこの経路も正常と判断する。 The method for inspecting the route of the IP routing table will be described as an example with reference to FIG. It can be seen that the network apparatus A transfers to the interface A-1 for the path of Dest1 of the coupling table Ω. Referring to the configuration information DB 13, since the interface A-1 also belongs to the network device A, it is determined that this route is normal. Next, the network apparatus B transfers the path of Dest1 to the interface A-2. Referring to the configuration information DB 13, since the interface A-2 is an interface of the network device A, the network device A with respect to Dest1 has already been inspected, and thus this route is also determined to be normal.

次に、Ｄｅｓｔ１の経路に対するネットワーク装置Ｃでは、ネットワーク装置Ｂと同様、ネットワーク装置Ａに転送されるので、この経路も正常と判断し、Ｄｅｓｔ１に対する経路は、全て正常であると判断する。 Next, since the network device C corresponding to the route of Dest1 is transferred to the network device A similarly to the network device B, it is determined that this route is also normal, and all the routes to Dest1 are determined to be normal.

結合テーブルΩのＤｅｓｔ２に対して、ネットワーク装置Ａは、インターフェースＢ−２宛てにパケットを転送することがわかる。次に、構成情報ＤＢ１３を参照して、インターフェースＢ−２を持つネットワーク装置を検索し、ネットワーク装置Ｂであることがわかる。次に、結合テーブルΩにおいて、Ｄｅｓｔ２に対してネットワーク装置Ｂが、インターフェースＢ−１に転送し、インターフェースＢ−１は同じネットワーク装置Ｂに属するインターフェースであるので、正常なルートと判断し、結合テーブルΩのなかでＤｅｓｔ２に対する次のネットワーク装置Ｃに対しての検査に移る。 It can be seen that the network device A transfers the packet to the interface B-2 with respect to Dest2 of the binding table Ω. Next, with reference to the configuration information DB 13, a network device having the interface B- 2 is searched, and it is found that the device is the network device B. Next, in the connection table Ω, the network device B transfers the Dest2 to the interface B-1, and since the interface B-1 is an interface belonging to the same network device B, it is determined as a normal route. In Ω, the next network device C is checked for Dest2.

ネットワーク装置Ｃでは、Ｄｅｓｔ２に対して経路を持たないので、経路なしのエラーと判断する。このエラー情報は、全ての経路の検査が終了するまで、保持される。結合テーブルΩのＤｅｓｔ３に対して、ネットワーク装置Ａは、インターフェースＣ−３宛てにパケットを転送することがわかる。 Since the network device C does not have a route with respect to Dest2, it is determined that there is no route error. This error information is held until the inspection of all routes is completed. It can be seen that the network device A transfers the packet to the interface C-3 with respect to Dest3 of the binding table Ω.

次に、構成情報ＤＢ１３を参照して、インターフェースＣ−３を持つネットワーク装置を検索し、ネットワーク装置Ｃであることがわかる。次に、結合テーブルΩにおいて、Ｄｅｓｔ３に対してネットワーク装置Ｃが、インターフェースＡ−３に転送し、インターフェースＡ−３はネットワーク装置Ａに属するインターフェースであることが判明する。ネットワーク装置Ａは、Ｄｅｓｔ３での経路においてすでにチェック済みであるので、この経路でループ発生のエラーが検出される。このエラー情報は、全ての経路の検査が終了するまで保持される（Ｓ４２）。 Next, referring to the configuration information DB 13, a network device having the interface C- 3 is searched, and it is found that it is the network device C. Next, in the connection table Ω, the network device C transfers Dest3 to the interface A-3, and the interface A-3 is found to be an interface belonging to the network device A. Since the network apparatus A has already been checked in the route of Dest3, an error of loop occurrence is detected in this route. This error information is held until the inspection of all routes is completed (S42).

次に、未検査のネットワーク装置Ｂの検査に移る。ネットワーク装置Ｂは、インターフェースＣ−２宛てにパケットを転送することがわかる。構成情報ＤＢ１３を参照して、インターフェースＣ−２は、ネットワーク装置Ｃに属することがわかり、ネットワーク装置Ｃでは、Ｄｅｓｔ３に対して、既にループ検出エラーが発生しているので、経路検査は終了する。経路検査が終了すると（Ｓ４３）、収集監視情報決定部１０４に対して異常を検査時に検出したエラーと含めて通知すると共に、ログ蓄積部１０６に異常情報を保管する（Ｓ４４）。 Next, it moves to the inspection of the network device B that has not been inspected. It can be seen that the network device B transfers the packet to the interface C-2. With reference to the configuration information DB 13, it can be seen that the interface C-2 belongs to the network device C. In the network device C, since the loop detection error has already occurred for Dest3, the path inspection is completed. When the route inspection is completed (S43), the collection monitoring information determining unit 104 is notified of the abnormality including the error detected during the inspection, and the abnormality information is stored in the log storage unit 106 (S44).

本説明では、ＩＰルーティングテーブルの経路検査を例に挙げて説明したが、この手法はＥｔｈｅｒｎｅｔ（登録商標）などのＭＡＣフォワーディングテーブルの経路検査でも同様に適用可能である。 In this description, the route inspection of the IP routing table has been described as an example, but this method can be similarly applied to the route inspection of the MAC forwarding table such as Ethernet (registered trademark).

次に、これら４つの判定機能の組み合わせ方について述べる。図１２は４つの判定機能１０３ａ〜１０３ｄの性質を記載した表である。時系列情報判定機能および複数時系列情報判定機能は事前発見型手段として、整数型情報判定機能や配列型情報判定機能は事後発見型手段として分類される。事前発見型手段は、監視情報の統計処理や相関処理を行い、これまでになかったパターンを異常の兆候として検出する。異常の兆候を検出できるため、監視システムは異常の事前検出が可能であるが、その反面、その後に実際には異常は発生しない場合も検出する可能性があるため、異常検出の精度は低い。 Next, how to combine these four determination functions will be described. FIG. 12 is a table describing the properties of the four determination functions 103a to 103d. The time series information determination function and the multiple time series information determination function are classified as prior discovery type means, and the integer type information determination function and the array type information determination function are classified as post discovery type means. The pre-discovery type means performs statistical processing and correlation processing on the monitoring information, and detects a pattern that has not existed as a sign of abnormality. Since a sign of abnormality can be detected, the monitoring system can detect an abnormality in advance, but on the other hand, since there is a possibility that an abnormality does not actually occur after that, the abnormality detection accuracy is low.

一方、事後発見型手段では、ネットワーク装置からのリアルタイムな監視情報を使って判定を行い、異常を検出する。このため、実際にネットワーク装置に異常が発生した後に、監視システムは異常検出するという事後検出となるが、異常検出の精度は高い。これらの特性から、事前発見手段をルールＤＢのツリー構造の上流側、事後発見手段をルールＤＢのツリー構造の下流側に配置することにより、迅速に障害の予兆を発見し、その予兆が本当に障害となるかを迅速かつ様々な種類の障害に対して確認することができる。 On the other hand, the post-discovery type means makes a determination using real-time monitoring information from the network device and detects an abnormality. For this reason, after the actual occurrence of an abnormality in the network device, the monitoring system detects the anomaly, but the accuracy of the abnormality detection is high. From these characteristics, by arranging the pre-discovery means upstream of the rule DB tree structure and the post-discovery means downstream of the rule DB tree structure, a sign of failure can be quickly found, and the sign is really a failure Can be quickly confirmed for various types of failures.

以下、事前発見手段および事後発見手段を組み合わせて、ネットワークの状態を監視するネットワーク監視システムの動作について以下に説明する。図１３は、図１に示すネットワーク監視システム１００の動作の手順を示したフローチャートである。初めに、図１と図１３を用いて発生したイベントに基づきネットワーク装置が監視情報を順次、収集し判定する手順について説明する。 Hereinafter, the operation of the network monitoring system that monitors the state of the network by combining the pre-discovery unit and the post-discovery unit will be described. FIG. 13 is a flowchart showing an operation procedure of the network monitoring system 100 shown in FIG. First, a procedure in which the network device sequentially collects and determines monitoring information based on an event that has occurred will be described with reference to FIGS. 1 and 13.

監視情報収集部１０１が、監視ルールＤＢ１０５から初期監視情報（監視ルールＤＢにおいて最初に収集を始める監視情報）を読込み（Ｓ２００）、監視ルールＤＢに指定された間隔で監視情報ａ（図２参照）の収集をＳＮＭＰのポーリングを用いて開始する（Ｓ２０１）。 The monitoring information collection unit 101 reads the initial monitoring information (monitoring information starting to be collected first in the monitoring rule DB) from the monitoring rule DB 105 (S200), and the monitoring information a (see FIG. 2) at intervals specified in the monitoring rule DB. Is collected using SNMP polling (S201).

ネットワーク装置１１１から収集された監視情報は、監視情報判定部１０２に渡され、監視情報判定部１０２は監視情報のデータ型に基づいて判定機能部１０３から適切な判定機能を選択し、監視情報の判定を行う（Ｓ２０２）。この場合の適切な判定機能の選択は、図２に示したＭＩＢオブシェクト名の示されたデータ型に基づいて行われる。監視情報判定部の応答と監視ルールＤＢの判定値を比較して、判定値より小さければ正常、大きければ異常と判断する（Ｓ２０３）。 The monitoring information collected from the network device 111 is passed to the monitoring information determination unit 102, and the monitoring information determination unit 102 selects an appropriate determination function from the determination function unit 103 based on the data type of the monitoring information, and A determination is made (S202). In this case, an appropriate determination function is selected based on the data type indicated by the MIB object name shown in FIG. The response of the monitoring information determination unit and the determination value of the monitoring rule DB are compared, and if smaller than the determination value, it is determined to be normal, and if larger, it is determined to be abnormal (S203).

異常である場合、収集監視情報決定部１０４は、ルールＤＢを参照して異常である監視情報ａの子監視ツリー番号を検索し、子監視ツリー番号１．１の監視情報ｂ（図２参照）の収集を開始するように監視情報収集部に通知する（Ｓ２０５）。通知を受けた監視情報収集部１０１は、監視情報ｂを監視ルールＤＢに指定された間隔で収集し（Ｓ２０１）、以下、同様の手順でこれら監視情報の判定を順次繰り返す。このとき、それぞれの監視情報ではアラーム状態を保持しており、親の監視情報ａのアラーム状態はエラー状態のまま監視情報を収集し、判定を継続する。このとき、仮に監視情報の値が、監視ルールＤＢに示す判定値と比較して偽となった場合でも、アラーム状態はエラー状態のままであるものとする。 If there is an abnormality, the collection monitoring information determination unit 104 searches the child monitoring tree number of the monitoring information a that is abnormal with reference to the rule DB, and the monitoring information b of the child monitoring tree number 1.1 (see FIG. 2). The monitoring information collection unit is notified to start the collection (S205). Upon receiving the notification, the monitoring information collection unit 101 collects the monitoring information b at intervals specified in the monitoring rule DB (S201), and thereafter repeats the determination of these monitoring information in the same procedure. At this time, each monitoring information holds an alarm state, the monitoring information is collected while the alarm state of the parent monitoring information a is in an error state, and the determination is continued. At this time, even if the value of the monitoring information becomes false compared with the determination value shown in the monitoring rule DB, the alarm state remains in the error state.

次に、図１と図１３とを用いて、発生しているアラーム解放の手順について説明する。アラームの解放は、ネットワーク監視者により問題が対処された場合やネットワークの自己修復機能が対処した場合などに、ネットワークの状態が変化し、監視している監視情報の判定値が変化することにより開始される。 Next, the alarm release procedure that has occurred will be described with reference to FIGS. Alarm release starts when the network status changes and the judgment value of the monitoring information being monitored changes when a problem is addressed by the network monitor or when the network self-healing function handles it Is done.

Ｓ２０３において、監視情報判定部１０２の応答が正常である場合、収集監視情報決定部１０３は、監視している監視情報のうち最下層の当たる監視情報が初期監視情報であるかどうかを判断し（Ｓ２０４）、初期監視情報であるならば（つまり、図２の監視情報ａ）、アラームが発生していない状態なので、監視情報決定部１０３は何もしない。Ｓ２０４において、最下層の監視情報が初期監視情報でないならば（つまり、図２の監視情報ｂまたは監視情報ｃ）、現在監視している監視情報の収集を終了するよう監視情報収集部に通知する（Ｓ２０６）。 In S203, when the response of the monitoring information determination unit 102 is normal, the collected monitoring information determination unit 103 determines whether the monitoring information corresponding to the lowest layer among the monitored monitoring information is the initial monitoring information ( If it is the initial monitoring information (that is, the monitoring information a in FIG. 2), the monitoring information determining unit 103 does nothing because the alarm has not occurred. In S204, if the monitoring information of the lowest layer is not the initial monitoring information (that is, monitoring information b or monitoring information c in FIG. 2), the monitoring information collecting unit is notified to finish collecting the monitoring information currently being monitored. (S206).

次に、監視していた監視情報の親の監視情報がアラーム状態であれば、直接の親監視情報の監視情報判定部１０２の判定結果を監視する（Ｓ２０７）。判定結果が真（異常）であるなら、判定結果が偽（正常）になるまで、判定結果の監視を続ける（Ｓ２０７）。判定結果が偽（正常）となると、その監視情報が初期監視情報かどうかを判断し（Ｓ２０４）、監視している監視情報の最下層が初期監視情報になるまで、つまり、全てのアラームが解放されるまで、Ｓ２０６以降の動作を続ける。 Next, if the monitoring information of the parent of the monitoring information being monitored is in an alarm state, the determination result of the monitoring information determination unit 102 of the direct parent monitoring information is monitored (S207). If the determination result is true (abnormal), monitoring of the determination result is continued until the determination result becomes false (normal) (S207). If the determination result is false (normal), it is determined whether or not the monitoring information is initial monitoring information (S204), and all alarms are released until the lowest layer of monitoring information being monitored becomes initial monitoring information. The operation after S206 is continued until it is done.

全てのアラームが解放されたあとは、初期監視情報のみの監視を実行しており、再び初期監視情報に異常が発生した場合、同様の動作を繰り返す。このアラーム解放動作により、事前発見手段において異常を誤検出した場合でも、初期状態に戻り、通常の監視動作を継続することが可能である。 After all alarms are released, only the initial monitoring information is monitored. If an abnormality occurs again in the initial monitoring information, the same operation is repeated. With this alarm release operation, even if an abnormality is detected by the prior discovery means, it is possible to return to the initial state and continue the normal monitoring operation.

以上の本発明の実施の形態において、監視ルールＤＢを使った動的な監視情報の収集により、監視情報収集のためにネットワークに与える負荷を最小限に抑制することが可能、事前発見手段を監視ルールＤＢのツリー構造の上流に配置することによりネットワーク管理者が障害の兆候の迅速な発見が可能、事後発見手段を監視ルールＤＢのツリー構造の下流側に配置することにより、発生した障害の原因が何であるか、または、障害の影響範囲がどこまで及ぶかをネットワーク管理者が瞬時に判断することが可能となる。 In the embodiment of the present invention described above, dynamic monitoring information collection using the monitoring rule DB can minimize the load on the network for monitoring information collection. Placing upstream in the rule DB tree structure allows the network administrator to quickly find signs of failure. Placing the post-discovery means downstream in the monitoring rule DB tree structure can cause the failure. It is possible for the network manager to instantaneously determine what is a problem or how far the failure is affected.

また、本発明の実施の形態においては、監視ルールＤＢのツリー構造の上流に事前発見手段を、下流に事後発見手段を配置した場合の形態について説明したが、本発明は、これに限定されることなく、任意の形で監視ルールＤＢの構築が可能である。 In the embodiment of the present invention, the case where the prior discovery means is arranged upstream of the tree structure of the monitoring rule DB and the post discovery means is arranged downstream is described. However, the present invention is limited to this. Therefore, it is possible to construct the monitoring rule DB in an arbitrary form.

次に、本発明の実施例を説明する。以下に述べる実施例では、ネットワーク管理者が障害を監視する際に構築する監視ルールＤＢ１０５の構築例とそれを用いた動作例を、詳細に説明するものとする。図１４は本発明の第一、第二、第三の実施例で用いるネットワーク構成を示した図である。図１４に示すように、ネットワーク構成は、ルータＲ１〜Ｒ３およびそれぞれローカルネットワークＬ１〜Ｌ３に所属するクライアントＨ１、Ｈ２、ストリーミングサーバＨ３、Ｈ４、ハブＨＵＢからなる。ここで、お互いを接続しているリンクは、１００Ｍｂｔ／ｓのＦａｓｔＥｔｈｅｒｎｅｔ（登録商標）であるものとする。 Next, examples of the present invention will be described. In the embodiment described below, a construction example of the monitoring rule DB 105 that is constructed when a network administrator monitors a failure and an operation example using the same will be described in detail. FIG. 14 is a diagram showing a network configuration used in the first, second and third embodiments of the present invention. As shown in FIG. 14, the network configuration includes routers R1 to R3 and clients H1 and H2, streaming servers H3 and H4, and a hub HUB belonging to the local networks L1 to L3, respectively. Here, it is assumed that the links connecting each other are Fast Ethernet (registered trademark) of 100 Mbt / s.

ネットワーク監視装置１００が監視する対象は、ルータＲ１〜Ｒ３のネットワーク機器である。ネットワーク監視装置１００はルータＲ２に接続されており、その他の各ルータに対して、ルータＲ２を介して到達可能である。 The objects monitored by the network monitoring apparatus 100 are the network devices of the routers R1 to R3. The network monitoring device 100 is connected to the router R2, and can reach other routers via the router R2.

以下、図１４と図１５とを参照して本発明の第一の実施例を説明する。図１５は、第一の実施例でのネットワーク監視装置１００内の監視ルールＤＢの各監視情報のつながりを記述するツリーを示す図である。図１５に示すように、第一の実施例では、トラヒックの急増を検出し（予兆発見）、それに関連するパケット落ち障害が無いかどうかを監視し（障害発見）、もし障害が発生していた場合は、どの方路（インターフェース）からのトラヒックが原因で障害が発生しているかを特定する（原因特定）という手順である。 The first embodiment of the present invention will be described below with reference to FIGS. FIG. 15 is a diagram illustrating a tree describing the connection of each piece of monitoring information in the monitoring rule DB in the network monitoring apparatus 100 according to the first embodiment. As shown in FIG. 15, in the first embodiment, a sudden increase in traffic is detected (predictive discovery), and it is monitored whether there is a packet drop failure associated therewith (failure discovery), and a failure has occurred. In this case, the procedure is to identify (cause identification) which route (interface) the traffic is caused by.

初期監視情報として、ネットワーク監視装置１００は、各ルータのローカルネットワークへのインターフェースの出力トラヒック量であるＭＩＢ情報ｉｆＯｕｔＯｃｔｅｓ（Ｍ１、Ｍ２、Ｍ３）を取得し、この情報を時系列情報判定機能を使って監視する。ここで、ストリーミングサーバＨ３からクライアントＨ１に２０Ｍｂｉｔ／ｓでストリーミングを配信中に、ストリーミングサーバＨ４から６０Ｍｂｉｔ／ｓでストリーミングの配信を開始するとする。ストリーミングサーバＨ４から配信が始まった時、監視情報Ｍ１で突然のトラヒック増を検出する。 As the initial monitoring information, the network monitoring apparatus 100 acquires MIB information ifOutOctes (M1, M2, M3) that is the output traffic amount of the interface to the local network of each router, and uses this information using the time-series information determination function. Monitor. Here, it is assumed that the streaming delivery is started from the streaming server H4 at 60 Mbit / s while the streaming is being delivered from the streaming server H3 to the client H1 at 20 Mbit / s. When distribution starts from the streaming server H4, a sudden increase in traffic is detected from the monitoring information M1.

監視情報Ｍ１が異常となるので、ネットワーク監視装置１００は、次の監視情報であるパケット落ちを監視するために、図２における子監視ツリー番号１．１および１．２に相当する、インターフェースのＭＩＢ情報ｉｆＯｕｔＤｉｓｃａｒｄ（Ｍ１１）およびルータのＭＩＢｉｐＯｕｔＤｉｓｃａｒｄ（Ｍ１２）を取得して、整数型情報判定機能を使って監視する。 Since the monitoring information M1 becomes abnormal, the network monitoring apparatus 100 monitors the next monitoring information packet drop, and corresponds to the MIB of the interface corresponding to the child monitoring tree numbers 1.1 and 1.2 in FIG. The information ifOutDiscard (M11) and the router's MIB ipOutDiscard (M12) are acquired and monitored using the integer type information determination function.

ここで、いずれかの監視情報において閾値異常のパケット落ちを検出すると、次に、ネットワーク監視装置１００はルータＲ２、ルータＲ３からの入力トラヒック量を調べるために、それぞれのインターフェースのＭＩＢ情報ｉｐＩｎＯｃｔｅｓ（Ｍ１１１、Ｍ１１２）を整数型情報判定機能を使って監視を開始する。 Here, when a packet drop with an abnormal threshold is detected in any of the monitoring information, the network monitoring apparatus 100 next checks the MIB information ipInOctes (M111) of each interface in order to check the amount of input traffic from the routers R2 and R3. , M112) starts monitoring using the integer type information determination function.

ストリーミングサーバＨ４からのトラヒックは６０Ｍｂｉｔ／ｓであるので、予めルールＤＢの監視情報Ｍ１１２に設定してある閾値である５０Ｍｂｉｔ／ｓを越えているという異常を検出するため、ネットワーク管理者は、パケット落ち障害の主たる原因がインターフェースＩＦ：１９２．１６８．３１．２／２４に入ってくるトラヒックが原因であることがわかる。 Since the traffic from the streaming server H4 is 60 Mbit / s, in order to detect an abnormality that exceeds the threshold value of 50 Mbit / s set in the monitoring information M112 of the rule DB in advance, the network administrator must It can be seen that the main cause of the failure is the traffic entering the interface IF: 192.168.31.2/24.

なお、図１５において、ＩＦＩＤがＮｏｄｅＩＤと同一となっている部分があるが、この場合には、ＩＦをチェックするのではなく、ルータをチェックすることを意味するものとし、以下の図１６，１７においても同様である。 In FIG. 15, there is a part where the IF ID is the same as the Node ID. In this case, it means that the router is checked instead of the IF, and the following FIG. , 17 is the same.

次に、図１４と図１６とを参照して本発明の第二の実施例を説明する。図１６は、第二の実施例でのネットワーク監視装置１００内の監視ルールＤＢの監視情報のつながりを記述するツリーを示す図である。図１６に示すように、第二の実施例では、エラーによるパケットの棄却の増加傾向を検出し（予兆発見）、検出後、各ルータが持つルーティングテーブルを検査し（障害発見）、もし経路障害が発生していれば、障害となっている経路の通知と経路障害の原因がルーティングプロトコルによる経路棄却であるかどうかを検査する（障害原因特定）という手順である。 Next, a second embodiment of the present invention will be described with reference to FIGS. FIG. 16 is a diagram illustrating a tree describing the connection of the monitoring information of the monitoring rule DB in the network monitoring apparatus 100 according to the second embodiment. As shown in FIG. 16, in the second embodiment, an increasing tendency of packet rejection due to an error is detected (forecast detection), and after detection, the routing table of each router is inspected (failure detection). If this occurs, the procedure is to notify the route that has failed and to check whether the cause of the route failure is route rejection by the routing protocol (identification of the cause of failure).

初期監視情報として、ネットワーク監視装置１００は、各ルータでＴＴＬ（Time To Live）値が“０”となったために棄却されたパケット数を示すＭＩＢ情報ｉｃｍｐＯｕｔＴｉｍｅＥｘｃｄｓ（Ｍ４、Ｍ５、Ｍ６）と、経路が無いため棄却されたパケット数を示すＭＩＢ情報ｉｃｍｐＯｕｔＤｅｓｔＵｎｒｅａｃｈ（Ｍ７、Ｍ８、Ｍ９）とを取得し、この情報を時系列情報判定機能を使って監視する。 As the initial monitoring information, the network monitoring apparatus 100 includes MIB information icmpOutTimeExcds (M4, M5, M6) indicating the number of packets discarded because the TTL (Time To Live) value is “0” in each router, and the route is MIB information icmpOutDestUnreach (M7, M8, M9) indicating the number of packets discarded because there is no packet is acquired, and this information is monitored using a time-series information determination function.

なお、上記ＴＴＬ値は、伝送されるＩＰパケットのヘッダに付加された情報であって、このパッケットがルータを一つ通過する毎に、ＴＴＬ値が“１”減算され、値が“０”になると、そのときのルータはこのパケットを棄却するようになっている。 The TTL value is information added to the header of the IP packet to be transmitted. Each time this packet passes through the router, the TTL value is decremented by “1” and the value becomes “0”. Then, the router at that time discards this packet.

ここで、各ルータにＯＳＰＦ（Open Shortest Path First）やＲＩＰ（Routing Information Protocol）などの複数のルーティングプロトコルが動作している環境で、ルーティングテーブルを決定する際に、異なるルータ間で違うルーティングプロトコルの経路を採用してしまったことが原因で、ルータＲ１とＲ２間で経路にループが発生したとする。このとき、ネットワーク監視装置１００は、監視情報Ｍ４および監視情報Ｍ５で、パケット棄却数が急激な増加を検出する。監視情報Ｍ４および監視情報Ｍ５が異常となったので、ネットワーク監視装置１００は、次の監視情報である経路検査を行うために、ルータの経路情報であるＭＩＢ情報ｉｐＲｏｕｔｅＥｎｔｒｙ（Ｍ４１）を全ルータから取得して、配列型情報判定機能を使って経路を検査する。 Here, when determining a routing table in an environment where a plurality of routing protocols such as OSPF (Open Shortest Path First) and RIP (Routing Information Protocol) are operating in each router, different routing protocols are used between different routers. It is assumed that a loop has occurred between the routers R1 and R2 due to the adoption of the route. At this time, the network monitoring apparatus 100 detects an abrupt increase in the number of discarded packets with the monitoring information M4 and the monitoring information M5. Since the monitoring information M4 and the monitoring information M5 have become abnormal, the network monitoring apparatus 100 acquires the MIB information ipRouteEntry (M41) that is the route information of the router from all the routers in order to perform the route inspection that is the next monitoring information. Then, the path is inspected using the array type information determination function.

この検査においてループが発見され、ループの位置が特定されると、管理者は、このループの位置情報を見て適切な処置を施すことができる。次に、ループが発生した原因が経路の棄却であるかどうかを判定するために、ネットワーク監視装置１００は、ルータの経路棄却数を示すｉｐＲｏｕｔｅＤｉｓｃａｒｄ（Ｍ４１１、Ｍ４１２、Ｍ４１３）を整数型情報判定機能を使って検査する。ここでは、ループ発生の原因が異なるプロトコルの経路を採用したことが原因であるので、監視情報Ｍ４１１と監視情報Ｍ４１２は異常とならない。 If a loop is found in this examination and the position of the loop is specified, the manager can take appropriate action by looking at the position information of the loop. Next, in order to determine whether or not the cause of the loop is the rejection of the route, the network monitoring apparatus 100 converts the ipRouteDiscard (M411, M412, M413) indicating the number of route rejections of the router to the integer type information determination function. Use and inspect. Here, since the cause of the occurrence of the loop is the use of a route of a different protocol, the monitoring information M411 and the monitoring information M412 do not become abnormal.

また、ルーティングプロトコルの異常で、ルーティングテーブルから経路が削除されてしまったことを想定すると、監視情報Ｍ７、監視情報Ｍ８、監視情報Ｍ９のいずれかが異常となり、監視情報Ｍ４１にて経路検査により経路なしを検出したあと、監視情報Ｍ４１１、監視情報Ｍ４１２、監視情報Ｍ４１３のいずれかが異常となるため、管理者はルーティングプロトコルの異常がどのルータで発生しているのか迅速に発見することができる。 Assuming that the route is deleted from the routing table due to an abnormality in the routing protocol, one of the monitoring information M7, the monitoring information M8, and the monitoring information M9 becomes abnormal, and the route is checked by the route inspection in the monitoring information M41. Since any of the monitoring information M411, the monitoring information M412 and the monitoring information M413 becomes abnormal after detecting none, the administrator can quickly find out which router has an abnormality in the routing protocol.

次に、図１４と図１７とを参照して本発明の第三の実施例を説明する。図１７は、第三の実施例でのネットワーク監視装置１００内の監視ルールＤＢの監視情報のつながりを記述するツリーを示す図である。図１７に示すように、第三の実施例は、正常なパケット棄却数の増加傾向を検出し（予兆発見）、パケット棄却につながるＣＰＵオーバロード障害、または温度障害が発生していないか監視し（障害発見）、ＣＰＵオーバロードが発生しているとプロセスが暴走していないかどうか調べ、温度異常であるとファンの状態を調べる（障害原因特定）という手順である。 Next, a third embodiment of the present invention will be described with reference to FIGS. FIG. 17 is a diagram illustrating a tree describing the connection of the monitoring information of the monitoring rule DB in the network monitoring apparatus 100 according to the third embodiment. As shown in FIG. 17, the third embodiment detects an increasing tendency of the number of normal packet discards (detection of a sign), and monitors whether a CPU overload failure or a temperature failure that leads to packet rejection has occurred. This is a procedure of (failure discovery), checking whether the process is not running out of control when a CPU overload occurs, and checking the state of the fan if the temperature is abnormal (specifying the cause of failure).

初期監視情報として、ネットワーク監視装置１００は、各インターフェースで正常なパケットが棄却された数を示すＭＩＢ情報ｉｆＯｕｔＤｉｓｃａｒｄ（ＭＭ１０、ＭＭ１２、ＭＭ１４）と、各ルータで正常なパケットが棄却された数を示すＭＩＢ情報ｉｐＯｕｔＤｉｓｃａｒｄ（ＭＭ１１、ＭＭ１３、ＭＭ１５）を取得し、時系列情報判定機能を使って監視する。 As initial monitoring information, the network monitoring apparatus 100 includes MIB information ifOutDiscard (MM10, MM12, MM14) indicating the number of normal packets discarded at each interface, and MIB indicating the number of normal packets discarded at each router. Information ipOutDiscard (MM11, MM13, MM15) is acquired and monitored using a time-series information determination function.

ここで、ルータＲ１内で動作しているプロトコルの暴走が原因となり、ＣＰＵがオーバフローしたとする。オーバフローが原因でルーティングプロトコルが正しく動作しなくなり、現在のルーティングテーブルにない経路はＲ１で棄却される。このとき、ネットワーク監視装置１００は、監視情報ＭＭ１０もしくは監視情報ＭＭ１１で、正常なパケットの棄却数が次第に増加するのを検出する。 Here, it is assumed that the CPU overflows due to the runaway of the protocol operating in the router R1. Due to the overflow, the routing protocol does not operate correctly, and a route that is not in the current routing table is rejected by R1. At this time, the network monitoring apparatus 100 detects that the number of normal packet rejections gradually increases in the monitoring information MM10 or the monitoring information MM11.

監視情報ＭＭ１０または監視情報ＭＭ１１が異常となったので、ネットワーク監視装置１００は、次の監視情報であるＣＰＵオーバロードおよび温度異常を監視するために、ルータのＣＰＵ使用率を示すＭＩＢ情報ｃｐｍＣＰＵＴｏｔａｌ５ｓｅｃ（ＭＭ１０１）と温度状態を示すＭＩＢ情報ｃｉｓｃｏＥｎｖＭｏｎＴｅｍｐｅｒａｔｕｒｅＳｔａｔｕｓＶａｌｕｅ（ＭＭ１１１）をそれぞれ取得し、整数型情報判定機能にて検査を行う。 Since the monitoring information MM10 or the monitoring information MM11 becomes abnormal, the network monitoring apparatus 100 uses the MIB information cpmCPUTotal5sec (MM101) indicating the CPU usage rate of the router in order to monitor the CPU overload and temperature abnormality that are the next monitoring information. ) And MIB information cisEnvMonTemperatureStatusValue (MM111) indicating the temperature state, respectively, and inspecting with the integer type information determination function.

この検査において、ＣＰＵ使用率が監視情報ＭＭ１０１の閾値より大きいと、ネットワーク監視装置１００は障害が発生しているとみなし、次にどのプロセスが原因となっているかを検査するために、プロセスごとのＣＰＵ占有率を示すＭＩＢ情報ｃｐｍＰｒｏｃｅｓｓＡｖｅｒａｇｅＵＳｅｃｓ（ＭＭ１０１１）を取得し、整数型情報判定機能を使って検査する。ここで、監視情報ＭＭ１０１１の閾値より大きいと、ネットワーク監視装置１００は異常であるとみなし、そのプロセスＩＤを管理者に通知する。 In this inspection, if the CPU usage rate is larger than the threshold value of the monitoring information MM101, the network monitoring apparatus 100 considers that a failure has occurred, and in order to inspect which process is the next cause, MIB information cpmProcessAverageUecs (MM1011) indicating the CPU occupancy is acquired and inspected using the integer type information determination function. Here, if it is larger than the threshold value of the monitoring information MM 1011, the network monitoring apparatus 100 considers that it is abnormal, and notifies the administrator of the process ID.

これにより、管理者は、どのプロセスが異常であるかが迅速に発見することができる。また、温度障害が発生したことを想定しても、上記と同様の動作で、どのファンに原因があるかを迅速に管理者に知らせることが可能である。 Thereby, the administrator can quickly find out which process is abnormal. Even if it is assumed that a temperature failure has occurred, it is possible to promptly notify the administrator which fan has the cause by the same operation as described above.

なお、上述した実施の形態および各実施例に示した動作フローは、その動作手順を予めプログラムとしてＲＯＭなどの記録媒体に記録しておき、これをコンピュータ（ＣＰＵ）に読取らせて実行させる様に構成できることは勿論である。 In the operation flow shown in the above-described embodiment and each example, the operation procedure is recorded in advance in a recording medium such as a ROM as a program, and this is read and executed by a computer (CPU). Of course, it can be configured as follows.

本発明の実施の形態におけるネットワーク管理システムの構成および監視対象ネットワークの構成を示すブロック図である。It is a block diagram which shows the structure of the network management system in embodiment of this invention, and the structure of a monitoring object network. 本発明の実施の形態におけるネットワーク管理システムが使用する、監視ルールＤＢ内の監視ルールの例を示す図である。It is a figure which shows the example of the monitoring rule in the monitoring rule DB which the network management system in embodiment of this invention uses. 本発明の実施の形態における判定機能である時系列情報判定機能の構成を示すブロック図である。It is a block diagram which shows the structure of the time series information determination function which is a determination function in embodiment of this invention. 図３の動作フローを示す図である。It is a figure which shows the operation | movement flow of FIG. 本発明の実施の形態における判定機能である複数時系列情報判定機能の構成を示すブロック図である。It is a block diagram which shows the structure of the multiple time series information determination function which is the determination function in embodiment of this invention. 図５の動作フローを示す図である。It is a figure which shows the operation | movement flow of FIG. 本発明の実施の形態における判定機能である整数型情報判定機能の構成を示すブロック図である。It is a block diagram which shows the structure of the integer type information determination function which is a determination function in embodiment of this invention. 図７の動作フローを示す図である。It is a figure which shows the operation | movement flow of FIG. 本発明の実施の形態における判定機能である配列型情報判定機能の構成を示すブロック図である。It is a block diagram which shows the structure of the arrangement | sequence type information determination function which is a determination function in embodiment of this invention. 図９の動作フローを示す図である。It is a figure which shows the operation | movement flow of FIG. 本発明の実施の形態における配列型判定機能の処理の流れを示す図である。It is a figure which shows the flow of a process of the sequence type determination function in embodiment of this invention. 本発明の実施の形態における判定機能のそれぞれの特徴を示す図である。It is a figure which shows each characteristic of the determination function in embodiment of this invention. 本発明の実施の形態におけるネットワーク管理システムの動作の流れを示すフローチャートである。It is a flowchart which shows the flow of operation | movement of the network management system in embodiment of this invention. 本発明の実施例の説明に用いるネットワーク構成例を示したブロック図である。It is the block diagram which showed the example of a network structure used for description of the Example of this invention. 本発明の第一の実施例における監視ルールを示す図である。It is a figure which shows the monitoring rule in 1st Example of this invention. 本発明の第二の実施例における監視ルールを示す図である。It is a figure which shows the monitoring rule in the 2nd Example of this invention. 本発明の第三の実施例における監視ルールを示す図である。It is a figure which shows the monitoring rule in the 3rd Example of this invention.

Explanation of symbols

１００ネットワーク監視システム
１０１監視情報収集部
１０２監視情報判定部
１０３判定機能部
１０４収集監視情報決定部
１０５監視ルールＤＢ（データベース）
１０６ログ蓄積部
１２０監視サイト
１２１監視端末
DESCRIPTION OF SYMBOLS 100 Network monitoring system 101 Monitoring information collection part 102 Monitoring information determination part 103 Determination function part 104 Collection monitoring information determination part 105 Monitoring rule DB (database)
106 log storage unit 120 monitoring site 121 monitoring terminal

Claims

A monitoring system for collecting and monitoring information on a plurality of network devices,
Monitoring rule storage means for storing in advance initial monitoring information to be collected from each of the network devices and related monitoring information as monitoring rules;
Sign detection means for detecting a sign of failure by processing initial monitoring information collected from the network device;
Collected monitoring information for searching for monitoring information from the monitoring rule storage means related to the initial monitoring information and identifying the cause of the failure in response to the detection of the warning by the warning detection means, and collecting the searched monitoring information A determination means;
A post-discovery means for performing failure details determination processing based on the monitoring information collected by the collected monitoring information determining means;
A network monitoring system comprising:

The initial monitoring information is time series monitoring information that changes in time series,
The sign finding means is means for statistically processing time-series monitoring information collected up to now, and means for detecting a sign of failure by comparing and determining the statistically processed result and the latest collected information. The network monitoring system according to claim 1, further comprising:

The initial monitoring information is time series monitoring information that changes in time series,
The predictor detecting means statistically processes the correlation of a plurality of time-series monitoring information collected up to now, and compares and determines the result of the statistical processing and the latest collected information. 2. The network monitoring system according to claim 1, further comprising means for detecting a sign.

The related monitoring information is route information held by the network device, and the post-discovery means checks the route information to check the normality of the route. 3. The network monitoring system according to any one of 3.

The network monitoring system according to any one of claims 1 to 3, wherein the related monitoring information is integer type information (integer type monitoring information), and the post-discovery means determines an integer value.

The monitoring information related to the initial monitoring information stored in the storage means has a tree structure as sequentially related monitoring information, and the collection monitoring information determination means is more detailed from the tree structure. The related monitoring information is sequentially searched to determine the collection of the monitoring information, and the post-discovery means performs a failure detail determination process based on the collected monitoring information. 5. The network monitoring system according to any one of 5.

The monitoring information is collected using SNMP (Simple Network Management Protocol), and the post-discovery means determines a determination processing function based on a data format defined in MIB (Management Information Base) during the determination processing. The network monitoring system according to claim 1, wherein the network monitoring system is configured as described above.

If the result of determination of the monitoring information instructed to be collected by the collected monitoring information determining means is normal, the collection of the monitoring information is terminated, and the monitoring information that has become a trigger for monitoring the monitoring information The network monitoring system according to claim 1, further comprising means for releasing an abnormal state.

A monitoring method for collecting and monitoring information of a plurality of network devices,
Prepare a monitoring rule storage means that stores in advance the initial monitoring information to be collected from each of the network devices and the monitoring information related thereto as monitoring rules;
A sign detection step of detecting a sign of failure by processing the initial monitoring information collected from the network device;
Collected monitoring information for collecting the searched monitoring information by searching the monitoring rule storage means for monitoring information specifying the cause of the failure in relation to the initial monitoring information in response to the finding of the sign in the predicting step. A decision step;
A post-discovery step for performing failure details determination processing based on the monitoring information collected by the collected monitoring information determination step;
A network monitoring method comprising:

The initial monitoring information is time series monitoring information that changes in time series,
The predictor detecting step includes a step of statistically processing time-series monitoring information collected so far, and a step of detecting a predictor of failure by comparing and determining the result of the statistical processing and the latest collected information The network monitoring method according to claim 9, further comprising:

The initial monitoring information is time series monitoring information that changes in time series,
The predictive discovery step includes a step of statistically processing the correlation of a plurality of time-series monitoring information collected up to now, and comparing and determining the result of the statistical processing and the latest collected information, The network monitoring method according to claim 9, further comprising a step of detecting a sign.

The related monitoring information is route information held by the network device, and the post-discovery step confirms the normality of the route by examining the route information. 11. The network monitoring method according to any one of 11.

The network monitoring method according to claim 9, wherein the related monitoring information is integer type information (integer type monitoring information), and the post-discovery step determines an integer value.

The monitoring information related to the initial monitoring information stored in the storage means has a tree structure as sequentially related monitoring information, and the collection monitoring information determination step is more detailed from the tree structure. The related monitoring information is sequentially searched to determine the collection of the monitoring information, and the post-discovery step performs a failure detail determination process based on the collected monitoring information. 13. The network monitoring method according to any one of 13.

The monitoring information is collected using SNMP (Simple Network Management Protocol), and the post-discovery step determines a determination processing function based on a data format defined in MIB (Management Information Base) during the determination processing. The network monitoring method according to claim 9, wherein the network monitoring method is performed.

If the result of the determination of the monitoring information instructed to be collected by the collected monitoring information determination step is normal, the collection of the monitoring information is terminated and the monitoring information that is a trigger for monitoring the monitoring information The network monitoring method according to claim 9, further comprising a step of releasing an abnormal state.

A program for causing a computer to execute a monitoring method for collecting and monitoring information of a plurality of network devices,
Processing to detect a failure sign by processing initial monitoring information collected from the network device; and
A process of searching the monitoring rule storage means for monitoring information that identifies the cause of the failure in relation to the initial monitoring information in response to the sign discovery, and collecting the searched monitoring information ;
A post-discovery process for determining a failure detail based on the related monitoring information;
A computer-readable program comprising: