JP2014053658A

JP2014053658A - Failure site estimation system and failure site estimation program

Info

Publication number: JP2014053658A
Application number: JP2012194743A
Authority: JP
Inventors: Taro Shibahara; 太郎芝原; Kenji Suzuki; 賢治鈴木
Original assignee: Nomura Research Institute Ltd
Current assignee: Nomura Research Institute Ltd
Priority date: 2012-09-05
Filing date: 2012-09-05
Publication date: 2014-03-20

Abstract

PROBLEM TO BE SOLVED: To make it possible to rapidly estimate and narrow down a failure suspicion site in the case of a failure in a large scale network system including a logical failure.SOLUTION: A failure site estimation system comprises: a route information acquisition unit 110 for acquiring route information composed of interfaces on a communication route to each node and storing the route information in a route information DB 130 when a monitoring target NW 300 is normal; a hop-by-hop polling unit 121 for acquiring route information on each route to a failure node from the route information DB 130 when a failure occurs in the monitoring target NW 300, and sequentially performing polling to each interface included in the route information to collect results of OK or NG; a suspicion pair extraction unit 122 for acquiring a suspicion pair set by extracting a suspicion pair, which is a pair of the nearest one of interfaces whose results are NG and the interface whose result is OK immediately in front of the nearest one, from each failure node; and a failure site output unit 123 for extracting a failure suspicion site on the basis of the suspicion pair set and the route information and outputting the failure suspicion site.

Description

本発明は、ネットワークの管理技術に関し、特に、ネットワーク障害の際に障害部位を推定する障害部位推定システムおよび障害部位推定プログラムに適用して有効な技術に関するものである。 The present invention relates to a network management technique, and more particularly to a technique effective when applied to a failure part estimation system and a failure part estimation program for estimating a failure part in the event of a network failure.

通常、ネットワークシステムを運用・管理する際には、例えばネットワーク監視システム等により障害の監視・検知と障害部位の特定などが行われる。一般的に、ネットワーク監視システムは、例えば、ベンダー等から提供・市販されているソフトウェアやシステム、装置等により構成される。 Normally, when operating and managing a network system, for example, a network monitoring system or the like monitors and detects a fault and identifies a faulty part. In general, a network monitoring system is configured by software, a system, a device, or the like provided / marketed by a vendor or the like, for example.

しかしながら、大規模なネットワークシステムでは、例えば、コアとなるネットワーク機器に障害が発生したような場合には、他の機器にも影響が及び、ネットワーク監視システムで障害として検知されるネットワーク機器が一時的に膨大な数となる場合も多く、正確な障害部位を特定することが困難な場合がある。特に、障害となったネットワーク機器がハードウェア障害等により完全に停止等してしまったような状態ではなく、正常な処理とエラー処理とが繰り返されるような「半死」の状態の場合は、ネットワーク監視システムにより障害部位を特定することはさらに困難となる。 However, in a large-scale network system, for example, when a failure occurs in a core network device, other devices are affected, and the network device detected as a failure by the network monitoring system is temporarily In many cases, it is difficult to specify an exact fault site. In particular, if the network device that has failed is not in a state where it has been completely stopped due to a hardware failure, etc., but in a “half-dead” state in which normal processing and error processing are repeated, the network It becomes more difficult to identify the faulty part by the monitoring system.

通常このような場合は、ＳＥ（System Engineer）等の当該ネットワークシステムに精通した技術者や開発者が手動で障害を解析し切り分けて、障害部位を特定することになる。しかしながら、このような障害解析や障害部位の特定手法は属人的であり、また、効率も悪く、対応策（例えば、特定のネットワーク機器の再起動など）の実施までに長時間を要する結果となる場合も多い。 Usually, in such a case, an engineer or developer who is familiar with the network system such as SE (System Engineer) manually analyzes and isolates the fault to identify the fault site. However, such failure analysis and failure location identification methods are personal, inefficient, and take a long time to implement countermeasures (for example, restarting specific network devices). There are many cases.

これに対し、ネットワークシステムにおける障害部位の特定を効率的に行う仕組みとして、例えば、特開２００６−２２９４２１号公報（特許文献１）には、分岐と端末で構成されたツリー型のネットワークのトポロジを、ツリーの根本側が上層側で先端側が下層側であり、各分岐にて１つ下の層が現れ、各分岐とその下層側の端末が関連づけられた階層構造で表現する階層構造テーブルを用い、ある分岐からツリー先端に向かうすべての下層側端末の故障が検出されたときに、当該分岐部分を推定故障箇所として求めることで、ネットワークの端末以外の故障を容易に診断する技術が記載されている。 On the other hand, as a mechanism for efficiently identifying a faulty part in a network system, for example, Japanese Patent Laying-Open No. 2006-229421 (Patent Document 1) describes a tree-type network topology composed of branches and terminals. , Using a hierarchical structure table in which the root side of the tree is the upper layer side and the tip side is the lower layer side, the next lower layer appears in each branch, and each branch and its lower layer side are associated with each other in a hierarchical structure. Describes a technique for easily diagnosing failures other than network terminals by finding the branch portion as an estimated failure location when failures in all lower-layer terminals from a certain branch toward the top of the tree are detected .

また、特開２００６−２３８０５２号公報（特許文献２）には、ネットワークの利用者が流しているフローの送信者アドレス、受信者アドレス及び通信品質を含むフロー品質情報を収集するフロー品質情報収集部と、ネットワークの構成情報を収集する経路情報収集手段と、収集されたフロー品質情報及びネットワークの構成情報とに基づき、フローが経由するリンクを求め、かつフローの品質劣化の有無を判定し、その結果をテーブルとして管理するフロー品質／経由リンクテーブル管理部及びテーブル記憶部と、管理されているテーブルにおいて、１つ以上のフローに品質劣化があった場合、その品質劣化を起こした任意のフローの集合が経由するリンクの集合の部分集合の中で、品質劣化を起こした任意のフローが経由しているリンクを含む部分集合であって、かつ、最小の要素数をもつ部分集合を、品質劣化箇所として出力する品質劣化箇所推定部とを有することで、精度高くかつ高速な品質劣化箇所推定を可能にする技術が記載されている。 Japanese Patent Laying-Open No. 2006-238052 (Patent Document 2) discloses a flow quality information collection unit that collects flow quality information including a sender address, a receiver address, and communication quality of a flow that is being flowed by a network user. And a route information collection means for collecting network configuration information, and a link through which the flow passes is determined based on the collected flow quality information and the network configuration information, and the presence / absence of flow quality degradation is determined. In the flow quality / routed link table management unit and table storage unit that manage the results as a table, and in the managed table, if there is quality degradation in one or more flows, the flow of any flow that caused the quality degradation A link through which an arbitrary flow that has degraded quality is in a subset of the set of links through which the set passes Technology that enables high-precision and high-speed quality degradation location estimation by having a quality degradation location estimation unit that outputs a subset that includes a minimum number of elements as a quality degradation location. Is described.

また、特開２０１０−１４７５９５号公報（特許文献３）には、管理対象装置とその装置への経路上の管理対象装置を示す経路情報とを対応づけて保持するネットワーク構成ＤＢ記憶部と、送達確認に対する応答がなかった場合は、その応答がなかった管理対象装置の経路情報を保持している情報から抽出して、その経路情報の管理対象装置に対する送達確認を実施し、その送達確認に対する応答のなかった管理対象装置を障害発生装置として特定するネットワーク管理部とを備えることで、ネットワーク層における障害監視を送達確認により実施し、ネットワーク障害の原因装置を迅速に切り分ける技術が記載されている。 Japanese Patent Laid-Open No. 2010-147595 (Patent Document 3) discloses a network configuration DB storage unit that holds a management target device and path information indicating a management target device on a path to the device, and a delivery If there is no response to the confirmation, it is extracted from the information holding the route information of the managed device that has not responded, the delivery confirmation of the route information to the managed device is performed, and the response to the delivery confirmation A technology is described that includes a network management unit that identifies a management target device that has not been detected as a failure generation device so that failure monitoring in the network layer is performed by confirmation of delivery, and a device that causes a network failure is quickly identified.

特開２００６−２２９４２１号公報JP 2006-229421 A 特開２００６−２３８０５２号公報JP 2006-238052 A 特開２０１０−１４７５９５号公報JP 2010-147595 A

特許文献１に記載されたような技術では、ツリー型のネットワークトポロジから故障箇所の分岐部分を推定することができる。しかしながら、そのためには、例えばＣＡＤ等により予めネットワークのトポロジに係る情報を作成しておく必要があり、ネットワークの構成変更などを考慮すると、簡潔性や柔軟性に欠ける場合がある。また、ポーリングに対する応答の有無によって故障を判断しており、ネットワーク機器が論理障害等による「半死」の状態では的確に障害を判断することができない場合も生じ得る。 With the technique described in Patent Document 1, it is possible to estimate a branch portion of a failure location from a tree-type network topology. However, for that purpose, it is necessary to create information related to the network topology in advance, for example, by CAD or the like, and there are cases where simplicity and flexibility are lacking in consideration of changes in the network configuration. Further, a failure may be determined based on the presence or absence of a response to polling, and a failure may not be accurately determined when the network device is in a “half dead” state due to a logical failure or the like.

また、特許文献２に記載されたような技術では、パケットロスや遅延などの通信品質に基づいてフローの品質劣化を判断し、品質劣化を起こしたフローの集合が経由しているリンクの集合の情報に基づいて品質劣化箇所を推定することができる。しかしながら、ネットワークの障害により末端部分の機器等からは品質情報自体が収集できない場合も想定され、障害の態様によっては推定の精度が維持できない場合も生じ得る。 Further, in the technique as described in Patent Document 2, flow quality degradation is determined based on communication quality such as packet loss and delay, and a set of links through which a set of flows causing quality degradation passes. A quality degradation location can be estimated based on information. However, there may be a case where the quality information itself cannot be collected from a terminal device or the like due to a network failure, and there may be a case where the estimation accuracy cannot be maintained depending on the failure mode.

また、特許文献３に記載されたような技術では、管理対象装置への送達確認に対する応答がなかった場合は、その経路上の管理対象装置への送達確認を行うことで、障害の原因装置を特定することができるが、やはり、送達確認に対する応答の有無によって障害を判断しているため、ネットワーク機器が論理障害等による「半死」の状態では的確に障害を判断することができない場合も生じ得る。 In addition, in the technology as described in Patent Document 3, when there is no response to the delivery confirmation to the management target device, the failure cause device is determined by confirming the delivery to the management target device on the route. Although it is possible to identify the failure, the failure is judged based on the presence or absence of a response to the delivery confirmation. Therefore, there may be a case where the network device cannot accurately determine the failure in a “half-dead” state due to a logical failure or the like. .

そこで本発明の目的は、大規模なネットワークシステムにおける障害の際に、障害原因となったネットワーク機器が論理障害の場合も含めて、障害の被疑部位を迅速に推定して絞り込むことを可能とする障害部位推定システムおよび障害部位推定プログラムを提供することにある。本発明の前記ならびにその他の目的と新規な特徴は、本明細書の記述および添付図面から明らかになるであろう。 Accordingly, an object of the present invention is to quickly estimate and narrow down a suspected part of a failure in the case of a failure in a large-scale network system, including the case where the network device that caused the failure is a logical failure. An object of the present invention is to provide an obstacle site estimation system and an obstacle site estimation program. The above and other objects and novel features of the present invention will be apparent from the description of this specification and the accompanying drawings.

本願において開示される発明のうち、代表的なものの概要を簡単に説明すれば、以下のとおりである。 Of the inventions disclosed in this application, the outline of typical ones will be briefly described as follows.

本発明の代表的な実施の形態による障害部位推定システムは、ネットワーク機器からなるノードがツリー型に接続された構成を有する監視対象ネットワークにおいて障害が発生した場合に障害被疑部位を推定する障害部位推定システムであって、以下の特徴を有するものである。 A failure site estimation system according to a representative embodiment of the present invention estimates a failure site when a failure occurs in a monitored network having a configuration in which nodes composed of network devices are connected in a tree shape. A system having the following characteristics.

すなわち、前記監視対象ネットワークの正常時に、前記監視対象ネットワーク内の各ノードについて、当該ノードに至る通信経路上の各ノードのインタフェースからなる経路情報を取得して、経路情報記録手段に記録する経路情報取得部と、前記監視対象ネットワークの障害時に、障害となっている各ノードに至る経路情報を前記経路情報記録手段からそれぞれ取得し、当該経路情報に含まれる各インタフェースに対して逐次ポーリングを行って、ＯＫもしくはＮＧの結果を収集する逐次ポーリング部と、経路情報に含まれる各インタフェースにおいて、前記ポーリングの結果がＮＧとなった最も手前のインタフェースと、その１つ手前の前記ポーリングの結果がＯＫとなったインタフェースとを被疑ペアとし、障害となっている各ノードについて被疑ペアを抽出して被疑ペア集合を取得する被疑ペア抽出部と、前記被疑ペア集合と前記経路情報記録手段に記録された経路情報とに基いて、障害被疑部位を抽出して出力する障害部位出力部とを有することを特徴とする。 That is, when the monitoring target network is normal, for each node in the monitoring target network, route information including the interface of each node on the communication route to the node is acquired and recorded in the route information recording unit In the event of a failure of the monitoring target network, the acquisition unit acquires route information from the route information recording unit to each of the failed nodes, and sequentially polls each interface included in the route information. , A sequential polling unit that collects OK or NG results, and the interface that is included in the path information, the interface that is closest to the polling result is NG, and the polling result that is immediately before that is OK The failed interface to each failed node. A suspicious pair extraction unit that extracts a suspicious pair and obtains a suspicious pair set, and a failure that extracts and outputs a suspicious part based on the suspected pair set and the route information recorded in the route information recording means It has the site | part output part, It is characterized by the above-mentioned.

また、本発明は、コンピュータを上記のような障害部位推定システムとして動作させるプログラムにも適用することができる。 The present invention can also be applied to a program that causes a computer to operate as the above-described failure site estimation system.

本願において開示される発明のうち、代表的なものによって得られる効果を簡単に説明すれば以下のとおりである。 Among the inventions disclosed in the present application, effects obtained by typical ones will be briefly described as follows.

すなわち、本発明の代表的な実施の形態によれば、大規模なネットワークシステムにおける障害の際に、障害原因となったネットワーク機器が論理障害の場合も含めて、障害の被疑部位を迅速に推定して絞り込むことが可能となる。 That is, according to the representative embodiment of the present invention, in the event of a failure in a large-scale network system, the suspected site of failure is quickly estimated, including the case where the network device that caused the failure is a logical failure. It becomes possible to narrow down.

本発明の一実施の形態における障害部位推定システムを有するネットワーク監視システムの構成例について概要を示した図である。It is the figure which showed the outline | summary about the structural example of the network monitoring system which has a failure site | part estimation system in one embodiment of this invention. 本発明の一実施の形態におけるネットワーク監視の例について概要を示した図である。It is the figure which showed the outline | summary about the example of the network monitoring in one embodiment of this invention. 本発明の一実施の形態における経路情報および品質情報を取得する処理の例について概要を示した図である。It is the figure which showed the outline | summary about the example of the process which acquires the route information and quality information in one embodiment of this invention. 本発明の一実施の形態における経路情報および品質情報を取得する処理の例について概要を示した図である。It is the figure which showed the outline | summary about the example of the process which acquires the route information and quality information in one embodiment of this invention. 本発明の一実施の形態における経路情報および品質情報を取得する処理の例について概要を示した図である。It is the figure which showed the outline | summary about the example of the process which acquires the route information and quality information in one embodiment of this invention. 本発明の一実施の形態におけるホップバイホップリストを得るためのソースコードの例を示した図である。It is the figure which showed the example of the source code for obtaining the hop-by-hop list | wrist in one embodiment of this invention. 本発明の一実施の形態における障害が検知されたノードに対して被疑ペア集合を取得する処理の例について概要を示した図である。It is the figure which showed the outline | summary about the example of the process which acquires a suspicious pair set with respect to the node in which the failure was detected in one embodiment of this invention. 本発明の一実施の形態における障害被疑部位を推定して出力する処理の例について概要を示したフローチャートである。It is the flowchart which showed the outline | summary about the example of the process which estimates and outputs the failure | damage suspected site | part in one embodiment of this invention. 本発明の一実施の形態における被疑ペア集合に基いて障害被疑部位を推定する処理の例について概要を示した図である。It is the figure which showed the outline | summary about the example of the process which estimates a failure suspected site | part based on the suspicious pair set in one embodiment of this invention. 従来技術におけるネットワーク監視の例について概要を示した図である。It is the figure which showed the outline | summary about the example of the network monitoring in a prior art.

以下、本発明の実施の形態を図面に基づいて詳細に説明する。なお、実施の形態を説明するための全図において、同一部には原則として同一の符号を付し、その繰り返しの説明は省略する。また、以下においては、本発明の特徴を分かり易くするために、従来の技術と比較して説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. Note that components having the same function are denoted by the same reference symbols throughout the drawings for describing the embodiment, and the repetitive description thereof will be omitted. In the following, in order to make the features of the present invention easier to understand, the description will be made in comparison with the prior art.

＜概要＞
図１０は、従来技術におけるネットワーク監視の例について概要を示した図である。図１０では、複数のルータ等のネットワーク（ＮＷ）機器３１０により構成されるツリー型の監視対象ネットワーク（ＮＷ）３００に対して、ベンダー各社から提供される市販のツール等により構成される障害監視システム２００が接続され、監視対象ＮＷ３００での障害発生を常時監視する構成を示している。ここでは、障害監視システム２００は、各ＮＷ機器３１０に対して、例えば、ＩＣＭＰ（Internet Control Message Protocol）／ＳＮＭＰ（Simple Network Management Protocol）ポーリングにより死活監視を行う。 <Overview>
FIG. 10 is a diagram showing an overview of an example of network monitoring in the prior art. In FIG. 10, a fault monitoring system configured by a commercially available tool or the like provided by each vendor for a tree-type monitored network (NW) 300 configured by a network (NW) device 310 such as a plurality of routers. 200 shows a configuration in which 200 is connected and a failure occurrence in the monitoring target NW 300 is constantly monitored. Here, the failure monitoring system 200 performs alive monitoring on each NW device 310 by, for example, Internet Control Message Protocol (ICMP) / Simple Network Management Protocol (SNMP) polling.

ここで例えば、ＮＷ機器３１０ａで障害が発生した場合、一般的な障害監視システム２００では、ネットワーク構成上で配下の各機器（図中の網掛けされたＮＷ機器３１０）についても、オペレータが確認する監視画面上で障害として表示してしまう。特に、ＮＷ機器３１０ａが論理障害等による「半死」状態のような場合には、障害監視システム２００による死活監視のポーリングのタイミングによって、障害機器がランダムかつ大量に表示され、監視画面上では、どの部位が障害の根本原因となっているのかを判別することが困難となる。 Here, for example, when a failure occurs in the NW device 310a, in the general failure monitoring system 200, the operator confirms each device under the network configuration (the shaded NW device 310 in the figure). Displayed as a failure on the monitoring screen. In particular, when the NW device 310a is in a “half-dead” state due to a logical failure or the like, the failure device is displayed in a random and large amount depending on the timing of alive monitoring by the failure monitoring system 200. It becomes difficult to determine whether the site is the root cause of the failure.

このような場合には、ＳＥ等の技術者が呼ばれて障害解析・切り分け等を行い、障害部位をＮＷ機器３１０ａであると特定することになる。しかしながら、ハードウェア障害ではなく論理障害の場合には、機器のログ等を参照しても障害の発生状況が不明である場合もあり、このような人手による手法では、障害部位を特定するまでに数十分から数時間という長時間を要してしまう場合がほとんどである。特に大規模システムでは、より迅速な障害部位の特定と対応策の実施が望まれる。 In such a case, an engineer such as SE is called to perform failure analysis, isolation, etc., and specify the failure site as the NW device 310a. However, in the case of a logical failure rather than a hardware failure, the failure occurrence status may be unknown even by referring to the device log, etc. In most cases, it takes a long time of several tens of minutes to several hours. In particular, in a large-scale system, it is desirable to identify a faulty part and implement countermeasures more quickly.

障害部位の正確な特定をシステムで自動的に行うには、これに応じた大掛かりな監視システムや解析システム等が必要となる。一方で、より低コストで簡易的に行うには、例えば、障害の被疑部位をある程度絞り込んで通知するところまでを自動化し、その後は絞り込まれた対象のＮＷ機器３１０の全部、もしくはそこから人手によりさらに絞り込んだ一部の機器に対して対応策を実施することで、迅速に復旧を図ることが可能となる場合もある。 In order to automatically identify the faulty part automatically in the system, a large-scale monitoring system or analysis system corresponding to this is required. On the other hand, in order to perform simply and at a lower cost, for example, automate the process of narrowing down the suspected part of the failure to some extent, and then manually or manually from the narrowed-down target NW device 310. Furthermore, by implementing countermeasures against some of the narrowed down devices, it may be possible to quickly recover.

論理障害（例えば、ルーティングテーブルの異常など）の場合には、一般的な傾向として、例えば、ＮＷ機器３１０からエラーや異常なログなどは出力されず、一見して正常に稼働しているように見える場合がある。また、ｐｉｎｇが通らなくなるケースの他にｐｉｎｇが通ったり通らなかったりするケースがあること、ショートフレームのｐｉｎｇは通るがロングフレームのｐｉｎｇは通らなかったりするケースがあることなどから、ｐｉｎｇのやり方を工夫することで論理障害を把握することが可能である。また、論理障害の迅速な復旧のためには障害部位を特定して切り離す（電源断や再起動など）ことが効果的である。 In the case of a logical failure (for example, an abnormality in a routing table), as a general tendency, for example, an error or an abnormal log is not output from the NW device 310, and it seems to be operating normally at first glance. May be visible. In addition to the case where ping does not pass, there are cases where ping passes or does not pass, and there are cases where short frame ping passes but long frame ping does not pass. It is possible to grasp the logical failure by devising. In order to quickly recover from a logical failure, it is effective to identify and isolate the faulty part (power supply interruption, restart, etc.).

そこで、本発明の一実施の形態である障害部位推定システムは、例えば、ハードウェア障害で障害監視システム２００に大量に障害メッセージ等が表示されるような場合であっても、迅速に障害部位を特定して効率的にログの確認などが行えるようにするとともに、論理障害の場合にも迅速に障害の被疑部位を推定して絞り込み、対応策の実施を可能とする。 In view of this, the failure site estimation system according to an embodiment of the present invention can quickly identify a failure site even when a large number of failure messages are displayed on the failure monitoring system 200 due to a hardware failure. It is possible to identify and efficiently check the log, etc., and even in the case of a logical failure, it is possible to quickly estimate and narrow down the suspected portion of the failure and enable countermeasures.

簡易的に迅速に障害部位を推定して絞り込むことを可能とするために、本実施の形態では、正常時に定期的に監視対象ＮＷ３００における経路情報と品質情報を収集して記録しておき、障害時・異常時（障害監視システム２００で障害を検知した場合）に、正常時に取得しておいた通信経路に従ってホップバイホップでｐｉｎｇによるポーリングを行う。このポーリングの成否（死活情報）に基づいて障害部位の集合を抽出し、そこから所定のロジックにより原因となる障害被疑部位を推定して抽出する。ここで、ポーリングの成否は、応答の有無だけに限らず、正常時の品質情報との比較に基づいて一定以上品質の劣化があった場合に障害部位と判断することで、論理障害のような「半死」の場合でも障害部位の推定を可能とする。 In this embodiment, in order to enable easy and quick estimation and narrowing down of a faulty part, in this embodiment, route information and quality information in the monitoring target NW 300 are collected and recorded periodically during normal operation. At the time / abnormality (when a failure is detected by the failure monitoring system 200), hop-by-hop polling is performed on a hop-by-hop basis according to the communication path acquired at the normal time. Based on the success or failure (life and death information) of this polling, a set of faulty parts is extracted, and a faulty suspected part causing the problem is estimated and extracted from the set. Here, the success or failure of polling is not limited to the presence or absence of a response, but it is determined as a faulty part when a quality deterioration is more than a certain level based on a comparison with quality information at normal time, such as a logical fault. Even in the case of “half-dead”, it is possible to estimate the site of failure.

図２は、本発明の一実施の形態におけるネットワーク監視の例について概要を示した図である。ここでは、従来の障害監視システム２００に加えて、障害部位の推定を行う障害部位推定システム１００を有し、障害監視システム２００等においてＮＷ機器３１０ａが原因の障害を検知した場合に（図１０の場合と同様に、配下のＮＷ機器３１０が障害状態として検知される）、障害部位推定システム１００において、障害となっている各ＮＷ機器３１０への経路情報と、通信経路上の死活情報とを分析して、ＮＷ機器３１０ａが障害の被疑部位であると推定することを可能とするものである。 FIG. 2 is a diagram showing an overview of an example of network monitoring according to an embodiment of the present invention. Here, in addition to the conventional failure monitoring system 200, the failure portion estimation system 100 for estimating the failure portion is provided, and when the failure caused by the NW device 310a is detected in the failure monitoring system 200 or the like (FIG. 10). As in the case, the subordinate NW device 310 is detected as a failure state), and the failure site estimation system 100 analyzes the route information to each NW device 310 that is in failure and the life / death information on the communication route. Thus, it is possible to estimate that the NW device 310a is a suspected part of the failure.

＜システム構成＞
図１は、本発明の一実施の形態である障害部位推定システム１００を有するネットワーク監視システムの構成例について概要を示した図である。ネットワーク監視システムは、上述の図２において示したように、監視対象ＮＷ３００に対して、障害監視システム２００と障害部位推定システム１００が接続される構成を有している。 <System configuration>
FIG. 1 is a diagram showing an outline of a configuration example of a network monitoring system having a failure site estimation system 100 according to an embodiment of the present invention. As shown in FIG. 2 described above, the network monitoring system has a configuration in which the failure monitoring system 200 and the failure site estimation system 100 are connected to the monitoring target NW 300.

監視対象ＮＷ３００は、ルータ等の多数のＮＷ機器３１０から構成されるツリー型のネットワークであり、各ＮＷ機器３１０は必要に応じて経路情報を保持するルーティングテーブル３１１を有している。また、障害監視システム２００は、上述したような、ベンダー各社から提供される市販のツール等により構成され、監視対象ＮＷ３００の各ＮＷ機器３１０に対して、例えば、ＩＣＭＰ／ＳＮＭＰポーリングにより死活監視を行って障害を検知し、これをネットワークトポロジを表現したマップ上に表示したり、障害通知メッセージとして表示したりして通知する情報処理システムである。 The monitoring target NW 300 is a tree-type network composed of a number of NW devices 310 such as routers, and each NW device 310 has a routing table 311 that holds route information as necessary. Further, the failure monitoring system 200 is configured by a commercially available tool or the like provided by each vendor as described above, and performs life / death monitoring on each NW device 310 of the monitoring target NW 300 by, for example, ICMP / SNMP polling. In this information processing system, a failure is detected and displayed on a map representing a network topology or displayed as a failure notification message.

障害部位推定システム１００は、障害監視システム２００において監視対象ＮＷ３００内の複数のノード（ＮＷ機器３１０）での障害を検知した場合に、各ノードへの経路情報と通信経路上の各ノードにおける死活情報とに基づいて、各ノード障害の原因となる通信経路上の共通部位を特定して障害部位として推定するシステムである。なお、本実施の形態では、障害部位推定システム１００を障害監視システム２００とは別個のシステムとして構成する例を示しているが、これらを１つのシステムとして構成することも当然可能である。 When the failure monitoring system 200 detects a failure in a plurality of nodes (NW devices 310) in the monitoring target NW 300, the failure site estimation system 100 and route information to each node and life / death information in each node on the communication route are detected. Based on the above, a common site on the communication path that causes each node failure is identified and estimated as a failed site. In the present embodiment, an example is shown in which the failure site estimation system 100 is configured as a system separate from the failure monitoring system 200, but it is naturally possible to configure these as a single system.

この障害部位推定システム１００は、例えば、ＰＣ（Personal Computer）やサーバ機器などにより構成される情報処理システムであり、ソフトウェアとして実装される経路情報取得部１１０および障害部位推定部１２０と、データベースやファイルテーブル等として実装される経路情報データベース（ＤＢ）１３０などを有する。 The fault site estimation system 100 is an information processing system including, for example, a PC (Personal Computer), a server device, and the like, and includes a path information acquisition unit 110 and a fault site estimation unit 120 implemented as software, a database and a file. A route information database (DB) 130 and the like mounted as a table or the like are included.

経路情報取得部１１０は、監視対象ＮＷ３００が正常時に、監視対象ＮＷ３００内の全ノード（ＮＷ機器３１０）に対してｐｉｎｇ／ｔｒａｃｅｒｏｕｔｅおよびＳＮＭＰによる経路探索を実行して正常時の経路情報を取得し、経路情報ＤＢ１３０に記録する機能を有する。経路情報の取得処理の内容については後述する。 The route information acquisition unit 110 performs route search by ping / traceroute and SNMP on all nodes (NW devices 310) in the monitoring target NW 300 when the monitoring target NW 300 is normal, and acquires normal route information. It has a function of recording in the route information DB 130. Details of the route information acquisition process will be described later.

障害部位推定部１２０は、監視対象ＮＷ３００が障害時・異常時（障害監視システム２００等によって障害を検知した場合）に、障害が検知されているＮＷ機器３１０から原因となる障害の被疑部位を推定して出力する機能を有し、例えば、ホップバイホップ（逐次）ポーリング部１２１、被疑ペア抽出部１２２および障害部位出力部１２３などの各部を有する。ホップバイホップポーリング部１２１は、障害が検知されているＩＰアドレスに対して、経路情報ＤＢ１３０から正常時の経路情報を取得し、当該通信経路上にある全てのＩＰアドレス（ホップバイホップリスト）に対して逐次（ホップバイホップで）ｐｉｎｇによるポーリングを行なって、結果（ＯＫ／ＮＧ）を収集することにより各ノードの状態を把握する機能を有する。 The fault site estimation unit 120 estimates a suspected fault site that causes a fault from the NW device 310 in which the fault is detected when the monitoring target NW 300 is faulty or abnormal (when a fault is detected by the fault monitoring system 200 or the like). And, for example, each unit such as a hop-by-hop (sequential) polling unit 121, a suspicious pair extraction unit 122, and a faulty part output unit 123. The hop-by-hop polling unit 121 obtains normal route information from the route information DB 130 for the IP address in which a failure is detected, and sends it to all IP addresses (hop-by-hop list) on the communication route. It has a function of grasping the state of each node by performing polling by ping sequentially (hop-by-hop) and collecting results (OK / NG).

被疑ペア抽出部１２２は、ホップバイホップポーリング部１２１によるポーリングにおいて、ｐｉｎｇの結果に異常があったＩＰアドレスのうち、通信経路上最も手前の（障害部位推定システム１００に最も近い）ＩＰアドレスと、通信経路上その１つ手前のホップのＩＰアドレス（ｐｉｎｇの結果は正常）とを被疑ペアとして抽出し、これを障害が検知されている各ＩＰアドレスに対して行なって、被疑ペア集合を得る機能を有する。障害部位出力部１２３は、被疑ペア抽出部１２２により抽出された被疑ペア集合をユニーク処理し、その結果に基づいて所定のロジックにより障害被疑部位を抽出して出力する機能を有する。障害被疑部位を推定する処理の内容についても後述する。 The suspicious pair extraction unit 122 includes the IP address that is the closest to the communication path (closest to the faulty part estimation system 100) among the IP addresses in which the ping result is abnormal in the polling by the hop-by-hop polling unit 121; A function that extracts the IP address of the hop immediately before on the communication path (the result of the ping is normal) as a suspected pair and performs this for each IP address in which a failure is detected to obtain a suspected pair set Have The failure part output unit 123 has a function of performing a unique process on the suspicious pair set extracted by the suspicious pair extraction unit 122, and extracting and outputting the suspicious part by a predetermined logic based on the result. The contents of the process for estimating the suspected part will also be described later.

＜経路情報取得処理＞
以下では、まず、正常時における経路情報取得部１１０による経路情報の取得処理の内容について説明する。ここでは、監視対象ＮＷ３００における全ての監視対象のノード（ＮＷ機器３１０）に至る正常時の通信経路上の完全なＩＰアドレスのリストを作成して経路情報とするとともに、当該通信経路（監視対象のノード）における正常時の品質情報を取得して経路情報ＤＢ１３０に記録する。なお、この処理は正常時に定期的に実行するか、少なくとも通信経路や通信品質に影響を与え得るシステムやネットワークの構成変更があった場合に実行するのが望ましい。 <Route information acquisition processing>
Below, the content of the route information acquisition process by the route information acquisition unit 110 at the normal time will be described first. Here, a list of complete IP addresses on a normal communication path to all the monitoring target nodes (NW devices 310) in the monitoring target NW 300 is created and used as path information, and the communication path (monitoring target Node) is acquired and recorded in the route information DB 130. It is desirable to execute this processing periodically when it is normal, or at least when there is a system or network configuration change that may affect the communication path and communication quality.

図３〜図５は、経路情報および品質情報を取得する処理の例について概要を示した図である。ここでは、障害部位推定システム１００をノード“ｎ００”とし、監視対象ＮＷ３００内のルータ等の各ＮＷ機器３１０をノード“ｎ０１”、“ｎ２１”、“ｎ２２”、“ｎ４１”、“ｎ４２”、“ｎ４３”として表したツリー型のネットワーク構成の例を示している（レイヤー２スイッチ等の機器については省略している）。また、各ノードは、それぞれ、“ｉ００”〜“ｉ４３”として表したインタフェース（ＩＦ）３１２を有していることを示している。なお、図３〜図５の例では、“ｉ４１”のインタフェース３１２（ＩＰアドレス）についての経路情報および品質情報を取得する場合を例として説明している。 3 to 5 are diagrams showing an outline of an example of processing for acquiring route information and quality information. Here, the failure site estimation system 100 is a node “n00”, and each NW device 310 such as a router in the monitoring target NW 300 is a node “n01”, “n21”, “n22”, “n41”, “n42”, “ An example of a tree-type network configuration represented as n43 ″ is shown (devices such as layer 2 switches are omitted). Each node has an interface (IF) 312 represented as “i00” to “i43”. In the example of FIGS. 3 to 5, the case where the route information and the quality information about the interface 312 (IP address) of “i41” are acquired is described as an example.

まず、正常時の品質情報を取得するため、図３に示すように、障害部位推定システム１００（ノード“ｎ００”）の経路情報取得部１１０は、監視対象のインタフェース３１２（“ｉ４１”）に対してｐｉｎｇコマンドを発行し、その応答からパケットロス率（ｌｏｓｓｒａｔｅ）および平均遅延時間を取得する。図３の例では、ｐｉｎｇによる“ｉ４１”のインタフェース３１２に対するｅｃｈｏパケットに対して応答としてｅｃｈｏ−ｒｅｐｌｙパケットを受け取る状態を矢印で示している。なお、取得した品質情報は、対象のインタフェース３１２と関連付けて経路情報ＤＢ１３０に記録する。なお、この品質情報を一定期間蓄積しておき、これに対して所定の統計処理を施すことで品質のベースラインを得るようにしてもよい。 First, in order to acquire normal quality information, as shown in FIG. 3, the path information acquisition unit 110 of the fault site estimation system 100 (node “n00”) performs the monitoring target interface 312 (“i41”). The ping command is issued, and the packet loss rate and the average delay time are obtained from the response. In the example of FIG. 3, an arrow indicates a state in which an echo-reply packet is received as a response to an echo packet for the interface 312 of “i41” by ping. The acquired quality information is recorded in the route information DB 130 in association with the target interface 312. The quality information may be accumulated for a certain period, and a predetermined statistical process may be performed on the quality information to obtain a quality baseline.

次に、正常時の経路情報を取得するため、図４に示すように、監視対象のインタフェース３１２（“ｉ４１”）に対してｔｒａｃｅｒｏｕｔｅコマンドを発行し、当該インタフェース３１２に至るまでに経由するノードの情報を取得する。図４の例では、通信経路上のノード“ｎ０１”、“ｎ２１”、“ｎ４１”に対して順次ｅｃｈｏパケットを送信し、応答としてｔｉｍｅ−ｅｘｃｅｅｄｅｄパケットを受け取る状態を矢印で示している。 Next, in order to obtain the normal path information, as shown in FIG. 4, a traceroute command is issued to the monitored interface 312 (“i41”), and the node that has passed through to the interface 312 is obtained. Get information. In the example of FIG. 4, an arrow indicates a state in which echo packets are sequentially transmitted to the nodes “n01”, “n21”, and “n41” on the communication path and the time-exceeded packet is received as a response.

次に、ｔｒａｃｅｒｏｕｔｅにより取得した各経由ノードに対して、それぞれＳＮＭＰによる経路探索を実行し、ホップするノード毎の入力のインタフェース３１２と出力のインタフェース３１２を全て取得する。図５の例では、宛先のノード“ｎ４１”に対する経由ノード“ｎ０１”、“ｎ２１”のそれぞれについて、ｓｎｍｐｇｅｔコマンドを発行した状態を矢印で示している。当該コマンドにより、各ノードのルーティングテーブル３１１等に基づいて得られるＭＩＢ（Management Information Base）の管理情報から、入力および出力のインタフェース３１２の情報を取得することができる。 Next, a route search by SNMP is executed for each transit node acquired by traceroute, and all input interfaces 312 and output interfaces 312 for each hopping node are acquired. In the example of FIG. 5, the state in which the snmpget command is issued for each of the relay nodes “n01” and “n21” with respect to the destination node “n41” is indicated by an arrow. With this command, information on the input and output interfaces 312 can be acquired from management information of MIB (Management Information Base) obtained based on the routing table 311 of each node.

上記の図４の例に示す処理により取得した経由ノードの情報と、図５の例に示す処理により取得した各経由ノードでの入力および出力のインタフェース３１２の情報とに基づいて、図５の下段の表に示すように、障害部位推定システム１００（ノード“ｎ００”）のインタフェース３１２（“ｉ００”）から監視対象のＮＷ機器３１０（ノード“ｎ４１”）のインタフェース３１２（“ｉ４１”）に至る通信経路上におけるインタフェース３１２のリスト（ホップバイホップリスト１３１）を作成する。作成したホップバイホップリスト１３１の情報は、監視対象のインタフェース３１２と関連付けて経路情報ＤＢ１３０に記録する。なお、品質情報と経路情報を取得する順序は上記の順に限らず、経路情報を先に取得してもよい。 Based on the information on the transit node acquired by the process shown in the example of FIG. 4 and the information on the input and output interfaces 312 at each transit node acquired by the process shown in the example of FIG. As shown in the table, communication from the interface 312 (“i00”) of the failure site estimation system 100 (node “n00”) to the interface 312 (“i41”) of the monitored NW device 310 (node “n41”). A list of interfaces 312 (hop-by-hop list 131) on the route is created. The information of the created hop-by-hop list 131 is recorded in the route information DB 130 in association with the monitored interface 312. Note that the order of acquiring the quality information and the route information is not limited to the above order, and the route information may be acquired first.

図６は、ホップバイホップリスト１３１を得るためのソースコードの例を参考情報として示した図である。上段の図では、対象のネットワーク構成例として、障害部位推定システム１００（ノード“ｎ００”）およびそのインタフェース３１２のＩＰアドレスと、ターゲットのノード（ＮＷ機器３１０）およびインタフェース３１２、中継するノード（ＮＷ機器３１０）およびその入力と出力のインタフェース３１２とルーティングテーブル３１１を示している。また、下段の図では、上段の図に示したような構成において、ターゲットのインタフェース３１２に至るまでのインタフェース３１２のリストを得るためのソースコード１１１の一例を示している。 FIG. 6 is a diagram showing an example of source code for obtaining the hop-by-hop list 131 as reference information. In the upper diagram, as an example of a target network configuration, an IP address of the failure site estimation system 100 (node “n00”) and its interface 312, a target node (NW device 310) and interface 312, a relay node (NW device) 310) and its input / output interface 312 and routing table 311. Further, the lower diagram shows an example of the source code 111 for obtaining a list of interfaces 312 up to the target interface 312 in the configuration as shown in the upper diagram.

＜障害部位推定処理＞
以下では、障害時・異常時における障害部位推定部１２０による障害部位の推定処理の内容について説明する。障害監視システム２００もしくは障害部位推定システム１００が、例えば、監視対象ＮＷ３００内の各ノードに対して定期的にｐｉｎｇによるポーリングを行う等して監視することによりネットワーク障害を検知した場合、障害部位推定部１２０のホップバイホップポーリング部１２１は、障害が検知された各インタフェース３１２（ＩＰアドレス）に対してホップバイホップでｐｉｎｇを実行する。すなわち、対象のインタフェース３１２に至る経路情報（ホップバイホップリスト１３１）を経路情報ＤＢ１３０から取得し、リストに含まれる各インタフェース３１２のＩＰアドレスに対してそれぞれｐｉｎｇによるポーリングを行なって、通信のＯＫ／ＮＧを判定する。 <Injury site estimation process>
Below, the content of the fault site estimation processing by the fault site estimation unit 120 at the time of fault / abnormality will be described. When the failure monitoring system 200 or the failure site estimation system 100 detects a network failure by, for example, periodically polling each node in the monitoring target NW 300 by ping or the like, the failure site estimation unit The 120 hop-by-hop polling unit 121 performs hop-by-hop ping on each interface 312 (IP address) in which a failure is detected. That is, the route information (hop-by-hop list 131) to the target interface 312 is acquired from the route information DB 130, and the IP address of each interface 312 included in the list is polled by ping, and communication OK / NG is determined.

なお、ｐｉｎｇによるポーリングにおける障害の検知や、通信のＯＫ／ＮＧの判定の際は、ｐｉｎｇの応答を受信したか否かのみで判定するのではなく、パケットロス率や平均遅延などの品質情報の値について、経路情報ＤＢ１３０に記録された正常時の品質情報（ベースライン）と比較することで判定する。例えば、現在の各品質情報の値がベースラインから所定の閾値以上低下しているか否かにより判定してもよいし、統計的な手法を利用して障害か否かを推測するようにしてもよい。 When detecting a failure in polling by ping or determining whether communication is OK / NG, it is not determined only by whether or not a ping response is received, but quality information such as packet loss rate and average delay is not determined. The value is determined by comparing with normal quality information (baseline) recorded in the route information DB 130. For example, determination may be made based on whether or not the current quality information values have fallen by a predetermined threshold or more from the baseline, or a failure may be estimated using a statistical method. Good.

さらに、障害部位推定部１２０の被疑ペア抽出部１２２が、上記のポーリングの結果がＮＧであったインタフェース３１２のうち、通信経路上最も手前のインタフェース３１２と、通信経路上その１つ手前のホップのインタフェース３１２とを被疑ペアとして抽出する。これを障害が検知されている各インタフェース３１２に対して行なって、被疑ペア集合１３２を取得する。 Further, the suspicious pair extraction unit 122 of the faulty part estimation unit 120 includes the interface 312 that is the foremost on the communication path among the interfaces 312 in which the polling result is NG, and the hop that is one hop before the communication path. The interface 312 is extracted as a suspect pair. This is performed for each interface 312 in which a failure is detected, and the suspected pair set 132 is acquired.

図７は、障害が検知されたノードに対して被疑ペア集合１３２を取得する処理の例について概要を示した図である。ここでは、図の上段左側に示した監視対象ＮＷ３００の構成（図３〜図５の例で示したものと同様）において、“ｉ２１”のインタフェース３１２が障害となった場合を例としている。 FIG. 7 is a diagram showing an overview of an example of processing for acquiring the suspected pair set 132 for a node in which a failure is detected. Here, a case where the interface 312 of “i21” becomes a failure in the configuration of the monitoring target NW 300 shown in the upper left side of the drawing (similar to that shown in the examples of FIGS. 3 to 5) is taken as an example.

このとき、障害監視システム２００において障害が検知される（ｐｉｎｇによるポーリングがＮＧとなる）各ノード（“ｎ２１”、“ｎ４１”、“ｎ４２”）に対して、ホップバイホップポーリング部１２１が、通信経路上の各インタフェース３１２に対してホップバイホップでｐｉｎｇによるポーリングを行う。このとき、図７の例では、例えば、“ｉ２１”、“ｉ３１”、“ｉ３２”、“ｉ４１”、“ｉ４２”の各インタフェース３１２（上段左側の図中で網掛けで示したもの）ではポーリングがＮＧとなり、他のインタフェース３１２ではＯＫとなる。このポーリングの結果をホップバイホップリスト１３１の表に追記・反映させたものが図７の上段右側の表である。表中のＯＫ／ＮＧの値は、対象のインタフェース３１２に対するｐｉｎｇによるポーリングの結果を示している。 At this time, the hop-by-hop polling unit 121 communicates with each node (“n21”, “n41”, “n42”) in which the failure is detected in the failure monitoring system 200 (polling by ping becomes NG). Polling by ping is performed hop-by-hop with respect to each interface 312 on the route. In this case, in the example of FIG. 7, for example, polling is performed in each interface 312 (shown by shading in the upper left diagram) of “i21”, “i31”, “i32”, “i41”, “i42”. Becomes NG, and the other interface 312 becomes OK. The result of this polling is added to and reflected in the table of the hop-by-hop list 131 is the table on the upper right side of FIG. The value of OK / NG in the table indicates the result of polling by ping for the target interface 312.

ここで、各インタフェース３１２に対するホップバイホップでのポーリングの結果がＮＧであった経路上のインタフェース３１２のうち、最も手前のインタフェース３１２と、その１つ手前のホップのインタフェース３１２とを被疑ペアとして抽出する。すなわち、ホップバイホップリスト１３１において、ポーリングの結果がＯＫからＮＧに変わる境界部分のインタフェース３１２を被疑ペアとして抽出し、被疑ペア集合１３２（図７の下段の表）を作成する。 Here, among the interfaces 312 on the route where the hop-by-hop polling result for each interface 312 is NG, the foremost interface 312 and the immediately preceding hop interface 312 are extracted as suspected pairs. To do. That is, in the hop-by-hop list 131, the interface 312 at the boundary where the polling result changes from OK to NG is extracted as a suspect pair, and a suspected pair set 132 (the lower table in FIG. 7) is created.

被疑ペア集合１３２において、“ＮＧ”の項目は境界部分におけるポーリング結果がＮＧのインタフェース３１２を示し、“ＰＲＥＶ”の項目はその手前のホップのポーリング結果がＯＫのインタフェース３１２を示している。図７の例では、全ての監視対象のインタフェース３１２において、“ＰＲＥＶ”が“ｉ１１”、“ＮＧ”が“ｉ２１”となっている。 In the suspicious pair set 132, an item “NG” indicates an interface 312 in which the polling result at the boundary portion is NG, and an item “PREV” indicates an interface 312 in which the polling result of the previous hop is OK. In the example of FIG. 7, “PREV” is “i11” and “NG” is “i21” in all the monitoring target interfaces 312.

次に、障害部位推定部１２０の障害部位出力部１２３が、被疑ペア集合１３２および経路情報ＤＢ１３０に記録された経路情報に基いて、障害被疑部位を推定して出力する。図８は、障害被疑部位を推定して出力する処理の例について概要を示したフローチャートである。まず、被疑ペア集合１３２の各エントリのＮＧ項目のインタフェース３１２に対してユニーク処理（重複するものを排除）する（Ｓ０１）。次に、ユニーク処理した結果のエントリ数（ＮＧ項目のインタフェース３１２の数）が１であるか否かを判定する（Ｓ０２）。エントリ数が１である場合は、当該エントリのＮＧ項目のインタフェース３１２を障害被疑部位として出力する（パターン１）（Ｓ０３）。すなわち、図示するように、ＯＫとＮＧの境界におけるＮＧのインタフェース３１２（１つだけ存在する）を障害被疑部位として出力する。 Next, the failure site output unit 123 of the failure site estimation unit 120 estimates and outputs the suspected failure site based on the suspected pair set 132 and the route information recorded in the route information DB 130. FIG. 8 is a flowchart showing an overview of an example of processing for estimating and outputting a suspected failure site. First, unique processing (excluding duplicates) is performed on the interface 312 of the NG item of each entry of the suspected pair set 132 (S01). Next, it is determined whether or not the number of entries (number of NG item interfaces 312) as a result of the unique processing is 1 (S02). If the number of entries is 1, the interface 312 of the NG item of the entry is output as the suspected failure part (pattern 1) (S03). That is, as shown in the figure, an NG interface 312 (only one exists) at the boundary between OK and NG is output as a suspected failure site.

ステップＳ０２においてＮＧ項目のエントリが複数ある場合は、さらに、被疑ペア集合１３２の各エントリ（ＮＧ項目についてユニーク処理済み）のＰＲＥＶ項目のインタフェース３１２に対してユニーク処理する（Ｓ０４）。次に、ユニーク処理した結果のエントリ数（ＰＲＥＶ項目のインタフェース３１２の数）が１であるか否かを判定する（Ｓ０５）。エントリ数が１である場合は、当該エントリのＰＲＥＶ項目のインタフェース３１２と、ＮＧ項目のインタフェース３１２との間の区間を障害被疑部位として出力する（パターン２）（Ｓ０６）。すなわち、図示するように、ＯＫとＮＧの境界部分の区間（図示するようにこの部分にプロバイダ等により提供されるネットワークを含む場合もある）を障害被疑部位として出力する。 If there are a plurality of NG item entries in step S02, the unique processing is further performed on the interface 312 of the PREV item of each entry of the suspected pair set 132 (unique processing for the NG item) (S04). Next, it is determined whether or not the number of entries (number of PREV item interfaces 312) as a result of the unique processing is 1 (S05). When the number of entries is 1, the section between the PREV item interface 312 and the NG item interface 312 of the entry is output as a suspected failure site (pattern 2) (S06). That is, as shown in the figure, the section of the boundary part between OK and NG (as shown, this part may include a network provided by a provider or the like) is output as a suspected failure part.

ステップＳ０５においてＰＲＥＶ項目のエントリが複数ある場合は、これらのインタフェース３１２のユニーク集合を障害被疑部位として出力する（パターン３）（Ｓ０７）。すなわち、図示するように、ＯＫとＮＧの境界におけるＯＫのインタフェース３１２（複数存在する）を障害被疑部位として出力する。 If there are a plurality of entries in the PREV item in step S05, the unique set of these interfaces 312 is output as a suspected failure site (pattern 3) (S07). In other words, as shown in the figure, the OK interface 312 (plural) at the boundary between OK and NG is output as a suspected failure site.

図９は、被疑ペア集合１３２に基いて障害被疑部位を推定する処理の例について概要を示した図である。ここでは、図７に示した例において取得した被疑ペア集合１３２に基いて、図８に示した障害被疑部位の推定手法の例によって障害被疑部位を推定する場合を示している。図９の例では、被疑ペア集合１３２に対して、図８のステップＳ０１の処理によりＮＧ項目のインタフェース３１２についてユニーク処理を行った結果、ＮＧ項目のエントリは“ｉ２１”の１レコードのみとなるため、パターン１により、当該インタフェース“ｉ２１”を障害被疑部位と推定して出力する。 FIG. 9 is a diagram showing an outline of an example of processing for estimating a suspected fault site based on the suspected pair set 132. Here, based on the suspected pair set 132 acquired in the example shown in FIG. 7, the case where the suspected failure site is estimated by the example of the suspected failure site estimation method shown in FIG. 8 is shown. In the example of FIG. 9, as a result of performing the unique process for the interface 312 of the NG item for the suspected pair set 132 by the process of step S01 of FIG. 8, the entry of the NG item is only one record of “i21”. , The pattern “i21” is estimated to be a suspected fault site and output according to pattern 1.

出力の態様は特に限定されず、例えば、障害監視システム２００などの画面における監視対象ＮＷ３００のトポロジを表したマップ上に障害被疑部位を特定可能なように強調表示してもよい。また、障害被疑部位に該当するＩＰアドレスやインタフェース３１２、ＮＷ機器３１０の識別情報などをメッセージとして表示する構成であってもよい。ここで出力される障害被疑部位は、障害の原因部位であると疑われる部位であり、正確な原因部位以外の構成要素を含む場合もあり得るが、迅速な障害対応という観点では非常に重要な情報となるものである。 The mode of output is not particularly limited. For example, the output may be highlighted on the map representing the topology of the monitoring target NW 300 on the screen of the failure monitoring system 200 so that the suspected failure portion can be identified. Moreover, the structure which displays the IP address applicable to a failure suspected part, the interface 312, the identification information of the NW apparatus 310, etc. as a message may be sufficient. The suspected failure site output here is a site suspected of being the cause of the failure, and may contain components other than the exact cause, but is very important in terms of quick failure response. It is information.

以上に説明したように、本発明の一実施の形態である障害部位推定システム１００によれば、正常時に定期的に監視対象ＮＷ３００における経路情報と品質情報を収集して記録しておき、障害時・異常時（障害監視システム２００で障害を検知した場合）に、正常時に取得した通信経路に従ってホップバイホップでｐｉｎｇによるポーリングを行う。このポーリングの成否（死活情報）に基づいて障害部位の集合を抽出し、そこから所定のロジックにより原因となる障害被疑部位を推定して抽出する。ここで、ポーリングの成否は、応答の有無だけに限らず、正常時の品質情報との比較に基づいて一定以上品質の劣化があった場合に障害部位と判断することで、論理障害のような「半死」の場合でも障害部位の推定を可能とする。 As described above, according to the failure site estimation system 100 according to an embodiment of the present invention, the route information and the quality information in the monitoring target NW 300 are collected and recorded periodically during normal operation. Polling by ping is performed hop-by-hop according to the communication path acquired in the normal state when an abnormality (when a failure is detected by the failure monitoring system 200). Based on the success or failure (life and death information) of this polling, a set of faulty parts is extracted, and a faulty suspected part causing the problem is estimated and extracted from the set. Here, the success or failure of polling is not limited to the presence or absence of a response, but it is determined as a faulty part when a quality deterioration is more than a certain level based on a comparison with quality information at normal time, such as a logical fault. Even in the case of “half-dead”, it is possible to estimate the site of failure.

これにより、大規模なネットワーク障害の場合でも、障害部位推定システム１００において、障害となっている各ノード（ＮＷ機器３１０）への経路情報と、通信経路上の死活情報とを分析して、簡易的に迅速に障害被疑部位を推定して絞り込むことが可能となる。また、難しい操作を必要とせず、オペレータ等でも容易に障害被疑部位の推定を行うことが可能であるため、早期に障害被疑部位を絞り込み、状況によっては即時に対応策をとることも可能となる。 Thus, even in the case of a large-scale network failure, the failure site estimation system 100 can easily analyze the route information to each failed node (NW device 310) and the life / death information on the communication route. Therefore, it is possible to quickly estimate and narrow down the suspected failure site. In addition, since it is possible to easily estimate the suspected failure site by an operator or the like without requiring difficult operations, it becomes possible to narrow down the suspected failure site early and take immediate countermeasures depending on the situation. .

以上、本発明者によってなされた発明を実施の形態に基づき具体的に説明したが、本発明は前記実施の形態に限定されるものではなく、その要旨を逸脱しない範囲で種々変更可能であることはいうまでもない。例えば、上記の実施の形態は本発明を分かりやすく説明するために詳細に説明したものであり、必ずしも説明した全ての構成を備えるものに限定されるものではない。また、実施の形態の構成の一部について、他の構成の追加・削除・置換をすることが可能である。また、上記の各図において、制御線や情報線は説明上必要と考えられるものを示しており、必ずしも実装上の全ての制御線や情報線を示しているとは限らない。実際にはほとんど全ての構成が相互に接続されていると考えてもよい。 As mentioned above, the invention made by the present inventor has been specifically described based on the embodiment. However, the present invention is not limited to the embodiment, and various modifications can be made without departing from the scope of the invention. Needless to say. For example, the above-described embodiment has been described in detail for easy understanding of the present invention, and is not necessarily limited to the one having all the configurations described. In addition, it is possible to add, delete, and replace other configurations for a part of the configuration of the embodiment. Moreover, in each said figure, the control line and the information line have shown what is considered necessary for description, and do not necessarily show all the control lines and information lines on mounting. Actually, it may be considered that almost all the components are connected to each other.

本発明は、ネットワーク障害の際に障害部位を推定する障害部位推定システムおよび障害部位推定プログラムに利用可能である。 INDUSTRIAL APPLICABILITY The present invention can be used for a failure site estimation system and a failure site estimation program for estimating a failure site in the event of a network failure.

１…ネットワーク（ＮＷ）監視システム、
１００…障害部位推定システム、１１０…経路情報取得部、１１１…ソースコード、１２０…障害部位推定部、１２１…ホップバイホップ（逐次）ポーリング部、１２２…被疑ペア抽出部、１２３…障害部位出力部、１３０…経路情報データベース（ＤＢ）、１３１…ホップバイホップリスト、１３２…被疑ペア集合、
２００…障害監視システム、
３００…監視対象ネットワーク（ＮＷ）、３１０、３１０ａ…ネットワーク（ＮＷ）機器、３１１…ルーティングテーブル、３１２…インタフェース。 1 ... Network (NW) monitoring system,
DESCRIPTION OF SYMBOLS 100 ... Fault site estimation system, 110 ... Path information acquisition unit, 111 ... Source code, 120 ... Fault site estimation unit, 121 ... Hop-by-hop (sequential) polling unit, 122 ... Suspicious pair extraction unit, 123 ... Fault site output unit , 130 ... Route information database (DB), 131 ... Hop-by-hop list, 132 ... Suspicious pair set,
200 ... Fault monitoring system,
300: Network to be monitored (NW), 310, 310a ... Network (NW) device, 311 ... Routing table, 312 ... Interface.

Claims

A failure location estimation system that estimates a suspected failure location when a failure occurs in a monitored network having a configuration in which nodes composed of network devices are connected in a tree shape,
A path information acquisition unit that acquires, for each node in the monitoring target network, path information including an interface of each node on a communication path leading to the node when the monitoring target network is normal, and records the path information in a path information recording unit When,
When a failure occurs in the monitored network, route information to each failed node is obtained from the route information recording means, and each interface included in the route information is sequentially polled, and OK or NG A sequential polling unit that collects the results of
In each interface included in the path information, the most recent interface in which the polling result is NG and the interface in which the previous polling result is OK are used as suspected pairs, resulting in a failure. A suspicious pair extraction unit that extracts a suspicious pair for each node and obtains a suspicious pair set; and
A fault site estimation system comprising: a fault site output unit that extracts and outputs a fault site based on the suspected pair set and the path information recorded in the path information recording means.

In the failure site estimation system according to claim 1,
The path information acquisition unit acquires quality information about a communication path that reaches each node in the monitored network when the monitored network is normal, and records the quality information in the path information recording unit,
The sequential polling unit is a result of the sequential polling based on the comparison between the quality information acquired at the time of the sequential polling and the quality information at the normal time for the corresponding communication path recorded in the path information recording unit. It is determined whether or not is OK or NG.

In the failure site estimation system according to claim 2,
The fault location estimation system, wherein the quality information recorded in the route information recording means is information on a packet loss rate and / or an average delay time included in a response to the ping command.

In the failure site estimation system according to any one of claims 1 to 3,
The route information acquisition unit issues a traceroute command to each node in the monitored network to acquire information on a node on the communication route, and searches for a route by SNMP for each node on the acquired communication route. And acquiring the accounting information by acquiring the input and / or output interface information.

In the failure site estimation system according to any one of claims 1 to 4,
When the number of entries from which duplication is eliminated is 1 for the interface in which the polling result is NG in the suspected pair set, the failure part output unit determines that the polling result for the entry is NG. A fault site estimation system that outputs an interface as a fault suspected site.

In the failure site estimation system according to any one of claims 1 to 5,
The failure part output unit has a plurality of entries in which the duplication is eliminated for the interface in which the polling result is NG in the suspect pair set, and the polling result is OK in these entries. If the number of entries from which duplication is eliminated is 1 for the interface that has become an error, the section between the interface whose polling result is OK and the interface whose polling result is NG is faulty. A fault site estimation system characterized by outputting as a suspected site.

In the failure site estimation system according to any one of claims 1 to 6,
The failure part output unit has a plurality of entries in which the duplication is eliminated for the interface in which the polling result is NG in the suspect pair set, and the polling result is OK in these entries. When there are a plurality of entries whose duplication has been eliminated for a given interface, the interface for which the polling result relating to the entry is OK is output as a suspected fault site.

A failure site estimation program for operating a computer as a failure site estimation system for estimating a failure site when a failure occurs in a monitored network having a configuration in which nodes composed of network devices are connected in a tree shape,
A path information acquisition process for acquiring path information composed of interfaces of nodes on a communication path leading to the node for each node in the monitored network when the monitored network is normal and recording the path information in a path information recording unit When,
When a failure occurs in the monitored network, route information to each failed node is obtained from the route information recording means, and each interface included in the route information is sequentially polled, and OK or NG A sequential polling process to collect the results of
In each interface included in the path information, the most recent interface in which the polling result is NG and the interface in which the previous polling result is OK are used as suspected pairs, resulting in a failure. A suspicious pair extraction process for extracting a suspicious pair for each node and obtaining a suspicious pair set;
A faulty part estimation program for causing a computer to execute a faulty part output process for extracting and outputting a faulty suspected part based on the suspected pair set and the route information recorded in the route information recording means.