JP2014239315A

JP2014239315A - Network fault analyzing system and network fault analyzing program

Info

Publication number: JP2014239315A
Application number: JP2013120359A
Authority: JP
Inventors: 洋司向山; Yoji Mukoyama
Original assignee: Nomura Research Institute Ltd
Current assignee: Nomura Research Institute Ltd
Priority date: 2013-06-07
Filing date: 2013-06-07
Publication date: 2014-12-18
Anticipated expiration: 2033-06-07
Also published as: JP5342082B1

Abstract

PROBLEM TO BE SOLVED: To make it possible to estimate and narrow down suspected places of fault occurrence on a network on the basis of information on a utilization situation and fault situation at a business application or the like regardless whether a fault is detected or not on a network level.SOLUTION: A network fault analyzing system for analyzing a fault in a network system holds information on a communication requirement to identify communication used by each business application, and information on a communication path, which is information on nodes made up of network devices on a path of communication related to each communication requirement; identifies information on a communication requirement used by a business application and information on a first communication path related to the communication requirement on the basis of contents of notification if it is notified that a fault has been detected during execution of the business application; and if there is a node that exists on both the first communication path and a second communication path related to one or more faults already having occurred, estimates the node as a suspected place of fault occurrence.

Description

本発明は、ネットワークの障害監視技術に関し、特に、システムにおける障害検知状況に基づいてネットワーク上の障害部位を推定するネットワーク障害解析システムおよびネットワーク障害解析プログラムに適用して有効な技術に関するものである。 The present invention relates to a network fault monitoring technique, and more particularly to a technique effective when applied to a network fault analysis system and a network fault analysis program for estimating a fault location on a network based on a fault detection status in the system.

情報処理システムの運用・管理においては、通常、サーバ機器やネットワーク機器などの各種機器や、これらにより提供される機能やサービス等の動作について監視する仕組みが構築される。 In the operation and management of an information processing system, a mechanism for monitoring various devices such as server devices and network devices, and functions and services provided by these devices is usually constructed.

ネットワークの障害監視装置や障害部位を特定するシステムなどにおいては、一般的に、例えば、応答パケットの断や、パケットのレスポンスの低下等、ネットワークレベル（いわゆるＯＳＩ参照モデルの７層のうち第４層まで）の状態を収集し、その情報に基づいて障害部位を特定もしくは推定するという手法がとられている。 In a network fault monitoring apparatus or a system for identifying a faulty part, generally, for example, the fourth layer of seven layers of the so-called OSI reference model, such as a response packet disconnection or a packet response decrease. Are collected, and the site of failure is identified or estimated based on the information.

これに関連する技術として、例えば、特開２００２−２７１３２８号公報（特許文献１）には、ネットワークシステムにおける障害情報を検知し、前記ネットワークシステムを構成するネットワーク要素を階層的に定義したデータベースと前記障害情報とに基づいて、前記ネットワーク要素のうち障害の原因と推定されるネットワーク要素を抽出する技術が記載されている。 As a technology related to this, for example, Japanese Patent Laid-Open No. 2002-271328 (Patent Document 1) detects a failure information in a network system and hierarchically defines a network element constituting the network system and the database A technique for extracting a network element that is presumed to be the cause of a failure from the network elements based on the failure information is described.

また、特開２００７−１８９６１５号公報（特許文献２）には、障害機器の情報を障害機器情報として収集する機器監視部と、障害機器情報に対応する障害の原因を示す障害原因情報を有する診断ファイルを記憶する診断ファイル記憶部と、上記障害機器情報に基づき上記診断ファイルから障害原因情報を取得して障害の原因を診断する障害原因解析部と、障害機器情報とネットワークトポロジ情報とに基づき障害の影響範囲を特定する障害影響範囲解析部と、上記障害の影響範囲を回避する通信経路である回避経路を機器の負荷情報に基づき算出する障害復旧部とを備えるネットワーク監視支援装置が記載されている。 Japanese Unexamined Patent Application Publication No. 2007-189615 (Patent Document 2) discloses a device monitoring unit that collects information on a faulty device as faulty device information, and a diagnosis having fault cause information indicating the cause of the fault corresponding to the faulty device information. A diagnostic file storage unit for storing a file, a failure cause analysis unit for diagnosing the cause of the failure by acquiring the failure cause information from the diagnostic file based on the failed device information, and a failure based on the failed device information and the network topology information A network monitoring support apparatus is provided that includes a failure influence range analysis unit that identifies an influence range of a failure, and a failure recovery unit that calculates an avoidance route that is a communication route that avoids the influence range of the failure based on device load information Yes.

また、特開２０１２−２１３０５７号公報（特許文献３）には、定期的に映像提供事業者設備を構成する各装置及び通信ネットワーク設備を構成する各装置から接続情報を収集している故障解析装置と、映像サーバから映像信号を受信して復号する際にエラーが発生し、予め取得した評価基準データが示す評価基準を超過していると判断した場合、評価基準超過通知を故障解析装置に送信する映像受信装置とを有する故障解析システムが記載されている。当該システムでは、故障解析装置は、受信した評価基準超過通知が示すノード識別情報により特定される映像受信装置の配信ルートを接続情報から読み出し、読み出した配信ルート上の装置や、評価基準超過通知が示すチャネルの共通性から故障箇所を特定する。 Japanese Patent Laid-Open No. 2012-213057 (Patent Document 3) discloses a failure analysis device that periodically collects connection information from each device constituting a video provider facility and each device constituting a communication network facility. If an error occurs when receiving and decoding the video signal from the video server and it is determined that the evaluation criterion indicated by the evaluation criterion data acquired in advance is exceeded, an evaluation criterion excess notification is sent to the failure analysis device. A failure analysis system having a video receiving device is described. In the system, the failure analysis apparatus reads the distribution route of the video reception device specified by the node identification information indicated by the received evaluation criterion excess notification from the connection information, and the device on the read distribution route or the evaluation criterion excess notification is received. The failure location is identified from the commonality of the indicated channels.

特開２００２−２７１３２８号公報JP 2002-271328 A 特開２００７−１８９６１５号公報JP 2007-189615 A 特開２０１２−２１３０５７号公報JP 2012-213057 A

上述したような従来技術では、ネットワークシステムにおける障害情報と、ネットワークの構成情報や経路情報などに基づいて、障害の場所の特定もしくは推定を行うことが可能である。しかしながら、これらの技術では、ネットワークレベル（例えば、ＯＳＩ参照モデルの第４層まで）で明確な障害を検知した場合や、通信遅延などの障害の兆候となるような事象が発生している場合などにおいて功を奏するにとどまる。 In the conventional technology as described above, it is possible to specify or estimate the location of a failure based on failure information in the network system, network configuration information, route information, and the like. However, in these technologies, when a clear failure is detected at the network level (for example, up to the fourth layer of the OSI reference model), or when an event that causes a failure such as a communication delay occurs. It only works for me.

これに対し、例えば、業務アプリケーションにおいて動作上の不具合が発生しているが、ネットワークレベルでは明確に障害を検知していないというような「半死」の状態では、障害の被疑部位の特定や推定、絞り込みを行うことが困難であった。このため、利用者にとって支障が生じているにも関わらず、システムの運用者や事業者においては異常として把握できておらず、利用者に支障を及ぼす状態が長時間継続してしまうという事態が生じる場合があった。このような事態に適切に対応しなければ、運用者や事業者に対する顧客満足度が低下してしまい、その後の事業機会を損失するという場合も生じ得る。 On the other hand, for example, in the case of a “half-dead” state where an operational failure has occurred in a business application but the failure has not been clearly detected at the network level, identification and estimation of the suspected portion of the failure, It was difficult to narrow down. For this reason, despite the troubles for the users, the system operators and operators have not grasped it as abnormal, and the situation where the troubles for the users are continued for a long time. There was a case. If this situation is not adequately handled, customer satisfaction with the operator or operator may be reduced, and subsequent business opportunities may be lost.

そこで本発明の目的は、ネットワークレベルでの障害の検知の有無に関わらず、業務アプリケーション等における利用状況や障害状況の情報に基づいて、ネットワーク上の障害発生の被疑部位の推定や絞り込みを可能とする、ネットワーク障害解析システムおよびネットワーク障害解析プログラムを提供することにある。 Accordingly, an object of the present invention is to enable estimation and narrowing down of suspected sites of failure occurrence on the network based on the usage status and failure status information in business applications, etc., regardless of whether or not a failure is detected at the network level. Another object is to provide a network failure analysis system and a network failure analysis program.

本発明の前記ならびにその他の目的と新規な特徴は、本明細書の記述および添付図面から明らかになるであろう。 The above and other objects and novel features of the present invention will be apparent from the description of this specification and the accompanying drawings.

本願において開示される発明のうち、代表的なものの概要を簡単に説明すれば、以下のとおりである。 Of the inventions disclosed in this application, the outline of typical ones will be briefly described as follows.

本発明の代表的な実施の形態によるネットワーク障害解析システムは、業務アプリケーションが稼働する情報処理装置を有するネットワークシステムにおける障害を解析するネットワーク障害解析システムであって、各業務アプリケーションが使用する通信を特定するための通信要件の情報と、各通信要件に係る通信の経路上のネットワーク機器からなるノードの情報である通信経路の情報と、を保持する。 A network failure analysis system according to a representative embodiment of the present invention is a network failure analysis system for analyzing a failure in a network system having an information processing apparatus on which a business application operates, and specifies communication used by each business application Information on communication requirements for communication and information on communication paths, which are information on nodes composed of network devices on the communication paths related to each communication requirement, are stored.

業務アプリケーションの実行において障害を検知したことを通知された場合に、当該通知の内容に基づいて当該業務アプリケーションが使用する通信要件の情報と当該通信要件に係る第１の通信経路の情報を特定し、当該第１の通信経路と、既に発生している１つ以上の障害に係る第２の通信経路との間で重複するノードが存在する場合に、当該ノードを障害発生の被疑部位として推定するものである。 When notified that a failure has been detected during execution of a business application, the communication requirement information used by the business application and the first communication path information related to the communication requirement are specified based on the content of the notification. When there is an overlapping node between the first communication path and the second communication path related to one or more faults that have already occurred, the node is estimated as a suspected part of the fault occurrence. Is.

また、本発明は、コンピュータを上記のようなネットワーク障害解析システムとして動作させるプログラムにも適用することができる。 The present invention can also be applied to a program that causes a computer to operate as a network failure analysis system as described above.

本願において開示される発明のうち、代表的なものによって得られる効果を簡単に説明すれば以下のとおりである。 Among the inventions disclosed in the present application, effects obtained by typical ones will be briefly described as follows.

すなわち、本発明の代表的な実施の形態によれば、ネットワークレベルでの障害の検知の有無に関わらず、業務アプリケーション等における利用状況や障害状況の情報に基づいて、ネットワーク上の障害発生の被疑部位の推定や絞り込みが可能となる。 That is, according to the representative embodiment of the present invention, regardless of whether or not a failure has been detected at the network level, the suspected occurrence of a failure on the network is determined based on the usage status or failure status information in a business application or the like. It is possible to estimate and narrow down the parts.

本発明の一実施の形態におけるであるネットワーク障害解析システムを有するアプリケーション監視サーバの構成例について概要を示した図である。It is the figure which showed the outline | summary about the structural example of the application monitoring server which has the network failure analysis system which is one embodiment of this invention. 本発明の一実施の形態における業務アプリケーションが稼働する情報処理システムのネットワークの構成例について概要を示した図である。It is the figure which showed the outline | summary about the structural example of the network of the information processing system which the business application in one embodiment of this invention operates. 本発明の一実施の形態における業務アプリケーションで障害を検知した場合に障害部位を推定する例について概要を示した図である。It is the figure which showed the outline | summary about the example which estimates a failure site | part when a failure is detected with the business application in one embodiment of this invention. 本発明の一実施の形態における通信要件テーブルのデータ構成と具体的なデータの例について概要を示した図である。It is the figure which showed the outline | summary about the data structure of the communication requirement table in one embodiment of this invention, and the example of concrete data. 本発明の一実施の形態における経路情報テーブルのデータ構成と具体的なデータの例について概要を示した図である。It is the figure which showed the outline | summary about the data structure of the path | route information table in one embodiment of this invention, and the example of concrete data. 本発明の一実施の形態におけるジョブテーブルのデータ構成と具体的なデータの例について概要を示した図である。It is the figure which showed the outline | summary about the data structure of the job table and specific example of data in one embodiment of this invention. 本発明の一実施の形態におけるアプリケーション経路情報テーブルのデータ構成と具体的なデータの例について概要を示した図である。It is the figure which showed the outline | summary about the data structure of the application path | route information table in one embodiment of this invention, and the example of concrete data. 本発明の一実施の形態における障害発生の被疑部位の推定・絞り込みの処理の流れの例について概要を示した図である。It is the figure which showed the outline | summary about the example of the flow of a process of estimation / narrowing down of the suspected site | part of failure occurrence in one embodiment of this invention.

以下、本発明の実施の形態を図面に基づいて詳細に説明する。なお、実施の形態を説明するための全図において、同一部には原則として同一の符号を付し、その繰り返しの説明は省略する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. Note that components having the same function are denoted by the same reference symbols throughout the drawings for describing the embodiment, and the repetitive description thereof will be omitted.

＜概要＞
本発明の一実施の形態であるネットワーク障害解析システムは、１つ以上のサーバ機器やネットワーク機器などの情報処理装置により構成される情報処理システムにおいて、各サーバ等で稼働する業務アプリケーションの動作がそれぞれ正常であるか否かを監視し、その結果を解析することによって、ネットワークレベルでの障害の検知の有無に関わらず、当該情報処理システムのネットワーク構成上における障害発生の被疑部位を絞り込みもしくは推定することを可能とするシステムである。 <Overview>
A network failure analysis system according to an embodiment of the present invention is an information processing system including information processing apparatuses such as one or more server devices and network devices. By monitoring whether it is normal and analyzing the result, it is possible to narrow down or estimate the suspected location of the failure on the network configuration of the information processing system regardless of whether or not a failure has been detected at the network level. It is a system that makes it possible.

より具体的には、各業務アプリケーションが処理を実行する際に行う通信について、送信元と送信先のＩＰアドレス、およびその間の通信経路（経由するノード）の情報を予め登録しておく。業務アプリケーションの監視等において異常や障害が検知された場合、障害が検知された業務アプリケーションが使用する通信経路上のノードに障害の被疑部位があると推定し、これを絞り込んでいく。 More specifically, for communication performed when each business application executes a process, information on the IP address of the transmission source and the transmission destination, and information on the communication path (node through) between them is registered in advance. When an abnormality or failure is detected in business application monitoring or the like, it is estimated that there is a suspicious part of the failure on the node on the communication path used by the business application in which the failure is detected, and this is narrowed down.

例えば、複数の業務アプリケーションにおいて障害が検知された場合、それぞれの業務アプリケーションが使用する通信経路で重複するノードがある場合には、これらの重複ノードが障害発生の原因となっている蓋然性が高いことから、これらを被疑部位として絞り込む。また、重複するノードがない（もしくは障害が検知された業務アプリケーションが１つしかない）場合には、当該通信経路上の各ノードについて、他の全ての業務アプリケーションが使用する通信経路のいずれとも重複しない場合、すなわち、当該ノードが当該業務アプリケーションにおける通信にしか用いられないような場合に、当該ノードが障害発生の原因となっている蓋然性が高いことから、これらのノードを被疑部位として絞り込む。 For example, if a failure is detected in multiple business applications, and there are duplicate nodes in the communication path used by each business application, the probability that these duplicate nodes are causing the failure is high. From these, we narrow down these as suspected parts. If there is no overlapping node (or there is only one business application in which a failure is detected), each node on the communication path overlaps with any of the communication paths used by all other business applications If not, that is, if the node is used only for communication in the business application, it is highly likely that the node is the cause of the failure, so these nodes are narrowed down as suspected parts.

図２は、本実施の形態における業務アプリケーションが稼働する情報処理システムのネットワークの構成例について概要を示した図である。図２の例では、業務アプリケーションを稼働させる複数の業務サーバ（業務サーバＡ（２ａ）〜業務サーバＤ（２ｄ））がそれぞれ連携してサービスを提供する構成となっており、それぞれが属するネットワークがルータやレイヤ３スイッチ等のネットワーク機器（ノードＡ（３ａ）〜ノードＥ（３ｅ）その他）によって接続される構成を有している。また、各業務サーバ上で稼働する業務アプリケーションにおける障害や異常を監視するアプリケーション監視サーバ１が接続されている。 FIG. 2 is a diagram showing an overview of a configuration example of a network of an information processing system in which a business application according to the present embodiment operates. In the example of FIG. 2, a plurality of business servers (business server A (2a) to business server D (2d)) that operate a business application cooperate with each other to provide a service. It is configured to be connected by network devices (node A (3a) to node E (3e) and others) such as routers and layer 3 switches. In addition, an application monitoring server 1 that monitors a failure or abnormality in a business application running on each business server is connected.

ここでは、例えば、業務サーバＡ（２ａ）と業務サーバＣ（２ｃ）上の業務アプリケーションが連携して行う処理について、通信要件として、ノードＡ（３ａ）→ノードＢ（３ｂ）→ノードＣ（３ｃ）の経路で通信が行われることを示している。同様に、例えば、業務サーバＢ（２ｂ）と業務サーバＤ（２ｄ）上の業務アプリケーションが連携して行う処理について、通信要件として、ノードＤ（３ｄ）→ノードＢ（３ｂ）→ノードＥ（３ｅ）の経路で通信が行われることを示している。他にも業務アプリケーションで使用される通信経路は存在し得るが、ここでの説明は省略する。 Here, for example, regarding the processing performed in cooperation between the business application on the business server A (2a) and the business server C (2c), as a communication requirement, node A (3a) → node B (3b) → node C (3c) ) Indicates that communication is performed through the route. Similarly, for example, the processing performed in cooperation between the business server B (2b) and the business application on the business server D (2d), as a communication requirement, node D (3d) → node B (3b) → node E (3e) ) Indicates that communication is performed through the route. There may be other communication paths used in the business application, but description thereof is omitted here.

なお、図２の例では、業務サーバＡ（２ａ）と業務サーバＣ（２ｃ）など、通信経路の両端で連携する業務アプリケーションが稼働する構成としているが、これに限られず、例えば、一端の業務サーバにおいてのみ業務アプリケーションが稼働して、これにより複数のノードを経由する通信が行われるという構成であってもよい。また、通信経路の両端は業務サーバなどのサーバ機器に限られず、例えば、ネットワーク機器や、Ｗｅｂブラウザ等のプログラムが稼働するＰＣ（Personal Computer）等のクライアント端末、銀行のＡＴＭ機（現金自動預け払い機）など、狭義のサーバ機器以外の情報処理装置であってもよい。 In the example of FIG. 2, business applications such as the business server A (2 a) and the business server C (2 c) that operate in cooperation with both ends of the communication path are configured to operate. A configuration in which a business application runs only on a server and communication via a plurality of nodes is thereby performed may be employed. In addition, both ends of the communication path are not limited to server devices such as business servers, for example, network devices, client terminals such as PCs (Personal Computers) on which programs such as Web browsers operate, bank ATM machines (automatic cash deposits) An information processing device other than a narrowly-defined server device may be used.

図３は、本実施の形態における業務アプリケーションで障害を検知した場合に障害部位を推定する例について概要を示した図である。ここでは、図２と同様のシステム構成およびネットワーク構成において、例えば、業務サーバＣ（２ｃ）と業務サーバＤ（２ｄ）上で稼働する業務アプリケーションにおいてそれぞれ障害が発生したことを、アプリケーション監視サーバ１が同時もしくは近接する時間帯で検知したことを示している。 FIG. 3 is a diagram showing an outline of an example of estimating a faulty part when a fault is detected by the business application in the present embodiment. Here, in the system configuration and network configuration similar to FIG. 2, for example, the application monitoring server 1 indicates that a failure has occurred in each of the business applications running on the business server C (2c) and the business server D (2d). It shows that it was detected at the same time or in close time.

このとき、ネットワークの監視において明確な障害事象を検知していなくても、ネットワーク構成のいずれかのノードや機器等に問題があって「半死」の状態となっており、これが原因で業務アプリケーションに障害を発生させていると仮定する。そうすると、問題があるノードは、障害が発生している業務アプリケーションが使用している通信経路上のいずれかのノードに存在するであろうことが推測される。従って、複数の業務アプリケーションにおいて同時期に障害が発生している場合は、これらの業務アプリケーションが使用している通信経路において重複する部分（図３の例ではノードＢ（３ｂ））において問題が発生している可能性が高いことが推測される。 At this time, even if a clear failure event is not detected in the network monitoring, there is a problem with one of the nodes or devices in the network configuration and it is in a “half-dead” state. Assume that a failure has occurred. Then, it is inferred that the problematic node will be present in any node on the communication path used by the business application in which the failure has occurred. Therefore, when a failure occurs at the same time in a plurality of business applications, a problem occurs in the overlapping part (node B (3b) in the example of FIG. 3) in the communication path used by these business applications. It is estimated that there is a high possibility that

一方、例えば、図３の例のように複数の業務アプリケーションにおいて障害を検知するのではなく、１つの業務アプリケーションでのみ障害を検知した場合は、上記のような推測とは逆に、当該業務アプリケーションが使用する通信経路上の各ノードのうち、他の業務アプリケーション（障害を検知しているものに限らず全て）が使用する通信経路のいずれとも重複しないもの（もしくは重複する経路の数が少ないもの）において問題が発生している可能性が高いことが推測される。また、その逆として、他の業務アプリケーションが使用している通信経路と重複しているノード（もしくは重複する経路の数が多いノード）は、問題が発生している可能性が低いことが推測される。 On the other hand, for example, when a failure is detected in only one business application instead of detecting a failure in a plurality of business applications as in the example of FIG. 3, the business application is contrary to the above estimation. Of the nodes on the communication path used by, those that do not overlap with any of the communication paths used by other business applications (not limited to those that detect a failure) (or that have a small number of overlapping paths) ) Is likely to have a problem. On the other hand, it is estimated that there is a low possibility that a node that overlaps with a communication path used by another business application (or a node with a large number of overlapping paths) has a problem. The

問題が発生しているノードが、他の業務アプリケーションが使用している通信経路と重複している場合は、その業務アプリケーションにおいても障害が発生する蓋然性が高いと考えられることから、このような推測が可能である。なお、このような推測は、複数の業務アプリケーションで障害を検知しているものの、通信経路に重複する部分がない場合についても同様に適用することができる。 If the node where the problem occurs overlaps with the communication path used by another business application, it is highly probable that a failure will occur in that business application. Is possible. Such a guess can be applied in the same way even when a failure is detected by a plurality of business applications, but there is no overlapping portion in the communication path.

本実施の形態のネットワーク障害解析システムでは、上記のような手法をとることにより、例えば、ネットワークの監視装置等において、ネットワーク機器や回線、ケーブル等の物理的な故障の情報や、通信の遅延、パケットロス等の障害の兆候となるような情報を検知できていない場合であっても、業務アプリケーションの利用状況に障害もしくは悪化等の事象が生じた場合に、ネットワーク構成上における障害の被疑部位を推定し、もしくは絞り込み、速やかに復旧もしくは障害を回避させるための適切な措置を講じることを可能とするものである。 In the network failure analysis system of the present embodiment, by taking the above-described method, for example, in a network monitoring device or the like, information on a physical failure such as a network device, a line, or a cable, a communication delay, Even if information that could be a sign of failure such as packet loss cannot be detected, if an event such as failure or deterioration occurs in the usage status of business applications, It is possible to estimate or narrow down and take appropriate measures to promptly recover or avoid failures.

＜システム構成＞
図１は、本発明の一実施の形態であるネットワーク障害解析システムを有するアプリケーション監視サーバの構成例について概要を示した図である。図１に示すように、本実施の形態のネットワーク障害解析システムは、例えば、アプリケーション監視サーバ１上にネットワーク障害解析部２０として実装されている。アプリケーション監視サーバ１上には、他に、例えば、１つ以上の業務サーバ２上でそれぞれ稼働する業務アプリケーションの障害や異常を監視する業務アプリケーション監視部１０が実装されている。また、ネットワークの障害や異常を監視する図示しないネットワーク障害監視システム等が実装されていてもよい。これらのシステムは、それぞれ独立したサーバシステムとして構成することも当然可能である。 <System configuration>
FIG. 1 is a diagram showing an outline of a configuration example of an application monitoring server having a network failure analysis system according to an embodiment of the present invention. As shown in FIG. 1, the network failure analysis system according to the present embodiment is implemented as a network failure analysis unit 20 on the application monitoring server 1, for example. On the application monitoring server 1, for example, a business application monitoring unit 10 that monitors failures and abnormalities of business applications respectively running on one or more business servers 2 is mounted. In addition, a network failure monitoring system (not shown) for monitoring a network failure or abnormality may be implemented. Of course, these systems can be configured as independent server systems.

アプリケーション監視サーバ１は、図２や図３において示したように、１つ以上の業務サーバ２上で稼働する図示しない業務アプリケーションの障害や異常を監視して、障害を検知した場合には障害発生の被疑部位を推定し、もしくは絞り込んで出力する機能を有するサーバシステムである。このアプリケーション監視サーバ１は、例えば、サーバ機器や、クラウドコンピューティング環境における仮想サーバ等により構成され、図示しないＯＳ（Operating System）やＤＢＭＳ（DataBase Management System）などのミドルウェア上で稼働するソフトウェアプログラムとして、業務アプリケーション監視部１０やネットワーク障害解析部２０などの各部（サブシステム）を実装している。 As shown in FIGS. 2 and 3, the application monitoring server 1 monitors a failure or abnormality of a business application (not shown) running on one or more business servers 2, and if a failure is detected, a failure occurs. It is a server system having a function of estimating or narrowing down the suspicious part of the output. The application monitoring server 1 includes, for example, a server device, a virtual server in a cloud computing environment, and the like as a software program that runs on middleware such as an OS (Operating System) and a DBMS (DataBase Management System) (not shown). Each unit (subsystem) such as the business application monitoring unit 10 and the network failure analysis unit 20 is mounted.

業務アプリケーション監視部１０は、各業務サーバ２上で稼働する業務アプリケーションの障害や異常を監視して、これらを検知した場合にこれを通知等により出力する機能を有する。業務アプリケーションに障害や異常が発生したかの判定は、例えば、業務アプリケーションが実行の際に出力する所定のエラーメッセージや警告メッセージなどを取得して、その内容に基づいて行うことができる。 The business application monitoring unit 10 has a function of monitoring a failure or abnormality of a business application running on each business server 2 and outputting a notification or the like when these are detected. The determination as to whether a failure or abnormality has occurred in the business application can be made based on, for example, obtaining a predetermined error message or warning message output when the business application is executed.

本実施の形態では、ネットワークの障害被疑部位を推定することが目的であることから、上記の所定のエラーメッセージや警告メッセージとしては、例えば、純粋に業務的・アプリケーション的なエラー（不適切な入力データや操作によるもの、業務アプリケーションのプログラムの不具合等）は除外される。なお、業務アプリケーション監視部１０としては、既存もしくは市販の運用監視ツールやソリューション、パッケージプログラムなどを適宜使用することができる。 Since the purpose of this embodiment is to estimate the suspected faulty part of the network, the predetermined error message and warning message are, for example, purely business / application errors (inappropriate input) Data, operations, business application program defects, etc.) are excluded. As the business application monitoring unit 10, existing or commercially available operation monitoring tools, solutions, package programs, and the like can be used as appropriate.

ネットワーク障害解析部２０は、業務アプリケーション監視部１０による業務アプリケーションの障害や異常の監視結果の情報に基づいて、上述した手法により、障害が検知された各業務アプリケーションが使用する通信経路上のノード（ネットワーク機器）から１つ以上のノードを障害発生の被疑部位として推定もしくは絞り込んで出力する機能を有し、ネットワーク障害解析システムとして機能するものである。ネットワーク障害解析部２０は、例えば、ソフトウェアプログラムとして実装される障害部位推定部２１と、データベースやファイルテーブル等により構成される通信要件テーブル２２および経路情報テーブル２３の各テーブルを有する。また、障害部位推定部２１の処理結果として出力された障害被疑経路情報２４および障害被疑ノード情報２５の各データをファイルやメモリ上に保持する。 The network failure analysis unit 20 uses the above-described method based on the information on the monitoring result of the business application failure or abnormality by the business application monitoring unit 10 to use a node (in the communication path used by each business application in which a failure is detected) It has a function of estimating or narrowing down one or more nodes from a network device) as suspected sites of failure occurrence, and functions as a network failure analysis system. The network failure analysis unit 20 includes, for example, a failure site estimation unit 21 implemented as a software program, and a communication requirement table 22 and a route information table 23 configured by a database, a file table, and the like. Further, each data of the suspected path information 24 and suspected node information 25 output as the processing result of the faulty part estimation unit 21 is stored in a file or memory.

障害部位推定部２１は、上述した手法により、障害が検知された各業務アプリケーションが使用する通信経路上のノードから１つ以上のノードを障害発生の被疑部位として推定もしくは絞り込んで出力する機能を有する。そのために、ネットワーク障害解析部２０では、各業務アプリケーションが使用する通信を特定するための通信要件に係る情報（例えば、送信元および送信先のＩＰアドレスや、通信経路上の各ノードの情報など）を自動もしくは手動により取得して、予め通信要件テーブル２２および経路情報テーブル２３に保持しておく。 The failure site estimation unit 21 has a function of estimating or narrowing down one or more nodes as suspected sites of failure occurrence from the nodes on the communication path used by each business application in which a failure is detected by the above-described method. . Therefore, in the network failure analysis unit 20, information related to communication requirements for specifying communication used by each business application (for example, IP addresses of transmission sources and transmission destinations, information of each node on the communication path, etc.) Is acquired automatically or manually and stored in the communication requirement table 22 and the route information table 23 in advance.

障害部位推定部２１では、業務アプリケーション監視部１０により障害が検知された場合に、障害通知メッセージに基づいて、障害が発生した通信経路の情報を通信要件テーブル２２および経路情報テーブル２３の内容に基づいて特定する。さらに、当該通信経路上の各ノードから、上述した手法により障害発生の被疑部位となるノードを推定もしくは絞り込み、この内容を監視用のクライアント端末等の画面上にリスト表示するなどして通知する。 In the failure part estimation unit 21, when a failure is detected by the business application monitoring unit 10, information on the communication path in which the failure has occurred is based on the contents of the communication requirement table 22 and the route information table 23 based on the failure notification message. To identify. Further, from each node on the communication path, a node that is a suspected site of failure occurrence is estimated or narrowed down by the above-described method, and the contents are notified by displaying a list on a screen of a monitoring client terminal or the like.

このとき、新たに障害を検知した通信経路の情報を障害被疑経路情報２４に記録しておくとともに、当該通信経路上の各ノードの情報を障害被疑ノード情報２５に記録しておく。なお、障害被疑経路情報２４および障害被疑ノード情報２５に記録された通信経路やノードの各エントリの情報は、対応する障害の復旧や、一定時間の経過等のタイミングで自動もしくは手動で消去し、最新の障害状況に基づく情報に更新されるようにしておくものとする。 At this time, information on the communication path in which the failure is newly detected is recorded in the failure suspected path information 24, and information on each node on the communication path is recorded in the suspected node information 25. The communication path and node entry information recorded in the failure suspected path information 24 and the suspected failure node information 25 are automatically or manually erased at the timing of the corresponding failure recovery, the passage of a certain time, etc. It is assumed that information is updated based on the latest failure status.

＜データ構成＞
図４は、本実施の形態における通信要件テーブル２２のデータ構成と具体的なデータの例について概要を示した図である。通信要件テーブル２２は、各業務アプリケーションによって使用される通信についてその要件の情報を保持するテーブルであり、例えば、Ｎｏ．、送信元アドレス、送信先アドレス、プロトコル、およびポート番号などの各項目を有する。 <Data structure>
FIG. 4 is a diagram showing an outline of the data configuration of the communication requirement table 22 and specific data examples in the present embodiment. The communication requirement table 22 is a table that holds information on requirements for communication used by each business application. Each item includes a source address, a destination address, a protocol, and a port number.

Ｎｏ．の項目は、各通信要件のエントリに対してユニークに採番されたシーケンス番号などの情報を保持する。送信元アドレスおよび送信先アドレスの各項目は、それぞれ、対象の通信要件における送信元および送信先となる各業務サーバ２等のＩＰアドレスの情報を保持する。プロトコルおよびポート番号の各項目は、それぞれ、対象の通信要件におけるプロトコルおよび用いられるポート番号の情報を保持する。プロトコルとしては、図示するように、例えば“ｔｃｐ”や“ｕｄｐ”などの情報が含まれ得る。なお、通信要件テーブル２２の上記の各項目は、例えば、管理者等が手動で予め登録しておくようにする。 No. This item holds information such as a sequence number uniquely assigned to each communication requirement entry. Each item of the transmission source address and the transmission destination address holds information on the IP address of each business server 2 that is the transmission source and the transmission destination in the target communication requirement. Each item of the protocol and the port number holds information on the protocol and the port number used in the target communication requirement. As shown in the figure, the protocol can include information such as “tcp” and “udp”, for example. The above items in the communication requirement table 22 are manually registered in advance by, for example, an administrator.

図４の例では、図２の例において矢印で示した２つの通信経路（業務サーバＡ（２ａ）と業務サーバＣ（２ｃ）間、および業務サーバＢ（２ｂ）と業務サーバＤ（２ｄ）間）について、それぞれ、Ｎｏ．の項目が“１”および“７”のエントリにおいて具体的なデータが登録されている例を示している。 In the example of FIG. 4, the two communication paths indicated by arrows in the example of FIG. 2 (between business server A (2a) and business server C (2c), and between business server B (2b) and business server D (2d)). ), No. In this example, specific data is registered in entries whose items are “1” and “7”.

図５は、本実施の形態における経路情報テーブル２３のデータ構成と具体的なデータの例について概要を示した図である。経路情報テーブル２３は、通信要件テーブル２２に登録された各通信要件に対応する通信経路に属するノードの情報を保持するテーブルであり、例えば、Ｎｏ．、送信元アドレス、送信先アドレス、ノード番号、ノード、前ノード、および後ノードなどの各項目を有する。 FIG. 5 is a diagram showing an overview of the data configuration of the route information table 23 and specific data examples in the present embodiment. The path information table 23 is a table that holds information on nodes belonging to communication paths corresponding to each communication requirement registered in the communication requirement table 22. , The transmission source address, the transmission destination address, the node number, the node, the previous node, and the subsequent node.

Ｎｏ．の項目は、各ノードのエントリに対してユニークに採番されたシーケンス番号などの情報を保持する。送信元アドレスおよび送信先アドレスの各項目は、それぞれ、対象のノードが属する通信経路における送信元および送信先となる各業務サーバ２等のＩＰアドレスの情報であり、図４に示した通信要件テーブル２２の送信元アドレスおよび送信先アドレスの項目に対応する。 No. This item holds information such as a sequence number uniquely assigned to each node entry. Each item of the transmission source address and the transmission destination address is information on the IP address of each business server 2 that is the transmission source and the transmission destination in the communication path to which the target node belongs, and the communication requirement table shown in FIG. This corresponds to 22 items of source address and destination address.

ノード番号の項目は、対象のノードが属する通信経路内で、各ノードに対してユニークに採番されたシーケンス番号などの情報を保持する。図５の例では、送信元に近いノードから順に“１”、“２”、…として割り当てるものとしているが、これに限られない。ノードの項目は、対象のノードを構成するネットワーク機器等を特定するマシン名等の情報を保持する。本実施の形態では、図２の例におけるノードＡ（３ａ）やノードＢ（３ｂ）等に対応して、“ノードＡ”や“ノードＢ”等の名称により特定している。前ノードおよび後ノードの各項目は、それぞれ、対象のノードが属する通信経路において、対象のノードに対して１ホップ前および後ろにあるノードを特定するマシン名等の情報を保持する。 The item of node number holds information such as a sequence number uniquely assigned to each node in the communication path to which the target node belongs. In the example of FIG. 5, “1”, “2”,... Are assigned in order from the node closest to the transmission source, but this is not a limitation. The node item holds information such as a machine name that identifies a network device or the like constituting the target node. In the present embodiment, the names such as “node A” and “node B” are specified in correspondence with the nodes A (3a) and B (3b) in the example of FIG. Each item of the previous node and the subsequent node holds information such as a machine name that identifies a node one hop ahead and behind the target node in the communication path to which the target node belongs.

図５の例では、図４の場合と同様に、図２の例において矢印で示した２つの通信経路（業務サーバＡ（２ａ）と業務サーバＣ（２ｃ）間、および業務サーバＢ（２ｂ）と業務サーバＤ（２ｄ）間）上の各ノードについて、それぞれ、Ｎｏ．の項目が“１”〜“３”および“１３”〜“１４”のエントリにおいて具体的なデータが登録されている例を示している。 In the example of FIG. 5, as in the case of FIG. 4, the two communication paths (between business server A (2a) and business server C (2c) and business server B (2b) indicated by arrows in the example of FIG. For each node on the business server D (2d). In this example, specific data is registered in entries “1” to “3” and “13” to “14”.

例えば、送信元アドレスが“１９２．１６８．１．１”で送信先アドレスが“１９２．１６８．１１．２”である通信経路上のノード（Ｎｏ．の項目が“１”〜“３”のエントリ）において、ノード番号の項目が“１”のノード（“ノードＡ”）は、図２の構成例に示すように、当該通信経路上の最初のノードであるため前ノードは存在せず、前ノードの項目の値は設定されないことを示している。同様に、送信元アドレスが“１９２．１６８．１０．１”で送信先アドレスが“１９２．１６８．２９．２”である通信経路上のノード（Ｎｏ．の項目が“１３“〜“１５”のエントリ）において、ノード番号の項目が“３”のノード（“ノードＣ”）は、当該通信経路上で最後のノードであるため後ノードは存在せず、後ノードの項目の値は設定されないことを示している。 For example, nodes (No. items “1” to “3”) on a communication path having a source address “192.168.1.1” and a destination address “192.168.11.2”. Entry), the node with the node number item “1” (“node A”) is the first node on the communication path, as shown in the configuration example of FIG. This indicates that the value of the item of the previous node is not set. Similarly, nodes (No. items “13” to “15” on the communication path with the source address “192.168.10.1” and the destination address “192.168.29.2”). In the entry), the node having the node number item “3” (“node C”) is the last node on the communication path, so there is no rear node, and the value of the rear node item is not set. It is shown that.

なお、経路情報テーブル２３に登録される各通信経路上のノードの情報は、手動で登録することも可能であるが、例えば、各ノードに設定されたルーティングテーブル等を取得・収集して解析し、ネットワーク上の通信経路情報を自動的に算出するような公知の技術を適宜利用することも可能である。また、このような算出処理を日次等のタイミングで定期的に行い、および／またはネットワークの構成変更等をトリガとして適時に行うことで、いわゆるダイナミックルーティングが用いられて、通信経路が自動的に変更もしくは最適化されるような構成であっても、経路情報テーブル２３における各ノードの情報を適宜最新のものに更新して適切に対応することが可能となる。 The information on the nodes on each communication path registered in the path information table 23 can be manually registered. For example, the routing table set for each node is acquired, collected, and analyzed. It is also possible to appropriately use a known technique for automatically calculating communication path information on the network. In addition, such calculation processing is periodically performed at a daily timing and / or timely triggered by a network configuration change or the like, so-called dynamic routing is used, and a communication path is automatically set. Even if the configuration is changed or optimized, the information of each node in the route information table 23 can be appropriately updated to appropriately cope with it.

図４および図５の例では、通信経路を特定するために、図４に示すような通信要件テーブル２２に、送信元と送信先のＩＰアドレスによって特定される通信要件を登録して行っているが、通信経路の特定の手法はこのようなものに限られない。例えば、業務アプリケーション監視部１０が監視対象とする業務アプリケーションのジョブやプロセスなどの識別情報に対して直接通信経路の情報を関連付けて管理することも可能である。 In the example of FIGS. 4 and 5, in order to specify the communication path, the communication requirement specified by the IP address of the transmission source and the transmission destination is registered in the communication requirement table 22 as shown in FIG. However, the specific method of the communication path is not limited to this. For example, it is also possible to manage information by directly associating communication path information with identification information such as jobs and processes of business applications to be monitored by the business application monitoring unit 10.

図６は、本実施の形態におけるジョブテーブル２２’のデータ構成と具体的なデータの例について概要を示した図である。ジョブテーブル２２’は、各業務サーバ２上で業務アプリケーションとして実行されるジョブの情報を保持するテーブルであり、例えば、ジョブＩＤ、およびアプリケーション識別ＩＤなどの各項目を有する。ジョブＩＤの項目は、各ジョブを一意に識別するＩＤや番号等の情報を保持する。アプリケーション識別ＩＤの項目は、対象のジョブにおいて実行されるアプリケーションプログラムやモジュール等を識別するＩＤ等の情報を保持する。１つのジョブで複数のアプリケーションプログラムが実行される場合もある。 FIG. 6 is a diagram showing an overview of the data configuration of the job table 22 ′ and specific data examples in the present embodiment. The job table 22 ′ is a table that holds information on jobs executed as business applications on each business server 2, and includes items such as a job ID and an application identification ID. The job ID item holds information such as an ID and a number for uniquely identifying each job. The item of application identification ID holds information such as an ID for identifying an application program or module executed in the target job. A plurality of application programs may be executed in one job.

図７は、本実施の形態におけるアプリケーション経路情報テーブル２３’のデータ構成と具体的なデータの例について概要を示した図である。ここでは、図５に示した経路情報テーブル２３の例における送信元アドレスおよび送信先アドレスの項目に代えて、アプリケーション識別ＩＤの項目を有している。この項目は、対象のノードが属する通信経路を使用する業務アプリケーションを特定する情報であり、図６に示したジョブテーブル２２’のアプリケーション識別ＩＤの項目に対応する。 FIG. 7 is a diagram showing an overview of the data configuration of the application route information table 23 ′ and specific data examples in the present embodiment. Here, instead of the items of the transmission source address and the transmission destination address in the example of the route information table 23 shown in FIG. This item is information for specifying a business application that uses the communication path to which the target node belongs, and corresponds to the application identification ID item in the job table 22 ′ shown in FIG. 6.

図６に示したジョブテーブル２２’および図７に示したアプリケーション経路情報テーブル２３’のような構成を有するテーブルを用いることにより、業務サーバ２上で実行されるジョブの識別情報から、使用する通信経路上の各ノードの情報を直接取得することが可能となる。なお、実際に障害が検知されたジョブについて、障害の内容や、業務的な重要度、影響範囲などの情報に基づいて、軽微な場合には障害を無視するようにしてもよいし、直後にジョブをリラン等した際に正常終了したような場合には障害を無視するようにしてもよい。 By using tables having configurations such as the job table 22 ′ shown in FIG. 6 and the application route information table 23 ′ shown in FIG. 7, the communication used from the identification information of the job executed on the business server 2 It becomes possible to directly acquire information of each node on the route. In the case of a job in which a failure is actually detected, the failure may be ignored if it is insignificant based on the content of the failure, the importance of work, and the scope of influence. If the job is normally completed when the job is rerun or the like, the failure may be ignored.

なお、上述の図４〜図７で示した各テーブルのデータ構成（項目）はあくまで一例であり、同様のデータを保持・管理することが可能な構成であれば、他のテーブル構成やデータ構成であってもよい。 Note that the data configuration (items) of each table shown in FIGS. 4 to 7 is merely an example, and other table configurations and data configurations are possible as long as similar data can be held and managed. It may be.

＜処理の流れ＞
図８は、本実施の形態のネットワーク障害解析部２０における障害発生の被疑部位の推定・絞り込みの処理の流れの例について概要を示した図である。ネットワーク障害解析部２０では、予め、通信要件テーブル２２および経路情報テーブル２３に、各業務アプリケーションの通信要件、および各通信要件に対応する通信経路上のノードの情報がそれぞれ登録されているものとする。 <Process flow>
FIG. 8 is a diagram showing an outline of an example of a flow of processing for estimating and narrowing down a suspected site where a failure has occurred in the network failure analysis unit 20 of the present embodiment. In the network failure analysis unit 20, it is assumed that the communication requirements of each business application and the node information on the communication path corresponding to each communication requirement are registered in advance in the communication requirement table 22 and the route information table 23. .

処理を開始すると、障害部位推定部２１は、まず、業務アプリケーション監視部１０による各業務アプリケーションの監視結果において障害や異常を検知したか否かを判定する処理を、障害を検知するまで繰り返す（Ｓ０１）。障害を検知すると、障害通知メッセージ等の内容に基づいて、通信要件テーブル２２に登録された各通信要件の中で当該障害に対応する通信要件を特定し、当該通信要件における送信元と送信先のＩＰアドレスの情報を取得する（Ｓ０２）。さらに、当該通信要件に対応する通信経路の情報として、経路情報テーブル２３から通信経路上の各ノード（ネットワーク機器等）の情報を取得する（Ｓ０３）。 When the process is started, the failure part estimation unit 21 first repeats the process of determining whether a failure or abnormality is detected in the monitoring result of each business application by the business application monitoring unit 10 until a failure is detected (S01). ). When a failure is detected, the communication requirement corresponding to the failure is identified among the communication requirements registered in the communication requirement table 22 based on the content of the failure notification message and the like, and the source and destination of the communication requirement are identified. IP address information is acquired (S02). Furthermore, information on each node (network device, etc.) on the communication path is acquired from the path information table 23 as information on the communication path corresponding to the communication requirement (S03).

さらに、障害被疑経路情報２４を参照し、ここに保持されている既発生の障害に係る通信経路の情報を取得する（Ｓ０４）。上述したように、障害被疑経路情報２４に登録されている情報は、障害を検知してから所定の期間のみ保持されており、当該期間を経過すると削除もしくは無効とされる。すなわち、近接する時間帯に検知した障害に係るもののみが登録されていることになり、これらは新たに検知した障害との関連性があるものと推定される。 Furthermore, the failure suspected route information 24 is referred to, and information on the communication route related to the existing failure held here is acquired (S04). As described above, the information registered in the failure suspected route information 24 is retained only for a predetermined period after the failure is detected, and is deleted or invalidated after the period. That is, only those related to the failure detected in the adjacent time zone are registered, and it is estimated that these are related to the newly detected failure.

次に、ステップＳ０４の処理結果として、既発生の障害が登録されているか否かを判定し（Ｓ０５）、登録されている場合は、ステップＳ０３で取得した、新たに障害を検知した通信経路上の全ノードについて処理を繰り返すループ処理を開始する（Ｓ０６）。ループ処理では、まず、対象のノードが、ステップＳ０４で取得した既発生の障害に係る通信経路上のいずれかに含まれるか、すなわち、新たに障害を検知した通信経路と、既発生の障害に係る通信経路が、対象のノードにおいて重複しているか否かを確認する（Ｓ０７、Ｓ０８）。 Next, as a processing result of step S04, it is determined whether or not an existing failure has been registered (S05), and if registered, on the communication path on which a new failure has been detected, acquired in step S03. A loop process for repeating the process is started for all the nodes (S06). In the loop processing, first, whether the target node is included in any of the communication paths related to the existing failure acquired in step S04, that is, the communication path that newly detected the failure and the existing failure It is confirmed whether or not such communication paths overlap in the target node (S07, S08).

ステップＳ０８において、重複していない場合は、対象のノードについての処理を終了し、次のノードの処理に移る（Ｓ１０、Ｓ０６）。一方、重複している場合は、対象のノードを被疑部位として抽出する（Ｓ０９）。複数の障害が検知されている場合に、これらの障害に係る業務アプリケーションがそれぞれ使用する通信経路で重複するノードは、障害発生の原因となっている蓋然性が高いと考えられるからである。なお、このとき、例えば、既発生の障害に係る通信経路との間で重複した数をカウントして、推定の強度を表すパラメータの１つとしてもよい。被疑部位として抽出したノードの情報は、障害被疑ノード情報２５に登録する。その後、対象のノードについての処理を終了し、次のノードの処理に移る（Ｓ１０、Ｓ０６）。 If there is no overlap in step S08, the process for the target node is terminated and the process proceeds to the next node (S10, S06). On the other hand, if they overlap, the target node is extracted as a suspected part (S09). This is because, when a plurality of failures are detected, nodes that overlap in the communication paths used by the business applications related to these failures are considered to have a high probability of causing the failure. At this time, for example, the number of overlaps with the communication path related to the existing failure may be counted and used as one of the parameters representing the estimated strength. The node information extracted as the suspected part is registered in the failure suspected node information 25. Thereafter, the process for the target node is terminated, and the process proceeds to the next node (S10, S06).

新たに障害を検知した通信経路上の全ノードに対するループ処理が終了すると、障害被疑ノード情報２５を参照して、被疑部位として抽出されたノードがあるか否かを判定する（Ｓ１１）。被疑部位として抽出されたノードがある場合は、これを出力して（Ｓ１７）、処理を終了する。 When the loop processing for all the nodes on the communication path in which a failure is newly detected is completed, it is determined whether there is a node extracted as a suspected part with reference to the suspected node information 25 (S11). If there is a node extracted as the suspected part, this is output (S17), and the process is terminated.

一方、被疑部位として抽出されたノードが存在しなかった場合、および、ステップＳ０５で、既発生の障害が登録されていない場合（最初に検知した障害である場合）は、ステップＳ０３で取得した、新たに障害を検知した通信経路上の全ノードについて処理を繰り返す別のループ処理を開始する（Ｓ１２）。ループ処理では、まず、対象のノードが、通信要件テーブル２２および経路情報テーブル２３に登録されている他の全ての通信経路（処理の効率上、障害が検知されていないものに限定してもよい）上のいずれかに含まれるか、すなわち、新たに障害を検知した通信経路と、他の通信経路が、対象のノードにおいて重複しているか否かを確認する（Ｓ１３、Ｓ１４）。 On the other hand, if the node extracted as the suspected part does not exist, and if the existing failure is not registered in step S05 (if it is the first detected failure), it is acquired in step S03. Another loop process for repeating the process is started for all nodes on the communication path in which a failure is newly detected (S12). In the loop process, first, the target node may be limited to all other communication paths registered in the communication requirement table 22 and the path information table 23 (for which the failure is not detected in terms of processing efficiency). ) That is included in any of the above, that is, whether or not the communication path in which the failure is newly detected overlaps with another communication path in the target node (S13, S14).

ステップＳ１４において、重複している場合は、対象のノードについての処理を終了し、次のノードの処理に移る（Ｓ１６、Ｓ１２）。一方、重複していない場合は、対象のノードを被疑部位として抽出する（Ｓ１５）。障害が検知された業務アプリケーションが１つしかない場合、もしくは、既発生の障害に係る通信経路との間で重複がない場合は、正常な状態のものも含む他の通信経路との間で重複していないノードが、単独での障害発生の原因となっている蓋然性が高いと考えられるからである。 If there is an overlap in step S14, the process for the target node is terminated and the process proceeds to the next node (S16, S12). On the other hand, if there is no overlap, the target node is extracted as a suspected part (S15). If there is only one business application in which a failure has been detected, or if there is no overlap with the communication path related to the existing failure, it will overlap with other communication paths including those in the normal state. This is because a node that has not been considered is considered to have a high probability of causing a failure alone.

重複がない場合だけでなく、重複が少ない場合も、単独での障害発生の原因となる可能性があるため、ステップＳ１４において、重複している場合であっても、重複した数が所定の数より少ないノード、もしくは重複した数の少ないものから順に上位の所定の数のノードを被疑部位として抽出するようにしてもよい。このとき、例えば、重複した数の少なさを推定の強度を表すパラメータの１つとしてもよい。この場合、重複した数が多いほど被疑部位としての推定の強度は弱くなることになる。 Not only when there is no duplication, but also when there are few duplications, there is a possibility that it may cause a failure alone. Therefore, even if there are duplications in step S14, the number of duplications is a predetermined number. A predetermined upper number of nodes may be extracted as suspected sites in order from a smaller number of nodes or a smaller number of duplicated nodes. At this time, for example, the small number of duplicates may be used as one of the parameters representing the estimated strength. In this case, the greater the number of duplicates, the weaker the estimation strength as the suspected part.

被疑部位として抽出したノードの情報は、障害被疑ノード情報２５に登録する。その後、対象のノードについての処理を終了し、次のノードの処理に移る（Ｓ１６、Ｓ１２）。新たに障害を検知した通信経路上の全ノードに対するループ処理が終了すると、障害被疑ノード情報２５に登録されているノードを、推定された被疑部位として出力して（Ｓ１７）、処理を終了する。なお、処理を終了する前に、ステップＳ０２、Ｓ０３で取得した、新たに障害を検知した通信経路に係る情報を、既発生の障害に係る通信経路の情報として障害被疑経路情報２４に登録しておく。 The node information extracted as the suspected part is registered in the failure suspected node information 25. Thereafter, the process for the target node is terminated, and the process proceeds to the next node (S16, S12). When the loop processing for all the nodes on the communication path in which a failure is newly detected is completed, the node registered in the failure suspected node information 25 is output as the estimated suspected portion (S17), and the processing is terminated. Before the process is completed, the information related to the communication path that has newly detected the failure acquired in steps S02 and S03 is registered in the failure suspected path information 24 as information on the communication path related to the existing failure. deep.

以上に説明したように、本発明の一実施の形態であるネットワーク障害解析システムによれば、情報処理システムにおいて各業務サーバ２で稼働する業務アプリケーションの動作を監視し、障害を検知した場合に、障害を検知した業務アプリケーションが使用する通信経路と、既発生の障害に係る通信経路もしくは他の正常な通信経路との重複の有無を解析することで、ネットワークレベルでの障害の検知の有無に関わらず、ネットワーク構成上において障害発生の被疑部位となるノードを推定もしくは絞り込むことが可能となる。 As described above, according to the network failure analysis system according to an embodiment of the present invention, when an operation of a business application running on each business server 2 is monitored in the information processing system and a failure is detected, By analyzing the presence or absence of duplication between the communication path used by the business application that detected the fault and the communication path related to the existing fault or other normal communication path, whether or not the fault has been detected at the network level. First, it becomes possible to estimate or narrow down the nodes that are suspected of occurrence of failure in the network configuration.

これにより、例えば、ネットワークの監視装置等において、ネットワーク機器や回線、ケーブル等の物理的な故障の情報や、通信の遅延、パケットロス等の障害の兆候となるような情報を検知できていない場合であっても、業務アプリケーションの利用状況に障害もしくは悪化等の事象が生じた場合に、ネットワーク構成上における障害の被疑部位を推定し、速やかに復旧もしくは障害を回避させるための適切な措置を講じることが可能となる。 As a result, for example, when a network monitoring device or the like cannot detect information on a physical failure such as a network device, a line, or a cable, or information that may indicate a failure such as a communication delay or a packet loss. However, if an event such as a failure or deterioration occurs in the usage status of a business application, the suspected location of the failure in the network configuration is estimated and appropriate measures are taken to quickly recover or avoid the failure. It becomes possible.

以上、本発明者によってなされた発明を実施の形態に基づき具体的に説明したが、本発明は上記の実施の形態に限定されるものではなく、その要旨を逸脱しない範囲で種々変更可能であることはいうまでもない。例えば、上記の実施の形態は本発明を分かりやすく説明するために詳細に説明したものであり、必ずしも説明した全ての構成を備えるものに限定されるものではない。また、上記の実施の形態の構成の一部について、他の構成の追加・削除・置換をすることが可能である。 As mentioned above, the invention made by the present inventor has been specifically described based on the embodiments. However, the present invention is not limited to the above-described embodiments, and various modifications can be made without departing from the scope of the invention. Needless to say. For example, the above-described embodiment has been described in detail for easy understanding of the present invention, and is not necessarily limited to the one having all the configurations described. In addition, it is possible to add, delete, and replace other configurations for a part of the configuration of the above-described embodiment.

また、上記の各図において、制御線や情報線は説明上必要と考えられるものを示しており、必ずしも実装上の全ての制御線や情報線を示しているとは限らない。実際にはほとんど全ての構成が相互に接続されていると考えてもよい。 Moreover, in each said figure, the control line and the information line have shown what is considered necessary for description, and do not necessarily show all the control lines and information lines on mounting. Actually, it may be considered that almost all the components are connected to each other.

本発明は、システムにおける障害検知状況に基づいてネットワーク上の障害部位を推定するネットワーク障害解析システムおよび障害解析プログラムに利用可能である。 The present invention is applicable to a network failure analysis system and a failure analysis program for estimating a failure site on a network based on a failure detection situation in the system.

１…アプリケーション監視サーバ、２…業務サーバ、２ａ〜２ｄ…業務サーバＡ〜業務サーバＤ、３ａ〜３ｅ…ノードＡ〜ノードＥ、
１０…業務アプリケーション監視部、
２０…ネットワーク障害解析部、２１…障害部位推定部、２２…通信要件テーブル、２２’…ジョブテーブル、２３…経路情報テーブル、２３’…アプリケーション経路情報テーブル、２４…障害被疑経路情報、２５…障害被疑ノード情報。 DESCRIPTION OF SYMBOLS 1 ... Application monitoring server, 2 ... Business server, 2a-2d ... Business server A-Business server D, 3a-3e ... Node A-Node E,
10: Business application monitoring unit,
DESCRIPTION OF SYMBOLS 20 ... Network fault analysis part, 21 ... Fault site | part estimation part, 22 ... Communication requirement table, 22 '... Job table, 23 ... Path information table, 23' ... Application path information table, 24 ... Fault suspected path information, 25 ... Fault Suspect node information.

Claims

A network failure analysis system for analyzing a failure in a network system having an information processing apparatus on which a business application operates,
Holds communication requirement information for identifying the communication used by each business application, and communication path information that is node information consisting of network devices on the communication path related to each communication requirement,
When notified that a failure has been detected during execution of a business application, the communication requirement information used by the business application and the first communication path information related to the communication requirement are specified based on the content of the notification. When there is an overlapping node between the first communication path and the second communication path related to one or more faults that have already occurred, the node is estimated as a suspected part of the fault occurrence. , Network failure analysis system.

The network failure analysis system according to claim 1,
A network failure analysis system that estimates a suspected portion having a higher suspected degree as a node having a larger number of overlapping between the first communication path and one or more second communication paths.

In the network failure analysis system according to claim 1 or 2,
When there is no overlapping node between the first communication path and one or more second communication paths, or when the second communication path does not exist, each of the first communication paths Network failure in which, when there is a node that does not overlap with one or more third communication paths that are used by all other business applications, the node is estimated as a suspected site of failure Analysis system.

The network failure analysis system according to claim 3,
A network failure analysis system that estimates a suspected portion having a higher degree of suspicion as a node having a smaller number of overlaps with one or more third communication routes among the nodes in the first communication route.

A network failure analysis program for operating a computer as a network failure analysis system for analyzing a failure in a network system having an information processing apparatus on which a business application operates,
When notified that a failure has been detected in the execution of a business application, based on the content of the notification, information on communication requirements for identifying the content of communication used by the business application and the communication requirements Information on the first communication path, which is information of nodes composed of network devices on the communication path, is specified, and the first communication path and the second communication path related to one or more faults that have already occurred A network failure analysis program that executes a process of estimating a node that is suspected of causing a failure when there is a node that overlaps with the node.