JP6438875B2

JP6438875B2 - Network monitoring apparatus and network monitoring method

Info

Publication number: JP6438875B2
Application number: JP2015208571A
Authority: JP
Inventors: 高田　篤; 篤高田; 裕司副島
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2015-10-23
Filing date: 2015-10-23
Publication date: 2018-12-19
Anticipated expiration: 2035-10-23
Also published as: JP2017085220A

Description

本発明は、ネットワークを構成する装置に発生する障害を監視する、ネットワーク監視装置およびネットワーク監視方法に関する。 The present invention relates to a network monitoring apparatus and a network monitoring method for monitoring a failure occurring in an apparatus constituting a network.

通信事業者のネットワークには、転送装置や制御装置など様々な装置が用いられる。ネットワーク構成を表示する手法の一つに、装置の地理的な位置情報を用いて、地図上の対応箇所に装置を描画する手法がある（非特許文献１参照）。この手法は、装置の位置や、装置がカバーする地理的な範囲を認識しやすくなる利点がある。この表示手法においては、ネットワーク異常の発生時にその原因となった箇所や影響範囲を認識しやすくするため、装置やリンク等から何らかの警報が発出される場合や、装置の応答がない等の異常が疑われる場合には、構成図上の該当する装置やリンクに対して付加記号を重畳したり、装置やリンクの色を変えたりする等をしたりして、該当する箇所をネットワークの管理者等に明示する。図７に示す例では、装置やリンクを地図上に配置し、異常箇所を示すマークを重畳させて表示している。 Various devices such as a transfer device and a control device are used for a network of a communication carrier. One of the methods for displaying the network configuration is a method of drawing a device at a corresponding location on a map using the geographical position information of the device (see Non-Patent Document 1). This method has an advantage that it is easy to recognize the position of the device and the geographical range covered by the device. In this display method, when a network error occurs, it is easy to recognize the location and range of influence, and when an alarm is issued from a device or link, or there is an error such as no response from the device. If in doubt, add an additional symbol to the corresponding device or link on the configuration diagram, change the color of the device or link, etc. Explicitly. In the example shown in FIG. 7, devices and links are arranged on a map, and marks indicating abnormal places are superimposed and displayed.

ネットワークの監視業務ではこのような従来技術を利用し、ある程度集約された故障監視センタで全国のネットワーク装置の故障を一元的に監視し、警報が通知された装置に対し、警報の内容に応じた適切な故障修理業務に迅速に繋げていく。例えば、監視業務の対象となる装置から電源故障の警報が通知されたときには、オンサイト（例えば、各データセンタ内）の担当者に、当該装置を新品の電源パッケージと交換するように指示し、故障復旧業務を実行するなどが行われる。なお、遠隔で対応可能な故障については、故障監視センタで対応し、オンサイトの担当者を派遣しない場合もある。このような、故障監視、修理業務を継続することにより、通信キャリアはネットワークの品質を保っている。 The network monitoring business uses such conventional technology to centrally monitor network device failures nationwide at a failure monitoring center that is centralized to some extent, and responds to the alarms for devices that have been notified of alarms. Promptly connect to appropriate repair work. For example, when a power failure alarm is notified from a device to be monitored, an on-site person (for example, in each data center) is instructed to replace the device with a new power supply package. Failure recovery work is executed. Note that failures that can be handled remotely are handled by the failure monitoring center, and there is a case where an onsite person in charge is not dispatched. By continuing such failure monitoring and repair operations, communication carriers maintain the quality of the network.

立石直規、他３名、「大規模ネットワークの情報可視化方式に関する検討」、社団法人電子情報通信学会、2013年3月、信学技報、Vol.112、No.492、ＩＣＭ2012-74、pp.89-94Naoki Tateishi and three others, “Examination of information visualization method for large-scale network”, The Institute of Electronics, Information and Communication Engineers, March 2013, IEICE Technical Report, Vol.112, No.492, ICM2012-74, pp. 89-94

しかしながら、前記した非特許文献１に記載の技術は、警報を通知した装置自体に故障が発生した場合には有効であるが、警報を通知できない装置や、警報を通知した装置からの警報情報の内容からでは真の故障原因を解析できない場合には、誤った故障修理業務に繋げてしまうケースが存在する。以下、具体的な事例を説明する。 However, the technique described in Non-Patent Document 1 described above is effective when a failure occurs in the device itself that has notified the alarm, but the alarm information from the device that cannot notify the alarm or the device that has notified the alarm If the true cause of failure cannot be analyzed from the contents, there is a case that leads to an erroneous repair work. Specific examples will be described below.

（ケースＡ）
ある装置からメモリエラーの警報通知があったため、当該装置に対し、メモリ交換をオンサイトの担当者が実行した。しかしながら、数日後、当該装置からメモリエラーの警報が通知される事態が数回続いた。当該障害の原因を解析したところ、当該装置の同一ラックの隣接装置が起こしている熱暴走が要因でメモリエラーが発生していることが判明した。この時、隣接装置の熱暴走の警報は通知されていたが、継続的に発生するものではなかったため、対応を後回しにしていた。 (Case A)
Since an alarm notification of a memory error was received from a certain device, an onsite person in charge performed memory replacement for the device. However, a few days later, a situation in which a warning of a memory error was notified from the device continued several times. When the cause of the failure was analyzed, it was found that a memory error occurred due to a thermal runaway caused by an adjacent device in the same rack of the device. At this time, the warning of the thermal runaway of the adjacent device was notified, but since it did not occur continuously, the response was postponed.

（ケースＢ）
ある装置から電源エラーの警報通知があったため、当該装置に対し、電源パッケージの交換をオンサイトの担当者が実行した。しかしながら、数日後、当該装置から電源エラーの警報が通知される事態が数回続いた。本障害の原因を解析したところ、同一フロアの装置から過去何度か同一の電源故障が発生していることが分かった。これら同一フロアの複数の装置から電源故障の警報が通知されていることから、原因は本フロアの電源供給装置（分電盤）が故障していることであると分かった。この時、当該電源供給装置（分電盤）は警報が通知できない装置であった。 (Case B)
Since there was an alarm notification of a power supply error from a certain device, an on-site person performed replacement of the power supply package for that device. However, a few days later, the device was informed of a power supply error alarm several times. When the cause of this failure was analyzed, it was found that the same power supply failure occurred several times in the past from the equipment on the same floor. Since a power failure alarm was notified from a plurality of devices on the same floor, it was found that the cause was a failure of the power supply device (distribution panel) on this floor. At this time, the power supply device (distribution panel) is a device that cannot notify an alarm.

ケースＡやケースＢのような事象では、警報が発生した装置のみを主眼におき故障対応していたために、真の原因に応じた故障修理業務に繋げられなかったものである。
つまり、ケースＡの場合には、図８（ｂ）に示すように、装置２「Ｄ」の熱暴走の影響によって、同一ラック５の装置２「Ａ」にメモリエラーが発生している。このとき、図８（ａ）に示すように、装置２「Ｄ」からは熱暴走の警報がＯｐＳ（Operation System）等のネットワーク監視装置１ａに通知され、装置２「Ａ」からはメモリエラーの警報がネットワーク監視装置１ａに通知される。しかしながら、ネットワーク監視装置１ａでは、装置２「Ｄ」の熱暴走の警報と、装置２「Ａ」のメモリエラーの警報とを、そのままオペレータ端末３に警報情報として送信するだけである。オペレータ端末３では、単に、装置２「Ｄ」と装置２「Ａ」とから、別々の警報情報が送信されてきたと認識するため、装置２「Ａ」のメモリエラーを同一ラック５の装置２「Ｄ」の熱暴走によるものであると認識できない、つまり、真の故障原因をネットワーク管理者が直接認識することはできない。よって、故障原因の特定に遅延が生じたり、真の故障原因の対応とは異なる誤った故障対応を行うおそれがあったりした。 In an event such as Case A or Case B, the failure was dealt with focusing only on the device that generated the alarm, so that the failure repair work corresponding to the true cause was not possible.
That is, in the case A, as shown in FIG. 8B, a memory error has occurred in the device 2 “A” of the same rack 5 due to the thermal runaway of the device 2 “D”. At this time, as shown in FIG. 8A, a thermal runaway alarm is notified from the device 2 “D” to the network monitoring device 1 a such as an OpS (Operation System), and a memory error is reported from the device 2 “A”. An alarm is notified to the network monitoring device 1a. However, in the network monitoring device 1a, the thermal runaway alarm of the device 2 “D” and the memory error alarm of the device 2 “A” are simply transmitted to the operator terminal 3 as alarm information. Since the operator terminal 3 simply recognizes that different alarm information has been transmitted from the device 2 “D” and the device 2 “A”, the memory error of the device 2 “A” is identified as the device 2 “of the same rack 5. D ”cannot be recognized as a result of thermal runaway, that is, the true cause of the failure cannot be directly recognized by the network administrator. As a result, there is a risk that the cause of the failure may be delayed, or an erroneous failure response different from the response to the true failure cause may be performed.

ケースＢの場合には、図９（ｂ）に示すように、電源供給装置（分電盤）４が不安定な状態となり、その結果として、同一フロアの装置２「Ｂ」および装置２「Ｃ」から電源エラーの警報がネットワーク監視装置１ａに通知される。しかしながら、図９（ａ）に示すように、ネットワーク監視装置１ａでは、装置２「Ｂ」の電源エラーと、装置２「Ｃ」の電源エラーの通知を、そのままオペレータ端末３に警報情報として送信するだけである。ここでは、電源供給装置（分電盤）４は警報通知できないため、真の故障原因をネットワーク管理者が直接認識することはできず、故障原因の特定に遅延が生じたり、真の故障原因の対応とは異なる、誤った故障対応を行うおそれがあったりした。 In the case B, as shown in FIG. 9B, the power supply device (distribution panel) 4 becomes unstable, and as a result, the devices 2 “B” and 2 “C” on the same floor. ”Is notified to the network monitoring device 1a. However, as shown in FIG. 9A, in the network monitoring device 1a, the power error of the device 2 “B” and the notification of the power error of the device 2 “C” are directly transmitted to the operator terminal 3 as alarm information. Only. Here, since the power supply device (distribution panel) 4 cannot give an alarm notification, the network administrator cannot directly recognize the cause of the true failure, causing a delay in identifying the cause of the failure, There is a risk of wrong failure handling that is different from the handling.

このような問題を鑑みて本発明がなされたのであり、本発明は、複数の警報の関係性を、装置の設置場所等の物理的な位置情報から紐解き、障害発生の原因を推定することにより、故障原因を迅速に特定することができる、ネットワーク監視装置およびネットワーク監視方法を提供することを課題とする。 The present invention has been made in view of such a problem, and the present invention relates to the relationship between a plurality of alarms from physical position information such as the installation location of the device, and estimates the cause of the failure. It is an object of the present invention to provide a network monitoring apparatus and a network monitoring method that can quickly identify the cause of failure.

前記した課題を解決するため、請求項１に記載の発明は、ネットワークを構成する各装置から警報情報を受信し、障害の監視を行うネットワーク監視装置であって、前記装置それぞれの識別情報に対応付けて、当該装置が配置される物理的な位置情報が格納される設備情報、および、前記警報情報を発信した装置の前記物理的な位置情報を加味した障害の推定原因を、当該装置以外の他の装置の物理的な位置情報と、前記他の装置からの警報情報の警報内容とに基づき特定する条件を示す原因シナリオ、が記憶される記憶部と、前記警報情報を発信した装置から当該装置の識別情報を含む前記警報情報を受信する度に、当該警報情報を前記記憶部に記憶する警報受信部と、受信した前記警報情報に含まれる前記装置の識別情報を用いて、前記設備情報を参照し、前記警報情報を発信した装置の物理的な位置情報を検出し、前記検出した物理的な位置情報および受信した前記警報情報に含まれる警報内容を用いて、前記原因シナリオにおける前記推定原因を特定する前記条件に合致する前記他の装置の警報情報が、前記記憶部に記憶されているか否かを判定し、記憶されている場合に、受信した前記警報情報に含まれる警報内容の原因が前記原因シナリオで示される推定原因であると特定する原因分析部と、を備えることを特徴とするネットワーク監視装置とした。 In order to solve the above-described problem, the invention according to claim 1 is a network monitoring device that receives alarm information from each device configuring a network and monitors a failure, and corresponds to identification information of each of the devices. In addition, the facility information in which the physical location information where the device is arranged is stored, and the cause of the failure that takes into account the physical location information of the device that has transmitted the alarm information, other than the device A storage unit for storing a cause scenario indicating a condition to be identified based on physical location information of another device and an alarm content of the alarm information from the other device, and the device from which the alarm information is transmitted Each time the alarm information including device identification information is received, an alarm receiver that stores the alarm information in the storage unit, and the device identification information included in the received alarm information, the The physical location information of the device that transmitted the alarm information is detected with reference to the device information, and using the detected physical location information and the alarm content included in the received alarm information, the cause scenario It is determined whether or not the alarm information of the other device that matches the condition for specifying the presumed cause is stored in the storage unit, and if it is stored, the alarm included in the received alarm information A network monitoring apparatus comprising: a cause analysis unit that identifies a cause of content as an estimated cause indicated by the cause scenario.

また、請求項３に記載の発明は、ネットワークを構成する各装置から警報情報を受信し、障害の監視を行うネットワーク監視装置のネットワーク監視方法であって、前記ネットワーク監視装置が、前記装置それぞれの識別情報に対応付けて、当該装置が配置される物理的な位置情報が格納される設備情報、および、前記警報情報を発信した装置の前記物理的な位置情報を加味した障害の推定原因を、当該装置以外の他の装置の物理的な位置情報と、前記他の装置からの警報情報の警報内容とに基づき特定する条件を示す原因シナリオ、が記憶される記憶部を備えており、前記警報情報を発信した装置から当該装置の識別情報を含む前記警報情報を受信する度に、当該警報情報を前記記憶部に記憶するステップと、受信した前記警報情報に含まれる前記装置の識別情報を用いて、前記設備情報を参照し、前記警報情報を発信した装置の物理的な位置情報を検出し、前記検出した物理的な位置情報および受信した前記警報情報に含まれる警報内容を用いて、前記原因シナリオにおける前記推定原因を特定する前記条件に合致する前記他の装置の警報情報が、前記記憶部に記憶されているか否かを判定し、記憶されている場合に、受信した前記警報情報に含まれる警報内容の原因が前記原因シナリオで示される推定原因であると特定するステップと、を実行することを特徴とするネットワーク監視方法とした。 The invention according to claim 3 is a network monitoring method of a network monitoring device that receives alarm information from each device constituting a network and monitors a failure, wherein the network monitoring device is a device for each of the devices. In association with the identification information, facility information in which physical location information where the device is arranged is stored, and an estimated cause of the failure in consideration of the physical location information of the device that has transmitted the alarm information, A storage unit for storing a cause scenario indicating a condition to be identified based on physical position information of a device other than the device and alarm content of alarm information from the other device; Each time the warning information including the identification information of the device is received from the device that has transmitted the information, the step of storing the warning information in the storage unit and the received warning information The device identification information is used to refer to the facility information, detect the physical location information of the device that has transmitted the alarm information, and include the detected physical location information and the received alarm information. If the alarm information of the other device that matches the condition for specifying the estimated cause in the cause scenario is stored in the storage unit, and stored And a step of identifying that the cause of the alarm content included in the received alarm information is an estimated cause indicated by the cause scenario.

このように、ネットワーク監視装置は、ネットワークを構成する装置から受信した警報情報を記憶部に記憶し、その警報情報に含まれるその装置の識別情報を用いて記憶部に記憶された設備情報を参照し、警報情報を発信した装置の物理的な位置情報を検出する。そして、ネットワーク監視装置は、検出した物理的な位置情報および受信した警報情報に含まれる警報内容を用いて、推定原因を特定する条件に合致する他の装置の警報情報が、記憶部に記憶されている場合に、受信した警報情報に含まれる警報内容の原因が原因シナリオで示される推定原因であると特定する。
これにより、ネットワーク監視装置は、装置の物理的な位置情報を加味して、障害発生の原因を推定することできる。よって、装置から発信された警報情報の警報内容だけからでは、真の故障原因を特定できない場合であっても、迅速に真の故障原因を特定し、誤った故障対応を行うおそれをなくすことができる。 As described above, the network monitoring device stores the alarm information received from the devices constituting the network in the storage unit, and refers to the facility information stored in the storage unit using the identification information of the device included in the alarm information. Then, the physical position information of the device that sent the alarm information is detected. Then, the network monitoring device stores the alarm information of the other device that matches the condition for specifying the presumed cause in the storage unit using the detected physical position information and the alarm content included in the received alarm information. The cause of the alarm content included in the received alarm information is identified as the presumed cause indicated in the cause scenario.
As a result, the network monitoring device can estimate the cause of the failure by taking into account the physical location information of the device. Therefore, even if it is not possible to identify the true cause of failure only from the alarm content of the alarm information sent from the device, it is possible to quickly identify the true cause of the failure and eliminate the risk of performing an incorrect failure response. it can.

請求項２に記載の発明は、前記装置が配置される物理的な位置情報は、データセンタおよびフロアの情報を少なくとも含み、前記警報情報を受信したときに、前記原因分析部は、同一データセンタおよび同一フロアに位置する装置から発信された警報情報を前記記憶部から抽出し、抽出した前記警報情報の中で、前記条件に合致する警報情報があるか否かを判定し、前記条件に合致する警報情報がある場合に、受信した前記警報情報に含まれる警報内容の原因が前記原因シナリオで示される推定原因であると特定することを特徴とする請求項１に記載のネットワーク監視装置とした。 According to a second aspect of the present invention, the physical location information at which the device is arranged includes at least data center and floor information, and when the alarm information is received, the cause analysis unit The alarm information transmitted from the device located on the same floor is extracted from the storage unit, and it is determined whether there is alarm information that matches the condition in the extracted alarm information, and the condition is met. 2. The network monitoring device according to claim 1, wherein when there is alarm information to be performed, the cause of the alarm content included in the received alarm information is identified as an estimated cause indicated in the cause scenario. .

このようにすることにより、ネットワーク監視装置は、記憶部に記憶された全ての警報情報の中から、受信した警報情報を発信した装置と同一のデータセンタおよび同一のフロアに位置する装置から発信された警報情報のみを抽出し、原因シナリオの条件に合致するか否かを判定することができる。よって、ネットワーク監視装置の処理負荷を軽減し、より迅速に推定原因を特定することが可能となる。 By doing so, the network monitoring device is transmitted from the devices located on the same data center and the same floor as the device that transmitted the received alarm information, out of all the alarm information stored in the storage unit. It is possible to extract only the alarm information and determine whether or not the cause scenario conditions are met. Therefore, it is possible to reduce the processing load on the network monitoring device and identify the estimated cause more quickly.

本発明によれば、複数の警報の関係性を、装置の設置場所等の物理的な位置情報から紐解き、障害発生の原因を推定することにより、故障原因を迅速に特定する、ネットワーク監視装置およびネットワーク監視方法を提供することができる。 According to the present invention, a network monitoring device that quickly identifies the cause of a failure by unraveling the relationship between a plurality of alarms from physical location information such as the installation location of the device and estimating the cause of the failure, and A network monitoring method can be provided.

本実施形態に係るネットワーク監視装置が、ケースＡ（熱暴走が原因でメモリエラーが発生）において実行する原因分析処理の概要を説明するための図である。It is a figure for demonstrating the outline | summary of the cause analysis process which the network monitoring apparatus which concerns on this embodiment performs in case A (memory error generate | occur | produces due to thermal runaway). 本実施形態に係るネットワーク監視装置が、ケースＢ（電源供給装置の故障で電源エラーが発生）において実行する原因分析処理の概要を説明するための図である。It is a figure for demonstrating the outline | summary of the cause analysis process which the network monitoring apparatus which concerns on this embodiment performs in case B (power supply error generate | occur | produces by failure of a power supply device). 本実施形態に係るネットワーク監視装置の構成を示す機能ブロック図である。It is a functional block diagram which shows the structure of the network monitoring apparatus which concerns on this embodiment. 本実施形態に係る設備情報ＤＢに格納される設備情報のデータ構成例を示す図である。It is a figure which shows the data structural example of the equipment information stored in equipment information DB which concerns on this embodiment. 本実施形態に係る原因シナリオＤＢに格納される原因シナリオ情報のデータ構成例を示す図である。It is a figure which shows the data structural example of the cause scenario information stored in cause scenario DB which concerns on this embodiment. 本実施形態に係るネットワーク監視装置が実行する処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the process which the network monitoring apparatus which concerns on this embodiment performs. 装置やリンクを地図上に配置し、異常箇所を示すマークを重畳させて表示する従来例を示す図である。It is a figure which shows the prior art example which arrange | positions an apparatus and a link on a map, and displays the mark which shows an abnormal location superimposed. 警報情報の内容からでは真の故障原因を解析できないケース（ケースＡ：熱暴走が原因でメモリエラーが発生）を説明するための図である。It is a figure for demonstrating the case (Case A: A memory error generate | occur | produced due to thermal runaway) where a true failure cause cannot be analyzed from the content of alarm information. 警報情報の内容からでは真の故障原因を解析できないケース（ケースＢ：電源供給装置の故障で電源エラーが発生）を説明するための図である。It is a figure for demonstrating the case (Case B: A power supply error generate | occur | produces by the failure of a power supply device) where a true failure cause cannot be analyzed from the content of alarm information.

次に、本発明を実施するための形態（以下、「本実施形態」という。）における、ネットワーク監視装置１およびネットワーク監視方法について説明する。 Next, the network monitoring device 1 and the network monitoring method in a mode for carrying out the present invention (hereinafter referred to as “the present embodiment”) will be described.

＜発明の概要＞
まず、本発明の概要について説明する。
本実施形態に係るネットワーク監視装置１は、ネットワークを構成する各装置２から警報情報を受信し、各装置２の物理的な位置情報（データセンタ、フロア、ラック等）を分析した上で、装置２の障害の原因を特定し、オペレータ端末３に送信する。 <Outline of the invention>
First, an outline of the present invention will be described.
The network monitoring device 1 according to the present embodiment receives alarm information from each device 2 constituting the network, analyzes the physical position information (data center, floor, rack, etc.) of each device 2, and then The cause of the failure 2 is identified and transmitted to the operator terminal 3.

具体的には、ネットワーク監視装置１は、前記した（ケースＡ）の場合において、図１に示すように、例えば、装置２「Ａ」からメモリエラーの警報情報を受信した場合に、その装置２「Ａ」の物理的な位置情報を検出し、同一ラックに配置される他の装置２の警報情報（ここでは、装置「Ｄ」からの熱暴走の警報情報）を併せて原因分析を行い（原因分析処理）、この原因分析処理により得られた推定原因を示す原因警報メッセージをオペレータ端末３に通知する。（ケースＡ）においては、装置「Ａ」のメモリエラーの原因が、装置「Ｄ」の熱暴走が原因であると推定する原因警報メッセージを、オペレータ端末３に通知する。 Specifically, in the case of (Case A), the network monitoring apparatus 1 receives the memory error alarm information from the apparatus 2 “A”, for example, as shown in FIG. The physical location information of “A” is detected, and the cause information is analyzed together with the alarm information of other devices 2 arranged in the same rack (here, the alarm information of thermal runaway from the device “D”) ( Cause analysis process), a cause alarm message indicating the presumed cause obtained by the cause analysis process is notified to the operator terminal 3. In (Case A), the cause alarm message that estimates that the cause of the memory error of the device “A” is caused by the thermal runaway of the device “D” is notified to the operator terminal 3.

また、ネットワーク監視装置１は、前記した（ケースＢ）の場合において、図２に示すように、例えば、装置２「Ｂ」から電源エラーの警報情報を受信した場合に、その装置２「Ｂ」の物理的な位置情報を検出し、同一フロアに配置される他の装置２の警報情報（ここでは、装置「Ｃ」からの電源エラーの警報情報）を併せて原因分析処理を実行し、その結果得られた推定原因を示す原因警報メッセージをオペレータ端末３に通知する。（ケースＢ）においては、電源供給装置（分電盤）の故障が原因であると推定する原因警報メッセージを、オペレータ端末３に通知する。 In the case of (Case B) described above, the network monitoring device 1 receives the power supply error alarm information from the device 2 “B”, for example, as shown in FIG. The physical location information is detected, the alarm information of the other device 2 arranged on the same floor (here, the power error alarm information from the device “C”) is executed together, and the cause analysis process is executed. A cause warning message indicating the estimated cause obtained as a result is notified to the operator terminal 3. In (Case B), the operator terminal 3 is notified of a cause alarm message that is presumed to be caused by a failure of the power supply device (distribution panel).

このように、本実施形態に係るネットワーク監視装置１によれば、各装置２から警報情報を取得した場合に、その装置２等の物理的な位置情報を加味して故障原因を推定する。これにより、装置から発信された警報情報の警報内容だけからでは真の故障原因を特定できない場合においても、ネットワーク監視装置１は、迅速に真の故障原因を特定し、その原因警報をオペレータ端末３に通知することができる。 Thus, according to the network monitoring device 1 according to the present embodiment, when alarm information is acquired from each device 2, the cause of the failure is estimated by taking into account the physical position information of the device 2 or the like. As a result, even when the true cause of failure cannot be identified only from the alarm content of the alarm information transmitted from the device, the network monitoring device 1 quickly identifies the true cause of failure and sends the cause alarm to the operator terminal 3. Can be notified.

＜ネットワーク監視装置の構成＞
次に、本実施形態に係るネットワーク監視装置１を構成について説明する。
図３は、本実施形態に係るネットワーク監視装置１の構成を示す機能ブロック図である。
ネットワーク監視装置１は、ネットワークを構成する各装置２（装置「Ａ」，…，装置「Ｄ」，…）およびオペレータ端末３に接続され、各装置２から障害発生等の警報情報を受信し、原因分析処理を実行した上で、原因警報メッセージをオペレータ端末３に通知する。
なお、装置２は、ネットワーク監視装置１の監視対象となる一般的なネットワーク装置であり、例えば、処理サーバ、ルータ、スイッチ等である。また、オペレータ端末３は、ネットワーク管理者がネットワークを管理するために操作する端末装置であって、一般的なコンピュータにより構成される。 <Configuration of network monitoring device>
Next, the configuration of the network monitoring device 1 according to the present embodiment will be described.
FIG. 3 is a functional block diagram showing the configuration of the network monitoring device 1 according to the present embodiment.
The network monitoring device 1 is connected to each device 2 (device “A”,..., Device “D”,...) And the operator terminal 3 constituting the network, receives alarm information such as the occurrence of a failure from each device 2, After executing the cause analysis process, the cause alarm message is notified to the operator terminal 3.
The device 2 is a general network device to be monitored by the network monitoring device 1, and is, for example, a processing server, a router, a switch, or the like. The operator terminal 3 is a terminal device that is operated by a network administrator to manage the network, and is configured by a general computer.

本実施形態に係るネットワーク監視装置１は、図３に示すように、制御部１０と、入出力部２０と、記憶部３０とを含んで構成される。
入出力部２０は、通信接続される、各装置２およびオペレータ端末３との間の情報の入出力を行う。また、入出力部２０は、通信回線を介して情報の送受信を行う不図示の通信インタフェースと、キーボード等の入力手段やモニタ等の出力手段（いずれも不図示）との間で入出力を行う入出力インタフェースとから構成される。 As shown in FIG. 3, the network monitoring device 1 according to the present embodiment includes a control unit 10, an input / output unit 20, and a storage unit 30.
The input / output unit 20 inputs and outputs information between each device 2 and the operator terminal 3 that are connected for communication. The input / output unit 20 performs input / output between a communication interface (not shown) that transmits and receives information via a communication line, and an input means such as a keyboard and an output means (not shown) such as a monitor. It consists of an input / output interface.

記憶部３０は、ハードディスクやフラッシュメモリ、ＲＡＭ（Random Access Memory）等の記憶手段からなり、各装置２から受信した警報情報が蓄積される警報情報ＤＢ（DataBase）３１、設備情報ＤＢ３２（後記する、図４参照）、原因シナリオＤＢ３３（後記する、図５参照）等が記憶される。 The storage unit 30 includes storage means such as a hard disk, flash memory, and RAM (Random Access Memory), and includes an alarm information DB (DataBase) 31 in which alarm information received from each device 2 is accumulated, and an equipment information DB 32 (described later). 4), a scenario scenario DB 33 (see FIG. 5 described later) and the like are stored.

制御部１０は、ネットワーク監視装置１全体の制御を司り、図３に示すように、警報受信部１１と、原因分析部１２と、原因警報通知部１３とを含んで構成される。
また、制御部１０は、例えば、記憶部３０に格納されたプログラムを不図示のＣＰＵがＲＡＭに展開し実行することで実現される。 The control unit 10 controls the network monitoring device 1 as a whole, and includes an alarm reception unit 11, a cause analysis unit 12, and a cause alarm notification unit 13, as shown in FIG.
The control unit 10 is realized by, for example, a CPU (not shown) developing and executing a program stored in the storage unit 30 on a RAM.

警報受信部１１は、各装置２から警報情報を受信し、記憶部３０内の警報情報ＤＢ３１に記憶する。この警報情報は、その警報情報を送信した装置２の識別情報（例えば、装置ＩＤや、当該装置のアドレス（ＩＰアドレス）等）、警報内容を含む情報である。警報受信部１１は、装置２から警報情報を受信すると、その受信した日時を対応付けて、警報情報ＤＢ３１に記憶する。警報情報ＤＢ３１に記憶された警報情報は、所定の期間経過後に、警報受信部１１が削除するようにしてもよい。これにより、メモリの空き容量を適正に確保することができる。
また、警報受信部１１は、受信した警報情報を原因分析部１２に出力する。 The alarm receiver 11 receives alarm information from each device 2 and stores it in the alarm information DB 31 in the storage unit 30. This alarm information is information including the identification information (for example, the apparatus ID, the address (IP address) of the apparatus, etc.) of the apparatus 2 that transmitted the alarm information, and the alarm contents. When receiving the alarm information from the device 2, the alarm receiving unit 11 stores the received date and time in the alarm information DB 31 in association with each other. The alarm information stored in the alarm information DB 31 may be deleted by the alarm receiver 11 after a predetermined period has elapsed. As a result, it is possible to appropriately secure an available memory capacity.
In addition, the alarm receiver 11 outputs the received alarm information to the cause analyzer 12.

原因分析部１２は、取得した警報情報について、当該警報情報を発信した装置２の物理的な位置を検出し、当該物理的な位置を加味した障害等の原因分析処理を実行する。
具体的には、原因分析部１２は、警報受信部１１から警報情報を取得すると、その警報情報に含まれる、その警報情報を送信した装置２の識別情報（装置ＩＤや装置のアドレス等）を用いて、設備情報ＤＢ３２を参照し、当該装置２の物理的な位置を検出した上で、当該装置２と同一フロアの装置の識別情報を取得する。 The cause analysis unit 12 detects the physical position of the device 2 that has transmitted the alarm information from the acquired alarm information, and executes a cause analysis process such as a failure in consideration of the physical position.
Specifically, when the cause analysis unit 12 acquires the alarm information from the alarm reception unit 11, the identification information (device ID, device address, etc.) of the device 2 that transmitted the alarm information included in the alarm information is included. The facility information DB 32 is used to detect the physical position of the device 2 and then acquire the identification information of the device on the same floor as the device 2.

図４は、本実施形態に係る設備情報ＤＢ３２に格納される設備情報３２０のデータ構成例を示す図である。
設備情報３２０には、装置２の識別情報（装置ＩＤや装置のアドレス等）に対応付けて、その装置２が配置される物理的な位置情報が格納される。具体的には、図４に示すように、設備情報３２０には、装置ＩＤ３２１、アドレス３２２に対応付けて、物理的な位置情報として、データセンタ３２３、フロア３２４、ラック３２５、ユニット３２６の各項目が格納される。
装置ＩＤ３２１は、当該ネットワークシステムにおいてその装置固有の識別情報である。
アドレス３２２は、装置ＩＤ３２１に示される装置のアドレス（例えば、ＩＰアドレス）である。
データセンタ３２３は、装置２が設置される拠点（施設、ビル等）の識別情報である。
フロア３２４は、装置２が設置されている拠点のフロアを示す識別情報である。
ラック３２５は、装置２が設置されているフロアに配置されるラック５の識別情報である。
ユニット３２６は、装置２が設置されているラック５内のユニット番号（例えば、ラック５の上から段何目を示す情報）である。
なお、この設備情報３２０は、本実施形態のように、ネットワーク監視装置１の記憶部３０内に記憶されていてもよいし、ネットワーク監視装置１と通信接続される外部ＤＢ装置に格納されるようにしてもよい。 FIG. 4 is a diagram illustrating a data configuration example of the facility information 320 stored in the facility information DB 32 according to the present embodiment.
The facility information 320 stores physical position information in which the device 2 is arranged in association with identification information (device ID, device address, etc.) of the device 2. Specifically, as shown in FIG. 4, each item of the data center 323, the floor 324, the rack 325, and the unit 326 is included in the facility information 320 as physical location information in association with the device ID 321 and the address 322. Is stored.
The device ID 321 is identification information unique to the device in the network system.
The address 322 is an address (for example, an IP address) of the device indicated by the device ID 321.
The data center 323 is identification information of a base (facility, building, etc.) where the apparatus 2 is installed.
The floor 324 is identification information indicating the floor of the base where the apparatus 2 is installed.
The rack 325 is identification information of the rack 5 arranged on the floor where the apparatus 2 is installed.
The unit 326 is a unit number in the rack 5 in which the apparatus 2 is installed (for example, information indicating the step number from the top of the rack 5).
The facility information 320 may be stored in the storage unit 30 of the network monitoring device 1 as in this embodiment, or may be stored in an external DB device that is connected to the network monitoring device 1 for communication. It may be.

原因分析部１２は、装置２の識別情報（装置ＩＤや装置のアドレス）等を用いて、設備情報３２０を参照し、その装置２の物理的な位置（データセンタ３２３、フロア３２４、ラック３２５、ユニット３２６等）を検出する。 The cause analysis unit 12 refers to the facility information 320 using the identification information (device ID or device address) of the device 2, and the physical location (data center 323, floor 324, rack 325, Unit 326).

そして、原因分析部１２は、検出した装置２の物理的な位置を示す情報のうち、データセンタ３２３およびフロア３２４の情報を用いて、検出した装置２と同一のフロアに位置する装置２の識別情報（装置ＩＤ３２１やアドレス３２２）を取得する。
原因分析部１２は、取得した同一フロアに位置する装置２の識別情報を用いて、記憶部３０内の警報情報ＤＢ３１を参照し、同一フロアに位置する装置２から発信された警報情報を抽出する。 And the cause analysis part 12 identifies the apparatus 2 located on the same floor as the detected apparatus 2 using the information of the data center 323 and the floor 324 among the information which shows the physical position of the detected apparatus 2. Information (device ID 321 and address 322) is acquired.
The cause analysis unit 12 refers to the alarm information DB 31 in the storage unit 30 by using the acquired identification information of the device 2 located on the same floor, and extracts the alarm information transmitted from the device 2 located on the same floor. .

続いて、原因分析部１２は、受信した警報情報に含まれる警報内容を用いて、記憶部３０に記憶された原因シナリオＤＢ３３を参照し、その警報内容に示される障害の推定原因を特定する処理を行う。 Subsequently, the cause analysis unit 12 refers to the cause scenario DB 33 stored in the storage unit 30 using the alarm content included in the received alarm information, and specifies the estimated cause of the failure indicated in the alarm content. I do.

図５は、本実施形態に係る原因シナリオＤＢ３３に格納される原因シナリオ情報３３０のデータ構成例を示す図である。
原因シナリオ情報３３０には、警報情報を発信した装置２の物理的な位置情報を加味した障害の推定原因を特定する条件を示す原因シナリオが格納される。具体的には、図５に示すように、原因シナリオ情報３３０には、警報内容３３１、原因シナリオ３３２、推定原因３３３の各項目が格納される。
警報内容３３１には、ネットワーク監視装置１が各装置２から受信した警報情報に含まれる警報内容の種別が格納される。例えば、「メモリエラー」や「電源エラー」等の警報内容が格納される。 FIG. 5 is a diagram showing a data configuration example of the cause scenario information 330 stored in the cause scenario DB 33 according to the present embodiment.
The cause scenario information 330 stores a cause scenario indicating a condition for identifying an estimated cause of a failure in consideration of physical location information of the device 2 that has transmitted the alarm information. Specifically, as shown in FIG. 5, the cause scenario information 330 stores items of alarm contents 331, cause scenarios 332, and estimated causes 333.
The alarm content 331 stores the type of alarm content included in the alarm information received from each device 2 by the network monitoring device 1. For example, alarm contents such as “memory error” and “power supply error” are stored.

原因シナリオ３３２には、警報内容３３１に対応した障害等の原因を推定するためのシナリオ（障害の原因を推定するための条件）が規定される。この原因シナリオ３３２では、警報情報を発信した装置２以外の他の装置２の物理的な位置情報と、他の装置２が発信した警報情報の警報内容とに基づき、障害の原因を推定するための条件が規定される。
具体的には、警報内容３３１が「メモリエラー」である場合に対応付けて、「同一データセンタ、同一フロア、同一ラック、隣接ユニットの他の装置から、熱暴走の警報あり」という原因シナリオ３３２が規定され、この条件を満たす他の装置２からの警報情報が存在する場合に、推定原因３３３に示す内容（ここでは、「隣接装置の熱暴走が（メモリエラー）の原因」）であると特定される。
なお、ここで隣接装置とは、例えば、各ラック５にユニットが縦に設置されている場合には上下となる関係、つまり、物理的に距離が近く隣り合う装置を意味する。 In the cause scenario 332, a scenario (condition for estimating the cause of the failure) for estimating the cause of the failure corresponding to the alarm content 331 is defined. In the cause scenario 332, in order to estimate the cause of the failure based on the physical position information of the device 2 other than the device 2 that transmitted the alarm information and the alarm content of the alarm information transmitted by the other device 2. The conditions are defined.
Specifically, in association with the case where the alarm content 331 is “memory error”, a cause scenario 332 of “there is a thermal runaway alarm from another device in the same data center, the same floor, the same rack, and an adjacent unit”. When there is alarm information from another device 2 that satisfies this condition, the content shown in the presumed cause 333 (here, “the thermal runaway of the adjacent device (cause of memory error)”) Identified.
Here, the adjacent device means, for example, a vertical relationship when units are installed vertically in each rack 5, that is, a device that is physically close and adjacent to each other.

また、警報内容３３１が「電源エラー」である場合に対応付けて、「同一データセンタ、同一フロアの他の装置から、電源エラーの警報あり」という原因シナリオ３３２が規定され、この条件を満たす他の装置２からの警報情報が存在する場合に、推定原因３３３に示す内容（ここでは、「電源供給装置（分電盤）の故障が原因」）であると特定される。 In addition, a cause scenario 332 of “There is a power error alarm from another device on the same data center and the same floor” is defined in association with the case where the alarm content 331 is “power error”. When the alarm information from the device 2 is present, it is specified as the content indicated by the estimated cause 333 (here, “cause of failure of the power supply device (distribution panel)”).

推定原因３３３には、原因シナリオ３３２に規定された条件を満たす場合に推定される原因を示す情報が格納される。 The estimated cause 333 stores information indicating a cause that is estimated when the condition defined in the cause scenario 332 is satisfied.

原因分析部１２は、受信した警報情報の警報内容を用いて、原因シナリオ３３２の条件に合致する警報情報を受信しているか否かを判定することにより、障害等の推定原因を特定する。 The cause analysis unit 12 identifies an estimated cause of a failure or the like by determining whether or not alarm information matching the condition of the cause scenario 332 is received using the alarm content of the received alarm information.

原因警報通知部１３は、原因分析部１２が特定した推定原因の情報を含む原因警報メッセージを生成し、入出力部２０を介して、オペレータ端末３に通知する。
なお、原因警報通知部１３は、原因分析部１２の原因分析処理の結果、原因シナリオ３３２に該当する警報情報が警報情報ＤＢ３１に格納されていないことから、障害等の原因を特定できなかった場合には、ネットワーク監視装置１が各装置２から受信した警報情報をそのままオペレータ端末３に送信するようにしてもよい。 The cause alarm notification unit 13 generates a cause alarm message including information on the estimated cause identified by the cause analysis unit 12 and notifies the operator terminal 3 via the input / output unit 20.
If the cause alarm notification unit 13 cannot identify the cause of the failure or the like because the alarm information corresponding to the cause scenario 332 is not stored in the alarm information DB 31 as a result of the cause analysis process of the cause analysis unit 12 Alternatively, the alarm information received from each device 2 by the network monitoring device 1 may be transmitted to the operator terminal 3 as it is.

＜処理の流れ＞
次に、ネットワーク監視装置１が実行する処理について、図６を参照して説明する。
図６は、本実施形態に係るネットワーク監視装置１が実行する処理の流れを示すフローチャートである。 <Process flow>
Next, processing executed by the network monitoring device 1 will be described with reference to FIG.
FIG. 6 is a flowchart showing the flow of processing executed by the network monitoring apparatus 1 according to this embodiment.

まず、ネットワーク監視装置１の警報受信部１１は、各装置２のいずれかから警報情報を受信する（ステップＳ１）。そして、警報受信部１１は、受信した警報情報を、記憶部３０内の警報情報ＤＢ３１に記憶するとともに、原因分析部１２に出力する。 First, the alarm receiver 11 of the network monitoring device 1 receives alarm information from any of the devices 2 (step S1). The alarm receiving unit 11 stores the received alarm information in the alarm information DB 31 in the storage unit 30 and outputs the alarm information to the cause analysis unit 12.

ネットワーク監視装置１の原因分析部１２は、警報情報を取得すると、その警報情報に含まれる、警報を発信した装置２の識別情報（装置ＩＤや装置のアドレス等）を用いて、設備情報ＤＢ３２（図４）を参照し、当該装置の物理的な位置（データセンタ、フロア、ラック、ユニット等）を検出する（ステップＳ２）。 When the cause analysis unit 12 of the network monitoring device 1 acquires the alarm information, it uses the identification information (device ID, device address, etc.) of the device 2 that issued the alarm included in the alarm information, and the facility information DB 32 ( Referring to FIG. 4), the physical position (data center, floor, rack, unit, etc.) of the device is detected (step S2).

続いて、原因分析部１２は、警報を発信した装置２の物理的な位置を示す情報に基づき、設備情報ＤＢ３２を参照し、当該装置と同一フロアの装置の識別情報（装置ＩＤや装置のアドレス情報等）を取得する（ステップＳ３）。 Subsequently, the cause analysis unit 12 refers to the facility information DB 32 based on information indicating the physical position of the device 2 that issued the alarm, and identifies device identification information (device ID and device address) on the same floor as the device. Information, etc.) is acquired (step S3).

次に、原因分析部１２は、取得した同一フロアに位置する装置２の識別情報に用いて、記憶部３０内の警報情報ＤＢ３１（図３）を参照し、同一フロアに位置する装置２から発信された警報情報を抽出する（ステップＳ４）。 Next, the cause analysis unit 12 refers to the alarm information DB 31 (FIG. 3) in the storage unit 30 using the acquired identification information of the device 2 located on the same floor, and transmits from the device 2 located on the same floor. The alarm information is extracted (step S4).

そして、原因分析部１２は、記憶部３０内の原因シナリオＤＢ３３（図５）を参照し、対象とする警報情報の警報内容３３１（例えば、「メモリエラー」「電源エラー」等）に対応する原因シナリオ３３２に基づき、ステップＳ４で抽出した警報情報のうち、当該原因シナリオ３３２の条件に該当する警報情報があるか否かを判定する（ステップＳ５）。 Then, the cause analysis unit 12 refers to the cause scenario DB 33 (FIG. 5) in the storage unit 30, and causes corresponding to the alarm contents 331 (for example, “memory error”, “power supply error”, etc.) of the target alarm information. Based on the scenario 332, it is determined whether there is alarm information that satisfies the conditions of the cause scenario 332 among the alarm information extracted in step S4 (step S5).

原因分析部１２は、ステップＳ５において、原因シナリオ３３２の条件に該当する警報情報が抽出された場合には（ステップＳ５→Ｙｅｓ）、原因シナリオ情報３３０において、原因シナリオ３３２に対応する推定原因３３３の情報を取得し、障害等の推定原因を特定する（ステップＳ６）。 When alarm information corresponding to the condition of the cause scenario 332 is extracted in step S5 (step S5 → Yes), the cause analysis unit 12 includes the estimated cause 333 corresponding to the cause scenario 332 in the cause scenario information 330. Information is acquired and an estimated cause such as a failure is identified (step S6).

そして、ネットワーク監視装置１の原因警報通知部１３は、ステップＳ６において特定された推定原因の情報を含む原因警報メッセージを生成し、オペレータ端末３に通知する（ステップＳ７）。そして、ネットワーク監視装置１は、処理を終了する。 And the cause alarm notification part 13 of the network monitoring apparatus 1 produces | generates the cause alarm message containing the information of the presumed cause specified in step S6, and notifies it to the operator terminal 3 (step S7). Then, the network monitoring device 1 ends the process.

一方、ステップＳ５において、原因シナリオ３３２の条件に該当する警報情報がなかった場合には（ステップＳ５→Ｎｏ）、原因警報通知部１３は、ステップＳ１において受信した警報情報を、そのままオペレータ端末３に送信する（ステップＳ８）。そして、ネットワーク監視装置１は、処理を終了する。 On the other hand, in step S5, when there is no alarm information corresponding to the condition of the cause scenario 332 (step S5 → No), the cause alarm notification unit 13 sends the alarm information received in step S1 to the operator terminal 3 as it is. Transmit (step S8). Then, the network monitoring device 1 ends the process.

なお、図６に示したネットワーク監視装置１が実行する処理の説明においては、ステップＳ３において、原因分析部１２が、「同一フロア」の装置の識別情報（装置ＩＤや装置のアドレス情報等）を取得し、ステップＳ４において警報情報ＤＢ３１を参照し、「同一フロア」に位置する装置２からの警報情報を抽出するものとして説明した。
しかしながら、本実施形態はこの処理の流れに限定されず、以下のようにしてもよい。
ネットワーク監視装置１は、ステップＳ３において、原因分析部１２が取得する装置の識別情報（装置ＩＤや装置アドレスの情報等）の物理的な位置の範囲を、取得した警報情報に含まれる警報内容に対応付けて予め設定しておく。例えば、原因分析部１２は、警報内容が「メモリエラー」であれば、ステップＳ３において取得する他の装置２の識別情報の範囲を、「同一ラック」の装置２の識別情報に設定しておく。そして、ステップＳ４において、原因分析部１２は、警報情報ＤＢ３１を参照し、「同一ラック」に位置する装置２からの警報情報を抽出する。また、原因分析部１２は、警報内容が「電源エラー」であれば、ステップＳ３において取得する他の装置の識別情報の範囲を、「同一フロア」の装置２の識別情報に設定しておく。そして、ステップＳ４において、原因分析部１２は、警報情報ＤＢ３１を参照し、「同一フロア」に位置する装置２からの警報情報を抽出する。 In the description of the processing executed by the network monitoring device 1 shown in FIG. 6, in step S3, the cause analysis unit 12 uses the identification information (device ID, device address information, etc.) of the device on the “same floor”. It acquired, and it demonstrated as what extracts alarm information from the apparatus 2 located in "the same floor" with reference to alarm information DB31 in step S4.
However, the present embodiment is not limited to this processing flow, and may be as follows.
In step S3, the network monitoring device 1 converts the physical position range of the device identification information (device ID, device address information, etc.) acquired by the cause analysis unit 12 into the alarm content included in the acquired alarm information. It is set in advance in association with each other. For example, if the alarm content is “memory error”, the cause analysis unit 12 sets the identification information range of the other device 2 acquired in step S3 to the identification information of the device 2 of “same rack”. . In step S <b> 4, the cause analysis unit 12 refers to the alarm information DB 31 and extracts alarm information from the device 2 located in “same rack”. Further, if the alarm content is “power error”, the cause analysis unit 12 sets the identification information range of the other device acquired in step S3 to the identification information of the device 2 on the “same floor”. In step S4, the cause analysis unit 12 refers to the alarm information DB 31 and extracts the alarm information from the device 2 located on the “same floor”.

このようにすることにより、ネットワーク監視装置１は、警報情報ＤＢ３１に記憶された全ての警報情報から、実際の処理対象となる警報情報の原因分析に関係する可能性のある範囲に限定して警報情報を抽出することができる。よって、ネットワーク監視装置１の処理負荷を軽減し、より迅速に推定原因を特定することが可能となる。 By doing in this way, the network monitoring apparatus 1 is limited to the range that may be related to the cause analysis of the alarm information to be actually processed from all the alarm information stored in the alarm information DB 31. Information can be extracted. Therefore, it is possible to reduce the processing load of the network monitoring device 1 and specify the estimated cause more quickly.

以上説明したように、本実施形態に係るネットワーク監視装置１およびネットワーク監視方法によれば、各装置２から警報情報を取得した場合に、その装置２等の物理的な位置情報を加味して故障原因を推定する。これにより、警報を通知した装置からの警報内容だけからでは真の故障原因を特定できない場合においても、迅速に真の故障原因を特定し、オペレータ端末３に通知することができる。 As described above, according to the network monitoring device 1 and the network monitoring method according to the present embodiment, when alarm information is acquired from each device 2, a failure occurs in consideration of physical location information of the device 2 or the like. Estimate the cause. Thereby, even when the true cause of the failure cannot be specified only from the alarm contents from the device that has notified the alarm, the true cause of the failure can be quickly identified and notified to the operator terminal 3.

１ネットワーク監視装置
２装置
３オペレータ端末
４電源供給装置（分電盤）
５ラック
１０制御部
１１警報受信部
１２原因分析部
１３原因警報通知部
２０入出力部
３０記憶部
３１警報情報ＤＢ
３２設備情報ＤＢ
３３原因シナリオＤＢ
３２０設備情報
３３０原因シナリオ情報 1 Network monitoring device 2 Device 3 Operator terminal 4 Power supply device (distribution panel)
5 rack 10 control unit 11 alarm reception unit 12 cause analysis unit 13 cause alarm notification unit 20 input / output unit 30 storage unit 31 alarm information DB
32 Equipment information DB
33 Cause scenario DB
320 Equipment information 330 Cause scenario information

Claims

A network monitoring device that receives alarm information from each device constituting a network and monitors a failure,
In association with the identification information of each of the devices, the facility information in which the physical location information where the device is arranged is stored, and the failure information that takes into account the physical location information of the device that has transmitted the alarm information A storage unit for storing a cause scenario indicating a condition for specifying an estimated cause based on physical position information of another device other than the device and an alarm content of alarm information from the other device;
Each time the alarm information including the identification information of the device is received from the device that has transmitted the alarm information, an alarm receiving unit that stores the alarm information in the storage unit,
Using the device identification information included in the received alarm information, referring to the facility information, detecting the physical location information of the device that has transmitted the alarm information, and detecting the detected physical location information and Using the alarm content included in the received alarm information, it is determined whether alarm information of the other device that matches the condition for specifying the estimated cause in the cause scenario is stored in the storage unit. And, if stored, a cause analysis unit that identifies that the cause of the alarm content included in the received alarm information is an estimated cause indicated in the cause scenario,
A network monitoring device comprising:

The physical location information where the device is located includes at least data center and floor information;
When the alarm information is received, the cause analysis unit extracts the alarm information transmitted from the devices located on the same data center and the same floor from the storage unit, and in the extracted alarm information, the condition It is determined whether there is alarm information that matches the condition, and if there is alarm information that matches the condition, the cause of the alarm content included in the received alarm information is an estimated cause indicated in the cause scenario The network monitoring device according to claim 1, wherein the network monitoring device is specified.

A network monitoring method of a network monitoring device that receives alarm information from each device constituting a network and monitors a failure,
The network monitoring device
In association with the identification information of each of the devices, the facility information in which the physical location information where the device is arranged is stored, and the failure information that takes into account the physical location information of the device that has transmitted the alarm information A storage unit for storing a cause scenario indicating a condition for specifying an estimated cause based on physical position information of a device other than the device and an alarm content of alarm information from the other device; And
Every time the alarm information including the identification information of the device is received from the device that has transmitted the alarm information, the step of storing the alarm information in the storage unit;
Using the device identification information included in the received alarm information, referring to the facility information, detecting the physical location information of the device that has transmitted the alarm information, and detecting the detected physical location information and Using the alarm content included in the received alarm information, it is determined whether alarm information of the other device that matches the condition for specifying the estimated cause in the cause scenario is stored in the storage unit. And, if stored, identifying the cause of the alarm content included in the received alarm information as the presumed cause indicated in the cause scenario,
The network monitoring method characterized by performing.