JP2013030826A

JP2013030826A - Network monitoring system and network monitoring method

Info

Publication number: JP2013030826A
Application number: JP2011163441A
Authority: JP
Inventors: Sachiko Ishikawa; 佐知子石川; Tomoaki Tsunokai; 友昭角皆; Junichi Tsuchiya; 順一土屋; Jinpei Mori; 仁平森; Yukio Kubota; 行男久保田; Kiyoaki Sakata; 清明坂田
Original assignee: Ricoh Co Ltd; Ricoh Technosystems Co Ltd
Current assignee: Ricoh Co Ltd; Ricoh Technosystems Co Ltd
Priority date: 2011-07-26
Filing date: 2011-07-26
Publication date: 2013-02-07

Abstract

PROBLEM TO BE SOLVED: To provide a network monitoring system which can notify an appropriate responsible person of failure occurrence at appropriate timing according to a failure level.SOLUTION: A network monitoring system comprises: failure detection means 12 for detecting a failure in an apparatus and/or a line; a failure determination table 18; an influence range determination table 20; a failure level table 19; a user table 24 with which notification destination users to be notified of failure states are registered in association with failure levels; influence range determination means 15 which determines each state of factors by referring to the failure determination table, then determines an influence range by referring to the influence range determination table on the basis of combination of states of factors; failure level determination means 16 for determining a failure level by referring to the failure level table on the basis of the influence range and an elapsed time; and electronic mail transmission means 22 for transmitting an electronic mail reporting a failure details to a notification destination user associated with the failure level in the user table.

Description

本発明は、ネットワークを構成する機器又は回線の障害を監視し、障害の状態をネットワークユーザに通知するネットワーク監視システムに関する。 The present invention relates to a network monitoring system for monitoring a failure of a device or a line constituting a network and notifying a network user of the failure state.

ネットワークに障害が生じると事業所などの業務に支障が生じるため、ネットワークの障害の発生に迅速に対応可能なネットワーク監視システムが望まれる。ネットワークに生じる障害は、サーバやルータなどのネットワーク機器の障害、回線のトラブル、ネットワークの過剰な負荷等があるが、ネットワークに接続された監視装置はそれぞれの障害の監視に適切な手法で障害を監視している（例えば、特許文献１参照。）。特許文献１には、複数の被監視装置からの情報信号に応じた処理を実行しかつ情報信号に応じた警報電子メールを生成する監視装置と、複数の電子メール送受信端末との間における電子メールの送受信を為す電子メールサーバと、を含む警報監視システムが開示されている。 When a network failure occurs, business operations such as business establishments are hindered. Therefore, a network monitoring system capable of quickly responding to the occurrence of a network failure is desired. Faults that occur in the network include faults in network devices such as servers and routers, line troubles, excessive network loads, etc., but monitoring devices connected to the network should be able to Monitoring is performed (see, for example, Patent Document 1). Patent Document 1 discloses an electronic mail between a monitoring device that executes processing according to information signals from a plurality of monitored devices and generates an alarm electronic mail according to the information signals, and a plurality of electronic mail transmitting / receiving terminals. An alarm monitoring system is disclosed that includes an e-mail server that transmits and receives e-mails.

監視装置が障害アラートを検出して発行するメールは、適切な連絡先（担当者〜経営層）に送信される。障害の拠点、発生した時間帯、及び、障害重要度（障害レベル）によって対応方法や担当者が異なるためである。 The mail that the monitoring device detects and issues a failure alert is sent to an appropriate contact (person in charge to management). This is because the handling method and the person in charge differ depending on the location of the failure, the time zone where the failure occurred, and the importance of the failure (failure level).

ここで、障害レベルは障害が継続している時間によって変化するため、管理者が、障害復旧までの障害継続時間を計測して、閾値を超える毎に障害レベルをレベルアップすることが考えられる。障害レベルをレベルアップすることで、監視装置は障害レベルに応じた連絡先を追加してメールの送信を行えばよいことになる。 Here, since the failure level changes depending on the time during which the failure continues, it is conceivable that the administrator measures the failure duration time until failure recovery and raises the failure level every time the threshold is exceeded. By raising the failure level, the monitoring device can add a contact address corresponding to the failure level and send the mail.

しかしながら、従来、障害レベルの決定や障害経過時間の計測に明確な規定がなく、障害の程度と障害レベルの関係があいまいであった。また、障害経過時間とレベルアップのタイミングにも統一性がなかった。このため、同じ障害の程度でも、メールによる障害の通知が適切な担当者に送信されるタイミングにばらつきがあり、対応が遅れる可能性があるという問題がある。 Conventionally, however, there has been no clear provision in determining the failure level and measuring the elapsed time of failure, and the relationship between the degree of failure and the failure level has been ambiguous. Also, the failure elapsed time and the level-up timing were not uniform. For this reason, there is a problem that even when the failure level is the same, there is a variation in the timing at which the failure notification by e-mail is transmitted to an appropriate person in charge, and the response may be delayed.

本発明は、上記課題に鑑み、障害の程度に応じて適切な担当者に適切なタイミングで障害の発生を通知可能なネットワーク監視システムを提供することを目的とする。 In view of the above problems, an object of the present invention is to provide a network monitoring system capable of notifying an appropriate person in charge of occurrence of a failure at an appropriate timing according to the degree of the failure.

本発明は、ネットワークを構成する機器又は回線の障害を監視し、障害の状態をネットワークユーザに通知するネットワーク監視システムであって、構内通信網を備えた拠点間を接続する前記機器若しくは前記回線の障害、検出日時、及び、機器若しくは回線の状態を含む障害情報を検出する障害検出手段と、前記拠点の拠点名、前記拠点の重要度、前記拠点の業務時間、前記機器又は前記回線の障害内容と該機器の利用可否判定基準、又は、障害の広狭を判定する判定基準、を対応づけた障害判定テーブルと、前記重要度、障害の検出日時が前記業務時間内か若しくは外か、前記機器若しくは回線の利用可否状態、又は、前記広狭、の各要素の状態の組み合わせに基づき前記機器又は回線の障害が及ぼす影響範囲の大きさが登録された影響範囲決定テーブルと、前記影響範囲の大きさと障害検出からの経過時間に対応づけて障害レベルが登録された障害レベルテーブルと、前記障害レベルに対応づけて障害の状態の通知先ユーザが登録されたユーザテーブルと、前記障害判定テーブルを参照して、前記要素の状態をそれぞれ決定し、前記要素の状態の組み合わせに基づき前記影響範囲決定テーブルを参照して、前記影響範囲を決定する影響範囲決定手段と、前記影響範囲と前記経過時間に基づき前記障害レベルテーブルを参照して前記障害レベルを決定する障害レベル決定手段と、前記ユーザテーブルにて前記障害レベルに対応づけられた前記通知先ユーザに障害の内容を知らせる電子メールを送信する電子メール送信手段と、を有する。 The present invention is a network monitoring system for monitoring a failure of a device or a line constituting a network and notifying a network user of the state of the failure, wherein the device or the line connecting the bases provided with a local communication network is provided. Failure detection means for detecting failure information including failure, detection date and time, and device or line status; name of the site; importance of the site; business hours of the site; failure content of the device or line And a failure determination table that associates a determination criterion for availability of the device or a determination criterion for determining the extent of failure, and whether the importance and the detection date of the failure are within or outside the business hours, the device or Effect of registering the size of the range of influence of the failure of the device or line based on the combination of the availability of the line or the state of each element of the wide and narrow A failure determination table, a failure level table in which a failure level is registered in association with the size of the affected range and an elapsed time from the detection of the failure, and a notification destination user in the failure state in association with the failure level Influence range determination means for determining the state of each of the elements by referring to the user table and the failure determination table, and determining the range of influence by referring to the influence range determination table based on the combination of the states of the elements Failure level determination means for determining the failure level by referring to the failure level table based on the influence range and the elapsed time, and failure to the notification destination user associated with the failure level in the user table E-mail transmitting means for transmitting an e-mail informing the contents of the e-mail.

障害レベルに応じて適切な担当者に適切なタイミングで障害の発生を通知可能なネットワーク監視システムを提供することができる。 It is possible to provide a network monitoring system capable of notifying an appropriate person in charge of occurrence of a failure at an appropriate timing according to the failure level.

ネットワーク監視システムの概略的な構成の一例を示す図である。It is a figure which shows an example of a schematic structure of a network monitoring system. 監視装置、障害管理サーバ、又は、メールサーバのハードウェア構成図の一例を示す。An example of the hardware block diagram of a monitoring apparatus, a failure management server, or a mail server is shown. ネットワーク監視システムの機能ブロック図の一例である。It is an example of a functional block diagram of a network monitoring system. ネットワーク監視システムの概略を説明する図の一例である。It is an example of the figure explaining the outline of a network monitoring system. 拠点識別テーブルの一例を示す図である。It is a figure which shows an example of a base identification table. 監視装置が検出するログを示す図の一例である。It is an example of the figure which shows the log which a monitoring apparatus detects. 監視装置が検出するログを視覚化した図の一例である。It is an example of the figure which visualized the log which a monitoring apparatus detects. ４つの要素に基づき決定される影響範囲を説明する図の一例である。It is an example of the figure explaining the influence range determined based on four elements. 仮決め時の影響範囲を模式的に説明する図の一例である。It is an example of the figure which illustrates the influence range at the time of provisional decision typically. 広域障害の仮決め時の影響範囲を模式的に説明する図の一例である。It is an example of the figure which illustrates typically the influence range at the time of provisional determination of a wide area fault. 範囲決定テーブルの一例を示す図である。It is a figure which shows an example of the range determination table. 障害レベルテーブルの一例である。It is an example of a failure level table. 記録用メールの生成を説明する図の一例である。It is an example of the figure explaining the production | generation of the mail for recording. 記録用メールの一例を示す図である。It is a figure which shows an example of the mail for recording. 障害レベルと連絡先の関係を説明する図の一例である。It is an example of the figure explaining the relationship between a failure level and a contact address. メールサーバが送信するメールを説明する図の一例である。It is an example of the figure explaining the mail which a mail server transmits. ネットワーク監視システムがネットワークの障害を監視する手順を説明するフローチャート図の一例である。It is an example of the flowchart figure explaining the procedure in which a network monitoring system monitors a failure of a network.

以下、本発明を実施するための形態について図面を参照しながら説明する。
図１は、本実施形態のネットワーク監視システムの概略的な構成の一例を示す図である。監視装置１００は、ネットワークを監視して何らかの障害が生じていないかを監視している。監視装置１００が障害を検出すると、障害管理サーバ２００が障害内容を解析して「影響範囲」を特定する。影響範囲は例えば、甚大、大、中、小である。 Hereinafter, embodiments for carrying out the present invention will be described with reference to the drawings.
FIG. 1 is a diagram illustrating an example of a schematic configuration of a network monitoring system according to the present embodiment. The monitoring device 100 monitors the network for any failure. When the monitoring apparatus 100 detects a failure, the failure management server 200 analyzes the content of the failure and identifies the “influence range”. The influence range is, for example, enormous, large, medium, or small.

また、障害管理サーバ２００は影響範囲に応じて、初期の「障害レベル」を決定する。図の例では、影響範囲が甚大又は大であれば初期の障害レベルは３、影響範囲が中であれば初期の障害レベルは２、影響範囲が小であれば初期の障害レベルは１、となる。 Further, the failure management server 200 determines an initial “failure level” according to the influence range. In the example of the figure, the initial failure level is 3 if the affected range is large or large, the initial failure level is 2 if the affected range is medium, and the initial failure level is 1 if the affected range is small. Become.

初期の障害レベルが決定されると、障害管理サーバ２００はメールサーバ３００に障害の状況と障害レベルを通知するので、メールサーバ３００は障害レベルに応じて、予め定められている連絡先にメールを送信する。 When the initial failure level is determined, the failure management server 200 notifies the mail server 300 of the failure status and failure level, so that the mail server 300 sends a mail to a predetermined contact address according to the failure level. Send.

また、障害管理サーバ２００は、障害が継続している場合、障害の継続時間を計測する。そして、継続時間が閾値を超える毎に、障害レベルを大きくする。障害レベルが遷移する毎に、障害管理サーバ２００はメールサーバ３００に障害レベルを通知するので、メールサーバ３００は障害レベルにより予め定められている連絡先にメールを送信する。 Further, the failure management server 200 measures the failure duration when the failure continues. Each time the duration exceeds the threshold, the failure level is increased. Each time the failure level transitions, the failure management server 200 notifies the mail server 300 of the failure level, so that the mail server 300 transmits an email to a contact address that is predetermined according to the failure level.

このように、本実施形態のネットワーク監視システム５００は、影響範囲を解析して障害レベルを決定するので、障害の程度と障害レベルの対応を統一させることができる。また、障害レベルは障害の継続時間が閾値を超える毎に大きくなるので、障害の継続時間と障害レベルの対応を統一させることができる。したがって、障害に対しメールの宛先を統一でき、企業などが障害に対し適切な対応を取ることが可能になる。 As described above, the network monitoring system 500 according to the present embodiment analyzes the influence range and determines the failure level, so that the correspondence between the failure level and the failure level can be unified. Further, since the failure level increases every time the failure duration exceeds the threshold, the correspondence between the failure duration and the failure level can be unified. Therefore, it is possible to unify the mail address for the failure, and it is possible for companies and the like to take an appropriate response to the failure.

〔構成例〕
図１に示したように、ネットワーク監視システム５００は、監視装置１００、障害管理サーバ２００、及び、メールサーバ３００を有している。これらは機能的な分類であり、一台以上のコンピュータに各装置の機能が搭載されていればよい。 [Configuration example]
As illustrated in FIG. 1, the network monitoring system 500 includes a monitoring device 100, a failure management server 200, and a mail server 300. These are functional classifications, and it is sufficient that the function of each device is mounted on one or more computers.

ネットワーク内には、各種のネットワーク機器（図の被監視装置）が存在する。ネットワーク機器はデータ通信などサービスの提供に必要であるが、ネットワーク監視システム５００の構成要素である必要はない。ネットワーク機器は、例えば、ルータ、Ｌ３スイッチ、サーバなどである。これらのネットワーク機器にケーブルが接続されＩＰアドレスが付与されるポートは、他のネットワークとの接続口となるのでインタフェースと呼ばれることがある。 There are various network devices (monitored devices in the figure) in the network. The network device is necessary for providing services such as data communication, but does not have to be a component of the network monitoring system 500. The network device is, for example, a router, an L3 switch, or a server. A port to which a cable is connected to these network devices and an IP address is assigned is sometimes called an interface because it is a connection port with another network.

図２は、監視装置１００、障害管理サーバ２００、又は、メールサーバ３００のハードウェア構成図の一例を示す。監視装置１００、障害管理サーバ２００、又は、メールサーバ３００はいずれもコンピュータとして機能すればよいので、以下、監視装置１００の構成として説明する。 FIG. 2 shows an example of a hardware configuration diagram of the monitoring device 100, the failure management server 200, or the mail server 300. Since all of the monitoring device 100, the failure management server 200, and the mail server 300 may function as a computer, the configuration of the monitoring device 100 will be described below.

監視装置１００はコンピュータの一形態である。監視装置１００はそれぞれバスで相互に接続されているＣＰＵ１０１、ＲＡＭ１０２、ＲＯＭ１０３、記憶媒体装着部１０４、通信装置１０５、入力装置１０６、描画制御部１０７、及び、ＨＤＤ１０８を有する。ＣＰＵ１０１は、ＯＳ（Operating System）やプログラムをＨＤＤ１０８から読み出して実行することで種々の機能を提供すると共に、監視装置１００が行う処理を統括的に制御する。 The monitoring device 100 is a form of a computer. The monitoring device 100 includes a CPU 101, a RAM 102, a ROM 103, a storage medium mounting unit 104, a communication device 105, an input device 106, a drawing control unit 107, and an HDD 108 that are mutually connected by a bus. The CPU 101 provides various functions by reading an OS (Operating System) and a program from the HDD 108 and executing them, and comprehensively controls processing performed by the monitoring apparatus 100.

ＲＡＭ１０２はＣＰＵ１０１がプログラムを実行する際に必要なデータを一時保管する作業メモリ（主記憶メモリ）になり、ＲＯＭ１０３はＢＩＯＳ（Basic Input Output System）やＯＳを起動するためのプログラム、静的なデータが記憶されている。 The RAM 102 is a working memory (main storage memory) that temporarily stores data necessary for the CPU 101 to execute the program. The ROM 103 stores a program (static input data) for starting up a BIOS (Basic Input Output System) and the OS. It is remembered.

記憶媒体装着部１０４には記憶媒体１１０が着脱可能であり、記憶媒体１１０に記録されたプログラムを読み込み、ＨＤＤ１０８に記憶させる。また、記憶媒体装着部１０４は、ＨＤＤ１０８に記憶されたデータを記憶媒体１１０に書き込むこともできる。記憶媒体１１０は例えば、ＵＳＤメモリ、ＳＤカード等である。 A storage medium 110 can be attached to and detached from the storage medium mounting unit 104, and a program recorded in the storage medium 110 is read and stored in the HDD 108. Further, the storage medium mounting unit 104 can also write data stored in the HDD 108 into the storage medium 110. The storage medium 110 is, for example, a USD memory or an SD card.

入力装置１０６は、キーボードやマウス、トラックボールなどであり、監視装置１００の製造管理者の様々な操作指示を受け付ける。 The input device 106 is a keyboard, a mouse, a trackball, or the like, and accepts various operation instructions from a manufacturing manager of the monitoring device 100.

ＨＤＤ１０８は、ＳＳＤ等の不揮発メモリでもよく、ＯＳ、プログラム、規格値などの各種のデータが記憶されている。監視装置１００は、障害を検出して障害情報を通知したり、障害アラートを発行したり、障害の状況を視覚化するプログラム１１１を有している。障害管理サーバ２００は、影響範囲を決定したり、障害レベルを決定するプログラム１１１を有している。メールサーバ３００は障害レベルに応じた連絡先にメールを送信するプログラム１１１を有している。 The HDD 108 may be a nonvolatile memory such as an SSD, and stores various data such as an OS, a program, and a standard value. The monitoring device 100 includes a program 111 that detects a failure and notifies failure information, issues a failure alert, and visualizes the failure status. The failure management server 200 has a program 111 for determining an influence range and a failure level. The mail server 300 has a program 111 for sending mail to a contact address corresponding to the failure level.

通信装置１０５は、インターネットなどのネットワーク３０１に接続するためのＮＩＣ（ＮｅｔｗｏｒｋＩｎｔｅｒｆａｃｅＣａｒｄ）であり、例えば、イーサネット（登録商標）カードである。 The communication device 105 is a NIC (Network Interface Card) for connecting to a network 301 such as the Internet, and is, for example, an Ethernet (registered trademark) card.

描画制御部１０７は、ＣＰＵ１０１がプログラム１１１を実行してグラフィックメモリに書き込んだ描画コマンドを解釈して、画面を生成しディスプレイ１０９に描画する。 The drawing control unit 107 interprets a drawing command written in the graphic memory by the CPU 101 executing the program 111, generates a screen, and draws it on the display 109.

図３は、ネットワーク監視システム５００の機能ブロック図の一例を示す。監視装置１００、障害管理サーバ２００、及び、メールサーバ３００のいずれもＣＰＵがプログラムを実行することと、メモリなどのハードウェアが協働することで実現されている。 FIG. 3 shows an example of a functional block diagram of the network monitoring system 500. All of the monitoring device 100, the failure management server 200, and the mail server 300 are realized by the CPU executing a program and the cooperation of hardware such as a memory.

被監視装置４００は、監視装置１００に監視されるネットワーク機器や回線など、ネットワークを用いたサービスの提供に必要なリソースである。ネットワークを構成する全ての機器が含まれる。例えば、ルータ、サーバ、レイヤ３スイッチ、負荷分散装置、セキュリティ装置等である。これらは、ネットワークとのインタフェース毎にＩＰアドレスを有し、インタフェース単位で障害が監視される。ネットワーク機器は通常、複数のインタフェースを有するので、１つのネットワーク機器から複数の障害が検出されることもある。 The monitored device 400 is a resource necessary for providing a service using a network, such as a network device or a line monitored by the monitoring device 100. All devices that make up the network are included. For example, a router, a server, a layer 3 switch, a load distribution device, a security device, and the like. These have an IP address for each interface with the network, and faults are monitored on an interface basis. Since a network device usually has a plurality of interfaces, a plurality of failures may be detected from one network device.

監視装置１００は、機器監視部１２、拠点識別テーブル１１、及び、送受信部１３を有する。機器監視部１２は、被監視装置４００の障害の発生を検出する。拠点識別テーブル１１は、障害の生じた機器又は回線と、拠点の識別情報を対応づけたテーブルである。障害の発生を検出する方法には様々な方法があるが、本実施形態のネットワーク監視システム５００では、その方法は問わない。具体的な方法や拠点については後述する。 The monitoring device 100 includes a device monitoring unit 12, a site identification table 11, and a transmission / reception unit 13. The device monitoring unit 12 detects the occurrence of a failure in the monitored device 400. The base identification table 11 is a table in which a faulty device or line is associated with base identification information. There are various methods for detecting the occurrence of a failure, but the network monitoring system 500 of this embodiment does not matter. Specific methods and bases will be described later.

障害管理サーバ２００は、範囲解析部１５、障害レベル決定部１６、メール生成部１７、送受信部１４、範囲決定テーブル１８及び障害レベルテーブル１９を有する。範囲解析部１５は、範囲決定テーブル１８を用いて障害の影響範囲を決定する。障害レベル決定部１６は、障害レベルテーブル１９を参照して、影響範囲と障害の継続時間から障害レベルを決定する。これらの詳細は後述する。メール生成部１７は、障害の情報と管理者によるメール情報の入力を受け付け、記録用メールを生成する。 The failure management server 200 includes a range analysis unit 15, a failure level determination unit 16, a mail generation unit 17, a transmission / reception unit 14, a range determination table 18, and a failure level table 19. The range analysis unit 15 determines the influence range of the failure using the range determination table 18. The failure level determination unit 16 refers to the failure level table 19 and determines a failure level from the influence range and the duration of the failure. Details of these will be described later. The mail generation unit 17 receives input of failure information and mail information by the administrator, and generates a recording mail.

メールサーバ３００は、メール送信部２２、送受信部２１、インシデント管理ＤＢ２３、連絡先テーブル２４及びメールテンプレート２５を有する。メール送信部２２は、障害レベルに応じた連絡先に障害を報告するメールを送信する。インシデント管理ＤＢ２３は、障害の進捗や障害の履歴を記録するためのデータベースである。インシデント管理ＤＢ２３はメールサーバ３００が有するとしているが、ＮＡＳ（Network Attached Storage）や他のサーバなど、独立した装置に搭載されていてもよい。また、メールサーバ３００は障害管理サーバ２００と同じコンピュータに搭載してもよい。 The mail server 300 includes a mail transmission unit 22, a transmission / reception unit 21, an incident management DB 23, a contact table 24, and a mail template 25. The mail transmission unit 22 transmits a mail reporting a failure to a contact address corresponding to the failure level. The incident management DB 23 is a database for recording the progress of failures and the history of failures. Although the incident management DB 23 is included in the mail server 300, it may be installed in an independent device such as NAS (Network Attached Storage) or another server. The mail server 300 may be mounted on the same computer as the failure management server 200.

〔概略〕
図４は、ネットワーク監視システム５００の概略を説明する図の一例である。ネットワーク監視システム５００は、上述した監視装置１００、障害管理サーバ２００及びメールサーバ３００を有する。
（１）まず、監視装置１００は、ネットワーク機器や回線の障害を監視し、障害を検出すると障害アラートを発行する。障害アラートには例えば、障害のあった拠点、検出の日時、通信可能か否かなどの機器の状態等、種々の情報が含まれる。以下、これを障害情報という。障害が発生すると監視装置１００はパトライトを点灯し、警報音を出力する。
（２）障害管理サーバ２００は、障害情報を取得して、拠点の重要度、障害の検出日時が拠点の業務時間内か又は外か、ネットワーク機器や回線の利用可否状態、及び、ネットワーク機器や回線の障害が及ぼす対象範囲の広狭、の組み合わせに基づき、障害の影響範囲を決定する。なお、障害管理サーバ２００は、監視装置１００と一体に構成することもできる。 [Outline]
FIG. 4 is an example for explaining an outline of the network monitoring system 500. The network monitoring system 500 includes the monitoring device 100, the failure management server 200, and the mail server 300 described above.
(1) First, the monitoring apparatus 100 monitors a failure of a network device or a line, and issues a failure alert when a failure is detected. The failure alert includes, for example, various information such as the location of the failure, the date and time of detection, and the device status such as whether communication is possible. Hereinafter, this is referred to as failure information. When a failure occurs, the monitoring apparatus 100 turns on the patrol light and outputs an alarm sound.
(2) The failure management server 200 obtains failure information, determines whether the base importance, the failure detection date / time is within or outside the business hours of the base, network device / line availability, The influence range of the failure is determined based on the combination of the range of the range affected by the line failure. The failure management server 200 can also be configured integrally with the monitoring device 100.

障害管理サーバ２００は、影響範囲に基づき初期の障害レベルを決定する。障害が長い時間、継続することは、障害が大きくなると考えられるので、障害管理サーバ２００は時間の経過と共に障害レベルを大きくする。
（３）障害管理サーバ２００は、管理者からメールの送信に必要な情報の入力を受け付け、障害情報を利用して記録用メールを生成する。この記録用メールは障害管理のためインシデント管理ＤＢ２３に記憶される。
（４）メールサーバ３００は、障害レベルに基づき連絡先を決定し障害の状況などを伝えるための電子メールを送信する。障害レベルが大きくなれば、その度に、電子メールを送信する。 The failure management server 200 determines an initial failure level based on the influence range. If the failure continues for a long time, it is considered that the failure becomes large. Therefore, the failure management server 200 increases the failure level as time passes.
(3) The failure management server 200 receives an input of information necessary for sending an email from the administrator, and generates a recording email using the failure information. This recording mail is stored in the incident management DB 23 for fault management.
(4) The mail server 300 determines the contact address based on the failure level and transmits an e-mail for notifying the status of the failure. An e-mail is sent each time the failure level increases.

以下、監視装置１００、障害管理サーバ２００、及び、メールサーバ３００について詳細に説明する。 Hereinafter, the monitoring device 100, the failure management server 200, and the mail server 300 will be described in detail.

〔監視装置による監視など〕
被監視装置４００には大きくネットワーク機器と回線があるが、回線に障害があると監視装置１００がネットワーク機器と通信できなくなるので、ネットワーク機器と回線の障害は特に区別しなくてよい。しかし、監視装置１００は通信自体ができないことを、回線の電圧レベル（Ｌ１、Ｌ２レベル）を監視するなどして検出することができる。 [Monitoring by monitoring equipment, etc.]
The monitored device 400 has a network device and a line. However, if there is a failure in the line, the monitoring device 100 cannot communicate with the network device. However, the monitoring device 100 can detect that communication cannot be performed by monitoring the voltage level (L1, L2 level) of the line.

ネットワーク機器のＩＰレベルの障害を監視するには、機器監視部１２はPingコマンドを例えばルータに向けて送信し、タイムアウトの有無、タイムアウトの回数、等により障害発生の有無や程度を監視する。ネットワーク機器のより上位層の障害を監視するには、機器監視部１２は、例えばＴＣＰポートへの接続を試みる。ネットワーク機器がＷｅｂサーバならＴＣＰポートの８０を指定してＴＣＰパケットを生成し、ネットワーク機器に送信する。同様に、タイムアウトの有無、タイムアウトの回数、等により障害発生の有無や程度を監視する。サーバの種類によって、ＴＣＰポート番号を代えることでサーバに応じた監視が可能である。 In order to monitor an IP level failure of a network device, the device monitoring unit 12 transmits a Ping command to, for example, a router, and monitors whether or not a failure has occurred based on the presence or absence of a timeout, the number of times of timeout, and the like. In order to monitor a failure in a higher layer of a network device, the device monitoring unit 12 tries to connect to a TCP port, for example. If the network device is a Web server, the TCP port 80 is specified to generate a TCP packet and send it to the network device. Similarly, whether or not a failure has occurred is monitored based on the presence or absence of a timeout, the number of times of timeout, and the like. Depending on the type of server, monitoring according to the server is possible by changing the TCP port number.

ＴＣＰよりも上位層の障害を監視する場合、ＨＴＴＰやＳＭＴＰ（Simple Mail Transfer Protocol）、ＦＴＰ、ＰＯＰ３などのアプリケーションレベルのプロトコルに従い、機器監視部１２が疑似トランザクションを生成し、各サーバに送信する。この場合、タイムアウトの有無だけでなく、レスポンスコードなどから高度な監視が可能になる。 When monitoring a failure in an upper layer than TCP, the device monitoring unit 12 generates a pseudo transaction according to an application level protocol such as HTTP, Simple Mail Transfer Protocol (SMTP), FTP, or POP3, and transmits the pseudo transaction to each server. In this case, not only the presence / absence of timeout, but also high-level monitoring is possible from the response code.

また、機器監視部１２は、ＳＮＭＰを利用してネットワーク機器を監視してもよい。機器監視部１２はＳＮＭＰのマネージャであり、ネットワーク機器がエージェントになる。ＳＮＭＰのマネージャはネットワーク機器が管理するＭＩＢ（Management Information Base）の内容を問い合わせる。また、ネットワーク機器は、予め指定されたイベントを検出するとＳＮＭＰトラップというイベント通知を監視装置１００に通知する。したがって、機器監視部１２は、イベント内容からネットワーク機器に生じた障害を検出することができる。 The device monitoring unit 12 may monitor network devices using SNMP. The device monitoring unit 12 is an SNMP manager, and a network device is an agent. The SNMP manager inquires about the contents of MIB (Management Information Base) managed by the network device. Further, when the network device detects a predetermined event, the network device notifies the monitoring device 100 of an event notification called an SNMP trap. Therefore, the device monitoring unit 12 can detect a failure that has occurred in the network device from the event content.

機器監視部１２は以上のようにして障害との関係が大きいＩＰアドレスを収集する。ＩＰアドレスが分かれば拠点を判別することができる。 As described above, the device monitoring unit 12 collects IP addresses having a large relationship with the failure. If the IP address is known, the base can be determined.

図５は、拠点識別テーブル１１の一例を示す図である。拠点識別テーブル１１は、ソース、インタフェース、及び、回線の各フィールドを有する。ソースは拠点の識別情報である。拠点とは、部署や事業所、営業所、データセンタなどである。拠点は、広域のネットワークに接続されているＬＡＮ（構内通信網）を有し、通信網からみるとネットワークアドレスで区分される小規模なネットワークである。拠点は、例えば、ルータやＬ３スイッチによりインターネット側、他の拠点等と区別されている。拠点間は回線で接続されている。拠点の識別情報（ソース）はルータやＬ３スイッチの識別情報となる場合が多い。なお、ソースには（緯度、経度）が付与されており、物理的な位置が明らかになっている。 FIG. 5 is a diagram illustrating an example of the site identification table 11. The site identification table 11 has fields for source, interface, and line. The source is base identification information. A base is a department, business office, sales office, data center, or the like. The base is a small-scale network that has a LAN (private communication network) connected to a wide-area network and is classified by a network address when viewed from the communication network. The bases are distinguished from the Internet side, other bases, etc. by, for example, routers and L3 switches. The bases are connected by a line. The base identification information (source) is often the identification information of a router or L3 switch. In addition, (latitude, longitude) is given to the source, and the physical position is clarified.

インタフェースは、拠点のＬＡＮ側のインタフェースやインターネット側のインタフェースのＩＰアドレスである。回線は、インタフェースに接続された回線の識別名や商品名又はサービス名である。 The interface is an IP address of an interface on the LAN side of the base or an interface on the Internet side. The line is an identification name, a product name, or a service name of the line connected to the interface.

このような拠点識別テーブル１１によれば、障害の発生したインタフェースや回線から影響を受ける拠点が明らかになる。なお、ソースが分かれば一般的な拠点名（事業所名など）も明らかになる。 According to such a base identification table 11, the base that is affected by the interface or the line where the fault has occurred is clarified. If the source is known, the general base name (such as the name of the establishment) will also be clarified.

〔監視装置が生成するアラート例〕
図６は、監視装置１００が検出するログを示す図の一例である。図６では１行のログが１つのインタフェースの動作状態を記述している。１行のログには、重要度、日時、ソース、及び、メッセージが記述される。 [Example of alerts generated by monitoring devices]
FIG. 6 is an example of a diagram illustrating a log detected by the monitoring apparatus 100. In FIG. 6, one line log describes the operation state of one interface. The importance, date and time, source, and message are described in one line of log.

ログには正常な動作が記述されるものと、障害アラートが記述されるものがある。障害アラートはさらに、その重要度により注意状態、警戒状態、重要警戒状態、危険状態とに分類される。ある障害アラートをどの状態に分類するかは、監視装置１００に、検出されたインタフェースとその状態に対応づけて設定されている。
重要なインタフェースが停止した：重要警戒状態
通常のインタフェースが停止した：警戒状態
「重要度」には、このように障害アラートの内容に基づき監視装置１００が判断した結果として正常状態、警戒状態、重要警戒状態の区別が記述される。「日時」には、不図示のタイムサーバ等から取得した、障害が検出された日時情報が記述される。「ソース」にはネットワーク機器から取得したソース名、又は、インタフェースのIPアドレスに基づき拠点識別テーブル１１から読み出されたソースが記述される。「メッセージ」にはインタフェースのIPアドレス、インタフェース名、インタフェースの状態、が記述される。この他、ＳＮＭＰトラップのトラップ内容など、監視装置１００の監視結果がログに記述される。 There are logs that describe normal operations and logs that describe fault alerts. The failure alert is further classified into an attention state, an alert state, an important alert state, and a dangerous state according to the importance. The state into which a certain failure alert is classified is set in the monitoring apparatus 100 in association with the detected interface and its state.
Critical interface has stopped: Critical alert state Normal interface has stopped: Alert state In the “importance”, the monitoring device 100 determines the normal state, the alert state, and the important state based on the content of the failure alert as described above. A warning status distinction is described. “Date / time” describes date / time information obtained from a time server (not shown) when a failure is detected. In “Source”, the source name read from the network device or the source read from the site identification table 11 based on the IP address of the interface is described. The “message” describes the IP address, interface name, and interface status of the interface. In addition, the monitoring result of the monitoring apparatus 100 such as the trap content of the SNMP trap is described in the log.

図７は、監視装置１００が検出するログを視覚化した図の一例である。日本地図上に地域名と、地域間を接続する回線の回線名（サービス名）が表示される。この地域名は、表示領域の都合上、各地域に含まれる拠点を１つに表したものであり、例えば関東という地域名には、関東地方の複数の拠点が含まれている。 FIG. 7 is an example of a diagram visualizing a log detected by the monitoring apparatus 100. The region name and the line name (service name) of the line connecting the regions are displayed on the map of Japan. This area name represents one base included in each area for the convenience of the display area. For example, the area name Kanto includes a plurality of bases in the Kanto region.

管理者が日本地図上の地域名をマウスでクリックすると、この地域名に含まれる都道府県がそれぞれ表示される。図では、関東地方という地域名がクリックされ、東京、茨城、栃木、千葉、という都道府県名が表示されている。 When the administrator clicks a region name on the map of Japan with the mouse, the prefectures included in the region name are displayed. In the figure, the area name Kanto region is clicked, and the prefecture names such as Tokyo, Ibaraki, Tochigi, and Chiba are displayed.

そして、管理者が日本地図上の都道府県名をマウスでクリックすると、クリックした都道府県に含まれる拠点の拠点名がそれぞれ表示される。 Then, when the administrator clicks the name of the prefecture on the map of Japan with the mouse, the names of the bases included in the clicked prefecture are displayed.

この拠点名又は拠点を示すアイコンは、例えば、重要警戒状態以上のアラートが検出されていると赤色で表示される。後述するように、赤色の拠点が例えば５個以上の場合、障害の対象範囲は"広"と判断される。 The base name or the icon indicating the base is displayed in red when an alert of an important alert state or higher is detected, for example. As will be described later, when there are five or more red bases, for example, the target range of the failure is determined to be “wide”.

このため、赤色の拠点が５個以上の場合、都道府県の地図の１つの都道府県名又は都道府県を示すアイコンが、赤色で表示される。１つ以上の都道府県が赤色の場合、日本地図上の地域名又は地域名を示すアイコンが赤色で表示される。 For this reason, when the number of red bases is five or more, an icon indicating one prefecture name or prefecture on the map of the prefecture is displayed in red. When one or more prefectures are red, the region name on the Japan map or an icon indicating the region name is displayed in red.

なお、図では日本を例にしたが各国のネットワークを同様に表示でき、また、世界地図上にネットワークを表示することもできる。 In the figure, Japan is taken as an example, but the network of each country can be displayed in the same manner, and the network can also be displayed on the world map.

〔障害管理サーバ２００〕
〔影響範囲〕
まず、影響範囲の考えたかについて説明する。影響範囲の決定には以下の4つの要素を使用する。
１．対象重要度
２．業務時間
３．提供サービス状態
４．対象範囲
１．対象重要度
対象重要度は、監視対象のネットワーク機器や回線に障害が生じた際に他の機器に与える影響度である。本実施形態では高、中、低の３つに区分する。
高：停止により企業全体の事業活動に大きな影響があるネットワーク機器
中：停止により影響を受ける業務や人員が所定数以上のネットワーク機器
低：その他
対象重要度が高のネットワーク機器は、例えば、メインサーバ（Ｍａｉｌ、ＤＢ、ＨＵＢ等）、共通ドメインサーバ、中のネットワーク機器はバックアップサーバ、パススルーサーバ、対象重要度が低のネットワーク機器はその他のサーバである。 [Fault management server 200]
〔Impact range〕
First, the idea of the range of influence will be described. The following four elements are used to determine the impact range.
1. Target importance 2. 2. Business hours Provided service status Scope 1. Target Importance The target importance is the degree of influence on other devices when a failure occurs in a monitored network device or line. In this embodiment, it is divided into high, medium and low.
High: Network equipment that has a major impact on the business activities of the entire company due to outages Medium: Network equipment that has more than a predetermined number of operations and personnel affected by the outage Low: Others Network equipment with high target importance is, for example, the main server (Mail, DB, HUB, etc.), common domain server, middle network devices are backup servers, pass-through servers, and network devices with low target importance are other servers.

２．業務時間
障害の検知時刻が、社員が業務を実施している時間帯の場合、影響が大きいと考えられる。よって、業務時間内、又は、業務時間外かを、影響範囲の決定要素とする。業務時間は企業によって異なるが、例えば次のようなパターンがある。
パターン１：２４ｈ（３６５日）
パターン２：７：００〜２２：００（月〜土）
パターン３：８：００〜２０：００（祝日を除く月〜金）
なお、月末、締め日（五十日）などだけ業務時間を変更してもよいし、サマータイムなどを反映させてもよい。 2. It is considered that the impact is great when the detection time of the business time failure is a time zone in which the employee conducts business. Therefore, whether it is within business hours or outside business hours is the determining factor of the influence range. Business hours vary depending on the company, but for example, there are the following patterns.
Pattern 1: 24h (365 days)
Pattern 2: 7:00 to 22:00 (Monday to Saturday)
Pattern 3: 8:00 to 20:00 (Monday to Friday excluding holidays)
The business hours may be changed only at the end of the month, the closing date (50th day), or the daylight saving time may be reflected.

３．提供サービス状態
提供サービス状態は例えば提供サービスの利用が可能かどうかの状態である。提供サービス状態は、大きく回線による状態とサーバによるものがある。 3. Provided service state The provided service state is, for example, a state of whether or not the provided service can be used. The provided service status is largely classified into a line status and a server status.

＜回線による提供サービス状態＞
・利用不可：障害が発生し、二重化された回線の両方が使用不可な状態
障害が発生し、単独回線が使用不可な状態
回線にて瞬断が発生し、１０分以内に再発した状態
初検知から再検知の復旧までの時間、通信が不安定なため提供サービスが利用不可な状態
二重化された回線で１分以下の通信断が複数回、発生し、経路切り換えが頻発している場合も該当する
・利用可能（一部断）：二重化された回線、機器の一方は使用可能だが、他方を使用することで提供サービスは利用可能な状態
＜サーバによる提供サービス状態＞
・利用不可：サーバへのアクセスが不可の障害
システムＤＢの破損障害
メール配信停止となる障害
サーバ運用不可（ミラーリング、正、副障害）となる障害
・利用可能：各種リソース（ディスク、メモリー）通知
メール配信状況に異常があり、滞留が発生している障害
バックアップ再取得不可の障害
障害が発生しているが、サーバ運用可の障害
サーバへのアクセス障害が発生したが、自動再起動処理により復旧した場合
４．対象範囲
１つの原因で障害が同時に多発し、対象数が多い（複数停止）場合は、業務への影響が大きいと推測される。このため、影響範囲を確定する要素として、本実施形態では対象範囲を取り入れる。対象範囲の決定には、同時（例えば数分以内など、ほぼ同時と見なせる時間差）の障害検知件数を利用する。同時の定義は、提供サービス毎に検知タイミングに差異が生じるため、拠点や回線毎に決定する。 <Service provided by line>
-Unusable: A fault has occurred and both duplexed lines are unusable.
A single line is unavailable due to a failure
An instantaneous interruption occurred on the line, and the condition recurred within 10 minutes
The service from the initial detection to the recovery of the re-detection is not available due to unstable communication.
Applicable even when there are multiple disconnections of 1 minute or less on the duplexed line and the path switching occurs frequently. ・ Available (partial disconnection): One of the duplexed line or device can be used. , The service provided by using the other is available <Service provided by server>
・ Unusable: Failure to access the server
System DB corruption failure
Failure to stop mail delivery
Failures that can not be used for servers (mirroring, primary and secondary failures) / available: various resources (disk, memory) notification
Failures in which the mail delivery status is abnormal and retention has occurred
Failure to re-acquire backup
A failure has occurred but the server is operational
3. When an access failure to the server occurs, but is recovered by automatic restart processing. Scope If there are many failures simultaneously due to one cause and the number of targets is large (multiple stops), it is estimated that the impact on business is large. For this reason, in the present embodiment, the target range is taken as an element for determining the influence range. For the determination of the target range, the number of faults detected simultaneously (for example, within a few minutes, a time difference that can be regarded as almost simultaneous) is used. The simultaneous definition is determined for each base or line because there is a difference in detection timing for each provided service.

例えば、ある単体の拠点に障害が発生すると、監視装置１００は、１件〜５件程のアラートを検知する。このため、単純な障害発生件数では、対象範囲の広さをはかれない。
そこで、本実施形態では、障害拠点件数を障害発生件数としてカウントし、対象範囲を広域性（障害拠点分布範囲）で表現する。広域とは、拠点が都道府県にまたがる場合、又は、各個別都道府県内でも多くの拠点に障害が検出される場合、と定義する。例えば、各都道府県単位で接続されている拠点数が５拠点以上であるとする。この場合、５拠点以上で同時の障害が発生した場合、各都道府県の多くの接続拠点が障害に影響を受けていると考えられるので、広域の障害とみなすことができる。
・広：５拠点以上
・狭：４拠点以下
想定される障害は、広の場合、地域災害、キャリア網障害、中継拠点障害、中継局障害、狭の場合、収容局障害によりある区内に影響がある障害、アクセス回線障害等で個別の拠点に影響がある障害、が想定される。 For example, when a failure occurs in a single base, the monitoring apparatus 100 detects about 1 to 5 alerts. For this reason, the scope of the target range cannot be measured with a simple number of failures.
Therefore, in the present embodiment, the number of failure bases is counted as the number of failure occurrences, and the target range is expressed by a wide area (failure base distribution range). A wide area is defined as a case where the bases are spread over prefectures, or a case where a failure is detected in many bases within each individual prefecture. For example, it is assumed that the number of bases connected to each prefecture is five or more. In this case, when simultaneous failures occur at five or more bases, it is considered that many connection bases in each prefecture are affected by the faults, and can be regarded as a wide-area fault.
・ Hiro: More than 5 bases
・ Narrow: 4 bases or less Assumed failures are wide, regional disasters, carrier network failures, relay base failures, relay station failures, and in narrow cases, faults that affect a certain ward due to containment station failures, access lines It is assumed that there is a failure that affects individual sites due to a failure.

図８は、４つの要素に基づき決定される影響範囲を説明する図の一例である。この図は、後述する影響範囲決定テーブル２０の一例となる。障害管理サーバ２００の範囲解析部１５は、対象重要度、業務時間、提供サービス状態、対象範囲の判定結果に応じて以下のように、影響範囲を決定する。
対象重要度：高業務時間：内又は外提供サービス状態：利用不可対象範囲：狭又は広の場合、影響範囲は甚大となる。
対象重要度：中業務時間：内提供サービス状態：利用不可対象範囲：狭の場合、影響範囲は大となる。
対象重要度：高業務時間：内及び外提供サービス状態：利用可能対象範囲：狭及び広の場合、影響範囲は中となる。
対象重要度：中業務時間：内及び外提供サービス状態：利用不可対象範囲：狭及び広の場合、影響範囲は小となる。 FIG. 8 is an example of a diagram illustrating an influence range determined based on four elements. This figure is an example of the influence range determination table 20 described later. The range analysis unit 15 of the failure management server 200 determines the influence range as follows according to the determination result of the target importance, the business time, the provided service state, and the target range.
Target Importance: High Business Hours: Inside or Outside Provided Service Status: Unusable Target Range: When narrow or wide, the impact range is enormous.
Target Importance: Medium Business Hours: Within Service Status: Unusable Target Range: If narrow, the impact range will be large.
Target Importance: High Business Hours: Inside and Outside Provided Service Status: Available Coverage: When narrow and wide, the impact range is medium.
Target Importance: Medium Business Hours: Inside and Outside Provided Service Status: Unusable Target Range: When narrow and wide, the impact range is small.

＜影響範囲の仮決め＞
より具体的には、範囲解析部１５は、拠点毎、インタフェース毎のアラートを精査し、各要素を求める。しかし、障害の対象範囲が広い場合（広域の場合）、提供サービス状態の判別に時間を要するおそれがある。 <Provisional impact range>
More specifically, the range analysis unit 15 examines alerts for each base and each interface to obtain each element. However, when the target range of failure is wide (in the case of a wide area), it may take time to determine the provided service state.

そのため、まず、要素を対象重要度と業務時間及び提供サービス状態の３つに限定して影響範囲を仮決めし、ある程度の時間が経過した段階で本決定することも有効である。 For this reason, it is also effective to temporarily determine the influence range by limiting the elements to the target importance level, the business time, and the provided service status, and to make the final determination after a certain amount of time has passed.

図９は、仮決め時の影響範囲を模式的に説明する図の一例である。要素は対象重要度、業務時間及び提供サービス状態の３つである。 FIG. 9 is an example of a diagram schematically illustrating the influence range at the time of provisional determination. There are three elements: target importance, business hours, and provided service status.

図９では、対象重要度、業務時間及び提供サービス状態から、影響範囲（甚大大中小）が決定されている。例えば、関西ＤＣ（データセンタ）で平日の午前９時に提供サービスが利用不可となった場合、対象重要度が高、業務時間内であるため、影響範囲は甚大となる。また、例えば、サービスセンタ（ＳＳ）で休日１５時にサービス利用不可となった場合、対象重要度が低、業務時間外であるため、影響範囲は低となる。 In FIG. 9, the impact range (Large, Large, Medium, and Small) is determined from the target importance, business hours, and service status. For example, if the service provided at Kansai DC (Data Center) becomes unavailable at 9:00 am on weekdays, the impact is enormous because the target importance is high and it is within business hours. Further, for example, when the service becomes unavailable at 15:00 on a holiday at the service center (SS), since the target importance is low and it is out of business hours, the influence range is low.

また、広域障害の場合、提供サービス状態を確定することに時間がかかることも予想される。しかし、ＳＬＡ（Service Level Agreement）などのサービスの品質保証を維持するには、広域障害か否かを確定するまで待つべきない。そこで、広域障害の場合、対象重要度と業務時間のみから影響範囲を仮決めすることも有効である。 Further, in the case of a wide area failure, it is expected that it takes time to determine the provided service state. However, in order to maintain quality assurance of services such as SLA (Service Level Agreement), it is not necessary to wait until it is determined whether there is a wide area failure. Therefore, in the case of a wide-area failure, it is also effective to tentatively determine the influence range based only on the target importance and the business hours.

図１０（ａ）は広域障害の仮決め時の影響範囲を模式的に説明する図の一例である。要素は対象重要度及び業務時間の２つになっている。例えば、対象重要度に高が含まれており、業務時間内又は外の場合、影響範囲は甚大になる。また、対象重要度が低であり、業務時間外の場合、影響範囲は低になる。なお、暫定的に影響範囲を仮決定する際は、各要素で選択肢が２つ以上ある場合、影響範囲の確定時に判明する各要素の一番高いものを有効とする。こうすることで、障害レベルを大きい方にバイアスできるので、報告漏れ防止できる。 FIG. 10A is an example of a diagram for schematically explaining the influence range at the time of provisional determination of a wide area fault. There are two elements: target importance and business hours. For example, if the target importance includes high, and the business importance is within or outside of business hours, the range of influence becomes enormous. In addition, when the target importance is low and it is out of business hours, the influence range is low. When the influence range is tentatively determined, if there are two or more options for each element, the highest element found when the influence range is determined is valid. In this way, the failure level can be biased to a larger level, so that missing reports can be prevented.

提供サービス状態が確定すると、範囲解析部１５は影響範囲を修正する。図１０（ｂ）では、提供サービス状態が追加されている。提供サービス状態が追加されることで、影響範囲が変わるのであれば、判明した段階で、影響範囲を再決定する。また、新たな障害の原因が判明した場合、影響範囲は再決定される。 When the provided service state is determined, the range analysis unit 15 corrects the influence range. In FIG. 10B, a provided service state is added. If the range of influence changes due to the addition of the provided service status, the range of influence is re-determined at the stage where it is found. In addition, when the cause of a new failure is found, the affected range is redetermined.

〔影響範囲の決定〕
範囲解析部１５は、以上のような考え方に基づき障害の影響範囲（例えば甚大、大、中、小）を決定する。 [Determination of the scope of impact]
The range analysis unit 15 determines the influence range (for example, large, large, medium, and small) of the failure based on the above concept.

図１１は、範囲決定テーブル１８の一例を示す図である。範囲決定テーブル１８は、これまで説明した影響範囲の考え方に基づき、影響範囲を決定するためのテーブルである。まず、図１１（ａ）では、拠点毎に、拠点名、ソース、インタフェース、対象重要度、業務時間、利用可否判定基準、及び、対象範囲の広狭基準が登録されている。拠点名は一般的な呼称である。 FIG. 11 is a diagram illustrating an example of the range determination table 18. The range determination table 18 is a table for determining the influence range based on the concept of the influence range described so far. First, in FIG. 11A, for each base, a base name, a source, an interface, a target importance, a business time, an availability determination criterion, and a target range broadness criterion are registered. The site name is a general name.

ソースは、上述したように拠点で使用される各種のネットワーク機器の識別名である。１つの拠点で１つ以上のルータやレイヤ３スイッチが利用されているが、それぞれに名称を付けるのは困難なのでソースという統一した名称を付与した。 The source is an identification name of various network devices used at the base as described above. Although one or more routers and layer 3 switches are used at a single site, it is difficult to name each of them, so they have been given a uniform name as a source.

インタフェースも上述したものと同じであり、インタフェースを識別するため、例えばルータにおいて、ＷＡＮやインターネット等とを接続する境界のＩＰアドレスを利用する。例えば、拠点内にＬＡＮがあり他の拠点のＬＡＮとＷＡＮを形成している場合、拠点のルータのＬＡＮ側のＩＰアドレスとＷＡＮ側のＩＰアドレスが登録される。 The interface is the same as described above, and in order to identify the interface, for example, a router uses an IP address at a boundary connecting to the WAN, the Internet, or the like. For example, when there is a LAN in the base and the LAN of another base is formed, the IP address on the LAN side and the IP address on the WAN side of the router at the base are registered.

対象重要度は、拠点の重要度であり上記の考え方に基づき予め登録されている。業務時間も、予めパターン化されたいくつかの業務時間のパターンが登録されている。利用可否判定基準は、障害発生時のソース（ネットワーク機器や回線）の利用可否を判定するためのテーブルとのリンク情報が登録されている。 The target importance level is the importance level of the base, and is registered in advance based on the above concept. As for the business hours, some patterns of business hours patterned in advance are registered. In the availability determination criterion, link information with a table for determining availability of a source (network device or line) at the time of occurrence of a failure is registered.

図１１（ｂ）は川崎ＣＣ（MZR009A）の利用可否判定基準の一例を示す図である。利用可否判定基準には、インタフェースの死活状態又はその組み合わせに応じて、利用可否の判定が登録されている。例えば、川崎ＣＣ（MZR009A）の場合、３つのインタフェースFa1、Tu1、Tu2の全てが停止すると利用不可と判定されることが登録されている。 FIG.11 (b) is a figure which shows an example of the usability determination criteria of Kawasaki CC (MZR009A). In the usability judgment criterion, the judgment of usability is registered according to the alive state of the interface or a combination thereof. For example, in the case of Kawasaki CC (MZR009A), it is registered that when all three interfaces Fa1, Tu1, and Tu2 are stopped, it is determined that they cannot be used.

また、WAN側のインタフェースであるインタフェースTu1のみが停止した場合、利用可能（一部）と判定されることが登録されている。 In addition, it is registered that when only the interface Tu1 that is an interface on the WAN side is stopped, it is determined that it can be used (partly).

対象範囲の広狭基準は、障害の対象範囲が広か狭の判定基準である。図では５個以上で「広」、４個以下で「狭」となっている。本実施形態では、重要警戒状態以上の拠点数をカウントするものする。よって、例えば、機器監視部１２が重要警戒状態のインタフェースを検出した場合、範囲解析部１５は拠点識別テーブル１１から重要警戒状態のインタフェースが対応づけられた拠点名を特定し、その拠点数をカウントする。そしてこの数を判定基準と比較して対象範囲が広か狭かを判定する。重複しないように重要警戒状態となったソースをカウントすることで拠点数が明らかになる。 The target range broadness criterion is a criterion for determining whether the target range of an obstacle is wide or narrow. In the figure, 5 or more are “wide” and 4 or less are “narrow”. In the present embodiment, the number of bases exceeding the critical alert state is counted. Thus, for example, when the device monitoring unit 12 detects an interface in an important alert state, the range analysis unit 15 identifies the name of the site associated with the interface in the important alert state from the site identification table 11 and counts the number of the sites. To do. Then, this number is compared with a determination criterion to determine whether the target range is wide or narrow. The number of bases is revealed by counting the sources that are in critical alert so as not to overlap.

また、拠点の数だけでなく、障害が検出されたネットワーク機器の数から障害の対象範囲が「広」か「狭」かを判定することができる。電源異常や空調異常などの環境面で障害が発生した場合、複数のネットワーク機器に障害が生じるおそれがある。例えば、ネットワーク機器の１つであるサーバに障害が生じると多くのユーザに影響を与える。このため、障害が検出されたネットワーク機器の数から障害の対象範囲が「広」か「狭」かを判定することも有効である。この場合の「対象範囲の広狭基準」は、拠点数の場合の広狭基準と同じでもよいし、異なっていてもよい。 Further, it is possible to determine whether the target range of the failure is “wide” or “narrow” based not only on the number of bases but also on the number of network devices in which the failure is detected. When an environmental failure such as a power supply abnormality or an air conditioning abnormality occurs, there is a possibility that a plurality of network devices may fail. For example, when a failure occurs in a server that is one of network devices, many users are affected. For this reason, it is also effective to determine whether the target range of the failure is “wide” or “narrow” from the number of network devices in which the failure is detected. In this case, the “wide / narrow reference for the target range” may be the same as or different from the wide / narrow reference for the number of bases.

例えば、広域を影響を受ける業務の範囲が広いことと定義する。この場合、ある拠点のサーバに障害が発生した場合に、どのような業務に影響が生じるかはサーバによって異なる。例えば、サーバに人事システムだけが含まれる場合と、人事システムと販売システムがが含まれる場合とでは、ユーザが影響を受ける業務範囲が異なる。したがって、拠点の数だけでなく、障害が生じたネットワーク機器の数が多い場合に、広域障害が発生したと判断できる。 For example, a wide area is defined as a range of business that is affected. In this case, what kind of work is affected when a failure occurs in a server at a certain site differs depending on the server. For example, the scope of work that the user is influenced by differs when the server includes only a personnel system and when the server includes a personnel system and a sales system. Therefore, it can be determined that a wide-area failure has occurred when the number of network devices in which a failure has occurred is large as well as the number of locations.

範囲解析部１５は重要警戒状態のソースと業務種別（人事、販売、生産等）が対応づけられたテーブルを参照し、業務に影響のあるソースに障害が発生したか否かを判定する。そして、業務に影響のあるソースの数を重複しないようにカウントし、広狭基準と比較することで障害の対象範囲が「広」か「狭」かを判定する。 The range analysis unit 15 refers to a table in which the source of the critical alert state and the business type (personnel, sales, production, etc.) are associated with each other, and determines whether or not a failure has occurred in the source that affects the business. Then, the number of sources that have an influence on the business is counted so as not to overlap, and it is determined whether the target range of failure is “wide” or “narrow” by comparing with the wide and narrow criteria.

範囲解析部１５は、障害情報から範囲決定テーブル１８を参照し、４つの要素のそれぞれの状態を決定する。そして、上記の影響範囲決定テーブル２０を参照して、４つの要素の状態が適合する影響範囲（甚大、大、中、小）を決定する。 The range analysis unit 15 refers to the range determination table 18 from the failure information and determines the state of each of the four elements. Then, with reference to the above-described influence range determination table 20, an influence range (large, large, medium, and small) in which the states of the four elements are matched is determined.

＜障害レベルの決定＞
障害レベル決定部１６は、障害レベルテーブル１９を参照して影響範囲から初期の障害レベルを決定する。
図１２（ａ）は障害レベルテーブル１９の一例を示す。障害レベルテーブル１９は、影響範囲と障害レベルを対応づける。そして、障害レベルは障害の経過時間と共に大きくなっていくことが特徴の1つになっている。 <Determination of failure level>
The failure level determination unit 16 refers to the failure level table 19 and determines an initial failure level from the affected range.
FIG. 12A shows an example of the failure level table 19. The failure level table 19 associates the affected range with the failure level. One of the features is that the failure level increases with the elapsed time of the failure.

なお、障害レベルの意味は例えば次のようになる。
・障害レベル５：経営としての危機管理が必要なレベル
・障害レベル４：業務的影響の経過管理が必要なレベル
・障害レベル３：業務代替手段の検討が必要なレベル
・障害レベル２：問合せ対応が必要なレベル
・障害レベル１：障害として管理が必要な最低レベル
より詳細な障害例として以下を挙げておく。
・障害レベル５：発生から長期化しており、重要なエリアや複数（又は広範囲）提供サービスで甚大な被害に結びつく恐れがあると想定される障害
・障害レベル４：重要なエリアや複数（又は広範囲）提供サービスに影響があり一定時間以上経過している障害
・障害レベル３：業務に何らかの影響が出ており、利用者が体感的に気づき、システム的にも検知している障害
・障害レベル２：重要度の低い特定のエリアに限定され、利用者が体感的に気づきにくいが、システム的に検知している障害
・障害レベル１：業務に実質的な被害が無いが、システム的に検知している障害（冗長化機能が働いている、等）
障害が長期にわたれば被害は大きいものとなるため、障害レベル決定部１６は経過時間を考慮して、障害レベルの変更を行う。ここで、障害レベルが１つレベルアップする閾値の時間は均一ではない。例えば、障害レベル１から障害レベル２に変更するための閾値の時間は、障害レベル２から障害レベル３に変更するための閾値の時間と、同じであるとは限らない（同じであってもよい）。また、閾値の時間は提供サービス毎に適宜設定されるものとする。 The meaning of the failure level is, for example, as follows.
・ Disability level 5: Level that requires risk management as management ・ Disability level 4: Level that requires progress management of operational impact ・ Disability level 3: Level that requires consideration of work alternatives ・ Disability level 2: Inquiries Necessary level / failure level 1: Minimum level that needs to be managed as a failure.
・ Failure level 5: Failures that are expected to lead to serious damage in the service provided by important areas or multiple (or wide-area) services that have been extended for a long time. ・ Failure level 4: Important areas or multiple (or wide-area) ) Failure / failure level 2 that affects the service provided and has passed for a certain period of time: Failure / failure level 2 that has some impact on the business, is noticed by the user, and is detected systematically : It is limited to a specific area with low importance, and it is difficult for the user to notice it physically, but the failure / failure level detected systematically 1: There is no substantial damage to the business, but it is detected systematically Failure (redundancy function is working, etc.)
If the failure lasts for a long period of time, the damage will be great. Therefore, the failure level determination unit 16 changes the failure level in consideration of the elapsed time. Here, the threshold time for the failure level to be increased by one is not uniform. For example, the threshold time for changing from failure level 1 to failure level 2 is not necessarily the same as the threshold time for changing from failure level 2 to failure level 3 (may be the same). ). In addition, the threshold time is appropriately set for each provided service.

図１２（ｂ）は経過時間と障害レベルの関係の一例を示す図である。 FIG. 12B is a diagram illustrating an example of the relationship between the elapsed time and the failure level.

経過時間は、例えば、下記を基準とし、それを境に障害レベルの変更を行う。
・1分・・・システム的に検知される瞬断の時間
・10分・・・現地利用者が体感的に気付く目安の時間
・60分・・・障害の継続により業務への影響が大きくなると判断できる時間
・240分・・・復旧時間（サービスレベル）が守れなくなる時間
なお、経過時間のカウントには、業務時間を考慮することが好ましい。影響範囲の決定に障害の検出日時が業務時間内か外かを判定要素としたが、これは検出日時だけを見ている。しかし、障害の継続時間が長くなり実時間が業務時間となれば障害レベルは大きいとみなすべきである。このため、障害レベル決定部１６は、例えば、実時間が業務時間となれば、障害レベルを１段階大きくする。また、業務時間内は、閾値時間に０．８などの定数を掛けて業務時間外よりも早く障害レベルが大きくなるようにする。 The elapsed time is, for example, based on the following, and the failure level is changed on the basis of the following.
・ 1 minute ・・・ Time instantly detected by system ・ 10 minutes ・・・ Estimated time for local users to notice sensation ・ 60 minutes Time that can be determined: 240 minutes ... Time when recovery time (service level) cannot be maintained Note that it is preferable to consider business hours when counting elapsed time. In determining the influence range, whether or not the failure detection date / time is within business hours or outside is used as a determination factor, but this only looks at the detection date / time. However, the failure level should be considered large if the failure duration is long and the real time is the business time. For this reason, the failure level determination unit 16 increases the failure level by one step if, for example, the actual time is the business time. Also, during business hours, the threshold level is multiplied by a constant such as 0.8 so that the failure level becomes larger earlier than outside business hours.

また、復旧予定時間が確定し、予め障害継続時間が判明している場合は、経過時間がゼロでも、復旧予定時間が確定した時点で、障害レベルを決めてしまう。これにより、障害レベルを早期に確定できる。 Further, when the scheduled recovery time is determined and the failure continuation time is known in advance, even if the elapsed time is zero, the failure level is determined when the scheduled recovery time is determined. Thereby, the failure level can be determined early.

図１２（ｃ）は提供サービスがＡＤＳＬ回線の場合の障害レベルの遷移を説明する図の一例である。ＡＤＳＬ回線はベストエフォート型なので、別影響範囲としている。すなわち、障害レベルは１から始まり、１０分経過後に障害レベルは２となるが、その後は経過時間に関係なく障害レベル２を維持する。 FIG. 12C is an example for explaining the transition of the failure level when the provided service is an ADSL line. Since the ADSL line is a best-effort type, the scope of influence is different. That is, the failure level starts from 1, and after 10 minutes, the failure level becomes 2. After that, the failure level 2 is maintained regardless of the elapsed time.

＜メールの生成＞
障害管理サーバ２００のメール生成部１７は、障害情報、対象重要度、業務時間内・外、提供サービス状態、対象範囲、及び、管理者が入力した情報を受け付け、記録用メールを生成する。記録用メールはユーザに送信されるメールとは異なるメールである。 <Generate email>
The mail generation unit 17 of the failure management server 200 receives failure information, target importance, inside / outside business hours, provided service status, target range, and information input by the administrator, and generates a recording mail. The recording mail is different from the mail transmitted to the user.

図１３は、記録用メールの生成を説明する図の一例である。記録用メールの生成に入力される情報は、例えば以下のものである。
一連番号、日時、ソース情報、メッセージ（IPアドレス）
一連番号はメール生成部１７が最後に生成した一連番号に１を足すなどして、重複しない連番が付与される。日時は、管理者が所定のキー（例えば"ctrl"キー＋"："キー）を押下することで自動的に入力される。また、日時には障害情報に含まれる障害の検知時刻も取り込まれる。ソース情報は対象重要度又は対象範囲の決定時に明らかな拠点のソース名である。 FIG. 13 is an example of a diagram illustrating generation of a recording mail. The information input for generating the recording mail is, for example, the following.
Serial number, date / time, source information, message (IP address)
The serial number is assigned a serial number that does not overlap by adding 1 to the last serial number generated by the mail generation unit 17. The date and time are automatically input when the administrator presses a predetermined key (for example, “ctrl” key + “:” key). The date and time of failure detection included in the failure information is also taken in the date and time. The source information is the source name of the base that is apparent when the target importance or target range is determined.

メッセージは、管理者がＩＰアドレスを選択することで入力される。メール生成部１７は、範囲決定テーブル１８にてソースに対応づけられたＩＰアドレスを表示するので、管理者はその中から障害の発生している機器のＩＰアドレスを選択する。このＩＰアドレスからＬＡＮ側、ＷＡＮ側などの障害現象を記録用メールに記述できる。 The message is input by the administrator selecting an IP address. Since the mail generation unit 17 displays the IP address associated with the source in the range determination table 18, the administrator selects the IP address of the device in which the failure has occurred. From this IP address, failure phenomena on the LAN side, WAN side, etc. can be described in the recording mail.

図１４は、記録用メールの一例を示す図である。記録用メールには、１障害発見先、２障害発生地点、３障害日時、４お客様担当者、５障害現象、６障害原因、７処置、８一次切り分け結果、９備考等である。また、対象重要度、業務時間の内・外が含まれる。 FIG. 14 is a diagram illustrating an example of a recording mail. The recording mail includes 1 failure discovery destination, 2 failure occurrence point, 3 failure date and time, 4 customer staff, 5 failure phenomenon, 6 failure cause, 7 treatment, 8 primary classification result, 9 remarks, and the like. In addition, the target importance and internal / external business hours are included.

１障害発見先は障害を検出した装置名である。２障害発生地点は拠点名である。３障害日時は障害の発生日時、復旧日時（復旧した場合）、完了日時（例えば冗長構成によるバックアップ状態から障害前の状態に戻った場合）である。４お客様担当者は、障害のあった拠点を顧客が使用する場合の顧客の担当者名である。５障害現象は、アラートのメッセージから読み取り可能な障害の状態である。６障害原因は、障害の原因である。７処置は、障害に対し施された処置である。８一次切り分け結果は、障害の生じたルータなどを切り離したことによる状況である。 One failure discovery destination is the name of the device that detected the failure. 2 The point of failure is the base name. The three failure dates and times are the occurrence date and time of the failure, the recovery date and time (when recovered), and the completion date and time (for example, when returning from the backup state by the redundant configuration to the state before the failure). 4. The customer contact person is the name of the customer contact person when the customer uses the location where the failure has occurred. The 5 failure phenomenon is a failure state that can be read from the alert message. 6 The cause of failure is the cause of failure. Seven treatments are treatments for the disorder. 8 The primary separation result is a situation caused by disconnecting a router having a failure.

このうち、４，６，７，８，９は管理者が入力することができる。５の障害現象は障害アラートと範囲決定テーブル１８から明らかであるが、管理者が追記してもよい。なお、回線や機器名は、拠点識別テーブル１１や、不図示のデータベースから読み出される。 Of these, 4, 6, 7, 8, and 9 can be input by the administrator. The failure phenomenon 5 is apparent from the failure alert and range determination table 18, but may be added by the administrator. The line and device name are read from the site identification table 11 and a database (not shown).

メール生成部１７は作成した記録用メールをインシデント管理ＤＢ２３に登録する。インシデント管理ＤＢ２３は、障害の進捗、履歴を管理するためのデータベースである。また、障害レベル決定部１６が決定した障害レベルはメールサーバ３００に通知される。 The mail generation unit 17 registers the created recording mail in the incident management DB 23. The incident management DB 23 is a database for managing failure progress and history. Further, the failure level determined by the failure level determination unit 16 is notified to the mail server 300.

〔メールサーバ〕
メールサーバ３００のメール送信部２２は、障害レベルに応じて連絡先を決定して、障害の内容を伝えるメールを生成し、送信する。連絡先テーブル２４には、各連絡先に所属するユーザと、そのユーザのメールアドレスが登録されている。また、メールテンプレート２５にはメールの文章やフォーマットのテンプレートが登録されている。 [Mail server]
The mail transmission unit 22 of the mail server 300 determines a contact address according to the failure level, and generates and transmits a mail conveying the content of the failure. In the contact table 24, users who belong to each contact and their mail addresses are registered. The mail template 25 is registered with mail text and format templates.

＜障害レベルと連絡先＞
図１５は、障害レベルと連絡先の関係を説明する図の一例である。図では連絡先をユーザ側とサービス側に区分した。ユーザ側はネットワークシステムを利用する側の人間であり、例えば経営層、各社責任区、ＬＡＮ管理区（責任者）、ＬＡＮ管理区（担当者）、拠点（担当者）などがある。サービス側は、サービスを提供する側の人間であり、例えば、経営層、責任区、案件責任者、案件担当者である。なお、提供区はサービス運営部門であり、主管区はサービス提供側の管理部門である。 <Disability level and contact information>
FIG. 15 is an example of a diagram for explaining the relationship between the failure level and the contact information. In the figure, the contact information is divided into the user side and the service side. The user side is a person who uses the network system, and includes, for example, a management layer, each company responsibility ward, a LAN management ward (responsible person), a LAN management ward (person in charge), and a base (person in charge). The service side is a person who provides the service, for example, a management layer, a responsible district, a project manager, or a project manager. The service area is the service management department, and the supervision area is the management department on the service provider side.

障害レベル１：ユーザ側の拠点（担当者）
障害レベル２：ＬＡＮ管理区と案件担当者
障害レベル３：ＬＡＮ管理区（責任者）、案件責任者
障害レベル４：ＬＡＮ管理区と責任区
障害レベル５：経営層
なお、障害レベルの上位は、下位の連絡先を含む。すなわち、障害レベル１の場合、メールは障害レベル１の連絡先に送信される。障害レベル２の場合、メールは障害レベル１と２の連絡先に送信される。障害レベル３の場合、メールは障害レベル１〜３の連絡先に送信される。障害レベル４の場合、メールは障害レベル１〜４の連絡先に送信される。障害レベル５の場合、メールは障害レベル１〜５の連絡先に送信される。 Failure level 1: User site (person in charge)
Failure level 2: LAN management zone and project manager Fault level 3: LAN management zone (responsible person), project manager Fault level 4: LAN management zone and responsibility zone Failure level 5: Management layer Includes subordinate contacts. That is, in the case of failure level 1, the mail is transmitted to the contact of failure level 1. In the case of failure level 2, the mail is transmitted to the contacts of failure levels 1 and 2. In the case of the failure level 3, the mail is transmitted to the contacts having the failure levels 1 to 3. In the case of the failure level 4, the mail is transmitted to the contacts having the failure levels 1 to 4. In the case of failure level 5, the mail is transmitted to the contacts of failure levels 1-5.

メール送信部２２は、連絡先を決定すると連絡先テーブルから連絡先のユーザのメールアドレスを読み出し、電子メールの連絡先に設定する。 When the contact is determined, the mail transmission unit 22 reads the mail address of the contact user from the contact table and sets it as the e-mail contact.

＜メール内容例＞
図１６は、メールサーバ３００が送信するメールを説明する図の一例である。なお、障害管理サーバ２００が送信するメールでなく、メールサーバ３００がメールを送信するのは、ユーザに障害内容を適切な表現で伝えるためである。メール送信部２２は、いくつかのメールテンプレート２５を有しており、メールテンプレート２５に必要事項を記入することでメールを生成する。 <Example of email contents>
FIG. 16 is an example of a diagram for explaining a mail transmitted by the mail server 300. The reason why the mail server 300 transmits mail instead of the mail transmitted by the fault management server 200 is to convey the fault contents to the user in an appropriate expression. The mail transmitting unit 22 has several mail templates 25 and generates mail by filling in necessary items in the mail template 25.

メールテンプレート２５には、「Subject」「いつも大変お世話になっております。監視センターです。監視システムよりアラートの通知を確認致しましたので、取り急ぎご連絡させていただきます。」というメッセージ、国内用の「Ｒ−ＷＡＮ国内系サービス障害連絡票」というメッセージ、が用意されている。 In the mail template 25, the message “Subject” “I am always indebted to you. This is a monitoring center. I have confirmed the alert notification from the monitoring system and will contact you as soon as possible.” "R-WAN domestic service failure communication form" is prepared.

また、「１.障害発生拠点２.障害日時３.障害影響４.接続網５.障害原因６.対応」という項目名が用意されている。また、「詳細につきましては、現在調査中の為、追ってご報告させていただきます。以上、宜しくお願い申し上げます。」「＜監視センタ-担当者＞：○○(xx-xxxx-xxxx)」というメッセージが用意されている。 In addition, the item name “1. Failure occurrence location 2. Failure date and time 3. Failure effect 4. Connection network 5. Failure cause 6. Response” is prepared. Also, “Details are currently under investigation and will be reported later. Thank you for your cooperation.” “<Monitoring Center-Person in charge>: XX (xx-xxxx-xxxx)” Is prepared.

メール送信部２２は、このようなメールテンプレート２５に、上記の項目、拠点名、インシデントNo、及び、障害のステータスを追加して記述してメールを完成させる。「１.障害発生拠点」は記録用メールの障害発生拠点である。「２.障害日時」は記録用メールの障害日時である。「３.障害影響」は、記録用メールの障害現象、処置、一次切り分け結果から生成された内容又は同じ内容である。「４.接続網」は、記録用メールの回線である。「５.障害原因」は記録用メールの障害原因である。「６.対応」は障害が復旧していなければ対応中となり、障害が復旧すると対応済となる。 The mail transmitting unit 22 completes the mail by adding and describing the above items, the site name, the incident number, and the failure status to the mail template 25. “1. Failure occurrence location” is the failure occurrence location of the recording mail. “2. Failure date and time” is the failure date and time of the recording mail. “3. Effect of failure” is the content generated from the failure phenomenon of the recording mail, the action, the result of the primary separation, or the same content. “4. Connection network” is a recording mail line. “5. Cause of failure” is a cause of failure in the recording mail. “6. Response” is being handled if the failure has not been recovered, and has been handled when the failure has been recovered.

また、拠点名は記録用メールの障害発生拠点であり、インシデントNoは、メール生成部１７が生成したレポートNoである。障害ステータスは、例えば「発生、発生/復旧、再発、再発/復旧、発生/復旧/完了、再発/復旧/完了」から選択する。障害が復旧したか否かは少なくとも管理者が把握しており、復旧していなければ「発生」、復旧していれば「発生／復旧」、復旧しておらず過去に遡って所定時間内に同じ拠点に障害がある場合は「再発」、再発だが復旧していれば「再発/復旧」、バックアップなどでなく元の状態に戻ると「発生/復旧/完了」又は「再発/復旧/完了」となる。
このようなメールにより、受け取ったユーザは障害の状況を容易に把握できる。 The base name is the base where the failure occurred in the recording mail, and the incident number is the report number generated by the mail generation unit 17. The failure status is selected from, for example, “occurrence, occurrence / recovery, relapse, relapse / recovery, occurrence / recovery / complete, relapse / recovery / complete”. The administrator knows at least whether or not the failure has been recovered. If it has not been recovered, it will be “occurrence”, if it has been recovered, “occurrence / recovery”. "Relapse" if there is a failure at the same location, "Relapse / recovery" if it is recurred but recovered, "Occurrence / recovery / complete" or "Reoccurrence / recovery / complete" when returning to the original state instead of backup etc. It becomes.
With such an email, the received user can easily grasp the failure status.

〔動作手順〕
図１７は、ネットワーク監視システム５００がネットワークの障害を監視する手順を説明するフローチャート図の一例である。図１７の手順は、監視装置１００が障害を関している間、繰り返し実行される。 [Operation procedure]
FIG. 17 is an example of a flowchart illustrating a procedure for the network monitoring system 500 to monitor a network failure. The procedure of FIG. 17 is repeatedly executed while the monitoring apparatus 100 is involved in a failure.

まず、機器監視部１２が障害を検出し、障害アラートを発行する（Ｓ１０）。機器監視部１２は障害情報を障害管理サーバ２００に通知する。 First, the device monitoring unit 12 detects a failure and issues a failure alert (S10). The device monitoring unit 12 notifies the failure management server 200 of the failure information.

次に、障害管理サーバ２００の範囲解析部１５は影響範囲を解析し、障害レベル決定部１６は初期の障害レベルを決定する（Ｓ２０）。メール生成部１７は記録用メールをメールサーバ３００に送信するので、メールサーバ３００の送受信部２１はインシデント管理ＤＢ２３に記録用メールを記憶させる。 Next, the range analysis unit 15 of the failure management server 200 analyzes the influence range, and the failure level determination unit 16 determines an initial failure level (S20). Since the mail generation unit 17 transmits the recording mail to the mail server 300, the transmission / reception unit 21 of the mail server 300 stores the recording mail in the incident management DB 23.

そして、メール生成部１７は障害レベルに応じた連絡先にメールを送信する（Ｓ３０）。 Then, the mail generation unit 17 transmits a mail to a contact corresponding to the failure level (S30).

機器監視部１２は、障害の監視を継続しているので、監視結果に基づき障害が復旧したか否かを判定する（Ｓ４０）。 Since the device monitoring unit 12 continues to monitor the failure, the device monitoring unit 12 determines whether the failure has been recovered based on the monitoring result (S40).

復旧した場合には（Ｓ４０のＹｅｓ）、メール送信部２２は復旧したことを連絡先にメールで通知する（Ｓ５０）。
復旧しない場合（Ｓ４０のＮｏ）、障害レベル決定部１６は閾値時間が経過したか否かに基づき障害をレベルアップするか否かを判定する（Ｓ６０）。レベルアップしなければ復旧を待機する。 In the case of recovery (Yes in S40), the mail transmission unit 22 notifies the contact person of the recovery by mail (S50).
When not recovering (No in S40), the failure level determination unit 16 determines whether or not to increase the failure based on whether or not the threshold time has elapsed (S60). If it does not level up, it waits for recovery.

障害がレベルアップした場合（Ｓ６０のＹｅｓ）、障害レベル決定部１６は障害レベルを１つ大きくする（Ｓ７０）。 When the failure level is increased (Yes in S60), the failure level determination unit 16 increases the failure level by one (S70).

そして、メール生成部１７は障害レベルに応じた連絡先にメールを送信する（Ｓ８０）。この時、すでに送信した連絡先にもメールを送信することが好ましい。これにより、障害レベルがアップしたことを通知できる。一方、すでに送信した連絡先にはメールを送信しないこととしてもよい。すでに障害の発生は通知済みだからである。 Then, the mail generation unit 17 transmits the mail to the contact address corresponding to the failure level (S80). At this time, it is preferable to send the mail to the contact already sent. Thereby, it can be notified that the failure level has increased. On the other hand, the e-mail may not be transmitted to the contact already sent. This is because the occurrence of a failure has already been notified.

以上説明したように、本実施形態のネットワーク監視システム５００は、ネットワークの障害時に監視装置１００より通知される障害情報により障害レベルを決定し、障害レベルに応じて連絡すべき連絡先を決定することができる。障害継続時間と共に障害レベルを大きくするので、障害の重要性に応じて連絡先を制御することができる。 As described above, the network monitoring system 500 according to the present embodiment determines a failure level based on failure information notified from the monitoring apparatus 100 when a network failure occurs, and determines a contact to be contacted according to the failure level. Can do. Since the failure level is increased with the failure duration, the contact can be controlled according to the importance of the failure.

１１拠点識別テーブル
１２機器監視部
１３、１４，２１送受信部
１５範囲解析部
１６障害レベル決定部
１７メール生成部
１８範囲決定テーブル
１９障害レベルテーブル
２０影響範囲決定テーブル
２２メール送信部
２３インシデント管理ＤＢ
２４連絡先テーブル
２５メールテンプレート
１００監視装置
２００障害管理サーバ
３００メールサーバ
５００ネットワーク監視システム DESCRIPTION OF SYMBOLS 11 Base identification table 12 Equipment monitoring part 13, 14, 21 Transmission / reception part 15 Range analysis part 16 Failure level determination part 17 Mail generation part 18 Range determination table 19 Failure level table 20 Influence range determination table 22 Mail transmission part 23 Incident management DB
24 Contact Table 25 Mail Template 100 Monitoring Device 200 Fault Management Server 300 Mail Server 500 Network Monitoring System

特開２００４−２０６５９８号公報JP 2004-206598 A

Claims

A network monitoring system for monitoring a failure of a device or a line constituting a network and notifying a network user of the failure state,
Failure detection means for detecting failure information including failure of the device or the line connecting between bases equipped with a local communication network, detection date and time, and state of the device or line;
The base name of the base, the importance of the base, the business hours of the base, the contents of failure of the device or the line, and the criteria for determining whether or not the device can be used, or the criteria for determining the extent of the failure A failure determination table;
The influence of the failure of the device or line based on the combination of the status of each element of the importance, failure detection date or time within or outside the business hours, availability of the device or line, or the state of the wide or narrow An influence range determination table in which the size of the range is registered;
A failure level table in which a failure level is registered in association with the size of the influence range and the elapsed time from the failure detection;
A user table in which failure-state notification destination users are registered in association with the failure level;
Referring to the failure determination table, determine the state of the element based on the failure information, and refer to the influence range determination table based on the combination of the state of the element to determine the influence range Means,
Failure level determination means for determining the failure level by referring to the failure level table based on the influence range and the elapsed time;
An e-mail transmitting means for transmitting an e-mail notifying the content of the failure to the notification destination user associated with the failure level in the user table;
A network monitoring system.

The failure level determination means increases the failure level by one step every time the elapsed time exceeds a threshold,
The e-mail transmission means transmits an e-mail informing the notification destination user of the content of the failure associated with the new failure level;
The network monitoring system according to claim 1.

In the failure level table, a threshold for increasing the failure level by one level is registered for each level.
The network monitoring system according to claim 2, wherein:

The influence range determining means determines two or more states of the element, preliminarily determines the influence range based on a combination of the two states of the element, and a timing at which the remaining state of the element is determined And update the affected range,
The network monitoring system according to any one of claims 1 to 3.

A site identification table that associates one or more IP addresses assigned to the device, the line to which the device is connected, and site identification information of a site affected by the failure of the device or the line; Have
The influence range determining means counts the number of bases associated with the IP address of the device in which a failure is detected or the IP address of the device to which the line is connected from the base identification table so as not to overlap. ,
If the number of bases is greater than or equal to the criterion, it is determined that the failure is in a wide area,
The network monitoring system according to claim 1, wherein:

The influence range determination means counts the number of the devices in which a failure is detected so as not to overlap,
If the number of the devices is equal to or greater than the criterion, determine that the failure is in a wide area,
The network monitoring system according to claim 1, wherein:

A network monitoring method in which a network monitoring system monitors a failure of a device or a line constituting a network and notifies a network user of the failure state,
A step of detecting a failure information including a failure of the device or the line connecting between the bases provided with the local communication network, a detection date and time, and a state of the device or the line,
The base name of the base, the importance of the base, the business hours of the base, the contents of failure of the device or the line, the determination criteria for the availability of the device, and the determination criteria for determining the extent of the failure Refer to the failure judgment table,
The influence range determining means determines the state of each of the elements, and further, each of the importance level, whether the date and time of failure detection is within or outside the business hours, the availability state of the device or line, and the width Based on the combination of element states, the influence range determination table in which the size of the influence range affected by the failure of the device or the line is registered, the state of the element is determined based on the failure information, respectively. Determining a range of influence based on a combination of states;
The failure level determination means determines the failure level based on the influence range and the elapsed time with reference to the failure level table in which the failure level is registered in association with the size of the influence range and the elapsed time from the detection of the failure. Steps,
The electronic mail transmitting means refers to a user table in which a failure state notification destination user is registered in association with the failure level, and notifies the notification destination user associated with the failure level of the content of the failure. Sending an email;
A network monitoring method.