JP2014078774A

JP2014078774A - Monitoring system

Info

Publication number: JP2014078774A
Application number: JP2012223807A
Authority: JP
Inventors: Tomohiro Kobori; 智弘小堀
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2012-10-09
Filing date: 2012-10-09
Publication date: 2014-05-01

Abstract

PROBLEM TO BE SOLVED: To appropriately set a threshold value set in the case of warning abnormality in a network according to the capability of an apparatus and the amount of packets to be relayed.SOLUTION: A server 0107 obtains and stores communication information on a primary-system terminal 0104 and a secondary-system terminal 0105 in a database 0106, sets a threshold value from the communication information for a prescribed period read out from the database, compares the threshold value with information obtained from either the primary-system terminal or the secondary-system terminal, and issues warning if the obtained information exceeds the threshold value.

Description

本発明は通信障害を監視するシステムに関する。 The present invention relates to a system for monitoring communication failures.

従来から使用されているネットワーク上のトラフィック量を監視する方法として、例えば、特許文献１には、ネットワークのトラフィックを分析するとともにアラートの発生原因を分析することが記載されている。 As a conventional method for monitoring the amount of traffic on a network, for example, Patent Document 1 describes analyzing network traffic and analyzing the cause of an alert.

なお、サーバにネットワーク機器のトラフィック量の閾値を設定する場合にはし、その閾値を超えていた場合に警告をする発する場合がある。方法がある。その概要は次の通りである。この場合においては、例えば、サーバでネットワーク機器の各ポート・VLANを通過するトラフィック量の閾値を設定する。閾値は論理帯域に対する使用率であり、主に上限値である。サーバは定期的にネットワーク機器のMIBを取得し、帯域の使用率を監視する。またはネットワーク機器が閾値を設定できる場合にはtrapやsyslogをサーバに送信することでサーバ側で監視することができる。サーバでは取得した情報を元に閾値を超過していた場合にユーザに警告をすることでネットワーク上のトラフィック量を監視することができる Note that when setting a threshold for the traffic amount of the network device in the server, a warning may be issued if the threshold is exceeded. There is a way. The outline is as follows. In this case, for example, a threshold value for the amount of traffic passing through each port / VLAN of the network device is set by the server. The threshold value is a usage rate for the logical band, and is mainly an upper limit value. The server periodically obtains the MIB of the network device and monitors the bandwidth usage rate. Alternatively, if the network device can set the threshold, it can be monitored on the server side by sending a trap or syslog to the server. The server can monitor the amount of traffic on the network by alerting the user when the threshold is exceeded based on the acquired information.

特開2009-231876号公報JP 2009-231876

ネットワークの異常を警告する場合に設定する閾値は、一般には固定値である。しかし稼動している機器の性能・設置されているシステムなどにより、本来その機器が使用する機器の能力や中継するパケット量は異なるため、適切な閾値の設定が課題となる。また、冗長構成を採れるシステムの場合には、正系が障害時に副系に切替るが、切替った際に正系と同じ条件での監視ができなくなる。また、閾値を固定化することには以下の課題がある。ネットワーク機器は中継するシステムによって大きく役割が異なる。企業の期間業務を担う機器は、オフィスで各PCと接続されるような機器とは中継するトラフィック量も必要になる能力も大きく異なるため、仮にこれらの機器が同一の機器だとしても同一の閾値で機器の監視を行うのは適切ではない。また、閾値を個別設定することには以下の課題がある。機器毎に監視の閾値を設定する場合、閾値を固定で行う場合よりは機器の監視が行えるが、監視できるは機器に異常が発生した場合の監視のみであり、ネットワークを流れるトラフィックに対する機器の状態を把握は確保されない。また、冗長構成には以下の課題がある。冗長構成を採ったシステムの場合、主にトラフィックが流れるのは正系であり、副系と流れるトラフィックに差がある。仮に正系で障害が発生すると、副系に正系で通信を行っていたトラフィックが流れるが、正系の通信が副系の監視条件に照らして正常とは限らず、正しい監視ができない。 The threshold value that is set when a network abnormality is warned is generally a fixed value. However, depending on the performance of an operating device, the installed system, and the like, the capability of the device originally used by the device and the amount of packets to be relayed differ, so setting an appropriate threshold value becomes a problem. Further, in the case of a system that can take a redundant configuration, the primary system is switched to the secondary system when a failure occurs. However, when the system is switched, monitoring under the same conditions as the primary system cannot be performed. Further, fixing the threshold value has the following problems. The role of network equipment varies greatly depending on the relaying system. The devices that handle the period work of a company differ greatly from the devices that are connected to each PC in the office in terms of the amount of traffic that is required for relaying, so even if these devices are the same device, the same threshold It is not appropriate to monitor equipment with Moreover, there are the following problems in individually setting the thresholds. When setting a monitoring threshold for each device, the device can be monitored more than when the threshold value is fixed, but monitoring is possible only when an abnormality occurs in the device, and the device status for traffic flowing through the network Grasping is not ensured. The redundant configuration has the following problems. In the case of a system having a redundant configuration, traffic mainly flows in the main system, and there is a difference in traffic flowing from the sub system. If a failure occurs in the primary system, the traffic that was communicating in the primary system flows to the secondary system, but the primary communication is not always normal according to the monitoring conditions of the secondary system, and correct monitoring cannot be performed.

本発明によるシステムは一例として、サービル提供システム及びユーザ端末と通信するための正系システム端末及び副系システム端末と、前記正系システム端末及び前記副系システム端末と通信するサーバと、前記正系システム端末及び前記副系システム端末の通信情報について、前記サーバによって格納されるデータベースとを有し、前記サーバは前記正系システム端末及び前記副系システム端末の前記通信情報を取得して前記データバースに格納し、前記データベースから読み出す所定期間分の前記通信情報から閾値を設定し、前記正系システム端末及び前記副系システム端末のいずれかからの取得情報と前記閾値とを比較して前記取得情報が前記閾値を越える場合は、警告を発信することを特徴とする。 As an example, the system according to the present invention includes a primary system terminal and a secondary system terminal for communicating with a service providing system and a user terminal, a server communicating with the primary system terminal and the secondary system terminal, and the primary system. A database stored by the server for communication information of the system terminal and the secondary system terminal, the server acquiring the communication information of the primary system terminal and the secondary system terminal to obtain the data burst The threshold is set from the communication information for a predetermined period that is stored in the database and read from the database, and the acquisition information from either the primary system terminal or the secondary system terminal is compared with the threshold to obtain the acquisition information. When the value exceeds the threshold, a warning is transmitted.

ユーザサイドとの接続ポートで閾値の超過を検知した場合、異常の発生元と一時的なものか否かについて情報を得ることができる。また。副系に正系の閾値を共有しうる構成とすることで、正系障害時にも正系と同環境での通信を開始することができ、運用の軽減を図れる。 When it is detected that the threshold is exceeded at the connection port with the user side, it is possible to obtain information as to whether the abnormality has occurred and whether it is temporary. Also. By configuring the secondary system so that the primary system threshold can be shared, communication in the same environment as that of the primary system can be started even in the event of a primary system failure, and operation can be reduced.

機器監視システムEquipment monitoring system 機器監視システム（閾値設定概要）Device monitoring system (Overview of threshold setting) 機器監視システム（閾値設定フロー）Device monitoring system (threshold setting flow) 機器監視システム（通常処理フロー概要）Equipment monitoring system (outline of normal processing flow) 機器監視システム（通報処理フロー）Equipment monitoring system (report processing flow) 機器監視システム（データベース格納データ）Device monitoring system (database storage data) 機器監視システム（閾値引継ぎ処理フロー概要）Device monitoring system (Overview of threshold takeover processing flow) 機器監視システム（閾値引継ぎ処理フロー）Device monitoring system (Threshold takeover processing flow)

本実施例では、監視の結果として警報発信を判断する閾値の設定を2段階で設ける。すなわち、第1に機器毎に固定閾値の設定し、第２に機器毎に所定期間（例えば、直前までに観測されたデータの前一週間）から閾値を算出し自動で設定する。 In this embodiment, a threshold value for determining alarm transmission is provided in two stages as a result of monitoring. That is, first, a fixed threshold is set for each device, and second, a threshold is calculated and automatically set for each device from a predetermined period (for example, one week before the data observed immediately before).

閾値の自動算出方法はネットワーク機器から定期的に通信情報（MIBなど）を取得し、サーバにて統計データを蓄積する。取得したデータを元にトラフィックの帯域使用量を抽出し、例えば平均と標準偏差を算出するなどにより、自動でトラフィック量を監視する上下閾値を設定する。
また、ネットワーク環境が冗長構成であるときは、サーバで情報を閾値を管理することにより正系障害時に副系に正系で設定された閾値を引継ぐ。以下、実施例により詳細に説明する。 The automatic threshold calculation method periodically acquires communication information (such as MIB) from network devices and accumulates statistical data on the server. Based on the acquired data, the bandwidth usage of the traffic is extracted, and the upper and lower thresholds for automatically monitoring the traffic volume are set by, for example, calculating the average and standard deviation.
Also, when the network environment is a redundant configuration, the threshold set for the secondary system is taken over by the secondary system in the event of a primary system failure by managing the threshold information with the server. Hereinafter, the embodiment will be described in detail.

機器監視システムの全体概要について図１を用いて説明する。図１に示す通り、ユーザ(0101)とサービス提供システム(0108)とが通信するネットワーク構成がある。ここで、サーバ(0107)をシステム端末A（0104）・システム端末B（0105）と接続し、サーバでシステム端末A（0104）・システム端末B（0105）を通過するトラフィック量・システム端末A（0104）・システム端末B（0105）のCPU使用率・メモリ使用率を監視する。監視結果情報はトラフィック監視DB0106に格納する。 An overall outline of the device monitoring system will be described with reference to FIG. As shown in FIG. 1, there is a network configuration in which a user (0101) and a service providing system (0108) communicate. Here, the server (0107) is connected to the system terminal A (0104) / system terminal B (0105), and the amount of traffic passing through the system terminal A (0104) / system terminal B (0105) at the server / system terminal A ( [0104] The CPU usage rate and the memory usage rate of the system terminal B (0105) are monitored. The monitoring result information is stored in the traffic monitoring DB 0106.

次に、機器監視システムにおける閾値設定概要について図２を用いて説明する。、図２に示す通り、サーバ(0107)では図1に図示するユーザ端末（0102）から送信されたトラフィックについて、システム端末A（0104）・システム端末B（0105）の各々で観測されたトラフィック量・機器のCPU使用率・メモリ使用率を取得する。取得したデータはトラフィック監視DB（0106）に格納する。サーバ(0107)は格納したデータの過去の所定期間分(例えば7日分)を用いて、システム端末A（0104）・システム端末B（0105）それぞれの平均・標準偏差からデータの95％の範囲を求め、上下限値を閾値と設定する。実際の処理例については図３の説明にて詳述する。取得したデータと閾値の一覧は図6に示すデータベースに保存する。データは例えば1分毎に自動で更新される。閾値の設定は1日1回とする。 Next, an outline of threshold setting in the device monitoring system will be described with reference to FIG. As shown in FIG. 2, the server (0107) has the traffic volume observed at each of the system terminal A (0104) and the system terminal B (0105) for the traffic transmitted from the user terminal (0102) shown in FIG.・ Obtain the CPU usage rate and memory usage rate of the device. The acquired data is stored in the traffic monitoring DB (0106). The server (0107) uses 95 minutes of data from the average and standard deviation of each of the system terminal A (0104) and system terminal B (0105) using the past predetermined period (for example, 7 days) of the stored data. And the upper and lower limit values are set as threshold values. An actual processing example will be described in detail with reference to FIG. The list of acquired data and threshold values is stored in the database shown in FIG. The data is automatically updated every minute, for example. The threshold is set once a day.

次に、機器監視システムにおける閾値設定フローについて図３を用いて説明する。サーバ(0107)はユーザ端末（0102）から送信されたトラフィックについて、システム端末A（0104）・システム端末B（0105）で観測したトラフィック量とシステム端末A（0104）・システム端末B（0105）のCPU使用率・メモリ使用率をMIB・trap・syslogを用いて取得する（0301,0302）。取得間隔は1分間1回とし、取得したデータはトラフィック監視DB(0106)に格納する（0303）。サーバは格納したデータ1週間分から閾値を算出する。サービスの提供時間内外で差はあるが、それぞれの時間帯である程度一定のデータになると考えられるため、データの分布は正規分布になる。そこで閾値の算出は過去所定期間（例えば1週間分）のデータから平均と標準偏差を算出し、データの届く範囲の95％の範囲を求め、その上下限値を閾値とする（0305）。仮に固定で閾値を設定している場合は、固定で設定した閾値と算出された閾値を比較し上限値であれば値の小さい方を採用し(0306,0308,0309)、下限値であれば値の大きい方(0306,0311,3012)を採用する。 Next, a threshold setting flow in the device monitoring system will be described with reference to FIG. The server (0107) determines the traffic volume observed by the system terminal A (0104) / system terminal B (0105) and the traffic volume of the system terminal A (0104) / system terminal B (0105) for the traffic transmitted from the user terminal (0102). CPU usage rate and memory usage rate are acquired using MIB, trap, and syslog (0301,0302). The acquisition interval is once per minute, and the acquired data is stored in the traffic monitoring DB (0106) (0303). The server calculates the threshold from the stored data for one week. Although there is a difference between inside and outside the service provision time, it is considered that the data is somewhat constant in each time zone, so the data distribution is a normal distribution. Therefore, the threshold value is calculated by calculating the average and standard deviation from the data in the past predetermined period (for example, for one week), obtaining a range of 95% of the reach of the data, and setting the upper and lower limit values as the threshold value (0305). If a fixed threshold value is set, compare the fixed threshold value with the calculated threshold value and use the smaller value if it is the upper limit value (0306,0308,0309). The one with the larger value (0306,0311,3012) is adopted.

次に、機器監視システムにおける通常処理フローについて図４を用いて説明する。サーバ(0107)はシステム端末A(0104)・システム端末B(0105)で観測したトラフィック量・CPU使用率・メモリ使用率を取得し、設定された閾値と比較を行う。情報の取得方法は図２と同様である。取得データの1つでも閾値を超過する場合、サーバは警報を鳴動する。実際の処理例は図5のフローで行う。0301、0302は図３と同様である。0501では、設定されている閾値と取得情報とを比較する。取得情報が閾値を越えていると判断するとき、警告をシステム端末へ送信する（0504）。超えないと判断するときは、警告発信はしない（0503）。 Next, a normal processing flow in the device monitoring system will be described with reference to FIG. The server (0107) acquires the traffic volume, CPU usage rate, and memory usage rate observed by the system terminal A (0104) and the system terminal B (0105), and compares them with the set threshold values. The information acquisition method is the same as in FIG. If any of the acquired data exceeds the threshold, the server will sound an alarm. An actual processing example is performed according to the flow of FIG. 0301 and 0302 are the same as those in FIG. In 0501, the set threshold value is compared with the acquired information. When it is determined that the acquired information exceeds the threshold, a warning is transmitted to the system terminal (0504). When it is judged that it does not exceed, a warning is not sent (0503).

次に、機器監視システムにおける閾値引継ぎ処理フローについて図７を用いて説明する。ここで例示するシステムは、システム端末A(0104)に機器障害が生じて正系のシステム端末A(0104)から副系のシステム端末B(0105)にネットワークが切替った場合、かつ通常のトラフィック量についてシステム端末A(0104)のネットワークに比べシステム端末B(0105)のネットワークの方が少ない場合とする。この場合、システム端末B(0105)に通常よりも多くのトラフィックが流れ、閾値超過になる。そこでシステム端末A(0104)が障害になった場合、トラフィック監視DBに格納されているシステム端末A(0104)の閾値をシステム端末B(0105)に適用し、閾値の監視を行う。これにより、トラフィック量に対応した閾値監視をネットワークが切替った直後から実現できる。 Next, a threshold handover process flow in the device monitoring system will be described with reference to FIG. In the system illustrated here, when a device failure occurs in the system terminal A (0104) and the network is switched from the primary system terminal A (0104) to the secondary system terminal B (0105), normal traffic Assume that the amount of the network of the system terminal B (0105) is smaller than that of the system terminal A (0104). In this case, more traffic than usual flows to the system terminal B (0105), and the threshold value is exceeded. Therefore, when the system terminal A (0104) becomes a failure, the threshold of the system terminal A (0104) stored in the traffic monitoring DB is applied to the system terminal B (0105) to monitor the threshold. As a result, threshold monitoring corresponding to the traffic volume can be realized immediately after the network is switched.

実際の処理例は図８に示すフローで行う。システム端末で障害が発生する（0901）。続いて、サーバにてサーバにてトラフィック監視DBからシステム端末Aの閾値情報を取得する（0902）。続いて、システム端末Aの閾値情報をシステム端末Bに適用する（0903）。この上で、図５と同様の処理フローを実行する（0904-0906）。 An actual processing example is performed according to the flow shown in FIG. A failure occurs in the system terminal (0901). Subsequently, the server acquires threshold information of the system terminal A from the traffic monitoring DB (0902). Subsequently, the threshold information of the system terminal A is applied to the system terminal B (0903). Then, the same processing flow as in FIG. 5 is executed (0904-0906).

Claims

A primary system terminal and a secondary system terminal for communicating with a service providing system and a user terminal;
A server that communicates with the primary system terminal and the secondary system terminal;
For communication information of the primary system terminal and the secondary system terminal, the database stored by the server,
The server acquires the communication information of the primary system terminal and the secondary system terminal, stores the communication information in the data verse, sets a threshold value from the communication information for a predetermined period to be read from the database, and the primary system terminal A monitoring system, characterized in that information obtained from either a terminal or a sub-system terminal is compared with the threshold value, and a warning is issued if the acquired information exceeds the threshold value.

The monitoring system according to claim 1, wherein the acquired information includes a traffic amount, a CPU usage rate, and a memory usage rate of any one of the primary system terminal and the secondary system terminal.

The server stores an initial threshold value in advance for the threshold value, compares the initial threshold value with the threshold value to be set, and selects either the initial threshold value or the threshold value to be set so as to tighten the upper and lower limits. The monitoring system according to claim 1.

The server, when the primary system terminal is switched to the secondary system terminal, acquires a threshold value for the primary system terminal from the database and applies the threshold to the secondary system terminal. The monitoring system according to 1