JP7498128B2

JP7498128B2 - Monitoring device, fault detection method and fault detection program

Info

Publication number: JP7498128B2
Application number: JP2021028735A
Authority: JP
Inventors: 淳野崎; 基裕山中; 信二石黒; 大和鈴木
Original assignee: エフサステクノロジーズ株式会社
Priority date: 2021-02-25
Filing date: 2021-02-25
Publication date: 2024-06-11
Anticipated expiration: 2041-02-25
Also published as: JP2022129879A

Description

本発明は、監視装置等に関する。 The present invention relates to a monitoring device, etc.

クラウドの普及に伴って、データセンター（ＤＣ：Data Center）の基盤となるＤＣネットワークには、今まで以上に高い品質が求められている。ＤＣネットワークでは、スイッチ等のネットワーク機器が障害アラートを発しないにも関わらず、誤動作するサイレント障害が発生する場合がある。このサイレント障害は、その発見が認識されにくいため、障害復旧が遅延し、多くのサービスに影響を及ぼす恐れがある。 As cloud computing becomes more widespread, higher quality than ever is required of data centers (DC) networks, which are the foundation of DCs. In DC networks, silent failures can occur, where network devices such as switches malfunction without issuing a failure alert. Because silent failures are difficult to detect, recovery from the failure can be delayed, potentially affecting many services.

図９は、サイレント障害の一例を説明するための図である。図９に示す例では、スイッチ４，５が、監視装置６に接続されている。スイッチ４は、コントロールプレーン４ａと、データプレーン４ｂとを有する。コントロールプレーン４ａは、スイッチ４全体を制御する制御部である。データプレーン４ｂは、実際にデータ通信を司るＡＳＩＣ（Application Specific Integrated Circuit：特定用途向け集積回路）である。スイッチ５には、スイッチ４と同様にして、コントロールプレーン５ａと、データプレーン５ｂとが含まれる。 Figure 9 is a diagram for explaining an example of a silent failure. In the example shown in Figure 9, switches 4 and 5 are connected to a monitoring device 6. Switch 4 has a control plane 4a and a data plane 4b. Control plane 4a is a control unit that controls the entire switch 4. Data plane 4b is an ASIC (Application Specific Integrated Circuit) that actually manages data communication. Switch 5, like switch 4, includes a control plane 5a and a data plane 5b.

たとえば、スイッチ４のデータプレーン４ｂに異常が発生し、通信に支障をきたしているが、コントロールプレーン４ａが正常である場合には、サイレント障害となる。ここで、コントロールプレーン４ａが正常に動作している場合、監視装置６が、ＳＮＭＰリクエストをスイッチ４に送信しても、異常を示すアラートが、監視装置６に通知されず、監視装置６は、ＳＮＭＰリクエストによって、データプレーン４ｂの障害を検知できない。 For example, if an abnormality occurs in the data plane 4b of the switch 4, causing a disruption in communication, but the control plane 4a is normal, this is a silent failure. Here, if the control plane 4a is operating normally, even if the monitoring device 6 sends an SNMP request to the switch 4, an alert indicating the abnormality is not sent to the monitoring device 6, and the monitoring device 6 cannot detect the failure in the data plane 4b through the SNMP request.

上記のサイレント障害を検知する従来技術として、従来技術１、２がある。従来技術１では、監視装置から、監視対象装置に対してテストデータを定期的に送信し、応答の有無で異常（サイレント障害等）を検知する。 Conventional techniques for detecting the above-mentioned silent failures include Conventional Techniques 1 and 2. In Conventional Technique 1, a monitoring device periodically transmits test data to a monitored device, and detects an abnormality (such as a silent failure) based on the presence or absence of a response.

従来技術２では、監視装置が、各監視対象装置の情報を定期的に収集し、収集した情報を基にして、システムの管理者が、通常時のネットワークの振る舞いを定義しておき、通常時の振る舞いとの違いや兆候を基にして異常（サイレント障害等）を検知する。 In conventional technology 2, a monitoring device periodically collects information from each monitored device, and a system administrator defines normal network behavior based on the collected information, and detects abnormalities (such as silent failures) based on differences from normal behavior and symptoms.

特開２０２０－８８７８６号公報JP 2020-88786 A 特開２０１１－２１１３５０号公報JP 2011-211350 A

上述した従来技術では、効率よくサイレント障害を検知することができないという問題がある。 The above-mentioned conventional technology has the problem that it is not possible to efficiently detect silent failures.

たとえば、従来技術１をそのまま、大規模なネットワークに適用すると、テストデータによってトラフィックの量が増加してしまうという問題がある。また、従来技術２では、通常時のネットワークの振る舞いを定義する管理者の負担が大きく、運用コストもかかる。 For example, if conventional technology 1 is applied as is to a large-scale network, there is a problem that the amount of traffic increases due to the test data. Furthermore, conventional technology 2 places a heavy burden on the administrator who defines the network behavior under normal circumstances, and it also incurs operational costs.

１つの側面では、本発明は、効率よくサイレント障害を検知することができる監視装置、障害検知方法および障害検知プログラムを提供することを目的とする。 In one aspect, the present invention aims to provide a monitoring device, a fault detection method, and a fault detection program that can efficiently detect silent faults.

第１の案では、監視装置は、取得部と、検知部とを有する。取得部は、ネットワークに含まれる複数のスイッチのうち、監視対象とする第１スイッチと、第２スイッチと、他の監視スイッチとをそれぞれ仮想ネットワークで接続した監視スイッチから、第１スイッチとの第１通信状況、第２スイッチとの第２通信状況、他の監視スイッチとの第３通信状況とを取得する。検知部は、第１通信状況と、第２通信状況と、第３通信状況とを基にして、第１スイッチおよび第２スイッチから、障害の発生したスイッチを検知する。 In the first proposal, the monitoring device has an acquisition unit and a detection unit. The acquisition unit acquires a first communication status with the first switch, a second communication status with the second switch, and a third communication status with the other monitoring switches from a monitoring switch that connects a first switch, a second switch, and other monitoring switches to be monitored by a virtual network among multiple switches included in the network. The detection unit detects a failed switch from the first switch and the second switch based on the first communication status, the second communication status, and the third communication status.

効率よくサイレント障害を検知することができる。 It can efficiently detect silent failures.

図１は、本実施例に係る監視システムを示す図である。FIG. 1 is a diagram showing a monitoring system according to the present embodiment. 図２は、ＩＰＳＬＡ機能を説明するための図である。FIG. 2 is a diagram for explaining the IP SLA function. 図３は、本実施例に係る監視装置の構成を示す機能ブロック図である。FIG. 3 is a functional block diagram showing the configuration of the monitoring device according to the present embodiment. 図４は、パターンテーブルのデータ構造の一例を示す図である。FIG. 4 is a diagram illustrating an example of the data structure of the pattern table. 図５は、判定ポリシーテーブルのデータ構造の一例を示す図である。FIG. 5 illustrates an example of the data structure of the judgment policy table. 図６は、メッセージ送信による経路切り替えの一例を説明するための図である。FIG. 6 is a diagram for explaining an example of route switching by message transmission. 図７は、本実施例に係る監視装置の処理手順を示すフローチャートである。FIG. 7 is a flowchart showing a processing procedure of the monitoring device according to the present embodiment. 図８は、実施例の監視装置と同様の機能を実現するコンピュータのハードウェア構成の一例を示す図である。FIG. 8 is a diagram illustrating an example of a hardware configuration of a computer that realizes the same functions as the monitoring device of the embodiment. 図９は、サイレント障害の一例を説明するための図である。FIG. 9 is a diagram illustrating an example of a silent failure.

以下に、本願の開示する監視装置、障害検知方法および障害検知プログラムの実施例を図面に基づいて詳細に説明する。なお、この実施例によりこの発明が限定されるものではない。 Below, examples of the monitoring device, fault detection method, and fault detection program disclosed in this application are described in detail with reference to the drawings. Note that the present invention is not limited to these examples.

図１は、本実施例に係る監視システムの一例を示す図である。図１に示すように、この監視システム１は、コアスイッチ１０Ａ，１０Ｂ、フロアスイッチ２０、監視スイッチ３０Ａ，３０Ｂ、監視装置１００を有する。 Figure 1 is a diagram showing an example of a monitoring system according to the present embodiment. As shown in Figure 1, the monitoring system 1 includes core switches 10A and 10B, a floor switch 20, monitoring switches 30A and 30B, and a monitoring device 100.

コアスイッチ１０Ａ，１０Ｂ、フロアスイッチ２０、監視スイッチ３０Ａ，３０Ｂはそれぞれ無線ＬＡＮ（Local Area Network）又は有線ＬＡＮによって相互に接続される。また、図示を省略するが、コアスイッチ１０Ａ，１０Ｂ、フロアスイッチ２０、監視スイッチ３０Ａ，３０Ｂは、無線ＬＡＮ又は有線ＬＡＮによって、ネットワーク内の他のスイッチ、端末装置に接続される。 The core switches 10A and 10B, the floor switch 20, and the monitoring switches 30A and 30B are connected to each other by a wireless LAN (Local Area Network) or a wired LAN. Although not shown, the core switches 10A and 10B, the floor switch 20, and the monitoring switches 30A and 30B are also connected to other switches and terminal devices in the network by a wireless LAN or a wired LAN.

コアスイッチ１０Ａは、ネットワーク内でパケット転送、中継を行うネットワークスイッチである。たとえば、コアスイッチ１０Ａは、ルーティングテーブルを保持しており、コアスイッチ１０Ｂ、他のスイッチ、端末装置からパケットを受信した場合には、ルーティングテーブルを基にして、データの転送、中継を行う。コアスイッチ１０Ａは、スイッチング機能も有する。 The core switch 10A is a network switch that forwards and relays packets within the network. For example, the core switch 10A holds a routing table, and when it receives a packet from the core switch 10B, another switch, or a terminal device, it forwards and relays the data based on the routing table. The core switch 10A also has a switching function.

コアスイッチ１０Ｂは、ネットワーク内でパケット転送、中継を行うネットワークスイッチである。たとえば、コアスイッチ１０Ｂは、ルーティングテーブルを保持しており、コアスイッチ１０Ａ、他のスイッチ、端末装置からパケットを受信した場合には、ルーティングテーブルを基にして、データの転送、中継を行う。コアスイッチ１０Ｂは、スイッチング機能も有する。 The core switch 10B is a network switch that forwards and relays packets within the network. For example, the core switch 10B holds a routing table, and when it receives a packet from the core switch 10A, another switch, or a terminal device, it forwards and relays the data based on the routing table. The core switch 10B also has a switching function.

フロアスイッチ２０は、ネットワークの中枢部と末端部との橋渡しを行うネットワークスイッチである。 The floor switch 20 is a network switch that acts as a bridge between the central and peripheral parts of the network.

監視スイッチ３０Ａは、ＩＰＳＬＡ機能を備え、コアスイッチ１０Ａ，１０Ｂを経由してフロアスイッチ２０に到達するＶＬＡＮ（Virtual Local Area Network）を作成し、コアスイッチ１０Ａ，１０Ｂ、フロアスイッチ２０、監視スイッチ３０Ｂを監視する。 The monitoring switch 30A has an IP SLA function, creates a VLAN (Virtual Local Area Network) that reaches the floor switch 20 via the core switches 10A and 10B, and monitors the core switches 10A and 10B, the floor switch 20, and the monitoring switch 30B.

監視スイッチ３０Ｂは、ＩＰＳＬＡ機能を備え、コアスイッチ１０Ａ，１０Ｂを経由してフロアスイッチ２０に到達するＶＬＡＮを作成し、コアスイッチ１０Ａ，１０Ｂ、フロアスイッチ２０、監視スイッチ３０Ａを監視する。 The monitoring switch 30B has an IP SLA function, creates a VLAN that reaches the floor switch 20 via the core switches 10A and 10B, and monitors the core switches 10A and 10B, the floor switch 20, and the monitoring switch 30A.

図２は、ＩＰＳＬＡ機能を説明するための図である。一例として、監視スイッチ３０Ａと、監視対象としてコアスイッチ１０Ａとを用いて説明を行う。監視スイッチ３０Ａは、監視パケットをコアスイッチ１０Ａに送信し、コアスイッチ１０Ａからの応答を基にして、コアスイッチ１０Ａのアラートの発生の有無を判定する。以下では説明を省略するが、監視スイッチ３０Ａとコアスイッチ１０Ａとは、ＶＬＡＮを介して、監視パケットに関する情報をやり取りする。 Figure 2 is a diagram for explaining the IP SLA function. As an example, the explanation will be given using a monitoring switch 30A and a core switch 10A as a monitoring target. The monitoring switch 30A sends a monitoring packet to the core switch 10A, and determines whether an alert has occurred in the core switch 10A based on the response from the core switch 10A. Although not explained below, the monitoring switch 30A and the core switch 10A exchange information related to the monitoring packet via a VLAN.

監視スイッチ３０Ａは、監視パケットを送信し、コアスイッチ１０Ａから応答を受信した場合には、コアスイッチ１０Ａにアラートが発生していないと判定する。 When the monitoring switch 30A sends a monitoring packet and receives a response from the core switch 10A, it determines that no alert has occurred in the core switch 10A.

一方、監視スイッチ３０Ａは、監視パケットをコアスイッチ１０Ａに送信し、コアスイッチ１０Ａから応答を受信しない場合には、コアスイッチ１０Ａにアラートが発生したと判定し、アラート情報を、監視装置１００に送信する。アラート情報の通信には、ＳＹＳＬＯＧ／ＳＮＭＰｔｒａｐ等のプロトコルが用いられる。 On the other hand, the monitoring switch 30A sends a monitoring packet to the core switch 10A, and if it does not receive a response from the core switch 10A, it determines that an alert has occurred in the core switch 10A and sends alert information to the monitoring device 100. A protocol such as SYSLOG/SNMP trap is used to communicate the alert information.

監視スイッチ３０Ａは、他の監視対象となるコアスイッチ１０Ｂ、フロアスイッチ２０、監視スイッチ３０Ｂについても、ＶＬＡＮを介して、監視パケットに関する情報をやり取りすることで、アラートの発生の有無を判定し、アラートが発生した場合には、アラート情報を、監視装置１００に送信する。 The monitoring switch 30A also determines whether an alert has occurred for the other monitored switches, the core switch 10B, the floor switch 20, and the monitoring switch 30B, by exchanging information about monitoring packets via the VLAN, and if an alert has occurred, it sends the alert information to the monitoring device 100.

アラート情報には、送信元の監視スイッチ３０Ａの情報と、アラートの発生した監視対象の情報が設定される。監視スイッチ３０Ａは、アラートの発生した監視対象を検知するたびに、アラート情報を、監視装置１００に送信する。 The alert information contains information about the sending monitoring switch 30A and information about the monitoring target for which the alert occurred. The monitoring switch 30A sends the alert information to the monitoring device 100 each time it detects a monitoring target for which an alert has occurred.

監視スイッチ３０Ｂは、監視スイッチ３０Ａと同様にして、監視パケットを監視対象（コアスイッチ１０Ａ，１０Ｂ，フロアスイッチ２０、監視スイッチ３０Ａ）に送信し、監視対象からの応答を基にして、監視対象のアラートの発生の有無を判定する。監視スイッチ３０Ｂは、監視対象にアラートが発生したと判定した場合には、アラート情報を、監視装置１００に送信する。 Similar to monitoring switch 30A, monitoring switch 30B sends monitoring packets to the monitored objects (core switches 10A and 10B, floor switch 20, and monitoring switch 30A) and determines whether an alert has occurred for the monitored object based on the response from the monitored object. If monitoring switch 30B determines that an alert has occurred for the monitored object, it sends alert information to monitoring device 100.

監視装置１００は、監視スイッチ３０Ａ、３０Ｂからアラート情報を受信した場合に、アラート情報を基にして、サイレント障害の発生した監視対象のスイッチを検知する装置である。監視装置１００は、サイレント障害の発生した監視対象のスイッチを検知すると、検知したスイッチに対して、メッセージを送信することで、監視対象のポートを閉塞させる。たとえば、ネットワークが冗長化されていれば、かかる処理を実行することで、自動的に、サイレント障害のスイッチを検知して、ネットワークを障害から復旧させることができる。 When the monitoring device 100 receives alert information from the monitoring switches 30A and 30B, it detects a monitored switch in which a silent failure has occurred based on the alert information. When the monitoring device 100 detects a monitored switch in which a silent failure has occurred, it blocks the monitored port by sending a message to the detected switch. For example, if the network is made redundant, it is possible to automatically detect the switch in which the silent failure has occurred and recover the network from the failure by executing this process.

次に、監視装置１００の構成の一例について説明する。図３は、本実施例に係る監視装置の構成を示す機能ブロック図である。図３に示すように、この監視装置１００は、通信部１１０と、入力部１２０と、表示部１３０と、記憶部１４０、制御部１５０とを有する。 Next, an example of the configuration of the monitoring device 100 will be described. FIG. 3 is a functional block diagram showing the configuration of a monitoring device according to this embodiment. As shown in FIG. 3, the monitoring device 100 has a communication unit 110, an input unit 120, a display unit 130, a memory unit 140, and a control unit 150.

通信部１１０は、ネットワークを介して、監視スイッチ３０Ａ，３０Ｂとの間で情報の送受信を行う。たとえば、通信部１１０は、ＮＩＣ（Network Interface Card）等によって実現される。 The communication unit 110 transmits and receives information between the monitoring switches 30A and 30B via the network. For example, the communication unit 110 is realized by a network interface card (NIC) or the like.

入力部１２０は、各種の情報を、入力する入力装置である。入力部１２０は、キーボードやマウス、タッチパネル等に対応する。 The input unit 120 is an input device for inputting various types of information. The input unit 120 corresponds to a keyboard, a mouse, a touch panel, etc.

表示部１３０は、制御部１５０から出力される情報を表示する表示装置である。表示部１３０は、液晶ディスプレイ、有機ＥＬ（Electro Luminescence）ディスプレイ、タッチパネル等に対応する。 The display unit 130 is a display device that displays information output from the control unit 150. The display unit 130 corresponds to a liquid crystal display, an organic EL (Electro Luminescence) display, a touch panel, etc.

記憶部１４０は、登録テーブル１４１、パターンテーブル１４２、判定ポリシーテーブル１４３を有する。記憶部１４０は、たとえば、ＲＡＭ（Random Access Memory)、フラッシュメモリ（Flash Memory）等の半導体メモリ素子、または、ハードディスク、光ディスク等の記憶装置によって実現される。 The storage unit 140 has a registration table 141, a pattern table 142, and a judgment policy table 143. The storage unit 140 is realized, for example, by a semiconductor memory element such as a random access memory (RAM) or a flash memory, or a storage device such as a hard disk or an optical disk.

登録テーブル１４１は、監視スイッチ３０Ａ，３０Ｂから送信されるアラート情報を保持するテーブルである。アラート情報には、このアラート情報の送信元となる監視スイッチの識別情報（ＩＰ＜Internet Protocol＞アドレス、ＭＡＣ＜Media Access Control＞アドレス等）と、アラートの発生した監視対象のスイッチの識別情報（ＩＰアドレス、ＭＡＣアドレス等）が含まれる。 Registration table 141 is a table that holds alert information sent from monitoring switches 30A and 30B. The alert information includes identification information (IP <Internet Protocol> address, MAC <Media Access Control> address, etc.) of the monitoring switch that is the sender of this alert information, and identification information (IP address, MAC address, etc.) of the monitored switch in which the alert occurred.

パターンテーブル１４２は、アラートの発生した監視対象と、アラートの発生していない監視対象との組み合わせに対応するパターンを定義するテーブルである。図４は、パターンテーブルのデータ構造の一例を示す図である。図４に示すように、このパターンテーブル１４２は、アラート発生箇所と、パターンとを対応付ける。アラート発生箇所は、監視パケットによってアラートが検出されたスイッチを示す。ここではアラート発生箇所として、監視スイッチ（監視スイッチ３０Ａ，３０Ｂ）、コアスイッチ１０Ａ、フロアスイッチ２０を用いて説明する。 Pattern table 142 is a table that defines patterns corresponding to combinations of monitoring targets for which an alert has occurred and monitoring targets for which no alert has occurred. FIG. 4 is a diagram showing an example of the data structure of a pattern table. As shown in FIG. 4, this pattern table 142 associates the location where an alert occurred with a pattern. The location where an alert occurred indicates the switch where an alert was detected by a monitoring packet. Here, the explanation will be given using the monitoring switches (monitoring switches 30A, 30B), core switch 10A, and floor switch 20 as the location where an alert occurred.

たとえば、監視スイッチ３０Ａから送信されたアラート情報において監視スイッチ３０Ｂにアラートが発生しておらず、かつ、監視スイッチ３０Ｂから送信されたアラート情報において監視スイッチ３０Ａにアラートが発生していない場合に、パターンテーブル１４２の監視スイッチの判定が「○」となる。 For example, if the alert information sent from monitoring switch 30A indicates that no alert has occurred in monitoring switch 30B, and the alert information sent from monitoring switch 30B indicates that no alert has occurred in monitoring switch 30A, the monitoring switch in pattern table 142 will be judged as "○".

一方、監視スイッチ３０Ａから送信されたアラート情報において監視スイッチ３０Ｂにアラートが発生している場合、または、監視スイッチ３０Ｂから送信されたアラート情報において監視スイッチ３０Ａにアラートが発生している場合には、パターンテーブル１４２の監視スイッチの判定が「×」となる。 On the other hand, if the alert information sent from monitoring switch 30A indicates that an alert has occurred in monitoring switch 30B, or if the alert information sent from monitoring switch 30B indicates that an alert has occurred in monitoring switch 30A, the judgment of the monitoring switch in pattern table 142 will be "X".

監視スイッチ３０Ａから送信されたアラート情報においてコアスイッチ１０Ａにアラートが発生しておらず、かつ、監視スイッチ３０Ｂから送信されたアラート情報においてコアスイッチ１０Ａにアラートが発生していない場合に、パターンテーブル１４２のコアスイッチの判定が「○」となる。 If the alert information sent from the monitoring switch 30A indicates that no alert has occurred in the core switch 10A, and the alert information sent from the monitoring switch 30B indicates that no alert has occurred in the core switch 10A, the core switch is judged as "○" in the pattern table 142.

監視スイッチ３０Ａから送信されたアラート情報においてコアスイッチ１０Ａにアラートが発生している場合、または、監視スイッチ３０Ｂから送信されたアラート情報においてコアスイッチ１０Ａにアラートが発生している場合には、パターンテーブル１４２のコアスイッチの判定が「×」となる。 If the alert information sent from the monitoring switch 30A indicates that an alert has occurred in the core switch 10A, or if the alert information sent from the monitoring switch 30B indicates that an alert has occurred in the core switch 10A, the core switch is judged as "X" in the pattern table 142.

監視スイッチ３０Ａから送信されたアラート情報においてフロアスイッチ２０にアラートが発生しておらず、かつ、監視スイッチ３０Ｂから送信されたアラート情報においてフロアスイッチ２０にアラートが発生していない場合に、パターンテーブル１４２のフロアスイッチの判定が「○」となる。 If the alert information sent from the monitoring switch 30A indicates that no alert has occurred in the floor switch 20, and the alert information sent from the monitoring switch 30B indicates that no alert has occurred in the floor switch 20, the floor switch determination in the pattern table 142 will be "○".

監視スイッチ３０Ａから送信されたアラート情報においてフロアスイッチ２０にアラートが発生している場合、または、監視スイッチ３０Ｂから送信されたアラート情報においてフロアスイッチ２０にアラートが発生している場合には、パターンテーブル１４２のコアスイッチの判定が「×」となる。 If the alert information sent from the monitoring switch 30A indicates that an alert has occurred in the floor switch 20, or if the alert information sent from the monitoring switch 30B indicates that an alert has occurred in the floor switch 20, the core switch in the pattern table 142 is determined to be "X".

ここで、図４に示すように、監視スイッチの判定が「○」、コアスイッチ１０Ａの判定が「○」、フロアスイッチ２０の判定が「×」の場合には、パターン「Ａ」となる。監視スイッチの判定が「○」、コアスイッチ１０Ａの判定が「×」、フロアスイッチ２０の判定が「○」の場合には、パターン「Ｂ」となる。 As shown in FIG. 4, if the monitoring switch judges "○", the core switch 10A judges "○", and the floor switch 20 judges "×", the pattern is "A". If the monitoring switch judges "○", the core switch 10A judges "×", and the floor switch 20 judges "○", the pattern is "B".

監視スイッチの判定が「×」、コアスイッチ１０Ａの判定が「○」、フロアスイッチ２０の判定が「○」の場合には、パターン「Ｃ」となる。監視スイッチの判定が「×」、コアスイッチ１０Ａの判定が「×」、フロアスイッチ２０の判定が「○」の場合には、パターン「Ｄ」となる。 If the monitoring switch judges "X", the core switch 10A judges "O", and the floor switch 20 judges "O", the pattern is "C". If the monitoring switch judges "X", the core switch 10A judges "X", and the floor switch 20 judges "O", the pattern is "D".

監視スイッチの判定が「○」、コアスイッチ１０Ａの判定が「×」、フロアスイッチ２０の判定が「×」の場合には、パターン「Ｅ」となる。監視スイッチの判定が「×」、コアスイッチ１０Ａの判定が「○」、フロアスイッチ２０の判定が「×」の場合には、パターン「Ｆ」となる。監視スイッチの判定が「×」、コアスイッチ１０Ａの判定が「×」、フロアスイッチ２０の判定が「×」の場合には、パターン「Ｇ」となる。 If the monitoring switch judges as "○", the core switch 10A judges as "×", and the floor switch 20 judges as "×", the pattern is "E". If the monitoring switch judges as "×", the core switch 10A judges as "○", and the floor switch 20 judges as "×", the pattern is "F". If the monitoring switch judges as "×", the core switch 10A judges as "×", and the floor switch 20 judges as "×", the pattern is "G".

ここで、図４で説明したパターンテーブル１４２は、コアスイッチ１０Ａに対応するパターンテーブルであるが、コアスイッチ１０Ｂに対応するパターンテーブルも同様となる。説明の便宜上、一部について説明すると、監視スイッチ３０Ａから送信されたアラート情報においてコアスイッチ１０Ｂにアラートが発生しておらず、かつ、監視スイッチ３０Ｂから送信されたアラート情報においてコアスイッチ１０Ｂにアラートが発生していない場合に、パターンテーブル（コアスイッチ１０Ｂに対応するパターンテーブル）のコアスイッチの判定が「○」となる。 The pattern table 142 described in FIG. 4 is a pattern table corresponding to core switch 10A, but the pattern table corresponding to core switch 10B is similar. For ease of explanation, a partial explanation will be given. If no alert has occurred in core switch 10B in the alert information sent from monitoring switch 30A, and no alert has occurred in core switch 10B in the alert information sent from monitoring switch 30B, the core switch judgment in the pattern table (pattern table corresponding to core switch 10B) will be "○".

監視スイッチ３０Ａから送信されたアラート情報においてコアスイッチ１０Ｂにアラートが発生している場合、または、監視スイッチ３０Ｂから送信されたアラート情報においてコアスイッチ１０Ｂにアラートが発生している場合には、パターンテーブル（コアスイッチ１０Ｂに対応するパターンテーブル）のコアスイッチの判定が「×」となる。 If the alert information sent from the monitoring switch 30A indicates that an alert has occurred in the core switch 10B, or if the alert information sent from the monitoring switch 30B indicates that an alert has occurred in the core switch 10B, the core switch is judged as "X" in the pattern table (pattern table corresponding to the core switch 10B).

そして、監視スイッチ、コアスイッチ１０Ａ、フロアスイッチ２０の「○」、「×」の組み合わせによって、コアスイッチ１０Ｂに関するパターンが特定される。 Then, a pattern related to the core switch 10B is identified by the combination of "○" and "×" for the monitoring switch, core switch 10A, and floor switch 20.

判定ポリシーテーブル１４３は、パターンに応じたサイレント障害の要因を判定するための情報を保持する。図５は、判定ポリシーテーブルのデータ構造の一例を示す図である。図５に示すように、この判定ポリシーテーブル１４３は、パターンと、要因とを対応付ける。パターンは、図４で説明したパターンＡ～Ｇに対応する。要因は、サイレント障害の要因を示す。ここでは一例として、コアスイッチ１０Ａに関するパターンを用いて説明を行う。 The judgment policy table 143 holds information for determining the cause of a silent failure according to a pattern. FIG. 5 is a diagram showing an example of the data structure of the judgment policy table. As shown in FIG. 5, this judgment policy table 143 associates patterns with causes. The patterns correspond to patterns A to G described in FIG. 4. The causes indicate the causes of silent failures. Here, an explanation will be given using a pattern related to the core switch 10A as an example.

たとえば、パターンＡの要因は、「フロアスイッチ２０またはコアスイッチ１０Ａ（コアスイッチ１０Ａのルーティング機能）に障害発生」となる。パターンＢの要因は、「コアスイッチ１０Ａに障害発生」となる。 For example, the cause of pattern A is "a failure occurs in the floor switch 20 or the core switch 10A (the routing function of the core switch 10A)." The cause of pattern B is "a failure occurs in the core switch 10A."

パターンＣの要因は、「コアスイッチ１０Ａ（コアスイッチ１０Ａのスイッチング機能）に障害発生」となる。パターンＤの要因は、「コアスイッチ１０Ａに障害発生」となる。 The cause of pattern C is "a failure occurs in core switch 10A (the switching function of core switch 10A)." The cause of pattern D is "a failure occurs in core switch 10A."

パターンＥの要因は、「コアスイッチ１０Ａ（コアスイッチ１０Ａのルーティング機能）に障害発生」となる。パターンＦの要因は、「コアスイッチ１０Ａ（コアスイッチ１０Ａのルーティング機能、スイッチング機能）に障害発生」となる。パターンＧの要因は、「コアスイッチ１０Ａに障害発生」となる。 The cause of pattern E is "a failure occurs in the core switch 10A (the routing function of the core switch 10A)." The cause of pattern F is "a failure occurs in the core switch 10A (the routing function, switching function of the core switch 10A)." The cause of pattern G is "a failure occurs in the core switch 10A."

図５では、コアスイッチ１０Ａに関するパターンを用いて説明を行った。図示を省略するが、コアスイッチ１０Ｂのパターンに対応する要因は、上記説明のコアスイッチ１０Ａを、コアスイッチ１０Ｂに置き換えたものとなる。 In Figure 5, the explanation was given using a pattern related to core switch 10A. Although not shown in the figure, the factors corresponding to the pattern of core switch 10B are obtained by replacing core switch 10A in the above explanation with core switch 10B.

図３の説明に戻る。制御部１５０は、取得部１５１と、検知部１５２と、送信部１５３とを有する。制御部１５０は、たとえば、ＣＰＵ（Central Processing Unit）やＭＰＵ(Micro Processing Unit)により実現される。また、制御部１５０は、例えばＡＳＩＣ（Application Specific Integrated Circuit）やＦＰＧＡ（Field Programmable Gate Array）等の集積回路により実行されてもよい。 Returning to the explanation of FIG. 3, the control unit 150 has an acquisition unit 151, a detection unit 152, and a transmission unit 153. The control unit 150 is realized, for example, by a CPU (Central Processing Unit) or an MPU (Micro Processing Unit). The control unit 150 may also be executed by an integrated circuit, for example, an ASIC (Application Specific Integrated Circuit) or an FPGA (Field Programmable Gate Array).

取得部１５１は、監視スイッチ３０Ａ，３０Ｂから、アラート情報を取得する。取得部１５１は、取得したアラート情報を、登録テーブル１４１に登録する。取得部１５１は、アラート情報を取得する度に、上記処理を繰り返し実行する。 The acquisition unit 151 acquires alert information from the monitoring switches 30A and 30B. The acquisition unit 151 registers the acquired alert information in the registration table 141. The acquisition unit 151 repeats the above process each time it acquires alert information.

検知部１５２は、登録テーブル１４１に登録されたアラート情報の組み合わせと、パターンテーブル１４２とを基にして、パターンを特定する。検知部１５２は、特定したパターンと、判定ポリシーテーブル１４３とを基にして、サイレント障害の要因となる箇所を検知し、検知結果を送信部１５３に出力する。検知部１５２は、検知結果を表示部１３０に出力して、表示させてもよい。 The detection unit 152 identifies a pattern based on a combination of alert information registered in the registration table 141 and the pattern table 142. The detection unit 152 detects a location that is the cause of a silent failure based on the identified pattern and the judgment policy table 143, and outputs the detection result to the transmission unit 153. The detection unit 152 may output the detection result to the display unit 130 for display.

たとえば、検知部１５２は、登録テーブル１４１に登録された各アラート情報を参照し、監視スイッチ（３０Ａ，３０Ｂ）、コアスイッチ１０Ａ、コアスイッチ１０Ｂ、フロアスイッチについて、「○」か「×」かの判定を実行する。検知部１５２が「○」か「×」かを判定する処理は、図４で説明した方法に対応する。 For example, the detection unit 152 refers to each piece of alert information registered in the registration table 141, and executes a judgment as to whether the monitoring switch (30A, 30B), the core switch 10A, the core switch 10B, and the floor switch are "○" or "×". The process by which the detection unit 152 judges whether the switch is "○" or "×" corresponds to the method described in FIG. 4.

検知部１５２は、「○」、「×」の判定結果の組み合わせと、パターンテーブル１４２とを基にして、パターンを特定する。検知部１５２が、パターンを特定する処理は、図４で説明した方法に対応する。なお、検知部１５２は、全ての判定結果が「○」となる場合には、サイレント障害が発生していないものとして、いずれかの判定結果が「×」となるまで、上記処理を繰り返し実行する。 The detection unit 152 identifies a pattern based on a combination of the judgment results of "○" and "×" and the pattern table 142. The process by which the detection unit 152 identifies a pattern corresponds to the method described in FIG. 4. If all judgment results are "○", the detection unit 152 assumes that a silent failure has not occurred and repeats the above process until any judgment result becomes "×".

検知部１５２は、パターン（図４で説明したパターンＡ～Ｇのいずれか）を特定すると、特定したパターンと、判定ポリシーテーブル１４３とを基にして、サイレント障害の要因となる箇所を検知し、検知した結果を、送信部１５３に出力する。検知部１５２は、サイレント障害の要因となる箇所に加えて、ルーティング機能、スイッチング機能に障害があるのかを合わせて出力してもよい。 When the detection unit 152 identifies a pattern (any of patterns A to G described in FIG. 4), it detects the location that is the cause of the silent failure based on the identified pattern and the judgment policy table 143, and outputs the detection result to the transmission unit 153. In addition to the location that is the cause of the silent failure, the detection unit 152 may also output whether there is a failure in the routing function or the switching function.

送信部１５３は、検知部１５２の検知結果を基にして、サイレント障害の要因となる箇所となるスイッチに対してメッセージを送信する。メッセージには、あて先となるスイッチの識別情報が設定されるものとする。 The sending unit 153 sends a message to the switch that is the cause of the silent failure based on the detection result of the detection unit 152. The message is set with the identification information of the destination switch.

送信部１５３のメッセージを受信したスイッチは、他のスイッチとの通信を停止する処理を行う。たとえば、送信部１５３は、監視スイッチ３０Ａ，３０Ｂを介して、該当するスイッチにメッセージを送信する。係る処理が実行されることで、コアスイッチ１０Ａ，１０Ｂによる経路の切り替えが発生する。 The switch that receives the message from the transmission unit 153 performs processing to stop communication with other switches. For example, the transmission unit 153 transmits a message to the relevant switch via the monitoring switches 30A and 30B. Execution of this processing causes a route switch by the core switches 10A and 10B.

図６は、メッセージ送信による経路切り替えの一例を説明するための図である。たとえば、監視装置１００が、コアスイッチ１０Ａにサイレント障害が発生したことを検知し、送信部１５３が、メッセージをコアスイッチ１０Ａに送信した場合について説明する。 Figure 6 is a diagram for explaining an example of path switching by message transmission. For example, a case will be explained in which the monitoring device 100 detects that a silent failure has occurred in the core switch 10A, and the transmission unit 153 transmits a message to the core switch 10A.

監視スイッチ３０Ａは、監視装置１００の送信部１５３からメッセージを受信すると、メッセージをコアスイッチ１０Ａに転送する。コアスイッチ１０Ａは、メッセージを受信すると、所定のスクリプトを実行し、コアスイッチ１０Ａのポートをダウンさせる。コアスイッチ１０Ａのポートがダウンすることで、それまでコアスイッチ１０Ａを経由していたパケットが、コアスイッチ１０Ｂを経由して転送されるようになり、経路の切り替えが発生する。これによって、一部のコアスイッチにサイレント障害が発生しても、ネットワークを自動的に復旧させることができる。 When the monitoring switch 30A receives a message from the transmitter 153 of the monitoring device 100, it transfers the message to the core switch 10A. When the core switch 10A receives the message, it executes a specified script and shuts down the port of the core switch 10A. When the port of the core switch 10A goes down, packets that had been routed through the core switch 10A until then are now routed through the core switch 10B, causing a switch in the path. This allows the network to be automatically restored even if a silent failure occurs in some of the core switches.

次に、本実施例に係る監視装置１００の処理手順の一例について説明する。図７は、本実施例に係る監視装置の処理手順を示すフローチャートである。図７に示すように、監視装置１００の取得部１５１は、監視スイッチ３０Ａ，３０Ｂからアラート情報を受信した場合に、アラート情報を登録テーブル１４１に登録する（ステップＳ１０１）。 Next, an example of the processing procedure of the monitoring device 100 according to this embodiment will be described. FIG. 7 is a flowchart showing the processing procedure of the monitoring device according to this embodiment. As shown in FIG. 7, when the acquisition unit 151 of the monitoring device 100 receives alert information from the monitoring switches 30A and 30B, the acquisition unit 151 registers the alert information in the registration table 141 (step S101).

監視装置１００の検知部１５２は、登録テーブル１４１の各アラート情報と、パターンテーブル１４２とを比較して、パターンを特定する（ステップＳ１０２）。検知部１５２は、パターンと判定ポリシーテーブル１４３とを基にして、サイレント障害の発生したスイッチを検知する（ステップＳ１０３）。 The detection unit 152 of the monitoring device 100 compares each piece of alert information in the registration table 141 with the pattern table 142 to identify a pattern (step S102). The detection unit 152 detects a switch in which a silent failure has occurred based on the pattern and the judgment policy table 143 (step S103).

監視装置１００の送信部は、サイレント障害の発生したスイッチに対してメッセージを送信し、送信先のスイッチのポートを閉塞させる（ステップＳ１０４）。 The transmitter of the monitoring device 100 sends a message to the switch in which the silent failure occurred, and blocks the port of the destination switch (step S104).

監視装置１００は、処理を継続するか否かを判定する（ステップＳ１０５）。監視装置１００は、処理を継続する場合には（ステップＳ１０５，Ｙｅｓ）、ステップＳ１０１に移行する。監視装置１００は、処理を継続しない場合には（ステップＳ１０５，Ｎｏ）、処理を終了する。 The monitoring device 100 determines whether or not to continue the process (step S105). If the monitoring device 100 continues the process (step S105, Yes), the monitoring device 100 proceeds to step S101. If the monitoring device 100 does not continue the process (step S105, No), the monitoring device 100 ends the process.

次に、本実施例に係る監視装置１００の効果について説明する。監視装置１００は、監視対象となるスイッチを監視する監視スイッチ３０Ａ，３０Ｂから、アラート情報を取得し、アラートの発生したスイッチの組み合わせを基にして、サイレント障害の発生したスイッチを検知する。これによって、効率的に監視対象となるスイッチのサイレント障害を検知することができる。 Next, the effects of the monitoring device 100 according to this embodiment will be described. The monitoring device 100 acquires alert information from the monitoring switches 30A and 30B that monitor the switches to be monitored, and detects the switch in which a silent failure has occurred based on the combination of switches in which an alert has occurred. This makes it possible to efficiently detect silent failures in the switches to be monitored.

たとえば、監視装置１００は、アラートの発生したスイッチの組み合わせを、パターンＡ～パターンＧのいずれかに分類し、分類したパターンと、判定ポリシーテーブル１４３とを基にして、サイレント障害の発生したスイッチを検知する。これにより、精度よく、サイレント障害に対応する箇所を特定することができる。 For example, the monitoring device 100 classifies the combination of switches in which an alert has occurred into one of patterns A to G, and detects the switch in which the silent failure has occurred based on the classified pattern and the judgment policy table 143. This makes it possible to pinpoint the location corresponding to the silent failure with high accuracy.

監視装置１００は、サイレント障害の発生したスイッチを検知した場合に、検知したスイッチに対して、メッセージを送信し、スイッチのポートを閉塞させる。冗長化されたネットワークにおいて、かかる処理を実行することで、サイレント障害が発生した場合でも、ネットワークを自動的に復旧させることができる。 When the monitoring device 100 detects a switch in which a silent failure has occurred, it sends a message to the detected switch and blocks the switch's port. By executing such processing in a redundant network, the network can be automatically restored even if a silent failure occurs.

次に、上記実施例に示した監視装置１００と同様の機能を実現するコンピュータのハードウェア構成の一例について説明する。図８は、実施例の監視装置と同様の機能を実現するコンピュータのハードウェア構成の一例を示す図である。 Next, an example of the hardware configuration of a computer that realizes the same functions as the monitoring device 100 shown in the above embodiment will be described. Figure 8 is a diagram showing an example of the hardware configuration of a computer that realizes the same functions as the monitoring device of the embodiment.

図８に示すように、コンピュータ２００は、各種演算処理を実行するＣＰＵ２０１と、ユーザからのデータの入力を受け付ける入力装置２０２と、ディスプレイ２０３とを有する。また、コンピュータ２００は、有線または無線ネットワークを介して、外部装置等との間でデータの授受を行う通信装置２０４と、インタフェース装置２０５とを有する。また、コンピュータ２００は、各種情報を一時記憶するＲＡＭ２０６と、ハードディスク装置２０７とを有する。そして、各装置２０１～２０７は、バス２０８に接続される。 As shown in FIG. 8, computer 200 has a CPU 201 that executes various types of arithmetic processing, an input device 202 that accepts data input from a user, and a display 203. Computer 200 also has a communication device 204 that transmits and receives data to and from external devices, etc., via a wired or wireless network, and an interface device 205. Computer 200 also has a RAM 206 that temporarily stores various types of information, and a hard disk device 207. Each of devices 201 to 207 is connected to a bus 208.

ハードディスク装置２０７は、取得プログラム２０７ａ、検知プログラム２０７ｂ、送信プログラム２０７ｃを有する。また、ＣＰＵ２０１は、各プログラム２０７ａ～２０７ｃを読み出してＲＡＭ２０６に展開する。 The hard disk device 207 has an acquisition program 207a, a detection program 207b, and a transmission program 207c. The CPU 201 also reads out each of the programs 207a to 207c and expands them in the RAM 206.

取得プログラム２０７ａは、取得プロセス２０６ａとして機能する。検知プログラム２０７ｂは、検知プロセス２０６ｂとして機能する。送信プログラム２０７ｃは、送信プロセス２０６ｃとして機能する。 The acquisition program 207a functions as an acquisition process 206a. The detection program 207b functions as a detection process 206b. The transmission program 207c functions as a transmission process 206c.

取得プロセス２０６ａの処理は、取得部１５１の処理に対応する。検知プロセス２０６ｂの処理は、検知部１５２の処理に対応する。送信プロセス２０６ｃの処理は、送信部１５３の処理に対応する。 The processing of the acquisition process 206a corresponds to the processing of the acquisition unit 151. The processing of the detection process 206b corresponds to the processing of the detection unit 152. The processing of the transmission process 206c corresponds to the processing of the transmission unit 153.

なお、各プログラム２０７ａ～２０７ｄについては、必ずしも最初からハードディスク装置２０７に記憶させておかなくても良い。例えば、コンピュータ２００に挿入されるフレキシブルディスク（ＦＤ）、ＣＤ－ＲＯＭ、ＤＶＤ、光磁気ディスク、ＩＣカードなどの「可搬用の物理媒体」に各プログラムを記憶させておく。そして、コンピュータ２００が各プログラム２０７ａ～２０７ｄを読み出して実行するようにしてもよい。 Note that each of the programs 207a to 207d does not necessarily have to be stored in the hard disk device 207 from the beginning. For example, each program may be stored in a "portable physical medium" such as a flexible disk (FD), CD-ROM, DVD, magneto-optical disk, or IC card that is inserted into the computer 200. Then, the computer 200 may read and execute each of the programs 207a to 207d.

１００監視装置
１１０通信部
１２０入力部
１３０表示部
１４０記憶部
１４１登録テーブル
１４２パターンテーブル
１４３判定ポリシーテーブル
１５０制御部
１５１取得部
１５２検知部
１５３送信部 REFERENCE SIGNS LIST 100 Monitoring device 110 Communication unit 120 Input unit 130 Display unit 140 Storage unit 141 Registration table 142 Pattern table 143 Judgment policy table 150 Control unit 151 Acquisition unit 152 Detection unit 153 Transmission unit

Claims

an acquisition unit that acquires, from a second monitoring switch that connects a first switch, a second switch, and a first monitoring switch to be monitored by a virtual network among a plurality of switches included in the network, a first communication status with the first switch, a second communication status with the second switch, and a third communication status with the first monitoring switch and the second monitoring switch ;
a detection unit that detects a switch in which a failure has occurred from the first switch and the second switch based on the first communication status, the second communication status, and the third communication status;
A monitoring device comprising:

The monitoring device according to claim 1, further comprising a transmission unit that performs control to stop communication by sending a message to a switch in which a fault has occurred and is detected by the detection unit.

The monitoring device according to claim 1 or 2, characterized in that the detection unit detects a failure in the first switch or the second switch when no alert has occurred in the first communication state and the third communication state, and an alert has occurred only in the second communication state.

The monitoring device according to claim 1, 2 or 3, characterized in that the detection unit detects a failure in the first switch when no alert has occurred in the second communication state and the third communication state, and an alert has occurred only in the first communication state.

The monitoring device according to any one of claims 1 to 4, characterized in that the detection unit detects a failure in the first switch when no alert has occurred in the first communication state and the second communication state, and an alert has occurred only in the third communication state.

The monitoring device according to any one of claims 1 to 5, characterized in that the detection unit detects a failure in the first switch when no alert has occurred in the second communication state and an alert has occurred only in the first communication state and the third communication state.

The monitoring device according to any one of claims 1 to 6, characterized in that the detection unit detects a failure in the first switch when no alert has occurred in the third communication state and an alert has occurred only in the first communication state and the second communication state.

The monitoring device according to any one of claims 1 to 7, characterized in that the detection unit detects an abnormality in the first switch when no alert has occurred in the first communication state and only alerts have occurred in the second communication state and the third communication state.

The monitoring device according to any one of claims 1 to 8, characterized in that the detection unit detects a failure in the first switch when an alert occurs in the first communication status, the second communication status, and the third communication status.

1. A computer-implemented method for fault detection, comprising:
obtain, from a second monitoring switch that connects a first switch, a second switch, and a first monitoring switch to be monitored by a virtual network among a plurality of switches included in the network, a first communication status with the first switch, a second communication status with the second switch, and a third communication status with the first monitoring switch and the second monitoring switch ;
a process of detecting a switch in which a fault has occurred from the first switch and the second switch based on the first communication status, the second communication status, and the third communication status.

On the computer,
obtain, from a second monitoring switch that connects a first switch, a second switch, and a first monitoring switch to be monitored by a virtual network among a plurality of switches included in the network, a first communication status with the first switch, a second communication status with the second switch, and a third communication status with the first monitoring switch and the second monitoring switch ;
a fault detection program that executes a process of detecting a switch in which a fault has occurred from the first switch and the second switch based on the first communication status, the second communication status, and the third communication status.