JP5456921B1

JP5456921B1 - Fault recovery device, fault recovery method, and fault recovery program

Info

Publication number: JP5456921B1
Application number: JP2013029215A
Authority: JP
Inventors: 武人小澤; 章仁田中
Original assignee: SoftBank Mobile Corp
Current assignee: SoftBank Corp
Priority date: 2013-02-18
Filing date: 2013-02-18
Publication date: 2014-04-02
Anticipated expiration: 2033-02-18
Also published as: JP2014157567A

Abstract

【課題】ネットワーク機器に生じた障害を最適な対処方法で復旧する。
【解決手段】障害復旧装置は、ネットワークに接続する複数のネットワーク機器の何れかに生じた障害を復旧するための複数の対処方法の中から過去一定期間内の対処失敗数の少ない対処方法を優先的に選択し（ステップ３０３，３０７）、選択された対処方法の実行により障害復旧を試みる（ステップ３０８）。複数の対処方法の中で過去一定期間内の対処失敗数の少ない対処方法は、障害復旧に効果的であると考えられるため、そのような対処方法を優先的に選択して実行することにより、最適な対処方法で障害復旧を試みることができる。
【選択図】図３A fault that occurs in a network device is recovered by an optimum coping method.
A failure recovery apparatus prioritizes a countermeasure with a small number of countermeasure failures within a certain past period from a plurality of countermeasures for recovering a fault that has occurred in any of a plurality of network devices connected to the network. (Steps 303 and 307), and attempts to recover from the failure by executing the selected countermeasure (step 308). Among the multiple countermeasures, a countermeasure that has a small number of countermeasure failures within a certain period of time in the past is considered effective for failure recovery. Therefore, by selecting and executing such a countermeasure as a priority, You can try to recover from the disaster with the best possible countermeasures.
[Selection] Figure 3

Description

本発明は、ネットワーク機器に生じた障害を復旧するための装置、方法、及びプログラムに関する。 The present invention relates to an apparatus, a method, and a program for recovering from a failure occurring in a network device.

ネットワークに接続する複数のネットワーク機器をＩＰ層レベルで監視する技術として、ＳＮＭＰ（Simple Network Management Protocol）が一般的に用いられている。ＳＮＭＰは、監視対象エージェントが発信するトラップ情報をマネージャが収集して、障害の発生等を監視する仕組みとなっている。その詳細は、ＩＥＴＦ（Internet Engineering Task Force）が管理するＲＦＣ（Request For Comments）１１５７，１４４１等に規定されている。 SNMP (Simple Network Management Protocol) is generally used as a technique for monitoring a plurality of network devices connected to a network at the IP layer level. SNMP is a mechanism in which a manager collects trap information transmitted by a monitored agent and monitors the occurrence of a failure. The details are defined in RFCs (Request For Comments) 1157, 1441, etc. managed by the Internet Engineering Task Force (IETF).

しかし、ＳＮＭＰ等の従来のネットワーク管理では、ネットワーク機器に生じた障害を復旧するための対処方法の候補が複数存在する場合にどの対処方法が最適であるかを考慮していないため、ある一つの障害を復旧するために複数の対処方法を実行しなければならない場合があった。 However, conventional network management such as SNMP does not consider which coping method is optimal when there are a plurality of coping method candidates for recovering from a failure that has occurred in a network device. In some cases, multiple countermeasures had to be performed to recover from the failure.

そこで、本発明は、ネットワーク機器に生じた障害を最適な対処方法で復旧することを課題とする。 Accordingly, an object of the present invention is to recover a failure that has occurred in a network device by an optimum coping method.

上述の課題を解決するため、本発明に係わる障害復旧装置は、ネットワークに接続する複数のネットワーク機器の何れかに生じた障害を復旧するための複数の対処方法の中から過去一定期間内の対処失敗数の少ない対処方法を優先的に選択する選択手段と、複数のネットワーク機器のそれぞれについて過去一定期間内にどの対処方法が何回失敗したかをカウントする第一のカウンタと、複数のネットワーク機器の全てについて過去一定期間内にどの対処方法が何回失敗したかをカウントする第二のカウンタと、選択された対処方法の実行により障害の復旧を試みる対処実行手段と、を備え、選択手段は、複数の対処方法の中から第一のカウンタのカウント値が最小である対処方法を選択し、選択手段は、第一のカウンタのカウント値が最小である対処方法が複数存在するときに、第一のカウンタのカウント値が最小である複数の対処方法の中から第二のカウンタのカウント値が最小である対処方法を選択する。 In order to solve the above-mentioned problem, the failure recovery apparatus according to the present invention is capable of dealing with a failure within a certain period in the past from a plurality of treatment methods for restoring a failure that has occurred in any of a plurality of network devices connected to the network. A selection means that preferentially selects a countermeasure with a small number of failures, a first counter that counts how many times the countermeasure has failed within a certain past period for each of a plurality of network devices, and a plurality of network devices comprising a second counter which Action within the past predetermined period counts or failure times for all, and a corrective action means to attempt to recover the fault by the execution of the selected Action of, the selection means The method for selecting the coping method with the smallest count value of the first counter is selected from among the plurality of coping methods. That when the Action there are multiple, the count value of the second counter from among a plurality of Action count value of the first counter is smallest select Action is minimal.

本発明によれば、過去一定期間内の対処失敗数の少ない対処方法を優先的に選択することにより、最適な対処方法で障害復旧を試みることができる。 According to the present invention, it is possible to attempt failure recovery with an optimum coping method by preferentially selecting a coping method with a small number of coping failures within a certain past period.

本実施形態に係わる障害復旧装置を備えるネットワークの全体構成図である。1 is an overall configuration diagram of a network including a failure recovery apparatus according to the present embodiment. 本実施形態に係わる障害事象とその対処方法との関係を示す説明図である。It is explanatory drawing which shows the relationship between the failure event concerning this embodiment, and its coping method. 本実施形態に係わる障害復旧処理の手順の流れを示すフローチャートである。It is a flowchart which shows the flow of the procedure of the failure recovery process concerning this embodiment.

以下、各図を参照しながら本発明の実施形態について説明する。
図１は本実施形態に係わる障害復旧装置１０を備えるネットワーク３０の全体構成図である。障害復旧装置１０は、ネットワーク３０を介して、複数のネットワーク機器２１，２２，２３に接続している。障害復旧装置１０は、例えば、ネットワーク機器２１，２２，２３をリモートメンテナンスするための処理を実行するプロセッサ及びメモリ等のハードウェア資源を備えるオペレーションサポートシステムである。ネットワーク３０は、例えば、移動通信網のコアネットワーク及び公衆ＩＰ網等を含む。ネットワーク機器２１，２２，２３は、例えば、フェムトセル基地局、ルータ、スイッチ等である。複数のネットワーク機器２１，２２，２３のうち何れかの機器に障害が生じると、その障害が生じたネットワーク機器から障害復旧装置１０に障害発生通知がなされる。障害復旧装置１０は、障害発生通知を受けると、その障害を復旧するための複数の対処方法の中から過去一定期間内の対処失敗数の少ない対処方法を優先的に選択し、選択された対処方法の実行により障害の復旧を試みる。そして、障害復旧装置１０は、障害復旧が完了した場合、或いは復旧処理を所定回数繰り返しても復旧できない場合には、その旨をオペレータ４０に報告する。 Embodiments of the present invention will be described below with reference to the drawings.
FIG. 1 is an overall configuration diagram of a network 30 including a failure recovery apparatus 10 according to the present embodiment. The failure recovery apparatus 10 is connected to a plurality of network devices 21, 22, and 23 via the network 30. The failure recovery apparatus 10 is an operation support system including hardware resources such as a processor and a memory that execute processing for remote maintenance of the network devices 21, 22, and 23, for example. The network 30 includes, for example, a core network of a mobile communication network and a public IP network. The network devices 21, 22, and 23 are, for example, femtocell base stations, routers, switches, and the like. When a failure occurs in any of the plurality of network devices 21, 22, and 23, a failure occurrence notification is made to the failure recovery apparatus 10 from the network device in which the failure has occurred. Upon receiving the failure occurrence notification, the failure recovery apparatus 10 preferentially selects a response method with a small number of response failures within a certain past period from a plurality of response methods for recovering the failure, and the selected response Try to recover from the failure by performing the method. Then, the failure recovery device 10 reports to the operator 40 when the failure recovery is completed or when the failure recovery cannot be performed even if the recovery process is repeated a predetermined number of times.

次に、複数の対処方法の中から過去一定期間内の対処失敗数の少ない対処方法を優先的に選択する仕組みについて、図１及び図２を参照しながら説明する。図２に示すように、ある一つの障害事象５０につきその障害を復旧するための複数の対処方法６１，６２，６３が予め対応付けられている。この対応付けは、障害事象５０の種別（アプリケーションに係わる要因、オペレーションシステムに係わる要因、ハードウェアに係わる要因、通信品質に係わる要因等）に応じて決定される。障害復旧装置１０は、複数のネットワーク機器２１，２２，２３のそれぞれについて過去一定期間内にどの対処方法が何回失敗したかをカウントする第一のカウンタ１１と、複数のネットワーク機器２１，２２，２３の全てについて過去一定期間内にどの対処方法が何回失敗したかをカウントする第二のカウンタ１２と、複数のネットワーク機器２１，２２，２３のそれぞれについての過去の対処履歴（過去に生じた障害事象の種別とその障害を復旧するために過去に実行した対処方法及びその実行日時）を保持する第一の履歴データベース１３と、複数のネットワーク機器２１，２２，２３の全てについての過去の対処履歴（過去に障害が発生したネットワーク機器、障害事象の種別とその障害を復旧するために過去に実行した対処方法及びその実行日時）を保持する第二の履歴データベースを備える。 Next, a mechanism for preferentially selecting a coping method with a small number of coping failures within a certain past period from a plurality of coping methods will be described with reference to FIGS. As shown in FIG. 2, a plurality of coping methods 61, 62, and 63 for recovering the failure are associated in advance with respect to a certain failure event 50. This association is determined according to the type of the failure event 50 (factors relating to the application, factors relating to the operation system, factors relating to hardware, factors relating to communication quality, etc.). The failure recovery apparatus 10 includes a first counter 11 that counts how many countermeasures have failed within a certain past period for each of a plurality of network devices 21, 22, and 23, and a plurality of network devices 21, 22, and 23. A second counter 12 that counts how many times a countermeasure has failed within a certain period of time for all 23, and a past history of dealing with each of a plurality of network devices 21, 22, 23 (which occurred in the past) The first history database 13 that holds the types of failure events, the countermeasures executed in the past to recover from the faults, and the execution dates and times), and the past countermeasures for all of the plurality of network devices 21, 22, 23 History (Network devices that have failed in the past, the type of failure event, and the countermeasures that were taken in the past to recover from the failure Comprising a second history database that holds the execution date and time).

例えば、第一のカウンタ１１は、ネットワーク機器２１に生じた障害を復旧するために過去一定期間内に実行して失敗したことのあるそれぞれ対処方法６１，６２，６３の失敗数と、ネットワーク機器２２に生じた障害を復旧するために過去一定期間内に実行して失敗したことのあるそれぞれ対処方法６１，６２，６３の失敗数と、ネットワーク機器２３に生じた障害を復旧するために過去一定期間内に実行して失敗したことのあるそれぞれ対処方法６１，６２，６３の失敗数とをそれぞれ区別してカウントする。第二のカウンタ１２は、複数のネットワーク機器２１，２２，２３の何れかに生じた障害を復旧するために過去一定期間内に実行して失敗したことのあるそれぞれ対処方法６１，６２，６３の失敗数をカウントする。 For example, the first counter 11 counts the number of failures of the coping methods 61, 62, and 63 that have been executed and failed in the past certain period in order to recover a failure that has occurred in the network device 21, and the network device 22. In order to recover the failure that occurred in the network device 23, the number of failures of the coping methods 61, 62, and 63 that have been executed and failed in the past certain period and the failure that occurred in the network device 23 in the past certain period The number of failures of the coping methods 61, 62, 63 that have been executed and have failed are counted separately. The second counter 12 is used for each of the countermeasures 61, 62, and 63 that have been executed and failed in the past certain period in order to recover a failure that has occurred in any of the plurality of network devices 21, 22, and 23. Count the number of failures.

障害復旧装置１０は、例えば、ネットワーク機器２１に障害事象５０が生じると、第一のカウンタ１１を参照して、ネットワーク機器２１に生じた障害を復旧するために過去一定期間内に実行して失敗したことのあるそれぞれ対処方法６１，６２，６３の失敗数を取得する。障害復旧装置１０は、複数の対処方法６１，６２，６３の中から失敗数が最も少ない対処方法を優先的に選択し、選択された対処方法の実行により障害復旧を試みる。ここで、第一のカウンタ１１のカウント値（失敗数）が最小である対処方法が複数存在する場合には、障害復旧装置１０は、第一のカウンタ１１のカウント値が最小である複数の対処方法の中から、第二のカウンタ１２のカウント値（失敗数）が最小である対処方法を選択する。もし、第二のカウンタ１２のカウント値が最小である対処方法が複数存在する場合は、障害復旧装置１０は、第二のカウンタ１２のカウント値が最小である複数の対処方法の中から何れか一つをランダムに選択する。 For example, when a failure event 50 occurs in the network device 21, the failure recovery apparatus 10 refers to the first counter 11 and executes the failure within a certain period in the past in order to recover the failure that has occurred in the network device 21. The number of failures of each coping method 61, 62, 63 that has been acquired is acquired. The failure recovery apparatus 10 preferentially selects a response method having the smallest number of failures from among a plurality of response methods 61, 62, and 63, and tries to recover from the failure by executing the selected response method. Here, when there are a plurality of coping methods in which the count value (number of failures) of the first counter 11 is the minimum, the failure recovery apparatus 10 has a plurality of coping methods in which the count value of the first counter 11 is the minimum. A coping method with the smallest count value (failure number) of the second counter 12 is selected from the methods. If there are a plurality of countermeasures for which the count value of the second counter 12 is the minimum, the failure recovery apparatus 10 can select one of the plurality of countermeasures for which the count value of the second counter 12 is the minimum. Select one at random.

図３は本実施形態に係わる障害復旧処理の手順の流れを示すフローチャートである。
ステップ３０１では、障害復旧装置１０は、複数のネットワーク機器２１，２２，２３の何れかからの障害発生通知の受信の有無を判定する。
ステップ３０２では、障害復旧装置１０は、障害が発生したネットワーク機器からエラーコードを取得し、その障害事象５０に関連付けられている複数の対処方法６１，６２，６３を特定する。
ステップ３０３では、障害復旧装置１０は、複数の対処方法６１，６２，６３の中から、第一のカウンタ１１のカウント値が最小である対処方法を選択する。
ステップ３０４では、障害復旧装置１０は、第一のカウンタ１１のカウント値が最小である対処方法が複数存在するか否かを判定する。
ステップ３０５では、障害復旧装置１０は、第一のカウンタ１１のカウント値が最小である複数の対処方法の中に、第二のカウンタ１２のカウント値が最小である対処方法が複数存在するか否かを判定する。
ステップ３０６では、障害復旧装置１０は、第二のカウンタ１２のカウント値が最小である複数の対処方法の中からランダムに何れか一つの対処方法を選択する。
ステップ３０７では、障害復旧装置１０は、第一のカウンタ１１のカウント値が最小である複数の対処方法の中から、第二のカウンタ１２のカウント値（失敗数）が最小である対処方法を選択する。
ステップ３０８では、障害復旧装置１０は、ステップ３０３，３０６，又は３０７で選択された対処方法の実行により障害の復旧を試みる。
ステップ３０９では、障害復旧装置１０は、障害が復旧したか否かを判定する。
ステップ３０１０では、障害復旧装置１０は、第一の履歴データベース１３を更新する。第一の履歴データベース１３の更新処理は、ネットワーク機器に生じた障害の種別と、その障害を復旧するために実行した対処方法及びその実行日時を第一の履歴データベース１３に新たに記録することにより行われる。オペレータ４０による手動操作で対処方法が実行された場合には、オペレータ４０からログを取得することにより、第一の履歴データベース１３の更新処理が行われる。
ステップ３０１１では、障害復旧装置１０は、第二の履歴データベース１４を更新する。第二の履歴データベース１４の更新処理は、障害が発生したネットワーク機器、障害事象の種別とその障害を復旧するために実行した対処方法及びその実行日時を第二の履歴データベース１４に新たに記録することにより行われる。オペレータ４０による手動操作で対処方法が実行された場合には、オペレータ４０からログを取得することにより、第二の履歴データベース１４の更新処理が行われる。
ステップ３０１２では、障害復旧装置１０は、復旧の完了をオペレータ４０に通知する。
ステップ３１３では、障害復旧装置１０は、第一のカウンタ１１を更新する。第一のカウンタ１１の更新処理は、第一の履歴データベース１３に記録されている対処方法の実行日時を参照することにより、障害が発生したネットワーク機器について、過去一定期間内に同一の対処方法が実行されている場合に、第一のカウンタ１１のカウント値をインクリメントすることにより行われる。
ステップ３１４では、障害復旧装置１０は、第二のカウンタ１２を更新する。第二のカウンタ１２の更新処理は、第二の履歴データベース１４に記録されている対処方法の実行日時を参照することにより、過去一定期間内に同一の対処方法が実行されている場合に、第二のカウンタ１２のカウント値をインクリメントすることにより行われる。
ステップ３１５では、障害復旧装置１０は、ステップ３０８の実行回数が所定の閾値を超えたか否かを判定する。
ステップ３１６では、障害復旧装置１０は、復旧が不可能である旨をオペレータ４０に通知する。 FIG. 3 is a flowchart showing the flow of the procedure of the failure recovery process according to this embodiment.
In step 301, the failure recovery apparatus 10 determines whether or not a failure occurrence notification has been received from any of the plurality of network devices 21, 22, and 23.
In step 302, the failure recovery apparatus 10 acquires an error code from the network device in which the failure has occurred, and identifies a plurality of coping methods 61, 62, 63 associated with the failure event 50.
In step 303, the failure recovery apparatus 10 selects a coping method with the smallest count value of the first counter 11 from the plurality of coping methods 61, 62, 63.
In step 304, the failure recovery apparatus 10 determines whether or not there are a plurality of countermeasures for which the count value of the first counter 11 is minimum.
In step 305, the failure recovery apparatus 10 determines whether there are a plurality of countermeasures with the smallest count value of the second counter 12 among the plurality of countermeasures with the smallest count value of the first counter 11. Determine whether.
In step 306, the failure recovery apparatus 10 randomly selects any one of the plurality of countermeasures for which the count value of the second counter 12 is the smallest.
In step 307, the failure recovery apparatus 10 selects a coping method with the smallest count value (number of failures) of the second counter 12 from a plurality of coping methods with the smallest count value of the first counter 11. To do.
In step 308, the failure recovery apparatus 10 attempts to recover from the failure by executing the coping method selected in step 303, 306, or 307.
In step 309, the failure recovery apparatus 10 determines whether or not the failure has been recovered.
In step 3010, the failure recovery apparatus 10 updates the first history database 13. The update processing of the first history database 13 is performed by newly recording in the first history database 13 the type of failure that has occurred in the network device, the coping method executed to recover from the failure, and the execution date and time. Done. When the coping method is executed by manual operation by the operator 40, the first history database 13 is updated by acquiring a log from the operator 40.
In step 3011, the failure recovery apparatus 10 updates the second history database 14. The update processing of the second history database 14 newly records in the second history database 14 the network device in which the failure has occurred, the type of the failure event, the coping method executed to recover from the failure, and the execution date and time. Is done. When the coping method is executed by manual operation by the operator 40, the second history database 14 is updated by acquiring a log from the operator 40.
In step 3012, the failure recovery apparatus 10 notifies the operator 40 of the completion of recovery.
In step 313, the failure recovery apparatus 10 updates the first counter 11. The update process of the first counter 11 refers to the execution date and time of the countermeasure method recorded in the first history database 13, so that the same countermeasure method can be used for a network device in which a failure has occurred within a certain past period. If it is being executed, this is done by incrementing the count value of the first counter 11.
In step 314, the failure recovery apparatus 10 updates the second counter 12. The update process of the second counter 12 is performed when the same coping method is executed within a certain past period by referring to the execution date and time of the coping method recorded in the second history database 14. This is done by incrementing the count value of the second counter 12.
In step 315, the failure recovery apparatus 10 determines whether or not the number of executions of step 308 exceeds a predetermined threshold value.
In step 316, the failure recovery apparatus 10 notifies the operator 40 that the recovery is impossible.

本実施形態に係わる障害復旧処理は、複数の対処方法６１，６２，６３の中で過去一定期間内における対処失敗数の少ない対処方法が障害復旧により効果的であるという傾向に着目し、そのような対処方法を優先的に選択して実行することにより、最適な対処方法で障害復旧を試みることができる。もし、第一のカウンタ１１のカウント値が最小である対処方法が複数存在する場合は、その中で、第二のカウンタ１２のカウント値が最小である対処方法が障害復旧に最も効果的であると考えられる。但し、第二のカウンタ１２のカウント値が最小である対処方法が複数存在する場合は、第二のカウンタ１２のカウント値が最小である複数の対処方法の中から何れか一つをランダムに選択すればよい。 The failure recovery processing according to the present embodiment pays attention to the tendency that among the plurality of response methods 61, 62, and 63, a response method with a small number of response failures within the past fixed period is more effective for failure recovery. By preferentially selecting and executing the appropriate coping method, it is possible to attempt failure recovery with the optimum coping method. If there are a plurality of countermeasures for which the count value of the first counter 11 is the smallest, the countermeasure for which the count value of the second counter 12 is the smallest is the most effective for failure recovery. it is conceivable that. However, when there are a plurality of coping methods in which the count value of the second counter 12 is the minimum, any one of the coping methods in which the count value of the second counter 12 is the minimum is randomly selected. do it.

なお、ステップ３０１〜３１６の各ステップを実行するための障害復旧プログラムをコンピュータ読み取り可能な記録媒体に記録し、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行させることにより、障害復旧処理を行ってもよい。「コンピュータシステム」とは、ＯＳや周辺機器等のハードウェアを含むものであってもよい。「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、フラッシュメモリ等の書き込み可能な不揮発性メモリ、ＤＶＤ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶資源のことをいう。「コンピュータ読み取り可能な記録媒体」は、インターネット等のネットワークや電話回線等の通信回線を介してプログラムが送信された場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリのように、プログラムを一時的に保持しているものも含むものとする。また、プログラムは、このプログラムを記憶装置等に格納したコンピュータシステムから、伝送媒体を介して、あるいは、伝送媒体中の伝送波により他のコンピュータシステムに伝送されてもよい。ここで、プログラムを伝送する「伝送媒体」は、インターネット等のネットワーク（通信網）や電話回線等の通信回線（通信線）のように情報を伝送する機能を有する媒体のことをいう。 In addition, the failure recovery program for executing each step of steps 301 to 316 is recorded on a computer-readable recording medium, and the program recorded on this recording medium is read and executed by the computer system, thereby recovering the failure. Processing may be performed. The “computer system” may include an OS and hardware such as peripheral devices. “Computer-readable recording medium” refers to a storage resource such as a flexible disk, a magneto-optical disk, a ROM, a writable nonvolatile memory such as a flash memory, a portable medium such as a DVD, and a hard disk built in a computer system. That means. A “computer-readable recording medium” temporarily stores a program such as a volatile memory inside a computer system as a server or client when the program is transmitted via a network such as the Internet or a communication line such as a telephone line. It also includes those that are held on the market. The program may be transmitted from a computer system storing the program in a storage device or the like to another computer system via a transmission medium or by a transmission wave in the transmission medium. Here, the “transmission medium” for transmitting the program refers to a medium having a function of transmitting information, such as a network (communication network) such as the Internet or a communication line (communication line) such as a telephone line.

１０…障害復旧装置
１１…第一のカウンタ
１２…第二のカウンタ
１３…第一の履歴データベース
１４…第二の履歴データベース
２１，２２，２３…ネットワーク機器
３０…ネットワーク
４０…オペレータ
５０…障害事象
６１，６２，６３…対処方法 DESCRIPTION OF SYMBOLS 10 ... Failure recovery apparatus 11 ... 1st counter 12 ... 2nd counter 13 ... 1st history database 14 ... 2nd history database 21, 22, 23 ... Network equipment 30 ... Network 40 ... Operator 50 ... Failure event 61 , 62, 63 ... coping method

Claims

A selection means for preferentially selecting a countermeasure with a small number of countermeasure failures within a certain past period from a plurality of countermeasures for recovering a failure occurring in any of a plurality of network devices connected to the network;
A first counter that counts how many times each coping method has failed within a certain period in the past for each of the plurality of network devices;
A second counter that counts how many times the coping method has failed within a certain period in the past for all of the plurality of network devices;
Action executing means for attempting to recover the failure by executing the selected action method;
Equipped with a,
The selection means selects a handling method having a minimum count value of the first counter from the plurality of handling methods,
When the selection means includes a plurality of handling methods in which the count value of the first counter is minimum, the second counter is selected from the plurality of handling methods in which the count value of the first counter is minimum. A failure recovery device that selects a coping method with the smallest count value .

The failure recovery device according to claim 1,
The selecting means selects one of a plurality of coping methods with the smallest count value of the second counter when there are plural coping methods with the smallest count value of the second counter. A disaster recovery device that is selected at random.

Priority is given to the countermeasures with the fewest number of countermeasure failures within a certain past period from the multiple countermeasures to recover from a failure that has occurred in one of the multiple network devices connected to the network, and the selected countermeasures A method of trying to recover from a failure by executing a method,
Computer system
For each of the plurality of network devices, count how many times the coping method failed within the past fixed period using the first counter,
The second counter is used to count how many countermeasures have failed within a certain period in the past for all of the plurality of network devices,
Selecting a coping method in which the count value of the first counter is minimum from the plurality of coping methods,
When there are a plurality of handling methods in which the count value of the first counter is the smallest, the count value of the second counter is the smallest among the plurality of handling methods in which the count value of the first counter is the smallest Disaster recovery method that selects the corrective action.

Priority is given to the countermeasures with the fewest number of countermeasure failures within a certain past period from the multiple countermeasures to recover from a failure that has occurred in one of the multiple network devices connected to the network, and the selected countermeasures A disaster recovery program for causing a computer system to execute a method of trying to recover from a failure by executing the method,
For each of the plurality of network devices, count how many times the coping method failed within the past fixed period using the first counter,
The second counter is used to count how many countermeasures have failed within a certain period in the past for all of the plurality of network devices,
Selecting a coping method in which the count value of the first counter is minimum from the plurality of coping methods,
When there are a plurality of handling methods in which the count value of the first counter is the smallest, the count value of the second counter is the smallest among the plurality of handling methods in which the count value of the first counter is the smallest A failure recovery program for causing the computer system to execute a process of selecting a coping method.