JP5459472B2

JP5459472B2 - Failure recovery apparatus, failure recovery method, and program

Info

Publication number: JP5459472B2
Application number: JP2009184215A
Authority: JP
Inventors: 友宏粉川
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2009-08-07
Filing date: 2009-08-07
Publication date: 2014-04-02
Anticipated expiration: 2029-08-07
Also published as: JP2011039632A

Description

本発明は、コンピュータシステムに障害が発生した場合にその復旧を行うための、障害復旧装置、障害復旧方法、及びプログラムに関する。 The present invention relates to a failure recovery apparatus, a failure recovery method, and a program for performing recovery when a failure occurs in a computer system.

近年、コンピュータシステムに障害が発生した場合において、速やかな復旧を図り、そして、利用者の負担の軽減するため、障害からの復旧を自動的に実行する仕組みが提案されている（例えば、特許文献１及び２参照。）。 In recent years, in the event of a failure in a computer system, a mechanism has been proposed in which recovery from a failure is automatically performed in order to quickly recover and reduce the burden on the user (for example, Patent Documents). 1 and 2).

具体的には、特許文献１は、コンピュータシステムで発生した障害を検知し自動で復旧するような仕組みとして、障害の要因に関する情報を蓄積したデータベースと、復旧方法に関する情報を蓄積したデータベースとを備えたシステム（障害自動復旧システム）を提案している。 Specifically, Patent Document 1 includes, as a mechanism for detecting and automatically recovering from a failure that has occurred in a computer system, a database that stores information relating to the cause of the failure and a database that stores information relating to a recovery method. System (automatic failure recovery system) is proposed.

特許文献１に開示のシステムでは、対象となるコンピュータシステムで障害が発生すると、障害検出手段によって障害が検出され、検出された障害に対応する復旧方法がデータベースに対して問い合わせされる。そして、適切な復旧方法が特定され、復旧実施手段によって障害復旧が行われる。 In the system disclosed in Patent Document 1, when a failure occurs in a target computer system, the failure is detected by the failure detection means, and a recovery method corresponding to the detected failure is inquired of the database. Then, an appropriate recovery method is specified, and failure recovery is performed by the recovery execution means.

但し、特許文献１に開示のシステムは、原則として、発生した障害を検知すると直ちに自動的に復旧処理を実行する。よって、このシステムでは、復旧後に、ＯＳダンプ、メモリ情報、性能情報、及び復旧処理によりローテートされるログの情報、といった障害発生時にしか採取できない情報を採取することは困難である。このため、特許文献１に開示の障害自動復旧システムでは、障害解析に採取できる情報は限られており、復旧後に、真の障害原因を究明できず、障害に対して本格的な対処ができない恐れがある。 However, in principle, the system disclosed in Patent Document 1 automatically executes a recovery process immediately after detecting a failure that has occurred. Therefore, in this system, after recovery, it is difficult to collect information that can be collected only when a failure occurs, such as OS dump, memory information, performance information, and log information rotated by restoration processing. For this reason, in the automatic failure recovery system disclosed in Patent Document 1, information that can be collected for failure analysis is limited, and after recovery, the true cause of the failure cannot be investigated, and there is a risk that the failure cannot be dealt with in earnest. There is.

一方、特許文献２は、障害発生時に必要な情報を採取してから復旧を行うシステムを提案している。特許文献２に開示のシステムでは、対象となるコンピュータシステムで障害が発生すると、原因究明に必要な情報を採取してから、このコンピュータシステムの再起動が行われるようになっている。 On the other hand, Patent Document 2 proposes a system that recovers after collecting necessary information when a failure occurs. In the system disclosed in Patent Document 2, when a failure occurs in a target computer system, information necessary for investigating the cause is collected and then the computer system is restarted.

特開平１１−０７３３３６号公報Japanese Patent Laid-Open No. 11-073336 特開平１１−１２００３２号公報Japanese Patent Laid-Open No. 11-120032

しかしながら、特許文献２に開示のシステムでは、特許文献１に開示のシステムの問題は解消できても、原因究明に必要な情報を採取するために要する時間が考慮されていないため、情報の採取に多大な時間がかかる場合がある。このため、特許文献２に開示のシステムを用いた場合は、コンピュータシステムの停止時間が非常に長くなってしまう可能性がある。 However, in the system disclosed in Patent Document 2, even though the problem of the system disclosed in Patent Document 1 can be solved, the time required for collecting information necessary for investigating the cause is not taken into consideration. It can take a lot of time. For this reason, when the system disclosed in Patent Document 2 is used, there is a possibility that the stop time of the computer system becomes very long.

本発明の目的は、上記問題を解消し、コンピュータシステムからの、障害発生時にしか採取できない情報の採取を実行しつつ、当該コンピュータシステムの復旧にかかる時間の長期化を抑制し得る、障害復旧装置、障害復旧方法、及びプログラムを提供することにある。 An object of the present invention is to solve the above-mentioned problems and execute a collection of information that can be collected only when a failure occurs from a computer system, while suppressing an increase in the time taken to restore the computer system. And providing a failure recovery method and program.

上記目的を達成するため、本発明における障害復旧装置は、対象となるコンピュータシステムに発生する障害を検知する、障害検出部と、
検知された前記障害に対応する復旧方法を決定する、復旧方法決定部と、
前記障害の解析のために採取が求められる１又は２以上の採取情報、各採取情報の採取に必要となる採取時間、及び各採取情報に付与された優先順位を特定する解析情報を取得する、解析情報取得部と、
前記採取情報の採取に使用可能な時間を特定し、そして、前記解析情報に含まれる前記採取時間に基づいて、前記優先順位の順に、特定した前記使用可能な時間内で採取可能な前記採取情報を決定する、採取情報決定部と、
前記採取情報決定部によって決定された前記採取情報の採取を実行する、情報採取部と、
前記情報採取部による採取の実行後に、前記復旧方法決定部によって決定された前記復旧方法に従って、前記コンピュータシステムを復旧させる、復旧実行部とを備えている、ことを特徴とする。 In order to achieve the above object, a failure recovery apparatus according to the present invention detects a failure that occurs in a target computer system, and a failure detection unit;
A recovery method determination unit for determining a recovery method corresponding to the detected failure;
Obtaining one or more collection information required to be collected for the analysis of the failure, collection time required for collecting each collection information, and analysis information specifying the priority given to each collection information; An analysis information acquisition unit;
The time that can be used for collecting the collection information is specified, and the collection information that can be collected within the specified usable time in order of priority based on the collection time included in the analysis information. A collection information determination unit for determining
An information collection unit that performs collection of the collection information determined by the collection information determination unit;
A recovery execution unit configured to recover the computer system in accordance with the recovery method determined by the recovery method determination unit after execution of the collection by the information collection unit;

また、上記目的を達成するため、本発明における障害復旧方法は、
（ａ）対象となるコンピュータシステムに発生する障害を検知する、ステップと、
（ｂ）前記（ａ）のステップで検知された前記障害に対応する復旧方法を決定する、ステップと、
（ｃ）前記障害の解析のために採取が求められる１又は２以上の採取情報、各採取情報の採取に必要となる採取時間、及び各採取情報に付与された優先順位を特定する解析情報を取得する、ステップと、
（ｄ）前記採取情報の採取に使用可能な時間を特定し、そして、前記解析情報に含まれる前記採取時間に基づいて、前記優先順位の順に、特定した前記使用可能な時間内で採取可能な前記採取情報を決定する、ステップと、
（ｅ）前記（ｄ）のステップによって決定された前記採取情報の採取を実行する、ステップと、
（ｆ）前記（ｅ）のステップによる採取の実行後に、前記（ｂ）のステップによって決定された前記復旧方法に従って、前記コンピュータシステムを復旧させる、ステップとを有する、ことを特徴とする。 In order to achieve the above object, the failure recovery method according to the present invention includes:
(A) detecting a failure occurring in the target computer system; and
(B) determining a recovery method corresponding to the failure detected in the step (a); and
(C) One or two or more pieces of collection information that are required to be collected for analysis of the failure, collection time required for collection of each piece of collection information, and analysis information that identifies the priority given to each piece of collection information Get, step, and
(D) A time that can be used for collecting the collection information is specified, and can be collected within the specified usable time in order of priority based on the collection time included in the analysis information. Determining the collection information; and
(E) performing the collection of the collection information determined by the step of (d); and
(F) After the collection in the step (e) is executed, the computer system is restored in accordance with the restoration method determined in the step (b).

更に、上記目的を達成するため、本発明におけるプログラムは、障害の発生したコンピュータシステムの復旧をコンピュータによって行うためのプログラムであって、
前記コンピュータに、
（ａ）前記コンピュータシステムに発生する障害を検知する、ステップと、
（ｂ）前記（ａ）のステップで検知された前記障害に対応する復旧方法を決定する、ステップと、
（ｃ）前記障害の解析のために採取が求められる１又は２以上の採取情報、各採取情報の採取に必要となる採取時間、及び各採取情報に付与された優先順位を特定する解析情報を取得する、ステップと、
（ｄ）前記採取情報の採取に使用可能な時間を特定し、そして、前記解析情報に含まれる前記採取時間に基づいて、前記優先順位の順に、特定した前記使用可能な時間内で採取可能な前記採取情報を決定する、ステップと、
（ｅ）前記（ｄ）のステップによって決定された前記採取情報の採取を実行する、ステップと、
（ｆ）前記（ｅ）のステップによる採取の実行後に、前記（ｂ）のステップによって決定された前記復旧方法に従って、前記コンピュータシステムを復旧させる、ステップとを実行させる、ことを特徴とする。 Furthermore, in order to achieve the above object, a program according to the present invention is a program for recovering a computer system in which a failure has occurred by a computer,
In the computer,
(A) detecting a failure occurring in the computer system; and
(B) determining a recovery method corresponding to the failure detected in the step (a); and
(C) One or two or more pieces of collection information that are required to be collected for analysis of the failure, collection time required for collection of each piece of collection information, and analysis information that identifies the priority given to each piece of collection information Get, step, and
(D) A time that can be used for collecting the collection information is specified, and can be collected within the specified usable time in order of priority based on the collection time included in the analysis information. Determining the collection information; and
(E) performing the collection of the collection information determined by the step of (d); and
(F) After the collection in the step (e) is executed, the computer system is restored according to the restoration method determined in the step (b).

以上の特徴により、本発明における障害復旧装置、障害復旧方法、及びプログラムによれば、コンピュータシステムからの、障害発生時にしか採取できない情報の採取を実行しつつ、当該コンピュータシステムの復旧にかかる時間の長期化を抑制することが可能となる。 With the above features, according to the failure recovery apparatus, the failure recovery method, and the program of the present invention, while collecting information that can only be collected from a computer system when a failure occurs, the time required for recovery of the computer system can be reduced. It becomes possible to suppress the prolongation.

図１は、本発明の実施の形態における障害復旧装置の構成を示すブロック図である。FIG. 1 is a block diagram showing a configuration of a failure recovery apparatus according to an embodiment of the present invention. 図２は、図１に示す障害情報データベースに登録されている障害情報の一例を示す図である。FIG. 2 is a diagram showing an example of failure information registered in the failure information database shown in FIG. 図３（ａ）は、図１に示す復旧方法データベースに登録されている復旧方法情報の一例を示す図であり、図３（ｂ）は、決定された復旧方法の一例を示す図である。FIG. 3A is a diagram illustrating an example of recovery method information registered in the recovery method database illustrated in FIG. 1, and FIG. 3B is a diagram illustrating an example of the determined recovery method. 図４（ａ）は、図１に示す障害情報データベースに登録されている解析情報の一例を示す図であり、図４（ｂ）は、取得された解析情報の一例を示す図である。FIG. 4A is a diagram illustrating an example of analysis information registered in the failure information database illustrated in FIG. 1, and FIG. 4B is a diagram illustrating an example of acquired analysis information. 図５は、図１に示す採取情報決定部が作成した採取情報リストの一例を示す図である。FIG. 5 is a diagram illustrating an example of a collection information list created by the collection information determination unit illustrated in FIG. 図６は、本発明の実施の形態における障害復旧装置の動作を示すフロー図である。FIG. 6 is a flowchart showing the operation of the failure recovery apparatus in the embodiment of the present invention. 図７は、図６に示す障害検出処理を具体的に示すフロー図である。FIG. 7 is a flowchart specifically showing the failure detection process shown in FIG. 図８は、図６に示す復旧方法決定処理を具体的に示すフロー図である。FIG. 8 is a flowchart specifically showing the recovery method determination process shown in FIG. 図９は、図６に示す解析情報取得処理を具体的に示すフロー図である。FIG. 9 is a flowchart specifically showing the analysis information acquisition process shown in FIG. 図１０は、図６に示す採取情報決定処理を具体的に示すフロー図である。FIG. 10 is a flowchart specifically showing the collection information determination process shown in FIG. 図１１は、図１０に示すステップＤ４及びＤ５を更に具体的に示すフロー図である。FIG. 11 is a flowchart more specifically showing steps D4 and D5 shown in FIG. 図１２は、図６に示す情報採取処理を具体的に示すフロー図である。FIG. 12 is a flowchart specifically showing the information collection process shown in FIG. 図１３は、図６に示す復旧処理を具体的に示すフロー図である。FIG. 13 is a flowchart specifically showing the recovery process shown in FIG.

（実施の形態１）
以下、本発明の実施の形態における障害復旧装置、障害復旧方法、及びプログラムについて、図１〜図１３を参照しながら説明する。最初に、本実施の形態における障害復旧装置の構成について図１を用いて説明する。図１は、本発明の実施の形態における障害復旧装置の構成を示すブロック図である。 (Embodiment 1)
Hereinafter, a failure recovery apparatus, a failure recovery method, and a program according to an embodiment of the present invention will be described with reference to FIGS. First, the configuration of the failure recovery apparatus according to the present embodiment will be described with reference to FIG. FIG. 1 is a block diagram showing a configuration of a failure recovery apparatus according to an embodiment of the present invention.

図１に示す本実施の形態における障害復旧装置１は、コンピュータシステム２０に障害が発生すると、コンピュータシステム２０に対して、障害に対応した適切な復旧処理を実行する装置である。図１に示すように、障害復旧装置１は、障害検出部１１と、復旧方法決定部１２と、解析情報取得部１３と、採取情報決定部１４と、情報採取部１５と、復旧実行部１６とを備えている。 The failure recovery apparatus 1 in the present embodiment shown in FIG. 1 is an apparatus that executes an appropriate recovery process corresponding to a failure on the computer system 20 when a failure occurs in the computer system 20. As illustrated in FIG. 1, the failure recovery apparatus 1 includes a failure detection unit 11, a recovery method determination unit 12, an analysis information acquisition unit 13, a collection information determination unit 14, an information collection unit 15, and a recovery execution unit 16. And.

障害検出部１１は、対象となるコンピュータシステム２０に発生する障害を検知する。復旧方法決定部１２は、検知された障害に対応する復旧方法を決定する。解析情報取得部１３は、解析情報を取得する。解析情報は、障害の解析のために採取が求められる１又は２以上の採取情報、各採取情報の採取に必要となる採取時間、及び各採取情報に付与された優先順位を特定する。 The failure detection unit 11 detects a failure that occurs in the target computer system 20. The recovery method determination unit 12 determines a recovery method corresponding to the detected failure. The analysis information acquisition unit 13 acquires analysis information. The analysis information specifies one or more collection information required to be collected for failure analysis, a collection time required for collecting each collection information, and a priority given to each collection information.

また、採取情報決定部１４は、採取情報の採取に使用可能な時間を特定し、そして、解析情報に含まれる採取時間に基づいて、優先順位の順に、特定した使用可能な時間内で採取可能な採取情報を決定する。そして、情報採取部１５は、採取情報決定部１４によって決定された採取情報の採取を実行する。また、復旧実行部１６は、情報採取部１５による採取の実行後に、復旧方法決定部１２によって決定された復旧方法に従って、コンピュータシステム２０を復旧させる。 In addition, the collection information determination unit 14 identifies a time that can be used for collecting the collection information, and can be collected within the specified usable time in order of priority based on the collection time included in the analysis information. Appropriate collection information. Then, the information collection unit 15 performs collection of the collection information determined by the collection information determination unit 14. Further, the recovery executing unit 16 recovers the computer system 20 according to the recovery method determined by the recovery method determining unit 12 after the information collecting unit 15 executes the collection.

このように、障害復旧装置１では、先ず、発生した障害の解析に必要な情報（採取情報）の中から、採取にかけることが認められている時間、各採取情報の採取時間、及び優先順位を考慮して、実際に採取できる採取情報が選出される。そして、選出された採取情報の採取が行われ、その後、障害の復旧処理が行われる。 As described above, in the failure recovery apparatus 1, first, from the information (collection information) necessary for analyzing the failure that has occurred, the time allowed to be collected, the collection time of each collection information, and the priority order Considering the above, collection information that can actually be collected is selected. Then, the selected collection information is collected, and then a failure recovery process is performed.

つまり、障害復旧装置１では、限られた時間内で、解析に有用な情報が優先順位の高い順に出来るだけ多く採取され、直ちに復旧処理が実行される。よって、障害復旧装置１によれば、コンピュータシステム２０からの、障害発生時にしか採取できない情報の採取を実行しつつ、コンピュータシステム２０の復旧にかかる時間の長期化を抑制することが可能となる。 That is, the failure recovery apparatus 1 collects as much information useful for analysis as possible in descending order of priority within a limited time, and immediately executes recovery processing. Therefore, according to the failure recovery apparatus 1, it is possible to suppress an increase in time required for recovery of the computer system 20 while executing collection of information that can be collected only when a failure occurs from the computer system 20.

ここで、図１に加えて図２〜図５を用いて、障害復旧装置１の構成を更に具体的に説明する。本実施の形態では、図１に示すように、障害復旧装置１は、障害情報データベース１１１、復旧方法データベース１２２、解析情報データベース１３２、及び運用ポリシーデータベース１４２に接続されている。 Here, the configuration of the failure recovery apparatus 1 will be described more specifically with reference to FIGS. 2 to 5 in addition to FIG. 1. In the present embodiment, as shown in FIG. 1, the failure recovery apparatus 1 is connected to a failure information database 111, a recovery method database 122, an analysis information database 132, and an operation policy database 142.

障害検出部１１は、本実施の形態では、コンピュータシステム２０に発生した障害を検出すると、障害情報データベース１１１にアクセスし、発生した障害の識別子（障害情報ＩＤ）を特定する。 In this embodiment, when detecting a failure that has occurred in the computer system 20, the failure detection unit 11 accesses the failure information database 111 and identifies an identifier (failure information ID) of the failure that has occurred.

具体的には、障害情報データベース１１１は、図２に示す障害情報を格納している。図２は、図１に示す障害情報データベースに登録されている障害情報の一例を示す図である。図２に示すように、障害情報は、障害内容と、障害内容毎に付与された障害情報ＩＤとを含んでいる。なお、図２の例では、障害情報ＩＤとしては、具体的な障害の発生箇所が用いられているが、本実施の形態はこれに限定されるものではない。 Specifically, the failure information database 111 stores the failure information shown in FIG. FIG. 2 is a diagram showing an example of failure information registered in the failure information database shown in FIG. As shown in FIG. 2, the failure information includes a failure content and a failure information ID assigned to each failure content. In the example of FIG. 2, a specific failure occurrence location is used as the failure information ID, but the present embodiment is not limited to this.

よって、障害検出部１１は、検出した障害の内容を特定し、これと図２に示す障害情報とを照らし合わせ、該当する障害情報ＩＤを特定する。また、障害検出部１１は、特定した障害情報ＩＤを、復旧方法決定部１２と、解析情報取得部１３とに送信する。 Therefore, the failure detection unit 11 specifies the content of the detected failure, compares this with the failure information shown in FIG. 2, and specifies the corresponding failure information ID. Further, the failure detection unit 11 transmits the specified failure information ID to the recovery method determination unit 12 and the analysis information acquisition unit 13.

また、本実施の形態では、復旧方法決定部１２は、障害検出部１１から障害情報ＩＤを受信すると、復旧方法データベース１２２にアクセスする。そして、復旧方法決定部１２は、障害検出部１１から受信した障害情報ＩＤをキーとして用いて、復旧方法を決定する。 In the present embodiment, the recovery method determination unit 12 accesses the recovery method database 122 when receiving the failure information ID from the failure detection unit 11. Then, the recovery method determination unit 12 determines a recovery method using the failure information ID received from the failure detection unit 11 as a key.

具体的には、復旧方法データベース１２２は、図３（ａ）に示す復旧方法情報を格納している。図３（ａ）は、図１に示す復旧方法データベースに登録されている復旧方法情報の一例を示す図である。図３（ａ）に示すように、復旧方法情報は、障害情報ＩＤ毎に予め設定された、具体的な復旧処理と、各復旧処理の実行に必要な時間（復旧時間）とを含んでいる。各復旧処理内容及び各復旧時間は、各障害に対応している。 Specifically, the recovery method database 122 stores the recovery method information shown in FIG. FIG. 3A is a diagram illustrating an example of the recovery method information registered in the recovery method database illustrated in FIG. As shown in FIG. 3A, the recovery method information includes a specific recovery process set in advance for each failure information ID and a time (recovery time) necessary for executing each recovery process. . Each restoration processing content and each restoration time correspond to each failure.

従って、復旧方法決定部１２は、障害検出部１１から障害情報ＩＤを受信すると、これに対応する復旧処理内容及び復旧時間を特定し、これらを復旧方法として決定する。図３（ｂ）は、決定された復旧方法の一例を示す図である。図３（ｂ）の例では、発生した障害の障害情報ＩＤが「メモリ１」である場合の復旧方法が示されている。また、復旧方法決定部１２は、決定した復旧方法を特定する情報（決定復旧方法情報）１２１を、採取情報決定部１４と、復旧実行部１６とに送信する。 Therefore, when receiving the failure information ID from the failure detection unit 11, the recovery method determination unit 12 specifies the recovery processing content and the recovery time corresponding to the failure information ID, and determines these as recovery methods. FIG. 3B is a diagram illustrating an example of the determined recovery method. The example of FIG. 3B shows a recovery method when the failure information ID of the failure that has occurred is “memory 1”. Further, the recovery method determination unit 12 transmits information (decision recovery method information) 121 for specifying the determined recovery method to the collection information determination unit 14 and the recovery execution unit 16.

また、本実施の形態では、解析情報取得部１３は、障害検出部１１から障害情報ＩＤを受信すると、解析情報データベース１３２にアクセスする。そして、解析情報取得部１３は、障害検出部１１から受信した障害情報ＩＤをキーとして用いて、検出された障害に対応する解析情報を取得する。 In the present embodiment, the analysis information acquisition unit 13 accesses the analysis information database 132 when receiving the failure information ID from the failure detection unit 11. Then, the analysis information acquisition unit 13 acquires analysis information corresponding to the detected failure using the failure information ID received from the failure detection unit 11 as a key.

具体的には、解析情報データベース１３２は、図４（ａ）に示す解析情報を格納している。図４（ａ）は、図１に示す障害情報データベースに登録されている解析情報の一例を示す図である。図４（ａ）に示すように、解析情報は、障害情報ＩＤ毎に予め設定された、１又は２以上の採取情報、各採取情報の採取時間、及び優先順位を含んでいる。なお、優先順位は、対応する障害ＩＤが同一の採取情報間で付与されている。 Specifically, the analysis information database 132 stores the analysis information shown in FIG. FIG. 4A is a diagram showing an example of analysis information registered in the failure information database shown in FIG. As shown in FIG. 4A, the analysis information includes one or more collection information set in advance for each failure information ID, a collection time of each collection information, and a priority order. Note that the priority is given to the collection information having the same failure ID.

従って、解析情報取得部１３は、障害検出部１１から障害情報ＩＤを受信すると、これに対応する、１又は２以上の採取情報、各採取情報の採取時間、及び優先順位を特定し、これらを、検出された障害に対応する解析情報として取得する。図４（ｂ）は、取得された解析情報の一例を示す図である。図４（ｂ）の例では、発生した障害の障害情報ＩＤが「メモリ１」である場合の解析情報が示されている。また、解析情報取得部１３は、取得した解析情報（取得解析情報）を採取情報決定部１４に送信する。 Accordingly, when the analysis information acquisition unit 13 receives the failure information ID from the failure detection unit 11, the analysis information acquisition unit 13 identifies one or more collection information corresponding to the failure information ID, the collection time of each collection information, and the priority order. And obtained as analysis information corresponding to the detected failure. FIG. 4B is a diagram illustrating an example of the acquired analysis information. In the example of FIG. 4B, analysis information when the failure information ID of the failure that has occurred is “memory 1” is shown. Further, the analysis information acquisition unit 13 transmits the acquired analysis information (acquired analysis information) to the collection information determination unit 14.

採取情報決定部１４は、本実施の形態では、先ず、運用ポリシーデータベース１４２にアクセスする。運用ポリシーデータベースには、復旧処理の運用ポリシーが格納されている。また、運用ポリシーは、復旧処理に使用できる時間、即ち、障害が発生してから復旧が完了するまでに許容可能な時間（復旧許容時間）を規定している。よって、採取情報決定部１４は、運用ポリシーデータベース１４２から、復旧許容時間を特定する。 In the present embodiment, the collection information determination unit 14 first accesses the operation policy database 142. The operation policy database stores an operation policy for recovery processing. In addition, the operation policy defines the time that can be used for the recovery process, that is, the time that can be allowed until the recovery is completed after the failure occurs (recovery allowable time). Therefore, the collection information determination unit 14 specifies the allowable recovery time from the operation policy database 142.

そして、採取情報決定部１４は、復旧許容時間を特定すると、これと、決定復旧方法情報１２１（図３（ｂ）（参照））に含まれる復旧時間とを対比して、解析用の採取情報の採取に使用可能な時間（解析許容時間）を特定する。続いて、採取情報決定部１４は、選択解析情報（図４（ｂ）参照）に含まれている各採取情報の採取時間に基づき、各採取情報の優先順位の順に、解析許容時間内で採取できる採取情報を決定する。 Then, when the recovery information determining unit 14 specifies the recovery allowable time, the recovery information included in the determination recovery method information 121 (FIG. 3B (see)) is compared with the recovery information for analysis. Specify the time available for sampling (allowable time for analysis). Subsequently, the collection information determination unit 14 performs the collection within the analysis allowable time in the order of priority of each collection information based on the collection time of each collection information included in the selection analysis information (see FIG. 4B). Determine the collection information that can be collected.

また、採取情報決定部１４は、図５に示すように、決定した採取情報を特定するリスト（採取情報リスト）１４１を作成し、これを情報採取部１５に送信する。図５は、図１に示す採取情報決定部が作成した採取情報リストの一例を示す図である。 Further, as shown in FIG. 5, the collection information determination unit 14 creates a list (collection information list) 141 that specifies the determined collection information, and transmits this to the information collection unit 15. FIG. 5 is a diagram illustrating an example of a collection information list created by the collection information determination unit illustrated in FIG.

情報採取部１５は、本実施の形態では、採取情報決定部１４から採取情報リスト１４１を受信すると、コンピュータシステム２０にアクセスする。そして、情報採取部１５は、コンピュータシステム２０から、採取情報リスト１４１（図５参照）に含まれる採取情報を取得する。採取情報の取得が終了すると、情報採取部１５は、そのことを復旧実行部１６に通知する。 In this embodiment, the information collection unit 15 accesses the computer system 20 when it receives the collection information list 141 from the collection information determination unit 14. Then, the information collection unit 15 acquires the collection information included in the collection information list 141 (see FIG. 5) from the computer system 20. When the acquisition of the collection information is completed, the information collection unit 15 notifies the recovery execution unit 16 of that.

復旧実行部１６は、本実施の形態では、情報採取部１５からの通知を受けると、復旧方法決定部１２から受信した決定復旧方法情報１２１（図３（ｂ）参照）から、復旧処理を特定し、特定した復旧処理を実行する。 In the present embodiment, when the recovery execution unit 16 receives a notification from the information collection unit 15, the recovery execution unit 16 specifies the recovery process from the determined recovery method information 121 (see FIG. 3B) received from the recovery method determination unit 12. Then, the specified recovery process is executed.

但し、本実施の形態では、採取情報決定部１４は、解析許容時間内で採取可能な採取情報が存在しない場合（例えば、復旧許容時間が０（ゼロ）の場合等）は、採取情報の採取を実行しないことを情報採取部１４に通知する。この場合、復旧実行部１６は、直ちに、コンピュータシステム２０の復旧処理を実行する。 However, in this embodiment, the collection information determination unit 14 collects collection information when there is no collection information that can be collected within the analysis allowable time (for example, when the recovery allowable time is 0 (zero)). Is notified to the information collecting unit 14. In this case, the recovery execution unit 16 immediately executes the recovery process of the computer system 20.

次に、本発明の実施の形態における障害復旧装置１の動作について図６〜図１３を用いて説明する。先ず、障害復旧装置１の全体の動作を図６に基づいて説明する。図６は、本発明の実施の形態における障害復旧装置の動作を示すフロー図である。以下の説明においては、適宜図１を参酌する。また、本実施の形態では、障害復旧装置１を動作させることによって、本実施の形態における障害復旧方法が実施される。よって、本実施の形態における障害復旧方法の説明は、以下の障害復旧装置１の説明に代える。 Next, operation | movement of the failure recovery apparatus 1 in embodiment of this invention is demonstrated using FIGS. First, the overall operation of the failure recovery apparatus 1 will be described with reference to FIG. FIG. 6 is a flowchart showing the operation of the failure recovery apparatus in the embodiment of the present invention. In the following description, FIG. 1 is taken into consideration as appropriate. In the present embodiment, the failure recovery method according to the present embodiment is implemented by operating the failure recovery apparatus 1. Therefore, the description of the failure recovery method in the present embodiment is replaced with the following description of the failure recovery apparatus 1.

図６に示すように、最初に、障害検出部１１によって、コンピュータシステム２０に発生した障害の検出が行われる（ステップＳ１）。次に、復旧方法決定部１２は、検出された障害に対応する復旧方法を決定する（ステップＳ２）。続いて、解析情報取得部１３は、検出された障害に対応する解析情報を取得する（ステップＳ３）。 As shown in FIG. 6, first, the failure detection unit 11 detects a failure that has occurred in the computer system 20 (step S1). Next, the recovery method determination unit 12 determines a recovery method corresponding to the detected failure (step S2). Subsequently, the analysis information acquisition unit 13 acquires analysis information corresponding to the detected failure (step S3).

次に、採取情報決定部１４によって、採取情報の採取に使用可能な時間（解析許容時間）が特定され、検出された障害の解析に必要な採取情報の中から、解析許容時間内に採取が行われる採取情報が決定される（ステップＳ４）。 Next, the collection information determination unit 14 specifies a time (analysis allowable time) that can be used for collecting the collection information, and sampling is performed within the analysis allowable time from the collection information necessary for analyzing the detected failure. Collection information to be performed is determined (step S4).

次に、情報採取部１５によって、コンピュータシステム２０から、ステップＳ４で決定された採取情報の採取が行われる（ステップＳ５）。その後、復旧実行部１６によって、ステップＳ２で決定された復旧方法が実行され、コンピュータシステム２０の復旧処理が行われる（ステップＳ６）。ステップＳ６の実行により、障害復旧装置１における処理は終了する。 Next, the information collection unit 15 collects the collection information determined in step S4 from the computer system 20 (step S5). Thereafter, the recovery execution unit 16 executes the recovery method determined in step S2, and the computer system 20 is recovered (step S6). By executing step S6, the process in the failure recovery apparatus 1 is completed.

ここで、図６に示したステップＳ１〜Ｓ６それぞれを図７〜図１３を用いて更に具体的に説明する。先ず、図７を用いて、図６に示したステップＳ１（障害検出処理）について説明する。図７は、図６に示す障害検出処理を具体的に示すフロー図である。 Here, steps S1 to S6 shown in FIG. 6 will be described more specifically with reference to FIGS. First, step S1 (failure detection processing) shown in FIG. 6 will be described with reference to FIG. FIG. 7 is a flowchart specifically showing the failure detection process shown in FIG.

図７に示すように、先ず、障害検出部１１は、コンピュータシステム２０に発生した障害を検出する（ステップＡ１）。そして、障害検出部１１は、障害情報データベース１１１に問い合わせを行い、検出した障害の障害情報ＩＤ（図２参照）を特定する（ステップＡ２）。例えば、検出された障害が、「メモリ不正アクセス」である場合は、障害検出部１１は、障害情報データベース１１１に格納されている障害情報を参照し、該当する障害情報ＩＤとして、「メモリ１」を特定する。 As shown in FIG. 7, first, the failure detection unit 11 detects a failure that has occurred in the computer system 20 (step A1). Then, the failure detection unit 11 makes an inquiry to the failure information database 111 and specifies the failure information ID (see FIG. 2) of the detected failure (step A2). For example, when the detected failure is “memory unauthorized access”, the failure detection unit 11 refers to the failure information stored in the failure information database 111 and sets “Memory 1” as the corresponding failure information ID. Is identified.

次に、障害検出部１１は、特定した障害情報ＩＤ、例えば「メモリ１」を、復旧方法決定部１２に送信し（ステップＡ３）、更に、これを解析情報取得部１３にも送信する（ステップＡ４）。ステップＡ４の実行により、図６に示したステップＳ１は終了する。なお、本実施の形態において、ステップＡ３とステップＡ４とは同時に実行されても良いし、ステップＡ４の実行後にステップＡ３が実行されても良い。 Next, the failure detection unit 11 transmits the specified failure information ID, for example, “memory 1” to the recovery method determination unit 12 (step A3), and further transmits this to the analysis information acquisition unit 13 (step S3). A4). Execution of step A4 completes step S1 shown in FIG. In the present embodiment, step A3 and step A4 may be executed simultaneously, or step A3 may be executed after step A4.

続いて、図８を用いて、図６に示したステップＳ２（復旧方法決定処理）について説明する。図８は、図６に示す復旧方法決定処理を具体的に示すフロー図である。 Next, step S2 (recovery method determination process) shown in FIG. 6 will be described with reference to FIG. FIG. 8 is a flowchart specifically showing the recovery method determination process shown in FIG.

図８に示すように、先ず、復旧方法決定部１２は、障害検出部１１から障害情報ＩＤを受信する（ステップＢ１）。次に、復旧方法決定部１２は、復旧方法データベース１２２に問い合わせを行い、受信した障害情報ＩＤに対応する復旧方法（復旧処理、復旧時間）を決定する（ステップＢ２）。 As shown in FIG. 8, first, the recovery method determination unit 12 receives a failure information ID from the failure detection unit 11 (step B1). Next, the recovery method determination unit 12 makes an inquiry to the recovery method database 122 and determines a recovery method (recovery process, recovery time) corresponding to the received failure information ID (step B2).

例えば、ステップＢ１において、復旧方法決定部１２が、障害情報ＩＤとして「メモリ１」を受信していたとする。この場合、復旧方法決定部１２は、「メモリ１」をキーとし、これを、復旧方法データベース１２２に格納されている復旧方法情報（図３（ａ）参照）に照らし合わせる。そして、復旧方法決定部１２は、復旧方法情報の中から、「メモリ１」に対応する復旧処理「サーバ再起動」と、同じく「メモリ１」に対応する復旧時間「６００」とを特定し、これらを決定復旧方法とする（図３（ｂ）参照）。 For example, in step B1, it is assumed that the recovery method determination unit 12 has received “memory 1” as the failure information ID. In this case, the recovery method determination unit 12 uses “memory 1” as a key, and compares this with the recovery method information stored in the recovery method database 122 (see FIG. 3A). Then, the recovery method determination unit 12 specifies the recovery process “server restart” corresponding to “memory 1” and the recovery time “600” corresponding to “memory 1” from the recovery method information, These are determined recovery methods (see FIG. 3B).

次に、復旧方法決定部１２は、ステップＢ２で得られた決定復旧方法を特定する情報（決定復旧方法情報）１２１を、採取情報決定部１４に送信し（ステップＢ３）、更に、これを復旧実行部１６にも送信する（ステップＢ４）。ステップＢ４の実行により、図６に示したステップＳ２は終了する。なお、本実施の形態において、ステップＢ３とステップＢ４とは同時に実行されても良いし、ステップＢ４の実行後にステップＡ３が実行されても良い。 Next, the recovery method determination unit 12 transmits information (decision recovery method information) 121 for specifying the determination recovery method obtained in step B2 to the collection information determination unit 14 (step B3), and further recovers this. It transmits also to the execution part 16 (step B4). By executing step B4, step S2 shown in FIG. 6 ends. In the present embodiment, step B3 and step B4 may be executed simultaneously, or step A3 may be executed after step B4.

続いて、図９を用いて、図６に示したステップＳ３（解析情報取得処理）について説明する。図９は、図６に示す解析情報取得処理を具体的に示すフロー図である。 Next, step S3 (analysis information acquisition process) shown in FIG. 6 will be described with reference to FIG. FIG. 9 is a flowchart specifically showing the analysis information acquisition process shown in FIG.

図９に示すように、先ず、解析情報取得部１３は、障害検出部１１から障害情報ＩＤを受信する（ステップＣ１）。次に、解析情報取得部１３は、解析情報データベース１３２に問い合わせを行い、受信した障害情報ＩＤに対応する解析情報を取得する（ステップＣ２）。 As shown in FIG. 9, first, the analysis information acquisition unit 13 receives a failure information ID from the failure detection unit 11 (step C1). Next, the analysis information acquisition unit 13 makes an inquiry to the analysis information database 132 and acquires analysis information corresponding to the received failure information ID (step C2).

例えば、ステップＣ１において、解析情報取得部１３が、障害情報ＩＤとして「メモリ１」を受信していたとする。この場合、解析情報取得部１３は、「メモリ１」に対応する採取情報、各採取情報に対応する採取時間及び優先度を特定する。具体的には、解析情報として、メモリダンプ（採取時間：１２０秒、優先度：１）、メモリ性能情報（採取時間：３０秒、優先度：２）、ＯＳダンプ（採取時間６００秒、優先度：３）、プロセス起動ログ（採取時間：３０秒、優先度：４）が取得される（図４（ｂ）参照）。 For example, in step C1, the analysis information acquisition unit 13 receives “memory 1” as the failure information ID. In this case, the analysis information acquisition unit 13 specifies the collection information corresponding to “memory 1”, the collection time and priority corresponding to each collection information. Specifically, as analysis information, memory dump (collection time: 120 seconds, priority: 1), memory performance information (collection time: 30 seconds, priority: 2), OS dump (collection time: 600 seconds, priority) : 3), a process activation log (collection time: 30 seconds, priority: 4) is acquired (see FIG. 4B).

次に、解析情報取得部１３は、ステップＣ２で取得された解析情報（採取情報、採取時間、優先度）を、採取情報決定部１４に送信する（ステップＣ３）。ステップＣ３の実行により、図６に示したステップＳ３は終了する。 Next, the analysis information acquisition unit 13 transmits the analysis information (collection information, collection time, priority) acquired in step C2 to the collection information determination unit 14 (step C3). Execution of step C3 completes step S3 shown in FIG.

続いて、図１０を用いて、図６に示したステップＳ４（採取情報決定処理）について説明する。図１０は、図６に示す採取情報決定処理を具体的に示すフロー図である。 Next, step S4 (collection information determination process) shown in FIG. 6 will be described with reference to FIG. FIG. 10 is a flowchart specifically showing the collection information determination process shown in FIG.

図１０に示すように、先ず、採取情報決定部１４は、復旧方法決定部１２から決定復旧方法情報１２１（図３（ｂ）参照）を受信し（ステップＤ１）、更に、解析情報取得部１３から取得解析情報１３１（図４（ｂ）参照）を受信する（ステップＤ２）。 As shown in FIG. 10, first, the collection information determination unit 14 receives the determination recovery method information 121 (see FIG. 3B) from the recovery method determination unit 12 (step D <b> 1), and further, the analysis information acquisition unit 13. Acquisition analysis information 131 (see FIG. 4B) is received (step D2).

次に、採取情報決定部１４は、運用ポリシーデータベース１４２に問い合わせを行い、運用ポリシーに規定されている復旧許容時間を特定する（ステップＤ３）。次に、採取情報決定部１４は、復旧許容時間と、決定復旧方法情報１２１に含まれる復旧時間とを対比して、解析許容時間を特定する（ステップＤ４）。 Next, the collection information determination unit 14 makes an inquiry to the operation policy database 142 and specifies the allowable recovery time specified in the operation policy (step D3). Next, the collection information determination unit 14 specifies the analysis allowable time by comparing the recovery allowable time with the recovery time included in the determined recovery method information 121 (step D4).

次に、採取情報決定部１４は、ステップＤ４で特定された解析許容時間と、取得解析情報１３１に含まれる各採取情報の採取時間及び優先度とに基づいて、解析許容時間内で採取できる採取情報を決定し、採取情報リスト１４１（図５参照）を作成する（ステップＤ５）。 Next, the collection information determination unit 14 can collect samples within the analysis allowable time based on the analysis allowable time specified in step D4 and the collection time and priority of each piece of collection information included in the acquired analysis information 131. Information is determined, and the collection information list 141 (see FIG. 5) is created (step D5).

その後、採取情報決定部１４は、ステップＤ５で作成した採取情報リスト１４１を、情報採取部１５に送信する（ステップＤ６）。ステップＤ６の実行により、図６に示したステップＳ４は終了する。 Thereafter, the collection information determination unit 14 transmits the collection information list 141 created in step D5 to the information collection unit 15 (step D6). Execution of step D6 completes step S4 shown in FIG.

ここで、ステップＤ４及びＤ５について、図１１を用いて更に具体的に説明する。図１１は、図１０に示すステップＤ４及びＤ５を更に具体的に示すフロー図である。図１１に示すステップＥ１〜Ｅ８のうち、ステップＥ１がステップＤ４に相当し、ステップＥ２〜Ｅ８がステップＤ５に相当する。 Here, steps D4 and D5 will be described more specifically with reference to FIG. FIG. 11 is a flowchart more specifically showing steps D4 and D5 shown in FIG. Of steps E1 to E8 shown in FIG. 11, step E1 corresponds to step D4, and steps E2 to E8 correspond to step D5.

図１１に示すように、採取情報決定部１４は、解析許容時間を特定する（ステップＥ１）。具体的には、採取情報決定部１４は、下記の数１を用い、運用ポリシーデータベース１４２に登録されている復旧許容時間から、決定復旧方法情報１２１に含まれる復旧時間（図３（ａ）及び（ｂ）参照）を減算して、解析許容時間を算出する。 As illustrated in FIG. 11, the collection information determination unit 14 specifies the analysis allowable time (step E1). Specifically, the collection information determination unit 14 uses the following formula 1, and calculates the recovery time (FIG. 3A and FIG. 3A) included in the determined recovery method information 121 from the allowable recovery time registered in the operation policy database 142. (B) is subtracted to calculate the allowable analysis time.

（数１）
解析許容時間＝復旧許容時間−復旧時間 (Equation 1)
Allowable analysis time = Allowable recovery time-Restore time

次に、採取情報決定部１４は、算出した解析許容時間が０（ゼロ）より大きいかどうかを判定する（ステップＥ２）。ステップＥ２の判定の結果、解析許容時間が０（ゼロ）より大きくない場合、即ち、０（ゼロ）以下である場合は、採取情報決定部１４は、ステップＥ８を実行する。ステップＥ８の内容については後述する。 Next, the collection information determination unit 14 determines whether or not the calculated analysis allowable time is greater than 0 (zero) (step E2). If the analysis allowable time is not greater than 0 (zero) as a result of the determination in step E2, that is, if it is 0 (zero) or less, the collection information determination unit 14 executes step E8. The contents of step E8 will be described later.

一方、ステップＥ２の判定の結果、解析許容時間が０（ゼロ）より大きい場合は、採取情報決定部１４は、取得解析情報１３１（図４（ｂ）参照）が空であるかどうかを判定する（ステップＥ３）。具体的には、取得解析情報１３１として、採取すべき１又は２以上の採取情報と、各採取情報に対応する採取時間及び優先度とで構成されたリストが存在しているかどうかを判定する。 On the other hand, if the analysis allowable time is greater than 0 (zero) as a result of the determination in step E2, the collection information determination unit 14 determines whether or not the acquired analysis information 131 (see FIG. 4B) is empty. (Step E3). Specifically, it is determined whether or not a list including one or two or more pieces of collection information to be collected and collection times and priorities corresponding to the pieces of collection information exists as the acquisition analysis information 131.

ステップＥ３の判定の結果、取得解析情報１３１が空である場合は、即ち、採取すべき採取情報が存在しない場合は、採取情報決定部１４は、ステップＥ８を実行する。一方、ステップＥ３の判定の結果、取得解析情報１３１が空でない場合は、採取情報決定部１４は、ステップＥ４を実行する。 If the acquisition analysis information 131 is empty as a result of the determination in step E3, that is, if there is no collection information to be collected, the collection information determination unit 14 executes step E8. On the other hand, if the result of determination in step E3 is that the acquisition analysis information 131 is not empty, the collection information determination unit 14 executes step E4.

ステップＥ４では、採取情報決定部１４は、取得解析情報１３１の中から最も優先度が高い採取情報を選択し、下記の数２を用いて解析残時間を算出する。なお、以下の数２において、「該当採取時間」は、ステップＥ４で選択された採取情報についての採取時間を意味している（図４（ａ）及び（ｂ）参照）。また、解析残時間は、一時的な数値であり、図１１に示す処理だけで利用される値である。 In step E4, the collection information determination unit 14 selects the collection information having the highest priority from the acquired analysis information 131, and calculates the analysis remaining time using the following formula 2. In the following formula 2, “corresponding collection time” means the collection time for the collection information selected in step E4 (see FIGS. 4A and 4B). The remaining analysis time is a temporary numerical value and is a value used only in the processing shown in FIG.

（数２）
解析残時間＝解析許容時間−該当採取時間 (Equation 2)
Analysis remaining time = Analysis allowable time-Applicable sampling time

次に、ステップＥ４が終了すると、採取情報決定部１４は、ステップＥ４における算出が終了した採取情報を、取得解析情報１３１中の採取情報のリストから削除する（ステップＥ５）。そして、採取情報決定部１４は、ステップＥ４で算出した解析残時間が０（ゼロ）より大きいかどうかを判定する（ステップＥ６）。 Next, when step E4 ends, the collection information determination unit 14 deletes the collection information for which the calculation in step E4 has ended from the list of collection information in the acquisition analysis information 131 (step E5). Then, the collection information determination unit 14 determines whether or not the remaining analysis time calculated in Step E4 is greater than 0 (zero) (Step E6).

ステップＥ６の判定の結果、解析残時間が０（ゼロ）より大きくない場合は、ステップＥ４で選択された採取情報は解析許容時間との関係から適切でないため、採取情報決定部１４は、再度、ステップＥ３以降を実行する。 As a result of the determination in step E6, if the remaining analysis time is not greater than 0 (zero), the collection information selected in step E4 is not appropriate because of the relationship with the analysis allowable time. Step E3 and subsequent steps are executed.

一方、ステップＥ６の判定の結果、解析残時間が０（ゼロ）より大きい場合は、採取情報決定部１４は、ステップＥ４で選択された採取情報を採取情報リスト１４１に追加し、更に、解析残時間を解析許容時間（解析許容時間＝解析残時間）に設定する（ステップＥ７）。 On the other hand, if the result of determination in step E6 is that the remaining analysis time is greater than 0 (zero), the collection information determination unit 14 adds the collection information selected in step E4 to the collection information list 141, and further the analysis remaining The time is set to the analysis allowable time (analysis allowable time = analysis remaining time) (step E7).

そして、ステップＥ７の実行後、採取情報決定部１４は、再度ステップＥ２以降を実施する。ステップＥ２〜Ｅ７が実行されることにより、解析許容時間内で採取が可能な採取情報が優先順位の順に特定される。 And after execution of step E7, the collection information determination part 14 implements step E2 and subsequent steps again. By executing steps E2 to E7, collection information that can be collected within the analysis allowable time is specified in order of priority.

また、ステップＥ８において、採取情報決定部１４は、採取情報リスト１４１を確定する。確定された採取情報リスト１４１は、情報採取部１５に送信される。 In step E8, the collection information determination unit 14 determines the collection information list 141. The confirmed collection information list 141 is transmitted to the information collection unit 15.

ここで、ステップＥ１〜Ｅ８について、具体例に基づいて説明する。先ず、復旧許容時間が「２０００秒」、決定復旧方法情報１２１が図３（ｂ）の例、取得解析情報１３１が図４（ｂ）の例である場合について説明する。 Here, steps E1 to E8 will be described based on specific examples. First, a case where the allowable recovery time is “2000 seconds”, the decision recovery method information 121 is an example of FIG. 3B, and the acquisition analysis information 131 is an example of FIG. 4B will be described.

上記の場合、ステップＥ１において、解析許容時間の値は、復旧許容時間「２０００」から復旧時間「６００」（図３（ｂ）参照）を減算して得られる値「１４００」となる。次に、ステップＥ２で解析許容時間の値を判定すると、「１４００＞０」であるので、ステップＥ３が実行される。 In the above case, in step E1, the value of the analysis allowable time is a value “1400” obtained by subtracting the recovery time “600” (see FIG. 3B) from the recovery allowable time “2000”. Next, when the value of the analysis allowable time is determined in step E2, since “1400> 0”, step E3 is executed.

ステップＥ３では、採取情報決定部１４は、取得解析情報１３１が「メモリダンプ」、「メモリ性能情報」、「ＯＳダンプ」、及び「プロセス起動ログ」を含むため（図４（ｂ）参照）、「空」でないと判断し、ステップＥ４を実行する。 In step E3, the collection information determination unit 14 includes the acquisition analysis information 131 including “memory dump”, “memory performance information”, “OS dump”, and “process activation log” (see FIG. 4B). If it is not “empty”, step E4 is executed.

ステップＥ４では、採取情報決定部１４は、最も優先度の高い（優先度＝「１」）採取情報「メモリダンプ」を選択し、解析残時間を計算する。この場合、復旧許容時間は「１４００」、「メモリダンプ」の採取時間は「１２０」であるので、解析残時間は「１２８０」となる。 In step E4, the collection information determination unit 14 selects the collection information “memory dump” having the highest priority (priority = “1”) and calculates the remaining analysis time. In this case, the recovery allowable time is “1400” and the collection time of “memory dump” is “120”, so the remaining analysis time is “1280”.

また、解析残時間の算出後、ステップＥ５では、採取情報決定部１４は、取得解析情報１３１の中の採取情報のリストから「メモリダンプ」の行を削除する。その後、採取情報決定部１４は、ステップＥ６において、解析残時間の値が０（ゼロ）より大きいかどうかを判定する。 After calculating the remaining analysis time, in step E5, the collection information determination unit 14 deletes the “memory dump” line from the collection information list in the acquired analysis information 131. Thereafter, in step E6, the collection information determination unit 14 determines whether or not the value of the analysis remaining time is greater than 0 (zero).

解析残時間は「１２８０」であり、０（ゼロ）より大きいので、採取情報決定部１４は、ステップＥ７において、採取情報リスト１４１（図５参照）に「メモリダンプ」を追加し、解析許容時間の値を、解析残時間の値、即ち「１２８０」に更新する。その後、採取情報決定部１４は、更新された解析許容時間を用いて、再度、ステップＥ２以降を実行する。 Since the remaining analysis time is “1280” and is greater than 0 (zero), the collection information determination unit 14 adds “memory dump” to the collection information list 141 (see FIG. 5) in step E7, and allows the analysis allowable time. Is updated to the value of the remaining analysis time, that is, “1280”. Thereafter, the collection information determination unit 14 executes Step E2 and subsequent steps again using the updated analysis allowable time.

再度のステップＥ２以降の実行により、図４（ｂ）に示された「メモリ性能情報」が採取情報リスト１４１に追加され、解析許容時間は「１２５０」に更新される。更に、「ＯＳダンプ」が採取情報リスト１４１に追加され、解析許容時間は「６５０」に更新される。続いて、「プロセス起動ログ」も採取情報リスト１４１に追加され、最終的に解析許容時間は「６２０」となる。 By executing again after step E2, the “memory performance information” shown in FIG. 4B is added to the collection information list 141, and the analysis allowable time is updated to “1250”. Furthermore, “OS dump” is added to the collection information list 141, and the analysis allowable time is updated to “650”. Subsequently, the “process activation log” is also added to the collection information list 141, and finally the analysis allowable time becomes “620”.

また、プロセス起動ログの採取情報リスト１４１の追加により、取得解析情報１３１は「空」となるので、プロセス起動ログの追加後のステップＥ３では「Ｙｅｓ」と判断される。よって、ステップＥ８が実行されて、最終的な採取情報リスト１４１が確定する。この場合は、採取情報リスト１４１は、図５の例と異なり、「メモリダンプ」、「メモリ性能情報」、「ＯＳダンプ」、及び「プロセス起動ログ」の全てを含んでいる。このように、運用ポリシーで設定されている復旧許容時間が十分にある場合は、解析に必要な全ての情報の採取が可能となる。 Further, since the acquisition analysis information 131 becomes “empty” due to the addition of the collection information list 141 of the process activation log, “Yes” is determined in step E3 after the addition of the process activation log. Therefore, step E8 is executed and the final collection information list 141 is confirmed. In this case, unlike the example of FIG. 5, the collection information list 141 includes all of “memory dump”, “memory performance information”, “OS dump”, and “process activation log”. As described above, when the recovery allowable time set in the operation policy is sufficient, it is possible to collect all information necessary for the analysis.

次に、復旧許容時間が「１０００秒」、決定復旧方法情報１２１が図３（ｂ）の例、取得解析情報１３１が図４（ｂ）の例である場合について説明する。この場合では、メモリダンプ及びメモリ性能情報を選択した後、更新された解析許容時間の値は「２５０」となる。よって、ステップＥ４において「ＯＳダンプ」を選択すると、「ＯＳダンプ」の採取時間が「６００」と長いため、解析残時間は「−３５０」となる。 Next, a case where the allowable recovery time is “1000 seconds”, the decision recovery method information 121 is an example of FIG. 3B, and the acquisition analysis information 131 is an example of FIG. 4B will be described. In this case, after the memory dump and the memory performance information are selected, the value of the updated analysis allowable time is “250”. Therefore, when “OS dump” is selected in step E4, the collection time of “OS dump” is as long as “600”, so the remaining analysis time is “−350”.

よって、ステップＥ６では、「−３５０＜０」となって、「Ｎｏ」と判断され、ステップＥ３に戻ることになる。このため、「ＯＳダンプ」は採取情報リスト１４１に追加されず、また、解析許容時間の値は更新されずに、「２５０」のままとなる。 Therefore, in step E6, “−350 <0” is established, “No” is determined, and the process returns to step E3. Therefore, the “OS dump” is not added to the collection information list 141, and the value of the analysis allowable time is not updated and remains “250”.

一方、優先順位がＯＳダンプの次に設定されている「プロセス起動ログ」の採取時間は「３０」と短いため、ステップＥ４における解析残時間は「２２０」となる。よって、採取情報決定部１４は、「プロセス起動ログ」を採取情報リスト１４１に追加する。本例では、最終的に、ステップＥ８において、採取情報リスト１４１は、「メモリダンプ」、「メモリ性能情報」、及び「プロセス起動ログ」のみを含むこととなる。 On the other hand, since the collection time of the “process activation log” whose priority is set next to the OS dump is as short as “30”, the analysis remaining time in step E4 is “220”. Therefore, the collection information determination unit 14 adds “process activation log” to the collection information list 141. In this example, finally, in step E8, the collection information list 141 includes only “memory dump”, “memory performance information”, and “process activation log”.

このように、運用ポリシーで設定されている復旧許容時間が足りない場合は、復旧許容時間以上にシステムが停止されないようにするため、障害復旧装置１は、優先度が高い情報から順に、可能な限り多くの情報の採取が行われる。 As described above, when the allowable recovery time set in the operation policy is insufficient, the failure recovery apparatus 1 is capable of sequentially starting from the information with the highest priority in order to prevent the system from being stopped more than the allowable recovery time. As much information is collected as possible.

次に、復旧許容時間が「０秒」、決定復旧方法情報１２１が図３（ｂ）の例、取得解析情報１３１が図４（ｂ）の例である場合について説明する。この場合では、ステップＥ１の実行により、解析許容時間は、復旧許容時間「０」から復旧時間「６００」を引いた値、「−６００」となる。 Next, the case where the allowable recovery time is “0 second”, the decision recovery method information 121 is an example of FIG. 3B, and the acquisition analysis information 131 is an example of FIG. 4B will be described. In this case, by the execution of step E1, the analysis allowable time becomes “−600”, which is a value obtained by subtracting the recovery time “600” from the recovery allowable time “0”.

よって、ステップＥ２では、「Ｎｏ」と判断されるので、その後、ステップＥ８が実行される。この場合、採取情報リスト１４１には採取情報は追加されておらず、採取情報リスト１４１は「空」の状態である。 Therefore, since it is determined as “No” in Step E2, Step E8 is executed thereafter. In this case, collection information is not added to the collection information list 141, and the collection information list 141 is in an “empty” state.

この結果、後処理において、情報採取部１５は何ら情報の採取を実施せず、直ちに、復旧実行部１６が復旧処理を実施することになる。このように、運用ポリシーで設定されている復旧許容時間が存在しない場合は、障害復旧装置１は、情報の採取よりも復旧を優先し、直ちに復旧を実施する。 As a result, in the post-processing, the information collection unit 15 does not collect any information, and the recovery execution unit 16 immediately performs the recovery process. As described above, when there is no allowable recovery time set in the operation policy, the failure recovery apparatus 1 prioritizes recovery over information collection and immediately performs recovery.

続いて、図１２を用いて、図６に示したステップＳ５（採取処理）について説明する。図１２は、図６に示す情報採取処理を具体的に示すフロー図である。 Next, step S5 (collection process) shown in FIG. 6 will be described with reference to FIG. FIG. 12 is a flowchart specifically showing the information collection process shown in FIG.

図１２に示すように、先ず、情報採取部１５は、採取情報決定部１４から採取情報リスト１４１（図５参照）を受信する（ステップＥ１）。次に、情報採取部１５は、採取情報リスト１４１に含まれる採取情報の採取を実施する（ステップＥ２）。例えば、採取情報リスト１４１が、図５に示すリストである場合は、情報採取部１５は、コンピュータシステム２０（図１参照）から、メモリダンプとメモリ性能情報とを採取する。その後、情報採取部１５は、復旧実行部１６に復旧の指示を行う（ステップＥ３）。 As shown in FIG. 12, first, the information collection unit 15 receives the collection information list 141 (see FIG. 5) from the collection information determination unit 14 (step E1). Next, the information collection unit 15 collects collection information included in the collection information list 141 (step E2). For example, when the collection information list 141 is the list shown in FIG. 5, the information collection unit 15 collects a memory dump and memory performance information from the computer system 20 (see FIG. 1). Thereafter, the information collection unit 15 instructs the recovery execution unit 16 to recover (step E3).

続いて、図１３を用いて、図６に示したステップＳ５（復旧処理）について説明する。図１３は、図６に示す復旧処理を具体的に示すフロー図である。 Next, step S5 (recovery processing) shown in FIG. 6 will be described with reference to FIG. FIG. 13 is a flowchart specifically showing the recovery process shown in FIG.

図１３に示すように、先ず、復旧実行部１６は、復旧方法決定部１２から決定復旧方法情報１２１を受信し（ステップＧ１）、更に、情報採取部１５から復旧を実行する旨の指示（復旧指示）を受け取る（ステップＧ２）。なお、ステップＧ１及びＧ２の順序は入れ替わっても良いし、両者は同時に実行されても良い。 As shown in FIG. 13, first, the recovery executing unit 16 receives the determined recovery method information 121 from the recovery method determining unit 12 (step G1), and further instructs the recovery (execution of recovery) from the information collecting unit 15 (Instruction) is received (step G2). Note that the order of steps G1 and G2 may be interchanged, or both may be executed simultaneously.

その後、復旧実行部１６は、ステップＧ１で受け取った決定復旧方法情報１２１で規定されている復旧処理（図３（ａ）及び（ｂ）参照）を実行する（ステップＧ３）。このように、本実施の形態では、採取情報の採取が終了した後に、復旧処理が実施される。なお、解析許容時間内に採取時間が収まる採取情報が存在しない場合は、復旧実行部１６は、採取情報の採取を待つことなく、ステップＧ１〜Ｇ３を実行する。 Thereafter, the recovery execution unit 16 executes the recovery process (see FIGS. 3A and 3B) defined in the decision recovery method information 121 received in step G1 (step G3). As described above, in the present embodiment, the recovery process is performed after the collection of the collection information is completed. In addition, when there is no collection information within which the collection time is within the analysis allowable time, the recovery execution unit 16 executes Steps G1 to G3 without waiting for collection of the collection information.

以上のように、本実施の形態によれば、予め設定されている復旧許容時間の範囲内で、障害発生時にしか採取できない情報をできる限り多く採取することができる。また、採取される情報は、解析に有用かどうか等の点から予め設定された優先度に基づいて選択されており、解析に有効な情報である。また、本実施の形態によれば、システム停止時間を許容以上に延ばすこともなく、障害を自動的に復旧することもできる。 As described above, according to the present embodiment, it is possible to collect as much information as can be collected only when a failure occurs within a preset recovery allowable time range. The collected information is selected based on a preset priority from the viewpoint of whether it is useful for analysis or the like, and is information effective for analysis. Further, according to the present embodiment, it is possible to automatically recover from a failure without extending the system stop time more than allowable.

また、本発明の実施の形態におけるプログラムは、コンピュータに、図６に示すステップＳ１〜Ｓ６を実行させるプログラムであれば良い。このプログラムをコンピュータにインストールし、実行することによって、本実施の形態における障害復旧装置１と障害復旧方法とを実現することができる。この場合、コンピュータのＣＰＵ（Central Processing
Unit）は、障害検出部１１、復旧方法決定部１２、解析情報取得部１３、採取情報決定部１４、情報採取部１５及び復旧実行部１６として機能し、処理を行なう。 The program in the embodiment of the present invention may be a program that causes a computer to execute steps S1 to S6 shown in FIG. By installing and executing this program on a computer, the failure recovery apparatus 1 and the failure recovery method in the present embodiment can be realized. In this case, the CPU (Central Processing) of the computer
Unit) functions as a failure detection unit 11, a recovery method determination unit 12, an analysis information acquisition unit 13, a collection information determination unit 14, an information collection unit 15, and a recovery execution unit 16, and performs processing.

また、本実施の形態では、障害情報データベース１１１、復旧方法データベース１２２、解析情報データベース１３２、及び運用ポリシーデータベース１４２は、コンピュータに備えられたハードディスク等の記憶装置に、これらを構成するデータファイルを格納することによって実現できる。なお、記憶装置は、ネットワークを介してコンピュータに接続されていても良い。 In the present embodiment, the failure information database 111, the recovery method database 122, the analysis information database 132, and the operation policy database 142 store data files constituting them in a storage device such as a hard disk provided in the computer. It can be realized by doing. Note that the storage device may be connected to a computer via a network.

なお、本実施の形態では、このプログラムがインストールされるコンピュータは、対象となるコンピュータシステム２０（図１参照）を構成するコンピュータであっても良い。この場合は、障害復旧装置１は、コンピュータシステム２０の内部に構築されている。 In the present embodiment, the computer on which this program is installed may be a computer constituting the target computer system 20 (see FIG. 1). In this case, the failure recovery apparatus 1 is built inside the computer system 20.

以上のように、本発明によれば、対象システムからの、障害発生時にしか採取できない情報の採取を実行しつつ、対象システムの復旧にかかる時間の長期化を抑制することができる。よって、本発明は、コンピュータシステムにおける障害の復旧に有用である。 As described above, according to the present invention, it is possible to suppress the lengthening of the time required for recovery of the target system while collecting information from the target system that can be collected only when a failure occurs. Therefore, the present invention is useful for recovery from a failure in a computer system.

１障害復旧装置
１１障害傑出部
１２復旧方法決定部
１３解析情報取得部
１４採取情報決定部
１５情報採取部
１６復旧実行部
２０コンピュータシステム
１１１障害情報データベース
１２１決定復旧方法情報
１２２復旧方法データベース
１３１取得解析情報
１３２解析情報データベース
１４１採取情報リスト
１４２運用ポリシーデータベース DESCRIPTION OF SYMBOLS 1 Failure recovery apparatus 11 Failure outstanding part 12 Recovery method determination part 13 Analysis information acquisition part 14 Collection information determination part 15 Information collection part 16 Recovery execution part 20 Computer system 111 Failure information database 121 Decision recovery method information 122 Recovery method database 131 Acquisition analysis Information 132 Analysis information database 141 Collection information list 142 Operation policy database

Claims

A failure detection unit for detecting a failure that occurs in the target computer system;
A recovery method determination unit for determining a recovery method corresponding to the detected failure;
Obtaining one or more collection information required to be collected for the analysis of the failure, collection time required for collecting each collection information, and analysis information specifying the priority given to each collection information; An analysis information acquisition unit;
From the preset time that can be used for the restoration process and the time that is necessary for the execution of the restoration method, the time that can be used for collecting the collection information is specified, and the analysis information is A collection information determination unit that determines the collection information that can be collected within the specified usable time in the order of priority based on the included collection time;
An information collection unit that performs collection of the collection information determined by the collection information determination unit;
A failure recovery apparatus comprising: a recovery execution unit that recovers the computer system in accordance with the recovery method determined by the recovery method determination unit after execution of collection by the information collection unit.

The recovery method is preset for each possible failure, together with the time required to implement the recovery method,
The failure recovery apparatus according to claim 1, wherein the recovery method determination unit determines a recovery method corresponding to the detected failure from among the recovery methods set in advance.

The collection information determination unit notifies the information collection unit that the collection of the collection information is not executed when there is no collection information that can be collected within the specified usable time, and the recovery execution is performed. wherein to recover the computer system parts, disaster recovery apparatus according to claim 1 or 2.

(A) detecting a failure occurring in the target computer system; and
(B) determining a recovery method corresponding to the failure detected in the step (a); and
(C) One or two or more pieces of collection information that are required to be collected for analysis of the failure, collection time required for collection of each piece of collection information, and analysis information that identifies the priority given to each piece of collection information Get, step, and
(D) identifying a time that can be used for collecting the collection information from a preset time that can be used for the restoration process and a time that is necessary for performing the determined restoration method ; and Determining the collection information that can be collected within the specified usable time in the order of priority based on the collection time included in the analysis information; and
(E) performing the collection of the collection information determined by the step of (d); and
(F) After executing the collection in the step (e), the step of restoring the computer system according to the restoration method determined in the step (b) is provided. .

The recovery method is preset for each possible failure, together with the time required to implement the recovery method,
5. The failure recovery method according to claim 4 , wherein in the step (b), a recovery method corresponding to the detected failure is determined from the recovery methods set in advance.

In the step (d), when the collection information that can be collected within the specified usable time does not exist, the step (f) is executed without executing the step (e). The failure recovery method according to claim 4 or 5 .

A program for recovering a failed computer system by a computer,
In the computer,
(A) detecting a failure occurring in the computer system; and
(B) determining a recovery method corresponding to the failure detected in the step (a); and
(C) One or two or more pieces of collection information that are required to be collected for analysis of the failure, collection time required for collection of each piece of collection information, and analysis information that identifies the priority given to each piece of collection information Get, step, and
(D) identifying a time that can be used for collecting the collection information from a preset time that can be used for the restoration process and a time that is necessary for performing the determined restoration method ; and Determining the collection information that can be collected within the specified usable time in the order of priority based on the collection time included in the analysis information; and
(E) performing the collection of the collection information determined by the step of (d); and
(F) After executing the collection in the step (e), the program is executed to restore the computer system according to the restoration method determined in the step (b).

The recovery method is preset for each possible failure, together with the time required to implement the recovery method,
The program according to claim 7 , wherein in the step (b), a recovery method corresponding to the detected failure is determined from the recovery methods set in advance.

In the step (d), when the collection information that can be collected within the specified usable time does not exist, the step (f) is executed without executing the step (e). The program according to claim 7 or 8 .