JP2012079212A

JP2012079212A - Information processor and failure recovery method

Info

Publication number: JP2012079212A
Application number: JP2010225653A
Authority: JP
Inventors: Daisuke Fujii; 大介藤井
Original assignee: Hitachi Systems Ltd
Current assignee: Hitachi Systems Ltd
Priority date: 2010-10-05
Filing date: 2010-10-05
Publication date: 2012-04-19
Anticipated expiration: 2030-10-05
Also published as: JP5588295B2

Abstract

PROBLEM TO BE SOLVED: To provide an information processor and a failure recovery method capable of automatically performing failure recovery according to an occurrence situation.SOLUTION: An information processor for making an information processor to be monitored recover from a failure occurring therein includes: a storage unit for storing, as a history, error information indicating a failure that occurred in the information processor to be monitored in the past in association with a failure recovery procedure; an operation check unit for obtaining operation information on the information processor to be monitored from the information processor to be monitored, and performing an operation check for determining whether the obtained operation information satisfies a predetermined condition; and a recovery unit for, when the operation check unit determines that the operation information does not satisfy the predetermined condition, specifying a recovery procedure corresponding to the operation information among the recovery procedures stored in the storage unit on the basis of the operation information and the error information stored in the storage unit, and making the information processor to be monitored recover from the failure according to the specified recovery procedure.

Description

本発明は、ハードウェア、ＯＳ（Operating System）、アプリケーション、システムなどの稼動状況を監視し、障害発生時に自動で復旧を行う情報処理装置、および障害復旧方法に関するものである。 The present invention relates to an information processing apparatus that monitors the operating status of hardware, an OS (Operating System), an application, a system, and the like and automatically recovers when a failure occurs, and a failure recovery method.

従来、ハードウェア、ＯＳ、アプリケーション、システム等に障害が生じた場合、その障害を復旧するために、障害を監視する端末等の監視装置が、監視対象機器に対して一定時間の間隔で障害の状況を監視し、障害の復旧のためのコマンドを実行する。または、監視装置が、異常を検知した場合に監視対象機器から出力されるログ上のエラーコードを参照し、そのエラーコードに対応する復旧方法をとることによって、障害を復旧させている。 Conventionally, when a failure occurs in a hardware, OS, application, system, etc., a monitoring device such as a terminal that monitors the failure has detected a failure at a certain time interval with respect to the monitoring target device in order to recover the failure. Monitor the situation and execute a command to recover from the failure. Alternatively, when the monitoring apparatus detects an abnormality, the error is recovered by referring to the error code on the log output from the monitored device and taking a recovery method corresponding to the error code.

例えば、障害監視を開始する場合、監視装置にインストールされたスケジューラソフト等の障害監視ツールが所定の間隔で監視対象機器の稼働状況をチェックし、監視対象機器から出力されるログファイルの監視や監視対象機器から発行されるメッセージを受信する（例えば、特許文献１）。 For example, when starting fault monitoring, a fault monitoring tool such as scheduler software installed on the monitoring device checks the operating status of the monitored device at a predetermined interval, and monitors and monitors the log file output from the monitored device A message issued from the target device is received (for example, Patent Document 1).

特開２００３−１５０４０７号公報JP 2003-150407 A

しかしながら、上述したような監視対象機器のチェックを行うか否か、あるいはメッセージの受信可否の判断は、サーバの稼働状況によって変動させることが難しいという課題がある。また、障害復旧では、想定されるエラーに対して、予め事前に定義された復旧手順を実行するということが一般的であるが、この場合、人手を介す場合と異なってフレキシブルな対応が難しいという課題がある。 However, there is a problem that it is difficult to change whether or not the monitoring target device as described above is to be checked or whether the message can be received or not depending on the operating status of the server. In disaster recovery, it is common to execute a pre-defined recovery procedure for an expected error, but in this case, it is difficult to respond flexibly unlike the case of manual intervention. There is a problem.

例えば、上述した特許文献１に開示された技術では、システム等の通常運用時はスケジューラソフト等を使用して障害監視を実行できたとしても、システム等の障害発生時や異常時の際において、その状況が短時間で変化するような場合には、固定的にその障害の復旧することは困難であり、人手によって障害を復旧せざるを得ないという問題があった。すなわち、発生状況に応じて、自動的に障害を復旧させることができないという問題があった。 For example, in the technique disclosed in Patent Document 1 described above, even if the failure monitoring can be executed using the scheduler software or the like during normal operation of the system or the like, When the situation changes in a short time, it is difficult to recover the failure in a fixed manner, and there is a problem that the failure must be recovered manually. That is, there is a problem that the failure cannot be automatically recovered according to the occurrence state.

本発明は、上記に鑑みてなされたものであって、発生状況に応じて自動的に障害を復旧させることが可能な情報処理装置、および障害復旧方法を提供することを目的とする。 The present invention has been made in view of the above, and an object thereof is to provide an information processing apparatus and a failure recovery method capable of automatically recovering from a failure according to the state of occurrence.

上述した課題を解決し、目的を達成するために、本発明にかかる情報処理装置は、監視対象となる情報処理装置に生じた障害を復旧させる情報処理装置であって、前記監視対象となる情報処理装置に生じた過去の障害を示すエラー情報と前記障害に対する復旧手順とを対応付けて履歴で記憶する記憶部と、前記監視対象となる情報処理装置から、前記監視対象となる情報処理装置の稼働情報を取得し、取得した前記稼動情報が所定の条件を満たすか否かを判定する動作チェックを行う動作チェック部と、前記動作チェック部が、前記稼動情報が所定の条件を満たさないと判定したことを受けて、前記稼動情報と、前記記憶部に記憶された前記エラー情報とに基づいて、前記稼動情報に応じた復旧手順を前記記憶部に記憶された前記復旧手順の中から特定し、特定した前記復旧手順にしたがって前記監視対象となる情報処理装置に生じた障害を復旧させるリカバリ部と、を備えたことを特徴とする。 In order to solve the above-described problems and achieve the object, an information processing apparatus according to the present invention is an information processing apparatus that recovers a failure that has occurred in an information processing apparatus to be monitored, and the information to be monitored A storage unit that stores error information indicating a past failure that has occurred in the processing device and a recovery procedure for the failure in association with each other and a history of the information processing device to be monitored from the information processing device to be monitored An operation check unit that acquires operation information and performs an operation check to determine whether the acquired operation information satisfies a predetermined condition, and the operation check unit determines that the operation information does not satisfy a predetermined condition Accordingly, based on the operation information and the error information stored in the storage unit, a recovery procedure corresponding to the operation information is stored in the storage procedure stored in the storage unit. Identified from, characterized in that and a recovery unit to recover the failure that occurred in the be monitored information processing apparatus according to the specified the recovery procedure.

また、本発明は、上記情報処理装置で行われる障害復旧方法である。 The present invention is also a failure recovery method performed by the information processing apparatus.

本発明によれば、発生状況に応じて自動的に障害を復旧させることが可能な情報処理装置、および障害復旧方法を提供することができる。 According to the present invention, it is possible to provide an information processing apparatus and a failure recovery method capable of automatically recovering from a failure according to an occurrence state.

本発明の実施の形態にかかる障害モニタリング・自動復旧システムの構成を示すブロック図である。It is a block diagram which shows the structure of the failure monitoring and automatic recovery system concerning embodiment of this invention. 実行制御ＤＢが記憶するコマンドや処理結果の例を示す図である。It is a figure which shows the example of the command and process result which execution control DB memorize | stores. 結果判定テーブルが記憶する実行条件や判定条件の例を示す図である。It is a figure which shows the example of the execution condition and determination condition which a result determination table memorize | stores. コマンドテーブルが記憶するコマンドの具体的な内容や判断方法の例を示す図である。It is a figure which shows the example of the concrete content and the judgment method of the command which a command table memorize | stores. キーワードテーブルが記憶するキーワードの具体的な内容の例を示す図である。It is a figure which shows the example of the specific content of the keyword which a keyword table memorize | stores. サーバ情報テーブルが記憶する設定情報の例を示す図である。It is a figure which shows the example of the setting information which a server information table memorize | stores. 通知先テーブルが記憶する通知先に関する情報の例を示す図である。It is a figure which shows the example of the information regarding the notification destination which a notification destination table memorize | stores. スケジュールテーブルが記憶するコマンドの実行タイミングに関する情報の例を示す図である。It is a figure which shows the example of the information regarding the execution timing of the command which a schedule table memorize | stores. パフォーマンスデータＤＢが記憶するパフォーマンスに関する情報の例を示す図である。It is a figure which shows the example of the information regarding the performance which performance data DB memorize | stores. 結果抽出テーブルが記憶する抽出条件の例を示す図である。It is a figure which shows the example of the extraction conditions which a result extraction table memorize | stores. エラー管理ＤＢがリカバリに関する情報を記憶する例を示す図である。It is a figure which shows the example which error management DB memorize | stores the information regarding recovery. 障害管理テーブルが記憶するリカバリ結果に関する情報の例を示す図である。It is a figure which shows the example of the information regarding the recovery result which a failure management table memorize | stores. 動作チェック処理部が行う動作チェック処理の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of the operation check process which an operation check process part performs. パフォーマンスデータ収集部が行うパフォーマンスデータ収集処理の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of the performance data collection process which a performance data collection part performs. リカバリ部が行うリカバリ処理の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of the recovery process which a recovery part performs.

以下に添付図面を参照して、本発明にかかる情報処理装置、および障害復旧方法の実施の形態を詳細に説明する。 Exemplary embodiments of an information processing apparatus and a failure recovery method according to the present invention will be explained below in detail with reference to the accompanying drawings.

図１は、本発明の実施の形態にかかる障害モニタリング・自動復旧システム１０００の構成を示すブロック図である。図１に示すように、障害モニタリング・自動復旧システム１０００は、Ｗｉｎｄｏｗｓ（登録商標）サーバ１０１と、Ｕｎｉｘ（登録商標）サーバ１０２と、ＤＢ（DataBase）サーバ１０３と、アプリケーションサーバ１０４と、共通監視端末２５１と、ネットワークＮとを含んで構成される。なお、ネットワークＮは、ＬＡＮ（Local Area Network）やＷＡＮ（Wide Area Network）等の一般的な通信回線網である。また、以下では、障害モニタリング・自動復旧システム１０００は上述した各種のサーバを監視対象としているが、これらのＯＳやデータベースを備えた装置に限定されるものではない。 FIG. 1 is a block diagram showing a configuration of a failure monitoring / automatic recovery system 1000 according to an embodiment of the present invention. As shown in FIG. 1, a failure monitoring / automatic recovery system 1000 includes a Windows (registered trademark) server 101, a Unix (registered trademark) server 102, a DB (DataBase) server 103, an application server 104, and a common monitoring terminal. 251 and a network N. The network N is a general communication network such as a LAN (Local Area Network) or a WAN (Wide Area Network). In the following, the failure monitoring / automatic recovery system 1000 targets the above-described various servers, but is not limited to the devices including these OSs and databases.

Ｗｉｎｄｏｗｓサーバ１０１およびＵｎｉｘサーバ１０２は、障害モニタリング・自動復旧システム１０００の監視対象となるサーバ（以下、単に監視対象サーバと呼ぶ。）であって、例えば、アプリケーションサーバ１０４に登録されている種々のアプリケーションを実行するものである。ＤＢサーバ１０３は、例えば、外部のアプリケーションで生成されたデータを記憶するものである。また、アプリケーションサーバ１０４は、例えば、上述したＷｉｎｄｏｗｓサーバ１０１およびＵｎｉｘサーバ１０２において実行される種々のアプリケーションを記憶するものである。 The Windows server 101 and the Unix server 102 are servers to be monitored by the failure monitoring / automatic recovery system 1000 (hereinafter simply referred to as monitoring target servers). For example, various applications registered in the application server 104 Is to execute. The DB server 103 stores, for example, data generated by an external application. In addition, the application server 104 stores various applications executed in the above-described Windows server 101 and Unix server 102, for example.

共通監視端末２５１は、上述した各監視対象サーバを監視し、各監視対象サーバに障害が生じた場合に、その障害を自動的に復旧させるＰＣ（Personal Computer）等の端末である。図１に示すように、共通監視端末２５１は、制御部２０１と、記憶部３５０とを含んで構成されている。制御部２０１は、ＣＰＵ（Central Processing Unit）等の演算装置から構成され、動作チェック部３０１と、パフォーマンスデータ収集部３０２と、リカバリ部３０３と、通信部３０４とを含んで構成されている。なお、制御部２０１は、現在の日時（時刻）を計時するタイマ（不図示）を有しているものとする。また、通信部３０４は、例えば、ＮＩＣ（Network Interface Card）等の通信装置である。 The common monitoring terminal 251 is a terminal such as a PC (Personal Computer) that monitors each of the monitoring target servers described above and automatically recovers from the failure when a failure occurs in each monitoring target server. As shown in FIG. 1, the common monitoring terminal 251 includes a control unit 201 and a storage unit 350. The control unit 201 includes an arithmetic device such as a CPU (Central Processing Unit), and includes an operation check unit 301, a performance data collection unit 302, a recovery unit 303, and a communication unit 304. Note that the control unit 201 has a timer (not shown) that counts the current date and time (time). The communication unit 304 is a communication device such as a NIC (Network Interface Card).

また、記憶部３５０は、ＨＤＤ（Hard Disk Drive）等の記憶媒体から構成され、実行制御ＤＢ３５１と、結果判定テーブル３５２と、コマンドテーブル３５３と、キーワードテーブル３５４と、サーバ情報テーブル３５５と、通知先テーブル３５６と、スケジュールテーブル３５７と、パフォーマンスＤＢ３５８と、結果抽出テーブル３５９と、エラー管理ＤＢ３６０と、障害管理テーブル３６１とを記憶している。 The storage unit 350 includes a storage medium such as an HDD (Hard Disk Drive), an execution control DB 351, a result determination table 352, a command table 353, a keyword table 354, a server information table 355, and a notification destination. A table 356, a schedule table 357, a performance DB 358, a result extraction table 359, an error management DB 360, and a failure management table 361 are stored.

実行制御ＤＢ３５１は、図１に示した監視対象サーバに対して実行させるコマンドや、コマンドによって実行された動作チェック処理、パフォーマンスデータ収集処理、リカバリ処理（いずれも後述）の処理結果を記憶するものである。 The execution control DB 351 stores a command to be executed on the monitoring target server shown in FIG. 1 and processing results of an operation check process, a performance data collection process, and a recovery process (all described later) executed by the command. is there.

図２は、実行制御ＤＢ３５１が記憶するコマンドや処理結果の例を示す図である。図２に示すように、実行制御ＤＢ３５１は、監視対象サーバを一意に識別するための対象サーバと、監視対象サーバに対して実行させるためのコマンドと、監視対象サーバのＯＳ（Operating System）の種別を示すサーバタイプと、コマンドが実行された日時を示す実行日時と、実行されたコマンドによって取得された実行結果を示す取得値と、コマンドが実行された際の監視対象サーバの状態の種類を示すステータスと、監視対象サーバの状態に応じてコマンドの実行間隔を変更するか否かを示す時間変動フラグと、コマンドが実行されてから次に実行されるまでの間隔を示す次回実行間隔とが対応付けて記憶されている。 FIG. 2 is a diagram illustrating an example of commands and processing results stored in the execution control DB 351. As illustrated in FIG. 2, the execution control DB 351 includes a target server for uniquely identifying a monitoring target server, a command for causing the monitoring target server to execute, and an OS (Operating System) type of the monitoring target server. Indicates the server type indicating the date and time when the command was executed, the acquired value indicating the execution result acquired by the executed command, and the type of status of the monitored server when the command was executed Corresponds to the status, the time variation flag that indicates whether the command execution interval is changed according to the status of the monitored server, and the next execution interval that indicates the interval from the command execution to the next execution It is remembered.

図２では、例えば、サーバタイプ「ＯＳ１」である対象サーバ「開発サーバ１」に対して、ステータス「Ａ」の状態（後述するが、監視対象サーバが安定稼動している状態）で、コマンド「ＵＢＤＦ０１」が実行日時「２００９／４／１０２２：００：００」に実行され、その際に取得値「８０」（％）が取得されたことを示している。また、次回実行されるまでの間隔は「１０分後」であり、状況に応じて実行間隔は変更されない（Ｎｏ）ことを示している。さらに、同じ対象サーバ「開発サーバ１」に対して、コマンド「ＵＳＹＬ０１」が設定され、以下同様に、上述した各項目が設定されていることを示している。 In FIG. 2, for example, for the target server “development server 1” of the server type “OS1”, in the status “A” (described later, the monitored server is operating stably), the command “ UBDF01 ”is executed at the execution date“ 2009/4/10 22:00: 00 ”, and the acquired value“ 80 ”(%) is acquired at that time. Further, the interval until the next execution is “10 minutes later”, indicating that the execution interval is not changed according to the situation (No). Further, it is shown that the command “USYL01” is set for the same target server “development server 1”, and the above-described items are set similarly.

なお、図２に示した例では、実行制御ＤＢ３５１は、実行日時、取得値、ステータスを、コマンドが実行される都度その結果や値を格納できるように、これらを（１）〜（１５）まで保持しているが、特にこの数に限定されることはない。また、上述した実行日時、取得値、ステータス以外の各項目は、例えば、障害モニタリング・自動復旧システム１０００の管理者等によってあらかじめ設定されるものとする。続いて、図１に戻り、結果判定テーブル３５２について説明する。 In the example shown in FIG. 2, the execution control DB 351 stores the execution date and time, the acquired value, and the status from (1) to (15) so that the result and value can be stored each time the command is executed. However, the number is not particularly limited. In addition, each item other than the above-described execution date / time, acquired value, and status is set in advance by, for example, the administrator of the failure monitoring / automatic recovery system 1000 or the like. Next, returning to FIG. 1, the result determination table 352 will be described.

結果判定テーブル３５２は、監視対象サーバに対して実行されるコマンドの実行条件や、監視対象サーバの状態を判定するための判定条件を記憶するものである。 The result determination table 352 stores execution conditions for commands executed on the monitoring target server and determination conditions for determining the status of the monitoring target server.

図３は、結果判定テーブル３５２が記憶する実行条件や判定条件の例を示す図である。図３に示すように、結果判定テーブル３５２は、実行されるコマンドを一意に特定するための実行コマンドと、コマンドを実行させる監視対象サーバのＯＳの種別を示す対象サーバタイプと、監視対象サーバが正常な状態（例えば、ＯＳのエラーがない状態）であって、安定稼動している場合におけるコマンドの実行間隔を示すステータスＡと、監視対象サーバが正常である場合におけるコマンドの実行間隔を示すステータスＢと、監視対象サーバが正常ではない（警戒時）場合におけるコマンドの実行間隔を示すステータスＣと、監視対象サーバが初期状態にある場合におけるコマンドの実行間隔を示す初期値と、後述する動作チェック処理やパフォーマンスデータ収集処理において、監視対象サーバが安定していると判断すべき稼動状態である（ステータスＡの状態である）と判定するためのしきい値（Ａ）と、監視対象サーバが注意を要すると判断すべき稼動状態である（ステータスＢの状態である）と判定するためのしきい値（Ｂ）と、監視対象サーバがエラーと判断すべき稼動状態である（ステータスＣの状態である）と判定するためのしきい値（Ｃ）と、ステータスＡ〜Ｃにおけるコマンドの実行間隔を見直す回数を示すステータス見直し回数とが対応付けて記憶されている。 FIG. 3 is a diagram illustrating an example of execution conditions and determination conditions stored in the result determination table 352. As shown in FIG. 3, the result determination table 352 includes an execution command for uniquely specifying a command to be executed, a target server type indicating the type of OS of the monitoring target server that executes the command, and a monitoring target server. Status A indicating the command execution interval in a normal state (for example, no OS error) and stable operation, and status indicating the command execution interval when the monitored server is normal B, status C indicating the command execution interval when the monitored server is not normal (at the time of warning), initial value indicating the command execution interval when the monitored server is in the initial state, and an operation check described later Process and performance data collection process should be determined that the monitored server is stable A threshold (A) for determining that there is a status (status A), and an operating status for determining that the monitored server needs attention (status B) Threshold value (B), threshold value (C) for determining that the monitored server is in an operating state (status C) that should be determined as an error, and execution of commands in statuses A to C The number of status reviews indicating the number of times to review the interval is stored in association with each other.

図３では、例えば、対象サーバタイプ「ＯＳ１」の監視対象サーバに対して実行される実行コマンド「ＵＢＤＦ０１」は、監視対象サーバが正常な状態であって、安定稼動している場合には「１２時間毎」に実行され、監視対象サーバが正常である場合には「６時間毎」に実行され、監視対象サーバが正常ではない（警戒時）場合には「２時間毎」に実行されることを示している。また、実行コマンド「ＵＢＤＦ０１」が当初実行される間隔は「３時間毎」であることを示している。さらに、後述する動作チェック処理、パフォーマンスデータ収集処理、リカバリ処理において取得された取得値が、しきい値（Ａ）「８０」（％）以上の値であれば、監視対象サーバが正常な状態であって、安定稼動している状態（ステータスＡの状態）であり、その取得値がしきい値（Ｂ）「９０」（％）以上の値であれば、監視対象サーバに注意が必要な状態（ステータスＢの状態）であり、その取得値が「９８」（％）以上の値であれば、監視対象サーバがエラーとなった状態（ステータスＣの状態）であることを示し、コマンドが「１０回」実行された場合にステータスＡ〜Ｃで示されたコマンドの実行間隔が見直されることを示している。なお、上述した結果判定テーブル３５２の各項目は、例えば、障害モニタリング・自動復旧システム１０００の管理者等によってあらかじめ設定されるものとする。続いて、図１に戻り、コマンドテーブル３５３について説明する。 In FIG. 3, for example, the execution command “UBDF01” executed for the monitoring target server of the target server type “OS1” is “12” when the monitoring target server is in a normal state and operates stably. Executed every "hour", executed every "6 hours" when the monitored server is normal, and "every 2 hours" when the monitored server is not normal (at the time of warning) Is shown. In addition, the interval at which the execution command “UBDF01” is initially executed is “every 3 hours”. Furthermore, if the acquired value acquired in the operation check process, performance data collection process, and recovery process, which will be described later, is a value equal to or greater than the threshold value (A) “80” (%), the monitored server is in a normal state. If the acquired value is equal to or greater than the threshold value (B) “90” (%), the monitored server needs to be noted. (Status B status) and the acquired value is “98” (%) or more, this indicates that the monitored server is in an error status (status C status), and the command is “ This indicates that the execution intervals of the commands indicated by the statuses A to C are reviewed when the command is executed “10 times”. Each item of the result determination table 352 described above is set in advance by, for example, an administrator of the failure monitoring / automatic recovery system 1000 or the like. Next, returning to FIG. 1, the command table 353 will be described.

コマンドテーブル３５３は、実行するコマンドの具体的な内容や、コマンドを実行した際に監視対象サーバの状態を判断するための判断方法を記憶するものである。 The command table 353 stores specific contents of the command to be executed and a determination method for determining the state of the monitoring target server when the command is executed.

図４は、コマンドテーブル３５３が記憶するコマンドの具体的な内容や判断方法の例を示す図である。図４に示すように、コマンドテーブル３５３は、実行するコマンドを一意に識別するためのコマンドＩＤと、コマンドが実行された際に監視対象サーバの状態を判断するための判断方法を示す判断種別と、上述した対象サーバタイプと、実際に実行されるコマンドと、そのコマンドを実行する際のオプション情報（引数等）とが対応付けて記憶されている。 FIG. 4 is a diagram illustrating an example of specific contents and determination methods of commands stored in the command table 353. As shown in FIG. 4, the command table 353 includes a command ID for uniquely identifying a command to be executed, a determination type indicating a determination method for determining the state of the monitored server when the command is executed, and The target server type described above, the command actually executed, and option information (such as an argument) when executing the command are stored in association with each other.

図４では、例えば、対象サーバタイプ「ＯＳ１」の監視対象サーバに対して実行される実行コマンド「ＵＢＤＦ０１」は、監視対象サーバの状態を判断するための判断方法として「しきい値」によって判断し、実際に実行されるコマンドは「ｄｂｆ」であり、引数は指定されていないことを示している。また、各監視対象サーバ共通に実行される実行コマンド「ＣＤＢＣ０１」は、判断方法としてコマンド実行時に出力されるログを確認すること（「ログ確認」）によって判断し、実際に実行されるコマンドは「ｂｒｃｏｎｎｅｃｔ」であり、引数「−ｕ／−ｃ−ｆｃｈｅｃｋ」が指定されることを示している。なお、上述したコマンドテーブル３５３の各項目は、結果判定テーブル３５２等と同様、障害モニタリング・自動復旧システム１０００の管理者等によってあらかじめ設定されるものとする。続いて、図１に戻り、キーワードテーブル３５４について説明する。 In FIG. 4, for example, the execution command “UBDF01” executed for the monitoring target server of the target server type “OS1” is determined by “threshold” as a determination method for determining the state of the monitoring target server. The actually executed command is “dbf”, indicating that no argument is specified. Further, the execution command “CDBC01” executed in common to each monitored server is determined by checking the log output at the time of command execution (“log check”) as a determination method, and the actually executed command is “ "brconnect", indicating that an argument "-u / -cf check" is specified. It should be noted that each item of the command table 353 described above is set in advance by the administrator of the failure monitoring / automatic recovery system 1000 or the like, like the result determination table 352 and the like. Next, returning to FIG. 1, the keyword table 354 will be described.

キーワードテーブル３５４は、コマンドが実行された際に出力されるログに含まれる数値や文字列、符号等の種々の情報（キーワード）をアラームとして記憶するものである。 The keyword table 354 stores various information (keywords) such as numerical values, character strings, codes, and the like included in a log output when a command is executed as an alarm.

図５は、キーワードテーブル３５４が記憶するキーワードの具体的な内容の例を示す図である。図５に示すように、キーワードテーブル３５４は、上述した実行コマンドと、ログに含まれるキーワードのうち、エラーとすべきキーワードを示すエラーキーワードと、ログに含まれるキーワードのうち、注意するレベルで足り、エラーとするレベルではないキーワードを示す無視キーワードとが対応付けて記憶されている。 FIG. 5 is a diagram showing an example of specific contents of keywords stored in the keyword table 354. As shown in FIG. 5, the keyword table 354 is sufficient for the above-mentioned execution command, an error keyword indicating a keyword that should be an error among the keywords included in the log, and a level to be noted among the keywords included in the log. In addition, an ignoring keyword indicating a keyword that is not at an error level is stored in association with each other.

図５では、例えば、実行コマンド「ＣＤＢＣ０１」が実行され、出力されたログにエラーキーワード「ＯＲＡ−０１５５５」、「ＯＲＡ−０００６０」、「ＯＲＡ−０１６３１」が含まれている場合にはエラーとなる一方、出力されたログに無視キーワード「ＯＲＡ−２３７０１」、「ＯＲＡ−０３１０６」、「ＯＲＡ−０１４２８」が含まれている場合にはエラーとはならずにそのまま処理が続行されることを示している。なお、上述したキーワードテーブル３５４の各項目は、結果判定テーブル３５２等と同様、障害モニタリング・自動復旧システム１０００の管理者等によってあらかじめ設定されるものとする。続いて、図１に戻り、サーバ情報テーブル３５５について説明する。 In FIG. 5, for example, an execution command “CDBC01” is executed, and an error occurs when the output log includes the error keywords “ORA-01555”, “ORA-0660”, and “ORA-16631”. On the other hand, if the output log includes the ignored keywords “ORA-23701”, “ORA-03106”, and “ORA-01428”, it indicates that the processing is continued without causing an error. Yes. It should be noted that each item of the keyword table 354 described above is set in advance by the administrator of the failure monitoring / automatic recovery system 1000 or the like, like the result determination table 352 and the like. Next, returning to FIG. 1, the server information table 355 will be described.

サーバ情報テーブル３５５は、監視対象サーバに関する種々の設定情報を記憶するものである。 The server information table 355 stores various setting information related to the monitoring target server.

図６は、サーバ情報テーブル３５５が記憶する設定情報の例を示す図である。図６に示すように、サーバ情報テーブル３５５は、監視対象サーバのホスト名と、監視対象サーバのＩＰアドレスと、監視対象サーバのＯＳ種別と、監視対象サーバの用途を示す用途タイプと、監視対象サーバの用途タイプが「ＤＢ」である場合のインスタンス名と、管理対象サーバにアクセス可能な管理者のアカウント（管理ユーザアカウント）およびパスワード（管理ユーザパスワード）と、ＤＢにアクセス可能なユーザのアカウント（ＤＢユーザアカウント）およびパスワード（ＤＢユーザパスワード）とが対応付けて記憶されている。 FIG. 6 is a diagram illustrating an example of setting information stored in the server information table 355. As shown in FIG. 6, the server information table 355 includes a host name of the monitoring target server, an IP address of the monitoring target server, an OS type of the monitoring target server, a usage type indicating the usage of the monitoring target server, and a monitoring target. Instance name when the usage type of the server is “DB”, administrator account (management user account) and password (management user password) that can access the managed server, and user account that can access the DB ( DB user account) and password (DB user password) are stored in association with each other.

図６では、例えば、ホスト名「開発サーバ１」、ＯＳ種別「ＯＳ１」である監視対象サーバのＩＰアドレスは「ＡＡＡ．ＡＡＡ．ＡＡＡ．ＡＡＡ」であり、インスタンス名「ＢＳＣ」のＤＢサーバとして設定されていることを示している。また、このサーバの管理ユーザアカウントおよび管理ユーザパスワードは、それぞれ「ｒｏｏｔ」「ＡＡＡＡＡＡＡ」であり、ＤＢユーザアカウントおよびＤＢユーザパスワードは、それぞれ「ｂｓｃａｄｍ」「ＡＡＡＡＡＡＡ」であることを示している。上述したサーバ情報テーブル３５５の各項目は、結果判定テーブル３５２等と同様、障害モニタリング・自動復旧システム１０００の管理者等によってあらかじめ設定されるものとする。続いて、図１に戻り、通知先テーブル３５６について説明する。 In FIG. 6, for example, the IP address of the monitoring target server having the host name “development server 1” and the OS type “OS1” is “AAA.AAA.AAA.AAA”, and is set as the DB server of the instance name “BSC”. It has been shown. In addition, the management user account and the management user password of this server are “root” and “AAAAAAA”, respectively, and the DB user account and the DB user password are “bscadm” and “AAAAAAA”, respectively. Each item of the server information table 355 described above is set in advance by the administrator of the failure monitoring / automatic recovery system 1000 or the like, like the result determination table 352 and the like. Next, returning to FIG. 1, the notification destination table 356 will be described.

通知先テーブル３５６は、監視対象サーバがエラーや障害となった場合に、その旨を通知する通知先に関する情報を記憶するものである。 The notification destination table 356 stores information related to a notification destination that notifies that when a monitoring target server has an error or failure.

図７は、通知先テーブル３５６が記憶する通知先に関する情報の例を示す図である。図７に示すように、通知先テーブル３５６は、通知先にエラーや障害となった旨を通知する対象サーバと、エラーコードと、通知する方法を示す連絡先手段と、エラーや障害となった旨を通知する連絡先宛先と、その宛先の氏名・名称を示す連絡先登録者名とが対応付けて記憶されている。 FIG. 7 is a diagram illustrating an example of information regarding the notification destination stored in the notification destination table 356. As shown in FIG. 7, the notification destination table 356 has a target server that notifies the notification destination that an error or failure has occurred, an error code, a contact means that indicates the notification method, and an error or failure. A contact address for notification to the effect and a contact registrant name indicating the name and name of the address are stored in association with each other.

図７では、例えば、対象サーバ「開発サーバ１」にエラーコード「ＯＲＡ−１５５５」のエラーが生じた場合、登録者「ＡＡＡＡ」に対して電話で「０２９−ＸＸＸ−ＸＸＸＸ」まで通知することを示している。なお、図７では、通知先の登録者が不在等の理由で連絡が付かない場合に備え、予備の連絡先（連絡先５まで）を記憶している。ここに示した各項目は、結果判定テーブル３５２等と同様、障害モニタリング・自動復旧システム１０００の管理者等によってあらかじめ設定されるものとする。続いて、図１に戻り、スケジュールテーブル３５７について説明する。 In FIG. 7, for example, when an error with the error code “ORA-1555” occurs in the target server “development server 1”, the registrant “AAAA” is notified to “029-XXX-XXXX” by telephone. Show. In FIG. 7, a spare contact address (up to contact address 5) is stored in preparation for a case where the contactee registrant cannot be contacted for reasons such as absence. Each item shown here is assumed to be set in advance by the administrator of the failure monitoring / automatic recovery system 1000 or the like, like the result determination table 352 and the like. Next, returning to FIG. 1, the schedule table 357 will be described.

スケジュールテーブル３５７は、コマンドを実行するタイミングに関する情報を記憶するものである。 The schedule table 357 stores information related to the timing for executing the command.

図８は、スケジュールテーブル３５７が記憶するコマンドの実行タイミングに関する情報の例を示す図である。図８に示すように、スケジュールテーブル３５７は、コマンドの実行日と、コマンドの実行時刻と、コマンドを実行する監視対象サーバを示す対象サーバと、実行コマンドとが対応付けて記憶されている。 FIG. 8 is a diagram illustrating an example of information related to the execution timing of the command stored in the schedule table 357. As shown in FIG. 8, the schedule table 357 stores a command execution date, a command execution time, a target server indicating a monitoring target server that executes the command, and an execution command in association with each other.

図８では、例えば、実行コマンド「ＵＢＤＦ０１」を実行するタイミングは、「２００９／０１／０１」の「００時００分」であり、「開発サーバ１」を対象サーバとして実行することを示している。ここに示した各項目は、結果判定テーブル３５２等と同様、障害モニタリング・自動復旧システム１０００の管理者等によってあらかじめ設定されるものとする。続いて、図１に戻り、パフォーマンスデータＤＢ３５８について説明する。 In FIG. 8, for example, the execution command “UBDF01” is executed at “00:00” of “2009/01/01”, indicating that “development server 1” is executed as the target server. . Each item shown here is assumed to be set in advance by the administrator of the failure monitoring / automatic recovery system 1000 or the like, like the result determination table 352 and the like. Next, returning to FIG. 1, the performance data DB 358 will be described.

パフォーマンスデータＤＢ３５８は、監視対象サーバの稼働状況（パフォーマンス）に関する情報を記憶するものである。 The performance data DB 358 stores information related to the operating status (performance) of the monitoring target server.

図９は、パフォーマンスデータＤＢ３５８が記憶するパフォーマンスに関する情報の例を示す図である。図９に示すように、パフォーマンスデータＤＢ３５８は、監視対象サーバのパフォーマンスをチェックする種類（どのようなチェックを行うか）を一意に特定するための識別ＩＤと、識別ＩＤで識別されるチェックのチェック内容と、上述した対象サーバおよび実行コマンドと、コマンドの実行間隔およびその実行結果とが対応付けて記憶されている。なお、図９で示した例では、実行間隔および実行結果は１〜１０まで登録されているが、この数に限定されるものではない。 FIG. 9 is a diagram illustrating an example of information related to performance stored in the performance data DB 358. As shown in FIG. 9, the performance data DB 358 includes an identification ID for uniquely identifying the type (what kind of check is performed) for checking the performance of the monitored server, and a check for the check identified by the identification ID. The contents, the above-described target server and execution command, and the command execution interval and the execution result are stored in association with each other. In the example shown in FIG. 9, the execution interval and the execution result are registered from 1 to 10, but the number is not limited to this number.

図９では、例えば、識別ＩＤ「ＨＷ００１」で識別される実行コマンド「ＵＢＤＦ０１」が、サーバ「開発サーバ１」を対象として、メモリ使用率を１分ごとに実行されることを示している。そして、その実行結果として、３５（％）あるいは５０（％）等の値が格納されていることを示している。なお、実行結果については、後述するパフォーマンスデータ収集処理において設定され、これら以外の各項目は、結果判定テーブル３５２等と同様、障害モニタリング・自動復旧システム１０００の管理者等によってあらかじめ設定されるものとする。このように、実行結果を定期的に繰り返し登録しておくことによって、例えば、管理者はメモリ使用率の推移を容易に把握することができる。続いて、図１に戻り、結果抽出テーブル３５９について説明する。 FIG. 9 shows that, for example, the execution command “UBDF01” identified by the identification ID “HW001” is executed for the server “development server 1” at the memory usage rate every minute. As an execution result, a value such as 35 (%) or 50 (%) is stored. The execution results are set in the performance data collection process described later, and the other items are set in advance by the administrator of the failure monitoring / automatic recovery system 1000 as in the result determination table 352 and the like. To do. In this way, by registering the execution results periodically and repeatedly, for example, the administrator can easily grasp the transition of the memory usage rate. Next, returning to FIG. 1, the result extraction table 359 will be described.

結果抽出テーブル３５９は、後述する動作チェック処理、パフォーマンスデータ収集処理が実行され、その結果として出力されたログ等に含まれる種々の情報のうち、障害モニタリング・自動復旧システム１０００が必要とする情報のみを抽出するため抽出条件を記憶するものである。 The result extraction table 359 performs operation check processing and performance data collection processing, which will be described later, and only information necessary for the failure monitoring / automatic recovery system 1000 out of various information included in the log output as a result thereof. The extraction condition is stored for extracting the.

図１０は、結果抽出テーブル３５９が記憶する抽出条件の例を示す図である。図１０に示すように、結果抽出テーブル３５９は、実行コマンドと、ログ等に含まれる種々の情報の中から必要な情報のみ解析して抽出するための抽出キーワードと、抽出キーワードによって抽出された情報を集計する場合の集計条件とが対応付けて記憶されている。 FIG. 10 is a diagram illustrating an example of extraction conditions stored in the result extraction table 359. As shown in FIG. 10, the result extraction table 359 includes an execution command, an extraction keyword for analyzing and extracting only necessary information from various information included in the log, and information extracted by the extraction keyword. Are stored in association with the totaling conditions for the totalization.

図１０では、例えば、実行コマンド「ＵＢＤＦ０１」が実行された際に出力されるログ等の情報の中から、「ｃｐｕ」および「ｕｓ」の２つの文字列を含む文字列（あるいは、その文字列を含む段落等のかたまり）が抽出されることを示している。なお、図１０に示した例では、抽出キーワードとして５つのキーワードが設定されている。これらの各項目は、結果判定テーブル３５２等と同様、障害モニタリング・自動復旧システム１０００の管理者等によってあらかじめ設定されるものとする。続いて、図１に戻り、エラー管理ＤＢ３６０について説明する。 In FIG. 10, for example, a character string (or its character string) including two character strings “cpu” and “us” from information such as a log output when the execution command “UBDF01” is executed. (A group of paragraphs including) is extracted. In the example shown in FIG. 10, five keywords are set as extraction keywords. Each of these items is set in advance by the administrator of the failure monitoring / automatic recovery system 1000 or the like, as in the result determination table 352 and the like. Next, returning to FIG. 1, the error management DB 360 will be described.

エラー管理ＤＢ３６０は、後述するリカバリ処理において、エラーに応じたリカバリを行うためのリカバリに関する情報を記憶するものである。このエラー管理ＤＢ３６０は、自動的に登録されるものではなく、過去に同様のリカバリ手順が存在しない場合に備えて管理者等によってあらかじめ登録されているものである。 The error management DB 360 stores information related to recovery for performing recovery according to an error in a recovery process described later. The error management DB 360 is not automatically registered, but is registered in advance by an administrator or the like in case a similar recovery procedure does not exist in the past.

図１１は、エラー管理ＤＢ３６０がリカバリに関する情報を記憶する例を示す図である。図１１に示すように、エラー管理ＤＢ３６０は、監視対象サーバから出力されるログ等に含まれるエラーの内容を示すエラーコードと、上述した対象サーバと、リカバリ処理を実行させる優先順位を示す対応順序と、通知先テーブル３５６に登録された宛先に通知するメッセージを示す通知内容と、動作チェック処理において実行されたコマンドを示すチェック時実行コマンドと、エラーをリカバリする手順を含むコマンドを示すリカバリ手順とが対応付けて記憶されている。 FIG. 11 is a diagram illustrating an example in which the error management DB 360 stores information related to recovery. As shown in FIG. 11, the error management DB 360 has an error code indicating the content of an error included in a log or the like output from the monitoring target server, the above-described target server, and a correspondence order indicating the priority order for executing the recovery process. A notification content indicating a message notified to a destination registered in the notification destination table 356, a check execution command indicating a command executed in the operation check process, and a recovery procedure indicating a command including a procedure for recovering an error; Are stored in association with each other.

図１１では、「ＯＳ１」の対象サーバに対して動作チェック処理時の実行コマンド「ＵＢＤＦ０１」が実行され、そのログ等にエラーコード「ＯＲＡ−０１６３１」が含まれている場合、管理者等に対しては「テーブル・索引がＭａｘＥｘｔｅｎｔ（＊＊＊）に到達」した旨が通知されるとともに、リカバリ手順としてコマンド「Ｒ３３ＴＣＤＢ１３」が実行されることを示している。また、同じ「ＯＳ１」の対象サーバに対して動作チェック処理時の実行コマンド「ＵＰＳＣ０１」が実行され、そのログ等にエラーコード「ＯＲＡ−０１６３１」が含まれている場合、管理者等に対しては「テーブル・索引がＭａｘＥｘｔｅｎｔ（＊＊＊）に到達」した旨が通知されるとともに、リカバリ手順としてコマンド「ＯＲＡＥＸＴＣＨ０１」が実行されることを示している。そして、この２つのエラーが同じタイミングで発生した場合には、対応順序が「１」であるコマンド「Ｒ３３ＴＣＤＢ１３」が実行され、その後、対応順序が「２」であるコマンド「ＯＲＡＥＸＴＣＨ０１」が実行されることを示している。これらの各項目は、結果判定テーブル３５２等と同様、障害モニタリング・自動復旧システム１０００の管理者等によってあらかじめ設定されるものとする。続いて、図１に戻り、障害管理テーブル３６１について説明する。 In FIG. 11, when the execution command “UBDF01” at the time of the operation check process is executed for the target server of “OS1” and the error code “ORA-16631” is included in the log or the like, This indicates that “the table / index has reached MaxExtent (***)” and that the command “R33TCDB13” is executed as a recovery procedure. In addition, when the execution command “UPSC01” at the time of the operation check process is executed for the target server of the same “OS1” and the error code “ORA-01631” is included in the log or the like, the administrator etc. Indicates that “the table / index has reached MaxExtent (***)” and that the command “ORAEXTCH01” is executed as a recovery procedure. When these two errors occur at the same timing, the command “R33TCDB13” whose correspondence order is “1” is executed, and thereafter the command “ORAEXTCH01” whose correspondence order is “2” is executed. It is shown that. Each of these items is set in advance by the administrator of the failure monitoring / automatic recovery system 1000 or the like, as in the result determination table 352 and the like. Next, returning to FIG. 1, the failure management table 361 will be described.

障害管理テーブル３６１は、リカバリ処理で実行されたリカバリ結果に関する情報を記憶するものである。 The failure management table 361 stores information related to the recovery result executed in the recovery process.

図１２は、障害管理テーブル３６１が記憶するリカバリ結果に関する情報の例を示す図である。図１２に示すように、障害管理テーブル３６１は、上述したエラーコードと、リカバリ処理が行われた監視対象サーバを示す実施サーバと、リカバリ処理が開始された日時を示す実施日時と、リカバリ処理で行われた手順を示すリカバリ手順と、リカバリ処理が完了した際のチェック手順を示す確認手順と、リカバリ処理が完了した日時を示す完了日時とが対応付けて記憶されている。 FIG. 12 is a diagram illustrating an example of information related to the recovery result stored in the failure management table 361. As shown in FIG. 12, the failure management table 361 includes the error code, the execution server indicating the monitoring target server on which the recovery process is performed, the execution date and time indicating the date and time when the recovery process is started, and the recovery process. A recovery procedure indicating the performed procedure, a confirmation procedure indicating a check procedure when the recovery process is completed, and a completion date and time indicating the date and time when the recovery process is completed are stored in association with each other.

図１２では、例えば、監視対象サーバ「開発サーバ５」に、エラーコード「ＯＲＡ−０１６３１」のエラーが生じた場合のリカバリ処理として、実施日時「２００８／１２／３１１８時３０分」から完了日時「２００８／１２／３１２１時３０分」まで「Ｅｘｔｅｎｔ拡張」が行われたことを示している。また、リカバリ処理の結果を「ＤＢＣｈｅｃｋ」で確認する手順となっていることを示している。なお、リカバリ手順および確認手順は、実際には「ＡＢＣＤ」「１２３４」等のコードとして障害管理テーブル３６１に記憶されている。これらの各項目は、リカバリ処理実行時に設定されるものとする。続いて、図１に戻り、動作チェック部３０１について説明する。 In FIG. 12, for example, as a recovery process when an error code “ORA-1631” occurs in the monitoring target server “development server 5”, the completion date and time from “2008/12/31 18:30” This indicates that “Extent extension” has been performed until “2008/12/31 21:30”. Further, it is indicated that the procedure is to confirm the result of the recovery process by “DBCheck”. The recovery procedure and the confirmation procedure are actually stored in the failure management table 361 as codes such as “ABCD” and “1234”. These items are set when the recovery process is executed. Subsequently, returning to FIG. 1, the operation check unit 301 will be described.

動作チェック部３０１は、監視対象となる各サーバの稼動状態をチェックするためのヘルスチェックを行うと共に、各サーバから出力される各種ログ情報から動作状況を確認するものである。動作チェック部３０１は、上述した実行制御ＤＢ３５１、結果判定テーブル３５２、コマンドテーブル３５３、キーワードテーブル３５４、サーバ情報テーブル３５５、通知先テーブル３５６、スケジュールテーブル３５７等の各種のテーブルを参照して後述する種々の処理を行う。なお、各種ログ情報とは、例えば、上述した各サーバに記憶されるイベントログ、アクセスログ、システムログ、エラーログ）、ＤＢアラートログ、サーバ稼働情報ログ、アプリケーションログ、ジョブログ等がある。 The operation check unit 301 performs a health check for checking the operation state of each server to be monitored and confirms the operation status from various log information output from each server. The operation check unit 301 refers to various tables such as the execution control DB 351, the result determination table 352, the command table 353, the keyword table 354, the server information table 355, the notification destination table 356, and the schedule table 357 described later. Perform the process. The various log information includes, for example, an event log, an access log, a system log, and an error log stored in each server described above, a DB alert log, a server operation information log, an application log, and a job log.

パフォーマンスデータ収集部３０２は、監視対象となる各サーバが備えているリソースの使用状況や、各サーバの稼働情報を収集し、収集した各サーバの稼働状況をモニタリングするものである。パフォーマンスデータ収集部３０２は、結果判定テーブル３５２、コマンドテーブル３５３、サーバ情報テーブル３５５、スケジュールテーブル３５７、パフォーマンスＤＢ３５８、結果抽出テーブル３５９等の各種のテーブルを参照して、後述する種々の処理を行う。 The performance data collection unit 302 collects the usage status of resources included in each server to be monitored and the operation information of each server, and monitors the collected operation status of each server. The performance data collection unit 302 refers to various tables such as a result determination table 352, a command table 353, a server information table 355, a schedule table 357, a performance DB 358, a result extraction table 359, and performs various processes described later.

リカバリ部３０３は、動作チェック部３０１やパフォーマンスデータ収集部３０２が監視対象となる各サーバの異常を検知した際に、その異常の状況を解析し、その状況に応じた対応を実行するものである。リカバリ部３０１は、コマンドテーブル３５３と、通知先テーブル３５６と、エラー管理ＤＢ３６０と、障害管理テーブル３６１等の各種のテーブルを参照して、後述する種々の処理を行う。また、リカバリ部３０３は、各サーバの異常を検知した際には、あらかじめ登録されている障害モニタリング・自動復旧システム１０００のサーバ管理者４０１や保守業者４０２宛に電話やメール等で異常が生じた旨の通知を行う。 When the operation check unit 301 or the performance data collection unit 302 detects an abnormality of each server to be monitored, the recovery unit 303 analyzes the abnormality state and executes a response corresponding to the abnormality state. . The recovery unit 301 performs various processes described later with reference to various tables such as the command table 353, the notification destination table 356, the error management DB 360, and the failure management table 361. In addition, when the recovery unit 303 detects an abnormality in each server, an abnormality has occurred in a telephone or e-mail addressed to the server administrator 401 or maintenance contractor 402 of the failure monitoring / automatic recovery system 1000 registered in advance. Notification to that effect.

続いて、障害モニタリング・自動復旧システム１０００で行われる処理について説明する。まず、動作チェック部が３０１行う動作チェック処理について説明する。
図１３は、動作チェック処理部３０１が行う動作チェック処理の処理手順を示すフローチャートである。なお、以下では、監視対象となる各サーバ等の装置はあらかじめ定められ、また、障害モニタリング・自動復旧システム１０００が所定のタイミングで起動されているものとする。 Next, processing performed in the failure monitoring / automatic recovery system 1000 will be described. First, the operation check process performed by the operation check unit 301 will be described.
FIG. 13 is a flowchart illustrating the processing procedure of the operation check process performed by the operation check processing unit 301. In the following description, it is assumed that devices such as servers to be monitored are determined in advance, and the failure monitoring / automatic recovery system 1000 is activated at a predetermined timing.

図１３に示すように、動作チェック処理部３０１は、まず、スケジュールテーブル３５７を参照し（ステップＳ１１５１）、現在時刻と、実行日および時刻とを比較して、対象サーバが実行コマンドを実行すべきタイミングであるか否かを判定する（ステップＳ１１５２）。 As shown in FIG. 13, the operation check processing unit 301 first refers to the schedule table 357 (step S1151), compares the current time with the execution date and time, and the target server should execute the execution command. It is determined whether it is timing (step S1152).

そして、動作チェック部３０１は、実行コマンドを実行すべき実行タイミングであると判定した場合（ステップＳ１１５１；Ｙｅｓ）、サーバ情報テーブル１１０２を参照し、スケジュールテーブル３５７の対象サーバのホスト名を特定し、特定したホスト名と同じレコードの情報（ＩＰアドレス、ホスト名、各種のアカウントやパスワード等）のログイン情報を含む対象サーバ情報を読み込む（ステップＳ１１５４）。 If the operation check unit 301 determines that it is the execution timing at which the execution command should be executed (step S1151; Yes), it refers to the server information table 1102, identifies the host name of the target server in the schedule table 357, Target server information including login information of information (IP address, host name, various accounts, passwords, etc.) of the same record as the identified host name is read (step S1154).

一方、動作チェック部３０１は、実行コマンドを実行すべき実行タイミングではないと判定した場合（ステップＳ１１５１；Ｎｏ）、実行タイミングとなるまで待機する。なお、後述するように、この時点でリカバリ部３０３がリカバリ処理を実行していた場合（ステップＳ１１５３）、ステップＳ１１５１〜Ｓ１１５２を行わず（実行タイミングの判定を行わず）に、即時にステップＳ１１５４以降の処理を行う。 On the other hand, when the operation check unit 301 determines that it is not the execution timing for executing the execution command (step S1151; No), the operation check unit 301 waits until the execution timing is reached. As will be described later, if the recovery unit 303 is executing recovery processing at this point (step S1153), steps S1151 to S1152 are not performed (determination of execution timing is not performed), and immediately after step S1154. Perform the process.

続いて、動作チェック部３０１は、スケジュールテーブル３５７の実行コマンドとサーバ情報テーブル３５５のＯＳ種別とをキーとしてコマンドテーブル３５３を参照し、その実行コマンドとＯＳ種別に一致するレコードに含まれるコマンドおよびオプション（すなわち、実際に実行されるコマンドやオプション。以下、コマンド詳細情報と呼ぶ。）を取得する（ステップＳ１１５５）。 Subsequently, the operation check unit 301 refers to the command table 353 using the execution command of the schedule table 357 and the OS type of the server information table 355 as keys, and the commands and options included in the record that matches the execution command and the OS type. (That is, a command or option that is actually executed. Hereinafter, it is referred to as command detailed information.) (Step S1155).

そして、動作チェック部３０１は、取得したコマンド詳細情報を、ステップＳ１１５４で取得した対象サーバ情報に示されたサーバに送信し、サーバに送信したコマンドを実行させ（ステップＳ１１５８）、そのコマンドの実行結果を取得する（ステップＳ１１５９）。 Then, the operation check unit 301 transmits the acquired command detailed information to the server indicated in the target server information acquired in step S1154, and executes the command transmitted to the server (step S1158), and the execution result of the command Is acquired (step S1159).

ここで、実行結果は単純にコマンドを実行した結果であるため、サーバの動作チェックには不要な情報も含まれている。したがって、動作チェック部３０１は、実行したコマンドをキーとして結果抽出テーブル３５９を参照し、実行したコマンドに一致する実行コマンドを含むレコードを特定し、特定したレコードに含まれる抽出キーワードに関する情報のみを抽出して加工する（ステップＳ１１６０）。 Here, since the execution result is simply the result of executing the command, unnecessary information is also included in the operation check of the server. Therefore, the operation check unit 301 refers to the result extraction table 359 using the executed command as a key, specifies a record including an execution command that matches the executed command, and extracts only information related to the extracted keyword included in the specified record. To process (step S1160).

各種ログ情報に含まれている正常終了・異常終了・単純な操作等を示す情報が含まれているため、このように、動作チェック部３０１が、上述した抽出キーワードに関する情報のみを抽出した上で動作チェックを行うことによって、より精度の高い結果を得ることができる。 Since information indicating normal termination / abnormal termination / simple operation included in various log information is included, the operation check unit 301 extracts only the information related to the extracted keyword as described above. By performing the operation check, a more accurate result can be obtained.

ステップＳ１１６０の処理が終了すると、動作チェック部３０１は、実行したコマンドをキーとしてコマンドテーブル３５３を参照し、そのコマンドに一致するコマンドＩＤを含むレコードを特定し、特定したレコードに含まれる判断種別がしきい値であるか否か（すなわち、実行結果をしきい値によって確認するか否か）を判定する（ステップＳ１１６１）。 When the processing in step S1160 ends, the operation check unit 301 refers to the command table 353 using the executed command as a key, identifies a record including a command ID that matches the command, and determines the determination type included in the identified record. It is determined whether or not it is a threshold value (that is, whether or not the execution result is confirmed by the threshold value) (step S1161).

動作チェック部３０１は、実行結果をしきい値によって確認すると判定した場合（ステップＳ１１６１；Ｙｅｓ）、結果判定テーブル３５２のしきい値（Ａ）〜（Ｃ）を読み取り（ステップＳ１１６３）、実行結果が、読み取ったしきい値（Ａ）〜（Ｃ）のうちのいずれの範囲に該当するか否かを判定したのち、しきい値（Ｃ）の値を超えた範囲にあるか否かを判定する（ステップＳ１１６４）。 When it is determined that the execution result is confirmed by the threshold value (step S1161; Yes), the operation check unit 301 reads the threshold values (A) to (C) of the result determination table 352 (step S1163), and the execution result is After determining whether the read threshold value (A) to (C) falls within the range, it is determined whether the threshold value (C) is exceeded. (Step S1164).

そして、動作チェック部３０１は、実行結果が、しきい値（Ｃ）の値を超えた範囲にあると判定した場合（ステップＳ１１６４；Ｙｅｓ）、そのサーバの稼動状態が異常であると判定し、ステップＳ１１７１に進む。なお、後述するパフォーマンスデータ収集処理で異常と検知した場合も、ステップＳ１１６４と同様に、ステップＳ１１７１に進む（ステップＳ１１６２）。 When the operation check unit 301 determines that the execution result is within the range exceeding the threshold value (C) (step S1164; Yes), the operation check unit 301 determines that the operating state of the server is abnormal, The process proceeds to step S1171. Note that if an abnormality is detected in the performance data collection process described later, the process proceeds to step S1171 as in step S1164 (step S1162).

一方、動作チェック部３０１は、実行結果が、しきい値（Ｃ）の値を超えた範囲にないと判定した場合（ステップＳ１１６４；Ｎｏ）、そのサーバは異常な稼動状態ではないと判定（サーバの動作確認が正常に終了したと判定）し、実行結果を実行制御ＤＢ３５１の取得値およびステータスに格納する（ステップＳ１１６５）。 On the other hand, when the operation check unit 301 determines that the execution result is not within the range exceeding the threshold value (C) (step S1164; No), the operation check unit 301 determines that the server is not in an abnormal operating state (server And the execution result is stored in the acquired value and status of the execution control DB 351 (step S1165).

その後、動作チェック部３０１は、動作チェックの次回実行時間を設定するため、実行制御ＤＢ３５１と結果判定テーブル３５２とを参照し、そのコマンドを次回実行させるタイミングを決定し、その結果をスケジュールテーブル３５７および実行制御ＤＢ３５１に格納する（ステップＳ１１６６）。 After that, the operation check unit 301 refers to the execution control DB 351 and the result determination table 352 to set the next execution time of the operation check, determines the timing for executing the command next time, and sets the result in the schedule table 357 and The data is stored in the execution control DB 351 (step S1166).

例えば、ステップＳ１１６４において、実行コマンド「ＵＢＤＦ０１」の実行結果が「８０％」であった場合、そのサーバのステータスは「Ａ」であるため、図３に示す結果判定テーブル３５２のステータスＡを参照し、コマンド「ＵＢＤＦ０１」を「１２時間毎」に実行させるように、そのタイミングを設定する。 For example, if the execution result of the execution command “UBDF01” is “80%” in step S1164, the status of the server is “A”, so refer to the status A of the result determination table 352 shown in FIG. The timing is set so that the command “UBDF01” is executed “every 12 hours”.

これと同様に、実行コマンド「ＵＢＤＦ０１」の実行結果が「９０％」であった場合には、そのサーバのステータスは「Ｂ」であるため、結果判定テーブル３５２のステータスＢを参照し、コマンド「ＵＢＤＦ０１」を「６時間毎」に実行させるように、そのタイミングを設定し、実行コマンド「ＵＢＤＦ０１」の実行結果が「９８％」であった場合には、そのサーバのステータスは「Ｃ」であるため、結果判定テーブル３５２のステータスＣを参照し、コマンド「ＵＢＤＦ０１」を「３時間毎」に実行させるように、そのタイミングを設定する。このように、サーバの稼動状態が異常な状態（危険な状態）に近づくにつれて、コマンドを実行させる間隔を短くし、サーバの稼動状態を迅速に把握することができるようになっている。 Similarly, when the execution result of the execution command “UBDF01” is “90%”, the status of the server is “B”, so the status “B” in the result determination table 352 is referred to, and the command “ When the timing is set so that “UBDF01” is executed “every 6 hours” and the execution result of the execution command “UBDF01” is “98%”, the status of the server is “C”. Therefore, the status C of the result determination table 352 is referred to, and the timing is set so that the command “UBDF01” is executed “every 3 hours”. As described above, as the server operating state approaches an abnormal state (dangerous state), the interval at which the command is executed can be shortened to quickly grasp the server operating state.

その後、動作チェック部３０１は、リカバリ処理が起動して実行された状態で、動作チェック処理を行っているか否か（すなわち、動作チェックが単体で実行されたものであるか、リカバリ処理の一環として実行されたものであるか）を判定し（ステップＳ１１６７）、動作チェックがリカバリ処理の一環として実行されていないと判定した場合（ステップＳ１１６７；Ｎｏ）、処理を終了させる。 Thereafter, the operation check unit 301 determines whether or not the operation check process is performed in a state where the recovery process is activated and executed (that is, whether the operation check is executed alone or as part of the recovery process). (Step S1167). If it is determined that the operation check has not been executed as part of the recovery process (Step S1167; No), the process is terminated.

一方、動作チェック部３０１は、動作チェックがリカバリ処理の一環として実行されていると判定した場合（ステップＳ１１６７；Ｙｅｓ）、図１５に示すリカバリ処理に進む（ステップＳ１１６８）。 On the other hand, when the operation check unit 301 determines that the operation check is being executed as part of the recovery process (step S1167; Yes), the operation check unit 301 proceeds to the recovery process illustrated in FIG. 15 (step S1168).

動作チェック部３０１は、ステップＳ１１６１において、実行結果をしきい値によって確認しないと判定した（すなわち実行結果を各種ログ情報によって確認すると判定した）場合（ステップＳ１１６１；Ｎｏ）、キーワードテーブル１１１２を参照し、各種ログ情報の中からステップＳ１１５７において実行されたコマンドに一致する実行コマンドを読み込んで特定し（ステップＳ１１６９）、そのコマンドの実行結果にエラーがあり、そのエラーがキーワードテーブル３５４のエラーキーワードとして登録されているものであるか否か（すなわち、実行結果にエラーが存在し、そのエラーが無視できないものであるか否か）を判定する（ステップＳ１１７０）。 The operation check unit 301 refers to the keyword table 1112 when it is determined in step S1161 that the execution result is not confirmed by the threshold (that is, it is determined that the execution result is confirmed by various log information) (step S1161; No). The execution command that matches the command executed in step S1157 is read from the various log information and specified (step S1169), and there is an error in the execution result of the command, and the error is registered as an error keyword in the keyword table 354. It is determined whether or not (i.e., whether an error exists in the execution result and the error cannot be ignored) (step S1170).

そして、動作チェック部３０１は、実行結果にエラーが存在せず、もしくはそのエラーが無視できないものではない（例えば、実行結果にエラーが存在しても、そのエラーが無視キーワードである場合等）と判定した場合（ステップＳ１１７０；Ｎｏ）、上述したステップＳ１１６５に進む。 Then, the operation check unit 301 does not have an error in the execution result, or the error cannot be ignored (for example, even if an error exists in the execution result, the error is an ignorable keyword). When it determines (step S1170; No), it progresses to step S1165 mentioned above.

一方、動作チェック部３０１は、実行結果にエラーが存在し、もしくはそのエラーが無視できないものである（例えば、実行結果にエラーが存在し、そのエラーがエラーキーワードである場合等）と判定した場合（ステップＳ１１７０；Ｙｅｓ）、通知先テーブル３５６を参照し、ステップＳ１１５４で取得した対象サーバ情報に含まれるホスト名に一致する対象サーバを特定し（ステップＳ１１７１）、特定した対象サーバの連絡先として登録された宛先に通報する（ステップＳ１１７２）。 On the other hand, when the operation check unit 301 determines that an error exists in the execution result or the error cannot be ignored (for example, an error exists in the execution result and the error is an error keyword). (Step S1170; Yes), with reference to the notification destination table 356, the target server that matches the host name included in the target server information acquired in Step S1154 is specified (Step S1171), and registered as the contact information of the specified target server The notified destination is notified (step S1172).

その後、動作チェック部３０１は、ステップＳ１１６５、Ｓ１１６６の場合と同様に、実行結果を実行制御ＤＢ３５１に格納し（ステップＳ１１７３）、そのコマンドを次回実行させるタイミングを決定し、その結果をスケジュールテーブル３５７および実行制御ＤＢ３５１に格納する（ステップＳ１１７４）。このステップＳ１１７４の処理、またはステップＳ１１６７の処理が終了すると、図１３に示す動作チェック処理の全ての処理が終了する。続いて、パフォーマンスデータ収集部３０２が行うパフォーマンスデータ収集処理について説明する。 Thereafter, as in the case of steps S1165 and S1166, the operation check unit 301 stores the execution result in the execution control DB 351 (step S1173), determines the timing for executing the command next time, and stores the result in the schedule table 357 and Stored in the execution control DB 351 (step S1174). When the process of step S1174 or the process of step S1167 ends, all the process of the operation check process shown in FIG. 13 ends. Next, performance data collection processing performed by the performance data collection unit 302 will be described.

図１４は、パフォーマンスデータ収集部３０２が行うパフォーマンスデータ収集処理の処理手順を示すフローチャートである。 FIG. 14 is a flowchart illustrating a processing procedure of performance data collection processing performed by the performance data collection unit 302.

図１４に示すように、パフォーマンスデータ収集部３０２は、スケジュールテーブル３５７を読み込み（ステップＳ１２５１）、スケジュールテーブル３５７の実行日と、現在時刻とを比較し、パフォーマンスデータ収集処理の実行タイミングであるか否かを判定する（ステップＳ１２５２）。 As shown in FIG. 14, the performance data collection unit 302 reads the schedule table 357 (step S1251), compares the execution date of the schedule table 357 with the current time, and determines whether it is the execution timing of the performance data collection process. Is determined (step S1252).

そして、パフォーマンスデータ収集部３０２は、パフォーマンスデータ収集処理の実行タイミングであると判定した場合（ステップＳ１２５２；Ｙｅｓ）、さらに、サーバ情報テーブル３５５を読み込んで、スケジュールテーブル３５７の対象サーバと同じホスト名とＯＳ種別と含む対象サーバ情報を取得する（ステップＳ１２５３）。 If the performance data collection unit 302 determines that it is the execution timing of the performance data collection process (step S1252; Yes), it further reads the server information table 355 and sets the same host name as the target server of the schedule table 357. The target server information including the OS type is acquired (step S1253).

その後、パフォーマンスデータ収集部３０２は、コマンドテーブル３５３を参照し、ステップＳ１２５１で読み込んだスケジュールテーブル３５７に含まれる実行コマンドと、ステップＳ１２５３で取得したＯＳ種別に一致するレコードに含まれるコマンドを特定し（ステップＳ１２５４）、そのコマンドを実行するための実行ファイル（実行コマンド）を生成する（ステップＳ１２５５）。 Thereafter, the performance data collection unit 302 refers to the command table 353, and identifies the execution command included in the schedule table 357 read in step S1251 and the command included in the record that matches the OS type acquired in step S1253 ( In step S1254, an execution file (execution command) for executing the command is generated (step S1255).

具体的には、パフォーマンスデータ収集部３０２は、ステップＳ１２５１においてスケジュールテーブル３５７の対象サーバ（例えば、開発サーバ１）、実行コマンド（例えば、ＵＢＤＦ０１）を読み込み、読み込んだ対象サーバをキーとして、さらに、ステップＳ１２５３において、サーバ情報テーブル３５５のＯＳ種別（例えば、ＯＳ１）を読み込む。その後、パフォーマンスデータ収集部３０２は、ステップＳ１２５４において、実行コマンドとＯＳ種別とをキーとして、さらに、コマンドテーブル３５３のコマンド（例えば、ｄｂｆ）を取得し、取得したコマンドを実行させるためのファイルを生成する。 Specifically, the performance data collection unit 302 reads the target server (for example, the development server 1) and the execution command (for example, UBDF01) of the schedule table 357 in step S1251, and further uses the read target server as a key. In S1253, the OS type (for example, OS1) in the server information table 355 is read. Thereafter, in step S1254, the performance data collection unit 302 further acquires a command (for example, dbf) in the command table 353 using the execution command and the OS type as a key, and generates a file for executing the acquired command. To do.

パフォーマンスデータ収集部３０２は、ステップＳ１２５５で生成したファイルを対象となるサーバに送信して処理を実行させ（ステップＳ１２５６、Ｓ１２５７）、その実行結果を取得し、対象サーバおよび実行コマンド（例えば、開発サーバ１、ＵＢＤＦ０１）をキーとして、その実行結果（例えば、３５（％））をパフォーマンスデータＤＢ３５８に登録する（ステップＳ１２５８）。 The performance data collection unit 302 transmits the file generated in step S1255 to the target server to execute the process (steps S1256 and S1257), acquires the execution result, and executes the target server and the execution command (for example, the development server). 1 (UBDF01) as a key, the execution result (for example, 35 (%)) is registered in the performance data DB 358 (step S1258).

ここで、実行結果は単純にコマンドを実行した結果であるため、サーバのパフォーマンスデータの収集には不要な情報も含まれている。したがって、パフォーマンスデータ収集部３０２は、実行したコマンドをキーとして結果抽出テーブル３５９を参照し、図１３に示したステップＳ１１６０の場合と同様に、実行したコマンドに一致する実行コマンドを含むレコードを特定し、特定したレコードに含まれる抽出キーワードに関する情報のみを抽出して加工する（ステップＳ１２５９）。 Here, since the execution result is simply the result of executing the command, information unnecessary for collecting server performance data is also included. Therefore, the performance data collection unit 302 refers to the result extraction table 359 using the executed command as a key, and specifies a record including an executed command that matches the executed command, as in step S1160 shown in FIG. Only the information related to the extracted keyword included in the specified record is extracted and processed (step S1259).

さらに、パフォーマンスデータ収集部３０２は、取得した実行結果が異常である場合も存在するため、実行コマンド（例えば、ＵＢＤＦ０１）およびＯＳ種別（例えば、ＯＳ１）をキーとして結果判定テーブル３５２を読み込んで、その実行コマンドおよびＯＳ種別に一致する実行コマンドおよび対象サーバタイプを含むレコードを特定し、そのレコードのしきい値（例えば、８０（％以上）、９０（％以上）、９８（％以上））を取得するとともに（ステップＳ１２６０）、ステップＳ１２５８で取得した実行結果を、対象サーバ、コマンド、対象サーバタイプ（例えば、開発サーバ１、ＵＢＤＦ０１、ＯＳ１）をキーとして実行制御ＤＢ３５１に登録する（ステップＳ１２６１）。 Furthermore, the performance data collection unit 302 reads the result determination table 352 using the execution command (for example, UBDF01) and the OS type (for example, OS1) as keys because the acquired execution result may be abnormal. The record including the execution command and the target server type that matches the execution command and the OS type is specified, and the threshold value of the record (for example, 80 (% or more), 90 (% or more), 98 (% or more)) is acquired. In step S1260, the execution result acquired in step S1258 is registered in the execution control DB 351 using the target server, command, and target server type (for example, development server 1, UBDF01, OS1) as keys (step S1261).

その後、パフォーマンスデータ収集部３０２は、ステップＳ１２５８で取得した実行結果が、ステップＳ１２６０で取得したしきい値を満たしているか否か（しきい値を超えているか否か）を判定し（ステップＳ１２６２）、実行結果がしきい値を超えていないと判定した場合（ステップＳ１２６２；Ｎｏ）、実行結果がいずれのしきい値を満たすものであるかを判定し、その結果（取得値、ステータス）を実行制御ＤＢ３５１に登録する（ステップＳ１２６５）。 Thereafter, the performance data collection unit 302 determines whether or not the execution result acquired in step S1258 satisfies the threshold acquired in step S1260 (whether or not the threshold is exceeded) (step S1262). When it is determined that the execution result does not exceed the threshold value (step S1262; No), it is determined which threshold value the execution result satisfies, and the result (acquired value, status) is executed. Register in the control DB 351 (step S1265).

一方、パフォーマンスデータ収集部３０２は、実行結果がしきい値を超えていると判定した場合（ステップＳ１２６２；Ｙｅｓ）、サーバの稼働状況が異常であると判定し、その結果（取得値、ステータス）を実行制御ＤＢ３５１に登録し（ステップＳ１２６３）、サーバの動作をチェックするために、図１３に示した動作チェック処理に進む（ステップＳ１２６４）。このステップＳ１２６４またはステップＳ１２６５の処理が終了すると、図１４に示したパフォーマンスデータ収集処理の全ての処理が終了する。続いて、リカバリ部３０３が行うリカバリ処理について説明する。 On the other hand, if the performance data collection unit 302 determines that the execution result exceeds the threshold (step S1262; Yes), the performance data collection unit 302 determines that the server operating status is abnormal, and the result (acquired value, status). Is registered in the execution control DB 351 (step S1263), and the operation check process shown in FIG. 13 is performed to check the operation of the server (step S1264). When the processing of step S1264 or step S1265 is completed, all the performance data collection processing shown in FIG. 14 is completed. Next, the recovery process performed by the recovery unit 303 will be described.

図１５は、リカバリ部３０３が行うリカバリ処理の処理手順を示すフローチャートである。なお、以下に示すリカバリ処理は、それ自体が単独で起動することはなく、図１３に示した動作チェック処理において実行結果に無視できないエラーが存在し、その異常が連絡先として登録された宛先に通報された場合（図１３に示したステップＳ１１７１〜１１７４までの処理が実行された場合）に、自動的に起動するものとする。 FIG. 15 is a flowchart showing a processing procedure of recovery processing performed by the recovery unit 303. Note that the recovery process shown below does not start on its own, there is an error that cannot be ignored in the execution result in the operation check process shown in FIG. 13, and the abnormality is registered in the destination registered as a contact. When notified (when the processing from step S1171 to 1174 shown in FIG. 13 is executed), it is automatically started.

図１５に示すように、リカバリ部３０３は、動作チェック処理からリカバリ処理に処理が遷移したタイミングで、通知先テーブル３５６の情報を読み込み（ステップＳ１３５１）、ステップＳ１１７１において参照した通知先テーブル３５６の連絡先として登録された宛先に、リカバリ処理を開始する旨を通報する（ステップＳ１３５２）。 As illustrated in FIG. 15, the recovery unit 303 reads information in the notification destination table 356 at the timing when the process transitions from the operation check process to the recovery process (step S1351), and contacts the notification destination table 356 referred to in step S1171. The destination registered as the destination is notified that the recovery process is to be started (step S1352).

そして、リカバリ部３０３は、図１３に示した動作チェック処理のステップＳ１１５４で取得した対象サーバ情報に含まれるホスト名に一致する対象サーバを取得する（ステップＳ１３５３）。その後、リカバリ部３０３は、取得した対象サーバと障害管理テーブル３６１の実施サーバが一致するレコードを読み込み（ステップＳ１３５４）、読み込んだレコードに含まれるエラーコードをキーとして、障害管理テーブル３６１の中に過去に同じコマンドでエラーが発生しているか確認し（ステップＳ１３５５）、障害管理テーブル３６１に同一の障害履歴が記憶されているか否かを判定する（ステップＳ１３５６）。 Then, the recovery unit 303 acquires a target server that matches the host name included in the target server information acquired in step S1154 of the operation check process illustrated in FIG. 13 (step S1353). Thereafter, the recovery unit 303 reads a record in which the acquired target server and the execution server of the failure management table 361 match (step S1354), and stores the past in the failure management table 361 using the error code included in the read record as a key. It is checked whether an error has occurred with the same command (step S1355), and it is determined whether the same failure history is stored in the failure management table 361 (step S1356).

リカバリ部３０３は、障害管理テーブル３６１に同一の障害履歴が記憶されていると判定した場合（ステップＳ１３５６；Ｙｅｓ）、記憶されているその障害履歴（エラーコード）に対応するリカバリ手順を選択し、過去の手順を採用する（ステップＳ１３５７）。 When it is determined that the same failure history is stored in the failure management table 361 (step S1356; Yes), the recovery unit 303 selects a recovery procedure corresponding to the stored failure history (error code), The past procedure is adopted (step S1357).

例えば、図１３に示した動作チェック処理のステップＳ１１５４で取得した対象サーバ情報が「開発サーバ５」である場合、リカバリ部３０３は、その対象サーバに一致する障害管理テーブル３６１の実施サーバ「開発サーバ５」を含むレコードを読み込む。そして、リカバリ部３０３は、そのレコードに含まれるエラーコード「ＯＲＡ−０１６３１」を参照し、リカバリ手順として「Ｅｘｔｅｎｔ拡張（ＡＢＣＤ）」を採用する。 For example, when the target server information acquired in step S1154 of the operation check process illustrated in FIG. 13 is “development server 5”, the recovery unit 303 performs the implementation server “development server” of the failure management table 361 that matches the target server. A record including “5” is read. Then, the recovery unit 303 refers to the error code “ORA-16311” included in the record, and adopts “Extent extension (ABCD)” as the recovery procedure.

一方、リカバリ部３０３は、障害管理テーブル３６１に同一の障害履歴が記憶されていないと判定した場合（ステップＳ１３５６；Ｎｏ）、サーバ情報テーブル３５５を参照し、他のサーバの中で同じＯＳ種別のサーバのホスト名と同じ名称を、障害管理テーブル３６１の中から特定し、特定した他のサーバ（実施サーバ）のリカバリ手順を確認する（ステップＳ１３６６）。 On the other hand, when the recovery unit 303 determines that the same failure history is not stored in the failure management table 361 (step S1356; No), the recovery unit 303 refers to the server information table 355 and has the same OS type among other servers. The same name as the host name of the server is identified from the failure management table 361, and the recovery procedure of the identified other server (implementing server) is confirmed (step S1366).

そして、リカバリ部３０３は、他のサーバのリカバリ手順があると判定した場合（ステップＳ１３６６；Ｙｅｓ）、ステップＳ１３５７に進み、過去のリカバリ手順を採用した場合と同様に、他のサーバのリカバリ手順を採用する。 If the recovery unit 303 determines that there is a recovery procedure for another server (step S1366; Yes), the process proceeds to step S1357, and the recovery procedure for the other server is performed in the same manner as when the past recovery procedure is adopted. adopt.

一方、リカバリ部３０３は、他のサーバのリカバリ手順がないと判定した場合（ステップＳ１３６６；Ｎｏ）、エラー管理ＤＢ３６０のエラーコードを参照し（ステップＳ１３６７）、エラー管理ＤＢ３６０に同一の障害履歴が記憶されているか否かを判定する（ステップＳ１３６８）。 On the other hand, if the recovery unit 303 determines that there is no recovery procedure for another server (step S1366; No), the recovery unit 303 refers to the error code in the error management DB 360 (step S1367) and stores the same failure history in the error management DB 360. It is determined whether it has been performed (step S1368).

そして、リカバリ部３０３は、エラー管理ＤＢ３６０に同一の障害履歴が記憶されていると判定した場合（ステップＳ１３６８；Ｙｅｓ）、エラー管理ＤＢ３６０に記憶されているリカバリ手順を採用し（ステップＳ１３６９）、ステップＳ１３５８に進む。 When the recovery unit 303 determines that the same failure history is stored in the error management DB 360 (step S1368; Yes), the recovery unit 303 adopts the recovery procedure stored in the error management DB 360 (step S1369). The process proceeds to S1358.

一方、リカバリ部３０３は、エラー管理ＤＢ３６０に同一の障害履歴が記憶されていないと判定した場合（ステップＳ１３６８；Ｎｏ）、自動的なリカバリ処理は困難であるため、図１３に示したステップＳ１１７２の場合と同様に関係者に再度通報を行い（ステップＳ１３７０）、管理者による対応を行う（ステップＳ１３７１）。 On the other hand, if the recovery unit 303 determines that the same failure history is not stored in the error management DB 360 (step S1368; No), the automatic recovery process is difficult, and thus the recovery process of step S1172 shown in FIG. As in the case, the relevant person is notified again (step S1370), and the administrator takes action (step S1371).

ステップＳ１３５７またはステップＳ１３６９が終了すると、ステップＳ１３５７またはステップＳ１３６９で採用したリカバリ手順のコマンド（例えば、障害管理テーブル３６１のリカバリ手順「Ｅｘｔｅｎｔ拡張」を実施した際のコマンド、あるいはエラー管理ＤＢ３６０のリカバリ手順のファイルである「Ｒ３ＲＴＣＤＢ１３」およびエラーコード「ＯＲＡ−０１６３１」に対応するコマンド「ＵＢＤＦ０１」）をコマンドテーブル３５３から取得し（ステップＳ１３５８）、図１４に示したステップＳ１２５５の場合と同様に、取得したコマンドを実行させるためのファイルを生成する（ステップＳ１３５９）。 When step S1357 or step S1369 ends, the recovery procedure command adopted in step S1357 or step S1369 (for example, the command when the recovery procedure “Extent extension” in the failure management table 361 is executed, or the recovery procedure of the error management DB 360) The command “UBDF01” corresponding to the file “R3RTCDB13” and the error code “ORA-1631” is acquired from the command table 353 (step S1358), and the acquired command is the same as in step S1255 shown in FIG. A file for executing is generated (step S1359).

そして、リカバリ部３０３は、ステップＳ１３５９で生成したファイルを対象となるサーバに送信してリカバリ処理を実行させ（ステップＳ１３６０、Ｓ１３６１）、その実行結果（例えば、そのサーバの各種ログ情報）を取得する（ステップＳ１３６２）。 Then, the recovery unit 303 transmits the file generated in step S1359 to the target server to execute the recovery process (steps S1360 and S1361), and acquires the execution result (for example, various log information of the server). (Step S1362).

ステップＳ１３６２の処理が終了すると、そのリカバリ手順の有効性を確認するために、動作チェック部３０１は、図１３に示したステップＳ１１５３（Ａ）〜Ｓ１１６８（Ｄ）までの動作チェック処理を実行する。そして、これらの動作チェック処理が終了すると、リカバリ部３０３は、その動作チェック処理によって、エラーが解消されたか否か（例えば、しきい値（Ｃ）の値を超えた範囲にないか）を判定する（ステップＳ１３６５）。 When the process of step S1362 ends, the operation check unit 301 executes the operation check process from steps S1153 (A) to S1168 (D) shown in FIG. 13 in order to confirm the validity of the recovery procedure. When these operation check processes are completed, the recovery unit 303 determines whether or not the error has been resolved by the operation check process (for example, it is not in a range exceeding the threshold value (C)). (Step S1365).

そして、リカバリ部３０３は、その動作チェック処理によって、エラーが解消されていないと判定した場合（ステップＳ１３６５；Ｎｏ）、ステップＳ１３６９に戻って、エラー管理ＤＢ３６０に記憶されているリカバリ手順を採用し（ステップＳ１３６９）、ステップＳ１３５８に進む。 When the recovery unit 303 determines that the error has not been eliminated by the operation check process (step S1365; No), the recovery unit 303 returns to step S1369 and adopts the recovery procedure stored in the error management DB 360 ( The process proceeds to step S1369) and step S1358.

一方、リカバリ部３０３は、その動作チェック処理によって、エラーが解消されたと判定した場合（ステップＳ１３６５；Ｙｅｓ）、そのリカバリ手順を障害管理テーブル３６１に登録し（ステップＳ１３７２）、エラーコードをキーとして、そのエラーコードが生じた際のリカバリ手順を優先的に実行させるために、エラー管理ＤＢ３６０の対応順序を変更する（ステップＳ１３７３）。 On the other hand, when the recovery unit 303 determines that the error has been eliminated by the operation check process (step S1365; Yes), the recovery procedure is registered in the failure management table 361 (step S1372), and the error code is used as a key. In order to preferentially execute the recovery procedure when the error code is generated, the correspondence order of the error management DB 360 is changed (step S1373).

具体的には、図１１に示す例では、対象サーバ「ＯＳ１」のサーバに、エラーコード「ＯＲＡ−０１６３１」が生じた場合、リカバリ手順としては、まず、「Ｒ３ＲＴＣＤＢ１３」のファイルを実行し、次に「ＯＲＡＥＸＴＣＨ０１」のファイルを実行する手順となっているが、例えば、「ＯＲＡＥＸＴＣＨ０１」のファイルを実行した方が、より多くの容量を確保できた場合、短い処理時間でリカバリ処理が実行できた等のように、実行されたリカバリ処理が有効なものである場合には、リカバリ手順の対応順序を入れ替えて、今後障害が生じた場合に「ＯＲＡＥＸＴＣＨ０１」のファイルを実行する手順が先に実行されるように、優先順位を高く設定する。 Specifically, in the example illustrated in FIG. 11, when the error code “ORA-01631” occurs in the server of the target server “OS1”, as a recovery procedure, first, the file “R3RTCDB13” is executed. However, for example, when the “ORAEXTCH01” file is executed, the recovery process can be executed in a short processing time when a larger capacity can be secured. If the executed recovery process is valid as shown above, the order of executing the “ORAEXTCH01” file is executed first when a failure occurs in the future by switching the correspondence order of the recovery procedures. As such, the priority is set high.

そして、リカバリ部３０３は、ステップＳ１３７３までの処理が終了すると、実行制御ＤＢ３５１の各項目をクリアすることによってリセットし（ステップＳ１３７４）、リカバリ処理が完了した旨の連絡を、図１３に示したステップＳ１１７２の場合と同様に関係者に通知する（ステップＳ１３７５、Ｓ１３７６）。このステップＳ１３７６又はステップＳ１３７１の処理が終了すると、図１５に示したリカバリ処理の全ての処理が終了する。 Then, when the processing up to step S1373 is completed, the recovery unit 303 resets the items by clearing each item of the execution control DB 351 (step S1374), and notifies that the recovery processing is completed, as shown in FIG. The related parties are notified in the same manner as in S1172 (steps S1375 and S1376). When the process of step S1376 or step S1371 is finished, all the recovery processes shown in FIG. 15 are finished.

このように、監視対象となるサーバに生じた障害を復旧させる共通監視端末２５１において、記憶部３５０が、監視対象となるサーバに生じた過去の障害を示すエラーコードと障害に対するリカバリ手順とを対応付けて履歴で記憶し、動作チェック部３０１が、監視対象となるサーバから、監視対象となるサーバの稼働情報を取得し、取得した稼動情報が所定の条件（しきい値、キーワード）を満たすか否かを判定する動作チェックを行い、リカバリ部３０３が、動作チェック部３０１が、稼動情報が所定の条件（しきい値、キーワード）を満たさないと判定したことを受けて、稼動情報と、記憶部３５０に記憶されたエラーコードとに基づいて、稼動情報に応じたリカバリ手順を記憶部３５０に記憶されたリカバリ手順の中から特定し、特定したリカバリ手順にしたがって監視対象となるサーバに生じた障害を復旧させるので、発生状況に応じて自動的に障害を復旧させることが可能となる。 As described above, in the common monitoring terminal 251 that recovers a failure that has occurred in the server to be monitored, the storage unit 350 corresponds to an error code indicating a past failure that has occurred in the server to be monitored and a recovery procedure for the failure. The operation check unit 301 acquires the operation information of the monitoring target server from the monitoring target server, and whether the acquired operation information satisfies a predetermined condition (threshold value, keyword) In response to the operation check for determining whether or not the operation check unit 301 determines that the operation information does not satisfy the predetermined condition (threshold value, keyword), the operation information and the storage are stored. Based on the error code stored in the unit 350, a recovery procedure corresponding to the operation information is identified from the recovery procedures stored in the storage unit 350, and Since to recover the failure that occurred in the server to be monitored according to the recovery procedure, it is possible to automatically recover the fault in accordance with the occurrence.

例えば、サーバの稼働状況によって、サーバの動作をチェックする条件を自動的に変更し、サーバに障害が発生した場合にも、単にあらかじめ定義したシナリオだけでなく、過去の実績のあるリカバリ手順にしたがって自動的に復旧させることが可能となる。
また、サーバが平常運用されている場合においては、その動作チェック、あるいはリソースの使用状況やパフォーマンスを自動的に実行させ、その際に異常を検知した場合、自動的に管理者への通報から異常の原因となった障害の復旧までの一連の流れの作業を行うので、サーバやシステムの運用保守を自動で行われることとなり、運用担当者や管理者の負担を軽減することが可能となる、
さらに、障害の復旧までの一連の流れの作業を自動的に行うので、迅速に障害が復旧でき、その結果、システムのダウンタイムを抑制することができるため、ユーザに安定したサービスを提供することができる。 For example, the conditions for checking server operation are automatically changed according to the operating status of the server, and when a failure occurs in the server, not only a predefined scenario but also a recovery procedure with a past record It is possible to recover automatically.
Also, when the server is operating normally, its operation check or resource usage status and performance are automatically executed, and if an abnormality is detected at that time, an error is automatically detected from the report to the administrator. Since a series of work up to the recovery of the failure that caused the problem is performed, the operation and maintenance of the server and system will be performed automatically, and it will be possible to reduce the burden on the operations staff and administrator.
Furthermore, since a series of operations up to the recovery of the failure is automatically performed, the failure can be recovered quickly, and as a result, the system downtime can be suppressed, so that a stable service can be provided to the user. Can do.

なお、本発明は、上記実施の形態そのままに限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で構成要素を変形して具体化することができる。また、上記実施の形態に開示されている複数の構成要素の適宜な組み合わせにより、種々の発明を形成することができる。例えば、実施の形態に示される全構成要素からいくつかの構成要素を削除してもよい。さらに、異なる実施の形態にわたる構成要素を適宜組み合わせても良い。 It should be noted that the present invention is not limited to the above-described embodiment as it is, and can be embodied by modifying the constituent elements without departing from the scope of the invention in the implementation stage. In addition, various inventions can be formed by appropriately combining a plurality of constituent elements disclosed in the above embodiments. For example, some components may be deleted from all the components shown in the embodiment. Furthermore, constituent elements over different embodiments may be appropriately combined.

１０００障害モニタリング・自動復旧システム、
１０１Ｗｉｎｄｏｗｓサーバ
１０２Ｕｎｉｘサーバ
１０３ＤＢ（DataBase）サーバ
１０４アプリケーションサーバ
２５１共通監視端末
２０１制御部
３０１動作チェック部
３０２パフォーマンスデータ収集部
３０３リカバリ部
３０４通信部
３５０記憶部
３５１実行制御ＤＢ
３５２結果判定テーブル
３５３コマンドテーブル
３５４キーワードテーブル
３５５サーバ情報テーブル
３５６通知先テーブル
３５７スケジュールテーブル
３５８パフォーマンスＤＢ
３５９結果抽出テーブル
３６０エラー管理ＤＢ
３６１障害管理テーブル
Ｎネットワーク。 1000 Fault monitoring / automatic recovery system,
101 Windows server 102 Unix server 103 DB (DataBase) server 104 Application server 251 Common monitoring terminal 201 Control unit 301 Operation check unit 302 Performance data collection unit 303 Recovery unit 304 Communication unit 350 Storage unit 351 Execution control DB
352 Result determination table 353 Command table 354 Keyword table 355 Server information table 356 Notification destination table 357 Schedule table 358 Performance DB
359 Result extraction table 360 Error management DB
361 Fault management table N Network.

Claims

An information processing apparatus that recovers a failure that occurred in an information processing apparatus to be monitored,
A storage unit that stores, in a history, error information indicating a past failure that has occurred in the information processing apparatus to be monitored and a recovery procedure for the failure;
An operation check unit that acquires operation information of the information processing apparatus to be monitored from the information processing apparatus to be monitored and performs an operation check to determine whether the acquired operation information satisfies a predetermined condition; ,
In response to determining that the operation information does not satisfy a predetermined condition, the operation check unit responds to the operation information based on the operation information and the error information stored in the storage unit. A recovery unit that identifies a recovery procedure from the recovery procedure stored in the storage unit, and recovers a failure that has occurred in the information processing apparatus to be monitored according to the specified recovery procedure;
An information processing apparatus comprising:

When the operation check unit determines that the operation information does not satisfy a predetermined condition, the operation check unit notifies a predetermined administrator that a failure has occurred in the information processing apparatus to be monitored,
The recovery unit is generated in the information processing device to be monitored in response to the operation check unit informing a predetermined administrator that a failure has occurred in the information processing device to be monitored. To recover from a failure
The information processing apparatus according to claim 1.

The operation check unit determines whether the operation information is within a predetermined condition by determining whether the operation information is within a predetermined threshold range or whether the operation information includes a predetermined keyword. To determine whether or not
The information processing apparatus according to claim 1, wherein the information processing apparatus is an information processing apparatus.

The information processing apparatus monitors the operating status of a plurality of information processing apparatuses to be monitored,
The storage unit associates and stores device identification information for identifying the information processing device to be monitored, the error information, and the recovery procedure in a history,
The operation check unit acquires operation information and the device identification information from a plurality of information processing devices to be monitored, determines whether the acquired operation information satisfies a predetermined condition,
The recovery unit, when the operation check unit determines that the operation information does not satisfy a predetermined condition, with the device identification information as a key, the operation information and the storage unit stored in the storage unit Based on the error information, recovering a failure that occurred in the information processing apparatus to be monitored according to the recovery procedure corresponding to the same apparatus identification information as the apparatus identification information,
The information processing apparatus according to any one of claims 1 to 3.

When the recovery procedure corresponding to the same device identification information as the device identification information does not exist, the recovery unit occurs in the information processing apparatus to be monitored according to the recovery procedure corresponding to the other device identification information. To recover from a failure
The information processing apparatus according to claim 4.

The operation check unit acquires operation information from the information processing device to be monitored in a state where the recovery unit has recovered a failure that has occurred in the information processing device to be monitored, and the acquired operation information Determine whether a given condition is met,
The recovery unit raises the priority of the recovery procedure when the operation check unit determines that the operation information satisfies a predetermined condition in a state where the failure is recovered,
The information processing apparatus according to claim 1, wherein the information processing apparatus is an information processing apparatus.

When the operation check unit determines that the operation information satisfies a predetermined condition, the operation check unit performs the operation check at a shorter interval as the operation information approaches a state that does not satisfy the predetermined condition.
The information processing apparatus according to any one of claims 1 to 6.

Further comprising a performance unit that monitors an operation status of the information processing apparatus to be monitored at predetermined intervals and determines whether the operation information satisfies a predetermined condition;
When the performance unit determines that the operation information does not satisfy a predetermined condition, the operation check unit notifies a predetermined administrator that a failure has occurred in the information processing apparatus to be monitored. ,
The recovery unit is generated in the information processing device to be monitored in response to the operation check unit informing a predetermined administrator that a failure has occurred in the information processing device to be monitored. To recover from a failure
The information processing apparatus according to any one of claims 1 to 7.

A fault monitoring / automatic recovery method performed by an information processing apparatus that recovers a fault that occurred in an information processing apparatus to be monitored,
An operation check step of obtaining operation information of the information processing apparatus to be monitored from the information processing apparatus to be monitored and performing an operation check for determining whether or not the acquired operation information satisfies a predetermined condition; ,
In the operation check step, in response to determining that the operation information does not satisfy a predetermined condition, the operation information and the monitoring target stored in the history in association with the recovery procedure for the failure in the storage unit A recovery procedure corresponding to the operation information is identified from the recovery procedures stored in the storage unit based on error information indicating a past failure that has occurred in the information processing apparatus, and the identified recovery procedure Therefore, a recovery step for recovering a failure that has occurred in the information processing apparatus to be monitored;
A failure recovery method comprising: