JP2006190153A

JP2006190153A - Method and program for failure restoration

Info

Publication number: JP2006190153A
Application number: JP2005002442A
Authority: JP
Inventors: Hajime Onodera; 一小野寺
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2005-01-07
Filing date: 2005-01-07
Publication date: 2006-07-20

Abstract

<P>PROBLEM TO BE SOLVED: To make processing for a business system normally function in a minimum restoration time in time of a failure of a computer wherein the business system operates, by a failure restoration method and a failure restoration program. <P>SOLUTION: In this method for failure restoration, a failure situation is confirmed when the failure occurs, a failure dissolution method for dissolving the failure is extracted on the basis of the confirmed failure situation, a processing time necessary to dissolve the failure is extracted in each the extracted failure dissolution method, and the extracted processing times are compared to determine the failure dissolution method. In the determination of the failure dissolution method, the failure dissolution method having the small extracted processing time is selected in each the extracted method. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、障害回復方法及び障害回復プログラムに関し、更に詳しくは、コンピュータの障害状況に応じて障害個所の復旧を行うことが可能な障害回復方法及び障害回復プログラムに関する。 The present invention relates to a failure recovery method and a failure recovery program, and more particularly, to a failure recovery method and a failure recovery program capable of recovering a failure location according to a failure state of a computer.

従来より、コンピュータシステムによるアプリケーションソフトの障害に対して自動で復旧することが行われている。例えば、監視ソフトウエアを使用して、動作中のアプリケーションソフトの状態を監視し、異常が生じた場合はアプリケーションソフトを再起動させたり、所定回数再起動させても復旧しない場合はＯＳの再起動やコンピュータのリブートを実行したりすることでシステム全体を復旧させることが行われていた（特許文献１等参照）。 Conventionally, automatic recovery from a failure of application software by a computer system has been performed. For example, using the monitoring software, monitor the status of the running application software. If an error occurs, restart the application software, or restart the OS if it does not recover after a predetermined number of restarts. In other words, the entire system is recovered by executing a reboot of the computer (see Patent Document 1).

特開２００２−１４９４３７号公報JP 2002-149437 A

しかしながら、上記のような先行技術の構成では、コンピュータの障害復旧に対して、より迅速で、且つ、確実な障害の復旧を行うためには未だ改善すべき点がある。 However, in the above prior art configuration, there is still a point to be improved in order to perform quicker and more reliable failure recovery with respect to computer failure recovery.

つまり、コンピュータに障害が発生した場合には、アプリケーションの再起動、ＯＳ再起動やハードウェアリセットだけでは解決できずに、バックアップデータによるリストアが必要な場合があるため、バックアップデータからリストア処理を障害個所に適用させることが望ましい。 In other words, if a failure occurs in the computer, it cannot be solved by restarting the application, restarting the OS, or resetting the hardware. Restoration using backup data may be necessary. It is desirable to apply to the place.

又、障害により処理が遅延してしまうことがあるため、正常なコンピュータへの処理経路変更による障害復旧を図る必要が生じることがある。 In addition, since processing may be delayed due to a failure, it may be necessary to recover the failure by changing the processing path to a normal computer.

更に、「障害個所へのデータ復旧」と「処理経路の変更による復旧」には、それぞれの正常処理になるまでの復旧時間を勘案する必要があるが、これに関してはコンピュータの管理者が状況に応じて復旧の方法を適宜選択しているのが実情であった。 Furthermore, it is necessary to consider the recovery time until normal processing is performed for “data recovery to the fault location” and “recovery by changing the processing path”. The actual situation was that the recovery method was selected accordingly.

従って、障害の復旧は管理者の経験に依存されることがあり、確実且つ短時間に障害が解決されるとは限らないという問題があった。 Therefore, there is a problem that the recovery of the failure may depend on the experience of the administrator, and the failure is not always solved in a short time.

本発明は、上記したような課題に鑑みてなされたものであって、コンピュータの障害復旧に対して、より迅速で確実な障害の復旧を行うことを目的とする。 The present invention has been made in view of the above-described problems, and an object of the present invention is to perform quicker and more reliable recovery from a failure in a computer.

又、本発明は、常にコンピュータを稼動監視することによって、コンピュータの障害状況に基づいた最適な障害復旧方法を抽出できることを目的とする。 It is another object of the present invention to extract an optimal failure recovery method based on the failure status of a computer by constantly monitoring the operation of the computer.

又、抽出された方法ごとに該障害を解消するために必要な時間を比較して時間の短い復旧方法を抽出することも目的とする。 Another object of the present invention is to extract a recovery method with a short time by comparing the time required for eliminating the failure for each extracted method.

更に、本発明の障害復旧方法は、コンピュータに障害が発生して、アプリケーションの再起動、ＯＳ再起動やハードウェアリセットに加えて、それでも解決できない場合に、バックアップデータによるリストア処理を障害個所に適用させることを目的とする。 Furthermore, the failure recovery method of the present invention applies backup processing to a failure location when a failure occurs in the computer and in addition to restarting the application, restarting the OS, or resetting the hardware, but still cannot be resolved. The purpose is to let you.

又、障害により処理が遅延してしまう場合には、正常なコンピュータへの処理経路変更による障害復旧を図ることを可能にすることも目的とする。 Another object of the present invention is to enable recovery from a failure by changing a processing route to a normal computer when processing is delayed due to a failure.

上記課題を解決する本発明の障害復旧方法は、障害発生時に障害状況を確認すること、確認された該障害状況に基づいて該障害を解消するための障害解消方法を抽出すること、該抽出された障害解消方法ごとに該障害を解消するために必要な処理時間を抽出すること、該抽出された処理時間を比較して該障害解消方法を決定することを有することを特徴とする。 The failure recovery method of the present invention that solves the above problems is to check a failure status when a failure occurs, extract a failure elimination method for solving the failure based on the confirmed failure status, Extracting a processing time necessary for solving the failure for each failure solving method, and determining the failure solving method by comparing the extracted processing times.

尚、本発明の障害復旧方法は、障害解消方法の決定を、抽出された方法ごとに該抽出された処理時間の短いものが選択されるようにするようにして良く、障害解消方法は、コンピュータ内のアプリケーションの再起動、コンピュータのリブート、バックアップデータによるコンピュータ内データの修復及び処理経路変更から成る群から選択される少なくとも１つを含むようにして良い。 In the failure recovery method of the present invention, the failure resolution method may be determined so that the extracted processing time with a short processing time is selected for each extracted method. It is possible to include at least one selected from the group consisting of restarting an application in the computer, rebooting the computer, restoring data in the computer with backup data, and changing a processing path.

又、上記課題を解決する本発明の障害復旧プログラムは、コンピュータに、障害発生時に障害状況を確認する機能と、該障害状況に基づいて該障害を解消するための障害解消方法を抽出する機能と、該抽出された方法ごとに該障害を解消するために必要な処理時間を抽出する機能と、該抽出された処理時間を比較して該障害解消方法を決定する機能とを実現させるプログラムである。 In addition, the failure recovery program of the present invention that solves the above-mentioned problems is a function for checking a failure status in a computer when a failure occurs, and a function for extracting a failure resolution method for solving the failure based on the failure status. And a program that realizes a function of extracting a processing time necessary for solving the failure for each extracted method and a function of determining the failure solving method by comparing the extracted processing times. .

尚、本プログラムにおいて、障害解消方法の決定が、該抽出された処理時間の短いものが選択されるように設定可能とされ、障害解消方法は、コンピュータ内のアプリケーションの再起動、コンピュータのリブート、バックアップデータによるコンピュータ内データの修復及び処理経路変更から成る群から選択可能とされているようにして良い。 In this program, the determination of the failure solving method can be set so that the extracted processing time with a short time is selected, and the failure solving method includes restarting the application in the computer, rebooting the computer, It may be possible to select from the group consisting of restoration of in-computer data by backup data and change of processing path.

本発明によれば、以下の効果を奏することができる。 According to the present invention, the following effects can be obtained.

コンピュータの障害に対して、より迅速で、且つ、確実な障害の復旧が可能となる。 It is possible to recover from a failure of the computer more quickly and reliably.

コンピュータに障害が発生した場合には、アプリケーションの再起動、ＯＳ再起動やハードウェアリセットの他に、バックアップデータからリストア処理を障害個所に適用させることが可能となる。 When a failure occurs in the computer, in addition to restarting the application, restarting the OS, and resetting the hardware, it is possible to apply the restore processing from the backup data to the failure location.

又、障害により処理が遅延してしまうことがあるため、正常なコンピュータへの処理経路変更による障害復旧を図ることで、より確実な復旧が可能となる。 In addition, since the processing may be delayed due to a failure, a more reliable recovery is possible by attempting to recover the failure by changing the processing path to a normal computer.

更に、「障害個所へのデータ復旧」と「処理経路の変更による復旧」について、それぞれの正常処理になるまでの復旧時間が短い復旧方法を適宜選択するため、より迅速な復旧をスムーズに適用することができる。 In addition, for “data recovery to the fault location” and “recovery by changing the processing path”, a recovery method with a short recovery time until normal processing is appropriately selected, so that quicker recovery is applied smoothly. be able to.

以下に本発明の実施の形態を添付図面に基づいて詳述する。尚、本発明は以下に説明する実施形態に限定されるものではなく、本発明の主旨の範囲において適宜変形、組み合わせを可能とする。 Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. The present invention is not limited to the embodiments described below, and can be appropriately modified and combined within the scope of the gist of the present invention.

図１は本発明の障害回復方法及び障害回復プログラムを適用した場合のコンピュータシステムの公的な一例を説明するための概略的構成図である。 FIG. 1 is a schematic configuration diagram for explaining a public example of a computer system when a failure recovery method and a failure recovery program of the present invention are applied.

図１において、１は稼動監視機能を備えたバックアップシステム、５は仮想監視プログラム、６はバックアッププログラム、７は障害復旧プラン作成プログラム、８は情報管理デーブル、９はバックアップデータ、３０はインターネットやイントラネット等のネットワーク、５２はコンピュータ、６０は業務システム、６１はアプリケーション、６２はデータである。 In FIG. 1, 1 is a backup system having an operation monitoring function, 5 is a virtual monitoring program, 6 is a backup program, 7 is a failure recovery plan creation program, 8 is an information management table, 9 is backup data, 30 is the Internet or an intranet. , 52 is a computer, 60 is a business system, 61 is an application, and 62 is data.

業務システム６０は図示されるように、通常、複数のコンピュータ５２で構成され、稼動監視機能を備えたバックアップシステム１は、仮想監視プログラム５、バックアッププログラム６、障害復旧プラン作成プログラム７、ＰＣ管理情報、処理管理情報、バックアップ情報、障害情報、影響範囲情報等を必要に応じて含む情報管理テーブル８及びバックアップデータ９とを有する。 As shown in the figure, the business system 60 is usually composed of a plurality of computers 52, and the backup system 1 having an operation monitoring function includes a virtual monitoring program 5, a backup program 6, a failure recovery plan creation program 7, and PC management information. And an information management table 8 and backup data 9 including processing management information, backup information, failure information, influence range information and the like as necessary.

稼動監視プログラム５は、情報通信回線としてのインターネットやイントラネットを含むネットワーク３０を通じて接続されているコンピュータ５２の稼動状態と、アプリケーション６１の稼動状態及び業務システム６０の稼動状態を監視する。本実施の形態では、稼動監視プログラム５は監視情報を情報管理テーブル８の障害情報と比較する機能を備える。 The operation monitoring program 5 monitors the operating state of the computer 52 connected through the network 30 including the Internet or an intranet as an information communication line, the operating state of the application 61, and the operating state of the business system 60. In the present embodiment, the operation monitoring program 5 has a function of comparing the monitoring information with the failure information in the information management table 8.

バックアッププログラム６は、前記コンピュータ５２とアプリケーション６１とデータ６２のバックアップデータを定期的に取得し、バックアップデータ９として格納し、バックアップ情報を情報管理テーブル８に格納してリストア機能を有する。稼動監視の結果とバックアップ結果は、同じ情報管理テーブルを利用することで、それぞれ重複する管理の手間を省くことができる。又、稼動監視の結果で異常が検知された場合に、バックアップデータをリストアして復旧することで、障害復旧に必要な人手を省くことができる。 The backup program 6 periodically acquires backup data of the computer 52, application 61 and data 62, stores it as backup data 9, stores backup information in the information management table 8, and has a restore function. By using the same information management table for the operation monitoring result and the backup result, it is possible to save the redundant management effort. Further, when an abnormality is detected as a result of the operation monitoring, the backup data is restored and restored, so that it is possible to save manpower necessary for the failure recovery.

図２は情報管理テーブルに格納されているテーブルの好適な一例が示してある。 FIG. 2 shows a preferred example of a table stored in the information management table.

処理管理マスターテーブル１５０は、業務処理内容に必要なアプリケーションの名称との関連を明確にするために使用される。例えば、処理Ｎｏ．Ｓ−００１はＡＡＡ−ＡＰＰ１、ＢＢＢ−ＡＰＰ１、処理Ｎｏ．Ｓ−００２はＡＡＡ−ＡＰＰ1 、ＡＡＡ−ＡＰＰ２のように格納されている。尚、図示されるように、各処理の項目に記載されたアプリケーション名の最初にコンピュータ名を付与しておくことで、どのコンピュータのどのアプリケーションがどの業務処理に必要かが対応付けられるため、障害時のみならず日々の管理においても都合が良い。 The process management master table 150 is used to clarify the relationship with the name of the application necessary for the business process contents. For example, Process No. S-001 is AAA-APP1, BBB-APP1, Process No. S-002 is stored as AAA-APP1 and AAA-APP2. As shown in the figure, by assigning the computer name to the beginning of the application name described in each processing item, it is possible to associate which application of which computer is required for which business processing. Convenient not only in time but also in daily management.

バックアップ管理マスターテーブル２００には、定期的にバックアップされた前記コンピュータ名称（図ではＰＣの欄でＡＡＡ又はＢＢＢと記載されている）とコンピュータ上のＯＳ（オペレーションシステム）情報（図ではＯＳの欄でＡＡＡ−ＯＳ、ＢＢＢ−ＯＳと記載されている）、アプリケーション情報（図ではアプリケーションの欄でＡＡＡ−ＡＰＰ１、ＡＰＰ２、ＢＢＢ−ＡＰＰ１、ＡＰＰ２と記載されている）、データ情報（図ではデータの欄でＡＡＡ−ＤＡＴＡ１、ＢＢＢ−ＤＡＴＡ１と記載されている）が格納されている。 In the backup management master table 200, the names of the computers that have been backed up regularly (in the figure, indicated as AAA or BBB in the PC column) and OS (operation system) information on the computer (in the OS column in the figure). AAA-OS, BBB-OS), application information (in the figure, AAA-APP1, APP2, BBB-APP1, APP2 in the application column), data information (in the figure, in the data column) AAA-DATA1 and BBB-DATA1) are stored.

これらのデータは、障害発生時にバックアップデータのリストアとして利用される。 These data are used as restoration of backup data when a failure occurs.

又、バックアップ管理マスターテーブル２００はどのデータとどのＰＣ、アプリケーションとが対応しているかを管理するために使用でき、障害発生時はこの情報を基に必要な処理が施される。 The backup management master table 200 can be used to manage which data corresponds to which PC and application. When a failure occurs, necessary processing is performed based on this information.

障害情報テーブル２５０には、前記コンピュータ上で発生の可能性がある障害内容に対して、障害となったアプリケーションの名称（図ではＡＡＡ−ＡＰＰ1 、ＡＡＡ−ＡＰＰ２と記載されている）とデータの名称２７０（図ではＡＡＡ−ＤＡＴＡ１と記載されている）、復旧手順（図ではＲＤ−００１又はＲＤ−００２と記載されている）と障害復旧時間であるデータリストアによる復旧時間２６０（図では６０分又は５分と記載）が格納されている。 In the failure information table 250, the name of the application (indicated as AAA-APP1 and AAA-APP2 in the figure) and the name of the data with respect to the failure content that may occur on the computer. 270 (denoted as AAA-DATA1 in the figure), recovery procedure (denoted as RD-001 or RD-002 in the figure), and restoration time 260 (60 minutes or 5 minutes) is stored.

稼動監視の結果、障害情報テーブルの「ＰＣ」２５１、又は「障害１」２５２、又は「障害２」２５３のそれぞれに記載されている情報に一致した場合は、障害と判定され障害情報が特定される。障害情報テーブル２５０から、障害が検知されて影響するアプリケーションと復旧方法、復旧時間が特定される。 As a result of the operation monitoring, if the information matches the information described in “PC” 251, “Fault 1” 252, or “Fault 2” 253 in the fault information table, it is determined as a fault and the fault information is identified. The From the failure information table 250, an application, a recovery method, and a recovery time that are affected when a failure is detected are specified.

影響範囲情報テーブル３００には、前記障害情報（Ｄ−００１、Ｄ−００２）に対して他の正常なコンピュータ処理が停止してしまう影響範囲を示すために設けられる。ここでは、障害情報によって影響される前記処理管理マスターテーブル１５０の前記処理内容と、復旧手順と処理経路変更による復旧時間３１０が格納されている。 The influence range information table 300 is provided to indicate an influence range in which other normal computer processing stops for the failure information (D-001, D-002). Here, the processing contents of the processing management master table 150 affected by the failure information, the recovery procedure, and the recovery time 310 due to the processing path change are stored.

つまり、障害情報Ｄ−００１と判定されると影響範囲は影響範囲Ｅ−００１であることが分かり、業務処理は処理Ｎｏ．Ｓ−００１、Ｓ−００２に影響が及び事が影響範囲情報テーブル３００から分かる。又、影響範囲情報テーブル３００からは影響される業務処理を処理経路変更により復旧させるための手順が復旧手順ＲＤ−００１であり、その場合の復旧時間時間が５分であることが分かる。 In other words, if the failure information D-001 is determined, the influence range is found to be the influence range E-001. It can be seen from the influence range information table 300 that S-001 and S-002 are affected. Further, it can be seen from the influence range information table 300 that the procedure for recovering the affected business process by changing the processing path is the recovery procedure RD-001, and the recovery time in that case is 5 minutes.

このように、稼動監視の結果から障害情報が特定されることで、影響範囲テーブルの影響範囲が特定されて、障害個所により停止してしまう処理が特定される。 As described above, the failure information is specified from the result of the operation monitoring, whereby the influence range of the influence range table is specified, and the process that stops at the failure point is specified.

復旧プランテーブル４００には、復旧プラン（Ｙ−００１、Ｙ−００２）に対応して、障害情報（Ｄ−００１、Ｄ−００２）と影響範囲（Ｅ−００１、Ｅ−００２）が対応づけられて格納されている。 In the recovery plan table 400, the failure information (D-001, D-002) and the affected range (E-001, E-002) are associated with the recovery plan (Y-001, Y-002). Stored.

これによって、障害情報デーブル２５０に格納されている障害個所へのデータリストアによる復旧時間と影響範囲情報テーブルに格納されている処理経路変更による復旧時間とを障害情報をトリガーとして比較することを可能にしている。 This makes it possible to compare the recovery time by data restoration to the fault location stored in the fault information table 250 and the recovery time by change of the processing path stored in the affected range information table using the fault information as a trigger. ing.

図３は稼動監視機能を備えたバックアップシステム１から業務システム６０のコンピュータ５２、前記コンピュータ上で稼動するアプリケーション６１の稼動監視が行われてから異常を検知して、障害個所と影響範囲の特定と障害復旧までのおおまかな処理のフローチャートを示している。図３のフローチャートについて次に説明する。 FIG. 3 shows an abnormality detected after the operation monitoring of the backup system 1 having the operation monitoring function to the computer 52 of the business system 60 and the application 61 operating on the computer is performed, and the failure location and the affected range are identified. The flowchart of the rough process until failure recovery is shown. Next, the flowchart of FIG. 3 will be described.

複数のコンピュータで構成される業務システムに対して、稼動監視プログラムによる稼動監視を行い、障害情報テーブルから障害情報を検索する（ステップ５００）。ここで稼動情報が障害情報テーブルにある「ＰＣ」や「障害１」又は「障害２」と一致しなければ正常稼動と見なし、異常なしと判定する。又、障害情報が一致した場合は異常検知とする。「障害１」や「障害２」は、ＰＣに起因する障害の種類を記載しており、どちらか一方であっても障害情報が一致した場合は、異常検知とする。 Operation monitoring by an operation monitoring program is performed on a business system composed of a plurality of computers, and failure information is searched from the failure information table (step 500). Here, if the operation information does not match “PC”, “failure 1”, or “failure 2” in the failure information table, it is regarded as normal operation and it is determined that there is no abnormality. If the failure information matches, an abnormality is detected. “Failure 1” and “Failure 2” describe the type of the failure caused by the PC, and if the failure information matches even one of them, the failure is detected.

稼動監視プログラムが、障害情報テーブル内を検索して、稼動監視結果が「ＰＣ」や「障害１」又は「障害２」と一致した場合は、障害情報テーブルから障害対象のアプリケーションと復旧データの検索をして障害情報と復旧手順を特定する（ステップ５１０）。 When the operation monitoring program searches the failure information table and the operation monitoring result matches “PC”, “failure 1”, or “failure 2”, the failure information table is searched for the failure target application and recovery data. The failure information and the recovery procedure are specified (step 510).

次に、影響範囲情報テーブルから影響範囲を検索して、障害情報を元に影響範囲情報テーブルから影響範囲と特定する（ステップ５２０) 。 Next, the influence range is searched from the influence range information table, and the influence range is specified from the influence range information table based on the failure information (step 520).

次に、障害情報テーブルから得られたリストアによる復旧時間と影響範囲テーブルから得られた処理経路変更による復旧時間を比較して、「リストアによる復旧時間」が「処理経路変更による復旧時間」と比べて短い場合はＹｅｓに、長い場合はＮｏと判断する。つまり、ここでは、復旧時間が短い復旧方法として、「障害個所へバックアップデータのリストア処理」、又は「障害個所以外への処理経路変更」を適用するかを特定する（ステップ５２０）。従って、大小の比較は逆としても良く、その場合はＹｅｓ、Ｎｏが入れ替わる。 Next, compare the recovery time from the restoration obtained from the failure information table with the recovery time from the processing range change obtained from the affected range table, and compare the "recovery time by restoration" with the "recovery time by change of the processing route". If it is short, the answer is Yes. That is, here, it is specified whether “restoring the backup data to the failure location” or “changing the processing path to a location other than the failure location” is applied as the recovery method with a short recovery time (step 520). Therefore, the comparison between large and small may be reversed, in which case Yes and No are interchanged.

つまり、この工程では何れか処理時間の短いものが選択、判断されれば良い。尚、処理時間が等しいときは、関連する業務処理の求める性格や状況によって決定されれば良い。例えば、データ整合性を優先する処理の障害については、バックアップデータのリストア処理による復旧。次の処理への連携速度を優先する処理の障害については、障害個所以外への処理経路変更による復旧といった判断の仕方が考えられる。 That is, in this process, it is only necessary to select and determine one with a short processing time. When the processing time is the same, it may be determined according to the character and situation required for the related business process. For example, for failures in processing that prioritize data consistency, recovery is performed by restoring backup data. As for the failure of the processing giving priority to the linkage speed to the next processing, a method of determination such as recovery by changing the processing path to a location other than the failure location can be considered.

復旧時間の比較により処理方法が決まれば、障害個所へバックアップデータのリストア処理を行う（ステップ５４０）か障害個所以外への処理経路変更処理（ステップ５５０）が行われる。 If the processing method is determined by comparing the recovery times, the backup data is restored to the failed location (step 540) or the processing path change processing to a location other than the failed location (step 550) is performed.

図４は稼動監視プログラムがコンピュータの稼動結果をログ（ＬＯＧ）情報として生成し、障害情報内から検索するまでのおおまかな処理のフローチャートを示している。 FIG. 4 shows a flowchart of a rough process until the operation monitoring program generates a computer operation result as log (LOG) information and searches from the failure information.

図４のフローチャートについて次に説明する。 Next, the flowchart of FIG. 4 will be described.

監視プログラムに監視対象コンピュータを登録する（ステップ６００）と、監視対象コンピュータにポーリングを実施し（ステップ６１０）、その監視結果としてＬＯＧ情報となるテキストデータを生成する（ステップ６２０）。 When the monitoring target computer is registered in the monitoring program (step 600), the monitoring target computer is polled (step 610), and text data to be LOG information is generated as the monitoring result (step 620).

ポーリングした結果返答がない場合はこの時点で異常と判断する。 If there is no response as a result of polling, it is determined that there is an abnormality at this point.

次に、コンピュータの監視結果である生成されたテキストデータで障害情報テーブルの障害情報内を検索し、「ＰＣ」や「障害１」又は「障害２」を対照して（ステップ６３０）、一致した場合に障害の検知とする。異常と検知された場合は復旧処置に移行する。 Next, the failure information in the failure information table is searched with the generated text data that is the monitoring result of the computer, and “PC”, “failure 1”, or “failure 2” is compared (step 630) and matched. In case of failure detection. If an abnormality is detected, the process proceeds to a recovery procedure.

従って、障害情報内の項目と一致しなければ以上なしと判断してステップ６１０に戻る。 Therefore, if it does not match the item in the failure information, it is determined that there is no more and the process returns to step 610.

図５は障害検知からバックアップデータのリストアによる復旧までのおおまかな手順を示している。図５のフローチャートについて次に説明する。 FIG. 5 shows a rough procedure from failure detection to restoration by restoration of backup data. Next, the flowchart of FIG. 5 will be described.

図４のフローチャートにおいて説明されたように、障害回復プログラムにより、監視結果であるテキストデータで障害情報テーブルの障害情報を検索し、障害が検知されると（図４の＊１と図５の＊１が対応）、検索結果からバックアップデータの指定を行うために、障害情報テーブル内の復旧手順に従い、バックアップ管理マスターテーブルのバックアップデータを特定する（ステップ７１０）。 As described in the flowchart of FIG. 4, the failure recovery program searches for failure information in the failure information table using text data as a monitoring result and detects a failure (* 1 in FIG. 4 and * in FIG. 5). 1), in order to specify the backup data from the search result, the backup data of the backup management master table is specified in accordance with the recovery procedure in the failure information table (step 710).

次に、検索結果から障害対象の指定を行うために、障害情報テーブルより、障害対象が特定される（ステップ７２０）。 Next, in order to designate the failure target from the search result, the failure target is specified from the failure information table (step 720).

このフローチャートでは障害回復プログラムに復旧手順からリストアの指示を行うため、障害情報テーブルの復旧手順のリストアを指示する（ステップ７３０）。 In this flowchart, in order to instruct the failure recovery program to restore from the restoration procedure, the restoration procedure of the failure information table is instructed to be restored (step 730).

障害回復プログラムは、バックアップデータを障害対象にリストア処理する（ステップ７４０）。 The failure recovery program restores the backup data to the failure target (step 740).

又、「障害個所以外への処理経路変更」が選択された場合には、影響範囲情報テーブル３００にある復旧手順に従い、障害個所以外への処理経路変更により、障害個所に影響されること無く処理が流れる。 In addition, when “change processing route to other than faulty location” is selected, processing without affecting the faulty location is performed by changing the processing route to other than the faulty location according to the recovery procedure in the influence range information table 300. Flows.

これの場合のフローチャートを図６に示す。ここで、図５と同じステップ番号のものは図５で説明したものと同じ公邸内用なので詳細は説明しない。処理経路変更の手順部分のみ説明する。図６は障害検知から処理の経路変更による復旧までのおおまかな手順を示している。図６のフローチャートについて次に説明する。 A flowchart in this case is shown in FIG. Here, the same step number as in FIG. 5 is for the same residence as that described in FIG. Only the procedure part for changing the processing path will be described. FIG. 6 shows a rough procedure from failure detection to recovery by changing the processing path. Next, the flowchart of FIG. 6 will be described.

図４のフローチャートにおいて説明されたように、障害回復プログラムにより、ポーリング監視結果であるテキストデータで障害情報テーブルの障害情報を検索し、障害が検知される（図４の＊２と図６の＊２が対応）。次に、検索結果から障害対象サーバの指定を行うために、障害情報テーブルより、障害対象サーバが特定される（ステップ８２０）。このフローチャートでは障害回復プログラムにより、障害対象サーバへの停止と復旧手順から処理経路変を指示する（ステップ８３０）。 As described in the flowchart of FIG. 4, the failure recovery program searches for failure information in the failure information table using text data that is a polling monitoring result, and a failure is detected (* 2 in FIG. 4 and * in FIG. 6). 2 corresponds). Next, in order to specify the failure target server from the search result, the failure target server is identified from the failure information table (step 820). In this flowchart, the failure recovery program instructs the processing path change from the stop and recovery procedure to the failure target server (step 830).

障害回復プログラムは、処理を障害個所以外への経路変更する（ステップ８４０）。 The failure recovery program changes the route to other than the failure location (step 840).

本発明の処理経路変更とリストアによる障害復旧は、図４の監視対象コンピュータへのポーリングや監視結果ＬＯＧといったコンピュータへの監視結果から、障害情報や影響範囲を特定して、最適な復旧手段を抽出している。又、本発明は、処理経路変更とリストアの何れか一方を選択するのではなく、障害情報の設定の仕方によって処理経路変更とリストア処理が組み合わされて障害回復処理が行われても良い。 Fault recovery by processing path change and restoration according to the present invention is performed by specifying fault information and an affected range from computer monitoring results such as polling to the monitoring target computer and monitoring result LOG in FIG. is doing. Further, according to the present invention, the failure recovery processing may be performed by combining the processing route change and the restore processing depending on how the failure information is set, instead of selecting either the processing route change or restoration.

尚、上記実施形態では、障害回復の判定のための復旧時間の比較が１つの障害についてのみ行っているが、複数の障害が発生した場合にはそれら障害の合計値によってどのような障害回復処理をするかを判定するようにして良いのは言うまでもない。 In the above embodiment, the comparison of the recovery time for determining the failure recovery is performed only for one failure. However, when a plurality of failures occur, what kind of failure recovery processing is performed depending on the total value of the failures. It goes without saying that it is possible to determine whether to do.

「稼動監視機能を備えたバックアップシステム」構成図である。1 is a configuration diagram of a “backup system having an operation monitoring function”. FIG. 情報管理テーブルの詳細図である。It is a detailed view of an information management table. 稼動監視による異常検知からの復旧処理フローチャートである。It is a recovery process flowchart from the abnormality detection by operation monitoring. 稼動監視処理フローチャートである。It is an operation | movement monitoring process flowchart. リストア処理フローチャートである。It is a restore process flowchart. 処理経路変更フローチャートである。It is a processing path change flowchart.

Explanation of symbols

１稼動監視機能を備えたバックアップシステム
５稼動監視プログラム
６バックアッププログラム
７障害復旧プラン作成プログラム
８情報管理テーブル
９バックアップデータ
３０インターネット／イントラネット
５２監視対象コンピュータ
６０業務システム
６１アプリケーション
６２データ
１５０処理管理マスターテーブル
２００バックアップ管理マスターテーブル
２５０障害情報テーブル
２５１ＰＣ
２５２障害１
２５３障害２
３００影響範囲情報テーブル
５００稼動監視
５１０障害情報テーブルから障害対象と復旧データの検索
５２０影響範囲情報テーブルから影響範囲の検索
５３０復旧時間の比較
５４０障害個所へのバックアップデータのリストア
５５０障害個所以外への処理経路変更
６００監視プログラムに監視対象コンピュータを登録
６１０監視対象コンピュータへポーリングを実施
６２０監視結果としてＬＯＧ情報となるテキストデータの生成
６３０テキストデータによる障害情報の検索
７１０バックアップデータの指定
７２０障害対象の指定
７３０復旧手順からリストアの指示
７４０バックアップデータを障害対照へリストア
８１０テキストでーたによる障害情報の検索
８２０障害対象サーバの指定
８３０障害回復プログラムにより処理経路変更処理の指示
８４０処理を障害個所以外へ経路変更 DESCRIPTION OF SYMBOLS 1 Backup system provided with operation monitoring function 5 Operation monitoring program 6 Backup program 7 Fault recovery plan creation program 8 Information management table 9 Backup data 30 Internet / Intranet 52 Monitored computer 60 Business system 61 Application 62 Data 150 Processing management master table 200 Backup management master table 250 Failure information table 251 PC
252 Disability 1
253 Disability 2
300 Influence range information table 500 Operation monitoring 510 Retrieval of failure target and recovery data from failure information table 520 Retrieval of influence range from influence range information table 530 Comparison of recovery time 540 Restoration of backup data to failure location 550 Rest of failure location Processing path change 600 Registering the monitoring target computer in the monitoring program 610 Polling the monitoring target computer 620 Generating text data to be LOG information as a monitoring result 630 Searching for fault information using text data 710 Specifying backup data 720 Specifying a fault target 730 Restore instruction from recovery procedure 740 Restore backup data to failure control 810 Search failure information by text 820 Specify failure target server 830 In disaster recovery program Route change to other than the failure point instructions 840 processing of the processing path change process Ri

Claims

To confirm a failure status when a failure occurs, to extract a failure solving method for solving the failure based on the confirmed failure status, and to eliminate the failure for each of the extracted failure solving methods A failure recovery method characterized by extracting a necessary processing time and comparing the extracted processing times to determine the failure elimination method.

2. The failure recovery method according to claim 1, wherein the failure resolution method is determined by selecting a method having a short extracted processing time for each of the extracted methods.

2. The failure solving method includes at least one selected from the group consisting of restarting an application in a computer, rebooting a computer, restoring data in the computer with backup data, and changing a processing path. Or the failure recovery method of 2.

A function for checking a failure status when a failure occurs in a computer, a function for extracting a failure solving method for solving the failure based on the failure status, and for solving the failure for each extracted method A failure recovery program for realizing a function of extracting a necessary processing time and a function of comparing the extracted processing times and determining the failure solving method.

The failure recovery program according to claim 4, wherein the determination of the failure elimination method can be set so that the extracted processing time with a short time is selected.

6. The fault recovery program according to claim 4 or 5, wherein the fault elimination method can be selected from the group consisting of restarting an application in the computer, rebooting the computer, restoring data in the computer using backup data, and changing a processing path. .