JP2005157462A

JP2005157462A - System switching method and information processing system

Info

Publication number: JP2005157462A
Application number: JP2003390970A
Authority: JP
Inventors: Masaki Kawashima; 政規川嶋; Satoshi Oikari; 智史大碇
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2003-11-20
Filing date: 2003-11-20
Publication date: 2005-06-16

Abstract

<P>PROBLEM TO BE SOLVED: To provide an information processing system performing switching, wherein the stop period of a service is shortened as much as possible even when system switching fails due to a program failure. <P>SOLUTION: A remote machine 200 is provided with a database 210 in which generation-managed software and its definition are stored and a database 220 in which the operation result of each program is associated with each other, and a first device as a current system and a second device as a stand-by system are provided with system switching functions 11 and 21, respectively, having restoration destination information storage parts 14 and 24 and program restoration functioning parts 12 and 22 having failure-categorized restoration tables 15 and 25. When the system switching to the second device is operated due to the program failure of the first device, a failure program is restored in a third device. When the system switching fails due to the generation of the identical failure of the second device, the processing is switched to the third device being the restoration destination of the program. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、現用系装置と待機系装置で構成される情報処理システムにおける系切り替え方法及び情報処理システムに係り、特に、現用系装置にソフトウェア障害が発生した場合に用いて好適な系切り替え方法及び情報処理システムに関する。 The present invention relates to a system switching method and an information processing system in an information processing system composed of an active system device and a standby system device, and more particularly to a system switching method suitable for use when a software failure occurs in an active system device and The present invention relates to an information processing system.

一般に、系切り替えを行う情報処理システムは、同一の機能またはサービスを提供する装置を２系統用意し、メインで稼働する系を現用系装置、もう一方の系を待機系装置として、現用系装置と待機系装置との間で通信し合うことによりお互いの装置を認識するように構成される。そして、このような情報処理システムは、現用系装置に障害が発生すると、現用系装置が障害情報を待機系装置に送信し、それを受信した待機系装置が、現用系装置となってシステムとしての稼働を開始する一方で、障害が発生した現用系装置が、待機系装置となるように制御されることにより、サービス停止を防止している。 In general, an information processing system that performs system switching has two systems that provide the same function or service, the active system is the active system, and the other system is the standby system. It is configured to recognize each other device by communicating with the standby system device. In such an information processing system, when a failure occurs in the active device, the active device transmits failure information to the standby device, and the standby device that receives the failure information becomes the active device as a system. On the other hand, the active system device in which the failure has occurred is controlled to become the standby system device, thereby preventing the service from being stopped.

なお、系切り替えを行う情報処理システムに関する従来技術として、例えば、特許文献１、特許文献２等に記載された技術が知られている。
特開平１１−９６０３３号公報特開平６−１３１２０８号公報 For example, the techniques described in Patent Document 1, Patent Document 2, and the like are known as conventional techniques related to an information processing system that performs system switching.
JP 11-96033 A JP-A-6-131208

系切り替えを行う情報処理システムにおいて、現用系装置と待機系装置とが切り替わる原因として、ハードウェアによる障害とソフトウェアを構成するプログラムによる障害とがある。前述した従来技術は、現用系装置と待機系装置とが同一の機能またはサービスを提供するために、同一のバージョンのプログラムを装置に導入しておかなくてはならない。そして、同一のバージョンのプログラムが両装置に導入された情報処理システムは、現用系装置でのプログラム障害の原因が、ユーザのデータ等のサービス要求側にある場合、プログラムを再起動する等の対処を行うことにより解決できるが、プログラムそのものに問題が存在していた場合、待機系装置が現用系装置として切り替わった後も、現用系装置で発生した障害と同一の原因によるソフトウェア障害が切り替え先でも発生することになり、切り替えに失敗となり、要因が特定されるまでサービスの提供を停止せざるを得ないという問題点を有している。 In an information processing system that performs system switching, the cause of switching between the active system device and the standby system device is a hardware failure and a software program failure. In the conventional technology described above, the same version of the program must be installed in the apparatus so that the active apparatus and the standby apparatus provide the same function or service. An information processing system in which the same version of the program is installed on both devices has a countermeasure such as restarting the program when the cause of the program failure in the active device is on the service request side such as user data. However, if a problem exists in the program itself, the software failure caused by the same cause as the failure that occurred in the active device can be changed even after the standby device is switched as the active device. As a result, there is a problem in that switching fails and service provision must be stopped until the cause is specified.

本発明の目的は、前述した従来技術の問題点を解決し、情報処理システムにおいて、障害の原因がプログラム障害で切り替えに失敗した際にも、サービスの停止期間をできるだけ短くすることができるようにした系切り替え方法及び情報処理システムを提供することにある。 An object of the present invention is to solve the above-described problems of the prior art, and in an information processing system, even when the cause of a failure is a program failure and switching fails, the service stop period can be shortened as much as possible. Another object is to provide a system switching method and an information processing system.

本発明によれば前記目的は、現用系装置としての第１装置と待機系装置としての第２装置とにより構成され、前記第１装置の障害時に、第２装置へその運用を切り替える情報処理システムにおける系切り替え方法において、前記第１装置が、プログラム障害による障害の発生時、第２装置にその運用を切り替える系切り替えを行い、プログラムを回復する装置に障害となったプログラムの回復を行わせ、前記系切り替えの後、前記第２装置が第１装置と同様な障害を発生して系切り替えに失敗したとき、第２装置が、プログラムを回復する装置にその運用を切り替える系切り替えを行うことにより達成される。 According to the present invention, the object is configured by a first device as an active device and a second device as a standby device, and an information processing system that switches operation to the second device when the first device fails. In the system switching method in the above, when the failure due to a program failure occurs, the first device performs system switching to switch its operation to the second device, and causes the device that recovers the program to recover the failed program, After the system switchover, when the second device generates a failure similar to that of the first device and fails in the system switchover, the second device performs the system switchover that switches the operation to the device that recovers the program. Achieved.

また、前記目的は、現用系装置としての第１装置と待機系装置としての第２装置とにより構成され、前記第１装置の障害時に、第２装置へその運用を切り替える情報処理システムにおいて、前記第１及び第２装置が、障害が発生したプログラムを回復する回復先の状態を保持し、障害発生時に系切り替えを実施する手段を持つ系切り替え機能部と、障害情報及び回復すべきプログラムを対応付けた障害別回復テーブルを備え、プログラムとその回復処理及び稼働実績とを対応付けた回復処理管理データベースにアクセスし、プログラムを回復する装置に回復命令を出し、回復命令を受けてプログラムや定義を稼働環境に設定する機能を備えたプログラム回復機能を備えることにより達成される。 In addition, in the information processing system, the object is configured by a first device as an active device and a second device as a standby device, and the operation is switched to a second device when the first device fails. The first and second devices correspond to the system switching function unit having means for performing system switching when a failure occurs, holding the recovery destination state for recovering the failed program, and the failure information and the program to be recovered. Access to the recovery management database that associates the program with its recovery processing and operation results, issues a recovery command to the device that recovers the program, and receives the recovery command to define the program and definition This is achieved by providing a program recovery function having a function for setting the operating environment.

本発明によれば、プログラム障害が原因で系切り替えに失敗した際にも、確実に稼働するプログラムまたは環境を指定した回復先の装置へ展開して、回復先の装置をホットスタンバイ状態とすることができるため、回復先の装置へ系を切り替えることによりサービスの停止期間を短くすることができる。 According to the present invention, even when system switching fails due to a program failure, a program or environment that operates reliably is deployed to a specified recovery destination device, and the recovery destination device is placed in a hot standby state. Therefore, the service suspension period can be shortened by switching the system to the recovery destination apparatus.

以下、本発明による系切り替え方法及び情報処理システムの実施形態を図面により詳細に説明する。 Embodiments of a system switching method and an information processing system according to the present invention will be described below in detail with reference to the drawings.

図１は本発明の一実施形態による情報処理システムの構成を示すブロック図である。図１において、１は通信経路、１１、２１、３１は切り替え機能部、１２、２２、３２はプログラム回復機能部、１３、２３、３３は稼働環境部、１４、２４、３４は回復先状態保持部、１５、２５、３５は障害別回復テーブル、１１０は第１装置、１２０は第２装置、１３０は第３装置、２００はリモートマシン、２１０は回復資源管理データベース、２２０は回復処理管理データベースである。 FIG. 1 is a block diagram showing a configuration of an information processing system according to an embodiment of the present invention. In FIG. 1, 1 is a communication path, 11, 21, and 31 are switching function units, 12, 22, and 32 are program recovery function units, 13, 23, and 33 are operating environment units, and 14, 24, and 34 are recovery destination state holdings. 15, 25, 35 are failure recovery tables, 110 is a first device, 120 is a second device, 130 is a third device, 200 is a remote machine, 210 is a recovery resource management database, and 220 is a recovery process management database. is there.

本発明の一実施形態による情報処理システムは、いずれも計算機により構成される第１装置１１０、第２装置１２０、第３装置１３０、及び、リモートマシン２００が通信経路１で接続されて構成される。図示実施形態のシステムは、第１装置１１０を現用系装置とし、第２装置１２０をホットスタンバイ状態の待機系装置とし、第３装置１３０をプログラムを回復する系の装置としている。また、第３装置１３０は、プログラム回復専用としているため、コールドスタンバイ状態（他の処理を実行している状態）となっている。リモートマシン２００は、プログラムを保守する専用の装置であり、回復処理管理データベース２２０及び回復資源管理データベース２１０が作成されて保持している。 An information processing system according to an embodiment of the present invention is configured by connecting a first device 110, a second device 120, a third device 130, and a remote machine 200, all of which are configured by a computer, via a communication path 1. . In the system of the illustrated embodiment, the first device 110 is an active device, the second device 120 is a standby device in a hot standby state, and the third device 130 is a device that recovers a program. Further, since the third device 130 is dedicated to program recovery, it is in a cold standby state (a state in which other processing is being executed). The remote machine 200 is a dedicated device for maintaining programs, and a recovery processing management database 220 and a recovery resource management database 210 are created and held.

そして、第１装置１１０及び第２装置１２０は、プログラムを回復する装置の情報と障害情報とを送受信する機能と、稼働環境部１３、２３上のプログラム障害を検知する機能と、プログラムを回復させる装置が切り替え可能か否かの状態を保持する回復先状態保持部１４、２４とを有する切り替え機能部１１、２１を備えている。この切り替え機能部１１、２２の機能により、例えば、第１装置１１０の稼働環境部１３でプログラム障害が発生すると、第１装置１１０で行っていた処理を第２装置１２０の稼働環境部２３へ切り替えることができる。また、第１装置１１０及び第２装置１２０は、障害情報と回復すべきプログラムとが対応付けられて格納した障害回復テーブル１５、２５を有し、このテーブル内の情報を用いて障害が発生したプログラムを回復するプログラム回復機能部１２、２２を備えている。このプログラム回復機能部１２、２２は、プログラムを回復する装置である第３装置１３０へデータ転送ライン４２を介して回復命令を通知する。プログラム回復機能部１２、２２には、プログラムを回復する装置の情報が設定されている。 The first device 110 and the second device 120 recover a program, a function for transmitting and receiving information on a device for restoring a program and failure information, a function for detecting a program failure on the operating environment units 13 and 23, and a program. Switching function units 11 and 21 having recovery destination state holding units 14 and 24 that hold the status of whether or not the device can be switched are provided. For example, when a program failure occurs in the operating environment unit 13 of the first device 110 due to the functions of the switching function units 11 and 22, the processing performed in the first device 110 is switched to the operating environment unit 23 of the second device 120. be able to. Further, the first device 110 and the second device 120 have failure recovery tables 15 and 25 in which failure information and programs to be recovered are stored in association with each other, and a failure has occurred using the information in this table. Program recovery function units 12 and 22 for recovering programs are provided. The program recovery function units 12 and 22 notify the third device 130, which is a device for recovering the program, of the recovery command via the data transfer line 42. In the program recovery function units 12 and 22, information on a device for recovering a program is set.

プログラムを回復する装置である第３装置１３０は、後述する処理動作で説明するように、プログラム障害を生じた第１装置１１０あるいは第２装置１２０の処理の代行をも行うため、第１装置、第２装置と同様に、回復先状態保持部３４を有する切り替え機能部３１、障害別回復テーブル３５を有するプログラム回復機能部３２及び稼働環境部３３を備えて構成される。 As will be described later in the processing operation, the third device 130, which is a device for recovering the program, also acts as a substitute for the processing of the first device 110 or the second device 120 in which a program failure has occurred. Similar to the second apparatus, it is configured to include a switching function unit 31 having a recovery destination state holding unit 34, a program recovery function unit 32 having a failure-specific recovery table 35, and an operating environment unit 33.

前述のプログラムを回復する装置である第３装置１３０は、本発明の実施形態において、必ずしも存在させる必要はなく、本発明は、第３装置１３０を設けずに構成することができる。この場合、後述する処理動作で説明するように、現用系装置としての第１装置１１０にプログラム障害が発生して、第１装置１１０が行っていた処理を待機系装置としての第２装置１２０に移行させた後、第１装置１１０がプログラムを回復する装置として動作させるようにする。 In the embodiment of the present invention, the third device 130 that is a device for recovering the above-described program is not necessarily present, and the present invention can be configured without providing the third device 130. In this case, as will be described later in the processing operation, a program failure occurs in the first device 110 as the active device, and the processing performed by the first device 110 is transferred to the second device 120 as the standby device. After the transition, the first device 110 is operated as a device for recovering the program.

なお、図には示していないが計算機により構成される第１装置１１０、第２装置１２０、第３装置１３０、及び、リモートマシン２００は、それぞれの装置全体の制御を行うＣＰＵと、ＣＰＵが利用可能にプログラム、データ等を格納するメモリと、ハードディスク等の外部記憶装置と、キーボード、マウス等の入力装置と、ディスプレイとを備えて構成されていればよい。そして、図１に示す各装置内の機能部は、メモリ内に格納されたプログラムをＣＰＵが実行することにより実現される。 Although not shown in the drawing, the first device 110, the second device 120, the third device 130, and the remote machine 200 configured by a computer are used by the CPU that controls the entire device and the CPU. It only needs to be configured to include a memory for storing programs, data, and the like, an external storage device such as a hard disk, an input device such as a keyboard and a mouse, and a display. 1 is realized by the CPU executing a program stored in the memory.

図２はリモートマシン２００内の回復処理管理データベース２２０に格納されるデータの構造を示す図であり、次に、これについて説明する。 FIG. 2 is a diagram showing the structure of data stored in the recovery process management database 220 in the remote machine 200, which will be described next.

回復処理管理データベース２２０内のデータは、プログラムの機能別で分類されたカテゴリ２０１と、バージョン毎に分類されたプログラム名２０２と、そのバージョン情報２０３と、プログラムを動作するために必要な定義情報２０４と、稼働環境を構築するために必要なコマンド２０５とが対応付けられたものである。さらに、前述のデータベース内のデータは、稼働実績２０６が管理され、プログラム毎に設定されている。図示例の場合、稼働実績２０６のＡは、回復可能状態であることを示し、Ｂは、稼働中であることを示し、Ｃは、障害実績があることを示し、図示していないが、Ｄは、回復すべきではないプログラムであることを示す。稼働実績は、現用系における系切り替え機能がプログラム障害を検知する毎に更新される。対応付けられたコマンドは、後述する図３に示すプログラムの定義を展開し、プログラムを回復する処理を有し、このコマンドによって稼働環境が構築される。 Data in the recovery processing management database 220 includes a category 201 classified by function of the program, a program name 202 classified by version, version information 203 thereof, and definition information 204 necessary for operating the program. And the command 205 necessary for constructing the operating environment are associated with each other. Further, the data in the above-described database is managed for the operation results 206 and set for each program. In the case of the illustrated example, A in the operation record 206 indicates that it is in a recoverable state, B indicates that it is in operation, C indicates that there is a failure record, and although not illustrated, D Indicates that the program should not be recovered. The operation results are updated every time the system switching function in the active system detects a program failure. The associated command has a process of expanding the definition of the program shown in FIG. 3 to be described later and recovering the program, and the operating environment is constructed by this command.

図３はリモートマシン２００内の回復資源管理データベース２１０に格納されるデータの構造を示す図であり、次に、これについて説明する。 FIG. 3 is a diagram showing a structure of data stored in the recovery resource management database 210 in the remote machine 200, which will be described next.

回復資源管理データベース２１０に格納されるデータは、プログラムのバージョン毎に世代管理されたプログラム２２１〜２２４及びプログラムが動作するために必要な構成定義２２５〜２２８とにより構成される。プログラム２２１〜２２４には、それぞれのモジュール構成、ユーザアプリケーション名、プログラムが動作するために必要な構築（セットアップ）コマンドが含まれている。このコマンドは、回復するプログラムによって異なるが、ソフトウェア毎に管理している場合、ソフトウェアが提供するセットアップのコマンド、あるいは、コマンドを組み合わせたシェルプログラム等であってもよい。また、プログラム２２１〜２２４及び構成定義２２５〜２２８は、それぞれ対応付けられたプログラム及び定義の展開をするだけでプログラムが動作するようにまとめられている。定義の形態は、回復するプログラムによって異なるが、本発明の実施形態の例では、ある場所に展開するような定義（展開先、コンフィグレーション）としている。 Data stored in the recovery resource management database 210 includes programs 221 to 224 that are generation-managed for each version of the program and configuration definitions 225 to 228 that are necessary for the program to operate. Each of the programs 221 to 224 includes a module configuration, a user application name, and a construction (setup) command necessary for the program to operate. This command differs depending on the program to be recovered, but when it is managed for each software, it may be a setup command provided by the software or a shell program combining the commands. Further, the programs 221 to 224 and the configuration definitions 225 to 228 are grouped so that the programs can be operated only by developing the associated programs and definitions. Although the form of definition varies depending on the program to be recovered, in the example of the embodiment of the present invention, the definition (development destination, configuration) is to be developed in a certain place.

次に、本発明の実施形態におけるプログラム障害発生時の切り替え処理について、現用系である第１装置１１０においてプログラム障害を検知し、待機系である第２装置１２０へ障害処理を通知し、回復させる系である第３装置１３０にプログラムの回復命令を通知する処理と、障害情報に応じた回復すべきプログラムを特定する処理と、待機系である第２装置１２０で障害情報を受信し、現用系としてサービス提供を開始した後プログラム障害が発生した際の処理と、プログラムを回復させる第３装置１３０においてプログラムを回復する処理との４つの処理について説明する。 Next, regarding the switching process when a program failure occurs in the embodiment of the present invention, the first device 110 that is the active system detects the program failure, notifies the second device 120 that is the standby system, and recovers the failure. A process for notifying the third apparatus 130 as a system of a program recovery instruction, a process for specifying a program to be recovered according to the failure information, and the second apparatus 120 as a standby system receiving the failure information, and the active system The following describes four processes: a process when a program failure occurs after service provision is started, and a process for recovering a program in the third device 130 for recovering the program.

図４は第１装置１１０において系切り替え機能がプログラム障害を検知した際の切り替え処理を説明するフローチャートであり、まず、これについて説明する。 FIG. 4 is a flowchart for explaining a switching process when the system switching function detects a program failure in the first device 110. First, this will be described.

（１）第１装置１１０は、プログラムの障害を検知すると、プログラムを回復させる第３装置１３０の情報と障害情報とを第２装置１２０へ通知して、第２装置に切り替え処理を実行させる（ステップ４０１、４０２、４０８）。 (1) When the first device 110 detects a failure in the program, the first device 110 notifies the second device 120 of information on the third device 130 for restoring the program and the failure information, and causes the second device to execute a switching process ( Steps 401, 402, 408).

（２）次に、第１装置１１０は、ステップ４０１で検知した障害情報に基いて、回復させるプログラムの選択処理を行い、回復すべきプログラムが特定できたか否かの判定を行う。回復すべきプログラムの特定は、図２により説明したデータにおける稼働実績に基いて、回復可能なもの、回復可能なものがなかった場合にはやむをえず障害実績のあるものが選択されて行われる（ステップ４０３、４０４）。 (2) Next, the first device 110 performs a process of selecting a program to be recovered based on the failure information detected in step 401, and determines whether or not a program to be recovered has been identified. The program to be recovered is identified based on the operation results in the data described with reference to FIG. 2, and when there is no recoverable or unavoidable one, it is unavoidable to select the one with a failure record ( Steps 403 and 404).

（３）ステップ４０４の判定で、回復可能なプログラムがなかった場合、ここでの現用系の障害処理は終了となる。 (3) If it is determined in step 404 that there is no recoverable program, the failure processing of the active system here ends.

（４）ステップ４０４の判定で、回復可能なプログラムがあった場合、プログラムを回復させる装置が自マシンであり、かつ、ステップ４０３で特定したプログラムが動作中であるか否かを判定する。この判定で、プログラムを回復させる装置が自マシンであるのは、図１に示すシステムで第３装置１３０が存在しない場合、あるいは、第３装置１３０がなんらかの理由で使用不可能な場合である（ステップ４０５）。 (4) If it is determined in step 404 that there is a recoverable program, it is determined whether the apparatus that recovers the program is the own machine and whether the program specified in step 403 is in operation. In this determination, the apparatus for recovering the program is the own machine when the third apparatus 130 does not exist in the system shown in FIG. 1 or when the third apparatus 130 cannot be used for some reason ( Step 405).

（５）ステップ４０５の判定で、プログラムを回復させる装置が自マシンであり、かつ、ステップ４０３で特定したプログラムが動作中であった場合、自装置、ここでは、第１の装置１１０は、自装置の稼働環境を停止させて対象のプログラムを停止させる（ステップ４０６）。 (5) If it is determined in step 405 that the apparatus for recovering the program is the own machine and the program specified in step 403 is operating, the own apparatus, here, the first apparatus 110 The operating environment of the apparatus is stopped and the target program is stopped (step 406).

（６）ステップ４０５の判定で、プログラムを回復させる装置が自マシンでなく、かつ、ステップ４０３で特定したプログラムが動作中でなかった場合、または、ステップ４０６の処理後、プログラムを回復させる装置に対して、ステップ４０３で得た回復させるプログラムの情報を含む回復命令を通知する。通知の情報としては、切り替え先である第２装置１２０の情報と、特定されたプログラムと、その定義と、構築コマンドの情報とである。ここでの処理で回復命令を通知する相手装置は、ステップ４０６の処理が行われた場合には、自装置である第１装置１１０であり、ステップ４０５の判定で、プログラムを回復させる装置が自マシンでなく、かつ、ステップ４０３で特定したプログラムが動作中でなかった場合には、コールドスタンバイ状態とされている第３装置１３０である（ステップ４０７）。 (6) If it is determined in step 405 that the apparatus for recovering the program is not its own machine and the program specified in step 403 is not in operation, or after the processing in step 406, the apparatus for recovering the program On the other hand, the recovery instruction including the information of the program to be recovered obtained in step 403 is notified. The notification information includes information on the second device 120 that is the switching destination, the identified program, its definition, and information on the construction command. If the process of step 406 is performed, the partner apparatus that notifies the recovery command in this process is the first apparatus 110 that is its own apparatus, and the apparatus that recovers the program by the determination of step 405 itself. If it is not a machine and the program specified in step 403 is not in operation, it is the third device 130 in the cold standby state (step 407).

図５はプログラム回復機能に含まれる障害別回復テーブル１５、２５の構造の例を示す図であり、次に、これについて説明する。 FIG. 5 is a diagram showing an example of the structure of the failure-specific recovery tables 15 and 25 included in the program recovery function. Next, this will be described.

図５に示す障害別回復テーブルは、障害情報と回復すべきプログラムとを対応付けて格納している。図５に示すレコード５０１〜５０３における障害Ａ〜障害Ｃは、従来技術の場合と同様にログ情報から取得することができる情報であったり、系切り替え機能が提供するインタフェースをプログラムが呼び出すことによって通知された情報であったりしてよい。そして、本発明の実施形態では、レコード５０３のように障害に応じて複数のプログラムを回復するように対応付けることもできる。 The failure-specific recovery table shown in FIG. 5 stores failure information and a program to be recovered in association with each other. Faults A to C in the records 501 to 503 shown in FIG. 5 are information that can be acquired from log information as in the case of the prior art, or are notified by a program calling an interface provided by the system switching function. It may be the information that was made. In the embodiment of the present invention, a plurality of programs can be associated with each other according to a failure as in the record 503.

図６は第１装置のプログラム回復機能部１２における障害情報に対応した回復すべきプログラムを特定する処理動作を説明するフローチャートであり、次に、これについて説明する。この処理は、図４により説明したステップ４０３の処理の詳細である。 FIG. 6 is a flowchart for explaining the processing operation for specifying the program to be recovered corresponding to the failure information in the program recovery function unit 12 of the first device, which will be described next. This process is the details of the process of step 403 described with reference to FIG.

（１）プログラム回復機能部１２は、系切り替え機能１１から送られてきた障害情報に基いて障害別回復テーブル１５を参照し、現用系で発生した障害に対応した障害発生プログラムを特定する（ステップ６０１）。 (1) The program recovery function unit 12 refers to the failure-specific recovery table 15 based on the failure information sent from the system switching function 11, and specifies a failure occurrence program corresponding to the failure that occurred in the active system (step) 601).

（２）次に、障害が発生したプログラムが回復すべきプログラムの対象とならないようにするために、ステップ６０３で図２に示した回復処理管理データベース２２０の稼働実績が稼働中となっているレコードを検索して該当するプログラムを見つける（ステップ６０２）
（３）該当するプログラムに対応付けられた稼働実績を、図１に示すデータ転送ライン４１を介して設定し直す。本発明の実施形態の場合、図２に示すように、バージョン１．１のプログラムＰｒｏｇ＿Ａの稼働実績がＢとなって稼動中を示しているため、ステップ６０２で特定したプログラムかつ稼働実績がＢとなっているこのレコードが検索され、このバージョンのプログラムの稼働実績をＢから障害の実績があることを示すＣに設定し直す（ステップ６０３）。 (2) Next, in order to prevent the failed program from being the target of the program to be recovered, a record in which the operation result of the recovery processing management database 220 shown in FIG. To find the corresponding program (step 602)
(3) The operation results associated with the corresponding program are reset through the data transfer line 41 shown in FIG. In the case of the embodiment of the present invention, as shown in FIG. 2, since the operation record of the version 1.1 program Prog_A is B, indicating that it is operating, the program identified in step 602 and the operation record is B. This record is searched, and the operation result of this version of the program is reset from B to C indicating that there is a failure record (step 603).

図７は図６の処理で得た回復すべきプログラムの情報に基いて、回復するプログラムのバージョンを得る処理動作を説明するフローチャートであり、次に、これについて説明する。 FIG. 7 is a flowchart for explaining the processing operation for obtaining the version of the program to be recovered based on the information of the program to be recovered obtained by the processing of FIG. 6, and this will be described next.

ここでの処理は、プログラム回復機能部１２が、回復処理管理データベース２２０から回復するプログラムの稼働実績を取得し、回復可能なプログラムがあるか、または、回復すべきプログラムが存在しないことが確認できるまで繰り返す処理（ステップ１００１〜１００３）である。 In this process, the program recovery function unit 12 acquires the operation results of the program to be recovered from the recovery processing management database 220, and can confirm that there is a recoverable program or that there is no program to be recovered. (Steps 1001 to 1003).

例えば、図６により説明した処理によりバージョン１．１のプログラムＰｒｏｇ＿Ａの稼働実績がＢからＣへ状態が遷移したため、ステップ１００３ではＮＯの判定となる。そして、ステップ１００２の処理で１つ前のバージョンを特定し、ステップ１００１、１００２の処理で稼働実績を判定する。また、本発明の実施形態において、回復処理管理データベース２２０は、バージョンの他に機能毎にカテゴリとして分別しているため、１つ前のバージョンのプログラムが存在しなかった場合等には、同等の機能（同一のカテゴリ）をもつ別のプログラムを特定することもできる。回復が可能なプログラムと判定した場合、回復可能プログラムを特定する処理を終了し、図４のフローにおけるステップ４０７の処理を実行する。 For example, since the operation history of the version 1.1 program Prog_A has changed from B to C by the process described with reference to FIG. 6, NO is determined in step 1003. Then, the previous version is specified by the process of step 1002, and the operation results are determined by the processes of steps 1001 and 1002. Further, in the embodiment of the present invention, the recovery processing management database 220 is classified as a category for each function in addition to the version. Therefore, when there is no previous version of the program, an equivalent function is provided. Another program having (same category) can be specified. If it is determined that the program is recoverable, the process of specifying the recoverable program is terminated, and the process of step 407 in the flow of FIG. 4 is executed.

図８は第１装置１１０から送信された障害情報を受信した際の第２装置１２０の処理動作を説明するフローチャート、図９は回復状態保持部１４、２４のデータ構造を示す図であり、次に、これらについて説明する。図８に示す処理は、図４により説明したステップ４０８の処理の詳細である。 FIG. 8 is a flowchart for explaining the processing operation of the second device 120 when the failure information transmitted from the first device 110 is received. FIG. 9 is a diagram showing the data structure of the recovery state holding units 14 and 24. These will be described below. The process shown in FIG. 8 is the details of the process in step 408 described with reference to FIG.

（１）第２装置１２０は、図４により説明した現用系の障害処理におけるステップ４０２の処理により送信されてきた回復先の装置である第３装置１３０の情報と障害情報とを、第２装置１２０の系切り替え機能２１により受信し、受け取った情報を、回復先状態保持部２４に設定する。回復先状態保持部２４に設定されて保持されるデータは、図９に示すように、回復先の装置の情報と、回復先へ切り替えてもよいか否かを示すフラグと、現用系で発生した障害情報とである。ここでの処理で設定するデータは、回復先８０１と障害情報８０３とである（ステップ７０１、７０２）。 (1) The second device 120 uses the information of the third device 130 that is the recovery destination device and the failure information transmitted by the process of step 402 in the failure processing of the active system described with reference to FIG. The information received by the system switching function 21 of 120 is set in the recovery destination state holding unit 24. As shown in FIG. 9, the data set and held in the recovery destination state holding unit 24 is generated in the active system, information on the recovery destination device, a flag indicating whether or not to switch to the recovery destination, and the like. Failure information. Data set in this processing is the recovery destination 801 and the failure information 803 (steps 701 and 702).

（２）ステップ７０２の処理で設定された情報に基いて系切り替えの処理を行い、待機系であった第２装置１２０は、第１装置１１０に代わって、現用系としてサービスの提供を開始する（ステップ７０３）。 (2) The system switching process is performed based on the information set in the process of step 702, and the second apparatus 120, which is the standby system, starts providing services as the active system in place of the first apparatus 110. (Step 703).

（３）現用系としてサービスの提供を開始した第２装置１２０は、その運用中に障害の発生を検出すると、回復先情報保持部２４の切り替え可否フラグ８０２及び障害情報８０３をチェックし、回復先である第３装置１３０にプログラムが回復済みか否か、すなわち、第３装置１３０への切り替えが不可能か否かを判定する（ステップ７０４、７０５）。 (3) When the second device 120 that has started providing the service as the active system detects the occurrence of a failure during its operation, the second device 120 checks the switchability flag 802 and the failure information 803 of the recovery destination information holding unit 24 to recover the recovery destination It is determined whether or not the program has been restored to the third device 130, that is, whether or not switching to the third device 130 is impossible (steps 704 and 705).

（４）ステップ７０５の判定で、第３装置１３０への切り替えが可能であった場合、第３装置１３０への切り替え処理を行い、一方、切り替えが不可能であった場合、システムとしてのサービスを停止し、障害調査を特定する（ステップ７０７、７０６）。 (4) If it is determined in step 705 that the switching to the third device 130 is possible, the switching process to the third device 130 is performed. On the other hand, if the switching is impossible, the system service is provided. Stop and identify the failure investigation (steps 707, 706).

なお、前述の処理では、ステップ７０６での処理をシステムの停止としたが、第２装置１２０のプログラム回復機能２２によってプログラムを回復する処理を行うようにすることも可能である。 In the above-described processing, the processing in step 706 is stopped. However, the program recovery function 22 of the second device 120 may be used to recover the program.

図１０はプログラムを回復する第３装置１３０が第１装置１１０からプログラムをロードする命令を受信した際の処理動作を説明するフローチャートであり、次に、これについて説明する。 FIG. 10 is a flowchart for explaining the processing operation when the third device 130 for restoring a program receives an instruction to load a program from the first device 110, which will be described next.

（１）第３装置１３０は、図４により説明したフローにおけるステップ４０７の処理で第１装置からプログラムの回復命令を受信すると、第３装置１３０内で自装置の処理のためのプログラムが動作中であるか否かを判定し、プログラムが動作中であった場合、稼働環境部を停止させ、プログラムの動作を停止する（ステップ９０１〜９０３）。 (1) When the third device 130 receives a program recovery command from the first device in the process of step 407 in the flow described with reference to FIG. 4, the program for the processing of the own device is operating in the third device 130. If the program is operating, the operating environment unit is stopped and the operation of the program is stopped (steps 901 to 903).

（２）ステップ９０２の判定で、プログラムが動作中でなかった場合、または、ステップ９０３の処理で動作環境部を停止させた後、ステップ９０１で受信した回復情報に基いて、回復資源管理テーブル２１０からプログラムのセットとプログラムを動作させるための定義情報とを、図１のデータ転送ライン４３を介して取得する（ステップ９０４）。 (2) If it is determined in step 902 that the program is not in operation, or after the operating environment unit is stopped by the processing in step 903, the recovery resource management table 210 is based on the recovery information received in step 901. The program set and definition information for operating the program are acquired via the data transfer line 43 in FIG. 1 (step 904).

（３）次に、ステップ９０４の処理で得たプログラムと定義とを展開することにより、それらを稼働環境部へ配布する処理を行い、さらに、ステップ９０１で得ている構築コマンドを実行し、回復したプログラムがホットスタンバイ状態となる。この段階で第３装置への処理の切り替えが可能となるため、切り替えが可能になったことを、ステップ９０１で得ている切り替え先の情報を用いて通知する。ここでは、第２装置１２０へ切り替えが可能となったことを通知している。これを受信した第２装置１２０は、回復先情報保持部２４の切り替え可否フラグを可に設定する（ステップ９０５〜９０７）。 (3) Next, the program and definition obtained in the process of step 904 are expanded to distribute them to the operating environment unit, and further, the construction command obtained in step 901 is executed and recovered. The selected program enters the hot standby state. Since the process can be switched to the third apparatus at this stage, the fact that the switch is possible is notified using the information of the switch destination obtained in step 901. Here, the second device 120 is notified that switching is possible. Receiving this, the second device 120 sets the switchability flag of the recovery destination information holding unit 24 to enable (steps 905 to 907).

前述した本発明の実施形態における各処理は、処理プログラムとして構成することができ、この処理プログラムは、ＨＤ、ＤＡＴ、ＦＤ、ＭＯ、ＤＶＤ−ＲＯＭ、ＣＤ−ＲＯＭ等の記録媒体に格納して提供することができ、また、通信回線を介して提供することができる。 Each processing in the above-described embodiment of the present invention can be configured as a processing program, and this processing program is stored in a recording medium such as HD, DAT, FD, MO, DVD-ROM, and CD-ROM and provided. It can also be provided via a communication line.

前述した本発明の実施形態は、現用系装置と待機系装置とを備えたデュプレックス構成の情報処理システムに本発明を適用したものとして説明したが、本発明は、少なくとも系切り替えに必要なプログラムが備えられている情報処理システムに対して適用することができる。 The above-described embodiment of the present invention has been described as applying the present invention to an information processing system having a duplex configuration including an active system device and a standby system device. However, the present invention provides at least a program required for system switching. The present invention can be applied to an information processing system provided.

前述した本発明の実施形態によれば、プログラム障害が原因で系切り替えに失敗した際にも、プログラム回復機能１２により確実に稼働するプログラムまたは環境を指定した回復先の装置へ展開して、回復先の装置がホットスタンバイ状態となるため、回復先の装置へ系を切り替えることによりサービスの停止期間を短くすることができる。 According to the embodiment of the present invention described above, even when system switching fails due to a program failure, the program recovery function 12 deploys a program or environment that is surely operated to a specified recovery destination device for recovery. Since the destination device is in a hot standby state, the service suspension period can be shortened by switching the system to the recovery destination device.

また、本発明の実施形態によれば、現用系でのプログラム回復機能が困難な障害である場合も、待機系のプログラム回復機能を使用することにより回復先へ確実に動作するプログラムを展開してホットスタンバイ状態とすることができるため、オペレーティングシステムなど比較的ハードに近い部分で動作するプログラムも回復することが可能になる。 Further, according to the embodiment of the present invention, even when the program recovery function in the active system is a difficult failure, the program that operates reliably to the recovery destination can be expanded by using the standby program recovery function. Since it can be in the hot standby state, it is possible to recover a program that operates in a portion close to hardware such as an operating system.

本発明の一実施形態による情報処理システムの構成を示すブロック図である。It is a block diagram which shows the structure of the information processing system by one Embodiment of this invention. リモートマシン内の回復処理管理データベースに格納されるデータの構造を示す図である。It is a figure which shows the structure of the data stored in the recovery process management database in a remote machine. リモートマシン内の回復資源管理データベースに格納されるデータの構造を示す図である。It is a figure which shows the structure of the data stored in the recovery resource management database in a remote machine. 第１装置において系切り替え機能がプログラム障害を検知した際の切り替え処理を説明するフローチャートである。It is a flowchart explaining the switching process when the system switching function detects a program failure in the first device. プログラム回復機能に含まれる障害別回復テーブルの構造の例を示す図である。It is a figure which shows the example of the structure of the recovery table classified by failure contained in a program recovery function. 第１装置のプログラム回復機能部における障害情報に対応した回復すべきプログラムを特定する処理動作を説明するフローチャートである。It is a flowchart explaining the processing operation which specifies the program which should be recovered | restored corresponding to the failure information in the program recovery function part of a 1st apparatus. 図６の処理で得た回復すべきプログラムの情報に基いて、回復するプログラムのバージョンを得る処理動作を説明するフローチャートである。7 is a flowchart for explaining a processing operation for obtaining a version of a program to be recovered based on information on a program to be recovered obtained by the processing of FIG. 6. 第１装置から送信された障害情報を受信した際の第２装置の処理動作を説明するフローチャートである。It is a flowchart explaining the processing operation of the 2nd apparatus at the time of receiving the failure information transmitted from the 1st apparatus. 回復状態保持部のデータ構造を示す図である。It is a figure which shows the data structure of a recovery state holding | maintenance part. プログラムを回復する第３装置が第１装置からプログラムをロードする命令を受信した際の処理動作を説明するフローチャートである。It is a flowchart explaining the processing operation when the 3rd apparatus which recovers a program receives the command which loads a program from a 1st apparatus.

Explanation of symbols

１通信経路
１１、２１、３１切り替え機能部
１２、２２、３２プログラム回復機能部
１３、２３、３３稼働環境部
１４、２４、３４回復先状態保持部
１５、２５、３５障害別回復テーブル
１１０第１装置
１２０第２装置
１３０第３装置
２００リモートマシン
２１０回復資源管理データベース
２２０回復処理管理データベース 1 Communication path 11, 21, 31 Switching function unit 12, 22, 32 Program recovery function unit 13, 23, 33 Operating environment unit 14, 24, 34 Recovery destination state holding unit 15, 25, 35 Failure-specific recovery table 110 First Device 120 Second device 130 Third device 200 Remote machine 210 Recovery resource management database 220 Recovery processing management database

Claims

In a system switching method in an information processing system configured by a first device as an active device and a second device as a standby device, and switching the operation to the second device when the first device fails,
When a failure due to a program failure occurs, the first device performs system switching for switching the operation of the second device, causes the device for recovering the program to recover the failed program, and after the system switching, A system switching method characterized in that when the second device generates a failure similar to that of the first device and fails in system switching, the second device performs system switching to switch the operation to the device that recovers the program. .

2. The system switching method according to claim 1, wherein the apparatus for recovering the program is an independent third apparatus or the first apparatus after system switching.

When a program failure occurs, the first device identifies a program to be recovered from the failure information, identifies an appropriate version of the program from the generation management program based on the operation results, and notifies a recovery command to the device that recovers the program The system switching method according to claim 1, wherein:

2. The system according to claim 1, wherein the second device refers to the recovery destination information sent from the first device when switching the operation to the device that recovers the program, and performs system switching to the recovery destination device. System switching method.

The apparatus for recovering the program obtains a file necessary for operation from a recovery resource management database in which generations of programs and definitions are managed, and constructs an operating environment by executing a command associated with the recovery processing management database. The system switching method according to claim 1, further comprising: notifying the first device that the construction has been completed.

In an information processing system configured by a first device as an active device and a second device as a standby device, and switching the operation to the second device when the first device fails,
The first and second devices hold a recovery destination state for recovering a program in which a failure has occurred, a system switching function unit having means for performing system switching when a failure occurs, fault information and a program to be recovered It has a recovery table for each associated failure, accesses a recovery processing management database that associates the program with its recovery processing and operation results, issues a recovery command to the device that recovers the program, receives the recovery command, and defines the program and definition An information processing system comprising a program recovery function having a function of setting a computer as an operating environment.

The information processing system according to claim 6, wherein the apparatus for recovering the program is an independent third apparatus or the first apparatus after system switching.

The first device and the second device are devices that specify a program to be recovered from failure information when a program failure occurs, specify an appropriate version program from a generation-managed program based on operation results, and recover the program. The information processing system according to claim 6, wherein a recovery command is notified.

A system switching processing program comprised of a first device and a second device, and possessed by the first device and the second device that switches its operation to the other device when the one device fails,
From the step of system switching to switch the operation to the other device when a failure occurs due to a program failure, the step of identifying the program to be recovered from the failure information when the program failure occurs, and the generation-managed program from the operation results A step of identifying an appropriate version of the program, a step of notifying a recovery instruction to a device for recovering the program, and a device for recovering the program when a failure occurs in the device after the system switching and the system switching fails And a system switching process program for switching the system by executing each of the processing steps.